JP3025732B2

JP3025732B2 - Control method of multiplex computer system

Info

Publication number: JP3025732B2
Application number: JP5176552A
Authority: JP
Inventors: 一洋島田; 俊正曾我部; 博之中山; 桂介河合
Original assignee: PFU Ltd
Current assignee: PFU Ltd
Priority date: 1993-07-16
Filing date: 1993-07-16
Publication date: 2000-03-27
Anticipated expiration: 2015-03-27
Also published as: JPH0736721A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、運用系で障害が発生し
たとき、他の新運用系が、当該運用系の業務を引き継
ぎ、さらには当該障害を解析する多重化コンピュータシ
ステム、例えば銀行のオンラインシステムや交通管制シ
ステムなどに適用される多重化コンピュータシステムに
関する。特に、業務引継用記憶装置と障害解析用記憶装
置とを運用系および新運用系の共通資源とし、障害発生
時、業務引継用記憶装置を新運用系に接続して、運用系
から新運用系への業務継続処理の高速化を図るととも
に、障害解析用の運用系情報を障害解析用記憶装置に格
納した後で当該装置を新運用系に接続して、当該新運用
系での障害解析処理の確実化・迅速化を図るようにし
た、多重化コンピュータシステムの制御方式に関する。
なお、新運用系として動作するコンピュータシステムは
運用系が正常動作時の待機系コンピュータシステムであ
る。 BACKGROUND OF THE INVENTION The present invention relates to a system in which a failure occurs in an operation system.
Other new operating system takes over the operation of the operating system
And a multiplexed computer system to analyze the fault.
The present invention relates to a system , for example, a multiplexed computer system applied to a bank online system or a traffic control system. In particular, storage for business takeover and storage for failure analysis
Is used as a common resource for the active and new operating systems
Attach the business takeover storage device to the new
To speed up business continuity processing from
In addition, the operational information for failure analysis is stored in the failure analysis storage device.
After that, connect the device to the new operation system and
The present invention relates to a control method for a multiplexed computer system which ensures and speeds up a failure analysis process in a system.
The computer system that operates as a new operating system is
The active computer is a standby computer system during normal operation.
You.

【０００２】また、新運用系に業務を引き継いでもらっ
た運用系（障害系）が復旧した場合、業務引継用記憶装
置を新運用系が使用していないことを確認した上で当該
装置を運用系に再接続して、当該運用系での業務再開処
理の正確化を図るようにした、多重化コンピュータシス
テムの制御方式に関する。[0002] Also, a new operation
If the active system (failure system) is restored,
After confirming that the new device is not
Reconnect the device to the active system and restart the business in the active system.
The present invention relates to a control method for a multiplexed computer system, which is designed to make the process more accurate .

【０００３】なお、本明細書で用いる「待機系」とはあ
くまで運用系が担当の第１の業務についてのことであ
り、運用系が正常に動作しているときの「待機系」で
は、・他の第２の業務を実行することなしに専ら運用系の障
害発生に備える・他の第２の業務の運用系として動作することなどが選択的に行われている。そして、後者の「待
機系」の場合には、（第１の業務担当の）運用系がダウ
ンすると本来の仕事である第２の業務に加えて第１の業
務をも実行するかたちの「新運用系」となる。[0003] The "standby system" used in this specification refers to the first task in which the active system is in charge, and the "standby system" when the active system is operating normally is as follows. Preparing for the occurrence of a failure in the active system exclusively without executing the other second business ・ Operation as the active system of the other second business is selectively performed. In the case of the latter “standby system”, when the active system (who is in charge of the first business) goes down, the “new business” executes the first business in addition to the second business which is the original business. Operational system ".

【０００４】[0004]

【従来の技術】図５は一般的な二重化コンピュータシス
テムの適用例を示す説明図であり、21はコンピュータシ
ステム（運用系）、22はコンピュータシステム（待機
系）、23、23′はＣＰＵ、24、24′はメモリ（主記憶装
置）、25、25′はディジタルＩ／Ｏコントローラ（DIO
C) 、26、26′はシステムコントローラ、27、27′はフ
ァイル系コントローラ、28、28′は回線系コントロー
ラ、29、29′はＬＡＮコントローラ、30、30′はシステ
ムディスク装置、31、31′はメモリダンプの出力先とし
ての障害解析用ディスク装置、32は二重化切替装置、33
は系間通信パス、34は運用系21および待機系22それぞれ
からの入出力が行われる共通ディスク装置（業務引継用
ディスク装置）、35は運用系21に対応の業務引継用ディ
スク装置、36は待機系22に対応の業務引継用ディスク装
置、37は回線切替え装置、38はＬＡＮやＷＡＮなどのネ
ットワーク、39は端末システムをそれぞれ示している。2. Description of the Related Art FIG. 5 is an explanatory view showing an example of application of a general redundant computer system, wherein 21 is a computer system (operating system), 22 is a computer system (standby system), 23 and 23 'are CPU, 24 , 24 'are memories (main storage devices), 25 and 25' are digital I / O controllers (DIO
C), 26 and 26 'are system controllers, 27 and 27' are file controllers, 28 and 28 'are line controllers, 29 and 29' are LAN controllers, 30, 30 'are system disk devices, and 31 and 31'. Is a disk unit for failure analysis as a memory dump output destination, 32 is a duplexer, 33
Is a path for inter-system communication, 34 is a common disk unit (disk for taking over the business) from which input / output is performed from each of the active system 21 and the standby system 22, 35 is a disk for business takeover corresponding to the active system 21, and 36 is A task takeover disk device corresponding to the standby system 22; 37, a line switching device; 38, a network such as a LAN or WAN; and 39, a terminal system.

【０００５】なお、本明細書で用いる「系間通信路」と
は二重化切替装置32または系間通信パス33のことであ
る。また、本明細書を通じて、障害解析用ディスク装置
および業務引継用ディスク装置のそれぞれを必要に応じ
てディスク装置と記述する。[0005] The "inter-system communication path" used in the present specification is a duplex switching device 32 or an inter-system communication path 33. Throughout this specification, each of the failure analysis disk device and the business takeover disk device will be referred to as a disk device as needed.

【０００６】ここで、二重化切替装置32は、コンピュー
タシステム同士の相互監視機能を持っており、運用系21
に電源故障やソフトウェアの異常ループ・内部矛盾など
の各種障害が発生したときにはその旨を示す割込みビッ
トを待機系22に通知し、さらには待機系22の方で系間通
信パス33を介して運用系21の障害を検出したときにはそ
の原因を問合せるための系間通信路として用いられる。Here, the duplex switching device 32 has a mutual monitoring function between computer systems, and
In the event of a failure such as a power failure or software abnormal loop or internal inconsistency, the standby system 22 is notified of an interrupt bit indicating the occurrence of the failure, and the standby system 22 operates via the inter-system communication path 33. When a failure of the system 21 is detected, it is used as an inter-system communication path for inquiring about the cause.

【０００７】また、系間通信パス33は二重化されてお
り、待機系22は、運用系21に対する定周期診断、すなわ
ち定周期で所定のメッセージを送信してこれに対する返
事を確認するといった診断処理を行い、運用系21からの
所定の返事を確認できないときには二重化切替装置32の
表示ビットを使用して運用系21に障害原因を問い合わせ
ている。Further, the inter-system communication path 33 is duplicated, and the standby system 22 performs a periodic diagnosis for the active system 21, that is, a diagnostic process of transmitting a predetermined message at a fixed period and confirming a reply thereto. When the predetermined reply from the active system 21 cannot be confirmed, the active system 21 is inquired about the cause of the failure using the display bit of the duplex switching device 32.

【０００８】このように、運用系21からの障害発生通知
といった相互監視ルートと、待機系22から運用系21に送
ったメッセージに対する所定の応答の有無といった定周
期診断ルートとの２系統により、待機系22は各種障害が
発生していないかどうかをチェックしている。As described above, the standby system has two systems: the mutual monitoring route for notifying the occurrence of a failure from the active system 21 and the fixed-period diagnostic route for determining whether a predetermined response to a message sent from the standby system 22 to the active system 21 has been made. The system 22 checks whether various failures have occurred.

【０００９】そして、待機系22は、前記の相互監視ルー
トまたは定周期診断ルートのいずれかにより障害発生を
確認すると、運用系21に対してもう一方のルート経由で
障害原因を問い合わせるようにしている。When the standby system 22 confirms the occurrence of a failure by either the mutual monitoring route or the periodic diagnostic route, the standby system 22 inquires the active system 21 about the cause of the failure via the other route. .

【００１０】これは、相互監視ルートまたは定周期診断
ルートのハードウェア障害が発生しているだけで運用系
21は正常に動作している場合と、運用系21のソフトウェ
アや電源などの障害が発生している場合とを識別し、後
者についてのみ待機系を切り替えて新運用系に設定する
ためである。[0010] This is because only a hardware failure of the mutual monitoring route or the periodic diagnosis route has occurred, and
Reference numeral 21 is for discriminating between a normal operation and a case where a failure such as software or power of the active system 21 has occurred, and switching the standby system only for the latter to set the new active system.

【００１１】すなわち、前者の場合には、障害原因の問
い合わせに対する応答として、電源やソフトウェアの正
常動作を示す表示ビットなどが運用系21から待機系22に
送られるが、後者の場合にはこのようなことはない。な
お、相互監視ルートと定周期診断ルートのそれぞれから
障害発生が確認されたときには、待機系22は、運用系21
に対して障害原因を問い合わせることなしに当該運用系
内部での障害発生と判断する。That is, in the former case, a display bit or the like indicating the normal operation of the power supply or software is sent from the active system 21 to the standby system 22 in response to the inquiry about the cause of the failure. There is nothing. When the occurrence of a failure is confirmed from each of the mutual monitoring route and the periodic diagnostic route, the standby system 22 becomes the active system 21.
It is determined that a failure has occurred within the active system without inquiring about the cause of the failure.

【００１２】図６は一般的な二重化切替装置32の具体例
を示す説明図であり、23、23′はＣＰＵ、25、25′はデ
ィジタルＩ／Ｏコントローラ（DIOC) 、41は電源異常通
知の入力ビット(COMP)、42はパニック発生通知の入力ビ
ット(CALL)、43はパニック発生通知の出力ビット(PANI
C) 、44はソフトウェア故障通知の入力ビット(WDTI: Wa
tch Dog Timer Input) 、45はウォッチドッグタイマ
ー、46、46′は外部設備インタフェースをそれぞれ示し
ている。FIG. 6 is an explanatory view showing a specific example of a general duplex switching device 32. Reference numerals 23 and 23 'denote CPUs, reference numerals 25 and 25' denote digital I / O controllers (DIOC), and reference numeral 41 denotes a power failure notification. The input bit (COMP), 42 is the input bit (CALL) for panic occurrence notification, and 43 is the output bit (PANI
C) and 44 are input bits for software failure notification (WDTI: Wa
tch Dog Timer Input), 45 indicates a watch dog timer, and 46 and 46 'indicate external equipment interfaces.

【００１３】ここで、電源が切断した系の外部設備イン
タフェース46、46′の信号が変化すると他系の入力ビッ
ト(COMP)41は“１”から“０”に変化し、「ソフトウェ
ア異常（パニック）」が発生したときのパニック関数が
自系の出力ビット(PANIC) 43をONにすると他系の入力ビ
ット(CALL)42が“０”から“１”に変化する。Here, when the signal of the external equipment interface 46, 46 'of the system whose power is turned off changes, the input bit (COMP) 41 of the other system changes from "1" to "0", and the "software error (panic )), The panic function turns on the output bit (PANIC) 43 of the own system, and the input bit (CALL) 42 of the other system changes from “0” to “1”.

【００１４】また、ウォッチドッグタイマー45により異
常ループ（無限ループ）やパニックには至らないＯＳプ
ログラムの内部矛盾（デッドロックなど）といった「ソ
フトウェア故障」が検出されたときには、入力ビット(W
DTI)44の値が変化する。When the watchdog timer 45 detects a "software failure" such as an abnormal loop (infinite loop) or an internal contradiction (deadlock, etc.) of the OS program that does not lead to panic, the input bit (W
The value of (DTI) 44 changes.

【００１５】このように、ソフトウェアの故障は、その
コンピュータシステム全体が完全に非動作状態となる
「ソフトウェア異常（パニック）」と、その一部は動作
状態のままとなる可能性が高い「ソフトウェア故障」と
に大別される。As described above, software failures include “software anomaly (panic)” in which the entire computer system is completely inactive, and “software failure” in which a part of the software system is likely to remain in an active state. ].

【００１６】このような内容を持つ二重化コンピュータ
システムでは、運用系21の障害発生の原因を調査するた
め、そのメモリ24やシステムディスク装置30などの格納
デ−タをディスク装置31に出力すること、すなわちメモ
リダンプが行われる。In the redundant computer system having such contents, in order to investigate the cause of the failure of the operating system 21, data stored in the memory 24 or the system disk device 30 is output to the disk device 31. That is, a memory dump is performed.

【００１７】そして、この障害原因の解析は、障害発生
状態から復旧した運用系（障害系）21がその再立ち上げ
後に自系のディスク装置31の内容を参照することにより
行っている。The analysis of the cause of the failure is performed by the operation system (failure system) 21 recovered from the failure occurrence state, referring to the contents of the disk device 31 of the own system after the restart.

【００１８】また、業務引継用ディスク装置35には運用
系21での直近の処理内容が所定単位で順次格納されてお
り、運用系21の障害発生により待機系22が新運用系に切
り替わる際、新運用系（待機系）22はこのディスク装置
35を参照することにより、業務引継ぎに必要なデ−タ、
すなわち運用系21のダウンのために処理が中断したかた
ちとなっている業務デ−タやそれに関連したデ−タなど
（以下の説明では業務引継用デ−タという）を確認する
ことになる。Further, the latest processing contents in the active system 21 are sequentially stored in a predetermined unit in the business takeover disk device 35. When the standby system 22 is switched to the new active system due to the failure of the active system 21, The new operation system (standby system) 22 is this disk device
By referring to 35, the data necessary for business takeover,
In other words, it confirms the business data in which the processing is interrupted due to the operation system 21 being down and data related thereto (hereinafter referred to as business takeover data).

【００１９】そして、運用系（障害系）21は、ダウン状
態から正常状態に復旧したとき、任意のタイミングで業
務引継用ディスク装置35を自系に接続して本来の分担業
務の実行を再開するか、またはこの業務引継用ディスク
装置35を自系に接続することなくそれまでの状態を継続
することとしている。When the active system (failed system) 21 recovers from the down state to the normal state, the operation takeover disk device 35 is connected to the own system at an arbitrary timing to resume the execution of the original assigned task. Alternatively, the operation takeover disk device 35 is not connected to its own system, and the state up to that point is continued.

【００２０】このとき、待機系（新運用系）22は、前者
の場合には運用系21から引き継いだ業務の実行から開放
され自系の本来の業務のみを実行し、後者の場合には依
然としてこれらの引継ぎ業務と本来の分担業務のそれぞ
れを実行する。At this time, in the former case, the standby system (new operation system) 22 is released from the execution of the operation inherited from the operation system 21 and executes only the original operation of the own system. Each of these handover tasks and the original tasks is executed.

【００２１】[0021]

【発明が解決しようとする課題】このように、従来の多
重化コンピュータシステムの制御方式では、メモリダン
プの出力先として自系内（運用系内）の障害解析用ディ
スク装置を用いその解析作業も復旧後の運用系が行って
おり、また、運用系（障害系）は、復旧した後も自系の
本来の分担業務を新運用系から取り戻さずにいるか、取
り戻すとしても業務引継用ディスク装置を任意のタイミ
ングで自系に接続、すなわちこのディスク装置を新運用
系が使用しているかどうかについての考慮なしに自系に
接続している（図７参照）。As described above, in the conventional control method of a multiplexed computer system, a failure analysis disk device in its own system (in the operation system) is used as a memory dump output destination, and its analysis work is also performed. The active system after the recovery is being performed, and the operating system (failed system) does not recover the original duties of its own system from the new active system even after the recovery, or even if it recovers it, The disk device is connected to the own system at an arbitrary timing, that is, the disk device is connected to the own system without considering whether the new operating system is using the disk device (see FIG. 7).

【００２２】そのため、障害解析が遅れ、待機系の負担
をいたずらに増やし、さらには業務引継用ディスク装置
が新運用系で使用されている状態で運用系が当該ディス
ク装置を接続するといったことが発生し、この場合には
新運用系の業務引継用ディスク装置に対する入出力が異
常になるだけでなく、このディスク装置上のデ−タの論
理構成に矛盾が発生して全体のデ−タ破壊を招く恐れが
あるという問題点があった。As a result, the failure analysis is delayed, the load on the standby system is unnecessarily increased, and the active system connects the disk device while the takeover disk device is being used by the new active system. However, in this case, not only does the I / O to the new active system takeover disk unit become abnormal, but also the logical configuration of the data on this disk unit becomes inconsistent and the entire data is destroyed. There was a problem that it could be invited.

【００２３】そこで、本発明では、業務引継用記憶装置
と障害解析用記憶装置とを運用系および新運用系で共用
し、障害発生時、業務引継用記憶装置を新運用系に接続
して、資源の節約化や、運用系から新運用系への業務継
続処理の高速化を図るとともに、障害解析用の運用系情
報を障害解析用記憶装置に格納した後で当該装置を新運
用系に接続して、当該新運用系が行なう障害解析処理の
確実化・迅速化を図ることを目的とする。また、運用系
（障害系）が復旧した場合、業務引継用記憶装置を新運
用系が使用していないことを確認した上で当該装置を運
用系に再接続して、当該運用系の業務再開処理の正確化
を図ることを目的とする。 Therefore, according to the present invention, a storage device for taking over business
And the failure analysis storage device are shared by the active system and the new active system
In the event of a failure, connect the business takeover storage device to the new operating system
To save resources and transfer operations from the operating system to the new operating system.
In addition to speeding up the connection processing,
After storing the information in the failure analysis storage device,
Connected to the active system to perform the failure analysis
The purpose is to ensure and speed up . Also, active
When the (failed) is restored, take over the business takeover storage device
Operation of the device after confirming that the
Reconnect to the active system and correct the operation restart process of the active system
The purpose is to plan.

【００２４】[0024]

【課題を解決するための手段】図１は本発明の基本構成
図である。図において、１は運用系（コンピュータシス
テム）、２は待機系（新運用系のコンピュータシステ
ム）、３はメモリダンプ先の障害解析用記憶装置、４は
業務引継用デ−タを管理するための業務引継用記憶装置
をそれぞれ示している。なお、障害解析用記憶装置３お
よび業務引継用記憶装置４としてはディスク装置などが
用いられる。FIG. 1 is a basic configuration diagram of the present invention. In the figure, 1 is an operation system (computer system), 2 is a standby system ( new operation computer system), 3 is a storage device for failure analysis at a memory dump destination, and 4 is a device for managing business takeover data. Each of the business takeover storage devices is shown. Note that a disk device or the like is used as the failure analysis storage device 3 and the business takeover storage device 4.

【００２５】業務引継用記憶装置４は運用系１の担当業
務についてのものであり、また、待機系２は少なくとも
運用系１と同程度の処理能力を持つコンピュータシステ
ムであり、両者の間では従来と同じように相互監視や定
周期診断が行われている。The business takeover storage device 4 is for the work in charge of the active system 1, and the standby system 2 is a computer system having at least the same processing capacity as the active system 1. In the same way as described above, mutual monitoring and periodic diagnosis are performed.

【００２６】ここで、運用系で障害が発生してその後復
旧したときの運用系および待機系における基本的な処理
手順は次のようになっている。すなわち、待機系２で
は、′相互監視機能や定周期診断機能により障害発生
を確認（または後述の障害発生との判断を）して運用系
１に対し強制ダウン処理の指示を送る。′運用系１で
のダウン処理にともなって業務引継用記憶装置４を自系
に接続する。′運用系１からの出力処理の終了通知を
受けてから障害解析用記憶装置３を自系に接続する。
′復旧後の運用系１からの問い合わせに応じて業務引
継用記憶装置４を自系で使用しているどうかを調べ、
「使用していない」場合にはこの業務引継用記憶装置４
をオフライン処理してその旨を、また「使用している」
場合にはその旨をそれぞれ運用系１に回答する。といっ
た一連の処理を実行している。Here, the basic processing procedures in the active system and the standby system when a failure occurs in the active system and thereafter the system is restored are as follows. That is, the standby system 2 confirms the occurrence of a failure by the mutual monitoring function or the periodic diagnosis function (or determines that a failure has occurred, which will be described later), and sends an instruction for forced down processing to the active system 1. 'The work takeover storage device 4 is connected to the own system with the down processing in the active system 1. 'After receiving the notification of the end of the output process from the active system 1, connect the failure analysis storage device 3 to its own system.
′ In response to an inquiry from the active system 1 after restoration, it is checked whether the business takeover storage device 4 is being used by the own system,
In the case of “not in use”, this business takeover storage device 4
Is processed offline and "use"
In that case, the fact is answered to the operation system 1 respectively. Is executed.

【００２７】一方、運用系１では、自系での障害発生
を待機系２に通知する。待機系２からの指示（ステッ
プ′）に基づいて強制ダウン処理を行い、また、障害
原因が前記の「ソフトウェア異常（パニック）」である
ときは自発的ダウン処理が行われる。このとき、障害解
析に必要なデ−タを主記憶装置などから障害解析用記憶
装置３に出力する。出力処理が終了したことを待機系
２に通知する。復旧後、待機系２に対して、業務引継
用記憶装置４を使用しているかどうかを問い合わせると
ともに、使用していない場合にはこの記憶装置のオフラ
イン処理を要求する。待機系２からの応答（ステップ
′）に基づき、それが「オフライン処理済」の場合に
は業務引継用記憶装置４を自系に再接続し、「使用中」
の場合にはこの自系への再接続を行わない。といった一
連の処理を実行している。On the other hand, the active system 1 notifies the standby system 2 of the occurrence of a fault in its own system. Forcible down processing is performed based on an instruction (step ') from the standby system 2, and when the cause of the failure is the "software abnormality (panic)", spontaneous down processing is performed. At this time, data necessary for failure analysis is output from the main storage device or the like to the failure analysis storage device 3. The standby system 2 is notified that the output processing has been completed. After the recovery, the standby system 2 is inquired whether or not the business takeover storage device 4 is being used, and if not, a request is made for offline processing of the storage device. Based on the response (step ') from standby 2, it is reconnected to the own system business takeover storage device 4 in the case of "off-line processed", "in use"
In the case of, the connection to the own system is not performed. Is executed.

【００２８】[0028]

【作用】本発明は、このように、業務引継用記憶装置と
障害解析用記憶装置を運用系および待機系の共通の記憶
装置で構成しておき、待機系（新運用系）は、運用系の
障害発生時に、業務引継用記憶装置を自系に接続し、か
つ運用系における障害解析用情報の障害解析用記憶装置
への出力処理の終了を確認した後で当該装置を自系に接
続している。また、復旧後の運用系（障害系）は、新運
用系（待機系）が業務引継用記憶装置のオフライン処理
を行ったことを確認してから当該装置を自系に接続して
いる。According to the present invention, the storage device for taking over the business and the storage device for analyzing the failure are constituted by a common storage device for the active system and the standby system, and the standby system (new operating system). Connects the takeover storage device to its own
Storage device for failure analysis of failure analysis information in active system
After confirming the end of the output process to the device , the device is connected to its own system . After the recovery, the active system (failed system) connects the device to its own system after confirming that the new active system (standby system) has performed offline processing of the business takeover storage device.

【００２９】そのため、業務引継用記憶装置および障害
解析用記憶装置としての資源が節約でき、業務継続処理
および障害解析処理が確実・迅速に行われる。また、業
務を再開する運用系（障害系）が、新運用系（待機系）
で使用中の業務引継用記憶装置を自系に接続してしまう
ことはないので、当該記憶装置のデ−タは確実に保護さ
れて、運用系の業務再開処理が正確に実行される。な
お、図示していないが複数の待機系の場合にも本発明が
適用できることは勿論である。Therefore, the storage device for taking over the business and the failure
Resources can be saved as storage for analysis, and business continuity processing
And the failure analysis processing is performed reliably and quickly. In addition, work
The operation system (failure system) that resumes operations is the new operation system (standby system)
In this case, the business takeover storage device in use is not connected to its own system, so the data of the storage device is securely protected.
As a result, the operation restart processing of the active system is executed accurately. Although not shown, the present invention is of course applicable to a plurality of standby systems.

【００３０】[0030]

【実施例】図２〜図４を参照して本発明の実施例を説明
する。なお、以下の説明でも、多重化コンピュータシス
テムの一例として二重化コンピュータシステムを用いる
ことにする。An embodiment of the present invention will be described with reference to FIGS. In the following description, a duplicated computer system will be used as an example of a multiplexed computer system.

【００３１】図２は、二重化コンピュータシステムの適
用例を示す説明図であり、図５のものに比べ、ハードウ
ェアの点では各系の個々の障害解析用ディスク装置31、
31′を除いてその代わりに各系に共通の障害解析用ディ
スク装置40を新たに設けたことなどが相違している。FIG. 2 is an explanatory diagram showing an application example of a redundant computer system. Compared with FIG. 5, the individual failure analysis disk devices 31 of each system are different in terms of hardware.
The difference is that a common failure analysis disk device 40 is newly provided for each system except for 31 '.

【００３２】図３は、運用系21で障害が発生したときの
メモリダンプ関連の処理手順を示す説明図である。ここ
で、待機系22の処理は次のようになっている。 (11)二重化切替装置32における運用系21からの入力ビッ
ト41、42、44の変化や、系間通信パス33を用いた定周期
診断結果に基づいて運用系21または系間ハードウェア
（二重化切替装置32、系間通信路パス33など）での障害
発生を確認する。 (12)この確認に用いられなかった方のルート、すなわち
入力ビット41、42、44の変化があったときには系間通信
路パス33により、また定周期診断結果で障害発生を確認
したときには二重化切替装置32によりそれぞれの障害原
因を運用系21に問い合わせる。 (13)この問い合わせに対して所定の監視時間内に応答が
ない場合には運用系21でのソフトウェア障害、電源故障
などと判断し、自系の新運用系への切替え処理を行う。FIG. 3 is an explanatory diagram showing a memory dump-related processing procedure when a failure occurs in the active system 21. Here, the processing of the standby system 22 is as follows. (11) Based on the change of the input bits 41, 42, 44 from the active system 21 in the redundant switching device 32 and the result of the periodic scan using the inter-system communication path 33, the active system 21 or the inter-system hardware (redundant switching Check the occurrence of a failure in the device 32, the inter-system communication path path 33, etc.). (12) The route which was not used for this confirmation, that is, the input bit 41, 42, 44 is changed by the inter-system communication path 33, and if the occurrence of a failure is confirmed by the fixed-cycle diagnosis result, duplex switching is performed. The device 32 inquires of the active system 21 about the cause of each failure. (13) If there is no response to this inquiry within a predetermined monitoring time, it is determined that a software failure or power supply failure has occurred in the active system 21 and switching processing of the own system to the new active system is performed.

【００３３】そして、このステップ(13)では、・DIOC25′経由による運用系21への強制ダウン指示・運用系21に対応の業務引継用ディスク装置35の自系へ
の接続・ネットワーク38経由による、新運用系への切替えの利
用者への通知・メモリダンプの出力完了（障害解析用ディスク装置40
への格納処理完了）の確認・障害解析用ディスク装置40の自系への接続を順次行っている。Then, in this step (13): ・ Forced down instruction to the active system 21 via the DIOC 25 ′ ・ Connection of the business takeover disk device 35 corresponding to the active system 21 to its own system ・ via the network 38 Notification to the user of switching to the new active system ・ Completion of memory dump output (fault analysis disk unit 40
Confirmation of completion of storage processing in the system)-The failure analysis disk unit 40 is connected to its own system sequentially.

【００３４】一方、運用系（障害系）21では、「ソフト
ウェア異常（パニック）」による障害発生の場合の自発
的ダウン処理や新運用系（待機系22）からの指示に基づ
く強制ダウン処理に続いてメモリダンプの出力処理が行
われ、この出力処理が完了したときにはその旨の通知を
新運用系（待機系22）に送っている。なお、この通知に
は二重化処理装置32の出力ビット(PANIC) 43と入力ビッ
ト(CALL)42とが用いられる。On the other hand, the active system (failure system) 21 follows a voluntary down process in the event of a failure due to “software abnormality (panic)” or a forced down process based on an instruction from the new active system (standby system 22). The output process of the memory dump is performed, and when this output process is completed, a notification to that effect is sent to the new active system (standby system 22). Note that the output bit (PANIC) 43 and the input bit (CALL) 42 of the duplex processing device 32 are used for this notification.

【００３５】図４は、復旧後の運用系が業務引継用ディ
スク装置35を強制リザーブするときの処理手順を示す説
明図である。すなわち、復旧後の運用系（障害系）21か
ら新運用系（待機系）22への、ディスク装置35のオフラ
イン・リリース指示に対して新運用系22は、 (21)ディスク装置35を使用しているかどうかを判断し、
「YES 」の場合はその旨を運用系21に通知し、「NO」の
場合は次のステップに進む。 (22)ディスク装置35を使用禁止の状態（オフライン）に
して、次のステップに進む。 (23)ディスク装置35に対してリリースを発行してリザー
ブ状態を解除することができたかどうかを判断してその
結果を運用系21に通知する。といった処理を行う。FIG. 4 is an explanatory diagram showing a processing procedure when the operating system after the recovery forcibly reserves the business takeover disk device 35. That is, in response to the offline release instruction of the disk device 35 from the active system (failed system) 21 to the new active system (standby system) 22 after recovery, the new active system 22 uses the disk device 35. To determine if
If "YES", the fact is notified to the active system 21, and if "NO", the process proceeds to the next step. (22) The disk device 35 is set to the use prohibited state (offline), and the process proceeds to the next step. (23) The release is issued to the disk device 35 to determine whether or not the reserved state can be released, and the result is notified to the active system 21. Is performed.

【００３６】なお、運用系（障害系）21から新運用系
（待機系）22に対するオフライン・リリース指示および
これに対する新運用系22から運用系21への応答はそれぞ
れ系間通信パス33経由で行われる。The off-line release instruction from the active system (failed system) 21 to the new active system (standby system) 22 and the response from the new active system 22 to the active system 21 are sent via the inter-system communication path 33. Will be

【００３７】続いて、復旧後の運用系21は、 (24)先の新運用系22からの回答に基づいて、新運用系22
がディスク装置35をリリースできたかどうかを判断し、
「YES 」の場合は次のステップに進み、「NO」の場合は
「自系への業務引継用ディスク装置35の強制リザーブが
不調」ということで終了する。 (25)ディスク装置35を自系に接続する。といった処理を行っている。Subsequently, the operation system 21 after the restoration is (24) based on the response from the new operation system 22 in advance.
Judge whether or not was able to release the disk unit 35,
In the case of "YES", the process proceeds to the next step, and in the case of "NO", the process ends because "the forced reserve of the task takeover disk device 35 to the own system is out of order". (25) Connect the disk device 35 to its own system. Is performed.

【００３８】[0038]

【発明の効果】本発明は、このように、業務引継用記憶
装置と障害解析用記憶装置を運用系および待機系の共通
の記憶装置で構成しておき、待機系（新運用系）は、運
用系の障害発生時に、業務引継用記憶装置を自系に接続
し、かつ運用系における障害解析用情報の障害解析用記
憶装置への出力処理の終了を確認した後で当該装置を自
系に接続している。 As described above, according to the present invention, the storage for business takeover is provided.
Device and storage device for failure analysis are common to the active and standby systems
The standby system (new operation system)
When a failure occurs in the active system, connect the business takeover storage device to the local system
Of failure analysis information in the active system
After confirming the end of the output process to the storage device,
Connected to the system.

【００３９】また、復旧後の運用系は、新運用系の管理
下となっている業務引継用記憶装置を自系に接続して再
び本来の分担業務を実行しようとするとき、新運用系に
対してこの記憶装置がオフライン処理されているかどう
かを問い合わせ、「オフライン処理済」の確認がとれた
場合のみ当該記憶装置を自系に接続するようにしてい
る。Further, after the recovery of the operation system, the operation takeover storage device under the management of the new operation system is connected to its own system, and when the original work is to be executed again, the operation system becomes the new operation system. It is inquired whether or not this storage device has been processed offline, and the storage device is connected to its own system only when "offline processing completed" is confirmed.

【００４０】そのため、業務引継用記憶装置および障害
解析用記憶装置としての資源を節約でき、業務継続処理
および障害解析処理を確実・迅速に行うことができる。
また、業務引継用記憶装置に対して運用系（障害系）と
新運用系（待機系）の両方の系からアクセス可能となる
ことはないので、当該記憶装置のデ−タを確実に保護し
て、運用系の業務再開処理を正確に行なうことができ
る。Therefore, the storage device for taking over the business and the failure
Resources can be saved as storage for analysis, and business continuity processing
And it is Ru can perform the failure analysis processing reliably, and quickly.
In addition, since it is not possible to access the business takeover storage device from both the active system (failure system) and the new active system (standby system), the data of the storage device is surely protected.
As a result, the operation restart process of the active system can be performed accurately .

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の、基本構成図である。FIG. 1 is a basic configuration diagram of the present invention.

【図２】本発明の、二重化コンピュータシステムの適用
例を示す説明図である。FIG. 2 is an explanatory diagram showing an application example of a redundant computer system according to the present invention.

【図３】本発明の、運用系で障害が発生したときのメモ
リダンプ関連の処理手順を示す説明図である。FIG. 3 is an explanatory diagram showing a memory dump-related processing procedure when a failure occurs in an active system according to the present invention;

【図４】本発明の、復旧後の運用系が業務引継用ディス
ク装置を強制リザーブするときの処理手順を示す説明図
である。FIG. 4 is an explanatory diagram showing a processing procedure when the active system after the recovery forcibly reserves the business takeover disk device according to the present invention.

【図５】一般的な、二重化コンピュータシステムの適用
例を示す説明図である。FIG. 5 is an explanatory diagram showing a typical application example of a duplex computer system.

【図６】一般的な、二重化切替装置の具体例を示す説明
図である。FIG. 6 is an explanatory diagram showing a specific example of a general duplex switching device.

【図７】従来の、復旧後の運用系が業務引継用ディスク
装置を強制リザーブするときの様子を示す説明図であ
る。FIG. 7 is an explanatory diagram showing a state in which the active system after recovery restores the task takeover disk device forcibly.

【符号の説明】[Explanation of symbols]

図１において、１：運用系（コンピュ−タシステム）２：待機系（新運用系のコンピュ−タシステム）３：障害解析用記憶装置４：業務引継用記憶装置In FIG. 1, 1: active system (computer system) 2: standby system ( new active computer system) 3: failure analysis storage device 4: business takeover storage device

───────────────────────────────────────────────────── フロントページの続き (72)発明者河合桂介神奈川県大和市深見西四丁目２番49号株式会社ピーエフユー大和工場内 (56)参考文献特開平２−77943（ＪＰ，Ａ) 特開平２−83753（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 11/16 - 11/20 G06F 11/34 ────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Keisuke Kawai 4-49, Fukami Nishi, Yamato-shi, Kanagawa Prefecture PF Yamato Factory Co., Ltd. (56) References JP-A-2-77943 (JP, A) Hei 2-83753 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G06F 11/16-11/20 G06F 11/34

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】運用系で障害が発生したとき、他の新運
用系が、当該運用系の業務を引き継ぐとともに当該障害
を解析する多重化コンピュータシステムにおいて、前記業務の引継ぎのための情報を保持する業務引継用記
憶装置と、前記障害を解析するための情報を保持する障
害解析用記憶装置のそれぞれを、前記運用系および前記
新運用系の共通資源とし、前記新運用系は、前記運用系で障害が発生したことの確
認や当該障害が発生したとの判断に基づいて前記業務引
継用記憶装置を自系に接続し、また、前記運用系が障害
解析用の前記情報を前記障害解析用記憶装置に格納した
旨の当該運用系からの通知に基づいて、当該障害解析用
記憶装置を自系に接続する、ことを特徴とする多重化コンピュータシステムの制御方
式。 (1) When a failure occurs in the operation system, another new operation is performed.
The service system takes over the operation of the operation system and the failure
In a multiplexed computer system that analyzes the information, a business takeover record that holds information for taking over the business
Storage device and a failure storing information for analyzing the failure.
Each of the harm analysis storage devices is
The new operating system should be used as a common resource.
Said business based on the
Connect the secondary storage device to its own system, and
The information for analysis is stored in the failure analysis storage device.
For the failure analysis based on the notification from the operation system
A method for controlling a multiplexed computer system, wherein a storage device is connected to its own system.
formula.

【請求項２】運用系で障害が発生したとき、他の新運
用系が、当該運用系の業務を引き継ぐ多重化コンピュー
タシステムにおいて、前記業務の引継ぎのための情報を保持する業務引継用記
憶装置を前記運用系および前記新運用系の共通資源と
し、前記運用系は、その障害状態から復旧したとき、前記新
運用系に対して前記業務引継用記憶装置を使用している
かどうかの問い合わせを行い、その回答内容が「使用し
ていない」旨であることを確認した上で当該業務引継用
記憶装置を自系に接続して自己の業務を再開する、こと
を特徴とする多重化コンピュータシステムの制御方式。2. When a failure occurs in the operation system, another new operation is performed.
Multiplexed computers that take over the work of the active system
In a data transfer system, a business takeover record holding information for taking over the business
Storage device as a common resource of the operating system and the new operating system.
And, the operating system, the time you recover from the failed state, performs the of whether you are using the business takeover for storage for the new active system inquiry, the answer content is "not in use" effect A control method for a multiplexed computer system, characterized in that after confirming that the task is taken over, the task takeover storage device is connected to its own system to resume its task.

【請求項３】前記新運用系は、自系で前記業務引継用
記憶装置を使用していないことを確認した場合には当該
装置をオフライン状態としてから前記回答内容を前記運
用系に通知する、ことを特徴とする請求項２記載の多重
化コンピュータシステムの制御方式。Wherein said new active system, when it is confirmed that not using the business takeover for <br/> storage device by its own system the
The luck the answer content from the device and the off-line state
3. The control method for a multiplexed computer system according to claim 2 , wherein the notification is made to a service system.