JPH0341852B2

JPH0341852B2 -

Info

Publication number: JPH0341852B2
Application number: JP57166509A
Authority: JP
Priority date: 1982-09-27
Filing date: 1982-09-27
Publication date: 1991-06-25
Also published as: JPS5957349A

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は、ホスト処理装置と複数のサブホスト
等分散処理装置を有する分散処理システムに係
り、特に分散処理装置のプログラム障害に際し該
分散処理装置の自動運転及び障害修復時間短縮に
好適なソフトウエア障害修復処理方式に関する。Detailed Description of the Invention [Field of Application of the Invention] The present invention relates to a distributed processing system having a host processing device and a plurality of distributed processing devices such as sub-hosts, and in particular, in the event of a program failure in the distributed processing device, the automatic The present invention relates to a software fault repair processing method suitable for shortening operation and fault repair time.

〔従来技術〕[Prior art]

第１図は、ホストと複数のサブホストとを有す
る分散処理システムの一例を示すシステム構成図
である。ホスト１は、通信回線４を介して複数個
のサブホスト２と接続されている。またサブホス
ト２には、それぞれ通信回線４を介して複数台の
端末装置３が接続されている。 FIG. 1 is a system configuration diagram showing an example of a distributed processing system having a host and a plurality of sub-hosts. The host 1 is connected to a plurality of sub-hosts 2 via communication lines 4. Further, a plurality of terminal devices 3 are connected to the sub-host 2 via communication lines 4, respectively.

このような分散処理システムにおいては、サブ
ホスト２でプログラム障害が発生した場合サブホ
スト２側が無人であるときには障害修復の手段が
なく、またホスト１側からはサブホストでどの様
な障害が発生しているかを知る手段がなかつた。
従つて、サブホスト２側は有人であることが前提
であるが、この場合でも下記の問題点があつた。 In such a distributed processing system, if a program failure occurs on subhost 2, there is no way to repair the failure if subhost 2 is unattended, and host 1 cannot tell what kind of failure is occurring on the subhost. There was no way to know.
Therefore, it is assumed that the sub-host 2 side is manned, but even in this case, the following problems occur.

(a) プログラム障害の検知及び再開始処理のため
のオペレータが必要である。(a) An operator is required to detect program failures and perform restart processing.

(b) サブホスト２側でプログラム障害原因の追求
を行う場合、専門家がサブホスト２の所在地へ
到着する迄に時間を要す。(b) When investigating the cause of a program failure on the sub-host 2 side, it takes time for an expert to arrive at the location of the sub-host 2.

(c) ホスト１側でプログラム障害原因の追求を行
う場合でも、下記いずれかの問題があつた。(c) Even when investigating the cause of the program failure on the host 1 side, one of the following problems occurred.

(i) 障害情報を人手でサブホスト２からホスト
１の所在地へ輸送するために輸送時間が必要
である。 (i) Transport time is required to manually transport the failure information from the sub-host 2 to the location of the host 1.

(ii) 通信回線を経由して障害情報を転送する場
合でも、プログラム障害部位の局所化がなさ
れていないために、障害情報の転送に長時間
を必要とする。 (ii) Even when fault information is transferred via a communication line, it takes a long time to transfer the fault information because the program fault location is not localized.

〔発明の目的〕[Purpose of the invention]

本発明の目的とするところは、前記の如き従来
技術の問題点を解決することであり、分散処理シ
ステムに於けるサブホスト等分散処理装置のプロ
グラム障害時の自動運転及び障害修復時間短縮に
好適なソフトウエア障害修復方式を提供すること
にある。 An object of the present invention is to solve the problems of the prior art as described above, and to provide a system suitable for automatic operation and shortening of failure recovery time when a program failure occurs in a distributed processing device such as a sub-host in a distributed processing system. The object of the present invention is to provide a software failure repair method.

〔発明の概要〕[Summary of the invention]

この発明の特徴とするところは、分散処理装置
に接続されるサービスプロセツサがプログラム障
害を検出し、プログラム障害検出時自動的にメモ
リダンプを取得し、サブホストの再立上げを実行
後、サービスプロセツサを有する分散処理装置が
プログラム障害部位の局所化を行い該障害情報を
ホスト処理装置へ転送し、ホスト側で障害原因の
究明完了後送信されるパツチ情報を受信し、該パ
ツチを障害プログラムに施こし、この後サービス
プロセツサがシステムの再開始を行うソフトウエ
ア障害修復方式である。 The feature of this invention is that a service processor connected to a distributed processing device detects a program failure, automatically acquires a memory dump when the program failure is detected, and after restarting the sub-host, the service processor A distributed processing device with a setter localizes the faulty part of the program, transfers the fault information to the host processing device, receives the patch information sent after completing the investigation of the cause of the fault, and applies the patch to the faulty program. This is a software failure recovery method in which the service processor restarts the system.

〔発明の実施例〕[Embodiments of the invention]

以下、本発明の一実施例を第２図〜第８図によ
り説明する。 An embodiment of the present invention will be described below with reference to FIGS. 2 to 8.

第２図は、本発明の一実施例である分散処理シ
ステムのシステム構成図である。サブホスト２に
は、業務フアイル装置６以外に現システムレジデ
ンス装置７、旧システム・レジデンス装置８、メ
モリダンププログラムフアイル装置９およびメモ
リダンプ用フアイル装置１０が接続されている。 FIG. 2 is a system configuration diagram of a distributed processing system that is an embodiment of the present invention. In addition to the business file device 6, the subhost 2 is connected to a current system residence device 7, an old system residence device 8, a memory dump program file device 9, and a memory dump file device 10.

第３図は、本発明の一実施例である分散処理シ
ステムの動作を示すフローチヤートである。 FIG. 3 is a flowchart showing the operation of a distributed processing system that is an embodiment of the present invention.

以下、通常運転について第２図により説明す
る。サブホスト２の運転は、コンソールサービス
プロセツサ５またはホスト１からの起動により、
現システム．レジデンス装置７に格納されている
プログラムがサブホスト２にロードされ実行開始
される。 The normal operation will be explained below with reference to FIG. The operation of the sub-host 2 is started by the console service processor 5 or the host 1.
Current system. The program stored in the residence device 7 is loaded into the sub-host 2 and starts executing.

通常運転とのときは、たとえば端末装置３より
メツセージを入力し、該メツセージが通信回線４
を径由してサブホスト２へ到着した後、サブホス
ト２内で一定の処理を実行し、ホスト１での処理
を実行すべくメツセージはサブホスト２より通信
回線４を経由してホスト１へ転送される。ホスト
１で業務処理を終了した後、ホスト１によつて作
成された応答メツセージは、サブホスト２を経由
して端末装置３へ送信される。以上は通常運転の
ときに行われるサブホスト２の動作の一例であ
る。 During normal operation, for example, a message is input from the terminal device 3, and the message is sent to the communication line 4.
After arriving at sub-host 2 via the sub-host 2, a certain process is executed within the sub-host 2, and the message is transferred from the sub-host 2 to the host 1 via the communication line 4 in order to execute the process at the host 1. . After the business process is completed at the host 1, the response message created by the host 1 is sent to the terminal device 3 via the sub-host 2. The above is an example of the operation of the sub-host 2 during normal operation.

次に通常運転中にサブホスト２でプログラム障
害が発生した場合の本発明の一実施例の動作につ
いて第３図をもとに説明する。通常運転中にコン
ソールサービスプロセツサ５でプログラム障害を
検知した場合、コンソールサービスプロセツサ５
は、サブホスト２に対してSTOP、STORE、
STATUS（サブホスト２の関連レジスタを主記
憶に格納するオペレーシヨン）、を指示後、メモ
リダンププログラムをメモリダンププログラムフ
アイル装置９よりロードした後、メモリダンプを
メモリダンプ用フアイル装置１０に格納する。こ
の後、コンソールサービスプロセツサ５は、予め
設定されているフラグを判定することにより、プ
ログラムのバージヨンアツプ直後か否かを判定
し、バージヨンアツプ直後であれば旧システム．
レジデンス装置８よりサブホスト２の立上げを実
行する。バージヨンアツプ直後でなければ該当フ
ラグは消去されており、現システム．レジデンス
装置７より再開始処理を実行し、再開始処理が不
成功である場合旧システム．レジデンス装置８よ
りサブホスト２の立上げを実行する。なおコンソ
ールサービスプロセツサ５は、オペレータに代わ
つて上記オペレーシヨンを自動的に実行するもの
でありオペレータが実行できる上記のようなオペ
レーシヨンを自動的に遂行できることはよく知ら
れている。再立上げ後、サブホスト２内のサービ
スプログラムは、メモリダンプ用フアイル装置１
０の内容にもとづきプログラム障害部位の局所化
を実行し、障害情報をホスト１へ送信する。障害
情報を受信したホスト１ではプログラム障害原因
究明後、パツチ情報をサブホスト２へ送信する。
パツチ情報を受信したサブホスト２では、現シス
テム．レジデンス装置７へ該パツチを実行する。
パツチが終了すると、サービスプログラムは、た
とえばコンソールメツセージ出力のような方法
で、コンソールサービスプロセツサ５にパツチ終
了を連絡する。次にコンソールサービスプロセツ
サ５は、サブホスト２が現システム．レジデンス
装置７で運転されていれば運転を続行し、旧シス
テム．レジデンス装置８で運転されていれば現シ
ステム．レジデンス装置７での運転に切替える。
第３図でステツプ31は、主としてコンソールサー
ビスプロセツサ５で行われる部分、ステツプ32は
主としてサブホスト２内のソフトウエアで行われ
る部分、ステツプ33はコンソールサービスプロセ
ツサ５およびサブホスト２内のサービスプログラ
ムによつて行われる部分を示す。 Next, the operation of an embodiment of the present invention when a program failure occurs in the sub-host 2 during normal operation will be described with reference to FIG. If the console service processor 5 detects a program failure during normal operation, the console service processor 5
commands STOP, STORE,
After instructing STATUS (an operation for storing related registers of the sub-host 2 in the main memory), the memory dump program is loaded from the memory dump program file device 9, and then the memory dump is stored in the memory dump file device 10. Thereafter, the console service processor 5 determines whether the version of the program has just been updated by checking a preset flag, and if the version has just been updated, the old system is restored.
The sub-host 2 is started up from the residence device 8. If it is not immediately after the version is updated, the corresponding flag will be deleted and the current system will be updated. Execute the restart process from the residence device 7, and if the restart process is unsuccessful, the old system. The sub-host 2 is started up from the residence device 8. The console service processor 5 automatically executes the above operations on behalf of the operator, and it is well known that the console service processor 5 can automatically execute the above operations that can be executed by an operator. After restarting, the service program in the subhost 2 will be transferred to the memory dump file device 1.
Based on the contents of 0, the program fault location is localized and the fault information is sent to the host 1. After receiving the fault information, the host 1 investigates the cause of the program fault and then sends patch information to the sub-host 2.
The subhost 2 that received the patch information updates the current system. The patch is executed to the residence device 7.
When the patch is completed, the service program notifies the console service processor 5 of the end of the patch, for example by outputting a console message. Next, the console service processor 5 determines that the subhost 2 is the current system. If it is being operated by residence device 7, it will continue to operate and the old system will be restored. If it is operated by residence device 8, it is the current system. The operation is switched to the residence device 7.
In FIG. 3, step 31 is performed mainly by the console service processor 5, step 32 is performed mainly by the software in the subhost 2, and step 33 is performed by the console service processor 5 and the service program in the subhost 2. This shows the parts that are performed by folding.

次にプログラム障害の判定方法について説明す
る。第４図は、本発明の一実施例であるプログラ
ム障害判定のフローチヤートを示すものである。
本障害判定は、主としてコンソールサービスプロ
セツサ５で実施する。 Next, a method for determining a program failure will be explained. FIG. 4 shows a flowchart of program failure determination according to an embodiment of the present invention.
This fault determination is mainly performed by the console service processor 5.

プログラム障害の場合、表面化する現象は下記
の４ケースである。 In the case of a program failure, the following four cases occur.

(1) プログラムABEND サブホスト２内ソフトウエアで検知され、コ
ンソールサービスプロセツサ５にABENDコー
ドが出力される。ABENDコードは、通常コン
ソールメツセージとして出力されるもので、コ
ンソールサービスプロセツサ５がABENDコー
ドを監視することは容易である。(1) Program ABEND Detected by the software in the subhost 2, and an ABEND code is output to the console service processor 5. The ABEND code is normally output as a console message, and it is easy for the console service processor 5 to monitor the ABEND code.

(2) システムWAIT システムがWAIT状態となり、WAITコー
ドはPSW（Program Status Word）等のレジ
スタに格納されている。システムWAITとな
ると通常WAITコードがコンソールに出力さ
れる。(2) System WAIT The system enters the WAIT state, and the WAIT code is stored in a register such as PSW (Program Status Word). When a system WAIT occurs, a WAIT code is usually output to the console.

(3) プログラムループプログラム自体がループ状態となる。(3) Program loop The program itself is in a loop.

(4) 結果不正上記４ケースのうち、(4)結果不正は人間の判定
を必要とする場合が多く本発明の対象外である。
ABEND、WAITの場合、コンソールサービスプ
ロセツサ５は、それぞれ登録されているABEND
コード、WAITコードと合致するかどうかチエ
ツクし、判定する。ループの場合の検出方法は、
一定時間間隔で診断命令が出ているか否かを判定
し、一定間隔で診断命令が出ていなければコンソ
ールサービスプロセツサ５からサブホスト２内ソ
フトウエアに割込みを発生させ割込みに対する応
答の正常／異常によりループ状態を検出する。(4) Fraudulent results Of the four cases described above, (4) Fraudulent results often requires human judgment and is outside the scope of the present invention.
In the case of ABEND and WAIT, the console service processor 5 uses the registered ABEND and WAIT respectively.
Check whether the code matches the WAIT code and judge. The detection method for loops is as follows:
It is determined whether a diagnostic command is issued at regular intervals, and if no diagnostic command is issued at regular intervals, an interrupt is generated from the console service processor 5 to the software in the sub-host 2, and the response to the interrupt is determined to be normal or abnormal. Detect loop conditions.

コンソールサービスプロセツサ５がプログラム
障害を検出すると、上記のようなプログラム障害
の種別と障害コードとを、コンソールキーインを
シミユレーシヨンする方法によつて主記憶装置の
所定の場所に格納する。 When the console service processor 5 detects a program failure, it stores the type of program failure and failure code as described above in a predetermined location in the main memory by simulating a console key-in.

第５図に本発明の一実施例であるプログラム障
害局所化のフローチヤートを示す。本局所化で使
用する情報は、プログラム障害検出後取得済みの
メモリダンプ用フアイル装置１０に格納されてい
るメモリダンプと現システム．レジデンス装置７
である。 FIG. 5 shows a flowchart of program fault localization according to an embodiment of the present invention. The information used in this localization is the memory dump stored in the memory dump file device 10 that was acquired after the program failure was detected, and the current system. Residence equipment 7
It is.

メモリダンプ内システムトレース情報より障害
部位を決定する。障害プログラムがオペレーテイ
ングシステムの一部であれば、オペレーテイング
システム関連テーブルをサーチすることにより、
チエーン切れ等のチエツクを行う。さらにプログ
ラムによつては障害部位前後の範囲をメモリダン
プと現システム．レジデンスとを比較照合するこ
とにより障害部位の局所化が可能である。 Determine the failure location from the system trace information in the memory dump. If the failing program is part of the operating system, by searching the operating system related tables,
Check for chain breaks, etc. Furthermore, depending on the program, the area before and after the failure location may be recorded as a memory dump and the current system. It is possible to localize the faulty area by comparing and checking with the residence.

なお障害局所化がサブホスト側のみでは不可能
な場合、すなわち上記判定で異常が無い場合ホス
トからの指示により特定範囲のメモリダンプのみ
を送信することにより、情報転送時間の短縮が可
能となる。 Note that when fault localization is not possible on the sub-host side alone, that is, when there is no abnormality in the above determination, the information transfer time can be shortened by transmitting only a memory dump of a specific range according to an instruction from the host.

ここで、上記判定で使用した要素を第６図から
第８図を用いて解説する。 Here, the elements used in the above determination will be explained using FIGS. 6 to 8.

第６図はシステムトレース情報である。オペレ
ーテイングシステムは、タスクデイスパツチング
のたびに関連するタスクのアドレス、割込み情報
等をシステムトレースエリア１１に格納してお
く。このシステムトレースエリア１１は決められ
たサイズであり、このエリアは複数個の最新情報
が残る様にラウンドロビン方式に管理されてお
り、メモリダンプにより取出せる。従つて本シス
テムトレース情報を判定することにより、障害時
のアドレスあるいは障害状態が判明する。 FIG. 6 shows system trace information. The operating system stores related task addresses, interrupt information, etc. in the system trace area 11 each time a task is dispatched. This system trace area 11 has a predetermined size, is managed in a round robin manner so that a plurality of pieces of latest information remain, and can be retrieved by memory dump. Therefore, by determining this system trace information, the address at the time of the failure or the failure state can be determined.

第７図は、オペレーテイングシステムのテーブ
ル１２〜１８がチエーンによりつながつている状
態を示すものである。このテーブルのつながりを
サーチすることにより、チエーン切れを見つけ、
どの部分が不良かを判定する。 FIG. 7 shows a state in which the tables 12 to 18 of the operating system are connected by a chain. By searching the connections of this table, we find the chain break,
Determine which part is defective.

第８図は、プログラム実行形式の例を示す。(a)
は正しいプログラム、(b)は誤つたプログラムであ
り、本内容を比較することにより、0005C₄番地
に誤りがあることが判明する。 FIG. 8 shows an example of a program execution format. (a)
is the correct program, (b) is the incorrect program, and by comparing the contents, it becomes clear that there is an error at address _0005C4 .

なおプログラム障害局所化そのものは、公知技
術である。たとえば多重プログラミングの環境
で、ある問題プログラムが異常終了したとき（プ
ログラムABENDしたとき）、そのプログラムが
占有している主記憶領域のみをメモリダンプする
ことは広く行われている。プログラム障害局所化
で問題となるのは、どこまで障害局所化の範囲を
狭くするかという程度の問題と、問題プログラム
の障害がオペレーテイングシステムと関連する場
合の処置である。上記述べたような手段を用い
て、どの程度までプログラム障害局所化を行うか
の判断は、サブホスト２内のサービスプログラム
の設計者にまかされる設計事項である。 Note that program fault localization itself is a known technique. For example, in a multiple programming environment, when a problem program terminates abnormally (program ABEND), it is common practice to memory dump only the main storage area occupied by that program. The problems with program fault localization are how narrow the range of fault localization should be, and what to do when the problem program fault is related to the operating system. The determination of the extent to which program failures should be localized using the means described above is a design matter left to the designer of the service program in the sub-host 2.

〔発明の効果〕〔Effect of the invention〕

本発明によれば次の効果がある。 According to the present invention, there are the following effects.

(a) 分散処理装置のプログラム障害を自動的に検
出できる。(a) Program failures in distributed processing devices can be automatically detected.

(b) プログラム障害発生時、自動的にメモリダン
プを取得するので、プログラム障害原因究明が
可能である。(b) When a program failure occurs, a memory dump is automatically obtained, making it possible to investigate the cause of the program failure.

(c) プログラム障害が発生しても、自動的に再開
始ができる。(c) Even if a program failure occurs, it can be restarted automatically.

(d) 分散処理装置で障害部位の局所化が可能であ
ることにより、障害情報転送時間が短くなる。(d) Since the distributed processing device can localize the fault area, the fault information transfer time is shortened.

(e) 専門家を分散配置することなく、ホスト処理
装置側に集中配置することが可能となり障害原
因究明の効率向上ができる。(e) It is possible to centrally deploy experts on the host processing device side without distributing them, improving the efficiency of investigating the cause of a failure.

従つてサブホスト等分散処理装置運転の自動
化、障害修復時間の短縮が図れる。 Therefore, it is possible to automate the operation of distributed processing devices such as sub-hosts, and shorten the time required to repair failures.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図はホストと複数のサブホストを有する分
散処理システムの一例を示すシステム構成図、第
２図は本発明の一実施例である分散処理システム
のシステム構成図、第３図は本発明の一実施例の
動作を示すフローチヤート、第４図は本発明の一
実施例であるプログラム障害判定についてのフロ
ーチヤート、第５図は本発明の一実施例であるプ
ログラム障害部位局所化のフローチヤート、第６
図はシステムトレースエリアに格納されるシステ
ムトレース情報を示す図、第７図はOSのテーブ
ルがチエインされている状態を示す図、第８図は
プログラム実行形式の例を示す図である。１……ホスト、２……サブホスト、３……端末
装置、４……通信回線、５……コンソールサービ
スプロセツサ、７……現システム．レジデンス装
置、８……旧システム．レジデンス装置、９……
メモリダンププログラムフアイル装置、１０……
メモリダンプ用フアイル装置、１１……システム
トレースエリア、１２〜１８……OSのテーブル。 FIG. 1 is a system configuration diagram showing an example of a distributed processing system having a host and a plurality of sub-hosts, FIG. 2 is a system configuration diagram of a distributed processing system that is an embodiment of the present invention, and FIG. Flowchart showing the operation of the embodiment; FIG. 4 is a flowchart for program fault determination which is an embodiment of the present invention; FIG. 5 is a flowchart of program fault localization which is an embodiment of the present invention; 6th
The figure shows system trace information stored in the system trace area, FIG. 7 shows a state in which OS tables are chained, and FIG. 8 shows an example of a program execution format. 1...Host, 2...Subhost, 3...Terminal device, 4...Communication line, 5...Console service processor, 7...Current system. Residence equipment, 8...old system. Residence equipment, 9...
Memory dump program file device, 10...
Memory dump file device, 11...System trace area, 12-18...OS table.

Claims

【特許請求の範囲】[Claims]

１ホスト処理装置と、該ホスト処理装置に接続
される分散処理装置とを有する分散処理システム
において、前記分散処理装置で稼動されるプログ
ラムのプログラム障害を検出しプログラム障害の
場合にはメモリダンプ採取を起動し該メモリダン
プ後システム再開始を行うサービスプロセツサ
と、前記システム再開始後に前記メモリダンプに
基づいてプログラム障害局所化を行い該局所化さ
れた情報を含む障害情報を前記ホスト処理装置に
送信しホスト処理装置よりパツチ情報を受信した
とき該パツチを前記障害プログラムに施こすサー
ビスプログラムが実行される前記分散処理装置と
を有し、前記サービスプロセツサは前記パツチが
施こされた後にシステムの再開始を行うことを特
徴とする分散処理システムのソフトウエア障害修
復方式。1. In a distributed processing system having a host processing device and a distributed processing device connected to the host processing device, detecting a program failure in a program running on the distributed processing device and collecting a memory dump in the case of a program failure. a service processor that starts up and restarts the system after the memory dump; and after restarting the system, localizes a program fault based on the memory dump and sends fault information including the localized information to the host processing device. and the distributed processing device executes a service program that applies the patch to the faulty program when patch information is received from the host processing device, and the service processor runs the system after the patch is applied. A software failure recovery method for a distributed processing system characterized by restarting the system.