JPH02207347A

JPH02207347A - Method for detecting fault of software

Info

Publication number: JPH02207347A
Application number: JP1027686A
Authority: JP
Inventors: Takao Dobashi; 土橋　孝男
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1989-02-08
Filing date: 1989-02-08
Publication date: 1990-08-17

Abstract

PURPOSE:To analyze the cause of a fault by extracting data when the fault occurs by checking the execution processing time and the processing completion state of each program at every constant processing, and preserving the content of a main memory at a time when abnormality is found out. CONSTITUTION:In a computer 1 used the one in current use, a software fault detection processing part 4 is started up periodically by a signal from a timer processing part 3, and checks the content(the execution requiring time and the processing completion state of each program) of a main processing part 10 at the time of performing a processing, and detects the occurrence of the fault of software. When the fault occurs in the software, a main memory preservation processing part 5 is operated, and the content of a main memory is preserved. Thus, the occurrence of the fault in the software other than a stall alarm and an irregal instruction can be detected, and the data when the fault occurs can be gathered. In such a way, the cause of the fault can be analyzed and it can be recovered at early stages.

Description

【発明の詳細な説明】〔発明の目的〕（産業上の利用分野）本発明は計算機システム等で使用されるソフトウェア障
害検出方法に関する。DETAILED DESCRIPTION OF THE INVENTION [Object of the Invention] (Field of Industrial Application) The present invention relates to a software failure detection method used in computer systems and the like.

（従来の技術）近年、計算機システムは増々高い信頼性を要求されおり
、重要なシステムではハードウェアの二重化が実施され
ている。また、ソフトウェアについても信頼性を向上さ
せるため様々な対策が実施されているが、万一障害が発
生した場合にはこれを直ちに検出して復旧対策を行なう
ことが重要である。(Prior Art) In recent years, computer systems have been required to have increasingly higher reliability, and important systems are now duplicating hardware. Furthermore, various measures have been taken to improve the reliability of software, but if a failure should occur, it is important to immediately detect it and take recovery measures.

この場合、障害の発生検知方法としては種々の方法がと
られている。In this case, various methods are used to detect the occurrence of a failure.

例えば、計算機内部に定周期のタイマを設置し、計算機
のＯ８（Ｏｐｒａｔｉｎｇ　５ｙｓｔｃｖ　）がこのタ
イマによって得られた時刻により各プログラムの実行管
理を行ない、一定時間以上Ｏ８側に管理が戻らないとき
警報装置を動作させ警報（ストールアラーム）を出力さ
せる方法がとられている。For example, a fixed-cycle timer is installed inside the computer, and the computer's O8 (Operating 5ystcv) manages the execution of each program based on the time obtained by this timer, and if control does not return to the O8 side for a certain period of time, an alarm A method is used to activate the system and output an alarm (stall alarm).

また他の方法として、プログラムの暴走などに起因して
実行不可能な命令（イリーガル命令）が検出された場合
、警報装置を動作させる方法もとられている。Another method is to activate an alarm device when an unexecutable instruction (illegal instruction) is detected due to a runaway program or the like.

（発明が解決しようとする課題）しかしながら上述した従来の障害発生検知方法において
は次に述べるような問題があった。(Problems to be Solved by the Invention) However, the conventional failure detection method described above has the following problems.

即ち、これら各障害発生検知方法では、ストールアラー
ム、イリーガル命令以外のソフトウェア障害、例えば割
込み許可状態でのループ発生、長時間に渡る待ち状態の
発生、プツトロックの発生、機能未実行の発生等を検知
することができないという間通があった。In other words, each of these failure detection methods detects software failures other than stall alarms and illegal instructions, such as the occurrence of a loop in the interrupt enabled state, the occurrence of a long wait state, the occurrence of a put lock, and the occurrence of a function not being executed. There was an interlude that it could not be done.

このため、このようなソフトウェア障害が発生した場合
には、異常に気がついた時点で保守員がこれに対処する
という方法をとっているので、ハード系を二重にしてい
る場合でも、切り換えが遅れて誤った処理が行われてし
まうことがあった。For this reason, when a software failure occurs, maintenance personnel take action as soon as they notice the problem, so even if the hardware is duplicated, there will be no delay in switching over. In some cases, incorrect processing was performed.

またこのようなソフトウェア障害は、障害が発生した時
点でこれを検出することができないので、障害発生時点
のデータを採取することができず、障害の原因解析がで
きないことが多い。Furthermore, since such software failures cannot be detected at the time the failure occurs, it is often impossible to collect data at the time the failure occurs, and it is often impossible to analyze the cause of the failure.

そして、このようなソフトウェア障害の原因となるバグ
は、極めてまれなタイミングで発生することが多いため
、障害発見後に計算機を再スタート（リスタート）させ
たとき、バグに起因するデータが既に書き換えられてい
て障害の原因が分からないことが多かった。Bugs that cause such software failures often occur at extremely rare times, so when the computer is restarted after the failure is discovered, the data caused by the bug has already been rewritten. In many cases, the cause of the problem was unknown.

本発明は上記の事情に鑑み、ストールアラーム、イリー
ガル命令以外のソフトウェア障害、例えば割込み許可状
態でのループ発生、長時間に渡る待ち状態の発生、プツ
トロックの発生、機能未実行の発生等を検知することが
でき、これによって障害時のデータを採取して障害原因
を解析することができるとともに、障害を早期に復旧さ
せることができるソフトウェア障害検出方法を提供する
ことを目的としている。In view of the above circumstances, the present invention detects software failures other than stall alarms and illegal instructions, such as the occurrence of a loop in an interrupt enabled state, the occurrence of a long waiting state, the occurrence of a putlock, and the occurrence of a function not being executed. The purpose of this invention is to provide a software failure detection method that can collect data at the time of a failure and analyze the cause of the failure, as well as enable early recovery from the failure.

〔発明の構成〕[Structure of the invention]

（課題を解決するための手段）上記の目的を達成するために本発明によるソフトウェア
障害検出方法は、計算機システムで使用されるソフトウ
ェア障害検出方法において、各プログラムの実行所要時
間と処理完了状態とを一定時間毎にチェックし、各プロ
グラムが所定時間内に終了していない場合あるいは実行
終了時に全処理が完了していない場合に異常と判定し、
この時点で障害原因の解析に必要な主メモリの内容を保
存することを特徴としている。(Means for Solving the Problems) In order to achieve the above object, the software failure detection method according to the present invention is a software failure detection method used in a computer system, in which the execution time and processing completion status of each program are determined. It is checked at regular intervals and is determined to be abnormal if each program has not finished within the specified time or if all processing has not been completed at the end of execution.
At this point, the main memory contents necessary for analyzing the cause of the failure are saved.

（作用）上記の構成において、各プログラムの実行所要時間と処
理完了状態とを一定時間毎にチェックし、各プログラム
が所定時間内に終了していない場合あるいは実行終了時
に全処理が完了していない場合に異常と判定し、この時
点で障害原因の解析に必要な主メモリの内容を保存する
ので、ストールアラーム、イリーガル命令以外のソフト
ウェア障害、例えば割込み許可状態でのループ発生、長
時間に渡る待ち状態の発生、プツトロックの発生、機能
未実行の発生等を検知することができ、これによって障
害時のデータを採取して障害原因を解析することができ
るとともに、障害を早期に復旧させることができる。(Function) In the above configuration, the execution time and processing completion status of each program are checked at regular intervals, and if each program does not finish within a predetermined time or all processing is not completed at the end of execution. At this point, the contents of the main memory necessary for analyzing the cause of the failure are saved. It is possible to detect the occurrence of a state, a put lock, the occurrence of a function not being executed, etc. This makes it possible to collect data at the time of a failure and analyze the cause of the failure, as well as to quickly recover from the failure. .

（実施例）第１図は本発明によるソフトウェア障害検出方法の一実
施例を適用した計算機システムの一例を示すブロック図
である。(Embodiment) FIG. 1 is a block diagram showing an example of a computer system to which an embodiment of the software fault detection method according to the present invention is applied.

この図に示す計算機システムは、２つの計算機１．２を
備えており、そのうちの一方、例えば計算機１が現用と
して、また他方が予備用として使用される。The computer system shown in this figure includes two computers 1.2, one of which, for example, computer 1, is used as a current computer and the other is used as a backup computer.

現用として使用される計算機１はオンライン処理を行な
う主処理部１０と、計時動作を行なうタイマ処理部３と
、このタイマ処理部３からの信号によって周期的に起動
されて前記主処理部１０の処理内容をチェックしソフト
ウェア障害の発生を検知するソフトウェア障害検出処理
部４と、このソフトウェア障害検出処理部４によってソ
フトウェア障害の発生が検知されたとき主メモリの内容
を保存させる主メモリ保存処理部５と、前記ソフトウェ
ア障害検出処理部４によってソフトウェア障害の発生が
検知されたとき故障信号を発生する故障信号出力装置６
とを備えている。The computer 1 currently used has a main processing section 10 that performs online processing, a timer processing section 3 that performs timekeeping operations, and is periodically activated by a signal from the timer processing section 3 to perform the processing of the main processing section 10. A software failure detection processing unit 4 that checks the contents and detects the occurrence of a software failure, and a main memory storage processing unit 5 that saves the contents of the main memory when the software failure detection processing unit 4 detects the occurrence of a software failure. , a failure signal output device 6 that generates a failure signal when the software failure detection processing section 4 detects the occurrence of a software failure;
It is equipped with

前記主処理部１０は複数のプログラムに基づいてオンラ
イン処理を行ないながらその内部に設けられた処理状況
記述テーブル２０に各プログラムの処理状況を記述する
。The main processing unit 10 performs online processing based on a plurality of programs, and describes the processing status of each program in a processing status description table 20 provided therein.

処理状況記述テーブル２０は第２図に示す如く各プログ
ラムの処理状況が各々記述される複数のプログラム実行
状況記述エリア２１ａ〜２１ｎを備えている。そして、
これら各プログラム実行状況記述エリア２１ａ〜２１ｎ
には各々対応するプログラムが実行中かどうかを示すプ
ログラム実行中フラグ１８と、対応するプログラムの各
モジュールが処理される毎に順次セットされる複数の処
理実行済フラグ１９ａ〜１９ｍとが設けられている。As shown in FIG. 2, the processing status description table 20 includes a plurality of program execution status description areas 21a to 21n in which the processing status of each program is described. and,
Each of these program execution status description areas 21a to 21n
are provided with a program execution flag 18 indicating whether or not the corresponding program is being executed, and a plurality of process execution flags 19a to 19m that are sequentially set each time each module of the corresponding program is processed. There is.

また、前記ソフトウェア障害検出処理部４は各プログラ
ムが正常に処理されているかどうかを判定するのに必要
な情報テーブル１３を備えている。Further, the software failure detection processing section 4 includes an information table 13 necessary for determining whether each program is being processed normally.

情報テーブル１３は第３図に示す如く各プログラムの実
行情報が各々記述される複数のプログラム実行情報記述
子１２ａ〜１２ｎを備えており、これら各プログラム実
行情報記述子１２ａ〜１２ｎには各々対応するプログラ
ムの実行開始時刻が記述される実行開始時刻記述エリア
１４と、対応するプログラムの実行所要時間が記述され
る実行所要時間記述エリア１５と、対応するプログラム
を実行するのに必要な時間の最大値が記述される実行所
要時間最大値記述エリア１６と、対応するプログラムを
構成するモジュール数が記述されるモジュール数記述エ
リア１７とが設けられている。As shown in FIG. 3, the information table 13 includes a plurality of program execution information descriptors 12a to 12n in which execution information of each program is described, and each of these program execution information descriptors 12a to 12n corresponds to An execution start time description area 14 in which the execution start time of a program is written, an execution time description area 15 in which the time required to execute the corresponding program is written, and the maximum value of the time required to execute the corresponding program. There are provided a maximum execution time description area 16 in which the number of modules constituting the corresponding program is written, and a module number description area 17 in which the number of modules constituting the corresponding program is written.

そして、ソフトウェア障害検出処理部４は前記タイマ処
理部３によって所定時間毎に起動されて前記情報テーブ
ル１３の内容と前記処理状況記述テーブル２０の内容と
をチェックしソフトウェア障害の有無を検知する。The software fault detection processing section 4 is activated by the timer processing section 3 at predetermined time intervals to check the contents of the information table 13 and the processing status description table 20 to detect the presence or absence of a software fault.

そして、ソフトウェア障害が発生したとき主メモリ保存
処理部５を動作させて主メモリの内容を保存させるとと
もに、故障信号出力装置６を起動させて故障信号を発生
させこれを計算機２に出力させる。When a software failure occurs, the main memory storage processing section 5 is operated to save the contents of the main memory, and the failure signal output device 6 is activated to generate a failure signal and output it to the computer 2.

計算機２は前記計算機１の主処理部１０と同様に構成さ
れる主処理部１１と、前記計算機１から故障信号が出力
されたときこれを受ける故障信号入力装置７と、この故
障信号入力装置７によって故障信号が取り込まれたとき
前記主処理部１１を起動させるリスタート処理部８とを
備えている。The computer 2 includes a main processing section 11 configured similarly to the main processing section 10 of the computer 1, a fault signal input device 7 that receives a fault signal when it is output from the computer 1, and this fault signal input device 7. and a restart processing section 8 that starts up the main processing section 11 when a failure signal is taken in by.

そして、前記計算機１から故障信号が出力されたときこ
れを検知して主処理部１１を起動させ前記計算機１に代
わりオンライン処理を開始する。Then, when a failure signal is output from the computer 1, it is detected and the main processing section 11 is activated to start online processing in place of the computer 1.

次に、第４図ないし第６図を参照しながらこの実施例の
動作を説明する。Next, the operation of this embodiment will be explained with reference to FIGS. 4 to 6.

まず、計算機１の主処理部１０はＯ８の制御の下に各プ
ログラムの処理を実行する。First, the main processing unit 10 of the computer 1 executes processing of each program under the control of O8.

この処理では、主、処理部１０はまず情報テーブル１３
を構成する各プログラム実行情報記述子１２ａ〜１２ｎ
の実行所要時間最大値記述エリア１６に各プログラム毎
に予め設定されている実行所要時間最大値を書き込む。In this process, the main processing unit 10 first uses the information table 13.
Each program execution information descriptor 12a to 12n that constitutes
The maximum time required for execution preset for each program is written in the maximum time required for execution description area 16.

この後、主処理部１０は各プログラムの処理を開始する
。After this, the main processing unit 10 starts processing each program.

そして、各プログラムの処理において、主処理部１０は
第４図に示す如くまず今実行対象となっているプログラ
ムに対応するプログラム実行状況記述エリア、例えば第
１プログラムを処理するときにはこの第１プログラムに
対応したプログラム実行状況記述エリア２１ａのプログ
ラム実行中フラグ１８をセットするとともに（ステップ
５ＴＩ）このプログラム実行記述子１２ａの各処理実行
済フラグ１９ａ〜１９ｍをクリアする（ステップ５Ｔ２
）。In the processing of each program, the main processing unit 10 first stores the program execution status description area corresponding to the program currently being executed, for example, when processing the first program, as shown in FIG. The program execution flag 18 of the corresponding program execution status description area 21a is set (step 5TI), and each processing completed flag 19a to 19m of this program execution descriptor 12a is cleared (step 5T2).
).

次いで、主処理部１０は今実行対象となっている第１プ
ログラムに対応するプログラム実行情報記述子１２ａの
実行開始時刻記述エリア１４に現在の時刻を書込み（ス
テップ５Ｔ３）、この後この第１プログラムを構成する
各モジュールの処理を順次行ないながら、各モジュール
の処理が完了する毎にプログラム実行記述子１２ａの各
処理実行済フラグ１９ａ〜１９ｍを１フラグずつ端から
順次セットする（ステップＳＴ４〜５Ｔ７）。Next, the main processing unit 10 writes the current time in the execution start time description area 14 of the program execution information descriptor 12a corresponding to the first program currently being executed (step 5T3), and thereafter While sequentially performing the processing of each module constituting the program, each processing completion flag 19a to 19m of the program execution descriptor 12a is set one by one starting from the end each time the processing of each module is completed (steps ST4 to ST5T7). .

そして、各モジュールの処理が全て完了したとき主処理
部１０はプログラム実行記述子１２ａの実行開始時刻記
述エリア１４に記述されている実行開始時刻と現在の時
刻とを比較して第１プログラムの処理に要した時間（実
行所要時間）を求め３、これをプログラム実行記述子１
２ａの実行所要貼間記述エリア１５に書き込む（ステッ
プ５Ｔ８）。When all the processing of each module is completed, the main processing unit 10 compares the execution start time written in the execution start time description area 14 of the program execution descriptor 12a with the current time and processes the first program. Find the time required for (execution time) 3, and write this as program execution descriptor 1
2a is written in the execution required pasting interval description area 15 (step 5T8).

この後、主処理部１０はプログラム実行状況記述エリア
２１ａのプログラム実行中フラグ１８をリセットする（
ステップ５Ｔ９）。After that, the main processing unit 10 resets the program execution flag 18 in the program execution status description area 21a (
Step 5T9).

以下同様に、主処理部１０はＯ８の制御の下に第５図に
示す如く残りの各プログラムおよび上述した第１プログ
ラムをサイクリックに実行しながら各プログラムに対応
するプログラム実行情報記述子１２ａ〜１２ｎの実行開
始時刻記述エリア１４の内容、実行所要時間記述エリア
１５の内容および各プログラム実行状況記述エリア２１
ａ〜２１ｎのプログラム実行中フラグ１８の内容、各処
理実行済フラグ１９ａ〜１９ｍの内容を更新する。Similarly, the main processing unit 10 cyclically executes each of the remaining programs and the above-mentioned first program under the control of the O8 as shown in FIG. 12n, the contents of the execution start time description area 14, the contents of the execution time description area 15, and each program execution status description area 21.
The contents of the program execution flag 18 of programs a to 21n and the contents of each process execution completed flag 19a to 19m are updated.

またこの動作と並行して、計算機１のソフトウェア障害
検出処理部４はタイマ処理部３がら所定時間毎に、起動
させられて以下に述べる処理を実行する。Further, in parallel with this operation, the software failure detection processing section 4 of the computer 1 is activated by the timer processing section 3 at predetermined time intervals to execute the processing described below.

この処理では、ソフトウェア障害検出処理部４は第６図
に示す如くまずチェック対象となるプログラムの番号を
示す変数ｌを初期化する（ステップ５ＴＩＯ）。In this process, the software failure detection processing section 4 first initializes a variable l indicating the number of the program to be checked, as shown in FIG. 6 (step 5TIO).

この後、ソフトウェア障害検出処理部４は変数ｉによっ
て指定されたプログラムに対応するプログラム実行状況
記述エリア（この場合、変数ｌが初期化された直後であ
るから第１プログラムに対応するプログラム実行状況記
述エリア２１ａ）のプログラム実行中フラグ１８がセッ
トされているか否かをチェックする（ステップ５ＴＩＩ
）。Thereafter, the software failure detection processing unit 4 enters the program execution status description area corresponding to the program specified by the variable i (in this case, the program execution status description corresponding to the first program is immediately after the variable l is initialized). Check whether the program execution flag 18 in area 21a) is set (step 5 TII
).

そして、このプログラム実行中フラグ１８がセットされ
ていれば、即ちこの第１プログラムが現在実行中であり
実行所要時間がまだ求められていなければ、ソフトウェ
ア障害検出処理部４は前記主処理部１０に代わりプログ
ラム実行情報記述子１２ａの実行開始時刻記述エリア１
４に記述されている実行開始時刻と現在の時刻とを比較
してこの第１プログラムの実行が開始されてから現在ま
での時間を求める（ステップ５Ｔ１２）。If the program execution flag 18 is set, that is, if the first program is currently being executed and the required execution time has not yet been determined, the software failure detection processing section 4 will notify the main processing section 10 of the first program. Execution start time description area 1 of alternative program execution information descriptor 12a
The execution start time described in step 4 is compared with the current time to determine the time from the start of execution of this first program to the present time (step 5T12).

また、前記プログラム実行中フラグ１８がリセットされ
ていれば、ソフトウェア障害検出処理部４はこの第１プ
ログラムの処理が完了していると判断してプログラム実
行情報記述子１２ａの実行所要時間記述エリア１５に記
述されている実行所要時間を読み出す。Further, if the program execution flag 18 is reset, the software failure detection processing unit 4 determines that the processing of the first program is completed, and the execution time description area 15 of the program execution information descriptor 12a is Read the required execution time described in .

この後、ソフトウェア障害検出処理部４は第１゜プログ
ラムに対する実行所要時間の値が計算されれているかを
チェックしくステップ５Ｔ１３）、計算されていれば、
プログラム実行情報記述子１２ａの実行所要時間最大値
記述エリア１６に記述されている実行所要時間最大値を
読出し、この実行所要時間最大値と前記実行所要時間の
値とを比較する（ステップ５Ｔ１４）。After this, the software failure detection processing unit 4 checks whether the value of the execution time for the first degree program has been calculated (step 5T13), and if it has been calculated,
The maximum required execution time written in the maximum required execution time value description area 16 of the program execution information descriptor 12a is read, and this maximum required execution time is compared with the value of the required execution time (step 5T14).

そして、実行所要時間の値が実行所要時間最大値よりも
大きいとき、ソフトウェア障害検出処理部４はこの第１
プログラムの処理を行なったとき、割込み許可状態での
ループ、長時間の待ち状態、デッドロック等のソフトウ
ェア障害が発生したと判断して主メモリ保存処理部５を
動作させ主メモリの内容を保存させるとともに（ステッ
プ５Ｔ２５）、故障信号出力装置６を起動させて故障信
号を発生させ計算機２の動作を開始させる（ステップ５
Ｔ２６）。Then, when the value of the execution time required is larger than the maximum execution time, the software failure detection processing unit 4
When processing a program, it is determined that a software failure such as a loop with interrupts enabled, a long waiting state, or a deadlock has occurred, and the main memory storage processing unit 5 is activated to save the contents of the main memory. At the same time (Step 5T25), the failure signal output device 6 is activated to generate a failure signal and the operation of the computer 2 is started (Step 5
T26).

また、前記実行所要時間の値が実行所要時間最大値より
も小さいときには、ソフトウェア障害検出処理部４はプ
ログラム実行情報記述子１２ａの実行所要時間記述エリ
ア１５の内容を消去した後（ステップ５Ｔ１５）、プロ
グラム実行状況記述子エリア２１ａのプログラム実行中
フラグ１８がリセットされているか否か（実行済みが否
が）をチェックする（ステップ５Ｔ１６）。Further, when the value of the execution time required is smaller than the maximum execution time, the software failure detection processing unit 4 erases the contents of the execution time description area 15 of the program execution information descriptor 12a (step 5T15), It is checked whether the program execution flag 18 in the program execution status descriptor area 21a has been reset (whether the program has been executed or not) (step 5T16).

そして、このプログラム実行中フラグ１８がリセットさ
れていれば、ソフトウェア障害検出処理部４はプログラ
ム実行状況記述子エリア２１ａの各処理済フラグ１９ａ
〜１９ｍのセット状況を読出し、そのセット数と、プロ
グラム実行記述子１２ａのモジュール数記述エリア１７
に記述されているモジュール数とが一致しているかどう
かをチェックする（ステップ５Ｔ１７）。Then, if this program execution flag 18 is reset, the software failure detection processing unit 4 detects each processed flag 19a of the program execution status descriptor area 21a.
Read the set status of ~19m and write the number of sets and the number of modules description area 17 of the program execution descriptor 12a.
It is checked whether the number of modules matches the number of modules described in (step 5T17).

そして、これらが一致していなければ、ソフトウェア障
害検出処理部４は第１プログラムの処理においていずれ
かのモジュール処理が抜ける等のソフトウェア障害が発
生したと判断して上述した主メモリの保存処理、計算機
２の起動処理を行なう（ステップ５Ｔ２５．５Ｔ２６）
。If these do not match, the software failure detection processing unit 4 determines that a software failure has occurred, such as an omission of one of the module processes in the processing of the first program, and performs the above-mentioned main memory storage process and computer processing. Perform step 2 startup processing (step 5T25.5T26)
.

また、これらが一致していれば、ソフトウェア障害検出
処理部４は第１プログラムの各モジュールが全て正常に
処理されていると判断してプログラム実行状況記述エリ
ア２１ａの各処理実行済フラグ１９ａ〜１９ｍをクリア
する（ステップ５Ｔ１８）。Further, if these match, the software failure detection processing unit 4 determines that all the modules of the first program are processed normally, and each processing completed flag 19a to 19m in the program execution status description area 21a. (Step 5T18).

また上述した動作において、実行時間が計算されていな
いときには（ステップ５Ｔ１３）、上述した実行所要時
間の判定処理、モジュールの処理状況判定処理が、また
プログラム実行中フラグ１８がセットされているときに
（プログラム実行中のとき）は前記モジュールの処理状
況判定処理がスキップされる。Further, in the above-described operation, when the execution time is not calculated (step 5T13), the above-described execution time determination processing and module processing status determination processing are performed, and when the program execution flag 18 is set ( (when the program is being executed), the process of determining the processing status of the module is skipped.

次いで、ソフトウェア障害検出処理部４は変数ｉをイン
クリメントして（ステップ５Ｔ１９）、この変数１の値
が処理対象となっているプログラムの数以下かどうかを
チェックする（ステップ５Ｔ２０）。Next, the software failure detection processing unit 4 increments the variable i (step 5T19), and checks whether the value of this variable 1 is less than or equal to the number of programs targeted for processing (step 5T20).

そして、変数ｌの値が全プログラムの数より小さいとき
、ソフトウェア障害検出処理部４は全プログラムのチェ
ック処理が完了していないと判断して残りのプログラム
に対して上述したチェック処理を実行する。Then, when the value of the variable l is smaller than the number of all programs, the software failure detection processing unit 4 determines that the checking process for all programs has not been completed, and executes the above-mentioned checking process for the remaining programs.

そして、処理対象となっているプログラムの全てに対し
て上述した処理が完了したとき、ソフトウェア障害検出
処理部４はタイマ処理部３から次の起動がかけられるま
でチェック処理を停止する。Then, when the above-described processing is completed for all the programs to be processed, the software failure detection processing section 4 stops the checking processing until the next activation is applied from the timer processing section 3.

このようにこの実施例においては、各プログラム毎に実
行所要時間をチェックしているので、第５図の処理Ｔ２
１のような割込み許可状態でのループ発生、長時間に渡
る待ち状態の発生、プツトロックの発生等を検知するこ
とができ、これによってこのようなソフトウェア障害が
発生したとき、計算機２を直ちに起動させることができ
るとともに、障害時点のデータを採取することができる
。As described above, in this embodiment, since the required execution time is checked for each program, the process T2 in FIG.
It is possible to detect the occurrence of a loop, the occurrence of a long waiting state, the occurrence of a putlock, etc. in the interrupt enabled state as shown in 1, and thereby, when such a software failure occurs, computer 2 can be started immediately. It is possible to collect data at the time of failure.

また、各プログラムを構成する各モジュールが確実に処
理されているかどうかをチェックしているので、機能未
実行等のソフトウェア障害をも検知することができ、こ
のようなソフトウェア障害が発生したときにも障害時の
データを採取して障害原因を解析することができるとと
もに、障害を早期に復旧させることができる。In addition, since it checks whether each module that makes up each program is being processed reliably, it is possible to detect software failures such as functions not being executed. Data at the time of a failure can be collected and the cause of the failure can be analyzed, and failures can be recovered quickly.

また上述した実施例においては、２つの計算機１．２を
備えた計算機システムを用いて本発明を説明したが、例
えば第７図に示すように１つの計算機１ａのみシステム
にも本発明を適用できるのは勿論である。Furthermore, in the embodiments described above, the present invention was explained using a computer system equipped with two computers 1.2, but the present invention can also be applied to a system with only one computer 1a, as shown in FIG. 7, for example. Of course.

但しこの場合、ソフトウェア障害検出処理部４によって
ソフトウェア障害が検出されて主メモリ保存部５が主メ
モリの内容を保存させた後、リスタート処理部８によっ
て主処理部１０に再起動がかけられる。However, in this case, after the software fault detection processing section 4 detects a software fault and the main memory storage section 5 saves the contents of the main memory, the restart processing section 8 restarts the main processing section 10.

〔発明の効果〕〔Effect of the invention〕

以上説明したように本発明によれば、ストールアラーム
、イリーガル命令以外のソフトウェア障害、例えば割込
み許可状態でのループ発生、長時間に渡る待ち状態の発
生、プツトロックの発生、機能未実行の発生等を検知す
ることができ、これによって障害時のデータを採取して
障害原因を解析することができるとともに、障害を早期
に復旧させることができる。As explained above, according to the present invention, software failures other than stall alarms and illegal instructions, such as the occurrence of a loop in the interrupt enabled state, the occurrence of a long waiting state, the occurrence of a putlock, and the occurrence of a function not being executed, can be prevented. This makes it possible to collect data at the time of a failure and analyze the cause of the failure, as well as to quickly recover from the failure.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は本発明によるソフトウェア障害検出方法の一実
施例を適用した計算機システムの一例を示すブロック図
、第２図は第１図に示す処理状況記述テーブルの一例を
示す模式図、第３図は第１図に示す情報テーブルの一例
を示す模式図、第４図は第１図に示す主処理部の動作例
を示すフローチャートおよび動作模式図、第５図は同実
施例の各プログラム処理タイミング図、第６図は第１図
に示すソフトウェア障害検出処理部の動作例を示すフロ
ーチャート、第７図は本発明によるソフトウェア障害検
出方法の一実施例を適用した計算機システムの他の一例
を示すブロック図である。１・・・計算機３・・・タイマ処理部４・・・ソフトウェア陣書検出処理部５・・・主メモリ保存部０・・・主処理部３・・・情報テーブル０・・・処理状況記述テーブルFIG. 1 is a block diagram showing an example of a computer system to which an embodiment of the software failure detection method according to the present invention is applied, FIG. 2 is a schematic diagram showing an example of the processing status description table shown in FIG. 1, and FIG. 1 is a schematic diagram showing an example of the information table shown in FIG. 1, FIG. 4 is a flowchart and a schematic diagram showing an example of the operation of the main processing section shown in FIG. 1, and FIG. 5 is a diagram showing each program processing timing of the same embodiment. 6 is a flowchart showing an example of the operation of the software fault detection processing unit shown in FIG. 1, and FIG. 7 is a block diagram showing another example of a computer system to which an embodiment of the software fault detection method according to the present invention is applied. It is a diagram. 1...Computer 3...Timer processing unit 4...Software document detection processing unit 5...Main memory storage unit 0...Main processing unit 3...Information table 0...Processing status description table

Claims

【特許請求の範囲】[Claims]

（１）計算機システムで使用されるソフトウェア障害検
出方法において、各プログラムの実行所要時間と処理完了状態とを一定時
間毎にチェックし、各プログラムが所定時間内に終了し
ていない場合あるいは実行終了時に全処理が完了してい
ない場合に異常と判定し、この時点で障害原因の解析に
必要な主メモリの内容を保存することを特徴とするソフ
トウェア障害検出方法。(1) In a software failure detection method used in a computer system, the execution time and processing completion status of each program are checked at regular intervals, and if each program does not finish within a predetermined time or when execution ends, A software fault detection method characterized by determining an abnormality when all processing is not completed, and saving the contents of main memory necessary for analyzing the cause of the fault at this point.