JPH117431A

JPH117431A - Failure recovery system for job executed by plural computers

Info

Publication number: JPH117431A
Application number: JP9158304A
Authority: JP
Inventors: Ikuko Honma; 郁子本間; Hideto Kurose; 秀人黒瀬; Kazuko Narita; 和子成田; Tomoyuki Iwata; 智之岩田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-06-16
Filing date: 1997-06-16
Publication date: 1999-01-12

Abstract

PROBLEM TO BE SOLVED: To reduce the end time of a job at the time of failure occurrence by execution time of a job step that is already normally carried out by notifying time that is taken for failure recovery and making a job step which is executed after a failure occurrence step wait. SOLUTION: A supervisory server 1-1 detects a failure of a computer C because a job step 4 does not end even after maximum execution time passes or that the job step abnormally ends when a failure occurs in the computer C. The server 1-1 retrieves jobs 2 and 3 which are affected by failure occurrence of the computer C from a job step management DB-4. By inputting recovery time, failure recovery time of the computer C is notified to the supervisory server. The server 1-1 adds delay time to execution predictive time of a job step 5 which executes the job 2 that is affected after the computer C, registers it on a job step management DB 5, makes it wait for execution, adds the execution start time of the job 3 that is later carried out by delay time, registers it and redoes execution schedule.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は複数コンピュータで
実行する業務で、あるコンピュータで障害が発生した場
合の回復システムで、特に影響業務の検索方法及び早期
障害回復方法及び障害発生後の他の業務のスケジュール
方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a service executed by a plurality of computers, and more particularly to a recovery system in the event of a failure in a certain computer. About how to schedule.

【０００２】[0002]

【従来の技術】複数のコンピュータで実行するような複
数のジョブステップからなる業務において、あるコンピ
ュータで障害が発生すると、障害が発生したコンピュー
タの次にこの業務のジョブステップを実行する予定のコ
ンピュータの実行待ちジョブステップがタイムオーバと
なり、ジョブステップが開始されないまま異常終了する
ためこの業務は異常終了する。この業務を回復するため
には障害が発生したコンピュータの障害回復処理後に異
常終了した業務を最初から再実行していた。2. Description of the Related Art In a business consisting of a plurality of job steps executed by a plurality of computers, when a failure occurs in a certain computer, a computer which is to execute the job step of the business following the failed computer is executed. This job ends abnormally because the job step waiting to be executed times out and ends abnormally without starting the job step. In order to recover this task, the task that ended abnormally after the failure recovery processing of the failed computer was re-executed from the beginning.

【０００３】[0003]

【発明が解決しようとする課題】従来の技術では障害が
発生したときに、異常が発生した業務はわかるが、障害
発生コンピュータから他の影響のある業務を容易に特定
化できない。In the prior art, when a failure occurs, the business in which the abnormality has occurred can be known, but it is not possible to easily specify the other affected business from the computer in which the failure has occurred.

【０００４】また、ジョブステップが異常終了したこと
しかわからず、業務回復に要する時間がわからないた
め、業務としては異常終了してしまう。異常終了した業
務を回復するためには最初から再実行する必要があり、
既に実行したジョブステップ実行時間分業務の終了時間
が遅くなる。また、この業務の遅延が他の業務の遅延に
つながる場合の考慮がない。[0004] Further, since it is only known that the job step has abnormally ended and the time required for business recovery cannot be known, the business ends abnormally. In order to recover a job that ended abnormally, it must be restarted from the beginning.
The end time of the job is delayed by the job step execution time that has already been executed. In addition, there is no consideration of a case where the delay of this task leads to the delay of another task.

【０００５】本発明の目的は障害発生コンピュータから
他の影響のある業務を容易に特定化することと、障害発
生コンピュータ分からのジョブステップを最優先で再実
行して障害発生時の業務の終了時間の短縮をすること
と、障害が発生した業務より後に実行する業務の実行ス
ケジュールの組み直しをすることにある。SUMMARY OF THE INVENTION It is an object of the present invention to easily specify a task having a different effect from a computer in which a failure has occurred, and to re-execute job steps from the computer in which a failure has occurred with the highest priority and to finish the task in the event of a failure. And to reconfigure the execution schedule of a task to be executed after the failed task.

【０００６】[0006]

【課題を解決するための手段】各業務のジョブステップ
の実行順序や実行コンピュータ、実行ログ等を登録する
記憶手段、以下ジョブステップ管理ＤＢと呼ぶ、をもと
にジョブステップを管理する事により影響のある業務と
実行コンピュータを容易に特定化し、その障害回復にか
かる時間を通知し、障害発生ステップ以降に実行するジ
ョブステップを待たせることによって業務を正常終了さ
せる。また、通知された遅延時間をもとに障害発生業務
以降の業務の実行スケジュールの組み直しをする。A storage unit for registering the execution order of job steps in each job, an execution computer, an execution log, and the like, which is hereinafter referred to as a job step management DB, is affected by managing job steps. A task with a certain problem and an execution computer are easily specified, the time required for recovery from the failure is notified, and the job is normally terminated by waiting for a job step to be executed after the failure step. Further, based on the notified delay time, the execution schedule of the work after the failure occurrence work is reconfigured.

【０００７】[0007]

【発明の実施の形態】図１に複数コンピュータで実行す
る複数のジョブステップからなる業務１、２、３の流れ
を示す。監視サーバはすべてのコンピュータのジョブス
テップ及び業務の管理をジョブステップ管理ＤＢをもと
に行っている。まず業務１が実行され次に業務２、最後
に業務３が実行されるように管理ＤＢに登録しておく。
各業務の実行の流れを以下に示す。業務１はコンピュー
タＡでジョブステップ１を実行した後、コンピュータＢ
でジョブステップ２を実行する。業務２はコンピュータ
Ａでジョブステップ３を実行した後、コンピュータＣで
ジョブステップ４を実行する。コンピュータＢはジョブ
ステップ３とジョブステップ４の実行終了を待ってジョ
ブステップ５を実行する。業務３はコンピュータＣでジ
ョブステップ６を実行した後、コンピュータＢでジョブ
ステップ７を実行する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows a flow of operations 1, 2, and 3 comprising a plurality of job steps executed by a plurality of computers. The monitoring server manages job steps and operations of all computers based on the job step management DB. The task DB is registered in the management DB so that task 1 is executed first, task 2 is executed next, and task 3 is executed last.
The flow of execution of each task is shown below. Job 1 executes job step 1 on computer A, and then executes computer B
To execute job step 2. Job 2 executes job step 3 on computer A and then executes job step 4 on computer C. The computer B executes the job step 5 after completing the execution of the job steps 3 and 4. The job 3 executes the job step 6 on the computer C, and then executes the job step 7 on the computer B.

【０００８】図２は業務２実行中にコンピュータＣで障
害が発生した場合の監視サーバ１−１の障害回復処理フ
ローである。FIG. 2 is a flowchart of a failure recovery process of the monitoring server 1-1 when a failure occurs in the computer C during the execution of the task 2.

【０００９】図３はジョブステップ管理ＤＢ要素であ
る。FIG. 3 shows a job step management DB element.

【００１０】図４は業務２実行中に障害が発生したとき
のジョブステップ管理ＤＢである。FIG. 4 shows a job step management DB when a failure occurs during the execution of the task 2.

【００１１】図５は図４の状態から業務２及び業務３を
再スケジュールしたときのジョブステップ管理ＤＢであ
る。FIG. 5 shows a job step management DB when the work 2 and the work 3 are rescheduled from the state shown in FIG.

【００１２】以下これら５つの図を使い説明する。A description will be given below with reference to these five figures.

【００１３】あらかじめ各業務のジョブステップの実行
順序３−１、３−２及び最大実行時間３−３をジョブス
テップ管理ＤＢ３に登録しておく。The execution order of job steps 3-1 and 3-2 and the maximum execution time 3-3 of each job are registered in the job step management DB 3 in advance.

【００１４】業務の実行を開始するときに、ジョブステ
ップ管理ＤＢ３に無限に待ち続けないため、業務の開始
時間と各ジョブステップの最大実行時間から求めた実行
予測時間３−４を登録する。（ステップ１）監視サーバ１−１で各コンピュータのジョブステップの
終了を監視する。In order to keep the job step management DB 3 from waiting indefinitely when the execution of the job is started, an estimated execution time 3-4 obtained from the start time of the job and the maximum execution time of each job step is registered. (Step 1) The monitoring server 1-1 monitors the end of the job step of each computer.

【００１５】各コンピュータはジョブステップの実行開
始３−５と終了３−６をジョブステップ管理ＤＢ３に登
録する。Each computer registers the job step execution start 3-5 and the end 3-6 in the job step management DB3.

【００１６】コンピュータＣで障害が発生すると最大実
行時間が過ぎてもジョブステップ４が終了しない、又は
ジョブステップが異常終了するので監視サーバ１−１が
コンピュータＣの障害を検知する。If a failure occurs in the computer C, the job step 4 does not end even if the maximum execution time has elapsed, or the job step ends abnormally, so that the monitoring server 1-1 detects the failure of the computer C.

【００１７】監視サーバ１−１はジョブステップ管理Ｄ
Ｂ４からコンピュータＣの障害発生により影響する業務
２及び業務３を検索する。（ステップ２）回復時間の入力により、監視サーバにコンピュータＣの
障害回復時間を通知する。（ステップ３）監視サーバ１−１はコンピュータＣ以降に影響する業務
２を実行するジョブステップ５の実行予測時間に遅延時
間を加算（５−１）し、ジョブステップ管理ＤＢ５に登
録し実行を待たせる。（ステップ４）また、監視サーバ１−１は後に実行する業務３の実行開
始時間を遅延時間分加算（５−３）してジョブステップ
管理ＤＢ５に登録し実行スケジュールをやり直す。（ス
テップ５）コンピュータＣは障害回復処理後、再度業務２のジョブ
ステップ４を登録し実行する。この業務のジョブステッ
プ４、５は優先順位を一番高く（５−２）し実行する。
（ステップ６）本実施例によれば、コンピュータＣで障害が発生したと
きに監視サーバに遅延時間を連絡するので業務２が異常
終了することはない。またコンピュータＣの障害回復後
コンピュータＡで実行したジョブステップ３の分は再度
実行することなくコンピュータＣのジョブステップ４か
ら業務２を最優先で再実行することによって、障害発生
時の業務の終了時間を短縮する事ができる。また、遅延
時間をもとに他の業務のスケジュールをやり直すため、
他の業務３の実行に影響を与えないようにできる。The monitoring server 1-1 has a job step management D
The business 2 and the business 3 affected by the failure of the computer C are searched from B4. (Step 2) By inputting the recovery time, the monitoring server is notified of the failure recovery time of the computer C. (Step 3) The monitoring server 1-1 adds the delay time to the predicted execution time of the job step 5 that executes the job 2 affecting the computer C and later (5-1), registers it in the job step management DB 5, and waits for execution. Let (Step 4) Further, the monitoring server 1-1 adds the execution start time of the task 3 to be executed later by the delay time (5-3), registers the result in the job step management DB 5, and re-executes the execution schedule. (Step 5) After the failure recovery processing, the computer C registers and executes the job step 4 of the job 2 again. The job steps 4 and 5 of this job are executed with the highest priority (5-2).
(Step 6) According to the present embodiment, when a failure occurs in the computer C, the delay time is notified to the monitoring server, so that the task 2 does not end abnormally. The job step 3 executed on the computer A after the recovery from the failure of the computer C is re-executed from the job step 4 of the computer C to the job 2 with the highest priority without being executed again. Can be shortened. Also, to reschedule other tasks based on the delay time,
The execution of other tasks 3 can be prevented from being affected.

【００１８】[0018]

【発明の効果】本発明によれば、複数のコンピュータで
実行する複数のジョブステップからなる業務において、
あるコンピュータで障害が発生した場合、ジョブステッ
プ管理ＤＢに情報を登録するため、コンピュータの障害
発生時に容易に影響のある業務を検索する事ができる。
また、その業務を最初から再実行することなく、障害回
復処理後、障害が発生したコンピュータのジョブステッ
プから最優先で再実行するため、既に正常に実行したジ
ョブステップの実行時間分障害発生時の業務の終了時間
を短縮する事ができる。また、遅延時間を元に他の業務
のスケジュールをやり直すため、他の業務の実行に影響
を与えないようにできる。According to the present invention, in a business consisting of a plurality of job steps executed by a plurality of computers,
When a failure occurs in a certain computer, information is registered in the job step management DB, so that it is possible to easily search for a task that has an effect when a failure occurs in the computer.
Also, since the job is not re-executed from the beginning, and after the failure recovery processing, the job step of the failed computer is re-executed with the highest priority. The end time of work can be shortened. Further, since the schedule of another task is redone based on the delay time, the execution of the other task can be prevented from being affected.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の一実施例である複数コンピュータで実
行する業務の流れを示すシステムの全体図である。FIG. 1 is an overall view of a system showing a flow of tasks executed by a plurality of computers according to an embodiment of the present invention.

【図２】図１に示すシステムの監視サーバの処理フロー
である。FIG. 2 is a processing flow of a monitoring server of the system shown in FIG. 1;

【図３】ジョブステップ管理ＤＢの要素である。FIG. 3 shows elements of a job step management DB.

【図４】図１に示すシステムのうち業務２実行中にコン
ピュータＣで障害が発生したときのジョブステップ管理
ＤＢである。FIG. 4 is a job step management DB when a failure occurs in a computer C during execution of a task 2 in the system shown in FIG. 1;

【図５】図１に示すシステムのうちコンピュータＣの障
害回復後、業務２及び業務３を再スケジュールしたジョ
ブステップ管理ＤＢである。FIG. 5 is a job step management DB in which the business 2 and the business 3 are rescheduled after the failure recovery of the computer C in the system shown in FIG.

【符号の説明】[Explanation of symbols]

───────────────────────────────────────────────────── フロントページの続き (72)発明者岩田智之神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア開発本部内 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Tomoyuki Iwata 5030 Totsukacho, Totsuka-ku, Yokohama-shi, Kanagawa Prefecture Software Development Division, Hitachi, Ltd.

Claims

【特許請求の範囲】[Claims]

【請求項１】複数のコンピュータで実行する義務におい
て、あるコンピュータで障害が発生した場合、１つの業
務のうち各コンピュータで実行するジョブの単位、以下
ジョブステップと呼ぶ、を管理する記憶手段と、障害回
復後の再実行手段と、遅延時間管理機構を有し、影響の
ある業務を容易に特定化し、監視サーバに遅延時間を通
知する事により障害発生のコンピュータのジョブステッ
プ以降に実行するジョブステップを待たせ、業務を正常
に実行させること、また、通知された遅延時間をもとに
障害発生業務以降の業務の実行スケジュールの組み直し
をすることを特徴とする複数コンピュータで実行する業
務の障害回復システム。A storage means for managing a unit of a job to be executed by each computer in a single job, hereinafter referred to as a job step, in the case where a failure occurs in one computer in an obligation to execute the plurality of computers, A job step that has re-execution means after failure recovery and a delay time management mechanism, easily specifies the affected business, and notifies the monitoring server of the delay time to execute after the job step of the failed computer Recovery of a task executed on multiple computers, characterized in that the task is executed normally, and the execution schedule of the task after the faulted task is reconfigured based on the notified delay time. system.