JPS6341943A

JPS6341943A - Error restoring system for logic unit

Info

Publication number: JPS6341943A
Application number: JP61186367A
Authority: JP
Inventors: Koemon Nigo; 仁後　公衛門
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1986-08-08
Filing date: 1986-08-08
Publication date: 1988-02-23

Abstract

PURPOSE:To reduce the possibility of danger causing an error due to an intermittent fault during retrial and causing a system down and job abort, etc., to which the retrial of instruction of the error is impossible by executing the retrial of instruction that became an error with an another normal logic unit without using the logic unit that caused the error. CONSTITUTION:When the fault is occurred in a logic unit 11 and detected by a supervisory means 14, the operation of the logic unit 11 is suspended and the internal state is preserved, information the logic unit 11 has a fault is given to a diagnosing processor 2 through a diagnosing interface 3. In response to this, the diagnosing processor 2 starts an internal state reading means 21, and the internal state of the logic unit 11 preserved by a supervisory means 14 is read out by the means 21. Then, the diagnosing processor 2 judges the fault detected from the internal state read out by a judging means 22. When it is judged that retrial is possible, and function degrade operation is not possible, a normal logic unit 13 is made to the reexecution and continuation of succeeding processing.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は複数の論理装置が主記憶装置を共有する情報処
理システムにおいて何れかの論理装置にエラーが発生し
た場合にエラー回復処理を行なう方式に関する。[Detailed Description of the Invention] [Field of Industrial Application] The present invention provides a method for performing error recovery processing when an error occurs in any logical device in an information processing system in which a plurality of logical devices share a main storage device. Regarding.

〔従来の技術〕[Conventional technology]

従来、この種のエラー回復方式が、例えば特開昭５７−
０６４８４９号公報（特願昭５５−１４１３２３号）に
開示されている。Conventionally, this type of error recovery method has been proposed, for example, in Japanese Patent Application Laid-Open No. 1983-
It is disclosed in Publication No. 064849 (Japanese Patent Application No. 55-141323).

この方式は、ある論理装置（中央処理装置）でエラーが
発生し、このエラーを発生した命令がメモリ書換えの条
件等で再試行可能と判断された場合、先ずエラーを発生
した論理装置で所定回数の再試行を行なわせ、それでも
エラーの回復が行なわれなかったときに、その命令を他
の正常な論理装置で再試行させるものである。In this method, when an error occurs in a certain logic unit (central processing unit) and it is determined that the instruction that caused the error can be retried due to memory rewriting conditions, etc., the logic unit that caused the error is first retried a predetermined number of times. If the error is still not recovered, the instruction is retried using another normal logical device.

〔発明が解決しようとする問題点〕[Problem that the invention seeks to solve]

このエラー回復方式では、エラーの原因が間欠故障で、
かつ再試行可能であってもエラーの発生した論理装置で
の再試行に成功すれば、システムダウン等を回避でき、
またエラー原因が固定故障によるものであっても再試行
可能であれば正常な他の論理装置で再試行が成功するこ
とによりシステムダウン等を回避できる。In this error recovery method, the cause of the error is an intermittent failure,
Even if retry is possible, if the retry is successful on the logical device where the error occurred, system down etc. can be avoided.
Furthermore, even if the cause of the error is a fixed failure, if a retry is possible, a system down can be avoided by successfully retrying with another normal logical device.

しかし、間欠故障であても常に再試行可能となるもので
はなく、命令実行中のメモリ書換えりイミンクによって
は再試行不可能となる場合もある。また、間欠故障を一
度おこした論理袋（冴は再び間欠故障をおこす確率が高
いと考えられるから、再試行を先ずエラーの発生した論
理装置で行なわせるこの方式では、その再試行の途中に
おいて再び間欠故障によるエラーが発生し、そのエラー
の命令が今度は再試行不可能となる可能生があり、この
ような場合、再試行不可能であるが故にシステムタウン
やジョブアポート等につながるという欠点がある。However, even in the case of an intermittent failure, retrying is not always possible, and retrying may not be possible depending on memory rewriting during instruction execution. In addition, since it is thought that a logic bag that has once caused an intermittent fault (Sae) has a high probability of causing an intermittent fault again, in this method, the retry is performed first on the logic device where the error occurred, and during the retry, the If an error occurs due to an intermittent failure, there is a possibility that the instruction in error cannot be retried.In such a case, since retry is not possible, it may lead to system town or job abort. be.

このエラー回復方式を改善したものとして、論理装置で
エラーが発生した場合で、かつ該エラーの命令が再試行
回走と判断されたとき、該エラーを発生した論理装置で
再試行することなく、他のＩＦ常な論理装置に前記エラ
ーの命令からの再試行を行なわせるようにした方式が考
えられている。As an improvement to this error recovery method, when an error occurs in a logical device and the instruction in the error is determined to be retried, the error recovery method is improved without retrying in the logical device where the error occurred. A method has been considered in which another IF normal logic device is made to retry from the error instruction.

この方式では、論理装置の局部的な障害であり、障害部
分を機能デグレードして継続的に運転できる場合にも、
論理装置を１台を切り離してしまうため、システムの性
イタを無駄に低下させるという欠点がある。未発明の目
的は、システムタウンやジョブアボード等を招く危険性
が少なく、しかも性能低下を極力抑えて論理装置のエラ
ー回復を行なう論理装置のエラー回復方式を提供するこ
とにある。With this method, even if the fault is local to the logical device and the faulty part can be functionally degraded to continue operation,
Since one logical device is separated, this method has the disadvantage of unnecessarily lowering the performance of the system. An object of the present invention is to provide an error recovery method for a logical device that reduces the risk of system town, job abort, etc., and performs error recovery of the logical device while minimizing performance degradation.

〔問題点を解決するための手段〕[Means for solving problems]

本発明の論理装置のエラー検出方式は、論理装置でエラ
ーが発生した場合に、エラーとなった命令が再試行可能
で、かつ前記エラーを発生した論理装着の機能の一部を
デグレードして処理を続けることができないと判断され
たとき、該エラーを発生した論理装置で再試行すること
なく他の正常な論理装置に前記エラーとなった命令から
の再試行を行なわせるものである。The logic device error detection method of the present invention is such that when an error occurs in the logic device, the instruction that caused the error can be retried, and a part of the function of the logic installation that caused the error is degraded for processing. When it is determined that the instruction cannot be continued, another normal logic device is made to retry the instruction that caused the error, without retrying with the logic device that caused the error.

〔作　用〕[For production]

したがって、再試行中に間欠故障によるエラーが発生し
そのエラーの命令が再試行不可能となってシステムダウ
ンやジョブアポート等を招く危険性を少なくすることが
でき、しかも機能デグレード運転可能なエラーの場合に
は、論理装置を切り離すことなく一部機能のみをデグレ
ードして運転することができるため性能低下を極力抑え
ることができる。Therefore, it is possible to reduce the risk that an error due to an intermittent failure occurs during a retry, and the error command cannot be retried, resulting in a system down or job abort. In some cases, it is possible to operate with only some functions degraded without disconnecting the logical device, thereby minimizing performance degradation.

〔実施例〕〔Example〕

次に、本発明の実施例について図面を参照して説明する
。Next, embodiments of the present invention will be described with reference to the drawings.

第１図は本発明の論理装置のエラー回復方式の一実施例
が適用された情報処理システムのブロック図、第２図は
診断処理装置２の処理例の流れ図である。FIG. 1 is a block diagram of an information processing system to which an embodiment of the logic device error recovery method of the present invention is applied, and FIG. 2 is a flowchart of an example of processing by the diagnostic processing device 2. In FIG.

この情報処理システムは、情報処理装置１と診断処理装
置２で構成されている。情報処理装置ｌは、例えば中央
処理装置である複数の論理装置１１、１３と、これら複
数の論理装置１１．１３に接続され両輪理装置１１．１
３からアクセス可能な主記憶装置１２と、論理装置１１
．　ｉ３および主記憶装置１２に接続された監視手段１
４とを含み、一つのオペレーティング・システムで制御
されている。監視手段１４は、論理装置１１．１３の故
障（エラー）を検出する機能と、故障を検出するとその
故障した論理装置の内部状態を保存して診断インタフェ
イス３を介して診断処理装置２にその旨を通知する機能
とを有している。診断処理装置２は、内部状態の読出し
手段２１と、再試行および機能デグレード運転の可能性
の判断手段２２と、引継ぎ情報の編集・作成手段２３と
、引継ぎ情報の設定手段２４と、引継ぎ処理の再実行・
継続の指示手段２５とを含む。内部状態の読出し手段２
１は、監視手段１４から上記通知があったときに監視手
段１４によって保存された故障発生論理装置の内部状態
を読出す手段である。This information processing system includes an information processing device 1 and a diagnostic processing device 2. The information processing device 1 includes a plurality of logical devices 11 and 13, which are central processing units, for example, and a double-wheel processing device 11.1 connected to these plurality of logical devices 11.13.
The main storage device 12 accessible from 3 and the logical device 11
．． Monitoring means 1 connected to i3 and main storage device 12
4 and are controlled by a single operating system. The monitoring means 14 has a function of detecting a failure (error) in the logical device 11.13, and when a failure is detected, saves the internal state of the failed logical device and transmits it to the diagnostic processing device 2 via the diagnostic interface 3. It also has a function to notify you of this. The diagnostic processing device 2 includes an internal state reading means 21, a determination means 22 for determining the possibility of retrying and functionally degraded operation, a handover information editing/creation means 23, a handover information setting means 24, and a handover information setting means 24. Re-run/
and a continuation instruction means 25. Internal state reading means 2
Reference numeral 1 denotes a means for reading out the internal state of the faulty logic device stored by the monitoring means 14 when the above notification is received from the monitoring means 14.

再試行および機能デグレード運転の可能性の判断手段２
２は、その読出された内部状態に基づいてエラーを発生
した命令の再試行の可能性および機能デグレード運転の
可能性を判断する手段である。Method 2 for determining the possibility of retrying and functionally degraded operation
2 is a means for determining the possibility of retrying an instruction in which an error has occurred and the possibility of functionally degraded operation based on the read internal state.

引継ぎ情報の編集・作成手段２３は、再試行および機能
デグレード運転の可能性の判断手段２２で再試行可能と
判断されたときに、故障した論理装置上で実行していた
処理の引継ぎ情報を前記内部状態から編集・作成する手
段である。引継ぎ情報の設定手段２４は、引継ぎ情報の
編集・作成手段２３で作成された引継ぎ情報を故障が検
出された論理装置以外の他の正常な論理装着に設定する
手段である。引継ぎ処理の再実行φ継続の指示手段２５
は、引継ぎ情報の設定手段２４により設定された情報を
もとにその正常な論理装置上で故障の検出された論理装
置で行なわれていた処理を引継いで再実行させることを
指示する手段である。The takeover information editing/creating means 23 edits the takeover information of the process being executed on the failed logical device when the retry and functional degraded operation possibility determining means 22 determines that a retry is possible. This is a means of editing and creating from the internal state. The handover information setting means 24 is means for setting the handover information created by the handover information editing/creation means 23 to other normal logical installations other than the logical device in which the failure has been detected. Instructing means 25 for re-execution of the handover process φ continuation
is means for instructing to take over and re-execute the process that was being performed in the logical device in which the failure was detected on the normal logical device based on the information set by the takeover information setting means 24. .

次に、論理装置１１に故障が発生した場合を例にとって
第２図により本実施例の動作を説明する。Next, the operation of this embodiment will be explained with reference to FIG. 2, taking as an example a case where a failure occurs in the logic device 11.

論理装置１１に故障が発生すると、これが監視手段１４
で検出される。監視手段１４はこれを検出すると、論理
装置１１の処理を中断させてその内部状態を保存し、診
断インタフェイス３を介して診断処理装置２に論理装置
１１が故障した旨の通知を行なう。診断処理装置２はこ
れに応答して内部状態の読出し手段２１を起動し、この
手段２１により監視手段１４で保存された論理装置１１
の内部状態を読出す（処理５１）６次に、診断処理装置
２は上記読出された内部状態から検出された故障が再試
行可能、かつ機能デグレード運転不可のものであるか否
かを再試行および機能デグレード運転の可能性の判断手
段２２で判断しく処理５２）、再試行可能、かつ機能デ
グレード運転不可でない場合には、故障発生の論理装置
１１の障害処理を行なう０機能デグレード運転可能な障
害ならば機能デグレードを行ない運転が継続される。そ
うでなければ、故障発生の論理装置１１のシステムから
の切り離しが行なわれる。一方、処理５２で再試行可能
、かつ機能デグレード運転不可と判断されたときは、引
継ぎ情報の編集・作成手段２３により読出した内部状態
から引継ぎ情報を編集・作成しく処理５３）、この引継
ぎ情報を引継ぎ情報の設定手段２４によって正常な論理
装置１３に対し設定させる（処理５４）。そして、引継
ぎ処理の再実行・ａ続の指示手段２５により正常な論理
装置１３に処理の引継ぎを指示しく処理５５）、論理装
置１３に引継いだ処理の再実行・継続金行なわせる。こ
れにより、再試行可能な故障の検出された論理装置１１
で実行されていたエラー発生時の処理が故障の検出され
た論理装置ｌｌ上でなく他の正常な論理装置１３上で再
実行され、論理装置１１で発生したエラーの回復が行な
われると共に、その後の処理も故障の発生した論理装置
１１でなく正常な論理装置１３に引継がれる。正常な論
理装置１３が故障した論理装置１１の処理を引継いた場
合、通常その故障した論理装置１１は論理的にシステム
から切り離される。When a failure occurs in the logical device 11, the monitoring means 14
Detected in When the monitoring means 14 detects this, it interrupts the processing of the logical device 11, saves its internal state, and notifies the diagnostic processing device 2 via the diagnostic interface 3 that the logical device 11 has failed. In response to this, the diagnostic processing device 2 activates the internal state reading means 21, which causes the logical device 11 stored in the monitoring means 14 to be read out.
(Process 51) 6 Next, the diagnostic processing device 2 retries whether or not the fault detected from the read internal state is one that can be retried and is functionally degraded and cannot be operated. Then, the determination means 22 for determining the possibility of functional degraded operation performs processing 52), and if retry is possible and functional degraded operation is not disabled, the fault processing of the logic device 11 in which the failure has occurred is performed. 0 Failure that allows functional degraded operation If so, the function will be degraded and operation will continue. Otherwise, the failed logical device 11 is disconnected from the system. On the other hand, when it is determined in process 52 that retry is possible and functional degraded operation is not possible, the takeover information is edited and created from the internal state read by the takeover information editing/creation means 23 (process 53), and this takeover information is The takeover information setting means 24 causes the normal logical device 13 to set it (processing 54). Then, the re-execution/a-continuation instructing means 25 instructs the normal logical device 13 to take over the process (step 55), causing the logical device 13 to re-execute the inherited process and perform a continuation payment. As a result, the logical device 11 in which a failure has been detected can be retried.
The process executed when an error occurs is re-executed on another normal logical device 13 instead of on the logical device 11 in which the failure was detected, and the error that occurred in the logical device 11 is recovered. The processing is also taken over by the normal logical device 13 instead of the logical device 11 in which the failure has occurred. When a normal logical device 13 takes over the processing of a failed logical device 11, the failed logical device 11 is usually logically separated from the system.

以上の説明は、論理装置１１が故障した場合であるが、
論理装置１３が故障した場合も上述したように診断処理
装置２の制御のもとに正常な論理装置１１に処理を引継
いで、論理装置１３上で行なわれていた処理を継続して
行なうことができる。The above explanation is for the case where the logical device 11 fails, but
Even if the logical device 13 fails, as described above, the normal logical device 11 can take over the processing under the control of the diagnostic processing device 2, and the processing that was being performed on the logical device 13 can be continued. can.

なお、本実施例は論理装置が２台の場合であるが、論理
装置が３台以上備わっている情報処理システムに対して
も本発明は適用可能である。Note that although this embodiment deals with a case where there are two logical devices, the present invention is also applicable to an information processing system equipped with three or more logical devices.

〔発明の効果〕〔Effect of the invention〕

以上説明したように本発明は、再試行可能、かつ機能デ
グレード運転不可のエラーが発生した場合に、エラーと
なった命令の再試行を、エラーの発生した論理装置で実
行することなく、他の正常な論理装置で実行することに
より、再試行中に間欠故障によるエラーが発生しそのエ
ラーの命令が再試行不可能となってシステムダウンやジ
ョブアボード等を招く危険性を少なくすることができ。As explained above, the present invention enables, when an error that allows retry and disables functional degraded operation occurs, to retry the instruction in error without executing it in the logic device where the error occurred, and to By executing on a normal logical device, it is possible to reduce the risk of an error occurring due to an intermittent failure during retry, making it impossible to retry the instruction in error, resulting in system down, job abort, etc. .

しかも機能デグレード運転可能なエラーの場合には、論
理装置を切り離すことなく一部機能のみをデグレードし
て運転することができるため性能低下を極力抑えること
ができるという効果がある。Moreover, in the case of an error that allows functional degradation, only a part of the functions can be degraded and operated without disconnecting the logical device, which has the effect of suppressing performance degradation as much as possible.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は本発明の論理装置のエラー回復方式の一実施例
が適用された情報処理システムのブロック図、第２図は
診断処理装置２の処理例の流れ図である。１・・・・・・・・・情報処理装置、２・・・・・・・・・診断処理装置、３・・・・・・・・・診断インタフェイス、１１．１３
・・・論理装着、１２・・・・・・・・・主記憶装置、１４・・・・・・・・・監視手段、２１・・・・・・・・・内部状態の読出し手段。２２・・・・・・・・・再試行および機能デグレード運
転の可能性の判断手段、２３・・・・・・・・・引継ぎ情報の編集・作成手段、
２４・・・・・・・・・引継ぎ情報の設定手段、２５・
・・・・・・・・引継ぎ処理の再実行会継続の指示手段
特許出願人　　日本電気株式会社　）第１図FIG. 1 is a block diagram of an information processing system to which an embodiment of the logic device error recovery method of the present invention is applied, and FIG. 2 is a flowchart of an example of processing by the diagnostic processing device 2. In FIG. 1... Information processing device, 2... Diagnostic processing device, 3... Diagnostic interface, 11.13
. . . logical mounting, 12 . . . main storage device, 14 . . . monitoring means, 21 . . . internal state reading means. 22...Means for determining the possibility of retrying and functionally degraded operation, 23...Means for editing and creating handover information,
24...Means for setting transfer information, 25.
...Means for instructing the continuation of the re-execution meeting of the handover process Patent applicant: NEC Corporation) Figure 1

Claims

【特許請求の範囲】[Claims]

複数の論理装置が主記憶装置を共有する情報処理システ
ムにおいて、前記論理装置でエラーが発生した場合に、
エラーとなった命令が最試行可能で、かつ前記エラーを
発生した論理装置の機能の一部をデグレードして処理を
続けることができないと判断されたとき、該エラーを発
生した論理装置で再試行することなく他の正常な論理装
置に前記エラーとなった命令からの最試行を行なわせる
論理装置のエラー回復方式。In an information processing system in which multiple logical devices share a main storage device, when an error occurs in the logical device,
When it is determined that the instruction that caused the error can be retried, and that it is not possible to continue processing by degrading part of the function of the logic device that caused the error, retry the instruction using the logic device that caused the error. An error recovery method for a logic device that causes another normal logic device to make a retry from the instruction that caused the error without causing the error.