JPS59144956A

JPS59144956A - Data processing system

Info

Publication number: JPS59144956A
Application number: JP58018632A
Authority: JP
Inventors: Noritaka Umeno; 典隆梅野
Original assignee: NEC Corp; Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1983-02-07
Filing date: 1983-02-07
Publication date: 1984-08-20

Abstract

PURPOSE:To restore a trouble even if plural instructions are concerned from the generation of the trouble to the report of the trouble by saving data outputted before writing at the data writing in a main storage device. CONSTITUTION:An instruction counter IC 11 is saved by an instruction counter (IC) backup (IC-BU) 21 at the setting time of a check point. In order to restore a general register group GPR 16 to the up-to-date check point setting time at the report of a trouble, only the GPR 16 is rewritten by a backup register BFR 24 on the basis of a V-display bit. A memory backup BFM 41 to stack the address and data in which memory writing is generated after the setting of the check point in the order of generation is provided to a memory module MMU 31, and if the BFM 41 is to be restored to the original state, the contents of the BFM 41 are read out by a writing counter at the report of the trouble to rewrite the contents of the MMU 31.

Description

【発明の詳細な説明】この発明はデータ処°理システム中の中央処理装置が処
理中に間欠的に発生する障害（以下単に障害と称す）の
障害回復を可能としたデータ処理システムに関すｇ−も
のである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a data processing system that enables failure recovery from failures (hereinafter simply referred to as failures) that occur intermittently during processing by a central processing unit in the data processing system. -It is something.

〈従来技術〉従来データ処理システム中の中央処理装置（以下ＣＰＵ
と称す）に障害が発生し、エラー検出回路により障害報
告がなされると、例えば特願昭５０＝１８９８９４号に
開示されているように、障害報告された時点に実行中で
あった命令を再実行することにより障害回復を行ってい
た（以下命令再試行方式と称す）。しかしこの命令再試
行方式はその命令実行中に障害が発生し、かつ障害報告
された場合にのみ効果があり、その障害原因がその命令
以前に発生した場合、例えばレジスタの書込み番地が障
害のために誤シ、別のレジスタに書込みが行われ、その
時障害報告されなかった場合、障害報告時実行中のその
命令を何度再実行させても結果は同じであり、障害回復
でき々いという問題があった。<Prior art> A central processing unit (hereinafter referred to as CPU) in a conventional data processing system
When a failure occurs in the system (referred to as ``failure'') and a failure is reported by the error detection circuit, the instruction that was being executed at the time the failure was reported is replayed, as disclosed in Japanese Patent Application No. 189894, for example. Failure recovery was performed by executing the instruction (hereinafter referred to as the instruction retry method). However, this instruction retry method is effective only when a failure occurs during the execution of the instruction and is reported as a failure.If the cause of the failure occurs before that instruction, for example, the write address of a register is due to a failure. If a write is made to another register by mistake and no fault is reported at that time, the result will be the same no matter how many times the instruction that was being executed at the time of the fault report is re-executed, making it impossible to recover from the fault. was there.

又従来から行われている別の方法として特公昭５３−１
１１８１号公報に開示されているように複数命令実行後
の処理の区切シでチェックポイントを設け、その時点の
命令カウンタの内容等を退避し、汎用レジスタやメモリ
のように即時退避できないものに対してはチェックポイ
ント以降で書込みが生じた時点でその書込み前の値をバ
ックアップメモリに退避しておき障害が発生し報告され
た時点で、前記退避した情報を基に最新のチェックポイ
ント設定時点に命令カウンタ、汎用レジスタ、メモリ等
の状態を戻した後、再実行することによシ障害回復を行
っていた。このチェックポイントからの再実行による障
害回復（以下チェックポイント再実行方式と称す）は複
数命令さかのぼって再実行することにより、前述の命令
再試行方式における欠点、即ちある障害原因が障害報告
時に実行中の命令以前に発生していた場合に障害回復で
きないという問題を殆んど解決した。Another method that has been used in the past is the
As disclosed in Publication No. 1181, a checkpoint is set at a break point between processes after multiple instructions are executed, and the contents of the instruction counter at that point are saved. When a write occurs after a checkpoint, the value before the write is saved to the backup memory, and when a failure occurs and is reported, the command is executed at the latest checkpoint setting based on the saved information. Failure recovery was accomplished by restoring the states of counters, general-purpose registers, memory, etc., and then re-executing. This fault recovery by re-execution from a checkpoint (hereinafter referred to as the checkpoint re-execution method) can solve the drawbacks of the instruction retry method described above by retroactively re-executing multiple instructions, i.e., if the cause of a fault is not being executed at the time of the fault report. This has almost solved the problem of not being able to recover from a fault if it occurred before the command.

しかし前記バックアップメモリで退避しきれなくなった
時点で即チェックポイントを設定する必要があることが
ら、従来バックアップメモリはチェックポイント設定論
理と一体とし、特公昭５３−１１１８１号公報に開示さ
れている如く“ストアインバッファ″′を採用したＣＰ
Ｕ側に備えられていた。この゛′ストアインバッファ″
は本来メモリのアクセスタイムを見かけ上高速化する技
法であるが、ＣＰＵから書込むメモリデータが常にこの
バッファ上にあることがらチェックポイント再実行方式
を適用できたのである。However, since it is necessary to immediately set a checkpoint when the backup memory cannot save enough data, conventionally the backup memory has been integrated with the checkpoint setting logic, as disclosed in Japanese Patent Publication No. 11181/1981. CP that uses a store-in buffer
It was prepared on the U side. This ``store-in buffer''
is originally a technique to apparently speed up memory access time, but because the memory data written by the CPU is always on this buffer, the checkpoint re-execution method could be applied.

パスドアインバッファ″以外にこの種の高速化技法に“
ストアスルーバッファ”がある。゛′ストアスルーバッ
ファ”はＣＰＵから書込むメモリデータがバッファ上に
ない場合があシ、書込み前の値の退避を行うにはメモリ
側にバックアップメモリを備える必要があるため″スト
アスルーバッファ”を採用し゛ているＣＰＵＫはチェッ
クポイント再実行方式は適用されて々かった。しがし“
ストアスルーバッファ″はメモリ書込みをバッファのみ
ならず常に主記憶装置にも行うため、最新のメモリデー
タが主記憶装置上に維持されており、バッファに障害が
発生してもそのバッファをバイパスして主記憶装置にあ
るデータを使って処理を続けることにより容易に障害回
復可能である。This kind of acceleration technique besides “Passed-in Buffer”
There is a ``store-through buffer.'' With the ``store-through buffer,'' there are cases where the memory data to be written by the CPU is not on the buffer, and in order to save the value before writing, it is necessary to have a backup memory on the memory side. For this reason, the checkpoint re-execution method has often been applied to CPUKs that employ ``store-through buffers''. Shigashi “
"Store-through buffer" writes memory not only to the buffer but also to the main memory at all times, so the latest memory data is maintained in the main memory, and even if a buffer failure occurs, the buffer can be bypassed. Failure recovery is easily possible by continuing processing using data in the main memory.

一方“ストアインバッファ”はメモリ書込みはバッファ
上にしか書込まないため最新のメモリデータは主記憶装
置上には維持されずバッファに障害が発生すると容易に
障害回復できない。即ち特公昭５３−１１１８１号公報
で開示されているよう々従来のチェックポイント再実行
方式は、命令再試行方式の欠点は解決されるとしても“
ストアインバッファ″を採用する必要があるため、バッ
ファも含めたＣＰＵ全体での障害回復に問題があった。On the other hand, in the case of a "store-in buffer", memory data is written only on the buffer, so the latest memory data is not maintained on the main memory, and if a fault occurs in the buffer, it is not easy to recover from the fault. In other words, even though the conventional checkpoint re-execution method disclosed in Japanese Patent Publication No. 53-11181 solves the disadvantages of the instruction retry method,
Since it is necessary to employ a "store-in buffer", there is a problem in failure recovery of the entire CPU including the buffer.

更に又メモリを共有したマルチグロセッッーシステムに
おいては、資源の排他制御を行いながら処理を進めてゆ
くことがら１つのＣＰＵに障害が発生した場合にそのＣ
ＰＵだけ単純にチェックポイントまで逆のぼって再実行
するわけにはゆかない。Furthermore, in a multi-grossing system that shares memory, processing proceeds while performing exclusive control of resources, so if a failure occurs in one CPU, that CPU
It is not possible to simply go back to the checkpoint and re-execute the PU.

〈発明の目的〉この発明の目的はチェックポイント再実行方式を適用す
ることにより前記命令再試行方式の欠点を解決し、障害
の発生から報告までが複数命令にまたがっていても障害
回復できるようにしたデータ処理システムを提供するこ
とにある。この発明の他の目的は主記憶装置に書込み時
書込み前のデータを退避することにょ如、′ストアイン
バッファ”にしか適用できなかったチェックポイント再
実行方式の前記欠点を解決し、バッファのないＣＰＵ又
は゛′ストアインバッファ″′を採用したＣＰＵにも適
用できるようにした改良されたチェックポイント再実行
方式を実現できるようにしたデータ処理システムを提供
することにある。この発明の更に他の目的は資源を共有
する複数のＣＰＵからなるマルチプロセッサシステムに
おいても比較的簡単にチェックポイント再実行方式を実
現できるようにしたデータ処理システムを提供すること
にある。<Objective of the Invention> The object of the present invention is to solve the drawbacks of the above-mentioned instruction retry method by applying a checkpoint re-execution method, and to enable failure recovery even if the process from the occurrence of a failure to its reporting spans multiple instructions. The objective is to provide a data processing system with Another object of the present invention is to solve the drawbacks of the checkpoint re-execution method, which could only be applied to 'store-in buffers', by saving data before writing to the main memory when writing, and An object of the present invention is to provide a data processing system that can implement an improved checkpoint re-execution method that can be applied to a CPU or a CPU that employs a "store-in buffer." Still another object of the present invention is to provide a data processing system that can relatively easily implement a checkpoint re-execution method even in a multiprocessor system consisting of a plurality of CPUs that share resources.

〈実施例〉一般的構成次にこの発明について図面を参照して詳細に説明する。<Example> General configuration Next, the present invention will be explained in detail with reference to the drawings.

この発明の実施例を示す第１図において、この発明のデ
ータ処理システムは２つのＣＰＵ　１及びＩ′と主記憶
装置（ＭＳＵ）、、３と障害回復装置（ＤＧＵ　）５と
から構成される。以下ＣＰＵに関してはＣＰＵＩを中心
に述べるが、ＣＰ　Ｕ　１’についても全く同じである
。ＣＰＵＩは本来の機能として命令カウンタ（Ｉ　Ｃ）
’　１１に格納された命令アドレスに基き主記憶装置３
より実行すべき命令を読出し、命令レジスタ（ＩＲ）１
２に一旦格納し命令実行制御部（ＥＸＣ）１５はその命
令に従って汎用レジスタ群（ＧＰＲ）１６及び主記憶装
置３の読出し又は書込みを何いながら命令を順次実行す
る。主記憶装置３へはアクセスしたいアドレスヲアドレ
スレジヌタ（ＡＲ）１３にセットし、書込みの場合には
書込みデータをメモリデータレジスタ（ＭＤＲ）１４に
セットすることによりアクセスし、読出しの一合続出し
データがメモリデータレジスタＭＤＲ１４にセットされ
る。を凡用レジスタ群（ＧＰＲ）１６は３２ビツトのレ
ジスタ１６個から構成されそれぞれＧＰＲ１６−０，１
６−１・・・・・・１６−１５と呼ぶ。In FIG. 1 showing an embodiment of the present invention, the data processing system of the present invention is comprised of two CPUs 1 and I', a main storage unit (MSU), 3, and a disaster recovery unit (DGU) 5. The following description of the CPU will focus on the CPUI, but the same holds true for the CPU 1'. CPUI has an instruction counter (IC) as its original function.
' Based on the instruction address stored in 11, main memory 3
Reads the instruction to be executed from instruction register (IR) 1.
The instruction execution control unit (EXC) 15 sequentially executes the instructions while reading from or writing to the general purpose register group (GPR) 16 and the main memory 3 according to the instructions. The main memory device 3 is accessed by setting the address to be accessed in the address register (AR) 13, and in the case of writing, by setting the write data in the memory data register (MDR) 14, and then reading is performed one after another. Data is set in memory data register MDR14. The general purpose register group (GPR) 16 consists of 16 32-bit registers, GPR16-0 and GPR16-0, respectively.
They are called 6-1...16-15.

主記憶装置３はｎメガバイトの記憶容量を持つメモリモ
ジュール（ＭＭＵ）３１、メモリモジュールＭＭＵ’３
１にアクセスするアドレスを格納するメモリアドレスレ
ジスタ（ＭＡＲ）３３、メモリ読出しの場合にメモリモ
ジュールＭＭＵ３１からの読出しデータを格納するメモ
リ読出しレジスタ（ＭＲＲ）３４、メモリ書込みの場合
メモ９モジユールＭＭＵ３１への書込みデータを保持す
るメモリ書込みレジスタ（ＭＷＲ）３５及びメモリ制御
部（ＭＭＣ）３２により本来のメモリ機能を実現してい
る。ＣＰＵＩ及び１”とのメモリデータ授受を行うデー
タバス３６１は双方向バスであシ、変換回路３６によｆ
ｉｃＰＵｌ又は１′からのデータをメモリ書込みレジス
タＭＷＲ３，５へ、又メモリ読出しレジスタＭＲＲ３４
のデータをＣＰｏ　１又は１°へ送る。The main storage device 3 includes a memory module (MMU) 31 having a storage capacity of n megabytes, and a memory module MMU'3.
Memory address register (MAR) 33 stores the address to access 1, memory read register (MRR) 34 stores read data from memory module MMU 31 in case of memory read, write to Memo 9 module MMU 31 in case of memory write. The original memory function is realized by a memory write register (MWR) 35 that holds data and a memory control unit (MMC) 32. The data bus 361 that exchanges memory data with the CPUI and 1" is a bidirectional bus, and the conversion circuit 36
Data from icPUl or 1' to memory write registers MWR3, 5 and memory read register MRR34.
data to CPo 1 or 1°.

又マルチプロセッサシステムのだめ共有する資源の排他
制御を行うロック命令及びアンロック命令が設けられて
いる。ロック命令は指定されたロケーションの値がすべ
てゼロ人ｕＯであるか否かチェックし、后ノ０であれば
Ｎｏｎ　Ｑデータを格納し、Ｎｏｎ　Ｑであればそのロ
ケーションをそのままとする。指定されたロケーション
の値が最初后ノ０であったかどうかは状態コードにセッ
トされソフトウェアで判断することができる。一般に上
記ロケーションはある共有資源を排他的に使用しだい場
合に使用中フラグとして使われ、Ｎ０ｎＯの場合いずれ
かのＣＰＵで使用中であることを示す。Furthermore, a lock command and an unlock command are provided for exclusive control of shared resources in a multiprocessor system. The lock instruction checks whether the values of the specified locations are all zero people uO, and if they are 0, then Non Q data is stored, and if they are Non Q, the location is left as is. Whether or not the value of the specified location was initially 0 is set in the status code and can be determined by software. Generally, the above location is used as an in-use flag when a certain shared resource is used exclusively, and if it is N0nO, it indicates that it is in use by one of the CPUs.

アンロック命令は上記使用中フラグ解除のだめ指定され
たロケーションの値をＡノ、Ｉ２０に設定する。The unlock command sets the value of the specified location to A, I20 in order to cancel the in-use flag.

使用中フラグの解除だけであれば単なる格納命令でもよ
いが、後述するようにこの発明の特長的役割を果す。A simple storage instruction may be sufficient as long as it only clears the in-use flag, but it plays a distinctive role of the present invention, as will be described later.

以上のような本来の機能を実現しているデータ処理シス
テムに対しチェックポイント再実行機能を追加する。チ
ェックポイント再実行はある基桑に従いチェックポイン
トを設定するとその時点のデータ処理システムの状態を
退避した後、本来の処理を行い又ある基準に合致すると
チェックポイントを設定するといった処理を繰返し、も
し本来の処理中に障害報告が発生すると処理を中断し、
退避した情報をもとに最新のチェックポイントの時点に
データ処理システムの状態を戻した後、再実行すること
により障害回復を行う。しかしデータ処理システムの状
態、特に汎用レジスタ及びメモリの状態を退避するには
時間がかかり、チエツクポイント設定時点でそれら状態
を退避することができないため、チェックポイント設定
後書込みが発生した場合その書込み前のデータを退避し
ておく。以下チェックポイント再実行機能をＣＰＵ１を
中心に述べるがＣＰ　Ｕ　ｌ’についても同様である。A checkpoint re-execution function is added to a data processing system that implements the above-mentioned original functions. Checkpoint re-execution involves setting a checkpoint according to a certain standard, saving the state of the data processing system at that point, performing the original processing, and setting a checkpoint when a certain criterion is met. If a failure report occurs during processing, processing will be interrupted,
After returning the state of the data processing system to the latest checkpoint based on the saved information, failure recovery is performed by re-executing the system. However, it takes time to save the state of the data processing system, especially the state of general-purpose registers and memory, and these states cannot be saved at the time of setting a checkpoint. Save the data. The checkpoint re-execution function will be described below with reference to the CPU 1, but the same applies to the CPU 1'.

命令カウンタ■Ｃ１１はチェックポイント設定時点で命
令カウンタＩＣバックアップ（以下ＩＣバックアップと
記す）　（ＩＣ−ＢＵ　＞２１に退避される。汎用レジ
スタ群ＧＰＲ１６にはそれと全く同じ構成をしたレジス
タバックアップ（ＢＦＲ）２３と、各汎用レジスタＧＰ
Ｒ１５−ｉが最新のチェックポイント設定以降に書込み
が行われたか否かを表示する■表示ピッ）（ＢＦＲ−Ｖ
）２４が設けられぞいる。今汎用レジスタＧＰＲ’１６
−１に対応してレジスタバックアップＢＦＲ２３−１及
び■表示ビットＢＦＲ−Ｖ２４−ｉ　（ｉ＝０〜１５）
と表わすことにする。命令実行制御部ＥＸＣ１５によシ
汎用レジスタＧＰＲ１６−ｉに書込みが行われるときレ
ジスタ選択信号（＜）ＡＲ倍信号１５１の値がｌｔ　ｉ
　Ｉ＋となシ汎用レジスタＧＰＲ１６−ｉが選択され、
データバスＡ（Ａバス）１６１上にその値がセットされ
る。ＢＦＲ制御部（ＢＦＲＣＴＬ）２２はＧＡＲ信号１
５１によりＢＦＲ−Ｖ２４−’ｒに基き汎用レジスタ（
）ＰＨ１０−１がチェックポイント設定以降書込みが行
われたか否かを判定し、書込みが行われている場合（Ｂ
ＦＲ−Ｖ２４−１−１″′）何もせず、書込みが行われ
ていない場合（ＢｐＲ−Ｖ２４−ｒ−パ０”）汎用レジ
スタＧ　Ｐ　Ｒ１５−ｉの値をＡバス１６１を経由して
バックアップレジスタＢＦＲ２３−１に書込むと共にＢ
　Ｆ　Ｒ−Ｖ　２４−　ｉを＋１”にセットする。この
Ｖ表示ビン）　ＢＦＲ−Ｖ２４はチェックポイント設定
でｌｌ　Ｏｒ＋にクリアされる。障害報告が行われたと
き汎用レジスタ群ＧＰＲ１６を最新のチェックポイント
設定時点に戻すにはＶ表示ピッ）ＢＦＲ−Ｖ２４の各ビ
ットで′１”に対応するびＬ用しジスタＧＰＲ１６のみ
バックアップレジスタＢＦＲ２４で書変えればよい。The instruction counter C11 is saved to the instruction counter IC backup (hereinafter referred to as IC backup) (IC-BU > 21) at the time of setting the checkpoint.The general register group GPR16 has a register backup (BFR) 23 with exactly the same configuration. and each general-purpose register GP
Display whether or not R15-i has been written to since the latest checkpoint setting (display beep)
)24 will be provided. Now general purpose register GPR'16
-1 corresponding to register backup BFR23-1 and ■display bit BFR-V24-i (i=0 to 15)
I will express it as When the instruction execution control unit EXC15 writes to the general-purpose register GPR16-i, the value of the register selection signal (<) AR multiplication signal 151 is lt i
I+ and general-purpose register GPR16-i are selected,
The value is set on data bus A (A bus) 161. BFR control unit (BFRCTL) 22 receives GAR signal 1
51, the general-purpose register (
) PH10-1 determines whether writing has been performed since the checkpoint was set, and if writing has been performed (B
FR-V24-1-1''') If nothing is done and no writing is performed (BpR-V24-r-Pa0'') Back up the value of general-purpose register GP R15-i via A bus 161 While writing to register BFR23-1, B
FR-V24-i is set to +1" (this V display bin) BFR-V24 is cleared to ll Or+ by checkpoint setting. When a failure is reported, general-purpose register group GPR16 is set to the latest checkpoint. To return to the setting point, it is only necessary to rewrite only the L register GPR16 in the backup register BFR24 each time each bit of BFR-V24 corresponds to '1'.

又メモリ屯ジュールＭＭＵ３．１に対してはチェックポ
イント設定以降メモリ書込みの発生したアドレス及び書
込み前のデータを全て発生した順にスツツクするメモリ
バックアップ（ＢＦＭ）４１が設けられている。メモリ
バックアラ７”　ＢＦＭ４１′は２５６ワードで１ワー
ド５４ビツト（書込みアドレス２２ビツトと書込み前の
データ３２ビツトを保持する）のメモリであシ、メモリ
バックアップＢＦＭ４１の書込みの都度カウントアツプ
する書込みカウンタ（ＷＣＮＴ）４２でアドレス指定さ
れ、又書込みカウンタＷ、ＣＮ’Ｔ　４２はその値が２
４８”以上か否かを監視回路（ＤＴＰ）４３により監視
されている。ＣＰＵＩからメモリ書込み要求が発生した
場合メモリ制御部ＭＭＣ３２はメモリアドレスレジスタ
ＭＡ］Ｒ３３に書込みアドレスをセットしメモリモジュ
ールＭＭＵ３１の書込むべきワードを選択する。メモリ
制御部ＭＭＣ３２は選択されたワードの値が一旦メモリ
続出書込データバス（Ｍバス）３１１にセットされるの
で、その値をメモリ読出しレジスタＭＲＲ３４にセット
し、ＣＰＵ　１から送られた書込みデータをメモリ書込
みレジスタＭＷＲ３５を通してメモリモジュールＭＭＵ
　３１に書込み指示を出す。それと同時にメモリ制御部
ＭＭＣ３２はメモリアドレスレジスタＭＡＲ３３及びメ
モリ読出しレジスタＭＲＲ３４の値を、書込みカウンタ
ＷＣＮＴ　４２が指定するメモリバックアップＢＦＭ４
１のアドレスに書込みを行うと共に書込みカウンタＷＣ
ＮＴ４２を１加算する。書込みカウンタＷＣＮＴ　４２
はチェックポイント設定時に″ん２ｆｆｌ　Ｏ”にクリ
アされる。Further, the memory module MMU 3.1 is provided with a memory backup (BFM) 41 that stores all addresses where memory writing has occurred since the checkpoint was set and data before writing in the order in which they occurred. The memory backup BFM41' is a memory of 256 words and 54 bits per word (holds 22 bits of write address and 32 bits of data before writing), and has a write counter (which counts up each time the memory backup BFM41 is written). WCNT) 42, and the write counter W, CN'T 42 has a value of 2.
48” or more is monitored by a monitoring circuit (DTP) 43. When a memory write request is generated from the CPUI, the memory control unit MMC32 sets a write address in the memory address register MA]R33 and writes the memory module MMU31. The memory control unit MMC32 selects the word to be read.Since the value of the selected word is temporarily set to the memory continuous write data bus (M bus) 311, the memory control unit MMC32 sets the value to the memory read register MRR34, and the CPU 1 The write data sent from the memory module MMU is passed through the memory write register MWR35.
Issue a write instruction to 31. At the same time, the memory control unit MMC32 changes the values of the memory address register MAR33 and memory read register MRR34 to the memory backup BFM4 specified by the write counter WCNT42.
Write to address 1 and write counter WC
Add 1 to NT42. Write counter WCNT 42
is cleared to "n2fflO" when setting a checkpoint.

障害報告が行われたときメモリモジュールＭＭＵ３１を
最新のチェックポイント設定時点に戻すには障害報告時
点の書込みカウンタＷＣＮＴ４２の値に１減算した値を
開始アドレスとしてアドレスが０になるまで逆のぼりな
がらメモリバックアップＢＦＭ４１を読出し、読出され
たデータの順にそのデータに基きメモリモジュールＭＭ
Ｕ３１ｔ−書変えて行けばよい。ＣＰ　Ｕ　１’に対し
ても主記憶装置Ｍ８３にはメモリバックアップＢＦＭ４
１“、書込みカウンタＷＣＮＴ４２’、監視回路ＤＴＲ
４３’をそれぞれ設は同様の動作がＣＰＵＩとは独立に
行われる。To return the memory module MMU31 to the latest checkpoint setting point when a fault is reported, the value of the write counter WCNT42 at the time of the fault report minus 1 is used as the starting address, and the memory is backed up while going backwards until the address becomes 0. BFM41 is read, and memory modules MM are read out based on the data in the order of read data.
U31t- Just rewrite it and go. There is also a memory backup BFM4 in the main storage device M83 for the CPU 1'.
1", write counter WCNT42', monitoring circuit DTR
43', similar operations are performed independently of the CPUI.

チェックポイント設定論理は第２図に示すように、（１
）他装置例えばＩＯ（入出力）装置に対し何らかの制御
を行いその時点以前に戻って再実行できなくなった場合
（例えばＩＯ命令）、（２）退避手段のないレジスタ、
メモリの値を変更した場合（図示してないが例えば割込
マスク状態、割込要因のセット／リセツ）　）　、（３
）監視回路ＤＴＲ４３ｔＤ　Ｂ　Ｆ　Ｍ　ｎｅａｌｙ　
ｆｕｌ　ｌ信号４３１がオン即ち書込みカウンタＷＣＮ
Ｔ４２の値が２４８以上になった場合、（４）自ＣＰＵ
でＵ　Ｎ　Ｌ　、ＯＣＫ命令が実行された場合である。As shown in Figure 2, the checkpoint setting logic is (1
) When some kind of control is performed on another device, such as an IO (input/output) device, and it is no longer possible to return to a point before that point and re-execute it (for example, an IO instruction), (2) a register with no saving means,
When the memory value is changed (not shown, for example, interrupt mask status, interrupt factor set/reset), (3
) Monitoring circuit DTR43tD B F M nealy
ful signal 431 is on, that is, write counter WCN
If the value of T42 becomes 248 or more, (4) Own CPU
This is the case when the U N L and OCK instructions are executed.

チェックポイント制御回路（、ＣＰＣＴＬ）２５は上記
条件のうち１つでも検出するとチェックポイント過渡状
態表示（ＣＰ　ＩＮＤ　）フリップフロップ２６をセッ
トし、命令実行制御部ＥＸＣ１５からの命令実行終了信
号１５３オンでチェックポイント設定（ｓｃｐ　）信号
２５１をオシにすると共に前記ＣＰＩＮＤフリッププロ
ップ２６をリセットする。ＳＣＰ信号２５１がオンにな
ると命令カウンタバックアップＩＣ−ＢＵ２１には命令
カウンタＩＣ１１の現在の値、即ち次に実行すべき命令
のアドレスを格納し、レジスタバックアップＢＦＲ２３
に関しては対応する■表示ピッ）　ＢＦＲ−Ｖ２４をす
べてリセットし、メモリバックアップＢＦＭ４１に関し
てはそのアドレスを格納している書込みカウンタＷＣＮ
Ｔ４．２をリセットすることによりデータ処理システム
を新しいチェックポイントに設定する。When the checkpoint control circuit (CPCTL) 25 detects even one of the above conditions, it sets the checkpoint transient state display (CP IND) flip-flop 26, and checks by turning on the instruction execution end signal 153 from the instruction execution control unit EXC15. The point setting (scp) signal 251 is turned on and the CPIND flip-flop 26 is reset. When the SCP signal 251 turns on, the current value of the instruction counter IC 11, that is, the address of the next instruction to be executed, is stored in the instruction counter backup IC-BU 21, and the register backup BFR 23
For the corresponding ■ display pin), reset all BFR-V24, and for memory backup BFM41, write counter WCN that stores its address.
Setting the data processing system to a new checkpoint by resetting T4.2.

以上の構成及び動作はＣＰ　Ｕ　１’に対しても同様で
ある。即ちＣＰ　Ｕ　ｌ’は命令カウンタＩ　Ｃ１１’
、命令レジスタＩ　Ｒ１２’、アドレスレジスタＡＲ１
３’、データレジスタＭＤＩ’ｉ’１４”、命令実行制
御部ＥＸＣ１５１、汎用レジスタ群ＧＰＲ１６’、命令
カウンタバックアラ７’ＩＣ−ＢＵ２１′、レジスタバ
ックアップ制御部Ｂ　Ｆ　ＲＣＴ　Ｌ　２２’、レジス
タバッファ：、／７’ＢＦＲ２３’、Ｖ表示ヒラ）　Ｂ
　Ｆ　Ｒ−Ｖ　２４’、チェックポイント制御回路２５
１、ＣＰ　ＩＮＤフリップフロップ２６゛があシ、更に
主記憶装置ＭＳＵ３にはメモリバックアップＢＦＭ４１
’、書込みカウンタ４２′、監視回路４３′があり、こ
れらはＣＰＵ１で説明したのと全く同様の動作をＣＰＵ
１のとは独立な形で行われる。The above configuration and operation are the same for CPU 1'. That is, CPU l' is the instruction counter IC11'
, instruction register I R12', address register AR1
3', data register MDI'i'14'', instruction execution control unit EXC151, general-purpose register group GPR16', instruction counter backer 7'IC-BU21', register backup control unit B F RCT L 22', register buffer: /7'BFR23', V display frame) B
F R-V 24', checkpoint control circuit 25
1. CP IND flip-flop 26 is open, and the main memory MSU3 has a memory backup BFM41.
', a write counter 42', and a monitoring circuit 43'.
This is done independently of 1.

次に障害が報告された場合の障害回復処理について説明
する。ＣＰＵＩ内での障害がパリティチェック、回路の
２重化比較、イリーガル命令、その他の手法で実現され
ている障害検出機構２９により検出されると、障害回復
装置ＤＧＵ５への障害報告（ＦＤ）信号１０１をオンに
すると共に、ＣＰＵ１及び主記憶装置ＭＳＵ３は即停止
状態になる。ＦＤ信号１０　］、がオンになることによ
シ障害回復装置ＤＯＵ５は障害回復処理を開始する。Next, failure recovery processing when a failure is reported will be explained. When a fault within the CPUI is detected by the fault detection mechanism 29 implemented by a parity check, circuit duplication comparison, illegal instruction, or other methods, a fault report (FD) signal 101 is sent to the fault recovery device DGU5. When the CPU 1 and the main storage device MSU3 are turned on, the CPU 1 and the main storage device MSU3 are brought to an immediate stop state. When the FD signal 10] turns on, the failure recovery device DOU5 starts failure recovery processing.

障害回復装置１）　Ｇ　Ｕ　５はＣＰＵＪ及び主記憶装
置ＭＳＵ３の診断プロセッサであり、診断インタフェー
スバス５１によｐｃＰＵｌ及び主記憶装置ＭＳＵ３のレ
ジスタ、メモリに対する書込み／読出しが可能であり、
この機能を利用して障害回復処理を行う。Failure recovery device 1) G U 5 is a diagnostic processor for the CPUJ and main storage device MSU3, and is capable of writing/reading to/from the registers and memory of the pcPU1 and the main storage device MSU3 via the diagnostic interface bus 51.
This function is used to perform failure recovery processing.

以下第３図を参照しながら障害回復処理について説明す
る。ＦＤ信号１０１がオンになることにより処理６１で
診断インタフェースバス５１によｐｃＰＩＮＤフリップ
フロップ２６を読取り０″であれば回復処理回置μと判
断して処理６２に進む。The failure recovery process will be explained below with reference to FIG. When the FD signal 101 is turned on, the pcPIND flip-flop 26 is read through the diagnostic interface bus 51 in process 61, and if it is 0'', it is determined that the recovery process has been completed μ, and the process proceeds to process 62.

処理６２で診断インタフェースバス５１により命令カウ
ンタバックアップＩＣ−ＢＵ２１を読取り、その値を命
令カウンタＩＣ１ｌにセットする。次″に処理６３で診
断インタフェース５１によりレジスタバックアップＢＦ
Ｒ２３及び■表示ビットＢＦＲ−Ｖ２４を一旦読込み、
■表示ヒツトＢＦＲ−■２４の値で“１″がセットされ
ているビット位置（例えば１番目）に対応するレジスタ
バックアップＢＦＲ２３のワード（ＢＦＲ２３−ｉ）を
対応する汎用レジスタ群（）ＰＲｌ６のワード（Ｃ）Ｐ
Ｒｌ　６−ｉ　）ノ格納ヲＶ表示ヒッ）　ＢＦＲ−Ｖ２
４の値でパ１″′がセットされている全ビットに対して
行う。In process 62, the instruction counter backup IC-BU21 is read through the diagnostic interface bus 51, and the read value is set in the instruction counter IC1l. Next, in process 63, register backup BF is performed by diagnostic interface 51.
Read R23 and ■display bit BFR-V24 once,
■ Display hit BFR - ■ The word (BFR23-i) of the register backup BFR23 corresponding to the bit position (for example, the first) where "1" is set in the value of 24 is displayed in the word (BFR23-i) of the general-purpose register group () C)P
Rl 6-i) No storage wo V display hit) BFR-V2
This is done for all bits for which PA1''' is set with a value of 4.

次に処理６４で診断インタフェース５１にょシ一旦メモ
リバックアップＢＦＭ４１を書込みカウンタＷＣＮＴ４
２の障害報告時の値を開始アドレスとして、書込みカウ
ンタＷＣＮＴ４２を１″ずつ減じながら０”になるまで
読込む。そのあと読込んだ順にメモリバックアップＢＦ
Ｍ４１のデータの書込みアドレス情報をメモリモジュー
ルＭＭＵ３１のアドレス指定として書込データ部を書込
データとして書込む。以上により命令カウンタ■Ｃ１１
、汎用レジスタ群（）ＰＨ１０及びメモリモジュールＭ
ＭＵ３１の状態を最新のチェックポイント設定時点まで
戻し、処理６５でエラーリセットしＦＤ信号１０１の状
態をオフとした後、ＣＰＵ１をスタートすることにより
最新のチェックポイントからの再実行を実現する。この
動作はＣＰ　Ｕ　１’に対しても全く同様である。Next, in process 64, the diagnostic interface 51 writes the memory backup BFM41 and counter WCNT4.
The write counter WCNT42 is read by decrementing the value by 1'' until it reaches 0'', using the value at the time of failure reporting in step 2 as the start address. After that, memory backup BF in the order of reading
The write address information of the data in M41 is used as the address specification of the memory module MMU31, and the write data portion is written as write data. As a result of the above, the instruction counter ■C11
, general-purpose register group () PH10 and memory module M
After returning the state of the MU 31 to the latest checkpoint setting time, resetting the error in process 65 and turning off the state of the FD signal 101, the CPU 1 is started to realize re-execution from the latest checkpoint. This operation is exactly the same for CPU 1'.

以上説明した実施例は第１図に示したようにバッファ（
緩衝記憶装置）のないＣＰＵであるが、ＣＰＵ内で発生
する潜込み要求が主記憶装置に対しても書込む、例えば
ストアスルーバッファを採用したＣＰＵに対してもこの
発明はそのまま適用できる。The embodiment described above has a buffer (
Although the CPU does not have a buffer storage device), the present invention can be applied as is to a CPU that employs a store-through buffer, for example, in which a sneak request generated within the CPU also writes to the main storage device.

〈効　果〉この発明は以上説明したように主記憶装置にデータ書込
み時、書込み前のデータを退避するように構成すること
により、ストアインバッファを採用したＣＰＵＵ外のＣ
ＰＵに対してもチェックポイント再実行方式が適用でき
るという効果があり、更に主記憶装置の書込み前データ
退避用メモリは主記憶装置の動作速度と同レベルの比較
的低速メモリで実現できるという効果がある。更に又他
ＣＰＵに対し資源を解除するＵＮＬＯＣＫ命令を設け、
ＵＮＬＯＣＫ命令により新だにチェックポイントを設定
することによりマルチプロセッサシステムにおいても容
易にチェックポイント再実行方式を適用できるという効
果がある。<Effects> As explained above, the present invention is configured to save data before writing when writing data to the main memory, so that the CPU outside the CPU U employs a store-in buffer.
This has the effect that the checkpoint re-execution method can be applied to the PU as well, and the memory for saving data before writing to the main memory can be realized with a relatively low-speed memory that is at the same level as the operating speed of the main memory. be. Furthermore, an UNLOCK command is provided to release resources from other CPUs.
By setting a new checkpoint using the UNLOCK instruction, the checkpoint re-execution method can be easily applied even in a multiprocessor system.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図はこの発明の一実施例であるデータ処理システム
を示すブロック図、第２図は第１図中のチェックポイン
ト制御回路２５におけるチェックポイント設定条件を示
す図、第３図は障害回復処理の動作を示す流れ図である
。１．１’：ＣＰＵ、１１：命令カウンタ（ＩＣ）、１２
：命令レジスタ（ＩＲ）、１３ニアドレスレジスタ（Ａ
Ｒ）、１４：データレジスタ（ＭＤＲ）、１５：命令実
行制御部（Ｅｘｃ　）、１６：汎用レジスタ群（ＧＰＲ
）、２１：■ｃバックアップ（■ｃ−ＢＵ）、２２　：
　Ｂ　Ｆ　Ｒ制御部（ＢＦＲＣＴＬ）２３：レジスタバ
ックアップ（ＢＦＲ）、２４：ｖ表示ビット（ＢＦＲ−
■）、２５：チェックポイント制御回路、２６：チェッ
クポイント過渡状態表示フリップフロップ（ｃＰＩＮＤ
）、３：主記憶装置（ＭＳＵ）、３１：メモリモジュ、
−／ｌ／　（Ｍ　Ｍ　Ｕ　）、３２　：　メモリ制ｍ部
（ＭＭＣ）、３３：メモリアドレスレジスタ（ＭＡＲ）
、３４：メモリ読出しレジスタ（ＭＲＲ）、３５：メモ
リ書込みレジスタ（ＭＷＲ）、４１．４１’：メモリバ
ックアップ（Ｂ　Ｆ　Ｍ　）、４２　’、　４２’　：
書込みカウンタ、４３．４３’：監視回路、５：障害回
復装置（ＤＧＵ）。FIG. 1 is a block diagram showing a data processing system that is an embodiment of the present invention, FIG. 2 is a diagram showing checkpoint setting conditions in the checkpoint control circuit 25 in FIG. 1, and FIG. 3 is a failure recovery process. 2 is a flowchart showing the operation of FIG. 1.1': CPU, 11: Instruction counter (IC), 12
: Instruction register (IR), 13th near address register (A
R), 14: Data register (MDR), 15: Instruction execution control unit (Exc), 16: General purpose register group (GPR)
), 21: ■c backup (■c-BU), 22:
BFR control unit (BFRCTL) 23: Register backup (BFR), 24: v display bit (BFR-
■), 25: Checkpoint control circuit, 26: Checkpoint transient state display flip-flop (cPIND
), 3: Main storage unit (MSU), 31: Memory module,
-/l/ (MMU), 32: Memory control unit (MMC), 33: Memory address register (MAR)
, 34: Memory read register (MRR), 35: Memory write register (MWR), 41.41': Memory backup (BFM), 42', 42':
Write counter, 43.43': Monitoring circuit, 5: Disaster recovery unit (DGU).

Claims

【特許請求の範囲】[Claims]

（１）主記憶装置と、その主記憶装置を共有しながらソ
フトウェア命令を実行する複数の処理装置と、これら主
記憶装置及び複数の処理装置の内部状態の読出し書込み
が可能な障害回復制御装置とを備え、前記各処理装置はそれぞれ命令カウンタ、汎用レジスタ
及び命令実行制御部を含み、前記主記憶装置の一部のエリアを前記処理装置の２つ以
上が同時にアクセスしないように制御するためのロック
命令及びアンロック命令を設け、前記複数の処理装置の
それぞれに対応して前記主記憶装置の指定されたロケー
ションに書込みが行われる都度、そのロケーションの元
の内容を順次格納するバックアップメモリ手段と、前記
汎用レジスタの状態を格納するバックアップレジスタ手
段と、前記命令カウンタの状態を格納する命令カウンタバック
アップと、前記バックアップメモリ手段への格納回数が所定の限界
を越えたこと又は前記アンロック命令が実行されたこと
を監視する監視回路と、その監視回路の出力に呼応して
前記バックアップメモリ手段、前記バックアップレジス
タ及び前記命令カウンタバックアップに対しリセット信
号を送出するチェックポイント制御部と、前記処理装置
内の障害を検出するチェック回路とから構成されるデー
タ処理システム。(1) A main storage device, a plurality of processing devices that execute software instructions while sharing the main storage device, and a failure recovery control device that can read and write the internal states of the main storage device and the plurality of processing devices. Each of the processing units includes an instruction counter, a general-purpose register, and an instruction execution control unit, and a lock for controlling a partial area of the main storage device so that two or more of the processing units do not access simultaneously. backup memory means that is provided with a command and an unlock command, and sequentially stores the original contents of a designated location of the main storage device each time a write is performed in a designated location of the main storage device corresponding to each of the plurality of processing devices; a backup register means for storing the state of the general-purpose register; an instruction counter backup means for storing the state of the instruction counter; and a backup register means for storing the state of the instruction counter; a checkpoint control unit that sends a reset signal to the backup memory means, the backup register, and the instruction counter backup in response to the output of the monitoring circuit; A data processing system consisting of a check circuit that detects