JP6089766B2

JP6089766B2 - Information processing system and failure processing method for information processing apparatus

Info

Publication number: JP6089766B2
Application number: JP2013034373A
Authority: JP
Inventors: 恵子越智
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-02-25
Filing date: 2013-02-25
Publication date: 2017-03-08
Anticipated expiration: 2033-02-25
Also published as: JP2014164472A

Description

本発明は情報処理システム、及び情報処理装置の障害処理方法に関する。 The present invention relates to an information processing system and a failure processing method for an information processing apparatus.

情報処理システムにおける障害検出方法は、２重化された装置の情報処理装置間でチェック用データを送受信し、双方の情報処理装置でヘルスチェックを実施することで行っていた。 A failure detection method in an information processing system is performed by transmitting / receiving check data between information processing devices of a duplexed device and performing a health check on both information processing devices.

このような背景に関連する技術としては、様々なものが知られている（例えば、特許文献１〜２参照。）。 Various techniques related to such a background are known (see, for example, Patent Documents 1 and 2).

例えば、特許文献１には、二重化構成で動作する通信制御装置の障害監視方式に関し、特にシステムの性質上処理量に大きな変動があっても確実な障害検出とリカバリを義務付けられた二重化制御装置を改良する技術が記載されている。具体的には、自系の障害を相手系に通知するウォッチドッグタイマ手段、相手系に対し、自系が動作していることを通知するヘルスチェック手段を備える。また、自系の障害を自系で監視するループ監視手段、自系と相手系との間の通信機能の状態を監視する系間通信状態監視手段、自系と相手系との間で回線の切り換えを行う回線切換装置の監視を行う回線切換装置監視手段を備える。その上で、これらの各手段により行われる異常発生の通知、及び自系の現在の状態（現在主機であるか否か、既に他の異常発生の通知を受けているか否か）を合わせた障害の判断を行い、速やかにリカバリ処理を行う。 For example, Patent Document 1 relates to a fault monitoring method for a communication control apparatus that operates in a duplex configuration, and in particular, a duplex control apparatus that is obliged to reliably detect and recover even if there is a large variation in the processing amount due to the nature of the system. Techniques to improve are described. Specifically, it is provided with a watchdog timer means for notifying the partner system of a failure of the host system and a health check means for notifying the partner system that the host system is operating. In addition, loop monitoring means for monitoring the failure of the own system in the own system, inter-system communication status monitoring means for monitoring the status of the communication function between the own system and the partner system, the line between the own system and the partner system Line switching device monitoring means for monitoring the line switching device that performs switching is provided. On top of that, a failure that combines the notification of the occurrence of an abnormality performed by each of these means and the current state of the own system (whether it is the main machine or whether it has already been notified of the occurrence of another abnormality) The recovery process is performed promptly.

また、例えば、特許文献２には、データ通信バスに複数のスレーブプロセッサが接続され、制御バスにマスタプロセッサ及びこの複数のスレーブプロセッサが接続されているマルチプロセッサシステムに関し、障害部位の限定方法が記載されている。即ち、障害発生時に、データ通信バスの障害か、それともスレーブプロセッサの障害かを明確に限定するものである。具体的には、第１及び第２のスレーブプロセッサ同士が通信中に通信が不可能になった場合、スレーブプロセッサの異常プロセッサ通知部が相手プロセッサの異常をマスタプロセッサに通知する。また、スレーブ間ヘルスチェック指示処理部が、正常な任意の第３のスレーブプロセッサを選ぶ。さらに、スレーブ間ヘルスチェック指示処理部は、この第３のスレーブプロセッサのスレーブ間ヘルスチェック処理部に対し通知してきたスレーブプロセッサ及び通知されたスレープロセッサとヘルスチェックを行うよう指示する。スレーブ間ヘルスチェック結果判定処理部は、この結果を受信し、双方のヘルスチェックが失敗した時はデータ通信バスの障害と認識する。また、スレーブ間ヘルスチェック結果判定処理部は、第１及び第２のスレーブプロセッサのいずれかでのヘルスチェックができた場合は、ヘルスチェックを失敗した方を、障害を有する側と認識する。 Further, for example, Patent Document 2 describes a method for limiting a faulty part in a multiprocessor system in which a plurality of slave processors are connected to a data communication bus and a master processor and the plurality of slave processors are connected to a control bus. Has been. That is, when a failure occurs, the failure of the data communication bus or the failure of the slave processor is clearly limited. Specifically, when communication becomes impossible while the first and second slave processors communicate with each other, the abnormal processor notification unit of the slave processor notifies the master processor of the abnormality of the counterpart processor. Further, the inter-slave health check instruction processing unit selects any normal third slave processor. Further, the inter-slave health check instruction processing unit instructs the inter-slave health check processing unit of the third slave processor to perform a health check with the notified slave processor and the notified slave processor. The inter-slave health check result determination processing unit receives this result and recognizes that the data communication bus is faulty when both health checks fail. Further, the inter-slave health check result determination processing unit recognizes that the health check has failed as the side having the failure when the health check can be performed by either the first or second slave processor.

特許第２５７８９８５号公報Japanese Patent No. 2578985 特開平０５−１２００４８号公報Japanese Patent Laid-Open No. 05-120048

２重化された情報処理装置における障害検出方法は、前述のとおり、該２重化された情報処理装置のＣＰＵ間でチェック用データを送受信することによるヘルスチェックを実施することで行っている。しかしながら、この障害検出方法であると、情報処理装置に属するＩ／Ｏコントローラの障害は、双方の情報処理装置でヘルスチェックのタイムアウトが発生するため、双方のどちらの情報処理装置の障害であるかを判断することができない。 As described above, the fault detection method in the duplex information processing apparatus is performed by performing a health check by transmitting / receiving check data between the CPUs of the duplex information processing apparatus. However, with this failure detection method, the failure of the I / O controller belonging to the information processing device is a failure of either information processing device because a health check timeout occurs in both information processing devices. Cannot be judged.

このため、複数の情報処理装置から成る情報処理システムの障害時の対策としては、所定の個々のケースに対応させて、どの情報処理装置を停止させるかを取り決めたルールを予め設けている。また、この情報処理システムの障害時には、上記ルールに従って、所定の個々のケースに対応した情報処理装置を停止させ、残りの情報処理装置だけで縮退運転を行う。この方法でも、システム運用を継続することには支障は生じない。しかし、この場合、実際には障害部位を有する情報処理装置を縮退稼働させてしまい、障害部位を有さない情報処理装置の稼働を停止させてしまう齟齬が生じ得る。そして、このような齟齬が生じた場合には、最終的には、結局は縮退稼働させていた情報処理装置も、その障害部位の保守交換のために稼働停止させることになる。よって、このような情報処理システムの障害時の対策では、システム全体が稼働停止に追い込まれたり、システム全体の稼働率が低下したりするといった問題点が有り、この点の解決が課題であった。 For this reason, as a countermeasure against a failure in an information processing system including a plurality of information processing apparatuses, a rule that determines which information processing apparatus is to be stopped is provided in advance in correspondence with a predetermined individual case. Further, when the information processing system fails, the information processing apparatus corresponding to a predetermined individual case is stopped according to the rules, and the degenerate operation is performed only with the remaining information processing apparatuses. Even with this method, there is no problem in continuing the system operation. However, in this case, the information processing apparatus having the faulty part may actually be degenerately operated, and the operation of the information processing apparatus having no faulty part may be stopped. When such a flaw occurs, eventually, the information processing apparatus that has been operated in a degenerate manner is eventually stopped for maintenance and replacement of the faulty part. Therefore, in the case of such a failure in the information processing system, there are problems such as the entire system being put into operation stoppage or the operation rate of the entire system being lowered. .

本発明の目的は、上述した課題を解決する情報処理システム、及び情報処理装置の障害処理方法を提供することにある。 The objective of this invention is providing the information processing system which solves the subject mentioned above, and the failure processing method of information processing apparatus.

上記課題を解決するために、本発明の第１の形態によると、通常時には２つの系統を使用して処理を行い、障害時には片方の系統のみを使用して処理を継続する情報処理システムであって、系統は、それぞれ、コンピュータと、Ｉ／Ｏカードとを備え、コンピュータは、それぞれ、ＣＰＵと、Ｉ／Ｏコントローラとを備え、Ｉ／Ｏカードは、自系統のコンピュータのヘルスチェックを行うために割り込み信号を定期的に発信する手段と、割り込み信号に対して自系統のコンピュータのＣＰＵによる処理が所定時間内に実行されなかったことに応じて、自系統のコンピュータに障害が生じたことを示す障害通知信号を、他方の系統のＩ／Ｏカードに対して送信する手段と、他方の系統のＩ／Ｏカードから送信された障害通知信号を受信したことに応じて、他方の系統のコンピュータに障害が生じたことを示す障害通知信号を、自系統のコンピュータに対して送信する手段とを有し、ＣＰＵは、自系統のＩ／Ｏカードから定期的に送信される割り込み信号を受信することができなかったことに応じて、自装置を停止させる手段と、自系統のＩ／Ｏカードから送信された障害通知信号を受信したことに応じて、自装置を縮退運転させる手段とを有する。 In order to solve the above-described problem, according to the first aspect of the present invention, an information processing system that performs processing using two systems in normal times and continues processing using only one system in the event of a failure. Te, lines, respectively provided, a computer and a I / O card, the computer, respectively, includes a CPU, a I / O controller, the I / O card, in order to perform a health check of the host system computer A means for periodically transmitting an interrupt signal to the computer and a failure of the computer of the local system in response to the interrupt signal being processed by the CPU of the computer of the local system within a predetermined time. to a fault notification signal indicating, received and means for transmitting to the other strains of the I / O card, a fault notification signal transmitted from the I / O cards of the other system Depending on, periodically a fault notification signal indicating that a failure has occurred in the other system of the computer, and means to be transmitted to the own system of the computer, CPU from the I / O card of the own system depending on the child could not receive an interrupt signal to be sent to, and means for hermetically stopping the own device, in response to receiving a failure notification signal transmitted from the I / O cards of the own system And means for causing the device to degenerate.

本発明の第２の形態によると、通常時には２つの系統を使用して処理を行い、障害時には片方の系統のみを使用して処理を継続する情報処理装置の障害処理方法であって、系統は、それぞれ、コンピュータと、Ｉ／Ｏカードとを備え、コンピュータは、それぞれ、ＣＰＵと、Ｉ／Ｏコントローラとを備え、Ｉ／Ｏカードが、自系統のコンピュータのヘルスチェックを行うために割り込み信号を定期的に発信する段階と、Ｉ／Ｏカードが、割り込み信号に対して自系統のコンピュータのＣＰＵによる処理が所定時間内に実行されなかったことに応じて、自系統のコンピュータに障害が生じたことを示す障害通知信号を、他方の系統のＩ／Ｏカードに対して送信する段階と、Ｉ／Ｏカードが、他方の系統のＩ／Ｏカードから送信された障害通知信号を受信したことに応じて、他方の系統のコンピュータに障害が生じたことを示す障害通知信号を、自系統のコンピュータに対して送信する段階と、ＣＰＵが、自系統のＩ／Ｏカードから定期的に送信される割り込み信号を受信することができなかったことに応じて、自装置を停止させる段階と、ＣＰＵが、自系統のＩ／Ｏカードから送信された障害通知信号を受信したことに応じて、自装置を縮退運転させる段階とを含む。 According to the second aspect of the present invention, there is provided a failure processing method for an information processing apparatus that performs processing using two systems at normal times and continues processing using only one system at the time of failure. Each having a computer and an I / O card . Each computer has a CPU and an I / O controller. The I / O card sends an interrupt signal to check the health of the computer of its own system. When the I / O card does not execute the processing by the CPU of its own computer within a predetermined time in response to the interrupt signal, the computer of its own system has failed . the fault notification signal indicating that the steps to be transmitted to the other system I / O cards, I / O card, failure notification sent from the I / O cards of the other system No. In response to receiving a failure notification signal indicating that the other system's computer is down, and transmitting with respect to own line of computers, CPU is own line of I / O cards depending on the child could not receive an interrupt signal periodically transmitted from a step of locked stop the own device, CPU can receive a failure notification signal transmitted from the I / O cards of the own system depending on the fact, and a step of degraded operation the own device.

なお、上記の発明の概要は、本発明の必要な特徴の全てを列挙したものではない。また、これらの特徴群のサブコンビネーションもまた、発明となり得る。 It should be noted that the above summary of the invention does not enumerate all the necessary features of the present invention. Also, a sub-combination of these feature groups can also be an invention.

以上の説明から明らかなように、この発明によっては、２重化した情報処理装置において、障害部位を含む装置を確実に稼働停止させることができる。 As is apparent from the above description, according to the present invention, in the duplexed information processing apparatus, the apparatus including the faulty part can be reliably stopped.

また、この発明によっては、上記の効果に加えて、障害部位の保守交換の際にシステム全体が完全停止に追い込まれたり、システム全体の稼働率が低下したりする事態は回避できる効果が有る。 Further, according to the present invention, in addition to the above-described effects, there is an effect that it is possible to avoid a situation where the entire system is driven to a complete stop or the operating rate of the entire system is lowered during maintenance replacement of the faulty part.

本発明の実施形態に係る情報処理システムの全体構成を示す構成図である。It is a lineblock diagram showing the whole information processing system composition concerning an embodiment of the present invention. 本発明の実施形態に係る情報処理システムの主要な構成要素の構成を示す構成図である。It is a block diagram which shows the structure of the main components of the information processing system which concerns on embodiment of this invention. 本発明の実施形態に係る情報処理システムにおいて共有装置に障害が起きた場合の動作を示すシーケンスチャートである。It is a sequence chart which shows operation | movement when a failure arises in a shared apparatus in the information processing system which concerns on embodiment of this invention.

以下、発明の実施の形態を通じて本発明を説明する。 Hereinafter, the present invention will be described through embodiments of the invention.

図１は、本発明の実施形態に係る情報処理システムの全体構成を示す構成図である。同図において、本実施形態の情報処理システム（クラスタシステム１００）は、共有資源１２にアクセスする複数のホスト１１−１〜１１−ｎ（ホスト情報処理装置）を備える。上記複数のホスト１１−１〜１１−ｎの各々は、複数の共有装置１３−１〜１３−ｎを備える。上記複数の共有装置１３−１〜１３−ｎの各々は、２台の装置１４−１，１４−２（いずれも情報処理装置）を備える。さらに、上記複数の共有装置１３−１〜１３−ｎの各々は、２つのＩ／Ｏカード２０−１，２０−２を備える。 FIG. 1 is a configuration diagram showing the overall configuration of an information processing system according to an embodiment of the present invention. In FIG. 1, the information processing system (cluster system 100) of this embodiment includes a plurality of hosts 11-1 to 11-n (host information processing apparatuses) that access a shared resource 12. Each of the plurality of hosts 11-1 to 11-n includes a plurality of shared devices 13-1 to 13-n. Each of the plurality of shared devices 13-1 to 13-n includes two devices 14-1 and 14-2 (both are information processing devices). Further, each of the plurality of shared devices 13-1 to 13-n includes two I / O cards 20-1 and 20-2.

なお、上記構成において、本実施形態に係る情報処理システムとして必須の構成は、上記の共有装置だけであり、他はオプショナルな構成要素である。また、この実施形態では、上記の２台の装置１４−１，１４−２は、いずれもＣＰＵとＩ／Ｏコントローラを備えるものとする（符号は省略する）。さらに、上記の構成において、共有装置１３を構成する２台の装置１４−１，１４−２は、Ｉ／Ｏカード２０−１とＩ／Ｏカード２０−２とを介してＩ／Ｆ（インターフェース）ケーブルで相互接続されているものとする。 In the above-described configuration, the configuration essential for the information processing system according to the present embodiment is only the above-described shared device, and the other components are optional components. In this embodiment, the two devices 14-1 and 14-2 are both provided with a CPU and an I / O controller (reference numerals are omitted). Further, in the above configuration, the two devices 14-1 and 14-2 constituting the shared device 13 are connected to the I / F (interface) via the I / O card 20-1 and the I / O card 20-2. ) It shall be interconnected by cable.

図２は、本発明の実施形態に係る情報処理システムの主要な構成要素の構成を示す構成図である。同図に示すＩ／Ｏカード２０−１及びＩ／Ｏカード２０−２の各々は、チェック用割込み処理部６０、チェック用割込みのリセット監視部６１、及び相手側Ｉ／Ｏカードへの障害通知部６２を備える。また、Ｉ／Ｏカード２０−１及びＩ／Ｏカード２０−２は、相手側Ｉ／Ｏカードからの障害通知受信処理部６Ａ、及び上位装置への障害通知部６Ｂを備える。 FIG. 2 is a configuration diagram showing a configuration of main components of the information processing system according to the embodiment of the present invention. Each of the I / O card 20-1 and the I / O card 20-2 shown in the figure includes a check interrupt processing unit 60, a check interrupt reset monitoring unit 61, and a fault notification to the counterpart I / O card. The unit 62 is provided. Further, the I / O card 20-1 and the I / O card 20-2 include a failure notification reception processing unit 6A from the counterpart I / O card and a failure notification unit 6B to the host device.

また、図２に示す装置１４−１及び装置１４−２の各々は、割込み受信処理部５０、チェック用割込み監視部５１、チェック用割込みリセット部５２、自装置停止部５３、Ｉ／Ｏカードからの相手側障害通知受信処理部５Ａ、及び自装置縮退運転移行部５Ｂを備える。なお、ここでは、図１に示すとおり、Ｉ／Ｏカード２０−１の上位装置は装置１４−１であり、Ｉ／Ｏカード２０−２の上位装置は装置１４−２である。 Each of the device 14-1 and the device 14-2 shown in FIG. 2 includes an interrupt reception processing unit 50, a check interrupt monitoring unit 51, a check interrupt reset unit 52, a self device stop unit 53, and an I / O card. The other party failure notification reception processing unit 5A and the own device degenerate operation transition unit 5B are provided. Here, as shown in FIG. 1, the host device of the I / O card 20-1 is the device 14-1, and the host device of the I / O card 20-2 is the device 14-2.

まず、本システムの機能の概要を説明する。上記いずれのＩ／Ｏカードも、チェック用割込み処理部６０が、定期的に、各々の上位装置（装置１４−１または装置１４−２）に対してチェック用割り込みを掛ける。また、チェック用割込みのリセット監視部６１は、上記の上位装置がチェック用割込み処理部６０による該チェック用割り込みを所定の一定時間内にリセットするか否かを監視する。さらに、相手側Ｉ／Ｏカードへの障害通知部６２は、上記リセットの監視により、上記所定の一定時間内に上位装置が該チェック用割り込みをリセットしなかったことが検出された場合に、他方のＩ／Ｏカードに対して障害通知を送出する。 First, an overview of the functions of this system will be described. In any of the above I / O cards, the check interrupt processing unit 60 periodically issues a check interrupt to each host device (device 14-1 or device 14-2). The check interrupt reset monitoring unit 61 monitors whether the host device resets the check interrupt generated by the check interrupt processing unit 60 within a predetermined time period. Further, the failure notifying unit 62 to the counterpart I / O card, when it is detected by the monitoring of the reset that the host device has not reset the check interrupt within the predetermined time period, A failure notification is sent to the I / O card.

上記の障害通知は、他方の相手側Ｉ／Ｏカードからの障害通知受信処理部６Ａが受け取る。これにより、当該Ｉ／Ｏカードの上位装置への障害通知部６Ｂは、相手側装置の異常と判断し、自己の上位装置に障害通知を送出し、これにより、当該上位装置は、自己を縮退運転に移行させる。 The failure notification is received by the failure notification reception processing unit 6A from the other partner I / O card. As a result, the failure notification unit 6B to the higher-level device of the I / O card determines that the other-side device is abnormal and sends a failure notification to its own higher-level device, so that the higher-level device degenerates itself. Shift to driving.

以下、本実施形態に係る情報処理システムの機能を、特徴的な機能であるヘルスチェックの具体的な実現方法も含めて詳細に説明する。割込み受信処理部５０は、配下のＩ／Ｏカード２０−１とＩ／Ｏカード２０−２とからの定期的なチェック用割り込みを受信する。チェック用割込み監視部５１は、配下のＩ／Ｏカード２０−１とＩ／Ｏカード２０−２とからの定期的なチェック用割り込みを監視する。チェック用割込みリセット部５２は、配下のＩ／Ｏカード２０−１とＩ／Ｏカード２０−２とからの定期的なチェック用割り込みをリセットする。自装置停止部５３は、自装置を停止させる。Ｉ／Ｏカードからの相手側障害通知受信処理部５Ａは、上記Ｉ／Ｏカードを介して相手装置の障害通知を受け取る。これにより、自装置縮退運転移行部５Ｂは、自装置を縮退運転へと移行する。 Hereinafter, the function of the information processing system according to the present embodiment will be described in detail including a specific method for realizing a health check that is a characteristic function. The interrupt reception processing unit 50 receives periodic check interrupts from the subordinate I / O card 20-1 and I / O card 20-2. The check interrupt monitoring unit 51 monitors periodic check interrupts from the subordinate I / O card 20-1 and I / O card 20-2. The check interrupt reset unit 52 resets periodic check interrupts from the subordinate I / O card 20-1 and I / O card 20-2. The own apparatus stop unit 53 stops the own apparatus. The counterpart failure notification reception processing unit 5A from the I / O card receives the failure notification of the counterpart device via the I / O card. As a result, the own apparatus degenerate operation transition unit 5B shifts the own apparatus to the degenerate operation.

本実施形態に係る情報処理システムは、上述のとおり、複数のホスト１１−１〜１１−ｎによって共有される共有資源１２を備える。この情報処理システムは、例えば排他制御装置として適用されるものであっても良い。この共有資源を有する情報処理システムでは、システムの高信頼性を達成するために、上記複数のホスト１１−１〜１１−ｎが共有する共有装置１３−１〜１３−ｎの各々に属する装置（情報処理装置）を２重化している。即ち、共有装置１３−１〜１３−ｎの各々は、２重化された装置１４−1 と装置１４−２とを備える構成としている。 As described above, the information processing system according to the present embodiment includes the shared resource 12 shared by the plurality of hosts 11-1 to 11-n. This information processing system may be applied as an exclusive control device, for example. In the information processing system having this shared resource, in order to achieve high system reliability, devices belonging to each of the shared devices 13-1 to 13-n shared by the plurality of hosts 11-1 to 11-n ( Information processing apparatus) is duplicated. In other words, each of the shared devices 13-1 to 13-n includes a duplexed device 14-1 and a device 14-2.

ここで、上記共有装置１３−１〜１３−ｎの各々は、通常は、装置１４−1 と装置１４−２とで負荷分散されるように制御される。また、上記共有装置１３−１〜１３−ｎの各々は、上記２重化した装置の一方が障害の時は、当該障害となった装置の方の稼働を停止させ、残された方の装置を縮退運転させる。さらに、上記２重化した装置間のインターフェースを制御する上記Ｉ／Ｏカードの各々には、当該Ｉ／Ｏカードの上位装置（装置１４−1 または装置１４−２）が備えるＣＰＵとの間でのヘルスチェックを行う機能を備えさせている。 Here, each of the shared devices 13-1 to 13-n is normally controlled so that the load is distributed between the devices 14-1 and 14-2. Further, each of the shared devices 13-1 to 13-n stops the operation of the failed device when one of the duplicated devices fails, and the remaining device To degenerate operation. Further, each of the I / O cards that controls the interface between the duplicated devices is connected to the CPU provided in the host device (device 14-1 or device 14-2) of the I / O card. It has a function to perform health check.

本実施形態に係る情報処理システムでは、上記ヘルスチェック機能により、Ｉ／Ｏカード各々の上位装置の障害の有無を判断できるようにしている。また、本実施形態に係る情報処理システムは、上記ヘルスチェック機能により、当該Ｉ／Ｏカードの上位装置自体の障害と判断された場合は、当該障害装置を制御系統から確実に切り離すように制御する。さらに、この場合、本実施形態に係る情報処理システムは、障害を発生していない方の装置を縮退運転へと移行させる。 In the information processing system according to the present embodiment, it is possible to determine the presence / absence of a failure in the host device of each I / O card by the health check function. Further, the information processing system according to the present embodiment controls the health check function so as to surely disconnect the faulty device from the control system when it is determined that the fault has occurred in the host device itself of the I / O card. . Furthermore, in this case, the information processing system according to the present embodiment shifts the device that has not failed to the degenerate operation.

図３は、本発明の実施形態に係る情報処理システムにおいて共有装置に障害が起きた場合の動作を示すシーケンスチャートである。以下、図１，２を参照しながら、図３に示すシーケンスチャートに基づいて本実施形態に係る情報処理システムにおいて共有装置に障害が起きた場合の動作を説明する。 FIG. 3 is a sequence chart showing an operation when a failure occurs in the shared apparatus in the information processing system according to the embodiment of the present invention. Hereinafter, with reference to FIGS. 1 and 2, an operation when a failure occurs in the shared apparatus in the information processing system according to the present embodiment will be described based on the sequence chart shown in FIG. 3.

まず、Ｉ／Ｏカード２０−１は、装置１４−１に対して定期的にチェック用割り込みを行うと、その後は、装置１４−１による該チェック用割り込みに対する上記所定の一定時間内のリセット処理の動作の有無を監視する。また、装置１４−１は、Ｉ／Ｏカード２０−１からの定期的な割り込みを監視し、割り込みが有った時は、上記所定の一定時間内に、該割り込みに対するリセット処理を実施する。上記２重化した装置の他方であるＩ／Ｏカード２０−２と装置１４−２についても同様の処理動作がなされる。 First, when the I / O card 20-1 periodically issues a check interrupt to the device 14-1, thereafter, the reset processing within the predetermined time for the check interrupt by the device 14-1. Monitor the presence or absence of operation. In addition, the device 14-1 monitors a periodic interrupt from the I / O card 20-1, and when there is an interrupt, executes a reset process for the interrupt within the predetermined time period. The same processing operation is performed for the I / O card 20-2 and the device 14-2 which are the other of the duplicated devices.

以下の説明では、上記いずれかの共有装置の、装置１４−１に属するＩ／Ｏコントローラに障害が発生するものとする。但し、当該装置１４−１に属するＣＰＵは正常であるものとし、また、Ｉ／Ｏカード２０−１も正常であるものとする。 In the following description, it is assumed that a failure occurs in the I / O controller belonging to the device 14-1 of any one of the above shared devices. However, it is assumed that the CPU belonging to the device 14-1 is normal and the I / O card 20-1 is also normal.

上記の障害が発生するまでの間、装置１４−１とＩ／Ｏカード２０−１とは、相互にチェックを行っている。しかし、この間、上述のとおり、装置１４−１のＩ／Ｏコントローラに障害が発生すると、Ｉ／Ｏカード２０−１は、上記所定の一定時間内に、装置１４−１に対する上記の定期的なチェック用割り込みに対するリセット動作がなされないことを検出する。さらに、装置１４−１のＣＰＵは、上記の定期的なチェック用割り込みが発生していないことを検出する。 Until the above failure occurs, the device 14-1 and the I / O card 20-1 check each other. However, during this time, as described above, when a failure occurs in the I / O controller of the device 14-1, the I / O card 20-1 performs the above-described periodic operation for the device 14-1 within the predetermined time period. Detects that the reset operation for the check interrupt is not performed. Further, the CPU of the device 14-1 detects that the above periodic check interrupt has not occurred.

また、この時、Ｉ／Ｏカード２０−１は、上記の検出により、上位の装置１４−１に障害が発生しているものと判断し、上記２重化した装置の他方であるＩ／Ｏカード２０−２に対して障害通知を行う。また、装置１４−１は、上記の検出により、自己の障害であるものと判断し、自己（即ち装置１４−１）の稼働を停止する。 At this time, the I / O card 20-1 determines that a failure has occurred in the higher-level device 14-1 based on the above detection, and the I / O which is the other of the duplexed devices. A failure notification is sent to the card 20-2. Further, the device 14-1 determines that it is a failure of itself by the above detection, and stops the operation of itself (that is, the device 14-1).

他方、Ｉ／Ｏカード２０−１から上記障害通知を受け取ったＩ／Ｏカード２０−２は、装置１４−２に対して割り込み処理による障害通知を行う。この障害通知（割り込み処理）を受け取った装置１４−２は、装置１４−１において障害が発生したことを認識し、自己（即ち装置１４−２）の運転状態を、縮退運転へと移行させる。 On the other hand, the I / O card 20-2 that has received the failure notification from the I / O card 20-1 notifies the device 14-2 of the failure by interrupt processing. Receiving this failure notification (interrupt processing), the device 14-2 recognizes that a failure has occurred in the device 14-1, and shifts its own operation state (that is, the device 14-2) to the degenerate operation.

本実施形態に係る情報処理システムによれば、前述のとおり、上記２重化した情報処理装置において、障害部位を含む装置側を確実に稼働停止させることが可能となる効果が有る。 According to the information processing system according to the present embodiment, as described above, there is an effect that the redundant information processing apparatus can reliably stop the operation of the apparatus including the faulty part.

また、本実施形態に係る情報処理システムでは、障害部位を含まない装置に対しては自動的に縮退運転へと移行させ、稼働停止させることがない。よって、本実施形態に係る情報処理システムによれば、上記の効果に加えて、障害部位の保守交換の際にシステム全体が完全停止に追い込まれたり、システム全体の稼働率が低下したりする事態は回避できる効果が有る。 In the information processing system according to the present embodiment, an apparatus that does not include a faulty part is automatically shifted to a degenerate operation and is not stopped. Therefore, according to the information processing system according to the present embodiment, in addition to the above effect, the situation where the entire system is driven to a complete stop or the operation rate of the entire system is reduced during maintenance replacement of the faulty part. Has an effect that can be avoided.

１２共有資源
５０割込み受信処理部
５１チェック用割込み監視部
５２チェック用割込みリセット部
５３自装置停止部
５ＡＩ／Ｏカードからの相手側障害通知受信処理部
５Ｂ自装置縮退運転移行部
６０チェック用割込み処理部
６１チェック用割込みのリセット監視部
６２相手側Ｉ／Ｏカードへの障害通知部
６Ａ相手側Ｉ／Ｏカードからの障害通知受信処理部
６Ｂ上位装置への障害通知部
１１−１〜１１−ｎホスト
１３−１〜１３−ｎ共有装置
１４−１，１４−２装置
２０−１，２０−２Ｉ／Ｏカード
１００クラスタシステム 12 Shared Resource 50 Interrupt Reception Processing Unit 51 Check Interrupt Monitoring Unit 52 Check Interrupt Reset Unit 53 Self Device Stop Unit 5A Counterpart Failure Notification Reception Processing Unit 5B from I / O Card Self Device Degenerate Operation Transition Unit 60 Check Interrupt Processing Unit 61 Check Interrupt Reset Monitoring Unit 62 Fault Notification Unit 6A to Counterpart I / O Card Fault Notification Reception Processing Unit 6B from Counterpart I / O Card Fault Notification Units 11-1 to 11- to Host Device n hosts 13-1 to 13-n shared devices 14-1, 14-2 devices 20-1, 20-2 I / O card 100 cluster system

Claims

通常時には２つの系統を使用して処理を行い、障害時には片方の系統のみを使用して処理を継続する情報処理システムであって、
前記系統は、それぞれ、コンピュータと、Ｉ／Ｏカードとを備え、
前記コンピュータは、それぞれ、ＣＰＵと、Ｉ／Ｏコントローラとを備え、
前記Ｉ／Ｏカードは、
自系統の前記コンピュータのヘルスチェックを行うために割り込み信号を定期的に発信する手段と、
前記割り込み信号に対して自系統の前記コンピュータのＣＰＵによる処理が所定時間内に実行されなかったことに応じて、自系統の前記コンピュータに障害が生じたことを示す障害通知信号を、他方の系統の前記Ｉ／Ｏカードに対して送信する手段と、
他方の系統の前記Ｉ／Ｏカードから送信された障害通知信号を受信したことに応じて、他方の系統の前記コンピュータに障害が生じたことを示す障害通知信号を、自系統の前記コンピュータに対して送信する手段と
を有し、
前記ＣＰＵは、
自系統の前記Ｉ／Ｏカードから定期的に送信される割り込み信号を受信することができなかったことに応じて、自装置を停止させる手段と、
自系統の前記Ｉ／Ｏカードから送信された障害通知信号を受信したことに応じて、自装置を縮退運転させる手段と
を有する情報処理システム。 An information processing system that performs processing using two systems at normal times and continues processing using only one system at the time of failure,
Each of the systems includes a computer and an I / O card,
Each of the computers includes a CPU and an I / O controller.
The I / O card is
Means for periodically sending an interrupt signal to perform a health check of the computer of its own system ;
The processing by the CPU of the computer of the self-system to the interrupt signal in response to which has not been performed within a predetermined time, a fault notification signal indicating that the computer of the self system failure occurs, the other system Means for transmitting to the I / O card
Responsive to receiving a fault notification signal transmitted from the I / O cards of the other system, a fault notification signal indicating that a failure has occurred in the computer of the other system, the computer of the self-system Means for transmitting to the
The CPU
Depending on the child could not receive an interrupt signal periodically transmitted from the I / O card of its own system, and means for hermetically stopping the own device,
Responsive to receiving a fault notification signal transmitted from the I / O cards of the own system, an information processing system having a means for degraded operation the own device.

通常時には２つの系統を使用して処理を行い、障害時には片方の系統のみを使用して処理を継続する情報処理装置の障害処理方法であって、
前記系統は、それぞれ、コンピュータと、Ｉ／Ｏカードとを備え、
前記コンピュータは、それぞれ、ＣＰＵと、Ｉ／Ｏコントローラとを備え、
前記Ｉ／Ｏカードが、自系統の前記コンピュータのヘルスチェックを行うために割り込み信号を定期的に発信する段階と、
前記Ｉ／Ｏカードが、前記割り込み信号に対して自系統の前記コンピュータのＣＰＵによる処理が所定時間内に実行されなかったことに応じて、自系統の前記コンピュータに障害が生じたことを示す障害通知信号を、他方の系統の前記Ｉ／Ｏカードに対して送信する段階と、
前記Ｉ／Ｏカードが、他方の系統の前記Ｉ／Ｏカードから送信された障害通知信号を受信したことに応じて、他方の系統の前記コンピュータに障害が生じたことを示す障害通知信号を、自系統の前記コンピュータに対して送信する段階と、
前記ＣＰＵが、自系統の前記Ｉ／Ｏカードから定期的に送信される割り込み信号を受信することができなかったことに応じて、自装置を停止させる段階と、
前記ＣＰＵが、自系統の前記Ｉ／Ｏカードから送信された障害通知信号を受信したことに応じて、自装置を縮退運転させる段階と
を含む情報処理装置の障害処理方法。 It is a failure processing method of an information processing apparatus that performs processing using two systems at normal times and continues processing using only one system at the time of failure,
Each of the systems includes a computer and an I / O card,
Each of the computers includes a CPU and an I / O controller.
The I / O card periodically sending an interrupt signal to check the health of the computer of its own system ;
The I / O card indicates that a failure has occurred in the computer of the own system in response to the processing by the CPU of the computer of the own system being not executed within a predetermined time in response to the interrupt signal. Transmitting a notification signal to the I / O card of the other system ;
The I / O card in response to receiving a failure notification signal transmitted from the I / O cards of the other system, a fault notification signal indicating that a failure has occurred in the computer of the other system Transmitting to the computer of its own system ;
The method comprising the CPU are from the I / O cards of the host system in response to the calls that can not receive the interrupt signal is periodically transmitted, thereby locked stop the own device,
It said CPU, in response to receiving a failure notification signal transmitted from the I / O card of its own system, the failure handling method of an information processing apparatus including a step of degraded operation the own device.