JP2008186173A

JP2008186173A - Fault monitoring system

Info

Publication number: JP2008186173A
Application number: JP2007018151A
Authority: JP
Inventors: Takashi Inoue; 貴司井上; Kazunari Otogawa; 一成乙川
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2007-01-29
Filing date: 2007-01-29
Publication date: 2008-08-14

Abstract

<P>PROBLEM TO BE SOLVED: To provide a fault monitoring system capable of recording a fault occurrence state in a detailed manner by making a plurality of OSs mutually monitor the states of the respective OSs. <P>SOLUTION: The plurality of OSs 50 and 51 constituting the fault monitoring system for making the states of the respective OSs to be mutually monitored in a computer system 100 having a processor 1 for operating the OSs 50 and 51 are respectively provided with self-state recording means 500 and 510 for recording the states of the OSs 50 and 51 in self-OS correspondence areas 20 and 21 of a shared memory 2, the other state recording means 501 and 511 for recording predetermined contents in the other OS correspondence areas 21 and 20 when recorded contents show the predetermined contents of the other OSs 51 and 50 with reference to the recorded contents of the other OS correspondence areas 21 and 20 of the shared memory 2, and the other state monitoring means 502 and 512 for monitoring the other OSs 51 and 50 on the basis of the recorded contents of the self-OS correspondence areas 20 and 21. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、複数のオペレーティングシステムによって各オペレーティングシステムの状態を相互に監視させ合い、障害の発生状況を詳細に記録できる障害監視システムに関する。 The present invention relates to a failure monitoring system that allows a plurality of operating systems to mutually monitor the status of each operating system and to record a failure occurrence state in detail.

従来、ウォッチドッグ等の専用のハードウェアを追加することなくオペレーティングシステム（以下、ＯＳ（Operating System）とする。）の状態を監視する計算機が知られている（例えば、特許文献１参照。）。 2. Description of the Related Art Conventionally, a computer that monitors the state of an operating system (hereinafter referred to as an OS (Operating System)) without adding dedicated hardware such as a watchdog is known (for example, see Patent Document 1).

この計算機は、プロセッサ上で二つの独立したＯＳ（第一及び第二ＯＳ）を動作させ、第一ＯＳの状態を監視するソフトウェアプログラムである障害監視モニタを第二ＯＳ上で実行させ、さらに、第一ＯＳと第二ＯＳとの間の通信を可能とするＯＳ間通信手段を介して第一ＯＳから第二ＯＳに送信される所定の信号に基づいて障害監視モニタに第一ＯＳにおける障害の有無を判定させる。すなわち、この計算機は、ソフトウェアである第二ＯＳにハードウェアであるウォッチドッグの機能を代用させることにより、第一ＯＳから独立した機能により、第一ＯＳの障害を監視できるようにする。
特開２００１−１０１０３３号公報 This computer operates two independent OSs (first and second OSs) on a processor, and executes a fault monitoring monitor, which is a software program for monitoring the state of the first OS, on the second OS. Based on a predetermined signal transmitted from the first OS to the second OS via the inter-OS communication means that enables communication between the first OS and the second OS, the failure monitor in the failure monitoring monitor Let the presence or absence. That is, this computer allows the second OS, which is software, to substitute the function of the watchdog, which is hardware, so that the failure of the first OS can be monitored by a function independent of the first OS.
JP 2001-101033 A

しかしながら、特許文献１に記載の計算機は、第二ＯＳの状態を監視することができず、第二ＯＳに障害が発生した場合には、計算機全体の信頼性が損なわれることになる。障害が発生した第二ＯＳ上で動作する障害監視モニタによって第一ＯＳの状態を監視することとなるからである。 However, the computer described in Patent Document 1 cannot monitor the state of the second OS, and if a failure occurs in the second OS, the reliability of the entire computer is impaired. This is because the state of the first OS is monitored by the failure monitoring monitor operating on the second OS where the failure has occurred.

上述の点に鑑み、本発明は、複数のオペレーティングシステムによって各オペレーティングシステムの状態を相互に監視させ合い、障害の発生状況を詳細に記録できる障害監視システムを提供することを目的とする。 In view of the above, it is an object of the present invention to provide a failure monitoring system that allows a plurality of operating systems to mutually monitor the status of each operating system and record a failure occurrence state in detail.

上述の目的を達成するために、第一の発明に係る障害監視システムは、複数のオペレーティングシステムを動作させるプロセッサを有するコンピュータシステムにおいて各オペレーティングシステムの状態を相互に監視させる障害監視システムであって、前記オペレーティングシステムは、共有メモリにおける自オペレーティングシステム対応領域に自オペレーティングシステムの状態を記録する自状態記録手段と、前記共有メモリにおける他オペレーティングシステム対応領域の記録内容を参照し、該記録内容が前記他オペレーティングシステムの所定状態を示す場合に、前記他オペレーティングシステム対応領域に所定内容を記録する他状態記録手段と、前記自オペレーティングシステム対応領域の記録内容に基づいて前記他オペレーティングシステムの状態を監視する他状態監視手段と、を備える、ことを特徴とする。 In order to achieve the above object, a failure monitoring system according to a first invention is a failure monitoring system for mutually monitoring the status of each operating system in a computer system having a processor for operating a plurality of operating systems. The operating system refers to own state recording means for recording the state of the own operating system in the own operating system corresponding area in the shared memory, and the recorded contents of the other operating system corresponding area in the shared memory, and the recorded contents are the other When indicating a predetermined state of the operating system, other state recording means for recording predetermined contents in the other operating system corresponding area, and the other operating system based on the recorded contents of the own operating system corresponding area Comprising the other state monitoring means for monitoring the state of the ring system, and wherein the.

また、第二の発明は、第一の発明に係る障害監視システムであって、前記他状態記録手段は、前記自状態記録手段が前記自オペレーティングシステムの状態を記録する度に、前記他オペレーティングシステム対応領域の記録内容を参照することを特徴とする。 The second invention is the fault monitoring system according to the first invention, wherein the other status recording means records the status of the own operating system each time the own status recording means records the status of the own operating system. The content recorded in the corresponding area is referred to.

また、第三の発明は、第一又は第二の発明に係る障害監視システムであって、前記プロセッサは、マルチコアプロセッサであり、前記複数のオペレーティングシステムは、それぞれ異なるコアで実行されることを特徴とする。 The third invention is a fault monitoring system according to the first or second invention, wherein the processor is a multi-core processor, and the plurality of operating systems are executed by different cores. And

上述の手段により、本発明は、複数のオペレーティングシステムによって各オペレーティングシステムの状態を相互に監視させ合い、障害の発生状況を詳細に記録できる障害監視システムを提供することができる。 With the above-described means, the present invention can provide a failure monitoring system that allows a plurality of operating systems to mutually monitor the status of each operating system and to record the occurrence status of a failure in detail.

以下、図面を参照しつつ、本発明を実施するための最良の形態の説明を行う。 Hereinafter, the best mode for carrying out the present invention will be described with reference to the drawings.

図１は、本発明に係る障害監視システムを備えたコンピュータシステムの構成例を示す図である。 FIG. 1 is a diagram illustrating a configuration example of a computer system including a failure monitoring system according to the present invention.

コンピュータシステム１００は、マルチコアプロセッサ１、揮発性メモリ２、不揮発性メモリ３の構成要素を有し、システムバス４を介して各構成要素間を接続する。 The computer system 100 includes components of a multi-core processor 1, a volatile memory 2, and a nonvolatile memory 3, and the components are connected via a system bus 4.

マルチコアプロセッサ１は、二以上のプロセッサコアを一個のパッケージに集積したプロセッサであり、コンピュータシステム１００は、一個のマルチコアプロセッサ１からなるマルチコアシングルプロセッサ構成を採用するが、複数個のマルチコアプロセッサからなるマルチコアマルチプロセッサ構成を採用してもよい。 The multi-core processor 1 is a processor in which two or more processor cores are integrated in one package, and the computer system 100 adopts a multi-core single processor configuration composed of one multi-core processor 1, but a multi-core composed of a plurality of multi-core processors. A multiprocessor configuration may be employed.

なお、コンピュータシステム１００は、複数の機能を分散処理させてリアルタイム処理性能を向上させるため、部品点数の削減によりコストを低減させるため、或いは、省電力化を図るためにマルチコアプロセッサ１を採用するが、本発明に係る障害監視システムは、一個のシングルコアプロセッサからなるシングルコアシングルプロセッサ構成、又は、複数個のシングルコアプロセッサからなるシングルコアマルチプロセッサ構成にも適用可能である。 Note that the computer system 100 employs the multi-core processor 1 in order to improve the real-time processing performance by distributing a plurality of functions in a distributed manner, to reduce costs by reducing the number of parts, or to save power. The fault monitoring system according to the present invention can also be applied to a single-core single-processor configuration composed of one single-core processor or a single-core multi-processor configuration composed of a plurality of single-core processors.

また、コンピュータシステム１００は、共有メモリを有する密結合マルチプロセッサ構成を採用するが、全てのプロセッサが対等となる対称型マルチプロセッサ構成であってもよく、プロセッサ毎に異なる処理内容が予め決定される非対称型マルチプロセッサ構成であってもよい。 The computer system 100 employs a tightly coupled multiprocessor configuration having a shared memory, but may be a symmetric multiprocessor configuration in which all processors are equal, and different processing contents are determined in advance for each processor. An asymmetric multiprocessor configuration may be used.

マルチコアプロセッサ１は、第一ＣＰＵ(Central Processing Unit)１０、第二ＣＰＵ１１、第三ＣＰＵ１２及び第四ＣＰＵ１３の四つのＣＰＵコアを有し、第一ＣＰＵ１０に第一ＯＳ５０を実行させ、第二ＣＰＵ１１、第三ＣＰＵ１２及び第四ＣＰＵ１３の三つのＣＰＵコアに第二ＯＳ５１を実行させるマルチＯＳ構成を採用する。 The multi-core processor 1 has four CPU cores, a first CPU (Central Processing Unit) 10, a second CPU 11, a third CPU 12, and a fourth CPU 13. The first CPU 10 executes the first OS 50, and the second CPU 11, A multi-OS configuration is adopted in which the three CPU cores of the third CPU 12 and the fourth CPU 13 execute the second OS 51.

「マルチＯＳ構成」とは、複数のＯＳをプロセッサ上で独立に動作させる構成をいい、例えば、組み込みシステム向けのＯＳとして利用されるリアルタイム処理に長けたμＩＴＲＯＮ(Micro Industrial TRON(The Real-time Operating System Nucleus))等のリアルタイムＯＳと、Ｗｉｎｄｏｗｓ（登録商標）やＬｉｎｕｘ（登録商標）といった機能が豊富な汎用ＯＳとを兼ね備えたハイブリッドＯＳがある。 “Multi-OS configuration” refers to a configuration in which a plurality of OSs are independently operated on a processor. For example, μITRON (Micro Industrial TRON (The Real-time Operating There is a hybrid OS that combines a real-time OS such as System Nucleus)) and a general-purpose OS having abundant functions such as Windows (registered trademark) and Linux (registered trademark).

第一ＯＳ５０及び第二ＯＳ５１は、それぞれ、リアルタイムＯＳ又は汎用ＯＳの何れであってもよく、自状態記録手段５００、５１０、他状態記録手段５０１、５１１、及び、他状態監視手段５０２、５１２をそれぞれ有する。 Each of the first OS 50 and the second OS 51 may be a real-time OS or a general-purpose OS. The self-state recording units 500 and 510, the other state recording units 501 and 511, and the other state monitoring units 502 and 512 are included. Have each.

揮発性メモリ２は、高速アクセスが要求される主記憶装置として使用される半導体メモリであり、例えば、ＤＲＡＭ(Dynamic Random Access Memory)やＳＲＡＭ(Static RAM)等であって、第一ＯＳ５０及び第二ＯＳ５１によって共有される。 The volatile memory 2 is a semiconductor memory used as a main storage device that requires high-speed access. For example, the volatile memory 2 is a DRAM (Dynamic Random Access Memory), an SRAM (Static RAM), and the like. Shared by the OS 51.

不揮発性メモリ３は、電源を切った後も記憶内容を保持できるメモリであり、例えば、フラッシュメモリ、ＦＲＡＭ(Ferroelectric RAM)、ＭＲＡＭ（Magneto-resistive RAM）等であって、揮発性メモリ２と同様に、第一ＯＳ５０及び第二ＯＳ５１によって共有される。なお、不揮発性メモリ３は、ハードディスク等の補助記憶装置であってもよい。 The non-volatile memory 3 is a memory that can retain stored contents even after the power is turned off. For example, the non-volatile memory 3 is a flash memory, an FRAM (Ferroelectric RAM), an MRAM (Magneto-resistive RAM), and the like. And shared by the first OS 50 and the second OS 51. Note that the nonvolatile memory 3 may be an auxiliary storage device such as a hard disk.

図２は、各種メモリの構成例を示す図であり、図２（Ａ）が揮発性メモリ２の構成例を示し、図２（Ｂ）が不揮発性メモリ３の構成例を示す。 FIG. 2 is a diagram illustrating a configuration example of various memories, FIG. 2A illustrates a configuration example of the volatile memory 2, and FIG. 2B illustrates a configuration example of the nonvolatile memory 3.

揮発性メモリ２は、第一ＯＳ５０の起動状態を記憶する第一ＯＳ起動情報領域２０と、第二ＯＳ５１の起動状態を記憶する第二ＯＳ起動情報領域２１とを有する（図２（Ａ）参照。）。 The volatile memory 2 includes a first OS activation information area 20 that stores the activation state of the first OS 50 and a second OS activation information area 21 that stores the activation state of the second OS 51 (see FIG. 2A). .)

また、不揮発性メモリ３は、第一ＯＳ５０の障害情報を記憶する第一ＯＳ障害情報領域３０と、第二ＯＳ５１の障害情報を記憶する第二ＯＳ障害情報領域３１とを有し、各障害情報領域３０、３１は、障害の発生回数を記憶する障害発生回数領域３００、３１０と、障害の内容を記憶する障害内容領域３０１、３１１と、をそれぞれ有する（図２（Ｂ）参照。）。 The non-volatile memory 3 includes a first OS failure information area 30 that stores failure information of the first OS 50 and a second OS failure information area 31 that stores failure information of the second OS 51, and each failure information The areas 30 and 31 have failure occurrence frequency areas 300 and 310 for storing the number of occurrences of failures, and failure content areas 301 and 311 for storing the contents of failures, respectively (see FIG. 2B).

システムバス４は、ＣＰＵとメモリとを結ぶバスであり、例えば、ノースブリッジ等のチップセットとＣＰＵとを結ぶフロントサイドバス、及び、チップセットとメモリとを結ぶメモリバスで構成されてもよい。 The system bus 4 is a bus that connects a CPU and a memory, and may be configured by, for example, a front side bus that connects a chip set such as a north bridge and a CPU, and a memory bus that connects a chip set and a memory.

次に、第一ＯＳ５０及び第二ＯＳ５１のそれぞれが有する自状態記録手段５００、５１０、他状態記録手段５０１、５１１及び他状態監視手段５０２、５１２について説明する。 Next, the own state recording units 500 and 510, the other state recording units 501 and 511, and the other state monitoring units 502 and 512 included in the first OS 50 and the second OS 51 will be described.

自状態記録手段５００、５１０は、自ＯＳの状態を記録するための手段であり、例えば、第一ＯＳ５０の自状態記録手段５００は、第一ＯＳ５０の起動状態を揮発性メモリ２の第一ＯＳ起動情報領域２０に記録し、第二ＯＳ５１の自状態記録手段５１０は、第二ＯＳ５１の起動状態を揮発性メモリ２の第二ＯＳ起動情報領域２１に記録する。 The own state recording means 500 and 510 are means for recording the state of the own OS. For example, the own state recording means 500 of the first OS 50 determines the activation state of the first OS 50 in the first OS of the volatile memory 2. The self-state recording unit 510 of the second OS 51 records the activation state of the second OS 51 in the second OS activation information area 21 of the volatile memory 2.

他状態記録手段５０１、５１１は、他ＯＳの状態を記録するための手段であり、例えば、第一ＯＳ５０の他状態記録手段５０１は、揮発性メモリ２の第二ＯＳ起動情報領域２１に記録された第二ＯＳ５１の起動状態を示す値（以下、「状態値」という。）を所定の状態値に書き換え、第二ＯＳ５１の他状態記録手段５１１は、揮発性メモリ２の第一ＯＳ起動情報領域２０に記録された第一ＯＳ５０の状態値を所定の状態値に書き換える。 The other state recording units 501 and 511 are units for recording the state of the other OS. For example, the other state recording unit 501 of the first OS 50 is recorded in the second OS activation information area 21 of the volatile memory 2. Then, a value indicating the activation state of the second OS 51 (hereinafter referred to as “state value”) is rewritten to a predetermined state value, and the other state recording unit 511 of the second OS 51 stores the first OS activation information area of the volatile memory 2. The state value of the first OS 50 recorded in 20 is rewritten to a predetermined state value.

他状態監視手段５０２、５１２は、他ＯＳの状態を監視するための手段であり、例えば、第一ＯＳ５０の他状態監視手段５０２は、第二ＯＳ５１の他状態記録手段５１１によって書き換えられる第一ＯＳ起動情報領域２０の内容を監視することで、第二ＯＳ５１の起動処理における異常（以下、単に「障害」という。）を検知する。第二ＯＳ５１で障害が発生すると、第二ＯＳ５１の他状態記録手段５１１は、第一ＯＳ起動情報領域２０の内容を書き換えることができなくなり、第一ＯＳ５０は、第二ＯＳ５１による書き換えが発生しないことによって間接的に第二ＯＳ５１の障害を検知できるからである。 The other state monitoring units 502 and 512 are units for monitoring the state of the other OS. For example, the other state monitoring unit 502 of the first OS 50 is rewritten by the other state recording unit 511 of the second OS 51. By monitoring the contents of the activation information area 20, an abnormality in the activation process of the second OS 51 (hereinafter simply referred to as “failure”) is detected. When a failure occurs in the second OS 51, the other status recording unit 511 of the second OS 51 cannot rewrite the contents of the first OS activation information area 20, and the first OS 50 is not rewritten by the second OS 51. This is because the failure of the second OS 51 can be indirectly detected.

反対に、第二ＯＳ５１の他状態監視手段５１２は、第一ＯＳ５０の他状態記録手段５０１によって書き換えられる第二ＯＳ起動情報領域２１の内容を監視することで、第一ＯＳ５０の障害を検知する。 On the contrary, the other state monitoring unit 512 of the second OS 51 detects the failure of the first OS 50 by monitoring the contents of the second OS activation information area 21 rewritten by the other state recording unit 501 of the first OS 50.

次に、図３を参照しながら、コンピュータシステム１００上で動作する第一ＯＳ５０及び第二ＯＳ５１が相互に相手ＯＳの起動状態を監視する処理（以下、「起動状態監視処理」とする。）について説明する。なお、図３は、起動状態監視処理の流れを示すフローチャートである。 Next, referring to FIG. 3, a process in which the first OS 50 and the second OS 51 operating on the computer system 100 mutually monitor the startup status of the partner OS (hereinafter referred to as “startup status monitoring process”). explain. FIG. 3 is a flowchart showing the flow of the activation state monitoring process.

最初に、コンピュータシステム１００は、第一ＣＰＵ１０を初期化した後、第一ＣＰＵ１０上で第一ＯＳ５０を起動させる。 First, after initializing the first CPU 10, the computer system 100 starts up the first OS 50 on the first CPU 10.

第一ＯＳ５０は、起動処理を開始すると、自状態記録手段５００により揮発性メモリ２の第一ＯＳ起動情報領域２０に「起動開始」を示す状態値を記録する（ステップＳ１）。 When starting the boot process, the first OS 50 records a state value indicating “start-up” in the first OS boot information area 20 of the volatile memory 2 by the self-state recording unit 500 (step S1).

このとき、第一ＯＳ５０は、監視対象である相手ＯＳ（第二ＯＳ５１）に関する情報を書き換える処理（以下、「他状態記録処理（後述）」という。）を実行させ、他状態記録手段５０１により揮発性メモリ２の第二ＯＳ起動情報領域２１に記録された、第二ＯＳ５１の状態を表す状態値を参照し、その状態値が第二ＯＳ５１の「起動開始」を示す状態値である場合、第二ＯＳ起動情報領域２１に記録された状態値を「起動開始確認」を示す状態値に書き換えるようにする。 At this time, the first OS 50 executes a process (hereinafter referred to as “other state recording process (described later)”) for rewriting information on the partner OS (second OS 51) to be monitored, and volatilizes by the other state recording unit 501. When the state value indicating the state of the second OS 51 recorded in the second OS activation information area 21 of the memory 2 is referred to and the state value is a state value indicating “start-up” of the second OS 51, The status value recorded in the second OS startup information area 21 is rewritten to a status value indicating “startup confirmation”.

なお、第一ＯＳ５０は、自状態記録手段５００により自身の状態値を第一ＯＳ起動情報領域２０に記録する度に、他状態記録処理（後述）を実行する。 Each time the first OS 50 records its own state value in the first OS activation information area 20 by the own state recording unit 500, the other state recording process (described later) is executed.

その後、第一ＯＳ５０は、第一ＯＳ起動情報領域２０に記録された状態値を周期的に参照し、第二ＯＳ５１における他状態記録処理（後述）によって第一ＯＳ起動情報領域２０に記録された「起動開始」を示す状態値が「起動開始確認」を示す状態値に書き換えられるまで待機する（ステップＳ２）。 Thereafter, the first OS 50 periodically refers to the state value recorded in the first OS activation information area 20 and is recorded in the first OS activation information area 20 by other state recording processing (described later) in the second OS 51. Wait until the status value indicating “start-up” is rewritten to the status value indicating “start-start confirmation” (step S2).

第一ＯＳ起動情報領域２０に記録された状態値が「起動開始確認」を示す状態値に書き換えられたことを検出すると（ステップＳ２のＹＥＳ）、第一ＯＳ５０は、自状態記録手段５００により第一ＯＳ起動情報領域２０に「チェックポイント１」を示す状態値を記録し（ステップＳ３）、第一ＯＳ５０における各種設定値を初期化する処理（以下、「ＯＳ初期化処理」とする。）を開始させる（ステップＳ４）。 When it is detected that the status value recorded in the first OS startup information area 20 has been rewritten to the status value indicating “startup confirmation” (YES in step S2), the first OS 50 uses the own status recording unit 500 to change the status value. A state value indicating “checkpoint 1” is recorded in one OS activation information area 20 (step S3), and various setting values in the first OS 50 are initialized (hereinafter referred to as “OS initialization process”). Start (step S4).

一方、第一ＯＳ起動情報領域２０に記録された状態値が「起動開始確認」を示す状態値に書き換えられないまま（ステップＳ２のＮＯ）、第一ＯＳ起動情報領域２０の参照回数が所定回数を上回ると（ステップＳ５のＹＥＳ）、第一ＯＳ５０は、他状態記録手段５０１により不揮発性メモリ３の第二ＯＳ障害情報領域３１における障害発生回数領域３１０に記録された障害発生回数の値をインクリメントし（ステップＳ６）、かつ、不揮発性メモリ３の第二ＯＳ障害情報領域３１における障害内容領域３１１に障害の内容として「起動確認障害」を表す値を記録（ステップＳ７）した後、ＯＳ初期化処理を開始させる（ステップＳ４）。 On the other hand, the state value recorded in the first OS activation information area 20 is not rewritten to the state value indicating “start activation confirmation” (NO in step S2), and the reference count of the first OS activation information area 20 is the predetermined number of times. Exceeds the value (YES in step S5), the first OS 50 increments the value of the number of failure occurrences recorded in the failure occurrence number area 310 in the second OS failure information area 31 of the nonvolatile memory 3 by the other state recording unit 501. (Step S6), and after recording a value indicating “startup confirmation failure” in the failure content area 311 in the second OS failure information area 31 of the nonvolatile memory 3 (Step S7), the OS initialization is performed. The process is started (step S4).

なお、第一ＯＳ５０は、第二ＯＳ障害情報領域３１における障害内容領域３１１に障害発生時刻、第一ＯＳ５０の状態値、その他の各種設定値等を記録するようにしてもよい。後のデバッグに利用するためである。 The first OS 50 may record the failure occurrence time, the state value of the first OS 50, other various setting values, and the like in the failure content area 311 in the second OS failure information area 31. This is for later debugging.

また、第一ＯＳ５０は、第二ＯＳ５１に制御信号を送信し、第一ＯＳ５０が記録した第二ＯＳ５１の障害内容に対処するよう第二ＯＳ５１に障害対応処理を実行させるようにしてもよい。 In addition, the first OS 50 may transmit a control signal to the second OS 51 to cause the second OS 51 to execute a failure handling process so as to deal with the failure content of the second OS 51 recorded by the first OS 50.

ここで、「障害対応処理」とは、相手ＯＳで発生した障害を相手ＯＳに除去させるための処理であり、例えば、発生した障害の回数やその障害の内容に応じて相手ＯＳをリセットさせたり、再起動させたり、相手ＯＳを縮退させたり（ＯＳの一部の機能を制限し、他の一部の機能を動作させることをいい、例えば、相手ＯＳを動作させる複数のプロセッサコアのうちの一部のコアを停止させたりする。）する処理がある。 Here, the “failure handling process” is a process for causing the partner OS to remove a fault that has occurred in the partner OS. For example, the partner OS may be reset in accordance with the number of faults that occurred and the content of the fault. , Restarting or degenerating the partner OS (referring to restricting some functions of the OS and operating some other functions, for example, among a plurality of processor cores that operate the partner OS Some cores are stopped.)

その後、第一ＯＳ５０は、自状態記録手段５００により第一ＯＳ起動情報領域２０に「チェックポイント２」を示す状態値を記録し（ステップＳ８）、ＯＳ初期化処理が完了したことを確認（ステップＳ９）した後、自状態記録手段５００により第一ＯＳ起動情報領域２０に「起動完了」を示す状態値を記録して（ステップＳ１０）、起動状態監視処理を終了させる。 Thereafter, the first OS 50 records a state value indicating “checkpoint 2” in the first OS activation information area 20 by the own state recording unit 500 (step S8), and confirms that the OS initialization processing is completed (step S8). After S9), the status value indicating “startup completed” is recorded in the first OS startup information area 20 by the own status recording means 500 (step S10), and the startup status monitoring process is terminated.

また、第二ＯＳ５１による第一ＯＳ５０の起動状態監視処理は、上述した第一ＯＳ５０による第二ＯＳ５１の起動状態監視処理と同様に、上述した第一ＯＳ５０による第二ＯＳ５１の起動状態監視処理に平行かつ独立して第二ＯＳ５１上で実行される。 In addition, the startup state monitoring process of the first OS 50 by the second OS 51 is parallel to the startup state monitoring process of the second OS 51 by the first OS 50, similarly to the startup state monitoring process of the second OS 51 by the first OS 50 described above. It is executed independently on the second OS 51.

次に、図４を参照しながら、他状態記録処理について説明する。なお、図４は、他状態記録処理の流れを示すフローチャートであり、第一ＯＳ５０は、自状態記録手段５００により第一ＯＳ起動情報領域２０に第一ＯＳ５０自身の状態値を記録する度に、他状態記録処理を実行するものとする。 Next, the other state recording process will be described with reference to FIG. FIG. 4 is a flowchart showing the flow of the other state recording process. The first OS 50 records the state value of the first OS 50 itself in the first OS activation information area 20 by the own state recording unit 500. It is assumed that the other state recording process is executed.

最初に、第一ＯＳ５０は、他状態記録手段５０１により揮発性メモリ２における第二ＯＳ起動情報領域２１を参照し、第二ＯＳ５１の状態値が「起動開始」を示す状態値となっているか否かを判定する（ステップＳ２０）。 First, the first OS 50 refers to the second OS activation information area 21 in the volatile memory 2 by the other status recording unit 501, and whether or not the status value of the second OS 51 is a status value indicating “start-up”. Is determined (step S20).

第二ＯＳ５１の状態値が「起動開始」となっている場合（ステップＳ２０のＹＥＳ）、第一ＯＳ５０は、第二ＯＳ起動情報領域２１の状態値を「起動開始確認」を示す状態値に書き換える（ステップＳ２１）。 When the status value of the second OS 51 is “startup start” (YES in step S20), the first OS 50 rewrites the status value of the second OS startup information area 21 with a status value indicating “startup confirmation”. (Step S21).

なお、第二ＯＳ５１の状態値が「起動開始」となっていない場合（ステップＳ２０のＮＯ）、第一ＯＳ５０は、第二ＯＳ起動情報領域２１の状態値をそのまま維持させるようにして他状態記録処理を終了させる。 If the state value of the second OS 51 is not “start-up” (NO in step S20), the first OS 50 records the other state so as to maintain the state value of the second OS start-up information area 21 as it is. End the process.

また、第二ＯＳ５１による第一ＯＳ５０の他状態記録処理は、上述した第一ＯＳ５０による第二ＯＳ５１の他状態記録処理と同様に、上述した第一ＯＳ５０による第二ＯＳ５１の他状態記録処理に平行かつ独立して第二ＯＳ５１上で実行される。 Further, the other state recording process of the first OS 50 by the second OS 51 is parallel to the other state recording process of the second OS 51 by the first OS 50, similarly to the other state recording process of the second OS 51 by the first OS 50 described above. It is executed independently on the second OS 51.

次に、図５を参照しながら、コンピュータシステム１００上で動作する第一ＯＳ５０及び第二ＯＳ５１が相互に相手ＯＳの障害状態を監視する処理（以下、「他状態監視処理」とする。）について説明する。なお、図５は、他状態監視処理の流れを示すフローチャートであり、各ＯＳは、自身の起動処理を完了させた後、この他状態監視処理を繰り返し実行するものとする。 Next, referring to FIG. 5, a process in which the first OS 50 and the second OS 51 operating on the computer system 100 mutually monitor the failure state of the partner OS (hereinafter referred to as “other state monitoring process”). explain. FIG. 5 is a flowchart showing the flow of other state monitoring processing. Each OS repeatedly executes this other state monitoring processing after completing its own startup processing.

最初に、第一ＯＳ５０は、他状態監視手段５０２により揮発性メモリ２における第二ＯＳ起動情報領域２１を参照し、第二ＯＳ５１の状態値が「起動完了」を示す状態値となっているか否かを判定する（ステップＳ３０）。 First, the first OS 50 refers to the second OS activation information area 21 in the volatile memory 2 by the other status monitoring unit 502, and whether or not the status value of the second OS 51 is a status value indicating “startup completed”. Is determined (step S30).

第二ＯＳ５１の状態値が「起動完了」となっている場合（ステップＳ３０のＹＥＳ）、第一ＯＳ５０は、不揮発性メモリ３における第二ＯＳ障害情報領域３１に記録された第二ＯＳ５１の障害に関する情報（障害発生回数及び障害内容）を消去して（ステップＳ３１）、他状態監視処理を終了させる。 When the state value of the second OS 51 is “startup complete” (YES in step S30), the first OS 50 relates to the failure of the second OS 51 recorded in the second OS failure information area 31 in the nonvolatile memory 3. The information (number of occurrences of failure and details of the failure) is deleted (step S31), and the other state monitoring process is terminated.

第一ＯＳ５０は、第二ＯＳ５１の起動途中に第二ＯＳ５１において障害が発生したことを検知していた場合であっても、第二ＯＳ５１が最終的に起動処理を完了させたことを検知できたからである。 Even if the first OS 50 detects that a failure has occurred in the second OS 51 during the startup of the second OS 51, it can detect that the second OS 51 has finally completed the startup process. It is.

なお、第一ＯＳ５０は、第二ＯＳ５１の状態値が「起動完了」となった場合であっても、他状態記録手段５０１により第二ＯＳ障害情報領域３１に記録された内容を消去することなく、第二ＯＳ障害情報領域３１にその内容を保持させるようにしてもよい。後のデバッグ等に利用できるようにするためである。 The first OS 50 does not erase the content recorded in the second OS failure information area 31 by the other status recording unit 501 even when the status value of the second OS 51 becomes “startup completed”. The contents may be held in the second OS failure information area 31. This is so that it can be used for later debugging.

なお、第二ＯＳ５１の状態値が「起動完了」となっていない場合（ステップＳ３０のＮＯ）、第一ＯＳ５０は、ステップＳ３０の判定回数をインクリメントした後、その判定回数が所定回数未満であるとき（ステップＳ３２のＮＯ）には、一旦、他状態監視処理を終了させる。 When the state value of the second OS 51 is not “startup complete” (NO in step S30), the first OS 50 increments the number of determinations in step S30 and then the number of determinations is less than a predetermined number. In (NO in step S32), the other state monitoring process is once ended.

一方、その判定回数が所定回数を上回るときには（ステップＳ３２のＹＥＳ）、第一ＯＳ５０は、他状態監視手段５０２により不揮発性メモリ３の第二ＯＳ障害情報領域３１における障害発生回数領域３１０に記録された障害発生回数の値をインクリメントし（ステップＳ３３）、かつ、不揮発性メモリ３の第二ＯＳ障害情報領域３１における障害内容領域３１１に、第二ＯＳ起動情報領域２１に記録された状態値を障害内容として記録する（ステップＳ３４）。 On the other hand, when the number of determinations exceeds the predetermined number (YES in step S32), the first OS 50 is recorded in the failure occurrence frequency area 310 in the second OS failure information area 31 of the nonvolatile memory 3 by the other state monitoring unit 502. The failure occurrence value is incremented (step S33), and the failure contents area 311 in the second OS failure information area 31 of the nonvolatile memory 3 is updated with the status value recorded in the second OS activation information area 21 as a failure. The contents are recorded (step S34).

その後、第一ＯＳ５０は、第二ＯＳ５１に制御信号を送信し、第一ＯＳ５０が記録した第二ＯＳ５１の障害内容に対処するよう第二ＯＳ５１に障害対応処理を実行させるようにして、他状態監視処理を終了させる。 Thereafter, the first OS 50 transmits a control signal to the second OS 51, and causes the second OS 51 to execute the failure handling process so as to deal with the failure contents of the second OS 51 recorded by the first OS 50, thereby monitoring the other state. End the process.

次に、図６を参照しながら、各ＯＳ起動情報領域に記録される状態値の遷移例について説明する。なお、図６は、第一ＯＳ起動情報領域２０に記録される状態値の遷移の一例を示す図である。また、この場合、「監視側」は、監視側ＯＳである第二ＯＳ５１を意味し、「被監視側」は、被監視側ＯＳである第一ＯＳ５０を意味する。 Next, an example of transition of state values recorded in each OS boot information area will be described with reference to FIG. FIG. 6 is a diagram illustrating an example of state value transitions recorded in the first OS activation information area 20. Further, in this case, “monitoring side” means the second OS 51 that is the monitoring-side OS, and “monitored side” means the first OS 50 that is the monitored-side OS.

第一ＯＳ起動情報領域２０の状態は、状態値が不定値となっている初期状態Ｐ１、「起動開始」が設定された状態Ｐ２、「起動開始確認」が設定された状態Ｐ３、「チェックポイント１」が設定された状態Ｐ４、「チェックポイント２」が設定された状態Ｐ５、「起動完了」が設定された状態Ｐ６、及び、各種障害状態を示す値が設定された状態Ｐ７を有する。 The state of the first OS activation information area 20 includes an initial state P1 in which the state value is an indefinite value, a state P2 in which “startup start” is set, a state P3 in which “startup confirmation” is set, and a “checkpoint” A state P4 in which “1” is set, a state P5 in which “checkpoint 2” is set, a state P6 in which “start-up completion” is set, and a state P7 in which values indicating various failure states are set.

状態Ｐ１は、コンピュータシステム１００の電源がオンされた後、又は、状態Ｐ４〜Ｐ７において監視側により被監視側の再起動が実行された後の状態であり、例えば、第一ＣＰＵ１０が初期化された後、又は、第二ＯＳ５１が第一ＯＳ５０を再起動させた後の状態である。 The state P1 is a state after the computer system 100 is turned on or after the monitored side is restarted by the monitoring side in the states P4 to P7. For example, the first CPU 10 is initialized. Or after the second OS 51 restarts the first OS 50.

なお、監視側は、コンピュータシステム１００全体を再起動させる必要がある場合にのみ、状態Ｐ６にある被監視側（第一ＯＳ５０）の再起動を実行させる。被監視側（第一ＯＳ５０）は、正常に起動処理を完了させているからである。 Note that the monitoring side causes the monitored side (first OS 50) in the state P6 to restart only when the entire computer system 100 needs to be restarted. This is because the monitored side (first OS 50) has successfully completed the startup process.

状態Ｐ２は、状態Ｐ１において被監視側が自身の起動処理を開始させた後の状態であり、例えば、第一ＯＳ５０が自身の起動処理を開始させた状態であって、被監視側である第一ＯＳ５０により状態値の書き換えが実行される。 The state P2 is a state after the monitored side starts its own startup process in the state P1, for example, a state where the first OS 50 starts its own startup process and is the first monitored side. The state value is rewritten by the OS 50.

状態Ｐ３は、被監視側が状態Ｐ２にあることを監視側が確認した後の状態であり、例えば、第一ＯＳ５０が第一ＯＳ５０自身の起動処理を開始させたことを第二ＯＳ５１が確認した後の状態であって、監視側である第二ＯＳ５１により状態値の書き換えが実行される。 The state P3 is a state after the monitoring side confirms that the monitored side is in the state P2, for example, after the second OS 51 confirms that the first OS 50 has started the startup process of the first OS 50 itself. The state value is rewritten by the second OS 51 on the monitoring side.

状態Ｐ４は、被監視側が状態Ｐ３にあることを被監視側自身が確認した後の状態であり、例えば、第一ＯＳ５０が第一ＯＳ５０自身の起動処理を開始させたことを第二ＯＳ５１が確認し、第一ＯＳ５０がその第二ＯＳ５１による確認をさらに確認した後の状態であって、被監視側である第一ＯＳ５０により状態値の書き換えが実行される。 The state P4 is a state after the monitored side confirms that the monitored side is in the state P3. For example, the second OS 51 confirms that the first OS 50 has started the startup process of the first OS 50 itself. In this state, the first OS 50 further confirms the confirmation by the second OS 51, and the state value is rewritten by the first OS 50 on the monitored side.

なお、被監視側は、自身が状態Ｐ２から状態Ｐ３に移行せず所定期間にわたって状態Ｐ２に留まった場合、強制的に状態Ｐ２を状態Ｐ４に移行させ、被監視側のＯＳ初期化処理を開始するようにする。監視側の障害によって被監視側が状態Ｐ３に停滞してしまうのを防止するためである。この場合、被監視側は、監視側に障害が発生した旨を監視側ＯＳの障害情報領域に記録する。 If the monitored side does not shift from the state P2 to the state P3 and stays in the state P2 for a predetermined period, it forcibly shifts the state P2 to the state P4 and starts the OS initialization process on the monitored side. To do. This is to prevent the monitored side from staying in the state P3 due to a failure on the monitoring side. In this case, the monitored side records that a failure has occurred on the monitoring side in the failure information area of the monitoring OS.

これにより、コンピュータシステム１００は、起動速度が異なる複数のＯＳを同時期に起動させる場合、起動速度の遅いＯＳが起動するまで起動速度の速いＯＳの起動を待機させることなく、複数のＯＳに各ＯＳの起動状態を相互監視させながら、起動速度の速いＯＳから順番に複数のＯＳを迅速に起動させることができる。 Accordingly, when the computer system 100 starts a plurality of OSs having different startup speeds at the same time, the computer system 100 allows each of the plurality of OSs to wait for the startup of the OS having a high startup speed until the OS having a low startup speed starts. A plurality of OSs can be quickly started in order from an OS with a fast startup speed while mutually monitoring the OS startup state.

状態Ｐ５は、被監視側が被監視側自身のＯＳ初期化処理を開始した状態であり、例えば、第一ＯＳ５０が第一ＯＳ５０自身のＯＳ初期化処理を開始した状態であって、被監視側である第一ＯＳ５０により状態値の書き換えが実行される。 The state P5 is a state in which the monitored side has started the OS initialization process of the monitored side. For example, the first OS 50 has started the OS initialization process of the first OS 50 itself. The state value is rewritten by a certain first OS 50.

状態Ｐ６は、被監視側が被監視側自身のＯＳ初期化処理を完了した後の状態であり、例えば、第一ＯＳ５０が第一ＯＳ５０自身のＯＳ初期化処理を完了した後の状態であって、被監視側である第一ＯＳ５０により状態値の書き換えが実行される。 The state P6 is a state after the monitored side completes the OS initialization process of the monitored side itself, for example, the state after the first OS 50 completes the OS initialization process of the first OS 50 itself, The state value is rewritten by the first OS 50 on the monitored side.

状態Ｐ７は、状態Ｐ４又は状態Ｐ５において被監視側が自身の障害を検知した後の状態であり、例えば、第一ＯＳ５０が第一ＯＳ５０自身の障害を検知した場合であって、被監視側である第一ＯＳ５０により状態値の書き換えが実行される。 The state P7 is a state after the monitored side detects its own failure in the state P4 or the state P5, for example, when the first OS 50 detects the failure of the first OS 50 itself and is the monitored side. The first OS 50 rewrites the state value.

以上の構成により、コンピュータシステム１００は、プロセッサ障害、ハードウェア障害、又は、ソフトウェア障害等により搭載するＯＳに障害が発生した場合であっても、その障害を早期に検知して障害対応処理を実行させることができるので、その障害によってシステム全体の信頼性が損なわれるのを防止することができる。 With the above configuration, the computer system 100 detects failure early and executes failure handling processing even when a failure occurs in the installed OS due to a processor failure, hardware failure, software failure, or the like. Therefore, it is possible to prevent the reliability of the entire system from being impaired due to the failure.

また、コンピュータシステム１００は、各ＯＳの起動状態を詳細に記録するので、各ＯＳで発生した障害の障害発生段階を特定することができる。 Further, since the computer system 100 records the activation state of each OS in detail, it is possible to specify the failure occurrence stage of the failure that has occurred in each OS.

また、コンピュータシステム１００は、各ＯＳの障害情報を不揮発性メモリ３に記録するので、コンピュータシステム１００に対する電力供給が途切れた場合であっても各ＯＳの障害情報を確実に保持することができる。 In addition, since the computer system 100 records the failure information of each OS in the nonvolatile memory 3, even when the power supply to the computer system 100 is interrupted, the failure information of each OS can be reliably retained.

また、コンピュータシステム１００は、マルチコアを採用することにより、一部のコアで異常が発生した場合であっても他の正常なコアで処理を継続させることができ、各ＯＳを正常動作させる確率を高めることで障害情報の記録がより確実に行われ、障害発生原因の早期発見を実現させることができる。 In addition, by adopting multi-core, the computer system 100 can continue processing with other normal cores even when an abnormality occurs in some cores, and the probability of operating each OS normally. By increasing it, failure information can be recorded more reliably and early detection of the cause of failure can be realized.

以上、本発明の好ましい実施例について詳説したが、本発明は、上述した実施例に制限されることはなく、本発明の範囲を逸脱することなしに上述した実施例に種々の変形及び置換を加えることができる。 Although the preferred embodiments of the present invention have been described in detail above, the present invention is not limited to the above-described embodiments, and various modifications and substitutions can be made to the above-described embodiments without departing from the scope of the present invention. Can be added.

例えば、上述の実施例において、コンピュータシステム１００は、各ＯＳの起動状態を相互に監視させるが、省電力モード移行状態や通常モード移行状態等、各ＯＳにおける起動状態以外の稼働状態を相互に監視させるようにしてもよい。 For example, in the above-described embodiment, the computer system 100 mutually monitors the activation state of each OS, but mutually monitors the operation state other than the activation state in each OS, such as the power saving mode transition state and the normal mode transition state. You may make it make it.

また、上述の実施例において、コンピュータシステム１００は、第一ＯＳ５０が自状態記録手段５００により自身の状態値を第一ＯＳ起動情報領域２０に記録する度に、他状態記録処理を実行するが、第一ＯＳ５０が所定の状態値を記録した場合に限り、他状態記録処理を実行するようにしてもよく、所定周期で他状態記録処理を実行するようにしてもよい。他状態記録処理の実行タイミングに柔軟性を持たせるためである。 In the above-described embodiment, the computer system 100 executes the other state recording process each time the first OS 50 records its own state value in the first OS activation information area 20 by the own state recording unit 500. Only when the first OS 50 records a predetermined state value, the other state recording process may be executed, or the other state recording process may be executed at a predetermined cycle. This is to provide flexibility in the execution timing of the other state recording process.

また、上述の実施例において、コンピュータシステム１００は、二つのＯＳで相互に起動状態を監視させるが、例えば、三つのＯＳを独立して起動させるコンピュータシステムにおいて、第一のＯＳに第二及び第三のＯＳの起動状態を監視させ、第二及び第三のＯＳの双方又は何れか一方に第一のＯＳの起動状態を監視させるようにしてもよい。 In the above-described embodiment, the computer system 100 causes the two OSs to monitor the activation state of each other. For example, in a computer system in which three OSs are activated independently, the first OS has the second and second OSs. The activation state of the third OS may be monitored, and the activation state of the first OS may be monitored by either or both of the second and third OSs.

また、本発明に係る障害監視システムは、四つ以上のＯＳを起動させるコンピュータシステムにおいても同様に、監視側ＯＳと被監視側ＯＳの組み合わせを柔軟に設定することができる。 Also, the failure monitoring system according to the present invention can flexibly set the combination of the monitoring OS and the monitored OS in a computer system that activates four or more OSs.

障害監視システムを備えたコンピュータシステムの構成例を示す図である。It is a figure which shows the structural example of the computer system provided with the failure monitoring system. 各種メモリの構成例を示す図である。It is a figure which shows the structural example of various memories. 起動状態監視処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a starting state monitoring process. 他状態記録処理の流れを示すフローチャートである。It is a flowchart which shows the flow of another state recording process. 他状態監視処理の流れを示すフローチャートである。It is a flowchart which shows the flow of another state monitoring process. ＯＳ起動情報領域に記録される状態値の遷移の一例を示す図である。It is a figure which shows an example of the transition of the state value recorded on OS starting information area.

符号の説明Explanation of symbols

１マルチコアプロセッサ
２揮発性メモリ
３不揮発性メモリ
４システムバス
１０〜１３ＣＰＵ
２０、２１ＯＳ起動情報領域
３０、３１ＯＳ障害情報領域
５０、５１オペレーティングシステム
３００、３１０障害発生回数領域
３０１、３１１障害内容領域
５００、５１０自状態記録手段
５０１、５１１他状態記録手段
５０２、５１２他状態監視手段
Ｐ１〜Ｐ７ＯＳ起動情報の状態 DESCRIPTION OF SYMBOLS 1 Multi-core processor 2 Volatile memory 3 Non-volatile memory 4 System bus 10-13 CPU
20, 21 OS startup information area 30, 31 OS failure information area 50, 51 Operating system 300, 310 Failure occurrence frequency area 301, 311 Failure content area 500, 510 Own state recording means 501, 511 Other state recording means 502, 512, etc. Status monitoring means P1 to P7 Status of OS startup information

Claims

複数のオペレーティングシステムを動作させるプロセッサを有するコンピュータシステムにおいて各オペレーティングシステムの状態を相互に監視させる障害監視システムであって、
前記オペレーティングシステムは、共有メモリにおける自オペレーティングシステム対応領域に自オペレーティングシステムの状態を記録する自状態記録手段と、前記共有メモリにおける他オペレーティングシステム対応領域の記録内容を参照し、該記録内容が前記他オペレーティングシステムの所定状態を示す場合に、前記他オペレーティングシステム対応領域に所定内容を記録する他状態記録手段と、前記自オペレーティングシステム対応領域の記録内容に基づいて前記他オペレーティングシステムの状態を監視する他状態監視手段と、を備える、
ことを特徴とする障害監視システム。 A fault monitoring system for mutually monitoring the status of each operating system in a computer system having a processor for operating a plurality of operating systems,
The operating system refers to own state recording means for recording the state of the own operating system in the own operating system corresponding area in the shared memory, and the recorded contents of the other operating system corresponding area in the shared memory, and the recorded contents are the other Other status recording means for recording predetermined contents in the other operating system compatible area when monitoring a predetermined state of the operating system, and monitoring the status of the other operating system based on the recorded contents of the own operating system compatible area A state monitoring means,
Fault monitoring system characterized by that.

前記他状態記録手段は、前記自状態記録手段が前記自オペレーティングシステムの状態を記録する度に、前記他オペレーティングシステム対応領域の記録内容を参照する、
ことを特徴とする請求項１に記載の障害監視システム。 The other status recording means refers to the recorded contents of the other operating system corresponding area every time the own status recording means records the status of the own operating system.
The fault monitoring system according to claim 1.

前記プロセッサは、マルチコアプロセッサであり、
前記複数のオペレーティングシステムは、それぞれ異なるコアで実行される、
ことを特徴とする請求項１又は２に記載の障害監視システム。 The processor is a multi-core processor;
The plurality of operating systems each run on a different core;
The fault monitoring system according to claim 1 or 2, characterized by the above.