JP2004046474A

JP2004046474A - Multi-os environmental computer system and program

Info

Publication number: JP2004046474A
Application number: JP2002202101A
Authority: JP
Inventors: Manabu Chikada; 近田　学; Sadaji Karasaki; 唐崎　貞二; Koshin Mori; 森　康臣; Yasuhiko Imai; 今井　康彦
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2002-07-11
Filing date: 2002-07-11
Publication date: 2004-02-12

Abstract

<P>PROBLEM TO BE SOLVED: To construct a virtual cluster system preventing the stoppage of the system in the occurrence of an OS failure by use of one computer without adding exclusive hardware. <P>SOLUTION: In a multi-OS environmental computer comprising a first OS 26, a second OS 27 and a multi-OS control part 25, the multi-OS control part 25 gives an OS failure detection instruction to each OS 26 or 27, whereby each OS 26 or 27 performs OS failure detection by use of common disc areas in HDD 15-17 which are computer resources. In the occurrence of OS failure, the multi-OS control part 25 controls the OS so that an application program is operated on the normally operating OS. When the failed OS is restored, the multi-OS control part 25 controls the OS so as to return to the original processing state. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、計算機システムのマルチＯＳ制御に関し、特にＯＳ障害監視およびＯＳ障害発生時の処理と復旧のためのマルチＯＳ環境の計算機システムおよびそのプログラムに関する。
【０００２】
【従来の技術】
計算機のＯＳに障害が発生した場合、システム停止となり、業務続行は不能状態となる。従来より、計算機が停止しても業務プログラムを他の計算機で処理させ、システム停止を防ぐクラスタシステムがよく用いられている。クラスタシステムは、企業の基幹システムを動作させる場合のように、いかなる事態が発生しても停止することが許されない計算機を構築する際の手段である。通常、２台ないしそれ以上の同スペックのマシンを用意して、それらが常にメモリやハードディスクの内容を同期させながら動作する。動作中のマシンが障害などでダウンすると、他のマシンが自動的に動作を引き継ぐことになる。
このように、一般的なクラスタシステムは、計算機を複数台と共有ディスク装置で構成される。また、アプリケーションプログラムを各計算機で分散して処理を行い、各計算機の負荷分散を行う方法もよく用いられる手法である。
【０００３】
一方、マルチＯＳ環境を持つ計算機システムとしては、例えば、特開２００１−１０１０３３公報で提案されている。これは、マルチＯＳ環境を持つ計算機により、第１のＯＳ、第２のＯＳ、第２のＯＳ上で動作する障害監視モニタおよびマルチＯＳ制御部でＯＳ障害を検出し、第１のＯＳ障害と再起動によっても情報が保持される記憶領域に第１のＯＳ上で動作しているアプリケーションプログラムのチェックポイント情報を保持し、再起動後はこのチェックポイント情報からアプリケーションプログラムを再開させるようにしたものである。
【０００４】
【発明が解決しようとする課題】
しかしながら、複数の計算機と共通ディスク装置を用いてクラスタシステムを構築するためには、複数台の計算機を配置しなければならず、高価になる。
また、前記公報に記載のシステムでは、ＯＳ障害を検出し、その後障害発生したＯＳを再起動し、アプリケーションプログラムをチェックポイント情報から再開することが可能であるが、ＯＳ障害からＯＳ再起動までの間はアプリケーションプログラムは停止し、業務自体は停止することになる。
【０００５】
そこで、本発明の目的は、これら従来の課題を解決し、マルチＯＳ環境の計算機１台だけで、専用ハードウェアを追加することなく、ＯＳ障害を検出することができ、ＯＳ障害発生時にアプリケーションプログラムの切り替えを意識せずに正常動作しているＯＳに引き継ぐことで、アプリケーションプログラムを停止させることなく、かつ障害発生したＯＳを復旧させる機能を有し、ＯＳ障害に対して信頼性の高いマルチＯＳ環境の計算機システムおよびそのプログラムを提供することにある。
【０００６】
【課題を解決するための手段】
上記目的を達成するため、本発明のマルチＯＳ環境の計算機システムは、第１のＯＳ、第２のＯＳ、計算機資源を管理するマルチＯＳ制御部から構成されるマルチＯＳ環境の計算機にて、マルチＯＳ制御部より第１のＯＳ、第２のＯＳにＯＳ障害監視命令を発行し、第１のＯＳ、第２のＯＳは、共通ディスク領域への生存情報の書き込みと、この生存情報を確認することでＯＳ障害の監視を行う。
【０００７】
また、本発明のマルチＯＳ環境の計算機システムは、マルチＯＳ制御部がＯＳ障害発生時に各ＯＳ上で動作していたアプリケーションプログラムを正常動作しているＯＳに引き継がせるように制御を行う。ＯＳ障害が発生したＯＳを予め設定した復旧方法により、ＯＳ障害の復旧を行い、ＯＳ障害発生していたＯＳが復旧したときには、本来動作していたＯＳ上でアプリケーションプログラムを動作させるように制御を行う。
【０００８】
これは、予め設定したアプリケーションプログラムの負荷を各ＯＳで分散させていたものを、ＯＳ障害復旧時に再びアプリケーションプログラムの負荷を各ＯＳに負荷分散させるためである。これにより、計算機１台でＯＳ障害発生時にもシステム停止することがない、クラスタシステムを構築することができるので、信頼性の高いシステムを実現することができる。
【０００９】
【発明の実施の形態】
以下、本発明の実施形態を、図面により詳細に説明する。
図１は、本発明の実施形態を示す計算機システムの構成図である。
図１に示すように、計算機は図１（ａ）に示すソフトウェア２４と図１（ｂ）に示すハードウェア１１に分けられる。ハードウェア１１は、ＣＰＵ１２、主メモリ１３、ＳＣＳＩアダプタ１４、ＬＡＮアダプタ１９およびＳＣＳＩアダプタ１４に接続されたハードディスク装置（ＨＤＤ）１５〜１８から構成される。
なお、ＬＡＮアダプタ１９に接続されたＬＡＮ回線２０には、複数の端末装置２１〜２３が接続される。
ソフトウェア２４は、第１のＯＳ２６、第２のＯＳ２７、マルチＯＳ制御部２５、業務アプリケーションプログラム（以下、ＡＰ）（１）２８、業務ＡＰ（２）３０、業務ＡＰ（３）２９、業務ＡＰ（４）３１から構成されている。
【００１０】
ソフトウェア２４の全ては、ハードウェア１１の主メモリ１３に保存されている。なお、マルチＯＳ制御部２５は、▲１▼ハードウェア資源分割機能と、▲２▼ＡＰ制御機能と、▲３▼ＯＳ障害検出および回復機能とを有している。また、業務ＡＰ（１）〜（４）は、一般業務処理を行うプログラムである。ここでは、アプリケーションプログラムの数は、一例として４つにしているが、数に限定されるものではなく、さらに多くの業務ＡＰを保存することもできる。また、第１のＯＳ２６、第２のＯＳ２７上で動作させるアプリケーションについても、限定されるものではない。
【００１１】
以下、マルチＯＳ制御部２５の３つの機能について説明する。
（１．ハードウェア資源分割機能）
ハードウェア１１からの割り込みや、ＣＰＵ処理時間、メモリなどの計算機資源を各ＯＳに振り分け、仮想ハードウェアとして各ＯＳに対して見せる。
また、各ＯＳが利用する外部デバイスのＩ／Ｏ資源を予約する機能を有する。
【００１２】
（２．アプリケーションプログラム制御機能）
アプリケーションプログラムを、ＯＳ障害発生時にＯＳ障害が発生していない方のＯＳに引き渡す機能である。障害回復時には、正常動作していた時のように、各ＯＳのアプリケーションプログラムの負荷を分散するために、アプリケーションプログラムを始め、動作していたＯＳの方に戻す機能も有する。
アプリケーションプログラムのインストール時に、アプリケーションプログラムの負荷を分散するために、どのＯＳ上で動作させるかの情報を予め共通ディスク領域に設定しておく。
【００１３】
（３．ＯＳ障害検出およびＯＳ障害回復機能）
ＯＳ障害検出は、各ＯＳにＯＳ障害検出命令を発行し、ＯＳ障害が発生していたとき、各ＯＳからの通知を受ける。ＯＳ障害回復時は各ＯＳより生存情報の通知を受けることにより、システムが正常状態に回復したことを判断する。ＯＳ障害回復は、障害発生したＯＳを再起動することにより、ＯＳの障害回復を行う機能を有する。
【００１４】
（障害時の処理）
ＬＡＮ回線２０に接続されている端末装置２１〜２３は、計算機にアクセスして実業務を行う装置である。ＯＳ障害が発生すると、正常動作している方のＯＳから計算機にアクセスしたこれらの端末装置２１〜２３に対してＯＳに障害が発生したことの通知を受ける。
ＯＳ障害検出、ＯＳ障害時の処置、および障害復旧時の処置の概要について述べる。
マルチＯＳ制御部２５は、第１のＯＳ２６、第２のＯＳ２７に対してＯＳ障害検出命令を発行する。命令を受けた第１のＯＳ２６および第２のＯＳ２７は、ＨＤＤ１５〜１８に存在する共通ディスク領域に、ある周期をもって自ＯＳが生存していることを書き込む。これが正常動作中の証明となり、この生存情報を確認すれば、どのＯＳに障害が発生したかが判別できる。
【００１５】
その後、第１のＯＳ２６および第２のＯＳ２７は、書き込まれた生存情報を読み込み、生存情報が正しくない場合には、マルチＯＳ制御部２５にＯＳ障害発生を通知する。このとき、ＬＡＮ回線２０に接続された端末装置２１〜２３にもＯＳ障害が発生したことの通知を行う。マルチＯＳ制御部２５は、予め設定されたアプリケーションプログラム制御方法に従って、障害発生したＯＳで動作していたアプリケーションプログラムを正常動作している方のＯＳ上で動作させるように命令を発行する。また、予め設定したＯＳ障害回復方法に従って、障害が発生したＯＳを回復する。
【００１６】
障害が発生したＯＳが回復された場合、マルチＯＳ制御部２５は各ＯＳより生存情報を正しく受けたことにより、障害回復が正しく行われたことを判断して、ＯＳ障害発生した時点で正常動作していたＯＳへ引き継がれたアプリケーションプログラムを、予め設定していたＯＳ上で動作させるように、各ＯＳに命令を発行する。また、障害発生していたＯＳは、ＬＡＮ回線２０に接続されている端末装置２１〜２３に対して復旧されたことを通知する。
【００１７】
（ＨＤＤの共通ディスク領域）
図２は、図１におけるＨＤＤ１５〜１８に存在する共通ディスク領域に記憶されているデータの図である。
ここでは、プログラム領域などは省略して、本発明に関する制御情報などについて示している。
生存情報確認領域４１は、第１のＯＳ２６および第２のＯＳ２７からアクセスされる領域であり、生存情報が各ＯＳより書き込まれる。生存情報には、書き込む時刻も同時に書き込まれる。ＯＳ障害検出開始時間４２は、マルチＯＳ制御部２５が各ＯＳにＯＳ障害検出命令を発行する時間が限納される。ＯＳ障害検出時間４３は、各ＯＳがマルチＯＳ制御部２５よりＯＳ障害検出命令を受けた後の、各ＯＳの生存書き込み情報の許容時間である。すなわち、障害検出命令を受けた時刻からここで定められた時間内に各ＯＳは生存情報を書き込む必要がある。
【００１８】
ＯＳ障害情報４４は、計算機の状態を示す情報である。計算機がＯＳ障害により動作しなくなった箇所、あるいは計算機の現在の状況は正常か、異常かなどを示す。
アプリケーションプログラム制御情報４５は、アプリケーションプログラムの制御情報であり、▲１▼各アプリケーションプログラムを動作させるＯＳの指定、▲２▼現在アプリケーションプログラムがどのＯＳ上で動作しているのか、を示す情報、および▲３▼アプリケーションプログラムが動作していたＯＳに障害発生した時の引継先ＯＳの情報が格納されている。アプリケーションプログラムインストール時に、各ＯＳの負荷分散を行うために、動作させるＯＳの指定とＯＳ障害が発生したときの引継先ＯＳの設定を行う。これらの▲１▼▲２▼▲３▼の各情報は、業務ＡＰ（１）〜（４）毎に示されている。
ＯＳ復旧方法４６には、第１のＯＳ２６、第２のＯＳ２７の障害発生時の復旧方法を示す情報が格納されている。これには、（ａ）メモリダンプなしのＯＳ再起動、（ｂ）メモリダンプ取得後のＯＳ再起動、（ｃ）ＯＳの特定機能のみの再起動、などに区分される。これらの（ａ）（ｂ）（ｃ）の各情報は、ＯＳ毎に示されている。
【００１９】
図３は、本発明における計算機の起動処理とＯＳ障害検出の手順を示すフローチャートである。
ハードウェア１１が起動された後、自動的な手続きの実行により第１のＯＳ２６、第２のＯＳ２７、マルチＯＳ制御部２５が起動すると、第１のＯＳ２６と第２のＯＳ２７が利用する計算機資源の割り当てが行われ、仮想ハードウェアとして各ＯＳに見せる（ステップ１０１）。第１のＯＳ２６と第２のＯＳ２７が、ＯＳ障害検出時間４３、ＯＳ復旧方法４６、ＯＳ障害情報４４を共有ディスク領域に設定する（ステップ１０２）。第１のＯＳ２６と第２のＯＳ２７は、共有ディスク領域内のアプリケーションプログラム制御情報４５のどのＯＳ上で動作させるのかを参照して、アプリケーションプログラムを開始させ、アプリケーションプログラム制御情報４５の『現在どのＯＳ上で動作しているか、』の欄にアプリケーションプログラムが動作しているＯＳを書き込む（ステップ１０３）。
【００２０】
マルチＯＳ制御部２５は、ＯＳ障害検出開始時間４２を設定する（ステップ１０４）。マルチＯＳ制御部２５は、第１のＯＳ２６と第２のＯＳ２７に対して、生存情報書き込みと確認の指示を発行する（ステップ１０５）。これにより、各ＯＳ２６，２７は、発行時刻からＯＳ障害検出時間４３（例えば、３分〜５分間隔）内に生存情報確認領域４１に生存情報の書き込みを行う必要がある。
第１のＯＳ２６と第２のＯＳ２７は、共有ディスク領域の生存確認情報領域４１に生存している情報を書き込む（ステップ１０６）。第１のＯＳ２６と第２のＯＳ２７は、ＯＳ障害情報４４を確認する（ステップ１０７）。確認した結果がＯＳ障害状態であった場合には、ＯＳ障害回復の確認処理と判断する（ステップ１０８）。第１のＯＳ２６と第２のＯＳ２７は、共有ディスク領域の生存情報確認領域４１に書き込まれた生存情報を読み込み、生存情報が生存情報書き込みと確認指示を受けた時間からＯＳ障害検出時間４３以内であることを確認する（ステップ１０９）。
【００２１】
第１のＯＳ２６は、第２のＯＳ２７の生存情報を読み込み、第２のＯＳ２７は逆に第１のＯＳ２６の生存情報を読み込む。第１のＯＳ２６と第２のＯＳ２７は、確認した結果から生存情報が正しい場合には、第１のＯＳ２６は、次回のＯＳ障害検出開始時間４２を設定する（ステップ１１１）。一方、生存情報を確認できなかった場合、生存情報が正しくない場合には、ＯＳ障害であるため、ＯＳ障害処理に移る（ステップ１１０）。この後は、ステップ１０５に戻り、ＯＳ障害検出を定期的に行う。
【００２２】
図４は、本発明におけるＯＳ障害発生時の処理の手順を示すフローチャートである。
ＯＳ障害が発生したとき、正常動作しているＯＳはマルチＯＳ制御部２５にＯＳ障害発生を通知し、ＯＳ障害情報４４にＯＳ障害発生しているＯＳを書き込む（ステップ２０１）。正常動作しているＯＳは、ＬＡＮ回線２０に接続された端末装置２１〜２３にＯＳ障害が発生していることを通知する（ステップ２０２）。
アプリケーションプログラムを正常動作しているＯＳ上で動作させるように処置を行う（ステップ２０３）。これは、マルチＯＳ制御部２５がＯＳ障害情報４４を確認し、アプリケーションプログラム制御情報４５の各アプリケーションプログラムの現在どのＯＳ上で動作しているのかを参照し、ＯＳ障害が発生したＯＳ上でアプリケーションプログラムが動作しているとき、アプリケーションプログラム制御情報４５のＯＳ障害時の引き継ぎ先情報に従って、アプリケーションプログラムを正常動作しているＯＳ上で動作させるように、正常動作しているＯＳに命令を発行する。
【００２３】
命令を受けたＯＳは、アプリケーションプログラム制御情報４５のＯＳ障害時の引き継ぎ先に従って、アプリケーションプログラムを引き継ぎ、アプリケーションプログラムを動作させる。また、アプリケーションプログラム制御情報４５のアプリケーションプログラムの現在の状態に引き継がれたＯＳを書き込む。これにより、意識しないでアプリケーションプログラムを正常動作しているＯＳへ引き継ぐことができるので、業務が停止することはない。
ＯＳ障害の復旧処理（ステップ２０４）は、マルチＯＳ制御部２５がＯＳ復旧方法４６に従って、ＯＳ障害が発生したＯＳを復旧する。
【００２４】
図５は、本発明において、障害発生したＯＳが回復したときの処理の手順を示すフローチャートである。
マルチＯＳ制御部２５は、ＯＳ障害検出開始時間４２を設定し、第１のＯＳ２６および第２のＯＳ２７に対して、ＯＳ障害検出処理命令を発行する（ステップ３０１）。第１のＯＳ２６と第２のＯＳ２７は、共有ディスク領域の生存情報確認領域に生存していることを書き込む（ステップ３０２）。第１のＯＳ２６と第２のＯＳ２７は、ＯＳ障害情報４４がＯＳ障害発生状態であることを確認する（ステップ３０３）。第１のＯＳ２６と第２のＯＳ２７は、共有ディスク領域の生存情報確認領域に書き込まれた生存情報を読み込み、生存情報が生存情報書き込みと確認指示を受けた時間からＯＳ障害検出時間４３以内であることを確認する（ステップ３０４）。
【００２５】
第１のＯＳ２６と第２のＯＳ２７は、確認した結果から生存情報が正しい場合、マルチＯＳ制御部２５にＯＳが復旧されたことを通知する（ステップ３０７）。一方、ステップ３０４において、生存情報が正しくない場合には、マルチＯＳ制御部２５にＯＳ復旧されていないことを通知する（ステップ３０５）。マルチＯＳ制御部２５は、ＯＳ復旧方法４６に従って、ＯＳ障害の復旧処理を行う（ステップ３０６）。ＯＳ障害から復旧されたＯＳは、ＬＡＮ回線２０に接続された端末装置２１〜２３にＯＳ復旧されたことを通知する（ステップ３０８）。
【００２６】
アプリケーションプログラムを元のＯＳ上で動作させるように制御する（ステップ３０９）。すなわち、マルチＯＳ制御部２５は、アプリケーションプログラム制御情報４５の『どのＯＳ上で動作させるのか』の欄と『現在どのＯＳ上で動作しているのか』の欄を参照し、参照した結果が異なるＯＳであったときには、ＯＳ障害が発生したときにアプリケーションプログラムが引き継がれたと判断する。その後、ＯＳ障害発生時に正常動作していたＯＳの方に引き継がれていたアプリケーションプログラムを、アプリケーションプログラム制御情報４５の『どのＯＳで動作させるか』の欄に従って、本来動作していたＯＳ上で動作させるように各ＯＳに命令を発行する。
【００２７】
命令を受けた第１のＯＳ２６、第２のＯＳ２７は、アプリケーションプログラム制御情報４５の『どのＯＳ上で動作させるのか』の欄に従って、アプリケーションプログラムを動作させ、アプリケーションプログラム制御情報４５のアプリケーションプログラムの現在の状態にアプリケーションプログラムが動作しているＯＳを書き込む。これにより、正常動作していたときのように、各ＯＳでのアプリケーションプログラムの負荷を分散することができる。
第１のＯＳ２６は、ＯＳ障害情報４４に正常状態の設定と、ＯＳ障害検出開始時間４２を設定する（ステップ３１０）。この後は、ステップ１０５に戻り、ＯＳ障害検出処理を定期的に行う。
【００２８】
図６は、本発明の動作概要を示す説明図である。
構成上は、図１とほぼ同じであるが、図６では、第１のＯＳ２６をホストＯＳ２６と呼び、第２のＯＳ２７をゲストＯＳ２７と呼んでいる。また、図６では、システム装置内に、ハードウェアと一緒にソフトウェアも含ませて記載されている。ハードウェアとしては、ＣＰＵ１２，１２Ａが２台、主メモリ１３，１３Ａが２台設けられ、ＶＧＡ（Ｖｉｄｅｏ　Ｇｒａｐｈｉｃｓ　Ａｒｒａｙ）３２も設けられている。ＶＧＡ３２は、グラフィック制御ＬＳＩの名称で知られており、横３２０×縦２００ドットで１６色を表示するものなどがある。
【００２９】
マルチＯＳ制御部２５は、１つのシステム装置にホストＯＳ２６とゲストＯＳ２７をインストールすることが可能であり、各ＯＳ２６，２７はハードウェアを共有することができる。ホストＯＳ２６とゲストＯＳ２７は、各々ユーザ要求の処理を行い、ここではホストＯＳ２６がデータベ−ス処理やメール処理など３３を担当し、ゲストＯＳ２７がプリンタ処理など３４を担当する。いま、一方のＯＳがハングアップした場合、他方のＯＳに処理を引き継ぐ。ここでは、ホストＯＳ２６に障害が発生したので、データベース処理、メール処理など３３をゲストＯＳ２７に引き継ぎ、ユーザには正常に処理できているように見せる。
【００３０】
システム装置内のＨＤＤ１８内に共通ディスク領域を準備しておき、ここにホストＯＳ２６とゲストＯＳ２７がそれぞれ正常時動作を書き込む。そして、ホストＯＳ２６とゲストＯＳ２７は、互いに監視し合い、書き込みがなかったとき、他のＯＳがハングアップしたものと判断する。ＯＳがハングアップしたことを検出したならば、他方のＯＳへ処理を引き継ぐ処理を行う。また、管理クライアント端末２１〜２３にＯＳがハングアップしたことを通知する。また、ＯＳがハングアップしたことを、リブートして復旧させるのかを共通ディスク領域に問い合わせる。ハングアップしたＯＳが復旧した場合には、ハングアップしていたＯＳが共通ディスク領域に書き込みが正常に行われたことを検出したときに、元の処理の状態に戻す。
【００３１】
（むすび）
以上のように、本実施例によれば、マルチＯＳ環境のシステム装置１台でＯＳ障害が発生しても、アプリケーションプログラムの切り替えを意識しないで、正常動作しているＯＳへ引き継ぐことができるので、システム停止することのないクラスタシステムを構築することができる。また、マルチＯＳ環境を用いることにより、アプリケーションプログラムの切り替えが従来のクラスタシステムより早くなる。また、ＯＳ障害が発生したとき自動的にＯＳ障害を回復することが可能であり、かつ、ＯＳ障害回復時にはアプリケーションプログラムを元の動作していたＯＳ上で動作させるようにするので、正常動作していた時のように、各ＯＳの負荷分散を行うことが可能になる。
【００３２】
（応用例）
本実施例では、第１のＯＳ２６と第２のＯＳ２７の両方のシステムを稼働していたが、第２のＯＳ２７は普通は動作させないで、第１のＯＳ２６の障害時に全ての処理を引き継ぐようなホットスペアの処理を行うことも可能である。
また、本実施例では、計算機へのＯＳの搭載数を２としたが、３つ以上のＯＳを組み込むことも可能であり、組み込むＯＳの数が多くなればシステム構築の幅も大きくなり、信頼性を高めることが可能である。
【００３３】
【発明の効果】
以上説明したように、本発明によれば、マルチＯＳ環境を利用してＯＳ障害監視、ＯＳ障害発生時にアプリケーションプログラムを正常動作しているＯＳへの引き継ぎ、ＯＳ障害の復旧処理およびＯＳ障害復旧後にアプリケーションプログラムを正常動作していたときのように、元のＯＳ上で動作するように制御を行うので、システム装置１台で安価にクラスタシステムを構築することができ、その結果、ＯＳ障害に対して信頼性の高いシステム装置を構築することが可能となる。さらに、計算機内のハードウェア資源であるＨＤＤを利用してＯＳ障害監視を行っているため、専用ハードウェアを追加する必要はない。
【図面の簡単な説明】
【図１】本発明の実施形態を示すマルチＯＳ環境の計算機システムのブロック図である。
【図２】図１におけるＨＤＤ内の共通ディスク領域に格納されたデータ構成図である。
【図３】本発明の実施形態を示す計算機の起動手順およびＯＳ障害検出手順のフローチャートである。
【図４】本発明の実施形態を示すＯＳ障害発生時の処理手順のフローチャートである。
【図５】本発明の実施形態を示すＯＳ障害回復時の処理手順のフローチャートである。
【図６】本発明の動作概要を示す説明図である。
【符号の説明】
１１…ハードウェア、１２…ＣＰＵ、１３…主メモリ、１４…ＳＣＳＩアダプタ、１５〜１８…ＨＤＤ、１９…ＬＡＮアダプタ、２０…ＬＡＮ回線、２１〜２３…端末装置、２４…ソフトウェア、２５…マルチＯＳ制御部、２６…第１のＯＳ、２７…第２のＯＳ、２８…業務ＡＰ（１）、２９…業務ＡＰ（３）、３０…業務ＡＰ（２）、３１…業務ＡＰ（４）、３２…ＶＧＡ、３３…データベース処理、メール処理など、３４…プリンタ処理など、４１…生存情報確認領域、４２…ＯＳ障害検出開始時間、４３…ＯＳ障害検出時間、４４…ＯＳ障害情報、４５…アプリケーションプログラム制御情報、４６…ＯＳ復旧方法。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to multi-OS control of a computer system, and more particularly to a multi-OS environment computer system for monitoring and recovering from an OS failure and processing when an OS failure occurs, and a program therefor.
[0002]
[Prior art]
If a failure occurs in the OS of the computer, the system is stopped and business continuation is disabled. 2. Description of the Related Art Conventionally, a cluster system that prevents a system stoppage by causing a business program to be processed by another computer even when the computer stops is often used. A cluster system is a means for constructing a computer that is not allowed to stop even if any situation occurs, such as when operating a core system of a company. Usually, two or more machines of the same specifications are prepared, and they always operate while synchronizing the contents of the memory and the hard disk. If the running machine goes down due to a failure or the like, another machine will automatically take over the operation.
As described above, a general cluster system includes a plurality of computers and a shared disk device. Also, a method of distributing the application programs among the computers and performing the processing, and distributing the load of the computers is also a commonly used method.
[0003]
On the other hand, a computer system having a multi-OS environment is proposed in, for example, JP-A-2001-101333. This is because a computer having a multi-OS environment detects an OS fault with a first OS, a second OS, a fault monitoring monitor operating on the second OS, and a multi-OS control unit. Checkpoint information of an application program running on the first OS is stored in a storage area in which information is retained even after a restart, and the application program is restarted from the checkpoint information after the restart. It is.
[0004]
[Problems to be solved by the invention]
However, in order to construct a cluster system using a plurality of computers and a common disk device, a plurality of computers must be arranged, which is expensive.
Further, in the system described in the above publication, it is possible to detect an OS failure, restart the OS in which the failure has occurred, and restart the application program from the checkpoint information. During this time, the application program stops, and the business itself stops.
[0005]
Therefore, an object of the present invention is to solve these conventional problems and to detect an OS failure with only one computer in a multi-OS environment without adding dedicated hardware. A multi-OS that has a function of restoring a failed OS without stopping application programs by taking over to an OS that is operating normally without being aware of the switching of the OS. An object of the present invention is to provide an environmental computer system and its program.
[0006]
[Means for Solving the Problems]
In order to achieve the above object, a computer system in a multi-OS environment according to the present invention uses a multi-OS environment computer comprising a first OS, a second OS, and a multi-OS control unit for managing computer resources. The OS control unit issues an OS failure monitoring instruction to the first OS and the second OS, and the first OS and the second OS write the survival information to the common disk area and confirm the survival information. Thus, the monitoring of the OS failure is performed.
[0007]
Further, the computer system in the multi-OS environment of the present invention controls the multi-OS control unit so that the application program running on each OS when the OS failure occurs can be taken over by the normally operating OS. The OS in which the OS failure has occurred is restored by a preset recovery method, and when the OS in which the OS failure has occurred is restored, control is performed so that the application program runs on the OS that originally operated. Do.
[0008]
This is for distributing the load of the application program to the respective OSs again when the OS failure is restored, while the load of the preset application program is distributed to the respective OSs. This makes it possible to construct a cluster system in which one computer does not stop even when an OS failure occurs, so that a highly reliable system can be realized.
[0009]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a configuration diagram of a computer system showing an embodiment of the present invention.
As shown in FIG. 1, the computer is divided into software 24 shown in FIG. 1A and hardware 11 shown in FIG. 1B. The hardware 11 includes a CPU 12, a main memory 13, a SCSI adapter 14, a LAN adapter 19, and hard disk devices (HDDs) 15 to 18 connected to the SCSI adapter 14.
A plurality of terminal devices 21 to 23 are connected to a LAN line 20 connected to the LAN adapter 19.
The software 24 includes a first OS 26, a second OS 27, a multi-OS control unit 25, a business application program (hereinafter, AP) (1) 28, a business AP (2) 30, a business AP (3) 29, and a business AP ( 4) It is composed of 31.
[0010]
All of the software 24 is stored in the main memory 13 of the hardware 11. The multi-OS control unit 25 has (1) a hardware resource division function, (2) an AP control function, and (3) an OS failure detection and recovery function. The business APs (1) to (4) are programs for performing general business processing. Here, the number of application programs is four as an example. However, the number is not limited to four, and more business APs can be stored. Further, the applications operated on the first OS 26 and the second OS 27 are not limited.
[0011]
Hereinafter, three functions of the multi-OS control unit 25 will be described.
(1. Hardware resource division function)
Computer resources such as interrupts from the hardware 11, CPU processing time, and memory are allocated to each OS, and are presented to each OS as virtual hardware.
Further, it has a function of reserving I / O resources of an external device used by each OS.
[0012]
(2. Application program control function)
This is a function for transferring an application program to an OS in which no OS failure has occurred when an OS failure occurs. At the time of recovery from a failure, it also has a function of returning to the operating OS, starting with the application program, in order to distribute the load of the application program of each OS as in the case of normal operation.
At the time of installation of the application program, information on which OS to run on is set in advance in the common disk area in order to distribute the load of the application program.
[0013]
(3. OS failure detection and OS failure recovery function)
In the OS failure detection, an OS failure detection instruction is issued to each OS, and when an OS failure has occurred, a notification is received from each OS. At the time of OS failure recovery, it is determined that the system has recovered to a normal state by receiving notification of survival information from each OS. The OS failure recovery has a function of recovering the OS failure by restarting the failed OS.
[0014]
(Process at the time of failure)
The terminal devices 21 to 23 connected to the LAN line 20 are devices that access a computer and perform actual tasks. When an OS failure occurs, a notification that an OS failure has occurred is received from the normally operating OS to the terminal devices 21 to 23 that have accessed the computer.
An outline of OS failure detection, measures at the time of OS failure, and measures at the time of failure recovery will be described.
The multi-OS control unit 25 issues an OS failure detection command to the first OS 26 and the second OS 27. The first OS 26 and the second OS 27 that have received the command write that the own OS is alive with a certain period in the common disk area existing in the HDDs 15 to 18. This is a proof of normal operation, and if this survival information is confirmed, it is possible to determine which OS has failed.
[0015]
After that, the first OS 26 and the second OS 27 read the written survival information, and if the survival information is incorrect, notify the multi-OS control unit 25 of the occurrence of the OS failure. At this time, the terminal devices 21 to 23 connected to the LAN line 20 are also notified that an OS failure has occurred. The multi-OS control unit 25 issues a command to operate the application program that was running on the failed OS on the normally operating OS according to a preset application program control method. In addition, the failed OS is recovered according to a preset OS failure recovery method.
[0016]
When the failed OS is recovered, the multi-OS control unit 25 correctly receives the survival information from each OS, determines that the failure has been properly recovered, and operates normally when the OS failure occurs. An instruction is issued to each OS so that the application program taken over by the OS that has been set is operated on the OS set in advance. The OS in which the failure has occurred notifies the terminal devices 21 to 23 connected to the LAN line 20 that the OS has been restored.
[0017]
(Common disk area of HDD)
FIG. 2 is a diagram of data stored in a common disk area existing in the HDDs 15 to 18 in FIG.
Here, the program area and the like are omitted, and control information and the like relating to the present invention are shown.
The survival information confirmation area 41 is an area accessed from the first OS 26 and the second OS 27, and survival information is written from each OS. The writing time is also written in the survival information at the same time. As the OS failure detection start time 42, the time during which the multi-OS control unit 25 issues an OS failure detection command to each OS is limited. The OS failure detection time 43 is an allowable time of the live write information of each OS after each OS receives the OS failure detection command from the multi-OS control unit 25. That is, each OS needs to write the survival information within the time determined here from the time when the failure detection instruction is received.
[0018]
The OS failure information 44 is information indicating the state of the computer. This indicates the location where the computer has stopped operating due to the OS failure, or whether the current status of the computer is normal or abnormal.
The application program control information 45 is control information of the application program, and includes (1) designation of an OS for operating each application program, (2) information indicating on which OS the application program is currently running, and {Circle around (3)} Stores information of the takeover destination OS when a failure occurs in the OS on which the application program is running. At the time of application program installation, in order to distribute the load of each OS, an operating OS is specified and a takeover destination OS is set when an OS failure occurs. These pieces of information (1), (2) and (3) are shown for each of the business APs (1) to (4).
The OS recovery method 46 stores information indicating a recovery method when a failure occurs in the first OS 26 and the second OS 27. This includes (a) restarting the OS without memory dump, (b) restarting the OS after acquiring the memory dump, and (c) restarting only a specific function of the OS. These pieces of information (a), (b), and (c) are shown for each OS.
[0019]
FIG. 3 is a flowchart showing a procedure of computer startup processing and OS failure detection according to the present invention.
After the hardware 11 is started, when the first OS 26, the second OS 27, and the multi-OS control unit 25 are started by executing an automatic procedure, the computer resources used by the first OS 26 and the second OS 27 are used. The assignment is performed, and each OS is shown as virtual hardware (step 101). The first OS 26 and the second OS 27 set the OS failure detection time 43, the OS recovery method 46, and the OS failure information 44 in the shared disk area (Step 102). The first OS 26 and the second OS 27 start the application program by referring to which OS of the application program control information 45 in the shared disk area is to be operated, and execute the “current OS Is operating on the above ??? column is written (step 103).
[0020]
The multi-OS control unit 25 sets the OS failure detection start time 42 (Step 104). The multi-OS control unit 25 issues instructions for writing and confirming survival information to the first OS 26 and the second OS 27 (step 105). Accordingly, each of the OSs 26 and 27 needs to write the survival information in the survival information confirmation area 41 within the OS failure detection time 43 (for example, every 3 to 5 minutes) from the issue time.
The first OS 26 and the second OS 27 write the surviving information in the survival confirmation information area 41 of the shared disk area (Step 106). The first OS 26 and the second OS 27 confirm the OS failure information 44 (Step 107). If the result of the check is an OS failure state, it is determined that the OS failure recovery is to be confirmed (step 108). The first OS 26 and the second OS 27 read the survival information written in the survival information confirmation area 41 of the shared disk area, and within the OS failure detection time 43 from the time at which the survival information was written and the survival information was received and the confirmation instruction was received. Confirm that there is (step 109).
[0021]
The first OS 26 reads the survival information of the second OS 27, and the second OS 27 reads the survival information of the first OS 26 in reverse. The first OS 26 and the second OS 27 set the next OS failure detection start time 42 when the survival information is correct based on the result of the check (step 111). On the other hand, if the survival information cannot be confirmed, or if the survival information is incorrect, the process proceeds to the OS failure process because it is an OS failure (step 110). Thereafter, the process returns to step 105, and the OS failure detection is performed periodically.
[0022]
FIG. 4 is a flowchart illustrating a procedure of processing when an OS failure occurs in the present invention.
When an OS failure occurs, the normally operating OS notifies the multi-OS control unit 25 of the occurrence of the OS failure, and writes the OS in which the OS failure has occurred in the OS failure information 44 (step 201). The normally operating OS notifies the terminal devices 21 to 23 connected to the LAN line 20 that an OS failure has occurred (step 202).
A measure is taken so that the application program is operated on a normally operating OS (step 203). This is because the multi-OS control unit 25 checks the OS failure information 44, refers to which OS of each application program is currently running on the application program control information 45, and executes the application on the OS where the OS failure has occurred. When the program is running, an instruction is issued to the normally operating OS in accordance with the takeover destination information at the time of the OS failure in the application program control information 45 so that the application program runs on the normally operating OS. .
[0023]
The OS receiving the command takes over the application program according to the takeover destination at the time of the OS failure in the application program control information 45 and operates the application program. In addition, the OS that is taken over to the current state of the application program in the application program control information 45 is written. As a result, the application program can be taken over to the OS that is operating normally without being conscious, and the business will not be stopped.
In the OS failure recovery process (step 204), the multi-OS control unit 25 recovers the OS in which the OS failure has occurred according to the OS recovery method 46.
[0024]
FIG. 5 is a flowchart showing a procedure of processing when the failed OS is recovered in the present invention.
The multi-OS control unit 25 sets the OS failure detection start time 42 and issues an OS failure detection processing command to the first OS 26 and the second OS 27 (Step 301). The first OS 26 and the second OS 27 write that they are alive in the survival information confirmation area of the shared disk area (step 302). The first OS 26 and the second OS 27 confirm that the OS failure information 44 indicates that an OS failure has occurred (step 303). The first OS 26 and the second OS 27 read the survival information written in the survival information confirmation area of the shared disk area, and the survival information is within the OS failure detection time 43 from the time when the survival information was written and the confirmation instruction was received. Is confirmed (step 304).
[0025]
The first OS 26 and the second OS 27 notify the multi-OS control unit 25 that the OS has been restored if the survival information is correct based on the result of the check (step 307). On the other hand, if the survival information is not correct in step 304, the multi-OS control unit 25 is notified that the OS has not been restored (step 305). The multi-OS control unit 25 performs an OS failure recovery process according to the OS recovery method 46 (Step 306). The OS recovered from the OS failure notifies the terminal devices 21 to 23 connected to the LAN line 20 that the OS has been recovered (step 308).
[0026]
The application program is controlled to operate on the original OS (step 309). In other words, the multi-OS control unit 25 refers to the column of “which OS is to be operated on” in the application program control information 45 and the column of “which OS is currently being operated” in the application program control information 45, and the result of the reference is different. If it is an OS, it is determined that the application program has been taken over when an OS failure occurs. After that, the application program that was taken over by the OS that was operating normally at the time of the OS failure is operated on the OS that originally operated according to the column “Which OS is to be used” in the application program control information 45. An instruction is issued to each OS to cause the OS to operate.
[0027]
The first OS 26 and the second OS 27 that have received the command operate the application program in accordance with the column “On which OS to operate” of the application program control information 45, and execute the current operation of the application program in the application program control information 45. The OS on which the application program is running is written in the state. As a result, the load of the application program on each OS can be distributed as in the case of normal operation.
The first OS 26 sets the normal state and the OS failure detection start time 42 in the OS failure information 44 (Step 310). Thereafter, the process returns to step 105, and the OS failure detection processing is performed periodically.
[0028]
FIG. 6 is an explanatory diagram showing an operation outline of the present invention.
Although the configuration is almost the same as that of FIG. 1, in FIG. 6, the first OS 26 is called a host OS 26 and the second OS 27 is called a guest OS 27. In FIG. 6, software is included in the system device together with hardware. As hardware, two CPUs 12 and 12A, two main memories 13 and 13A are provided, and a VGA (Video Graphics Array) 32 is also provided. The VGA 32 is known by the name of a graphic control LSI, and includes a type that displays 16 colors with 320 × 200 dots.
[0029]
The multi-OS control unit 25 can install the host OS 26 and the guest OS 27 in one system device, and the OSs 26 and 27 can share hardware. The host OS 26 and the guest OS 27 respectively perform user request processing. In this case, the host OS 26 is in charge of 33 such as database processing and mail processing, and the guest OS 27 is in charge of 34 such as printer processing. If one OS hangs up, the other OS takes over the processing. Here, since a failure has occurred in the host OS 26, 33 such as database processing and mail processing is taken over by the guest OS 27, and the user is made to appear to be able to perform normal processing.
[0030]
A common disk area is prepared in the HDD 18 in the system device, and the host OS 26 and the guest OS 27 write the normal operation here. Then, the host OS 26 and the guest OS 27 monitor each other, and when there is no writing, determine that the other OS has hung up. If it is detected that the OS has hung up, a process of taking over the process to the other OS is performed. Further, it notifies the management client terminals 21 to 23 that the OS has hung up. In addition, it inquires of the common disk area whether the OS has hung up and is to be recovered by rebooting. When the hung up OS is recovered, when the hung up OS detects that the writing to the common disk area has been normally performed, the OS returns to the original processing state.
[0031]
(Conclusion)
As described above, according to the present embodiment, even if an OS failure occurs in one system device in a multi-OS environment, it is possible to take over to a normally operating OS without being aware of switching of application programs. Thus, it is possible to construct a cluster system without stopping the system. Also, by using the multi-OS environment, switching of application programs is faster than in a conventional cluster system. Further, when an OS failure occurs, the OS failure can be automatically recovered, and at the time of recovery from the OS failure, the application program is made to operate on the original operating OS. It is possible to distribute the load of each OS as in the case where the OS has been running.
[0032]
(Application example)
In the present embodiment, both systems of the first OS 26 and the second OS 27 are operated. It is also possible to perform hot spare processing.
In this embodiment, the number of OSs installed in the computer is two. However, it is possible to incorporate three or more OSs. Can be enhanced.
[0033]
【The invention's effect】
As described above, according to the present invention, an OS failure is monitored using a multi-OS environment, an application program is handed over to an OS that is operating normally when an OS failure occurs, OS failure recovery processing, and after OS failure recovery Since the control is performed so that the application program operates on the original OS as if it were operating normally, a cluster system can be constructed inexpensively with one system device. And a highly reliable system device can be constructed. Further, since OS failure monitoring is performed using the HDD which is a hardware resource in the computer, it is not necessary to add dedicated hardware.
[Brief description of the drawings]
FIG. 1 is a block diagram of a computer system in a multi-OS environment according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating a data structure stored in a common disk area in an HDD in FIG. 1;
FIG. 3 is a flowchart of a computer startup procedure and an OS failure detection procedure according to the embodiment of the present invention.
FIG. 4 is a flowchart of a processing procedure when an OS failure occurs according to the embodiment of the present invention.
FIG. 5 is a flowchart of a processing procedure at the time of OS failure recovery according to the embodiment of the present invention.
FIG. 6 is an explanatory diagram showing an operation outline of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 11 ... Hardware, 12 ... CPU, 13 ... Main memory, 14 ... SCSI adapter, 15-18 ... HDD, 19 ... LAN adapter, 20 ... LAN line, 21-23 ... Terminal device, 24 ... Software, 25 ... Multi OS Control unit, 26 first OS, 27 second OS, 28 business AP (1), 29 business AP (3), 30 business AP (2), 31 business AP (4), 32 ... VGA, 33 ... Database processing, mail processing, etc. 34 ... Printer processing, etc. 41 ... Survival information confirmation area, 42 ... OS failure detection start time, 43 ... OS failure detection time, 44 ... OS failure information, 45 ... Application program Control information, 46... OS recovery method.

Claims

ユーザ要求の処理を行い、他のＯＳに障害が発生した場合には、該他のＯＳの処理を引き継ぐ第１のＯＳと、
ユーザ要求の処理を行い、他のＯＳに障害が発生した場合には、該他のＯＳの処理を引き継ぐ第２のＯＳと、
計算機資源を各ＯＳに振り分け、各ＯＳにＯＳ障害検出命令を発行し、障害が発生したときには、アプリケーションプログラムを正常動作中のＯＳに引き渡し、該障害が回復したときには、該アプリケーションプログラムを元のＯＳに戻すように制御するマルチＯＳ制御手段と、
該第１のＯＳおよび該第２のＯＳにより生存を確認するための情報が書き込まれ、かつ該第１および第２のＯＳにより書き込まれた該生存を確認するための情報が読み込まれる共通ディスク領域を備えた記憶手段とを有することを特徴とするマルチＯＳ環境の計算機システム。A first OS that takes over the processing of the other OS when a user request is processed and a failure occurs in the other OS;
A second OS taking over the processing of the user request and taking over the processing of the other OS when a failure occurs in the other OS;
The computer resources are distributed to each OS, an OS failure detection instruction is issued to each OS, and when a failure occurs, the application program is delivered to a normally operating OS. When the failure is recovered, the application program is restored to the original OS. Multi-OS control means for controlling to return to
A common disk area in which information for confirming existence is written by the first OS and the second OS, and information for confirming existence written by the first and second OSs is read. A computer system in a multi-OS environment, comprising: storage means provided with:

請求項１記載のマルチＯＳ環境の計算機システムにおいて、前記他のＯＳの処理を引き継ぐためのＯＳが３つ以上存在し、かつ、アプリケーションプログラムの負荷を各ＯＳに分散させるように予め設定しておくことを特徴とするマルチＯＳ環境の計算機システム。2. The computer system in a multi-OS environment according to claim 1, wherein three or more OSs for taking over the processing of the other OS exist, and the load of the application program is set in advance to be distributed to each OS. A computer system in a multi-OS environment.

マルチＯＳ環境の計算機を、第１のＯＳおよび第２のＯＳが共通ディスク領域へ生存を確認するための情報の書き込みおよび読み込み手段と、マルチＯＳ制御部が一方のＯＳ障害が発生したＯＳ上で動作中のアプリケーションプログラムを正常動作中のＯＳ上で動作させ、ＯＳの障害が復旧したとき、該アプリケーションプログラムを元のＯＳ上で動作させるように制御する手段として機能させるためのプログラム。A computer in a multi-OS environment is provided with a means for writing and reading information for the first OS and the second OS to confirm survival in the common disk area, and a multi-OS control unit on one OS in which an OS failure has occurred. A program for operating an operating application program on a normally operating OS and, when the OS failure is recovered, a function for controlling the application program to operate on the original OS.