JP2004178296A

JP2004178296A - Knowledge based operation management system, method and program

Info

Publication number: JP2004178296A
Application number: JP2002344133A
Authority: JP
Inventors: Hideaki Asahi; 秀明朝日; Toshifumi Kamisaka; 利文上坂
Original assignee: NEC Corp; NEC Solution Innovators Ltd
Current assignee: NEC Corp; NEC Solution Innovators Ltd
Priority date: 2002-11-27
Filing date: 2002-11-27
Publication date: 2004-06-24
Anticipated expiration: 2022-11-27
Also published as: JP3916232B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a knowledge based operation management system, method and program capable of performing operation management while reducing dependence on a system manager. <P>SOLUTION: The system manager defines the information (messages to be supervised and recovery data or the like) needed for operation management from a monitor terminal 3 and registers this information in a monitor server 2 and the server 1 to be supervised. Upon trouble, the server 1 to be monitored sends trouble information to the monitor server 2, which in turn acquires the recovery data corresponding to this trouble information to send a recovery command included in the recovery data to the server 1 to be monitored. The server 1 performs the recovery command and sends the result of recovery to the monitor server 2. The monitor server 2 sends the received result to a monitor terminal 3, which in turn displays it. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、運用中の障害を管理する運用管理システム，方法およびプログラムに関し、特にデータベースに蓄積された過去の障害情報を参照して障害復旧を行うナレッジ型運用管理システム，方法およびプログラムに関する。
【０００２】
【従来の技術】
従来の運用管理システムは、オペレータが障害を発見後、対処方法が分からないなどで即時対応ができない場合には、システム管理者に通知し、システム管理者本人が対処する、もしくはオペレータに復旧方法を指示して対処していた。
【０００３】
上記の方法を改善する１つの方法として、特許文献１には、コンピュータネットワーク上で障害が発生した時の速やかな復旧を行えるようにして操作者の負担を軽減させることを目的とし、過去に発生したネットワーク障害の履歴およびその復旧方法を蓄積管理しておき、ネットワーク障害が発生した時には、その蓄積した情報を基に操作者が行うべき復旧方法について指示を出すようにしてネットワークにおける障害を処理する、ネットワーク障害処理システム及び該システムにおけるネットワーク障害処理方法が開示されている。
【０００４】
【特許文献１】
特開平８−４４６４１号公報
【０００５】
【発明が解決しようとする課題】
しかしながら、上述した従来の運用管理システムでは、障害発生の通知しかオペレータにされないため、重要度の高い障害時には、システム管理者を経由して対処しなければならない事と、システムの統廃合等により、大規模なマルチプラットフォーム環境になると、それに伴い障害の復旧方法は多様かつ複雑になるので、オペレータがマニュアル等で復旧方法を調べて対処する事で、復旧までに時間がかかってしまうという問題点があった。
【０００６】
業務システムは２４時間稼動が基本であり、障害発生時には短時間での復旧が要求されている。また、システムの監視において、豊富なノウハウを有するシステム管理者を多数配置するのはコストがかかるので、少数のシステム管理者の下でオペレータが監視を行うのが最も多い運用管理方法であるが、できればコストを押さえ、且つシステムを監視する者全てがシステム管理者レベルに近いノウハウを有することができるような運用管理方法が望まれている。
【０００７】
また、特許文献１の方法では、障害の対応できる範囲がネットワークのみであり汎用的でないという問題点がある。更に、復旧に必ずオペレータの介在が必要であるという問題点もある。
【０００８】
本発明の目的は、上記の問題点を解決するナレッジ型運用管理システム，方法およびプログラムを提供することにある。
【０００９】
【課題を解決するための手段】
本願第１の発明のナレッジ型運用管理システムは、障害情報に対するリカバリデータを定義し前記リカバリデータを監視サーバに登録し監視対象サーバでの復旧結果を表示する監視端末と、障害が発生すると障害情報を採取して監視サーバに送信し監視サーバから受信した復旧コマンドを投入して実行し復旧結果を監視サーバに送信する監視対象サーバと、前記監視対象サーバから受信した障害情報に対応するリカバリデータを取得しリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し前記監視対象サーバから受信した復旧結果を前記監視端末に送信する監視サーバと、を備える。
【００１０】
本願第２の発明のナレッジ型運用管理システムは、管理対象障害と障害対処区分とコマンド投入区分とリカバリデータとを定義し前記管理対象障害定義を監視対象サーバに登録し前記障害対処区分定義と前記コマンド投入区分定義と前記リカバリデータとを前記監視サーバに登録し監視対象サーバでの復旧結果を表示する監視端末と、障害が発生すると前記管理対象障害定義を基に障害情報を採取して監視サーバに送信し監視サーバから受信した復旧コマンドを投入し復旧結果を監視サーバに送信する監視対象サーバと、前記監視対象サーバから障害情報を受信し受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が自動実行タイプである場合には障害情報に対応するリカバリデータを取得しリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し前記監視対象サーバから受信した復旧結果を前記監視端末に送信する監視サーバと、を備える。
【００１１】
本願第３の発明のナレッジ型運用管理システムは、第２の発明において前記監視サーバは前記監視対象サーバから受信した障害情報の障害対処区分定義が表示タイプの場合には障害情報を前記監視端末に送信し、前記監視端末は受信した障害メッセージを表示する、ことを特徴とする。
【００１２】
本願第４の発明のナレッジ型運用管理システムは、第２の発明において前記監視サーバは受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信し前記監視端末から確認応答を受信したときにリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し、前記監視端末は受信したリカバリデータを表示して確認し前記監視サーバに確認応答を送信する、ことを特徴とする。
【００１３】
本願第５の発明のナレッジ型運用管理システムは、第２の発明において前記監視サーバは受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信し前記監視端末から更新リカバリデータを受信し受信した更新リカバリデータでリカバリデータを登録しているデータベースを更新し前記更新リカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し、前記監視端末は受信したリカバリデータを表示して更新し前記監視サーバに更新リカバリデータを送信する、ことを特徴とする。
【００１４】
本願第６の発明のナレッジ型運用管理システムは、第３の発明において前記監視端末は表示した障害メッセージに関するリカバリデータを定義し前記監視サーバに送信し、前記監視サーバは受信したリカバリデータを前記ナレッジデータベースに登録する、ことを特徴とする。
【００１５】
本願第７の発明のナレッジ型運用管理システムは、第１，第２，第３，第４，第５または第６の発明において前記監視サーバはリカバリデータを登録するデータベースを有し、前記障害情報に含まれるメッセージキーで前記データベースを検索してリカバリデータを取得する、ことを特徴とする。
【００１６】
本願第８の発明のナレッジ型運用管理方法は、監視サーバが監視対象サーバを運用管理するナレッジ型運用管理方法であって、監視端末は管理対象障害定義と障害対処区分定義とコマンド投入区分定義とリカバリデータとを定義し前記管理対象障害定義を監視対象サーバに登録し前記障害対処区分定義と前記コマンド投入区分定義と前記リカバリデータとを前記監視サーバに登録し、前記監視対象サーバは障害が発生すると前記管理対象障害定義を基に障害情報を採取して前記監視サーバに送信し、前記監視サーバは前記監視対象サーバから障害情報を受信し受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が自動実行タイプである場合には障害情報に対応するリカバリデータを取得しリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し、前記監視対象サーバは前記監視サーバから受信した復旧コマンドを投入し復旧結果を前記監視サーバに送信し、前記監視サーバは前記監視対象サーバから受信した復旧結果を前記監視端末に送信し、前記監視端末は前記監視サーバから受信した復旧結果を表示する、ことを特徴とする。
【００１７】
本願第９の発明のナレッジ型運用管理方法は、第８の発明において前記監視サーバは前記監視対象サーバから受信した障害情報の障害対処区分定義が表示タイプの場合には障害情報を前記監視端末に送信し、前記監視端末は受信した障害情報を表示する、ことを特徴とする。
【００１８】
本願第１０の発明のナレッジ型運用管理方法は、第８の発明において前記監視サーバは受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信し、前記監視端末は受信したリカバリデータを表示して確認し前記監視サーバに確認応答を送信し、前記監視サーバは確認応答を受信したときにリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する、ことを特徴とする。
【００１９】
本願第１１の発明のナレッジ型運用管理方法は、第８の発明において前記監視サーバは受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信し、前記監視端末は受信したリカバリデータを表示して更新し前記監視サーバに更新リカバリデータを送信し、前記監視サーバは前記監視端末から更新リカバリデータを受信し受信した更新リカバリデータでリカバリデータを登録しているデータベースを更新し前記更新リカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する、ことを特徴とする。
【００２０】
本願第１２の発明のナレッジ型運用管理方法は、第９の発明において前記監視端末は表示した障害メッセージに関するリカバリデータを定義し前記監視サーバに送信し、前記監視サーバは受信したリカバリデータを前記ナレッジデータベースに登録する、ことを特徴とする。
【００２１】
本願第１３の発明のナレッジ型運用管理方法は、第８，第９，第１０，第１１または第１２の発明において前記監視サーバはリカバリデータを登録しているデータベースを前記障害情報に含まれるメッセージキーで検索してリカバリデータを取得する、ことを特徴とする。
【００２２】
本願第１４の発明のナレッジ型運用管理プログラムは、監視サーバが監視対象サーバを運用管理するナレッジ型運用管理システムにおけるナレッジ型運用管理プログラムであって、コンピュータに、監視端末が、管理対象障害定義と障害対処区分定義とコマンド投入区分定義とリカバリデータとを定義する機能、前記管理対象障害定義を監視対象サーバに登録する機能、前記障害対処区分定義と前記コマンド投入区分定義と前記リカバリデータとを前記監視サーバに登録する機能、監視対象サーバが、障害が発生すると前記管理対象障害定義を基に障害情報を採取して前記監視サーバに送信する機能、監視サーバが、前記監視対象サーバから障害情報を受信する機能、受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が自動実行タイプである場合には障害情報に対応するリカバリデータを取得する機能、リカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する機能、前記監視対象サーバが、前記監視サーバから受信した復旧コマンドを投入する機能、復旧結果を前記監視サーバに送信する機能、前記監視サーバが、前記監視対象サーバから受信した復旧結果を前記監視端末に送信する機能、前記監視端末が、前記監視サーバから受信した復旧結果を表示する機能、を実現させる。
【００２３】
本願第１５の発明のナレッジ型運用管理プログラムは、第１４の発明において前記監視サーバが、前記監視対象サーバから受信した障害情報の障害対処区分定義が表示タイプの場合には障害情報を前記監視端末に送信する機能、前記監視端末が、受信した障害情報を表示する機能、を実現させる。
【００２４】
本願第１６の発明のナレッジ型運用管理プログラムは、第１４の発明において前記監視サーバが、受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信する機能、前記監視端末が、受信したリカバリデータを表示して確認する機能、前記監視サーバに確認応答を送信する機能、前記監視サーバが、確認応答を受信したときにリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する機能、を実現させる。
【００２５】
本願第１７の発明のナレッジ型運用管理プログラムは、第１４の発明において前記監視サーバが、受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信する機能、前記監視端末が、受信したリカバリデータを表示して更新する機能、前記監視サーバに更新リカバリデータを送信する機能、前記監視サーバが、前記監視端末から更新リカバリデータを受信し受信した更新リカバリデータでリカバリデータを登録しているデータベースを更新する機能、前記更新リカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する機能、を実現させる。
【００２６】
本願第１８の発明のナレッジ型運用管理プログラムは、第１５の発明において前記監視端末が、表示した障害メッセージに関するリカバリデータを定義し前記監視サーバに送信する機能、前記監視サーバが、受信したリカバリデータを前記ナレッジデータベースに登録する機能、を実現させる。
【００２７】
本願第１９の発明のナレッジ型運用管理プログラムは、第１４，第１５，第１６，第１７または第１８の発明において前記監視サーバが、リカバリデータを登録しているデータベースを前記障害情報に含まれるメッセージキーで検索してリカバリデータを取得する機能、を実現させる。
【００２８】
【発明の実施の形態】
本発明は、コンピュータ上で動作している業務システムの運用管理において監視対象サーバ等で発生した障害に対し、その対処方法をデータベースに蓄積することにより、その後、同障害発生時には蓄積されたデータベースからその障害の対処方法が検索され、業務システムの一部分のみを監視するオペレータ（以後、オペレータと略記する）にとってわかりやすくガイドメッセージとして表示し、復旧方法をナビゲートする構成を提供するものである。
【００２９】
業務システム全体の構成を把握し業務システム復旧方法について熟知しているシステム管理者（以後、システム管理者と略記する）しか保有しないシステムノウハウをデータベースに登録できる機構を提供することで、ナレッジデータベースの構築を容易にすること、過去における障害が再発した場合にナレッジデータベースを検索しスキルの低いオペレータであってもシステム管理者と同レベルの運用レベルを保証すること、ができる。
【００３０】
本発明の実施の形態について、図面を参照して説明する。
図１は、本発明の実施の形態の全体構成を示す図である。
【００３１】
図１を参照すると、本発明を利用した運用管理システムの一例は、業務を運行しているｎ（ｎは１以上の整数）台の監視対象サーバ１と、監視対象サーバからの障害情報を収集する監視サーバ２と、監視サーバ２からの情報を表示しオペレータ等が操作を行う監視端末３と、を備えている。
【００３２】
監視対象サーバ１は、プログラム制御で動作する情報処理装置で、第１運用管理手段１１を有する。
【００３３】
第１運用管理手段１１は、自サーバ内の障害情報を採取し第２運用管理手段２１に通知を行い、また第２運用管理手段２１から障害復旧コマンド投入の指示を受けて、自サーバに対してコマンドを実行してその結果通知を行う。
【００３４】
監視サーバ２は、プログラム制御で動作する情報処理装置で、第２運用管理手段２１とナレッジデータベース２２とを有する。
【００３５】
第２運用管理手段２１は、第１運用管理手段１１からの障害情報の取得、第１運用管理手段１１へ監視対象サーバ１への復旧コマンド投入指示、ナレッジデータベース２２と連携してシステム復旧ノウハウの登録・検索、監視端末３へ取得した情報の通知を行う。
【００３６】
ナレッジデータベース２２は、システム管理者が定義した障害情報の関連情報である復旧コマンドおよびガイドメッセージを格納する。ナレッジデータベース２２は、メッセージＩＤ，ノードまたはメッセージ本文、あるいはこれらの組み合わせをキー（メッセージキー）として復旧コマンドとガイドメッセージを格納する。以後、復旧コマンドとガイドメッセージを合わせてリカバリデータと称す。
【００３７】
ここで、第１運用管理手段１１および第２運用管理手段２１の詳細について、図２を使用して説明する。
図２は、第１運用管理手段および第２運用管理手段の詳細構成を示す図である。
【００３８】
先ず、第２運用管理手段２１について説明する。
【００３９】
図２を参照すると、第２運用管理手段２１は、第２管理サービス部２１１と、第２障害情報取得サービス部２１２と、ナレッジサービス部２１３と、データベースサービス部２１４と、第２コマンド投入サービス部２１５と、アプリケーション部２１６と、を含んでいる。
【００４０】
第２管理サービス部２１１は、以下の４つの機能を有する。
（１）システム管理者が定義した障害対応情報の定義ファイルの情報を読み込み、第２障害情報取得サービス部２１２、ナレッジサービス部２１３に反映させる。
（２）ある障害情報に対してナレッジサービス部２１３を利用するか否かの障害の切り分け定義を第２障害情報取得サービス部２１２に送る。
（３）第１障害情報取得サービス部１１２に指定の障害情報メッセージ採取を設定する為の定義情報を第１管理サービス部１１１に送る。
（４）定義したリカバリデータをナレッジデータベース２２に登録するように、ナレッジサービス部２１３に指示を出す。
【００４１】
第２障害情報取得サービス部２１２は、以下の２つの機能を有する。
（１）第１障害情報取得サービス部１１２から送られてきた監視対象サーバ１の障害メッセージを取得し、ナレッジサービス部２１３に通知する。
（２）障害メッセージを監視端末３で表示させるため、アプリケーション部２１６にメッセージを通知する。
【００４２】
ナレッジサービス部２１３は、以下の４つの機能を有する。
（１）第２障害情報取得サービス部２１２からの障害メッセージを元に、それに対応したリカバリデータをナレッジデータベース２２から検索するように要求を出し、その結果を取得する。
（２）復旧コマンド投入の際には、第２コマンド投入サービス部２１５へ指定の監視対象サーバ１内の第１コマンド投入サービス部１１３に対し、復旧コマンドを投入するよう要求を出し、その結果を取得する。
（３）監視端末３で参照できるように復旧コマンド投入結果をアプリケーション部２１６に通知する。
（４）監視端末３からリカバリデータの更新（追加・修正・削除）の指示があれば、データベースサービス部２１４にその要求を出す。
【００４３】
データベースサービス部２１４は、以下の２つの機能を有する。
（１）ナレッジサービス部２１３からの検索要求に対してナレッジデータベース２２を検索し、検索結果をナレッジサービス部２１３に返す。
（２）ナレッジサービス部２１３から指定の障害メッセージに対するリカバリデータの更新（追加・修正・削除）の要求があれば、ナレッジデータベース２２に対して実行する。
【００４４】
第２コマンド投入サービス部２１５は、以下の２つの機能を有する。
（１）ナレッジサービス部２１３からの障害復旧コマンド投入要求に応じて、指定監視対象サーバ１内の第１コマンド投入サービス部１１３へ復旧コマンド投入の指示をする。
（２）第１コマンド投入サービス部１１３からの投入結果をナレッジサービス部２１３に通知する。
【００４５】
アプリケーション部２１６は、上述の各サービス部からの情報を監視端末３に通知する、または監視端末３からのオペレーションをナレッジサービス部２１３に伝達することを目的とした、各サービス部と監視端末３の仲介の役割をするものである。これにより、各監視対象サーバ１の障害メッセージを一元的に参照したり、障害対処コマンド（復旧コマンド）を監視対象サーバ１に投入することが監視端末３上から可能になる。
【００４６】
続いて、第１運用管理手段１１について説明する。
【００４７】
図２を参照すると、第１運用管理手段１１は、第１管理サービス部１１１と、第１障害情報取得サービス部１１２と、第１コマンド投入サービス部１１３と、を含んでいる。
【００４８】
第１管理サービス部１１１は、第２管理サービス部２１１から送られてきた、どの障害メッセージを採取するかという定義情報を第１障害情報取得サービス部１１２に反映させる機能を有する。
【００４９】
第１障害情報取得サービス部１１２は、第１管理サービス部１１１からの定義情報を元に、監視対象サーバ１内で発生した障害メッセージを採取し、第２障害情報取得サービス部２１２に送信する機能を有する。
【００５０】
第１コマンド投入サービス部１１３は、第２コマンド投入サービス部２１５からの復旧コマンドの投入指示を受け、監視対象サーバ１に対してコマンドを投入し、その結果（復旧結果）を第２コマンド投入サービス部２１５に通知する機能を有する。
【００５１】
ここで、第１障害情報取得サービス部１１２が採取送信する障害メッセージについて、図３を用いて説明する。
図３は、障害メッセージのフォーマット例を示す図である。
【００５２】
図３を参照すると、障害メッセージは、発生時間，発生日時，メッセージＩＤ，障害発生サーバ（ノード）およびメッセージ本文の情報を含み、メッセージＩＤとはメッセージ本文に割り当てられている識別番号を示す。図３の障害メッセージ例でいうと、例１「業務サービスＡが異常終了しました」という内容は「ＤＣＲＯ０３」に、例２−１、例２−２「ファイルの生成に失敗しましたエラーコード＝？」は「ＤＣＲＯ１９」が割り当てられている。障害メッセージの中でも、例２のメッセージ本文内にある「エラーコード＝？」は、ファイルの生成に失敗した複数存在する原因の中から特定されたものが、「？」部分に数字で置換される。ファイルの生成に失敗した原因として、ディスクへの書き込み権限が無かった場合は「１」が、ディスクの空き容量が不足していた場合は「４」が置換される。システム管理者は、メッセージＩＤやメッセージ本文内のエラーコード等を、採取する障害メッセージおよびリカバリデータの関連付けの条件として障害メッセージを定義する。
【００５３】
監視端末３は、プログラム制御で動作する装置で、システム管理者やオペレータが操作を行う。また、監視サーバ２からの情報も表示する。システム管理者は、監視端末３から下記の障害対応情報の定義を行う。
（１）障害管理の対象とするメッセージの定義（管理対象障害の定義（管理対象メッセージ一覧の作成））。メッセージは、メッセージＩＤ，ノードまたはメッセージ本文、あるいはこれらの組み合わせから成るメッセージキーで定義する。
（２）定義した障害メッセージがナレッジサービス部２１３を利用するか否かの定義（障害対処区分の定義）。単にある事象が発生したということだけ監視端末３に通知する場合は表示タイプ、障害対処が必要な場合はナレッジタイプ、と定義する。表示タイプはアプリケーション部２１６に、ナレッジタイプはナレッジサービス部２１３に、通知される。
（３）ナレッジサービス部２１３を利用する障害メッセージに対するリカバリデータ（復旧コマンドとガイドメッセージ）の定義。
（４）リカバリデータ内の復旧コマンドを自動投入する（自動実行タイプ）か自動投入しない（確認実行タイプ）かの定義（コマンド投入区分の定義）。復旧コマンドが確認実行タイプの場合、オペレータは監視端末３からリカバリデータの内容を確認し、更新があればリカバリデータの追加／修正／削除を行う。
【００５４】
次に、本発明の実施の形態の動作について、図２〜図５を参照して詳細に説明する。
図４は、運用準備段階の動作の流れを示す図である。
図５は、運用段階の動作の流れを示す図である。
【００５５】
本発明の実施の形態の動作を、以下の２つに分けて説明する。
・運用準備段階の動作
・運用段階の動作。
【００５６】
先ず、運用準備段階の動作について、図２および図４を参照して説明する。
【００５７】
システム管理者は、監視端末３から、どういったメッセージが発生した時を障害とするかの定義（管理対象障害の定義（管理対象メッセージ一覧の作成））、またナレッジサービス部２１３を利用する場合はその障害メッセージに対するリカバリデータの定義、障害対処区分の定義、コマンド投入区分の定義を行う（図２：（１）、図４：ステップ１０１）。単にある事象が発生したということだけ監視端末３に通知する場合はアプリケーション部２１６に、障害対処が必要な場合はナレッジサービス部２１３に、通知するように定義され、運用段階では、第２障害情報取得サービス部２１２がこの定義に基づきナレッジサービス部２１３またはアプリケーション部２１６に通知する。
【００５８】
定義した情報は、第２管理サービス部２１１を通じて各サービス部に配布する（図２：（２）→（３）、図４：ステップ１０２）。すなわち、管理対象障害定義は第１障害情報取得サービス部１１２に、コマンド投入区分定義はナレッジサービス部２１３に、登録する。また、第２障害情報取得サービス部２１２には監視対象サーバ１からの障害メッセージをアプリケーション部２１６またはナレッジサービス部２１３に送る契機（障害対処区分定義）を反映させる。
【００５９】
ナレッジサービス部２１３にはリカバリデータ登録の指示を出し（図２：（４））、データベースサービス部２１４を通じてナレッジデータベース２２に登録する（図２：（５）、図４：ステップ１０３）。
【００６０】
上記の動作で、障害対応情報の定義を行い、運用準備が終了する。なお、運用段階で新たに管理対象としたいメッセージが出てきたら、そのメッセージに対して上記の運用準備処理を行う。
【００６１】
続いて、運用段階の動作について、図２および図５を参照して説明する。
【００６２】
監視対象サーバ１で障害が発生した場合（図４：ステップ２０１）、第１障害情報取得サービス部１１２は、障害により出力されるメッセージのメッセージＩＤおよびエラーコードで管理対象障害定義（管理対象メッセージ一覧）を検索して管理対象のメッセージか否かを判断し、管理対象のメッセージであれば図３で示す障害メッセージフォーマットに組み立て、障害メッセージを第２障害情報取得サービス部２１２に通知する（図２：（６））。
【００６３】
通知された障害メッセージがナレッジサービス部２１３を利用するように定義されていた場合、すなわち障害対処区分がナレッジタイプの場合（図４：ステップ２０２→ステップ２０３）、第２障害情報取得サービス部２１２は障害メッセージをナレッジサービス部２１３に通知する（図２：（７））。
【００６４】
ナレッジサービス部２１３は、データベースサービス部２１４に同様の障害メッセージ、およびリカバリデータの検索要求を出し（図２：（４））、データベースサービス部２１４は要求に応じて、ナレッジデータベース２２を検索して結果（リカバリデータ）を返す（図２：（５）→（８）→（９））。
【００６５】
リカバリデータ内の復旧コマンドが自動で投入されるように登録されていた場合、すなわちコマンド投入区分が自動実行タイプの場合（図４：ステップ２０３→ステップ２０４）、第２コマンド投入サービス部２１５に障害が発生している監視対象サーバ１の第１コマンド投入サービス部１１３に対して復旧コマンドを投入するように指示する（図２：（１０）→（１１）、図４：ステップ２０４）。
【００６６】
第２コマンド投入サービス部２１５はリカバリデータを第１コマンド投入サービス部１１３に渡し、第１コマンド投入サービス部１１３は復旧コマンドを投入実行し、第２コマンド投入サービス部２１５は第１コマンド投入サービス部１１３からの投入結果を受け取る（図２：（１２））とナレッジサービス部２１３に通知し（図２：（１３））、ナレッジサービス部２１３はアプリケーション部２１６に通知して監視端末３に表示するようにする（図２：（１４）→（１５）、図４：ステップ２０５）。
【００６７】
リカバリデータ内の復旧コマンドがオペレータが確認してから投入されるように登録されている場合、すなわちコマンド投入区分が確認実行タイプの場合（図４：ステップ２０３→ステップ２０６）、監視端末３にリカバリデータ内容を表示してオペレータに操作を促し（図２：（５）→（９）→（１４）→（１５）、図４：ステップ２０６）、オペレータがガイドメッセージを参照して確認し、リカバリデータに関して更新が無ければコマンドが投入される（図２：（１６）→（１７）→（１０）→（１１）、図４：ステップ２０７→ステップ２０４）。
【００６８】
リカバリデータの内容に更新（追加・修正・削除）がある場合、オペレータ等が監視端末３に表示されているリカバリデータを更新し、更新したリカバリデータをナレッジデータベース２２に蓄積する（図２：（１６）→（１７）→（４）→（５）、図４：ステップ２０８→ステップ２０９）。その後、コマンド投入、コマンド投入結果通知に進む（図２：（１０）→（１１）→（１２）→（１３）→（１４）→（１５）、図４：ステップ２０４→ステップ２０５）。
【００６９】
ナレッジサービス部２１３を利用しない場合、すなわち障害対処区分が表示タイプの場合、第１障害情報取得サービス部１１２からの障害メッセージは、アプリケーション部２１６を通して監視端末３に通知される（図２：（６）→（１８）→（１５）、図４：ステップ２０２→ステップ２１０）。
【００７０】
もし、監視端末３に通知された障害メッセージがリカバリデータを定義したい障害（新規障害）であった場合（図４：ステップ２１１→ステップ２１２）、システム管理者は運用準備段階で行ったと同様の処理を行い、障害対処のノウハウ（リカバリデータ）を定義しナレッジデータベース２２に蓄積する（図２：（１）→（２）→（４）→（５）、図４：ステップ２１２→ステップ２１３→ステップ２１４）。
【００７１】
また、管理対象になっているがリカバリデータを定義していないメッセージについては、以下のような方法でも、ナレッジデータベース２２にリカバリデータを登録できる。
【００７２】
例えば、図３の例２−１の障害メッセージに対してのみリカバリデータを定義していた場合、例２−２の障害メッセージ発生時には障害発生の原因が異なるため、新たに例２−２に対してのリカバリデータを登録する必要がある。しかし、ここで新たに定義ファイルを追加修正してナレッジサービス部２１３に反映させる必要は無く、メッセージＩＤで障害メッセージを定義しておき、監視端末３画面の入力操作から、発生した障害メッセージのエラーコードに対応するリカバリデータの追加を行うことで、ナレッジデータベース２２の内容を更新し、次回から利用することができる。こうすることで、エラーコードが多数あるような障害メッセージに対し、後から容易にリカバリデータを追加していくことができる。
【００７３】
このようにして、ナレッジデータベースに障害対処方法を蓄積していくことにより、過去における障害が再発した場合に、ナレッジデータベースを検索することで、スキルの低いオペレータであってもシステム管理者と同レベルの障害対処を行うことができる。
【００７４】
次に、本発明の他の実施の形態について説明する。
【００７５】
他の実施の形態の構成、上述の実施の形態の構成に対して、監視対象サーバに復旧コマンドを登録するローカルデータベースを追加し、第１運用管理手段に復旧コマンド登録部とローカルデータベース検索部とを追加したものである。
【００７６】
他の実施の形態の動作について説明する。
【００７７】
運用準備段階において、コマンド投入区分定義と障害対処区分定義とを第１障害情報取得サービス部にも登録する。
【００７８】
運用段階において、監視対象サーバで障害が発生した場合、第１障害情報取得サービス部は障害情報を採取し、障害対処区分がナレッジタイプで且つコマンド投入区分が自動実行タイプの場合には、採取した障害情報のメッセージキーをローカルデータベース検索部に渡す。ローカルデータベース検索部はメッセージキーでローカルデータベースを検索して一致すれば、その復旧コマンドを取り出して第１コマンド投入サービス部に渡す。第１コマンド投入サービス部は復旧コマンドを投入して実行し復旧結果を監視サーバを経由して監視端末に送信する。
【００７９】
障害対処区分がナレッジタイプで且つコマンド投入区分が自動実行タイプでない場合およびローカルデータベースを検索して一致しない場合は、第１障害情報取得サービス部は障害情報を監視サーバに送信し、以降は上述の実施の形態と同様の動作を行うが、第１コマンド投入サービス部で復旧コマンドを投入して復旧結果が正常の場合には、復旧コマンド登録部が復旧コマンドをローカルデータベースに登録する。
【００８０】
こうすることで、ローカルデータベースに登録されたメッセージに関しては、障害が再発した場合に、監視対象サーバのみで障害処理を行うことができる。
【００８１】
本発明による上述した実施の形態において、ナレッジ型運用管理システムの処理動作を実行するためのプログラム等を、データとしてコンピュータの磁気ディスクや光ディスク等の記録媒体（図示せず）に記録するようにし、記録されたデータを読み出してナレッジ型運用管理システムを動作させるために用いる。このように、本発明によるナレッジ型運用管理システムを動作させるデータを記録媒体に記録させ、この記録媒体をインストールすることによりナレッジ型運用管理システムの機能が実現できるようになる。
【００８２】
【発明の効果】
本発明は、運用管理システムとナレッジデータベースを組み合わせることにより、以下２つの効果が得られる。
【００８３】
第１の効果は、システム管理者のノウハウをオペレータが共有できることである。
【００８４】
その理由は、過去の障害情報及び復旧方法をナレッジデータベースに蓄積することにより、過去に発生した障害に対してはナレッジデータベースを検索して、復旧方法とガイドメッセージを表示するからである。
【００８５】
第２の効果は、自立的な運用管理システムが構築可能になることである。
【００８６】
その理由は、ナレッジデータベースに登録されている復旧コマンドを自動投入するように設定することで、オペレータの介在無しにシステムの復旧作業を行うことができるからである。
【図面の簡単な説明】
【図１】本発明の実施の形態の全体構成を示す図
【図２】運用管理手段の詳細構成を示す図
【図３】障害メッセージのフォーマット例を示す図
【図４】運用準備段階の動作の流れを示す図
【図５】運用段階の動作の流れを示す図
【符号の説明】
１監視対象サーバ
２監視サーバ
３監視端末
１１第１運用管理手段
２１第２運用管理手段
２２ナレッジデータベース
１１１第１管理サービス部
１１２第１障害情報取得サービス部
１１３第１コマンド投入サービス部
２１１第２管理サービス部
２１２第２障害情報取得サービス部
２１３ナレッジサービス部
２１４データベースサービス部
２１５第２コマンド投入サービス部
２１６アプリケーション部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an operation management system, a method, and a program for managing a fault during operation, and more particularly to a knowledge-type operation management system, a method, and a program for performing a fault recovery by referring to past fault information stored in a database.
[0002]
[Prior art]
In the conventional operation management system, if an operator cannot find a fault and cannot take immediate action after discovering a fault, the system administrator is notified and the system administrator himself can take action, or the operator can determine a recovery method. I was telling me to deal with it.
[0003]
As one method for improving the above method, Japanese Patent Application Laid-Open No. H11-163873 aims to reduce the burden on the operator by enabling quick recovery when a failure occurs on a computer network. The history of network failures and their recovery methods are stored and managed, and when a network failure occurs, the network failure is processed by giving an instruction on the recovery method to be performed by the operator based on the stored information. , A network fault handling system and a network fault handling method in the system are disclosed.
[0004]
[Patent Document 1]
JP-A-8-44641
[0005]
[Problems to be solved by the invention]
However, in the conventional operation management system described above, only the notification of the occurrence of a failure is sent to the operator. Therefore, when a failure of high importance has to be dealt with via a system administrator, the system has to be dealt with. In a multi-platform environment with a large scale, the recovery methods for failures become diverse and complicated, and it takes time for the recovery if the operator checks the recovery methods manually and takes measures. Was.
[0006]
The business system is basically operated for 24 hours, and it is required to recover in a short time when a failure occurs. Also, in system monitoring, it is costly to arrange a large number of system administrators with abundant know-how, so it is the most common operation management method for operators to monitor under a small number of system administrators, There is a demand for an operation management method that minimizes costs if possible and allows all persons who monitor the system to have know-how close to the system administrator level.
[0007]
Further, the method of Patent Document 1 has a problem that the range that can be dealt with by the failure is only the network and is not general-purpose. Furthermore, there is also a problem that an operator's intervention is always necessary for recovery.
[0008]
An object of the present invention is to provide a knowledge-based operation management system, method, and program that solve the above-mentioned problems.
[0009]
[Means for Solving the Problems]
A knowledge-type operation management system according to a first aspect of the present invention includes a monitoring terminal that defines recovery data for failure information, registers the recovery data in a monitoring server, and displays a recovery result in the monitored server, and a failure information when a failure occurs. The monitored server that collects and sends the recovery command received from the monitoring server to the monitoring server, executes the recovery command, and sends the recovery result to the monitoring server; and the recovery data corresponding to the failure information received from the monitored server. And a monitoring server for transmitting a recovery command acquired and included in the recovery data to the monitored server and transmitting a recovery result received from the monitored server to the monitoring terminal.
[0010]
The knowledge-type operation management system according to the second aspect of the present invention defines a management target fault, a fault handling section, a command input section, and recovery data, registers the management target fault definition in a monitoring target server, and A monitoring terminal for registering the command input category definition and the recovery data in the monitoring server and displaying a recovery result on the monitored server; and collecting a failure information based on the managed failure definition when a failure occurs, to the monitoring server. The monitored server that inputs the recovery command transmitted to the monitoring server and transmits the recovery result to the monitoring server, and the failure handling division definition of the failure information received and received from the monitored server is a knowledge type and If the command input classification definition is of the automatic execution type, the recovery data corresponding to the failure information is obtained and the recovery data Comprising a monitoring server for transmitting transmits a recovery command included in data in the managed server recovery result received from the monitored servers in the monitoring terminal.
[0011]
In the knowledge-type operation management system according to a third aspect of the present invention, in the second aspect, the monitoring server sends the failure information to the monitoring terminal when the failure handling division definition of the failure information received from the monitored server is a display type. Transmitting, and the monitoring terminal displays the received fault message.
[0012]
According to a fourth aspect of the present invention, in the second aspect of the present invention, in the second aspect, the monitoring server is configured such that the failure handling division definition of the received failure information is a knowledge type and the command input division definition is a confirmation execution type. After receiving the recovery data corresponding to the failure information, transmit the obtained recovery data to the monitoring terminal, and when receiving an acknowledgment from the monitoring terminal, transmit a recovery command included in the recovery data to the monitoring target server, The monitoring terminal displays and confirms the received recovery data, and transmits a confirmation response to the monitoring server.
[0013]
In the knowledge-type operation management system according to a fifth aspect of the present invention, the monitoring server according to the second aspect, wherein the failure handling division definition of the received failure information is a knowledge type and the command input division definition is a confirmation execution type Transmitting the recovery data obtained after obtaining the recovery data corresponding to the failure information to the monitoring terminal, receiving the update recovery data from the monitoring terminal, updating the database that has registered the recovery data with the received update recovery data, and updating the database. A recovery command included in the update recovery data is transmitted to the monitored server, the monitoring terminal displays and updates the received recovery data, and transmits the updated recovery data to the monitoring server.
[0014]
In a knowledge-type operation management system according to a sixth aspect of the present invention, in the third aspect, the monitoring terminal defines recovery data relating to the displayed failure message and transmits the recovery data to the monitoring server, and the monitoring server transmits the received recovery data to the knowledge. Registering in a database.
[0015]
In a knowledge-type operation management system according to a seventh aspect of the present invention, in the first, second, third, fourth, fifth, or sixth aspect, the monitoring server has a database for registering recovery data, and the failure information , And retrieves the recovery data by searching the database with the message key included in the message.
[0016]
The knowledge-based operation management method according to the eighth invention of the present application is a knowledge-based operation management method in which a monitoring server operates and manages a monitored server, wherein the monitoring terminal includes a managed object failure definition, a failure handling category definition, and a command input category definition. Recovery data is defined, the managed failure definition is registered in the monitored server, the failure handling category definition, the command input category definition, and the recovery data are registered in the monitoring server, and the monitored server fails. Then, failure information is collected based on the managed failure definition and transmitted to the monitoring server. The monitoring server receives the failure information from the monitored server, and the failure handling division definition of the received failure information is a knowledge type and When the command input category definition is of the automatic execution type, the recovery data corresponding to the failure information is obtained and the recovery data is obtained. The monitored server sends the included recovery command to the monitored server, the monitored server inputs the recovery command received from the monitoring server, transmits the recovery result to the monitoring server, and the monitoring server receives the recovery command from the monitored server. The restoration result is transmitted to the monitoring terminal, and the monitoring terminal displays the restoration result received from the monitoring server.
[0017]
In the knowledge-based operation management method according to a ninth aspect of the present invention, in the eighth aspect, the monitoring server sends the failure information to the monitoring terminal when the failure handling division definition of the failure information received from the monitored server is a display type. Transmitting, and the monitoring terminal displays the received fault information.
[0018]
In the knowledge-type operation management method according to the tenth aspect of the present invention, the monitoring server according to the eighth aspect, wherein the failure handling division definition of the received failure information is a knowledge type and the command input division definition is a confirmation execution type Transmitting the recovery data obtained after obtaining the recovery data corresponding to the failure information to the monitoring terminal, displaying and confirming the received recovery data, transmitting an acknowledgment to the monitoring server, Transmits a recovery command included in the recovery data to the monitored server when receiving the confirmation response.
[0019]
In the knowledge-type operation management method according to the eleventh aspect of the present invention, the monitoring server according to the eighth aspect, wherein the failure handling division definition of the received failure information is a knowledge type and the command input division definition is a confirmation execution type After acquiring the recovery data corresponding to the failure information, the acquired recovery data is transmitted to the monitoring terminal, and the monitoring terminal displays and updates the received recovery data, transmits the updated recovery data to the monitoring server, and transmits the updated recovery data to the monitoring server. The server receives update recovery data from the monitoring terminal, updates a database in which recovery data is registered with the received update recovery data, and transmits a recovery command included in the update recovery data to the monitored server. And
[0020]
In a knowledge-based operation management method according to a twelfth aspect of the present invention, in the ninth aspect, the monitoring terminal defines recovery data relating to the displayed failure message and transmits the recovery data to the monitoring server, and the monitoring server transmits the received recovery data to the knowledge server. Registering in a database.
[0021]
In the knowledge-based operation management method according to a thirteenth aspect of the present invention, in the eighth, ninth, tenth, eleventh, or twelfth aspect, the monitoring server stores a database in which recovery data is registered in a message included in the failure information. The recovery data is obtained by searching with a key.
[0022]
A knowledge-type operation management program according to a fourteenth invention of the present application is a knowledge-type operation management program in a knowledge-type operation management system in which a monitoring server operates and manages a monitored server. A function for defining a troubleshooting section definition, a command input section definition, and recovery data, a function for registering the managed failure definition in a monitored server, and a section for storing the troubleshooting section definition, the command input section definition, and the recovery data. A function to register with the monitoring server, a function to collect failure information based on the managed failure definition when the failure occurs, and to send the failure information to the monitoring server. The function to receive, the fault handling division definition of the received fault information is a knowledge type, and the command When the input category definition is the automatic execution type, a function of acquiring recovery data corresponding to failure information, a function of transmitting a recovery command included in recovery data to the monitored server, and the monitored server A function of inputting a restoration command received from the server, a function of transmitting a restoration result to the monitoring server, a function of the monitoring server transmitting a restoration result received from the monitored server to the monitoring terminal, and a function of the monitoring terminal A function of displaying a recovery result received from the monitoring server.
[0023]
A knowledge-type operation management program according to a fifteenth aspect of the present invention is the program according to the fourteenth aspect, wherein the monitoring server sends the failure information to the monitoring terminal when the failure handling division definition of the failure information received from the monitored server is a display type. And a function of the monitoring terminal displaying the received fault information.
[0024]
A knowledge-type operation management program according to a sixteenth aspect of the present invention provides the knowledge-type operation management program according to the fourteenth aspect, wherein the monitoring server is configured such that the failure handling division definition of the received failure information is a knowledge type and the command input division definition is a confirmation execution type. Is a function of transmitting the recovery data acquired after acquiring the recovery data corresponding to the failure information to the monitoring terminal, a function of the monitoring terminal displaying and confirming the received recovery data, and transmitting an acknowledgment to the monitoring server. A function of transmitting a recovery command included in recovery data to the monitored server when the monitoring server receives an acknowledgment.
[0025]
A knowledge-type operation management program according to a seventeenth aspect of the present invention provides the knowledge-type operation management program according to the fourteenth aspect, wherein the monitoring server is configured so that the failure handling division definition of the received failure information is a knowledge type and the command input division definition is a confirmation execution type. Is a function of transmitting the recovery data obtained after obtaining the recovery data corresponding to the failure information to the monitoring terminal, a function of displaying and updating the received recovery data by the monitoring terminal, and updating the recovery data to the monitoring server. A function of transmitting, a function of the monitoring server receiving update recovery data from the monitoring terminal and updating a database in which recovery data is registered with the received update recovery data, and a function of monitoring a recovery command included in the update recovery data. Function to transmit to the target server.
[0026]
A knowledge-type operation management program according to an eighteenth aspect of the present invention provides the knowledge-type operation management program according to the fifteenth aspect, wherein the monitoring terminal defines recovery data related to the displayed failure message and transmits the recovery data to the monitoring server. Is registered in the knowledge database.
[0027]
A knowledge-type operation management program according to a nineteenth aspect of the present invention is the knowledge management program according to the fourteenth, fifteenth, sixteenth, seventeenth, or eighteenth aspect, wherein the monitoring server includes a database in which recovery data is registered in the failure information. A function to retrieve recovery data by searching with a message key is realized.
[0028]
BEST MODE FOR CARRYING OUT THE INVENTION
The present invention accumulates, in a database, how to cope with a failure that has occurred in a monitored server or the like in the operation management of a business system operating on a computer, and thereafter, when the failure occurs, the accumulated database is used. The present invention provides a configuration in which a method for coping with the failure is searched, displayed as a guide message that is easy to understand for an operator who monitors only a part of the business system (hereinafter, abbreviated as an operator), and a recovery method is navigated.
[0029]
By providing a system that can register the system know-how that only the system administrator (hereinafter abbreviated as the system administrator) who knows the configuration of the entire business system and is familiar with the business system recovery method in the database, It is possible to facilitate the construction and to search the knowledge database when a failure in the past recurs, and to guarantee the same operation level as that of the system administrator even if the operator has low skill.
[0030]
An embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram showing an overall configuration of an embodiment of the present invention.
[0031]
Referring to FIG. 1, an example of an operation management system using the present invention collects n (n is an integer of 1 or more) monitored servers 1 running a business and collects failure information from the monitored servers. And a monitoring terminal 3 that displays information from the monitoring server 2 and is operated by an operator or the like.
[0032]
The monitoring target server 1 is an information processing device that operates under program control and has a first operation management unit 11.
[0033]
The first operation management unit 11 collects failure information in the own server and notifies the second operation management unit 21 of the failure information. And execute the command to notify the result.
[0034]
The monitoring server 2 is an information processing device that operates under program control, and includes a second operation management unit 21 and a knowledge database 22.
[0035]
The second operation management means 21 obtains failure information from the first operation management means 11, instructs the first operation management means 11 to input a restoration command to the monitored server 1, and provides system recovery know-how in cooperation with the knowledge database 22. The registration / retrieval and notification of the acquired information to the monitoring terminal 3 are performed.
[0036]
The knowledge database 22 stores a recovery command and a guide message that are related information of the failure information defined by the system administrator. The knowledge database 22 stores a recovery command and a guide message using a message ID, a node or a message text, or a combination thereof as a key (message key). Hereinafter, the recovery command and the guide message are collectively referred to as recovery data.
[0037]
Here, details of the first operation management means 11 and the second operation management means 21 will be described with reference to FIG.
FIG. 2 is a diagram showing a detailed configuration of the first operation management means and the second operation management means.
[0038]
First, the second operation management means 21 will be described.
[0039]
Referring to FIG. 2, the second operation management unit 21 includes a second management service unit 211, a second failure information acquisition service unit 212, a knowledge service unit 213, a database service unit 214, and a second command input service unit. 215 and an application unit 216.
[0040]
The second management service unit 211 has the following four functions.
(1) The information of the definition file of the failure handling information defined by the system administrator is read and reflected in the second failure information acquisition service unit 212 and the knowledge service unit 213.
(2) A fault isolation definition for determining whether to use the knowledge service unit 213 for certain fault information is sent to the second fault information acquisition service unit 212.
(3) The first failure information acquisition service unit 112 sends definition information for setting the designated failure information message collection to the first management service unit 111.
(4) Instruct the knowledge service unit 213 to register the defined recovery data in the knowledge database 22.
[0041]
The second failure information acquisition service unit 212 has the following two functions.
(1) Obtain a failure message of the monitored server 1 sent from the first failure information acquisition service unit 112 and notify the knowledge service unit 213.
(2) In order to display the failure message on the monitoring terminal 3, the application unit 216 is notified of the message.
[0042]
The knowledge service unit 213 has the following four functions.
(1) Based on the failure message from the second failure information acquisition service unit 212, a request is issued to search the recovery data corresponding to the failure message from the knowledge database 22, and the result is acquired.
(2) At the time of inputting the recovery command, a request is issued to the second command input service unit 215 to input the recovery command to the first command input service unit 113 in the designated monitored server 1 and the result is returned. get.
(3) Notify the application unit 216 of the restoration command input result so that the monitoring terminal 3 can refer to it.
(4) If there is an instruction from the monitoring terminal 3 to update (add / modify / delete) the recovery data, the request is issued to the database service unit 214.
[0043]
The database service unit 214 has the following two functions.
(1) Search the knowledge database 22 in response to a search request from the knowledge service unit 213, and return a search result to the knowledge service unit 213.
(2) If there is a request from the knowledge service unit 213 to update (add / modify / delete) the recovery data for the specified failure message, the request is executed for the knowledge database 22.
[0044]
The second command input service unit 215 has the following two functions.
(1) In response to a failure recovery command input request from the knowledge service unit 213, the recovery instruction is input to the first command input service unit 113 in the designated monitoring target server 1.
(2) The result of input from the first command input service unit 113 is notified to the knowledge service unit 213.
[0045]
The application unit 216 transmits the information from each service unit to the monitoring terminal 3 or transmits the operation from the monitoring terminal 3 to the knowledge service unit 213. It acts as an intermediary. As a result, it is possible from the monitoring terminal 3 to refer to the failure message of each monitored server 1 centrally and to input a failure handling command (recovery command) to the monitored server 1.
[0046]
Next, the first operation management means 11 will be described.
[0047]
Referring to FIG. 2, the first operation management unit 11 includes a first management service unit 111, a first failure information acquisition service unit 112, and a first command input service unit 113.
[0048]
The first management service unit 111 has a function of reflecting the definition information indicating which failure message is sent from the second management service unit 211 to the first failure information acquisition service unit 112.
[0049]
The first failure information acquisition service unit 112 collects a failure message generated in the monitored server 1 based on the definition information from the first management service unit 111 and transmits the message to the second failure information acquisition service unit 212. Having.
[0050]
The first command input service unit 113 receives a recovery command input instruction from the second command input service unit 215, inputs a command to the monitored server 1, and outputs the result (recovery result) to the second command input service. It has a function of notifying the unit 215.
[0051]
Here, a failure message collected and transmitted by the first failure information acquisition service unit 112 will be described with reference to FIG.
FIG. 3 is a diagram illustrating a format example of a failure message.
[0052]
Referring to FIG. 3, the failure message includes information on the occurrence time, the occurrence date and time, the message ID, the failure server (node) and the message text, and the message ID indicates an identification number assigned to the message text. In the example of the failure message shown in FIG. 3, the content of Example 1 "Business service A has terminated abnormally" is "DCRO03", and Example 2-1 and Example 2-2 "File creation failed error code = “?” Is assigned “DCRO19”. In the error message, among the error messages, “Error code =?” In the message body of Example 2 is replaced with a number that is identified from the plurality of causes of file generation failure that have been identified. . As a cause of the file generation failure, “1” is replaced when there is no write permission to the disk, and “4” is replaced when the free space of the disk is insufficient. The system administrator defines a failure message as a condition for associating the failure ID to be collected with the recovery data, such as a message ID and an error code in the message body.
[0053]
The monitoring terminal 3 is a device that operates under program control, and is operated by a system administrator or an operator. Also, information from the monitoring server 2 is displayed. The system administrator defines the following troubleshooting information from the monitoring terminal 3.
(1) Definition of a message to be subjected to fault management (definition of a managed fault (creation of a managed message list)) A message is defined by a message key consisting of a message ID, a node or a message body, or a combination thereof.
(2) Definition of whether or not the defined failure message uses the knowledge service unit 213 (definition of failure handling category). A display type is defined when the monitoring terminal 3 is notified only that an event has occurred, and a knowledge type is defined when troubleshooting is required. The display type is notified to the application unit 216 and the knowledge type is notified to the knowledge service unit 213.
(3) Definition of recovery data (recovery command and guide message) for a failure message using the knowledge service unit 213.
(4) Definition of whether to automatically input a recovery command in the recovery data (automatic execution type) or not (confirmation execution type) (definition of command input category). If the restoration command is of the confirmation execution type, the operator confirms the contents of the recovery data from the monitoring terminal 3, and if there is an update, adds / modifies / deletes the recovery data.
[0054]
Next, the operation of the embodiment of the present invention will be described in detail with reference to FIGS.
FIG. 4 is a diagram showing the flow of the operation in the operation preparation stage.
FIG. 5 is a diagram showing a flow of operation in the operation stage.
[0055]
The operation of the embodiment of the present invention will be described in the following two parts.
・ Operation at the operation preparation stage
-Operation at the operation stage.
[0056]
First, the operation in the operation preparation stage will be described with reference to FIGS.
[0057]
The system administrator uses the knowledge service unit 213 to define what message is to be generated as a failure from the monitoring terminal 3 (definition of a management target failure (creation of a management target message list)). Defines the recovery data, the troubleshooting section, and the command input section for the failure message (FIG. 2: (1), FIG. 4: Step 101). The application unit 216 is defined to notify the monitoring terminal 3 only that a certain event has occurred, and the knowledge service unit 213 is required to deal with a failure. The acquisition service unit 212 notifies the knowledge service unit 213 or the application unit 216 based on the definition.
[0058]
The defined information is distributed to each service unit through the second management service unit 211 (FIG. 2: (2) → (3), FIG. 4: Step 102). That is, the management target failure definition is registered in the first failure information acquisition service unit 112, and the command input classification definition is registered in the knowledge service unit 213. In addition, the second failure information acquisition service unit 212 reflects a trigger (failure handling category definition) for transmitting a failure message from the monitored server 1 to the application unit 216 or the knowledge service unit 213.
[0059]
A recovery data registration instruction is issued to the knowledge service unit 213 (FIG. 2: (4)) and registered in the knowledge database 22 through the database service unit 214 (FIG. 2: (5), FIG. 4: step 103).
[0060]
With the above operation, the failure handling information is defined, and the operation preparation ends. When a new message to be managed appears at the operation stage, the above-described operation preparation processing is performed on the message.
[0061]
Next, the operation in the operation stage will be described with reference to FIGS.
[0062]
When a failure occurs in the monitored server 1 (FIG. 4: step 201), the first failure information acquisition service unit 112 uses the message ID and error code of the message output due to the failure to define the failure to be managed (list of managed messages). ) To determine whether the message is a management target message, and if the message is a management target message, assembles it into the failure message format shown in FIG. 3 and notifies the second failure information acquisition service unit 212 of the failure message (FIG. 2). : (6)).
[0063]
If the notified failure message is defined to use the knowledge service unit 213, that is, if the failure handling category is the knowledge type (FIG. 4: Step 202 → Step 203), the second failure information acquisition service unit 212 The failure message is notified to the knowledge service unit 213 (FIG. 2: (7)).
[0064]
The knowledge service unit 213 issues a similar failure message and a search request for recovery data to the database service unit 214 (FIG. 2: (4)), and the database service unit 214 searches the knowledge database 22 according to the request. The result (recovery data) is returned (FIG. 2: (5) → (8) → (9)).
[0065]
If the recovery command in the recovery data is registered so as to be automatically input, that is, if the command input category is of the automatic execution type (FIG. 4: Step 203 → Step 204), the second command input service unit 215 has a fault. Is instructed to input a recovery command to the first command input service unit 113 of the monitored server 1 where the error has occurred (FIG. 2: (10) → (11), FIG. 4: Step 204).
[0066]
The second command input service unit 215 passes the recovery data to the first command input service unit 113, the first command input service unit 113 inputs and executes the recovery command, and the second command input service unit 215 outputs the first command input service unit. Receiving the input result from 113 (FIG. 2: (12)), it notifies the knowledge service unit 213 (FIG. 2: (13)), and the knowledge service unit 213 notifies the application unit 216 and displays it on the monitoring terminal 3. (FIG. 2: (14) → (15), FIG. 4: Step 205).
[0067]
When the recovery command in the recovery data is registered so as to be input after the operator confirms it, that is, when the command input category is the confirmation execution type (FIG. 4: Step 203 → Step 206), the recovery is performed on the monitoring terminal 3. The contents of the data are displayed to prompt the operator to perform an operation (FIG. 2: (5) → (9) → (14) → (15), FIG. 4: Step 206). If the data is not updated, a command is input (FIG. 2: (16) → (17) → (10) → (11), FIG. 4: Step 207 → Step 204).
[0068]
If the content of the recovery data is updated (added / modified / deleted), the operator updates the recovery data displayed on the monitoring terminal 3 and stores the updated recovery data in the knowledge database 22 (FIG. 2: ( 16) → (17) → (4) → (5), FIG. 4: Step 208 → Step 209). Thereafter, the process proceeds to command input and command input result notification (FIG. 2: (10) → (11) → (12) → (13) → (14) → (15); FIG. 4: Step 204 → Step 205).
[0069]
When the knowledge service unit 213 is not used, that is, when the failure handling category is the display type, the failure message from the first failure information acquisition service unit 112 is notified to the monitoring terminal 3 through the application unit 216 (FIG. 2: (6) ) → (18) → (15), FIG. 4: Step 202 → Step 210).
[0070]
If the failure message notified to the monitoring terminal 3 is a failure for which recovery data is to be defined (new failure) (FIG. 4: Step 211 → Step 212), the system administrator performs the same processing as that performed in the operation preparation stage. To define the troubleshooting know-how (recovery data) and accumulate it in the knowledge database 22 (FIG. 2: (1) → (2) → (4) → (5), FIG. 4: Step 212 → Step 213 → Step 214).
[0071]
Further, for a message that is managed but does not define recovery data, the recovery data can be registered in the knowledge database 22 by the following method.
[0072]
For example, if recovery data is defined only for the failure message of Example 2-1 in FIG. 3, when the failure message of Example 2-2 occurs, the cause of the failure is different. It is necessary to register all recovery data. However, it is not necessary to add and modify a new definition file and reflect it in the knowledge service unit 213. Instead, a failure message is defined by a message ID, and an error of the failure message generated by an input operation on the monitor terminal 3 screen is input. By adding the recovery data corresponding to the code, the contents of the knowledge database 22 can be updated and used from the next time. By doing so, it is possible to easily add recovery data to a failure message having a large number of error codes later.
[0073]
In this way, by accumulating the troubleshooting methods in the knowledge database, if the failure in the past recurs, the knowledge database can be searched. Troubleshooting.
[0074]
Next, another embodiment of the present invention will be described.
[0075]
In addition to the configuration of the other embodiment and the configuration of the above-described embodiment, a local database for registering a recovery command in a monitored server is added, and a recovery command registration unit, a local database search unit, Is added.
[0076]
The operation of another embodiment will be described.
[0077]
In the operation preparation stage, the command input section definition and the fault handling section definition are also registered in the first fault information acquisition service section.
[0078]
In the operation stage, when a failure occurs in the monitored server, the first failure information acquisition service unit collects the failure information, and collects the failure information if the troubleshooting type is the knowledge type and the command inputting type is the automatic execution type. Pass the message key of the failure information to the local database search unit. The local database retrieval unit retrieves the local database using the message key, and if there is a match, retrieves the restoration command and passes it to the first command input service unit. The first command input service unit inputs and executes a recovery command, and transmits a recovery result to the monitoring terminal via the monitoring server.
[0079]
When the failure handling category is the knowledge type and the command input category is not the automatic execution type, and when the local database is searched and does not match, the first failure information acquisition service unit transmits the failure information to the monitoring server, and thereafter, the first failure information acquisition service unit described above. The same operation as that of the embodiment is performed. However, when the restoration command is inputted by the first command input service unit and the restoration result is normal, the restoration command registration unit registers the restoration command in the local database.
[0080]
In this way, with respect to the message registered in the local database, when the failure recurs, the failure can be processed only by the monitored server.
[0081]
In the above-described embodiment according to the present invention, a program or the like for executing the processing operation of the knowledge-type operation management system is recorded as data on a recording medium (not shown) such as a magnetic disk or optical disk of a computer. It is used to read the recorded data and operate the knowledge-based operation management system. As described above, the data for operating the knowledge-type operation management system according to the present invention is recorded on the recording medium, and the function of the knowledge-type operation management system can be realized by installing the recording medium.
[0082]
【The invention's effect】
According to the present invention, the following two effects can be obtained by combining the operation management system and the knowledge database.
[0083]
The first effect is that the operator can share the know-how of the system administrator.
[0084]
The reason is that, by storing past failure information and a recovery method in the knowledge database, the knowledge database is searched for a failure that has occurred in the past, and a recovery method and a guide message are displayed.
[0085]
The second effect is that an independent operation management system can be constructed.
[0086]
The reason is that by setting the restoration command registered in the knowledge database to be automatically input, the system can be restored without the intervention of the operator.
[Brief description of the drawings]
FIG. 1 is a diagram showing an overall configuration of an embodiment of the present invention.
FIG. 2 is a diagram showing a detailed configuration of an operation management unit.
FIG. 3 is a diagram showing a format example of a failure message;
FIG. 4 is a diagram showing a flow of operation in an operation preparation stage.
FIG. 5 is a diagram showing a flow of operation in an operation stage.
[Explanation of symbols]
1 monitored server
2 Monitoring server
3 monitoring terminal
11 First operation management means
21 Second operation management means
22 Knowledge Database
111 First Management Service Department
112 First Failure Information Acquisition Service Department
113 1st command input service section
211 Second Management Service Department
212 Second Failure Information Acquisition Service Department
213 Knowledge Service Department
214 Database Service Department
215 second command input service unit
216 Application Department

Claims

障害情報に対するリカバリデータを定義し前記リカバリデータを監視サーバに登録し監視対象サーバでの復旧結果を表示する監視端末と、
障害が発生すると障害情報を採取して監視サーバに送信し監視サーバから受信した復旧コマンドを投入して実行し復旧結果を監視サーバに送信する監視対象サーバと、
前記監視対象サーバから受信した障害情報に対応するリカバリデータを取得しリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し前記監視対象サーバから受信した復旧結果を前記監視端末に送信する監視サーバと、
を備えることを特徴とするナレッジ型運用管理システム。A monitoring terminal that defines recovery data for the failure information, registers the recovery data on a monitoring server, and displays a recovery result on the monitored server;
When a failure occurs, a monitored server that collects failure information, sends it to the monitoring server, enters and executes a recovery command received from the monitoring server, executes the recovery command, and sends the recovery result to the monitoring server;
A monitoring server that acquires recovery data corresponding to the failure information received from the monitored server, transmits a recovery command included in the recovery data to the monitored server, and transmits a recovery result received from the monitored server to the monitoring terminal. When,
A knowledge-based operation management system comprising:

管理対象障害と障害対処区分とコマンド投入区分とリカバリデータとを定義し前記管理対象障害定義を監視対象サーバに登録し前記障害対処区分定義と前記コマンド投入区分定義と前記リカバリデータとを前記監視サーバに登録し監視対象サーバでの復旧結果を表示する監視端末と、
障害が発生すると前記管理対象障害定義を基に障害情報を採取して監視サーバに送信し監視サーバから受信した復旧コマンドを投入し復旧結果を監視サーバに送信する監視対象サーバと、
前記監視対象サーバから障害情報を受信し受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が自動実行タイプである場合には障害情報に対応するリカバリデータを取得しリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し前記監視対象サーバから受信した復旧結果を前記監視端末に送信する監視サーバと、
を備えることを特徴とするナレッジ型運用管理システム。The management target failure, the failure handling category, the command input category, and the recovery data are defined, the management target failure definition is registered in the monitored server, and the failure handling category definition, the command input category definition, and the recovery data are registered in the monitoring server A monitoring terminal that registers with the server and displays the results of recovery on the monitored server;
When a failure occurs, a monitored server that collects failure information based on the managed failure definition, sends the failure information to the monitoring server, inputs a recovery command received from the monitoring server, and sends a recovery result to the monitoring server;
When the failure information is received from the monitored server and the failure handling division definition of the received failure information is the knowledge type and the command input division definition is the automatic execution type, the recovery data corresponding to the failure information is obtained and the recovery data is obtained. A monitoring server that transmits a recovery command included in the monitoring target server and transmits a recovery result received from the monitoring target server to the monitoring terminal;
A knowledge-based operation management system comprising:

前記監視サーバは前記監視対象サーバから受信した障害情報の障害対処区分定義が表示タイプの場合には障害情報を前記監視端末に送信し、前記監視端末は受信した障害メッセージを表示する、
ことを特徴とする請求項２記載のナレッジ型運用管理システム。The monitoring server transmits the failure information to the monitoring terminal when the failure handling division definition of the failure information received from the monitored server is a display type, and the monitoring terminal displays the received failure message.
3. The knowledge-based operation management system according to claim 2, wherein:

前記監視サーバは受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信し前記監視端末から確認応答を受信したときにリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し、
前記監視端末は受信したリカバリデータを表示して確認し前記監視サーバに確認応答を送信する、
ことを特徴とする請求項２記載のナレッジ型運用管理システム。The monitoring server monitors the recovery data acquired after acquiring the recovery data corresponding to the failure information when the failure handling division definition of the received failure information is the knowledge type and the command input division definition is the confirmation execution type. Transmitting a recovery command included in recovery data to the monitored server when receiving an acknowledgment from the monitoring terminal by transmitting to the terminal;
The monitoring terminal displays and confirms the received recovery data, and transmits an acknowledgment to the monitoring server.
3. The knowledge-based operation management system according to claim 2, wherein:

前記監視サーバは受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信し前記監視端末から更新リカバリデータを受信し受信した更新リカバリデータでリカバリデータを登録しているデータベースを更新し前記更新リカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し、
前記監視端末は受信したリカバリデータを表示して更新し前記監視サーバに更新リカバリデータを送信する、
ことを特徴とする請求項２記載のナレッジ型運用管理システム。The monitoring server monitors the recovery data acquired after acquiring the recovery data corresponding to the failure information when the failure handling division definition of the received failure information is the knowledge type and the command input division definition is the confirmation execution type. Transmitting a recovery command included in the update recovery data to the monitoring target server by updating the database that has registered the recovery data with the update recovery data received and received the update recovery data from the monitoring terminal transmitted to the terminal,
The monitoring terminal displays and updates the received recovery data and transmits updated recovery data to the monitoring server;
3. The knowledge-based operation management system according to claim 2, wherein:

前記監視端末は表示した障害メッセージに関するリカバリデータを定義し前記監視サーバに送信し、
前記監視サーバは受信したリカバリデータを前記ナレッジデータベースに登録する、
ことを特徴とする請求項３記載のナレッジ型運用管理システム。The monitoring terminal defines recovery data relating to the displayed failure message and transmits the recovery data to the monitoring server,
The monitoring server registers the received recovery data in the knowledge database.
The knowledge-type operation management system according to claim 3, wherein:

前記監視サーバはリカバリデータを登録するデータベースを有し、前記障害情報に含まれるメッセージキーで前記データベースを検索してリカバリデータを取得する、
ことを特徴とする請求項１，２，３，４，５または６記載のナレッジ型運用管理システム。The monitoring server has a database for registering recovery data, and retrieves the database with a message key included in the failure information to obtain recovery data.
The knowledge-type operation management system according to claim 1, 2, 3, 4, 5, or 6.

監視サーバが監視対象サーバを運用管理するナレッジ型運用管理方法であって、
監視端末は管理対象障害定義と障害対処区分定義とコマンド投入区分定義とリカバリデータとを定義し前記管理対象障害定義を監視対象サーバに登録し前記障害対処区分定義と前記コマンド投入区分定義と前記リカバリデータとを前記監視サーバに登録し、
前記監視対象サーバは障害が発生すると前記管理対象障害定義を基に障害情報を採取して前記監視サーバに送信し、
前記監視サーバは前記監視対象サーバから障害情報を受信し受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が自動実行タイプである場合には障害情報に対応するリカバリデータを取得しリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信し、
前記監視対象サーバは前記監視サーバから受信した復旧コマンドを投入し復旧結果を前記監視サーバに送信し、
前記監視サーバは前記監視対象サーバから受信した復旧結果を前記監視端末に送信し、
前記監視端末は前記監視サーバから受信した復旧結果を表示する、
ことを特徴とするナレッジ型運用管理方法。A knowledge-based operation management method in which a monitoring server operates and manages a monitored server,
The monitoring terminal defines a managed failure definition, a failure handling category definition, a command input category definition, and recovery data, registers the managed fault definition in a monitored server, and registers the failure handling category definition, the command input category definition, and the recovery. Register the data with the monitoring server,
When a failure occurs, the monitored server collects failure information based on the managed failure definition and sends the failure information to the monitoring server,
The monitoring server receives the failure information from the monitored server and, if the failure handling division definition of the received failure information is a knowledge type and the command input division definition is an automatic execution type, recovers the recovery data corresponding to the failure information. Acquiring and transmitting a recovery command included in the recovery data to the monitored server,
The monitored server sends a restoration command received from the monitoring server and sends a restoration result to the monitoring server,
The monitoring server transmits a recovery result received from the monitored server to the monitoring terminal,
The monitoring terminal displays a recovery result received from the monitoring server,
A knowledge-type operation management method characterized by the following.

前記監視サーバは前記監視対象サーバから受信した障害情報の障害対処区分定義が表示タイプの場合には障害情報を前記監視端末に送信し、前記監視端末は受信した障害情報を表示する、
ことを特徴とする請求項８記載のナレッジ型運用管理方法。The monitoring server transmits the fault information to the monitoring terminal when the fault handling division definition of the fault information received from the monitored server is a display type, and the monitoring terminal displays the received fault information.
9. The knowledge-based operation management method according to claim 8, wherein:

前記監視サーバは受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信し、
前記監視端末は受信したリカバリデータを表示して確認し前記監視サーバに確認応答を送信し、
前記監視サーバは確認応答を受信したときにリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する、
ことを特徴とする請求項８記載のナレッジ型運用管理方法。The monitoring server monitors the recovery data acquired after acquiring the recovery data corresponding to the failure information when the failure handling division definition of the received failure information is the knowledge type and the command input division definition is the confirmation execution type. Send to terminal,
The monitoring terminal displays and confirms the received recovery data, transmits an acknowledgment to the monitoring server,
The monitoring server transmits a recovery command included in recovery data to the monitored server when receiving an acknowledgment,
9. The knowledge-based operation management method according to claim 8, wherein:

前記監視サーバは受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信し、
前記監視端末は受信したリカバリデータを表示して更新し前記監視サーバに更新リカバリデータを送信し、
前記監視サーバは前記監視端末から更新リカバリデータを受信し受信した更新リカバリデータでリカバリデータを登録しているデータベースを更新し前記更新リカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する、
ことを特徴とする請求項８記載のナレッジ型運用管理方法。The monitoring server monitors the recovery data acquired after acquiring the recovery data corresponding to the failure information when the failure handling division definition of the received failure information is the knowledge type and the command input division definition is the confirmation execution type. Send to terminal,
The monitoring terminal displays and updates the received recovery data, transmits updated recovery data to the monitoring server,
The monitoring server receives the update recovery data from the monitoring terminal, updates the database that has registered the recovery data with the received update recovery data, and transmits a recovery command included in the update recovery data to the monitored server.
9. The knowledge-based operation management method according to claim 8, wherein:

前記監視端末は表示した障害メッセージに関するリカバリデータを定義し前記監視サーバに送信し、
前記監視サーバは受信したリカバリデータを前記ナレッジデータベースに登録する、
ことを特徴とする請求項９記載のナレッジ型運用管理方法。The monitoring terminal defines recovery data relating to the displayed failure message and transmits the recovery data to the monitoring server,
The monitoring server registers the received recovery data in the knowledge database.
10. The knowledge-based operation management method according to claim 9, wherein:

前記監視サーバはリカバリデータを登録しているデータベースを前記障害情報に含まれるメッセージキーで検索してリカバリデータを取得する、
ことを特徴とする請求項８，９，１０，１１または１２記載のナレッジ型運用管理方法。The monitoring server retrieves the database in which the recovery data is registered with the message key included in the failure information to obtain the recovery data,
13. The knowledge-based operation management method according to claim 8, 9, 10, 11, or 12.

監視サーバが監視対象サーバを運用管理するナレッジ型運用管理システムにおけるナレッジ型運用管理プログラムであって、
コンピュータに、
監視端末が、
管理対象障害定義と障害対処区分定義とコマンド投入区分定義とリカバリデータとを定義する機能、
前記管理対象障害定義を監視対象サーバに登録する機能、
前記障害対処区分定義と前記コマンド投入区分定義と前記リカバリデータとを前記監視サーバに登録する機能、
監視対象サーバが、
障害が発生すると前記管理対象障害定義を基に障害情報を採取して前記監視サーバに送信する機能、
監視サーバが、
前記監視対象サーバから障害情報を受信する機能、
受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が自動実行タイプである場合には障害情報に対応するリカバリデータを取得する機能、
リカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する機能、前記監視対象サーバが、
前記監視サーバから受信した復旧コマンドを投入する機能、
復旧結果を前記監視サーバに送信する機能、
前記監視サーバが、
前記監視対象サーバから受信した復旧結果を前記監視端末に送信する機能、
前記監視端末が、
前記監視サーバから受信した復旧結果を表示する機能、
を実現させるためのナレッジ型運用管理プログラム。A knowledge-type operation management program in a knowledge-type operation management system in which a monitoring server operates and manages a monitored server,
On the computer,
The monitoring terminal is
Function to define managed target failure definition, failure response category definition, command input category definition, and recovery data,
A function for registering the managed failure definition in a monitored server,
A function of registering the failure handling section definition, the command input section definition, and the recovery data in the monitoring server;
If the monitored server is
When a failure occurs, a function of collecting failure information based on the managed failure definition and transmitting the failure information to the monitoring server;
The monitoring server is
A function of receiving failure information from the monitored server;
A function of acquiring recovery data corresponding to the failure information when the failure handling division definition of the received failure information is a knowledge type and the command input division definition is an automatic execution type;
A function of transmitting a recovery command included in recovery data to the monitored server, wherein the monitored server
A function of inputting a recovery command received from the monitoring server,
A function of transmitting a restoration result to the monitoring server,
The monitoring server is
A function of transmitting a recovery result received from the monitored server to the monitoring terminal,
The monitoring terminal is
A function of displaying a recovery result received from the monitoring server,
Knowledge-based operation management program for realizing

前記監視サーバが、
前記監視対象サーバから受信した障害情報の障害対処区分定義が表示タイプの場合には障害情報を前記監視端末に送信する機能、
前記監視端末が、
受信した障害情報を表示する機能、
を実現させるための請求項１４記載のナレッジ型運用管理プログラム。The monitoring server is
A function of transmitting fault information to the monitoring terminal when the fault handling division definition of the fault information received from the monitored server is a display type;
The monitoring terminal is
A function to display received fault information,
The knowledge-type operation management program according to claim 14 for realizing:

前記監視サーバが、
受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信する機能、
前記監視端末が、
受信したリカバリデータを表示して確認する機能、
前記監視サーバに確認応答を送信する機能、
前記監視サーバが、
確認応答を受信したときにリカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する機能、
を実現させるための請求項１４記載のナレッジ型運用管理プログラム。The monitoring server is
If the failure handling section definition of the received failure information is the knowledge type and the command input section definition is the confirmation execution type, the recovery data acquired after acquiring the recovery data corresponding to the failure information is transmitted to the monitoring terminal. function,
The monitoring terminal is
A function to display and confirm received recovery data,
Sending an acknowledgment to the monitoring server;
The monitoring server is
A function of transmitting a recovery command included in recovery data to the monitored server when receiving an acknowledgment;
The knowledge-type operation management program according to claim 14 for realizing:

前記監視サーバが、
受信した障害情報の障害対処区分定義がナレッジタイプで且つ前記コマンド投入区分定義が確認実行タイプである場合には障害情報に対応するリカバリデータを取得した後に取得したリカバリデータを前記監視端末に送信する機能、
前記監視端末が、
受信したリカバリデータを表示して更新する機能、
前記監視サーバに更新リカバリデータを送信する機能、
前記監視サーバが、
前記監視端末から更新リカバリデータを受信し受信した更新リカバリデータでリカバリデータを登録しているデータベースを更新する機能、
前記更新リカバリデータに含まれる復旧コマンドを前記監視対象サーバに送信する機能、
を実現させるための請求項１４記載のナレッジ型運用管理プログラム。The monitoring server is
If the failure handling section definition of the received failure information is the knowledge type and the command input section definition is the confirmation execution type, the recovery data acquired after acquiring the recovery data corresponding to the failure information is transmitted to the monitoring terminal. function,
The monitoring terminal is
A function to display and update received recovery data,
A function of transmitting update recovery data to the monitoring server,
The monitoring server is
A function of receiving update recovery data from the monitoring terminal and updating a database that has registered recovery data with the received update recovery data;
A function of transmitting a recovery command included in the update recovery data to the monitored server,
The knowledge-type operation management program according to claim 14 for realizing:

前記監視端末が、
表示した障害メッセージに関するリカバリデータを定義し前記監視サーバに送信する機能、
前記監視サーバが、
受信したリカバリデータを前記ナレッジデータベースに登録する機能、
を実現させるための請求項１５記載のナレッジ型運用管理プログラム。The monitoring terminal is
A function of defining recovery data relating to the displayed failure message and transmitting the data to the monitoring server;
The monitoring server is
A function of registering the received recovery data in the knowledge database,
The knowledge-type operation management program according to claim 15 for realizing:

前記監視サーバが、
リカバリデータを登録しているデータベースを前記障害情報に含まれるメッセージキーで検索してリカバリデータを取得する機能、
を実現させるための請求項１４，１５，１６，１７または１８記載のナレッジ型運用管理プログラム。The monitoring server is
A function of searching a database in which recovery data is registered with a message key included in the failure information to obtain recovery data,
The knowledge-type operation management program according to claim 14, 15, 16, 17 or 18, which realizes the following.