JP2011175513A

JP2011175513A - Fault management system and method

Info

Publication number: JP2011175513A
Application number: JP2010039899A
Authority: JP
Inventors: Hisashi Shindo; 久進藤
Original assignee: NEC Computertechno Ltd
Current assignee: NEC Computertechno Ltd
Priority date: 2010-02-25
Filing date: 2010-02-25
Publication date: 2011-09-08
Anticipated expiration: 2030-02-25
Also published as: JP5505966B2

Abstract

<P>PROBLEM TO BE SOLVED: To reduce working loads such as an input operation and to provide information including an accurate suspicion rate by using an FRU table set at first. <P>SOLUTION: When a fault phenomenon is detected by a service processor 3, fault history information is retrieved by using information for specifying the fault phenomenon as a key, and when the fault phenomenon coincides with a fault phenomenon caused in the past, a fault factor part 2 having inducement history causing the fault phenomenon is extracted from the fault history information. When the fault factor part 2 having the inducement history is extracted, a corrected suspicion rate is calculated by correcting the suspicion rate set at first which corresponds to the fault factor part 2 in the FRU table according to the inducement frequency of the fault factor part 2. When relation between the fault phenomenon and the fault factor part 2 coincides between the FRU table and the fault history information, the inducement frequency corresponding to the fault history information is incremented. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、情報処理システムの障害を管理するシステム及び方法に関する。 The present invention relates to a system and method for managing failures in an information processing system.

情報処理システムの障害に対処するための技術として、ＦＲＵ（Field Replaceable Unit）テーブルを利用するものがある。このＦＲＵテーブルは、情報処理システムに発生する複数種の障害事象と、各障害事象の要因となる可能性がある障害要因部位（プロセッサ、メモリ、ノードコントローラ、入出力装置、配線、端子等）と、各障害要因部位の被疑割合とが対応付けられて構成された情報である。障害発生時には、前記ＦＲＵテーブルから発生した障害事象に対応する障害要因部位が抽出され、それらに関する情報が前記被疑割合と共に保守者等に提供される。 As a technique for coping with a failure in an information processing system, there is a technique that uses an FRU (Field Replaceable Unit) table. This FRU table includes a plurality of types of failure events that occur in the information processing system, and failure factor sites (processors, memories, node controllers, input / output devices, wiring, terminals, etc.) that may cause each failure event. The information is configured by associating the suspicious ratio of each failure factor part. When a failure occurs, a failure factor site corresponding to the failure event that has occurred is extracted from the FRU table, and information related to the failure factor is provided to a maintenance person or the like together with the suspect ratio.

特許文献１において、障害発生を検知する複数の障害チェックレジスタの各コードと各障害チェックレジスタに対応する障害要因部位の各コード（ＦＲＵコード）を対応付け、且つ各ＦＲＵコードの部品交換優先順位情報を含む障害チェックレジスタ情報−ＦＲＵ情報対応テーブルを１つのファイルに登録する構成が開示されている。 In Patent Document 1, each code of a plurality of fault check registers for detecting occurrence of a fault is associated with each code (FRU code) of a fault factor corresponding to each fault check register, and component replacement priority information of each FRU code A configuration for registering a failure check register information-FRU information correspondence table including a single file is disclosed.

特許文献２において、情報処理システムのハードウェア資源のパーティション構造を、各ハードウェア資源の障害履歴に基づいて最適化する構成が開示されている。 Patent Document 2 discloses a configuration that optimizes a hardware resource partition structure of an information processing system based on a failure history of each hardware resource.

特許文献３において、サービスプロセッサ、及び障害率データ（被疑割合）を含む故障辞書（ＦＲＵテーブル）を用いて情報処理システムの障害管理を行うシステムであって、部品の交換履歴等に基づいて障害率データを更新する構成が開示されている。 In Patent Literature 3, a failure management (FRU table) including a service processor and failure rate data (probability rate) is used to manage a failure of an information processing system, and a failure rate based on a component replacement history or the like A configuration for updating data is disclosed.

特開平１１−２４９９２６号公報Japanese Patent Laid-Open No. 11-249926 特開２００９−１６３６４６号公報JP 2009-163646 A 特開平１０−３２０２４１号公報（段落００１７等参照）JP-A-10-320241 (see paragraph 0017 etc.)

通常、前記被疑割合を含むＦＲＵテーブルは、システム設計者等により一意的に設定登録されるものであるが、上記特許文献３に開示されるように、実際の部品交換履歴等に応じて補正していくことにより、信頼性を向上させることができるものである。 Usually, the FRU table including the suspected ratio is uniquely set and registered by a system designer or the like, but as disclosed in Patent Document 3, it is corrected according to the actual component replacement history or the like. By doing so, the reliability can be improved.

しかしながら、上記特許文献３に係る構成は、保守者が入出力装置を操作して入力した実際の部品交換作業に関する情報に基づいて、前記故障辞書（ＦＲＵテーブル）自体を更新するものである。そのため、ＦＲＵテーブルを信頼性の高い状態に維持するためには、保守者の入力操作が不可欠となり、保守者に入力作業の負担を強いることとなる。また、ＦＲＵテーブル自体が更新されるため、設定当初のＦＲＵテーブルが必要となった時に、初期化処理等を行う必要がある。 However, the configuration according to Patent Document 3 updates the failure dictionary (FRU table) itself on the basis of information related to an actual part replacement operation input by a maintenance person operating an input / output device. Therefore, in order to maintain the FRU table in a highly reliable state, an input operation by the maintenance person becomes indispensable, and the maintenance person is forced to bear an input work. Further, since the FRU table itself is updated, it is necessary to perform initialization processing when the initial FRU table is required.

そこで、本発明は、入力操作等の作業負担を軽減し、設定当初のＦＲＵテーブルを用いて正確な被疑割合を含む情報を提供することを目的とする。 Therefore, an object of the present invention is to reduce the work load such as an input operation, and to provide information including an accurate suspicion rate using an FRU table at the time of setting.

本発明の一態様は、情報処理システムを構成する各部位の動作を監視するサービスプロセッサと、少なくとも、複数種類の障害事象、前記各障害事象を引き起こす可能性のある障害要因部位の識別情報、及び前記各障害要因部位が前記障害事象を引き起こす可能性を示す被疑割合が対応付けられて構成されるＦＲＵテーブルを格納するＦＲＵ格納部と、少なくとも、過去に発生した障害事象、各障害事象の要因となった前記障害要因部位、及び前記障害要因部位が対応する障害事象を引き起こした誘因頻度が対応付けられて構成される障害履歴情報を格納する障害履歴格納部と、前記サービスプロセッサにより障害事象が検知された場合に、当該障害事象を特定する情報をキーとして前記障害履歴情報を検索し、当該障害事象が過去に発生した障害事象と一致する場合に、前記障害履歴情報から当該障害事象を引き起こした誘因履歴のある前記障害要因部位を抽出する障害履歴抽出部と、前記誘因履歴のある障害要因部位が抽出された場合に、当該障害要因部位の前記誘因頻度に応じて、前記ＦＲＵテーブルの当該障害要因部位に対応する設定当初の被疑割合を補正して得られる補正被疑割合を算出する補正部と、当該障害事象と当該障害要因部位との関係が、前記ＦＲＵテーブルと前記障害履歴情報とで一致する場合に、当該障害履歴情報の相当する前記誘因頻度をインクリメントする障害履歴更新部と、前記補正被疑割合又は前記ＦＲＵテーブルの設定当初の被疑割合を表示するコンソール部とを備える障害管理システムである。 One aspect of the present invention is a service processor that monitors the operation of each part constituting an information processing system, at least a plurality of types of fault events, identification information of fault factor parts that may cause the fault events, and A FRU storage unit that stores a FRU table configured by associating a suspicious ratio indicating the possibility that each failure factor site causes the failure event, at least a failure event that occurred in the past, and a factor of each failure event; The failure event is detected by the failure history storage unit for storing failure history information configured to associate the failure factor site and the trigger frequency that caused the failure event corresponding to the failure factor site, and the service processor. The failure history information is searched using the information for identifying the failure event as a key, and the failure event has occurred in the past. A failure history extraction unit that extracts the failure factor part having an incentive history that caused the failure event from the failure history information, and a failure factor part having the cause history is extracted when the failure event matches A correction unit that calculates a corrected suspicion rate obtained by correcting the initial suspicion rate corresponding to the failure factor part of the FRU table according to the incentive frequency of the failure factor part, the failure event, and the failure event When the relationship with the failure factor part matches between the FRU table and the failure history information, a failure history update unit that increments the incentive frequency corresponding to the failure history information, the corrected suspect ratio or the FRU table It is a fault management system provided with the console part which displays the suspicion ratio at the time of the setting of this.

また、本発明の他の態様は、少なくとも、複数種類の障害事象、各障害事象を引き起こす可能性のある障害要因部位の識別情報、及び前記各障害要因部位が対応する障害事象を引き起こす可能性を示す被疑割合が対応付けられて構成されるＦＲＵテーブルと、少なくとも、過去に発生した障害事象、各障害事象の要因となった前記障害要因部位、及び前記障害要因部位が対応する障害事象を引き起こした誘因頻度が対応付けられて構成される障害履歴情報とを参照し、前記情報処理システムの障害を管理する障害管理方法であって、ある障害事象が検知された場合に、当該障害事象を特定する情報をキーとして前記障害履歴情報を検索し、当該障害事象が過去に発生した障害事象と一致する場合に、前記障害履歴情報から当該障害事象を引き起こした誘因履歴のある前記障害要因部位を抽出するステップと、前記誘因履歴のある障害要因部位が抽出された場合に、当該障害要因部位の前記誘因頻度に応じて、前記ＦＲＵテーブルの当該障害要因部位に対応する設定当初の被疑割合を補正して得られる補正被疑割合を算出するステップと、当該障害事象と当該障害要因部位との関係が、前記ＦＲＵテーブルと前記障害履歴情報とで一致する場合に、当該障害履歴情報の相当する前記誘因頻度をインクリメントするステップと、前記補正被疑割合又は前記ＦＲＵテーブルの設定当初の被疑割合を表示するステップとを備えるものである。 Another aspect of the present invention provides at least a plurality of types of fault events, identification information of fault factor parts that may cause each fault event, and the possibility that each fault factor part causes a corresponding fault event. The FRU table configured by associating the suspicious ratios indicated, at least the fault event that occurred in the past, the fault factor site that caused each fault event, and the fault event corresponding to the fault factor site A failure management method for managing a failure in the information processing system with reference to failure history information configured in association with an incentive frequency, and when a failure event is detected, the failure event is identified The failure history information is searched using the information as a key, and if the failure event matches a failure event that has occurred in the past, the failure event is subtracted from the failure history information. The step of extracting the failure factor part having the incentive history and the failure factor part having the cause history are extracted, and the failure factor of the FRU table is determined according to the incentive frequency of the failure factor part. The step of calculating the corrected suspicion ratio obtained by correcting the initial suspicion ratio setting corresponding to the part, and the relationship between the fault event and the fault factor part match in the FRU table and the fault history information And the step of incrementing the incentive frequency corresponding to the failure history information and the step of displaying the corrected suspect ratio or the initial suspect ratio of the FRU table.

上記構成により、被疑割合は、検知された障害事象と過去の障害事象との比較結果に応じて適正に補正される。また、障害履歴情報の誘因履歴は、検知された障害事象が過去の障害事象と一致する場合に自動的にインクリメントされるため、手動による入力作業を軽減することができる。また、ＦＲＵテーブルを、初期化等の処理を必要とせずに設定当初の状態で維持することができる。 With the above configuration, the suspect ratio is appropriately corrected according to the comparison result between the detected failure event and the past failure event. Further, the trigger history of the fault history information is automatically incremented when the detected fault event matches a past fault event, so that manual input work can be reduced. Further, the FRU table can be maintained in the initial setting state without requiring processing such as initialization.

本実施の形態に係る障害管理システムの機能的な構成を示す図である。It is a figure which shows the functional structure of the failure management system which concerns on this Embodiment. 本実施の形態に係る障害管理システムの具体的構成を例示する図である。It is a figure which illustrates the specific structure of the failure management system which concerns on this Embodiment. 本実施の形態に係る障害管理システムにおける処理を例示するフローチャートである。It is a flowchart which illustrates the process in the failure management system which concerns on this Embodiment. ２つのノードコントローラ間で障害が発生した状況を示す図である。It is a figure which shows the condition where the failure generate | occur | produced between two node controllers. ＦＲＵテーブルを例示する図表である。It is a chart which illustrates a FRU table. 当初の被疑割合から補正被疑割合を算出する例を示す図である。It is a figure which shows the example which calculates a correction | amendment suspicion ratio from the initial suspicion ratio. 情報処理システムのパーティション構成を変更する状況を例示する図である。It is a figure which illustrates the condition which changes the partition structure of an information processing system. 各部位でのエラー発生回数（誘因頻度）を例示する図表である。It is a graph which illustrates the frequency | count of error occurrence (incentive frequency) in each site | part.

実施の形態１
以下、図面を参照して本発明の実施の形態について説明する。図１は、本実施の形態に係る障害管理システム１の機能的な構成を示している。前記障害管理システム１は、プロセッサ、メモリ、ノードコントローラ、入出力装置等の各種部位を含んで構成される情報処理システムの障害を管理するものであって、障害要因部位２、サービスプロセッサ３、ＦＲＵ格納部４、障害履歴格納部５、障害履歴抽出部６、補正部７、障害履歴更新部８、及びコンソール部９を含んで構成される。 Embodiment 1
Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a functional configuration of a failure management system 1 according to the present embodiment. The fault management system 1 manages faults in an information processing system including various parts such as a processor, a memory, a node controller, and an input / output device, and includes a fault factor part 2, a service processor 3, and an FRU. The storage unit 4 includes a failure history storage unit 5, a failure history extraction unit 6, a correction unit 7, a failure history update unit 8, and a console unit 9.

前記障害要因部位２は、前記情報処理システムを構成するハードウェア資源であって、各種障害事象の要因となる可能性を有する部位である。 The failure factor part 2 is a hardware resource that constitutes the information processing system, and is a part that may cause various failure events.

前記サービスプロセッサ３は、複数の前記障害要因部位２の動作を監視するものである。前記サービスプロセッサ３は、前記情報処理システムとは独立して動作可能であることが好ましい。 The service processor 3 monitors the operations of the plurality of failure factor sites 2. The service processor 3 is preferably operable independently of the information processing system.

前記ＦＲＵ格納部４は、少なくとも、複数種類の障害事象、各障害事象を引き起こす可能性のある前記障害要因部位２の識別情報、及び前記各障害要因部位２が前記障害事象を引き起こす可能性を示す被疑割合が対応付けられて構成されるＦＲＵテーブルを格納する。 The FRU storage unit 4 indicates at least a plurality of types of failure events, identification information of the failure factor site 2 that may cause each failure event, and the possibility that each failure factor site 2 causes the failure event. Stores the FRU table configured in association with the suspect ratio.

前記障害履歴格納部５は、少なくとも、過去に発生した障害事象、各障害事象の要因となった前記障害要因部位２、及び前記障害要因部位２が対応する障害事象を引き起こした誘因頻度が対応付けられて構成される障害履歴情報を格納する。 The failure history storage unit 5 associates at least a failure event that occurred in the past, the failure factor part 2 that has caused each failure event, and an incentive frequency that caused the failure event corresponding to the failure factor part 2 The failure history information configured is stored.

前記障害履歴抽出部６は、前記サービスプロセッサ３により障害事象が検知された場合に、当該障害事象を特定する情報をキーとして前記障害履歴情報を検索し、当該障害事象が過去に発生した障害事象と一致する場合に、前記障害履歴情報から当該障害事象を引き起こした誘因履歴のある前記障害要因部位２を抽出する。 When a failure event is detected by the service processor 3, the failure history extraction unit 6 searches the failure history information using information for specifying the failure event as a key, and the failure event in which the failure event has occurred in the past If it matches, the failure factor part 2 having the trigger history that caused the failure event is extracted from the failure history information.

前記補正部７は、前記誘因履歴のある前記障害要因部位２が抽出された場合に、当該障害要因部位２の前記誘因頻度に応じて、前記ＦＲＵテーブルの当該障害要因部位２に対応する設定当所の被疑割合を補正して得られる補正被疑割合を算出する。 When the failure factor part 2 having the incentive history is extracted, the correction unit 7 sets corresponding to the failure factor part 2 of the FRU table according to the cause frequency of the failure factor part 2 The corrected suspicion rate obtained by correcting the suspicious rate is calculated.

前記障害履歴更新部８は、当該障害事象と当該障害要因部位との関係が、前記ＦＲＵテーブルと前記障害履歴情報とで一致する場合に、当該障害履歴情報の相当する前記要因頻度をインクリメントする。 The failure history update unit 8 increments the factor frequency corresponding to the failure history information when the relationship between the failure event and the failure factor site matches between the FRU table and the failure history information.

前記コンソール部９は、前記補正被疑割合又は前記ＦＲＵテーブルの設定当所の被疑割合を表示する。 The console unit 9 displays the corrected suspicious ratio or the suspicious ratio set in the FRU table.

上記構成により、前記コンソール部９に表示される被疑割合は、今回検知された障害事象と過去の障害事象との比較結果に応じて適正に補正されたものとなる。この補正処理は、前記ＦＲＵテーブル自体を変更することなく行われる。また、今回検知された障害事象が過去の障害事象と一致する場合、前記障害履歴情報が自動的にインクリメントされるため、前記障害履歴情報の手動による更新作業を軽減することができる。 With the above configuration, the suspicious ratio displayed on the console unit 9 is appropriately corrected according to the comparison result between the fault event detected this time and the past fault event. This correction process is performed without changing the FRU table itself. In addition, when the failure event detected this time coincides with a past failure event, the failure history information is automatically incremented, so that manual updating of the failure history information can be reduced.

図２は、本実施の形態に係る障害管理システムの具体的な構成を例示している。同図において、情報処理システム１１及び障害情報管理サーバ１２が示されている。 FIG. 2 illustrates a specific configuration of the failure management system according to the present embodiment. In the figure, an information processing system 11 and a failure information management server 12 are shown.

前記情報処理システム１１は、主記憶（ＭＥＭ）２１と複数のプロセッサ（ＰＲＯＣ）２２と複数のノードコントローラ（ＮＣ）２３と複数の入出力装置（ＩＯ）２４より構成され、上記いずれか１つあるいは複数の部位で障害が検出された場合、信号線ｅ００１を介してエラーがサービスプロセッサ（ＳＶＰ）２５に報告される。ＳＶＰ２５は、エラー報告により上記ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、ＩＯ２４の障害情報を採取する機構を有する。 The information processing system 11 includes a main memory (MEM) 21, a plurality of processors (PROC) 22, a plurality of node controllers (NC) 23, and a plurality of input / output devices (IO) 24. When a failure is detected in a plurality of parts, an error is reported to the service processor (SVP) 25 via the signal line e001. The SVP 25 has a mechanism for collecting failure information of the MEM 21, PROC 22, NC 23, and IO 24 based on an error report.

ＦＲＵテーブル３０には、予めエラー信号を保持するエラーインディケータフラグと各エラーインディケータフラグが対象とする障害要因部位（ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、ＩＯ２４、配線等）、被疑割合、エラー補助情報、製造ロット番号等が登録されている。 The FRU table 30 includes an error indicator flag that holds an error signal in advance, a failure factor part (MEM21, PROC22, NC23, IO24, wiring, etc.) targeted by each error indicator flag, a suspect ratio, error auxiliary information, and a production lot number. Etc. are registered.

第１の障害履歴格納データベース（ＤＢ）３１は、該情報処理システム１１で検出された障害を格納し保持し続け、同一部位でエラーを検出した場合は、エラーカウントフィールドのみが更新される。 The first failure history storage database (DB) 31 continues to store and hold failures detected by the information processing system 11, and when an error is detected in the same part, only the error count field is updated.

第２の障害履歴格納ＤＢ３２は、前記第１の障害履歴格納ＤＢ３１と同様に障害に関する情報を格納するものであるが、前記障害情報管理サーバ１２が保有する他装置障害ＤＢ３５や電圧・クロックを振ったマージン評価における検査障害ＤＢ３６に格納された情報が、信号線ｎ００１，ｎ００２を介して受信され反映される。 The second failure history storage DB 32 stores information relating to failures in the same way as the first failure history storage DB 31, but the other failure failure DB 35 owned by the failure information management server 12 and the voltage / clock are changed. Information stored in the inspection failure DB 36 in the margin evaluation is received and reflected via the signal lines n001 and n002.

データ収集部４０は、受信したエラー通報をトリガとして前記ＦＲＵテーブル３０、前記第１の障害履歴ＤＢ３１、前記第２の障害処理格納ＤＢ３２のデータを収集する機能を有する。 The data collection unit 40 has a function of collecting data in the FRU table 30, the first failure history DB 31, and the second failure processing storage DB 32 using the received error notification as a trigger.

障害要因解析部４１は、前記データ収集部４０のデータに基づいて、報告されたエラーに関する情報と、過去の障害履歴、他の情報処理システムの障害履歴、製造ロット等とを比較し分析する。 The failure factor analysis unit 41 compares and analyzes information on the reported error with a past failure history, a failure history of another information processing system, a manufacturing lot, and the like based on the data of the data collection unit 40.

障害要因部位被疑割合算出部４２は、前記障害要因解析部４１で障害履歴の中に今回報告されたエラーと一致するものがあると判定された場合には、前記障害要因部位やその被疑割合を補正する。一方、一致するものがなかった場合には、前記ＦＲＵテーブル３０からのデータを選択し、上記補正処理を実施しない。 If the failure factor analysis unit 41 determines that there is a failure history that matches the currently reported error, the failure factor part suspicious rate calculation unit 42 calculates the failure factor part and the suspicious rate thereof. to correct. On the other hand, if there is no match, the data from the FRU table 30 is selected and the correction process is not performed.

コンソール４３は、前記障害要因部位被疑割合算出部４２を経た情報を表示する。 The console 43 displays the information that has passed through the failure factor part suspicious ratio calculation unit 42.

図２に例示する構成においては、構成情報解析部４４が備えられている。この構成情報解析部４４は、ＳＶＰ２５より情報処理システムの増設あるいは縮退に伴う使用する部位（例えば、複数あるノード間インタフェース）が変更するケースにおいて、その対象部位の障害履歴を参照し、より障害が少ない部位をシステムに組み込む情報を抽出してＳＶＰ２５に通知する機能を有する。即ち、本例に係る情報処理システム１１は、自らの各部位２１，２２，２３，２４の論理的又は物理的なパーティション構成を調整する機能を備えている。 In the configuration illustrated in FIG. 2, a configuration information analysis unit 44 is provided. The configuration information analysis unit 44 refers to the failure history of the target part in the case where the part to be used (for example, a plurality of inter-node interfaces) is changed from the SVP 25 due to the addition or reduction of the information processing system. It has a function of extracting information for incorporating a small number of parts into the system and notifying the SVP 25 of the information. That is, the information processing system 11 according to the present example has a function of adjusting the logical or physical partition configuration of the respective parts 21, 22, 23, and 24 thereof.

前記障害情報管理サーバ１２は、複数の情報処理システム１１とネットワークｎ００１，ｎ００２，ｎ００３，ｎ００４を介して障害情報を送受信する機能を有する。上記ネットワークを介して受信したデータは、他装置障害ＤＢ３５に格納され、複数の情報処理システム１１に配信され情報共有される。また、電圧・クロックを振ったマージン評価における障害情報は、検査障害ＤＢ３６に格納され、他装置障害ＤＢ３５と同様に複数の情報処理システムと情報を共有する。 The failure information management server 12 has a function of transmitting / receiving failure information to / from a plurality of information processing systems 11 via networks n001, n002, n003, and n004. Data received via the network is stored in the other device failure DB 35, and is distributed to a plurality of information processing systems 11 to share information. Further, failure information in margin evaluation using a voltage / clock is stored in the inspection failure DB 36 and shares information with a plurality of information processing systems in the same manner as the other device failure DB 35.

図３は、本実施の形態に係る障害管理システム１により行われる処理を例示している。この処理は、ＭＥＭ２１、複数のＰＲＯＣ２２、複数のＮＣ２３、複数のＩＯ２４でエラーを検出した場合の障害要因部位やその被疑割合を決定するものである。各ステップの処理は、後述する。 FIG. 3 illustrates processing performed by the failure management system 1 according to this embodiment. This process determines the failure factor site and the suspicious ratio when an error is detected by the MEM 21, the plurality of PROCs 22, the plurality of NCs 23, and the plurality of IOs 24. The process of each step will be described later.

図４は、２つのノードコントローラＮＣ０，ＮＣ１で障害を発生したケースを例示している。各ノード０，１は、プロセッサ（ＰＲＯＣ０，１）、ノードコントローラ（ＮＣ０）、入出力装置（ＩＯ０，１）により構成される。ノード０とノード１は、それぞれＮＣ０，ＮＣ１のポート（Ｐ１）にＣＡＢＬＥ＿Ａを接続して信号を送受信している。本例では、ノード０から送信したデータにおいて、受信したノード１のＮＣ１がエラーを検出したことを示している。 FIG. 4 illustrates a case where a failure has occurred in the two node controllers NC0 and NC1. Each node 0, 1 is composed of a processor (PROC0, 1), a node controller (NC0), and an input / output device (IO 0, 1). Node 0 and node 1 transmit and receive signals by connecting CABLE_A to the ports (P1) of NC0 and NC1, respectively. In this example, in the data transmitted from the node 0, the received NC1 of the node 1 indicates that an error has been detected.

図５は、ＦＲＵテーブル３０を例示の一例であり、ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、ＩＯ２４、ＳＶＰ２５のエラー通報を格納するエラーインディケータとそのエラーインディケータが点灯した場合の障害要因部位の名称（ＮＡＭＥ）、被疑割合（ＲＡＴＥ）、製造ロットあるいはパッケージコンプ（ＲＥＶ）、ベンダー（ＩＤ）が登録されている。このフォーマットで障害が検出された事項のみが、前記第１及び第２の障害履歴ＤＢ３１，３２、検査障害ＤＢ３５、他装置障害ＤＢ３６に蓄積されており、障害通報時にすでに障害履歴が存在した場合は、エラーカウンタ部がインクリメント"＋１"される。尚、本例では、４つのＦＲＵを格納した状態が示されているが、本発明はこれに限定されるものではない。 FIG. 5 shows an example of the FRU table 30. An error indicator for storing an error report of MEM21, PROC22, NC23, IO24, and SVP25, the name of the cause of failure when the error indicator is lit (NAME), suspected A ratio (RATE), a production lot or package comp (REV), and a vendor (ID) are registered. Only the items in which a failure is detected in this format are accumulated in the first and second failure history DBs 31 and 32, the inspection failure DB 35, and the other device failure DB 36, and when a failure history already exists at the time of failure notification The error counter section is incremented by “+1”. In this example, a state in which four FRUs are stored is shown, but the present invention is not limited to this.

図６は、障害通報時に同一箇所の障害履歴にヒットし、被疑割合を補正した場合の計算例である。被疑対象はＮＣ０、ＮＣ１、ＣＡＢＬＥ＿Ａ（図４参照）であり、障害履歴がない場合には、障害要因部位と被疑割合は、前記ＦＲＵテーブル３０の設定当所の値、ＮＣ０＝５０％、ＮＣ１＝４９％、ＣＡＢＬＥ＝１％がコンソールに表示される。一方、障害履歴がある場合には、その回数にも依存するが、例えばＮＣ０の交換により復旧した実績がある場合、設定当所の被疑割合を補正して、ＮＣ０＝６７％、ＮＣ１＝３２％、ＣＡＢＬＥ＝１％としてコンソールに表示する。 FIG. 6 is a calculation example in the case where the failure history at the same location is hit at the time of failure notification and the suspicion rate is corrected. The suspected objects are NC0, NC1, and CABLE_A (see FIG. 4), and when there is no failure history, the failure factor site and the suspected ratio are the values set in the FRU table 30, where NC0 = 50%, NC1 = 49 %, CABLE = 1% is displayed on the console. On the other hand, if there is a failure history, depending on the number of times, for example, if there is a record of recovery by replacement of NC0, the set suspicious ratio is corrected, NC0 = 67%, NC1 = 32%, Display on the console as CABLE = 1%.

図７は、情報処理システム１１の構成の増設や縮退に伴う新規リソースを組み込むケース、即ちパーティション構成を変更する状況を例示している。ノード０，１の２ノード構成から更にノード２を情報処理システムに組み込む際に、前記ＳＶＰ２５は、ノード０（ＮＣ０）のポート２（Ｐ２）、ノード２（ＮＣ２）のポート２（Ｐ２）、ノード０（ＮＣ０）のポート３（Ｐ３）、ノード２（ＮＣ２）のポート３（Ｐ３）における障害履歴の頻度を前記第１の障害履歴ＤＢ３１および第２の障害履歴ＤＢ３２から索引し、より障害頻度の低い経路を選択して情報処理システムに組み込むことを指示する。 FIG. 7 exemplifies a case where a new resource is incorporated as the configuration of the information processing system 11 is increased or reduced, that is, a partition configuration is changed. When the node 2 is further incorporated into the information processing system from the two-node configuration of the nodes 0 and 1, the SVP 25 includes the port 2 (P2) of the node 0 (NC0), the port 2 (P2) of the node 2 (NC2), the node The failure frequency in the port 3 (P3) of 0 (NC0) and the port 3 (P3) of the node 2 (NC2) is indexed from the first failure history DB 31 and the second failure history DB 32, and the failure frequency Instructs to select a low route and incorporate it into the information processing system.

以下に、上記構成の障害管理システムにおける動作を説明する。ここでは、情報処理システム１内の複数のノード間を接続し各ノードを制御するＮＣ２３間で障害が発生した場合の動作説明を行う。尚、本例では、ＮＣ２３間を対象としているが、ＭＥＭ２１−ＰＲＯＣ２２間、ＰＲＯＣ２２−ＮＣ２３間、ＩＯ２４−ＮＣ２３間、ＳＶＰ２５−ＭＥＭ２１，ＰＲＯＣ２２，ＮＣ２３，ＩＯ２４間において、またＭＥＭ２１，ＰＲＯＣ２２，ＮＣ２３，ＩＯ２４，ＳＶＰ２５において単体障害が発生した場合でも、同様の処理がなされるものとする。 The operation in the fault management system having the above configuration will be described below. Here, an operation will be described when a failure occurs between the NCs 23 that connect a plurality of nodes in the information processing system 1 and control each node. In this example, the target is between NC23, but between MEM21 and PROC22, between PROC22 and NC23, between IO24 and NC23, between SVP25 and MEM21, PROC22, NC23, and IO24, and also between MEM21, PROC22, NC23, IO24, Even when a single failure occurs in the SVP 25, the same processing is assumed to be performed.

障害通報から障害要因部位およびその被疑割合をコンソールに表示するまでのフローを、図３〜６を参照して説明する。以下、図３のフローについて説明する。 A flow from the failure notification to displaying the failure factor site and the suspicious ratio thereof on the console will be described with reference to FIGS. Hereinafter, the flow of FIG. 3 will be described.

Ｓ００１：障害検出であり情報処理システムのいずれかの部位（ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、ＩＯ２４、ＳＶＰ２５）でエラーを検出する。 S001: Fault detection, and an error is detected in any part of the information processing system (MEM21, PROC22, NC23, IO24, SVP25).

Ｓ００２：ＳＶＰ２５へエラーを報告する。 S002: Report an error to SVP25.

Ｓ００３：サービスプロセッサログを回収する。情報処理システム内のエラーインディケータ（ＥＩＦ）や各種ステート情報、エラー補助情報を採取する。 S003: Collect the service processor log. An error indicator (EIF), various state information, and error auxiliary information in the information processing system are collected.

Ｓ００４：Ｓ００３で回収したログのエラーインディケータ（ＥＩＦ）をキーとしてＦＲＵテーブル３０を索引する。ＦＲＵテーブル３０には、図５に示す様に各エラーインディケータフラグに応じた複数の障害要因部位名（ＮＡＭＥ）、被疑割合（ＲＡＴＥ）、レビジョン（ＲＥＶ）、ベンダーＩＤ（ＶＩＤ）が登録されている。例えば、ＮＯ＿ＥＩＦ＿３が"１"となった場合、ノード０とノード１間の障害であることを示し、ノード０のポート１（ＮＣＯ＿Ｐ１）とノード１のポート１（ＮＣ１＿Ｐ１）およびポート間を接続するケーブル（ＣＡＢＬＥ＿Ａ）が障害要因部位の対象となり、それぞれ被疑割合が４９％、５０％、１％として読み出される。また、各障害要因部位に付随する情報（ＮＡＭＥ，ＲＡＴＥ，ＲＥＶ，ＶＩＤ）も同様に読み出される。 S004: The FRU table 30 is indexed using the error indicator (EIF) of the log collected in S003 as a key. In the FRU table 30, a plurality of failure factor site names (NAME), suspect ratios (RATE), revisions (REV), and vendor IDs (VID) corresponding to each error indicator flag are registered as shown in FIG. . For example, when NO_EIF_3 becomes “1”, this indicates a failure between node 0 and node 1, and port 1 (NCO_P1) of node 0 and port 1 (NC1_P1) of node 1 and a cable connecting the ports (CABLE_A) is the target of the failure factor, and the suspicious ratios are read as 49%, 50%, and 1%, respectively. Further, information (NAME, RATE, REV, VID) associated with each failure factor site is also read out in the same manner.

Ｓ００５：ＳＶＰ２５からのエラー通報をトリガとしてＳ００４で索引された情報を第１の障害履歴ＤＢ３１に格納すると同時に、過去に障害履歴があったかを判定し、判定の有無にしたがってＳ００７〜Ｓ０１０の分岐先にデータを送信する。更に、過去に同一の障害履歴があった場合は、Ｎ０＿ＥＩＦ＿３に対応するエラーカウンタのフィールドが"＋１"される。 S005: The information indexed in S004 is stored in the first failure history DB 31 with an error notification from the SVP 25 as a trigger, and at the same time, it is determined whether there has been a failure history in the past, and the branch destinations of S007 to S010 are determined according to the presence / absence of the determination. Send data. Further, when there is the same failure history in the past, the error counter field corresponding to N0_EIF_3 is incremented by "+1".

Ｓ００６：Ｓ００５と同様に、ＳＶＰ２５からのエラー通報をトリガとして、他装置障害ＤＢ３５および検査障害ＤＢ３６の中に今回発生した障害と一致するものが無いかを判定する。 S006: As in S005, using the error notification from the SVP 25 as a trigger, it is determined whether there is any other device failure DB 35 or inspection failure DB 36 that matches the failure that occurred this time.

Ｓ００７〜Ｓ０１０は、Ｓ００５とＳ００６の判定結果により４つの処理に分岐し、いずれか１つの処理が実行される。 S007 to S010 are branched into four processes according to the determination results of S005 and S006, and any one process is executed.

Ｓ００７：ＦＲＵ索引データ、Ｓ００５およびＳ００６から読み出した障害履歴情報をもとに、製造ロット、ベンダーＩＤ等の条件を比較分析し、障害要因部位およびその被疑割合の補正の必要性を判定する。 S007: Based on the failure history information read from the FRU index data and S005 and S006, the conditions such as the production lot and the vendor ID are compared and analyzed to determine the necessity of correction of the failure factor site and its suspicious rate.

Ｓ００８：ＦＲＵ索引データ、Ｓ００５から読み出した障害履歴情報をもとに、製造ロット、ベンダーＩＤ等の条件を比較分析し、障害要因部位およびその被疑割合の補正の必要性を判定する。 S008: Based on the FRU index data and the failure history information read from S005, the conditions such as the manufacturing lot and the vendor ID are compared and analyzed to determine the necessity of correcting the failure factor site and its suspicious rate.

Ｓ００９：ＦＲＵ索引データ、Ｓ００６から読み出した障害履歴情報をもとに、製造ロット、ベンダーＩＤ等の条件を比較分析し、障害要因部位およびその被疑割合の補正の必要性を判定する。 S009: Based on the FRU index data and the failure history information read from S006, the conditions such as the production lot and the vendor ID are compared and analyzed to determine the necessity of correction of the failure factor site and its suspicious rate.

Ｓ０１０：Ｓ００５およびＳ００６で共にヒットする障害履歴情報がなかったためＦＲＵテーブルの情報をそのまま送信する。 S010: Since there is no failure history information hit in both S005 and S006, the information in the FRU table is transmitted as it is.

Ｓ０１１：Ｓ００７〜Ｓ００９で被疑割合の補正が必要と判定された場合、被疑割合を補正する。Ｓ０１０の場合は、何もしない。補正方法は、後述する。 S011: If it is determined in S007 to S009 that the suspicious ratio needs to be corrected, the suspicious ratio is corrected. In the case of S010, nothing is done. The correction method will be described later.

Ｓ０１２：Ｓ０１１の情報をコンソール表示し、保守員へ障害要因部位に関する情報を通知する。 S012: The information of S011 is displayed on the console, and the maintenance staff is notified of information relating to the cause of failure.

次に、情報処理システム１１のシステム構成の拡張あるいは縮退に伴う新規部品やパスを組み込む場合、即ち情報処理システム１１の論理的又は物理的なパーティション構成の調整に係る動作を説明する。 Next, an operation related to adjustment of a logical or physical partition configuration of the information processing system 11 when a new part or path associated with expansion or contraction of the system configuration of the information processing system 11 is incorporated will be described.

ここでは、図２、図７、図８を参照する。ＳＶＰ２５により第１及び第２の障害履歴ＤＢ３１，３２の情報をデータ収集部４０に読み出し、その情報に新規部品の組み込みロケーションや組み込むパスの障害履歴の有無を構成情報解析部４４で解析する。例えば、空きスロットルや空きポートが存在する場合に、障害履歴が少ない部位が情報処理システム１１に組み込まれる。図７に示すように、ノード０，１の２ノード構成から更にノード２を情報処理システム１１に組み込む際に、ノード０（ＮＣ０）のポート２（Ｐ２）、ノード２（ＮＣ２）のポート２（Ｐ２）、ノード０（ＮＣ０）のポート３（Ｐ３）、ノード２（ＮＣ２）のポート３（Ｐ３）における障害履歴の頻度を、第１及び第２の障害履歴ＤＢ３１，３２からデータ収集部４０を介して収集し、構成情報解析部８０において、より障害頻度の低い経路が選択される。図８は、エラー発生頻度を例示している。本例では、ＮＣＯ＿Ｐ２−ＮＣ２＿Ｐ２間でのエラー頻度が１７であるのに対し、ＮＣＯ＿Ｐ３−ＮＣ２＿Ｐ３間でのエラー頻度が３であるため、ＮＣＯ＿Ｐ３−ＮＣ２＿Ｐ３の経路が障害頻度の低いものとして選択される。ＳＶＰ２５は、この選択結果に基づいて、その経路を情報処理システム１１に組み込むことを指示する。ＳＶＰ２５は、信号線ｃ０００１（図２参照）を介して各ＭＥＭ２１、ＰＲＯＣ２２、ＮＣ２３、ＩＯ２４へ構成指示を配信し、より安定した情報処理システム１１の構成を行う。 Here, reference is made to FIGS. The information of the first and second failure history DBs 31 and 32 is read by the SVP 25 to the data collection unit 40, and the configuration information analysis unit 44 analyzes the location where new parts are incorporated and the presence / absence of failure history of the paths to be incorporated in the information. For example, when there are vacant throttles and vacant ports, a part with a small failure history is incorporated into the information processing system 11. As shown in FIG. 7, when the node 2 is further incorporated into the information processing system 11 from the two-node configuration of the nodes 0 and 1, the port 2 (P2) of the node 0 (NC0) and the port 2 ( P2), the frequency of failure histories in port 3 (P3) of node 0 (NC0) and port 3 (P3) of node 2 (NC2), the data collection unit 40 from the first and second failure history DBs 31 and 32. The configuration information analysis unit 80 selects a path with a lower failure frequency. FIG. 8 illustrates the error occurrence frequency. In this example, the error frequency between NCO_P2-NC2_P2 is 17, whereas the error frequency between NCO_P3-NC2_P3 is 3, so the route of NCO_P3-NC2_P3 is selected as having a low failure frequency. The SVP 25 instructs to incorporate the route into the information processing system 11 based on the selection result. The SVP 25 distributes the configuration instruction to each MEM 21, PROC 22, NC 23, and IO 24 via the signal line c0001 (see FIG. 2), and configures the information processing system 11 more stably.

上記構成により、前記被疑割合は、検知された障害事象と過去の障害事象との比較結果に応じて適正に補正される。また、前記障害履歴情報の誘因履歴は、検知された障害事象が過去の障害事象と一致する場合に自動的にインクリメントされるため、手動による入力作業を軽減することができる。また、前記ＦＲＵテーブルを、初期化等の処理を必要とせずに設定当初の状態で維持することができる。 With the above configuration, the suspect ratio is appropriately corrected according to the comparison result between the detected failure event and the past failure event. Further, the trigger history of the fault history information is automatically incremented when the detected fault event matches a past fault event, so that manual input work can be reduced. Further, the FRU table can be maintained in the initial setting state without requiring processing such as initialization.

尚、本発明は上記実施の形態に限られるものではなく、趣旨を逸脱しない範囲で適宜変更することが可能である。 Note that the present invention is not limited to the above-described embodiment, and can be modified as appropriate without departing from the spirit of the present invention.

１障害管理システム
２障害要因部位
３，２５サービスプロセッサ
４ＦＲＵ格納部
５障害履歴格納部
６障害履歴抽出部
７補正部
８障害履歴更新部
９コンソール部
１１情報処理システム
１２障害情報管理サーバ
２１主記憶（ＭＥＭ）
２２プロセッサ（ＰＲＯＣ）
２３ノードコントローラ（ＮＣ）
２４入出力装置（ＩＯ）
３０ＦＲＵテーブル
３１第１の障害履歴データベース
３２第２の障害履歴データベース
３５他装置障害データベース
３６検査障害データベース
４０データ収集部
４１障害要因解析部
４２障害要因部位被疑割合算出部
４３コンソール
４４構成情報解析部 DESCRIPTION OF SYMBOLS 1 Failure management system 2 Failure factor part 3,25 Service processor 4 FRU storage part 5 Failure history storage part 6 Failure history extraction part 7 Correction part 8 Failure history update part 9 Console part 11 Information processing system 12 Failure information management server 21 Main memory (MEM)
22 Processor (PROC)
23 Node controller (NC)
24 Input / output unit (IO)
30 FRU table 31 1st failure history database 32 2nd failure history database 35 other device failure database 36 inspection failure database 40 data collection unit 41 failure factor analysis unit 42 failure factor part suspect ratio calculation unit 43 console 44 configuration information analysis unit

Claims

情報処理システムを構成する各部位の動作を監視するサービスプロセッサと、
少なくとも、複数種類の障害事象、前記各障害事象を引き起こす可能性のある障害要因部位の識別情報、及び前記各障害要因部位が前記障害事象を引き起こす可能性を示す被疑割合が対応付けられて構成されるＦＲＵテーブルを格納するＦＲＵ格納部と、
少なくとも、過去に発生した障害事象、各障害事象の要因となった前記障害要因部位、及び前記障害要因部位が対応する障害事象を引き起こした誘因頻度が対応付けられて構成される障害履歴情報を格納する障害履歴格納部と、
前記サービスプロセッサにより障害事象が検知された場合に、当該障害事象を特定する情報をキーとして前記障害履歴情報を検索し、当該障害事象が過去に発生した障害事象と一致する場合に、前記障害履歴情報から当該障害事象を引き起こした誘因履歴のある前記障害要因部位を抽出する障害履歴抽出部と、
前記誘因履歴のある障害要因部位が抽出された場合に、当該障害要因部位の前記誘因頻度に応じて、前記ＦＲＵテーブルの当該障害要因部位に対応する設定当初の被疑割合を補正して得られる補正被疑割合を算出する補正部と、
当該障害事象と当該障害要因部位との関係が、前記ＦＲＵテーブルと前記障害履歴情報とで一致する場合に、当該障害履歴情報の相当する前記誘因頻度をインクリメントする障害履歴更新部と、
前記補正被疑割合又は前記ＦＲＵテーブルの設定当初の被疑割合を表示するコンソール部と、
を備える障害管理システム。 A service processor for monitoring the operation of each part constituting the information processing system;
At least a plurality of types of failure events, identification information of failure factor parts that may cause each of the failure events, and a suspicion ratio indicating that each failure factor part may cause the failure event are associated with each other. A FRU storage unit for storing the FRU table;
Stores fault history information that is configured by associating at least fault events that occurred in the past, the fault factor site that caused each fault event, and the trigger frequency that caused the fault event corresponding to the fault factor site. A failure history storage unit,
When a failure event is detected by the service processor, the failure history information is searched using information for identifying the failure event as a key, and the failure history matches the failure event that has occurred in the past. A failure history extraction unit that extracts the failure factor site having the incentive history that caused the failure event from the information;
Correction obtained by correcting the initial suspicion rate corresponding to the failure factor part of the FRU table according to the cause frequency of the failure factor part when the failure factor part having the cause history is extracted A correction unit for calculating the suspicion rate;
A fault history update unit that increments the incentive frequency corresponding to the fault history information when the relationship between the fault event and the fault factor part matches between the FRU table and the fault history information;
A console unit for displaying the corrected suspicious ratio or the initial suspicious ratio of the FRU table;
Fault management system comprising

前記障害履歴格納部は、自機の前記情報処理システムに関する前記障害履歴情報を格納する第１の障害履歴格納部と、他機の情報処理システムに関する前記障害履歴情報を格納する第２の障害履歴格納部とを備え、
前記障害履歴抽出部及び前記補正部は、前記第１及び第２の障害履歴格納部に格納された情報に基づいて、前記補正被疑割合を算出する、
請求項１に記載の障害管理システム。 The failure history storage unit includes a first failure history storage unit that stores the failure history information related to the information processing system of the own device, and a second failure history that stores the failure history information related to the information processing system of another device. A storage unit,
The failure history extraction unit and the correction unit calculate the corrected suspicion rate based on information stored in the first and second failure history storage units,
The failure management system according to claim 1.

前記情報処理システムの論理的又は物理的なパーティション構成を調整するものであって、前記被疑割合が低い前記部位の使用頻度が増加するように当該調整を行うパーティション調整部、
を更に備える請求項１又は２に記載の障害管理システム。 A partition adjustment unit that adjusts the logical or physical partition configuration of the information processing system, and performs the adjustment so that the use frequency of the part with the low suspicion rate is increased,
The failure management system according to claim 1, further comprising:

少なくとも、複数種類の障害事象、各障害事象を引き起こす可能性のある障害要因部位の識別情報、及び前記各障害要因部位が対応する障害事象を引き起こす可能性を示す被疑割合が対応付けられて構成されるＦＲＵテーブルと、少なくとも、過去に発生した障害事象、各障害事象の要因となった前記障害要因部位、及び前記障害要因部位が対応する障害事象を引き起こした誘因頻度が対応付けられて構成される障害履歴情報とを参照し、前記情報処理システムの障害を管理する障害管理方法であって、
ある障害事象が検知された場合に、当該障害事象を特定する情報をキーとして前記障害履歴情報を検索し、当該障害事象が過去に発生した障害事象と一致する場合に、前記障害履歴情報から当該障害事象を引き起こした誘因履歴のある前記障害要因部位を抽出するステップと、
前記誘因履歴のある障害要因部位が抽出された場合に、当該障害要因部位の前記誘因頻度に応じて、前記ＦＲＵテーブルの当該障害要因部位に対応する設定当初の被疑割合を補正して得られる補正被疑割合を算出するステップと、
当該障害事象と当該障害要因部位との関係が、前記ＦＲＵテーブルと前記障害履歴情報とで一致する場合に、当該障害履歴情報の相当する前記誘因頻度をインクリメントするステップと、
前記補正被疑割合又は前記ＦＲＵテーブルの設定当初の被疑割合を表示するステップと、
を備える障害管理方法。 At least a plurality of types of fault events, identification information of fault factor parts that may cause each fault event, and a suspicious ratio indicating the possibility that each fault factor part causes a corresponding fault event are configured to be associated with each other. The FRU table is associated with at least a failure event that has occurred in the past, the failure factor site that caused each failure event, and the trigger frequency that caused the failure event corresponding to the failure factor site. A failure management method that refers to failure history information and manages failures in the information processing system,
When a failure event is detected, the failure history information is searched using information for identifying the failure event as a key. When the failure event matches a failure event that has occurred in the past, the failure history information Extracting the failure factor site with the incentive history that caused the failure event;
Correction obtained by correcting the initial suspicion rate corresponding to the failure factor part of the FRU table according to the cause frequency of the failure factor part when the failure factor part having the cause history is extracted Calculating a suspicion rate;
When the relationship between the failure event and the failure factor site matches between the FRU table and the failure history information, incrementing the incentive frequency corresponding to the failure history information; and
Displaying the corrected suspect ratio or the initial suspect ratio of the FRU table;
A fault management method comprising:

自機の前記情報処理システムに関する前記障害履歴情報を格納する第１の障害履歴格納部と、他機の情報処理システムに関する前記障害履歴情報を格納する第２の障害履歴格納部とを参照し、前記補正被疑割合を算出する、
請求項４に記載の障害管理方法。 With reference to a first failure history storage unit that stores the failure history information related to the information processing system of the own device, and a second failure history storage unit that stores the failure history information related to the information processing system of another device, Calculating the corrected suspicion rate;
The failure management method according to claim 4.

前記情報処理システムの論理的又は物理的なパーティション構成を調整するものであって、前記補正被疑割合又は前記被疑割合が低い前記部位の使用頻度が増加するように当該調整を行うステップ、
を更に備える請求項４又は５に記載の障害管理方法。 Adjusting the logical or physical partition configuration of the information processing system, and performing the adjustment so that the correction suspect ratio or the use frequency of the part with the low suspect ratio is increased;
The failure management method according to claim 4 or 5, further comprising: