WO2010116514A1

WO2010116514A1 - Raid control device

Info

Publication number: WO2010116514A1
Application number: PCT/JP2009/057291
Authority: WO
Inventors: 敬治藤田; 佳樹伏見
Original assignee: 富士通株式会社
Priority date: 2009-04-09
Filing date: 2009-04-09
Publication date: 2010-10-14

Abstract

The RAID control device comprises a counting unit, which counts, for each storage medium, the number of media errors within a specified time by a plurality of storage media that constitute a RAID, and a detecting unit which detects, as a recording medium associated with RAID slowdown, a storage medium on which the number of media errors in the aforementioned specified time is above a threshold value.

Description

ＲＡＩＤ制御装置RAID controller

　本発明は、ＲＡＩＤ（Redundant Arrays of Inexpensive Disks）制御装置に関する。 The present invention relates to a RAID (Redundant Arrays of Inexpensive Disks) control device.

　ＲＡＩＤは、複数の記録媒体、例えば、ＨＤＤ（Hard Disk Drive）を一つの仮想的なＨＤＤとして運用する技術である。ＲＡＩＤを構成する複数のＨＤＤの一つに対するデータの読み出しエラー及び／又は書き込みエラー（以下、メディアエラーと称することもある）が頻発すると、ＲＡＩＤの制御部が動作していないように見える現象が生じる。このような現象を、システムスローダウン現象、或いは、ＲＡＩＤのスローダウン現象と呼ぶ。 RAID is a technique for operating a plurality of recording media, for example, HDD (Hard Disk Drive) as one virtual HDD. When a data read error and / or write error (hereinafter also referred to as a media error) frequently occurs in one of a plurality of HDDs constituting a RAID, a phenomenon appears that the RAID control unit does not appear to operate. . Such a phenomenon is called a system slowdown phenomenon or a RAID slowdown phenomenon.

特開２００４－２５２６９２号公報JP 2004-252692 A 特開２００５－２６７０５６号公報Japanese Patent Laying-Open No. 2005-267056

　従来、ＲＡＩＤのスローダウン現象が生じると、メディアエラーが多発しているＨＤＤを人手により特定し、特定されたＨＤＤを人手で交換することで、スローダウン現象の解消を図っていた。このため、スローダウン現象に対する改善には、多大な時間が費やされていた。 Conventionally, when a RAID slowdown phenomenon occurs, an HDD in which media errors frequently occur is manually identified, and the identified HDD is manually replaced to eliminate the slowdown phenomenon. For this reason, a great deal of time has been spent improving the slowdown phenomenon.

　本発明の一態様の目的は、現在又は将来におけるＲＡＩＤのスローダウン現象を検出可能な技術を提供することである。 An object of one aspect of the present invention is to provide a technique capable of detecting a RAID slowdown phenomenon in the present or future.

　本発明の一態様は、ＲＡＩＤ制御装置である。このＲＡＩＤ制御装置は、ＲＡＩＤを構成する複数の記録媒体の所定時間内におけるメディアエラー数を記録媒体毎に計数する計数部と、
　前記所定時間内におけるメディアエラー数が閾値以上の記録媒体をＲＡＩＤのスローダウン現象に係る記録媒体として検出する検出部とを含む。 One embodiment of the present invention is a RAID control device. The RAID control device includes a counting unit that counts, for each recording medium, the number of media errors within a predetermined time of a plurality of recording media constituting the RAID;
And a detection unit that detects a recording medium having a number of media errors within a predetermined time as a recording medium related to a RAID slowdown phenomenon.

　本発明の他の態様の一つは、上記したＲＡＩＤ制御装置による障害記録媒体の検出方法である。また、本発明の他の態様の一つは、コンピュータ（情報処理装置）が上記したＲＡＩＤ制御装置として機能するためのプログラム、又は当該プログラムを記録した記録媒体である。 Another aspect of the present invention is a method for detecting a faulty recording medium by the above-described RAID control apparatus. Another aspect of the present invention is a program for a computer (information processing apparatus) to function as the above-described RAID control apparatus, or a recording medium on which the program is recorded.

　本発明の一態様によれば、現在又は将来的なＲＡＩＤのスローダウン現象を検出することができる。 According to one aspect of the present invention, a current or future RAID slowdown phenomenon can be detected.

図１は、ＲＡＩＤ制御装置の実施形態を実現する情報処理装置のハードウェア構成例を示す図である。FIG. 1 is a diagram illustrating a hardware configuration example of an information processing apparatus that implements an embodiment of a RAID control apparatus. 図２は、図１に示した情報処理装置によって実現されるＲＡＩＤ装置を模式的に示すブロック図である。FIG. 2 is a block diagram schematically showing a RAID apparatus realized by the information processing apparatus shown in FIG. 図３は、ＲＡＩＤ管理部によって作成される管理ログの例を示す。FIG. 3 shows an example of a management log created by the RAID management unit. 図４は、周期的なスローダウン現象判定の例を示す。FIG. 4 shows an example of periodic slowdown phenomenon determination. 図５は、閾値テーブルのデータ構造例を示す。FIG. 5 shows an example of the data structure of the threshold table. 図６は、ＲＡＩＤ制御部の動作例を示すフローチャートである。FIG. 6 is a flowchart illustrating an operation example of the RAID control unit.

　以下、図面を参照して本発明の実施形態について説明する。以下の実施形態における構成は例示であり、本発明は実施の形態の構成に限定されない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Configurations in the following embodiments are examples, and the present invention is not limited to the configurations in the embodiments.

　図１は、実施形態に係るＲＡＩＤ制御装置が適用される情報処理装置の構成例を示す図である。図１において、情報処理装置１０は、例えば、専用又は汎用のサーバマシンや、専用又は汎用のコンピュータである。汎用のコンピュータは、例えばパーソナルコンピュータ（ＰＣ）である。 FIG. 1 is a diagram illustrating a configuration example of an information processing apparatus to which a RAID control apparatus according to an embodiment is applied. In FIG. 1, the information processing apparatus 10 is, for example, a dedicated or general-purpose server machine or a dedicated or general-purpose computer. The general-purpose computer is, for example, a personal computer (PC).

　情報処理装置１０は、プロセッサとしてのＣＰＵ（中央演算処理装置）１と、主記憶装置２と、ＲＡＩＤコントローラ３Ａ，３Ｂ，３Ｃと、ＬＡＮ（Local Area Network）インタフェース４と、入出力（Ｉ／Ｏ）ユニット５とを備えている。ＣＰＵ１，主記憶装置２，ＲＡＩＤコントローラ３Ａ，３Ｂ，３Ｃと、ＬＡＮインタフェース４及び入出力ユニット５は、バスＢを介して相互に接続されている。 The information processing apparatus 10 includes a CPU (Central Processing Unit) 1 as a processor, a main storage device 2,

RAID controllers

3A, 3B, 3C, a LAN (Local Area Network) interface 4, an input / output (I / O). ) Unit 5. The CPU 1, the main storage device 2, the

RAID controllers

3A, 3B, 3C, the LAN interface 4 and the input / output unit 5 are connected to each other via a bus B.

　主記憶装置２は、プログラムやデータを格納したＲＯＭ（Read Only Memory）と、ＣＰＵ１のワークエリアとして使用されるＲＡＭ（Random Access Memory）とを含んでいる。ＲＡＭは、メモリと呼ばれる。 The main storage device 2 includes a ROM (Read Only Memory) storing programs and data and a RAM (Random Access Memory) used as a work area of the CPU 1. The RAM is called a memory.

　ＲＡＩＤコントローラ３は、ＲＡＩＤ部（ＲＡＩＤシステムともいう）６を制御する。図１に示す例では、ＲＡＩＤ部６Ａを制御するＲＡＩＤコントローラ３Ａと、ＲＡＩＤ部６Ｂを制御するＲＡＩＤコントローラ３Ｂと、ＲＡＩＤ部６Ｃを制御するＲＡＩＤコントローラ３Ｃとが例示されている。 The RAID controller 3 controls a RAID unit (also referred to as a RAID system) 6. In the example illustrated in FIG. 1, a RAID controller 3A that controls the RAID unit 6A, a RAID controller 3B that controls the RAID unit 6B, and a RAID controller 3C that controls the RAID unit 6C are illustrated.

　ＲＡＩＤコントローラ３には、ＲＡＩＤ部６を構成する複数の記録媒体としての複数のハードディスクドライブ（ＨＤＤ）７（ディスクアレイと呼ばれる）が接続されている。図１に示す例では、ＲＡＩＤコントローラ３Ａに接続されたＲＡＩＤ部６Ａは、二つのＨＤＤ７Ａ及び７Ｂを備えている。 The RAID controller 3 is connected to a plurality of hard disk drives (HDD) 7 (referred to as disk arrays) as a plurality of recording media constituting the RAID unit 6. In the example shown in FIG. 1, the RAID unit 6A connected to the RAID controller 3A includes two

HDDs

7A and 7B.

　ＲＡＩＤコントローラ３は、ＲＡＩＤ部６に対するデータの書き込み／読み出しを行う集積回路である。また、ＲＡＩＤコントローラ３は、ＲＡＩＤ部６に対する書き込み／読み出しのアクセス履歴を、管理ログとして記録する。
ＲＡＩＤ部６Ａは、ＲＡＩＤレベル１（RAID 1）、すなわちミラーリングを実施するように設定されている。このため、ＲＡＩＤコントローラ３Ａは、書き込み対象のデータをＨＤＤ７ＡとＨＤＤ７Ｂとに書き込む。また、ＲＡＩＤコントローラ３Ａは、例えば、ＨＤＤ７ＡとＨＤＤ７Ｂとの一方を現用系のＨＤＤとして使用し、他方を予備系のＨＤＤとして使用する。例えば、ＨＤＤ７Ａが現用系として使用され、ＨＤＤ７Ｂが予備系として使用される。現用系のＨＤＤ７Ａの障害が検出された場合には、ＲＡＩＤコントローラ３Ａは、障害が生じたＨＤＤ７Ａの切り離し処理を行い、ＨＤＤ７Ｂが現用系として使用されるための設定変更を行う。切り離し処理は、切り離し対象のＨＤＤをディスエーブル状態にすることで行う。 The RAID controller 3 is an integrated circuit that writes / reads data to / from the RAID unit 6. Further, the RAID controller 3 records a write / read access history for the RAID unit 6 as a management log.
The RAID unit 6A is set to perform RAID level 1 (RAID 1), that is, mirroring. Therefore, the RAID controller 3A writes the write target data to the HDD 7A and the HDD 7B. The RAID controller 3A uses, for example, one of the HDD 7A and the HDD 7B as a working HDD and the other as a standby HDD. For example, the HDD 7A is used as an active system, and the HDD 7B is used as a standby system. When the failure of the active HDD 7A is detected, the RAID controller 3A performs the process of disconnecting the HDD 7A in which the failure has occurred, and changes the settings for using the HDD 7B as the active system. The disconnection process is performed by disabling the HDD to be disconnected.

　なお、ＲＡＩＤコントローラ３Ｂによって制御されるＲＡＩＤ部６Ｂは、ＲＡＩＤレベル“RAID 10”を実施し、ＲＡＩＤコントローラ３Ｃによって制御されるＲＡＩＤ部６Ｃは、ＲＡＩＤレベル“RAID 50”を実施する。 Note that the RAID unit 6B controlled by the RAID controller 3B implements the RAID level “RAID 10”, and the RAID unit 6C controlled by the RAID controller 3C implements the RAID level “RAID 50”.

　ＬＡＮインタフェース４は、ネットワークＮとの間でデータの送受信処理を実施するための通信インタフェース回路である。Ｉ／Ｏユニット５は、入力装置、出力装置、可搬性を有する記録媒体のような周辺装置を情報処理装置１０に接続するための回路である。可搬性記録媒体は、例えば、図１に示すようなＤＶＤ８や、ＵＳＢメモリ９である。 The LAN interface 4 is a communication interface circuit for performing data transmission / reception processing with the network N. The I / O unit 5 is a circuit for connecting a peripheral device such as an input device, an output device, and a portable recording medium to the information processing device 10. The portable recording medium is, for example, a DVD 8 or a USB memory 9 as shown in FIG.

　例えば、ＤＶＤ８やＵＳＢメモリ９に格納されたプログラムやデータをＩ／Ｏユニット５に接続し、ＲＡＩＤ部６へのインストールを行うことができる。さらに、ＲＡＩＤ部６にインストールされたプログラムがＣＰＵ１によって主記憶装置２のメモリにロードされ、実行されるようにすることができる。主記憶装置２のＲＯＭ又はＲＡＩＤ部６には、オペレーティングシステム（ＯＳ）や、１以上のアプリケーションプログラム（アプリケーションと呼ぶ）が格納されており、ＣＰＵ１は、ＯＳやアプリケーションを実行することによって、情報処理装置１０をＲＡＩＤ装置として機能させることができる。 For example, a program or data stored in the DVD 8 or the USB memory 9 can be connected to the I / O unit 5 and installed in the RAID unit 6. Furthermore, the program installed in the RAID unit 6 can be loaded into the memory of the main storage device 2 by the CPU 1 and executed. The ROM or RAID unit 6 of the main storage device 2 stores an operating system (OS) and one or more application programs (referred to as applications), and the CPU 1 executes information processing by executing the OS and applications. The device 10 can function as a RAID device.

　図１に示した主記憶装置２中のＲＯＭやＲＡＭ、ＨＤＤ７，ＤＶＤ８，ＵＳＢメモリ９は、コンピュータ読み取り可能な記録媒体の例示であり、記録媒体の種類はこれらに限定されない。 The ROM, RAM, HDD 7, DVD 8, and USB memory 9 in the main storage device 2 shown in FIG. 1 are examples of computer-readable recording media, and the types of recording media are not limited to these.

　図２は、図１に示した情報処理装置１０によって実現されるＲＡＩＤ装置を模式的に示すブロック図である。情報処理装置１０は、ＣＰＵ１が主記憶装置２にロードされたプログラムを実行することによって、図２に示すようなＲＡＩＤ部６と、ＲＡＩＤ制御部２０とを備えるＲＡＩＤ装置として機能する。 FIG. 2 is a block diagram schematically showing a RAID device realized by the information processing apparatus 10 shown in FIG. The information processing apparatus 10 functions as a RAID apparatus including the RAID unit 6 and the RAID control unit 20 as illustrated in FIG. 2 when the CPU 1 executes a program loaded on the main storage device 2.

　図２において、ＲＡＩＤ部６Ａ～６Ｃは、図１に示したものである。図２に示すＲＡＩＤ部６Ａは、ＨＤＤ７Ａに相当する“ＨＤＤ－０”と、ＨＤＤ７Ｂに相当する“ＨＤＤ－１”からなるミラーリングシステムである。 In FIG. 2, RAID sections 6A to 6C are those shown in FIG. The RAID unit 6A shown in FIG. 2 is a mirroring system including “HDD-0” corresponding to the HDD 7A and “HDD-1” corresponding to the HDD 7B.

　本実施形態では、ＲＡＩＤ制御部２０は、ＣＰＵ１がＯＳを実行することによって実現される機能である。詳細には、ＲＡＩＤ制御装置としてのＲＡＩＤ制御部２０は、図１に示したＣＰＵ１，主記憶装置２，及びＲＡＩＤコントローラ３によって実現される。ＲＡＩＤ制御部２０は、ＲＡＩＤ管理部２１（２１Ａ，２１Ｂ，２１Ｃ）と、タイマ２３Ａ及び２３Ｂと、計数部２４と、カウンタ２５及び２６を含む複数のカウンタと、検出部２７と、閾値テーブル２８とを備えている。 In the present embodiment, the RAID control unit 20 is a function realized by the CPU 1 executing the OS. Specifically, the RAID control unit 20 as a RAID control device is realized by the CPU 1, the main storage device 2, and the RAID controller 3 shown in FIG. The RAID control unit 20 includes a RAID management unit 21 (21A, 21B, 21C),

timers

23A and 23B, a counting unit 24, a plurality of

counters including counters

25 and 26, a detection unit 27, and a threshold table 28. It has.

　各ＲＡＩＤ管理部２１Ａ，２１Ｂ，２１Ｃは、ＲＡＩＤコントローラ３Ａ，３Ｂ，３Ｃによる機能として実現される。タイマ２３，計数部２４、検出部２７は、ＣＰＵ１がＯＳを実行する機能として実現される。カウンタ２５及び２６、閾値テーブル２８は、例えば主記憶装置２のメモリ上に作成される。 Each

RAID management part

21A, 21B, 21C is realized as a function by the

RAID controllers

3A, 3B, 3C. The timer 23, the counting unit 24, and the detection unit 27 are realized as functions for the CPU 1 to execute the OS. The

counters

25 and 26 and the threshold table 28 are created on the memory of the main storage device 2, for example.

　なお、ＲＡＩＤコントローラ３の機能は、専用プロセッサ又はＤＳＰ(Data Signal Processor）のような汎用プロセッサがプログラムを実行することによって実現されることができる。或いは、ＣＰＵ１がプログラム（例えばＯＳ）を実行することによって実現されるようにすることができる。 The function of the RAID controller 3 can be realized by executing a program by a general-purpose processor such as a dedicated processor or a DSP (Data Signal Processor). Alternatively, it can be realized by the CPU 1 executing a program (for example, OS).

　各ＲＡＩＤ管理部２１は、対応するＲＡＩＤ部６に対するデータの書き込み／読み出し、ＨＤＤのＲＡＩＤ部６からの切り離しを制御する。また、各ＲＡＩＤ管理部２１は、対応するＲＡＩＤ部６に対するアクセスの履歴を管理ログ２２として記録する。 Each RAID management unit 21 controls writing / reading of data to / from the corresponding RAID unit 6 and detachment of the HDD from the RAID unit 6. Each RAID management unit 21 records a history of access to the corresponding RAID unit 6 as a management log 22.

　図３は、管理ログ２２の例を示す。管理ログ２２は、日時を示すタイムスタンプと、管理対象のＨＤＤと、管理対象のＨＤＤに対するアクセス、すなわち、ＨＤＤに対するデータの読み出し又は書き込みの結果と、ＲＡＩＤ部６の識別子とを含むレコードを時系列で記録している。 FIG. 3 shows an example of the management log 22. The management log 22 is a time series of records including a time stamp indicating date and time, a management target HDD, access to the management target HDD, that is, a result of reading or writing data to the HDD, and an identifier of the RAID unit 6. It is recorded with.

　図３に示す例では、管理対象のＨＤＤ－０に対して“TargetHDD-0”の表記が使用され、管理対象のＨＤＤ－１に対して“TargetHDD-1”の表記が用いられている。また、アクセス結果の表記として、アクセス結果が正常であれば“Normal”、アクセス結果が異常、すなわち読み出しエラー又は書き込みエラーが生じた場合には、“MediaError”が記録される。また、ＲＡＩＤ部の識別子として、例えば、“RAID X”（Ｘは数字又は記号）の表記を使用することができる。 In the example shown in FIG. 3, the notation “TargetHDD-0” is used for the managed HDD-0, and the notation “TargetHDD-1” is used for the managed HDD-1. In addition, as a representation of the access result, “Normal” is recorded when the access result is normal, and “MediaError” is recorded when the access result is abnormal, that is, when a read error or a write error occurs. Further, for example, the notation “RAID X” (X is a number or a symbol) can be used as the identifier of the RAID part.

　図２に示すタイマ２３Ａ及び２３Ｂのそれぞれは、所定の監視周期に基づく監視間隔を計時する。本実施形態は、所定の監視周期で計数部２４及び検出部２７によるＲＡＩＤ部６（管理ログ２２）の監視処理が行われる（図４参照）。図２に示す例では、ＲＡＩＤ部６Ａ及びＲＡＩＤ部６Ｂが周期的な監視対象となっている。タイマ２３Ａは、ＲＡＩＤ部６Ａ用のタイマであり、タイマ２３Ｂは、ＲＡＩＤ部６Ｂ用のタイマである。
図４に示すように、所定のチェック期間内に、所定回数以上のメディアエラーが発生している場合に、ＲＡＩＤのスローダウン現象が発生していると判定することができる。但し、周期的な監視はオプションであり、タイマ２３Ａ及び２３Ｂは必須の構成要素ではない。 Each of the

timers

23A and 23B shown in FIG. 2 measures a monitoring interval based on a predetermined monitoring cycle. In the present embodiment, the monitoring process of the RAID unit 6 (management log 22) is performed by the counting unit 24 and the detection unit 27 in a predetermined monitoring cycle (see FIG. 4). In the example shown in FIG. 2, the RAID unit 6A and the RAID unit 6B are periodically monitored. The timer 23A is a timer for the RAID unit 6A, and the timer 23B is a timer for the RAID unit 6B.
As shown in FIG. 4, when a media error has occurred a predetermined number of times or more within a predetermined check period, it can be determined that a RAID slowdown phenomenon has occurred. However, periodic monitoring is optional, and the

timers

23A and 23B are not essential components.

　各タイマ２３Ａ，２３Ｂは、所定の監視周期毎に、監視周期の到来を計数部２４及び検出部２７に知らせる。監視周期、すなわち監視間隔は、数分～数十分の時間を設定することができる。監視周期は、タイマ２３Ａ，２３Ｂへの外部入力による変更設定を通じて可変とすることができる。 The

timers

23A and 23B notify the counting unit 24 and the detection unit 27 of the arrival of the monitoring cycle at every predetermined monitoring cycle. The monitoring cycle, that is, the monitoring interval can be set to several minutes to several tens of minutes. The monitoring cycle can be made variable through change setting by external input to the

timers

23A and 23B.

　計数部２４は、監視周期の到来がタイマ２３Ａ又は２３Ｂによって知らされると、監視周期に対応するＲＡＩＤ管理部２１の管理ログ２２（図３）から所定のチェック期間に含まれるレコードを抽出し、そのレコード中のメディアエラー数を、管理対象のＨＤＤ毎に計数する。例えば、ＲＡＩＤ部６Ａを構成するＨＤＤ－０及びＨＤＤ－１に対する計数結果は、カウンタ２５，２６に格納される。これに対し、ＲＡＩＤ部６Ｂ（ＲＡＩＤレベル“RAID 10”）を構成する４つのＨＤＤに対応する図示しないカウンタが用意される（図示せず）。 When the arrival of the monitoring cycle is notified by the

timer

23A or 23B, the counting unit 24 extracts records included in the predetermined check period from the management log 22 (FIG. 3) of the RAID management unit 21 corresponding to the monitoring cycle, The number of media errors in the record is counted for each managed HDD. For example, count results for HDD-0 and HDD-1 constituting the RAID unit 6A are stored in the

counters

25 and 26. On the other hand, counters (not shown) corresponding to the four HDDs constituting the RAID unit 6B (RAID level “RAID 10”) are prepared (not shown).

　カウンタ２５、２６に格納されるＨＤＤ－０，ＨＤＤ－１に対するメディアエラー数（計数結果）は、チェック期間中における各ＨＤＤのメディアエラー回数を示す。カウンタ２５は、ＨＤＤ－０のメディアエラー数を保持し、カウンタ２６は、ＨＤＤ－１のメディアエラー数を保持する。 The number of media errors (counting result) for HDD-0 and HDD-1 stored in the

counters

25 and 26 indicates the number of media errors of each HDD during the check period. The counter 25 holds the number of media errors of the HDD-0, and the counter 26 holds the number of media errors of the HDD-1.

　検出部２７は、監視対象のＲＡＩＤ部６（例えばＲＡＩＤ部６Ａ）の監視周期の到来がタイマ２３によって知らされると、計数部２４の処理と同期して、カウンタ２５及び２６からメディアエラー数を読み出す。一方、検出部２７は、閾値テーブル２８からメディアエラー数と対比すべき閾値を読み出し、メディアエラー数が閾値以上か否かを判定する。 When the timer 23 is notified of the arrival of the monitoring period of the monitoring target RAID unit 6 (for example, the RAID unit 6A), the detection unit 27 calculates the number of media errors from the

counters

25 and 26 in synchronization with the processing of the counting unit 24. read out. On the other hand, the detection unit 27 reads a threshold value to be compared with the media error number from the threshold value table 28, and determines whether the media error number is equal to or greater than the threshold value.

　このとき、検出部２７は、ＨＤＤ－０及びＨＤＤ－１のいずれか一方のメディアエラー数が閾値以上であると判定した場合には、そのメディアエラー数が閾値以上であるＨＤＤをＲＡＩＤのスローダウン現象に係るＨＤＤとして検出（特定）する。さらに、検出部２７は、検出されたＨＤＤのＲＡＩＤ部６からの切り離しをＲＡＩＤ管理部２１に対して指示する。 At this time, if the detection unit 27 determines that the media error number of any one of the HDD-0 and HDD-1 is equal to or greater than the threshold value, the detection unit 27 selects a HDD whose media error number is equal to or greater than the threshold value as a RAID slowdown. It is detected (specified) as an HDD related to the phenomenon. Furthermore, the detection unit 27 instructs the RAID management unit 21 to disconnect the detected HDD from the RAID unit 6.

　例えば、図３に示すように、チェック期間におけるＨＤＤ－０のメディアエラー数がスローダウン閾値である１０[回]以上であると判定した場合には、検出部２７は、ＨＤＤ－０を現在又は将来のスローダウン現象の原因となるＨＤＤとして検出し、ＨＤＤ－０を切り離し対象のＨＤＤとして決定する。 For example, as shown in FIG. 3, when it is determined that the number of media errors of the HDD-0 in the check period is equal to or greater than 10 [times] that is the slowdown threshold, the detection unit 27 sets the HDD-0 to the current or It is detected as an HDD that will cause a slow-down phenomenon in the future, and HDD-0 is determined as the target HDD to be disconnected.

　図５は、閾値テーブル２８のデータ構造例を示す。図４に示す例では、閾値テーブル２８は、ＲＡＩＤ部６Ａ，６Ｂ，６Ｃに対応するエントリからなる。エントリは、例えばＲＡＩＤ部６毎に用意することができる。或いは、監視対象のＲＡＩＤ部６のみのエントリが登録される構成を適用可能である。 FIG. 5 shows an example of the data structure of the threshold table 28. In the example illustrated in FIG. 4, the threshold table 28 includes entries corresponding to the

RAID units

6A, 6B, and 6C. An entry can be prepared for each RAID unit 6, for example. Alternatively, a configuration in which only the entry of the monitoring target RAID unit 6 is registered is applicable.

　各エントリには、エントリ番号が付与される。エントリ番号１がＲＡＩＤ部６Ａに対応し、エントリ番号２がＲＡＩＤ部６Ｂに対応し、エントリ番号３がＲＡＩＤ部６Ｃに対応する。
各エントリは、システム番号，チェック期間，監視周期，警告閾値，スローダウン閾値，コントローラ識別子，及び監視対象フラグを含むことができる。 Each entry is given an entry number. Entry number 1 corresponds to the RAID part 6A, entry number 2 corresponds to the RAID part 6B, and entry number 3 corresponds to the RAID part 6C.
Each entry can include a system number, a check period, a monitoring period, a warning threshold, a slowdown threshold, a controller identifier, and a monitoring target flag.

　ここに、システム番号は、ＲＡＩＤシステムであるＲＡＩＤ部６の識別子である。チェック期間は、管理ログ２２からレコードを抽出する（切り出す）ための時間である。 Here, the system number is an identifier of the RAID unit 6 which is a RAID system. The check period is a time for extracting (cutting out) records from the management log 22.

　ここで、計数部２４は、チェック期間を常時保持する構成を適用可能である。或いは、計数部２４は、管理ログ２２からレコードを抽出する場合に、閾値テーブル２８を参照してチェック期間を確認する構成を適用可能である。この場合、図４に示すようにチェック期間が格納される。チェック期間の格納はオプションである。 Here, it is possible to apply a configuration in which the counting unit 24 always holds the check period. Alternatively, the count unit 24 can apply a configuration in which the check period is confirmed with reference to the threshold table 28 when the record is extracted from the management log 22. In this case, the check period is stored as shown in FIG. Storage of the check period is optional.

　また、監視周期は、ＲＡＩＤ部６の監視、すなわちメディアエラー数の計数及び判定が行われる監視間隔の時間を示す。監視間隔は、タイマ２３Ａ，２３Ｂに設定されており、タイマ２３Ａ，２３Ｂは、監視周期（例えば５分）毎に、監視周期の到来を通知する。 Also, the monitoring period indicates the monitoring interval time during which the RAID unit 6 monitors, that is, the number of media errors is counted and determined. The monitoring interval is set in the

timers

23A and 23B, and the

timers

23A and 23B notify the arrival of the monitoring period every monitoring period (for example, 5 minutes).

　このため、閾値テーブル２８に対する監視周期の格納はオプションである。さらに、監視周期を設けない場合、例えば、所定のコマンドに応じて計数部２４及び検出部２７が動作する場合には、監視周期及びタイマ２３Ａ，２３Ｂは省略可能である。 For this reason, storing the monitoring period for the threshold table 28 is optional. Furthermore, when the monitoring cycle is not provided, for example, when the counting unit 24 and the detection unit 27 operate according to a predetermined command, the monitoring cycle and the

timers

23A and 23B can be omitted.

　監視周期は、ＲＡＩＤ部６毎に異なる値を設定することができる。図４に示す例では、ＲＡＩＤ部６Ａ用のタイマ２３ＡとＲＡＩＤ部６Ｂ用のタイマ２３Ｂとが用意されている。但し、ＲＡＩＤ部６Ａの監視周期とＲＡＩＤ部６Ｂの監視周期とを共通化可能な場合には、一つのタイマが設けられた構成を適用可能である。 The monitoring cycle can be set to a different value for each RAID unit 6. In the example shown in FIG. 4, a timer 23A for the RAID unit 6A and a timer 23B for the RAID unit 6B are prepared. However, when the monitoring period of the RAID unit 6A and the monitoring period of the RAID unit 6B can be shared, a configuration in which one timer is provided can be applied.

　警告閾値は、チェック期間内におけるメディアエラー数が警告を発行すべき値か否かを判定するための閾値であり、検出部２７によって参照される。警告は、情報処理装置１０の外部に、音，光，ディスプレイへの表示のような様々な手段で報知可能である。但し、本実施形態では、メディアエラー数が警告閾値以上である場合には、警告が、ＲＡＩＤ部６に対応するＲＡＩＤ管理部２１によって管理ログ２２に記録される。警告記録を残さない場合には、警告閾値は省略可能である。 The warning threshold is a threshold for determining whether or not the number of media errors within the check period is a value for issuing a warning, and is referred to by the detection unit 27. The warning can be notified to the outside of the information processing apparatus 10 by various means such as sound, light, and display on a display. However, in this embodiment, when the number of media errors is equal to or greater than the warning threshold, a warning is recorded in the management log 22 by the RAID management unit 21 corresponding to the RAID unit 6. When no warning record is left, the warning threshold value can be omitted.

　スローダウン閾値は、メディアエラー数がＲＡＩＤのスローダウン現象の要因となっている、或いはメディアエラー数がスローダウン現象の要因となる可能性が高いと判定するための閾値であり、検出部２７によって参照される。警告閾値及びスローダウン閾値も、ＲＡＩＤ部６或いはＲＡＩＤコントローラ毎に異なる値を適用可能である。 The slowdown threshold is a threshold for determining that the number of media errors is a factor of the RAID slowdown phenomenon or that the number of media errors is likely to be a factor of the slowdown phenomenon. Referenced. The warning threshold value and the slowdown threshold value can also be applied to different values for each RAID unit 6 or RAID controller.

　コントローラ識別子は、システム番号で特定されるＲＡＩＤ部を制御するＲＡＩＤコントローラの識別子である。監視対象フラグは、ＲＡＩＤ部６が、計数部２４及び検出部２７による監視対象か否かを示すフラグである。 The controller identifier is an identifier of a RAID controller that controls the RAID part specified by the system number. The monitoring target flag is a flag indicating whether or not the RAID unit 6 is a monitoring target by the counting unit 24 and the detection unit 27.

　計数部２４及び検出部２７による監視対象は、ＲＡＩＤレベル１（ミラーリング）をサポートするＲＡＩＤ部に制限される。このため、図１、図２に示す例では、ＲＡＩＤレベル“RAID 1”を実施するＲＡＩＤ部６Ａと、ＲＡＩＤレベル“RAID 10”を実施するＲＡＩＤ部６Ｂとが監視対象となり、ＲＡＩＤレベル“RAID 50”を実施するＲＡＩＤ部６Ｃは、非監視対象となる。 The monitoring target by the counting unit 24 and the detection unit 27 is limited to a RAID unit that supports RAID level 1 (mirroring). Therefore, in the example shown in FIGS. 1 and 2, the RAID unit 6A that implements the RAID level “RAID“ 1 ”and the RAID unit 6B that implements the RAID level“ RAID 10 ”are monitored, and the RAID level“ RAID 50 ”. The RAID unit 6C that performs "" is a non-monitoring target.

　本実施形態におけるＲＡＩＤ制御部２０の構成は、ＲＡＩＤレベル“RAID 1”自体の他、ＲＡＩＤレベル１と他のＲＡＩＤレベルとの組み合わせに係るＲＡＩＤレベルを実施するＲＡＩＤ部に対して適用可能である。すなわち、ＲＡＩＤ制御部２０が監視対象とするＲＡＩＤレベルは、少なくとも、ＲＡＩＤ“１”，“０＋１”，“１＋０”，“１＋５”，“５＋１”，“１＋６”，“６＋１”を含むことができる。 The configuration of the RAID control unit 20 in the present embodiment is applicable to a RAID unit that implements a RAID level related to a combination of RAID level 1 and another RAID level in addition to the RAID level “RAID 1” itself. That is, the RAID levels that are monitored by the RAID control unit 20 can include at least RAID “1”, “0 + 1”, “1 + 0”, “1 + 5”, “5 + 1”, “1 + 6”, “6 + 1”. .

　閾値テーブル２８において、監視対象のＲＡＩＤレベルのＲＡＩＤ部に対し、監視対象フラグ“ＯＮ”が設定され、非監視対象のＲＡＩＤレベルのＲＡＩＤ部に対し、監視対象フラグ“ＯＦＦ”が設定される。従って、図２に示す例では、ＲＡＩＤ制御部２０は、ＲＡＩＤ部６Ａ及びＲＡＩＤ部６Ｂに対するメディアエラー数の監視を行い、ＲＡＩＤ部６Ｃに対する監視は行われない。 In the threshold table 28, the monitoring target flag “ON” is set for the RAID portion of the RAID level to be monitored, and the monitoring target flag “OFF” is set for the RAID portion of the RAID level that is not to be monitored. Therefore, in the example shown in FIG. 2, the RAID control unit 20 monitors the number of media errors for the RAID unit 6A and the RAID unit 6B, and does not monitor the RAID unit 6C.

　図６は、図２に示したＲＡＩＤ制御部２０の動作例を示すフローチャートである。図６に示す処理は、例えば、情報処理装置１０の電源投入によって開始される。なお、図６に示す処理の前提として、図５に示した内容の閾値テーブル２８が静的に主記憶装置２又は二次記憶（ＨＤＤ７）に格納されていると仮定する。 FIG. 6 is a flowchart showing an operation example of the RAID control unit 20 shown in FIG. The process illustrated in FIG. 6 is started, for example, when the information processing apparatus 10 is turned on. As a premise of the process shown in FIG. 6, it is assumed that the threshold table 28 having the contents shown in FIG. 5 is statically stored in the main storage device 2 or the secondary storage (HDD 7).

　図６に示す処理が開始されると、ＲＡＩＤ制御部２０において、各ＲＡＩＤ部６Ａ，６Ｂ，６ＣのＲＡＩＤレベルチェック処理が実行される（ステップＳ０１）。 When the process shown in FIG. 6 is started, the RAID control unit 20 executes a RAID level check process for each of the

RAID units

6A, 6B, and 6C (step S01).

　すなわち、ＲＡＩＤ制御部２０は、閾値テーブル２８（図５）を参照し、各ＲＡＩＤ部６Ａ，６Ｂ，６Ｃに対応する監視対象フラグを確認する。ここでは、図５に示したように、ＲＡＩＤ部６Ａ及び６Ｂに対する監視対象フラグが“ＯＮ”であり、ＲＡＩＤ部６Ｃに対する監視対象フラグが“ＯＦＦ”である。 That is, the RAID control unit 20 refers to the threshold value table 28 (FIG. 5) and confirms the monitoring target flags corresponding to the

RAID units

6A, 6B, and 6C. Here, as shown in FIG. 5, the monitoring target flag for the

RAID units

6A and 6B is “ON”, and the monitoring target flag for the RAID unit 6C is “OFF”.

　これによって、ＲＡＩＤ制御部２０は、ＲＡＩＤ部６Ａ及び６Ｂがデータの冗長性を確保するＲＡＩＤレベル、すなわち監視対象のＲＡＩＤレベルを有すると判定する。一方、ＲＡＩＤ制御部２０は、ＲＡＩＤ部６Ｃが監視対象外、すなわち非監視対象であると判定する。 Thereby, the RAID control unit 20 determines that the

RAID units

6A and 6B have a RAID level that ensures data redundancy, that is, a monitoring target RAID level. On the other hand, the RAID control unit 20 determines that the RAID unit 6C is not a monitoring target, that is, a non-monitoring target.

　続いて、ＲＡＩＤ制御部２０は、上述した監視対象の判定結果に従って、タイマ２３Ａ及び２３Ｂに対する設定を行う。ここでは、閾値テーブル２８の格納内容に従って、ＲＡＩＤ部６Ａ用のタイマ２３Ａに監視周期“５分”が設定され、ＲＡＩＤ部６Ｂ用のタイマ２３Ｂに監視周期“５分”が設定される。 Subsequently, the RAID control unit 20 performs settings for the

timers

23A and 23B according to the monitoring target determination result described above. Here, according to the stored contents of the threshold table 28, the monitoring period “5 minutes” is set in the timer 23A for the RAID section 6A, and the monitoring period “5 minutes” is set in the timer 23B for the RAID section 6B.

　その後、ＲＡＩＤ制御部２０は、各タイマ２３Ａ，２３Ｂをスタートさせる。このとき、ＲＡＩＤ制御部２０は、各ＲＡＩＤ部６Ａ及び６Ｂに対する監視周期の到来がずれた状態となるように、タイマ２３Ａ及び２３Ｂをスタートさせることができる。もっとも、同時にタイマ２３Ａ及び２３Ｂをスタートさせることもできる。 Thereafter, the RAID control unit 20 starts the

timers

23A and 23B. At this time, the RAID control unit 20 can start the

timers

23A and 23B so that the arrival of the monitoring periods for the

RAID units

6A and 6B is shifted. However, the

timers

23A and 23B can be started at the same time.

　その後、タイマ２３Ａ及び２３Ｂは、それぞれ監視周期時間の計時を行い、監視周期の時間となると、監視周期の到来を計数部２４及び検出部２７に通知する（ステップＳ０２）。例えば、ＲＡＩＤ部６Ａに対応するタイマ２３Ａが監視周期の到来を計数部２４及び検出部２７に通知したと仮定する。 Thereafter, the

timers

23A and 23B each measure the monitoring cycle time, and when the monitoring cycle time is reached, the arrival of the monitoring cycle is notified to the counting unit 24 and the detection unit 27 (step S02). For example, it is assumed that the timer 23A corresponding to the RAID unit 6A notifies the counting unit 24 and the detection unit 27 that the monitoring period has arrived.

　すると、計数部２４が管理ログのチェックを行う（ステップＳ０３）。すなわち、計数部２４は、閾値テーブル２８を参照して、ＲＡＩＤ部６Ａに対応するチェック期間“１０分”を確認する。続いて、計数部２４は、ＲＡＩＤ管理部２１Ａの管理ログ２２（図３）を参照し、例えば、最新のレコードのタイムスタンプからチェック期間“１０分”内に入るタイムスタンプを有するレコード群を抽出する。 Then, the counting unit 24 checks the management log (step S03). That is, the counting unit 24 refers to the threshold table 28 and confirms the check period “10 minutes” corresponding to the RAID unit 6A. Subsequently, the counting unit 24 refers to the management log 22 (FIG. 3) of the RAID management unit 21A and extracts, for example, a record group having a time stamp that falls within the check period “10 minutes” from the time stamp of the latest record. To do.

　続いて、計数部２４は、抽出されたレコード群から、ＲＡＩＤ部６Ａを構成する各ＨＤＤ、すなわち、ＨＤＤ－０についてのメディアエラー数（“Ｎ１”とする）と、ＨＤＤ－１についてのメディアエラー数（“Ｎ２”とする）とを計数する（ステップＳ０４）。計数部２４は、メディアエラー数Ｎ１をカウンタ２５に格納し、メディアエラー数Ｎ２をカウンタ２６に格納する。 Subsequently, the counting unit 24 determines the number of media errors (referred to as “N1”) for each HDD constituting the RAID unit 6A, that is, HDD-0, and the media error for HDD-1 from the extracted record group. The number (referred to as “N2”) is counted (step S04). The counting unit 24 stores the media error number N1 in the counter 25, and stores the media error number N2 in the counter 26.

　検出部２７は、計数部２４によって、カウンタ２５及び２６にメディアエラー数が格納された後に、カウンタ２５及び２６から各メディアエラー数Ｎ１，Ｎ２を読み出す。一方、検出部２７は、閾値テーブル２８（図５）を参照し、ＲＡＩＤ部６Ａに対応する警告閾値（５０回）及びスローダウン閾値（１００回）を読み出す（ステップＳ０５）。 The detecting unit 27 reads the media error numbers N1 and N2 from the

counters

25 and 26 after the counter unit 25 stores the media error numbers in the

counters

25 and 26. On the other hand, the detection unit 27 refers to the threshold value table 28 (FIG. 5) and reads the warning threshold value (50 times) and the slowdown threshold value (100 times) corresponding to the RAID unit 6A (step S05).

　次に、検出部２７は、ＨＤＤ－０のメディアエラー数Ｎ１が警告閾値（５０回）未満か否かを判定する（ステップＳ０６）。このとき、ＨＤＤ－０のメディアエラー数Ｎ１が警告閾値未満であれば（Ｓ０６；ＹＥＳ）、処理がステップＳ１１に進む。これに対し、ＨＤＤ－０のメディアエラー数Ｎ１が警告閾値以上であれば（Ｓ０６；ＮＯ）、処理がステップＳ０７に進む。 Next, the detection unit 27 determines whether or not the media error number N1 of the HDD-0 is less than the warning threshold value (50 times) (step S06). At this time, if the number N1 of media errors in HDD-0 is less than the warning threshold (S06; YES), the process proceeds to step S11. On the other hand, if the media error number N1 of HDD-0 is equal to or greater than the warning threshold (S06; NO), the process proceeds to step S07.

　ステップＳ０７では、検出部２７は、ＨＤＤ－０のメディアエラー数Ｎ１が５０回以上１００回未満の範囲に入っているか否かを判定する。このとき、メディアエラー数Ｎ１が上記範囲に入っている場合（Ｓ０７；ＹＥＳ）には、処理がステップＳ０８に進む。これに対し、メディアエラー数Ｎ１が上記範囲に入っていない場合（Ｓ０７；ＮＯ）には、処理がステップＳ０９に進む。 In step S07, the detection unit 27 determines whether the media error number N1 of the HDD-0 is in the range of 50 times or more and less than 100 times. At this time, if the media error number N1 is within the above range (S07; YES), the process proceeds to step S08. On the other hand, if the media error number N1 is not within the above range (S07; NO), the process proceeds to step S09.

　ステップＳ０８では、検出部２７は、メディアエラー数Ｎ１を含む警告レコードを管理ログ２２に書き込むことをＲＡＩＤ管理部２１Ａに指示する。これによって、ＲＡＩＤ管理部２１Ａが、管理ログ２２に警告レコードを書き込む。警告レコードは、情報処理装置１０のユーザによって、後日利用可能である。ステップＳ０８による処理が終了すると、処理がステップＳ０２に戻る。ここに、タイマ２３は、監視周期の到来を通知すると、監視間隔時間の計時をリスタートする。これによって、次の監視周期において、タイマ２３は監視周期の到来を通知する状態となる。 In step S08, the detection unit 27 instructs the RAID management unit 21A to write a warning record including the media error number N1 to the management log 22. As a result, the RAID management unit 21A writes a warning record in the management log 22. The warning record can be used at a later date by the user of the information processing apparatus 10. When the process in step S08 ends, the process returns to step S02. Here, when the timer 23 notifies the arrival of the monitoring cycle, the timer 23 restarts the measurement of the monitoring interval time. Thereby, in the next monitoring cycle, the timer 23 is in a state of notifying the arrival of the monitoring cycle.

　処理がステップＳ０９に進んだ場合には、検出部２７は、メディアエラー数Ｎ１がスローダウン閾値（１００回）以上であるので、ＨＤＤ－０に係るＲＡＩＤのスローダウン現象が発生していると判定し、スローダウン現象の原因がＨＤＤ－０であると特定する。 When the process proceeds to step S09, the detection unit 27 determines that the RAID slowdown phenomenon related to HDD-0 has occurred because the media error number N1 is equal to or greater than the slowdown threshold (100 times). Then, it is determined that the cause of the slow-down phenomenon is HDD-0.

　すると、検出部２７は、ＲＡＩＤ管理部２１Ａに対し、ＨＤＤ－０の切り離しを指示する（ステップＳ１０）。ＲＡＩＤ管理部２１Ａは、指示に従って、ＲＡＩＤ部６ＡのＨＤＤ－０（ＨＤＤ７Ａ）をディスエーブル状態に遷移させる。これによって、ＨＤＤ－０の切り離しが行われる。その後、処理がステップＳ０２に戻る。 Then, the detection unit 27 instructs the RAID management unit 21A to disconnect the HDD-0 (step S10). The RAID management unit 21A changes the HDD-0 (HDD 7A) of the RAID unit 6A to the disabled state in accordance with the instruction. As a result, the HDD-0 is disconnected. Thereafter, the process returns to step S02.

　一方、処理がステップＳ１１に進んだ場合には、検出部２７は、ＨＤＤ－１のメディアエラー数Ｎ２が警告閾値（５０回）未満か否かを判定する。このとき、ＨＤＤ－１のメディアエラー数が警告閾値未満であれば（Ｓ１１；ＹＥＳ）、ＲＡＩＤ部６Ａは正常であると判定され（ステップＳ１２）、処理がステップＳ０２に戻る。 On the other hand, when the process proceeds to step S11, the detection unit 27 determines whether the media error number N2 of the HDD-1 is less than the warning threshold (50 times). At this time, if the number of media errors in HDD-1 is less than the warning threshold (S11; YES), it is determined that the RAID unit 6A is normal (step S12), and the process returns to step S02.

　これに対し、ＨＤＤ－１のメディアエラー数Ｎ２が警告閾値以上であれば（Ｓ１１；ＮＯ）、検出部２７は、ＨＤＤ－１のメディアエラー数Ｎ２が５０回以上１００回未満の範囲に入っているか否かを判定する（ステップＳ１３）。 On the other hand, if the media error number N2 of HDD-1 is equal to or greater than the warning threshold (S11; NO), the detection unit 27 falls within a range where the media error number N2 of HDD-1 is 50 times or more and less than 100 times. It is determined whether or not there is (step S13).

　このとき、メディアエラー数Ｎ２が上記範囲に入っている場合（Ｓ１３；ＹＥＳ）には、処理がステップＳ１４に進む。これに対し、メディアエラー数Ｎ２が上記範囲に入っていない場合（Ｓ１３；ＮＯ）には、処理がステップＳ１５に進む。 At this time, if the media error number N2 is within the above range (S13; YES), the process proceeds to step S14. On the other hand, if the media error number N2 is not within the above range (S13; NO), the process proceeds to step S15.

　ステップＳ１４では、検出部２７は、メディアエラー数Ｎ２を含む警告レコードを管理ログ２２に書き込むことをＲＡＩＤ管理部２１Ａに指示する。これによって、ＲＡＩＤ管理部２１Ａが、管理ログ２２に警告レコードを書き込む。警告レコードは、情報処理装置１０のユーザが後日利用することができる。その後、処理がステップＳ０２に戻る。 In step S14, the detection unit 27 instructs the RAID management unit 21A to write a warning record including the media error number N2 in the management log 22. As a result, the RAID management unit 21A writes a warning record in the management log 22. The warning record can be used later by the user of the information processing apparatus 10. Thereafter, the process returns to step S02.

　処理がステップＳ１５に進んだ場合には、検出部２７は、メディアエラー数Ｎ２がスローダウン閾値（１００回）以上であるので、ＨＤＤ－１に係るＲＡＩＤのスローダウン現象が発生していると判定し、スローダウン現象の原因がＨＤＤ－１であると特定する。 When the process proceeds to step S15, the detection unit 27 determines that the RAID slowdown phenomenon related to HDD-1 has occurred because the media error number N2 is equal to or greater than the slowdown threshold (100 times). Then, it is determined that the cause of the slowdown phenomenon is HDD-1.

　すると、検出部２７は、ＲＡＩＤ管理部２１Ａに対し、ＨＤＤ－１の切り離しを指示する（ステップＳ１６）。ＲＡＩＤ管理部２１Ａは、指示に従って、ＲＡＩＤ部６ＡのＨＤＤ－１（ＨＤＤ７Ｂ）をディスエーブル状態に遷移させる。これによって、ＨＤＤ－１の切り離しが行われる。その後、処理がステップＳ０２に戻る。なお、ＲＡＩＤ部６Ｂに対する監視周期が到来した場合には、タイマ２３Ｂが計数部２４及び検出部２７に当該到来を通知し、ステップＳ０２以降の処理が行われる。 Then, the detection unit 27 instructs the RAID management unit 21A to disconnect the HDD-1 (step S16). The RAID management unit 21A changes the HDD-1 (HDD 7B) of the RAID unit 6A to the disabled state according to the instruction. As a result, the HDD-1 is disconnected. Thereafter, the process returns to step S02. If the monitoring period for the RAID unit 6B has arrived, the timer 23B notifies the counting unit 24 and the detection unit 27 of the arrival, and the processing from step S02 is performed.

　以上の動作例によれば、閾値テーブル２８に登録された監視周期の値で、タイマを設定することができる。このため、閾値テーブル２８の値を外部入力により更新し、電源投入又は情報処理装置の再起動を契機として、タイマ２３Ａ，２３Ｂによる監視周期を変更することができる。また、閾値テーブル２８に登録されたチェック期間を外部入力により更新し、電源投入又は情報処理装置の再起動を契機として、チェック期間（管理ログ２２からのレコードの抽出範囲）を変更することができる。なお、上記した動作例では、チェック期間が監視周期よりも長い場合について説明したが、チェック期間及び監視周期の長さは適宜設定可能である。 According to the above operation example, the timer can be set with the value of the monitoring period registered in the threshold table 28. For this reason, the value of the threshold table 28 can be updated by external input, and the monitoring cycle by the

timers

23A and 23B can be changed when the power is turned on or the information processing apparatus is restarted. Further, the check period registered in the threshold table 28 can be updated by external input, and the check period (extraction range of records from the management log 22) can be changed when the power is turned on or the information processing apparatus is restarted. . In the above operation example, the case where the check period is longer than the monitoring period has been described. However, the length of the check period and the monitoring period can be set as appropriate.

　また、或るＨＤＤのメディアエラー数が例えば警告閾値以上である場合に、対象のＨＤＤの切り離しを行う場合には、当該切り離し対象のＨＤＤが原因のスローダウン現象の発生を未然に防止することができる。 Further, when the target HDD is disconnected when the number of media errors of a certain HDD is equal to or greater than, for example, a warning threshold, it is possible to prevent the occurrence of a slowdown phenomenon caused by the HDD to be disconnected. it can.

　従来では、スローダウン現象の回復までの手順において、スローダウン現象の原因を突き止めるために長時間が必要であり、ＲＡＩＤ装置のユーザに対する影響が大きかった。また、スローダウン現象からの回復には人手による作業を経なければならなかった。 Conventionally, in the procedure until the recovery of the slowdown phenomenon, it takes a long time to find the cause of the slowdown phenomenon, which has a great influence on the user of the RAID device. In addition, recovery from the slowdown phenomenon required manual work.

　上述した実施形態によれば、ＲＡＩＤ部６を構成する或るＨＤＤ７にメディアエラーが多発し、スローダウン現象が発生した場合には、検出部２７がスローダウン現象の原因となっているＨＤＤ７を自動的に検出し、当該ＨＤＤ７の切り離し処理が自動的に実行される。これによって、スローダウン現象からの回復を早期に図ることができる。 According to the above-described embodiment, when a media error frequently occurs in a certain HDD 7 constituting the RAID unit 6 and a slow down phenomenon occurs, the detection unit 27 automatically detects the HDD 7 causing the slow down phenomenon. And the HDD 7 disconnection process is automatically executed. As a result, recovery from the slow-down phenomenon can be achieved at an early stage.

　上述したように、チェック期間中におけるメディアエラー数がスローダウン閾値以上となった場合に、スローダウン現象が発生していると判定することができる。メディアエラー数は、ＲＡＩＤ管理部２１が作成する管理ログ２２を用いて計数することができる。これによって、管理ログ２２の解析という簡易な構成で、スローダウン現象の発生の判断を行うことができる。 As described above, when the number of media errors during the check period is equal to or greater than the slowdown threshold, it can be determined that the slowdown phenomenon has occurred. The number of media errors can be counted using the management log 22 created by the RAID management unit 21. Accordingly, it is possible to determine the occurrence of the slowdown phenomenon with a simple configuration of analyzing the management log 22.

　また、スローダウン現象の発生と判断する基準は、ＲＡＩＤコントローラの種類や、ＲＡＩＤレベルやＲＡＩＤを構成するＨＤＤ数のようなＲＡＩＤ構成に依存する。これに対し、本実施形態では、図６の処理フローに示したように、警告閾値及びスローダウン閾値は、ＲＡＩＤコントローラ３（ＲＡＩＤ部６）毎に異なる値を設定することができる。これによって、図１に示した情報処理装置１０のように、複数のＲＡＩＤコントローラ３を備える場合において、ＲＡＩＤコントローラに応じた警告閾値及びスローダウン閾値を設定し、適正なスローダウン現象の判定及びＨＤＤの切り離しを実行することができる。なお、情報処理装置が複数のＲＡＩＤ部や複数のＲＡＩＤコントローラを備えることは必須の要件ではない。 Also, the criterion for determining the occurrence of the slowdown phenomenon depends on the RAID configuration such as the type of RAID controller, the RAID level, and the number of HDDs constituting the RAID. On the other hand, in the present embodiment, as shown in the processing flow of FIG. 6, the warning threshold value and the slowdown threshold value can be set to different values for each RAID controller 3 (RAID unit 6). As a result, in the case where a plurality of RAID controllers 3 are provided as in the information processing apparatus 10 shown in FIG. 1, the warning threshold value and the slowdown threshold value corresponding to the RAID controller are set, the proper slowdown phenomenon determination and the HDD are performed. Detachment can be performed. Note that it is not an essential requirement that the information processing apparatus includes a plurality of RAID units and a plurality of RAID controllers.

　また、記憶データの冗長性を確保できないＲＡＩＤレベルを実施するＲＡＩＤ部（ＲＡＩＤシステム）６Ｃを監視対象外と判断し、ミラーリングを含む処理を実施するＲＡＩＤ部６Ａ及び６Ｂを監視対象として決定する。これによって、監視範囲を適正に制限することができる。
また、上述した実施形態では、監視周期の時間を可変にすることができる。従って、ＲＡ　ＩＤコントローラ３の特性を考慮した監視周期を設定することができる。また、チェック期間も、例えばＲＡＩＤ部６に対するアクセス頻度を考慮してその長さを変更することができる。 Further, the RAID unit (RAID system) 6C that implements the RAID level where the redundancy of the stored data cannot be ensured is determined not to be monitored, and the

RAID units

6A and 6B that perform processing including mirroring are determined as monitoring targets. As a result, the monitoring range can be appropriately limited.
In the above-described embodiment, the time of the monitoring cycle can be made variable. Therefore, it is possible to set a monitoring cycle in consideration of the characteristics of the RAID controller 3. Also, the length of the check period can be changed in consideration of the access frequency to the RAID unit 6, for example.

　以上の説明した実施形態によれば、ＲＡＩＤ装置のスローダウン現象に対して、回復までの時間が短縮化される。これによって、ＲＡＩＤ装置の安定稼働が可能となる。また、スローダウン現象からの回復に係る処理は自動で行われるので、人手による作業を省略することができる。 According to the embodiment described above, the time until recovery is shortened against the slowdown phenomenon of the RAID device. As a result, the RAID apparatus can be stably operated. Further, since the processing related to recovery from the slowdown phenomenon is automatically performed, it is possible to omit manual work.

１・・・ＣＰＵ
２・・・主記憶装置
３・・・ＲＡＩＤコントローラ
４・・・ＬＡＮインタフェース
５・・・入出力ユニット
６・・・ＲＡＩＤ部
７・・・ハードディスクドライブ
８・・・ＤＶＤ
９・・・ＵＳＢメモリ
１０・・・情報処理装置
２１・・・ＲＡＩＤ管理部
２２・・・管理ログ
２３Ａ，２３Ｂ・・・タイマ
２４・・・計数部
２５，２６・・・カウンタ
２７・・・検出部
２８・・・閾値テーブル 1 ... CPU
2 ... Main storage device 3 ... RAID controller 4 ... LAN interface 5 ... I / O unit 6 ... RAID unit 7 ... Hard disk drive 8 ... DVD
DESCRIPTION OF SYMBOLS 9 ... USB memory 10 ... Information processing apparatus 21 ... RAID management part 22 ...

Management log

23A, 23B ... Timer 24 ... Counting

part

25, 26 ... Counter 27 ... Detection unit 28 ... threshold value table

Claims

　所定時間内における、ＲＡＩＤを構成する複数の記録媒体に対するメディアエラー数を記録媒体毎に計数する計数部と、
　前記所定時間内におけるメディアエラー数が閾値以上の記録媒体をＲＡＩＤのスローダウン現象に係る記録媒体として検出する検出部とを含む
ＲＡＩＤ制御装置。 A counting unit that counts, for each recording medium, the number of media errors for a plurality of recording media constituting the RAID within a predetermined time;
A RAID control apparatus comprising: a detection unit that detects a recording medium having a number of media errors within a predetermined time as a recording medium associated with a RAID slowdown phenomenon;
　前記検出部によって検出された記録媒体を前記ＲＡＩＤを構成する複数の記録媒体から自動的に切り離す処理を行う切り離し制御部をさらに含む
請求項１に記載のＲＡＩＤ制御装置。 The RAID control apparatus according to claim 1, further comprising a detachment control unit that performs a process of automatically detaching the recording medium detected by the detection unit from a plurality of recording media constituting the RAID.
　前記閾値が記録媒体毎に設けられている
請求項１に記載のＲＡＩＤ制御装置。 The RAID control device according to claim 1, wherein the threshold is provided for each recording medium.
　前記ＲＡＩＤのＲＡＩＤレベルを判定する判定部をさらに含み、
　前記計数部及び前記検出部は、前記ＲＡＩＤのＲＡＩＤレベルがミラーリングを行うＲＡＩＤ１を含む場合に動作する
請求項１に記載のＲＡＩＤ制御装置。 A determination unit for determining a RAID level of the RAID;
The RAID control apparatus according to claim 1, wherein the counting unit and the detection unit operate when the RAID level of the RAID includes RAID 1 that performs mirroring.
　コンピュータに
　ＲＡＩＤを構成する複数の記録媒体の所定時間内におけるデータのメディアエラー数を記録媒体毎に計数するステップと、
　所定時間内におけるメディアエラー数が閾値以上の記録媒体をＲＡＩＤのスローダウン現象に係る記録媒体として検出するステップと
を実行させるプログラム。 Counting the number of data media errors within a predetermined time of a plurality of recording media constituting a RAID in the computer for each recording medium;
And a step of detecting a recording medium in which a number of media errors within a predetermined time is equal to or greater than a threshold as a recording medium related to a RAID slowdown phenomenon.