WO2010116514A1 - Raid control device - Google Patents

Raid control device Download PDF

Info

Publication number
WO2010116514A1
WO2010116514A1 PCT/JP2009/057291 JP2009057291W WO2010116514A1 WO 2010116514 A1 WO2010116514 A1 WO 2010116514A1 JP 2009057291 W JP2009057291 W JP 2009057291W WO 2010116514 A1 WO2010116514 A1 WO 2010116514A1
Authority
WO
WIPO (PCT)
Prior art keywords
raid
unit
hdd
media
recording medium
Prior art date
Application number
PCT/JP2009/057291
Other languages
French (fr)
Japanese (ja)
Inventor
敬治 藤田
佳樹 伏見
Original Assignee
富士通株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 富士通株式会社 filed Critical 富士通株式会社
Priority to PCT/JP2009/057291 priority Critical patent/WO2010116514A1/en
Publication of WO2010116514A1 publication Critical patent/WO2010116514A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0727Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/076Error or fault detection not based on redundancy by exceeding limits by exceeding a count or rate limit, e.g. word- or bit count limit

Definitions

  • the present invention relates to a RAID (Redundant Arrays of Inexpensive Disks) control device.
  • RAID is a technique for operating a plurality of recording media, for example, HDD (Hard Disk Drive) as one virtual HDD.
  • HDD Hard Disk Drive
  • a data read error and / or write error (hereinafter also referred to as a media error) frequently occurs in one of a plurality of HDDs constituting a RAID
  • a phenomenon appears that the RAID control unit does not appear to operate. .
  • Such a phenomenon is called a system slowdown phenomenon or a RAID slowdown phenomenon.
  • JP 2004-252692 A Japanese Patent Laying-Open No. 2005-267056
  • An object of one aspect of the present invention is to provide a technique capable of detecting a RAID slowdown phenomenon in the present or future.
  • the RAID control device includes a counting unit that counts, for each recording medium, the number of media errors within a predetermined time of a plurality of recording media constituting the RAID; And a detection unit that detects a recording medium having a number of media errors within a predetermined time as a recording medium related to a RAID slowdown phenomenon.
  • Another aspect of the present invention is a method for detecting a faulty recording medium by the above-described RAID control apparatus.
  • Another aspect of the present invention is a program for a computer (information processing apparatus) to function as the above-described RAID control apparatus, or a recording medium on which the program is recorded.
  • a current or future RAID slowdown phenomenon can be detected.
  • FIG. 1 is a diagram illustrating a hardware configuration example of an information processing apparatus that implements an embodiment of a RAID control apparatus.
  • FIG. 2 is a block diagram schematically showing a RAID apparatus realized by the information processing apparatus shown in FIG.
  • FIG. 3 shows an example of a management log created by the RAID management unit.
  • FIG. 4 shows an example of periodic slowdown phenomenon determination.
  • FIG. 5 shows an example of the data structure of the threshold table.
  • FIG. 6 is a flowchart illustrating an operation example of the RAID control unit.
  • FIG. 1 is a diagram illustrating a configuration example of an information processing apparatus to which a RAID control apparatus according to an embodiment is applied.
  • the information processing apparatus 10 is, for example, a dedicated or general-purpose server machine or a dedicated or general-purpose computer.
  • the general-purpose computer is, for example, a personal computer (PC).
  • the information processing apparatus 10 includes a CPU (Central Processing Unit) 1 as a processor, a main storage device 2, RAID controllers 3A, 3B, 3C, a LAN (Local Area Network) interface 4, an input / output (I / O). ) Unit 5.
  • the CPU 1, the main storage device 2, the RAID controllers 3A, 3B, 3C, the LAN interface 4 and the input / output unit 5 are connected to each other via a bus B.
  • the main storage device 2 includes a ROM (Read Only Memory) storing programs and data and a RAM (Random Access Memory) used as a work area of the CPU 1.
  • the RAM is called a memory.
  • the RAID controller 3 controls a RAID unit (also referred to as a RAID system) 6.
  • a RAID controller 3A that controls the RAID unit 6A
  • a RAID controller 3B that controls the RAID unit 6B
  • a RAID controller 3C that controls the RAID unit 6C are illustrated.
  • the RAID controller 3 is connected to a plurality of hard disk drives (HDD) 7 (referred to as disk arrays) as a plurality of recording media constituting the RAID unit 6.
  • HDD hard disk drives
  • the RAID unit 6A connected to the RAID controller 3A includes two HDDs 7A and 7B.
  • the RAID controller 3 is an integrated circuit that writes / reads data to / from the RAID unit 6. Further, the RAID controller 3 records a write / read access history for the RAID unit 6 as a management log.
  • the RAID unit 6A is set to perform RAID level 1 (RAID 1), that is, mirroring. Therefore, the RAID controller 3A writes the write target data to the HDD 7A and the HDD 7B.
  • RAID controller 3A uses, for example, one of the HDD 7A and the HDD 7B as a working HDD and the other as a standby HDD. For example, the HDD 7A is used as an active system, and the HDD 7B is used as a standby system.
  • the RAID controller 3A When the failure of the active HDD 7A is detected, the RAID controller 3A performs the process of disconnecting the HDD 7A in which the failure has occurred, and changes the settings for using the HDD 7B as the active system.
  • the disconnection process is performed by disabling the HDD to be disconnected.
  • RAID unit 6B controlled by the RAID controller 3B implements the RAID level “RAID 10”
  • RAID unit 6C controlled by the RAID controller 3C implements the RAID level “RAID 50”.
  • the LAN interface 4 is a communication interface circuit for performing data transmission / reception processing with the network N.
  • the I / O unit 5 is a circuit for connecting a peripheral device such as an input device, an output device, and a portable recording medium to the information processing device 10.
  • the portable recording medium is, for example, a DVD 8 or a USB memory 9 as shown in FIG.
  • a program or data stored in the DVD 8 or the USB memory 9 can be connected to the I / O unit 5 and installed in the RAID unit 6. Furthermore, the program installed in the RAID unit 6 can be loaded into the memory of the main storage device 2 by the CPU 1 and executed.
  • the ROM or RAID unit 6 of the main storage device 2 stores an operating system (OS) and one or more application programs (referred to as applications), and the CPU 1 executes information processing by executing the OS and applications.
  • the device 10 can function as a RAID device.
  • the ROM, RAM, HDD 7, DVD 8, and USB memory 9 in the main storage device 2 shown in FIG. 1 are examples of computer-readable recording media, and the types of recording media are not limited to these.
  • RAID sections 6A to 6C are those shown in FIG. 2
  • the RAID unit 6A shown in FIG. 2 is a mirroring system including “HDD-0” corresponding to the HDD 7A and “HDD-1” corresponding to the HDD 7B.
  • the RAID control unit 20 is a function realized by the CPU 1 executing the OS. Specifically, the RAID control unit 20 as a RAID control device is realized by the CPU 1, the main storage device 2, and the RAID controller 3 shown in FIG.
  • the RAID control unit 20 includes a RAID management unit 21 (21A, 21B, 21C), timers 23A and 23B, a counting unit 24, a plurality of counters including counters 25 and 26, a detection unit 27, and a threshold table 28. It has.
  • Each RAID management part 21A, 21B, 21C is realized as a function by the RAID controllers 3A, 3B, 3C.
  • the timer 23, the counting unit 24, and the detection unit 27 are realized as functions for the CPU 1 to execute the OS.
  • the counters 25 and 26 and the threshold table 28 are created on the memory of the main storage device 2, for example.
  • the function of the RAID controller 3 can be realized by executing a program by a general-purpose processor such as a dedicated processor or a DSP (Data Signal Processor). Alternatively, it can be realized by the CPU 1 executing a program (for example, OS).
  • a general-purpose processor such as a dedicated processor or a DSP (Data Signal Processor).
  • Each RAID management unit 21 controls writing / reading of data to / from the corresponding RAID unit 6 and detachment of the HDD from the RAID unit 6.
  • Each RAID management unit 21 records a history of access to the corresponding RAID unit 6 as a management log 22.
  • FIG. 3 shows an example of the management log 22.
  • the management log 22 is a time series of records including a time stamp indicating date and time, a management target HDD, access to the management target HDD, that is, a result of reading or writing data to the HDD, and an identifier of the RAID unit 6. It is recorded with.
  • the notation “TargetHDD-0” is used for the managed HDD-0
  • the notation “TargetHDD-1” is used for the managed HDD-1.
  • “Normal” is recorded when the access result is normal
  • “MediaError” is recorded when the access result is abnormal, that is, when a read error or a write error occurs.
  • the notation “RAID X” (X is a number or a symbol) can be used as the identifier of the RAID part.
  • Each of the timers 23A and 23B shown in FIG. 2 measures a monitoring interval based on a predetermined monitoring cycle.
  • the monitoring process of the RAID unit 6 (management log 22) is performed by the counting unit 24 and the detection unit 27 in a predetermined monitoring cycle (see FIG. 4).
  • the RAID unit 6A and the RAID unit 6B are periodically monitored.
  • the timer 23A is a timer for the RAID unit 6A
  • the timer 23B is a timer for the RAID unit 6B.
  • periodic monitoring is optional, and the timers 23A and 23B are not essential components.
  • the timers 23A and 23B notify the counting unit 24 and the detection unit 27 of the arrival of the monitoring cycle at every predetermined monitoring cycle.
  • the monitoring cycle that is, the monitoring interval can be set to several minutes to several tens of minutes.
  • the monitoring cycle can be made variable through change setting by external input to the timers 23A and 23B.
  • the counting unit 24 extracts records included in the predetermined check period from the management log 22 (FIG. 3) of the RAID management unit 21 corresponding to the monitoring cycle, The number of media errors in the record is counted for each managed HDD. For example, count results for HDD-0 and HDD-1 constituting the RAID unit 6A are stored in the counters 25 and 26. On the other hand, counters (not shown) corresponding to the four HDDs constituting the RAID unit 6B (RAID level “RAID 10”) are prepared (not shown).
  • the number of media errors (counting result) for HDD-0 and HDD-1 stored in the counters 25 and 26 indicates the number of media errors of each HDD during the check period.
  • the counter 25 holds the number of media errors of the HDD-0
  • the counter 26 holds the number of media errors of the HDD-1.
  • the detection unit 27 calculates the number of media errors from the counters 25 and 26 in synchronization with the processing of the counting unit 24. read out. On the other hand, the detection unit 27 reads a threshold value to be compared with the media error number from the threshold value table 28, and determines whether the media error number is equal to or greater than the threshold value.
  • the detection unit 27 sets the HDD-0 to the current or It is detected as an HDD that will cause a slow-down phenomenon in the future, and HDD-0 is determined as the target HDD to be disconnected.
  • FIG. 5 shows an example of the data structure of the threshold table 28.
  • the threshold table 28 includes entries corresponding to the RAID units 6A, 6B, and 6C. An entry can be prepared for each RAID unit 6, for example. Alternatively, a configuration in which only the entry of the monitoring target RAID unit 6 is registered is applicable.
  • Each entry is given an entry number. Entry number 1 corresponds to the RAID part 6A, entry number 2 corresponds to the RAID part 6B, and entry number 3 corresponds to the RAID part 6C. Each entry can include a system number, a check period, a monitoring period, a warning threshold, a slowdown threshold, a controller identifier, and a monitoring target flag.
  • the system number is an identifier of the RAID unit 6 which is a RAID system.
  • the check period is a time for extracting (cutting out) records from the management log 22.
  • the count unit 24 can apply a configuration in which the check period is confirmed with reference to the threshold table 28 when the record is extracted from the management log 22.
  • the check period is stored as shown in FIG. Storage of the check period is optional.
  • the monitoring period indicates the monitoring interval time during which the RAID unit 6 monitors, that is, the number of media errors is counted and determined.
  • the monitoring interval is set in the timers 23A and 23B, and the timers 23A and 23B notify the arrival of the monitoring period every monitoring period (for example, 5 minutes).
  • storing the monitoring period for the threshold table 28 is optional. Furthermore, when the monitoring cycle is not provided, for example, when the counting unit 24 and the detection unit 27 operate according to a predetermined command, the monitoring cycle and the timers 23A and 23B can be omitted.
  • the monitoring cycle can be set to a different value for each RAID unit 6.
  • a timer 23A for the RAID unit 6A and a timer 23B for the RAID unit 6B are prepared.
  • the monitoring period of the RAID unit 6A and the monitoring period of the RAID unit 6B can be shared, a configuration in which one timer is provided can be applied.
  • the warning threshold is a threshold for determining whether or not the number of media errors within the check period is a value for issuing a warning, and is referred to by the detection unit 27.
  • the warning can be notified to the outside of the information processing apparatus 10 by various means such as sound, light, and display on a display.
  • a warning is recorded in the management log 22 by the RAID management unit 21 corresponding to the RAID unit 6.
  • the warning threshold value can be omitted.
  • the slowdown threshold is a threshold for determining that the number of media errors is a factor of the RAID slowdown phenomenon or that the number of media errors is likely to be a factor of the slowdown phenomenon. Referenced.
  • the warning threshold value and the slowdown threshold value can also be applied to different values for each RAID unit 6 or RAID controller.
  • the controller identifier is an identifier of a RAID controller that controls the RAID part specified by the system number.
  • the monitoring target flag is a flag indicating whether or not the RAID unit 6 is a monitoring target by the counting unit 24 and the detection unit 27.
  • the monitoring target by the counting unit 24 and the detection unit 27 is limited to a RAID unit that supports RAID level 1 (mirroring). Therefore, in the example shown in FIGS. 1 and 2, the RAID unit 6A that implements the RAID level “RAID“ 1 ”and the RAID unit 6B that implements the RAID level“ RAID 10 ”are monitored, and the RAID level“ RAID 50 ”.
  • the RAID unit 6C that performs "" is a non-monitoring target.
  • the configuration of the RAID control unit 20 in the present embodiment is applicable to a RAID unit that implements a RAID level related to a combination of RAID level 1 and another RAID level in addition to the RAID level “RAID 1” itself. That is, the RAID levels that are monitored by the RAID control unit 20 can include at least RAID “1”, “0 + 1”, “1 + 0”, “1 + 5”, “5 + 1”, “1 + 6”, “6 + 1”. .
  • the monitoring target flag “ON” is set for the RAID portion of the RAID level to be monitored, and the monitoring target flag “OFF” is set for the RAID portion of the RAID level that is not to be monitored. Therefore, in the example shown in FIG. 2, the RAID control unit 20 monitors the number of media errors for the RAID unit 6A and the RAID unit 6B, and does not monitor the RAID unit 6C.
  • FIG. 6 is a flowchart showing an operation example of the RAID control unit 20 shown in FIG.
  • the process illustrated in FIG. 6 is started, for example, when the information processing apparatus 10 is turned on.
  • the threshold table 28 having the contents shown in FIG. 5 is statically stored in the main storage device 2 or the secondary storage (HDD 7).
  • the RAID control unit 20 executes a RAID level check process for each of the RAID units 6A, 6B, and 6C (step S01).
  • the RAID control unit 20 refers to the threshold value table 28 (FIG. 5) and confirms the monitoring target flags corresponding to the RAID units 6A, 6B, and 6C.
  • the monitoring target flag for the RAID units 6A and 6B is “ON”, and the monitoring target flag for the RAID unit 6C is “OFF”.
  • the RAID control unit 20 determines that the RAID units 6A and 6B have a RAID level that ensures data redundancy, that is, a monitoring target RAID level. On the other hand, the RAID control unit 20 determines that the RAID unit 6C is not a monitoring target, that is, a non-monitoring target.
  • the RAID control unit 20 performs settings for the timers 23A and 23B according to the monitoring target determination result described above.
  • the monitoring period “5 minutes” is set in the timer 23A for the RAID section 6A
  • the monitoring period “5 minutes” is set in the timer 23B for the RAID section 6B.
  • the RAID control unit 20 starts the timers 23A and 23B.
  • the RAID control unit 20 can start the timers 23A and 23B so that the arrival of the monitoring periods for the RAID units 6A and 6B is shifted.
  • the timers 23A and 23B can be started at the same time.
  • the timers 23A and 23B each measure the monitoring cycle time, and when the monitoring cycle time is reached, the arrival of the monitoring cycle is notified to the counting unit 24 and the detection unit 27 (step S02). For example, it is assumed that the timer 23A corresponding to the RAID unit 6A notifies the counting unit 24 and the detection unit 27 that the monitoring period has arrived.
  • the counting unit 24 checks the management log (step S03). That is, the counting unit 24 refers to the threshold table 28 and confirms the check period “10 minutes” corresponding to the RAID unit 6A. Subsequently, the counting unit 24 refers to the management log 22 (FIG. 3) of the RAID management unit 21A and extracts, for example, a record group having a time stamp that falls within the check period “10 minutes” from the time stamp of the latest record. To do.
  • the counting unit 24 determines the number of media errors (referred to as “N1”) for each HDD constituting the RAID unit 6A, that is, HDD-0, and the media error for HDD-1 from the extracted record group.
  • the number (referred to as “N2”) is counted (step S04).
  • the counting unit 24 stores the media error number N1 in the counter 25, and stores the media error number N2 in the counter 26.
  • the detecting unit 27 reads the media error numbers N1 and N2 from the counters 25 and 26 after the counter unit 25 stores the media error numbers in the counters 25 and 26.
  • the detection unit 27 refers to the threshold value table 28 (FIG. 5) and reads the warning threshold value (50 times) and the slowdown threshold value (100 times) corresponding to the RAID unit 6A (step S05).
  • the detection unit 27 determines whether or not the media error number N1 of the HDD-0 is less than the warning threshold value (50 times) (step S06). At this time, if the number N1 of media errors in HDD-0 is less than the warning threshold (S06; YES), the process proceeds to step S11. On the other hand, if the media error number N1 of HDD-0 is equal to or greater than the warning threshold (S06; NO), the process proceeds to step S07.
  • step S07 the detection unit 27 determines whether the media error number N1 of the HDD-0 is in the range of 50 times or more and less than 100 times. At this time, if the media error number N1 is within the above range (S07; YES), the process proceeds to step S08. On the other hand, if the media error number N1 is not within the above range (S07; NO), the process proceeds to step S09.
  • step S08 the detection unit 27 instructs the RAID management unit 21A to write a warning record including the media error number N1 to the management log 22.
  • the RAID management unit 21A writes a warning record in the management log 22.
  • the warning record can be used at a later date by the user of the information processing apparatus 10.
  • the detection unit 27 determines that the RAID slowdown phenomenon related to HDD-0 has occurred because the media error number N1 is equal to or greater than the slowdown threshold (100 times). Then, it is determined that the cause of the slow-down phenomenon is HDD-0.
  • step S10 the detection unit 27 instructs the RAID management unit 21A to disconnect the HDD-0 (step S10).
  • the RAID management unit 21A changes the HDD-0 (HDD 7A) of the RAID unit 6A to the disabled state in accordance with the instruction. As a result, the HDD-0 is disconnected. Thereafter, the process returns to step S02.
  • step S11 the detection unit 27 determines whether the media error number N2 of the HDD-1 is less than the warning threshold (50 times). At this time, if the number of media errors in HDD-1 is less than the warning threshold (S11; YES), it is determined that the RAID unit 6A is normal (step S12), and the process returns to step S02.
  • the detection unit 27 falls within a range where the media error number N2 of HDD-1 is 50 times or more and less than 100 times. It is determined whether or not there is (step S13).
  • step S14 if the media error number N2 is within the above range (S13; YES), the process proceeds to step S14. On the other hand, if the media error number N2 is not within the above range (S13; NO), the process proceeds to step S15.
  • step S14 the detection unit 27 instructs the RAID management unit 21A to write a warning record including the media error number N2 in the management log 22.
  • the RAID management unit 21A writes a warning record in the management log 22.
  • the warning record can be used later by the user of the information processing apparatus 10. Thereafter, the process returns to step S02.
  • the detection unit 27 determines that the RAID slowdown phenomenon related to HDD-1 has occurred because the media error number N2 is equal to or greater than the slowdown threshold (100 times). Then, it is determined that the cause of the slowdown phenomenon is HDD-1.
  • the detection unit 27 instructs the RAID management unit 21A to disconnect the HDD-1 (step S16).
  • the RAID management unit 21A changes the HDD-1 (HDD 7B) of the RAID unit 6A to the disabled state according to the instruction. As a result, the HDD-1 is disconnected. Thereafter, the process returns to step S02. If the monitoring period for the RAID unit 6B has arrived, the timer 23B notifies the counting unit 24 and the detection unit 27 of the arrival, and the processing from step S02 is performed.
  • the timer can be set with the value of the monitoring period registered in the threshold table 28.
  • the value of the threshold table 28 can be updated by external input, and the monitoring cycle by the timers 23A and 23B can be changed when the power is turned on or the information processing apparatus is restarted.
  • the check period registered in the threshold table 28 can be updated by external input, and the check period (extraction range of records from the management log 22) can be changed when the power is turned on or the information processing apparatus is restarted.
  • the check period is longer than the monitoring period has been described.
  • the length of the check period and the monitoring period can be set as appropriate.
  • the target HDD when the number of media errors of a certain HDD is equal to or greater than, for example, a warning threshold, it is possible to prevent the occurrence of a slowdown phenomenon caused by the HDD to be disconnected. it can.
  • the detection unit 27 automatically detects the HDD 7 causing the slow down phenomenon. And the HDD 7 disconnection process is automatically executed. As a result, recovery from the slow-down phenomenon can be achieved at an early stage.
  • the number of media errors during the check period is equal to or greater than the slowdown threshold, it can be determined that the slowdown phenomenon has occurred.
  • the number of media errors can be counted using the management log 22 created by the RAID management unit 21. Accordingly, it is possible to determine the occurrence of the slowdown phenomenon with a simple configuration of analyzing the management log 22.
  • the criterion for determining the occurrence of the slowdown phenomenon depends on the RAID configuration such as the type of RAID controller, the RAID level, and the number of HDDs constituting the RAID.
  • the warning threshold value and the slowdown threshold value can be set to different values for each RAID controller 3 (RAID unit 6).
  • the warning threshold value and the slowdown threshold value corresponding to the RAID controller are set, the proper slowdown phenomenon determination and the HDD are performed. Detachment can be performed. Note that it is not an essential requirement that the information processing apparatus includes a plurality of RAID units and a plurality of RAID controllers.
  • the RAID unit (RAID system) 6C that implements the RAID level where the redundancy of the stored data cannot be ensured is determined not to be monitored, and the RAID units 6A and 6B that perform processing including mirroring are determined as monitoring targets.
  • the monitoring range can be appropriately limited.
  • the time of the monitoring cycle can be made variable. Therefore, it is possible to set a monitoring cycle in consideration of the characteristics of the RAID controller 3. Also, the length of the check period can be changed in consideration of the access frequency to the RAID unit 6, for example.
  • the time until recovery is shortened against the slowdown phenomenon of the RAID device.
  • the RAID apparatus can be stably operated. Further, since the processing related to recovery from the slowdown phenomenon is automatically performed, it is possible to omit manual work.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The RAID control device comprises a counting unit, which counts, for each storage medium, the number of media errors within a specified time by a plurality of storage media that constitute a RAID, and a detecting unit which detects, as a recording medium associated with RAID slowdown, a storage medium on which the number of media errors in the aforementioned specified time is above a threshold value.

Description

RAID制御装置RAID controller
 本発明は、RAID(Redundant Arrays of Inexpensive Disks)制御装置に関する。 The present invention relates to a RAID (Redundant Arrays of Inexpensive Disks) control device.
 RAIDは、複数の記録媒体、例えば、HDD(Hard Disk Drive)を一つの仮想的なHDDとして運用する技術である。RAIDを構成する複数のHDDの一つに対するデータの読み出しエラー及び/又は書き込みエラー(以下、メディアエラーと称することもある)が頻発すると、RAIDの制御部が動作していないように見える現象が生じる。このような現象を、システムスローダウン現象、或いは、RAIDのスローダウン現象と呼ぶ。 RAID is a technique for operating a plurality of recording media, for example, HDD (Hard Disk Drive) as one virtual HDD. When a data read error and / or write error (hereinafter also referred to as a media error) frequently occurs in one of a plurality of HDDs constituting a RAID, a phenomenon appears that the RAID control unit does not appear to operate. . Such a phenomenon is called a system slowdown phenomenon or a RAID slowdown phenomenon.
特開2004-252692号公報JP 2004-252692 A 特開2005-267056号公報Japanese Patent Laying-Open No. 2005-267056
 従来、RAIDのスローダウン現象が生じると、メディアエラーが多発しているHDDを人手により特定し、特定されたHDDを人手で交換することで、スローダウン現象の解消を図っていた。このため、スローダウン現象に対する改善には、多大な時間が費やされていた。 Conventionally, when a RAID slowdown phenomenon occurs, an HDD in which media errors frequently occur is manually identified, and the identified HDD is manually replaced to eliminate the slowdown phenomenon. For this reason, a great deal of time has been spent improving the slowdown phenomenon.
 本発明の一態様の目的は、現在又は将来におけるRAIDのスローダウン現象を検出可能な技術を提供することである。 An object of one aspect of the present invention is to provide a technique capable of detecting a RAID slowdown phenomenon in the present or future.
 本発明の一態様は、RAID制御装置である。このRAID制御装置は、RAIDを構成する複数の記録媒体の所定時間内におけるメディアエラー数を記録媒体毎に計数する計数部と、
 前記所定時間内におけるメディアエラー数が閾値以上の記録媒体をRAIDのスローダウン現象に係る記録媒体として検出する検出部とを含む。
One embodiment of the present invention is a RAID control device. The RAID control device includes a counting unit that counts, for each recording medium, the number of media errors within a predetermined time of a plurality of recording media constituting the RAID;
And a detection unit that detects a recording medium having a number of media errors within a predetermined time as a recording medium related to a RAID slowdown phenomenon.
 本発明の他の態様の一つは、上記したRAID制御装置による障害記録媒体の検出方法である。また、本発明の他の態様の一つは、コンピュータ(情報処理装置)が上記したRAID制御装置として機能するためのプログラム、又は当該プログラムを記録した記録媒体である。 Another aspect of the present invention is a method for detecting a faulty recording medium by the above-described RAID control apparatus. Another aspect of the present invention is a program for a computer (information processing apparatus) to function as the above-described RAID control apparatus, or a recording medium on which the program is recorded.
 本発明の一態様によれば、現在又は将来的なRAIDのスローダウン現象を検出することができる。 According to one aspect of the present invention, a current or future RAID slowdown phenomenon can be detected.
図1は、RAID制御装置の実施形態を実現する情報処理装置のハードウェア構成例を示す図である。FIG. 1 is a diagram illustrating a hardware configuration example of an information processing apparatus that implements an embodiment of a RAID control apparatus. 図2は、図1に示した情報処理装置によって実現されるRAID装置を模式的に示すブロック図である。FIG. 2 is a block diagram schematically showing a RAID apparatus realized by the information processing apparatus shown in FIG. 図3は、RAID管理部によって作成される管理ログの例を示す。FIG. 3 shows an example of a management log created by the RAID management unit. 図4は、周期的なスローダウン現象判定の例を示す。FIG. 4 shows an example of periodic slowdown phenomenon determination. 図5は、閾値テーブルのデータ構造例を示す。FIG. 5 shows an example of the data structure of the threshold table. 図6は、RAID制御部の動作例を示すフローチャートである。FIG. 6 is a flowchart illustrating an operation example of the RAID control unit.
 以下、図面を参照して本発明の実施形態について説明する。以下の実施形態における構成は例示であり、本発明は実施の形態の構成に限定されない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Configurations in the following embodiments are examples, and the present invention is not limited to the configurations in the embodiments.
 図1は、実施形態に係るRAID制御装置が適用される情報処理装置の構成例を示す図である。図1において、情報処理装置10は、例えば、専用又は汎用のサーバマシンや、専用又は汎用のコンピュータである。汎用のコンピュータは、例えばパーソナルコンピュータ(PC)である。 FIG. 1 is a diagram illustrating a configuration example of an information processing apparatus to which a RAID control apparatus according to an embodiment is applied. In FIG. 1, the information processing apparatus 10 is, for example, a dedicated or general-purpose server machine or a dedicated or general-purpose computer. The general-purpose computer is, for example, a personal computer (PC).
 情報処理装置10は、プロセッサとしてのCPU(中央演算処理装置)1と、主記憶装置2と、RAIDコントローラ3A,3B,3Cと、LAN(Local Area Network)インタフェース4と、入出力(I/O)ユニット5とを備えている。CPU1,主記憶装置2,RAIDコントローラ3A,3B,3Cと、LANインタフェース4及び入出力ユニット5は、バスBを介して相互に接続されている。 The information processing apparatus 10 includes a CPU (Central Processing Unit) 1 as a processor, a main storage device 2, RAID controllers 3A, 3B, 3C, a LAN (Local Area Network) interface 4, an input / output (I / O). ) Unit 5. The CPU 1, the main storage device 2, the RAID controllers 3A, 3B, 3C, the LAN interface 4 and the input / output unit 5 are connected to each other via a bus B.
 主記憶装置2は、プログラムやデータを格納したROM(Read Only Memory)と、CPU1のワークエリアとして使用されるRAM(Random Access Memory)とを含んでいる。RAMは、メモリと呼ばれる。 The main storage device 2 includes a ROM (Read Only Memory) storing programs and data and a RAM (Random Access Memory) used as a work area of the CPU 1. The RAM is called a memory.
 RAIDコントローラ3は、RAID部(RAIDシステムともいう)6を制御する。図1に示す例では、RAID部6Aを制御するRAIDコントローラ3Aと、RAID部6Bを制御するRAIDコントローラ3Bと、RAID部6Cを制御するRAIDコントローラ3Cとが例示されている。 The RAID controller 3 controls a RAID unit (also referred to as a RAID system) 6. In the example illustrated in FIG. 1, a RAID controller 3A that controls the RAID unit 6A, a RAID controller 3B that controls the RAID unit 6B, and a RAID controller 3C that controls the RAID unit 6C are illustrated.
 RAIDコントローラ3には、RAID部6を構成する複数の記録媒体としての複数のハードディスクドライブ(HDD)7(ディスクアレイと呼ばれる)が接続されている。図1に示す例では、RAIDコントローラ3Aに接続されたRAID部6Aは、二つのHDD7A及び7Bを備えている。 The RAID controller 3 is connected to a plurality of hard disk drives (HDD) 7 (referred to as disk arrays) as a plurality of recording media constituting the RAID unit 6. In the example shown in FIG. 1, the RAID unit 6A connected to the RAID controller 3A includes two HDDs 7A and 7B.
 RAIDコントローラ3は、RAID部6に対するデータの書き込み/読み出しを行う集積回路である。また、RAIDコントローラ3は、RAID部6に対する書き込み/読み出しのアクセス履歴を、管理ログとして記録する。
RAID部6Aは、RAIDレベル1(RAID 1)、すなわちミラーリングを実施するように設定されている。このため、RAIDコントローラ3Aは、書き込み対象のデータをHDD7AとHDD7Bとに書き込む。また、RAIDコントローラ3Aは、例えば、HDD7AとHDD7Bとの一方を現用系のHDDとして使用し、他方を予備系のHDDとして使用する。例えば、HDD7Aが現用系として使用され、HDD7Bが予備系として使用される。現用系のHDD7Aの障害が検出された場合には、RAIDコントローラ3Aは、障害が生じたHDD7Aの切り離し処理を行い、HDD7Bが現用系として使用されるための設定変更を行う。切り離し処理は、切り離し対象のHDDをディスエーブル状態にすることで行う。
The RAID controller 3 is an integrated circuit that writes / reads data to / from the RAID unit 6. Further, the RAID controller 3 records a write / read access history for the RAID unit 6 as a management log.
The RAID unit 6A is set to perform RAID level 1 (RAID 1), that is, mirroring. Therefore, the RAID controller 3A writes the write target data to the HDD 7A and the HDD 7B. The RAID controller 3A uses, for example, one of the HDD 7A and the HDD 7B as a working HDD and the other as a standby HDD. For example, the HDD 7A is used as an active system, and the HDD 7B is used as a standby system. When the failure of the active HDD 7A is detected, the RAID controller 3A performs the process of disconnecting the HDD 7A in which the failure has occurred, and changes the settings for using the HDD 7B as the active system. The disconnection process is performed by disabling the HDD to be disconnected.
 なお、RAIDコントローラ3Bによって制御されるRAID部6Bは、RAIDレベル“RAID 10”を実施し、RAIDコントローラ3Cによって制御されるRAID部6Cは、RAIDレベル“RAID 50”を実施する。 Note that the RAID unit 6B controlled by the RAID controller 3B implements the RAID level “RAID 10”, and the RAID unit 6C controlled by the RAID controller 3C implements the RAID level “RAID 50”.
 LANインタフェース4は、ネットワークNとの間でデータの送受信処理を実施するための通信インタフェース回路である。I/Oユニット5は、入力装置、出力装置、可搬性を有する記録媒体のような周辺装置を情報処理装置10に接続するための回路である。可搬性記録媒体は、例えば、図1に示すようなDVD8や、USBメモリ9である。 The LAN interface 4 is a communication interface circuit for performing data transmission / reception processing with the network N. The I / O unit 5 is a circuit for connecting a peripheral device such as an input device, an output device, and a portable recording medium to the information processing device 10. The portable recording medium is, for example, a DVD 8 or a USB memory 9 as shown in FIG.
 例えば、DVD8やUSBメモリ9に格納されたプログラムやデータをI/Oユニット5に接続し、RAID部6へのインストールを行うことができる。さらに、RAID部6にインストールされたプログラムがCPU1によって主記憶装置2のメモリにロードされ、実行されるようにすることができる。主記憶装置2のROM又はRAID部6には、オペレーティングシステム(OS)や、1以上のアプリケーションプログラム(アプリケーションと呼ぶ)が格納されており、CPU1は、OSやアプリケーションを実行することによって、情報処理装置10をRAID装置として機能させることができる。 For example, a program or data stored in the DVD 8 or the USB memory 9 can be connected to the I / O unit 5 and installed in the RAID unit 6. Furthermore, the program installed in the RAID unit 6 can be loaded into the memory of the main storage device 2 by the CPU 1 and executed. The ROM or RAID unit 6 of the main storage device 2 stores an operating system (OS) and one or more application programs (referred to as applications), and the CPU 1 executes information processing by executing the OS and applications. The device 10 can function as a RAID device.
 図1に示した主記憶装置2中のROMやRAM、HDD7,DVD8,USBメモリ9は、コンピュータ読み取り可能な記録媒体の例示であり、記録媒体の種類はこれらに限定されない。 The ROM, RAM, HDD 7, DVD 8, and USB memory 9 in the main storage device 2 shown in FIG. 1 are examples of computer-readable recording media, and the types of recording media are not limited to these.
 図2は、図1に示した情報処理装置10によって実現されるRAID装置を模式的に示すブロック図である。情報処理装置10は、CPU1が主記憶装置2にロードされたプログラムを実行することによって、図2に示すようなRAID部6と、RAID制御部20とを備えるRAID装置として機能する。 FIG. 2 is a block diagram schematically showing a RAID device realized by the information processing apparatus 10 shown in FIG. The information processing apparatus 10 functions as a RAID apparatus including the RAID unit 6 and the RAID control unit 20 as illustrated in FIG. 2 when the CPU 1 executes a program loaded on the main storage device 2.
 図2において、RAID部6A~6Cは、図1に示したものである。図2に示すRAID部6Aは、HDD7Aに相当する“HDD-0”と、HDD7Bに相当する“HDD-1”からなるミラーリングシステムである。 In FIG. 2, RAID sections 6A to 6C are those shown in FIG. The RAID unit 6A shown in FIG. 2 is a mirroring system including “HDD-0” corresponding to the HDD 7A and “HDD-1” corresponding to the HDD 7B.
 本実施形態では、RAID制御部20は、CPU1がOSを実行することによって実現される機能である。詳細には、RAID制御装置としてのRAID制御部20は、図1に示したCPU1,主記憶装置2,及びRAIDコントローラ3によって実現される。RAID制御部20は、RAID管理部21(21A,21B,21C)と、タイマ23A及び23Bと、計数部24と、カウンタ25及び26を含む複数のカウンタと、検出部27と、閾値テーブル28とを備えている。 In the present embodiment, the RAID control unit 20 is a function realized by the CPU 1 executing the OS. Specifically, the RAID control unit 20 as a RAID control device is realized by the CPU 1, the main storage device 2, and the RAID controller 3 shown in FIG. The RAID control unit 20 includes a RAID management unit 21 (21A, 21B, 21C), timers 23A and 23B, a counting unit 24, a plurality of counters including counters 25 and 26, a detection unit 27, and a threshold table 28. It has.
 各RAID管理部21A,21B,21Cは、RAIDコントローラ3A,3B,3Cによる機能として実現される。タイマ23,計数部24、検出部27は、CPU1がOSを実行する機能として実現される。カウンタ25及び26、閾値テーブル28は、例えば主記憶装置2のメモリ上に作成される。 Each RAID management part 21A, 21B, 21C is realized as a function by the RAID controllers 3A, 3B, 3C. The timer 23, the counting unit 24, and the detection unit 27 are realized as functions for the CPU 1 to execute the OS. The counters 25 and 26 and the threshold table 28 are created on the memory of the main storage device 2, for example.
 なお、RAIDコントローラ3の機能は、専用プロセッサ又はDSP(Data Signal Processor)のような汎用プロセッサがプログラムを実行することによって実現されることができる。或いは、CPU1がプログラム(例えばOS)を実行することによって実現されるようにすることができる。 The function of the RAID controller 3 can be realized by executing a program by a general-purpose processor such as a dedicated processor or a DSP (Data Signal Processor). Alternatively, it can be realized by the CPU 1 executing a program (for example, OS).
 各RAID管理部21は、対応するRAID部6に対するデータの書き込み/読み出し、HDDのRAID部6からの切り離しを制御する。また、各RAID管理部21は、対応するRAID部6に対するアクセスの履歴を管理ログ22として記録する。 Each RAID management unit 21 controls writing / reading of data to / from the corresponding RAID unit 6 and detachment of the HDD from the RAID unit 6. Each RAID management unit 21 records a history of access to the corresponding RAID unit 6 as a management log 22.
 図3は、管理ログ22の例を示す。管理ログ22は、日時を示すタイムスタンプと、管理対象のHDDと、管理対象のHDDに対するアクセス、すなわち、HDDに対するデータの読み出し又は書き込みの結果と、RAID部6の識別子とを含むレコードを時系列で記録している。 FIG. 3 shows an example of the management log 22. The management log 22 is a time series of records including a time stamp indicating date and time, a management target HDD, access to the management target HDD, that is, a result of reading or writing data to the HDD, and an identifier of the RAID unit 6. It is recorded with.
 図3に示す例では、管理対象のHDD-0に対して“TargetHDD-0”の表記が使用され、管理対象のHDD-1に対して“TargetHDD-1”の表記が用いられている。また、アクセス結果の表記として、アクセス結果が正常であれば“Normal”、アクセス結果が異常、すなわち読み出しエラー又は書き込みエラーが生じた場合には、“MediaError”が記録される。また、RAID部の識別子として、例えば、“RAID X”(Xは数字又は記号)の表記を使用することができる。 In the example shown in FIG. 3, the notation “TargetHDD-0” is used for the managed HDD-0, and the notation “TargetHDD-1” is used for the managed HDD-1. In addition, as a representation of the access result, “Normal” is recorded when the access result is normal, and “MediaError” is recorded when the access result is abnormal, that is, when a read error or a write error occurs. Further, for example, the notation “RAID X” (X is a number or a symbol) can be used as the identifier of the RAID part.
 図2に示すタイマ23A及び23Bのそれぞれは、所定の監視周期に基づく監視間隔を計時する。本実施形態は、所定の監視周期で計数部24及び検出部27によるRAID部6(管理ログ22)の監視処理が行われる(図4参照)。図2に示す例では、RAID部6A及びRAID部6Bが周期的な監視対象となっている。タイマ23Aは、RAID部6A用のタイマであり、タイマ23Bは、RAID部6B用のタイマである。
図4に示すように、所定のチェック期間内に、所定回数以上のメディアエラーが発生している場合に、RAIDのスローダウン現象が発生していると判定することができる。但し、周期的な監視はオプションであり、タイマ23A及び23Bは必須の構成要素ではない。
Each of the timers 23A and 23B shown in FIG. 2 measures a monitoring interval based on a predetermined monitoring cycle. In the present embodiment, the monitoring process of the RAID unit 6 (management log 22) is performed by the counting unit 24 and the detection unit 27 in a predetermined monitoring cycle (see FIG. 4). In the example shown in FIG. 2, the RAID unit 6A and the RAID unit 6B are periodically monitored. The timer 23A is a timer for the RAID unit 6A, and the timer 23B is a timer for the RAID unit 6B.
As shown in FIG. 4, when a media error has occurred a predetermined number of times or more within a predetermined check period, it can be determined that a RAID slowdown phenomenon has occurred. However, periodic monitoring is optional, and the timers 23A and 23B are not essential components.
 各タイマ23A,23Bは、所定の監視周期毎に、監視周期の到来を計数部24及び検出部27に知らせる。監視周期、すなわち監視間隔は、数分~数十分の時間を設定することができる。監視周期は、タイマ23A,23Bへの外部入力による変更設定を通じて可変とすることができる。 The timers 23A and 23B notify the counting unit 24 and the detection unit 27 of the arrival of the monitoring cycle at every predetermined monitoring cycle. The monitoring cycle, that is, the monitoring interval can be set to several minutes to several tens of minutes. The monitoring cycle can be made variable through change setting by external input to the timers 23A and 23B.
 計数部24は、監視周期の到来がタイマ23A又は23Bによって知らされると、監視周期に対応するRAID管理部21の管理ログ22(図3)から所定のチェック期間に含まれるレコードを抽出し、そのレコード中のメディアエラー数を、管理対象のHDD毎に計数する。例えば、RAID部6Aを構成するHDD-0及びHDD-1に対する計数結果は、カウンタ25,26に格納される。これに対し、RAID部6B(RAIDレベル“RAID 10”)を構成する4つのHDDに対応する図示しないカウンタが用意される(図示せず)。 When the arrival of the monitoring cycle is notified by the timer 23A or 23B, the counting unit 24 extracts records included in the predetermined check period from the management log 22 (FIG. 3) of the RAID management unit 21 corresponding to the monitoring cycle, The number of media errors in the record is counted for each managed HDD. For example, count results for HDD-0 and HDD-1 constituting the RAID unit 6A are stored in the counters 25 and 26. On the other hand, counters (not shown) corresponding to the four HDDs constituting the RAID unit 6B (RAID level “RAID 10”) are prepared (not shown).
 カウンタ25、26に格納されるHDD-0,HDD-1に対するメディアエラー数(計数結果)は、チェック期間中における各HDDのメディアエラー回数を示す。カウンタ25は、HDD-0のメディアエラー数を保持し、カウンタ26は、HDD-1のメディアエラー数を保持する。 The number of media errors (counting result) for HDD-0 and HDD-1 stored in the counters 25 and 26 indicates the number of media errors of each HDD during the check period. The counter 25 holds the number of media errors of the HDD-0, and the counter 26 holds the number of media errors of the HDD-1.
 検出部27は、監視対象のRAID部6(例えばRAID部6A)の監視周期の到来がタイマ23によって知らされると、計数部24の処理と同期して、カウンタ25及び26からメディアエラー数を読み出す。一方、検出部27は、閾値テーブル28からメディアエラー数と対比すべき閾値を読み出し、メディアエラー数が閾値以上か否かを判定する。 When the timer 23 is notified of the arrival of the monitoring period of the monitoring target RAID unit 6 (for example, the RAID unit 6A), the detection unit 27 calculates the number of media errors from the counters 25 and 26 in synchronization with the processing of the counting unit 24. read out. On the other hand, the detection unit 27 reads a threshold value to be compared with the media error number from the threshold value table 28, and determines whether the media error number is equal to or greater than the threshold value.
 このとき、検出部27は、HDD-0及びHDD-1のいずれか一方のメディアエラー数が閾値以上であると判定した場合には、そのメディアエラー数が閾値以上であるHDDをRAIDのスローダウン現象に係るHDDとして検出(特定)する。さらに、検出部27は、検出されたHDDのRAID部6からの切り離しをRAID管理部21に対して指示する。 At this time, if the detection unit 27 determines that the media error number of any one of the HDD-0 and HDD-1 is equal to or greater than the threshold value, the detection unit 27 selects a HDD whose media error number is equal to or greater than the threshold value as a RAID slowdown. It is detected (specified) as an HDD related to the phenomenon. Furthermore, the detection unit 27 instructs the RAID management unit 21 to disconnect the detected HDD from the RAID unit 6.
 例えば、図3に示すように、チェック期間におけるHDD-0のメディアエラー数がスローダウン閾値である10[回]以上であると判定した場合には、検出部27は、HDD-0を現在又は将来のスローダウン現象の原因となるHDDとして検出し、HDD-0を切り離し対象のHDDとして決定する。 For example, as shown in FIG. 3, when it is determined that the number of media errors of the HDD-0 in the check period is equal to or greater than 10 [times] that is the slowdown threshold, the detection unit 27 sets the HDD-0 to the current or It is detected as an HDD that will cause a slow-down phenomenon in the future, and HDD-0 is determined as the target HDD to be disconnected.
 図5は、閾値テーブル28のデータ構造例を示す。図4に示す例では、閾値テーブル28は、RAID部6A,6B,6Cに対応するエントリからなる。エントリは、例えばRAID部6毎に用意することができる。或いは、監視対象のRAID部6のみのエントリが登録される構成を適用可能である。 FIG. 5 shows an example of the data structure of the threshold table 28. In the example illustrated in FIG. 4, the threshold table 28 includes entries corresponding to the RAID units 6A, 6B, and 6C. An entry can be prepared for each RAID unit 6, for example. Alternatively, a configuration in which only the entry of the monitoring target RAID unit 6 is registered is applicable.
 各エントリには、エントリ番号が付与される。エントリ番号1がRAID部6Aに対応し、エントリ番号2がRAID部6Bに対応し、エントリ番号3がRAID部6Cに対応する。
各エントリは、システム番号,チェック期間,監視周期,警告閾値,スローダウン閾値,コントローラ識別子,及び監視対象フラグを含むことができる。
Each entry is given an entry number. Entry number 1 corresponds to the RAID part 6A, entry number 2 corresponds to the RAID part 6B, and entry number 3 corresponds to the RAID part 6C.
Each entry can include a system number, a check period, a monitoring period, a warning threshold, a slowdown threshold, a controller identifier, and a monitoring target flag.
 ここに、システム番号は、RAIDシステムであるRAID部6の識別子である。チェック期間は、管理ログ22からレコードを抽出する(切り出す)ための時間である。 Here, the system number is an identifier of the RAID unit 6 which is a RAID system. The check period is a time for extracting (cutting out) records from the management log 22.
 ここで、計数部24は、チェック期間を常時保持する構成を適用可能である。或いは、計数部24は、管理ログ22からレコードを抽出する場合に、閾値テーブル28を参照してチェック期間を確認する構成を適用可能である。この場合、図4に示すようにチェック期間が格納される。チェック期間の格納はオプションである。 Here, it is possible to apply a configuration in which the counting unit 24 always holds the check period. Alternatively, the count unit 24 can apply a configuration in which the check period is confirmed with reference to the threshold table 28 when the record is extracted from the management log 22. In this case, the check period is stored as shown in FIG. Storage of the check period is optional.
 また、監視周期は、RAID部6の監視、すなわちメディアエラー数の計数及び判定が行われる監視間隔の時間を示す。監視間隔は、タイマ23A,23Bに設定されており、タイマ23A,23Bは、監視周期(例えば5分)毎に、監視周期の到来を通知する。 Also, the monitoring period indicates the monitoring interval time during which the RAID unit 6 monitors, that is, the number of media errors is counted and determined. The monitoring interval is set in the timers 23A and 23B, and the timers 23A and 23B notify the arrival of the monitoring period every monitoring period (for example, 5 minutes).
 このため、閾値テーブル28に対する監視周期の格納はオプションである。さらに、監視周期を設けない場合、例えば、所定のコマンドに応じて計数部24及び検出部27が動作する場合には、監視周期及びタイマ23A,23Bは省略可能である。 For this reason, storing the monitoring period for the threshold table 28 is optional. Furthermore, when the monitoring cycle is not provided, for example, when the counting unit 24 and the detection unit 27 operate according to a predetermined command, the monitoring cycle and the timers 23A and 23B can be omitted.
 監視周期は、RAID部6毎に異なる値を設定することができる。図4に示す例では、RAID部6A用のタイマ23AとRAID部6B用のタイマ23Bとが用意されている。但し、RAID部6Aの監視周期とRAID部6Bの監視周期とを共通化可能な場合には、一つのタイマが設けられた構成を適用可能である。 The monitoring cycle can be set to a different value for each RAID unit 6. In the example shown in FIG. 4, a timer 23A for the RAID unit 6A and a timer 23B for the RAID unit 6B are prepared. However, when the monitoring period of the RAID unit 6A and the monitoring period of the RAID unit 6B can be shared, a configuration in which one timer is provided can be applied.
 警告閾値は、チェック期間内におけるメディアエラー数が警告を発行すべき値か否かを判定するための閾値であり、検出部27によって参照される。警告は、情報処理装置10の外部に、音,光,ディスプレイへの表示のような様々な手段で報知可能である。但し、本実施形態では、メディアエラー数が警告閾値以上である場合には、警告が、RAID部6に対応するRAID管理部21によって管理ログ22に記録される。警告記録を残さない場合には、警告閾値は省略可能である。 The warning threshold is a threshold for determining whether or not the number of media errors within the check period is a value for issuing a warning, and is referred to by the detection unit 27. The warning can be notified to the outside of the information processing apparatus 10 by various means such as sound, light, and display on a display. However, in this embodiment, when the number of media errors is equal to or greater than the warning threshold, a warning is recorded in the management log 22 by the RAID management unit 21 corresponding to the RAID unit 6. When no warning record is left, the warning threshold value can be omitted.
 スローダウン閾値は、メディアエラー数がRAIDのスローダウン現象の要因となっている、或いはメディアエラー数がスローダウン現象の要因となる可能性が高いと判定するための閾値であり、検出部27によって参照される。警告閾値及びスローダウン閾値も、RAID部6或いはRAIDコントローラ毎に異なる値を適用可能である。 The slowdown threshold is a threshold for determining that the number of media errors is a factor of the RAID slowdown phenomenon or that the number of media errors is likely to be a factor of the slowdown phenomenon. Referenced. The warning threshold value and the slowdown threshold value can also be applied to different values for each RAID unit 6 or RAID controller.
 コントローラ識別子は、システム番号で特定されるRAID部を制御するRAIDコントローラの識別子である。監視対象フラグは、RAID部6が、計数部24及び検出部27による監視対象か否かを示すフラグである。 The controller identifier is an identifier of a RAID controller that controls the RAID part specified by the system number. The monitoring target flag is a flag indicating whether or not the RAID unit 6 is a monitoring target by the counting unit 24 and the detection unit 27.
 計数部24及び検出部27による監視対象は、RAIDレベル1(ミラーリング)をサポートするRAID部に制限される。このため、図1、図2に示す例では、RAIDレベル“RAID 1”を実施するRAID部6Aと、RAIDレベル“RAID 10”を実施するRAID部6Bとが監視対象となり、RAIDレベル“RAID 50”を実施するRAID部6Cは、非監視対象となる。 The monitoring target by the counting unit 24 and the detection unit 27 is limited to a RAID unit that supports RAID level 1 (mirroring). Therefore, in the example shown in FIGS. 1 and 2, the RAID unit 6A that implements the RAID level “RAID“ 1 ”and the RAID unit 6B that implements the RAID level“ RAID 10 ”are monitored, and the RAID level“ RAID 50 ”. The RAID unit 6C that performs "" is a non-monitoring target.
 本実施形態におけるRAID制御部20の構成は、RAIDレベル“RAID 1”自体の他、RAIDレベル1と他のRAIDレベルとの組み合わせに係るRAIDレベルを実施するRAID部に対して適用可能である。すなわち、RAID制御部20が監視対象とするRAIDレベルは、少なくとも、RAID“1”,“0+1”,“1+0”,“1+5”,“5+1”,“1+6”,“6+1”を含むことができる。 The configuration of the RAID control unit 20 in the present embodiment is applicable to a RAID unit that implements a RAID level related to a combination of RAID level 1 and another RAID level in addition to the RAID level “RAID 1” itself. That is, the RAID levels that are monitored by the RAID control unit 20 can include at least RAID “1”, “0 + 1”, “1 + 0”, “1 + 5”, “5 + 1”, “1 + 6”, “6 + 1”. .
 閾値テーブル28において、監視対象のRAIDレベルのRAID部に対し、監視対象フラグ“ON”が設定され、非監視対象のRAIDレベルのRAID部に対し、監視対象フラグ“OFF”が設定される。従って、図2に示す例では、RAID制御部20は、RAID部6A及びRAID部6Bに対するメディアエラー数の監視を行い、RAID部6Cに対する監視は行われない。 In the threshold table 28, the monitoring target flag “ON” is set for the RAID portion of the RAID level to be monitored, and the monitoring target flag “OFF” is set for the RAID portion of the RAID level that is not to be monitored. Therefore, in the example shown in FIG. 2, the RAID control unit 20 monitors the number of media errors for the RAID unit 6A and the RAID unit 6B, and does not monitor the RAID unit 6C.
 図6は、図2に示したRAID制御部20の動作例を示すフローチャートである。図6に示す処理は、例えば、情報処理装置10の電源投入によって開始される。なお、図6に示す処理の前提として、図5に示した内容の閾値テーブル28が静的に主記憶装置2又は二次記憶(HDD7)に格納されていると仮定する。 FIG. 6 is a flowchart showing an operation example of the RAID control unit 20 shown in FIG. The process illustrated in FIG. 6 is started, for example, when the information processing apparatus 10 is turned on. As a premise of the process shown in FIG. 6, it is assumed that the threshold table 28 having the contents shown in FIG. 5 is statically stored in the main storage device 2 or the secondary storage (HDD 7).
 図6に示す処理が開始されると、RAID制御部20において、各RAID部6A,6B,6CのRAIDレベルチェック処理が実行される(ステップS01)。 When the process shown in FIG. 6 is started, the RAID control unit 20 executes a RAID level check process for each of the RAID units 6A, 6B, and 6C (step S01).
 すなわち、RAID制御部20は、閾値テーブル28(図5)を参照し、各RAID部6A,6B,6Cに対応する監視対象フラグを確認する。ここでは、図5に示したように、RAID部6A及び6Bに対する監視対象フラグが“ON”であり、RAID部6Cに対する監視対象フラグが“OFF”である。 That is, the RAID control unit 20 refers to the threshold value table 28 (FIG. 5) and confirms the monitoring target flags corresponding to the RAID units 6A, 6B, and 6C. Here, as shown in FIG. 5, the monitoring target flag for the RAID units 6A and 6B is “ON”, and the monitoring target flag for the RAID unit 6C is “OFF”.
 これによって、RAID制御部20は、RAID部6A及び6Bがデータの冗長性を確保するRAIDレベル、すなわち監視対象のRAIDレベルを有すると判定する。一方、RAID制御部20は、RAID部6Cが監視対象外、すなわち非監視対象であると判定する。 Thereby, the RAID control unit 20 determines that the RAID units 6A and 6B have a RAID level that ensures data redundancy, that is, a monitoring target RAID level. On the other hand, the RAID control unit 20 determines that the RAID unit 6C is not a monitoring target, that is, a non-monitoring target.
 続いて、RAID制御部20は、上述した監視対象の判定結果に従って、タイマ23A及び23Bに対する設定を行う。ここでは、閾値テーブル28の格納内容に従って、RAID部6A用のタイマ23Aに監視周期“5分”が設定され、RAID部6B用のタイマ23Bに監視周期“5分”が設定される。 Subsequently, the RAID control unit 20 performs settings for the timers 23A and 23B according to the monitoring target determination result described above. Here, according to the stored contents of the threshold table 28, the monitoring period “5 minutes” is set in the timer 23A for the RAID section 6A, and the monitoring period “5 minutes” is set in the timer 23B for the RAID section 6B.
 その後、RAID制御部20は、各タイマ23A,23Bをスタートさせる。このとき、RAID制御部20は、各RAID部6A及び6Bに対する監視周期の到来がずれた状態となるように、タイマ23A及び23Bをスタートさせることができる。もっとも、同時にタイマ23A及び23Bをスタートさせることもできる。 Thereafter, the RAID control unit 20 starts the timers 23A and 23B. At this time, the RAID control unit 20 can start the timers 23A and 23B so that the arrival of the monitoring periods for the RAID units 6A and 6B is shifted. However, the timers 23A and 23B can be started at the same time.
 その後、タイマ23A及び23Bは、それぞれ監視周期時間の計時を行い、監視周期の時間となると、監視周期の到来を計数部24及び検出部27に通知する(ステップS02)。例えば、RAID部6Aに対応するタイマ23Aが監視周期の到来を計数部24及び検出部27に通知したと仮定する。 Thereafter, the timers 23A and 23B each measure the monitoring cycle time, and when the monitoring cycle time is reached, the arrival of the monitoring cycle is notified to the counting unit 24 and the detection unit 27 (step S02). For example, it is assumed that the timer 23A corresponding to the RAID unit 6A notifies the counting unit 24 and the detection unit 27 that the monitoring period has arrived.
 すると、計数部24が管理ログのチェックを行う(ステップS03)。すなわち、計数部24は、閾値テーブル28を参照して、RAID部6Aに対応するチェック期間“10分”を確認する。続いて、計数部24は、RAID管理部21Aの管理ログ22(図3)を参照し、例えば、最新のレコードのタイムスタンプからチェック期間“10分”内に入るタイムスタンプを有するレコード群を抽出する。 Then, the counting unit 24 checks the management log (step S03). That is, the counting unit 24 refers to the threshold table 28 and confirms the check period “10 minutes” corresponding to the RAID unit 6A. Subsequently, the counting unit 24 refers to the management log 22 (FIG. 3) of the RAID management unit 21A and extracts, for example, a record group having a time stamp that falls within the check period “10 minutes” from the time stamp of the latest record. To do.
 続いて、計数部24は、抽出されたレコード群から、RAID部6Aを構成する各HDD、すなわち、HDD-0についてのメディアエラー数(“N1”とする)と、HDD-1についてのメディアエラー数(“N2”とする)とを計数する(ステップS04)。計数部24は、メディアエラー数N1をカウンタ25に格納し、メディアエラー数N2をカウンタ26に格納する。 Subsequently, the counting unit 24 determines the number of media errors (referred to as “N1”) for each HDD constituting the RAID unit 6A, that is, HDD-0, and the media error for HDD-1 from the extracted record group. The number (referred to as “N2”) is counted (step S04). The counting unit 24 stores the media error number N1 in the counter 25, and stores the media error number N2 in the counter 26.
 検出部27は、計数部24によって、カウンタ25及び26にメディアエラー数が格納された後に、カウンタ25及び26から各メディアエラー数N1,N2を読み出す。一方、検出部27は、閾値テーブル28(図5)を参照し、RAID部6Aに対応する警告閾値(50回)及びスローダウン閾値(100回)を読み出す(ステップS05)。 The detecting unit 27 reads the media error numbers N1 and N2 from the counters 25 and 26 after the counter unit 25 stores the media error numbers in the counters 25 and 26. On the other hand, the detection unit 27 refers to the threshold value table 28 (FIG. 5) and reads the warning threshold value (50 times) and the slowdown threshold value (100 times) corresponding to the RAID unit 6A (step S05).
 次に、検出部27は、HDD-0のメディアエラー数N1が警告閾値(50回)未満か否かを判定する(ステップS06)。このとき、HDD-0のメディアエラー数N1が警告閾値未満であれば(S06;YES)、処理がステップS11に進む。これに対し、HDD-0のメディアエラー数N1が警告閾値以上であれば(S06;NO)、処理がステップS07に進む。 Next, the detection unit 27 determines whether or not the media error number N1 of the HDD-0 is less than the warning threshold value (50 times) (step S06). At this time, if the number N1 of media errors in HDD-0 is less than the warning threshold (S06; YES), the process proceeds to step S11. On the other hand, if the media error number N1 of HDD-0 is equal to or greater than the warning threshold (S06; NO), the process proceeds to step S07.
 ステップS07では、検出部27は、HDD-0のメディアエラー数N1が50回以上100回未満の範囲に入っているか否かを判定する。このとき、メディアエラー数N1が上記範囲に入っている場合(S07;YES)には、処理がステップS08に進む。これに対し、メディアエラー数N1が上記範囲に入っていない場合(S07;NO)には、処理がステップS09に進む。 In step S07, the detection unit 27 determines whether the media error number N1 of the HDD-0 is in the range of 50 times or more and less than 100 times. At this time, if the media error number N1 is within the above range (S07; YES), the process proceeds to step S08. On the other hand, if the media error number N1 is not within the above range (S07; NO), the process proceeds to step S09.
 ステップS08では、検出部27は、メディアエラー数N1を含む警告レコードを管理ログ22に書き込むことをRAID管理部21Aに指示する。これによって、RAID管理部21Aが、管理ログ22に警告レコードを書き込む。警告レコードは、情報処理装置10のユーザによって、後日利用可能である。ステップS08による処理が終了すると、処理がステップS02に戻る。ここに、タイマ23は、監視周期の到来を通知すると、監視間隔時間の計時をリスタートする。これによって、次の監視周期において、タイマ23は監視周期の到来を通知する状態となる。 In step S08, the detection unit 27 instructs the RAID management unit 21A to write a warning record including the media error number N1 to the management log 22. As a result, the RAID management unit 21A writes a warning record in the management log 22. The warning record can be used at a later date by the user of the information processing apparatus 10. When the process in step S08 ends, the process returns to step S02. Here, when the timer 23 notifies the arrival of the monitoring cycle, the timer 23 restarts the measurement of the monitoring interval time. Thereby, in the next monitoring cycle, the timer 23 is in a state of notifying the arrival of the monitoring cycle.
 処理がステップS09に進んだ場合には、検出部27は、メディアエラー数N1がスローダウン閾値(100回)以上であるので、HDD-0に係るRAIDのスローダウン現象が発生していると判定し、スローダウン現象の原因がHDD-0であると特定する。 When the process proceeds to step S09, the detection unit 27 determines that the RAID slowdown phenomenon related to HDD-0 has occurred because the media error number N1 is equal to or greater than the slowdown threshold (100 times). Then, it is determined that the cause of the slow-down phenomenon is HDD-0.
 すると、検出部27は、RAID管理部21Aに対し、HDD-0の切り離しを指示する(ステップS10)。RAID管理部21Aは、指示に従って、RAID部6AのHDD-0(HDD7A)をディスエーブル状態に遷移させる。これによって、HDD-0の切り離しが行われる。その後、処理がステップS02に戻る。 Then, the detection unit 27 instructs the RAID management unit 21A to disconnect the HDD-0 (step S10). The RAID management unit 21A changes the HDD-0 (HDD 7A) of the RAID unit 6A to the disabled state in accordance with the instruction. As a result, the HDD-0 is disconnected. Thereafter, the process returns to step S02.
 一方、処理がステップS11に進んだ場合には、検出部27は、HDD-1のメディアエラー数N2が警告閾値(50回)未満か否かを判定する。このとき、HDD-1のメディアエラー数が警告閾値未満であれば(S11;YES)、RAID部6Aは正常であると判定され(ステップS12)、処理がステップS02に戻る。 On the other hand, when the process proceeds to step S11, the detection unit 27 determines whether the media error number N2 of the HDD-1 is less than the warning threshold (50 times). At this time, if the number of media errors in HDD-1 is less than the warning threshold (S11; YES), it is determined that the RAID unit 6A is normal (step S12), and the process returns to step S02.
 これに対し、HDD-1のメディアエラー数N2が警告閾値以上であれば(S11;NO)、検出部27は、HDD-1のメディアエラー数N2が50回以上100回未満の範囲に入っているか否かを判定する(ステップS13)。 On the other hand, if the media error number N2 of HDD-1 is equal to or greater than the warning threshold (S11; NO), the detection unit 27 falls within a range where the media error number N2 of HDD-1 is 50 times or more and less than 100 times. It is determined whether or not there is (step S13).
 このとき、メディアエラー数N2が上記範囲に入っている場合(S13;YES)には、処理がステップS14に進む。これに対し、メディアエラー数N2が上記範囲に入っていない場合(S13;NO)には、処理がステップS15に進む。 At this time, if the media error number N2 is within the above range (S13; YES), the process proceeds to step S14. On the other hand, if the media error number N2 is not within the above range (S13; NO), the process proceeds to step S15.
 ステップS14では、検出部27は、メディアエラー数N2を含む警告レコードを管理ログ22に書き込むことをRAID管理部21Aに指示する。これによって、RAID管理部21Aが、管理ログ22に警告レコードを書き込む。警告レコードは、情報処理装置10のユーザが後日利用することができる。その後、処理がステップS02に戻る。 In step S14, the detection unit 27 instructs the RAID management unit 21A to write a warning record including the media error number N2 in the management log 22. As a result, the RAID management unit 21A writes a warning record in the management log 22. The warning record can be used later by the user of the information processing apparatus 10. Thereafter, the process returns to step S02.
 処理がステップS15に進んだ場合には、検出部27は、メディアエラー数N2がスローダウン閾値(100回)以上であるので、HDD-1に係るRAIDのスローダウン現象が発生していると判定し、スローダウン現象の原因がHDD-1であると特定する。 When the process proceeds to step S15, the detection unit 27 determines that the RAID slowdown phenomenon related to HDD-1 has occurred because the media error number N2 is equal to or greater than the slowdown threshold (100 times). Then, it is determined that the cause of the slowdown phenomenon is HDD-1.
 すると、検出部27は、RAID管理部21Aに対し、HDD-1の切り離しを指示する(ステップS16)。RAID管理部21Aは、指示に従って、RAID部6AのHDD-1(HDD7B)をディスエーブル状態に遷移させる。これによって、HDD-1の切り離しが行われる。その後、処理がステップS02に戻る。なお、RAID部6Bに対する監視周期が到来した場合には、タイマ23Bが計数部24及び検出部27に当該到来を通知し、ステップS02以降の処理が行われる。 Then, the detection unit 27 instructs the RAID management unit 21A to disconnect the HDD-1 (step S16). The RAID management unit 21A changes the HDD-1 (HDD 7B) of the RAID unit 6A to the disabled state according to the instruction. As a result, the HDD-1 is disconnected. Thereafter, the process returns to step S02. If the monitoring period for the RAID unit 6B has arrived, the timer 23B notifies the counting unit 24 and the detection unit 27 of the arrival, and the processing from step S02 is performed.
 以上の動作例によれば、閾値テーブル28に登録された監視周期の値で、タイマを設定することができる。このため、閾値テーブル28の値を外部入力により更新し、電源投入又は情報処理装置の再起動を契機として、タイマ23A,23Bによる監視周期を変更することができる。また、閾値テーブル28に登録されたチェック期間を外部入力により更新し、電源投入又は情報処理装置の再起動を契機として、チェック期間(管理ログ22からのレコードの抽出範囲)を変更することができる。なお、上記した動作例では、チェック期間が監視周期よりも長い場合について説明したが、チェック期間及び監視周期の長さは適宜設定可能である。 According to the above operation example, the timer can be set with the value of the monitoring period registered in the threshold table 28. For this reason, the value of the threshold table 28 can be updated by external input, and the monitoring cycle by the timers 23A and 23B can be changed when the power is turned on or the information processing apparatus is restarted. Further, the check period registered in the threshold table 28 can be updated by external input, and the check period (extraction range of records from the management log 22) can be changed when the power is turned on or the information processing apparatus is restarted. . In the above operation example, the case where the check period is longer than the monitoring period has been described. However, the length of the check period and the monitoring period can be set as appropriate.
 また、或るHDDのメディアエラー数が例えば警告閾値以上である場合に、対象のHDDの切り離しを行う場合には、当該切り離し対象のHDDが原因のスローダウン現象の発生を未然に防止することができる。 Further, when the target HDD is disconnected when the number of media errors of a certain HDD is equal to or greater than, for example, a warning threshold, it is possible to prevent the occurrence of a slowdown phenomenon caused by the HDD to be disconnected. it can.
 従来では、スローダウン現象の回復までの手順において、スローダウン現象の原因を突き止めるために長時間が必要であり、RAID装置のユーザに対する影響が大きかった。また、スローダウン現象からの回復には人手による作業を経なければならなかった。 Conventionally, in the procedure until the recovery of the slowdown phenomenon, it takes a long time to find the cause of the slowdown phenomenon, which has a great influence on the user of the RAID device. In addition, recovery from the slowdown phenomenon required manual work.
 上述した実施形態によれば、RAID部6を構成する或るHDD7にメディアエラーが多発し、スローダウン現象が発生した場合には、検出部27がスローダウン現象の原因となっているHDD7を自動的に検出し、当該HDD7の切り離し処理が自動的に実行される。これによって、スローダウン現象からの回復を早期に図ることができる。 According to the above-described embodiment, when a media error frequently occurs in a certain HDD 7 constituting the RAID unit 6 and a slow down phenomenon occurs, the detection unit 27 automatically detects the HDD 7 causing the slow down phenomenon. And the HDD 7 disconnection process is automatically executed. As a result, recovery from the slow-down phenomenon can be achieved at an early stage.
 上述したように、チェック期間中におけるメディアエラー数がスローダウン閾値以上となった場合に、スローダウン現象が発生していると判定することができる。メディアエラー数は、RAID管理部21が作成する管理ログ22を用いて計数することができる。これによって、管理ログ22の解析という簡易な構成で、スローダウン現象の発生の判断を行うことができる。 As described above, when the number of media errors during the check period is equal to or greater than the slowdown threshold, it can be determined that the slowdown phenomenon has occurred. The number of media errors can be counted using the management log 22 created by the RAID management unit 21. Accordingly, it is possible to determine the occurrence of the slowdown phenomenon with a simple configuration of analyzing the management log 22.
 また、スローダウン現象の発生と判断する基準は、RAIDコントローラの種類や、RAIDレベルやRAIDを構成するHDD数のようなRAID構成に依存する。これに対し、本実施形態では、図6の処理フローに示したように、警告閾値及びスローダウン閾値は、RAIDコントローラ3(RAID部6)毎に異なる値を設定することができる。これによって、図1に示した情報処理装置10のように、複数のRAIDコントローラ3を備える場合において、RAIDコントローラに応じた警告閾値及びスローダウン閾値を設定し、適正なスローダウン現象の判定及びHDDの切り離しを実行することができる。なお、情報処理装置が複数のRAID部や複数のRAIDコントローラを備えることは必須の要件ではない。 Also, the criterion for determining the occurrence of the slowdown phenomenon depends on the RAID configuration such as the type of RAID controller, the RAID level, and the number of HDDs constituting the RAID. On the other hand, in the present embodiment, as shown in the processing flow of FIG. 6, the warning threshold value and the slowdown threshold value can be set to different values for each RAID controller 3 (RAID unit 6). As a result, in the case where a plurality of RAID controllers 3 are provided as in the information processing apparatus 10 shown in FIG. 1, the warning threshold value and the slowdown threshold value corresponding to the RAID controller are set, the proper slowdown phenomenon determination and the HDD are performed. Detachment can be performed. Note that it is not an essential requirement that the information processing apparatus includes a plurality of RAID units and a plurality of RAID controllers.
 また、記憶データの冗長性を確保できないRAIDレベルを実施するRAID部(RAIDシステム)6Cを監視対象外と判断し、ミラーリングを含む処理を実施するRAID部6A及び6Bを監視対象として決定する。これによって、監視範囲を適正に制限することができる。
また、上述した実施形態では、監視周期の時間を可変にすることができる。従って、RA IDコントローラ3の特性を考慮した監視周期を設定することができる。また、チェック期間も、例えばRAID部6に対するアクセス頻度を考慮してその長さを変更することができる。
Further, the RAID unit (RAID system) 6C that implements the RAID level where the redundancy of the stored data cannot be ensured is determined not to be monitored, and the RAID units 6A and 6B that perform processing including mirroring are determined as monitoring targets. As a result, the monitoring range can be appropriately limited.
In the above-described embodiment, the time of the monitoring cycle can be made variable. Therefore, it is possible to set a monitoring cycle in consideration of the characteristics of the RAID controller 3. Also, the length of the check period can be changed in consideration of the access frequency to the RAID unit 6, for example.
 以上の説明した実施形態によれば、RAID装置のスローダウン現象に対して、回復までの時間が短縮化される。これによって、RAID装置の安定稼働が可能となる。また、スローダウン現象からの回復に係る処理は自動で行われるので、人手による作業を省略することができる。 According to the embodiment described above, the time until recovery is shortened against the slowdown phenomenon of the RAID device. As a result, the RAID apparatus can be stably operated. Further, since the processing related to recovery from the slowdown phenomenon is automatically performed, it is possible to omit manual work.
1・・・CPU
2・・・主記憶装置
3・・・RAIDコントローラ
4・・・LANインタフェース
5・・・入出力ユニット
6・・・RAID部
7・・・ハードディスクドライブ
8・・・DVD
9・・・USBメモリ
10・・・情報処理装置
21・・・RAID管理部
22・・・管理ログ
23A,23B・・・タイマ
24・・・計数部
25,26・・・カウンタ
27・・・検出部
28・・・閾値テーブル
1 ... CPU
2 ... Main storage device 3 ... RAID controller 4 ... LAN interface 5 ... I / O unit 6 ... RAID unit 7 ... Hard disk drive 8 ... DVD
DESCRIPTION OF SYMBOLS 9 ... USB memory 10 ... Information processing apparatus 21 ... RAID management part 22 ... Management log 23A, 23B ... Timer 24 ... Counting part 25, 26 ... Counter 27 ... Detection unit 28 ... threshold value table

Claims (5)

  1.  所定時間内における、RAIDを構成する複数の記録媒体に対するメディアエラー数を記録媒体毎に計数する計数部と、
     前記所定時間内におけるメディアエラー数が閾値以上の記録媒体をRAIDのスローダウン現象に係る記録媒体として検出する検出部とを含む
    RAID制御装置。
    A counting unit that counts, for each recording medium, the number of media errors for a plurality of recording media constituting the RAID within a predetermined time;
    A RAID control apparatus comprising: a detection unit that detects a recording medium having a number of media errors within a predetermined time as a recording medium associated with a RAID slowdown phenomenon;
  2.  前記検出部によって検出された記録媒体を前記RAIDを構成する複数の記録媒体から自動的に切り離す処理を行う切り離し制御部をさらに含む
    請求項1に記載のRAID制御装置。
    The RAID control apparatus according to claim 1, further comprising a detachment control unit that performs a process of automatically detaching the recording medium detected by the detection unit from a plurality of recording media constituting the RAID.
  3.  前記閾値が記録媒体毎に設けられている
    請求項1に記載のRAID制御装置。
    The RAID control device according to claim 1, wherein the threshold is provided for each recording medium.
  4.  前記RAIDのRAIDレベルを判定する判定部をさらに含み、
     前記計数部及び前記検出部は、前記RAIDのRAIDレベルがミラーリングを行うRAID1を含む場合に動作する
    請求項1に記載のRAID制御装置。
    A determination unit for determining a RAID level of the RAID;
    The RAID control apparatus according to claim 1, wherein the counting unit and the detection unit operate when the RAID level of the RAID includes RAID 1 that performs mirroring.
  5.  コンピュータに
     RAIDを構成する複数の記録媒体の所定時間内におけるデータのメディアエラー数を記録媒体毎に計数するステップと、
     所定時間内におけるメディアエラー数が閾値以上の記録媒体をRAIDのスローダウン現象に係る記録媒体として検出するステップと
    を実行させるプログラム。
    Counting the number of data media errors within a predetermined time of a plurality of recording media constituting a RAID in the computer for each recording medium;
    And a step of detecting a recording medium in which a number of media errors within a predetermined time is equal to or greater than a threshold as a recording medium related to a RAID slowdown phenomenon.
PCT/JP2009/057291 2009-04-09 2009-04-09 Raid control device WO2010116514A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2009/057291 WO2010116514A1 (en) 2009-04-09 2009-04-09 Raid control device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2009/057291 WO2010116514A1 (en) 2009-04-09 2009-04-09 Raid control device

Publications (1)

Publication Number Publication Date
WO2010116514A1 true WO2010116514A1 (en) 2010-10-14

Family

ID=42935822

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2009/057291 WO2010116514A1 (en) 2009-04-09 2009-04-09 Raid control device

Country Status (1)

Country Link
WO (1) WO2010116514A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0467476A (en) * 1990-07-09 1992-03-03 Fujitsu Ltd Array disk controller
JPH07200191A (en) * 1994-01-10 1995-08-04 Fujitsu Ltd Disk array device
JPH10275060A (en) * 1997-03-31 1998-10-13 Nec Corp Array disk controller
JPH11282637A (en) * 1998-02-27 1999-10-15 Aiwa Co Ltd Method for reconstituting raid data storing system
JPH11345095A (en) * 1998-06-02 1999-12-14 Toshiba Corp Disk array device and control method therefor
JP2003140839A (en) * 2001-10-30 2003-05-16 Fujitsu Ltd Hard disk multiplex control device and hard disk multiplex control program
JP2006301714A (en) * 2005-04-15 2006-11-02 Toshiba Corp Array controller, information processor including this array controller, and disk array control method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0467476A (en) * 1990-07-09 1992-03-03 Fujitsu Ltd Array disk controller
JPH07200191A (en) * 1994-01-10 1995-08-04 Fujitsu Ltd Disk array device
JPH10275060A (en) * 1997-03-31 1998-10-13 Nec Corp Array disk controller
JPH11282637A (en) * 1998-02-27 1999-10-15 Aiwa Co Ltd Method for reconstituting raid data storing system
JPH11345095A (en) * 1998-06-02 1999-12-14 Toshiba Corp Disk array device and control method therefor
JP2003140839A (en) * 2001-10-30 2003-05-16 Fujitsu Ltd Hard disk multiplex control device and hard disk multiplex control program
JP2006301714A (en) * 2005-04-15 2006-11-02 Toshiba Corp Array controller, information processor including this array controller, and disk array control method

Similar Documents

Publication Publication Date Title
TWI337304B (en) Method for fast system recovery via degraded reboot
US8219748B2 (en) Storage system comprising both power saving and diagnostic functions
TWI632462B (en) Switching device and method for detecting i2c bus
US10275330B2 (en) Computer readable non-transitory recording medium storing pseudo failure generation program, generation method, and generation apparatus
JP5754508B2 (en) Information processing apparatus, information processing method, and program
CN103019885A (en) Method and system for monitoring embedded Linux-based hard disc bad track
JP2010086364A (en) Information processing device, operation state monitoring device and method
WO2023226380A1 (en) Disk processing method and system, and electronic device
CN111796959A (en) Host machine container self-healing method, device and system
US7624300B2 (en) Managing storage stability
JP2009289234A (en) Information processing apparatus, error notification program, and error notifying method
JP5104479B2 (en) Information processing device
WO2011051999A1 (en) Information processing device and method for controlling information processing device
JP5689783B2 (en) Computer, computer system, and failure information management method
WO2017072904A1 (en) Computer system and failure detection method
JP2011076344A (en) Information processing apparatus, method of controlling information processing apparatus and control program
WO2010116514A1 (en) Raid control device
JP2013061841A (en) Information processing device and test method for information processing device
JP2013025632A (en) Disk controller, disk device abnormality detection method and program
JP2009282848A (en) Abnormality determining apparatus
JP2010066801A (en) Log recording system, module monitoring means, trace log managing means, recording method, program, and storage medium
JP5467936B2 (en) Fault monitoring apparatus, method and program for distributed / parallel processing system
JP2022052504A (en) Bmc, server system, device stabilization determination method, and program
JP2004253035A (en) Disk drive quality monitor system, method and program
JP2010003132A (en) Information processor, and fault detection method of input/output device thereof, and program thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09843024

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09843024

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP