CN112162705A

CN112162705A - RAID (redundant array of independent disk) set fault automatic offline repair reporting method and system

Info

Publication number: CN112162705A
Application number: CN202011059284.3A
Authority: CN
Inventors: 白淑贤; 李国平; 李源; 邱春武; 白成刚
Original assignee: Sina Technology China Co Ltd
Current assignee: Sina Technology China Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-01
Anticipated expiration: 2040-09-30

Abstract

The embodiment of the invention provides a method and a system for automatically downloading and repairing a RAID group fault, wherein the method comprises the following steps: circularly traversing the RAID disk groups on the server to acquire the read-write performance, the state and the service life of each RAID disk group on the server; judging each RAID group according to the read-write performance, the state and the service life of the RAID group, and removing and offline the RAID groups according to the judgment result; writing the log information of the RAID disk group with the offline removed into a local fault log; and sending a repair alarm to a repair interface according to the recorded local fault log. According to the technical scheme of the invention, the failure RAID disk group or the suspected failure RAID disk group is automatically judged by monitoring the current use state of the disk group, the service life of the disk group and the read-write performance of the disk group, and the failure is timely and accurate after one-time detection according to the set time.

Description

RAID (redundant array of independent disk) set fault automatic offline repair reporting method and system

Technical Field

The invention relates to the field of computers, in particular to an automatic offline repair method and system for RAID (redundant array of independent disks) disk group faults.

Background

A RAID disk group refers to a disk array, which is simply a logical disk group with a large capacity composed of N independent disks. The extensive disk set mainly serves to store resources used by the CDN service, and support the CDN service on the line to operate normally.

A RAID group failure refers to a situation where a disk hardware resource is abnormal in reading and writing or even does not work for some reason. If a server running on the line fails to process the failed disk group or processes the failed disk group in time, the server may affect the service, and therefore, in order to avoid affecting the service, the failed disk group is removed from the line as much as possible.

The existing offline and repair method of the failed RAID group is as follows:

the method comprises the following steps: abnormal scene 1: monitoring the state of the RAID disk group, and if a fault disk group exists, sending an alarm to an alarm system; abnormal scene 2: monitoring the service index, and if the service index is abnormal, sending an alarm to an alarm system;

step two: the operation and maintenance personnel receive the alarm and judge whether the state of the RAID group is abnormal or whether the reading and writing of the RAID group are abnormal to cause the fluctuation of the service index;

step three: manually rejecting a failed RAID disk group;

step four: and collecting the detailed information of the failed RAID group, and submitting a repair application.

In the process of implementing the invention, the following disadvantages are found in the prior art:

1. failure discovery is not timely enough: the detection of a fault by means of a manually received alarm is clearly delayed.

2. Fault localization is not accurate enough: monitoring the status of a RAID disk group does not fully cover disk group anomalies. For example: the disk group status is normal, but the age of the disk group is close to over-guaranteed, which causes abnormal reading and writing of the disk group, and at this time, a lot of time is spent by manpower to locate the problem.

3. Rejecting failed RAID disk groups is not timely enough: the first two points will result in the failed RAID group not being culled at the first time, thereby causing impact on CDN services.

4. The fault RAID group is not repaired timely enough: the failed RAID disk group needs to manually collect related abnormal log information and submit a repair application, the step is manual processing, the abnormal log information is likely to be forgotten due to the influence of uncontrollable factors, the failed disk group is not repaired in time, the number of disk groups on a server is less and less, and CDN service is indirectly influenced.

Disclosure of Invention

The embodiment of the invention provides a method and a device for automatically downloading and repairing a failure of a RAID (redundant array of independent disk) disk group, which are used for judging whether the RAID disk group fails or not by combining a plurality of indexes; the program automatically judges the failed RAID disk group or the suspected failed RAID disk group by monitoring the current use state of the disk group, the service life of the disk group and the read-write performance of the disk group, and the failure is timely and accurate after one-time detection according to set time.

To achieve the above object, in one aspect, an embodiment of the present invention provides an automatic offline repair method for a RAID group failure, where the method includes:

circularly traversing the RAID disk groups on the server to acquire the read-write performance and the state of each RAID disk group on the server

And life span;

judging each RAID group according to the read-write performance, the state and the service life of the RAID group, and judging according to the judgment

Removing the RAID disk group and offline as a result;

writing the log information of the RAID disk group with the offline removed into a local fault log;

and sending a repair alarm to a repair interface according to the recorded local fault log.

In another aspect, an embodiment of the present invention provides an automatic offline repair system for a RAID group failure, where the apparatus includes:

the information acquisition module is used for circularly traversing the RAID groups on the server and acquiring the read-write performance, the state and the service life of each RAID group on the server;

the fault judgment module is used for judging each RAID group according to the read-write performance, the state and the service life of the RAID group and removing the RAID group according to the judgment result;

the removing module is used for removing the RAID disk group according to the judgment result and writing the log information of the RAID disk group with the removed offline into a local fault log;

and the warranty alarm module is used for sending a repair alarm to the repair interface according to the recorded local fault log.

The technical scheme has the following beneficial effects:

the technical scheme of the invention has accurate and effective fault location, avoids the condition of CDN abnormal service caused by the fault of the RAID group, and solves the possibility of abnormal service from the source; the failed RAID disk group can be removed in time, and the RAID disk group is timely offline after being timely found, so that the influence on online service is avoided; and the failed RAID group is reported and repaired in time, and after the failed RAID group is offline on the server, application is submitted in time, so that uncontrollable factors of manual processing are greatly reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating an automatic offline repair method for a RAID group failure according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an automatic offline repair system for a RAID group failure according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of an automatic offline repair method for a RAID group failure according to an embodiment of the present invention, where the method includes:

circularly traversing the RAID disk groups on the server to acquire the read-write performance, the state and the service life of each RAID disk group on the server; and records the total number of RAID disk groups served online.

Judging each RAID group according to the read-write performance, the state and the service life of the RAID group, and removing and offline the RAID groups according to the judgment result; specifically, the method comprises the following steps: according to the read-write performance of the RAID disk group, judging the read-write performance of the RAID disk group as an abnormal RAID disk group, and removing and offline;

if the read-write performance of the RAID disk group is judged to be not abnormal, whether the state of the RAID disk group is abnormal or not is judged, if the state is abnormal, whether the number of the RAID disk groups which are served online is larger than a set threshold value or not is further judged, and if the number of the RAID disk groups which are served online is larger than the set threshold value, the RAID disk groups are removed and taken offline; if the number of the RAID disk groups which are served online is not more than a set threshold value after the RAID disk group is judged to be offline, directly sending alarm information for replacing the RAID disk group;

if the state of the RAID disk group is judged to be abnormal, whether the service life of the RAID disk group is expired is judged, if yes, whether the number of the RAID disk groups which are served online is larger than a set threshold value after the RAID disk group is offline is further judged, and if yes, the RAID disk group is removed offline; and if the number of the RAID disk groups which are served online is not more than a set threshold value after the RAID disk group is offline, directly sending alarm information for replacing the RAID disk group.

Writing the information of the RAID disk group with the removed offline into a local fault log;

sending a repair alarm to a repair interface according to the local fault log record; specifically, the local fault log is detected at regular time, and if newly added log information exists, a repair alarm is sent to the repair interface according to the newly added log information.

Corresponding to the above method, as shown in fig. 2, it is a schematic structural diagram of an automatic offline repair system for RAID group failure according to an embodiment of the present invention, where the apparatus includes:

the information acquisition module 11 is used for circularly traversing the RAID groups on the server, and acquiring the read-write performance, the state, and the life of each RAID group on the server;

the failure determination module 12 is configured to determine each RAID group according to read-write performance, status, and life of the RAID group;

the removing module 13 is configured to remove the RAID group and take off the line according to the determination result, and write the log information of the RAID group from which the offline is removed into the local fault log;

and the warranty alarm module 14 is used for sending a repair alarm to the repair interface according to the recorded local fault log.

Preferably, the information obtaining module 11 is further configured to record the total number of RAID groups served online when the RAID groups on the server are traversed in a loop.

Preferably, the failure determination module 12 is specifically configured to:

according to the read-write performance of the RAID disk group, judging the RAID disk group with abnormal read-write performance as needing to be removed and offline;

if the read-write performance of the RAID disk group is judged to be not abnormal, whether the state of the RAID disk group is abnormal or not is judged, if the state is abnormal, whether the number of the RAID disk groups which are served online after the RAID disk group is offline is larger than a set threshold value or not is further judged, and if the number of the RAID disk groups which are served online is larger than the set threshold value is judged, the RAID disk group is removed and offline;

if the state of the RAID disk group is judged not to be abnormal, whether the service life of the RAID disk group is expired or not is judged, if yes, whether the number of the RAID disk groups which are served online is larger than a set threshold value or not is further judged, and if yes, the RAID disk group is judged to be removed and taken offline.

Preferably, the failure determination module 12 is further configured to: and for the RAID group with abnormal state or expired life, if the number of the RAID groups which are served online is not more than a set threshold value after the RAID group is judged to be offline, directly sending alarm information for replacing the RAID group.

Preferably, the warranty alarm module 14 is specifically configured to:

and detecting the local fault log at regular time, and if newly added log information exists, sending a repair alarm to the repair interface according to the newly added log information.

The specific application examples of the invention are as follows:

for the fault judgment module, the fault position is judged by monitoring the indexes of the on-line RAID disk groups, and the fault RAID disk groups are directly alarmed or written into a local fault log for the automatic elimination module to use, so that the fault judgment module is the basis for automatically eliminating the fault disk groups.

Detailed workflow explanation of the fault judgment module: the program of the fault judgment module can run according to the frequency per minute, the program can acquire all RAID disk group lists served on line after being started, the lists are circulated, and three indexes of reading and writing of the disk groups, disk group states and disk group service lives are judged one by one according to the severity of faults. For example: if the reading and writing of the first disk group is slow, the waiting time of each server I/O operation is 1000ms on average, and the threshold is 500ms (which can be set according to the service), the disk group is directly added into a rejection offline list after the threshold is exceeded, the rejection offline is executed, the information of the failed disk group is written into a local failure log, the severity of the index is high, and if the processing is not performed in time, the CDN service of the disk group is influenced; if the first disk group is read and written without abnormality, judging whether the state of the disk group is failed, if the state is failed, indicating that the disk group is possibly damaged (in some cases, due to disk dropping, the disk dropping condition can be on-line again), judging that the number of the disk groups left after the disk group is subtracted from the number of the disk groups on the current server, in the example, 3 is larger than a threshold value 2 (the threshold value can be set according to the service, according to the on-line operation condition, the 2 disk groups are the minimum disk groups, and the capacity size is not considered), adding the disk groups into a reject off-line list, performing reject off-line, writing the information of the fault disk groups into a local fault log, and if the number is smaller than the threshold value, directly sending an alarm to remind a maintenance worker of replacing the disk groups in time; the severity of the index is that if the index is not processed in time, the service is generally not influenced, but the service quality is indirectly abnormal, because the number of disk groups is small, the storage is less, and if the storage is not supplemented in time, the user experience and the service quality are indirectly influenced; if the state of the first disk group is not abnormal, continuously judging the life value of the disk group, and judging whether the life value is expired or not, wherein the life expiration threshold is 1, if so, judging the number of the disk groups left after the disk group is subtracted from the number of the disk groups served online on the current server, if so, more than 2 (the threshold can be set according to the service), adding the disk groups into a rejection offline list, executing rejection offline, writing the information of the failed disk groups into a local failure log, and if not, directly sending an alarm to remind operation and maintenance personnel of replacing the disk groups in time; the severity of the index is low, if the index is not processed in time, the service is generally not influenced, the index can be operated for a period of time after the service life is over, but the service quality is indirectly abnormal, the storage is less because the number of the disk groups is less, and if the storage is not supplemented in time, the user experience and the service quality are indirectly influenced; if all the three indexes are judged to be abnormal, the first disk group is judged to be finished, and all the remaining 4 disk groups are judged by analogy. The disk group on the line has disks of manufacturers such as Intel, Samsung, magnesium light and the like, the SN number of the disk is used in the program to judge which manufacturer the disk belongs to, and different program instructions are further adopted according to the manufacturers to obtain different index parameters.

The fault judgment module of the invention has the following characteristics:

reliable judgment indexes are as follows: the judgment indexes are richer, the disk group state, the disk group service life and the read-write condition of the disk group are mainly monitored, and all fault conditions are basically covered, so that the judgment and the positioning are more accurate.

The monitoring frequency is controllable: each monitoring is controlled within 1 minute, and the quick discovery of the fault disk group is ensured.

Compatible types: the module is compatible with various RAID group types, such as: intel, samsung, magnesium light, and other mainstream manufacturers.

And (3) fault elimination combined service: the disk group is not completely rejected according with the judgment standard, so that more disk groups can be reserved on line as far as possible to provide service. If the reading and writing of the disk group are abnormal, directly removing the disk group; if the number of the non-fault disk groups is larger than or equal to the threshold, all fault disk groups meeting the index judgment are rejected, and if the number of the non-fault disk groups is smaller than the threshold, disk group states which are abnormal are sequentially rejected according to the influence degree on the on-line service, and then the service life of the disk groups is expired. For example: disk group read-write anomalies already affect service and must be culled; the abnormal state of the disk group may be the condition of disk dropping, and the disk group can be used after being on-line again and can be selectively removed; the possible end of life of the disk pack is expired but the end of life is not reached and may still be used for some time, and optionally rejected.

The rejecting module of the invention executes the action of rejecting the failed disk group through the rejecting signal sent by the fault judging module, and the step can not influence the RAID disk group which is being served on the line and can smoothly drop the line. Detailed RAID group information is provided for the auto repair module. The automatic eliminating module has the following characteristics:

data retention: and the details of the failed RAID group are saved locally for the automatic repair module to use. Meanwhile, the local data retention time is one week, so that the subsequent troubleshooting of related problems is facilitated.

High availability: and the fault disk group is rejected through fault judgment, and the rejected disk group is automatically offline without influencing the online service provided by the non-fault disk group, so that the online service provided by the high-availability RAID disk group is realized.

The automatic repair reporting alarm module calls an interface of a repair reporting system by reading a local fault log, thereby achieving the purposes of automatic repair reporting and alarm. The automatic repair alarm module has the following characteristics:

data retention: the details of the server and the details of the fault RAID are combined according to the format required by the interface and are submitted to a repair system, and meanwhile, the data are stored locally, so that the follow-up troubleshooting of related problems is facilitated.

Detailed information summary: if the detected RAID group is abnormal, the operation result of the program is sent to operation and maintenance personnel in the form of an email to inform which RAID groups of which servers are offline and which RAID groups are expired or abnormal in state, and the disk groups need to be applied for repairing and sending as soon as possible because the number of the disk groups on the servers is limited and not removed. In these cases, a detailed summary of information is obtained without requiring a complicated operation of logging in to the server.

Compared with the prior art, the application has the following technical advantages:

1. high availability of the RAID disk group is realized, and the abnormal disk group can be rapidly offline.

2. Due to the fact that offline repair of the failed RAID disk group is reported, complete automation is achieved, manual operation and maintenance and intervention are not needed, and operation and maintenance cost is greatly reduced.

3. The system can be compatible with disk groups of various models and RAID cards of various models, and after the models are newly added, the system does not need to be developed secondarily.

4. The automatic offline and automatic repair of the failed RAID group are separated, so that the low coupling of program modules is realized;

5. the failure judgment, automatic offline and automatic repair reporting module of the RAID group is additionally provided with logs and alarms for executing failure, and system abnormality can be discovered at the first time.

6. The system obtains relatively fixed information content and has low development cost.

7. And the single machine deployment only needs to be carried out on each server, and the programs on each server work independently without mutual influence.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An automatic offline repair method for a RAID group fault is characterized by comprising the following steps:

circularly traversing the RAID disk groups on the server to acquire the read-write performance, the state and the service life of each RAID disk group on the server;

judging each RAID group according to the read-write performance, the state and the service life of the RAID group, and removing and offline the RAID groups according to the judgment result;

2. The method of claim 1, further comprising: the total number of RAID group groups served online is recorded as the RAID group on the server is traversed in a loop.

3. The method for automatically repairing a RAID group failure according to claim 2, wherein the determining each RAID group according to the read-write performance, status and lifetime of the RAID group and removing the RAID group according to the determination result comprises:

according to the read-write performance of the RAID disk group, judging the read-write performance of the RAID disk group as an abnormal RAID disk group, and removing and offline;

if the read-write performance of the RAID disk group is judged to be not abnormal, whether the state of the RAID disk group is abnormal or not is judged, if the state is abnormal, whether the number of the RAID disk groups which are served online is larger than a set threshold value or not is further judged, and if the number of the RAID disk groups which are served online is larger than the set threshold value, the RAID disk groups are removed and taken offline;

if the state of the RAID disk group is judged to be abnormal, whether the service life of the RAID disk group is expired is judged, if yes, whether the number of the RAID disk groups which are served online is larger than a set threshold value after the RAID disk group is offline is further judged, and if yes, the RAID disk group is removed offline.

4. The method of claim 3, further comprising:

and for the RAID group with abnormal state or expired life, if the number of the RAID groups which are served online is not more than a set threshold value after the RAID group is judged to be offline, directly sending alarm information for replacing the RAID group.

5. The method of claim 4, wherein sending a repair alarm to a repair interface according to the recorded local failure log comprises:

6. An automatic offline repair system for a RAID group failure, comprising:

the fault judgment module is used for judging each RAID group according to the read-write performance, the state and the service life of the RAID group;

7. The RAID group failure automatic repair offline system of claim 6, wherein the information acquisition module is further configured to record a total number of RAID groups serviced online while looping through the RAID groups on the servers.

8. The RAID group failure automatic offline repair system of claim 7, wherein the failure determination module is specifically configured to:

9. The RAID group failure automatic offline repair system of claim 8, wherein the culling determination module is further specifically configured to:

10. The RAID group failure automatic offline repair reporting system of claim 9, wherein the warranty alarm module is specifically configured to: