CN108446734A

CN108446734A - Disk failure automatic prediction method based on artificial intelligence

Info

Publication number: CN108446734A
Application number: CN201810228937.2A
Authority: CN
Inventors: 李新明; 刘斌
Original assignee: Zhongke Edge Intelligence Information Technology (suzhou) Co Ltd
Current assignee: Zhongke Edge Intelligence Information Technology (suzhou) Co Ltd
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2018-08-24

Abstract

The present invention refers to a kind of disk failure automatic prediction method based on artificial intelligence, including：Acquire the status data of several groups disk, as training data, machine learning algorithm is used to be trained it to generate disk failure identification and categorizing system, disk failure identification and categorizing system are the early warning fault time of disk to be calculated according to the status data of disk；A disk is acquired in some or all of current time status data with the first setting time period, import aforementioned disk failure identification and categorizing system, the early warning fault time for obtaining the current time disk, the disk is handled using preset alarm rule according to the early warning fault time of the disk.The present invention uses artificial intelligence technology, the Disk State data based on S.M.A.R.T technical limit spacings to predict disk failure, handled in time failed disk to reach, and enhance the purpose of storage system reliability.

Description

Disk failure automatic prediction method based on artificial intelligence

Technical field

The present invention relates to artificial intelligence fields, belong to a kind of disk failure automatic prediction method based on artificial intelligence.

Background technology

Storage system is responsible for that data are persistently stored, and is one of chief component of information system, and reliability is information The key of system normal operation.Although in recent years, the technologies such as solid-state storage, biometric storage develop rapidly, up to the present, magnetic Disk is still the core component of storage system.The reliability of disk directly affects the reliability of storage system.Disk by The mechanical mixture body of the compositions such as magnetic sheet, magnetic head, motor, design structure itself determine that its reliability is not high.In large-scale number According in center, disk unit quantity generally reaches 100,000, Bai Wanji.In large-scale data center, even if since technique is promoted, disk Single product can keep lower failure rate, but since radix is too big, and disk failure will largely occur.Disk failure compares other portions Part fault data wants more more.In consideration of it, as can the generation to disk failure is predicted in advance, great convenience will be brought to O＆M, And greatly reduce due to disk failure and caused by lose.

The High Availabitity of storage system, it will usually use redundant array of inexpensive disk RAID (Redundant Arrays of Inexpensive Disks) technology or distributed storage technology such as HDFS (Hadoop DistributeFileSystem), MFS (Moosefs) etc., improves tolerance of the system to disk failure in a manner of data redundancy, and that improves storage system can By property.But these Passive fault-tolerant control technologies can't reduce the failure rate of physical disk itself, on the contrary due to data redundancy, it is also necessary to More disks are consumed, operation cost is increased.No matter which kind of method is used in fact, the service life of disk is limited, and failure is must Right.From operating cost angle, first will solve the integrity problem of storage system, this problem can by premise To Passive fault-tolerant control mode to give limited solution (improve fault-tolerant, but be not avoided that hardware while the loss of data that damage is brought Risk), and reliability is higher, and the hardware resource of consumption will be more；Second Problem is exactly to reduce hardware maintenance cost.It is right Disk failure progress accurately estimate can make rational planning for disk stock amount and daily maintenance plan, for data center reduction at This, it is extremely important to be turned up service stability.

In order to support the prediction of disk failure, need to be acquired the various state parameters of disk and comprehensive assessment its fortune Row state.Most of disk all uses S.M.A.R.T (Self-Monitoring Analysis and at present ReportingTechnology) technology, the technology monitor the multiple parameters in disk operational process, including the tracking of disk it is wrong, The information such as parity errors, SMART can do the alarm of single index, this method by the method to target setting threshold value It is simple and practicable, but early warning accuracy rate bottom, actually use single or simple S.M.A.R.T attribute values can't be accurate Predict disk failure in ground.

Invention content

The purpose of the present invention is to provide a kind of disk failure automatic prediction method based on artificial intelligence, using artificial intelligence Energy technology, the Disk State data based on S.M.A.R.T technical limit spacings predict disk failure, to reach timely pair of event Barrier disk is handled, and enhances the purpose of storage system reliability.

In order to achieve the above objectives, the present invention provides the following technical solutions：

A kind of disk failure automatic prediction method based on artificial intelligence, including：

The status data of acquisition several groups disk instructs it using machine learning algorithm as training data Practice to generate disk failure identification and categorizing system, disk failure identification and categorizing system are to according to the status number of disk According to the early warning fault time that disk is calculated；

A disk is acquired in some or all of current time status data with the first setting time period, imports aforementioned magnetic Disk fault identification and categorizing system obtain the early warning fault time of the current time disk, when according to the early warning failure of the disk Between the disk is handled using preset alarm rule.

In further embodiment, the method further includes：

S.M.A.R.T technologies are used to acquire the status data of disk.

In further embodiment, the status data includes bottom data read error rate, motor arrival rated speed Time, remap sector count, seek error rate, power on run time, unrepairable error count, magnetic head write-in height, Measure the temperature of hard disk, at least one of the sector count that hardware ECC restores, waiting is reset.

In further embodiment, the method for the identification of one disk failure of the generation and categorizing system includes：

The status data of several groups disk is obtained, each status data is both provided with corresponding state threshold, to status number Quantified according to state threshold；

Status data, the state threshold of aforementioned quantization are trained using SVM algorithm, obtain one for Disk State Optimal Separating Hyperplane.

It is described to refer to according to the early warning fault time preset alarm rule of the disk in further embodiment,

It is less than or equal to the first setting time threshold value in response to the early warning fault time of any one disk, when being set with second Between the period state data acquisition is carried out to the disk, and then obtain its early warning fault time, the second setting time period was less than the One setting time period；

It is less than or equal to the second setting time threshold value in response to the early warning fault time of arbitrary disk, sends out fault warning, the Two setting time threshold values are less than the first setting time threshold value.

In further embodiment, the early warning fault time of the disk failure identification and categorizing system one disk of acquisition Method includes：

Set a fault time precision；

The status data of the disk is obtained, if the natural time from current time is divided by fault time precision Dry period, then judge whether the disk can break down within a wherein period successively according to time sequencing, it will Judge that the time range for that period that can be broken down is exported as early warning fault time.

In further embodiment, the training data includes the status data of current time disk, from current time Disk State variable quantity and variance in the time range of previous fault time precision, from current time when previous failure Between precision time range in disk I/O average loads.

In further embodiment, the fault time precision is 5 days.

In further embodiment, the method further includes：

Logistic regression algorithms are used to be trained training data to generate disk failure identification and categorizing system.

It is more than a setting probability of malfunction threshold value in response to the probability that disk breaks down in any one period, when by this Between section as the disk early warning fault time export.

The beneficial effects of the present invention are：

1) according to many condition comprehensive analysis, disk failure recall rate is improved.

2) can not only according to current state judge disk whether failure, can also according to Disk State and load judgement disk Fault trend.

3) means unbalance discs recall rate (FDR) and rate of false alarm (FAR) are provided.

Above description is only the general introduction of technical solution of the present invention, in order to better understand the technical means of the present invention, And can be implemented in accordance with the contents of the specification, below with presently preferred embodiments of the present invention and after coordinating attached drawing to be described in detail such as.

Description of the drawings

Fig. 1 is the flow chart of the disk failure automatic prediction method based on artificial intelligence of the present invention.

Fig. 2 is the schematic diagram of the Lead Time of the present invention.

Fig. 3 is the schematic diagram of the disk failure identification of the present invention and the operation principle of categorizing system.

Fig. 4 is that the early warning fault time of the disk failure identification and categorizing system based on classification operation principle of the present invention is pre- Survey process chart.

Fig. 5 is the present invention by the way that probability of malfunction threshold value is arranged to obtain the method flow diagram of early warning failure.

Specific implementation mode

With reference to the accompanying drawings and examples, the specific implementation mode of the present invention is described in further detail.Implement below Example is not limited to the scope of the present invention for illustrating the present invention.

In conjunction with Fig. 1, the present invention refers to a kind of disk failure automatic prediction method based on artificial intelligence, the method packet It includes：

Step 1, the status data for acquiring several groups disk, as training data, using machine learning algorithm to it It is trained to generate disk failure identification and categorizing system, disk failure identification and categorizing system are to according to disk The early warning fault time of disk is calculated in status data.

Step 2 acquires a disk with the first setting time period in some or all of current time status data, imports Aforementioned disk failure identification and categorizing system, obtain the early warning fault time of the current time disk, according to the early warning of the disk Fault time is handled the disk using preset alarm rule.

The disk failure automatic Prediction side based on artificial intelligence proposed by the present invention is elaborated in terms of five below The particular content of method and the extension of related art scheme.

One, disk is handled about according to the early warning fault time of disk

It is described to refer to according to the early warning fault time preset alarm rule of the disk,

It is less than or equal to the first setting time threshold value in response to the early warning fault time of any one disk, when being set with second Between the period state data acquisition is carried out to the disk, and then obtain its early warning fault time, the second setting time period was less than the One setting time period.

In conjunction with Fig. 2, fault warning is provided when system prediction is to disk failure, and disk is not thoroughly unavailable at this time, It is separated by a period between the time point that current point in time and disk really break down.

Assuming that the time difference between early warning and physical fault is defined as Lead Time by us, if in fault pre-alarming Time point just goes to replace disk, then still have certain time apart from the real failure of disk, and disk is available within the time period , can slattern so a part of disk can normal use life cycle, wastage be equal to Lead Time.

Prediction can remind how long disk also damages, and lead is bigger, and predictablity rate is higher, but simultaneously, Lead Time Also bigger.

Cost is reduced in order to improve disk utilization, while increasing the accuracy of prediction algorithm, the present invention takes following skill Art means：

Two level forecasting mechanism is introduced in forecasting system, whether first order prediction disk will break down, and the second level is pre- Survey the specific time that disk distance breaks down.Specifically, prison will be shortened after failure (level-one prediction) by being predicted to be in disk It surveys time interval and fault time prediction is carried out to it, predict that in next X days (two level prediction) can occur for specific failure.Example That monitoring in 5 days is primary if conventional, when find certain disk be predicted to be will failure after, be changed to daily that all monitoring is primary, only therefore When barrier will be happened at that (alarm threshold) was interior in Y days, it can just make fault warning and give removable disk.

Two, about disk failure identification and the generating mode of categorizing system

By abovementioned steps it is found that the premise of the disk failure automatic prediction method based on artificial intelligence mentioned by the present invention It is the disk failure identification and classification for generating the early warning fault time that one can be calculated disk according to the status data of disk System.

The service life of disk is affected by many factors, for example, during the original state of disk, later stage use disk by Damage arrived etc., these factors are fed back all vividly in the status data of disk.

Therefore, the present invention proposes, by acquiring the status data of several groups disk, as training data (sample number According to), then use machine learning algorithm to be trained training data (sample data) and be to generate disk failure identification and classification System.

It should be appreciated that the quantity of training data (sample data) is more, type is more, the disk failure of generation identifies and divides The precision of class system and accuracy are also higher.

In some instances, disk failure identification and categorizing system realize that data are handed over by network and a Cloud Server Mutually, training data (sample data) is periodically downloaded from the Cloud Server constantly to carry out self-teaching and update.

Acquisition mode about training data (sample data), it is preferred that in step 1, the present invention uses S.M.A.R.T skills Art is to acquire the status data of disk.

Monitoring and self-detection mechanism of the SMART as disk internal state, can detect and describe each of disk well State feature, and current Disk State is converted into one group of specific numerical value, show the state when front disk in vector form Feature, convenient for learning its numerical characteristics using machine learning algorithm.

In SAMRT data, there are 23 important data item, this method to have chosen 10 main data item as disk The source of training data in failure predication.

This 10 main status data items include bottom data read error rate, motor reach rated speed time, Sector count is remapped, error rate is sought, powers on run time, unrepairable error count, magnetic head write-in height, metering hard disk Temperature, the sector count that hardware ECC restores, waiting is reset.

In fact, from the foregoing it will be appreciated that the Disk State data used type is more, quantity is more, the disk of generation therefore The precision and accuracy of barrier identification and categorizing system are also higher, but simultaneously, the types of the Disk State data of use is more, quantity More, operand when training is also bigger, and operation time is also longer.In order to both balance, we select aforementioned 10 to magnetic The status data item that disk failure is affected is as training data.

After collecting enough training datas, disk failure identification and categorizing system are generated in next step.

One disk failure of the generation identifies and the method for categorizing system includes：

The status data of several groups disk is obtained, each status data is both provided with corresponding state threshold, to status number Quantified according to state threshold.

Disk failure forecasting mechanism embodies, and mainly finds state threshold, and status data is more than the threshold value corresponding to it Carry out fault warning.From the foregoing it will be appreciated that state not instead of SMART here provides specific a certain item index, one group of index Quantified, Disk State and state threshold are embodied.Then these symbolic animal of the birth year are calculated as disaggregated model training data with SVM Method trains disaggregated model, and the Optimal Separating Hyperplane for finding out Disk State (is regarded as a kind of specific manifestation shape of state threshold Formula).

SVM is a kind of supervision machine learning algorithm of classics, has good performance when high number of latitude is according to classification.In algorithm Based on the LIBSVM that increases income in realization.The nicety of grading of svm classifier model mainly by training data and kernel function selection and The adjusting of relevant parameter influences.

After generating disk failure identification and categorizing system, we can start to carry out fault pre-alarming, tool to disk Body is as follows：

For example, when the early warning fault time of a certain disk being less than given threshold, alarm, prompting changing disk are carried out；Or Person monitors the early warning fault time of multiple disks simultaneously, is ranked up to it according to early warning fault time；Or use aforementioned two Grade forecasting mechanism etc..

Three, about disk failure identification and the operation principle of categorizing system

The disk failure identifies and the method for the early warning fault time of categorizing system one disk of acquisition includes：

Set a fault time precision.

For machine learning algorithm, relative to specific early warning fault time numerical value is calculated, yes or no's sentences The journey that stops is relatively more simple, operand is small, arithmetic speed also faster, the disk failure identification of generation and categorizing system Greater number of disk can be monitored under simple hardware supported.

Using excellent in performance of the machine learning algorithm in treatment classification problem, present invention proposition will be predicted to turn fault time Classification problem is turned to be solved.Here whether classification is not categorized into disk instead of i.e. by failure, is categorized into disk failure Whether occur within next given a period of time.

For example, the precision for setting fault time prediction first as X days, occurs after prediction when specific failure can be predicted 0~X days or X~2X days or 2X~3X days etc..Only need predict disk failure whether can at next X days, Occur within the scope of the given times such as 2X days, by fault time predictive conversion for the soluble problem of sorting algorithm.

In conjunction with Fig. 3, it is assumed that the fault time precision set as 5 days, first determine whether it is current i.e. will the disk of failure can not It can break down in next 5 days, if it is determined that the disk can break down in 5 days, then the disk will be provided In the early warning of 5 days internal faults, then within the scope of time just after prediction 5 days that this failure occurs；If it is determined that not 5 It breaks down in it, to judge that the disk can or can not break down within 10 days futures into one；If it is then just predicting early warning Fault time can in 5~10 days later, and so on, by each judgement flow with obtain the disk early warning therefore Downtime, and early warning fault time is a time range.

Four, based on the disk failure identification of classification operation principle and the generating mode of categorizing system

In conjunction with Fig. 4, by two points that the fault time predictive conversion of disk is typical usable machine learning algorithm solution Class problem, needs extra care：Training data will consider disk current state, state variation rate, magnetic disc i/o load shape simultaneously More factors such as condition.It handles outside training data, when simply doing failure predication, two classifications to be sorted are the following meeting respectively Or will not failure, and require here by disk sort at meeting in X days or will not failure, more fault times limit.Specifically The training data for choosing grader acquires disk related data as unit of day when data are acquired.

Specifically, the training data includes the status data of current time disk, the previous failure from current time Disk State variable quantity and variance in the time range of time precision, from current time previous fault time precision when Between disk in range I/O average loads.

Assuming that fault time precision is still 5 days, the collection point of each training data needs to acquire disk in SMART data item Current status data and magnetic disc i/o load, and calculate state change value of this data point forward within the scope of 5 days, 5 days it is each Data item variance, average I/O loads in 5 days, are recorded aforementioned each item data as an input data.That is, each sample Notebook data needs the data item that records to include：Variable quantity and variance, magnetic in current Disk State data, nearest 5 days of Disk State Disk I/O average loads in nearest 5 days.

Preferably, Logistic regression algorithms is used to be trained aforementioned training data with generate disk failure identification and Categorizing system.Logistic recurrence is a kind of supervised learning algorithm, can be used for classifying.It is specific as follows：The letter of given unknown parameter Number, is trained by training data, uses optimal method to determine that one group of parameter, this group of parameter are exactly that Logistic is returned Return disaggregated model.

When unknown input data return disaggregated model to aforementioned Logistic again for we, Logistic returns disaggregated model Unknown data is classified, and exports and belongs to the specific probability of a certain classification is how many.

In conjunction with Fig. 5, in the prediction of disk failure time, disaggregated model is returned by training to Logistic, determines magnetic The probability that disk breaks down in following a period of time.The probability that only disk breaks down in following a period of time is more than setting When probability threshold value, disk failure alarm just can be really provided.That is, the disk failure identification and categorizing system obtain a disk The method of early warning fault time includes：

The key technical feature for setting while being also rate of false alarm and recall rate balance of probability of malfunction threshold value.

Five, the balance about FAR rate of false alarms and FDR recall rates

FAR rate of false alarms refer to that the intertwining misprediction of normal magnetic flux is i.e. by the probability of failed disk.

FDR recall rates refer to the ratio that the number of faults predicted accounts for total breakdown frequency.

From the foregoing it will be appreciated that high FDR may bring high FAR.

In actual operations, it replaces because wrong report carries out disk and can cause waste to disk, but disk failure prediction Meaning is that FDR as high as possible.In the basis for forecasting (SMART data) and machine learning method (SVM) of this method design Under the conditions of, it there is no method to accomplish recall rate very and zero rate of false alarm, it is therefore desirable to be made according to actual needs to FDR and FAR It accepts or rejects.

Forecasting system that this method is related in practical application, according to own characteristic, by adjust probability of malfunction threshold value with The FAR and FDR of prediction result are controlled, and then can be adjusted to prediction result according to the requirement to FDR and FAR.

By introducing threshold mechanism, prediction result is adjusted, the characteristics of according to current predictive model, pass through be arranged therefore Hinder probability threshold value, early warning is only just carried out when probability of malfunction is more than probability of malfunction threshold value, to adjust FDR and FAR.

Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.

Several embodiments of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of disk failure automatic prediction method based on artificial intelligence, which is characterized in that including：

Acquire several groups disk status data, as training data, use machine learning algorithm to be trained it with Disk failure identification and categorizing system are generated, disk failure identification and categorizing system are to according to the status data meter of disk Calculate the early warning fault time for obtaining disk；

A disk is acquired in some or all of current time status data with the first setting time period, imports aforementioned disk event Barrier identification and categorizing system, obtain the early warning fault time of the current time disk, are adopted according to the early warning fault time of the disk The disk is handled with preset alarm rule.

2. the disk failure automatic prediction method according to claim 1 based on artificial intelligence, which is characterized in that the side Method further includes：

S.M.A.R.T technologies are used to acquire the status data of disk.

3. the disk failure automatic prediction method according to claim 1 or 2 based on artificial intelligence, which is characterized in that institute It includes that bottom data read error rate, motor reach that the time of rated speed, to remap sector count, tracking wrong to state status data Accidentally rate, power on run time, unrepairable error count, magnetic head write-in height, the temperature of metering hard disk, hardware ECC restore, etc. Wait at least one of the sector count reset.

4. the disk failure automatic prediction method according to claim 1 or 2 based on artificial intelligence, which is characterized in that institute Stating the method for generating disk failure identification and categorizing system includes：

Obtain the status data of several groups disk, each status data is both provided with corresponding state threshold, to status data and State threshold is quantified；

Status data, the state threshold of aforementioned quantization are trained using SVM algorithm, obtain a classification for being directed to Disk State Hyperplane.

5. the disk failure automatic prediction method according to claim 1 based on artificial intelligence, which is characterized in that described Refer to according to the early warning fault time preset alarm rule of the disk,

It is less than or equal to the first setting time threshold value in response to the early warning fault time of any one disk, with the second setting time week Phase carries out state data acquisition to the disk, and then obtains its early warning fault time, and the second setting time period set less than first It fixes time the period；

It is less than or equal to the second setting time threshold value in response to the early warning fault time of arbitrary disk, sends out fault warning, second sets Threshold value of fixing time is less than the first setting time threshold value.

6. the disk failure automatic prediction method according to claim 1 based on artificial intelligence, which is characterized in that the magnetic The method that disk fault identification and categorizing system obtain the early warning fault time of a disk includes：

Set a fault time precision；

Natural time from current time is divided into several by the status data for obtaining the disk by fault time precision Period, then judge whether the disk can break down within a wherein period successively according to time sequencing, will judge The time range for that period that can be broken down is exported as early warning fault time.

7. the disk failure automatic prediction method according to claim 6 based on artificial intelligence, which is characterized in that the instruction It includes the status data of current time disk, from current time in the time range of previous fault time precision to practice data Disk State variable quantity and variance, the I/O of disk from current time in the time range of previous fault time precision are flat Load.

8. the disk failure automatic prediction method according to claim 6 based on artificial intelligence, which is characterized in that the event Downtime precision is 5 days.

9. the disk failure automatic prediction method based on artificial intelligence according to claim 1 or 6, which is characterized in that The method further includes：

10. the disk failure automatic prediction method according to claim 9 based on artificial intelligence, which is characterized in that described Disk failure identifies and the method for the early warning fault time of categorizing system one disk of acquisition includes：

It is more than a setting probability of malfunction threshold value in response to the probability that disk breaks down in any one period, by the period Early warning fault time as the disk exports.