CN111581072B - Disk fault prediction method based on SMART and performance log - Google Patents

Disk fault prediction method based on SMART and performance log Download PDF

Info

Publication number
CN111581072B
CN111581072B CN202010397456.1A CN202010397456A CN111581072B CN 111581072 B CN111581072 B CN 111581072B CN 202010397456 A CN202010397456 A CN 202010397456A CN 111581072 B CN111581072 B CN 111581072B
Authority
CN
China
Prior art keywords
disk
smart
method based
model
prediction method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010397456.1A
Other languages
Chinese (zh)
Other versions
CN111581072A (en
Inventor
徐敏
胡聪
刘翠玲
洪德华
张翠翠
王鹏
孙佳丽
薛晓茹
王国梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202010397456.1A priority Critical patent/CN111581072B/en
Publication of CN111581072A publication Critical patent/CN111581072A/en
Application granted granted Critical
Publication of CN111581072B publication Critical patent/CN111581072B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3485Performance evaluation by tracing or monitoring for I/O devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to the technical field of cloud storage, and discloses a disk fault prediction method based on SMART and performance logs, which comprises the following steps: (1) And collecting the SMART information, the performance log data and the external operation conditions of the magnetic disk, and training by using a random forest algorithm to obtain characteristic items and a judgment model for judging the magnetic disk faults. According to the disk fault prediction method based on the SMART and the performance log, a model for judging whether the disk is faulty is obtained by utilizing a random forest algorithm, compared with threshold judgment of a SMART single characteristic item, the model comprehensively analyzes a plurality of characteristic items, comprehensively judges whether the disk is faulty, has higher judgment accuracy, predicts future changes of the disk characteristic item through data changes of the existing disk, brings the judgment model into the judgment model to judge, predicts the running condition of the future disk in advance, helps operation and maintenance personnel to backup and replace the hard disk in time, avoids data loss and server downtime, and reduces economic loss caused by the data loss.

Description

Disk fault prediction method based on SMART and performance log
Technical Field
The invention relates to the technical field of cloud storage, in particular to a disk fault prediction method based on a SMART and a performance log.
Background
With the development of information industry, a large amount of data is continuously generated, and the development of data storage services is promoted. The stability of the storage system is closely related to the benefit of the service provider and storage system failure can cause significant loss to the user. Ensuring that data is not lost, care must first be taken to secure cloud storage. Because the number and scale of the magnetic disks in the cloud storage are extremely huge, the hard disk is one of the components with the highest hardware failure rate of the server, if the hard disk failure can be predicted in advance, maintenance personnel can be guided to process conditions, such as data backup, hard disk replacement and the like, normal operation of the system is ensured, and loss is reduced. At present, hard disk manufacturers basically monitor and analyze the state of the hard disk by adopting a self-monitoring analysis reporting technology (S.M.A.R.T), but the detection rate of faults is only 3% -10%.
SMART, a technology for self-analysis and detection of magnetic disks, has been generally popularized as early as the end of 90 s, and is one of standard conditions required to be followed by each magnetic disk manufacturer specified by ATA standard, and is also a method for predicting a failed magnetic disk commonly adopted by magnetic disk manufacturers.
Each hard disk records a plurality of parameters of the hard disk when in operation: these parameters include model, capacity, temperature, density, sector, seek time, transmission, bit error rate, etc. After the hard disk runs for thousands of hours, many intrinsic physical parameters are changed, and when a certain parameter exceeds an alarm threshold value, the hard disk is indicated to be close to being damaged. At this time, the hard disk still works, and if the user ignores the alarm and continues to use, the hard disk becomes very unreliable and may malfunction at any time.
The threshold value judging method based on SMART is too simple, the detection rate of the fault disk in the actual running environment is usually 3% -10%, the detection rate of the fault disk is too low, and the actual early warning effect is not great.
The SMART information is not updated in real time and may take a period of time to update, and may not be refreshed during the period of time that the SMART information is not sufficient to predict disk failure.
Disclosure of Invention
Aiming at the defects of the background technology, the invention provides a technical scheme of a disk fault prediction method based on SMART and performance logs, and the fault is predicted by a deep learning model trained on a data set, so that the accuracy rate can be improved to more than 95%, and the prediction rate is greatly improved.
The invention provides the following technical scheme: a disk fault prediction method based on SMART and performance log includes:
(1) Collecting disk SMART information, performance log data and external operation conditions, and training by using a random forest algorithm to obtain characteristic items and a judgment model for judging disk faults;
(2) Fitting the numerical value of the characteristic item and the function image of time to obtain a change model library of the numerical value of each characteristic item;
(3) Comparing the characteristic item numerical value change curve of the normally operated magnetic disk with the change model library curve, and selecting the characteristic item prediction value of the nearest model prediction future N time;
(4) The predicted value is carried into a judging model for analysis, and whether the magnetic disk fails at the moment N or not is judged, and the failure probability of the magnetic disk is judged;
(5) And (5) returning a prediction result and giving early warning information.
Preferably, after the feature items and the judgment model are obtained, a importance degree ordering of the feature items is obtained by using a recursion algorithm, and a prediction path is established according to the importance of the feature items.
Preferably, the external operating conditions include room temperature, humidity, machine density, room type, task type, and task volume.
Preferably, in the step (2), the magnetic discs are classified according to the importance degree of the feature items, and the change model library curve search labels are set according to the types of the magnetic discs.
Preferably, the prediction method tracks the prediction result after the prediction result is given, collects the accuracy of the method judgment, and establishes an exception database to collect data of the result with the prediction error.
Preferably, the training samples and the test samples in the step (1) and the step (2) are respectively extracted, and the test data of the test sample in the step (2) is used as a detection sample to detect the model in the step (1).
Preferably, the method is used for detecting the disk of the server, and predicting whether the disk of the server will fail according to the SMART data and IO performance logs of each hard disk of the server
The invention has the following beneficial effects:
1. according to the disk fault prediction method based on the SMART and the performance log, a model for judging whether the disk is faulty is obtained by utilizing a random forest algorithm, and compared with threshold judgment of a SMART single characteristic item, the model comprehensively analyzes a plurality of characteristic items, comprehensively judges whether the disk is faulty, and is higher in judgment accuracy.
2. According to the disk fault prediction method based on the SMART and the performance log, future changes of the disk characteristic items are predicted through data changes of the existing disk, then the future changes are carried into the judgment model to be judged, the running condition of the future disk is predicted in advance, operation and maintenance personnel are helped to backup and replace the hard disk in time, data loss is avoided, a server is down, and economic losses caused by the future changes are reduced.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully, and it is apparent that the embodiments described are only some, but not all, of the embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
A disk fault prediction method based on SMART and performance log includes:
collecting the SMART information, performance log data and external operation conditions of the disk, and obtaining characteristic items and a judgment model for judging the disk faults by training;
carrying out standardization processing on the collected data, randomly extracting a training sample set and a test sample set, wherein SMART information, a performance log and external operation conditions are used as feature sets;
selecting m features as feature subsets, determining a decision result of one node on each decision tree, and establishing a decision tree;
training with a training sample set, and evaluating the training sample set by a test sample set;
and integrating all the decision trees to predict so as to obtain a judgment model.
(2) Fitting the function images of the numerical values and time of the feature items to obtain a change model library of the numerical values of each feature item, uniformly selecting 5-8 time sequences, judging the coincidence degree of the points, and classifying and archiving the images;
(3) Comparing the characteristic item numerical value change curve of the normally operated magnetic disk with the change model library curve, and selecting the characteristic item prediction value of the nearest model prediction future N time;
(4) The predicted value is carried into a judging model for analysis, and whether the magnetic disk fails at the moment N or not is judged, and the failure probability of the magnetic disk is judged;
(5) And (5) returning a prediction result and giving early warning information.
After the characteristic items and the judging model are obtained, a recursive algorithm is utilized to obtain the importance degree ordering of the characteristic items, and a prediction path is established according to the importance of the characteristic items, so that the contribution value of each node on the decision tree to given prediction is obtained.
The external operation conditions comprise machine room temperature, humidity, machine density, machine room type, task type and task amount, the external operation environment has great influence on the service life of the magnetic disk, and comprehensive consideration is given to improving the prediction accuracy.
In the step (2), the magnetic disks are classified according to the importance degree of the characteristic items, the change model library curve is set according to the magnetic disk types to search labels, and the classification detection improves the calculated amount during matching.
The prediction method tracks the prediction result after the prediction result is given, the accuracy of the method judgment is collected, and an exception database is established to collect data of the result with the prediction error, so that later evaluation and improvement are facilitated.
The training samples and the test samples in the step (1) and the step (2) are respectively extracted, and the test data of the test samples in the step (2) are used as detection samples to detect the model in the step (1).
The method is used for detecting the disk of the server, and predicting whether the disk of the server will fail according to the SMART data and IO performance logs of each hard disk of the server.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. A disk failure prediction method based on SMART and performance logs, comprising:
(1) Collecting disk SMART information, performance log data and external operation conditions, and training by using a random forest algorithm to obtain characteristic items and a judgment model for judging disk faults;
(2) Fitting the numerical value of the characteristic item and the function image of time to obtain a change model library of the numerical value of each characteristic item;
(3) Comparing the characteristic item numerical value change curve of the normally operated magnetic disk with the change model library curve, and selecting the characteristic item prediction value of the nearest model prediction future N time;
(4) The predicted value is carried into a judging model for analysis, and whether the magnetic disk fails at the moment N or not is judged, and the failure probability of the magnetic disk is judged;
(5) And (5) returning a prediction result and giving early warning information.
2. The disk failure prediction method based on SMART and performance logs according to claim 1, wherein: and after the characteristic items and the judging model are obtained, obtaining the importance degree ordering of the characteristic items by using a recursion algorithm, and establishing a prediction path according to the importance of the characteristic items.
3. The disk failure prediction method based on SMART and performance logs according to claim 1, wherein: the external operating conditions include machine room temperature, humidity, machine density, machine room type, task type, and task volume.
4. The disk failure prediction method based on SMART and performance logs according to claim 2, wherein: in the step (2), the disks are classified according to the importance degree of the characteristic items, and the change model library curve search labels are set according to the types of the disks.
5. The disk failure prediction method based on SMART and performance logs according to claim 1, wherein: the prediction method tracks the prediction result after the prediction result is given, the accuracy of the method judgment is collected, and an exception database is established to collect data of the result with the prediction error.
6. The disk failure prediction method based on SMART and performance logs according to claim 1, wherein: the method is used for detecting the disk of the server, and predicting whether the disk of the server can fail according to SMART data and IO performance logs of each hard disk of the server.
CN202010397456.1A 2020-05-12 2020-05-12 Disk fault prediction method based on SMART and performance log Active CN111581072B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010397456.1A CN111581072B (en) 2020-05-12 2020-05-12 Disk fault prediction method based on SMART and performance log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010397456.1A CN111581072B (en) 2020-05-12 2020-05-12 Disk fault prediction method based on SMART and performance log

Publications (2)

Publication Number Publication Date
CN111581072A CN111581072A (en) 2020-08-25
CN111581072B true CN111581072B (en) 2023-08-15

Family

ID=72123036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010397456.1A Active CN111581072B (en) 2020-05-12 2020-05-12 Disk fault prediction method based on SMART and performance log

Country Status (1)

Country Link
CN (1) CN111581072B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327241A (en) * 2020-09-29 2022-04-12 伊姆西Ip控股有限责任公司 Method, electronic device and computer program product for managing disk
CN112199258A (en) * 2020-11-13 2021-01-08 新华三大数据技术有限公司 Method and device for monitoring magnetic disk, electronic equipment and medium
CN114595085A (en) * 2020-12-03 2022-06-07 中兴通讯股份有限公司 Disk failure prediction method, prediction model training method and electronic equipment
CN112527594A (en) * 2020-12-04 2021-03-19 浪潮电子信息产业股份有限公司 Hard disk inspection method, device and system
CN112951311B (en) * 2021-04-16 2023-11-10 中国民航大学 Hard disk fault prediction method and system based on variable weight random forest
CN115410638B (en) * 2022-07-28 2023-11-07 南京航空航天大学 Disk fault detection system based on contrast clustering
JP7281854B1 (en) 2022-08-17 2023-05-26 デジタルデータソリューション株式会社 Information processing device, information processing method and program
CN116755910B (en) * 2023-08-16 2023-11-03 中移(苏州)软件技术有限公司 Host machine high availability prediction method and device based on cold start and electronic equipment

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260279A (en) * 2015-11-04 2016-01-20 四川效率源信息安全技术股份有限公司 Method and device of dynamically diagnosing hard disk failure based on S.M.A.R.T (Self-Monitoring Analysis and Reporting Technology) data
CN106355208A (en) * 2016-08-31 2017-01-25 广州精点计算机科技有限公司 Data prediction analysis method based on COX model and random survival forest
CN107392320A (en) * 2017-07-28 2017-11-24 郑州云海信息技术有限公司 A kind of method that hard disk failure is predicted using machine learning
CN108470225A (en) * 2018-03-21 2018-08-31 广东省交通规划设计研究院股份有限公司 The sedimentation information forecasting method and forecasting system of roadbed
CN109919335A (en) * 2019-03-11 2019-06-21 西安电子科技大学 Disk failure forecasting system based on deep learning
CN110377449A (en) * 2019-07-19 2019-10-25 苏州浪潮智能科技有限公司 A kind of disk failure prediction technique, device and electronic equipment and storage medium
CN110399238A (en) * 2019-06-27 2019-11-01 浪潮电子信息产业股份有限公司 A kind of disk failure method for early warning, device, equipment and readable storage medium storing program for executing
CN110399237A (en) * 2019-06-29 2019-11-01 苏州浪潮智能科技有限公司 A kind of disk failure prediction technique, system, terminal and storage medium
CN110414155A (en) * 2019-07-31 2019-11-05 北京天泽智云科技有限公司 A kind of detection of fan part temperature anomaly and alarm method with single measuring point
CN110427311A (en) * 2019-06-26 2019-11-08 华中科技大学 Disk failure prediction technique and system based on temporal aspect processing and model optimization
CN110766059A (en) * 2019-10-14 2020-02-07 四川西部能源股份有限公司郫县水电厂 Transformer fault prediction method, device and equipment
CN110889255A (en) * 2019-10-31 2020-03-17 国网湖北省电力有限公司 Power system transient stability evaluation method based on cascaded deep forest

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7730364B2 (en) * 2007-04-05 2010-06-01 International Business Machines Corporation Systems and methods for predictive failure management
US11423327B2 (en) * 2018-10-10 2022-08-23 Oracle International Corporation Out of band server utilization estimation and server workload characterization for datacenter resource optimization and forecasting

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260279A (en) * 2015-11-04 2016-01-20 四川效率源信息安全技术股份有限公司 Method and device of dynamically diagnosing hard disk failure based on S.M.A.R.T (Self-Monitoring Analysis and Reporting Technology) data
CN106355208A (en) * 2016-08-31 2017-01-25 广州精点计算机科技有限公司 Data prediction analysis method based on COX model and random survival forest
CN107392320A (en) * 2017-07-28 2017-11-24 郑州云海信息技术有限公司 A kind of method that hard disk failure is predicted using machine learning
CN108470225A (en) * 2018-03-21 2018-08-31 广东省交通规划设计研究院股份有限公司 The sedimentation information forecasting method and forecasting system of roadbed
CN109919335A (en) * 2019-03-11 2019-06-21 西安电子科技大学 Disk failure forecasting system based on deep learning
CN110427311A (en) * 2019-06-26 2019-11-08 华中科技大学 Disk failure prediction technique and system based on temporal aspect processing and model optimization
CN110399238A (en) * 2019-06-27 2019-11-01 浪潮电子信息产业股份有限公司 A kind of disk failure method for early warning, device, equipment and readable storage medium storing program for executing
CN110399237A (en) * 2019-06-29 2019-11-01 苏州浪潮智能科技有限公司 A kind of disk failure prediction technique, system, terminal and storage medium
CN110377449A (en) * 2019-07-19 2019-10-25 苏州浪潮智能科技有限公司 A kind of disk failure prediction technique, device and electronic equipment and storage medium
CN110414155A (en) * 2019-07-31 2019-11-05 北京天泽智云科技有限公司 A kind of detection of fan part temperature anomaly and alarm method with single measuring point
CN110766059A (en) * 2019-10-14 2020-02-07 四川西部能源股份有限公司郫县水电厂 Transformer fault prediction method, device and equipment
CN110889255A (en) * 2019-10-31 2020-03-17 国网湖北省电力有限公司 Power system transient stability evaluation method based on cascaded deep forest

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Random-forest-based failure prediction for hard disk drives;Jing Shen 等;《Uncertain Data Mining in Internet of Things》;第14卷(第11期);第1-15页 *

Also Published As

Publication number Publication date
CN111581072A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN111581072B (en) Disk fault prediction method based on SMART and performance log
CN108647136B (en) Hard disk damage prediction method and device based on SMART information and deep learning
Wang et al. A two-step parametric method for failure prediction in hard disk drives
CN109828869B (en) Method, device and storage medium for predicting hard disk fault occurrence time
CN109739739B (en) Disk failure prediction method, device and storage medium
US7096153B2 (en) Principal component analysis based fault classification
CN115034248A (en) Automatic diagnostic method, system and storage medium for equipment
US20050149297A1 (en) Principal component analysis based fault classification
EP3663919B1 (en) System and method of automated fault correction in a network environment
CN112148561B (en) Method and device for predicting running state of business system and server
CN112951311A (en) Hard disk fault prediction method and system based on variable weight random forest
CN113361208A (en) Solid state disk residual life evaluation method based on comprehensive health index
CN111984511A (en) Multi-model disk fault prediction method and system based on two-classification
CN112433928A (en) Fault prediction method, device, equipment and storage medium of storage equipment
CN117170915A (en) Data center equipment fault prediction method and device and computer equipment
CN110175100B (en) Storage disk fault prediction method and prediction system
CN111381990B (en) Disk fault prediction method and device based on flow characteristics
Barelli et al. Unsupervised anomaly detection for hard drives
CN117076184B (en) Transaction system detection method, device and storage medium
CN117093433B (en) Fault detection method and device, electronic equipment and storage medium
KR102566876B1 (en) System and method for detecting anomaly in air drying equipment for radar system using unsupervised learning method
CN113971003B (en) Online sampling device and method for disk SMART data
CN117251352B (en) Disk fault prediction method, system, equipment and storage medium
Wang et al. A fusion approach for anomaly detection in hard disk drives
CN117574303B (en) Construction condition monitoring and early warning method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant