CN113176962B

CN113176962B - Computer room IT equipment fault accurate detection method and system for data center

Info

Publication number: CN113176962B
Application number: CN202110400918.5A
Authority: CN
Inventors: 赵希峰
Original assignee: Beijing Zhongda Kehui Technology Development Co ltd
Current assignee: Beijing Zhongda Kehui Technology Development Co ltd
Priority date: 2021-04-14
Filing date: 2021-04-14
Publication date: 2024-05-07
Anticipated expiration: 2041-04-14
Also published as: CN113176962A

Abstract

The invention discloses a method and a system for accurately detecting faults of computer room IT equipment for a data center, wherein the method comprises the following steps: acquiring performance index information of target IT equipment according to a target period, obtaining a periodic performance data sequence, carrying out normalization processing on the periodic performance data sequence, clustering the processed periodic performance data according to preset fault types, obtaining a clustering result, calculating a target abnormal value score of a periodic performance data subsequence corresponding to each preset fault type in the clustering result, and judging whether the target IT equipment has faults and specific fault information according to the target abnormal value score corresponding to each preset fault type. Whether the target IT equipment fails or not and the specific failure information when the target IT equipment fails can be intelligently determined according to the calculated scores, manual fault detection is not needed one by one, the specific failure of the target IT equipment is rapidly and accurately determined, the labor cost is saved, and maintenance personnel can rapidly conduct subsequent maintenance.

Description

Computer room IT equipment fault accurate detection method and system for data center

Technical Field

The invention relates to the technical field of equipment management, in particular to a method and a system for accurately detecting faults of computer room IT equipment for a data center.

Background

With the rapid development of the information society, modern information technology and automation equipment show explosive growth trend, the scale of a data center is continuously increased, and the requirements on the safety and stability of the data center are higher and higher. The machine room serves as an important data center and becomes critical for fault detection and elimination. The hidden trouble of equipment in a machine room is mainly caused by long-time power-on work, equipment aging, manual misoperation and the like of electric equipment such as network equipment, storage equipment and servers in the machine room. If the fault of the asset equipment caused by the conditions is not timely alarmed and the abnormal condition in the operation of the machine room is pointed out, the information data has potential safety hazard.

The existing fault detection method for the room IT equipment is to collect the working parameters of the room IT equipment, manually conduct fault detection one by one according to the collected working parameters, seriously waste the labor cost, greatly influence the subsequent maintenance efficiency, and further cause that the periodic maintenance work of the room equipment is not in place, so that the service life of the equipment is reduced.

Disclosure of Invention

Aiming at the problems displayed above, the invention provides a method and a system for accurately detecting faults of computer room IT equipment for a data center, which are used for solving the problems that in the background art, faults are manually detected one by one according to collected working parameters, the subsequent maintenance efficiency is greatly influenced while the labor cost is seriously wasted, and the periodic maintenance work of the computer room equipment is not in place, so that the service life of the equipment is reduced.

A machine room IT equipment fault accurate detection method for a data center comprises the following steps:

Acquiring performance index information of target IT equipment according to a target period to obtain a periodic performance data sequence;

normalizing the periodic performance data sequence;

clustering the processed periodic performance data according to a preset fault type to obtain a clustering result;

calculating a target abnormal value score of the periodic performance data subsequence corresponding to each preset fault type in the clustering result;

And judging whether the target IT equipment fails or not and specific failure information when the failure occurs according to the target abnormal value score corresponding to each preset failure type.

Preferably, the collecting performance index information of the target IT device according to the target period to obtain a periodic performance data sequence includes:

Determining a used time length of the target IT device;

determining a performance detection period of target IT equipment according to the used time length, and determining the performance detection period as the target period;

Acquiring working parameters of the target IT equipment according to the target period;

combining target working parameters corresponding to each performance index with each other;

After the combination is finished, sub-performance index values of each performance index under different dimensions are collected;

generating a performance data subsequence of each performance index according to the sub-performance index values of each performance index in different dimensions;

and generating a periodic performance data sequence of the target IT equipment according to the performance data subsequences of all the performance indexes.

Preferably, the normalizing the periodic performance data sequence includes: and carrying out maximum and minimum normalization processing on the periodic performance data sequence.

Preferably, the clustering the processed periodic performance data according to the preset fault type to obtain a clustering result includes:

Determining a target performance index associated with each preset fault type;

Obtaining target performance data corresponding to the target performance index from the periodic performance data;

and clustering the processed periodic performance data according to the target performance data corresponding to each preset fault type to obtain a clustering result.

Preferably, the calculating the target outlier score of the periodic performance data subsequence corresponding to each preset fault type in the clustering result includes:

Determining a functional level value of the target IT device;

Confirming whether the working function level value is greater than or equal to a preset threshold value, if so, confirming that the accuracy of the abnormal value score of the periodic performance data subsequence in each clustering result is 100, otherwise, confirming that the accuracy of the abnormal value score of the periodic performance data subsequence in each clustering result is 90;

Dividing and calculating according to the first subsequence value of the periodic performance data subsequence corresponding to each preset fault type and the second subsequence value of the performance index of the target IT equipment in the normal working state to obtain the first abnormal value fraction of the periodic performance data subsequence corresponding to each preset fault type;

when the accuracy is confirmed to be 100, confirming the first abnormal value score of the periodic performance data subsequence corresponding to each preset fault type as a target abnormal value score of the periodic performance data subsequence in the clustering result;

And when the accuracy is confirmed to be 90, multiplying the first outlier score of the periodic performance data subsequence corresponding to each preset fault type by a preset proportion to obtain a second outlier score of the periodic performance data subsequence corresponding to each preset fault type, and confirming the second outlier score of the periodic performance data subsequence corresponding to each preset fault type as a target outlier score of the periodic performance data subsequence corresponding to the preset fault type.

Preferably, the determining whether the target IT device fails or not and the specific failure information when the failure occurs according to the target abnormal value score corresponding to each preset failure type includes:

confirming whether the target abnormal value score corresponding to each preset fault type is larger than or equal to the preset abnormal value score corresponding to the preset fault type, if so, confirming that the target IT equipment has no fault, otherwise, confirming that the target IT equipment has fault;

And determining a target fault type with the target abnormal value score smaller than the preset abnormal value score, and determining specific fault information of the target IT equipment according to the target fault type and the performance index information of the target IT equipment.

Preferably, the method further comprises:

Generating a fault code of the target IT equipment according to the specific fault information of the target IT equipment;

Searching a fault point corresponding to the fault code, positioning the fault point, and obtaining an electronic positioning result;

inquiring a fault solution corresponding to the fault point;

And displaying the fault solution and the electronic positioning result.

Preferably, the clustering of the processed periodic performance data according to the preset fault type further includes:

Acquiring a target business operation related to the clustering result, acquiring a business input instruction related to the target business operation, and calling first data A ₁ and second data A ₂ related to the business input instruction from an equipment database and a management database;

Determining a first capacity of the first data A ₁, determining a second capacity of the second data A ₂, judging whether the first capacity and the second capacity are empty, and if yes, performing invalid feedback to the sub-component of the target IT device;

otherwise, determining a first code χ ₁ of the first data a ₁ and performing a first classification process;

Meanwhile, determining a second code χ ₂ of the second data A ₂, and performing second classification processing;

Wherein i=1, 2; A cumulative multiplication of the index values χ _ij representing the different characteristic indices of j=1, 2..n in the i-th data; r _ij+1 represents the feature code value of the j+1th feature index in the ith data; r _ij represents the feature code value of the j-th feature index in the i-th data; k1 represents the number of feature coding sequences of the jth feature index in the ith data; /(I) A sequence value representing a kth feature code sequence in a jth feature index in the ith data; alpha _k represents the sequence weight of the kth feature coding sequence in the jth feature index in the ith data, and n represents the number of feature indexes;

And according to the classification areas corresponding to the first classification result S ₁ and the second classification result S ₂, effective feedback is carried out on the corresponding sub-components in the target IT equipment, and corresponding feedback information is sent to be displayed.

A machine room IT equipment fault accurate detection system for a data center, the system comprising:

The acquisition module is used for acquiring the performance index information of the target IT equipment according to the target period to obtain a periodic performance data sequence;

the processing module is used for carrying out normalization processing on the periodic performance data sequence;

The clustering module is used for clustering the processed periodic performance data according to a preset fault type to obtain a clustering result;

the calculation module is used for calculating the target abnormal value score of the periodic performance data subsequence corresponding to each preset fault type in the clustering result;

And the judging module is used for judging whether the target IT equipment fails or not and specific failure information when the failure occurs according to the target abnormal value score corresponding to each preset failure type.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention.

FIG. 1 is a workflow diagram of a method for accurately detecting faults of computer room IT equipment in a data center;

FIG. 2 is another workflow diagram of a method for accurately detecting faults of machine room IT equipment in a data center according to the present invention;

FIG. 3 is a further workflow diagram of a method for accurately detecting a failure of a machine room IT device in a data center according to the present invention;

Fig. 4 is a schematic structural diagram of a system for accurately detecting faults of computer room IT equipment for a data center according to the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The existing fault detection method for the room IT equipment is to collect the working parameters of the room IT equipment, manually conduct fault detection one by one according to the collected working parameters, seriously waste the labor cost, greatly influence the subsequent maintenance efficiency, and further cause that the periodic maintenance work of the room equipment is not in place, so that the service life of the equipment is reduced. In order to solve the problems, the embodiment discloses a method for accurately detecting faults of computer room IT equipment for a data center.

A machine room IT equipment fault accurate detection method for a data center, as shown in FIG. 1, comprises the following steps:

step S101, acquiring performance index information of target IT equipment according to a target period to obtain a periodic performance data sequence;

Step S102, carrying out normalization processing on the periodic performance data sequence;

step S103, clustering the processed periodic performance data according to a preset fault type to obtain a clustering result;

step S104, calculating a target abnormal value score of the periodic performance data subsequence corresponding to each preset fault type in the clustering result;

and step 105, judging whether the target IT equipment fails or not and specific failure information when the failure occurs according to the target abnormal value score corresponding to each preset failure type.

The working principle of the technical scheme is as follows: acquiring performance index information of target IT equipment according to a target period, obtaining a periodic performance data sequence, carrying out normalization processing on the periodic performance data sequence, clustering the processed periodic performance data according to preset fault types, obtaining a clustering result, calculating a target abnormal value score of a periodic performance data subsequence corresponding to each preset fault type in the clustering result, and judging whether the target IT equipment has faults or not and specific fault information when the faults occur according to the target abnormal value score corresponding to each preset fault type.

The beneficial effects of the technical scheme are as follows: the method has the advantages that the periodic performance data sequences are clustered, the target abnormal value score corresponding to each preset fault type is calculated, whether the target IT equipment fails or not and the specific fault information when the target IT equipment fails can be intelligently determined according to the calculated scores, the specific faults of the target IT equipment are quickly and accurately determined without manually performing fault checking one by one, the labor cost is saved, maintenance staff can quickly perform subsequent maintenance, the service life of the target IT equipment is prolonged, the problem that the follow-up maintenance efficiency is greatly influenced while the labor cost is seriously wasted when the fault checking is performed one by one according to the collected working parameters in the prior art is solved, and the service life of equipment is reduced due to the fact that the periodic maintenance work of the equipment in a machine room is not in place is solved.

In one embodiment, the collecting the performance index information of the target IT device according to the target period to obtain the periodic performance data sequence includes:

Determining a used time length of the target IT device;

The beneficial effects of the technical scheme are as follows: the periodic performance data sequence of the target IT equipment is more accurate and practical, data guarantee is provided for the subsequent fault judging process, furthermore, the length of the detection period can be flexibly determined according to the service life of the target IT equipment through intelligent setting of the detection period, further, frequent fault detection of the target IT equipment can be realized, and the service life of the target IT equipment is further prolonged.

In one embodiment, the normalizing the periodic performance data sequence includes: and carrying out maximum and minimum normalization processing on the periodic performance data sequence.

In one embodiment, as shown in fig. 2, the clustering the processed periodic performance data according to the preset fault type to obtain a clustering result includes:

Step S201, determining a target performance index related to each preset fault type;

step S202, obtaining target performance data corresponding to the target performance index from the periodic performance data;

step S203, clustering the processed periodic performance data according to the target performance data corresponding to each preset fault type to obtain a clustering result.

The beneficial effects of the technical scheme are as follows: the performance data corresponding to each preset fault type in the periodic performance data can be quickly and intuitively determined by classifying the performance data, so that the classification process is simpler and more convenient.

In one embodiment, the calculating the target outlier score of the periodic performance data subsequence corresponding to each preset fault type in the clustering result includes:

Determining a functional level value of the target IT device;

The beneficial effects of the technical scheme are as follows: by adjusting the calculated outlier score according to the operational functional level value of the target IT device, IT is possible to actually evaluate whether the target IT device is malfunctioning in consideration of the operational capabilities of the target IT device itself.

In one embodiment, the determining whether the target IT device fails and specific failure information when the failure occurs according to the target outlier score corresponding to each preset failure type includes:

The beneficial effects of the technical scheme are as follows: whether the target IT equipment fails or not can be intuitively determined according to the scores by directly determining whether the target IT equipment fails or not by using the score comparison mode, and the judging efficiency is improved.

In one embodiment, as shown in fig. 3, the method further comprises:

step S301, generating a fault code of the target IT equipment according to the specific fault information of the target IT equipment;

Step S302, searching a fault point corresponding to the fault code, positioning the fault point and obtaining an electronic positioning result;

Step S303, inquiring a fault solution corresponding to the fault point;

And step S304, displaying the fault solution and the electronic positioning result.

The beneficial effects of the technical scheme are as follows: the maintenance personnel can quickly and accurately know the fault point of the fault of the target IT equipment and a specific maintenance scheme, and further can quickly maintain the target IT equipment so as to ensure the working efficiency and the service quality of the target IT equipment, and the experience of a user is further improved.

In one embodiment, the clustering the processed periodic performance data according to the preset fault type, after obtaining the clustering result, further includes:

The beneficial effects of the technical scheme are as follows: whether the classification of the clustering result is effective and whether the classification has a reasonable effect on the target IT equipment can be accurately confirmed, further, whether the target IT equipment has a certain fault or not can be accurately confirmed according to the feedback result by effectively or invalidively feeding back the corresponding sub-components in the target IT equipment, on the other hand, the fault judgment of the target IT equipment is realized, and the fault judgment accuracy is improved.

The embodiment also discloses a computer lab IT equipment fault accurate detection system for data center, as shown in fig. 4, the system includes:

The acquisition module 401 is configured to acquire performance index information of the target IT device according to a target period, and obtain a periodic performance data sequence;

A processing module 402, configured to normalize the periodic performance data sequence;

The clustering module 403 is configured to cluster the processed periodic performance data according to a preset fault type, and obtain a clustering result;

a calculation module 404, configured to calculate a target outlier score of the periodic performance data subsequence corresponding to each preset fault type in the clustering result;

And the judging module 405 is configured to judge whether the target IT device fails or not and specific failure information when the failure occurs according to the target outlier score corresponding to each preset failure type.

The working principle and the beneficial effects of the above technical solution are described in the method claims, and are not repeated here.

It will be appreciated by those skilled in the art that the first and second aspects of the present invention refer to different phases of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A machine room IT equipment fault accurate detection method for a data center is characterized by comprising the following steps:

normalizing the periodic performance data sequence;

Judging whether the target IT equipment fails or not and specific failure information when the failure occurs according to the target abnormal value score corresponding to each preset failure type;

The calculating the target abnormal value score of the periodic performance data subsequence corresponding to each preset fault type in the clustering result comprises the following steps:

Determining a functional level value of the target IT device;

2. The method for accurately detecting the fault of the machine room IT equipment for the data center according to claim 1, wherein the step of collecting the performance index information of the target IT equipment according to the target period to obtain the periodic performance data sequence includes:

Determining a used time length of the target IT device;

3. The method for accurately detecting the fault of the computer room IT equipment in the data center according to claim 1, wherein the normalizing the periodic performance data sequence includes: and carrying out maximum and minimum normalization processing on the periodic performance data sequence.

4. The method for accurately detecting the faults of the computer room IT equipment in the data center according to claim 1, wherein the clustering the processed periodic performance data according to the preset fault type to obtain a clustering result includes:

Determining a target performance index associated with each preset fault type;

5. The method for accurately detecting the fault of the machine room IT equipment in the data center according to claim 1, wherein the determining whether the target IT equipment has a fault or not and specific fault information when the fault has a fault according to the target outlier score corresponding to each preset fault type includes:

6. The machine room IT equipment fault accurate detection method for a data center of claim 1, further comprising:

inquiring a fault solution corresponding to the fault point;

And displaying the fault solution and the electronic positioning result.

7. The method for accurately detecting faults of computer room IT equipment in a data center according to claim 1, wherein the clustering of the processed periodic performance data according to a preset fault type is performed, and after a clustering result is obtained, the method further comprises:

Otherwise, determining a first code of the first data A ₁ And performing first classification processing;

At the same time, a second code of the second data A ₂ is determined And performing a second classification process;

；

Wherein i=1, 2; index value/>, which represents different characteristic indexes of j=1, 2..n in the i-th data Is a tired multiplication of (2); /(I)A feature code value indicating a j+1th feature index in the i-th data; /(I)A feature code value indicating a j-th feature index in the i-th data; k1 represents the number of feature coding sequences of the jth feature index in the ith data; /(I)A sequence value representing a kth feature code sequence in a jth feature index in the ith data; /(I)A sequence weight value of a kth characteristic coding sequence in a jth characteristic index in the ith data is represented, and n is the number of the characteristic indexes;

And according to the first classification result And second categorization processing results/>And the corresponding classifying area is used for effectively feeding back to the corresponding sub-component in the target IT equipment and sending corresponding feedback information to display.

8. A computer lab IT equipment fault accurate detection system for data center, characterized in that, this system includes:

The judging module is used for judging whether the target IT equipment fails or not and specific failure information when the failure occurs according to the target abnormal value score corresponding to each preset failure type;

The calculation module is used for calculating a target abnormal value score of the periodic performance data subsequence corresponding to each preset fault type in the clustering result, and the method comprises the following steps:

Determining a functional level value of the target IT device;