CN113296994A - Fault diagnosis system and method based on domestic computing platform - Google Patents
Fault diagnosis system and method based on domestic computing platform Download PDFInfo
- Publication number
- CN113296994A CN113296994A CN202110540400.1A CN202110540400A CN113296994A CN 113296994 A CN113296994 A CN 113296994A CN 202110540400 A CN202110540400 A CN 202110540400A CN 113296994 A CN113296994 A CN 113296994A
- Authority
- CN
- China
- Prior art keywords
- fault
- data
- characteristic
- module
- computing platform
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003745 diagnosis Methods 0.000 title claims abstract description 65
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000012545 processing Methods 0.000 claims abstract description 24
- 108010028984 3-isopropylmalate dehydratase Proteins 0.000 claims abstract description 9
- 238000012544 monitoring process Methods 0.000 claims description 30
- 238000007499 fusion processing Methods 0.000 claims description 23
- 238000004364 calculation method Methods 0.000 claims description 15
- 239000011159 matrix material Substances 0.000 claims description 13
- 230000009471 action Effects 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 7
- 238000004891 communication Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000010586 diagram Methods 0.000 claims description 5
- 238000012986 modification Methods 0.000 claims description 5
- 230000004048 modification Effects 0.000 claims description 5
- 238000007405 data analysis Methods 0.000 claims description 4
- 230000036541 health Effects 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 238000007792 addition Methods 0.000 claims description 3
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000012938 design process Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000004806 packaging method and process Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 238000006467 substitution reaction Methods 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 claims description 2
- 238000013480 data collection Methods 0.000 claims 1
- 230000004927 fusion Effects 0.000 abstract description 4
- 238000005457 optimization Methods 0.000 abstract description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0775—Content or structure details of the error report, e.g. specific table structure, specific error fields
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3051—Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Animal Behavior & Ethology (AREA)
- Evolutionary Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Test And Diagnosis Of Digital Computers (AREA)
Abstract
The invention relates to a fault diagnosis system and method based on a domestic computing platform, and relates to the technical field of computer fault diagnosis. The invention realizes the characteristic data acquisition of the platform based on the IPMI standard protocol and the custom protocol, the characteristic data processing based on various fusion strategies, the fault diagnosis based on the fault knowledge base and the fusion of various optimization strategies to obtain a complete system and a method for diagnosing the fault of the computing platform.
Description
Technical Field
The invention relates to the technical field of information security, in particular to a fault diagnosis system and method based on a domestic computing platform.
Background
At present, the complexity, the comprehensiveness and the intelligent degree of a domestic computing platform system are continuously improved, and the cost of development, production, particularly maintenance and guarantee is higher and higher. Meanwhile, the increase of the composition links and the influence factors gradually increases the probability of the failure and the functional failure of the whole computing platform. Based on this background, almost all domestic computing systems put an urgent need for accurate fault diagnosis of devices. On the other hand, the application of the fault diagnosis technology in the field of the domestic computing platform still stays in a more basic level, and the domestic computing platform has high requirements on the rapidity and the accuracy of equipment fault diagnosis, so a complete set of fault diagnosis system and method based on the domestic computing platform needs to be provided.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is as follows: how to design a complete set of fault diagnosis system and method based on a domestic computing platform.
(II) technical scheme
In order to solve the technical problem, the invention provides a fault diagnosis system based on a domestic computing platform, which comprises a platform characteristic data acquisition module, a data fusion processing module and a state monitoring and fault diagnosis module;
the platform characteristic data acquisition module is used for acquiring data of a specific part of a domestic computing platform in real time through a sensor;
the data fusion processing module is used for analyzing and processing the collected characteristic data and removing redundant meaningless characteristic parameters;
and the state monitoring and fault diagnosis module classifies the data information obtained after the fusion processing according to the data information obtained after the fusion processing and the type information related to the domestic computing platform on the parameter indexes of the domestic computing platform to be monitored, performs similarity matching calculation on the classified information, and obtains the information with the maximum matching degree as an analysis result.
Preferably, before feature extraction, the data fusion processing module firstly performs comprehensive processing on feature data of each module of the domestic computing platform, then determines parameter indexes to be monitored, covers parameter features capable of indicating whether equipment is in fault to the maximum extent, and then starts feature attribute selection.
Preferably, the state monitoring and fault diagnosis module realizes the concept of using an expert system for reference, and adopts the expert system to evaluate the health condition of the domestic computing platform according to the state information data of each module of the domestic computing platform, wherein the state monitoring and fault diagnosis module comprises a fault knowledge database, a platform fault reasoning submodule and a platform fault database management submodule; the fault knowledge database stores data information and engineering technical data required by state monitoring and fault diagnosis of the computing platform, wherein the data information and the engineering technical data comprise fault cases, characteristic parameters, fault related factors, fault phenomena and fault processing operation; the platform fault reasoning submodule is a reasoning method based on fault cases and with different strategies set according to the damage mechanism and the data analysis requirements of different modules of the domestic computing platform; the platform fault database management submodule provides management operations for fault knowledge data, including addition, deletion, modification and query of the fault knowledge database.
Preferably, the platform characteristic data acquisition module defines a communication protocol on an operating system of the domestic computing platform in a self-defined mode, is responsible for extracting and packaging IPMI data, analyzes and classifies the analyzed data, and transmits a result to the domestic computing platform.
The invention also provides a method for realizing fault diagnosis by using the fault diagnosis system, which comprises the following steps:
step one, constructing a fault knowledge database
The fault knowledge database comprises the following five parts: in the design process of each module of a domestic computing platform, analyzing various software and hardware and environmental factors which may or are known to cause the fault of a board card or a whole machine, drawing a relational logic diagram, analyzing fault cases and fault characteristic factors of each module, and storing fault knowledge in a platform fault knowledge database in a list form;
step two, when the domestic computing platform is abnormal in operation, the platform characteristic data acquisition module acquires characteristic data information of the domestic computing platform in real time through IPMI and a custom communication protocol, the data fusion processing module performs data abstraction on the characteristic data after fusion processing to obtain a list of related characteristic attributes, and the data fusion process comprises two key processing operations: firstly, word segmentation processing is carried out, the feature data of the complete character string are decomposed into independent words, and redundant data or features in the words are deleted; second, substitution treatment, the entry after word segmentation treatment is replaced by the entry in the established professional term;
thirdly, the state monitoring and fault diagnosis module classifies the characteristic attributes obtained by the second step according to fault data information and type information related to a domestic computing platform, then performs primary matching retrieval, considers that a fault phenomenon is stored in a fault knowledge database in a character string mode, and searches a fault case similar to current data information in the platform fault knowledge database by adopting a fuzzy query algorithm based on character string matching in the primary matching process;
fourthly, the state monitoring and fault diagnosis module carries out similarity matching calculation on the classified information to obtain a similarity matrix b;
and step five, multiplying the obtained similarity matrix b by the weight value of the corresponding characteristic attribute by the state monitoring and fault diagnosis module to obtain a final result.
Preferably, the principle of implementing similarity matching calculation on the classified information by the state monitoring and fault diagnosis module is as follows: the method comprises the steps of assuming that characteristic attributes in a fault knowledge database are distributed in an n-dimensional characteristic space in a point mode according to a certain rule, and a search system constructed by characteristic information based on points finds similar points, namely similar characteristic information, according to a space distance after the characteristic attributes are input, wherein the similarity between new fault characteristics and existing fault characteristics in the fault knowledge database is determined by using weighting as a judgment method through the distance between the points in the space as an evaluation scale.
Preferably, the specific implementation manner of the state monitoring and fault diagnosing module performing similarity matching calculation on the classified information is as follows:
comparing the characteristic attributes of the classified information with the characteristic attributes of n fault cases in a fault knowledge database one by one to obtain a similarity matrix b:
n represents the nth fault case in the fault knowledge database, m represents the mth characteristic attribute of the fault case n, wherein the similarity bnmThe calculation method of (2) is as follows:
let x be { x ═ x1,x2,...,xnWhere x is all characteristic attributes of a failure case, xiIs the ith feature attribute of the fault case, i is more than or equal to 1 and less than or equal to n, and x is { x for two points (i.e. two fault cases) on the n-dimensional feature space a1,x2,…,xnY ═ y1,y2,…,yn-the distance in feature space is:
wirepresenting the ith weight value, x, in the fault caseiAnd yiI-th characteristic attributes of the fault cases x and y, respectively, when x isi≠yiWhen, a (x)i,yi) Taking the value of 1 when xi=yiWhen, a (x)i,yi) The value is 0, and the distance formula is the characteristic attribute x of the classified fault case xiAnd the characteristic attribute y of the fault case y in the fault knowledge databaseiThe calculation formula for finally obtaining the similarity between x and y is as follows:
replacing x with n and y with m to obtain bnm;
Preferably, in step five, the state monitoring and fault diagnosing module multiplies the obtained similarity matrix b by the weight value of the corresponding characteristic attribute, and the final result is obtained as follows:
and (3) carrying out statistical sequencing on the matrix t to obtain X with the maximum matching degree as an analysis result, namely:
the result X is the best matching solution, and the fault case with the highest similarity can be extracted at this time.
Preferably, after the fifth step, processing measures corresponding to the fault case are retrieved from the fault knowledge database, the judged and modified fault is analyzed, the fault is corrected according to the judgment conclusion and the proposed processing measures, and a solution of the current fault problem is found.
Preferably, if the fault information is not found in the fault knowledge database, correspondingly executing an adding operation, and adding corresponding new knowledge to the fault knowledge database by using an instance learning method, wherein the operation steps are as follows:
step 1, adding a new fault in a fault knowledge database;
step 2, finding a new fault diagnosis action interface;
step 3, adding a new diagnosis state for the diagnosis action;
step 4, adding a new diagnosis action, and circularly executing the step 3;
and step 5, adding a fault processing action.
(III) advantageous effects
The invention realizes the characteristic data acquisition of the platform based on the IPMI standard protocol and the custom protocol, the characteristic data processing based on various fusion strategies, the fault diagnosis based on the fault knowledge base and the fusion of various optimization strategies to obtain a complete system and a method for diagnosing the fault of the computing platform.
Drawings
FIG. 1 is a schematic diagram of a fault diagnosis system of the present invention;
FIG. 2 is a schematic diagram of a fault knowledge base implementation of the present invention.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
As shown in fig. 1 and fig. 2, the fault diagnosis system for the domestic computing platform of the present invention is used for performing fault diagnosis on the domestic computing platform, wherein the domestic computing platform is constructed based on a domestic soar platform and a galaxy kylin operating system, and is completely adapted to other domestic software and hardware platforms; the system comprises a platform characteristic data acquisition module, a data fusion processing module and a state monitoring and fault diagnosis module;
the platform characteristic data acquisition module acquires data of a specific part of a domestic computing platform in real time through a sensor;
the data fusion processing module analyzes and processes the collected characteristic data and removes redundant meaningless characteristic parameters;
and the state monitoring and fault diagnosis module is used for rapidly classifying the parameter indexes of the domestic computing platform to be monitored according to the data information after fusion processing and the type information related to the domestic computing platform and performing similarity matching calculation to obtain the parameter indexes with the maximum matching degree as an analysis result.
The domestic computing platform comprises a computing module, a power supply module, a switching module and a Baseboard Management Controller (BMC); the platform characteristic data acquisition module is also used for transmitting the acquired data to a specified position through a protocol; the acquisition of characteristic data of each module of the domestic computing platform depends on a standard IPMI technical protocol and other self-defined protocols, wherein the BMC plays a main role in managing communication between each other module of the domestic computing platform and a platform characteristic data acquisition module of a fault diagnosis system and providing monitoring and control functions for the computing module, the power supply module and the exchange module, and the platform characteristic data acquisition module can acquire information of temperature, voltage, current, load, fan and the like of the computing module, the power supply module and the exchange module in real time through the BMC.
The data fusion processing module realizes characteristic data analysis and processing according to two strategies:
in a complex domestic computing platform, the characteristic parameter indexes of the modules to be monitored are more, if the characteristic parameter indexes are not subjected to preprocessing, the complexity of subsequent system computing is higher, so before feature extraction, firstly, the characteristic data of each module of the domestic computing platform are comprehensively processed, redundant characteristic parameters are removed, then, the parameter indexes to be monitored are determined, the parameter characteristics capable of indicating whether equipment is in fault or not are covered to the maximum extent, and then, feature selection is started;
in the data extraction process, a plurality of pieces of same characteristic data information sometimes appear, a certain data redundancy is caused when all the characteristic data information are recorded in the log, and the plurality of pieces of same data information are combined into one piece, so that the generation of the subsequent system log and the analysis of the subsequent system log are facilitated. Meanwhile, when the system monitors, fault data which are never encountered before are found, the new information is fused with the previous fault information, namely, the new fault category is written into a database, and the solution method is matched and corresponds to achieve the purpose of intelligent operation and maintenance.
The state monitoring and fault diagnosis module realizes the concept of referring to an expert system, adopts the expert system to evaluate the health condition of the domestic computing platform according to the state information data of each module of the domestic computing platform, and mainly comprises a platform fault knowledge database, a platform fault reasoning submodule and a platform fault database management submodule.
In the operation process of the computing platform, the BMC performs initialization work of the collected data sensor, and monitors and records data such as internal hardware temperature, fan operation, CPU occupancy rate and the like of the domestic computing platform. The system event log is used for recording all states in visible time, and the system can inquire the event log information periodically. Meanwhile, the platform characteristic data acquisition module self-defines a communication protocol on an operating system of the domestic computing platform, is responsible for extracting and packaging IPMI data, analyzes the analyzed data, classifies information and transmits the result to the domestic computing platform.
FIG. 2 is a flow chart of the operation of the condition monitoring and fault diagnosis module. The fault knowledge database stores data information and engineering technical data required by state monitoring and fault diagnosis of the computing platform, wherein the data information and the engineering technical data comprise fault cases, characteristic parameters, fault related factors, fault phenomena, fault processing operation and the like; the platform fault reasoning submodule is a reasoning method with different strategies set according to damage mechanisms and data analysis requirements of different modules of a domestic computing platform, and the reasoning method based on fault cases is mainly adopted; the platform fault database management sub-module mainly provides management operations for the fault knowledge database, including addition, deletion, modification, query and the like of the fault knowledge database.
The fault diagnosis method realized by the fault diagnosis system comprises the following steps:
step one, constructing a fault knowledge database. The computing platform fault knowledge database mainly comprises the following five parts: fault case number, fault phenomenon, fault location, system to which the fault belongs, and fault handling operation. In the design process of each module of a domestic computing platform, various software and hardware, environments and other factors which may or are known to cause the fault of a board card or a whole computer are analyzed, a relational logic diagram is drawn, common fault cases and fault characteristic factors of each module are analyzed, and fault knowledge is stored in a platform fault knowledge database in a list form;
step two, when the domestic computing platform is abnormal in operation, the platform characteristic data acquisition module acquires characteristic data information of the domestic computing platform in real time through IPMI and a custom communication protocol, the data fusion processing module performs data abstraction on the characteristic data after fusion processing to obtain a list of related characteristic attributes, and the data fusion process mainly comprises two parts of key processing operations: firstly, word segmentation processing is carried out, feature data such as complete character strings and the like are decomposed into independent words, and redundant data or features in the words are deleted; second, substitution treatment, the entry after word segmentation treatment is replaced by the entry in the established professional term;
and step three, the state monitoring and fault diagnosis module rapidly classifies the characteristic attributes obtained by the processing of the step two according to fault data information and type information related to a domestic computing platform, and then performs preliminary matching retrieval, wherein the important concern is the similarity of fault phenomena. In consideration of the fact that the fault phenomenon is stored in a fault knowledge database in a character string mode, in the primary matching process, a fuzzy query algorithm based on character string matching is adopted, and a fault case similar to current data information is searched in a platform fault knowledge database;
and step four, carrying out similarity matching calculation on the classified information by the state monitoring and fault diagnosis module, wherein the main realization principle is as follows: it is assumed that the feature attributes in the fault knowledge database are distributed in an n-dimensional feature space in a point form according to a certain rule, and the feature information is based on a search system constructed by the points, and after a certain feature attribute is input, similar points, namely similar feature information, are quickly found according to the spatial distance. The distance between points in the space is used as an evaluation scale, and the similarity between the new fault feature and the existing fault features in the fault knowledge database is determined by weighting to serve as a judgment method. The specific implementation mode is as follows:
comparing the characteristic attributes of the classified information with the characteristic attributes of n fault cases in a fault knowledge database one by one to obtain a similarity matrix b:
n represents the nth fault case in the fault knowledge database, and m represents the mth characteristic attribute of the fault case n. Wherein the similarity bnmThe calculation method of (2) is as follows:
let x be{x1,x2,...,xnWhere x is all characteristic attributes of a failure case, xi(1 ≦ i ≦ n) is the ith feature attribute for the failure case. X ═ x for two points (i.e., two failure cases) on the n-dimensional feature space a1,x2,…,xnY ═ y1,y2,…,yn-the distance in feature space is:
wirepresenting the ith weight value, x, in the fault caseiAnd yi(1 ≦ i ≦ n) is the ith feature attribute for failure case x and y, respectively, when x isi≠yiWhen, a (x)i,yi) Taking the value of 1 when xi=yiWhen, a (x)i,yi) The value is 0. The distance formula is the characteristic attribute x of the classified fault case xiAnd the characteristic attribute y of the fault case y in the fault knowledge databaseiThe distance calculating method of (1). The final similarity between x and y is calculated as:
replacing x with n and y with m to obtain bnm;
Step five, the state monitoring and fault diagnosis module multiplies the obtained similarity matrix b by the weight value of the corresponding characteristic attribute to obtain a final result:
and (3) performing statistical sequencing on the matrix t to obtain X with the maximum matching degree as an analysis result, namely:
the result X is the best matching solution, and the fault case with the highest similarity can be extracted at this time. And then, processing measures corresponding to the fault cases are retrieved from the fault knowledge database, similar faults are analyzed, judged and modified, the similar faults are corrected according to the judgment conclusion and the proposed processing measures, and a solution of the current fault problem is found.
And meanwhile, the platform fault database management submodule mainly adds, modifies and deletes the fault knowledge database. Firstly, an initial fault knowledge database is established, and information such as historical debugging data and engineering experience is input. When the system starts diagnosis work, inputting the processed characteristic attributes of each module of the domestic computing platform to a platform fault reasoning submodule, operating reasoning and judging by the platform fault reasoning submodule according to the requirement of the diagnosis process, performing similarity matching with relevant fault information in a fault knowledge database, evaluating the health condition of each module of the domestic computing platform, and if a fault occurs and the fault knowledge database does not have the fault information, correspondingly executing addition operation. And adding corresponding new knowledge to the fault knowledge database by using an example learning method. The main operation steps are as follows:
step 1, adding a new fault in a fault knowledge database;
step 2, finding a new fault diagnosis action interface;
step 3, adding a new diagnosis state for the diagnosis action;
step 4, adding a new diagnosis action, and circularly executing the step 3;
and step 5, adding a fault processing action.
The modify and delete operations in the fault knowledge database are similar to the above operational steps.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A fault diagnosis system based on a domestic computing platform is characterized by comprising a platform characteristic data acquisition module, a data fusion processing module and a state monitoring and fault diagnosis module;
the platform characteristic data acquisition module is used for acquiring data of a specific part of a domestic computing platform in real time through a sensor;
the data fusion processing module is used for analyzing and processing the collected characteristic data and removing redundant meaningless characteristic parameters;
and the state monitoring and fault diagnosis module classifies the data information obtained after the fusion processing according to the data information obtained after the fusion processing and the type information related to the domestic computing platform on the parameter indexes of the domestic computing platform to be monitored, performs similarity matching calculation on the classified information, and obtains the information with the maximum matching degree as an analysis result.
2. The system of claim 1, wherein before feature extraction, the data fusion processing module firstly performs comprehensive processing on feature data of each module of the domestic computing platform, then determines parameter indexes to be monitored, covers parameter features capable of indicating whether equipment is in fault to the maximum extent, and then starts feature attribute selection.
3. The system of claim 1, wherein the state monitoring and fault diagnosis module implements a concept of using an expert system to evaluate the health condition of the domestic computing platform according to the state information data of each module of the domestic computing platform, and comprises a fault knowledge database, a platform fault reasoning sub-module and a platform fault database management sub-module; the fault knowledge database stores data information and engineering technical data required by state monitoring and fault diagnosis of the computing platform, wherein the data information and the engineering technical data comprise fault cases, characteristic parameters, fault related factors, fault phenomena and fault processing operation; the platform fault reasoning submodule is a reasoning method based on fault cases and with different strategies set according to the damage mechanism and the data analysis requirements of different modules of the domestic computing platform; the platform fault database management submodule provides management operations for fault knowledge data, including addition, deletion, modification and query of the fault knowledge database.
4. The system of claim 1, wherein the platform characteristic data collection module defines a communication protocol on an operating system of the domestic computing platform, is responsible for extracting and packaging IPMI data, analyzes and classifies information of the analyzed data, and transmits a result to the domestic computing platform.
5. A method for performing fault diagnosis using the fault diagnosis system according to any one of claims 1 to 4, comprising the steps of:
step one, constructing a fault knowledge database
The fault knowledge database comprises the following five parts: in the design process of each module of a domestic computing platform, analyzing various software and hardware and environmental factors which may or are known to cause the fault of a board card or a whole machine, drawing a relational logic diagram, analyzing fault cases and fault characteristic factors of each module, and storing fault knowledge in a platform fault knowledge database in a list form;
step two, when the domestic computing platform is abnormal in operation, the platform characteristic data acquisition module acquires characteristic data information of the domestic computing platform in real time through IPMI and a custom communication protocol, the data fusion processing module performs data abstraction on the characteristic data after fusion processing to obtain a list of related characteristic attributes, and the data fusion process comprises two key processing operations: firstly, word segmentation processing is carried out, the feature data of the complete character string are decomposed into independent words, and redundant data or features in the words are deleted; second, substitution treatment, the entry after word segmentation treatment is replaced by the entry in the established professional term;
thirdly, the state monitoring and fault diagnosis module classifies the characteristic attributes obtained by the second step according to fault data information and type information related to a domestic computing platform, then performs primary matching retrieval, considers that a fault phenomenon is stored in a fault knowledge database in a character string mode, and searches a fault case similar to current data information in the platform fault knowledge database by adopting a fuzzy query algorithm based on character string matching in the primary matching process;
fourthly, the state monitoring and fault diagnosis module carries out similarity matching calculation on the classified information to obtain a similarity matrix b;
and step five, multiplying the obtained similarity matrix b by the weight value of the corresponding characteristic attribute by the state monitoring and fault diagnosis module to obtain a final result.
6. The method of claim 5, wherein the similarity matching calculation of the classified information by the condition monitoring and fault diagnosis module is implemented according to the following principle: the method comprises the steps of assuming that characteristic attributes in a fault knowledge database are distributed in an n-dimensional characteristic space in a point mode according to a certain rule, and a search system constructed by characteristic information based on points finds similar points, namely similar characteristic information, according to a space distance after the characteristic attributes are input, wherein the similarity between new fault characteristics and existing fault characteristics in the fault knowledge database is determined by using weighting as a judgment method through the distance between the points in the space as an evaluation scale.
7. The method of claim 6, wherein the similarity matching calculation of the classified information by the condition monitoring and fault diagnosis module is implemented as follows:
comparing the characteristic attributes of the classified information with the characteristic attributes of n fault cases in a fault knowledge database one by one to obtain a similarity matrix b:
n represents the nth fault case in the fault knowledge database, m represents the mth characteristic attribute of the fault case n, wherein the similarity bnmThe calculation method of (2) is as follows:
let x be { x ═ x1,x2,...,xnWhere x is all characteristic attributes of a failure case, xiIs the ith feature attribute of the fault case, i is more than or equal to 1 and less than or equal to n, and x is { x for two points (i.e. two fault cases) on the n-dimensional feature space a1,x2,…,xnY ═ y1,y2,…,yn-the distance in feature space is:
wirepresenting the ith weight value, x, in the fault caseiAnd yiI-th characteristic attributes of the fault cases x and y, respectively, when x isi≠yiWhen, a (x)i,yi) Taking the value of 1 when xi=yiWhen, a (x)i,yi) The value is 0, and the distance formula is the characteristic attribute x of the classified fault case xiAnd the characteristic attribute y of the fault case y in the fault knowledge databaseiThe calculation formula for finally obtaining the similarity between x and y is as follows:
replacing x with n and y with m to obtain bnm。
8. The method according to claim 7, wherein, in step five, the state monitoring and fault diagnosis module multiplies the obtained similarity matrix b by the weight value of the corresponding characteristic attribute to obtain the following final result:
and (3) carrying out statistical sequencing on the matrix t to obtain X with the maximum matching degree as an analysis result, namely:
the result X is the best matching solution, and the fault case with the highest similarity can be extracted at this time.
9. The method of claim 7, wherein after step five further retrieving in a fault knowledge database the handling measures for the corresponding fault case, analyzing the determined and modified fault, correcting the fault based on the determination and the proposed handling measures, and finding a solution to the current fault problem.
10. The method as claimed in claim 5, wherein if the fault information is not found in the fault knowledge database, correspondingly performing an adding operation, and adding corresponding new knowledge to the fault knowledge database by using an instance learning method, the operation steps are as follows:
step 1, adding a new fault in a fault knowledge database;
step 2, finding a new fault diagnosis action interface;
step 3, adding a new diagnosis state for the diagnosis action;
step 4, adding a new diagnosis action, and circularly executing the step 3;
and step 5, adding a fault processing action.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110540400.1A CN113296994A (en) | 2021-05-18 | 2021-05-18 | Fault diagnosis system and method based on domestic computing platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110540400.1A CN113296994A (en) | 2021-05-18 | 2021-05-18 | Fault diagnosis system and method based on domestic computing platform |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113296994A true CN113296994A (en) | 2021-08-24 |
Family
ID=77322653
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110540400.1A Pending CN113296994A (en) | 2021-05-18 | 2021-05-18 | Fault diagnosis system and method based on domestic computing platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113296994A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113838565A (en) * | 2021-09-24 | 2021-12-24 | 中国科学院近代物理研究所 | Intelligent operation and maintenance device and method for controlling medical heavy ion accelerator |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102765643A (en) * | 2012-05-31 | 2012-11-07 | 天津大学 | Elevator fault diagnosis and early-warning method based on data drive |
CN104679828A (en) * | 2015-01-19 | 2015-06-03 | 云南电力调度控制中心 | Rules-based intelligent system for grid fault diagnosis |
US20160041070A1 (en) * | 2014-08-05 | 2016-02-11 | 01dB-METRAVIB, Société par Actions Simplifiée | Automatic Rotating-Machine Fault Diagnosis With Confidence Level Indication |
CN106844194A (en) * | 2016-12-21 | 2017-06-13 | 北京航空航天大学 | A kind of construction method of multi-level software fault diagnosis expert system |
WO2019228317A1 (en) * | 2018-05-28 | 2019-12-05 | 华为技术有限公司 | Face recognition method and device, and computer readable medium |
CN112016471A (en) * | 2020-08-27 | 2020-12-01 | 杭州电子科技大学 | Rolling bearing fault diagnosis method under incomplete sample condition |
CN112202741A (en) * | 2020-09-23 | 2021-01-08 | 山西省工业设备安装集团有限公司 | Gateway device based on small signal analysis and automatic identification communication bus and protocol |
CN112270312A (en) * | 2020-11-26 | 2021-01-26 | 中南林业科技大学 | Fan bearing fault diagnosis method and system, computer equipment and storage medium |
-
2021
- 2021-05-18 CN CN202110540400.1A patent/CN113296994A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102765643A (en) * | 2012-05-31 | 2012-11-07 | 天津大学 | Elevator fault diagnosis and early-warning method based on data drive |
US20160041070A1 (en) * | 2014-08-05 | 2016-02-11 | 01dB-METRAVIB, Société par Actions Simplifiée | Automatic Rotating-Machine Fault Diagnosis With Confidence Level Indication |
CN104679828A (en) * | 2015-01-19 | 2015-06-03 | 云南电力调度控制中心 | Rules-based intelligent system for grid fault diagnosis |
CN106844194A (en) * | 2016-12-21 | 2017-06-13 | 北京航空航天大学 | A kind of construction method of multi-level software fault diagnosis expert system |
WO2019228317A1 (en) * | 2018-05-28 | 2019-12-05 | 华为技术有限公司 | Face recognition method and device, and computer readable medium |
CN112016471A (en) * | 2020-08-27 | 2020-12-01 | 杭州电子科技大学 | Rolling bearing fault diagnosis method under incomplete sample condition |
CN112202741A (en) * | 2020-09-23 | 2021-01-08 | 山西省工业设备安装集团有限公司 | Gateway device based on small signal analysis and automatic identification communication bus and protocol |
CN112270312A (en) * | 2020-11-26 | 2021-01-26 | 中南林业科技大学 | Fan bearing fault diagnosis method and system, computer equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113838565A (en) * | 2021-09-24 | 2021-12-24 | 中国科学院近代物理研究所 | Intelligent operation and maintenance device and method for controlling medical heavy ion accelerator |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113723632B (en) | Industrial equipment fault diagnosis method based on knowledge graph | |
KR20180054992A (en) | Failure prediction method of system resource for smart computing | |
CN112949874A (en) | Power distribution terminal defect characteristic self-diagnosis method and system | |
CN112906764A (en) | Communication safety equipment intelligent diagnosis method and system based on improved BP neural network | |
CN115577701A (en) | Risk behavior identification method, device, equipment and medium for big data security | |
CN116167370A (en) | Log space-time characteristic analysis-based distributed system anomaly detection method | |
CN115617614A (en) | Log sequence anomaly detection method based on time interval perception self-attention mechanism | |
CN113296994A (en) | Fault diagnosis system and method based on domestic computing platform | |
CN117675691A (en) | Remote fault monitoring method, device, equipment and storage medium of router | |
CN114647558A (en) | Method and device for detecting log abnormity | |
WO2024027487A1 (en) | Health degree evaluation method and apparatus based on intelligent operations and maintenance scene | |
CN117687824A (en) | Satellite fault diagnosis system based on quality problem knowledge graph | |
CN110740111B (en) | Data leakage prevention method and device and computer readable storage medium | |
CN113778792B (en) | Alarm classifying method and system for IT equipment | |
CN117951854A (en) | Barrier removing method and device for edge equipment, electronic equipment and storage medium | |
CN115964470A (en) | Service life prediction method and system for motorcycle accessories | |
CN111221704B (en) | Method and system for determining running state of office management application system | |
CN114936139A (en) | Log processing method, device, equipment and storage medium in data center network | |
CN113076217A (en) | Disk fault prediction method based on domestic platform | |
CN117150439B (en) | Automobile manufacturing parameter detection method and system based on multi-source heterogeneous data fusion | |
Liu et al. | AutoSlicer: Scalable Automated Data Slicing for ML Model Analysis | |
JP7450570B2 (en) | Information processing device, information processing method, and information processing program | |
CN118245264A (en) | Server fault processing method and device, electronic equipment and medium | |
Dong et al. | Compound record clustering algorithm for design pattern detection by decision tree learning | |
CN118368088A (en) | Network security event classification method and system based on machine learning algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |