Disclosure of Invention
The invention aims to provide a multisource fault cooperative analysis positioning method for a large-scale system, which improves the automatic analysis positioning capability of system faults and solves the problem of accurate positioning of the faults of the large-scale parallel system.
In order to achieve the purpose, the invention adopts the technical scheme that: a multi-source fault collaborative analysis positioning method for a large-scale system is based on the following modules:
the IPMI protocol fault monitoring module is used for carrying out fault acquisition on the commercial server;
the SNMP protocol fault monitoring module is used for carrying out fault acquisition on the network equipment;
the node heartbeat fault monitoring module is used for carrying out fault acquisition on special equipment software;
the maintenance access fault monitoring module is used for fault acquisition of special equipment hardware;
the system core fault monitoring module is used for monitoring various system environment faults and Panic information and informing the maintenance system of service warehousing registration in a register mode;
the resource management fault monitoring module is used for monitoring the software and hardware states and warehousing and registering through resource management service;
the operation starting and running fault monitoring module is used for checking the service state during operation starting and running, detecting various faults of the file system and the running environment and warehousing and registering through the service of the operation system;
the runtime environment anomaly detection module is used for detecting faults and anomalies of various used component services and transmitting the faults and the anomalies to the outside through an interface provided by an operating system and merging into a library for registration;
the system comprises a customized out-of-band maintenance system module, a maintenance system module and a management module, wherein the customized out-of-band maintenance system module is used for monitoring the hardware faults of the computing resources and warehousing and registering through the service of the maintenance system;
the fault analysis system is used for carrying out duplicate removal, screening and merging analysis on the data acquired by each fault monitoring module;
the system information base is used for storing the data collected by each fault monitoring module and the data generated by the analysis of the fault analysis system;
the analysis positioning method comprises the following steps:
s1, uniformly classifying faults acquired by each fault monitoring module by a fault analysis system, defining a fault code Fid for each fault, defining an upper and lower association list Fuplist and Fdownlist for each fault, wherein the upper association list Fuplist comprises a group of fault codes Fid which can induce the fault, and the lower association list Fdownlist comprises a group of fault Fid which can induce the fault;
s2, the fault analysis system receives the faults sent by each fault monitoring module to form a current fault list;
s3, the fault analysis system carries out up-down correlation analysis on the current fault list;
s4, the fault analysis system reads a fault Fk from the current fault list, the reading is successful, the S5 is entered, the reading is failed, and the S2 is skipped;
s5, analyzing the upper and lower association lists of Fk by the fault analysis system, jumping to S10 if the upper and lower association lists are empty, otherwise, entering S6;
s6, reading a lower association list Fdownlist of the Fk by the fault analysis system, jumping to S8 if the lower association list Fdownlist is empty, and otherwise entering S7;
s7, the fault association system checks whether the current fault list has faults in the Fk lower association list according to the Fid, if so, the fault is deleted from the current fault list, and the fault site information of the fault is merged into the fault site of the Fk;
s8, reading the upper association list Fuplist of the Fk by the fault analysis system, jumping to S10 if the upper association list Fuplist is empty, and otherwise entering S9;
s9, the fault association system checks whether the current fault list has faults in the Fk upper association list Fuplist according to the Fid, if so, the fault site of the Fk is merged into the fault site of the fault, the Fk is deleted from the fault list, and S4 is skipped;
s10, the fault analysis system completes the accurate positioning of a fault Fk, and jumps to S4.
Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:
the multi-source fault collaborative analysis positioning method for the large-scale system improves the automatic analysis positioning capability of the system fault and solves the problem of accurate positioning of the large-scale parallel system fault; on the basis of fault detection based on IPMI, SNMP, heartbeat and other traditional modes, when a system is abnormally operated, sensing discovery of faults and abnormity of a large-scale parallel system is realized by combining various customized fault monitoring means such as out-of-band maintenance fault monitoring, operating system abnormity monitoring, resource management service fault monitoring, operation starting and operation detection, operation environment abnormity detection and the like, fault analysis and positioning in multi-level cooperation are carried out by uniformly classifying the faults and setting up the upper and lower correlation characteristics of the faults and based on an information interaction mechanism of software and hardware cooperative control, finally, fault resetting and screening are realized, derivative faults and abnormity are removed, accurate positioning of the faults is realized, the problem that the traditional fault detection and positioning method is difficult to quickly and comprehensively accurately position various deep fault sources in the system is solved, and basic support is provided for system fault tolerance treatment, and the system usability and the user experience are improved.
Detailed Description
Example (b): a multi-source fault collaborative analysis positioning method for a large-scale system is based on a large-scale heterogeneous system and based on the following modules:
the IPMI protocol fault monitoring module is used for carrying out fault acquisition on the commercial server;
the SNMP protocol fault monitoring module is used for carrying out fault acquisition on the network equipment;
the node heartbeat fault monitoring module is used for carrying out fault acquisition on special equipment software;
the maintenance access fault monitoring module is used for fault acquisition of special equipment hardware;
the system core fault monitoring module is used for monitoring various system environment faults and Panic information and informing the maintenance system of service warehousing registration in a register mode;
the resource management fault monitoring module is used for monitoring the software and hardware states and warehousing and registering through resource management service;
the operation starting and running fault monitoring module is used for checking the service state during operation starting and running, detecting various faults of the file system and the running environment and warehousing and registering through the service of the operation system;
the runtime environment anomaly detection module is used for detecting faults and anomalies of various used component services and transmitting the faults and the anomalies to the outside through an interface provided by an operating system and merging into a library for registration;
the system comprises a customized out-of-band maintenance system module, a maintenance system module and a management module, wherein the customized out-of-band maintenance system module is used for monitoring the hardware faults of the computing resources and warehousing and registering through the service of the maintenance system;
the fault analysis system is used for carrying out duplicate removal, screening and merging analysis on the data acquired by each fault monitoring module;
the system information base is used for storing the data collected by each fault monitoring module and the data generated by the analysis of the fault analysis system;
the analysis positioning method comprises the following steps:
s1, uniformly classifying faults acquired by each fault monitoring module by a fault analysis system, defining a fault code Fid for each fault, defining an upper and lower association list Fuplist and Fdownlist for each fault, wherein the upper association list Fuplist comprises a group of fault codes Fid which can induce the fault, and the lower association list Fdownlist comprises a group of fault Fid which can induce the fault;
s2, the fault analysis system receives the faults sent by each fault monitoring module to form a current fault list;
s3, the fault analysis system carries out up-down correlation analysis on the current fault list;
s4, the fault analysis system reads a fault Fk from the current fault list, the reading is successful, the S5 is entered, the reading is failed, and the S2 is skipped;
s5, analyzing the upper and lower association lists of Fk by the fault analysis system, jumping to S10 if the upper and lower association lists are empty, otherwise, entering S6;
s6, reading a lower association list Fdownlist of the Fk by the fault analysis system, jumping to S8 if the lower association list Fdownlist is empty, and otherwise entering S7;
s7, the fault association system checks whether the current fault list has faults in the Fk lower association list according to the Fid, if so, the fault is deleted from the current fault list, and the fault site information of the fault is merged into the fault site of the Fk;
s8, reading the upper association list Fuplist of the Fk by the fault analysis system, jumping to S10 if the upper association list Fuplist is empty, and otherwise entering S9;
s9, the fault association system checks whether the current fault list has faults in the Fk upper association list Fuplist according to the Fid, if so, the fault site of the Fk is merged into the fault site of the fault, the Fk is deleted from the fault list, and S4 is skipped;
s10, the fault analysis system completes the accurate positioning of a fault Fk, and jumps to S4.
The fault site is the data collected by each fault monitoring module when a fault occurs.
The examples are further explained below:
the multi-source fault collaborative analysis positioning method for the large-scale system mainly comprises two parts of multi-path fault perception and multi-level collaborative fault analysis positioning of the system, and the processing schematic is shown in figure 1.
(1) System multipath fault awareness
The fault sensing mainly comprises the following modes:
a) the traditional IPMI, SNMP and other protocols based on the traditional method, the computing resource of node heartbeat and the network resource fault discovery are supported;
b) monitoring of computing resource hardware faults is realized by customizing an out-of-band maintenance system, and warehousing and registering are realized through maintenance service;
c) monitoring various system environmental faults and Panic information through an operating system core, and informing a maintenance system of warehousing registration in a register mode;
d) checking and monitoring software and hardware states through a resource management polling state, and warehousing and registering through a resource management service;
e) detecting various faults of a file system and an operating environment through service state check during operation starting, and warehousing and registering through operating system service;
f) during the operation of the system, faults and exceptions of various used component services are detected based on the operation state in the operation process of the operation, and the faults and the exceptions are transmitted to the outside through an interface provided by the operation system and are registered in a library.
(2) Multi-level collaborative fault analysis positioning
When a system has a component fault, the fault may be reported in multiple layers, for example, a node hardware fault may cause a fault reported by a maintenance monitoring system, a node heartbeat system and an operation control system. At the moment, by monitoring and analyzing the resource states of the nodes, the network and the like, and combining software such as an operating system, a working resource management system, a runtime system and the like, a multi-level software and hardware cooperative control information interaction mechanism is established, faults are classified uniformly, the upper and lower correlation characteristics of the faults are set, fault duplicate removal and screening are carried out, derived faults and anomalies are removed, and the faults are accurately positioned. The multi-level cooperative fault analysis and positioning processing flow is mainly as follows.
a) When the system is monitored to have operation fault/abnormity in the operation process, an analysis positioning processing flow is started;
b) the fault analysis system drives relevant parts of a fault site to carry out deep detection on fault and running states and inquires whether deterministic faults occur or not;
c) each part of the system carries out local state fault detection, and when abnormality is found during detection but the positioning cannot be accurately carried out, the relevant parts can be driven to carry out deep detection according to the abnormality types (for example, when the detection during running finds that the message transmission is abnormal, the state inspection of network components can be driven);
d) the fault analysis system collects deep detection results sent by all parts and performs merging analysis on a fault site based on the up-down correlation characteristics of the fault;
e) the fault analysis system performs fault duplicate removal and screening on the basis of fault field analysis, removes derived faults and anomalies, and realizes accurate positioning of the faults.
When the multi-source fault collaborative analysis positioning method for the large-scale system is adopted, the automatic analysis positioning capability of the system fault is improved, and the problem of accurate positioning of the large-scale parallel system fault is solved; on the basis of fault detection based on IPMI, SNMP, heartbeat and other traditional modes, when a system is abnormally operated, sensing discovery of faults and abnormity of a large-scale parallel system is realized by combining various customized fault monitoring means such as out-of-band maintenance fault monitoring, operating system abnormity monitoring, resource management service fault monitoring, operation starting and operation detection, operation environment abnormity detection and the like, fault analysis and positioning in multi-level cooperation are carried out by uniformly classifying the faults and setting up the upper and lower correlation characteristics of the faults and based on an information interaction mechanism of software and hardware cooperative control, finally, fault resetting and screening are realized, derivative faults and abnormity are removed, accurate positioning of the faults is realized, the problem that the traditional fault detection and positioning method is difficult to quickly and comprehensively accurately position various deep fault sources in the system is solved, and basic support is provided for system fault tolerance treatment, and the system usability and the user experience are improved.
To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:
parallel Computing (Parallel Computing): parallel computing refers to a process of solving a computing problem by simultaneously using multiple computing resources, and the same problem is solved by multiple nodes/processors in a concurrent and collaborative manner, so that the computing speed and the processing capacity are improved.
And (3) parallel operation: generally, a task process set written by parallel languages such as MPI and running on computing resources of a parallel computer is started and controlled by an operating system, and the same problem is solved by cooperation among processes.
The resource management system comprises: the system is a management control system which runs in a parallel computer and is used for performing the functions of resource state monitoring, resource allocation, resource recovery and the like.
An operation management system: the system is a management control system which runs in a parallel computer and is used for performing functions such as parallel job scheduling, task starting, control, recovery and the like.
IPMI: an Intelligent Platform Management Interface (Intelligent Platform Management Interface) is an industrial standard adopted by Management server equipment, and a user can monitor the state of a server by using IPMI.
And (3) fault up-down correlation characteristics: various faults in the system have derivative relations.
The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.