CN112100019B

CN112100019B - Multi-source fault collaborative analysis positioning method for large-scale system

Info

Publication number: CN112100019B
Application number: CN201910863431.3A
Authority: CN
Inventors: 高剑刚; 龚道永; 宋长明; 钱宇; 李伟东; 张宏宇; 刘沙
Original assignee: Wuxi Jiangnan Computing Technology Institute
Current assignee: Wuxi Jiangnan Computing Technology Institute
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2021-03-23
Anticipated expiration: 2039-09-12
Also published as: CN112100019A

Abstract

The invention discloses a multisource fault collaborative analysis positioning method for a large-scale system, which comprises the following steps: s1, uniformly classifying faults acquired by each fault monitoring module, defining a fault code Fid for each fault, and defining an upper and lower association list Fuplist and Fdownlist for each fault, wherein the upper association list Fuplist comprises a group of fault codes Fid which can induce the fault, and the lower association list Fdownlist comprises a group of fault Fid which can induce the fault; s2, the fault analysis system receives the faults sent by each fault monitoring module to form a current fault list; s3, the fault analysis system carries out up-down correlation analysis on the current fault list; s10, the fault analysis system realizes the accurate positioning of a fault Fk and jumps to S4. The invention improves the automatic analysis and positioning capability of the system fault and solves the problem of accurate positioning of the fault of the large-scale parallel system.

Description

Multi-source fault collaborative analysis positioning method for large-scale system

Technical Field

The invention relates to a multisource fault collaborative analysis positioning method for a large-scale system, and belongs to the technical field of availability management of the large-scale parallel system.

Background

With the continuous development of high-performance computing technology, the large-scale parallel system is increasingly enlarged in scale, the resources are various and large in quantity, the software and hardware structures are complex, and the number of components is increased, so that the Mean Time Between Failures (MTBF) of the large-scale system is remarkably reduced, the component failure becomes a normal event in the operation process of the system, and the fault-tolerant system becomes a basic support system for stable and reliable operation of the system. The rapid discovery and accurate positioning of system faults are the basis of system fault processing and fault-tolerant operation.

In a large-scale parallel system, due to the complex structure of the system, the fault relates to the aspects of management resources, computing resources, network resources, an operating system, an operating environment and the like, the diversity and the concealment of the fault cause the complex fault site presented outside, and meanwhile, the system fault often causes large-area abnormity of related software and hardware environments, and the accurate positioning of the fault is difficult to realize, the traditional fault detection method mainly monitors the computing resources and the network resources based on node heartbeat monitoring, IPMI information monitoring, SNMP protocol monitoring and the like, generally only can find and position the conventional explicit hardware fault and node downtime fault in the system, for various system deep faults, because the fault influence range is wide, the fault site presents complicatedly, generally the fault source is difficult to accurately position, and the fault positioning and analysis are often required to be carried out by manual intervention when the fault occurs, greatly influencing the usability of the system and the user experience.

Disclosure of Invention

The invention aims to provide a multisource fault cooperative analysis positioning method for a large-scale system, which improves the automatic analysis positioning capability of system faults and solves the problem of accurate positioning of the faults of the large-scale parallel system.

In order to achieve the purpose, the invention adopts the technical scheme that: a multi-source fault collaborative analysis positioning method for a large-scale system is based on the following modules:

the IPMI protocol fault monitoring module is used for carrying out fault acquisition on the commercial server;

the SNMP protocol fault monitoring module is used for carrying out fault acquisition on the network equipment;

the node heartbeat fault monitoring module is used for carrying out fault acquisition on special equipment software;

the maintenance access fault monitoring module is used for fault acquisition of special equipment hardware;

the system core fault monitoring module is used for monitoring various system environment faults and Panic information and informing the maintenance system of service warehousing registration in a register mode;

the resource management fault monitoring module is used for monitoring the software and hardware states and warehousing and registering through resource management service;

the operation starting and running fault monitoring module is used for checking the service state during operation starting and running, detecting various faults of the file system and the running environment and warehousing and registering through the service of the operation system;

the runtime environment anomaly detection module is used for detecting faults and anomalies of various used component services and transmitting the faults and the anomalies to the outside through an interface provided by an operating system and merging into a library for registration;

the system comprises a customized out-of-band maintenance system module, a maintenance system module and a management module, wherein the customized out-of-band maintenance system module is used for monitoring the hardware faults of the computing resources and warehousing and registering through the service of the maintenance system;

the fault analysis system is used for carrying out duplicate removal, screening and merging analysis on the data acquired by each fault monitoring module;

the system information base is used for storing the data collected by each fault monitoring module and the data generated by the analysis of the fault analysis system;

the analysis positioning method comprises the following steps:

s1, uniformly classifying faults acquired by each fault monitoring module by a fault analysis system, defining a fault code Fid for each fault, defining an upper and lower association list Fuplist and Fdownlist for each fault, wherein the upper association list Fuplist comprises a group of fault codes Fid which can induce the fault, and the lower association list Fdownlist comprises a group of fault Fid which can induce the fault;

s2, the fault analysis system receives the faults sent by each fault monitoring module to form a current fault list;

s3, the fault analysis system carries out up-down correlation analysis on the current fault list;

s4, the fault analysis system reads a fault Fk from the current fault list, the reading is successful, the S5 is entered, the reading is failed, and the S2 is skipped;

s5, analyzing the upper and lower association lists of Fk by the fault analysis system, jumping to S10 if the upper and lower association lists are empty, otherwise, entering S6;

s6, reading a lower association list Fdownlist of the Fk by the fault analysis system, jumping to S8 if the lower association list Fdownlist is empty, and otherwise entering S7;

s7, the fault association system checks whether the current fault list has faults in the Fk lower association list according to the Fid, if so, the fault is deleted from the current fault list, and the fault site information of the fault is merged into the fault site of the Fk;

s8, reading the upper association list Fuplist of the Fk by the fault analysis system, jumping to S10 if the upper association list Fuplist is empty, and otherwise entering S9;

s9, the fault association system checks whether the current fault list has faults in the Fk upper association list Fuplist according to the Fid, if so, the fault site of the Fk is merged into the fault site of the fault, the Fk is deleted from the fault list, and S4 is skipped;

s10, the fault analysis system completes the accurate positioning of a fault Fk, and jumps to S4.

Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:

the multi-source fault collaborative analysis positioning method for the large-scale system improves the automatic analysis positioning capability of the system fault and solves the problem of accurate positioning of the large-scale parallel system fault; on the basis of fault detection based on IPMI, SNMP, heartbeat and other traditional modes, when a system is abnormally operated, sensing discovery of faults and abnormity of a large-scale parallel system is realized by combining various customized fault monitoring means such as out-of-band maintenance fault monitoring, operating system abnormity monitoring, resource management service fault monitoring, operation starting and operation detection, operation environment abnormity detection and the like, fault analysis and positioning in multi-level cooperation are carried out by uniformly classifying the faults and setting up the upper and lower correlation characteristics of the faults and based on an information interaction mechanism of software and hardware cooperative control, finally, fault resetting and screening are realized, derivative faults and abnormity are removed, accurate positioning of the faults is realized, the problem that the traditional fault detection and positioning method is difficult to quickly and comprehensively accurately position various deep fault sources in the system is solved, and basic support is provided for system fault tolerance treatment, and the system usability and the user experience are improved.

Drawings

FIG. 1 is a schematic diagram of the principle of a multi-source fault co-analysis positioning method for a large-scale system according to the present invention;

FIG. 2 is a flow chart of a multi-source fault collaborative analysis positioning method for a large-scale system.

Detailed Description

Example (b): a multi-source fault collaborative analysis positioning method for a large-scale system is based on a large-scale heterogeneous system and based on the following modules:

the analysis positioning method comprises the following steps:

The fault site is the data collected by each fault monitoring module when a fault occurs.

The examples are further explained below:

the multi-source fault collaborative analysis positioning method for the large-scale system mainly comprises two parts of multi-path fault perception and multi-level collaborative fault analysis positioning of the system, and the processing schematic is shown in figure 1.

(1) System multipath fault awareness

The fault sensing mainly comprises the following modes:

a) the traditional IPMI, SNMP and other protocols based on the traditional method, the computing resource of node heartbeat and the network resource fault discovery are supported;

b) monitoring of computing resource hardware faults is realized by customizing an out-of-band maintenance system, and warehousing and registering are realized through maintenance service;

c) monitoring various system environmental faults and Panic information through an operating system core, and informing a maintenance system of warehousing registration in a register mode;

d) checking and monitoring software and hardware states through a resource management polling state, and warehousing and registering through a resource management service;

e) detecting various faults of a file system and an operating environment through service state check during operation starting, and warehousing and registering through operating system service;

f) during the operation of the system, faults and exceptions of various used component services are detected based on the operation state in the operation process of the operation, and the faults and the exceptions are transmitted to the outside through an interface provided by the operation system and are registered in a library.

(2) Multi-level collaborative fault analysis positioning

When a system has a component fault, the fault may be reported in multiple layers, for example, a node hardware fault may cause a fault reported by a maintenance monitoring system, a node heartbeat system and an operation control system. At the moment, by monitoring and analyzing the resource states of the nodes, the network and the like, and combining software such as an operating system, a working resource management system, a runtime system and the like, a multi-level software and hardware cooperative control information interaction mechanism is established, faults are classified uniformly, the upper and lower correlation characteristics of the faults are set, fault duplicate removal and screening are carried out, derived faults and anomalies are removed, and the faults are accurately positioned. The multi-level cooperative fault analysis and positioning processing flow is mainly as follows.

a) When the system is monitored to have operation fault/abnormity in the operation process, an analysis positioning processing flow is started;

b) the fault analysis system drives relevant parts of a fault site to carry out deep detection on fault and running states and inquires whether deterministic faults occur or not;

c) each part of the system carries out local state fault detection, and when abnormality is found during detection but the positioning cannot be accurately carried out, the relevant parts can be driven to carry out deep detection according to the abnormality types (for example, when the detection during running finds that the message transmission is abnormal, the state inspection of network components can be driven);

d) the fault analysis system collects deep detection results sent by all parts and performs merging analysis on a fault site based on the up-down correlation characteristics of the fault;

e) the fault analysis system performs fault duplicate removal and screening on the basis of fault field analysis, removes derived faults and anomalies, and realizes accurate positioning of the faults.

When the multi-source fault collaborative analysis positioning method for the large-scale system is adopted, the automatic analysis positioning capability of the system fault is improved, and the problem of accurate positioning of the large-scale parallel system fault is solved; on the basis of fault detection based on IPMI, SNMP, heartbeat and other traditional modes, when a system is abnormally operated, sensing discovery of faults and abnormity of a large-scale parallel system is realized by combining various customized fault monitoring means such as out-of-band maintenance fault monitoring, operating system abnormity monitoring, resource management service fault monitoring, operation starting and operation detection, operation environment abnormity detection and the like, fault analysis and positioning in multi-level cooperation are carried out by uniformly classifying the faults and setting up the upper and lower correlation characteristics of the faults and based on an information interaction mechanism of software and hardware cooperative control, finally, fault resetting and screening are realized, derivative faults and abnormity are removed, accurate positioning of the faults is realized, the problem that the traditional fault detection and positioning method is difficult to quickly and comprehensively accurately position various deep fault sources in the system is solved, and basic support is provided for system fault tolerance treatment, and the system usability and the user experience are improved.

To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:

parallel Computing (Parallel Computing): parallel computing refers to a process of solving a computing problem by simultaneously using multiple computing resources, and the same problem is solved by multiple nodes/processors in a concurrent and collaborative manner, so that the computing speed and the processing capacity are improved.

And (3) parallel operation: generally, a task process set written by parallel languages such as MPI and running on computing resources of a parallel computer is started and controlled by an operating system, and the same problem is solved by cooperation among processes.

The resource management system comprises: the system is a management control system which runs in a parallel computer and is used for performing the functions of resource state monitoring, resource allocation, resource recovery and the like.

An operation management system: the system is a management control system which runs in a parallel computer and is used for performing functions such as parallel job scheduling, task starting, control, recovery and the like.

IPMI: an Intelligent Platform Management Interface (Intelligent Platform Management Interface) is an industrial standard adopted by Management server equipment, and a user can monitor the state of a server by using IPMI.

And (3) fault up-down correlation characteristics: various faults in the system have derivative relations.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A multi-source fault collaborative analysis positioning method for a large-scale system is characterized by comprising the following steps: based on the following modules:

the analysis positioning method comprises the following steps: