CN112100019B - Multi-source fault collaborative analysis positioning method for large-scale system - Google Patents

Multi-source fault collaborative analysis positioning method for large-scale system Download PDF

Info

Publication number
CN112100019B
CN112100019B CN201910863431.3A CN201910863431A CN112100019B CN 112100019 B CN112100019 B CN 112100019B CN 201910863431 A CN201910863431 A CN 201910863431A CN 112100019 B CN112100019 B CN 112100019B
Authority
CN
China
Prior art keywords
fault
list
analysis
faults
monitoring module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910863431.3A
Other languages
Chinese (zh)
Other versions
CN112100019A (en
Inventor
高剑刚
龚道永
宋长明
钱宇
李伟东
张宏宇
刘沙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Jiangnan Computing Technology Institute
Original Assignee
Wuxi Jiangnan Computing Technology Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Jiangnan Computing Technology Institute filed Critical Wuxi Jiangnan Computing Technology Institute
Priority to CN201910863431.3A priority Critical patent/CN112100019B/en
Publication of CN112100019A publication Critical patent/CN112100019A/en
Application granted granted Critical
Publication of CN112100019B publication Critical patent/CN112100019B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3058Monitoring arrangements for monitoring environmental properties or parameters of the computing system or of the computing system component, e.g. monitoring of power, currents, temperature, humidity, position, vibrations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a multisource fault collaborative analysis positioning method for a large-scale system, which comprises the following steps: s1, uniformly classifying faults acquired by each fault monitoring module, defining a fault code Fid for each fault, and defining an upper and lower association list Fuplist and Fdownlist for each fault, wherein the upper association list Fuplist comprises a group of fault codes Fid which can induce the fault, and the lower association list Fdownlist comprises a group of fault Fid which can induce the fault; s2, the fault analysis system receives the faults sent by each fault monitoring module to form a current fault list; s3, the fault analysis system carries out up-down correlation analysis on the current fault list; s10, the fault analysis system realizes the accurate positioning of a fault Fk and jumps to S4. The invention improves the automatic analysis and positioning capability of the system fault and solves the problem of accurate positioning of the fault of the large-scale parallel system.

Description

Multi-source fault collaborative analysis positioning method for large-scale system
Technical Field
The invention relates to a multisource fault collaborative analysis positioning method for a large-scale system, and belongs to the technical field of availability management of the large-scale parallel system.
Background
With the continuous development of high-performance computing technology, the large-scale parallel system is increasingly enlarged in scale, the resources are various and large in quantity, the software and hardware structures are complex, and the number of components is increased, so that the Mean Time Between Failures (MTBF) of the large-scale system is remarkably reduced, the component failure becomes a normal event in the operation process of the system, and the fault-tolerant system becomes a basic support system for stable and reliable operation of the system. The rapid discovery and accurate positioning of system faults are the basis of system fault processing and fault-tolerant operation.
In a large-scale parallel system, due to the complex structure of the system, the fault relates to the aspects of management resources, computing resources, network resources, an operating system, an operating environment and the like, the diversity and the concealment of the fault cause the complex fault site presented outside, and meanwhile, the system fault often causes large-area abnormity of related software and hardware environments, and the accurate positioning of the fault is difficult to realize, the traditional fault detection method mainly monitors the computing resources and the network resources based on node heartbeat monitoring, IPMI information monitoring, SNMP protocol monitoring and the like, generally only can find and position the conventional explicit hardware fault and node downtime fault in the system, for various system deep faults, because the fault influence range is wide, the fault site presents complicatedly, generally the fault source is difficult to accurately position, and the fault positioning and analysis are often required to be carried out by manual intervention when the fault occurs, greatly influencing the usability of the system and the user experience.
Disclosure of Invention
The invention aims to provide a multisource fault cooperative analysis positioning method for a large-scale system, which improves the automatic analysis positioning capability of system faults and solves the problem of accurate positioning of the faults of the large-scale parallel system.
In order to achieve the purpose, the invention adopts the technical scheme that: a multi-source fault collaborative analysis positioning method for a large-scale system is based on the following modules:
the IPMI protocol fault monitoring module is used for carrying out fault acquisition on the commercial server;
the SNMP protocol fault monitoring module is used for carrying out fault acquisition on the network equipment;
the node heartbeat fault monitoring module is used for carrying out fault acquisition on special equipment software;
the maintenance access fault monitoring module is used for fault acquisition of special equipment hardware;
the system core fault monitoring module is used for monitoring various system environment faults and Panic information and informing the maintenance system of service warehousing registration in a register mode;
the resource management fault monitoring module is used for monitoring the software and hardware states and warehousing and registering through resource management service;
the operation starting and running fault monitoring module is used for checking the service state during operation starting and running, detecting various faults of the file system and the running environment and warehousing and registering through the service of the operation system;
the runtime environment anomaly detection module is used for detecting faults and anomalies of various used component services and transmitting the faults and the anomalies to the outside through an interface provided by an operating system and merging into a library for registration;
the system comprises a customized out-of-band maintenance system module, a maintenance system module and a management module, wherein the customized out-of-band maintenance system module is used for monitoring the hardware faults of the computing resources and warehousing and registering through the service of the maintenance system;
the fault analysis system is used for carrying out duplicate removal, screening and merging analysis on the data acquired by each fault monitoring module;
the system information base is used for storing the data collected by each fault monitoring module and the data generated by the analysis of the fault analysis system;
the analysis positioning method comprises the following steps:
s1, uniformly classifying faults acquired by each fault monitoring module by a fault analysis system, defining a fault code Fid for each fault, defining an upper and lower association list Fuplist and Fdownlist for each fault, wherein the upper association list Fuplist comprises a group of fault codes Fid which can induce the fault, and the lower association list Fdownlist comprises a group of fault Fid which can induce the fault;
s2, the fault analysis system receives the faults sent by each fault monitoring module to form a current fault list;
s3, the fault analysis system carries out up-down correlation analysis on the current fault list;
s4, the fault analysis system reads a fault Fk from the current fault list, the reading is successful, the S5 is entered, the reading is failed, and the S2 is skipped;
s5, analyzing the upper and lower association lists of Fk by the fault analysis system, jumping to S10 if the upper and lower association lists are empty, otherwise, entering S6;
s6, reading a lower association list Fdownlist of the Fk by the fault analysis system, jumping to S8 if the lower association list Fdownlist is empty, and otherwise entering S7;
s7, the fault association system checks whether the current fault list has faults in the Fk lower association list according to the Fid, if so, the fault is deleted from the current fault list, and the fault site information of the fault is merged into the fault site of the Fk;
s8, reading the upper association list Fuplist of the Fk by the fault analysis system, jumping to S10 if the upper association list Fuplist is empty, and otherwise entering S9;
s9, the fault association system checks whether the current fault list has faults in the Fk upper association list Fuplist according to the Fid, if so, the fault site of the Fk is merged into the fault site of the fault, the Fk is deleted from the fault list, and S4 is skipped;
s10, the fault analysis system completes the accurate positioning of a fault Fk, and jumps to S4.
Due to the application of the technical scheme, compared with the prior art, the invention has the following advantages:
the multi-source fault collaborative analysis positioning method for the large-scale system improves the automatic analysis positioning capability of the system fault and solves the problem of accurate positioning of the large-scale parallel system fault; on the basis of fault detection based on IPMI, SNMP, heartbeat and other traditional modes, when a system is abnormally operated, sensing discovery of faults and abnormity of a large-scale parallel system is realized by combining various customized fault monitoring means such as out-of-band maintenance fault monitoring, operating system abnormity monitoring, resource management service fault monitoring, operation starting and operation detection, operation environment abnormity detection and the like, fault analysis and positioning in multi-level cooperation are carried out by uniformly classifying the faults and setting up the upper and lower correlation characteristics of the faults and based on an information interaction mechanism of software and hardware cooperative control, finally, fault resetting and screening are realized, derivative faults and abnormity are removed, accurate positioning of the faults is realized, the problem that the traditional fault detection and positioning method is difficult to quickly and comprehensively accurately position various deep fault sources in the system is solved, and basic support is provided for system fault tolerance treatment, and the system usability and the user experience are improved.
Drawings
FIG. 1 is a schematic diagram of the principle of a multi-source fault co-analysis positioning method for a large-scale system according to the present invention;
FIG. 2 is a flow chart of a multi-source fault collaborative analysis positioning method for a large-scale system.
Detailed Description
Example (b): a multi-source fault collaborative analysis positioning method for a large-scale system is based on a large-scale heterogeneous system and based on the following modules:
the IPMI protocol fault monitoring module is used for carrying out fault acquisition on the commercial server;
the SNMP protocol fault monitoring module is used for carrying out fault acquisition on the network equipment;
the node heartbeat fault monitoring module is used for carrying out fault acquisition on special equipment software;
the maintenance access fault monitoring module is used for fault acquisition of special equipment hardware;
the system core fault monitoring module is used for monitoring various system environment faults and Panic information and informing the maintenance system of service warehousing registration in a register mode;
the resource management fault monitoring module is used for monitoring the software and hardware states and warehousing and registering through resource management service;
the operation starting and running fault monitoring module is used for checking the service state during operation starting and running, detecting various faults of the file system and the running environment and warehousing and registering through the service of the operation system;
the runtime environment anomaly detection module is used for detecting faults and anomalies of various used component services and transmitting the faults and the anomalies to the outside through an interface provided by an operating system and merging into a library for registration;
the system comprises a customized out-of-band maintenance system module, a maintenance system module and a management module, wherein the customized out-of-band maintenance system module is used for monitoring the hardware faults of the computing resources and warehousing and registering through the service of the maintenance system;
the fault analysis system is used for carrying out duplicate removal, screening and merging analysis on the data acquired by each fault monitoring module;
the system information base is used for storing the data collected by each fault monitoring module and the data generated by the analysis of the fault analysis system;
the analysis positioning method comprises the following steps:
s1, uniformly classifying faults acquired by each fault monitoring module by a fault analysis system, defining a fault code Fid for each fault, defining an upper and lower association list Fuplist and Fdownlist for each fault, wherein the upper association list Fuplist comprises a group of fault codes Fid which can induce the fault, and the lower association list Fdownlist comprises a group of fault Fid which can induce the fault;
s2, the fault analysis system receives the faults sent by each fault monitoring module to form a current fault list;
s3, the fault analysis system carries out up-down correlation analysis on the current fault list;
s4, the fault analysis system reads a fault Fk from the current fault list, the reading is successful, the S5 is entered, the reading is failed, and the S2 is skipped;
s5, analyzing the upper and lower association lists of Fk by the fault analysis system, jumping to S10 if the upper and lower association lists are empty, otherwise, entering S6;
s6, reading a lower association list Fdownlist of the Fk by the fault analysis system, jumping to S8 if the lower association list Fdownlist is empty, and otherwise entering S7;
s7, the fault association system checks whether the current fault list has faults in the Fk lower association list according to the Fid, if so, the fault is deleted from the current fault list, and the fault site information of the fault is merged into the fault site of the Fk;
s8, reading the upper association list Fuplist of the Fk by the fault analysis system, jumping to S10 if the upper association list Fuplist is empty, and otherwise entering S9;
s9, the fault association system checks whether the current fault list has faults in the Fk upper association list Fuplist according to the Fid, if so, the fault site of the Fk is merged into the fault site of the fault, the Fk is deleted from the fault list, and S4 is skipped;
s10, the fault analysis system completes the accurate positioning of a fault Fk, and jumps to S4.
The fault site is the data collected by each fault monitoring module when a fault occurs.
The examples are further explained below:
the multi-source fault collaborative analysis positioning method for the large-scale system mainly comprises two parts of multi-path fault perception and multi-level collaborative fault analysis positioning of the system, and the processing schematic is shown in figure 1.
(1) System multipath fault awareness
The fault sensing mainly comprises the following modes:
a) the traditional IPMI, SNMP and other protocols based on the traditional method, the computing resource of node heartbeat and the network resource fault discovery are supported;
b) monitoring of computing resource hardware faults is realized by customizing an out-of-band maintenance system, and warehousing and registering are realized through maintenance service;
c) monitoring various system environmental faults and Panic information through an operating system core, and informing a maintenance system of warehousing registration in a register mode;
d) checking and monitoring software and hardware states through a resource management polling state, and warehousing and registering through a resource management service;
e) detecting various faults of a file system and an operating environment through service state check during operation starting, and warehousing and registering through operating system service;
f) during the operation of the system, faults and exceptions of various used component services are detected based on the operation state in the operation process of the operation, and the faults and the exceptions are transmitted to the outside through an interface provided by the operation system and are registered in a library.
(2) Multi-level collaborative fault analysis positioning
When a system has a component fault, the fault may be reported in multiple layers, for example, a node hardware fault may cause a fault reported by a maintenance monitoring system, a node heartbeat system and an operation control system. At the moment, by monitoring and analyzing the resource states of the nodes, the network and the like, and combining software such as an operating system, a working resource management system, a runtime system and the like, a multi-level software and hardware cooperative control information interaction mechanism is established, faults are classified uniformly, the upper and lower correlation characteristics of the faults are set, fault duplicate removal and screening are carried out, derived faults and anomalies are removed, and the faults are accurately positioned. The multi-level cooperative fault analysis and positioning processing flow is mainly as follows.
a) When the system is monitored to have operation fault/abnormity in the operation process, an analysis positioning processing flow is started;
b) the fault analysis system drives relevant parts of a fault site to carry out deep detection on fault and running states and inquires whether deterministic faults occur or not;
c) each part of the system carries out local state fault detection, and when abnormality is found during detection but the positioning cannot be accurately carried out, the relevant parts can be driven to carry out deep detection according to the abnormality types (for example, when the detection during running finds that the message transmission is abnormal, the state inspection of network components can be driven);
d) the fault analysis system collects deep detection results sent by all parts and performs merging analysis on a fault site based on the up-down correlation characteristics of the fault;
e) the fault analysis system performs fault duplicate removal and screening on the basis of fault field analysis, removes derived faults and anomalies, and realizes accurate positioning of the faults.
When the multi-source fault collaborative analysis positioning method for the large-scale system is adopted, the automatic analysis positioning capability of the system fault is improved, and the problem of accurate positioning of the large-scale parallel system fault is solved; on the basis of fault detection based on IPMI, SNMP, heartbeat and other traditional modes, when a system is abnormally operated, sensing discovery of faults and abnormity of a large-scale parallel system is realized by combining various customized fault monitoring means such as out-of-band maintenance fault monitoring, operating system abnormity monitoring, resource management service fault monitoring, operation starting and operation detection, operation environment abnormity detection and the like, fault analysis and positioning in multi-level cooperation are carried out by uniformly classifying the faults and setting up the upper and lower correlation characteristics of the faults and based on an information interaction mechanism of software and hardware cooperative control, finally, fault resetting and screening are realized, derivative faults and abnormity are removed, accurate positioning of the faults is realized, the problem that the traditional fault detection and positioning method is difficult to quickly and comprehensively accurately position various deep fault sources in the system is solved, and basic support is provided for system fault tolerance treatment, and the system usability and the user experience are improved.
To facilitate a better understanding of the invention, the terms used herein will be briefly explained as follows:
parallel Computing (Parallel Computing): parallel computing refers to a process of solving a computing problem by simultaneously using multiple computing resources, and the same problem is solved by multiple nodes/processors in a concurrent and collaborative manner, so that the computing speed and the processing capacity are improved.
And (3) parallel operation: generally, a task process set written by parallel languages such as MPI and running on computing resources of a parallel computer is started and controlled by an operating system, and the same problem is solved by cooperation among processes.
The resource management system comprises: the system is a management control system which runs in a parallel computer and is used for performing the functions of resource state monitoring, resource allocation, resource recovery and the like.
An operation management system: the system is a management control system which runs in a parallel computer and is used for performing functions such as parallel job scheduling, task starting, control, recovery and the like.
IPMI: an Intelligent Platform Management Interface (Intelligent Platform Management Interface) is an industrial standard adopted by Management server equipment, and a user can monitor the state of a server by using IPMI.
And (3) fault up-down correlation characteristics: various faults in the system have derivative relations.
The above embodiments are merely illustrative of the technical ideas and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims (1)

1. A multi-source fault collaborative analysis positioning method for a large-scale system is characterized by comprising the following steps: based on the following modules:
the IPMI protocol fault monitoring module is used for carrying out fault acquisition on the commercial server;
the SNMP protocol fault monitoring module is used for carrying out fault acquisition on the network equipment;
the node heartbeat fault monitoring module is used for carrying out fault acquisition on special equipment software;
the maintenance access fault monitoring module is used for fault acquisition of special equipment hardware;
the system core fault monitoring module is used for monitoring various system environment faults and Panic information and informing the maintenance system of service warehousing registration in a register mode;
the resource management fault monitoring module is used for monitoring the software and hardware states and warehousing and registering through resource management service;
the operation starting and running fault monitoring module is used for checking the service state during operation starting and running, detecting various faults of the file system and the running environment and warehousing and registering through the service of the operation system;
the runtime environment anomaly detection module is used for detecting faults and anomalies of various used component services and transmitting the faults and the anomalies to the outside through an interface provided by an operating system and merging into a library for registration;
the system comprises a customized out-of-band maintenance system module, a maintenance system module and a management module, wherein the customized out-of-band maintenance system module is used for monitoring the hardware faults of the computing resources and warehousing and registering through the service of the maintenance system;
the fault analysis system is used for carrying out duplicate removal, screening and merging analysis on the data acquired by each fault monitoring module;
the system information base is used for storing the data collected by each fault monitoring module and the data generated by the analysis of the fault analysis system;
the analysis positioning method comprises the following steps:
s1, uniformly classifying faults acquired by each fault monitoring module by a fault analysis system, defining a fault code Fid for each fault, defining an upper and lower association list Fuplist and Fdownlist for each fault, wherein the upper association list Fuplist comprises a group of fault codes Fid which can induce the fault, and the lower association list Fdownlist comprises a group of fault Fid which can induce the fault;
s2, the fault analysis system receives the faults sent by each fault monitoring module to form a current fault list;
s3, the fault analysis system carries out up-down correlation analysis on the current fault list;
s4, the fault analysis system reads a fault Fk from the current fault list, the reading is successful, the S5 is entered, the reading is failed, and the S2 is skipped;
s5, analyzing the upper and lower association lists of Fk by the fault analysis system, jumping to S10 if the upper and lower association lists are empty, otherwise, entering S6;
s6, reading a lower association list Fdownlist of the Fk by the fault analysis system, jumping to S8 if the lower association list Fdownlist is empty, and otherwise entering S7;
s7, the fault association system checks whether the current fault list has faults in the Fk lower association list according to the Fid, if so, the fault is deleted from the current fault list, and the fault site information of the fault is merged into the fault site of the Fk;
s8, reading the upper association list Fuplist of the Fk by the fault analysis system, jumping to S10 if the upper association list Fuplist is empty, and otherwise entering S9;
s9, the fault association system checks whether the current fault list has faults in the Fk upper association list Fuplist according to the Fid, if so, the fault site of the Fk is merged into the fault site of the fault, the Fk is deleted from the fault list, and S4 is skipped;
s10, the fault analysis system completes the accurate positioning of a fault Fk, and jumps to S4.
CN201910863431.3A 2019-09-12 2019-09-12 Multi-source fault collaborative analysis positioning method for large-scale system Active CN112100019B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910863431.3A CN112100019B (en) 2019-09-12 2019-09-12 Multi-source fault collaborative analysis positioning method for large-scale system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910863431.3A CN112100019B (en) 2019-09-12 2019-09-12 Multi-source fault collaborative analysis positioning method for large-scale system

Publications (2)

Publication Number Publication Date
CN112100019A CN112100019A (en) 2020-12-18
CN112100019B true CN112100019B (en) 2021-03-23

Family

ID=73748895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910863431.3A Active CN112100019B (en) 2019-09-12 2019-09-12 Multi-source fault collaborative analysis positioning method for large-scale system

Country Status (1)

Country Link
CN (1) CN112100019B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112764907B (en) * 2021-01-26 2024-05-10 网易(杭州)网络有限公司 Task processing method and device, electronic equipment and storage medium
CN113190278B (en) * 2021-03-18 2023-03-17 山东英信计算机技术有限公司 Multi-scenario fault processing method, system and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101614781A (en) * 2009-07-20 2009-12-30 浙江大学 A kind of broadcasting and TV equipment failure intelligent diagnosing method based on the space rule index
CN103427417A (en) * 2013-07-31 2013-12-04 国电南瑞科技股份有限公司 Power distribution network fault processing method based on multi-source information fusion
WO2018168873A1 (en) * 2017-03-17 2018-09-20 株式会社フジキン Operation analysis system of fluid control device, method, and computer program
CN109656793A (en) * 2018-11-22 2019-04-19 安徽继远软件有限公司 A kind of information system performance stereoscopic monitoring method based on multi-source heterogeneous data fusion
CN109766334A (en) * 2019-01-07 2019-05-17 国网湖南省电力有限公司 Processing method and system for electrical equipment online supervision abnormal data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107786994B (en) * 2016-08-26 2021-01-12 卓望数码技术(深圳)有限公司 User perception quality difference analysis method and system for end-to-end wireless service
CN109783313B (en) * 2018-12-18 2023-03-14 平安科技(深圳)有限公司 System exception handling method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101614781A (en) * 2009-07-20 2009-12-30 浙江大学 A kind of broadcasting and TV equipment failure intelligent diagnosing method based on the space rule index
CN103427417A (en) * 2013-07-31 2013-12-04 国电南瑞科技股份有限公司 Power distribution network fault processing method based on multi-source information fusion
WO2018168873A1 (en) * 2017-03-17 2018-09-20 株式会社フジキン Operation analysis system of fluid control device, method, and computer program
CN109656793A (en) * 2018-11-22 2019-04-19 安徽继远软件有限公司 A kind of information system performance stereoscopic monitoring method based on multi-source heterogeneous data fusion
CN109766334A (en) * 2019-01-07 2019-05-17 国网湖南省电力有限公司 Processing method and system for electrical equipment online supervision abnormal data

Also Published As

Publication number Publication date
CN112100019A (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN109213675B (en) Software analysis platform
Chen et al. Towards intelligent incident management: why we need it and how we make it
EP3036633B1 (en) Cloud deployment infrastructure validation engine
Zheng et al. Co-analysis of RAS log and job log on Blue Gene/P
CN107544839B (en) Virtual machine migration system, method and device
EP3591485B1 (en) Method and device for monitoring for equipment failure
CN110716842B (en) Cluster fault detection method and device
CN102571498B (en) Fault injection control method and device
WO2016188100A1 (en) Information system fault scenario information collection method and system
JP2001188765A (en) Technique for referring to fault information showing plural related fault under distributed computing environment
CN112100019B (en) Multi-source fault collaborative analysis positioning method for large-scale system
CN111046011A (en) Log collection method, system, node, electronic device and readable storage medium
US8554908B2 (en) Device, method, and storage medium for detecting multiplexed relation of applications
Li et al. Conan: Diagnosing batch failures for cloud systems
CN114167181B (en) Method and system for monitoring local and allopatric line fault tracing
CN103995759B (en) High-availability computer system failure handling method and device based on core internal-external synergy
CN113672452A (en) Method and system for monitoring operation of data acquisition task
CN117194154A (en) APM full-link monitoring system and method based on micro-service
KR20220060871A (en) System for artificial intelligence integrated resource management of data cente
Sosnowski et al. Monitoring event logs within a cluster system
WO2022160141A1 (en) Industrial network-based codeless tracking analytics method and apparatus for industrial software
Wang et al. Log data modeling and acquisition in supporting SaaS software performance issue diagnosis
CN113031969A (en) Equipment deployment inspection method and device, computer equipment and storage medium
Sloper et al. Dynamic error recovery in the ATLAS TDAQ system
Arefin et al. Cloudinsight: Shedding light on the cloud

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
CB03 Change of inventor or designer information

Inventor after: Gao Jiangang

Inventor after: Gong Daoyong

Inventor after: Song Changming

Inventor after: Qian Yu

Inventor after: Li Weidong

Inventor after: Zhang Hongyu

Inventor after: Liu Sha

Inventor before: Gong Daoyong

Inventor before: Song Changming

Inventor before: Qian Yu

Inventor before: Li Weidong

Inventor before: Zhang Hongyu

Inventor before: Liu Sha

CB03 Change of inventor or designer information
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant