CN116418653A

CN116418653A - Fault positioning method and device based on multi-index root cause positioning algorithm

Info

Publication number: CN116418653A
Application number: CN202310262617.XA
Authority: CN
Inventors: 王敬宇; 黄成明; 吕雯鑫; 曹金刚
Original assignee: St Max Intelligent Technology Jiangsu Co ltd
Current assignee: St Max Intelligent Technology Jiangsu Co ltd
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-07-11

Abstract

The invention provides a fault positioning method and device based on a multi-index root cause positioning algorithm, and relates to the technical field of fault positioning. The method comprises the following steps: firstly, operation and maintenance data of a service system are obtained, a service system topological graph and a fault system topological graph are constructed according to the operation and maintenance data, at least one target index for positioning a fault root cause is screened according to the service system topological graph and the fault system topological graph, then the fault position and the fault reason of the service system are determined according to the target index, the fault grade of the service system is determined according to the fault position and the fault reason of the service system, and finally, a corresponding alarm strategy is executed according to the fault grade of the service system and the fault system topological graph. In the method, a plurality of indexes for root cause positioning are screened out through the service system topological graph and the fault system topological graph, and root cause positioning is performed based on the indexes, so that the positioning speed and the positioning accuracy of fault diagnosis of the service system are improved.

Description

Fault positioning method and device based on multi-index root cause positioning algorithm

Technical Field

The invention relates to the technical field of fault location, in particular to a fault location method and device based on a multi-index root cause location algorithm.

Background

With the continuous development of business, business micro-service architecture is more and more favored by various large enterprises, and brings greater challenges to traditional operation and maintenance, and the multi-dimensional KPI indexes have a large number and complex relationship with each other, so that the operation and maintenance personnel can urgently want to realize the positioning of the fault at the first time after the fault occurs.

In the related art, when the root cause positioning algorithm is adopted in the existing method, a single index is often considered to be an access point for root cause positioning, so that the correlation relationship between indexes is ignored, and the positioning speed and the positioning accuracy are poor.

Disclosure of Invention

The embodiment of the invention provides a fault positioning method and device based on a multi-index root cause positioning algorithm, aiming at solving the problems in the background technology.

In order to solve the technical problems, the invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a fault location method based on a multi-index root cause location algorithm, where the method includes:

acquiring operation and maintenance data of a service system, and constructing a service system topological graph and a fault system topological graph according to the operation and maintenance data;

screening at least one target index for positioning the root cause of the fault according to the service system topological graph and the fault system topological graph;

Determining the fault position and the fault reason of the service system according to the target index;

determining the fault grade of the service system according to the fault position and the fault reason of the service system;

and executing a corresponding alarm strategy according to the fault level of the service system and the fault system topological graph.

Optionally, the operation data includes abnormal log data and key data of network element equipment; the step of constructing a service system topological graph and a fault system topological graph according to the operation and maintenance data comprises the following steps:

acquiring service processing logic between network element devices, and determining a data interaction relationship between the network element devices according to the service processing logic relationship between the network element devices;

constructing the service system topological graph according to the key data of the network element equipment and the data interaction relation between the network element equipment;

determining an abnormal propagation direction between network element equipment according to the abnormal log data;

and updating the service system topology graph according to the abnormal propagation direction between the network element devices to obtain the fault system topology graph.

Optionally, the step of screening at least one target index for fault root cause positioning according to the service system topological graph and the fault system topological graph includes:

Determining a feature matrix of a topological graph of a first dimension according to the topological graph of the service system, wherein the feature matrix of the topological graph of the first dimension represents service association relations among network element devices in the service system;

determining a feature matrix of the topological graph of the second dimension according to the topological graph of the fault system; the feature matrix of the topological graph of the second dimension characterizes abnormal resource calling relations among network element devices in the fault system;

inputting the feature matrix of the topological graph in the first dimension and the feature matrix of the topological graph in the second dimension into a preset neural network model to respectively obtain a first index screening result and a second index screening result, wherein the first index screening result and the second index screening result comprise a plurality of indexes to be selected and screening probabilities corresponding to the indexes to be selected;

and screening at least one target index for positioning the root cause of the fault according to the size relation between the screening probability of each index to be selected in the first index screening result and the second index screening result and a preset first threshold value.

Optionally, the step of determining the fault location and the fault cause of the service system according to the target index includes:

Determining a correlation coefficient between each monitoring index and the target index in the service system;

calculating the association degree score of each monitoring index and the target index according to the correlation coefficient and the first weight corresponding to the correlation coefficient;

screening at least one fault locating and monitoring index from the correlation coefficient according to the magnitude relation between the correlation score and a preset second threshold value;

and determining the fault position and the fault reason of the service system according to the fault positioning monitoring index.

Optionally, the step of determining the fault location and the fault cause of the service system according to the fault location monitoring index includes:

determining fault location sub-areas mapped by each fault location monitoring index;

determining the intersection of the fault location sub-areas as a target fault location area, wherein the target fault location area characterizes the fault location of the service system;

and determining a corresponding fault reason according to the target fault location area and the fault positioning monitoring index.

Optionally, the step of determining the fault level of the service system according to the fault location and the fault reason of the service system includes:

Determining a first fault evaluation score of the service system according to the fault position in the hierarchy of the service system;

determining a second fault evaluation score of the service system according to the influence capacity of the fault reason on the service system;

calculating a final fault evaluation score of the service system according to the first fault evaluation score and the second weight corresponding to the first fault evaluation score and the second fault evaluation score;

and determining the fault grade of the service system according to the final fault evaluation score of the service system.

Optionally, the fault level includes a low-level fault and a high-level fault, and the step of executing a corresponding alarm policy according to the fault level of the service system and the fault system topology map includes:

executing a fault self-healing processing strategy under the condition that the fault level of the service system is low, and not giving an alarm;

and under the condition that the fault grade of the service system is high, executing a fault alarm processing strategy and sending the fault position and the fault reason of the service system to a manager of the service system.

A second aspect of the embodiment of the present invention provides a fault locating device based on a multi-index root cause locating algorithm, where the device includes:

the acquisition module is used for acquiring operation and maintenance data of the service system and constructing a service system topological graph and a fault system topological graph according to the operation and maintenance data;

the screening module is used for screening at least one target index for positioning the root cause of the fault according to the service system topological graph and the fault system topological graph;

the first determining module is used for determining the fault position and the fault reason of the service system according to the target index;

the second determining module is used for determining the fault grade of the service system according to the fault position and the fault reason of the service system;

and the alarm module is used for executing a corresponding alarm strategy according to the fault level of the service system and the fault system topological graph.

Optionally, the acquiring module includes:

the data interaction relation determining sub-module is used for acquiring service processing logic between network element devices and determining the data interaction relation between the network element devices according to the service processing logic relation between the network element devices;

a service system topological graph construction sub-module, configured to construct the service system topological graph according to the key data of the network element device and the data interaction relationship between the network element devices;

An abnormal propagation direction determining sub-module, configured to determine an abnormal propagation direction between network element devices according to the abnormal log data;

and the fault system topology diagram construction submodule is used for updating the service system topology diagram according to the abnormal propagation direction between the network element devices to obtain the fault system topology diagram.

Optionally, the screening module includes:

the first computing sub-module is used for determining a feature matrix of the topological graph of the first dimension according to the topological graph of the service system, wherein the feature matrix of the topological graph of the first dimension represents service association relations among network element devices in the service system;

the second calculation sub-module is used for determining a feature matrix of the topological graph of the second dimension according to the topological graph of the fault system; the feature matrix of the topological graph of the second dimension characterizes abnormal resource calling relations among network element devices in the fault system;

an input sub-module, configured to input a feature matrix of the topological graph in the first dimension and a feature matrix of the topological graph in the second dimension into a preset neural network model, to obtain a first index screening result and a second index screening result, where the first index screening result and the second index screening result include a plurality of indexes to be selected and screening probabilities corresponding to the indexes to be selected;

And the screening sub-module is used for screening at least one target index for positioning the root cause of the fault according to the size relation between the screening probability of each index to be selected in the first index screening result and the second index screening result and a preset first threshold value.

Optionally, the first determining module includes:

the correlation coefficient determination submodule is used for determining the correlation coefficient between each monitoring index and the target index in the service system;

the relevance calculating submodule is used for calculating the relevance score of each monitoring index and the target index according to the correlation coefficient and the first weight corresponding to the correlation coefficient;

the fault location monitoring index screening sub-module is used for screening at least one fault location monitoring index from the correlation coefficient according to the magnitude relation between the correlation score and a preset second threshold value;

Optionally, the fault location monitoring index screening sub-module includes:

the fault location sub-area determining unit is used for determining a fault location sub-area mapped by each fault location monitoring index;

A target fault location area determining unit, configured to determine an intersection of the fault location sub-areas as a target fault location area, where the target fault location area represents a fault location of a service system;

and the fault cause determining unit is used for determining the corresponding fault cause according to the target fault location area and the fault positioning monitoring index.

Optionally, the second determining module includes:

a first fault evaluation score determining sub-module, configured to determine a first fault evaluation score of the service system according to the level of the fault location in the service system;

a second fault evaluation score determining submodule, configured to determine a second fault evaluation score of the service system according to the influence capability of the fault cause on the service system;

a final fault evaluation score determining sub-module, configured to calculate a final fault evaluation score of the service system according to the first fault evaluation score and the second fault evaluation score, and a second weight corresponding to the first fault evaluation score and the second fault evaluation score;

and the fault grade determining sub-module is used for determining the fault grade of the service system according to the final fault evaluation score of the service system.

Optionally, the alarm module includes:

the first alarm sub-module is used for executing a fault self-healing processing strategy and not giving an alarm under the condition that the fault level of the service system is low;

and the second alarm sub-module is used for executing a fault alarm processing strategy and sending the fault position and the fault reason of the service system to a manager of the service system under the condition that the fault grade of the service system is high.

A third aspect of the embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the method steps provided by the first aspect of the embodiment of the invention when executing the program stored in the memory.

A fourth aspect of the embodiments of the present invention proposes a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as proposed in the first aspect of the embodiments of the present invention.

The embodiment of the invention has the following advantages: firstly, operation and maintenance data of a service system are obtained, a service system topological graph and a fault system topological graph are constructed according to the operation and maintenance data, at least one target index for positioning a fault root cause is screened according to the service system topological graph and the fault system topological graph, then the fault position and the fault reason of the service system are determined according to the target index, the fault grade of the service system is determined according to the fault position and the fault reason of the service system, and finally, a corresponding alarm strategy is executed according to the fault grade of the service system and the fault system topological graph. In the method, a plurality of indexes for root cause positioning are screened out by generating a service system topological graph representing the normal running state of the service system and a fault system topological graph representing the abnormal information interaction state of the service system, the root cause positioning is performed based on the indexes, the association relation among different indexes is considered, and the positioning speed and the positioning accuracy of fault diagnosis of the service system are improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of an electronic device in a hardware running environment according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a system architecture according to an embodiment of the present application.

Fig. 3 is a flowchart of steps of a fault location method based on a multi-index root cause location algorithm according to an embodiment of the present application.

Fig. 4 is a schematic functional block diagram of a fault locating device based on a multi-index root cause locating algorithm according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following further describes the aspects of the present application with reference to the accompanying drawings.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an electronic device in a hardware running environment according to an embodiment of the present application.

As shown in fig. 1, the electronic device may include: a processor 1001, such as a central processing unit (CentralProcessingUnit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The memory 1005 may be a high-speed random access memory (RandomAccessMemory, RAM) or a stable nonvolatile memory (Non-VolatileMemory, NVM), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the structure shown in fig. 1 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components.

As shown in fig. 1, an operating system, a data storage module, a network communication module, a user interface module, and an electronic program may be included in the memory 1005 as one type of storage medium.

In the electronic device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the electronic device of the present invention may be provided in the electronic device, and the electronic device invokes the fault locating device based on the multi-index root cause locating algorithm stored in the memory 1005 through the processor 1001, and executes the fault locating method based on the multi-index root cause locating algorithm provided in the embodiment of the present invention.

Referring to fig. 2, a system architecture diagram of an embodiment of the present application is shown. As shown in fig. 1, the system architecture may include a first device 201, a second device 202, a third device 203, a fourth device 204, and a network 205. Wherein the network 205 is used as a medium to provide communication links between the first device 201, the second device 202, the third device 203, and the fourth device 204. The network 205 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

In this embodiment, the first device 201, the second device 202, the third device 203, and the fourth device 204 may be hardware devices or software that support network connection to provide various network services. When the device is hardware, it may be a variety of electronic devices including, but not limited to, smartphones, tablets, laptop portable computers, desktop computers, servers, and the like. In this case, the hardware device may be realized as a distributed device group composed of a plurality of devices, or may be realized as a single device. When the device is software, it can be installed in the above-listed devices. In this case, as software, it may be implemented as a plurality of software or software modules for providing distributed services, for example, or as a single software or software module. The present invention is not particularly limited herein.

In a specific implementation, the device may provide the corresponding network service by installing a corresponding client application or server application. After the device has installed the client application, it may be embodied as a client in network communication. Accordingly, after the server application is installed, it may be embodied as a server in network communications.

As an example, in fig. 2, the first device 201 is embodied as a server, and the second device 202, the third device 203, and the fourth device 204 are embodied as clients. Specifically, the second device 202, the third device 203, and the fourth device 204 may be clients installed with an information browsing-type application, and the first device 103 may be a background server of the information browsing-type application. It should be noted that, the fault locating method based on the multi-index root cause locating algorithm provided in the embodiment of the present application may be executed by the first device 201.

It should be understood that the number of networks and devices in fig. 3 is merely illustrative. There may be any number of networks and devices as desired for an implementation.

S301: and acquiring operation and maintenance data of the service system, and constructing a service system topological graph and a fault system topological graph according to the operation and maintenance data.

In this embodiment, the operation and maintenance data refers to normal operation transaction data between network element devices and abnormal interaction data between network element devices generated in the operation process of the service system, the service system topology graph refers to a dynamic diagram showing how each network element device in the service system performs data interaction, the fault system topology graph refers to a dynamic diagram showing how abnormal network element devices perform data interaction, and the steps of constructing the service system topology graph and the fault system topology graph according to the operation and maintenance data include:

S301-1: acquiring service processing logic between network element devices, and determining a data interaction relationship between the network element devices according to the service processing logic relationship between the network element devices;

s301-2: constructing a service system topological graph according to key data of network element equipment and a data interaction relationship between the network element equipment;

s301-3: determining an abnormal propagation direction between network element devices according to the abnormal log data;

s301-4: and updating the service system topological graph according to the abnormal propagation direction between the network element devices to obtain a fault system topological graph.

In the embodiments of S301-1 to S301-4, the service processing logic between the network element devices is data processing logic when one network element device performs service interaction with other network element devices, and the data interaction relationship between the network element devices is that for any one network element device, it is required to acquire the data of which network element device, and further, it is required to send the processed data to which network element device, where the key data of the network element device refers to hardware attribute information of the network element device, such as an interface protocol, a network address, and so on. And acquiring the key data of the network element equipment and the data interaction relation between the network element equipment, namely acquiring a service system topological graph according to the execution sequence of the data interaction. The abnormal log data can reflect malicious data competition relationship among network element devices, namely the propagation direction of abnormal data, after the service system topological graph is obtained, the service system topological graph is updated according to the propagation direction of the abnormal data, and the data interaction relationship among normal network element devices irrelevant to the propagation direction of the abnormal data is deleted, so that the fault system topological graph is obtained.

S302: and screening at least one target index for positioning the root cause of the fault according to the service system topological graph and the fault system topological graph.

In this embodiment, after obtaining the service system topology map and the fault system topology map, the target index for determining the occurrence position and the occurrence cause of the fault may be selected from a plurality of indexes of the service system according to the service system topology map and the fault system topology map, where the indexes of the service system include performance indexes such as delay, throughput, CPU occupancy rate, RAM, and the like, and service indexes associated with the service of the service system, and the step of screening at least one target index for positioning the root cause of the fault according to the service system topology map and the fault system topology map includes:

s302-1: determining a feature matrix of the topological graph of the first dimension according to the topological graph of the service system;

s302-2: determining a feature matrix of the topological graph of the second dimension according to the topological graph of the fault system;

s302-3: inputting a feature matrix of the topological graph in the first dimension and a feature matrix of the topological graph in the second dimension into a preset neural network model to respectively obtain a first index screening result and a second index screening result, wherein the first index screening result and the second index screening result comprise a plurality of indexes to be selected and screening probabilities corresponding to the indexes to be selected;

S302-4: and screening at least one target index for fault root cause positioning according to the size relation between the screening probability of each index to be selected in the first index screening result and the second index screening result and a preset first threshold value.

In the embodiments of S302-1 to S302-4, feature extraction is performed on the service system topology graph and the fault system topology graph, so that a feature matrix of the topology graph in the first dimension and a feature matrix of the topology graph in the second dimension can be obtained, the feature matrix of the topology graph in the first dimension represents service association relations among network element devices in the service system, and the feature matrix of the topology graph in the second dimension represents abnormal resource calling relations among network element devices in the fault system, so that normal data interaction relations and abnormal data exchange relations among network element devices can be converted into corresponding feature values. And inputting the feature matrix of the topological graph in the first dimension and the feature matrix of the topological graph in the second dimension into a preset trained neural network model, so as to obtain a first index screening result corresponding to the feature matrix of the topological graph in the first dimension and a second index screening result corresponding to the feature matrix of the topological graph in the second dimension, wherein the first index screening result comprises indexes for root cause positioning and corresponding screening probabilities determined according to the topological graph of the service system, the second index screening result comprises indexes for root cause positioning and corresponding screening probabilities determined according to the topological graph of the fault system, and then determining indexes larger than the screening probability threshold as target indexes by comparing the screening probabilities of each index with a preset screening probability threshold.

S303: and determining the fault position and the fault reason of the service system according to the target index.

In this embodiment, after screening out the target index for root cause positioning, the fault location and the fault cause of the service system may be determined according to the target index, which specifically includes the steps of:

s303-1: determining a correlation coefficient between each monitoring index and a target index in a service system;

s303-2: calculating the association degree score of each monitoring index and the target index according to the correlation coefficient and the first weight corresponding to the correlation coefficient;

s303-3: screening at least one fault locating and monitoring index from the correlation coefficient according to the magnitude relation between the correlation score and a preset second threshold value;

in the embodiments of S303-1 to S303-3, the monitoring index refers to an index having an association relationship with the target index, that is, the change of the monitoring index may cause the change of the target index, when the target index is abnormally changed, the monitoring index is likely to be caused by the change of the monitoring index having a higher association with the target index, for example, when the service index such as response time, transaction amount and the like is observed to be abnormal, the performance index having a higher association with the service index may be detected, and the performance index having a higher association is likely to be the root cause of the fault. Therefore, the correlation coefficient of all the monitoring indexes of the target indexes needs to be calculated, then the correlation score of the monitoring indexes is calculated according to the weight coefficient corresponding to the importance degree of the monitoring indexes, after the calculation of the correlation score of all the monitoring indexes is completed according to the method, the monitoring indexes can be screened according to the magnitude relation between the correlation score of each monitoring index and the preset correlation score threshold, and the screened fault positioning monitoring indexes are the monitoring indexes with strong correlation with the target indexes.

S303-4: and determining the fault position and the fault reason of the service system according to the fault positioning monitoring index.

In this embodiment, when obtaining the fault location monitoring indicator, the fault location and the fault cause of the service system may be determined according to the fault location monitoring indicator, which specifically includes the steps of:

s303-4-1: determining a fault location sub-area mapped by each fault location monitoring index;

s303-4-2: determining an intersection of the fault location sub-regions as a target fault location region, wherein;

s303-4-3: and determining a corresponding fault reason according to the target fault location area and the fault positioning monitoring index.

In the embodiments from S303-4-1 to S303-4-3, the target fault location area represents the fault location of the service system, which may be understood as that the fault area corresponding to each fault location monitoring indicator is usually a fixed number of areas, so that the fault location sub-area may be mapped by the fault location monitoring indicator, then the intersection of the fault location sub-areas is determined as the target fault location area, and after the target fault location area is determined, the cause of the fault may be deduced according to the fault problem and the fault location monitoring indicator that often occur in the target fault location area.

As an example, if there are the fault location monitor index a and the fault location monitor index B, the fault location sub-areas mapped by the fault location monitor index a are numbered a, B, and c, and the fault location sub-areas mapped by the fault location monitor index a are numbered c, d, and e, the target fault location area is c, that is, the area c is determined as the diagnosis result of the fault location due to the diagnosis, and the cause of the fault in the area c is generally three of e, f, and g, where the cause e is related to both the fault location monitor index a and the fault location monitor index B, the cause e is related to only the fault location monitor index a, and the cause g is related to only the fault location monitor index B, so the cause e can be the diagnosis result of the fault cause due to the diagnosis.

S304: and determining the fault grade of the service system according to the fault position and the fault reason of the service system.

In this embodiment, when determining the fault location and the fault cause of the service system, the corresponding alarm policy may be executed according to the fault level and the fault system topology map of the service system, where the specific steps include:

s304-1: determining a first fault evaluation score of the service system according to the level of the fault position in the service system;

S304-2: determining a second fault evaluation score of the service system according to the influence capacity of the fault reason on the service system;

s304-3: calculating the final fault evaluation score of the service system according to the first fault evaluation score and the second weight corresponding to the first fault evaluation score and the second fault evaluation score;

s304-4: and determining the fault grade of the service system according to the final fault evaluation score of the service system.

In the embodiments of S304-1 to S304-4, first, if the fault location is at an upper level such as a resource level, a data level, etc., the corresponding first fault evaluation score is high. And if the fault position is in the lower layer such as the service layer and the user layer, the corresponding first fault evaluation score is very low, namely the fault position and the first fault evaluation score are in positive correlation in the hierarchy of the service system. Second, if the impact of the fault on the service system is small, for example, the feedback speed of the system is delayed slightly, and the response time is increased, the corresponding second fault evaluation score will be low. If the influence capability of the fault on the service system is large, for example, the authenticity and accuracy of the data are affected, the corresponding second fault evaluation score will be high, that is, the influence capability of the fault location service system and the second fault evaluation score have a positive correlation. Therefore, after the first fault evaluation score and the second fault evaluation score are obtained, the final fault evaluation score can be calculated according to the weight values corresponding to the first fault evaluation score and the second fault evaluation score, and then the fault grade of the service system is determined according to the score interval section where the final fault evaluation score is located.

S305: and executing a corresponding alarm strategy according to the fault level of the service system and the fault system topological graph.

In this embodiment, after determining the failure level of the service system, a corresponding alarm policy may be executed according to the failure level of the service system and the topology map of the failure system, which includes the specific steps of:

s305-1: under the condition that the fault level of the service system is low, executing a fault self-healing processing strategy and not giving an alarm;

s305-2: and under the condition that the fault grade of the service system is high, executing a fault alarm processing strategy and sending the fault position and the fault reason of the service system to a manager of the service system.

In the embodiments of S305-1 to S305-2, the fault level includes a low level fault and a high level fault, the low level fault represents a fault that does not affect the normal operation of the service system, the high level fault represents a fault that may affect the normal operation of the service system, and if the fault level of the service system is low, in order to ensure that the service system can continue to maintain the normal operation, a fault self-healing policy may be issued to the service system, and no alarm is given to service personnel, and the service system executes the fault self-healing policy by itself, for example, closes a data interface with the fault. If the fault level of the service system is high, it is indicated that the service system may stop running at any time, so that a manager, i.e. a system operation and maintenance person, needs to be timely notified to remove the fault, and therefore, an alarm processing message can be sent to the operation and maintenance person through a preset mailbox interface, and the alarm processing message includes the fault location and the fault cause of the fault, thereby helping the operation and maintenance person to quickly complete the root cause diagnosis of the fault in the service system and the determination of the processing strategy.

According to the fault positioning method based on the multi-index root cause positioning algorithm, operation and maintenance data of a service system are firstly obtained, a service system topological graph and a fault system topological graph are constructed according to the operation and maintenance data, at least one target index for positioning the root cause of the fault is screened out according to the service system topological graph and the fault system topological graph, then the fault position and the fault cause of the service system are determined according to the target index, the fault grade of the service system is determined according to the fault position and the fault cause of the service system, and finally a corresponding alarm strategy is executed according to the fault grade of the service system and the fault system topological graph. In the method, a plurality of indexes for root cause positioning are screened out by generating a service system topological graph representing the normal running state of the service system and a fault system topological graph representing the abnormal information interaction state of the service system, the root cause positioning is performed based on the indexes, the association relation among different indexes is considered, and the positioning speed and the positioning accuracy of fault diagnosis of the service system are improved.

Referring to fig. 4, a second aspect of the embodiment of the present invention proposes a fault locating device 400 based on a multi-index root cause locating algorithm, the device comprising:

The acquiring module 401 is configured to acquire operation and maintenance data of a service system, and construct a service system topology graph and a fault system topology graph according to the operation and maintenance data;

a screening module 402, configured to screen at least one target index for positioning a root cause of a fault according to the service system topology map and the fault system topology map;

a first determining module 403, configured to determine a fault location and a fault cause of the service system according to the target indicator;

a second determining module 404, configured to determine a failure level of the service system according to a failure location and a failure cause of the service system;

and the alarm module 405 is configured to execute a corresponding alarm policy according to the fault level of the service system and the fault system topology map.

In one possible embodiment, the acquisition module includes:

the data interaction relation determining sub-module is used for acquiring service processing logic between the network element devices and determining the data interaction relation between the network element devices according to the service processing logic relation between the network element devices;

the service system topological graph construction sub-module is used for constructing a service system topological graph according to the key data of the network element equipment and the data interaction relationship between the network element equipment;

And the fault system topological graph construction submodule is used for updating the service system topological graph according to the abnormal propagation direction between the network element devices to obtain a fault system topological graph.

In one possible embodiment, the screening module includes:

the second calculation sub-module is used for determining a feature matrix of the topological graph of the second dimension according to the topological graph of the fault system; the feature matrix of the topological graph of the second dimension represents abnormal resource calling relations among network element devices in the fault system;

the input sub-module is used for inputting the feature matrix of the topological graph in the first dimension and the feature matrix of the topological graph in the second dimension into a preset neural network model to respectively obtain a first index screening result and a second index screening result, wherein the first index screening result and the second index screening result comprise a plurality of indexes to be selected and screening probabilities corresponding to the indexes to be selected;

and the screening sub-module is used for screening at least one target index for positioning the fault root cause according to the size relation between the screening probability of each index to be selected in the first index screening result and the second index screening result and a preset first threshold value.

In one possible embodiment, the first determining module includes:

the correlation calculation submodule is used for calculating the correlation score of each monitoring index and the target index according to the correlation coefficient and the first weight corresponding to the correlation coefficient;

In a possible implementation manner, the fault location monitoring index screening sub-module includes:

a target fault location area determining unit, configured to determine an intersection of the fault location sub-areas as a target fault location area, where the target fault location area represents a fault location of the service system;

In one possible embodiment, the second determining module includes:

the first fault evaluation score determining submodule is used for determining a first fault evaluation score of the service system according to the level of the fault position in the service system;

the second fault evaluation score determining submodule is used for determining a second fault evaluation score of the service system according to the influence capacity of the fault reasons on the service system;

the final fault evaluation score determining sub-module is used for calculating the final fault evaluation score of the service system according to the first fault evaluation score, the second fault evaluation score and the second weight corresponding to the first fault evaluation score and the second fault evaluation score;

In one possible implementation, the alert module includes:

It should be noted that, the specific implementation of the fault locating device 400 based on the multi-index root cause locating algorithm according to the embodiment of the present application refers to the specific implementation of the fault locating method based on the multi-index root cause locating algorithm set forth in the first aspect of the embodiment of the present application, and is not described herein again.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (apparatus), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. "and/or" means either or both of which may be selected. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

The fault locating method and device based on the multi-index root cause locating algorithm provided by the invention are described in detail, and specific examples are applied to illustrate the principle and implementation of the invention, and the description of the above examples is only used for helping to understand the method and core ideas of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. The fault positioning method based on the multi-index root cause positioning algorithm is characterized by comprising the following steps:

2. The fault location method based on multi-index root cause location algorithm according to claim 1, wherein the operation data includes exception log data and key data of network element equipment; the step of constructing a service system topological graph and a fault system topological graph according to the operation and maintenance data comprises the following steps:

3. The fault location method based on multi-index root cause location algorithm according to claim 1, wherein the step of screening at least one target index for fault root cause location according to the service system topology map and the fault system topology map comprises:

4. The fault location method based on multi-index root cause location algorithm according to claim 1, wherein the step of determining the fault location and the fault cause of the service system according to the target index comprises:

5. The fault location method based on multi-index root cause location algorithm according to claim 1, wherein the step of determining the fault location and the fault cause of the service system according to the fault location monitoring index comprises:

6. The fault location method based on multi-index root cause location algorithm according to claim 1, wherein the step of determining the fault level of the service system according to the fault location and the fault cause of the service system comprises:

7. The fault location method based on multi-index root cause location algorithm according to claim 1, wherein the fault level includes a low-level fault and a high-level fault, and the step of executing a corresponding alarm policy according to the fault level of the service system and the fault system topology map includes:

8. A fault locating device based on a multi-index root cause locating algorithm, the device comprising:

9. The fault locating device based on multi-index root cause locating algorithm according to claim 8, wherein the operation data includes exception log data and key data of network element equipment; the acquisition module comprises:

10. The multi-index root cause positioning algorithm-based fault location device of claim 8, wherein the screening module comprises: