CN113900844A - Service code level-based fault root cause positioning method, system and storage medium - Google Patents

Service code level-based fault root cause positioning method, system and storage medium Download PDF

Info

Publication number
CN113900844A
CN113900844A CN202111127982.7A CN202111127982A CN113900844A CN 113900844 A CN113900844 A CN 113900844A CN 202111127982 A CN202111127982 A CN 202111127982A CN 113900844 A CN113900844 A CN 113900844A
Authority
CN
China
Prior art keywords
fault
root cause
heterogeneous
node
calling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111127982.7A
Other languages
Chinese (zh)
Other versions
CN113900844B (en
Inventor
沈梦家
曹立
隋楷心
刘大鹏
王继斌
张文池
吴楠
陈恒茂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bishi Technology Co ltd
Original Assignee
Beijing Bishi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bishi Technology Co ltd filed Critical Beijing Bishi Technology Co ltd
Priority to CN202111127982.7A priority Critical patent/CN113900844B/en
Publication of CN113900844A publication Critical patent/CN113900844A/en
Application granted granted Critical
Publication of CN113900844B publication Critical patent/CN113900844B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0775Content or structure details of the error report, e.g. specific table structure, specific error fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a fault root cause positioning method, a system and a storage medium based on service code level, wherein the method comprises the following steps: constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation; constructing a time series anomaly detection model based on multi-dimensional indexes, and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph; generating a heterogeneous fault map based on the abnormal detection result of each calling edge; and carrying out fault root cause positioning on the obtained heterogeneous fault graph based on a random walk object level sorting algorithm. By adopting the heterogeneous topological graph, the calling relation and the membership relation of the service codes with finer granularity are simply and clearly displayed; by fusing the correlation characteristics of the multi-dimensional indexes, the accuracy of index abnormality detection of the calling edge in the heterogeneous topological structure is effectively improved; the accuracy of fault root cause positioning is effectively improved through a node sorting algorithm of a heterogeneous graph.

Description

Service code level-based fault root cause positioning method, system and storage medium
Technical Field
The invention relates to fault root cause location, in particular to fault root cause location based on service code level.
Background
With the rapid development of technologies such as cloud computing and service computing and the increasing demand of social production for business, more and more modern enterprises deploy application programs and system services in a cloud computing environment, which are called distributed cloud application programs or micro-services. Compared with the traditional centralized architecture, the distributed architecture has better component expansibility, higher development productivity and lower cost.
To ensure high availability and reliability of the system, application providers must deploy link monitoring systems to collect key performance metrics for each service, such as network response time, service response rate, success rate, etc., to handle complex distributed environments to meet availability constraints and stringent service level objectives. However, with increasingly complex business requirements and increasing micro-service scale, when a fault occurs, a large number of index alarms are generated due to the existence of a cross-system multiple-call dependency relationship, and at this time, a system administrator faces massive alarm index information and is difficult to quickly find a key alarm index and a corresponding fault root cause system thereof only by relying on manual analysis, so that monitoring index data and a system topological relationship need to be automatically processed and analyzed by using a machine learning algorithm, so that a fault root cause system is quickly positioned.
However, most of the existing link tracking and monitoring systems only acquire call relation data between systems, perform fault root cause location based on the call relation of the system level, and do not consider service code key information of system call, so that the existing scheme is difficult to locate the problem of fault root cause of fine granularity, and abnormal information is easily hidden due to data aggregation information of system level coarse granularity.
In addition, due to complexity and periodicity of services, the existing simple anomaly detection strategy based on a fixed threshold or k-sigma has more false alarms or false negatives, for example, the effect of an alarm rule that the response rate is lower than 90% and the time exceeds 3 minutes in different services is not satisfactory, and an ideal effect is difficult to achieve. Most of the current anomaly detection algorithms only perform anomaly detection triggering alarm aiming at a single index, do not consider the complex dependency relationship existing among a plurality of key performance indexes, are easy to cause false alarm, and have high false alarm rate particularly in the scene of index anomaly detection of a fine-grained calling side in a heterogeneous topological structure.
Finally, for a data scene after combining system and service codes, currently, academic circles and industrial circles mostly adopt the same level of call data for analysis, but most of actual scenes involve multiple different levels of call data, and the situation is often more complicated. Therefore, a fault root cause positioning scheme for a converged system and service code needs to be provided.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides:
a fault root cause positioning method based on service code level mainly comprises the following steps:
s1, constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation;
s2, constructing a time series anomaly detection model based on multi-dimensional indexes, and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph;
s3, generating a heterogeneous fault map based on the abnormal detection result of each calling edge;
s4, based on the random walk object level sorting algorithm, the fault root cause positioning is carried out on the obtained heterogeneous fault graph.
A fault root cause positioning system based on service code level mainly comprises the following modules:
the global heterogeneous topological graph generating module is used for constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation;
the anomaly detection module is used for constructing a time series anomaly detection model based on multi-dimensional indexes and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph;
the heterogeneous fault map generation module is used for generating a heterogeneous fault map based on the abnormal detection result of each calling edge;
and the fault root cause positioning module is used for positioning the fault root cause of the obtained heterogeneous fault graph based on a random walk object level sorting algorithm.
A storage medium storing a computer program; when the computer program is executed by a processor in a computer device, the computer device performs the method as described in any one of the above.
By constructing a heterogeneous topological graph, the invention simply and clearly shows the calling relation and the membership relation of the service codes with finer granularity; by fusing the correlation characteristics of the multi-dimensional indexes, a time series abnormity detection model based on the multi-dimensional indexes is constructed, the abnormity detection of the calling edge of the global heterogeneous topological graph is realized, and compared with the technical problem of high false alarm rate caused by carrying out abnormity detection only aiming at a single index in the prior art, the accuracy of abnormity detection of the index of the calling edge in the heterogeneous topological structure is effectively improved; further, a heterogeneous fault graph and a root cause system corresponding to the current alarm are obtained through a node sorting algorithm of the heterogeneous graph and combined with automatic processing of a machine learning algorithm, and are simply displayed to the system for subsequent analysis and processing in a form of visual graph and root cause recommendation, so that an administrator can be assisted to efficiently locate the fault root cause, and the accuracy of fault root cause location is effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 flow chart of the method of the present invention
FIG. 2 is a heterogeneous topology diagram of the system and service code invocation relationship of the present invention
FIG. 3 is a time series anomaly detection model based on multi-dimensional indexes
FIG. 4 is a schematic diagram of the index abnormality detection result of the present invention
FIG. 5 heterogeneous fault map of the present invention
FIG. 6 is a visual interface for fault root cause location in accordance with the present invention
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and thus the scope of the present invention is not limited by the specific embodiments disclosed below.
Example one
In order to solve the problems in the prior art, the present embodiment provides a method for locating a fault root cause based on a service code level, where a flowchart is shown in fig. 1, and the method mainly includes the following steps:
s1 constructs a global heterogeneous topology graph including intersystem call relationships and service intersystem call relationships.
In order to locate the exception and root cause of a finer-grained service code level, the invention provides a composition strategy of a mixed relation between a service code and an application system. In addition, if a system call forwarded by using the enterprise service bus system ESB _ F5 exists, the service code calling relationship and the service code membership in the upstream and downstream systems can be obtained by arranging the CMDB service calling comparison table. The construction process of the heterogeneous topology map is described below with actual sample data.
The service monitoring system collects the call data and the state of the service transaction in detail in the log, for example, the call log at a certain alarm time is analyzed and then is shown in the following table 1:
Figure BDA0003279499510000051
TABLE 1 parsed transaction detail data
It can be seen that the data at this time includes the system nodes S1, S2, S3, S4 and the service code nodes T1, T2 called by them. The call relationship existing among the nodes is considered comprehensively, a heterogeneous topological graph including the call relationship of the system nodes and the service code nodes is constructed and obtained as shown in fig. 2, the call relationship graph reflecting the global system and the service code is obtained in fig. 2, wherein each call edge is time sequence index data formed by aggregation of transaction detail data and set time granularity, and the indexes adopted by the invention comprise: transaction amount, success amount, response amount, failure amount, non-response amount, success rate, response time. Compared with the prior art which only relates to the calling topological graph among the systems, the heterogeneous topological graph comprising the calling relation among the systems and the service codes, which is used by the invention, can capture the calling relation and the membership relation of the service codes with finer granularity, and the representation form is concise and clear.
Due to the fact that the traffic volume in actual service is large and complex, the obtained global heterogeneous topological graph is often complex. However, only local services are affected when a fault occurs in an actual production environment, so that the invention proposes to call the global heterogeneous topological graph and detect the abnormity at the same time so as to obtain the local heterogeneous topological graph with the fault.
S2, a time series anomaly detection model based on the multi-dimensional indexes is constructed, and anomaly detection is carried out on the calling edge of the global heterogeneous topological graph. The time series anomaly detection model based on the multi-dimensional indexes is constructed through a graph attention machine mechanism as shown in fig. 3. S2 specifically includes the following steps:
s2.1, normalizing the time sequence of the time window corresponding to the n indexes;
wherein n represents eachCalling the number of KPI indexes counted at the edge, converting the n KPI indexes into nodes for representing in order to consider the correlation characteristics among all the indexes, namely, the ith index corresponds to the node vi. Obtaining input characteristics { v) corresponding to n KPI indexes by adopting a min-max normalization method1,v2,…,vnTherein of
Figure BDA0003279499510000061
Node viAnd representing a w-dimensional feature vector corresponding to the ith KPI, wherein the dimension w of the feature vector corresponds to the dimension of the time window.
S2.2 learning the fusion characteristics of the nodes through graph attention mechanism.
Node viFusion feature h ofiCalculated by the following formula:
Figure BDA0003279499510000062
wherein N (i) represents a node viV set of neighbor nodes ofjRepresenting a node viA represents a sigmoid activation function, aijRepresenting a node viAnd node vjAssociated weight of, node VjRepresenting the w-dimensional feature vector corresponding to the j index, and associating the weight aijCalculated by the following formula:
Figure BDA0003279499510000063
wherein the content of the first and second substances,
Figure BDA0003279499510000064
eijrepresenting a node viAnd node vjAttention value of calling edge in between, eilRepresenting a node viAnd node vlThe attention value of the calling edge in between,
Figure BDA0003279499510000065
representation featureConnecting operation, LeakyReLU is an activation function, W represents a learnable parameter matrix, L represents vjThe number of neighbor nodes of a node, l represents viSequence numbers of neighbor nodes of the node.
Calculating to obtain the fusion characteristics of all nodes by using HiAnd (4) showing.
S2.3 fusion characteristics H based on all obtained nodesiAnd learning to obtain the embedded characteristics of the time series corresponding to different indexes.
After the learning of the graph attention machine, the fusion characteristics H of all the nodesiThe output feature dimension is n x w, the n x 2w dimension feature is obtained by connecting the output feature dimension with the original sequence feature, then the long-term time sequence dependent feature is input into the LSTM module to be coded, and the embedded feature of the time sequence corresponding to different indexes is obtained by learning.
S2.4 obtaining the predicted values of the time series of all the indexes at the t moment based on the obtained embedding characteristics of the time series corresponding to the different indexes
Figure BDA0003279499510000071
Specifically, the embedded characteristics of all the indexes are input into a multi-layer perceptron MLP to obtain predicted values of all time sequences at t moment
Figure BDA0003279499510000072
Taking a mean square error loss function MSE as an optimization function:
Figure BDA0003279499510000073
where n represents the number of predicted indices.
S2.5 the predicted values at time t based on the time series of all the obtained indicators
Figure BDA0003279499510000074
Calculating to obtain abnormal score value score representing index deviation degreei(t)。
Wherein the deviation value for the i-th index is calculated by the following formula:
Figure BDA0003279499510000075
the deviation value of the index is normalized by the following formula:
Figure BDA0003279499510000076
wherein, scorei(t) is the value of the abnormality score,
Figure BDA0003279499510000077
and
Figure BDA0003279499510000078
and respectively representing the median and the quartile instead of the mean and the standard deviation, and experiments prove that the normalization effect has the optimal expression effect. By adopting a time series abnormity detection model based on multiple indexes, the invention can more intuitively observe the deviation degree of each index.
S2.6 score based on the obtained abnormality score value scoreiAnd (t) judging whether the calling edge is abnormal or not. Specifically, the abnormality score value score representing the degree of deviation of the index to be obtainedi(t) comparing the abnormality score value score with a preset threshold value when the abnormality score value score is higher than the preset threshold valueiAnd (t) when the threshold value is larger than the threshold value, judging that the detection result of the calling edge is abnormal. The detection result is shown in fig. 4, where red sides indicate abnormality and black sides indicate normality.
Compared with the traditional time series anomaly detection method, the time series anomaly detection model based on the multi-dimensional indexes, which is constructed by the invention, does not depend on any hypothesis of data distribution, and takes the correlation dependence characteristics among the multi-dimensional indexes called by the service into consideration, so that the anomaly detection is more accurate and efficient.
S3, generating a heterogeneous fault map based on the abnormal detection result of each calling edge.
Specifically, based on S2, an abnormal calling edge in the heterogeneous topology map is obtained, and data of the calling edge whose detection result is normal is filtered from the global heterogeneous topology map, so as to obtain a heterogeneous fault map in which only a fault portion is displayed. For example, filtering the global heterogeneous topology map of fig. 2 results in a heterogeneous fault map as shown in fig. 5.
S4, based on the random walk object level sorting algorithm, the fault root cause positioning is carried out on the obtained heterogeneous fault graph.
Specifically, S4 includes the following steps:
s4.1, based on the heterogeneous fault map generated in S3, an object set V and an object type set A are determined.
Specifically, the heterogeneous fault map generated by S3 can be formally expressed as
Figure BDA0003279499510000081
Wherein ν, ε represents the object set and the relationship set, respectively. Setting object type mapping function due to the fact that heterogeneous graph comprises multiple types of objects
Figure BDA0003279499510000082
Wherein A represents a set of object types which are not repeated after mapping, and objects of the same type of a plurality of different instances are mapped to corresponding object types through a mapping function.
And S4.2, distributing corresponding abnormal propagation factors for different object types based on the obtained object type set A.
And distributing corresponding abnormal propagation factors for different object types based on the importance degrees of the different object types in the heterogeneous fault graph. Specifically, the abnormal propagation factors of different object types can be obtained through distribution by expert knowledge or learning by combining search optimization algorithms, such as simulated annealing optimization algorithms, based on historical data.
Compared with the method that the abnormal propagation differences among different object types are not considered in the prior art, the method and the device for calculating the root cause score effectively improve the accuracy and pertinence of subsequent root cause score calculation by setting the abnormal propagation factors among different object types and expressing the differences of the abnormal propagation weights among different object types.
S4.3 based on the obtained object set V, iteratively calculating by adopting a PageRank algorithm to obtain a pivot value of each object as an initial root factor score R of each objectea
Where a represents any object in the set of objects V.
S4.4 determining a root cause score R of each object based on the obtained abnormal propagation factor and the initial root cause scorex
Specifically, the root cause fraction R of the object x is obtained by the following formulax
Figure BDA0003279499510000091
X, Y respectively represents an object set with the type of X and an object set with the type of Y in the object type set A, wherein X represents an object in the object set with the type of X, and Y represents an object in the object set with the type of Y; rxAnd RyRoot scores representing object x and object y, respectively; mxYIs a contiguous matrix, MxYM is used as element inxYMeaning that if there is a relationship between object x and object type Y, then mxYNum (x, Y); if there is no relationship between object x and object type Y, then mxY0; num (x, Y) represents the sum of the number of relationships between object x and all objects in the set of objects of type Y; gamma rayXYRepresenting an exception propagation factor between object type X and object type Y,
Figure BDA0003279499510000092
ε represents the attenuation factor, selected based on expert knowledge.
The invention effectively solves the problem that the initial root factor score does not consider the relation between different object types by combining the object sorting algorithm of the heterogeneous graph.
And S4.5, selecting the object corresponding to the root cause score of top-K as a fault root cause positioning result based on the obtained root cause score of each object.
Wherein the root score of top-K represents the first K largest root scores.
Specifically, the obtained fault root cause positioning result is displayed in a visual form, as shown in fig. 6, for reference by a system administrator.
By adopting the heterogeneous topological graph, the invention simply and clearly shows the calling relation and the membership relation of the service codes with finer granularity; by fusing the correlation characteristics of the multi-dimensional indexes, the accuracy of index abnormality detection of the calling edge in the heterogeneous topological structure is effectively improved; the method has the advantages that root cause positioning is carried out by adopting a node sorting algorithm of a heterogeneous graph, not only are pivot values of abnormal propagation of objects in the heterogeneous graph considered, but also abnormal propagation causes among different object types are considered, after system monitoring data pass through the algorithm processing framework, heterogeneous fault graphs and root cause systems corresponding to current alarms are obtained by combining automatic processing of a machine learning algorithm, and are simply displayed to the system for analysis and processing in a visual form and a root cause recommending form, so that an administrator can be assisted to efficiently position a fault root cause, and the accuracy of fault root cause positioning is effectively improved.
Example two
The embodiment provides a fault root cause positioning system based on service code level, which mainly comprises the following modules:
and the global heterogeneous topological graph generating module is used for constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation.
In order to locate the service code level abnormity and root cause with finer granularity, the invention provides a composition strategy of mixed relation between the service code and an application system. In addition, if a system call forwarded by using the enterprise service bus system ESB _ F5 exists, the service code calling relationship and the service code membership in the upstream and downstream systems can be obtained by arranging the CMDB service calling comparison table.
And the anomaly detection module is used for constructing a time series anomaly detection model based on the multi-dimensional indexes and carrying out anomaly detection on the calling edge of the global heterogeneous topological graph. Here, the abnormality detection model is constructed by a graph attention machine system as shown in fig. 3. The anomaly detection module is used for realizing the following functions:
firstly, the time series of the time windows corresponding to the n indexes are normalized.
Wherein n represents the number of KPI indexes counted by each calling edge, and in order to consider the correlation characteristics among all indexes, the n KPI indexes are converted into nodes to be represented, namely the i index corresponds to the node vi. Obtaining input characteristics { v) corresponding to n KPI indexes by adopting a min-max normalization method1,v2,…,vnTherein of
Figure BDA0003279499510000111
Node viAnd representing a w-dimensional feature vector corresponding to the ith KPI, wherein the dimension w of the feature vector corresponds to the dimension of the time window.
And learning fusion characteristics of different nodes through a graph attention mechanism.
In particular, node viFusion feature h ofiCalculated by the following formula:
Figure BDA0003279499510000112
wherein N (i) represents a node viV set of neighbor nodes ofjRepresenting a node viA represents a sigmoid activation function, aijRepresenting a node viAnd node vjAssociated weight of, node VjRepresenting the w-dimensional feature vector corresponding to the j index, and associating the weight aijCalculated by the following formula:
Figure BDA0003279499510000113
Figure BDA0003279499510000114
wherein e isijRepresenting a node viAnd node vjAttention value of calling edge in between, eilRepresenting a node viAnd node vlThe attention value of the calling edge in between,
Figure BDA0003279499510000115
representing a characteristic join operation, LeakyReLU being an activation function, W representing a learnable parameter matrix, L representing vjThe number of neighbor nodes of a node, l represents viSequence numbers of neighbor nodes of the node. Calculating to obtain the fusion characteristics of all nodes by using HiAnd (4) showing.
Fusion characteristic H based on all obtained nodesiAnd learning to obtain the embedded characteristics of the time series corresponding to different indexes.
After the learning of the graph attention machine, the fusion characteristics H of all the nodesiThe output feature dimension is n x w, the n x 2w dimension feature is obtained by connecting the output feature dimension with the original sequence feature, then the long-term time sequence dependent feature is input into the LSTM module to be coded, and the embedded feature of the time sequence corresponding to different indexes is obtained by learning.
Obtaining the predicted values of the time series of all the indexes at the time t based on the obtained embedding characteristics of the time series corresponding to the different indexes
Figure BDA0003279499510000121
Specifically, the embedded characteristics of all the indexes are input into a multi-layer perceptron MLP to obtain predicted values of all time sequences at t moment
Figure BDA0003279499510000122
Taking a mean square error loss function MSE as an optimization function:
Figure BDA0003279499510000123
where n represents the number of predicted indices. The invention adopts a time series abnormity detection model based on multiple indexes, and the deviation degree of each index can be observed more intuitively.
Based onThe predicted values of the time series of all the indexes at the time t are obtained
Figure BDA0003279499510000124
Calculating to obtain abnormal score value score representing index deviation degreei(t)。
Wherein the deviation value for the i-th index is calculated by the following formula:
Figure BDA0003279499510000125
the deviation value of the index is normalized by the following formula:
Figure BDA0003279499510000126
wherein, scorei(t) is the value of the abnormality score,
Figure BDA0003279499510000131
and
Figure BDA0003279499510000132
and respectively representing the median and the quartile instead of the mean and the standard deviation, and experiments prove that the normalization effect has the optimal expression effect. By adopting a time series abnormity detection model based on multiple indexes, the invention can more intuitively observe the deviation degree of each index.
Based on the obtained abnormality score value scoreiAnd (t) judging whether the calling edge is abnormal or not.
Specifically, the abnormality score value score representing the degree of deviation of the index to be obtainedi(t) comparing the abnormality score value score with a preset threshold value when the abnormality score value score is higher than the preset threshold valueiAnd (t) when the threshold value is larger than the threshold value, judging that the detection result of the calling edge is abnormal. The detection result is shown in fig. 4, where red sides indicate abnormality and black sides indicate normality.
Compared with the traditional time series anomaly detection method, the time series anomaly detection model based on the multi-dimensional indexes, which is constructed by the invention, does not depend on any hypothesis of data distribution, and takes the correlation dependence characteristics among the multi-dimensional indexes called by the service into consideration, so that the anomaly detection is more accurate and efficient.
And the heterogeneous fault map generation module is used for generating a heterogeneous fault map based on the abnormal detection result of each calling edge.
Specifically, the abnormal calling side in the heterogeneous topological graph is obtained based on the abnormal detection module, data of the calling side with a detection result being normal is filtered from the global heterogeneous topological graph, and the heterogeneous fault graph which only displays the fault part is obtained. For example, filtering the global heterogeneous topology map of fig. 2 results in a heterogeneous fault map as shown in fig. 5.
And the fault root cause positioning module is used for positioning the fault root cause of the obtained heterogeneous fault graph based on a random walk object level sorting algorithm.
Specifically, the fault root cause positioning module is used for realizing the following functions:
and determining an object set V and an object type set A based on the heterogeneous fault graphs generated by the heterogeneous fault graph generation module.
Specifically, the heterogeneous fault map generated by the heterogeneous fault map generation module can be formally expressed as
Figure BDA0003279499510000141
Wherein ν, ε represents the object set and the relationship set, respectively. Setting object type mapping function due to the fact that heterogeneous graph comprises multiple types of objects
Figure BDA0003279499510000142
Wherein A represents a set of object types which are not repeated after mapping, and objects of the same type of a plurality of different instances are mapped to corresponding object types through a mapping function.
And distributing corresponding abnormal propagation factors for different object types based on the obtained object type set A.
And distributing corresponding abnormal propagation factors for different object types based on the importance degrees of the different object types in the heterogeneous fault graph. Specifically, the abnormal propagation factors of different object types can be obtained through distribution by expert knowledge or learning by combining search optimization algorithms, such as simulated annealing optimization algorithms, based on historical data.
Compared with the method that the abnormal propagation differences among different object types are not considered in the prior art, the method and the device for calculating the root cause score effectively improve the accuracy and pertinence of subsequent root cause score calculation by setting the abnormal propagation factors among different object types and expressing the differences of the abnormal propagation weights among different object types.
Based on the obtained object set V, a PageRank algorithm is adopted to iteratively calculate a pivot value of each object as an initial root factor score R of each objectea
Where a represents any object in the set of objects V.
Determining a root cause score R of each object based on the obtained abnormal propagation factor and the initial root cause scorex
Specifically, the root cause fraction R of the object x is obtained by the following formulax
Figure BDA0003279499510000143
X, Y respectively represents an object set with the type of X and an object set with the type of Y in the object type set A, wherein X represents an object in the object set with the type of X, and Y represents an object in the object set with the type of Y; rxAnd RyRoot scores representing object x and object y, respectively; mxYIs a contiguous matrix, MxYM is used as element inxYMeaning that if there is a relationship between object x and object type Y, then mxYNum (x, Y); if there is no relationship between object x and object type Y, then mxY0; num (x, Y) represents the sum of the number of relationships between object x and all objects in the set of objects of type Y; gamma rayXYRepresenting an exception propagation factor between object type X and object type Y,
Figure BDA0003279499510000151
ε represents the attenuation factor, selected based on expert knowledge.
The invention effectively solves the problem that the initial root factor score does not consider the relation between different object types by combining the object sorting algorithm of the heterogeneous graph.
And selecting an object corresponding to the root factor score of top-K as a fault root factor positioning result based on the obtained root factor score of each object.
Wherein the root score of top-K represents the first K largest root scores.
Specifically, the obtained fault root cause positioning result is displayed in a visual form, as shown in fig. 6, for reference by a system administrator.
Example three:
the present embodiment provides a storage medium storing a computer program; when the computer program is executed by a processor in a computer device, the computer device performs the method as described in any one of the above.
By adopting the heterogeneous topological graph, the invention simply and clearly shows the calling relation and the membership relation of the service codes with finer granularity; by fusing the correlation characteristics of the multi-dimensional indexes, the accuracy of index abnormality detection of the calling edge in the heterogeneous topological structure is effectively improved; the method has the advantages that root cause positioning is carried out by adopting a node sorting algorithm of a heterogeneous graph, not only are pivot values of abnormal propagation of objects in the heterogeneous graph considered, but also abnormal propagation causes among different object types are considered, after system monitoring data pass through the algorithm processing framework, heterogeneous fault graphs and root cause systems corresponding to current alarms are obtained by combining automatic processing of a machine learning algorithm, and are simply displayed to the system for analysis and processing in a visual form and a root cause recommending form, so that an administrator can be assisted to efficiently position a fault root cause, and the accuracy of fault root cause positioning is effectively improved.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that the embodiments may be practiced without the specific details. Thus, the foregoing descriptions of specific embodiments described herein are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed. It will be apparent to those skilled in the art that many modifications and variations are possible in light of the above teaching. Further, as used herein to refer to the position of a component, the terms above and below, or their synonyms, do not necessarily refer to an absolute position relative to an external reference, but rather to a relative position of the component with reference to the drawings.
Moreover, the foregoing drawings and description include many concepts and features that may be combined in various ways to achieve various benefits and advantages. Thus, features, components, elements and/or concepts from various different figures may be combined to produce embodiments or implementations not necessarily shown or described in this specification. Furthermore, not all features, components, elements and/or concepts shown in a particular figure or description are necessarily required to be in any particular embodiment and/or implementation. It is to be understood that such embodiments and/or implementations fall within the scope of the present description.

Claims (10)

1. A fault root cause positioning method based on service code level is characterized by comprising the following steps:
s1, constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation;
s2, constructing a time series anomaly detection model based on multi-dimensional indexes, and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph;
s3, generating a heterogeneous fault map based on the abnormal detection result of each calling edge;
s4, based on the random walk object level sorting algorithm, fault root cause positioning is carried out on the obtained heterogeneous fault graph.
2. The method according to claim 1, wherein each of the calling edges of the global heterogeneous topology map is time-series index data generated by aggregating transaction detail data and set time granularity, and the index data at least includes a combination of two or more of transaction amount, success amount, response amount, failure amount, non-response amount, success rate, response rate, and response time.
3. The method for locating a root cause of a fault based on a service code level as claimed in claim 1, wherein the step of S2 further comprises the steps of:
s2.1, normalizing the time sequence of the time window corresponding to the n indexes;
s2.2, learning the fusion characteristics of the nodes through graph attention mechanism;
s2.3 fusion characteristics H based on all obtained nodesiLearning to obtain embedded characteristics of time series corresponding to different indexes;
s2.4 obtaining the predicted values of the time series of all the indexes at the t moment based on the obtained embedding characteristics of the time series corresponding to different indexes
Figure FDA0003279499500000011
S2.5 predicted value at t moment based on time series of all obtained indexes
Figure FDA0003279499500000012
Calculating to obtain an abnormality score value score representing the degree of deviation of the indexi(t);
S2.6 score based on obtained abnormality score value scoreiAnd (t) judging whether the calling edge is abnormal or not.
4. A method according to claim 3, wherein the S2.2 learning the fusion characteristics of the nodes through the graph attention mechanism includes:
fusion feature h of node iiCalculated by the following formula:
Figure FDA0003279499500000021
wherein N (i) represents a node viV set of neighbor nodes ofjRepresenting a node viThe neighbor nodes of (a) are,
wherein N (i) represents a node viV set of neighbor nodes ofjRepresenting a node viA represents a sigmoid activation function, aijRepresenting a node viAnd node vjAssociated weight of, node vjRepresenting a w-dimensional feature vector corresponding to the jth KPI index;
associated weight aijCalculated by the following formula:
Figure FDA0003279499500000022
wherein the content of the first and second substances,
Figure FDA0003279499500000023
eijrepresenting a node viAnd node vjAttention value of calling edge in between, eilRepresenting a node viAnd node vlThe attention value of the calling edge in between,
Figure FDA0003279499500000024
representing a characteristic join operation, LeakyReLU being an activation function, W representing a learnable parameter matrix, L representing vjThe number of neighbor nodes of a node, l represents viSequence numbers of neighbor nodes of the node.
5. The method for locating a root cause of a fault based on a service code level as claimed in claim 1, wherein the step of S4 comprises the steps of:
s4.1, determining an object set V and an object type set A based on the heterogeneous fault map generated in the S3;
s4.2, distributing corresponding abnormal propagation factors for different object types based on the obtained object type set A;
s4.3 based on the obtained object set V, iteratively calculating by adopting a PageRank algorithm to obtain a pivot value of each object as an initial root factor score R of each objectea
S4.4 determining a root cause score R of each object based on the obtained abnormal propagation factor and the initial root cause scorex
And S4.5, selecting the object corresponding to the root cause score of top-K as a fault root cause positioning result based on the obtained root cause score of each object.
6. The method for fault root cause location based on service code level of claim 5, wherein the S4.2 comprises: the abnormal propagation factor is distributed through expert knowledge or a combined search optimization algorithm based on historical data.
7. The method for fault root cause location based on service code level of claim 5, wherein the S4.4 comprises:
root cause score R of object xxCalculated by the following formula:
Figure FDA0003279499500000031
x, Y respectively represents an object set with the type of X and an object set with the type of Y in the object type set A, wherein X represents an object in the object set with the type of X, and Y represents an object in the object set with the type of Y; rxAnd RyRoot scores representing object x and object y, respectively; mxYIs a contiguous matrix, MxYM is used as element inxYMeaning that if there is a relationship between object x and object type Y, then mxYNum (x, Y); if there is no relationship between object x and object type Y, then mxY0; num (x, Y) represents the sum of the number of relationships between object x and all objects in the set of objects of type Y; gamma rayXYRepresenting object type X and object type YThe abnormal propagation factor of the abnormal wave in the middle,
Figure FDA0003279499500000032
ε represents the attenuation factor.
8. The method for fault root location based on service code level of claim 5, wherein the root score of top-K represents the top K largest root scores; said S4.5 further comprises: and displaying the obtained fault root cause positioning result in a visual form.
9. A fault root cause positioning system based on service code level is characterized in that the system mainly comprises the following modules:
the global heterogeneous topological graph generating module is used for constructing a global heterogeneous topological graph comprising an intersystem calling relation and a service code calling relation;
the anomaly detection module is used for constructing a time series anomaly detection model based on multi-dimensional indexes and carrying out anomaly detection on each calling edge of the global heterogeneous topological graph;
the heterogeneous fault map generation module is used for generating a heterogeneous fault map based on the abnormal detection result of each calling edge;
and the fault root cause positioning module is used for positioning the fault root cause of the obtained heterogeneous fault graph based on a random walk object level sorting algorithm.
10. A storage medium, characterized in that it stores a computer program; the computer device performs the method of any one of claims 1-8 when the computer program is executed by a processor in the computer device.
CN202111127982.7A 2021-09-26 2021-09-26 Fault root cause positioning method, system and storage medium based on service code level Active CN113900844B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111127982.7A CN113900844B (en) 2021-09-26 2021-09-26 Fault root cause positioning method, system and storage medium based on service code level

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111127982.7A CN113900844B (en) 2021-09-26 2021-09-26 Fault root cause positioning method, system and storage medium based on service code level

Publications (2)

Publication Number Publication Date
CN113900844A true CN113900844A (en) 2022-01-07
CN113900844B CN113900844B (en) 2024-07-09

Family

ID=79029270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111127982.7A Active CN113900844B (en) 2021-09-26 2021-09-26 Fault root cause positioning method, system and storage medium based on service code level

Country Status (1)

Country Link
CN (1) CN113900844B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114598539A (en) * 2022-03-16 2022-06-07 京东科技信息技术有限公司 Root cause positioning method and device, storage medium and electronic equipment
CN114615019A (en) * 2022-02-15 2022-06-10 北京云集智造科技有限公司 Anomaly detection method and system based on micro-service topological relation generation
CN115333921A (en) * 2022-08-20 2022-11-11 海南大学 Micro-service abnormal root cause positioning method and device
CN115514617A (en) * 2022-09-13 2022-12-23 上海驻云信息科技有限公司 Universal abnormal root cause positioning and analyzing method and device
CN115509789B (en) * 2022-09-30 2023-08-11 中国科学院重庆绿色智能技术研究院 Method and system for predicting faults of computing system based on component call analysis

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160181A (en) * 2015-09-02 2015-12-16 华中科技大学 Detection method of abnormal data of numerical control system instruction field sequence
CN110888755A (en) * 2019-11-15 2020-03-17 亚信科技(中国)有限公司 Method and device for searching abnormal root node of micro-service system
CN111597070A (en) * 2020-07-27 2020-08-28 北京必示科技有限公司 Fault positioning method and device, electronic equipment and storage medium
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
WO2021179643A1 (en) * 2020-03-12 2021-09-16 华为技术有限公司 Fault processing method, apparatus and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160181A (en) * 2015-09-02 2015-12-16 华中科技大学 Detection method of abnormal data of numerical control system instruction field sequence
CN110888755A (en) * 2019-11-15 2020-03-17 亚信科技(中国)有限公司 Method and device for searching abnormal root node of micro-service system
WO2021179643A1 (en) * 2020-03-12 2021-09-16 华为技术有限公司 Fault processing method, apparatus and system
CN111597070A (en) * 2020-07-27 2020-08-28 北京必示科技有限公司 Fault positioning method and device, electronic equipment and storage medium
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114615019A (en) * 2022-02-15 2022-06-10 北京云集智造科技有限公司 Anomaly detection method and system based on micro-service topological relation generation
CN114615019B (en) * 2022-02-15 2024-01-16 北京云集智造科技有限公司 Anomaly detection method based on micro-service topological relation generation
CN114598539A (en) * 2022-03-16 2022-06-07 京东科技信息技术有限公司 Root cause positioning method and device, storage medium and electronic equipment
CN114598539B (en) * 2022-03-16 2024-03-01 京东科技信息技术有限公司 Root cause positioning method and device, storage medium and electronic equipment
CN115333921A (en) * 2022-08-20 2022-11-11 海南大学 Micro-service abnormal root cause positioning method and device
CN115333921B (en) * 2022-08-20 2024-03-29 海南大学 Micro-service abnormal root cause positioning method and device
CN115514617A (en) * 2022-09-13 2022-12-23 上海驻云信息科技有限公司 Universal abnormal root cause positioning and analyzing method and device
CN115509789B (en) * 2022-09-30 2023-08-11 中国科学院重庆绿色智能技术研究院 Method and system for predicting faults of computing system based on component call analysis

Also Published As

Publication number Publication date
CN113900844B (en) 2024-07-09

Similar Documents

Publication Publication Date Title
CN113900844A (en) Service code level-based fault root cause positioning method, system and storage medium
US11151502B2 (en) Real-time adaptive operations performance management system
CN111858123B (en) Fault root cause analysis method and device based on directed graph network
KR102118670B1 (en) System and method for management of ict infra
WO2021213247A1 (en) Anomaly detection method and device
CN114785666B (en) Network troubleshooting method and system
US11200103B2 (en) Using a machine learning module to perform preemptive identification and reduction of risk of failure in computational systems
CN112415331B (en) Power grid secondary system fault diagnosis method based on multi-source fault information
CN113962273B (en) Multi-index-based time series anomaly detection method and system and storage medium
Marashi et al. Identification of interdependencies and prediction of fault propagation for cyber–physical systems
WO2023115856A1 (en) Task exception alert method and apparatus
CN115514627A (en) Fault root cause positioning method and device, electronic equipment and readable storage medium
Gupta et al. A supervised deep learning framework for proactive anomaly detection in cloud workloads
CN115758173A (en) Cloud platform system anomaly detection method and device based on parallel graph attention network
CN111027591B (en) Node fault prediction method for large-scale cluster system
US20230105304A1 (en) Proactive avoidance of performance issues in computing environments
CN114443437A (en) Alarm root cause output method, apparatus, device, medium, and program product
CN111144720B (en) Correlation analysis method and device for operation and maintenance scene and computer readable storage medium
Xu et al. Integrated system health management-oriented maintenance decision-making for multi-state system based on data mining
CN116668264A (en) Root cause analysis method, device, equipment and storage medium for alarm clustering
US20230076662A1 (en) Automatic suppression of non-actionable alarms with machine learning
CN116074181A (en) Service fault root cause positioning method and device based on graph reasoning under influence of protection mechanism
CN113052509B (en) Model evaluation method, model evaluation device, electronic apparatus, and storage medium
CN110738326B (en) Selection method and device of artificial intelligence service system model
Jin et al. Anomaly detection and health-status analysis in a core router system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant