CN111669281A

CN111669281A - Alarm analysis method, device, equipment and storage medium

Info

Publication number: CN111669281A
Application number: CN201910175478.0A
Authority: CN
Inventors: 松鸿蒙
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-03-08
Filing date: 2019-03-08
Publication date: 2020-09-15
Anticipated expiration: 2039-03-08
Also published as: CN111669281B

Abstract

The application provides an alarm analysis method, an alarm analysis device, alarm analysis equipment and a storage medium, and belongs to the technical field of networks. The present application provides a method for dynamically adjusting the length of a time window during alarm analysis. By mining the corresponding relation between the alarm characteristics and the length of the time window, the length of the prediction time window is obtained according to the alarm characteristics in the current time window, and the current length of the time window is adjusted by utilizing the length of the prediction time window, so that the flexibility of alarm analysis can be improved. Along with the continuous change of the actual condition of the existing network, the alarm characteristics change, and then the length of the predicted time window can change, so that the adjustment amplitude of the time window can change along with the corresponding change of the existing network condition, the adaptability of the alarm analysis method is improved, and the alarm analysis process can be guaranteed to cope with the complex and variable conditions of the existing network.

Description

Alarm analysis method, device, equipment and storage medium

Technical Field

The present application relates to the field of network technologies, and in particular, to an alarm analysis method, apparatus, device, and storage medium.

Background

An alert refers to a notification message generated by a network device when an abnormal event is detected. With the continuous expansion of network scale and the increasing complexity of network architecture, the number of alarms reported by network devices every day is increasing explosively, and these alarms need to be analyzed so as to diagnose and repair faults according to the analysis result.

Currently, computer devices perform alarm analysis based on a fixed length time window. Specifically, the user may preset the length of the time window, which is typically not changeable after the setting. In the process of alarm analysis, the computer equipment determines the time window according to the current time point and the length of the time window. And acquiring the alarms in the time window according to the occurrence time point of each alarm, and executing alarm analysis according to the alarms. In this process, the computer device may buffer alarms that are outside the time window for use the next time an alarm analysis is performed. Over time, when the time passes by the length of the time window, the computer device may re-execute the steps of determining the time window, obtaining the alarm, performing alarm analysis, and so on.

When the method is adopted to analyze the alarm, if the length of the preset time window is too small, when the alarm frequency in the network is increased, a large amount of alarms need to be processed each time alarm analysis is executed according to the time window, and the analysis performance is influenced; if the length of the preset time window is too large, when the frequency of alarms in the network is reduced, the waiting time for executing alarm analysis according to the time window is too long, and the analysis efficiency is influenced. Therefore, the alarm analysis performed according to the time window with the fixed length can cause poor flexibility and difficulty in coping with the complex and variable network conditions.

Disclosure of Invention

The embodiment of the application provides an alarm analysis method, an alarm analysis device, alarm analysis equipment and a storage medium, and can solve the technical problem of poor flexibility caused by executing alarm analysis according to a time window with a fixed length in the related art. The technical scheme is as follows:

in a first aspect, an alarm analysis method is provided, where the method includes:

acquiring the characteristics of an alarm in a first time window; acquiring the length of a predicted time window according to the alarm characteristics; adjusting the length of the first time window according to the length of the predicted time window to obtain a second time window; and analyzing the alarm in the second time window.

The present embodiment provides a method for dynamically adjusting the length of a time window during alarm analysis. By mining the relationship between the alarm characteristics and the length of the time window, the length of the prediction time window is obtained according to the alarm characteristics in the current time window, and the current length of the time window is adjusted by using the length of the prediction time window, so that the flexibility of alarm analysis can be improved. Along with the continuous change of the actual condition of the existing network, the alarm characteristics change, and then the length of the predicted time window can change, so that the adjustment amplitude of the time window can change along with the corresponding change of the existing network condition, the adaptability of the alarm analysis method is improved, and the alarm analysis process can be guaranteed to cope with the complex and variable conditions of the existing network.

Optionally, the obtaining the length of the predicted time window according to the characteristic of the alarm includes: inputting the alarm characteristics into a prediction model, and outputting the length of the prediction time window, wherein the prediction model is used for predicting the length of the time window according to the alarm characteristics.

By this implementation, the effects achieved at least can include: by means of the prediction model, the relation between the characteristics of the alarm and the length of the prediction time window can be mined. Therefore, in the alarm analysis process, the prediction model can be used for automatically and accurately predicting the length of the prediction time window according to the characteristics of the alarm in the existing network, so that the length of the prediction time window is matched with the characteristics of the alarm. In the alarm analysis process, the prediction model can dynamically provide the length of the time window according to the actual situation of the current network alarm, so that the self-adaptive capacity of the time window length is improved.

Optionally, before inputting the characteristic of the alarm into a prediction model and outputting the length of the prediction time window, the method further includes: obtaining characteristics of sample alarms and labels of the sample alarms, wherein the labels represent the length of a target time window, and the target time window comprises root alarms in the sample alarms and each derived alarm corresponding to the root alarms; and training to obtain the prediction model according to the characteristics of the sample alarm and the label of the sample alarm.

By this implementation, the effects achieved at least can include: in the model training stage, a prediction model is obtained according to the characteristic of the sample alarm and the label training of the sample alarm, and the label of the sample alarm represents the length of the target time window, so that the prediction model can learn the corresponding relation between the characteristic of the alarm and the length of the target time window through a large number of samples. In the process of alarm analysis, the length of a prediction time window output by a prediction model according to the characteristics of alarms approaches the length of a target time window, so that the prediction time window contains root alarms and all derivative alarms in the alarms as much as possible, and after the length of the prediction time window is adjusted, the probability that the adjusted time window contains the root alarms and all derivative alarms is high, so that the probability that the root alarms and the corresponding derivative alarms are divided into different time windows can be obviously reduced, the root alarms and all derivative alarms can be acquired as much as possible when the alarms are acquired according to a second time window, and the accuracy of alarm analysis is improved by analyzing according to the root alarms and all derivative alarms.

Optionally, the prediction model comprises at least one of a neural network model, a random forest model, a logistic regression model, and a ridge regression model; the node of the input layer in the neural network model is used for receiving the input alarm characteristics, and the node of the output layer in the neural network model is used for outputting the length of the prediction time window; a root node of a decision tree in the random forest model is used for receiving input characteristics of an alarm, and leaf nodes of the decision tree are used for outputting the length of a prediction time window; the independent variable of the logistic regression model is the characteristic of an alarm, and the dependent variable of the logistic regression model is the length of a prediction time window; the independent variable of the ridge regression model is the characteristic of the alarm, and the dependent variable of the ridge regression model is the length of the prediction time window.

By this implementation, the effects achieved at least can include: the method for predicting the time window length through various machine learning models is provided, the machine learning model used for predicting the time window length can be selected according to actual requirements, and flexibility is improved.

Optionally, the characteristic of the alarm includes at least one of a name of the alarm, a level of the alarm, an event type of the alarm, an occurrence time point of the alarm, and a number of alarms occurring at the occurrence time point.

By this implementation, the effects achieved at least can include: the alarm generally contains a large amount of information, and if the length of the prediction time window is predicted by using all information of the alarm, the operation amount is too large, so that the operation efficiency and the operation speed are affected, and the operation accuracy is affected due to interference of invalid information and noise information. By extracting several key information of the name of the alarm, the level of the alarm, the event type of the alarm, the occurrence time point of the alarm and the alarm quantity occurring at the occurrence time point as the characteristics of the alarm, the calculation amount can be reduced, thereby improving the calculation efficiency and accelerating the calculation speed. And the interference of invalid information is shielded, so that the length of the predicted time window can be predicted more quickly and accurately through the alarm characteristics in the subsequent steps, and the operation speed and the operation accuracy are improved.

Optionally, the adjusting the length of the first time window according to the length of the predicted time window includes: and when the difference value between the length of the predicted time window and the length of the first time window is out of a preset error range, adjusting the length of the first time window according to the length of the predicted time window.

Optionally, when the difference between the length of the predicted time window and the length of the first time window is outside a preset error range, adjusting the length of the first time window according to the length of the predicted time window, where the adjusting includes any one of: comparing a first difference between the length of the predicted time window and the length of the first time window with a first error threshold, and expanding the length of the first time window when the first difference is greater than the first error threshold, wherein the decrement of the first difference is the length of the predicted time window, and the decrement of the first difference is the length of the first time window; and comparing a second difference value between the length of the prediction time window and the length of the first time window with a second error threshold, and reducing the length of the first time window when the second difference value is greater than the second error threshold, wherein the decrement of the second difference value is the length of the first time window, and the decrement of the second difference value is the length of the prediction time window.

By this implementation, the effects achieved at least can include: on the one hand, the length of the first time window can be enlarged when the length of the first time window is too small, and the length of the first time window can be reduced when the length of the first time window is too large, so that the effect of flexible adjustment is achieved. On the other hand, the error threshold value can allow the current length of the time window to have a certain degree of error, the condition that some alarm quantity with wrong division occurs in the current time window is accepted, the balance between the prediction accuracy and the prediction efficiency of the time window length is realized, the condition that root alarms and derivative alarms corresponding to the root alarms are separated in the time window predicted every time is avoided, endless circulation caused by prediction errors every time is also avoided, the convergence of the time window length prediction process is guaranteed, the performance of the time window length prediction process is improved, and the robustness of time window extraction is enhanced. On the other hand, the operation and maintenance personnel can set the error threshold value in a self-defined mode, so that the quantity of the alarms which are divided wrongly in the allowable time window is controlled in a self-defined mode, and the flexibility is improved.

Optionally, the adjusting the length of the first time window according to the length of the predicted time window to obtain a second time window includes: acquiring a target length according to the length of the predicted time window and the relative duration of the alarm, wherein the relative duration represents the position of the alarm in the first time window; and adjusting the length of the first time window to the target length to obtain the second time window.

Optionally, the relative duration is a difference between an occurrence time point of the alarm and a start time point of the first time window.

Optionally, the obtaining of the target length according to the length of the predicted time window and the relative duration of the alarm includes any one of: acquiring the sum of the length of the predicted time window and the relative duration of the alarm as the target length; when the number of the alarms is multiple, for each alarm in the multiple alarms, obtaining the sum of the length of the predicted time window corresponding to the alarm and the relative duration of the alarm to obtain multiple sums, and obtaining the maximum value of the multiple sums to be used as the target length.

In a second aspect, an alarm analysis apparatus is provided, the apparatus comprising: for executing the above alarm analysis method. In particular, the alarm analysis device comprises functional modules for performing the alarm analysis method according to the first aspect or any of the alternatives of the first aspect.

In a third aspect, a computer device is provided, where the computer device includes a processor and a memory, where the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the alarm analysis method according to the first aspect or any one of the optional manners of the first aspect.

In a fourth aspect, there is provided a computer-readable storage medium having at least one instruction stored therein, where the instruction is loaded and executed by the processor to implement the alarm analysis method according to the first aspect or any one of the optional manners of the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer device, enable the computer device to carry out the alarm analysis method of the first aspect or any of the alternatives of the first aspect.

A sixth aspect provides a chip, where the chip includes a processor and/or program instructions, and when the chip runs, the alarm analysis method according to the first aspect or any one of the alternatives of the first aspect is implemented.

Drawings

FIG. 1 is an architectural diagram of an implementation environment provided by embodiments of the present application;

FIG. 2 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;

FIG. 3 is a system architecture diagram of a cluster of computer devices according to an embodiment of the present application;

FIG. 4 is a system architecture diagram of another cluster of computer devices provided by an embodiment of the present application;

FIG. 5 is a flow chart of a predictive model training method provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a tag labeling a sample alarm according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a neural network model provided in an embodiment of the present application;

FIG. 8 is a flowchart of an alarm analysis method provided in an embodiment of the present application;

FIG. 9 is a flowchart of an alarm analysis method provided in an embodiment of the present application;

FIG. 10 is a diagram of a logical architecture of an alarm analysis method provided by an embodiment of the present application;

FIG. 11 is a diagram illustrating a logical architecture of a time window extraction service provided by an embodiment of the present application;

fig. 12 is a schematic structural diagram of an alarm analysis device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is an architecture diagram of an implementation environment provided by an embodiment of the present application, where the implementation environment includes a computer device 101 and at least one network device 102, and the computer device 101 and the at least one network device 102 may be connected through a network.

The computer device 101 may be configured to perform at least one of the alarm analysis method illustrated in FIG. 8 described below and the predictive model training method illustrated in FIG. 5 described below. The computer device 101 may be an entity having processing resources as well as storage resources. In one possible implementation, the computer device 101 may be a physical device, and the structure of the computer device 101 may be as described in the embodiment of fig. 2 below. In another possible implementation, the computer device 101 may be a virtualized device. For example, the computer device 101 may be a resilient cloud server, virtual machine, container, application, service, microservice, module, and the like. The present embodiment does not limit the form of the computer apparatus 101. The computer device 101 may be located locally, or may be deployed in a cloud environment, an edge environment, a terminal environment, or other operating environments, and the operating environment of the computer device 101 is not limited in this embodiment.

The network device 102 may be deployed in an operator, an enterprise network, a campus network, an edge network, a terminal environment, or other environments, and the present embodiment does not limit the operating environment of the network device 102. If the network device 102 fails, the network device generates an alarm and sends the alarm to the computer device. The network device 102 may be a server, a switch, a router, a relay, a bridge, a firewall, a mobile terminal, a personal computer, a notebook computer, a Service Gateway (SGW), a packet data network gateway (PGW), an Optical Network Terminal (ONT), an Optical Network Unit (ONU), an optical splitter, or an internet of things terminal, and the specific form of the network device 102 is not limited in this embodiment.

Fig. 2 is a schematic structural diagram of a computer device 200 according to an embodiment of the present disclosure, where the computer device 200 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPU) 201 and one or more volatile or non-volatile memories 202, where the one or more volatile or non-volatile memories 202 store at least one instruction, and the at least one instruction is loaded and executed by the one or more processors 201 to implement at least one of the alarm analysis method and the prediction model training method provided in the following method embodiments. Of course, the computer device 200 may also have components such as a wired or wireless network interface, an input/output interface, and the like, so as to perform input/output, and the computer device 200 may also include other components for implementing device functions, which are not described herein again. The operating system running on the computer device 200 may be a LinuX operating system, or certainly may also be a Windows operating system, and the like, and the operating system of the computer device is not limited in this embodiment.

Fig. 3 is a system architecture diagram of a computer device cluster according to an embodiment of the present application, where the computer device cluster includes at least one computer device 200, and a structure of each computer device 200 is please refer to the embodiment in fig. 2, which is not described herein again.

The cluster of computer devices may be used to perform at least one of the predictive model training method described below in fig. 5 and the alarm analysis method described below in fig. 8. In particular, each computer device 200 in FIG. 3 may perform any one or more steps of an alarm analysis method, and different computer devices 200 in FIG. 3 may perform different steps of the alarm analysis method described below. For example, one of the computer devices 200 of FIG. 3 may perform step 801, another of the computer devices 200 of FIG. 3 may perform step 802, and yet another of the computer devices 200 of FIG. 3 may perform step 803. Each computer device 200 in FIG. 3 may perform any one or more of the steps of the predictive model training method described below, and different computer devices 200 in FIG. 3 may perform different steps of the predictive model training method described below. For example, one of the computer devices 200 of FIG. 3 may perform step 501, another of the computer devices 200 of FIG. 3 may perform step 502, and another of the computer devices 200 of FIG. 3 may perform step 503.

Fig. 4 is a system architecture diagram of another computer device cluster provided in an embodiment of the present application, where the computer device cluster includes a cloud computing service and at least one computer device 200. In fig. 4, the cloud computing service may be implemented by a cloud server cluster, for example, by renting one or more virtual servers provided by a cloud computing service provider. The cloud computing service can extend the computing capacity of operation through a virtualization technology to share software and hardware resources and information, and provides the software and hardware resources and information to each node device in the cloud computing service as required, so that each node device can exert the maximum efficiency. Please refer to the embodiment in fig. 2 for the structure of each computer device 200 in fig. 4, which is not described herein.

The cloud computing service in fig. 4 may perform any one or more steps of the alarm analysis method, and each computer device 200 in fig. 4 may also perform any one or more steps of the alarm analysis method. The steps performed by the cloud computing service and the computer device 200 may be different or the same during the alarm analysis. For example, the cloud computing service in fig. 4 may perform

steps

801 and 804, where one computer device 200 in fig. 4 performs step 802, and another computer device 200 in fig. 4 performs step 803.

The cloud computing service in fig. 4 may perform any one or more steps of the predictive model training method, and each computer device 200 in fig. 4 may also perform any one or more steps of the predictive model training method. The steps performed by the cloud computing service and the computer device 200 may be different or the same during the predictive model training. For example, the cloud computing service in fig. 4 may perform step 506 described below, one computer device 200 in fig. 4 may perform step 501 described below, and another computer device 200 in fig. 4 may perform

steps

502, 503, 504, and 505 described below.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, is also provided that includes instructions executable by a processor to perform at least one of an alarm analysis method and a predictive model training method in the following embodiments. For example, the computer-readable storage medium may be a read-only memory (ROM), a Random Access Memory (RAM), a compact disc-read-only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 5 is a flowchart of a predictive model training method provided in an embodiment of the present application, and as shown in fig. 5, the method includes steps 501 to 507 executed by a computer device:

501. the computer device obtains a sample alert.

The sample alarms may be pre-stored in a database and the computer device may read the sample alarms from an alarm library. Alternatively, the sample alarm may be a historical alarm. Accordingly, the manner of obtaining the sample alarm may include: and the computer equipment reads the historical alarm of which the occurrence time point is positioned in the historical time period according to the occurrence time point and the historical time period of the sample alarm, and takes the historical alarm as the sample alarm. The ending time point of the historical time period may be the current time point, and the length of the historical time period may be set according to the requirement, for example, the historical time period may be the past week, the past 3 days, and the like.

With respect to the timing of obtaining the sample alarm, the computer device may receive a training instruction, and trigger obtaining the sample alarm and subsequent steps according to the training instruction. The training instruction may be triggered by an input operation of the operation and maintenance worker, for example, may be triggered by a foreground interface or a background instruction.

Optionally, the computer device may sort the sample alarms in order from morning to evening of the occurrence time point to obtain sequentially arranged sample alarms, so as to perform the subsequent steps according to the sequentially arranged sample alarms.

502. And the computer equipment divides the sample alarms caused by the same fault into the same class according to the classification instruction.

The classification instruction is used for indicating the class of the sample alarm, and the classification instruction can be triggered according to the input operation of the operation and maintenance personnel. Wherein, the class (English) includes sample alarm caused by the same fault, and the faults corresponding to different classes are different. The class to which the sample alarm belongs can be determined according to one or more of the name of the sample alarm, the level of the sample alarm, the event type of the sample alarm, the occurrence time point of the sample alarm and the topological distance of the network element.

503. And the computer equipment labels the root cause alarm, the derivative alarm and the corresponding relation between the root cause alarm and the derivative alarm in the class according to the first labeling instruction.

The first labeling instruction comprises the identifier of the root cause alarm, the identifier of the derivative alarm and the corresponding relation between the root cause alarm and the derivative alarm. The first labeling instruction can be triggered according to the input operation of the operation and maintenance personnel.

The root cause alarm (root cause alarm) directs the alarms of other alarms. Derived alarms (generic name: derived alarms) refer to alarms caused by root cause alarms. There is a causal relationship between the root alarm, which is the cause, and the derivative alarm, which is the result. For example, for alarm a, alarm B and alarm C, if alarm a causes alarm B and alarm B, alarm a is said to be the root cause alarm, and alarm B and alarm C are both the derived alarms corresponding to alarm a. A root cause alarm may correspond to one or more derived alarms. In one exemplary scenario, assume that front-end service A is to invoke middle tier service B, middle tier service B is to invoke middle tier service C, and middle tier service C is to invoke back-end service D. When the back-end service D has a fault and the time delay is larger, the back-end service D generates an alarm 1; if the middle layer service C does not obtain the response of the back end service D after overtime, the time delay of the middle layer service C is long, and the middle layer service C generates an alarm 2; if the time delay of the middle layer service B is long due to the fact that the middle layer service B does not obtain the response of the middle layer service C after overtime, the middle layer service B generates an alarm 3; if the front-end service A does not obtain the response of the middle-layer service B after overtime, the time delay of the front-end service A is large, and then the front-end service A generates an alarm 4; in this scenario, alarm 1, alarm 2, alarm 3, and alarm 4 are generated in the current network. Alarm 1 is a root cause alarm, and alarm 2, alarm 3 and alarm 4 are all derived alarms corresponding to alarm 1.

The root cause alarm, the derived alarm and the corresponding relationship between the root cause alarm and the derived alarm can be determined according to the service characteristics of the alarms, and the service characteristics can include the characteristics of the alarms, including at least one of the name of the alarm, the level of the alarm, the event type of the alarm, the occurrence time point of the alarm and the number of the alarms occurring at the occurrence time point.

Exemplarily, referring to fig. 6, a root cause alarm 1 and a derived alarm 1-1 may be labeled in class 1, and a correspondence between the root cause alarm 1 and the derived alarm 1-1 may be labeled. Similarly, marking a root cause alarm 2, a derivative alarm 2-1 and a derivative alarm 2-2 in the class 1, and marking the corresponding relation between the root cause alarm 2 and the derivative alarm 2-1 as well as the derivative alarm 2-2; similarly, labeling a root cause alarm 3, a derivative alarm 3-1, a derivative alarm 3-2 and a derivative alarm 3-3 in the class 1, and labeling the corresponding relation between the root cause alarm 3 and the derivative alarm 3-1, the derivative alarm 3-2 and the derivative alarm 3-3; root cause alarms 4 and derivative alarms 4-1 may be labeled in class 2, and the correspondence between root cause alarms 4 and derivative alarms 4-1 may be labeled. Similarly, a root cause alarm 5, a derivative alarm 5-1 and a derivative alarm 5-2 are marked in the class 2, and the corresponding relation between the root cause alarm 5 and the derivative alarm 5-1 and the derivative alarm 5-2 is marked. The black solid circle represents a root cause alarm 1, the hollow circle represents a derivative alarm, and a connecting line between the black solid circle and the hollow circle represents a corresponding relation between the derivative alarm and the derivative alarm.

504. And marking the target time window by the computer equipment according to the second marking instruction.

The target time window includes root cause alarms in the sample alarms and each derived alarm corresponding to the root cause alarm. If the root cause alarm in the sample alarms corresponds to N derived alarms, the target time window may include the root cause alarm and the N derived alarms, where N is a positive integer.

The starting time point of the target time window may be the occurrence time point of the root cause alarm. The end time point of the target time window may be the occurrence time point of the last derived alarm corresponding to the root cause alarm. The length of the target time window may be the difference between the occurrence time point of the last derived alarm corresponding to the root cause alarm and the occurrence time point of the root cause alarm. Wherein, the last derived alarm corresponding to the root cause alarm refers to the derived alarm with the latest occurrence time point in all the derived alarms corresponding to the root cause alarm.

For example, the sample alarm includes root cause alarm 1, derivative alarm 1-1 corresponding to root cause alarm 1, derivative alarm 1-2 corresponding to root cause alarm 1, derivative alarm 1-1 occurring at 18:00, derivative alarm 1-2 occurring at 18:03, derivative alarm 1-2 occurring at 18:05, derivative alarm 1-3 occurring at 18:10, and the last derivative alarm corresponding to root cause alarm 1 is derivative alarm 1-3. The starting time point of the target time window may be the occurrence time point 18:00 of the root cause alarm 1 and the ending time point of the target time window may be the occurrence time point 18:10 of the derived alarms 1-3. The length of the target time window may be the difference between 18:10 and 18:00, i.e. 10 minutes.

The second annotation instruction is used for indicating a target time window, and the second annotation instruction may include an identifier of the target time window to which the sample alarm belongs. The second labeling instruction can be triggered according to the input operation of the operation and maintenance personnel. For example, for the root cause alarm in the sample alarm, the operation and maintenance staff may construct a time window just including the root cause alarm and all the derived alarms of the root cause alarm for the root cause alarm according to the root cause alarm, the derived alarms and the corresponding relationship between the root cause alarm and the derived alarms labeled in the step 503 to obtain a target time window, and trigger an input operation according to the target time window, so that the computer device receives the second labeling instruction.

For example, referring to fig. 6, after the class 1 is labeled with the root cause alarm 1 and the derivative alarm 1-1, and the corresponding relationship between the root cause alarm 1 and the derivative alarm 1-1 is labeled, the root cause alarm 1 and the derivative alarm 1-1 may be divided into the target time window 1, and the second labeling instruction is triggered, so that the computer device may mark the target time window 1 including the root cause alarm 1 and the derivative alarm 1-1 according to the second labeling instruction. Similarly, dividing the root cause alarm 2, the derivative alarm 2-1 and the derivative alarm 2-2 into a target time window 2, triggering a second labeling instruction, and enabling the computer equipment to mark the target time window 2 to comprise the root cause alarm 2, the derivative alarm 2-1 and the derivative alarm 2-2 according to the second labeling instruction, and so on.

505. And the computer equipment marks the label of the sample alarm according to the third marking instruction.

The third labeling instruction comprises a label of the sample alarm, and the third labeling instruction can be triggered according to the input operation of the operation and maintenance personnel.

The label indicates the length of the target time window. In one possible implementation, the tags may include a first tag corresponding to a root cause alarm and a second tag corresponding to a derived alarm. The first label and the second label may be different, and the first label is a difference between an occurrence time point of the root cause alarm and an occurrence time point of a last derived alarm corresponding to the root cause alarm. The second label is a preset value. The preset value may be 0, but may be other small values. The numerical units of the first label and the second label may be minutes.

For example, if the sample alarms include root cause alarm 1, root cause alarm 2, and root cause alarm 3, it is assumed that root cause alarm 1 occurs at 18:00, derivative alarm 1-1 corresponding to root cause alarm 1 occurs at 18:03, derivative alarm 1-2 corresponding to root cause alarm 1 occurs at 18:05, and derivative alarm 1-3 corresponding to root cause alarm 1 occurs at 18: 10; 18:03 has a root cause alarm 2, 18:12 has a derivative alarm 2-1 corresponding to the root cause alarm 2, 18:20 has a derivative alarm 2-2 corresponding to the root cause alarm 2; 18:20 has generated root cause alarm 3, 18:25 has generated derivative alarm 3-1 corresponding to root cause alarm 3, 18:30 has generated derivative alarm 3-2 corresponding to root cause alarm 3.

In this scenario, for root cause alarm 1, derived alarm 1-2, and derived alarm 1-3, the lengths of the target time windows corresponding to these four sample alarms may be the difference between 18:10 and 18:00, i.e., 10 minutes, then the label for root cause alarm 1 may be 10 minutes, and the labels for derived alarm 1-1, derived alarm 1-2, and derived alarm 1-3 may be 0. For root cause alarm 2, derived alarm 2-1, and derived alarm 2-2, the length of the target time window for these three sample alarms may be the difference between 18:20 and 18:03, i.e., 17 minutes, then the label for root cause alarm 2 may be 17 minutes, and the labels for derived alarm 2-1 and derived alarm 2-2 may be 0. For root cause alarm 3, derived alarm 3-1, and derived alarm 3-2, the length of the target time window for these three sample alarms may be the difference between 18:30 and 18:20, i.e., 10 minutes, then the label for root cause alarm 3 may be 10 minutes, and the labels for derived alarm 3-1 and derived alarm 3-2 may be 0.

For example, referring to fig. 6, after a target time window 1 is labeled in class 1, a difference between an occurrence time point of a root cause alarm 1 and an occurrence time point of a derivative alarm 1-1 may be obtained, and if the difference is 1 minute, a third labeling instruction is triggered, where the third labeling instruction includes that a label of the root cause alarm 1 is 1, and a label of the derivative alarm 1-1 is 0, and the computer device may label the label of the root cause alarm 1 as 1 and label the label of the derivative alarm 1-1 as 0 according to the third labeling instruction. In fig. 6, L denotes a label, and for example, L ═ 1 means that the label is 1. Similarly, the difference between the occurrence time point of the root cause alarm 2 and the occurrence time point of the derivative alarm 2-2 can be obtained, if the difference is 2 minutes, a third labeling instruction is triggered, the third labeling instruction includes that the label of the root cause alarm 2 is 2, the label of the derivative alarm 2-1 is 0, and the label of the derivative alarm 2-2 is 0, the computer device labels the label of the root cause alarm 2 as 2, the label of the derivative alarm 2-1 as 0, and the label of the derivative alarm 2-2 as 0 according to the third labeling instruction, and so on.

By labeling the root cause alarm and the derivative alarm with different labels, the achieved effect at least can include: the root cause alarm and the derivative alarm can be distinguished through different labels, so that the difference between the root cause alarm and the derivative alarm in the sample alarm can be learned by the prediction model, the root cause alarm and the derivative alarm can be prevented from being confused by the prediction model, for example, the derivative alarm corresponding to a certain root cause alarm is judged to be the root cause alarm corresponding to other derivative alarms by mistake, and the accuracy of the prediction model can be improved.

In one possible implementation, after labeling the label of the sample alarm, the sample alarm with the label may be stored in a local file, so that the sample alarm with the label is read from the local file when the model needs to be trained.

506. The computer device obtains characteristics of the sample alarms and labels of the sample alarms.

The characteristics of the sample alarm may include at least one of a name of the sample alarm, a level of the sample alarm, an event type of the sample alarm, an occurrence time point of the sample alarm, and a number of sample alarms occurring at the occurrence time point. In one possible implementation, if the features of the sample alarm are in text form, the text form features may be encoded to obtain the features in digital form.

The characteristics of the sample alarm may include at least one of a name of the sample alarm, a level of the sample alarm, an event type of the sample alarm, an occurrence time point of the sample alarm, and a number of alarms occurring at the occurrence time point.

The name of a sample alarm may describe the phenomenon in which the sample alarm occurred. The name of the sample alarm may be an identification number (ID) of the sample alarm, a number, and the like. Alternatively, the name of the sample alarm may be an identification of the sample alarm source to which the sample alarm corresponds. The identifier of the sample warning source may include an ID of the sample warning source, an Internet Protocol (IP) address, a name, a number, a serial number, and the like.

The event type of the sample alarm may indicate the type of fault that caused the sample alarm. Taking the sample alarm source of the sample alarm as a disk as an example, the event type of the sample alarm may include a bad track of the disk, a loss of metadata, an excessively slow read-write speed, and the like. Taking the sample alarm source of the sample alarm as the router as an example, the event types of the sample alarm may include port failure, link disconnection, network card failure, and the like. Taking the sample alarm source of the sample alarm as an example, the event type of the sample alarm may include over-temperature, over-humidity, smoke detection, etc. Taking the sample alarm source of the sample alarm as the server as an example, the event types of the sample alarm may include overload, traffic overload, service processing failure, and the like. Of course, the event types of the sample alarms are only examples, and the specific event types of the sample alarms may be determined according to the product form of the sample alarm source, which is not limited in this embodiment.

It should be noted that the event type of the sample alarm may be represented by a number, a name, or other identifiers, and each event type corresponds to each identifier. For example, the identifier 1 may be used to indicate a bad track of the disk, the identifier 2 may be used to indicate a loss of metadata, the identifier 3 may be used to indicate that the read-write speed is too slow, and so on.

The level of the sample alarm may indicate the time and scope of the impact of the sample alarm. The higher the level of the sample alarm, the longer the time representing the influence of the sample alarm, and the wider the range of the influence of the sample alarm. Generally speaking, the time affected by an alarm is long, and the range affected by the alarm is wide.

The occurrence time point of the sample alarm refers to the time point of the sample alarm reported by the equipment. The occurrence time point of the sample alarm may describe a time sequence in which the sample alarm occurs. Generally speaking, for a root cause alarm and a derived alarm corresponding to the root cause alarm, the occurrence time point of the root cause alarm is before, and the occurrence time point of the derived alarm is after. The occurrence time point of the sample alarm may be represented using a time stamp.

The number of alarms occurring at the occurrence time point describes the total number of sample alarms reported in the network when the device reports the sample alarms, wherein the occurrence time points of these alarms are the same, e.g. the timestamps in these alarms are the same. For example, assuming that the occurrence time point of a certain sample alarm is 18:00, the alarm number occurring at the occurrence time point corresponding to the sample alarm may be the total number of sample alarms reported by the network at 18: 00.

By using the above information as a characteristic of the sample alarm, the achieved effect may at least include: the sample alarm generally contains a large amount of information, and if the length of the prediction time window is predicted by using all information of the sample alarm, the operation amount is too large, so that the operation efficiency and the operation speed are affected, and the operation accuracy is affected due to interference of invalid information and noise information. By extracting several key information, such as the name of the sample alarm, the level of the sample alarm, the event type of the sample alarm, the occurrence time point of the sample alarm and the alarm quantity occurring at the occurrence time point, as the characteristics of the sample alarm, the operation amount can be reduced, so that the operation efficiency is improved, and the operation speed is accelerated. And the interference of invalid information is shielded, so that the length of a prediction time window can be predicted more quickly and accurately through the characteristics of sample alarm in the subsequent steps, and the operation speed and the operation accuracy are improved.

It should be noted that the name of the sample alarm, the level of the sample alarm, the event type of the sample alarm, the occurrence time point of the sample alarm, and the number of alarms occurring at the occurrence time point are only exemplary descriptions of the characteristics of the sample alarm, and other information of the sample alarm may also be used as the characteristics of the sample alarm, so as to predict the length of the prediction time window by using the other information of the sample alarm.

In one possible implementation, a dimension setting instruction may be received, a dimension of a feature of the sample alarm may be determined according to the dimension setting instruction, and information corresponding to the dimension may be selected from information of the sample alarm as the feature of the sample alarm. Wherein the dimension setting instruction is used for indicating the dimension of the characteristic of the used sample alarm, and the dimension setting instruction can be triggered by the input operation of the user.

By this implementation, the effects achieved at least can include: the method can support the function of feature expansion, can dynamically modify the dimension of the features used in the process of predicting the length of the time window, and enables a user to select which features are adopted to predict the length of the time window in a self-defined manner according to requirements, thereby improving the flexibility.

In one possible implementation, if the features of the sample alarm are in text form, the text form features may be encoded to obtain the features in digital form. As for the method for encoding the features, a one-bit encoding (english: one-bit encoding) method may be adopted to encode the features in the text form, and the obtained features in the digital form may be composed of 0 and 1. Of course, the one-bit effective coding is only an exemplary description of the coding method, and other coding methods besides the one-bit effective coding may also be used, for example, the coding method is performed by using a weight of evidence (word), and the coding method is not limited in this embodiment.

In one possible implementation, the features of the sample alarm may be in the form of vectors. Each bit of the vector corresponds to each dimension of the feature one to one, and the value of each bit represents the value of the feature of the corresponding dimension. For example, the sample alarm may be characterized in the form of (X)₁，X₂，X₃，X₄，X₅)，X₁Name indicating sample alarm, X₂Indicating the level of sample alarms, X₃Indicating the level of sample alarms, X₄Event type, X, representing sample alarms₅Indicating the point in time of occurrenceThe number of sample alarms that occur. If the number of the sample alarms is multiple, the form of the characteristics of the sample alarms can be a matrix, each row of the matrix corresponds to one sample alarm, each column of the matrix corresponds to one dimension of the characteristics, and each element of the matrix is a value of one dimension of one characteristic.

Taking the example that the features of the sample alarm include X dimensions, the form of the features of the sample alarm may be as follows. Where n represents the number of sample alarms, (X)_i1，X_i2，X_i3，X_i4，X_i5) For the features of the sample alarm i, X_i1For the 1 st dimension of the sample alarm i, X_i2For the 2 nd dimension of the sample alarm i, X_i3The feature of the 3 rd dimension of i is alerted for the sample, where i is a positive integer.

507. And training by the computer equipment to obtain a prediction model according to the characteristics of the sample alarm and the label of the sample alarm.

The form of the label of the sample alarm may be as follows.

Wherein Y represents the label of the sample alarm, n represents the number of the sample alarm, and i represents the identifier of the sample alarm. For example, Y_iLabel of sample alarm i.

The prediction model is used for predicting the length of the time window according to the characteristics of the alarm. The input parameter of the predictive model may be a characteristic of the alarm and the output parameter of the predictive model may be the length of the prediction time window. The predictive model may be stored in a model library from which the computer device may load the predictive model, wherein the model library is a database for storing the predictive model. The predictive model may be in the form of a file. This embodiment is not limited to this.

The prediction model may be a regression model, but may also be a classification model. The predictive model may be a machine learning model. In one possible implementation, the predictive model may include any one or a combination of more of a neural network model, a random forest model, a logistic regression model, and a ridge regression model, see in particular (1) to (4) below.

(1) Neural network model

The neural network model includes an input layer, one or more hidden layers, and an output layer. Taking the example that the neural network model includes 1 input layer, 2 hidden layers and 1 output layer, the structure of the neural network model may be as shown in fig. 7.

The input layer may be a first layer in the neural network model, and the input layer may include one or more nodes for receiving the characteristics of the input alarm. The nodes of the input layer can operate the alarm characteristics and output the operation result to the first hidden layer.

In one possible implementation, each node of the input layer may have a one-to-one correspondence with each dimension of the feature of the alarm, and any one node may be used to receive the feature of one dimension of the alarm. For example, node 1 of the input layer is used for receiving the feature of the alarm in dimension 1, node 2 of the input layer is used for receiving the feature of the alarm in dimension 2, node i of the input layer is used for receiving the feature of the alarm in dimension i, and i is a positive integer. In one exemplary scenario, the features of an alarm may include 5 dimensions, where dimension 1 is an alarm name, dimension 2 is a level of the alarm, dimension 3 is an event type of the alarm, dimension 4 is an occurrence time point of the alarm, and dimension 5 is a number of alarms occurring at the occurrence time point. Accordingly, referring to fig. 7, a node 1 of the input layer is for receiving an alarm name, a node 2 of the input layer is for receiving a level of an alarm, a node 3 of the input layer is for receiving an event type of the alarm, a node 4 of the input layer is for receiving an occurrence time point of the alarm, and a node 5 of the input layer is for receiving an amount of the alarm occurring at the occurrence time point. The number of nodes of the input layer may be equal to the number of dimensions of the feature, and if the feature of the alarm includes n dimensions, the input layer includes n nodes, where n is a positive integer and n is greater than or equal to i.

The hidden layer can be a layer between the input layer and the output layer, the hidden layer can comprise one or more nodes, the nodes of the hidden layer can operate according to the operation result of the previous layer to obtain the operation result of the layer, and the operation result of the layer is output to the next layer. Wherein the previous layer of the first hidden layer may be an input layer and the next layer of the last hidden layer may be an output layer. The number of the hidden layers can be set according to requirements, and the number of the hidden layers is not limited in the embodiment. For example, the number of hidden layers may be 2. The number of nodes of the hidden layer may be set according to requirements, and the number of nodes of the hidden layer is not limited in this embodiment. For example, the number of nodes of the hidden layer may be 200.

The output layer may be the last layer in the neural network model. The output layer may include one or more nodes, where the node of the output layer may be configured to perform an operation on an operation result of the last hidden layer to obtain a length of the predicted time window, and the node of the output layer may be configured to output the length of the predicted time window.

In one possible implementation, referring to fig. 7, the neural network model may be a regression model, and the output layer of the regression model may include a node that may calculate the length of the prediction time window.

In another possible implementation, the neural network model may be a classification model, an output layer of the classification model may include a plurality of nodes, each node of the output layer may correspond to a length, and each node of the output layer may calculate a probability that the length of the prediction time window is the length corresponding to the node. The probability calculated by each node of the output layer can be obtained, and a plurality of probabilities are obtained. The maximum probability among the plurality of probabilities may be obtained, the node of the output layer in which the maximum probability is calculated may be determined, and the length corresponding to the node may be used as the length of the prediction time window. For example, assuming that the output layer includes node 1, node 2, and node 3, node 1 may correspond to 3 minutes, node 2 may correspond to 10 minutes, and node 3 may correspond to 20 minutes. If the alarm is input into the neural network model, the probability that the length of the prediction time window calculated by the node 1 is 3 minutes is 0.1, the probability that the length of the prediction time window calculated by the node 2 is 10 minutes is 0.6, and the probability that the length of the prediction time window calculated by the node 3 is 20 minutes is 0.3, the maximum probability calculated by the node 2 in the 3 nodes can be determined, so that the 10 minutes corresponding to the node 2 can be used as the length of the prediction time window.

(2) Random forest model

The random forest model may include one or more decision trees. The decision tree may include a root node, one or more non-leaf nodes, and one or more leaf nodes.

The root node of the decision tree is used to receive the characteristics of the incoming alarm. The root node may make a determination based on the characteristics of the alarm, select a branch from one or more branches below the root node, and output the characteristics of the alarm to the non-leaf node corresponding to the branch.

The non-leaf node is used for judging according to the input alarm characteristics, selecting a branch from one or more branches under the non-leaf node, outputting the alarm characteristics to the next layer of non-leaf node corresponding to the branch, and outputting the alarm characteristics to the leaf node by analogy.

The leaf nodes of the decision tree are used to output the length of the prediction time window. In one possible implementation, each leaf node of the decision tree may correspond to a length. The leaf node reached by the alarm characteristic can be determined, the length corresponding to the leaf node is used as the length of the prediction time window, and the length of the prediction time window is output. For example, leaf node 1 may correspond to 5 minutes, leaf node 2 may correspond to 10 minutes, leaf node 3 may correspond to 15 minutes, and if the alarm signature eventually reaches leaf node 3, the length of the predicted time window may be determined to be 15 minutes.

(3) Logistic regression model

The independent variable of the logistic regression model is the characteristic of the alarm, and the dependent variable of the logistic regression model is the length of the prediction time window. The logistic regression model may include one or more parameters for fitting a relationship between the characteristics of the alarms and the length of the predicted time window. Wherein if the characteristic of the alarm comprises multiple dimensions, then logicalThe logistic regression model may include a plurality of independent variables, each independent variable being a feature of one dimension of the alarm. For example, if the features of the alarm include X dimensions, the arguments of the logistic regression model may include X₁、X₂、X₃、X₄And X₅。

(4) Ridge regression model

The independent variable of the ridge regression model is the characteristic of the alarm, and the dependent variable of the ridge regression model is the length of the prediction time window. The ridge regression model may include one or more parameters for fitting a relationship between the features of the alarms and the length of the predicted time window.

Regarding the specific process of model training, in one possible implementation, initial parameters may be set for the prediction model, the characteristics of the sample alarm and the label of the sample alarm may be input into the prediction model, and the length of the prediction time window may be output. The method can obtain the deviation between the label of the sample alarm and the length of the prediction time window, judge whether the deviation is smaller than the preset threshold value, and adjust the initial parameters of the prediction model when the deviation is larger than the preset threshold value. And inputting the characteristics of the sample alarm and the label of the sample alarm into the prediction model, outputting the length of the prediction time window, obtaining the deviation and adjusting the initial parameters, wherein the training of the prediction model is finished until the deviation calculated according to the length of the prediction time window output by the prediction model is smaller than a preset threshold value. The preset threshold may be determined according to actual requirements, for example, when the accuracy of the prediction model is required to be higher, the lower the preset threshold is set. In another possible implementation, the number of iterations may be accumulated, and each time the length of the prediction time window of the sample alarm is output using the prediction model and the initial parameter of the prediction model is adjusted once, one is added to the number of iterations, and when the number of iterations reaches a preset number, the prediction model training is completed.

In one possible implementation, the model training mode may include any one or more of the following (1) to (4):

(1) and training by adopting a back propagation algorithm according to the characteristics of the sample alarm and the label of the sample alarm to obtain a neural network model.

(2) And training by adopting a random forest training algorithm according to the characteristics of the sample alarm and the label of the sample alarm to obtain a random forest model.

(3) And training by adopting a gradient descent algorithm according to the characteristics of the sample alarm and the label of the sample alarm to obtain a logistic regression model.

(4) And training by adopting a gradient descent algorithm according to the characteristics of the sample alarm and the label of the sample alarm to obtain a ridge regression model.

In one possible implementation, after the predictive model is trained, the predictive model may be stored in a model library. The model may be in the form of a file and, correspondingly, the storage medium of the model library may be a file system. The file system includes, but is not limited to, a local file system, a Hadoop Distributed File System (HDFS), a File Allocation Table (FAT) 32 system, a New Technology File System (NTFS), an extended file system (E5T) 3, an E5T4 system, an extended file allocation table system (E5), an elastic file system (ReFS), a Network Attached Storage (NAS), and the like.

In the method provided by the embodiment, the prediction model is obtained by training by using the characteristics of the sample alarm and the label of the sample alarm, so that the prediction model can extract the corresponding relation between the characteristics of the alarm and the length of the prediction time window through the sample alarm and the label. In the alarm analysis process, the prediction model can be used for dynamically predicting the length of the prediction time window according to the alarm characteristics, so that the length of the time window is matched with the characteristics of the current network alarm, and the accuracy of the length of the time window is improved. In addition, the length of the time window can be dynamically provided according to the actual situation of the current network alarm, so that the self-adaptive capacity of the time window length is improved.

Fig. 8 is a flowchart of an alarm analysis method provided in an embodiment of the present application, and as shown in fig. 8, the method includes steps 801 to 804 executed by a computer device:

801. the computer device obtains characteristics of the alarm within the first time window.

An alert refers to a notification message generated by a network device when an abnormal event is detected. The network device that generates the alert may be referred to as the alert source of the alert. The alarm may comprise at least one of a root cause alarm, a derived alarm. In a possible implementation, the network device may also report the alarm to the computer device in real time after generating the alarm, and the computer device may receive the alarm reported by the network device in real time, so as to analyze the alarm reported in real time by performing the following steps. In another possible implementation, the alarms may be pre-stored in a database, and the computer device may read the alarms from the database to analyze the alarms in the database. The present embodiment does not limit the manner of obtaining the alarm.

The dimension of the feature of the alarm may be the same as the dimension of the feature of the sample alarm obtained in step 507 above. In particular, the characteristics of the alarm may include at least one of a name of the alarm, a level of the alarm, an event type of the alarm, an occurrence time point of the alarm, and a number of alarms occurring at the occurrence time point. The alarm characteristics of each dimension are similar to the alarm characteristics of the sample in the dimension, and please refer to step 507 above, which is not described herein again.

A first time window: refers to the time window before the length adjustment, and the first time window may be the current time window. As to the manner of obtaining the first time window, in one possible implementation, the current time point and the length of the first time window may be obtained; acquiring the sum of the current time point and the length of the first time window as the end time point of the first time window; and taking a time range formed by the current time point and the ending time point of the first time window as the first time window. Wherein the unit of the sum is time. For example, assuming that the current time point is 18:00 and the length of the first time window is 10 minutes, the sum between 18:00 and 10 minutes can be obtained, resulting in 18:10, and the first time window is 18:00 to 18: 10.

Alternatively, the length of the first time window may be a preset length. The preset length can be set according to requirements. In one possible implementation, the preset length may be a minimum length of the time window, for example, may be 1 minute. By taking the minimum length of the time window as the preset length, the length of the time window can be expanded through the length adjustment step subsequently. In another possible implementation, the preset length may be a maximum length of the time window. By taking the maximum length of the time window as the preset length, the length of the time window can be reduced through the length adjustment step subsequently.

Alternatively, the length of the first time window may not be a preset length, but may be adjusted one or more times. In a possible implementation, if the method provided in the embodiment of fig. 8 is performed for the first time, the preset length may be used as the length of the first time window, and the adjusted length is obtained by adjusting the preset length. If the method provided in the embodiment of fig. 8 is performed for the second time, the third time, and the nth time, the length obtained by the last adjustment may be used as the length of the first time window. Wherein N is a positive integer greater than 1.

An alarm within the first time window is an alarm whose point in time is within the first time window. Specifically, the alarm within the first time window occurs at a time point later than or equal to the start time point of the first time window, and the alarm within the first time window occurs at a time point earlier than or equal to the end time point of the first time window. The number of alarms within the first time window may be one or more. The number of alarms in the first time window can be positively correlated with the reporting frequency of the alarms in the network, and the higher the reporting frequency of the alarms is, the more the number of alarms in the first time window can be. In addition, the number of alarms in the first time window may be positively correlated with the length of the first time window, and the larger the length of the first time window is, the larger the number of alarms in the first time window may be.

In one possible implementation, the process of obtaining alarms within the first time window may include: starting from the current time point, waiting for the length of the first time window, and caching the received alarm; and when the time passes the length of the first time window, obtaining the cached alarm. For example, assuming that the current time point is 8:00, the length of the first time window is 5 minutes, it may start from 8:00, wait for 5 minutes, and buffer the received alarm; when the time passes 5 minutes, the cached alarm with the occurrence time point of 8:00 to 8:05 can be acquired as the alarm in the first time window.

In one possible implementation, the alarms within the first time window may be obtained by delaying the queue. Specifically, the length of the first time window may be used as a delay duration, and a delay queue may be created according to the delay duration; in the process of alarm analysis, when an alarm is acquired, the alarm is stored in a delay queue, and when the time reaches the delay time, the alarm is read from the delay queue. The delay queue is a Message Queue (MQ).

In one possible implementation, before obtaining the characteristics of the alarm, the noise reduction rule may be applied to reduce the noise of the alarm, so as to screen out an effective alarm and filter out an invalid alarm. The invalid alarm may include a repeat alarm, a flash alarm, a shake alarm, and the like. The alarm of the flash connection interruption (english term: the alarm of the interrupt connection interruption) refers to an alarm generated during the period from the network disconnection to the network reconnection, and the time interval from the time point of the network disconnection to the time point of the network reconnection is less than the preset time interval. Shock alarms (full name: shock alarms) refer to alarms caused by a fault occurring and recovering continuously in a short period of time.

In one possible implementation, if the features of the alarm are in text form, the text form features may be encoded to obtain the features in digital form. The method for encoding the alarm characteristics is the same as the method for encoding the sample alarm characteristics, and is not described herein again.

802. And the computer equipment acquires the length of the predicted time window according to the characteristics of the alarm.

The length of the predicted time window refers to the length of the time window predicted according to the characteristics of the alarm. In one possible implementation, the computer device may load the predictive model, input the characteristics of the alarm into the predictive model, and output the length of the predicted time window. The computer device may obtain the prediction model from the model library, or store the prediction model in advance, and the manner of obtaining the prediction model is not limited in this embodiment.

In one possible implementation, if the number of alarms in the first time window is multiple, the computer device may input the features of the multiple alarms into the predictive model and output a vector. The vector may include a number of bits, and the number of bits of the vector may be equal to the number of alarms. Each bit of the vector corresponds to each alarm one by one, and for a certain bit of the vector, the value of the bit is the length of the prediction time window corresponding to the alarm corresponding to the bit. Specifically, the ith bit of the vector may correspond to an alarm i, where the alarm i is the ith alarm input to the prediction model in a prediction process, and i is a positive integer. The value of the ith bit is the length of the prediction time window corresponding to the alarm i. Correspondingly, for any one of the alarms, a bit corresponding to the alarm can be determined from a plurality of bits of the vector, and a value of the bit is obtained and used as the length of a prediction time window corresponding to the alarm.

For example, if 3 alarms are included in the first time window, which are denoted as alarm 1, alarm 2, and alarm 3, the features of alarm 1, alarm 2, and alarm 3 are sequentially input into the prediction model, and the output vector is (853). For alarm 1, it may be determined that the bit corresponding to alarm 1 is the 1 st bit of the vector, and the value of the 1 st bit of the vector is 8, so that the length of the prediction time window corresponding to alarm 1 is 8 minutes; for the alarm 2, it may be determined that the bit corresponding to the alarm 2 is the 2 nd bit of the vector, and the value of the 2 nd bit of the vector is 5, so that the length of the prediction time window corresponding to the alarm 2 is 5 minutes; for the alarm 3, it may be determined that the bit corresponding to the alarm 3 is the 3 rd bit of the vector, and the value of the 3 rd bit of the vector is 3, so that the length of the prediction time window corresponding to the alarm 3 is 3 minutes.

The length of the prediction time window obtained by using the prediction model is only an example, and the length of the prediction time window is not limited in this embodiment. For example, a preset correspondence between the alarm characteristic and the length of the predicted time window may be established, and the length of the predicted time window corresponding to the alarm characteristic may be obtained from the preset correspondence according to the alarm characteristic. The preset correspondence may include at least one alarm characteristic and at least one predicted time window length, and the preset correspondence may be pre-stored in the computer device.

By using a prediction model to obtain the length of the prediction time window, at least the following effects can be achieved: in the process of alarm analysis, if the root cause alarm and the derivative alarm are separated in different time windows, all the derivative alarms corresponding to the root cause alarm cannot be obtained according to the time windows, and meanwhile, the probability of mistakenly identifying the derivative alarm corresponding to a certain root cause alarm as the derivative alarms corresponding to other root cause alarms is very high, so that the accuracy of alarm analysis is influenced. For example, if the root cause alarm 1 corresponds to the derivative alarm 1-1, the derivative alarm 1-2, and the derivative alarm 1-3, if the time window is too small, the ending time point of the time window is later than the root cause alarm 1, the derivative alarm 1-1, and the derivative alarm 1-2, but earlier than the derivative alarm 1-3, then only the root cause alarm 1, the derivative alarm 1-1, and the derivative alarm 1-2 are obtained according to the time window, and the derivative alarm 1-3 is not obtained, so that when analysis is performed according to the root cause alarm 1, the derivative alarm 1-1, and the derivative alarm 1-2, the accuracy is poor due to omission of the derivative alarm 1-3.

In the embodiment, the prediction model is obtained by training according to the characteristics of the sample alarm and the label of the sample alarm in the model training stage, and the label of the sample alarm represents the length of the target time window, so that the prediction model can learn the corresponding relation between the characteristics of the alarm and the length of the target time window through a large number of samples. In the process of alarm analysis, the length of a prediction time window output by a prediction model according to the characteristics of alarms approaches the length of a target time window, so that the prediction time window contains root alarms and all derivative alarms in the alarms as much as possible, and after the length of the prediction time window is adjusted, the probability that the adjusted time window contains the root alarms and all derivative alarms is high, so that the probability that the root alarms and the corresponding derivative alarms are divided into different time windows can be obviously reduced, the root alarms and all derivative alarms can be acquired as much as possible when the alarms are acquired according to a second time window, and the accuracy of alarm analysis is improved by analyzing according to the root alarms and all derivative alarms.

803. And the computer equipment adjusts the length of the first time window according to the length of the predicted time window to obtain a second time window.

The second time window is the time window after the length adjustment. The starting point in time of the second time window may be later than the starting point in time of the first time window. In a possible implementation, when the length of the predicted time window is obtained, the current time point may be used as the starting time point of the second time window, so that when the length of the predicted time window is obtained by prediction, the length is used to adjust the time window and perform alarm analysis, thereby implementing a function of real-time alarm analysis. In another possible implementation, the alarm analysis instruction may be received, and a time point at which the alarm analysis instruction is received is used as a starting time point of the second time window, so that when there is a demand for alarm analysis, the time window is adjusted and alarm analysis is performed by using the predicted length.

Optionally, the process of adjusting the length of the first time window may include the following steps one to two:

step one, acquiring a target length according to the length of the predicted time window and the relative duration of the alarm.

The relative duration is used to indicate the location of the alarm in the first time window. In one possible implementation, the relative duration is the difference between the time of occurrence of the alarm and the starting time point of the first time window. For example, assuming that the starting time point of the first time window is 18:00, the occurrence time point of the alarm is 18:10, and the difference between 18:10 and 18:00 is 10 minutes, the relative duration of the alarm is 10 minutes. As another example, assuming that the starting time point of the first time window is 19:30 and the occurrence time point of the alarm is 19:35, the difference between 19:35 and 19:30 is 5 minutes and the relative duration of the alarm is 5 minutes.

Of course, this way of calculating the relative time duration is only an example, and the relative time duration may also be calculated by using other algorithms, for example, after calculating the difference between the occurrence time point of the alarm and the start time point of the first time window, the difference may be added to one or more coefficients, the difference may be subtracted from one or more coefficients, the difference may be multiplied by one or more coefficients, or the difference may be divided by one or more coefficients, and the obtained difference, product, or quotient is used as the length of the predicted time window, for example, the relative time duration may be the difference between the occurrence time point of the alarm and the end time point of the first time window, and the algorithm of the relative time duration is not limited in this embodiment.

The target length refers to the length to which the length of the first time window needs to be adjusted. The method for obtaining the target length may include any one of the following first to second implementation manners:

and in the first mode, the sum of the length of the prediction time window and the relative duration of the alarm is obtained as the target length.

Wherein the unit of the sum is time. For example, assuming that the relative duration of the alarm is 10 minutes, and the length of the predicted time window is 20 minutes, the sum of 10 minutes and 20 minutes can be obtained to obtain 30 minutes, and the target length is 30 minutes.

And secondly, when the number of the alarms is multiple, acquiring the sum of the length of the predicted time window corresponding to each alarm and the relative duration of each alarm in the multiple alarms to obtain multiple sums, and acquiring the maximum value of the multiple sums to serve as the target length.

In one exemplary scenario, if the first time window is 18:00 to 19:00, the alarms within the first time window include root alarm 1, derived alarm 1-2, root alarm 2, derived alarm 2-1, and derived alarm 2-2. The occurrence time point of the root cause alarm 1 is 18:00, and the corresponding relative time length of the root cause alarm 1 is 0 minute. And the occurrence time point of the derived alarm 1-1 is 18:10, and the corresponding relative time length of the derived alarm 1-1 is 10 minutes. If the occurrence time point of the derivative alarm 1-2 is 18:15, the corresponding relative time length of the derivative alarm 1-2 is 15 minutes; if the occurrence time point of the root cause alarm 2 is 18:10, the relative time length corresponding to the root cause alarm 2 is 10 minutes, the occurrence time point of the derivative alarm 2-1 is 18:20, the relative time length corresponding to the derivative alarm 2-1 is 20 minutes, the occurrence time point of the derivative alarm 2-2 is 18:40, and the relative time length corresponding to the derivative alarm 2-2 is 40 minutes; the length of the acquired prediction time window may be 15 minutes and 30 minutes.

In this scenario, for a root cause alarm 1, the sum of 15 minutes (the length of the predicted time window) and 0 minute (relative duration) may be obtained, resulting in 15 minutes; for the derived alarm 1-1, the sum of 15 minutes (the length of the predicted time window) and 10 minutes (relative duration) is obtained, resulting in 25 minutes; for the derived alarm 1-2, the sum of 15 minutes (the length of the predicted time window) and 15 minutes (relative duration) is obtained, resulting in 30 minutes; for root cause alarm 2, the sum of 30 minutes (the length of the predicted time window) and 10 minutes (relative duration) can be obtained, resulting in 40 minutes; for the derived alarm 2-1, the sum of 30 minutes (the length of the predicted time window) and 20 minutes (relative duration) is obtained, resulting in 80 minutes; for the derived alarm 2-2, the sum of 30 minutes (length of the predicted time window) and 40 minutes (relative duration) is obtained, resulting in 70 minutes. Therefore, the maximum value of 15 minutes, 25 minutes, 30 minutes, 40 minutes, 80 minutes, and 70 minutes is 70 minutes, and the target length is 70 minutes.

And step two, adjusting the length of the first time window to the target length to obtain the second time window.

The length of the second time window is the target length. The starting point in time of the second time window may be the same as the starting point of the first time window. For example, if the first time window is 18:00 to 18:10, the length of the first time window is 10 minutes, the target length is acquired as 20 minutes, and the length of the first time window is extended from 10 minutes to 20 minutes, the second time window is 18:00 to 18: 20.

Optionally, a preset error range may be set, a difference between the length of the predicted time window and the length of the first time window is obtained, the difference is compared with the preset error range, and when the difference between the length of the predicted time window and the length of the first time window is outside the preset error range, the length of the first time window is adjusted according to the length of the predicted time window. When the difference between the length of the predicted time window and the length of the first time window is within a preset error range, the alarm in the first time window may be analyzed.

The method has the advantages that the length of the time window is adjusted when the difference between the length of the predicted time window and the length of the first time window is out of the preset error range, so that the current length of the time window can be allowed to have a certain degree of error, the condition that some wrongly-divided alarms occur in the current time window is accepted, the balance between the prediction accuracy of the length of the time window and the prediction efficiency is realized, the condition that the alarm and the derivative alarm corresponding to the alarm occur in the time window predicted every time are separated is avoided, endless circulation caused by the mistake of prediction every time is also avoided, the convergence of the time window length prediction process is guaranteed, the performance of the time window length prediction process is improved, and the robustness of time window extraction is enhanced. In addition, the operation and maintenance personnel can set the error threshold value in a self-defined manner, so that the number of the alarms which are wrongly divided in the allowable time window is controlled in a self-defined manner, and the flexibility is improved.

In a possible implementation, referring to fig. 9, the above steps 801 to 803 may be an iterative process, and after the length of the first time window is adjusted once, the above steps 801 to 803 may be continuously performed one or more times, so as to continuously adjust the length of the first time window until the difference between the length of the predicted time window and the length of the first time window is within a preset error range, and then a target length is output, so that the following step 804 is performed according to a second time window obtained by the target length.

In this way, the effects achieved can at least include: through an iteration mode, the optimal time window length matched with the current network condition can be continuously found, the length of the time window can be adjusted along with the dynamic change of the current network condition, and the self-adaptability and the flexibility of the length of the time window are improved. When the alarm reporting frequency is reduced, the length of a smaller prediction time window can be generated, and the length of the time window can be reduced in time, so that the waiting time of fault analysis is effectively shortened, the fault processing efficiency is improved, and the average recovery time (full name: mean time to recovery, short for MTTR) is improved; when the alarm reporting frequency of the existing network is higher, the length of a larger prediction time window can be generated, the length of the time window can be enlarged in time, the probability that the alarm and the derived alarm are divided into different time windows is reduced within a certain error allowable range, and the accuracy of alarm analysis is improved. Meanwhile, the length of the generated time window is not too large, so that the condition that the full alarm is used for analysis due to the fact that the length of the time window is too large is avoided, the large system overhead caused by the fact that the full alarm is used for analysis is avoided, and the performance of alarm analysis is effectively improved.

The preset error range may be determined according to an error threshold. Specifically, the adjustment using the preset error range may include any one of the following adjustment modes one to four:

the method comprises the steps of obtaining a first difference value between the length of a predicted time window and the length of a first time window, comparing the first difference value between the length of the predicted time window and the length of the first time window with a first error threshold, when the first difference value is larger than the first error threshold, indicating that the length of the first time window is too small relative to the length of the predicted time window, and if the accuracy is too poor due to alarm analysis according to the current first time window, determining that the difference value between the length of the predicted time window and the length of the first time window is out of a preset error range by computer equipment, so that the length of the first time window can be enlarged. When the first difference is not greater than the first error threshold, indicating that the accuracy of the length of the first time window is within an acceptable level, the computer device determines that the difference between the length of the predicted time window and the length of the first time window is within a preset error range, and can directly analyze the alarm in the first time window. When the above steps 801 to 803 are executed next time and it is determined that the first difference obtained next time is greater than the first error threshold, the length of the first time window is expanded again.

The first error threshold may be considered as the maximum allowed error. Regarding the manner of obtaining the first error threshold, in one possible implementation, the computer device may receive a setting instruction according to which the first error threshold is obtained, the setting instruction including the first error threshold. In another possible implementation, the first error threshold may be pre-stored in the computer device, and the computer device may read the pre-stored first error threshold, and this embodiment does not limit the manner of obtaining the first error threshold.

The preset error range in the first adjustment mode is that the first difference is not larger than the first error threshold. The first difference value is a difference value obtained by subtracting the length of the first time window from the length of the predicted time window, where the length of the predicted time window is taken as a subtrahend, and the length of the first time window is taken as a subtrahend, that is, the first difference value may be an absolute value of a difference between the length of the predicted time window and the length of the first time window, and the length of the predicted time window is greater than the length of the first time window.

The obtaining manner of the first error threshold is the same as that of the first error threshold, and is not described herein again.

For example, if the length of the first time window is 10 minutes, the length of the predicted time window is 20 minutes, the target length is 15 minutes, and the second error threshold is 3 minutes, the difference between 20 minutes and 10 minutes may be obtained to obtain 10 minutes, which is the first difference value, 10 minutes and 3 minutes may be compared, and 10 minutes is greater than 3 minutes, so the length of the first time window may be expanded from 10 minutes to 15 minutes.

The second adjustment mode is that a second difference between the length of the predicted time window and the length of the first time window can be obtained, the second difference between the length of the predicted time window and the length of the first time window can be compared with a second error threshold, when the second difference is greater than the second error threshold, the length of the first time window is over-large relative to the length of the predicted time window, and if the alarm analysis according to the current first time window causes over-poor accuracy, the computer device determines that the difference between the length of the predicted time window and the length of the first time window is out of a preset error range, and can reduce the length of the first time window; when the second difference is not greater than the second error threshold, indicating that the accuracy of the length of the predicted time window is within an acceptable degree, the computer device determines that the difference between the length of the predicted time window and the length of the first time window is within a preset error range, and can directly analyze the alarm in the first time window. When the above steps 801 to 803 are executed next time and it is determined that the second difference obtained next time is greater than the second error threshold, the length of the first time window is reduced again.

And the preset error range in the second adjustment mode is that the second difference is not greater than the second error threshold. The second difference value is a difference obtained by subtracting the length of the predicted time window from the length of the first time window, where the length of the predicted time window is taken as a decrement, and calculating the difference, that is, the second difference value may be an absolute value of a difference between the length of the first time window and the length of the predicted time window, and the length of the first time window is greater than the length of the predicted time window.

The obtaining manner of the second error threshold is the same as the obtaining manner of the first error threshold, and is not described herein again.

For example, if the length of the first time window is 20 minutes, the length of the predicted time window is 10 minutes, the target length is 15 minutes, and the second error threshold is 3 minutes, the difference between 20 minutes and 10 minutes may be obtained to obtain 10 minutes, which is the second difference value, 10 minutes and 3 minutes may be compared, 10 minutes is greater than 3 minutes, and thus the length of the first time window may be reduced from 20 minutes to 15 minutes.

In one possible implementation, after the length of the first time window is reduced once, the above steps 801 to 803 may be continuously performed one or more times, so as to continuously reduce the length of the first time window until a second difference between the length of the predicted time window and the length of the first time window is not greater than a second error threshold, and the target length is output, so that the following step 804 is performed according to the second time window.

In another possible implementation, a difference between the target length and the length of the first time window may also be obtained, whether the difference is outside a preset error range is determined, and when the difference between the target length and the length of the first time window is outside the preset error range, the length of the first time window is adjusted according to the target length, which is described in the following three adjustment manners to four adjustment manners. When the difference between the target length and the length of the first time window is within a preset error range, the alarm in the first time window may be analyzed.

And if the accuracy is too poor due to alarm analysis according to the current first time window, the computer equipment determines that the difference between the target length and the length of the first time window is out of a preset error range, and the length of the first time window can be enlarged. When the third difference is not greater than the third error threshold, indicating that the accuracy of the length of the first time window is within an acceptable degree, the computer device determines that the difference between the target length and the length of the first time window is within a preset error range, and may directly analyze the alarm in the first time window. When the above steps 801 to 803 are executed next time and it is determined that the third difference obtained next time is greater than the third error threshold, the length of the first time window is expanded again.

And the preset error range in the third adjustment mode is that the third difference is larger than the third error threshold. The third difference is obtained by calculating the difference between the target length and the length of the first time window by using the target length as a subtrahend and the length of the first time window as a subtrahend, that is, the third difference may be an absolute value of the difference between the target length and the length of the first time window, and the target length is greater than the length of the first time window.

For example, if the length of the first time window is 10 minutes, the target length is 20 minutes, and the third error threshold is 3 minutes, the difference between 20 minutes and 10 minutes may be obtained, resulting in 10 minutes, which is the third difference value, 10 minutes may be compared to 3 minutes, 10 minutes being greater than 3 minutes, and thus the length of the first time window may be expanded from 10 minutes to 20 minutes.

In a possible implementation, after the length of the first time window is expanded once, the above steps 801 to 803 may be continuously performed one or more times, so as to continue expanding the length of the first time window until a third difference between the target length and the length of the first time window is not greater than a third error threshold, and the target length is output, so that the following step 804 is performed according to the second time window.

The fourth adjustment mode is that a fourth difference between the target length and the length of the first time window may be obtained, the fourth difference between the target length and the length of the first time window may be compared with a fourth error threshold, when the fourth difference is greater than the fourth error threshold, it indicates that the length of the first time window is too large relative to the target length, and if the alarm analysis performed according to the current first time window may result in too poor accuracy, the computer device determines that the difference between the target length and the length of the first time window is outside a preset error range, and may reduce the length of the first time window; when the fourth difference is not greater than the fourth error threshold, indicating that the accuracy of the target length is within an acceptable degree, the computer device determines that the difference between the target length and the length of the first time window is within a preset error range, and may directly analyze the alarm in the second time window. When the above steps 801 to 803 are executed next time and it is determined that the fourth difference obtained next time is greater than the fourth error threshold, the length of the first time window is reduced again.

The preset error range in the fourth adjustment mode is that the fourth difference is greater than the fourth error threshold. The fourth difference is obtained by subtracting the target length from the length of the first time window by using the length of the first time window as a subtracted number and using the target length as a subtracted number, that is, the fourth difference may be an absolute value of a difference between the length of the first time window and the target length, and the length of the first time window is greater than the target length.

The obtaining manner of the fourth error threshold is the same as the obtaining manner of the first error threshold, and is not described herein again.

For example, if the length of the first time window is 20 minutes, the target length is 10 minutes, and the fourth error threshold is 3 minutes, the difference between 20 minutes and 10 minutes may be obtained, resulting in 10 minutes, which is the fourth difference value, 10 minutes and 3 minutes may be compared, 10 minutes being greater than 3 minutes, and thus the length of the first time window may be reduced from 20 minutes to 10 minutes.

In a possible implementation, after the length of the first time window is reduced once, the above steps 801 to 803 may be continuously performed one or more times, so as to continuously reduce the length of the first time window until a fourth difference between the target length and the length of the first time window is not greater than a fourth error threshold, and the target length is output, so that the following step 804 is performed according to the second time window.

It should be noted that, in this embodiment, the magnitude relationship among the first error threshold, the second error threshold, the third error threshold, and the fourth error threshold is not limited. The first error threshold, the second error threshold, the third error threshold, and the fourth error threshold may be the same or different.

Through the first adjustment mode to the fourth adjustment mode, the length of the first time window can be enlarged when the length of the first time window is too small, and the length of the first time window can be reduced when the length of the first time window is too large, so that the effect of flexible adjustment is achieved. In one possible implementation, the target length and the length of the first time window may conform to a target size relationship, the target size relationship including any one of:

(1) the target length is greater than the length of the first time window, and a first difference between the target length and the length of the first time window is greater than a first error threshold, a decrement of the first difference being the target length, a decrement of the first difference being the length of the first time window. As an example, an implementation manner of making the target length and the length of the first time window conform to the target size relationship may be the first adjustment manner or the third adjustment manner.

(2) The target length is less than the length of the first time window, and a second difference between the target length and the length of the first time window is greater than a second error threshold, a decrement of the second difference is the length of the first time window, and a decrement of the second difference is the target length. As an example, an implementation manner of making the target length and the length of the first time window conform to the target size relationship may be the second adjustment manner or the fourth adjustment manner.

804. The computer device analyzes the alarms within the second time window.

Regarding the manner in which alarms within the second time window are obtained, for any one or more alarms in the alarm data stream, the computer device may determine whether the time point of occurrence of the one or more alarms falls within the second time window, and obtain the one or more alarms if the time point of occurrence of the one or more alarms falls within the second time window.

Regarding the manner in which alarms are analyzed, in one possible implementation, root cause analysis may be performed on the alarms to obtain root cause analysis results. The root cause analysis results include, but are not limited to: one or more of a root cause alarm in the alarms, a derivative alarm in the alarms, and a correspondence between the root cause alarm and the derivative alarm. Embodiments of root cause analysis may include: and clustering the alarms in the second time window to obtain at least one class. And inputting the alarms in the same class into the root cause analysis model and outputting a root cause analysis result. The root cause analysis model is used for predicting a root cause analysis result according to the alarm.

It should be noted that, the above is only described by taking a single computer device as an example to execute the alarm analysis method, alternatively, the alarm analysis method may also be executed by a computer device cluster, a system architecture of the computer device cluster may be as shown in fig. 3 or as shown in fig. 4, different computer devices may execute different or the same steps in the embodiment of fig. 8, and different computer devices may be located at different or the same place.

In combination with the above-mentioned fig. 5 embodiment and fig. 8 embodiment, the present application embodiment further provides a logic architecture as shown in fig. 10. As shown in fig. 10, the logical architecture includes a network, a network management system, an alarm analysis system, a big data storage, and a processing engine. The network management system may be provided as the computer device in the above embodiment.

A network may be an operator or enterprise deployed network environment that includes various levels of network devices and network links between different network devices. If the network equipment or the network link has a fault, the network equipment reports an alarm to the network management system.

The network management system mainly comprises an alarm acquisition module, an alarm database module, an alarm noise reduction module, a noise reduction rule database module and the like. The alarm acquisition module is used for acquiring newly reported alarms in the network, summarizing the acquired alarms, unifying formats of the acquired alarms and storing the alarms with unified formats into an alarm database; the alarm database module is used for storing alarms; the alarm noise reduction module is used for loading noise reduction rules and reducing noise of the alarm by using the noise reduction rules, so that invalid alarms are filtered; the noise reduction rule module is used for storing noise reduction rules. The alarm denoising module may denoise the alarm using a rule execution engine, which may be Drools (JBoss rules, an open source business rule engine that is easy to access enterprise policies, easy to adjust, and easy to manage).

The alarm analysis system comprises an alarm dump service, a big data engine, a time window extraction service, a class library, a big data storage and processing engine, a class (configuration) management service, a User Interface (hereinafter, referred to as User Interface, UI for short), an alarm subscription module, a feature extraction module, a prediction model reasoning module, a prediction model training module, a model library and a time window length judgment module.

The alarm dump service is used for receiving the alarm sent by the network management system, caching the alarm and sending the alarm to the big data engine.

The time window extraction service is used for subscribing the alarm after noise reduction from the network management system, and the method provided by the embodiment is used for acquiring the length of the time window and sending the length of the time window to the alarm analysis module. Also, the time window extraction service may send the denoised alarm to the big data engine. The specific logical architecture of the time window extraction service may be as shown in fig. 11.

Big data engine: the system is used for storing the alarms in real time and providing service for inquiring the alarms for other modules. The big data engine may be a Druid (Druid is a kind of database connection pool).

An alarm analysis module: the system is used for obtaining the time window length from the time window extraction service, obtaining the alarm from the big data engine according to the time window length and analyzing the alarm. The analysis may include alarm aggregation and root cause analysis. The alarm analysis module can obtain the class corresponding to each alarm and write the class into the class library. The alarm analysis module may periodically extract the service acquisition time window length from the time window, thereby periodically analyzing the alarm.

Class library: the class obtained by the alarm analysis module is used for storing and can be called a configuration database. The class library may provide the stored classes to a class management service.

Class management service: which may be referred to as a setup management service, interfaces with class libraries. The class management service is used for queries, statistics classes, and alarms, thereby supporting the relevant functions required by the UI service.

Big data storage and processing engine: is the bottom layer frame of each module. The big data storage and processing engine includes one or more data engines and their management services. The big data storage and processing engine may provide big data storage, analysis capabilities, and a Pipeline (in english) running framework, the beginning of which is the data input to one data source and the end of which is the data output to another data source. The Pipeline running framework can realize a periodic and controllable automatic data processing flow, and is convenient for the alarm analysis module to run periodically. As shown by the dashed line box inside the system in fig. 10, one operation flow of the Pipeline operation framework is as follows: the big data engine reads the alarm, the big data engine outputs the alarm to the alarm analysis module, the alarm analysis module analyzes the alarm, and the analysis result is written into the class library.

UI service: and the alarm display module is used for displaying the alarm and the result of alarm analysis.

An alarm subscription module: the method is used for acquiring newly added alarms from the network management system, screening the alarms according to the current length of the time window and predicting the length of the time window according to the characteristics of the alarms. Meanwhile, the alarm subscription module can also pull the historical alarms from the alarm library and read and write the historical alarms to a local file, so that operation and maintenance personnel can label the historical alarms manually, and the prediction model can complete model training according to the labeled historical alarms.

A feature extraction module: the device is used for receiving the alarm sent by the alarm subscription module and extracting five key features in the alarm, and the feature extraction module can support feature extension.

The prediction model reasoning module: the system is used for loading a prediction model from a model library, predicting the length of a corresponding prediction time window based on the characteristics of each alarm and sending the length of the prediction time window to a time window length judgment module

A prediction model training module: the model training algorithm is used for training to obtain a prediction model through the input characteristics of the sample alarm and the label of the sample alarm, and the prediction model is stored in a model base so that the model reasoning module can call the prediction model.

Model library: for managing and storing the predictive models.

Time window length judging module: and the method is used for calculating the length of the predicted time window of each alarm and judging whether the current length of the time window can be within an allowable error range or not according to the error threshold, namely whether the time window just comprises the alarm and all derived alarms thereof. The specific processing steps of the time window length determination module are also referred to as the above step 803.

Referring to fig. 11, a logical architecture of the time window extraction service may be as shown in fig. 11, including a predictive model inference module and a predictive model training module. The predictive model inference module is used to implement the embodiment of fig. 8. The predictive model inference module can operate in real-time. The predictive model training module is used to perform the embodiment of fig. 5. The predictive model training module may run periodically or periodically.

For the operation flow of the prediction model inference module, as shown in fig. 11, the alarm subscription module subscribes a newly added alarm from the network management system, acquires an alarm within a time window, sends the acquired alarm to the feature extraction module, and the feature extraction module performs feature extraction on the alarm and sends the feature of the alarm to the prediction model inference module. The prediction model reasoning module loads a prediction model from the model base and predicts the length of a corresponding prediction time window for each alarm through the prediction model; calculating the difference corresponding to each alarm according to the occurrence time point of each alarm; and calculating the sum of each difference and the length of the prediction time window according to the difference corresponding to each alarm and the length of the prediction time window to obtain a plurality of sums. The time window length judging module compares the plurality of sums with the current length of the time window, judges whether the current length of the time window needs to be expanded to the maximum value of the plurality of sums, and then predicts the length of the predicted time window corresponding to the alarm in the expanded time window again or directly outputs the current length of the time window.

For the operation flow of the prediction model training module, as shown in fig. 11, the alarm subscription module subscribes alarms within a certain time range from the alarm library and writes the alarms into a local file, and after labels of the alarms are manually labeled, the labeled alarms are sent to the alarm feature extraction module for feature extraction. And the prediction model training module completes the training of the prediction model through a model training algorithm to obtain the prediction model, and stores the prediction model into the model base so that the prediction model reasoning module calls the prediction model from the model base.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 12 is a schematic structural diagram of an alarm analysis apparatus provided in an embodiment of the present application, where the alarm analysis apparatus may be applied to a computer device in the foregoing embodiments, as shown in fig. 12, the apparatus includes:

an obtaining module 1201, configured to perform step 801; the obtaining module 1201 is further configured to execute the step 802; an adjusting module 1202, configured to perform the step 803; an analyzing module 1203 is configured to perform the step 804.

Optionally, the obtaining module 1201 is configured to input the characteristic of the alarm into a prediction model, and output the length of the prediction time window.

Optionally, the obtaining module 1201 is configured to perform the step 506; the device also includes: and a model training module for executing the step 507.

Optionally, the adjusting module 1202 is configured to adjust a length of the time window using a prediction error condition.

Optionally, the adjusting module 1202 is configured to execute any one of the first adjusting manner to the fourth adjusting manner in step 803.

Optionally, the adjusting module 1202 is configured to execute the first step and the second step in the step 803.

Optionally, the obtaining module 1201 is configured to perform any one of the following: acquiring the sum of the length of the predicted time window and the relative duration of the alarm as the target length; when the number of the alarms is multiple, for each alarm in the multiple alarms, obtaining the sum of the length of the predicted time window corresponding to the alarm and the relative duration of the alarm to obtain multiple sums, and obtaining the maximum value of the multiple sums as the target length.

It should be noted that, when analyzing an alarm, the alarm analysis apparatus provided in the embodiment of fig. 12 is only illustrated by the division of the functional modules, and in practical applications, the above function allocation may be completed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the above described functions. In addition, the alarm analysis device and the alarm analysis method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments and are not described herein again.

It should be appreciated that the apparatus 1200 herein is embodied in the form of functional modules. The term module herein may refer to an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared, dedicated, or group processor) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality. In an alternative example, as will be understood by those skilled in the art, the apparatus 1200 may be embodied as a computer device in the foregoing embodiment, and the apparatus 1200 may be configured to perform each process and/or step corresponding to the computer device in the foregoing method embodiment, that is, the apparatus 1200 has a function of implementing the corresponding step performed by the computer device in the foregoing method; the function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions; for example, the obtaining module and the analyzing module may be replaced by a processor, and processing operations in the method embodiments are respectively executed, which is not described herein again to avoid repetition.

All the above optional technical solutions of the above various solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described in detail herein.

In an exemplary embodiment, the present application further provides a computer program product comprising instructions which, when run on a computer device, enable the computer device to implement at least one of the alarm analysis method and the predictive model training method of the above embodiments.

In an exemplary embodiment, the present application further provides a chip, which includes a processor, configured to call and execute instructions stored in a memory, so that a device in which the chip is installed performs at least one of the alarm analysis method and the prediction model training method.

In one exemplary embodiment, the present application further provides a chip comprising: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing at least one of the alarm analysis method and the prediction model training method.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer program instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer program instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., Digital Video Disk (DVD)), or a semiconductor medium (e.g., solid state disk), among others.

The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present application generally indicates that the former and latter related objects are in an "or" relationship.

The term "plurality" in this application means two or more, e.g., a plurality of packets means two or more packets.

The terms "first," "second," and the like, in the present application, are used for distinguishing between similar items and items that have substantially the same function or similar items, and those skilled in the art will understand that the terms "first," "second," and the like do not denote any order or importance, but rather the terms "first," "second," and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An alarm analysis method, characterized in that the method comprises:

acquiring the characteristics of an alarm in a first time window;

acquiring the length of a predicted time window according to the alarm characteristics;

adjusting the length of the first time window according to the length of the predicted time window to obtain a second time window;

and analyzing the alarm in the second time window.

2. The method of claim 1, wherein obtaining the length of the predicted time window according to the characteristic of the alarm comprises:

inputting the alarm characteristics into a prediction model, and outputting the length of the prediction time window, wherein the prediction model is used for predicting the length of the time window according to the alarm characteristics.

3. The method of claim 2, wherein before inputting the characteristic of the alarm into a predictive model and outputting the length of the predicted time window, the method further comprises:

obtaining characteristics of sample alarms and labels of the sample alarms, wherein the labels represent the length of a target time window, and the target time window comprises root alarms in the sample alarms and each derived alarm corresponding to the root alarms;

and training to obtain the prediction model according to the characteristics of the sample alarm and the label of the sample alarm.

4. The method of any one of claims 1 to 3, wherein the predictive model comprises at least one of a neural network model, a random forest model, a logistic regression model, and a ridge regression model;

the node of the input layer in the neural network model is used for receiving the input alarm characteristics, and the node of the output layer in the neural network model is used for outputting the length of the prediction time window;

a root node of a decision tree in the random forest model is used for receiving input characteristics of an alarm, and leaf nodes of the decision tree are used for outputting the length of a prediction time window;

the independent variable of the logistic regression model is the characteristic of an alarm, and the dependent variable of the logistic regression model is the length of a prediction time window;

the independent variable of the ridge regression model is the characteristic of the alarm, and the dependent variable of the ridge regression model is the length of the prediction time window.

5. The method according to any of claims 1 to 4, wherein the characteristics of the alarm comprise at least one of a name of the alarm, a level of the alarm, an event type of the alarm, a time point of occurrence of the alarm, and a number of alarms occurring at the time point of occurrence.

6. The method according to any of claims 1 to 5, wherein said adjusting the length of the first time window in accordance with the length of the predicted time window comprises:

and when the difference value between the length of the predicted time window and the length of the first time window is out of a preset error range, adjusting the length of the first time window according to the length of the predicted time window.

7. The method according to any one of claims 1 to 6, wherein said adjusting the length of the first time window according to the length of the predicted time window to obtain a second time window comprises:

acquiring a target length according to the length of the predicted time window and the relative duration of the alarm, wherein the relative duration represents the position of the alarm in the first time window;

and adjusting the length of the first time window to the target length to obtain the second time window.

8. The method of claim 7, wherein the relative duration is a difference between a time point of occurrence of the alert and a starting time point of the first time window.

9. The method according to claim 7 or 8, wherein the obtaining of the target length according to the length of the predicted time window and the relative duration of the alarm comprises any one of the following:

acquiring the sum of the length of the predicted time window and the relative duration of the alarm as the target length;

when the number of the alarms is multiple, for each alarm in the multiple alarms, obtaining the sum of the length of the predicted time window corresponding to the alarm and the relative duration of the alarm to obtain multiple sums, and obtaining the maximum value of the multiple sums to be used as the target length.

10. An alarm analysis apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring the characteristics of the alarm in the first time window;

the obtaining module is further configured to obtain a length of a predicted time window according to the characteristic of the alarm;

the adjusting module is used for adjusting the length of the first time window according to the length of the predicted time window to obtain a second time window;

and the analysis module is used for analyzing the alarm in the second time window.

11. The apparatus of claim 10, wherein the obtaining module is configured to input the characteristic of the alarm into a prediction model and output the length of the predicted time window, and the prediction model is configured to predict the length of the time window according to the characteristic of the alarm.

12. The apparatus of claim 11,

the obtaining module is configured to obtain characteristics of sample alarms and tags of the sample alarms, where the tags indicate lengths of target time windows, and the target time windows include root alarms in the sample alarms and each derivative alarm corresponding to the root alarms;

the device further comprises: and the model training module is used for training to obtain the prediction model according to the characteristics of the sample alarm and the label of the sample alarm.

13. The apparatus of any one of claims 10 to 12, wherein the predictive model comprises at least one of a neural network model, a random forest model, a logistic regression model, and a ridge regression model;

14. The apparatus according to any one of claims 10 to 13, wherein the characteristics of the alarm include at least one of a name of the alarm, a level of the alarm, an event type of the alarm, an occurrence time point of the alarm, and a number of alarms occurring at the occurrence time point.

15. The apparatus according to any one of claims 10 to 14, wherein the adjusting module is configured to adjust the length of the first time window according to the length of the predicted time window when a difference between the length of the predicted time window and the length of the first time window is outside a preset error range.

16. The apparatus of any one of claims 10 to 15, wherein the adjusting module is configured to:

17. The apparatus of claim 16, wherein the relative duration is a difference between a time point of occurrence of the alert and a starting time point of the first time window.

18. The apparatus according to claim 16 or 17, wherein the obtaining module is configured to perform any one of:

19. A computer device comprising one or more processors and one or more volatile or non-volatile memories having stored therein at least one instruction that is loaded and executed by the one or more processors to perform an operation performed by the alarm analysis method of any of claims 1 to 9.

20. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor to perform operations performed by the alarm analysis method of any one of claims 1 to 9.