CN113537760A

CN113537760A - Intelligent recommendation method and system for fault handling plan

Info

Publication number: CN113537760A
Application number: CN202110792779.5A
Authority: CN
Inventors: 冷迪; 陈瑞; 黄建华
Original assignee: Shenzhen Power Supply Co ltd
Current assignee: Shenzhen Power Supply Co ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-10-22

Abstract

The invention discloses an intelligent recommendation system and method for a fault handling plan, wherein a data processing module is used for acquiring data such as historical faults, historical alarms and processing schemes, alarm convergence rules, system scheduled convergence and defense rules corresponding to the historical faults, the historical alarms and the processing schemes, the alarm convergence rules, the system scheduled convergence and defense rules corresponding to the historical faults, the historical alarms and the system scheduled convergence and defense rules, and the like, carrying out OLAP processing on the data and storing the data into different types of storage systems; the model module is used for clustering and converging faults according to the fault processing plan data and a machine learning related algorithm, and then applying the model to a production system to realize intelligent judgment of fault decision; the sorting module is used for sorting the plan candidate set generated by the model module through an algorithm, setting relevant super parameters by combining with manual experience, and finally obtaining the sorting of the fault processing plans, so that the corresponding intelligent processing scheme can be quickly obtained at a low threshold when a fault or an alarm occurs.

Description

Intelligent recommendation method and system for fault handling plan

Technical Field

The invention belongs to the technical field of IT intelligent operation and maintenance, and particularly relates to an intelligent recommendation method and system for a fault handling plan.

Background

In the operation and maintenance of the enterprise information system, the problems of slow fault location and long fault processing time are faced, for example, nearly hundreds of invalid alarms are generated on average every day, the time spent on screening normal alarms and carrying out problem location is averagely more than 30 minutes, and the average fault processing time is about 1 hour. In addition, the time consumed by the work order examination and approval and the dispatching flow on the transportation and dispatching system is about 90 minutes in processing time as a whole.

In the current daily operation and maintenance process, only the statistics of alarm information and quantity can not see the health degree overview of a service system and related support resources, and operation and maintenance personnel can only try to debug from a specific fault, so that the fault treatment timeliness is influenced. Meanwhile, operation and maintenance personnel can not effectively pre-judge by using historical data only by means of indexes such as current alarm events and the like, and can process problems in advance. The current events that develop into faults and affect the operation of the business without intervention at a previous stage are more than one hundred times a year.

Due to the interruption of a service system caused by the failure of one or more components, after diagnosis, operation and maintenance personnel remove the failure through a knowledge base and personal experience, and the problem can be solved after trying a plurality of methods. The whole process needs to be manually executed through manual judgment, the recovery time of the service system is difficult to guarantee in a plurality of processes tried by different schemes, and particularly in the fault early warning direction, when a fault or alarm occurs, the corresponding optimal intelligent processing scheme cannot be quickly obtained, so that a large amount of loss caused by service faults is increased, and the usability of the whole service system is reduced.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide an intelligent recommendation method and system for a fault handling plan, so as to improve the information operation and maintenance efficiency.

In order to solve the above technical problem, the present invention provides an intelligent recommendation system for a fault handling plan, comprising:

the data processing module is used for acquiring historical faults, historical alarms and corresponding processing schemes, alarm convergence rules, system scheduled convergence and defense rules, performing online analysis processing on the historical faults, the historical alarms and the corresponding processing schemes, the alarm convergence rules, the system scheduled convergence and defense rules and other data, and storing the data into different types of storage systems to form a knowledge base;

the model module is used for processing the plan data according to the faults and clustering and converging the faults through a machine learning related algorithm to finally reach a neural network under a large-scale scene, and then applying the model to a production system to realize intelligent judgment of fault decision;

and the sequencing module is used for sequencing the candidate set of the plans generated by the model module through an algorithm, setting relevant hyperparameters by combining manual experience, and finally obtaining the sequencing of the fault handling plans.

Further, the intelligent recommendation system for fault handling plans further comprises:

the alarm event acquisition module is used for acquiring an alarm or a fault event;

and the alarm classification module is used for classifying the alarm or fault event acquired by the alarm event acquisition module.

the plan acquisition module is used for acquiring a plan corresponding to the alarm or fault event;

and the association module is used for associating the alarm or fault event with a plan.

Further, the intelligent recommendation system for fault handling plans further comprises: and the storage module is used for storing historical faults, historical alarms and corresponding processing schemes, alarm convergence rules, system preset convergence and defense rules.

Further, the data processing module comprises an acquisition unit, a data processing unit and an allocation unit, wherein the acquisition unit is used for directly acquiring the historical faults, the historical alarms and the corresponding processing schemes, the alarm convergence rules and the system preset convergence and defense rules from the storage module, the data processing unit is used for carrying out online analysis processing on the historical faults, the historical alarms and the corresponding processing schemes, the alarm convergence rules and the system preset convergence and defense rules, and the allocation unit is used for storing the historical faults, the historical alarms and the corresponding processing schemes, the alarm convergence rules and the system preset convergence and defense rules into different types of storage systems to form a knowledge base.

Further, the sorting module is further configured to recommend an optimal fault handling scheme, including recommending a handling scheme with the highest priority, or recommending according to a direction from the highest priority to the lowest priority.

Furthermore, the alarm event acquisition module is specifically used for checking and acquiring the alarm event of the relevant index in the operating system portrait by selecting the specified time range.

The invention also provides an intelligent recommendation method for the fault handling plan, which comprises the following steps:

acquiring historical faults, historical alarms and corresponding processing schemes, alarm convergence rules and system predetermined convergence and defense rules of the historical faults, the historical alarms and the corresponding processing schemes, the alarm convergence rules, the system predetermined convergence and defense rules and other data, performing online analysis processing on the historical faults, the historical alarms and the corresponding processing schemes, the alarm convergence rules, the system predetermined convergence and defense rules and the like, and storing the data into different types of storage systems to form a knowledge base;

according to the fault processing plan data, clustering convergence is carried out on the faults through a machine learning related algorithm, finally a neural network under a large-scale scene is obtained, and then the model is applied to a production system so as to realize intelligent judgment on fault decision;

and sequencing the candidate sets of the plans generated by the model module through an algorithm, setting relevant hyperparameters by combining manual experience, and finally obtaining the sequencing of the fault handling plans.

Further, the intelligent recommendation method for the fault handling plan further comprises the following steps:

monitoring and acquiring alarms or fault events in real time;

classifying the alarm or fault event;

acquiring a plan corresponding to an alarm or fault event;

associating the alarm or fault event with a predetermined plan;

and storing historical faults, historical alarms and processing schemes corresponding to the historical alarms, alarm convergence rules and system preset convergence and defense rules.

Further, the intelligent recommendation method for the fault handling plan further comprises the following steps: and setting a configuration function for the plans of the fault or alarm events, wherein the configuration function comprises setting that each fault or alarm event is matched with a plurality of plans, or limiting the operation content or the operation method of the plans.

The implementation of the invention has the following beneficial effects: the invention researches the direction of a fault early warning plan based on an intelligent operation and maintenance technology, and can push an optimal fault processing scheme;

the invention processes the historical fault, the historical alarm and the corresponding processing scheme, the alarm convergence rule, the system predetermined convergence and defense rule and other complex data by OLAP technology, has high speed, reduces a large amount of loss caused by the service fault to the minimum, and improves the robustness of the whole service system;

according to the method, the generated plan candidate sets are sorted, the relevant super parameters are set by combining manual experience, the sorting of the fault handling plans is finally obtained, the optimal fault handling scheme can be pushed, the problems are effectively solved, and the method is accurate and low in threshold;

the invention can greatly improve the information operation and maintenance efficiency and reduce the average downtime of common faults from 2 hours to about 20 minutes.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of an intelligent recommendation system for a fault handling plan according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an alarm event acquiring module acquiring an alarm or fault event in the embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a data processing module according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a plan setup configuration according to an embodiment of the present invention.

Fig. 5 is a schematic flow chart of a method for intelligently recommending a fault handling plan according to a second embodiment of the present invention.

Detailed Description

The following description of the embodiments refers to the accompanying drawings, which are included to illustrate specific embodiments in which the invention may be practiced.

Referring to fig. 1, an embodiment of the present invention provides an intelligent recommendation system for a fault handling plan, including:

the alarm classification module is used for classifying the alarm or fault event acquired by the alarm event acquisition module;

the association module is used for associating the alarm or fault event with a predetermined plan;

the storage module is used for storing historical faults, historical alarms and corresponding processing schemes, alarm convergence rules, system preset convergence and defense rules;

the data processing module is used for acquiring historical faults, historical alarms and corresponding processing schemes, alarm convergence rules, system scheduled convergence and defense rules, performing online analysis processing (OLAP) on the historical faults, the historical alarms and the corresponding processing schemes, the alarm convergence rules, the system scheduled convergence and defense rules and other data, and storing the data into different types of storage systems to form a knowledge base;

It can be understood that the alarm event acquisition module, the alarm classification module, the plan acquisition module, the association module and the storage module are all modules at a preset stage, and the data processing module, the model module and the sequencing module are modules which are started to be executed when a fault or an alarm event just occurs.

Specifically, as shown in FIG. 2, the alarm event of the related index can be viewed and obtained in the operating system image by selecting the specified time range.

As shown in fig. 3, in this embodiment, the data processing module includes an obtaining unit, a data processing unit, and an allocating unit, where the obtaining unit is configured to directly obtain the historical fault, the historical alarm and the corresponding processing scheme, the alarm convergence rule, and the system predetermined convergence and defense rule from the storage module, the data processing unit is configured to perform OLAP processing on the historical fault, the historical alarm and the corresponding processing scheme, the alarm convergence rule, and the system predetermined convergence and defense rule, and the allocating unit is configured to store the historical fault, the historical alarm and the corresponding processing scheme, the alarm convergence rule, and the system predetermined convergence and defense rule in different types of storage systems to form a knowledge base, which is convenient for use by a downstream algorithm and a downstream model. For example, a failure or an alarm of a CPU, a disk, or a memory is stored in different types of storage systems, respectively. Further, the fault or alarm under each system is also stored separately, for example, the fault or alarm of CPU, disk, and memory under Linux system is stored in Linux storage system separately, and Linux storage system separately establishes database to store the fault or alarm of CPU, disk, and memory and the corresponding data of processing scheme, alarm convergence rule, system predetermined convergence and defense rule, and the like, and the same is true for Windows system.

In addition, when the obtaining unit obtains data such as historical faults, historical alarms, processing schemes corresponding to the historical alarms, alarm convergence rules, system predetermined convergence and defense rules and the like, the obtaining unit also obtains the data from a database corresponding to a corresponding system, such as a Linux system or a Windows system.

The model module applies the model to the production system to realize intelligent judgment of fault decision, specifically: for example, the LSTM algorithm is mainly used for solving the problem that the weight of a common loop neural network is small, different abnormal data are input, the LSTM is used for training a neural network convergence algorithm, finally, a neural network model under a large-scale scene is obtained, and then the model is applied to a production system to realize intelligent judgment of fault decision so as to finally obtain an accurate fault decision. It should be understood that there are a number of decisions.

The sequencing module obtains sequencing of the fault processing plans, so that the corresponding intelligent processing scheme can be rapidly obtained with a low threshold when a fault or an alarm occurs. For example, the ranking of the fault handling plans is finally obtained by ranking the candidate sets of plans generated by the model layer through a linear (such as logistic regression)/nonlinear algorithm (such as GBDT), and setting the relevant hyperparameters (set by manual customization) in combination with manual experience.

The sorting module is further configured to intelligently recommend an optimal fault handling scheme, and specifically, recommend a handling scheme with the highest priority, or recommend the handling scheme according to the direction from the highest priority to the lowest priority.

In this embodiment, the method can support addition, deletion, check and modification of a plan, the plan supports rich text editing and attachment uploading, and the processing modes of the plan include 2 modes, namely manual and automatic, wherein the automatic processing type is realized by docking a standard operation and maintenance flow.

Further, a configuration function may be set for the predetermined plan of the fault or alarm event, that is, the number of matching predetermined plans and the content requirement of the predetermined plan, as shown in fig. 4, specifically, if the fault is diagnosed as a disk shortage:

1. each fault or alarm event may be set to match 3-5 plans,

2. the operation content or the operation method of the plan can be limited, and only the plan which meets the user-defined content can be matched, for example, the content of the plan is a volume-expanded plan, for example, the content of the plan is a log-cleaned plan, or the monitoring score of the plan reaches 50 points.

Referring to fig. 5 again, in a second embodiment of the present invention, a failure handling plan intelligent recommendation method is further provided, which corresponds to the failure handling plan intelligent recommendation system in the first embodiment of the present invention, and includes:

step S101, monitoring and acquiring an alarm or a fault event in real time, and it should be understood that the alarm or the fault event may be directly acquired from the system or may be pushed and acquired from a third party.

Of course, in step S101, the alarm or fault event can also be viewed and acquired within a specified time range by selection. Still alternatively, different types of alarms or fault events may be acquired as desired.

Step S102, classifying the alarm or fault event. Specifically, different types of alarms or fault events may be separated, or alarms or fault events may be classified according to different systems, such as a Linux system or a Windows system.

Step S103, acquiring a plan corresponding to the alarm or fault event, wherein each alarm or fault event corresponds to a plurality of plans.

Step S104, the alarm or fault event is associated with the plan.

Step S105, storing data such as historical faults, historical alarms and corresponding processing schemes, alarm convergence rules, system predetermined convergence and defense rules and the like. It should be understood that here, the steps are performed at intervals, not in real time, but in real time as the steps S101-104 described above.

Step S106, data such as historical faults, historical alarms and corresponding processing schemes, alarm convergence rules, system predetermined convergence and defense rules and the like are obtained, OLAP processing is carried out on the data such as the historical faults, the historical alarms and corresponding processing schemes, the alarm convergence rules, the system predetermined convergence and defense rules and the like, and the data are stored in different types of storage systems to form a knowledge base which is convenient for downstream algorithms and models to use.

It should be understood that step S106 is performed upon the occurrence of a fault or alarm event, and that steps S101-105 are of a pre-set phase.

For example, the faults or alarms of the CPU, the disk and the memory are respectively stored in different types of storage systems. Further, in more detail, the fault or alarm under each system is also stored separately, for example, the fault or alarm of CPU, disk, and memory under the Linux system is stored in the Linux storage system, and the Linux storage system separately establishes the database to store the fault or alarm of CPU, disk, and memory and the corresponding data such as processing scheme, alarm convergence rule, system predetermined convergence and defense rule, and the like, and the same applies to the Windows system.

In addition, when data such as historical faults, historical alarms and corresponding processing schemes, alarm convergence rules, system predetermined convergence and defense rules and the like are acquired, the data are also acquired from a database corresponding to a corresponding system, for example, data are acquired from a Linux system or a Windows system.

And S107, according to the failure processing plan data and through a machine learning related algorithm, clustering convergence is carried out on the failures, and finally neural network models under a large-scale scene are obtained, and then the models are applied to a production system to realize intelligent judgment on failure decision.

For example, the LSTM algorithm is mainly used for solving the problem that the weight of a common loop neural network is small, different abnormal data are input, the LSTM is used for training a neural network convergence algorithm, finally, a neural network model under a large-scale scene is obtained, and then the model is applied to a production system to realize intelligent judgment of fault decision so as to finally obtain an accurate fault decision. It should be understood that there are a number of decisions.

And step S108, sequencing the candidate sets of the plans generated by the model module through an algorithm, setting relevant hyperparameters by combining with manual experience, and finally obtaining the sequencing of the fault processing plans so as to quickly and low threshold obtain the corresponding intelligent processing scheme when a fault or an alarm occurs.

For example, the ranking of the plan candidate set generated by the model layer is performed by a linear (such as logistic regression)/nonlinear algorithm (such as GBDT), and the related hyperparameters are set through manual experience (set by manual customization), so as to finally obtain the ranking of the fault handling plan.

The optimal fault handling scheme is intelligently selected, and specifically, the handling scheme with the best priority can be obtained or recommended according to the direction from the highest priority to the lowest priority.

Further, the method further comprises: setting a configuration function for the plans of the fault or alarm event, namely matching the number of the plans and the content requirements of the plans, specifically, if the fault is diagnosed as insufficient disk:

1. each fault or alarm event may be set to match 3-5 plans,

For the working principle and process of this embodiment, refer to the description of the first embodiment of the present invention, and are not described herein again.

As can be seen from the above description, compared with the prior art, the beneficial effects of the present invention are: the invention researches the direction of a fault early warning plan based on an intelligent operation and maintenance technology, and can push an optimal fault processing scheme;

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. An intelligent failure handling plan recommendation system, comprising:

2. The intelligent failure handling plan recommendation system according to claim 1, further comprising:

3. The intelligent failure handling plan recommendation system according to claim 2, further comprising:

4. The intelligent failure handling plan recommendation system according to claim 3, further comprising: and the storage module is used for storing historical faults, historical alarms and corresponding processing schemes, alarm convergence rules, system preset convergence and defense rules.

5. The intelligent recommendation system for fault handling plans according to claim 4, wherein the data processing module comprises an obtaining unit, a data processing unit and an allocating unit, wherein the obtaining unit is used for directly obtaining the historical faults, the historical alarms and their corresponding processing schemes, the alarm convergence rules, and the system predetermined convergence and defense rules from the storage module, the data processing unit is used for performing online analysis processing on the historical faults, the historical alarms and their corresponding processing schemes, the alarm convergence rules, and the system predetermined convergence and defense rules, and the allocating unit is used for storing the historical faults, the historical alarms and their corresponding processing schemes, the alarm convergence rules, and the system predetermined convergence and defense rules in different types of storage systems to form a knowledge base.

6. The intelligent failure handling plan recommendation system according to claim 1, wherein the ranking module is further configured to recommend an optimal failure handling plan, including recommending a handling plan with the highest priority, or recommending according to a direction from the highest priority to the lowest priority.

7. The intelligent recommendation system for failure handling plans according to claim 2, wherein the alarm event acquisition module is specifically configured to view and acquire alarm events of relevant indexes within a specified time range by selecting in an operating system representation.

8. An intelligent recommendation method for a fault handling plan comprises the following steps:

9. The intelligent recommendation method for fault handling plans according to claim 8, characterized by further comprising:

monitoring and acquiring alarms or fault events in real time;

classifying the alarm or fault event;

acquiring a plan corresponding to an alarm or fault event;

associating the alarm or fault event with a predetermined plan;

10. The intelligent recommendation method for fault handling plans according to claim 8, characterized by further comprising: and setting a configuration function for the plans of the fault or alarm events, wherein the configuration function comprises setting that each fault or alarm event is matched with a plurality of plans, or limiting the operation content or the operation method of the plans.