CN115174350B - Operation and maintenance alarm method, device, equipment and medium - Google Patents

Operation and maintenance alarm method, device, equipment and medium Download PDF

Info

Publication number
CN115174350B
CN115174350B CN202210764564.7A CN202210764564A CN115174350B CN 115174350 B CN115174350 B CN 115174350B CN 202210764564 A CN202210764564 A CN 202210764564A CN 115174350 B CN115174350 B CN 115174350B
Authority
CN
China
Prior art keywords
target
alarm
resources
monitoring data
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210764564.7A
Other languages
Chinese (zh)
Other versions
CN115174350A (en
Inventor
路小敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Jinan data Technology Co ltd
Original Assignee
Inspur Jinan data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Jinan data Technology Co ltd filed Critical Inspur Jinan data Technology Co ltd
Priority to CN202210764564.7A priority Critical patent/CN115174350B/en
Publication of CN115174350A publication Critical patent/CN115174350A/en
Application granted granted Critical
Publication of CN115174350B publication Critical patent/CN115174350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Monitoring And Testing Of Exchanges (AREA)

Abstract

The application discloses an operation and maintenance alarming method, an operation and maintenance alarming device, operation and maintenance alarming equipment and operation and maintenance alarming media, and relates to the technical field of information. The method comprises the following steps: collecting monitoring data corresponding to resources in a target service through a preset monitoring collector, and comparing the monitoring data with a preset alarm threshold value if the monitoring data are collected; acquiring target resources of which the monitoring data exceeds the preset alarm threshold value, and judging whether the target resources are the same type of resources or not according to target monitoring data corresponding to the target resources; and if the target resource is the same type of resource, determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the target monitoring data so as to carry out corresponding alarm and fault restoration according to the target alarm root cause. By the technical scheme, alarms generated by a large-scale cluster can be rapidly handled, and the root cause of the alarms or faults can be analyzed so as to increase the operation and maintenance convenience.

Description

Operation and maintenance alarm method, device, equipment and medium
Technical Field
The present invention relates to the field of information technologies, and in particular, to an operation and maintenance alarm method, apparatus, device, and medium.
Background
Along with the rapid development of the cloud computing field, the cloud platform is mature, the system scale is huge, and the performance requirements are higher. The system is huge, the modules are more, the dependence is more, and the delimitation analysis of the problems is more and more difficult and complex. How to rapidly cope with alarms generated by a large-scale cluster, and particularly how to solve flood alarms caused by other problems caused by a certain problem is a problem to be solved urgently. The existing warning processing of the monitoring data is to set a threshold value mode for monitoring abnormality of resources, and generate warning when the threshold value is exceeded. Currently, in order to reduce flood alarms, multiple pieces of collected monitoring alarm information with the same attribute are combined.
However, on the one hand, in the prior art, a fixed alarm threshold value needs to be manually set, and only the alarm can be carried out according to the manually set threshold value, so that the overall accuracy and recall rate are low; on the other hand, the number of alarms is huge, so that the efficacy of the alarms is discounted, and particularly flood alarms can be generated due to other problems caused by a certain key component or resource problem, so that the analysis and investigation are difficult; furthermore, when the problem is only solved and the cause of the problem is not found, the cause of the problem is still required to be checked later. In summary, the problem of how to rapidly cope with alarms generated by a large-scale cluster and analyze the root cause of the alarms or faults to increase the operation and maintenance convenience needs to be further solved.
Disclosure of Invention
Therefore, the present invention aims to provide an operation and maintenance alarm method, apparatus, device and medium, which can rapidly cope with alarms generated by a large-scale cluster and analyze the root cause of the alarms or faults so as to increase operation and maintenance convenience. The specific scheme is as follows:
in a first aspect, the application discloses an operation and maintenance alarming method, which comprises the following steps:
Collecting monitoring data corresponding to resources in a target service through a preset monitoring collector, and comparing the monitoring data with a preset alarm threshold value if the monitoring data are collected;
Acquiring target resources of which the monitoring data exceeds the preset alarm threshold value, and judging whether the target resources are the same type of resources or not according to target monitoring data corresponding to the target resources;
And if the target resource is the same type of resource, determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the target monitoring data so as to carry out corresponding alarm and fault restoration according to the target alarm root cause.
Optionally, after the collecting, by the preset monitoring collector, the monitoring data corresponding to the resource in the target service further includes:
If the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, searching and storing error information of a data calling request corresponding to the collected data sent by the preset monitoring collector through requesting to call link analysis;
Correspondingly, the determining, by using the target monitoring data, a target alarm root cause of an alarm event for the target resource according to an association relationship between resources in the target service, so as to perform corresponding alarm and fault repair according to the target alarm root cause, includes:
And determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the error information so as to carry out corresponding alarm and fault restoration according to the target alarm root cause.
Optionally, the searching and saving the error information of the data call request corresponding to the collected data sent by the preset monitoring collector through requesting to call the link analysis includes:
determining a corresponding path of the data calling request in a system according to a unique request identity number corresponding to a data calling request corresponding to the collected data sent by the preset monitoring collector;
Searching for a call failure event of the data call request in the call process in the path, and storing error information corresponding to the call failure event.
Optionally, after the obtaining the target resource with the monitoring data exceeding the preset alarm threshold, and determining whether the target resource is the same type of resource according to the target monitoring data corresponding to the target resource, the method further includes:
If the target resource is not the same type of resource, judging a target alarm importance level corresponding to the alarm event aiming at the target resource according to the relation between the target monitoring data and the preset alarm importance;
And alarming according to the alarming event of the target resource according to the target alarming importance level.
Optionally, after determining, by using the target monitoring data according to the association relationship between the resources in the target service, the target alert root cause of the alert event for the target resource if the target resource is the same type of resource, the method further includes:
Judging the target alarm importance level corresponding to the alarm event aiming at the target resource according to the target alarm root cause based on the preset alarm importance;
if the target alarm importance level is higher than the preset level threshold, alarming the alarm event aiming at the target resource;
And if the importance level of the target alarm is not higher than the preset level threshold, directly alarming the alarm event of the target resource at an alarm time point after waiting for the next preset alarm period.
Optionally, the determining, by using the target monitoring data, a target alarm root cause of an alarm event for the target resource according to an association relationship between resources in the target service includes:
Determining the alarm event corresponding to the target monitoring data by utilizing the target monitoring data, and deducing the current state of the target resource according to the alarm event;
analyzing the reason of the alarm event according to the current state of the target resource, and determining the target alarm reason of the alarm event aiming at the target resource according to the association relation between the resources in the target service.
Optionally, after determining, by using the target monitoring data according to the association relationship between the resources in the target service, the target alert root cause of the alert event for the target resource if the target resource is the same type of resource, the method further includes:
Determining a corresponding preset fault restoration script according to the target alarm root cause, and restoring a fault corresponding to the target alarm root cause through the preset fault restoration script;
if the target alarm root is successfully repaired due to the corresponding fault, carrying out alarm recovery and pushing a corresponding fault repair success event;
If the target alarm root causes the corresponding faults to be repaired unsuccessfully, the target alarm root causes are pushed to related operation and maintenance personnel to adjust the preset fault repair script or manually perform the corresponding fault repair.
In a second aspect, the present application discloses an operation and maintenance alarm device, including:
the data acquisition module is used for collecting monitoring data corresponding to resources in the target service through a preset monitoring collector, and comparing the monitoring data with a preset alarm threshold value if the monitoring data are collected;
The type judging module is used for acquiring target resources of which the monitoring data exceeds the preset alarm threshold value and judging whether the target resources are the same type of resources or not according to the target monitoring data corresponding to the target resources;
And the root cause determining module is used for determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the target monitoring data if the target resource is the same type of resource, so as to carry out corresponding alarm and fault repair according to the target alarm root cause.
In a third aspect, the present application discloses an electronic device, comprising:
a memory for storing a computer program;
and the processor is used for executing the computer program to realize the steps of the operation and maintenance alarming method disclosed above.
In a fourth aspect, the present application discloses a computer-readable storage medium for storing a computer program; wherein the computer program when executed by the processor implements the steps of the operation and maintenance alarming method disclosed above.
When the operation and maintenance alarm is carried out, firstly, monitoring data corresponding to resources in a target service are collected through a preset monitoring collector, if the monitoring data are collected, the monitoring data are compared with a preset alarm threshold, then target resources, of which the monitoring data exceed the preset alarm threshold, are obtained, whether the target resources are the same type of resources or not is judged through target monitoring data corresponding to the target resources, if the target resources are the same type of resources, target alarm root factors of alarm events for the target resources are determined according to the association relation between the resources in the target service by utilizing the target monitoring data, and corresponding alarm and fault repair are carried out according to the target alarm root factors. When the operation and maintenance alarm is carried out, the application firstly collects the monitoring data corresponding to the resources in the target service through the preset monitoring collector, compares the monitoring data with the alarm threshold, then obtains the target resources of which the monitoring data exceeds the preset alarm threshold, judges whether the target resources are the same type of resources, analyzes the target alarm root cause of the alarm event if the target resources are the same type of resources, and carries out corresponding alarm and repair through the target alarm root cause. Therefore, when carrying out operation and maintenance alarming, the application preliminarily judges the target resource causing the alarming by presetting the alarming threshold value, further judges whether the alarming event is a flood alarming by judging whether the target resource is the same type of resource, and if the alarming event is a flood alarming, determines the target alarming root cause of the alarming event aiming at the target resource according to the association relation between the resources in the target service by utilizing the target monitoring data; on the other hand, whether the current alarm event is a flood alarm is further judged by judging whether the target resource is the same type of resource, so that the problem that the efficiency of the alarm is discounted and the analysis and investigation are difficult due to the fact that the flood alarm is generated by other problems caused by a certain key component or resource problem and the number of the flood alarms is huge is solved; furthermore, root cause analysis is performed according to the association relation between resources in the target service by utilizing the target monitoring data, so that the root cause of the alarm problem can be analyzed in time when the alarm problem occurs, the subsequent alarm recovery and fault recovery are further performed through the target alarm root cause, the alarm recovery speed is accelerated, and the operation and maintenance convenience is improved. In conclusion, the application can rapidly cope with alarms generated by a large-scale cluster and analyze the root cause of the alarms or faults so as to increase the operation and maintenance convenience.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an operation and maintenance alarming method provided by the application;
FIG. 2 is a schematic diagram of the root cause analysis of the target alarms provided by the application;
FIG. 3 is a flowchart of a specific operation and maintenance alarm method provided by the present application;
FIG. 4 is a flowchart of a specific operation and maintenance alarm method provided by the present application;
FIG. 5 is a schematic diagram of an alarm flow provided by the present application;
FIG. 6 is a schematic diagram of an operation and maintenance alarm device according to the present application;
fig. 7 is a block diagram of an electronic device according to the present application.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The existing warning processing of the monitoring data is to set a threshold value mode for monitoring abnormality of resources, and generate warning when the threshold value is exceeded. Currently, in order to reduce flood alarms, multiple pieces of collected monitoring alarm information with the same attribute are combined. However, on the one hand, in the prior art, a fixed alarm threshold value needs to be manually set, and only the alarm can be carried out according to the manually set threshold value, so that the overall accuracy and recall rate are low; on the other hand, the number of alarms is huge, so that the efficacy of the alarms is discounted, and particularly flood alarms can be generated due to other problems caused by a certain key component or resource problem, so that the analysis and investigation are difficult; furthermore, when the problem is only solved and the cause of the problem is not found, the cause of the problem is still required to be checked later. Therefore, the application provides an operation and maintenance alarming method which can rapidly cope with alarms generated by a large-scale cluster and analyze the root cause of the alarms or faults so as to increase the convenience of operation and maintenance.
The embodiment of the invention discloses an operation and maintenance alarming method, which is shown in fig. 1 and comprises the following steps:
Step S11: and collecting monitoring data corresponding to the resources in the target service through a preset monitoring collector, and comparing the monitoring data with a preset alarm threshold value if the monitoring data are collected.
In this embodiment, the preset alarm threshold is an alarm threshold obtained in advance through a preset interface. It can be understood that the preset alarm threshold value user can perform corresponding adjustment according to actual conditions, can manually set the alarm threshold value according to experience, and can also display the alarm threshold value result of automatic adjustment. When the monitoring data does not exceed the preset alarm threshold value, a corresponding alarm event is not triggered; when the monitored data exceeds the preset alarm threshold, the monitored data triggers a corresponding alarm event, and in this embodiment, after the monitored data exceeds the preset alarm threshold, the alarm event is not triggered immediately and further judgment is needed.
In this embodiment, a preset monitoring collector deployed in advance in a target service is used to collect monitoring data of resources in a corresponding target service, and after the monitoring data is obtained, the monitoring data is compared with a preset alarm threshold. In a specific embodiment, taking monitoring of a virtual machine where an openstack is deployed with service usage and a storage volume is mounted as an example, by telegraf, acquiring actual data of each index of the virtual machine, and if the monitored data is collected, classifying the monitored data into a link-through type. According to the technical scheme, the monitoring data of the resources in the target service are obtained, and the monitoring resource data are compared with the preset alarm threshold value, so that the target resources causing the alarm can be judged preliminarily through the preset alarm threshold value, and further judgment is carried out on the target alarm resources causing the alarm event subsequently.
Step S12: and acquiring target resources of which the monitoring data exceeds the preset alarm threshold value, and judging whether the target resources are the same type of resources or not according to the target monitoring data corresponding to the target resources.
In this embodiment, the monitoring data is compared with a preset alarm threshold through the monitoring data, a target resource of which the monitoring data exceeds the preset alarm threshold is obtained, and whether the target resource is the same type of resource is judged through the target monitoring data corresponding to the target resource. Specifically, whether the target resource is the same type of resource is judged through the target monitoring data corresponding to the target resource, if the target resource which causes the alarm event is not the same type of resource, the alarm event caused by the current target resource is indicated to be a common alarm event, and further alarm can be continued; if the target resources causing the alarm event are of the same type, the method indicates that a large number of alarm events caused by the target resources are flood alarms, and the root cause of the alarm event caused by the target resources needs to be searched and further judged. By the technical scheme, whether the current alarm event is a flood alarm is further judged by judging whether the target resource is the same type of resource, so that the problem that the effect of the alarm is discounted and the analysis and investigation are difficult due to the fact that the flood alarm is generated by other problems caused by a certain key component or resource problem and the number of flood alarms is huge is solved.
Step S13: and if the target resource is the same type of resource, determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the target monitoring data so as to carry out corresponding alarm and fault restoration according to the target alarm root cause.
In this embodiment, the determining, by using the target monitoring data, a target alarm root cause of an alarm event for the target resource according to an association relationship between resources in the target service includes: determining the alarm event corresponding to the target monitoring data by utilizing the target monitoring data, and deducing the current state of the target resource according to the alarm event; analyzing the reason of the alarm event according to the current state of the target resource, and determining the target alarm reason of the alarm event aiming at the target resource according to the association relation between the resources in the target service. Furthermore, the important level alarms or alarm items and alarm items with wider influence range are stored in a preset scene evaluation database in advance, and the association relation between the resources in the association scene evaluation target service is input. The preset scene evaluation database comprises, but is not limited to, automatically identifying that recorded monitoring index data such as monitoring items reach a certain value and cause downtime; scenes with a wider influence range, such as a similar storage back end, cannot be accessed, so that a storage volume cannot be accessed; the corresponding modification of the feedback is manually identified. Wherein the root cause analysis finds the root cause of the alarm or fault according to rules or algorithms, and when such alarm disappears, other alarms generated by the alarm disappear accordingly. Specifically, as shown in fig. 2, if the target resources are the same type of resources, receiving data of a module such as a calculation storage network and the like to generate resource entities, monitoring the data of the module to obtain alarm entities, mounting the alarm entities on the responsive resource entities according to corresponding relations to form a directed entity diagram, and deducing alarm and deduction states through a scene evaluator and according to a preset scene evaluation database and formulated rules to realize causal root analysis and associated root cause so as to determine the target alarm root cause of an alarm event for the target resources.
In this embodiment, after determining the target alarm root cause of the alarm event for the target resource according to the association relationship between the resources in the target service by using the target monitoring data, the method further includes: judging the target alarm importance level corresponding to the alarm event aiming at the target resource according to the target alarm root cause based on the preset alarm importance; if the target alarm importance level is higher than the preset level threshold, alarming the alarm event aiming at the target resource; and if the importance level of the target alarm is not higher than the preset level threshold, directly alarming the alarm event of the target resource at an alarm time point after waiting for the next preset alarm period. Specifically, the monitoring item data is compared with an alarm threshold value and data information in the monitoring item data is managed by combining with a label, a corresponding alarm task is triggered to generate an alarm, and the alarm information is sent to corresponding services such as elastic expansion, thermal expansion and the like; and for the alarm task periods of unequal alarms with wider influence range and high importance level after the root cause analysis, the event alarms are actively pushed.
In this embodiment, after determining the target alarm root cause of the alarm event for the target resource according to the association relationship between the resources in the target service by using the target monitoring data, the method further includes: determining a corresponding preset fault restoration script according to the target alarm root cause, and restoring a fault corresponding to the target alarm root cause through the preset fault restoration script; if the target alarm root is successfully repaired due to the corresponding fault, carrying out alarm recovery and pushing a corresponding fault repair success event; if the target alarm root causes the corresponding faults to be repaired unsuccessfully, the target alarm root causes are pushed to related operation and maintenance personnel to adjust the preset fault repair script or manually perform the corresponding fault repair. Specifically, various fault repair scripts are injected in advance, and when corresponding faults are generated, related scripts are called to repair. After the restoration is successful, a restoration success event or alarm restoration is pushed, the result is pushed to operation and maintenance personnel, and by configuring a short message, a mailbox and the like, when the alarm, the event or the failure restoration fails, a message is sent to relevant responsible persons for script adjustment or manual restoration. According to the technical scheme, root cause analysis is carried out by utilizing the target monitoring data according to the association relation between the resources in the target service, so that the root cause of the alarm problem can be analyzed in time when the alarm problem occurs, the subsequent alarm recovery and fault recovery are further carried out by the target alarm root cause, the alarm recovery speed is increased, and the operation and maintenance convenience is increased; on the other hand, the warning items with high warning level or important warning items and wide influence range are labeled, and when the warning with large important influence range is generated, the warning items can be pushed in real time, so that the warning delay of the important warning is reduced.
It can be seen that, in this embodiment, when performing an operation and maintenance alarm, a preset monitoring collector is first used to collect monitoring data corresponding to resources in a target service, and compare the monitoring data with an alarm threshold, then a target resource with the monitoring data exceeding the preset alarm threshold is obtained, and whether the target resource is a same type of resource is determined, if the target resource is the same type of resource, then a target alarm root cause of the current alarm event is analyzed, and corresponding alarm and repair are performed through the target alarm root cause. Therefore, when carrying out operation and maintenance alarming, the application preliminarily judges the target resource causing the alarming by presetting the alarming threshold value, further judges whether the alarming event is a flood alarming by judging whether the target resource is the same type of resource, and if the alarming event is a flood alarming, determines the target alarming root cause of the alarming event aiming at the target resource according to the association relation between the resources in the target service by utilizing the target monitoring data; on the other hand, whether the current alarm event is a flood alarm is further judged by judging whether the target resource is the same type of resource, so that the problem that the efficiency of the alarm is discounted and the analysis and investigation are difficult due to the fact that the flood alarm is generated by other problems caused by a certain key component or resource problem and the number of the flood alarms is huge is solved; furthermore, root cause analysis is performed according to the association relation between resources in the target service by utilizing the target monitoring data, so that the root cause of the alarm problem can be analyzed in time when the alarm problem occurs, the subsequent alarm recovery and fault recovery are further performed through the target alarm root cause, the alarm recovery speed is accelerated, and the operation and maintenance convenience is improved. In conclusion, the application can rapidly cope with alarms generated by a large-scale cluster and analyze the root cause of the alarms or faults so as to increase the operation and maintenance convenience.
Referring to fig. 3, an embodiment of the present invention discloses a specific operation and maintenance alarm method, and compared with the previous embodiment, the present embodiment further describes and optimizes a technical solution.
Step S21: and collecting monitoring data corresponding to the resources in the target service through a preset monitoring collector.
Step S22: and if the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, searching and storing error information of a data calling request corresponding to the collected data sent by the preset monitoring collector through requesting to call link analysis.
In this embodiment, searching and saving error information of a data call request corresponding to collected data sent by the preset monitoring collector by requesting to call a link analysis includes: determining a corresponding path of the data calling request in a system according to a unique request identity number corresponding to a data calling request corresponding to the collected data sent by the preset monitoring collector; searching for a call failure event of the data call request in the call process in the path, and storing error information corresponding to the call failure event. Specifically, the link procedure is recorded when each request is invoked, each invocation request generates a globally unique ID (i.e., identity Document, identification number) for identifying the request, the ID does not change during the invocation process, and the path of the user request in the system is strung through the secondary ID as each layer of invocation is continuously transferred. If the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, the types of the monitoring data are divided into the link failure types, and when a call failure has an error in a certain call process, error information is stored, and root cause analysis is carried out on the error information.
In one embodiment, if the stored network is not available, deducing that the storage volume on the network is not available, wherein the storage network is not available as an optimal solution; the path of the storage volume is not communicated, and the path of the storage volume is not communicated to be suboptimal; the path of the storage and the storage volume is not communicated, the storage path is not communicated as a root cause, and the root causes are associated according to the sequence, the hierarchical relation and the like. According to the technical scheme, the trouble of troubleshooting the existing problems is solved by adopting the method for calling and analyzing the fault data link, so that root cause analysis and troubleshooting can be further carried out on the existing problems when the monitoring data corresponding to the resources in the target service cannot be obtained.
Step S23: and determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the error information so as to carry out corresponding alarm and fault restoration according to the target alarm root cause.
In this embodiment, the error information of the data call request corresponding to the collected data sent by the preset monitoring collector is searched and stored through requesting to call link analysis, the target alarm root cause of the alarm event for the target resource is determined according to the association relation between the resources in the target service by the error information, and the trouble of problem investigation is solved by adopting the method of fault data link call analysis, so that the operation and maintenance convenience is improved, the fault influence is reduced, and the use of the client is more intelligent and friendly.
Referring to fig. 4, an embodiment of the present invention discloses a specific operation and maintenance alarm method, and compared with the previous embodiment, the present embodiment further describes and optimizes a technical solution.
Step S31: and collecting monitoring data corresponding to the resources in the target service through a preset monitoring collector, and comparing the monitoring data with a preset alarm threshold value if the monitoring data are collected.
Step S32: and acquiring target resources of which the monitoring data exceeds the preset alarm threshold value, and judging whether the target resources are the same type of resources or not according to the target monitoring data corresponding to the target resources.
Step S33: and if the target resource is not the same type of resource, judging a target alarm importance level corresponding to the alarm event aiming at the target resource according to the relation between the target monitoring data and the preset alarm importance.
Specifically, whether the target resource is the same type of resource is judged through the target monitoring data corresponding to the target resource, if the target resource which causes the alarm event is not the same type of resource, the alarm event caused by the current target resource is indicated to be a common alarm event, and further alarm can be continued.
Step S34: and alarming according to the alarming event of the target resource according to the target alarming importance level.
Specifically, if the target alarm importance level is higher than the preset level threshold, the alarm event aiming at the target resource is alarmed; and if the importance level of the target alarm is not higher than the preset level threshold, directly alarming the alarm event of the target resource at an alarm time point after waiting for the next preset alarm period. Further, as shown in fig. 5, the alarm flow diagram is shown in fig. 5, monitoring data corresponding to resources in a target service are collected through a preset monitoring collector, if the monitoring data are collected, the monitoring data are compared with a preset alarm threshold, target resources, of which the monitoring data exceed the preset alarm threshold, are obtained, whether the target resources are the same type of resources or not is judged through target monitoring data corresponding to the target resources, if the target resources are the same type of resources, target alarm root causes of alarm events for the target resources are determined according to association relations among the resources in the target service by utilizing the target monitoring data; if the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, searching and storing error information of a data call request corresponding to the collected data sent by the preset monitoring collector through requesting to call link analysis, and determining a target alarm root cause of an alarm event aiming at the target resources according to the association relation between the resources in the target service by utilizing the error information; if the target resource is not the same type of resource, judging a target alarm importance level corresponding to the alarm event aiming at the target resource according to the relation between the target monitoring data and the preset alarm importance; and alarming according to the alarming event of the target resource according to the target alarming importance level.
Referring to fig. 6, the embodiment of the application discloses an operation and maintenance alarm device, which comprises:
the data acquisition module 11 is configured to collect, by using a preset monitoring collector, monitoring data corresponding to resources in a target service, and if the monitoring data is collected, compare the monitoring data with a preset alarm threshold;
The type judging module 12 is configured to obtain a target resource whose monitoring data exceeds the preset alarm threshold, and judge whether the target resource is a same type resource according to target monitoring data corresponding to the target resource;
And the root cause determining module 13 is configured to determine, according to the association relationship between the resources in the target service, a target alarm root cause of an alarm event for the target resource by using the target monitoring data if the target resource is the same type of resource, so as to perform corresponding alarm and fault repair according to the target alarm root cause.
It can be seen that, in this embodiment, when performing an operation and maintenance alarm, a preset monitoring collector is first used to collect monitoring data corresponding to resources in a target service, and compare the monitoring data with an alarm threshold, then a target resource with the monitoring data exceeding the preset alarm threshold is obtained, and whether the target resource is a same type of resource is determined, if the target resource is the same type of resource, then a target alarm root cause of the current alarm event is analyzed, and corresponding alarm and repair are performed through the target alarm root cause. Therefore, when carrying out operation and maintenance alarming, the application preliminarily judges the target resource causing the alarming by presetting the alarming threshold value, further judges whether the alarming event is a flood alarming by judging whether the target resource is the same type of resource, and if the alarming event is a flood alarming, determines the target alarming root cause of the alarming event aiming at the target resource according to the association relation between the resources in the target service by utilizing the target monitoring data; on the other hand, whether the current alarm event is a flood alarm is further judged by judging whether the target resource is the same type of resource, so that the problem that the efficiency of the alarm is discounted and the analysis and investigation are difficult due to the fact that the flood alarm is generated by other problems caused by a certain key component or resource problem and the number of the flood alarms is huge is solved; furthermore, root cause analysis is performed according to the association relation between resources in the target service by utilizing the target monitoring data, so that the root cause of the alarm problem can be analyzed in time when the alarm problem occurs, the subsequent alarm recovery and fault recovery are further performed through the target alarm root cause, the alarm recovery speed is accelerated, and the operation and maintenance convenience is improved. In conclusion, the application can rapidly cope with alarms generated by a large-scale cluster and analyze the root cause of the alarms or faults so as to increase the operation and maintenance convenience.
In some specific embodiments, the operation and maintenance alarm device further includes:
The link analysis module is used for searching and storing error information of a data call request corresponding to the collected data sent by the preset monitoring collector through requesting to call link analysis if the monitoring data corresponding to the resources in the target service cannot be collected by the preset monitoring collector;
Correspondingly, the root cause determining module 13 is specifically configured to: and determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the error information so as to carry out corresponding alarm and fault restoration according to the target alarm root cause.
In some embodiments, the link analysis module specifically includes:
The path determining unit is used for determining a path corresponding to the data calling request in the system according to a unique request identity number corresponding to the data calling request corresponding to the collected data sent by the preset monitoring collector;
the failure time searching unit is used for searching a calling failure event in the calling process of the data calling request in the path and storing error information corresponding to the calling failure event.
In some specific embodiments, the operation and maintenance alarm device further includes:
The first level determining module is used for judging a target alarm importance level corresponding to the alarm event of the target resource according to the relation between the target monitoring data and the preset alarm importance if the target resource is not the same type of resource;
And the first alarm module is used for alarming the alarm event of the target resource according to the target alarm importance level.
In some specific embodiments, the operation and maintenance alarm device further includes:
the second level determining module is used for judging the target alarm importance level corresponding to the alarm event aiming at the target resource according to the target alarm root cause based on the preset alarm importance;
the timely alarm module is used for alarming the alarm event aiming at the target resource if the target alarm importance level is higher than the preset level threshold;
and the delay alarm module is used for directly alarming the alarm event of the target resource at an alarm time point after waiting for the next preset alarm period if the target alarm importance level is not higher than the preset level threshold.
In some specific embodiments, the root cause determining module 13 specifically includes:
The state determining unit is used for determining the alarm event corresponding to the target monitoring data by utilizing the target monitoring data and deducing the current state of the target resource according to the alarm event;
and the alarm reason determining unit is used for analyzing the reason generated by the alarm event according to the current state of the target resource and determining the target alarm reason of the alarm event aiming at the target resource according to the association relation between the resources in the target service.
In some specific embodiments, the operation and maintenance alarm device further includes:
The fault restoration unit is used for determining a corresponding preset fault restoration script according to the target alarm root cause and restoring the corresponding fault of the target alarm root cause through the preset fault restoration script;
the alarm recovery module is used for carrying out alarm recovery and pushing a corresponding fault recovery success event if the target alarm root is successfully recovered due to the corresponding fault;
And the root cause pushing module is used for pushing the target alarm root cause to relevant operation and maintenance personnel to adjust the preset fault restoration script or manually carry out corresponding fault restoration if the corresponding fault of the target alarm root cause is not successfully restored.
Fig. 7 shows an electronic device 20 according to an embodiment of the present application. The electronic device 20 may specifically further include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. The memory 22 is configured to store a computer program, where the computer program is loaded and executed by the processor 21 to implement relevant steps in the operation and maintenance alarm method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is used to provide voltage to each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein; the input/output interface 25 is used for acquiring external input data or outputting external output data, and the specific interface type thereof may be selected according to the specific application requirement, which is not limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, and the like, and the storage may be temporary storage or permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20, and the computer program 222 may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the operation and maintenance alerting method performed by the electronic device 20 as disclosed in any of the previous embodiments.
Further, the application also discloses a computer readable storage medium for storing a computer program; wherein the computer program when executed by the processor implements the previously disclosed operation and maintenance alerting method. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above detailed description of the operation and maintenance alarming method, device, equipment and medium provided by the invention applies specific examples to illustrate the principle and implementation of the invention, and the above examples are only used for helping to understand the method and core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (9)

1. An operation and maintenance alarm method, comprising:
Collecting monitoring data corresponding to resources in a target service through a preset monitoring collector, and comparing the monitoring data with a preset alarm threshold value if the monitoring data are collected;
Acquiring target resources of which the monitoring data exceeds the preset alarm threshold value, and judging whether the target resources are the same type of resources or not according to target monitoring data corresponding to the target resources;
If the target resource is the same type of resource, determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the target monitoring data so as to carry out corresponding alarm and fault restoration according to the target alarm root cause;
correspondingly, after the monitoring data corresponding to the resources in the target service are collected through the preset monitoring collector, the method further comprises the following steps:
If the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, searching and storing error information of a data calling request corresponding to the collected data sent by the preset monitoring collector through requesting to call link analysis;
Correspondingly, the determining, by using the target monitoring data, a target alarm root cause of an alarm event for the target resource according to an association relationship between resources in the target service, so as to perform corresponding alarm and fault repair according to the target alarm root cause, includes:
And determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the error information so as to carry out corresponding alarm and fault restoration according to the target alarm root cause.
2. The operation and maintenance alarm method according to claim 1, wherein the searching and saving the error information of the data call request corresponding to the collected data sent by the preset monitoring collector through the request call link analysis includes:
determining a corresponding path of the data calling request in a system according to a unique request identity number corresponding to a data calling request corresponding to the collected data sent by the preset monitoring collector;
Searching for a call failure event of the data call request in the call process in the path, and storing error information corresponding to the call failure event.
3. The operation and maintenance alarm method according to claim 1, wherein after the obtaining the target resource whose monitoring data exceeds the preset alarm threshold, and determining whether the target resource is the same type of resource according to the target monitoring data corresponding to the target resource, further comprises:
If the target resource is not the same type of resource, judging a target alarm importance level corresponding to the alarm event aiming at the target resource according to the relation between the target monitoring data and the preset alarm importance;
And alarming according to the alarming event of the target resource according to the target alarming importance level.
4. The operation and maintenance alarm method according to claim 3, wherein if the target resource is the same type of resource, after determining a target alarm root cause of an alarm event for the target resource according to an association relationship between resources in the target service by using the target monitoring data, further comprising:
Judging the target alarm importance level corresponding to the alarm event aiming at the target resource according to the target alarm root cause based on the preset alarm importance;
If the target alarm importance level is higher than a preset level threshold, alarming the alarm event aiming at the target resource;
And if the importance level of the target alarm is not higher than the preset level threshold, directly alarming the alarm event of the target resource at an alarm time point after waiting for the next preset alarm period.
5. The operation and maintenance alarm method according to claim 1, wherein the determining, by using the target monitoring data, a target alarm root cause of an alarm event for the target resource according to an association relationship between resources in the target service includes:
Determining the alarm event corresponding to the target monitoring data by utilizing the target monitoring data, and deducing the current state of the target resource according to the alarm event;
analyzing the reason of the alarm event according to the current state of the target resource, and determining the target alarm reason of the alarm event aiming at the target resource according to the association relation between the resources in the target service.
6. The operation and maintenance alarm method according to any one of claims 1 to 5, wherein if the target resource is the same type of resource, after determining a target alarm root cause of an alarm event for the target resource according to an association relationship between resources in the target service by using the target monitoring data, the operation and maintenance alarm method further comprises:
Determining a corresponding preset fault restoration script according to the target alarm root cause, and restoring a fault corresponding to the target alarm root cause through the preset fault restoration script;
if the target alarm root is successfully repaired due to the corresponding fault, carrying out alarm recovery and pushing a corresponding fault repair success event;
If the target alarm root causes the corresponding faults to be repaired unsuccessfully, the target alarm root causes are pushed to related operation and maintenance personnel to adjust the preset fault repair script or manually perform the corresponding fault repair.
7. An operation and maintenance alarm device, comprising:
the data acquisition module is used for collecting monitoring data corresponding to resources in the target service through a preset monitoring collector, and comparing the monitoring data with a preset alarm threshold value if the monitoring data are collected;
The type judging module is used for acquiring target resources of which the monitoring data exceeds the preset alarm threshold value and judging whether the target resources are the same type of resources or not according to the target monitoring data corresponding to the target resources;
The root cause determining module is used for determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the target monitoring data if the target resource is the same type of resource, so as to carry out corresponding alarm and fault repair according to the target alarm root cause;
correspondingly, after the monitoring data corresponding to the resources in the target service are collected through the preset monitoring collector, the method further comprises the following steps:
If the monitoring data corresponding to the resources in the target service cannot be collected through the preset monitoring collector, searching and storing error information of a data calling request corresponding to the collected data sent by the preset monitoring collector through requesting to call link analysis;
Correspondingly, the determining, by using the target monitoring data, a target alarm root cause of an alarm event for the target resource according to an association relationship between resources in the target service, so as to perform corresponding alarm and fault repair according to the target alarm root cause, includes:
And determining a target alarm root cause of an alarm event aiming at the target resource according to the association relation between the resources in the target service by utilizing the error information so as to carry out corresponding alarm and fault restoration according to the target alarm root cause.
8. An electronic device, comprising:
a memory for storing a computer program;
a processor for executing the computer program to implement the steps of the operation and maintenance alert method according to any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program; wherein the computer program when executed by a processor implements the steps of the operation and maintenance alerting method of any one of claims 1 to 6.
CN202210764564.7A 2022-06-30 2022-06-30 Operation and maintenance alarm method, device, equipment and medium Active CN115174350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210764564.7A CN115174350B (en) 2022-06-30 2022-06-30 Operation and maintenance alarm method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210764564.7A CN115174350B (en) 2022-06-30 2022-06-30 Operation and maintenance alarm method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115174350A CN115174350A (en) 2022-10-11
CN115174350B true CN115174350B (en) 2024-07-02

Family

ID=83490040

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210764564.7A Active CN115174350B (en) 2022-06-30 2022-06-30 Operation and maintenance alarm method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115174350B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115471215B (en) * 2022-10-31 2023-03-28 江西省地质局地理信息工程大队 Business process processing method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327964A (en) * 2020-10-10 2022-04-12 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for processing fault reasons of service system

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7043661B2 (en) * 2000-10-19 2006-05-09 Tti-Team Telecom International Ltd. Topology-based reasoning apparatus for root-cause analysis of network faults
US8738972B1 (en) * 2011-02-04 2014-05-27 Dell Software Inc. Systems and methods for real-time monitoring of virtualized environments
CN111814999B (en) * 2020-07-08 2024-01-16 上海燕汐软件信息科技有限公司 Fault work order generation method, device and equipment
CN112148772A (en) * 2020-09-24 2020-12-29 创新奇智(成都)科技有限公司 Alarm root cause identification method, device, equipment and storage medium
CN114356499A (en) * 2021-12-27 2022-04-15 山东浪潮科学研究院有限公司 Kubernetes cluster alarm root cause analysis method and device
CN114443437A (en) * 2022-01-28 2022-05-06 中国建设银行股份有限公司 Alarm root cause output method, apparatus, device, medium, and program product

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114327964A (en) * 2020-10-10 2022-04-12 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for processing fault reasons of service system

Also Published As

Publication number Publication date
CN115174350A (en) 2022-10-11

Similar Documents

Publication Publication Date Title
CN109714192B (en) Monitoring method and system for monitoring cloud platform
CN110661659B (en) Alarm method, device and system and electronic equipment
CN112653586B (en) Time-space big data platform application performance management method based on full link monitoring
CN107196804B (en) Alarm centralized monitoring system and method for terminal communication access network of power system
CN111176879A (en) Fault repairing method and device for equipment
US6253339B1 (en) Alarm correlation in a large communications network
CN108833137A (en) A kind of flexibility micro services Monitoring framework framework
CN106856489A (en) A kind of service node switching method and apparatus of distributed memory system
CN105207806A (en) Monitoring method and apparatus of distributed service
CN101997925A (en) Server monitoring method with early warning function and system thereof
JP2001057555A (en) Network fault detection method and device
CN106940677A (en) One kind application daily record data alarm method and device
CN113542039A (en) Method for positioning 5G network virtualization cross-layer problem through AI algorithm
CN112446511A (en) Fault handling method, device, medium and equipment
CN116719664B (en) Application and cloud platform cross-layer fault analysis method and system based on micro-service deployment
CN112954031B (en) Equipment state notification method based on cloud mobile phone
CN112711493A (en) Scenario root cause analysis application
CN115174350B (en) Operation and maintenance alarm method, device, equipment and medium
CN115001989A (en) Equipment early warning method, device, equipment and readable storage medium
CN108809729A (en) The fault handling method and device that CTDB is serviced in a kind of distributed system
CN113656252A (en) Fault positioning method and device, electronic equipment and storage medium
CN112328463A (en) Log monitoring method and device
CN116662127A (en) Method, system, equipment and medium for classifying and early warning equipment alarm information
CN116594840A (en) Log fault acquisition and analysis method, system, equipment and medium based on ELK
Li et al. An integrated data-driven framework for computing system management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant