CN115473783A - Prometheus-based index alarm management system and method - Google Patents

Prometheus-based index alarm management system and method Download PDF

Info

Publication number
CN115473783A
CN115473783A CN202210931409.XA CN202210931409A CN115473783A CN 115473783 A CN115473783 A CN 115473783A CN 202210931409 A CN202210931409 A CN 202210931409A CN 115473783 A CN115473783 A CN 115473783A
Authority
CN
China
Prior art keywords
alarm
information
sending
log
prometheus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210931409.XA
Other languages
Chinese (zh)
Inventor
余杭卿
侯俊栋
陈善君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Group Co Ltd
Original Assignee
Inspur Software Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Group Co Ltd filed Critical Inspur Software Group Co Ltd
Priority to CN202210931409.XA priority Critical patent/CN115473783A/en
Publication of CN115473783A publication Critical patent/CN115473783A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Maintenance And Management Of Digital Transmission (AREA)

Abstract

The invention discloses a Prometheus-based index alarm management system and method, belongs to the technical field of operation and maintenance monitoring alarms, and aims to solve the technical problem of providing a simple, easy-to-use and high-reliability alarm mode for system operation and maintenance alarms. The method comprises the following steps: the alarm index collection module is used for acquiring an index name of an index required by system alarm from Prometheus based on an index acquisition request; the alarm rule management module is used for supporting the user to configure and edit alarm rule information, generating an alarm file and sending the alarm file to Prometous; the alarm file configuration method is used for supporting a user to configure alarm parameters required in the alarm file to obtain the alarm file configured with the parameters; the alarm information sending and processing module is used for analyzing the alarm information and sending an alarm notice based on the analyzed alarm information; and the alarm notification server is used for analyzing the alarm file configured with the parameters and sending an alarm notification based on the analyzed alarm file configured with the parameters.

Description

Prometheus-based index alarm management system and method
Technical Field
The invention relates to the technical field of operation and maintenance monitoring alarms, in particular to a Prometheus-based index alarm management system and method.
Background
The alarm is a crucial ring for the operation and maintenance of large-scale systems. Due to the complexity of the large-scale system, if index state collection and alarming are carried out without special performance statistics when the system has problems, operation and maintenance personnel can hardly sense the problems found by the system in time and solve the problems in time, so that an alarming tool with high reliability, timeliness and easiness in use is necessary for the operation and maintenance of each large-scale system. The alarm judgment of the alarm tool is based on various indexes of the system, and more index architectures are used in the market at present and are based on Prometheus, an open source tool.
Prometheus was originally built at soundlog as an open source system monitoring and alarm tool. Since the establishment of 2012, prometheus was adopted by many companies and organizations. The project has a very active community of developers and users. It is now an independent open source project, independent of any corporate maintenance. To emphasize this point and to clarify the governance structure of the project, prometheus added to the cloud native computing foundation in 2016, as the second hosting project after kubernets.
Prometheus collects and stores its metrics as time series data, i.e., the metric information is stored with a timestamp at the time of recording and an optional key-value pair called a tag. And then index data configured in the system can be inquired through the fixed interface of the system, and a series of operation and maintenance work such as monitoring and alarming of system performance is completed. Except that Prometheus completes processing of time sequence data, prometheus officials provide matched professional and comprehensive monitoring tools, grafana and alarm tools, but as the completion of the alarm work of indexes needs to be completed by simultaneously configuring the 3 components and performing a series of configurations, the whole system is too bloated when only the single function of index alarm needs to be completed, the deployment process is complex and cumbersome, and the system performance is influenced to a greater or lesser extent in partial scenes.
How to provide a simple and easy-to-use and highly reliable alarm mode for system operation and maintenance alarm is a technical problem to be solved.
Disclosure of Invention
The technical task of the invention is to provide a Prometheus-based index alarm management system and method aiming at the defects, so as to solve the technical problem of providing a simple, easy-to-use and high-reliability alarm mode for system operation and maintenance alarm.
In a first aspect, the present invention provides a Prometheus-based indicator alarm management system, configured to obtain alarm indicators and related data, generate and manage alarm rules, and process and send alarm information, where the system includes:
the alarm index collection module is interacted with Prometous and used for acquiring an index name of an index required by system alarm from Prometous based on an index acquisition request;
the alarm rule management module is interacted with the Prometheus, is interacted with a user through an alarm rule management interface, is used for supporting the user to configure and edit alarm rule information, is used for generating an alarm file based on the alarm rule information and sends the alarm file to the Prometheus; the alarm file configuration method is used for supporting a user to configure alarm parameters required in the alarm file to obtain the alarm file configured with the parameters;
the alarm information sending and processing module is used for acquiring alarm information from Prometous at regular time, analyzing the alarm information and sending an alarm notice based on the analyzed alarm information; and the alarm rule management module is used for acquiring the alarm file configured with the parameters, analyzing the alarm file configured with the parameters and sending an alarm notification based on the analyzed alarm file configured with the parameters.
Preferably, the alarm rule information includes an alarm rule name, an alarm index, an alarm threshold, an alarm level, an alarm range, an alarm group and a corresponding alarm channel;
each alarm grade is used for limiting the sending time interval of the alarm notice, and the sending time intervals corresponding to different alarm grades are different;
the alarm groups are groups to which alarm rules belong, and each alarm group is adapted to a corresponding alarm scene;
the Prometheus obtains the alarm file and works in the alarm range, and after the alarm is triggered, the Prometheus sends alarm information to the alarm information sending and processing module at regular time;
the alarm information sending and processing module is used for analyzing alarm information to obtain alarm parameters and alarm sources, selecting an alarm channel based on alarm grouping, selecting a sending time interval based on alarm levels, and sending alarm notifications at regular time according to the sending time interval through the alarm channel.
Preferably, the warning information sending and processing module includes:
the alarm information acquisition unit is used for acquiring alarm information from Prometheus;
the alarm information processing unit is interacted with the alarm information acquisition unit and is used for analyzing the alarm information to obtain alarm parameters, the alarm parameters comprise the source of the alarm through analyzing the name of an alarm rule and a label carried by the alarm rule, the alarm types are distinguished through analyzing the time in the alarm information, and the alarm types comprise the common alarm information and the alarm information with the recovered index;
and the alarm sending unit is interacted with the alarm information processing unit, is used for receiving alarm parameters, is used for selecting a corresponding alarm channel according to the alarm grouping and judging whether the corresponding alarm channel is opened, is used for selecting a corresponding sending time interval according to the alarm level and judging whether the sending time interval is in a sending time interval specified by the alarm information if the corresponding alarm channel is opened, and is used for sending an alarm notification at regular time according to the corresponding sending time interval through the corresponding alarm channel if the corresponding sending time interval is opened.
Preferably, the system further comprises:
the log alarm module is interacted with a user through a log alarm interface, and is used for configuring log alarms through the log alarm interface, and comprises a log path for configuring a monitoring system, a keyword label for configuring an alarm log entry, a scanning time interval for configuring the log path, an alarm channel for configuring and sending the log alarms, the log alarm module is used for regularly scanning log files under the appointed log path based on the appointed scanning time interval, and for newly-added logs needing alarms, the log alarm module is used for sending the log alarms based on the appointed alarm channel.
Preferably, the system further comprises:
the history recording module is interacted with a user through a history recording interface, is used for configuring log clearing rules, is used for recording each alarm as an alarm log in a local folder, and is used for deleting the alarm log based on the log clearing rules to prevent the log from being recorded too much;
the alarm log comprises alarm information, alarm parameters and alarm notifications corresponding to the alarm information, and also comprises log alarms.
Preferably, the alarm rule management module is configured to generate and export an alarm template based on the configured alarm file, and to import the alarm template and perform secondary configuration on the alarm template.
In a second aspect, the present invention provides a Prometheus-based indicator alarm management method, which is used for performing alarm management by using the Prometheus-based indicator alarm management system according to any one of the first aspects, and includes acquiring alarm indicators and related data, generating and managing alarm rules, and processing and sending alarm information, and the method includes the following steps:
acquiring an index name of an index required by system alarm from Prometheus based on an index acquisition request;
alarm rule information is configured and edited, an alarm file is generated based on the alarm rule information, and the alarm file is sent to Prometous;
and acquiring alarm information from Prometheus at regular time, analyzing the alarm information and sending an alarm notice based on the analyzed alarm information.
Preferably, the alarm rule information includes an alarm rule name, an alarm index, an alarm threshold, an alarm level, an alarm range, an alarm group and a corresponding alarm channel;
each alarm grade is used for limiting the sending time interval of the alarm notice, and the sending time intervals corresponding to different alarm grades are different;
the alarm groups are groups to which alarm rules belong, and each alarm group is adapted to a corresponding alarm scene;
the Prometheus obtains the alarm file and works in the alarm range, and after the alarm is triggered, the Prometheus sends alarm information to the alarm information sending and processing module at regular time;
after alarm information is obtained from Prometous at regular time, analyzing the alarm information to obtain alarm parameters and alarm sources, selecting an alarm channel based on alarm grouping, selecting a sending time interval based on alarm level, and sending an alarm notification at regular time according to the sending time interval through the alarm channel;
analyzing the alarm information to obtain alarm parameters, comprising:
distinguishing the source of the alarm by analyzing the name of the alarm rule and the label carried by the alarm rule;
distinguishing alarm types by analyzing the time in the alarm message, wherein the alarm types comprise common alarm information and alarm information with recovered indexes;
selecting an alarm channel based on the alarm grouping and selecting a transmission time interval based on the alarm level, comprising the steps of:
selecting a corresponding alarm channel according to the alarm grouping and judging whether the corresponding alarm channel is opened or not;
if yes, selecting a corresponding sending time interval according to the alarm grade, and judging whether the sending time interval is in a sending time interval specified by the alarm information;
if yes, the alarm notice is sent according to the corresponding sending time interval through the corresponding alarm channel.
Preferably, the method further comprises the steps of:
configuring log alarm, including configuring log path of monitoring system, configuring key word label of alarm log entry, configuring scanning time interval of log path and configuring alarm channel for sending log alarm;
regularly scanning log files under a designated log path based on a designated scanning time interval;
for newly added logs needing to be alarmed, sending log alarms based on a specified alarm channel;
the method further comprises the steps of:
configuring log cleaning rules;
recording each alarm in a local folder as an alarm log, wherein the alarm log comprises alarm information, alarm parameters and alarm notifications corresponding to the alarm information, and log alarms;
and deleting the alarm log based on the log clearing rule to prevent the log from recording too much.
Preferably, the method further comprises the steps of:
configuring and editing alarm rule information, and after generating an alarm file based on the alarm rule information, configuring alarm parameters required in the alarm file to obtain an alarm file configured with the parameters;
acquiring an alarm file configured with parameters, analyzing the alarm file configured with the parameters, and sending an alarm notification based on the analyzed alarm file configured with the parameters;
the method further comprises the steps of:
generating and exporting an alarm template based on the configured alarm file;
and when alarm rule information is configured and edited, importing the alarm template, performing secondary configuration on the alarm template, generating an alarm file based on the configured alarm rule information, and sending the alarm file to Prometheus.
The Prometheus-based index alarm management system and method provided by the invention have the following advantages:
1. under the condition of departing from Grafana and AlertManager, the functions of acquiring alarm indexes and related data, generating and managing alarm rules and processing and sending alarm information are realized based on Prometheus, the difficulty required by one system operation and maintenance alarm is reduced, and the time and the space required by a large amount of alarm configuration are saved;
2. configuration and modification editing of alarm rules are supported, and the requirements of various scene alarms can be met through highly customized configuration;
3. for the occasion that partial Prometheus is not used, the log alarm function can be used to complete the most basic alarm function, so that the extensibility and the usability of the alarm function are improved;
4. each alarm is recorded, so that the historical alarms can be conveniently recorded and analyzed, and meanwhile, the log record is deleted according to the configured log clearing rule, so that the phenomenon that too many logs are generated and the working efficiency is influenced is prevented;
5. the method supports the export of the configured alarm rule generation template, facilitates the import of the configuration into another set of tools, and does not need to perform the same secondary configuration.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
The invention is further described below with reference to the accompanying drawings.
FIG. 1 is a block diagram of a workflow of a Prometheus-based index alarm management system according to embodiment 1;
fig. 2 is a block diagram of a workflow of an alarm information sending processing module in the Prometheus-based indicator alarm management system in embodiment 1;
fig. 3 is a flowchart of the Prometheus-based index alarm management method according to embodiment 1.
Detailed Description
The present invention is further described below with reference to the accompanying drawings and specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not intended to limit the present invention, and the embodiments and technical features of the embodiments can be combined with each other without conflict.
The embodiment of the invention provides a Prometheus-based index alarm management system and method, which are used for solving the technical problem of providing a simple, easy-to-use and high-reliability alarm mode for system operation and maintenance alarm.
Example 1:
the invention relates to a Prometheus-based index alarm management system, which comprises an alarm index collection module, an alarm rule management module and an alarm information sending and processing module, and is used for acquiring alarm indexes and related data, generating and managing alarm rules and processing and sending alarm information as shown in figure 1.
And the alarm index collection module is interacted with Prometheus and is used for acquiring the index name of the index required by the system alarm from the Prometheus based on the index acquisition request.
For a monitored operation and maintenance system, prometheus provides acquisition interfaces of all index types of the operation and maintenance system, and after the management system is started, the management system sends a request to the acquisition interfaces provided by Prometheus to acquire all index names usable by alarms of the operation and maintenance system, and provides the index names for the subsequent alarm rule generation.
If the large operation and maintenance system object has a specific cluster node or other distinguishing reference labels, the corresponding label is obtained for the rule to distinguish the range of the alarm object.
The alarm rule management module interacts with Prometous and interacts with a user through an alarm rule management interface, is used for supporting the user to configure and edit alarm rule information, is used for generating an alarm file based on the alarm rule information and sends the alarm file to Prometous.
The user fills in the name of the required alarm rule, the alarm index, the alarm threshold, the alarm level and other alarm related information such as the alarm details and the alarm abstract through the alarm rule management interface, fills in the alarm group to which the alarm rule belongs, specifies the corresponding alarm channel, and automatically generates the alarm file required by Prometheus after clicking is finished after selecting the alarm range, the alarm file is transmitted into Prometheus, and the subsequent Prometheus can perform the alarm processing of the corresponding index according to the alarm file.
The alarm rule management module supports the editing of the alarm file, namely, the functions of modifying and deleting a certain alarm rule after the alarm rule file is generated. The alarm rules can be stored and managed through Prometous, so that corresponding modification and deletion can be completed by accessing an alarm interface of Prometous through a tool.
Meanwhile, the alarm rule management module is used for supporting a user to configure the required alarm parameters in the alarm file to obtain the alarm file configured with the parameters.
Namely, the user can also completely self-define the rule to carry out the alarm rule under the condition of knowing PromeQL of Prometheus, or directly fill the parameters required by the alarm into the configuration file for the management system to use.
The alarm information sending and processing module is used for acquiring alarm information from Prometheus at regular time, analyzing the alarm information and sending an alarm notification based on the analyzed alarm information.
As shown in fig. 2, in this embodiment, the warning information sending processing module is configured to parse the warning information to obtain a warning parameter and a warning source, and is configured to select a warning channel based on the warning packet, and select a sending time interval based on the warning level, so as to send the warning notification through the warning channel at regular time according to the sending time interval.
As a specific embodiment, the alarm information sending processing module includes an alarm information obtaining unit, an alarm information processing unit, and an alarm sending unit. Aiming at the alarm information returned by Prometheus, the alarm information acquisition unit, the alarm information processing unit and the alarm sending unit in the alarm information sending processing module have the following functions.
The alarm information acquisition unit is used for acquiring alarm information from Prometheus.
After Prometheus obtains the alarm configuration file, the information is read to carry out alarm work in the set range. Meanwhile, when an alarm is triggered, the Prometheus sends alarm information to a specific port every 15 seconds (the value can be configured), and the management system can acquire the alarm information through the interface and perform subsequent processing.
The alarm information processing unit interacts with the alarm information acquisition unit and is used for analyzing the alarm information to obtain alarm parameters, wherein the alarm parameters comprise the source of an alarm by analyzing the name of an alarm rule and a label carried by the alarm rule, and the alarm types comprise the common alarm information and the alarm information with recovered indexes by analyzing the time in the alarm information.
Namely, after receiving the alarm message, the alarm information processing unit in the management system analyzes the alarm message provided by Prometheus, distinguishes the source of the alarm by the name of the alarm rule and the label carried by the alarm rule, and distinguishes whether the alarm message is used as a normal alarm or the alarm message with the recovered index by analyzing the time in the alarm message. After the processing is finished, the parameters of each alarm are sent to the alarm sending unit for processing.
The alarm sending unit is interacted with the alarm information processing unit, is used for receiving alarm parameters, is used for selecting a corresponding alarm channel according to the alarm grouping and judging whether the corresponding alarm channel is opened, if so, is used for selecting a corresponding sending time interval according to the alarm level and judging whether the sending time interval is in a sending time interval specified by the alarm information, and if so, is used for sending an alarm notification at regular time according to the corresponding sending time interval through the corresponding alarm channel.
The user needs to input the alarm grouping and the corresponding alarm channel in the configured rule file, the alarm grouping is used for providing alarms of different groups, and the alarm can send alarm messages to the alarm channel in the defined group for different alarm scene requirements. The alarm channel supports a series of mainstream alarm channels such as mail, snmp, syslog, weChat and the like. And meanwhile, the webhook is supported to expand other alarm channels.
Meanwhile, the user can customize the email template when using the email alarm, and can finish the customization of the email only by keeping the required keywords in the email. When the alarm is sent, the set alarm level is required to be read, and the user can set the alarm information sending time intervals of different levels by setting the alarm level. If the level is normal, the alarm notification is sent once in 3 hours, and at the moment, the alarm notification is carried out after 3 hours of the ordinary alarm message of the level is sent for the first time, so that excessive useless alarm message sending is prevented.
Aiming at the alarm file which is sent by the alarm rule management module and is provided with parameters, an alarm information acquisition unit, an alarm information processing unit and an alarm sending unit in the alarm information sending processing module have the following functions.
The alarm information acquisition unit is used for acquiring the alarm file configured with the parameters from the alarm rule management module.
The alarm information processing unit is interacted with the alarm information acquisition unit and is used for analyzing the alarm file configured with the parameters to obtain alarm parameters, the alarm parameters comprise the source of an alarm by analyzing the name of an alarm rule and a label carried by the alarm rule, and the alarm types are distinguished by analyzing the time in the alarm message, wherein the alarm types comprise common alarm information and the alarm information with the recovered index.
And the alarm sending unit is interacted with the alarm information processing unit, is used for receiving alarm parameters, is used for selecting a corresponding alarm channel according to the alarm grouping and judging whether the corresponding alarm channel is opened or not, is used for selecting a corresponding sending time interval according to the alarm grade and judging whether the sending time interval is in a sending time interval specified by the alarm information or not if the corresponding alarm channel is opened, and is used for sending an alarm notification at regular time according to the corresponding sending time interval through the corresponding alarm channel if the corresponding sending time interval is opened.
The user needs to input the alarm grouping and the corresponding alarm channel in the configured rule file, the alarm grouping is used for providing alarms of different groups, and the alarm can send alarm messages to the alarm channel in the defined group for different alarm scene requirements. The alarm channel supports a series of mainstream alarm channels such as mail, snmp, syslog, weChat and the like. And simultaneously, supporting the webhook to expand other alarm channels.
Meanwhile, the user can customize the email template when using the email alarm, and can finish the customization of the email only by keeping the required keywords in the email. When the alarm is sent, the set alarm level is required to be read, and a user can set the alarm level to complete the setting of the sending time interval of the alarm information of different levels. If the level is normal, the alarm notification is sent once in 3 hours, and at the moment, the alarm notification is carried out after 3 hours of the alarm message with the normal level is sent for the first time, so that excessive and useless alarm message sending is prevented.
The system further comprises a log alarm module, wherein the log alarm module interacts with a user through a log alarm interface and is used for configuring log alarms through the log alarm interface, and the log alarm module comprises a log path for configuring the monitoring system, a keyword tag for configuring alarm log entries, a scanning time interval for configuring the log path, an alarm channel for configuring and sending the log alarms, is used for regularly scanning log files under the specified log path based on the specified scanning time interval, and is used for sending the log alarms based on the specified alarm channel for newly-added logs needing to be alarmed.
This function does not need Prometheus, the management system scans the log under the log path of the monitored system through the log alarm module (the path is configured in advance for the user), and the user needs to provide the keyword tag of the required alarm log entry. When a newly added log needing to be alarmed is generated, the log alarming module sends the log alarming through the alarming channel.
The log alarm module scans logs under the appointed log path at intervals according to a time interval configured by a user, and sends all logs needing to be alarmed in the interval through the configured alarm channel at one time, so that the user can conveniently monitor the log surface, and the management system is wider and more comprehensive in available range.
As an improvement, the system further comprises a history recording module, wherein the history recording module interacts with a user through a history recording interface, is used for configuring log cleaning rules, is used for recording each alarm as an alarm log in a local folder, and is used for deleting the alarm log based on the log cleaning rules to prevent excessive log records; the alarm log comprises alarm information, alarm parameters and alarm notifications corresponding to the alarm information, and also comprises log alarm.
Based on the history module, each successful alarm is recorded in the folder of the tool through the log, so that a user can conveniently check and analyze all the alarm history records once. The log clearing rules may also be configured to prevent excessive logging from occurring.
And the alarm rule management module is used for generating and exporting an alarm template based on the configured alarm file, and is used for importing the alarm template and carrying out secondary configuration on the alarm template.
The alarm rule management module supports the import and export of the alarm template. After configuration of one set of alarm rules, alarm channels and the like is completed, the function can export the configuration information by one key, so that the configuration can be conveniently imported into another set of tool without carrying out the same secondary configuration.
Example 2:
the invention relates to a Prometheus-based index alarm management method, which performs alarm management through the Prometheus-based index alarm management system disclosed in embodiment 1, and comprises the steps of acquiring alarm indexes and related data, generating and managing alarm rules, and processing and sending alarm information. As shown in fig. 3, the method comprises the steps of:
s100, acquiring an index name of an index required by system alarm from Prometheus based on an index acquisition request;
s200, configuring and editing alarm rule information, generating an alarm file based on the alarm rule information, and sending the alarm file to Prometous;
s300, acquiring alarm information from Prometheus at regular time, analyzing the alarm information and sending an alarm notification based on the analyzed alarm information.
For a monitored operation and maintenance system, prometheus provides acquisition interfaces of all index types of the operation and maintenance system, and in this embodiment, step S100 sends a request to the acquisition interfaces provided by Prometheus to acquire all index names usable by an alarm of the operation and maintenance system, so as to provide the index names for subsequent alarm rule generation.
If the large operation and maintenance system object has a specific cluster node or other distinguishing reference labels, the corresponding label is obtained for the rule to distinguish the range of the alarm object.
Step S200, the user configures the name of the required alarm rule, the alarm index, the alarm threshold, the alarm level and other alarm related information such as alarm details and the alarm abstract, fills the alarm group to which the alarm rule belongs, specifies the corresponding alarm channel, and generates an alarm file required by Prometheus based on the alarm rule information after selecting the alarm range, and transmits the alarm file to Prometheus, and the subsequent Prometheus carries out the alarm processing of the corresponding index according to the alarm file.
Step S300, the alarm information is analyzed to obtain alarm parameters and alarm sources, an alarm channel is selected based on alarm grouping, a sending time interval is selected based on alarm levels, and an alarm notification is sent at regular time according to the sending time interval through the alarm channel. The specific operation is as follows:
(1) Acquiring alarm information from Prometheus;
after Prometheus obtains the alarm configuration file, the information is read to carry out alarm work in the set range. Meanwhile, when the alarm is triggered, the Prometheus sends alarm information to a specific port every 15 seconds (the value can be configured), and the management system can acquire the alarm information through the interface and perform subsequent processing;
(2) Analyzing the alarm information to obtain alarm parameters, wherein the alarm parameters comprise the source of an alarm by analyzing the name of an alarm rule and a label carried by the alarm rule, and the alarm types are distinguished by analyzing the time in the alarm message, wherein the alarm types comprise common alarm information and the alarm information with recovered indexes;
(3) Receiving alarm parameters, selecting a corresponding alarm channel according to alarm grouping and judging whether the corresponding alarm channel is opened, if so, selecting a corresponding sending time interval according to the alarm level and judging whether the sending time interval is in a sending time interval designated by alarm information, and if so, sending an alarm notification at regular time according to the corresponding sending time interval through the corresponding alarm channel.
The user needs to input an alarm group and a corresponding alarm channel in the configured rule file, the alarm group is used for providing alarms of different groups, and the alarms can send alarm messages to the alarm channels in the defined groups for different alarm scene requirements. The alarm channel supports a series of mainstream alarm channels such as mail, snmp, syslog, weChat and the like. And meanwhile, the webhook is supported to expand other alarm channels.
Meanwhile, the user can customize the email template when using the email alarm, and can finish the customization of the email only by keeping the required keywords in the email. When the alarm is sent, the set alarm level is required to be read, and a user can set the alarm level to complete the setting of the sending time interval of the alarm information of different levels. If the level is normal, the alarm notification is sent once in 3 hours, and at the moment, the alarm notification is carried out after 3 hours of the alarm message with the normal level is sent for the first time, so that excessive and useless alarm message sending is prevented.
As an improvement, the user may also completely customize the rule to perform the alarm rule under the condition of knowing PromeQL of Prometheus, and directly fill the parameters required by the alarm into the configuration file to perform subsequent alarm analysis and transmission.
For the improved situation, in step S200 of this embodiment, alarm rule information is configured and edited, and after an alarm file is generated based on the alarm rule information, alarm parameters required in the alarm file are configured to obtain an alarm file configured with the parameters.
Step S300 is to obtain the alarm file configured with the parameters, analyze the alarm file configured with the parameters, and send an alarm notification based on the analyzed alarm file configured with the parameters. The specific flow of step S300 is as follows:
(1) Acquiring an alarm file configured with parameters from an alarm rule management module;
(2) Analyzing the alarm file configured with the parameters to obtain alarm parameters, wherein the alarm parameters comprise the source of an alarm by analyzing the name of an alarm rule and a label carried by the alarm rule, and the alarm types are distinguished by analyzing the time in an alarm message, wherein the alarm types comprise common alarm information and the alarm information with recovered indexes;
(3) Receiving alarm parameters, selecting a corresponding alarm channel according to alarm grouping and judging whether the corresponding alarm channel is opened, if so, selecting a corresponding sending time interval according to the alarm level and judging whether the sending time interval is in a sending time interval designated by alarm information, and if so, sending an alarm notification at regular time according to the corresponding sending time interval through the corresponding alarm channel.
The user needs to input an alarm group and a corresponding alarm channel in the configured rule file, the alarm group is used for providing alarms of different groups, and the alarms can send alarm messages to the alarm channels in the defined groups for different alarm scene requirements. The alarm channel supports a series of mainstream alarm channels such as mail, snmp, syslog, weChat and the like. And simultaneously, supporting the webhook to expand other alarm channels.
Meanwhile, the user can customize the email template when using the email alarm, and can finish the customization of the email only by keeping the required keywords in the email. When the alarm is sent, the set alarm level is required to be read, and a user can set the alarm level to complete the setting of the sending time interval of the alarm information of different levels. If the level is normal, the alarm notification is sent once in 3 hours, and at the moment, the alarm notification is carried out after 3 hours of the ordinary alarm message of the level is sent for the first time, so that excessive useless alarm message sending is prevented.
As an improvement, the method further comprises a log alarm operation, and the specific flow is as follows:
(1) Configuring log alarm, including configuring log path of monitoring system, configuring keyword label of alarm log entry, configuring scanning time interval of log path, and configuring alarm channel for sending log alarm;
(2) Regularly scanning log files under a designated log path based on a designated scanning time interval;
(3) And sending log alarm based on the appointed alarm channel for the newly added log needing alarm.
As an improvement, the method further comprises an alarm history record, and the specific operations are as follows:
(1) Configuring log cleaning rules;
(2) Recording each alarm as an alarm log in a local folder, wherein the alarm log comprises alarm information, alarm parameters and alarm notifications corresponding to the alarm information, and log alarms;
(3) And deleting the alarm log based on the log clearing rule to prevent the log from recording too much.
As an improvement, in step S200 of the method of the present embodiment, an alarm template is generated and exported based on the configured alarm file; and when the alarm rule information is configured and edited, importing an alarm template, carrying out secondary configuration on the alarm template, generating an alarm file based on the configured alarm rule information, and sending the alarm file to Prometous.
And based on the improved implementation, the import and export of the alarm template are supported. After configuration of one set of alarm rules, alarm channels and the like is completed, the function can export the configuration information by one key, so that the configuration can be conveniently imported into another set of tool without carrying out the same secondary configuration.
While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims (10)

1. A Prometheus-based index alarm management system is characterized by being used for acquiring alarm indexes and related data, generating and managing alarm rules, and processing and sending alarm information, and comprises:
the alarm index collection module is interacted with Prometous and used for acquiring an index name of an index required by system alarm from Prometous based on an index acquisition request;
the alarm rule management module is interacted with the Prometous, is interacted with a user through an alarm rule management interface, is used for supporting the user to configure and edit alarm rule information, is used for generating an alarm file based on the alarm rule information and sends the alarm file to the Prometous; the alarm file configuration system is used for supporting a user to configure required alarm parameters in the alarm file to obtain the alarm file configured with the parameters;
the alarm information sending and processing module is used for acquiring alarm information from Prometous at regular time, analyzing the alarm information and sending an alarm notice based on the analyzed alarm information; and the alarm rule management module is used for acquiring the alarm file configured with the parameters, analyzing the alarm file configured with the parameters and sending an alarm notification based on the analyzed alarm file configured with the parameters.
2. The Prometheus-based indicator alarm management system of claim 1, wherein the alarm rule information comprises an alarm rule name, an alarm indicator, an alarm threshold, an alarm level, an alarm range, an alarm group, and a corresponding alarm channel;
each alarm level is used for limiting the sending time interval of the alarm notification, and the sending time intervals corresponding to different alarm levels are different;
the alarm groups are groups to which alarm rules belong, and each alarm group is adapted to a corresponding alarm scene;
the Prometheus obtains the alarm file and works in the alarm range, and after the alarm is triggered, the Prometheus sends alarm information to the alarm information sending and processing module at regular time;
the alarm information sending and processing module is used for analyzing the alarm information to obtain alarm parameters and alarm sources, selecting an alarm channel based on alarm grouping and selecting a sending time interval based on alarm levels, and sending alarm notifications according to the sending time interval at regular time through the alarm channel.
3. The Prometheus-based indicator alarm management system of claim 2, wherein the alarm information sending processing module comprises:
the alarm information acquisition unit is used for acquiring alarm information from Prometheus;
the alarm information processing unit is interacted with the alarm information acquisition unit and is used for analyzing the alarm information to obtain alarm parameters, the alarm parameters comprise the source of the alarm by analyzing the name of the alarm rule and the label carried by the alarm rule, and the alarm types comprise the common alarm information and the alarm information with the recovered index by analyzing the time in the alarm information;
and the alarm sending unit is interacted with the alarm information processing unit, is used for receiving alarm parameters, is used for selecting a corresponding alarm channel according to the alarm grouping and judging whether the corresponding alarm channel is opened, is used for selecting a corresponding sending time interval according to the alarm level and judging whether the sending time interval is in a sending time interval specified by the alarm information if the corresponding alarm channel is opened, and is used for sending an alarm notification at regular time according to the corresponding sending time interval through the corresponding alarm channel if the corresponding sending time interval is opened.
4. A Prometheus-based metric alarm management system according to any of claims 1-3, characterized in that the system further comprises:
the log alarm module is interacted with a user through a log alarm interface, is used for configuring log alarms through the log alarm interface, and comprises a log path for configuring a monitoring system, a keyword label for configuring alarm log entries, a scanning time interval for configuring the log path, an alarm channel for configuring and sending the log alarms, is used for regularly scanning log files under the appointed log path based on the appointed scanning time interval, and is used for sending the log alarms based on the appointed alarm channel for newly-added logs needing alarms.
5. The Prometheus-based metric alarm management system of claim 4, wherein the system further comprises:
the history recording module is interacted with a user through a history recording interface, is used for configuring log cleaning rules, is used for recording each alarm as an alarm log in a local folder, and is used for deleting the alarm log based on the log cleaning rules to prevent excessive log records;
the alarm log comprises alarm information, alarm parameters and alarm notifications corresponding to the alarm information, and also comprises log alarms.
6. The Prometheus-based indicator alarm management system of any one of claims 1-3, wherein the alarm rule management module is configured to generate and export an alarm template based on the configured alarm file, and to import and reconfigure the alarm template a second time.
7. A Prometheus-based index alarm management method, for performing alarm management by the Prometheus-based index alarm management system of any one of claims 1 to 6, including acquiring alarm indexes and related data, generating and managing alarm rules, and processing and sending alarm information, the method comprising the steps of:
acquiring an index name of an index required by system alarm from Prometheus based on an index acquisition request;
alarm rule information is configured and edited, an alarm file is generated based on the alarm rule information, and the alarm file is sent to Prometous;
and acquiring alarm information from Prometheus at regular time, analyzing the alarm information and sending an alarm notice based on the analyzed alarm information.
8. The Prometheus-based indicator alarm management method of claim 7, wherein the alarm rule information includes an alarm rule name, an alarm indicator, an alarm threshold, an alarm level, an alarm range, an alarm group, and a corresponding alarm channel;
each alarm grade is used for limiting the sending time interval of the alarm notice, and the sending time intervals corresponding to different alarm grades are different;
the alarm groups are groups to which alarm rules belong, and each alarm group is adapted to a corresponding alarm scene;
the Prometheus obtains the alarm file and works in the alarm range, and after the alarm is triggered, the Prometheus sends alarm information to the alarm information sending and processing module at regular time;
after alarm information is obtained from Prometheus at regular time, analyzing the alarm information to obtain alarm parameters and alarm sources, selecting alarm channels based on alarm groups, selecting sending time intervals based on alarm levels, and sending alarm notifications at regular time according to the sending time intervals through the alarm channels;
analyzing the alarm information to obtain alarm parameters, comprising:
distinguishing the source of the alarm by analyzing the name of the alarm rule and the label carried by the alarm rule;
distinguishing alarm types by analyzing the time in the alarm message, wherein the alarm types comprise common alarm information and alarm information with recovered indexes;
selecting an alarm channel based on the alarm grouping and selecting a transmission time interval based on the alarm level, comprising the steps of:
selecting a corresponding alarm channel according to the alarm grouping and judging whether the corresponding alarm channel is opened or not;
if yes, selecting a corresponding sending time interval according to the alarm grade, and judging whether the sending time interval is in a sending time interval specified by the alarm information;
if yes, the alarm notice is sent according to the corresponding sending time interval through the corresponding alarm channel.
9. The Prometheus-based index alarm management method according to claim 7 or 8, wherein the method further comprises the steps of:
configuring log alarm, including configuring log path of monitoring system, configuring keyword label of alarm log entry, configuring scanning time interval of log path, and configuring alarm channel for sending log alarm;
regularly scanning log files under a designated log path based on a designated scanning time interval;
for newly added logs needing to be alarmed, sending log alarms based on a specified alarm channel;
the method further comprises the steps of:
configuring log cleaning rules;
recording each alarm in a local folder as an alarm log, wherein the alarm log comprises alarm information, alarm parameters and alarm notifications corresponding to the alarm information, and log alarms;
and deleting the alarm log based on the log clearing rule to prevent the log from recording too much.
10. The Prometheus-based indicator alarm management method of any one of claims 7-8, wherein the method further comprises the steps of:
configuring and editing alarm rule information, and configuring alarm parameters required in alarm files after generating alarm files based on the alarm rule information to obtain alarm files configured with the parameters;
acquiring an alarm file configured with parameters, analyzing the alarm file configured with the parameters, and sending an alarm notification based on the analyzed alarm file configured with the parameters;
the method further comprises the steps of:
generating and exporting an alarm template based on the configured alarm file;
and when alarm rule information is configured and edited, importing the alarm template, performing secondary configuration on the alarm template, generating an alarm file based on the configured alarm rule information, and sending the alarm file to Prometheus.
CN202210931409.XA 2022-08-04 2022-08-04 Prometheus-based index alarm management system and method Pending CN115473783A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210931409.XA CN115473783A (en) 2022-08-04 2022-08-04 Prometheus-based index alarm management system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210931409.XA CN115473783A (en) 2022-08-04 2022-08-04 Prometheus-based index alarm management system and method

Publications (1)

Publication Number Publication Date
CN115473783A true CN115473783A (en) 2022-12-13

Family

ID=84367649

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210931409.XA Pending CN115473783A (en) 2022-08-04 2022-08-04 Prometheus-based index alarm management system and method

Country Status (1)

Country Link
CN (1) CN115473783A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581060A (en) * 2020-05-11 2020-08-25 金蝶软件(中国)有限公司 Prometheus-based log alarm system and method and related equipment
CN112291114A (en) * 2020-11-17 2021-01-29 恩亿科(北京)数据科技有限公司 Data source monitoring method and system, electronic equipment and storage medium
CN112511339A (en) * 2020-11-09 2021-03-16 宝付网络科技(上海)有限公司 Container monitoring alarm method, system, equipment and storage medium based on multiple clusters
CN113037549A (en) * 2021-03-04 2021-06-25 浪潮云信息技术股份公司 Operation and maintenance environment warning method
CN113704065A (en) * 2021-08-31 2021-11-26 平安普惠企业管理有限公司 Monitoring method, device, equipment and computer storage medium
CN113760639A (en) * 2020-10-19 2021-12-07 北京沃东天骏信息技术有限公司 Monitoring method, monitoring device, computing equipment and medium
US11240127B1 (en) * 2019-03-01 2022-02-01 Pivotal Software, Inc. Indicator tools

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11240127B1 (en) * 2019-03-01 2022-02-01 Pivotal Software, Inc. Indicator tools
CN111581060A (en) * 2020-05-11 2020-08-25 金蝶软件(中国)有限公司 Prometheus-based log alarm system and method and related equipment
CN113760639A (en) * 2020-10-19 2021-12-07 北京沃东天骏信息技术有限公司 Monitoring method, monitoring device, computing equipment and medium
CN112511339A (en) * 2020-11-09 2021-03-16 宝付网络科技(上海)有限公司 Container monitoring alarm method, system, equipment and storage medium based on multiple clusters
CN112291114A (en) * 2020-11-17 2021-01-29 恩亿科(北京)数据科技有限公司 Data source monitoring method and system, electronic equipment and storage medium
CN113037549A (en) * 2021-03-04 2021-06-25 浪潮云信息技术股份公司 Operation and maintenance environment warning method
CN113704065A (en) * 2021-08-31 2021-11-26 平安普惠企业管理有限公司 Monitoring method, device, equipment and computer storage medium

Similar Documents

Publication Publication Date Title
CN112612675B (en) Distributed big data log link tracking method and system under micro-service architecture
CN107196804B (en) Alarm centralized monitoring system and method for terminal communication access network of power system
CA2578957C (en) Agile information technology infrastructure management system
US7882215B2 (en) System and method for implementing polling agents in a client management tool
US9262248B2 (en) Log configuration of distributed applications
US20110191394A1 (en) Method of processing log files in an information system, and log file processing system
US6734878B1 (en) System and method for implementing a user interface in a client management tool
CN102323940B (en) Configuration platform implementation method based on database, configuration platform and system
US20020143920A1 (en) Service monitoring and reporting system
US20060184529A1 (en) System and method for analysis and management of logs and events
CN102567531B (en) General method for monitoring status of light database
CN111259073A (en) Intelligent business system running state studying and judging system based on logs, flow and business access
US20060168187A1 (en) System and method for archiving within a client management tool
CN115473783A (en) Prometheus-based index alarm management system and method
GB2416091A (en) High Capacity Fault Correlation
CN114817300A (en) Log query method based on SQL (structured query language) statements and application thereof
CN109089274B (en) Method for troubleshooting using customizable troubleshooting indicators on variable time buckets
CN112685370A (en) Log collection method, device, equipment and medium
CA2525710A1 (en) Automated network infrastructure audit system
RU2575991C2 (en) Network configuration management in communication networks
CN111953519A (en) SDN network flow visualization method and device
CN109684159A (en) Method for monitoring state, device, equipment and the storage medium of distributed information system
CN116644039B (en) Automatic acquisition and analysis method for online capacity operation log based on big data
CN114500230B (en) Optical transmission fault recording and broadcasting method and system based on time axis
CN118173214B (en) Intelligent communication interaction method and system for medical information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination