CN114422339A

CN114422339A - Automatic scheduling distributed data monitoring system and method

Info

Publication number: CN114422339A
Application number: CN202210314581.0A
Authority: CN
Inventors: 郭飞; 胡玮; 管永权; 董晓军
Original assignee: Xi'an Tali Technology Co ltd
Current assignee: Xi'an Tali Technology Co ltd
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-04-29
Anticipated expiration: 2042-03-29
Also published as: CN114422339B

Abstract

The invention provides an automatic dispatching distributed data monitoring system and method, and belongs to the technical field of monitoring systems. The invention comprises the following steps: the system comprises an Nginx reverse proxy server, a plurality of alarm rule modules, a Kafka queue and a plurality of alarm calculation modules. The invention solves the problems of massive alarm rules, performance problems in calculation and unavailability of the system in the process of calculating the downtime of the alarm module. Meanwhile, the invention solves the problem of information loss caused by the fact that the alarm rules are not processed in time in the peak value area when the alarm rules are established in batch. Finally, the invention provides a configuration method of the alarm rule based on the dimensionality of the user, the dimensionality of the time, the dimensionality of the monitored object and the like.

Description

Automatic scheduling distributed data monitoring system and method

Technical Field

The invention belongs to the technical field of monitoring systems, and particularly relates to an automatic scheduling distributed data monitoring system and method.

Background

In cloud computing and an Internet of things system, all business modules need to work cooperatively, and internal data of the business modules have the characteristics of heterogeneity, loose coupling and the like. These systems all require a stable available monitoring system to ensure their health and stability. The data monitoring system plays an important role in guaranteeing data security and data service quality, and therefore, research on the monitoring system is very meaningful.

The common monitoring system generally adopts a data interface mode, pushes data to the monitoring system through a third-party service, or pulls the data from the third-party service, and then performs centralized rule calculation in the monitoring system, so as to judge whether to alarm. Due to the design of the framework, when the number of the data points of the third-party service to be docked is large and the number of alarm rules for operating the data is large, the monitoring system can easily reach the performance bottleneck in the aspects of data access and data processing, and the expansibility of the system is weak.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an automatic scheduling distributed data monitoring system and method.

In order to achieve the above purpose, the invention provides the following technical scheme:

an automatically scheduled distributed data monitoring system comprising:

the Nginx reverse proxy server is in communication connection with the monitored object through an HTTP (hyper text transport protocol), an MQTT (multiple quantum QTT) protocol and a WebSocket protocol;

the plurality of alarm rule modules are deployed by adopting a Master-Slave framework, and are connected with the Nginx reverse proxy server through an Http protocol;

the Kafka queue is connected with the plurality of alarm rule modules through an Http protocol;

the alarm calculation modules are connected in a Tcp protocol mode through a Zookeeper server; each alarm calculation module is in communication connection with the Kafka queue.

Preferably, the Master-slave architecture includes:

the slave nodes are in communication connection with the Zookeeper server; for computing services;

the Master node is in communication connection with the Zookeeper server; the method is mainly used for managing a plurality of slave nodes and simultaneously bearing computing services.

Preferably, the method further comprises the following steps:

the MYSQL database is in communication connection with the plurality of alarm rule modules;

the time sequence database OpenTsdb is in communication connection with the Kafka queue and the plurality of alarm calculation modules;

and the centralized cache module is in communication connection with the Kafka queue and the plurality of alarm calculation modules.

An automatically scheduled distributed data monitoring method comprises the following steps:

the Nginx reverse proxy server acquires a monitoring request of a monitored object and stores the monitoring request in a time sequence database OpenTsdb;

when the alarm rule module is down, the alarm rule modules are redeployed; otherwise, the plurality of alarm rule modules respectively formulate alarm rules and multidimensional data logic models according to the monitoring request of the monitored object, and store the plurality of alarm rules and the data logic models in the MYSQL database;

the Kafka queue decouples the plurality of alarm rule modules and the plurality of alarm calculation modules;

a plurality of alarm rules and data logic models enter a Kafka queue;

the Zookeeper server coordinates a plurality of alarm calculation modules to respectively acquire an alarm rule and a data logic model which are respectively maintained from the KafKa queue, and the plurality of alarm calculation modules respectively calculate according to the alarm rule and the data logic model to parallelly acquire a plurality of monitoring alarm results;

and the monitoring alarm result is pushed to the monitored object or a third-party system through the kafka queue.

Preferably, when the alarm rule module goes down, the step of relocating the plurality of alarm rule modules includes:

when a detection cluster on an alarm rule module of a Master node detects that a slave is down;

an alarm rule module of the Master node obtains the IP address of a slave alarm rule module in downtime from a Zookeeper server;

the alarm rule module of the Master node acquires an alarm rule corresponding to the IP address of the slave alarm rule module in downtime from the cache data of the centralized cache module;

an alarm rule module of the Master node sends an alarm rule corresponding to the IP address of the down machine to a Kafka queue;

and an alarm rule module of the Master node and a survivor slave alarm rule module form a new Master-slave architecture.

when a detection cluster on an alarm rule module of any slave node detects that an alarm rule module of a Master node is down;

the Zookeeper server selects a slave alarm rule module as a new Master alarm rule module; the new Master alarm rule module obtains the IP address of the shutdown Master alarm rule module from the Zookeeper server;

the new Master alarm rule module acquires the alarm rule corresponding to the IP address of the shutdown Master alarm rule module from the cache;

the new Master alarm rule module sends alarm rules corresponding to the IP address of the shutdown Master to a Kafka queue;

and the alarm rule module of the new Master node and the plurality of slave alarm rule modules form a new Master-slave architecture.

Preferably, the step of obtaining a plurality of monitoring alarm results in parallel by the plurality of alarm calculation modules according to the alarm rules and the data logic model operation comprises:

the alarm calculation module converts the data logic model into a monitoring request and obtains the data of the monitored object from the monitoring request;

and the alarm calculation module calculates the data of the monitored object by using the alarm rule to obtain a monitoring alarm result.

Preferably, the step of entering the Kafka queue by the plurality of alarm rules and the data logic model comprises:

a plurality of alarm rules and a data logic model are entered into a Kafka queue from a plurality of alarm rule modules.

Preferably, the step of entering the KafKa queue by the plurality of alarm rules and the data logic model includes:

a plurality of alarm rules and data logic models are loaded from the MYSQL database into the Kafka queue.

Preferably, the data logic model includes:

a plurality of namespaces, wherein the number of the namespaces is determined by a monitoring request of a monitored object;

a plurality of meters respectively subordinate to the plurality of namespaces; the number of meters is determined by a monitoring request of a monitored object;

a plurality of dimensions respectively subordinate to the plurality of metrics; the number of sessions is determined by the monitoring request of the monitored object. The automatic dispatching distributed data monitoring system and the method thereof provided by the invention have the following beneficial effects: (1) high availability, particularly for availability in large data volume environments, the distributed system design enables a greater number of alarms to be handled by the system per unit time. Therefore, the system can be applied in a big data service state. (2) The reliability is high, in the structural design of the distributed cluster, any alarm computing node is down, and the high-level rule set for maintaining the computation is dynamically transferred to other computing nodes. Therefore, the data in the system memory are not lost due to the instability of hardware under the responsible environment. (3) The invention can establish multi-dimensional alarm rule configuration for different monitored objects from different time granularities and different alarm calculation methods, thereby meeting the alarm monitoring for different requirements of the monitored objects under complex scenes.

Drawings

In order to more clearly illustrate the embodiments of the present invention and the design thereof, the drawings required for the embodiments will be briefly described below. The drawings in the following description are only some embodiments of the invention and it will be clear to a person skilled in the art that other drawings can be derived from them without inventive effort.

Fig. 1 is a structure of an automatically scheduled distributed data monitoring system according to embodiment 1 of the present invention;

fig. 2 is a flowchart of an automatically scheduled distributed data monitoring method according to embodiment 1 of the present invention;

FIG. 3 is a production consumption graph of alarm rules in embodiment 1 of the present invention;

fig. 4 is a flowchart of slave downtime alarm rules according to embodiment 1 of the present invention;

fig. 5 is a flow chart of Master downtime alarm rules according to embodiment 1 of the present invention;

FIG. 6 is an interaction diagram of an alarm calculation module according to embodiment 1 of the present invention;

fig. 7 is a tree structure diagram of a data logic model in embodiment 1 of the present invention.

Detailed Description

In order that those skilled in the art will better understand the technical solutions of the present invention and can practice the same, the present invention will be described in detail with reference to the accompanying drawings and specific examples. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "axial", "radial", "circumferential", etc. indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of describing technical solutions of the present invention and simplifying the description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In the description of the present invention, it should be noted that, unless explicitly stated or limited otherwise, the terms "connected" and "connected" are to be interpreted broadly, e.g., as a fixed connection, a detachable connection, or an integral connection; can be mechanically or electrically connected; may be directly connected or indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations. In the description of the present invention, unless otherwise specified, "a plurality" means two or more, and will not be described in detail herein.

Example 1

Referring to fig. 1, an automatically scheduled distributed data monitoring system includes: the system comprises an Nginx reverse proxy server, a plurality of alarm rule modules, a Kafka queue, a plurality of alarm calculation modules, a MYSQL database, a time sequence database OpenTsdb and a centralized cache module. The Nginx reverse proxy server is in communication connection with the monitored object through an HTTP protocol, an MQTT protocol and a WebSocket protocol. And the plurality of alarm rule modules are deployed in a distributed load balancing mode, and are connected with the Nginx reverse proxy server through an Http protocol. The Kafka queue is connected with a plurality of alarm rule modules through an Http protocol. And the plurality of alarm calculation modules are connected in a Tcp protocol mode through the Zookeeper server. The MYSQL database is in communication connection with the plurality of alarm rule modules. And the time sequence database OpenTsdb is in communication connection with the Kafka queue and the plurality of alarm calculation modules. The centralized cache module is in communication connection with the Kafka queue and the plurality of alarm calculation modules.

In this embodiment, the distributed load balancing manner is a Master-slave architecture, and the Master-slave architecture includes: a plurality of slave nodes and a Master node. The slave nodes are in communication connection with the Zookeeper server and used for computing services. The Master node is in communication connection with the Zookeeper server, is mainly used for managing a plurality of slave nodes and simultaneously bears computing services.

Referring to fig. 2, an automatically scheduled distributed data monitoring method includes the following steps: the Nginx reverse proxy server acquires a monitoring request of a monitored object and stores the monitoring request in a time sequence database OpenTsdb; when the alarm rule module is down, the alarm rule modules are redeployed; otherwise, the plurality of alarm rule modules respectively formulate alarm rules and multidimensional data logic models according to the monitoring request of the monitored object, and store the plurality of alarm rules and the data logic models in the MYSQL database; the Kafka queue decouples the plurality of alarm rule modules and the plurality of alarm calculation modules; a plurality of alarm rules and data logic models enter a Kafka queue; the Zookeeper server coordinates a plurality of alarm calculation modules to respectively acquire an alarm rule and a data logic model which are respectively maintained from the KafKa queue, and the plurality of alarm calculation modules respectively calculate according to the alarm rule and the data logic model to parallelly acquire a plurality of monitoring alarm results; and the monitoring alarm result is pushed to the monitored object or a third-party system through the kafka queue.

In this embodiment, the alarm rule module is mainly responsible for setting alarm rules and maintaining data management of the monitored object, and a series of alarm rule models are established for the monitored object concerned by the user according to actual requirements through an HTTP request of a local area network or a wide area network. In this embodiment, an alarm threshold may be configured, and support threshold calculation rules such as "greater than", "greater than or equal to", "less than or equal to", and a data aggregation mode supports Sum, Average, Max, Min, and Variance. The alarm rules may configure the calculation period of the monitored data and configure the number of times the calculation result exceeds a threshold.

After the alarm calculation modules adopt a distributed architecture, a single node of each alarm calculation module only maintains a subset of the alarm rules, and if a certain node goes down, the maintained subset of the alarm rules are all lost theoretically, so that the system is unreliable. As shown in fig. 3.

According to the method, a Master-slave mode is adopted by utilizing the service coordination capability of the Zookeeper, the Master node manages all nodes, when the slave is detected to be down, information such as an ip address of the down machine can be acquired, through the information, the Master node is matched with all alarm rules corresponding to the ip address in a cache or a database, and then the alarm rule set on the down node is sent to the Kafka queue again. And other surviving alarm rule computing nodes consume the message queue again, and randomly and disorderly consume the subset of the alarm rule sets. Furthermore, the rest of other alarm rule calculation nodes can still maintain the complete set of alarm rules after the message queue is consumed, so that the distributed system has high reliability.

Referring to fig. 4, the step of redeploying the plurality of alarm rule modules includes: when a detection cluster on an alarm rule module of a Master node detects that a slave is down; an alarm rule module of the Master node obtains the IP address of a slave alarm rule module in downtime from a Zookeeper server; the alarm rule module of the Master node acquires an alarm rule corresponding to the IP address of the slave alarm rule module in downtime from the cache data of the centralized cache module; an alarm rule module of the Master node sends an alarm rule corresponding to the IP address of the down machine to a Kafka queue; and an alarm rule module of the Master node and a survivor slave alarm rule module form a new Master-slave architecture.

Referring to fig. 5, when an alarm rule module goes down, the step of relocating the plurality of alarm rule modules includes: when a detection cluster on an alarm rule module of any slave node detects that an alarm rule module of a Master node is down; the Zookeeper server selects a slave alarm rule module as a new Master alarm rule module; the new Master alarm rule module obtains the IP address of the shutdown Master alarm rule module from the Zookeeper server; the new Master alarm rule module acquires the alarm rule corresponding to the IP address of the shutdown Master alarm rule module from the cache; the new Master alarm rule module sends the alarm rule corresponding to the IP address of the shutdown Master to the Kafka queue; and the alarm rule module of the new Master node and the plurality of slave alarm rule modules form a new Master-slave architecture.

Due to the fact that the open source time sequence database Opentsdb has a stored logic structure with the following parts: after the data are modeled by the Metric, Tags, Value and Timestamp, wherein namespace and dimension in the monitoring data logic model need to be converted into Tags in the opentsdb storage logic at the same time, and the Metric in the monitoring data logic model is converted into the Metric in the opentsdb. And storing the acquired data values in a Tsdb database according to tags, metrics and time series.

As shown in fig. 6, the step of the alarm calculation module obtaining the monitoring alarm result according to the alarm rule and the data logic model operation includes: the alarm calculation module converts the data logic model into a monitoring request and obtains the data of the monitored object from the monitoring request; and the alarm calculation module calculates the data of the monitored object by using the alarm rule to obtain a monitoring alarm result.

In this embodiment, the step of entering the Kafka queue by the plurality of alarm rules and the data logic model includes two steps: a plurality of alarm rules and a data logic model are entered into a Kafka queue from a plurality of alarm rule modules. A plurality of alarm rules and data logic models are loaded from the MYSQL database into the Kafka queue.

The alarm calculation module is used as a consumer of the Kafka queue, the alarm rule is totally consumed from the Kafka queue and stored in the memory of the process, and then the timer task carries out timing calculation on the alarm data according to the alarm calculation period in the alarm rule. The Kafka queue maintains a complete set of alarm rules of the whole system, and each alarm calculation module process consumes a subset of the alarm rules, so that the load of each alarm calculation module is greatly reduced, and the healthy and stable operation of the system can be ensured.

In this embodiment, the data logic model includes: multiple namespaces, multiple metrics, and multiple dimensions. Wherein, the plurality of metrics are respectively subordinate to the plurality of namespaces, and the plurality of dimension are respectively subordinate to the plurality of metrics. The number of namespace, metric and dimension is determined by the monitoring request of the monitored object. Referring to FIG. 7, for example, the monitored object is a set of Internet of things systems, where there are two subsystems, Subsystem-1 and Subsystem-2, and Subsystem-1 and Subsystem-2 are two different namespaces. If the monitoring system needs to monitor the temperature and the equipment rotating speed in the system of the Internet of things, the meters are temperature and speed, the equipment 1 in the monitoring area 1 is arranged in the Subsystem-1, at the moment, the meters in the Subsystem-1 correspond to two-dimensional dimensions which can be named as area-1 and device-1 respectively, the Subsystem-2 only has one area, and the system monitors the data on the equipment 2, so that only one dimension of the dimension is device-2.

The above embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, and any simple changes or equivalent substitutions of the technical solutions that can be obviously obtained by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. An automatically scheduled distributed data monitoring system, comprising:

2. The automatically scheduled distributed data monitoring system of claim 1, wherein the Master-slave architecture comprises:

3. The automatically scheduled distributed data monitoring system of claim 1, further comprising:

4. An automatically scheduled distributed data monitoring method is characterized by comprising the following steps:

a plurality of alarm rules and data logic models enter a Kafka queue;

5. The method according to claim 4, wherein the step of relocating the plurality of alarm rule modules when the alarm rule modules are down comprises:

6. The method according to claim 4, wherein the step of relocating the plurality of alarm rule modules when the alarm rule modules are down comprises:

7. The distributed data monitoring method of claim 4, wherein the step of obtaining a plurality of monitoring alarm results in parallel by a plurality of alarm calculation modules according to the alarm rules and the data logic model operation comprises:

8. The automatically scheduled distributed data monitoring method of claim 4, wherein the step of entering the plurality of alarm rules and data logic models into the Kafka queue comprises:

9. The automatically scheduled distributed data monitoring method of claim 4, wherein the step of entering the plurality of alarm rules and data logic models into the KafKa queue comprises:

10. The automatically scheduled distributed data monitoring method of claim 4, wherein the data logic model comprises:

a plurality of dimensions respectively subordinate to the plurality of metrics; the number of sessions is determined by the monitoring request of the monitored object.