CN113609139A

CN113609139A - Monitoring data management method and device, electronic equipment and storage medium

Info

Publication number: CN113609139A
Application number: CN202111164311.8A
Authority: CN
Inventors: 孙辽东
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2021-11-05
Also published as: WO2023050705A1

Abstract

The application discloses a monitoring data management method, a monitoring data management device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: creating a plurality of InfluxDB clusters; calculating the index relationship between the servers and the InfluxDB clusters by using a Hash algorithm according to the number of the servers and the number of the InfluxDB clusters, and storing the index relationship into a relational database; and collecting monitoring data of a target server, determining a target InfluxDB cluster corresponding to the target server according to the index relation, and writing the monitoring data into the target InfluxDB cluster. Therefore, the monitoring data management method provided by the application realizes the rapid storage of the monitoring data of the huge number of servers.

Description

Monitoring data management method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of storage technologies, and in particular, to a method and an apparatus for managing monitoring data, an electronic device, and a computer-readable storage medium.

Background

The artificial intelligence platform stores the monitoring data based on the InfluxDB database, and because the InfluxDB is limited by the InfluxDB, a single node cannot support the quick writing and quick query of the monitoring data of a huge number of servers in an supercomputing scene, and the experience of a user is seriously influenced.

Therefore, how to implement fast reading and writing of monitoring data of a huge number of servers is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

The present application aims to provide a monitoring data management method, a monitoring data management device, an electronic device, and a computer-readable storage medium, which implement fast reading and writing of monitoring data of a large number of servers.

In order to achieve the above object, the present application provides a monitoring data management method, including:

creating a plurality of InfluxDB clusters;

calculating the index relationship between the servers and the InfluxDB clusters by using a Hash algorithm according to the number of the servers and the number of the InfluxDB clusters, and storing the index relationship into a relational database;

and collecting monitoring data of a target server, determining a target InfluxDB cluster corresponding to the target server according to the index relation, and writing the monitoring data into the target InfluxDB cluster.

Wherein, still include:

acquiring a data query command;

analyzing the data query command into a plurality of data query subcommands; each data query subcommand corresponds to a single InfluxDB cluster;

distributing each data inquiry subcommand to a corresponding InfluxDB cluster for execution to obtain response subcommand information corresponding to each data inquiry subcommand;

and summarizing and analyzing all the response sub-information by utilizing an analysis function to obtain response information corresponding to the data query command.

After the obtaining the data query command, the method further includes:

judging whether the data query command contains an accurate query condition; the accurate query condition comprises a single server or a single InfluxDB cluster needing to be queried;

if yes, directly responding to the data query command;

if not, the step of analyzing the data query command into a plurality of data query subcommands is executed.

Wherein, still include:

monitoring the monitoring data written into the InfluxDB cluster by utilizing an alarm engine in each InfluxDB cluster according to threshold information so as to generate alarm information;

and summarizing alarm information generated by the InfluxDB cluster.

Wherein, still include:

monitoring all the InfluxDB clusters for abnormity;

and if the abnormal InfluxDB cluster is monitored, recovering the abnormal InfluxDB cluster according to the abnormal type.

Wherein the recovering the abnormal infiluxdb cluster according to the abnormal type includes:

and if the abnormal type is node abnormality, selecting a normal node in the abnormal InfluxDB cluster to take over the abnormal node.

The method for recovering the abnormal InfluxDB cluster according to the abnormal type includes the following steps:

if the abnormal type is cluster abnormality, a new InfluxDB cluster is created, the index relation of the abnormal InfluxDB cluster is obtained from the relational database, and cached monitoring data is obtained from a server corresponding to the abnormal InfluxDB cluster, so that the new InfluxDB cluster replaces the abnormal InfluxDB cluster.

In order to achieve the above object, the present application provides a monitoring data management apparatus, including:

the system comprises a creating module, a selecting module and a sending module, wherein the creating module is used for creating a plurality of InfluxDB clusters;

the computing module is used for computing the index relationship between the servers and the InfluxDB clusters by utilizing a Hash algorithm according to the number of the servers and the number of the InfluxDB clusters and storing the index relationship into a relational database;

and the write-in module is used for acquiring monitoring data of a target server, determining a target InfluxDB cluster corresponding to the target server according to the index relation, and writing the monitoring data into the target InfluxDB cluster.

To achieve the above object, the present application provides an electronic device including:

a memory for storing a computer program;

and a processor for implementing the steps of the monitoring data management method when executing the computer program.

To achieve the above object, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the monitoring data management method as described above.

According to the scheme, the monitoring data management method provided by the application comprises the following steps: creating a plurality of InfluxDB clusters; calculating the index relationship between the servers and the InfluxDB clusters by using a Hash algorithm according to the number of the servers and the number of the InfluxDB clusters, and storing the index relationship into a relational database; and collecting monitoring data of a target server, determining a target InfluxDB cluster corresponding to the target server according to the index relation, and writing the monitoring data into the target InfluxDB cluster.

According to the monitoring data management method, the monitoring data of a large number of servers are stored through the plurality of InfluxDB clusters, and the monitoring data of each server are written into the corresponding InfluxDB cluster in parallel, so that the monitoring data are written rapidly. Therefore, the monitoring data management method provided by the application realizes the rapid storage of the monitoring data of the huge number of servers. The application also discloses a monitoring data management device, an electronic device and a computer readable storage medium, which can also realize the technical effects.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

FIG. 1 is a flow diagram illustrating a method of monitoring data management according to an exemplary embodiment;

FIG. 2 is a diagram illustrating an indexing relationship of a server to an InfluxDB cluster in accordance with an illustrative embodiment;

FIG. 3 is a flow diagram illustrating another monitoring data management method in accordance with an exemplary embodiment;

FIG. 4 is a block diagram illustrating a monitoring data management device in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In addition, in the embodiments of the present application, "first", "second", and the like are used for distinguishing similar objects, and are not necessarily used for describing a specific order or a sequential order.

The embodiment of the application discloses a monitoring data management method, which realizes the quick reading and writing of monitoring data of a large number of servers.

Referring to fig. 1, a flowchart of a monitoring data management method according to an exemplary embodiment is shown, as shown in fig. 1, including:

s101: creating a plurality of InfluxDB clusters;

s102: calculating the index relationship between the servers and the InfluxDB clusters by using a Hash algorithm according to the number of the servers and the number of the InfluxDB clusters, and storing the index relationship into a relational database;

in this embodiment, a plurality of infixdb clusters are created to store monitoring data of a large number of servers, that is, dynamic expansion of the infixdb clusters is realized based on a library partitioning idea, and ansable (an automated operation and maintenance tool) can be used to complete one-click deployment and library partitioning dynamic expansion of the entire artificial intelligence platform. In specific implementation, the number of the infiluxdb clusters to be created is calculated according to the number of servers and the number of servers corresponding to a single infiluxdb cluster. For example, if the number of servers is 900, and one infiluxdb cluster is used to store monitoring data of 200 servers, that is, the number of servers corresponding to a single infiluxdb cluster is 200, the number of infiluxdb clusters to be created is 5, and the indexes are 0, 1, 2, 3, and 4, respectively. The index relationship between the server and the InfluxDB cluster is calculated by using a Hash algorithm according to the number n of the servers and the number of the InfluxDB cluster, and the calculation mode is as shown in FIG. 2, (Hash (cluster index number) + Hash (node name)), and 2 is included in FIG. 2³²And each position, wherein the hollow circle represents a node position, when a new InfluxDB cluster is added, a position without a relevant node is required to be found, and then the index is reversely calculated.

It can be understood that each server stores the index relationship between itself and its corresponding infixdb cluster, and each infixdb cluster also stores the index relationship between itself and its corresponding server, and all the index relationships may be stored in a relational database for backup, such as marlabb. When a plurality of nodes exist in the relational database, the memory of the support node directly performs data synchronization by an RPC (Remote Procedure Call) in an incremental updating mode. Further, three script deployment modes are built in each InfluxDB cluster: direct deployment, containerized deployment, k8s (kubernets) deployment in virtual/physical machines.

S103: and collecting monitoring data of a target server, determining a target InfluxDB cluster corresponding to the target server according to the index relation, and writing the monitoring data into the target InfluxDB cluster.

In this embodiment, each server includes a collecting component telegraff, configured to collect its own monitoring data, and output the monitoring data to the infiluxdb cluster. In a specific implementation, a data output mode of the acquisition component telegraf is modified, output of monitoring data is dynamically completed, that is, a new configuration item is added to a configuration file of an original component of the telegraf for defining a new data output mode, and in the data output mode, for a target server, a corresponding target infiluxdb cluster needs to be determined according to an index relationship, and monitoring data of the target server is written into the target infiluxdb cluster.

As a preferred embodiment, an adapter may also be customized, and is used to convert the collected monitoring data into a data format that the infiluxdb cluster can store. It should be noted that, the data warehousing supports periodic writing and cache space overflow writing, with respect to the minimization condition, if the maximum number of attempts is exceeded or the corresponding infiluxdb cluster cannot be written temporarily, the data is stored in the cache pool, and if the data still cannot be written after the preset time, the data is discarded.

According to the monitoring data management method provided by the embodiment of the application, monitoring data of a large number of servers are stored through a plurality of InfluxDB clusters, and the monitoring data of each server are written into the corresponding InfluxDB clusters in parallel, so that the monitoring data are written quickly. Therefore, the monitoring data management method provided by the application realizes the rapid storage of the monitoring data of the huge number of servers.

This embodiment introduces a data query method, specifically:

referring to fig. 3, a flowchart of another monitoring data management method according to an exemplary embodiment is shown, as shown in fig. 3, including:

s201: acquiring a data query command;

in this embodiment, the data Query command may specifically be SQL (Structured Query Language), a global interceptor (Aspect idea in Java) is added to the infiluxdb layer, and all data Query commands, that is, all Dao methods, are intercepted to determine whether the data Query command includes an accurate Query condition; the accurate query condition comprises a single server or a single InfluxDB cluster needing to be queried; if yes, directly responding to the data query command; if not, the process proceeds to step S202.

S202: analyzing the data query command into a plurality of data query subcommands; each data query subcommand corresponds to a single InfluxDB cluster;

s203: distributing each data inquiry subcommand to a corresponding InfluxDB cluster for execution to obtain response subcommand information corresponding to each data inquiry subcommand;

in specific implementation, the data query command is analyzed into a plurality of data query subcommands according to the infiluxdb cluster to be queried, each data query subcommand is used for querying one infiluxdb cluster, and is distributed to the corresponding infiluxdb cluster to be executed in parallel, and blocking callback is supported, so that response subcommand information in each infiluxdb cluster is obtained.

S204: and summarizing and analyzing all the response sub-information by utilizing an analysis function to obtain response information corresponding to the data query command.

In specific implementation, SQL is dynamically parsed to determine whether data summarization using an analysis function is required. Adding an attribute annotation mark needing to participate in calculation in an Object Relational Mapping (ORM) Object, and customizing an analysis function based on the function of the InfluxDB, wherein the analysis function comprises a mean value, a maximum value, a minimum value, a variance, a latest value and the like. Taking a mean value function as an example, each infiluxdb cluster obtains a calculation result, i.e., response sub-information, using its own mean value function, then performs calculation on the calculation results generated by all the infiluxdb clusters, and divides the calculation results by the number of the infiluxdb clusters to obtain a final calculation result, i.e., response information corresponding to the data query command.

Therefore, the embodiment realizes the parallel data query of a plurality of InfluxDB clusters, and the query results of all the InfluxDB clusters are summarized and analyzed by using the analysis function, so that the data query efficiency is improved.

On the basis of the above embodiment, as a preferred implementation, the method further includes: monitoring the monitoring data written into the InfluxDB cluster by utilizing an alarm engine in each InfluxDB cluster according to threshold information so as to generate alarm information; and summarizing alarm information generated by the InfluxDB cluster.

In specific implementation, an alarm engine is deployed in each InfluxDB cluster, and when a new InfluxDB cluster is added, the deployment of an alarm engine module is dynamically completed. The alarm engine is used for monitoring the written monitoring data according to threshold information to generate alarm information, the threshold information can be threshold range, enable/disable, alarm frequency and the like, and is issued to each alarm engine by the service module, and in addition, the service module can also update the threshold information in each alarm engine. Furthermore, the alarm information summarizing component is deployed for unified processing of the alarm information, and the unified processing may include data deduplication, alarm mail generation, alarm information storage and the like.

On the basis of the above embodiment, as a preferred implementation, the method further includes: monitoring all the InfluxDB clusters for abnormity; and if the abnormal InfluxDB cluster is monitored, recovering the abnormal InfluxDB cluster according to the abnormal type.

In specific implementation, for example, it is ensured that a plurality of infiluxdb clusters can normally store monitoring data, all infiluxdb clusters are detected at regular time, and if a state is abnormal, a fast recovery monitoring event is triggered. In each server, the monitoring data of the server in the latest period of time is cached for exception recovery.

As a feasible implementation manner, if the exception type is a node exception, that is, an exception node is detected in the exception infiluxdb cluster, a normal node is selected from the exception infiluxdb cluster to take over the exception node.

As another feasible implementation manner, if the exception type is cluster exception, a new infiluxdb cluster is created, the index relationship of the exception infiluxdb cluster is obtained from the relational database, and the cached monitoring data is obtained from the server corresponding to the exception infiluxdb cluster, so that the new infiluxdb cluster replaces the exception infiluxdb cluster.

In the following, a monitoring data management apparatus provided in an embodiment of the present application is introduced, and a monitoring data management apparatus described below and a monitoring data management method described above may be referred to each other.

Referring to fig. 4, a block diagram of a monitoring data management apparatus according to an exemplary embodiment is shown, as shown in fig. 4, including:

a creating module 401, configured to create multiple infiluxdb clusters;

a calculating module 402, configured to calculate an index relationship between a server and an infiluxdb cluster by using a hash algorithm according to the number of servers and the number of infiluxdb clusters, and store the index relationship in a relational database;

the write-in module 403 is configured to collect monitoring data of a target server, determine a target infiluxdb cluster corresponding to the target server according to the index relationship, and write the monitoring data into the target infiluxdb cluster.

The monitoring data management device provided by the embodiment of the application stores monitoring data of a large number of servers through a plurality of InfluxDB clusters, and the monitoring data of each server is written into the corresponding InfluxDB cluster in parallel, so that the monitoring data is written quickly. Therefore, the monitoring data management device provided by the application realizes the rapid storage of the monitoring data of the huge servers.

On the basis of the above embodiment, as a preferred implementation, the method further includes:

the acquisition module is used for acquiring a data query command;

the analysis module is used for analyzing the data query command into a plurality of data query subcommands; each data query subcommand corresponds to a single InfluxDB cluster;

the execution module is used for distributing each data inquiry subcommand to the corresponding InfluxDB cluster for execution to obtain the response subcommand information corresponding to each data inquiry subcommand;

and the first summarizing module is used for summarizing and analyzing all the response sub-information by utilizing an analysis function to obtain the response information corresponding to the data query command.

the judging module is used for judging whether the data query command contains an accurate query condition; the accurate query condition comprises a single server or a single InfluxDB cluster needing to be queried; if yes, starting the working process of the response module; if not, starting the working process of the analysis module;

and the response module is used for responding to the data query command.

the monitoring module is used for monitoring the monitoring data written into the InfluxDB cluster by utilizing an alarm engine in each InfluxDB cluster according to threshold information so as to generate alarm information;

and the second summarizing module is used for summarizing the alarm information generated by the InfluxDB cluster.

the monitoring module is used for monitoring all the InfluxDB clusters for abnormity;

and the recovery module is used for recovering the abnormal InfluxDB cluster according to the abnormal type when the abnormal InfluxDB cluster is monitored.

On the basis of the above embodiment, as a preferred implementation manner, if the exception type is a node exception, the recovery module specifically selects a module in which a normal node takes over the exception node from the exception infiluxdb cluster.

On the basis of the above embodiment, as a preferred implementation manner, the server caches monitoring data of a latest preset duration, if the exception type is cluster exception, the recovery module specifically creates a new infiluxdb cluster, obtains an index relationship of the abnormal infiluxdb cluster from the relational database, and obtains cached monitoring data from a server corresponding to the abnormal infiluxdb cluster, so as to implement a module in which the new infiluxdb cluster replaces the abnormal infiluxdb cluster.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the hardware implementation of the program module, and in order to implement the method according to the embodiment of the present application, an embodiment of the present application further provides an electronic device, and fig. 5 is a structural diagram of an electronic device according to an exemplary embodiment, as shown in fig. 5, the electronic device includes:

a communication interface 1 capable of information interaction with other devices such as network devices and the like;

and the processor 2 is connected with the communication interface 1 to realize information interaction with other equipment, and is used for executing the monitoring data management method provided by one or more technical schemes when running a computer program. And the computer program is stored on the memory 3.

In practice, of course, the various components in the electronic device are coupled together by the bus system 4. It will be appreciated that the bus system 4 is used to enable connection communication between these components. The bus system 4 comprises, in addition to a data bus, a power bus, a control bus and a status signal bus. For the sake of clarity, however, the various buses are labeled as bus system 4 in fig. 5.

The memory 3 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device. Examples of such data include: any computer program for operating on an electronic device.

It will be appreciated that the memory 3 may be either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 3 described in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiment of the present application may be applied to the processor 2, or implemented by the processor 2. The processor 2 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2. The processor 2 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 2 may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 3, and the processor 2 reads the program in the memory 3 and in combination with its hardware performs the steps of the aforementioned method.

When the processor 2 executes the program, the corresponding processes in the methods according to the embodiments of the present application are realized, and for brevity, are not described herein again.

In an exemplary embodiment, the present application further provides a storage medium, i.e. a computer storage medium, specifically a computer readable storage medium, for example, including a memory 3 storing a computer program, which can be executed by a processor 2 to implement the steps of the foregoing method. The computer readable storage medium may be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for managing monitored data, comprising:

creating a plurality of InfluxDB clusters;

2. The monitoring data management method according to claim 1, further comprising:

acquiring a data query command;

3. The monitoring data management method according to claim 2, wherein after the obtaining the data query command, the method further comprises:

if yes, directly responding to the data query command;

4. The monitoring data management method according to claim 1, further comprising:

and summarizing alarm information generated by the InfluxDB cluster.

5. The monitoring data management method according to any one of claims 1 to 4, characterized by further comprising:

monitoring all the InfluxDB clusters for abnormity;

6. The monitoring data management method according to claim 5, wherein the recovering the abnormal infiluxdb cluster according to the abnormal type includes:

7. The monitoring data management method according to claim 5, wherein monitoring data of a latest preset duration of the server is cached in the server, and the recovering the abnormal infiluxdb cluster according to the abnormal type includes:

8. A monitoring data management apparatus, comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the monitoring data management method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the monitoring data management method according to any one of claims 1 to 7.