CN117596126B

CN117596126B - Monitoring method for high-speed network abnormality in high-performance cluster

Info

Publication number: CN117596126B
Application number: CN202410079549.8A
Authority: CN
Inventors: 戴煜; 刘翀; 康浩鹏; 张家杰; 姚胜
Original assignee: Hefei Advanced Computing Center Operations Management Co ltd
Current assignee: Hefei Advanced Computing Center Operations Management Co ltd
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-03-26
Anticipated expiration: 2044-01-19
Also published as: CN117596126A

Abstract

The invention relates to the technical field of distributed computing, and discloses a monitoring method for high-speed network abnormality in a high-performance cluster, which comprises the following steps: acquiring information of cluster nodes with abnormal network connection; acquiring information of a parallel communication domain where a first cluster node is located; generating a domain relation according to the information of the parallel communication domain in which the first cluster node is located, and if a relation of mutual communication exists between two cluster nodes in one parallel communication domain, generating the domain relation for the two cluster nodes; generating a characterization feature for each cluster node; inputting the characterization features of the cluster nodes into an anomaly identification model, and outputting a result representing the type of network anomaly cause of the first cluster node; the invention identifies the reason of the network abnormality of the cluster node by learning the occurrence mode of the network abnormality of the cluster node in each layer of communication domain through the training model, can respond and adjust in time in a targeted manner, and stabilizes the service response speed of the high-performance computing cluster.

Description

Monitoring method for high-speed network abnormality in high-performance cluster

Technical Field

The invention relates to the technical field of distributed computing, in particular to a monitoring method for high-speed network abnormality in a high-performance cluster.

Background

The high-performance computing cluster is an architecture capable of using a plurality of computers for parallel computing, message transmission is carried out among nodes through a cross-language communication protocol, and along with the continuous expansion of the cluster scale and the continuous growth of the cluster performance, the influence of network delay on the high-performance computing cluster is larger and larger; the invention discloses a method, a device, equipment and a readable medium for detecting sub-health of a cluster network, wherein the network state of a node is judged based on the time delay of a data packet in a log, for example, the invention with the bulletin number of CN115002001B, and the node with abnormal network state is switched through the time delay and the packet loss rate of data transmission;

according to the general method, only whether the node of the high-performance computing cluster has network abnormality or not can be judged, the reason of the network abnormality cannot be analyzed, operation and maintenance personnel are required to check logs to carry out debugging analysis, and a great amount of time is required for analyzing specific reasons, so that the network abnormality of the node can be solved only by switching the node in order to ensure the normal operation of the cluster, but the problem after switching the node still exists possibly due to the undefined reason of the network abnormality of the node, frequent switching of the node is caused, and the service response speed of the high-performance computing cluster is influenced.

Disclosure of Invention

The invention provides a monitoring method for high-speed network abnormality in a high-performance cluster, which solves the technical problem that the service response speed of the high-performance computing cluster is influenced only by switching nodes to solve the network abnormality of the nodes because the reason of the network abnormality cannot be resolved in the related technology.

The invention provides a monitoring method for high-speed network abnormality in a high-performance cluster, which comprises the following steps:

step 101, obtaining information of cluster nodes with abnormal network connection, defining the cluster nodes with abnormal network connection as first cluster nodes, and defining the cluster nodes outside the first cluster nodes as second cluster nodes;

102, obtaining information of a parallel communication domain where a first cluster node is located;

step 103, generating a domain relation according to the information of the parallel communication domain where the first cluster node is located, and if a relation of mutual communication exists between two cluster nodes in one parallel communication domain, generating the domain relation for the two cluster nodes;

step 104, generating a characterization feature for each cluster node;

step 105, inputting the characterization features of the cluster nodes into an anomaly identification model, and outputting a result representing the type of network anomaly cause of the first cluster node.

Further, the definition of the network connection abnormality of the cluster node is that the establishment of the network connection exceeds a set first time.

Further, encoding based on information of the subtasks performed by the cluster node generates a characterization feature of the cluster node.

Further, the anomaly identification model includes:

the feature fusion layer has the following calculation formula:

；

，/>；

fusion characteristics of the ith cluster node representing the kth parallel communication domain, +.>And->Characterization features of the ith and jth cluster nodes, respectively, representing the kth parallel communication domain,/->And->Respectively representing a first weight parameter and a second weight parameter,/respectively>Weight vector representing the first hidden layer, +.>A set of cluster nodes representing an edge domain relationship with an ith cluster node in a kth parallel communication domain,/v>Representing an activation function->Fusion weights representing the ith and jth cluster nodes of the kth parallel communication domain,/, are->And->Intermediate characteristics of the ith and jth cluster nodes respectively representing the kth parallel communication domain,/, are represented by->Represents an exponential function based on natural constants, < ->Representing vector stitching, T representing transposition;

the cross-domain fusion layer has the following calculation formula:

；

wherein the method comprises the steps ofAnd->Respectively representing the fusion characteristics of the first cluster node in the kth and the h parallel communication domains, M represents the total number of the parallel communication domains in which the first cluster node is located, and +.>Representing a characterization feature of the first cluster node, < >>Representing the first bias parameter, ">Representing a third weight parameter, ++>Representing cross-domain fusion characteristics of the first cluster node;

the output layer has the following calculation formula:

；

wherein,representing the fourth weight parameter,/->Representing a second bias parameter, ">Representing the output vector.

Further, one component of the output vector corresponds to the network anomaly cause type representing one first cluster node, and the network anomaly cause type of the first cluster node represented by the component with the largest output vector is used as the output result.

The invention provides a monitoring system for high-speed network abnormality in a high-performance cluster, which comprises:

the abnormality identification module is used for identifying cluster nodes with abnormal network connection;

the information acquisition module is used for acquiring information of cluster nodes with abnormal network connection and acquiring information of parallel communication domains where the first cluster nodes are located;

the domain relation generating module generates a domain relation according to the information of the parallel communication domain where the first cluster node is located;

a characterization module for generating a characterization feature for each cluster node;

and the pattern recognition module is used for inputting the characterization characteristics of the cluster nodes into the anomaly recognition model and outputting a result representing the type of network anomaly cause of the first cluster node.

Further, the types of network anomaly causes of the first cluster node include communication domain allocation errors and excessive CPU usage.

Communication domain allocation errors; the communication domain allocation error herein refers to: a process corresponding to the first cluster node performs a certain service, and the parallel communication domain distributed by the cluster node organized by the service does not contain the first cluster node;

the CPU usage is too high, which means that: the first cluster node uses the virtual network service, and an excessive number of data packets results in an excessive CPU usage rate for running the virtual network service.

Further, the cluster node control system further comprises an abnormal node control module, wherein the abnormal node control module executes a corresponding control strategy to control the cluster node based on the network abnormality cause type of the first cluster node.

Further, if the network abnormality cause type of the first cluster node is that the CPU usage is too high, one of the following policies is adopted: transferring part of subtasks of the first cluster node to the second cluster node until the abnormal network connection of the first cluster node disappears;

transferring all subtasks of the first cluster node to a second cluster node with higher CPU processing performance;

the network abnormality cause type of the first cluster node is communication domain allocation error, and the following strategy is adopted: the first cluster node is registered in a parallel communication domain organized by services performed by the corresponding process.

The present invention provides a storage medium storing non-transitory computer readable instructions that, when executed by a computer, are capable of performing the steps of a method of monitoring for high-speed network anomalies in a high-performance cluster as described above.

The invention has the beneficial effects that: the invention identifies the reason of the network abnormality of the cluster node by learning the occurrence mode of the network abnormality of the cluster node in each layer of communication domain through the training model, can respond and adjust in time in a targeted manner, and stabilizes the service response speed of the high-performance computing cluster.

Drawings

FIG. 1 is a flow chart of a method of monitoring for high speed network anomalies in a high performance cluster in accordance with the present invention;

FIG. 2 is a schematic diagram of a monitoring system for high-speed network anomalies in a high-performance cluster according to one embodiment of the present invention;

fig. 3 is a schematic diagram of a second module of the monitoring system for high-speed network anomalies in a high-performance cluster according to the present invention.

In the figure: the system comprises an anomaly identification module 201, an information acquisition module 202, a domain relation generation module 203, a characterization module 204, a pattern identification module 205 and an anomaly node control module 206.

Detailed Description

The subject matter described herein will now be discussed with reference to example embodiments. It is to be understood that these embodiments are merely discussed so that those skilled in the art may better understand and implement the subject matter described herein and that changes may be made in the function and arrangement of the elements discussed without departing from the scope of the disclosure herein. Various examples may omit, replace, or add various procedures or components as desired. In addition, features described with respect to some examples may be combined in other examples as well.

In at least one embodiment of the present invention, a method for monitoring a high-speed network anomaly in a high-performance cluster is provided, as shown in fig. 1, including the following steps:

here, the definition of the cluster node network connection anomaly is that the establishment of the network connection exceeds a set first time. Generally, for a high-performance computing cluster, the processing response time of the cluster node is generally 100ms, so the first time can be defined as 10ms, and the impact on the performance of the high-performance computing cluster is larger according to the wooden barrel effect.

The establishment of a network connection is typically embodied by the delay of a data packet in a log;

for sporadic network connection anomalies, the delays of the data packets are different, and in order to enhance robustness, the delays of the data packets in the logs of the cluster nodes are ordered from large to small, and the delay of the rank L is read as the time for establishing the network connection of the cluster nodes; the default value of L is 3.

Of course, reference may be made to other methods of statistically defining delays in the art, and it should be noted that the delay of packet loss is generally defined as an infinite value.

a parallel communication domain comprises a group of processes, the processes are operated by the cluster nodes, and the processes can communicate and cooperate with each other, so that the cluster nodes correspond to the processes and can be mapped as the relationship among the cluster nodes according to the relationship of the processes of the parallel communication domain;

here, processes in the parallel communication domain include relationships that communicate and cooperate with each other.

the first cluster node is within a plurality of different parallel communication domains, and thus defines IDs for the parallel communication domains, which can be distinguished according to the IDs of the parallel communication domains from which the domain relationship originates.

Step 104, generating a characterization feature for each cluster node;

in one embodiment of the invention, the characterization features belong to non-artificial features and are generated by adopting modes such as single-heat coding and the like.

In one embodiment of the invention, the encoding is performed based on information of subtasks performed by the cluster nodes to generate characterization features;

specifically, the content information of the subtasks is expressed in a text form, and the characterization features are obtained through semantic coding modes such as word vector coding and the like.

Step 105, inputting the characterization features of the cluster nodes into an anomaly identification model, wherein the anomaly identification model comprises:

the feature fusion layer has the following calculation formula:

；

，/>；

fusion characteristics of the ith cluster node representing the kth parallel communication domain, +.>And->Ith respectively representing kth parallel communication domainAnd characterization feature of the jth cluster node, < >>And->Respectively representing a first weight parameter and a second weight parameter,/respectively>Weight vector representing the first hidden layer, +.>A set of cluster nodes representing an edge domain relationship with an ith cluster node in a kth parallel communication domain,/v>Representing an activation function->Fusion weights representing the ith and jth cluster nodes of the kth parallel communication domain,/, are->And->Intermediate characteristics of the ith and jth cluster nodes respectively representing the kth parallel communication domain,/, are represented by->Represents an exponential function based on natural constants, < ->Representing vector stitching, T representing transposition;

the cross-domain fusion layer has the following calculation formula:

；

the output layer has the following calculation formula:

；

One component of the output vector corresponds to the network abnormality cause type representing one first cluster node, and the network abnormality cause type of the first cluster node represented by the component with the largest output vector is used as an output result.

The trained anomaly identification model has the capability to output results representative of the type of network anomaly cause for the first cluster node.

If the number of the first cluster nodes is more than one, calculating in series or in parallel, and inputting the fusion characteristics of only one first cluster node to a cross-domain fusion layer at one time or one channel, and outputting the network abnormality cause type of only one first cluster node.

In one embodiment of the present invention, to accommodate a dynamic high performance computer cluster, another anomaly identification model is provided, and the calculation formula of the output layer is as follows:

；

wherein,representing the fourth weight parameter,/->Representing a second bias parameter, ">Representing the output vector +.>Representing vector concatenation function, ">Representing service characteristics of the first cluster node;

the service feature is a representation of the service in which the first cluster node participates, and the service can also be represented by descriptive text or codes, and the service feature can be obtained by means of word vectors or other semantic codes.

The weight parameters and the bias parameters can be obtained by updating the back propagation according to the general neural network training process.

At least one embodiment of the present invention provides a storage medium storing non-transitory computer readable instructions that, when executed by a computer, are capable of performing the steps of a method for monitoring high-speed network anomalies in a high-performance cluster as described above.

At least one embodiment of the present invention provides a monitoring system for high-speed network anomalies in a high-performance cluster, as shown in fig. 2, including:

an anomaly identification module 201 for identifying cluster nodes in which network connection anomalies occur;

an information obtaining module 202, configured to obtain information of cluster nodes with abnormal network connection, and obtain information of a parallel communication domain where a first cluster node is located;

a domain relation generating module 203, configured to generate a domain relation according to information of a parallel communication domain where the first cluster node is located;

a characterization module 204 for generating characterization features for each cluster node;

the pattern recognition module 205 is configured to input the characterization feature of the cluster node into the anomaly recognition model, and output a result indicating the type of network anomaly cause of the first cluster node.

In one embodiment of the invention, the types of network anomaly causes for the first cluster node include communication domain allocation errors and CPU usage that is too high.

the CPU usage is too high, which means that: the first cluster node uses virtual network service, and the excessive number of data packets leads to the excessively high CPU utilization rate of running the virtual network service;

of course, the network anomaly cause type of the first cluster node of the present invention may also include other types for indicating the network anomaly cause of other cluster nodes known to those skilled in the art.

It should be noted that, the problem that the network bandwidth is insufficient belongs to the mismatch of task allocation resources, which generally does not occur in a normal cluster, and the factors of the network service provider belong to factors outside the cluster, and also do not belong to the network abnormality cause of the first cluster node.

In one embodiment of the present invention, a monitoring system for high-speed network anomalies in a high-performance cluster is provided, as shown in fig. 3, further including an anomaly node control module 206, which executes a corresponding control policy based on a network anomaly cause type of a first cluster node;

specifically, for a CPU with too high a usage rate, the following strategy is adopted: transferring part of subtasks of the first cluster node to the second cluster node until the abnormal network connection of the first cluster node disappears;

for communication domain allocation errors, the following strategy is adopted: the first cluster node is registered in a parallel communication domain organized by services performed by the corresponding process.

The embodiment has been described above with reference to the embodiment, but the embodiment is not limited to the above-described specific implementation, which is only illustrative and not restrictive, and many forms can be made by those of ordinary skill in the art, given the benefit of this disclosure, are within the scope of this embodiment.

Claims

1. The monitoring method for the high-speed network abnormality in the high-performance cluster is characterized by comprising the following steps of:

104, generating a characterization feature for each cluster node, and encoding based on the information of the subtasks executed by the cluster node to generate the characterization feature of the cluster node;

step 105, inputting the characterization features of the cluster nodes into an anomaly identification model, and outputting a result representing the type of network anomaly cause of the first cluster node;

the anomaly identification model includes:

the feature fusion layer has the following calculation formula:

；

，/>；

wherein the method comprises the steps ofFusion characteristics of the ith cluster node representing the kth parallel communication domain, +.>And->Characterization features of the ith and jth cluster nodes, respectively, representing the kth parallel communication domain,/->And->Respectively representing a first weight parameter and a second weight parameter,/respectively>Weight vector representing the first hidden layer, +.>A set of cluster nodes representing an edge domain relationship with an ith cluster node in a kth parallel communication domain,/v>Representing an activation function->Fusion weights representing the ith and jth cluster nodes of the kth parallel communication domain,/, are->And->Intermediate characteristics of the ith and jth cluster nodes respectively representing the kth parallel communication domain,/, are represented by->Represents an exponential function based on natural constants, < ->Representing vector stitching, T representing transposition;

the cross-domain fusion layer has the following calculation formula:

；

the output layer has the following calculation formula:

；

2. A method of monitoring for high-speed network anomalies in a high-performance cluster according to claim 1, characterized in that the definition of a cluster node network connection anomaly is a first time at which a network connection is established beyond a set point.

3. The method according to claim 1, wherein one component of the output vector corresponds to a network anomaly cause type representing one of the first cluster nodes, and the network anomaly cause type of the first cluster node represented by the component with the largest output vector is used as the output result.

4. A monitoring system for high-speed network anomalies in a high-performance cluster, characterized in that it is configured to perform the steps of a method for monitoring for high-speed network anomalies in a high-performance cluster according to any one of claims 1 to 3, comprising:

the characterization module is used for generating characterization features for each cluster node, and encoding the characterization features for the cluster node based on the information of the subtasks executed by the cluster node;

the pattern recognition module is used for inputting the characterization features of the cluster nodes into the anomaly recognition model and outputting a result representing the type of network anomaly cause of the first cluster node;

the anomaly identification model includes:

the feature fusion layer has the following calculation formula:

；

，/>；

the cross-domain fusion layer has the following calculation formula:

；

the output layer has the following calculation formula:

；

5. The monitoring system for high-speed network anomalies in a high-performance cluster according to claim 4, wherein the types of network anomalies for the first cluster node include communication domain allocation errors and CPU utilization overages;

6. The system of claim 4, further comprising an anomaly node control module that executes a corresponding control policy to control the cluster nodes based on the network anomaly cause type of the first cluster node.

7. The system of claim 5, wherein the network anomaly type of the first cluster node is a CPU utilization that is too high, and one of the following policies is adopted: transferring part of subtasks of the first cluster node to the second cluster node until the abnormal network connection of the first cluster node disappears;

8. A storage medium storing non-transitory computer readable instructions which, when executed by a computer, are capable of performing the steps of a method of monitoring for high speed network anomalies in a high-performance cluster according to any one of claims 1 to 3.