CN117596126B - Monitoring method for high-speed network abnormality in high-performance cluster - Google Patents
Monitoring method for high-speed network abnormality in high-performance cluster Download PDFInfo
- Publication number
- CN117596126B CN117596126B CN202410079549.8A CN202410079549A CN117596126B CN 117596126 B CN117596126 B CN 117596126B CN 202410079549 A CN202410079549 A CN 202410079549A CN 117596126 B CN117596126 B CN 117596126B
- Authority
- CN
- China
- Prior art keywords
- cluster
- representing
- cluster node
- domain
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000005856 abnormality Effects 0.000 title claims abstract description 29
- 238000012544 monitoring process Methods 0.000 title claims abstract description 21
- 238000004891 communication Methods 0.000 claims abstract description 77
- 238000012512 characterization method Methods 0.000 claims abstract description 32
- 230000002159 abnormal effect Effects 0.000 claims abstract description 18
- 230000004927 fusion Effects 0.000 claims description 24
- 239000013598 vector Substances 0.000 claims description 22
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 230000017105 transposition Effects 0.000 claims description 4
- 238000003909 pattern recognition Methods 0.000 claims description 3
- 230000004044 response Effects 0.000 abstract description 5
- 238000012549 training Methods 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 4
- 230000001934 delay Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000011217 control strategy Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/14—Network analysis or design
- H04L41/142—Network analysis or design using statistical or mathematical methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/16—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
Abstract
The invention relates to the technical field of distributed computing, and discloses a monitoring method for high-speed network abnormality in a high-performance cluster, which comprises the following steps: acquiring information of cluster nodes with abnormal network connection; acquiring information of a parallel communication domain where a first cluster node is located; generating a domain relation according to the information of the parallel communication domain in which the first cluster node is located, and if a relation of mutual communication exists between two cluster nodes in one parallel communication domain, generating the domain relation for the two cluster nodes; generating a characterization feature for each cluster node; inputting the characterization features of the cluster nodes into an anomaly identification model, and outputting a result representing the type of network anomaly cause of the first cluster node; the invention identifies the reason of the network abnormality of the cluster node by learning the occurrence mode of the network abnormality of the cluster node in each layer of communication domain through the training model, can respond and adjust in time in a targeted manner, and stabilizes the service response speed of the high-performance computing cluster.
Description
Technical Field
The invention relates to the technical field of distributed computing, in particular to a monitoring method for high-speed network abnormality in a high-performance cluster.
Background
The high-performance computing cluster is an architecture capable of using a plurality of computers for parallel computing, message transmission is carried out among nodes through a cross-language communication protocol, and along with the continuous expansion of the cluster scale and the continuous growth of the cluster performance, the influence of network delay on the high-performance computing cluster is larger and larger; the invention discloses a method, a device, equipment and a readable medium for detecting sub-health of a cluster network, wherein the network state of a node is judged based on the time delay of a data packet in a log, for example, the invention with the bulletin number of CN115002001B, and the node with abnormal network state is switched through the time delay and the packet loss rate of data transmission;
according to the general method, only whether the node of the high-performance computing cluster has network abnormality or not can be judged, the reason of the network abnormality cannot be analyzed, operation and maintenance personnel are required to check logs to carry out debugging analysis, and a great amount of time is required for analyzing specific reasons, so that the network abnormality of the node can be solved only by switching the node in order to ensure the normal operation of the cluster, but the problem after switching the node still exists possibly due to the undefined reason of the network abnormality of the node, frequent switching of the node is caused, and the service response speed of the high-performance computing cluster is influenced.
Disclosure of Invention
The invention provides a monitoring method for high-speed network abnormality in a high-performance cluster, which solves the technical problem that the service response speed of the high-performance computing cluster is influenced only by switching nodes to solve the network abnormality of the nodes because the reason of the network abnormality cannot be resolved in the related technology.
The invention provides a monitoring method for high-speed network abnormality in a high-performance cluster, which comprises the following steps:
step 101, obtaining information of cluster nodes with abnormal network connection, defining the cluster nodes with abnormal network connection as first cluster nodes, and defining the cluster nodes outside the first cluster nodes as second cluster nodes;
102, obtaining information of a parallel communication domain where a first cluster node is located;
step 103, generating a domain relation according to the information of the parallel communication domain where the first cluster node is located, and if a relation of mutual communication exists between two cluster nodes in one parallel communication domain, generating the domain relation for the two cluster nodes;
step 104, generating a characterization feature for each cluster node;
step 105, inputting the characterization features of the cluster nodes into an anomaly identification model, and outputting a result representing the type of network anomaly cause of the first cluster node.
Further, the definition of the network connection abnormality of the cluster node is that the establishment of the network connection exceeds a set first time.
Further, encoding based on information of the subtasks performed by the cluster node generates a characterization feature of the cluster node.
Further, the anomaly identification model includes:
the feature fusion layer has the following calculation formula:
;
;
,/>;
fusion characteristics of the ith cluster node representing the kth parallel communication domain, +.>And->Characterization features of the ith and jth cluster nodes, respectively, representing the kth parallel communication domain,/->And->Respectively representing a first weight parameter and a second weight parameter,/respectively>Weight vector representing the first hidden layer, +.>A set of cluster nodes representing an edge domain relationship with an ith cluster node in a kth parallel communication domain,/v>Representing an activation function->Fusion weights representing the ith and jth cluster nodes of the kth parallel communication domain,/, are->And->Intermediate characteristics of the ith and jth cluster nodes respectively representing the kth parallel communication domain,/, are represented by->Represents an exponential function based on natural constants, < ->Representing vector stitching, T representing transposition;
the cross-domain fusion layer has the following calculation formula:
;
;
wherein the method comprises the steps ofAnd->Respectively representing the fusion characteristics of the first cluster node in the kth and the h parallel communication domains, M represents the total number of the parallel communication domains in which the first cluster node is located, and +.>Representing a characterization feature of the first cluster node, < >>Representing the first bias parameter, ">Representing a third weight parameter, ++>Representing cross-domain fusion characteristics of the first cluster node;
the output layer has the following calculation formula:
;
wherein,representing the fourth weight parameter,/->Representing a second bias parameter, ">Representing the output vector.
Further, one component of the output vector corresponds to the network anomaly cause type representing one first cluster node, and the network anomaly cause type of the first cluster node represented by the component with the largest output vector is used as the output result.
The invention provides a monitoring system for high-speed network abnormality in a high-performance cluster, which comprises:
the abnormality identification module is used for identifying cluster nodes with abnormal network connection;
the information acquisition module is used for acquiring information of cluster nodes with abnormal network connection and acquiring information of parallel communication domains where the first cluster nodes are located;
the domain relation generating module generates a domain relation according to the information of the parallel communication domain where the first cluster node is located;
a characterization module for generating a characterization feature for each cluster node;
and the pattern recognition module is used for inputting the characterization characteristics of the cluster nodes into the anomaly recognition model and outputting a result representing the type of network anomaly cause of the first cluster node.
Further, the types of network anomaly causes of the first cluster node include communication domain allocation errors and excessive CPU usage.
Communication domain allocation errors; the communication domain allocation error herein refers to: a process corresponding to the first cluster node performs a certain service, and the parallel communication domain distributed by the cluster node organized by the service does not contain the first cluster node;
the CPU usage is too high, which means that: the first cluster node uses the virtual network service, and an excessive number of data packets results in an excessive CPU usage rate for running the virtual network service.
Further, the cluster node control system further comprises an abnormal node control module, wherein the abnormal node control module executes a corresponding control strategy to control the cluster node based on the network abnormality cause type of the first cluster node.
Further, if the network abnormality cause type of the first cluster node is that the CPU usage is too high, one of the following policies is adopted: transferring part of subtasks of the first cluster node to the second cluster node until the abnormal network connection of the first cluster node disappears;
transferring all subtasks of the first cluster node to a second cluster node with higher CPU processing performance;
the network abnormality cause type of the first cluster node is communication domain allocation error, and the following strategy is adopted: the first cluster node is registered in a parallel communication domain organized by services performed by the corresponding process.
The present invention provides a storage medium storing non-transitory computer readable instructions that, when executed by a computer, are capable of performing the steps of a method of monitoring for high-speed network anomalies in a high-performance cluster as described above.
The invention has the beneficial effects that: the invention identifies the reason of the network abnormality of the cluster node by learning the occurrence mode of the network abnormality of the cluster node in each layer of communication domain through the training model, can respond and adjust in time in a targeted manner, and stabilizes the service response speed of the high-performance computing cluster.
Drawings
FIG. 1 is a flow chart of a method of monitoring for high speed network anomalies in a high performance cluster in accordance with the present invention;
FIG. 2 is a schematic diagram of a monitoring system for high-speed network anomalies in a high-performance cluster according to one embodiment of the present invention;
fig. 3 is a schematic diagram of a second module of the monitoring system for high-speed network anomalies in a high-performance cluster according to the present invention.
In the figure: the system comprises an anomaly identification module 201, an information acquisition module 202, a domain relation generation module 203, a characterization module 204, a pattern identification module 205 and an anomaly node control module 206.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It is to be understood that these embodiments are merely discussed so that those skilled in the art may better understand and implement the subject matter described herein and that changes may be made in the function and arrangement of the elements discussed without departing from the scope of the disclosure herein. Various examples may omit, replace, or add various procedures or components as desired. In addition, features described with respect to some examples may be combined in other examples as well.
In at least one embodiment of the present invention, a method for monitoring a high-speed network anomaly in a high-performance cluster is provided, as shown in fig. 1, including the following steps:
step 101, obtaining information of cluster nodes with abnormal network connection, defining the cluster nodes with abnormal network connection as first cluster nodes, and defining the cluster nodes outside the first cluster nodes as second cluster nodes;
here, the definition of the cluster node network connection anomaly is that the establishment of the network connection exceeds a set first time. Generally, for a high-performance computing cluster, the processing response time of the cluster node is generally 100ms, so the first time can be defined as 10ms, and the impact on the performance of the high-performance computing cluster is larger according to the wooden barrel effect.
The establishment of a network connection is typically embodied by the delay of a data packet in a log;
for sporadic network connection anomalies, the delays of the data packets are different, and in order to enhance robustness, the delays of the data packets in the logs of the cluster nodes are ordered from large to small, and the delay of the rank L is read as the time for establishing the network connection of the cluster nodes; the default value of L is 3.
Of course, reference may be made to other methods of statistically defining delays in the art, and it should be noted that the delay of packet loss is generally defined as an infinite value.
102, obtaining information of a parallel communication domain where a first cluster node is located;
a parallel communication domain comprises a group of processes, the processes are operated by the cluster nodes, and the processes can communicate and cooperate with each other, so that the cluster nodes correspond to the processes and can be mapped as the relationship among the cluster nodes according to the relationship of the processes of the parallel communication domain;
here, processes in the parallel communication domain include relationships that communicate and cooperate with each other.
Step 103, generating a domain relation according to the information of the parallel communication domain where the first cluster node is located, and if a relation of mutual communication exists between two cluster nodes in one parallel communication domain, generating the domain relation for the two cluster nodes;
the first cluster node is within a plurality of different parallel communication domains, and thus defines IDs for the parallel communication domains, which can be distinguished according to the IDs of the parallel communication domains from which the domain relationship originates.
Step 104, generating a characterization feature for each cluster node;
in one embodiment of the invention, the characterization features belong to non-artificial features and are generated by adopting modes such as single-heat coding and the like.
In one embodiment of the invention, the encoding is performed based on information of subtasks performed by the cluster nodes to generate characterization features;
specifically, the content information of the subtasks is expressed in a text form, and the characterization features are obtained through semantic coding modes such as word vector coding and the like.
Step 105, inputting the characterization features of the cluster nodes into an anomaly identification model, wherein the anomaly identification model comprises:
the feature fusion layer has the following calculation formula:
;
;
,/>;
fusion characteristics of the ith cluster node representing the kth parallel communication domain, +.>And->Ith respectively representing kth parallel communication domainAnd characterization feature of the jth cluster node, < >>And->Respectively representing a first weight parameter and a second weight parameter,/respectively>Weight vector representing the first hidden layer, +.>A set of cluster nodes representing an edge domain relationship with an ith cluster node in a kth parallel communication domain,/v>Representing an activation function->Fusion weights representing the ith and jth cluster nodes of the kth parallel communication domain,/, are->And->Intermediate characteristics of the ith and jth cluster nodes respectively representing the kth parallel communication domain,/, are represented by->Represents an exponential function based on natural constants, < ->Representing vector stitching, T representing transposition;
the cross-domain fusion layer has the following calculation formula:
;
;
wherein the method comprises the steps ofAnd->Respectively representing the fusion characteristics of the first cluster node in the kth and the h parallel communication domains, M represents the total number of the parallel communication domains in which the first cluster node is located, and +.>Representing a characterization feature of the first cluster node, < >>Representing the first bias parameter, ">Representing a third weight parameter, ++>Representing cross-domain fusion characteristics of the first cluster node;
the output layer has the following calculation formula:
;
wherein,representing the fourth weight parameter,/->Representing a second bias parameter, ">Representing the output vector.
One component of the output vector corresponds to the network abnormality cause type representing one first cluster node, and the network abnormality cause type of the first cluster node represented by the component with the largest output vector is used as an output result.
The trained anomaly identification model has the capability to output results representative of the type of network anomaly cause for the first cluster node.
If the number of the first cluster nodes is more than one, calculating in series or in parallel, and inputting the fusion characteristics of only one first cluster node to a cross-domain fusion layer at one time or one channel, and outputting the network abnormality cause type of only one first cluster node.
In one embodiment of the present invention, to accommodate a dynamic high performance computer cluster, another anomaly identification model is provided, and the calculation formula of the output layer is as follows:
;
wherein,representing the fourth weight parameter,/->Representing a second bias parameter, ">Representing the output vector +.>Representing vector concatenation function, ">Representing service characteristics of the first cluster node;
the service feature is a representation of the service in which the first cluster node participates, and the service can also be represented by descriptive text or codes, and the service feature can be obtained by means of word vectors or other semantic codes.
The weight parameters and the bias parameters can be obtained by updating the back propagation according to the general neural network training process.
At least one embodiment of the present invention provides a storage medium storing non-transitory computer readable instructions that, when executed by a computer, are capable of performing the steps of a method for monitoring high-speed network anomalies in a high-performance cluster as described above.
At least one embodiment of the present invention provides a monitoring system for high-speed network anomalies in a high-performance cluster, as shown in fig. 2, including:
an anomaly identification module 201 for identifying cluster nodes in which network connection anomalies occur;
an information obtaining module 202, configured to obtain information of cluster nodes with abnormal network connection, and obtain information of a parallel communication domain where a first cluster node is located;
a domain relation generating module 203, configured to generate a domain relation according to information of a parallel communication domain where the first cluster node is located;
a characterization module 204 for generating characterization features for each cluster node;
the pattern recognition module 205 is configured to input the characterization feature of the cluster node into the anomaly recognition model, and output a result indicating the type of network anomaly cause of the first cluster node.
In one embodiment of the invention, the types of network anomaly causes for the first cluster node include communication domain allocation errors and CPU usage that is too high.
Communication domain allocation errors; the communication domain allocation error herein refers to: a process corresponding to the first cluster node performs a certain service, and the parallel communication domain distributed by the cluster node organized by the service does not contain the first cluster node;
the CPU usage is too high, which means that: the first cluster node uses virtual network service, and the excessive number of data packets leads to the excessively high CPU utilization rate of running the virtual network service;
of course, the network anomaly cause type of the first cluster node of the present invention may also include other types for indicating the network anomaly cause of other cluster nodes known to those skilled in the art.
It should be noted that, the problem that the network bandwidth is insufficient belongs to the mismatch of task allocation resources, which generally does not occur in a normal cluster, and the factors of the network service provider belong to factors outside the cluster, and also do not belong to the network abnormality cause of the first cluster node.
In one embodiment of the present invention, a monitoring system for high-speed network anomalies in a high-performance cluster is provided, as shown in fig. 3, further including an anomaly node control module 206, which executes a corresponding control policy based on a network anomaly cause type of a first cluster node;
specifically, for a CPU with too high a usage rate, the following strategy is adopted: transferring part of subtasks of the first cluster node to the second cluster node until the abnormal network connection of the first cluster node disappears;
transferring all subtasks of the first cluster node to a second cluster node with higher CPU processing performance;
for communication domain allocation errors, the following strategy is adopted: the first cluster node is registered in a parallel communication domain organized by services performed by the corresponding process.
The embodiment has been described above with reference to the embodiment, but the embodiment is not limited to the above-described specific implementation, which is only illustrative and not restrictive, and many forms can be made by those of ordinary skill in the art, given the benefit of this disclosure, are within the scope of this embodiment.
Claims (8)
1. The monitoring method for the high-speed network abnormality in the high-performance cluster is characterized by comprising the following steps of:
step 101, obtaining information of cluster nodes with abnormal network connection, defining the cluster nodes with abnormal network connection as first cluster nodes, and defining the cluster nodes outside the first cluster nodes as second cluster nodes;
102, obtaining information of a parallel communication domain where a first cluster node is located;
step 103, generating a domain relation according to the information of the parallel communication domain where the first cluster node is located, and if a relation of mutual communication exists between two cluster nodes in one parallel communication domain, generating the domain relation for the two cluster nodes;
104, generating a characterization feature for each cluster node, and encoding based on the information of the subtasks executed by the cluster node to generate the characterization feature of the cluster node;
step 105, inputting the characterization features of the cluster nodes into an anomaly identification model, and outputting a result representing the type of network anomaly cause of the first cluster node;
the anomaly identification model includes:
the feature fusion layer has the following calculation formula:
;
;
,/>;
wherein the method comprises the steps ofFusion characteristics of the ith cluster node representing the kth parallel communication domain, +.>And->Characterization features of the ith and jth cluster nodes, respectively, representing the kth parallel communication domain,/->And->Respectively representing a first weight parameter and a second weight parameter,/respectively>Weight vector representing the first hidden layer, +.>A set of cluster nodes representing an edge domain relationship with an ith cluster node in a kth parallel communication domain,/v>Representing an activation function->Fusion weights representing the ith and jth cluster nodes of the kth parallel communication domain,/, are->And->Intermediate characteristics of the ith and jth cluster nodes respectively representing the kth parallel communication domain,/, are represented by->Represents an exponential function based on natural constants, < ->Representing vector stitching, T representing transposition;
the cross-domain fusion layer has the following calculation formula:
;
;
wherein the method comprises the steps ofAnd->Respectively representing the fusion characteristics of the first cluster node in the kth and the h parallel communication domains, M represents the total number of the parallel communication domains in which the first cluster node is located, and +.>Representing a characterization feature of the first cluster node, < >>Representing the first bias parameter, ">Representing a third weight parameter, ++>Representing cross-domain fusion characteristics of the first cluster node;
the output layer has the following calculation formula:
;
wherein,representing the fourth weight parameter,/->Representing a second bias parameter, ">Representing the output vector.
2. A method of monitoring for high-speed network anomalies in a high-performance cluster according to claim 1, characterized in that the definition of a cluster node network connection anomaly is a first time at which a network connection is established beyond a set point.
3. The method according to claim 1, wherein one component of the output vector corresponds to a network anomaly cause type representing one of the first cluster nodes, and the network anomaly cause type of the first cluster node represented by the component with the largest output vector is used as the output result.
4. A monitoring system for high-speed network anomalies in a high-performance cluster, characterized in that it is configured to perform the steps of a method for monitoring for high-speed network anomalies in a high-performance cluster according to any one of claims 1 to 3, comprising:
the abnormality identification module is used for identifying cluster nodes with abnormal network connection;
the information acquisition module is used for acquiring information of cluster nodes with abnormal network connection and acquiring information of parallel communication domains where the first cluster nodes are located;
the domain relation generating module generates a domain relation according to the information of the parallel communication domain where the first cluster node is located;
the characterization module is used for generating characterization features for each cluster node, and encoding the characterization features for the cluster node based on the information of the subtasks executed by the cluster node;
the pattern recognition module is used for inputting the characterization features of the cluster nodes into the anomaly recognition model and outputting a result representing the type of network anomaly cause of the first cluster node;
the anomaly identification model includes:
the feature fusion layer has the following calculation formula:
;
;
,/>;
wherein the method comprises the steps ofFusion characteristics of the ith cluster node representing the kth parallel communication domain, +.>And->Characterization features of the ith and jth cluster nodes, respectively, representing the kth parallel communication domain,/->And->Respectively representing a first weight parameter and a second weight parameter,/respectively>Weight vector representing the first hidden layer, +.>A set of cluster nodes representing an edge domain relationship with an ith cluster node in a kth parallel communication domain,/v>Representing an activation function->Fusion weights representing the ith and jth cluster nodes of the kth parallel communication domain,/, are->And->Intermediate characteristics of the ith and jth cluster nodes respectively representing the kth parallel communication domain,/, are represented by->Represents an exponential function based on natural constants, < ->Representing vector stitching, T representing transposition;
the cross-domain fusion layer has the following calculation formula:
;
;
wherein the method comprises the steps ofAnd->Respectively representing the fusion characteristics of the first cluster node in the kth and the h parallel communication domains, M represents the total number of the parallel communication domains in which the first cluster node is located, and +.>Representing a characterization feature of the first cluster node, < >>Representing the first bias parameter, ">Representing a third weight parameter, ++>Representing cross-domain fusion characteristics of the first cluster node;
the output layer has the following calculation formula:
;
wherein,representing the fourth weight parameter,/->Representing a second bias parameter, ">Representing the output vector.
5. The monitoring system for high-speed network anomalies in a high-performance cluster according to claim 4, wherein the types of network anomalies for the first cluster node include communication domain allocation errors and CPU utilization overages;
communication domain allocation errors; the communication domain allocation error herein refers to: a process corresponding to the first cluster node performs a certain service, and the parallel communication domain distributed by the cluster node organized by the service does not contain the first cluster node;
the CPU usage is too high, which means that: the first cluster node uses the virtual network service, and an excessive number of data packets results in an excessive CPU usage rate for running the virtual network service.
6. The system of claim 4, further comprising an anomaly node control module that executes a corresponding control policy to control the cluster nodes based on the network anomaly cause type of the first cluster node.
7. The system of claim 5, wherein the network anomaly type of the first cluster node is a CPU utilization that is too high, and one of the following policies is adopted: transferring part of subtasks of the first cluster node to the second cluster node until the abnormal network connection of the first cluster node disappears;
transferring all subtasks of the first cluster node to a second cluster node with higher CPU processing performance;
the network abnormality cause type of the first cluster node is communication domain allocation error, and the following strategy is adopted: the first cluster node is registered in a parallel communication domain organized by services performed by the corresponding process.
8. A storage medium storing non-transitory computer readable instructions which, when executed by a computer, are capable of performing the steps of a method of monitoring for high speed network anomalies in a high-performance cluster according to any one of claims 1 to 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410079549.8A CN117596126B (en) | 2024-01-19 | 2024-01-19 | Monitoring method for high-speed network abnormality in high-performance cluster |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410079549.8A CN117596126B (en) | 2024-01-19 | 2024-01-19 | Monitoring method for high-speed network abnormality in high-performance cluster |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117596126A CN117596126A (en) | 2024-02-23 |
CN117596126B true CN117596126B (en) | 2024-03-26 |
Family
ID=89917057
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410079549.8A Active CN117596126B (en) | 2024-01-19 | 2024-01-19 | Monitoring method for high-speed network abnormality in high-performance cluster |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117596126B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115550139A (en) * | 2022-09-19 | 2022-12-30 | 中国电信股份有限公司 | Fault root cause positioning method, device and system, electronic equipment and storage medium |
CN115865625A (en) * | 2022-11-28 | 2023-03-28 | 武汉烽火技术服务有限公司 | Method and device for analyzing fault root cause of communication equipment |
CN116450399A (en) * | 2023-06-13 | 2023-07-18 | 西华大学 | Fault diagnosis and root cause positioning method for micro service system |
CN116684253A (en) * | 2023-05-24 | 2023-09-01 | 电子科技大学 | Network anomaly management and control method based on intelligent operation and maintenance |
CN117376092A (en) * | 2023-10-24 | 2024-01-09 | 中国联合网络通信集团有限公司 | Fault root cause positioning method, device, equipment and storage medium |
CN117376084A (en) * | 2022-06-30 | 2024-01-09 | 华为技术有限公司 | Fault detection method, electronic equipment and medium thereof |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10193780B2 (en) * | 2015-10-09 | 2019-01-29 | Futurewei Technologies, Inc. | System and method for anomaly root cause analysis |
US11252014B2 (en) * | 2019-09-30 | 2022-02-15 | Dynatrace Llc | Forming root cause groups of incidents in clustered distributed system through horizontal and vertical aggregation |
US11894969B2 (en) * | 2021-07-12 | 2024-02-06 | Ciena Corporation | Identifying root causes of network service degradation |
-
2024
- 2024-01-19 CN CN202410079549.8A patent/CN117596126B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117376084A (en) * | 2022-06-30 | 2024-01-09 | 华为技术有限公司 | Fault detection method, electronic equipment and medium thereof |
CN115550139A (en) * | 2022-09-19 | 2022-12-30 | 中国电信股份有限公司 | Fault root cause positioning method, device and system, electronic equipment and storage medium |
CN115865625A (en) * | 2022-11-28 | 2023-03-28 | 武汉烽火技术服务有限公司 | Method and device for analyzing fault root cause of communication equipment |
CN116684253A (en) * | 2023-05-24 | 2023-09-01 | 电子科技大学 | Network anomaly management and control method based on intelligent operation and maintenance |
CN116450399A (en) * | 2023-06-13 | 2023-07-18 | 西华大学 | Fault diagnosis and root cause positioning method for micro service system |
CN117376092A (en) * | 2023-10-24 | 2024-01-09 | 中国联合网络通信集团有限公司 | Fault root cause positioning method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
GAD:基于拓扑感知的时间序列异常检测;戚琦;申润业;王敬宇;;通信学报;20200624(06);第152-160页 * |
Also Published As
Publication number | Publication date |
---|---|
CN117596126A (en) | 2024-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3796176B1 (en) | Fault root cause analysis method and apparatus | |
US8676965B2 (en) | Tracking high-level network transactions | |
CN112468347B (en) | Security management method and device for cloud platform, electronic equipment and storage medium | |
CN109462590B (en) | Unknown protocol reverse analysis method based on fuzzy test | |
US11586981B2 (en) | Failure analysis device, failure analysis method, and failure analysis program | |
RU2697648C2 (en) | Traffic classification system | |
Dai et al. | Identifying and estimating persistent items in data streams | |
CN112087334A (en) | Alarm root cause analysis method, electronic device and storage medium | |
US7844443B2 (en) | Network subscriber experience modeling | |
CN116723136B (en) | Network data detection method applying FCM clustering algorithm | |
CN117596126B (en) | Monitoring method for high-speed network abnormality in high-performance cluster | |
CN116828087B (en) | Information security system based on block chain connection | |
CN112543145A (en) | Method and device for selecting communication path of equipment node for sending data | |
CN110266554B (en) | Testing method of private communication protocol | |
Chiu et al. | A genetic algorithm for reliability-oriented task assignment with k/spl tilde/duplications in distributed systems | |
JP7173273B2 (en) | Failure analysis device, failure analysis method and failure analysis program | |
CN110798371A (en) | Testing method of private communication protocol | |
CN115766518A (en) | Anomaly detection model training and anomaly detection method and system for cloud side end system | |
Sheng et al. | How to fingerprint attack traffic against industrial control system network | |
CN112600753B (en) | Equipment node communication path selection method and device according to equipment access amount | |
CN114330363A (en) | Industrial control protocol vulnerability mining method based on vulnerability semantic intelligent analysis | |
JP2018537009A (en) | Whitelist generator, whitelist evaluator, whitelist generator / evaluator, whitelist generation method, whitelist evaluation method, and whitelist generation / evaluation method | |
CN111950853A (en) | Power running state white list generation method based on information physical bilateral data | |
Cheng et al. | Analysis of policy anomalies in distributed firewalls | |
CN113055388B (en) | Deep packet detection method and system based on generation countermeasure network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |