CN115941676A - Multi-node high-performance computing cluster network traffic load balancing scheduling method and device - Google Patents

Multi-node high-performance computing cluster network traffic load balancing scheduling method and device Download PDF

Info

Publication number
CN115941676A
CN115941676A CN202211180131.3A CN202211180131A CN115941676A CN 115941676 A CN115941676 A CN 115941676A CN 202211180131 A CN202211180131 A CN 202211180131A CN 115941676 A CN115941676 A CN 115941676A
Authority
CN
China
Prior art keywords
path
load
flow
host
paths
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211180131.3A
Other languages
Chinese (zh)
Inventor
张宇超
黄浩强
阙喜戎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202211180131.3A priority Critical patent/CN115941676A/en
Publication of CN115941676A publication Critical patent/CN115941676A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a multi-node high-performance computing cluster network traffic load balancing scheduling method and device. ArCa implements all parts on the host without any changes on the multi-node high performance computing cluster network. The hosts mutually send the detection data packet, and obtain the mapping relation between the path and the quintuple and the load of the path. This information forms a network load table on the basis of which ArCa estimates the short-term load of the network. By evaluation, the host makes decisions on the traffic plan and selects paths to send messages with the help of the hop-flow algorithm. Each module is a process which runs independently, and information communication is carried out in an interprocess communication (such as MMAP) mode, so that the overall concurrency of the mechanism is improved, and the reaction speed and decision efficiency of the mechanism are improved.

Description

Multi-node high-performance computing cluster network traffic load balancing scheduling method and device
Technical Field
The invention relates to the technical field of communication, in particular to a method and a device for load balancing and scheduling of network traffic of a multi-node high-performance computing cluster.
Background
With the rapid development of the internet, data has reached an astonishing scale, and the pressure of data processing has increased day and night. According to the white paper of International Data Corporation (IDC), the total amount of data will increase to 163ZB by 2025. Enterprises develop various multi-node High Performance Computing (HPC) clusters to handle large data processing tasks, such as data storage, genome sequencing, and Distributed Machine Learning (DML), and develop high-speed communication techniques to improve cluster performance, such as All-Reduce and Remote Direct Memory Access (RDMA).
The multi-node HPC cluster is developed mainly for distributed storage and DML and other distributed high-performance services. Each compute node of the cluster is a host, and the network uses switches to connect the host to Clos, fat-Tree, or Leaf-Spine. As the performance of hosts grows, the performance of the network also increases. Compared with the traditional network, the multi-node high-performance computing cluster network has the following four characteristics:
(1) And (4) expandability. A multi-node HPC cluster requires high scalability to control deployment overhead. Therefore, clos is widely deployed in the network approach because it is simple and easy to expand. In recent years, leaf-Spine has also gained popularity, for which it can be extended to build a large multi-node high-performance computing cluster network, such as EFLOPS (floating-point operations per second). General purpose network hardware is common in high performance computing cluster networks because it ensures broad compatibility, while specialized hardware is rarely used.
(2) The core is bypassed. To cope with the high computation performance of the host, the multi-node high performance computing cluster network introduces RDMA transfer technology to improve throughput. RDMA allows a Network Interface Card (NIC) to copy memory from user space without system interrupts and CPU replication. In this process, RDMA bypasses the system kernel entirely, including TCP. To some extent, RDMA replaces TCP as a transport protocol. Therefore, the mechanisms built on TCP are no longer effective.
(3) Asymmetry. Although vendors attempt to maintain the symmetry of network resources, asymmetry still exists for many reasons, such as network failures and the implementation of heterogeneous resources. One such situation is the unbalanced striping of Clos, which is almost unavoidable in practice. Asymmetry occurs once the number of switches in one layer and the number of switches in an adjacent layer cannot be fully divided. When the asymmetry is high, an algorithm that performs balancing only according to load information of the local environment will cause a decrease in the average utilization of the network in the global scope.
(4) And (4) concurrency. A multi-node high performance computing cluster network is full of concurrent flows. Taking DML as an example, it initiates communication from many hosts simultaneously, where many hosts share the same sub-network resources. When several hosts send traffic to their shared switch at the same time and select the same switch output port, the traffic contends for bandwidth. The main reason for this is that the load balancing method selects the "best" path, regardless of the selection of other hosts.
Communication performance directly determines the performance of the cluster, and load balancing is one of the most important techniques to improve network performance. In most high performance computing cluster networks, multi-path load balancing (ECMP) is a default load balancing algorithm deployed in commercial network switches. Typically, the switch collects 5 tuples of each packet as hash keys and hashes them to a random output port. ECMP has the basic capability of load balancing and is commonly used because of its simplicity. However, hash collisions and poor resiliency to asymmetry make ECMP unable to achieve the expected performance in many cases. Previous load balancing efforts have been directed to improving the balancing performance of Data Center Networks (DCNs) and other cloud-scale networks. On the one hand, some designs are mainly dedicated to upgrading hardware, such as Ananta upgrading a multiplexer (Mux), and SilkRoad is balanced using a switching Application Specific Integrated Circuit (ASIC). The goal of this type of work was to research and develop hardware dedicated to load balancing that outperformed commercial switches. On the other hand, designs mainly dedicated to software development are common, and such designs discuss how switches and backend servers are programmed to achieve a balanced way based on existing hardware.
While these efforts have worked well in traditional fashion, it has been difficult for them to achieve the same performance on a high performance computing cluster network due to some of the unique features of the high performance computing cluster network. The expression is as follows:
(1) And (4) expandability. Designs aimed at developing dedicated hardware load balancers are difficult to deploy and scale up due to their high cost.
(2) RDMA is widely deployed in high performance computing cluster networks, which is a transport method of kernel bypass to improve network throughput. With RDMA, clients directly access the server's memory and bypass the kernel transport protocol, which includes the TCP protocol. Most of the existing load balancing work is designed based on Flowlet, which is based on the characteristics of TCP. Thus, in the RDMA environment, the balanced performance of these jobs degrades to the level of ECMP due to the inability to use flowlets.
(3) Yet another factor is asymmetric network resources. To save costs, enterprises select and purchase different kinds and versions of hard disks for the HPC cluster, with hosts with different computing capabilities processing the same DML task together. Hosts under different top of rack (tors) may have different available bandwidth to obtain unequal network resources. Link failures can also lead to asymmetries. In an asymmetric network, locally optimal load balancing may result in a global bandwidth utilization penalty.
(4) Another feature is concurrent flow. Distributed applications (such as DML) initiate communication from many hosts, with large amounts of data flowing into the network at the same time. What happens assuming that the load balancing mechanism always selects a minimal load path for each host? Most traffic will be arranged in a few paths! We call concurrent path contention. Such competition can cause more serious imbalances and congestion. It may be helpful to introduce a serialization process and a centralized scheduler, but this is far from fast reaction.
Disclosure of Invention
Aiming at the challenges in the aspects, the invention provides a method (ArCa) and a device for load balancing and scheduling of multi-node high-performance computing cluster network traffic, which are designed for a complete load balancing capability of a high-performance computing cluster network. The method is a simple application layer communication module and can be realized in any high-performance computing cluster network. The only dependency of ArCa is ECMP, which is widely deployed as the default mechanism for networks. Therefore, arCa achieves high scalability and is tolerant of kernel bypass communications. ArCa detects global network loading through in-band network telemetry (INT) packets and maintains a global view in the case of asymmetries. In addition, we introduce a new distributed load balancing algorithm "skip-pipelining" to avoid path contention. As the name suggests, the skip-stream algorithm arranges the flow like a stream and skips the flow randomly, and the simple introduction of randomness effectively avoids path competition.
In order to achieve the above purpose, the invention provides the following technical scheme:
in one aspect, the invention provides a method for load balancing and scheduling of network traffic of a multi-node high-performance computing cluster, which comprises the following steps:
s1, load detection: sending detection data packets with different quintuple groups between the hosts, and obtaining the mapping relation between the paths of other hosts and the quintuple groups and the load of the paths on each host to form a network load table;
s2, load estimation: estimating the current path load of the network according to the historical load data;
s3, balance decision: and (3) adopting a jump flow algorithm, and distributing the flow to each path by the host to complete the flow scheduling of load balancing.
Further, step S1 includes:
s101, sending a detection data packet A with different quintuple at a sending end host i in an end-to-end mode, wherein the quintuple comprises a source IP, a destination IP, a source port number, a destination port number and a protocol, a switch uniquely corresponds the quintuple to a certain port through an ECMP (equal cost performance protocol), and the quintuple of the detection data packet A is expressed as<IP i ,IP j ,p i ,p j ,UDP>;
S102, when the detection data packet A reaches each switch along a fixed path, the switch adds relevant information of the switch at the tail of the detection packet according to an INT protocol, wherein the relevant information comprises: the access port which enters the local switch, the exit port which is sent from the local switch, the queuing time delay of the local switch and the fixed bandwidth of the local switch;
s103, after receiving the detection data packet A, the receiving end host j immediately copies and sends the same detection data packet B, and the quintuple of the detection data packet B is expressed as<IP j ,IP i ,p j ,p i ,UDP>The detection data packet B returns to the sending end host i along the original path;
s104, the receiving end host j analyzes the detection data packet A, and a mapping relation is established: quintuple → path → load, where the path is an ordered arrangement of ingress and egress ports of the switch through which the probe packet passes; the load is represented using the bandwidth-delay product;
s105, the sending end host i receives the detection data packet B, analyzes the detection data packet B into a mapping relation of five-tuple → path → load in the same way as the receiving end host j, and stores the mapping relation in a program.
Further, in step S104, it is assumed that the probe packet passes through a path from the sender host i to the receiver host j
Figure BDA0003863638160000041
Has a bandwidth of->
Figure BDA0003863638160000042
A total delay from the sender to the receiver of &>
Figure BDA0003863638160000043
Then the bandwidth-delay product is calculated by the following formula:
Figure BDA0003863638160000044
further, step S2 includes:
s201, obtaining a predicted load based on EWMA;
s202, adjusting the EWMA parameters according to the load feedback.
Further, in step S201, the EWMA formula is:
Figure BDA0003863638160000051
wherein
Figure BDA0003863638160000052
Is at decision time t, path->
Figure BDA0003863638160000053
The estimated load of (2); and delta t is the interval between the load data updating time of the latest received detection packet and the decision-making time.
Further, in step S202, each time the probe packet brings back the network load, according to the actual data of the probe packet, the parameters of the EWMA are adjusted based on the following formula:
Figure BDA0003863638160000054
where Δ t' is the time interval between the current and the last received probe packet.
Further, step S3 includes:
s301, path arrangement: the method comprises the steps of sequentially setting a number for each path, wherein the paths adjacent to the number are neighbor paths, the paths with the numbers smaller than the current path are left neighbors, and the paths with the numbers larger than the current path are right neighbors;
s302, traffic scheduling: assuming that m paths leading to the receiving end host j and a series of flow demands F waiting to be sent to the receiving end host j are arranged on the current sending end host i, a jump flow algorithm is adopted,
further, the skip flow algorithm of step S3 includes:
s3021, obtaining a predicted value copy of the path load at the current moment, storing the predicted value copy in the current process in a path-load mapping mode, and distributing the traffic to be scheduled to the paths of the copy one by one;
s3022, randomly selecting a path k, comparing the path k with the paths adjacent to the left and right sides of the path k, obtaining the path with the lowest load on the left and right sides within the range of the path k after recursive comparison, distributing a part of the traffic demand waiting for transmission to the path with the lowest load, and enabling the load on the path to be close to or slightly higher than the adjacent path;
s3023, if the flow demand is not distributed, returning to the step S3022 to randomly select a path again; otherwise, the flow demand is scheduled;
s3024, corresponding to each flow demand, modifying the quintuple according to the mapping relation of the quintuple → the path, and enabling the flow to be transmitted through the pre-scheduled path inevitably on the basis of an ECMP mechanism;
s3025, establishing connection between the sending end host i and the receiving end host j, and carrying out flow transmission;
and S3026, corresponding to the flow demand of each host, performing flow transmission by adopting the steps S3021 to S3025.
On the other hand, the invention also provides the device for load balancing and scheduling of the network traffic of the multi-node high-performance computing cluster, which comprises the following modules to realize the method: .
The load detection module is used for sending detection data packets with different quintuple groups among the hosts, obtaining the mapping relation among the paths of other hosts and the quintuple groups and the load of the paths on each host, and forming a network load table;
the load estimation module is used for estimating the current path load of the network according to the historical load data;
and the balance decision module is used for distributing the flow to each path by the host and finishing the flow scheduling of load balance by adopting the jump flow algorithm module.
Further, the skip flow algorithm module comprises:
s3021, obtaining a predicted value copy of the path load at the current moment, storing the predicted value copy in the current process in a path-load mapping mode, and distributing the traffic to be scheduled to the paths of the copy one by one;
s3022, randomly selecting a path k, comparing the path k with the paths adjacent to the left and right sides of the path k, obtaining the path with the lowest load on the left and right sides within the range of the path k after recursive comparison, and distributing a part of the traffic demand waiting for transmission to the path with the lowest load to ensure that the load on the path is close to or slightly higher than the adjacent path;
s3023, if the flow demand is not distributed, returning to the step S3022 to randomly select a path again; otherwise, the flow requirement is scheduled;
s3024, corresponding to each flow demand, modifying the quintuple according to the mapping relation of the quintuple → the path, and enabling the flow to be transmitted through the pre-scheduled path inevitably on the basis of an ECMP mechanism;
s3025, establishing connection between the sending end host i and the receiving end host j, and carrying out flow transmission;
and S3026, corresponding to the flow demand of each host, performing flow transmission by adopting the steps S3021 to S3025.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a multi-node high-performance computing cluster network flow load balancing scheduling method, which completes the detection of network path load through an in-network remote measuring technology, establishes a global view angle and avoids the problem of local blind areas caused by asymmetry; based on EWMA and a network path load detection result, path load is estimated, and accuracy and instantaneity of a load balancing mechanism are improved; based on a skip flow algorithm, a quintuple of sending flow is actively adjusted on a server, the flow scheduling of load balancing is completed by utilizing the stability of an ECMP mechanism of a switch, and the contingency of random routing is added to avoid concurrent path competition.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required for the embodiments will be briefly described below. It should be apparent that the drawings in the following description are merely some of the embodiments described in the present invention, and that other drawings may be obtained by those skilled in the art.
Fig. 1 is a schematic block diagram of a method for load balancing and scheduling of network traffic of a multi-node high-performance computing cluster according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a skip-pipelining algorithm according to an embodiment of the present invention.
FIG. 3 is a graph comparing the performance of ArCa and ECMP on the exchange side according to the embodiment of the present invention.
Fig. 4 is a comparison diagram of performance of ArCa and ECMP at the server side according to an embodiment of the present invention.
Detailed Description
For a better understanding of the present solution, the method of the present invention is described in detail below with reference to the accompanying drawings.
The invention discloses a multi-node high-performance computing cluster network traffic load balancing scheduling method (ArCa), which is a modular system as shown in fig. 1 and comprises three parts: network load detection, load estimation and balancing decision. ArCa implements all parts on the host without any changes on the multi-node high performance computing cluster network. The hosts mutually send the detection data packet, and obtain the mapping relation between the path and the quintuple and the load of the path. This information forms a network load table on the basis of which ArCa estimates the short-term load of the network. By evaluation, the host makes decisions on the traffic plan and selects paths to send messages with the help of the hop-flow algorithm. Each module is a process which runs independently, and information communication is carried out in an interprocess communication (such as MMAP) mode, so that the overall concurrency of the mechanism is improved, and the reaction speed and decision efficiency of the mechanism are improved.
Specifically, arCa comprises the steps of:
step 1: (module one) load detection. And sending an INT detection message and detecting the load of the path.
Step 1.1: in an end-to-end fashion, a probe packet with a different quintuple is sent out at the sending end (host i). Wherein the five-membered group consists of<Source IP, destination IP, source port number, destination port number, protocol>The switch sends the quintuple uniquely corresponding to a certain port through the ECMP. Therefore, when the quintuple is fixed, the switch port through which the flow passes is also fixed, thereby obtaining a fixed path. It is assumed here that the quintuple of a certain probe packet A is<IP i ,IP j ,p i ,p j ,UDP>。
Step 1.2: when the detection packet a reaches each switch along a fixed path, the switch adds the relevant information of the switch to the end of the detection packet according to the INT protocol. Such information includes: the access port of the local switch, the output port of the local switch, the queuing delay of the local switch and the fixed bandwidth of the local switch.
Step 1.3: after receiving the detection packet A, the receiving end (host j) immediately copies a same packet B, exchanges the source IP and the destination IP, the source port number and the destination port number, and the quintuple of the B is<IP j ,IP i ,p j ,p i ,UDP>. B is sent out and will return to host i along the original path.
Step 1.4: the receiving end analyzes the detection packet A and establishes a mapping relation: quintuple → path → load. Wherein, the path is the ordered arrangement of the access ports of the switch through which the detection packets pass; the load is expressed using the bandwidth-delay product. Suppose a probe packet traverses a path from host i to j
Figure BDA0003863638160000081
Has a bandwidth of->
Figure BDA0003863638160000082
The total delay from the transmitting end to the receiving end is pick>
Figure BDA0003863638160000083
The bandwidth-delay product can be calculated by the following formula.
Figure BDA0003863638160000084
Step 1.5: the sending end receives the detection packet B, and analyzes the detection packet B into a mapping relation of five-tuple → path → load in the same way as the receiving end, and stores the mapping relation in a program. Through the steps, the network load and path mapping relation to other hosts is obtained on each host.
Step 2: (Block two) load estimation. And predicting the current path load according to the historical load data. Due to the fact that the detection time and the time for carrying out the load balancing decision have intervals, the detected data cannot accurately reflect the current network load, and the accuracy of the decision needs to be improved through a pre-estimation mode.
Step 2.1: based on EWMA, a predicted load is obtained. In the present invention, the EWMA formula is:
Figure BDA0003863638160000085
wherein
Figure BDA0003863638160000086
Is at decision time t, path->
Figure BDA0003863638160000087
The estimated load of (2); and delta t is the interval between the load data updating time of the latest received detection packet and the decision-making time.
Step 2.2: and adjusting the weight parameter alpha of the EWMA according to the load feedback. When the detection packet is brought back to the network load each time, the parameters of the EWMA are adjusted based on the following formula according to the actual data of the detection packet, and the prediction accuracy is improved.
Figure BDA0003863638160000091
Where Δ t' is the time interval between the current and the last received probe packet.
And step 3: (module three) balanced scheduling. And distributing the flow to each path by using a hopping water flow algorithm to complete the flow scheduling of load balancing. Assume that there are m paths to host j on current host i and a series of traffic demands F waiting for host j. In fact, each path is a different arrangement of ingress and egress ports of the switch, and although two paths pass through the same switch, they do not affect each other if their ingress and egress ports are different, and therefore we consider them to be two different paths. Then the traffic scheduling steps are as follows:
step 3.1: and (6) path arrangement. Each path is numbered in turn so that it has a "neighbor" path: the adjacent paths with the numbers are neighbor paths, the paths with the numbers smaller than the current paths are left neighbors, and the paths with the numbers larger than the current paths are right neighbors. For example, for a path numbered k, the ith bit neighbor to the left is (k-i)% n and the ith bit neighbor to the right is (k + i)% n. Where n is the total number of paths,% represents the modulo operation.
Step 3.2: traffic scheduling
The specific working flow of the skip-flow algorithm of the present invention is shown in fig. 2. The method specifically comprises the following steps:
and 3.2.1, communicating with a module II, acquiring a predicted value copy of the path load at the current moment, and storing the predicted value copy in the current process in a path-load mapping mode. Traffic waiting to be scheduled will be allocated to the path of the replica one by one.
Step 3.2.2, a path k is randomly selected, load comparison is carried out on the path k and the adjacent paths k-1 and k +1, the path k-1 is compared with the path k-2, and the path k +1 is compared with the path k + 2. The traffic demand waiting for transmission is distributed in part to the path with the lowest load, making the load on the path close to or slightly higher than its neighboring paths.
3.2.2 if the flow demand is not distributed, returning to the step 3.2.1 to randomly select a path again; otherwise, the traffic demand is scheduled.
And 3.2.3, corresponding to each flow demand, modifying the quintuple according to the mapping relation of the quintuple → the path, and enabling the flow to be transmitted through the pre-scheduled path necessarily based on an ECMP mechanism.
And 3.2.4, establishing connection between the host i and the host j, and carrying out flow transmission.
And 3.2.5, corresponding to the flow demand of each host, adopting the load balancing mechanism, thereby achieving the effect of balancing load in a network range.
We explain with a specific example: there are 8 paths between the sending end host i and the receiving end host j, and the paths are numbered from 0 to 7 by black numbers, and each bar represents the capacity and load of one path. The red bar is the load of the path and the light yellow bar is the bandwidth. At the bottom of the bar, the white numbers represent the specific values of the load. The solid boxes with flow requirements represent randomly selected paths and the two dashed boxes represent flow paths and jump paths, respectively. For convenience, we assume that the size of each traffic demand is 1 unit and the bandwidth size of each path is 10 units. Initially there is a traffic demand of 6 units that needs to be scheduled.
Step 1: path 1 is chosen randomly. Path 1 has a load less than path 0 but greater than path 2 than the adjacent paths, so we route traffic to path 2, bringing the load of path 2 to the load of path 1. At this point a flow demand of 1 unit is allocated. A random number of 61 is generated, so that the path allocated in the second round (1 + 61)% 8=6 is selected.
Step 2: the remaining 5 units of flow demand are allocated. For path 6, both path 5 and path 7 are less loaded than path 6 compared to the adjacent paths, thereby distributing traffic to path 5 and path 7. The allocation consumes a flow demand of 5 units. A random number 82 is generated, so that the path allocated for the third round (6 + 82)% 8=0 is selected.
And step 3: at this time, no unit traffic remains, and the scheduling algorithm ends.
The performance of the ArCa of the present invention is compared with that of the existing ECMP:
we have completed some experiments in a practical production environment. Fig. 3 shows the achieved load performance of the two load balancing schemes at the exchange side. The red bar shows that ECMP forwards at least 400Gbps of traffic to port number 10 and port number 9, which results in an imbalance of port numbers 0-4, while port number 0 sends almost no traffic. Traffic that would have been forwarded to ports 0-4 is actually forwarded to ports 7-10. It can be said that stream collisions occur at different levels at the ports 7 to 10. ArCa achieves a higher degree of equilibrium. In the case of coarse-grained elephant traffic, the skip-flow algorithm schedules traffic approximately evenly on each path. Specifically, the skip-flow algorithm schedules traffic through different output ports of the switch on average. If we evaluate the balance with the standard deviation of the throughput, the value of ECMP is 148.06, while the value of ArCa is 2.96, almost 49 times higher.
Fig. 4 illustrates a server-side throughput case, where each bar is the communication throughput between a pair of hosts. To our knowledge, the problem of switch imbalance does not always lead to a degradation in the performance of the host servers, which is why improvements in load balancing are difficult to help improve network performance. The difference in throughput performance on the host is not as significant as the S0 output port. The flow collisions and imbalances on ECMP eventually lead to a degradation of the performance of the host machine as shown by the red bars. ArCa maintains the performance of load balancing where the throughput is 6Gbps (191 Gbps-197 Gbps) in comparison to ECMP, which is almost a 6.3-fold improvement, and 38Gbps (160 Gbps-198 Gbps). Meanwhile, the standard deviation of throughput in ECMP is 13.61, while the standard deviation in ArCa is 1.78, which is a 7.08-fold improvement. It is much smaller compared to the boost in the switch. In addition, the overall throughput of ArCa is also slightly higher than ECMP. The average throughput of server-side ArCa was 22.5% higher than ECMP.
In summary, in the present invention, we have disclosed the main challenges of implementing a load balancing approach on a multi-node HPC cluster network and propose our solution ArCa, which is a software-based load balancing approach with high scalability, kernel bypass tolerance, asymmetric resilience and contention avoidance. The ArCa is established on the basis of widely deployed ECMP, is very convenient to deploy due to the implementation based on an application layer, and solves the problem of load balancing failure caused by RDMA transmission bypassed by a kernel. ArCa creates a view of the global load with the help of INT probe technology, thus providing flexibility to asymmetric network topologies. Meanwhile, in ArCa, we have designed a pipelining hopping algorithm to avoid concurrent path contention by adding randomness in load-balancing flow scheduling. Furthermore, we prove through evaluation that the ArCa of the present invention almost achieves load balancing compared to native ECMP, and the network throughput is better than ECMP 22.5%.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: it is to be understood that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for some of the technical features thereof, but such modifications or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A multi-node high-performance computing cluster network traffic load balancing scheduling method is characterized by comprising the following steps:
s1, load detection: sending detection data packets with different quintuple groups between the hosts, and obtaining the mapping relation between the paths of other hosts and the quintuple groups and the load of the paths on each host to form a network load table;
s2, load estimation: estimating the current path load of the network according to the historical load data;
s3, balance decision: and (4) adopting a hopping flow algorithm, and distributing the flow to each path by the host to finish the flow scheduling of load balancing.
2. The method for load balancing and scheduling traffic of a multi-node high performance computing cluster network according to claim 1, wherein step S1 comprises:
s101, sending a detection data packet A with different quintuples at a sending end host i in an end-to-end mode, wherein the quintuple comprises a source IP, a destination IP, a source port number, a destination port number and a protocol, a switch uniquely corresponds the quintuple to a certain port through an ECMP (equal cost performance protocol), and the quintuple of the detection data packet A is expressed as<IP i ,IP j ,p i ,p j ,UDP>;
S102, when the detection data packet A reaches each switch along a fixed path, the switch adds relevant information of the switch at the tail of the detection packet according to an INT protocol, wherein the relevant information comprises: the access port of the local switch, the output port of the local switch, the queuing delay of the local switch and the fixed bandwidth of the local switch;
s103, after receiving the detection data packet A, the receiving end host j immediately copies and sends the same detection data packet B, and the quintuple of the detection data packet B is expressed as<IP j ,IP i ,p j ,p i ,UDP>The detection data packet B returns to the sending end host i along the original path;
s104, the receiving end host j analyzes the detection data packet A, and a mapping relation is established: quintuple → path → load, where the path is an ordered arrangement of ingress and egress ports of the switch through which the probe packet passes; the load is represented using the bandwidth-delay product;
s105, the sending end host i receives the detection data packet B, analyzes the detection data packet B into a mapping relation of five-tuple → path → load in the same way as the receiving end host j, and stores the mapping relation in a program.
3. The method as claimed in claim 2, wherein in step S104, it is assumed that the probe packet passes through a path from the sending end host i to the receiving end host j
Figure FDA0003863638150000011
Has a bandwidth of
Figure FDA0003863638150000012
A total delay from the sender to the receiver of &>
Figure FDA0003863638150000013
Then the bandwidth-delay product is calculated by the following equation:
Figure FDA0003863638150000021
4. the method for load balancing and scheduling traffic of a multi-node high performance computing cluster network according to claim 1, wherein the step S2 comprises:
s201, obtaining a predicted load based on EWMA;
s202, adjusting the EWMA parameters according to the load feedback.
5. The method for load balancing and scheduling of network traffic of a multi-node high performance computing cluster according to claim 4, wherein in step S201, the EWMA formula is:
Figure FDA0003863638150000022
wherein
Figure FDA0003863638150000023
Is at decision time t, path->
Figure FDA0003863638150000024
The estimated load of (2); and delta t is the interval between the load data updating time of the latest received detection packet and the decision-making time.
6. The method of claim 4, wherein in step S202, each time the probe packet brings back a network load, the parameters of the EWMA are adjusted according to the actual data of the probe packet based on the following formula:
Figure FDA0003863638150000025
where Δ t' is the time interval between the current and the last received probe packet.
7. The method for network traffic load balancing scheduling of multi-node high performance computing clusters according to claim 1, wherein step S3 comprises:
s301, path arrangement: sequentially setting a number for each path, wherein the paths adjacent to the number are neighbor paths, the paths with the numbers smaller than the current path are left neighbors, and the paths with the numbers larger than the current path are right neighbors;
s302, traffic scheduling: assuming that there are m paths leading to the receiver host j on the current sender host i and a series of traffic demands F waiting for being sent to the receiver host j, a hopping flow algorithm is employed,
8. the method for load balancing and scheduling traffic of a multi-node high performance computing cluster network according to claim 1, wherein the leap-flow algorithm of step S3 comprises:
s3021, obtaining a predicted value copy of the path load at the current moment, storing the predicted value copy in the current process in a path-load mapping mode, and distributing the traffic to be scheduled to the paths of the copy one by one;
s3022, randomly selecting a path k, comparing the path k with the paths adjacent to the left and right sides of the path k, obtaining the path with the lowest load on the left and right sides within the range of the path k after recursive comparison, and distributing a part of the traffic demand waiting for transmission to the path with the lowest load to ensure that the load on the path is close to or slightly higher than the adjacent path;
s3023, if the flow demand is not distributed, returning to the step S3022 to randomly select a path again; otherwise, the flow demand is scheduled;
s3024, corresponding to each flow demand, modifying the quintuple according to the mapping relation of the quintuple → the path, and enabling the flow to be transmitted through the pre-scheduled path inevitably on the basis of an ECMP mechanism;
s3025, establishing connection between the sending end host i and the receiving end host j, and carrying out flow transmission;
and S3026, corresponding to the flow demand of each host, adopting steps S3021 to S3025 to perform flow transmission.
9. A multi-node high-performance computing cluster network traffic load balancing scheduling device, characterized by comprising the following modules to implement the method of any one of claims 1 to 8:
the load detection module is used for sending detection data packets with different quintuple groups among the hosts, obtaining the mapping relation among the paths of other hosts and the quintuple groups and the load of the paths on each host, and forming a network load table;
the load estimation module is used for estimating the current path load of the network according to the historical load data;
and the balance decision module is used for distributing the flow to each path by the host and finishing the flow scheduling of load balance by adopting the jump flow algorithm module.
10. The apparatus of claim 9, wherein the hop flow algorithm module comprises:
s3021, obtaining a predicted value copy of the path load at the current moment, storing the predicted value copy in the current process in a path-load mapping mode, and distributing the traffic to be scheduled to the paths of the copy one by one;
s3022, randomly selecting a path k, comparing the path k with the paths adjacent to the left and right sides of the path k, obtaining the path with the lowest load on the left and right sides within the range of the path k after recursive comparison, and distributing a part of the traffic demand waiting for transmission to the path with the lowest load to ensure that the load on the path is close to or slightly higher than the adjacent path;
s3023, if the flow demand is not distributed, returning to the step S3022 to randomly select a path again; otherwise, the flow demand is scheduled;
s3024, corresponding to each flow demand, modifying the quintuple according to the mapping relation of the quintuple → the path, and based on an ECMP mechanism, enabling the flow to be transmitted by the pre-scheduled path;
s3025, establishing connection between the sending end host i and the receiving end host j, and performing flow transmission;
and S3026, corresponding to the flow demand of each host, adopting steps S3021 to S3025 to perform flow transmission.
CN202211180131.3A 2022-09-26 2022-09-26 Multi-node high-performance computing cluster network traffic load balancing scheduling method and device Pending CN115941676A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211180131.3A CN115941676A (en) 2022-09-26 2022-09-26 Multi-node high-performance computing cluster network traffic load balancing scheduling method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211180131.3A CN115941676A (en) 2022-09-26 2022-09-26 Multi-node high-performance computing cluster network traffic load balancing scheduling method and device

Publications (1)

Publication Number Publication Date
CN115941676A true CN115941676A (en) 2023-04-07

Family

ID=86649628

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211180131.3A Pending CN115941676A (en) 2022-09-26 2022-09-26 Multi-node high-performance computing cluster network traffic load balancing scheduling method and device

Country Status (1)

Country Link
CN (1) CN115941676A (en)

Similar Documents

Publication Publication Date Title
US20210359944A1 (en) Flow-Based Load Balancing
Abu-Libdeh et al. Symbiotic routing in future data centers
EP0631413B1 (en) Method for shortest path routing
EP1698089B1 (en) System and method for distributing route selection in an implementation of a routing protocol
EP1668848B1 (en) Distributed software architecture for implementing the bgp protocol in a router
US8503310B2 (en) Technique for policy conflict resolution using priority with variance
JP5086585B2 (en) Network relay device
US6574669B1 (en) Method and apparatus for routing traffic within a network utilizing linear optimization
Kanagevlu et al. SDN controlled local re-routing to reduce congestion in cloud data center
Wang et al. Freeway: Adaptively isolating the elephant and mice flows on different transmission paths
US8934485B2 (en) Methods and apparatus to determine an alternate route in a network
CN109412963B (en) Service function chain deployment method based on stream splitting
Yu et al. Space shuffle: A scalable, flexible, and high-performance data center network
WO2004040858A1 (en) Dynamic load distribution using local state information
Chiang et al. A multipath transmission scheme for the improvement of throughput over SDN
Maksic et al. Improving utilization of data center networks
Manzanares-Lopez et al. An MPTCP-compatible load balancing solution for pools of servers in OpenFlow SDN networks
US11070472B1 (en) Dynamically mapping hash indices to member interfaces
CN115941676A (en) Multi-node high-performance computing cluster network traffic load balancing scheduling method and device
US20180212881A1 (en) Load-based compression of forwarding tables in network devices
Lin et al. Proactive multipath routing with a predictive mechanism in software‐defined networks
WO2010112802A1 (en) Routing traffic in a communications network
Han et al. Bypass flow-splitting forwarding in FISH networks
Wette et al. HybridTE: traffic engineering for very low-cost software-defined data-center networks
Chang et al. A Congestion Aware Multi-Path Label Switching in Data Centers Using Programmable Switches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination