CN115462049B

CN115462049B - Forwarding path planning method for large-scale data network center

Info

Publication number: CN115462049B
Application number: CN202080100357.0A
Authority: CN
Inventors: 梁建国; 齐辰晨; 郑海洋; 施学美
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2024-03-08
Anticipated expiration: 2040-05-18
Also published as: CN115462049A; WO2021232190A1

Abstract

A method for planning forwarding paths in a large-scale data center network is provided. The method is implemented by a first computing node in the network. The method includes obtaining information associated with an algorithm implemented by at least one second computing node; obtaining stored network topology data associated with the at least one second computing node; receiving, at a first computing node, a first data packet from a source device to be forwarded to a destination device; determining a forwarding path from the source device to the destination device based on information associated with the first data packet and the network topology data; and transmitting the first data packet to the destination device over the forwarding path.

Description

Forwarding path planning method for large-scale data network center

Background

Data center networks typically use a compact interconnected network topology to provide high bandwidth for internal data exchanges. In such networks, it is not stable to employ an efficient load balancing scheme so that all available bandwidth resources can be utilized. To utilize all of the available bandwidth, it is necessary to route the data stream over the network rather than overload a single path. An Equal Cost Multipath (ECMP) path planning algorithm may use multiple equal cost paths from a source node to a destination node in the network. The advantage of using this algorithm is that the data flow can be split more evenly throughout the network, avoiding congestion and increasing bandwidth consumption.

In existing data center networks, such as two-tier or three-tier Clos networks, multiple servers are connected to the network through a first tier switch, such as a leaf switch or an access switch. When the data flow arrives, the server forwards the data packets to the various paths in the network to their respective destinations. Packet forwarding may be determined based on "server version" topology information and pre-computed values associated with the various paths. When the packet arrives at the next hop leaf switch, each leaf switch performs dynamic path planning to distribute the packet based on the "switch version" topology information and path planning algorithm. Since the server is not configured to perform dynamic path planning, the "server version" topology information may not represent real-time network topology information. When network congestion occurs, the server cannot determine the source of the network congestion and respond in time to reroute the data flow.

Drawings

Methods and systems for dynamic path planning in a large-scale data center network are provided. The present disclosure enables dynamic path planning capability of a switch (i.e., a leaf switch or access switch of a two-layer or three-layer Clos network) on a compute node (i.e., server device) of the network via the switch. The compute nodes communicate with one or more switches to obtain information about path planning algorithms or routing algorithms used by the switches and network topologies associated with the switches through various protocols, such as Link Layer Discovery Protocol (LLDP). The computing node also configures the path planning algorithm or routing algorithm used therein based on information related to the path planning algorithm or routing algorithm used by the switch. The path planning algorithm or routing algorithm used by the switch may include an Equal Cost Multipath (ECMP) planning algorithm. The computing node further synchronizes a network topology associated with the computing node with a network topology associated with the switch. The present disclosure enables a computing node to dynamically route received data packets before a switch of a large-scale data center network, thereby effectively avoiding data flow collisions in the network. Further, a computing node according to the present disclosure may effectively detect network congestion and respond in time by rerouting the data flow to bypass the network congestion.

The detailed description refers to the accompanying drawings. In the figures, the leftmost digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference symbols in different drawings indicates similar or identical items.

Fig. 1 illustrates an example environment in which a forwarding path planning system may be used according to embodiments of the present disclosure.

Fig. 2 illustrates an example network architecture of a large-scale data center network according to an embodiment of the present disclosure.

Fig. 3 illustrates an example of a failure occurring in a large-scale data center network according to an embodiment of the present disclosure.

Fig. 4 illustrates an example configuration of a computing node for implementing a forwarding path planning method according to an embodiment of the present disclosure.

Fig. 5 illustrates an example forwarding path planning according to embodiments of the present disclosure.

Fig. 6 illustrates another example forwarding path planning according to embodiments of the present disclosure.

Fig. 7 illustrates an example Equal Cost Multipath (ECMP) plan according to an embodiment of the disclosure.

Fig. 8 illustrates an exemplary forwarding path planning algorithm according to an embodiment of the present disclosure.

Fig. 9 illustrates another example forwarding path planning algorithm according to an embodiment of the present disclosure.

Fig. 10 illustrates another example forwarding path planning algorithm according to an embodiment of the present disclosure.

Fig. 11 illustrates another example forwarding path planning algorithm according to an embodiment of the present disclosure.

Detailed Description

The present application describes a number and varying embodiments and implementations. The following section describes an example framework suitable for practicing various implementations. Next, the present application describes example systems, devices, and processes for implementing a distributed training system.

Fig. 1 illustrates an example environment in which a forwarding path planning system may be used according to embodiments of the present disclosure. The environment 100 may include a data center network 102. In this example, the data center network 102 may include a plurality of compute nodes or servers 104-1, 104-2, …, 104-K (hereinafter collectively referred to as compute nodes 104), where K is a positive integer greater than 1. In an embodiment, multiple computing nodes 104 may communicate data with each other via a communication network 106.

The computing node 104 may be implemented as any of a variety of computing devices having computing/processing and communication capabilities, which may include, but are not limited to, servers, desktop computers, notebook or portable computers, handheld devices, netbooks, internet devices, tablet computers, mobile devices (e.g., mobile phones, personal digital assistants, smart phones, etc.), and the like, or a combination thereof.

The communication network 106 may be a wireless or wired network, or a combination of both. The network 106 may be a separate network interconnected with each other and functioning as a collection of single large networks (e.g., the internet or an intranet). Examples of such separate networks include, but are not limited to, telephone networks, wired networks, local Area Networks (LANs), wide Area Networks (WANs), and Metropolitan Area Networks (MANs). Further, the separate network may be a wireless or wired network, or a combination of both. The wired network may include electrical carrier connections (such as communication cables, etc.) and/or optical carrier waves or connections (such as fiber optic connections, etc.). The wireless network may include, for example, a WiFi network, other radio frequency networks (e.g.,zigbee, etc.), and the like. In the implementation modeWhere the communication network 106 may include a plurality of inter-node interconnections or switches 108-1, 108-2, …, 108-L (hereinafter collectively referred to as inter-node switches 108) for providing connectivity between the computing nodes 104. Wherein L is a positive integer greater than 1.

In an implementation, the environment 100 may also include a plurality of client devices 110-1, 110-2, …, 110-N (hereinafter collectively client devices 110), where N is a positive integer greater than 1. In an embodiment, users of client devices 110 may communicate with each other via communication network 106 or access online resources and services. These online resources and services may be implemented at the compute node 104. Data streams generated by users of client devices 110 may be distributed to multiple routing paths and routed to destination devices through one or more of the multiple paths. In an implementation, the destination device may include another client device 110 or a computing node 104. In an implementation, each of the plurality of routing paths may include one or more compute nodes 104 and switches 108 interconnected by physical links.

Fig. 2 illustrates an example network architecture of a large-scale data center network according to an embodiment of the present disclosure. The network architecture of the large-scale data center network 200 may provide a detailed view of the environment in which the forwarding path planning system may be used. In an implementation, the network architecture of the large-scale data center network is a three-layer Clos network architecture in a full-mesh topology. The first layer corresponds to a layer of leaf switches 206, also referred to as access switches or top of rack (ToR) switches. The compute nodes 208 are directly connected to the leaf switches 206, with each compute node 208 connected to at least two leaf switches 206. In an implementation, the compute node 208 may include one or more network interface controllers (e.g., four network interface controllers) connected to one or more ports (e.g., four ports) of the leaf switch 206. In an implementation, the number of network interface controllers in each computing node 208 may be the same or different. The second layer corresponds to a layer of aggregation switches 204 (also referred to as spine switches 204) connected to one or more leaf switches 206. In an implementation, the plurality of compute nodes 208, the interconnected leaf switches 206, and the interconnected aggregation switches 204 may form a Point of delivery (PoD) unit, e.g., poD-1 and PoD-2, as shown. The third layer corresponds to the layer of core switches 202 connected to one or more aggregation switches 204. Core switch 202 is located at the top of the cloud data center network pyramid and may include a Wide Area Network (WAN) connection to an external carrier network.

In an implementation, if two processing units or processes included in different computing nodes 208 are connected under the same leaf switch 206, data packets transmitted between the two processing units or processes will pass through the same leaf switch 206 without passing through any aggregation switch 204 or core switch 202. Alternatively, if two processing units or processes in different computing nodes are connected under different leaf switches, then the data packets transmitted between the two processing units or processes will pass through one of the aggregation switches. In an implementation, packets transmitted between two processing units or processes may be caused to flow through a designated aggregation switch by setting appropriate combinations of source and destination ports in the packets. The route management for avoiding congestion may aim at passing data flows from the same leaf switch to different destination leaf switches through different aggregation switches and/or passing data flows from different source leaf switches to the same destination leaf switch through different aggregation switches, thereby avoiding collisions between data flows and leaving the aggregation switches free of network congestion.

In an embodiment, a processing unit or process in a computing node 208 may send data to or receive data from a processing unit or process in another computing node through a Network Interface Controller (NIC). In implementations, a processing unit or process in a computing node 208 may be associated with a single network interface controller or multiple network interface controllers for transmitting data to processing units or processes in other computing nodes. Additionally or alternatively, multiple processing units or processes may be associated with a single network interface controller and employ the network interface controller for transmitting data to the processing units or processes in other computing nodes. In an implementation, a plurality of rules for packet forwarding/routing may be implemented on the computing node 208. The plurality of rules may include, but are not limited to, a priority of a processing unit or process in a first computing node to select an adjacent processing unit or process, a condition of a network interface controller in the first computing node to send or receive data, a condition of a network interface controller in the first computing node to route data to or from a network interface controller in a second computing node, and the like.

In an implementation, route management may assign a Network Interface Controller (NIC) identifier to each network interface controller connected or linked to the same leaf switch. In some examples, the processing unit or processing network interface controller and the network interface controller of the next processing unit or process are located in the same computing node or directly connected or linked to the same leaf switch, and the routing identifier may be determined as a default value or identifier. The default route identifier indicates that the data is routed either within the compute node or through the leaf switch, but not through any aggregation switches in the communication network. Otherwise, the routing identifier may be determined to be equal to the NIC identifier or other predefined value for the processing unit or process. Based on the mapping between the routing identifier and the aggregation identifier, the aggregation identifier may be determined based on the determined routing identifier. In an implementation, for example, a mapping relationship between the routing identifier and the aggregation identifier may be pre-determined using a probe-based routing mechanism (e.g., sending probe packets between computing nodes as previously described). In other words, the data flow between processing units (or processes) included in the same computing node or the network interface controller of the same leaf switch will not pass through any aggregation switch in the communication network. On the other hand, the data flows of the leaf switches among the processing units (or processes) included in the different computing nodes and the network interface controller will pass through the designated aggregation switch according to the predetermined mapping relation, so the data flows can be routed and managed and distributed to the different aggregation switches to avoid network congestion.

It should be understood that the three-tier Clos network as shown in fig. 2 is merely an example network architecture for a large-scale data center network. Other network architectures including, but not limited to, two-layer Clos networks may also be employed to build large-scale data center networks.

Fig. 3 illustrates an example failure occurring in a large-scale data center network according to an embodiment of the present disclosure. After the configuration of data center network 300 is delivered, network anomalies may occur, including but not limited to link failures (e.g., failures 312, 316, and 320), compute node failures (e.g., failure 310), leaf switch failures (e.g., failure 318), aggregate switch failures (e.g., failure 314), or core switch failures (e.g., failure 322), resulting in packet loss and congestion in certain routing paths. To detect those anomalies in a data center network, detection techniques such as Network Quality Analysis (NQA) tracking may be introduced. In an implementation, one or more compute nodes in each delivery point (PoD) unit may implement an NQA tracking scheme to detect that another compute node in the same or a different PoD unit, a leaf switch in the same or a different PoD unit, an aggregation switch in the same or a different PoD unit, or a core switch is inaccessible, etc.

One example detection method is to utilize one compute node in a delivery point (PoD) unit as a detection source. By way of example and not limitation, the computing node 308-1 in PoD-1 may be assigned as the detection source. The computing node 308-1 may periodically ping other computer nodes, leaf switches, aggregation switches, and core switches by sending internet message control protocol (ICMP) echo request packets. The computing node 308-1 waits for ICMP echo replies from each of these nodes and switches. After a preset period of time, or referred to as a time-to-live (TTL) period, if no echo reply is received from the computing node or switch, the computing node 308-1 determines that the computing node or switch is inaccessible. In one example, when packet loss occurs in a large number of computing nodes connected to the same leaf switch, the detection source (i.e., the designated computing node for anomaly detection) may further determine that the leaf switch may be in failure. In another example, when packet loss occurs only in sporadic computing nodes connected to the same leaf switch, the detection source may further determine that the sporadic computing nodes may experience overload or that the corresponding ports in the leaf switch may be full. In yet another example, when packet loss occurs in a large number of computing nodes located in different point of delivery (PoD) units, the detection source may further determine that a failure may occur in one or more respective aggregation switches located therein and/or one or more respective core switches connected thereto. In yet another example, when packet loss occurs only in sporadic computing nodes located in different point of delivery (PoD) units, the detection source may further determine that the sporadic computing nodes may experience overload or that corresponding ports in the aggregation switch and/or the core switch may be full.

Another example detection method is where various computing nodes at different locations cooperate with each other as detection sources and different detection strategies are deployed between these detection sources. Each detection source may implement an agent that is capable of additionally detecting anomalies associated with OSI layers 4 through 7. The detection source may accept input control commands to dynamically configure the detection strategy. The method may establish a TCP connection to further detect one or more parameters associated with the transport layer (i.e., OSI layer 4), such as a transmission delay or a transmission rate. The detection method can further know the exact position of the anomaly by using network topology information, transmission delay or transmission rate and data packet loss rate. In one example, when a packet routed by a first leaf switch to multiple compute nodes experiences a high packet loss rate, but a packet to the number of compute nodes routed by a different leaf switch is received without significant delay, the detection source may determine that the number of compute nodes is operating properly, but the first leaf switch may experience some anomaly. In another example, the detection source may determine that a first leaf switch is operating when a packet to a first number of compute nodes routed by the first leaf switch experiences a long delay, but a packet to a second number of compute nodes routed by the same leaf switch has no delay, but a port in the first leaf switch corresponding to the first number of compute nodes may be congested.

It should be understood that the network failure detection method described above is for illustrative purposes only. Other methods including, but not limited to, random Early Detection (RED), weighted Random Early Detection (WRED), robust Random Early Detection (RRED), explicit Congestion Notification (ECN), back ECN (BECN) may also be implemented to detect network congestion.

Fig. 4 illustrates an example configuration of a computing node for implementing a forwarding path planning method according to an embodiment of the present disclosure. In an implementation, the example configuration 400 of the computing node 402 may include, but is not limited to, one or more processing units 404, one or more network interfaces 406, input/output (I/O) interfaces 408, and memory 412. In an implementation, the computing node 402 may further include one or more intra-node interconnects or switches 410.

In an embodiment, the processing unit 404 may be configured to execute instructions stored in the memory 412 and/or received from the input/output interface 408 and/or the network interface 406. In an embodiment, the processing unit 404 may be implemented as one or more hardware processors including, for example, a microprocessor, a special purpose instruction set processor, a Physical Processing Unit (PPU), a Central Processing Unit (CPU), a graphics processing unit, a digital signal processor, a tensor processing unit, and the like. Additionally or alternatively, the functions described herein may be performed, at least in part, by one or more hardware logic components. For example, but not limited to, exemplary types of hardware logic components that can be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

Memory 412 may include machine-readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash RAM. Memory 412 is an example of a machine-readable medium.

A machine-readable medium may include volatile or nonvolatile types, removable or non-removable media, which may implement storage of information using any method or technology. The information may include machine-readable instructions, data structures, program modules, or other data. Examples of machine-readable media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electronically Erasable Programmable Read Only Memory (EEPROM), flash memory or other internal storage technology, compact disc read only memory (CD-ROM, digital Versatile Disc (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing node.

In an embodiment, the network interface 406 may be configured to connect the computing node 402 to other computing nodes via the communication network 106. In an implementation, the network interface 406 may be established through a Network Interface Controller (NIC) that may employ hardware and software to connect the computing node 402 to the communication network 106. In implementations, each type of NIC may use a different type of structure or connector to connect to the physical medium associated with the communication network 106. Examples of such structures or connector types may be found in the IEEE802 specifications and may include, for example, ethernet (defined in 802.3), token ring (defined in 802.5), and wireless networks (defined in 802.11), infiniBand, and the like.

In an implementation, intra-node switch 410 may include various types of interconnections or switches, which may include, but are not limited to, a high-speed serial computer expansion bus (e.g., PCIe, etc.), a serial multichannel near communications link (e.g., nolan, which is a wired communications protocol serial multichannel near communications link), a switch chip (e.g., NVSwitch, etc.) with multiple ports, a point-to-point processor interconnection (e.g., intel QPI/UPI, etc.). Etc.

In an embodiment, the computing node 402 may also include other hardware components and/or other software components, such as program modules 414 executing instructions stored in memory 412 for performing various operations, and program data 416 for storing data received for path planning, anomaly detection, and the like. In an implementation, program modules 414 may include a topology aware module 418, a path planning module 420, and an anomaly detection module 422.

The topology aware module 418 may be configured to maintain topology data associated with the network 106. Topology data may be generated and implemented on each element of network 106 when the network architecture is delivered. Topology data includes an arrangement of elements of a network, compute nodes, leaf switches, aggregation switches, core switches, etc., and indicates connections/links between these elements. In an Equal Cost Multipath (ECMP) algorithm, topology data may be represented as an undirected graph and stored as an adjacency list. All paths that route data packets from a source node to a destination node may be configured to have equal costs. Thus, the data flow from the source node to the destination node may be evenly distributed to all paths. In implementations, one or more available paths from a source node to a destination may be configured to reserve bandwidth and data flows may be forwarded only to the available paths. The topology aware module 418 can communicate with one or more switches (e.g., leaf switches 306) in the large-scale data center network in real-time, periodically, or at preset time periods to obtain topology data associated with the network 106. In implementations, topology data associated with the network 106 can be stored in one or more separate storage devices. The topology aware module 418 can communicate with one or more separate storage devices to obtain real-time topology data. In implementations, topology data associated with the network 106 can be dynamically updated to the topology aware module 418 in response to notification of network congestion. In implementations, communication and topology data exchange between the compute nodes and the leaf switches may be implemented using protocols including, but not limited to, link Layer Discovery Protocol (LLDP), link Aggregation Control Protocol (LACP), generic Remote Procedure Call (GRPC), and the like. In implementations, communication and topology data exchange between the compute nodes and the leaf switches may be implemented through remote software control.

The path planning module 420 may be configured to determine a routing path to forward the data packet and to assign the data packet to the routing path to balance the data flow in the network 106. In an embodiment, when a packet arrives, path planning module 420 may obtain the source address, destination address, and protocol from the IP header of the TCP/IP packet and the source port and destination port from the TCP packet. The source address, destination address, source port, destination port and protocol may form a so-called five tuple (or 5 tuple). The five-tuple can uniquely indicate a data stream, where all data packets have exactly the same five-tuple. The path planning module 420 may determine all possible routing paths from the source node to the destination node. A data stream in which all data packets have exactly the same five-tuple can use one of all possible routing paths at a time. In practice, when it is desired to select a different path, for example, when a network anomaly occurs, the network topology data changes due to the anomaly. The topology aware module 418 can update the network topology data stored in the program data 416 to reflect changes. In an embodiment, the network topology data associated with the compute node 402 may be stored separately from the program data 416 and may be updated in response to topology data changes due to anomalies. The path planning module 420 may recalculate using a hash algorithm based on the updated network topology data and select a different path through another source port. It should be understood that the five-tuple (or 5-tuple) uniquely indicating the above-described data stream is for illustration purposes only. The present disclosure is not intended to be limiting. Path planning module 420 may construct (triplets or 3-tuples) that include a source IP address, a destination IP address, and an ICMP identifier that uniquely identifies the ICMP query session to indicate the data flow.

In an embodiment, planning module 420 may determine all possible routing paths from the source node to the destination node. The routing path may be determined based on various path finding algorithms including, but not limited to, a shortest path algorithm. Examples of shortest path algorithms may include, but are not limited to, dijkstra's algorithm, viterbi's algorithm, floyd-Warshall's algorithm, bellman-Ford's algorithm, and the like. The path planning module 420 may further perform a hash operation on the quintuple to obtain a corresponding quintuple hash value, and determine a routing path from all possible shortest route paths according to the quintuple hash value. When the hash maps a five-tuple to a unique path, all packets with the same five-tuple can be routed through the same path. Various hash algorithms may be implemented by path planning module 420, including but not limited to message digests (MD, MD2, MD4, MD5, and MD 6), RIPEMD (RIPEND, RIPEMD-128 and RIPEMD-160), whirlpool (Whirlpool-0, whirlpool-T, and Whirlpool), or secure hash functions (SHA-0, SHA-1, SHA-2, and SHA-3). By performing a hash operation on the five-tuple data, different data flows can be evenly distributed over all possible routing paths between the source node and the destination node to avoid network congestion.

In the current path planning method, the computation nodes forward the data packets based on pre-computed hash values corresponding to all possible routing paths, respectively, and the switches (i.e., leaf switches) perform path planning based on dynamic network topology and hash algorithms. Typically, topology data associated with the compute nodes and the leaf switches are not synchronized, and hash algorithms implemented on the compute nodes and the leaf switches have different configurations. Thus, five-tuple hash values calculated by compute nodes and switches (i.e., leaf switches or aggregation switches) may direct different data flows to the same routing path, resulting in possible congestion in the network. In this embodiment, the path planning module 420 of the compute node 402 may obtain information associated with the hash algorithm implemented on one or more leaf switches and use the obtained information to configure the hash algorithm implemented on the compute node 402. The information may include one or more parameters configured for a hash algorithm implemented on the one or more leaf switches. The path planning module 420 of the compute node 402 may also obtain stored network topology data associated with one or more leaf switches and update the network topology data based on the obtained network topology data. In implementations, communication and data exchange between the compute nodes and the leaf switches may be implemented using protocols including, but not limited to, link Layer Discovery Protocol (LLDP), link Aggregation Control Protocol (LACP), generic Remote Procedure Call (GRPC), and the like. Since the topology data and hash algorithm configuration are synchronized between the compute nodes and the leaf switches, collisions mapping different five-tuple hash values to the same path can be reduced and possible flow congestion can be avoided. Furthermore, when a network anomaly occurs, the computing node can effectively determine elements in the network that are involved in the anomaly and re-plan the forwarding paths for the data flow because the computing node maintains an updated topology from the perspective of the leaf switches and active sessions of the data flow.

The anomaly detection module 422 may be configured to detect anomalies occurring in the network 106. The anomaly detection module 422 may implement the detection method described above with respect to FIG. 3. Accordingly, fig. 3 is not described in detail herein.

Program data 416 may be configured to store topology information 424, configuration information 426, and routing information 428. The topology information may include the network element and a connection state of the network element. Topology information may be dynamically updated based on path planning and data exchanges between the compute nodes 402 and the leaf switches. Configuration information 426 may include, for example, versions and parameters of algorithms implemented on computing node 402, such as routing algorithms, hash algorithms. Routing information 428 may include all possible routing paths between the source node and the destination node. Routing information 428 may also include a mapping between five-tuple hash values and corresponding forwarding paths.

Fig. 5 illustrates an example forwarding path planning according to embodiments of the present disclosure. The example forwarding path plan 500 is shown in various computing nodes and leaf switches, ultimately connected to a single aggregation switch. The data packets from computing node 506-1 to computing node 506-2 are distributed to two data flows in two routing paths. Path A includes four hops, path A-1, path A-2, path A-3, and Path A-4, and passes through compute node 506-1, leaf switch 504-1, aggregation switch 502, leaf switch 504-2, and compute node 506-2. Path B includes four hops, path B-1, path B-2, path B-3, and Path B-4, and passes through compute node 506-1, leaf switch 504-1, aggregation switch 502, leaf switch 504-3, and compute node 506-2. In the transmission of the data stream, computing node 506-1 detects anomalies in path A and further determines that path A-3 and path A-4 are involved in anomalies. An exception may be associated with a leaf switch 504-3 and/or a port of a leaf switch 504-3. The network topology data associated with the computing node 506-1 may be updated to reflect dynamic changes in the network caused by the anomaly. Based on the updated network topology data, the computing node 506-1 may recalculate based on the updated network topology data using a hash algorithm and select a different path by going through another source port. The compute node 506-1 transmits the data stream using different paths (including path A-1, path A-2, path A-3', and path A-4') through the leaf switch 504-4.

Fig. 6 illustrates another example forwarding path planning according to embodiments of the present disclosure. The example forwarding path plan 600 is shown between various computing nodes and leaf switches that ultimately connect to two aggregation switches. The data packets from compute node 606-1 to compute node 606-2 are distributed to two data streams in two routing paths. Path a includes four hops: path A-1, path A-2, path A-3, and Path A-4, and pass through compute node 606-1, leaf switch 604-3, aggregation switch 602-2, leaf switch 604-4, and compute node 606-2. Path B includes four hops: path B-1, path B-2, path B-3, and path B-4, and pass through compute node 606-1, leaf switch 604-1, aggregation switch 602-1, leaf switch 604-2, and compute node 606-2. In the transmission of the data stream, the compute node 606-1 detects the anomaly in path A and further determines that the leaf switch 604-4 is involved in the anomaly. The network topology data associated with the computing nodes 506 may be updated to reflect dynamic changes in the network caused by the anomaly. Based on the updated network topology data, the computing node 606-1 may recalculate using a hash algorithm based on the updated network topology data and select a different path by going through another source port. The compute node 606-1 transmits the data stream using different paths (including path A-1, path A-2, path A-3', and path A-4') through the leaf switch 604-2.

Fig. 7 illustrates an example Equal Cost Multipath (ECMP) plan according to an embodiment of the disclosure. ECMP load balancing refers to the identification of flows to evenly distribute data flows and to distribute data flows over different routing paths by using a load balancing algorithm. As shown in ECMP plan 700, four equal cost paths a, B, C, and D may be used to route data packets from computing node 706-1 to computing node 706-2. By using a hash algorithm on the five-tuple data for routing, all four paths are used to route the data packet from compute node 706-1 to compute node 706-2. In the initial path planning state, this helps to distribute the data flows evenly across the network and reduces possible congestion. In addition, when one of the four paths fails, the data stream may be distributed among the other three paths. It should be understood that the five-tuple and ECMP load balancing shown in fig. 7 is for illustration purposes only. The present disclosure is not intended to be limiting. In an implementation, a data flow may be indicated with a triplet including a source IP address, a destination IP address, and an ICMP identifier that uniquely identifies an ICMP query session. Further, one or more paths may be set to reserve bandwidth, and thus, data flows are allocated only to available paths that do not include reserved paths.

Fig. 8 illustrates an exemplary forwarding path planning algorithm according to an embodiment of the present disclosure. Fig. 9 illustrates another example forwarding path planning algorithm according to an embodiment of the present disclosure. Fig. 10 illustrates another example forwarding path planning algorithm according to an embodiment of the present disclosure. Fig. 11 illustrates another example forwarding path planning algorithm according to an embodiment of the present disclosure. The methods described in fig. 8-11 may be implemented in the environment of fig. 1 and/or the network architecture of fig. 2. However, the present disclosure is not intended to be limiting. The methods described in fig. 8-11 may alternatively be implemented in other environments and/or network architectures.

The methods described in fig. 8-11 are described in the general context of machine-executable instructions. Generally, machine-executable instructions may include routines, programs, objects, components, data structures, procedures, modules, functions, and the like that perform particular functions or implement particular abstract data types. Furthermore, each example method is shown as a collection of blocks in a logic flow diagram that represents a sequence of operations that may be implemented in hardware, software, firmware, or a combination thereof. The order in which the method is described is not intended to be construed as a limitation, and any number of the described method blocks can be combined in any order to implement the method, or an alternative method. In addition, various blocks may be omitted from the methods without departing from the spirit and scope of the subject matter described herein. In the context of software, the blocks represent computer instructions that, when executed by one or more processors, perform the recited operations. In the context of hardware, some or all of the blocks may represent Application Specific Integrated Circuits (ASICs) or other physical components that perform the operations.

Returning to the method 800 depicted in fig. 8. At block 802, a first computing node (e.g., computing node 104) may obtain information associated with an algorithm implemented by at least a second computing node.

In an implementation, the first computing node may implement the same algorithm as at least the second computing node. The algorithm implemented on the first computing node may be configured differently than the algorithm implemented on the second computing node. The algorithm may include various hash algorithms for routing path planning based on five tuples associated with data packets in a data stream. In an embodiment, the at least one second computing node may be a leaf switch in a three-tier Clos network, the first computing node being connected to the data center network through the leaf switch.

At block 804, a first computing node (e.g., computing node 104) may obtain network topology data stored in association with at least one second computing node.

In an embodiment, a first computing node may obtain information associated with an algorithm implemented by at least a second computing node described at block 802 and network topology data associated with at least a second computing node described at block 804 via various network protocols, such as a Link Layer Discovery Protocol (LLDP). The network topology data may be represented in an undirected graph showing the network elements and connection states associated therewith.

At block 806, a first computing node (e.g., computing node 104) may receive a first data packet from a source device to be forwarded to a destination device.

In an implementation, the source device and the destination device may refer to client device 110 of fig. 1. In one example, a data packet is generated when a user operates a client device 110 to communicate with another user operating a different client device. In another example, the data packet is generated when a user accesses an online resource or uses an online service. In yet another example, the computing node 104 may upload or download data from the cloud storage space, thereby generating a stream of data packets.

At block 808, a first computing node (e.g., computing node 104) may determine a set of values associated with the first data packet from information associated with the algorithm.

In an implementation, the set of values associated with the first data packet may include a five-tuple extracted from the first data packet. The set of values may include a source IP address, a destination IP address, a source port number, a destination port number, and a protocol for communication. The set of values, i.e., five tuples, may be hashed using a hashing algorithm implemented by the first computing node. In an implementation, a data flow may be indicated with a triplet including a source IP address, a destination IP address, and an ICMP identifier that uniquely identifies an ICMP query session.

At block 810, a first computing node (e.g., computing node 104) may determine a forwarding path from a source device to a destination device based on the set of values associated with the first data packet and the network topology data.

In an implementation, a first computing node may update network topology data associated therewith using network topology data obtained from a memory associated with at least one second computing node. The first computing node may determine all best shortest route paths from the updated network topology data and select the forwarding path from a plurality of best shortest route paths from the set of values, e.g., five-tuple hash values.

At block 812, a first computing node (e.g., computing node 104) may send a first data packet to a destination device via the forwarding path.

Returning to the method 900 described in fig. 9, at block 902, a first computing node (e.g., computing node 104) may determine five-tuple data associated with a first data packet, the five-tuple data including a source IP address, a source port number, a destination IP address, a destination port number, and a protocol.

In an implementation, the first computing node may extract a source IP address, a destination IP address, and a protocol from an IP header of the data packet, and further extract a source port number and a destination port number from the TCP portion. The protocol may include any type of IP protocol including, but not limited to, IPv4 and IPv6. In other embodiments, the first computing node may extract the source IP address, the destination IP address, and the ICMP identifier to generate a triplet (or 3-tuple) to indicate the data flow.

At block 904, a first computing node (e.g., computing node 104) may update a configuration of a hash algorithm implemented by the first computing node using information associated with an algorithm implemented by at least one second computing node. In implementations, the information associated with the algorithm implemented by the at least one second computing node may include a version of the algorithm, one or more parameter configurations of the algorithm, and the like.

At block 906, a first computing node (e.g., computing node 104) may calculate a five-tuple hash value corresponding to the five-tuple data as a set of values associated with the first data packet using an updated hash algorithm. In an implementation, when the first computing node generates a triplet to identify a data flow, the first computing node uses an updated hash algorithm to compute a triplet hash value corresponding to the triplet data as a set of values associated with the first data packet.

Returning to the method 1000 described in fig. 10, at block 1002, a first computing node (e.g., computing node 104) may determine one or more paths from a source device to a destination device based on network topology data. In an embodiment, the one or more paths from the source device to the destination device may include one or more shortest paths. The first computing node may implement various shortest path algorithms, such as Dijkstra's algorithm, viterbi's algorithm, floyd-warhead's algorithm, bellman-Ford's algorithm.

At block 1004, a first computing node (e.g., computing node 104) may perform a modulo operation on the five-tuple hash value with respect to one or more paths. In an embodiment, the modulus operation may generate one or more different modulus values that respectively correspond to one or more paths. Data packets with all the same five-tuple can form a data stream. Packets directed from a source IP address to a destination IP address may be assigned to different data flows depending on the source port through which the packet passes. The data streams may be assigned to the one or more paths according to the one or more different modulus values that correspond substantially to the one or more paths.

At block 1006, a first computing node (e.g., computing node 104) may determine a forwarding path from one or more paths based on a result of the modulo operation. In an embodiment, the first computing node may select one path mapped to the data flow represented by the five-tuple as the forwarding path. Arriving packets with the same five-tuple can use the same forwarding path. In other embodiments, the first computing node may designate one of the one or more paths as the forwarding path based on traffic on those paths.

Returning to the method 1100 depicted in fig. 11, at block 1102, a first computing node (e.g., computing node 104) may determine at least a first forwarding path and a second forwarding path from the one or more paths. The first computing node may implement an Equal Cost Multipath (ECMP) algorithm to determine all possible paths between the source device and the destination device. In implementations, the one or more paths may be ordered based on the associated one or more different modulus values. The first computing node may select a path mapped to the data stream represented by the five-tuple to forward the data stream. Alternatively, the first computing node may designate more than one path to forward the data stream. In an embodiment, the first computing node may allocate the data packets from the source device to the destination device to different data streams such that the data streams are transmitted to all possible paths between the source device and the destination device. In other embodiments, the first computing node may distribute the data stream to a set of all possible paths.

At block 1104, a first computing node (e.g., computing node 104) may receive a plurality of second data packets from a source device to be forwarded to a destination device. The plurality of second data packets may have the same or different five-tuple. Packets having the same five-tuple may be transmitted as a data stream once through one of the first forwarding path and the second forwarding path. Packets with different five-tuple groups may form different data flows that go through different forwarding paths. In an embodiment, when one of the plurality of second data packets has the same five-tuple as the first data packet described in fig. 8, the second data packet is transmitted through the same forwarding path generated according to the embodiment shown in fig. 8.

At block 1106, a first computing node (e.g., computing node 104) may allocate a plurality of second data packets to a first forwarding path and a second forwarding path, each of the first forwarding path and the second forwarding path carrying a portion of the plurality of second data packets. In an embodiment, the first computing node may evenly distribute the data flow to all possible paths (i.e., the first forwarding path and the second forwarding path) between the source device and the destination device based on the hash computation. In other embodiments, the data flows carried by the first forwarding path and the second forwarding path may be non-uniform.

At block 1108, a first computing node (e.g., computing node 104) may determine an anomaly in one of the first forwarding path and the second forwarding path. The anomalies may be associated with computing nodes, switches, ports of computing nodes, ports of switch nodes, etc., resulting in network congestion. The first computing node may detect the anomaly using the detection method described above with respect to fig. 3. In implementations, the first computing node may each generate one or more sessions corresponding to one or more forwarding paths. The first computing node may detect the anomaly when a session timeout occurs in one of the one or more sessions.

At block 1110, a first computing node (e.g., computing node 104) may determine a third forwarding path from the source device to the destination device to reroute the data flow, i.e., the portion of the plurality of second data packets that involves the anomaly. In an embodiment, the first computing node may recalculate using the hashing algorithm based on the updated network topology data and select a different path through another source port.

Although the method blocks described above are described as being performed in a particular order, in some implementations, some or all of the method blocks may be performed in other orders or in parallel.

Although implementations have been described in language specific to structural features and/or methodological acts, it is to be understood that the claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed subject matter. Additionally or alternatively, some or all of the operations may be implemented by one or more ASICs, FPGAs, or other hardware.

Example clauses

A. A method implemented by a first computing node, the method comprising: obtaining, via the network, information associated with an algorithm implemented by the at least one second computing node; obtaining, via the network, stored network topology data associated with the at least one second computing node; receiving, at a first computing node, a first data packet from a source device to be forwarded to a destination device; determining a forwarding path from the source device to the destination device based on information associated with the first data packet and the network topology data; and transmitting the first data packet to the destination device over the forwarding path.

B. The method of paragraph a, wherein the information associated with the first data packet includes a set of values, and determining, at the first computing node, a set of values associated with the first data packet from the information associated with the algorithm further includes: executing a hash algorithm implemented by the first computing node; applying at least information associated with an algorithm implemented by the at least one second computing node to the hash algorithm; and calculating the set of values associated with the first data packet using the hash algorithm.

C. The method of paragraph a, wherein the information associated with the first data packet includes a set of values, and determining, at the first computing node, a set of values associated with the first data packet from the information associated with the algorithm further includes: determining five-tuple data associated with the first data packet, the five-tuple data including a source IP address associated with the source device, a source port number associated with the source device, a destination IP address associated with the destination device, a destination port number associated with the destination device, and a protocol for communicating in the network; updating a hash algorithm implemented by the first computing node using information associated with an algorithm implemented by the at least one second computing node; and calculating a five-tuple hash value corresponding to the five-tuple data as the set of values associated with the first data packet using an updated hash algorithm.

D. The method of paragraph C, wherein determining a forwarding path from the source device to the destination device based on information associated with the first data packet and the network topology data further comprises: determining one or more paths from the source device to the destination device based on the network topology data; and determining the forwarding path from the one or more paths according to the five-tuple hash value.

E. The method of paragraph D, wherein determining a forwarding path from the source device to the destination device based on information associated with the first data packet and the network topology data further comprises: performing a modulus operation on the five-tuple hash value for the one or more paths; and determining the forwarding path from the one or more paths according to the result of the modulus operation.

F. The method of paragraph a, wherein the forwarding path from the source device to the destination device includes at least a first forwarding path and a second forwarding path, and the method further comprises receiving a plurality of second data packets from the source device to be forwarded to the destination device; distributing the plurality of second data packets to the first forwarding path and the second forwarding path, each of the first forwarding path and the second forwarding path carrying at least a portion of the plurality of second data packets; detecting an abnormality occurring in one of the first forwarding path and the second forwarding path; and determining a third forwarding path from the source device to the destination device, rerouting the portion of the plurality of second data packets carried by the one of the first forwarding path and the second forwarding path associated with the anomaly.

G. The method of paragraph a, wherein determining a forwarding path from the source device to the destination device from the set of values associated with the first data packet and the network topology data is based on an equal cost multi-path (ECMP) planning algorithm.

H. One or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform actions comprising obtaining, via a network, information associated with an algorithm implemented by at least one second computing node; obtaining, via the network, stored network topology data associated with the at least one second computing node; receiving, at a first computing node, a first data packet from a source device to be forwarded to a destination device; determining a forwarding path from the source device to the destination device based on information associated with the first data packet and the network topology data; and transmitting the first data packet to the destination device over the forwarding path.

I. The one or more machine-readable media of paragraph H, wherein the information associated with the first data packet comprises a set of values, and the acts further comprise: executing a hash algorithm implemented by the first computing node; applying at least information associated with an algorithm implemented by the at least one second computing node to the hash algorithm; and calculating a set of values associated with the first data packet using the hash algorithm.

J. The one or more machine-readable media of paragraph H, wherein the information associated with the first data packet comprises a set of values, and the acts further comprise: determining five-tuple data associated with the first data packet, the five-tuple data including a source IP address associated with the source device, a source port number associated with the source device, a destination IP address associated with the destination device, a destination port number associated with the destination device, and a protocol for communicating in the network; updating a hash algorithm implemented by the first computing node using information associated with an algorithm implemented by the at least one second computing node; and calculating a five-tuple hash value corresponding to the five-tuple data as the set of values associated with the first data packet using an updated hash algorithm.

K. The one or more machine-readable media of paragraph J, the acts further comprising determining one or more paths from the source device to the destination device from the network topology data; and determining the forwarding path from the one or more paths according to the five-tuple hash value.

L, the one or more machine-readable media of paragraph K, the acts further comprising performing a modulo operation on the five-tuple hash value for the one or more paths; and determining the forwarding path from the one or more paths according to the result of the modulus operation.

M, the one or more machine-readable media of paragraph H, the actions further comprising receiving, from the source device, a plurality of second data packets to be forwarded to the destination device; distributing the plurality of second data packets to the first forwarding path and the second forwarding path, each of the first forwarding path and the second forwarding path carrying at least a portion of the plurality of second data packets; detecting an abnormality occurring in one of the first forwarding path and the second forwarding path; and determining a third forwarding path from the source device to the destination device to reroute the portion of the plurality of second data packets carried by the one of the first forwarding path and the second forwarding path associated with the anomaly.

N, one or more machine-readable media as paragraph H recites, wherein determining a forwarding path from the source device to the destination device from the set of values associated with the first data packet and the network topology data is based on an Equal Cost Multipath (ECMP) planning algorithm.

O, a first computing node comprising one or more processing units; and a memory storing machine-executable instructions that, when executed by the one or more processing units, cause the one or more processing units to perform actions comprising: obtaining, via the network, information associated with an algorithm implemented by the at least one second computing node; obtaining, via the network, stored network topology data associated with the at least one second computing node; receiving, at a first computing node, a first data packet from a source device to be forwarded to a destination device; determining a forwarding path from the source device to the destination device based on information associated with the first data packet and the network topology data; and transmitting the first data packet to the destination device over the forwarding path.

P, the first computing node of paragraph O, wherein the information associated with the first data packet comprises a set of values, and the acts further comprise: determining five-tuple data associated with the first data packet, the five-tuple data including a source IP address associated with the source device, a source port number associated with the source device, a destination IP address associated with the destination device, a destination port number associated with the destination device, and a protocol for communicating in the network; updating a hash algorithm implemented by the first computing node using information associated with an algorithm implemented by the at least one second computing node; and calculating a five-tuple hash value corresponding to the five-tuple data as the set of values associated with the first data packet using an updated hash algorithm.

Q, the first computing node of paragraph P, wherein the information associated with the first data packet comprises a set of values, and the actions further comprise determining one or more paths from the source device to the destination device based on the network topology data; and determining the forwarding path from the one or more paths according to the five-tuple hash value.

R, the first computing node of paragraph Q, the actions further comprising performing a modulo operation on the five tuple hash value for the one or more paths; and determining the forwarding path from the one or more paths according to the result of the modulus operation.

S, the first computing node of paragraph O, the actions further comprising receiving, from the source device, a plurality of second data packets to be forwarded to the destination device; distributing the plurality of second data packets to the first forwarding path and the second forwarding path, each of the first forwarding path and the second forwarding path carrying at least a portion of the plurality of second data packets; detecting an abnormality occurring in one of the first forwarding path and the second forwarding path; and determining a third forwarding path from the source device to the destination device to reroute the portion of the plurality of second data packets carried by the one of the first forwarding path and the second forwarding path associated with the anomaly.

T, the first computing node of paragraph O, wherein determining a forwarding path from the source device to the destination device from the set of values associated with the first data packet and the network topology data is based on an equal cost multi-path (ECMP) planning algorithm.

Claims

1. A method implemented by a first computing node, the method comprising:

obtaining, via the network, information associated with an algorithm implemented by the at least one second computing node;

obtaining, via the network, stored network topology data associated with the at least one second computing node;

receiving, at a first computing node, a first data packet from a source device to be forwarded to a destination device;

determining a forwarding path from the source device to the destination device based on information associated with the first data packet and the network topology data; and is also provided with

Transmitting the first data packet to the destination device via the forwarding path.

2. The method of claim 1, wherein the information associated with the first data packet comprises a set of values, and determining, at the first computing node, a set of values associated with the first data packet from the information associated with the algorithm further comprises:

Executing a hash algorithm implemented by the first computing node;

applying at least information associated with an algorithm implemented by the at least one second computing node to the hash algorithm; and

the set of values associated with the first data packet is calculated using the hash algorithm.

3. The method of claim 1, wherein the information associated with the first data packet comprises a set of values, and determining, at the first computing node, a set of values associated with the first data packet from the information associated with the algorithm further comprises:

determining five-tuple data associated with the first data packet, the five-tuple data including a source IP address associated with the source device, a source port number associated with the source device, a destination IP address associated with the destination device, a destination port number associated with the destination device, and a protocol for communicating in the network;

updating a hash algorithm implemented by the first computing node using information associated with an algorithm implemented by the at least one second computing node; and is also provided with

A quintuple hash value corresponding to the quintuple data is calculated as the set of values associated with the first data packet using an updated hash algorithm.

4. The method of claim 3, wherein determining a forwarding path from the source device to the destination device based on information associated with the first data packet and the network topology data further comprises:

determining one or more paths from the source device to the destination device based on the network topology data;

and determining the forwarding path from the one or more paths according to the five-tuple hash value.

5. The method of claim 4, wherein determining a forwarding path from the source device to the destination device based on information associated with the first data packet and the network topology data further comprises:

performing a modulus operation on the five-tuple hash value for the one or more paths; and is also provided with

And determining the forwarding path from the one or more paths according to the result of the modulus operation.

6. The method of claim 1, wherein the forwarding path from the source device to the destination device comprises at least a first forwarding path and a second forwarding path, and the method further comprises:

receiving a plurality of second data packets from the source device to be forwarded to the destination device;

Distributing the plurality of second data packets to the first forwarding path and the second forwarding path, each of the first forwarding path and the second forwarding path carrying at least a portion of the plurality of second data packets;

detecting an abnormality occurring in one of the first forwarding path and the second forwarding path; and is also provided with

Determining a third forwarding path from the source device to the destination device, rerouting the portion of the plurality of second data packets carried by the one of the first forwarding path and the second forwarding path associated with the anomaly.

7. The method of claim 1, wherein determining a forwarding path from the source device to the destination device from a set of values associated with the first data packet and the network topology data is based on an equal cost multipath planning algorithm.

8. One or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform actions comprising:

9. The one or more machine readable media of claim 8, wherein the information associated with the first data packet comprises a set of values, and the acts further comprise:

executing a hash algorithm implemented by the first computing node;

10. The one or more machine readable media of claim 8, wherein the information associated with the first data packet comprises a set of values, and the acts further comprise:

11. The one or more machine-readable media of claim 10, wherein the acts further comprise:

12. The one or more machine-readable media of claim 11, the acts further comprising:

13. The one or more machine-readable media of claim 8, the acts further comprising:

distributing the plurality of second data packets to a first forwarding path and a second forwarding path, each of the first forwarding path and the second forwarding path carrying at least a portion of the plurality of second data packets;

14. The one or more machine readable media of claim 8, wherein determining a forwarding path from the source device to the destination device from a set of values associated with the first data packet and the network topology data is based on an equal cost multipath planning algorithm.

15. A first computing node, comprising:

one or more processing units; and

a memory storing machine-executable instructions that, when executed by one or more processing units, cause the one or more processing units to perform actions comprising:

16. The first computing node of claim 15, wherein the information associated with the first data packet comprises a set of values, and the acts further comprise:

17. The first computing node of claim 16, wherein the information associated with the first data packet comprises a set of values, and the acts further comprise:

18. The first computing node of claim 17, the acts further comprising:

19. The first computing node of claim 15, the acts further comprising:

20. The first computing node of claim 15, wherein determining a forwarding path from the source device to the destination device from a set of values associated with the first data packet and the network topology data is based on an equal cost multipath planning algorithm.