Disclosure of Invention
In order to solve the technical problems, the invention provides an NPU cluster network structure and a network interconnection method, which solve the problem that the data interaction efficiency in the prior art limits the NPU cluster computing power.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in a first aspect, the present invention provides an NPU cluster network architecture, including:
the node is used as a hardware structure for training a neural network model and comprises a first group of processors and a second group of processors, and the NPU of the first group of processors is electrically connected with the NPU of the second group of processors;
a first network plane electrically connected to the first set of processors;
and a second network plane electrically connected to the second set of processors.
In one implementation, the node further includes a plurality of CPUs, and a high-speed serial computer expansion bus is connected between each CPU and each NPU of the first group of processors and each NPU of the second group of processors; and HCCS buses are connected among the plurality of CPUs.
In one implementation, an HCCS bus is connected between the NPUs of the first set of processors, an HCCS bus is connected between the NPUs of the second set of processors, and an HCCS bus is connected between the NPUs of the first set of processors and the NPUs of the second set of processors.
In one implementation, the NPUs of different nodes communicate in a RoCE network protocol through the first network plane or the second network plane.
In one implementation, the number of NPUs included in the first set of processors and the second set of processors is four.
In a second aspect, an embodiment of the present invention provides a network interconnection method, including:
selecting one NPU from a plurality of NPUs forming each node, and marking the NPU as a bridging NPU;
controlling a plurality of NPUs inside each node to be interconnected through the bridging NPU, wherein the interconnection is mutual data transmission among the NPUs;
and controlling data transmission among the nodes after internal interconnection.
In one implementation, the controlling the NPUs inside the nodes to interconnect with each other through the bridging NPU, where the interconnection is mutual data transmission between the NPUs, includes:
obtaining a first bridging NPU and a second bridging NPU in the bridging NPU according to the bridging NPU, wherein the first bridging NPU is positioned in the first group of processors, and the second bridging NPU is positioned in the second group of processors;
controlling a number of the NPUs within the first set of processors to be interconnected by the first bridging NPU;
controlling a number of the NPUs within the second set of processors to be interconnected by the second bridging NPU;
and controlling the interconnection of the first bridging NPU and the second bridging NPU.
In one implementation, the controlling the NPUs inside the nodes is performed by interconnecting the NPUs through the bridging NPUs, where the interconnection is mutual data transmission between the NPUs, and the method further includes:
controlling the interconnection of the first bridging NPU and the CPU in the node;
and controlling the second bridging NPU to be interconnected with the CPU in the node.
In one implementation, the data transmission between the nodes after the control internal interconnection includes:
controlling the bridging NPU of each node after interconnection to receive target data;
and controlling the bridging NPU of each node to send the received target data to a target computing node through a RoCE network protocol.
In one implementation, the controlling the bridging NPU of each node sends the received data to the target computing node through the RoCE network protocol, and then further includes:
controlling the bridging NPU of the target computing node to receive the target data;
the bridging NPU controlling the target computing node distributes the received target data to individual NPUs within the target computing node.
In a third aspect, an embodiment of the present invention further provides a network interconnection device, where the device includes the following components:
the bridge NPU screening module is used for selecting one NPU from a plurality of NPUs forming each node and marking the NPU as a bridge NPU;
the interconnection control module is used for controlling a plurality of NPUs in each node to be interconnected through the bridging NPU, and the interconnection is mutual data transmission among the NPUs;
and the data transmission module is used for controlling data transmission among the nodes after internal interconnection.
In a fourth aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a network interconnection program stored in the memory and capable of running on the processor, and when the processor executes the network interconnection program, the steps of the network interconnection method described above are implemented.
In a fifth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a network interconnection program is stored, where the network interconnection program, when executed by a processor, implements the steps of the network interconnection method described above.
The beneficial effects are that: the invention divides the nodes for training the neural network model into two groups, namely the first group of processors and the second group of processors, and divides the network plane into two planes, wherein one network plane is only responsible for the data transmission generated by one group of processors in the process of training the neural network model, thereby improving the data transmission efficiency of the NPU cluster formed by each NPU processor.
Detailed Description
The technical scheme of the invention is clearly and completely described below with reference to the examples and the drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Through research, the artificial intelligence technology AI is used for training a model, and the higher the model complexity is, the higher the requirement on AI calculation force is, as shown in FIG. 2The solution is gradually increased year by year. The latest AI model has put forward the need for class E AI calculation, i.e. FP16 calculation used in single model training exceeds 1E OPS)。
The class E AI computing power brings stringent requirements to computational efficiency and communication efficiency. The neural Network Processor (NPU) can meet the requirement of the E-class AI computing power on the computing efficiency, for example, the rising 910 chip forms an NPU cluster by integrating a plurality of NPUs therein, so as to meet the requirement of the E-class AI computing power on the computing efficiency. But the increase in the number of NPUs also means that more AI chips (multiple NPUs and CPUs make up an AI chip, also called a node) need to be used in parallel. As shown in fig. 3, with the increase of AI parallelism (the number of nodes), network traffic (the amount of data that needs to be transmitted by a network as communication information) also shows a near-linear trend, and when the network traffic reaches a certain amount, communication efficiency (data interaction efficiency) may hinder the computation of NPU clusters.
In order to solve the technical problems, the invention provides an NPU cluster network structure and a network interconnection method, which solve the problem that the data interaction efficiency in the prior art limits the NPU cluster computing power. In the implementation, the node is firstly divided into a first group of processors and a second group of processors, the network plane is also divided into a first network plane and a second network plane, the first network plane is only responsible for the data interaction of the first processor, and the second network plane is only responsible for the data interaction of the second processor. The invention can improve the data interaction efficiency of the processor, thereby improving the computing power of the NPU cluster formed by the processors.
For example, as shown in fig. 4, the nodes and switches constitute two identical network planes (labeled as first network plane and second network plane, respectively). As shown in fig. 5, one cabinet contains 8 nodes. Each node has an internal structure as shown in FIG. 6 (the node has CPU except NPU, and the CPU is not shown in FIG. 6, but only NPU is shown as a core processor), namely, each node has NPU0, NPU1, NPU2, NPU3, NPU4, NPU5, NPU6 and NPU7. NPU0, NPU1, NPU4, NPU5 are divided into a first group of processors, and NPU2, NPU3, NPU6, NPU7 are divided into a second group of processors. As shown in fig. 5, the Spine and Leaf switch layers are divided into two network planes: both consist of independent Spine switches SW0, SW 1..sw 31, and Leaf switch SW0 column No. … SW0 column No. 63, respectively. Each network plane is responsible for data transmission of only the parameters generated by one of the sets of processors of each node. Taking the node #1 in fig. 5 as an example, the solid line is a first network plane, where the first network plane is only responsible for parameters generated by training the neural network model on the NPU0, NPU1, NPU4 and NPU5, that is, the first network plane is only responsible for transmitting parameters generated by training the model on the NPU0, NPU1, NPU4 and NPU5 in the node #1 to other nodes on the same network plane, that is, the first network plane is only responsible for interaction between parameter data on the NPU0, NPU1, NPU4 and NPU5 and other nodes on the same network plane. Also for the second network, it is responsible for only the interaction of parameter data on NPUs 2, 3, 6, 7 with other nodes of the same network plane.
Application scenario of this embodiment: the architecture of the neural network model for image recognition, target detection and natural language processing and model parameters are transmitted to eight nodes in fig. 5, meanwhile, training data sets are also implanted into the eight nodes, NPUs in the eight nodes perform parallel training on the model by using the training sets, and after each round of training is completed, the NPUs in each node transmit model parameters obtained by the round of training to the NPUs in other nodes through a network plane so as to realize interaction of the model parameters obtained by each round of training among the nodes.
Exemplary Structure
The embodiment provides an NPU cluster network structure, which comprises nodes, a first network plane and a second network plane.
In one embodiment, as shown in FIG. 7, each node includes eight NPUs and four CPUs. NPU1, NPU2, NPU3 and NPU4 with ports of-0, -1, -2 and, -3 are the first group of processors (NPU clusters are formed by NPU1, NPU2, NPU3 and NPU 4), and NPU1, NPU2, NPU3 and NPU4 with ports of-4, -5, -6 and, -7 are the second group of processors.
In this embodiment, the first network plane and the second network plane form a biplane networking mode, i.e. the network is divided into two symmetrical network planes. The solid line and the dotted line respectively contain 2048 NPUs, and are independently composed into a Spine-Leaf network communication structure (fat tree structure). The fractional bandwidth per network plane is up to 2048/2 x 100 gb/s=12.8 TB/s. Wherein the 8 ports within each compute node are assigned to different network planes (e.g., network plane where ports 0, 1, 4, 5 are assigned to solid lines, network plane where ports 2, 3, 6, 7 are assigned to dashed lines). The advantages of this embodiment are: the method has the advantages of (1) high bandwidth utilization, (2) predictable network delay, (3) good expansibility, (4) reduced requirements on the switch, and (5) high security and availability. In addition, the respective networking is performed through the solid line network plane and the dotted line network plane, compared with the single networking of the solid line network plane and the dotted line network plane, the single networking of the solid line network plane and the dotted line network plane is integrated, the number of ports and the number of switches required by each node are reduced, and the utilization rate of hardware resources is further optimized.
From outside the compute node, the two network planes of solid and dashed lines are not interworking. In order to realize all-computer interconnection, the embodiment further constructs a cross-network-plane NPU interconnection architecture inside the node. As shown in fig. 6, by using a high-speed interconnection technology of a high-bandwidth cache coherence system (Huawei Cache Coherent System, HCCS), high-speed communication between NPUs located in different network planes within the same computing node can be realized. In addition, multiple NPUs in the same plane may also implement intra-group interconnect through HCCS or inter-group interconnect through PCI-Express (PCIe), thereby implementing full-machine NPU interconnect. Therefore, by adopting a biplane networking mode and combining cross-plane and co-plane NPU interconnection in the nodes, all NPU interconnection can be realized under the mode of collective communication optimization, and the aim of large-scale AI task training of the whole machine is further achieved.
In one embodiment, an HCCS bus is connected between eight NPUs in the first processor set, an HCCS bus (Huawei Cache Coherent System) is also connected between eight NPUs in the second processor set, an HCCS bus is also connected between an NPU of the first processor set and the NPUs of the second processor set, a high-speed serial computer expansion bus PCIe (PCI-Express) is connected between an NPU of the first processor set and the CPU of the first processor set, and a high-speed serial computer expansion bus PCIe is also connected between an NPU of the second processor set and the CPU of the second processor set. The NPUs between different nodes communicate using RoCE network protocols (remote direct data access over converged ethernet, RDMA over Converged Ethenet, roCE).
In this embodiment, the connections between the NPUs within each node (the first set of processor NPUs and the second set of processor NPUs constitute the NPUs within the node) and the connections between the CPUs within each node all belong to intra-node isomorphic computing unit interconnections. Intra-node homogeneous computing unit interconnections, i.e., interconnections between CPUs or interconnections between NPUs within the same computing node. The Cache consistency bus protocol HCCS is adopted for communication, so that on-chip direct memory access can be performed between the CPU and the NPU, and high-speed data interaction is realized. Specifically, each node contains 4 CPUs, which are all interconnected by 3 HCCSs; each node contains 8 NPUs, making up 2 NPU groups (first group processor NPUs and second group processor NPUs), each NPU being fully interconnected with other NPUs in the group through 3 HCCSs. There is no direct HCCS interconnect between NPU groups. The single link unidirectional bandwidth of the HCCS is 30GB/s, so that the unidirectional aggregate bandwidth of each CPU/NPU is 90GB/s.
In this embodiment, the connection between the NPU and the CPU inside each node belongs to intra-node heterogeneous computing unit interconnection, i.e. communication between heterogeneous computing units inside the same computing node, i.e. communication between the CPU and the NPU. Implemented through a high speed serial bus PCIe, each NPU is connected through 1 PCIe 4.0 x16 and 1 CPU. Each computing node comprises 4 CPUs and 8 NPUs, so that each CPU is respectively in point-to-point interconnection with the two NPUs through a PCIe protocol, and further, the full interconnection of the NPUs between two planes in the same node can be realized. Each PCIe 4.0 x16 provides a theoretical bandwidth for unidirectional data transmission of 32 GB/s. PCIe over the system may provide fractional bandwidth up to 512 x 4 x 32 gb/s=64 TB/s.512 are 512 interconnection areas, and the E-level NPU cluster comprises planes consisting of 2 fat tree structures, each plane contains 512 interconnection scope, and each interconnection scope is defined as a full interconnection structure consisting of 4 NPUs.
In this embodiment, the connections between NPUs of different nodes belong to the inter-node interconnect: each computing node provides 8 on-board 100GE Ethernet ports to be interconnected with a high-speed switch, and the direct access of the computing nodes across network levels is realized by combining RoCE (RDMA over Converged Ethenet) v2 technology. RoCE is a network protocol that allows direct access (remote memory direct access, RDMA) of remote data over ethernet networks. The RoCE v2 can realize the routing function, the interconnection among the nodes fully utilizes the scaling-free fat tree network of the RoCE v2 network, and a single connection can reach the bidirectional communication bandwidth of 24 GB/s. The biplane network is implemented by the RoCE v2 network protocol, each plane containing 2048 NPUs.
In the embodiment, PCIe and HCCS together form a plurality of communication protocols, so that the whole system can achieve the same performance of the full fat tree network, and a foundation is provided for full-machine efficient communication. Besides full-machine scale training, model training of various scales can be effectively supported. The multiprotocol bridging networking technology can improve the dimension, flexibility and performance of communication by effectively combining inter-chip interconnection and inter-node interconnection.
Exemplary method
The network interconnection method of the embodiment can be applied to terminal equipment. In this embodiment, as shown in fig. 1, the network interconnection method specifically includes the following steps:
s100, selecting one NPU from a plurality of NPUs forming each node, and recording the NPU as a bridging NPU.
One NPU in the compute node (denoted NPU 0) is selected as a "bridging" temporary buffer for primary transit, and data interaction with the CPU in the node and the NPU of the other plane (i.e., the CPU in the first set of processors interacts with the NPU data in the second set of processors) is implemented through PCIe and HCCS channels inside the node, respectively.
S200, controlling the NPUs inside the nodes to be interconnected through the bridging NPU, wherein the interconnection is mutual data transmission among the NPUs.
In one embodiment, step S200 includes steps S201 to S206 as follows:
s201, according to the bridging NPU, a first bridging NPU and a second bridging NPU in the bridging NPU are obtained, wherein the first bridging NPU is located in the first group of processors, and the second bridging NPU is located in the second group of processors.
S202, controlling a plurality of NPUs in the first group of processors to be interconnected through the first bridging NPU.
S203, controlling a plurality of NPUs in the second group of processors to be interconnected through the second bridging NPU.
One NPU is selected from the NPUs of each group of processors as a bridging NPU, and other NPUs of each group of processors interact data with the bridging NPU.
S204, the first bridge NPU and the second bridge NPU are controlled to be interconnected.
Each group of processors performs data interaction through bridging NPU.
S205, the first bridge NPU is controlled to be interconnected with the CPU in the node.
S206, controlling the interconnection of the second bridge NPU and the CPU in the node.
And S300, controlling data transmission among the nodes after internal interconnection.
In one embodiment, the specific procedure of step S300 is as follows: controlling the bridging NPU of each node after interconnection to receive target data; and controlling the bridging NPU of each node to send the received target data to a target computing node through a RoCE network protocol.
The "bridging" NPU sends target data (model parameters generated during training of the neural network model) to the memory of the destination computing node over the RoCE network to save the target data.
In another embodiment, the bridging NPU controlling the target computing node receives the target data; the bridging NPU controlling the target computing node distributes the received target data to individual NPUs within the target computing node. In this embodiment, data distribution is accomplished by the "bridging" NPU of the destination computing node through PCIe and HCCS. The communication of the NPU which is 'network unconnected' from the outside is realized through the HCCS between the NPUs in different planes in the node and the PCIe between the CPU and the NPU, thereby realizing the communication of the whole machine parameter plane.
The following describes the performance of the network interconnection method of the present invention using 4096 NPU cards as an example:
the theoretical halving bandwidth of the fat tree structure of 2 network surfaces can reach 12.8TB/s, and the halving bandwidth of PCIe can reach 64TB/s. The network key indexes of the invention are shown in table 1:
TABLE 1
The E-level computing power intelligent networking mode provided by the invention successfully supports trillion-parameter ultra-large-scale AI model training, and in addition, the platform also supports practical application of nearly twenty scenes such as urban management, intelligent traffic and the like.
In one embodiment, as shown in FIG. 8, consists of 4 AI sub-clusters, each cluster containing 16 racks, each rack containing 8 nodes, each node consisting of 8 NPUs. In terms of network interconnection, a classical Spine-Leaf two-layer switching network architecture is adopted: the Leaf layer consists of access switches, each cabinet is connected with 2 Leaf switches, gathers traffic from NPU computing nodes and is directly connected to the switch of the Spine layer; the Spine switches interconnect all Leaf switches in a two-layer fat tree topology. A total of 32 Leaf switches and 16 Spine switches are employed per AI sub-cluster. The networking mode in this embodiment has the advantage that the scale of the network is not excessively expanded.
In another embodiment, as shown in fig. 9, a *** tensor processing unit (tensor processing unit, TPU) forms a TPU Pod, typically using Mesh or 3D Torus.
Exemplary apparatus
The embodiment also provides a network interconnection device, which comprises the following components:
the bridge NPU screening module is used for selecting one NPU from a plurality of NPUs forming each node and marking the NPU as a bridge NPU;
the interconnection control module is used for controlling a plurality of NPUs in each node to be interconnected through the bridging NPU, and the interconnection is mutual data transmission among the NPUs;
and the data transmission module is used for controlling data transmission among the nodes after internal interconnection.
Based on the above embodiment, the present invention also provides a terminal device, and a functional block diagram thereof may be shown in fig. 10. The terminal equipment comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein the processor of the terminal device is adapted to provide computing and control capabilities. The memory of the terminal device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the terminal device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a network interconnection method. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the terminal equipment is preset in the terminal equipment and is used for detecting the running temperature of the internal equipment.
It will be appreciated by those skilled in the art that the functional block diagram shown in fig. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal device to which the present inventive arrangements are applied, and that a particular terminal device may include more or less components than those shown, or may combine some of the components, or may have a different arrangement of components.
In one embodiment, there is provided a terminal device including a memory, a processor, and a network interconnection program stored in the memory and executable on the processor, the processor implementing the following operation instructions when executing the network interconnection program:
selecting one NPU from a plurality of NPUs forming each node, and marking the NPU as a bridging NPU;
controlling a plurality of NPUs inside each node to be interconnected through the bridging NPU, wherein the interconnection is mutual data transmission among the NPUs;
and controlling data transmission among the nodes after internal interconnection.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.