CN115809685A

CN115809685A - NPU cluster network structure and network interconnection method

Info

Publication number: CN115809685A
Application number: CN202310088059.XA
Authority: CN
Inventors: 田永鸿; 陈文光; 高文; 王丙强; 林哲; 章弋嘉
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-03-17
Anticipated expiration: 2043-02-09
Also published as: CN115809685B

Abstract

The invention relates to the technical field of communication, in particular to an NPU cluster network structure and a network interconnection method. The invention divides the nodes for training the neural network model into two groups, namely a first group of processors and a second group of processors, and divides the network plane into two planes, wherein one network plane is only responsible for the transmission of data generated by one group of processors in the process of training the neural network model, thereby improving the data transmission efficiency of the NPU cluster formed by each NPU processor, once the data transmission efficiency is improved, the data transmission efficiency does not restrict the calculation power of the NPU cluster, and the calculation power of the NPU cluster is improved.

Description

NPU cluster network structure and network interconnection method

Technical Field

The invention relates to the technical field of communication, in particular to an NPU cluster network structure and a network interconnection method.

Background

For artificial intelligence techniques AIIn the training of the model, the greater the complexity of the model, the higher the requirement on the AI computation power, as shown in fig. 2, the AI computation power requirement increases year by year. The latest AI models have addressed the need for class E AI power, i.e., FP16 power used in single model training exceeds 1E OPS: (

）。

The E-level AI computation power brings severe requirements on the computation efficiency and the communication efficiency. A neural-Network Processing Unit (NPU) can satisfy the requirement of E-level AI computing power on computing efficiency, for example, because a plurality of NPUs are integrated in a promotion 910 chip to form an NPU cluster, the requirement of E-level AI computing power on computing efficiency is satisfied. However, the increase in the number of NPUs also means that more AI chips need to be used in parallel (a plurality of NPUs and CPUs constitute an AI chip, and one AI chip is also called a node). As shown in fig. 3, as the AI parallelism (number of nodes) increases, the network traffic (the amount of data that needs to be transmitted by the network as communication information) also shows a nearly linear increasing trend, and when the network traffic reaches a certain amount, the communication efficiency (data interaction efficiency) will hinder the computation power of the NPU cluster.

In summary, the data interaction efficiency in the prior art limits the computation power of the NPU cluster.

Thus, there is a need for improvements and enhancements in the art.

Disclosure of Invention

In order to solve the technical problems, the invention provides an NPU cluster network structure and a network interconnection method, and solves the problem that the computational power of an NPU cluster is limited by the data interaction efficiency in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides an NPU cluster network structure, including:

the node is used as a hardware structure for training a neural network model and comprises a first group of processors and a second group of processors, and the NPU of the first group of processors is electrically connected with the NPU of the second group of processors;

a first network plane electrically connected to the first set of processors;

a second network plane electrically connected to the second set of processors.

In one implementation manner, the node further comprises a plurality of CPUs, and a high-speed serial computer expansion bus is connected between each CPU and the NPU of the first group of processors and the NPU of the second group of processors; HCCS buses are connected among a plurality of CPUs.

In one implementation, an HCCS bus is connected between the NPUs of the first set of processors, an HCCS bus is connected between the NPUs of the second set of processors, and an HCCS bus is connected between the NPUs of the first set of processors and the NPUs of the second set of processors.

In one implementation, the NPUs of the disparate nodes communicate with each other in a RoCE network protocol over the first network plane or the second network plane.

In one implementation, the first set of processors and the second set of processors each include four sets of the NPUs.

In a second aspect, an embodiment of the present invention provides a network interconnection method, including:

selecting one NPU from a plurality of NPUs forming each node, and recording the NPU as a bridging NPU;

controlling a plurality of the NPUs within each of the nodes to be interconnected through the bridging NPU, wherein the interconnection is the mutual data transmission among the plurality of the NPUs;

and controlling data transmission among the nodes after internal interconnection.

In one implementation, the controlling a number of the NPUs within each of the nodes is interconnected by the bridging NPU, the interconnection being a mutual data transfer between the number of the NPUs, including:

obtaining a first bridging NPU and a second bridging NPU in the bridging NPU according to the bridging NPU, wherein the first bridging NPU is located in the first group of processors, and the second bridging NPU is located in the second group of processors;

controlling a number of the NPUs within the first set of processors to be interconnected by the first bridging NPU;

controlling a number of the NPUs within the second set of processors to be interconnected by the second bridging NPU;

controlling the first bridging NPU to interconnect with the second bridging NPU.

In one implementation, the controlling a plurality of the NPUs within each of the nodes is interconnected by the bridging NPU, where the interconnection is a mutual data transmission between the plurality of NPUs, and the method further includes:

controlling the first bridging NPU to interconnect with a CPU within the node;

controlling the second bridging NPU to interconnect with a CPU within the node.

In one implementation, the controlling data transmission between the nodes after the internal interconnection includes:

the bridging NPU of each node after the control interconnection receives target data;

and the bridging NPU controlling each node transmits the received target data to a target computing node through a RoCE network protocol.

In one implementation, the controlling the bridging NPU of each of the nodes sends the received data to the target computing node via a RoCE network protocol, and then further includes:

controlling the bridging NPU of the target computing node to receive the target data;

controlling the bridging NPU of the target computing node to distribute the received target data to each NPU within the target computing node.

In a third aspect, an embodiment of the present invention further provides a network interconnection apparatus, where the apparatus includes the following components:

the bridge NPU screening module is used for selecting one NPU from a plurality of NPUs forming each node and recording the selected NPU as a bridge NPU;

the interconnection control module is used for controlling a plurality of NPUs in each node to be interconnected through the bridge NPU, and the interconnection is realized by mutually transmitting data among the NPUs;

and the data transmission module is used for controlling data transmission among all the nodes after internal interconnection.

In a fourth aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a network interconnection program that is stored in the memory and is executable on the processor, and when the processor executes the network interconnection program, the steps of the network interconnection method are implemented.

In a fifth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a network interconnection program is stored on the computer-readable storage medium, and when the network interconnection program is executed by a processor, the network interconnection program implements the steps of the network interconnection method described above.

Has the advantages that: the invention divides the nodes for training the neural network model into two groups, namely a first group of processors and a second group of processors, and divides the network plane into two planes, wherein one network plane is only responsible for the transmission of data generated by one group of processors in the process of training the neural network model, thereby improving the data transmission efficiency of the NPU cluster formed by each NPU processor, once the data transmission efficiency is improved, the data transmission efficiency does not restrict the calculation power of the NPU cluster, and the calculation power of the NPU cluster is improved.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a diagram illustrating AI computational power increase in the background art;

FIG. 3 is a diagram illustrating the relationship between AI parallelism and data interaction efficiency in the background art;

FIG. 4 is a diagram of a bi-plane network architecture in an embodiment of the present invention;

fig. 5 is a schematic diagram of a Spine and Leaf switch layer divided into two network planes according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of the internal interconnection of nodes in an embodiment of the present invention;

FIG. 7 is a block diagram of a node in an embodiment of the present invention;

FIG. 8 is a schematic diagram of a group network of 4096 NPUs consisting of 4 clusters in an embodiment of the present invention;

FIG. 9 is a diagram of a TPU-v3 Pod interconnect architecture in an embodiment of the invention;

fig. 10 is a schematic block diagram of an internal structure of a terminal device according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is clearly and completely described below by combining the embodiment and the attached drawings of the specification. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Research shows that the artificial intelligence technology AI is used for training the model, the greater the complexity of the model is, the higher the requirement on the AI computing power is, and as shown in FIG. 2, the AI computing power requirement is increased year by year. The latest AI models have addressed the need for class E AI computing power, i.e., FP16 computing power used in single model training exceeds 1E OPS ((R))

）。

The E-level AI computation power imposes severe requirements on both computational efficiency and communication efficiency. A neural-Network Processing Unit (NPU) can satisfy the requirement of E-level AI computing power on computing efficiency, for example, because a plurality of NPUs are integrated in a promotion 910 chip to form an NPU cluster, the requirement of E-level AI computing power on computing efficiency is satisfied. However, the increase in the number of NPUs also means that more AI chips need to be used in parallel (a plurality of NPUs and CPUs constitute an AI chip, and one AI chip is also called a node). As shown in fig. 3, as the AI parallelism (node count) increases, the network traffic (the amount of data that needs to be transmitted by the network as communication information) also shows a nearly linear increasing trend, and when the network traffic reaches a certain amount, the communication efficiency (data interaction efficiency) will hinder the calculation power of the NPU cluster.

In order to solve the technical problems, the invention provides an NPU cluster network structure and a network interconnection method, and solves the problem that the computational power of an NPU cluster is limited by the data interaction efficiency in the prior art. During specific implementation, the nodes are divided into a first group of processors and a second group of processors, the network plane is also divided into a first network plane and a second network plane, the first network plane is only responsible for data interaction of the first processor, and the second network plane is only responsible for data interaction of the second processor. The invention can improve the data interaction efficiency of the processor, thereby improving the calculation power of the NPU cluster formed by the processor.

For example, as shown in fig. 4, the nodes and switches form two identical network planes (labeled as a first network plane and a second network plane, respectively). As shown in fig. 5, one cabinet contains 8 nodes. The internal structure of each node is shown in fig. 6 (the node includes a CPU in addition to the NPU, and the CPU is not shown in fig. 6, and only the NPU is shown as a core processor), that is, each node includes NPU0, NPU1, NPU2, NPU3, NPU4, NPU5, NPU6, and NPU7. NPU0, NPU1, NPU4, NPU5 are divided into a first group of processors, NPU2, NPU3, NPU6, NPU7 are divided into a second group of processors. As shown in fig. 5, the Spine and Leaf switch layers are divided into two network planes: the two are respectively composed of independent Spine switches SW0 and SW1.. SW31 and a Leaf switch SW0 column No. 0 \8230andSW 0 column No. 63. Each network plane is responsible for data transfer of only the parameters generated by one of the sets of processors at each node. Taking the node #1 in fig. 5 as an example, the solid line is a first network plane, and the first network plane is only responsible for parameters generated by training the neural network model on NPU0, NPU1, NPU4, and NPU5, that is, the first network plane is only responsible for transmitting the parameters generated by training the neural network model on NPU0, NPU1, NPU4, and NPU5 in the node #1 to other nodes of the same network plane, that is, the first network plane is only responsible for interaction of parameter data on NPU0, NPU1, NPU4, and NPU5 with other nodes of the same network plane. Also for the second network, it is only responsible for the interaction of the parameter data on NPU2, NPU3, NPU6, NPU7 with other nodes of the same network plane.

Application scenarios of the present embodiment: the architecture and model parameters of the neural network model for image recognition, target detection and natural language processing are transmitted to eight nodes in fig. 5, a training data set is implanted into the eight nodes, the model is trained in parallel by the NPU in the eight nodes through the training set, and after each round of training is completed, the model parameters obtained by the round of training are transmitted to NPUs in other nodes through a network plane by the NPU in each node, so that interaction of the model parameters obtained by each round of training among the nodes is realized.

Exemplary Structure

The embodiment provides an NPU cluster network structure, which includes a node, a first network plane, and a second network plane.

In one embodiment, as shown in FIG. 7, each node includes eight NPUs and four CPUs. NPU1, NPU2, NPU3 and NPU4 with ports of net ports-0, -1, -2 and-3 are respectively a first group of processors (the NPU1, NPU2, NPU3 and NPU4 also form an NPU cluster), and NPU1, NPU2, NPU3 and NPU4 with ports of net ports-4, -5, -6 and-7 are respectively a second group of processors.

In this embodiment, the first network plane and the second network plane form a biplane networking mode, that is, the network is divided into two symmetrical network planes. The solid line and the dotted line respectively include 2048 NPUs, and independently form a Spine-Leaf network communication structure (fat tree structure). The bisection bandwidth per network plane is as high as 2048/2 x 100gb/s =12.8TB/s. Wherein, 8 ports inside each computing node belong to different network planes (for example,

ports

0, 1, 4, 5 belong to the network plane where the solid line is located, and

ports

2, 3, 6, 7 belong to the network plane where the dotted line is located). The advantages of this embodiment: the method comprises the following steps of (1) high bandwidth utilization rate, (2) predictable network delay, (3) good expansibility, (4) reducing the requirement on a switch, and (5) high safety and availability. In addition, the network deployment is carried out through the solid line network plane and the dotted line network plane, compared with the single network deployment which integrates the solid line network plane and the dotted line network plane, the number of ports and the number of switches needed by each node are reduced, and the utilization rate of hardware resources is further optimized.

The two network planes, solid and dashed, are not intercommunicating from the outside of the compute node. In order to implement full-machine interconnection, the present embodiment further constructs a cross-network-plane NPU interconnection architecture inside the node. As shown in fig. 6, high-speed communication between NPUs located in different network planes within the same compute node can be implemented by a high-speed interconnection technology of a high-bandwidth Cache coherence System (HCCS). In addition, a plurality of NPUs in the same plane can realize intra-group interconnection through HCCS or inter-group interconnection through PCI-Express (PCIe), and further realize full-machine NPU interconnection. Therefore, a biplane networking mode is adopted, and by combining cross-plane and coplanar NPU interconnection in nodes, all NPUs can be interconnected in a set communication optimization mode, and the goal of full-machine large-scale AI task training is further completed.

In one embodiment, an HCCS bus is connected between eight NPUs inside the first group of processors, an HCCS bus (Huawei Cache Coherent System) is also connected between eight NPUs inside the second group of processors, an HCCS bus is also connected between the NPU of the first group of processors and the NPU of the second group of processors, a serial computer expansion bus PCIe (PCI-Express) is connected between the NPU of the first group of processors and the CPU of the first group, and a serial computer expansion bus PCIe is also connected between the NPU of the second group of processors and the CPU of the second group. The NPUs between different nodes adopt RoCE network protocol communication (remote direct data access over Converged Ethernet, roCE).

In this embodiment, the connections between the NPUs within each node (the first set of processor NPUs and the second set of processor NPUs form the NPUs within the node) and the connections between the CPUs within each node are among the intra-node homogeneous computing unit interconnects. The isomorphic computing units in the nodes are interconnected, namely the interconnection between CPUs in the same computing node or the interconnection between NPUs. The Cache consistency bus protocol HCCS is adopted for communication, so that on-chip direct memory access can be carried out between the CPU/NPU, and high-speed data interaction is realized. Specifically, each node contains 4 CPUs, which are fully interconnected by 3 HCCSs; each node contains 8 NPUs, grouped into 2 NPU groups (a first group of processor NPUs and a second group of processor NPUs), each NPU being fully interconnected with the other NPUs in the group by 3 HCCSs. There is no direct HCCS interconnection between groups of NPUs. The single-link unidirectional bandwidth of the HCCS is 30GB/s, so that the unidirectional aggregation bandwidth of each CPU/NPU is 90GB/s.

In this embodiment, the connection between the NPU and the CPU inside each node belongs to intra-node heterogeneous computing unit interconnection, that is, communication between heterogeneous computing units inside the same computing node, that is, communication between the CPU and the NPU. Through a high speed serial bus PCIe implementation, each NPU is connected through 1 PCIe 4.0 x16 and 1 CPU. Each computing node comprises 4 CPUs and 8 NPUs, so that each CPU is respectively in point-to-point interconnection with the two NPUs through a PCIe protocol, and further full interconnection of the NPUs between the two planes in the same node can be realized. Each PCIe 4.0 x16 provides a theoretical bandwidth for unidirectional data transfer of 32 GB/s. PCIe of the whole system can provide halved bandwidth of 512 x 4 x 32GB/s =64TB/s.512 are 512 interconnection zones, and the class E NPU cluster includes 2 planes of fat tree structures, each plane includes 512 interconnection domains, and each interconnection domain is defined as a full interconnection structure composed of 4 NPUs.

In this embodiment, the connections between NPUs of different nodes belong to an internode interconnect: each computing node provides 8 onboard 100GE Ethernet ports to be interconnected with a high-speed switch, and direct access of the computing nodes across network levels is realized by combining RoCE (RDMA over Converged Ethernet) v2 technology. RoCE is a network protocol that allows remote direct access (RDMA) to be used over Ethernet. RoCE v2 can realize a routing function, the interconnection among nodes fully utilizes a scaling-free fat tree network of the RoCE v2 network, and a single connection can reach a bidirectional communication bandwidth of 24 GB/s. A biplane network is implemented by the RoCE v2 network protocol, each plane containing 2048 NPUs.

In the embodiment, PCIe and HCCS jointly form a plurality of communication protocols, so that the whole system can achieve the same performance of a full fat tree network, and a foundation is provided for full-machine efficient communication. Besides full-machine scale training, the method can also effectively support model training of various scales. The multi-protocol bridging networking technology can improve the communication dimension, flexibility and performance by effectively combining the inter-chip interconnection and the inter-node interconnection.

Exemplary method

The network interconnection method of the embodiment can be applied to terminal equipment. In this embodiment, as shown in fig. 1, the network interconnection method specifically includes the following steps:

s100, selecting one NPU from a plurality of NPUs forming each node, and marking the selected NPU as a bridging NPU.

One NPU (denoted as NPU 0) in the computing node is selected as a "bridge" temporary buffer of the primary relay, and data interaction with the CPU in the node and the NPU in the other plane (i.e. data interaction between the CPU in the first group of processors and the NPU in the second group of processors) is respectively realized through PCIe and HCCS channels inside the node.

And S200, controlling the NPUs in each node to be interconnected through the bridging NPU, wherein the interconnection is mutual data transmission among the NPUs.

In one embodiment, step S200 includes steps S201 to S206 as follows:

s201, obtaining a first bridging NPU and a second bridging NPU in the bridging NPU according to the bridging NPU, wherein the first bridging NPU is located in the first group of processors, and the second bridging NPU is located in the second group of processors.

S202, controlling a plurality of the NPUs in the first group of processors to be interconnected through the first bridging NPU.

S203, controlling a plurality of the NPUs in the second group of processors to be interconnected through the second bridging NPU.

One NPU is selected from each NPU of each group of processors as a bridging NPU, and other NPUs of each group of processors perform data interaction with the bridging NPU.

S204, controlling the first bridging NPU and the second bridging NPU to be interconnected.

And each group of processors carries out data interaction through the bridge NPU.

And S205, controlling the first bridge NPU to be interconnected with the CPU in the node.

S206, controlling the second bridge NPU to be interconnected with the CPU in the node.

And S300, controlling the data transmission between the nodes after the internal interconnection.

In one embodiment, the specific process of step S300 is as follows: the bridging NPU of each node after the control interconnection receives target data; and controlling the bridging NPU of each node to send the received target data to a target computing node through a RoCE network protocol.

The "bridging" NPU sends the target data (model parameters generated during the process of training the neural network model) to the memory of the destination computing node through the RoCE network to store the target data.

In another embodiment, the bridging NPU controlling the target computing node receives the target data; controlling the bridging NPU of the target computing node to distribute the received target data to each NPU within the target computing node. In this embodiment, data distribution is accomplished by the "bridging" NPU of the destination computing node over PCIe and HCCS. The HCCS between the NPUs of different planes in the node and the PCIe between the CPU-NPUs realize the NPU communication with the appearance of network disconnection from the outside, thereby realizing the communication of the full-machine parameter plane.

4096 NPU cards are taken as an example to illustrate the performance of the network interconnection method of the present invention:

theoretically, the bisection bandwidth of the 2 network-plane fat-tree structures can reach 12.8TB/s, and the PCIe bisection bandwidth can reach 64TB/s. Through actual measurement, the network key indexes of the invention are shown in table 1:

TABLE 1

The E-level computing power intelligent networking mode successfully supports billion-parameter ultra-large-scale AI model training, and the platform also supports practical application of nearly twenty scenes such as urban management, intelligent traffic and the like.

In one embodiment, as shown in FIG. 8, consisting of 4 AI sub-clusters, each cluster contains 16 enclosures, each enclosure containing 8 nodes, each node consisting of 8 NPUs. In the aspect of network interconnection, a classic Spine-Leaf two-layer switching network architecture is adopted: the Leaf layer consists of access switches, each cabinet is connected with 2 Leaf switches, converges the flow from the NPU computing node and is directly connected to the switch of the Spine layer; spine switches interconnect all Leaf switches in a two-layer fat-tree topology. Each AI sub-cluster employs 32 Leaf switches and 16 Spine switches in total. The networking approach in this embodiment has the advantage that the size of the network does not swell excessively.

In another embodiment, as shown in fig. 9, the TPU Pod formed by the *** Tensor Processing Unit (TPU) usually uses Mesh or 3D Torus.

Exemplary devices

The present embodiment further provides a network interconnection apparatus, which includes the following components:

the bridge NPU screening module is used for selecting one NPU from a plurality of NPUs forming each node and marking the selected NPU as a bridge NPU;

and the data transmission module is used for controlling the data transmission among the nodes after the internal interconnection.

Based on the above embodiments, the present invention further provides a terminal device, and a schematic block diagram thereof may be as shown in fig. 10. The terminal equipment comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein the processor of the terminal device is configured to provide computing and control capabilities. The memory of the terminal equipment comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the terminal device is used for connecting and communicating with an external terminal through a network. The computer program is executed by a processor to implement a network interconnection method. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the terminal equipment is arranged in the terminal equipment in advance and used for detecting the operating temperature of the internal equipment.

It will be understood by those skilled in the art that the block diagram of fig. 10 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the terminal equipment to which the solution of the present invention is applied, and a specific terminal equipment may include more or less components than those shown in the figure, or may combine some components, or have different arrangements of components.

In one embodiment, a terminal device is provided, where the terminal device includes a memory, a processor, and a network interconnection program stored in the memory and executable on the processor, and the processor executes the network interconnection program to implement the following operation instructions:

controlling a plurality of NPUs inside each node to be interconnected through the bridging NPU, wherein the interconnection is mutual data transmission among the plurality of NPUs;

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct Rambus Dynamic RAM (DRDRAM), and Rambus Dynamic RAM (RDRAM), among others.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An NPU cluster network architecture, comprising:

a first network plane electrically connected to the first set of processors;

a second network plane electrically connected to the second set of processors.

2. The NPU cluster network structure of claim 1, wherein the node further comprises a plurality of CPUs, each CPU having a high speed serial computer expansion bus connected to the NPU of the first set of processors and the NPU of the second set of processors; HCCS buses are connected among a plurality of CPUs.

3. The NPU cluster network structure of claim 2, wherein an HCCS bus is connected between a number of the NPUs of the first set of processors, an HCCS bus is connected between a number of the NPUs of the second set of processors, and an HCCS bus is connected between the NPUs of the first set of processors and the NPUs of the second set of processors.

4. The NPU cluster network structure of claim 3, wherein the NPUs of distinct nodes communicate with each other in a RoCE network protocol over the first network plane or the second network plane.

5. The NPU cluster network structure of claim 1, wherein the first set of processors and the second set of processors each include four sets of the NPUs.

6. A method of interconnecting a network based on the NPU cluster network architecture of any of claims 1 to 5, comprising:

7. The method of claim 6, wherein said controlling a number of said NPUs within each of said nodes to be interconnected by said bridging NPU, said interconnecting for a number of said NPUs to communicate data with each other, comprises:

obtaining a first bridging NPU and a second bridging NPU in the bridging NPU according to the bridging NPU, wherein the first bridging NPU is located in a first group of processors, and the second bridging NPU is located in a second group of processors;

8. The method of claim 7, wherein the controlling the number of the NPUs within each of the nodes is interconnected by the bridging NPU, the interconnecting being a mutual data transfer between the number of the NPUs, further comprising:

controlling the first bridging NPU to interconnect with a CPU within the node;

controlling the second bridging NPU to interconnect with a CPU within the node.

9. The method as claimed in claim 6, wherein said controlling data transmission between said nodes after internal interconnection comprises:

10. The method of interconnecting networks of claim 9, wherein said bridging NPU controlling each of said nodes sends the received target data to the target computing node via a RoCE network protocol, and thereafter further comprising:

11. A network interconnection apparatus, comprising:

12. A terminal device, characterized in that the terminal device comprises a memory, a processor and a network interconnection program stored in the memory and operable on the processor, and the processor implements the steps of the network interconnection method according to any one of claims 6 to 10 when executing the network interconnection program.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a network interconnection program, which, when executed by a processor, implements the steps of the network interconnection method according to any one of claims 6 to 10.