CN115809685B - NPU cluster network structure and network interconnection method - Google Patents

NPU cluster network structure and network interconnection method Download PDF

Info

Publication number
CN115809685B
CN115809685B CN202310088059.XA CN202310088059A CN115809685B CN 115809685 B CN115809685 B CN 115809685B CN 202310088059 A CN202310088059 A CN 202310088059A CN 115809685 B CN115809685 B CN 115809685B
Authority
CN
China
Prior art keywords
npu
processors
npus
network
bridging
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310088059.XA
Other languages
Chinese (zh)
Other versions
CN115809685A (en
Inventor
田永鸿
陈文光
高文
王丙强
林哲
章弋嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202310088059.XA priority Critical patent/CN115809685B/en
Publication of CN115809685A publication Critical patent/CN115809685A/en
Application granted granted Critical
Publication of CN115809685B publication Critical patent/CN115809685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Multi Processors (AREA)

Abstract

The invention relates to the technical field of communication, in particular to an NPU cluster network structure and a network interconnection method. The invention divides the nodes for training the neural network model into two groups, namely the first group of processors and the second group of processors, and divides the network plane into two planes, wherein one network plane is only responsible for the data transmission generated by one group of processors in the process of training the neural network model, thereby improving the data transmission efficiency of the NPU cluster formed by each NPU processor.

Description

NPU cluster network structure and network interconnection method
Technical Field
The invention relates to the technical field of communication, in particular to an NPU cluster network structure and a network interconnection method.
Background
The artificial intelligence technology AI is used for training a model, and the higher the model complexity is, the higher the requirement on AI computing power is, as shown in fig. 2, the AI computing power requirement is increased year by year. The latest AI model has put forward the need for class E AI calculation, i.e. FP16 calculation used in single model training exceeds 1E OPS)。
The class E AI computing power brings stringent requirements to computational efficiency and communication efficiency. The neural Network Processor (NPU) can meet the requirement of the E-class AI computing power on the computing efficiency, for example, the rising 910 chip forms an NPU cluster by integrating a plurality of NPUs therein, so as to meet the requirement of the E-class AI computing power on the computing efficiency. But the increase in the number of NPUs also means that more AI chips (multiple NPUs and CPUs make up an AI chip, also called a node) need to be used in parallel. As shown in fig. 3, with the increase of AI parallelism (the number of nodes), network traffic (the amount of data that needs to be transmitted by a network as communication information) also shows a near-linear trend, and when the network traffic reaches a certain amount, communication efficiency (data interaction efficiency) may hinder the computation of NPU clusters.
In summary, the data interaction efficiency in the prior art limits the computation power of the NPU cluster.
Accordingly, there is a need for improvement and advancement in the art.
Disclosure of Invention
In order to solve the technical problems, the invention provides an NPU cluster network structure and a network interconnection method, which solve the problem that the data interaction efficiency in the prior art limits the NPU cluster computing power.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
in a first aspect, the present invention provides an NPU cluster network architecture, including:
the node is used as a hardware structure for training a neural network model and comprises a first group of processors and a second group of processors, and the NPU of the first group of processors is electrically connected with the NPU of the second group of processors;
a first network plane electrically connected to the first set of processors;
and a second network plane electrically connected to the second set of processors.
In one implementation, the node further includes a plurality of CPUs, and a high-speed serial computer expansion bus is connected between each CPU and each NPU of the first group of processors and each NPU of the second group of processors; and HCCS buses are connected among the plurality of CPUs.
In one implementation, an HCCS bus is connected between the NPUs of the first set of processors, an HCCS bus is connected between the NPUs of the second set of processors, and an HCCS bus is connected between the NPUs of the first set of processors and the NPUs of the second set of processors.
In one implementation, the NPUs of different nodes communicate in a RoCE network protocol through the first network plane or the second network plane.
In one implementation, the number of NPUs included in the first set of processors and the second set of processors is four.
In a second aspect, an embodiment of the present invention provides a network interconnection method, including:
selecting one NPU from a plurality of NPUs forming each node, and marking the NPU as a bridging NPU;
controlling a plurality of NPUs inside each node to be interconnected through the bridging NPU, wherein the interconnection is mutual data transmission among the NPUs;
and controlling data transmission among the nodes after internal interconnection.
In one implementation, the controlling the NPUs inside the nodes to interconnect with each other through the bridging NPU, where the interconnection is mutual data transmission between the NPUs, includes:
obtaining a first bridging NPU and a second bridging NPU in the bridging NPU according to the bridging NPU, wherein the first bridging NPU is positioned in the first group of processors, and the second bridging NPU is positioned in the second group of processors;
controlling a number of the NPUs within the first set of processors to be interconnected by the first bridging NPU;
controlling a number of the NPUs within the second set of processors to be interconnected by the second bridging NPU;
and controlling the interconnection of the first bridging NPU and the second bridging NPU.
In one implementation, the controlling the NPUs inside the nodes is performed by interconnecting the NPUs through the bridging NPUs, where the interconnection is mutual data transmission between the NPUs, and the method further includes:
controlling the interconnection of the first bridging NPU and the CPU in the node;
and controlling the second bridging NPU to be interconnected with the CPU in the node.
In one implementation, the data transmission between the nodes after the control internal interconnection includes:
controlling the bridging NPU of each node after interconnection to receive target data;
and controlling the bridging NPU of each node to send the received target data to a target computing node through a RoCE network protocol.
In one implementation, the controlling the bridging NPU of each node sends the received data to the target computing node through the RoCE network protocol, and then further includes:
controlling the bridging NPU of the target computing node to receive the target data;
the bridging NPU controlling the target computing node distributes the received target data to individual NPUs within the target computing node.
In a third aspect, an embodiment of the present invention further provides a network interconnection device, where the device includes the following components:
the bridge NPU screening module is used for selecting one NPU from a plurality of NPUs forming each node and marking the NPU as a bridge NPU;
the interconnection control module is used for controlling a plurality of NPUs in each node to be interconnected through the bridging NPU, and the interconnection is mutual data transmission among the NPUs;
and the data transmission module is used for controlling data transmission among the nodes after internal interconnection.
In a fourth aspect, an embodiment of the present invention further provides a terminal device, where the terminal device includes a memory, a processor, and a network interconnection program stored in the memory and capable of running on the processor, and when the processor executes the network interconnection program, the steps of the network interconnection method described above are implemented.
In a fifth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a network interconnection program is stored, where the network interconnection program, when executed by a processor, implements the steps of the network interconnection method described above.
The beneficial effects are that: the invention divides the nodes for training the neural network model into two groups, namely the first group of processors and the second group of processors, and divides the network plane into two planes, wherein one network plane is only responsible for the data transmission generated by one group of processors in the process of training the neural network model, thereby improving the data transmission efficiency of the NPU cluster formed by each NPU processor.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a diagram of the AI algorithm force increase in the background;
FIG. 3 is a diagram showing the relationship between AI parallelism and data interaction efficiency in the background art;
FIG. 4 is a diagram of a biplane network architecture in an embodiment of the present invention;
FIG. 5 is a schematic diagram of a split of a Spine and Leaf switch layer into two network planes in an embodiment of the present invention;
FIG. 6 is a schematic diagram of an internal interconnection of nodes in an embodiment of the present invention;
FIG. 7 is a block diagram of a node in an embodiment of the invention;
fig. 8 is a schematic diagram of a networking of 4096 NPUs composed of 4 clusters in an embodiment of the present invention;
FIG. 9 is a diagram of a TPU-v3 Pod interconnect architecture in an embodiment of the present invention;
fig. 10 is a schematic block diagram of an internal structure of a terminal device according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is clearly and completely described below with reference to the examples and the drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Through research, the artificial intelligence technology AI is used for training a model, and the higher the model complexity is, the higher the requirement on AI calculation force is, as shown in FIG. 2The solution is gradually increased year by year. The latest AI model has put forward the need for class E AI calculation, i.e. FP16 calculation used in single model training exceeds 1E OPS)。
The class E AI computing power brings stringent requirements to computational efficiency and communication efficiency. The neural Network Processor (NPU) can meet the requirement of the E-class AI computing power on the computing efficiency, for example, the rising 910 chip forms an NPU cluster by integrating a plurality of NPUs therein, so as to meet the requirement of the E-class AI computing power on the computing efficiency. But the increase in the number of NPUs also means that more AI chips (multiple NPUs and CPUs make up an AI chip, also called a node) need to be used in parallel. As shown in fig. 3, with the increase of AI parallelism (the number of nodes), network traffic (the amount of data that needs to be transmitted by a network as communication information) also shows a near-linear trend, and when the network traffic reaches a certain amount, communication efficiency (data interaction efficiency) may hinder the computation of NPU clusters.
In order to solve the technical problems, the invention provides an NPU cluster network structure and a network interconnection method, which solve the problem that the data interaction efficiency in the prior art limits the NPU cluster computing power. In the implementation, the node is firstly divided into a first group of processors and a second group of processors, the network plane is also divided into a first network plane and a second network plane, the first network plane is only responsible for the data interaction of the first processor, and the second network plane is only responsible for the data interaction of the second processor. The invention can improve the data interaction efficiency of the processor, thereby improving the computing power of the NPU cluster formed by the processors.
For example, as shown in fig. 4, the nodes and switches constitute two identical network planes (labeled as first network plane and second network plane, respectively). As shown in fig. 5, one cabinet contains 8 nodes. Each node has an internal structure as shown in FIG. 6 (the node has CPU except NPU, and the CPU is not shown in FIG. 6, but only NPU is shown as a core processor), namely, each node has NPU0, NPU1, NPU2, NPU3, NPU4, NPU5, NPU6 and NPU7. NPU0, NPU1, NPU4, NPU5 are divided into a first group of processors, and NPU2, NPU3, NPU6, NPU7 are divided into a second group of processors. As shown in fig. 5, the Spine and Leaf switch layers are divided into two network planes: both consist of independent Spine switches SW0, SW 1..sw 31, and Leaf switch SW0 column No. … SW0 column No. 63, respectively. Each network plane is responsible for data transmission of only the parameters generated by one of the sets of processors of each node. Taking the node #1 in fig. 5 as an example, the solid line is a first network plane, where the first network plane is only responsible for parameters generated by training the neural network model on the NPU0, NPU1, NPU4 and NPU5, that is, the first network plane is only responsible for transmitting parameters generated by training the model on the NPU0, NPU1, NPU4 and NPU5 in the node #1 to other nodes on the same network plane, that is, the first network plane is only responsible for interaction between parameter data on the NPU0, NPU1, NPU4 and NPU5 and other nodes on the same network plane. Also for the second network, it is responsible for only the interaction of parameter data on NPUs 2, 3, 6, 7 with other nodes of the same network plane.
Application scenario of this embodiment: the architecture of the neural network model for image recognition, target detection and natural language processing and model parameters are transmitted to eight nodes in fig. 5, meanwhile, training data sets are also implanted into the eight nodes, NPUs in the eight nodes perform parallel training on the model by using the training sets, and after each round of training is completed, the NPUs in each node transmit model parameters obtained by the round of training to the NPUs in other nodes through a network plane so as to realize interaction of the model parameters obtained by each round of training among the nodes.
Exemplary Structure
The embodiment provides an NPU cluster network structure, which comprises nodes, a first network plane and a second network plane.
In one embodiment, as shown in FIG. 7, each node includes eight NPUs and four CPUs. NPU1, NPU2, NPU3 and NPU4 with ports of-0, -1, -2 and, -3 are the first group of processors (NPU clusters are formed by NPU1, NPU2, NPU3 and NPU 4), and NPU1, NPU2, NPU3 and NPU4 with ports of-4, -5, -6 and, -7 are the second group of processors.
In this embodiment, the first network plane and the second network plane form a biplane networking mode, i.e. the network is divided into two symmetrical network planes. The solid line and the dotted line respectively contain 2048 NPUs, and are independently composed into a Spine-Leaf network communication structure (fat tree structure). The fractional bandwidth per network plane is up to 2048/2 x 100 gb/s=12.8 TB/s. Wherein the 8 ports within each compute node are assigned to different network planes (e.g., network plane where ports 0, 1, 4, 5 are assigned to solid lines, network plane where ports 2, 3, 6, 7 are assigned to dashed lines). The advantages of this embodiment are: the method has the advantages of (1) high bandwidth utilization, (2) predictable network delay, (3) good expansibility, (4) reduced requirements on the switch, and (5) high security and availability. In addition, the respective networking is performed through the solid line network plane and the dotted line network plane, compared with the single networking of the solid line network plane and the dotted line network plane, the single networking of the solid line network plane and the dotted line network plane is integrated, the number of ports and the number of switches required by each node are reduced, and the utilization rate of hardware resources is further optimized.
From outside the compute node, the two network planes of solid and dashed lines are not interworking. In order to realize all-computer interconnection, the embodiment further constructs a cross-network-plane NPU interconnection architecture inside the node. As shown in fig. 6, by using a high-speed interconnection technology of a high-bandwidth cache coherence system (Huawei Cache Coherent System, HCCS), high-speed communication between NPUs located in different network planes within the same computing node can be realized. In addition, multiple NPUs in the same plane may also implement intra-group interconnect through HCCS or inter-group interconnect through PCI-Express (PCIe), thereby implementing full-machine NPU interconnect. Therefore, by adopting a biplane networking mode and combining cross-plane and co-plane NPU interconnection in the nodes, all NPU interconnection can be realized under the mode of collective communication optimization, and the aim of large-scale AI task training of the whole machine is further achieved.
In one embodiment, an HCCS bus is connected between eight NPUs in the first processor set, an HCCS bus (Huawei Cache Coherent System) is also connected between eight NPUs in the second processor set, an HCCS bus is also connected between an NPU of the first processor set and the NPUs of the second processor set, a high-speed serial computer expansion bus PCIe (PCI-Express) is connected between an NPU of the first processor set and the CPU of the first processor set, and a high-speed serial computer expansion bus PCIe is also connected between an NPU of the second processor set and the CPU of the second processor set. The NPUs between different nodes communicate using RoCE network protocols (remote direct data access over converged ethernet, RDMA over Converged Ethenet, roCE).
In this embodiment, the connections between the NPUs within each node (the first set of processor NPUs and the second set of processor NPUs constitute the NPUs within the node) and the connections between the CPUs within each node all belong to intra-node isomorphic computing unit interconnections. Intra-node homogeneous computing unit interconnections, i.e., interconnections between CPUs or interconnections between NPUs within the same computing node. The Cache consistency bus protocol HCCS is adopted for communication, so that on-chip direct memory access can be performed between the CPU and the NPU, and high-speed data interaction is realized. Specifically, each node contains 4 CPUs, which are all interconnected by 3 HCCSs; each node contains 8 NPUs, making up 2 NPU groups (first group processor NPUs and second group processor NPUs), each NPU being fully interconnected with other NPUs in the group through 3 HCCSs. There is no direct HCCS interconnect between NPU groups. The single link unidirectional bandwidth of the HCCS is 30GB/s, so that the unidirectional aggregate bandwidth of each CPU/NPU is 90GB/s.
In this embodiment, the connection between the NPU and the CPU inside each node belongs to intra-node heterogeneous computing unit interconnection, i.e. communication between heterogeneous computing units inside the same computing node, i.e. communication between the CPU and the NPU. Implemented through a high speed serial bus PCIe, each NPU is connected through 1 PCIe 4.0 x16 and 1 CPU. Each computing node comprises 4 CPUs and 8 NPUs, so that each CPU is respectively in point-to-point interconnection with the two NPUs through a PCIe protocol, and further, the full interconnection of the NPUs between two planes in the same node can be realized. Each PCIe 4.0 x16 provides a theoretical bandwidth for unidirectional data transmission of 32 GB/s. PCIe over the system may provide fractional bandwidth up to 512 x 4 x 32 gb/s=64 TB/s.512 are 512 interconnection areas, and the E-level NPU cluster comprises planes consisting of 2 fat tree structures, each plane contains 512 interconnection scope, and each interconnection scope is defined as a full interconnection structure consisting of 4 NPUs.
In this embodiment, the connections between NPUs of different nodes belong to the inter-node interconnect: each computing node provides 8 on-board 100GE Ethernet ports to be interconnected with a high-speed switch, and the direct access of the computing nodes across network levels is realized by combining RoCE (RDMA over Converged Ethenet) v2 technology. RoCE is a network protocol that allows direct access (remote memory direct access, RDMA) of remote data over ethernet networks. The RoCE v2 can realize the routing function, the interconnection among the nodes fully utilizes the scaling-free fat tree network of the RoCE v2 network, and a single connection can reach the bidirectional communication bandwidth of 24 GB/s. The biplane network is implemented by the RoCE v2 network protocol, each plane containing 2048 NPUs.
In the embodiment, PCIe and HCCS together form a plurality of communication protocols, so that the whole system can achieve the same performance of the full fat tree network, and a foundation is provided for full-machine efficient communication. Besides full-machine scale training, model training of various scales can be effectively supported. The multiprotocol bridging networking technology can improve the dimension, flexibility and performance of communication by effectively combining inter-chip interconnection and inter-node interconnection.
Exemplary method
The network interconnection method of the embodiment can be applied to terminal equipment. In this embodiment, as shown in fig. 1, the network interconnection method specifically includes the following steps:
s100, selecting one NPU from a plurality of NPUs forming each node, and recording the NPU as a bridging NPU.
One NPU in the compute node (denoted NPU 0) is selected as a "bridging" temporary buffer for primary transit, and data interaction with the CPU in the node and the NPU of the other plane (i.e., the CPU in the first set of processors interacts with the NPU data in the second set of processors) is implemented through PCIe and HCCS channels inside the node, respectively.
S200, controlling the NPUs inside the nodes to be interconnected through the bridging NPU, wherein the interconnection is mutual data transmission among the NPUs.
In one embodiment, step S200 includes steps S201 to S206 as follows:
s201, according to the bridging NPU, a first bridging NPU and a second bridging NPU in the bridging NPU are obtained, wherein the first bridging NPU is located in the first group of processors, and the second bridging NPU is located in the second group of processors.
S202, controlling a plurality of NPUs in the first group of processors to be interconnected through the first bridging NPU.
S203, controlling a plurality of NPUs in the second group of processors to be interconnected through the second bridging NPU.
One NPU is selected from the NPUs of each group of processors as a bridging NPU, and other NPUs of each group of processors interact data with the bridging NPU.
S204, the first bridge NPU and the second bridge NPU are controlled to be interconnected.
Each group of processors performs data interaction through bridging NPU.
S205, the first bridge NPU is controlled to be interconnected with the CPU in the node.
S206, controlling the interconnection of the second bridge NPU and the CPU in the node.
And S300, controlling data transmission among the nodes after internal interconnection.
In one embodiment, the specific procedure of step S300 is as follows: controlling the bridging NPU of each node after interconnection to receive target data; and controlling the bridging NPU of each node to send the received target data to a target computing node through a RoCE network protocol.
The "bridging" NPU sends target data (model parameters generated during training of the neural network model) to the memory of the destination computing node over the RoCE network to save the target data.
In another embodiment, the bridging NPU controlling the target computing node receives the target data; the bridging NPU controlling the target computing node distributes the received target data to individual NPUs within the target computing node. In this embodiment, data distribution is accomplished by the "bridging" NPU of the destination computing node through PCIe and HCCS. The communication of the NPU which is 'network unconnected' from the outside is realized through the HCCS between the NPUs in different planes in the node and the PCIe between the CPU and the NPU, thereby realizing the communication of the whole machine parameter plane.
The following describes the performance of the network interconnection method of the present invention using 4096 NPU cards as an example:
the theoretical halving bandwidth of the fat tree structure of 2 network surfaces can reach 12.8TB/s, and the halving bandwidth of PCIe can reach 64TB/s. The network key indexes of the invention are shown in table 1:
TABLE 1
The E-level computing power intelligent networking mode provided by the invention successfully supports trillion-parameter ultra-large-scale AI model training, and in addition, the platform also supports practical application of nearly twenty scenes such as urban management, intelligent traffic and the like.
In one embodiment, as shown in FIG. 8, consists of 4 AI sub-clusters, each cluster containing 16 racks, each rack containing 8 nodes, each node consisting of 8 NPUs. In terms of network interconnection, a classical Spine-Leaf two-layer switching network architecture is adopted: the Leaf layer consists of access switches, each cabinet is connected with 2 Leaf switches, gathers traffic from NPU computing nodes and is directly connected to the switch of the Spine layer; the Spine switches interconnect all Leaf switches in a two-layer fat tree topology. A total of 32 Leaf switches and 16 Spine switches are employed per AI sub-cluster. The networking mode in this embodiment has the advantage that the scale of the network is not excessively expanded.
In another embodiment, as shown in fig. 9, a *** tensor processing unit (tensor processing unit, TPU) forms a TPU Pod, typically using Mesh or 3D Torus.
Exemplary apparatus
The embodiment also provides a network interconnection device, which comprises the following components:
the bridge NPU screening module is used for selecting one NPU from a plurality of NPUs forming each node and marking the NPU as a bridge NPU;
the interconnection control module is used for controlling a plurality of NPUs in each node to be interconnected through the bridging NPU, and the interconnection is mutual data transmission among the NPUs;
and the data transmission module is used for controlling data transmission among the nodes after internal interconnection.
Based on the above embodiment, the present invention also provides a terminal device, and a functional block diagram thereof may be shown in fig. 10. The terminal equipment comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein the processor of the terminal device is adapted to provide computing and control capabilities. The memory of the terminal device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the terminal device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a network interconnection method. The display screen of the terminal equipment can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the terminal equipment is preset in the terminal equipment and is used for detecting the running temperature of the internal equipment.
It will be appreciated by those skilled in the art that the functional block diagram shown in fig. 10 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal device to which the present inventive arrangements are applied, and that a particular terminal device may include more or less components than those shown, or may combine some of the components, or may have a different arrangement of components.
In one embodiment, there is provided a terminal device including a memory, a processor, and a network interconnection program stored in the memory and executable on the processor, the processor implementing the following operation instructions when executing the network interconnection program:
selecting one NPU from a plurality of NPUs forming each node, and marking the NPU as a bridging NPU;
controlling a plurality of NPUs inside each node to be interconnected through the bridging NPU, wherein the interconnection is mutual data transmission among the NPUs;
and controlling data transmission among the nodes after internal interconnection.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (4)

1. The network interconnection method based on the NPU cluster network structure is characterized by comprising the following steps of:
selecting one NPU from a plurality of NPUs forming each node, marking the NPU as a bridging NPU, setting a neural network model on the NPU, and performing neural network model parallel training on the NPU to generate model parameters;
controlling a plurality of NPUs inside each node to be interconnected through the bridging NPU, wherein the interconnection is mutual data transmission among the NPUs;
controlling data transmission among the nodes after internal interconnection;
the controlling the NPUs inside the nodes to be interconnected through the bridging NPUs, wherein the interconnection is that the NPUs mutually transmit data, and the method comprises the following steps:
obtaining a first bridge NPU and a second bridge NPU in the bridge NPU according to the bridge NPU, wherein the first bridge
The NPU is connected in a first group of processors, the second bridging NPU is positioned in a second group of processors, the NPU of the first group of processors is electrically connected with the NPU of the second group of processors, the NPUs in each group of processors are mutually connected in pairs, the first group of processors and the second group of processors form the node, and the node is used as a hardware structure for training a neural network model;
controlling a number of the NPUs within the first set of processors to be interconnected by the first bridging NPU;
controlling a number of the NPUs within the second set of processors to be interconnected by the second bridging NPU;
controlling the first bridging NPU to interconnect with the second bridging NPU;
controlling the interconnection of the first bridging NPU and the CPU in the node;
controlling the interconnection of the second bridging NPU and the CPU in the node;
and data transmission is carried out among the nodes after the internal interconnection control, and the method comprises the following steps:
controlling the bridging NPU of each node after interconnection to receive target data;
the bridging NPU controlling each node sends the received target data to target calculation through the RoCE network protocol
Controlling the bridging NPU of the target computing node to receive the target data;
controlling the bridging NPU of the target computing node to distribute the received target data into the target computing node
Is not limited by the NPU;
further comprises:
a first network plane electrically connected to the first set of processors;
a second network plane electrically connected to the second set of processors, the second network plane and the first network plane not being interworking;
the node also comprises a plurality of CPUs, and a high-speed serial computer expansion bus is respectively connected between each CPU and the NPU of the first group of processors and the NPU of the second group of processors; an HCCS bus is connected among the plurality of CPUs;
an HCCS bus is connected between a plurality of NPUs of the first group of processors, an HCCS bus is connected between a plurality of NPUs of the second group of processors, an HCCS bus is connected between the NPUs of the first group of processors and the NPUs of the second group of processors, the NPUs of different nodes are communicated through the first network plane or the second network plane by a RoCE network protocol or a PCIe communication protocol or an HCCS communication protocol, and the PCIe communication protocol and the HCCS communication protocol are communicated to form a full fat tree network;
the NPU trains the neural network model by using a training set, and after each round of training is completed, the NPU in each node transmits model parameters obtained by training to the NPU in other nodes through a network plane.
2. A network interconnection apparatus, the apparatus comprising:
the bridge NPU screening module is used for selecting one NPU from a plurality of NPUs forming each node, marking the NPU as a bridge NPU, setting a neural network model on the NPU, and carrying out neural network model parallel training on the NPU to generate model parameters;
the interconnection control module is used for controlling a plurality of NPUs in each node to be interconnected through the bridging NPU, and the interconnection is mutual data transmission among the NPUs;
the data transmission module is used for controlling data transmission among the nodes after internal interconnection;
the controlling the NPUs inside the nodes to be interconnected through the bridging NPUs, wherein the interconnection is that the NPUs mutually transmit data, and the method comprises the following steps:
obtaining a first bridge NPU and a second bridge NPU in the bridge NPU according to the bridge NPU, wherein the first bridge
The NPU is connected in a first group of processors, the second bridging NPU is positioned in a second group of processors, the NPU of the first group of processors is electrically connected with the NPU of the second group of processors, the NPUs in each group of processors are mutually connected in pairs, the first group of processors and the second group of processors form the node, and the node is used as a hardware structure for training a neural network model;
controlling a number of the NPUs within the first set of processors to be interconnected by the first bridging NPU;
controlling a number of the NPUs within the second set of processors to be interconnected by the second bridging NPU;
controlling the first bridging NPU to interconnect with the second bridging NPU;
controlling the interconnection of the first bridging NPU and the CPU in the node;
controlling the interconnection of the second bridging NPU and the CPU in the node;
and data transmission is carried out among the nodes after the internal interconnection control, and the method comprises the following steps:
controlling the bridging NPU of each node after interconnection to receive target data;
the bridging NPU controlling each node sends the received target data to target calculation through the RoCE network protocol
Controlling the bridging NPU of the target computing node to receive the target data;
controlling the bridging NPU of the target computing node to distribute the received target data into the target computing node
Is not limited by the NPU;
further comprises:
a first network plane electrically connected to the first set of processors;
a second network plane electrically connected to the second set of processors, the second network plane and the first network plane not being interworking;
the node also comprises a plurality of CPUs, and a high-speed serial computer expansion bus is respectively connected between each CPU and the NPU of the first group of processors and the NPU of the second group of processors; an HCCS bus is connected among the plurality of CPUs;
an HCCS bus is connected between a plurality of NPUs of the first group of processors, an HCCS bus is connected between a plurality of NPUs of the second group of processors, an HCCS bus is connected between the NPUs of the first group of processors and the NPUs of the second group of processors, the NPUs of different nodes are communicated through the first network plane or the second network plane by a RoCE network protocol or a PCIe communication protocol or an HCCS communication protocol, and the PCIe communication protocol and the HCCS communication protocol are communicated to form a full fat tree network;
the NPU trains the neural network model by using a training set, and after each round of training is completed, the NPU in each node transmits model parameters obtained by training to the NPU in other nodes through a network plane.
3. A terminal device, characterized in that the terminal device comprises a memory, a processor and a memory stored in the memory
And a network interconnection program operable on the processor, the processor implementing the right of way when executing the network interconnection program
The method of interconnecting networks of claim 1.
4. A computer readable storage medium, wherein the computer readable storage medium has a network interconnection program stored thereon
In order, the network interconnection program, when executed by a processor, implements the steps of the network interconnection method as claimed in claim 1.
CN202310088059.XA 2023-02-09 2023-02-09 NPU cluster network structure and network interconnection method Active CN115809685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310088059.XA CN115809685B (en) 2023-02-09 2023-02-09 NPU cluster network structure and network interconnection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310088059.XA CN115809685B (en) 2023-02-09 2023-02-09 NPU cluster network structure and network interconnection method

Publications (2)

Publication Number Publication Date
CN115809685A CN115809685A (en) 2023-03-17
CN115809685B true CN115809685B (en) 2023-07-25

Family

ID=85487823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310088059.XA Active CN115809685B (en) 2023-02-09 2023-02-09 NPU cluster network structure and network interconnection method

Country Status (1)

Country Link
CN (1) CN115809685B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102904943A (en) * 2012-09-28 2013-01-30 无锡江南计算技术研究所 Cluster computing system hybrid communication method based on embedded processor memory interface
CN114902201A (en) * 2020-01-03 2022-08-12 微软技术许可有限责任公司 Distributed processing architecture

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003067354A (en) * 2001-08-29 2003-03-07 Hitachi Ltd Parallel computer system and interprocessor communication processing method
US7856544B2 (en) * 2008-08-18 2010-12-21 International Business Machines Corporation Stream processing in super node clusters of processors assigned with stream computation graph kernels and coupled by stream traffic optical links
CN108628800A (en) * 2018-05-08 2018-10-09 济南浪潮高新科技投资发展有限公司 A kind of the intelligence computation cluster and its configuration method of dynamic reconfigurable

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102904943A (en) * 2012-09-28 2013-01-30 无锡江南计算技术研究所 Cluster computing system hybrid communication method based on embedded processor memory interface
CN114902201A (en) * 2020-01-03 2022-08-12 微软技术许可有限责任公司 Distributed processing architecture

Also Published As

Publication number Publication date
CN115809685A (en) 2023-03-17

Similar Documents

Publication Publication Date Title
US8769459B2 (en) High-end fault-tolerant computer system and method for same
US8307122B2 (en) Close-coupling shared storage architecture of double-wing expandable multiprocessor
CN113011591A (en) Quantum measurement and control system for multi-bit quantum feedback control
US20130346933A1 (en) Prototype verification system and verification method for high-end fault-tolerant computer
WO2020133317A1 (en) Computing resource allocation technology and neural network system
DE102022104207A1 (en) Pooling of network processing resources
CN101739241A (en) On-chip multi-core DSP cluster and application extension method
CN102685017A (en) On-chip network router based on field programmable gate array (FPGA)
CN113261015A (en) Neural network system and data processing technology
CN104125293B (en) A kind of Cloud Server and its application method
WO2021244168A1 (en) System on chip, data transmission method, and broadcast modules
WO2016078205A1 (en) Directory structure implementation method and system for host system
JP2022543886A (en) Incorporating Rings into Circular Computer Networks
CN117493237B (en) Computing device, server, data processing method, and storage medium
CN114564434B (en) General multi-core brain processor, acceleration card and computer equipment
CN116541338A (en) Computing system, model training method, device and product
CN111079908B (en) Network-on-chip data processing method, storage medium, computer device and apparatus
CN115809685B (en) NPU cluster network structure and network interconnection method
CN113902111A (en) Multi-chip interconnection system and neural network accelerated processing method
CN117312215B (en) Server system, job execution method, device, equipment and medium
CN104104736A (en) Cloud server and use method thereof
US20230403232A1 (en) Data Transmission System and Method, and Related Device
CN116225177B (en) Memory system, memory resource adjusting method and device, electronic equipment and medium
JP2023508791A (en) Quantum measurement and control system for multi-bit quantum feedback control
US10614026B2 (en) Switch with data and control path systolic array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Tian Yonghong

Inventor after: Gao Wen

Inventor after: Wang Bingqiang

Inventor after: Lin Zhe

Inventor after: Zhang Gejia

Inventor before: Tian Yonghong

Inventor before: Chen Wenguang

Inventor before: Gao Wen

Inventor before: Wang Bingqiang

Inventor before: Lin Zhe

Inventor before: Zhang Gejia

CB03 Change of inventor or designer information