WO2020199143A1 - Ai训练网络及方法 - Google Patents

Ai训练网络及方法 Download PDF

Info

Publication number
WO2020199143A1
WO2020199143A1 PCT/CN2019/081161 CN2019081161W WO2020199143A1 WO 2020199143 A1 WO2020199143 A1 WO 2020199143A1 CN 2019081161 W CN2019081161 W CN 2019081161W WO 2020199143 A1 WO2020199143 A1 WO 2020199143A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing unit
server
oxc
training
channel
Prior art date
Application number
PCT/CN2019/081161
Other languages
English (en)
French (fr)
Inventor
沈胜宇
吴聿旻
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2019/081161 priority Critical patent/WO2020199143A1/zh
Priority to CN201980004858.6A priority patent/CN112042168B/zh
Priority to EP19923093.9A priority patent/EP3934205B1/en
Priority to PCT/CN2019/113175 priority patent/WO2020199560A1/zh
Publication of WO2020199143A1 publication Critical patent/WO2020199143A1/zh
Priority to US17/485,833 priority patent/US20220012590A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/067Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using optical means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04QSELECTING
    • H04Q11/00Selecting arrangements for multiplex systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04QSELECTING
    • H04Q11/00Selecting arrangements for multiplex systems
    • H04Q11/0001Selecting arrangements for multiplex systems using optical switching
    • H04Q11/0005Switch and router aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an AI training network and method.
  • AI training in the field of artificial intelligence, a large number of accelerators (accelerators can be GPUs, CPUs, which can provide computing power, for example) are used for calculations to calculate the optimal structural parameters of a neural network so that the network can complete specific tasks.
  • the so-called "AI training” is to "feed" a large amount of data to the machine so that it slowly learns to recognize and distinguish objects.
  • ImageNet1K classification is a common scene, in which 1.28 million pictures can be given, which contain 1000 different objects. At the same time, each photo has been given the correct label, that is, the object category in the picture is given.
  • the task of AI training is to find a suitable neural network architecture (such as Alexnet) and the assignment of each parameter, so that the network can identify objects in the picture as accurately as possible.
  • multiple accelerators use training algorithms to perform calculations separately, merge their learning results, and distribute them to each accelerator here, and then enter the next iteration. After so many rounds of iterative calculations, the machine can learn more key details and appear smarter.
  • the graphics processing unit GPU is more suitable for this type of iterative operation, so the GPU is more commonly used in AI training.
  • an AI training method is provided, which is applied to an artificial intelligence AI training network, the AI training network includes a first server, a second server, and an optical cross-connect OXC connection, wherein the first server includes a first graphics processing unit , The second server includes a second graphics processing unit, the first server and the second server are respectively connected to the optical cross-connect OXC, and the method includes: the first graphics processing unit according to the first data flow diagram Perform AI training calculation on the first data set; before the first image processing unit completes the AI training calculation on the first data set, trigger the optical cross-connect OXC to start channel switching, and after channel switching is completed, the first An optical channel between a graphics processing unit and a second graphics processing unit is successfully established; after the first graphics unit completes the calculation, the calculation result is sent to the second graphics unit through the established optical channel; The second graphics unit uses the second data flow graph to perform AI training calculation on the calculation result.
  • the first graphics processing unit starts the establishment of the channel after completing its own calculation (that is, after there is data to be transmitted), and therefore has to wait for the establishment of the entire channel.
  • the establishment of the channel is started before data needs to be transmitted.
  • the first graphics processing unit located on the first server completes its own calculations, it can immediately send the calculation results to the graphics of the next server. Processing unit. There is no need to wait for the establishment of the same channel or only a small amount of time to wait for the establishment of the channel, thus saving the time consumption of AI training.
  • the AI training network further includes a main server.
  • the channel switching performed by the OXC specifically includes: the OXC receives a channel establishment instruction from the main server, and the channel establishment instruction wants to carry an adjustment parameter; and the OXC switches the optical channel according to the adjustment parameter.
  • This program provides a specific solution for adjusting OXC.
  • the main server periodically sends the channel establishment instruction to the OXC.
  • the main server obtains the sending period of the channel establishment instruction according to the time period during which the first graphics processing unit sends data to the second graphics processing unit and the channel switching time of the OXC.
  • This solution provides a solution for periodically instructing the OXC to switch channels according to the regularity of data sent between two graphics processing units.
  • the OXC is a microelectromechanical system MEMS or a silicon photo SiP.
  • an AI training network which corresponds to the above AI training method and has corresponding beneficial effects.
  • an optical cross-connect management method is provided.
  • the optical cross-connect OXC respectively connects a first server in an AI training network and a second server in the AI training network, wherein the first server includes a first graphics processing unit,
  • the second server includes a second graphics processing unit, including: obtaining a channel establishment instruction according to the time period during which the first graphics processing unit sends data to the second graphics processing unit and the channel switching time of the OXC The sending cycle; according to the sending cycle, periodically sending a channel switching instruction to the OXC, instructing the OXC to establish a channel between the first image processing unit and the second image processing unit.
  • This solution introduces how to periodically establish the optical channel in the optical cross OXC so as to timely forward the data that the first graphics processing unit needs to send to the second image processing unit. After the first graphics processing unit located on the first server completes its own calculations, it can immediately send the calculation results to the graphics processing unit of the next server without waiting or waiting for a small amount of time, thereby saving the time consumption of AI training.
  • the channel switching of the OXC specifically includes: the OXC receives a channel establishment instruction from the main server, and the channel establishment instruction wants to carry adjustment parameters; the OXC adjusts according to the The parameter adjusts the optical channel.
  • the main server periodically sends the channel establishment instruction to the OXC.
  • the first graphics processing unit may further include: according to the time when the first graphics processing unit sends data to the second graphics processing unit Cycle, and the channel switching time of the OXC, obtain the transmission cycle of the channel establishment instruction.
  • the channel switching is completed before the first graphics unit completes the calculation.
  • an optical cross-connection management server such as the aforementioned main server, which can execute the optical cross-connection management method and has corresponding technical effects.
  • Figure 1 provides an architecture diagram of an AI training network embodiment
  • Figure 2 provides a data flow measurement diagram between image processing units
  • Figure 3 provides a flowchart of an AI training embodiment
  • Figure 4 provides a schematic diagram of an embodiment of switching optical channels in a microelectromechanical system.
  • the artificial intelligence (AI) training network consists of multiple servers forming a server array, which executes AI training by running AI programs.
  • FIG. 1 provides an AI training network architecture.
  • the array includes a server 11, a server 12, a server 13, and a server 14.
  • the array also includes an optical cross-connect 15, an optical cross-connect 16, and an optical cross-connect 17.
  • the server in the embodiment of the present invention may be a dedicated server, a general-purpose server, a workstation, a notebook computer and other devices with computing capabilities.
  • the servers can communicate through a data exchange network 18, which is, for example, Ethernet or Fibre Channel (FC).
  • a certain server in the figure can be used as a master server, and the remaining servers can be used as slave servers.
  • the master server can send commands to other servers through the data exchange network 18.
  • the main server can receive AI training instructions and raw data from outside the array through the data exchange network 18.
  • the master server may be elected by the servers through a program or specified by a third party.
  • the server 11 is defined as the master server, and the other servers are slave servers. It should be noted that there are more devices and devices in the actual array, such as server network interface cards, memory RAM, input and output devices, Ethernet switches and routers in the data exchange network 18. For the sake of introduction, Not shown in Figure 1.
  • a complete AI training is to run the following steps iteratively until the calculation results converge to sufficient accuracy: (1) Forward propagation: TF inputs the input data into the neural network from the left side of the figure above, and runs each in order according to the operator dependencies. Operator until the result is obtained at the right end of the above figure; (2) Calculate the loss: use the difference between the result in step (1) and the correct answer as the loss; (3) Backward propagation: follow the chain The derivation rule is to back-propagate the loss in step (2) step by step to obtain the gradient of all parameters; (4) when the loss value of each iteration becomes flat and there is no more sharp drop, convergence can be performed.
  • the server includes a central processing unit (CPU) and a graphics processing unit (GPU).
  • CPU central processing unit
  • GPU graphics processing unit
  • the server 11 contains: CPU111, CPU112, GPU113, and GPU114.
  • the CPU 111 and the CPU 112 may communicate via a bus (for example, a fast path interconnection QPI bus or a hyper-transmission HT bus) or a node controller (NC).
  • the CPU111 and the GPU113, as well as the CPU112 and the GPU114, can communicate via the Express Peripheral Component Interconnect Express (PCIe) bus.
  • PCIe Express Peripheral Component Interconnect Express
  • ISA ISA
  • PCI AGP
  • AGI AGU
  • the CPU issues calculation commands to the GPU, and the GPU completes the calculation commands issued by the CPU.
  • Each server can run an operating system (OS), and an AI training program can run on the OS.
  • AI training programs such as tensorflow (TF) programs, CNTK, Caffe and MXNET.
  • the AI training software TF requires the user to first give a neural network structure called a data flow graph.
  • the data flow graph includes multiple operators.
  • the operators can be matrix multiplication, averaging, maximization, sigmoid activation function, etc.
  • Some operators have a dependency relationship, that is, the output result calculated by one operator is used as the input data of another operator.
  • the array has a large number of GPUs. In order to improve computing efficiency, operators need to be distributed to multiple GPUs so that multiple GPUs can jointly complete the calculation of the data flow graph.
  • the operators assigned to the two GPUs include not only calculation operators (calculation operators are used for function calculations), but also communication operators Sub (communication operator is used for communication between GPUs).
  • the two GPUs that communicate may belong to the same server, or they may belong to different servers.
  • they can communicate through the bus inside the server.
  • they need to rely on the communication channel outside the server—that is, use the optical cross-connect (OXC) in Figure 1 to communicate.
  • OXC optical cross-connect
  • the data sent by GPU113 passes through After OXC15, OXC17 and OXC17, GPU144 can be reached.
  • the OXC is also connected to the data exchange network 18 via Ethernet or FC, so as to receive commands from the server CPU from the Ethernet, and adjust the connection between the input and output of the optical switch according to the commands.
  • Optical cross-connect (OXC) devices include but are not limited to microelectromechanical systems (MEMS) and silicon photonics (SiP).
  • MEMS is a micrometer-sized mechanical system whose processing technology is transformed from semiconductor processing technology, and its operating range is within the micrometer range.
  • the MEMS optical switch mentioned in this application is manufactured by the MEMS process, and is an array of mirrors that can be deflected according to external instructions, and is used to reflect the incident light beam to a specific direction. The light beam can be spread in free space.
  • the disadvantage of MEMS is that the speed of channel switching (switching from the original channel to the newly established channel) is very slow, about 10ms, which is 6 orders of magnitude different from the ns level of electrical switching.
  • Silicon Photonics is an optical system that uses silicon wafers as the light transmission medium. Unlike MEMS, silicon wafers rely on waveguide channels to simultaneously complete the propagation and direction maintenance of the light beam. Silicon Photonics can provide faster channel switching speeds than M
  • the time it takes to transfer data between GPU chips includes two parts: the data to switch channels and the time to actually transmit data.
  • the embodiment of the present invention can switch channels in advance before data transmission is needed, and can directly use existing channels when data needs to be transmitted, thereby reducing the AI training impact of the switching process on the calculation time.
  • Figure 2 is a graph of data flow between two GPUs intercepted through the software interface during the real AI training process.
  • the abscissa in the figure is time (unit: second), and the ordinate is the data size (unit: megabyte).
  • a mark point represents a data transmission. It can be seen from the figure that data transmission has obvious time periodicity: frequent data transmission occurs every idle period of approximately 200ms, and such frequent transmission ends after approximately 500ms.
  • the size of data transmission is concentrated below 5MB, 5MB-10MB is also more distributed, 10MB-20MB and 30-40M have a small distribution, according to statistics, the total amount of packets transmitted in each cycle is on the order of GB.
  • the channel switching instruction is sent to the OXC in advance to trigger the channel switching.
  • the first channel switching timing You can instruct the OXC to establish a transmission channel before the previous GPU calculation is completed on the previous GPU, and the channel is switched before the previous GPU calculation is completed, and the data can be directly sent to the next GPU after the channel switching is completed. Therefore, it is avoided that the channel is temporarily established when there is data transmission, and the high delay caused by the low switching speed of OXC is avoided. According to the statistics in Figure 2, it can be known that the data generation is periodic.
  • the main server can predict the time of subsequent data generation (that is, the time when traffic occurs) and the cost of channel switching based on the generation time of the data in the past. Time, you can calculate the latest time that triggers OXC to start channel establishment. As long as the OXC starts channel switching at or slightly earlier than this latest moment, the new channel can be established before the data to be transmitted is generated.
  • the second channel switching timing the transmission channel is established before the previous GPU calculation is completed (the channel switching is not required before the previous GPU calculation is completed). In this case, the timing is more flexible.
  • the channel switching can be completed before the previous GPU calculation is completed, or the channel switching can be completed after the data generation is completed, so the second channel switching timing covers the first channel switching timing . If the second channel switching timing is used to trigger the channel switching of the OXC, it is possible that the previous GPU has already completed the calculation before the channel switching is completed.
  • the previous GPU needs to wait for a period of time to send data to the next GPU, but compared to the existing technology (the channel switching is triggered when there is data to be transmitted) , Since the start time of channel switching has been advanced, time is still saved.
  • Step S11 the CPU 111 of the main server 11 runs the AI training program, and uses the AI training program to load the training data set and the data flow graph.
  • the master server 11 splits the training data set and the data flow graph into several parts, and sends them to the slave server 12, the slave server 13 and the slave server 14 through the data exchange network 18, so that each Each server shares part of the training tasks.
  • each part of the data flow graph received from the server is used to calculate the received part of the data set, that is to say, there is a correspondence between the received part of the data flow graph and the received part of the data set . If the training tasks of all servers are combined, the training data set and the data flow graph can be formed.
  • the main server 11 may also undertake the calculation of a part of the training data set and a part of the data flow graph; the main server 11 may also not undertake the calculation task, but only perform the function of scheduling.
  • the main server 11 has a processor and an interface, and the interface is used to communicate with the OXC. If the main server undertakes computing tasks, it may further include a graphics processing unit. In this embodiment, there are 4 servers in total. Assuming that computing tasks are evenly distributed among them, each server processes: 1/4 of the training data set and 1/4 of the data flow graph corresponding to 1/4 of the training data.
  • part of the training data set that a single server is responsible for is called the first-level data subset, and the part of the data flow graph that is responsible for the single server is called the first-level data flow subgraph.
  • Step S12 receiving the first-level subset of training data and the first-level sub-picture of the data stream sent by the main server 11 from the server.
  • the CPU of the slave server splits the first-level subset of the training data and the first-level subgraph of the data flow according to the number of GPUs, and splits the first-level subset of data into multiple second-level subsets of data and a first-level subgraph of data Split into multiple data flow secondary subgraphs. Then the data second-level subset and the data flow second-level subgraph are sent to the corresponding GPU, and the GPU is instructed to perform calculations on the received data second-level subset according to the received data flow second-level subgraph.
  • Each server starts to calculate the first-level subset of data for which it is responsible according to its own first-level subgraph of the data flow.
  • the specific calculation operation is performed by the GPU.
  • the CPU of the server 12 ( CPU121 and/or CPU122) allocate computing tasks to the attributable GPUs, for example, GPU123 and GPU124 are responsible for 1/8 of the training data set and 1/8 of the data flow graph, respectively.
  • the main server 11 (for example, the CPU 111 or the CPU 112) sends a channel establishment instruction to the OXC according to a preset time period.
  • a channel establishment instruction for example, the CPU 111 or the CPU 112
  • the main server 11 periodically sends instructions to the OXC, as shown in Figure 2.
  • the transmission duration is approximately 0.5s (500ms)
  • the interval time after transmission is approximately 0.2s (200ms)
  • the channel establishment instruction includes adjustment parameters for instructing the optical channel to be adjusted according to the adjustment parameters according to the OXC.
  • the adjustment parameters include the number of the lens to be adjusted and the angle to be adjusted. 4
  • this embodiment assumes that the input of GPU123 (second graphics processing unit) depends on the output of GPU113 (first graphics processing unit), then the lens that needs to be adjusted is the OXC between GPU123 and GPU113, that is, MEMS15 , MEMS 15 includes a microelectromechanical controller 150 and two reflective lens arrays, each lens array includes multiple lenses, the deflection angle of the lenses is physically adjustable.
  • the electrical signal sent by the GPU 113 After the electrical signal sent by the GPU 113 is converted into an optical signal, it reaches the GPU 124 through the fiber channel 151, the mirror 152, the mirror 153 and the fiber channel 154. As shown in FIG. 4, before the adjustment, the increased deflection angle of the mirror is 45°, and the reflection path of the optical signal is 155-156-158. At this time, if the GPU113 sends data, the data will reach the GPU124. Adjusting the angle of the lens 152 and/or the lens 153 can achieve the purpose of modifying the reflection path. Once the reflection path is modified, it means that the new channel is successfully established. In this embodiment, the reflector 153 is adjusted.
  • the adjustment parameters included in the channel establishment instruction sent by the main server 11 to the OXC 15 are, for example, ⁇ reflector 15, mirror angle 30° ⁇ .
  • the training data set and data flow graph that GPU113 (the first graphics processing unit) needs to bear can also be called the first training data set and the first data flow graph respectively; the GPU123 (the second graphics processing unit) needs The training data set (calculation result of GPU113) and the data flow graph assumed are called the second training data set and the second data flow graph, respectively.
  • the main server 11 (such as CPU111 or CPU112) sends the channel establishment instruction to the OXC according to the preset time period. set up. Once the GPU113 completes the calculation, it can immediately send the signal to the GPU123 through this channel. Therefore, in this embodiment, the limitation may be added: before the GPU113 completes the calculation of the training data set allocated by itself, the establishment of the channel between the GPU113 and the GPU123 must be completed. For example, the time period for GPU113 to send data to GPU123 is 2 seconds.
  • main server 11 can Notify OXC to start establishing a channel between GPU113 and GPU123 before 9.6 seconds, 11.6 seconds, 13.6 seconds... In this example, compared with the prior art, 0.4 seconds of channel establishment time is saved.
  • the GPU113 must first complete the establishment of the channel between the GPU113 and the GPU123 before completing the calculation of the training data set allocated by itself”, this restriction is not necessary. In other embodiments, it is not limited to complete the channel establishment before the GPU 113 completes the calculation, as long as the channel establishment is started before the GPU 113 completes the calculation. For example: GPU113 needs to send data to GPU123 in 10 seconds, 12 seconds, 14 seconds..., and the channel establishment needs 0.4 seconds, then the main server 11 can notify OXC to start establishment at 9.7 seconds, 11.7 seconds, 13.7 seconds... The channel between GPU113 and GPU123.
  • the main server 11 may notify the OXC to start establishing a channel between the GPU113 and the GPU123 at 9 seconds, 11 seconds, 13 seconds...
  • GPU113 completes the calculation only 0.2 seconds after the channel establishment is completed, which saves 0.4 seconds of channel establishment time compared with the prior art.
  • steps S13-S15 such as sending channel establishment instructions and receiving channel establishment instructions, are not limited to being executed by the main server 11, and can also be changed to other servers in the cluster. Or implemented by third-party equipment.
  • step S12 and step S13.
  • the two can be executed in parallel, or either one can be executed first.
  • step S14 the MEMS 15 receives a channel establishment command including adjustment parameters ⁇ mirror 15, mirror angle 30° ⁇ , and the MEMS controller 150 adjusts the angle of the mirror 15 according to the command, and the angle of the light reflected after the adjustment is 30°.
  • the reflection path 155-156-157 of the optical signal is established, that is, the channel switching between GPU11 and GPU123 is completed.
  • the MEMS 15 sends a response message indicating that the channel is successfully established to the main server 11 to inform the main server 11 that the channel between the GPU113 and the GPU123 is successfully established.
  • the CPU of the main server 11 notifies the GPU 113 of the successful channel establishment message.
  • Step S15 after the GPU113 receives the notification sent by the main server 11. If the calculation has been completed, the calculation result can be immediately sent to the GPU123 through the optical path 155-156-157, and the GPU123 uses the received data for subsequent calculations. If the calculation has not been completed, after waiting for the calculation to be completed, the calculation result can be sent to the GPU123 through the optical path 155-156-157 immediately. After GPU123 receives the calculation result of GPU113, it performs the next step of calculation according to its own data flow sub-graph.
  • MEMS15 can be used to send data to GPU123 immediately. Since there is no need to spend time waiting for channel establishment, time is saved.
  • each cross-server GPU communication can save about 10ms.
  • the server array needs to frequently transmit signals between GPUs of different servers, so applying this embodiment can save a lot of time.
  • step 13 the "preset time period" is mentioned.
  • the lens in the OXC is turned over in advance based on this time period.
  • the following is an exemplary description of how to obtain this period. It should be emphasized that there can be more ways to obtain the time period. Two methods are provided here to deepen the understanding of the embodiments by those skilled in the art.
  • Method 1 Set by the administrator. For the same type of AI training, the time period is similar, so the administrator can master this time period according to his own experience, and manually set the value of this period in the software.
  • Method 2 Obtain it by the server array.
  • the received sub-data flow graph may include not only calculation operators, but also communication operators, which can describe the dependency between GPUs.
  • the GPU that needs to send data has a "send" operator in the sub-data flow graph, and the GPU that needs to receive data has a "receive" operator in the sub-data flow graph received.
  • the GPU uses the communication operator for data transmission, it can record the relevant information of this data transmission: the source GPU, the destination GPU, the size of the transmitted data, the time point of occurrence and other information. Through these information (these related information can be recorded by the source GPU or the target GPU or jointly recorded), the traffic graph between GPUs as shown in FIG.
  • This information can be stored in the memory of the server where the GPU is located, or it can be summarized in a unified storage location, such as in the memory of the main server, or summarized to third-party devices outside the server array.
  • the software can grasp the time period and save the time period to a storage location that can be read.
  • the present invention also provides an embodiment of a program product, which runs in a main server, the program product includes program code, and the main server runs the program code to manage the OXC, for example: according to the first graphics processing unit to send data to The time period of the second graphics processing unit and the channel switching time of the OXC obtain the sending period of the channel establishment instruction; according to the sending period, the channel switching instruction is periodically sent to the OXC to instruct the The OXC establishes a channel between the first image processing unit and the second image processing unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种人工智能训练技术,应用于人工智能AI训练网络,在位于不同服务器的图形处理单元需要通信之前,提前开始建立通信用的光通道,一旦前一个服务器的图形处理单元完成自身的计算后,无需等待或者仅等待少量时间即可立刻把计算结果发送给下一个服务器的图形处理单元,从而节约了AI训练的时间消耗。

Description

AI训练网络及方法 技术领域
本申请涉及人工智能领域,尤其涉及一种AI训练网络及方法。
背景技术
在人工智能领域的AI训练中,使用大量加速器(加速器例如可以是GPU、CPU,可以提供算力)进行计算,计算一个神经网络的最优结构参数,使得该网络能完成特定的工作。所谓“AI训练”,就是给机器“喂”大量的数据,让它慢慢学会识别和区分对象。例如ImageNet1K分类是一种常见的场景,在该场景中可以给定128万张图片,其中包含1000个不同的对象。同时每张照片已经给出了正确的label,即给出了该图片中的对象类别。则AI训练的任务在于找到一个合适的神经网络架构(如Alexnet)和其中每个参数的赋值,使得该网络能够尽可能正确的识别图片中的对象。
在具体实现中,多个加速器使用训练算法进行分别进行计算,并把各自的学习结果合并在一起,并在此分发给每个加速器,然后进入下一次迭代。如此经过多轮迭代运算后,机器就能习得更多的关键细节,从而显得更加智能。相较于中央处理器(CPU)而言,图形处理单元(graphics processing unit,GPU)更适合这类迭代运算,因此GPU更普遍的应用于AI训练。
随着应用场景需求的提升,神经网络规模和数据集规模急剧增长,例如Nvidia DGX-2和*** TPU这样的大规模加速器服务器集群应运而生,以便提高更强的算力。随着高算力加速器集群的规模越来越大,在GPU芯片之间传递数据变得更加频繁,这导致了GPU芯片之间传递数据的快慢对整个训练过程的耗时所造成的影响越来越明显。因此,如何降低建立光通道GPU芯片之间传递数据所耗费的时间,是目前亟需解决的问题。
发明内容
第一方面,提供一种AI训练方法,应用于人工智能AI训练网络,所述AI训练网络包括第一服务器、第二服务器和光交叉连接OXC连接,其中所述第一服务器包括第一图形处理单元,所述第二服务器包括第二图形处理单元,所述第一服务器和所述第二服务器分别与所述光交叉连接OXC连接,所述方法包括:第一图形处理单元按照第一数据流图对第一数据集进行AI训练计算;在所述第一图像处理单元完成对第一数据集的AI训练计算之前,触发所述光交叉连接OXC开始进行通道切换,通道切换完成后,所述第一图形处理单元与第二图形处理单元之间的光通道建立成功;所述第一图形单元完成计算后,通过已建立完成的所述光通道发送计算结果给所述第二图形单元;所述第二图形单元使用第二数据流图对所述计算结果进行AI训练计算。
由于现有技术中占有在应用该方法,第一图形处理单元完成自身的计算之后(也就是在有数据需要传输之后)才开始启动通道的建立,因此不得不等待整个通道的建立时间。而在本实施例中,在有数据需要传输之前即开始了通道的建立,一旦位于第一服务器的第一图形处理单元完成自身的计算后,即可立刻把计算结果发送给下一个服务器的图形处理单元。无需等待同道的建立或者仅等待少量时间来等待通道的建立,从而节约了AI训练的时间消耗。
第一方面的第一种可能实现方式中,所述AI训练网络还包括主服务器。其中,所述OXC进行通道切换具体包括:所述OXC接收主服务器的通道建立指令,所述通道建立指令中想携带调整参数;所述OXC按照所述调整参数对光通道进行切换。
该方案提供了一种调整OXC的具体解决方案。
基于第一方面的第一种可能实现方式中,可选的,所述主服务器周期性发送所述通道建立指令给所述OXC。例如,主服务器根据所述第一图形处理单元发送数据给所述第二图形处理单元的的时间周期,以及所述OXC的通道切换时间,获得所述通道建立指令的发送周期。
该方案提供了一种根据两个图形处理单元之间发送数据的规律性,周期性指令OXC进行通道切换的方案。
第一方面的第二种可能实现方式中,OXC是微机电***MEMS或者是硅光SiP。
第二方面,提供一种AI训练网络,对应于上面的AI训练方法,并具有相应的有益效果。
第三方面,提供一种光交叉连接管理方法,光交叉连接OXC分别连接AI训练网络中的第一服务器和AI训练网络中的第二服务器,其中所述第一服务器包括第一图形处理单元,所述第二服务器包括第二图形处理单元,包括:根据所述第一图形处理单元发送数据给所述第二图形处理单元的的时间周期,以及所述OXC的通道切换时间,获得通道建立指令的发送周期;按照所述发送周期,周期性的发送通道切换指令给所述OXC,指示所述OXC进行建立所述第一图像处理单元和所述第二图像处理单元之间的通道。
该方案介绍了如何周期性的建立光交叉OXC中的光通道,以便及时的转发第一图形处理单元需要发送给第二图像处理单元的数据。位于第一服务器的第一图形处理单元完成自身的计算后,无需等待或者仅等待少量时间通道即可立刻把计算结果发送给下一个服务器的图形处理单元,从而节约了AI训练的时间消耗。
第三方面的第一种可能实现方式中,所述OXC进行通道切换具体包括:所述OXC接收主服务器的通道建立指令,所述通道建立指令中想携带调整参数;所述OXC按照所述调整参数对光通道进行调整。
可选的,基于第三方面的第一种可能实现方式:所述主服务器周期性发送所述通道建立指令给所述OXC。此外,在第一图形处理单元按照第一数据流图对第一数据集进行AI训练计算之前,还可以包括:根据所述第一图形处理单元发送数据给所述第二图形处理单元的的时间周期,以及所述OXC的通道切换时间,获得所述通道建立指令的发送周期。
第三方面的第二种可能实现方式中,在第一图形单元完成计算之前,所述通道切换完成。
第三方面的第三种可能实现方式中,微机电***MEMS和硅光SiP中的一种。
第四方面,提供一种光交叉连接管理服务器,例如前述的主服务器,可以执行光交叉连接管理方法,并具有相应的技术效果。
附图说明
图1提供了一种AI训练网络实施例的架构图;
图2提供了一种图像处理单元之间数据流量测量图;
图3提供了一种AI训练实施例流程图;
图4提供了微机电***中进行切换光通道的实施例示意图。
具体实施方式
人工智能(artificial intelligence,AI)训练网络由多台服务器组成服务器阵列,通过运行AI程序执行AI训练。图1提供了一种AI训练网络的架构,如图所示,阵列包括服务器11、服务器12、服务器13和服务器14,所述阵列还包括光交叉连接15、光交叉连接16和光交叉连接17。本发明实施例中的服务器可以是专用服务器、通用服务器、工作站、笔记本电脑等具有运算能力 的设备。服务器之间可以通过数据交换网络18通信,数据交换网络18例如是以太网或者光纤通道(fibre channel,FC)。在这些通过数据交换网络互连的服务器中,可以由图中的某一个服务器作为主服务器,而余下的服务器作为从服务器,主服务器可以把命令通过数据交换网络18发送给其他服务器。此外,主服务器可以通过所述数据交换网络18从阵列外部接收AI训练的指令和原始数据。主服务器可以由服务器之间通过程序选举产生也可以由第三方指定,为了方便介绍,把服务器11定义为主服务器,其余服务器是从服务器。需要说明的是,在实际的阵列中还有更多的器件和设备,例如服务器的网络接口卡、内存RAM、输入输出设备,数据交换网络18中的以太网交换机、路由器等,为简介起见,图1中没有示出。
一个完整的AI训练是迭代的运行以下步骤直到计算结果收敛到足够的精度:(1)前向传播:TF将输入数据从上图的左侧输入神经网络,按照算子依赖关系,顺序运行各个算子,直到在上图的右端得到结果;(2)计算损失:将步骤(1)中的到的结果,与正确答案之间的差,作为损失;(3)后向传播:按照链式求导规则,将步骤(2)中的损失,逐级反向传播,得到所有参数的梯度;(4)当每一次迭代的损失值趋于平坦,不再有剧烈的下降,可以进行收敛。从上述步骤(1)、(2)和(3)的迭代运行的特征可知,AI训练的计算量和通讯特征都是迭代重复的。在经过几个迭代之后,就能比较精确的预测在什么时候,会发送多大尺寸的报文,从哪个GPU发送到那个GPU。
服务器中包含中央处理器(CPU)和图形处理单元(GPU),以服务器11为例,其内部包含了:CPU111、CPU112和GPU113、GPU114。CPU111和CPU112可以通过总线(例如快速路径互连QPI总线或者超传输HT总线)通信或者节点控制器(node controller,NC)通信。CPU111和GPU113之间,以及CPU112和GPU114之间,均可以通过快捷***部件互连标准(peripheral component interconnect express,PCIe)总线通信。除了PCIe之外,ISA、PCI、AGP、AGI与AGU也是可行的GPU接口标准。CPU下发计算命令给GPU,GPU完成CPU下发的计算命令。
每个服务器可以运行一个操作***(OS),OS之上可以运行AI训练程序,AI训练程序例如:谷张量流(tensorflow,TF)程序、CNTK、Caffe和MXNET。AI训练软件TF需要用户首先给出一个神经网络的结构,称为数据流图,数据流图包括多个算子,算子可以是矩阵乘法、求平均值、求最大值、sigmoid激活函数等。一些算子之间有依赖关系,也就是说:通过一个算子计算得出的输出结果作为另一个算子的输入数据。阵列拥有大量的GPU,为了提高计算效率,需要把算子分散到多个GPU,以便由多个GPU共同完成数据流图的计算。由于GPU分配到的算子之间的依赖关系,导致了GPU之间也产生了依赖关系,即:前一个GPU的输出结果作为下一个GPU的输入数据。由于依赖关系的存在,就需要在GPU之间的通信,为此,这两个GPU所分配到的算子除了包含计算算子(计算算子用于进行函数计算)之外,还有通讯算子(通讯算子用于GPU之间通信)。
发生通信的两个GPU可能属于同一个服务器,也可能属于不同的服务器。当发生通信的2个GPU属于相同的服务器时,可以通过服务器内部的总线进行通信。而当发生通信的2个GPU属于不同的服务器时,需要依靠服务器外部的通信渠道——也就是使用图1中的光交叉连接(OXC)进行通信,例如,GPU113的发出的数据,在依次通过OXC15、OXC17和OXC17之后,可以到达GPU144。OXC也通过以太网或FC连接到数据交换网络18上,以便从以太网接受来自服务器CPU的命令,按照命令调整光开关输入和输出之间的连接关系。光交叉连接(optical cross-connect,OXC)器件包括但不限于微机电***(micro electro mechanical system,MEMS)和硅光(silicon  photonics,SiP)。MEMS是微米大小的机械***,其加工技术由半导体加工技术改造而来,操作范围在微米范围内。本申请中提到的MEMS光开关,是由MEMS工艺制造的,能够按照外部指令进行偏转的反射镜组成的阵列,用于把入射的光束,反射到特定的方向。光束的传播可以是在自由空间中进行。MEMS的缺点在于通道切换(从原有通道切换到新建立的通道)的速度很慢,大约10ms左右,比电交换的ns级相差了6个数量级。硅光是使用硅片作为光传导介质的光***,与MEMS不同的是,硅片依靠波导通道同时完成光束的传播和方向保持。硅光能够提供比MEMS更快的通道切换速度。
然而,不论是MEMS还是硅光,从原通道切换到新通道的切换时间对AI训练所耗费的时间而言始终是不可忽略的,因此缩小这个切换时间从而整体上提高AI训练的耗时,是一个需要解决的问题。GPU芯片之间传递数据所耗费的时间包括两部分:切换通道的数据和实际传输数据的时间。本发明实施例可以在需要传输数据之前,提前切换好通道,当需要传输数据时可以直接使用现成的通道,从而减少了切换过程对计算时间的AI训练影响。
图2是在真实AI训练过程中,通过软件界面所截取的两个GPU间数据流量图,图中横坐标是时间(单位:秒),纵坐标是数据大小(单位:兆字节),每一个标记点代表一次数据传输。从图中可以看出,数据传输具有明显的时间周期性:每间隔大概200ms的空闲期会发生频繁的数据传输,这样的频繁传输大概持续500ms后结束。数据传输的大小集中在5MB以下,5MB~10MB也有较多分布,10MB~20MB和30~40M有少量分布,根据统计,每一次周期传输的报文总量在GB的数量级。在其他实施场景中,可能偶有不符合周期性规律的情况,但是大多数情况下仍然是周期性的,因此仍然可以使用本发明实施例提供的方案获得收益。
因此,本发明实施例中通过发挥AI训练流量的高度重复特定和可预测性,提前发送通道切换指令给OXC,以便触发通道的切换。第一种通道切换时机:可以在前一个GPU前一个GPU计算完成之前指令OXC建立传输通道,并且在前一个GPU计算完成之前通道完成切换,通道切换完成后可以直接把数据发送给后一个GPU。从而避免有数据传输时才临时建立通道,规避OXC的低切换速度引起的高延时。根据图2的统计可以知道数据的产生是周期性的,主服务器根据过去时刻数据的产生时间,可以预计到后续的数据生成的时刻(也就是流量发生的时刻)、以及通道切换所需要花费的时间,可以计算出触发OXC开始通道建立的最晚时刻。只要OXC在等于或者略微早于这个最晚时刻开始通道的切换,那么新的通道可以在待传输的数据生成之前建立完成。
第二种通道切换时机:在前一个GPU计算完成之前开始(不要求前一个GPU计算完成之前通道切换完成)建立传输通道。这种情况下,时机更加灵活,可以在前一个GPU计算完成之前完成通道的切换,也可以在数据生成完成之后才完成通道的切换,因此第二种通道切换时机涵盖了第一种通道切换时机。如果使用第二种通道切换时机进行对OXC进行通道切换的触发,有可能在通道切换尚未切换完成的时候,前一个GPU已经计算完成。由于只有通道切换完成后数据才能被传输,因此前一个GPU需要等待一段时间才能发送数据给后一个GPU,但是相比于现有技术(在有数据需要传输时才开始触发通道的切换)而言,由于通道切换的开始时间得到了提前,因此仍然节约了时间。
下面参照图3,对本发明AI训练实施例流程进行更详细的介绍。
步骤S11,主服务器11的CPU111运行AI训练程序,使用AI训练程序载入训练数据集以及数据流图。主服务器11把训练数据集和所述数据流图拆分成几个部分,通过数据交换网络18,分别发往所述从服务器12、所述从服务器13以及所述从服务器14,以使得每个服务器分担一部 分训练任务。其中,每个从服务器收到的那部分数据流图用于计算收到的那部分数据集,也就是说收到的那部分数据流图和和收到的那部分数据集之间有对应关系。如果把所有服务器的训练任务合起来,就可以组成所述训练数据集以及所述数据流图。
主服务器11在执行执行调度的功能之外,还可以承担一部分所述训练数据集和一部分所述数据流图的计算;主服务器11也可以不承担计算任务,仅仅执行调度的功能。主服务器11拥有处理器和与接口,接口用于和OXC通信,如果主服务器承担计算任务,那么还可以进一步包含图形处理单元。本实施例中共有4个服务器,假设计算任务在它们之间平均分配,那么每个服务器处理:1/4的训练数据集和与1/4的训练数据对应的1/4的数据流图。为了方便后续的介绍,由单个服务器负责的部分训练数据集称为数据一级子集,由单个服务器负责的部分数据流图称为数据流一级子图。
步骤S12,从服务器接收主服务器11发送的所述训练数据一级子集和所述数据流一级子图。从服务器的CPU按照GPU的数量把训练数据一级子集和数据流一级子图再次拆分,一个数据一级子集拆分成多个数据二级子集、一个数据流一级子图拆分成多个数据流二级子图。然后把数据二级子集和数据流二级子图发送给对应的GPU,指令所述GPU对接收到的数据二级子集按照接收到的数据流二级子图进行计算。
各个服务器按照自己的数据流一级子图开始计算自己负责的数据一级子集。具体的计算操作由GPU执行,以服务器12为例,在通过数据交换网络18收到需要自己计算的1/4的训练数据集和1/4的所述数据流图后,服务器12的CPU(CPU121和/或CPU122)把计算任务分配给归属的GPU,例如GPU123和GPU124分别承担1/8的训练数据集和1/8的所述数据流图。
步骤S13,主服务器11(例如CPU111或者CPU112)按照预设的时间周期,发送通道建立指令给OXC。如前所述,GPU之间可能存在依赖关系,由于依赖关系,GPU之间会周期性的发送大量数据,以图2为例,主服务器11周期性的发送指令给OXC,在图2所示的例子中,发送持续时间大致是0.5s(500ms),发送后的间隔时间大致是0.2s(200ms),因此这个时间周期可以大致是200ms+500ms=700ms。因此,每间隔700ms建立一次对应的通道即可。
通道建立指令中包括调整参数,用于指令按照OXC按照所述调整参数对光通道进行调整。本实施例中,调整参数中包括需要调整的镜片的编号以及需要调整的角度。参见图4,本实施例假设GPU123(第二图形处理单元)的输入依赖于GPU113(第一图形处理单元)的输出,那么需要被调整的镜片是位于GPU123和GPU113之间的OXC,也就是MEMS15,MEMS15包括微机电控制器150以及2个反射镜片阵列,每个镜片阵列包括多个镜片,镜片的偏转角度在物理上是可调的。GPU113发出的电信号转换成光信号之后,通过光纤通道151、反射镜152、反射镜153、光纤通道154到达GPU124。如图4所示,在调整之前,反射镜提高的偏转角度是45°,光信号的反射路径是155-156-158,此时如果GPU113发出的数据,那么数据会到达GPU124。调整镜片152和/或镜片153的角度均可以达到修改反射路径的目的,反射路径一旦修改完成也就意味着新的通道建立成功。本实施例中,调整的是反射镜153,当把反射镜15的提高的反射角度调整为30°之后,GPU113和GPU123之间的通道建立成功。主服务器11发送给OXC15的通道建立指令包含的调整参数例如是:{反射镜15,反射镜角度30°}。
为了方便介绍,也可以把GPU113(第一图形处理单元)需要承担的训练数据集和数据流图分别称为第一训练数据集、第一数据流图;把GPU123(第二图形处理单元)需要承担的训练数据集(GPU113的计算结果)和数据流图分别称为第二训练数据集、第二数据流图。
需要说明的是,本实施例中,主服务器11(例如CPU111或者CPU112)按照预设的时间周期,发送通道建立指令给OXC的时机可以早于GPU113发送数据给GPU123的时机,以便提前触发通道的建立。GPU113一旦完成计算,就可以立即通过这个通道把信号发送给GPU123。因此本实施例可以增加有这样的限制:GPU113完成自身分配的训练数据集的计算之前,先要完成GPU113与GPU123之间通道的建立。例如:GPU113发送数据给GPU123的时间周期是2秒,具体而言,分别需要在10秒、12秒、14秒……的时刻发送数据给GPU123,建立通道需要花费0.4秒,那么主服务器11可以通知OXC在9.6秒、11.6秒、13.6秒……之前开始建立GPU113和GPU123之间的通道。这个例子中,和现有技术相比节约了0.4秒的通道建立时间。
需要说明的是,在其他实施例中,“GPU113完成自身分配的训练数据集的计算之前,先要完成GPU113与GPU123之间通道的建立”,这一限制并不是必须的。在其他实施例中,可以不限于在GPU113完成计算之前完成通道的建立,只要在GPU113完成计算之前启动通道的建立即可。例如:GPU113分别需要在10秒、12秒、14秒……发送数据给GPU123,通道建立需要是0.4秒,那么主服务器11可以通知OXC在9.7秒、11.7秒、13.7秒……的时刻开始建立GPU113和GPU123之间的通道。在这个例子中,GPU113在完成计算之后,需要等待0.1秒之后才能把数据通过通道发送给GPU123,和现有技术相比节约了0.3秒的通道建立时间。或者,,那么主服务器11可以通知OXC在9秒、11秒、13秒……的时刻开始建立GPU113和GPU123之间的通道。在这个例子中,通道建立完成之后的0.2秒GPU113才完成计算,和现有技术相比节约了0.4秒的通道建立时间。
需要特别说明的是,在步骤S13-S15中主服务器11所执行的功能,例如发送通道建立指令以及接受通道建立指令的响应,并不限于由主服务器11执行,也可以改由集群中其他服务器或者第三方设备执行。
需要说明的是,步骤S12和步骤S13之间没有依赖关系。二者可以并行执行,也可以任意一个先执行。
步骤S14,MEMS15接收包含调整参数{反射镜15,反射镜角度30°}的通道建立指令,MEMS控制器150按照指令把反射镜15的角度进行调整,调整后反射的光线角度为30°。
经过调整之后,光信号的反射路径155-156-157建立完成,也就是说GPU11和GPU123之间的通道切换完成。微机电***15发送通道建立成功的响应消息给主服务器11,以便告知主服务器11:GPU113和GPU123之间的通道成功建立。主服务器11收到该响应消息之后,主服务器11的CPU把通道建立成功的消息通知给GPU113。
步骤S15,GPU113收到主服务器11发送的通知之后。如果已经完成计算,则可以立即通过光路径155-156-157把计算结果给GPU123,GPU123使用收到的数据进行后续计算。如果未完成计算,则等待完成计算之后,可以立即通过光路径155-156-157发送计算结果给GPU123。GPU123收到GPU113的计算结果后,按照自己的数据流子图进行下一步计算。
由以上步骤可以看出,GPU113一旦完成计算,在MEMS15中已经有现成的通道供其使用,因此可以立即使用MEMS15把数据发给GPU123,由于不用耗费时间等待通道的建立,因此节约了时间。对于MEMS而言,每一次跨服务器的GPU通信可以节约左右10ms,而在一次AI训练中,服务器阵列需要频繁的在不同服务器的GPU之间传输信号,因此应用该实施例可大量节约时间。
在步骤13中提到“预设的时间周期”,OXC中的镜片基于该时间周期提前翻转,下面对如 何获得这个周期进行示例性的说明。需要强调的是,该时间周期的获取还可以有更多的办法,此处提供两种以加深本领域人员对实施例的理解。
方法一:由管理员设置。对于同一种类型的AI训练,时间周期是相似性,因此管理员可以根据自身经验掌握这个时间周期,通过人工在软件中设置这个周期的数值。
方法二:由服务器阵列自行获得。对于需要进行GPU之间通信的GPU而言,其收到的子数据流图中除了包含计算算子之外,还可以包含通讯算子,通讯算子可以对GPU之间的依赖关系进行描述。需要发送数据的GPU的子数据流图中拥有“发送”算子,需要接收数据的GPU收到的子数据流图中拥有“接收”算子。当GPU使用通讯算子数据传输时,可以记录这个数据传输的相关信息:源GPU、目的GPU、传输数据量大小、发生的时间点等信息。通过这些信息(这些相关信息可以由源GPU记录或者目的GPU记录或者共同记录)即可以获得如图2所示的GPU之间的流量图,从而掌握GPU之间传输数据的规律性。这些信息可以存放在GPU所在的服务器的存储器中,也可以汇总到一个统一的存放位置,例如汇总到主服务器的存储器中,或者汇总到服务器阵列之外的第三方设备。在持续记录一定时间之后,软件可以掌握时间周期,并把时间周期保存到可被读取的存储位置。
本发明还提供一种程序产品的实施例,运行在主服务器中,程序产品包括程序代码,主服务器运行所述程序代码可以对OXC进行管理,例如:根据所述第一图形处理单元发送数据给所述第二图形处理单元的的时间周期,以及所述OXC的通道切换时间,获得通道建立指令的发送周期;按照所述发送周期,周期性的发送通道切换指令给所述OXC,指示所述OXC进行建立所述第一图像处理单元和所述第二图像处理单元之间的通道。

Claims (14)

  1. 一种AI训练方法,应用于人工智能AI训练网络,所述AI训练网络包括第一服务器、第二服务器和光交叉连接OXC连接,其中所述第一服务器包括第一图形处理单元,所述第二服务器包括第二图形处理单元,所述第一服务器和所述第二服务器分别与所述光交叉连接OXC连接,所述方法包括:
    第一图形处理单元按照第一数据流图对第一数据集进行AI训练计算;在所述第一图形单元完成对所述第一数据集的AI训练计算之前,触发所述OXC开始进行通道切换,通道切换完成后,所述第一图形处理单元与第二图形处理单元之间的光通道建立成功;
    所述第一图形单元完成计算后,通过已建立完成的所述光通道发送计算结果给所述第二图形单元;
    所述第二图形单元使用第二数据流图对所述计算结果进行AI训练计算。
  2. 根据权利要求1所述的AI训练方法方法,其中,所述AI训练网络还包括主服务器,所述OXC进行通道切换具体包括:
    所述OXC接收主服务器的通道建立指令,所述通道建立指令中携带调整参数;
    所述OXC按照所述调整参数对光通道进行切换。
  3. 根据权利要求2所述的AI训练方法方法,其中,所述方法还包括:
    所述主服务器周期性发送所述通道建立指令给所述OXC。
  4. 根据权利要求3所述的AI训练方法方法,其中,所述方法还包括:在第一图形处理单元按照第一数据流图对第一数据集进行AI训练计算之前,还包括:
    根据所述第一图形处理单元发送数据给所述第二图形处理单元的的时间周期,以及所述OXC的通道切换时间,获得所述通道建立指令的发送周期。
  5. 根据权利要求1-4任一项所述的AI训练方法方法,其中,所述述通道切换完成的时间是:
    在第一图形单元完成计算之前。
  6. 根据权利要求1-4任一项所述的AI训练方法方法,其中,所述OXC是:
    微机电***MEMS和硅光SiP中的一种。
  7. 一种AI训练网络,所述AI训练网络包括第一服务器、第二服务器和光交叉连接OXC连接,其中所述第一服务器包括第一图形处理单元,所述第二服务器包括第二图形处理单元,所述第一服务器和所述第二服务器分别与所述光交叉连接OXC连接,其中:
    所述第一图形处理单元用于:按照第一数据流图对第一数据集进行AI训练计算,以及通过已建立完成的所述光通道发送计算结果给所述第二图形单元;
    所述光交叉连接OXC用于:在所述第一图像处理单元完成对第一数据集的AI训练计算之前,开始进行通道切换,其中,通道切换完成后,所述第一图形处理单元与第二图形处理单元之间的光通道建立成功;
    所述第二图形单元用于:使用第二数据流图对所述计算结果进行AI训练计算。
  8. 根据权利要求7所述的AI训练网络,其中,所述AI训练网络还包括主服务器,所述包括主服务器用于:
    发送通道建立指令给所述OXC,所述通道建立指令中想携带调整参数;
    所述OXC按照所述调整参数对光通道进行通道切换。
  9. 根据权利要求8所述的AI训练网络,其中,所述主服务器还用于:
    周期性发送所述通道建立指令给所述OXC。
  10. 根据权利要求9所述的AI训练网络,其中,所述所述主服务器还用于:
    根据所述第一图形处理单元发送数据给所述第二图形处理单元的的时间周期,以及所述OXC的通道切换时间,获得所述通道建立指令的发送周期。
  11. 根据权利要求7-10任一项所述的AI训练网络,其中,所述光交叉连接OXC进一步用于:
    在所述第一图像处理单元完成对第一数据集的AI训练计算之前,完成所述通道切换。
  12. 根据权利要求7-10任一项所述的AI训练网络,其中,所述OXC是:
    微机电***MEMS和硅光SiP中的一种。
  13. 一种光交叉连接管理方法,光交叉连接OXC分别连接AI训练网络中的第一服务器和AI训练网络中的第二服务器,其中所述第一服务器包括第一图形处理单元,所述第二服务器包括第二图形处理单元,包括:
    根据所述第一图形处理单元发送数据给所述第二图形处理单元的的时间周期,以及所述OXC的通道切换时间,获得通道建立指令的发送周期;
    按照所述发送周期,周期性的发送通道切换指令给所述OXC,指示所述OXC在所述第一图像处理单元完成对第一数据集的AI训练计算之前,建立所述第一图像处理单元和所述第二图像处理单元之间的通道。
  14. 一种光交叉连接管理服务器,光交叉连接管理服务器和光交叉连接OXC通信,所述OXC与AI训练网络中的第一服务器和AI训练网络中的第二服务器通信,其中所述第一服务器包括第一图形处理单元,所述第二服务器包括第二图形处理单元,所述光交叉连接管理服务器包括处理器,所述处理器用于:
    根据所述第一图形处理单元发送数据给所述第二图形处理单元的的时间周期,以及所述OXC的通道切换时间,获得通道建立指令的发送周期;
    按照所述发送周期,周期性的发送通道切换指令给所述OXC,在所述第一图像处理单元完成对第一数据集的AI训练计算之前,指示所述OXC进行建立所述第一图像处理单元和所述第二图像处理单元之间的通道。
PCT/CN2019/081161 2019-04-03 2019-04-03 Ai训练网络及方法 WO2020199143A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
PCT/CN2019/081161 WO2020199143A1 (zh) 2019-04-03 2019-04-03 Ai训练网络及方法
CN201980004858.6A CN112042168B (zh) 2019-04-03 2019-10-25 Ai训练网络及方法
EP19923093.9A EP3934205B1 (en) 2019-04-03 2019-10-25 Ai training network and method
PCT/CN2019/113175 WO2020199560A1 (zh) 2019-04-03 2019-10-25 Ai训练网络及方法
US17/485,833 US20220012590A1 (en) 2019-04-03 2021-09-27 Ai training network and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/081161 WO2020199143A1 (zh) 2019-04-03 2019-04-03 Ai训练网络及方法

Publications (1)

Publication Number Publication Date
WO2020199143A1 true WO2020199143A1 (zh) 2020-10-08

Family

ID=72664600

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2019/081161 WO2020199143A1 (zh) 2019-04-03 2019-04-03 Ai训练网络及方法
PCT/CN2019/113175 WO2020199560A1 (zh) 2019-04-03 2019-10-25 Ai训练网络及方法

Family Applications After (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/113175 WO2020199560A1 (zh) 2019-04-03 2019-10-25 Ai训练网络及方法

Country Status (4)

Country Link
US (1) US20220012590A1 (zh)
EP (1) EP3934205B1 (zh)
CN (1) CN112042168B (zh)
WO (2) WO2020199143A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023071193A1 (zh) * 2021-10-28 2023-05-04 华为技术有限公司 一种模型训练***和方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116741019A (zh) * 2023-08-11 2023-09-12 成都飞航智云科技有限公司 一种基于ai的飞行模型训练方法、训练***

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030026524A1 (en) * 2001-08-06 2003-02-06 Sunao Kakizaki Optical switching apparatus with optical reflection monitor and reflection monitoring system
CN1984006A (zh) * 2006-05-30 2007-06-20 北京邮电大学 一种面向网格应用的光并行传输方法
CN102546749A (zh) * 2010-12-08 2012-07-04 中国电信股份有限公司 用于接入移动ip 网的方法以及ip 承载网
CN106941633A (zh) * 2017-02-20 2017-07-11 武汉邮电科学研究院 基于sdn的全光交换数据中心网络控制***及其实现方法
CN107885762A (zh) * 2017-09-19 2018-04-06 北京百度网讯科技有限公司 智能大数据***、提供智能大数据服务的方法和设备
CN108353217A (zh) * 2015-09-10 2018-07-31 环球互连及数据中心公司 多租户互连设施中的自动化光纤交叉连接服务

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3925272B2 (ja) * 2002-03-29 2007-06-06 Kddi株式会社 データ伝送システム及びノード
JP4284199B2 (ja) * 2004-01-26 2009-06-24 株式会社日立コミュニケーションテクノロジー 光クロスコネクト装置及び網管理装置
US8036531B2 (en) * 2006-12-14 2011-10-11 Verizon Patent And Licensing Inc. Hybrid switch for optical networks
CN101888276A (zh) * 2010-04-09 2010-11-17 西安电子科技大学 用于多用户光量子通信网络的量子路由器及其路由方法
US10223997B2 (en) * 2011-12-07 2019-03-05 Ubitus Inc. System and method of leveraging GPU resources to increase performance of an interact-able content browsing service
CN102740177B (zh) * 2012-07-17 2014-08-27 上海汇珏网络通信设备有限公司 一种无阻塞可拓展多级光开关阵列及其工作方法
CN105871498B (zh) * 2015-01-21 2019-03-29 中兴通讯股份有限公司 一种光交叉连接调度的装置、方法及光电混合交叉***
CN106664236B (zh) * 2015-06-10 2019-11-12 华为技术有限公司 一种信号传输方法、控制器和信号传输***
CN105117170A (zh) * 2015-08-24 2015-12-02 浪潮(北京)电子信息产业有限公司 一种计算机***架构
US11138494B2 (en) * 2017-05-02 2021-10-05 International Business Machines Corporation Storage controller acceleration for neural network training and inference
CN107632953A (zh) * 2017-09-14 2018-01-26 郑州云海信息技术有限公司 一种gpu箱pcie扩展互连拓扑装置
CN208013975U (zh) * 2018-04-23 2018-10-26 苏州超集信息科技有限公司 在线智能能力平台的硬件设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030026524A1 (en) * 2001-08-06 2003-02-06 Sunao Kakizaki Optical switching apparatus with optical reflection monitor and reflection monitoring system
CN1984006A (zh) * 2006-05-30 2007-06-20 北京邮电大学 一种面向网格应用的光并行传输方法
CN102546749A (zh) * 2010-12-08 2012-07-04 中国电信股份有限公司 用于接入移动ip 网的方法以及ip 承载网
CN108353217A (zh) * 2015-09-10 2018-07-31 环球互连及数据中心公司 多租户互连设施中的自动化光纤交叉连接服务
CN106941633A (zh) * 2017-02-20 2017-07-11 武汉邮电科学研究院 基于sdn的全光交换数据中心网络控制***及其实现方法
CN107885762A (zh) * 2017-09-19 2018-04-06 北京百度网讯科技有限公司 智能大数据***、提供智能大数据服务的方法和设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023071193A1 (zh) * 2021-10-28 2023-05-04 华为技术有限公司 一种模型训练***和方法

Also Published As

Publication number Publication date
CN112042168B (zh) 2022-03-04
CN112042168A (zh) 2020-12-04
EP3934205B1 (en) 2024-02-14
WO2020199560A1 (zh) 2020-10-08
EP3934205A1 (en) 2022-01-05
US20220012590A1 (en) 2022-01-13
EP3934205A4 (en) 2022-04-27

Similar Documents

Publication Publication Date Title
US11861203B2 (en) Method, apparatus and electronic device for cloud service migration
Khani et al. SiP-ML: high-bandwidth optical network interconnects for machine learning training
WO2020199143A1 (zh) Ai训练网络及方法
Wang et al. {TopoOpt}: Co-optimizing Network Topology and Parallelization Strategy for Distributed Training Jobs
CN104301391B (zh) 多域光网络数据中心资源虚拟化映射方法
JP2022545103A (ja) 分散ストレージシステム及びデータ処理方法
CN110113271B (zh) 一种基于光电混合交换网络的mpi应用加速***及方法
WO2017179537A1 (ja) ソフトウェア更新制御装置、ソフトウェア更新制御システム、ソフトウェア更新制御方法、及び、ソフトウェア更新制御プログラムが格納された記録媒体
CN111435315A (zh) 分配资源的方法、装置、设备和计算机可读介质
Wang et al. Integrating coflow and circuit scheduling for optical networks
WO2017050036A1 (zh) 资源配置信息的发送、数据分发方法及装置
Li et al. Co-Scheduler: A coflow-aware data-parallel job scheduler in hybrid electrical/optical datacenter networks
Watanabe et al. Contmec: An architecture of multi-access edge computing for offloading container-based mobile applications
CN106933654B (zh) 一种基于缓存的虚拟机启动方法
WO2020159269A1 (en) Processing computational models in parallel
CN106357800B (zh) 一种基于QoE的云计算服务架构
Qin et al. Interference and topology-aware VM live migrations in software-defined networks
WO2022111466A1 (zh) 任务调度方法、控制方法、电子设备、计算机可读介质
Jin et al. $ run $ r u n Data: Re-Distributing Data via Piggybacking for Geo-Distributed Data Analytics Over Edges
CN108388470B (zh) 一种大数据任务处理方法及计算机设备
WO2022029926A1 (ja) コンピュータシステムおよび演算処理方法
EP3918477A1 (en) Processing computational models in parallel
US20220086103A1 (en) Network bandwidth adjustment method and related product
CN110764922A (zh) 一种数据的处理方法、单板和计算机存储介质
Zhan et al. Dynamic bandwidth allocation for switching FCAE-1553 network in avionics system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19922712

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19922712

Country of ref document: EP

Kind code of ref document: A1