WO2021238734A1 - Method for training neural network, and related device - Google Patents

Method for training neural network, and related device Download PDF

Info

Publication number
WO2021238734A1
WO2021238734A1 PCT/CN2021/094579 CN2021094579W WO2021238734A1 WO 2021238734 A1 WO2021238734 A1 WO 2021238734A1 CN 2021094579 W CN2021094579 W CN 2021094579W WO 2021238734 A1 WO2021238734 A1 WO 2021238734A1
Authority
WO
WIPO (PCT)
Prior art keywords
micro
batch data
accelerator
data
training
Prior art date
Application number
PCT/CN2021/094579
Other languages
French (fr)
Chinese (zh)
Inventor
陈仙萍
秦勇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021238734A1 publication Critical patent/WO2021238734A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a neural network training method and related equipment.
  • deep neural network deep neural network, DNN training usually uses accelerators for calculation, and the calculation process generally includes forward calculation and reverse calculation.
  • Fig. 1 is a schematic diagram of DNN performing forward calculation
  • Fig. 2 is a schematic diagram of DNN performing reverse calculation.
  • the forward calculation is performed layer by layer according to the first layer ⁇ the second layer ⁇ the third layer ⁇ the fourth layer.
  • the feature values obtained by the forward-calculation of each layer are stored in the accelerator.
  • the backward calculation is performed layer by layer according to the fourth layer ⁇ the third layer ⁇ the second layer ⁇ the first layer.
  • each layer of reverse calculation needs to use the feature values saved in the forward calculation of the corresponding layer of the training data. Therefore, each time the reverse calculation of a layer is completed, the accelerator storage occupied by the characteristic value of the corresponding layer is released. Until all the training data have completed the reverse calculation, all the feature values saved by the accelerator are completely released.
  • the forward calculation of all training data needs to be completed in the above calculation process.
  • the accelerator needs to save all the characteristic values obtained after forward calculation of the training data, which causes the storage occupancy of the accelerator to stay at a large value for a long time.
  • the training efficiency of the network is low.
  • the embodiment of the present application provides a neural network training method and related equipment, which can keep the peak storage occupancy of the accelerator at a low value and improve the training efficiency of the neural network.
  • the first aspect of the embodiments of the present application provides a neural network training method.
  • the training method is applied to N accelerators, each accelerator is loaded with the same neural network, and the N accelerators train the neural network in a data-parallel manner.
  • the training method includes: each accelerator first obtains M micro-batch data from the processor, and N ⁇ M micro-batch data form training data, where the micro-batch data usually contains at least one sample data to be trained.
  • each accelerator trains the neural network based on M micro-batch data
  • each accelerator performs forward calculation on the i-th micro-batch data, it directly reverses the forward calculation result of the i-th micro-batch data. Calculation until the reverse calculation of M micro-batch data is completed to obtain the result of the reverse calculation.
  • each accelerator updates the parameters of the neural network according to the results of the reverse calculation to complete the training of the neural network.
  • N ⁇ 2, M ⁇ 2, i 1, 2,...,M.
  • each accelerator After each accelerator completes the forward calculation of the i-th micro-batch data, it immediately performs the reverse calculation on the forward calculation result of the i-th micro-batch data.
  • each accelerator starts the reverse calculation, it can start to release the feature value generated by the forward calculation of the i-th micro-batch data until the reverse calculation of the i-th micro-batch data is completed (at this time, the i-th micro-batch The feature values generated by the data in the forward calculation are completely released). Therefore, the storage occupancy peak of each accelerator occurs when the reverse calculation of the i-th micro-batch data starts, and at this time, each accelerator only needs to save the feature value generated by the i-th micro-batch data in the forward calculation. During the entire calculation process, the peak storage occupancy of each accelerator appears periodically (that is, the peak storage occupancy appears at the beginning of the reverse calculation of each micro-batch data), and can be kept at a low value, which can improve nerves The training efficiency of the network.
  • the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is after each accelerator performs the inverse calculation on the M micro-batch data.
  • the sum of M gradients obtained.
  • M gradients can be obtained, and the M gradients are accumulated to obtain a gradient accumulated value.
  • each accelerator updates the parameters of the neural network according to the results of the inverse calculation including: each accelerator first performs averaging calculations according to the gradient accumulation values corresponding to the N accelerators to obtain the target gradient accumulation value . Then, each accelerator updates the parameters of the neural network according to the accumulated value of the target gradient. Specifically, each accelerator performs averaging calculation based on its corresponding gradient accumulation value and the gradient accumulation values corresponding to the other accelerators to obtain the target gradient accumulation value, so as to realize the parameter update of the neural network based on the target gradient accumulation value to complete the neural network train.
  • the training method is also applied to the processor, and before each accelerator obtains M micro-batch data, the training method further includes: the processor obtains the training data.
  • the processor determines the size of the micro-batch data according to the target storage capacity threshold and the size of the training data. If the N accelerators are the same, the target storage capacity threshold is the storage capacity threshold of any one of the N accelerators. If there are at least P accelerators that are different, the target storage capacity threshold is the smallest value among the storage capacity thresholds of at least P accelerators, N ⁇ P ⁇ 2.
  • the processor divides the training data into N ⁇ M micro-batch data according to the size of the micro-batch data.
  • the processor may determine the optimal micro-batch data size based on the target storage capacity threshold and the size of the training data, thereby dividing the training data into N ⁇ M micro-batch data. Since the size of the micro-batch data is optimal, the storage occupancy of the feature values generated by the micro-batch data after the forward calculation can be reduced, the storage resources of the accelerator can be saved, and the training efficiency of the neural network can be improved.
  • the storage occupancy corresponding to each micro-batch data is less than or equal to the target storage capacity threshold, and the size of each micro-batch data is less than or equal to the size of the training data. According to the foregoing two conditions, Determine the size of the micro-batch data.
  • the cluster linearity corresponding to each micro-batch data is the largest, and the optimal micro-batch data size can be determined through the foregoing conditions.
  • M is a value obtained by rounding up the ratio.
  • M is the ratio
  • the second aspect of the embodiments of the present application provides a neural network training method, which is applied to N accelerators, each accelerator loads a part of the neural network layer, the N accelerators load the neural network together, and the N accelerators are pipelined
  • the neural network is trained in a parallel manner.
  • the training method includes: the first accelerator among the N accelerators first obtains M micro-batch data from the processor, and the M micro-batch data forms the training data.
  • the N accelerators In the process of N accelerators training the neural network based on the M micro-batch data, the N accelerators directly perform the forward calculation on the i-th micro-batch data after jointly completing the forward calculation of the i-th micro-batch data The result is calculated in the reverse direction until the reverse calculation of the M micro-batch data is completed to obtain the result of the reverse calculation.
  • the N accelerators update the parameters of the neural network according to the results of the reverse calculation.
  • N ⁇ 2, M ⁇ 2, i 1, 2,...,M.
  • the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is after each accelerator performs the inverse calculation on the M micro-batch data.
  • the sum of M gradients obtained.
  • M gradients can be obtained, and the M gradients are accumulated to obtain a gradient accumulated value.
  • the N accelerators updating the parameters of the neural network according to the results of the inverse calculation include: each accelerator performs the parameters of the partial layer of the neural network loaded on it according to its corresponding gradient accumulation value. renew. Specifically, each accelerator updates the parameters of part of the neural network loaded on it based on its corresponding gradient accumulation value, so as to complete the training of the neural network.
  • the training method is also applied to the processor, and before the N accelerators obtain M micro-batch data, the training method further includes: the processor first obtains the training data. Then, the processor determines the size of the micro-batch data according to the storage capacity threshold of each accelerator and the size of the training data. Finally, the processor divides the training data into M micro-batch data according to the size of the micro-batch data. Specifically, the processor may determine the optimal micro-batch data size based on the storage capacity threshold of each accelerator and the size of the training data, thereby dividing the training data into M micro-batch data. Since the size of the micro-batch data is optimal, the storage occupancy of the feature values generated by the micro-batch data after the forward calculation can be reduced, the storage resources of the accelerator can be saved, and the training efficiency of the neural network can be improved.
  • the peak storage occupancy of each accelerator is less than or equal to the storage capacity threshold of the accelerator, and the peak storage occupancy of each accelerator is the reverse calculation of the first micro-batch data at the accelerator Previously, the storage occupancy of several micro-batch data that has been calculated forward by the accelerator, and the size of each micro-batch data is less than or equal to the size of the training data.
  • the micro-batch data can be determined size.
  • the cluster linearity corresponding to each micro-batch data is the largest, and the optimal micro-batch data size can be determined through the foregoing conditions.
  • M is a value obtained by rounding up the ratio.
  • M is the ratio
  • the third aspect of the embodiments of the present application provides a neural network training device, the training device includes: N accelerators, each accelerator loads the same neural network, and the N accelerators train the neural network in a data-parallel manner .
  • each accelerator is used to obtain M micro-batch data, and N ⁇ M micro-batch data constitute training data.
  • Each accelerator is also used to perform forward calculation on the i-th micro-batch data, and directly perform reverse calculation on the forward calculation result of the i-th micro-batch data, until the reverse calculation of the M micro-batch data is completed. Get the result of the reverse calculation.
  • the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is after each accelerator performs the inverse calculation on the M micro-batch data. The sum of M gradients obtained.
  • each accelerator is also used for averaging calculation according to the gradient accumulation value corresponding to the N accelerators to obtain the target gradient accumulation value.
  • Each accelerator is also used to update the parameters of the neural network according to the accumulated value of the target gradient.
  • the training device further includes a processor, and the processor is used to obtain training data.
  • the processor is also used to determine the size of the micro-batch data according to the target storage capacity threshold and the size of the training data. If N accelerators are the same, the target storage capacity threshold is the storage capacity threshold of any one of the N accelerators. If there are at least P different accelerators in the accelerators, the target storage capacity threshold is the smallest value among the storage capacity thresholds of at least P accelerators, N ⁇ P ⁇ 2.
  • the processor is also used to divide the training data into N ⁇ M micro-batch data according to the size of the micro-batch data.
  • the storage occupancy corresponding to each micro-batch data is less than or equal to the target storage capacity threshold, and the size of each micro-batch data is less than or equal to the size of the training data.
  • the cluster linearity corresponding to each micro-batch data is the largest.
  • M is a value obtained by rounding up the ratio.
  • M is the ratio
  • the fourth aspect of the embodiments of the present application provides a neural network training device, the training device includes N accelerators, each accelerator loads a part of the neural network layer, the N accelerators load the neural network together, and the N accelerators are pipelined in parallel The way to train the neural network.
  • the first accelerator among the N accelerators is used to obtain M micro-batch data, and the M micro-batch data constitute training data.
  • N accelerators are used to directly calculate the forward calculation result of the i-th micro-batch data after the forward calculation of the i-th micro-batch data is jointly completed, until the reverse calculation of the M micro-batch data is completed Calculate to get the result of the reverse calculation.
  • the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is after each accelerator performs the inverse calculation on the M micro-batch data. The sum of M gradients obtained.
  • each accelerator is used to update the parameters of the partial layer of the neural network it loads according to its corresponding gradient accumulation value.
  • the training device further includes a processor, and the processor is used to obtain training data.
  • the processor is also used to determine the size of the micro-batch data according to the storage capacity threshold of each accelerator and the size of the training data.
  • the processor is also used to divide the training data into M micro-batch data according to the size of the micro-batch data.
  • the peak storage occupancy of each accelerator is less than or equal to the storage capacity threshold of the accelerator, and the peak storage occupancy of each accelerator is the reverse calculation of the first micro-batch data at the accelerator Previously, the storage occupancy of several micro-batch data that has been calculated forward by the accelerator, and the size of each micro-batch data is less than or equal to the size of the training data.
  • the cluster linearity corresponding to each micro-batch data is the largest.
  • M is a value obtained by rounding up the ratio.
  • M is the ratio
  • the fifth aspect of the embodiments of the present application provides a neural network training device, which includes:
  • processors One or more processors, memories, bus systems, and one or more programs, where the processors and the memories are connected through the bus system;
  • the one or more programs are stored in the memory, and the one or more programs include instructions that, when executed by the training device, cause the training device to perform operations as described in the first aspect and the second aspect.
  • the sixth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the training method according to any one of the first aspect and the second aspect.
  • the embodiment of the application provides a neural network training method and related equipment.
  • each accelerator completes the forward calculation of the i-th micro-batch data, it performs the reverse calculation on the forward calculation result of the i-th micro-batch data until the reverse of the M micro-batch data is completed.
  • To calculate after each accelerator completes the forward calculation of the i-th micro-batch data, it immediately performs reverse calculation on the forward calculation result of the i-th micro-batch data.
  • each accelerator starts the reverse calculation, it can start to release the characteristic value generated by the forward calculation of the i-th micro-batch data until the reverse calculation of the i-th micro-batch data is completed.
  • the storage occupancy peak of each accelerator occurs when the reverse calculation of the i-th micro-batch data starts, and at this time, each accelerator only needs to save the feature value generated by the i-th micro-batch data in the forward calculation.
  • the peak storage occupancy of each accelerator appears periodically and can be kept at a low value, which can improve the training efficiency of the neural network.
  • Figure 1 is a schematic diagram of DNN performing forward calculation
  • Figure 2 is a schematic diagram of DNN performing reverse calculation
  • Figure 3 is a schematic diagram of data parallelism provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of parallel pipelines provided by an embodiment of the application.
  • Fig. 5 is a schematic structural diagram of a neural network training system provided by an embodiment of the application.
  • FIG. 6 is a schematic flowchart of a neural network training method provided by an embodiment of the application.
  • FIG. 7 is a schematic diagram of a first application example of a neural network training method provided by an embodiment of this application.
  • FIG. 8 is a schematic flowchart of a first application example provided by an embodiment of this application.
  • FIG. 9 is a schematic diagram of another process of the first application example provided by an embodiment of the application.
  • FIG. 10 is a schematic diagram of another process of a neural network training method provided by an embodiment of the application.
  • FIG. 11 is a schematic diagram of a second application example of a neural network training method provided by an embodiment of this application.
  • FIG. 12 is a schematic flowchart of a second application example provided by an embodiment of this application.
  • FIG. 13 is a schematic diagram of another process of the second application example provided by an embodiment of this application.
  • FIG. 14 is a schematic diagram of a calculation process of a second application example provided by an embodiment of the application.
  • 15 is a schematic structural diagram of a neural network training device provided by an embodiment of the application.
  • 16 is a schematic diagram of another structure of a neural network training device provided by an embodiment of the application.
  • FIG. 17 is a schematic diagram of another structure of the neural network training device provided by an embodiment of the application.
  • the embodiment of the present application provides a neural network training method and related equipment, which can keep the peak storage occupancy of the accelerator at a low value and improve the training efficiency of the neural network.
  • the embodiments of the present application will be described below in conjunction with the drawings. A person of ordinary skill in the art knows that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.
  • AI is a theory, method, technology and application system that uses digital computers or digital computer-controlled machines to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
  • Training refers to training a neural network model through a large number of labeled samples, so that the neural network model can have specific functions.
  • Inference also called prediction or inference, refers to using a trained neural network model to infer various conclusions using new business data.
  • AI parameters refer to the parameters in the AI model determined through AI training.
  • an AI model can be regarded as a function, and AI parameters can be regarded as coefficients in the function.
  • the AI parameter can be the weight of the convolution kernel in the neural network.
  • the AI model is a support vector machine
  • the AI parameter can be the support vector in the support vector machine; for another example, if the AI model is a linear regression model or a logistic regression model, the AI parameter can be a linear regression or logistic regression model. coefficient.
  • the AI models listed are only examples.
  • AI models can also be other types of models, such as decision tree models, random forest models, confidence networks, reinforcement learning models, transfer learning models, inductive learning models, and their combination
  • AI parameters may also be parameters in other types of models, and the embodiments of the present application do not limit the specific types of AI parameters and AI models.
  • AI parameters can also be referred to as neural network parameters.
  • the adjustment process of AI parameters is very important for AI calculation. Specifically, in the process of AI calculation, the business data in the data set is usually input to the AI model, and the AI model will infer and predict the business data based on the AI parameters to obtain the prediction result. According to the error between the predicted result and the real result, the AI parameters are adjusted, so that when the next inference prediction is made based on the adjusted AI parameters, the error can be reduced. Through the cyclic execution of the AI parameter adjustment process, the AI parameters can be adjusted to be gradually accurate. When the training is over, the AI model containing the accurate parameters can be used to achieve accurate reasoning and prediction, such as accurately integrating the face image Face recognition.
  • neural networks for example, DNN
  • DNN Deep neural network
  • ANN artificial neural network
  • ANN artificial neural network
  • MLP multilayer perceptron
  • Fig. 3 is a schematic diagram of data parallelism provided by an embodiment of the application. As shown in Fig.
  • accelerator 1, accelerator 2, and accelerator 3 are provided, and accelerator 1, accelerator 2, and accelerator 3 are all loaded with the same complete neural network,
  • the accelerator 1, accelerator 2, and accelerator 3 calculate the respective training data to obtain their respective calculation results.
  • accelerator 1 As an example, after accelerator 1 performs forward calculation on all the data allocated to accelerator 1, then reverse calculation is performed on the forward calculation results of all data, so as to obtain the reverse calculation result of accelerator 1.
  • the accelerator 2 and the accelerator 3 can also perform the same operation, which will not be repeated here.
  • accelerator 1, accelerator 2, and accelerator 3 can update the parameters of the neural network loaded respectively.
  • Pipeline parallelism is a way of model parallelism.
  • Pipeline parallelism usually means that each accelerator of multiple accelerators is loaded with part of the neural network layer, and multiple accelerators jointly load the complete neural network. After receiving the training data, each accelerator is responsible for the parameter training of the partial layer of the neural network. Multiple accelerators can jointly train the neural network based on the training data.
  • Fig. 4 is a schematic diagram of the parallel pipeline provided by an embodiment of the application. The thin line frame in Fig. 4 represents the forward calculation of data, and the thick line frame represents the reverse calculation of data.
  • accelerator 1 loads the first layer of the neural network
  • accelerator 2 loads the second layer of the neural network
  • accelerator 3 loads the third layer of the neural network
  • accelerator 1 accelerator 2
  • the forward calculation result of accelerator 1 can be input to accelerator 2
  • the forward calculation result of accelerator 2 can be input to accelerator 3
  • the reverse calculation result of accelerator 3 can be input to accelerator 2
  • the inverse calculation result can be input to accelerator 1.
  • the accelerator 1 receives data 1, data 2, and data 3 from the processor, it can perform forward and backward calculations on the three data. Specifically, after data 1 is forwarded calculation by accelerator 1, accelerator 2, and accelerator 3 in turn, the forward calculation result of data 1 can be obtained, that is, data 1 that has been forward calculated by accelerator 3.
  • accelerator 1 can perform forward calculation on data 2 at the same time.
  • data 1, data 2 can be obtained.
  • the forward calculation results of the three data can be reversed.
  • the backward calculation is the reverse process of the forward calculation, and reference may be made to the foregoing description of the forward calculation, which will not be repeated here.
  • the accelerator 1 can update the parameters of the first layer based on the reverse calculation results obtained, and the accelerator 2 can update the parameters of the second layer based on the reverse calculation results obtained.
  • the parameters of the third layer can be updated based on the reverse calculation results obtained.
  • FIG. 5 is a schematic structural diagram of the neural network training system provided by an embodiment of the application.
  • the neural network system includes a plurality of training devices 501, and the training devices 501 can communicate with each other through a switch 502.
  • Each training device 501 includes a central processing unit (CPU), hereinafter referred to as a processor 5011, and multiple accelerators 5012.
  • the accelerator 5012 can be implemented by an acceleration device such as a graphics processing unit (GPU) or a field programmable gate array (Field Programmable Gate Array, FPGA), which is not limited here.
  • the processor 5011 may send sample data for training the neural network to the accelerator 5012, and the accelerator 5012 may train the neural network based on the sample data.
  • FIG. 6 is a schematic flowchart of a neural network training method provided by an embodiment of the application. Please refer to FIG. 6.
  • the training method is applied to a training device including a processor and N accelerators.
  • the processor can provide data for neural network training for each accelerator, each accelerator is loaded with the same neural network, and N accelerators train the neural network in a data-parallel manner.
  • the training method includes:
  • Each accelerator obtains M micro-batch data.
  • the processor may first obtain training data, and divide the training data into N ⁇ M microbatch data, and each microbatch includes at least one sample data to be trained. Then, the processor sends M micro-batch data to each accelerator. Among them, N ⁇ 2 and M ⁇ 2.
  • each accelerator After each accelerator performs the forward calculation on the i-th micro-batch data, it directly performs the reverse calculation on the forward calculation result of the i-th micro-batch data until the reverse calculation of the M micro-batch data is completed to obtain The result of the reverse calculation.
  • the accelerator when the accelerator completes the reverse calculation of a certain (or multiple) micro-batch data, it should be understood that the accelerator has completed the forward calculation of the micro-batch quantity, and has completed the micro-batch quantity. The backward calculation of the result of the forward calculation of the data.
  • the accelerator when the accelerator performs reverse calculation on a certain (or multiple) micro-batch data, it should be understood that the accelerator has completed the forward calculation of the micro-batch quantity and reversed the calculation result of the micro-batch data. , I won’t repeat it in the follow-up.
  • the result of the inverse calculation may include the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is the M micro-batch data obtained after each accelerator performs the inverse calculation on the M micro-batch data.
  • the sum of the gradients Specifically, in the process of each accelerator training the neural network based on M micro-batch data, each accelerator performs forward calculation and reverse calculation on the i-th micro-batch data, respectively, after obtaining the i-th gradient, Then, forward calculation and reverse calculation are performed on the i+1th micro-batch data, respectively, to obtain the i+1th gradient, until the reverse calculation of the M micro-batch data is completed, and M gradients are obtained.
  • each accelerator performs accumulation calculation on the M gradients to obtain the gradient accumulation value. It is worth noting that when the accelerator completes the forward calculation of a certain micro-batch data, it stores the characteristic values generated in the forward calculation process. When the accelerator starts the reverse calculation of the micro-batch data, it starts to release the feature value generated by the forward calculation of the micro-batch data (because the reverse calculation needs to use the feature value generated by the forward calculation). Until the reverse calculation of the micro-batch data is completed, the feature value generated by the micro-batch data in the forward calculation is completely released, that is, the storage amount occupied by the part of the feature value is released (the corresponding micro-batch data Storage occupancy).
  • a certain accelerator in the training device completes the forward calculation of the first micro-batch data, it immediately performs the reverse calculation on the first micro-batch data.
  • the first gradient can be obtained, and then the forward calculation is performed on the second micro-batch data.
  • the reverse calculation is performed on the second micro-batch data immediately to obtain the second gradient.
  • M gradients can be obtained.
  • the accelerator can superimpose M gradients to obtain the gradient accumulation value corresponding to the accelerator.
  • the training device may also include other accelerators, and the other accelerators may also implement the foregoing process to obtain the gradient accumulation values corresponding to the other accelerators, which will not be repeated here.
  • Each accelerator updates the parameters of the neural network according to the result of the reverse calculation.
  • Each accelerator first performs an averaging calculation according to its corresponding gradient accumulation value and the gradient accumulation values corresponding to other accelerators, and obtains the target gradient accumulation value that is finally used to update the neural network. Then, each accelerator updates the parameters of the neural network according to the accumulated value of the target gradient.
  • the gradient accumulation value corresponding to each accelerator may be different (each accelerator loads the same neural network, but in the micro-batch data received by each accelerator, the sample data may be different, resulting in different calculation results) , In order to achieve the most efficient training effect, each accelerator can average all gradient accumulation values to obtain the same target gradient accumulation value. In this way, all accelerators can update the same neural network based on the same target gradient accumulation value, and complete the training of the neural network.
  • the size of the micro-batch data can be set to save the storage resources of the accelerator.
  • the training method may further include:
  • the processor first obtains the training data. It should be noted that the training data is a collection of all sample data input to an accelerator, and the size of the training data is greater than or equal to the size of the micro-batch data. Then determine the size of the micro-batch data according to the target storage capacity threshold and the size of the training data.
  • the target storage capacity threshold is the storage capacity threshold of any accelerator in the N accelerators, and if there are at least P accelerators in the N accelerators that are different ( That is, the storage capacity thresholds of at least P accelerators are different), the target storage capacity threshold is the smallest value among the storage capacity thresholds of at least P accelerators, N ⁇ P ⁇ 2.
  • the processor divides the training data into N ⁇ M micro-batch data according to the size of the micro-batch data.
  • the size of micro-batch data should meet the following conditions: (1) The storage occupancy corresponding to the size of each micro-batch data is less than or equal to the target storage capacity threshold; (2) The size of each micro-batch data is less than or equal to training The size of the data; (3) The cluster linearity corresponding to each micro-batch data is the largest.
  • the number M of the micro-batch data is the value obtained by rounding the aforementioned ratio up. If the ratio between the size of the training data and the size of the micro-batch data is an integer, the number M of the micro-batch data is the aforementioned ratio.
  • the size of the micro-batch data can be set to the optimal value to reduce the storage footprint of the feature values generated by the micro-batch data after the forward calculation, which can further save the storage resources of the accelerator and improve the performance of the neural network. Training efficiency.
  • each accelerator after each accelerator completes the forward calculation of the i-th micro-batch data, it immediately performs reverse calculation on the forward calculation result of the i-th micro-batch data.
  • each accelerator starts the reverse calculation, it can start to release the feature value generated by the i-th micro-batch data in the forward calculation until the reverse calculation of the i-th micro-batch data is completed (the i-th micro-batch data is first) The eigenvalues generated by the calculation are completely released). Therefore, the storage occupancy peak of each accelerator occurs when the reverse calculation of the i-th micro-batch data starts, and at this time, each accelerator only needs to save the feature value generated by the i-th micro-batch data in the forward calculation. During the entire calculation process, the peak storage occupancy of each accelerator appears periodically (that is, the peak storage occupancy appears at the beginning of the reverse calculation of each micro-batch data), and can be kept at a low value, which can improve nerves The training efficiency of the network.
  • FIG. 7 is a schematic diagram of a first application example of the neural network training method provided by an embodiment of the application. Please refer to FIG. 7.
  • the training device for training the target neural network is provided with a processor, GPU1, GPU2, GPU3, and GPU4.
  • the same target neural network is loaded on GPU1, GPU2, GPU3, and GPU4.
  • the target neural network has a multi-layer structure, and the size and calculation time of each layer are uniformly set.
  • the externally input training data contains 1024 sample data.
  • FIG. 8 is a schematic flowchart of the first application example provided by the embodiment of the application. As shown in FIG. 8, the process includes:
  • the processor determines the video memory capacity threshold Cmax of GPU1 and the size of the batch data.
  • S2 The processor selects the size of a micro-batch data according to the size of the batch data, and determines the video memory occupancy C1 corresponding to the micro-batch data in the GPU1.
  • S3 The processor judges whether C1 ⁇ Cmax is satisfied, if it is not satisfied, S2 is executed again, and if it is satisfied, S4 is executed.
  • S4 The processor determines all values of the size of the micro-batch data under the condition of C1 ⁇ Cmax, and takes the size of the micro-batch data with the largest cluster linearity L as the final choice among all the values.
  • the processor selects the size of the micro-batch data as 256.
  • the size of the micro-batch data is 256
  • the processor sets the size of the micro-batch data to 128.
  • the processor again sets the size of the micro-batch data to 64.
  • processor can also perform the same process as S1-S4 for GPU2, GPU3, and GPU4. Since GPU1, GPU2, GPU3, and GPU4 are GPUs with the same performance, the size of the micro-batch data finally determined by each GPU is 32 and the number is 8.
  • FIG. 9 is another schematic flow diagram of the first application example provided by an embodiment of the application. As shown in FIG. 9, the process includes:
  • W1 Perform forward calculation on the first micro-batch data and store the feature values generated by the forward calculation.
  • W2 After determining the end of the forward calculation of the first micro-batch data, perform the reverse calculation on the forward calculation result of the first micro-batch data, and start to release the memory usage corresponding to the first micro-batch data ( That is, the feature value generated by the forward calculation of the first micro-batch data is released).
  • the reverse calculation of the first micro-batch data is completed, the memory occupation corresponding to the first micro-batch data is completely released, and the first gradient is obtained.
  • W3 Perform forward calculation and reverse calculation on the second micro-batch data to obtain the second gradient.
  • For the calculation process of the second micro-batch data please refer to W1 and W2, which will not be repeated here.
  • 8 gradients can be obtained, and the 8 gradients are accumulated to obtain the gradient accumulation value.
  • W4 Update the target neural network according to the accumulated value of the gradient.
  • each GPU can obtain its corresponding gradient accumulation value. After the gradient accumulation value is obtained, the neural network can be updated. Specifically, each GPU first performs an averaging calculation according to its corresponding gradient accumulation value and the gradient accumulation values corresponding to other GPUs, to obtain the target gradient accumulation value that is finally used to update the neural network. Then, each GPU updates the parameters of the neural network according to the accumulated value of the target gradient. For example, GPU1 may perform an average calculation on its corresponding gradient accumulation value and the gradient accumulation values corresponding to GPU2, GPU3, and GPU4 to obtain the target gradient accumulation value. In the same way, GPU2, GPU3, and GPU4 can also obtain the target gradient accumulation value. Finally, GPU1, GPU2, GPU3, and GPU4 can update the parameters of the neural network loaded respectively according to the target gradient accumulation value.
  • a certain micro-batch data is first calculated in the forward direction during the training process, and then immediately calculated in the reverse direction.
  • the backward calculation of the micro-batch data is completed before the forward calculation of the next micro-batch data is started. Therefore, the peak video memory usage in this application example occurs when the reverse calculation of any micro-batch data starts.
  • the accelerator only needs to save all the feature values generated by the forward calculation of a micro-batch data. During the entire calculation process, the peak value of the video memory occupancy appears periodically until the forward calculation and reverse calculation of all micro-batch data are completed. And when a peak video memory usage occurs, the accelerator only needs to save a feature value generated by the forward calculation of micro-batch data, and keep the peak video memory usage at a low value, which can improve the training efficiency of the neural network.
  • FIG. 10 is a schematic diagram of another flow chart of a neural network training method provided by an embodiment of the application.
  • the training method is applied to a training device including a processor and N accelerators.
  • the processor can provide data for neural network training for each accelerator.
  • Each accelerator loads a partial layer of a neural network
  • N accelerators load the complete neural network together
  • the N accelerators train the neural network in a pipelined parallel manner.
  • the N accelerators can jointly train the neural network based on the sample data.
  • the training device is equipped with three accelerators, and the neural network has 15 layers.
  • accelerator 1 is loaded with layers 1 to 5 of the neural network
  • accelerator 2 is loaded with layers 6 to 10 of the neural network
  • accelerator 3 is loaded with layers 11 to 15 of the neural network
  • Accelerator 1, Accelerator 2, and Accelerator 3 can train the neural network in a pipelined parallel manner.
  • the training method includes:
  • the first accelerator among the N accelerators obtains M micro-batch data.
  • the processor may first obtain training data, and divide the training data into M micro-batch data, and each micro-batch contains at least one sample data to be trained. Then, the processor sends M micro-batch data to the first accelerator among the N accelerators.
  • N accelerators There are N accelerators in the training device, but N accelerators as a whole (because N accelerators load a neural network together), and the first accelerator is used as the input of the whole, so the processor only needs to prepare M micro-batch Data, and send M micro-batch data to the first accelerator. Among them, N ⁇ 2 and M ⁇ 2.
  • N accelerators After the N accelerators jointly complete the forward calculation of the i-th micro-batch data, they directly perform the reverse calculation on the forward calculation result of the i-th micro-batch data until the reverse of the M micro-batch data is completed. Calculate to get the result of the reverse calculation.
  • the j-th accelerator performs forward calculation on the i-th micro-batch data should be understood as the i-th accelerator pair that has been forward-calculated by the j-1th accelerator Forward calculations for micro-batch data.
  • the completion of the forward calculation of the i-th micro-batch data by the j-th accelerator should be understood to mean that the j-th accelerator completes the forward calculation of the i-th micro-batch data that has been forward calculated by the j-1 accelerator. .
  • the reverse calculation of the i-th micro-batch data by the k-th accelerator should be understood as the reverse calculation of the i-th micro-batch data that has been reversed by the k+1 accelerator by the k-th accelerator.
  • the completion of the reverse calculation of the i-th micro-batch data by the k-th accelerator should be understood as the completion of the reverse calculation of the i-th micro-batch data that has been reversed by the k+1 accelerator. .
  • the backward calculation of the i-th micro-batch data by the N-th accelerator should be understood as the result of the N-th accelerator pair having been forward-calculated by the N-th accelerator.
  • the i-th micro-batch data is reversely calculated, which will not be described in detail later.
  • the result of the inverse calculation may include the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is the M micro-batch data obtained after each accelerator performs the inverse calculation on the M micro-batch data.
  • the sum of the gradients For ease of understanding, the above examples are still used for explanation.
  • the accelerator 1 After the accelerator 1 receives M micro-batch data, it performs forward calculation on the first micro-batch data. After completing the forward calculation, the accelerator 1 sends the calculated first micro-batch data to the accelerator 2, so that the accelerator 2 performs forward calculation on the first micro-batch data.
  • the accelerator 3 starts to perform reverse calculation on the first micro-batch data.
  • the accelerator 3 can obtain the first gradient, and send the first micro-batch data after the reverse calculation of the accelerator 3 to the accelerator 2, so that the accelerator 2 performs reverse calculation on the first micro-batch data.
  • the first gradient can also be obtained respectively.
  • the three accelerators can also perform the aforementioned calculation process on the second micro-batch data to the M-th micro-batch data, so the accelerator 1 can obtain M gradients and accumulate the M gradients to obtain the gradient accumulation value .
  • the accelerator 2 and the accelerator 3 can also obtain M gradients respectively, and the accumulated value of the gradients can be obtained through accumulation calculation.
  • the accelerator when the accelerator completes the forward calculation of a certain micro-batch data, it stores the characteristic values generated in the forward calculation process.
  • the accelerator starts the reverse calculation of the micro-batch data, it starts to release the feature value generated by the forward calculation of the micro-batch data (because the reverse calculation needs to use the feature value generated by the forward calculation).
  • the reverse calculation of the micro-batch data is completed, at this time the feature value generated by the forward calculation of the micro-batch data is completely released, that is, the storage amount occupied by the part of the feature value is released.
  • the accelerator 3 when the accelerator 3 performs the reverse calculation on the first micro-batch data, it only completes the forward calculation of the first micro-batch data. Therefore, the accelerator 1 stores the first micro-batch data in the forward calculation office. The resulting characteristic value.
  • the accelerator 2 performs the reverse calculation on the first micro-batch data, it is assumed that the forward calculation of the three micro-batch data has been completed (because the accelerator 3 performs the forward calculation and the reverse calculation on the first micro-batch data , Accelerator 2 can perform forward calculation on the remaining micro-batch data synchronously, for example, perform forward calculation on the second micro-batch data and the third micro-batch data), so the accelerator 2 stores these three micro-batch data in the forward direction.
  • accelerator 1 performs reverse calculation on the first micro-batch data
  • accelerator 3 stores the feature values generated by the forward calculation of these 5 micro-batch data . Therefore, the peak storage occupancy of accelerator 1, accelerator 2, and accelerator 3 all appear at the beginning of the reverse calculation of the first micro-batch data, and the peak storage occupancy of accelerator 1 is greater than the peak storage occupancy of accelerator 2 3 peak storage occupancy.
  • the N accelerators update the parameters of the neural network according to the results of the reverse calculation.
  • Each accelerator updates part of the neural network layers it loads according to its corresponding gradient accumulation value.
  • the accelerator 1 updates the parameters of the first to fifth layers of the neural network according to its corresponding gradient accumulation value.
  • the accelerator 2 updates the parameters of the 6th to the 10th layer of the neural network according to its corresponding gradient accumulation value.
  • the accelerator 3 updates the parameters of the 11th to 15th layers of the neural network according to its corresponding gradient accumulation value.
  • the size of the micro-batch data can be set to save the storage resources of the accelerator.
  • the training method may further include: the processor first obtains the training data. Then, the processor determines the size of the micro-batch data according to the storage capacity threshold of each accelerator and the size of the second data sample. Finally, the processor divides the training data into M micro-batch data according to the size of the micro-batch data.
  • the size of the micro-batch data should meet the following conditions: (1) The peak storage occupancy of each accelerator is less than or equal to the storage capacity threshold of the accelerator, and the peak storage occupancy of each accelerator is the first micro-batch in the accelerator. Before the backward calculation of the batch data, the storage occupancy corresponding to several micro-batch data that has been calculated forward by the accelerator. As in the above example, when the accelerator 3 performs reverse calculation on the first micro-batch data, it only completes the forward calculation of the first micro-batch data. Therefore, the storage occupancy corresponding to the first micro-batch data (ie The peak storage occupancy of accelerator 3) should be less than or equal to the storage capacity threshold of accelerator 3.
  • Accelerator 2 when Accelerator 2 performs the reverse calculation of the first micro-batch data, it has completed the forward calculation of the 3 micro-batch data, so the storage occupancy corresponding to the three micro-batch data (that is, the accelerator 2’s Peak storage occupancy) should be less than or equal to the storage capacity threshold of accelerator 2, and so on; (2) The size of each micro-batch data is less than or equal to the size of the training data; (3) The cluster linearity corresponding to each micro-batch data maximum.
  • the number M of the micro-batch data is the value obtained by rounding the aforementioned ratio up. If the ratio between the size of the training data and the size of the micro-batch data is an integer, the number M of the micro-batch data is the aforementioned ratio.
  • the size of the micro-batch data can be set to the optimal value to reduce the storage footprint of the feature values generated by the micro-batch data after the forward calculation, which can further save the storage resources of the accelerator and improve the performance of the neural network. Training efficiency.
  • the N accelerators after the N accelerators complete the forward calculation on the i-th micro-batch data, they immediately perform the reverse calculation on the forward calculation result of the i-th micro-batch data.
  • each accelerator starts the reverse calculation of the i-th micro-batch data, it can start to release the feature values generated by the i-th micro-batch data in the forward calculation until the reverse calculation of the i-th micro-batch data is completed ( The feature value generated by the forward calculation of the i-th micro-batch data is completely released). Therefore, the storage occupancy peak of each accelerator appears when the reverse calculation of the first micro-batch data starts, and at this time, each accelerator only needs to save part of the feature values generated by the forward calculation of the micro-batch data. In the entire calculation process, the peak storage occupancy of each accelerator can be kept at a low value, which can improve the training efficiency of the neural network.
  • FIG. 11 is a schematic diagram of a second application example of the neural network training method provided by an embodiment of the application.
  • the training device for training the neural network is provided with a processor, GPU1, GPU2, GPU3, and GPU4.
  • the neural network has a 32-layer structure, and the size and calculation time of each layer are uniformly set.
  • GPU1 is loaded with the 1st to 8th layers of the neural network
  • GPU2 is loaded with the 9th to 16th layers of the neural network
  • GPU3 is loaded with the 17th to 24th layers of the neural network
  • GPU4 is loaded with the 25th layer of the neural network To the 32nd floor.
  • the externally input training data contains 256 sample data. Because GPU1, GPU2, GPU3, and GPU4 are trained in parallel in a pipeline.
  • the processor sends the training data to GPU1 (GPU1 is used as the input port of the entire target neural network), so that the four GPUs train the target neural network based on the training data.
  • the training data can be further divided into multiple micro-batch data.
  • the size of the micro-batch data needs to be determined, and GPU1, GPU2, GPU3, and GPU4 are assumed to be GPUs with the same performance.
  • the description of determining the size of the micro-batch data please refer to the relevant part in the aforementioned first application example, which will not be repeated here. It should be noted that since GPU1, GPU2, GPU3, and GPU4 are considered as a whole and are GPUs of the same performance, the processor only needs to perform the following process on GPU1 to determine the size of the micro-batch data. The foregoing process of determining the size of micro-batch data will be described below with reference to FIG. 12.
  • FIG. 12 is a schematic flowchart of a second application example provided by an embodiment of this application. As shown in FIG. 12, the determination process includes:
  • T1 The processor determines the video memory capacity threshold Cmax of GPU1 and the size of the training data.
  • T2 The processor selects the size of a micro-batch data according to the size of the training data, and determines the memory usage C1 corresponding to the micro-batch data in GPU1.
  • T3 The processor determines whether the peak display memory usage of GPU1 is satisfied, N ⁇ C1 ⁇ Cmax, if not satisfied, re-execute T2, and if satisfied, execute T4.
  • T4 The processor determines all the values of the size of the micro-batch data under the condition of N ⁇ C1 ⁇ Cmax, and among all the values, the size of the micro-batch data with the largest cluster linearity L is taken as the final choice .
  • the processor selects the size of the micro-batch data as 256.
  • the size of the micro-batch data is 256
  • the processor sets the size of the micro-batch data to 32
  • FIG. 13 is a schematic diagram of another flow chart of the second application example provided by an embodiment of this application
  • FIG. 14 is a schematic diagram of the calculation process of the second application example provided by an embodiment of this application.
  • the thin line frame in Figure 14 represents the forward calculation of the micro-batch data
  • the thick line frame represents the reverse calculation of the micro-batch data
  • the micro-batch data is marked with MB, for example, the first One micro-batch data is MB1, the second micro-batch data is MB2, and so on.
  • the calculation process is as follows:
  • P1 GPU1 performs forward calculation on the first micro-batch data, and displays the feature values generated by the forward calculation.
  • GPU1 sends the calculated first micro-batch data to GPU2, so that GPU2 performs forward calculation on the first micro-batch data (when GPU2 performs forward calculation on the first micro-batch data, GPU1 synchronizes The second micro-batch data is calculated forward).
  • GPU4 completes the forward calculation of the first micro-batch data, it can perform reverse calculation on the first micro-batch data, while the remaining GPUs are still performing forward calculations on the remaining micro-batch data.
  • GPU4 starts to perform the reverse calculation of the first micro-batch data, and starts to release the memory occupied by the first micro-batch data in GPU4 (that is, start to release the first micro-batch data generated in the forward calculation Eigenvalues).
  • GPU4 obtains the first gradient and sends the first micro-batch data to GPU3, so that GPU3 performs reverse calculation on the first micro-batch data ( At this time, GPU3 has completed the forward calculation of the third micro-batch data).
  • GPU3 After the reverse calculation of the first micro-batch data by GPU3 is completed, GPU3 obtains the first gradient and sends the first micro-batch data to GPU2 so that GPU2 performs reverse calculation on the first micro-batch data (this At that time, GPU3 has completed the forward calculation of the fifth micro-batch data). By analogy, until the reverse calculation of the first micro-batch data by GPU1 ends, the first gradient is obtained.
  • each GPU can obtain 8 gradients, and accumulate the 8 gradients to obtain the gradient accumulation value.
  • GPU1 updates the parameters of the first to the eighth layers of the neural network according to its corresponding gradient accumulation value.
  • GPU2 updates the parameters of the 9th to 16th layers of the neural network according to its corresponding gradient accumulation value.
  • GPU3 updates the parameters of the 17th to 24th layers of the neural network according to its corresponding gradient accumulation value.
  • GPU4 updates the parameters of the 25th to 32nd layers of the neural network according to its corresponding gradient accumulation value.
  • the peak video memory usage of each GPU appears at the beginning of the reverse calculation of the first micro-batch data (that is, at the arrow in the figure, at this time, the peak video memory usage will gradually decrease , Until the reverse calculation of the first micro-batch data is completed. After this, the peak memory usage will appear periodically), and each GPU does not need to save all the feature values generated by the micro-batch data in the forward calculation. Keep the peak memory usage of each GPU at a low value (compared with the prior art shown in Figure 4, in the prior art, each GPU needs to save all the feature values generated by the micro-batch data in the forward calculation , As shown by the arrow in Figure 4) to improve the training efficiency of the neural network.
  • FIG. 15 is a schematic structural diagram of a neural network training device provided by an embodiment of the application. Please refer to FIG. 15.
  • the training device includes a processor 1501 and N accelerators 1502.
  • each accelerator 1502 loads the same neural network, and N accelerators 1502 train the neural network in a data parallel manner.
  • Each accelerator 1502 is used to obtain M micro-batch data from the processor 1501, and N ⁇ M micro-batch data form training data.
  • Each accelerator 1502 is also used to directly calculate the forward calculation result of the i-th micro-batch data after the forward calculation of the i-th micro-batch data, until the reverse calculation of the M micro-batch data is completed In order to get the result of the reverse calculation.
  • Each accelerator 1502 is also used to update the parameters of the neural network according to the result of the backward calculation.
  • N ⁇ 2, M ⁇ 2, i 1, 2,...,M.
  • the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator 1502, and the gradient accumulation value corresponding to each accelerator 1502 performs the inverse calculation on the M micro-batch data respectively. After that, the sum of the M gradients obtained.
  • each accelerator 1502 is also used to perform an average calculation according to the gradient accumulation values corresponding to the N accelerators 1502 to obtain the target gradient accumulation value.
  • Each accelerator 1502 is also used to update the parameters of the neural network according to the accumulated value of the target gradient.
  • the processor 1501 is further configured to determine the size of the micro-batch data according to the target storage capacity threshold and the size of the training data. If the N accelerators 1502 are the same, the target storage capacity threshold is N accelerators. The storage capacity threshold of any accelerator 1502 in 1502. If there are at least P accelerators 1502 in N accelerators 1502 that are different, the target storage capacity threshold is the smallest value among the storage capacity thresholds of at least P accelerators 1502, N ⁇ P ⁇ 2 . The processor 1501 is also configured to divide the training data into N ⁇ M micro-batch data according to the size of the micro-batch data.
  • the storage occupancy corresponding to each micro-batch data is less than or equal to the target storage capacity threshold, and the size of each micro-batch data is less than or equal to the size of the training data.
  • the cluster linearity corresponding to each micro-batch data is the largest.
  • M is a value obtained by rounding up the ratio.
  • M is the ratio
  • FIG. 16 is a schematic diagram of another structure of a neural network training device provided by an embodiment of the application.
  • the training device includes a processor 1601 and N accelerators 1602. Among them, each accelerator 1602 loads part of the layer of the neural network, N accelerators 1602 jointly load the neural network, and the N accelerators 1602 train the neural network in a pipeline parallel manner.
  • the first accelerator 1602 among the N accelerators 1602 is used to obtain M micro-batch data, and the M micro-batch data constitute training data.
  • the N accelerators 1602 are used to directly perform reverse calculations on the forward calculation results of the i-th micro-batch data after jointly completing the forward calculation of the i-th micro-batch data, until the reverse of the M micro-batch data is completed. Calculate to get the result of reverse calculation.
  • the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator 1602, and the gradient accumulation value corresponding to each accelerator 1602 performs the inverse calculation on the M micro-batch data respectively. After that, the sum of the M gradients obtained.
  • each accelerator 1602 is used to update the parameters of the partial layer of the neural network it loads according to its corresponding gradient accumulation value.
  • the processor 1601 is used to obtain training data.
  • the processor 1601 is also configured to determine the size of the micro-batch data according to the storage capacity threshold of each accelerator 1602 and the size of the training data.
  • the processor 1601 is also configured to divide the training data into M micro-batch data according to the size of the micro-batch data.
  • the peak storage occupancy of each accelerator 1602 is less than or equal to the storage capacity threshold of the accelerator 1602, and the peak storage occupancy of each accelerator 1602 is the first micro-batch data in the accelerator 1602.
  • the storage occupancy corresponding to several micro-batch data that has been forward-calculated by the accelerator 1602 and the size of each micro-batch data is less than or equal to the size of the training data.
  • the cluster linearity corresponding to each micro-batch data is the largest.
  • M is a value obtained by rounding up the ratio.
  • M is the ratio
  • FIG. 17 is a schematic diagram of another structure of the neural network training device provided by an embodiment of the application.
  • the training device includes: one or more central processing units 1701, a memory 1702, an input/output interface 1703, a wired or wireless network interface 1704, and a power supply 1705.
  • the memory 1702 may be short-term storage or persistent storage. Furthermore, the central processing unit 1701 may be configured to communicate with the memory 1702, and execute a series of instruction operations in the memory 1702 on the training device.
  • the central processing unit 1701 can perform operations performed by the training device in the embodiment shown in FIG. 6 or FIG. 10, and details are not described herein again.
  • the specific functional module division of the central processing unit 1701 may be similar to the functional module division of the processor and accelerator described in FIG. 15 or FIG. 16, and will not be repeated here.
  • An embodiment of the present application also provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the training method in the foregoing embodiment shown in FIG. 6 or FIG. 10.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Feedback Control In General (AREA)

Abstract

A method for training a neural network, and a related device. In the method, after completing forward calculation of one piece of micro-batch data, an accelerator immediately performs reverse calculation on a forward calculation result of the micro-batch data. When the accelerator starts the reverse calculation, release of feature values, which are generated during the forward calculation, of the micro-batch data can be started until the reverse calculation of the micro-batch data is completed, and at this moment, the feature values, which are generated during the forward calculation, of the micro-batch data are completely released. Thereafter, the accelerator can perform forward calculation and reverse calculation on the next piece of micro-batch data until reverse calculation of all pieces of micro-batch data is completed. Therefore, during the whole calculation process, the accelerator does not need to store feature values, which are generated during the forward calculation, of all pieces of micro-batch data, and as a result, the peak value of the amount of memory occupation of the accelerator can be kept at a lower value, and the training efficiency of a neural network can be improved.

Description

一种神经网络的训练方法及相关设备A neural network training method and related equipment
本申请要求于2020年5月29日提交中国专利局、申请号为202010479541.2、申请名称为“一种神经网络的训练方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 29, 2020, the application number is 202010479541.2, and the application name is "a neural network training method and related equipment", the entire content of which is incorporated by reference In this application.
技术领域Technical field
本申请涉及人工智能领域,尤其涉及一种神经网络的训练方法及相关设备。This application relates to the field of artificial intelligence, and in particular to a neural network training method and related equipment.
背景技术Background technique
在人工智能(artificial intelligence,AI)领域,深度神经网络(deep neural network,DNN)的训练通常采用加速器进行计算,计算过程一般包括前向计算和反向计算。In the field of artificial intelligence (AI), deep neural network (deep neural network, DNN) training usually uses accelerators for calculation, and the calculation process generally includes forward calculation and reverse calculation.
由于DNN具有分层特点,其计算过程一般逐层进行。图1为DNN进行前向计算的一个示意图,图2为DNN进行反向计算的一个示意图。如图1所示,设DNN为四层结构,当训练数据输入DNN后,按第一层→第二层→第三层→第四层逐层进行前向计算。训练数据经过前向计算后,每层前向计算所得到的特征值保存在加速器中。当所有训练数据完成如前述的前向计算后,则按第四层→第三层→第二层→第一层逐层进行反向计算。如图2所示,训练数据在进行反向计算时,每层反向计算需要利用训练数据在对应层前向计算中保存的特征值。因此,每完成一层的反向计算则释放对应层的特征值所占用的加速器存储量。直到所有训练数据均完成反向计算,加速器所保存的所有特征值被完全释放。Due to the hierarchical characteristics of DNN, its calculation process is generally carried out layer by layer. Fig. 1 is a schematic diagram of DNN performing forward calculation, and Fig. 2 is a schematic diagram of DNN performing reverse calculation. As shown in Figure 1, suppose the DNN is a four-layer structure. After the training data is input to the DNN, the forward calculation is performed layer by layer according to the first layer → the second layer → the third layer → the fourth layer. After the training data is forward-calculated, the feature values obtained by the forward-calculation of each layer are stored in the accelerator. After all the training data has completed the forward calculation as described above, the backward calculation is performed layer by layer according to the fourth layer → the third layer → the second layer → the first layer. As shown in Figure 2, when the training data is in the reverse calculation, each layer of reverse calculation needs to use the feature values saved in the forward calculation of the corresponding layer of the training data. Therefore, each time the reverse calculation of a layer is completed, the accelerator storage occupied by the characteristic value of the corresponding layer is released. Until all the training data have completed the reverse calculation, all the feature values saved by the accelerator are completely released.
然而,上述计算过程中需要先完成所有训练数据的前向计算,此时,加速器需保存所有训练数据经过前向计算后所得到的特征值,导致加速器的存储占有量长期处于较大值,神经网络的训练效率低下。However, the forward calculation of all training data needs to be completed in the above calculation process. At this time, the accelerator needs to save all the characteristic values obtained after forward calculation of the training data, which causes the storage occupancy of the accelerator to stay at a large value for a long time. The training efficiency of the network is low.
发明内容Summary of the invention
本申请实施例提供了一种神经网络的训练方法及其相关设备,能使得加速器的存储占用量峰值保持在较低值,提高神经网络的训练效率。The embodiment of the present application provides a neural network training method and related equipment, which can keep the peak storage occupancy of the accelerator at a low value and improve the training efficiency of the neural network.
本申请实施例的第一方面提供了一种神经网络的训练方法,训练方法应用于N个加速器,每个加速器均加载同一个神经网络,N个加速器以数据并行的方式对神经网络进行训练,训练方法包括:每个加速器先获取来自处理器的M个微批量数据,N×M个微批量数据组成训练数据,其中,微批量数据通常包含至少一个待训练的样本数据。在每个加速器根据M个微批量数据对神经网络进行训练的过程中,每个加速器对第i个微批量数据进行前向计算后,直接对第i个微批量数据的前向计算结果进行反向计算,直至完成对M个微批量数据的反向计算以得到反向计算的结果。最后,每个加速器根据反向计算的结果对神经网络的参数进行更新,以完成神经网络的训练。The first aspect of the embodiments of the present application provides a neural network training method. The training method is applied to N accelerators, each accelerator is loaded with the same neural network, and the N accelerators train the neural network in a data-parallel manner. The training method includes: each accelerator first obtains M micro-batch data from the processor, and N×M micro-batch data form training data, where the micro-batch data usually contains at least one sample data to be trained. When each accelerator trains the neural network based on M micro-batch data, after each accelerator performs forward calculation on the i-th micro-batch data, it directly reverses the forward calculation result of the i-th micro-batch data. Calculation until the reverse calculation of M micro-batch data is completed to obtain the result of the reverse calculation. Finally, each accelerator updates the parameters of the neural network according to the results of the reverse calculation to complete the training of the neural network.
其中,N≥2,M≥2,i=1,2,...,M。Among them, N≥2, M≥2, i=1, 2,...,M.
从上述训练方法可以看出:每个加速器对第i个微批量数据在完成前向计算后,则立即对第i个微批量数据的前向计算结果进行反向计算。每个加速器在开始反向计算时,可 开始释放第i个微批量数据在前向计算所产生的特征值,直至完成第i个微批量数据的反向计算(此时,第i个微批量数据在前向计算所产生的特征值被完全释放)。因此,每个加速器的存储占用量峰值出现在第i个微批量数据的反向计算开始之时,此时每个加速器只需保存第i个微批量数据在前向计算所产生的特征值。在整个计算过程中,每个加速器的存储占用量峰值周期性出现(即存储占用量峰值出现在每个微批量数据的反向计算开始之时),且能保持在较低值,可提高神经网络的训练效率。From the above training method, it can be seen that after each accelerator completes the forward calculation of the i-th micro-batch data, it immediately performs the reverse calculation on the forward calculation result of the i-th micro-batch data. When each accelerator starts the reverse calculation, it can start to release the feature value generated by the forward calculation of the i-th micro-batch data until the reverse calculation of the i-th micro-batch data is completed (at this time, the i-th micro-batch The feature values generated by the data in the forward calculation are completely released). Therefore, the storage occupancy peak of each accelerator occurs when the reverse calculation of the i-th micro-batch data starts, and at this time, each accelerator only needs to save the feature value generated by the i-th micro-batch data in the forward calculation. During the entire calculation process, the peak storage occupancy of each accelerator appears periodically (that is, the peak storage occupancy appears at the beginning of the reverse calculation of each micro-batch data), and can be kept at a low value, which can improve nerves The training efficiency of the network.
在一种可能的实现方式中,反向计算的结果包括每个加速器对应的梯度累加值,每个加速器对应的梯度累加值为每个加速器对M个微批量数据分别进行反向计算后,所得到的M个梯度的和。具体地,每个加速器在对M个微批量数据的前向计算结果进行反向计算后,可得到M个梯度,并对M个梯度进行累加得到梯度累加值。In a possible implementation, the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is after each accelerator performs the inverse calculation on the M micro-batch data. The sum of M gradients obtained. Specifically, after each accelerator performs reverse calculation on the forward calculation results of M micro-batch data, M gradients can be obtained, and the M gradients are accumulated to obtain a gradient accumulated value.
在一种可能的实现方式中,每个加速器根据反向计算的结果对神经网络的参数进行更新包括:每个加速器先根据N个加速器对应的梯度累加值进行求平均计算,得到目标梯度累加值。然后,每个加速器根据目标梯度累加值对神经网络的参数进行更新。具体地,每个加速器基于其对应的梯度累加值以及其余加速器对应的梯度累加值进行求平均计算,得到目标梯度累加值,从而基于目标梯度累加值实现神经网络的参数更新,以完成神经网络的训练。In a possible implementation, each accelerator updates the parameters of the neural network according to the results of the inverse calculation including: each accelerator first performs averaging calculations according to the gradient accumulation values corresponding to the N accelerators to obtain the target gradient accumulation value . Then, each accelerator updates the parameters of the neural network according to the accumulated value of the target gradient. Specifically, each accelerator performs averaging calculation based on its corresponding gradient accumulation value and the gradient accumulation values corresponding to the other accelerators to obtain the target gradient accumulation value, so as to realize the parameter update of the neural network based on the target gradient accumulation value to complete the neural network train.
在一种可能的实现方式中,该训练方法还应用于处理器,每个加速器获取M个微批量数据之前,该训练方法还包括:处理器获取训练数据。处理器根据目标存储容量阈值和训练数据的大小确定微批量数据的大小,其中,若N个加速器相同,则目标存储容量阈值为N个加速器中任意一个加速器的存储容量阈值,若N个加速器中存在至少P个加速器不同,则目标存储容量阈值为至少P个加速器的存储容量阈值中的最小值,N≥P≥2。处理器根据微批量数据的大小,将训练数据分成N×M个微批量数据。具体地,处理器可基于目标存储容量阈值和训练数据的大小确定最优的微批量数据的大小,从而将训练数据分成N×M个微批量数据。由于微批量数据的大小为最优值,可减小微批量数据在前向计算后所产生的特征值的存储占用量,可节省加速器的存储资源,提高神经网络的训练效率。In a possible implementation manner, the training method is also applied to the processor, and before each accelerator obtains M micro-batch data, the training method further includes: the processor obtains the training data. The processor determines the size of the micro-batch data according to the target storage capacity threshold and the size of the training data. If the N accelerators are the same, the target storage capacity threshold is the storage capacity threshold of any one of the N accelerators. If there are at least P accelerators that are different, the target storage capacity threshold is the smallest value among the storage capacity thresholds of at least P accelerators, N≥P≥2. The processor divides the training data into N×M micro-batch data according to the size of the micro-batch data. Specifically, the processor may determine the optimal micro-batch data size based on the target storage capacity threshold and the size of the training data, thereby dividing the training data into N×M micro-batch data. Since the size of the micro-batch data is optimal, the storage occupancy of the feature values generated by the micro-batch data after the forward calculation can be reduced, the storage resources of the accelerator can be saved, and the training efficiency of the neural network can be improved.
在一种可能的实现方式中,每个微批量数据所对应的存储占用量小于或等于目标存储容量阈值,每个微批量数据的大小小于或等于训练数据的大小,通过前述两个条件,可确定微批量数据的大小。In a possible implementation, the storage occupancy corresponding to each micro-batch data is less than or equal to the target storage capacity threshold, and the size of each micro-batch data is less than or equal to the size of the training data. According to the foregoing two conditions, Determine the size of the micro-batch data.
在一种可能的实现方式中,每个微批量数据所对应的集群线性度最大,通过前述条件,可确定最优的微批量数据的大小。In a possible implementation manner, the cluster linearity corresponding to each micro-batch data is the largest, and the optimal micro-batch data size can be determined through the foregoing conditions.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为非整数,则M为将比值向上取整后的值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is non-integer, then M is a value obtained by rounding up the ratio.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为整数,则M为比值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is an integer, then M is the ratio.
本申请实施例的第二方面提供了一种神经网络的训练方法,该训练方法应用于N个加速器,每个加速器加载神经网络的部分层,N个加速器共同加载神经网络,N个加速器以流水线并行的方式对神经网络进行训练,训练方法包括:N个加速器中的第1个加速器先获 取来自处理器的M个微批量数据,M个微批量数据组成训练数据。在N个加速器根据M个微批量数据对神经网络进行训练的过程中,N个加速器在共同完成对第i个微批量数据的前向计算后,直接对第i个微批量数据的前向计算结果进行反向计算,直至完成对M个微批量数据的反向计算以得到反向计算的结果。N个加速器根据反向计算的结果对神经网络的参数进行更新。The second aspect of the embodiments of the present application provides a neural network training method, which is applied to N accelerators, each accelerator loads a part of the neural network layer, the N accelerators load the neural network together, and the N accelerators are pipelined The neural network is trained in a parallel manner. The training method includes: the first accelerator among the N accelerators first obtains M micro-batch data from the processor, and the M micro-batch data forms the training data. In the process of N accelerators training the neural network based on the M micro-batch data, the N accelerators directly perform the forward calculation on the i-th micro-batch data after jointly completing the forward calculation of the i-th micro-batch data The result is calculated in the reverse direction until the reverse calculation of the M micro-batch data is completed to obtain the result of the reverse calculation. The N accelerators update the parameters of the neural network according to the results of the reverse calculation.
其中,N≥2,M≥2,i=1,2,...,M。Among them, N≥2, M≥2, i=1, 2,...,M.
从上述训练方法可以看出:N个加速器在共同完成对第i个微批量数据的前向计算后,则立即对第i个微批量数据的前向计算结果进行反向计算。每个加速器开始进行对第i个微批量数据的反向计算时,可开始释放第i个微批量数据在前向计算所产生的特征值,直至完成第i个微批量数据的反向计算(此时,第i个微批量数据在前向计算所产生的特征值被完全释放)。且由于N个加速器是以流水线并行的方式对M个微批量数据进行计算,故每个加速器的存储占用量峰值出现在第1个微批量数据的反向计算开始之时,此时每个加速器只需保存部分微批量数据在前向计算所产生的特征值。在整个计算过程中,每个加速器的存储占用量峰值能保持在较低值,提高神经网络的训练效率。It can be seen from the above training method that after the N accelerators jointly complete the forward calculation of the i-th micro-batch data, they immediately perform the reverse calculation on the forward calculation result of the i-th micro-batch data. When each accelerator starts the reverse calculation of the i-th micro-batch data, it can start to release the feature values generated by the i-th micro-batch data in the forward calculation until the reverse calculation of the i-th micro-batch data is completed ( At this time, the feature value generated by the forward calculation of the i-th micro-batch data is completely released). And since the N accelerators calculate M micro-batch data in a pipelined parallel manner, the peak storage occupancy of each accelerator appears at the beginning of the reverse calculation of the first micro-batch data. At this time, each accelerator You only need to save some of the feature values generated by the forward calculation of the micro-batch data. During the entire calculation process, the peak storage occupancy of each accelerator can be kept at a low value, which improves the training efficiency of the neural network.
在一种可能的实现方式中,反向计算的结果包括每个加速器对应的梯度累加值,每个加速器对应的梯度累加值为每个加速器对M个微批量数据分别进行反向计算后,所得到的M个梯度的和。具体地,每个加速器在对M个微批量数据的前向计算结果进行反向计算后,可得到M个梯度,并对M个梯度进行累加得到梯度累加值。In a possible implementation, the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is after each accelerator performs the inverse calculation on the M micro-batch data. The sum of M gradients obtained. Specifically, after each accelerator performs reverse calculation on the forward calculation results of M micro-batch data, M gradients can be obtained, and the M gradients are accumulated to obtain a gradient accumulated value.
在一种可能的实现方式中,N个加速器根据反向计算的结果对神经网络的参数进行更新包括:每个加速器根据其对应的梯度累加值,对其加载的神经网络的部分层的参数进行更新。具体地,每个加速器基于其对应的梯度累加值,对其加载的神经网络的部分层进行参数更新,以完成神经网络的训练。In a possible implementation, the N accelerators updating the parameters of the neural network according to the results of the inverse calculation include: each accelerator performs the parameters of the partial layer of the neural network loaded on it according to its corresponding gradient accumulation value. renew. Specifically, each accelerator updates the parameters of part of the neural network loaded on it based on its corresponding gradient accumulation value, so as to complete the training of the neural network.
在一种可能的实现方式中,该训练方法还应用于处理器,N个加速器获取M个微批量数据之前,该训练方法还包括:处理器先获取训练数据。然后,处理器根据每个加速器的存储容量阈值和训练数据的大小确定微批量数据的大小。最后,处理器根据微批量数据的大小,将训练数据分成M个微批量数据。具体地,处理器可基于每个加速器的存储容量阈值和训练数据的大小确定最优的微批量数据的大小,从而将训练数据分成M个微批量数据。由于微批量数据的大小为最优值,可减小微批量数据在前向计算后所产生的特征值的存储占用量,可节省加速器的存储资源,提高神经网络的训练效率。In a possible implementation manner, the training method is also applied to the processor, and before the N accelerators obtain M micro-batch data, the training method further includes: the processor first obtains the training data. Then, the processor determines the size of the micro-batch data according to the storage capacity threshold of each accelerator and the size of the training data. Finally, the processor divides the training data into M micro-batch data according to the size of the micro-batch data. Specifically, the processor may determine the optimal micro-batch data size based on the storage capacity threshold of each accelerator and the size of the training data, thereby dividing the training data into M micro-batch data. Since the size of the micro-batch data is optimal, the storage occupancy of the feature values generated by the micro-batch data after the forward calculation can be reduced, the storage resources of the accelerator can be saved, and the training efficiency of the neural network can be improved.
在一种可能的实现方式中,每个加速器的存储占用量峰值小于或等于该加速器的存储容量阈值,每个加速器的存储占用量峰值为在该加速器对第1个微批量数据进行反向计算之前,已被该加速器完成前向计算的若干个微批量数据所对应的存储占用量,每个微批量数据的大小小于或等于训练数据的大小,通过前述两个条件,可确定微批量数据的大小。In a possible implementation, the peak storage occupancy of each accelerator is less than or equal to the storage capacity threshold of the accelerator, and the peak storage occupancy of each accelerator is the reverse calculation of the first micro-batch data at the accelerator Previously, the storage occupancy of several micro-batch data that has been calculated forward by the accelerator, and the size of each micro-batch data is less than or equal to the size of the training data. Through the foregoing two conditions, the micro-batch data can be determined size.
在一种可能的实现方式中,每个微批量数据所对应的集群线性度最大,通过前述条件,可确定最优的微批量数据的大小。In a possible implementation manner, the cluster linearity corresponding to each micro-batch data is the largest, and the optimal micro-batch data size can be determined through the foregoing conditions.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为非整数,则M为将比值向上取整后的值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is non-integer, then M is a value obtained by rounding up the ratio.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为整数,则M为比值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is an integer, then M is the ratio.
本申请实施例的第三方面提供了一种神经网络的训练装置,该训练装置包括:N个加速器,每个加速器均加载同一个神经网络,N个加速器以数据并行的方式对神经网络进行训练。具体地,每个加速器用于获取M个微批量数据,N×M个微批量数据组成训练数据。每个加速器还用于对第i个微批量数据进行前向计算后,直接对第i个微批量数据的前向计算结果进行反向计算,直至完成对M个微批量数据的反向计算以得到反向计算的结果。每个加速器还用于根据反向计算的结果对神经网络的参数进行更新。其中,N≥2,M≥2,i=1,2,...,M。The third aspect of the embodiments of the present application provides a neural network training device, the training device includes: N accelerators, each accelerator loads the same neural network, and the N accelerators train the neural network in a data-parallel manner . Specifically, each accelerator is used to obtain M micro-batch data, and N×M micro-batch data constitute training data. Each accelerator is also used to perform forward calculation on the i-th micro-batch data, and directly perform reverse calculation on the forward calculation result of the i-th micro-batch data, until the reverse calculation of the M micro-batch data is completed. Get the result of the reverse calculation. Each accelerator is also used to update the parameters of the neural network based on the results of the inverse calculation. Among them, N≥2, M≥2, i=1, 2,...,M.
在一种可能的实现方式中,反向计算的结果包括每个加速器对应的梯度累加值,每个加速器对应的梯度累加值为每个加速器对M个微批量数据分别进行反向计算后,所得到的M个梯度的和。In a possible implementation, the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is after each accelerator performs the inverse calculation on the M micro-batch data. The sum of M gradients obtained.
在一种可能的实现方式中,每个加速器还用于根据N个加速器对应的梯度累加值进行求平均计算,得到目标梯度累加值。每个加速器还用于根据目标梯度累加值对神经网络的参数进行更新。In a possible implementation manner, each accelerator is also used for averaging calculation according to the gradient accumulation value corresponding to the N accelerators to obtain the target gradient accumulation value. Each accelerator is also used to update the parameters of the neural network according to the accumulated value of the target gradient.
在一种可能的实现方式中,该训练装置还包括处理器,处理器用于获取训练数据。处理器还用于根据目标存储容量阈值和训练数据的大小确定微批量数据的大小,其中,若N个加速器相同,则目标存储容量阈值为N个加速器中任意一个加速器的存储容量阈值,若N个加速器中存在至少P个加速器不同,则目标存储容量阈值为至少P个加速器的存储容量阈值中的最小值,N≥P≥2。处理器还用于根据微批量数据的大小,将训练数据分成N×M个微批量数据。In a possible implementation manner, the training device further includes a processor, and the processor is used to obtain training data. The processor is also used to determine the size of the micro-batch data according to the target storage capacity threshold and the size of the training data. If N accelerators are the same, the target storage capacity threshold is the storage capacity threshold of any one of the N accelerators. If there are at least P different accelerators in the accelerators, the target storage capacity threshold is the smallest value among the storage capacity thresholds of at least P accelerators, N≥P≥2. The processor is also used to divide the training data into N×M micro-batch data according to the size of the micro-batch data.
在一种可能的实现方式中,每个微批量数据所对应的存储占用量小于或等于目标存储容量阈值,每个微批量数据的大小小于或等于训练数据的大小。In a possible implementation manner, the storage occupancy corresponding to each micro-batch data is less than or equal to the target storage capacity threshold, and the size of each micro-batch data is less than or equal to the size of the training data.
在一种可能的实现方式中,每个微批量数据所对应的集群线性度最大。In a possible implementation manner, the cluster linearity corresponding to each micro-batch data is the largest.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为非整数,则M为将比值向上取整后的值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is non-integer, then M is a value obtained by rounding up the ratio.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为整数,则M为比值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is an integer, then M is the ratio.
本申请实施例的第四方面提供了一种神经网络的训练装置,该训练装置包括N个加速器,每个加速器加载神经网络的部分层,N个加速器共同加载神经网络,N个加速器以流水线并行的方式对神经网络进行训练。具体地,N个加速器中的第1个加速器用于获取M个微批量数据,M个微批量数据组成训练数据。N个加速器用于在共同完成对第i个微批量数据的前向计算后,直接对第i个微批量数据的前向计算结果进行反向计算,直至完成对M个微批量数据的反向计算以得到反向计算的结果。N个加速器还用于根据反向计算的结果对神经网络的参数进行更新。其中,N≥2,M≥2,i=1,2,...,M。The fourth aspect of the embodiments of the present application provides a neural network training device, the training device includes N accelerators, each accelerator loads a part of the neural network layer, the N accelerators load the neural network together, and the N accelerators are pipelined in parallel The way to train the neural network. Specifically, the first accelerator among the N accelerators is used to obtain M micro-batch data, and the M micro-batch data constitute training data. N accelerators are used to directly calculate the forward calculation result of the i-th micro-batch data after the forward calculation of the i-th micro-batch data is jointly completed, until the reverse calculation of the M micro-batch data is completed Calculate to get the result of the reverse calculation. The N accelerators are also used to update the parameters of the neural network based on the results of the reverse calculation. Among them, N≥2, M≥2, i=1, 2,...,M.
在一种可能的实现方式中,反向计算的结果包括每个加速器对应的梯度累加值,每个加速器对应的梯度累加值为每个加速器对M个微批量数据分别进行反向计算后,所得到的 M个梯度的和。In a possible implementation, the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is after each accelerator performs the inverse calculation on the M micro-batch data. The sum of M gradients obtained.
在一种可能的实现方式中,每个加速器用于根据其对应的梯度累加值,对其加载的神经网络的部分层的参数进行更新。In a possible implementation manner, each accelerator is used to update the parameters of the partial layer of the neural network it loads according to its corresponding gradient accumulation value.
在一种可能的实现方式中,该训练装置还包括处理器,处理器用于获取训练数据。处理器还用于根据每个加速器的存储容量阈值和训练数据的大小确定微批量数据的大小。处理器还用于根据微批量数据的大小,将训练数据分成M个微批量数据。In a possible implementation manner, the training device further includes a processor, and the processor is used to obtain training data. The processor is also used to determine the size of the micro-batch data according to the storage capacity threshold of each accelerator and the size of the training data. The processor is also used to divide the training data into M micro-batch data according to the size of the micro-batch data.
在一种可能的实现方式中,每个加速器的存储占用量峰值小于或等于该加速器的存储容量阈值,每个加速器的存储占用量峰值为在该加速器对第1个微批量数据进行反向计算之前,已被该加速器完成前向计算的若干个微批量数据所对应的存储占用量,每个微批量数据的大小小于或等于训练数据的大小。In a possible implementation, the peak storage occupancy of each accelerator is less than or equal to the storage capacity threshold of the accelerator, and the peak storage occupancy of each accelerator is the reverse calculation of the first micro-batch data at the accelerator Previously, the storage occupancy of several micro-batch data that has been calculated forward by the accelerator, and the size of each micro-batch data is less than or equal to the size of the training data.
在一种可能的实现方式中,每个微批量数据所对应的集群线性度最大。In a possible implementation manner, the cluster linearity corresponding to each micro-batch data is the largest.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为非整数,则M为将比值向上取整后的值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is non-integer, then M is a value obtained by rounding up the ratio.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为整数,则M为比值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is an integer, then M is the ratio.
本申请实施例的第五方面提供了一种神经网络的训练装置,该训练装置包括:The fifth aspect of the embodiments of the present application provides a neural network training device, which includes:
一个或多个处理器、存储器、总线***、以及一个或多个程序,所述处理器和所述存储器通过所述总线***相连;One or more processors, memories, bus systems, and one or more programs, where the processors and the memories are connected through the bus system;
其中,所述一个或多个程序被存储在所述存储器中,所述一个或多个程序包括指令,所述指令当被所述训练装置执行时使所述训练装置执行如第一方面和第二方面任意一项所述的训练方法。Wherein, the one or more programs are stored in the memory, and the one or more programs include instructions that, when executed by the training device, cause the training device to perform operations as described in the first aspect and the second aspect. The training method described in any one of the two aspects.
本申请实施例第六方面提供了一种计算机可读存储介质,包括指令,当该指令在计算机上运行时,使得计算机执行如第一方面和第二方面任意一项所述的训练方法。The sixth aspect of the embodiments of the present application provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the training method according to any one of the first aspect and the second aspect.
从以上技术方案可以看出,本申请实施例具有以下优点:It can be seen from the above technical solutions that the embodiments of the present application have the following advantages:
本申请实施例提供了一种神经网络的训练方法及其相关设备。在该方法中,每个加速器在对第i个微批量数据完成前向计算后,则对第i个微批量数据的前向计算结果进行反向计算,直至完成对M个微批量数据的反向计算。前述过程中,每个加速器对第i个微批量数据在完成前向计算后,则立即对第i个微批量数据的前向计算结果进行反向计算。每个加速器在开始反向计算时,可开始释放第i个微批量数据在前向计算所产生的特征值,直至完成第i个微批量数据的反向计算。因此,每个加速器的存储占用量峰值出现在第i个微批量数据的反向计算开始之时,此时每个加速器只需保存第i个微批量数据在前向计算所产生的特征值。在整个计算过程中,每个加速器的存储占用量峰值周期性出现,且能保持在较低值,可提高神经网络的训练效率。The embodiment of the application provides a neural network training method and related equipment. In this method, after each accelerator completes the forward calculation of the i-th micro-batch data, it performs the reverse calculation on the forward calculation result of the i-th micro-batch data until the reverse of the M micro-batch data is completed. To calculate. In the foregoing process, after each accelerator completes the forward calculation of the i-th micro-batch data, it immediately performs reverse calculation on the forward calculation result of the i-th micro-batch data. When each accelerator starts the reverse calculation, it can start to release the characteristic value generated by the forward calculation of the i-th micro-batch data until the reverse calculation of the i-th micro-batch data is completed. Therefore, the storage occupancy peak of each accelerator occurs when the reverse calculation of the i-th micro-batch data starts, and at this time, each accelerator only needs to save the feature value generated by the i-th micro-batch data in the forward calculation. During the entire calculation process, the peak storage occupancy of each accelerator appears periodically and can be kept at a low value, which can improve the training efficiency of the neural network.
附图说明Description of the drawings
图1为DNN进行前向计算的一个示意图;Figure 1 is a schematic diagram of DNN performing forward calculation;
图2为DNN进行反向计算的一个示意图;Figure 2 is a schematic diagram of DNN performing reverse calculation;
图3为本申请实施例提供的数据并行的一个示意图;Figure 3 is a schematic diagram of data parallelism provided by an embodiment of the application;
图4为本申请实施例提供的流水线并行的一个示意图;FIG. 4 is a schematic diagram of parallel pipelines provided by an embodiment of the application;
图5为本申请实施例提供的神经网络训练***的一个结构示意图;Fig. 5 is a schematic structural diagram of a neural network training system provided by an embodiment of the application;
图6为本申请实施例提供的神经网络的训练方法的一个流程示意图;FIG. 6 is a schematic flowchart of a neural network training method provided by an embodiment of the application;
图7为本申请实施例提供的神经网络的训练方法的第一应用例示意图;FIG. 7 is a schematic diagram of a first application example of a neural network training method provided by an embodiment of this application;
图8为本申请实施例提供的第一应用例的一个流程示意图;FIG. 8 is a schematic flowchart of a first application example provided by an embodiment of this application;
图9为本申请实施例提供的第一应用例的另一流程示意图;FIG. 9 is a schematic diagram of another process of the first application example provided by an embodiment of the application;
图10为本申请实施例提供的神经网络的训练方法的另一流程示意图;FIG. 10 is a schematic diagram of another process of a neural network training method provided by an embodiment of the application;
图11为本申请实施例提供的神经网络的训练方法的第二应用例示意图;FIG. 11 is a schematic diagram of a second application example of a neural network training method provided by an embodiment of this application;
图12为本申请实施例提供的第二应用例的一个流程示意图;FIG. 12 is a schematic flowchart of a second application example provided by an embodiment of this application;
图13为本申请实施例提供的第二应用例的另一流程示意图;FIG. 13 is a schematic diagram of another process of the second application example provided by an embodiment of this application;
图14为本申请实施例提供的第二应用例的计算过程示意图;FIG. 14 is a schematic diagram of a calculation process of a second application example provided by an embodiment of the application;
图15为本申请实施例提供的神经网络的训练装置的一个结构示意图;15 is a schematic structural diagram of a neural network training device provided by an embodiment of the application;
图16为本申请实施例提供的神经网络的训练装置的另一结构示意图;16 is a schematic diagram of another structure of a neural network training device provided by an embodiment of the application;
图17为本申请实施例提供的神经网络的训练装置的又一结构示意图。FIG. 17 is a schematic diagram of another structure of the neural network training device provided by an embodiment of the application.
具体实施方式Detailed ways
本申请实施例提供了一种神经网络的训练方法及其相关设备,能使得加速器的存储占用量峰值保持在较低值,提高神经网络的训练效率。下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiment of the present application provides a neural network training method and related equipment, which can keep the peak storage occupancy of the accelerator at a low value and improve the training efficiency of the neural network. The embodiments of the present application will be described below in conjunction with the drawings. A person of ordinary skill in the art knows that with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are equally applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、***、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second", etc. in the description and claims of the application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is only used to describe the method of distinguishing objects with the same attribute in the description of the embodiments of the present application. In addition, the terms "include" and "have" and any variations of them are intended to cover non-exclusive inclusion, so that a process, method, system, product or device containing a series of units is not necessarily limited to those units, but may include Listed or inherent to these processes, methods, products, or equipment.
本申请实施例可应用于AI领域中。AI是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。The embodiments of this application can be applied in the AI field. AI is a theory, method, technology and application system that uses digital computers or digital computer-controlled machines to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
通常来讲,AI的实现包括两个环节:训练以及推理。训练,是指通过大量标记过的样本训练出一个神经网络模型,使神经网络模型可以具有特定的功能。推理,也称预测或推断,是指利用训练好的神经网络模型,使用新的业务数据推理出各种结论。Generally speaking, the realization of AI includes two links: training and reasoning. Training refers to training a neural network model through a large number of labeled samples, so that the neural network model can have specific functions. Inference, also called prediction or inference, refers to using a trained neural network model to infer various conclusions using new business data.
AI参数:是指通过AI训练确定出的AI模型中的参数。通俗地说,AI模型可以视为一种函数,AI参数可以视为函数中的系数。例如,如果AI模型是神经网络,AI参数可以是神经网络中卷积核的权重。又例如,如果AI模型是支持向量机,AI参数可以是支持向量机中的支持向量;又例如,如果AI模型是线性回归模型或逻辑回归模型,AI参数可以是线性回归或逻辑回归模型中的系数。当然,列举的AI模型仅是示例,AI模型还可以是其他类型的模型,例如决策树模型、随机森林模型、置信网络、强化学习模型、迁移学习模型、归纳学习模型中的一项及其组合,相应地,AI参数也可以是其他类型的模型中的参数,本申请实施例并不限定AI参数和AI模型的具体类型。AI参数也可以称作神经网络参数。AI parameters: refer to the parameters in the AI model determined through AI training. In layman's terms, an AI model can be regarded as a function, and AI parameters can be regarded as coefficients in the function. For example, if the AI model is a neural network, the AI parameter can be the weight of the convolution kernel in the neural network. For another example, if the AI model is a support vector machine, the AI parameter can be the support vector in the support vector machine; for another example, if the AI model is a linear regression model or a logistic regression model, the AI parameter can be a linear regression or logistic regression model. coefficient. Of course, the AI models listed are only examples. AI models can also be other types of models, such as decision tree models, random forest models, confidence networks, reinforcement learning models, transfer learning models, inductive learning models, and their combination Correspondingly, the AI parameters may also be parameters in other types of models, and the embodiments of the present application do not limit the specific types of AI parameters and AI models. AI parameters can also be referred to as neural network parameters.
AI参数的调整过程对于AI计算来讲是至关重要的。具体来讲,AI计算的过程中,通常会将数据集中的业务数据,输入至AI模型,AI模型会基于AI参数对业务数据进行推理预测,得到预测结果。根据预测结果与真实结果之间的误差,会对AI参数进行调整,使得根据调整后的AI参数来进行下一次推理预测时,误差得以减小。通过循环执行AI参数调整的过程,能够让AI参数通过调整来逐渐精确,当训练结束时,利用包含了精确的参数的AI模型,可以实现精确的推理预测,比如精确的将人脸图像中的人脸识别出来。The adjustment process of AI parameters is very important for AI calculation. Specifically, in the process of AI calculation, the business data in the data set is usually input to the AI model, and the AI model will infer and predict the business data based on the AI parameters to obtain the prediction result. According to the error between the predicted result and the real result, the AI parameters are adjusted, so that when the next inference prediction is made based on the adjusted AI parameters, the error can be reduced. Through the cyclic execution of the AI parameter adjustment process, the AI parameters can be adjusted to be gradually accurate. When the training is over, the AI model containing the accurate parameters can be used to achieve accurate reasoning and prediction, such as accurately integrating the face image Face recognition.
随着人工智能技术的快速发展,神经网络(例如,DNN)近年来在图像、视频以及语音等多种媒体信号的处理与分析中取得了很大的成就。神经网络也可以称为人工神经网络(artificial neural network,ANN),或者类神经网络,在机器学习和认知科学领域,是一种模仿生物神经网络(动物的中枢神经***,特别是大脑)的结构和功能的数学模型或计算模型,用于对函数进行估计或近似。人工神经网络可以包括卷积神经网络(convolutional neural network,CNN)、深度神经网络(deep neural network,DNN)、多层感知器(multilayer perceptron,MLP)等神经网络。一个性能优良的神经网络往往需要经过复杂的训练过程,为了完成神经网络的训练,可通过数据并行和流水线并行两种方式实现,以下将结合现有技术对前述两种方式分别进行介绍。With the rapid development of artificial intelligence technology, neural networks (for example, DNN) have made great achievements in the processing and analysis of various media signals such as images, videos, and voices in recent years. Neural network can also be called artificial neural network (artificial neural network, ANN), or quasi-neural network, in the field of machine learning and cognitive science, it is a kind of imitating biological neural network (animal's central nervous system, especially the brain) Mathematical model or calculation model of structure and function, used to estimate or approximate the function. Artificial neural networks may include neural networks such as convolutional neural network (CNN), deep neural network (DNN), and multilayer perceptron (MLP). A neural network with excellent performance often needs to go through a complicated training process. In order to complete the training of the neural network, it can be implemented in two ways: data parallel and pipeline parallel. The two methods will be introduced separately below in combination with the existing technology.
数据并行的基本思想是使用多个设备上的模型副本同时对训练数据进行训练,并在迭代结束时同步跨副本的模型参数。具体地,每个设备均加载有同一个神经网络,在接收到训练数据后,每个加速器可基于训练数据对其加载的该神经网络进行训练,其中,设备可以是加速器,训练数据是训练数据子集,即部分训练数据。图3为本申请实施例提供的数据并行的一个示意图,如图3所示,设有加速器1、加速器2和加速器3,且加速器1、加速器2和加速器3均加载同一个完整的神经网络,当处理器为三个加速器分配训练数据(包含多个数据)后,加速器1、加速器2和加速器3对各自的训练数据进行计算,从而得到各自的计算结果。以加速器1为例,加速器1对分配至加速器1的所有数据进行前向计算后,再对所有数据的前向计算结果进行反向计算,从而得到加速器1的反向计算结果。同理,加速器2和加速器3也可执行相同的操作,此处不再赘述。基于三个反向计算结果,加速器1、加速器2和加速器3可对各自加载的神经网络进行参数更新。The basic idea of data parallelism is to use model copies on multiple devices to train the training data at the same time, and to synchronize the model parameters across the copies at the end of the iteration. Specifically, each device is loaded with the same neural network. After receiving the training data, each accelerator can train the neural network it has loaded based on the training data. The device can be an accelerator and the training data is training data. A subset is part of the training data. Fig. 3 is a schematic diagram of data parallelism provided by an embodiment of the application. As shown in Fig. 3, accelerator 1, accelerator 2, and accelerator 3 are provided, and accelerator 1, accelerator 2, and accelerator 3 are all loaded with the same complete neural network, After the processor allocates training data (including multiple data) to the three accelerators, the accelerator 1, accelerator 2, and accelerator 3 calculate the respective training data to obtain their respective calculation results. Taking accelerator 1 as an example, after accelerator 1 performs forward calculation on all the data allocated to accelerator 1, then reverse calculation is performed on the forward calculation results of all data, so as to obtain the reverse calculation result of accelerator 1. In the same way, the accelerator 2 and the accelerator 3 can also perform the same operation, which will not be repeated here. Based on the three inverse calculation results, accelerator 1, accelerator 2, and accelerator 3 can update the parameters of the neural network loaded respectively.
流水线并行为模型并行中的一种方式。流水线并行通常指,多个加速器中每个加速器加载有神经网络的部分层,多个加速器共同加载完整的该神经网络,在接收到训练数据后,每个加速器负责神经网络部分层的参数训练,多个加速器可基于训练数据共同对该神经网 络进行训练。图4为本申请实施例提供的流水线并行的一个示意图,图4中的细线框表示数据的前向计算,粗线框表示数据的反向计算。如图4所示,设神经网络包含三层,加速器1加载神经网络的第一层、加速器2加载神经网络的第二层,和加速器3加载神经网络的第三层,且加速器1、加速器2和加速器3依次连接,故加速器1的前向计算结果可输入至加速器2,加速器2的前向计算结果可输入至加速器3,且加速器3的反向计算结果可输入至加速器2,加速器2的反向计算结果可输入至加速器1。加速器1接收来自处理器的数据1、数据2和数据3后,可对三个数据进行前向计算和反向计算。具体地,数据1依次经过加速器1、加速器2和加速器3的前向计算后,可得到数据1的前向计算结果,即已被加速器3前向计算后的数据1。需要说明的是,加速器2在对已被加速器1前向计算后的数据1进行前向计算时,加速器1可同时对数据2进行前向计算,以此类推,可得到数据1、数据2、数据3的前向计算结果。完成前向计算后,可对三个数据的前向计算结果进行反向计算。应理解,反向计算为前向计算的逆过程,可参考前述前向计算的说明,此处不再赘述。当完成三个数据的反向计算后,加速器1可基于其得到的反向计算结果对第一层进行参数更新,加速器2可基于其得到的反向计算结果对第二层进行参数更新,加速器3可基于其得到的反向计算结果对第三层进行参数更新。Pipeline parallelism is a way of model parallelism. Pipeline parallelism usually means that each accelerator of multiple accelerators is loaded with part of the neural network layer, and multiple accelerators jointly load the complete neural network. After receiving the training data, each accelerator is responsible for the parameter training of the partial layer of the neural network. Multiple accelerators can jointly train the neural network based on the training data. Fig. 4 is a schematic diagram of the parallel pipeline provided by an embodiment of the application. The thin line frame in Fig. 4 represents the forward calculation of data, and the thick line frame represents the reverse calculation of data. As shown in Figure 4, suppose the neural network contains three layers, accelerator 1 loads the first layer of the neural network, accelerator 2 loads the second layer of the neural network, and accelerator 3 loads the third layer of the neural network, and accelerator 1, accelerator 2 It is connected to accelerator 3 in turn, so the forward calculation result of accelerator 1 can be input to accelerator 2, the forward calculation result of accelerator 2 can be input to accelerator 3, and the reverse calculation result of accelerator 3 can be input to accelerator 2, The inverse calculation result can be input to accelerator 1. After the accelerator 1 receives data 1, data 2, and data 3 from the processor, it can perform forward and backward calculations on the three data. Specifically, after data 1 is forwarded calculation by accelerator 1, accelerator 2, and accelerator 3 in turn, the forward calculation result of data 1 can be obtained, that is, data 1 that has been forward calculated by accelerator 3. It should be noted that when accelerator 2 performs forward calculation on data 1 that has been forward calculated by accelerator 1, accelerator 1 can perform forward calculation on data 2 at the same time. By analogy, data 1, data 2, can be obtained. Data 3 forward calculation result. After the forward calculation is completed, the forward calculation results of the three data can be reversed. It should be understood that the backward calculation is the reverse process of the forward calculation, and reference may be made to the foregoing description of the forward calculation, which will not be repeated here. After completing the backward calculation of the three data, the accelerator 1 can update the parameters of the first layer based on the reverse calculation results obtained, and the accelerator 2 can update the parameters of the second layer based on the reverse calculation results obtained. 3 The parameters of the third layer can be updated based on the reverse calculation results obtained.
为了提高神经网络的训练效率,本申请提供了一种神经网络的训练方法。该训练方法可应用于神经网络训练***中,图5为本申请实施例提供的神经网络训练***的一个结构示意图。如图5所示,该神经网络***包含多个训练装置501,训练装置501之间可通过交换机502进行通信连接。每个训练装置501包含中央处理器(central processing unit,CPU),以下简称处理器5011,以及多个加速器5012。其中,加速器5012可由图形处理单元(graphics processing unit,GPU)或现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)等加速设备实现,此处不做限制。处理器5011可向加速器5012发送用于训练神经网络的样本数据,加速器5012可基于样本数据对神经网络进行训练。In order to improve the training efficiency of the neural network, this application provides a neural network training method. This training method can be applied to a neural network training system. FIG. 5 is a schematic structural diagram of the neural network training system provided by an embodiment of the application. As shown in FIG. 5, the neural network system includes a plurality of training devices 501, and the training devices 501 can communicate with each other through a switch 502. Each training device 501 includes a central processing unit (CPU), hereinafter referred to as a processor 5011, and multiple accelerators 5012. Wherein, the accelerator 5012 can be implemented by an acceleration device such as a graphics processing unit (GPU) or a field programmable gate array (Field Programmable Gate Array, FPGA), which is not limited here. The processor 5011 may send sample data for training the neural network to the accelerator 5012, and the accelerator 5012 may train the neural network based on the sample data.
图6为本申请实施例提供的神经网络的训练方法的一个流程示意图,请参阅图6,该训练方法应用于包含处理器和N个加速器的训练装置。在该训练装置中,处理器可为每个加速器提供用于神经网络训练的数据,每个加速器均加载同一个神经网络,N个加速器以数据并行的方式对神经网络进行训练。该训练方法包括:FIG. 6 is a schematic flowchart of a neural network training method provided by an embodiment of the application. Please refer to FIG. 6. The training method is applied to a training device including a processor and N accelerators. In the training device, the processor can provide data for neural network training for each accelerator, each accelerator is loaded with the same neural network, and N accelerators train the neural network in a data-parallel manner. The training method includes:
601、每个加速器获取M个微批量数据。601. Each accelerator obtains M micro-batch data.
当需要对神经网络进行训练时,处理器可先获取训练数据,并将训练数据分成N×M个微批量(microbatch)数据,每个微批量包含至少一个待训练的样本数据。然后,处理器向每个加速器发送M个微批量数据。其中,N≥2,M≥2。When the neural network needs to be trained, the processor may first obtain training data, and divide the training data into N×M microbatch data, and each microbatch includes at least one sample data to be trained. Then, the processor sends M micro-batch data to each accelerator. Among them, N≥2 and M≥2.
602、每个加速器对第i个微批量数据进行前向计算后,直接对第i个微批量数据的前向计算结果进行反向计算,直至完成对M个微批量数据的反向计算以得到反向计算的结果。602. After each accelerator performs the forward calculation on the i-th micro-batch data, it directly performs the reverse calculation on the forward calculation result of the i-th micro-batch data until the reverse calculation of the M micro-batch data is completed to obtain The result of the reverse calculation.
当每个加速器接收到来自处理器的M个微批量数据后,则开始进行神经网络的训练。具体地,每个加速器对第i个微批量数据进行前向计算后,可得到第i个微批量数据的前向计算结果,再对第i个微批量数据的前向计算结果进行反向计算。然后,可对第i+1个微批量数据分别进行前向计算和反向计算,直至完成对M个微批量数据的反向计算,得到 反向计算的结果。其中,i=1,2,...,M。After each accelerator receives M micro-batch data from the processor, it starts training the neural network. Specifically, after each accelerator performs forward calculation on the i-th micro-batch data, the forward calculation result of the i-th micro-batch data can be obtained, and then the forward calculation result of the i-th micro-batch data is reversed. . Then, the forward calculation and the reverse calculation can be performed on the i+1th micro-batch data respectively, until the reverse calculation of the M micro-batch data is completed, and the result of the reverse calculation is obtained. Among them, i=1, 2,...,M.
为了便于描述,在本实施例中,加速器完成对某个(或多个)微批量数据的反向计算应理解为,加速器已对该微批量数量完成前向计算,并已完成对该微批量数据的前向计算结果的反向计算。同理,加速器对某个(或多个)微批量数据进行反向计算应理解为,加速器已对该微批量数量完成前向计算,并对该微批量数据的前向计算结果进行反向计算,后续不再赘述。For ease of description, in this embodiment, when the accelerator completes the reverse calculation of a certain (or multiple) micro-batch data, it should be understood that the accelerator has completed the forward calculation of the micro-batch quantity, and has completed the micro-batch quantity. The backward calculation of the result of the forward calculation of the data. In the same way, when the accelerator performs reverse calculation on a certain (or multiple) micro-batch data, it should be understood that the accelerator has completed the forward calculation of the micro-batch quantity and reversed the calculation result of the micro-batch data. , I won’t repeat it in the follow-up.
更进一步地,反向计算的结果可包括每个加速器对应的梯度累加值,每个加速器对应的梯度累加值为每个加速器对M个微批量数据分别进行反向计算后,所得到的M个梯度的和。具体地,在每个加速器根据M个微批量数据对神经网络进行训练的过程中,每个加速器在对第i个微批量数据分别进行前向计算和反向计算,得到第i个梯度后,则对第i+1个微批量数据分别进行前向计算和反向计算,得到第i+1个梯度,直至完成对M个微批量数据的反向计算,得到M个梯度。然后,每个加速器对M个梯度进行累加计算,得到梯度累加值。值得注意的是,加速器完成某一个微批量数据的前向计算时,则存储前向计算过程中所产生的特征值。加速器开始该微批量数据的反向计算时,则开始释放该微批量数据在前向计算所产生的特征值(由于进行反向计算时,需要使用前向计算所产生的特征值)。直至完成该微批量数据的反向计算,此时该微批量数据在前向计算所产生的特征值被完全释放,即该部分特征值所占用的存储量被释放(该微批量数据所对应的存储占用量)。Furthermore, the result of the inverse calculation may include the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is the M micro-batch data obtained after each accelerator performs the inverse calculation on the M micro-batch data. The sum of the gradients. Specifically, in the process of each accelerator training the neural network based on M micro-batch data, each accelerator performs forward calculation and reverse calculation on the i-th micro-batch data, respectively, after obtaining the i-th gradient, Then, forward calculation and reverse calculation are performed on the i+1th micro-batch data, respectively, to obtain the i+1th gradient, until the reverse calculation of the M micro-batch data is completed, and M gradients are obtained. Then, each accelerator performs accumulation calculation on the M gradients to obtain the gradient accumulation value. It is worth noting that when the accelerator completes the forward calculation of a certain micro-batch data, it stores the characteristic values generated in the forward calculation process. When the accelerator starts the reverse calculation of the micro-batch data, it starts to release the feature value generated by the forward calculation of the micro-batch data (because the reverse calculation needs to use the feature value generated by the forward calculation). Until the reverse calculation of the micro-batch data is completed, the feature value generated by the micro-batch data in the forward calculation is completely released, that is, the storage amount occupied by the part of the feature value is released (the corresponding micro-batch data Storage occupancy).
例如,训练装置中的某个加速器完成对第1个微批量数据的前向计算后,则立即对第1个微批量数据进行反向计算。完成第1个微批量数据的反向计算后,可得到第1个梯度,然后对第2个微批量数据进行前向计算。完成第2个微批量数据的前向计算后,则立即对第2个微批量数据进行反向计算,以得到第2个梯度。以此类推,直至完成对第M个微批量数据的反向计算,则可得到M个梯度。最后,该加速器可将M个梯度进行叠加,即可得到该加速器对应的梯度累加值。值得注意的是,训练装置还可包含其余加速器,其余加速器也可实现前述过程,以得到其余加速器对应的梯度累加值,此处不做赘述。For example, after a certain accelerator in the training device completes the forward calculation of the first micro-batch data, it immediately performs the reverse calculation on the first micro-batch data. After the reverse calculation of the first micro-batch data is completed, the first gradient can be obtained, and then the forward calculation is performed on the second micro-batch data. After the forward calculation of the second micro-batch data is completed, the reverse calculation is performed on the second micro-batch data immediately to obtain the second gradient. By analogy, until the reverse calculation of the M-th micro-batch data is completed, M gradients can be obtained. Finally, the accelerator can superimpose M gradients to obtain the gradient accumulation value corresponding to the accelerator. It is worth noting that the training device may also include other accelerators, and the other accelerators may also implement the foregoing process to obtain the gradient accumulation values corresponding to the other accelerators, which will not be repeated here.
603、每个加速器根据反向计算的结果对神经网络的参数进行更新。603. Each accelerator updates the parameters of the neural network according to the result of the reverse calculation.
每个加速器先根据其对应的梯度累加值,以及其余加速器对应的梯度累加值进行求平均计算,得到最终用于对神经网络进行更新的目标梯度累加值。然后,每个加速器根据目标梯度累加值对神经网络的参数进行更新。Each accelerator first performs an averaging calculation according to its corresponding gradient accumulation value and the gradient accumulation values corresponding to other accelerators, and obtains the target gradient accumulation value that is finally used to update the neural network. Then, each accelerator updates the parameters of the neural network according to the accumulated value of the target gradient.
每个加速器所对应的梯度累加值可能是不同的(每个加速器加载的是同一个神经网络,但在每个加速器所接收的微批量数据里,样本数据可能是不同的,导致计算结果不同),为了达到最高效的训练效果,每个加速器可对所有梯度累加值进行求平均计算,得到相同的目标梯度累加值。这样,所有加速器均可基于同一个目标梯度累加值对同一个神经网络完成更新,完成该神经网络的训练。The gradient accumulation value corresponding to each accelerator may be different (each accelerator loads the same neural network, but in the micro-batch data received by each accelerator, the sample data may be different, resulting in different calculation results) , In order to achieve the most efficient training effect, each accelerator can average all gradient accumulation values to obtain the same target gradient accumulation value. In this way, all accelerators can update the same neural network based on the same target gradient accumulation value, and complete the training of the neural network.
此外,可通过设置微批量数据的大小,以节省加速器的存储资源。在处理器发送M个微批量数据至每个加速器之前,该训练方法还可包括:In addition, the size of the micro-batch data can be set to save the storage resources of the accelerator. Before the processor sends M micro-batch data to each accelerator, the training method may further include:
处理器先获取训练数据,需要说明的是,训练数据为输入至一个加速器的所有样本数据的集合,训练数据的大小大于或等于微批量数据的大小。然后根据目标存储容量阈值和 训练数据的大小确定微批量数据的大小。其中,若N个加速器相同(即N个加速器的存储容量阈值均相同),则目标存储容量阈值为N个加速器中任意一个加速器的存储容量阈值,若N个加速器中存在至少P个加速器不同(即至少P个加速器的存储容量阈值不同),则目标存储容量阈值为至少P个加速器的存储容量阈值中的最小值,N≥P≥2。最后,处理器根据微批量数据的大小,将训练数据分成N×M个微批量数据。The processor first obtains the training data. It should be noted that the training data is a collection of all sample data input to an accelerator, and the size of the training data is greater than or equal to the size of the micro-batch data. Then determine the size of the micro-batch data according to the target storage capacity threshold and the size of the training data. Among them, if the N accelerators are the same (that is, the storage capacity thresholds of the N accelerators are the same), the target storage capacity threshold is the storage capacity threshold of any accelerator in the N accelerators, and if there are at least P accelerators in the N accelerators that are different ( That is, the storage capacity thresholds of at least P accelerators are different), the target storage capacity threshold is the smallest value among the storage capacity thresholds of at least P accelerators, N≥P≥2. Finally, the processor divides the training data into N×M micro-batch data according to the size of the micro-batch data.
其中,微批量数据的大小应满足以下条件:(1)每个微批量数据的大小所对应的存储占用量小于或等于目标存储容量阈值;(2)每个微批量数据的大小小于或等于训练数据的大小;(3)每个微批量数据所对应的集群线性度最大。Among them, the size of micro-batch data should meet the following conditions: (1) The storage occupancy corresponding to the size of each micro-batch data is less than or equal to the target storage capacity threshold; (2) The size of each micro-batch data is less than or equal to training The size of the data; (3) The cluster linearity corresponding to each micro-batch data is the largest.
更进一步地,若训练数据的大小与微批量数据的大小之间的比值为非整数,则微批量数据的数量M为将前述比值向上取整后的值。若训练数据的大小与微批量数据的大小之间的比值为整数,则微批量数据的数量M为前述比值。Furthermore, if the ratio between the size of the training data and the size of the micro-batch data is non-integer, the number M of the micro-batch data is the value obtained by rounding the aforementioned ratio up. If the ratio between the size of the training data and the size of the micro-batch data is an integer, the number M of the micro-batch data is the aforementioned ratio.
通过上述过程,可以将微批量数据的大小设置为最优值,以减小微批量数据在前向计算后所产生的特征值的存储占用量,能够进一步节省加速器的存储资源,提高神经网络的训练效率。Through the above process, the size of the micro-batch data can be set to the optimal value to reduce the storage footprint of the feature values generated by the micro-batch data after the forward calculation, which can further save the storage resources of the accelerator and improve the performance of the neural network. Training efficiency.
本实施例中,每个加速器对第i个微批量数据在完成前向计算后,则立即对第i个微批量数据的前向计算结果进行反向计算。每个加速器在开始反向计算时,可开始释放第i个微批量数据在前向计算所产生的特征值,直至完成第i个微批量数据的反向计算(第i个微批量数据在前向计算所产生的特征值被完全释放)。因此,每个加速器的存储占用量峰值出现在第i个微批量数据的反向计算开始之时,此时每个加速器只需保存第i个微批量数据在前向计算所产生的特征值。在整个计算过程中,每个加速器的存储占用量峰值周期性出现(即存储占用量峰值出现在每个微批量数据的反向计算开始之时),且能保持在较低值,可提高神经网络的训练效率。In this embodiment, after each accelerator completes the forward calculation of the i-th micro-batch data, it immediately performs reverse calculation on the forward calculation result of the i-th micro-batch data. When each accelerator starts the reverse calculation, it can start to release the feature value generated by the i-th micro-batch data in the forward calculation until the reverse calculation of the i-th micro-batch data is completed (the i-th micro-batch data is first) The eigenvalues generated by the calculation are completely released). Therefore, the storage occupancy peak of each accelerator occurs when the reverse calculation of the i-th micro-batch data starts, and at this time, each accelerator only needs to save the feature value generated by the i-th micro-batch data in the forward calculation. During the entire calculation process, the peak storage occupancy of each accelerator appears periodically (that is, the peak storage occupancy appears at the beginning of the reverse calculation of each micro-batch data), and can be kept at a low value, which can improve nerves The training efficiency of the network.
为了便于理解,以下将提供一个应用例对图6所示的训练方法做进一步的介绍。图7为本申请实施例提供的神经网络的训练方法的第一应用例示意图,请参阅图7,用于训练目标神经网络的训练装置设置有处理器、GPU1、GPU2、GPU3和GPU4。GPU1、GPU2、GPU3和GPU4上均加载有同一个目标神经网络,该目标神经网络为多层结构,各层大小和计算耗时均匀设置。In order to facilitate understanding, an application example will be provided below to further introduce the training method shown in FIG. 6. FIG. 7 is a schematic diagram of a first application example of the neural network training method provided by an embodiment of the application. Please refer to FIG. 7. The training device for training the target neural network is provided with a processor, GPU1, GPU2, GPU3, and GPU4. The same target neural network is loaded on GPU1, GPU2, GPU3, and GPU4. The target neural network has a multi-layer structure, and the size and calculation time of each layer are uniformly set.
设外部输入的训练数据包含1024个样本数据。处理器确定所有样本数据的数量为1024个后,由于GPU1、GPU2、GPU3和GPU4通过数据并行的方式进行训练,处理器可先确定每个GPU负责的批量数据的大小=1024/4=256,并将训练数据平均分为4个批量数据。因此,处理器可为每个GPU提供256个样本数据(即一个批量数据),以使得每个GPU基于其负责的样本数据对目标神经网络进行训练。It is assumed that the externally input training data contains 1024 sample data. After the processor determines that the number of all sample data is 1024, since GPU1, GPU2, GPU3, and GPU4 are trained in a data-parallel manner, the processor can first determine the size of the batch data that each GPU is responsible for=1024/4=256, And the training data is equally divided into 4 batches of data. Therefore, the processor can provide 256 sample data (that is, one batch of data) for each GPU, so that each GPU trains the target neural network based on the sample data for which it is responsible.
为了节省GPU的存储资源,可将一个批量数据进一步分成多个微批量数据。首先需要确定微批量数据的大小,设GPU1、GPU2、GPU3和GPU4为相同性能的GPU,下面将结合图8并以GPU1为例对确定微批量数据的大小的过程进行说明。图8为本申请实施例提供的第一应用例的一个流程示意图,如图8所示,该过程包括:In order to save the storage resources of the GPU, one batch of data can be further divided into multiple micro batches. First, the size of the micro-batch data needs to be determined, and GPU1, GPU2, GPU3, and GPU4 are assumed to be GPUs with the same performance. The process of determining the size of the micro-batch data will be described below in conjunction with Figure 8 and taking GPU1 as an example. FIG. 8 is a schematic flowchart of the first application example provided by the embodiment of the application. As shown in FIG. 8, the process includes:
S1:处理器确定GPU1的显存容量阈值Cmax和批量数据的大小。S1: The processor determines the video memory capacity threshold Cmax of GPU1 and the size of the batch data.
S2:处理器根据批量数据的大小选择一个微批量数据的大小,确定该微批量数据在GPU1中所对应的显存占用量C1。S2: The processor selects the size of a micro-batch data according to the size of the batch data, and determines the video memory occupancy C1 corresponding to the micro-batch data in the GPU1.
S3:处理器判断是否满足C1≤Cmax,若不满足,则重新执行S2,若满足,则执行S4。S3: The processor judges whether C1≤Cmax is satisfied, if it is not satisfied, S2 is executed again, and if it is satisfied, S4 is executed.
S4:处理器在满足C1≤Cmax条件下,确定微批量数据的大小的所有取值,并在所有取值中,取具有最大的集群线性度L的微批量数据的大小作为最终的选择。S4: The processor determines all values of the size of the micro-batch data under the condition of C1≤Cmax, and takes the size of the micro-batch data with the largest cluster linearity L as the final choice among all the values.
具体地,处理器确定GPU1的显存容量阈值Cmax=16GByte,以及批量数据的大小为256。Specifically, the processor determines that the video memory capacity threshold of GPU1 is Cmax=16 GByte, and the size of the batch data is 256.
处理器选择微批量数据的大小为256,当微批量数据的大小为256时,微批量数据在GPU1中的显存占用量C1=64Gbyte。在判断不满足C1≤Cmax,处理器则将微批量数据的大小设为128,当微批量数据的大小为128时,微批量数据在GPU1中的显存占用量C1=32Gbyte,依旧不满足C1≤Cmax。处理器再次将微批量数据的大小设为64,此时微批量数据在GPU1中的显存占用量C1=16Gbyte,满足C1≤Cmax。此时,则GPU1需计算的微批量数据的数量为256/64=4。The processor selects the size of the micro-batch data as 256. When the size of the micro-batch data is 256, the video memory occupied by the micro-batch data in the GPU1 is C1=64Gbyte. When judging that C1≤Cmax is not satisfied, the processor sets the size of the micro-batch data to 128. When the size of the micro-batch data is 128, the memory usage of the micro-batch data in GPU1 is C1=32Gbyte, which still does not satisfy C1≤ Cmax. The processor again sets the size of the micro-batch data to 64. At this time, the memory occupancy of the micro-batch data in the GPU1 is C1=16 Gbyte, which satisfies C1≤Cmax. At this time, the number of micro-batch data to be calculated by GPU1 is 256/64=4.
确定微批量数据的大小为64时,满足C1≤Cmax后,则可计算微批量数据的大小所对应的集群线性度。具体地,当微批量数据的大小为64时,其对应的计算耗时T1=32ms,其余耗时(例如特征值传输、参数更新的耗时等等)T2=12ms,则微批量数据的大小为64时,所对应的集群线性度L=T1/(T1+T2)=0.727。更进一步地,处理器将微批量数据的大小设为32,其依旧满足C1≤Cmax,则计算其对应的集群线性度L=0.762。同理,处理器可继续计算出微批量数据的大小为其余取值时,所对应的集群线性度L。When it is determined that the size of the micro-batch data is 64 and C1≤Cmax is satisfied, the cluster linearity corresponding to the size of the micro-batch data can be calculated. Specifically, when the size of the micro-batch data is 64, the corresponding calculation time is T1=32ms, and the remaining time (for example, the time-consuming of feature value transmission, parameter update, etc.) T2=12ms, then the size of the micro-batch data When it is 64, the corresponding cluster linearity L=T1/(T1+T2)=0.727. Furthermore, the processor sets the size of the micro-batch data to 32, which still satisfies C1≤Cmax, then calculates its corresponding cluster linearity L=0.762. In the same way, the processor can continue to calculate the cluster linearity L corresponding to when the size of the micro-batch data is the other value.
设在所有取值中,当微批量数据的大小为32时,其对应的集群线性度L最大。处理器可最终确定微批量数据的大小为32,此时,GPU1需计算的微批量数据的数量为256/32=8。Suppose that among all the values, when the size of the micro-batch data is 32, the corresponding cluster linearity L is the largest. The processor can finally determine that the size of the micro-batch data is 32. At this time, the number of micro-batch data that GPU1 needs to calculate is 256/32=8.
应理解,处理器对GPU2、GPU3和GPU4也可执行如同S1-S4的过程。由于GPU1、GPU2、GPU3和GPU4为相同性能的GPU,因此,每个GPU最终确定的微批量数据的大小均为32,数量为8。It should be understood that the processor can also perform the same process as S1-S4 for GPU2, GPU3, and GPU4. Since GPU1, GPU2, GPU3, and GPU4 are GPUs with the same performance, the size of the micro-batch data finally determined by each GPU is 32 and the number is 8.
在确定微批量数据的大小后,则GPU1、GPU2、GPU3和GPU4以数据并行的方式开始进行计算。下面将结合图9对前述计算过程进行说明,图9为本申请实施例提供的第一应用例的另一流程示意图,如图9所示,该过程包括:After determining the size of the micro-batch data, GPU1, GPU2, GPU3, and GPU4 start to perform calculations in a data-parallel manner. The foregoing calculation process will be described below with reference to FIG. 9. FIG. 9 is another schematic flow diagram of the first application example provided by an embodiment of the application. As shown in FIG. 9, the process includes:
W1:对第1个微批量数据进行前向计算,并存储前向计算所产生的特征值。W1: Perform forward calculation on the first micro-batch data and store the feature values generated by the forward calculation.
W2:确定第1个微批量数据的前向计算结束后,则对第1个微批量数据的前向计算结果进行反向计算,并开始释放第1个微批量数据所对应的显存占用量(即开始释放第1个微批量数据在前向计算所产生的特征值)。当第1个微批量数据的反向计算结束后,第1个微批量数据所对应的显存占用量被完全释放,得到第1个梯度。W2: After determining the end of the forward calculation of the first micro-batch data, perform the reverse calculation on the forward calculation result of the first micro-batch data, and start to release the memory usage corresponding to the first micro-batch data ( That is, the feature value generated by the forward calculation of the first micro-batch data is released). When the reverse calculation of the first micro-batch data is completed, the memory occupation corresponding to the first micro-batch data is completely released, and the first gradient is obtained.
W3:对第2个微批量数据进行前向计算和反向计算,得到第2个梯度。关于第2个微批量数据的计算过程可参考W1和W2,此处不再赘述。以此类推,直到8个微批量数据均完成前向计算和反向计算,可得到8个梯度,并对8个梯度进行累加,得到梯度累加值。W3: Perform forward calculation and reverse calculation on the second micro-batch data to obtain the second gradient. For the calculation process of the second micro-batch data, please refer to W1 and W2, which will not be repeated here. By analogy, until the 8 micro-batch data have completed the forward calculation and the reverse calculation, 8 gradients can be obtained, and the 8 gradients are accumulated to obtain the gradient accumulation value.
W4:根据梯度累加值对目标神经网络进行更新。W4: Update the target neural network according to the accumulated value of the gradient.
由于每个GPU均执行步骤W1-W3,故每个GPU均可得到其对应的梯度累加值。得到梯度累加值之后,则可对神经网络进行更新。具体地,每个GPU先根据其对应的梯度累加值, 以及其余GPU对应的梯度累加值进行求平均计算,得到最终用于对神经网络进行更新的目标梯度累加值。然后,每个GPU根据目标梯度累加值对神经网络的参数进行更新。例如,GPU1可将其对应的梯度累加值,以及GPU2、GPU3和GPU4对应的梯度累加值进行求平均计算,得到目标梯度累加值。同理,GPU2、GPU3和GPU4也可得到该目标梯度累加值。最后,GPU1、GPU2、GPU3和GPU4可根据该目标梯度累加值,对各自加载的神经网络的参数进行更新。Since each GPU executes steps W1-W3, each GPU can obtain its corresponding gradient accumulation value. After the gradient accumulation value is obtained, the neural network can be updated. Specifically, each GPU first performs an averaging calculation according to its corresponding gradient accumulation value and the gradient accumulation values corresponding to other GPUs, to obtain the target gradient accumulation value that is finally used to update the neural network. Then, each GPU updates the parameters of the neural network according to the accumulated value of the target gradient. For example, GPU1 may perform an average calculation on its corresponding gradient accumulation value and the gradient accumulation values corresponding to GPU2, GPU3, and GPU4 to obtain the target gradient accumulation value. In the same way, GPU2, GPU3, and GPU4 can also obtain the target gradient accumulation value. Finally, GPU1, GPU2, GPU3, and GPU4 can update the parameters of the neural network loaded respectively according to the target gradient accumulation value.
本应用例中,某一个微批量数据在训练过程中先做前向计算,然后立即做反向计算。该微批量数据的反向计算完成之后才开始下一个微批量数据的前向计算。故本应用例中的显存占用量峰值出现在任意一个微批量数据的反向计算开始之时。这时加速器只需要保存一个微批量数据在前向计算所产生的所有特征值。在整个计算过程中,显存占用量峰值周期性出现,直到所有微批量数据前向计算和反向计算完成。且当出现显存占用量峰值时,加速器上只需要保存一个微批量数据在前向计算所产生的特征值,将显存占用量峰值保持在较低值,可提高神经网络的训练效率。In this application example, a certain micro-batch data is first calculated in the forward direction during the training process, and then immediately calculated in the reverse direction. The backward calculation of the micro-batch data is completed before the forward calculation of the next micro-batch data is started. Therefore, the peak video memory usage in this application example occurs when the reverse calculation of any micro-batch data starts. At this time, the accelerator only needs to save all the feature values generated by the forward calculation of a micro-batch data. During the entire calculation process, the peak value of the video memory occupancy appears periodically until the forward calculation and reverse calculation of all micro-batch data are completed. And when a peak video memory usage occurs, the accelerator only needs to save a feature value generated by the forward calculation of micro-batch data, and keep the peak video memory usage at a low value, which can improve the training efficiency of the neural network.
图10为本申请实施例提供的神经网络的训练方法的另一流程示意图,请参阅图10,该训练方法应用于包含处理器和N个加速器的训练装置。在该训练装置中,处理器可为每个加速器提供用于神经网络训练的数据。每个加速器加载一个神经网络的部分层,N个加速器共同加载完整的该神经网络,N个加速器以流水线并行的方式对神经网络进行训练。在接收到待训练的样本数据后,N个加速器可基于样本数据共同对该神经网络进行训练。例如,训练装置设置有三个加速器,且神经网络有15层。其中,加速器1加载有该神经网络的第1层至第5层,加速器2加载有该神经网络的第6层至第10层,加速器3加载有该神经网络的第11层至第15层,加速器1、加速器2、和加速器3则可通过流水线并行的方式对该神经网络进行训练。该训练方法包括:FIG. 10 is a schematic diagram of another flow chart of a neural network training method provided by an embodiment of the application. Please refer to FIG. 10. The training method is applied to a training device including a processor and N accelerators. In this training device, the processor can provide data for neural network training for each accelerator. Each accelerator loads a partial layer of a neural network, N accelerators load the complete neural network together, and the N accelerators train the neural network in a pipelined parallel manner. After receiving the sample data to be trained, the N accelerators can jointly train the neural network based on the sample data. For example, the training device is equipped with three accelerators, and the neural network has 15 layers. Wherein, accelerator 1 is loaded with layers 1 to 5 of the neural network, accelerator 2 is loaded with layers 6 to 10 of the neural network, and accelerator 3 is loaded with layers 11 to 15 of the neural network, Accelerator 1, Accelerator 2, and Accelerator 3 can train the neural network in a pipelined parallel manner. The training method includes:
1001、N个加速器中的第1个加速器获取M个微批量数据。1001. The first accelerator among the N accelerators obtains M micro-batch data.
当需要对神经网络进行训练时,处理器可先获取训练数据,并将训练数据分成M个微批量数据,每个微批量包含至少一个待训练的样本数据。然后,处理器向N个加速器中的第1个加速器发送M个微批量数据。训练装置内设有N个加速器,但N个加速器作为一个整体(因N个加速器共同加载一个神经网络),且第1个加速器作为该整体的输入端,故处理器仅需准备M个微批量数据,并将M个微批量数据发送至第1个加速器。其中,N≥2,M≥2。When the neural network needs to be trained, the processor may first obtain training data, and divide the training data into M micro-batch data, and each micro-batch contains at least one sample data to be trained. Then, the processor sends M micro-batch data to the first accelerator among the N accelerators. There are N accelerators in the training device, but N accelerators as a whole (because N accelerators load a neural network together), and the first accelerator is used as the input of the whole, so the processor only needs to prepare M micro-batch Data, and send M micro-batch data to the first accelerator. Among them, N≥2 and M≥2.
1002、N个加速器在共同完成对第i个微批量数据的前向计算后,直接对第i个微批量数据的前向计算结果进行反向计算,直至完成对M个微批量数据的反向计算以得到反向计算的结果。1002. After the N accelerators jointly complete the forward calculation of the i-th micro-batch data, they directly perform the reverse calculation on the forward calculation result of the i-th micro-batch data until the reverse of the M micro-batch data is completed. Calculate to get the result of the reverse calculation.
当N个加速器接收到来自处理器的M个微批量数据后,则开始进行神经网络的训练。具体地,N个加速器若共同完成对第i个微批量数据的前向计算,则对第i个微批量数据的前向计算结果(即已被第N个加速器前向计算后的第i个微批量数据)进行反向计算,直至完成对M个微批量数据的反向计算,得到反向计算的结果。M≥2,i=1,2,...,M。When the N accelerators receive M micro-batch data from the processor, they start training the neural network. Specifically, if N accelerators jointly complete the forward calculation of the i-th micro-batch data, the forward calculation result of the i-th micro-batch data (that is, the i-th one that has been forward-calculated by the N-th accelerator) Micro-batch data) perform reverse calculation until the reverse calculation of M micro-batch data is completed, and the reverse calculation result is obtained. M≥2, i=1, 2,...,M.
为了便于描述,在本实施例中,第j个加速器对第i个微批量数据进行前向计算应理 解为,第j个加速器对已被第j-1个加速器前向计算后的第i个微批量数据进行前向计算。同样地,第j个加速器完成第i个微批量数据的前向计算应理解为,第j个加速器完成已被第j-1个加速器前向计算后的第i个微批量数据的前向计算。同样地,第k个速器对第i个微批量数据进行反向计算应理解为,第k个加速器对已被第k+1加速器反向计算后的第i个微批量数据进行反向计算。同样地,第k个加速器完成第i个微批量数据的反向计算应理解为,第k个加速器完成已被第k+1个加速器反向计算后的第i个微批量数据的反向计算。其中,j=2,...,N,k=1,...,N-1。此外,第N个加速器对第i个微批量数据(或第i个微批量数据的前向计算结果)进行反向计算应理解为第N个加速器对已被第N个加速器前向计算后的第i个微批量数据进行反向计算,后续不再赘述。For ease of description, in this embodiment, the j-th accelerator performs forward calculation on the i-th micro-batch data should be understood as the i-th accelerator pair that has been forward-calculated by the j-1th accelerator Forward calculations for micro-batch data. Similarly, the completion of the forward calculation of the i-th micro-batch data by the j-th accelerator should be understood to mean that the j-th accelerator completes the forward calculation of the i-th micro-batch data that has been forward calculated by the j-1 accelerator. . Similarly, the reverse calculation of the i-th micro-batch data by the k-th accelerator should be understood as the reverse calculation of the i-th micro-batch data that has been reversed by the k+1 accelerator by the k-th accelerator. . Similarly, the completion of the reverse calculation of the i-th micro-batch data by the k-th accelerator should be understood as the completion of the reverse calculation of the i-th micro-batch data that has been reversed by the k+1 accelerator. . Among them, j=2,...,N, k=1,...,N-1. In addition, the backward calculation of the i-th micro-batch data by the N-th accelerator (or the forward calculation result of the i-th micro-batch data) should be understood as the result of the N-th accelerator pair having been forward-calculated by the N-th accelerator. The i-th micro-batch data is reversely calculated, which will not be described in detail later.
更进一步地,反向计算的结果可包括每个加速器对应的梯度累加值,每个加速器对应的梯度累加值为每个加速器对M个微批量数据分别进行反向计算后,所得到的M个梯度的和。为了便于理解,依旧通过上述例子进行说明。加速器1在接收到M个微批量数据后,则对第1个微批量数据进行前向计算。完成前向计算后,加速器1将经过计算的第1个微批量数据发送至加速器2,使得加速器2对第1个微批量数据进行前向计算。依次类推,在加速器3完成第1个微批量数据的前向计算后,则加速器3开始对第1个微批量数据进行反向计算。完成反向计算后,加速器3可得到第1个梯度,并将经过加速器3反向计算后第1个微批量数据发送至加速器2,使得加速器2对第1个微批量数据进行反向计算。加速器2、加速器1在完成对第1个微批量数据的反向计算后,也分别可得到第1个梯度。同理,3个加速器也可对第2个微批量数据至第M个微批量数据执行如前述的计算过程,故加速器1可得到M个梯度,并对M个梯度进行累加,得到梯度累加值。加速器2和加速器3也可分别得到M个梯度,并通过累加计算得到梯度累加值。Furthermore, the result of the inverse calculation may include the gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is the M micro-batch data obtained after each accelerator performs the inverse calculation on the M micro-batch data. The sum of the gradients. For ease of understanding, the above examples are still used for explanation. After the accelerator 1 receives M micro-batch data, it performs forward calculation on the first micro-batch data. After completing the forward calculation, the accelerator 1 sends the calculated first micro-batch data to the accelerator 2, so that the accelerator 2 performs forward calculation on the first micro-batch data. By analogy, after the accelerator 3 completes the forward calculation of the first micro-batch data, the accelerator 3 starts to perform reverse calculation on the first micro-batch data. After completing the reverse calculation, the accelerator 3 can obtain the first gradient, and send the first micro-batch data after the reverse calculation of the accelerator 3 to the accelerator 2, so that the accelerator 2 performs reverse calculation on the first micro-batch data. After accelerator 2 and accelerator 1 complete the reverse calculation of the first micro-batch data, the first gradient can also be obtained respectively. In the same way, the three accelerators can also perform the aforementioned calculation process on the second micro-batch data to the M-th micro-batch data, so the accelerator 1 can obtain M gradients and accumulate the M gradients to obtain the gradient accumulation value . The accelerator 2 and the accelerator 3 can also obtain M gradients respectively, and the accumulated value of the gradients can be obtained through accumulation calculation.
值得注意的是,加速器完成某一个微批量数据的前向计算时,则存储前向计算过程中所产生的特征值。加速器开始该微批量数据的反向计算时,则开始释放该微批量数据在前向计算所产生的特征值(由于进行反向计算时,需要使用前向计算所产生的特征值)。直至完成该微批量数据的反向计算,此时该微批量数据在前向计算所产生的特征值被完全释放,即该部分特征值所占用的存储量被释放。It is worth noting that when the accelerator completes the forward calculation of a certain micro-batch data, it stores the characteristic values generated in the forward calculation process. When the accelerator starts the reverse calculation of the micro-batch data, it starts to release the feature value generated by the forward calculation of the micro-batch data (because the reverse calculation needs to use the feature value generated by the forward calculation). Until the reverse calculation of the micro-batch data is completed, at this time the feature value generated by the forward calculation of the micro-batch data is completely released, that is, the storage amount occupied by the part of the feature value is released.
依旧如上述例子,加速器3在对第1个微批量数据进行反向计算时,仅完成第1个微批量数据的前向计算,故加速器1存有第1个微批量数据在前向计算所产生的特征值。而加速器2在对第1个微批量数据进行反向计算时,设已完成3个微批量数据的前向计算(由于加速器3在对第1个微批量数据进行前向计算和反向计算时,加速器2可同步对其余微批量数据进行前向计算,例如对第2个微批量数据、第3个微批量数据进行前向计算),故加速器2存有这3个微批量数据在前向计算所产生的特征值。而加速器1在对第1个微批量数据进行反向计算时,设已完成5个微批量数据的前向计算,故加速器3存有这5个微批量数据在前向计算所产生的特征值。因此,加速器1、加速器2、加速器3的存储占用量峰值均出现在第1个微批量数据的反向计算开始之时,且加速器1的存储占用量峰值大于加速器2的存储占用量峰值大于加速器3的存储占用量峰值。As in the above example, when the accelerator 3 performs the reverse calculation on the first micro-batch data, it only completes the forward calculation of the first micro-batch data. Therefore, the accelerator 1 stores the first micro-batch data in the forward calculation office. The resulting characteristic value. When the accelerator 2 performs the reverse calculation on the first micro-batch data, it is assumed that the forward calculation of the three micro-batch data has been completed (because the accelerator 3 performs the forward calculation and the reverse calculation on the first micro-batch data , Accelerator 2 can perform forward calculation on the remaining micro-batch data synchronously, for example, perform forward calculation on the second micro-batch data and the third micro-batch data), so the accelerator 2 stores these three micro-batch data in the forward direction. Calculate the resulting characteristic value. When accelerator 1 performs reverse calculation on the first micro-batch data, it is assumed that the forward calculation of 5 micro-batch data has been completed, so accelerator 3 stores the feature values generated by the forward calculation of these 5 micro-batch data . Therefore, the peak storage occupancy of accelerator 1, accelerator 2, and accelerator 3 all appear at the beginning of the reverse calculation of the first micro-batch data, and the peak storage occupancy of accelerator 1 is greater than the peak storage occupancy of accelerator 2 3 peak storage occupancy.
1003、N个加速器根据反向计算的结果对神经网络的参数进行更新。1003. The N accelerators update the parameters of the neural network according to the results of the reverse calculation.
每个加速器根据其对应的梯度累加值,对其加载的神经网络的部分层进行更新。依旧如上述例子,加速器1根据其对应的梯度累加值,对神经网络的第1层至第5层的参数进行更新。加速器2根据其对应的梯度累加值,对神经网络的第6层至第10层的参数进行更新。加速器3根据其对应的梯度累加值,对神经网络的第11层至第15层的参数进行更新。Each accelerator updates part of the neural network layers it loads according to its corresponding gradient accumulation value. As in the above example, the accelerator 1 updates the parameters of the first to fifth layers of the neural network according to its corresponding gradient accumulation value. The accelerator 2 updates the parameters of the 6th to the 10th layer of the neural network according to its corresponding gradient accumulation value. The accelerator 3 updates the parameters of the 11th to 15th layers of the neural network according to its corresponding gradient accumulation value.
此外,可通过设置微批量数据的大小,以节省加速器的存储资源。在处理器发送M个微批量数据至N个加速器的第1个加速器之前,该训练方法还可包括:处理器先获取训练数据。然后,处理器根据每个加速器的存储容量阈值和第二数据样本的大小确定微批量数据的大小。最后,处理器根据微批量数据的大小,将训练数据分成M个微批量数据。In addition, the size of the micro-batch data can be set to save the storage resources of the accelerator. Before the processor sends M micro-batch data to the first accelerator of the N accelerators, the training method may further include: the processor first obtains the training data. Then, the processor determines the size of the micro-batch data according to the storage capacity threshold of each accelerator and the size of the second data sample. Finally, the processor divides the training data into M micro-batch data according to the size of the micro-batch data.
其中,微批量数据的大小应满足以下条件:(1)每个加速器的存储占用量峰值小于或等于该加速器的存储容量阈值,每个加速器的存储占用量峰值为在该加速器对第1个微批量数据进行反向计算之前,已被该加速器完成前向计算的若干个微批量数据所对应的存储占用量。依旧如上述例子,加速器3在对第1个微批量数据进行反向计算时,仅完成第1个微批量数据的前向计算,因此,第1个微批量数据所对应的存储占用量(即为加速器3的存储占用量峰值)应小于或等于加速器3的存储容量阈值。同理,加速器2在对第1个微批量数据进行反向计算时,已完成3个微批量数据的前向计算,故这3个微批量数据所对应的存储占用量(即为加速器2的存储占用量峰值)应小于或等于加速器2的存储容量阈值等等;(2)每个微批量数据的大小小于或等于训练数据的大小;(3)每个微批量数据所对应的集群线性度最大。Among them, the size of the micro-batch data should meet the following conditions: (1) The peak storage occupancy of each accelerator is less than or equal to the storage capacity threshold of the accelerator, and the peak storage occupancy of each accelerator is the first micro-batch in the accelerator. Before the backward calculation of the batch data, the storage occupancy corresponding to several micro-batch data that has been calculated forward by the accelerator. As in the above example, when the accelerator 3 performs reverse calculation on the first micro-batch data, it only completes the forward calculation of the first micro-batch data. Therefore, the storage occupancy corresponding to the first micro-batch data (ie The peak storage occupancy of accelerator 3) should be less than or equal to the storage capacity threshold of accelerator 3. In the same way, when Accelerator 2 performs the reverse calculation of the first micro-batch data, it has completed the forward calculation of the 3 micro-batch data, so the storage occupancy corresponding to the three micro-batch data (that is, the accelerator 2’s Peak storage occupancy) should be less than or equal to the storage capacity threshold of accelerator 2, and so on; (2) The size of each micro-batch data is less than or equal to the size of the training data; (3) The cluster linearity corresponding to each micro-batch data maximum.
更进一步地,若训练数据的大小与微批量数据的大小之间的比值为非整数,则微批量数据的数量M为将前述比值向上取整后的值。若训练数据的大小与微批量数据的大小之间的比值为整数,则微批量数据的数量M为前述比值。Furthermore, if the ratio between the size of the training data and the size of the micro-batch data is non-integer, the number M of the micro-batch data is the value obtained by rounding the aforementioned ratio up. If the ratio between the size of the training data and the size of the micro-batch data is an integer, the number M of the micro-batch data is the aforementioned ratio.
通过上述过程,可以将微批量数据的大小设置为最优值,以减小微批量数据在前向计算后所产生的特征值的存储占用量,能够进一步节省加速器的存储资源,提高神经网络的训练效率。Through the above process, the size of the micro-batch data can be set to the optimal value to reduce the storage footprint of the feature values generated by the micro-batch data after the forward calculation, which can further save the storage resources of the accelerator and improve the performance of the neural network. Training efficiency.
本实施例中,N个加速器在对第i个微批量数据完成前向计算后,则立即对第i个微批量数据的前向计算结果进行反向计算。每个加速器开始进行对第i个微批量数据的反向计算时,可开始释放第i个微批量数据在前向计算所产生的特征值,直至完成第i个微批量数据的反向计算(第i个微批量数据在前向计算所产生的特征值被完全释放)。因此,每个加速器的存储占用量峰值出现在第1个微批量数据的反向计算开始之时,此时每个加速器只需保存部分微批量数据在前向计算所产生的特征值。在整个计算过程中,每个加速器的存储占用量峰值能保持在较低值,可提高神经网络的训练效率。In this embodiment, after the N accelerators complete the forward calculation on the i-th micro-batch data, they immediately perform the reverse calculation on the forward calculation result of the i-th micro-batch data. When each accelerator starts the reverse calculation of the i-th micro-batch data, it can start to release the feature values generated by the i-th micro-batch data in the forward calculation until the reverse calculation of the i-th micro-batch data is completed ( The feature value generated by the forward calculation of the i-th micro-batch data is completely released). Therefore, the storage occupancy peak of each accelerator appears when the reverse calculation of the first micro-batch data starts, and at this time, each accelerator only needs to save part of the feature values generated by the forward calculation of the micro-batch data. In the entire calculation process, the peak storage occupancy of each accelerator can be kept at a low value, which can improve the training efficiency of the neural network.
为了便于理解,以下将提供一个应用例对图10所示的训练方法做进一步的介绍。图11为本申请实施例提供的神经网络的训练方法的第二应用例示意图,请参阅图11,用于训练神经网络的训练装置设置有处理器、GPU1、GPU2、GPU3和GPU4。该神经网络为32层结构,各层大小和计算耗时均匀设置。GPU1加载有神经网络的第1至第8层,GPU2加载有神经网络的第9层至第16层,GPU3加载有神经网络的第17层至第24层,GPU4加载有神经网络的第25层至第32层。For ease of understanding, an application example will be provided below to further introduce the training method shown in FIG. 10. FIG. 11 is a schematic diagram of a second application example of the neural network training method provided by an embodiment of the application. Please refer to FIG. 11. The training device for training the neural network is provided with a processor, GPU1, GPU2, GPU3, and GPU4. The neural network has a 32-layer structure, and the size and calculation time of each layer are uniformly set. GPU1 is loaded with the 1st to 8th layers of the neural network, GPU2 is loaded with the 9th to 16th layers of the neural network, GPU3 is loaded with the 17th to 24th layers of the neural network, and GPU4 is loaded with the 25th layer of the neural network To the 32nd floor.
设外部输入的训练数据包含256个样本数据。由于GPU1、GPU2、GPU3和GPU4通过流水线并行的方式进行训练。处理器将该训练数据发送至GPU1(GPU1作为整个目标神经网络的输入端口),以使得4个GPU基于训练数据对目标神经网络进行训练。It is assumed that the externally input training data contains 256 sample data. Because GPU1, GPU2, GPU3, and GPU4 are trained in parallel in a pipeline. The processor sends the training data to GPU1 (GPU1 is used as the input port of the entire target neural network), so that the four GPUs train the target neural network based on the training data.
为了节省GPU的存储资源,可将训练数据进一步分成多个微批量数据。首先需要确定微批量数据的大小,设GPU1、GPU2、GPU3和GPU4为相同性能的GPU。关于确定微批量数据的大小的说明,可参考前述第一应用例中的相关部分,此处不做赘述。需要说明的是,由于GPU1、GPU2、GPU3和GPU4视为一个整体,且为相同性能的GPU,处理器仅需对GPU1执行以下过程即可确定微批量数据的大小。以下结合图12对前述确定微批量数据的大小的过程进行说明,图12为本申请实施例提供的第二应用例的一个流程示意图,如图12所示,该确定过程包括:In order to save GPU storage resources, the training data can be further divided into multiple micro-batch data. First, the size of the micro-batch data needs to be determined, and GPU1, GPU2, GPU3, and GPU4 are assumed to be GPUs with the same performance. Regarding the description of determining the size of the micro-batch data, please refer to the relevant part in the aforementioned first application example, which will not be repeated here. It should be noted that since GPU1, GPU2, GPU3, and GPU4 are considered as a whole and are GPUs of the same performance, the processor only needs to perform the following process on GPU1 to determine the size of the micro-batch data. The foregoing process of determining the size of micro-batch data will be described below with reference to FIG. 12. FIG. 12 is a schematic flowchart of a second application example provided by an embodiment of this application. As shown in FIG. 12, the determination process includes:
T1:处理器确定GPU1的显存容量阈值Cmax和训练数据的大小。T1: The processor determines the video memory capacity threshold Cmax of GPU1 and the size of the training data.
T2:处理器根据训练数据的大小选择一个微批量数据的大小,确定该微批量数据在GPU1中所对应的显存占用量C1。T2: The processor selects the size of a micro-batch data according to the size of the training data, and determines the memory usage C1 corresponding to the micro-batch data in GPU1.
T3:处理器判断是否满足GPU1的显存占用量峰值N×C1≤Cmax,若不满足,重新执行T2,若满足,则执行T4。T3: The processor determines whether the peak display memory usage of GPU1 is satisfied, N×C1≤Cmax, if not satisfied, re-execute T2, and if satisfied, execute T4.
T4:处理器在满足N×C1≤Cmax条件下,确定微批量数据的大小的所有取值,并在所有取值中,取具有最大的集群线性度L的微批量数据的大小作为最终的选择。T4: The processor determines all the values of the size of the micro-batch data under the condition of N×C1≤Cmax, and among all the values, the size of the micro-batch data with the largest cluster linearity L is taken as the final choice .
具体地,处理器确定GPU的显存容量阈值Cmax=64GByte,以及批量数据的大小为256。Specifically, the processor determines the GPU memory capacity threshold Cmax=64 GByte, and the batch data size is 256.
处理器选择微批量数据的大小为256,当微批量数据的大小为256时,微批量数据在GPU中的显存占用量C1=64Gbyte。由于CPU1在对第1个微批量数据进行反向计算前,已对7个微批量数据完成前向计算,因此,在判断不满足7×C1≤Cmax,处理器则将微批量数据的大小设为128,当微批量数据的大小为128时,微批量数据在GPU1中的显存占用量C1=32Gbyte,依旧不满足7×C1≤Cmax。依次类推,直至处理器将微批量数据的大小设为32,此时微批量数据在GPU1中的显存占用量C1=8Gbyte,满足7×C1≤Cmax。此时,则GPU1需计算的微批量数据的数量为256/32=8。The processor selects the size of the micro-batch data as 256. When the size of the micro-batch data is 256, the memory occupied by the micro-batch data in the GPU is C1=64Gbyte. Because CPU1 has completed the forward calculation of 7 micro-batch data before performing the reverse calculation on the first micro-batch data, therefore, when judging that 7×C1≤Cmax is not satisfied, the processor sets the size of the micro-batch data When the size of the micro-batch data is 128, the memory occupied by the micro-batch data in GPU1 is C1=32Gbyte, which still does not satisfy 7×C1≤Cmax. By analogy, until the processor sets the size of the micro-batch data to 32, the memory occupancy of the micro-batch data in GPU1 is C1=8Gbyte, which satisfies 7×C1≤Cmax. At this time, the number of micro-batch data to be calculated by GPU1 is 256/32=8.
确定微批量数据的大小为32时,满足7×C1≤Cmax后,则可计算微批量数据的大小所对应的集群线性度。具体地,当微批量数据的大小为32时,其对应的计算耗时T1=32ms,其余耗时(例如特征值传输、参数更新的耗时等等)T2=10ms,则微批量数据的大小为64时,所对应的集群线性度L=T1/(T1+T2)=0.762。更进一步地,处理器将微批量数据的大小设为32,其依旧满足7×C1≤Cmax,则计算其对应的集群线性度L=0.726。同理,处理器可继续计算出微批量数据的大小为其余取值时,所对应的集群线性度L。When it is determined that the size of the micro-batch data is 32, after 7×C1≤Cmax is satisfied, the cluster linearity corresponding to the size of the micro-batch data can be calculated. Specifically, when the size of the micro-batch data is 32, the corresponding calculation time is T1=32ms, and the remaining time (such as the time-consuming of feature value transmission, parameter update, etc.) T2=10ms, then the size of the micro-batch data When it is 64, the corresponding cluster linearity L=T1/(T1+T2)=0.762. Furthermore, the processor sets the size of the micro-batch data to 32, which still satisfies 7×C1≤Cmax, and then calculates its corresponding cluster linearity L=0.726. In the same way, the processor can continue to calculate the cluster linearity L corresponding to when the size of the micro-batch data is the other value.
设在所有取值中,当微批量数据的大小为32时,其对应的集群线性度L最大。处理器可最终确定微批量数据的大小为32,此时,GPU1需计算的微批量数据的数量为256/32=8。Suppose that among all the values, when the size of the micro-batch data is 32, the corresponding cluster linearity L is the largest. The processor can finally determine that the size of the micro-batch data is 32. At this time, the number of micro-batch data that GPU1 needs to calculate is 256/32=8.
在确定微批量数据的大小和数量后,处理器则将8个微批量数据发送至GPU1,使得GPU1、GPU2、GPU3和GPU4以流水线并行的方式开始进行计算。以下结合图13和图14对前述计算过程进行说明。图13为本申请实施例提供的第二应用例的另一流程示意图,图14为本申请实施例提供的第二应用例的计算过程示意图。需要说明的是,为了便于作图, 图14中的细线框表示微批量数据的前向计算,粗线框表示微批量数据的反向计算,且微批量数据以MB进行标记,例如,第1个微批量数据为MB1,第2个微批量数据为MB2等等。如图13和图14所示,计算过程如下:After determining the size and quantity of the micro-batch data, the processor sends 8 micro-batch data to GPU1, so that GPU1, GPU2, GPU3, and GPU4 start calculations in a pipelined parallel manner. The foregoing calculation process will be described below in conjunction with FIG. 13 and FIG. 14. FIG. 13 is a schematic diagram of another flow chart of the second application example provided by an embodiment of this application, and FIG. 14 is a schematic diagram of the calculation process of the second application example provided by an embodiment of this application. It should be noted that, for the convenience of drawing, the thin line frame in Figure 14 represents the forward calculation of the micro-batch data, the thick line frame represents the reverse calculation of the micro-batch data, and the micro-batch data is marked with MB, for example, the first One micro-batch data is MB1, the second micro-batch data is MB2, and so on. As shown in Figure 13 and Figure 14, the calculation process is as follows:
P1:GPU1对第1个微批量数据进行前向计算,并显存前向计算所产生的特征值。P1: GPU1 performs forward calculation on the first micro-batch data, and displays the feature values generated by the forward calculation.
P2:GPU1将经过计算后的第1个微批量数据发送至GPU2,使得GPU2对第1个微批量数据进行前向计算(在GPU2对第1个微批量数据进行前向计算时,GPU1同步对第2个微批量数据进行前向计算)。依次类推,当GPU4完成第1个微批量数据的前向计算后,则可对第1个微批量数据进行反向计算,而其余GPU仍在进行其余微批量数据的前向计算。P2: GPU1 sends the calculated first micro-batch data to GPU2, so that GPU2 performs forward calculation on the first micro-batch data (when GPU2 performs forward calculation on the first micro-batch data, GPU1 synchronizes The second micro-batch data is calculated forward). By analogy, after GPU4 completes the forward calculation of the first micro-batch data, it can perform reverse calculation on the first micro-batch data, while the remaining GPUs are still performing forward calculations on the remaining micro-batch data.
P3:GPU4开始进行第1个微批量数据的反向计算,并开始释放第1个微批量数据在GPU4中所对应的显存占用量(即开始释放第1个微批量数据在前向计算所产生的特征值)。当GPU4对第1个微批量数据的反向计算结束后,GPU4得到第1个梯度,并将第1个微批量数据发送至GPU3,以使得GPU3对第1个微批量数据进行反向计算(此时GPU3已完成对第3个微批量数据的前向计算)。GPU3对第1个微批量数据的反向计算结束后,GPU3得到第1个梯度,并将第1个微批量数据发送至GPU2,以使得GPU2对第1个微批量数据进行反向计算(此时GPU3已完成对第5个微批量数据的前向计算)。以此类推,直到GPU1对第1个微批量数据的反向计算结束,得到第1个梯度。P3: GPU4 starts to perform the reverse calculation of the first micro-batch data, and starts to release the memory occupied by the first micro-batch data in GPU4 (that is, start to release the first micro-batch data generated in the forward calculation Eigenvalues). When the reverse calculation of the first micro-batch data by GPU4 is completed, GPU4 obtains the first gradient and sends the first micro-batch data to GPU3, so that GPU3 performs reverse calculation on the first micro-batch data ( At this time, GPU3 has completed the forward calculation of the third micro-batch data). After the reverse calculation of the first micro-batch data by GPU3 is completed, GPU3 obtains the first gradient and sends the first micro-batch data to GPU2 so that GPU2 performs reverse calculation on the first micro-batch data (this At that time, GPU3 has completed the forward calculation of the fifth micro-batch data). By analogy, until the reverse calculation of the first micro-batch data by GPU1 ends, the first gradient is obtained.
P4:直至所有GPU完成8个微批量数据的反向计算,每个GPU均可得到8个梯度,并对8个梯度进行累加得到梯度累加值。P4: Until all GPUs complete the reverse calculation of 8 micro-batch data, each GPU can obtain 8 gradients, and accumulate the 8 gradients to obtain the gradient accumulation value.
P5:每个GPU在得到其对应的梯度累加值后,则可对其加载的层次的参数进行更新。P5: After each GPU obtains its corresponding gradient accumulation value, it can update the parameters of its loaded level.
例如,GPU1根据其对应的梯度累加值,对神经网络的第1层至第8层的参数进行更新。GPU2根据其对应的梯度累加值,对神经网络的第9层至第16层的参数进行更新。GPU3根据其对应的梯度累加值,对神经网络的第17层至第24层的参数进行更新。GPU4根据其对应的梯度累加值,对神经网络的第25层至第32层的参数进行更新。For example, GPU1 updates the parameters of the first to the eighth layers of the neural network according to its corresponding gradient accumulation value. GPU2 updates the parameters of the 9th to 16th layers of the neural network according to its corresponding gradient accumulation value. GPU3 updates the parameters of the 17th to 24th layers of the neural network according to its corresponding gradient accumulation value. GPU4 updates the parameters of the 25th to 32nd layers of the neural network according to its corresponding gradient accumulation value.
结合图14可知,GPU1开始第1个微批量数据的反向计算之前,已完成7个微批量数据的前向计算。GPU2开始第1个微批量数据的反向计算之前,已完成5个微批量数据的前向计算。GPU3开始第1个微批量数据的反向计算之前,已完成3个微批量数据的前向计算。GPU4开始第1个微批量数据的反向计算之前,已完成1个微批量数据的前向计算。在这4个GPU中,每个GPU的显存占用量峰值均出现在第1个微批量数据的反向计算开始之时(即图中的箭头处,此时,显存占用量峰值将逐渐减小,直至第1个微批量数据的反向计算完成。在此之后,显存占用量峰值将周期性出现),且每个GPU不需保存所有微批量数据在前向计算所产生的特征值,可将每个GPU的显存占用量峰值保持在较低值(与图4所示的现有技术相比,现有技术中每个GPU均需保存所有微批量数据在前向计算所产生的特征值,如图4中的箭头所示),提高神经网络的训练效率。It can be seen from FIG. 14 that before the GPU1 starts the backward calculation of the first micro-batch data, the forward calculation of 7 micro-batch data has been completed. Before GPU2 starts the backward calculation of the first micro-batch data, the forward calculation of 5 micro-batch data has been completed. Before GPU3 starts the backward calculation of the first micro-batch data, the forward calculation of the three micro-batch data has been completed. Before GPU4 starts the backward calculation of the first micro-batch data, the forward calculation of 1 micro-batch data has been completed. In these 4 GPUs, the peak video memory usage of each GPU appears at the beginning of the reverse calculation of the first micro-batch data (that is, at the arrow in the figure, at this time, the peak video memory usage will gradually decrease , Until the reverse calculation of the first micro-batch data is completed. After this, the peak memory usage will appear periodically), and each GPU does not need to save all the feature values generated by the micro-batch data in the forward calculation. Keep the peak memory usage of each GPU at a low value (compared with the prior art shown in Figure 4, in the prior art, each GPU needs to save all the feature values generated by the micro-batch data in the forward calculation , As shown by the arrow in Figure 4) to improve the training efficiency of the neural network.
以上是对本申请实施例提供的神经网络的训练方法的具体说明,以下将对本申请实施例提供的神经网络的训练装置进行介绍。图15为本申请实施例提供的神经网络的训练装置的一个结构示意图。请参阅图15,该训练装置包括:处理器1501和N个加速器1502。The above is a specific description of the neural network training method provided by the embodiment of the present application, and the neural network training device provided by the embodiment of the present application will be introduced below. FIG. 15 is a schematic structural diagram of a neural network training device provided by an embodiment of the application. Please refer to FIG. 15. The training device includes a processor 1501 and N accelerators 1502.
其中,每个加速器1502均加载同一个神经网络,N个加速器1502以数据并行的方式 对神经网络进行训练。Among them, each accelerator 1502 loads the same neural network, and N accelerators 1502 train the neural network in a data parallel manner.
每个加速器1502用于获取来自处理器1501的M个微批量数据,N×M个微批量数据组成训练数据。Each accelerator 1502 is used to obtain M micro-batch data from the processor 1501, and N×M micro-batch data form training data.
每个加速器1502还用于对第i个微批量数据进行前向计算后,直接对第i个微批量数据的前向计算结果进行反向计算,直至完成对M个微批量数据的反向计算以得到反向计算的结果。Each accelerator 1502 is also used to directly calculate the forward calculation result of the i-th micro-batch data after the forward calculation of the i-th micro-batch data, until the reverse calculation of the M micro-batch data is completed In order to get the result of the reverse calculation.
每个加速器1502还用于根据反向计算的结果对神经网络的参数进行更新。其中,N≥2,M≥2,i=1,2,...,M。Each accelerator 1502 is also used to update the parameters of the neural network according to the result of the backward calculation. Among them, N≥2, M≥2, i=1, 2,...,M.
在一种可能的实现方式中,反向计算的结果包括每个加速器1502对应的梯度累加值,每个加速器1502对应的梯度累加值为每个加速器1502对M个微批量数据分别进行反向计算后,所得到的M个梯度的和。In a possible implementation, the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator 1502, and the gradient accumulation value corresponding to each accelerator 1502 performs the inverse calculation on the M micro-batch data respectively. After that, the sum of the M gradients obtained.
在一种可能的实现方式中,每个加速器1502还用于根据N个加速器1502对应的梯度累加值进行求平均计算,得到目标梯度累加值。每个加速器1502还用于根据目标梯度累加值对神经网络的参数进行更新。In a possible implementation manner, each accelerator 1502 is also used to perform an average calculation according to the gradient accumulation values corresponding to the N accelerators 1502 to obtain the target gradient accumulation value. Each accelerator 1502 is also used to update the parameters of the neural network according to the accumulated value of the target gradient.
在一种可能的实现方式中,处理器1501还用于根据目标存储容量阈值和训练数据的大小确定微批量数据的大小,其中,若N个加速器1502相同,则目标存储容量阈值为N个加速器1502中任意一个加速器1502的存储容量阈值,若N个加速器1502中存在至少P个加速器1502不同,则目标存储容量阈值为至少P个加速器1502的存储容量阈值中的最小值,N≥P≥2。处理器1501还用于根据微批量数据的大小,将训练数据分成N×M个微批量数据。In a possible implementation, the processor 1501 is further configured to determine the size of the micro-batch data according to the target storage capacity threshold and the size of the training data. If the N accelerators 1502 are the same, the target storage capacity threshold is N accelerators. The storage capacity threshold of any accelerator 1502 in 1502. If there are at least P accelerators 1502 in N accelerators 1502 that are different, the target storage capacity threshold is the smallest value among the storage capacity thresholds of at least P accelerators 1502, N≥P≥2 . The processor 1501 is also configured to divide the training data into N×M micro-batch data according to the size of the micro-batch data.
在一种可能的实现方式中,每个微批量数据所对应的存储占用量小于或等于目标存储容量阈值,每个微批量数据的大小小于或等于训练数据的大小。In a possible implementation manner, the storage occupancy corresponding to each micro-batch data is less than or equal to the target storage capacity threshold, and the size of each micro-batch data is less than or equal to the size of the training data.
在一种可能的实现方式中,每个微批量数据所对应的集群线性度最大。In a possible implementation manner, the cluster linearity corresponding to each micro-batch data is the largest.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为非整数,则M为将比值向上取整后的值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is non-integer, then M is a value obtained by rounding up the ratio.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为整数,则M为比值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is an integer, then M is the ratio.
需要说明的是,上述装置中处理器/加速器之间的信息交互、执行过程等内容,由于与本申请中图6所示的方法实施例基于同一构思,其带来的技术效果与该方法实施例相同,具体内容可参见该方法实施例中的叙述,此处不再赘述。It should be noted that the information interaction and execution process between the processor/accelerator in the above-mentioned device are based on the same concept as the method embodiment shown in FIG. 6 in this application, and the technical effect brought by it is the same as that of the method implementation. The example is the same, and the specific content can be referred to the description in the method embodiment, which will not be repeated here.
图16为本申请实施例提供的神经网络的训练装置的另一结构示意图。请参阅图16,该训练装置包括处理器1601和N个加速器1602。其中,每个加速器1602加载神经网络的部分层,N个加速器1602共同加载神经网络,N个加速器1602以流水线并行的方式对神经网络进行训练。FIG. 16 is a schematic diagram of another structure of a neural network training device provided by an embodiment of the application. Please refer to FIG. 16, the training device includes a processor 1601 and N accelerators 1602. Among them, each accelerator 1602 loads part of the layer of the neural network, N accelerators 1602 jointly load the neural network, and the N accelerators 1602 train the neural network in a pipeline parallel manner.
N个加速器1602中的第1个加速器1602用于获取M个微批量数据,M个微批量数据组成训练数据。The first accelerator 1602 among the N accelerators 1602 is used to obtain M micro-batch data, and the M micro-batch data constitute training data.
N个加速器1602用于在共同完成对第i个微批量数据的前向计算后,直接对第i个微批量数据的前向计算结果进行反向计算,直至完成对M个微批量数据的反向计算以得到反 向计算的结果。The N accelerators 1602 are used to directly perform reverse calculations on the forward calculation results of the i-th micro-batch data after jointly completing the forward calculation of the i-th micro-batch data, until the reverse of the M micro-batch data is completed. Calculate to get the result of reverse calculation.
N个加速器1602还用于根据反向计算的结果对神经网络的参数进行更新。其中,N≥2,M≥2,i=1,2,...,M。The N accelerators 1602 are also used to update the parameters of the neural network according to the result of the backward calculation. Among them, N≥2, M≥2, i=1, 2,...,M.
在一种可能的实现方式中,反向计算的结果包括每个加速器1602对应的梯度累加值,每个加速器1602对应的梯度累加值为每个加速器1602对M个微批量数据分别进行反向计算后,所得到的M个梯度的和。In a possible implementation, the result of the inverse calculation includes the gradient accumulation value corresponding to each accelerator 1602, and the gradient accumulation value corresponding to each accelerator 1602 performs the inverse calculation on the M micro-batch data respectively. After that, the sum of the M gradients obtained.
在一种可能的实现方式中,每个加速器1602用于根据其对应的梯度累加值,对其加载的神经网络的部分层的参数进行更新。In a possible implementation manner, each accelerator 1602 is used to update the parameters of the partial layer of the neural network it loads according to its corresponding gradient accumulation value.
在一种可能的实现方式中,处理器1601用于获取训练数据。处理器1601还用于根据每个加速器1602的存储容量阈值和训练数据的大小确定微批量数据的大小。处理器1601还用于根据微批量数据的大小,将训练数据分成M个微批量数据。In a possible implementation manner, the processor 1601 is used to obtain training data. The processor 1601 is also configured to determine the size of the micro-batch data according to the storage capacity threshold of each accelerator 1602 and the size of the training data. The processor 1601 is also configured to divide the training data into M micro-batch data according to the size of the micro-batch data.
在一种可能的实现方式中,每个加速器1602的存储占用量峰值小于或等于该加速器1602的存储容量阈值,每个加速器1602的存储占用量峰值为在该加速器1602对第1个微批量数据进行反向计算之前,已被该加速器1602完成前向计算的若干个微批量数据所对应的存储占用量,每个微批量数据的大小小于或等于训练数据的大小。In a possible implementation, the peak storage occupancy of each accelerator 1602 is less than or equal to the storage capacity threshold of the accelerator 1602, and the peak storage occupancy of each accelerator 1602 is the first micro-batch data in the accelerator 1602. Before the reverse calculation is performed, the storage occupancy corresponding to several micro-batch data that has been forward-calculated by the accelerator 1602, and the size of each micro-batch data is less than or equal to the size of the training data.
在一种可能的实现方式中,每个微批量数据所对应的集群线性度最大。In a possible implementation manner, the cluster linearity corresponding to each micro-batch data is the largest.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为非整数,则M为将比值向上取整后的值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is non-integer, then M is a value obtained by rounding up the ratio.
在一种可能的实现方式中,若训练数据的大小与微批量数据的大小之间的比值为整数,则M为比值。In a possible implementation, if the ratio between the size of the training data and the size of the micro-batch data is an integer, then M is the ratio.
需要说明的是,上述装置中处理器/加速器之间的信息交互、执行过程等内容,由于与本申请中图10所示的方法实施例基于同一构思,其带来的技术效果与该方法实施例相同,具体内容可参见该方法实施例中的叙述,此处不再赘述。It should be noted that the information interaction and execution process between the processor/accelerator in the above-mentioned device are based on the same concept as the method embodiment shown in FIG. 10 in this application, and the technical effect brought by it is the same as that of the method implementation. The example is the same, and the specific content can be referred to the description in the method embodiment, which will not be repeated here.
图17为本申请实施例提供的神经网络的训练装置的又一结构示意图。请参阅图17,该训练装置包括:一个或一个以***处理器1701,存储器1702,输入输出接口1703,有线或无线网络接口1704,电源1705。FIG. 17 is a schematic diagram of another structure of the neural network training device provided by an embodiment of the application. Please refer to FIG. 17, the training device includes: one or more central processing units 1701, a memory 1702, an input/output interface 1703, a wired or wireless network interface 1704, and a power supply 1705.
存储器1702可以是短暂存储或持久存储。更进一步地,中央处理器1701可以配置为与存储器1702通信,在训练装置上执行存储器1702中的一系列指令操作。The memory 1702 may be short-term storage or persistent storage. Furthermore, the central processing unit 1701 may be configured to communicate with the memory 1702, and execute a series of instruction operations in the memory 1702 on the training device.
本实施例中,中央处理器1701可以执行前述图6或图10所示实施例中训练装置所执行的操作,具体此处不再赘述。In this embodiment, the central processing unit 1701 can perform operations performed by the training device in the embodiment shown in FIG. 6 or FIG. 10, and details are not described herein again.
本实施例中,中央处理器1701中的具体功能模块划分可以与前述图15或图16中所描述的处理器和加速器等单元的功能模块划分方式类似,此处不再赘述。In this embodiment, the specific functional module division of the central processing unit 1701 may be similar to the functional module division of the processor and accelerator described in FIG. 15 or FIG. 16, and will not be repeated here.
本申请实施例还提供了一种计算机可读存储介质,包括指令,当该指令在计算机上运行时,使得计算机执行前述图6或图10所示实施例中的训练方法。An embodiment of the present application also provides a computer-readable storage medium, including instructions, which when run on a computer, cause the computer to execute the training method in the foregoing embodiment shown in FIG. 6 or FIG. 10.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的***,装置和方法,可以通 过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .

Claims (26)

  1. 一种神经网络的训练方法,其特征在于,所述训练方法应用于N个加速器,每个加速器均加载同一个神经网络,所述N个加速器以数据并行的方式对所述神经网络进行训练,所述训练方法包括:A neural network training method, characterized in that the training method is applied to N accelerators, each accelerator is loaded with the same neural network, and the N accelerators train the neural network in a data-parallel manner, The training method includes:
    所述每个加速器获取M个微批量数据,N×M个微批量数据组成训练数据;Each accelerator obtains M micro-batch data, and N×M micro-batch data forms training data;
    所述每个加速器对第i个微批量数据进行前向计算后,直接对所述第i个微批量数据的前向计算结果进行反向计算,直至完成对所述M个微批量数据的反向计算以得到反向计算的结果;After each accelerator performs forward calculation on the i-th micro-batch data, it directly performs reverse calculation on the forward calculation result of the i-th micro-batch data until the reverse of the M micro-batch data is completed. Calculate to get the result of reverse calculation;
    所述每个加速器根据所述反向计算的结果对神经网络的参数进行更新;Each accelerator updates the parameters of the neural network according to the result of the reverse calculation;
    其中,N≥2,M≥2,i=1,2,...,M。Among them, N≥2, M≥2, i=1, 2,...,M.
  2. 根据权利要求1所述的训练方法,其特征在于,所述反向计算的结果包括每个加速器对应的梯度累加值,所述每个加速器对应的梯度累加值为所述每个加速器对所述M个微批量数据分别进行反向计算后,所得到的M个梯度的和。The training method according to claim 1, wherein the result of the inverse calculation includes a gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is The sum of the M gradients obtained after the reverse calculation of the M micro-batch data is performed.
  3. 根据权利要求2所述的训练方法,其特征在于,所述每个加速器根据所述反向计算的结果对神经网络的参数进行更新包括:The training method according to claim 2, wherein each accelerator updating the parameters of the neural network according to the result of the backward calculation comprises:
    所述每个加速器根据N个加速器对应的梯度累加值进行求平均计算,得到目标梯度累加值;Each accelerator performs an averaging calculation according to the gradient accumulation value corresponding to the N accelerators to obtain the target gradient accumulation value;
    所述每个加速器根据所述目标梯度累加值对所述神经网络的参数进行更新。Each of the accelerators updates the parameters of the neural network according to the accumulated value of the target gradient.
  4. 根据权利要求1至3任意一项所述的训练方法,其特征在于,所述训练方法还应用于处理器,所述每个加速器获取M个微批量数据之前,所述训练方法还包括:The training method according to any one of claims 1 to 3, wherein the training method is also applied to a processor, and before each accelerator obtains M micro-batch data, the training method further comprises:
    所述处理器获取所述训练数据;Acquiring the training data by the processor;
    所述处理器根据目标存储容量阈值和训练数据的大小确定微批量数据的大小,其中,若所述N个加速器相同,则所述目标存储容量阈值为所述N个加速器中任意一个加速器的存储容量阈值,若所述N个加速器中存在至少P个加速器不同,则所述目标存储容量阈值为所述至少P个加速器的存储容量阈值中的最小值,N≥P≥2;The processor determines the size of the micro-batch data according to the target storage capacity threshold and the size of the training data. If the N accelerators are the same, the target storage capacity threshold is the storage of any one of the N accelerators. Capacity threshold, if there are at least P different accelerators among the N accelerators, the target storage capacity threshold is the smallest value among the storage capacity thresholds of the at least P accelerators, N≥P≥2;
    所述处理器根据所述微批量数据的大小,将所述训练数据分成所述N×M个微批量数据。The processor divides the training data into the N×M micro-batch data according to the size of the micro-batch data.
  5. 根据权利要求4所述的训练方法,其特征在于,每个所述微批量数据所对应的存储占用量小于或等于所述目标存储容量阈值,每个所述微批量数据的大小小于或等于所述训练数据的大小。The training method according to claim 4, wherein the storage occupancy corresponding to each micro-batch data is less than or equal to the target storage capacity threshold, and the size of each micro-batch data is less than or equal to the target storage capacity threshold. State the size of the training data.
  6. 根据权利要求5所述的训练方法,其特征在于,每个所述微批量数据所对应的集群线性度最大。The training method according to claim 5, wherein the cluster linearity corresponding to each of the micro-batch data is the largest.
  7. 根据权利要求6所述的训练方法,其特征在于,若所述训练数据的大小与所述微批量数据的大小之间的比值为非整数,则所述M为将所述比值向上取整后的值。The training method according to claim 6, wherein if the ratio between the size of the training data and the size of the micro-batch data is a non-integer, then the M is the result of rounding up the ratio Value.
  8. 根据权利要求6所述的训练方法,其特征在于,若所述训练数据的大小与所述微批量数据的大小之间的比值为整数,则所述M为所述比值。The training method according to claim 6, wherein if the ratio between the size of the training data and the size of the micro-batch data is an integer, then the M is the ratio.
  9. 一种神经网络的训练方法,其特征在于,所述训练方法应用于N个加速器,每个加 速器加载神经网络的部分层,所述N个加速器共同加载所述神经网络,所述N个加速器以流水线并行的方式对所述神经网络进行训练,所述训练方法包括:A neural network training method, characterized in that the training method is applied to N accelerators, each accelerator loads part of the layer of the neural network, the N accelerators load the neural network together, and the N accelerators The neural network is trained in a pipelined parallel manner, and the training method includes:
    所述N个加速器中的第1个加速器获取M个微批量数据,所述M个微批量数据组成训练数据;The first accelerator among the N accelerators obtains M micro-batch data, and the M micro-batch data constitutes training data;
    所述N个加速器在共同完成对第i个微批量数据的前向计算后,直接对所述第i个微批量数据的前向计算结果进行反向计算,直至完成对所述M个微批量数据的反向计算以得到反向计算的结果;After the N accelerators jointly complete the forward calculation of the i-th micro-batch data, they directly perform the reverse calculation on the forward calculation result of the i-th micro-batch data, until the completion of the M micro-batch data Reverse calculation of data to get the result of reverse calculation;
    所述N个加速器根据所述反向计算的结果对神经网络的参数进行更新;The N accelerators update the parameters of the neural network according to the result of the reverse calculation;
    其中,N≥2,M≥2,i=1,2,...,M。Among them, N≥2, M≥2, i=1, 2,...,M.
  10. 根据权力要求9所述的训练方法,其特征在于,所述反向计算的结果包括所述每个加速器对应的梯度累加值,所述每个加速器对应的梯度累加值为所述每个加速器对所述M个微批量数据分别进行反向计算后,所得到的M个梯度的和。The training method according to claim 9, wherein the result of the inverse calculation includes a gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is The sum of the M gradients obtained after the M micro-batch data is separately calculated in the reverse direction.
  11. 根据权力要求10所述的训练方法,其特征在于,所述N个加速器根据所述反向计算的结果对神经网络的参数进行更新包括:The training method according to claim 10, wherein the N accelerators updating the parameters of the neural network according to the result of the reverse calculation comprises:
    所述每个加速器根据其对应的梯度累加值,对其加载的所述神经网络的部分层的参数进行更新。Each accelerator updates the parameters of the partial layer of the neural network that it loads according to its corresponding gradient accumulation value.
  12. 根据权利要求9至11任意一项所述的训练方法,其特征在于,所述训练方法还应用于处理器,所述N个加速器获取M个微批量数据之前,所述训练方法还包括:The training method according to any one of claims 9 to 11, wherein the training method is also applied to a processor, and before the N accelerators obtain M micro-batch data, the training method further comprises:
    所述处理器获取所述训练数据;Acquiring the training data by the processor;
    所述处理器根据每个加速器的存储容量阈值和训练数据的大小确定微批量数据的大小;The processor determines the size of the micro-batch data according to the storage capacity threshold of each accelerator and the size of the training data;
    所述处理器根据所述微批量数据的大小,将所述训练数据分成所述M个微批量数据。The processor divides the training data into the M micro-batch data according to the size of the micro-batch data.
  13. 根据权利要求12所述的训练方法,其特征在于,每个加速器的存储占用量峰值小于或等于该加速器的存储容量阈值,所述每个加速器的存储占用量峰值为在该加速器对第1个微批量数据进行反向计算之前,已被该加速器完成前向计算的若干个微批量数据所对应的存储占用量,每个所述微批量数据的大小小于或等于所述训练数据的大小。The training method according to claim 12, wherein the peak storage occupancy of each accelerator is less than or equal to the storage capacity threshold of the accelerator, and the peak storage occupancy of each accelerator is the first one between the accelerator and the accelerator. Before the reverse calculation of the micro-batch data is performed, the storage occupancy corresponding to the several micro-batch data that has been forward-calculated by the accelerator, and the size of each micro-batch data is less than or equal to the size of the training data.
  14. 根据权利要求13所述的训练方法,其特征在于,每个所述微批量数据所对应的集群线性度最大。The training method according to claim 13, wherein the cluster linearity corresponding to each of the micro-batch data is the largest.
  15. 根据权利要求14所述的训练方法,其特征在于,若所述训练数据的大小与所述微批量数据的大小之间的比值为非整数,则所述M为将所述比值向上取整后的值。The training method according to claim 14, wherein if the ratio between the size of the training data and the size of the micro-batch data is a non-integer, then the M is the result of rounding up the ratio Value.
  16. 根据权利要求14所述的训练方法,其特征在于,若所述训练数据的大小与所述微批量数据的大小之间的比值为整数,则所述M为所述比值。The training method according to claim 14, wherein if the ratio between the size of the training data and the size of the micro-batch data is an integer, then the M is the ratio.
  17. 一种神经网络的训练装置,其特征在于,所述训练装置包括N个加速器,每个加速器均加载同一个神经网络,所述N个加速器以数据并行的方式对所述神经网络进行训练;A neural network training device, characterized in that the training device includes N accelerators, each accelerator is loaded with the same neural network, and the N accelerators train the neural network in a data-parallel manner;
    所述每个加速器用于获取M个微批量数据,N×M个微批量数据组成训练数据;Each accelerator is used to obtain M micro-batch data, and N×M micro-batch data constitute training data;
    所述每个加速器还用于对第i个微批量数据进行前向计算后,直接对所述第i个微批量数据的前向计算结果进行反向计算,直至完成对所述M个微批量数据的反向计算以得到 反向计算的结果;Each accelerator is also used to directly perform reverse calculation on the forward calculation result of the i-th micro-batch data after performing forward calculation on the i-th micro-batch data, until the completion of the M micro-batch data Reverse calculation of data to get the result of reverse calculation;
    所述每个加速器还用于根据所述反向计算的结果对神经网络的参数进行更新;Each of the accelerators is also used to update the parameters of the neural network according to the result of the reverse calculation;
    其中,N≥2,M≥2,i=1,2,...,M。Among them, N≥2, M≥2, i=1, 2,...,M.
  18. 根据权利要求17所述的训练装置,其特征在于,所述反向计算的结果包括每个加速器对应的梯度累加值,所述每个加速器对应的梯度累加值为所述每个加速器对所述M个微批量数据分别进行反向计算后,所得到的M个梯度的和。The training device according to claim 17, wherein the result of the inverse calculation includes a gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is The sum of the M gradients obtained after the reverse calculation of the M micro-batch data is performed.
  19. 根据权利要求18所述的训练装置,其特征在于,所述每个加速器还用于根据N个加速器对应的梯度累加值进行求平均计算,得到目标梯度累加值;The training device according to claim 18, wherein each accelerator is further used for averaging calculation according to the gradient accumulation value corresponding to the N accelerators to obtain the target gradient accumulation value;
    所述每个加速器还用于根据所述目标梯度累加值对所述神经网络的参数进行更新。Each accelerator is also used to update the parameters of the neural network according to the accumulated value of the target gradient.
  20. 根据权利要求17至19任意一项所述的训练装置,其特征在于,所述训练装置还应包括处理器;The training device according to any one of claims 17 to 19, wherein the training device should also include a processor;
    所述处理器用于获取所述训练数据;The processor is used to obtain the training data;
    所述处理器还用于根据目标存储容量阈值和训练数据的大小确定微批量数据的大小,其中,若所述N个加速器相同,则所述目标存储容量阈值为所述N个加速器中任意一个加速器的存储容量阈值,若所述N个加速器中存在至少P个加速器不同,则所述目标存储容量阈值为所述至少P个加速器的存储容量阈值中的最小值,N≥P≥2;The processor is further configured to determine the size of the micro-batch data according to the target storage capacity threshold and the size of the training data, where if the N accelerators are the same, the target storage capacity threshold is any one of the N accelerators The storage capacity threshold of the accelerator, if there are at least P different accelerators among the N accelerators, the target storage capacity threshold is the smallest value among the storage capacity thresholds of the at least P accelerators, N≥P≥2;
    所述处理器还用于根据所述微批量数据的大小,将所述训练数据分成所述N×M个微批量数据。The processor is further configured to divide the training data into the N×M micro-batch data according to the size of the micro-batch data.
  21. 一种神经网络的训练装置,其特征在于,所述训练装置包括N个加速器,每个加速器加载神经网络的部分层,所述N个加速器共同加载所述神经网络,所述N个加速器以流水线并行的方式对所述神经网络进行训练;A neural network training device, characterized in that the training device includes N accelerators, each accelerator loads a part of the layer of the neural network, the N accelerators load the neural network together, and the N accelerators are pipelined Training the neural network in a parallel manner;
    所述N个加速器中的第1个加速器用于获取M个微批量数据,所述M个微批量数据组成训练数据;The first accelerator among the N accelerators is used to obtain M micro-batch data, and the M micro-batch data constitute training data;
    所述N个加速器用于在共同完成对第i个微批量数据的前向计算后,直接对所述第i个微批量数据的前向计算结果进行反向计算,直至完成对所述M个微批量数据的反向计算以得到反向计算的结果;The N accelerators are used to directly perform reverse calculations on the forward calculation results of the i-th micro-batch data after jointly completing the forward calculation of the i-th micro-batch data, until the completion of the M-th micro-batch data Reverse calculation of micro-batch data to get the result of reverse calculation;
    所述N个加速器还用于根据所述反向计算的结果对神经网络的参数进行更新;The N accelerators are also used to update the parameters of the neural network according to the result of the reverse calculation;
    其中,N≥2,M≥2,i=1,2,...,M。Among them, N≥2, M≥2, i=1, 2,...,M.
  22. 根据权力要求21所述的训练装置,其特征在于,所述反向计算的结果包括所述每个加速器对应的梯度累加值,所述每个加速器对应的梯度累加值为所述每个加速器对所述M个微批量数据分别进行反向计算后,所得到的M个梯度的和。The training device according to claim 21, wherein the result of the inverse calculation includes a gradient accumulation value corresponding to each accelerator, and the gradient accumulation value corresponding to each accelerator is The sum of M gradients obtained after the M micro-batch data are respectively subjected to reverse calculation.
  23. 根据权力要求22所述的训练装置,其特征在于,所述每个加速器用于根据其对应的梯度累加值,对其加载的所述神经网络的部分层的参数进行更新。The training device according to claim 22, wherein each accelerator is used to update the parameters of the partial layer of the neural network that it loads according to its corresponding gradient accumulation value.
  24. 根据权利要求21至23任意一项所述的训练装置,其特征在于,所述训练装置还应包括处理器;The training device according to any one of claims 21 to 23, wherein the training device should also include a processor;
    所述处理器用于获取所述训练数据;The processor is used to obtain the training data;
    所述处理器还用于根据每个加速器的存储容量阈值和训练数据的大小确定微批量数据 的大小;The processor is further configured to determine the size of the micro-batch data according to the storage capacity threshold of each accelerator and the size of the training data;
    所述处理器还用于根据所述微批量数据的大小,将所述训练数据分成所述M个微批量数据。The processor is further configured to divide the training data into the M micro-batch data according to the size of the micro-batch data.
  25. 一种神经网络的训练装置,其特征在于,包括:A neural network training device, which is characterized in that it comprises:
    一个或多个处理器、存储器、总线***、以及一个或多个程序,所述处理器和所述存储器通过所述总线***相连;One or more processors, memories, bus systems, and one or more programs, where the processors and the memories are connected through the bus system;
    其中,所述一个或多个程序被存储在所述存储器中,所述一个或多个程序包括指令,所述指令当被所述训练装置执行时使所述训练装置执行如权利要求1至16任一项所述的训练方法。Wherein, the one or more programs are stored in the memory, and the one or more programs include instructions that, when executed by the training device, cause the training device to execute as claimed in claims 1 to 16. Any one of the training methods.
  26. 一种计算机可读存储介质,包括指令,当该指令在计算机上运行时,使得计算机执行如权利要求1至16任一项所述的训练方法。A computer-readable storage medium, comprising instructions, which when run on a computer, cause the computer to execute the training method according to any one of claims 1 to 16.
PCT/CN2021/094579 2020-05-29 2021-05-19 Method for training neural network, and related device WO2021238734A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010479541.2 2020-05-29
CN202010479541.2A CN113743570B (en) 2020-05-29 2020-05-29 Neural network training method and related equipment

Publications (1)

Publication Number Publication Date
WO2021238734A1 true WO2021238734A1 (en) 2021-12-02

Family

ID=78725142

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/094579 WO2021238734A1 (en) 2020-05-29 2021-05-19 Method for training neural network, and related device

Country Status (2)

Country Link
CN (1) CN113743570B (en)
WO (1) WO2021238734A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115437795A (en) * 2022-11-07 2022-12-06 东南大学 Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506695A (en) * 2017-07-28 2017-12-22 武汉理工大学 Video monitoring equipment failure automatic detection method
CN110795228A (en) * 2018-08-03 2020-02-14 伊姆西Ip控股有限责任公司 Adaptive batch dataset partitioning for distributed deep learning using accelerator mixture sets

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705705B (en) * 2019-09-25 2022-04-22 浪潮电子信息产业股份有限公司 Convolutional neural network model synchronous training method, cluster and readable storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107506695A (en) * 2017-07-28 2017-12-22 武汉理工大学 Video monitoring equipment failure automatic detection method
CN110795228A (en) * 2018-08-03 2020-02-14 伊姆西Ip控股有限责任公司 Adaptive batch dataset partitioning for distributed deep learning using accelerator mixture sets

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XU WENCHAO; PANG YUXIN; YANG YANQIN; LIU YANBO: "Human Activity Recognition Based On Convolutional Neural Network", 2018 24TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), IEEE, 20 August 2018 (2018-08-20), pages 165 - 170, XP033454144, DOI: 10.1109/ICPR.2018.8545435 *
YANPING HUANG, CHENG YOULONG, BAPNA ANKUR, FIRAT ORHAN, CHEN MIA XU, CHEN DEHAO, LEE HYOUKJOONG, NGIAM JIQUAN, LE QUOC V, WU YONGH: "GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism", CORR (ARXIV), CORNELL UNIVERSITY LIBRARY, vol. 1811.06965, no. v5, pages 1 - 11, XP055730504 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115437795A (en) * 2022-11-07 2022-12-06 东南大学 Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception
CN115437795B (en) * 2022-11-07 2023-03-24 东南大学 Video memory recalculation optimization method and system for heterogeneous GPU cluster load perception

Also Published As

Publication number Publication date
CN113743570A (en) 2021-12-03
CN113743570B (en) 2024-06-04

Similar Documents

Publication Publication Date Title
US20210166112A1 (en) Method for neural network and apparatus performing same method
EP4036803A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
US11232356B2 (en) Training giant neural networks using pipeline parallelism
CN115456160A (en) Data processing method and data processing equipment
WO2020073211A1 (en) Operation accelerator, processing method, and related device
CN113159073B (en) Knowledge distillation method and device, storage medium and terminal
EP4145351A1 (en) Neural network construction method and system
CN107622303A (en) For the method for neutral net and the equipment of execution this method
CN114330699A (en) Neural network structure searching method and device
CN112149809A (en) Model hyper-parameter determination method and device, calculation device and medium
CN111325222A (en) Image normalization processing method and device and storage medium
WO2023280113A1 (en) Data processing method, training method for neural network model, and apparatus
CN114492723A (en) Neural network model training method, image processing method and device
CN116362325A (en) Electric power image recognition model lightweight application method based on model compression
CN114356540A (en) Parameter updating method and device, electronic equipment and storage medium
CN113792621A (en) Target detection accelerator design method based on FPGA
WO2022252694A1 (en) Neural network optimization method and apparatus
WO2021238734A1 (en) Method for training neural network, and related device
JP7150651B2 (en) Neural network model reducer
CN111652349A (en) Neural network processing method and related equipment
CN114519425A (en) Convolution neural network acceleration system with expandable scale
WO2022227024A1 (en) Operational method and apparatus for neural network model and training method and apparatus for neural network model
CN114358250A (en) Data processing method, data processing apparatus, computer device, medium, and program product
CN114298329A (en) Model training method, device, equipment and storage medium
US11928598B2 (en) Method and system for distributed neural network training

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21813460

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21813460

Country of ref document: EP

Kind code of ref document: A1