CN115688867A

CN115688867A - Method, apparatus, device and storage medium for training neural network

Info

Publication number: CN115688867A
Application number: CN202211431463.4A
Authority: CN
Inventors: 朱亦博; 陈扬锐; 谢聪; 顾骏程; 彭杨华; 林海滨
Original assignee: Douyin Vision Co Ltd; Lemon Inc Cayman Island
Current assignee: Douyin Vision Co Ltd; Lemon Inc Cayman Island
Priority date: 2022-11-15
Filing date: 2022-11-15
Publication date: 2023-02-03
Also published as: WO2024104232A1

Abstract

Methods, apparatuses, devices and storage media for training a neural network are provided according to embodiments of the present disclosure. In the method, at a first working node of a plurality of working nodes, a first set of global gradients for a first set of network layers in the neural network is obtained. The first set of global gradients is aggregated from local gradients determined by the plurality of worker nodes for the first set of network layers in the current training step. The plurality of working nodes are configured to jointly train a neural network. Further, a second set of global gradients for a second set of network layers in the neural network is obtained. The second set of network layers is different from the first set of network layers. The second set of global gradients is aggregated from local gradients determined by the plurality of worker nodes for the second set of network layers in a previous training step prior to the current training step. Further, parameters of the neural network are updated based on the first set of global gradients and the second set of global gradients. In this way, the convergence and accuracy of the trained neural network may be improved.

Description

Method, apparatus, device and storage medium for training neural network

Technical Field

Example embodiments of the present disclosure generally relate to the field of computers, and in particular, to methods, apparatuses, devices, and computer-readable storage media for training neural networks.

Background

With the rapid development of computer technology, neural Networks (NN) are increasingly used in fields such as computer vision and natural language processing. Meanwhile, exponential growth in model size and data size makes training of neural networks time and resource consuming.

Currently, a common method to accelerate neural network training is data parallel, which uses multiple computing devices to train the neural network. Although the data parallel method greatly accelerates the training speed, the method has difficulty in achieving a linear acceleration effect due to frequent communication interaction among a plurality of computing devices.

Disclosure of Invention

According to a first aspect of the present disclosure, a method for training a neural network is provided. In the method, at a first working node of a plurality of working nodes, a first set of global gradients for a first set of network layers in a neural network is obtained. The first set of global gradients is aggregated from local gradients determined by the plurality of worker nodes for the first set of network layers in the current training step. The plurality of working nodes are configured to jointly train a neural network. Further, a second set of global gradients for a second set of network layers in the neural network is obtained. The second set of network layers is different from the first set of network layers. The second set of global gradients is aggregated from local gradients determined by the plurality of worker nodes for the second set of network layers in a previous training step prior to the current training step. Further, parameters of the neural network are updated based on the first set of global gradients and the second set of global gradients.

According to a second aspect of the present disclosure, an apparatus for training a neural network is provided. The device comprises a first obtaining module, a second obtaining module and an updating module. The first acquisition module is configured to: at a first working node of the plurality of working nodes, a first set of global gradients for a first set of network layers in the neural network is obtained, the first set of global gradients being aggregated from local gradients determined by the plurality of working nodes for the first set of network layers in a current training step, the plurality of working nodes being configured to jointly train the neural network. The second acquisition module is configured to: obtaining a second set of global gradients for a second set of network layers in the neural network, the second set of network layers being different from the first set of network layers, the second set of global gradients being aggregated from local gradients determined by the plurality of working nodes for the second set of network layers in a previous training step prior to the current training step. The update module configured to: parameters of the neural network are updated based on the first set of global gradients and the second set of global gradients.

According to a third aspect of the present disclosure, an electronic device is provided. The electronic device comprises at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit. The instructions, when executed by the at least one processing unit, cause the electronic device to perform a method according to the first aspect of the present disclosure.

According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon a computer program executable by a processor to perform the method according to the first aspect of the present disclosure.

It should be understood that the statements herein set forth in this summary are not intended to limit the essential or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various implementations of the present disclosure will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

FIG. 2 illustrates a flow diagram of a method for training a neural network, in accordance with some embodiments of the present disclosure;

FIG. 3 illustrates a neural network training pipeline, in accordance with some embodiments of the present disclosure;

FIG. 4 illustrates a block diagram of an example apparatus for training a neural network, in accordance with some embodiments of the present disclosure; and

FIG. 5 illustrates a block diagram that shows a computing device in which one or more embodiments of the disclosure may be implemented.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are illustrated in the accompanying drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more complete and thorough understanding of the disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". Other explicit and implicit definitions are also possible below.

It will be appreciated that the data referred to in this disclosure, including but not limited to the data itself, the acquisition or use of the data, should comply with the requirements of the applicable laws and regulations and related regulations.

It is understood that before the technical solutions disclosed in the embodiments of the present disclosure are used, the user should be informed of the type, the use range, the use scene, etc. of the personal information related to the present disclosure and obtain the authorization of the user through an appropriate manner according to the relevant laws and regulations.

For example, in response to receiving an active request from a user, a prompt message is sent to the user to explicitly prompt the user that the requested operation to be performed would require the acquisition and use of personal information to the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server, or a storage medium that performs the operations of the disclosed technical solution, according to the prompt information.

As an optional but non-limiting implementation manner, in response to receiving an active request of the user, the prompt information is sent to the user, for example, a pop-up window manner may be used, and the prompt information may be presented in a text manner in the pop-up window. In addition, a selection control for providing personal information to the electronic device by the user selecting "agree" or "disagree" can be carried in the pop-up window.

It is understood that the above notification and user authorization process is only illustrative and is not intended to limit the implementation of the present disclosure, and other ways of satisfying the relevant laws and regulations may be applied to the implementation of the present disclosure.

The term "responsive" as used herein means that a corresponding event occurs or a condition is satisfied. It will be appreciated that the timing of the performance of a subsequent action performed in response to the event or condition and the time at which the event occurred or the condition was true may not be strongly correlated. For example, in some cases, follow-up actions may be performed immediately upon the occurrence of an event or the satisfaction of a condition; in other cases, however, the follow-up action may be performed after a period of time has elapsed since the occurrence of the event or the establishment of the condition.

As used herein, the term "model" may learn from training data the associations between respective inputs and outputs, such that after training is complete, for a given input, a corresponding output may be generated. The generation of the model may be based on machine learning techniques. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs using multiple layers of processing units. Neural network models are one example of deep learning based models. The "model" may also be referred to herein as a "machine learning model", "machine learning network", or "learning network", these terms being used interchangeably herein.

A "neural network" is a deep learning based machine learning network. Neural networks are capable of processing inputs and providing corresponding outputs, and typically include an input layer and an output layer and one or more hidden layers between the input layer and the output layer. Neural networks used in deep learning applications typically include many hidden layers, thereby increasing the depth of the network. The layers of the neural network are connected in sequence such that the output of a previous layer is provided as the input of a subsequent layer, wherein the input layer receives the input of the neural network and the output of the output layer is the final output of the neural network. Each layer of the neural network includes one or more nodes (also referred to as processing nodes or neurons), each node processing an input from a previous layer. In the context of the present disclosure, the input layer, the output layer, and the hidden layer may also be referred to individually or collectively as the network layer.

In general, machine learning can roughly include three phases, namely a training phase, a testing phase, and an application phase (also referred to as an inference phase). In the training phase, a given model may be trained using a large amount of training data, with parameter values being updated iteratively until the model is able to derive consistent inferences from the training data that meet the desired objectives. By training, the model may be considered to be able to learn from the training data the association between inputs to outputs (also referred to as input to output mapping). Parameter values of the trained model are determined. In the testing phase, test inputs are applied to the trained model to test whether the model can provide the correct outputs, thereby determining the performance of the model. In the application phase, the model may be used to process the actual inputs to determine the corresponding outputs based on the trained parameter values.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented. Environment 100 relates to a data-parallel-based neural network training environment, which includes N work nodes 110-1, 110-2, \ 8230 \8230; 110-N (where N is an integer greater than 1) and a service node 120. Working nodes 110-1, 110-2 \8230 \ 8230, 110-N may maintain respective local training data sets 112-1, 112-2, \8230, 112-N, respectively, and are configured to jointly train a neural network. For ease of discussion, working nodes 110-1, 110-2, \8230 \ 8230; 110-N may be referred to individually or collectively as working nodes 110, and local training data sets 112-1, 112-2, \8230 \ 112-N may be referred to individually or collectively as local training data sets 112.

Data Parallelism (Data Parallelism) is an accelerated method of neural network training. In data-parallel neural network training, the training tasks are split across multiple working nodes 110, each working node 110 maintaining the same model parameters and the same computational tasks, but processing different training data. The service node 120 may aggregate the gradient data at each of the working nodes 110 and synchronize the aggregated gradient data to each of the working nodes 110. In this way, training and computation under the same global training data are split to different working nodes 110, thereby relieving computation and storage pressure at a single working node 110. Worker nodes are also sometimes referred to as clients, worker node devices, clients, terminal nodes, terminal devices, edge devices, and the like.

In data-parallel neural network training, working nodes 110 store respective local neural networks 132, such as local neural networks 132-1, 132-2, \8230; 132-N (hereinafter referred to individually or collectively as local neural networks 132 for ease of discussion). Working nodes 110 perform local training using their corresponding local training data sets 112. The worker node 110 sends the gradients for the various network layers determined during the local training process to the service node 120. The service node 120 obtains a global gradient by aggregating gradients from the respective working nodes 110, and synchronizes the aggregated global gradient to the respective working nodes 110. The worker node 110 may update the local neural network 132 based on the synchronized global gradient to perform the next training step. The above process may be repeated until the neural network training converges. Since the respective neural networks finally trained at the respective working nodes 110 are identical to each other, the local neural network may also be simply referred to as a neural network hereinafter for convenience of description.

In some embodiments, the worker node 110 and/or the service node 120 may be implemented at a terminal device or a server. The terminal device may be any type of mobile terminal, fixed terminal, or portable terminal including a mobile handset, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, media computer, multimedia tablet, personal Communication System (PCS) device, personal navigation device, personal Digital Assistant (PDA), audio/video player, digital camera/camcorder, positioning device, television receiver, radio broadcast receiver, electronic book device, gaming device, or any combination of the preceding, including accessories and peripherals of these devices, or any combination thereof. In some embodiments, the terminal device can also support any type of interface to the user (such as "wearable" circuitry, etc.). Servers are various types of computing systems/servers capable of providing computing power, including but not limited to mainframes, edge computing nodes, computing devices in a cloud environment, and so forth.

It should be understood that FIG. 1 illustrates only an example data parallel based neural network training environment. The environment may also be different depending on the neural network training and the actual application needs. For example, although the service node 120 is shown as a separate node in fig. 1, in some applications the service node 120 may be trained locally, as a worker node 110, in addition to being a central node, and so forth. The scope of the present disclosure is not limited in this respect.

As described above, although the neural network training method based on data parallel can greatly increase the training speed, it is difficult to achieve the linear acceleration effect due to frequent communication interaction between the plurality of working nodes 110. At present, there are methods for updating all network layers of a neural network by using the gradient determined in the last training step in the current training step to eliminate the dependency of the communication process of the gradient and the calculation process of the current training step, and to hide the communication overhead by constructing a pipeline in which communication and calculation are parallel. However, introducing gradient delays for all network layers of the neural network can adversely affect the convergence and accuracy of the model.

The inventor finds out through research that: because of the hierarchical structure of the neural network and the reverse order of forward propagation and backward propagation, the communication overhead can be hidden by introducing gradients only for some network layers in the neural network, not for all network layers. In this regard, embodiments of the present disclosure propose an improved gradient delay based neural network training scheme. Specifically, according to the embodiment of the present disclosure, the network layers in the neural network are divided into a first group of network layers and a second group of network layers, and a first group of gradients for the first group of network layers and a second group of gradients for the second group of network layers are respectively obtained, where the first group of gradients are aggregated from local gradients determined by the plurality of working nodes 110 for the first group of network layers in a current training step, and the second group of gradients are aggregated from local gradients determined by the plurality of working nodes 110 for the second group of network layers in a previous training step. Further, parameters of the neural network are updated based on the first set of gradients and the second set of gradients.

As will be more clearly understood from the following description, according to embodiments of the present disclosure, communication overhead is hidden by introducing gradient delays only for part of the network layers in the neural network. In this way, the influence of the introduced gradient delay on the model convergence and accuracy can be reduced as much as possible, thereby ensuring the convergence and accuracy of the model while hiding communication overhead.

Some example embodiments of the disclosure will now be described with continued reference to the accompanying drawings.

Partial delay

Fig. 2 illustrates a flow diagram of a method 200 for training a neural network, in accordance with some embodiments of the present disclosure. In some embodiments, method 200 may be performed by a worker node 110 as shown in FIG. 1. It should be understood that method 200 may also include additional blocks not shown and/or may omit certain block(s) shown, as the scope of the present disclosure is not limited in this respect.

At block 202, the working node 110 obtains a first set of global gradients for a first set of network layers in the neural network. The first set of global gradients is aggregated from the local gradients determined by the plurality of worker nodes 110 for the first set of network layers in the current training step, i.e., the global gradients for the first set of network layers are not delayed. At block 204, the working node 110 obtains a second set of global gradients for a second set of network layers in the neural network. The second set of network layers is different from the first set of network layers. The second set of global gradients is aggregated from local gradients determined by the plurality of working nodes 110 for the second set of network layers in a previous training step prior to the current training step, i.e. the gradients for the second set of network layers are delayed. In some embodiments, the previous training step may be the last training step prior to the current training step. It should be understood that the previous training steps may also include any other suitable training step prior to the current training step, and the scope of the present disclosure is not limited in this respect.

For ease of explanation, reference will be made to fig. 3 hereinafter. Fig. 3 illustrates a neural network training pipeline 300 in accordance with some embodiments of the present disclosure. The pipeline 300 is illustrated with a neural network having 3 network layers as an example. It should be understood that aspects in accordance with embodiments of the present disclosure may be applied to neural networks having any other suitable number of network layers, and the scope of the present disclosure is not limited in this respect.

In the example of fig. 3, the first set of network layers includes an output layer in the neural network, i.e., a third network layer. The second set of network layers includes the first two network layers from the input layer in the neural network, i.e., the first network layer and the second network layer. As shown in FIG. 3, blocks 302-1, 302-2, and 302-3 schematically illustrate the back propagation computation process in the current training step for the first, second, and third network layers, respectively, in the neural network. Note that since the back propagation computation process starts from the output layer and propagates back to the input layer, i.e., first computes the gradient for the third network layer, then computes the gradient for the second network layer, and finally computes the gradient for the first network layer, block 302-3 is shown before block 302-2 and block 302-2 is shown before block 302-1 in fig. 3. Blocks 304-1, 304-2, and 304-3 schematically illustrate the forward propagation computation process of a first, second, and third network layer, respectively, in a neural network in the next training step after the current training step. Blocks 306-1, 306-2, and 306-3 schematically illustrate the round-trip communication process of the first, second, and third network layers, respectively, in the neural network during the current training step. The round-trip communication procedure corresponds to the working node 110 sending the locally determined local gradients to the serving node 120 and receiving from the serving node 120 global gradients aggregated from the local gradients determined by the plurality of working nodes 110 for the respective network layers. For the sake of simplicity, the "round-trip communication process" may also be referred to as "communication process" in the following.

As shown in fig. 3, since the gradient for the first set of network layers (i.e., the third network layer) is not delayed, the working node 110 may obtain the gradient for the first set of network layers from the serving node 120 via the communication process 306-3. Further, since the gradients for the second set of network layers (i.e., the first network layer and the second network layer) are delayed, the working node 110 may retrieve from the buffer 310 the second set of global gradients for the second set of network layers that were stored in the buffer 310 in the last training step prior to the current training step.

Referring back to FIG. 2, at block 206, the working node 110 updates parameters of the neural network based on the first set of global gradients and the second set of global gradients. In some embodiments, the worker node 110 may directly use the acquired global gradient to update the corresponding layer in the neural network. Referring to fig. 3, the worker node 110 may update parameters of the first network layer and the second network layer using the global gradients for the first network layer and the second network layer retrieved from the buffer 310, the update corresponding to block 308-1 in fig. 3. The worker node 110 may then perform forward propagation calculations for the first network layer and the second network layer in the next training step based on the updated parameters. Further, the worker node 110 may update the parameters of the third network layer using the global gradient for the third network layer obtained via communication process 306-3, which corresponds to block 308-2 in fig. 3. The worker node 110 may in turn perform forward propagation calculations for the third network layer in the next training step based on the updated parameters.

By introducing gradient delay only for part of network layers in the neural network instead of all network layers to hide communication overhead, the influence of the introduced gradient delay on the model convergence and accuracy can be reduced as much as possible, thereby ensuring the convergence and accuracy of the model while hiding the communication overhead.

In some embodiments, a minimum number of network layers for gradient delay may also be determined such that the communication overhead is entirely hidden, thereby ensuring that the training pipeline does not stall, i.e., there is no latency at the working node 110. The inventor finds out through research that: for a hierarchically structured neural network, in a training pipeline without stalls, if a forward network layer is delayed, all forward network layers preceding the forward network layer are delayed. Thus, the minimum number of network layers that would introduce gradient delay required to have the communication overhead fully hidden can be determined by solving an optimization problem as follows:

minimizing k

Constraint conditions

Where k denotes the number of network layers required to introduce a gradient delay, v _i Representing the round-trip communication time between the worker node 110 and the service node 120 for the ith network layer, b _i Representing the back propagation computation time, u, for the ith network layer _i Representing forward propagation for the ith network layerThe time is calculated, and m represents the total number of network layers in the neural network. First constraint in the optimization problem

For ensuring that the communication and computation overlap completely, i.e. the communication overhead is completely hidden; and a second constraint

To ensure that undelayed gradients are synchronized before the forward propagation computation of their corresponding network layer begins.

In some embodiments, the forward propagation computation time, the backward propagation computation time, and the round trip communication time for each network layer in the neural network may be determined by monitoring the duration of the forward propagation computation process, the backward propagation computation process, and the communication process of the respective network layer during a warm-up training phase of the neural network. The worker node 110, in turn, may determine the number of network layers to use for the gradient delay based on the determined forward propagation computation time, backward propagation computation time, and round trip communication time. In one example, the working node 110 may find the minimum number of network layers that need to introduce gradient delay by solving the optimization problem described above, and determine the number of network layers for gradient delay based on the minimum number. For example, the number of network layers for gradient delay may be determined to be equal to or greater than the minimum number and less than any suitable number of the total number of network layers in the neural network.

Further, the worker node 110 may determine the first set of network layers and the second set of network layers based on the determined number of network layers. In some embodiments, the working node 110 may select the determined number of network layers from the input layers of the neural network as the second set of network layers and determine the remaining network layers in the neural network as the first set of network layers. Referring to fig. 3, in the case where the determined number of network layers for gradient delay is 2, the working node 110 may determine the first 2 network layers (i.e., first and second network layers) of the neural network from the input layer as the second set of network layers, and determine the remaining network layers (i.e., third network layers) of the neural network as the first set of network layers.

Delay compensation

In some embodiments, the worker node 110 may predict a third set of global gradients determined for the second set of network layers in the current training step based on the second set of global gradients for the second set of network layers. In one example, the inventors have found through research that: the third set of global gradients may be predicted based on taylor expansion of the gradient function, as shown by equation (1) below:

where gt represents the predicted third set of global gradients, gt' represents the predicted third set of global gradients, λ is a predetermined hyper-parameter,

representing the transpose, x, of the matrix gt _t-1 Represents the parameters of the second set of network layers in the current training step, and x _t′-1 Representing the parameters of the second set of network layers in the last training step. Further, the worker node 110 may update the parameters of the second set of network layers with the determined third set of global gradients. In this way, the introduced gradient delay can be compensated by means of the predicted global gradient, further reducing the influence of the introduced gradient delay on model convergence and accuracy.

In some embodiments, to compensate for the lag introduced by the gradient delay, the gradient of a backward predicted step may be obtained by predicting the model parameters in the next training step. To this end, the working node 110 may locally store two identical neural networks, one of which is used for training purposes (hereinafter also referred to as a training neural network) and the other of which is used for predicting model parameters (hereinafter also referred to as a prediction neural network).

In one embodiment, the worker node 110 may obtain a first set of local gradients for a second set of network layers. The first set of local gradients is determined for the second set of network layers in a previous training step. The working nodes 110 may predict parameters of the neural network in a next training step after the current training step based on the first set of local gradients, and determine a second set of local gradients for a second set of network layers in the current training step based on the predicted parameters for aggregation with the local gradients determined by the remaining working nodes 110 of the plurality of working nodes 110 to obtain a global gradient for the second set of network layers.

Illustratively, the working node 110 may retrieve from the buffer 310 a first set of local gradients determined for the second set of network layers in a previous training step and use the first set of local gradients to update parameters of the neural network for prediction by considering parameters of the neural network in a current training step and the learning rate to obtain parameters of the neural network for prediction in a next training step as a prediction of parameters of the neural network in a next training step. Further, the working node 110 may determine a second set of local gradients for the second set of network layers in the current training step based on the updated parameters of the predictive neural network. Note that this second set of local gradients is the gradients of the backward predicted step. The working node 110 may in turn use this second set of local gradients to aggregate with the local gradients determined by the remaining working nodes 110 of the plurality of working nodes 110 to derive global gradients for the second set of network layers for updating the parameters of the training neural network in the next training step. In this way, the parameters of the network layer to which the gradient delay is introduced can be updated with the gradient of the backward prediction step, so that the introduced gradient delay can be offset, and the influence of the introduced gradient delay on model convergence and accuracy can be reduced.

In another embodiment, the working nodes 110 may predict parameters of the neural network in a next training step after the current training step based on the second set of global gradients, and determine a set of local gradients for the second set of network layers in the current training step based on the predicted parameters for aggregating with the local gradients determined by the remaining working nodes 110 of the plurality of working nodes 110 to obtain global gradients for the second set of network layers.

Illustratively, the working node 110 may update the parameters of the neural network for prediction by considering the parameters of the neural network in the current training step and the learning rate using the second set of global gradients obtained from the buffer 310 to obtain the parameters of the neural network for prediction in the next training step as a prediction for the parameters of the neural network in the next training step. Further, the working node 110 may determine a set of local gradients for the second set of network layers in the current training step based on the updated parameters of the predictive neural network. Note that the set of local gradients is the gradient of the backward predicted step. The working node 110 may in turn derive a global gradient for a second set of network layers using the set of local gradients aggregated with the local gradients determined by the remaining working nodes 110 of the plurality of working nodes 110 for updating the parameters of the training neural network in a next training step. In this way, the parameters of the network layer to which the gradient delay is introduced can be updated with the gradient of the backward prediction step, so that the introduced gradient delay can be offset, and the influence of the introduced gradient delay on model convergence and accuracy can be reduced.

In yet another embodiment, the worker node 110 may obtain a first set of local gradients for a second set of network layers. The first set of local gradients is determined for the second set of network layers in a previous training step. The working node 110 may predict a third set of global gradients determined for the second set of network layers in the current training step based on the first set of local gradients and the second set of global gradients. Further, the working nodes 110 may predict parameters of the neural network in a next training step after the current training step based on the third set of global gradients, and determine a second set of local gradients for the second set of network layers in the current training step based on the predicted parameters for aggregation with the local gradients determined by the remaining working nodes 110 of the plurality of working nodes 110 to obtain global gradients for the second set of network layers.

Illustratively, the worker node 110 may retrieve from the buffer 310 the first set of local gradients determined for the second set of network layers in a previous training step. The worker node 110 may predict a third set of global gradients determined for the second set of network layers in the current training step using the first set of local gradients and the second set of global gradients based on equation (2) below:

wherein

Representing the predicted third set of global gradients, g _t-1 Representing a second set of global gradients, g _i，t-1 Representing local gradients for aggregating to obtain a second set of global gradients, g _i，t Representing a first set of local gradients, n representing a total number of the plurality of working nodes 110, x _t-1 Representing the parameter, x, of the neural network in the current training step _t-2 Representing the parameters of the neural network in the last training step, and a function DC _λ () Is defined in equation (1).

Further, the working node 110 may update the parameters of the neural network for prediction by considering the parameters of the neural network in the current training step and the learning rate using the predicted third set of global gradients to obtain the parameters of the neural network for prediction in the next training step as a prediction of the parameters of the neural network in the next training step. Further, the working node 110 may determine a second set of local gradients for the second set of network layers in the current training step based on the updated parameters of the predictive neural network. Note that this second set of local gradients is the gradients of the backward predicted step. The working node 110 may in turn use this second set of local gradients to aggregate with the local gradients determined by the remaining working nodes 110 of the plurality of working nodes 110 to derive global gradients for the second set of network layers for updating the parameters of the training neural network in the next training step. In this way, the parameters of the network layer to which the gradient delay is introduced can be updated with the gradient of the backward prediction step, so that the introduced gradient delay can be offset, and the influence of the introduced gradient delay on model convergence and accuracy can be reduced.

While various possible ways to compensate for gradient delays have been described above, it should be understood that the introduced gradient delays may also be compensated in any other suitable way, and the scope of the present disclosure is not limited in this respect.

System design

A second set of global gradients for a second set of network layers may be stored in buffer 310 by introducing gradient delays for the second set of network layers. This allows the second set of global gradients to be retrieved simultaneously. Thus, in the case where the second set of network layers includes multiple network layers, parameter updates for the multiple network layers may be performed simultaneously. In some embodiments, the worker node 110 may use the same core (Kernel) to update parameters of the second set of network layers. By the method, compared with the existing scheme that different cores are required to be called for different network layers to update parameters, system resources and time overhead can be saved, and the training speed of the neural network is further improved.

In some embodiments, the buffer 310 may include two different buffer regions, a first buffer region and a second buffer region. The working node 110 may store the second set of global gradients in the first buffer of the buffer 310 in a previous training step and read the second set of global gradients directly from the first buffer in a current training step. Further, in the current training step, the working node 110 acquires a third set of global gradients for the second set of network layers through a communication process. The third set of global gradients is aggregated from the local gradients determined by the plurality of worker nodes 110 for the second set of network layers in the current training step. The worker node 110 may in turn store the third set of global gradients in the second buffer of the buffer 310 and read it directly from the second buffer in the next training step. In this way, compared with the existing scheme, the method can avoid the system overhead caused by copying the gradient data from one buffer area to another buffer area in the buffer, thereby improving the training speed of the neural network.

In some embodiments, in a previous training step, the working node 110 may store the second set of global gradients in the buffer 310 and lock the buffer 310 after the second set of global gradients is stored in the buffer 310. Further, when the second set of global gradients needs to be obtained in the current training step, the working node 110 may unlock the buffer 310 and read the second set of global gradients from the buffer 310. Subsequently, the working node 110 may store the third set of global gradients for the second set of network layers acquired in the current training step in the buffer 310. In this manner, a shared buffer 310 may be implemented, thereby reducing the overhead of memory resources by the buffer 310.

In some embodiments, a working node 110 may include two different processing units, a first processing unit and a second processing unit, where the first processing unit is configured to implement training of a neural network and the second processing unit is configured to control communications at the working node 110. The working node 110 may determine a total computation time and a total communication time in a training step for the neural network based on pre-warm training of the neural network, and select a buffer from a first buffer associated with a first processing unit and a second buffer associated with a second processing unit based on a comparison of the total computation time and the total communication time. Further, the worker node 110 may store the second set of global gradients in the selected buffer.

Illustratively, the worker nodes 110 may include a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU), where the CPU may be used to control communications at the worker nodes 110 and the GPU may be used to train a neural network. If it is determined that the total computation time of the neural network in the training step is greater than the total communication implementation, a buffer associated with the GPU may be selected to store a second set of global gradients. If it is determined that the total computation time of the neural network in the training step is less than the total communication implementation, a buffer associated with the CPU may be selected to store the second set of global gradients. By the method, transmission overhead can be transferred from a relatively slow pipeline stage to a relatively fast pipeline stage, so that a neural network training pipeline is balanced, and the training speed of the neural network is increased.

As can be seen from the above description in conjunction with fig. 1-3, a method for training a neural network according to an embodiment of the present disclosure hides communication overhead by introducing gradient delays for only a portion of the network layers in the neural network. In this way, the influence of the introduced gradient delay on the model convergence and accuracy can be reduced as much as possible, thereby ensuring the convergence and accuracy of the model while hiding communication overhead.

Example implementations of the method according to the present disclosure have been described in detail above with reference to fig. 1 to 3, and implementations of the respective apparatuses and devices will be described hereinafter with reference to fig. 4 and 5.

Example apparatus and devices

Fig. 4 illustrates a block diagram of an example apparatus 400 for training a neural network, in accordance with some embodiments of the present disclosure. The apparatus 400 may be used, for example, to implement a worker node 110 as shown in fig. 1. The apparatus 400 may include a first acquisition module 402 configured to: at a first working node of the plurality of working nodes, a first set of global gradients for a first set of network layers in the neural network is obtained, the first set of global gradients being aggregated from local gradients determined by the plurality of working nodes for the first set of network layers in a current training step, the plurality of working nodes being configured to jointly train the neural network. Further, the apparatus 400 may include a second obtaining module 404 configured to: obtaining a second set of global gradients for a second set of network layers in the neural network, the second set of network layers being different from the first set of network layers, the second set of global gradients being aggregated from local gradients determined by the plurality of working nodes for the second set of network layers in a previous training step prior to the current training step. Further, the apparatus 400 may also include an update module 406 configured to: parameters of the neural network are updated based on the first set of global gradients and the second set of global gradients.

In some embodiments, the first set of global gradients and the second set of global gradients are aggregated at the service node, the apparatus 400 further comprising: a time determination module configured to: determining a forward propagation computation time, a backward propagation computation time, and a round-trip communication time between the first working node and the service node for each network layer in the neural network based on pre-heating training of the neural network; a number determination module configured to: determining a number of network layers for the gradient delay based on the determined forward propagation computation time, backward propagation computation time, and round trip communication time; and a network layer determination module configured to: based on the determined number of network layers, a first set of network layers and a second set of network layers are determined.

In some embodiments, the update module 406 includes: a first parameter update module configured to: updating parameters of the first set of network layers with the first set of global gradients; a gradient prediction module configured to: predicting a third set of global gradients determined for the second set of network layers in the current training step based on the second set of global gradients; and a second parameter update module configured to: parameters of the second set of network layers are updated with a third set of global gradients.

In some embodiments, the apparatus 400 further comprises: a first gradient acquisition module configured to: obtaining a first set of local gradients for the second set of network layers, the first set of local gradients determined for the second set of network layers in a previous training step; a first parameter prediction module configured to: predicting parameters of the neural network in a next training step after the current training step based on the first set of local gradients; and a first gradient determination module configured to: based on the predicted parameters, a second set of local gradients for the second set of network layers in the current training step is determined for aggregation with the local gradients determined for the remaining working nodes of the plurality of working nodes to obtain a global gradient for the second set of network layers.

In some embodiments, the apparatus 400 further comprises: a second parameter prediction module configured to: predicting parameters of the neural network in a next training step after the current training step based on the second set of global gradients; and a second gradient determination module configured to: based on the predicted parameters, a set of local gradients for the second set of network layers in the current training step is determined for aggregation with the local gradients determined for the remaining working nodes of the plurality of working nodes to obtain a global gradient for the second set of network layers.

In some embodiments, the apparatus 400 further comprises: a second gradient acquisition module configured to: obtaining a first set of local gradients for the second set of network layers, the first set of local gradients being determined for the second set of network layers in a previous training step; a global gradient prediction module configured to: predicting a third set of global gradients determined for the second set of network layers in the current training step based on the first set of local gradients and the second set of global gradients; a third parameter prediction module configured to: predicting parameters of the neural network in a next training step after the current training step based on the third set of global gradients; and a third gradient determination module configured to: based on the predicted parameters, a second set of local gradients for the second set of network layers in the current training step is determined for aggregating with the local gradients determined for the remaining working nodes of the plurality of working nodes to obtain a global gradient for the second set of network layers.

The modules and/or units included in the apparatus 400 may be implemented in various ways, including software, hardware, firmware, or any combination thereof. In some embodiments, one or more of the units may be implemented using software and/or firmware, such as machine executable instructions stored on a storage medium. In addition to, or in the alternative to, machine-executable instructions, some or all of the elements in apparatus 400 may be implemented at least in part by one or more hardware logic components. By way of example, and not limitation, exemplary types of hardware logic components that may be used include Field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standards (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and so forth.

These modules and/or units illustrated in fig. 4 may be partially or wholly implemented as hardware modules, software modules, firmware modules, or any combination thereof. In particular, in certain embodiments, the processes, methods, or procedures described above may be implemented by hardware in a storage system or a host corresponding to the storage system or other computing device independent of the storage system.

FIG. 5 illustrates a block diagram that illustrates a computing device 500 in which one or more embodiments of the disclosure may be implemented. It should be understood that the computing device 500 illustrated in FIG. 5 is merely exemplary and should not be construed as limiting in any way the functionality and scope of the embodiments described herein. The computing device 500 shown in fig. 5 may be used to implement the worker node 110 of fig. 1.

As shown in fig. 5, computing device 500 is in the form of a general purpose computing device. Components of computing device 500 may include, but are not limited to, one or more processors or processing units 510, memory 520, storage 530, one or more communication units 540, one or more input devices 550, and one or more output devices 560. The processing unit 510 may be a real or virtual processor and may be capable of performing various processes according to programs stored in the memory 520. In a multi-processor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capabilities of computing device 500.

Computing device 500 typically includes a number of computer storage media. Such media may be any available media that is accessible by computing device 500 and includes, but is not limited to, volatile and non-volatile media, removable and non-removable media. Memory 520 may be volatile memory (e.g., registers, cache, random Access Memory (RAM)), non-volatile memory (e.g., read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory), or some combination thereof. Storage 530 may be a removable or non-removable medium and may include a machine-readable medium, such as a flash drive, a magnetic disk, or any other medium that may be capable of being used to store information and/or data (e.g., training data for training) and that may be accessed within computing device 500.

Computing device 500 may further include additional removable/non-removable, volatile/nonvolatile storage media. Although not shown in FIG. 5, a magnetic disk drive for reading from and writing to a removable, non-volatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, non-volatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data media interfaces. Memory 520 may include a computer program product 525 having one or more program modules configured to perform the various methods or acts of the various embodiments of the disclosure.

The communication unit 540 enables communication with other computing devices over a communication medium. Additionally, the functionality of the components of computing device 500 may be implemented in a single computing cluster or multiple computing machines, which are capable of communicating over a communications connection. Thus, computing device 500 may operate in a networked environment using logical connections to one or more other servers, network Personal Computers (PCs), or another network node.

The input device 550 may be one or more input devices such as a mouse, keyboard, trackball, or the like. Output device 560 may be one or more output devices such as a display, speakers, printer, or the like. Computing device 500 can also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., communicating with one or more devices that enable a user to interact with computing device 500, or communicating with any device (e.g., a network card, a modem, etc.) that enables computing device 500 to communicate with one or more other computing devices, as desired, via communication unit 540. Such communication may be performed via input/output (I/O) interfaces (not shown).

According to an exemplary implementation of the present disclosure, a computer-readable storage medium having stored thereon computer-executable instructions is provided, wherein the computer-executable instructions are executed by a processor to implement the above-described method. According to an exemplary implementation of the present disclosure, there is also provided a computer program product, tangibly stored on a non-transitory computer-readable medium and comprising computer-executable instructions that are executed by a processor to implement the method described above.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus, devices and computer program products implemented in accordance with the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various implementations of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing has described implementations of the present disclosure, and the above description is illustrative, not exhaustive, and is not limited to the implementations disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described implementations. The terminology used herein was chosen in order to best explain the principles of the implementations, the practical application, or improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various implementations disclosed herein.

Claims

1. A method for training a neural network, comprising:

obtaining, at a first working node of a plurality of working nodes, a first set of global gradients for a first set of network layers in a neural network, the first set of global gradients being aggregated from local gradients determined by the plurality of working nodes for the first set of network layers in a current training step, the plurality of working nodes configured to jointly train the neural network;

obtaining a second set of global gradients for a second set of network layers in the neural network, the second set of network layers being different from the first set of network layers, the second set of global gradients being aggregated from local gradients determined by the plurality of working nodes for the second set of network layers in a previous training step prior to the current training step; and

updating parameters of the neural network based on the first set of global gradients and the second set of global gradients.

2. The method of claim 1, wherein the first set of global gradients and the second set of global gradients are aggregated at a service node, the method further comprising:

determining a forward propagation computation time, a backward propagation computation time, and a round trip communication time between the first working node and the service node for each network layer in the neural network based on pre-warm training of the neural network;

determining a number of network layers for gradient delay based on the determined forward propagation computation time, the backward propagation computation time, and the round trip communication time; and

determining the first set of network layers and the second set of network layers based on the determined number of network layers.

3. The method of claim 2, wherein determining the first set of network layers and the second set of network layers comprises:

selecting the determined number of network layers as the second set of network layers from an input layer of the neural network; and

determining remaining network layers in the neural network as the first set of network layers.

4. The method of claim 1, wherein updating parameters of the neural network comprises:

updating parameters of the first set of network layers with the first set of global gradients;

predicting a third set of global gradients determined for the second set of network layers in the current training step based on the second set of global gradients; and

updating parameters of the second set of network layers with the third set of global gradients.

5. The method of claim 1, further comprising:

obtaining a first set of local gradients for the second set of network layers, the first set of local gradients determined for the second set of network layers in the previous training step;

predicting parameters of the neural network in a next training step after the current training step based on the first set of local gradients; and

determining a second set of local gradients for the second set of network layers in the current training step based on the predicted parameters for aggregating with the local gradients determined for the remaining working nodes of the plurality of working nodes to obtain a global gradient for the second set of network layers.

6. The method of claim 1, further comprising:

predicting parameters of the neural network in a next training step after the current training step based on the second set of global gradients; and

determining a set of local gradients for the second set of network layers in the current training step based on the predicted parameters for aggregating with the local gradients determined for the remaining working nodes of the plurality of working nodes to obtain a global gradient for the second set of network layers.

7. The method of claim 1, further comprising:

predicting a third set of global gradients determined for the second set of network layers in the current training step based on the first set of local gradients and the second set of global gradients;

predicting parameters of the neural network in a next training step after the current training step based on the third set of global gradients; and

8. The method of any of claims 1 to 7, wherein the parameters of the second set of network layers are updated using the same core.

9. The method of any of claims 1 to 7, wherein the second set of global gradients is stored in a first buffer region of a buffer in the previous training step, and wherein obtaining the second set of global gradients comprises:

reading the second set of global gradients from the first cache.

10. The method of claim 9, further comprising:

obtaining a third set of global gradients for the second set of network layers in the neural network, the third set of global gradients being aggregated from local gradients determined by the plurality of working nodes for the second set of network layers in the current training step; and

storing the third set of global gradients in a second buffer of the buffer, the second buffer different from the first buffer.

11. The method of any of claims 1 to 7, wherein the second set of global gradients is stored in a buffer in the previous training step, the method further comprising:

locking the buffer in response to the second set of global gradients being stored in the buffer, an

Wherein obtaining the second set of global gradients comprises:

unlocking the buffer in the current training step; and

reading the second set of global gradients from the buffer.

12. The method of any one of claims 1 to 7, wherein the training of the neural network is implemented at a first processing unit, the method further comprising:

determining a total computation time and a total communication time in a training step for the neural network based on a pre-heating training of the neural network;

selecting a buffer from a first buffer associated with the first processing unit and a second buffer associated with a second processing unit configured to control communications at the first working node based on a comparison of the total calculated time and the total communication time; and

storing the second set of global gradients in the selected buffer.

13. An apparatus for training a neural network, comprising:

a first acquisition module configured to: obtaining, at a first working node of a plurality of working nodes, a first set of global gradients for a first set of network layers in a neural network, the first set of global gradients being aggregated from local gradients determined by the plurality of working nodes for the first set of network layers in a current training step, the plurality of working nodes configured to jointly train the neural network;

a second acquisition module configured to: obtaining a second set of global gradients for a second set of network layers in the neural network, the second set of network layers being different from the first set of network layers, the second set of global gradients being aggregated from local gradients determined by the plurality of working nodes for the second set of network layers in a previous training step prior to the current training step; and

an update module configured to: updating parameters of the neural network based on the first set of global gradients and the second set of global gradients.

14. The apparatus of claim 13, wherein the first set of global gradients and the second set of global gradients are aggregated at a service node, the apparatus further comprising:

a time determination module configured to: determining a forward propagation computation time, a backward propagation computation time, and a round trip communication time between the first working node and the service node for each network layer in the neural network based on warm-up training of the neural network;

a number determination module configured to: determining a number of network layers for gradient delay based on the determined forward propagation computation time, the backward propagation computation time, and the round trip communication time; and

a network layer determination module configured to: determining the first set of network layers and the second set of network layers based on the determined number of network layers.

15. The apparatus of claim 13, wherein the update module comprises:

a first parameter update module configured to: updating parameters of the first set of network layers with the first set of global gradients;

a gradient prediction module configured to: predicting a third set of global gradients determined for the second set of network layers in the current training step based on the second set of global gradients; and

a second parameter update module configured to: updating parameters of the second set of network layers with the third set of global gradients.

16. The apparatus of claim 13, further comprising:

a first gradient acquisition module configured to: obtaining a first set of local gradients for the second set of network layers, the first set of local gradients determined for the second set of network layers in the previous training step;

a first parameter prediction module configured to: predicting parameters of the neural network in a next training step after the current training step based on the first set of local gradients; and

a first gradient determination module configured to: determining a second set of local gradients for the second set of network layers in the current training step based on the predicted parameters for aggregating with the local gradients determined for the remaining ones of the plurality of working nodes to obtain a global gradient for the second set of network layers.

17. The apparatus of claim 13, further comprising:

a second parameter prediction module configured to: predicting parameters of the neural network in a next training step after the current training step based on the second set of global gradients; and

a second gradient determination module configured to: determining a set of local gradients for the second set of network layers in the current training step based on the predicted parameters for aggregating with the local gradients determined for the remaining working nodes of the plurality of working nodes to obtain a global gradient for the second set of network layers.

18. The apparatus of claim 13, further comprising:

a second gradient acquisition module configured to: obtaining a first set of local gradients for the second set of network layers, the first set of local gradients determined for the second set of network layers in the previous training step;

a global gradient prediction module configured to: predicting a third set of global gradients determined for the second set of network layers in the current training step based on the first set of local gradients and the second set of global gradients;

a third parameter prediction module configured to: predicting parameters of the neural network in a next training step after the current training step based on the third set of global gradients; and

a third gradient determination module configured to: determining a second set of local gradients for the second set of network layers in the current training step based on the predicted parameters for aggregating with the local gradients determined for the remaining working nodes of the plurality of working nodes to obtain a global gradient for the second set of network layers.

19. An electronic device, comprising:

at least one processing unit; and

at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions when executed by the at least one processing unit causing the apparatus to perform the method of any of claims 1-12.

20. A computer-readable storage medium, having stored thereon a computer program executable by a processor to implement a method according to any one of claims 1 to 12.