CN115019079B

CN115019079B - Method for accelerating deep learning training by distributed outline optimization for image recognition

Info

Publication number: CN115019079B
Application number: CN202110239799.XA
Authority: CN
Inventors: 文再文; 杨明瀚; 许东; 陈铖; 田永鸿
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-03-04
Filing date: 2021-03-04
Publication date: 2024-05-28
Anticipated expiration: 2041-03-04
Also published as: CN115019079A

Abstract

The invention discloses a method for accelerating deep learning training of distributed outline optimization for image recognition, which is characterized in that a plurality of computing nodes are arranged for distributed computation; when training the second-order matrix calculation in the image recognition depth neural network model, adopting block diagonal approximation, and using a sketch method to perform distributed implicit calculation of matrix and vector products; and each computing node sequentially transmits the pictures into an image recognition depth neural network model trained by the distributed sketch optimization acceleration depth learning training method in parallel, and predicts the output vector of each picture, namely the probability that the picture belongs to each label, so that image recognition is realized. The method can greatly reduce the calculated amount, improve the training speed of the deep neural network, shorten the training time and improve the effect of a second-order algorithm compared with a first-order algorithm.

Description

Method for accelerating deep learning training by distributed outline optimization for image recognition

Technical Field

The invention relates to a deep neural network training method in an image recognition technology, in particular to a method for accelerating deep neural network training by adopting distributed outline second-order optimization in image recognition application.

Background

Deep neural networks are one of the most important technologies in the current image recognition field. A deep neural network requires training using data before it is put into use. Generally, the more the number of pictures used for training, the more perfect the variety, and the better the image recognition effect of the depth neural network obtained by training. However, the huge amount of data also means that training consumes a lot of resources, often requiring days or even weeks to train a large network.

Deep neural networks can be seen as functions that have a large number of parameters to train, and thus training of deep neural networks can be seen as an optimization problem for the parameters of the functions. In the training process, if the set of all parameters is considered as a vector, the iteration of each step is actually to get a direction so that the parameter vector is optimized towards this direction. Currently, the training methods of neural networks can be roughly classified into a first-order optimization algorithm and a second-order optimization algorithm.

The first-order algorithm is the most common and common method in the deep neural network training process. Typically, the first order algorithm is trained using only gradient information (i.e., first derivative) of the deep neural network as a function, most typically a random gradient descent method. However, the training speed of the first-order optimization algorithm is slower, the number of required training iteration steps is more, the training process is sensitive to parameter values, and the generalization capability is reduced during massive parallel training.

Compared with a first-order algorithm trained by using gradient information only, a second-order algorithm is trained by using gradient information and second derivative information of a function, namely a Hairth matrix of a multiple function. With the second order algorithm, the number of iterative steps required for training is typically much less than with the first order algorithm. However, in deep neural network training, parameters are very large in scale, the neisseria matrix is very large and complex and difficult to calculate or store, so that the second-order algorithm needs to be applied to the deep neural network training to reduce the complexity of calculation and storage by considering how to construct an approximation of the neisseria matrix. At the same time, since the calculation of the neisserial matrix often involves gradient information of the samples in all the calculations, it is very difficult to perform distributed calculation on the process.

Disclosure of Invention

The invention aims to provide a distributed sketch optimization acceleration deep learning training method for image recognition, which is a second-order optimization new method applied to large-scale deep neural network training of image recognition, so that the training time of the deep neural network is shortened, and the final effect obtained by training is improved.

The principle of the invention is as follows: in the large-scale deep neural network training of image recognition, when a second-order matrix for approximating a Hai matrix is calculated, block diagonal approximation is adopted, and distributed outline optimization is carried out, namely, implicit calculation of the product of the matrix and the vector is carried out in a distributed manner by using an outline method, so that the calculated amount is greatly reduced, the training time is shortened, and the effect of a second-order algorithm compared with that of a first-order algorithm is improved.

The deep neural network used to identify the image needs to be trained to apply. In convolutional neural networks and their variants (e.g., resNet, VGGNet), an image is considered as a tensor of three two-dimensional matrices corresponding to pixels of the RGB three colors. For each two-dimensional matrix, a plurality of convolution check images are used for processing. Each convolution kernel is a small-scale matrix composed of weights, and the function of each convolution kernel is to perform weighted average on a small area with the same scale on the image, so as to obtain the characteristics of the small area. And weighting each small region by using a convolution kernel to obtain an output image after extracting the features, wherein the output image is also a tensor consisting of a plurality of two-dimensional matrixes. After the operations such as nonlinear processing (ReLU operation), the output image is used as an input to the next convolution layer, and the same operations as described above are performed. After processing of a plurality of convolution layers, the output image is subjected to a full connection layer to obtain a final output vector, the vector reflects the probability that the image belongs to each label, and the label with the highest probability is taken as a recognition result.

An untrained deep neural network often has a predictive label represented by the final output vector far from the true label of the image, and thus various parameters in the deep neural network need to be adjusted, i.e., training of the deep neural network. The invention provides a novel technical scheme for the training process of using a convolutional layer to be a main deep neural network in image recognition and accelerating deep learning training by distributed outline optimization.

The technical scheme provided by the invention is as follows:

A method for training the deep learning of the distributed outline optimization acceleration is used for training a deep neural network model of image recognition, a plurality of computing nodes are arranged for distributed computation, each computing node sequentially transmits pictures into a neural network trained by the method for training the distributed outline optimization acceleration deep learning in parallel, and an output vector finally predicted by each picture, namely the probability that the picture belongs to each label is obtained, so that the image recognition is realized;

The method comprises the following steps:

1) A large number of pictures of which the labels are known are first prepared. Dividing the pictures into a plurality of batches of pictures formed by a plurality of small pictures, and respectively performing neural network model training tasks.

One input picture for training the deep neural network is called a round, and the pictures are divided into groups (such as 64 pictures are called 1 group) with a small number of pictures, and one group is called a batch of pictures. The input of a batch of pictures may be the minimum unit of deep neural network training.

In convolutional neural networks and their variants (e.g., resNet, VGGNet), an image is processed as a tensor of three two-dimensional matrices for each of the three-dimensional matrices, using multiple convolutional collation images.

2) Setting a distributed computing system of a plurality of computing nodes, and optimizing and adjusting parameters in the neural network by using a loss function for measuring the difference between a predicted probability vector and a real image label in the training process of the neural network by using the distributed computing system, so as to obtain optimized parameters of the neural network;

Parameters in the neural network include weight values of convolutional kernels of the convolutional layers and weight values of the full-connection layers. And each computing node sequentially transmits the pictures into the trained neural network in parallel to obtain the probability vector of final prediction output of each picture.

By regarding all parameters in the neural network as a vector, the invention adopts a distributed outline optimization acceleration deep learning training method, so that the loss function sum becomes smaller after the parameters are optimized and adjusted.

For the calculation process of the optimized parameters, the invention adopts the following method.

21 The optimization calculation of the parameters of each layer in the deep neural network is expressed as the following formula:

in the above parameter optimization direction calculation formula, d is the final optimization direction of the parameter vector, and after this round of adjustment, the parameter vector will be adjusted along the direction of d. U is a gradient matrix formed by the gradients of the pictures; each column vector of U is a gradient of parameters of the layer, which is obtained by back propagation calculation for the loss function of each picture; b is a column vector; g is the average of all column vectors in U, i.e. the average gradient of the loss function of all pictures; λ is a preset weight value representing the magnitude of the parameter vector adjusted along the optimization direction d. That is, the correction term of the linear combination of the average gradient direction and the gradient direction of the loss function for each picture, both of which constitute the final optimization direction.

22 A distributed outline optimization acceleration deep learning training method is adopted, so that the calculated amount of an optimization process is reduced;

since U is very large in size, all operations involving U require a significant amount of computational resources to be consumed in computing the appropriate b. The invention adopts a distributed outline optimization acceleration deep learning training method to reduce the calculation amount of the optimization process, and comprises the following steps:

221 For the existing U, the columns of the U are randomly sampled by using a distributed outline optimization method, namely, a few columns of the matrix U are selected to obtain a new matrix For convenience in implementation, the number of columns is selected as the number of samples in each batch. U use/>, in the calculation processInstead, the computational overhead of matrix operations in the computation process can be greatly reduced.

222 A matrix U is calculated in a schematic way.

The gradient of each sample at each convolution layer is calculated from the input matrix a of that layer and the derivative G of the function with respect to the output of that layer. Since the two matrices may be quite large, a schematic method may be adopted to randomly select the columns of the matrix a and the derivative matrix G, and a new matrix with smaller columns is formed to replace the columns, and subsequent calculation is performed.

23 A calculation process of the parameters of the neural network is optimized in a distributed mode;

In the aforementioned distributed scenario, the computation of each picture in the neural network is performed simultaneously on the respective compute nodes. Therefore, constructing a complete U requires that each node transmits each calculated gradient to each other, and the required transmission amount is too large. Therefore, the invention further adopts the calculation process of the distributed optimization neural network parameters on the basis of the outline optimization, and comprises at least one of two distributed optimization methods.

The distributed optimization scheme provided by the invention comprises the following two different methods:

231 Method of optimizing and re-integrating alone: each node calculates an optimized direction d based on the pictures on the nodes and the output results thereof, and then sums the directions and averages the directions to obtain a comprehensive optimized direction.

Assuming that twenty pictures are evenly distributed on ten nodes to be respectively calculated, and each node has two pictures, when each node has only two pictures, ten optimized directions are obtained by independent calculation according to the process, and the final optimized directions after the aggregation and the average are obtained. Compared with the gradient of the loss function which needs to synchronize all twenty pictures at each node, the optimization method adopting the distributed computation omits the process, and greatly reduces the traffic.

232 A method for obtaining an approximate matrix of U ^T U by block diagonal approximation;

When the calculation overhead is mainly used for calculating the matrix multiplication operation of U ^T U, the block diagonal approximation of U ^T U can be adopted to obtain the approximate matrix of U ^T U. Since the picture gradient matrix U can be expressed as follows:

U＝[U₁,U₂,...,U_i,...,U_N]

wherein U _i is a partial column vector of U calculated on different nodes; n is the number of distributed nodes. Then U ^T U can be expanded to

Since the above calculation requires multiplication of U _i by two, it is necessary to synchronize the partial column vectors calculated on each node among all nodes, which is a huge overhead in communication cost. Thus, U ^T U is block diagonally approximated, i.e., usingThe block diagonal matrix is formed as an approximation matrix instead of U ^T U. In this way, since the calculation of the approximation matrix involves only multiplication of U _i and its own transpose, the calculation does not need to be synchronized with other nodes in advance, and the calculation result only needs to be summarized and then the approximation matrix of U ^T U is reconstructed and the subsequent calculation is performed.

3) After the optimization direction is calculated according to the calculation process of the optimization parameters in the step 2), the parameter vector of the neural network is subjected to numerical adjustment along the optimization direction.

And (3) taking all batches of pictures as input to perform the optimization calculation of the step (2) and adjust the parameter values of the neural network, improving the accuracy of the neural network identification image compared with the previous round, and re-performing the training of the subsequent rounds on the batches of pictures. After each training round, it is necessary to test the accuracy of image recognition of the neural network with another set of pictures of known labels; setting an image recognition accuracy threshold, if the accuracy of image recognition is not less than the set threshold, indicating that the image recognition reaches the standard, stopping training, and putting the deep neural network into practical application.

Compared with the prior art, the invention has the beneficial effects that:

By utilizing the technical scheme of the invention, the calculated amount can be greatly reduced, the training speed of the deep neural network can be greatly improved, the training time can be shortened, and the effect of the second-order algorithm compared with the first-order algorithm can be improved. In image recognition applications, the time required to achieve the same effect using the deep neural network training process of the present invention is reduced by 40% compared to the first order algorithm.

Drawings

FIG. 1 is a block flow diagram of training a batch of pictures based on a deep neural network according to the present invention.

FIG. 2 is a flow chart diagram of a distributed optimization training method employing individual optimization and resynthesis in accordance with the present invention.

FIG. 3 is a flow diagram of a distributed optimization training method employing block diagonal approximations to obtain an approximation matrix for U ^T U.

Detailed Description

The invention is further described by way of examples in the following with reference to the accompanying drawings, but in no way limit the scope of the invention.

The invention provides a method for accelerating deep learning training by distributed outline optimization, which is used for training a deep neural network for image recognition. First a large number of pictures of which the labels are known need to be prepared. The pictures are divided into a plurality of batches composed of a small number of pictures, and then training tasks are respectively carried out.

In a distributed scenario where multiple compute nodes are available for computation, a batch of pictures may be distributed as evenly as possible across the nodes. And each computing node sequentially transmits the pictures to which the computing nodes belong to the neural network in parallel, and obtains the final predicted output vector of each picture. As described above, this vector corresponds to the probability that the picture identified by the neural network belongs to each tag. The predicted label represented by this vector is not necessarily identical to the true label result of the image, and therefore a different type of loss function is required to measure the gap between the predicted probability vector and the true image label.

And summing the loss function values of the pictures distributed on each node in a batch to obtain a continuous function value for measuring the prediction accuracy of the neural network for the pictures in the batch, wherein the larger the value is, the lower the prediction accuracy of the neural network is, and the higher the prediction accuracy is otherwise.

The training process of the neural network reduces the function value as much as possible, which requires adjustment of parameters in the neural network, especially the weight value of the convolution kernel of the convolution layer and the weight value of the full connection layer. Considering all parameters in the neural network as a vector, we need to get an optimization direction so that the above-mentioned loss function sum becomes smaller after the parameter vector is adjusted along this optimization direction.

For the calculation process of the optimization direction, the invention adopts the following method.

We first list the calculation formulas of the optimization direction of the parameters of each layer in the deep neural network, the result is as follows:

In the above parameter optimization direction calculation formula, each column vector of U is a gradient of a parameter related to the layer obtained by back propagation calculation of a loss function of each picture, b is a column vector, g is an average value of all column vectors in U, that is, an average gradient of loss functions of all pictures, and λ is a preset weight. That is, the average gradient direction, and the correction term for the linear combination of the gradient directions of the loss functions for the respective pictures, both constitute the final optimization direction.

Since U is very large in size, all operations involving U require a significant amount of computational resources to be consumed in computing the appropriate b. Therefore, the present solution proposes a method for reducing the calculation amount of the optimization process by using the outline method, which is mainly divided into the following two cases:

1) For the existing U, the columns of the U are randomly sampled by using a sketching method, namely, a few columns of the matrix U are selected to obtain a new matrix In this case U is used/>, in the calculation processInstead, the computational overhead of matrix operations in the computation process can be greatly reduced.

2) The calculation is performed in a schematic manner when calculating the matrix U. The gradient of each sample at each convolution layer is calculated from the input a of that layer and the derivative G of the function with respect to the output of that layer. Since the two matrices may be quite large, the same method as 1) can be adopted to randomly sample the columns of the two matrices, obtain an approximate matrix with smaller column number and perform substitution calculation.

In the aforementioned distributed scenario, the computation of each picture in the neural network is performed simultaneously on the respective compute nodes. Therefore, constructing a complete U requires that each node transmits each calculated gradient to each other, and the required transmission amount is too large. Therefore, the invention adds two distributed optimization methods to perform parameter optimization on the basis of the outline optimization: a method of optimizing and re-synthesizing separately and a method of obtaining an approximate matrix of U ^T U by adopting block diagonal approximation.

The calculation process of the invention further adopting the distributed optimization neural network parameters comprises at least one of two distributed optimization methods:

1) The method for optimizing and re-synthesizing independently comprises the following steps: each node calculates an optimized direction d based on the pictures on the nodes and the output results thereof, and then sums the directions and averages the directions to obtain a comprehensive optimized direction.

For example, assuming that twenty pictures are evenly distributed on ten nodes to be respectively calculated, and each node has two pictures, under the distributed optimization scheme, each node calculates to obtain ten optimization directions according to the process when only two pictures are taken as the condition of the two pictures, and the final optimization directions are obtained by averaging after summarizing. Compared with the gradient of the loss function which needs to synchronize all twenty pictures at each node, the scheme omits the process, and greatly reduces the traffic.

2) The method for obtaining the approximate matrix of U ^T U by block diagonal approximation comprises the following steps: in this scheme, the calculation overhead is mainly used for calculating the matrix multiplication operation of U ^T U, so that the approximate matrix of U ^T U can be obtained by using block diagonal approximation. In the original scheme, one of the main purposes of the synchronization U is to calculate U ^T U. This calculation process requires vector multiplication between all components of U, thus synchronizing the gradient of the loss function of each picture between nodes and reassigning the calculation task of U ^T U to different nodes. And U ^T U is subjected to block diagonal approximation, usingA block diagonal matrix is formed instead, where U _i is the gradient value of the loss function for all pictures at the ith node. Thus, the process of approximating U ^T U does not require synchronizing all gradients across all nodes in advance, but rather each computes a portion of the block diagonal matrix across its own nodes.

After the optimization direction is calculated, the parameter vector of the neural network is subjected to numerical adjustment along the optimization direction. After the above operation is performed on all batches of pictures, the accuracy of the neural network identification image is improved compared with the previous round, and the second round of training is performed on the batches of pictures again. After multiple rounds of training, another set of pictures of known labels are used to test the image recognition capability of the neural network, which can be put into practice if it meets the standards.

The specific implementation mode of the independent optimization and recombination distributed optimization method is as follows:

A. Firstly, acquiring a plurality of picture samples for training, and distributing the picture samples to each computing node;

B. Each computing node obtains the gradient of the image sample of the computing node and the average gradient of all the image samples of the computing node through backward propagation;

C. calculating the optimization direction of a sample based on each calculation node by using the parameter optimization direction calculation formula on each calculation node by using a sketch method;

D. synchronizing the optimization directions of all the computing nodes, and taking the average value as a final optimization direction;

E. and optimizing and adjusting parameters of the deep neural network by using the final optimization direction.

The specific implementation mode of the distributed optimization scheme for obtaining the approximate matrix of U ^T U by adopting block diagonal approximation is as follows:

A. Firstly, acquiring a plurality of picture samples used for training, and distributing pictures to each computing node;

B. Each computing node obtains the gradient of the loss function of the picture through backward propagation, and the average gradient of the loss functions of all pictures of the computing node;

C. For the pictures on each computing node, calculating to obtain U ^T U by using a matrix formed by gradient combination of the pictures (namely U _i on each node);

D. Synchronizing U ^T U obtained by all computing nodes and obtaining block diagonalization approximation of the original symmetrical matrix, namely

A block diagonal matrix is formed;

E. b is obtained according to the diagonalization approximation of the block, and a final optimization direction d is obtained through calculation;

F. And optimizing and adjusting parameters of the deep neural network by using the final optimization direction.

The invention is further illustrated by the following examples:

For a convolutional neural network, 64 pictures are obtained each time parameter optimization is performed, and 8 calculation nodes can perform parallel calculation. We need to optimally adjust the parameters of this network based on the output of 64 pictures (i.e. 64 pictures and their respective labels) in this convolutional neural network so that the predicted label pair of the network is closer to the real label of the 64 pictures.

Let us take a certain convolutional layer of this convolutional neural network as an example. Each picture can be back-propagated to obtain the gradient of the convolution kernel of this layer with respect to this picture, i.e. there are 64 gradients. In the first order optimization algorithm, the average value of these 64 gradient vectors, i.e., the average gradient, is the final optimization direction of the parameters. In the present invention, this average gradient also needs to be obtained first, but further adjustments are needed.

All the image gradients can form a gradient matrix U, and then the U is needed to be used for calculation. In distributed computing, the pictures are distributed to a plurality of computing nodes as evenly as possible, namely 8 pictures are distributed to each node, and gradient vectors corresponding to the 8 pictures are obtained through computing. According to the original scheme, the gradient matrix U consisting of 64 gradient vectors is obtained by each node through the transmission of 8 nodes, and the required transmission cost is too great, so that the two distributed schemes described by the invention are required to be adopted.

The independent optimization and resynthesis distributed optimization method is adopted, and specifically, each node does not synchronize gradient vectors, but performs subsequent calculation locally according to 8 gradient vectors owned by the node and average gradient calculation to obtain an optimization direction d, so that eight nodes respectively obtain 8 optimization directions, and the vector after the average is the optimization direction of the parameters of the convolution layer finally. This solution eventually only requires a summary of 8 optimization directions and no synchronization matrix U.

The distributed optimization method for obtaining the approximate matrix of U ^T U by block diagonal approximation is to perform approximation processing on subsequent calculation related to U. The purpose of synchronization U is to calculate U ^T U. This step requires enormous computational resources and transmission resources as described above. Thus, we use the method of block diagonal approximation to get an approximation matrix for U ^T U. According to the specific method, each computing node calculates a local U ^T U according to 8 gradient vectors, and the U at the moment consists of 8 vectors. And the matrix calculated by each node is resynchronized to obtain an approximate matrix consisting of 8 diagonal blocks to replace U ^T U for subsequent calculation.

And finally, adjusting each parameter of the convolutional neural network according to the obtained final optimization direction.

Specifically, after taking all batches of pictures as input to perform the second step of calculation and adjusting the parameter values of the neural network, the accuracy of the neural network for identifying the images is improved compared with the previous round, and the pictures of the batches are trained again in the subsequent rounds. After each training round, testing the accuracy of the image recognition of the neural network by using another group of pictures with known labels; setting an image recognition accuracy threshold, if the accuracy of image recognition is not less than the set threshold, indicating that the image recognition reaches the standard, stopping training, and putting the deep neural network into practical image recognition application.

It should be noted that the purpose of the disclosed embodiments is to aid further understanding of the present invention, but those skilled in the art will appreciate that: various alternatives and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the disclosed embodiments, but rather the scope of the invention is defined by the appended claims.

Claims

1. A method for the distributed outline optimization acceleration deep learning training of image recognition sets up a plurality of computing nodes for distributed computation; when training the second-order matrix calculation in the image recognition depth neural network model, adopting block diagonal approximation, and using a sketch method to perform distributed implicit calculation of matrix and vector products; each computing node sequentially transmits pictures into an image recognition depth neural network model trained by the distributed sketch optimization acceleration depth learning training method in parallel, and predicts to obtain an output vector of each picture, namely the probability that the picture belongs to each label, so that image recognition is realized;

The method comprises the following steps:

1) Firstly, preparing pictures of known labels, dividing the pictures into a plurality of batches of pictures, and respectively performing neural network model training tasks as input pictures; the input picture is expressed as tensor composed of three two-dimensional matrixes corresponding to RGB three-color pixels, and each two-dimensional matrix of the image is checked and processed by using a plurality of convolutions of a neural network model;

2) Setting a distributed computing system comprising a plurality of computing nodes, and training a neural network by using the distributed computing system; optimizing, calculating and adjusting parameters in the neural network through the loss function, so that optimized parameters of the neural network are obtained;

The parameters in the neural network comprise the weight value of the convolution kernel of the convolution layer and the weight value of the full connection layer; regarding all parameters in the neural network as one vector; each computing node sequentially transmits the pictures into a trained neural network in parallel to obtain probability vectors of final prediction output of each picture;

The calculation process of the optimization parameters comprises the following steps:

21 The optimization calculation of the parameters of each layer in the deep neural network is expressed as:

D is the final optimization direction of the parameter vector, and comprises a correction term of linear combination of the average gradient direction and the gradient direction of the loss function of each picture; u is a gradient matrix formed by the gradients of the pictures; each column vector of U is the gradient of the parameters of the layer obtained by backward propagation calculation of the loss function of each picture; b is a column vector; g is the average of all column vectors in U, i.e. the average gradient of the loss function of all pictures; lambda is a preset weight value and represents the amplitude of the parameter vector adjusted along the optimization direction d;

22 The column vector b is obtained by adopting a distributed outline optimization acceleration deep learning training method, so that the calculated amount of the optimization process is reduced; comprising the following steps:

221 For the existing U, the columns of the U are randomly sampled by using a distributed outline optimization method, namely, a few columns of the matrix U are selected to obtain a new matrix Use/>, in the calculation processThe U is replaced, and the calculation cost of matrix operation is reduced;

222 A matrix U is calculated in a sketch mode;

The gradient of each sample in each convolution layer is calculated by an input matrix A of the convolution layer and a derivative matrix G of the output of the convolution layer by a function;

Randomly selecting columns of the input matrix A and the derivative matrix G to form a new matrix with smaller columns, respectively replacing the input matrix A and the derivative matrix G, and carrying out subsequent calculation;

23 A calculation process of the parameters of the distributed optimization neural network is adopted, and the calculation process comprises a distributed optimization method of independent optimization and resynthesis and/or a distributed optimization method of obtaining an approximate matrix of U ^T U by block diagonal approximation;

the independent optimization and recombination distributed optimization method is as follows: each node calculates an optimized direction based on the pictures on the node and the output results thereof, and then sums up the optimized directions and averages the summed optimized directions to obtain a combined optimized direction; the method specifically comprises the following steps:

B. Each computing node obtains the gradient of the picture sample through backward propagation, and the average gradient of all the picture samples of the computing node;

C. Obtaining an optimization direction of a sample based on each computing node by calculating a parameter optimization direction on each computing node by using a sketch method;

E. Optimizing and adjusting parameters of the deep neural network by using the final optimization direction;

the method for obtaining the approximate matrix of U ^T U by adopting block diagonal approximation is as follows: when the U ^T U matrix multiplication operation is calculated, the U ^T U adopts block diagonal approximation to obtain an approximate matrix of U ^T U;

the picture gradient matrix U is expressed as: u= [ U ₁,U₂,…,U_i,…,U_N ]; wherein U _i is a partial column vector of U calculated on different nodes; n is the number of distributed nodes; u ^T U is unfolded to be:

block diagonal approximation of U ^T U, i.e. using The block diagonal matrix is used as an approximate matrix to replace U ^T U;

The U ^T U is subjected to block diagonal approximation, the calculation of an approximation matrix only involves multiplication of U _i and the transposition of the U _i, and the calculation results only need to be summarized and then the approximation matrix of U ^T U is formed again for subsequent calculation, so that synchronization with other nodes is not needed in advance;

3) After the optimization direction is calculated according to the calculation process of the optimization parameters in the step 2), the parameter vector of the neural network realizes the numerical adjustment along the optimization direction.

2. The method for the distributed sketch optimization acceleration deep learning training according to claim 1, wherein when image recognition is carried out, pictures of all batches are taken as input, after optimization calculation in the step 2) is carried out and parameter values of a neural network are adjusted, subsequent rounds of training are carried out on the pictures of the batches again; after each round of training, it is determined whether to stop training by testing the accuracy of image recognition of the neural network.

3. The method for distributed profiling optimized accelerated deep learning training of claim 1, wherein each batch of pictures comprises 64 pictures.

4. The method of distributed profiling optimized accelerated deep learning training of claim 1, wherein step 2) uses specifically a loss function that measures the gap between the predicted probability vector and the true image label.

5. The method for accelerating deep learning training by distributed sketch optimization according to claim 1, wherein in the step 23), the specific distributed optimization scheme for obtaining the approximate matrix of U ^T U by using block diagonal approximation comprises the following steps:

B. each computing node obtains the gradient of the loss function of the picture through backward propagation, and the average gradient of the loss functions of all the pictures of the computing node;

C. For the picture on each calculation node, a matrix formed by gradient combination of the picture, namely U _i on each node is used for calculating to obtain U ^T U;

D. Synchronizing U ^T U obtained by all computing nodes and obtaining block diagonalization approximation of the original symmetrical matrix, namely adopting A block diagonal matrix is formed;