CN118052260A - Dynamic layering gradient compression method for neural network model - Google Patents

Dynamic layering gradient compression method for neural network model Download PDF

Info

Publication number
CN118052260A
CN118052260A CN202410387893.3A CN202410387893A CN118052260A CN 118052260 A CN118052260 A CN 118052260A CN 202410387893 A CN202410387893 A CN 202410387893A CN 118052260 A CN118052260 A CN 118052260A
Authority
CN
China
Prior art keywords
layer
gradient
time
communication
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410387893.3A
Other languages
Chinese (zh)
Other versions
CN118052260B (en
Inventor
巨涛
康贺廷
张燕
刘帅
火久元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lanzhou Jiaotong University
Original Assignee
Lanzhou Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lanzhou Jiaotong University filed Critical Lanzhou Jiaotong University
Priority to CN202410387893.3A priority Critical patent/CN118052260B/en
Priority claimed from CN202410387893.3A external-priority patent/CN118052260B/en
Publication of CN118052260A publication Critical patent/CN118052260A/en
Application granted granted Critical
Publication of CN118052260B publication Critical patent/CN118052260B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A dynamic hierarchical gradient compression method for a deep neural network model combines a gradient sparsification compression method with a pipeline parallel technology, matches a proper threshold value for each layer of neural network, and realizes the self-adaptive compression of the transmission gradient of each layer of network by dynamically adjusting the threshold value in the subsequent iteration. And then, combining given model structure and hardware configuration information, solving an optimal layer gradient combination communication mode by utilizing a heuristic dynamic programming algorithm, and combining multiple layers of small-scale gradient tensors into one layer of communication. And finally, combining and combining the solved optimal layer gradients to be applied to a specific training iteration process, so that the training speed of the large-scale deep neural network model is improved while the training precision of the model is ensured, the maximum overlapping of calculation and communication is realized, the utilization rate of calculation resources is improved, and an effective solution is provided for fully utilizing hardware calculation resources and improving the training speed of the deep neural network model.

Description

Dynamic layering gradient compression method for neural network model
Technical Field
The invention belongs to the field of deep learning neural networks, relates to a communication optimization method between nodes in data parallel neural network model training, and particularly relates to a dynamic hierarchical gradient compression method of a deep neural network model based on synchronous data parallelism.
Background
In recent years, with the rapid development of deep learning in various application fields, the continuous and changing complex model architecture and data volume make training a large-scale deep neural network require more computational resource support. Since the computing power and storage capacity of a single acceleration device cannot meet the requirement of training a large-scale model, parallelization training of the model by adopting a distributed idea becomes an effective mode in the prior art. At present, synchronous data parallel training based on a parameter server architecture is mainly adopted, and each computing node performs iterative updating of model parameters by maintaining consistency of model parameter copies and calculating gradients by using different small batches of samples in each training. The concrete implementation idea is that each node sends the local gradient to the server node after completing the calculation of the local gradient; then, the server node aggregates and averages the local gradients from all the computing nodes to obtain a global gradient, and applies the global gradient to update model parameters; and each computing node acquires the latest model parameters from the server node to carry out the next iteration, and finally, convergence is achieved. The synchronous random gradient descent algorithm has good convergence, so that the synchronous random gradient descent algorithm is a common optimization method adopted by distributed deep learning training. However, since the computing node needs to send the local gradient to the server node and acquire the latest model parameters from the server node after the back propagation is completed, the additional communication overhead is often greater than the computing overhead due to the frequent gradient transmission operation between the computing node and the server node, and the additional communication overhead gradually becomes a main constraint factor for accelerating the deep neural network distributed training. Therefore, how to realize efficient communication between the server node and the computing node without affecting the model training accuracy is a problem to be solved in synchronous data parallel training.
In the aspect of deep neural network distributed data training optimization, a great deal of work exists, gradient compression is a commonly adopted lossy compression optimization technology, communication is optimized by reducing the gradient quantity transmitted during deep neural network training, sparsification is used as an aggressive communication compression mode, the transmission frequency of operation gradients without updating is reduced, important gradients are selected for transmission, and the communication traffic can be obviously reduced on the premise of ensuring model convergence. Through a fixed threshold (threshold) gradient sparse optimization method [2], gradients of each layer are uniformly compressed, but isomerism among different layers of a model is ignored, reasonable thresholds are difficult to set for various deep neural networks, and good generalization is lacking. The communication transmission operation during training is performed by selecting a gradient value of a certain proportion through local gradient selection and dynamic threshold adjustment [3][4]. However, in the case of a high compression ratio, a loss of model training accuracy is easily caused. The method [5][6][7] for updating the important gradient (Top-K) is to perform the Top-K operation on all gradients, and then to screen the gradient values with larger influence rate on the model updating. However, when performing extensive deep neural network training, the Top-K operation tends to incur a significant time overhead due to the very large gradient tensor. And (3) carrying out Top-K operation by randomly selecting gradient values with a certain proportion to obtain a sparse threshold value, and finally carrying out gradient screening on all gradients by using the threshold value. However, this method is not accurate enough for threshold calculation, and requires two Top-K samples on a subset of the original gradient vectors, resulting in higher overhead. The Top-K method can remarkably reduce the traffic, but neglects the hierarchical structure characteristic of the deep neural network, and the thinning operation is usually carried out after all layer gradient calculation is finished, so that the calculation and communication cannot achieve a better parallel effect.
In order to realize efficient compression of gradient tensors, the cost of gradient compression time is reduced, and a compression threshold is estimated by exploring gradient distribution conditions in the training process of the deep neural network. According to the Gassiank optimization method [8][9] proposed in the existing work, theoretical analysis and experimental verification show that the gradient is mostly located near zero in the training process and follows Gaussian distribution, so that the time cost of gradient compression can be greatly reduced. However, as training proceeds, the gradient value becomes closer to zero, and the threshold value predicted by the method according to the gaussian distribution is larger, so that the model is difficult to converge in the same training time. In order to simulate gradient distribution characteristics more accurately, the gradient distribution in the training process is modeled into three distributions of double-exponential distribution, double-gamma distribution and double-generalized pareto distribution by a multi-section threshold estimation method [10], but when the compression ratio becomes large, larger deviation can occur in fitting of the real gradient distribution, and the convergence of a model is affected.
In the back propagation process of the deep neural network, the gradient is calculated sequentially from back to front, namely, the gradient calculation of the network of the previous layer is independent of the communication of the next layer, and the parameter update of the next layer is independent of the previous layer. Thus, portions of the communication overhead may be hidden by overlapping the communication tasks with the computing tasks. The overlapping of the calculation task and the communication task does not actually reduce the communication time, but the calculation task and the communication task are executed in parallel as much as possible, so that the overall training efficiency of the model is improved. The existing wait-free back propagation optimization method WFBP [11] realizes interlayer gradient communication and computation pipeline parallelism, but due to the difference of computation and communication time between different layers, extra communication overhead is difficult to hide in a network environment with high delay or low bandwidth. In the existing method based on priority transmission, each layer of gradient tensor is segmented according to a preset threshold, and scheduling transmission is realized by utilizing the priority among tasks, so that the effect of calculation and communication overlapping is achieved, but because frequent super-parameter adjustment operation is required for processing different runtime environments, deployment is difficult to realize in dynamically-changed software and hardware resources. Because some large tensor communication may block the execution of higher priority tensors, existing work divides the large tensor into a plurality of small tensors to execute, so as to realize larger degree of calculation and communication overlap, but each divided small tensor needs to be inserted with pushing operation, and too many divided small tensors can bring longer total starting time, so that the bandwidth utilization rate is lower.
Aiming at the problems existing in the process of the distributed data parallel training large-scale deep neural network, no efficient gradient compression method capable of realizing maximum overlapping of calculation and communication, reducing communication quantity among nodes in the process of the data parallel training and accelerating the data parallel training speed of the deep neural network model on the premise of guaranteeing model convergence and training precision exists at present.
Disclosure of Invention
The invention provides a dynamic hierarchical gradient compression method of a neural network model, which reduces the communication time between nodes in the data parallel training process and improves the resource utilization rate on the premise of ensuring model convergence and training accuracy, thereby providing an effective solution for fully utilizing network bandwidth resources and accelerating the training speed of the data parallel deep neural network model.
In order to achieve the purpose, the invention is realized by adopting the following technical scheme:
a dynamic hierarchical gradient compression method of a neural network model comprises the following steps:
1) The dynamic layered gradient compression method is designed, namely, a proper compression threshold value is calculated for each layer of gradient, and the method specifically comprises the following steps: firstly, after the back propagation calculation of a certain layer l finishes the gradient, adding the gradient calculated by the layer l to the gradient residual error in the layer, wherein the gradient residual error is the sum of all previous gradients locally accumulated in the layer l in the calculation node; then, applying a Top-K gradient selection strategy to the gradient of the layer l, calculating to obtain a gradient compression threshold of the layer, compressing the gradient calculated by the layer l through the threshold, and using compressed gradient communication; then accumulating the gradient residual error of the iteration for the gradient obtained by the subsequent iteration calculation; by comparing the threshold change of each layer after iteration for a plurality of times, the threshold change is found to be slow and has only slight difference, and the layer threshold is reused according to the characteristic of slow gradient change in the training process in order to further reduce the gradient compression time cost; dynamically adjusting the layer threshold value every s training iterations in the subsequent training process, storing the threshold value information of each layer, and reusing the threshold value in the subsequent s-1 iteration process;
2) Training a plurality of input small batch mini-batch data by adopting a specific deep neural network model and a training data set according to a hardware computing resource environment supported by an operation platform;
3) Detecting forward and backward propagation calculation time spending of different layers of the model and memory occupation information in the training process, and recording and storing by using corresponding data structures;
4) The various overhead information of the layers recorded in the step 3) is applied, a heuristic dynamic programming algorithm is adopted to combine the compressed gradient information of the layers to communicate together, communication delay overhead in the layered gradient communication process is reduced, and the model training speed is further accelerated;
5) And taking the user model and the planning result thereof as input, and executing on a plurality of distributed devices by adopting a synchronous data parallel training method.
Further:
In the step 2), the hardware computing resource environment is determined according to the number of the GPUs of different clusters and the communication bandwidth; the specific deep neural network model is determined according to the input model type, and is also applicable to different convolutional neural network models, cyclic neural network models and transformers; the input small batch data is input and divided according to the adopted data set.
In step 3), the following data structure is used to define the storage mode for the model internal data: the method comprises the steps of storing by using dictionary key values, and storing layer types as keys and forward and backward propagation calculation time and memory occupation information as values, wherein the occupation information mainly comprises the following steps: the magnitude of the output activation value per layer, the magnitude of the gradient in the back propagation, the number of parameters per layer.
The layer gradient information merging method in the step 4) comprises the following steps:
4-1) optimizing targets, combining specific hardware resources according to an input network model, combining a plurality of compressed gradient tensors together as much as possible for communication in the training process, realizing maximum overlapping of gradient calculation and gradient communication, and reducing communication delay overhead in the layered gradient transmission process; the final optimization target is to reduce the communication delay overhead through layer gradient merging, so that the training iteration time is minimized;
In the parallel training process of the deep neural network data, the one-time iterative training time consists of four parts, namely forward calculation loss time, reverse layer-by-layer calculation gradient time, layer gradient compression time and gradient communication time after compression, and the four parts are expressed as follows by a formula:
equation (1) shows that the time of one iteration is calculated by forward calculation time t f and backward layer-by-layer calculation gradient time Layer-by-layer gradient sparsification time/>And communication time without overlap/>Composition; equation (2) represents the moment/>, at which layer l starts to calculate the gradientThe time of the forward calculation is the time of the end or the time of the layer l+1 thinning end; equation (3) represents the thinning time/>, of layer lCalculating the end time for the gradient of layer l; the formula (4) shows that the time when the layer l starts communication is determined by the thinning end time of the layer l and the communication end time of the layer l+1;
4-2) a layer gradient merging algorithm, namely determining a group of optimal layer gradient merging combination modes m by using a heuristic dynamic programming algorithm, and merging the current layer gradient into the gradient calculated in the previous layer by traversing the comparison layer by layer from back to front when the condition is met; in the merging process, recording the corresponding layer number combination after merging for the subsequent training process; the method comprises the following steps: the merge layer is defined as: if at time τ, the gradient of layer l is merged into layer l-1 and the gradient of layer l is not compressed and communicated, layer l is the merged layer, denoted as The merging layer is denoted by the symbol l m, the normal layer is denoted by the symbol l n, and l m has the following constraint:
in equation (5) l > 1 indicates that layer 1 cannot be a merge layer because there is no previous layer to merge; Indicating that l m is not compressed and communicated after the gradient calculation is completed; d l-1=dl+dl-1 indicates that the gradient tensor d l of this layer of l m will accumulate into the previous layer tensor d l-1; /(I) The gradient computation representing layer l-1 may begin immediately after the layer l gradient computation is complete.
Compared with the prior art, the invention has the following beneficial effects:
Aiming at the problem of optimization of parallel training communication of large-scale deep neural network data, the invention firstly provides a dynamic hierarchical gradient compression method of a deep neural network model, which is used for matching a proper threshold value for each layer of neural network, dynamically adjusting the threshold value in the subsequent iteration, realizing the self-adaptive compression of the transmission gradient of each layer of network, ensuring the model precision and simultaneously solving the problem of large communication traffic in the process of parallel training of the data; then, aiming at the problems of low network bandwidth resource utilization rate and excessive communication delay in the communication process after the layered gradient compression, small-scale gradient tensors of a plurality of layers are combined into one-layer communication through a heuristic dynamic programming algorithm, an optimal layered gradient combining mode is solved, and the overhigh communication delay overhead introduced in the layered gradient decision is reduced. And finally, combining and combining the obtained optimal layer gradients to be applied to a specific training iteration process, so that the maximum overlapping of calculation and communication is realized, the communication time is reduced, and an effective solution is provided for improving the training speed of the deep neural network model.
Drawings
FIG. 1 is a schematic diagram of a gradient compression method of the present invention;
FIG. 2 is a schematic diagram of layer gradient merging communication according to the present invention;
FIG. 3 is a schematic diagram of a layer gradient merging process according to the present invention;
FIG. 4 is a training accuracy at ResNet-20 for different compression methods;
FIG. 5 shows training loss at ResNet-20 for different compression methods;
FIG. 6 is training accuracy on VGG16 for different compression methods;
FIG. 7 is training loss on VGG16 for different compression methods;
FIG. 8 is a diagram showing training time composition for different compression methods on ResNet-20;
Fig. 9 shows training time compositions for different compression methods on VGG 16.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, a dynamic hierarchical gradient compression method for a neural network model includes the following steps:
1) By designing a dynamic hierarchical gradient threshold strategy and calculating a proper compression threshold for each layer of gradient, not only is the traffic compressed, but also gradient calculation and communication overlap in the back propagation process are enabled, the problem that the gradient calculation and communication cannot be executed in parallel in the application process of the existing gradient compression method is solved, and the resource utilization rate is improved. Fig. 1 shows a specific hierarchical gradient compression process, specifically: firstly, after the back propagation calculation of a certain layer l finishes the gradient, adding the gradient calculated by the layer l to the gradient residual error in the layer, wherein the gradient residual error is the sum of all previous gradients locally accumulated in the layer l in the calculation node; then, applying a Top-K gradient selection strategy to the gradient of the layer l, calculating to obtain a gradient compression threshold of the layer, compressing the gradient calculated by the layer l through the threshold, and using compressed gradient communication; and then accumulating the gradient residual error of the iteration to be used in the gradient obtained by the subsequent iteration calculation. By comparing the threshold changes of each layer after iteration for a plurality of times, only slight difference is found in the slow change of the threshold, and the layer threshold is reused according to the characteristic of slow gradient change in the training process in order to further reduce the gradient compression time cost. The layer threshold value is dynamically adjusted every s training iterations in the subsequent training process, the threshold value information of each layer is stored, and then the threshold value is reused in the subsequent s-1 iteration process, so that the time cost of gradient compression can be greatly reduced, and the parallel training process of the deep neural network data is accelerated.
Step 1) solves the problems that the existing gradient compression method depends on global gradient, and can not enable gradient calculation and communication to be executed in parallel, so that the training efficiency is low; and through threshold reuse, the cost of gradient compression time is reduced while the model precision is ensured, and the training time is further reduced.
2) According to the hardware computing resource environment supported by the operation platform, a specific deep neural network model is adopted to train a plurality of small batch (mini-batch) data input by a user by using a PyTorch deep learning framework. The hardware computing resource environment is determined according to the number of the GPUs of different clusters and the communication bandwidth; the specific deep neural network model is determined according to the input model type, and is also applicable to different convolutional neural network models, cyclic neural network models and transformers; there is no limitation on the input small batch data, and input division is performed according to the employed data set.
3) And recording forward and backward propagation time overheads of different layers of the model and memory occupation information in the training process by using a script file written by python. The dictionary key values are used for storage, and layer types are used as keys, and forward and backward propagation computation time and memory occupation information (output activation value size of each layer, gradient size in backward propagation and parameter quantity of each layer) are used as values for storage.
4) According to the data structure defined in the step 3) and the recorded related information, a heuristic dynamic programming algorithm [1] is adopted to combine a plurality of layers of the deep neural network, communication delay expenditure is minimized in the process of performing gradient calculation and communication as parallel as possible, and the specific layer gradient information combining process is as follows:
(1) Optimizing the target. According to an input network model, combining specific hardware resources, and simultaneously carrying out gradient calculation and communication as parallel as possible, combining a plurality of layers of compressed gradients for communication together, minimizing communication delay overhead in a transmission process, and optimizing the aim to minimize one training iteration time.
In the parallel training process of the deep neural network data, one iteration training time consists of four parts, namely forward calculation loss time, reverse layer-by-layer calculation gradient time, layer gradient compression time and gradient communication time after compression, and the four parts are expressed as follows by a formula.
Equation (1) shows that the time of one iteration is calculated by forward calculation time t f and backward layer-by-layer calculation gradient timeLayer-by-layer gradient sparsification time/>And communication time without overlap/>Composition is prepared. Equation (2) represents the moment/>, at which layer l starts to calculate the gradientThe compression end time is thinned for the time when the forward calculation ends or the layer l+1. Equation (3) represents the thinning time/>, of layer lThe end time is calculated for the gradient of layer l. The expression (4) indicates that the time at which the layer l starts communication is determined by the thinning end time of the layer l and the communication end time of the layer l+1.
(2) Layer gradient merging algorithm. According to a given network model and specific hardware resources, a heuristic dynamic programming algorithm is used to find an optimal layer gradient merging combination mode, so that communication delay overhead is minimized.
In the layer gradient merging process, a plurality of layer gradient tensors are merged together for communication, although communication delay overhead is relieved, the moment when the first 1 layer or a plurality of layers of gradients start to be thinned and communicated is delayed, for example, in fig. 2 (a), after the layer L calculates the gradient and compresses, the layer gradient starts to be communicated at the moment 1, and in fig. 2 (b), the layer L and the layer L-1 are merged together for communication, and the moment when the layer L starts to be communicated in a gradient manner is delayed to the moment 2. If all the layer gradients are combined together to be sent, communication delay overhead is the lowest, so that communication cost is the lowest, but the communication mode needs to calculate all the layer gradients to start communication, and gradient calculation and communication parallel execution cannot be realized. The goal of the layer gradient merging is therefore to achieve the best overlap of gradient communication and computation during the training process as simultaneously as possible with the merging of the multiple layer gradient tensors.
The merge layer is defined as: if at time τ, the gradient of layer l is merged into layer l-1 and the gradient of layer l is not compressed and communicated, layer l is the merged layer, denoted asThe merging layer is denoted by the symbol l m, the normal layer is denoted by the symbol l n, and l m has the following constraint:
in equation (5) l > 1 indicates that layer 1 cannot be a merge layer because there is no previous layer to merge; Indicating that l m is not compressed and communicated after the gradient calculation is completed; d l-1=dl+dl-1 indicates that the gradient tensor d l of this layer of l m will accumulate into the previous layer tensor d l-1; /(I) The gradient computation representing layer l-1 may begin immediately after the layer l gradient computation is complete.
In the iterative solving process of the optimization target, it is critical to determine whether merging a layer can reduce the iteration time t iter, if so, setting the layer as a merged layer, otherwise, not merging the layer gradients. Fig. 3 depicts a specific iterative planning combining process. Assuming that layers L, L-1, L-2 have been combined, a layer gradient combination m 1 = [ L, L-1, L-2] is formed in which combination, when the gradient calculations of layer L and layer L-1 are completed, there is no immediate compression and communication, but rather the gradient tensors of layer L and layer L-1 are accumulated into layer L-2, i.e., d L-2=dL-2+dL-1+dL. Then, starting from the L-3 layer, carrying out iterative programming on the rest layers to form a next layer gradient combination m j (j is more than or equal to 1 and less than or equal to x). After all layers have been traversed, all layers are partitioned to form a layer gradient combination m= [ m 1,m2...mx ]. In the subsequent training process, the layers contained in each combination m j are communicated once.
5) And taking the user model and the planning result thereof as input, and executing on a plurality of distributed devices by adopting a synchronous data parallel training method.
According to the invention, a gradient sparsification compression method is combined with a pipeline parallel technology, a proper threshold value is matched for each layer of neural network, and the adaptive compression of the transmission gradient of each layer of network is realized by dynamically adjusting the threshold value in the subsequent iteration, so that the model accuracy is ensured and the communication traffic is reduced. And then, solving the optimal layer gradient combined communication mode by utilizing a heuristic dynamic programming algorithm according to the given model structure and hardware configuration information aiming at the problems of low bandwidth utilization rate and high communication delay caused by the layered communication of the small tensors after the layered gradient compression, and combining the multi-layer small-scale gradient tensors into one-layer communication so as to reduce the overhigh communication delay cost introduced in the layered gradient decision. And finally, applying the solved optimal layer gradient combination to a specific data parallel training iterative process, and improving the training speed of the large-scale deep neural network model while ensuring the training precision of the model, thereby realizing the maximum overlapping of calculation and communication, improving the utilization rate of calculation resources, and providing an effective solution for fully utilizing hardware calculation resources and improving the training speed of the deep neural network model.
To verify the effectiveness of the present invention, the present invention was compared with various performance levels of various reference algorithms, and specific experimental results are shown in fig. 4 to 9, and table 1. The brief analysis is as follows:
(1) To verify the effectiveness of the method (DLGS) of the present invention in improving model training accuracy and reducing training loss, training accuracy and loss at 0.1 compression rate for different compression methods were tested on the CIFAR-10 dataset using ResNet-20 and VGG 16. The method of the present invention was compared with the convergence of the global Top-K method, the LAGS method and the uncompressed method (dense). batchsize is set to 32, the training accuracy and penalty for training 140 epochs, resNet-20 are shown in fig. 4 and 5, and the training accuracy and penalty for VGG16 are shown in fig. 6 and 7. Experimental results show that the invention can achieve training accuracy and loss similar to those of an uncompressed method while reducing the traffic.
(2) In order to verify the specific details of the method (DLGS) of the invention for improving the training speed of the model, the time of one iteration is decomposed into the calculation time, the thinning time and the non-overlapping communication time, and the training time compositions of the method and the uncompressed method, the Top-K method and the LAGS method of each part of one iteration are compared under the compression rates of 0.1 and 0.01. The specific comparison results are shown in fig. 8 and 9. Experimental results show that the method (DLGS) is shorter in time consumption in all methods under two compression rates, on one hand, the method is mainly used for reducing communication traffic and reducing communication time, and in the process, calculation and communication are overlapped, part of communication overhead is hidden in calculation, and on the other hand, the method is used for reducing sparse time overhead through threshold reuse and reducing communication delay overhead through a layer gradient merging method, so that the time of the whole training iteration is greatly reduced.
(3) To evaluate the effectiveness of the method (DLGS) of the present invention in reducing training time, training times on CIFAR-10 datasets were compared using models ResNet-20 and VGG16 at both compression rates of 0.1 and 0.01, with different compression methods training times as shown in table 1. Experimental data shows that the training time of the method is shorter than that of other compression methods under two compression rates, and the specific reason is that the method reduces the compression cost and the communication cost in the model training process by combining the multiple layers of gradients for communication after threshold value repetition and gradient sparsification, thereby accelerating the overall model training process.
Table 1 shows training time spent on CIFAR-10 datasets for different compression methods
The references of the invention are as follows:
[1]Liu D,Xue S,Zhao B,et al.Adaptive dynamic programming for control:A survey and recent advances[J].
IEEE Transactions on Systems,Man,and Cybernetics:Systems,2020,51(1):142-160.
[2]Scalabledistributed DNN training using commodity GPU cloud computing[C]//Proceedings ofthe Interspeech 2015.
Baixas,France:ISCA.2015:1488-1492.
[3]DRYDEN N,MOON T,JACOBS S A,et al.Communication quantization for data-parallel training of deep neural networks[C]//20162nd Workshop on Machine Learning in HPC Environments(MLHPC).Salt Lake City,UT,USA:IEEE,2016:1-8.
[4]CHEN C Y,CHOI J,BRAND D,et al.Adacomp:Adaptive residual gradient compression for data-parallel distributed training[C]//Proceedings ofthe AAAI conference on artificial intelligence.2018,32(1).
[5]RENGGLI C,ASHKBOOS S,AGHAGOLZADEH M,et al.SparCML:High-performance sparse communication for machine learning[C]//Proceedings of the International Conference for High Performance Computing,Networking,Storage andAnalysis.2019:1-15.
[6]SATTLER F,WIEDEMANN S,K R,et al.Sparse binary compression:Towards distributed deep learning with minimal communication[C]//2019International Joint Conference on Neural Networks(IJCNN).
Budapest,Hungary:IEEE,2019:1-8.
[7]SHI S,WANG Q,ZHAO K,et al.A distributed synchronous SGD algorithm with global Top-K sparsification for lowbandwidth networks[C]//2019IEEE 39th International Conference on Distributed Computing Systems(ICDCS).Dallas,TX,USA:IEEE,2019:2238-2247.
[8]SHI S,CHU X,CHEUNG K C,et al.Understanding Top-K sparsification in distributed deep learning[J/OL].
[2023-04-16].https://arxiv.org/abs/1911.08772.
[9]GUO B,LIU Y,ZHANG C.A partition based gradient compression algorithm for distributed training in aiot
[J].Sensors,2021,21(6):1943.
[10]M ABDELMONIEM A,ELZANATY A,ALOUINI M S,et al.An efficient statistical-based gradient compression technique for distributed training systems[J].Proceedings of Machine Learning and Systems,2021,3:297-322.
[11]ZHANG H,ZHENG Z,XU S,et al.Poseidon:An efficient communication architecture for distributed deep learning on{GPU}clusters[C]//2017USENIX Annual Technical Conference(USENIX ATC 17).2017:181-193.
The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims (4)

1. The dynamic hierarchical gradient compression method for the neural network model is characterized by comprising the following steps of:
1) The dynamic layered gradient compression method is designed, namely, a proper compression threshold value is calculated for each layer of gradient, and the method specifically comprises the following steps: firstly, after the back propagation calculation of a certain layer l finishes the gradient, adding the gradient calculated by the layer l to the gradient residual error in the layer, wherein the gradient residual error is the sum of all previous gradients locally accumulated in the layer l in the calculation node; then, applying a Top-K gradient selection strategy to the gradient of the layer l, calculating to obtain a gradient compression threshold of the layer, compressing the gradient calculated by the layer l through the threshold, and using compressed gradient communication; then accumulating the gradient residual error of the iteration for the gradient obtained by the subsequent iteration calculation; by comparing the threshold change of each layer after iteration for a plurality of times, the threshold change is found to be slow and has only slight difference, and the layer threshold is reused according to the characteristic of slow gradient change in the training process in order to further reduce the gradient compression time cost; dynamically adjusting the layer threshold value every s training iterations in the subsequent training process, storing the threshold value information of each layer, and reusing the threshold value in the subsequent s-1 iteration process;
2) Training a plurality of input small batch mini-batch data by adopting a specific deep neural network model and a training data set according to a hardware computing resource environment supported by an operation platform;
3) Detecting forward and backward propagation calculation time spending of different layers of the model and memory occupation information in the training process, and recording and storing by using corresponding data structures;
4) The various overhead information of the layers recorded in the step 3) is applied, a heuristic dynamic programming algorithm is adopted to combine the compressed gradient information of the layers to communicate together, communication delay overhead in the layered gradient communication process is reduced, and the model training speed is further accelerated;
5) And taking the user model and the planning result thereof as input, and executing on a plurality of distributed devices by adopting a synchronous data parallel training method.
2. The method for dynamic hierarchical gradient compression of a neural network model according to claim 1, wherein the method comprises the steps of: in the step 2), the hardware computing resource environment is determined according to the number of the GPUs of different clusters and the communication bandwidth; the specific deep neural network model is determined according to the input model type, and is also applicable to different convolutional neural network models, cyclic neural network models and transformers; the input small batch data is input and divided according to the adopted data set.
3. The method for dynamic hierarchical gradient compression of a neural network model according to claim 1, wherein the method comprises the steps of: in step 3), the following data structure is used to define the storage mode for the model internal data: the method comprises the steps of storing by using dictionary key values, and storing layer types as keys and forward and backward propagation calculation time and memory occupation information as values, wherein the occupation information mainly comprises the following steps: the magnitude of the output activation value per layer, the magnitude of the gradient in the back propagation, the number of parameters per layer.
4. The method for dynamic hierarchical gradient compression of a neural network model according to claim 1, wherein the method comprises the steps of: the layer gradient information merging method in the step 4) comprises the following steps:
4-1) optimizing targets, combining specific hardware resources according to an input network model, combining a plurality of compressed gradient tensors together as much as possible for communication in the training process, realizing maximum overlapping of gradient calculation and gradient communication, and reducing communication delay overhead in the layered gradient transmission process; the final optimization target is to reduce the communication delay overhead through layer gradient merging, so that the training iteration time is minimized;
In the parallel training process of the deep neural network data, the one-time iterative training time consists of four parts, namely forward calculation loss time, reverse layer-by-layer calculation gradient time, layer gradient compression time and gradient communication time after compression, and the four parts are expressed as follows by a formula:
equation (1) shows that the time of one iteration is calculated by forward calculation time t f and backward layer-by-layer calculation gradient time Layer-by-layer gradient sparsification time/>And communication time without overlap/>Composition; equation (2) represents the moment/>, at which layer l starts to calculate the gradientThe time of the forward calculation is the time of the end or the time of the layer l+1 thinning end; equation (3) represents the thinning time/>, of layer lCalculating the end time for the gradient of layer l; the formula (4) shows that the time when the layer l starts communication is determined by the thinning end time of the layer l and the communication end time of the layer l+1;
4-2) a layer gradient merging algorithm, namely determining a group of optimal layer gradient merging combination modes m by using a heuristic dynamic programming algorithm, and merging the current layer gradient into the gradient calculated in the previous layer by traversing the comparison layer by layer from back to front when the condition is met; in the merging process, recording the corresponding layer number combination after merging for the subsequent training process; the method comprises the following steps: the merge layer is defined as: if at time τ, the gradient of layer l is merged into layer l-1 and the gradient of layer l is not compressed and communicated, layer l is the merged layer, denoted as The merging layer is denoted by the symbol l m, the normal layer is denoted by the symbol l n, and l m has the following constraint:
in equation (5) l > 1 indicates that layer 1 cannot be a merge layer because there is no previous layer to merge; Indicating that l m is not compressed and communicated after the gradient calculation is completed; d l-1=dl+dl-1 indicates that the gradient tensor d l of this layer of l m will accumulate into the previous layer tensor d l-1; /(I) The gradient computation representing layer l-1 may begin immediately after the layer l gradient computation is complete.
CN202410387893.3A 2024-04-01 Dynamic layering gradient compression method for neural network model Active CN118052260B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410387893.3A CN118052260B (en) 2024-04-01 Dynamic layering gradient compression method for neural network model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410387893.3A CN118052260B (en) 2024-04-01 Dynamic layering gradient compression method for neural network model

Publications (2)

Publication Number Publication Date
CN118052260A true CN118052260A (en) 2024-05-17
CN118052260B CN118052260B (en) 2024-08-02

Family

ID=

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113515370A (en) * 2021-04-28 2021-10-19 之江实验室 Distributed training method for large-scale deep neural network
WO2021247118A1 (en) * 2020-06-01 2021-12-09 Microsoft Technology Licensing, Llc Model compression by sparsity-inducing regularization optimization
CN113988266A (en) * 2021-11-01 2022-01-28 南京大学 Top-k-based adaptive distributed gradient compression method supporting complex network conditions
CN114611656A (en) * 2020-12-08 2022-06-10 华为技术有限公司 Gradient sparse method and gradient sparse device of neural network
WO2022251317A1 (en) * 2021-05-27 2022-12-01 Rutgers, The State University Of New Jersey Systems of neural networks compression and methods thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021247118A1 (en) * 2020-06-01 2021-12-09 Microsoft Technology Licensing, Llc Model compression by sparsity-inducing regularization optimization
CN114611656A (en) * 2020-12-08 2022-06-10 华为技术有限公司 Gradient sparse method and gradient sparse device of neural network
CN113515370A (en) * 2021-04-28 2021-10-19 之江实验室 Distributed training method for large-scale deep neural network
WO2022251317A1 (en) * 2021-05-27 2022-12-01 Rutgers, The State University Of New Jersey Systems of neural networks compression and methods thereof
CN113988266A (en) * 2021-11-01 2022-01-28 南京大学 Top-k-based adaptive distributed gradient compression method supporting complex network conditions

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RAVID SHWARTZ-ZIV等: "REPRESENTATION COMPRESSION AND GENERALIZATION IN DEEP NEURAL NETWORKS", CONFERENCE PAPER AT ICLR 2019, 9 June 2019 (2019-06-09) *
周于涛: "面向边缘计算的分布式深度学习优化方法研究与实现", 中国优秀硕士学位论文全文数据库 信息科技辑, no. 06, 15 June 2022 (2022-06-15) *

Similar Documents

Publication Publication Date Title
CN108268638B (en) Distributed implementation method for generating countermeasure network based on Spark framework
CN110533183B (en) Task placement method for heterogeneous network perception in pipeline distributed deep learning
CN109948029B (en) Neural network self-adaptive depth Hash image searching method
CN114756383B (en) Distributed computing method, system, equipment and storage medium
CN113543176B (en) Unloading decision method of mobile edge computing system based on intelligent reflecting surface assistance
CN113191484A (en) Federal learning client intelligent selection method and system based on deep reinforcement learning
Wang et al. A comprehensive survey on training acceleration for large machine learning models in IoT
CN109635922B (en) Distributed deep learning parameter quantification communication optimization method and system
CN112884236B (en) Short-term load prediction method and system based on VDM decomposition and LSTM improvement
CN113159287A (en) Distributed deep learning method based on gradient sparsity
CN114580636A (en) Neural network lightweight deployment method based on three-target joint optimization
CN116167436A (en) Neural network pipeline parallel training method for optimizing model division
CN113627519A (en) Distributed random gradient descent method with compression and delay compensation
Zhuang et al. Accumulated decoupled learning with gradient staleness mitigation for convolutional neural networks
CN118052260B (en) Dynamic layering gradient compression method for neural network model
CN116502774B (en) Time sequence prediction method based on time sequence decomposition and Legend projection
Zhang et al. Evaluation and optimization of gradient compression for distributed deep learning
CN118052260A (en) Dynamic layering gradient compression method for neural network model
Wu et al. Efficient federated learning on resource-constrained edge devices based on model pruning
Zhang et al. Optimizing federated edge learning on Non-IID data via neural architecture search
CN117892769B (en) Neural network training method, video memory scheduling method, system, equipment and product
CN118133929B (en) Method and device for accelerating neural network training based on node freezing
CN117707795B (en) Graph-based model partitioning side collaborative reasoning method and system
CN111598399B (en) Ultra-large-scale power transmission network expansion planning method based on distributed computing platform
Zhao et al. Bridging the Gap Between Memory and Communication Efficiency on Distributed Deep Learning Systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant