CN114118381B

CN114118381B - Learning method, device, equipment and medium based on self-adaptive aggregation sparse communication

Info

Publication number: CN114118381B
Application number: CN202111470644.3A
Authority: CN
Inventors: 邓晓歌; 李东升; 孙涛
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2024-02-02
Anticipated expiration: 2041-12-03
Also published as: CN114118381A

Abstract

The invention relates to the field of distributed learning, and discloses a learning method, a device, equipment and a medium based on self-adaptive aggregation sparse communication; performing sparse processing on target information corresponding to the target node; calculating a convergence result according to a preset sequence and a Lyapunov function; training the deep neural network model to obtain a learning method, adaptively skipping some communication rounds by adaptive selection rules, and further reducing the number of communication bits by sparsifying the transmission information. For the polarization of the top-k sparsification operator, the algorithm uses an error feedback format, and the technical effect of fully utilizing the computing capacity of the distributed cluster is realized.

Description

Learning method, device, equipment and medium based on self-adaptive aggregation sparse communication

Technical Field

The present disclosure relates to the field of distributed learning, and in particular, to a learning method, apparatus, device, and medium based on adaptive aggregation sparse communication.

Background

Random optimization algorithms implemented on distributed computing architectures are increasingly being used to address large-scale machine learning issues. One key bottleneck in such systems is the communication overhead for exchanging information such as random gradients between different nodes. The sparse communication method and the adaptive aggregation method for reserving the memory are frames in various technologies proposed for solving the problem. Intuitively, multiprocessor co-training for a task may speed up the training process and reduce training time. However, the cost of communication between processors often hinders scalability of a distributed system. Worse yet, the performance of multiple processors may be lower than the performance of a single processor when the ratio of computation to communication is low.

Therefore, how to fully utilize the computing power of the distributed clusters is a technical problem to be solved.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide a learning method, a learning device, learning equipment and learning media based on self-adaptive aggregation sparse communication, and aims to solve the problem that the computing capacity of a distributed cluster cannot be fully utilized in the prior art.

In order to achieve the above object, the present invention provides a learning method based on adaptive aggregation sparse communication, the method comprising:

acquiring an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule;

performing sparse processing on target information corresponding to the target node;

calculating a convergence result according to a preset sequence and a Lyapunov function;

the deep neural network model is trained to obtain a learning method.

Optionally, the step of acquiring the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule includes:

acquiring a preset self-adaptive aggregation rule;

dividing all nodes into two disjoint sets M between nodes communicating with a server according to the adaptive aggregation rule ^t And

upon detection of the t-th iteration, M is used ^t Intermediate node and new gradient information, and inReducing iterative communication rounds from M to |M by reusing old compression gradient information of nodes ^t I to determine the target node.

Optionally, the step of performing sparse processing on the target information corresponding to the target node includes:

and selecting a top-k gradient component for the target information corresponding to the target node in iteration, and setting the rest gradient components to zero so that zero elements are free from communication.

Optionally, the step of selecting a top-k gradient component of the target information corresponding to the target node in the iteration and setting zero for the rest gradient components, so that the zero element is free from communication, further includes:

using an error feedback technology, and incorporating the error generated by the sparsification into the next step to ensure convergence;

definition of helper sequencesWherein->Is the error at the t-th iteration on the m-node.

Optionally, the step of calculating the convergence result according to the preset sequence in combination with the lyapunov function includes:

recording deviceAnd the learning rate is selected as

Wherein c _γ >0 is a constant, giving:

thereby converging the calculation result.

Optionally, the step of training the deep neural network model to obtain a learning method includes:

the training is performed using the following iterative format,

wherein,

optionally, after the step of obtaining the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule, the method further includes:

in combination with the adaptive aggregation rule, the following iteration format is utilized for iteration

Wherein M is ^t Andthe working set is respectively communicated with the server and not communicated with in the t-th iteration.

In addition, in order to achieve the above object, the present invention also proposes a learning device based on adaptive aggregation sparse communication, which is characterized in that the device includes:

the node determining module is used for acquiring the self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule;

the sparse processing module is used for carrying out sparse processing on the target information corresponding to the target node;

the result acquisition module is used for calculating a convergence result according to a preset sequence and combining with the Lyapunov function;

and the model training module is used for training the deep neural network model to obtain a learning method.

In addition, to achieve the above object, the present invention also proposes a computer apparatus including: the system comprises a memory, a processor and an adaptive aggregate sparse communication-based learning program stored on the memory and executable on the processor, the adaptive aggregate sparse communication-based learning program configured to implement the adaptive aggregate sparse communication-based learning method as described above.

In addition, in order to achieve the above object, the present invention also proposes a medium having stored thereon a learning program based on adaptive aggregation sparse communication, which when executed by a processor, implements the steps of the learning method based on adaptive aggregation sparse communication as described above.

The method comprises the steps of obtaining an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule; performing sparse processing on target information corresponding to the target node; calculating a convergence result according to a preset sequence and a Lyapunov function; training the deep neural network model to obtain a learning method, adaptively skipping some communication rounds by adaptive selection rules, and further reducing the number of communication bits by sparsifying the transmission information. For the polarization of the top-k sparsification operator, an error feedback format is used by the algorithm, so that the technical effect of fully utilizing the computing capacity of the distributed cluster is realized.

Drawings

FIG. 1 is a schematic structural diagram of a learning device based on adaptive aggregation sparse communication in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flow chart of an embodiment of the present invention based on adaptive aggregated sparse communications;

fig. 3 is a comparison diagram of four algorithms based on adaptive aggregated sparse communications according to an embodiment of the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a learning device based on adaptive aggregation sparse communication in a hardware running environment according to an embodiment of the present invention.

As shown in fig. 1, the learning device based on adaptive aggregation sparse communication may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the structure shown in fig. 1 does not constitute a limitation of the learning device based on adaptive aggregated sparse communications, and may include more or fewer components than illustrated, or may combine certain components, or may be a different arrangement of components.

As shown in fig. 1, an operating system, a data storage module, a network communication module, a user interface module, and a learning program based on adaptive aggregate sparse communication may be included in the memory 1005 as one storage medium.

In the learning device based on adaptive aggregation sparse communication shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the learning device based on the adaptive aggregation sparse communication according to the present invention may be disposed in the learning device based on the adaptive aggregation sparse communication, where the learning device based on the adaptive aggregation sparse communication invokes the learning program based on the adaptive aggregation sparse communication stored in the memory 1005 through the processor 1001, and executes the learning method based on the adaptive aggregation sparse communication provided by the embodiment of the present invention.

The embodiment of the invention provides a learning method based on self-adaptive aggregation sparse communication, and referring to fig. 2, fig. 2 is a flow diagram of a first embodiment of the learning method based on self-adaptive aggregation sparse communication.

In this embodiment, the learning method based on adaptive aggregation sparse communication includes the following steps:

step S10: and acquiring an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule. A step of

It should be noted that the size and complexity of Machine Learning (ML) models and data sets have increased significantly over the last decades, resulting in higher computational intensity and thus more time consuming training processes. With the development of distributed training, it is accelerated using multiple processors. A large number of distributed machine learning tasks can be described as

Where ω is the parameter vector to be learned, d is the dimension of the parameter, M: = {1,..m } represents a set of distributed nodes,is a smooth loss function (not necessarily convex) on node m, but +.>Is and probability distributionRelated are independent random data samples.

It will be appreciated that for simplicity, definitions

In a specific implementation, a random gradient descent algorithm (SGD) is the dominant force for solving the problem, and the iterative format is

Where gamma is the learning rate and where,is the small batch of data that node m selects at the t-th iteration.

Further, the step of obtaining the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule includes: acquiring a preset self-adaptive aggregation rule; dividing all nodes into two disjoint sets M between nodes communicating with a server according to the adaptive aggregation rule ^t Andupon detection of the t-th iteration, M is used ^t Intermediate node and new gradient information, and is +.>Reducing iterative communication rounds from M to |M by reusing old compression gradient information of nodes ^t I to determine the target node.

Further, after the step of obtaining the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule, the method further includes:

It should be noted that, it is important to reduce the number of communication rounds to improve the communication efficiency. The high-order information (newton-type method) is used to replace the conventional gradient information, thereby reducing the number of communication rounds. A distributed preprocessing acceleration gradient method is proposed to reduce the number of communication rounds. There are also many new aggregation techniques, such as periodic aggregation and adaptive aggregation techniques, developed for skipping certain communications. Each node is allowed to independently perform local model updates and periodically average the generated models. The delayed aggregation gradient (LAG) method updates the model at the server side and the nodes only adaptively upload information with sufficient information content. Unfortunately, while LAG has good performance in deterministic settings (i.e., full gradients), its performance in random settings is significantly degraded. More recent efforts have been made in random settings for adaptive aggregation algorithms. The communication audit distributed random gradient descent algorithm (CSGD) increases the batch size to mitigate the effects of random gradient noise. The inert aggregation random gradient algorithm (LASG) designs a group of new self-adaptive communication rules customized for the random ladder measurement, and achieves a good experimental effect.

In particular implementations, a highly efficient communication algorithm, i.e., a SASG, that combines sparse communication with an adaptive aggregated random gradient is presented herein. Our SASG method can be the same asThe number of communication bits and the number of communication rounds are saved without sacrificing the required convergence characteristics. Considering that in a distributed learning system, not all the communication rounds between a server and a node are equally important, we can adjust the frequency of communication between the node and the server according to the importance of the node to transmit information. More specifically, in terms of reducing the number of communication rounds, we formulate an adaptive selection rule that divides a set of nodes M into two disjoint sets M ^t And->At the t-th iteration we will use only M ^t New gradient information of selected node in (b) while reusing +.>The old compression gradient information of the intermediate node can reduce the communication round of each iteration from M to |M ^t | a. The invention relates to a method for producing a fibre-reinforced plastic composite. On the other hand, since the quantization method is only used in the common single-precision floating point operationThe maximum compression ratio of 32 times can be achieved, so that a more effective thinning method is adopted in the algorithm. In particular, we will select the top-k gradient component (in terms of absolute value) at each iteration and set the remaining gradient components to zero so that zero value elements are free from communication, thereby significantly reducing the number of communication bits.

In particular implementations, note that the communication rounds between the server and the nodes do not all contribute the same in a distributed learning system, so we employ an adaptive aggregation method to develop an aggregation rule that can skip inefficient communication rounds. This adaptive aggregation method is derived from the Lazy Aggregation Gradient (LAG) method, which designs an adaptive option to detect nodes with little gradient change and reuse old gradients. In combination with such an adaptive aggregation rule, the following iterative format can be obtained

Step S20: and carrying out sparse processing on the target information corresponding to the target node.

It should be noted that such studies are mainly developed around the ideas of quantization and sparsification. Quantization methods compress information by transmitting lower bits instead of the original 32-bit data. The quantization random gradient descent algorithm (QSGD) provides additional flexibility to control the trade-off between per-iteration communication cost and convergence speed, using adjustable quantization levels. Ternary gradients are used to reduce the communication data size. It reduces each component of the gradient to its sign bit (one bit). The sparsification method aims to reduce the number of elements transmitted per iteration. These methods can be divided into two main categories: random sparsification and deterministic sparsification. Random sparsification is the random selection of components for communication. This method is named random-k, where k represents the number of selected components. This random selection method is typically an unbiased estimate of the original gradient, making it very friendly to theoretical analysis. Unlike stochastic sparsification, deterministic sparsification retains only the k components with the largest stochastic gradient by considering the size of each component. This method is also known as top-k. It is apparent that this approach should use some error feedback or accumulation procedure to ensure that all gradient information is ultimately added to the model, although with some delay, compared to the unbiased approach.

In particular implementations, after the adaptive selection process, the selected node sends sparse information derived by the top-k operator to the parameter server.

Further, the step of performing sparse processing on the target information corresponding to the target node includes: and selecting a top-k gradient component for the target information corresponding to the target node in iteration, and setting the rest gradient components to zero so that zero elements are free from communication.

Further, the step of selecting a top-k gradient component of the target information corresponding to the target node in the iteration process and setting zero for the rest gradient components, so that the zero element is free from communication, further includes: using an error feedback technology, and incorporating the error generated by the sparsification into the next step to ensure convergence; definition of helper sequencesWherein->Is the error at the t-th iteration on the m-node.

Step S30: and calculating a convergence result according to a preset sequence and combining the Lyapunov function.

It should be noted that, in this embodiment, a biased top-k sparsification operator is applied, and the convergence analysis is more complicated due to the introduced compression error. We define an auxiliary sequence v _t } _{t＝0，1，...} The sequence can be regarded as { ω } _t } _{t＝0，1，...} Error of (2)And (5) approximating. We obtain the convergence result of the SASG algorithm by analyzing the sequence, the convergence speed matches the original SGD.

Further, the step of calculating the convergence result according to the preset sequence in combination with the lyapunov function includes: recording deviceAnd the learning rate is selected as

Wherein c _γ >0 is a constant, giving:

thereby converging the calculation result.

It will be appreciated that our algorithm ensures convergence and achieves a sub-linear convergence rate despite skipping many communication rounds and performing communication compression. In other words, the SASG algorithm can still achieve convergence rates on the same order of magnitude as the SGD approach using well-designed adaptive aggregation rules and sparse communication techniques.

Step S40: the deep neural network model is trained to obtain a learning method.

Further, the training the deep neural network model to obtain the learning method includes the steps of:

the training is performed using the following iterative format,

wherein,

in a specific implementation, the SASG algorithm is benchmarked using an inert-polymeric random gradient (LASG) method, a sparsification method, and a distributed SGD. Experience shows that up to 99% gradient information is not necessary in each iteration, so we use top-1% sparsification operators in the SASG algorithm and the sparsification method. In all experiments, the training data was distributed among m=10 nodes, each node performing one training iteration using 10 samples. We completed the following three set-up evaluations, each experiment being repeated five times. MNIST the MNIST dataset contains 70,000 handwritten digits in 10 categories, with 60,000 examples in the training set and 10,000 examples in the test set. We consider a two-layer Fully Connected (FC) neural network model, with 512 neurons in the second layer for class 10 classification on MNIST. For all algorithms we choose a learning rate γ=0.005. For the adaptive aggregation algorithms SASG and last, we set d=10, αd=1/2γ, d=1, 2. CIFAR-10[39 ]]The dataset consisted of 60,000 color images in 10 categories, each category having 6,000 images. We tested the res net18 model on the CIFAR-10 dataset using all algorithms described above. This experiment performed common data enhancement techniques such as random clipping, random flipping, and normalization. The basic learning rate was set to γ=0.01, and the learning rate was attenuated to 0.001 at the 20 th lot. For SASG and LASG, we set d=10, αd=1/γ, d=1, 2. CIFAR-100 the CIFAR-100 dataset contains 60,000 color images of 100 categories of 600 images each. There are 500 training images and 100 test images for each category. We tested VGG16 model on CIFAR-100 dataset [41]. The experiment performed a similar data enhancement technique. The basic learning rate was set to γ=0.01, and the learning rate was attenuated to 0.001 at the 30 th lot. For SASG and LASG we set d=10, αd=4/D/γ ² D=1, 2,..10. Our experimental results are based on the PyTorch implementation of all methods run on a Ubuntu 20.04 machine equipped with a Nvidia RTX-2080Ti GPU.

It can be immediately obtained by calculating the parameter numbers of different models, and the communication bit numbers required by different algorithms to reach the same base line can be obtained. The last column of fig. 3 shows that the SASG algorithm combined with the adaptive aggregation technique and sparse communication significantly reduces the number of communication bits required for the model to achieve the same performance, far superior to the LASG and sparse algorithms.

The embodiment obtains an adaptive aggregation rule and determines a target node according to the adaptive aggregation rule; performing sparse processing on target information corresponding to the target node; calculating a convergence result according to a preset sequence and a Lyapunov function; training the deep neural network model to obtain a learning method, adaptively skipping some communication rounds by adaptive selection rules, and further reducing the number of communication bits by sparsifying the transmission information. For the polarization of the top-k sparsification operator, an error feedback format is used by the algorithm, so that the technical effect of fully utilizing the computing capacity of the distributed cluster is realized.

In addition, the embodiment of the invention also provides a medium, wherein the medium is stored with a learning program based on the adaptive aggregation sparse communication, and the learning program based on the adaptive aggregation sparse communication realizes the steps of the learning method based on the adaptive aggregation sparse communication when being executed by a processor.

Other embodiments or specific implementation manners of the learning device based on adaptive aggregation sparse communication according to the present invention may refer to the above method embodiments, and will not be described herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A learning method based on adaptive aggregation sparse communication, the method comprising:

training the deep neural network model to obtain a learning method;

the step of obtaining the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule comprises the following steps:

acquiring a preset self-adaptive aggregation rule;

upon detection of the t-th iteration, M is used ^t New gradient information for selected nodes in a network, while reusingIntermediate node old compression gradient information reduces iteration communication rounds from M to |M ^t I to determine a target node;

the step of performing sparse processing on the target information corresponding to the target node includes:

selecting top-k gradient components for target information corresponding to the target node in iteration, and setting the rest gradient components to zero so that zero value elements are free from communication;

the step of selecting a top-k gradient component and setting the rest gradient components to zero when the target information corresponding to the target node is iterated, so that zero value elements are free from communication, and the method further comprises the steps of:

definition of helper sequencesWherein->Is the error at the t-th iteration on the m-node;

the step of calculating the convergence result according to the preset sequence and combining with the Lyapunov function comprises the following steps:

recording deviceAnd the learning rate is selected as follows:

wherein c _γ >0 is a constant, giving:

converging the calculation result;

the training of the deep neural network model to obtain the learning method comprises the following steps:

training is performed using the following iterative format:

wherein,

after the step of obtaining the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule, the method further comprises the following steps:

in combination with the adaptive aggregation rule, iterating by using the following iteration format:

wherein M is ^t Andworking sets which are communicated with the server and are not communicated in the t-th iteration respectively; m: = {1,..m } represents a set of distributed nodes; />Is the small batch of data that node m selects at the t-th iteration.

2. A learning apparatus based on adaptive aggregation sparse communication, wherein the learning method based on adaptive aggregation sparse communication according to claim 1 is adopted, the apparatus comprising:

3. A learning device based on adaptive aggregated sparse communications, the device comprising: a memory, a processor, and an adaptive aggregated sparse communication based learning program stored on the memory and executable on the processor, the adaptive aggregated sparse communication based learning program configured to implement the steps of the adaptive aggregated sparse communication based learning method of claim 1.

4. A medium having stored thereon a learning program based on adaptive aggregated sparse communication, which when executed by a processor, implements the steps of the learning method based on adaptive aggregated sparse communication of claim 1.