CN114118381A

CN114118381A - Learning method, device, equipment and medium based on adaptive aggregation sparse communication

Info

Publication number: CN114118381A
Application number: CN202111470644.3A
Authority: CN
Inventors: 邓晓歌; 李东升; 孙涛
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2022-03-01
Anticipated expiration: 2041-12-03
Also published as: CN114118381B

Abstract

The invention relates to the field of distributed learning, and discloses a learning method, a device, equipment and a medium based on self-adaptive aggregation sparse communication, wherein a self-adaptive aggregation rule is obtained, and a target node is determined according to the self-adaptive aggregation rule; performing sparse processing on target information corresponding to the target node; calculating a convergence result by combining a preset sequence with a Lyapunov function; training the deep neural network model to obtain a learning method, adaptively skipping some communication turns through an adaptive selection rule, and further reducing the number of communication bits by thinning transmission information. For the bias of the top-k sparse operator, an error feedback format is used in the algorithm, and the technical effect of fully utilizing the computing power of the distributed cluster is achieved.

Description

Learning method, device, equipment and medium based on adaptive aggregation sparse communication

Technical Field

The present application relates to the field of distributed learning, and in particular, to a learning method, apparatus, device, and medium based on adaptive aggregation sparse communication.

Background

Stochastic optimization algorithms implemented on distributed computing architectures are increasingly being used to handle large-scale machine learning problems. One key bottleneck in such systems is the communication overhead for exchanging information such as random gradients between different nodes. The sparse communication method and the adaptive aggregation method for reserving the memory are frameworks of various technologies proposed for solving the problem two. Intuitively, multi-processor collaborative training for a task can speed up the training process and reduce training time. However, the cost of communication between processors often hinders scalability of distributed systems. Worse, the performance of multiple processors may be lower than the performance of a single processor when the ratio of computation to communication is low.

Therefore, how to fully utilize the computing power of the distributed cluster becomes a technical problem to be solved urgently.

The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

The invention mainly aims to provide a learning method, a device, equipment and a medium based on self-adaptive aggregation sparse communication, and aims to solve the problem that the computing capacity of a distributed cluster cannot be fully utilized in the prior art.

In order to achieve the above object, the present invention provides a learning method based on adaptive aggregation sparse communication, including:

obtaining a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule;

performing sparse processing on target information corresponding to the target node;

calculating a convergence result by combining a preset sequence with a Lyapunov function;

the deep neural network model is trained to obtain a learning method.

Optionally, the step of obtaining an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule includes:

acquiring a preset self-adaptive aggregation rule;

dividing all nodes into a plurality of nodes according to the self-adaptive aggregation rule among the nodes communicating with the serverTwo disjoint sets M^tAnd

using M when it is detected that the t-th iteration is performed^tThe middle node and the new gradient information are in

Intermediate reuse of old compression gradient information of nodes reduces iterative communication turn from M to | M^tL to determine the target node.

Optionally, the step of performing sparse processing on the target information corresponding to the target node includes:

and selecting top-k gradient components for target information corresponding to the target node during iteration, and setting the rest gradient components to zero, so that zero elements are free from communication.

Optionally, after the step of selecting a top-k gradient component and setting the rest gradient components to zero in the iteration of the target information corresponding to the target node, so that zero elements are exempted from communication, the method further includes:

an error feedback technology is used, and errors generated by sparsification are brought into the next step to ensure convergence;

defining auxiliary sequences

Wherein

Is the error at the t-th iteration on the m-node.

Optionally, the step of calculating a convergence result according to a preset sequence and a lyapunov function includes:

note the book

And the learning rate is selected as

Wherein c is_γ>0 is a constant, giving:

thereby converging the calculation result.

Optionally, the step of training the deep neural network model to obtain a learning method includes:

the following iterative format is used for training,

wherein the content of the first and second substances,

optionally, after the step of obtaining the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule, the method further includes:

iteration is performed by combining the self-adaptive aggregation rule and utilizing the following iteration format

Wherein M is^tAnd

working sets with and without communication with the server in the t-th iteration, respectively.

In addition, to achieve the above object, the present invention further provides an adaptive aggregation sparse communication based learning apparatus, including:

the node determining module is used for acquiring a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule;

the sparse processing module is used for carrying out sparse processing on the target information corresponding to the target node;

the result acquisition module is used for calculating a convergence result according to a preset sequence and the Lyapunov function;

and the model training module is used for training the deep neural network model to obtain a learning method.

In addition, to achieve the above object, the present invention also provides a computer device, including: the adaptive sparse communication aggregation system comprises a memory, a processor and an adaptive sparse communication aggregation-based learning program stored on the memory and operable on the processor, wherein the learning program is configured to implement the learning method based on adaptive sparse communication aggregation as described above.

Furthermore, to achieve the above object, the present invention further proposes a medium having stored thereon an adaptive aggregated sparse communication based learning program, which when executed by a processor implements the steps of the adaptive aggregated sparse communication based learning method as described above.

The method comprises the steps of obtaining a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule; performing sparse processing on target information corresponding to the target node; calculating a convergence result by combining a preset sequence with a Lyapunov function; training the deep neural network model to obtain a learning method, adaptively skipping some communication turns through an adaptive selection rule, and further reducing the number of communication bits by thinning transmission information. For the bias of the top-k sparse operator, an error feedback format is used in the algorithm, and the technical effect of fully utilizing the computing power of the distributed cluster is achieved.

Drawings

FIG. 1 is a schematic structural diagram of a learning device based on adaptive aggregation sparse communication in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flow chart of the adaptive aggregation sparse communication based solution of the present invention;

fig. 3 is a comparison diagram of four algorithms based on adaptive aggregation sparse communication according to an embodiment of the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a learning device based on adaptive aggregation sparse communication in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the learning device based on adaptive aggregation sparse communication may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the architecture shown in fig. 1 does not constitute a limitation of an adaptive aggregate sparse communication based learning device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 1, the memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and a learning program based on adaptive aggregation sparse communication.

In the learning apparatus based on adaptive aggregation sparse communication shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the learning apparatus based on adaptive aggregation sparse communication according to the present invention may be disposed in the learning apparatus based on adaptive aggregation sparse communication, and the learning apparatus based on adaptive aggregation sparse communication calls the learning program based on adaptive aggregation sparse communication stored in the memory 1005 through the processor 1001 and executes the learning method based on adaptive aggregation sparse communication according to the present invention.

The embodiment of the invention provides a learning method based on adaptive aggregation sparse communication, and referring to fig. 2, fig. 2 is a schematic flow diagram of a first embodiment of the learning method based on adaptive aggregation sparse communication.

In this embodiment, the learning method based on adaptive aggregation sparse communication includes the following steps:

step S10: and acquiring a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule. A

It should be noted that over the past decades, Machine Learning (ML) models and datasets have increased significantly in size and complexity, resulting in higher computational intensity and therefore a more time-consuming training process. With this comes the development of distributed training, which uses multiple processors for acceleration. A large number of distributed machine learning tasks can be described as

Where ω is the parameter connection to be learned, d is the dimension of the parameter, M: m represents a set of distributed nodes,

is the smooth loss function (not necessarily convex) at node m, and

is and probability distribution

The correlations are independent random data samples.

It is to be understood that, for simplicity, definitions are provided

In a specific implementation, a stochastic gradient descent algorithm (SGD) is the main force for solving the problem, and the iteration format is

Where gamma is the learning rate, where,

is the small batch of data that node m selected at the t-th iteration.

Further, the step of obtaining an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule includes: acquiring a preset self-adaptive aggregation rule; dividing all nodes into two disjoint sets M according to the adaptive aggregation rule between nodes communicating with a server^tAnd

Further, after the step of obtaining the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule, the method further includes:

Wherein M is^tAnd

It should be noted that it is important to reduce the number of communication rounds to improve the communication efficiency. High-order information (Newton type method) is used for replacing traditional gradient information, and therefore the number of communication rounds is reduced. A distributed preprocessing acceleration gradient method is proposed to reduce the number of communication rounds. There are also many new aggregation techniques, such as periodic aggregation and adaptive aggregation techniques, developed to skip certain communications. Each node is allowed to perform local model updates independently and the generated model is periodically averaged. The delayed aggregation gradient (LAG) method updates the model at the server side, and the nodes only adaptively upload information with a sufficient amount of information. Unfortunately, while LAG has good performance in a deterministic setting (i.e., full gradient), its performance in a stochastic setting drops significantly. More recent efforts have been made to adapt the aggregation algorithm in a random setting. A communication censoring distributed random gradient descent algorithm (CSGD) increases the batch size to mitigate the effects of random gradient noise. The inert aggregation random gradient algorithm (LASG) designs a group of new self-adaptive communication rules customized for random gradient, and achieves good experimental effect.

In particular implementations, an efficient communication algorithm combining sparse communication with adaptive aggregated random gradients, SASG, is presented herein. Our SASG approach can save both the number of communication bits and the communication round without sacrificing the required convergence properties. Considering that in a distributed learning system, not all server and node communication rounds are equally important, we can adjust the communication frequency between a node and a server according to the importance of the node to transmit information. More specifically, in terms of reducing the number of communication rounds, an adaptive selection rule is established to divide the node set M into two disjoint sets M^tAnd

at the t-th iteration, we will use only M^tSelection in^cNew gradient information of fixed node_cAt the same time reuse

The intermediate nodes compress gradient information, so that the communication turn of each iteration can be reduced from M to | M^tL. On the other hand, the quantization method can only reach the maximum compression ratio of 32 times in common single-precision floating-point operation, so that a more effective thinning method is adopted in the algorithm. In particular, we will select the top-k gradient components (in terms of absolute values) at each iteration and set the remaining gradient components to zero, so that zero-valued elements are kept from communication, thereby significantly reducing the number of bits communicated.

In particular implementations, we have noted that the communication rounds between servers and nodes do not all contribute the same in a distributed learning system, so we use an adaptive aggregation approach to develop aggregation rules that can skip inefficient communication rounds. This adaptive aggregation method, derived from the Lazy Aggregation Gradient (LAG) method, envisages an adaptive selection to detect nodes with small gradient changes and reuse the old gradients. In combination with such adaptive aggregation rules, the following iterative format can be obtained

Wherein M is^tAnd

Step S20: and performing sparse processing on the target information corresponding to the target node.

It should be noted that such research is mainly developed around the idea of quantization and sparsification. The quantization method compresses information by transmitting lower bits instead of the original 32-bit data. The quantized random gradient descent algorithm (QSGD) provides additional flexibility to control the trade-off between communication cost per iteration and convergence speed, with adjustable quantization levels. Ternary gradients are used to reduce the communication data size. It reduces each component of the gradient to its sign bit (one bit). The sparsification method aims at reducing the number of elements transmitted per iteration. These methods can be divided into two broad categories: stochastic sparsification and deterministic sparsification. Random sparsification is the random selection of some components for communication. This method is named random-k, where k denotes the number of selected components. This random selection method is usually an unbiased estimation of the original gradient, making it very friendly to theoretical analysis. Unlike stochastic sparsification, deterministic sparsification retains only the k components with the largest random gradient by considering the magnitude of each component. This method is also called top-k. Compared to the unbiased solution, it is clear that this method should use some error feedback or accumulation procedure to ensure that all gradient information is eventually added to the model, despite some delay.

In a specific implementation, after the adaptive selection process, the selected nodes send sparse information derived by the top-k operator to the parameter server.

Further, the step of performing sparse processing on the target information corresponding to the target node includes: and selecting top-k gradient components for target information corresponding to the target node during iteration, and setting the rest gradient components to zero, so that zero elements are free from communication.

Further, after the step of selecting a top-k gradient component and setting the rest gradient components to zero so that zero elements are free from communication for the target information corresponding to the target node during iteration, the method further includes: an error feedback technology is used, and errors generated by sparsification are brought into the next step to ensure convergence; defining auxiliary sequences

Wherein

Is the error at the t-th iteration on the m-node.

Step S30: and calculating a convergence result according to a preset sequence and the Lyapunov function.

It should be noted that, in the embodiment, a biased top-k sparsification operator is applied, and a convergence analysis is more complicated due to a compression error introduced. We define an auxiliary sequence v_t}_{t＝0，1，...}The sequence can be viewed as { ω_t}_{t＝0，1，...}Is approximated. We get the convergence result of the SASG algorithm by analyzing the sequence, and the convergence rate matches the original SGD.

Further, the step of calculating the convergence result according to the preset sequence and the lyapunov function includes: note the book

And the learning rate is selected as

Wherein c is_γ>0 is a constant, giving:

thereby converging the calculation result.

It will be appreciated that our algorithm guarantees convergence and achieves a sub-linear convergence rate despite skipping many communication rounds and performing communication compression. In other words, the SASG algorithm uses well-designed adaptive aggregation rules and sparse communication techniques, and still achieves the same order of convergence speed as the SGD method.

Step S40: the deep neural network model is trained to obtain a learning method.

Further, the step of training the deep neural network model to obtain a learning method includes:

the following iterative format is used for training,

wherein the content of the first and second substances,

in a specific implementation, the SASG algorithm is benchmark tested using an inert aggregated random gradient (LASG) method, a sparsification method, and a distributed SGD. Experience has shown that up to 99% of the gradient information is not necessary in each iteration, so we use the top-1% sparsification operator in the SASG algorithm and sparsification method. In all experiments, the training data was distributed among 10 nodes, each node using 10 samples for one training iteration. We completed the evaluation under the following three settings, and each experiment was repeated five times. MNIST data set contains 70,000 handwritten digits in 10 categories, with 60,000 examples in the training set and 10,000 examples in the test set. We consider a two-layer fully-connected (FC) neural network model, the second layer having 512 neurons for class 10 classification on MNIST. For all algorithms, we choose the learning rate γ to be 0.005. For the adaptive aggregation algorithms SASG and LASG, we set D-10, α D-1/2 γ, D-1, 2. CIFAR-10[39 ]]The data set consisted of 60,000 color images in 10 categories, each with 6,000 images. We tested the ResNet18 model using all of the algorithms described above on the CIFAR-10 dataset. The experiment performed common data enhancement techniques such as random cropping, random flipping, and normalization. The basic learning rate was set to γ of 0.01, and the learning rate was attenuated to 0.001 at the 20 th batch. For SASG and LASG, we set D to 10, α D to 1/γ, D to 1,2,...,10. CIFAR-100 data set contains 60,000 color images in 100 categories, 600 images in each category. Each category has 500 training images and 100 test images. We tested the VGG16 model [41 ] on the CIFAR-100 dataset]. This experiment performed similar data enhancement techniques. The basic learning rate was set to γ of 0.01, and the learning rate was attenuated to 0.001 at the 30 th batch. For SASG and LASG, we set D10, α D4/D/γ ²1,2, 10. Our experimental results were based on PyTorch implementation of all methods run on a Ubuntu 20.04 machine equipped with an Nvidia RTX-2080Ti GPU.

Immediately, the number of communication bits required by different algorithms to reach the same baseline can be obtained by calculating the number of parameters of different models. The last column of fig. 3 shows that the SASG algorithm combined with adaptive aggregation technique and sparse communication significantly reduces the number of communication bits required by the model to achieve the same performance, far better than the LASG and sparse algorithms.

The embodiment obtains a self-adaptive aggregation rule and determines a target node according to the self-adaptive aggregation rule; performing sparse processing on target information corresponding to the target node; calculating a convergence result by combining a preset sequence with a Lyapunov function; training the deep neural network model to obtain a learning method, adaptively skipping some communication turns through an adaptive selection rule, and further reducing the number of communication bits by thinning transmission information. For the bias of the top-k sparse operator, an error feedback format is used in the algorithm, and the technical effect of fully utilizing the computing power of the distributed cluster is achieved.

Furthermore, an embodiment of the present invention further provides a medium, where the medium stores an adaptive aggregation sparse communication based learning program, and the adaptive aggregation sparse communication based learning program, when executed by a processor, implements the steps of the adaptive aggregation sparse communication based learning method as described above.

Other embodiments or specific implementation manners of the learning device based on adaptive aggregation sparse communication according to the present invention may refer to the above method embodiments, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., a rom/ram, a magnetic disk, an optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A learning method based on adaptive aggregation sparse communication is characterized by comprising the following steps:

the deep neural network model is trained to obtain a learning method.

2. The method of claim 1, wherein the step of obtaining an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule comprises:

acquiring a preset self-adaptive aggregation rule;

dividing all nodes into two disjoint sets M according to the adaptive aggregation rule between nodes communicating with a server^tAnd

3. The method of claim 1, wherein the step of sparsifying target information corresponding to the target node comprises:

4. The method of claim 3, wherein the step of the target information corresponding to the target node iteratively selecting top-k gradient components and setting the remaining gradient components to zero such that zero elements are communication-free further comprises:

defining auxiliary sequences

Wherein

Is the error at the t-th iteration on the m-node.

5. The method of claim 1, wherein the step of calculating the convergence result according to the preset sequence in combination with the lyapunov function comprises:

note the book

And the learning rate is selected as

Wherein c is_γ>0 is a constant, giving:

thereby converging the calculation result.

6. The method of claim 1, wherein the step of training the deep neural network model to obtain a learning method comprises:

the following iterative format is used for training,

wherein the content of the first and second substances,

7. the method of any of claims 1 to 6, wherein after the step of obtaining an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule, further comprising:

Wherein M is^tAnd

8. An apparatus for learning based on adaptive aggregated sparse communication, the apparatus comprising:

9. An adaptive aggregated sparse communication based learning device, the device comprising: a memory, a processor and an adaptive aggregated sparse communication based learning program stored on the memory and executable on the processor, the adaptive aggregated sparse communication based learning program being configured to implement the steps of the adaptive aggregated sparse communication based learning method of any one of claims 1 to 7.

10. A medium having stored thereon an adaptive aggregated sparse communication based learning program, which when executed by a processor implements the steps of the adaptive aggregated sparse communication based learning method of any one of claims 1 to 7.