CN114118381A - Learning method, device, equipment and medium based on adaptive aggregation sparse communication - Google Patents

Learning method, device, equipment and medium based on adaptive aggregation sparse communication Download PDF

Info

Publication number
CN114118381A
CN114118381A CN202111470644.3A CN202111470644A CN114118381A CN 114118381 A CN114118381 A CN 114118381A CN 202111470644 A CN202111470644 A CN 202111470644A CN 114118381 A CN114118381 A CN 114118381A
Authority
CN
China
Prior art keywords
adaptive
communication
sparse
target node
adaptive aggregation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111470644.3A
Other languages
Chinese (zh)
Other versions
CN114118381B (en
Inventor
邓晓歌
李东升
孙涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202111470644.3A priority Critical patent/CN114118381B/en
Publication of CN114118381A publication Critical patent/CN114118381A/en
Application granted granted Critical
Publication of CN114118381B publication Critical patent/CN114118381B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of distributed learning, and discloses a learning method, a device, equipment and a medium based on self-adaptive aggregation sparse communication, wherein a self-adaptive aggregation rule is obtained, and a target node is determined according to the self-adaptive aggregation rule; performing sparse processing on target information corresponding to the target node; calculating a convergence result by combining a preset sequence with a Lyapunov function; training the deep neural network model to obtain a learning method, adaptively skipping some communication turns through an adaptive selection rule, and further reducing the number of communication bits by thinning transmission information. For the bias of the top-k sparse operator, an error feedback format is used in the algorithm, and the technical effect of fully utilizing the computing power of the distributed cluster is achieved.

Description

Learning method, device, equipment and medium based on adaptive aggregation sparse communication
Technical Field
The present application relates to the field of distributed learning, and in particular, to a learning method, apparatus, device, and medium based on adaptive aggregation sparse communication.
Background
Stochastic optimization algorithms implemented on distributed computing architectures are increasingly being used to handle large-scale machine learning problems. One key bottleneck in such systems is the communication overhead for exchanging information such as random gradients between different nodes. The sparse communication method and the adaptive aggregation method for reserving the memory are frameworks of various technologies proposed for solving the problem two. Intuitively, multi-processor collaborative training for a task can speed up the training process and reduce training time. However, the cost of communication between processors often hinders scalability of distributed systems. Worse, the performance of multiple processors may be lower than the performance of a single processor when the ratio of computation to communication is low.
Therefore, how to fully utilize the computing power of the distributed cluster becomes a technical problem to be solved urgently.
The above is only for the purpose of assisting understanding of the technical aspects of the present invention, and does not represent an admission that the above is prior art.
Disclosure of Invention
The invention mainly aims to provide a learning method, a device, equipment and a medium based on self-adaptive aggregation sparse communication, and aims to solve the problem that the computing capacity of a distributed cluster cannot be fully utilized in the prior art.
In order to achieve the above object, the present invention provides a learning method based on adaptive aggregation sparse communication, including:
obtaining a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule;
performing sparse processing on target information corresponding to the target node;
calculating a convergence result by combining a preset sequence with a Lyapunov function;
the deep neural network model is trained to obtain a learning method.
Optionally, the step of obtaining an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule includes:
acquiring a preset self-adaptive aggregation rule;
dividing all nodes into a plurality of nodes according to the self-adaptive aggregation rule among the nodes communicating with the serverTwo disjoint sets MtAnd
Figure BDA0003391888200000021
using M when it is detected that the t-th iteration is performedtThe middle node and the new gradient information are in
Figure BDA0003391888200000026
Intermediate reuse of old compression gradient information of nodes reduces iterative communication turn from M to | MtL to determine the target node.
Optionally, the step of performing sparse processing on the target information corresponding to the target node includes:
and selecting top-k gradient components for target information corresponding to the target node during iteration, and setting the rest gradient components to zero, so that zero elements are free from communication.
Optionally, after the step of selecting a top-k gradient component and setting the rest gradient components to zero in the iteration of the target information corresponding to the target node, so that zero elements are exempted from communication, the method further includes:
an error feedback technology is used, and errors generated by sparsification are brought into the next step to ensure convergence;
defining auxiliary sequences
Figure BDA0003391888200000022
Wherein
Figure BDA0003391888200000027
Is the error at the t-th iteration on the m-node.
Optionally, the step of calculating a convergence result according to a preset sequence and a lyapunov function includes:
note the book
Figure BDA0003391888200000023
And the learning rate is selected as
Figure BDA0003391888200000024
Wherein c isγ>0 is a constant, giving:
Figure BDA0003391888200000025
thereby converging the calculation result.
Optionally, the step of training the deep neural network model to obtain a learning method includes:
the following iterative format is used for training,
Figure BDA0003391888200000031
wherein the content of the first and second substances,
Figure BDA0003391888200000032
optionally, after the step of obtaining the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule, the method further includes:
iteration is performed by combining the self-adaptive aggregation rule and utilizing the following iteration format
Figure BDA0003391888200000033
Wherein M istAnd
Figure BDA0003391888200000034
working sets with and without communication with the server in the t-th iteration, respectively.
In addition, to achieve the above object, the present invention further provides an adaptive aggregation sparse communication based learning apparatus, including:
the node determining module is used for acquiring a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule;
the sparse processing module is used for carrying out sparse processing on the target information corresponding to the target node;
the result acquisition module is used for calculating a convergence result according to a preset sequence and the Lyapunov function;
and the model training module is used for training the deep neural network model to obtain a learning method.
In addition, to achieve the above object, the present invention also provides a computer device, including: the adaptive sparse communication aggregation system comprises a memory, a processor and an adaptive sparse communication aggregation-based learning program stored on the memory and operable on the processor, wherein the learning program is configured to implement the learning method based on adaptive sparse communication aggregation as described above.
Furthermore, to achieve the above object, the present invention further proposes a medium having stored thereon an adaptive aggregated sparse communication based learning program, which when executed by a processor implements the steps of the adaptive aggregated sparse communication based learning method as described above.
The method comprises the steps of obtaining a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule; performing sparse processing on target information corresponding to the target node; calculating a convergence result by combining a preset sequence with a Lyapunov function; training the deep neural network model to obtain a learning method, adaptively skipping some communication turns through an adaptive selection rule, and further reducing the number of communication bits by thinning transmission information. For the bias of the top-k sparse operator, an error feedback format is used in the algorithm, and the technical effect of fully utilizing the computing power of the distributed cluster is achieved.
Drawings
FIG. 1 is a schematic structural diagram of a learning device based on adaptive aggregation sparse communication in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flow chart of the adaptive aggregation sparse communication based solution of the present invention;
fig. 3 is a comparison diagram of four algorithms based on adaptive aggregation sparse communication according to an embodiment of the present invention.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a learning device based on adaptive aggregation sparse communication in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the learning device based on adaptive aggregation sparse communication may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (WI-FI) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the architecture shown in fig. 1 does not constitute a limitation of an adaptive aggregate sparse communication based learning device and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, the memory 1005, which is a storage medium, may include therein an operating system, a data storage module, a network communication module, a user interface module, and a learning program based on adaptive aggregation sparse communication.
In the learning apparatus based on adaptive aggregation sparse communication shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the learning apparatus based on adaptive aggregation sparse communication according to the present invention may be disposed in the learning apparatus based on adaptive aggregation sparse communication, and the learning apparatus based on adaptive aggregation sparse communication calls the learning program based on adaptive aggregation sparse communication stored in the memory 1005 through the processor 1001 and executes the learning method based on adaptive aggregation sparse communication according to the present invention.
The embodiment of the invention provides a learning method based on adaptive aggregation sparse communication, and referring to fig. 2, fig. 2 is a schematic flow diagram of a first embodiment of the learning method based on adaptive aggregation sparse communication.
In this embodiment, the learning method based on adaptive aggregation sparse communication includes the following steps:
step S10: and acquiring a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule. A
It should be noted that over the past decades, Machine Learning (ML) models and datasets have increased significantly in size and complexity, resulting in higher computational intensity and therefore a more time-consuming training process. With this comes the development of distributed training, which uses multiple processors for acceleration. A large number of distributed machine learning tasks can be described as
Figure BDA0003391888200000051
Where ω is the parameter connection to be learned, d is the dimension of the parameter, M: m represents a set of distributed nodes,
Figure BDA0003391888200000052
is the smooth loss function (not necessarily convex) at node m, and
Figure BDA0003391888200000053
is and probability distribution
Figure BDA0003391888200000054
The correlations are independent random data samples.
It is to be understood that, for simplicity, definitions are provided
Figure BDA0003391888200000055
In a specific implementation, a stochastic gradient descent algorithm (SGD) is the main force for solving the problem, and the iteration format is
Figure BDA0003391888200000056
Where gamma is the learning rate, where,
Figure BDA0003391888200000057
is the small batch of data that node m selected at the t-th iteration.
Further, the step of obtaining an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule includes: acquiring a preset self-adaptive aggregation rule; dividing all nodes into two disjoint sets M according to the adaptive aggregation rule between nodes communicating with a servertAnd
Figure BDA0003391888200000061
using M when it is detected that the t-th iteration is performedtThe middle node and the new gradient information are in
Figure BDA0003391888200000063
Intermediate reuse of old compression gradient information of nodes reduces iterative communication turn from M to | MtL to determine the target node.
Further, after the step of obtaining the adaptive aggregation rule and determining the target node according to the adaptive aggregation rule, the method further includes:
iteration is performed by combining the self-adaptive aggregation rule and utilizing the following iteration format
Figure BDA0003391888200000062
Wherein M istAnd
Figure BDA0003391888200000064
working sets with and without communication with the server in the t-th iteration, respectively.
It should be noted that it is important to reduce the number of communication rounds to improve the communication efficiency. High-order information (Newton type method) is used for replacing traditional gradient information, and therefore the number of communication rounds is reduced. A distributed preprocessing acceleration gradient method is proposed to reduce the number of communication rounds. There are also many new aggregation techniques, such as periodic aggregation and adaptive aggregation techniques, developed to skip certain communications. Each node is allowed to perform local model updates independently and the generated model is periodically averaged. The delayed aggregation gradient (LAG) method updates the model at the server side, and the nodes only adaptively upload information with a sufficient amount of information. Unfortunately, while LAG has good performance in a deterministic setting (i.e., full gradient), its performance in a stochastic setting drops significantly. More recent efforts have been made to adapt the aggregation algorithm in a random setting. A communication censoring distributed random gradient descent algorithm (CSGD) increases the batch size to mitigate the effects of random gradient noise. The inert aggregation random gradient algorithm (LASG) designs a group of new self-adaptive communication rules customized for random gradient, and achieves good experimental effect.
In particular implementations, an efficient communication algorithm combining sparse communication with adaptive aggregated random gradients, SASG, is presented herein. Our SASG approach can save both the number of communication bits and the communication round without sacrificing the required convergence properties. Considering that in a distributed learning system, not all server and node communication rounds are equally important, we can adjust the communication frequency between a node and a server according to the importance of the node to transmit information. More specifically, in terms of reducing the number of communication rounds, an adaptive selection rule is established to divide the node set M into two disjoint sets MtAnd
Figure BDA0003391888200000071
at the t-th iteration, we will use only MtSelection incNew gradient information of fixed nodecAt the same time reuse
Figure BDA0003391888200000073
The intermediate nodes compress gradient information, so that the communication turn of each iteration can be reduced from M to | MtL. On the other hand, the quantization method can only reach the maximum compression ratio of 32 times in common single-precision floating-point operation, so that a more effective thinning method is adopted in the algorithm. In particular, we will select the top-k gradient components (in terms of absolute values) at each iteration and set the remaining gradient components to zero, so that zero-valued elements are kept from communication, thereby significantly reducing the number of bits communicated.
In particular implementations, we have noted that the communication rounds between servers and nodes do not all contribute the same in a distributed learning system, so we use an adaptive aggregation approach to develop aggregation rules that can skip inefficient communication rounds. This adaptive aggregation method, derived from the Lazy Aggregation Gradient (LAG) method, envisages an adaptive selection to detect nodes with small gradient changes and reuse the old gradients. In combination with such adaptive aggregation rules, the following iterative format can be obtained
Figure BDA0003391888200000072
Wherein M istAnd
Figure BDA0003391888200000074
working sets with and without communication with the server in the t-th iteration, respectively.
Step S20: and performing sparse processing on the target information corresponding to the target node.
It should be noted that such research is mainly developed around the idea of quantization and sparsification. The quantization method compresses information by transmitting lower bits instead of the original 32-bit data. The quantized random gradient descent algorithm (QSGD) provides additional flexibility to control the trade-off between communication cost per iteration and convergence speed, with adjustable quantization levels. Ternary gradients are used to reduce the communication data size. It reduces each component of the gradient to its sign bit (one bit). The sparsification method aims at reducing the number of elements transmitted per iteration. These methods can be divided into two broad categories: stochastic sparsification and deterministic sparsification. Random sparsification is the random selection of some components for communication. This method is named random-k, where k denotes the number of selected components. This random selection method is usually an unbiased estimation of the original gradient, making it very friendly to theoretical analysis. Unlike stochastic sparsification, deterministic sparsification retains only the k components with the largest random gradient by considering the magnitude of each component. This method is also called top-k. Compared to the unbiased solution, it is clear that this method should use some error feedback or accumulation procedure to ensure that all gradient information is eventually added to the model, despite some delay.
In a specific implementation, after the adaptive selection process, the selected nodes send sparse information derived by the top-k operator to the parameter server.
Further, the step of performing sparse processing on the target information corresponding to the target node includes: and selecting top-k gradient components for target information corresponding to the target node during iteration, and setting the rest gradient components to zero, so that zero elements are free from communication.
Further, after the step of selecting a top-k gradient component and setting the rest gradient components to zero so that zero elements are free from communication for the target information corresponding to the target node during iteration, the method further includes: an error feedback technology is used, and errors generated by sparsification are brought into the next step to ensure convergence; defining auxiliary sequences
Figure BDA0003391888200000081
Wherein
Figure BDA0003391888200000085
Is the error at the t-th iteration on the m-node.
Step S30: and calculating a convergence result according to a preset sequence and the Lyapunov function.
It should be noted that, in the embodiment, a biased top-k sparsification operator is applied, and a convergence analysis is more complicated due to a compression error introduced. We define an auxiliary sequence vt}t=0,1,...The sequence can be viewed as { ωt}t=0,1,...Is approximated. We get the convergence result of the SASG algorithm by analyzing the sequence, and the convergence rate matches the original SGD.
Further, the step of calculating the convergence result according to the preset sequence and the lyapunov function includes: note the book
Figure BDA0003391888200000082
And the learning rate is selected as
Figure BDA0003391888200000083
Wherein c isγ>0 is a constant, giving:
Figure BDA0003391888200000084
thereby converging the calculation result.
It will be appreciated that our algorithm guarantees convergence and achieves a sub-linear convergence rate despite skipping many communication rounds and performing communication compression. In other words, the SASG algorithm uses well-designed adaptive aggregation rules and sparse communication techniques, and still achieves the same order of convergence speed as the SGD method.
Step S40: the deep neural network model is trained to obtain a learning method.
Further, the step of training the deep neural network model to obtain a learning method includes:
the following iterative format is used for training,
Figure BDA0003391888200000091
wherein the content of the first and second substances,
Figure BDA0003391888200000092
in a specific implementation, the SASG algorithm is benchmark tested using an inert aggregated random gradient (LASG) method, a sparsification method, and a distributed SGD. Experience has shown that up to 99% of the gradient information is not necessary in each iteration, so we use the top-1% sparsification operator in the SASG algorithm and sparsification method. In all experiments, the training data was distributed among 10 nodes, each node using 10 samples for one training iteration. We completed the evaluation under the following three settings, and each experiment was repeated five times. MNIST data set contains 70,000 handwritten digits in 10 categories, with 60,000 examples in the training set and 10,000 examples in the test set. We consider a two-layer fully-connected (FC) neural network model, the second layer having 512 neurons for class 10 classification on MNIST. For all algorithms, we choose the learning rate γ to be 0.005. For the adaptive aggregation algorithms SASG and LASG, we set D-10, α D-1/2 γ, D-1, 2. CIFAR-10[39 ]]The data set consisted of 60,000 color images in 10 categories, each with 6,000 images. We tested the ResNet18 model using all of the algorithms described above on the CIFAR-10 dataset. The experiment performed common data enhancement techniques such as random cropping, random flipping, and normalization. The basic learning rate was set to γ of 0.01, and the learning rate was attenuated to 0.001 at the 20 th batch. For SASG and LASG, we set D to 10, α D to 1/γ, D to 1,2,...,10. CIFAR-100 data set contains 60,000 color images in 100 categories, 600 images in each category. Each category has 500 training images and 100 test images. We tested the VGG16 model [41 ] on the CIFAR-100 dataset]. This experiment performed similar data enhancement techniques. The basic learning rate was set to γ of 0.01, and the learning rate was attenuated to 0.001 at the 30 th batch. For SASG and LASG, we set D10, α D4/D/γ 21,2, 10. Our experimental results were based on PyTorch implementation of all methods run on a Ubuntu 20.04 machine equipped with an Nvidia RTX-2080Ti GPU.
Immediately, the number of communication bits required by different algorithms to reach the same baseline can be obtained by calculating the number of parameters of different models. The last column of fig. 3 shows that the SASG algorithm combined with adaptive aggregation technique and sparse communication significantly reduces the number of communication bits required by the model to achieve the same performance, far better than the LASG and sparse algorithms.
The embodiment obtains a self-adaptive aggregation rule and determines a target node according to the self-adaptive aggregation rule; performing sparse processing on target information corresponding to the target node; calculating a convergence result by combining a preset sequence with a Lyapunov function; training the deep neural network model to obtain a learning method, adaptively skipping some communication turns through an adaptive selection rule, and further reducing the number of communication bits by thinning transmission information. For the bias of the top-k sparse operator, an error feedback format is used in the algorithm, and the technical effect of fully utilizing the computing power of the distributed cluster is achieved.
Furthermore, an embodiment of the present invention further provides a medium, where the medium stores an adaptive aggregation sparse communication based learning program, and the adaptive aggregation sparse communication based learning program, when executed by a processor, implements the steps of the adaptive aggregation sparse communication based learning method as described above.
Other embodiments or specific implementation manners of the learning device based on adaptive aggregation sparse communication according to the present invention may refer to the above method embodiments, and are not described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., a rom/ram, a magnetic disk, an optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A learning method based on adaptive aggregation sparse communication is characterized by comprising the following steps:
obtaining a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule;
performing sparse processing on target information corresponding to the target node;
calculating a convergence result by combining a preset sequence with a Lyapunov function;
the deep neural network model is trained to obtain a learning method.
2. The method of claim 1, wherein the step of obtaining an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule comprises:
acquiring a preset self-adaptive aggregation rule;
dividing all nodes into two disjoint sets M according to the adaptive aggregation rule between nodes communicating with a servertAnd
Figure FDA0003391888190000011
using M when it is detected that the t-th iteration is performedtThe middle node and the new gradient information are in
Figure FDA0003391888190000012
Intermediate reuse of old compression gradient information of nodes reduces iterative communication turn from M to | MtL to determine the target node.
3. The method of claim 1, wherein the step of sparsifying target information corresponding to the target node comprises:
and selecting top-k gradient components for target information corresponding to the target node during iteration, and setting the rest gradient components to zero, so that zero elements are free from communication.
4. The method of claim 3, wherein the step of the target information corresponding to the target node iteratively selecting top-k gradient components and setting the remaining gradient components to zero such that zero elements are communication-free further comprises:
an error feedback technology is used, and errors generated by sparsification are brought into the next step to ensure convergence;
defining auxiliary sequences
Figure FDA0003391888190000013
Wherein
Figure FDA0003391888190000014
Is the error at the t-th iteration on the m-node.
5. The method of claim 1, wherein the step of calculating the convergence result according to the preset sequence in combination with the lyapunov function comprises:
note the book
Figure FDA0003391888190000021
And the learning rate is selected as
Figure FDA0003391888190000022
Wherein c isγ>0 is a constant, giving:
Figure FDA0003391888190000023
thereby converging the calculation result.
6. The method of claim 1, wherein the step of training the deep neural network model to obtain a learning method comprises:
the following iterative format is used for training,
Figure FDA0003391888190000024
wherein the content of the first and second substances,
Figure FDA0003391888190000025
7. the method of any of claims 1 to 6, wherein after the step of obtaining an adaptive aggregation rule and determining a target node according to the adaptive aggregation rule, further comprising:
iteration is performed by combining the self-adaptive aggregation rule and utilizing the following iteration format
Figure FDA0003391888190000026
Wherein M istAnd
Figure FDA0003391888190000027
working sets with and without communication with the server in the t-th iteration, respectively.
8. An apparatus for learning based on adaptive aggregated sparse communication, the apparatus comprising:
the node determining module is used for acquiring a self-adaptive aggregation rule and determining a target node according to the self-adaptive aggregation rule;
the sparse processing module is used for carrying out sparse processing on the target information corresponding to the target node;
the result acquisition module is used for calculating a convergence result according to a preset sequence and the Lyapunov function;
and the model training module is used for training the deep neural network model to obtain a learning method.
9. An adaptive aggregated sparse communication based learning device, the device comprising: a memory, a processor and an adaptive aggregated sparse communication based learning program stored on the memory and executable on the processor, the adaptive aggregated sparse communication based learning program being configured to implement the steps of the adaptive aggregated sparse communication based learning method of any one of claims 1 to 7.
10. A medium having stored thereon an adaptive aggregated sparse communication based learning program, which when executed by a processor implements the steps of the adaptive aggregated sparse communication based learning method of any one of claims 1 to 7.
CN202111470644.3A 2021-12-03 2021-12-03 Learning method, device, equipment and medium based on self-adaptive aggregation sparse communication Active CN114118381B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111470644.3A CN114118381B (en) 2021-12-03 2021-12-03 Learning method, device, equipment and medium based on self-adaptive aggregation sparse communication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111470644.3A CN114118381B (en) 2021-12-03 2021-12-03 Learning method, device, equipment and medium based on self-adaptive aggregation sparse communication

Publications (2)

Publication Number Publication Date
CN114118381A true CN114118381A (en) 2022-03-01
CN114118381B CN114118381B (en) 2024-02-02

Family

ID=80366670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111470644.3A Active CN114118381B (en) 2021-12-03 2021-12-03 Learning method, device, equipment and medium based on self-adaptive aggregation sparse communication

Country Status (1)

Country Link
CN (1) CN114118381B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341628A (en) * 2023-02-24 2023-06-27 北京大学长沙计算与数字经济研究院 Gradient sparsification method, system, equipment and storage medium for distributed training

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200311539A1 (en) * 2019-03-28 2020-10-01 International Business Machines Corporation Cloud computing data compression for allreduce in deep learning
CN111784002A (en) * 2020-09-07 2020-10-16 腾讯科技(深圳)有限公司 Distributed data processing method, device, computer equipment and storage medium
CN112424797A (en) * 2018-05-17 2021-02-26 弗劳恩霍夫应用研究促进协会 Concept for the transmission of distributed learning of neural networks and/or parametric updates thereof
CN112766502A (en) * 2021-02-27 2021-05-07 上海商汤智能科技有限公司 Neural network training method and device based on distributed communication and storage medium
CN113159287A (en) * 2021-04-16 2021-07-23 中山大学 Distributed deep learning method based on gradient sparsity
CN113315604A (en) * 2021-05-25 2021-08-27 电子科技大学 Adaptive gradient quantization method for federated learning
CN113467949A (en) * 2021-07-07 2021-10-01 河海大学 Gradient compression method for distributed DNN training in edge computing environment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112424797A (en) * 2018-05-17 2021-02-26 弗劳恩霍夫应用研究促进协会 Concept for the transmission of distributed learning of neural networks and/or parametric updates thereof
US20200311539A1 (en) * 2019-03-28 2020-10-01 International Business Machines Corporation Cloud computing data compression for allreduce in deep learning
CN111784002A (en) * 2020-09-07 2020-10-16 腾讯科技(深圳)有限公司 Distributed data processing method, device, computer equipment and storage medium
CN112766502A (en) * 2021-02-27 2021-05-07 上海商汤智能科技有限公司 Neural network training method and device based on distributed communication and storage medium
CN113159287A (en) * 2021-04-16 2021-07-23 中山大学 Distributed deep learning method based on gradient sparsity
CN113315604A (en) * 2021-05-25 2021-08-27 电子科技大学 Adaptive gradient quantization method for federated learning
CN113467949A (en) * 2021-07-07 2021-10-01 河海大学 Gradient compression method for distributed DNN training in edge computing environment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116341628A (en) * 2023-02-24 2023-06-27 北京大学长沙计算与数字经济研究院 Gradient sparsification method, system, equipment and storage medium for distributed training
CN116341628B (en) * 2023-02-24 2024-02-13 北京大学长沙计算与数字经济研究院 Gradient sparsification method, system, equipment and storage medium for distributed training

Also Published As

Publication number Publication date
CN114118381B (en) 2024-02-02

Similar Documents

Publication Publication Date Title
Lei et al. GCN-GAN: A non-linear temporal link prediction model for weighted dynamic networks
Joseph et al. Impact of regularization on spectral clustering
WO2017066509A1 (en) Systems and methods of distributed optimization
CN112529071B (en) Text classification method, system, computer equipment and storage medium
CN109032630B (en) Method for updating global parameters in parameter server
CN111209930B (en) Method and device for generating trust policy and electronic equipment
CN114461386A (en) Task allocation method and task allocation device
CN114461929A (en) Recommendation method based on collaborative relationship graph and related device
CN114118381A (en) Learning method, device, equipment and medium based on adaptive aggregation sparse communication
US20210326757A1 (en) Federated Learning with Only Positive Labels
Yu et al. Heterogeneous federated learning using dynamic model pruning and adaptive gradient
CN114358216A (en) Quantum clustering method based on machine learning framework and related device
TW202001611A (en) Reliability evaluating method for multi-state flow network and system thereof
US20210232895A1 (en) Flexible Parameter Sharing for Multi-Task Learning
CN111563598A (en) Method and system for predicting quantum computation simulation time
CN113034343B (en) Parameter-adaptive hyperspectral image classification GPU parallel method
CN115220833A (en) Method for optimizing neural network model and method for providing graphic user interface
CN113760407A (en) Information processing method, device, equipment and storage medium
CN114819163A (en) Quantum generation countermeasure network training method, device, medium, and electronic device
Chen et al. Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices
CN114580649A (en) Method and device for eliminating quantum Pagli noise, electronic equipment and medium
CN113313253A (en) Neural network compression method, data processing device and computer equipment
Shokrzade et al. ELM-NET, a closer to practice approach for classifying the big data using multiple independent ELMs
CN113159297A (en) Neural network compression method and device, computer equipment and storage medium
Campobello et al. LBGS: a smart approach for very large data sets vector quantization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant