CN112001501A - Parameter updating method, device and equipment of AI distributed training system - Google Patents

Parameter updating method, device and equipment of AI distributed training system Download PDF

Info

Publication number
CN112001501A
CN112001501A CN202010820131.XA CN202010820131A CN112001501A CN 112001501 A CN112001501 A CN 112001501A CN 202010820131 A CN202010820131 A CN 202010820131A CN 112001501 A CN112001501 A CN 112001501A
Authority
CN
China
Prior art keywords
target
worker node
node
updating
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010820131.XA
Other languages
Chinese (zh)
Other versions
CN112001501B (en
Inventor
郭振华
范宝余
曹芳
赵雅倩
李仁刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202010820131.XA priority Critical patent/CN112001501B/en
Publication of CN112001501A publication Critical patent/CN112001501A/en
Application granted granted Critical
Publication of CN112001501B publication Critical patent/CN112001501B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Abstract

The application discloses a parameter updating method of an AI distributed training system, which comprises the following steps: starting a training task of an AI algorithm model on a target worker node of a distributed heterogeneous system, controlling the node to load model parameters, randomly selecting sample data of a kth iterative training for the node, carrying out gradient updating on the model parameters, randomly creating a target node set, carrying out non-zero value updating on an adjacent matrix by using the set, and updating model parameters on each node in the set by using the updated adjacent matrix; and when the kth iterative training is finished, if the AI algorithm model is converged, repeating the iterative training on the node until the node finishes M iterative training, and judging that the distributed heterogeneous system finishes the AI acceleration task. By the method, the requirements of each worker node in the distributed computing cluster on communication bandwidth during parameter synchronization can be reduced while a mixed heterogeneous distributed computing environment is supported.

Description

Parameter updating method, device and equipment of AI distributed training system
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for updating parameters of an AI distributed training system.
Background
In practical application, a distributed cluster is often used to accelerate a training task of an AI (Artificial Intelligence) algorithm model, when a plurality of worker nodes in the distributed cluster are used to perform data parallel training on the AI algorithm model, firstly, the same AI algorithm model is deployed on each worker node, and batch-time iteration processing is performed on the marked training data, wherein in each iteration process, a batch of training data needs to be divided into N micro-batches according to the number of the worker nodes, then, the N micro-batches of training data are distributed to different worker nodes to perform model training, and finally, after all the worker nodes complete training of each micro-batch of training data, model parameters on each worker node are synchronously updated.
At present, Parameter synchronization updating methods of an AI distributed training system mainly comprise a Parameter-Server algorithm, an All-Reduce algorithm and a Ring-All-Reduce algorithm, and All the three Parameter synchronization methods are oriented to isomorphic distributed computing environments, so that All worker nodes in the distributed computing system are required to be provided with completely same computing equipment, and All worker nodes are required to adopt communication links with the same bandwidth, so that the linear acceleration ratio of the whole distributed computing system can be improved. However, in practical applications, various new computing devices are continuously added to worker nodes of the AI distributed training system, and in this case, if the three algorithms are used to update parameters in the AI distributed training system, the overall performance of the AI distributed training system is limited to the worker node with the slowest computing performance in the distributed computing environment or the communication link with the slowest transmission. At present, no effective solution exists for the technical problem.
Therefore, the technical problem to be solved by the technical staff in the field is how to enable the AI distributed training system to support a mixed heterogeneous distributed computing environment and reduce the requirement of each worker node in the distributed computing cluster on communication bandwidth during parameter synchronization.
Disclosure of Invention
In view of this, an object of the present invention is to provide a method, an apparatus, a device, and a medium for updating parameters of an AI distributed training system, so that the AI distributed training system can support a hybrid heterogeneous distributed computing environment and can also reduce a requirement for a communication bandwidth when each worker node in a distributed computing cluster performs parameter synchronization. The specific scheme is as follows:
a parameter updating method of an AI distributed training system comprises the following steps:
when the distributed heterogeneous system needs to finish an AI acceleration task, starting a training task of an AI algorithm model on a target worker node of the distributed heterogeneous system, and initializing model parameters and an adjacent matrix of the target worker node; an AI algorithm model is deployed on all worker nodes of the distributed heterogeneous system;
controlling the target worker node to load preset target model parameters, and randomly selecting sample data of a kth iterative training for the target worker node; wherein k is more than or equal to 1;
gradient updating is carried out on the target model parameters based on the sample data, and a target node set is randomly created for the target worker nodes;
updating a non-zero value of the adjacent matrix by using the target node set to obtain an updated adjacent matrix, and updating model parameters on each worker node in the target node set by using the updated adjacent matrix;
when the target worker node completes the kth iterative training, judging whether an AI algorithm model on the target worker node converges;
if yes, the target worker node is judged to complete the kth AI algorithm model training task, the step of controlling the target worker node to load preset target model parameters is repeatedly executed, and the distributed heterogeneous system is judged to complete the AI acceleration task until the target worker node completes M times of iterative training; wherein M is more than or equal to 2, and M is the preset iteration frequency.
Preferably, the method further comprises the following steps:
and building the distributed heterogeneous system by using a plurality of worker nodes provided with the GPU and/or AI chips and/or the FPGA.
Preferably, after the process of determining whether the AI algorithm model on the target worker node converges, the method further includes:
and if not, re-executing the step of controlling the target worker node to load preset target model parameters and randomly selecting sample data of the kth iterative training for the target worker node.
Preferably, the process of initializing the model parameters of the target worker node and the adjacency matrix includes:
and initializing the adjacent matrix of the target worker node to be a zero matrix E.
Preferably, the process of updating the adjacency matrix with a non-zero value by using the target node set to obtain an updated adjacency matrix includes:
carrying out non-zero value updating on the zero matrix E by utilizing a node set g to obtain an updated adjacent matrix;
wherein the expression of updating the adjacency matrix is as follows:
Figure BDA0002634153490000031
in the formula, g is the node set, i and j are the number of rows and columns in the zero matrix E, respectively, and u represents the number of rows and columns where i and j are the same.
Preferably, the process of updating the model parameters on each worker node in the target node set by using the updated adjacency matrix includes:
updating model parameters on each worker node in the target node set by using the updated adjacency matrix to obtain updated parameters;
wherein the expression of the update parameter is:
Figure BDA0002634153490000032
in the formula (I), the compound is shown in the specification,
Figure BDA0002634153490000033
a parameter matrix updated based on the parameters of the node set g in the k iteration process,
Figure BDA0002634153490000034
and updating the adjacency matrix based on the node set g in the k iteration process.
Correspondingly, the invention also discloses a parameter updating device of the AI distributed training system, which comprises:
the model initialization module is used for starting a training task of an AI algorithm model on a target worker node of the distributed heterogeneous system when the distributed heterogeneous system needs to finish an AI acceleration task, and initializing model parameters and an adjacent matrix of the target worker node; an AI algorithm model is deployed on all worker nodes of the distributed heterogeneous system;
the sample selection module is used for controlling the target worker node to load preset target model parameters and randomly selecting sample data of the kth iterative training for the target worker node; wherein k is more than or equal to 1;
the gradient updating module is used for carrying out gradient updating on the target model parameters based on the sample data and randomly creating a target node set for the target worker node;
the parameter updating module is used for updating a nonzero value of the adjacent matrix by using the target node set to obtain an updated adjacent matrix and updating model parameters on each worker node in the target node set by using the updated adjacent matrix;
the convergence judging module is used for judging whether an AI algorithm model on the target worker node converges or not when the target worker node completes the kth iterative training;
the task training module is used for judging that the target worker node completes the kth AI algorithm model training task when the judgment result of the convergence judgment module is yes, repeatedly executing the step of controlling the target worker node to load preset target model parameters until the target worker node completes M times of iterative training, and judging that the distributed heterogeneous system completes the AI acceleration task; wherein M is more than or equal to 2, and M is the preset iteration frequency.
Correspondingly, the invention also discloses a parameter updating device of the AI distributed training system, which comprises:
a memory for storing a computer program;
a processor for implementing a parameter updating method of an AI distributed training system as disclosed in the foregoing when executing the computer program.
Accordingly, the present invention also discloses a computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the parameter updating method of the AI distributed training system as disclosed in the foregoing.
Therefore, in the invention, when the distributed heterogeneous system needs to complete the AI acceleration task, an AI algorithm model is deployed on all worker nodes of the distributed heterogeneous system, and model parameters and an adjacent matrix of a target worker node are initialized; then, controlling a target worker node to load preset target model parameters, and randomly selecting sample data of the kth iterative training for the target worker node; then, gradient updating is carried out on the parameters of the target model based on the sample data, a target node set is randomly created for the target worker node, after the target node set is created, then the target node set is used to update the adjacent matrix with a non-zero value to obtain an updated adjacent matrix, and updating model parameters on each worker node in the target node set by utilizing the updated adjacency matrix, when the target worker node completes the kth iterative training, whether an AI algorithm model on the target worker node is converged or not is judged, if the IA algorithm model on the target worker node is converged, the target worker node is proved to have completed the kth AI algorithm model training task, and finally, and repeatedly executing the step of controlling the target worker node to load preset target model parameters until the target worker node finishes M times of iterative training, and indicating that the distributed heterogeneous system finishes the AI acceleration task. Obviously, in the parameter synchronization method provided by the invention, because the AI acceleration task is performed based on the distributed heterogeneous system, and in the parameter synchronization process of the AI model, the target worker node randomly creates a target node set and only updates the parameters of the randomly selected worker node, the method can enable the AI distributed training system to support the mixed heterogeneous distributed computing environment and simultaneously reduce the requirement of each worker node in the distributed computing cluster on the communication bandwidth during the parameter synchronization. Correspondingly, the parameter updating device, the equipment and the medium of the AI distributed training system provided by the invention also have the beneficial effects.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a parameter updating method of an AI distributed training system according to an embodiment of the present invention;
fig. 2 is a structural diagram of a distributed heterogeneous system according to an embodiment of the present invention;
fig. 3 is a structural diagram of a parameter updating apparatus of an AI distributed training system according to an embodiment of the present invention;
fig. 4 is a structural diagram of a parameter updating device of an AI distributed training system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart of a parameter updating method of an AI distributed training system according to an embodiment of the present invention, where the parameter updating method includes:
step S11: when the distributed heterogeneous system needs to finish an AI acceleration task, starting a training task of an AI algorithm model on a target worker node of the distributed heterogeneous system, and initializing model parameters and an adjacent matrix of the target worker node;
an AI algorithm model is deployed on all worker nodes of the distributed heterogeneous system;
in this embodiment, a novel parameter updating method for an AI distributed training system is provided, and the method can also reduce the requirement on communication bandwidth when parameter synchronization is performed on each worker node in a distributed computing cluster while supporting a heterogeneous distributed computer environment.
Specifically, in this embodiment, an AI algorithm model is deployed on all worker nodes of the distributed heterogeneous system, and when the distributed heterogeneous system needs to complete an AI acceleration task, a training task of the AI algorithm model is started on a target worker node of the distributed heterogeneous system, and model parameters of the target worker node and an adjacency matrix are initialized. It should be noted that the target worker node refers to any one of the computing nodes in the distributed heterogeneous system, and in this embodiment, the number of worker nodes in the distributed heterogeneous system is not limited.
Referring to fig. 2, fig. 2 is a structural diagram of a distributed heterogeneous system according to an embodiment of the present invention, where there are 6 worker nodes in the distributed heterogeneous system, and in an actual application, the distributed heterogeneous system may be represented as G ═ V, E }, where V represents each work node in the distributed heterogeneous system, and E represents an adjacent matrix between each worker node in the distributed heterogeneous system.
Step S12: controlling a target worker node to load preset target model parameters, and randomly selecting sample data of a kth iterative training for the target worker node; wherein k is more than or equal to 1;
step S13: gradient updating is carried out on target model parameters based on sample data, and a target node set is randomly created for target worker nodes;
step S14: carrying out non-zero value updating on the adjacent matrix by using the target node set to obtain an updated adjacent matrix, and updating model parameters on each worker node in the target node set by using the updated adjacent matrix;
after initializing the model parameters of the target worker node and the adjacent matrix in the distributed heterogeneous system, controlling the target worker node to load preset target model parameters, randomly selecting sample data of kth iterative training for the target worker node, and when randomly selecting the sample data of the kth iterative training for the target worker node, performing gradient updating on the target model parameters by using the selected sample data and randomly creating a target node set for the target worker node.
In practical application, a target worker node worker can be definediWith the target model parameter of xiWherein x isiThe distributed heterogeneous system comprises T parameters, and then all worker nodes in the distributed heterogeneous system can be spliced into a parameter matrix X ═ X1,x2,x3,...,xn]∈RT×NIn the formula, N represents that the distributed heterogeneous system includes N worker nodes. And, can define
Figure BDA0002634153490000071
A random sampling process of training samples for a target worker node, wherein k represents the kth iteration of the AI algorithm model, i represents the id address corresponding to the target worker node,
Figure BDA0002634153490000072
and showing that a set of sample data is randomly selected for the ith worker node in the kth iteration process. Sample data selected by a target worker node is assumed to be
Figure BDA0002634153490000073
Then, the process of gradient updating the target model parameters can be expressed as
Figure BDA0002634153490000074
Randomly creating a target node set g ═ w for a target worker node1,w2,...,wlAnd then, updating a non-zero value of the adjacency matrix by using the target node set to obtain an updated adjacency matrix, and updating the model parameters on each worker node in the target node set by using the updated adjacency matrix. Obviously, the purpose of this step is to make the target worker node know which model parameters on the worker node need to be updated.
Step S15: when the target worker node completes the kth iterative training, judging whether an AI algorithm model on the target worker node converges;
step S16: if yes, the target worker node is judged to complete the kth AI algorithm model training task, the step of controlling the target worker node to load preset target model parameters is repeatedly executed, and the distributed heterogeneous system is judged to complete the AI acceleration task until the target worker node completes M times of iterative training;
wherein M is more than or equal to 2, and M is the preset iteration frequency.
And when the target worker node completes the kth iterative training, judging whether an AI algorithm model on the target worker node is converged, if so, indicating that the target worker node has completed the kth iterative training process, and in this case, returning to the step S12 again until the target worker node completes the M iterative training, indicating that the distributed heterogeneous system has completed an AI acceleration task.
It should be noted that, in this embodiment, the number M of times that the target worker node performs iterative training is determined by the training precision of the AI algorithm model, that is, if the training precision required by the AI algorithm model is higher, the setting value of M is larger, and if the training precision required by the AI algorithm model is lower, the setting value of M is relatively reduced.
Compared with the prior art, in the embodiment, each iteration process only needs the parameter synchronization between the target worker node and the randomly selected worker node, and does not need the parameter synchronization with all the worker nodes, so that the requirement of each worker node in the distributed computing cluster on the communication bandwidth during the parameter synchronization can be reduced by the method provided by the embodiment. Moreover, the parameter updating method of the AI distributed training system provided by the embodiment is directed to a distributed heterogeneous system, and is not limited to a homogeneous distributed computing environment, so that the method provided by the embodiment can also improve the heterogeneity and the extensibility of the AI distributed training system.
In addition, the technical scheme provided by this embodiment starts from improving the heterogeneity and expandability of the distributed heterogeneous system, and randomly selects the Parameter synchronization node on the basis of the All-Reduce algorithm, so that the high expandability of the All-Reduce algorithm is ensured, and meanwhile, the defects that the overall performance of the AI distributed training system is limited by a short board of a worker node with slowest computational performance in a distributed computing environment and the communication pressure of All the worker nodes is too large are avoided by calculating a heterogeneous execution mode among the nodes, and the bandwidth bottleneck of the Parameter-Server algorithm is broken through, and the problem that the Ring-All-Reduce algorithm cannot fully exert the communication bandwidth performance due to too small Parameter data blocks is avoided.
As can be seen, in this embodiment, when the distributed heterogeneous system needs to complete an AI acceleration task, an AI algorithm model is deployed on all worker nodes of the distributed heterogeneous system, and model parameters and an adjacency matrix of a target worker node are initialized; then, controlling a target worker node to load preset target model parameters, and randomly selecting sample data of the kth iterative training for the target worker node; then, gradient updating is carried out on the parameters of the target model based on the sample data, a target node set is randomly created for the target worker node, after the target node set is created, then the target node set is used to update the adjacent matrix with a non-zero value to obtain an updated adjacent matrix, and updating model parameters on each worker node in the target node set by utilizing the updated adjacency matrix, when the target worker node completes the kth iterative training, whether an AI algorithm model on the target worker node is converged or not is judged, if the IA algorithm model on the target worker node is converged, the target worker node is proved to have completed the kth AI algorithm model training task, and finally, and repeatedly executing the step of controlling the target worker node to load preset target model parameters until the target worker node finishes M times of iterative training, and indicating that the distributed heterogeneous system finishes the AI acceleration task. Obviously, in the parameter synchronization method provided in this embodiment, because an AI acceleration task is performed based on a distributed heterogeneous system, and in the process of parameter synchronization of an AI model, a target worker node is a target node set created at random, and only a randomly selected worker node is subjected to parameter update, the method can enable an AI distributed training system to support a mixed heterogeneous distributed computing environment, and at the same time, can also reduce the requirement of each worker node in a distributed computing cluster on communication bandwidth when performing parameter synchronization.
Based on the foregoing embodiment, this embodiment further describes and optimizes the technical solution, and as a preferred implementation, the parameter updating method of the AI distributed training system further includes:
and building a distributed heterogeneous system by using a plurality of worker nodes provided with a GPU and/or AI chips and/or an FPGA.
It can be understood that, in practical applications, a GPU (Graphics Processing Unit), an AI (Artificial Intelligence) chip and an FPGA (Field Programmable Gate Array) are relatively common novel computing devices, and are also widely applied in actual life, and therefore, in this embodiment, a distributed heterogeneous system is built by using a plurality of worker nodes on which the GPUs and/or the AI chips and/or the FPGA are installed.
Obviously, the technical solution provided by this embodiment can make the parameter updating method of the AI distributed training system provided by this application more universal.
Based on the above embodiments, this embodiment further describes and optimizes the technical solution, and as a preferred implementation, the above steps: after the process of judging whether the AI algorithm model on the target worker node converges, the method further comprises the following steps:
and if not, re-executing the steps of controlling the target worker node to load preset target model parameters and randomly selecting sample data of the kth iterative training for the target worker node.
In this embodiment, when the target worker node completes the kth iterative training, if it is determined that the AI algorithm model on the target worker node does not converge, the step S12 needs to be executed again: and controlling the target worker node to load preset target model parameters, and randomly selecting sample data of the kth iterative training for the target worker node until the target worker node completes the kth AI algorithm model training task.
Obviously, the technical solution provided by this embodiment can further ensure the integrity of the parameter updating method for the AI distributed training system provided by this application.
Based on the above embodiments, this embodiment further describes and optimizes the technical solution, and as a preferred implementation, the above steps: the process of initializing the model parameters and the adjacency matrix of the target worker node comprises the following steps:
and initializing the adjacent matrix of the target worker node to be a zero matrix E.
Specifically, in this embodiment, the adjacent matrix of the target worker node is initialized to the zero matrix E, and it can be thought that when the adjacent matrix of the target worker node is initialized to the zero matrix E, the subsequent parameter updating process can be greatly simplified, so that the resource overhead required by the AI distributed training system in the parameter updating process can be further reduced.
As a preferred embodiment, the process of updating the adjacency matrix with a non-zero value by using the target node set to obtain an updated adjacency matrix includes:
carrying out non-zero value updating on the zero matrix E by using the node set g to obtain an updated adjacent matrix;
wherein, the expression for updating the adjacency matrix is as follows:
Figure BDA0002634153490000101
in the formula, g is a node set, i and j are respectively the number of rows and columns in the zero matrix E, and u represents the number of rows and columns where i and j are the same.
In the present embodiment, a specific implementation of updating the zero matrix E with a non-zero value by using the node set g is provided. In this embodiment, the distributed heterogeneous system shown in fig. 2 is specifically described, and it is assumed that a 3 rd worker node in the distributed heterogeneous system shown in fig. 2 is started and is denoted as w3Then, for worker node w3Proceed initialization, load w3The model parameter x set in advance3And is w3Randomly selecting sample data
Figure BDA0002634153490000102
Then based on the sample data
Figure BDA0002634153490000103
To w3Model parameter x of3A gradient update is performed, that is,
Figure BDA0002634153490000104
and is w3Creating a node set g ═ { w1, w3, w4}, and finally, performing nonzero value updating on a zero matrix E by using the node set g to obtain an updated adjacency matrix, wherein an expression of the updated adjacency matrix is as follows:
Figure BDA0002634153490000105
namely:
Figure BDA0002634153490000111
as a preferred embodiment, the above steps: the process of updating the model parameters on each worker node in the target node set by using the updated adjacency matrix comprises the following steps:
updating model parameters on each worker node in the target node set by using the updated adjacency matrix to obtain updated parameters;
wherein, the expression of the update parameter is:
Figure BDA0002634153490000112
in the formula (I), the compound is shown in the specification,
Figure BDA0002634153490000113
is a parameter matrix updated based on the parameters of the node set g in the kth iteration process, EkAnd g is an adjacency matrix updated based on the node set g in the k iteration process.
In this embodiment, a method for updating model parameters on each worker node in a target node set by using an updated adjacency matrix is provided, and in this embodiment, a 3 rd worker node in a distributed heterogeneous system shown in fig. 2 is also used as an example, and this is just a benefit to the fact that the method is applied to updating model parameters on each worker node in a target node setWhen the model parameters on each node in the node set g ═ w1, w3, w4 are updated by using the updated adjacency matrix, the model parameters of the worker nodes w1, w3 and w4 are actually accumulated and averaged, that is, the process of x 4 isk+1←(x1+x2+x4)/3。
Based on the technical content disclosed in the above embodiments, in order to enable those skilled in the art to more clearly know the inventive principle of the present invention, the present embodiment specifically describes the parameter updating process of the target worker node through the following pseudo code.
Step 1: to target worker node workeriInitializing the model parameters and the adjacency matrix E to control a target worker node workeriLoading model parameters xi
Step 2: is worker node workeriRandomly selecting sample data of kth iterative training
Figure BDA0002634153490000114
Step 3: target worker node worker based on gradient calculationiIs subjected to gradient update
Figure BDA0002634153490000115
Step 4: randomly creating a target node set g ═ w for a target worker node1,w2,...,wl};
Step 5: according to the target node set g ═ w1,w2,...,wlThe adjacency matrix E is updated, that is,
Figure BDA0002634153490000121
step 6: updating model parameters on each worker node in the target node set by utilizing the updated adjacency matrix
Figure BDA0002634153490000122
Step 7: and completing the parameter updating of the k iteration.
Referring to fig. 3, fig. 3 is a structural diagram of a parameter updating apparatus of an AI distributed training system according to an embodiment of the present invention, where the parameter updating apparatus includes:
the model initialization module 21 is configured to start a training task of an AI algorithm model on a target worker node of the distributed heterogeneous system when the distributed heterogeneous system needs to complete an AI acceleration task, and initialize model parameters and an adjacency matrix of the target worker node; an AI algorithm model is deployed on all worker nodes of the distributed heterogeneous system;
the sample selection module 22 is used for controlling the target worker node to load preset target model parameters and randomly selecting sample data of the kth iterative training for the target worker node; wherein k is more than or equal to 1;
the gradient updating module 23 is configured to perform gradient updating on the target model parameters based on the sample data, and randomly create a target node set for the target worker node;
the parameter updating module 24 is configured to perform non-zero value updating on the adjacent matrix by using the target node set to obtain an updated adjacent matrix, and update the model parameters on each worker node in the target node set by using the updated adjacent matrix;
the convergence judging module 25 is configured to judge whether an AI algorithm model on the target worker node converges when the target worker node completes the kth iterative training;
the task training module 26 is configured to, when the determination result of the convergence determination module is yes, determine that the target worker node completes a kth AI algorithm model training task, repeatedly execute the step of controlling the target worker node to load preset target model parameters, and determine that the distributed heterogeneous system completes an AI acceleration task until the target worker node completes M iterative trainings; wherein M is more than or equal to 2, and M is the preset iteration frequency.
The parameter updating device of the AI distributed training system provided by the embodiment of the invention has the beneficial effects of the parameter updating method of the AI distributed training system disclosed by the embodiment of the invention.
Referring to fig. 4, fig. 4 is a structural diagram of a parameter updating device of an AI distributed training system according to an embodiment of the present invention, where the parameter updating device includes:
a memory 31 for storing a computer program;
a processor 32 for implementing the parameter updating method of the AI distributed training system as disclosed in the foregoing when executing the computer program.
The parameter updating device of the AI distributed training system provided by the embodiment of the invention has the beneficial effects of the parameter updating method of the AI distributed training system disclosed by the embodiment of the invention.
Accordingly, the present invention also discloses a computer readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for updating parameters of the AI distributed training system as disclosed above is implemented.
The computer-readable storage medium provided by the embodiment of the invention has the beneficial effects of the parameter updating method of the AI distributed training system disclosed in the foregoing.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The parameter updating method, device, equipment and medium of the AI distributed training system provided by the invention are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the invention, and the description of the above embodiment is only used to help understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (9)

1. A parameter updating method of an AI distributed training system is characterized by comprising the following steps:
when the distributed heterogeneous system needs to finish an AI acceleration task, starting a training task of an AI algorithm model on a target worker node of the distributed heterogeneous system, and initializing model parameters and an adjacent matrix of the target worker node; an AI algorithm model is deployed on all worker nodes of the distributed heterogeneous system;
controlling the target worker node to load preset target model parameters, and randomly selecting sample data of a kth iterative training for the target worker node; wherein k is more than or equal to 1;
gradient updating is carried out on the target model parameters based on the sample data, and a target node set is randomly created for the target worker nodes;
updating a non-zero value of the adjacent matrix by using the target node set to obtain an updated adjacent matrix, and updating model parameters on each worker node in the target node set by using the updated adjacent matrix;
when the target worker node completes the kth iterative training, judging whether an AI algorithm model on the target worker node converges;
if yes, the target worker node is judged to complete the kth AI algorithm model training task, the step of controlling the target worker node to load preset target model parameters is repeatedly executed, and the distributed heterogeneous system is judged to complete the AI acceleration task until the target worker node completes M times of iterative training; wherein M is more than or equal to 2, and M is the preset iteration frequency.
2. The parameter updating method according to claim 1, further comprising:
and building the distributed heterogeneous system by using a plurality of worker nodes provided with the GPU and/or AI chips and/or the FPGA.
3. The method according to claim 1, wherein after the process of determining whether the AI algorithm model on the target worker node converges, the method further comprises:
and if not, re-executing the step of controlling the target worker node to load preset target model parameters and randomly selecting sample data of the kth iterative training for the target worker node.
4. The parameter updating method according to claim 1, wherein the process of initializing the model parameters and the adjacency matrix of the target worker node comprises:
and initializing the adjacent matrix of the target worker node to be a zero matrix E.
5. The method according to claim 4, wherein the updating the adjacency matrix with the non-zero value by using the target node set to obtain an updated adjacency matrix comprises:
carrying out non-zero value updating on the zero matrix E by utilizing a node set g to obtain an updated adjacent matrix;
wherein the expression of updating the adjacency matrix is as follows:
Figure FDA0002634153480000021
in the formula, g is the node set, i and j are the number of rows and columns in the zero matrix E, respectively, and u represents the number of rows and columns where i and j are the same.
6. The parameter updating method according to claim 5, wherein the process of updating the model parameters on each worker node in the target node set by using the updated adjacency matrix comprises:
updating model parameters on each worker node in the target node set by using the updated adjacency matrix to obtain updated parameters;
wherein the expression of the update parameter is:
Figure FDA0002634153480000022
in the formula (I), the compound is shown in the specification,
Figure FDA0002634153480000023
a parameter matrix updated based on the parameters of the node set g in the k iteration process,
Figure FDA0002634153480000024
and updating the adjacency matrix based on the node set g in the k iteration process.
7. A parameter updating apparatus of an AI distributed training system, comprising:
the model initialization module is used for starting a training task of an AI algorithm model on a target worker node of the distributed heterogeneous system when the distributed heterogeneous system needs to finish an AI acceleration task, and initializing model parameters and an adjacent matrix of the target worker node; an AI algorithm model is deployed on all worker nodes of the distributed heterogeneous system;
the sample selection module is used for controlling the target worker node to load preset target model parameters and randomly selecting sample data of the kth iterative training for the target worker node; wherein k is more than or equal to 1;
the gradient updating module is used for carrying out gradient updating on the target model parameters based on the sample data and randomly creating a target node set for the target worker node;
the parameter updating module is used for updating a nonzero value of the adjacent matrix by using the target node set to obtain an updated adjacent matrix and updating model parameters on each worker node in the target node set by using the updated adjacent matrix;
the convergence judging module is used for judging whether an AI algorithm model on the target worker node converges or not when the target worker node completes the kth iterative training;
the task training module is used for judging that the target worker node completes the kth AI algorithm model training task when the judgment result of the convergence judgment module is yes, repeatedly executing the step of controlling the target worker node to load preset target model parameters until the target worker node completes M times of iterative training, and judging that the distributed heterogeneous system completes the AI acceleration task; wherein M is more than or equal to 2, and M is the preset iteration frequency.
8. A parameter updating apparatus of an AI distributed training system, comprising:
a memory for storing a computer program;
a processor for implementing a parameter updating method of an AI distributed training system as claimed in any one of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements a parameter updating method of an AI distributed training system according to any one of claims 1 to 6.
CN202010820131.XA 2020-08-14 2020-08-14 Parameter updating method, device and equipment of AI distributed training system Active CN112001501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010820131.XA CN112001501B (en) 2020-08-14 2020-08-14 Parameter updating method, device and equipment of AI distributed training system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010820131.XA CN112001501B (en) 2020-08-14 2020-08-14 Parameter updating method, device and equipment of AI distributed training system

Publications (2)

Publication Number Publication Date
CN112001501A true CN112001501A (en) 2020-11-27
CN112001501B CN112001501B (en) 2022-12-23

Family

ID=73473503

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010820131.XA Active CN112001501B (en) 2020-08-14 2020-08-14 Parameter updating method, device and equipment of AI distributed training system

Country Status (1)

Country Link
CN (1) CN112001501B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766498A (en) * 2021-01-29 2021-05-07 北京达佳互联信息技术有限公司 Model training method and device
CN112965888A (en) * 2021-03-03 2021-06-15 山东英信计算机技术有限公司 Method, system, device and medium for predicting task quantity based on deep learning
CN113128700A (en) * 2021-03-23 2021-07-16 同盾控股有限公司 Method and system for accelerating safe multi-party computing federal model training
CN113485805A (en) * 2021-07-01 2021-10-08 曙光信息产业(北京)有限公司 Distributed computing adjustment method, device and equipment based on heterogeneous acceleration platform
CN113656494A (en) * 2021-07-27 2021-11-16 中南大学 Synchronization method and system of parameter server and readable storage medium
CN115250253A (en) * 2022-06-22 2022-10-28 西南交通大学 Bandwidth perception reduction processing method and AI model training method
CN115879543A (en) * 2023-03-03 2023-03-31 浪潮电子信息产业股份有限公司 Model training method, device, equipment, medium and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492753A (en) * 2018-11-05 2019-03-19 中山大学 A kind of method of the stochastic gradient descent of decentralization
US20190318268A1 (en) * 2018-04-13 2019-10-17 International Business Machines Corporation Distributed machine learning at edge nodes
CN111027708A (en) * 2019-11-29 2020-04-17 杭州电子科技大学舟山同博海洋电子信息研究院有限公司 Distributed machine learning-oriented parameter communication optimization method
CN111353582A (en) * 2020-02-19 2020-06-30 四川大学 Particle swarm algorithm-based distributed deep learning parameter updating method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190318268A1 (en) * 2018-04-13 2019-10-17 International Business Machines Corporation Distributed machine learning at edge nodes
CN109492753A (en) * 2018-11-05 2019-03-19 中山大学 A kind of method of the stochastic gradient descent of decentralization
CN111027708A (en) * 2019-11-29 2020-04-17 杭州电子科技大学舟山同博海洋电子信息研究院有限公司 Distributed machine learning-oriented parameter communication optimization method
CN111353582A (en) * 2020-02-19 2020-06-30 四川大学 Particle swarm algorithm-based distributed deep learning parameter updating method

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112766498A (en) * 2021-01-29 2021-05-07 北京达佳互联信息技术有限公司 Model training method and device
CN112766498B (en) * 2021-01-29 2022-11-22 北京达佳互联信息技术有限公司 Model training method and device
CN112965888A (en) * 2021-03-03 2021-06-15 山东英信计算机技术有限公司 Method, system, device and medium for predicting task quantity based on deep learning
CN113128700A (en) * 2021-03-23 2021-07-16 同盾控股有限公司 Method and system for accelerating safe multi-party computing federal model training
CN113485805A (en) * 2021-07-01 2021-10-08 曙光信息产业(北京)有限公司 Distributed computing adjustment method, device and equipment based on heterogeneous acceleration platform
CN113485805B (en) * 2021-07-01 2024-02-06 中科曙光(南京)计算技术有限公司 Distributed computing adjustment method, device and equipment based on heterogeneous acceleration platform
CN113656494A (en) * 2021-07-27 2021-11-16 中南大学 Synchronization method and system of parameter server and readable storage medium
CN115250253A (en) * 2022-06-22 2022-10-28 西南交通大学 Bandwidth perception reduction processing method and AI model training method
CN115250253B (en) * 2022-06-22 2024-02-27 西南交通大学 Reduction processing method for bandwidth perception and training method for AI model
CN115879543A (en) * 2023-03-03 2023-03-31 浪潮电子信息产业股份有限公司 Model training method, device, equipment, medium and system
CN115879543B (en) * 2023-03-03 2023-05-05 浪潮电子信息产业股份有限公司 Model training method, device, equipment, medium and system

Also Published As

Publication number Publication date
CN112001501B (en) 2022-12-23

Similar Documents

Publication Publication Date Title
CN112001501B (en) Parameter updating method, device and equipment of AI distributed training system
CN112101530B (en) Neural network training method, device, equipment and storage medium
CN108290704B (en) Method and apparatus for determining allocation decisions for at least one elevator
CN109857532B (en) DAG task scheduling method based on Monte Carlo tree search
JP2021532437A (en) Improving machine learning models to improve locality
CN110490319B (en) Distributed deep reinforcement learning method based on fusion neural network parameters
WO2020237798A1 (en) Upgrade method and device
CN111144571A (en) Deep learning reasoning operation method and middleware
CN111147541B (en) Node processing method, device and equipment based on parameter server and storage medium
CN112306452A (en) Method, device and system for processing service data by merging and sorting algorithm
CN115437372B (en) Robot path planning method and device, electronic equipment and storage medium
CN112329941B (en) Deep learning model updating method and device
CN108897619A (en) A kind of multi-layer resource flexibility configuration method for supercomputer
KR101595062B1 (en) Maximal matching method for graph
WO2020105161A1 (en) Edge device machine learning model switching system, edge device machine learning model switching method, program, and edge device
CN116627659B (en) Model check point file storage method, device, equipment and storage medium
WO2006033967A2 (en) A method and apparatus for modeling systems
CN113537406B (en) Method, system, medium and terminal for enhancing image automatic data
CN112990332B (en) Sub-graph scale prediction and distributed training method and device and electronic equipment
CN116237935B (en) Mechanical arm collaborative grabbing method, system, mechanical arm and storage medium
CN112507197B (en) Model searching method, device, electronic equipment, storage medium and program product
CN111177474B (en) Graph data processing method and related device
US20160174127A1 (en) Information processing apparatus, control method, and program
CN117439810A (en) Honey network node deployment method, system and storable medium for electric power Internet of things
CN112506658A (en) Dynamic resource allocation and task scheduling method in service chain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant