CN112085074B

CN112085074B - Model parameter updating system, method and device

Info

Publication number: CN112085074B
Application number: CN202010863228.9A
Authority: CN
Inventors: 沈力; 陈淙靓; 黄浩智; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2024-05-07
Anticipated expiration: 2040-08-25
Also published as: CN112085074A

Abstract

The application provides a model parameter updating system, method and device, relates to the technical field of artificial intelligence, and is used for optimizing a training process of a neural network model. The method comprises the following steps: performing interactive operation of a plurality of iteration rounds by each auxiliary computing node and the main computing node to obtain local model parameters of target models on each auxiliary computing node and the main computing node; in the interactive operation of any iteration round, each auxiliary computing node determines a first gradient of each local model parameter, performs target processing on the first gradient based on a first error compensation value to obtain a second gradient, and updates each local model parameter by using a received third gradient; and the main computing node performs target processing on the second gradient based on the second error compensation value to obtain a third gradient, sends the third gradient to each auxiliary computing node, and updates local model parameters by using the third gradient. The method improves the convergence of the neural network model in the distributed training, and further improves the training efficiency of the neural network model.

Description

Model parameter updating system, method and device

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a system, a method, and an apparatus for updating model parameters.

Background

With the continuous development of artificial intelligence technology, neural network models have more and more applications in the directions of information classification, information identification, information processing and the like; in order to improve the training efficiency of the neural network model, the parameters and training data of the neural network model are often subjected to distributed training in the related art, meanwhile, in order to further improve the training efficiency, the training nodes in the distributed training system perform some target processing for reducing the data quantity on the parameters of the neural network model, but errors are generated in the target processing process, the loss function value of the neural network model in the training process is affected, so that the neural network model cannot converge or the convergence speed is too slow, the training efficiency of the neural network model is reduced, and therefore, how to improve the training efficiency of the distributed training on the neural network model becomes a problem to be considered.

Disclosure of Invention

The embodiment of the application provides a model parameter updating system, a model parameter updating method and a model parameter updating device, which are used for improving the training efficiency of distributed training of a neural network model.

In a first aspect of the present application, a model parameter updating system is provided, comprising a main computing node and at least two auxiliary computing nodes,

Performing interactive operation of a plurality of iteration rounds by each auxiliary computing node and the main computing node to obtain local model parameters on each auxiliary computing node and local model parameters on the main computing node; the local model parameters include model parameters of a target model, wherein the interoperation of any one of the plurality of iterative rounds includes:

each auxiliary computing node determines a first gradient of each local model parameter according to a training sample of the target model, performs target processing on the first gradient based on a first error compensation value of target processing corresponding to the auxiliary computing node to obtain a second gradient, and sends the second gradient to the main computing node; and updating respective local model parameters with a third gradient received from the master computing node; the first gradient is used for indicating the change degree of local model parameters on the auxiliary computing node, and the target processing comprises the operation of reducing the data quantity of the data;

the main computing node performs target processing on the second gradient based on a second error compensation value of target processing corresponding to the main computing node to obtain a third gradient, and sends the third gradient to each auxiliary computing node; and updating local model parameters on the master computing node using the third gradient.

In a second aspect of the present application, a method for updating model parameters is provided, including:

The main computing node performs target processing on the second gradient based on a second error compensation value of target processing corresponding to the main computing node to obtain a third gradient; the second gradient is obtained by performing target processing on the first gradients of the local model parameters based on first error compensation values of target processing corresponding to the auxiliary computing nodes in each of the at least two auxiliary computing nodes; the local model parameters comprise model parameters of a target model, the first gradient is determined by each auxiliary computing node according to training samples of the target model, and the first gradient is used for indicating the change degree of the local model parameters on the auxiliary computing nodes;

The main computing node sends the third gradient to each auxiliary computing node so that each auxiliary computing node updates each local model parameter by using the third gradient; and

The master computing node updates local model parameters on the master computing node using the third gradient.

In one possible implementation, the method further includes:

in the t iteration round, the main computing node updates the second error compensation value corresponding to the main computing node in the t iteration round according to the following formula to obtain the second error compensation value corresponding to the t+1th iteration round of the main computing node:

wherein, E _t+1 is a second error compensation value corresponding to the main computing node in the t+1st iteration round.

In one possible implementation, the updating, by the master computing node, local model parameters on the master computing node using the third gradient includes:

and in the t iteration round, determining the difference value between the local model parameter updated by the main computing node in the t-1 iteration round and the third gradient as the local model parameter updated by the main computing node in the t iteration round.

In one possible implementation, the target process satisfies the following condition:

The target processing error value is not greater than the error threshold; the target processing error value is an offset value of a data amount of original data and a data amount of processing data corresponding to the original data, and the processing data is obtained by performing the target processing on the original data; the error threshold is determined according to the original data and preset error parameters; and

And the data quantity of the processed data obtained by the target processing is not more than the original data corresponding to the processed data.

In a third aspect of the present application, a method for updating model parameters is provided, including:

The auxiliary computing node determines a first gradient of a local model parameter on the auxiliary computing node according to a training sample of a target model; the local model parameters include model parameters of the target model, the first gradient being used to indicate a degree of variation of the local model parameters on the secondary computing node;

The auxiliary computing node performs target processing on the first gradient based on a first error compensation value of target processing corresponding to the auxiliary computing node to obtain a second gradient; the target process includes an operation of reducing a data amount of data;

The auxiliary computing node sends the second gradients to the main computing node, so that the main computing node performs target processing on the second gradients of at least two auxiliary computing nodes based on second error compensation values of target processing corresponding to the main computing node to obtain a third gradient, and the main computing node updates local model parameters on the main computing node according to the third gradient;

the secondary computing node updates local model parameters on the secondary computing node with a third gradient received from the primary computing node.

In one possible implementation, the method further includes:

in the t iteration round, the auxiliary computing node updates a first error compensation value corresponding to the auxiliary computing node in the t iteration round according to the following formula to obtain a first error compensation value corresponding to the auxiliary computing node in the t+1th iteration round:

Wherein e _t+1 is a first error compensation value corresponding to the secondary computing node in the t+1st iteration round.

In one possible implementation, the secondary computing node updates local model parameters on the secondary computing node with a third gradient received from the primary computing node, comprising:

and in the t iteration round, determining the difference value between the local model parameter updated by the auxiliary computing node in the t-1 iteration round and the third gradient as the local model parameter updated by the auxiliary computing node in the t iteration round.

The target processing error is not greater than the error threshold; the target processing error is a deviation value of the data quantity of original data and the data quantity of processing data corresponding to the original data, and the processing data is obtained by performing target processing on the original data; the error threshold is determined according to the original data and preset error parameters; and

In a fourth aspect of the present application, there is provided a model parameter updating apparatus comprising:

The gradient processing unit is used for carrying out target processing on the second gradient based on a second error compensation value of target processing corresponding to the main computing node to obtain a third gradient; the second gradient is obtained by performing target processing on the first gradients of the local model parameters based on first error compensation values of target processing corresponding to the auxiliary computing nodes in each of the at least two auxiliary computing nodes; the local model parameters comprise model parameters of a target model, the first gradient is determined by each auxiliary computing node according to training samples of the target model, and the first gradient is used for indicating the change degree of the local model parameters on the auxiliary computing nodes;

The information sending unit is used for sending the third gradient to each auxiliary computing node so that each auxiliary computing node can update each local model parameter by using the third gradient; and

And the parameter updating unit is used for updating the local model parameters on the main computing node by utilizing the third gradient.

In one possible implementation manner, the main computing node performs an interactive operation of a plurality of iterative rounds with the auxiliary computing nodes, wherein a second error compensation value corresponding to the main computing node in the t-th iterative round is an accumulated value of a second error of the target processing performed by the main computing node in the 1 st to t-1 st iterative rounds; and when t is a positive integer and t is 1, the corresponding second error compensation value of the main calculation node in the 1 st iteration round is a preset compensation value.

In one possible implementation, the gradient processing unit is specifically configured to:

determining an average gradient of a second gradient of the at least two auxiliary computing nodes in the t-th iteration round;

And performing the target processing on the average gradient based on a second error compensation value corresponding to the main computing node in the t iteration round according to the following formula, so as to obtain the third gradient:

Wherein the said A third gradient for the master computing node obtained in the t-th iteration round; said Q _s () is the algorithm of said target process on said master computing node; said/>Is the average gradient, and E _t is a second error compensation value corresponding to the main computation node in the t-th iteration round.

In a possible implementation, the gradient processing unit is further configured to: in the t iteration round, updating the second error compensation value corresponding to the main calculation node in the t iteration round according to the following formula to obtain the second error compensation value corresponding to the t+1th iteration round of the main calculation node:

In a possible implementation manner, the parameter updating unit is specifically configured to:

And in the t iteration round, determining a difference value between the local model parameter updated by the main computing node in the t-1 iteration round and the third gradient as the local model parameter updated by the main computing node in the t iteration round.

In one possible implementation, the target model is a distributed training neural network model,

When the model parameter updating device is applied to the field of information classification, the target model is an information classification model, and the model parameters are parameters indicating the corresponding relation between information characteristics and information types in the information classification model; or (b)

When the model parameter updating device is applied to the field of information detection, the target model is an information detection model, and the model parameters are parameters indicating the corresponding relation between information characteristics and detection results in the information prediction model.

In a fifth aspect of the present application, there is provided a model parameter updating apparatus comprising:

The first gradient processing unit is used for determining a first gradient of the local model parameter on the auxiliary computing node according to the training sample of the target model; the local model parameters include model parameters of the target model, the first gradient being used to indicate a degree of variation of the local model parameters on the secondary computing node;

The second gradient processing unit is used for carrying out target processing on the first gradient based on a first error compensation value of target processing corresponding to the auxiliary computing node to obtain a second gradient; the target process includes an operation of reducing a data amount of data;

The information sending unit is used for sending the second gradients to the main computing node, so that the main computing node performs target processing on the second gradients of at least two auxiliary computing nodes based on second error compensation values of target processing corresponding to the main computing node to obtain a third gradient, and the main computing node updates local model parameters on the main computing node according to the third gradient;

and a parameter updating unit, configured to update local model parameters on the secondary computing node with a third gradient received from the primary computing node.

In one possible implementation, the secondary computing node performs a plurality of iterative rounds of inter-operation with the primary computing node; the first error compensation value corresponding to the auxiliary computing node in the t iteration round is an accumulated value of a difference value between the first error of the target processing and the second gradient obtained in the t-1 iteration round in the 1 st to t-1 iteration rounds of the auxiliary computing node; and when t is a positive integer and t is 1, the first error compensation value of the 1 st iteration round is a preset compensation value.

In one possible implementation manner, the second gradient processing unit is specifically configured to:

In the t iteration round, performing the target processing on the first gradient based on the following formula and a first error compensation value corresponding to the auxiliary computing node in the t iteration round to obtain the second gradient:

Wherein Δ _t is a second gradient obtained by the secondary computing node in the t-th iteration round, and Q _w () is an algorithm of the target processing on the secondary computing node; the alpha _t is a learning rate of the local model parameters of the secondary computing node in the t-th iteration round, and the learning rate is determined based on an initial learning rate and an iteration round; the e _t is a first error compensation value corresponding to the auxiliary computing node in the t iteration round; the epsilon is a preset super parameter; the m _t is a gradient momentum of the secondary computing node in the t-th iteration round, the gradient momentum is determined according to the first gradient, and the gradient momentum is used for indicating the extent to which the first gradient advances toward a change direction indicated by an objective function; the V _t is a gradient sliding average value of the auxiliary computing node in the t-th iteration round, the gradient sliding average value is used for indicating the change degree of the first gradient, the V _t is obtained by processing the gradient sliding average value V _t-1 of the auxiliary computing node in the t-1 th iteration round according to the first gradient and an index moving average coefficient, and the index moving average coefficient is obtained according to an initial moving average coefficient and the iteration round.

In a possible implementation manner, the second gradient processing unit is further configured to:

In the t iteration round, updating a first error compensation value corresponding to the auxiliary computing node in the t iteration round according to the following formula to obtain a first error compensation value corresponding to the auxiliary computing node in the t+1th iteration round:

In a sixth aspect, the present application provides a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of the second aspect and any one of the possible implementations when executing the program.

In a seventh aspect, the present application provides a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of the third aspect and any one of the possible implementations when executing the program.

In an eighth aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of a computer device, which executes the computer instructions, causing the computer device to perform the method provided in the various possible implementations of the second or third aspect described above.

In a ninth aspect of the application, there is provided a computer readable storage medium storing computer instructions that, when run on a computer, cause the computer to perform a method as described in any one of the first aspect and any one of the possible implementations.

Due to the adoption of the technical scheme, the embodiment of the application has at least the following technical effects:

In the distributed training system provided by the embodiment of the application, when the calculation nodes (the main calculation node and the auxiliary calculation node) perform target processing on the relevant gradient of the model parameters of the target model, the influence of errors generated by the target processing is reduced by adopting the corresponding error compensation value, so that the errors of the target processing of the calculation nodes are reduced, the influence of the errors of the target processing on the loss function value of the target model is further reduced, the possibility of non-convergence of the target model in the distributed training process is reduced, and the training efficiency of the target model is improved; and the influence of the error of the target processing on the loss function value of the target model is reduced, so that the convergence speed of the target model in the distributed training process is further increased, and the training efficiency of the target model is further improved.

Drawings

FIG. 1 is a schematic diagram of a model parameter updating system according to an embodiment of the present application;

FIG. 2 is a process diagram of initializing each computing node according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an initialization process of each computing node according to an embodiment of the present application;

FIG. 4 is a schematic process diagram of a method for updating model parameters according to an embodiment of the present application;

FIG. 5 is a schematic process diagram of a method for updating model parameters according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an interaction process between a primary computing node and a secondary computing node according to an embodiment of the present application;

FIG. 7 is a schematic diagram of experimental results of a model parameter updating method according to an embodiment of the present application;

FIG. 8 is a schematic diagram of experimental results of a model parameter updating method according to an embodiment of the present application;

FIG. 9 is a schematic diagram of experimental results of a model parameter updating method according to an embodiment of the present application;

FIG. 10 is a schematic diagram of experimental results of a model parameter updating method according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a model parameter updating apparatus according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a model parameter updating apparatus according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to better understand the technical solutions provided by the embodiments of the present application, the following detailed description will be given with reference to the accompanying drawings and specific embodiments.

In order to facilitate the technical solution of the present application to be better understood by those skilled in the art, the following description of technical terms related to the present application is provided.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results; the artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level; the artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning and other directions.

Convergence of neural network model: given a maximum number of iterations in the training process of the neural network model, a loss threshold value of the loss function value (i.e. a limit value of an error of the model), the loss function value is reduced by adjusting the model parameters in each training of the neural network model, the loss function value converges when the training sample is theoretically large enough and the iteration number tends to be infinite, but in practice, when the iteration number reaches the set maximum iteration number, the loss function value is still larger than the loss threshold value or does not reach the loss threshold value, or after the iteration number reaches a certain number, the loss function value does not converge and unstable floating occurs near a certain value, and in this case, the training of the neural network model fails, and the phenomenon becomes that the neural network model does not converge or has poor convergence.

Target treatment: the target processing in the embodiment of the present application refers to an operation of reducing the data amount; the target process may be compression processing for deleting the minimum value in the data, or quantization processing for quantizing data of one data format to data of another data format, such as quantization processing for quantizing data of float 32 format to data of 6bit format, or the like, or other operations capable of reducing the data amount.

The following describes the design concept of the present application.

The neural network model in machine learning has more and more applications in the directions of information classification such as image classification, information recognition such as target detection, information processing such as natural language processing or speech synthesis and the like; however, in order to obtain a neural network model with higher recognition accuracy and classification accuracy, a large amount of parameters and training data are often used for training the neural network model; however, the data size of the parameters and the training data greatly affects the training efficiency of the neural network model, so a parameter server (parameter-server) model is often adopted in the related art to perform distributed training on the parameters and the training data of the neural network model, but in the operation process of the parameter server, a great deal of communication exists between an auxiliary computing node worker and a main computing node server in the server, so the training speed of the neural network model is severely limited, although the parameters of the neural network model are quantized or compressed on the server or the worker in the related art at present, the error of the data of the parameters is caused in the quantization or compression process, the loss function value of the neural network model is reduced in the training process, the neural network model is not converged or is too slow in convergence speed, the training efficiency of the neural network model is affected, the performance of the neural network model obtained by training is unstable, the accuracy of the neural network model obtained by training is affected, and the training efficiency of the neural network model for performing distributed training on the neural network model is improved, and the problem of taking the accuracy of the neural network obtained by training is improved is considered.

In view of this, the inventors devised a model parameter updating method and apparatus in machine learning involving artificial intelligence; in consideration of improving the training efficiency of the neural network model, the neural network model is subjected to distributed training in the embodiment of the application, and training related data is subjected to target processing for reducing the data volume on a main computing node and an auxiliary computing node in a distributed training system so as to improve the training efficiency; in the embodiment of the application, the training efficiency and accuracy of the neural network model are further improved by improving the convergence of the neural network model in the training process in consideration of the high correlation between the training efficiency and accuracy of the neural network model and the convergence of the neural network model in the training process; because the convergence of the neural network model in the training process is greatly influenced by errors caused by the fact that the computing node performs target processing on training related data, the error compensation value of target processing is utilized on the computing node to compensate errors generated by the fact that the computing node performs target processing is considered in the embodiment of the application, and further the influence of the errors generated by the target processing on the convergence of the neural network model is reduced; meanwhile, in order to improve the convergence rate of the neural network model, in the embodiment of the application, on a main computing node and each auxiliary computing node in the distributed training system, corresponding error compensation values are adopted to compensate errors generated by target processing; specifically:

In the embodiment of the application, the distributed training system comprises a main computing node and at least two auxiliary computing nodes, wherein each auxiliary computing node performs interactive operation of a plurality of iterative rounds with the main computing node to obtain model parameters of a target model, each auxiliary computing node determines first gradients of local model parameters according to training samples of the target model in interactive operation of any iterative round, performs target processing on the first gradients based on first error compensation values of target processing corresponding to the auxiliary computing nodes to obtain second gradients, and sends the second gradients to the main computing node; and updating respective local model parameters with a third gradient received from the master computing node; the first gradient is used for indicating the change degree of local model parameters of the target model on the auxiliary computing node; the main computing node performs target processing on the second gradient based on a second error compensation value of target processing corresponding to the main computing node to obtain a third gradient, and sends the third gradient to each auxiliary computing node; and updating local model parameters of a target model on the main computing node by using a third gradient, wherein the target model is the neural network model.

In order to more clearly understand the design concept of the present application, the model parameter updating method provided by the embodiment of the present application is described below by way of example with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present application provides a model parameter updating system, which includes a main computing node 110 and at least two auxiliary computing nodes 120, wherein the main computing node 110 and the auxiliary computing nodes 120 may be parameter servers, the parameter servers may be, but are not limited to, distributed servers, blockchain servers, cloud servers, etc., communication between the main computing node 110 and the auxiliary computing nodes 120 may be performed through a communication network, and identities of the main computing node 110 and the auxiliary computing nodes 120 may be switched under different scenarios, and the communication network may be, but is not limited to, a local area network, a wide area network, etc.

When training the target model, each auxiliary computing node 120 has local model parameters of the target model, and each auxiliary computing node 120 trains the target model according to training samples; the local model parameters of the target model are also stored on the main computing node 110, and the main computing node 110 adjusts the local model parameters on the main computing node according to the training results of the auxiliary computing nodes, where the local model parameters include the model parameters of the target model.

In the process of training the target model, each auxiliary computing node 120 performs a plurality of iterative rounds of interaction with the main computing node 110 to obtain local model parameters on each auxiliary computing node 120 and local model parameters on the main computing node 110, wherein in each iterative round of interaction, because each auxiliary computing node 120 trains the target model, the local model parameters on each auxiliary computing node 120 in the same iterative round may be the same or different; the local model parameters on the primary computing node 110 and the local model parameters on the respective secondary computing nodes 120 may also be the same or different in the same iteration round.

In the interaction of any one of the above-described iteration rounds, each of the secondary computing nodes 120 and the primary computing node 110 perform the following interaction: each auxiliary computing node 120 determines a first gradient of each local model parameter according to the training sample of the target model, performs target processing on the first gradient based on a first error compensation value of target processing corresponding to the auxiliary computing node to obtain a second gradient, and sends the second gradient to the main computing node; and updating respective local model parameters with a third gradient received from the master computing node; the first gradient is used for indicating the change degree of local model parameters on the auxiliary computing node, and the upper target processing comprises the operation of reducing the data quantity of the data;

The main computing node 110 performs target processing on the second gradient based on a second error compensation value of target processing corresponding to the main computing node to obtain a third gradient, and sends the third gradient to each auxiliary computing node; and updating its own local model parameters with the third gradient.

As an embodiment, before each secondary computing node 120 performs the interactive operation with the primary computing node 110 for a plurality of iterative rounds, a node initialization operation is further required to obtain a network structure of the target model, a training sample of the target model, and initial model parameters of the target model; the method specifically comprises the following two node initialization modes:

a first node initialization mode;

Referring to fig. 2, the main computing node 110 obtains a network structure of the target model, a training sample of the target model, and initial model parameters of the target model, takes the obtained initial model parameters as local model parameters on the main computing node in the 1 st iteration round, and sends the obtained network structure of the target model, the training sample of the target model, and the initial model parameters of the target model to each auxiliary computing node 120;

Each secondary computing node 120 uses the received initial model parameters as local model parameters in the 1 st iteration round, uses the received network structure as the network structure of the target model, and uses the received training samples as the training data of the target model.

The second node initialization mode:

Referring to fig. 3, one auxiliary computing node 120 of the at least two auxiliary computing nodes acquires a network structure of the target model, a training sample of the target model and initial model parameters of the target model, and takes the acquired initial model parameters as local model parameters of the auxiliary computing node in the 1 st iteration round; and sending the obtained network structure of the target model, the training sample of the target model and the initial model parameters of the target model to other auxiliary computing nodes 120, so that the other auxiliary computing nodes 120 take the received initial model parameters as local model parameters in the 1 st iteration round, take the received network structure as the network structure of the target model and take the received training sample as training data of the target model; and sending the obtained initial model parameters to the main computing node 110, so that the main computing node uses the received initial model parameters as the local model parameters of the main computing node in the 1 st iteration round.

The following describes the target processing on the primary computing node and the secondary computing node in the embodiment of the present application.

The target process is an operation of reducing the data amount; in the process of training the target model, in order to improve training efficiency, in the embodiment of the application, the target processing is performed on the training related data on the main computing node and the auxiliary computing node so as to reduce the data size of the training related data, and when the main computing node and the auxiliary computing node transmit the training related data, the transmitted data size can be reduced, the transmission speed can be increased, and further, the training efficiency can be improved.

In the embodiment of the application, the target processing meets the following two conditions:

Condition 1: the target process error of the target process is not greater than the error threshold.

The target processing error is a deviation value of a data amount of original data and a data amount of processing data corresponding to the original data, wherein the original data is data before target processing, and the processing data is data obtained by target processing of the original data; the error threshold is determined according to the original data and a preset error parameter.

Specifically, the target process satisfies the following equation 1:

equation 1: the I is less than or equal to (1-c ₁)||x||+c₂) and the I is x-Q (x);

In the formula 1, Q () is an algorithm corresponding to a target process, where the target process may be a target process on a primary computing node, or may be a target process on a secondary computing node; x is original data, and Q (x) represents processing data obtained by performing target processing on the original data x; sign "| 'I' I representing norming; the term ||x-Q (x) | characterizes the target processing error described above; (1-c ₁)||x||+c₂ represents the error threshold, c1 and c2 are preset error parameters, c1 is a value greater than zero, and c2 is a value greater than or equal to zero.

Condition 2: the data volume of the processed data obtained by the target processing is smaller than the original data corresponding to the processed data.

Specifically, the target process satisfies the following equation 2:

equation 2: the value of Q (x) is less than or equal to the value of x;

In the formula 2, Q () is an algorithm corresponding to a target process, where the target process may be a target process on a primary computing node, or may be a target process on a secondary computing node; x is original data, and Q (x) represents processing data obtained by performing target processing on the original data x; sign "| 'I' I representing norming; the ||q (x) | characterizes the data amount of the above-described processing data; the expression of |x|| characterizes the above. Data amount of original data.

Satisfying the above condition 1 and condition 2 such as the target processing may be, but not limited to, quantization processing, compression processing, or other operations capable of reducing the amount of data, etc., and may be set by those skilled in the art according to actual demands.

As an embodiment, the target model in the embodiment of the present application may be, but is not limited to, a distributed training neural network model, and when applied to different implementation scenarios, the target model may be a different neural network model, and specific examples of the target model and corresponding model parameters in several specific implementation scenarios are given below:

Example 1: the model parameter updating method provided by the embodiment of the application is applied to the field of information classification, the target model is an information classification model, and the model parameters of the target model are parameters indicating the corresponding relation between the information characteristics and the information types in the information classification model.

Specifically, when applied to image classification, the target model may be an image classification model, and the model parameters of the target model may be parameters indicating the correspondence between image features and image categories in the image classification model; when the method is applied to text classification, the target model can be a text classification model, and model parameters of the target model can be parameters of the text classification model, wherein the parameters indicate the corresponding relation between text characteristics and text categories.

Example 2: when the model parameter updating method provided by the embodiment of the application is applied to the field of information detection, the target model can be an information detection model, and the model parameters of the target model are parameters indicating the corresponding relation between the information characteristics and the detection results in the information prediction model.

Specifically, when applied to target detection in an image, the target model may be a target detection model, and model parameters of the target model may be parameters indicating a correspondence between target features and target results in the target detection model; when the method is applied to text detection, the target model can be a text detection model, and model parameters of the target model can be parameters of the text detection model, which indicate the corresponding relation between text features and text results.

Example 3: when the model parameter updating method provided by the embodiment of the application is applied to the field of information processing, the target model can be an information processing model, and the model parameters of the target model are parameters indicating the corresponding relation between the information characteristics and the processing results in the information processing model.

Specifically, when applied to speech recognition, the target model may be a speech recognition model, and the model parameters of the target model may be parameters indicating correspondence between speech features and speech recognition results in the speech recognition model.

In examples 1 through 3 above, the information feature may include, but is not limited to, at least one feature of text features, image features, audio features, video features, contest winning features, behavior features, device identification, account identification.

It should be noted that the specific implementation scenarios in examples 1 to 3 and the corresponding target models and model parameters are merely illustrative, and those skilled in the art may apply the model parameter updating system and method provided in the present application to other implementation scenarios and other target models as needed according to actual needs.

Based on the application scenario of fig. 1 and the above specific implementation scenarios, an exemplary description is given below of a model parameter updating method according to an embodiment of the present application; referring to fig. 4, an exemplary diagram of a method for updating model parameters is provided in an embodiment of the present application; the method is applied to each auxiliary computing node 120 in the interactive operation of each iteration round in the plurality of iteration rounds, and specifically comprises the following steps:

step S401, determining a first gradient of local model parameters on an auxiliary computing node according to a training sample of a target model; the local model parameters include model parameters of the target model, and the first gradient is used to indicate a degree of variation of the local model parameters on the secondary computing node.

Specifically, in the parameter model updating system, each auxiliary computing node and the main computing node perform interactive operation of multiple iteration rounds, in each iteration round, the auxiliary computing node can randomly select a part from training samples, a gradient of a local model parameter x _t point on the auxiliary computing node is used as a first gradient in a t-th iteration round by using a loss function of a counter propagation computing target model, namely, a derivative of the loss function of the computing target model in the t-th iteration round is used as a first gradient, wherein t is a local model parameter of the auxiliary computing node in the t-th iteration round.

Step S402, performing target processing on the first gradient based on a first error compensation value of target processing corresponding to the auxiliary computing node to obtain a second gradient; the target processing includes an operation of reducing the data amount of the data.

Specifically, the first error compensation value corresponding to the auxiliary computing node in the t iteration round is the accumulated value of the difference value between the first error of the target processing and the second gradient obtained in the t-1 iteration round in the 1 st to t-1 iteration rounds of the auxiliary computing node; and when t is a positive integer and t is 1, the first error compensation value of the 1 st iteration round is a preset compensation value.

Step S403, the second gradient is sent to the main computing node, so that the main computing node performs target processing on the second gradients of the at least two auxiliary computing nodes based on the second error compensation value of the target processing corresponding to the main computing node to obtain a third gradient, and the main computing node updates the local model parameters on the main computing node according to the third gradient.

The process of the primary computing node performing the target processing on the second gradient of the at least two secondary computing nodes is described in the following embodiments of the present application.

Step S404, updating the local model parameters of the host computing node by using the third gradient received from the host computing node.

Specifically, for one auxiliary computing node in each auxiliary computing node, in the t iteration round, determining the difference value between the local model parameter updated by the auxiliary computing node in the t-1 iteration round and the third gradient as the local model parameter updated by the auxiliary computing node in the t iteration round; specifically, the secondary computing node may update its own local model parameters by the following equation 3:

equation 3: x _t+1＝x_t-Δ_st;

In the formula 3, x _t+1 is a local model parameter obtained by updating the auxiliary computing node in the t iteration round, and x _t is a local model parameter obtained by updating the auxiliary computing node in the t-1 iteration round; Δ _st is the third gradient sent by the master computing node.

Since there are a plurality of secondary computing nodes, for each secondary computing node, the respective local model parameters can be updated according to the principle of the above formula 3, so the above formula 3 can be modified into the following formula 4:

Equation 4:

in equation 4, i is the identity of the secondary computing node, Is the local model parameter updated by the auxiliary computing node identified as i in the t iteration round,/>The local model parameters are updated by the auxiliary computing node marked as i in the t-1 iteration round; Δ _st is the third gradient sent by the master computing node.

The following describes in detail the process of performing the target processing on the first gradient by the auxiliary computing node based on the first error compensation value in step S402.

In the embodiment of the application, in order to improve the training efficiency of the target model, when the first gradient is subjected to target processing, the first gradient can be processed by combining the first error compensation value with the learning rate of the model parameters, the sliding condition of the first gradient of the model parameters and the momentum acceleration condition; wherein:

In order to improve the learning efficiency of the model parameters, the embodiment of the application is designed to enable the learning rate of the model parameters to be self-adaptive in different iteration rounds; specifically, in the t iteration round, the auxiliary computing node may determine, through an adaptive gradient algorithm, a learning rate of model parameters of the target model in the t iteration round, for example, but not limited to, a learning rate of local model parameters of the auxiliary computing node in the t iteration round may be determined based on the initial learning rate and the iteration round by using a principle of the following formula 5:

equation 5:

In formula 5, α _t is the learning rate of the local model parameter of the auxiliary computing node in the t-th iteration round, α is the preset initial learning rate, and t is the iteration round.

In order to accelerate the updating speed of the model parameters, in the t iteration round, the auxiliary computing node can also determine the gradient sliding average value of the auxiliary computing node in the t iteration round through a self-adaptive gradient algorithm; the gradient moving average value is used for indicating the change degree of the first gradient, and the gradient moving average value of the auxiliary computing node in the t-1 th iteration round can be processed according to the first gradient and the index moving average coefficient, so that the gradient moving average value of the auxiliary computing node in the t-1 th iteration round is obtained, and the index moving average coefficient is obtained according to the initial moving average coefficient and the iteration round.

Specifically, the exponential moving average coefficient of the secondary computing node in the t-th iteration round can be determined by the principle shown in formula 6:

equation 6:

in the formula 6, θ _t is an exponential average coefficient of the auxiliary computing node in the t-th iteration round, θ is a preset initial moving average coefficient, and t is the iteration round.

It should be noted that, in each iteration round, the learning rate and the index moving average coefficient of the model parameters of each auxiliary computing node may be set to be the same or different, and those skilled in the art may set according to actual requirements

Further, the first gradient may be processed by using the index moving average coefficient of the auxiliary computing node in the t iteration round through the following principle of formula 7, to obtain a gradient moving average value of the auxiliary computing node in the t iteration round:

Equation 7: v _t＝θ_t×V_t-1+(1-θ_t)×(g_t)²;

In formula 7, V _t is the gradient running average of the auxiliary computing node in the t-th iteration round, V _t-1 is the gradient running average of the auxiliary computing node in the t-1 st iteration round, and g _t is the first gradient.

Since there are a plurality of auxiliary computing nodes, for each auxiliary computing node, the respective first gradients may be processed according to the principle of the above formula 7 to obtain respective gradient sliding average values, so the above formula 7 may be modified into the following formula 8:

Equation 8:

in formula 8, i is the identification of the auxiliary computing node; is the gradient running average of the secondary compute node identified as i in the iteration round of the t-th,/> Is the gradient running average of the secondary computing node identified as i in the iteration round of t-1; /(I)Is the first gradient of the secondary compute node identified as i in the t-th iteration round.

Further, the gradient momentum of the auxiliary computing node in the t iteration round can be determined by using the preset super-parameters of the momentum according to the following formula 9:

Equation 9: m _t＝β×m_t-1+(1-β)×g_t;

In equation 9, m _t is the gradient momentum of the secondary computing node in the t-th iteration round, m _t-1 is the gradient momentum of the secondary computing node in the t-1 st iteration round, β is the preset hyper-parameter of momentum, and g _t is the first gradient of the local model parameter on the secondary computing node.

Since there are a plurality of auxiliary computing nodes, for each auxiliary computing node, the respective first gradients may be processed according to the principle of the above formula 9 to obtain respective gradient momentums, so the above formula 10 may be transformed into the following formula 10:

Equation 10:

in formula 10, i is the identification of the auxiliary computing node; is the gradient momentum of the secondary compute node identified as i in the iteration round of the nth,/> Is the gradient momentum of the secondary computing node identified as i in the iteration round of t-1; /(I)Is the first gradient of the secondary compute node identified as i in the t-th iteration round.

Further, each auxiliary computing node performs target processing on the first gradient based on the first error compensation value of the target processing corresponding to the auxiliary computing node to obtain a second gradient, including:

in the t iteration round, the auxiliary computing node performs target processing on the first gradient based on the following formula 11 and a first error compensation value corresponding to the auxiliary computing node in the t iteration round to obtain a second gradient:

Equation 11:

In formula 11, Δ _t is the second gradient obtained by the secondary computing node in the t iteration round, and Q _w () is the algorithm of the target processing on the secondary computing node; alpha _t is the learning rate of the local model parameters of the auxiliary computing node in the t iteration round; e _t is the first error compensation value corresponding to the auxiliary computing node in the t iteration round; the E is a preset super parameter; m _t is the gradient momentum of the secondary computing node in the t-th iteration round, and V _t is the gradient running average of the secondary computing node in the t-th iteration round.

Since there are a plurality of auxiliary computing nodes, for each auxiliary computing node, the respective second gradients may be obtained according to the principle of the above formula 11, and thus the above formula 12 may be modified into the following formula 12:

equation 12:

In formula 12, i is the identification of the auxiliary computing node; Is a second gradient obtained by the auxiliary computing node identified as i in the t iteration round; q _w () is the algorithm of target processing on the secondary compute node; /(I) Is the gradient momentum of the secondary computing node identified as i in the t-th iteration round; /(I)Is the gradient running average of the auxiliary computing node identified as i in the t-th iteration round; /(I)Is the first gradient of the secondary computing node identified as i in the t-th iteration round; /(I)Is a first error compensation value corresponding to the auxiliary computing node marked as i in the t iteration round; alpha _t is the learning rate of the local model parameters of the auxiliary computing node in the t iteration round; the e is a preset hyper-parameter.

As an embodiment, after step S402, in the t-th iteration, the auxiliary computing node may further update the first error compensation value corresponding to the t-th iteration according to the following formula 13, to obtain the first error compensation value corresponding to the auxiliary computing node in the t+1th iteration:

equation 13: />

In formula 13, e _t+1 is a first error compensation value corresponding to the auxiliary computing node in the t+1st iteration round, and α _t is a learning rate of the local model parameter of the auxiliary computing node in the t iteration round; m _t is the gradient momentum of the secondary computing node in the t-th iteration round, and V _t is the gradient running average of the secondary computing node in the t-th iteration round; the E is a preset super parameter; e _t is the corresponding first error compensation value of the secondary computing node in the t-th iteration round, and delta _t is the second gradient obtained by the secondary computing node in the t-th iteration round.

Since there are a plurality of auxiliary calculation nodes, for each auxiliary calculation node, the respective first error compensation values may be updated according to the principle of the above formula 13, and thus the above formula 11 may be modified into the following formula 14:

equation 14:

in formula 14, i is the label of the auxiliary computing node; The first error compensation value corresponding to the auxiliary computing node marked as i in the t+1th iteration round, and alpha _t is the learning rate of the local model parameter of the auxiliary computing node in the t iteration round; /(I) Is the gradient momentum of the secondary compute node identified as i in the t-th iteration round,/>Is the gradient running average of the auxiliary computing node identified as i in the t iteration round; the E is a preset super parameter; /(I)Is the corresponding first error compensation value of the auxiliary computing node identified as i in the t iteration round,/>Is the second gradient that the secondary computing node identified as i gets in the t-th iteration round.

Based on the application scenario of fig. 1, an embodiment of the present application provides a method for updating model parameters of a main computing node 110, please refer to fig. 5, and an exemplary diagram of a method for updating model parameters is provided, and the method is applied to the main computing node 110 in the interaction operation of each iteration round in the above-mentioned multiple iteration rounds, and specifically includes the following steps:

step S501, performing target processing on the second gradient based on a second error compensation value of target processing corresponding to the main computing node to obtain a third gradient; the second gradient is obtained by performing target processing on the first gradients of the local model parameters based on the first error compensation values of the target processing corresponding to the auxiliary computing nodes in each of the at least two auxiliary computing nodes; the local model parameters include model parameters of the target model, and the first gradient is determined by the auxiliary computing nodes according to training samples of the target model.

Specifically, the second error compensation value corresponding to the main calculation node in the t iteration round is the accumulated value of the second error of the target processing in the 1 st to t-1 iteration rounds of the main calculation node; and when t is a positive integer and t is 1, the second error compensation value corresponding to the main calculation node in the 1 st iteration round is a preset compensation value.

Step S502, the third gradient is sent to each auxiliary computing node, so that each auxiliary computing node updates each local model parameter by using the third gradient.

In particular, the respective secondary computing nodes update their respective local model parameters using the third gradient, see the above description, which is not repeated here.

In step S503, the local model parameters of the host are updated by using the third gradient.

Specifically, in the t-th iteration round, the master computing node may determine a difference value between the local model parameter updated by the master computing node in the t-1 st iteration round and the third gradient as a model parameter updated by the master computing node in the t-th iteration round; specifically, the host computing node may update its own local model parameters by equation 15 as follows:

Equation 15: x _t+1＝X_t-Δ_st;

In formula 15, X _t+1 is a local model parameter updated by the main computing node in the t-1 th iteration round, and X _t is a local model parameter updated by the main computing node in the t-1 th iteration round; Δ _st is the third gradient described above.

In the following, a procedure of performing target processing on the second gradient by the main calculation node based on the second error compensation value in the above step S501 will be described.

Specifically, in the t-th iteration round in step S501, the master computing node may determine an average gradient of the second gradients of the at least two slave computing nodes; the average gradient of the second gradient is determined as follows by equation 16:

Equation 16:

in formula 16, i is the identity of the auxiliary computing nodes, and N is the total number of the auxiliary computing nodes; Is the average gradient of the second gradients of the at least two auxiliary computing nodes; /(I) Is the second gradient that the secondary compute node identified as i gets in the t-th iteration round,/>The calculation method of (2) can be described with reference to the above formula 11 and formula 12, and the description thereof will not be repeated here.

Further, the main computing node performs the target processing on the average gradient based on the second error compensation value corresponding to the main computing node in the t-th iteration round according to the following formula 17, to obtain the third gradient:

Equation 17:

in equation 17 A third gradient obtained in the t iteration round for the main computing node; q _s () is the algorithm of the target process on the above-described main computing node; /(I)Is the average gradient of the second gradients of the at least two auxiliary computing nodes; e _t is the corresponding second error compensation value of the master computing node in the t-th iteration round.

As an embodiment, after step S501, in the t-th iteration, the main computing node may further update the second error compensation value corresponding to the t-th iteration according to the following formula 18, to obtain the second error compensation value corresponding to the main computing node in the t+1th iteration:

Equation 18:

In equation 18, E _t+1 is the second error compensation value corresponding to the main computation node in the t+1st iteration round; e _t is a second error compensation value corresponding to the main computing node in the t iteration round; Is the average gradient of the second gradients of the at least two auxiliary computing nodes; /(I) And (3) calculating a third gradient of the node in the t iteration round for the main computing node.

A specific example of the interaction of the primary and secondary computing nodes is provided below.

In this example, if the target process of the main computing node (server) and each auxiliary computing node (worker) is quantization process, the process of updating the image classification model on the main computing node and the auxiliary computing node is as follows:

And (one) initializing each computing node.

Initialization of the master computing node 110: obtaining a network structure of the target model, a training sample of the target model, initial model parameters of the target model, and an algorithm Q _s () for setting target processing on the main computing node, taking the obtained initial model parameters as local model parameters on the main computing node in the 1 st iteration round, and sending the obtained network structure of the target model, the training sample of the target model, and the initial model parameters of the target model to each auxiliary computing node 120.

Initialization of each secondary computing node 120: taking the received initial model parameters as local model parameters in the 1st iteration round, taking the received network structure as the network structure of the target model, and taking the received training sample as training data of the target model; and setting an algorithm Q _w (), an initial learning rate alpha _t, an initial moving average coefficient theta, a preset super parameter beta of momentum and a super parameter epsilon of target processing on the main computing node.

Setting a maximum iteration round T of the interaction operation of the main computing node and the auxiliary computing node, so that the training of the target model is finished after the main computing node and the auxiliary computing node perform the T iteration round.

And (II) interaction operation process of each computing node.

Referring to fig. 6, in each iteration round of the multiple iteration rounds of the primary and secondary computing nodes, the interaction procedure of the primary and secondary computing nodes is as follows:

in step S601, each secondary computing node determines a learning rate of the local model parameter in each t-th iteration round based on the initial learning rate and the iteration round.

Specific procedures can be found in the above description of equation 5, and the description thereof will not be repeated here.

In step S602, each secondary computing node determines an exponential moving average coefficient of the local model parameter in each of the t-th iteration round based on the initial exponential average coefficient and the iteration round.

Specific procedures can be found in the above description of equation 6, and the description thereof will not be repeated here.

In step S603, each secondary computing node processes each first gradient using the index moving average coefficient in each t-th iteration round, to obtain a gradient moving average value in each t-th iteration round.

Specific procedures can be found in the above description of formula 7 and formula 8, and the description will not be repeated here.

In step S604, each secondary computing node determines the gradient momentum in each iteration round t using the preset hyper-parameters of momentum.

Specific procedures can be found in the above description of formula 9 and formula 10, and the description will not be repeated here.

In step S605, each auxiliary computing node performs target processing on each first gradient based on the corresponding first error compensation value in each t-th iteration round, to obtain each second gradient.

Specific procedures can be found in the above description of formula 11 and formula 12, and the description will not be repeated here.

In step S606, each secondary computing node sends the respective second gradient to the primary computing node.

In step S607, each auxiliary computing node updates the first error compensation value corresponding to each iteration round t by using the second gradient, to obtain the first error compensation value corresponding to each iteration round t+1.

Specific procedures are described with reference to the above-mentioned formulas 13 and 14, and the description thereof will not be repeated here.

In step S608, the primary computing node determines an average gradient of the second gradients of all the secondary computing nodes.

Specific procedures are described with reference to equation 16 above, and will not be repeated here.

In step S609, the main computing node performs target processing on the average gradient by using the second error compensation value corresponding to the t iteration round of the main computing node, so as to obtain a third gradient.

Specific procedures can be found in the description of equation 17 above, and the description will not be repeated here.

In step S610, the primary computing node transmits the third gradient to each secondary computing node.

In step S611, each secondary computing node updates the respective local model parameters using the received third gradient.

Specific procedures can be found in the above description of the formula 3 and the formula 4, and the description thereof will not be repeated here.

In step S612, the master computing node uses the third gradient to further its own local model parameters.

Specific procedures are described with reference to equation 15 above, and will not be repeated here.

In step S613, the main computing node updates the second error compensation value corresponding to the t-th iteration round by using the third gradient, to obtain the second error compensation value corresponding to the main computing node in the t+1th iteration round.

Specific procedures are described with reference to equation 18 above and will not be repeated here.

It should be noted that, step S607 may be any step after step S605, step S610 and step S611 may be any step after step S609, and step S612 may be no sequence.

The experimental effect of the model parameter updating method provided by the embodiment of the application is described below.

The following uses the model parameter updating method in the embodiment of the present application in the image classification field, the target model is an image classification model with a network structure of RestNet-18, the target process is quantization process, and experimental data (training sample and test sample of the target model) are CIFAR-100 as an example to describe the experimental effect, please refer to fig. 7 and 8.

FIG. 7 shows the experimental results on training samples, wherein the horizontal values in FIG. 7 are the number of iterative rounds, and the vertical values are the loss values of the image classification model; the left graph shows experimental data quantized to different numbers of bits, and the right graph shows experimental results of whether error compensation values are adopted on the main calculation node and the auxiliary calculation node. The left graph comprises a change condition of the loss value of the image classification model along with the increase of iteration rounds under the conditions of a first parameter training mode (flaot), a second parameter training mode (6 bits) which is quantized to 6bits and a third parameter training mode (5 bits) which is quantized to 5 bits, wherein the loss value of the image classification model in the third parameter training mode (5 bits) is slightly higher than the loss value of the image classification model in the second parameter training mode (6 bits) along with the increase of iteration rounds; the right graph comprises a first processing mode (flaot) adopting full-precision parameter training, a second processing mode (6 bits) adopting bidirectional quantization and bidirectional error compensation, a third processing mode (6 bits.err_w) adopting bidirectional quantization and error compensation on an auxiliary computing node, and a fourth processing mode (6 bits.err_s) adopting bidirectional quantization and error compensation on a main computing node, so that the change condition of the loss value of the image classification model along with the increase of iteration rounds can be seen, and the two modes with the highest loss value of the image classification model along with the increase of iteration rounds are respectively 6bits.err_w and 3924.err_w, and the loss values of the two modes of flaot and 6bits are not obviously different.

It should be noted that, in the embodiment of the present application, bidirectional quantization refers to performing quantization processing on related data on a main computing node and on an auxiliary computing node; the bidirectional error compensation in the embodiment of the present application refers to performing quantization processing by using the second error compensation value at the main computing node and performing quantization processing by using the first error compensation value at the auxiliary computing node, where the quantization processing is one of the target processing in the embodiment of the present application.

FIG. 8 shows the results of experiments on test specimens; in fig. 8, the horizontal numerical value is the number of iterative rounds, and the vertical numerical value is the classification accuracy of image classification by the image classification model; the left graph shows experimental data quantized to different numbers of bits, and the right graph shows experimental results of whether error compensation values are adopted on the main calculation node and the auxiliary calculation node. The left graph comprises a change condition of the classification accuracy of the image classification model along with the increase of iteration rounds under the conditions of a first parameter training mode (flaot), a second parameter training mode (6 bits) which is quantized to 6bits and a third parameter training mode (5 bits) which is quantized to 5 bits, and the classification accuracy of the image classification model can be seen to be 6bits, a flag 32 and 5 bits from high to low along with the increase of iteration rounds; the right graph comprises a first processing mode (flaot) adopting full-precision parameter training, a second processing mode (6 bits) adopting bidirectional quantization and bidirectional error compensation, a third processing mode (6bits.err_w) adopting bidirectional quantization and error compensation on an auxiliary computing node, and a fourth processing mode (6bits.err_s) adopting bidirectional quantization and error compensation on a main computing node, and the classification accuracy of the image classification model is changed along with the increase of iteration rounds, so that the classification accuracy of the image classification model is respectively 6bits, flag 32, bits.err_w and 6bits.err_s from high to low along with the increase of iteration rounds, and the classification accuracy of the image classification model is not obviously different from that of the 6bits and flaot.

The following description uses the model parameter updating method in the embodiment of the present application to be applied to the field of binary emotion classification in natural language processing, where the target model is a binary emotion classification model with a network structure of a GRU (gated circular convolution network), the experimental data (training sample and test sample of the target model) is an IMDB, and the experimental data includes 50000 movie reviews as an example, and the experimental effect is shown in fig. 9 and 10.

FIG. 9 shows the experimental results on training samples, wherein the horizontal values in FIG. 9 are the number of iterative rounds, and the vertical values are the loss values of the image classification model; the left graph shows experimental data quantized to different numbers of bits, and the right graph shows experimental results of whether error compensation values are adopted on the main calculation node and the auxiliary calculation node. The left graph comprises a change condition of a loss value of the binary emotion classification model along with the increase of iteration rounds under the conditions of adopting a fourth parameter training mode (flaot) with full precision, a fifth parameter training mode (5 bits) with 5bits quantization and a sixth parameter training mode (3 bits) with 3 bits quantization, and the loss value of the binary emotion classification model can be seen to be 3 bits, 5bits and flaot from high to low sequentially along with the increase of iteration rounds; the right graph includes a fifth processing mode (flaot bits) adopting full-precision parameter training, a sixth processing mode (5 bits) adopting bidirectional quantization and bidirectional error compensation, a seventh processing mode (5 bits. Err_w) adopting bidirectional quantization and error compensation on an auxiliary computing node, and a change condition that the loss value of the image classification model occurs along with the increase of iteration rounds under the eighth processing mode (5 bits. Err_s) adopting bidirectional quantization and error compensation on a main computing node, so that the loss value of the binary emotion classification model is sequentially from high to low as the iteration rounds increase, namely, 5bits. Err_w, 5bits and flaot.

FIG. 10 shows the results of experiments on test specimens; in fig. 10, the horizontal numerical value is the number of iterative rounds, and the vertical numerical value is the classification accuracy of the binary emotion classification model for classification; the left graph shows experimental data quantized to different numbers of bits, and the right graph shows experimental results of whether error compensation values are adopted on the main calculation node and the auxiliary calculation node. The left graph comprises a fourth parameter training mode (flaot) with full precision, a fifth parameter training mode (5 bits) with 5bits of quantization and a sixth parameter training mode (3 bits) with 3 bits of quantization, and the classification accuracy of the binary emotion classification model is changed along with the increase of iteration rounds, so that the classification accuracy of the binary emotion classification model in the three modes of flaot, 5bits and 3 bits is not obviously different along with the increase of iteration rounds; the right graph comprises a fifth processing mode (flaot bits) adopting full-precision parameter training, a sixth processing mode (5 bits) adopting bidirectional quantization and bidirectional error compensation, a seventh processing mode (5 bits. Err_w) adopting bidirectional quantization and error compensation on an auxiliary computing node, and a change condition that the classification accuracy of the binary emotion classification model occurs along with the increase of iteration rounds under the eighth processing mode (5 bits. Err_s) adopting bidirectional quantization and error compensation on a main computing node, so that the classification accuracy of the binary emotion classification model is highest under the 5bits and flaot modes along with the increase of iteration rounds, and the classification accuracy of the binary emotion classification model under the 5bits and flaot modes is not obviously different, and the classification accuracy of the binary emotion classification model under the 5bits. Err_s is obviously higher than that of the 5bits.

The experimental results show that when the model parameter updating method in the embodiment of the application is used for training the target model, the training error and the generalization error which are the same as those achieved by a full-precision algorithm (float) can be ensured by using only 1/5 of the traffic per month, so that the method in the embodiment of the application greatly reduces the requirement of the communication quality of the distributed training; in addition, as can be seen from the experimental results, the bidirectional error compensation strategy introduced in the embodiment of the application remarkably maintains the stability of the training algorithm and ensures the convergence of the training algorithm, thereby reducing the possibility that the target model is not converged in the distributed training process, and remarkably improving the training efficiency of the target model; on the other hand, the influence of the error of the target processing on the loss function value of the target model is reduced, so that the convergence speed of the target model in the distributed training process is further increased, the training efficiency of the target model is further improved, and the accuracy of the target model is also improved by reducing the error of the target processing.

Referring to fig. 11, based on the same inventive concept, an embodiment of the present application provides a model parameter updating apparatus 1100, including:

A gradient processing unit 1101, configured to perform the target processing on the second gradient based on a second error compensation value of the target processing corresponding to the main computing node, to obtain a third gradient; the second gradient is obtained by performing target processing on the first gradients of the local model parameters based on first error compensation values of target processing corresponding to the auxiliary computing nodes in each of the at least two auxiliary computing nodes; the local model parameters comprise model parameters of a target model, the first gradient is determined by the auxiliary computing nodes according to training samples of the target model, and the first gradient is used for indicating the change degree of the local model parameters on the auxiliary computing nodes;

An information sending unit 1102, configured to send the third gradient to each of the auxiliary computing nodes, so that each of the auxiliary computing nodes updates each local model parameter by using the third gradient; and

A parameter updating unit 1103, configured to update the local model parameter on the main computing node by using the third gradient.

As one embodiment, the main computing node performs the interactive operation of multiple iteration rounds with the auxiliary computing nodes, wherein the second error compensation value corresponding to the main computing node in the t iteration round is the accumulated value of the second error of the target processing performed by the main computing node in the 1 st to t-1 th iteration rounds; and when t is a positive integer and t is 1, the second error compensation value corresponding to the main calculation node in the 1 st iteration round is a preset compensation value.

As an embodiment, the gradient processing unit 1101 is specifically configured to:

determining an average gradient of the second gradients of the at least two auxiliary computing nodes in the t iteration round;

And performing the target processing on the average gradient based on a second error compensation value corresponding to the t iteration round by the main computing node according to the following formula to obtain the third gradient:

wherein, the above A third gradient obtained in the t-th iteration round for the master computing node; the Q _s () is an algorithm of the target process on the main computing node; above/>Is the average gradient, and E _t is a second error compensation value corresponding to the main computation node in the t-th iteration round.

As an embodiment, the gradient processing unit 1101 is further configured to: in the t iteration round, updating the second error compensation value corresponding to the main calculation node in the t iteration round according to the following formula to obtain the second error compensation value corresponding to the t+1th iteration round of the main calculation node:

As an embodiment, the parameter updating unit 1103 is specifically configured to:

As one example, the target process satisfies the following condition:

the target processing error value is not greater than the error threshold; the target processing error value is an error value between a data amount of original data and a data amount of processing data corresponding to the original data, the processing data being obtained by performing the target processing on the original data; the error threshold is determined according to the original data and a preset error parameter; and

The data amount of the processed data obtained by the target processing is not more than the original data corresponding to the processed data.

As one example, the target model is a distributed training neural network model,

When the model parameter updating device is applied to the information classification field, the target model is an information classification model, and the model parameters are parameters indicating the corresponding relation between information characteristics and information types in the information classification model; or (b)

When the model parameter updating device is applied to the information detection field, the target model is an information detection model, and the model parameters are parameters indicating the corresponding relation between information characteristics and detection results in the information prediction model.

As an example, the apparatus of FIG. 11 may be used to implement any of the model parameter updating methods discussed above on the master computing node.

Referring to fig. 12, based on the same inventive concept, an embodiment of the present application provides a model parameter updating apparatus 1200, including:

A first gradient processing unit 1201, configured to determine a first gradient of a local model parameter on the auxiliary computing node according to a training sample of the target model; the local model parameters include model parameters of the target model, and the first gradient is used for indicating the change degree of the local model parameters on the auxiliary computing node;

a second gradient processing unit 1202 that performs target processing on the first gradient based on a first error compensation value of target processing corresponding to the auxiliary computing node to obtain a second gradient; the above-described target processing includes an operation of reducing the data amount of the data;

An information sending unit 1203 configured to send the second gradient to the master computing node, so that the master computing node performs the target processing on the second gradients of at least two auxiliary computing nodes to obtain a third gradient based on a second error compensation value of the target processing corresponding to the master computing node, and so that the master computing node updates a local model parameter on the master computing node according to the third gradient;

A parameter updating unit 1204, configured to update the local model parameter on the secondary computing node with the third gradient received from the primary computing node.

As one embodiment, the auxiliary computing node performs interactive operation of multiple iteration rounds with the main computing node; the first error compensation value corresponding to the auxiliary calculation node in the t iteration round is an accumulated value of a difference value between the first error of the target processing and the second gradient obtained in the t-1 iteration round in the 1 st to t-1 iteration rounds of the auxiliary calculation node; and when t is a positive integer and t is 1, the first error compensation value of the 1 st iteration round is a preset compensation value.

As an embodiment, the second gradient processing unit 1202 is specifically configured to:

In the t iteration round, performing the target processing on the first gradient based on the following formula and a first error compensation value corresponding to the auxiliary computing node in the t iteration round, to obtain the second gradient:

Wherein, Δ _t is the second gradient obtained by the secondary computing node in the t-th iteration round, and Q _w () is the algorithm of the target processing on the secondary computing node; the α _t is a learning rate of the local model parameter in the t-th iteration round by the auxiliary computing node, and the learning rate is determined based on an initial learning rate and the iteration round; the e _t is a first error compensation value corresponding to the auxiliary computing node in the t-th iteration round; the E is a preset super parameter; the m _t is a gradient momentum of the auxiliary computing node in the t-th iteration round, the gradient momentum is determined according to the first gradient, and the gradient momentum is used for indicating the degree of the first gradient advancing to the change direction indicated by the objective function; the V _t is a gradient running average of the auxiliary computing node in the t-th iteration round, the gradient running average is used for indicating the change degree of the first gradient, the V _t is obtained by processing the gradient running average V _t-1 of the auxiliary computing node in the t-1 th iteration round according to the first gradient and an index running average coefficient, and the index running average coefficient is obtained according to an initial running average coefficient and the iteration round.

As an embodiment, the second gradient processing unit 1202 is further configured to:

In the t iteration round, updating the first error compensation value corresponding to the auxiliary calculation node in the t iteration round according to the following formula to obtain the first error compensation value corresponding to the auxiliary calculation node in the t+1th iteration round:

wherein e _t+1 is a first error compensation value corresponding to the auxiliary node in the t+1st iteration round.

As an embodiment, the parameter updating unit 1204 is specifically configured to:

As one example, the above target process satisfies the following condition:

The target processing error is not greater than the error threshold; the target processing error is a deviation value between a data amount of original data and a data amount of processed data corresponding to the original data, the processed data being obtained by performing the target processing on the original data; the error threshold is determined according to the original data and a preset error parameter; and

As an embodiment, the apparatus in fig. 12 may be used to implement any of the model parameter updating methods on the secondary computing nodes discussed above.

The above-mentioned model parameter updating apparatus 1100 is a computer device as shown in fig. 13 as an example of a hardware entity, and the computer device includes a processor 1301, a storage medium 1302, and at least one external communication interface 1303; the processor 1301, the storage medium 1302, and the external communication interface 1303 are all connected through a bus 1304.

The storage medium 1302 has stored therein a computer program;

Processor 1301, when executing the computer program, implements the model parameter updating method of master computing node 110 discussed previously.

One processor 1301 is illustrated in fig. 13, but the number of processors 1301 is not limited in practice.

Wherein the storage medium 1302 may be a volatile memory (RAM) such as a random-access memory (RAM); the storage medium 1302 may also be a non-volatile memory medium (non-volatile memory), such as a read-only memory medium, a flash memory medium (flash memory), a hard disk (HARD DISK DRIVE, HDD) or a solid state disk (solid-STATE DRIVE, SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The storage medium 1302 may be a combination of the above storage media.

As an example of the hardware entity of the model parameter updating apparatus 1200, reference may be made to a computer device shown in fig. 13, wherein when the computer device shown in fig. 13 is used as the hardware entity of the model parameter updating apparatus 1200, a computer program is stored in the storage medium 1302; processor 1301, when executing the computer program, implements the model parameter updating method of secondary computing node 120 discussed previously.

Based on the same technical idea, embodiments of the present application also provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the model parameter updating method provided by the embodiment of the application.

Based on the same technical idea, an embodiment of the present application also provides a computer-readable storage medium storing computer instructions that, when executed on a computer, cause the computer to perform the model parameter updating method as previously discussed.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A model parameter updating system is characterized by comprising a main computing node and at least two auxiliary computing nodes,

each auxiliary computing node determines a first gradient of each local model parameter according to a training sample of the target model, performs target processing on the first gradient based on a first error compensation value of target processing corresponding to the auxiliary computing node to obtain a second gradient, and sends the second gradient to the main computing node; and using a third gradient received from the master computing node, combining the formulas: x _t+1＝x_t-Δ_st, updating respective local model parameters, wherein x _t+1 is a local model parameter updated in the t-th iteration round, x _t is a local model parameter updated in the t-1 st iteration round, and delta _st is the third gradient; the first gradient is used for indicating the change degree of local model parameters on the auxiliary computing node, and the target processing comprises the operation of reducing the data quantity of the data;

The main computing node performs target processing on the second gradient based on a second error compensation value of target processing corresponding to the main computing node to obtain a third gradient, and sends the third gradient to each auxiliary computing node; and using the third gradient, combining the formula: x _t+1＝x_t-Δ_st, updating local model parameters on the main computing node, wherein x _t+1 is a local model parameter updated in the t iteration round, x _t is a local model parameter updated in the t-1 iteration round, and delta _st is the third gradient; and combining the formula using the third gradient Updating the second error compensation value, E _t+1 is the corresponding second error compensation value in the (t+1) th iteration round, E _t is the corresponding second error compensation value in the (t) th iteration round,/>Is the average gradient of the second gradients of at least two of said secondary computing nodes,/>The third gradient is that t is a positive integer, and when t is 1, the second error compensation value in the 1 st iteration round is a preset compensation value;

the main computing node performs the target processing on the second gradient based on a second error compensation value of the target processing corresponding to the main computing node to obtain a third gradient, which includes:

In the t-th iteration round, the primary computing node determines an average gradient of a second gradient of the at least two secondary computing nodes;

The main computing node performs the target processing on the average gradient based on a second error compensation value corresponding to the main computing node in the t iteration round according to the following formula, so as to obtain the third gradient:

Wherein the said A third gradient for the master computing node obtained in the t-th iteration round; said Q _s () is the algorithm of said target process on said master computing node; said/>Is the average gradient, and E _t is a second error compensation value corresponding to the main computing node in the t iteration round;

before each auxiliary computing node and the main computing node perform interactive operation of a plurality of iteration rounds, performing initialization operation in a node interactive mode; the initializing operation is performed in a node interaction mode, and specifically includes:

The main computing node acquires a network structure, a training sample and initial model parameters of the target model, takes the acquired initial model parameters as local model parameters on the main computing node in a first iteration round, and sends the acquired network structure, the training sample and the initial model parameters to each auxiliary computing node, so that each auxiliary computing node takes the received initial model parameters as the local model parameters in the first iteration round, takes the received network structure as the network structure of the target model, and takes the received training sample as the training sample of the target model.

2. The system of claim 1, wherein the target model is a distributed training neural network model,

When the model parameter updating method is applied to the field of information classification, the target model is an information classification model, and the model parameters are parameters indicating the corresponding relation between information characteristics and information types in the information classification model; or (b)

When the model parameter updating method is applied to the field of information detection, the target model is an information detection model, and the model parameters are parameters indicating the corresponding relation between information characteristics and detection results in the information detection model.

3. A method for updating model parameters, comprising:

The main computing node performs target processing on the second gradient based on a second error compensation value of target processing corresponding to the main computing node to obtain a third gradient; the second gradient is obtained by performing target processing on the first gradients of the local model parameters based on first error compensation values of target processing corresponding to the auxiliary computing nodes in each of the at least two auxiliary computing nodes; the local model parameters comprise model parameters of a target model, the first gradient is determined by each auxiliary computing node according to training samples of the target model, and the first gradient is used for indicating the change degree of the local model parameters on the auxiliary computing nodes; the initial local model parameters on the auxiliary computing nodes are received from the main computing nodes in a node interaction mode;

The master computing node utilizes the third gradient in combination with the formula: x _t+1＝x_t-Δ_st, updating local model parameters on the main computing node, wherein x _t+1 is a local model parameter updated in the t iteration round, x _t is a local model parameter updated in the t-1 iteration round, and delta _st is the third gradient;

The main computing node utilizes the third gradient to combine the formula Updating the second error compensation value, E _t+1 is the corresponding second error compensation value in the (t+1) th iteration round, E _t is the corresponding second error compensation value in the (t) th iteration round,/>Is the average gradient of the second gradients of at least two of said secondary computing nodes,/>The third gradient is that t is a positive integer, and when t is 1, the second error compensation value in the 1 st iteration round is a preset compensation value;

wherein the target process satisfies the following condition: the target processing error of the target processing is not greater than the error threshold; the data volume of the processed data obtained by the target processing is smaller than the original data corresponding to the processed data.

4. The method of claim 3, wherein the target model is a distributed training neural network model,

5. A method for updating model parameters, comprising:

The auxiliary computing node determines a first gradient of a local model parameter on the auxiliary computing node according to a training sample of a target model; the local model parameters include model parameters of the target model, the first gradient being used to indicate a degree of variation of the local model parameters on the secondary computing node; the training samples and the initial local model parameters are received by the auxiliary computing node from the main computing node in a node interaction mode;

the auxiliary computing node sends the second gradient to a main computing node so that the main computing node performs target processing on the second gradients of at least two auxiliary computing nodes based on second error compensation values of target processing corresponding to the main computing node to obtain a third gradient, and the main computing node updates local model parameters on the main computing node according to the third gradient;

The secondary computing node utilizes a third gradient received from the primary computing node in conjunction with the formula: x _t+1＝x_t-Δ_st, updating local model parameters on the auxiliary computing node, wherein x _t+1 is a local model parameter updated in the t iteration round, x _t is a local model parameter updated in the t-1 iteration round, and delta _st is the third gradient;

The main computing node utilizes the third gradient to combine the formula Updating the second error compensation value, E _t+1 is the corresponding second error compensation value in the (t+1) th iteration round, E _t is the corresponding second error compensation value in the (t) th iteration round,/>Is the average gradient of the second gradients of at least two of said secondary computing nodes,/>And when t is a positive integer and t is 1, the second error compensation value in the 1 st iteration round is a preset compensation value.

6. The method of claim 5, wherein the secondary computing node interoperates with the primary computing node for a plurality of iterative rounds; the first error compensation value corresponding to the auxiliary computing node in the t iteration round is an accumulated value of a difference value between the first error of the target processing and the second gradient obtained in the t-1 iteration round in the 1 st to t-1 iteration rounds of the auxiliary computing node; and when t is a positive integer and t is 1, the first error compensation value of the 1 st iteration round is a preset compensation value.

7. The method of claim 6, wherein the secondary computing node performing the target processing on the first gradient to obtain a second gradient based on a first error compensation value of the target processing corresponding to the secondary computing node, comprising:

In the t iteration round, the auxiliary computing node performs the target processing on the first gradient based on the following formula and a first error compensation value corresponding to the auxiliary computing node in the t iteration round to obtain the second gradient:

8. The method of any one of claim 5-7, wherein the target model is a distributed training neural network model,

9. A model parameter updating apparatus, characterized by comprising:

The gradient processing unit is used for carrying out target processing on the second gradient based on a second error compensation value of target processing corresponding to the main computing node to obtain a third gradient; the second gradient is obtained by performing target processing on the first gradients of the local model parameters based on first error compensation values of target processing corresponding to the auxiliary computing nodes in each of the at least two auxiliary computing nodes; the local model parameters comprise model parameters of a target model, the first gradient is determined by each auxiliary computing node according to training samples of the target model, and the first gradient is used for indicating the change degree of the local model parameters on the auxiliary computing nodes; the initial local model parameters on the auxiliary computing nodes are received from the main computing nodes in a node interaction mode;

A parameter updating unit, configured to use the third gradient to combine the formula: x _t+1＝x_t-Δ_st, updating local model parameters on the main computing node, wherein x _t+1 is a local model parameter updated in the t iteration round, x _t is a local model parameter updated in the t-1 iteration round, and delta _st is the third gradient; and combining the formula using the third gradientUpdating the second error compensation value, E _t+1 is the corresponding second error compensation value in the (t+1) th iteration round, E _t is the corresponding second error compensation value in the (t) th iteration round,/>Is the average gradient of the second gradients of at least two of said secondary computing nodes,/>The third gradient is that t is a positive integer, and when t is 1, the second error compensation value in the 1 st iteration round is a preset compensation value;

Wherein, the gradient processing unit is specifically configured to:

10. A model parameter updating apparatus, characterized by comprising:

The first gradient processing unit is used for determining a first gradient of the local model parameter on the auxiliary computing node according to the training sample of the target model; the local model parameters include model parameters of the target model, the first gradient being used to indicate a degree of variation of the local model parameters on the secondary computing node; the training samples and the initial local model parameters are received by the auxiliary computing node from the main computing node in a node interaction mode;

the information sending unit is used for sending the second gradients to a main computing node, so that the main computing node performs target processing on the second gradients of at least two auxiliary computing nodes based on second error compensation values of target processing corresponding to the main computing node to obtain a third gradient, and the main computing node updates local model parameters on the main computing node according to the third gradient;

A parameter updating unit for using the third gradient received from the master computing node in combination with the formula: x _t+1＝x_t-Δ_st, updating local model parameters on the auxiliary computing node, wherein x _t+1 is a local model parameter updated in the t iteration round, x _t is a local model parameter updated in the t-1 iteration round, and delta _st is the third gradient;