CN113283596B

CN113283596B - Model parameter training method, server, system and storage medium

Info

Publication number: CN113283596B
Application number: CN202110542415.1A
Authority: CN
Inventors: 董星
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2024-06-04
Anticipated expiration: 2041-05-18
Also published as: CN113283596A

Abstract

The disclosure relates to a model parameter training method, a server, a system and a storage medium, and relates to the technical field of machine learning. The embodiment of the disclosure at least solves the problem that the parameter training of the prediction model takes a long time in the related art. The method is applied to a work server of a distributed system and comprises the following steps: acquiring current embedded parameters corresponding to training samples of a current batch; acquiring network parameters currently stored in a working server from the working server; based on the training samples of the current batch, carrying out iterative training on the current embedded parameters and the network parameters currently stored by the working server so as to obtain embedded parameter gradients and network parameter gradients; updating the current embedded parameters based on the embedded parameter gradient, and synchronizing the updated embedded parameters to a parameter server; based on the network parameter gradient, the network parameters currently stored by the working server are updated.

Description

Model parameter training method, server, system and storage medium

Technical Field

The disclosure relates to the technical field of machine learning, and in particular relates to a model parameter training method, a server, a system and a storage medium.

Background

Currently, the related art uses a deep neural network as a prediction model for determining click-through-rate (CTR), where network parameters in the prediction model are obtained by training a large-scale sparse model based on a distributed parameter server (PARAMETER SERVER, PS) architecture. Specifically, the PS architecture includes a parameter server and a working server, in the training, the working server is configured to obtain a model parameter from the parameter server after obtaining a training sample from the outside, perform iterative training on the model parameter, send a gradient of the model parameter obtained by training to the parameter server, and update the model parameter stored by the parameter server according to the gradient of the model parameter.

However, since the parameter server mainly uses the central processing unit (central processing unit, CPU) to update the model parameters, the working server mainly uses the graphics processor (graphics processing unit, GPU) to perform iterative training, and in the iterative training process, a large number of model parameters and gradients need to be transmitted between the parameter server and the working server, and the data transmission amount between the CPU and the GPU is large, which results in long time consumption and low hardware resource utilization rate in the whole training process.

Disclosure of Invention

The disclosure provides a model parameter training method, a server, a system and a storage medium, so as to at least solve the problem that in the related art, the parameter training of a prediction model takes longer time. The technical scheme of the present disclosure is as follows:

According to a first aspect of embodiments of the present disclosure, there is provided a parameter training method of a prediction model, applied to a working server of a distributed system, including: acquiring current embedded parameters corresponding to training samples of a current batch; acquiring network parameters currently stored in a working server from the working server; based on the training samples of the current batch, carrying out iterative training on the current embedded parameters and the network parameters currently stored by the working server so as to obtain embedded parameter gradients and network parameter gradients; updating the current embedded parameters based on the embedded parameter gradient, and synchronizing the updated embedded parameters to a parameter server; based on the network parameter gradient, the network parameters currently stored by the working server are updated.

The current embedded parameters are pre-stored in a working server, and the parameter training method further comprises the following steps: under the condition that the current stored embedded parameters of the working server comprise part of the embedded parameters of the current embedded parameters, acquiring the difference embedded parameters from the parameter server and storing the difference embedded parameters; the difference embedding parameters comprise embedding parameters of the current embedding parameters except for partial embedding parameters;

Or under the condition that the currently stored embedded parameters of the working server do not comprise the current embedded parameters, acquiring the current embedded parameters from the parameter server and storing the current embedded parameters.

Optionally, in the case that the current batch is the first batch, before acquiring the network parameters currently stored in the working server from the working server, the parameter training method further includes: network parameters of the predictive model from the parameter server are received and stored.

Optionally, in the case that the current lot is a lot other than the first lot, before acquiring the network parameters currently stored in the work server from the work server, the parameter training method further includes:

combining the plurality of historical network parameter gradients and storing network parameters obtained by combining; the plurality of historical network parameter gradients includes network parameter gradients trained based on training samples of a previous batch of the current batch.

Optionally, the above parameter training method further includes: and in the process of carrying out iterative training on the current embedded parameters and the network parameters currently stored by the working server, acquiring a training sample of the next batch, and acquiring the embedded parameters corresponding to the training sample of the next batch based on the training sample of the next batch.

Optionally, the above parameter training method further includes:

Acquiring target network parameters and sending the target network parameters to a parameter server; the target network parameters include network parameters that are trained based on the last batch of training samples.

According to a second aspect of the embodiments of the present disclosure, there is provided a work server including an acquisition unit, a training unit, an update unit, and a transmission unit; the acquisition unit is used for acquiring current embedded parameters corresponding to the training samples of the current batch; the acquisition unit is also used for acquiring the network parameters currently stored by the working server from the working server; the training unit is used for carrying out iterative training on the current embedded parameters acquired by the acquisition unit and the network parameters currently stored by the working server based on the training samples of the current batch so as to acquire embedded parameter gradients and network parameter gradients; the updating unit is used for updating the current embedding parameters based on the embedding parameter gradient obtained by training of the training unit; the sending unit is used for synchronizing the embedded parameters updated by the updating unit to the parameter server; and the updating unit is also used for updating the network parameters currently stored by the working server based on the network parameter gradient.

Optionally, the current embedded parameter is pre-stored in a working server, and the working server further includes a storage unit; the acquisition unit is also used for acquiring the difference embedding parameters from the parameter server under the condition that the partial embedding parameters of the current embedding parameters are included in the embedding parameters stored in the current working server; the storage unit is used for storing the difference embedding parameters; the difference embedding parameters comprise embedding parameters of the current embedding parameters except for partial embedding parameters;

Or the obtaining unit is further used for obtaining the current embedded parameter from the parameter server under the condition that the embedded parameter currently stored in the working server does not comprise the current embedded parameter; and the storage unit is used for storing the current embedded parameters.

Optionally, the working server further includes a receiving unit and a storage unit; the receiving unit is used for receiving the network parameters of the prediction model from the parameter server before the obtaining unit obtains the network parameters currently stored by the working server from the working server under the condition that the current batch is the first batch; and the storage unit is used for storing the network parameters of the prediction model from the parameter server, which are received by the receiving unit.

Optionally, in the case that the current lot is a lot other than the first lot, before the obtaining unit obtains the network parameter currently stored in the work server from the work server, the updating unit is specifically configured to: combining the plurality of historical network parameter gradients and storing network parameters obtained by combining; the plurality of historical network parameter gradients includes network parameter gradients trained based on training samples of a previous batch of the current batch.

Optionally, the acquiring unit is further configured to acquire a training sample of a next batch in a process of performing iterative training on the current embedded parameter and the network parameter currently stored in the working server, and acquire an embedded parameter corresponding to the training sample of the next batch based on the training sample of the next batch.

Optionally, the acquiring unit is further configured to acquire a target network parameter; the target network parameters comprise network parameters obtained by training based on training samples of the last batch; and the sending unit is also used for sending the target network parameters to the parameter server.

According to a third aspect of embodiments of the present disclosure, there is provided a work server comprising: a processor, a memory for storing executable instructions; wherein the processor is configured to execute instructions to implement the method of parameter training of a predictive model as provided in the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium comprising instructions which, when executed by a processor, cause the processor to perform a parameter training method of a predictive model as provided in the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a distributed parameter training system, including a plurality of parameter servers and a plurality of work servers; any one of a plurality of working servers is used to perform the parameter training method of the predictive model of the first aspect.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer program product comprising instructions, characterized in that the instructions, when executed by a processor, implement the parameter training method of the predictive model of the first aspect.

The technical scheme provided by the disclosure at least brings the following beneficial effects: because the working server can prestore the network parameters required by the iterative training in advance, the network parameters required by the iterative training can be directly obtained from the buffer memory of the working server when the iterative training is required, and compared with the prior art, the network parameters do not need to be frequently obtained from the parameter server, so that the transmission of the model parameters between the parameter server and the working server is reduced. Meanwhile, the update of the current embedded parameters and the current stored network parameters can be performed by the working server side, the gradient value obtained by each training is not required to be sent to the parameter server, the transmission of the training gradient can be correspondingly reduced, the iterative training time can be further reduced, and the resource utilization rate of hardware is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of a distributed parameter training system, shown in accordance with an exemplary embodiment;

FIG. 2 is one of the flow diagrams of a method for parameter training of a predictive model, according to an exemplary embodiment;

FIG. 3 is a second flow chart of a method of parameter training of a predictive model, according to an exemplary embodiment;

FIG. 4 is a third flow chart of a method of parameter training of a predictive model, according to an exemplary embodiment;

FIG. 5 is a flow diagram illustrating a method of parameter training for a predictive model in accordance with an illustrative embodiment;

FIG. 6 is a flow diagram illustrating a method of parameter training for a predictive model in accordance with an illustrative embodiment;

FIG. 7 is a flowchart illustrating a method of parameter training for a predictive model, according to an exemplary embodiment;

FIG. 8 is one of the schematic structural diagrams of a work server shown according to an exemplary embodiment;

fig. 9 is a second schematic diagram of a configuration of a working server according to an exemplary embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In addition, in the description of the embodiments of the present disclosure, "/" means or, unless otherwise indicated, for example, a/B may mean a or B. "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the embodiments of the present disclosure, "a plurality" means two or more than two.

The parameter training method of the prediction model provided by the embodiments of the present disclosure (for convenience of description, the embodiments of the present disclosure will be hereinafter simply referred to as parameter training method) may be applicable to a distributed parameter training system (in practical application, also referred to as a distributed system). Fig. 1 shows a schematic structure of the distributed parameter training system. As shown in fig. 1, the distributed parameter training system 10 is used to train model parameters in the resulting predictive model, and the distributed parameter training system 10 includes a plurality of parameter servers (illustratively, 3 parameter servers 111, 112, 113 are shown in fig. 1, and in practice, a greater or lesser number of parameter servers may be used) and a work server (illustratively, 3 work servers 121, 122, 123 are shown in fig. 1, and in practice, a greater or lesser number of work servers may be used). The parameter server and the working server in the distributed parameter training system adopt a distributed architecture, a plurality of parameter servers are respectively connected with a plurality of working servers, and a bus connection mode or a ring connection mode can be adopted among the working servers. The parameter server 11 and the working server may be connected in a wired manner or may be connected in a wireless manner, which is not limited in the embodiment of the present disclosure.

The parameter server is used for storing the full quantity of model parameters and sending the model parameters to the working server when the working server initiates the iterative training requirement.

The parameter server is also used for updating the model parameters. The parameter server receives the model gradient sent by the working server, and updates the model parameters according to the received model gradient.

The parameter server according to the embodiment of the present disclosure may be any one of the plurality of parameter servers, or may be a main parameter server for managing the plurality of parameter servers. The parameter server mainly uses the CPU therein to store the full quantity of model parameters and update the model parameters. The model parameters include embedded parameters (Embedding parameters) and network parameters.

Wherein the embedding parameters comprise parameters involved in an embedding layer in the predictive model for converting sample vectors of sparse training samples into dense vectors of fixed size. The network parameters include parameters in the predictive model, such as W (weight), b (bias), etc. By way of example, the parameter server may be a parameter server (PARAMETER SERVER, PS) in a distributed parameter server architecture.

The working server is used for obtaining training samples from outside and obtaining network parameters of the prediction model from the parameter server for iterative training.

The working server is also used for sending the model parameters obtained by the iterative training to the parameter server and storing the model parameters obtained by the iterative training.

The work server according to the embodiment of the present disclosure may be any one of the plurality of work servers, or may be a master work server for managing the plurality of work servers. The working server comprises a display card buffer memory for storing model parameters and model gradients and a calculation module for iterative training.

By way of example, the work server may be an execution server (Worker) in a distributed parameter server architecture.

The parameter training method provided by the embodiment of the present disclosure may be specifically applied to any one of the working servers in the distributed parameter training system, and is described below with reference to fig. 1.

As shown in fig. 2, the parameter training method provided in the embodiment of the present disclosure specifically includes the following S201 to S206.

S201, the working server acquires current embedded parameters corresponding to training samples of a current batch.

As a possible implementation manner, the working server may query the embedded parameters corresponding to the training samples of the current batch from the parameter server according to the training samples of the current batch after obtaining the training samples of the current batch from the external sample server.

As another possible implementation manner, the working server may query the embedded parameters corresponding to the training samples of the current batch from the working server according to the training samples of the current batch after obtaining the training samples of the current batch from the external sample server.

It should be noted that, the sample server stores a plurality of batches of training sample sets, and the training samples of the current batch include a plurality of small batch (mini-batch) samples obtained by dividing samples in the training test sample set of the current batch.

The embedding parameters are used to convert the sample vector of sparse training samples into a fixed-size dense vector. The embedding parameters corresponding to the training samples may include a plurality of embedding sub-parameters, and the embedding sub-parameters in the corresponding embedding parameters may be different from one training sample to another.

For a specific implementation of this step, reference may be made to the following description of the embodiments of the present disclosure, which is not repeated here.

S202, the working server acquires the network parameters currently stored by the working server from the working server.

As a possible implementation, the working server queries the currently stored network parameters from the cache of the working server after obtaining the training samples of the current batch from the external sample server.

It should be noted that, the buffer of the working server is a graphics card buffer, which is a region for storing data in the GPU of the working server. The network parameters include parameters in the predictive model, such as W (weight), b (bias), etc.

In practical application, the working server may execute S201 first and then execute S202, may execute S202 first and then execute S201, and may execute S201 and S202 simultaneously, which is not limited by the implementation of the present disclosure.

And S203, the working server carries out iterative training on the current embedded parameters and the network parameters currently stored by the working server based on the training samples of the current batch so as to obtain embedded parameter gradients and network parameter gradients.

As one possible implementation manner, the working server uses the training samples of the current batch, the current embedding parameters and the network parameters currently stored by the working server as input layers, and performs forward computation and backward propagation computation based on the sparse model to obtain the embedding parameter gradient of the current embedding parameters and the network parameter gradient of the network parameters currently stored by the working server.

For a specific implementation manner in this step, reference may be made specifically to the training process of the worker in the existing PS architecture, which is not described herein.

S204, the working server updates the current embedded parameters based on the embedded parameter gradient.

The specific implementation manner of this step may refer to the process of updating the embedded parameters according to the embedded parameter gradient by the parameter server in the prior art, and will not be described herein.

In one case, the working server stores the updated current embedded parameters in a cache of the working server after updating the current embedded parameters.

For example, in the case where the current embedding parameters corresponding to the training samples of the current batch include the embedding subparameters L1, L3, and L5, the working server obtains the current embedding parameters after obtaining the current embedding parameters, and the updated current embedding parameters include updated L1, updated L3, and updated L5. In this case, the work server deletes L1, L3, and L5 in the cache, and stores updated L1, updated L3, and updated L5.

S205, the working server synchronizes the updated embedded parameters to the parameter server.

As a possible implementation, the working server sends the embedded parameters to be updated to the parameter server.

Illustratively, after the working server obtains updated L1, updated L3, and updated L5, the updated L1, updated L3, and updated L5 are sent to the parameter server to enable the parameter server to update its stored full amount of embedded parameters.

S206, the working server updates the network parameters currently stored by the working server based on the network parameter gradient.

As one possible implementation, the working server determines updated network parameters based on the network parameter gradients and the network parameter gradients trained by other working servers in the distributed parameter training system, and updates the network parameters currently stored in the cache of the working server by using the updated network parameters.

For a specific implementation of this step, reference may be made to the following detailed description of the embodiments of the disclosure, which is not repeated here.

The technical scheme provided by the embodiment at least has the following beneficial effects: because the working server can prestore the network parameters required by the iterative training in advance, the network parameters required by the iterative training can be directly obtained from the buffer memory of the working server when the iterative training is required, and compared with the prior art, the network parameters do not need to be frequently obtained from the parameter server, so that the transmission of the model parameters between the parameter server and the working server is reduced. Meanwhile, for updating the current embedded parameters and the current stored network parameters, the working server can update the current embedded parameters and the current stored network parameters on the working server side, the gradient value obtained by each training is not required to be sent to the parameter server, the transmission of the training gradient can be correspondingly reduced, the iterative training time can be further reduced, and the resource utilization rate of hardware is improved

In one design, the current embedding parameters corresponding to the training samples of the current batch may be embedding parameters stored in the working server in advance, and in S201 provided in the embodiment of the present disclosure, the working server may specifically obtain the current embedding parameters from the memory thereof. In order to store the current embedded parameters in the working server in advance, as shown in fig. 3, the parameter training method provided in the embodiment of the present disclosure may further include the following S301 to S305.

S301, the working server determines whether the currently stored embedded parameters comprise the current embedded parameters.

As a possible implementation manner, in the process of performing iterative training on the embedded parameters and the network parameters corresponding to the training samples of the previous batch, the working server obtains the training samples of the current batch from the external sample server, and inquires whether the embedded subparameter included in the current embedded parameters is stored in the memory of the working server.

S302, under the condition that the embedding parameters stored in the working server at present comprise part of the embedding parameters of the current embedding parameters, the working server acquires the difference embedding parameters from the parameter server.

Wherein the differential embedding parameters include embedding parameters of the current embedding parameters other than the partial embedding parameters.

For example, the current embedding parameters include embedding sub-parameters L1, L2, L3, and L4, and if the working server currently stores L1 and L3 in the embedding parameters, the working server determines that L1 and L3 are part of the embedding parameters in the current embedding parameters. Meanwhile, the working server determines that L2 and L4 are difference embedding parameters, and obtains the difference embedding parameters L2 and L4 from the parameter service.

S303, the working server stores the difference embedding parameters.

As one possible implementation manner, the working server stores the acquired difference embedding parameter into a graphics card buffer of the working server.

It can be understood that after the working server stores the differential embedding parameters, the current embedding parameters corresponding to the training samples of the current batch are included in the graphics card buffer of the working server.

The technical scheme provided by the embodiment at least has the following beneficial effects: the working server can determine whether the currently stored embedded parameters comprise part of the embedded parameters in the currently stored embedded parameters, and acquire the difference embedded parameters corresponding to the training samples of the current batch from the parameter server under the condition that the currently stored embedded parameters comprise part of the embedded parameters. Compared with the prior art, the method has the advantages that the stored and updated part of embedded parameters do not need to be acquired from the parameter server, the transmission quantity of the embedded parameters of the parameter server and the working server can be further reduced, and transmission resources are saved.

S304, under the condition that the embedded parameters currently stored by the working server do not comprise the current embedded parameters, the working server acquires the current embedded parameters from the parameter server.

As a possible implementation manner, if the working server determines that any part of the current embedding parameters are not included in the currently stored embedding parameters, the current embedding parameters are acquired from the parameter server.

S305, the working server stores the current embedded parameters.

The specific embodiment of this step may refer to the specific description in S303, and will not be described herein.

In another case, if the working server determines that the currently stored embedded parameters include the current embedded parameters, the working server directly obtains the current embedded parameters from the graphics card buffer after the previous batch of training samples are trained, and trains the current embedded parameters based on the current batch of training samples.

Further, after the working server completes iterative training on the embedded parameters corresponding to the training samples of the previous batch and the network parameters, the working server acquires the embedded parameters corresponding to the training samples of the current batch from the cache of the working server according to the training samples of the current batch.

The technical scheme provided by the embodiment at least has the following beneficial effects: under the condition that the currently stored embedded parameters are determined not to include the current embedded parameters, the working server can acquire the current embedded parameters from the parameter server in advance and store the current embedded parameters into the display card cache, so that the asynchronous prefetching effect is realized, the time of training parameters of the working server can be saved, and the training efficiency is improved.

In one design, in order to obtain and prestore the network parameters of the prediction model from the working server, as shown in fig. 4, in the case that the current lot is the first lot, the parameter training method provided in the embodiment of the present disclosure further includes the following steps S207 to S208 before S202.

S207, the working server receives network parameters of the prediction model from the parameter server.

As one possible implementation, the working server requests the parameter server to send the network parameters of the predictive model to the working server after obtaining the training samples of the current batch.

S208, the working server stores network parameters of the prediction model from the parameter server.

As one possible implementation, the working server stores the received network parameters in a graphics card cache of the working server.

The technical scheme provided by the embodiment at least has the following beneficial effects: for the first batch of training samples, network parameters can be obtained from a parameter server and stored, and a data basis can be provided for subsequent iterative training in the working server. It can be understood that the working server only needs to acquire the network parameters of the prediction model from the parameter server when training is performed for the first time, and the network parameters are not required to be transmitted with the parameter server when training is performed each time, so that the transmission pressure between the parameter server and the working server can be reduced, and the efficiency of parameter training is improved.

In one design, in order to obtain, from a working server, network parameters corresponding to training samples of a current lot stored in advance during iterative training of the working server, as shown in fig. 5, in the case that the current lot is another lot except the first lot, the parameter training method provided in the embodiment of the present disclosure further includes the following steps S209-S210 before S202.

S209, the working server performs combination processing on the plurality of historical network gradient values to determine network parameters corresponding to training samples of the current batch.

The plurality of historical network gradient values comprise gradient values of network parameters corresponding to the training samples of the previous batch, wherein the gradient values are obtained based on the training samples of the previous batch.

As a possible implementation manner, the working server performs averaging, transmission and merging of the historical network gradient values with other working servers in the distributed parameter training system based on a preset protocol (Allreduce) algorithm until each working server stores the network parameters updated based on the gradient values.

The specific implementation of this step can be specifically described with reference to ring-allreduce in the prior art.

It will be appreciated that in this case all working servers in the distributed parameter training system are connected in a ring connection.

As another possible implementation manner, the working server receives the historical network gradient values sent by the other working servers, merges and averages all the received historical network gradient values to obtain updated network parameters, and sends the updated network parameters to the other working servers until each working server stores the updated network parameters based on the gradient values.

It should be noted that, the communication between the working server and other working servers may be based on NVlink/remote direct data access (Remote Direct Memory Access, RDMA)/DPDK communication modes between GPUs of the working servers, so as to perform gradient value transmission.

The implementation manner of this step may specifically be also used as a specific implementation manner of updating the currently stored network parameter based on the network parameter gradient value in S208, where the difference is that the batches of training samples corresponding to the updated network parameter are different.

S210, the working server stores and processes the network parameters.

As one possible implementation, the working server stores the network parameters corresponding to the training samples of the current batch into the cache of the working server.

The technical scheme provided by the embodiment at least has the following beneficial effects: compared with the prior art, the method has the advantages that the network parameters can be updated by replacing the parameter servers, the historical network gradients are not required to be sent to the parameter servers one by one, and the parameter training time of the working servers can be reduced. Meanwhile, the historical network gradients are combined among the working servers by adopting allreduce algorithm, so that the speed of combining can be increased, and the time of parameter training can be further reduced.

In one design, in order to save the time of parameter training and improve the effect of parameter training, as shown in fig. 6, the parameter training method provided in the embodiment of the disclosure further includes the following steps S401 to S403.

S401, the working server acquires a training sample of the next batch in the process of performing iterative training on the current embedded parameters and the network parameters currently stored by the working server.

As a possible implementation, the working server may start to obtain the training samples of the next batch from the external sample server in the process of performing S203 described above.

The specific implementation manner of obtaining the training samples of the next batch in this step may refer to the specific description in S201 in the embodiment of the disclosure, and will not be described herein. The difference is that the batches of training samples obtained are different.

S402, the working server acquires embedded parameters corresponding to the training samples of the next batch from the parameter server based on the training samples of the next batch.

For a specific implementation manner of this step, reference may be made to a specific description of any one implementation manner of S201 provided in the embodiment of the present disclosure, which is not described herein. The difference is that the training samples corresponding to the embedding parameters acquired in S402 are different in lot number.

S403, the working server stores the obtained embedded parameters.

As one possible implementation manner, the working server stores the acquired embedded parameters corresponding to the training samples of the next batch into a graphics card cache of the working server.

The technical scheme provided by the embodiment at least has the following beneficial effects: the method can asynchronously acquire the training samples of the next batch and the embedded parameters corresponding to the training samples of the next batch in the training process, realize 'parameter prefetching', save the time of iterative training of model parameters and improve the training speed.

In one design, in order to enable updating of network parameters, as shown in fig. 7, the parameter training method provided in the embodiment of the disclosure further includes the following steps S501 to S502.

S501, the working server acquires the target network parameters.

The target network parameters comprise training samples based on the last batch, and the network parameters are obtained through training.

As a possible implementation manner, the working server acquires the updated network parameters as the target network parameters after performing iterative training and updating on the network parameters based on the training samples of the last batch.

S502, the working server sends the target network parameters to the parameter server.

As one possible implementation, the working server sends the target network parameters to the parameter server. Correspondingly, after receiving the target network parameters, the parameter server determines the target network parameters and the embedded parameters thereof after gradient updating according to the embedded parameters corresponding to the training samples of the last batch as model parameters of the prediction model.

The technical scheme provided by the embodiment at least has the following beneficial effects: after the training of the training samples of the last batch is completed, the updated network parameters can be sent to the parameter server, and a basis is provided for the subsequent training prediction model of the parameter server. Thus, the network parameters of the prediction model obtained by the last update can be stored in the parameter server.

In addition, the present disclosure also provides a work server applied to the distributed system, and referring to fig. 8, the work server 60 includes an acquisition unit 601, a training unit 602, an update unit 603, and a transmission unit 604.

The obtaining unit 601 is configured to obtain a current embedded parameter corresponding to a current batch of training samples. For example, as shown in fig. 2, the acquisition unit 601 may be used to perform S201.

The obtaining unit 601 is further configured to obtain, from the working server 60, a network parameter currently stored by the working server 60. For example, as shown in fig. 2, the acquisition unit 601 may be used to perform S202.

The training unit 602 is configured to perform iterative training on the current embedding parameter acquired by the acquiring unit 601 and the network parameter currently stored by the working server 60 based on the training samples of the current batch, so as to obtain an embedding parameter gradient and a network parameter gradient. For example, as shown in fig. 2, training unit 602 may be used to perform S203.

An updating unit 603, configured to update the current embedding parameter based on the embedding parameter gradient obtained by training by the training unit 602. For example, as shown in fig. 2, the updating unit 603 may be used to perform S204.

A sending unit 604, configured to synchronize the embedded parameters updated by the updating unit 603 with the parameter server. For example, as shown in fig. 2, the transmission unit 604 may be used to perform S205.

The updating unit 603 is further configured to update the network parameters currently stored by the working server 60 based on the network parameter gradient. For example, as shown in fig. 2, the updating unit 603 may be used to perform S206.

Optionally, as shown in fig. 8, the current embedded parameter provided by the embodiment of the present disclosure is pre-stored in the working server 60, and the working server 60 further includes a storage unit 605.

The obtaining unit 601 is further configured to obtain, in a case where a partial embedding parameter of the current embedding parameter is included in the embedding parameters currently stored in the working server 60, a differential embedding parameter from the parameter server. The differential embedding parameters include embedding parameters of the current embedding parameters other than the partial embedding parameters. For example, as shown in fig. 3, the acquisition unit 601 may be used to perform S302.

A storage unit 605 for storing the difference embedding parameter. For example, as shown in fig. 3, the storage unit 605 may be used to perform S303.

Or alternatively

The obtaining unit 601 is further configured to obtain the current embedding parameter from the parameter server when the embedding parameter currently stored in the working server 60 does not include the current embedding parameter. For example, as shown in fig. 3, the acquisition unit 601 may be used to perform S304.

A storage unit 605 for storing the current embedding parameters. For example, as shown in fig. 3, the storage unit 605 may be used to perform S305.

Optionally, as shown in fig. 8, the working server 60 provided by the embodiment of the present disclosure further includes a receiving unit 606.

A receiving unit 606, configured to receive, in a case where the current lot is the first lot, the network parameters of the prediction model from the parameter server before the obtaining unit 601 obtains the network parameters currently stored by the work server 60 from the work server 60. For example, as shown in fig. 4, the receiving unit 606 may be used to perform S207.

A storage unit 605 is configured to store the network parameters of the prediction model received by the receiving unit 606 from the parameter server. For example, as shown in fig. 4, the storage unit 605 may be used to perform S208.

Optionally, as shown in fig. 8, in the working server 60 provided in the embodiment of the present disclosure, in a case where the current batch is another batch except the first batch, before the obtaining unit 601 obtains the network parameter currently stored in the working server 60 from the working server 60, the updating unit 603 is specifically configured to:

And combining the plurality of historical network parameter gradients, and storing the network parameters obtained by the combining process. The plurality of historical network parameter gradients includes network parameter gradients trained based on training samples of a previous batch of the current batch. For example, as shown in fig. 5, the updating unit 603 may be used to perform S209-S210.

Optionally, as shown in fig. 8, the obtaining unit 601 provided by the embodiment of the present disclosure is further configured to obtain a training sample of a next batch in a process of performing iterative training on a current embedding parameter and a network parameter currently stored in the working server 60, and obtain an embedding parameter corresponding to the training sample of the next batch based on the training sample of the next batch. For example, as shown in fig. 6, the receiving unit 606 may be used to perform S401-S402.

Optionally, as shown in fig. 8, the obtained unit 601 provided by the embodiment of the present disclosure is further configured to obtain a target network parameter. The target network parameters include network parameters that are trained based on the last batch of training samples. For example, as shown in fig. 7, the acquisition unit 601 may be used to perform S501.

The sending unit 604 is further configured to send the target network parameter to the parameter server. For example, as shown in fig. 7, the transmission unit 604 may be used to perform S502.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 9 is a schematic structural diagram of another working server provided by the present disclosure. As shown in fig. 9, the work server 70 may include at least one processor 701 and a memory 703 for storing processor-executable instructions. Wherein the processor 701 is configured to execute instructions in the memory 703 to implement the parameter training method in the above-described embodiments.

In addition, work server 70 may also include a communication bus 702 and at least one communication interface 704.

The processor 701 may be a GPU, a micro-processing unit, an ASIC, or one or more integrated circuits for controlling the execution of the programs of the present disclosure.

Communication bus 702 may include a path to transfer information between the aforementioned components.

Communication interface 704, uses any transceiver-like device for communicating with other devices or communication networks, such as ethernet, radio access network (radio access network, RAN), wireless local area network (wireless local area networks, WLAN), etc.

The memory 703 may be, but is not limited to, read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, as well as electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), compact disc read-only memory (compact disc read-only memory) or other optical disc storage, optical disc storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be stand alone and be connected to the processing unit by a bus. The memory may also be integrated with the processing unit as a volatile storage medium in the GPU.

The memory 703 is used for storing instructions for executing the disclosed aspects and is controlled by the processor 701 for execution. The processor 701 is configured to execute instructions stored in the memory 703 to implement the functions in the methods of the present disclosure.

In a particular implementation, as one embodiment, processor 701 may include one or more GPUs, such as GPU0 and GPU1 in fig. 9.

In a particular implementation, as one embodiment, work server 70 may include multiple processors, such as processor 701 and processor 707 in FIG. 9. Each of these processors may be a single-core (single-CPU) processor or may be a multi-core (multi-GPU) processor. A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

In a particular implementation, work server 70 may also include an output device 705 and an input device 706, as one embodiment. The output device 705 communicates with the processor 701 and may display information in a variety of ways. For example, the output device 705 may be a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, or a projector (projector), or the like. The input device 706 is in communication with the processor 701 and may accept user input in a variety of ways. For example, the input device 706 may be a mouse, keyboard, touch screen device, or sensing device, among others.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is not limiting of the work server 70 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

In addition, the present disclosure also provides a computer-readable storage medium comprising instructions that, when executed by a processor, cause the processor to perform a method of parameter training of a predictive model as provided by the above embodiments.

In addition, the present disclosure also provides a computer program product comprising instructions that, when executed by a processor, cause the processor to perform a method of parameter training of a predictive model as provided by the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method for training parameters of a predictive model, applied to a working server of a distributed system, comprising:

Acquiring current embedded parameters corresponding to training samples of a current batch;

Combining the historical network parameter gradients under the condition that the current batch is other batches except the first batch, and storing the network parameters obtained by combining; the plurality of historical network parameter gradients comprise network parameter gradients trained based on training samples of a previous batch of the current batch;

acquiring network parameters currently stored by the working server from the working server;

Performing iterative training on the current embedded parameters and the network parameters currently stored by the working server based on the training samples of the current batch to obtain embedded parameter gradients and network parameter gradients;

updating the current embedded parameters based on the embedded parameter gradient, and synchronizing the updated embedded parameters to a parameter server;

and updating the network parameters currently stored by the working server based on the network parameter gradient.

2. The method for training parameters of a predictive model according to claim 1, wherein the current embedded parameters are pre-stored in the working server, the method further comprising:

under the condition that the current stored embedded parameters of the working server comprise part of the embedded parameters of the current embedded parameters, acquiring difference embedded parameters from a parameter server, and storing the difference embedded parameters; the differential embedding parameters include embedding parameters of the current embedding parameters other than the partial embedding parameters;

Or alternatively

And under the condition that the current embedded parameters stored by the working server do not comprise the current embedded parameters, acquiring the current embedded parameters from the parameter server, and storing the current embedded parameters.

3. The method for training parameters of a predictive model according to claim 1, wherein in the case that the current lot is the first lot, before the network parameters currently stored by the work server are obtained from the work server, the method further comprises:

network parameters of the predictive model from the parameter server are received and stored.

4. The method for training parameters of a predictive model of claim 1, further comprising:

and acquiring a next batch of training samples in the process of performing iterative training on the current embedded parameters and the network parameters currently stored by the working server, and acquiring embedded parameters corresponding to the next batch of training samples based on the next batch of training samples.

5. The method for training parameters of a predictive model of claim 1, further comprising:

Acquiring target network parameters and sending the target network parameters to the parameter server; the target network parameters comprise network parameters obtained by training based on training samples of the last batch.

6. The working server is characterized by comprising an acquisition unit, a training unit, an updating unit and a sending unit;

The acquisition unit is used for acquiring current embedded parameters corresponding to the training samples of the current batch;

the obtaining unit is further configured to obtain, from the working server, a network parameter currently stored by the working server;

the training unit is used for performing iterative training on the current embedded parameters acquired by the acquisition unit and the network parameters currently stored by the working server based on the training samples of the current batch so as to acquire embedded parameter gradients and network parameter gradients;

the updating unit is used for updating the current embedding parameters based on the embedding parameter gradient obtained by training of the training unit;

The sending unit is used for synchronizing the embedded parameters updated by the updating unit to the parameter server;

The updating unit is further used for updating the network parameters currently stored by the working server based on the network parameter gradient;

In the case that the current lot is a lot other than the first lot, the updating unit is specifically configured to, before the obtaining unit obtains, from the work server, the network parameter currently stored by the work server:

Combining the plurality of historical network parameter gradients and storing network parameters obtained by combining; the plurality of historical network parameter gradients includes a network parameter gradient trained based on training samples of a previous batch of the current batch.

7. The work server according to claim 6, wherein the current embedding parameter is stored in the work server in advance, the work server further comprising a storage unit;

The obtaining unit is further configured to obtain a difference embedding parameter from a parameter server when a part of the embedding parameters of the current embedding parameters are included in the embedding parameters currently stored in the working server; the storage unit is used for storing the difference embedding parameters; the differential embedding parameters include embedding parameters of the current embedding parameters other than the partial embedding parameters;

Or alternatively

The obtaining unit is further configured to obtain, when the current embedding parameter stored in the working server does not include the current embedding parameter, the current embedding parameter from the parameter server; and the storage unit is used for storing the current embedded parameters.

8. The work server of claim 6, further comprising a receiving unit and a storage unit;

the receiving unit is configured to receive, when the current lot is the first lot, network parameters of a prediction model from the parameter server before the obtaining unit obtains the network parameters currently stored by the working server from the working server;

The storage unit is used for storing the network parameters of the prediction model received by the receiving unit from the parameter server.

9. The working server according to claim 6, wherein the obtaining unit is further configured to obtain a next batch of training samples during the iterative training of the current embedding parameter and the network parameter currently stored in the working server, and obtain an embedding parameter corresponding to the next batch of training samples based on the next batch of training samples.

10. The work server of claim 6, wherein the fetching unit is further configured to obtain a target network parameter; the target network parameters comprise network parameters obtained by training based on training samples of the last batch;

the sending unit is further configured to send the target network parameter to the parameter server.

11. A work server, comprising: a processor, a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the parameter training method of the predictive model of any one of claims 1-5.

12. A computer readable storage medium comprising instructions that, when executed by a processor, cause the processor to perform the method of parameter training of a predictive model as claimed in any one of claims 1-5.

13. A distributed parameter training system, which is characterized by comprising a plurality of parameter servers and a plurality of working servers; any one of the plurality of working servers is configured to perform the parameter training method of the predictive model of any one of claims 1-5.

14. A computer program product comprising instructions which, when executed by a processor, implement a method of parameter training of a predictive model as claimed in any one of claims 1 to 5.