CN110929885A

CN110929885A - Smart campus-oriented distributed machine learning model parameter aggregation method

Info

Publication number: CN110929885A
Application number: CN201911197322.9A
Authority: CN
Inventors: 张纪林; 范禹辰; 万健; 周丽; 任永坚; 张俊聪; 魏振国
Original assignee: Zhejiang Shuguang Information Technology Co Ltd; Hangzhou Electronic Science and Technology University
Current assignee: Zhejiang Shuguang Information Technology Co Ltd; Hangzhou Dianzi University; Hangzhou Electronic Science and Technology University
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-03-27

Abstract

The invention discloses a parameter aggregation method of a distributed machine learning model for a smart campus, which is used for solving the problem that model training falls into a local optimal solution under a data parallel strategy. The invention starts from a model aggregation method of a distributed machine learning algorithm, determines the proportion of local models of each calculation process when a parameter server aggregates local model parameters through the loss function value of each calculation process, and improves the training precision; training data are obtained by a method of extracting data by using a calculation process without putting back, so that communication overhead is reduced. When the method is applied to the synchronous models such as the integral synchronous parallel model and the delayed synchronous parallel model, the training precision can be effectively improved, the training speed is not influenced, and the accuracy of service recommendation can be effectively improved when the training result is applied to the smart campus.

Description

Smart campus-oriented distributed machine learning model parameter aggregation method

Technical Field

The invention relates to a distributed machine learning model parameter aggregation method for a smart campus, in particular to a distributed machine learning model parameter aggregation method for solving the problem that a model falls into a local optimal solution for the field of the smart campus.

Background

With the development of the big data era, the traditional machine learning becomes more and more unconscious when facing mass data, and under the situation, the distributed machine learning is produced. Compared with the training of the traditional machine learning on a single machine, the distributed machine learning can fully utilize the resources of the high-performance computing cluster. The existing distributed machine learning model generally uses a parameter server idea, namely, a parameter server and a plurality of computing nodes are set for training. The parameter server is responsible for collecting and combining the training data of each computing node and then sending the training data back to each computing node; each compute node holds a part of training data for training local parameters, synchronizes with the parameter server after reaching a synchronization condition, and receives global parameters from the parameter server. The method for aggregating the local model parameters by the parameter server has a great influence on the training precision.

The smart campus generates a large amount of data in the aspects of education, life, administration and the like every day, and in order to realize accurate service recommendation to students, teachers and other employees, the data are required to be used for training a distributed machine learning model. The trained model can be used for carrying out personalized service recommendation according to the type of the user, such as personalized course recommendation of students, scientific research service recommendation of teachers and other administrative service recommendation. This requires that the accuracy of service recommendations meet high standards, which would otherwise reduce user experience and efficiency.

In some existing data parallel model aggregation methods, a parameter averaging method is generally used, that is, a parameter server directly averages model parameters of each computation process to compute global model parameters. However, this approach has drawbacks when the problem is non-convex: if there are multiple locally optimal solutions to the problem, it may fall into local optima and fail to jump out, thereby greatly reducing model accuracy.

Therefore, aiming at the characteristic of distributed machine learning under the current data parallel strategy, a model parameter aggregation method capable of dealing with the problem non-convex condition under the data parallel strategy is needed to be invented.

Disclosure of Invention

The invention aims to solve the problem that distributed machine learning training falls into local optimization due to a non-convex problem, and provides a distributed machine learning model parameter aggregation method for a smart campus.

The technical scheme adopted for solving the technical problem comprises the following steps:

a parameter server determines weights by using loss function values sent by computing processes, and then performs weighted average on local models of the computing processes, so that training precision is improved, wherein the method is realized by adopting the following steps:

step 1: daily behavior information of users generated by the smart campus is collected and converted into a uniform data format.

Step 2: and randomly selecting the same amount of data from the training data by each calculation process in a non-return extraction mode, and training.

And step 3: according to a preset synchronization strategy before training, the calculation process sends local model parameters which are being trained to a parameter server (main process) at regular intervals of iteration.

And 4, step 4: and the parameter server sets a merging weight 1/L for each calculation process according to the loss function value L of the model sent by each calculation process, and performs weighted average calculation on all local model parameters to obtain global model parameters.

And 5: and the parameter server sends the global model parameters to all the computing processes, and each computing process continues training after receiving the new global model.

Step 6: and returning to the step 2 until the training result of the distributed machine learning model is converged.

The invention has the beneficial effects that:

1. the invention can use the data random non-playback extraction method, can ensure that each calculation process keeps the same amount of different data, maximizes the utilization efficiency of the data, and does not need a parameter server to send the data to the calculation process, thereby reducing the communication traffic and improving the training precision.

2. The invention can use weighted average to the parameter server aggregation parameter, so that the parameter server can determine the weight according to the loss function of the local model parameter of the calculation process when aggregating the model parameter, thereby leading the model training to be capable of dealing with the non-convex problem, and improving the training precision compared with direct average.

3. The method can be applied to various synchronous parallel strategies such as an integral synchronous parallel model, a delay synchronous parallel model and the like under the data parallel strategy, and the application scene is wider than that of other model parameter aggregation methods.

4. Compared with other training algorithms, the random gradient descent method based on loss function weight reordering can reduce communication traffic and effectively improve training precision.

Drawings

FIG. 1 is a diagram illustrating the steps of the random gradient descent method based on weight reordering of the loss function according to the present invention.

Fig. 2 is a diagram illustrating a synchronization barrier.

Fig. 3 is a parameter synchronization explanatory diagram.

Detailed Description

Embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. The specific steps are described as shown in fig. 1, wherein:

step 1: data generated by daily behaviors of teachers, students and other employees are cleaned and converted, and are stored in a memory mapping type database for training.

Step 2: the main process reads a configuration file, which comprises training parameters and a model network, wherein the training parameters mainly comprise an initial learning rate, a learning rate adjusting mode, an impulse value, a maximum iteration number and the like; the model network is a model network file which is described by layers in a prototxt format. Each calculation process randomly selects local training data from all training data by using a non-return extraction method, and the final result is that each process has the same amount of different data, and the training data is a formatted picture with a label.

And step 3: the parameter server and the calculation processes carry out data communication according to different model synchronization strategies, and by taking an integral synchronous parallel model as an example, all the calculation processes send local model parameters to the parameter server together after each iteration is completed, and the parameter server side carries out model parameter aggregation after receiving the local model parameters of all the calculation processes so as to calculate the global model parameters.

Because the running speeds of the computing processes are different, the whole synchronous parallel model can wait for the slowest process to finish iteration and then start synchronization, as shown in fig. 2, a synchronization barrier is established after the slowest process No. 1 and No. 5, and other processes can wait in the period; the delay synchronization parallel model establishes a synchronization barrier according to a preset delay threshold s, and when a fast process is trained s times more than a slow process, all processes enter the synchronization barrier for synchronization.

Specifically, the synchronization between the parameter server and the computing process comprises the steps of:

1. performing iterative computation in a computation process;

2. the calculation process enters a synchronization barrier according to a set model synchronization strategy;

3. the calculation process sends the local model parameters to the parameter server;

4. the parameter server computes the global model and broadcasts it to all computing processes.

In order to complete the communication process, the invention defines a Blob basic data structure, which has the following structure:

where the main variables are defined as:

shared_ptr<SyncedMemory>data_；

shared_ptr<SyncedMemory>diff_；

shared_ptr<SyncedMemory>shape_data_；

vector<int>shape_；

int count_；

int capacity_；

the data _ pointer is a shared _ ptr type, belongs to a boost library, and is mainly used for applying for memory storage to store data for forward propagation; diff _ is also an intelligent pointer and is used for storing updated parameters, and shape _ data _andshape _ are used for storing the shape of the Blob; the count is used for storing the number of the elements in the Blob; since Blob may readjust the structure many times depending on the situation, capacity is used to store the current number of elements.

And 4, step 4: and the parameter server determines the parameter aggregation weight of the local model of each calculation process according to the loss function value of each calculation process in the received data. As shown in fig. 3, the present invention determines how much contribution should be added to the current global model by the loss function value of each computing process, and assuming that there are 4 computing processes, each computing process also uploads the loss function value of the local model when uploading parameters to the parameter server, the height of the right rectangle of each computing process in the figure represents the size of the loss function value, and the larger the rectangle is, the larger the loss function value is, and vice versa. After receiving the parameters of each calculation process and the loss function values, the parameter server sorts all local models from small to large according to the size of the loss function values, and then performs weighted average on all local model parameters by taking the reciprocal of the loss function values as weight, so that the local models with higher quality (with smaller loss function values) can occupy larger weight in the global model as much as possible.

The communication between the parameter server and the computing process is realized through threads started on the main process, each computing process corresponds to one communication thread, and in addition, one more thread is started for updating the global parameters, namely, new global parameters are broadcasted to the computing process. And in the process of total iteration, the parameter server allocates the current iteration number to the idle calculation process, and the calculation process is stored in the idleQ queue after the current iteration step is completed to prepare for next training.

The thread of the parameter server is defined by pthread, wherein the starting of the global parameter updating thread is defined as:

pthread_create(&threads,NULL,ComputeValueThreadServer<Dtype>,&pramas)；

the opening of the communication thread is defined as:

pthread_create(&threadc[i],NULL,ComputeValueThreadClient<Dtype>,&pramac[i])；

the thread array stores thread numbers and is used for determining the corresponding relation between the starting threads and the computing process. ComputeValueThreadServer and computevaluethreadthread are defined function handles in which all operations required during thread start are defined, the former calculates global parameters separately by the network layer after the calculation process sends the completion local parameters, and the latter repeats the loop to obtain the local parameters sent from different processes for each iteration.

In addition, because the threads share the memory, in order to prevent the data from being read and written repeatedly, the invention uses the lock to ensure the correctness of the data, and the definition is as follows:

pthread_mutex_lock(&mutexData)；

function (); // read store operation to be performed

pthread_mutex_unlock(&mutexData)；

Or locks that control multiple variables using the following definitions:

pthread_cond_broadcast(&mutexData)；

the pseudo code of the random gradient descent method based on loss function weight reordering of the invention is as follows:

and 5: and after the parameter server calculates the parameters of the global model, the parameters are sent to all calculation processes, and the calculation processes use the new global model as a new local model to continue training.

Step 6: step 3 is performed until the model converges.

Claims

1. A distributed machine learning model parameter aggregation method for a smart campus is characterized by comprising the following steps:

step 1: collecting daily behavior information of a user generated by the smart campus, and converting the daily behavior information into a uniform data format;

step 2: randomly selecting the same amount of data from the training data by each calculation process, and training;

and step 3: according to a preset synchronization strategy before training, a calculation process sends local model parameters which are being trained to a main process in which a parameter server is located at regular intervals of iteration;

and 4, step 4: the parameter server sends the loss function value of the model according to each calculation processLSetting a merge weight of 1 ≦ for each computing processLCarrying out weighted average calculation on all local model parameters to obtain global model parameters;

and 5: the parameter server sends the global model parameters to all the computing processes, and each computing process continues training after receiving a new global model;

step 6: and returning to the step 2 until the neural network training result is converged.

2. The method of claim 1, wherein the method comprises: the acquisition of the data of the calculation process in the step 2 is realized in a non-playback extraction mode.

3. The method of claim 1, wherein the method comprises: in step 4, the host process orders the loss function values for all of the computing processes.