CN109492753A

CN109492753A - A kind of method of the stochastic gradient descent of decentralization

Info

Publication number: CN109492753A
Application number: CN201811309202.9A
Authority: CN
Inventors: 蒋帆; 蒋一帆; 吴维刚
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2019-03-19

Abstract

The invention discloses a kind of methods of the stochastic gradient descent of decentralization, the parallel stochastic gradient descent method of traditional centralization in traditional distributed deep learning frame is improved to the parallel stochastic gradient descent method of decentralization to be trained, central server node is removed, remaining working node and operated adjacent node communication carry out local model training and parameter updates, it is repeatedly trained by more working nodes to finally obtain locally optimal solution to the continuous tuning of parameter, to complete distributed deep learning.

Description

A kind of method of the stochastic gradient descent of decentralization

Technical field

The present invention relates to depth learning technology field, more particularly to a kind of stochastic gradient descent of decentralization Method.

Background technique

Currently, not stopping paying out today of exhibition in artificial intelligence, deep learning has become an important neck of artificial intelligence Domain, distributed deep learning algorithm have iterative, and the update and non-once completion of model need loop iteration multiple；Have Fault-tolerance, even if generating some mistakes in each cycle, the final convergence of model is unaffected；It is convergent non-equal with parameter Even property, some parameters just no longer change by several circulations in model, and other parameters take a long time to restrain, these features are determined Determined deep learning algorithm applied to machine learning can not be with the increase of machine and ability is linearly promoted, because of vast resources It will be wasted in communication, waiting, coordination etc., in order to make up this defect, parameter server is suggested dedicated for big Scale optimizes the frame of processing, is used for the training of large-scale data, for example, TB even PB rank and large-scale mould Shape parameter.In large-scale Optimization Framework, usually has billions of or even hundred billion ranks parameters and need to estimate, therefore, When design faces the system of this challenge, the algorithm optimized in extensive topic model dependent on SGD or L-BFGS is needed Solve to need to consume when frequently accessing modification model parameter enormous bandwidth, improve degree of parallelism, it is synchronous wait caused by delay with And the problems such as fault-tolerant, therefore the parallel stochastic gradient descent algorithm for the parameter server concentrated frequently with one, band is distributed Formula deep learning.

But the distributed deep learning frame with parameter server, it can achieve in the case where network is unobstructed preferably Effect.However in reality, network environment is not necessarily always optimal situation, under the network condition of low bandwidth and high delay, property It can significantly decrease, the reason is that on parameter server node, because to be communicated with all nodes, in network The problem of will appear network congestion in the case where bad, to reduce operating rate.In addition, as network model is more and more multiple Miscellaneous, call duration time has increasing accounting.Largely call duration time is bigger to the pressure of parameter server, at this time Call duration time just becomes bottleneck.

Therefore, the communication time in distributed deep learning training how is reduced, improving operational efficiency is art technology The problem of personnel's urgent need to resolve.

Summary of the invention

In view of this, can be applicable to data simultaneously the present invention provides a kind of method of the stochastic gradient descent of decentralization In capable distributed deep learning frame, by the central node in traditional parallel stochastic gradient descent method, i.e. parameter service Device node removes, to save communication time, improves network transfer speeds.

To achieve the goals above, the present invention adopts the following technical scheme:

A kind of method of the stochastic gradient descent of decentralization, comprises the following specific steps that:

Step 1: by the Segmentation of Data Set for needing to be trained at n block, distributing to one for each individual described piece Specific working node；

Step 2: the data of each working node sampling model training from assigned block are for described in training Working node local model；

Step 3: each working node uses iterative method and parallel stochastic gradient descent method to carry out the work simultaneously The parameter of node, which updates, to be calculated；

Wherein specific step is as follows for the working node parameter update:

Step 31: local working node being initialized first: the initial value x of parameter is set₀；Step-length γ is set；If Set weight matrix W；The number of iterations K is set；

Step 32: concentrating the data randomly selected for iteration in the local data of the local working node；

Step 33: stochastic gradient descent method being used to the data and the parameter of the local working node, is used FormulaThe gradient u of iteration is found out, wherein x_iFor the updated parameter of local node described in last iteration；

Step 34: obtaining the parameter of the local working node and operated adjacent node, and obtained from weight matrix W The weight for taking the operated adjacent node and the working node, obtains provisional parameter x ' after being weighted；

Step 35: on the provisional parameter x ' that gradient u obtained in step 33 and step 34 step are obtained, bringing into Stochastic gradient descent formulaIt receives the undated parameter x of the local working node and is updated；

Step 36: the gradient u of the local working node and the operated adjacent node being detected, with described The gradient of working node and the ratio between the gradient of the operated adjacent node redistribute the weight in the weight matrix W, Specific calculating process are as follows: the local working node and the operated adjacent node gradient are compared, obtain decimal than big Several ratio is multiplied with the ratio with the weight of the operated adjacent node and is adjusted the rear operated adjacent node The weight, then the sum of the weight that subtracts the operated adjacent node with 1 obtains the local working node adjustment The weight afterwards；Step 37: judging whether that completing K iteration if it is completes institute if being otherwise again introduced into step 32 The parameter for stating working node updates and the weight distribution, completes the process of the local model training.

Preferably, the weight matrix W is all initialized, and local working node and operated adjacent node weight are initialized as 1/3。

Preferably, working node weight adjusted is saved into the weight matrix W.

It can be seen via above technical scheme that compared with prior art, the present disclosure provides a kind of decentralizations The method of stochastic gradient descent is removed the central server node in traditional parallel stochastic gradient descent method, so that often The parameter of a working node more new capital is carried out in local and operated adjacent node, mutually transmits information, root between working node Weight is redistributed according to the ratio of local working node gradient and operated adjacent node gradient, and saves work section with weight matrix Weighing factor between point carries out using iterative method when parameter update, and each time when iteration, each working node itself executes one Secondary stochastic gradient descent algorithm, before using gradients affect parameter, first to the parameter of local working node and operated adjacent node Acquisition provisional parameter is weighted with weight, with the gradient of provisional parameter and local working node to local working node into Row parameter updates, and is finally completed model training process.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 attached drawing is flow chart schematic diagram provided by the invention；

Fig. 2 attached drawing is working node communication structure schematic diagram provided by the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a kind of methods of the stochastic gradient descent of decentralization, comprise the following specific steps that:

S1: the Segmentation of Data Set being trained will be needed at n block, each individual block is distributed into a specific work Make node；

S2: the data of each working node sampling model training from assigned block are local for training node Model；

S3: each working node uses iterative method and parallel stochastic gradient descent method to carry out the parameter of working node more simultaneously It is new to calculate；

Wherein specific step is as follows for the update of working node parameter:

S31: local working node is initialized first: the initial value x of parameter is set₀；Step-length γ is set；Setting power Weight matrix W；The number of iterations K is set；

S32: the local data for the node that works in this locality concentrates the data randomly selected for iteration；

S33: data and parameter to local working node find out the gradient u and ginseng of iteration using stochastic gradient descent method Number, gradient calculation formula areWherein x_iFor the last updated parameter of iteration local node；

S34: obtaining the parameter of local working node and operated adjacent node, and local work is obtained from weight matrix W The weight of node and operated adjacent node obtains provisional parameter x ' after being weighted；

S35: provisional parameter x ' obtained in gradient u and S34 according to obtained in S33, using stochastic gradient descent formulaIt receives the undated parameter x of local working node and is updated；

S36: detecting the gradient u of local working node and operated adjacent node, with the gradient u and phase of working node The ratio between gradient u of adjacent working node redistributes the weight in weight matrix W, specific calculating process are as follows: by local working node It is compared with operated adjacent node gradient u, the decimal ratio several than greatly is obtained, with weight and the ratio phase of operated adjacent node Multiply the weight for being adjusted rear adjacent node, then the sum of the weight for subtracting operated adjacent node with 1 obtains local working node tune Weight after whole；

S37: judging whether to complete K iteration, if being otherwise again introduced into step 32, if it is completes working node Parameter updates and weight distribution, completes the process of local model training, and redistributing weight matrix is to adjust different operating section Weight between point makes to obtain bigger weight with the more similar adjacent node of local node gradient u, to accelerate model instruction Convergence rate in white silk.

In order to further optimize the above technical scheme, weight matrix W is all initialized, local working node and operated adjacent Node weight is initialized as 1/3.

In order to further optimize the above technical scheme, working node weight updates result and saves into weight matrix W.

Embodiment

During decentralization stochastic gradient descent model training, its essence is by conventional center stochastic gradient descent Center Parameter server node in method removes, so that the parameter of each working node more new capital is in local and operated adjacent section It being carried out between point, training dataset is divided into n block first, each individually block is assigned to a specific working node, Same each working node can be in local training pattern.Each working node carries out local model training simultaneously, needs first to define The weight matrix of one all working node that are connected, the then initial parameter, step-length and the number of iterations of initial work node, Carry out working node local model training when, working node each first finds out local gradient, then with adjacent work section Point carries out parameter exchange, i.e., by the local working node of the parameter and acquisition of local working node and operated adjacent node and adjacent The weight of working node is weighted to obtain provisional parameter, uses later to the gradient of provisional parameter and local working node Stochastic gradient descent obtains the undated parameter of local working node and carries out the update of local node parameter, while to local work The gradient difference of node and operated adjacent node is detected, and redistributes weight with the ratio between gradient.

Weight matrix W is the matrix of a n*n, and every a line weight represents a local working node with all working node Between weights influence relationship, in the weights initialisation of every a line, by local working node and two operated adjacent nodes Weighted value is initialized as 1/3, remaining is initialized as 0, that is, indicates local working node with remaining non-conterminous working node weight It is 0, does not have weights influence.

Since the present invention does not have a central server node, the communication complexity of most busy working node be by need into What the complexity of the corresponding figure of row model training determined, although call duration time on each working node compared with centralization with The call duration time of working node in the method for machine gradient decline increased, but due to the parameter use of each working node with The number that the decline of machine gradient updates is constant and the present invention eliminates central server node, so while calculating time phase difference Seldom, but the used time generally of the invention is shorter, and especially under low bandwidth and the network condition of high delay, call duration time advantage is more Obviously.

For communicating requirement, working node requires same central server in the method for conventional center stochastic gradient descent Node communication, therefore it is required that data differences in asynchronous communication between all working node cannot be too big, and the present invention due to It only needs to be communicated with operated adjacent node, so only need to guarantee operated adjacent node data similarity, therefore this The invention scope of application is wider.

The present invention provides a kind of methods of the stochastic gradient descent of decentralization to carry out distributed deep learning, will be traditional Data parallel stochastic gradient descent method in central server node remove so that the parameter of each working node more new capital It works in this locality and is carried out between node and operated adjacent node, information is mutually transmitted between working node, calculate each work section The gradient value of point, first by the weight of the parameter of operated adjacent node and operated adjacent node and the weight of local working node into Row ranking operation, the parameter value after then enabling weighting influences the parameter of local working node, further according to local working node gradient Weight is redistributed with the ratio of operated adjacent node gradient and is saved in weight matrix, to complete local working node ginseng Distributed deep learning training is realized in several updates.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For device disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.

The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of method of the stochastic gradient descent of decentralization, which is characterized in that comprise the following specific steps that:

Step 1: the Segmentation of Data Set that is trained will be needed at n block, by each individual described piece distribute to one it is specific Working node；

Step 2: each working node sampling model training data from assigned block is used to train the work section The local model of point；

Step 3: each working node uses iterative method and stochastic gradient descent method to carry out the working node parameter simultaneously The calculating of update；

Wherein specific step is as follows for the working node parameter update:

Step 31: the local working node being initialized first: the initial value x of parameter is set₀；Step-length γ is set；Setting Weight matrix W；The number of iterations K is set；

Step 33: are found out by this using stochastic gradient descent method and is changed for the data and the parameter of the local working node The gradient u in generation；

Step 34: obtaining the parameter of the operated adjacent node and the local working node, and obtained from weight matrix W The weight of the operated adjacent node and the local working node is taken, then obtains provisional parameter x ' after being weighted；

Step 35: the provisional parameter x ' that the gradient u and step 34 step according to obtained in step 33 obtain, using random Gradient descent method obtains the undated parameter of local working node；

Step 36: the gradient of the gradient u and the operated adjacent node being detected, according to the local working node Gradient and the ratio between the gradient of the operated adjacent node redistribute the weight in the weight matrix W；

Step 37: judging whether that completing K iteration if it is completes the work section if being otherwise again introduced into step 32 The process of the weight distribution that the parameter of point updates, the local model training is completed.

2. a kind of method of the stochastic gradient descent of decentralization according to claim 1, which is characterized in that the weight Matrix W all initializes, and local working node and operated adjacent node weight are initialized as 1/3.

3. a kind of method of the stochastic gradient descent of decentralization according to claim 1, which is characterized in that each institute Stating working node tool, there are two the operated adjacent nodes.