CN116629350A

CN116629350A - Improved horizontal synchronous federal learning aggregation acceleration method

Info

Publication number: CN116629350A
Application number: CN202310721384.5A
Authority: CN
Inventors: 王鑫; 丁雪爽; 吴浩宇; 雷涛
Original assignee: Shaanxi University of Science and Technology
Current assignee: Shaanxi University of Science and Technology
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-08-22

Abstract

The invention relates to a transverse synchronous federal learning aggregation acceleration method, which comprises the following steps: the client receives the global model and the client contribution value sent by the server; the client trains a local model by using a local data set, calculates a divergence threshold value and prepares to send the trained local model; the client optimizes the local model by using a difference matrix and a gradient compression algorithm, sends the optimized local model to the server, and the server stores all received models into a buffer area and calculates information entropy; the server calculates the contribution value of the previous round of client, stores information entropy, takes out all local models from the buffer area, and obtains a new global model based on weighted aggregation of the contribution values of the local models; the server sends the contribution value of the previous round and the new global model to the client, and the client receives the global model sent by the server and performs the next round of training. The method and the device can improve the accuracy of the transverse synchronous federal training model and reduce the communication resource and time expenditure from the client to the server.

Description

Improved horizontal synchronous federal learning aggregation acceleration method

Technical Field

The invention belongs to the technical field of transverse federal learning, and is particularly suitable for solving the problems of large equipment data difference, poor calculation efficiency, slow time response and low communication efficiency in the processes of synchronizing federal learning training models and aggregating and updating models, and is an improved transverse synchronous federal learning aggregation acceleration method.

Background

With the advent of the big data age, importance of data privacy security has become a worldwide trend, and compared with the hidden danger of data security brought by uploading local data to a central Server (Server) training model by traditional Machine Learning (ML), federal learning (Federated learning, FL) protects user data privacy security by a data-immobilized, model-moving architecture, supports joint modeling of multiparty institutions under the premise of data privacy security and supervision requirements, and provides a feasible method for solving the problem of 'data island'.

The main framework of the horizontal synchronous federal learning is a federal average algorithm (FedAvg), and the method is suitable for scenes with more overlapping degree of data characteristics of participants, less overlapping of sample space, close geographic distribution and good communication environment. In the FedAvg algorithm, each participant client (Party) firstly receives an initial model issued by a server, synchronously trains local data for multiple times to obtain a local model, then uploads the local model to the server for average model aggregation, and the server obtains a new global model and issues the new global model to the client for new model training. In the training process, the client trains the private local model locally and uploads the encryption model to the server through encryption, noise and other means so as to update the global model.

The federal average algorithm can effectively fuse data features from multiple clients through multiple rounds of local model iteration, but as the data volume and the diversity of a user equipment data set increase, the computing requirements of local equipment and a trained local model also increase, which can cause a larger time overhead in the process of uploading the local model to a server by the client. The time cost of synchronous federal learning is mainly divided into two parts, and the time cost and the communication time cost are calculated. Wherein the computing time overhead includes a server aggregate time overhead and a client local training time overhead. The communication time overhead includes a time cost in the process of transmitting data from the client to the server, and the time cost in the process of waiting for the client to transmit data and the time cost in the process of transmitting data from the server to the client. In federally learned application scenarios, servers are typically played by roles with high computational power, stabilizing network resources. The computational overhead and transmission overhead of the server are therefore almost negligible. In synchronous federation learning, reducing training overhead and communication overhead of a server becomes a main approach for accelerating federation learning. Therefore, it is necessary to propose acceleration schemes in both federal learning client local training models and server aggregation models to improve accuracy of the laterally synchronized federal training models and reduce client-to-server communication resources and time overhead.

Disclosure of Invention

To overcome the deficiencies of the prior art described above, and to achieve the needs of the above scenario, it is an object of the present invention to provide an improved lateral synchronous federal learning acceleration scheme. Under the overall framework of the federal average algorithm, the data volume sent to the server by the client is reduced based on the ideas of difference matrix and gradient compression, so that the effect of optimizing the communication efficiency from the local model uploaded by the client to the server is achieved; based on the D-S evidence theory, a participant optimization scheme is proposed, and scoring weights are given to all participants by combining the influence of the participants on the global model. In the aggregation stage, the server performs aggregation according to the weight coefficients of different clients instead of average aggregation; a dynamic update strategy based on an information entropy and a divergence threshold. The method has better flexibility by using the divergence threshold as a standard for measuring whether to send data to the server or not aiming at the stage that the participant sends the local model to the global server.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

the horizontal synchronous federal learning aggregation acceleration method comprises the following steps:

step 1, a client receives a global model and a client contribution value sent by a server;

step 2, the client trains a local model by using a local data set, calculates a divergence threshold value and prepares to send the trained local model;

step 3, the client optimizes the local model by using a difference matrix and a gradient compression algorithm, and sends the optimized local model to the server, and the server stores all received models into a buffer area and calculates information entropy;

step 4, the server calculates the contribution value of the previous round of client, stores information entropy, takes out all local models from the buffer area, and obtains a new global model based on weighted aggregation of the contribution values of the local models;

and 5, the server sends the contribution value of the previous round and the new global model to the client, and the client receives the global model sent by the server and performs the next round of training.

The step 1, the global model is obtained by the aggregation of local models of all clients, and the process is as follows: the client trains locally by using a local data set to obtain a local model, the local model is uploaded to a server buffer area after being based on a depth gradient compression algorithm and an updating standard of a divergence threshold value, and the local model uploaded to the server buffer area is aggregated and weighted by contribution values obtained by each client to obtain a new round of global model; the client contribution value is calculated based on D-S evidence theory (Dempster-Shafer evidence theory, D-S), and represents the contribution degree of the last round of clients to the global model, and is an important standard in the weighted aggregation of the global model. At the beginning of federal learning, clients participating in training are assigned an average contribution value; in the server aggregation stage, the server acquires relevant parameters of each client, including the ratio of the client data set to the whole data set, the contribution value of the last round of clients, the divergence of the local model and the global model trained by the clients, and the number of rounds of the clients which participate in training altogether so far. Considering that data heterogeneity may cause a Matai Effect (Matthew Effect) generated by a client contribution value in a federal learning aggregation process, the invention adopts an evidence theory algorithm based on a pearson correlation coefficient to calculate the correlation parameters of the client to obtain a new round of contribution value so as to buffer the problem of inaccurate fusion results generated when a collision is large.

Step 2, the local data set represents data distributed locally to the client, the data set comprises data characteristics of local users of the client, different clients have local data sets with different data types and sizes, and the local data sets do not share data with each other; the local model is a model that different clients train on the basis of the global model using their local data sets. In the initial training stage of federal learning, a random model is issued to clients by a server as a global model, and each client receives the global model issued by the server and trains the global model by using a local data set to obtain a model with local data characteristics of the client, wherein the model is called a local model.

The calculating method of the divergence threshold value in the step 2 is as follows:

in the local model training stage, the client calculates a divergence threshold value of the local model to judge whether the local model is sufficiently updated, and when the local model is sufficiently updated, the client sends the local model to the server to be responsible for continuous local training;

every time the ith client side Party i participating in federal learning training completes one round of local iteration, recording iteration round n of the local model through a counter, and passing throughCalculating a divergence threshold of the client Party i, wherein after the participants complete one batch of local model iteration, the divergence threshold meets the condition:when the client Party i sends the local model +.>To a server;

wherein n representsThe client performs local training rounds, and the character delta represents the divergence threshold value of the model, and is marked as i and delta ⁱ A divergence value representing the first client local model, w representing the federal learning model,representing a local model trained by the ith client at the t-th round, w ^* Representing a reference model, gamma _i Contribution value representing client Party i, < >>The distance between the local model and the reference model is described.

In the horizontal synchronous federal learning, in order to accelerate the communication efficiency between a client and a server, a global round and a local round of training are set as fixed values, the global round represents the number of times of updating a global model of the server, and the local round represents the number of times that the client trains to obtain a local model on the basis of the global model by utilizing a local data set;

in the invention, all clients participating in training meet the divergence threshold conditionAnd after the local training round reaches the set fixed value round, uploading the local model to a server buffer zone to complete the local model training of the current round.

The method for optimizing the local model in the step 3 is as follows:

(1) The client calculates a difference matrix, and M clients participating in training in each round are set, i is E [1, M]M is the maximum number of clients participating in training, when the difference between the local model and the global model is small, namelyEpsilon is an arbitrary constant approaching 0, the client still sends the locally trained local model to the server +.>And waste of communication resources is caused. Therefore, the client can only send the difference value between the local model and the global model to reduce the data quantity sent to the server at the stage, namely the client sends the difference matrix of the global model and the local model +.>To a server;

(2) Through a depth gradient compression algorithm, a calculation result is sent to a server, and the difference matrix can find that when the client sends a local model to the server, a plurality of numerical values with smaller change exist in the matrix. These less varying values may have little impact on the server model in this round of updating, but sending a decimal still requires network bandwidth. Therefore, a depth gradient compression strategy is introduced, a compression threshold constant th is selected, the intermediate value with smaller value is not sent in the training process, but stored and accumulated with the intermediate value at the same position of the matrix of the next round, and when the intermediate value is large enough, namely the intermediate value is greater than th, the intermediate value is sent to the server.

The information entropy calculation method in the step 3 is as follows:

the server calculates information entropy according to the situation of the t-th round client side Party i And providing a local model with a large enough entropy value, and performing global weighted aggregation operation, wherein H (U) represents the information entropy of the local model uploaded by the client Party i, E (level) is a mathematical expectation, and log (level) is a logarithmic operation.

And 4, calculating the contribution value of the client in the previous round as follows:

first, a contribution value correlation matrix S of each client is calculated, and the formula is as follows:

second, the reliability cred (γ) of the client Party i is calculated based on the correlation matrix _i ) The formula is as follows:

where μ represents the desire, the client contribution value γ _i For its subscript, thenRepresenting the client i contribution value gamma _i σ represents variance, client contribution value γ _i For its subscript, then->Representing the client i contribution value gamma _i Is defined by calculating the modified basic probability distribution BPA, which represents the credibility of the client>X represents parameters of the client for which the historical training round τ of client participation, the client i and overall client data set specific gravity D ⁱ Fusion is carried out on the KL-divergence of the/D and the KL-divergence, and the fused result gamma _i As the contribution value of the client Party i to the global model at the t-th round, the formula is as follows:

and 4, the method for obtaining the new global model by weighting and aggregation comprises the following steps:

(1) In the global model aggregation stage, when the server completes all the participantsAfter the contribution value and the information entropy of (1) are calculated, the contribution degree of each client and the data set duty ratio of each client are taken as weights, and a global model w of the (t+1) th round is carried out _t Aggregation, the formula is as follows:

(2) The server obtains a t+1 global model w _t+1 Thereafter, w is _t And a client contribution value gamma _i The data are issued to the clients participating in training, and the clients receive the global model w _t+1 A new round of training is performed.

Compared with the existing horizontal synchronous federal learning, the method can provide the following efficiency advantages and safety:

1) Based on the difference matrix and the gradient compression concept, the data volume sent to the server by the client is reduced, and the effect of optimizing the communication efficiency of uploading the local model to the server by the client is achieved;

2) Based on the D-S evidence theory, a participant optimization scheme is provided, scoring weights are given to all participants by combining the influence of the participants on the global model, and high-quality client local models are selected for aggregation, so that the accuracy of the models is improved;

3) And (3) accelerating aggregation of local models of the high-quality clients and updating of global models based on a dynamic updating strategy of the information entropy and the divergence threshold.

Drawings

FIG. 1 is a framework of a horizontal federal learning fault detection method based on multi-layer packet aggregation in accordance with the present invention.

Fig. 2 is a schematic representation of a visual representation of the depth gradient approach.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings and examples.

Federal learning is used as a paradigm of distributed machine learning, and multiparty organization joint modeling can be performed under the condition of meeting the requirement of user privacy protection and data security, so that the problem of 'data island' is technically and effectively solved by joint modeling on the basis of not sharing local data.

The horizontal synchronous federal learning is used as an efficient federal learning sharing model framework, is suitable for application scenes with close sample geographic distribution and good communication environment, and is used for updating local models in a local synchronous iteration mode by combining a plurality of devices, so that the local models sent by the devices are collected to carry out aggregation updating of global models at a server side, but the heterogeneous property of user equipment and the difference of data distribution can cause calculation and communication expenditure, so that the problems of low global model aggregation efficiency and high training time expenditure are caused. Therefore, how to optimize the aggregation and communication overhead of the global model in the horizontal synchronous federal learning process to improve the overall training speed of federal learning is particularly important. The invention designs a set of transverse synchronous federal learning aggregation optimization method from federal learning local model training to global model aggregation acceleration. The data of the participators are compressed through a difference matrix and a depth gradient compression algorithm, a scheme based on D-S evidence theory (Dempster-Shafer evidence theory, D-S) for optimizing the participators is provided, and a dynamic delay updating scheme based on a divergence threshold value and information entropy is designed to reduce redundant communication.

First, preliminary knowledge required for understanding the present invention will be described:

1. lateral federal learning

Lateral federal learning is applicable to situations where there is more overlap in the participant data features. The federal learning framework researched by the invention is mainly FedAvg, and is mainly applicable to scenes with similar sample data characteristics and less sample space overlapping due to different participants.

Assuming M, T as a positive integer, it represents the maximum number of clients participating in training and the maximum number of iterations, respectively, let x, y represent data sample input and prediction result output, i represents client sequence number, i e [1, M]. In the federal averaging algorithm, the objective function F (w) is the predicted output y _i And true sample x _i Loss function loss (x _i ；y _i ) The objective function may be expressed as F (w) =loss (x _i ,y _i The method comprises the steps of carrying out a first treatment on the surface of the w), loss (, etc.) function is used to describe the prediction versus the actualThe degree of gap, F (), of the data is used to describe the objective function that needs to be optimized in the horizontal federal learning process, and notation is madeGradient, eta is learning rate, < >>Describing the minimum falling direction of the objective function F (w), wherein eta is the step size adopted when the gradient falls, and sigma (level) is the accumulation operation.

In the process of co-training the comprehensive model by M clients with the same data structure, the following steps are generally divided into the following 4 steps:

(1) The client Party i performs a multiple gradient descent algorithm (Stochastic Gradient Descent, SGD) locally at the t-th round:wherein (1)>For the local model of the t-1 th round, η is the learning rate of SGD, ++>For the gradient of the objective function, the symbol "≡" represents a model update. Hiding the gradient information by using a security technology, and sending the hidden result to a server;

(2) After the server receives the local models sent by all the participants, the server performs average aggregation operation (aggregating the models sent by the participants, calculating the number of the participants and averaging to obtain a t+1 round global model w) in FedAVG through the secure aggregation operation aggregation gradient _t+1 The specific algorithm formula is that

(3) The server aggregates the global model w of the t+1 rounds _t+1 After encryption, sending the encrypted data to each participant;

(4) Each participantParty i accepts global model w of t+1 round sent by server _t+1 Performing local modelingUpdating and repeating the steps.

2. Synchronous update

The lateral federation learning studied at present is mostly based on a synchronous updating strategy, and a synchronous federation learning system generally comprises a global server and a plurality of participant nodes, wherein each participant trains through local data and sends a local model to the server, and the server obtains or updates the global model by aggregating the local models of all the participants. In a synchronous federal learning system, it is assumed that there are M participants who have data sets with the same data structure. In the beginning stage of global training, the global server determines a unified prediction task target and a prediction model, determines a parameter structure in the model, and sends the parameter structure to each participant to start training.

The notation α e 0,1 indicates the proportion of the selected clients, the training process of the synchronous federal learning can be divided into 4 phases, after the initialization of the global model and the local model is completed, the global server sends the global model of the current round to the local model, and for the clients (Party), the following phases are executed in a loop:

(1) The client receives the global model from the server, decrypts and leaves it available if the model is in ciphertext form.

(2) The client uses the received global model to replace the local model, and uses the local data set to start local model iteration on the local model until the model converges or reaches the maximum iteration round.

(3) And after the local model iteration is finished, the client sends the local model of the last round of local iteration to the global server. This stage performs encryption or perturbation on the model depending on privacy requirements.

(4) The server randomly selects a part of clients alpha M in the M participators to participate in the global aggregation. And after the server receives the local models of all the selected clients, starting global model aggregation, and distributing the aggregated global models to all the clients.

3. Depth gradient compression (Gradient Compression, DGC)

DGC solves the communication bandwidth problem by sending important gradients to compress the remaining gradients. To ensure accuracy of uploading the local model, DGC is transmitted by specifying that only gradients larger than a threshold th, where the symbol th represents the gradient threshold, is sent to the server side, where gradients smaller than the threshold th are locally accumulated until they become large enough for transmission, so that during DGC, although large gradients are sent immediately, small gradients are accumulated temporarily, and eventually become large with round accumulation, in order to avoid information loss.

4. D-S evidence theory

D-S evidence theory (Dempster-Shafer evidence theory, D-S) is a method of uncertain information reasoning fusion. Identifying frame (Frame of Discernment, foD) with symbol θ as evidence theory, which is defined as θ= { H ₁ ,H ₂ ,…,H _N Where N is the number of hypotheses in the recognition system, H is a subset of the power set of the recognition framework θ for each hypothesis in the recognition system, and all decision schemes made by the system.

The basic probability distribution algorithm (Basic Probability Assignment, BPA) is a mapping function m:2 under the recognition framework θ ^θ →[0,1]Meets the constraint conditionPhi is the null set and m () is the basic probability distribution function. The character A is one or more propositions in the recognition framework theta, and m (A) represents the supporting degree of evidence on the propositions A.

Proposition a has a trust Function (Bel), which is described as the degree of trust in the event that proposition a is true, under the recognition framework θ is defined as: wherein propositions A and B are both recognition frameworks 2 ^θ And m is a basic probability distribution function on the recognition framework theta.

Identifying likelihood functions (Plausibility Function, PI) under the framework θ, the likelihood functions of proposition a are defined asA trust interval [ Bel (A), PI (A) is obtained by calculating a trust function Bel (A) and a likelihood function PI (A) for identifying a certain proposition A in a framework]The degree of confirmation of proposition a is shown.

M is recorded ₁ ,m ₂ Is defined as two mutually independent basic probability distribution functions on the recognition frame theta, and A, B and C are respectively set as propositions under the recognition frame theta:the basic probabilities of proposition B and proposition C can be fused by using the Dempster-Shafer synthesis formula rule to obtain a fusion result and make a decision +.>Wherein k= Σ _B∩C≠A m ₁ (B)m ₂ (C)，k∈[0,1]Is an evidence conflict factor.

5. Pearson correlation coefficient

Recording deviceAnd->Respectively representing the average value of the sample data, r is the pearson correlation coefficient which is commonly used to measure the degree of correlation between samples, and the overall pearson correlation coefficient r between different samples is defined as sample +.>And->Is the product of covariance and standard deviation +.>

6. Information entropy

Entropy (Entropy) is used to describe uncertainty in the occurrence of events, information Entropy is used to measure uncertainty in information, and the more ordered a system is, the lower the information Entropy is, and conversely the more chaotic the system is, the higher the information Entropy is. Note that character X is information, { X ₁ ,x ₂ ,…,x _n The sequence content of the information X is represented by H (X), and the information X= { X is represented by H (X) ₁ ,x ₂ ,…,x _n -at least:wherein p (x) _i ) Representing x in information _i The probability of occurrence, log () is a logarithmic operation.

7. KL divergence

The definition of KL divergence is based on Entropy (Entropy). Which can measure the degree of difference between the sample distributions. The notation p and q respectively represent two random variables, the probability distribution of which is p (x) and q (x), and notation D _KL (p||q) represents the entropy of the random variable p relative to q, then the formula for KL divergence can be expressed as:

as shown in fig. 1, the present invention proposes an improved flow of a horizontal synchronous federal learning acceleration scheme, in which the thickened part is the modification of the present design on the conventional federal average algorithm, these modifications all have module properties, and the present acceleration scheme can be disassembled into 3 main modules according to the needs in the actual scene: the calculation amount of the client training local model is reduced through algorithm optimization, the data amount sent by the client is reduced through data compression, and the aggregation efficiency of the server model is improved through data screening. The federal learning acceleration of the lateral synchronization client-server side is achieved.

In the invention, a notation w represents a model trained in the federal learning process, D represents a client local data set, i represents a client serial number participating in training, and t represents the federal learning training round number. M, T is a positive integer, which respectively represents the maximum number of clients participating in training and the maximum number of iterations, wherein i is E [1, M]，t∈[0,T]. So the local model of the ith client of the t-th round of participating training can be written asThe global model of server aggregation can be written as: w (w) _t The ith client local data set is D _i . Notation gamma indicates the contribution value of the local model of the client to the global model, delta indicates the divergence to describe the difference between the models, w is the reference model, and the contribution value of the client i is gamma _i ，Δ ⁱ Representing the divergence value between the client Party i and the reference model w.

Through a data compression strategy, a client only transmits the difference value between a local model and a global model by using a difference matrix, and then a compression threshold is selected by using the depth gradient compression (Gradient Compression, DGC) thought, and the data can be transmitted to a server side when the data is larger than the threshold, and the data is transmitted to the client side when the data is accumulated to be larger than the threshold.

Communication between the client and the server is unnecessary when the local model and the global model have small differences; conversely, when the local model differs significantly from the global model, more communication overhead should be spent updating the model. In order to solve the above problems, a dynamic update acceleration strategy based on a divergence threshold is proposed, and a divergence threshold delta is used as a dynamic measurement to determine whether a client needs to communicate with a server to send a local model.

Client Party i calculates divergence threshold delta ⁱ : in the local model training stage, the client calculates the divergence threshold value of the local model to judge whether the updating of the local model is sufficient or not until the updating is completedAnd the client transmits the local model to the server when in time sharing, otherwise, the local training is continued. Every time client i completes a round of local iteration, the iteration round n of the local model is recorded by a counter and passed throughCalculating a divergence threshold for client i, wherein delta ⁱ Divergence value representing the i-th client local model,/v>Represents a local model, w represents a reference model, gamma _i Representing the contribution value of the client. When a participant completes one batch of local model iterations, and the divergence threshold satisfies the conditional expression:in this case, the client may send the local model +.>To the server. The threshold value of divergence based on the contribution value designed by the invention has adjustability relative to the traditional fixed threshold value standard.

In the transverse synchronous federal learning process, local models trained by different clients by using local data sets all have influence on a global model, so that selecting high-quality clients has positive significance on the aggregation of the global model, and aiming at the situation, the invention provides a model for comprehensively scoring and giving local model weights gamma based on D-S evidence theory and combining the performances of the clients in the past several rounds of training processes _i The server aggregates the local models according to the weight and the client data set duty ratio in the aggregation stage, so that the influence of inferior participants (malicious participants or normal participants which cause the degradation of the global model) on the precision and the convergence speed of the global model is reduced. Proposing in a server aggregate local model stageWherein |·| calculates the individual of the data elements in the collectionCount (n)/(l)>The data set for client i is the proportion of the overall data set.

Based on the preliminary knowledge, the invention performs the following steps:

1) The client receives the initial global model sent by the server and the contribution value of the client.

(1.1) the client receives the global model: global model w in initial training phase ₀ (round 0, t=0) at which time the server will randomly select the local modelBeginning training as a global model, assume that during the t-th round training phase of federal learning, the server issues a global model w for the previous round (i.e., the t-1 round) of aggregation _t-1 As an initial model of the t-th round to M participants Party i, where t ε [0, T]T represents the maximum round of model training, M represents the number of clients participating in training, i E [1, M]For the ith participant subscript, w represents a globally or locally trained model.

(1.2) character-inscription gamma _i For the contribution value of the client i to the global model, the client i receives the contribution value gamma sent from the server _i Wherein the contribution value gamma of the client _i Is calculated according to the D-S evidence theoretical formula, and the explanation is given in (4.1).

2) The client trains the local model by using the local data set, calculates a divergence threshold value, and prepares to send the trained local model.

(2.1) the client uses the local data set for local model training: the ith client (Party i) accepts the global model w issued by the server _t-1 Multiple gradient descent SGD is performed using the local data set,training local model->Let the character w be the reference model, the server will make the contribution value gamma in the beginning of training _i And the higher global model is aggregated, weighted average operation is carried out and then the weighted average operation is used as a reference model w, and in actual training, the client Party i calculates the difference between the reference model and the local model and is used for measuring whether the local model amplitude change of the client reaches an aggregation standard. When the global model is in the first iteration, the server will randomly choose one from all participants as reference model w, i.e. there is +.>i∈[1,M]. In the subsequent training, the reference model will depend on the contribution value gamma of the client Party i _i To be selected. The server will aggregate the local models with higher contribution values on average and take the result as a new round of reference model, i.e +.>Wherein, gamma _i ，γ _j Representing the contribution values of the i, j-th clients, respectively,>for averaging.

(2.2) client i calculates the divergence threshold Δj ⁱ : in the local model training stage, the client calculates the divergence threshold value of the local model to judge whether the local model is sufficiently updated, and when the local model is sufficiently updated, the client sends the local model to the server to be responsible for continuing the local training. Every time the client side Party i completes one round of local iteration, the iteration round n of the local model is recorded through a counter, and the iteration round n is passed throughCalculating a divergence threshold for client Party i, where Δ ⁱ Divergence value representing the i-th client local model,/v>Representation officePart model, w represents reference model, gamma _i Representing the contribution value of the client. When a participant completes one batch of local model iterations, and the divergence threshold satisfies the conditional expression:in this case, the client may send the local model +.>To the server.

3) And the client optimizes and transmits the local model by using a difference matrix and a gradient compression algorithm, and the server stores the received local model into a buffer area and calculates information entropy.

(3.1) the client calculates a difference matrix: the client side has small difference in single-round model updating, namelyEpsilon approaches 0, the client still sends the locally trained local model +.>And waste of communication resources is caused. Therefore, the client can only send the difference value between the local model and the global model at this stage, so as to achieve the effect of reducing the sent data. I.e. the client sends the difference matrix of the global model and the local model +.>To the server. Suppose that there are M clients participating in training in each round, i E [1, M](M is the maximum number of clients participating in the training).

(3.2) through a depth gradient compression algorithm, and sending the calculation result to a server side: it can be found by the difference matrix that when the client sends the local model to the server, there are some values in the matrix that change less. These less varying values may have little impact on the server model in this round of updating, but sending a decimal still requires network bandwidth. Therefore, a depth gradient compression strategy is introduced, a compression threshold constant th is selected, the intermediate value with smaller value is not sent in the training process, but stored and accumulated with the intermediate value at the same position of the matrix of the next round, and when the intermediate value is large enough, (namely, the intermediate value > th), the intermediate value is sent to the server.

FIG. 2 shows a visual representation of a depth gradient pattern, denoted G representing the gradient matrix of the client local model, t being its subscript, G _t Representing the gradient matrix of the client-side local model of the t-th round. In the beginning stage of t-th round federal learning, selecting a compression threshold value of th=7, and when the client Party i finishes local training, not directly carrying out local modelTo the server, but rather performs a depth gradient compression algorithm. The implementation details of the algorithm are as follows: first, gradient matrix G of client _t Compared with threshold th (G _t Representing client local model +.>Gradient matrix G _t ) Matrix G of gradients _t All positions 1 and the rest positions 0 which are larger than the threshold th, and a Mask matrix Mask is obtained. The client side uses the gradient matrix G _t And multiplying the Mask matrix Mask by the points, and sending the point multiplication result, namely all gradients larger than the threshold thr, to the server. After the transmission is finished, the participants invert the Mask matrix Mask and are connected with the gradient matrix G _t Dot multiplication is carried out to obtain a gradient matrix G which is not transmitted by the round _t ' and remain until the transmit phase of the next round of training, where G _t ' represents the gradient matrix which is not transmitted to the server side, the Mask matrix Mask is consistent with the dimension of the gradient matrix, and matrix elements only have 0 and 1, so that the function of filtering the threshold value is achieved.

When the next round (t+1st round) of transmission phase starts, the gradient matrix G which is not transmitted by the current round (t round) is obtained _t Gradient matrix G of' and t+1st round _t+1 Adding, and taking the added result as a gradient matrix G of t+1 rounds _t+1 The method comprises the following steps: g _t+1 ←G _t+1 +G _t ' symbol "≡" denotes a gradient matrix operation followed by a depth compression algorithm. And the local model sent by the client is finely adjusted through the difference matrix and the depth gradient compression algorithm, so that redundancy caused by invalid updating in federal learning can be avoided, and further, the communication cost of federal learning is reduced.

4) The server calculates the contribution value of the previous round of clients based on the D-S evidence theory, calculates the information entropy, finds out the local model meeting the information entropy condition from the buffer area, and performs the weighted aggregation operation based on the contribution value to obtain a new global model.

(4.1) the server computes a client contribution based on D-S evidence theory: at the start of federal learning and every new client joins the training phase (e.g., round t). Newly added clients i are assigned an average contribution value gamma _i . In the server aggregation stage, the server acquires relevant parameters of each client i, including a client Party i data set D ⁱ Ratio of the occupied total data set D to D ⁱ /D, contribution value gamma of previous round of client i _i The KL divergence of the client model and the number of rounds τ the client has so far participated in training altogether. Wherein gamma is _i (0≤γ _i <1) The contribution value of the client Party i is scored by using a D-S evidence theory algorithm to participate in the historical rotation tau of the client and the specific gravity D of the data set ⁱ Scores given by/D, KL-divergence fusion, τ represents the historical turns of client i participating in training, τ e [0, T]。

Contribution value gamma of client _i The method is calculated according to a D-S evidence theory formula, and takes the fact that client contribution values in the federal learning aggregation process possibly generate a Matai Effect (Matthew Effect) due to data heterogeneity into consideration. To buffer the possibility of inaccurate fusion results when there are conflicting larger propositions. Wherein, the Martai effect refers to the phenomenon of bipolar differentiation when the client Party i is solved by using the D-S evidence theory.

The notation μ represents the desire, the client contribution value γ _i For its subscript, thenRepresenting the client i contribution value gamma _i The sign sigma represents the variance and the client contribution value gamma _i For its subscript, then->Representing the client i contribution value gamma _i Is a variance of (c).

The contribution correlation matrix S for the different clients is first calculated based on section (4-1).

Wherein S is _ij The relevance of the contribution value of the client i and the client i is shown, and the specific calculation mode is as follows:

wherein cov (gamma) _i ,γ _j ) Representing the covariance of the client i and client j contribution values, where E () is the mathematical expectation. It should be noted that when S _ij When=0, let S _ij =0.001, to overcome the 0 confidence conflict of evidence theory. By the expression (4-3), the trust cred (γ) of the client i is calculated _i ):

The character cred represents the credibility of the client, calculates the corrected basic probability distribution BPA and definesThe notation X denotes the parameters of the client: historical training rounds of client participation τ, client i and overall client data set specific gravity D ⁱ KL-divergence, etc.

And fusing parameters of the client according to the formulas (4-4) and (4-5).

Fusing the result gamma _i As client i aggregates weights.

(4.2) the server calculates the information entropy: the server calculates information entropy according to the condition of the t-round client iAnd providing a local model with a large enough entropy value, and performing global weighted aggregation operation.

(4.3) global model aggregation: after the server finishes the calculation of the contribution values of all the participants and the information entropy calculation, according to the formula (4-6), the global model w of the t+1st round is carried out according to the weight of each client _t And (3) polymerization.

5) And the server sends the contribution value of the previous round and the new global model to the client, and the client receives the global model sent by the server and performs the next round of training.

The server then obtains the t+1 global model w _t Thereafter, w is _t And a client contribution value gamma _i The data are issued to a client side i participating in training, and the client receives a global model w _t A new round of training is performed.

Claims

1. The transverse synchronous federal learning aggregation acceleration method is characterized by comprising the following steps of:

2. The method for accelerating the horizontal synchronous federal learning aggregation according to claim 1, wherein in the step 1, a global model is obtained by aggregating local models of all clients, and the process is as follows: the client trains locally by using a local data set to obtain a local model, the local model is uploaded to a server buffer area after being based on a depth gradient compression algorithm and an updating standard of a divergence threshold value, and the local model uploaded to the server buffer area is aggregated and weighted by contribution values obtained by each client to obtain a new round of global model; the client contribution value is calculated based on a D-S evidence theory, the contribution degree of the last round of clients to the global model is represented, and the clients participating in training are distributed with an average contribution value when federal learning starts; in the server aggregation stage, the server acquires relevant parameters of each client, including the ratio of the client data set to the whole data set, the contribution value of the last round of clients, the divergence of the local model and the global model trained by the clients, and the number of rounds of the clients which participate in training altogether so far.

3. The method according to claim 1, wherein in step 2, different clients have local data sets with different data types and sizes, and do not share data with each other.

4. The method for accelerating the horizontal synchronous federal learning aggregation according to claim 1, wherein the calculating method of the divergence threshold value in the step 2 is as follows:

every time the ith client side Party i participating in federal learning training completes one round of local iteration, recording iteration round n of the local model through a counter, and passing throughCalculating a divergence threshold of the client Party i, wherein after the participants complete one batch of local model iteration, the divergence threshold meets the condition: />When the client Party i sends the local model +.>To a server;

wherein n represents the local training round of the client, the character delta represents the divergence threshold of the model, and the upper label is i, delta ⁱ A divergence value representing the first client local model, w representing the federal learning model,representing the part of the ith client trained on the t-th roundModel, w ^* Representing a reference model, gamma _i Contribution value representing client Party i, < >>The distance between the local model and the reference model is described.

5. The method for accelerating aggregation of horizontal synchronous federal learning according to claim 4, wherein in horizontal synchronous federal learning, a global round and a local round of training are set as fixed values, the global round represents the number of times the server global model is updated, and the local round represents the number of times the client side obtains the local model by training on the basis of the global model by using a local data set;

all clients participating in the training meet the divergence threshold conditionAnd after the local training round reaches the set fixed value round, uploading the local model to a server buffer zone to complete the local model training of the current round.

6. The method for accelerating the horizontal synchronous federal learning aggregation according to claim 1, wherein the method for optimizing the local model in step 3 is as follows:

(1) The client calculates a difference matrix, and M clients participating in training in each round are set, i is E [1, M]M is the maximum number of clients participating in training, when the difference between the local model and the global model is small, namelyEpsilon is an arbitrary constant approaching 0, the client still sends the locally trained local model to the server +.>Can cause the waste of communication resources, and the client only transmits the difference value between the local model and the global model to reduce the transmissionThe amount of data sent to the server, i.e. the client sends the difference matrix of global model and local model +.>To a server;

(2) And through a depth gradient compression algorithm, sending a calculation result to a server, introducing a depth gradient compression strategy, selecting a compression threshold constant th, not sending an intermediate value with smaller value in the training process, accumulating the intermediate value stored in the same position as the matrix of the next round, and sending the intermediate value to the server when the intermediate value is large enough, namely the intermediate value > th.

7. The method for accelerating the horizontal synchronous federal learning aggregation according to claim 1, wherein the method for calculating the information entropy in the step 3 is as follows:

8. The method for accelerating the horizontal synchronization federal learning aggregation according to claim 1, wherein in step 4, the contribution value calculation method of the previous round of clients is as follows:

second, the baseCalculating the trust cred (gamma) of the client Party i in the correlation matrix _i ) The formula is as follows:

9. the method for accelerating the horizontal synchronous federal learning aggregation according to claim 1, wherein the method for obtaining the new global model by the weighted aggregation in the step 4 is as follows:

(1) In the global model aggregation stage, after the server finishes the calculation of the contribution values and information entropy of all the participants, the contribution degree of each client and the data set duty ratio thereof are taken as weights, and the global model w of the t+1st round is carried out _t Aggregation, the formula is as follows: