CN115587633A

CN115587633A - Personalized federal learning method based on parameter layering

Info

Publication number: CN115587633A
Application number: CN202211382618.XA
Authority: CN
Inventors: 肖云鹏; 彭锦华; 李茜; 庞育才; 李暾; 王国胤
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-01-10

Abstract

The invention belongs to the field of application of the federal learning technology, and particularly relates to a personalized federal learning method based on parameter layering; the invention comprises the following steps: the method comprises the steps that a client-side carries out parameter division on a local model before federal learning to obtain base layer parameters and personalized layer parameters, the base layer parameters and the personalized layer parameters are updated in each federal learning, the client-side is clustered and divided based on the updated base layer parameters, so that group average weight of each group is obtained and uploaded to a server, and the server updates the base layer parameters; after the federal learning is completed, obtaining optimal basic layer parameters and issuing the optimal basic layer parameters to a client, and the client trains a local model by adopting local data to obtain an individualized local model; according to the invention, the heterogeneity problem caused by the non-independent and uniformly distributed data of each client can be relieved through parameter layering and clustering division in federal training, and the final model of each client is more suitable for the local data.

Description

Personalized federal learning method based on parameter layering

Technical Field

The invention belongs to the field of application of federal learning technology, relates to adjustment of a global model and a local model, and particularly relates to a personalized federal learning method based on parameter layering.

Background

With the further development of big data, people's knowledge and concern about data privacy are also continuously improved; accordingly, federal learning has gained widespread attention since its introduction and has been applied in some scenarios. Federal learning is a distributed machine learning framework with privacy protection and security encryption technology, and aims to enable scattered participants to collaboratively perform model training of machine learning on the premise that privacy data are not disclosed to other participants. However, due to the problem of high data heterogeneity, it is difficult to train a global model suitable for all clients through federal learning.

With the advances in federal learning research, it has been proposed to use personalized federal learning approaches to address the problem of data heterogeneity. The core idea of personalized federal learning is to follow different direction researches according to heterogeneous data distribution by capturing personalized information of each client so as to obtain a high-quality personalized model. Currently, researchers have divided personalized federal learning into two categories: global model personalization and learning of personalized models. The global model is individualized into two stages, firstly a shared global FL model is trained, and then additional training is carried out on local data, so that the purpose of individuation is achieved. The purpose of learning the personalized model is to build the personalized model by modifying the aggregation process of the FL model.

In recent years, more and more researchers have studied personalized federal learning in the field of federal learning. Aspects of the study are primarily based on multitask learning, base and personalized layer layering processes, and transfer learning. The multitask-based learning is mainly characterized in that an independent weight vector is trained for each node by learning an independent model of each node and using an arbitrary convex loss function; and the statistical problem in the federal environment is solved by considering the correlation among the node models, and the sample capacity is improved. The layering processing mainly considers the difference of data distribution among all nodes, and meanwhile, the higher the layer number of the neural network is, the stronger the individuation is. The transfer learning-based learning is a learning process which mainly applies the model learned in the source field to the target field by utilizing the similarity among data, tasks or models.

While numerous scholars have conducted extensive research into the field of personalized federal learning with considerable success, there are still some challenges:

1. the client-side non-independent co-distributed data causes slow convergence of the global model. In the federal learning environment, the devices participating in the federal learning have the problem of large data distribution difference and the problem of communication cost, so that a good global model is difficult to train quickly.

2. Federal calculations are a high complexity problem. In the process of dividing by calculating the parameter similarity, the problem of high computation complexity of mass data exists, and the computation efficiency is greatly reduced.

3. Local distribution diversity problem. Due to the difference in data distribution of the clients, the preferences captured from the raw data are different, which makes the trained global model not well generalized in various data. Therefore, how to train the personalized model of each client on the basis of the global model becomes a main research direction.

For the problem of client data distribution difference, personalized federal learning is gradually becoming the mainstream choice for solving the problem. Zhu et al (Zhu, zhuangdi, junyuan Hong, and Jianyu Zhou. "Data-free knowledge distillation for heterologous contaminated learning." International Conference on Machine learning. PMLR, 2021.) proposed a distillation method without Data knowledge to solve the Data heterogeneity problem, and adjusted the local training by learning knowledge as inductive bias, achieving fewer communication turns to promote FL to have better generalization performance. Inspired by the paper, the invention provides an individualized federated learning method based on iterative partitioning and parameter layering, wherein parameters of a model base layer are used for participating in federated training, parameters of an individualized layer are adapted to local data distribution, meanwhile, the problem of weight divergence is solved through clustering partitioning, and rapid convergence of a global model is realized with fewer communication turns.

Disclosure of Invention

In order to solve the problems, the invention provides a personalized federal learning method based on parameter layering, which comprises the following steps:

s1, constructing an individualized federated learning system comprising N clients and a server, wherein the server is provided with a main model after parameter initialization;

s2, downloading a main model from a server by a client as a local model, wherein parameters of the main model are divided into basic layer parameters and personalized layer parameters;

s3, the client side improves the base layer parameters and the personalized layer parameters of the local model through random gradient descent based on the local data, and obtains base layer weight updating vectors;

s4, updating the vector of the base layer weight, reducing the dimension to obtain a ternary vector matrix, and measuring the ternary vector matrix by a ternary cosine similarity method to obtain a ternary cosine similarity matrix;

s5, calculating similarity distances among the clients through ternary cosine similarity matrixes of the clients, clustering and dividing the clients by adopting a K-Medoids algorithm according to the similarity distances and the base layer weight updating vectors to obtain K groups, and internally aggregating each group to obtain corresponding group average weight;

s6, uploading all the group average weights to a server for global aggregation, and enabling the server to obtain updated basic layer parameters and send the updated basic layer parameters to a client;

s7, judging whether a federal learning iteration threshold is reached, if so, entering a step S8, otherwise, returning to the step S3;

s8, fixing the basic layer parameters of the local model by the client, and performing random gradient descent on the local model through local data to improve personalized parameters to finally obtain the personalized model of the client.

Furthermore, each client improves the base layer parameters and the personalized layer parameters of the local model through random gradient descent, so that the self base layer weight updating vector is obtained, and the updating processes of the base layer parameters between the clients are mutually independent; the calculation formula of the weight update vector of the base layer is expressed as follows:

wherein, the first and the second end of the pipe are connected with each other,

representing the weight of the base layer obtained after the client i, i = {1,2,, N } adopts random gradient descent in the t federal learning round,

represents the personalized layer weight, W, obtained by the client i after the client i adopts random gradient descent in the t-th federal learning turn _B ^(t-1) Base layer parameters representing server updates after the t-1 federal learning round,

representing the individualized layer weight obtained by the client i after random gradient descent in the t-1 federal learning turn, C _i Representing bulk data sampled from local data of client i, SGD _i Representing the random gradient descent method adopted by the client i,

the base layer weight update vector representing the client i in the tth federal learning round.

Further, the process of acquiring the ternary cosine similarity matrix of the client i, i = {1,2,, N } in step S3 includes:

s31, updating the vector dimensionality reduction of the base layer weight of the client i by adopting a singular value decomposition algorithm to obtain a ternary vector matrix of the client i, wherein the expression is as follows:

wherein, V _i Ternary vector matrix, v, representing client i _i1 、v _i2 And v _i3 Representing the cardinal direction vectors in the ternary vector matrix for client i,

representing a base layer weight updating vector of the client i in the t federal learning turn;

s32, defining the similarity of the ternary cosine of the client i based on the ternary vector matrix, wherein the similarity is expressed as:

wherein the content of the first and second substances,

representing the ternary cosine similarity, v, of client i _scale An inverse matrix representing the product of the base layer weight update vector and the ternary vector matrix,

a product operator representing a hadamard matrix;

s33, normalizing the ternary cosine similarity of the client i to obtain a ternary cosine similarity matrix of the client i, wherein the ternary cosine similarity matrix is expressed as follows:

wherein, M _i A ternary cosine similarity matrix representing client i.

Further, the step S4 of obtaining the group average weight of each small group includes:

s41, calculating the similarity distance between every two clients according to the ternary cosine similarity matrix, wherein the similarity distance is expressed as follows:

wherein alpha is _i,j Represents the similarity distance, M, of client i and client j _i Ternary cosine similarity matrix, M, representing client i _j A ternary cosine similarity matrix representing client j;

s42, randomly selecting ternary cosine similarity matrixes of K clients as clustering centers, clustering and dividing through similarity distances, measuring clustering quality by adopting a cost function, and finally obtaining K small groups;

s43, each group carries out safety aggregation in the group to obtain a corresponding group average weight, and the calculation formula is as follows:

represents group g in the t federal learning round _k The group average weight of (a) is,

represents the base layer weight update vector of the client i in the t-th federal learning turn, c _i Which represents the client-side i-the client,

represents group g _k A set of group members of (1);

indicating that client i is group g _k Group member of (1), n _i Representing the number of samples on client i and n representing the total number of samples for all clients in a group.

Further, the Cost function Cost is expressed as:

Cost＝E _m -E _m-1

wherein E is _m Evaluation score representing the m-th group update result, E _m-1 The evaluation scores of the group updating results of the (m-1) th time are shown, p represents a ternary cosine similarity matrix of the client sides except the cluster center,

denotes the kth subgroup, o, in the m-th subgroup update _k The cluster center of the kth subgroup is indicated, and K indicates the number of subgroups.

Further, an objective function is set in step S7 with the goal of minimizing the average personalized population loss, expressed as:

wherein, W _B Representing the final base layer parameters obtained after federal training,

representing the personalization layer parameters owned locally by the first client, N representing the number of all clients participating in federal training,

represents the mathematical expectation of the ith client personalization loss function, (x, y) represents the data sample distribution of client i,

the personalized layer weight of the ith client is represented, f represents an output function, and l represents a personalized loss function common to all the clients.

The invention has the beneficial effects that:

the problem of high data heterogeneity in federal learning makes it difficult to train a global model suitable for all clients through federal learning. Meanwhile, federal learning also has a problem of client weight divergence when training a global model, which is also a problem caused by data heterogeneity. Because the local data distribution of each client is different, the model optimization directions are different, and the convergence speed and the convergence effect of the global model are greatly reduced. Aiming at the problems, the invention improves the parameters and the training process of each client end which needs to be uploaded to the federal learning, and provides a personalized federal learning method based on parameter layering. According to the method, model parameters are layered, the model parameters are divided into basic layer parameters and personalized layer parameters, the basic layer parameters are used for participating in federal training, the personalized layer parameters are reserved locally, and unique personalized features of each client are reserved. The problem of different local data distribution is solved, and the trained model is more suitable for the local client. Meanwhile, groups are dynamically divided according to the similarity of parameter updating in the training process, and the convergence speed of the global model is accelerated.

Drawings

FIG. 1 is a schematic diagram of a personalized Federal learning framework based on parameter stratification according to the present invention;

fig. 2 is a flow chart of the personalized federal learning method based on parameter stratification according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention provides a parameter layering-based personalized federated learning method, which is used for accelerating the convergence speed of a global model by grouping the update direction similarity of parameters of a base layer, relieving the problem of local distribution difference by locally utilizing parameters of a personalized layer, and finally realizing the customization of the personalized model of a local client. As shown in fig. 1, the present invention mainly comprises 3 parts:

s1, before federal learning training, a server issues a main model for initializing parameters, the parameters of the main model are divided into basic layer parameters and personalized layer parameters, and a client downloads the main model issued by the server as a local model;

s2, after early preparation of the Federal learning training is completed, the client side trains a local model by adopting local data to obtain a basic layer parameter updating direction; clustering the client based on the updating direction of the parameters of the basic layer, so that the clustering result is more accurate and effective; then calculating the group average weight of the group and uploading the group average weight to a server for global aggregation to obtain updated basic layer parameters, downloading the updated basic layer parameters obtained by the server again by the client, and repeating the operation of S2 until the optimal basic layer parameters are obtained;

and S3, the data distribution of different clients is different, so that the global parameters obtained through the federal training are not suitable for each client. Therefore, the server side initializes the basic layer parameters at the beginning, the client side initializes the personalized layer parameters of the client side, the basic layer parameters are used for participating in federal training to obtain more generalized global basic layer parameters, and the personalized layer parameters participate in the training of each iteration. And finally, each client uses local data to update SGD through the trained basic layer parameters and personalized layer parameters to obtain a personalized model more suitable for local data distribution.

In an embodiment, a specific process of the personalized federal learning method based on parameter stratification, as shown in fig. 2, includes the following steps:

s10, constructing an individualized federated learning system comprising N clients and a server, wherein the server is provided with a main model after parameter initialization;

s20, downloading a main model from a server by a client as a local model, wherein parameters of the main model are divided into basic layer parameters and personalized layer parameters;

s30, the client improves the basic layer parameters and the personalized layer parameters of the local model through random gradient descent based on the local data to obtain basic layer weight updating vectors;

s40, updating the vector dimension reduction of the base layer weight to obtain a ternary vector matrix, and measuring the ternary vector matrix by a ternary cosine similarity method to obtain a ternary cosine similarity matrix;

s50, calculating similarity distances among the clients through ternary cosine similarity matrixes of the clients, clustering and dividing the clients by adopting a K-Medoids algorithm according to the similarity distances and the base layer weight updating vectors to obtain K groups, and internally aggregating each group to obtain corresponding group average weight;

s60, uploading all the group average weights to a server for global aggregation, and enabling the server to obtain updated basic layer parameters and send the updated basic layer parameters to a client;

s70, judging whether the federal learning iteration threshold is reached, if so, entering a step S80, otherwise, returning to the step S30;

s80, the client fixes the base layer parameters of the local model, and random gradient descent is carried out on the local model through local data to improve personalized parameters, so that the personalized model of the client is obtained finally.

Preferably, in multiple loop iterations of federal learning, each client needs to first download updated base layer parameters of the server in each round, and then perform multiple random gradient descent on the base layer parameters (i.e. the latest base layer parameters downloaded by the client from the server) and the personalized layer parameters of the local model by using local data to improve the parameters of the local model, so as to obtain updated base layer parameters and updated personalized layer parameters of the local model, thereby obtaining an update direction of the base layer parameters of the local model, wherein a calculation formula of a base layer weight update vector (i.e. an update direction of the base layer parameters of the local model) is represented as:

representing the individualized layer weight W obtained by the client i after random gradient descent in the t federal learning turn _B ^(t-1) Base layer parameters representing server updates after the t-1 federal learning turn,

representing the personalized layer weight obtained by the client i after random gradient descent in the t-1 th federal learning turn, C _i Representing bulk data sampled from local data of client i, SGD _i Representing the random gradient descent method adopted by the client i,

Specifically, each client improves the base layer parameters and the personalized layer parameters of the local model through random gradient descent, so that the self base layer weight updating vector is obtained, and the base layer parameter updating processes among the clients are independent.

Preferably, after obtaining the update direction of the parameters of the base layer of the local model, the client needs to perform dimension reduction on the parameters of the base layer, and then obtains a ternary cosine similarity matrix of the client as a basis for clustering, which specifically comprises the following steps:

s31, in order to reduce the calculation complexity and facilitate the subsequent representation of the ternary cosine similarity matrix, updating the vector dimensionality reduction of the base layer weight of the client i by adopting a singular value decomposition algorithm to obtain the ternary vector matrix of the client i, wherein the representation is as follows:

wherein, V _i Ternary vector matrix, v, representing client i _i1 、v _i2 And v _i3 A basic direction vector in a ternary vector matrix representing the client i is used for representing the optimization direction of the client base layer parameters,

s32, in order to reduce the calculation cost, a measurement method is provided, namely the ternary cosine similarity is used for measuring the optimization direction of the updated basic layer parameters; namely, the ternary cosine similarity of the client i is defined based on the ternary vector matrix, and is expressed as:

a product operator representing a hadamard matrix;

s33, normalizing the ternary cosine similarity of the client i to an interval [0,1], and obtaining a ternary cosine similarity matrix of the client i, wherein the ternary cosine similarity matrix is expressed as:

wherein M is _i A ternary cosine similarity matrix representing client i.

Preferably, on the basis of the ternary cosine similarity matrix, the embodiment constructs a module for updating similarity based on the client base layer parameters, in consideration of the advantage of clustering in solving the problem of weight divergence. The method comprises the steps of firstly calculating similarity distance through a cosine formula, then clustering and dividing clients participating in federal training through a K-Medoids algorithm, dividing the clients into a plurality of groups based on a parameter updating direction, finally carrying out safety aggregation in each group to obtain group average weight of the group, carrying out global aggregation on each group average weight at a server to obtain base layer weight of the next round, and distributing the base layer weight to each client for the next round of federal training.

Specifically, the process of obtaining the group average weight of each group and uploading the group average weight to the server, and the updating of the base layer parameters by the server includes:

s41, calculating the similarity distance between every two clients according to the ternary cosine similarity matrix, and taking the similarity distance as the clustering distance basis of clustering division, which is beneficial to dividing the clients with the same updating direction into a group and accelerating the rapid convergence of the average weight in the group, wherein the similarity distance calculation formula is expressed as:

wherein alpha is _i,j Represents the similarity distance, M, between client i and client j _i Ternary cosine similarity matrix, M, representing client i _j A ternary cosine similarity matrix representing client j;

s42, randomly selecting ternary cosine similarity matrixes of K clients as clustering centers, clustering and dividing through similarity distances, measuring clustering quality by adopting a cost function, and finally obtaining K subgroups;

specifically, cluster division is performed through similarity distances, that is, division is performed through the sizes of the similarity distances between the remaining clients and the clients corresponding to the cluster center, the clients not corresponding to the cluster center are allocated to the group with the smallest similarity distance, and then an updating process is started: and (3) at each updating, randomly selecting one group member as a clustering center for each group, replacing the original clustering center, restarting clustering grouping, judging whether the updated clustering effect is improved, if so, keeping the replacement, and otherwise, recovering the result to the last result. When the clustering effect of the replacement is not improved any more, the updating is stopped.

Specifically, a Cost function is adopted to measure the quality of the clustering result, and the Cost function Cost is expressed as:

Cost＝E _m -E _m-1

wherein E is _m Evaluation score representing the m-th group update result, E _m-1 The evaluation scores of the (m-1) th group updating results are shown, p represents a ternary cosine similarity matrix of the client sides except the cluster center,

denotes the kth subgroup, o, in the m-th subgroup update _k The cluster center of the kth subgroup is indicated, and K indicates the number of subgroups. When the cost function is not changed any more, all the central points are not changed any more or the set maximum iteration times are reached, and the optimal group division G = { G } is obtained according to the clustering division algorithm ₁ ,g ₂ ,...,g _K }，

wherein the content of the first and second substances,

represents the base layer weight update vector of the client i in the t-th federal learning turn, c _i Which represents the client-side i-to,

represents group g _k A set of group members of (1);

The group mean weights for each group were finally obtained and are expressed as:

and S44, uploading the group average weight of each group to a central server for global aggregation to obtain the latest basic layer parameters, and redistributing the latest basic layer parameters to each client for the next round of federal training.

In an embodiment, a method for layering client parameters is designed for the problem of data difference, a base layer is uploaded to a central server for global aggregation, a personalized layer is trained on local data, and for a local model of each client, the number of layers is defined as:

wherein, K _B Representing the number of base layer parameters, K _P Representing the number of personalization level parameters.

Secondly, defining a forward transmission mode of the local model data, and expressing as follows:

the base layer weight matrix (i.e. base layer parameters) representing the client,

representing layer 1 parameters in the base layer parameters, parameters of different layers in the base layer parameters may have different dimensions, the base layer weight matrices for different clients are the same,

a personalization level weight matrix (i.e., personalization level parameters) representing the client i, a representing an activation function, specifically represented as a certain level according to subscripts; the data of the client end passes through the base layer firstly and then passes through the personalized layer to finally obtain output, and forward transmission can be simplifiedThe single description is as follows:

specifically, after the federal learning is completed, the optimal base layer parameters are obtained, then personalized layer parameters need to be optimized, namely, the client optimizes the personalized layer parameters of the client by using local data, and at the moment, an objective function is set with the purpose of minimizing the average personalized population loss, and the objective function is expressed as follows:

the personalized layer weight of the ith client is represented, f represents that a sample x of the client i firstly passes through the base layer and then passes through an output function of the personalized layer, and l represents a personalized loss function common to all the clients.

Distribution P due to real data generation _i Unknown during training, we will use the personalized loss function of the ith device as a proxy for the simulated global loss function (minimizing the average personalized population loss), the loss on the ith device (client) being defined as:

wherein, W _B Represents the final base layer parameters, W, obtained after federal training _P Representing a personalization layer parameter, n, unique to the ith device _i Represents the sample size of the ith device, (x) _i,j ,y _i,j ) Representing the jth sample size in the ith device's data distribution.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A personalized federal learning method based on parameter stratification is characterized by comprising the following steps:

s2, a client downloads a main model from a server as a local model, and parameters of the main model are divided into basic layer parameters and personalized layer parameters;

s5, calculating similarity distances among the clients through ternary cosine similarity matrixes of the clients, clustering and dividing the clients according to the similarity distances and the base layer weight updating vectors by adopting a K-Medoids algorithm to obtain K groups, and internally aggregating each group to obtain corresponding group average weights;

s8, the client fixes the base layer parameters of the local model, and random gradient descent is carried out on the local model through local data to improve personalized parameters, so that the personalized model of the client is obtained finally.

2. The individualized federated learning method based on parameter stratification according to claim 1, wherein each client improves the base layer parameters and individualized layer parameters of the local model through random gradient descent, thereby obtaining the base layer weight update vector thereof, and the base layer parameter update processes between the clients are independent from each other; the calculation formula of the base layer weight updating vector is expressed as follows:

wherein the content of the first and second substances,

representing the weight of a base layer obtained after the client i, i = {1,2,, N } adopts random gradient descent in the t-th federal learning turn,

representing the individualized layer weight obtained by the client i after random gradient descent in the t-1 federal learning turn, C _i Representing bulk data sampled from local data of client i, SGD _i Represents the random gradient descent method adopted by the client i,

and representing the base layer weight updating vector of the client i in the t federal learning turn.

3. The personalized federal learning method based on parameter stratification according to claim 1, wherein the process of obtaining the ternary cosine similarity matrix of the client i, i = {1,2,, N } in step S3 comprises:

s31, updating vector dimensionality reduction on the base layer weight of the client i by adopting a singular value decomposition algorithm to obtain a ternary vector matrix of the client i, wherein the expression is as follows:

wherein the content of the first and second substances,

a product operator representing a hadamard matrix;

wherein M is _i A ternary cosine similarity matrix representing client i.

4. The personalized federal learning method based on parameter stratification according to claim 1, wherein the step S4 of obtaining the group average weight of each group comprises:

s41, calculating the similarity distance between every two clients according to the ternary cosine similarity matrix, wherein the similarity distance is expressed as:

wherein the content of the first and second substances,

base layer weight update vector representing client i in the t federal learning round, c _i Which represents the client-side i-to,

represents group g _k A set of group members of (1);

indicating that client i is the group g _k Group member of (1), n _i Representing the number of samples on client i and n representing the total number of samples for all clients in a group.

5. The personalized federal learning method based on parameter stratification according to claim 4, wherein the Cost function Cost is expressed as:

Cost＝E _m -E _m-1

6. The personalized federal learning method based on parameter stratification according to claim 1, wherein an objective function is set in step S7 with the purpose of minimizing average personalized population loss, which is expressed as:

representing the personalization layer parameters owned locally by the first client, N representing the number of all clients participating in the federal training,