CN113705610A

CN113705610A - Heterogeneous model aggregation method and system based on federal learning

Info

Publication number: CN113705610A
Application number: CN202110844739.0A
Authority: CN
Inventors: 陈孔阳; 张炜斌; 陈卓荣; 严基杰; 黄耀; 李进
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2021-07-26
Filing date: 2021-07-26
Publication date: 2021-11-26
Anticipated expiration: 2041-07-26
Also published as: CN113705610B

Abstract

The invention relates to the field of federal learning, in particular to a heterogeneous model polymerization method and a heterogeneous model polymerization system based on federal learning, wherein the method comprises the following steps: initializing a neural network model; each client contributes part of local data to be uploaded to a server to form a shared data set, and a CGAN model is trained; the client side trains the local model by using the local data set and the data set generated by the CGAN model, predicts each data of the shared data set and uploads the predicted score to the server side; the server side calculates the deviation degree of the prediction scores of the clients, takes the reciprocal of the calculation result as weight, calculates the global prediction score, and uses the global prediction score to perform knowledge distillation on the server side model; the client downloads the prediction scores of other client models from the server to perform cooperative training; the model converges after multiple iterations. The invention can solve the problem of heterogeneous client data, and the client model uploads and downloads the predicted scores of the shared data set, thereby reducing the communication quantity between the client and the server.

Description

Heterogeneous model aggregation method and system based on federal learning

Technical Field

The invention relates to the field of federal learning, in particular to a heterogeneous model polymerization method and system based on federal learning.

Background

The field of deep learning is rapidly developed at present, however, deep learning has an obvious disadvantage that a large amount of data is required for training to achieve better performance. In recent years, importance on data privacy and security has become a worldwide trend, and meanwhile, most industrial data show a data islanding phenomenon, so that how to jointly train an excellent model on the premise of meeting user privacy protection, data security and government regulations is a key technology for solving the problem.

Federal learning has developed to date, and many challenges remain. The two most important aspects are the differences in heterogeneous and local data of the client model. Since each client is not necessarily the same and in different space, the communication volume, computing power and owned data of each client are greatly different, and the differences can seriously affect the quality of the model jointly trained by each client.

In recent years, a new method has been proposed to allow different clients to design different network structures according to their computing power, each client downloads the average prediction score of all client models to a shared data set in each turn, and lets the client local model fit the average prediction score using knowledge distillation, so as to learn global consensus. This method still has some disadvantages as follows:

1. the method still does not well solve the problem of client data heterogeneity, and under the condition that local data sets of the clients are not independently and identically distributed, the model performance of the clients is greatly different, and fairness is poor.

2. The average prediction score of the shared data set is calculated by all the client models only by adopting a simple average method without considering the performance difference of all the client models, so that the quality of the average prediction score is seriously influenced if some client models have poor performance.

Another new approach has been proposed in recent years. In the method, a plurality of clients are randomly selected in each round to send the parameters of the aggregation model in the previous round, and the clients update the parameters by using local data and send the updated parameters to the server. And after the server side averages the weighted values of the received model parameters, the polymerization model parameters of the round are obtained by utilizing unlabeled data or data (such as GAN) generated by a generator to carry out integrated distillation. This method still has some disadvantages as follows:

1. the client still needs to upload and download the model parameters to the server, and the problem of communication volume is not effectively solved.

2. In the method, the server side needs to average the model parameters of each client side, so that the client side models are not completely heterogeneous, some client side models are required to be isomorphic, and only the client side models with the same model structure can be aggregated.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention aims to provide a federal learning method which can better solve the problem of heterogeneous client data, allow the client model to be heterogeneous and send a prediction score to a server to reduce the communication volume.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a heterogeneous model aggregation method and system based on federated learning comprise the following steps:

s1, each client side contains a local data set and initializes a neural network model, and the server side initializes a neural network model;

s2, each client contributes a small part of local data set and uploads the small part of local data set to the server;

s3, forming a batch of shared data sets by the service end, and training a CGAN model by using the shared data sets;

s4, each client downloads the shared data set and the CGAN model from the server to the local;

s5, the server randomly selects a plurality of clients;

s6, the client trains a local client model by using an enhanced iteration method by using local data and CGAN generated data;

s7, the client uses the local client model to predict each data of the shared data set in turn and uploads the predicted score to the server;

s8, the server side calculates the deviation degree of the prediction scores of the clients and other clients by using a JS function, and the reciprocal of the calculated result is used as the weight of the prediction scores;

s9, calculating global prediction scores by using a weighted average method according to the prediction score weights and the prediction scores of the clients calculated in the step S7;

s10, the server model carries out knowledge distillation on the server model by using the global prediction score;

s11, downloading the prediction scores of other client models from the server by the client, and then performing cooperative training on the local model through the prediction scores of the other client models and the prediction scores of the local model;

s12, iterating S5 to S11 for multiple times, and finally converging the server model and the client model;

and S13, downloading the server model to the local client by each client.

Preferably, in step S1, the model structure and the model parameters of each client are different, and the data distribution of the local data set of each client is different.

Preferably, in step S3, the data classes of the shared data set are balanced, and only one class of data can be generated by one CGAN model.

Preferably, in step S5, only k clients are selected from each round of federal learning to perform local training, where k is generally smaller than the total number of clients.

Another object of the present invention is to provide a heterogeneous model aggregation system based on federal learning, which includes:

the enhanced iteration module is used for ensuring that the data types owned by the client model in the batch iteration process are complete and are uniformly distributed, so that the client model can continuously correct the gradient descending direction in the training process, and the gradient descends towards the optimal solution direction;

the cooperation training module is used for solving the problem of insufficient data representation of the client in the training process, and adopts a loss function to add extra items to guide the gradient descending direction of the model so as to enable the client model to achieve the cooperation effect;

the knowledge distillation module is used for solving the reliability problem of the client model under the heterogeneous condition; and each client sends the prediction score to the server in each round, the server calculates the weight of the prediction score of each client model by using a JS function, then calculates the global prediction score by weighted average, and the server model carries out knowledge distillation by the global prediction score.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method firstly trains the CGAN model by using the shared data set, and the client generates missing data by using the CGAN model, so that the local data are independently and identically distributed, and the problem of heterogeneous client data is better solved.

2. The invention allows the client models to be heterogeneous, and each client model uses the enhanced iteration and the cooperation training in sequence, so that each client model can correct the gradient descending direction and learn the knowledge of other client models, thereby further improving the performance of the client models and reducing the difference between the client models. In terms of communication volume, the client model uploads and downloads the prediction scores of the shared data set, and compared with federal learning, the communication volume is greatly reduced.

3. According to the method, the JS divergence of the prediction scores of the client models and the average prediction scores of the other client models is calculated in sequence, the reciprocal of the JS divergence calculation result is used as the weight of the prediction scores of the client models, so that the excellent client models, namely the client models with the smaller JS divergence of the prediction scores and the average prediction scores of the other client models, have larger weight, and conversely have smaller weight, and finally the weight of the prediction scores of the client models is normalized. The quality of the weighted prediction scores obtained using the JS function will be much higher than the average prediction scores of the previous federal learning.

Drawings

FIG. 1 is a flow chart of the training of federated learning-based heterogeneous model aggregation in an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating the effect of enhanced iteration in an embodiment of the present invention;

FIG. 3 is a graph showing the experimental effect of the accuracy of the model of the server under different values of lambda in the embodiment of the present invention;

fig. 4 is a diagram of an effect of an experiment on client model accuracy under client model heterogeneity in the embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described in further detail with reference to the accompanying drawings and examples, and it is obvious that the described examples are some, but not all, examples of the present invention, and the embodiments of the present invention are not limited thereto. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For convenience of understanding, terms referred to in the embodiments of the present invention are explained below:

a neural network: the method is an arithmetic mathematical model simulating animal neural network behavior characteristics and performing distributed parallel information processing. The network achieves the aim of processing information by adjusting the mutual connection relationship among a large number of nodes in the network depending on the complexity of the system.

Federal learning: federal machine learning is a machine learning framework, and can enable multiple parties to develop efficient machine learning under the condition of meeting the requirements of user privacy protection, data security and government regulations.

Knowledge distillation: the method is a deep learning technology, and by introducing a complex teacher network with excellent reasoning performance, a soft target deduced by the teacher network is used as a part of an optimization target function, so that a simplified student network is guided to train, and knowledge migration is realized.

JS divergence: is based on the variation of the KL divergence, the similarity of the two probability distributions is measured. The JS divergence solves the problem of KL divergence being asymmetric. Generally, the JS divergence is symmetrical, with a value between 0 and 1. JS divergence is defined as follows:

where Q and P are two distributions.

Conditional generation-antagonistic network (CGAN): is a deep learning model. The model passes through (at least) two modules in the framework: the mutual game learning of the generative model and the discriminant model produces a fairly good output. After the training is finished, specific condition information is input into the generative model, and the generative model can generate specific data.

In the field of internet of things, a plurality of devices are located at different spatial positions, so that the data acquired by the devices are distributed differently, the network bandwidths of the devices are different, and the performances and the computing capabilities of the devices are different. If the traditional federal learning algorithm is directly used in the above scenario, the performance of the constructed combined model cannot reach the expected index because the differences of the device data distribution, the performance and the network bandwidth are not fully considered. The invention provides a heterogeneous model aggregation method and system based on federal learning. In addition, the invention allows the client models to be heterogeneous, and each client model uses the enhanced iteration and the cooperative training in sequence, so that each client model can correct the gradient descending direction and learn the knowledge of other client models, thereby further improving the performance of the client models and reducing the difference between the client models. In terms of communication volume, the client model uploads and downloads the prediction scores of the shared data set, and compared with federal learning, the communication volume is greatly reduced.

Example 1

As shown in fig. 1, the federated learning-based heterogeneous model aggregation method in this embodiment uses a mnst data set, and includes the following steps:

s1, setting a local data set D at each client_iAnd initializing a neural network model M_iWherein i is a client serial number, and a neural network model M is initialized at a server; wherein, the neural network model M of each client_iAre allowed to be different, each client local data set D_iThe data distribution is not the same.

In this embodiment, the client is a computer with certain computing power, and for the client with the number i, the local data set D is_iFor data of a part of classes in the Mnist dataset, the initialized neural network model M_iThe structure of (a) is shown in table 1:

TABLE 1

The server is a data center with strong performance and large communication capacity, the model is M, and the structure of the model is shown in Table 1. Because the computer brands are not consistent and are located at different positions, the data sets, the network bandwidths and the computing power of the clients are different.

And S2, each client contributes a small part of local data and uploads the local data to the server.

In this embodiment, each computer wirelessly uploads a small portion of data of the local client to the data center of the server.

S3, forming a batch of shared data set D at the service end by the uploaded local data, and training a CGAN model G by using the shared data set D_jWherein j is a data category and the data categories of the shared data set D are balanced; a CGAN model G_jCGAN model G, which can only generate one data of type j_jThe number is the same as the number of data categories j.

In this embodiment, the data center of the server receives the local data uploaded by all computers, and integrates the uploaded local data into the shared data set D. The data center of the server side trains out a high-performance CGAN model G by utilizing the strong computing power of the data center_jWhere j is the data class.

S4, each client downloads the shared data set D and the CGAN model G from the server_jTo local, where each client needs to download the CGAN model G from the server_jIs equal to the CGAN model G trained in the step S3_jThe number m of (2).

This embodiment wirelessly downloads the shared data set D and the CGAN model G from the data center for each computer_jTo the home.

S5, the server randomly selects a plurality of clients and sends the selected instructions to the clients; and the server only randomly selects k clients to execute local training in each round of federal learning, wherein k is generally smaller than the total number of the clients.

The embodiment randomly extracts a plurality of computers for the data center and sends the selected instructions to the computers.

S6, client i pair local data set D_iAnd CGAN model G_jThe generated data set is used for training a local model M by using an enhanced iteration method_iThe effect of the enhancement iteration is to correct the direction of gradient descent, and the schematic effect diagram is shown in fig. 2, and the specific steps of the enhancement iteration method include:

step S61: for each class j of data, the CGAN model G is used in turn_jGenerating a data set d with j class labels_jWherein the data set d_jOf size N, using a data set d_jFor local client model M_iOne round of training was performed according to the following formula:

where Cross EntropyLoss () is the cross entropy loss function, θ_iFor local client model M_iThe parameter of (1), eta, is the learning rate,

for local client model M_iFor data set d_jPrediction of the nth data, j is the data label. Usage data set d_jAfter training, calculating local client model M in the training process_iFor data set d_jAverage predicted fraction of total data in

Average prediction fraction

Obtained according to the following formula:

step S62: using local data sets D_iFor local client model M_iMultiple rounds of training were performed according to the following formula:

wherein KLDivLoss () is the relative entropy loss function, Cross EntropyLoss () is the cross entropy loss function, θ_iFor local client model M_iWith η being the learning rate and α being the regularization parameter,

for local client model M_iFor local data set D_iThe prediction of the n-th data,

for local data sets D_iThe label of the nth data is identified,

for local data setsD_iAnd average prediction scores corresponding to the label categories of the nth data.

The embodiment utilizes the local original data set D for the computer receiving the selected instruction_iAnd CGAN model G_jGenerated data training local client model M_i。

S7, using local client model M by client i_iPredicting each data of the shared data set D in turn and scoring the prediction P_iAnd uploading to the server.

In this embodiment, the computer with number i receiving the selected instruction utilizes the local client model M_iPredicting the shared data set D and scoring the prediction P_iAnd wirelessly uploading the data to a data center of the server.

S8, the server side calculates the deviation degree of the prediction scores of each selected client side and other client sides by using the JS divergence function, and the reciprocal of the calculated result of the JS divergence function is used as the weight W of the prediction score of each selected client side_iSo that the superior client model, i.e., the client model with less JS divergence of the prediction score to the average prediction scores of the remaining client models, has more weight, and conversely has less weight.

In the embodiment, the data center receives the prediction result scores uploaded by the selected computers, then calculates the deviation degree of each selected computer prediction score from the other computer prediction scores by using the JS divergence function, and takes the calculation result as the weight W of the computer prediction score_i. The data center judges the similarity of the client models by using the JS divergence function, so that the weight of each client model is determined.

S9, according to the prediction score P_iAnd the weight W of the prediction score of each selected client calculated in step S8_iThe global prediction score P is calculated using a weighted average method.

In this embodiment, the data center calculates the weight W of the prediction score of each computer according to step S8_iThe global prediction score P for all computers is calculated using a weighted average.

S10, the server side model performs knowledge distillation on the server side model M by using the shared data set D and the global prediction score P, and the specific operation of the knowledge distillation comprises the following steps:

s101, performing multiple rounds of training on the server model M by using the shared data set D and the global prediction score P according to the following formula:

KLDivloss () is a relative entropy loss function, Cross EntropyLoss () is a cross entropy loss function, theta is a parameter of a service model M, eta is a learning rate, alpha is a regularization parameter, and M (D)ⁿ) Prediction of the nth data of the shared data set D for the server model M, yⁿFor the tag of the nth data of the shared data set D, PⁿIs the nth fraction of the global predicted fraction P and T represents the temperature of the knowledge distillation. It should be noted that different values of the regularization parameter α may cause different effects of the final client model, and fig. 3 shows the accuracy effect of the client model when the regularization parameter α has the same value.

S102, carrying out knowledge distillation by the server model through the global prediction fraction, wherein a loss function of the knowledge distillation is set as follows:

L(M)＝α·T²·KLDivLoss(M(Dⁿ)，Pⁿ)+(1-α)·CrossEntropyLoss(M(Dⁿ)，yⁿ)

wherein KLDivloss () is a relative entropy loss function, Cross EntropyLoss () is a cross entropy loss function, L (M) is a loss value of the service model M training, theta is a parameter of the service model M, eta is a learning rate, alpha is a regularization parameter, and M (D)ⁿ) Prediction of the nth data of the shared data set D for the server model M, yⁿFor the tag of the nth data of the shared data set D, PⁿIs the nth fraction of the global predicted fraction P and T represents the temperature of the knowledge distillation.

In the embodiment, the data center trains the model M of the data center through the knowledge distillation technology by using the shared data set D obtained in the step S3 and the global prediction score P obtained in the step S9

S11, downloading the prediction scores P of other client models from the server by the client i_jWherein j ≠ i, and then the prediction score P of other client models and the prediction score P of the local client model_iFor local client model M_iPerforming cooperative training; the specific operation of the cooperative training comprises the following steps:

pairing local client models M using shared data sets D_iMultiple rounds of training were performed according to the following formula:

wherein Cross EntropyLoss () is a cross entropy loss function, η is a learning rate, α is a weight factor of an additional term of the loss function, M_i(Dⁿ) For local client model M_iPrediction of the nth data of the shared data set D, yⁿTo share the tag of the nth data of the data set D,

for local client model M_iPrediction and client model M for nth data of shared data set D_jPredicted score for nth data of shared data set D

JS divergence score of (1), where j ≠ i. Considering that while the client i model is trained, the predictions of other client models are not updated, we add λ^epochAnd the weight factor, wherein 0 < lambda < 1, epoch represents the number of epoch times of the current training, and the influence of the prediction of other client models on the training of the client i model is gradually reduced along with the increase of the training epoch times. The client judges the similarity of the own model and other client models by using the JS divergence function, and is convenient to perform cooperative training by using a plurality of client models.

The cooperative training is based on the idea of ensemble learning, and the feature representation of data is fully represented by integrating the feature representations of the same group of data by different clients. In order for the client model to achieve the cooperative effect, an additional term needs to be added through a loss function to guide the gradient descending direction of the model. The invention sets the loss function as follows:

where Cross EntropyLoss () is the cross entropy loss function, L (M)_i) For local client model M_iThe loss value of the training, eta is the learning rate, alpha is the weight factor of the extra term of the loss function, M_i(Dⁿ) For local client model M_iPrediction of the nth data of the shared data set D, yⁿTo share the tag of the nth data of the data set D,

JS divergence score of (1), where j ≠ i. Considering that while the client i model is trained, the predictions of other client models are not updated, we add λ^epochThis weighting factor, where 0 < λ < 1, epoch represents the number of epochs currently trained. As the number of times of training epochs increases, the influence of the predictions of other client models on the training of the client i model gradually decreases.

In the embodiment, the computer with the number i receiving the selected instruction downloads the predicted score P of the shared data set from the data center to other computers_jWhere j ≠ i, and uses these prediction scores and the shared dataset D to model M locally_iAnd performing cooperative training.

And S12, iterating steps S5 to S11 for multiple times, and finally converging the server model and the client model.

After multiple iterations of steps S5 and S11, the server model and the client model gradually converge, and the server model becomes a high-performance model for learning the knowledge of local data of all computers. Wherein the heterogeneous client model accuracy is shown in fig. 4.

And S13, downloading the server model M to the computer of the client by each client.

And the computer of each client wirelessly downloads the server model M from the data center of the server, and is deployed on a local computer to run.

Example 2

Based on the same inventive concept as that of embodiment 1, this embodiment further provides a heterogeneous model aggregation system based on federal learning, including:

and the enhanced iteration module is used for ensuring that the data types owned by the client model in the batch iteration process are complete and the types are uniformly distributed, so that the client model can continuously correct the gradient descending direction in the training process, and the gradient descends towards the optimal solution direction.

And the knowledge distillation module is used for solving the reliability problem of the client model under the heterogeneous condition. And each client sends the prediction score to the server in each round, the server calculates the weight of the prediction score of each client model by using a JS function, and then calculates the global prediction score by weighted average. The service-side model carries out knowledge distillation through the global prediction score, wherein the loss function of the knowledge distillation is set as follows:

wherein KLDivloss () is a relative entropy loss function, Cross EntropyLoss () is a cross entropy loss function, L (M) is a loss value of the service model M training, theta is a parameter of the service model M, eta is a learning rate, alpha is a regularization parameter, and M (D)ⁿ) Prediction of the nth data of the shared data set D for the server model M, yⁿFor the tag of the nth data of the shared data set D, PⁿIs the n-th of the global prediction score PFractional, T denotes the temperature of the knowledge distillation.

And the cooperation training module is used for solving the problem of insufficient data representation of the client in the training process. In the federal learning system, the local computing resources of the client are scarce, and the shortage of the computing resources causes the data feature representation of the client model to be insufficient. The cooperative training is based on the idea of ensemble learning, and the feature representation of data is fully represented by integrating the feature representations of the same group of data by different clients. In order for the client model to achieve the cooperative effect, an additional term needs to be added through a loss function to guide the gradient descending direction of the model. The invention sets the loss function as follows:

where Cross EntropyLoss () is the cross entropy loss function, L (M)_i) For local client model M_iThe loss value of the training, eta is the learning rate, alpha is the weight factor of the extra term of the loss function, M_i(Dn) is the local client model M_iPrediction of the nth data of the shared data set D, yⁿTo share the tag of the nth data of the data set D,

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A heterogeneous model aggregation method based on federated learning is characterized by comprising the following steps:

s1, setting a local data set at each client and initializing a neural network model, and initializing a neural network model at the server;

s2, each client contributes a small part of local data and uploads the local data to the server;

s3, forming a batch of shared data sets at the service end by the uploaded local data, and training a CGAN model by using the shared data sets;

s5, the server randomly selects a plurality of clients;

s6, the client trains a local client model by using an enhanced iteration method by using the local data set and the data generated by the CGAN model;

s9, according to the prediction score P_iStep S8, calculating the global prediction score by using a weighted average method according to the prediction score weight and the prediction score of each client calculated in the step S8;

s10, the server model carries out knowledge distillation on the server model by using the shared data set and the global prediction score;

s11, downloading the prediction scores of other client models from the server by the client, and then performing cooperative training on the local client model through the prediction scores of the other client models and the prediction score of the local client model;

s12, iterating steps S5 to S11 for multiple times, and finally converging the server model and the client model;

and S13, downloading the server model to the computer of the client by each client.

2. The heterogeneous model aggregation method according to claim 1, wherein in step S1, the model structure and the model parameters of the neural network model of each client are different, and the data distribution of the local data set of each client is different.

3. The heterogeneous model aggregation method of claim 1, wherein in step S3, the data classes of the shared data set are balanced, and only one class of data can be generated by one CGAN model.

4. The method for aggregating heterogeneous models according to claim 1, wherein in step S5, the number of clients randomly selected by the server is smaller than the total number of clients.

5. The heterogeneous model aggregation method according to claim 1, wherein in step S6, the enhancement iteration method includes the steps of:

s61: for each class j of data, a CGAN model G is used_jGenerating a data set d with j categories_jWherein the data set d_jOf size N, using a data set d_jFor local model M_iOne round of training was performed according to the following formula:

Average prediction fraction

Obtained according to the following formula:

s62: using local data sets D_iFor local model M_iMultiple rounds of training were performed according to the following formula:

for local data sets D_iThe label of the nth data is identified,

for local data sets D_iAnd average prediction scores corresponding to the label categories of the nth data.

6. The heterogeneous model polymerization process of claim 1, wherein in step S10, the operation of knowledge distillation comprises:

using the shared dataset D and the global prediction score P, client M is trained in multiple rounds according to the following formula:

KLDivloss () is a relative entropy loss function, Cross EntropyLoss () is a cross entropy loss function, theta is a parameter of a service model M, eta is a learning rate, alpha is a regularization parameter, and M (D)ⁿ) Prediction of the nth data of the shared data set D for the server model M, yⁿFor the tag of the nth data of the shared data set D, PⁿIs the nth fraction of the global predicted fraction P and T represents the temperature of the knowledge distillation.

7. The heterogeneous model aggregation method according to claim 1, wherein in step S11, the specific operation of the cooperative training includes:

pairing local models M using a shared dataset D_iMultiple rounds of training were performed according to the following formula:

The JS divergence score of (1), wherein j is not equal to i; lambda [ alpha ]^epochIs a weighting factor, where 0 < λ < 1, epoch represents the number of epochs currently trained.

8. A federated learning-based heterogeneous model aggregation system, comprising:

the knowledge distillation module is used for solving the reliability problem of the client model under the heterogeneous condition; each client sends a prediction score to the server in each round, the server calculates the weight of the prediction score of each client model by using a JS function, then calculates a global prediction score by weighted average, and the server model carries out knowledge distillation by the global prediction score;

and the cooperation training module is used for solving the problem of insufficient data representation of the client in the training process, and adding additional items by adopting a loss function to guide the gradient descending direction of the model so as to enable the client model to achieve the cooperation effect.

9. The heterogeneous model aggregation system of claim 8, wherein the cooperative training module employs a loss function of:

wherein Cr isossEntropyLoss () is a cross-entropy loss function, L (M)_i) For local client model M_iThe loss value of the training, eta is the learning rate, alpha is the weight factor of the extra term of the loss function, M_i(Dⁿ) For local client model M_iPrediction of the nth data of the shared data set D, yⁿTo share the tag of the nth data of the data set D,

10. The heterogeneous model polymerization system of claim 8, wherein the loss function of the knowledge distillation is: