CN118070876A

CN118070876A - Large-model knowledge distillation low-rank adaptive federal learning method, electronic equipment and readable storage medium

Info

Publication number: CN118070876A
Application number: CN202410473762.7A
Authority: CN
Inventors: 刘倩; 李国庆
Original assignee: Athena Eyes Co Ltd
Current assignee: Athena Eyes Co Ltd
Priority date: 2024-04-19
Filing date: 2024-04-19
Publication date: 2024-05-24
Anticipated expiration: 2044-04-19
Also published as: CN118070876B

Abstract

The invention provides a large model knowledge distillation low-rank adaptive federal learning method, which is characterized in that a student model and a teacher model are arranged in each client, and meanwhile, a student model with the same structure is arranged in a central server connected with each client; in each client, calculating and acquiring total loss of a teacher model and total loss of a student model based on the student model, the teacher model and preset local data; calculating and obtaining a low-rank gradient matrix through total loss and a low-rank adaptation method of the student model, and uploading the low-rank gradient matrix to a central server; updating the parameters of the student model through the parameter gradient matrix after the student model is aggregated in the central server, and repeating for a plurality of times until the teacher model converges. Compared with the prior art, the method decomposes the parameter gradient matrix by a low-rank adaptation method, can obviously reduce the number of communication parameters between the client and the server during federal learning, and greatly improves the communication efficiency.

Description

Large-model knowledge distillation low-rank adaptive federal learning method, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a large model knowledge distillation low-rank adaptive federal learning method, electronic equipment and a readable storage medium.

Background

With the development of big data technology, data of each industry are widely distributed in different institutions, but each institution has the problems of competition and privacy protection, and the phenomenon of data island occurs. Meanwhile, along with the continuous improvement of large-scale machine learning algorithm and computer hardware performance, remarkable progress is made in the recent research application of a large model, and how to safely and efficiently integrate and apply the data of each medical institution to the training of the large model becomes a key ring of the landing of the large medical model.

Federal learning is a key means for machine learning to implement privacy computation and solve data islanding, and can enable each client to participate in model training together without revealing data privacy. During federal training, each client needs to send update information of local model parameters to a central server for aggregation and then to be distributed to each client in each iteration, but a large model has a large number of parameters, and usually multiple rounds of communication are needed to enable the model to converge, so that high communication cost is a problem that federal learning is not negligible, a pretrained model, a knowledge distillation technology and a low-rank adaptive federal learning technology can be utilized to improve federal learning efficiency.

Firstly, through the end of a client end of a pre-trained teacher model and a pre-trained student model, compared with the initial model, the training of the teacher model and the student model can be started, the number of training rounds for converging the model can be reduced, so that the number of communication rounds for federal learning is reduced, and the federal learning communication efficiency and the training efficiency are improved.

Knowledge distillation can reduce the amount of parameter information to be transmitted between a client and a central server by transferring knowledge of a large teacher model to a large student model with smaller amount of parameters and then enabling the large student model to participate in federal learning, so that the parameter information communication cost of the client and the central server in the federal learning process of the large teacher model is reduced. In each iteration, each client calculates sample loss and self-adaptive mutual distillation loss of a teacher model and a student model based on local data to update respective parameters, then the student model with a small number of parameters participates in federal learning, data knowledge of other clients is cooperatively learned in different clients, and the number of communication rounds can be reduced through the federal learning training mode of pre-training model knowledge distillation. When the structures of the teacher models of the clients are inconsistent, the student model structures of all the clients are set to be consistent, and the heterogeneous problems of different clients can be solved.

In the prior art, the communication pressure is relieved by transmitting the prediction result without transmitting parameter update information under the condition that the data volume of the client is small, but the condition is over criticism, and the data of the client exceeds a certain volume and is not suitable any more; communication pressure is reduced by a data-less knowledge extraction method, but the data generated by a lightweight generator in the method cannot meet the requirements of local large-scale language model training; communication pressure is reduced by a pretraining model fine adjustment technology, and the model isomerism condition is not adapted; federal learning can be realized through knowledge distillation aiming at a machine learning model with a conventional scale, but communication efficiency is too low when the federal learning of a large-scale model is handled, and the federal learning of a medical large model is difficult to realize.

Accordingly, it would be desirable to provide a large model knowledge distillation low-rank adaptive federal learning method, an electronic device and a readable storage medium that can effectively solve the above-mentioned problems.

Disclosure of Invention

The invention aims to provide a large model knowledge distillation low-rank adaptive federal learning method which has clear logic, safety, effectiveness, reliability and simple and convenient operation, greatly improves federal learning communication efficiency and aggregation efficiency, and efficiently realizes medical large model federal learning under multiparty medical private data.

Based on the above purpose, the technical scheme provided by the invention is as follows:

A large model knowledge distillation low-rank adaptive federal learning method comprises the following steps:

s1, a student model and a teacher model are arranged in each client, the student model is used as a local model in federal learning, the teacher model is used as a local private model, and the student model with the same structure is deployed in a central server connected with each client;

S2, acquiring total loss of the teacher model based on the output result of the student model, the output result of the teacher model and preset local data in each client, and updating parameters of the teacher model according to the total loss of the teacher model;

S3, in each client, based on the output result of the student model, the output result of the teacher model and preset local data, obtaining total loss of the student model, calculating a parameter gradient matrix of the student model after low-rank decomposition according to the total loss of the student model and a low-rank adaptation method, obtaining a low-rank gradient matrix, and uploading the low-rank gradient matrix to the student model in a central server;

S4, in the central server, aggregating a plurality of low-rank gradient matrixes through a federal aggregation algorithm to obtain an aggregated parameter gradient matrix, and issuing the aggregated parameter gradient matrix to student models in all the clients;

S5, in each client, updating parameters of the student model according to the aggregated parameter gradient matrix;

And repeating the steps S2 to S5 until the teacher models in the clients are converged.

Preferably, the step S2 includes the steps of:

according to the output result of the student model, the preset local data and the teacher model are respectively used for acquiring a first error and a second error;

And acquiring the total loss of the teacher model according to a first loss calculation formula, the first error and the second error, and updating the parameters of the teacher model according to the total loss of the teacher model.

Preferably, the preset first loss calculation formula is:

；

wherein, For the first error in the ith client,/>For the second error in the ith client,/>Is the total loss of the teacher model in the ith client.

Preferably, the calculating, based on the output result of the student model, the output result of the teacher model and preset local data, obtains total loss of the student model, and calculates a parameter gradient matrix after low-rank decomposition of the student model according to the total loss of the student model and a low-rank adaptation method, to obtain a low-rank gradient matrix, including the following steps:

According to the output result of the teacher model, respectively obtaining a third error and a fourth error with the preset local data and the student model;

acquiring total loss of the student model according to a second loss calculation formula, the third error and the fourth error;

Total loss according to the student model And a preset model original parameter matrix/>Calculating and obtaining parameter gradient matrix/>, of student model；

Decomposing the parameter gradient matrix of the student model according to a low rank adaptation methodObtaining the low-rank gradient matrix/>。

Preferably, the preset second loss calculation formula is:

；

wherein, For the third error in the ith client,/>For the fourth error in the ith client,/>Is the total loss of the student model in the ith client.

Preferably, the low rank decomposition of the parameter gradient matrix of the student model according to the low rank adaptation methodObtaining the low-rank gradient matrix/>The method specifically comprises the following steps:

；

，/>，/>；

；

Wherein k is the intrinsic rank of the low latitude of the parameter gradient matrix in the training process, For the parameter gradient matrix of student model,/>For the low-rank gradient matrix, m is the number of rows of the parameter gradient matrix, and n is the number of columns of the parameter gradient matrix.

An electronic device, comprising: a processor, a memory, and a communication bus;

The communication bus is used for realizing connection communication between the processor and the memory;

the processor is configured to execute a verification program of the newly added functional module stored in the memory, so as to implement the step of the large model knowledge distillation low rank adaptive federal learning method according to any one of the above.

A readable storage medium having stored therein computer executable instructions which, when loaded and executed by a processor, implement the steps of the large model knowledge distillation low rank adaptive federal learning method of any of the above.

The invention provides a large model knowledge distillation low-rank adaptive federal learning method, which is characterized in that a student model and a teacher model are arranged in each client, and meanwhile, a student model with the same structure is arranged in a central server connected with each client; in each client, calculating and acquiring total loss of a teacher model and total loss of a student model based on the student model, the teacher model and preset local data; calculating and obtaining a low-rank gradient matrix through total loss and a low-rank adaptation method of the student model, and uploading the low-rank gradient matrix to a central server; updating the parameters of the student model through the parameter gradient matrix after the student model is aggregated in the central server, and repeating for a plurality of times until the teacher model converges.

Compared with the prior art, the method decomposes the parameter gradient matrix by a low-rank adaptation method, can obviously reduce the number of communication parameters between the client and the server during federal learning, and greatly improves the communication efficiency.

The invention also discloses an electronic device and a readable storage medium, which belong to the same technical conception as the method solves the same technical problems, and are supposed to have the same beneficial effects, and are not repeated here.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a large model knowledge distillation low-rank adaptive federal learning method provided by an embodiment of the present invention;

FIG. 2 is a flowchart of step S2 provided in an embodiment of the present invention;

fig. 3 is a flowchart of acquiring a low rank gradient matrix in step S3 according to an embodiment of the present invention;

fig. 4 is a schematic diagram of implementing large model knowledge distillation low-rank adaptive federal learning data trend according to an embodiment of the present invention;

Fig. 5 is a schematic diagram of a model parameter structure of student model low-rank adaptive federal learning and normal federal learning llama B according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention is written in a progressive manner.

The embodiment of the invention provides a large-model knowledge distillation low-rank adaptive federal learning method, electronic equipment and a readable storage medium. The method mainly solves the technical problems of low communication efficiency and low aggregation efficiency of the existing federal learning model in the prior art.

As shown in fig. 1, a large model knowledge distillation low-rank adaptive federal learning method includes the following steps:

S2, acquiring total loss of the teacher model based on output results of the student model, output results of the teacher model and preset local data in each client, and updating parameters of the teacher model according to the total loss of the teacher model;

S3, in each client, based on the output result of the student model, the output result of the teacher model and preset local data, obtaining total loss of the student model, calculating a low-rank decomposed parameter gradient matrix of the student model according to the total loss of the student model and a low-rank adaptation method, obtaining a low-rank gradient matrix, and uploading the low-rank gradient matrix to the student model in the central server;

S4, in the central server, aggregating a plurality of low-rank gradient matrixes through a federal aggregation algorithm to obtain an aggregated parameter gradient matrix, and issuing the aggregated parameter gradient matrix to student models in all clients;

In step S1, N clients are deployed, a central server is deployed, and each client deploys a pre-trained large teacher model and a large student model with the number of pre-trained parameters smaller than that of the large teacher model; the server side deploys student models with the same structure as the client side, and when the teacher model is K times of the student models, the communication efficiency of each client side is improved by K times;

in this embodiment, 2 clients, a central server, are specifically provided. 2 clients deploy pre-trained teacher model ，/>And the quantity of the pre-trained parameters is smaller than that of the student model/>, of the teacher big model，/>; Student model/>, with server side deployment and client side structure identical. When a teacher model adopts a llama large model of 13B and a student model adopts a llama large model of 7B, the parameter amount is about 70 hundred million, the parameter gradient amount of each client participating in federal communication is reduced by about 60 hundred million after each client adopts a knowledge distillation method, and the transmission of about 120 hundred million parameter gradient values can be reduced by about 1.85 times by 2 clients;

In step S4, the central server receives the parameter gradient matrix of P and Q uploaded by each client; and the server aggregates the parameter gradient matrixes P and Q of the clients by using a federation aggregation algorithm, and distributes the aggregated parameter gradient matrixes to the clients.

In the step S5, each client receives the P matrix and the Q matrix distributed by the central server, and each client updates the parameters of the local student model according to the parameter gradient matrix, so that the student model learns knowledge of other clients through federal learning, and knowledge of other clients can be taught to the teacher model in the next round of client local training mutual learning distillation, thereby realizing federal learning of the teacher model.

Steps S2 to S5 are circularly performed until the local teacher model converges. The communication overhead of the round is reduced by about 3099 times, and the communication overhead is reduced from about 260 hundred million parameter ladder quantity to about 839 ten thousand parameter ladder quantity.

As shown in fig. 2, preferably, step S2 includes the steps of:

A1. respectively obtaining a first error and a second error according to the output result of the student model, preset local data and a teacher model;

A2. and acquiring the total loss of the teacher model according to the first loss calculation formula, the first error and the second error, and updating the parameters of the teacher model according to the total loss of the teacher model.

Preferably, the preset first loss calculation formula is:

；

wherein, For the first error in the ith client,/>Second error in ith client,/>Is the total loss of the teacher model in the ith client.

As shown in fig. 3, preferably, the step S3 of obtaining the low rank gradient matrix includes the following steps:

B1. Respectively obtaining a third error and a fourth error according to the output result of the teacher model, preset local data and the student model;

B2. Acquiring total loss of the student model according to the second loss calculation formula, the third error and the fourth error;

B3. total loss according to student model And a preset model original parameter matrix/>Calculating and obtaining parameter gradient matrix/>, of student model；

B4. Parameter gradient matrix for low-rank decomposition student model according to low-rank adaptation methodObtaining a low-rank gradient matrix。

Preferably, the preset second loss calculation formula is:

；

As shown in fig. 4, in the present embodiment, two client-side teacher models，/>And student model/>，/>Based on the local data loss and the knowledge distillation mutual supervision loss, the teacher model parameter and the student model parameter are calculated and updated, and the teacher model and the student model can learn knowledge of the local data and knowledge of other client data through mutual learning.

(1) Client 1 teacher model in parameter iterative updatingThe total loss of (2) is/>：

Loss of client 1 based on teacher modelTo calculate gradient update model parameters because of loss/>Consider hard tag error/>, with client 1 local dataAnd soft tag error/>, output with client 1 student modelTherefore, the teacher model of the client 1 can learn the local data knowledge of the client 1 and learn the knowledge output by the student model of the client 1.

(2) Client 1 student model in parameter iterative updateTotal loss/>：

Loss of client 1 based on student modelTo calculate gradient update model parameters because of loss/>Consider hard tag error/>, with client 1 local dataAnd soft tag error/>, output with client 1 teacher modelTherefore, the student model of the client 1 can learn knowledge of local data of the client 1 and knowledge of the teacher model of the client 1.

The same applies to client 2:

(3) Client 2 teacher model in parameter iterative update Total loss/>：

Loss of client 2 based on teacher modelTo calculate gradient update model parameters because of loss/>Taking into account hard tag errors/>, with client 2 local dataAnd soft tag error/>, output with client 2 student modelTherefore, the teacher model of the client 2 can learn the local data knowledge of the client 2 and learn the knowledge output by the student model of the client 2.

(4) Client 2 student model in parameter iterative updateTotal loss/>：

Loss of client 2 based on student modelTo calculate gradient update model parameters because of loss/>Taking into account hard tag errors/>, with client 2 local dataAnd soft tag error/>, output with client 2 teacher modelTherefore, the student model of the client 2 can learn knowledge of local data of the client 2 and knowledge of the teacher model of the client 2.

Preferably, step B4, specifically, is:

；

，/>，/>；

；

In this embodiment, each client student model uses a low-rank adaptation method to perform model training, parameters except a specific layer are frozen, and then a large model specific layer is trained to perform low-rank decomposition on a parameter matrix, wherein the parameter gradient matrix is P and Q, and the parameter update gradient matrix is G, so that the amount of transmitted parameter gradient data can be reduced.

In the middle ofAs an objective function,/>, in step B3For the original parameter matrix of the pre-training model,/>And the matrix is a parameter gradient matrix, and P and G are matrices subjected to G low-rank decomposition.

(1) Low rank decomposition of client 1 gradient parameter matrix

Wherein:

(2) Low rank decomposition of client 2 gradient parameter matrix

Wherein:

When the medical large model is used in multiparty privacy medical data training scenes, the parameter matrix of the model is overlarge, the data field is strong, the parameter gradient updated by the model has an internal rank of low latitude, and the federal learning client transmits P and Q after low rank decomposition when transmitting to the server, so that very large communication pressure can be reduced, and obvious communication efficiency advantages are achieved. When fine tuning is performed, the single client improves the communication efficiency by B times

Wherein C is the proportion of the trainable model structure of the student model to the structural parameters of the original student model;

Since the parameter gradient matrix of the large model during training has a low-latitude intrinsic rank:

Therefore, when the federal learning is performed on a large model, the low-rank adaptive federal learning significantly reduces the communication pressure in the federal learning process.

As shown in fig. 5, the parameter structure diagram of the low-rank adaptive federal learning and normal federal learning llama B model is shown in the schematic diagram, the parameter size of the black bold frame is the parameter size of the low-rank adaptive training, and when the llama student large model of 7B performs the low-rank adaptive federal learning, the value layer in the self-attention mechanism layer of llama is trained, and according to the llama model structure of 7B, the reduced parameter amount of each client is calculated:

transmission is required when using holothurian training:

32000 × 4096 + 32 × (4096 × 4096 + 4096 × 4096 + 4096 × 4096 + 4096 × 4096 + + 4096 × 11008 + 4096 × 11008 + 11008 × 4096 + 4096 + 4096 )+ 4096 + 4096 × 32000 = 6738415616

When the rank is taken to be 8, the number of parameters to be trained and transmitted when using low rank adaptation training is as follows:

(4096 × 8 + 8 × 4096 + 4096 × 8 + 8 × 4096) × 32 = 4194304

The communication efficiency improvement 6738415616 divided by 4194304 is about 1606 times (when the client is 2, the parameter before optimization is multiplied by 2 divided by the parameter after optimization is multiplied by 2, and the result is 1606 times).

A communication bus for implementing a connection communication between the processor and the memory;

And the processor is used for executing the verification program of the newly-added functional module stored in the memory so as to realize the steps of the large model knowledge distillation low-rank adaptive federal learning method.

In addition, each functional module in each embodiment of the present invention may be integrated in one processor, or each module may be separately used as one device, or two or more modules may be integrated in one device; the functional modules in the embodiments of the present invention may be implemented in hardware, or may be implemented in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by program instructions and associated hardware, where the program instructions may be stored in a computer readable storage medium, and where the program instructions, when executed, perform steps comprising the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.

As used in the specification and in the claims, the terms "a," "an," "the," and/or "the" are not specific to a singular, but may include a plurality, unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that the steps and elements are explicitly identified, and they do not constitute an exclusive list, as other steps or elements may be included in a method or apparatus. The inclusion of an element defined by the phrase "comprising one … …" does not preclude the presence of additional identical elements in a process, method, article, or apparatus that comprises an element.

The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature.

If a flowchart is used in the present application, the flowchart is used to describe the operations performed by a system according to an embodiment of the present application. It should be appreciated that the preceding or following operations are not necessarily performed in order precisely. Rather, the steps may be processed in reverse order or simultaneously. Also, other operations may be added to or removed from these processes.

The large model knowledge distillation low-rank adaptive federal learning provided by the invention is described in detail above. The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A large model knowledge distillation low-rank adaptive federal learning method is characterized by comprising the following steps:

2. The large model knowledge distillation low rank adaptive federal learning method according to claim 1, wherein said step S2 comprises the steps of:

3. The large model knowledge distillation low rank adaptive federal learning method according to claim 2, wherein the preset first loss calculation formula is:

；

4. The large model knowledge distillation low rank adaptive federal learning method according to claim 1, wherein the calculating to obtain the total loss of the student model based on the output result of the student model and the output result of the teacher model and the preset local data, and calculating the parameter gradient matrix after the low rank decomposition of the student model according to the total loss of the student model and the low rank adaptive method to obtain the low rank gradient matrix comprises the following steps:

Decomposing the parameter gradient matrix of the student model according to a low rank adaptation methodObtaining the low-rank gradient matrix。

5. The large model knowledge distillation low rank adaptive federal learning method according to claim 4, wherein the predetermined second loss calculation formula is:

；

6. The large model knowledge distillation low rank adaptive federal learning method according to claim 4, wherein the low rank decomposing the parameter gradient matrix of the student model according to the low rank adaptive methodObtaining the low-rank gradient matrix/>The method specifically comprises the following steps:

；

，/>，/>；

；

7. An electronic device, comprising: a processor, a memory, and a communication bus;

The processor is configured to execute a verification program of the newly added functional module stored in the memory, so as to implement the steps of the large model knowledge distillation low rank adaptive federal learning method according to any one of claims 1 to 6.

8. A readable storage medium having stored therein computer executable instructions which when loaded and executed by a processor implement the steps of the large model knowledge distillation low rank adaptive federal learning method of any one of claims 1 to 6.