CN116245141B

CN116245141B - Transfer learning architecture, method, electronic device and storage medium

Info

Publication number: CN116245141B
Application number: CN202310074506.6A
Authority: CN
Inventors: 徐枫; 薄子豪; 郭雨晨; 戴琼海
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-01-13
Filing date: 2023-01-13
Publication date: 2024-06-04
Anticipated expiration: 2043-01-13
Also published as: CN116245141A

Abstract

The application relates to the technical field of deep learning, in particular to a transfer learning architecture, a transfer learning method, electronic equipment and a storage medium, wherein the transfer learning architecture comprises: one or more upstream task models, each upstream task model including a multi-head attention mechanism layer, and the multi-head attention mechanism layer extending entirely into an expert network layer; and the downstream task model comprises expert fusion layers with the same layer number as the multi-head attention mechanism layers, the expert fusion layers correspond to each layer of the multi-head attention mechanism layers, and the expert network in each layer of expert fusion layers is obtained by migrating the corresponding layers of the multi-head attention mechanism layers of all the upstream task models. Therefore, the problems that only one upstream task model can be migrated, a plurality of upstream models cannot be utilized simultaneously, the requirement on the capability of the upstream models is high and the like in the related technology are solved.

Description

Transfer learning architecture, method, electronic device and storage medium

Technical Field

The present application relates to the field of deep learning technologies, and in particular, to a migration learning architecture, a migration learning method, an electronic device, and a storage medium.

Background

The deep learning technical field has been developed in recent years, and has a trend towards large data and large models, and the deep learning technical field is applied to various industries more and tends to be mature. The migration learning refers to migrating a depth model trained on an upstream task data set to a new downstream task data set for training, so that the downstream task model improves performance by using knowledge of an upstream task.

The existing transfer learning method generally requires that upstream tasks are more general, the training data volume is large, the model feature expression is good, and the performance is strong. However, the requirement for data set and model size results in relatively few data sets for migration learning upstream tasks, many small data sets and simple models trained thereon are not effectively utilized. A small model trained by a common scientific research institution on a small data set and a specific field task is difficult to multiplex by a migration learning mode. For example, a large model pretrained on an even larger scale dataset of ImageNet is often taken as a pretrained model to improve downstream task performance.

Expert fusion methods (MixtureofExperts, moE) are often used to improve the parametric efficiency of visual, natural language processing, and cross-modal task large models. For example, the multitasking model Uni-PERCEIVER-MoE uses expert fusion to enable different experts to handle the input of different task features. However, the existing expert fusion method is not designed for a migration learning scenario, and some network structures are not suitable for a cross-task migration model.

Disclosure of Invention

The application provides a migration learning architecture, a migration learning method, electronic equipment and a storage medium, which are used for solving the problems that in the related technology, only one upstream task model can be migrated, a plurality of upstream models cannot be utilized at the same time, the requirement on the capability of the upstream models is high and the like.

An embodiment of a first aspect of the present application provides a migration learning architecture, including: one or more upstream task models, each upstream task model comprising a multi-head attention mechanism layer, and the multi-head attention mechanism layer extending entirely into an expert network layer; and the downstream task model comprises expert fusion layers with the same layer number as the multi-head attention mechanism layers, the expert fusion layers correspond to each layer of the multi-head attention mechanism layers, and a private network in each layer of expert fusion layers is obtained by migrating the corresponding layers of the multi-head attention mechanism layers of all the upstream task models.

Optionally, an expert fusion layer of the downstream task model is constructed according to the multi-head attention mechanism layers of all the upstream task models, wherein the number of layers of the expert fusion layer is the same as the number of the upstream task models.

Optionally, the selection module includes a fully connected network and a classification network.

Optionally, the training of the downstream task model includes: acquiring a downstream task data set; and fixing parameters of the expert network from the upstream task model in the downstream task model, and training the parameters of the expert network belonging to the downstream task, the selection module and the task head by utilizing the downstream task data set until the training is completed.

Optionally, the training of the upstream task model includes: acquiring an upstream task data set; and training all parameters of the upstream task model by using the upstream task data set until training is completed.

Optionally, during training of the upstream task model, each multi-headed attention mechanism layer randomly discards tokens of the input sequence.

An embodiment of a second aspect of the present application provides a method for migration learning, including a migration learning architecture as described in the foregoing embodiment, including the following steps: acquiring one or more upstream task models after training, wherein each upstream task model comprises a multi-head attention mechanism layer, and the whole multi-head attention mechanism layer is expanded into an expert network layer; and constructing expert fusion layers of the downstream task models according to the multi-head attention mechanism layers of all the upstream task models, wherein the expert fusion layers correspond to each layer of the multi-head attention mechanism layers, and an expert network in each layer of expert fusion layers is obtained by migrating the corresponding layers of the multi-head attention mechanism layers of all the upstream task models.

An embodiment of a third aspect of the present application provides an electronic device, including: the migration learning method comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the migration learning method according to the embodiment.

An embodiment of a fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program that is executed by a processor for implementing the migration learning method as described in the above embodiment.

Therefore, the application has at least the following beneficial effects:

The application improves the influence range of an expert selection part in the expert fusion method, promotes the traditional expert structure based on a feedforward network module (FFN) in a Transformer network to a self-attention network module (self-attention), and enables the whole Transformer network layer to be used as an expert, thereby enabling the expert fusion method to be applied to a migration learning framework, regarding a plurality of upstream models as a plurality of field experts, enabling a downstream task to dynamically select a proper field expert and improving the performance of the downstream model; the dynamic selection can enable the specific field characteristics trained by the small data set to be selectively used, so that the requirement on the upstream task model capacity is reduced; the arrangement of multiple domain experts may utilize multiple upstream task models; the method can integrate a large number of tasks in specific fields and specific capabilities of small models trained by small data sets, reduce the threshold of the migration learning upstream model, and is beneficial to the construction of deep learning general artificial intelligence. Therefore, the technical problems that only one upstream task model can be migrated, a plurality of upstream models cannot be utilized simultaneously, the requirement on the capability of the upstream models is high and the like in the related technology are solved.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

Fig. 1 is a block diagram of a learning-to-transfer architecture according to an embodiment of the present application;

Fig. 2 is a frame diagram of a migration learning architecture according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for learning to migrate according to an embodiment of the present application;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.

The following describes a migration learning architecture, a method, an electronic device, and a storage medium according to embodiments of the present application with reference to the accompanying drawings. Aiming at the problems that the prior transfer learning method mentioned in the background technology generally requires that upstream tasks are more general, the training data amount is large, the model characteristics are well expressed and the performance is strong, the data sets of the superior transfer learning upstream tasks are few due to the requirements of the data sets and the model sizes, and a plurality of small data sets and simple models trained on the small data sets cannot be effectively utilized, the application provides a transfer learning architecture, in the transfer learning architecture, a plurality of upstream models are regarded as a plurality of field experts, so that the downstream tasks dynamically select proper field experts to promote the performance of the downstream models. The dynamic selection can enable the specific field characteristics trained by the small data set to be selectively used, so that the requirement on the upstream task model capacity is reduced; the setting of a plurality of field experts can utilize a plurality of upstream task models, integrate the specific capabilities of a plurality of small models trained by specific field tasks and small data sets, reduce the threshold of the upstream model for transfer learning, and facilitate the construction of the deep learning general artificial intelligence. Therefore, the problems that only one upstream task model can be migrated, a plurality of upstream models cannot be utilized simultaneously, the requirement on the capability of the upstream models is high and the like in the related technology are solved.

Specifically, fig. 1 is a block diagram of a learning-to-transfer architecture according to an embodiment of the present application.

As shown in fig. 1, the transfer learning architecture 10 includes: one or more upstream task models 11 and a downstream task model 12.

Wherein, each upstream task model 11 comprises a multi-head attention mechanism layer, and the whole multi-head attention mechanism layer is expanded into an expert network layer; the downstream task model 12 includes expert fusion layers having the same number of layers as the multi-headed attention mechanism layers, the expert fusion layers corresponding to each of the multi-headed attention mechanism layers, the expert network in each of the expert fusion layers being derived by migrating the corresponding layers of the multi-headed attention mechanism layers of all the upstream task models.

It can be understood that the migration learning architecture of the present application includes n upstream task models and one downstream task model, where n is a positive integer, and the number of the upstream task models may be specifically set according to actual requirements, and is not specifically limited.

Specifically, the upstream task model includes a multi-head attention mechanism layer, and the multi-head attention mechanism layer is entirely extended to an expert network layer (transducer), that is, the upstream task model is in a transducer or VisionTransformer network structure, that is, each model includes the same number of m transducer layers and a task head related to a task. A transducer layer includes a self-attention layer (self-attention) and a feed-forward network layer (feedforwardnetwork).

The downstream task model may be a transducer or VisionTransformer network structure, the number of layers of the expert fusion (MoE) layer is the same as that of the multi-head attention mechanism layer, each layer corresponds to the other layer, the expert network in the expert fusion layer is obtained by migrating the corresponding layer of the multi-head attention mechanism layer of the upstream task model, and model migration is performed, that is, m layers of transducers of n upstream models are used to construct the downstream model of the corresponding m layers of moes, and n expert networks in each layer of MoE are from the transducers of the corresponding layers of n upstream models, that is, parameter copying.

It should be noted that, the expert fusion method (MoE) in the related art is not suitable for the migration learning framework, because only the feedforward network layer of each layer of the transducer layer is divided into a plurality of expert networks, and the self-attention layer is not divided, so that the whole model cannot be migrated to the downstream task. The embodiment of the application can extend the expert network layer into the whole transducer layer, so that the expert network layer is suitable for the migration learning method.

In an embodiment of the present application, the expert fusion layer includes: the system comprises a selection module, an expert module and a fusion module.

Wherein the selection module is used for selecting one or more tokens of the expert network processing input sequence; the expert module comprises an expert network of an upstream task model and an expert network belonging to a downstream task, and each expert network processes a corresponding token selected by the selection model; the fusion module is used for determining the weight of each expert network according to the probability value output by the selection module, and carrying out weighted summation on the output results of all the expert networks according to the weight of each expert network.

It can be understood that, in the embodiment of the present application, the expert fusion layer includes a selection module (gate), an expert module (expertise), and a fusion module (fusion), where the selection module (gate) includes a fully connected network and a classification network.

Specifically, the selection module (gate): for each token (token) of the input sequence, the module selects k (k < = n+1) suitable experts (expertise) to process, and the module is implemented with a fully connected network and a classified softmax network.

Expert module (experts): comprises n+1 expert networks (expert), respectively a transducer layer in n upstream task models, and 1 new expert network belonging to the downstream task, each expert network only inputs a part of tokens token belonging to the expert network allocated by the selection module.

Fusion module (fusion): and (3) carrying out weighted summation on the output results of the n+1 expert networks, wherein the weight is the probability value output by the gating dynamic selection module.

In an embodiment of the present application, the training of the downstream task model includes: acquiring a downstream task data set; and fixing parameters of the expert network from the upstream task model in the downstream task model, and training parameters of the expert network, the selection module and the task head belonging to the downstream task by utilizing the downstream task data set until the training is completed.

It can be understood that, in the downstream task model training in the embodiment of the present application, the downstream task data set is used to train the downstream model, and in the training, parameters of n×m expert networks from the upstream task model in the downstream task model are fixed, and only m new expert networks, m selection modules and task header parameters are trained.

In an embodiment of the present application, training of the upstream task model includes: acquiring an upstream task data set; and training all parameters of the upstream task model by using the upstream task data set until the training is completed.

It can be appreciated that the embodiment of the present application may train n upstream models on n upstream task data sets by acquiring n upstream task data sets, all parameters of each model participating in the training.

Wherein, during the training process of the upstream task model, each multi-head attention mechanism layer randomly discards tokens of the input sequence.

It should be noted that, in order to reduce the number of input tokens in the downstream task, which is only used as one expert network, the number of input tokens is reduced, which causes mismatching with the upstream task training process, so that performance is reduced, and in the upstream task training, each layer of transformers can discard some input tokens randomly, so that the network is adapted to incomplete token sequence input.

In summary, as shown in fig. 2, the migration learning framework of the present application mainly includes the following modules:

1. Upstream task model: the n upstream task models are all in a transducer or VisionTransformer network structure, i.e. each model contains the same number of m transducer layers and a task head related to a task. All models have been trained on respective upstream task datasets.

2. Downstream task model: a downstream task model, also a transducer or VisionTransformer network architecture. The transducer layer of the model was changed to MoE layers, each MoE layer comprising the following structure:

(1) Gating dynamic selection module (gate): for each token of the input sequence, the module selects k (k < = n+1) suitable experts (expertise) to process. The module is implemented with a fully connected network and a softmax network.

(2) Expert module (experts): comprises n+1 expert networks (expert), respectively a transducer layer in n upstream task models, and 1 new expert network belonging to the downstream task. Each expert network only inputs a part of the token distributed by the gating dynamic selection module and belonging to the expert network.

(3) Fusion module (fusion): and (3) carrying out weighted summation on the output results of the n+1 expert networks, wherein the weight is the probability value output by the gating dynamic selection module.

The transfer learning framework of the application is mainly divided into three training phases:

1. upstream model training: n upstream models are trained on n upstream task datasets, all parameters of each model participating in the training.

2. Model migration: an m-layer transducer of the n upstream models was used to construct the downstream model of the corresponding m-layer MoE. The n expert networks in each layer MoE are derived from the transformers, i.e. parameter copies, of the corresponding layer of the n upstream models.

3. Training a downstream model: the downstream model is trained with the downstream task dataset. In training, parameters of n×m expert networks from upstream tasks in a downstream model are fixed, and only m new expert networks, m gating dynamic selection modules and task header parameters are trained.

It should be noted that the embodiments of the present application may be implemented on a hardware system such as a common PC or workstation.

The application of the migration learning framework will be described by a specific embodiment, and the following scenario includes 2 upstream tasks:

1. natural image classification task (imageclassification): the input data is a two-dimensional color natural image, such as an image of an animal such as a cow, a sheep, etc., and the task model needs to determine which category the input image belongs to, such as whether the input image belongs to the cow or the sheep. The model comprises 6 layers VisionTransformer and a classification task head, and the model is output as a probability value judged to be each category.

2. Natural language mask self-supervising task (NLP): the data is a large corpus of a certain language (e.g., english, chinese, etc.), and a natural language model is trained by using a mask self-supervision training mode. The model comprises a layer 6 transducer layer and a text task header, and a word sequence consisting of a sentence is input, wherein a mask (mask) is used to zero partial words, and a complete word sequence of the sentence is output.

The downstream task is set as follows:

Image annotation task (imagecaption): the data is a plurality of image and text pairs, the model inputs one image, and a sentence capable of describing the content of the image is output. The model contains 6 MoE layers, each containing 3 expert networks, 2 from the upstream task model, and a text task header.

The three-stage training method of the above embodiment is used for the transfer learning training. Finally, the downstream model performance trained using the present transfer learning will exceed the transducer model trained directly on the downstream task dataset, and more than 2 upstream task models each will be tuned (finetune) directly on the downstream task.

According to the migration learning framework provided by the embodiment of the application, the influence range of an expert selection part in an expert fusion method is improved, and an expert structure based on a feed-forward network module in a fransformer network is promoted to a self-attention network module, so that the whole fransformer network layer is used as an expert, the expert fusion method can be applied to the migration learning framework, a plurality of upstream models are regarded as a plurality of field experts, and a downstream task dynamically selects a proper field expert to improve the performance of the downstream model; the dynamic selection can enable the specific field characteristics trained by the small data set to be selectively used, so that the requirement on the upstream task model capacity is reduced; the arrangement of multiple domain experts may utilize multiple upstream task models; the method can integrate a large number of tasks in specific fields and specific capabilities of small models trained by small data sets, reduce the threshold of the migration learning upstream model, and is beneficial to the construction of deep learning general artificial intelligence.

The migration learning method provided by the embodiment of the application is applied to the migration learning architecture described in the embodiment.

Fig. 3 is a flowchart of a transfer learning method according to an embodiment of the present application.

As shown in fig. 3, the transfer learning method includes the steps of:

In step S101, one or more trained upstream task models are obtained, where each upstream task model includes a multi-head attention mechanism layer, and the multi-head attention mechanism layer is extended as an expert network layer.

The training process of the upstream task model is already described in the above embodiments, and will not be described herein.

In step S102, an expert fusion layer of the downstream task model is constructed according to the multi-head attention mechanism layers of all the upstream task models, wherein the expert fusion layer corresponds to each layer of the multi-head attention mechanism layers, and the expert network in each layer of expert fusion layer is obtained by migrating the corresponding layers of the multi-head attention mechanism layers of all the upstream task models.

It can be understood that the embodiment of the application utilizes the multi-layer multi-head attention mechanism layer of the upstream task model to construct the expert fusion layer of the corresponding downstream task model, and the expert network in each expert fusion layer is from the multi-head attention mechanism layer of the upstream task model.

It should be noted that the foregoing explanation of the embodiment of the transfer learning architecture is also applicable to the transfer learning method of this embodiment, and will not be repeated here.

According to the transfer learning method provided by the embodiment of the application, the influence range of an expert selection part in the expert fusion method is improved, a traditional expert structure based on a feedforward network module in a fransformer network is promoted to a self-attention network module, and the whole fransformer network layer is used as an expert, so that the expert fusion method can be applied to a transfer learning framework, a plurality of upstream models are regarded as a plurality of field experts, and a downstream task dynamically selects a proper field expert to improve the performance of the downstream model; the method can integrate a large number of tasks in specific fields and specific capabilities of small models trained by small data sets, reduce the threshold of the migration learning upstream model, and is beneficial to the construction of deep learning general artificial intelligence.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:

Memory 401, processor 402, and a computer program stored on memory 401 and executable on processor 402.

The processor 402 implements the migration learning method provided in the above-described embodiment when executing a program.

Further, the electronic device further includes:

A communication interface 403 for communication between the memory 401 and the processor 402.

A memory 401 for storing a computer program executable on the processor 402.

Memory 401 may include high-speed RAM (Random Access Memory ) memory, and may also include non-volatile memory, such as at least one disk memory.

If the memory 401, the processor 402, and the communication interface 403 are implemented independently, the communication interface 403, the memory 401, and the processor 402 may be connected to each other by a bus and perform communication with each other. The bus may be an ISA (Industry Standard Architecture ) bus, a PCI (PERIPHERAL COMPONENT, external device interconnect) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 401, the processor 402, and the communication interface 403 are integrated on a chip, the memory 401, the processor 402, and the communication interface 403 may perform communication with each other through internal interfaces.

The processor 402 may be a CPU (Central Processing Unit ) or an ASIC (Application SPECIFIC INTEGRATED Circuit, application specific integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the migration learning method as above.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable gate arrays, field programmable gate arrays, and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

Claims

1. A transfer learning architecture, comprising:

One or more upstream task models, each upstream task model comprising a multi-head attention mechanism layer, and the multi-head attention mechanism layer extending entirely into an expert network layer, training of the upstream task model comprising: acquiring an upstream task data set; training all parameters of the upstream task model by using the upstream task data set until training is completed, wherein the upstream task data set is an image and text pair, and the upstream task model comprises an image classification model and a natural language model;

The downstream task model comprises expert fusion layers with the same layer number as the multi-head attention mechanism layers, the expert fusion layers correspond to each layer of the multi-head attention mechanism layers, and a private network in each layer of expert fusion layers is obtained by migrating the corresponding layers of the multi-head attention mechanism layers of all upstream task models; the expert fusion layer comprises: a selection module for selecting one or more tokens of the expert network processing the input sequence; the expert module comprises an expert network of an upstream task model and an expert network belonging to a downstream task, and each expert network processes the corresponding token selected by the selection module; the fusion module is used for determining the weight of each expert network according to the probability value output by the selection module, and carrying out weighted summation on the output results of all the expert networks according to the weight of each expert network; the training of the downstream task model comprises the following steps: acquiring a downstream task data set; and fixing parameters of the expert network from the upstream task model in the downstream task model, training the parameters of the expert network belonging to the downstream task, the selection module and the task head by using the downstream task data set until training is completed, so that the downstream task model selects the corresponding expert network, wherein the downstream task data set is an image and text pair, and the downstream task model is an image annotation model and is used for outputting texts describing the image.

2. The transfer learning architecture of claim 1, wherein expert fusion layers of downstream task models are built from multi-headed attention mechanism layers of all upstream task models, wherein the number of layers of the expert fusion layers is the same as the number of upstream task models.

3. The transfer learning architecture of claim 1, wherein the selection module comprises a fully connected network and a classification network.

4. The transitional learning architecture of claim 1, wherein each multi-headed attention mechanism layer randomly discards tokens of an input sequence during training of the upstream task model.

5. A method of transfer learning, wherein the method is applied to the transfer learning architecture of any one of claims 1-4, and wherein the method comprises the steps of:

Acquiring one or more upstream task models after training, wherein each upstream task model comprises a multi-head attention mechanism layer, and the whole multi-head attention mechanism layer is expanded into an expert network layer; training of the upstream task model includes: acquiring an upstream task data set; training all parameters of the upstream task model by using the upstream task data set until training is completed, wherein the upstream task data set is an image and text pair, and the upstream task model comprises an image classification model and a natural language model;

Constructing expert fusion layers of downstream task models according to multi-head attention mechanism layers of all upstream task models, wherein the expert fusion layers correspond to each layer of the multi-head attention mechanism layers, and a private network in each layer of expert fusion layers is obtained by migrating corresponding layers of the multi-head attention mechanism layers of all upstream task models; the expert fusion layer comprises: a selection module for selecting one or more tokens of the expert network processing the input sequence; the expert module comprises an expert network of an upstream task model and an expert network belonging to a downstream task, and each expert network processes the corresponding token selected by the selection module; the fusion module is used for determining the weight of each expert network according to the probability value output by the selection module, and carrying out weighted summation on the output results of all the expert networks according to the weight of each expert network; the training of the downstream task model comprises the following steps: acquiring a downstream task data set; and fixing parameters of the expert network from the upstream task model in the downstream task model, training the parameters of the expert network belonging to the downstream task, the selection module and the task head by using the downstream task data set until training is completed, so that the downstream task model selects the corresponding expert network, wherein the downstream task data set is an image and text pair, and the downstream task model is an image annotation model and is used for outputting texts describing the image.

6. An electronic device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the transfer learning method of claim 5.

7. A computer-readable storage medium having stored thereon a computer program, characterized in that the program is executed by a processor for implementing the migration learning method according to claim 5.