CN117273068B

CN117273068B - Model initialization method based on linearly expandable learning genes

Info

Publication number: CN117273068B
Application number: CN202311264810.3A
Authority: CN
Inventors: 耿新; 夏诗禹; 杨旭
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-04-16
Anticipated expiration: 2043-09-28
Also published as: CN117273068A

Abstract

The invention provides a model initialization method based on a linearly expandable learning gene, which comprises the following steps: firstly, creating an auxiliary transducer linearly expanded by a learning gene, and training the transducer by a distillation method; and initializing the convertors with different depths by linearly expanding the trained learning genes so as to adapt to different downstream tasks. The method trains a universal learning gene which can be linearly expanded, the learning gene can be used for initializing offspring models with different depths, simultaneously comprehensively considers the performance and resource problems of the models, and does not need to pretrain each specific model; the learning genes are extracted from the ancestor model, so that the ancestor model is not needed, and the additional cost is saved. By adopting the method of the invention to initialize the convertors with different depths, the method has good performance on downstream tasks.

Description

Model initialization method based on linearly expandable learning genes

Technical Field

The invention relates to a model initialization method based on a linearly expandable learning gene, and belongs to the technical field of machine learning and computer vision.

Background

Deep neural networks, such as VisionTransformer, have achieved excellent performance in a variety of computer vision tasks. Parameter initialization is an important step taken before training a model and plays a key role in determining the quality of the final model. Today, large-scale pre-training on massive data brings a huge basic model, and provides good initialization for fine tuning of various downstream tasks. However, in popular pre-training and fine tuning processes, the parameters of the original overall model need to be stored and updated separately for each downstream task, which is very expensive and time consuming for the capacity of the currently increasing visual models. More importantly, this approach does not flexibly initialize models of different depths to meet various application requirements, such as edges and internet of things devices that involve limited computing resources. Thus, a problem arises naturally: we cannot initialize a model for each downstream task taking into account its performance and resources?

Researchers have proposed a wide range of parameter initialization schemes such as random initialization, xavier initialization, kaiming initialization, and self-distillation. Today, large-scale pre-training on massive data provides a good initialization for model fine-tuning of various downstream tasks. However, this approach requires the reuse of the original entire model each time it is faced with a different downstream task. Furthermore, when another model of different depth is encountered, we need to pretrain again, which creates a waste of computational and memory resources. In addition, there has been a great deal of work today to study knowledge distillation techniques. Common to them is that each time a new student model is trained, forward is required for the teacher model, which undoubtedly causes additional overhead.

Disclosure of Invention

Let us now review the evolution process of an organism in which various biological features of its offspring are initialized from genes condensed from their ancestors to adapt to different environments. For example, different offspring of felines have evolved different body types, different hunting habits, and so forth. However, invariably, these biological features are initiated and extended by feline genes that have been condensed by their common ancestors over the years of evolution. Mimicking the behavior of organism genes, researchers have proposed a new learning paradigm called learning genes that first learns concentrated knowledge called learning genes from ancestral models and then inherits this small part to initialize offspring models. The existing work is to extract some complete layers as learning genes from gradient information of ancestor models, and then initialize offspring models by stacking randomly initialized shallow layers with the extracted learning gene layers. However, there are three main limitations in the previous work. First, heuristic extraction of the learning genes inevitably results in poor performance. Second, offspring models of different sizes are not considered. Third, only convolutional neural network structures were explored, which clearly limited the potential of learning genes in the age based on the transducer architecture.

Aiming at the problem that the existing method cannot flexibly initialize model parameters, the invention provides a model initialization method based on a linearly expandable learning gene, which expands a shared transducer module to form and initialize transducers with different depths so as to adapt to downstream tasks containing different resources. Similar to the scalability of genes, we call this module a learning gene. To determine the extended pattern, we studied the relationship between the position of the layer and the corresponding layer parameter values and found that a linear function is suitable to approximate this relationship. In light of this, we propose a new method of flexibly constructing and initializing a transducer. Specifically, to learn the learning genes, we first create an auxiliary transducer that is linearly extended by the learning genes, and then train it by distillation. We can then initialize the transducers at different depths by linearly expanding the trained learning genes to accommodate different downstream tasks. Although the pre-training and fine tuning approach is dominant in the model parameter initialization approach, it does not flexibly initialize models with different depths to meet various application requirements. Furthermore, pre-training each particular model is very expensive and time consuming.

The invention provides the following technical scheme: a model initialization method based on a linearly expandable learning gene comprises the following steps:

Firstly, creating an auxiliary transducer linearly expanded by a learning gene, and training the transducer by a distillation method; and initializing the convertors with different depths by linearly expanding the trained learning genes so as to adapt to different downstream tasks.

Further, the method specifically comprises the following steps:

s1, creating an auxiliary transducer model which is linearly developed from the learning genes according to the following formula:

Where L represents the total number of layers of the auxiliary model, θ _l represents the parameters of the layer, θ _b and θ _a represent learning gene parameters and will be inherited into different offspring models to handle their specific tasks;

S2, linearly expanding a parameter matrix in the multi-head self-attention module by linearly expanding/> parameters of by the following formula:

Wherein and/> correspond to the parameter matrices of the multi-headed self-attention module in θ _a and θ _b; by similar means/> and/>

The multi-headed self-attention module allows the model to jointly process information from different embedded subspaces, the output of which is:

MSA(Q,K,V)＝Concat(head₁,...,head_h)W^O，

wherein Concat () represents the concatenation of the outputs of all the heads; for the kth head, its output is:

Wherein Q_k,K_k and V _k correspond to the queries, keys and values of the kth head,/> represent a parameter matrix that projects the output after all the heads are output stitched, h represents the number of heads in the multi-head self-attention module, d represents the output dimension of each head,/> represents a suitable normalization constant;

s3, a parameter matrix and/> D' in the linearly expanding multi-layer perceptron module represent the output dimension of the first linear transformation; the parameters of W ¹ are linearly extended as follows:

Wherein and/> correspond to the parameter matrices of the multi-layer sensor modules in θ _a and θ _b; b ¹,W² and b ² are obtained in a similar manner; the output of the multilayer perceptron is:

MLP(x)＝σ(xW¹+b¹)W²+b²，

Wherein σ (·) represents the activation function;

S4, the learnable parameter vector in the linear expansion layer normalization module comprises and/> parameters of the following formula linear expansion gamma:

Wherein γ _(a) and γ _(b) correspond to the parameter vectors of the layer normalization modules in θ _a and θ _b; beta is obtained in a similar manner; the output of the layer normalization is expressed as follows:

Wherein μ and δ are the mean and standard deviation of the corresponding characterization, and the ° represents an element-by-element multiplication;

S5, creating an auxiliary model which is linearly developed by the learning genes according to the steps S1 to S4; then the training process of the auxiliary model is promoted by adopting a distillation method; the distillation loss is as follows:

Wherein z _t represents the output of the pre-trained teacher model, z _s represents the output of the auxiliary model, τ represents the temperature hyper-parameter for distillation, represents the softmax function, KL represents the Kullback-Leibler divergence function; in combination with the classification loss, the total learning loss is defined as:

Wherein y represents a real mark, CE represents cross entropy loss, lambda is a super parameter, and the weight of the two losses is adjusted; calculating gradient descent by using the overall loss function, so as to update parameters of the auxiliary model;

S6, after learning theta _a and theta _b by using the auxiliary model, inheriting the auxiliary model and the theta _b into different offspring models so as to process specific tasks of the environment where the auxiliary model and the theta _b are located; initializing offspring models with different layers by using our learning genes:

Where L ^ds represents the number of layers of the offspring model, represents the parameters of the first layer, and then the offspring model is trained normally.

Further, in the step S1, when all the linear expansion strategies are adopted, the value range of L is from 1 to L; when a partial linear expansion strategy is adopted, the value range of L is a plurality of scattered values which are not more than L at maximum.

Further, in the step S3, D' > D is set.

Further, in the step S4, the layer normalization module and the residual connection are adopted before and after the multi-head self-attention module and the multi-layer perceptron module.

Further, in the step S5, distillation loss is introduced based on a soft distillation strategy.

Further, in step S6, the learning process of the offspring model adopts different training strategies according to different downstream tasks.

The invention provides a model initialization method for image classification tasks, which adopts the model initialization method based on the linearly expandable learning genes to linearly expand the trained learning gene parameters to initialize a offspring model, and the offspring model is matched with a linear layer as a classification head to process the image classification tasks and perform training to obtain image classification results.

Compared with the prior art, the invention has the following advantages and beneficial effects:

The method trains a universal learning gene which can be linearly expanded, the learning gene can be used for initializing offspring models with different depths, simultaneously comprehensively considers the performance and resource problems of the models, and does not need to pretrain each specific model; the learning genes are extracted from the ancestor model, so that the ancestor model is not needed, and the additional cost is saved. In the invention, the linear expansion of the multi-head self-attention module can not only make the parameters of the multi-head self-attention module of each layer different, but also help to keep the most stable common knowledge from the ancestor model to the offspring model; the linear expansion of the multi-layer perceptron enables the output of the multi-layer perceptron to be more diversified, and meanwhile, the parameter efficiency is improved; the linear expansion of the layer normalization module not only improves the efficiency of model parameters, but also is beneficial to the stable training of a transducer model. By adopting the method of the invention to initialize the convertors with different depths, the method has good performance on downstream tasks.

Drawings

FIG. 1 is a block diagram of the method of the present invention;

FIG. 2 is a graph comparing classification performance of the method of the present invention with random initialization, pretraining fine-tuning, mini-init, share-init, he-LG, etc. when the pretraining model is consistent with the downstream model in size and the downstream model is tiny (Des-Ti) or small (Des-S), wherein Mini-init represents the method of initializing the downstream model by training the obtained sharing parameters by means of weight conversion.

Detailed Description

The technical scheme provided by the present invention will be described in detail with reference to the following specific examples, and it should be understood that the following specific examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

The invention provides a model initialization method based on a linearly expandable learning gene, and the framework of the model initialization method is shown in figure 1. The invention is realized from the aspect of gene expandability, and in the field of machine learning, the module is called a learning gene. To determine the extended pattern, we studied the relationship between the position of the layer and the corresponding layer parameter values and found that a linear function is suitable to approximate this relationship. In light of this, we propose a new method of flexibly constructing and initializing a transducer. To learn the learning genes, we create and train an auxiliary transducer that is linearly extended by the learning genes. Then, to accommodate different downstream tasks, we can initialize the transducers at different depths by linearly expanding the trained learning genes. Specifically, the method of the invention comprises the following steps:

In step S1, to obtain the learning gene, we created an auxiliary transducer model that was linearly developed from the learning gene according to the following formula:

When all linear expansion strategies are adopted, the value range of L is from 1 to L; when a partial linear expansion strategy is adopted, the value range of L is scattered values (the maximum value does not exceed L). L represents the total number of layers of the auxiliary model, theta _l represents the parameters of the layer, theta _b and theta _a represent learning gene parameters and will be inherited into different offspring models to handle their specific tasks.

In addition to the encoder of the transform (which includes a multi-head self-attention module, a multi-layer perceptron module, and a layer normalization module), the transform first segments the input image into image blocks and composes a sequence of image blocks, which are then mapped into a sequence of image block representations in the D-dimension using linear projection. The position coding is also incorporated into the image block representation before being input to the transducer's encoder. Thus, these tile characterizations will be input to the encoder, where N represents the number of tiles.

Before proposing a strategy for linearly expanding the learning gene, we first performed the following thinking and exploration. Learning genes aims at preserving the most generalizing part of the ancestral model, which naturally motivates us to consider eliminating the superfluous parameters first. As one of the most representative works, weight sharing directly shares the parameters of all layers to maximize elimination of parameter redundancy. Although this approach is simple, this fully shared approach has a large impact on model capabilities in the visual field. To alleviate this problem, researchers have applied weight conversion strategies that impose a learnable function on the shared weights to increase the diversity of the parameters. Interestingly, if we treat the parameters of each layer as a high-dimensional tensor, we can estimate the relationship between the layer's position and the corresponding parameter value. Specifically, the weight sharing method assumes a "horizontal line" state, i.e., each layer uses the same parameters, while the weight conversion method makes the layer parameters completely different due to the inter-layer mapping function. To get some empirical observations, we reduced each tensor to 1-D data points using PCA for ease of use. Here we selected the trained ViT model for analysis. We can see that most one-dimensional data points are not irregularly arranged, but form an approximate trend of growth. Of the various fitting functions, the linear function is the simplest one, and can approximately reflect this trend. In light of this, we propose to linearly expand the corresponding modules to build and initialize the transducer model.

For a transducer model, we linearly expand the parameter tensors of the multi-headed self-attention module, the multi-layered perceptron module, and the layer normalization module included in the encoder. The strategy enables each layer to have linear difference, so that the diversity of parameters is improved, and meanwhile, the efficiency of the parameters is reserved.

In step S2, for the linear expansion multi-head self-attention module described in step S1, specifically, unlike the conventional multi-head self-attention module, we linearly expand the parameter matrix in the multi-head self-attention module by taking/> as an example, we linearly expand the parameters as follows:

Wherein and/> correspond to the parameter matrices of the multi-headed self-attention modules in θ _a and θ _b. Similarly, we can derive linear extensions/> and/> that can both make the multi-headed self-care module parameters of the layers different and help preserve the most stable common knowledge from ancestor to offspring models. The multi-headed self-attention module allows the model to jointly process information from different embedded subspaces, the output of which is:

MSA(Q,K,V)＝Concat(head₁,...,head_h)W^O，

Wherein Concat () represents the concatenation of the outputs of all the headers. For the kth head, its output is:

Wherein Q_k,K_k and V _k correspond to the k-th head's queries, keys and values. The/> represents the parameter matrix that projects the output after all heads are output stitched, h represents the number of heads in the multi-head self-attention module, d represents the output dimension of each head, and/> represents the appropriate normalization constant. We also impose a linear expansion constraint on W ^O when training the auxiliary model.

In step S3, for the linear expansion multi-layer perceptron module described in step S1, we linearly expand the parameter matrix therein to include and/> D' representing the output dimensions of the first linear transformation. Taking W ¹ as an example, we linearly expand its parameters as follows:

Wherein and/> correspond to the parameter matrices of the multi-layer sensor modules in θ _a and θ _b. Similarly, we can get b ¹,W² and b ². Through linear expansion, the output of the multi-layer sensor becomes more diversified, and meanwhile, the parameter efficiency is improved. The output of the multilayer perceptron is:

MLP(x)＝σ(xW¹+b¹)W²+b²，

wherein σ (·) represents the activation function. We typically set D' > D.

Step S4, besides the multi-head self-attention module and the multi-layer perceptron module of step S2 and step S3, we also linearly expand the layer normalization module, we linearly expand the learnable parameter vectors therein including and/> by taking γ as an example, we linearly expand the parameters thereof as follows:

Wherein γ _(a) and γ _(b) correspond to the parameter vectors of the layer normalization modules in θ _a and θ _b. Similarly, we can obtain β. Layer normalization and residual connection are employed both before and after the multi-head self-attention and multi-layer perceptron modules, which is critical for stable training and fast convergence in the transducer model. We express the layer normalized output as follows:

Where μ and δ are the mean and standard deviation of the corresponding characterization. Representing element-wise multiplication.

In step S5, to train the learning genes, namely, θ _a and θ _b, we created an auxiliary model that was linearly developed from the learning genes according to steps S1 to S4. We then facilitate the training process of the auxiliary model by using distillation. It should be noted that the parameters of the corresponding modules in the auxiliary model are linearly developed by the learning genes. For simplicity we consider only the soft distillation strategy. Soft distillation refers to minimizing Kullback-Leibler (KL) divergence between the output of the teacher model and the output of the student model. We utilized this strategy to introduce distillation losses:

where z _t represents the output of the pre-trained teacher model, e.g., levit-384.z _s denotes the output of the auxiliary model, τ denotes the temperature super-parameter for distillation, denotes the softmax function, and KL denotes the Kullback-Leibler (KL) divergence function. In combination with the classification loss, our total learning loss is defined as:

Where y represents the true sign, CE represents the cross entropy loss, and λ is the super parameter used to adjust the weight size of both losses. The gradient descent is calculated using the overall loss function, thereby updating the parameters of the auxiliary model.

We used Imagenet-1K to train an auxiliary model from which the learning gene was extended. In the training process, we apply an SGD optimizer and cosine learning rate mechanism, the learning rate is set to 5e-4, the weight decay is set to 0.05, and the batch size is set to 128. 100 epochs were trained with NVIDIA TESLA V100 GPUs.

In step S6, after learning θ _a and θ _b with the auxiliary model, we inherit them into different offspring models to handle the specific task of the environment in which they are located. Benefiting from the flexibility of our linear expansion strategy, we can initialize offspring models of different layers with our learning genes:

Where L ^ds represents the number of layers of the offspring model and represents the parameters of the first layer. We linearly expand the learning genes to create the auxiliary model, so that the parameters of the auxiliary model are limited by the linear expansion during the training process. It should be noted that, unlike the auxiliary model linearly extended from the learning gene, when the learning gene is trained, we linearly extend the parameters of the learning gene to initialize the offspring model, and then normally train, the offspring model is not limited by the linear extension. Obviously, we only need to reuse the ancestor model once, and then we can initialize the offspring models of different depths. In terms of expressions, the ancestor model corresponds to the pre-trained teacher model, and the offspring model corresponds to the downstream task model. The learning process of the offspring model adopts different training strategies, such as image classification tasks, according to the downstream tasks. The method comprises the specific steps of linearly expanding trained learning gene parameters to initialize a offspring model, wherein the offspring model is matched with a linear layer as a classification head to process an image classification task. During the training process, the model is trained with Cross-entropy loss (Cross-entropy loss).

The present method provides competitive results in cases where the downstream model is consistent with the pre-trained model. We compared the present method with a random initialization and Pre-training fine tuning method (Pre-Fin), the latter performance being considered as the upper limit that can be reached by the initialization method. Furthermore, we have designed Mini-Init and Share-Init using the most advanced compression method in this case. We also compared with previous learning gene work, denoted He-Lg.

As shown in fig. 2, the present method achieves a significant performance improvement over random initialization over all downstream datasets, which verifies the effectiveness of initialization with a learning gene. Specifically, the performance of the method on Mini-ImageNet, tiny-ImageNet and CIFAR-100 and Des-S was 25.19%, 10.08% and 10.32% higher than random initialization, respectively. Compared with Pre-Fin, the method has competitive performance on all data sets, which shows that common knowledge, i.e. learning genes, can be well learned and extracted. Note that we can use the trained LEARNGENE to initialize offspring models of different depths, while Pre-Fin needs to perform multiple Pre-training when initializing models of different depths. Furthermore, we can observe that the method only needs to shift about one third of the parameters to initialize the offspring model compared to Pre-Fin, which is more efficient. Compared with Mini-Init and Share-Init, the method achieves better performance on all data sets, which proves the superiority of our learning genes. Furthermore, we observed that the method was 7.53%, 6.76% and 7.66% higher on Mini-ImageNet, tiny-ImageNet and CIFAR-100, respectively, whereas the method only required shifting about one third of the parameters to initialize the offspring model. Compared with He-Lg, the method achieves improvement in performance on all data sets, and shows the advantage of initializing a offspring model by the method. For example, we observed that the present method was 21.05%, 8.95% and 5.51% higher than He-Lg on Mini-ImageNet, tiny-ImageNet and CIFAR-100, respectively.

The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. A model initialization method based on a linearly expandable learning gene is characterized by comprising the following steps:

firstly, creating an auxiliary transducer linearly expanded by a learning gene, and training the transducer by a distillation method; initializing the convectors with different depths by linearly expanding the trained learning genes so as to adapt to different downstream tasks; the method specifically comprises the following steps:

S2, linearly expanding parameters of a parameter matrix in the multi-head self-attention module by linearly expanding parameters of by the following formula:

Wherein and/> correspond to the parameter matrices of the multi-headed self-attention module in θ _a and θ _b; obtaining/> and/> multi-headed self-attention modules in a similar manner allows the model to jointly process information from different embedded subspaces, the outputs of which are:

MSA(Q，K，V)＝Concat(head₁,...,head_h)W^O，

Wherein Q_k,K_k and V _k correspond to the queries, keys and values of the kth head, represent the parameter matrix projecting the output after output stitching of all heads, h represents the number of heads in the multi-head self-attention module, d represents the output dimension of each head,/> represents the appropriate normalization constant;

S3, the parameter matrices and D' in the linearly expanding multi-layer perceptron module represent the output dimensions of the first linear transformation; the parameters of W ¹ are linearly extended as follows:

MLP(x)＝σ(xW¹+b¹)W²+b²，

Wherein σ (·) represents the activation function;

wherein μ and δ are the mean and standard deviation of the corresponding characterization, representing element-wise multiplication;

Wherein, L ^ds represents the layer number of the offspring model, represents the parameter of the first layer, and then the offspring model is normally trained;

And (3) linearly expanding the trained learning gene parameters to initialize a offspring model, wherein the offspring model is matched with a linear layer as a classification head to process an image classification task, and training is carried out to obtain an image classification result.

2. The method for initializing a model based on a linearly expandable learning gene according to claim 1, wherein in said step S1, when all of the linearly expandable strategies are adopted, the value of L ranges from 1 to L; when a partial linear expansion strategy is adopted, the value range of L is a plurality of scattered values which are not more than L at maximum.

3. The method for initializing a model based on a linearly expandable learning gene according to claim 1, wherein D' > D is set in the step S3.

4. The method for initializing a model based on a linearly expandable learning gene according to claim 1, wherein in the step S4, a layer normalization module and a residual connection are used before and after the multi-head self-attention module and the multi-layer perceptron module.

5. The method for initializing a model based on a linearly expandable learning gene according to claim 1, characterized in that in said step S5, distillation loss is introduced based on a soft distillation strategy.

6. The method for initializing a model based on a linearly expandable learning gene according to claim 1, wherein in step S6, the learning process of the offspring model adopts different training strategies according to the downstream tasks.