CN117273068B - Model initialization method based on linearly expandable learning genes - Google Patents

Model initialization method based on linearly expandable learning genes Download PDF

Info

Publication number
CN117273068B
CN117273068B CN202311264810.3A CN202311264810A CN117273068B CN 117273068 B CN117273068 B CN 117273068B CN 202311264810 A CN202311264810 A CN 202311264810A CN 117273068 B CN117273068 B CN 117273068B
Authority
CN
China
Prior art keywords
model
learning
linearly
layer
offspring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311264810.3A
Other languages
Chinese (zh)
Other versions
CN117273068A (en
Inventor
耿新
夏诗禹
杨旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202311264810.3A priority Critical patent/CN117273068B/en
Publication of CN117273068A publication Critical patent/CN117273068A/en
Application granted granted Critical
Publication of CN117273068B publication Critical patent/CN117273068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a model initialization method based on a linearly expandable learning gene, which comprises the following steps: firstly, creating an auxiliary transducer linearly expanded by a learning gene, and training the transducer by a distillation method; and initializing the convertors with different depths by linearly expanding the trained learning genes so as to adapt to different downstream tasks. The method trains a universal learning gene which can be linearly expanded, the learning gene can be used for initializing offspring models with different depths, simultaneously comprehensively considers the performance and resource problems of the models, and does not need to pretrain each specific model; the learning genes are extracted from the ancestor model, so that the ancestor model is not needed, and the additional cost is saved. By adopting the method of the invention to initialize the convertors with different depths, the method has good performance on downstream tasks.

Description

Model initialization method based on linearly expandable learning genes
Technical Field
The invention relates to a model initialization method based on a linearly expandable learning gene, and belongs to the technical field of machine learning and computer vision.
Background
Deep neural networks, such as VisionTransformer, have achieved excellent performance in a variety of computer vision tasks. Parameter initialization is an important step taken before training a model and plays a key role in determining the quality of the final model. Today, large-scale pre-training on massive data brings a huge basic model, and provides good initialization for fine tuning of various downstream tasks. However, in popular pre-training and fine tuning processes, the parameters of the original overall model need to be stored and updated separately for each downstream task, which is very expensive and time consuming for the capacity of the currently increasing visual models. More importantly, this approach does not flexibly initialize models of different depths to meet various application requirements, such as edges and internet of things devices that involve limited computing resources. Thus, a problem arises naturally: we cannot initialize a model for each downstream task taking into account its performance and resources?
Researchers have proposed a wide range of parameter initialization schemes such as random initialization, xavier initialization, kaiming initialization, and self-distillation. Today, large-scale pre-training on massive data provides a good initialization for model fine-tuning of various downstream tasks. However, this approach requires the reuse of the original entire model each time it is faced with a different downstream task. Furthermore, when another model of different depth is encountered, we need to pretrain again, which creates a waste of computational and memory resources. In addition, there has been a great deal of work today to study knowledge distillation techniques. Common to them is that each time a new student model is trained, forward is required for the teacher model, which undoubtedly causes additional overhead.
Disclosure of Invention
Let us now review the evolution process of an organism in which various biological features of its offspring are initialized from genes condensed from their ancestors to adapt to different environments. For example, different offspring of felines have evolved different body types, different hunting habits, and so forth. However, invariably, these biological features are initiated and extended by feline genes that have been condensed by their common ancestors over the years of evolution. Mimicking the behavior of organism genes, researchers have proposed a new learning paradigm called learning genes that first learns concentrated knowledge called learning genes from ancestral models and then inherits this small part to initialize offspring models. The existing work is to extract some complete layers as learning genes from gradient information of ancestor models, and then initialize offspring models by stacking randomly initialized shallow layers with the extracted learning gene layers. However, there are three main limitations in the previous work. First, heuristic extraction of the learning genes inevitably results in poor performance. Second, offspring models of different sizes are not considered. Third, only convolutional neural network structures were explored, which clearly limited the potential of learning genes in the age based on the transducer architecture.
Aiming at the problem that the existing method cannot flexibly initialize model parameters, the invention provides a model initialization method based on a linearly expandable learning gene, which expands a shared transducer module to form and initialize transducers with different depths so as to adapt to downstream tasks containing different resources. Similar to the scalability of genes, we call this module a learning gene. To determine the extended pattern, we studied the relationship between the position of the layer and the corresponding layer parameter values and found that a linear function is suitable to approximate this relationship. In light of this, we propose a new method of flexibly constructing and initializing a transducer. Specifically, to learn the learning genes, we first create an auxiliary transducer that is linearly extended by the learning genes, and then train it by distillation. We can then initialize the transducers at different depths by linearly expanding the trained learning genes to accommodate different downstream tasks. Although the pre-training and fine tuning approach is dominant in the model parameter initialization approach, it does not flexibly initialize models with different depths to meet various application requirements. Furthermore, pre-training each particular model is very expensive and time consuming.
The invention provides the following technical scheme: a model initialization method based on a linearly expandable learning gene comprises the following steps:
Firstly, creating an auxiliary transducer linearly expanded by a learning gene, and training the transducer by a distillation method; and initializing the convertors with different depths by linearly expanding the trained learning genes so as to adapt to different downstream tasks.
Further, the method specifically comprises the following steps:
s1, creating an auxiliary transducer model which is linearly developed from the learning genes according to the following formula:
Where L represents the total number of layers of the auxiliary model, θ l represents the parameters of the layer, θ b and θ a represent learning gene parameters and will be inherited into different offspring models to handle their specific tasks;
S2, linearly expanding a parameter matrix in the multi-head self-attention module by linearly expanding/> parameters of by the following formula:
Wherein and/> correspond to the parameter matrices of the multi-headed self-attention module in θ a and θ b; by similar means/> and/>
The multi-headed self-attention module allows the model to jointly process information from different embedded subspaces, the output of which is:
MSA(Q,K,V)=Concat(head1,...,headh)WO
wherein Concat () represents the concatenation of the outputs of all the heads; for the kth head, its output is:
Wherein Qk,Kk and V k correspond to the queries, keys and values of the kth head,/> represent a parameter matrix that projects the output after all the heads are output stitched, h represents the number of heads in the multi-head self-attention module, d represents the output dimension of each head,/> represents a suitable normalization constant;
s3, a parameter matrix and/> D' in the linearly expanding multi-layer perceptron module represent the output dimension of the first linear transformation; the parameters of W 1 are linearly extended as follows:
Wherein and/> correspond to the parameter matrices of the multi-layer sensor modules in θ a and θ b; b 1,W2 and b 2 are obtained in a similar manner; the output of the multilayer perceptron is:
MLP(x)=σ(xW1+b1)W2+b2
Wherein σ (·) represents the activation function;
S4, the learnable parameter vector in the linear expansion layer normalization module comprises and/> parameters of the following formula linear expansion gamma:
Wherein γ (a) and γ (b) correspond to the parameter vectors of the layer normalization modules in θ a and θ b; beta is obtained in a similar manner; the output of the layer normalization is expressed as follows:
Wherein μ and δ are the mean and standard deviation of the corresponding characterization, and the ° represents an element-by-element multiplication;
S5, creating an auxiliary model which is linearly developed by the learning genes according to the steps S1 to S4; then the training process of the auxiliary model is promoted by adopting a distillation method; the distillation loss is as follows:
Wherein z t represents the output of the pre-trained teacher model, z s represents the output of the auxiliary model, τ represents the temperature hyper-parameter for distillation, represents the softmax function, KL represents the Kullback-Leibler divergence function; in combination with the classification loss, the total learning loss is defined as:
Wherein y represents a real mark, CE represents cross entropy loss, lambda is a super parameter, and the weight of the two losses is adjusted; calculating gradient descent by using the overall loss function, so as to update parameters of the auxiliary model;
S6, after learning theta a and theta b by using the auxiliary model, inheriting the auxiliary model and the theta b into different offspring models so as to process specific tasks of the environment where the auxiliary model and the theta b are located; initializing offspring models with different layers by using our learning genes:
Where L ds represents the number of layers of the offspring model, represents the parameters of the first layer, and then the offspring model is trained normally.
Further, in the step S1, when all the linear expansion strategies are adopted, the value range of L is from 1 to L; when a partial linear expansion strategy is adopted, the value range of L is a plurality of scattered values which are not more than L at maximum.
Further, in the step S3, D' > D is set.
Further, in the step S4, the layer normalization module and the residual connection are adopted before and after the multi-head self-attention module and the multi-layer perceptron module.
Further, in the step S5, distillation loss is introduced based on a soft distillation strategy.
Further, in step S6, the learning process of the offspring model adopts different training strategies according to different downstream tasks.
The invention provides a model initialization method for image classification tasks, which adopts the model initialization method based on the linearly expandable learning genes to linearly expand the trained learning gene parameters to initialize a offspring model, and the offspring model is matched with a linear layer as a classification head to process the image classification tasks and perform training to obtain image classification results.
Compared with the prior art, the invention has the following advantages and beneficial effects:
The method trains a universal learning gene which can be linearly expanded, the learning gene can be used for initializing offspring models with different depths, simultaneously comprehensively considers the performance and resource problems of the models, and does not need to pretrain each specific model; the learning genes are extracted from the ancestor model, so that the ancestor model is not needed, and the additional cost is saved. In the invention, the linear expansion of the multi-head self-attention module can not only make the parameters of the multi-head self-attention module of each layer different, but also help to keep the most stable common knowledge from the ancestor model to the offspring model; the linear expansion of the multi-layer perceptron enables the output of the multi-layer perceptron to be more diversified, and meanwhile, the parameter efficiency is improved; the linear expansion of the layer normalization module not only improves the efficiency of model parameters, but also is beneficial to the stable training of a transducer model. By adopting the method of the invention to initialize the convertors with different depths, the method has good performance on downstream tasks.
Drawings
FIG. 1 is a block diagram of the method of the present invention;
FIG. 2 is a graph comparing classification performance of the method of the present invention with random initialization, pretraining fine-tuning, mini-init, share-init, he-LG, etc. when the pretraining model is consistent with the downstream model in size and the downstream model is tiny (Des-Ti) or small (Des-S), wherein Mini-init represents the method of initializing the downstream model by training the obtained sharing parameters by means of weight conversion.
Detailed Description
The technical scheme provided by the present invention will be described in detail with reference to the following specific examples, and it should be understood that the following specific examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.
The invention provides a model initialization method based on a linearly expandable learning gene, and the framework of the model initialization method is shown in figure 1. The invention is realized from the aspect of gene expandability, and in the field of machine learning, the module is called a learning gene. To determine the extended pattern, we studied the relationship between the position of the layer and the corresponding layer parameter values and found that a linear function is suitable to approximate this relationship. In light of this, we propose a new method of flexibly constructing and initializing a transducer. To learn the learning genes, we create and train an auxiliary transducer that is linearly extended by the learning genes. Then, to accommodate different downstream tasks, we can initialize the transducers at different depths by linearly expanding the trained learning genes. Specifically, the method of the invention comprises the following steps:
In step S1, to obtain the learning gene, we created an auxiliary transducer model that was linearly developed from the learning gene according to the following formula:
When all linear expansion strategies are adopted, the value range of L is from 1 to L; when a partial linear expansion strategy is adopted, the value range of L is scattered values (the maximum value does not exceed L). L represents the total number of layers of the auxiliary model, theta l represents the parameters of the layer, theta b and theta a represent learning gene parameters and will be inherited into different offspring models to handle their specific tasks.
In addition to the encoder of the transform (which includes a multi-head self-attention module, a multi-layer perceptron module, and a layer normalization module), the transform first segments the input image into image blocks and composes a sequence of image blocks, which are then mapped into a sequence of image block representations in the D-dimension using linear projection. The position coding is also incorporated into the image block representation before being input to the transducer's encoder. Thus, these tile characterizations will be input to the encoder, where N represents the number of tiles.
Before proposing a strategy for linearly expanding the learning gene, we first performed the following thinking and exploration. Learning genes aims at preserving the most generalizing part of the ancestral model, which naturally motivates us to consider eliminating the superfluous parameters first. As one of the most representative works, weight sharing directly shares the parameters of all layers to maximize elimination of parameter redundancy. Although this approach is simple, this fully shared approach has a large impact on model capabilities in the visual field. To alleviate this problem, researchers have applied weight conversion strategies that impose a learnable function on the shared weights to increase the diversity of the parameters. Interestingly, if we treat the parameters of each layer as a high-dimensional tensor, we can estimate the relationship between the layer's position and the corresponding parameter value. Specifically, the weight sharing method assumes a "horizontal line" state, i.e., each layer uses the same parameters, while the weight conversion method makes the layer parameters completely different due to the inter-layer mapping function. To get some empirical observations, we reduced each tensor to 1-D data points using PCA for ease of use. Here we selected the trained ViT model for analysis. We can see that most one-dimensional data points are not irregularly arranged, but form an approximate trend of growth. Of the various fitting functions, the linear function is the simplest one, and can approximately reflect this trend. In light of this, we propose to linearly expand the corresponding modules to build and initialize the transducer model.
For a transducer model, we linearly expand the parameter tensors of the multi-headed self-attention module, the multi-layered perceptron module, and the layer normalization module included in the encoder. The strategy enables each layer to have linear difference, so that the diversity of parameters is improved, and meanwhile, the efficiency of the parameters is reserved.
In step S2, for the linear expansion multi-head self-attention module described in step S1, specifically, unlike the conventional multi-head self-attention module, we linearly expand the parameter matrix in the multi-head self-attention module by taking/> as an example, we linearly expand the parameters as follows:
Wherein and/> correspond to the parameter matrices of the multi-headed self-attention modules in θ a and θ b. Similarly, we can derive linear extensions/> and/> that can both make the multi-headed self-care module parameters of the layers different and help preserve the most stable common knowledge from ancestor to offspring models. The multi-headed self-attention module allows the model to jointly process information from different embedded subspaces, the output of which is:
MSA(Q,K,V)=Concat(head1,...,headh)WO
Wherein Concat () represents the concatenation of the outputs of all the headers. For the kth head, its output is:
Wherein Qk,Kk and V k correspond to the k-th head's queries, keys and values. The/> represents the parameter matrix that projects the output after all heads are output stitched, h represents the number of heads in the multi-head self-attention module, d represents the output dimension of each head, and/> represents the appropriate normalization constant. We also impose a linear expansion constraint on W O when training the auxiliary model.
In step S3, for the linear expansion multi-layer perceptron module described in step S1, we linearly expand the parameter matrix therein to include and/> D' representing the output dimensions of the first linear transformation. Taking W 1 as an example, we linearly expand its parameters as follows:
Wherein and/> correspond to the parameter matrices of the multi-layer sensor modules in θ a and θ b. Similarly, we can get b 1,W2 and b 2. Through linear expansion, the output of the multi-layer sensor becomes more diversified, and meanwhile, the parameter efficiency is improved. The output of the multilayer perceptron is:
MLP(x)=σ(xW1+b1)W2+b2
wherein σ (·) represents the activation function. We typically set D' > D.
Step S4, besides the multi-head self-attention module and the multi-layer perceptron module of step S2 and step S3, we also linearly expand the layer normalization module, we linearly expand the learnable parameter vectors therein including and/> by taking γ as an example, we linearly expand the parameters thereof as follows:
Wherein γ (a) and γ (b) correspond to the parameter vectors of the layer normalization modules in θ a and θ b. Similarly, we can obtain β. Layer normalization and residual connection are employed both before and after the multi-head self-attention and multi-layer perceptron modules, which is critical for stable training and fast convergence in the transducer model. We express the layer normalized output as follows:
Where μ and δ are the mean and standard deviation of the corresponding characterization. Representing element-wise multiplication.
In step S5, to train the learning genes, namely, θ a and θ b, we created an auxiliary model that was linearly developed from the learning genes according to steps S1 to S4. We then facilitate the training process of the auxiliary model by using distillation. It should be noted that the parameters of the corresponding modules in the auxiliary model are linearly developed by the learning genes. For simplicity we consider only the soft distillation strategy. Soft distillation refers to minimizing Kullback-Leibler (KL) divergence between the output of the teacher model and the output of the student model. We utilized this strategy to introduce distillation losses:
where z t represents the output of the pre-trained teacher model, e.g., levit-384.z s denotes the output of the auxiliary model, τ denotes the temperature super-parameter for distillation, denotes the softmax function, and KL denotes the Kullback-Leibler (KL) divergence function. In combination with the classification loss, our total learning loss is defined as:
Where y represents the true sign, CE represents the cross entropy loss, and λ is the super parameter used to adjust the weight size of both losses. The gradient descent is calculated using the overall loss function, thereby updating the parameters of the auxiliary model.
We used Imagenet-1K to train an auxiliary model from which the learning gene was extended. In the training process, we apply an SGD optimizer and cosine learning rate mechanism, the learning rate is set to 5e-4, the weight decay is set to 0.05, and the batch size is set to 128. 100 epochs were trained with NVIDIA TESLA V100 GPUs.
In step S6, after learning θ a and θ b with the auxiliary model, we inherit them into different offspring models to handle the specific task of the environment in which they are located. Benefiting from the flexibility of our linear expansion strategy, we can initialize offspring models of different layers with our learning genes:
Where L ds represents the number of layers of the offspring model and represents the parameters of the first layer. We linearly expand the learning genes to create the auxiliary model, so that the parameters of the auxiliary model are limited by the linear expansion during the training process. It should be noted that, unlike the auxiliary model linearly extended from the learning gene, when the learning gene is trained, we linearly extend the parameters of the learning gene to initialize the offspring model, and then normally train, the offspring model is not limited by the linear extension. Obviously, we only need to reuse the ancestor model once, and then we can initialize the offspring models of different depths. In terms of expressions, the ancestor model corresponds to the pre-trained teacher model, and the offspring model corresponds to the downstream task model. The learning process of the offspring model adopts different training strategies, such as image classification tasks, according to the downstream tasks. The method comprises the specific steps of linearly expanding trained learning gene parameters to initialize a offspring model, wherein the offspring model is matched with a linear layer as a classification head to process an image classification task. During the training process, the model is trained with Cross-entropy loss (Cross-entropy loss).
The present method provides competitive results in cases where the downstream model is consistent with the pre-trained model. We compared the present method with a random initialization and Pre-training fine tuning method (Pre-Fin), the latter performance being considered as the upper limit that can be reached by the initialization method. Furthermore, we have designed Mini-Init and Share-Init using the most advanced compression method in this case. We also compared with previous learning gene work, denoted He-Lg.
As shown in fig. 2, the present method achieves a significant performance improvement over random initialization over all downstream datasets, which verifies the effectiveness of initialization with a learning gene. Specifically, the performance of the method on Mini-ImageNet, tiny-ImageNet and CIFAR-100 and Des-S was 25.19%, 10.08% and 10.32% higher than random initialization, respectively. Compared with Pre-Fin, the method has competitive performance on all data sets, which shows that common knowledge, i.e. learning genes, can be well learned and extracted. Note that we can use the trained LEARNGENE to initialize offspring models of different depths, while Pre-Fin needs to perform multiple Pre-training when initializing models of different depths. Furthermore, we can observe that the method only needs to shift about one third of the parameters to initialize the offspring model compared to Pre-Fin, which is more efficient. Compared with Mini-Init and Share-Init, the method achieves better performance on all data sets, which proves the superiority of our learning genes. Furthermore, we observed that the method was 7.53%, 6.76% and 7.66% higher on Mini-ImageNet, tiny-ImageNet and CIFAR-100, respectively, whereas the method only required shifting about one third of the parameters to initialize the offspring model. Compared with He-Lg, the method achieves improvement in performance on all data sets, and shows the advantage of initializing a offspring model by the method. For example, we observed that the present method was 21.05%, 8.95% and 5.51% higher than He-Lg on Mini-ImageNet, tiny-ImageNet and CIFAR-100, respectively.
The technical means disclosed by the scheme of the invention is not limited to the technical means disclosed by the embodiment, and also comprises the technical scheme formed by any combination of the technical features. It should be noted that modifications and adaptations to the invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (6)

1. A model initialization method based on a linearly expandable learning gene is characterized by comprising the following steps:
firstly, creating an auxiliary transducer linearly expanded by a learning gene, and training the transducer by a distillation method; initializing the convectors with different depths by linearly expanding the trained learning genes so as to adapt to different downstream tasks; the method specifically comprises the following steps:
s1, creating an auxiliary transducer model which is linearly developed from the learning genes according to the following formula:
Where L represents the total number of layers of the auxiliary model, θ l represents the parameters of the layer, θ b and θ a represent learning gene parameters and will be inherited into different offspring models to handle their specific tasks;
S2, linearly expanding parameters of a parameter matrix in the multi-head self-attention module by linearly expanding parameters of by the following formula:
Wherein and/> correspond to the parameter matrices of the multi-headed self-attention module in θ a and θ b; obtaining/> and/> multi-headed self-attention modules in a similar manner allows the model to jointly process information from different embedded subspaces, the outputs of which are:
MSA(Q,K,V)=Concat(head1,...,headh)WO
wherein Concat () represents the concatenation of the outputs of all the heads; for the kth head, its output is:
Wherein Qk,Kk and V k correspond to the queries, keys and values of the kth head, represent the parameter matrix projecting the output after output stitching of all heads, h represents the number of heads in the multi-head self-attention module, d represents the output dimension of each head,/> represents the appropriate normalization constant;
S3, the parameter matrices and D' in the linearly expanding multi-layer perceptron module represent the output dimensions of the first linear transformation; the parameters of W 1 are linearly extended as follows:
Wherein and/> correspond to the parameter matrices of the multi-layer sensor modules in θ a and θ b; b 1,W2 and b 2 are obtained in a similar manner; the output of the multilayer perceptron is:
MLP(x)=σ(xW1+b1)W2+b2
Wherein σ (·) represents the activation function;
S4, the learnable parameter vector in the linear expansion layer normalization module comprises and/> parameters of the following formula linear expansion gamma:
Wherein γ (a) and γ (b) correspond to the parameter vectors of the layer normalization modules in θ a and θ b; beta is obtained in a similar manner; the output of the layer normalization is expressed as follows:
wherein μ and δ are the mean and standard deviation of the corresponding characterization, representing element-wise multiplication;
S5, creating an auxiliary model which is linearly developed by the learning genes according to the steps S1 to S4; then the training process of the auxiliary model is promoted by adopting a distillation method; the distillation loss is as follows:
Wherein z t represents the output of the pre-trained teacher model, z s represents the output of the auxiliary model, τ represents the temperature hyper-parameter for distillation, represents the softmax function, KL represents the Kullback-Leibler divergence function; in combination with the classification loss, the total learning loss is defined as:
Wherein y represents a real mark, CE represents cross entropy loss, lambda is a super parameter, and the weight of the two losses is adjusted; calculating gradient descent by using the overall loss function, so as to update parameters of the auxiliary model;
S6, after learning theta a and theta b by using the auxiliary model, inheriting the auxiliary model and the theta b into different offspring models so as to process specific tasks of the environment where the auxiliary model and the theta b are located; initializing offspring models with different layers by using our learning genes:
Wherein, L ds represents the layer number of the offspring model, represents the parameter of the first layer, and then the offspring model is normally trained;
And (3) linearly expanding the trained learning gene parameters to initialize a offspring model, wherein the offspring model is matched with a linear layer as a classification head to process an image classification task, and training is carried out to obtain an image classification result.
2. The method for initializing a model based on a linearly expandable learning gene according to claim 1, wherein in said step S1, when all of the linearly expandable strategies are adopted, the value of L ranges from 1 to L; when a partial linear expansion strategy is adopted, the value range of L is a plurality of scattered values which are not more than L at maximum.
3. The method for initializing a model based on a linearly expandable learning gene according to claim 1, wherein D' > D is set in the step S3.
4. The method for initializing a model based on a linearly expandable learning gene according to claim 1, wherein in the step S4, a layer normalization module and a residual connection are used before and after the multi-head self-attention module and the multi-layer perceptron module.
5. The method for initializing a model based on a linearly expandable learning gene according to claim 1, characterized in that in said step S5, distillation loss is introduced based on a soft distillation strategy.
6. The method for initializing a model based on a linearly expandable learning gene according to claim 1, wherein in step S6, the learning process of the offspring model adopts different training strategies according to the downstream tasks.
CN202311264810.3A 2023-09-28 2023-09-28 Model initialization method based on linearly expandable learning genes Active CN117273068B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311264810.3A CN117273068B (en) 2023-09-28 2023-09-28 Model initialization method based on linearly expandable learning genes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311264810.3A CN117273068B (en) 2023-09-28 2023-09-28 Model initialization method based on linearly expandable learning genes

Publications (2)

Publication Number Publication Date
CN117273068A CN117273068A (en) 2023-12-22
CN117273068B true CN117273068B (en) 2024-04-16

Family

ID=89211911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311264810.3A Active CN117273068B (en) 2023-09-28 2023-09-28 Model initialization method based on linearly expandable learning genes

Country Status (1)

Country Link
CN (1) CN117273068B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109146921A (en) * 2018-07-02 2019-01-04 华中科技大学 A kind of pedestrian target tracking based on deep learning
CN113537365A (en) * 2021-07-20 2021-10-22 北京航空航天大学 Multitask learning self-adaptive balancing method based on information entropy dynamic weighting
JP2022133872A (en) * 2021-03-02 2022-09-14 株式会社Jvcケンウッド Machine learning device, inference device, machine learning method, and machine learning program
CN116628510A (en) * 2023-07-25 2023-08-22 自然语义(青岛)科技有限公司 Self-training iterative artificial intelligent model training method
CN116644316A (en) * 2023-05-31 2023-08-25 杭州电子科技大学 Multi-mode multi-task learning oriented lightweight adaptive network learning method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11782401B2 (en) * 2019-08-02 2023-10-10 Aspentech Corporation Apparatus and methods to build deep learning controller using non-invasive closed loop exploration

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109146921A (en) * 2018-07-02 2019-01-04 华中科技大学 A kind of pedestrian target tracking based on deep learning
JP2022133872A (en) * 2021-03-02 2022-09-14 株式会社Jvcケンウッド Machine learning device, inference device, machine learning method, and machine learning program
CN113537365A (en) * 2021-07-20 2021-10-22 北京航空航天大学 Multitask learning self-adaptive balancing method based on information entropy dynamic weighting
CN116644316A (en) * 2023-05-31 2023-08-25 杭州电子科技大学 Multi-mode multi-task learning oriented lightweight adaptive network learning method
CN116628510A (en) * 2023-07-25 2023-08-22 自然语义(青岛)科技有限公司 Self-training iterative artificial intelligent model training method

Also Published As

Publication number Publication date
CN117273068A (en) 2023-12-22

Similar Documents

Publication Publication Date Title
CN110321926B (en) Migration method and system based on depth residual error correction network
US10037457B2 (en) Methods and systems for verifying face images based on canonical images
Deco et al. An information-theoretic approach to neural computing
CN109389207A (en) A kind of adaptive neural network learning method and nerve network system
WO2022006919A1 (en) Activation fixed-point fitting-based method and system for post-training quantization of convolutional neural network
Anselmi et al. Deep convolutional networks are hierarchical kernel machines
CN107622303A (en) For the method for neutral net and the equipment of execution this method
CN111797911B (en) Multi-label classification method for image data
CN113434699A (en) Pre-training method of BERT model, computer device and storage medium
CN111753995A (en) Local interpretable method based on gradient lifting tree
CN116363423A (en) Knowledge distillation method, device and storage medium for small sample learning
CN109063725B (en) Multi-view clustering-oriented multi-graph regularization depth matrix decomposition method
CN105260736A (en) Fast image feature representing method based on normalized nonnegative sparse encoder
CN113408610B (en) Image identification method based on adaptive matrix iteration extreme learning machine
CN117273068B (en) Model initialization method based on linearly expandable learning genes
CN106779062A (en) A kind of multi-layer perception (MLP) artificial neural network based on residual error network
CN109034387A (en) A kind of approximation method for quickly training self-encoding encoder based on pseudo- reversal learning
CN111461229A (en) Deep neural network optimization and image classification method based on target transfer and line search
CN117409456A (en) Non-aligned multi-view multi-mark learning method based on graph matching mechanism
CN115188055A (en) Lightweight expression identification method for NNIE neural network accelerator
CN115081516A (en) Internet of things flow prediction method based on biological connection group time-varying convolution network
CN113935473A (en) Deep learning neural network optimization method and application method
Imai et al. Stepwise pathnet: Transfer learning algorithm to improve network structure versatility
CN112580783B (en) Cross-dimension knowledge migration method for migrating knowledge from high-dimension deep learning model to low dimension
CN115131599B (en) Image classification method based on deviation resistance and robustness knowledge distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant