WO2022126683A1 - 面向多任务的预训练语言模型自动压缩方法及平台 - Google Patents
面向多任务的预训练语言模型自动压缩方法及平台 Download PDFInfo
- Publication number
- WO2022126683A1 WO2022126683A1 PCT/CN2020/138016 CN2020138016W WO2022126683A1 WO 2022126683 A1 WO2022126683 A1 WO 2022126683A1 CN 2020138016 W CN2020138016 W CN 2020138016W WO 2022126683 A1 WO2022126683 A1 WO 2022126683A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- distillation
- model
- network
- knowledge
- task
- Prior art date
Links
- 238000012549 training Methods 0.000 title claims abstract description 71
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000004821 distillation Methods 0.000 claims abstract description 139
- 239000013598 vector Substances 0.000 claims abstract description 108
- 238000013140 knowledge distillation Methods 0.000 claims abstract description 105
- 238000007906 compression Methods 0.000 claims abstract description 88
- 230000006835 compression Effects 0.000 claims abstract description 87
- 238000005070 sampling Methods 0.000 claims abstract description 46
- 108090000623 proteins and genes Proteins 0.000 claims description 69
- 239000011159 matrix material Substances 0.000 claims description 23
- 238000005215 recombination Methods 0.000 claims description 9
- 230000006798 recombination Effects 0.000 claims description 9
- 238000013508 migration Methods 0.000 claims description 7
- 230000005012 migration Effects 0.000 claims description 7
- 238000003058 natural language processing Methods 0.000 claims description 7
- 206010064571 Gene mutation Diseases 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 5
- 230000035772 mutation Effects 0.000 claims description 4
- 238000010845 search algorithm Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 230000000052 comparative effect Effects 0.000 claims description 2
- 238000013526 transfer learning Methods 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Definitions
- the invention belongs to the field of language model compression, and in particular relates to a multitask-oriented pre-training language model automatic compression method and platform.
- the layer-by-layer knowledge distillation method is to minimize the two networks. feature map distance between.
- this method can usually achieve better results.
- the core challenge of neural network compression in the case of small samples is that the compressed model is easily overfitted on the few-sample training instance, resulting in a large estimation error between the inference process and the original network. Estimation errors may accumulate and propagate layer by layer, eventually corrupting the network output.
- the existing knowledge distillation methods are mainly data-driven sparse constraints or artificially designed distillation strategies; considering that usually a BERT network has 12 layers of Transformer units, each unit contains 8 self-attention units; self-attention units There are hundreds of millions of possible connection methods. Due to the limitation of computing resources, it is almost impossible to manually design all possible distillation structures and find the optimal structure.
- the purpose of the present invention is to provide a multi-task oriented pre-training language model automatic compression method and platform aiming at the deficiencies of the prior art.
- a multi-task-oriented pre-training language model automatic compression method comprising three stages:
- the second stage is to train the knowledge distillation network of meta-learning: define the search space, input the knowledge distillation coding vector constructed in the first stage into the search space, and eliminate the knowledge distillation coding vectors that do not meet the conditions;
- the knowledge distillation encoding vector is used as input, and the weight matrix used to construct the distillation structure model is output, and the corresponding distillation structure model is generated; the generated distillation structure model is trained to update the structure generator;
- the third stage the process of searching distillation structure models based on evolutionary algorithms: input multiple knowledge distillation encoding vectors that satisfy specific constraints into the updated structure generator in the second stage to generate corresponding weight matrices, and obtain multiple distillation structure models; The accuracy of each distillation structure model; the evolutionary algorithm is used to search for the distillation structure model with the highest accuracy that satisfies specific constraints, and a general compression architecture is obtained.
- the first stage is specifically: Bernoulli sampling is performed on the 12-layer Transformer units of the BERT model in turn to generate a knowledge distillation encoding vector, and each layer corresponds to a random variable; when the probability of the random variable being 1 is greater than or equal to 0.5 , the element corresponding to the knowledge distillation encoding vector is 1, which means that the current Transformer unit performs migration learning; when the probability value of the random variable being 1 is less than 0.5, the element corresponding to the layer sampling vector is 0, which means that the current Transformer unit does not perform migration learning.
- the defined search space is specifically as follows: the number of elements that are 1 in the knowledge distillation encoding vector is not less than 6.
- the defining structure generator is specifically: the structure generator is composed of two fully connected layers, the input is the knowledge distillation encoding vector constructed in the first stage, and the output is the weight matrix used to generate the distillation structure model.
- distillation structure model generated by the training to update the structure generator includes the following sub-steps:
- the step (2.2) is specifically: according to the knowledge distillation coding vector constructed in the first stage, wherein each element corresponds to a layer of Transformer units, perform layer sampling knowledge distillation on each Transformer layer of the teacher network, and use the teacher model.
- the weight of the Transformer unit whose corresponding element is 1 in the knowledge distillation coding vector is used to initialize the Transformer unit of the student model migration; that is, the element whose sample is 1 in each layer is passed through the structure generator to generate the Transformer unit corresponding to the student model and its weight; through knowledge distillation
- the encoding vector establishes a one-to-one mapping relationship between the teacher model and the student model, and generates the corresponding distillation network structure according to the knowledge distillation encoding vector.
- the method for training the structure generator in combination with Bernoulli distribution sampling is specifically: using Bernoulli distribution to perform layer sampling on each layer of Transformer units to construct different knowledge distillation coding vectors, and use the training data set to perform multiple iterative training.
- the structure generator and the distillation structure model are simultaneously trained based on a knowledge distillation encoding vector, and a structure generator that can generate weight matrices for different distillation structure models is obtained by changing the input knowledge distillation encoding vector.
- the third stage includes the following sub-steps:
- gene mutation refers to randomly changing some element values in the gene; gene recombination refers to randomly recombining the genes of two parents; and eliminating new genes that do not meet specific constraints.
- a platform based on the above multi-task-oriented pre-training language model automatic compression method including the following components:
- Data loading component used to obtain training samples of multi-task-oriented pre-trained language models, where the training samples are labeled text samples that satisfy supervised learning tasks;
- Automatic compression component used to automatically compress multi-task-oriented pre-trained language models, including knowledge distillation vector encoding module, distillation network generation module, structure generator and distillation network joint training module, distillation network search module and task-specific fine-tuning module;
- the knowledge distillation vector encoding module includes the layer sampling vector of the Transformer; in the forward propagation process, the knowledge distillation encoding vector is input into the structure generator to generate the corresponding structure of the distillation network and the weight matrix of the structure generator;
- the distillation network generation module is based on the structure generator to construct a distillation network corresponding to the current input knowledge distillation encoding vector, and adjust the shape of the weight matrix output by the structure generator to make it correspond to the knowledge distillation encoding vector.
- the encoding of the input and output of the distillation structure The number of device units is the same;
- the structure generator and distillation network joint training module is an end-to-end training structure generator. Specifically, the knowledge distillation encoding vector based on Transformer layer sampling and a small batch of training data are input into the distillation network; the weight and structure of the distillation structure are updated. the weight matrix of the generator;
- the distillation network search module is to search for the highest-precision distillation network that satisfies specific constraints, and proposes an evolutionary algorithm to search for the highest-precision distillation network that satisfies specific constraints; input the knowledge distillation encoding vector into the trained structure generator to generate the corresponding distillation.
- each distillation network is encoded by a knowledge distillation encoding vector based on Transformer layer sampling Therefore, the knowledge distillation encoding vector is defined as the gene of the distillation network; under certain constraints, a series of knowledge distillation encoding vectors are selected as the genes of the distillation network, and the accuracy of the corresponding distillation network is obtained by evaluating on the verification set; then , select the top k genes with higher precision, and use gene recombination and mutation to generate new genes; by further repeating the process of selecting the top k optimal genes and the process of generating new genes, iteratively obtains the one that satisfies the constraints and has the highest precision Gene;
- the task-specific fine-tuning module is to build a downstream task network on the pre-trained model distillation network generated by the automatic compression component, use the feature layer and output layer of the distillation network to fine-tune the downstream task scene, and output the final fine-tuned student model, that is, log in A pre-trained language model compression model including downstream tasks required by the user; the compressed model is output to a specified container, which can be downloaded by the logged-in user, and the output compression model page of the platform presents the size of the model before and after compression. comparative information;
- the logged-in user obtains the pre-trained compression model from the platform, and the user uses the compression model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the logged-in user on the dataset of the actual scene; and The comparison information of the inference speed before and after compression is presented on the compression model inference page of the platform.
- the beneficial effects of the present invention are: firstly, the present invention studies a general compression architecture for generating multiple pre-trained language models based on meta-learning knowledge distillation; secondly, on the basis of the trained meta-learning network, an evolutionary algorithm is used to search for optimal compression structure, resulting in an optimal general compression architecture for task-independent pretrained language models.
- the multi-task-oriented pre-training language model automatic compression platform of the present invention the general architecture of the multi-task-oriented pre-training language model is compressed and generated, the compressed model architecture is fully utilized to improve the compression efficiency of downstream tasks, and the compression efficiency of downstream tasks can be improved.
- Large-scale natural language processing models are deployed on end-side devices with small memory and limited resources, which promotes the implementation of general-purpose deep language models in the industry.
- 1 is an overall architecture diagram of the compression method of the present invention in conjunction with a specific task
- Fig. 2 is the training flow chart of the knowledge distillation network of meta-learning
- Figure 3 is an architecture diagram of building a distillation network based on a structure generator
- Figure 4 is a diagram of the joint training process of the structure generator and the distillation network
- Figure 5 is a diagram of a distillation network search architecture based on an evolutionary algorithm.
- the present invention studies knowledge distillation based on meta-learning to generate multiple pre-trained language models.
- Generic compression architecture First constructs a knowledge distillation encoding vector based on Transformer layer sampling, and distills the knowledge structure of the large model at different levels.
- a meta-network of structure generator is designed, and the structure generator is used to generate a distilled structure model corresponding to the currently input encoding vector.
- a method of Bernoulli distribution sampling is proposed to train the structure generator.
- each encoder unit that is migrated is generated by using Bernoulli distribution sampling to form a corresponding encoding vector.
- the encoding vector of the input structure generator and the training data of the mini-batch and jointly training the structure generator and the corresponding distillation structure, we can learn a structure generator that can generate weights for different distillation structures.
- an evolutionary algorithm is used to search for the optimal compression structure, thereby obtaining the optimal general compression structure of the task-independent pre-trained language model.
- the invention solves the problem of overfitting learning and low generalization ability of the compression model in the compression process of the BERT model under the condition of few sample data, deeply explores the feasibility and key technology of language understanding of the large-scale deep language model under the condition of few samples, and improves the Compression models are geared towards flexibility and effectiveness in use for a variety of downstream tasks.
- the knowledge distillation of meta-learning can completely liberate manpower from tedious hyperparameter tuning, while allowing the use of multiple target metrics to directly optimize the compression model.
- knowledge distillation of meta-learning can easily enforce conditional constraints when searching for the desired compression structure without manual tuning of reinforcement learning hyperparameters.
- the application technical route of the compression method of the present invention is shown in Figure 1.
- the present invention is a multi-task-oriented pre-training language model automatic compression method, and the whole process is divided into three stages: the first stage is to construct a knowledge distillation encoding vector based on Transformer layer sampling; the second stage is to train the knowledge of meta-learning Distillation network; the third stage is to search for the optimal compression structure based on evolutionary algorithms; specifically:
- Stage 1 Construct knowledge distillation encoding vector based on Transformer layer sampling.
- the Bernoulli distribution is used to perform layer sampling on all Transformer units of the BERT model to generate a layer sampling vector, that is, the knowledge distillation encoding vector.
- the ith Transformer unit encoder
- the random variable X i ⁇ Bernoulli(p) is an independent Bernoulli random variable
- the probability of being 1 is p
- the probability of being 0 is 1- p.
- the 12-layer Transformer units of the BERT model are sequentially Bernoulli-sampled using random variables Xi to generate a vector consisting of 12 elements of 0 or 1.
- the probability p of the random variable X i being 1 is greater than or equal to 0.5
- the element corresponding to the layer sampling vector is 1, representing the current Transformer unit for migration learning
- the probability value of the random variable X i being 1 is less than 0.5
- the layer sampling vector corresponds to The element of is 0, which means that the current Transformer unit does not perform transfer learning.
- the Bernoulli sampling method is used to perform layer sampling on all Transformer units included in the BERT model to form a knowledge distillation coding vector layer sample .
- Stage 2 Train a meta-learned knowledge distillation network.
- the search space input the knowledge distillation encoding vector constructed in the first stage into the search space, and eliminate the vectors that do not meet the constraints
- the structure generator take the filtered knowledge distillation encoding vector as input, and output It is used to construct the weight matrix of the distillation network and generate the corresponding distillation structure model; use the batch dataset to train the generated distillation structure and update the distillation structure to update the structure generator; finally output the weight output by the structure generator after iterative update.
- the structure generator is a meta-network consisting of two fully connected layers; the input is the knowledge distillation encoding vector constructed in the first stage, and the output is the weight matrix used to generate the distilled structure model.
- Training the structure generator includes the following sub-steps:
- Step 1 In the forward propagation process, the knowledge distillation encoding vector is input into the structure generator and the weight matrix is output.
- Step 2 Figure 3 shows the process of building a distillation structure model based on the structure generator:
- each element li corresponds to a layer of Transformer units, and layer sampling knowledge distillation is performed on each Transformer layer of the teacher network, and the corresponding element of the knowledge distillation coding vector in the teacher model is 1.
- the weight of the Transformer unit is used to initialize the Transformer unit of the student model migration; that is, the elements sampled at each layer are 1 to generate the Transformer unit and its weight corresponding to the student model through the structure generator; the teacher model and the student model are established through the knowledge distillation encoding vector.
- the corresponding distillation network structure is generated according to the knowledge distillation encoding vector.
- Step 3 Figure 4 shows the process of jointly training the structure generator and the distillation structure model:
- Input a small batch of training data into the distillation structure model generated in step 2 for model training.
- the structure generator also updates according to the updated parameters; that is, in the process of back propagation , the distillation structure model and the structure generator are updated together; the weights output by the structure generator can be calculated using the chain rule, so the structure generator can be trained end-to-end.
- a method of Bernoulli distribution sampling is proposed to train the structure generator, specifically: using Bernoulli distribution to sample each layer of Transformer units to construct different knowledge distillation encoding vectors, and use the same training data set for multiple iteration training , at each iteration, the structure generator and the distillation structure model are simultaneously trained based on a knowledge distillation encoding vector, and a structure generator that can generate weight matrices for different distillation structure models is obtained by changing the input knowledge distillation encoding vector.
- the shape of the weight matrix output by the structure generator needs to be adjusted to make it consistent with the number of encoder units input and output of the distillation structure corresponding to the knowledge distillation encoding vector.
- the coding vectors obtained by layer sampling are kept consistent.
- the shape of the weight matrix output by the structure generator is adjusted according to the number and position of Transformer units whose elements are 1 in the coding vector.
- Figure 5 shows the process of distillation network search based on evolutionary algorithm:
- Step 1 Each distillation structure model is generated by the knowledge distillation coding vector based on Transformer layer sampling, so the knowledge distillation coding vector is defined as the gene G of the distillation structure model, and a series of genes that satisfy the constraint C are randomly selected as the initial population. .
- Step 2 Evaluate the inference accuracy accuracy of the distillation structure model corresponding to each gene G i in the existing population on the validation set, and select the top k genes with the highest accuracy.
- Step 3 Use the top k genes with the highest accuracy selected in step 2 to perform gene recombination and gene mutation to generate new genes, and add the new genes to the existing population.
- Gene mutation refers to mutation by randomly changing the value of some elements in the gene; gene recombination refers to randomly recombining the genes of two parents to produce offspring; and it is easy to strengthen constraint C by eliminating unqualified genes.
- Step 4 Repeat steps 2 and 3 for N rounds of iterations, select the top k genes with the highest accuracy in the existing population and generate new genes, until the gene that satisfies the constraint C and has the highest accuracy is obtained.
- a multitask-oriented pre-training language model automatic compression platform of the present invention includes the following components:
- Data loading component used to obtain the training samples of the BERT model and the multi-task-oriented pre-training language model uploaded by the logged-in user to be compressed and containing specific natural language processing downstream tasks; the training samples are labeled to meet the supervised learning task. Text sample.
- Automatic compression component used to automatically compress multi-task-oriented pre-trained language models, including knowledge distillation vector encoding module, distillation network generation module, structure generator and distillation network joint training module, distillation network search module, and task-specific fine-tuning module.
- the knowledge distillation vector encoding module includes the layer sampling vector of the Transformer. In the process of forward propagation, the distillation network encoding vector is input into the structure generator, and the weight matrix of the distillation network corresponding to the structure and the structure generator is generated.
- the distillation network generation module is based on the structure generator to construct a distillation network corresponding to the currently input encoding vector, and adjust the shape of the weight matrix output by the structure generator to make it consistent with the number of encoder units in the input and output of the distillation structure corresponding to the encoding vector. .
- the structure generator and distillation network joint training module is an end-to-end training structure generator. Specifically, the knowledge distillation encoding vector sampled based on the Transformer layer and a small batch of training data are input into the distillation network. Update the weights of the distillation structure and the weight matrix of the structure generator.
- the distillation network search module is to search for the distillation network with the highest accuracy that satisfies the specific constraints, and proposes an evolutionary algorithm to search for the distillation network with the highest accuracy that meets the specific constraints.
- each distillation network is encoded and generated by the encoding vector containing the sampling based on the Transformer layer, so the distillation network encoding vector is defined as the gene of the distillation network.
- a series of distillation network encoding vectors are first selected as the genes of the distillation network, and the accuracy of the corresponding distillation network is obtained by evaluating on the validation set. Then, the top k genes with the highest precision are selected, and gene recombination and mutation are used to generate new genes. By further repeating the process of selecting the top k optimal genes and the process of generating new genes, the genes that satisfy the constraints and have the highest accuracy are obtained.
- the task-specific fine-tuning module is to build a downstream task network on the pre-trained model distillation network generated by the automatic compression component, use the feature layer and output layer of the distillation network to fine-tune the downstream task scene, and output the final fine-tuned student model, that is, log in User-required pre-trained language model compression models that include downstream tasks.
- the compressed model is output to a specified container, which can be downloaded by the logged-in user, and the comparison information of the size of the model before and after compression is presented on the page of outputting the compressed model of the platform.
- the logged-in user obtains the pre-trained compression model from the platform, and the user uses the compressed model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the logged-in user on the dataset of the actual scene. And the comparison information of the inference speed before and after compression is presented on the compression model inference page of the platform.
- the logged-in user can directly download the trained pre-training language model provided by the platform of the present invention, and according to the user's demand for a specific downstream task of natural language processing, build the downstream task on the basis of the compressed pre-training model architecture generated by the platform Network and fine-tuned, and finally deployed on end devices. Inference on natural language processing downstream tasks can also be performed directly on the platform.
- the BERT pre-training model generated by the automatic compression component is loaded by the platform, and a model of the text classification task is constructed on the generated pre-training model;
- the compressed model of the BERT model that contains the text classification task.
- the compressed model is output to the designated container, which can be downloaded by the logged-in user, and the comparison information of the model size before and after compression is presented on the output compressed model page of the platform.
- the size of the model before compression is 110M, and the size after compression is 56M , compressed by 49%. As shown in Table 1 below.
- the compression model output by the platform is used to infer the SST-2 test set data uploaded by the logged-in user, and the compression model inference page of the platform shows that the inference speed after compression is 2.01 faster than that before compression times, and the inference accuracy improves from 91.5% before compression to 92.0%.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Feedback Control In General (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Machine Translation (AREA)
Abstract
Description
文本分类任务(SST-2)(包含67K个样本) | 压缩前 | 压缩后 | 对比 |
模型大小 | 110M | 56M | 压缩49% |
推理精度 | 91.5% | 92.0% | 提升0.5% |
Claims (10)
- 一种面向多任务的预训练语言模型自动压缩方法,其特征在于,包括三个阶段:第一阶段,构建基于Transformer层采样的知识蒸馏编码向量:采用伯努利分布对BERT模型的所有Transformer单元进行层采样,生成知识蒸馏编码向量;第二阶段,训练元学习的知识蒸馏网络:定义搜索空间,将第一阶段构建的知识蒸馏编码向量输入该搜索空间,剔除不符合条件的知识蒸馏编码向量;定义结构生成器,将经过筛选的知识蒸馏编码向量作为输入,输出用于构建蒸馏结构模型的权重矩阵,并生成对应的蒸馏结构模型;训练生成的蒸馏结构模型从而更新结构生成器;第三阶段,基于进化算法的蒸馏结构模型搜索的过程:将多个满足特定约束的知识蒸馏编码向量输入第二阶段更新后的结构生成器生成对应的权重矩阵,得到多个蒸馏结构模型;评估每个蒸馏结构模型的精度;采用进化算法搜索其中满足特定约束的精度最高的蒸馏结构模型,得到通用压缩架构。
- 如权利要求1所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述第一阶段具体为:依次对BERT模型的12层Transformer单元进行伯努利采样生成知识蒸馏编码向量,每一层对应一个随机变量;当随机变量为1的概率大于等于0.5时,知识蒸馏编码向量对应的元素为1,代表当前Transformer单元进行迁移学习;当随机变量为1的概率值小于0.5时,层采样向量对应的元素为0,代表当前Transformer单元不进行迁移学习。
- 如权利要求2所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述定义搜索空间具体为:知识蒸馏编码向量中元素为1的数量不少于6。
- 如权利要求3所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述定义结构生成器具体为:结构生成器由两个全连接层组成,输入为第一阶段构建的知识蒸馏编码向量,输出为用于生成蒸馏结构模型的权重矩阵。
- 如权利要求4所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述训练生成的蒸馏结构模型从而更新结构生成器,包括以下子步骤:步骤(2.1):将知识蒸馏编码向量输入结构生成器并输出权重矩阵;步骤(2.2):基于结构生成器输出的权重矩阵构建蒸馏结构模型;步骤(2.3):联合训练结构生成器和蒸馏结构模型:将训练数据输入步骤(2.2)生成的蒸馏结构模型进行模型训练,且结构生成器一起更新;同时结合伯努利分布采样的方法训练结构生成器。
- 如权利要求5所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述步骤(2.2) 具体为:根据第一阶段构建的知识蒸馏编码向量,其中每一个元素对应一层Transformer单元,对教师网络的每个Transformer层进行层采样知识蒸馏,使用教师模型中知识蒸馏编码向量对应元素为1的Transformer单元的权重来初始化学生模型迁移的Transformer单元;即每个层采样为1的元素经过结构生成器生成学生模型对应的Transformer单元以及其权重;通过知识蒸馏编码向量将教师模型和学生模型建立一对一的映射关系,根据知识蒸馏编码向量生成对应的蒸馏网络结构。
- 如权利要求6所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述结合伯努利分布采样的方法训练结构生成器具体为:利用伯努利分布对各层Transformer单元进行层采样构建不同的知识蒸馏编码向量,用训练数据集进行多次迭代训练,每轮迭代时基于一个知识蒸馏编码向量同时训练结构生成器和蒸馏结构模型,通过改变输入的知识蒸馏编码向量学习得到能够为不同蒸馏结构模型生成权重矩阵的结构生成器。
- 如权利要求7所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述第三阶段包括以下子步骤:步骤(3.1):将知识蒸馏编码向量定义为蒸馏结构模型的基因,随机选取满足特定约束的一系列基因作为初始种群;步骤(3.2):评估现有种群中各个基因对应的蒸馏结构模型的精度,选取精度较高的前k个基因;步骤(3.3):利用步骤(3.2)选取的精度较高的前k个基因进行基因重组和基因变异生成新的基因,将新基因加入现有种群中;步骤(3.4):重复迭代设定轮次的步骤(3.2)~(3.3),选择现有种群中前k个精度较高的基因并生成新基因,最终获得满足特定约束并且精度最高的基因。
- 如权利要求8所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述步骤(3.3)中,基因变异是指通过随机改变基因里一部分元素值;基因重组是指随机地将两个父辈的基因进行重组;剔除不满足特定约束的新基因。
- 一种基于权利要求1-9任一项所述面向多任务的预训练语言模型自动压缩方法的平台,其特征在于,包括以下组件:数据加载组件:用于获取面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的有标签的文本样本;自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括知识蒸馏向量编码模块、蒸馏网络生成模块、结构生成器和蒸馏网络联合训练模块、蒸馏网络搜索模块和特定任务微调模块;知识蒸馏向量编码模块包括Transformer的层采样向量;前向传播过程中,将知识蒸馏编码向量输入结构生成器,生成对应结构的蒸馏网络和结构生成器的权重矩阵;蒸馏网络生成模块是基于结构生成器构建与当前输入的知识蒸馏编码向量对应的蒸馏网络,调整结构生成器输出的权重矩阵的形状,使其与知识蒸馏编码向量对应的蒸馏结构的输入输出的编码器单元数目一致;结构生成器和蒸馏网络联合训练模块是端到端的训练结构生成器,具体地,将基于Transformer层采样的知识蒸馏编码向量和一个小批次的训练数据输入蒸馏网络;更新蒸馏结构的权重和结构生成器的权重矩阵;蒸馏网络搜索模块是为了搜索出满足特定约束条件的最高精度的蒸馏网络,提出进化算法搜索满足特定约束条件的最高精度的蒸馏网络;将知识蒸馏编码向量输入训练好的结构生成器,生成对应蒸馏网络的权重,在验证集上对蒸馏网络进行评估,获得对应蒸馏网络的精度;在元学习蒸馏网络中采用的进化搜索算法中,每个蒸馏网络是由包含基于Transformer层采样的知识蒸馏编码向量生成,所以将知识蒸馏编码向量定义为蒸馏网络的基因;在满足特定约束条件下,首先选取一系列知识蒸馏编码向量作为蒸馏网络的基因,通过在验证集上评估获得对应蒸馏网络的精度;然后,选取精度较高的前k个基因,采用基因重组和变异生成新的基因;通过进一步重复前k个最优基因选择的过程和新基因生成的过程进行迭代,获得满足约束条件并且精度最高的基因;特定任务微调模块是在所述自动压缩组件生成的预训练模型蒸馏网络上构建下游任务网络,利用蒸馏网络的特征层和输出层对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型;将所述压缩模型输出到指定的容器,可供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息;推理组件:登陆用户从所述平台获取预训练压缩模型,用户利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理;并在所述平台的压缩模型推理页面呈现压缩前后推理速度的对比信息。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB2214196.4A GB2619569A (en) | 2020-12-15 | 2020-12-21 | Method and platform for automatically compressing multi-task-oriented pre-training language model |
JP2022570738A JP7381814B2 (ja) | 2020-12-15 | 2020-12-21 | マルチタスク向けの予めトレーニング言語モデルの自動圧縮方法及びプラットフォーム |
US17/564,071 US11526774B2 (en) | 2020-12-15 | 2021-12-28 | Method for automatically compressing multitask-oriented pre-trained language model and platform thereof |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011470331.3 | 2020-12-15 | ||
CN202011470331.3A CN112232511B (zh) | 2020-12-15 | 2020-12-15 | 面向多任务的预训练语言模型自动压缩方法及平台 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/564,071 Continuation US11526774B2 (en) | 2020-12-15 | 2021-12-28 | Method for automatically compressing multitask-oriented pre-trained language model and platform thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022126683A1 true WO2022126683A1 (zh) | 2022-06-23 |
Family
ID=74123619
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/138016 WO2022126683A1 (zh) | 2020-12-15 | 2020-12-21 | 面向多任务的预训练语言模型自动压缩方法及平台 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112232511B (zh) |
WO (1) | WO2022126683A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117216225A (zh) * | 2023-10-19 | 2023-12-12 | 四川大学 | 一种基于三模态知识蒸馏的3d视觉问答方法 |
CN117787922A (zh) * | 2024-02-27 | 2024-03-29 | 东亚银行(中国)有限公司 | 基于蒸馏学习和自动学习的反洗钱业务处理方法、***、设备和介质 |
CN117807235A (zh) * | 2024-01-17 | 2024-04-02 | 长春大学 | 一种基于模型内部特征蒸馏的文本分类方法 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2619569A (en) * | 2020-12-15 | 2023-12-13 | Zhejiang Lab | Method and platform for automatically compressing multi-task-oriented pre-training language model |
US11527074B1 (en) | 2021-11-24 | 2022-12-13 | Continental Automotive Technologies GmbH | Systems and methods for deep multi-task learning for embedded machine vision applications |
CN114298224B (zh) * | 2021-12-29 | 2024-06-18 | 云从科技集团股份有限公司 | 图像分类方法、装置以及计算机可读存储介质 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111062489A (zh) * | 2019-12-11 | 2020-04-24 | 北京知道智慧信息技术有限公司 | 一种基于知识蒸馏的多语言模型压缩方法、装置 |
CN111767711A (zh) * | 2020-09-02 | 2020-10-13 | 之江实验室 | 基于知识蒸馏的预训练语言模型的压缩方法及平台 |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111611377B (zh) * | 2020-04-22 | 2021-10-29 | 淮阴工学院 | 基于知识蒸馏的多层神经网络语言模型训练方法与装置 |
-
2020
- 2020-12-15 CN CN202011470331.3A patent/CN112232511B/zh active Active
- 2020-12-21 WO PCT/CN2020/138016 patent/WO2022126683A1/zh active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111062489A (zh) * | 2019-12-11 | 2020-04-24 | 北京知道智慧信息技术有限公司 | 一种基于知识蒸馏的多语言模型压缩方法、装置 |
CN111767711A (zh) * | 2020-09-02 | 2020-10-13 | 之江实验室 | 基于知识蒸馏的预训练语言模型的压缩方法及平台 |
Non-Patent Citations (1)
Title |
---|
ANTONIO BARBALAU; ADRIAN COSMA; RADU TUDOR IONESCU; MARIUS POPESCU: "Black-Box Ripper: Copying black-box models using generative evolutionary algorithms", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 October 2020 (2020-10-21), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081792289 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117216225A (zh) * | 2023-10-19 | 2023-12-12 | 四川大学 | 一种基于三模态知识蒸馏的3d视觉问答方法 |
CN117216225B (zh) * | 2023-10-19 | 2024-06-04 | 四川大学 | 一种基于三模态知识蒸馏的3d视觉问答方法 |
CN117807235A (zh) * | 2024-01-17 | 2024-04-02 | 长春大学 | 一种基于模型内部特征蒸馏的文本分类方法 |
CN117807235B (zh) * | 2024-01-17 | 2024-05-10 | 长春大学 | 一种基于模型内部特征蒸馏的文本分类方法 |
CN117787922A (zh) * | 2024-02-27 | 2024-03-29 | 东亚银行(中国)有限公司 | 基于蒸馏学习和自动学习的反洗钱业务处理方法、***、设备和介质 |
CN117787922B (zh) * | 2024-02-27 | 2024-05-31 | 东亚银行(中国)有限公司 | 基于蒸馏学习和自动学习的反洗钱业务处理方法、***、设备和介质 |
Also Published As
Publication number | Publication date |
---|---|
CN112232511A (zh) | 2021-01-15 |
CN112232511B (zh) | 2021-03-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022126683A1 (zh) | 面向多任务的预训练语言模型自动压缩方法及平台 | |
WO2022141754A1 (zh) | 一种卷积神经网络通用压缩架构的自动剪枝方法及平台 | |
WO2022126797A1 (zh) | 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台 | |
US10984308B2 (en) | Compression method for deep neural networks with load balance | |
US11501171B2 (en) | Method and platform for pre-trained language model automatic compression based on multilevel knowledge distillation | |
US20190034796A1 (en) | Fixed-point training method for deep neural networks based on static fixed-point conversion scheme | |
JP7381814B2 (ja) | マルチタスク向けの予めトレーニング言語モデルの自動圧縮方法及びプラットフォーム | |
KR102592585B1 (ko) | 번역 모델 구축 방법 및 장치 | |
CN113033786B (zh) | 基于时间卷积网络的故障诊断模型构建方法及装置 | |
CN112578089B (zh) | 一种基于改进tcn的空气污染物浓度预测方法 | |
CN114398976A (zh) | 基于bert与门控类注意力增强网络的机器阅读理解方法 | |
CN112347756A (zh) | 一种基于序列化证据抽取的推理阅读理解方法及*** | |
CN107579816A (zh) | 基于递归神经网络的密码字典生成方法 | |
CN111058840A (zh) | 一种基于高阶神经网络的有机碳含量(toc)评价方法 | |
CN117539977A (zh) | 一种语言模型的训练方法及装置 | |
Duggal et al. | High performance squeezenext for cifar-10 | |
CN114139674A (zh) | 行为克隆方法、电子设备、存储介质和程序产品 | |
CN113051353A (zh) | 一种基于注意力机制的知识图谱路径可达性预测方法 | |
WO2023082045A1 (zh) | 一种神经网络架构搜索的方法和装置 | |
CN114637863B (zh) | 一种基于传播的知识图谱推荐方法 | |
CN117807235B (zh) | 一种基于模型内部特征蒸馏的文本分类方法 | |
CN118133680A (zh) | 一种基于图注意力网络的输电网支路参数辨识方法 | |
Li | Research on University Book Purchasing Model Based on Genetic-Neural Network Algorithm | |
Zhang et al. | Optimization of Neural Networks Based on Genetic Algorithms for SL Datasets and Applications | |
CN118014126A (zh) | 基于issa-bp神经网络模型的变电站碳排放预测方法及*** |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20965696 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 202214196 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20201221 |
|
ENP | Entry into the national phase |
Ref document number: 2022570738 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20965696 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20965696 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 061223) |