WO2022126683A1 - 面向多任务的预训练语言模型自动压缩方法及平台 - Google Patents

面向多任务的预训练语言模型自动压缩方法及平台 Download PDF

Info

Publication number
WO2022126683A1
WO2022126683A1 PCT/CN2020/138016 CN2020138016W WO2022126683A1 WO 2022126683 A1 WO2022126683 A1 WO 2022126683A1 CN 2020138016 W CN2020138016 W CN 2020138016W WO 2022126683 A1 WO2022126683 A1 WO 2022126683A1
Authority
WO
WIPO (PCT)
Prior art keywords
distillation
model
network
knowledge
task
Prior art date
Application number
PCT/CN2020/138016
Other languages
English (en)
French (fr)
Inventor
王宏升
单海军
傅家庆
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Priority to GB2214196.4A priority Critical patent/GB2619569A/en
Priority to JP2022570738A priority patent/JP7381814B2/ja
Priority to US17/564,071 priority patent/US11526774B2/en
Publication of WO2022126683A1 publication Critical patent/WO2022126683A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Definitions

  • the invention belongs to the field of language model compression, and in particular relates to a multitask-oriented pre-training language model automatic compression method and platform.
  • the layer-by-layer knowledge distillation method is to minimize the two networks. feature map distance between.
  • this method can usually achieve better results.
  • the core challenge of neural network compression in the case of small samples is that the compressed model is easily overfitted on the few-sample training instance, resulting in a large estimation error between the inference process and the original network. Estimation errors may accumulate and propagate layer by layer, eventually corrupting the network output.
  • the existing knowledge distillation methods are mainly data-driven sparse constraints or artificially designed distillation strategies; considering that usually a BERT network has 12 layers of Transformer units, each unit contains 8 self-attention units; self-attention units There are hundreds of millions of possible connection methods. Due to the limitation of computing resources, it is almost impossible to manually design all possible distillation structures and find the optimal structure.
  • the purpose of the present invention is to provide a multi-task oriented pre-training language model automatic compression method and platform aiming at the deficiencies of the prior art.
  • a multi-task-oriented pre-training language model automatic compression method comprising three stages:
  • the second stage is to train the knowledge distillation network of meta-learning: define the search space, input the knowledge distillation coding vector constructed in the first stage into the search space, and eliminate the knowledge distillation coding vectors that do not meet the conditions;
  • the knowledge distillation encoding vector is used as input, and the weight matrix used to construct the distillation structure model is output, and the corresponding distillation structure model is generated; the generated distillation structure model is trained to update the structure generator;
  • the third stage the process of searching distillation structure models based on evolutionary algorithms: input multiple knowledge distillation encoding vectors that satisfy specific constraints into the updated structure generator in the second stage to generate corresponding weight matrices, and obtain multiple distillation structure models; The accuracy of each distillation structure model; the evolutionary algorithm is used to search for the distillation structure model with the highest accuracy that satisfies specific constraints, and a general compression architecture is obtained.
  • the first stage is specifically: Bernoulli sampling is performed on the 12-layer Transformer units of the BERT model in turn to generate a knowledge distillation encoding vector, and each layer corresponds to a random variable; when the probability of the random variable being 1 is greater than or equal to 0.5 , the element corresponding to the knowledge distillation encoding vector is 1, which means that the current Transformer unit performs migration learning; when the probability value of the random variable being 1 is less than 0.5, the element corresponding to the layer sampling vector is 0, which means that the current Transformer unit does not perform migration learning.
  • the defined search space is specifically as follows: the number of elements that are 1 in the knowledge distillation encoding vector is not less than 6.
  • the defining structure generator is specifically: the structure generator is composed of two fully connected layers, the input is the knowledge distillation encoding vector constructed in the first stage, and the output is the weight matrix used to generate the distillation structure model.
  • distillation structure model generated by the training to update the structure generator includes the following sub-steps:
  • the step (2.2) is specifically: according to the knowledge distillation coding vector constructed in the first stage, wherein each element corresponds to a layer of Transformer units, perform layer sampling knowledge distillation on each Transformer layer of the teacher network, and use the teacher model.
  • the weight of the Transformer unit whose corresponding element is 1 in the knowledge distillation coding vector is used to initialize the Transformer unit of the student model migration; that is, the element whose sample is 1 in each layer is passed through the structure generator to generate the Transformer unit corresponding to the student model and its weight; through knowledge distillation
  • the encoding vector establishes a one-to-one mapping relationship between the teacher model and the student model, and generates the corresponding distillation network structure according to the knowledge distillation encoding vector.
  • the method for training the structure generator in combination with Bernoulli distribution sampling is specifically: using Bernoulli distribution to perform layer sampling on each layer of Transformer units to construct different knowledge distillation coding vectors, and use the training data set to perform multiple iterative training.
  • the structure generator and the distillation structure model are simultaneously trained based on a knowledge distillation encoding vector, and a structure generator that can generate weight matrices for different distillation structure models is obtained by changing the input knowledge distillation encoding vector.
  • the third stage includes the following sub-steps:
  • gene mutation refers to randomly changing some element values in the gene; gene recombination refers to randomly recombining the genes of two parents; and eliminating new genes that do not meet specific constraints.
  • a platform based on the above multi-task-oriented pre-training language model automatic compression method including the following components:
  • Data loading component used to obtain training samples of multi-task-oriented pre-trained language models, where the training samples are labeled text samples that satisfy supervised learning tasks;
  • Automatic compression component used to automatically compress multi-task-oriented pre-trained language models, including knowledge distillation vector encoding module, distillation network generation module, structure generator and distillation network joint training module, distillation network search module and task-specific fine-tuning module;
  • the knowledge distillation vector encoding module includes the layer sampling vector of the Transformer; in the forward propagation process, the knowledge distillation encoding vector is input into the structure generator to generate the corresponding structure of the distillation network and the weight matrix of the structure generator;
  • the distillation network generation module is based on the structure generator to construct a distillation network corresponding to the current input knowledge distillation encoding vector, and adjust the shape of the weight matrix output by the structure generator to make it correspond to the knowledge distillation encoding vector.
  • the encoding of the input and output of the distillation structure The number of device units is the same;
  • the structure generator and distillation network joint training module is an end-to-end training structure generator. Specifically, the knowledge distillation encoding vector based on Transformer layer sampling and a small batch of training data are input into the distillation network; the weight and structure of the distillation structure are updated. the weight matrix of the generator;
  • the distillation network search module is to search for the highest-precision distillation network that satisfies specific constraints, and proposes an evolutionary algorithm to search for the highest-precision distillation network that satisfies specific constraints; input the knowledge distillation encoding vector into the trained structure generator to generate the corresponding distillation.
  • each distillation network is encoded by a knowledge distillation encoding vector based on Transformer layer sampling Therefore, the knowledge distillation encoding vector is defined as the gene of the distillation network; under certain constraints, a series of knowledge distillation encoding vectors are selected as the genes of the distillation network, and the accuracy of the corresponding distillation network is obtained by evaluating on the verification set; then , select the top k genes with higher precision, and use gene recombination and mutation to generate new genes; by further repeating the process of selecting the top k optimal genes and the process of generating new genes, iteratively obtains the one that satisfies the constraints and has the highest precision Gene;
  • the task-specific fine-tuning module is to build a downstream task network on the pre-trained model distillation network generated by the automatic compression component, use the feature layer and output layer of the distillation network to fine-tune the downstream task scene, and output the final fine-tuned student model, that is, log in A pre-trained language model compression model including downstream tasks required by the user; the compressed model is output to a specified container, which can be downloaded by the logged-in user, and the output compression model page of the platform presents the size of the model before and after compression. comparative information;
  • the logged-in user obtains the pre-trained compression model from the platform, and the user uses the compression model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the logged-in user on the dataset of the actual scene; and The comparison information of the inference speed before and after compression is presented on the compression model inference page of the platform.
  • the beneficial effects of the present invention are: firstly, the present invention studies a general compression architecture for generating multiple pre-trained language models based on meta-learning knowledge distillation; secondly, on the basis of the trained meta-learning network, an evolutionary algorithm is used to search for optimal compression structure, resulting in an optimal general compression architecture for task-independent pretrained language models.
  • the multi-task-oriented pre-training language model automatic compression platform of the present invention the general architecture of the multi-task-oriented pre-training language model is compressed and generated, the compressed model architecture is fully utilized to improve the compression efficiency of downstream tasks, and the compression efficiency of downstream tasks can be improved.
  • Large-scale natural language processing models are deployed on end-side devices with small memory and limited resources, which promotes the implementation of general-purpose deep language models in the industry.
  • 1 is an overall architecture diagram of the compression method of the present invention in conjunction with a specific task
  • Fig. 2 is the training flow chart of the knowledge distillation network of meta-learning
  • Figure 3 is an architecture diagram of building a distillation network based on a structure generator
  • Figure 4 is a diagram of the joint training process of the structure generator and the distillation network
  • Figure 5 is a diagram of a distillation network search architecture based on an evolutionary algorithm.
  • the present invention studies knowledge distillation based on meta-learning to generate multiple pre-trained language models.
  • Generic compression architecture First constructs a knowledge distillation encoding vector based on Transformer layer sampling, and distills the knowledge structure of the large model at different levels.
  • a meta-network of structure generator is designed, and the structure generator is used to generate a distilled structure model corresponding to the currently input encoding vector.
  • a method of Bernoulli distribution sampling is proposed to train the structure generator.
  • each encoder unit that is migrated is generated by using Bernoulli distribution sampling to form a corresponding encoding vector.
  • the encoding vector of the input structure generator and the training data of the mini-batch and jointly training the structure generator and the corresponding distillation structure, we can learn a structure generator that can generate weights for different distillation structures.
  • an evolutionary algorithm is used to search for the optimal compression structure, thereby obtaining the optimal general compression structure of the task-independent pre-trained language model.
  • the invention solves the problem of overfitting learning and low generalization ability of the compression model in the compression process of the BERT model under the condition of few sample data, deeply explores the feasibility and key technology of language understanding of the large-scale deep language model under the condition of few samples, and improves the Compression models are geared towards flexibility and effectiveness in use for a variety of downstream tasks.
  • the knowledge distillation of meta-learning can completely liberate manpower from tedious hyperparameter tuning, while allowing the use of multiple target metrics to directly optimize the compression model.
  • knowledge distillation of meta-learning can easily enforce conditional constraints when searching for the desired compression structure without manual tuning of reinforcement learning hyperparameters.
  • the application technical route of the compression method of the present invention is shown in Figure 1.
  • the present invention is a multi-task-oriented pre-training language model automatic compression method, and the whole process is divided into three stages: the first stage is to construct a knowledge distillation encoding vector based on Transformer layer sampling; the second stage is to train the knowledge of meta-learning Distillation network; the third stage is to search for the optimal compression structure based on evolutionary algorithms; specifically:
  • Stage 1 Construct knowledge distillation encoding vector based on Transformer layer sampling.
  • the Bernoulli distribution is used to perform layer sampling on all Transformer units of the BERT model to generate a layer sampling vector, that is, the knowledge distillation encoding vector.
  • the ith Transformer unit encoder
  • the random variable X i ⁇ Bernoulli(p) is an independent Bernoulli random variable
  • the probability of being 1 is p
  • the probability of being 0 is 1- p.
  • the 12-layer Transformer units of the BERT model are sequentially Bernoulli-sampled using random variables Xi to generate a vector consisting of 12 elements of 0 or 1.
  • the probability p of the random variable X i being 1 is greater than or equal to 0.5
  • the element corresponding to the layer sampling vector is 1, representing the current Transformer unit for migration learning
  • the probability value of the random variable X i being 1 is less than 0.5
  • the layer sampling vector corresponds to The element of is 0, which means that the current Transformer unit does not perform transfer learning.
  • the Bernoulli sampling method is used to perform layer sampling on all Transformer units included in the BERT model to form a knowledge distillation coding vector layer sample .
  • Stage 2 Train a meta-learned knowledge distillation network.
  • the search space input the knowledge distillation encoding vector constructed in the first stage into the search space, and eliminate the vectors that do not meet the constraints
  • the structure generator take the filtered knowledge distillation encoding vector as input, and output It is used to construct the weight matrix of the distillation network and generate the corresponding distillation structure model; use the batch dataset to train the generated distillation structure and update the distillation structure to update the structure generator; finally output the weight output by the structure generator after iterative update.
  • the structure generator is a meta-network consisting of two fully connected layers; the input is the knowledge distillation encoding vector constructed in the first stage, and the output is the weight matrix used to generate the distilled structure model.
  • Training the structure generator includes the following sub-steps:
  • Step 1 In the forward propagation process, the knowledge distillation encoding vector is input into the structure generator and the weight matrix is output.
  • Step 2 Figure 3 shows the process of building a distillation structure model based on the structure generator:
  • each element li corresponds to a layer of Transformer units, and layer sampling knowledge distillation is performed on each Transformer layer of the teacher network, and the corresponding element of the knowledge distillation coding vector in the teacher model is 1.
  • the weight of the Transformer unit is used to initialize the Transformer unit of the student model migration; that is, the elements sampled at each layer are 1 to generate the Transformer unit and its weight corresponding to the student model through the structure generator; the teacher model and the student model are established through the knowledge distillation encoding vector.
  • the corresponding distillation network structure is generated according to the knowledge distillation encoding vector.
  • Step 3 Figure 4 shows the process of jointly training the structure generator and the distillation structure model:
  • Input a small batch of training data into the distillation structure model generated in step 2 for model training.
  • the structure generator also updates according to the updated parameters; that is, in the process of back propagation , the distillation structure model and the structure generator are updated together; the weights output by the structure generator can be calculated using the chain rule, so the structure generator can be trained end-to-end.
  • a method of Bernoulli distribution sampling is proposed to train the structure generator, specifically: using Bernoulli distribution to sample each layer of Transformer units to construct different knowledge distillation encoding vectors, and use the same training data set for multiple iteration training , at each iteration, the structure generator and the distillation structure model are simultaneously trained based on a knowledge distillation encoding vector, and a structure generator that can generate weight matrices for different distillation structure models is obtained by changing the input knowledge distillation encoding vector.
  • the shape of the weight matrix output by the structure generator needs to be adjusted to make it consistent with the number of encoder units input and output of the distillation structure corresponding to the knowledge distillation encoding vector.
  • the coding vectors obtained by layer sampling are kept consistent.
  • the shape of the weight matrix output by the structure generator is adjusted according to the number and position of Transformer units whose elements are 1 in the coding vector.
  • Figure 5 shows the process of distillation network search based on evolutionary algorithm:
  • Step 1 Each distillation structure model is generated by the knowledge distillation coding vector based on Transformer layer sampling, so the knowledge distillation coding vector is defined as the gene G of the distillation structure model, and a series of genes that satisfy the constraint C are randomly selected as the initial population. .
  • Step 2 Evaluate the inference accuracy accuracy of the distillation structure model corresponding to each gene G i in the existing population on the validation set, and select the top k genes with the highest accuracy.
  • Step 3 Use the top k genes with the highest accuracy selected in step 2 to perform gene recombination and gene mutation to generate new genes, and add the new genes to the existing population.
  • Gene mutation refers to mutation by randomly changing the value of some elements in the gene; gene recombination refers to randomly recombining the genes of two parents to produce offspring; and it is easy to strengthen constraint C by eliminating unqualified genes.
  • Step 4 Repeat steps 2 and 3 for N rounds of iterations, select the top k genes with the highest accuracy in the existing population and generate new genes, until the gene that satisfies the constraint C and has the highest accuracy is obtained.
  • a multitask-oriented pre-training language model automatic compression platform of the present invention includes the following components:
  • Data loading component used to obtain the training samples of the BERT model and the multi-task-oriented pre-training language model uploaded by the logged-in user to be compressed and containing specific natural language processing downstream tasks; the training samples are labeled to meet the supervised learning task. Text sample.
  • Automatic compression component used to automatically compress multi-task-oriented pre-trained language models, including knowledge distillation vector encoding module, distillation network generation module, structure generator and distillation network joint training module, distillation network search module, and task-specific fine-tuning module.
  • the knowledge distillation vector encoding module includes the layer sampling vector of the Transformer. In the process of forward propagation, the distillation network encoding vector is input into the structure generator, and the weight matrix of the distillation network corresponding to the structure and the structure generator is generated.
  • the distillation network generation module is based on the structure generator to construct a distillation network corresponding to the currently input encoding vector, and adjust the shape of the weight matrix output by the structure generator to make it consistent with the number of encoder units in the input and output of the distillation structure corresponding to the encoding vector. .
  • the structure generator and distillation network joint training module is an end-to-end training structure generator. Specifically, the knowledge distillation encoding vector sampled based on the Transformer layer and a small batch of training data are input into the distillation network. Update the weights of the distillation structure and the weight matrix of the structure generator.
  • the distillation network search module is to search for the distillation network with the highest accuracy that satisfies the specific constraints, and proposes an evolutionary algorithm to search for the distillation network with the highest accuracy that meets the specific constraints.
  • each distillation network is encoded and generated by the encoding vector containing the sampling based on the Transformer layer, so the distillation network encoding vector is defined as the gene of the distillation network.
  • a series of distillation network encoding vectors are first selected as the genes of the distillation network, and the accuracy of the corresponding distillation network is obtained by evaluating on the validation set. Then, the top k genes with the highest precision are selected, and gene recombination and mutation are used to generate new genes. By further repeating the process of selecting the top k optimal genes and the process of generating new genes, the genes that satisfy the constraints and have the highest accuracy are obtained.
  • the task-specific fine-tuning module is to build a downstream task network on the pre-trained model distillation network generated by the automatic compression component, use the feature layer and output layer of the distillation network to fine-tune the downstream task scene, and output the final fine-tuned student model, that is, log in User-required pre-trained language model compression models that include downstream tasks.
  • the compressed model is output to a specified container, which can be downloaded by the logged-in user, and the comparison information of the size of the model before and after compression is presented on the page of outputting the compressed model of the platform.
  • the logged-in user obtains the pre-trained compression model from the platform, and the user uses the compressed model output by the automatic compression component to infer the new data of the natural language processing downstream task uploaded by the logged-in user on the dataset of the actual scene. And the comparison information of the inference speed before and after compression is presented on the compression model inference page of the platform.
  • the logged-in user can directly download the trained pre-training language model provided by the platform of the present invention, and according to the user's demand for a specific downstream task of natural language processing, build the downstream task on the basis of the compressed pre-training model architecture generated by the platform Network and fine-tuned, and finally deployed on end devices. Inference on natural language processing downstream tasks can also be performed directly on the platform.
  • the BERT pre-training model generated by the automatic compression component is loaded by the platform, and a model of the text classification task is constructed on the generated pre-training model;
  • the compressed model of the BERT model that contains the text classification task.
  • the compressed model is output to the designated container, which can be downloaded by the logged-in user, and the comparison information of the model size before and after compression is presented on the output compressed model page of the platform.
  • the size of the model before compression is 110M, and the size after compression is 56M , compressed by 49%. As shown in Table 1 below.
  • the compression model output by the platform is used to infer the SST-2 test set data uploaded by the logged-in user, and the compression model inference page of the platform shows that the inference speed after compression is 2.01 faster than that before compression times, and the inference accuracy improves from 91.5% before compression to 92.0%.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Feedback Control In General (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种面向多任务的预训练语言模型自动压缩方法及平台。该方法设计一种结构生成器的元网络,基于Transformer层采样的知识蒸馏方法构建知识蒸馏编码向量,利用结构生成器生成与当前输入的编码向量对应的蒸馏结构模型;同时提出伯努利分布采样的方法训练结构生成器;每轮迭代时,利用伯努利分布采样的方式迁移各个编码器单元,组成对应的编码向量;通过改变输入结构生成器的编码向量和小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,可以学得能够为不同蒸馏结构生成权重的结构生成器;同时在已训练好的元学习网络基础上,通过进化算法搜索最优压缩结构,由此得到与任务无关的预训练语言模型的最优通用压缩架构。

Description

面向多任务的预训练语言模型自动压缩方法及平台 技术领域
本发明属于语言模型压缩领域,尤其涉及一种面向多任务的预训练语言模型自动压缩方法及平台。
背景技术
大规模预训练语言模型在自然语言理解和生成任务上都取得了优异的性能,然而,将具有海量参数的预训练语言模型部署到内存有限的设备中仍然面临巨大挑战。在模型压缩领域,已有的语言模型压缩方法都是针对特定任务的语言模型压缩。尽管针对特定任务的知识蒸馏很有效,但是对大模型微调和推理仍费时费力,计算成本也很高。当面向下游其它任务时,使用特定任务知识蒸馏生成的预训练模型仍需要重新微调大模型以及生成相关的大模型知识。
已有的模型压缩中采用的知识蒸馏策略多数为逐层知识蒸馏,即给定一个教师网络和一个学生网络,为了实现对学生网络的监督训练,逐层知识蒸馏的方法是最小化两个网络之间的特征图距离。在训练数据充足时,该方法通常可以取得较好的效果。但是在小样本数据下,训练容易受到过拟合的影响,估计误差显著增大并会逐层传播。因此,小样本情况下进行神经网络压缩的核心挑战是:压缩后的模型很容易在少样本训练实例上过拟合,从而导致推断过程中与原始网络之间存在较大的估计误差。估计误差可能会逐层累积和传播,最终破坏网络输出。
另外,已有的知识蒸馏方法主要是数据驱动的稀疏约束或人工设计的蒸馏策略;考虑到通常一个BERT网络具有12层Transformer单元,每个单元包含8头的自注意力单元;自注意力单元可能的连接方式有上亿种情况,由于受计算资源等限制,人工设计所有可能的蒸馏结构并且寻找最优结构几乎不可能。
发明内容
本发明的目的在于针对现有技术的不足,提供一种面向多任务的预训练语言模型自动压缩方法及平台。
本发明的目的是通过以下技术方案实现的:一种面向多任务的预训练语言模型自动压缩方法,包括三个阶段:
第一阶段,构建基于Transformer层采样的知识蒸馏编码向量:采用伯努利分布对BERT模型的所有Transformer单元进行层采样,生成知识蒸馏编码向量;
第二阶段,训练元学习的知识蒸馏网络:定义搜索空间,将第一阶段构建的知识蒸馏编码向量输入该搜索空间,剔除不符合条件的知识蒸馏编码向量;定义结构生成器,将经过筛选的知识蒸馏编码向量作为输入,输出用于构建蒸馏结构模型的权重矩阵,并生成对应的蒸馏结构模型;训练生成的蒸馏结构模型从而更新结构生成器;
第三阶段,基于进化算法的蒸馏结构模型搜索的过程:将多个满足特定约束的知识蒸馏编码向量输入第二阶段更新后的结构生成器生成对应的权重矩阵,得到多个蒸馏结构模型;评估每个蒸馏结构模型的精度;采用进化算法搜索其中满足特定约束的精度最高的蒸馏结构模型,得到通用压缩架构。
进一步地,所述第一阶段具体为:依次对BERT模型的12层Transformer单元进行伯努利采样生成知识蒸馏编码向量,每一层对应一个随机变量;当随机变量为1的概率大于等于0.5时,知识蒸馏编码向量对应的元素为1,代表当前Transformer单元进行迁移学习;当随机变量为1的概率值小于0.5时,层采样向量对应的元素为0,代表当前Transformer单元不进行迁移学习。
进一步地,所述定义搜索空间具体为:知识蒸馏编码向量中元素为1的数量不少于6。
进一步地,所述定义结构生成器具体为:结构生成器由两个全连接层组成,输入为第一阶段构建的知识蒸馏编码向量,输出为用于生成蒸馏结构模型的权重矩阵。
进一步地,所述训练生成的蒸馏结构模型从而更新结构生成器,包括以下子步骤:
步骤(2.1):将知识蒸馏编码向量输入结构生成器并输出权重矩阵;
步骤(2.2):基于结构生成器输出的权重矩阵构建蒸馏结构模型;
步骤(2.3):联合训练结构生成器和蒸馏结构模型:将训练数据输入步骤(2.2)生成的蒸馏结构模型进行模型训练,且结构生成器一起更新;同时结合伯努利分布采样的方法训练结构生成器。
进一步地,所述步骤(2.2)具体为:根据第一阶段构建的知识蒸馏编码向量,其中每一个元素对应一层Transformer单元,对教师网络的每个Transformer层进行层采样知识蒸馏,使用教师模型中知识蒸馏编码向量对应元素为1的Transformer单元的权重来初始化学生模型迁移的Transformer单元;即每个层采样为1的元素经过结构生成器生成学生模型对应的Transformer单元以及其权重;通过知识蒸馏编码向量将教师模型和学生模型建立一对一的映射关系,根据知识蒸馏编码向量生成对应的蒸馏网络结构。
进一步地,所述结合伯努利分布采样的方法训练结构生成器具体为:利用伯努利分布对各层Transformer单元进行层采样构建不同的知识蒸馏编码向量,用训练数据集进行多次迭代训练,每轮迭代时基于一个知识蒸馏编码向量同时训练结构生成器和蒸馏结构模型,通过改 变输入的知识蒸馏编码向量学习得到能够为不同蒸馏结构模型生成权重矩阵的结构生成器。
进一步地,所述第三阶段包括以下子步骤:
步骤(3.1):将知识蒸馏编码向量定义为蒸馏结构模型的基因,随机选取满足特定约束的一系列基因作为初始种群;
步骤(3.2):评估现有种群中各个基因对应的蒸馏结构模型的精度,选取精度较高的前k个基因;
步骤(3.3):利用步骤(3.2)选取的精度较高的前k个基因进行基因重组和基因变异生成新的基因,将新基因加入现有种群中;
步骤(3.4):重复迭代设定轮次的步骤(3.2)~(3.3),选择现有种群中前k个精度较高的基因并生成新基因,最终获得满足特定约束并且精度最高的基因。
进一步地,所述步骤(3.3)中,基因变异是指通过随机改变基因里一部分元素值;基因重组是指随机地将两个父辈的基因进行重组;剔除不满足特定约束的新基因。
一种基于上述面向多任务的预训练语言模型自动压缩方法的平台,包括以下组件:
数据加载组件:用于获取面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的有标签的文本样本;
自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括知识蒸馏向量编码模块、蒸馏网络生成模块、结构生成器和蒸馏网络联合训练模块、蒸馏网络搜索模块和特定任务微调模块;
知识蒸馏向量编码模块包括Transformer的层采样向量;前向传播过程中,将知识蒸馏编码向量输入结构生成器,生成对应结构的蒸馏网络和结构生成器的权重矩阵;
蒸馏网络生成模块是基于结构生成器构建与当前输入的知识蒸馏编码向量对应的蒸馏网络,调整结构生成器输出的权重矩阵的形状,使其与知识蒸馏编码向量对应的蒸馏结构的输入输出的编码器单元数目一致;
结构生成器和蒸馏网络联合训练模块是端到端的训练结构生成器,具体地,将基于Transformer层采样的知识蒸馏编码向量和一个小批次的训练数据输入蒸馏网络;更新蒸馏结构的权重和结构生成器的权重矩阵;
蒸馏网络搜索模块是为了搜索出满足特定约束条件的最高精度的蒸馏网络,提出进化算法搜索满足特定约束条件的最高精度的蒸馏网络;将知识蒸馏编码向量输入训练好的结构生成器,生成对应蒸馏网络的权重,在验证集上对蒸馏网络进行评估,获得对应蒸馏网络的精度;在元学习蒸馏网络中采用的进化搜索算法中,每个蒸馏网络是由包含基于Transformer层采样的知识蒸馏编码向量生成,所以将知识蒸馏编码向量定义为蒸馏网络的基因;在满足 特定约束条件下,首先选取一系列知识蒸馏编码向量作为蒸馏网络的基因,通过在验证集上评估获得对应蒸馏网络的精度;然后,选取精度较高的前k个基因,采用基因重组和变异生成新的基因;通过进一步重复前k个最优基因选择的过程和新基因生成的过程进行迭代,获得满足约束条件并且精度最高的基因;
特定任务微调模块是在所述自动压缩组件生成的预训练模型蒸馏网络上构建下游任务网络,利用蒸馏网络的特征层和输出层对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型;将所述压缩模型输出到指定的容器,可供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息;
推理组件:登陆用户从所述平台获取预训练压缩模型,用户利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理;并在所述平台的压缩模型推理页面呈现压缩前后推理速度的对比信息。
本发明的有益效果是:首先,本发明研究基于元学***台,压缩生成面向多任务的预训练语言模型的通用架构,充分利用已压缩好的模型架构提高下游任务的压缩效率,并且可将大规模自然语言处理模型部署在内存小、资源受限等端侧设备,推动了通用深度语言模型在工业界的落地进程。
附图说明
图1是本发明压缩方法结合特定任务的整体架构图;
图2是元学习的知识蒸馏网络的训练流程图;
图3是基于结构生成器构建蒸馏网络的架构图;
图4是结构生成器和蒸馏网络联合训练过程图;
图5是基于进化算法的蒸馏网络搜索架构图。
具体实施方式
受神经网络架构搜索的启发,尤其是在少样本的情况下,自动机器学习能够基于一个反馈回路以迭代方式进行自动知识蒸馏,本发明研究基于元学习的知识蒸馏生成多种预训练语言模型的通用压缩架构。具体地,本发明首先构建一种基于Transformer层采样的知识蒸馏编码向量,在不同层级上蒸馏大模型的知识结构。设计一种结构生成器的元网络,利用该结构生成器生成与当前输入的编码向量对应的蒸馏结构模型。同时,提出伯努利分布采样的方法训练结构生成器。每轮迭代时,利用伯努利分布采样产生迁移的各个编码器单元,组成对应 的编码向量。通过改变输入结构生成器的编码向量和小批次的训练数据,联合训练结构生成器和对应的蒸馏结构,可以学得能够为不同蒸馏结构生成权重的结构生成器。同时,在已训练好的元学习网络基础上,通过进化算法搜索最优压缩结构,由此得到与任务无关的预训练语言模型的最优通用压缩架构。本发明解决少样本数据下BERT模型压缩过程中过拟合学习和压缩模型泛化能力低的问题,深入地探索大规模深度语言模型在少样本条件下的语言理解的可行性和关键技术,提高压缩模型面向多种下游任务使用过程中的灵活性和有效性。与已有的知识蒸馏方法相比,元学习的知识蒸馏能够把人力彻底从繁琐的超参数调优中解放出来,同时允许利用多种目标度量方法直接优化压缩模型。与其它自动机器学习方法相比,元学习的知识蒸馏能够很容易地在搜索所需压缩结构时实施条件约束,无需手动调整强化学习的超参数。本发明压缩方法的应用技术路线如图1所示,基于大规模文本数据集,研究基于元学习的知识蒸馏以及基于进化算法的蒸馏网络自动搜索,通过元蒸馏学习将面向多任务的大规模预训练语言模型自动压缩生成满足不同硬约束条件(如浮点数运算次数)且与任务无关的通用架构;使用该通用架构时,在元蒸馏学习网络的基础上构建下游任务网络,输入下游任务数据集,仅微调特定的下游任务,节省计算成本,提高效率。
本发明一种面向多任务的预训练语言模型自动压缩方法,整个过程分为三个阶段:第一个阶段是构建基于Transformer层采样的知识蒸馏编码向量;第二个阶段是训练元学习的知识蒸馏网络;第三个阶段是基于进化算法搜索最优压缩结构;具体为:
第一阶段:构建基于Transformer层采样的知识蒸馏编码向量。采用伯努利分布对BERT模型的所有Transformer单元进行层采样,生成一个层采样向量,即知识蒸馏编码向量。
具体地,假设当前考虑迁移第i个Transformer单元(编码器);随机变量X i~Bernoulli(p)是一个独立的伯努利随机变量,为1的概率为p,为0的概率为1-p。利用随机变量X i依次对BERT模型的12层Transformer单元进行伯努利采样,生成一个由12个0或1元素组成的向量。当随机变量X i为1的概率p大于等于0.5时,层采样向量对应的元素为1,代表当前Transformer单元进行迁移学习;当随机变量X i为1的概率值小于0.5时,层采样向量对应的元素为0,代表当前Transformer单元不进行迁移学习。利用伯努利采样方式对BERT模型包含的所有Transformer单元依次进行层采样,组成知识蒸馏编码向量layer sample,本实施例中layer sample=[l 1…l i…l 12],其中l i为layer sample中的第i个元素,i=1~12。
第二阶段:训练元学习的知识蒸馏网络。如图2所示,定义搜索空间,将第一阶段构建的知识蒸馏编码向量输入该搜索空间,剔除不符合约束条件的向量;定义结构生成器,将经过筛选的知识蒸馏编码向量作为输入,输出用于构建蒸馏网络的权重矩阵,并生成对应的蒸馏结构模型;采用批数据集训练生成的蒸馏结构并更新蒸馏结构从而更新结构生成器;最终 输出迭代更新后结构生成器输出的权重。
定义搜索空间:为了防止层采样Transformer单元迁移(l i=1)的数量过少,提出增加层采样约束条件:
s.t.sum(l i==1)≥6
即每生成一个知识蒸馏网络结构时,对BERT模型的所有Transformer单元的层采样阶段构建约束条件,使得知识蒸馏编码向量中元素为1的数量不少于6,否则重新进行层采样。
定义结构生成器:结构生成器是一个元网络,由两个全连接层组成;输入为第一阶段构建的知识蒸馏编码向量,输出为用于生成蒸馏结构模型的权重矩阵。
训练结构生成器:包括以下子步骤:
步骤一:前向传播过程中,将知识蒸馏编码向量输入结构生成器并输出权重矩阵。
步骤二:如图3所示为基于结构生成器构建蒸馏结构模型的过程:
根据第一阶段构建的知识蒸馏编码向量,其中每一个元素l i对应一层Transformer单元,对教师网络的每个Transformer层进行层采样知识蒸馏,使用教师模型中知识蒸馏编码向量对应元素为1的Transformer单元的权重来初始化学生模型迁移的Transformer单元;即每个层采样为1的元素经过结构生成器生成学生模型对应的Transformer单元以及其权重;通过知识蒸馏编码向量将教师模型和学生模型建立一对一的映射关系,根据知识蒸馏编码向量生成对应的蒸馏网络结构。
步骤三:如图4所示为联合训练结构生成器和蒸馏结构模型的过程:
将一个小批次的训练数据输入步骤二生成的蒸馏结构模型进行模型训练,蒸馏结构模型更新参数(权重矩阵)后,结构生成器根据更新后的参数也进行更新;即反向传播的过程中,蒸馏结构模型和结构生成器一起更新;结构生成器输出的权重可以使用链式法则计算,因此,可以端到端的训练结构生成器。
同时,提出伯努利分布采样的方法训练结构生成器,具体为:利用伯努利分布对各层Transformer单元进行层采样构建不同的知识蒸馏编码向量,用同一个训练数据集进行多次迭代训练,每轮迭代时基于一个知识蒸馏编码向量同时训练结构生成器和蒸馏结构模型,通过改变输入的知识蒸馏编码向量学习得到能够为不同蒸馏结构模型生成权重矩阵的结构生成器。
而且需要调整结构生成器输出的权重矩阵的形状,使其与知识蒸馏编码向量对应的蒸馏结构的输入输出的编码器单元数目一致。通过层采样所得的编码向量保持一致的,具体地,根据编码向量中元素为1的Transformer单元的数目和位置来调整结构生成器输出的权重矩阵的形状。
第三阶段:如图5所示为基于进化算法的蒸馏网络搜索的过程:
在第二阶段训练好的元学习的知识蒸馏网络基础上,将多个满足特定约束条件的知识蒸馏编码向量输入结构生成器生成对应的权重矩阵,得到多个蒸馏结构模型;在验证集上对每个蒸馏结构模型进行评估,获得对应的精度;采用进化算法搜索其中满足特定约束条件(如浮点数运算次数)的精度最高的蒸馏结构模型,由此得到与任务无关的预训练语言模型的通用压缩架构,如图5中方框标记的Network_2。进化搜索算法的具体步骤如下:
步骤一、每个蒸馏结构模型是由基于Transformer层采样的知识蒸馏编码向量生成的,所以将知识蒸馏编码向量定义为蒸馏结构模型的基因G,随机选取满足约束条件C的一系列基因作为初始种群。
步骤二、评估现有种群中各个基因G i对应的蒸馏结构模型在验证集上的推理精度accuracy,选取精度最高的前k个基因。
步骤三、利用步骤二选取的精度最高的前k个基因进行基因重组和基因变异生成新的基因,将新基因加入现有种群中。基因变异是指通过随机改变基因里一部分元素值来进行变异;基因重组是指随机地将两个父辈的基因进行重组产生后代;而且可以很容易地通过消除不合格的基因来加强约束C。
步骤四、重复迭代N轮步骤二和步骤三,选择现有种群中前k个精度最高的基因并生成新基因,直到获得满足约束条件C并且精度最高的基因。
本发明一种面向多任务的预训练语言模型自动压缩平台,包括以下组件:
数据加载组件:用于获取登陆用户上传的待压缩的包含具体自然语言处理下游任务的BERT模型和面向多任务的预训练语言模型的训练样本;所述训练样本是满足监督学习任务的带标签的文本样本。
自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括知识蒸馏向量编码模块、蒸馏网络生成模块、结构生成器和蒸馏网络联合训练模块、蒸馏网络搜索模块、特定任务微调模块。
知识蒸馏向量编码模块包括Transformer的层采样向量。前向传播过程中,将蒸馏网络编码向量输入结构生成器,生成对应结构的蒸馏网络和结构生成器的权重矩阵。
蒸馏网络生成模块是基于结构生成器构建与当前输入的编码向量对应的蒸馏网络,调整结构生成器输出的权重矩阵的形状,使其与编码向量对应的蒸馏结构的输入输出的编码器单元数目一致。
结构生成器和蒸馏网络联合训练模块是端到端的训练结构生成器,具体地,将基于Transformer层采样的知识蒸馏编码向量和一个小批次的训练数据输入蒸馏网络。更新蒸馏结 构的权重和结构生成器的权重矩阵。
蒸馏网络搜索模块是为了搜索出满足特定约束条件的最高精度的蒸馏网络,提出进化算法搜索满足特定约束条件的最高精度的蒸馏网络。将网络编码向量输入训练好的结构生成器,生成对应蒸馏网络的权重,在验证集上对蒸馏网络进行评估,获得对应蒸馏网络的精度。在元学习蒸馏网络中采用的进化搜索算法中,每个蒸馏网络是由包含基于Transformer层采样的编码向量编码生成,所以将蒸馏网络编码向量定义为蒸馏网络的基因。在满足特定约束条件下,首先选取一系列蒸馏网络编码向量作为蒸馏网络的基因,通过在验证集上评估获得对应蒸馏网络的精度。然后,选取精度最高的前k个基因,采用基因重组和变异生成新的基因。通过进一步重复前k个最优基因选择的过程和新基因生成的过程进行迭代,获得满足约束条件并且精度最高的基因。
特定任务微调模块是在所述自动压缩组件生成的预训练模型蒸馏网络上构建下游任务网络,利用蒸馏网络的特征层和输出层对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型。将所述压缩模型输出到指定的容器,可供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息。
推理组件:登陆用户从所述平台获取预训练压缩模型,用户利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理。并在所述平台的压缩模型推理页面呈现压缩前后推理速度的对比信息。
登陆用户可直接下载本发明平台提供的训练好的预训练语言模型,根据用户对具体某个自然语言处理下游任务的需求,在所述平台生成的已压缩的预训练模型架构基础上构建下游任务网络并进行微调,最后部署在终端设备。也可以直接在所述平台上对自然语言处理下游任务进行推理。
下面将以电影评论进行情感分类任务对本发明的技术方案做进一步的详细描述。
通过所述平台的数据加载组件获取登陆用户上传的单个句子的文本分类任务的BERT模型和情感分析数据集SST-2;
通过所述平台的自动压缩组件,生成面向多任务的BERT预训练语言模型;
通过所述平台加载自动压缩组件生成的BERT预训练模型,在所述生成的预训练模型上构建文本分类任务的模型;
基于所述自动压缩组件的特定任务微调模块所得的学生模型进行微调,利用自动压缩组件生成的BERT预训练模型的特征层和输出层对下游文本分类任务场景进行微调,最终,平台输出登陆用户需求的包含文本分类任务的BERT模型的压缩模型。
将所述压缩模型输出到指定的容器,可供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息,压缩前模型大小为110M,压缩后为56M,压缩了49%。如下表1所示。
表1:文本分类任务BERT模型压缩前后对比信息
文本分类任务(SST-2)(包含67K个样本) 压缩前 压缩后 对比
模型大小 110M 56M 压缩49%
推理精度 91.5% 92.0% 提升0.5%
通过所述平台的推理组件,利用所述平台输出的压缩模型对登陆用户上传的SST-2测试集数据进行推理,并在所述平台的压缩模型推理页面呈现压缩后比压缩前推理速度加快2.01倍,并且推理精度从压缩前的91.5%提升为92.0%。

Claims (10)

  1. 一种面向多任务的预训练语言模型自动压缩方法,其特征在于,包括三个阶段:
    第一阶段,构建基于Transformer层采样的知识蒸馏编码向量:采用伯努利分布对BERT模型的所有Transformer单元进行层采样,生成知识蒸馏编码向量;
    第二阶段,训练元学习的知识蒸馏网络:定义搜索空间,将第一阶段构建的知识蒸馏编码向量输入该搜索空间,剔除不符合条件的知识蒸馏编码向量;定义结构生成器,将经过筛选的知识蒸馏编码向量作为输入,输出用于构建蒸馏结构模型的权重矩阵,并生成对应的蒸馏结构模型;训练生成的蒸馏结构模型从而更新结构生成器;
    第三阶段,基于进化算法的蒸馏结构模型搜索的过程:将多个满足特定约束的知识蒸馏编码向量输入第二阶段更新后的结构生成器生成对应的权重矩阵,得到多个蒸馏结构模型;评估每个蒸馏结构模型的精度;采用进化算法搜索其中满足特定约束的精度最高的蒸馏结构模型,得到通用压缩架构。
  2. 如权利要求1所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述第一阶段具体为:依次对BERT模型的12层Transformer单元进行伯努利采样生成知识蒸馏编码向量,每一层对应一个随机变量;当随机变量为1的概率大于等于0.5时,知识蒸馏编码向量对应的元素为1,代表当前Transformer单元进行迁移学习;当随机变量为1的概率值小于0.5时,层采样向量对应的元素为0,代表当前Transformer单元不进行迁移学习。
  3. 如权利要求2所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述定义搜索空间具体为:知识蒸馏编码向量中元素为1的数量不少于6。
  4. 如权利要求3所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述定义结构生成器具体为:结构生成器由两个全连接层组成,输入为第一阶段构建的知识蒸馏编码向量,输出为用于生成蒸馏结构模型的权重矩阵。
  5. 如权利要求4所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述训练生成的蒸馏结构模型从而更新结构生成器,包括以下子步骤:
    步骤(2.1):将知识蒸馏编码向量输入结构生成器并输出权重矩阵;
    步骤(2.2):基于结构生成器输出的权重矩阵构建蒸馏结构模型;
    步骤(2.3):联合训练结构生成器和蒸馏结构模型:将训练数据输入步骤(2.2)生成的蒸馏结构模型进行模型训练,且结构生成器一起更新;同时结合伯努利分布采样的方法训练结构生成器。
  6. 如权利要求5所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述步骤(2.2) 具体为:根据第一阶段构建的知识蒸馏编码向量,其中每一个元素对应一层Transformer单元,对教师网络的每个Transformer层进行层采样知识蒸馏,使用教师模型中知识蒸馏编码向量对应元素为1的Transformer单元的权重来初始化学生模型迁移的Transformer单元;即每个层采样为1的元素经过结构生成器生成学生模型对应的Transformer单元以及其权重;通过知识蒸馏编码向量将教师模型和学生模型建立一对一的映射关系,根据知识蒸馏编码向量生成对应的蒸馏网络结构。
  7. 如权利要求6所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述结合伯努利分布采样的方法训练结构生成器具体为:利用伯努利分布对各层Transformer单元进行层采样构建不同的知识蒸馏编码向量,用训练数据集进行多次迭代训练,每轮迭代时基于一个知识蒸馏编码向量同时训练结构生成器和蒸馏结构模型,通过改变输入的知识蒸馏编码向量学习得到能够为不同蒸馏结构模型生成权重矩阵的结构生成器。
  8. 如权利要求7所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述第三阶段包括以下子步骤:
    步骤(3.1):将知识蒸馏编码向量定义为蒸馏结构模型的基因,随机选取满足特定约束的一系列基因作为初始种群;
    步骤(3.2):评估现有种群中各个基因对应的蒸馏结构模型的精度,选取精度较高的前k个基因;
    步骤(3.3):利用步骤(3.2)选取的精度较高的前k个基因进行基因重组和基因变异生成新的基因,将新基因加入现有种群中;
    步骤(3.4):重复迭代设定轮次的步骤(3.2)~(3.3),选择现有种群中前k个精度较高的基因并生成新基因,最终获得满足特定约束并且精度最高的基因。
  9. 如权利要求8所述面向多任务的预训练语言模型自动压缩方法,其特征在于,所述步骤(3.3)中,基因变异是指通过随机改变基因里一部分元素值;基因重组是指随机地将两个父辈的基因进行重组;剔除不满足特定约束的新基因。
  10. 一种基于权利要求1-9任一项所述面向多任务的预训练语言模型自动压缩方法的平台,其特征在于,包括以下组件:
    数据加载组件:用于获取面向多任务的预训练语言模型的训练样本,所述训练样本是满足监督学习任务的有标签的文本样本;
    自动压缩组件:用于将面向多任务的预训练语言模型自动压缩,包括知识蒸馏向量编码模块、蒸馏网络生成模块、结构生成器和蒸馏网络联合训练模块、蒸馏网络搜索模块和特定任务微调模块;
    知识蒸馏向量编码模块包括Transformer的层采样向量;前向传播过程中,将知识蒸馏编码向量输入结构生成器,生成对应结构的蒸馏网络和结构生成器的权重矩阵;
    蒸馏网络生成模块是基于结构生成器构建与当前输入的知识蒸馏编码向量对应的蒸馏网络,调整结构生成器输出的权重矩阵的形状,使其与知识蒸馏编码向量对应的蒸馏结构的输入输出的编码器单元数目一致;
    结构生成器和蒸馏网络联合训练模块是端到端的训练结构生成器,具体地,将基于Transformer层采样的知识蒸馏编码向量和一个小批次的训练数据输入蒸馏网络;更新蒸馏结构的权重和结构生成器的权重矩阵;
    蒸馏网络搜索模块是为了搜索出满足特定约束条件的最高精度的蒸馏网络,提出进化算法搜索满足特定约束条件的最高精度的蒸馏网络;将知识蒸馏编码向量输入训练好的结构生成器,生成对应蒸馏网络的权重,在验证集上对蒸馏网络进行评估,获得对应蒸馏网络的精度;在元学习蒸馏网络中采用的进化搜索算法中,每个蒸馏网络是由包含基于Transformer层采样的知识蒸馏编码向量生成,所以将知识蒸馏编码向量定义为蒸馏网络的基因;在满足特定约束条件下,首先选取一系列知识蒸馏编码向量作为蒸馏网络的基因,通过在验证集上评估获得对应蒸馏网络的精度;然后,选取精度较高的前k个基因,采用基因重组和变异生成新的基因;通过进一步重复前k个最优基因选择的过程和新基因生成的过程进行迭代,获得满足约束条件并且精度最高的基因;
    特定任务微调模块是在所述自动压缩组件生成的预训练模型蒸馏网络上构建下游任务网络,利用蒸馏网络的特征层和输出层对下游任务场景进行微调,输出最终微调好的学生模型,即登陆用户需求的包含下游任务的预训练语言模型压缩模型;将所述压缩模型输出到指定的容器,可供所述登陆用户下载,并在所述平台的输出压缩模型的页面呈现压缩前后模型大小的对比信息;
    推理组件:登陆用户从所述平台获取预训练压缩模型,用户利用所述自动压缩组件输出的压缩模型在实际场景的数据集上对登陆用户上传的自然语言处理下游任务的新数据进行推理;并在所述平台的压缩模型推理页面呈现压缩前后推理速度的对比信息。
PCT/CN2020/138016 2020-12-15 2020-12-21 面向多任务的预训练语言模型自动压缩方法及平台 WO2022126683A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
GB2214196.4A GB2619569A (en) 2020-12-15 2020-12-21 Method and platform for automatically compressing multi-task-oriented pre-training language model
JP2022570738A JP7381814B2 (ja) 2020-12-15 2020-12-21 マルチタスク向けの予めトレーニング言語モデルの自動圧縮方法及びプラットフォーム
US17/564,071 US11526774B2 (en) 2020-12-15 2021-12-28 Method for automatically compressing multitask-oriented pre-trained language model and platform thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011470331.3 2020-12-15
CN202011470331.3A CN112232511B (zh) 2020-12-15 2020-12-15 面向多任务的预训练语言模型自动压缩方法及平台

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/564,071 Continuation US11526774B2 (en) 2020-12-15 2021-12-28 Method for automatically compressing multitask-oriented pre-trained language model and platform thereof

Publications (1)

Publication Number Publication Date
WO2022126683A1 true WO2022126683A1 (zh) 2022-06-23

Family

ID=74123619

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/138016 WO2022126683A1 (zh) 2020-12-15 2020-12-21 面向多任务的预训练语言模型自动压缩方法及平台

Country Status (2)

Country Link
CN (1) CN112232511B (zh)
WO (1) WO2022126683A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117216225A (zh) * 2023-10-19 2023-12-12 四川大学 一种基于三模态知识蒸馏的3d视觉问答方法
CN117787922A (zh) * 2024-02-27 2024-03-29 东亚银行(中国)有限公司 基于蒸馏学习和自动学习的反洗钱业务处理方法、***、设备和介质
CN117807235A (zh) * 2024-01-17 2024-04-02 长春大学 一种基于模型内部特征蒸馏的文本分类方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2619569A (en) * 2020-12-15 2023-12-13 Zhejiang Lab Method and platform for automatically compressing multi-task-oriented pre-training language model
US11527074B1 (en) 2021-11-24 2022-12-13 Continental Automotive Technologies GmbH Systems and methods for deep multi-task learning for embedded machine vision applications
CN114298224B (zh) * 2021-12-29 2024-06-18 云从科技集团股份有限公司 图像分类方法、装置以及计算机可读存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062489A (zh) * 2019-12-11 2020-04-24 北京知道智慧信息技术有限公司 一种基于知识蒸馏的多语言模型压缩方法、装置
CN111767711A (zh) * 2020-09-02 2020-10-13 之江实验室 基于知识蒸馏的预训练语言模型的压缩方法及平台

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111611377B (zh) * 2020-04-22 2021-10-29 淮阴工学院 基于知识蒸馏的多层神经网络语言模型训练方法与装置

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062489A (zh) * 2019-12-11 2020-04-24 北京知道智慧信息技术有限公司 一种基于知识蒸馏的多语言模型压缩方法、装置
CN111767711A (zh) * 2020-09-02 2020-10-13 之江实验室 基于知识蒸馏的预训练语言模型的压缩方法及平台

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ANTONIO BARBALAU; ADRIAN COSMA; RADU TUDOR IONESCU; MARIUS POPESCU: "Black-Box Ripper: Copying black-box models using generative evolutionary algorithms", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 21 October 2020 (2020-10-21), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081792289 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117216225A (zh) * 2023-10-19 2023-12-12 四川大学 一种基于三模态知识蒸馏的3d视觉问答方法
CN117216225B (zh) * 2023-10-19 2024-06-04 四川大学 一种基于三模态知识蒸馏的3d视觉问答方法
CN117807235A (zh) * 2024-01-17 2024-04-02 长春大学 一种基于模型内部特征蒸馏的文本分类方法
CN117807235B (zh) * 2024-01-17 2024-05-10 长春大学 一种基于模型内部特征蒸馏的文本分类方法
CN117787922A (zh) * 2024-02-27 2024-03-29 东亚银行(中国)有限公司 基于蒸馏学习和自动学习的反洗钱业务处理方法、***、设备和介质
CN117787922B (zh) * 2024-02-27 2024-05-31 东亚银行(中国)有限公司 基于蒸馏学习和自动学习的反洗钱业务处理方法、***、设备和介质

Also Published As

Publication number Publication date
CN112232511A (zh) 2021-01-15
CN112232511B (zh) 2021-03-30

Similar Documents

Publication Publication Date Title
WO2022126683A1 (zh) 面向多任务的预训练语言模型自动压缩方法及平台
WO2022141754A1 (zh) 一种卷积神经网络通用压缩架构的自动剪枝方法及平台
WO2022126797A1 (zh) 基于多层级知识蒸馏预训练语言模型自动压缩方法及平台
US10984308B2 (en) Compression method for deep neural networks with load balance
US11501171B2 (en) Method and platform for pre-trained language model automatic compression based on multilevel knowledge distillation
US20190034796A1 (en) Fixed-point training method for deep neural networks based on static fixed-point conversion scheme
JP7381814B2 (ja) マルチタスク向けの予めトレーニング言語モデルの自動圧縮方法及びプラットフォーム
KR102592585B1 (ko) 번역 모델 구축 방법 및 장치
CN113033786B (zh) 基于时间卷积网络的故障诊断模型构建方法及装置
CN112578089B (zh) 一种基于改进tcn的空气污染物浓度预测方法
CN114398976A (zh) 基于bert与门控类注意力增强网络的机器阅读理解方法
CN112347756A (zh) 一种基于序列化证据抽取的推理阅读理解方法及***
CN107579816A (zh) 基于递归神经网络的密码字典生成方法
CN111058840A (zh) 一种基于高阶神经网络的有机碳含量(toc)评价方法
CN117539977A (zh) 一种语言模型的训练方法及装置
Duggal et al. High performance squeezenext for cifar-10
CN114139674A (zh) 行为克隆方法、电子设备、存储介质和程序产品
CN113051353A (zh) 一种基于注意力机制的知识图谱路径可达性预测方法
WO2023082045A1 (zh) 一种神经网络架构搜索的方法和装置
CN114637863B (zh) 一种基于传播的知识图谱推荐方法
CN117807235B (zh) 一种基于模型内部特征蒸馏的文本分类方法
CN118133680A (zh) 一种基于图注意力网络的输电网支路参数辨识方法
Li Research on University Book Purchasing Model Based on Genetic-Neural Network Algorithm
Zhang et al. Optimization of Neural Networks Based on Genetic Algorithms for SL Datasets and Applications
CN118014126A (zh) 基于issa-bp神经网络模型的变电站碳排放预测方法及***

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20965696

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 202214196

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20201221

ENP Entry into the national phase

Ref document number: 2022570738

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20965696

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20965696

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 061223)