CN116306869A

CN116306869A - Method for training text classification model, text classification method and corresponding device

Info

Publication number: CN116306869A
Application number: CN202310240367.XA
Authority: CN
Inventors: 陆金星; 张长浩; 张睿; 赵智源
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-06-23

Abstract

The embodiment of the specification provides a method for training a text classification model, a text classification method and a corresponding device. The main technical scheme comprises the following steps: taking the trained first text classification model as a teacher model; obtaining a student model by utilizing an embedding module, a decoding module and an encoding module with reduced layers in the first classification model, and adding a mapping module between the encoding module and the decoding module of the student model; taking the text sample in the training sample as input of a teacher model and a student model to train the student model, and obtaining a second text classification model by utilizing the trained student model; the training targets are as follows: the difference between the first characteristic representation distribution output by the coding module of the teacher model and the third characteristic representation distribution output by the mapping module of the student model is minimized. The embodiment of the specification can reduce the number of the super parameters on the basis of ensuring the robustness of the model, thereby reducing the whole time consumption and the resource consumption caused by determining the optimal super parameters.

Description

Method for training text classification model, text classification method and corresponding device

Technical Field

One or more embodiments of the present disclosure relate to the field of natural language processing, and in particular, to a method for training a text classification model, a text classification method, and a corresponding apparatus.

Background

Text classification is one of the most basic techniques in NLP (Natural LanguageProcessing ). Text classification is not simply a rule-based classification, but rather requires understanding the semantics expressed in the text to understand the "subject" expressed in the text and classify accordingly. Text classification can be widely applied to a variety of application production channels, such as intelligent conversations, emotion recognition, content understanding, content wind control, and the like.

The current text classification method is mostly realized based on a pre-training language model as an encoding module, but because of the huge model parameters of the pre-training language model such as BERT (Bidirectional Encoder Representation from Transformers, bi-directional encoding representation based on conversion), the model deployment and reasoning time consumption bring great pressure. In order to obtain a good model compression effect, the current industry mostly adopts a knowledge distillation method, but the current knowledge distillation method generally has the problems of large number of super parameters and very sensitive distillation results to the super parameters. To obtain a good and stable distillation result, a grid search (grid search) is used to perform a large-scale search for the value of the super-parameter. This necessarily results in an overall time consuming and resource consuming model training process that is too lengthy, degrading overall performance.

Disclosure of Invention

In view of this, one or more embodiments of the present disclosure disclose a method of training a text classification model, a text classification method, and corresponding apparatus, so as to reduce the overall time and resource consumption of the model training process.

According to a first aspect, embodiments of the present specification provide a method of training a text classification model, the method comprising:

taking the trained first text classification model as a teacher model;

obtaining a student model by utilizing an embedding module, a decoding module and a coding module with reduced layers in the first classification model, wherein a mapping module is added between the coding module and the decoding module of the student model, and the mapping module is used for carrying out mapping processing on the second characteristic representation distribution output by the coding module and outputting a third characteristic representation distribution;

taking the text sample in the training sample as the input of the teacher model and the student model to train the student model, and obtaining a second text classification model by using the student model obtained by training; the training targets are as follows: minimizing a difference between a first characteristic representation distribution output by an encoding module of the teacher model and a third characteristic representation distribution output by a mapping module of the student model;

The second text classification model is used for acquiring the category corresponding to the input text to be classified.

According to an implementation manner of the embodiment of the present disclosure, before the training is performed on the first text classification model as the teacher model, the method further includes:

training a first text classification model by using a plurality of training samples, wherein the text samples in the training samples are used as the input of the first text classification model, and the class labels marked by the text samples in the training samples are used as the target output of the first text classification model;

the first text classification model comprises an embedding module, an encoding module and a decoding module; the embedding module extracts the embedded representation of each element Token in the text sample input into the first text classification model; the encoding module encodes the embedded representation of each Token to obtain feature vectors of each Token so as to form a first feature representation distribution of the text sample; the decoding module predicts a category of the text sample using the first characteristic representation distribution of the text sample.

According to an implementation manner in the embodiments of the present disclosure, before the training, the student model multiplexes parameters of the embedding module, the decoding module, and the encoding module in the first classification model, and randomly initializes or initializes parameters of the mapping module to a preset initial value.

According to an implementation manner in the embodiments of the present disclosure, the mapping module includes: multilayer perceptron MLP.

According to an implementation manner in the embodiments of the present disclosure, in each iteration of the training process, parameters of the embedding module, the encoding module, and the mapping module in the student model are updated by using the loss function value corresponding to the training target, so as to keep parameters of the decoding module unchanged.

According to an achievable manner in the embodiments of the present disclosure, the loss function value is obtained by using a mean square error between the first characteristic representation distribution and the third characteristic representation distribution, or by using a KL divergence between the first characteristic representation distribution and the third characteristic representation distribution.

According to an implementation manner in the embodiments of the present disclosure, the method is applied to the field of wind control, the text to be classified includes text resources on a network, and the category includes whether a risk or a risk level exists.

According to a second aspect, there is provided a text classification method, the method comprising:

acquiring a text to be classified;

inputting the text to be classified into a second text classification model, wherein the second text classification model comprises an embedding module, a coding module, a mapping module and a decoding module; the embedding module acquires embedded representations of the Token of each element in the text to be classified; the encoding module encodes the embedded representation of each Token to obtain feature vectors of each Token so as to form second feature representation distribution of the text to be classified; the mapping module performs mapping processing on the second characteristic representation distribution and outputs a third characteristic representation distribution of the text to be classified; the decoding module classifies the text to be classified by using the third characteristic representation distribution to obtain the category of the text to be classified;

Wherein the second text classification model is pre-trained using the method described in the first aspect.

According to a third aspect, there is provided an apparatus for training a text classification model, the apparatus comprising:

a sample acquisition unit configured to acquire a training sample;

a model construction unit configured to take the first text classification model which has been trained as a teacher model; obtaining a student model by utilizing an embedding module, a decoding module and a coding module with reduced layers in the first classification model, wherein a mapping module is added between the coding module and the decoding module of the student model, and the mapping module is used for carrying out mapping processing on the second characteristic representation distribution output by the coding module and outputting a third characteristic representation distribution;

the first training unit is configured to train the student model by taking a text sample in the training sample as input of the teacher model and the student model, and obtain a second text classification model by utilizing the trained student model; the training targets are as follows: minimizing a difference between a first characteristic representation distribution output by an encoding module of the teacher model and a third characteristic representation distribution output by a mapping module of the student model;

According to a fourth aspect, there is provided a text classification apparatus, the apparatus comprising:

a text acquisition unit configured to acquire a text to be classified;

a text classification unit configured to input the text to be classified into a second text classification model, the second text classification model including an embedding module, an encoding module, a mapping module, and a decoding module; the embedding module acquires embedded representations of the Token of each element in the text to be classified; the encoding module encodes the embedded representation of each Token to obtain feature vectors of each Token so as to form second feature representation distribution of the text to be classified; the mapping module performs mapping processing on the second characteristic representation distribution and outputs a third characteristic representation distribution of the text to be classified; the decoding module classifies the text to be classified by using the third characteristic representation distribution to obtain the category of the text to be classified;

wherein the second text classification model is pre-trained by the apparatus as described in the third aspect above.

According to a fifth aspect, embodiments of the present description provide a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method according to the first and second aspects above.

According to a sixth aspect, embodiments of the present specification provide a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements the method as described in the first and second aspects above.

According to the technical scheme, the embodiment of the specification can have the following advantages:

1) According to the embodiment of the specification, the mapping module is added between the coding module and the decoding module of the student model, so that the difference of the characteristic representation input to the decoding module in the student model and the teacher model is reduced, the middle layers of the coding modules of the student model and the teacher model are not required to be aligned in the training process, the mapping module output of the student model and the coding module output of the teacher model are only required to be aligned, the super parameters of the model are reduced, and the overall time consumption and the resource consumption caused by determining the optimal super parameters are reduced.

2) According to the embodiment of the specification, the student model can directly multiplex the decoding module of the teacher model, parameters of the decoding module are kept unchanged in the training process, temperature parameters used in the decoding module do not need to be adjusted, and super parameters of the model are further reduced.

3) The loss function used for training the student model is only obtained by aligning the output of the mapping module of the student model with the output of the coding module of the teacher model, and the weight super-parameters generated by various loss functions are not needed, so that the super-parameters of the model are reduced, and the overall time consumption and the resource consumption caused by determining the optimal super-parameters are reduced.

Of course, not all of the above-described advantages need be achieved at the same time in practicing any one of the embodiments of the present description.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a process for fine-tuning stage distillation which currently exists;

FIG. 2 is an exemplary system architecture diagram to which embodiments of the present description may be applied;

FIG. 3 is a flow chart of a method of training a text classification model provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a process for fine-tuning stage distillation provided in the examples of the present disclosure;

FIG. 5 is a flow chart of a method for text classification provided by an embodiment of the present disclosure;

FIG. 6 is a block diagram of an apparatus for training a text classification model provided in an embodiment of the present disclosure;

fig. 7 is a block diagram of a text classification apparatus according to an embodiment of the present disclosure.

Detailed Description

The following describes the scheme provided in the present specification with reference to the drawings.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

Knowledge distillation is mainly divided into pre-training stage distillation and fine tuning stage distillation. Because the pre-training stage distillation requires training on a large number of nonstandard corpora unrelated to downstream tasks, it is computationally demanding and time consuming. The distillation in the fine tuning stage is directly trained on specific downstream tasks, so that the training speed is high and the performance is remarkably improved. Thus, the methods provided in the examples of this specification employ a fine-tuning stage distillation process.

There are also some methods of distillation that employ a fine tuning stage, such as that shown in figure 1. The text classification model obtained through training is used as a teacher model, the student model is basically the same as the teacher model, the number of layers is only partially reduced by the coding module, the coding module in the figure takes a mode of containing a plurality of layers of transformers (namely conversion layers in the figure) as an example, the teacher model contains 8 layers of transformers, and the student model contains 4 layers of transformers. In training the student model, training samples are used, including a text sample X and a class label y to which the text sample is labeled. The text sample is used as the input of a teacher model and a student model at the same time, and the value of the total Loss function Loss in each iteration process is represented by L _CE 、L _logits And L _features Determination, for example:

Loss＝L _CE +αL _logits +βL _features

wherein L is _CE A difference between the predicted category and the corresponding category label for the text sample is determined by the student model. L (L) _logits Text predicted by teacher modelAnd determining the difference between the probability distribution of the sample on each category and the probability distribution of the text sample predicted by the student model on each category. L (L) _features And the difference between the output of each transducer layer in the coding module in the student model and the output of the corresponding transducer layer in the teacher model is determined.

The manner shown in fig. 1 has the following two main disadvantages:

on the one hand, it is necessary to introduce the super-parameters T, α and β additionally, where the super-parameter T is the temperature coefficient involved in the decoding module when using softmax. Knowledge distillation is very sensitive to model hyper-parameters, which need to be adjusted and tested in order to obtain a good and stable knowledge distillation result, so as to obtain optimal hyper-parameters. For example, different hyper-parameter values are combined in a grid search mode, so that the whole training is excessively long in time consumption and excessive in consumed resources.

The second aspect requires ensuring that the dimensions of the corresponding transducer layers in the teacher model and the student model are exactly the same, which would otherwise lead to large performance deviations.

In view of this, the embodiments of the present disclosure provide a new idea, and solve the above problems in a smart manner. To facilitate an understanding of the embodiments of the present specification, a brief description of a system architecture on which the embodiments of the present specification are based will first be described.

Fig. 2 shows an exemplary system architecture to which embodiments of the present description may be applied. The system architecture includes a model training device and a text classification device.

After the model training device acquires the training sample, the first text classification model is used as a teacher model to carry out knowledge distillation training on the student model in an offline stage by adopting the mode provided by the embodiment of the specification, and a second text classification model is obtained.

The text classification device classifies the text to be classified on line by using the established second text classification model to obtain the text category.

The model training device and the text classifying device can be respectively set as independent servers, can be set in the same server or server group, and can be set in independent or same cloud servers. The cloud server is also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual special server (VPs, virtual Private Server) service. The model training device and the text classification device can also be arranged on a computer terminal with stronger computing power.

Note that, the above-mentioned text classification device may perform text classification in an offline manner, for example, classification is performed on a batch of texts to be classified, in addition to performing text classification on a line.

It should be understood that the number of model training devices, text classification devices, first text classification models, and second text classification models in fig. 1 are merely illustrative. There may be any number of model training means, text classification means, first text classification model and second text classification model, as desired for implementation.

Fig. 3 is a flowchart of a method for training a text classification model according to an embodiment of the present disclosure. It will be appreciated that the method may be performed by a model training apparatus in the system shown in fig. 2. Referring to fig. 3, the method may include:

step 301: and taking the trained first text classification model as a teacher model.

Step 303: and obtaining a student model by utilizing an embedding module, a decoding module and an encoding module with reduced layers in the first classification model, and adding a mapping module between the encoding module and the decoding module of the student model, wherein the mapping module is used for mapping the second characteristic representation distribution output by the encoding module and outputting a third characteristic representation distribution.

Step 305: taking the text sample in the training sample as input of a teacher model and a student model to train the student model, and obtaining a second text classification model by utilizing the trained student model; the training targets are as follows: the difference between the first characteristic representation distribution output by the coding module of the teacher model and the third characteristic representation distribution output by the mapping module of the student model is minimized.

As can be seen from the technical content provided by the above embodiments, in the embodiment of the present disclosure, a mapping module is added between the encoding module and the decoding module of the student model, so as to reduce the difference of the feature representations input to the decoding module in the student model and the teacher model, and alignment of the middle layers of the encoding modules of the student model and the teacher model is not required in the training process, and only the mapping module output of the student model and the encoding module output of the teacher model are required to be aligned, so that the super parameters of the model are reduced, and the overall time consumption and the resource consumption caused by determining the optimal super parameters are reduced.

It should be noted that the limitations of "first", "second", and the like in the embodiments of the present disclosure are not limited in terms of size, order, and number, and are merely used for distinguishing between them by name. For example, a "first text classification model" and a "second text classification model" are used to distinguish the two text classification models in terms of name. For another example, "first feature representation distribution", "second feature representation distribution", and "third feature representation distribution" are used to nominally distinguish between the three feature representation distributions.

The respective steps shown in fig. 3 are explained below. The above step 301, i.e. "the trained first text classification model is used as the teacher model", will be described in detail in connection with an embodiment.

First, a supervised training manner is adopted for training the first text classification model in the embodiment of the specification. Training data comprising a plurality of training samples is obtained, each training sample comprising a text sample and a category label annotated to the text sample. The specific category labels may vary depending on the specific application scenario, for example, category labels in the area of wind control may be labels such as whether there is a risk or risk level labels. For another example, in the e-commerce field, the text sample may be a product description text and the category label may be a product category label. For another example, in the intelligent dialog field, the text sample may be dialog content of user input obtained from a history log, and the category label may be a user intention label or an emotion label.

The structure of the first text classification model provided in the embodiments of the present disclosure may include an embedding module, an encoding module, and a decoding module as shown in fig. 1 and 4. The first text classification model may be a text classification model based on a pre-trained language model. For example, the Pre-Training language model on which the coding module is based may be a Pre-Training language model such as BERT (Bidirectional Encoder Representation from Transformers, bi-directional coding representation based on conversion), XLNet (an autoregressive model that implements bi-directional context information through an arrangement of language models), GPT (generating Pre-Training) model, and the like as an initial coding module, on which further Training is performed. The BERT is a bi-directional pre-training language model, and uses Transformer Encoder (transform encoder) as a model structure, and the BERT can well utilize context information for feature learning. XLNet is a BERT-like model, a more generalized autoregressive pre-training model. GPT uses Transformer Decoder (transform decoder) structure and only mask multi-headed attention is reserved in Transformer Decoder.

The Embedding module is used for extracting the embedded representation of each Token in the input text sample, namely the Embedding module performs Token-based Embedding processing on the input text sample.

The text sample referred to in the embodiments of the present specification may be a text sample composed of words, phrases, sentences, etc., in which each Token character, a starter, and a separator. The Token-based processing at least comprises the following steps: the words Embedding and the position Embedding.

The word Embedding, namely, each Token carries out word vector coding to obtain word vector representation.

And (3) encoding the position of each Token in the text sample to obtain a position representation.

The encoding module is used for encoding the embedded representation of each Token to obtain the feature vector of each Token so as to form a first feature representation distribution of the text sample. The coding module may comprise a plurality of layers of transducers, each of which is exemplified by 12 layers of transducers.

The decoding module predicts a category of the text sample using the first characteristic representation distribution of the text sample. As one of the possible ways, the decoding module may include a pooling network (pooler) and a classification network (Classifier), where the pooling network is configured to pool the first feature representation distribution of the text samples (i.e., feature vectors of each Token). The classification network predicts the probability of the text sample on each category by using the characteristic representation distribution obtained after pooling, and accordingly obtains the category of the text sample.

In the embodiment of the present disclosure, a first text classification model is subjected to a supervised training, a text sample in a training sample is used as an input of the first text classification model, and a class label marked by the text sample in the training sample is used as a target output of the first text classification model. Namely, training targets are as follows: the difference between the predicted result of the first text classification model and the class label to which the text sample is labeled is minimized.

In the present embodiment, the loss function may be constructed according to the training object described above, for example, using a cross entropy loss function. And updating model parameters in a gradient descending mode by using the value of the loss function in each round of iteration until a preset training ending condition is met. The training ending condition may include, for example, the value of the loss function being less than or equal to a preset loss function threshold, the number of iterations reaching a preset number of times threshold, etc.

The first text classification model obtained through training serves as a teacher model in knowledge distillation. Knowledge distillation is a learning mode, which adopts a larger model as a teacher model to guide a smaller student model for training. The teacher model has large parameters, the student model has small parameters, and the obtained student model can consume less calculation resources and consume less time for reasoning on the basis of achieving the basic effect of the teacher model.

The step 303, namely, "the student model is obtained by using the embedding module, the decoding module and the encoding module with reduced layers in the first classification model, and the mapping module is added between the encoding module and the decoding module of the student model" is described in detail below in connection with the embodiment.

This step is the process of building a student model. When the student model is constructed, the student model is constructed based on the following characteristics mainly after the following characteristics are observed:

first, coding modules based on pre-training language models such as BERT typically have highly coupled progressive structural information derived from memory understanding of the samples employed by the original pre-training, but reducing the effectiveness of knowledge distillation if cross-layer or layered alignment of the transducers as shown in fig. 1 is employed.

The large structure and the multiple parameters of the second teacher model are mainly embodied on the multiple layers of transformers of the coding module, so that when the structure and the parameters are reduced to obtain the student model, the number of layers of transformers in the coding module is mainly reduced.

Third, the student model and the teacher model also differ in their ability to characterize data by some amount due to differences in the number of model parameters. After the coding module, the representation of the semantics of the teacher model is more abstract than that of the student model, and if the coding module output of the student model and the coding module of the teacher model are simply aligned, the student model can hardly learn the abstract semantic information of the teacher model.

Fourth, the ablation transformation experiment proves that the decoding module has a simple structure, but plays a great role in downstream tasks.

Fifth, since the coding module of the teacher model is larger than the coding module of the student model, the decoding module learned by the teacher model generally has a strong understanding ability.

Based on the above features, the student model constructed in the embodiments of the present specification includes an embedding module, an encoding module, a mapping module, and a decoding module.

First the decoding module in the student model (which may include, for example, a pooling network and a classification network) is fully multiplexed with the decoding module in the teacher model. The embedding module also multiplexes the parameters of the embedding module in the teacher model at the initial stage. The coding modules in the student model are of a smaller scale than the teacher module, i.e., the number of layers in the coding modules of the teacher model is reduced. In fig. 4, the coding module in the student model is taken as an example, and the 4-layer transducer is extracted from the coding module in the teacher model.

A projection (projector) module is added between the coding module and decoding module of the student model, and the projection module is used for improving the characterization capability of the student model. The projection module may employ a smaller parameter number of MLPs (Multilayer Perceptron, multi-layer perceptron), such as one layer of MLPs, or multiple layers of MLPs. As projection modules, for example CNN (Convolutional Neural Network), RNN (Recurrent Neural Network ) etc. can also be used. However, experiments have shown that a layer of MLP is already effective, so that a layer of MLP is preferred. The hidden layer dimension of the projection module can be half of the hidden layer dimension of the coding module, or can take other values.

The step 305, i.e. "the text sample in the training sample is used as input of the teacher model and the student model to train the student model, and the second text classification model is obtained by using the trained student model" will be described in detail in connection with the embodiment.

When the student model is initialized before training, the parameters of the embedded module, the corresponding transducer layer in the decoding module and the decoding module which are obtained through training and in the first text classification model are multiplexed. The parameters of the mapping module may be randomly initialized or initialized to a preset initial value.

In training the student model, only the teacher model is used to guide training of the student model. Namely, training targets are as follows: the difference between the first characteristic representation distribution output by the coding module of the teacher model and the third characteristic representation distribution output by the mapping module of the student model is minimized. As shown in fig. 4, the alignment of the middle layer in the coding module between the student model and the teacher model is abandoned, and only the output of the mapping module of the student model is aligned with the output of the coding module of the teacher model, thereby reducing the hyper-parameters of the model.

The processing of the text samples in the training samples by the teacher model in the training process is described in step 301. The processing of the text sample by the student model comprises the following steps: the method comprises the steps that an embedding module obtains embedded representations of elements Token in a text sample; the encoding module encodes the embedded representation of each Token to obtain a feature vector of each Token to form a second feature representation distribution of the text sample; the mapping module maps the second characteristic representation distribution of the text sample and outputs a third characteristic representation distribution of the text sample; the decoding module classifies the text samples according to the third characteristic representation distribution of the text samples to obtain the categories of the text samples.

The loss function can be constructed according to the training target, the value of the loss function is utilized in each iteration, and model parameters are updated in a mode such as gradient descent until a preset training ending condition is met. The training ending condition may include, for example, the value of the loss function being less than or equal to a preset loss function threshold, the number of iterations reaching a preset number of times threshold, etc.

Wherein the Loss function Loss can adopt an alignment Loss function L _align . The alignment loss function may use an MSE (mean square error) function or may use a KL divergence. Taking MSE as an example:

L _align ＝MSE(H ^T (x),H ^S (x))

wherein H is ^S (x) For the output of the mapping module in the student model, H ^T (x) For the output of the coding module in the teacher model, MSE () is the mean square error function.

In the training process, only the parameters of the embedded module, the decoding module and the mapping module are updated, and the parameters of the decoding module are kept unchanged. Because the decoding module is a decoding module in the complete multiplexing teacher module, parameters are unchanged in the training process, so that the super parameter T (temperature parameter) used by the decoding module does not need to be adjusted. In addition, since the loss function involves only L _align And thus also to the hyper-parameters alpha and beta involved in the method shown in figure 1. The number of combinations of the super parameters is obviously reduced, and the overall time consumption and the consumption of resources caused by determining the optimal super parameters are reduced. The difference of the output of the coding modules of the student model and the teacher model is obviously reduced by a mapping module with small parameter quantity, thereby improving The training effect of knowledge distillation is achieved.

And after training, taking the student model obtained by training as a second text classification model to classify the text.

A method of text classification based on the second text classification model may be as shown in fig. 5, which may be performed by the text classification apparatus in the system shown in fig. 2. As shown in fig. 5, the following steps may be included:

step 501: and obtaining the text to be classified.

Step 503: inputting the text to be classified into a second text classification model, and obtaining the category of the text to be classified obtained by the second text classification model.

The second text classification model comprises an embedding module, an encoding module, a mapping module and a decoding module. The method comprises the steps that an embedding module obtains embedded representations of Token of each element in a text to be classified; the encoding module encodes the embedded representation of each Token to obtain the feature vector of each Token so as to form a second feature representation distribution of the text to be classified; the mapping module maps the second characteristic representation distribution and outputs a third characteristic representation distribution of the text to be classified; and the decoding module classifies the text by using the third characteristic representation distribution to obtain the category of the text to be classified.

The method provided in the embodiments of the present disclosure may be applied to a variety of application scenarios, including but not limited to the following:

Application scenario 1: field of wind control

With the wide popularization of the internet, almost everyone can perform content production, and this inevitably causes adverse effects on the whole network environment due to the high risk content. Therefore, the text resources on the network can be used as the text to be classified, and the text to be classified is classified by adopting the second text classification model trained by the embodiment of the specification. Wherein the text resource may be, for example, a comment, a blog, a merchandise introduction, etc. The classification may be a classification result, such as risky or risky. The classification may also be a multi-classification result, such as high risk, medium risk, low risk, no risk, etc. The classification result may be further provided to a risk control system to intercept the text resource in time.

Application scenario 2: e-commerce field

With the progress of science and technology, the development of market economy and the improvement of living standard of people, a large number of e-commerce platforms are presented, wherein the number and variety of commodities are various, and difficulty is brought to commodity classification. Therefore, the commodity description text can be used as the text to be classified, and the text to be classified is classified by adopting the second text classification model trained by the embodiment of the specification. The category obtained by classification is a commodity category. In this way, the commodity can be automatically put on shelf to the corresponding category.

Application scenario 3: intelligent dialogue field

The intelligent dialogue system can provide the capability of answering questions, and solves the questions posed by the user or provides corresponding services for the user on the basis of understanding the natural language of the user. Is widely applied to intelligent customer service, intelligent sound boxes and the like. In this scenario, the dialogue content input by the user may be used as the text to be classified, and the text to be classified may be classified by using the second text classification model trained in the embodiment of the present disclosure. The classified category may be a user intent, or a user emotion. The user may be further served according to the classification result, for example, when the user is identified to find a music intention, the music resource found by the user is returned to the user. For another example, when the user emotion is identified as anger, a manual service may be switched to the user. Etc.

The above is a detailed description of the method provided by the embodiments of the present specification, and the following describes in detail the apparatus provided by the embodiments of the present specification.

Fig. 6 is a block diagram of an apparatus for training a text classification model according to an embodiment of the present disclosure, which is a model training apparatus in the system architecture shown in fig. 2. As shown in fig. 6, the apparatus 600 may include: the sample acquisition unit 601, the model construction unit 602, and the first training unit 603 may further include a second training unit 604. Wherein the main functions of each constituent unit are as follows:

The sample acquisition unit 601 is configured to acquire training samples.

A model construction unit 602 configured to take the first text classification model that has been trained as a teacher model; and obtaining a student model by utilizing an embedding module, a decoding module and an encoding module with reduced layers in the first classification model, and adding a mapping module between the encoding module and the decoding module of the student model, wherein the mapping module is used for mapping the second characteristic representation distribution output by the encoding module and outputting a third characteristic representation distribution.

A first training unit 603 configured to use a text sample in the training samples as input of a teacher model and a student model to train the student model, and obtain a second text classification model by using the student model obtained by training; the training targets are as follows: the difference between the first characteristic representation distribution output by the coding module of the teacher model and the third characteristic representation distribution output by the mapping module of the student model is minimized.

The second training unit 604 is configured to perform training in advance by using a plurality of training samples to obtain a first text classification model, wherein the text samples in the training samples are used as input of the first text classification model, and the class labels marked by the text samples in the training samples are used as target output of the first text classification model.

The first text classification model comprises an embedding module, an encoding module and a decoding module; the embedding module extracts the embedded representation of each element Token in the text sample input into the first text classification model; the encoding module encodes the embedded representation of each Token to obtain feature vectors of each Token to form a first feature representation distribution of the text sample; the decoding module predicts a category of the text sample using the first characteristic representation distribution of the text sample.

As one of the realizable modes, the student model multiplexes the parameters of the embedding module, the decoding module and the encoding module in the first classification model before training, and randomly initializes or initializes the parameters of the mapping module to a preset initial value.

As one of the realizations, the mapping module may use a MLP with a smaller parameter, for example, a layer of MLP, or multiple layers of MLP. As projection modules, for example CNN, RNN, etc. may also be employed.

As one of the realizable modes, in each iteration of the training process of the student model, the parameters of the embedded module, the encoding module and the mapping module in the student model are updated by using the loss function value corresponding to the training target, and the parameters of the decoding module are kept unchanged.

Wherein the loss function value is obtained by using a mean square error between the first characteristic representation distribution and the third characteristic representation distribution, or by using a KL divergence between the first characteristic representation distribution and the third characteristic representation distribution.

Fig. 7 shows a block diagram of a text classification apparatus according to an embodiment of the present specification. As shown in fig. 7, the apparatus 700 may include: a text acquisition unit 701 and a text classification unit 702. Wherein the main functions of each constituent unit are as follows:

the text acquisition unit 701 is configured to acquire a text to be classified.

A text classification unit 702 configured to input text to be classified into a second text classification model, the second text classification model comprising an embedding module, an encoding module, a mapping module, and a decoding module; the method comprises the steps that an embedding module obtains embedded representations of Token of each element in a text to be classified; the encoding module encodes the embedded representation of each Token to obtain the feature vector of each Token so as to form a second feature representation distribution of the text to be classified; the mapping module maps the second characteristic representation distribution and outputs a third characteristic representation distribution of the text to be classified; and the decoding module classifies the text by using the third characteristic representation distribution to obtain the category of the text to be classified.

Wherein the second text classification model is pre-trained by the apparatus described in fig. 6.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the present invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The present description also provides a computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements the steps of the method of any of the preceding method embodiments.

And an electronic device comprising:

One or more processors; and

a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the preceding method embodiments.

The present description embodiment also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the preceding method embodiments.

The Memory may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), a static storage device, a dynamic storage device, or the like.

From the foregoing description of embodiments, it will be apparent to those skilled in the art that the present embodiments may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the embodiments of the present specification may be embodied in essence or what contributes to the prior art in the form of a computer program product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present specification.

The foregoing detailed description of the invention has been presented for purposes of illustration and description, and it should be understood that the foregoing is by way of illustration and description only, and is not intended to limit the scope of the invention.

Claims

1. A method of training a text classification model, the method comprising:

taking the trained first text classification model as a teacher model;

The second text classification model is used for determining the category corresponding to the input text to be classified.

2. The method of claim 1, further comprising, prior to said taking the trained first text classification model as a teacher model:

3. The method of claim 1, wherein the student model multiplexes parameters of the embedding module, the decoding module, and the encoding module in the first classification model and randomly initializes or initializes parameters of the mapping module to preset initial values before the training.

4. The method of claim 1, wherein the mapping module comprises: multilayer perceptron MLP.

5. The method of claim 1, wherein parameters of the embedding module, the encoding module, and the mapping module in the student model are updated with the loss function value corresponding to the training objective in each iteration of the training process, leaving parameters of the decoding module unchanged.

6. The method according to claim 5, wherein the loss function value is derived using a mean squared error between the first and third characteristic representation distributions or using a KL-divergence between the first and third characteristic representation distributions.

7. The method according to any one of claims 1 to 6, wherein the method is applied in the area of wind control, the text to be classified comprises text resources on a network, and the category comprises whether there is a risk or a risk level.

8. A method of text classification, the method comprising:

acquiring a text to be classified;

Wherein the second text classification model is pre-trained using the method of any of claims 1 to 7.

9. An apparatus for training a text classification model, the apparatus comprising:

a sample acquisition unit configured to acquire a training sample;

10. A text classification device, the device comprising:

a text acquisition unit configured to acquire a text to be classified;

wherein the second text classification model is pre-trained by the apparatus of claim 9.

11. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1 to 8.

12. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1 to 8.