CN111078876A

CN111078876A - Short text classification method and system based on multi-model integration

Info

Publication number: CN111078876A
Application number: CN201911229492.0A
Authority: CN
Inventors: 段东圣; 井雅琪; 任博雅; 时磊; 孙旷怡; 李扬曦; 佟玲玲; 习健; 宋永浩
Original assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Computing Technology of CAS; National Computer Network and Information Security Management Center
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-04-28

Abstract

The invention provides a short text classification method based on multi-model integration, which comprises the following steps: selecting a plurality of classification models for classifying the short texts; sampling the training samples to generate a training set corresponding to the classification model one by one; training the classification model through a corresponding training set to obtain a corresponding final model; classifying the target text through all the final models to obtain a plurality of classification result vectors; and integrating all the classification result vectors to obtain a final result vector, and taking the class represented by the element with the maximum value in the final result vector as the class of the target text.

Description

Short text classification method and system based on multi-model integration

Technical Field

The invention relates to the field of deep learning, in particular to a method and a system for classifying Chinese short text information through multiple models.

Background

With the rapid development of social contact modes such as microblogs and WeChat, short texts become an important information form in life. The proper classification of short text messages (i.e., determining a category for each sample according to a predefined subject category) has wide application, such as identification of specific categories of information, multi-dimensional classification of product ratings, and the like.

The invention relates to a complaint short text classification method based on deep ensemble learning in China, which comprises the following steps: CN109739986A, a BTM topic model and a convolutional neural network are used for respectively extracting the features of the text, combining the features and inputting the combined features into an integrated random forest model. The method has the advantages that random forests are used during integration, submodels (Bert, Text RNN, Text CNN and SVM) with different types and structures are integrated, the submodels are large in structural difference and rich in diversity, and the differential features of short Text data samples can be extracted and coded from different angles, so that the extracted feature distribution is closer to the overall feature distribution of data. The invention relates to a Bagging-BSJ short text classification method in China, which has the following publication numbers: CN107292348A, adopting Bagging integration algorithm idea to perform semantic feature expansion on the short text, and combining Bayesian algorithm, support vector machine algorithm and J48 algorithm to classify the short text after semantic feature expansion.

The adoption of deep learning models to classify short text messages is a method that has been commonly adopted in recent years. Particularly, a Bert model derived from Google AI team in 2018 is a huge model built by adopting a deep bidirectional Transformer, the number of parameters is more than 3 hundred million, the model obtains the best performance at that time on 11 NLP tasks, and huge reverberation is caused in the NLP industry. Subsequently, companies such as OpenAI, FastAI, etc. also successively derive their own gross models, and compared to well-known GPT, GPT2, Elmo, etc., NLP leaderboard is refreshed many times.

However, the large-scale model represented by Bert still has some problems to be solved in the real application of short text classification. Only one of the problems is analyzed here: because the number of parameters to be trained is huge, even if fine tuning is performed on the basis of a pre-training model, a large-volume model needs a large amount of training data, and in practical application, the quantity of marking data which can be matched with the model volume is difficult to collect. Because the large-volume model has extremely strong fitting capability, under the condition of insufficient data, an overfitting phenomenon often occurs, so that the generalization capability is insufficient, namely the trained model can well classify the training data, and the classification effect on unknown data is sharply reduced.

At present, no relevant method or scheme is found in the aspect of improving the generalization capability of the Bert model. In traditional machine learning and deep learning applications, the generalization ability of the model is generally improved by expanding the number of samples in a training data set, and by supplementing training samples, the distribution of the samples in the training set can better approach the overall distribution of the data, so that the model generated by training can more accurately fit the overall distribution of the data, and the generalization ability of the model is improved. However, in real-world applications, it is often difficult to collect a sufficient amount of training data, requiring high time and labor costs, and in this way increasing the generalization capability of Bert is costly.

Disclosure of Invention

Aiming at the problem that the generalization capability of short text classification is insufficient when short text classification is performed by applying the Bert due to the fact that the scale of training data is not enough to match the quantity of parameters of the Bert model in practical application, the method adopts a mode of respectively training a plurality of short text classification models, and then integrates the classification results of the plurality of classification models to obtain the final classification result.

Specifically, the short text classification method based on multi-model integration comprises the following steps: selecting a plurality of classification models for classifying the short texts; sampling the training samples to generate a plurality of training sets corresponding to the classification models one by one; training the classification model through a corresponding training set to obtain a corresponding final model; classifying the target text through all the final models to obtain a plurality of classification result vectors; and integrating all the classification result vectors to obtain a final result vector, and taking the class represented by the element with the maximum value in the final result vector as the class of the target text.

The short text classification method of the invention comprises the following steps: a Bert model, a TextRnn model, a TextCNN model, and an SVM model.

The short text classification method of the invention, wherein the classification result vector is a binary vector, a first value of the classification result vector represents a probability value that the target text belongs to a first class, and a second value of the classification result vector represents a probability value that the target text belongs to a second class; and carrying out weighted average on all the classification result vectors to obtain the final result vector, wherein the final result vector is a binary vector.

The short text classification method of the invention, wherein the process of sampling the training sample comprises the following steps: sampling data from the training sample a plurality of times in a sample-back manner to generate the training set; when the number of the training samples is greater than the sampling threshold, the generated plurality of training sets are independent from each other, and when the number of the training samples is less than or equal to the sampling threshold, the generated plurality of training sets are the same.

The invention also provides a short text classification system based on multi-model integration, which comprises the following steps: the classification model selection module is used for selecting a plurality of classification models for classifying the short texts; the training data acquisition module is used for sampling the training samples and generating a plurality of training sets which are in one-to-one correspondence with the classification models; the classification model training module is used for training the classification model through a corresponding training set so as to obtain a plurality of final models; the target text classification module is used for classifying the target texts through all the final models to obtain a plurality of classification result vectors; and the classification result integration module is used for integrating all the classification result vectors to obtain a final result vector, and taking the class represented by the element with the maximum value in the final result vector as the class of the target text.

The short text classification system of the invention, wherein the classification model comprises: a Bert model, a TextRnn model, a TextCNN model, and an SVM model.

In the short text classification system, in the target text classification module, the classification result vector is a binary vector, a first value of the classification result vector represents a probability value that the target text belongs to a first class, and a second value of the classification result vector represents a probability value that the target text belongs to a second class; in the classification result integration module, the final result vector is obtained by performing weighted average on all the classification result vectors, and the final result vector is a binary vector.

The invention relates to a short text classification system, wherein the training data acquisition module comprises: sampling data from the training sample a plurality of times in a sample-back manner to generate the training set; when the number of the training samples is greater than the sampling threshold, the generated plurality of training sets are independent from each other, and when the number of the training samples is less than or equal to the sampling threshold, the generated plurality of training sets are the same.

The present invention also provides a computer-readable storage medium storing executable instructions for performing the short text classification method based on multi-model integration as described above.

The invention also provides a data processing device, which comprises the computer readable storage medium, wherein the processor of the data processing device calls and executes the executable instructions in the readable storage medium to perform short text classification based on multi-model integration.

According to the short text classification method, the classification models of the short texts are trained through the training sets respectively, and then the classification results of the classification models are weighted and averaged to obtain the final classification result, so that the unknown data can be better classified, and better generalization capability can be obtained.

Drawings

FIG. 1 is a flow chart of the short text classification method based on multi-model integration of the present invention.

FIG. 2 is a flow chart of training sample sampling of the short text classification method of the present invention.

FIG. 3 is a schematic diagram of the training of the classification model of the short text classification method of the present invention.

FIG. 4 is a schematic diagram of the multi-model ensemble classification of the present invention.

FIG. 5 is a schematic diagram of a data processing apparatus of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the short text classification method and system based on multi-model integration proposed by the present invention are further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention aims to solve the problem that the generalization capability of a Bert model is insufficient when short text classification is carried out due to the fact that the scale of training data is not enough to match the parameter quantity of the model in practical application, and provides a multi-model integration framework.

The invention integrates classification models (Bert, Text RNN, Text CNN and SVM) with different types and structures, each classification model has large structure difference and richer diversity, and can extract and code the differentiated features of short Text data samples from different angles, so that the extracted feature distribution is closer to the overall feature distribution of the data. The method not only considers the traditional SVM model with a non-deep network structure in integration, but also adds a Bert model built based on a Transformer and an attention mechanism, a TextRNN model based on RNN and a TextCNN model based on CNN. Different types of short text classification models are selected, and differential extraction and coding of text features are achieved, so that overall distribution of data is better fitted.

The technical key points of the model integration framework and the system for improving the short text classification generalization capability mainly comprise selection of a plurality of short text classification models, training data sampling and model training, fusion of multi-model classification results and the like, and the main technical key points comprise:

1. a plurality of short text classification models are selected. The key point of the model selection is that different types of short text classification models are selected, so that the differential extraction and coding of text features are realized, and the overall distribution of data is better fitted. Because the Bert model is a deep model built based on a Transformer and an attention mechanism, three models different from the Bert model in the realization mechanism are selected in the scheme, and the three models are respectively as follows: the method is characterized in that a TextRNN model based on RNN, a TextCNN model based on CNN and an SVM model with a non-deep network structure (the three models are open source models) are used for realizing the extraction and coding of differential features of short text data samples from different angles, so that the extracted feature distribution is closer to the feature distribution of the data population.

2. For a given training set, data is sampled for multiple times in a return sampling mode to generate multiple data sets. Compared with the 4 short text classification models of the specific embodiment of the present invention, for a data set with sample data larger than 2 thousands, the generated 4 data sets each contain 1.5 thousands of data, and for a data set with sample data smaller than 2 thousands, sampling is not required (i.e., 4 data sets are the same data set). The 4 selected models are then trained using the generated 4 data sets, respectively. The technical effects of the operation in this step are as follows: when the training data is insufficient, the overfitting of the Bert model to the data characteristics of the training sample is avoided by dividing the data and respectively training 3 small-scale auxiliary models.

3. And fusing multi-model classification results. For short text data to be classified, the 4 models generated by training are used for classification calculation, each model outputs probability value vectors of each category, and the 4 result vectors are weighted and averaged to generate final result vectors. In the result vector, the category represented by the element with the largest value is the category of the short text data. Because the selected 4 models extract and encode the characteristics of the input data from different angles, the high variance caused by the overfitting of a single model can be reduced through weighted average, and the accuracy of text classification is improved.

Aiming at the problem that the generalization capability of short text classification is insufficient when short text classification is carried out by applying the Bert due to the fact that the scale of training data is not enough to match the quantity of parameters of the Bert model in practical application, the invention designs the multi-model integration framework and the system for improving the generalization capability of the short text classification, and the multi-model integration framework and the system can have better classification effect on unknown data and obtain better generalization capability. FIG. 1 is a flow chart of the short text classification method based on multi-model integration of the present invention. As shown in fig. 1, the present invention targets the second category, and the specific embodiment is as follows:

step S1: a plurality of short text classification models are selected. The invention takes a Bert model as a basic model, selects three models with different realizing mechanisms from the Bert model, and respectively comprises the following steps: the method is characterized in that a TextRNN model based on RNN, a TextCNN model based on CNN and an SVM model with a non-deep network structure (the three models are open source models) are used for realizing the extraction and coding of differentiated features of data samples from different angles, so that the extracted feature distribution is closer to the overall feature distribution of the data.

Step S2: and sampling the training samples to generate four independent training sets. FIG. 2 is a flow chart of training sample sampling for the short text classification method of the present invention; as shown in fig. 2, the algorithm process of the sampling of the present invention specifically includes:

step S21, reading all the labeled data samples;

step S22, judging whether the total amount of the labeled data samples is more than 2 ten thousand, if the total amount of the labeled data samples is less than 2 ten thousand, sampling is not needed, namely 4 classification models are trained by using the original training set, the sampling process is finished, and if the total amount of the labeled data samples is more than 2 ten thousand, the next step is continuously executed;

step S23, randomly sampling a sample from the sample set and putting the sample into a first result set;

and step S24, repeating step S23 until the number of samples in the first result set reaches 1.5 ten thousand, and continuing to generate the second, third and fourth result data sets until all the data sets are finished.

To this end, four independent sets of training samples are generated.

Step S3: and (5) training a classification model. And (3) respectively training four text classification models of Bert, TextRNN, TextCNN and SVM by using the four training data sets generated in the step S2 until the parameters converge, wherein fig. 3 is a schematic diagram of training the classification models of the short text classification method of the present invention. As shown in fig. 3, the present invention needs to perform a cleaning process on the training text data, which includes sorting the text into a format satisfying the input, removing special symbols, etc., such as an asterisk (#) and a pound (#) that are meaningless for the text. After data cleaning, the training process of each model is also different. The SVM and the TextRNN both need to be segmented, and a dictionary defined by a user is added in the segmentation step so as to improve the segmentation accuracy. The text data after Word segmentation needs to be vectorized by a Word2Vec language model trained in advance, and then input into an SVM and a TextRNN model for training. The Word2Vec language model is trained by using nearly 500w pieces of corpus data in the field, so that the model is more aware and sensitive to the characteristics of text composition, distribution and the like in the specific field, and the accuracy of model identification is improved. And Bert and TextCNN are directly input text data, and do not need to go through word segmentation and vectorization processes, wherein Bert also goes through a 'pre-training' process in advance before training (fine tuning), and almost 500 pieces of corpus data in the field are also used.

Step S4: and integrating multi-model classification results. When classifying texts, the input texts (target texts) are classified simultaneously by using the multiple models generated in step S3, and fig. 4 is a schematic diagram of the multiple-model integrated classification according to the present invention. As shown in fig. 4, each classification model will obtain a classification result, which is a binary vector, where a first value in the vector represents a probability value that the input text belongs to the first class, and a second value in the vector represents a probability value that the input text belongs to the first class; then, the four binary vectors are weighted and averaged according to the bit to obtain a new binary vector (final result vector), and the category represented by the element with the maximum value in the final result vector is the category of the short text data. In the embodiment of the present invention, other manners, such as a summation manner, may also be adopted to integrate the classification results, and the present invention is not limited thereto.

FIG. 5 is a schematic diagram of a data processing apparatus of the present invention. As shown in fig. 5, the embodiment of the present invention also provides a computer-readable storage medium and a data processing apparatus. The computer readable storage medium of the present invention stores executable instructions, and when the executable instructions are executed by a processor of a data processing apparatus, the method for classifying short texts based on multi-model integration is implemented. It will be understood by those skilled in the art that all or part of the steps of the above method may be implemented by instructing relevant hardware (e.g., processor, FPGA, ASIC, etc.) through a program, and the program may be stored in a readable storage medium, such as a read-only memory, a magnetic or optical disk, etc. All or some of the steps of the above embodiments may also be implemented using one or more integrated circuits. Accordingly, the modules in the above embodiments may be implemented in hardware, for example, by an integrated circuit, or in software, for example, by a processor executing programs/instructions stored in a memory. Embodiments of the invention are not limited to any specific form of hardware or software combination.

The method and the device can be applied to short text classification scenes, such as text classification tasks of short messages, screening of specific category data in microblogs, screening of junk mails, inquiry and division of chat robots and the like. Aiming at the problem that the generalization capability of short text classification of the application of the Bert is insufficient due to the fact that the scale of training data is not enough to match the parameter quantity of the Bert model in practical application, the invention designs a multi-model integration framework and a multi-model integration system for improving the Bert generalization capability. The integrated system is subjected to a large amount of practical tests in microblog data screening application, and verification results show that although the integrated model consumes more time in the training process, the obtained comprehensive model can have better classification effect and stability on unknown data, and better generalization performance is obtained.

The above embodiments are only for illustrating the invention and are not to be construed as limiting the invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention, therefore, all equivalent technical solutions also fall into the scope of the invention, and the scope of the invention is defined by the claims.

Claims

1. A short text classification method based on multi-model integration is characterized by comprising the following steps:

selecting a plurality of classification models for classifying the short texts;

sampling the training samples to generate a plurality of training sets corresponding to the classification models one by one;

training the classification model through a corresponding training set to obtain a corresponding final model;

classifying the target text through all the final models to obtain a plurality of classification result vectors;

and integrating all the classification result vectors to obtain a final result vector, and taking the class represented by the element with the maximum value in the final result vector as the class of the target text.

2. The short text classification method of claim 1, characterized in that the classification model comprises: a Bert model, a TextRnn model, a TextCNN model, and an SVM model.

3. The short text classification method according to claim 1 or 2, characterized in that the classification result vector is a binary vector, a first value of the classification result vector represents a probability value that the target text belongs to a first class, and a second value of the classification result vector represents a probability value that the target text belongs to a second class; and carrying out weighted average on all the classification result vectors to obtain the final result vector, wherein the final result vector is a binary vector.

4. The short text classification method of claim 1, wherein the process of sampling the training samples comprises: sampling data from the training sample a plurality of times in a sample-back manner to generate the training set; when the number of the training samples is greater than the sampling threshold, the generated plurality of training sets are independent from each other, and when the number of the training samples is less than or equal to the sampling threshold, the generated plurality of training sets are the same.

5. A short text classification system based on multi-model integration, comprising:

the classification model selection module is used for selecting a plurality of classification models for classifying the short texts;

the training data acquisition module is used for sampling the training samples and generating a plurality of training sets which are in one-to-one correspondence with the classification models;

the classification model training module is used for training the classification model through a corresponding training set so as to obtain a plurality of final models;

the target text classification module is used for classifying the target texts through all the final models to obtain a plurality of classification result vectors;

and the classification result integration module is used for integrating all the classification result vectors to obtain a final result vector, and taking the class represented by the element with the maximum value in the final result vector as the class of the target text.

6. The short text classification system of claim 5, characterized in that the classification model comprises: a Bert model, a TextRnn model, a TextCNN model, and an SVM model.

7. The short text classification system according to claim 5 or 6, characterized in that in the target text classification module, the classification result vector is a binary vector, a first value of the classification result vector represents a probability value that the target text belongs to a first class, and a second value of the classification result vector represents a probability value that the target text belongs to a second class;

in the classification result integration module, the final result vector is obtained by performing weighted average on all the classification result vectors, and the final result vector is a binary vector.

8. The short text classification system of claim 5, wherein the training data collection module comprises: sampling data from the training sample a plurality of times in a sample-back manner to generate the training set; when the number of the training samples is greater than the sampling threshold, the generated plurality of training sets are independent from each other, and when the number of the training samples is less than or equal to the sampling threshold, the generated plurality of training sets are the same.

9. A computer readable storage medium storing executable instructions for performing the short text classification method based on multi-model integration according to any one of claims 1 to 4.

10. A data processing apparatus comprising the computer-readable storage medium of claim 9, the processor of the data processing apparatus retrieving and executing executable instructions in the readable storage medium to perform short text classification based on multi-model integration.