CN111177376B

CN111177376B - Chinese text classification method based on BERT and CNN hierarchical connection

Info

Publication number: CN111177376B
Application number: CN201911302047.2A
Authority: CN
Inventors: 马强; 赵鸣博; 孔维健; 王晓峰; 孙嘉瞳; 邓开连
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2023-08-15
Anticipated expiration: 2039-12-17
Also published as: CN111177376A

Abstract

The application relates to a Chinese text classification method based on BERT and CNN hierarchical connection, which is mainly used for solving the text classification problems of emotion analysis, core sentence recognition, relationship recognition and the like of Chinese texts. In the application, a CNN model and a BERT model are used for hierarchical connection to obtain a new model BERT-CNN. The BERT-CNN model is added with the CNN model, so that sentence features extracted by the BERT model can be further extracted, and more effective sentence semantic representation is obtained. Therefore, in the text classification task, a better classification effect can be obtained.

Description

Chinese text classification method based on BERT and CNN hierarchical connection

Technical Field

The application belongs to the technical field of natural language processing, and particularly relates to a Chinese text classification method based on deep learning model BERT and CNN hierarchical connection.

Background

With the high speed of economy and the internet, more and more people choose to post various utterances on the internet. In the face of a large amount of text data on a network, how to efficiently obtain data of use value from the data becomes a research hotspot. Question-answering robots, searching, machine translation and emotion analysis are key application fields of natural language processing, and text classification technologies are not separated from the technologies, and are the basis of the technologies. Precisely because text classification technology is the basis, its accuracy requirements are high. Thus, text classification technology has been a research hotspot and is a difficulty over the years.

With the rapid development of the fields of machine learning, deep learning and the like, text classification does not depend on time-consuming and labor-consuming manual work any more, so that the automatic text classification technology is turned to. And along with the continuous improvement of the accuracy, the method has been widely applied to emotion analysis and junk text recognition. However, there are some areas where the effectiveness is poor, such as illegal advertisement recognition, etc., and the areas of emotion analysis and spam text recognition are in urgent need for higher accuracy.

At present, the effect obtained by the deep learning technology in the text classification technology is better, but the effect of the deep learning technology depends on semantic feature extraction of sentences. Conventional deep learning models rely on quantization of words or characters in sentences as model inputs, but this method is sometimes affected by quantization results, so that quantization is required for texts in different fields separately, which is relatively time-consuming and laborious. The model introduced herein not only has better effect, but also does not need to quantify words or characters for each field.

Disclosure of Invention

The purpose of the application is that: further improving the classification effect of Chinese text.

In order to achieve the above purpose, the technical scheme of the application is to provide a Chinese text classification method based on the hierarchical connection of BERT and CNN, which is characterized by comprising the following steps:

step 1, pretraining a BERT model through a large number of public Chinese text data sets to obtain and store all parameters in the BERT model, wherein the BERT model is composed of 12 layers of transformers;

step 2, performing hierarchical connection by using a CNN model and a BERT model, wherein when performing hierarchical connection, the output of the first position of each layer in the layer structure of the BERT model 12 is used as the input of the CNN model, the input width is 12, a BERT-CNN model is obtained, in the BERT-CNN model, an input matrix with the width of 12 is subjected to rolling and maximum pooling operation by the CNN model to obtain a new more effective sentence semantic feature vector, and then the sentence semantic feature vector is input into a full connection layer and finally passes through a classifier;

step 3, initializing parameters of the BERT model part, wherein the initialized parameter values are parameters obtained by pre-training before, and the parameters of the CNN model part are generated randomly by meeting normal distribution;

step 4, data preprocessing is carried out on the classification training set;

and 5, retraining the BERT-CNN model through the preprocessed data set.

Preferably, in step 1, the chinese text data set for pre-training the BERT model includes a sentence-inside prediction training set and a continuous training set of sentence pairs, wherein:

the construction process of the sentence internal prediction training set comprises the following steps:

after the data is segmented according to sentences, 15% of words in the sentences are randomly masked. 80% of the 15% of words are replaced by [ mask ], 10% of words are still replaced by original words, the rest 10% of words are replaced by one word at random, and [ CLS ] characters are spliced at the initial position of the sentence, and a new sentence formed by the method is input as a BERT model to predict the 15% of words which are shaded;

the process of whether sentence pairs are continuous in training sets comprises the following steps:

after the data is segmented according to sentences, any two sentences are connected into a sentence through [ sep ], [ CLS ] characters are spliced at the starting position of the sentence, whether the two sentences are continuous in the article or not is predicted by using the formed new sentence as the input of the BERT model, and the output of the BERT model is a probability value which indicates the probability of the continuous two sentences.

Preferably, in step 2, the core component in the transform encoder is a multi-head attention mechanism, the multi-head attention mechanism is composed of 8 self-attention mechanisms, and the output of the transform encoder is formed by splicing the outputs of the 8 self-attention mechanisms.

Preferably, in step 4, the data preprocessing includes removing part of invalid character strings in the sentence, and then segmenting the sentence by character.

The application provides a Chinese text classification method based on Bidirectional Encoder Representations from Transformers (BERT for short) and Convolutional Neural Networks (CNN for short) hierarchical connection, which uses a BERT model and a CNN model to conduct hierarchical connection so as to further improve the capability of the model to extract sentence semantic features.

The application provides a Chinese text classification method based on BERT-CNN, which can obtain more effective semantic features of sentences by adding a CNN model when the semantic features of sentences are obtained, and can obtain better effect when text classification is performed compared with some Chinese text classification models at present.

Drawings

FIG. 1 is a flow diagram of a method for classifying Chinese text based on BERT and CNN hierarchical connection according to the present application;

fig. 2 is an internal structural diagram of the BERT-CNN model of the present application.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

The specific embodiment of the application relates to a Chinese text classification method based on BERT and CNN hierarchical connection, which comprises the steps of pre-training a BERT model through a wiki encyclopedia Chinese text data set to obtain and store all parameters in the BERT model; hierarchical connection is carried out by using a CNN model and a BERT model, so that a new model BERT-CNN is obtained; initializing parameters of the BERT model part, wherein the initialized parameter values are parameters obtained by pre-training before, and the parameters of the CNN model part are generated randomly by meeting normal distribution; data preprocessing is carried out on the classification training set; finally, retraining the BERT-CNN model through the preprocessed data set;

fig. 1 shows a flow diagram of a chinese text classification method based on BERT and CNN hierarchical connection according to the present application.

As shown in fig. 1, after the flow begins, the first step to be performed is to pre-train the BERT model. The pre-training BERT model mainly comprises two parts, namely, constructing a training set; secondly, training the BERT model by using a constructed training set.

Two training sets are constructed, namely, a sentence internal prediction training set; but whether sentence pairs are continuous training sets. The specific implementation steps are as follows:

the construction embodiment of the intra-sentence prediction training set is to mask 15% of the words in the sentence randomly. Of these 15% of words, 80% are replaced with mask, 10% of words are still replaced with the original words, and the remaining 10% are replaced with one word at random. And concatenating the [ CLS ] characters at the start position of the sentence, the new sentence constructed in this way is input as a model to predict 15% of the words that are masked.

The specific implementation mode of the construction of the continuous training set of the sentence pairs is to connect any two sentences in the article into one sentence through [ sep ], splice [ CLS ] characters at the starting position of the sentence, and use the formed new sentence as the input of the model to predict whether the two sentences are continuous in the article. The output of the model is a probability value that represents the probability that the two sentences are consecutive.

Pre-training the BERT model through the two training sets, and storing the trained model weight parameters for the weight parameter initialization value of the BERT model part of the BERT-CNN model.

The second step, as shown in FIG. 1, is to construct the BERT-CNN model. The internal structure of the BERT-CNN model is shown in fig. 2. And taking the output of the first position of each layer in the BERT model 12-layer structure as the input of the CNN model, wherein the input width is 12, and performing convolution and maximum pooling operation on the input matrix with the width of 12 through the CNN model to obtain a new and more effective sentence semantic feature vector.

The BERT model employed above consisted of a 12-layer transducer encoder. the core component in the tansformer encoder is a multi-head attention mechanism, the multi-head attention mechanism is composed of 8 self-attention mechanisms, and the output is formed by splicing the outputs of the 8 self-attention mechanisms. The purpose of this is to enable the model to learn relevant information in the different representation subspaces.

Wherein, the self-attention is calculated as follows:

in self-intent, q=v=k, which are all input matrices for the intent mechanism, WQ, WK, WV are three weight matrices corresponding to Q, K, V, which are weight parameters that require model learning. d, d _k Refers to the dimension of the input matrix row vector in order to control the inner product result of the denominator not to be too large.

Wherein, the calculation formula of the multi-head attention is as follows:

multihead(Q,K,V)＝concat(head ₁ ,head ₂ ,...,head _h )W ^o

the aim of concat () is to realize the splicing of row vectors by the matrix; head part _i The calculation result of the ith self-attribute in the multi-head attribute of the finger; w (W) ^O Refers to the weighting parameters of the output of the multi-head attention and the next layer connection.

The CNN model employed above is a one-dimensional convolutional neural network. The one-dimensional convolution neural network is characterized in that in the convolution process, convolution operation only moves downwards continuously without left and right movement operation, so that a one-dimensional vector is obtained after one convolution operation is completed with an input matrix. The CNN model is divided into a convolutional layer and a pooling layer, the convolutional layer is composed of convolution kernels, in this embodiment, three types of convolution kernels with window sizes of 2, 3 and 4 are adopted, and the pooling layer adopts maximum pooling.

Therefore, the semantic feature vector of the obtained sentence can be used as the input of the full connection layer, and the class probability of the sentence is finally obtained through the softmax layer.

The third step is the initialization of the BERT-CNN model weight parameters, as shown in FIG. 1. The specific initialization steps are as follows: firstly, initializing weight parameters of the BERT model, wherein the initial value is the weight value which is pre-trained and stored in the first step. The weight parameters of the CNN model are then initialized, this time with a randomly generated set of data satisfying the normal distribution.

The fourth step, as shown in FIG. 1, is to train the BERT-CNN model classifier. The specific training step is by inputting a sentence, such as "the product quality is not good", which is obviously a bad comment, the present application expects the model output to be a probability value of less than 0.5. The closer this value is to 0, the more accurate the prediction. While for a good example, the present application expects its predictive probability to be greater than 0.5, and the closer the probability value is to 1, the better. Therefore, the application adopts cross entropy as a loss function, adam as an optimizer, and continuously updates the weight, so that the model can obtain a group of weight parameter optimal solutions under the action of training data. Furthermore, the parameter updating is not only aimed at the CNN model, but also the weight parameters of the BERT model are updated continuously, namely, the parameters are fine-tuned for the task.

The application adopts the BERT and CNN combined model to extract the effective characteristics of the sentences, and the pre-trained BERT model not only has the powerful semantic representation function of WORDs and sentences, but also can be directly applied to tasks in any field without re-training by adopting data, and has certain advantages compared with the WORD2VEC model. And the BERT model adopts an attribute mechanism to solve the problem of long-distance dependence, so that the problem that parallel computation cannot be performed by using a penetrating RNN model is also solved. On the basis, the application introduces a CNN model to further perform feature fusion on the result of the BERT model, so that more effective sentence semantic features can be obtained.

Claims

1. A Chinese text classification method based on BERT and CNN hierarchical connection is characterized by comprising the following steps:

step 1, pretraining a BERT model through a large number of public Chinese text data sets to obtain and store all parameters in the BERT model, wherein the BERT model is composed of 12 layers of transformers, and the Chinese text data sets pretraining the BERT model comprise a sentence internal prediction training set and a sentence pair continuous training set, wherein:

after the data are segmented according to sentences, 15% of words in the sentences are randomly covered; 80% of the 15% of words are replaced by [ mask ], 10% of words are still replaced by original words, the rest 10% of words are replaced by one word at random, and [ CLS ] characters are spliced at the initial position of the sentence, and a new sentence formed by the method is input as a BERT model to predict the 15% of words which are shaded;

after the data is segmented according to sentences, connecting any two sentences into a sentence through [ sep ], splicing [ CLS ] characters at the initial position of the sentence, and predicting whether the two sentences are continuous in the article or not by using the formed new sentence as the input of the BERT model, wherein the output of the BERT model is a probability value, and the probability value represents the probability of the two continuous sentences;

step 4, data preprocessing is carried out on the classification training set;

and 5, retraining the BERT-CNN model through the preprocessed data set.

2. The method of claim 1, wherein in step 2, the core component of the transformer encoder is a multi-head attention mechanism, the multi-head attention mechanism is composed of 8 self-attention mechanisms, and the output of the transformer encoder is formed by splicing the outputs of 8 self-attention mechanisms.

3. A Chinese text classification method based on BERT and CNN hierarchical connection as claimed in claim 1,

wherein in step 4, the data preprocessing includes removing part of invalid character strings in sentences,

the sentence is then segmented per character.