CN112070139A

CN112070139A - Text classification method based on BERT and improved LSTM

Info

Publication number: CN112070139A
Application number: CN202010898906.5A
Authority: CN
Inventors: 戚力鑫; 万书振
Original assignee: China Three Gorges University CTGU
Current assignee: China Three Gorges University CTGU
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2020-12-11
Anticipated expiration: 2040-08-31
Also published as: CN112070139B

Abstract

The invention belongs to the field of text recognition, and discloses a text classification method based on BERT and improved LSTM, which comprises the following steps: preprocessing input text data; inputting the preprocessed text data into a BERT model for processing to obtain a word vector sequence; carrying out depth coding on the vector sequence by utilizing an improved LSTM network to obtain a characteristic vector; reducing the dimension of the feature vector by using a full connection layer; and classifying the feature vectors subjected to dimension reduction by using a classifier. The invention distinguishes the importance of words in the text through the improved LSTM, thereby improving the learning quality and efficiency of the neuron, and the text classification model has high fitting speed and good classification effect; the BERT of the text classification model captures context information, thereby being convenient for identifying ambiguous words, laying a good foundation for feature extraction and being beneficial to improving the text classification precision.

Description

Text classification method based on BERT and improved LSTM

Technical Field

The invention belongs to the field of text recognition, and particularly relates to a text classification method based on BERT and improved LSTM.

Background

The main application fields of text classification are microblog emotion analysis, user comment mining, information retrieval, classification of news groups, word semantic resolution and the like. Before the 90 s of the 20 th century, automatic text classification mainly adopts a knowledge-based engineering mode, namely, manual classification by professionals, and has the defects of high cost, time waste and labor waste. Since the 90 s, researchers began applying various statistical and machine learning methods to automatic text classification, such as Support Vector Machine (SVM), AdaBoost, naive bayes, KNN, and Logistic regression. In recent years, with the rapid development of deep learning and various neural network models, the text classification method based on deep learning has attracted close attention and research in academia and industry, and the recurrent neural networks LSTM, GRU and the convolutional neural network CNN are widely applied to text classification.

In the current text classification method, input is often a non-dynamic word vector or word vector, the word vector or word vector cannot be changed according to the context of the word vector or word vector, and the information coverage is relatively single; most of the adopted feature extraction models are CNN and RNN models in deep learning, and fine-grained adjustment of different importance levels of input information streams in input dimensions is lacked.

Disclosure of Invention

The invention aims to solve the problems, and provides a text classification method based on BERT and improved LSTM, which adds a contribution gate in the existing LSTM unit to pay attention to the importance of different elements of a vector sequence of a text, improves the learning efficiency of neurons, accelerates the fitting speed of a text classification model, and improves the classification effect of the text classification model.

The technical scheme of the invention is a text classification method based on BERT and improved LSTM, a text classification model comprises the BERT, the improved LSTM and a classifier which are connected in sequence, and the text classification method comprises the following steps:

step 1: preprocessing input text data;

step 2: inputting the preprocessed text data into a BERT model for processing to obtain a word vector sequence;

and step 3: carrying out depth coding on the vector sequence by utilizing an improved LSTM network to obtain a characteristic vector;

and 4, step 4: reducing the dimension of the feature vector by using a full connection layer;

and 5: and classifying the feature vectors subjected to dimension reduction by using a classifier.

Further, in step 1, the preprocessing of the text data includes punctuation filtering, abbreviation filling, space deletion and illegal character filtering.

Further, the step 2 specifically includes:

1) utilizing the trained BERT model to perform word segmentation on the text of the preprocessed text data set T ', and obtaining a word vector set T' ═ { T₁″，t₂″，...，t_n", the text of the text data set is converted into a word vector t" of fixed length₁＝{w₁，w₂，...，w_L}; 2) inputting the word vector set T' into a Token embedding layer, a Segment embedding layer and a Position embedding layer in BERT to respectively obtain vector codes V₁Sentence coding V₂And a position code V₃；

3) Will V₁、V₂、V₃Adding the words and the vectors, inputting the words and the vectors into a bidirectional Transformer in BERT to obtain a word vector sequence S ═ S₁，s₂，...，s_n}。

Further, the LSTM unit of the improved LSTM network comprises a contribution gate, a forgetting gate, an input gate and an output gate, wherein the contribution gate is based on the cell state c at the last moment_t-1Hidden state h_t-1And the input information of the current time to generate and input vector x_tAttention vector a with the same dimension_tAttention vector a_tAs inputs to the forgetting gate, the input gate, and the output gate.

Further, step 5 performs probability classification on the feature vector with the dimension reduced in step 4 by using a softmax classifier, and outputs a probability prediction vector P ═ { P ═ P₁，p₂，...，p_C}，p_iI 1,2, C denotes the probability that the text belongs to a particular class, C being the total number of classes; p with the maximum probability value_iThe corresponding classification is determined as the category of the text.

Compared with the prior art, the invention has the beneficial effects that:

1) the text classification method of the invention distinguishes the importance of the words in the text through the improved LSTM, thereby improving the learning quality and efficiency of the neurons, and the text classification model has high fitting speed and good classification effect;

2) the BERT of the text classification model captures context information, so that ambiguous words can be conveniently recognized, a good foundation is laid for feature extraction, and the text classification precision is improved;

3) the text classification model combining BERT and improved LSTM has good universality and is suitable for text classification in different technical fields.

Drawings

The invention is further illustrated by the following figures and examples.

Fig. l is a flowchart illustrating a text classification method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a text classification model according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of an LSTM unit in accordance with an embodiment of the present invention.

FIG. 4 is a schematic diagram showing the comparison of the improved LSTM with the verification accuracies of sRNN, LSTM and GRU.

FIG. 5 is a graphical representation of the improved LSTM vs. the loss of validation for sRNN, LSTM, GRU.

Detailed Description

Embodiments select the chinese dataset THUCNews for classification testing. As shown in FIG. 2, the text classification model of an embodiment includes BERT, modified LSTM, classifiers, which include a fully connected layer and a Softmax layer.

As shown in fig. 1, the text classification method based on BERT and modified LSTM includes the following steps:

step 1: inputting a text data set T, preprocessing sentences in the text, including punctuation mark filtration, abbreviation filling, space deletion and illegal character filtration, determining the length threshold of the sentences by combining the length distribution and mean square deviation of the sentences in the text data set, and forming uniform sentence length to finally obtain a text data set T';

step 2: vectorizing the text data set T' by using the trained BERT model;

segmenting the text in the T' by using the trained BERT model to obtain data T ″ { T ″₁″，t₂″，...，t_n", where each text is converted to a word vector t" of fixed length L₁＝{w₁，w₂，...，w_L}；

Putting the text in the T' into a Token Embedding layer, a Segment Embedding layer and a Position Embedding layer in the BERT to respectively obtain vector codes V₁Sentence coding V₂And a position code V₃；

Will V₁、V₂、V₃Adding the words, inputting the words into a bidirectional Transformer in BERT, and outputting a word vector sequence S ═ S corresponding to T ″₁，s₂，...，s_nH, where each word vector subsequence s_iBy the word vector v (w) of the ith text_j) Composition, i represents each word in the text;

the BERT model of the examples was referred to the BERT disclosed in the paper "Pre-training of Deep biological transformations for Language Understanding", published by the Google research and development team 2018.

And step 3: the improved LSTM network is used for carrying out feature learning on the word vector sequence S, potential features F are extracted, and the structure of an LSTM unit of the improved LSTM network is shown in figure 3;

and 4, step 4: reducing the dimension of the feature vector F by using a full connection layer;

and 5: performing probability classification on the feature vector of the dimensionality reduction step 4 by using a softmax classifier, and outputting a probability prediction vector P ═ P₁，p₂，...，p_C}，p_iI 1,2, C denotes the probability that the text belongs to a particular class, C being the total number of classes; and determining the corresponding classification with the maximum probability value as the text category.

The LSTM unit of the improved LSTM network comprises a contribution gate, a forgetting gate, an input gate and an output gate, wherein the contribution gate is based on the cell state c at the last moment_t-1Hidden state h_t-1And the input information of the current time to generate and input vector x_tAttention having the same dimensionForce vector a_tAttention vector a_tAnd x_tCombining to obtain optimized input vector x_t', as inputs to a forgetting gate, an input gate, and an output gate;

a_t＝σ_a(W_ax_t+U_ah_t-1+M_ac_t-1+b_a)

x_t′＝(x_t+h_t-1)οa_t

forget the door:

f_t＝σ_g(W_fx_t′+b_f)

an input gate:

i_t＝σ_g(W_ix_t′+b_i)

an output gate:

o_t＝σ_g(W_ox_t′+b_o)

cell state:

c_t＝f_tοc_t-1+i_tοσ_c(W_cx_t′+b_c)

hidden state:

h_t＝o_tοσ_h(c_t)

wherein h is_tHidden state at the current time t, c_tIs the cellular state at the present time t, W_a、U_a、M_a、W_f、W_i、W_o、W_cAre respectively a weight matrix, b_a、b_f、b_i、b_o、b_cAre respectively deviation terms; sigma_a、σ_g、σ_c、σ_hRespectively, an activation function; o denotes element-by-element dot multiplication operation.

Compared with models such as word2vec, word vectors generated by the word2vec model are fixed, and word polysemous differences are given to the BERT according to context related information to generate more accurate feature representation, so that the model performance is improved.

The improved LSTM gives attention to important elements in input information to neurons in the input dimension aspect because of the capability of paying attention to the important elements, so that the fitting speed of the model is increased, and the training effect of the model is improved.

In order to verify the effectiveness of the improved LSTM, the Chinese data set THUCNews is compared with other neural network models sRNN, LSTM and GRU, 10 times, 30 times and 50 times (Epochs) are respectively trained, the model precision (Accuracy) and the model fitting speed (Convergent) are compared, and the experimental results are shown in Table 1.

TABLE 1 precision comparison table for improved LSTM and sRNN, LSTM, GRU models

As can be seen from Table 1, the accuracy of the improved LSTM is optimal when the feature extraction layer selects and uses the text classification model of the improved LSTM for training different times, the accuracy of the improved LSTM is improved by 1.86% relative to the LSTM after 10 times of training, the accuracy of the improved LSTM is improved by 1.14% relative to the LSTM after 30 times of training, and the accuracy of the improved LSTM is improved by 0.94% relative to the LSTM after 50 times of training.

As can be seen from fig. 4 and 5, the verification accuracy curve of the improved LSTM does not change drastically any more when the LSTM is iterated 6 times, and the corresponding verification loss is also a minimum value, the model is close to fitting, and the model achieves the best effect.

Compared with the improved LSTM, the accuracy change range of the sRNN, the GRU and the LSTM is larger when different training times are used. Under different training times, the model training effect of the improved LSTM is similar, the model precision is 92.02% after 10 times of training, the model precision is 92.86% after 50 times of training, and the difference is only 0.8%.

The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution of the present invention and the inventive concept within the scope of the present invention, which is disclosed by the present invention, and the equivalent or change thereof belongs to the protection scope of the present invention.

Claims

1. The text classification method based on BERT and improved LSTM is characterized by comprising the following steps:

step 1: preprocessing input text data;

2. The BERT and LSTM-based text classification method of claim 1, wherein the preprocessing of the text data in step 1 comprises punctuation filtering, abbreviation padding, space deletion and illegal character filtering.

3. The BERT and LSTM-based text classification method according to claim 1, wherein the step 2 specifically comprises:

1) utilizing a trained BERT model to perform word segmentation on the text of the preprocessed text data set T', and obtaining a word vector set T ═ T₁”,t₂”,...,t_n"}, the text of the text data set is converted into a word vector t of fixed length₁”＝{w₁,w₂,...,w_L}；

2) Inputting the word vector set T' into a Token embedding layer, a Segment embedding layer and a Position embedding layer in BERT to respectively obtain vector codes V₁Sentence coding V₂And a position code V₃；

3) Will V₁、V₂、V₃Adding the words and the vectors, inputting the words and the vectors into a bidirectional Transformer in BERT to obtain a word vector sequence S ═ S₁,s₂,...,s_n}。

4. According to claim 1The text classification method based on BERT and improved LSTM is characterized in that LSTM units of the improved LSTM network comprise a contribution gate, a forgetting gate, an input gate and an output gate, and the contribution gate is based on the cell state c at the last moment_t-1Hidden state h_t-1And the input information of the current time to generate and input vector x_tAttention vector a with the same dimension_tAttention vector a_tAnd x_tCombining to obtain optimized input vector x_t', as inputs to a forgetting gate, an input gate, and an output gate;

a_t＝σ_a(W_ax_t+U_ah_t-1+M_ac_t-1+b_a)

forget the door:

f_t＝σ_g(W_fx_t'+b_f)

an input gate:

i_t＝σ_g(W_ix_t'+b_i)

an output gate:

o_t＝σ_g(W_ox_t'+b_o)

cell state:

hidden state:

wherein h is_tHidden state at the current time t, c_tIs the cellular state at the present time t, W_a、U_a、M_a、W_f、W_i、W_o、W_cAre respectively a weight matrix, b_a、b_f、b_i、b_o、b_cAre respectively deviation terms; sigma_a、σ_g、σ_c、σ_hRespectively, an activation function;

representing an element-by-element dot product operation.

5. The BERT and LSTM based text classification method according to any of claims 1-4, wherein step 5 uses softmax layer to perform probability classification on the step 4 reduced-dimension feature vector, and outputs a probability prediction vector P ═ { P ═ P { (m }₁,p₂,...,p_C}，p_iI 1,2, C denotes the probability that the text belongs to a particular class, C being the total number of classes; p with the maximum probability value_iThe corresponding classification is determined as the category of the text.