CN111177376B - Chinese text classification method based on BERT and CNN hierarchical connection - Google Patents
Chinese text classification method based on BERT and CNN hierarchical connection Download PDFInfo
- Publication number
- CN111177376B CN111177376B CN201911302047.2A CN201911302047A CN111177376B CN 111177376 B CN111177376 B CN 111177376B CN 201911302047 A CN201911302047 A CN 201911302047A CN 111177376 B CN111177376 B CN 111177376B
- Authority
- CN
- China
- Prior art keywords
- bert
- model
- sentence
- cnn
- sentences
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The application relates to a Chinese text classification method based on BERT and CNN hierarchical connection, which is mainly used for solving the text classification problems of emotion analysis, core sentence recognition, relationship recognition and the like of Chinese texts. In the application, a CNN model and a BERT model are used for hierarchical connection to obtain a new model BERT-CNN. The BERT-CNN model is added with the CNN model, so that sentence features extracted by the BERT model can be further extracted, and more effective sentence semantic representation is obtained. Therefore, in the text classification task, a better classification effect can be obtained.
Description
Technical Field
The application belongs to the technical field of natural language processing, and particularly relates to a Chinese text classification method based on deep learning model BERT and CNN hierarchical connection.
Background
With the high speed of economy and the internet, more and more people choose to post various utterances on the internet. In the face of a large amount of text data on a network, how to efficiently obtain data of use value from the data becomes a research hotspot. Question-answering robots, searching, machine translation and emotion analysis are key application fields of natural language processing, and text classification technologies are not separated from the technologies, and are the basis of the technologies. Precisely because text classification technology is the basis, its accuracy requirements are high. Thus, text classification technology has been a research hotspot and is a difficulty over the years.
With the rapid development of the fields of machine learning, deep learning and the like, text classification does not depend on time-consuming and labor-consuming manual work any more, so that the automatic text classification technology is turned to. And along with the continuous improvement of the accuracy, the method has been widely applied to emotion analysis and junk text recognition. However, there are some areas where the effectiveness is poor, such as illegal advertisement recognition, etc., and the areas of emotion analysis and spam text recognition are in urgent need for higher accuracy.
At present, the effect obtained by the deep learning technology in the text classification technology is better, but the effect of the deep learning technology depends on semantic feature extraction of sentences. Conventional deep learning models rely on quantization of words or characters in sentences as model inputs, but this method is sometimes affected by quantization results, so that quantization is required for texts in different fields separately, which is relatively time-consuming and laborious. The model introduced herein not only has better effect, but also does not need to quantify words or characters for each field.
Disclosure of Invention
The purpose of the application is that: further improving the classification effect of Chinese text.
In order to achieve the above purpose, the technical scheme of the application is to provide a Chinese text classification method based on the hierarchical connection of BERT and CNN, which is characterized by comprising the following steps:
step 1, pretraining a BERT model through a large number of public Chinese text data sets to obtain and store all parameters in the BERT model, wherein the BERT model is composed of 12 layers of transformers;
step 2, performing hierarchical connection by using a CNN model and a BERT model, wherein when performing hierarchical connection, the output of the first position of each layer in the layer structure of the BERT model 12 is used as the input of the CNN model, the input width is 12, a BERT-CNN model is obtained, in the BERT-CNN model, an input matrix with the width of 12 is subjected to rolling and maximum pooling operation by the CNN model to obtain a new more effective sentence semantic feature vector, and then the sentence semantic feature vector is input into a full connection layer and finally passes through a classifier;
step 3, initializing parameters of the BERT model part, wherein the initialized parameter values are parameters obtained by pre-training before, and the parameters of the CNN model part are generated randomly by meeting normal distribution;
step 4, data preprocessing is carried out on the classification training set;
and 5, retraining the BERT-CNN model through the preprocessed data set.
Preferably, in step 1, the chinese text data set for pre-training the BERT model includes a sentence-inside prediction training set and a continuous training set of sentence pairs, wherein:
the construction process of the sentence internal prediction training set comprises the following steps:
after the data is segmented according to sentences, 15% of words in the sentences are randomly masked. 80% of the 15% of words are replaced by [ mask ], 10% of words are still replaced by original words, the rest 10% of words are replaced by one word at random, and [ CLS ] characters are spliced at the initial position of the sentence, and a new sentence formed by the method is input as a BERT model to predict the 15% of words which are shaded;
the process of whether sentence pairs are continuous in training sets comprises the following steps:
after the data is segmented according to sentences, any two sentences are connected into a sentence through [ sep ], [ CLS ] characters are spliced at the starting position of the sentence, whether the two sentences are continuous in the article or not is predicted by using the formed new sentence as the input of the BERT model, and the output of the BERT model is a probability value which indicates the probability of the continuous two sentences.
Preferably, in step 2, the core component in the transform encoder is a multi-head attention mechanism, the multi-head attention mechanism is composed of 8 self-attention mechanisms, and the output of the transform encoder is formed by splicing the outputs of the 8 self-attention mechanisms.
Preferably, in step 4, the data preprocessing includes removing part of invalid character strings in the sentence, and then segmenting the sentence by character.
The application provides a Chinese text classification method based on Bidirectional Encoder Representations from Transformers (BERT for short) and Convolutional Neural Networks (CNN for short) hierarchical connection, which uses a BERT model and a CNN model to conduct hierarchical connection so as to further improve the capability of the model to extract sentence semantic features.
The application provides a Chinese text classification method based on BERT-CNN, which can obtain more effective semantic features of sentences by adding a CNN model when the semantic features of sentences are obtained, and can obtain better effect when text classification is performed compared with some Chinese text classification models at present.
Drawings
FIG. 1 is a flow diagram of a method for classifying Chinese text based on BERT and CNN hierarchical connection according to the present application;
fig. 2 is an internal structural diagram of the BERT-CNN model of the present application.
Detailed Description
The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.
The specific embodiment of the application relates to a Chinese text classification method based on BERT and CNN hierarchical connection, which comprises the steps of pre-training a BERT model through a wiki encyclopedia Chinese text data set to obtain and store all parameters in the BERT model; hierarchical connection is carried out by using a CNN model and a BERT model, so that a new model BERT-CNN is obtained; initializing parameters of the BERT model part, wherein the initialized parameter values are parameters obtained by pre-training before, and the parameters of the CNN model part are generated randomly by meeting normal distribution; data preprocessing is carried out on the classification training set; finally, retraining the BERT-CNN model through the preprocessed data set;
fig. 1 shows a flow diagram of a chinese text classification method based on BERT and CNN hierarchical connection according to the present application.
As shown in fig. 1, after the flow begins, the first step to be performed is to pre-train the BERT model. The pre-training BERT model mainly comprises two parts, namely, constructing a training set; secondly, training the BERT model by using a constructed training set.
Two training sets are constructed, namely, a sentence internal prediction training set; but whether sentence pairs are continuous training sets. The specific implementation steps are as follows:
the construction embodiment of the intra-sentence prediction training set is to mask 15% of the words in the sentence randomly. Of these 15% of words, 80% are replaced with mask, 10% of words are still replaced with the original words, and the remaining 10% are replaced with one word at random. And concatenating the [ CLS ] characters at the start position of the sentence, the new sentence constructed in this way is input as a model to predict 15% of the words that are masked.
The specific implementation mode of the construction of the continuous training set of the sentence pairs is to connect any two sentences in the article into one sentence through [ sep ], splice [ CLS ] characters at the starting position of the sentence, and use the formed new sentence as the input of the model to predict whether the two sentences are continuous in the article. The output of the model is a probability value that represents the probability that the two sentences are consecutive.
Pre-training the BERT model through the two training sets, and storing the trained model weight parameters for the weight parameter initialization value of the BERT model part of the BERT-CNN model.
The second step, as shown in FIG. 1, is to construct the BERT-CNN model. The internal structure of the BERT-CNN model is shown in fig. 2. And taking the output of the first position of each layer in the BERT model 12-layer structure as the input of the CNN model, wherein the input width is 12, and performing convolution and maximum pooling operation on the input matrix with the width of 12 through the CNN model to obtain a new and more effective sentence semantic feature vector.
The BERT model employed above consisted of a 12-layer transducer encoder. the core component in the tansformer encoder is a multi-head attention mechanism, the multi-head attention mechanism is composed of 8 self-attention mechanisms, and the output is formed by splicing the outputs of the 8 self-attention mechanisms. The purpose of this is to enable the model to learn relevant information in the different representation subspaces.
Wherein, the self-attention is calculated as follows:
in self-intent, q=v=k, which are all input matrices for the intent mechanism, WQ, WK, WV are three weight matrices corresponding to Q, K, V, which are weight parameters that require model learning. d, d k Refers to the dimension of the input matrix row vector in order to control the inner product result of the denominator not to be too large.
Wherein, the calculation formula of the multi-head attention is as follows:
multihead(Q,K,V)=concat(head 1 ,head 2 ,...,head h )W o
the aim of concat () is to realize the splicing of row vectors by the matrix; head part i The calculation result of the ith self-attribute in the multi-head attribute of the finger; w (W) O Refers to the weighting parameters of the output of the multi-head attention and the next layer connection.
The CNN model employed above is a one-dimensional convolutional neural network. The one-dimensional convolution neural network is characterized in that in the convolution process, convolution operation only moves downwards continuously without left and right movement operation, so that a one-dimensional vector is obtained after one convolution operation is completed with an input matrix. The CNN model is divided into a convolutional layer and a pooling layer, the convolutional layer is composed of convolution kernels, in this embodiment, three types of convolution kernels with window sizes of 2, 3 and 4 are adopted, and the pooling layer adopts maximum pooling.
Therefore, the semantic feature vector of the obtained sentence can be used as the input of the full connection layer, and the class probability of the sentence is finally obtained through the softmax layer.
The third step is the initialization of the BERT-CNN model weight parameters, as shown in FIG. 1. The specific initialization steps are as follows: firstly, initializing weight parameters of the BERT model, wherein the initial value is the weight value which is pre-trained and stored in the first step. The weight parameters of the CNN model are then initialized, this time with a randomly generated set of data satisfying the normal distribution.
The fourth step, as shown in FIG. 1, is to train the BERT-CNN model classifier. The specific training step is by inputting a sentence, such as "the product quality is not good", which is obviously a bad comment, the present application expects the model output to be a probability value of less than 0.5. The closer this value is to 0, the more accurate the prediction. While for a good example, the present application expects its predictive probability to be greater than 0.5, and the closer the probability value is to 1, the better. Therefore, the application adopts cross entropy as a loss function, adam as an optimizer, and continuously updates the weight, so that the model can obtain a group of weight parameter optimal solutions under the action of training data. Furthermore, the parameter updating is not only aimed at the CNN model, but also the weight parameters of the BERT model are updated continuously, namely, the parameters are fine-tuned for the task.
The application adopts the BERT and CNN combined model to extract the effective characteristics of the sentences, and the pre-trained BERT model not only has the powerful semantic representation function of WORDs and sentences, but also can be directly applied to tasks in any field without re-training by adopting data, and has certain advantages compared with the WORD2VEC model. And the BERT model adopts an attribute mechanism to solve the problem of long-distance dependence, so that the problem that parallel computation cannot be performed by using a penetrating RNN model is also solved. On the basis, the application introduces a CNN model to further perform feature fusion on the result of the BERT model, so that more effective sentence semantic features can be obtained.
Claims (3)
1. A Chinese text classification method based on BERT and CNN hierarchical connection is characterized by comprising the following steps:
step 1, pretraining a BERT model through a large number of public Chinese text data sets to obtain and store all parameters in the BERT model, wherein the BERT model is composed of 12 layers of transformers, and the Chinese text data sets pretraining the BERT model comprise a sentence internal prediction training set and a sentence pair continuous training set, wherein:
the construction process of the sentence internal prediction training set comprises the following steps:
after the data are segmented according to sentences, 15% of words in the sentences are randomly covered; 80% of the 15% of words are replaced by [ mask ], 10% of words are still replaced by original words, the rest 10% of words are replaced by one word at random, and [ CLS ] characters are spliced at the initial position of the sentence, and a new sentence formed by the method is input as a BERT model to predict the 15% of words which are shaded;
the process of whether sentence pairs are continuous in training sets comprises the following steps:
after the data is segmented according to sentences, connecting any two sentences into a sentence through [ sep ], splicing [ CLS ] characters at the initial position of the sentence, and predicting whether the two sentences are continuous in the article or not by using the formed new sentence as the input of the BERT model, wherein the output of the BERT model is a probability value, and the probability value represents the probability of the two continuous sentences;
step 2, performing hierarchical connection by using a CNN model and a BERT model, wherein when performing hierarchical connection, the output of the first position of each layer in the layer structure of the BERT model 12 is used as the input of the CNN model, the input width is 12, a BERT-CNN model is obtained, in the BERT-CNN model, an input matrix with the width of 12 is subjected to rolling and maximum pooling operation by the CNN model to obtain a new more effective sentence semantic feature vector, and then the sentence semantic feature vector is input into a full connection layer and finally passes through a classifier;
step 3, initializing parameters of the BERT model part, wherein the initialized parameter values are parameters obtained by pre-training before, and the parameters of the CNN model part are generated randomly by meeting normal distribution;
step 4, data preprocessing is carried out on the classification training set;
and 5, retraining the BERT-CNN model through the preprocessed data set.
2. The method of claim 1, wherein in step 2, the core component of the transformer encoder is a multi-head attention mechanism, the multi-head attention mechanism is composed of 8 self-attention mechanisms, and the output of the transformer encoder is formed by splicing the outputs of 8 self-attention mechanisms.
3. A Chinese text classification method based on BERT and CNN hierarchical connection as claimed in claim 1,
wherein in step 4, the data preprocessing includes removing part of invalid character strings in sentences,
the sentence is then segmented per character.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911302047.2A CN111177376B (en) | 2019-12-17 | 2019-12-17 | Chinese text classification method based on BERT and CNN hierarchical connection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911302047.2A CN111177376B (en) | 2019-12-17 | 2019-12-17 | Chinese text classification method based on BERT and CNN hierarchical connection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111177376A CN111177376A (en) | 2020-05-19 |
CN111177376B true CN111177376B (en) | 2023-08-15 |
Family
ID=70657375
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911302047.2A Active CN111177376B (en) | 2019-12-17 | 2019-12-17 | Chinese text classification method based on BERT and CNN hierarchical connection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111177376B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111858848B (en) * | 2020-05-22 | 2024-03-15 | 青岛创新奇智科技集团股份有限公司 | Semantic classification method and device, electronic equipment and storage medium |
CN111737512B (en) * | 2020-06-04 | 2021-11-12 | 东华大学 | Silk cultural relic image retrieval method based on depth feature region fusion |
CN111737475B (en) * | 2020-07-21 | 2021-06-22 | 南京擎盾信息科技有限公司 | Unsupervised network public opinion spam long text recognition method |
CN112101027A (en) * | 2020-07-24 | 2020-12-18 | 昆明理工大学 | Chinese named entity recognition method based on reading understanding |
CN111930952A (en) * | 2020-09-21 | 2020-11-13 | 杭州识度科技有限公司 | Method, system, equipment and storage medium for long text cascade classification |
CN113342970B (en) * | 2020-11-24 | 2023-01-03 | 中电万维信息技术有限责任公司 | Multi-label complex text classification method |
CN112463965A (en) * | 2020-12-03 | 2021-03-09 | 上海欣方智能***有限公司 | Method and system for semantic understanding of text |
CN112559730B (en) * | 2020-12-08 | 2021-08-24 | 北京京航计算通讯研究所 | Text abstract automatic generation method and system based on global feature extraction |
CN112732916B (en) * | 2021-01-11 | 2022-09-20 | 河北工业大学 | BERT-based multi-feature fusion fuzzy text classification system |
CN113032539A (en) * | 2021-03-15 | 2021-06-25 | 浙江大学 | Causal question-answer pair matching method based on pre-training neural network |
CN113312568B (en) * | 2021-03-25 | 2022-06-17 | 罗普特科技集团股份有限公司 | Web information extraction method and system based on HTML source code and webpage snapshot |
CN113468324A (en) * | 2021-06-03 | 2021-10-01 | 上海交通大学 | Text classification method and system based on BERT pre-training model and convolutional network |
CN114242159B (en) * | 2022-02-24 | 2022-06-07 | 北京晶泰科技有限公司 | Method for constructing antigen peptide presentation prediction model, and antigen peptide prediction method and device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110209824A (en) * | 2019-06-13 | 2019-09-06 | 中国科学院自动化研究所 | Text emotion analysis method based on built-up pattern, system, device |
-
2019
- 2019-12-17 CN CN201911302047.2A patent/CN111177376B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110209824A (en) * | 2019-06-13 | 2019-09-06 | 中国科学院自动化研究所 | Text emotion analysis method based on built-up pattern, system, device |
Non-Patent Citations (1)
Title |
---|
基于混合注意力机制的中文文本蕴含识别方法;黄生斌等;《北京信息科技大学学报(自然科学版)》;20200615(第03期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111177376A (en) | 2020-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111177376B (en) | Chinese text classification method based on BERT and CNN hierarchical connection | |
CN112214599B (en) | Multi-label text classification method based on statistics and pre-training language model | |
CN109918671B (en) | Electronic medical record entity relation extraction method based on convolution cyclic neural network | |
CN109992783B (en) | Chinese word vector modeling method | |
Chen et al. | Research on text sentiment analysis based on CNNs and SVM | |
CN110609891A (en) | Visual dialog generation method based on context awareness graph neural network | |
CN110866117A (en) | Short text classification method based on semantic enhancement and multi-level label embedding | |
CN110619034A (en) | Text keyword generation method based on Transformer model | |
CN110287323B (en) | Target-oriented emotion classification method | |
CN109558576B (en) | Punctuation mark prediction method based on self-attention mechanism | |
CN108388560A (en) | GRU-CRF meeting title recognition methods based on language model | |
CN111143563A (en) | Text classification method based on integration of BERT, LSTM and CNN | |
CN108170848B (en) | Chinese mobile intelligent customer service-oriented conversation scene classification method | |
CN113297364B (en) | Natural language understanding method and device in dialogue-oriented system | |
CN113673254B (en) | Knowledge distillation position detection method based on similarity maintenance | |
CN110555084A (en) | remote supervision relation classification method based on PCNN and multi-layer attention | |
CN114817494B (en) | Knowledge search type dialogue method based on pre-training and attention interaction network | |
CN111027292B (en) | Method and system for generating limited sampling text sequence | |
CN113239690A (en) | Chinese text intention identification method based on integration of Bert and fully-connected neural network | |
CN112070139A (en) | Text classification method based on BERT and improved LSTM | |
CN114153973A (en) | Mongolian multi-mode emotion analysis method based on T-M BERT pre-training model | |
CN112287106A (en) | Online comment emotion classification method based on dual-channel hybrid neural network | |
CN111581392B (en) | Automatic composition scoring calculation method based on statement communication degree | |
CN114417851A (en) | Emotion analysis method based on keyword weighted information | |
CN111949762A (en) | Method and system for context-based emotion dialogue, and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |