CN113220890A

CN113220890A - Deep learning method combining news headlines and news long text contents based on pre-training

Info

Publication number: CN113220890A
Application number: CN202110645654.XA
Authority: CN
Inventors: 王贵参; 伍俊霖; 王红梅; 党源源; 张丽杰; 王桂娟
Original assignee: Changchun University of Technology
Current assignee: Changchun University of Technology
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-08-06

Abstract

The invention discloses a deep learning method combining news headlines and long text contents based on pre-training, and aims to improve the prediction accuracy of news texts. The method mainly comprises the following steps: the method comprises the steps of data preprocessing, loading a vocabulary table, model parameters and a pre-training model, training a news text classification model based on the pre-training model by using a news title, training a news text classification model based on the combination of the pre-training model and a classification algorithm by using news long text contents, and verifying the trained news text classification model based on the pre-training by using a test set. The text classification method based on the traditional deep learning has poor prediction effect on Chinese news texts, and aiming at the situation, the deep learning method based on the pre-training and combining news headlines and news long text contents is provided on a self-collected data set, so that the prediction accuracy rate of the Chinese news texts can be effectively improved.

Description

Deep learning method combining news headlines and news long text contents based on pre-training

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a deep learning method combining news headlines and news long text contents based on pre-training.

Background

Text classification is a popular problem for natural language processing. With the continuous development of economy in China, information is increased explosively, and due to the diversity and complexity of news texts, the content of the news texts is crossed, the content of the news texts is similar, the similarity between categories is high, and the boundaries are unclear, the method has important significance in rapidly classifying massive news texts.

In recent years, Chinese Text Classification problem research is rapidly developed, and Zhou and the like combine a convolutional Neural Network with a long-short term memory Network to provide a C-LSTM (A C-LSTM Neural Network for Text Classification) Text Classification algorithm, wherein the C-LSTM uses the convolutional Neural Network to extract high-level phrase representation and then inputs the high-level phrase representation into the long-short term memory Network to obtain sentence representation, and the C-LSTM can capture local features of phrases and semantic information of the sentences. Lai and the like combine a Convolutional Neural network with a cyclic Neural network to provide a TextRCNN (Current conditional Neural Networks for Text classification) Text classification model, the TextRCNN further obtains context information on a C-LSTM by using a bidirectional long and short memory network, the hidden layer output and word vectors obtained by the bidirectional long and short memory network are spliced, the spliced vectors are nonlinearly mapped to low dimensions, and the value of each position in the vectors takes the maximum value on all time sequences to obtain the final feature vector.

The classical algorithms based on long texts include textcnn (relational Neural Networks for subsequent Classification), TextRNN (recursive Neural Networks for Text Classification with Multi-Task Learning), and the like, and these algorithms are optimized for high-dimensional data, Text order, time reduction, and the like of Text Classification, because the convolutional Neural Networks can extract local spatial or short-time structural relationships, and for Sentence models, the convolutional Neural Networks have good ability to extract n-ary features at different positions in sentences, and can learn short-range and long-range relationships through pooling operations, but the convolutional Neural Networks have poor feature extraction capability for sequence data, while the convolutional Neural Networks are good but cannot extract local spatial or short-time structural relationships.

At present, the accuracy of the traditional deep learning model for classification of news texts still cannot reach higher precision, semantic understanding in text classification is insufficient, and deep features of news long texts are difficult to obtain. In order to solve the above-mentioned shortcomings, we propose a deep learning method combining news headlines and news long text contents based on pre-training.

Disclosure of Invention

The method comprises the steps of loading pre-training language model parameters, identifying a text model of text classification on a TextRCNN text classification model by using a convolutional neural network model, for example, identifying key phrases, regarding the text as a series of words by using the convolutional neural network text classification model, aiming at capturing the dependency relationship and the text structure of the text words, fusing the models, extracting local features of the text, analyzing sentence semantic features related to the text context, and supplementing local and integral interactive information of the text, so that the classification accuracy of the text model is improved considerably, and the model comprises the following steps.

Step S1: and (2) data preprocessing, namely cleaning the crawled news text, storing the data according to the forms of labels and titles and labels and contents, and dividing the data set according to the proportion of 80% of the training set, 10% of the verification set and 10% of the test set.

Step S2: the vocabulary required by the loading method, the parameters of the Pre-training model and the BERT (Pre-training of Deep Bidirectional transformations for Language interpretation) Pre-training model.

Step S3: the BERT-based news text classification model is trained using a news headline training set, and the BERT-and TextRCNN-based news text classification model is trained using a news long-text content training set.

Step S4: and verifying the trained news text classification model based on the pre-training by using the test set, and calculating the accuracy, the recall rate and the F1 value of the news text classification model based on the pre-training.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a diagram comparing the same title model on the own news text data set with other content models.

Detailed Description

The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention. The invention will now be described in further detail by means of the figures and examples.

The embodiment of the invention is premised on the data set being a news text data set collected by an author.

Fig. 1 is a schematic flow chart of a text classification model based on pre-training according to an embodiment of the present invention. As shown in fig. 1, the present embodiment mainly includes the following steps:

step S1: data pre-processing

And (3) cleaning the crawled news text, only keeping news with news content text length exceeding 200 characters, wherein the data set comprises nine ten thousand news samples which are divided into nine types including 9 types of data of finance, real estate, education, science and technology, military, automobile, sports, games and entertainment. Storing the data according to the forms of labels plus titles and labels plus contents, and dividing a data set according to the proportion of 80% of a training set, 10% of a verification set and 10% of a test set;

step S2: vocabulary required by loading method, parameters of pre-training model and BERT pre-training model

The pre-training model is a BERT model, the input of the network model is news headlines and news long text contents of news texts, the headlines and the contents are subjected to mask and truncation operation, and word vectors of the headlines and word vectors of the contents are output.

Step S3: training a BERT-based news text classification model by using a news headline training set, and training a BERT-and TextRCNN-based news text classification model by using a news long text content training set;

the method comprises the steps of connecting a bidirectional long-time and short-time memory neural network with a BERT, inputting word vectors serving as contents into a bidirectional long-time and short-time memory neural network model, processing the word vectors through the network model to obtain feature word vectors based on contexts, splicing initial word vectors and trained feature word vectors based on the contexts, activating the word vectors by using a relu function, then performing maximum pooling operation on the convolutional neural network to obtain local feature word vectors, compressing dimensionality of data by using a sequence function to obtain one-dimensional vectors, and transmitting the one-dimensional vectors into a full connection layer to obtain vector representation of the contents. After the word vectors of the titles are transmitted into the full-connection layer, the vector representation of the titles is obtained, and the vector representation dimensions of the titles and the content are the same and are consistent with the number of the classification labels. And splicing the results obtained by the title and the content, and obtaining a final representation through a SoftMax function.

Step S4: verifying the trained news text classification model based on the pre-training by using a test set, and calculating the accuracy, the recall rate and the F1 value of the news text classification model based on the pre-training;

the above embodiments are only for illustrating the invention and not for limiting the same, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the invention, so that all equivalent technical solutions also belong to the scope of the invention, and the scope of the invention should be defined by the claims.

Example 1 experimental results of the invention on a self-collected news text dataset

The data set consists of nine ten thousand news long texts, and is divided into 9 types of finance, real estate, education, science and technology, military, automobile, sports, games and entertainment, wherein the content length of each type of news is more than 200 characters, and the data are respectively crawled from websites such as a Pink silk screen, a Minnan web, a game grand view and an upper square web and are used for classifying news texts.

Selecting performance indexes such as accuracy, recall rate, F1 and the like as evaluation standards, wherein the calculation formula is as follows:

wherein, TP is the number of samples that predict the correct class as the correct class, TN is the total number of samples that correctly identify the class that do not belong to, FP is the total number of samples that are misclassified into the class, and FN is the total number of samples that belong to the class but are classified into other classes.

TABLE 1 parameters of the parts of the content model in the experiment (BERT _ RCNN)

TABLE 2 parameters of various parts of the model of the question in the experiment

Example 1 the model of the present invention was applied to a data set for test validation, and the accuracy, recall and F1 indices were selected as evaluation indices and compared with 3 classical text classification methods based on pre-training. The 3 comparison methods are BERT (title) + BERT (content), BERT (title) + BERT _ CNN (content), BERT (title) + BERT _ RNN (content), respectively. The existing comparison text classification methods are all operated under respective optimal parameters, and the experimental comparison results are shown in table 3.

TABLE 3 experimental comparison results

From the experimental results of table 2, the accuracy of the comparative BERT (title) + BERT (content) method was 94.13%, the recall was 94.25%, and the F1 value was 94.14%. The compared BERT (title) + BERT _ CNN (content) method has the accuracy of 93.64%, the recall rate of 93.80% and the F1 value of 93.69%, which indicates that the BERT _ CNN method only extracts local semantic features from news text data and has poor feature extraction capability on the text sequence data. The compared BERT (title) + BERT _ RNN (content) method has an accuracy of 91.84%, a recall of 92.07%, and an F1 value of 91.82%, indicating the structural relationship of LSTM to the inability to extract local spatial features or short time. The accuracy, recall, and F1 values of BERT (title) + BERT _ RCNN (content) method used in the present invention were 94.76%, 94.86%, and 94.76%, respectively. The experiment result shows that BERT _ RCNN not only extracts the local features of the text, but also analyzes the sentence semantic features related to the text context, and supplements the interactive information of the local and the whole text, so that the classification accuracy of the text model is improved to a certain extent.

Claims

1. A deep learning method based on pre-training and combining news headlines and news long text contents is characterized by comprising the following steps:

step S1: the data preprocessing, the crawled news text is cleaned, only news with news content text length exceeding 200 words is reserved, the data set comprises nine ten thousand news samples which are divided into nine types including 9 types of data of finance, real estate, education, science and technology, military, automobile, sports, games and entertainment;

storing the data according to the forms of labels plus titles and labels plus contents, and dividing a data set according to the proportion of 80% of a training set, 10% of a verification set and 10% of a test set;

step S2: loading a vocabulary required by the method, parameters of a pre-training model and a BERT pre-training model;

step S3: training a BERT-based news text classification model by using a news headline training set, and training a BERT-and RCNN-based news text classification model by using a news long text content training set;

2. The pre-training-based deep learning method combining news headlines and news long-text content as claimed in claim 1, wherein: in step S2, the pre-training model is BERT, the input of the network model is news headlines and news long text contents of news texts, and mask and truncation operations are performed on the headlines and the contents, and word vectors of the headlines and word vectors of the contents are output.

3. The pre-training-based deep learning method combining news headlines and news long-text content as claimed in claim 1, wherein: in step S3, a bidirectional long-and-short-term memory neural network is connected to BERT, word vectors as contents are input into a bidirectional long-and-short-term memory neural network model, context-based feature word vectors are obtained by processing the two-way long-and-short-term memory neural network model, the initial word vectors and the trained context-based feature word vectors are spliced and activated by a relu function, then, using the maximal pooling operation of the convolutional neural network to obtain local feature word vectors, compressing the dimensionality of the data by using a sequence function to convert the dimensionality into one-dimensional vectors, the vector representation of the content is obtained after the vector of the title is transmitted into the full-connection layer, the vector representation of the title is obtained after the word vector of the title is transmitted into the full-connection layer, and the vector representation dimensions of the title and the content are the same and are consistent with the number of the classification labels, the obtained results of the title and the content are spliced, and the final representation is obtained through a SoftMax function.