CN113220890A - Deep learning method combining news headlines and news long text contents based on pre-training - Google Patents

Deep learning method combining news headlines and news long text contents based on pre-training Download PDF

Info

Publication number
CN113220890A
CN113220890A CN202110645654.XA CN202110645654A CN113220890A CN 113220890 A CN113220890 A CN 113220890A CN 202110645654 A CN202110645654 A CN 202110645654A CN 113220890 A CN113220890 A CN 113220890A
Authority
CN
China
Prior art keywords
news
training
text
model
long
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110645654.XA
Other languages
Chinese (zh)
Inventor
王贵参
伍俊霖
王红梅
党源源
张丽杰
王桂娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Technology
Original Assignee
Changchun University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Technology filed Critical Changchun University of Technology
Priority to CN202110645654.XA priority Critical patent/CN113220890A/en
Publication of CN113220890A publication Critical patent/CN113220890A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a deep learning method combining news headlines and long text contents based on pre-training, and aims to improve the prediction accuracy of news texts. The method mainly comprises the following steps: the method comprises the steps of data preprocessing, loading a vocabulary table, model parameters and a pre-training model, training a news text classification model based on the pre-training model by using a news title, training a news text classification model based on the combination of the pre-training model and a classification algorithm by using news long text contents, and verifying the trained news text classification model based on the pre-training by using a test set. The text classification method based on the traditional deep learning has poor prediction effect on Chinese news texts, and aiming at the situation, the deep learning method based on the pre-training and combining news headlines and news long text contents is provided on a self-collected data set, so that the prediction accuracy rate of the Chinese news texts can be effectively improved.

Description

Deep learning method combining news headlines and news long text contents based on pre-training
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a deep learning method combining news headlines and news long text contents based on pre-training.
Background
Text classification is a popular problem for natural language processing. With the continuous development of economy in China, information is increased explosively, and due to the diversity and complexity of news texts, the content of the news texts is crossed, the content of the news texts is similar, the similarity between categories is high, and the boundaries are unclear, the method has important significance in rapidly classifying massive news texts.
In recent years, Chinese Text Classification problem research is rapidly developed, and Zhou and the like combine a convolutional Neural Network with a long-short term memory Network to provide a C-LSTM (A C-LSTM Neural Network for Text Classification) Text Classification algorithm, wherein the C-LSTM uses the convolutional Neural Network to extract high-level phrase representation and then inputs the high-level phrase representation into the long-short term memory Network to obtain sentence representation, and the C-LSTM can capture local features of phrases and semantic information of the sentences. Lai and the like combine a Convolutional Neural network with a cyclic Neural network to provide a TextRCNN (Current conditional Neural Networks for Text classification) Text classification model, the TextRCNN further obtains context information on a C-LSTM by using a bidirectional long and short memory network, the hidden layer output and word vectors obtained by the bidirectional long and short memory network are spliced, the spliced vectors are nonlinearly mapped to low dimensions, and the value of each position in the vectors takes the maximum value on all time sequences to obtain the final feature vector.
The classical algorithms based on long texts include textcnn (relational Neural Networks for subsequent Classification), TextRNN (recursive Neural Networks for Text Classification with Multi-Task Learning), and the like, and these algorithms are optimized for high-dimensional data, Text order, time reduction, and the like of Text Classification, because the convolutional Neural Networks can extract local spatial or short-time structural relationships, and for Sentence models, the convolutional Neural Networks have good ability to extract n-ary features at different positions in sentences, and can learn short-range and long-range relationships through pooling operations, but the convolutional Neural Networks have poor feature extraction capability for sequence data, while the convolutional Neural Networks are good but cannot extract local spatial or short-time structural relationships.
At present, the accuracy of the traditional deep learning model for classification of news texts still cannot reach higher precision, semantic understanding in text classification is insufficient, and deep features of news long texts are difficult to obtain. In order to solve the above-mentioned shortcomings, we propose a deep learning method combining news headlines and news long text contents based on pre-training.
Disclosure of Invention
The method comprises the steps of loading pre-training language model parameters, identifying a text model of text classification on a TextRCNN text classification model by using a convolutional neural network model, for example, identifying key phrases, regarding the text as a series of words by using the convolutional neural network text classification model, aiming at capturing the dependency relationship and the text structure of the text words, fusing the models, extracting local features of the text, analyzing sentence semantic features related to the text context, and supplementing local and integral interactive information of the text, so that the classification accuracy of the text model is improved considerably, and the model comprises the following steps.
Step S1: and (2) data preprocessing, namely cleaning the crawled news text, storing the data according to the forms of labels and titles and labels and contents, and dividing the data set according to the proportion of 80% of the training set, 10% of the verification set and 10% of the test set.
Step S2: the vocabulary required by the loading method, the parameters of the Pre-training model and the BERT (Pre-training of Deep Bidirectional transformations for Language interpretation) Pre-training model.
Step S3: the BERT-based news text classification model is trained using a news headline training set, and the BERT-and TextRCNN-based news text classification model is trained using a news long-text content training set.
Step S4: and verifying the trained news text classification model based on the pre-training by using the test set, and calculating the accuracy, the recall rate and the F1 value of the news text classification model based on the pre-training.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a diagram comparing the same title model on the own news text data set with other content models.
Detailed Description
The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention. The invention will now be described in further detail by means of the figures and examples.
The embodiment of the invention is premised on the data set being a news text data set collected by an author.
Fig. 1 is a schematic flow chart of a text classification model based on pre-training according to an embodiment of the present invention. As shown in fig. 1, the present embodiment mainly includes the following steps:
step S1: data pre-processing
And (3) cleaning the crawled news text, only keeping news with news content text length exceeding 200 characters, wherein the data set comprises nine ten thousand news samples which are divided into nine types including 9 types of data of finance, real estate, education, science and technology, military, automobile, sports, games and entertainment. Storing the data according to the forms of labels plus titles and labels plus contents, and dividing a data set according to the proportion of 80% of a training set, 10% of a verification set and 10% of a test set;
step S2: vocabulary required by loading method, parameters of pre-training model and BERT pre-training model
The pre-training model is a BERT model, the input of the network model is news headlines and news long text contents of news texts, the headlines and the contents are subjected to mask and truncation operation, and word vectors of the headlines and word vectors of the contents are output.
Step S3: training a BERT-based news text classification model by using a news headline training set, and training a BERT-and TextRCNN-based news text classification model by using a news long text content training set;
the method comprises the steps of connecting a bidirectional long-time and short-time memory neural network with a BERT, inputting word vectors serving as contents into a bidirectional long-time and short-time memory neural network model, processing the word vectors through the network model to obtain feature word vectors based on contexts, splicing initial word vectors and trained feature word vectors based on the contexts, activating the word vectors by using a relu function, then performing maximum pooling operation on the convolutional neural network to obtain local feature word vectors, compressing dimensionality of data by using a sequence function to obtain one-dimensional vectors, and transmitting the one-dimensional vectors into a full connection layer to obtain vector representation of the contents. After the word vectors of the titles are transmitted into the full-connection layer, the vector representation of the titles is obtained, and the vector representation dimensions of the titles and the content are the same and are consistent with the number of the classification labels. And splicing the results obtained by the title and the content, and obtaining a final representation through a SoftMax function.
Step S4: verifying the trained news text classification model based on the pre-training by using a test set, and calculating the accuracy, the recall rate and the F1 value of the news text classification model based on the pre-training;
the above embodiments are only for illustrating the invention and not for limiting the same, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the invention, so that all equivalent technical solutions also belong to the scope of the invention, and the scope of the invention should be defined by the claims.
Example 1 experimental results of the invention on a self-collected news text dataset
The data set consists of nine ten thousand news long texts, and is divided into 9 types of finance, real estate, education, science and technology, military, automobile, sports, games and entertainment, wherein the content length of each type of news is more than 200 characters, and the data are respectively crawled from websites such as a Pink silk screen, a Minnan web, a game grand view and an upper square web and are used for classifying news texts.
Selecting performance indexes such as accuracy, recall rate, F1 and the like as evaluation standards, wherein the calculation formula is as follows:
Figure DEST_PATH_IMAGE001
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
wherein, TP is the number of samples that predict the correct class as the correct class, TN is the total number of samples that correctly identify the class that do not belong to, FP is the total number of samples that are misclassified into the class, and FN is the total number of samples that belong to the class but are classified into other classes.
TABLE 1 parameters of the parts of the content model in the experiment (BERT _ RCNN)
Figure DEST_PATH_IMAGE005
TABLE 2 parameters of various parts of the model of the question in the experiment
Figure 59384DEST_PATH_IMAGE006
Example 1 the model of the present invention was applied to a data set for test validation, and the accuracy, recall and F1 indices were selected as evaluation indices and compared with 3 classical text classification methods based on pre-training. The 3 comparison methods are BERT (title) + BERT (content), BERT (title) + BERT _ CNN (content), BERT (title) + BERT _ RNN (content), respectively. The existing comparison text classification methods are all operated under respective optimal parameters, and the experimental comparison results are shown in table 3.
TABLE 3 experimental comparison results
Figure DEST_PATH_IMAGE007
From the experimental results of table 2, the accuracy of the comparative BERT (title) + BERT (content) method was 94.13%, the recall was 94.25%, and the F1 value was 94.14%. The compared BERT (title) + BERT _ CNN (content) method has the accuracy of 93.64%, the recall rate of 93.80% and the F1 value of 93.69%, which indicates that the BERT _ CNN method only extracts local semantic features from news text data and has poor feature extraction capability on the text sequence data. The compared BERT (title) + BERT _ RNN (content) method has an accuracy of 91.84%, a recall of 92.07%, and an F1 value of 91.82%, indicating the structural relationship of LSTM to the inability to extract local spatial features or short time. The accuracy, recall, and F1 values of BERT (title) + BERT _ RCNN (content) method used in the present invention were 94.76%, 94.86%, and 94.76%, respectively. The experiment result shows that BERT _ RCNN not only extracts the local features of the text, but also analyzes the sentence semantic features related to the text context, and supplements the interactive information of the local and the whole text, so that the classification accuracy of the text model is improved to a certain extent.

Claims (3)

1. A deep learning method based on pre-training and combining news headlines and news long text contents is characterized by comprising the following steps:
step S1: the data preprocessing, the crawled news text is cleaned, only news with news content text length exceeding 200 words is reserved, the data set comprises nine ten thousand news samples which are divided into nine types including 9 types of data of finance, real estate, education, science and technology, military, automobile, sports, games and entertainment;
storing the data according to the forms of labels plus titles and labels plus contents, and dividing a data set according to the proportion of 80% of a training set, 10% of a verification set and 10% of a test set;
step S2: loading a vocabulary required by the method, parameters of a pre-training model and a BERT pre-training model;
step S3: training a BERT-based news text classification model by using a news headline training set, and training a BERT-and RCNN-based news text classification model by using a news long text content training set;
step S4: and verifying the trained news text classification model based on the pre-training by using the test set, and calculating the accuracy, the recall rate and the F1 value of the news text classification model based on the pre-training.
2. The pre-training-based deep learning method combining news headlines and news long-text content as claimed in claim 1, wherein: in step S2, the pre-training model is BERT, the input of the network model is news headlines and news long text contents of news texts, and mask and truncation operations are performed on the headlines and the contents, and word vectors of the headlines and word vectors of the contents are output.
3. The pre-training-based deep learning method combining news headlines and news long-text content as claimed in claim 1, wherein: in step S3, a bidirectional long-and-short-term memory neural network is connected to BERT, word vectors as contents are input into a bidirectional long-and-short-term memory neural network model, context-based feature word vectors are obtained by processing the two-way long-and-short-term memory neural network model, the initial word vectors and the trained context-based feature word vectors are spliced and activated by a relu function, then, using the maximal pooling operation of the convolutional neural network to obtain local feature word vectors, compressing the dimensionality of the data by using a sequence function to convert the dimensionality into one-dimensional vectors, the vector representation of the content is obtained after the vector of the title is transmitted into the full-connection layer, the vector representation of the title is obtained after the word vector of the title is transmitted into the full-connection layer, and the vector representation dimensions of the title and the content are the same and are consistent with the number of the classification labels, the obtained results of the title and the content are spliced, and the final representation is obtained through a SoftMax function.
CN202110645654.XA 2021-06-10 2021-06-10 Deep learning method combining news headlines and news long text contents based on pre-training Withdrawn CN113220890A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110645654.XA CN113220890A (en) 2021-06-10 2021-06-10 Deep learning method combining news headlines and news long text contents based on pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110645654.XA CN113220890A (en) 2021-06-10 2021-06-10 Deep learning method combining news headlines and news long text contents based on pre-training

Publications (1)

Publication Number Publication Date
CN113220890A true CN113220890A (en) 2021-08-06

Family

ID=77083520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110645654.XA Withdrawn CN113220890A (en) 2021-06-10 2021-06-10 Deep learning method combining news headlines and news long text contents based on pre-training

Country Status (1)

Country Link
CN (1) CN113220890A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743081A (en) * 2021-09-03 2021-12-03 西安邮电大学 Recommendation method of technical service information
CN113821637A (en) * 2021-09-07 2021-12-21 北京微播易科技股份有限公司 Long text classification method and device, computer equipment and readable storage medium
CN113987171A (en) * 2021-10-20 2022-01-28 绍兴达道生涯教育信息咨询有限公司 News text classification method and system based on pre-training model variation
CN115269854A (en) * 2022-08-30 2022-11-01 重庆理工大学 False news detection method based on theme and structure perception neural network
CN116628171A (en) * 2023-07-24 2023-08-22 北京惠每云科技有限公司 Medical record retrieval method and system based on pre-training language model
CN117743585A (en) * 2024-02-20 2024-03-22 广东海洋大学 News text classification method

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113743081A (en) * 2021-09-03 2021-12-03 西安邮电大学 Recommendation method of technical service information
CN113743081B (en) * 2021-09-03 2023-08-01 西安邮电大学 Recommendation method of technical service information
CN113821637A (en) * 2021-09-07 2021-12-21 北京微播易科技股份有限公司 Long text classification method and device, computer equipment and readable storage medium
CN113987171A (en) * 2021-10-20 2022-01-28 绍兴达道生涯教育信息咨询有限公司 News text classification method and system based on pre-training model variation
CN115269854A (en) * 2022-08-30 2022-11-01 重庆理工大学 False news detection method based on theme and structure perception neural network
CN115269854B (en) * 2022-08-30 2024-02-02 重庆理工大学 False news detection method based on theme and structure perception neural network
CN116628171A (en) * 2023-07-24 2023-08-22 北京惠每云科技有限公司 Medical record retrieval method and system based on pre-training language model
CN116628171B (en) * 2023-07-24 2023-10-20 北京惠每云科技有限公司 Medical record retrieval method and system based on pre-training language model
CN117743585A (en) * 2024-02-20 2024-03-22 广东海洋大学 News text classification method
CN117743585B (en) * 2024-02-20 2024-04-26 广东海洋大学 News text classification method

Similar Documents

Publication Publication Date Title
CN108829757B (en) Intelligent service method, server and storage medium for chat robot
CN113220890A (en) Deep learning method combining news headlines and news long text contents based on pre-training
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN107943784B (en) Relationship extraction method based on generation of countermeasure network
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN110287323B (en) Target-oriented emotion classification method
CN111143563A (en) Text classification method based on integration of BERT, LSTM and CNN
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN111414481A (en) Chinese semantic matching method based on pinyin and BERT embedding
CN113239663B (en) Multi-meaning word Chinese entity relation identification method based on Hopkinson
CN112199503B (en) Feature-enhanced unbalanced Bi-LSTM-based Chinese text classification method
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN114020906A (en) Chinese medical text information matching method and system based on twin neural network
CN117236338B (en) Named entity recognition model of dense entity text and training method thereof
CN113705315A (en) Video processing method, device, equipment and storage medium
CN111339772B (en) Russian text emotion analysis method, electronic device and storage medium
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN111680529A (en) Machine translation algorithm and device based on layer aggregation
CN113408619B (en) Language model pre-training method and device
Anjum et al. Exploring Humor in Natural Language Processing: A Comprehensive Review of JOKER Tasks at CLEF Symposium 2023.
CN117610567A (en) Named entity recognition algorithm based on ERNIE3.0_Att_IDCNN_BiGRU_CRF
CN112329441A (en) Legal document reading model and construction method
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210806

WW01 Invention patent application withdrawn after publication