CN113220890A - Deep learning method combining news headlines and news long text contents based on pre-training - Google Patents
Deep learning method combining news headlines and news long text contents based on pre-training Download PDFInfo
- Publication number
- CN113220890A CN113220890A CN202110645654.XA CN202110645654A CN113220890A CN 113220890 A CN113220890 A CN 113220890A CN 202110645654 A CN202110645654 A CN 202110645654A CN 113220890 A CN113220890 A CN 113220890A
- Authority
- CN
- China
- Prior art keywords
- news
- training
- text
- model
- long
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a deep learning method combining news headlines and long text contents based on pre-training, and aims to improve the prediction accuracy of news texts. The method mainly comprises the following steps: the method comprises the steps of data preprocessing, loading a vocabulary table, model parameters and a pre-training model, training a news text classification model based on the pre-training model by using a news title, training a news text classification model based on the combination of the pre-training model and a classification algorithm by using news long text contents, and verifying the trained news text classification model based on the pre-training by using a test set. The text classification method based on the traditional deep learning has poor prediction effect on Chinese news texts, and aiming at the situation, the deep learning method based on the pre-training and combining news headlines and news long text contents is provided on a self-collected data set, so that the prediction accuracy rate of the Chinese news texts can be effectively improved.
Description
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a deep learning method combining news headlines and news long text contents based on pre-training.
Background
Text classification is a popular problem for natural language processing. With the continuous development of economy in China, information is increased explosively, and due to the diversity and complexity of news texts, the content of the news texts is crossed, the content of the news texts is similar, the similarity between categories is high, and the boundaries are unclear, the method has important significance in rapidly classifying massive news texts.
In recent years, Chinese Text Classification problem research is rapidly developed, and Zhou and the like combine a convolutional Neural Network with a long-short term memory Network to provide a C-LSTM (A C-LSTM Neural Network for Text Classification) Text Classification algorithm, wherein the C-LSTM uses the convolutional Neural Network to extract high-level phrase representation and then inputs the high-level phrase representation into the long-short term memory Network to obtain sentence representation, and the C-LSTM can capture local features of phrases and semantic information of the sentences. Lai and the like combine a Convolutional Neural network with a cyclic Neural network to provide a TextRCNN (Current conditional Neural Networks for Text classification) Text classification model, the TextRCNN further obtains context information on a C-LSTM by using a bidirectional long and short memory network, the hidden layer output and word vectors obtained by the bidirectional long and short memory network are spliced, the spliced vectors are nonlinearly mapped to low dimensions, and the value of each position in the vectors takes the maximum value on all time sequences to obtain the final feature vector.
The classical algorithms based on long texts include textcnn (relational Neural Networks for subsequent Classification), TextRNN (recursive Neural Networks for Text Classification with Multi-Task Learning), and the like, and these algorithms are optimized for high-dimensional data, Text order, time reduction, and the like of Text Classification, because the convolutional Neural Networks can extract local spatial or short-time structural relationships, and for Sentence models, the convolutional Neural Networks have good ability to extract n-ary features at different positions in sentences, and can learn short-range and long-range relationships through pooling operations, but the convolutional Neural Networks have poor feature extraction capability for sequence data, while the convolutional Neural Networks are good but cannot extract local spatial or short-time structural relationships.
At present, the accuracy of the traditional deep learning model for classification of news texts still cannot reach higher precision, semantic understanding in text classification is insufficient, and deep features of news long texts are difficult to obtain. In order to solve the above-mentioned shortcomings, we propose a deep learning method combining news headlines and news long text contents based on pre-training.
Disclosure of Invention
The method comprises the steps of loading pre-training language model parameters, identifying a text model of text classification on a TextRCNN text classification model by using a convolutional neural network model, for example, identifying key phrases, regarding the text as a series of words by using the convolutional neural network text classification model, aiming at capturing the dependency relationship and the text structure of the text words, fusing the models, extracting local features of the text, analyzing sentence semantic features related to the text context, and supplementing local and integral interactive information of the text, so that the classification accuracy of the text model is improved considerably, and the model comprises the following steps.
Step S1: and (2) data preprocessing, namely cleaning the crawled news text, storing the data according to the forms of labels and titles and labels and contents, and dividing the data set according to the proportion of 80% of the training set, 10% of the verification set and 10% of the test set.
Step S2: the vocabulary required by the loading method, the parameters of the Pre-training model and the BERT (Pre-training of Deep Bidirectional transformations for Language interpretation) Pre-training model.
Step S3: the BERT-based news text classification model is trained using a news headline training set, and the BERT-and TextRCNN-based news text classification model is trained using a news long-text content training set.
Step S4: and verifying the trained news text classification model based on the pre-training by using the test set, and calculating the accuracy, the recall rate and the F1 value of the news text classification model based on the pre-training.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a diagram comparing the same title model on the own news text data set with other content models.
Detailed Description
The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention. The invention will now be described in further detail by means of the figures and examples.
The embodiment of the invention is premised on the data set being a news text data set collected by an author.
Fig. 1 is a schematic flow chart of a text classification model based on pre-training according to an embodiment of the present invention. As shown in fig. 1, the present embodiment mainly includes the following steps:
step S1: data pre-processing
And (3) cleaning the crawled news text, only keeping news with news content text length exceeding 200 characters, wherein the data set comprises nine ten thousand news samples which are divided into nine types including 9 types of data of finance, real estate, education, science and technology, military, automobile, sports, games and entertainment. Storing the data according to the forms of labels plus titles and labels plus contents, and dividing a data set according to the proportion of 80% of a training set, 10% of a verification set and 10% of a test set;
step S2: vocabulary required by loading method, parameters of pre-training model and BERT pre-training model
The pre-training model is a BERT model, the input of the network model is news headlines and news long text contents of news texts, the headlines and the contents are subjected to mask and truncation operation, and word vectors of the headlines and word vectors of the contents are output.
Step S3: training a BERT-based news text classification model by using a news headline training set, and training a BERT-and TextRCNN-based news text classification model by using a news long text content training set;
the method comprises the steps of connecting a bidirectional long-time and short-time memory neural network with a BERT, inputting word vectors serving as contents into a bidirectional long-time and short-time memory neural network model, processing the word vectors through the network model to obtain feature word vectors based on contexts, splicing initial word vectors and trained feature word vectors based on the contexts, activating the word vectors by using a relu function, then performing maximum pooling operation on the convolutional neural network to obtain local feature word vectors, compressing dimensionality of data by using a sequence function to obtain one-dimensional vectors, and transmitting the one-dimensional vectors into a full connection layer to obtain vector representation of the contents. After the word vectors of the titles are transmitted into the full-connection layer, the vector representation of the titles is obtained, and the vector representation dimensions of the titles and the content are the same and are consistent with the number of the classification labels. And splicing the results obtained by the title and the content, and obtaining a final representation through a SoftMax function.
Step S4: verifying the trained news text classification model based on the pre-training by using a test set, and calculating the accuracy, the recall rate and the F1 value of the news text classification model based on the pre-training;
the above embodiments are only for illustrating the invention and not for limiting the same, and those skilled in the relevant art can make various changes and modifications without departing from the spirit and scope of the invention, so that all equivalent technical solutions also belong to the scope of the invention, and the scope of the invention should be defined by the claims.
Example 1 experimental results of the invention on a self-collected news text dataset
The data set consists of nine ten thousand news long texts, and is divided into 9 types of finance, real estate, education, science and technology, military, automobile, sports, games and entertainment, wherein the content length of each type of news is more than 200 characters, and the data are respectively crawled from websites such as a Pink silk screen, a Minnan web, a game grand view and an upper square web and are used for classifying news texts.
Selecting performance indexes such as accuracy, recall rate, F1 and the like as evaluation standards, wherein the calculation formula is as follows:
wherein, TP is the number of samples that predict the correct class as the correct class, TN is the total number of samples that correctly identify the class that do not belong to, FP is the total number of samples that are misclassified into the class, and FN is the total number of samples that belong to the class but are classified into other classes.
TABLE 1 parameters of the parts of the content model in the experiment (BERT _ RCNN)
TABLE 2 parameters of various parts of the model of the question in the experiment
Example 1 the model of the present invention was applied to a data set for test validation, and the accuracy, recall and F1 indices were selected as evaluation indices and compared with 3 classical text classification methods based on pre-training. The 3 comparison methods are BERT (title) + BERT (content), BERT (title) + BERT _ CNN (content), BERT (title) + BERT _ RNN (content), respectively. The existing comparison text classification methods are all operated under respective optimal parameters, and the experimental comparison results are shown in table 3.
TABLE 3 experimental comparison results
From the experimental results of table 2, the accuracy of the comparative BERT (title) + BERT (content) method was 94.13%, the recall was 94.25%, and the F1 value was 94.14%. The compared BERT (title) + BERT _ CNN (content) method has the accuracy of 93.64%, the recall rate of 93.80% and the F1 value of 93.69%, which indicates that the BERT _ CNN method only extracts local semantic features from news text data and has poor feature extraction capability on the text sequence data. The compared BERT (title) + BERT _ RNN (content) method has an accuracy of 91.84%, a recall of 92.07%, and an F1 value of 91.82%, indicating the structural relationship of LSTM to the inability to extract local spatial features or short time. The accuracy, recall, and F1 values of BERT (title) + BERT _ RCNN (content) method used in the present invention were 94.76%, 94.86%, and 94.76%, respectively. The experiment result shows that BERT _ RCNN not only extracts the local features of the text, but also analyzes the sentence semantic features related to the text context, and supplements the interactive information of the local and the whole text, so that the classification accuracy of the text model is improved to a certain extent.
Claims (3)
1. A deep learning method based on pre-training and combining news headlines and news long text contents is characterized by comprising the following steps:
step S1: the data preprocessing, the crawled news text is cleaned, only news with news content text length exceeding 200 words is reserved, the data set comprises nine ten thousand news samples which are divided into nine types including 9 types of data of finance, real estate, education, science and technology, military, automobile, sports, games and entertainment;
storing the data according to the forms of labels plus titles and labels plus contents, and dividing a data set according to the proportion of 80% of a training set, 10% of a verification set and 10% of a test set;
step S2: loading a vocabulary required by the method, parameters of a pre-training model and a BERT pre-training model;
step S3: training a BERT-based news text classification model by using a news headline training set, and training a BERT-and RCNN-based news text classification model by using a news long text content training set;
step S4: and verifying the trained news text classification model based on the pre-training by using the test set, and calculating the accuracy, the recall rate and the F1 value of the news text classification model based on the pre-training.
2. The pre-training-based deep learning method combining news headlines and news long-text content as claimed in claim 1, wherein: in step S2, the pre-training model is BERT, the input of the network model is news headlines and news long text contents of news texts, and mask and truncation operations are performed on the headlines and the contents, and word vectors of the headlines and word vectors of the contents are output.
3. The pre-training-based deep learning method combining news headlines and news long-text content as claimed in claim 1, wherein: in step S3, a bidirectional long-and-short-term memory neural network is connected to BERT, word vectors as contents are input into a bidirectional long-and-short-term memory neural network model, context-based feature word vectors are obtained by processing the two-way long-and-short-term memory neural network model, the initial word vectors and the trained context-based feature word vectors are spliced and activated by a relu function, then, using the maximal pooling operation of the convolutional neural network to obtain local feature word vectors, compressing the dimensionality of the data by using a sequence function to convert the dimensionality into one-dimensional vectors, the vector representation of the content is obtained after the vector of the title is transmitted into the full-connection layer, the vector representation of the title is obtained after the word vector of the title is transmitted into the full-connection layer, and the vector representation dimensions of the title and the content are the same and are consistent with the number of the classification labels, the obtained results of the title and the content are spliced, and the final representation is obtained through a SoftMax function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110645654.XA CN113220890A (en) | 2021-06-10 | 2021-06-10 | Deep learning method combining news headlines and news long text contents based on pre-training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110645654.XA CN113220890A (en) | 2021-06-10 | 2021-06-10 | Deep learning method combining news headlines and news long text contents based on pre-training |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113220890A true CN113220890A (en) | 2021-08-06 |
Family
ID=77083520
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110645654.XA Withdrawn CN113220890A (en) | 2021-06-10 | 2021-06-10 | Deep learning method combining news headlines and news long text contents based on pre-training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113220890A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113743081A (en) * | 2021-09-03 | 2021-12-03 | 西安邮电大学 | Recommendation method of technical service information |
CN113821637A (en) * | 2021-09-07 | 2021-12-21 | 北京微播易科技股份有限公司 | Long text classification method and device, computer equipment and readable storage medium |
CN113987171A (en) * | 2021-10-20 | 2022-01-28 | 绍兴达道生涯教育信息咨询有限公司 | News text classification method and system based on pre-training model variation |
CN115269854A (en) * | 2022-08-30 | 2022-11-01 | 重庆理工大学 | False news detection method based on theme and structure perception neural network |
CN116628171A (en) * | 2023-07-24 | 2023-08-22 | 北京惠每云科技有限公司 | Medical record retrieval method and system based on pre-training language model |
CN117743585A (en) * | 2024-02-20 | 2024-03-22 | 广东海洋大学 | News text classification method |
-
2021
- 2021-06-10 CN CN202110645654.XA patent/CN113220890A/en not_active Withdrawn
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113743081A (en) * | 2021-09-03 | 2021-12-03 | 西安邮电大学 | Recommendation method of technical service information |
CN113743081B (en) * | 2021-09-03 | 2023-08-01 | 西安邮电大学 | Recommendation method of technical service information |
CN113821637A (en) * | 2021-09-07 | 2021-12-21 | 北京微播易科技股份有限公司 | Long text classification method and device, computer equipment and readable storage medium |
CN113987171A (en) * | 2021-10-20 | 2022-01-28 | 绍兴达道生涯教育信息咨询有限公司 | News text classification method and system based on pre-training model variation |
CN115269854A (en) * | 2022-08-30 | 2022-11-01 | 重庆理工大学 | False news detection method based on theme and structure perception neural network |
CN115269854B (en) * | 2022-08-30 | 2024-02-02 | 重庆理工大学 | False news detection method based on theme and structure perception neural network |
CN116628171A (en) * | 2023-07-24 | 2023-08-22 | 北京惠每云科技有限公司 | Medical record retrieval method and system based on pre-training language model |
CN116628171B (en) * | 2023-07-24 | 2023-10-20 | 北京惠每云科技有限公司 | Medical record retrieval method and system based on pre-training language model |
CN117743585A (en) * | 2024-02-20 | 2024-03-22 | 广东海洋大学 | News text classification method |
CN117743585B (en) * | 2024-02-20 | 2024-04-26 | 广东海洋大学 | News text classification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108829757B (en) | Intelligent service method, server and storage medium for chat robot | |
CN113220890A (en) | Deep learning method combining news headlines and news long text contents based on pre-training | |
CN107798140B (en) | Dialog system construction method, semantic controlled response method and device | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN107943784B (en) | Relationship extraction method based on generation of countermeasure network | |
CN112732916B (en) | BERT-based multi-feature fusion fuzzy text classification system | |
CN110321563B (en) | Text emotion analysis method based on hybrid supervision model | |
CN110287323B (en) | Target-oriented emotion classification method | |
CN111143563A (en) | Text classification method based on integration of BERT, LSTM and CNN | |
CN109271524B (en) | Entity linking method in knowledge base question-answering system | |
CN111414481A (en) | Chinese semantic matching method based on pinyin and BERT embedding | |
CN113239663B (en) | Multi-meaning word Chinese entity relation identification method based on Hopkinson | |
CN112199503B (en) | Feature-enhanced unbalanced Bi-LSTM-based Chinese text classification method | |
CN114676255A (en) | Text processing method, device, equipment, storage medium and computer program product | |
CN114020906A (en) | Chinese medical text information matching method and system based on twin neural network | |
CN117236338B (en) | Named entity recognition model of dense entity text and training method thereof | |
CN113705315A (en) | Video processing method, device, equipment and storage medium | |
CN111339772B (en) | Russian text emotion analysis method, electronic device and storage medium | |
CN115759119A (en) | Financial text emotion analysis method, system, medium and equipment | |
CN111680529A (en) | Machine translation algorithm and device based on layer aggregation | |
CN113408619B (en) | Language model pre-training method and device | |
Anjum et al. | Exploring Humor in Natural Language Processing: A Comprehensive Review of JOKER Tasks at CLEF Symposium 2023. | |
CN117610567A (en) | Named entity recognition algorithm based on ERNIE3.0_Att_IDCNN_BiGRU_CRF | |
CN112329441A (en) | Legal document reading model and construction method | |
CN112989839A (en) | Keyword feature-based intent recognition method and system embedded in language model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20210806 |
|
WW01 | Invention patent application withdrawn after publication |