CN117743585B - News text classification method - Google Patents

News text classification method Download PDF

Info

Publication number
CN117743585B
CN117743585B CN202410189847.2A CN202410189847A CN117743585B CN 117743585 B CN117743585 B CN 117743585B CN 202410189847 A CN202410189847 A CN 202410189847A CN 117743585 B CN117743585 B CN 117743585B
Authority
CN
China
Prior art keywords
content
text
standard
text content
association weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410189847.2A
Other languages
Chinese (zh)
Other versions
CN117743585A (en
Inventor
冯卓文
王观承
王颢静
徐广珺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Ocean University
Original Assignee
Guangdong Ocean University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Ocean University filed Critical Guangdong Ocean University
Priority to CN202410189847.2A priority Critical patent/CN117743585B/en
Publication of CN117743585A publication Critical patent/CN117743585A/en
Application granted granted Critical
Publication of CN117743585B publication Critical patent/CN117743585B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a news text classification method, which belongs to the technical field of data processing and comprises the following steps: s1, generating a first content association weight between a title and first text content according to the title and the first text content of a news text; s2, generating a second content association weight between the title and the last text content according to the title, the last text content and the first content association weight of the news text; s3, generating a third content association weight according to the rest text contents except the first text content and the last text content in the news text; s4, constructing a text processing model, and inputting the first content association weight, the second content association weight and the third content key weight into the text processing model to obtain a classification result of the news text. The whole content structure of the news text is grasped in the whole classification process, the influence of keywords is fully considered, the error of the classification result is small, and the accuracy is high.

Description

News text classification method
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a news text classification method.
Background
Currently, news reading products are generally organized and organized according to the field to which news content belongs, such as classified distribution according to hot spots, domestic and international, etc., or according to science, military, entertainment, etc. At present, the above-mentioned process of classifying and issuing news is usually performed manually, which not only wastes manpower, but also makes the news classification result greatly affected by subjective feeling of individuals, so that the classification result is not accurate enough.
Disclosure of Invention
The invention provides a news text classification method for solving the problems.
The technical scheme of the invention is as follows: a news text classification method comprises the following steps:
S1, generating a first content association weight between a title and first text content according to the title and the first text content of a news text;
s2, generating a second content association weight between the title and the last text content according to the title, the last text content and the first content association weight of the news text;
s3, generating a third content association weight according to the rest text contents except the first text content and the last text content in the news text;
s4, constructing a text processing model, and inputting the first content association weight, the second content association weight and the third content key weight into the text processing model to obtain a classification result of the news text.
Further, S1 comprises the following sub-steps:
S11, eliminating the title of the news text and the stop words of the first text content to obtain a standard title and a standard first text content respectively;
s12, extracting the same words as the standard title in the standard first text content to generate a first training word sequence;
S13, determining the first segment occupation ratio according to the first training word sequence;
and S14, generating a first content association weight between the title and the first text content according to the first segment occupation ratio.
The beneficial effects of the above-mentioned further scheme are: in the invention, the first news text segment generally simply introduces the summary of the news event, thus having higher reference value, and considering that the news headline is a high summary of the whole news text content, the invention constructs a first content association weight to reflect the association degree between the news headline and the first news text segment content. The first content association weight refers to the ratio condition of words which are the same as news headlines in the first text content and word vectors of the same words, so that the first content association weight can effectively reflect the semantic association degree.
Further, in S13, the calculation formula of the first segment occupation ratio z f is: ; wherein P represents the number of words of the first training word sequence, M represents the number of words of the standard first text content, and J represents the number of words of the standard headline.
Further, in S14, the calculation formula of the first content association weight σ 1 between the title and the first text content is: ; wherein, P represents the number of words in the first training word sequence, C p represents the word frequency of the P-th word in the first training word sequence in the standard first text content, z f represents the first segment occupation ratio, and X p represents the word vector of the P-th word in the first training word sequence.
Further, S2 comprises the following sub-steps:
s21, eliminating stop words of the end text content to obtain standard end text content;
S22, extracting the same words as the standard title in the standard end text content to generate a second training word sequence;
s23, determining the end segment occupation ratio according to the second training word sequence;
s24, calculating a content association tag value between the standard first text content and the standard last text content according to the first segment occupation ratio, the last segment occupation ratio and the first content association weight;
s26, generating a second content association weight between the title and the final text content according to the content association tag value between the standard first text content and the standard final text content.
The beneficial effects of the above-mentioned further scheme are: in the invention, considering that semantic association may exist between the first and the last sections of the news text and the last section content may be a high summary of the news text content, the invention combines parameters such as the last section occupation ratio, the first content association weight and the like to construct the second content association weight, when the last section occupation ratio is more than twice the first section occupation ratio, the importance of the first section text content is lower, the influence of the first content association weight on the second content association weight is properly reduced, and the association degree between the last section text content and the news headline is reflected by the second content association weight.
Further, in S23, the calculation formula of the end segment occupation ratio z l is: ; where Q represents the number of words of the second training word sequence, N represents the number of words of the standard end text content, and J represents the number of words of the standard heading.
Further, in S24, the calculation formula of the content association tag value b between the standard first text content and the standard last text content is: ; where z f denotes the first segment occupancy rate, z l denotes the last segment occupancy rate, and σ 1 denotes the first content association weight between the title and the first segment of text content.
Further, in S26, the calculation formula of the second content association weight σ 2 between the title and the end text content is: ; wherein Q represents the number of words in the second training word sequence, C q represents the word frequency of the Q-th word in the second training word sequence in the standard end text content, z l represents the end occupation ratio, X q represents the word vector of the Q-th word in the second training word sequence, and b represents the content-associated tag value between the standard first text content and the standard end text content.
Further, in S3, the specific method for generating the third content association weight is as follows: selecting the paragraph with the largest word number in the rest text contents except the first text content and the last text content as a secondary paragraph, extracting keywords of the secondary paragraph, and generating a third content association weight according to all keywords and standard titles of the secondary paragraph;
the calculation formula of the third content association weight sigma 3 is as follows: ; where X j represents the word vector of the jth word in the standard heading, Word vectors representing the highest word frequency words in the secondary paragraph, exp (·) represents the exponential function, and J represents the number of words of the standard heading.
The beneficial effects of the above-mentioned further scheme are: in the present invention, the paragraph with the most text content (i.e., the most number of words) is a detailed description of the news event, so the present invention uses the word vector distribution of the paragraph to construct the third content association weight.
Further, the expression of the objective function Loss of the text processing model is: ; where O represents the number of words in the news text, ln (. Cndot.) represents a logarithmic function, σ 1 represents a first content association weight, σ 2 represents a second content association weight, σ 3 represents a third content association weight, and r represents a hyper-parameter supporting a vector machine.
The text processing model may employ a support vector machine.
The beneficial effects of the invention are as follows: the invention discloses a news text classification method, which sequentially carries out text analysis processing on important first and last segments and the segments with the largest text content in a news text to obtain three content association weights representing semantic information, and then inputs the three content association weights into a constructed text processing model to obtain the accurate category of the news text. The whole content structure of the news text is grasped in the whole classification process, the influence of keywords is fully considered, the error of the classification result is small, and the accuracy is high.
Drawings
Fig. 1 is a flowchart of a news text classification method.
Detailed Description
Embodiments of the present invention are further described below with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a news text classification method, which includes the following steps:
S1, generating a first content association weight between a title and first text content according to the title and the first text content of a news text;
s2, generating a second content association weight between the title and the last text content according to the title, the last text content and the first content association weight of the news text;
s3, generating a third content association weight according to the rest text contents except the first text content and the last text content in the news text;
s4, constructing a text processing model, and inputting the first content association weight, the second content association weight and the third content key weight into the text processing model to obtain a classification result of the news text.
In an embodiment of the present invention, S1 comprises the following sub-steps:
S11, eliminating the title of the news text and the stop words of the first text content to obtain a standard title and a standard first text content respectively;
s12, extracting the same words as the standard title in the standard first text content to generate a first training word sequence;
S13, determining the first segment occupation ratio according to the first training word sequence;
and S14, generating a first content association weight between the title and the first text content according to the first segment occupation ratio.
In the invention, the first news text segment generally simply introduces the summary of the news event, thus having higher reference value, and considering that the news headline is a high summary of the whole news text content, the invention constructs a first content association weight to reflect the association degree between the news headline and the first news text segment content. The first content association weight refers to the ratio condition of words which are the same as news headlines in the first text content and word vectors of the same words, so that the first content association weight can effectively reflect the semantic association degree.
In the embodiment of the present invention, in S13, the calculation formula of the first segment occupation ratio z f is: ; wherein P represents the number of words of the first training word sequence, M represents the number of words of the standard first text content, and J represents the number of words of the standard headline.
In the embodiment of the present invention, in S14, the calculation formula of the first content association weight σ 1 between the title and the first text content is: ; wherein, P represents the number of words in the first training word sequence, C p represents the word frequency of the P-th word in the first training word sequence in the standard first text content, z f represents the first segment occupation ratio, and X p represents the word vector of the P-th word in the first training word sequence.
In an embodiment of the present invention, S2 comprises the following sub-steps:
s21, eliminating stop words of the end text content to obtain standard end text content;
S22, extracting the same words as the standard title in the standard end text content to generate a second training word sequence;
s23, determining the end segment occupation ratio according to the second training word sequence;
s24, calculating a content association tag value between the standard first text content and the standard last text content according to the first segment occupation ratio, the last segment occupation ratio and the first content association weight;
s26, generating a second content association weight between the title and the final text content according to the content association tag value between the standard first text content and the standard final text content.
In the invention, considering that semantic association may exist between the first and the last sections of the news text and the last section content may be a high summary of the news text content, the invention combines parameters such as the last section occupation ratio, the first content association weight and the like to construct the second content association weight, when the last section occupation ratio is more than twice the first section occupation ratio, the importance of the first section text content is lower, the influence of the first content association weight on the second content association weight is properly reduced, and the association degree between the last section text content and the news headline is reflected by the second content association weight.
In the embodiment of the present invention, in S23, the calculation formula of the end segment occupation ratio z l is: ; where Q represents the number of words of the second training word sequence, N represents the number of words of the standard end text content, and J represents the number of words of the standard heading.
In the embodiment of the present invention, in S24, the calculation formula of the content association tag value b between the standard first text content and the standard last text content is: ; where z f denotes the first segment occupancy rate, z l denotes the last segment occupancy rate, and σ 1 denotes the first content association weight between the title and the first segment of text content.
In the embodiment of the present invention, in S26, the calculation formula of the second content association weight σ 2 between the title and the end text content is: ; wherein Q represents the number of words in the second training word sequence, C q represents the word frequency of the Q-th word in the second training word sequence in the standard end text content, z l represents the end occupation ratio, X q represents the word vector of the Q-th word in the second training word sequence, and b represents the content-associated tag value between the standard first text content and the standard end text content.
In the embodiment of the present invention, in S3, a specific method for generating the third content association weight is: selecting the paragraph with the largest word number in the rest text contents except the first text content and the last text content as a secondary paragraph, extracting keywords of the secondary paragraph, and generating a third content association weight according to all keywords and standard titles of the secondary paragraph;
the calculation formula of the third content association weight sigma 3 is as follows: ; where X j represents the word vector of the jth word in the standard heading, Word vectors representing the highest word frequency words in the secondary paragraph, exp (·) represents the exponential function, and J represents the number of words of the standard heading.
In the present invention, the paragraph with the most text content (i.e., the most number of words) is a detailed description of the news event, so the present invention uses the word vector distribution of the paragraph to construct the third content association weight.
In the embodiment of the invention, the expression of the objective function Loss of the text processing model is: ; where O represents the number of words in the news text, ln (. Cndot.) represents a logarithmic function, σ 1 represents a first content association weight, σ 2 represents a second content association weight, σ 3 represents a third content association weight, and r represents a hyper-parameter supporting a vector machine.
The text processing model may employ a support vector machine.
Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims (7)

1. A news text classification method, comprising the steps of:
S1, generating a first content association weight between a title and first text content according to the title and the first text content of a news text;
s2, generating a second content association weight between the title and the last text content according to the title, the last text content and the first content association weight of the news text;
s3, generating a third content association weight according to the rest text contents except the first text content and the last text content in the news text;
s4, constructing a text processing model, and inputting the first content association weight, the second content association weight and the third content key weight into the text processing model to obtain a classification result of the news text;
The step S1 comprises the following substeps:
S11, eliminating the title of the news text and the stop words of the first text content to obtain a standard title and a standard first text content respectively;
s12, extracting the same words as the standard title in the standard first text content to generate a first training word sequence;
S13, determining the first segment occupation ratio according to the first training word sequence;
s14, generating a first content association weight between the title and the first text content according to the first segment occupation ratio;
the step S2 comprises the following substeps:
s21, eliminating stop words of the end text content to obtain standard end text content;
S22, extracting the same words as the standard title in the standard end text content to generate a second training word sequence;
s23, determining the end segment occupation ratio according to the second training word sequence;
s24, calculating a content association tag value between the standard first text content and the standard last text content according to the first segment occupation ratio, the last segment occupation ratio and the first content association weight;
s26, generating a second content association weight between the title and the final text content according to the content association tag value between the standard first text content and the standard final text content;
In the step S3, the specific method for generating the third content association weight is as follows: selecting the paragraph with the largest word number in the rest text contents except the first text content and the last text content as a secondary paragraph, extracting keywords of the secondary paragraph, and generating a third content association weight according to all keywords and standard titles of the secondary paragraph;
the calculation formula of the third content association weight sigma 3 is as follows: ; wherein X j represents a word vector of a j-th word in the standard heading,/> Word vectors representing the highest word frequency words in the secondary paragraph, exp (·) represents the exponential function, and J represents the number of words of the standard heading.
2. The news text classification method according to claim 1, wherein in S13, the calculation formula of the first segment occupation ratio z f is: ; wherein P represents the number of words of the first training word sequence, M represents the number of words of the standard first text content, and J represents the number of words of the standard headline.
3. The news text classification method according to claim 1, wherein in S14, the calculation formula of the first content association weight σ 1 between the title and the first text content is: ; wherein, P represents the number of words in the first training word sequence, C p represents the word frequency of the P-th word in the first training word sequence in the standard first text content, z f represents the first segment occupation ratio, and X p represents the word vector of the P-th word in the first training word sequence.
4. The news text classification method according to claim 1, wherein in S23, the calculation formula of the end segment occupation ratio z l is: ; where Q represents the number of words of the second training word sequence, N represents the number of words of the standard end text content, and J represents the number of words of the standard heading.
5. The news text classification method according to claim 1, wherein in S24, the calculation formula of the content association tag value b between the standard first-segment text content and the standard last-segment text content is: ; where z f denotes the first segment occupancy rate, z l denotes the last segment occupancy rate, and σ 1 denotes the first content association weight between the title and the first segment of text content.
6. The news text classification method according to claim 1, wherein in S26, the calculation formula of the second content association weight σ 2 between the title and the last text content is: ; wherein Q represents the number of words in the second training word sequence, C q represents the word frequency of the Q-th word in the second training word sequence in the standard end text content, z l represents the end occupation ratio, X q represents the word vector of the Q-th word in the second training word sequence, and b represents the content-associated tag value between the standard first text content and the standard end text content.
7. The news text classification method of claim 1, wherein the expression of the objective function Loss of the text processing model is: ; where O represents the number of words in the news text, ln (. Cndot.) represents a logarithmic function, σ 1 represents a first content association weight, σ 2 represents a second content association weight, σ 3 represents a third content association weight, and r represents a hyper-parameter supporting a vector machine.
CN202410189847.2A 2024-02-20 2024-02-20 News text classification method Active CN117743585B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410189847.2A CN117743585B (en) 2024-02-20 2024-02-20 News text classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410189847.2A CN117743585B (en) 2024-02-20 2024-02-20 News text classification method

Publications (2)

Publication Number Publication Date
CN117743585A CN117743585A (en) 2024-03-22
CN117743585B true CN117743585B (en) 2024-04-26

Family

ID=90261320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410189847.2A Active CN117743585B (en) 2024-02-20 2024-02-20 News text classification method

Country Status (1)

Country Link
CN (1) CN117743585B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220890A (en) * 2021-06-10 2021-08-06 长春工业大学 Deep learning method combining news headlines and news long text contents based on pre-training
CN114969324A (en) * 2022-04-15 2022-08-30 河南大学 Chinese news title classification method based on subject word feature expansion

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11455464B2 (en) * 2019-09-18 2022-09-27 Accenture Global Solutions Limited Document content classification and alteration

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220890A (en) * 2021-06-10 2021-08-06 长春工业大学 Deep learning method combining news headlines and news long text contents based on pre-training
CN114969324A (en) * 2022-04-15 2022-08-30 河南大学 Chinese news title classification method based on subject word feature expansion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于语义特征的逻辑段落划分方法及应用;朱振方;刘培玉;王金龙;;计算机科学;20091215(第12期);全文 *

Also Published As

Publication number Publication date
CN117743585A (en) 2024-03-22

Similar Documents

Publication Publication Date Title
Abbas et al. Multinomial Naive Bayes classification model for sentiment analysis
CN107992633B (en) Automatic electronic document classification method and system based on keyword features
CN109960799B (en) Short text-oriented optimization classification method
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN110457676B (en) Evaluation information extraction method and device, storage medium and computer equipment
CN110347908B (en) Voice shopping method, device, medium and electronic equipment
CN102640089A (en) System and method for inputting text into electronic devices
CN110705247B (en) Based on x2-C text similarity calculation method
CN101097570A (en) Advertisement classification method capable of automatic recognizing classified advertisement type
CN101138001A (en) Learning processing method, learning processing device, and program
CN110909116B (en) Entity set expansion method and system for social media
CN108287848B (en) Method and system for semantic parsing
CN111414746A (en) Matching statement determination method, device, equipment and storage medium
CN110347833B (en) Classification method for multi-round conversations
CN115062151A (en) Text feature extraction method, text classification method and readable storage medium
CN114722198A (en) Method, system and related device for determining product classification code
CN112989803B (en) Entity link prediction method based on topic vector learning
CN112489689B (en) Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure
CN111737607B (en) Data processing method, device, electronic equipment and storage medium
CN111563361B (en) Text label extraction method and device and storage medium
CN117743585B (en) News text classification method
CN111382265B (en) Searching method, device, equipment and medium
CN110348497A (en) A kind of document representation method based on the building of WT-GloVe term vector
CN116070620A (en) Information processing method and system based on big data
CN115017260A (en) Keyword generation method based on subtopic modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant