CN117743585B

CN117743585B - News text classification method

Info

Publication number: CN117743585B
Application number: CN202410189847.2A
Authority: CN
Inventors: 冯卓文; 王观承; 王颢静; 徐广珺
Original assignee: Guangdong Ocean University
Current assignee: Guangdong Ocean University
Priority date: 2024-02-20
Filing date: 2024-02-20
Publication date: 2024-04-26
Anticipated expiration: 2044-02-20
Also published as: CN117743585A

Abstract

The invention discloses a news text classification method, which belongs to the technical field of data processing and comprises the following steps: s1, generating a first content association weight between a title and first text content according to the title and the first text content of a news text; s2, generating a second content association weight between the title and the last text content according to the title, the last text content and the first content association weight of the news text; s3, generating a third content association weight according to the rest text contents except the first text content and the last text content in the news text; s4, constructing a text processing model, and inputting the first content association weight, the second content association weight and the third content key weight into the text processing model to obtain a classification result of the news text. The whole content structure of the news text is grasped in the whole classification process, the influence of keywords is fully considered, the error of the classification result is small, and the accuracy is high.

Description

News text classification method

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a news text classification method.

Background

Currently, news reading products are generally organized and organized according to the field to which news content belongs, such as classified distribution according to hot spots, domestic and international, etc., or according to science, military, entertainment, etc. At present, the above-mentioned process of classifying and issuing news is usually performed manually, which not only wastes manpower, but also makes the news classification result greatly affected by subjective feeling of individuals, so that the classification result is not accurate enough.

Disclosure of Invention

The invention provides a news text classification method for solving the problems.

The technical scheme of the invention is as follows: a news text classification method comprises the following steps:

S1, generating a first content association weight between a title and first text content according to the title and the first text content of a news text;

s2, generating a second content association weight between the title and the last text content according to the title, the last text content and the first content association weight of the news text;

s3, generating a third content association weight according to the rest text contents except the first text content and the last text content in the news text;

s4, constructing a text processing model, and inputting the first content association weight, the second content association weight and the third content key weight into the text processing model to obtain a classification result of the news text.

Further, S1 comprises the following sub-steps:

S11, eliminating the title of the news text and the stop words of the first text content to obtain a standard title and a standard first text content respectively;

s12, extracting the same words as the standard title in the standard first text content to generate a first training word sequence;

S13, determining the first segment occupation ratio according to the first training word sequence;

and S14, generating a first content association weight between the title and the first text content according to the first segment occupation ratio.

The beneficial effects of the above-mentioned further scheme are: in the invention, the first news text segment generally simply introduces the summary of the news event, thus having higher reference value, and considering that the news headline is a high summary of the whole news text content, the invention constructs a first content association weight to reflect the association degree between the news headline and the first news text segment content. The first content association weight refers to the ratio condition of words which are the same as news headlines in the first text content and word vectors of the same words, so that the first content association weight can effectively reflect the semantic association degree.

Further, in S13, the calculation formula of the first segment occupation ratio z _f is: ; wherein P represents the number of words of the first training word sequence, M represents the number of words of the standard first text content, and J represents the number of words of the standard headline.

Further, in S14, the calculation formula of the first content association weight σ ₁ between the title and the first text content is: ; wherein, P represents the number of words in the first training word sequence, C _p represents the word frequency of the P-th word in the first training word sequence in the standard first text content, z _f represents the first segment occupation ratio, and X _p represents the word vector of the P-th word in the first training word sequence.

Further, S2 comprises the following sub-steps:

s21, eliminating stop words of the end text content to obtain standard end text content;

S22, extracting the same words as the standard title in the standard end text content to generate a second training word sequence;

s23, determining the end segment occupation ratio according to the second training word sequence;

s24, calculating a content association tag value between the standard first text content and the standard last text content according to the first segment occupation ratio, the last segment occupation ratio and the first content association weight;

s26, generating a second content association weight between the title and the final text content according to the content association tag value between the standard first text content and the standard final text content.

The beneficial effects of the above-mentioned further scheme are: in the invention, considering that semantic association may exist between the first and the last sections of the news text and the last section content may be a high summary of the news text content, the invention combines parameters such as the last section occupation ratio, the first content association weight and the like to construct the second content association weight, when the last section occupation ratio is more than twice the first section occupation ratio, the importance of the first section text content is lower, the influence of the first content association weight on the second content association weight is properly reduced, and the association degree between the last section text content and the news headline is reflected by the second content association weight.

Further, in S23, the calculation formula of the end segment occupation ratio z _l is: ; where Q represents the number of words of the second training word sequence, N represents the number of words of the standard end text content, and J represents the number of words of the standard heading.

Further, in S24, the calculation formula of the content association tag value b between the standard first text content and the standard last text content is: ; where z _f denotes the first segment occupancy rate, z _l denotes the last segment occupancy rate, and σ ₁ denotes the first content association weight between the title and the first segment of text content.

Further, in S26, the calculation formula of the second content association weight σ ₂ between the title and the end text content is: ; wherein Q represents the number of words in the second training word sequence, C _q represents the word frequency of the Q-th word in the second training word sequence in the standard end text content, z _l represents the end occupation ratio, X _q represents the word vector of the Q-th word in the second training word sequence, and b represents the content-associated tag value between the standard first text content and the standard end text content.

Further, in S3, the specific method for generating the third content association weight is as follows: selecting the paragraph with the largest word number in the rest text contents except the first text content and the last text content as a secondary paragraph, extracting keywords of the secondary paragraph, and generating a third content association weight according to all keywords and standard titles of the secondary paragraph;

the calculation formula of the third content association weight sigma ₃ is as follows: ; where X _j represents the word vector of the jth word in the standard heading, Word vectors representing the highest word frequency words in the secondary paragraph, exp (·) represents the exponential function, and J represents the number of words of the standard heading.

The beneficial effects of the above-mentioned further scheme are: in the present invention, the paragraph with the most text content (i.e., the most number of words) is a detailed description of the news event, so the present invention uses the word vector distribution of the paragraph to construct the third content association weight.

Further, the expression of the objective function Loss of the text processing model is: ; where O represents the number of words in the news text, ln (. Cndot.) represents a logarithmic function, σ ₁ represents a first content association weight, σ ₂ represents a second content association weight, σ ₃ represents a third content association weight, and r represents a hyper-parameter supporting a vector machine.

The text processing model may employ a support vector machine.

The beneficial effects of the invention are as follows: the invention discloses a news text classification method, which sequentially carries out text analysis processing on important first and last segments and the segments with the largest text content in a news text to obtain three content association weights representing semantic information, and then inputs the three content association weights into a constructed text processing model to obtain the accurate category of the news text. The whole content structure of the news text is grasped in the whole classification process, the influence of keywords is fully considered, the error of the classification result is small, and the accuracy is high.

Drawings

Fig. 1 is a flowchart of a news text classification method.

Detailed Description

Embodiments of the present invention are further described below with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a news text classification method, which includes the following steps:

In an embodiment of the present invention, S1 comprises the following sub-steps:

In the invention, the first news text segment generally simply introduces the summary of the news event, thus having higher reference value, and considering that the news headline is a high summary of the whole news text content, the invention constructs a first content association weight to reflect the association degree between the news headline and the first news text segment content. The first content association weight refers to the ratio condition of words which are the same as news headlines in the first text content and word vectors of the same words, so that the first content association weight can effectively reflect the semantic association degree.

In the embodiment of the present invention, in S13, the calculation formula of the first segment occupation ratio z _f is: ; wherein P represents the number of words of the first training word sequence, M represents the number of words of the standard first text content, and J represents the number of words of the standard headline.

In the embodiment of the present invention, in S14, the calculation formula of the first content association weight σ ₁ between the title and the first text content is: ; wherein, P represents the number of words in the first training word sequence, C _p represents the word frequency of the P-th word in the first training word sequence in the standard first text content, z _f represents the first segment occupation ratio, and X _p represents the word vector of the P-th word in the first training word sequence.

In an embodiment of the present invention, S2 comprises the following sub-steps:

In the invention, considering that semantic association may exist between the first and the last sections of the news text and the last section content may be a high summary of the news text content, the invention combines parameters such as the last section occupation ratio, the first content association weight and the like to construct the second content association weight, when the last section occupation ratio is more than twice the first section occupation ratio, the importance of the first section text content is lower, the influence of the first content association weight on the second content association weight is properly reduced, and the association degree between the last section text content and the news headline is reflected by the second content association weight.

In the embodiment of the present invention, in S23, the calculation formula of the end segment occupation ratio z _l is: ; where Q represents the number of words of the second training word sequence, N represents the number of words of the standard end text content, and J represents the number of words of the standard heading.

In the embodiment of the present invention, in S24, the calculation formula of the content association tag value b between the standard first text content and the standard last text content is: ; where z _f denotes the first segment occupancy rate, z _l denotes the last segment occupancy rate, and σ ₁ denotes the first content association weight between the title and the first segment of text content.

In the embodiment of the present invention, in S26, the calculation formula of the second content association weight σ ₂ between the title and the end text content is: ; wherein Q represents the number of words in the second training word sequence, C _q represents the word frequency of the Q-th word in the second training word sequence in the standard end text content, z _l represents the end occupation ratio, X _q represents the word vector of the Q-th word in the second training word sequence, and b represents the content-associated tag value between the standard first text content and the standard end text content.

In the embodiment of the present invention, in S3, a specific method for generating the third content association weight is: selecting the paragraph with the largest word number in the rest text contents except the first text content and the last text content as a secondary paragraph, extracting keywords of the secondary paragraph, and generating a third content association weight according to all keywords and standard titles of the secondary paragraph;

In the present invention, the paragraph with the most text content (i.e., the most number of words) is a detailed description of the news event, so the present invention uses the word vector distribution of the paragraph to construct the third content association weight.

In the embodiment of the invention, the expression of the objective function Loss of the text processing model is: ; where O represents the number of words in the news text, ln (. Cndot.) represents a logarithmic function, σ ₁ represents a first content association weight, σ ₂ represents a second content association weight, σ ₃ represents a third content association weight, and r represents a hyper-parameter supporting a vector machine.

The text processing model may employ a support vector machine.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. A news text classification method, comprising the steps of:

s4, constructing a text processing model, and inputting the first content association weight, the second content association weight and the third content key weight into the text processing model to obtain a classification result of the news text;

The step S1 comprises the following substeps:

s14, generating a first content association weight between the title and the first text content according to the first segment occupation ratio;

the step S2 comprises the following substeps:

s26, generating a second content association weight between the title and the final text content according to the content association tag value between the standard first text content and the standard final text content;

In the step S3, the specific method for generating the third content association weight is as follows: selecting the paragraph with the largest word number in the rest text contents except the first text content and the last text content as a secondary paragraph, extracting keywords of the secondary paragraph, and generating a third content association weight according to all keywords and standard titles of the secondary paragraph;

the calculation formula of the third content association weight sigma ₃ is as follows: ; wherein X _j represents a word vector of a j-th word in the standard heading,/> Word vectors representing the highest word frequency words in the secondary paragraph, exp (·) represents the exponential function, and J represents the number of words of the standard heading.

2. The news text classification method according to claim 1, wherein in S13, the calculation formula of the first segment occupation ratio z _f is: ; wherein P represents the number of words of the first training word sequence, M represents the number of words of the standard first text content, and J represents the number of words of the standard headline.

3. The news text classification method according to claim 1, wherein in S14, the calculation formula of the first content association weight σ ₁ between the title and the first text content is: ; wherein, P represents the number of words in the first training word sequence, C _p represents the word frequency of the P-th word in the first training word sequence in the standard first text content, z _f represents the first segment occupation ratio, and X _p represents the word vector of the P-th word in the first training word sequence.

4. The news text classification method according to claim 1, wherein in S23, the calculation formula of the end segment occupation ratio z _l is: ; where Q represents the number of words of the second training word sequence, N represents the number of words of the standard end text content, and J represents the number of words of the standard heading.

5. The news text classification method according to claim 1, wherein in S24, the calculation formula of the content association tag value b between the standard first-segment text content and the standard last-segment text content is: ; where z _f denotes the first segment occupancy rate, z _l denotes the last segment occupancy rate, and σ ₁ denotes the first content association weight between the title and the first segment of text content.

6. The news text classification method according to claim 1, wherein in S26, the calculation formula of the second content association weight σ ₂ between the title and the last text content is: ; wherein Q represents the number of words in the second training word sequence, C _q represents the word frequency of the Q-th word in the second training word sequence in the standard end text content, z _l represents the end occupation ratio, X _q represents the word vector of the Q-th word in the second training word sequence, and b represents the content-associated tag value between the standard first text content and the standard end text content.

7. The news text classification method of claim 1, wherein the expression of the objective function Loss of the text processing model is: ; where O represents the number of words in the news text, ln (. Cndot.) represents a logarithmic function, σ ₁ represents a first content association weight, σ ₂ represents a second content association weight, σ ₃ represents a third content association weight, and r represents a hyper-parameter supporting a vector machine.