CN108415959B

CN108415959B - Text classification method and device

Info

Publication number: CN108415959B
Application number: CN201810118843.XA
Authority: CN
Inventors: 王富田; 李健; 张连毅; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2018-02-06
Filing date: 2018-02-06
Publication date: 2021-06-25
Anticipated expiration: 2038-02-06
Also published as: CN108415959A

Abstract

The embodiment of the invention provides a text classification method and device. By the method, the texts to be classified can be classified without manual participation, and labor cost is reduced. Secondly, because the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified can accurately summarize the core content of the text to be classified, when the text to be classified is classified, the category to which the text to be classified belongs is determined by combining the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified, which is beneficial to improving the accuracy of classification.

Description

Text classification method and device

Technical Field

The invention relates to the technical field of internet, in particular to a text classification method and device.

Background

With the continuous development of internet technology, the presentation of various information using texts as carriers, such as network news, blog articles, soft articles, states, etc., is increased explosively, and users can search for interesting texts on the internet for reading. In order for a user to quickly search for a text of interest, it is necessary to classify various texts on the internet in advance.

In the prior art, generally, a worker manually reads a text, determines the subject of the content in the text, determines the category of the text according to the subject of the content in the text, and then stores the text in a classified manner.

However, the inventor finds that the prior art has large workload and high labor cost when classifying texts.

Disclosure of Invention

In order to solve the above technical problems, embodiments of the present invention show a text classification method and apparatus.

In a first aspect, an embodiment of the present invention shows a text classification method, where the method includes:

extracting at least one text keyword in a text to be classified, wherein the text keyword is used for describing the subject content of the text to be classified;

according to the at least one text keyword, acquiring a first possibility that the text to be classified belongs to each preset text category respectively;

extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, and the partial sentences are used for describing the subject content of the problem to be classified;

extracting at least one abstract key word in the abstract, wherein the abstract key word is used for describing the subject content of the abstract;

acquiring a second possibility that the text to be classified respectively belongs to each preset text category according to the at least one abstract key word;

determining a third possibility that the text to be classified respectively belongs to each preset text category according to the first possibility and the second possibility;

and determining a preset text category with the maximum third possibility to which the text to be classified belongs as a target text category to which the text to be classified belongs.

In an optional implementation manner, the obtaining, according to the at least one abstract keyword, a second possibility that the text to be classified respectively belongs to each preset text category includes:

for each preset text category, acquiring a preset keyword set belonging to the preset text category;

determining abstract keywords in the preset keyword set from the at least one abstract keyword;

acquiring the weight of each determined abstract keyword in the preset keyword set;

counting the occurrence frequency of each determined abstract key word in the abstract respectively;

and determining a second possibility that the text to be classified belongs to the preset text category according to the weight and the times.

In an optional implementation manner, the determining, according to the weight and the number of times, a second possibility that the text to be classified belongs to the preset text category includes:

for each determined abstract keyword, calculating the product of the weight of the abstract keyword in the preset keyword set and the occurrence frequency of the abstract keyword in the abstract, and taking the product as the fourth possibility that the abstract keyword belongs to the preset text category;

and summing the determined fourth possibility that each abstract key word belongs to the preset text category to obtain a second possibility that the text to be classified belongs to the preset text category.

In an optional implementation manner, the determining, according to the first likelihood and the second likelihood, a third likelihood that the text to be classified belongs to each preset text category respectively includes:

acquiring a preset text weight and a preset abstract weight;

for each preset text category, calculating a first product between a first possibility that the text to be classified belongs to the preset text category and the text weight;

calculating a second product between a second possibility that the text to be classified belongs to the preset text category and the abstract weight;

and summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category.

In a second aspect, an embodiment of the present invention shows a text classification apparatus, including:

the system comprises a first extraction module, a second extraction module and a third extraction module, wherein the first extraction module is used for extracting at least one text keyword in a text to be classified, and the text keyword is used for describing the subject content of the text to be classified;

the first obtaining module is used for obtaining a first possibility that the text to be classified belongs to each preset text category respectively according to the at least one text keyword;

the second extraction module is used for extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, and the partial sentences are used for describing the subject content of the problem to be classified;

the third extraction module is used for extracting at least one abstract key word in the abstract, and the abstract key word is used for describing the subject content of the abstract;

the second obtaining module is used for obtaining a second possibility that the texts to be classified respectively belong to each preset text category according to the at least one abstract key word;

the first determining module is used for determining a third possibility that the text to be classified belongs to each preset text category according to the first possibility and the second possibility;

and the second determining module is used for determining the preset text category with the maximum third possibility to which the text to be classified belongs as the target text category to which the text to be classified belongs.

In an optional implementation manner, the second obtaining module includes:

the first acquisition unit is used for acquiring a preset keyword set belonging to each preset text category;

the first determining unit is used for determining abstract keywords in the preset keyword set from the at least one abstract keyword;

the second obtaining unit is used for obtaining the weight of each determined abstract keyword in the preset keyword set;

the statistical unit is used for counting the occurrence frequency of each determined abstract key word in the abstract respectively;

and the second determining unit is used for determining a second possibility that the text to be classified belongs to the preset text category according to the weight and the times.

In an optional implementation manner, the second determining unit includes:

a calculating subunit, configured to calculate, for each determined abstract keyword, a product of a weight of the abstract keyword in the preset keyword set and a number of times that the abstract keyword appears in the abstract, and use the product as a fourth possibility that the abstract keyword belongs to the preset text category;

and the summing subunit is used for summing the fourth possibility that each determined abstract keyword belongs to the preset text category to obtain a second possibility that the text to be classified belongs to the preset text category.

In an optional implementation manner, the first determining module includes:

the third acquiring unit is used for acquiring a preset text weight and a preset abstract weight;

the first calculation unit is used for calculating a first product between a first possibility that the text to be classified belongs to the preset text category and the text weight for each preset text category;

the second calculation unit is used for calculating a second product between a second possibility that the text to be classified belongs to the preset text category and the abstract weight;

and the summation unit is used for summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category.

In a third aspect, an embodiment of the present invention shows an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the text classification method according to the first aspect are implemented.

In a fourth aspect, the present invention shows a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the text classification method according to the first aspect.

Compared with the prior art, the embodiment of the invention has the following advantages:

in the embodiment of the invention, at least one text keyword in the text to be classified is extracted, wherein the text keyword is used for describing the subject content of the text to be classified; according to at least one text keyword, acquiring a first possibility that a text to be classified belongs to each preset text category respectively; extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, and the partial sentences are used for describing the subject content of the abstract; extracting at least one abstract key word in the abstract; acquiring a second possibility that the texts to be classified respectively belong to each preset text category according to at least one abstract keyword; determining a third possibility that the text to be classified respectively belongs to each preset text category according to the first possibility and the second possibility; and determining the preset text category with the maximum third possibility to which the text to be classified belongs as the target text category to which the text to be classified belongs.

By the method, the texts to be classified can be classified without manual participation, and labor cost is reduced. Secondly, because the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified can accurately summarize the core content of the text to be classified, when the text to be classified is classified, the category to which the text to be classified belongs is determined by combining the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified, which is beneficial to improving the accuracy of classification.

Drawings

FIG. 1 is a flow chart of the steps of one embodiment of a method of text classification of the present invention;

fig. 2 is a block diagram of a text classification apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a text classification method according to the present invention is shown, which may specifically include the following steps:

in step S101, at least one text keyword in the text to be classified is extracted, where the text keyword is used to describe the subject content of the text to be classified;

in the embodiment of the invention, the text to be classified comprises a plurality of sentences, each sentence is composed of at least one vocabulary, the vocabulary comprises nouns, verbs, adjectives, adverbs and the like, and at least one text keyword can be extracted from the text to be classified.

In step S102, according to at least one text keyword, obtaining a first possibility that a text to be classified belongs to each preset text category;

wherein, this step can be realized through the following process, including:

11) acquiring a preset keyword set belonging to any preset text category;

in the embodiment of the present invention, a plurality of preset text categories, for example, a military category, an education category, a sports category, and an online game category, etc., are set in advance.

And preset keyword sets respectively belonging to each preset text category are set in advance.

For example, the preset keyword set belonging to the sports category may include keywords: NBA, football, euro crown, zhong chao, skiing, table tennis, etc.

The preset keyword set belonging to the class of the network game may include keywords: dota, hero alliance, island fang-forces, super mary and legends, etc.

12) Determining text keywords in the preset keyword set in at least one text keyword;

and for any one text keyword in the at least one text keyword, determining whether the text keyword is located in the preset keyword set, and for each other text keyword in the at least one text keyword, executing the above operation.

13) Acquiring the weight of each determined text keyword in the preset keyword set respectively;

in the embodiment of the present invention, a weight may be set in advance for each keyword included in the preset keyword set, when the weight of a keyword is larger, the association between the keyword and the preset text category to which the preset keyword set belongs is larger, and when the weight of a keyword is smaller, the association between the keyword and the preset text category to which the preset keyword set belongs is smaller.

14) Counting the frequency of occurrence of each text keyword in the text to be classified;

and for any determined text keyword, counting the occurrence frequency of the text keyword in the text to be classified, and for each determined other text keyword, executing the operation.

15) And determining the first possibility that the text to be classified belongs to the preset text category according to the weight and the times.

Specifically, for any determined text keyword, calculating the product of the weight of the text keyword in the preset keyword set and the occurrence frequency of the text keyword in the text to be classified, and taking the product as the fifth possibility that the text keyword belongs to the preset text category; the above operation is also performed for each of the other determined text keywords. And then summing the determined fifth possibility that each text keyword belongs to the preset text category to obtain the first possibility that the text to be classified belongs to the preset text category.

Further, the first possibility that the text to be classified respectively belongs to each of the other preset text categories can be calculated according to the above-mentioned flow 11) to 15).

In step S103, extracting an abstract in the text to be classified, where the abstract includes a part of sentences in the text to be classified, and the part of sentences is used to describe the subject content of the text to be classified;

in the embodiment of the present invention, the text summarization algorithm may be used to extract a summary from the text to be classified, for example, the pagerank algorithm may be used to extract a summary from the text to be classified.

The abstract in the embodiment of the invention can comprise 3-4 sentences in the text to be classified.

In step S104, at least one abstract key word in the abstract is extracted, where the abstract key word is used to describe the subject content of the abstract;

in the embodiment of the present invention, at least one abstract keyword may be extracted from the abstract, wherein in the embodiment of the present invention, at least one abstract keyword may be extracted from the abstract by using any keyword extraction algorithm in the prior art.

In step S105, according to at least one abstract keyword, obtaining a second possibility that the text to be classified belongs to each preset text category;

wherein, this step can be realized through the following process, including:

21) acquiring a preset keyword set belonging to any preset text category;

22) determining abstract keywords in the preset keyword set in at least one abstract keyword;

and for any abstract keyword in the at least one abstract keyword, determining whether the abstract keyword is located in the preset keyword set, and for each other abstract keyword in the at least one abstract keyword, executing the above operation.

23) Acquiring the weight of each determined abstract keyword in the preset keyword set;

24) counting the times of occurrence of each determined abstract key word in the abstract respectively;

and for any determined abstract key word, counting the occurrence frequency of the abstract key word in the text to be classified, and for each determined other abstract key word, executing the operation.

25) And determining a second possibility that the text to be classified belongs to the preset text category according to the weight and the times.

Specifically, for any determined abstract keyword, calculating the product of the weight of the abstract keyword in the preset keyword set and the occurrence frequency of the abstract keyword in the abstract, and taking the product as the fourth possibility that the abstract keyword belongs to the preset text category; and the operation is also executed for each determined other abstract key word. And then summing the determined fourth possibility that each abstract key word belongs to the preset text category to obtain a second possibility that the text to be classified belongs to the preset text category.

Further, the second possibility that the text to be classified respectively belongs to each of the other preset text categories can be calculated according to the above-mentioned flow 21) to 25).

In step S106, determining a third likelihood that the text to be classified belongs to each preset text category according to the first likelihood and the second likelihood;

specifically, the step can be implemented by the following process, including:

31) acquiring a preset text weight and a preset abstract weight;

wherein the preset text weight and the preset abstract weight are set locally in advance,

32) calculating a first product between a first possibility that the text to be classified belongs to any one preset text category and a preset text weight;

33) calculating a second product between a second possibility that the text to be classified belongs to the preset text category and a preset abstract weight;

34) and summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category.

Further, the third possibility that the text to be classified respectively belongs to each of the other preset text categories can be calculated according to the flow of 31) to 34).

In step S107, a preset text category with the maximum third possibility to which the text to be classified belongs is determined as a target text category to which the text to be classified belongs.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 2, a block diagram of a structure of an embodiment of the text classification apparatus of the present invention is shown, and the apparatus may specifically include the following modules:

the first extraction module 11 is configured to extract at least one text keyword in a text to be classified, where the text keyword is used to describe the subject content of the text to be classified;

a first obtaining module 12, configured to obtain, according to the at least one text keyword, a first possibility that the text to be classified belongs to each preset text category respectively;

a second extraction module 13, configured to extract an abstract in the text to be classified, where the abstract includes a part of sentences in the text to be classified, and the part of sentences is used to describe subject contents of the problem to be classified;

a third extraction module 14, configured to extract at least one abstract keyword in the abstract, where the abstract keyword is used to describe the subject content of the abstract;

a second obtaining module 15, configured to obtain, according to the at least one abstract keyword, a second possibility that the text to be classified belongs to each preset text category;

the first determining module 16 is configured to determine, according to the first likelihood and the second likelihood, a third likelihood that the text to be classified belongs to each preset text category respectively;

a second determining module 17, configured to determine a preset text category with the maximum third possibility to which the text to be classified belongs as a target text category to which the text to be classified belongs.

In an optional implementation manner, the second obtaining module 15 includes:

In an optional implementation manner, the second determining unit includes:

In an optional implementation, the first determining module 16 includes:

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The text classification method and the text classification device provided by the invention are described in detail, and the principle and the implementation mode of the invention are explained by applying specific examples, and the description of the examples is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of text classification, the method comprising:

the text to be classified comprises a plurality of sentences, each sentence is composed of at least one vocabulary, and each vocabulary at least comprises a noun, a verb, an adjective and an adverb;

extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, the partial sentences are used for describing the subject content of the problem to be classified, and the abstract is extracted from the text to be classified by using a text abstract algorithm;

determining a third possibility that the text to be classified respectively belongs to each preset text category according to the first possibility and the second possibility, wherein the third possibility comprises the following steps:

acquiring a preset text weight and a preset abstract weight, wherein the preset text weight and the preset abstract weight are locally set in advance;

summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category;

2. The method according to claim 1, wherein the obtaining a second possibility that the text to be classified respectively belongs to each preset text category according to the at least one abstract keyword comprises:

3. The method according to claim 2, wherein the determining the second possibility that the text to be classified belongs to the preset text category according to the weight and the number of times comprises:

4. An apparatus for classifying text, the apparatus comprising:

the system comprises a first extraction module, a second extraction module and a third extraction module, wherein the first extraction module is used for extracting at least one text keyword in a text to be classified, and the text keyword is used for describing the subject content of the text to be classified; the text to be classified comprises a plurality of sentences, each sentence is composed of at least one vocabulary, and each vocabulary at least comprises a noun, a verb, an adjective and an adverb;

the second extraction module is used for extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, the partial sentences are used for describing the subject content of the problem to be classified, and the abstract is extracted from the text to be classified by using a text abstract algorithm;

wherein the first determining module comprises:

the third acquiring unit is used for acquiring a preset text weight and a preset abstract weight; the preset text weight and the preset abstract weight are set locally in advance;

the summation unit is used for summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category;

5. The apparatus of claim 4, wherein the second obtaining module comprises:

6. The apparatus according to claim 5, wherein the second determining unit comprises:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the text classification method according to any one of claims 1 to 3 are implemented by the processor when executing the program.

8. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the text classification method according to one of the claims 1 to 3.