CN108415959B - Text classification method and device - Google Patents

Text classification method and device Download PDF

Info

Publication number
CN108415959B
CN108415959B CN201810118843.XA CN201810118843A CN108415959B CN 108415959 B CN108415959 B CN 108415959B CN 201810118843 A CN201810118843 A CN 201810118843A CN 108415959 B CN108415959 B CN 108415959B
Authority
CN
China
Prior art keywords
text
abstract
classified
preset
possibility
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810118843.XA
Other languages
Chinese (zh)
Other versions
CN108415959A (en
Inventor
王富田
李健
张连毅
武卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinovoice Technology Co Ltd
Original Assignee
Beijing Sinovoice Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinovoice Technology Co Ltd filed Critical Beijing Sinovoice Technology Co Ltd
Priority to CN201810118843.XA priority Critical patent/CN108415959B/en
Publication of CN108415959A publication Critical patent/CN108415959A/en
Application granted granted Critical
Publication of CN108415959B publication Critical patent/CN108415959B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a text classification method and device. By the method, the texts to be classified can be classified without manual participation, and labor cost is reduced. Secondly, because the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified can accurately summarize the core content of the text to be classified, when the text to be classified is classified, the category to which the text to be classified belongs is determined by combining the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified, which is beneficial to improving the accuracy of classification.

Description

Text classification method and device
Technical Field
The invention relates to the technical field of internet, in particular to a text classification method and device.
Background
With the continuous development of internet technology, the presentation of various information using texts as carriers, such as network news, blog articles, soft articles, states, etc., is increased explosively, and users can search for interesting texts on the internet for reading. In order for a user to quickly search for a text of interest, it is necessary to classify various texts on the internet in advance.
In the prior art, generally, a worker manually reads a text, determines the subject of the content in the text, determines the category of the text according to the subject of the content in the text, and then stores the text in a classified manner.
However, the inventor finds that the prior art has large workload and high labor cost when classifying texts.
Disclosure of Invention
In order to solve the above technical problems, embodiments of the present invention show a text classification method and apparatus.
In a first aspect, an embodiment of the present invention shows a text classification method, where the method includes:
extracting at least one text keyword in a text to be classified, wherein the text keyword is used for describing the subject content of the text to be classified;
according to the at least one text keyword, acquiring a first possibility that the text to be classified belongs to each preset text category respectively;
extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, and the partial sentences are used for describing the subject content of the problem to be classified;
extracting at least one abstract key word in the abstract, wherein the abstract key word is used for describing the subject content of the abstract;
acquiring a second possibility that the text to be classified respectively belongs to each preset text category according to the at least one abstract key word;
determining a third possibility that the text to be classified respectively belongs to each preset text category according to the first possibility and the second possibility;
and determining a preset text category with the maximum third possibility to which the text to be classified belongs as a target text category to which the text to be classified belongs.
In an optional implementation manner, the obtaining, according to the at least one abstract keyword, a second possibility that the text to be classified respectively belongs to each preset text category includes:
for each preset text category, acquiring a preset keyword set belonging to the preset text category;
determining abstract keywords in the preset keyword set from the at least one abstract keyword;
acquiring the weight of each determined abstract keyword in the preset keyword set;
counting the occurrence frequency of each determined abstract key word in the abstract respectively;
and determining a second possibility that the text to be classified belongs to the preset text category according to the weight and the times.
In an optional implementation manner, the determining, according to the weight and the number of times, a second possibility that the text to be classified belongs to the preset text category includes:
for each determined abstract keyword, calculating the product of the weight of the abstract keyword in the preset keyword set and the occurrence frequency of the abstract keyword in the abstract, and taking the product as the fourth possibility that the abstract keyword belongs to the preset text category;
and summing the determined fourth possibility that each abstract key word belongs to the preset text category to obtain a second possibility that the text to be classified belongs to the preset text category.
In an optional implementation manner, the determining, according to the first likelihood and the second likelihood, a third likelihood that the text to be classified belongs to each preset text category respectively includes:
acquiring a preset text weight and a preset abstract weight;
for each preset text category, calculating a first product between a first possibility that the text to be classified belongs to the preset text category and the text weight;
calculating a second product between a second possibility that the text to be classified belongs to the preset text category and the abstract weight;
and summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category.
In a second aspect, an embodiment of the present invention shows a text classification apparatus, including:
the system comprises a first extraction module, a second extraction module and a third extraction module, wherein the first extraction module is used for extracting at least one text keyword in a text to be classified, and the text keyword is used for describing the subject content of the text to be classified;
the first obtaining module is used for obtaining a first possibility that the text to be classified belongs to each preset text category respectively according to the at least one text keyword;
the second extraction module is used for extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, and the partial sentences are used for describing the subject content of the problem to be classified;
the third extraction module is used for extracting at least one abstract key word in the abstract, and the abstract key word is used for describing the subject content of the abstract;
the second obtaining module is used for obtaining a second possibility that the texts to be classified respectively belong to each preset text category according to the at least one abstract key word;
the first determining module is used for determining a third possibility that the text to be classified belongs to each preset text category according to the first possibility and the second possibility;
and the second determining module is used for determining the preset text category with the maximum third possibility to which the text to be classified belongs as the target text category to which the text to be classified belongs.
In an optional implementation manner, the second obtaining module includes:
the first acquisition unit is used for acquiring a preset keyword set belonging to each preset text category;
the first determining unit is used for determining abstract keywords in the preset keyword set from the at least one abstract keyword;
the second obtaining unit is used for obtaining the weight of each determined abstract keyword in the preset keyword set;
the statistical unit is used for counting the occurrence frequency of each determined abstract key word in the abstract respectively;
and the second determining unit is used for determining a second possibility that the text to be classified belongs to the preset text category according to the weight and the times.
In an optional implementation manner, the second determining unit includes:
a calculating subunit, configured to calculate, for each determined abstract keyword, a product of a weight of the abstract keyword in the preset keyword set and a number of times that the abstract keyword appears in the abstract, and use the product as a fourth possibility that the abstract keyword belongs to the preset text category;
and the summing subunit is used for summing the fourth possibility that each determined abstract keyword belongs to the preset text category to obtain a second possibility that the text to be classified belongs to the preset text category.
In an optional implementation manner, the first determining module includes:
the third acquiring unit is used for acquiring a preset text weight and a preset abstract weight;
the first calculation unit is used for calculating a first product between a first possibility that the text to be classified belongs to the preset text category and the text weight for each preset text category;
the second calculation unit is used for calculating a second product between a second possibility that the text to be classified belongs to the preset text category and the abstract weight;
and the summation unit is used for summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category.
In a third aspect, an embodiment of the present invention shows an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the text classification method according to the first aspect are implemented.
In a fourth aspect, the present invention shows a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the text classification method according to the first aspect.
Compared with the prior art, the embodiment of the invention has the following advantages:
in the embodiment of the invention, at least one text keyword in the text to be classified is extracted, wherein the text keyword is used for describing the subject content of the text to be classified; according to at least one text keyword, acquiring a first possibility that a text to be classified belongs to each preset text category respectively; extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, and the partial sentences are used for describing the subject content of the abstract; extracting at least one abstract key word in the abstract; acquiring a second possibility that the texts to be classified respectively belong to each preset text category according to at least one abstract keyword; determining a third possibility that the text to be classified respectively belongs to each preset text category according to the first possibility and the second possibility; and determining the preset text category with the maximum third possibility to which the text to be classified belongs as the target text category to which the text to be classified belongs.
By the method, the texts to be classified can be classified without manual participation, and labor cost is reduced. Secondly, because the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified can accurately summarize the core content of the text to be classified, when the text to be classified is classified, the category to which the text to be classified belongs is determined by combining the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified, which is beneficial to improving the accuracy of classification.
Drawings
FIG. 1 is a flow chart of the steps of one embodiment of a method of text classification of the present invention;
fig. 2 is a block diagram of a text classification apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a text classification method according to the present invention is shown, which may specifically include the following steps:
in step S101, at least one text keyword in the text to be classified is extracted, where the text keyword is used to describe the subject content of the text to be classified;
in the embodiment of the invention, the text to be classified comprises a plurality of sentences, each sentence is composed of at least one vocabulary, the vocabulary comprises nouns, verbs, adjectives, adverbs and the like, and at least one text keyword can be extracted from the text to be classified.
In step S102, according to at least one text keyword, obtaining a first possibility that a text to be classified belongs to each preset text category;
wherein, this step can be realized through the following process, including:
11) acquiring a preset keyword set belonging to any preset text category;
in the embodiment of the present invention, a plurality of preset text categories, for example, a military category, an education category, a sports category, and an online game category, etc., are set in advance.
And preset keyword sets respectively belonging to each preset text category are set in advance.
For example, the preset keyword set belonging to the sports category may include keywords: NBA, football, euro crown, zhong chao, skiing, table tennis, etc.
The preset keyword set belonging to the class of the network game may include keywords: dota, hero alliance, island fang-forces, super mary and legends, etc.
12) Determining text keywords in the preset keyword set in at least one text keyword;
and for any one text keyword in the at least one text keyword, determining whether the text keyword is located in the preset keyword set, and for each other text keyword in the at least one text keyword, executing the above operation.
13) Acquiring the weight of each determined text keyword in the preset keyword set respectively;
in the embodiment of the present invention, a weight may be set in advance for each keyword included in the preset keyword set, when the weight of a keyword is larger, the association between the keyword and the preset text category to which the preset keyword set belongs is larger, and when the weight of a keyword is smaller, the association between the keyword and the preset text category to which the preset keyword set belongs is smaller.
14) Counting the frequency of occurrence of each text keyword in the text to be classified;
and for any determined text keyword, counting the occurrence frequency of the text keyword in the text to be classified, and for each determined other text keyword, executing the operation.
15) And determining the first possibility that the text to be classified belongs to the preset text category according to the weight and the times.
Specifically, for any determined text keyword, calculating the product of the weight of the text keyword in the preset keyword set and the occurrence frequency of the text keyword in the text to be classified, and taking the product as the fifth possibility that the text keyword belongs to the preset text category; the above operation is also performed for each of the other determined text keywords. And then summing the determined fifth possibility that each text keyword belongs to the preset text category to obtain the first possibility that the text to be classified belongs to the preset text category.
Further, the first possibility that the text to be classified respectively belongs to each of the other preset text categories can be calculated according to the above-mentioned flow 11) to 15).
In step S103, extracting an abstract in the text to be classified, where the abstract includes a part of sentences in the text to be classified, and the part of sentences is used to describe the subject content of the text to be classified;
in the embodiment of the present invention, the text summarization algorithm may be used to extract a summary from the text to be classified, for example, the pagerank algorithm may be used to extract a summary from the text to be classified.
The abstract in the embodiment of the invention can comprise 3-4 sentences in the text to be classified.
In step S104, at least one abstract key word in the abstract is extracted, where the abstract key word is used to describe the subject content of the abstract;
in the embodiment of the present invention, at least one abstract keyword may be extracted from the abstract, wherein in the embodiment of the present invention, at least one abstract keyword may be extracted from the abstract by using any keyword extraction algorithm in the prior art.
In step S105, according to at least one abstract keyword, obtaining a second possibility that the text to be classified belongs to each preset text category;
wherein, this step can be realized through the following process, including:
21) acquiring a preset keyword set belonging to any preset text category;
22) determining abstract keywords in the preset keyword set in at least one abstract keyword;
and for any abstract keyword in the at least one abstract keyword, determining whether the abstract keyword is located in the preset keyword set, and for each other abstract keyword in the at least one abstract keyword, executing the above operation.
23) Acquiring the weight of each determined abstract keyword in the preset keyword set;
24) counting the times of occurrence of each determined abstract key word in the abstract respectively;
and for any determined abstract key word, counting the occurrence frequency of the abstract key word in the text to be classified, and for each determined other abstract key word, executing the operation.
25) And determining a second possibility that the text to be classified belongs to the preset text category according to the weight and the times.
Specifically, for any determined abstract keyword, calculating the product of the weight of the abstract keyword in the preset keyword set and the occurrence frequency of the abstract keyword in the abstract, and taking the product as the fourth possibility that the abstract keyword belongs to the preset text category; and the operation is also executed for each determined other abstract key word. And then summing the determined fourth possibility that each abstract key word belongs to the preset text category to obtain a second possibility that the text to be classified belongs to the preset text category.
Further, the second possibility that the text to be classified respectively belongs to each of the other preset text categories can be calculated according to the above-mentioned flow 21) to 25).
In step S106, determining a third likelihood that the text to be classified belongs to each preset text category according to the first likelihood and the second likelihood;
specifically, the step can be implemented by the following process, including:
31) acquiring a preset text weight and a preset abstract weight;
wherein the preset text weight and the preset abstract weight are set locally in advance,
32) calculating a first product between a first possibility that the text to be classified belongs to any one preset text category and a preset text weight;
33) calculating a second product between a second possibility that the text to be classified belongs to the preset text category and a preset abstract weight;
34) and summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category.
Further, the third possibility that the text to be classified respectively belongs to each of the other preset text categories can be calculated according to the flow of 31) to 34).
In step S107, a preset text category with the maximum third possibility to which the text to be classified belongs is determined as a target text category to which the text to be classified belongs.
In the embodiment of the invention, at least one text keyword in the text to be classified is extracted, wherein the text keyword is used for describing the subject content of the text to be classified; according to at least one text keyword, acquiring a first possibility that a text to be classified belongs to each preset text category respectively; extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, and the partial sentences are used for describing the subject content of the abstract; extracting at least one abstract key word in the abstract; acquiring a second possibility that the texts to be classified respectively belong to each preset text category according to at least one abstract keyword; determining a third possibility that the text to be classified respectively belongs to each preset text category according to the first possibility and the second possibility; and determining the preset text category with the maximum third possibility to which the text to be classified belongs as the target text category to which the text to be classified belongs.
By the method, the texts to be classified can be classified without manual participation, and labor cost is reduced. Secondly, because the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified can accurately summarize the core content of the text to be classified, when the text to be classified is classified, the category to which the text to be classified belongs is determined by combining the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified, which is beneficial to improving the accuracy of classification.
It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 2, a block diagram of a structure of an embodiment of the text classification apparatus of the present invention is shown, and the apparatus may specifically include the following modules:
the first extraction module 11 is configured to extract at least one text keyword in a text to be classified, where the text keyword is used to describe the subject content of the text to be classified;
a first obtaining module 12, configured to obtain, according to the at least one text keyword, a first possibility that the text to be classified belongs to each preset text category respectively;
a second extraction module 13, configured to extract an abstract in the text to be classified, where the abstract includes a part of sentences in the text to be classified, and the part of sentences is used to describe subject contents of the problem to be classified;
a third extraction module 14, configured to extract at least one abstract keyword in the abstract, where the abstract keyword is used to describe the subject content of the abstract;
a second obtaining module 15, configured to obtain, according to the at least one abstract keyword, a second possibility that the text to be classified belongs to each preset text category;
the first determining module 16 is configured to determine, according to the first likelihood and the second likelihood, a third likelihood that the text to be classified belongs to each preset text category respectively;
a second determining module 17, configured to determine a preset text category with the maximum third possibility to which the text to be classified belongs as a target text category to which the text to be classified belongs.
In an optional implementation manner, the second obtaining module 15 includes:
the first acquisition unit is used for acquiring a preset keyword set belonging to each preset text category;
the first determining unit is used for determining abstract keywords in the preset keyword set from the at least one abstract keyword;
the second obtaining unit is used for obtaining the weight of each determined abstract keyword in the preset keyword set;
the statistical unit is used for counting the occurrence frequency of each determined abstract key word in the abstract respectively;
and the second determining unit is used for determining a second possibility that the text to be classified belongs to the preset text category according to the weight and the times.
In an optional implementation manner, the second determining unit includes:
a calculating subunit, configured to calculate, for each determined abstract keyword, a product of a weight of the abstract keyword in the preset keyword set and a number of times that the abstract keyword appears in the abstract, and use the product as a fourth possibility that the abstract keyword belongs to the preset text category;
and the summing subunit is used for summing the fourth possibility that each determined abstract keyword belongs to the preset text category to obtain a second possibility that the text to be classified belongs to the preset text category.
In an optional implementation, the first determining module 16 includes:
the third acquiring unit is used for acquiring a preset text weight and a preset abstract weight;
the first calculation unit is used for calculating a first product between a first possibility that the text to be classified belongs to the preset text category and the text weight for each preset text category;
the second calculation unit is used for calculating a second product between a second possibility that the text to be classified belongs to the preset text category and the abstract weight;
and the summation unit is used for summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category.
In the embodiment of the invention, at least one text keyword in the text to be classified is extracted, wherein the text keyword is used for describing the subject content of the text to be classified; according to at least one text keyword, acquiring a first possibility that a text to be classified belongs to each preset text category respectively; extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, and the partial sentences are used for describing the subject content of the abstract; extracting at least one abstract key word in the abstract; acquiring a second possibility that the texts to be classified respectively belong to each preset text category according to at least one abstract keyword; determining a third possibility that the text to be classified respectively belongs to each preset text category according to the first possibility and the second possibility; and determining the preset text category with the maximum third possibility to which the text to be classified belongs as the target text category to which the text to be classified belongs.
By the method, the texts to be classified can be classified without manual participation, and labor cost is reduced. Secondly, because the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified can accurately summarize the core content of the text to be classified, when the text to be classified is classified, the category to which the text to be classified belongs is determined by combining the text keywords in the text to be classified and the abstract keywords of the abstract in the text to be classified, which is beneficial to improving the accuracy of classification.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
The text classification method and the text classification device provided by the invention are described in detail, and the principle and the implementation mode of the invention are explained by applying specific examples, and the description of the examples is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (8)

1. A method of text classification, the method comprising:
the text to be classified comprises a plurality of sentences, each sentence is composed of at least one vocabulary, and each vocabulary at least comprises a noun, a verb, an adjective and an adverb;
extracting at least one text keyword in a text to be classified, wherein the text keyword is used for describing the subject content of the text to be classified;
according to the at least one text keyword, acquiring a first possibility that the text to be classified belongs to each preset text category respectively;
extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, the partial sentences are used for describing the subject content of the problem to be classified, and the abstract is extracted from the text to be classified by using a text abstract algorithm;
extracting at least one abstract key word in the abstract, wherein the abstract key word is used for describing the subject content of the abstract;
acquiring a second possibility that the text to be classified respectively belongs to each preset text category according to the at least one abstract key word;
determining a third possibility that the text to be classified respectively belongs to each preset text category according to the first possibility and the second possibility, wherein the third possibility comprises the following steps:
acquiring a preset text weight and a preset abstract weight, wherein the preset text weight and the preset abstract weight are locally set in advance;
for each preset text category, calculating a first product between a first possibility that the text to be classified belongs to the preset text category and the text weight;
calculating a second product between a second possibility that the text to be classified belongs to the preset text category and the abstract weight;
summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category;
and determining a preset text category with the maximum third possibility to which the text to be classified belongs as a target text category to which the text to be classified belongs.
2. The method according to claim 1, wherein the obtaining a second possibility that the text to be classified respectively belongs to each preset text category according to the at least one abstract keyword comprises:
for each preset text category, acquiring a preset keyword set belonging to the preset text category;
determining abstract keywords in the preset keyword set from the at least one abstract keyword;
acquiring the weight of each determined abstract keyword in the preset keyword set;
counting the occurrence frequency of each determined abstract key word in the abstract respectively;
and determining a second possibility that the text to be classified belongs to the preset text category according to the weight and the times.
3. The method according to claim 2, wherein the determining the second possibility that the text to be classified belongs to the preset text category according to the weight and the number of times comprises:
for each determined abstract keyword, calculating the product of the weight of the abstract keyword in the preset keyword set and the occurrence frequency of the abstract keyword in the abstract, and taking the product as the fourth possibility that the abstract keyword belongs to the preset text category;
and summing the determined fourth possibility that each abstract key word belongs to the preset text category to obtain a second possibility that the text to be classified belongs to the preset text category.
4. An apparatus for classifying text, the apparatus comprising:
the system comprises a first extraction module, a second extraction module and a third extraction module, wherein the first extraction module is used for extracting at least one text keyword in a text to be classified, and the text keyword is used for describing the subject content of the text to be classified; the text to be classified comprises a plurality of sentences, each sentence is composed of at least one vocabulary, and each vocabulary at least comprises a noun, a verb, an adjective and an adverb;
the first obtaining module is used for obtaining a first possibility that the text to be classified belongs to each preset text category respectively according to the at least one text keyword;
the second extraction module is used for extracting an abstract in the text to be classified, wherein the abstract comprises partial sentences in the text to be classified, the partial sentences are used for describing the subject content of the problem to be classified, and the abstract is extracted from the text to be classified by using a text abstract algorithm;
the third extraction module is used for extracting at least one abstract key word in the abstract, and the abstract key word is used for describing the subject content of the abstract;
the second obtaining module is used for obtaining a second possibility that the texts to be classified respectively belong to each preset text category according to the at least one abstract key word;
the first determining module is used for determining a third possibility that the text to be classified belongs to each preset text category according to the first possibility and the second possibility;
wherein the first determining module comprises:
the third acquiring unit is used for acquiring a preset text weight and a preset abstract weight; the preset text weight and the preset abstract weight are set locally in advance;
the first calculation unit is used for calculating a first product between a first possibility that the text to be classified belongs to the preset text category and the text weight for each preset text category;
the second calculation unit is used for calculating a second product between a second possibility that the text to be classified belongs to the preset text category and the abstract weight;
the summation unit is used for summing the first product and the second product to obtain a third possibility that the text to be classified belongs to the preset text category;
and the second determining module is used for determining the preset text category with the maximum third possibility to which the text to be classified belongs as the target text category to which the text to be classified belongs.
5. The apparatus of claim 4, wherein the second obtaining module comprises:
the first acquisition unit is used for acquiring a preset keyword set belonging to each preset text category;
the first determining unit is used for determining abstract keywords in the preset keyword set from the at least one abstract keyword;
the second obtaining unit is used for obtaining the weight of each determined abstract keyword in the preset keyword set;
the statistical unit is used for counting the occurrence frequency of each determined abstract key word in the abstract respectively;
and the second determining unit is used for determining a second possibility that the text to be classified belongs to the preset text category according to the weight and the times.
6. The apparatus according to claim 5, wherein the second determining unit comprises:
a calculating subunit, configured to calculate, for each determined abstract keyword, a product of a weight of the abstract keyword in the preset keyword set and a number of times that the abstract keyword appears in the abstract, and use the product as a fourth possibility that the abstract keyword belongs to the preset text category;
and the summing subunit is used for summing the fourth possibility that each determined abstract keyword belongs to the preset text category to obtain a second possibility that the text to be classified belongs to the preset text category.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the text classification method according to any one of claims 1 to 3 are implemented by the processor when executing the program.
8. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the text classification method according to one of the claims 1 to 3.
CN201810118843.XA 2018-02-06 2018-02-06 Text classification method and device Active CN108415959B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810118843.XA CN108415959B (en) 2018-02-06 2018-02-06 Text classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810118843.XA CN108415959B (en) 2018-02-06 2018-02-06 Text classification method and device

Publications (2)

Publication Number Publication Date
CN108415959A CN108415959A (en) 2018-08-17
CN108415959B true CN108415959B (en) 2021-06-25

Family

ID=63126883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810118843.XA Active CN108415959B (en) 2018-02-06 2018-02-06 Text classification method and device

Country Status (1)

Country Link
CN (1) CN108415959B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388714B (en) * 2018-10-23 2020-11-24 东软集团股份有限公司 Text labeling method, device, equipment and computer readable storage medium
CN112579784B (en) * 2021-03-01 2021-06-01 江西师范大学 Cloud edge collaborative document classification system and method based on deep reinforcement learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2371688C (en) * 1999-05-05 2008-09-09 West Publishing Company D/B/A West Group Document-classification system, method and software
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
CN104915356A (en) * 2014-03-13 2015-09-16 ***通信集团上海有限公司 Text classification correcting method and device
CN105095223A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Method for classifying texts and server
CN106897262A (en) * 2016-12-09 2017-06-27 阿里巴巴集团控股有限公司 A kind of file classification method and device and treating method and apparatus
WO2017167067A1 (en) * 2016-03-30 2017-10-05 阿里巴巴集团控股有限公司 Method and device for webpage text classification, method and device for webpage text recognition

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030050927A1 (en) * 2001-09-07 2003-03-13 Araha, Inc. System and method for location, understanding and assimilation of digital documents through abstract indicia
US20060059121A1 (en) * 2004-08-31 2006-03-16 Microsoft Corporation Method and system for identifying an author of a paper
CN101063975A (en) * 2007-02-15 2007-10-31 刘二中 Method and system for electronic text-processing and searching
CN101676907A (en) * 2008-09-16 2010-03-24 北京雷速科技有限公司 Method and system of directionally acquiring Internet resources
CN106294425B (en) * 2015-05-26 2019-11-19 富泰华工业(深圳)有限公司 The automatic image-text method of abstracting and system of commodity network of relation article
CN105243130A (en) * 2015-09-29 2016-01-13 中国电子科技集团公司第三十二研究所 Text processing system and method for data mining
US10037365B2 (en) * 2016-01-29 2018-07-31 Integral Search International Ltd. Computer-implemented patent searching method in connection to matching degree
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method
CN106708959A (en) * 2016-11-30 2017-05-24 重庆大学 Combination drug recognition and ranking method based on medical literature database

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2371688C (en) * 1999-05-05 2008-09-09 West Publishing Company D/B/A West Group Document-classification system, method and software
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
CN104915356A (en) * 2014-03-13 2015-09-16 ***通信集团上海有限公司 Text classification correcting method and device
CN105095223A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Method for classifying texts and server
WO2017167067A1 (en) * 2016-03-30 2017-10-05 阿里巴巴集团控股有限公司 Method and device for webpage text classification, method and device for webpage text recognition
CN106897262A (en) * 2016-12-09 2017-06-27 阿里巴巴集团控股有限公司 A kind of file classification method and device and treating method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种改进的特征选取方法;苑俊英 等;《科技信息》;20090430;第04卷(第2009年期);172-173 *

Also Published As

Publication number Publication date
CN108415959A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
US9201880B2 (en) Processing a content item with regard to an event and a location
Bafna et al. Feature based summarization of customers’ reviews of online products
CN110263248B (en) Information pushing method, device, storage medium and server
CN109299280B (en) Short text clustering analysis method and device and terminal equipment
Furlan et al. Semantic similarity of short texts in languages with a deficient natural language processing support
CN108305180B (en) Friend recommendation method and device
US20150269162A1 (en) Information processing device, information processing method, and computer program product
Adam et al. Sentiment analysis on movie review using Naïve Bayes
Jurgens et al. Event detection in blogs using temporal random indexing
Weerasinghe et al. Feature Vector Difference based Authorship Verification for Open-World Settings.
CN108415959B (en) Text classification method and device
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
CN110019556B (en) Topic news acquisition method, device and equipment thereof
Thomas et al. Exb text summarizer
Lucy et al. AboutMe: Using self-descriptions in webpages to document the effects of english pretraining data filters
KR101902460B1 (en) Device for document categorizing
Kutuzov et al. Cross-Lingual Trends Detection for Named Entities in News Texts with Dynamic Neural Embedding Models.
CN106446696B (en) Information processing method and electronic equipment
CN116089616A (en) Theme text acquisition method, device, equipment and storage medium
CN112632277B (en) Resource processing method and device for target content object
US20130080145A1 (en) Natural language processing apparatus, natural language processing method and computer program product for natural language processing
CN110990709B (en) Role automatic recommendation method and device and electronic equipment
Atikah et al. Topic Modelling Using VSM-LDA For Document Summarization
Rofiq Indonesian news extractive text summarization using latent semantic analysis
JP7078244B2 (en) Data processing equipment, data processing methods, data processing systems and programs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant