CN109002443B - Text information classification method and device - Google Patents

Text information classification method and device Download PDF

Info

Publication number
CN109002443B
CN109002443B CN201710420071.0A CN201710420071A CN109002443B CN 109002443 B CN109002443 B CN 109002443B CN 201710420071 A CN201710420071 A CN 201710420071A CN 109002443 B CN109002443 B CN 109002443B
Authority
CN
China
Prior art keywords
root
text information
preset
word
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710420071.0A
Other languages
Chinese (zh)
Other versions
CN109002443A (en
Inventor
葛婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710420071.0A priority Critical patent/CN109002443B/en
Publication of CN109002443A publication Critical patent/CN109002443A/en
Application granted granted Critical
Publication of CN109002443B publication Critical patent/CN109002443B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text information classification method and device, relates to the technical field of information classification, and mainly aims to identify key words in material information and classify the material information based on the key words. The main technical scheme of the invention is as follows: segmenting the acquired text information to obtain a plurality of roots, wherein the roots are preset keywords for classifying the text information; combining a plurality of root word groups according to the positions of the root words in the text information; matching the root word group with a preset label, wherein the preset label is a combined identifier corresponding to a plurality of root word groups; selecting a preset label corresponding to one root word group from the successfully matched root word groups according to a preset rule as a classification label of the text information; and classifying the text information according to the classification label. The invention is mainly used for classifying the text information.

Description

Text information classification method and device
Technical Field
The present invention relates to the field of information classification technologies, and in particular, to a method and an apparatus for classifying text information.
Background
The basic idea of SEM (Search Engine Marketing) is to let the user find information and click to get to the website/web page by Search to further know the information he needs. The role of the SEM system is to obtain the maximum amount of access in the search engine with the minimum investment and to generate the corresponding business value. In the information retrieval process, a user can trigger preset keywords through search words, then related advertisement creatives are obtained through the keywords through clicking operation, and then the related advertisement creatives enter a website of an advertiser to achieve flow or conversion.
Data such as keywords, creatives, etc. in the SEM account of the advertiser are generally referred to collectively as material, and business personnel often need to update material information. For the new material information, how to divide the new material information into the corresponding accounts or how to divide the material information into the new accounts, the currently adopted processing method is to identify the material information in a manual mode and divide the material information into the corresponding accounts. However, for a large amount of material information, a large amount of labor is needed to complete the process on the premise of ensuring the effectiveness, and the accuracy of material classification is greatly reduced in the high-strength working state.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for classifying text information, and mainly aims to identify keywords in material information and classify the material information based on the keywords.
In order to achieve the purpose, the invention mainly provides the following technical scheme:
in one aspect, the present invention provides a method for classifying text information, including:
segmenting the acquired text information to obtain a plurality of roots, wherein the roots are preset keywords for classifying the text information;
combining a plurality of root word groups according to the positions of the root words in the text information;
matching the root word group with a preset label, wherein the preset label is a combined identifier corresponding to a plurality of root word groups;
selecting a preset label corresponding to one root word group from the successfully matched root word groups according to a preset rule as a classification label of the text information;
and classifying the text information according to the classification label.
Preferably, before the obtained text information is subjected to word segmentation, the method further includes:
setting a preset rule, specifically:
setting a combined root in the preset label, wherein the combined root comprises the sequence and the number of the roots;
different priorities are set for preset tags having the same root number.
Preferably, the selecting a preset tag corresponding to a root group from the successfully matched root groups according to a preset rule as the classification tag of the text information includes:
determining the number of the roots in the root group which is successfully matched;
when the root group with the largest number of roots is unique, taking a preset label corresponding to the root group as a classification label of the text information;
and when the root group with the maximum root number is multiple, extracting a preset label corresponding to the root group with the highest priority in the multiple root groups as the classification label of the text information.
Preferably, matching the root word group with a preset tag includes:
and matching the preset labels according to the root number and/or the arrangement sequence of the roots in the root group.
Preferably, the segmenting the acquired text information to obtain a plurality of word roots includes:
when the word segmentation in the word segmentation result is the preset keyword, taking the word segmentation as a root word;
and when the word segmentation in the word segmentation result contains a plurality of preset keywords, extracting the plurality of preset keywords in the word segmentation as the root words of the text information respectively.
Preferably, combining a plurality of root word groups according to the position of the root word in the text information includes:
determining the position of each root word appearing in the text information;
and generating a root group composed of a plurality of root combinations according to the sequence of the appearance positions.
On the other hand, the invention also provides a text information classification device, which comprises:
the word segmentation unit is used for segmenting the acquired text information to obtain a plurality of roots, and the roots are preset keywords for classifying the text information;
the combining unit is used for combining a plurality of root word groups according to the positions of the root words obtained by the word segmentation unit in the text information;
the matching unit is used for matching the root group combined by the combining unit with a preset label, and the preset label is a combined identifier corresponding to a plurality of root groups;
the selecting unit is used for selecting a preset label corresponding to one root word group from the root word groups successfully matched by the matching unit according to a preset rule as a classification label of the text information;
and the marking unit is used for classifying the text information according to the classification label determined by the selection unit.
Preferably, the apparatus further comprises:
the setting unit is used for setting a preset rule before the word segmentation unit performs word segmentation on the acquired text information, and specifically, the setting unit further comprises:
the first setting module is used for setting the combined root in the preset label, including the sequence and the number of the root;
and the second setting module is used for setting different priorities for the preset labels with the same root number.
Preferably, the selection unit includes:
the determining module is used for determining the number of the roots in the root group which is successfully matched;
a selecting module, configured to, when the root group with the largest number of roots determined by the determining module is unique, use a preset tag corresponding to the root group as a classification tag of the text information;
the selecting module is further configured to, when there are a plurality of root groups with the largest number of roots, extract a preset tag corresponding to a root group with the highest priority in the plurality of root groups as the classification tag of the text information.
Preferably, the matching unit is further configured to match the preset tag according to the number of roots in the root group and/or the arrangement order of the roots.
Preferably, the word segmentation unit is further configured to, when a word in the word segmentation result is the preset keyword, take the word as a root word; and when the word segmentation in the word segmentation result contains a plurality of preset keywords, extracting the plurality of preset keywords in the word segmentation as the root words of the text information respectively.
Preferably, the combination unit includes:
the determining module is used for determining the position of each root word in the text information;
and the generating module is used for generating the root group combined by a plurality of roots according to the sequence of the appearance positions.
In order to achieve the above object, according to another aspect of the present invention, there is provided a storage medium including a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the above text information classification method.
In order to achieve the above object, according to another aspect of the present invention, a processor for executing a program is provided, wherein the program executes the method for classifying text information described above.
According to the classification method and the classification device of the text information, which are provided by the invention, the method and the device are mainly used for automatically classifying the delivered advertisement text information when advertisement delivery is carried out in search engine marketing so as to record the classification accounts corresponding to the advertisement text information. The method can mark each text message with a unique classification label, thereby ensuring that the text messages can be accurately classified, avoiding repeated division errors caused by dividing the same text message into a plurality of classification accounts, simultaneously avoiding the problems of wrong division and low division efficiency caused by manual operation, being particularly suitable for application scenes of mass advertisement delivery, and ensuring the classification accuracy while improving the classification efficiency.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart illustrating a method for classifying text information according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating another text information classification method according to an embodiment of the present invention;
fig. 3 is a block diagram showing a text information classification apparatus according to an embodiment of the present invention;
fig. 4 is a block diagram showing another text information classification apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The embodiment of the invention provides a method for classifying text information, which is applied to classifying the delivered advertisement text information when advertisement delivery is carried out in search engine marketing as shown in figure 1. Distributing advertisement text information to a corresponding account for subsequent statistical operation, and the specific steps comprise:
101. and segmenting the acquired text information to obtain a plurality of word roots.
The source of the text information is not limited to the advertisement text information provided by the user, and a large batch of text information can be obtained by a word expansion tool or a word expansion method according to the keywords provided by the user.
And performing word segmentation processing on the acquired text information item by item, wherein the embodiment of the invention does not limit the specific mode of word segmentation, and the adopted word segmentation algorithm can be any one of a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. And screening by using preset keywords according to the word segmentation result of the text information to obtain the preset keywords contained in the text information, namely determining a plurality of roots in the text information, wherein the roots are the preset keywords for classifying the text information. Generally, the preset keywords are preset by a user before executing the method, and at least one preset keyword may be correspondingly set for different accounts.
102. The plurality of root word groups are combined according to the position of the root word in the text information.
After the word segmentation of the text information and the determination of the corresponding root word are completed, the obtained root words are combined to generate a plurality of root word groups. In the embodiment of the present invention, the specific combination manner is not specifically limited, that is, in the case that no combination manner is preset, a root group set including all combination manners is generated according to the root contained in the text information. For the condition that the preset combination mode exists, the corresponding root group can be generated according to the set mode, or the root group which is in accordance with the preset rule is screened in the root group set containing all the combination modes.
It should be noted that the root group in this step is combined according to the position of the root in the text information, that is, the combination mode is preset according to the specific position of the root in the text. The position limitation may be the sequence of root appearance, or the position in a paragraph or sentence where the root appears, that is, the root appearing in the same paragraph or sentence is combined in batches.
103. And matching the root word group with a preset label.
The preset label is a combination identifier corresponding to a plurality of root combinations, namely a preset root group, and each preset root group has a unique corresponding combination identifier, namely the preset label.
And matching the obtained root groups with preset labels one by one based on the preset labels, wherein the matching content further comprises judging whether the roots contained in the root groups are the same as the roots contained in the preset labels or not and judging whether the arrangement sequence of the roots in the root groups is the same as the arrangement sequence of the roots in the preset labels or not. And the matched content can be adjusted according to the change of the actual application scene, so that the number of the root groups matched with the preset labels is controlled. For example, if the matching rule is defined that the root sequence is the same, the root group of the root group contains the root of the preset tag, and all the root groups with the same sequence are matched, and if the root group of the preset tag is (1,2), and the root group to be matched is (4,1,2), (1,2), (2,3,1), then (4,1,2) and (1,2) are the root groups according with the matching rule.
According to the matching result in the step, when the text information does not contain the root group which accords with the preset label, an unknown label is marked on the text information, and the user is informed of the fact that the text information can not be identified and classified according to the currently provided key words through the position label. And when the text information only contains a root group which accords with the preset label, taking the preset label matched with the root group as the classification label of the text information. When the text information contains a plurality of root groups which accord with the preset labels, step 104 is executed to further determine the classification labels of the text information.
It should be noted that, as for the preset tag, the matching result is that the root group in the matching text information corresponds to the root contained in one preset tag, or corresponds to the root contained in multiple preset tags.
104. And selecting a preset label corresponding to one root word group from the successfully matched root word groups according to a preset rule as a classification label of the text information.
When the root word group matched with the preset labels exists in the text information, one preset label needs to be selected from the preset labels through the step as the classification label of the text information. That is, the classification label marked on the text information is one of a plurality of preset labels.
The mode of selecting the preset tag in this step is executed according to a preset rule, wherein the preset rule can be set by a user in a self-defined manner, the preset rule can be determined according to the root number contained in the preset tag, the priority can also be set for the preset tag, and the preset rule is set to be selected according to the priority.
105. And classifying the text information according to the classification label.
After the step of labeling the text information with the classification label is completed, the text information can be divided into corresponding accounts according to the specific classification label. The number of the classification tags corresponding to the account is not limited to one or more.
And for the text information marked with the unknown label, an account can be created for collecting the text information, and the text information in the account is redistributed after the classification label or the keyword is added.
In combination with the foregoing implementation manner, it can be seen that the text information classification method adopted in the embodiment of the present invention achieves the purpose of classifying text information by labeling the text information with a classification label, and for the labeled classification label, by analyzing the keywords contained in the text information, the text information can label a preset label that is most matched with the content of the text information as the classification label, thereby achieving automatic labeling and automatic classification of the text information.
In order to describe the text information classification method proposed by the present invention in more detail, an embodiment of the present invention will describe, with reference to a specific example, the proposed text information classification method, as shown in fig. 2, the specific steps of parsing the text information and matching the text information with the preset tag by the method are as follows:
201. and setting a preset rule.
The preset rule can be customized and set by a user through a setting interface, and the rule is mainly used for following principles when a unique preset label is selected subsequently.
In the embodiment of the present invention, the specific set content includes:
1. and setting the combined root in the preset label, including the sequence and the number of the root.
Namely, the preset label for matching is determined, and the root word contained in the preset label needs to further determine the number of the root word therein and the arrangement sequence of the root word.
2. Different priorities are set for preset tags having the same root number.
In the embodiment of the invention, different preset labels have different priorities, wherein for the preset labels with different root numbers, the priority is higher when the root word is more. For the preset labels with the same root number, the priorities of different preset labels need to be determined according to the setting of the user, that is, the user designates the priority of the preset label, or the user designates the priority of the root, so as to determine the specific priority of the preset label.
202. And segmenting the acquired text information to obtain a plurality of word roots.
This step is the same as step 101 in the above embodiment, and the specific implementation process may refer to the content in step 101. In the embodiment of the present invention, the word segmentation of the text information is based on the root word in the preset tag set in step 201, the root word is used as a preset keyword to compare with the word segmentation in the word segmentation result, when the word segmentation is the preset keyword, it is determined that the text information contains the root word, and when the word segmentation contains a plurality of preset keywords, all the contained preset keywords are used as the root word of the text information.
203. The plurality of root word groups are combined according to the position of the root word in the text information.
The combination performed in this step is a combination arranged according to the appearance sequence of the root word in the text information.
Since a part word corresponding to a root word in the text information may be repeated for a plurality of times, the position of the root word is recorded according to the position of the root word which first appears in the text information when the position of the root word is determined. And then sorting the different roots according to the positions, and generating a corresponding root group according to the sorting. For example, if the root sequence appearing in a text message is (1,2,3), the resulting root combination only includes: four groups of (1,2), (2,3), (1,3) and (1,2,3) are excluded, and combinations of (1,3,2), (2,1) and the like which are not arranged in sequence are excluded.
204. And matching preset labels according to the root number and/or the arrangement sequence of the roots in the root group.
And matching the obtained root group with the corresponding root in the preset label, wherein in the step, the specific content of matching comprises the number of the roots, or the arrangement sequence of the roots, or the matching by combining the number of the roots and the arrangement sequence of the roots. And determining whether the content of the text information accords with a preset label or not by the last mode for the scene needing accurate matching, namely the number of roots in the root combination is the same as that of the roots in the preset label, and the arrangement sequence of the roots is also the same. Thus, when the root group is completely the same as the root in the preset label, the matching between the root group and the root in the preset label is successful.
It should be noted that, when a preset tag is successfully matched with a root group in text information, other root groups are not used for matching with the preset tag, that is, whether the text information is matched with the preset tag is matched, without calculating the number of root groups matched with the preset tag.
205. And determining a unique classification label corresponding to the text information, and classifying according to the classification label.
Based on the matching process in step 204, if there is no matched preset tag, then an unknown tag is marked on the text information; if only the uniquely matched preset label exists, marking the preset label as a classification label in the text information; if there are multiple matching preset labels, it is necessary to select one preset label as a classification label to be marked in the text information according to the rule preset in step 201, that is, the preset label with the largest root number or the highest priority is selected as the classification label to be marked according to the root number or the priority of the preset label.
And after the corresponding marking is finished, dividing each text message into corresponding accounts according to the classification labels.
The following is illustrated by specific applications:
when advertising information is put for a certain financial product client, the put text information is 'which bank of the bankbook and financing product is good', wherein, the bankbook, the financing, the bank and which are roots, and the preset labels set based on the roots are as follows:
bao Ben and jin Cai (Bao Ben and jin Cai)
Bank financing (Bank, financing)
Bao ben and Yuan Zhi doubt word (Bao ben, Yuan, which)
Bank fortune question word (bank, fortune which)
After word segmentation of the text information, the following results are obtained: the method comprises the following steps of preserving the book, managing the wealth, producing, determining which one of the products, the bank and the good product, and determining the root of the word as follows: book keeping, financing, which, bank.
Matching the root group obtained by combining the root sequences with preset labels to obtain two matched preset labels (book keeping, financing) and (book keeping, financing, which) and determining the preset label 'book keeping financing question words' as the classification labels of the text information according to the root number in the preset labels, and marking the 'book keeping financing question words' in the text information so as to facilitate the subsequent classification processing.
In order to achieve the above object, according to another aspect of the present invention, an embodiment of the present invention further provides a storage medium, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute the above method for identifying guest information in a website log.
In order to achieve the above object, according to another aspect of the present invention, an embodiment of the present invention further provides a processor, where the processor is configured to execute a program, where the program executes the method for identifying guest information in a website log.
Further, as an implementation of the foregoing method, an embodiment of the present invention provides a text information classification apparatus, where the apparatus embodiment corresponds to the foregoing method embodiment, and for convenience of reading, details in the foregoing method embodiment are not repeated in this apparatus embodiment one by one, but it should be clear that the apparatus in this embodiment can correspondingly implement all the contents in the foregoing method embodiment. The device is used in a device for analyzing or acquiring text information, and as shown in fig. 3, the device comprises:
a word segmentation unit 31, configured to segment words of the acquired text information to obtain multiple roots, where the roots are preset keywords used for text information classification;
a combining unit 32, configured to combine multiple root word groups according to the positions of the root words obtained by the word segmentation unit 31 in the text information;
a matching unit 33, configured to match the root group combined by the combining unit 32 with a preset tag, where the preset tag is a combination identifier corresponding to multiple root combinations;
a selecting unit 34, configured to select a preset tag corresponding to one root group from the root groups successfully matched by the matching unit 33 according to a preset rule, as a classification tag of the text information;
and a labeling unit 35, configured to classify the text information according to the classification label determined by the selecting unit 34.
Further, as shown in fig. 4, the apparatus further includes:
the setting unit 36 is configured to set a preset rule before the word segmentation unit 31 performs word segmentation on the acquired text information, and specifically includes:
the first setting module 361 is configured to set the combined root in the preset tag, including the order and number of the root;
a second setting module 362, configured to set different priorities for preset tags having the same root number.
Further, as shown in fig. 4, the selecting unit 34 includes:
a determining module 341, configured to determine the number of roots in the root group for which matching is successful;
a selecting module 342, configured to, when the root group with the largest number of roots determined by the determining module 341 is unique, take a preset tag corresponding to the root group as a classification tag of the text information;
the selecting module 342 is further configured to, when there are multiple root groups with the largest number of roots, extract a preset tag corresponding to a root group with the highest priority in the multiple root groups as the classification tag of the text information.
Further, the matching unit 33 is further configured to match the preset tag according to the number of roots in the root group and/or the arrangement order of the roots.
Further, the word segmentation unit 31 is further configured to, when a word in the word segmentation result is the preset keyword, use the word as a root word; and when the word segmentation in the word segmentation result contains a plurality of preset keywords, extracting the plurality of preset keywords in the word segmentation as the root words of the text information respectively.
Further, as shown in fig. 4, the combining unit 32 includes:
a determining module 321, configured to determine a position where each root word appears in the text information;
a generating module 322, configured to generate a root group formed by combining multiple roots according to the sequence of the root appearance positions determined by the determining module 321.
In summary, the method and the device for classifying text information according to the embodiments of the present invention achieve the purpose of classifying text information by labeling classification labels on text information, and for the labeled classification labels, by analyzing keywords contained in the text information, the text information can label a preset label most matched with the content of the text information as a classification label, thereby achieving automatic labeling and automatic classification of the text information.
The text information classification device comprises a processor and a memory, wherein the word segmentation unit, the combination unit, the matching unit, the selection unit, the labeling unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, the key words in the material information are identified by adjusting the kernel parameters, and the material information is classified based on the key words.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the classification method of text information when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the classification method of the text information is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: segmenting the acquired text information to obtain a plurality of roots, wherein the roots are preset keywords for classifying the text information; combining a plurality of root word groups according to the positions of the root words in the text information; matching the root word group with a preset label, wherein the preset label is a combined identifier corresponding to a plurality of root word groups; selecting a preset label corresponding to one root word group from the successfully matched root word groups according to a preset rule as a classification label of the text information; and classifying the text information according to the classification label.
Further, the method further comprises the following steps of: setting a preset rule, specifically: setting a combined root in the preset label, wherein the combined root comprises the sequence and the number of the roots; different priorities are set for preset tags having the same root number.
Further, the selecting a preset tag corresponding to a root word group from the successfully matched root word groups according to a preset rule as the classification tag of the text information includes: determining the number of the roots in the root group which is successfully matched; when the root group with the largest number of roots is unique, taking a preset label corresponding to the root group as a classification label of the text information; and when the root group with the maximum root number is multiple, extracting a preset label corresponding to the root group with the highest priority in the multiple root groups as the classification label of the text information.
Further, matching the root word group with a preset tag includes: and matching the preset labels according to the root number and/or the arrangement sequence of the roots in the root group.
Further, the segmenting the acquired text information to obtain a plurality of word roots includes: when the word segmentation in the word segmentation result is the preset keyword, taking the word segmentation as a root word; and when the word segmentation in the word segmentation result contains a plurality of preset keywords, extracting the plurality of preset keywords in the word segmentation as the root words of the text information respectively.
Further, combining a plurality of root word groups according to the position of the root word in the text message comprises: determining the position of each root word appearing in the text information; and generating a root group composed of a plurality of root combinations according to the sequence of the appearance positions.
And the devices herein may be servers, PCs, PADs, handsets, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: segmenting the acquired text information to obtain a plurality of roots, wherein the roots are preset keywords for classifying the text information; combining a plurality of root word groups according to the positions of the root words in the text information; matching the root word group with a preset label, wherein the preset label is a combined identifier corresponding to a plurality of root word groups; selecting a preset label corresponding to one root word group from the successfully matched root word groups according to a preset rule as a classification label of the text information; and classifying the text information according to the classification label.
Further, the method further comprises the following steps of: setting a preset rule, specifically: setting a combined root in the preset label, wherein the combined root comprises the sequence and the number of the roots; different priorities are set for preset tags having the same root number.
Further, the selecting a preset tag corresponding to a root word group from the successfully matched root word groups according to a preset rule as the classification tag of the text information includes: determining the number of the roots in the root group which is successfully matched; when the root group with the largest number of roots is unique, taking a preset label corresponding to the root group as a classification label of the text information; and when the root group with the maximum root number is multiple, extracting a preset label corresponding to the root group with the highest priority in the multiple root groups as the classification label of the text information.
Further, matching the root word group with a preset tag includes: and matching the preset labels according to the root number and/or the arrangement sequence of the roots in the root group.
Further, the segmenting the acquired text information to obtain a plurality of word roots includes: when the word segmentation in the word segmentation result is the preset keyword, taking the word segmentation as a root word; and when the word segmentation in the word segmentation result contains a plurality of preset keywords, extracting the plurality of preset keywords in the word segmentation as the root words of the text information respectively.
Further, combining a plurality of root word groups according to the position of the root word in the text message comprises: determining the position of each root word appearing in the text information; and generating a root group composed of a plurality of root combinations according to the sequence of the appearance positions.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (8)

1. A method for classifying textual information, the method comprising:
setting a preset rule, specifically:
setting a combined root in a preset label, wherein the combined root comprises the sequence and the number of the roots;
setting different priorities for preset labels with the same root number;
segmenting the acquired text information to obtain a plurality of roots, wherein the roots are preset keywords for classifying the text information;
combining a plurality of root word groups according to the positions of the root words in the text information;
matching the root word group with a preset label, wherein the preset label is a combined identifier corresponding to a plurality of root word groups;
selecting a preset label corresponding to one root word group from the successfully matched root word groups according to a preset rule as a classification label of the text information;
and classifying the text information according to the classification label.
2. The method of claim 1, wherein the selecting a preset tag corresponding to a root word group from successfully matched root word groups according to a preset rule as the classification tag of the text information comprises:
determining the number of the roots in the root group which is successfully matched;
when the root group with the largest number of roots is unique, taking a preset label corresponding to the root group as a classification label of the text information;
and when the root group with the maximum root number is multiple, extracting a preset label corresponding to the root group with the highest priority in the multiple root groups as the classification label of the text information.
3. The method of claim 1 or 2, wherein matching the root set with preset labels comprises:
and matching the preset labels according to the root number and/or the arrangement sequence of the roots in the root group.
4. The method of claim 1, wherein the segmenting the obtained text information into a plurality of roots comprises:
when the word segmentation in the word segmentation result is the preset keyword, taking the word segmentation as a root word;
and when the word segmentation in the word segmentation result contains a plurality of preset keywords, extracting the plurality of preset keywords in the word segmentation as the root words of the text information respectively.
5. The method of claim 1 or 4, wherein combining a plurality of root word groups according to the position of the root word in the text message comprises:
determining the position of each root word appearing in the text information;
and generating a root group composed of a plurality of root combinations according to the sequence of the appearance positions.
6. An apparatus for classifying text information, the apparatus comprising:
the setting unit is used for setting a preset rule before the word segmentation unit performs word segmentation on the acquired text information, and specifically, the setting unit further comprises:
the first setting module is used for setting the combined root in the preset label, including the sequence and the number of the root;
the second setting module is used for setting different priorities for the preset labels with the same root number;
the word segmentation unit is used for segmenting the acquired text information to obtain a plurality of roots, and the roots are preset keywords for classifying the text information;
the combining unit is used for combining a plurality of root word groups according to the positions of the root words obtained by the word segmentation unit in the text information;
the matching unit is used for matching the root group combined by the combining unit with a preset label, and the preset label is a combined identifier corresponding to a plurality of root groups;
the selecting unit is used for selecting a preset label corresponding to one root word group from the root word groups successfully matched by the matching unit according to a preset rule as a classification label of the text information;
and the marking unit is used for classifying the text information according to the classification label determined by the selection unit.
7. A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the classification method of text information according to any one of claims 1-5.
8. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the method for classifying text information according to any one of claims 1 to 5 when the program is run.
CN201710420071.0A 2017-06-06 2017-06-06 Text information classification method and device Active CN109002443B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710420071.0A CN109002443B (en) 2017-06-06 2017-06-06 Text information classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710420071.0A CN109002443B (en) 2017-06-06 2017-06-06 Text information classification method and device

Publications (2)

Publication Number Publication Date
CN109002443A CN109002443A (en) 2018-12-14
CN109002443B true CN109002443B (en) 2021-12-28

Family

ID=64572985

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710420071.0A Active CN109002443B (en) 2017-06-06 2017-06-06 Text information classification method and device

Country Status (1)

Country Link
CN (1) CN109002443B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666405B (en) * 2019-03-06 2023-07-07 百度在线网络技术(北京)有限公司 Method and device for identifying text implication relationship
CN110472015B (en) * 2019-08-13 2022-12-13 腾讯科技(深圳)有限公司 Text information extraction method, text information extraction device, terminal and storage medium
CN111191428B (en) * 2019-12-27 2022-02-25 北京百度网讯科技有限公司 Comment information processing method and device, computer equipment and medium
CN111259117B (en) * 2020-01-16 2023-11-21 广州拉卡拉信息技术有限公司 Short text batch matching method and device
CN111291195B (en) * 2020-01-21 2021-08-10 腾讯科技(深圳)有限公司 Data processing method, device, terminal and readable storage medium
CN112287042A (en) * 2020-11-22 2021-01-29 长沙修恒信息科技有限公司 Material name processing system in ERP system
CN116701616B (en) * 2022-12-07 2024-07-12 荣耀终端有限公司 Text classification method and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
US8756234B1 (en) * 2004-11-16 2014-06-17 The General Hospital Corporation Information theory entropy reduction program
CN104063502A (en) * 2014-07-08 2014-09-24 中南大学 WSDL semi-structured document similarity analyzing and classifying method based on semantic model
CN105183838A (en) * 2015-09-02 2015-12-23 有戏(厦门)网络科技有限公司 Text editing method and system based on material obtaining
CN105224695A (en) * 2015-11-12 2016-01-06 中南大学 A kind of text feature quantization method based on information entropy and device and file classification method and device
CN105573968A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Text indexing method based on rules
CN106354872A (en) * 2016-09-18 2017-01-25 广州视源电子科技股份有限公司 Text clustering method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2598300C2 (en) * 2015-01-27 2016-09-20 Общество с ограниченной ответственностью "Аби Девелопмент" Methods and systems for automatic recognition of characters using forest solutions

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8756234B1 (en) * 2004-11-16 2014-06-17 The General Hospital Corporation Information theory entropy reduction program
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
CN104063502A (en) * 2014-07-08 2014-09-24 中南大学 WSDL semi-structured document similarity analyzing and classifying method based on semantic model
CN105183838A (en) * 2015-09-02 2015-12-23 有戏(厦门)网络科技有限公司 Text editing method and system based on material obtaining
CN105224695A (en) * 2015-11-12 2016-01-06 中南大学 A kind of text feature quantization method based on information entropy and device and file classification method and device
CN105573968A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Text indexing method based on rules
CN106354872A (en) * 2016-09-18 2017-01-25 广州视源电子科技股份有限公司 Text clustering method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向功能的Web服务分类***研究与实现;王华兰等;《小型微型计算机***》;20130115;第46-53页 *

Also Published As

Publication number Publication date
CN109002443A (en) 2018-12-14

Similar Documents

Publication Publication Date Title
CN109002443B (en) Text information classification method and device
CN110765770B (en) Automatic contract generation method and device
CN110348580B (en) Method and device for constructing GBDT model, and prediction method and device
CN109561326B (en) Data query method and device
CN111831629B (en) Data processing method and device
CN108090041B (en) Method and device for generating advertisement creativity
CN109918658B (en) Method and system for acquiring target vocabulary from text
CN111125086B (en) Method, device, storage medium and processor for acquiring data resources
US8521579B2 (en) Predicting marketing campaigns having more than one step
CN110569330A (en) text labeling system, device, equipment and medium based on intelligent word selection
CN106997350B (en) Data processing method and device
CN106909567B (en) Data processing method and device
CN110046188A (en) Method for processing business and its system
CN109582548B (en) Page element circle selection method and device based on non-buried points
CN112307004B (en) Data management method, device, equipment and storage medium
CN105677677A (en) Information classification and device
CN108984572B (en) Website information pushing method and device
CN107430633A (en) The representative content through related optimization being associated to data-storage system
CN113919936B (en) Sample data processing method and device
CN115293243A (en) Method, device and equipment for realizing intelligent matching of data assets
CN111078905A (en) Data processing method, device, medium and equipment
CN106776654B (en) Data searching method and device
CN109359274A (en) The method, device and equipment that the character string of a kind of pair of Mass production is identified
CN105512145A (en) Method and device for information classification
CN110807082A (en) Quality spot check item determination method, system, electronic device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant