CN111259158B

CN111259158B - Text classification method, device and medium

Info

Publication number: CN111259158B
Application number: CN202010114084.7A
Authority: CN
Inventors: 鲁骁; 孟二利; 王斌
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2023-06-02
Anticipated expiration: 2040-02-25
Also published as: CN111259158A

Abstract

Disclosed herein are a text classification method, apparatus and medium, the method comprising: determining the formation mode of the target text according to the dictionary; generating word vectors of the target text according to the word vector generation method corresponding to the composition mode; classifying the target text according to the word vector and the sample mapping set of the target text; the sample word mapping set comprises a plurality of subsets, each subset comprises a one-to-one mapping relation between a plurality of sample words and word vectors, sample words contained in different subsets belong to different categories, and the dimensions of all word vectors in the sample word mapping set are the same; a dictionary is a subset of all sample words in the set of sample word mappings. The method maps target texts with different lengths onto vector spaces with the same dimension based on the dictionary and the sample mapping set, so that unification on a representation layer is realized, the target texts with different lengths can be classified by the same classification algorithm, and the classification accuracy can be effectively improved.

Description

Text classification method, device and medium

Technical Field

The disclosure relates to the field of text classification, and in particular relates to a text classification method, device and medium.

Background

At present, a large number of regular texts are often required to be set in the text processing service to manage the content, and the regular texts comprise keywords, phrases, sentences, regular expressions and the like. Meanwhile, the rule texts need to maintain corresponding classification information, which is used for representing the attribution of the category corresponding to the data obtained by filtering the rule texts, so that operators can conveniently carry out subsequent processing on the data, including log statistics, data report forms, rule error correction, processing on the data of different categories through different data processing channels, and the like, and the accuracy of rule text classification can influence the effect of subsequent data processing flows to a great extent and further influence the operation efficiency of a service system.

Currently, classification of rule tables can be achieved by two methods, manual classification and automatic classification.

The manual classification method depends on understanding of the rule table and the class system by service personnel, and needs unified classification standards, and the service personnel are required to be trained in time aiming at updating the rule table and adjusting the class system. In actual business operation, a plurality of business personnel often maintain the business at the same time, and classification errors caused by inconsistent knowledge of different business personnel on classification standards often occur. In the service operation process, after finding the data to be filtered through various modes such as user feedback, log tracking, system inspection and the like, operators manually update the rule table, and after long-time accumulation, the rule text is large in general scale, so that the rule classification is difficult to rely on manual work.

The automatic classification method is generally implemented by extracting text features of a rule table through text analysis and then adopting a proper classification algorithm. For short text and long text, different feature extraction methods need to be employed. Because the rule table has a complex composition structure and contains texts in various forms such as keywords, phrases, sentences, regular expressions and the like, the text is difficult to realize by a unified method, and the characteristic representation is inconsistent due to the rule texts with different lengths, so that the classification accuracy is affected. Meanwhile, in the traditional classification method, a certain scale of manual annotation data is needed, dictionary-based features are extracted through the annotated training data, however, because the rule table is different from the content of the long text articles, the word distribution is very sparse, so that most words are difficult to be covered by the annotated training data, and therefore, once new keywords appear in the rule table, the condition of unknown words can be caused, so that the feature expression of the rule text is invalid, and the accuracy of the classification algorithm is directly influenced.

In the related art, for an unregistered word, a category to which the unregistered word belongs may be determined by calculating a similarity between a context of the unregistered word and a context of each category. This approach requires reliance on a synonym dictionary and requires that the synonyms have category attributes. In business, the category system is complex and there is no corresponding synonym dictionary available. In addition, when determining the category to which the unregistered word belongs in the related art, the context information according to which is applicable in the service scene of continuous sentences, articles and the like, but is not applicable to the rule text, because the rule text is mostly a keyword and a phrase segment, and no corresponding context information exists.

Disclosure of Invention

To overcome the problems in the related art, a text classification method, apparatus, and medium are provided herein.

According to a first aspect of embodiments herein, there is provided a text classification method comprising:

determining the formation mode of the target text according to the dictionary;

generating word vectors of the target text according to the word vector generation method corresponding to the formation mode;

classifying the target text according to the word vector and the sample mapping set of the target text;

the sample word mapping set comprises a plurality of subsets, each subset comprises a one-to-one mapping relation between a plurality of sample words and word vectors, sample words contained in different subsets belong to different categories, and the dimensions of all word vectors in the sample word mapping set are the same; the dictionary is a subset of all sample words in the set of sample word mappings.

In another embodiment, the determining the construction mode of the target text according to the dictionary includes: when the target text is determined to be a sample word in the dictionary, determining that the construction mode of the target text is a first mode;

the word vector generation method corresponding to the first mode comprises the following steps: and inquiring a one-to-one mapping relation in the sample word mapping set to determine a word vector corresponding to the target text.

In another embodiment, the determining the construction mode of the target text according to the dictionary includes: determining that the target text is not a sample word in the dictionary, and determining that the construction mode of the target text is a second mode when the length of the target text is greater than or equal to a set length;

the word vector generation method corresponding to the second mode comprises the following steps: and obtaining at least one effective composition word after word segmentation operation is carried out on the target text, selecting the effective composition word belonging to the dictionary from the effective composition words, inquiring a one-to-one mapping relation in a sample word mapping set to determine word vectors corresponding to the selected effective composition word, and determining word vectors corresponding to the target text according to the word vectors corresponding to each selected effective composition word.

In another embodiment, the determining the construction mode of the target text according to the dictionary includes: determining that the target text is not a sample word in the dictionary, and determining that the formation mode of the target text is a third mode when the length of the target text is smaller than a set length;

the word vector generation method corresponding to the third mode comprises the following steps: and respectively carrying out at least one sliding window type splitting on the target text, wherein window lengths of different sliding windows are different, selecting unit words belonging to the dictionary from unit words obtained by splitting through each sliding window type, inquiring a one-to-one mapping relation in the sample word mapping set to determine word vectors corresponding to all selected unit words, and determining word vectors corresponding to the target text according to the word vectors corresponding to all selected unit words.

In another embodiment, the length of the sliding window used in the at least one sliding window split is N character lengths from 1 character length to M character lengths, M being an integer greater than 1, N being less than or equal to M.

In another embodiment, classifying the target text using a set of word vectors and sample mappings for the target text includes: and calculating the similarity between the word vector of the target text and the word vector in the sample mapping set, and determining the category to which the target text belongs according to the similarity.

According to a second aspect of embodiments herein, there is provided a text classification apparatus comprising:

the first determining module is used for determining the formation mode of the target text according to the dictionary;

the generating module is used for generating the word vector of the target text according to the word vector generating method corresponding to the composition mode;

the classification module is used for classifying the target text according to the word vector and the sample mapping set of the target text;

In one embodiment, the first determining module includes:

the second determining module is used for determining that the composition mode of the target text is a first mode when the target text is a sample word in the dictionary;

the generation module comprises:

the first execution module is used for generating the word vector of the target text according to the word vector generation method corresponding to the first mode;

the first execution module includes:

and the first query module is used for querying the one-to-one mapping relation in the sample word mapping set to determine the word vector corresponding to the target text.

In one embodiment, the first determining module includes:

a third determining module, configured to determine that the target text is not a sample word in the dictionary, and determine that a configuration mode of the target text is a second mode when a length of the target text is greater than or equal to a set length;

the generation module comprises:

the second execution module is used for generating the word vector of the target text according to the word vector generation method corresponding to the second mode;

the second execution module includes:

the word segmentation module is used for obtaining at least one effective composition word after carrying out word segmentation operation on the target text;

A first selection module, configured to select an effective component word belonging to the dictionary from the effective component words;

the second query module is used for querying one-to-one mapping relation in the sample word mapping set to determine word vectors corresponding to the selected effective component words;

and the fourth determining module is used for determining the word vector corresponding to the target text according to the word vector corresponding to each selected effective composition word.

In one embodiment, the first determining module includes:

a fourth determining module, configured to determine that the target text is not a sample word in the dictionary, and determine that a configuration mode of the target text is a third mode when a length of the target text is less than a set length;

the generation module comprises:

the third execution module is used for generating the word vector of the target text according to the word vector generation method corresponding to the third mode;

the third execution module includes:

the splitting module is used for respectively splitting the target text by at least one sliding window, and window lengths of different sliding windows are different;

the second selection module is used for selecting unit words belonging to the dictionary from the unit words obtained after splitting by using each sliding window;

The third query module is used for querying the one-to-one mapping relation in the sample word mapping set to determine word vectors corresponding to all selected unit words;

and a fifth determining module, configured to determine a word vector corresponding to the target text according to the word vectors corresponding to all the selected unit words.

In one embodiment, the length of the sliding window used in the at least one sliding window split is N character lengths from 1 character length to M character length, M being an integer greater than 1, N being less than or equal to M.

In one embodiment, the classification module comprises:

the computing module is used for computing the similarity between the word vector of the target text and the word vector in the sample mapping set;

and the determining module is used for determining the category to which the target text belongs according to the similarity.

According to a third aspect of embodiments herein, there is provided a non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a text classification method, the method comprising:

determining the formation mode of the target text according to the dictionary;

The technical scheme provided by the embodiment of the invention can comprise the following beneficial effects: based on the dictionary and the sample mapping set, mapping the target texts with different lengths onto the vector space with the same dimension, and realizing unification on the representation layer, so that the target texts with different lengths can be classified by the same classification algorithm, and the classification accuracy can be effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent herewith and together with the description, serve to explain the principles herein.

FIG. 1 is a flow chart illustrating a method of text classification according to an exemplary embodiment;

FIG. 2 is a flowchart illustrating a method of generating a word vector corresponding to the second mode in step S13 of FIG. 1, according to an exemplary embodiment;

FIG. 3 is a flowchart illustrating a method of generating a word vector corresponding to the third mode in step S13 of FIG. 1, according to an exemplary embodiment;

FIG. 4 is a flowchart illustrating a method of text classification according to an exemplary embodiment;

FIG. 5 is a block diagram of a text classification device according to an exemplary embodiment;

FIG. 6 is a block diagram of a text classification device according to an exemplary embodiment;

FIG. 7 is a block diagram of a text classification device according to an exemplary embodiment;

FIG. 8 is a block diagram of a text classification device according to an exemplary embodiment;

fig. 9 is a block diagram illustrating a text classification apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with this document. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The text is used for classifying target texts, wherein the target texts are business rules, business contents, technical contents, network interaction contents and the like.

Embodiments herein provide a text classification method. Referring to fig. 1, fig. 1 is a flowchart illustrating a text classification method according to an exemplary embodiment. As shown in fig. 1, the text classification method includes:

and S11, determining the constitution mode of the target text according to the dictionary.

Step S12, generating word vectors of the target text according to the word vector generation method corresponding to the composition mode.

Step S13, classifying the target text according to the word vector and the sample mapping set of the target text.

Wherein, the subset of all sample words in the sample word mapping set refers to a part of all sample words in the sample word mapping set or all sample words in the sample word mapping set.

The number of words contained in the sample words is 1, or 2 or more than 2, i.e. the sample words are words or phrases. The categories in the sample word mapping set are obtained according to preset general category division or are determined according to a self-defining mode.

Illustrating:

the sample mapping set includes 50 categories, illustrated by the following four:

the mobile phone screen category comprises the following one-to-one mapping relation:

touch screen-vector 101, folding screen-vector 102, curved screen-vector 103, flexible screen-vector 104, screen-vector 105, … …, etc.

Communication device class, which includes the following one-to-one mapping:

mobile terminal-vector 201, handset-vector 202, router-vector 203, base station-vector 204, set top box-vector 205, … …, etc.

Network hotness category, under which the following one-to-one mapping relationship is included:

traffic-vector 301, click-volume-vector 302, forward-volume-vector 303, fan number 304, praise-vector 305, … …, etc.

The dictionary includes sample words in a set of sample mappings, including, for example: touch screen, folding screen, curved screen, flexible screen, mobile terminal, cell-phone, router, basic station, flow, click volume, forward volume, vermicelli number.

According to the method, based on the dictionary and the sample mapping set, target texts with different lengths are mapped to the vector space with the same dimension, so that unification on a representation layer is realized, classification of the target texts with different lengths can be realized through the same classification algorithm, and classification accuracy can be effectively improved.

Embodiments herein provide a text classification method. The method further comprises a method for obtaining a sample mapping set, wherein the sample mapping set is a preset sample mapping set which is directly obtained, or the sample mapping set is obtained by combining manual intervention with an automatic expansion word, and the method specifically comprises the following steps:

step 1, manually selecting a plurality of seed words for each category, and selecting a first number (for example, 50) of seed words for each category in order to ensure the classification effect.

And 2, expanding according to the seed words to obtain the paraphrasing words with preset quantity.

And step 3, merging similar words of all seed words in each category, and removing repeated sample words to obtain a sample word set of each category.

And 4, determining word vectors of each sample word in the sample word set of each category, and finally forming a sample mapping set. The vectors of the sample words are set to have the same dimension, for example, the dimension is 40 dimensions.

Wherein, the expanding according to the seed words in the step 2 comprises any one of the following modes:

in the first mode, word vectors corresponding to all seed words are determined, the similarity of the seed words and other candidate words is calculated through the word vectors, then the similarity is ordered, and a preset number of similar words with larger similarity of each seed word are extracted to realize expansion.

In a second mode, a word list of similar meaning words (such as word list of similar meaning words disclosed by word forest, word net and the like) is queried, the word list is queried for the similar meaning words of seed words, a preset number of similar meaning words are selected for each seed word, and then the words are combined and de-duplicated.

The main difference between the second mode and the first mode is that the similarity calculation method of the candidate words in the seed word expansion process is different, and the calculation complexity is different. In the first mode, the similarity between the seed word and other candidate words is calculated based on the word vector, the calculation complexity of the method is high, but the word vector used in the calculation process and the word vector used in the subsequent classification algorithm are the same vector space, and the front-back consistency is high. In the second mode, the similar meaning word of the seed word is directly inquired in the similar meaning word list in a table look-up mode, the calculation complexity of the method is low, the similar meaning word list is manually maintained and is not consistent with the data distribution in the application scene of the method, and the effect stability is lower than that of the first mode.

In order to make the dimensions of the word vectors of all the sample words the same in both the first and second methods, it is necessary to make the dimensions of the word vector of each sample word the same when determining the word vector of each sample word in the sample word set of each category in step 4.

The embodiment also provides a text classification method. In this method, determining the formation of the target text according to the dictionary in step S11 shown in fig. 1 includes: and when the target text is determined to be a sample word in the dictionary, determining the construction mode of the target text as a first mode. The word vector generation method corresponding to the first mode in step S13 shown in fig. 1 includes: and determining word vectors corresponding to the target text according to the one-to-one mapping relation in the query sample word mapping set.

Wherein determining that the target text is a sample word in the dictionary comprises: and comparing the target text with each sample word in the dictionary one by one, wherein the target text is considered to be identical and is not considered to be identical, and the target text is considered not to be identical.

Examples are as follows:

the target text is a curved screen, and each sample word in the dictionary is compared with the target text one by one, so that the target text is matched with the dictionary to be identical, namely the dictionary comprises the curved screen. And determining a word vector corresponding to the target text as a vector 103 according to a one-to-one mapping relation in the query sample word mapping set, and taking the vector 103 as the word vector of the target text.

The embodiment also provides a text classification method. In this method, determining the formation of the target text according to the dictionary in step S11 shown in fig. 1 includes: and determining that the target text is not a sample word in the dictionary, and determining that the construction mode of the target text is a second mode when the length of the target text is greater than or equal to the set length.

Referring to fig. 2, fig. 2 is a flowchart of a word vector generating method corresponding to the second mode in step S13 shown in fig. 1 and provided in this embodiment, and as shown in fig. 2, the method includes:

s21, obtaining at least one effective composition word after word segmentation operation is carried out on the target text;

step S22, selecting effective composition words belonging to the dictionary from the effective composition words;

step S23, determining word vectors corresponding to the selected effective component words according to one-to-one mapping relation in the query sample word mapping set;

and step S24, determining the word vector corresponding to the target text according to the word vector corresponding to each selected effective component word.

The method for obtaining at least one effective composition word after word segmentation operation on the target text comprises the following steps: and performing word segmentation operation on the target text to obtain a composition word set, and removing invalid words (the invalid words are virtual words, auxiliary words and the like) from the composition word set to obtain effective composition words.

Determining the word vector corresponding to the target text according to the word vector corresponding to each selected effective component word, including: and calculating an average vector of word vectors corresponding to each selected effective component word, taking the average vector as a word vector corresponding to the target text, or calculating a weighted average vector of word vectors corresponding to each effective component word, and taking the weighted average vector as a word vector corresponding to the target text. When calculating the weighted average vector of the word vectors corresponding to each effective component word, the weights corresponding to different effective component words are different, for example, a part of sample words with relatively basic properties or high use frequency are preset to have larger weights.

Examples are as follows:

the length is set to be 6 words, and the target text is that the screen of the prototype is a curved screen. After judging that the formation mode of the target text is the second mode, performing word segmentation operation on the target text to obtain a composition word set, wherein the composition word set comprises: prototype, screen, yes, curved screen. Removing invalid words from the composition word set to obtain valid composition words, including: prototype, screen, curved screen. And finally determining the category of the target text according to the composition word as a mobile phone screen.

The embodiment also provides a text classification method. In this method, determining the formation of the target text according to the dictionary in step S11 shown in fig. 1 includes: and determining that the target text is not a sample word in the dictionary, and determining that the construction mode of the target text is a third mode when the length of the target text is smaller than the set length.

Referring to fig. 3, fig. 3 is a flowchart of a word vector generating method corresponding to the third mode in step S13 shown in fig. 1 provided in this embodiment, and as shown in fig. 3, the method includes:

step S31, at least one sliding window type splitting is respectively carried out on the target text, and window lengths of different sliding windows are different;

step S32, selecting unit words belonging to the dictionary from the unit words obtained after splitting by using each sliding window;

step S33, inquiring a one-to-one mapping relation in the sample word mapping set to determine word vectors corresponding to all selected unit words;

and step S34, determining the word vector corresponding to the target text according to the word vectors corresponding to all the selected unit words.

Wherein the length of the sliding window used in the at least one sliding window type splitting is from 1 character length to N character lengths in M character lengths, M is an integer greater than 1, and N is less than or equal to M. Wherein to N kinds of smoothing windows, i.e., N kinds of M kinds of smoothing windows ranging in length from 1 character to M characters. Examples include: a 1-character smoothing window, a 2-character smoothing window, and a 3-character smoothing window. For another example: including a 1-character smoothing window and a 3-character smoothing window.

Determining the word vector corresponding to the target text according to the word vectors corresponding to all the selected unit words, including: and calculating an average vector of word vectors corresponding to all the selected unit words, taking the average vector as a word vector corresponding to the target text, or calculating a weighted average vector of word vectors corresponding to all the selected unit words, and taking the weighted average vector as a word vector corresponding to the target text. When calculating the weighted average vector of the word vectors corresponding to all the selected unit words, the weights corresponding to different unit words are different, for example, a part of sample words with relatively basic properties or high use frequency are preset to have larger weights.

Examples are as follows:

the length is set to be 6 words, and the target text is that the screen is a curved screen. After judging that the formation mode of the target text is a third mode, word segmentation operation is carried out on the target text by using a 1-character smoothing window, a 2-character smoothing window and a 3-character smoothing window respectively. The unit words obtained after word segmentation operation by using the 1-character smooth window target text comprise: screen, curtain, yes, face, screen. The unit words obtained after word segmentation operation by using the 2-character smooth window target text comprise: the screen and the curtain are curved surfaces and surface screens. The unit words obtained after word segmentation operation by using the 3-character smooth window target text comprise: the screen is curved, or curved. Selecting unit words belonging to a dictionary from unit words obtained by splitting by using sliding windows, wherein the unit words comprise: a screen and a curved surface screen. And finally, determining the category of the target text according to the unit word as a mobile phone screen.

The method can realize vector representation of the text of the third mode on the vector space of the same dimension aiming at a plurality of smooth windows used by the third mode, thereby carrying out a subsequent classification method compatible with the first mode and the second mode.

The embodiment also provides a text classification method. In this method, classifying the target text according to the word vector and the sample mapping set of the target text in step S13 shown in fig. 1 includes: and calculating the similarity between the word vector of the target text and the word vector in the sample mapping set, determining the word vector meeting the similarity condition in the sample mapping set, and determining the category to which the target text belongs according to the category of the word vector meeting the similarity condition.

The method for calculating the similarity between the word vector of the target text and the word vector in the sample mapping set is various, for example, by calculating the distance between the word vector and the sample mapping set, wherein the distance is Euclidean distance, manhattan distance, chebyshev distance, minkowski distance, mahalanobis distance, hamming distance and the like; the similarity of the word vector of the target text to the word vectors in the set of sample mappings is calculated, for example, by a classification algorithm.

The classification algorithm is an algorithm for classifying vectors, such as a K nearest neighbor (kNN) classification algorithm. The K nearest neighbors are meant to be the K nearest neighbors, meaning that each sample can be represented by its nearest K neighbors. The core idea of the kNN algorithm is that if a sample belongs to a certain class for the most of the k nearest samples in the feature space, then that sample also belongs to that class and has the characteristics of the samples on that class. The method only determines the category to which the sample to be classified belongs according to the category of one or more samples which are nearest to each other in determining the classification decision. The kNN method is related to a very small number of neighboring samples when making a class decision. Since the kNN method relies mainly on surrounding limited neighboring samples, rather than on the method of discriminating class fields, the class to which it belongs is determined.

When a kNN classification algorithm is used, calculating the similarity between a word vector of the target text and each word vector in the sample mapping set, sorting the similarity, setting a K value in the kNN classification method to be a preset value (for example, 20), determining K word vectors with the forefront rank from the sorting, determining the category to which the sample word corresponding to the K word vectors belongs, and taking the category with the largest number of the same categories in the K categories corresponding to the K sample words as the category to which the target text belongs.

The following is a detailed description of one embodiment.

Specific examples:

in this embodiment, the target text to be processed is a rule to be processed, for example, the processing rule is a processing rule about the mobile terminal.

Referring to fig. 4, fig. 4 is a flow chart of a text classification method in a specific embodiment. As shown in FIG. 4, this text classification method includes a sample map set generation process S4-1 and a classification process S4-2.

The sample map set generation process S4-1 includes steps S411 to S415:

in step S411, a plurality of seed words are selected for each category, and 50 seed words are selected for each category in order to ensure the classification effect.

Step S412, expanding the seed words to obtain similar words, obtaining a plurality of (e.g. 10) paraphrasing words for each seed word, and using the similar words under each category as sample words to obtain a plurality of (e.g. 500) sample words corresponding to each category.

When the near-meaning words are obtained, the similarity of the seed words and other candidate words is calculated through word vectors, then the similarity is ordered, and a preset number of similar words with larger similarity of each seed word are extracted to realize expansion.

Step S413, merging the similar words of all the seed words in each category, and removing the repeated sample words to obtain a sample word set of each category.

In step S414, a word vector of each sample word in the sample word set of each category is determined, and finally a sample mapping set is formed, where the dimensions of the vector of each sample word are the same (e.g. 40 dimensions).

In step S415, a dictionary is determined, the dictionary including sample words in the set of sample mappings.

The classification process S4-2 includes steps S421 to S437:

step S421, matching the rule to be processed with each word in the dictionary.

Step S422, it is determined whether the rule to be processed hits the dictionary, i.e., whether the rule to be processed is a word in the dictionary, and if yes, go to step S423, and if no, go to step S425.

In step S423, it is determined that the constituent form of the rule to be processed is a first type constituent form (or referred to as a word type constituent form).

Step S424, generating a word vector of the rule to be processed, specifically, determining a word vector corresponding to the rule to be processed according to a one-to-one mapping relationship in the query sample word mapping set, and turning to step S436.

In step S425, the length of the rule to be processed, i.e. the number of words contained in the rule to be processed, is determined.

Step S426, judging whether the length of the rule to be processed is larger than the preset length (e.g. 6), if yes, turning to step S427, if no, turning to step S432.

In step S427, it is determined that the constituent form of the rule to be processed is a second type constituent form (or referred to as phrase type constituent form).

Step S428, word segmentation operation is carried out on the rule to be processed to obtain a plurality of effective composition words;

step S429, selecting the effective constituent words belonging to the dictionary from the plurality of effective constituent words.

Step S430, determining word vectors corresponding to the selected effective component words according to the one-to-one mapping relation in the query sample word mapping set.

Step S431, generating a word vector of the rule to be processed through a weighted average operation, namely calculating an average vector of the word vectors of the selected effective component words, taking the average vector as the word vector of the rule to be processed, and jumping to step S436.

In step S432, it is determined that the formation of the rule to be processed is a third type formation (or referred to as an unregistered word type formation).

Step S433, three sliding window type splitting is carried out on the rule to be processed; for example: the first sliding window has a length of 1 character, the second sliding window has a length of 2 characters, and the third sliding window has a length of 3 characters.

And S434, selecting unit words belonging to the dictionary from the unit words split through the three sliding windows, and discarding the unit words not belonging to the dictionary from the split unit words.

In step S435, a word vector of the rule to be processed is generated through a weighted average operation, that is, an average vector of word vectors corresponding to all the selected unit words is calculated, and the average vector is used as the word vector of the rule to be processed.

Step S436, determining the category of the rule to be processed by using the word vector and the sample mapping set of the rule to be processed according to the KNN classification mode.

Step S437, outputting the category.

The embodiment also provides a text classification device. Referring to fig. 5, fig. 5 is a block diagram illustrating a text classification apparatus according to an exemplary embodiment. As shown in fig. 5, the text classification apparatus includes:

a first determining module 501 determines a formation mode of the target text according to the dictionary;

a generating module 502, configured to generate a word vector of the target text according to a word vector generating method corresponding to the configuration mode;

a classification module 503, configured to classify the target text according to the word vector and the sample mapping set of the target text;

The embodiment also provides a text classification device. In this apparatus, a first determining module 501 includes: and the second determining module is used for determining that the composition mode of the target text is a first mode when the target text is determined to be a sample word in the dictionary.

The generating module 502 includes: and the first execution module is used for generating the word vector of the target text according to the word vector generation method corresponding to the first mode.

A first execution module comprising: and the first query module is used for querying the one-to-one mapping relation in the sample word mapping set to determine the word vector corresponding to the target text.

The embodiment also provides a text classification device. Referring to fig. 6, fig. 6 is a block diagram illustrating a text classification apparatus according to an exemplary embodiment. As shown in fig. 6, the first determining module 501 in this apparatus includes: and a third determining module 601, configured to determine that the target text is not a sample word in the dictionary, and determine that the configuration mode of the target text is the second mode when the length of the target text is greater than or equal to the set length.

The generating module 502 includes: and a second execution module 602, configured to generate a word vector of the target text according to the word vector generation method corresponding to the second mode.

A second execution module 602, comprising:

the word segmentation module 603 is configured to obtain at least one valid composition word after performing a word segmentation operation on the target text;

a first selection module 604, configured to select valid component words belonging to the dictionary from the valid component words;

a second query module 605, configured to query a one-to-one mapping relationship in the sample word mapping set to determine a word vector corresponding to the selected valid component word;

a fourth determining module 606, configured to determine a word vector corresponding to the target text according to the word vector corresponding to each selected valid component word.

The embodiment also provides a text classification device. Referring to fig. 7, fig. 7 is a block diagram illustrating a text classification apparatus according to an exemplary embodiment. As shown in fig. 7, the first determining module 501 in this apparatus includes: a fourth determining module 701, configured to determine that the target text is not a sample word in the dictionary, and determine that the configuration mode of the target text is a third mode when the length of the target text is less than a set length.

The generating module 502 includes: and a third execution module 702, configured to generate a word vector of the target text according to the word vector generation method corresponding to the third mode.

The third execution module 702 includes:

the splitting module 703 is configured to split the target text by at least one sliding window respectively, where window lengths of different sliding windows are different;

a second selection module 704, configured to select a unit word belonging to the dictionary from unit words obtained by splitting using each sliding window;

a third query module 705, configured to query a one-to-one mapping relationship in the sample word mapping set to determine word vectors corresponding to all the selected unit words;

and a fifth determining module 706, configured to determine a word vector corresponding to the target text according to the word vectors corresponding to all the selected unit words.

The length of the sliding window used in the at least one sliding window type split is N character lengths from 1 character length to M character length, M being an integer greater than 1, N being less than or equal to M.

The embodiment also provides a text classification device. Referring to fig. 8, fig. 8 is a block diagram illustrating a text classification apparatus according to an exemplary embodiment. As shown in fig. 8, the classification module 503 includes:

a calculating module 801, configured to calculate a similarity between a word vector of the target text and a word vector in the sample mapping set;

A sixth determining module 802 determines, according to the similarity, a category to which the target text belongs.

Embodiments herein also provide a non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a text classification method, the method comprising:

determining the formation mode of the target text according to the dictionary;

Fig. 9 is a block diagram illustrating a text classification device 900 according to an exemplary embodiment. For example, apparatus 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 9, apparatus 900 may include one or more of the following components: a processing component 902, a memory 904, a power component 906, a multimedia component 908, an audio component 910, an input/output (I/O) interface 912, a sensor component 914, and a communication component 916.

The processing component 902 generally controls overall operations of the apparatus 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 902 may include one or more processors 920 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 902 can include one or more modules that facilitate interaction between the processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.

The memory 904 is configured to store various types of data to support operations at the device 900. Examples of such data include instructions for any application or method operating on the device 900, contact data, phonebook data, messages, pictures, videos, and the like. The memory 904 may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 906 provides power to the various components of the device 900. Power components 906 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for device 900.

The multimedia component 908 comprises a screen between the device 900 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front-facing camera and/or a rear-facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 900 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 910 is configured to output and/or input audio signals. For example, the audio component 910 includes a Microphone (MIC) configured to receive external audio signals when the device 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 904 or transmitted via the communication component 916. In some embodiments, the audio component 910 further includes a speaker for outputting audio signals.

The I/O interface 912 provides an interface between the processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 914 includes one or more sensors for providing status assessment of various aspects of the apparatus 900. For example, the sensor assembly 914 may detect the on/off state of the device 900, the relative positioning of the components, such as the display and keypad of the apparatus 900, the sensor assembly 914 may also detect the change in position of the apparatus 900 or one component of the apparatus 900, the presence or absence of user contact with the apparatus 900, the orientation or acceleration/deceleration of the apparatus 900, and the change in temperature of the apparatus 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 916 is configured to facilitate communication between the apparatus 900 and other devices in a wired or wireless manner. The device 900 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 916 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 916 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, apparatus 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory 904 including instructions executable by the processor 920 of the apparatus 900 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit herein being indicated by the following claims.

Claims

1. A method of text classification, comprising:

determining the formation mode of the target text according to the dictionary;

2. The text classification method of claim 1,

the method for determining the formation mode of the target text according to the dictionary comprises the following steps: when the target text is determined to be a sample word in the dictionary, determining that the construction mode of the target text is a first mode;

3. The text classification method of claim 1,

the method for determining the formation mode of the target text according to the dictionary comprises the following steps: determining that the target text is not a sample word in the dictionary, and determining that the construction mode of the target text is a second mode when the length of the target text is greater than or equal to a set length;

4. The text classification method of claim 1,

the method for determining the formation mode of the target text according to the dictionary comprises the following steps: determining that the target text is not a sample word in the dictionary, and determining that the formation mode of the target text is a third mode when the length of the target text is smaller than a set length;

5. The text classification method of claim 4,

6. The text classification method of claim 1,

classifying the target text using the word vector and sample mapping set of the target text, comprising: and calculating the similarity between the word vector of the target text and the word vector in the sample mapping set, and determining the category to which the target text belongs according to the similarity.

7. A text classification device, comprising:

8. The text classification apparatus of claim 7,

the first determining module includes:

the generation module comprises:

The first execution module includes:

9. The text classification apparatus of claim 7,

the first determining module includes:

the generation module comprises:

the second execution module includes:

10. The text classification apparatus of claim 7,

the first determining module includes:

the generation module comprises:

the third execution module includes:

11. The text classification apparatus of claim 10,

12. The text classification apparatus of claim 7,

the classification module comprises:

13. A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a text classification method, the method comprising:

determining the formation mode of the target text according to the dictionary;