CN112861974A - Text classification method and device, electronic equipment and storage medium - Google Patents

Text classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112861974A
CN112861974A CN202110183059.9A CN202110183059A CN112861974A CN 112861974 A CN112861974 A CN 112861974A CN 202110183059 A CN202110183059 A CN 202110183059A CN 112861974 A CN112861974 A CN 112861974A
Authority
CN
China
Prior art keywords
text
text set
feature vector
category
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110183059.9A
Other languages
Chinese (zh)
Inventor
李东根
田原
易仕伟
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Workway Shenzhen Information Technology Co ltd
Original Assignee
Workway Shenzhen Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Workway Shenzhen Information Technology Co ltd filed Critical Workway Shenzhen Information Technology Co ltd
Priority to CN202110183059.9A priority Critical patent/CN112861974A/en
Publication of CN112861974A publication Critical patent/CN112861974A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of machine learning, and discloses a text classification method, a device, electronic equipment and a storage medium, wherein the text classification method comprises the following steps: obtaining a target characteristic vector and at least two text sets of a text to be processed, wherein each text set corresponds to one category and comprises characteristic vectors of text data belonging to the same category; for each text set, K nearest feature vectors of the target feature vectors are obtained from each text set, the polymerization degree between the target feature vectors and each text set is obtained based on the K nearest feature vectors and the target feature vectors, and a comparison result of the polymerization degree and the category aggregation degree of each text set is obtained; determining a target text set from at least two text sets based on a comparison result corresponding to each text set; the classification of the target text set is determined as the classification of the text to be processed, so that the accuracy of text classification is improved.

Description

Text classification method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a text classification method and apparatus, an electronic device, and a storage medium.
Background
The text classification model is one of important applications in the field of artificial intelligence, and can identify the category to which the text belongs. The text classification model has wide application in intelligent customer service, news recommendation, intention recognition systems and the like, namely the text classification model is a basic component of the complex systems.
At present, in many text classification tasks, a KNN (K-nearest neighbor) algorithm is usually selected to generate a text classification model based on the existing text data fast modeling, and the basic idea is to compare feature vectors of a text to be processed with feature vectors of text data in a training set under the condition that the text data in the training set is known to be classified, find first K text data in the training set most similar to the text to be processed, and use a classification with the largest occurrence frequency in the K text data as a category to which the text to be processed belongs.
However, the classification result of the text classification model based on the KNN algorithm is extremely sensitive to the selection of K values and the distribution of text data in the training set. In practical application, for some application scenes with few samples, the problem of uneven distribution of text data of each category is easily caused, and the accuracy of text classification is seriously reduced.
Disclosure of Invention
The embodiment of the application provides a text classification method, a text classification device, an electronic device and a storage medium, which can reduce the influence of the value of K on a classification result, improve the problem of inaccurate classification result caused by different data distribution or unbalanced data, and improve the accuracy of text classification.
In a first aspect, an embodiment of the present application provides a text classification, including:
obtaining a target characteristic vector and at least two text sets of a text to be processed, wherein each text set corresponds to one category and comprises characteristic vectors of text data belonging to the same category;
for each text set, obtaining K nearest feature vectors of the target feature vector from each text set, obtaining a polymerization degree between the target feature vector and each text set based on the K nearest feature vectors and the target feature vectors, and obtaining a comparison result of the polymerization degree and the class aggregation degree of each text set; the aggregation degree represents the density degree between the target feature vector and the K nearest feature vectors, and the category aggregation degree represents the density degree of the feature vector distribution in the same text set;
determining a target text set from the at least two text sets based on a comparison result corresponding to each text set;
and determining the category of the target text set as the category of the text to be processed.
Optionally, the obtaining K nearest feature vectors of the target feature vector from each text set specifically includes:
obtaining the similarity between each feature vector in each text set and the target feature vector;
and according to the sequence of similarity from large to small, determining the K feature vectors ranked at the top as K nearest feature vectors of the target feature vector.
Optionally, the obtaining, based on the k nearest feature vectors and the target feature vector, a degree of polymerization between the target feature vector and each text set specifically includes:
obtaining the similarity of each feature vector in the K nearest feature vectors and the target feature vector;
and determining the average value of the similarity corresponding to the K nearest feature vectors as the polymerization degree between the target feature vector and each text set.
Optionally, the comparison result is a ratio of the degree of polymerization and the degree of category aggregation, and the determining a target text set from the at least two text sets based on the comparison result corresponding to each text set specifically includes:
determining the classification probability of the text to be processed belonging to each text set based on the comparison result corresponding to each text set;
and determining the text set with the maximum classification probability as a target text set.
Optionally, the category aggregation degree of each text set is obtained by:
for each feature vector in each text set, obtaining similarity between each feature vector and other feature vectors in each text set, and determining a degree of polymerization between each feature vector and each text set based on K similarity degrees in the top order according to the obtained similarity degrees in the descending order;
and determining the category polymerization degree corresponding to each text set based on the polymerization degree corresponding to each feature vector in each text set.
Optionally, the determining, based on the K similarity degrees ranked top, a degree of polymerization between each feature vector and each text set specifically includes:
and determining the average value of the K similarity degrees which are ranked at the top as the polymerization degree between each feature vector and each text set.
Optionally, the determining, based on the degree of polymerization corresponding to each feature vector in each text set, a category degree of polymerization corresponding to each text set specifically includes:
and determining the average value of the polymerization degrees corresponding to the feature vectors in each text set as the category polymerization degree corresponding to each text set.
Optionally, wherein each category corresponds to a user intent.
In a second aspect, an embodiment of the present application provides a text classification apparatus, including:
the acquisition module is used for acquiring a target characteristic vector and at least two text sets of a text to be processed, wherein each text set corresponds to one category, and each text set comprises characteristic vectors of text data belonging to the same category;
the aggregation degree calculation module is used for obtaining K nearest feature vectors of the target feature vector from each text set aiming at each text set, obtaining the aggregation degree between the target feature vector and each text set based on the K nearest feature vectors and the target feature vectors, and obtaining a comparison result of the aggregation degree and the class aggregation degree of each text set; the aggregation degree represents the density degree between the target feature vector and the K nearest feature vectors, and the category aggregation degree represents the density degree of the feature vector distribution in the same text set;
and the classification module is used for determining a target text set from the at least two text sets based on the comparison result corresponding to each text set, and determining the category of the target text set as the category of the text to be processed.
Optionally, the polymerization degree calculating module is specifically configured to: obtaining the similarity between each feature vector in each text set and the target feature vector; and according to the sequence of similarity from large to small, determining the K feature vectors ranked at the top as K nearest feature vectors of the target feature vector.
Optionally, the polymerization degree calculating module is specifically configured to: obtaining the similarity of each feature vector in the K nearest feature vectors and the target feature vector; and determining the average value of the similarity corresponding to the K nearest feature vectors as the polymerization degree between the target feature vector and each text set.
Optionally, the comparison result is a ratio of the degree of polymerization and the degree of class aggregation, and the classification module is specifically configured to: determining the classification probability of the text to be processed belonging to each text set based on the comparison result corresponding to each text set; and determining the text set with the maximum classification probability as a target text set.
Optionally, the text classification apparatus further includes a training module, configured to obtain a category aggregation degree of each text set by:
for each feature vector in each text set, obtaining similarity between each feature vector and other feature vectors in each text set, and determining a degree of polymerization between each feature vector and each text set based on K similarity degrees in the top order according to the obtained similarity degrees in the descending order;
and determining the category polymerization degree corresponding to each text set based on the polymerization degree corresponding to each feature vector in each text set.
Optionally, the training module is specifically configured to: and determining the average value of the K similarity degrees which are ranked at the top as the polymerization degree between each feature vector and each text set.
Optionally, the training module is specifically configured to: and determining the average value of the polymerization degrees corresponding to the feature vectors in each text set as the category polymerization degree corresponding to each text set.
Optionally, wherein each category corresponds to a user intent.
In a third aspect, an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any one of the methods when executing the computer program.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, implement the steps of any of the methods described above.
In a fifth aspect, an embodiment of the present application provides a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when executed by a processor, implement the steps of any of the methods described above.
The text classification method, the text classification device, the electronic equipment and the storage medium can automatically learn the distribution of the text data of the same category in the feature space from the text set, and obtain the category polymerization degree representing the density degree of the feature vector distribution in the text set; when text classification is carried out, text data of different classes in a training set are divided into text sets corresponding to the classes, K text data most similar to the text to be processed are searched in the text sets of the classes respectively, the polymerization degree of the text to be processed to each class is calculated respectively, and the most appropriate class of the text to be processed is determined by comparing the polymerization degree of the text to be processed to the class and the class aggregation degree. On one hand, the search and subsequent processing of K nearest text data are respectively carried out in text sets of different categories, so that the influence of the value of K on a classification result can be reduced, on the other hand, the aggregation degree and the category aggregation degree of each category are classified by comparing texts to be processed, so that the problem of inaccurate classification result caused by different data distribution or unbalanced data can be solved, and the accuracy of text classification is improved through the optimization of the two aspects.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram of text classification based on a KNN algorithm when K is 3 according to an embodiment of the present application;
fig. 2 is a schematic diagram of text classification based on a KNN algorithm when K is 5 according to an embodiment of the present application;
FIG. 3 is a diagram of an example of a distribution of text data in a feature space according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a text classification method according to an embodiment of the present application;
fig. 5 is a schematic flowchart of obtaining a category polymerization degree of a text collection according to an embodiment of the present application;
FIG. 6 is a diagram of an example of a distribution of text data in a feature space according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a text classification apparatus provided in an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, all other embodiments that can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort fall within the scope of the present disclosure.
It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
For convenience of understanding, terms referred to in the embodiments of the present application are explained below:
stop Words (Stop Words) refer to the automatic filtering of some Words or phrases before or after processing natural language data (or text) in order to save storage space and improve search efficiency in information retrieval. Stop words are all manually input and are not automatically generated, and the generated stop words form a stop word list.
Chinese word segmentation is the process of dividing a Chinese character sequence into several independent words, i.e. recombining continuous character sequences into word sequences according to a certain standard. A common chinese word segmentation tool is jieba.
Word2vec, a group of correlation models used to generate Word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word text. The network is represented by words and the input words in adjacent positions are guessed, and the order of the words is unimportant under the assumption of the bag-of-words model in word2 vec. After training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent word-to-word relationships, and the vector is a hidden layer of the neural network.
The basic idea of the KNN algorithm is as follows: under the condition of text data classification in a known training set, comparing the feature vector of the text to be processed with the feature vector of the text data in the training set, finding the first K pieces of text data which are most similar to the text to be processed in the training set, and classifying the K pieces of text data with the largest occurrence frequency as the category to which the text to be processed belongs. Taking fig. 1 as an example, each point in fig. 1 represents a feature vector of text data, where a triangle point belongs to a category one, a circle point belongs to a category two, and a square point represents a feature vector of text to be processed, and assuming that K is 3, the KNN algorithm finds three points (i.e., points within a dotted circle in fig. 1) closest to the square point, determines which category the three points belong to respectively, and if more of the three points belong to the category one, the text to be processed (i.e., the square point) is classified into the category one.
However, the classification result of the text classification model based on the KNN algorithm is extremely sensitive to the selection of K values and the distribution of text data in the training set.
In practical applications, different values of K may bring different classification results. With fig. 1 as a force, when K is 3, the KNN algorithm finds three points closest to the feature vector (i.e., the points of the square) of the text to be processed, and then classifies the text to be processed into a category one (i.e., the category to which the triangular points belong). Taking fig. 2 as an example, when K is 5, the KNN algorithm finds five points closest to the feature vector of the text to be processed (i.e., the points of the square), and if there are 3 circular points in the five points, the feature vector of the text to be processed (i.e., the points of the square) is classified into category two.
Further, the distribution of text data of different classes may be different, or the number of data of different classes and the density of distribution may be different due to the unbalanced number of samples, which is not taken into account by the KNN algorithm. Taking fig. 3 as an example, each dot represents a text datum, the black dots belong to a first category, the white dots belong to a second category, the black dots are distributed in a narrow and densely distributed area, and the white dots are distributed in a wide and sparsely distributed area. For the gray points to be classified, the nearest neighbor points found by KNN are all black points, so the gray points are classified into class one. However, analyzing the distribution of the dots in fig. 3, it can be seen that the black dots are all next to each other, and the gray dots are clearly distant from the nearest black dots, and conversely, the interval between the gray dots and the white dots is relatively consistent with the distribution of the white dots, therefore, the gray dots should be classified into category two.
Therefore, for some application scenarios with fewer samples, the problem of uneven distribution of text data of each category is easily caused, and the accuracy of text classification is seriously reduced.
Based on this, an embodiment of the present application provides a text classification method, which specifically includes the following steps with reference to fig. 4:
s401, obtaining a target feature vector of a text to be processed and at least two text sets, wherein each text set corresponds to one category, and each text set comprises feature vectors of text data belonging to the same category.
In specific implementation, the text data used or appeared in the application scene can be obtained, the text data is divided into a plurality of categories, and the text data of the same category is put into the same text set. The type of the text data may be set according to application requirements, for example, by taking an intelligent customer service of a shopping website as an example, the text input by the user may be divided into commodity consultation, price consultation, payment operation consultation, activity consultation, after-sale consultation, and the like, and the embodiment of the present application is not limited.
In a specific implementation, the text data in each text set may be pre-processed in advance, where the pre-processing is to remove noise data in the text data, and the pre-processing specifically may include: the method comprises the steps of removing special characters (other characters except Chinese and English) and stop words in text data, and then performing word segmentation processing on the text data to obtain a plurality of word segments. And then converting the plurality of word segments into feature vectors, namely feature vectors of the text data to be detected, and storing the feature vectors of each text data into a corresponding text set, so that the text set is convenient to use during text classification.
Specifically, the text can be converted into feature vectors by combining a FastText word vector model and SIF weighting, and the formula is as follows: v ═ α FastText (w), firstly, using a FastText word vector model to convert all participles in text data into word vectors, then multiplying each word vector by SIF weight (the weight calculation formula of SIF is α ═ a/(a + P (w)), where a is 0.01, and P (w) is the probability of participle occurrence), then summing up all the word vectors, and finally obtaining the vector representation of the text data. Similarly, the text to be processed can be converted into a feature vector by adopting the mode of combining the FastText word vector model and SIF weighting, and the target feature vector is obtained.
Of course, other ways may also be adopted to convert the text into the feature vector, such as word2vec model, and the embodiment of the present application is not limited.
S402, aiming at each text set, K nearest feature vectors of a target feature vector are obtained from the text set, the polymerization degree between the target feature vector and the text set is obtained based on the K nearest feature vectors and the target feature vectors, and a comparison result of the polymerization degree and the classification aggregation degree of the text set is obtained.
The polymerization degree represents the density degree between the target feature vector and the K nearest feature vectors, and the density degree represents the polymerization degree between the text to be processed and the text set. The category polymerization degree represents the density degree of the distribution of the feature vectors in the same text set.
The K nearest feature vectors refer to K feature vectors nearest to the target feature vector. In specific implementation, the obtaining K nearest feature vectors of the target feature vector from the text set in step S402 specifically includes: and obtaining the similarity of each feature vector in the text set and the target feature vector, and determining the K feature vectors ranked in the front as K nearest feature vectors of the target feature vector according to the ranking of the similarity from large to small. Or, K nearest feature vectors may also be searched by using a kd-tree, which is the prior art and will not be described in detail.
In specific implementation, the obtaining, based on the k nearest feature vectors and the target feature vector in step S402, a polymerization degree between the target feature vector and the text set specifically includes: and obtaining the similarity between each feature vector in the K nearest feature vectors and the target feature vector, and determining the average value of the similarities corresponding to the K nearest feature vectors as the polymerization degree between the target feature vector and the text set.
Set of texts V in a category ccFor example, a target feature vector J and a text set V are calculatedcWherein the target feature vector J and the text set VcThe ith feature vector v inc,iThe similarity of (A) is recorded as Sc,i=Sim(vc,iJ), all of Sc,i=Sim(vc,iJ) the set ofc(ii) a Then from ScTaking the most similar K eigenvectors, and obtaining a set of K nearest eigenvectors as Ks=Max(ScK); finally, calculating the average value of the similarity of the k nearest feature vectors as a target feature vector J and a text set VcDegree of polymerization h betweenJ,c=Mean(ks)。
S403, determining a target text set from at least two text sets based on the comparison result corresponding to each text set.
In specific implementation, the ratio of the polymerization degree and the class aggregation degree can be used as the comparison result. Therefore, the classification probability of the text to be processed belonging to each text set can be determined based on the comparison result corresponding to each text set, and the text set with the maximum classification probability is determined as the target text set.
For example, a text set VcClass degree of polymerization of acTarget feature vector J and text set VcHas a degree of polymerization of hcThen degree of polymerization hcAnd class degree of polymerization acHas a comparison result of hc/ac. Then, based on Softmax function P ═ Softmax (h)c/ac) And obtaining the classification probability that the target feature vector J belongs to the class c, and determining the text set with the maximum classification probability as the target text set.
In specific practice, hc/acThe closer to 1, the target feature vector J is shown in the text set VcThe closer the distribution mode in (1) is to the text set VcThe distribution mode of the medium feature vectors. Therefore, the comparison result h corresponding to each text set can be comparedc/acSelect h closest to 1c/acH is to bec/acThe corresponding text set is determined as a target text set.
Target feature vector J and text set VcDegree of polymerization between divided by the set of text VcThe degree of class aggregation of (1) is equivalent to the target feature vector J and the text set VcThe relative conversion is carried out on the absolute quantity of the polymerization degree between the text data and the text data, the classification is carried out on the basis of the ratio of the two, the distribution condition of the text data of each category is fully considered, and the accuracy of text classification is improved.
S404, determining the category of the target text set as the category of the text to be processed.
In specific implementation, referring to fig. 5, the category aggregation degree of each text set may be obtained as follows:
s501, aiming at each feature vector in the text set, obtaining the similarity between the feature vector and other feature vectors in the text set, and determining the polymerization degree between the feature vector and the text set based on K similarity in the top sequence according to the sequence from the obtained similarity from large to small.
Specifically, the average of the top K similarity degrees may be determined as the degree of aggregation between the feature vector and the text set. Or taking the median of the top K similarity as the polymerization degree between the feature vector and the text set.
S502, based on the polymerization degree corresponding to each feature vector in the text set, determining the category polymerization degree corresponding to the text set.
Specifically, an average value of the degree of polymerization corresponding to the feature vector in the text set may be determined as the category degree of polymerization corresponding to the text set. Or taking the median of the polymerization degrees corresponding to the feature vectors in the text set as the polymerization degree between the feature vectors and the text set.
For example, assume that K is 3, text set VcThere are 20 feature vectors of text data. Set V in textcTaking one of the feature vectors as an example, calculating the similarity between the feature vector and the other 19 feature vectors, taking the 3 largest similarities, and taking the average value of the 3 similarities as the polymerization degree between the feature vector and the text set. The polymerization degrees corresponding to the 20 feature vectors can be obtained through the method, and then the average value of the polymerization degrees corresponding to the 20 feature vectors is used as the category polymerization degree corresponding to the text set.
Taking fig. 6 as an example, each point represents a piece of text data, the black point belongs to the first category, the white point belongs to the second category, the distribution of the text data of the same category in the feature space is similar, they are aggregated together to form a category region, the black point distribution region is narrow and densely distributed, and the white point distribution region is wide and sparsely distributed. Taking K as an example, referring to the step shown in fig. 5, the category polymerization degrees of the category one and the category two are obtained by calculation, the category polymerization degrees of the category one and the category two are illustrated by a line segment in fig. 6, and the length of the line segment is inversely related to the category polymerization degree, that is, the longer the line segment is, the smaller the category polymerization degree is, which indicates that the text data distribution of the category is more sparse, for example, the black points are densely distributed, the obtained category polymerization degree is high, the white points are sparsely distributed, and the obtained category polymerization degree is low. Then, referring to the steps shown in fig. 4, the polymerization degree h1 of the gray point and the polymerization degree h2 of the gray point and the category two are calculated respectively, and then the polymerization degree h1 is compared with the analog polymerization degree of the category one, and the polymerization degree h2 and the category polymerization degree of the category two are compared, and based on the comparison result, it is found that the polymerization degree h2 is closer to the category polymerization degree of the category two, that is, the gray point more conforms to the distribution mode of the white point, so the gray point is assigned to the category two.
The text classification method can automatically learn the distribution of the text data of the same category in the feature space from the text set, and obtain the category polymerization degree representing the density degree of the feature vector distribution in the text set; when text classification is carried out, text data of different classes in a training set are divided into text sets corresponding to the classes, K text data most similar to the text to be processed are searched in the text sets of the classes respectively, the polymerization degree of the text to be processed to each class is calculated respectively, and the most appropriate class of the text to be processed is determined by comparing the polymerization degree of the text to be processed to the class and the class aggregation degree. On one hand, the search and subsequent processing of K nearest text data are respectively carried out in text sets of different categories, so that the influence of the value of K on a classification result can be reduced, on the other hand, the aggregation degree and the category aggregation degree of each category are classified by comparing texts to be processed, so that the problem of inaccurate classification result caused by different data distribution or unbalanced data can be solved, and the accuracy of text classification is improved through the optimization of the two aspects.
The text classification method can be applied to application scenes such as intelligent customer service, news recommendation and intention identification, can quickly construct a text classification model with high classification accuracy based on text data in the application scenes, and can well classify texts in scenes with few samples.
Next, the classification method according to the embodiment of the present application will be described with reference to the banking business as an example.
In the intention recognition, each category in the classification method corresponds to a user intention, and text data indicating the same intention is included in a text set of each category.
For example, the specific service is an unmanned banking service of a bank, and it is necessary to identify the intention of the user based on a sentence input by the user through voice or text, and further provide a corresponding service to the user or guide the user to perform a corresponding operation. At this time, the intention categories may be determined according to the application requirements, for example, the 6 categories may include transfer, deposit, withdraw, confirm, modify, and cancel, and then text data corresponding to various intentions are collected and stored in corresponding text sets, where the text data of the three categories of transfer, deposit, and withdraw are sentences with long length, and the text data is large in amount and sparse in distribution, and the text data of the three categories of confirm, modify, and cancel are phrases with long length or words, and the text data is small in amount and dense in distribution.
Training is performed on the basis of a text set corresponding to each intention so as to obtain a category polymerization degree corresponding to each intention.
Firstly, preprocessing text data in a text set.
The preprocessing aims to remove noise data in the text data, wherein the preprocessing specifically comprises the following steps: the method comprises the steps of removing special characters (other characters except Chinese and English) and stop words in text data, and then performing word segmentation processing on the text data to obtain a plurality of word segments.
And secondly, converting the preprocessed text data into a feature vector.
The word segmentation of the text data is converted into a feature vector, namely the feature vector of the text data to be detected, and the feature vector of each text data is stored in a corresponding text set, so that the text classification is convenient to use.
And thirdly, calculating the class polymerization degree.
The degree of polymerization of each text data in the text collection for such intent is calculated. Taking a class c as an example, VcRepresenting a set of texts of category c, in the manner referred to in FIG. 5Calculate a text set VcEach text data pair text set V in (1)cAnd then an average of these polymerization degrees is calculated as a class polymerization degree of the class c. By the above method, class polymerization degrees of all classes are obtained.
And performing intention recognition on the text to be processed input by the user based on the category polymerization degree obtained in the training stage.
Firstly, preprocessing a text to be processed input by a user.
The preprocessing mode refers to the preprocessing mode in the training stage.
And secondly, converting the preprocessed text into a feature vector.
And performing vector representation on the text to be processed by using the same method in the same training stage to generate a vector v of the text to be processed.
And thirdly, calculating the polymerization degree of the text to be processed to each intention category.
Refer specifically to step S403 in fig. 4.
And fourthly, determining an intention recognition result.
Based on Softmax function P ═ Softmax (h)c/ac) And obtaining the classification probability of the text to be processed belonging to each intention category, and determining the text set with the maximum classification probability as a target text set. Wherein, the intention recognition result returned by the Softmax function can be represented as: { "class _ 0": p0, "class _ 1": p 1. }, where class _0, class _1 denote intent classes and p0, p1 denote classification probabilities.
For example, the text entered by the user is: "i want to transfer money", the word segmentation result obtained through data preprocessing is [ "i", "want", "transfer money"]. And performing vector characterization processing on the word segmentation result to obtain a target feature vector v. Calculating all category polymerization degrees to obtain a category polymerization degree vector [ a ]0,a1,a2,a3,a4,a5]The corresponding intention categories are [ "transfer", "deposit", "draw", "cancel", "modify", "confirm"]. And comparing the class polymerization degrees to obtain a class probability vector [0.91,0.04,0 ].02,0.01,0.01,0.01]And finally, the intention recognition result is as follows: { "transfer": 0.91, "dispose": 0.04, "draw": 0.02, "cancel": 0.01, "modify": 0.01, "confirm": 0.01}, i.e., it is recognized that the user intends to be "transfer".
As shown in fig. 7, based on the same inventive concept as the text classification method, the embodiment of the present application further provides a text classification device 70, which specifically includes:
the acquiring module 701 is configured to acquire a target feature vector of a text to be processed and at least two text sets, where each text set corresponds to one category and each text set includes feature vectors of text data belonging to the same category;
a polymerization degree calculation module 702, configured to obtain, for each text set, K nearest feature vectors of the target feature vector from each text set, obtain, based on the K nearest feature vectors and the target feature vector, a polymerization degree between the target feature vector and each text set, and obtain a comparison result between the polymerization degree and a class polymerization degree of each text set; the aggregation degree represents the density degree between the target feature vector and the K nearest feature vectors, and the category aggregation degree represents the density degree of the feature vector distribution in the same text set;
a classification module 703, configured to determine, based on a comparison result corresponding to each text set, a target text set from the at least two text sets, and determine a category of the target text set as a category of the text to be processed.
Optionally, the polymerization degree calculating module 702 is specifically configured to: obtaining the similarity between each feature vector in each text set and the target feature vector; and according to the sequence of similarity from large to small, determining the K feature vectors ranked at the top as K nearest feature vectors of the target feature vector.
Optionally, the polymerization degree calculating module 702 is specifically configured to: obtaining the similarity of each feature vector in the K nearest feature vectors and the target feature vector; and determining the average value of the similarity corresponding to the K nearest feature vectors as the polymerization degree between the target feature vector and each text set.
Optionally, the comparison result is a ratio of the degree of polymerization and the degree of class aggregation, and the classification module 703 is specifically configured to: determining the classification probability of the text to be processed belonging to each text set based on the comparison result corresponding to each text set; and determining the text set with the maximum classification probability as a target text set.
Optionally, the text classification device 70 further includes a training module, configured to obtain a category aggregation degree of each text set by:
for each feature vector in each text set, obtaining similarity between each feature vector and other feature vectors in each text set, and determining a degree of polymerization between each feature vector and each text set based on K similarity degrees in the top order according to the obtained similarity degrees in the descending order;
and determining the category polymerization degree corresponding to each text set based on the polymerization degree corresponding to each feature vector in each text set.
Optionally, the training module is specifically configured to: and determining the average value of the K similarity degrees which are ranked at the top as the polymerization degree between each feature vector and each text set.
Optionally, the training module is specifically configured to: and determining the average value of the polymerization degrees corresponding to the feature vectors in each text set as the category polymerization degree corresponding to each text set.
Optionally, wherein each category corresponds to a user intent.
The text classification device and the text classification method provided by the embodiment of the application adopt the same inventive concept, can obtain the same beneficial effects, and are not repeated herein.
Based on the same inventive concept as the text classification method, the embodiment of the present application further provides an electronic device, which may be specifically a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), a server, and the like. As shown in fig. 8, the electronic device 80 may include a processor 801 and a memory 802.
The Processor 801 may be a general-purpose Processor, such as a Central Processing Unit (CPU), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware components, and may implement or execute the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the text classification method disclosed in the embodiments of the present application may be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
Memory 802, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charged Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 902 of the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
Embodiments of the present application provide a computer-readable storage medium for storing computer program instructions for the electronic device, which includes a program for executing the text classification method.
The computer storage media may be any available media or data storage device that can be accessed by a computer, including but not limited to magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
The above embodiments are only used to describe the technical solutions of the present application in detail, but the above embodiments are only used to help understanding the method of the embodiments of the present application, and should not be construed as limiting the embodiments of the present application. Modifications and substitutions that may be readily apparent to those skilled in the art are intended to be included within the scope of the embodiments of the present application.

Claims (10)

1. A method of text classification, comprising:
obtaining a target characteristic vector and at least two text sets of a text to be processed, wherein each text set corresponds to one category and comprises characteristic vectors of text data belonging to the same category;
for each text set, obtaining K nearest feature vectors of the target feature vector from each text set, obtaining a polymerization degree between the target feature vector and each text set based on the K nearest feature vectors and the target feature vectors, and obtaining a comparison result of the polymerization degree and the class aggregation degree of each text set; the aggregation degree represents the density degree between the target feature vector and the K nearest feature vectors, and the category aggregation degree represents the density degree of the feature vector distribution in the same text set;
determining a target text set from the at least two text sets based on a comparison result corresponding to each text set;
and determining the category of the target text set as the category of the text to be processed.
2. The method according to claim 1, wherein the obtaining K nearest feature vectors of the target feature vector from each text set specifically comprises:
obtaining the similarity between each feature vector in each text set and the target feature vector;
and according to the sequence of similarity from large to small, determining the K feature vectors ranked at the top as K nearest feature vectors of the target feature vector.
3. The method according to claim 1, wherein the obtaining a degree of polymerization between the target feature vector and each text set based on the k nearest feature vectors and the target feature vector comprises:
obtaining the similarity of each feature vector in the K nearest feature vectors and the target feature vector;
and determining the average value of the similarity corresponding to the K nearest feature vectors as the polymerization degree between the target feature vector and each text set.
4. The method according to claim 1, wherein the comparison result is a ratio of the degree of polymerization and the degree of category aggregation, and the determining a target text set from the at least two text sets based on the comparison result corresponding to each text set specifically includes:
determining the classification probability of the text to be processed belonging to each text set based on the comparison result corresponding to each text set;
and determining the text set with the maximum classification probability as a target text set.
5. The method according to any one of claims 1 to 4, wherein the category polymerization degree of each text set is obtained by:
for each feature vector in each text set, obtaining similarity between each feature vector and other feature vectors in each text set, and determining a degree of polymerization between each feature vector and each text set based on K similarity degrees in the top order according to the obtained similarity degrees in the descending order;
and determining the category polymerization degree corresponding to each text set based on the polymerization degree corresponding to each feature vector in each text set.
6. The method according to claim 5, wherein the determining the degree of polymerization between each feature vector and each text set based on the top K similarity degrees comprises:
and determining the average value of the K similarity degrees which are ranked at the top as the polymerization degree between each feature vector and each text set.
7. The method according to claim 5, wherein the determining a category aggregation level corresponding to each text set based on the aggregation level corresponding to each feature vector in each text set specifically comprises:
and determining the average value of the polymerization degrees corresponding to the feature vectors in each text set as the category polymerization degree corresponding to each text set.
8. A text classification apparatus is characterized in that,
the acquisition module is used for acquiring a target characteristic vector and at least two text sets of a text to be processed, wherein each text set corresponds to one category, and each text set comprises characteristic vectors of text data belonging to the same category;
the aggregation degree calculation module is used for obtaining K nearest feature vectors of the target feature vector from each text set aiming at each text set, obtaining the aggregation degree between the target feature vector and each text set based on the K nearest feature vectors and the target feature vectors, and obtaining a comparison result of the aggregation degree and the class aggregation degree of each text set; the aggregation degree represents the density degree between the target feature vector and the K nearest feature vectors, and the category aggregation degree represents the density degree of the feature vector distribution in the same text set;
and the classification module is used for determining a target text set from the at least two text sets based on the comparison result corresponding to each text set, and determining the category of the target text set as the category of the text to be processed.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium having computer program instructions stored thereon, which, when executed by a processor, implement the steps of the method of any one of claims 1 to 7.
CN202110183059.9A 2021-02-08 2021-02-08 Text classification method and device, electronic equipment and storage medium Pending CN112861974A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110183059.9A CN112861974A (en) 2021-02-08 2021-02-08 Text classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110183059.9A CN112861974A (en) 2021-02-08 2021-02-08 Text classification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112861974A true CN112861974A (en) 2021-05-28

Family

ID=75988309

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110183059.9A Pending CN112861974A (en) 2021-02-08 2021-02-08 Text classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112861974A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113869408A (en) * 2021-09-27 2021-12-31 北京迪力科技有限责任公司 Classification method and computer equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113869408A (en) * 2021-09-27 2021-12-31 北京迪力科技有限责任公司 Classification method and computer equipment

Similar Documents

Publication Publication Date Title
US20210224286A1 (en) Search result processing method and apparatus, and storage medium
CN109918673B (en) Semantic arbitration method and device, electronic equipment and computer-readable storage medium
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN110287328B (en) Text classification method, device and equipment and computer readable storage medium
CN110019732B (en) Intelligent question answering method and related device
CN111488426A (en) Query intention determining method and device and processing equipment
CN109634698B (en) Menu display method and device, computer equipment and storage medium
CN112732871B (en) Multi-label classification method for acquiring client intention labels through robot induction
CN111325156A (en) Face recognition method, device, equipment and storage medium
JP2020512651A (en) Search method, device, and non-transitory computer-readable storage medium
Safae et al. A review of machine learning algorithms for web page classification
CN108090178A (en) A kind of text data analysis method, device, server and storage medium
CN115410199A (en) Image content retrieval method, device, equipment and storage medium
CN111125329B (en) Text information screening method, device and equipment
CN112861974A (en) Text classification method and device, electronic equipment and storage medium
CN116257601A (en) Illegal word stock construction method and system based on deep learning
CN115221316A (en) Knowledge base processing method, model training method, computer device and storage medium
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
CN111708862B (en) Text matching method and device and electronic equipment
CN110717015B (en) Neural network-based polysemous word recognition method
CN115114425A (en) Text pushing method and device, electronic equipment and computer readable storage medium
CN111708884A (en) Text classification method and device and electronic equipment
CN111737469A (en) Data mining method and device, terminal equipment and readable storage medium
Chen et al. Pruning deep feature networks using channel importance propagation
Diou et al. Vitalas at trecvid-2008

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination