CN113704479B

CN113704479B - Unsupervised text classification method and device, electronic equipment and storage medium

Info

Publication number: CN113704479B
Application number: CN202111249214.9A
Authority: CN
Inventors: 张剑; 程刚; 黄仁杰; 刘代琴
Original assignee: Shenzhen Raisound Technology Co ltd
Current assignee: Shenzhen Raisound Technology Co ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-02-18
Anticipated expiration: 2041-10-26
Also published as: CN113704479A

Abstract

The invention relates to an artificial intelligence technology, and discloses an unsupervised text classification method, which comprises the following steps: the method comprises the steps of obtaining a text training set, constructing a first training set and a second training set based on the text training set, obtaining a pre-constructed original clustering model, carrying out clustering processing on the first training set by utilizing a clustering layer of the original clustering model, calculating clustering loss of the clustered first training set, calculating contrast loss of the second training set by utilizing a contrast learning layer of the original clustering model, training the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model, and classifying texts to be classified by utilizing the standard clustering model to obtain a classification result. The invention also provides an unsupervised text classification device, electronic equipment and a computer readable storage medium. The method and the device can solve the problem of low accuracy of text classification.

Description

Unsupervised text classification method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an unsupervised text classification method and device, electronic equipment and a computer readable storage medium.

Background

Clustering is one of the most fundamental challenges in unsupervised learning, and has been widely used for text classification. The clustering method in the prior art has the following defects: 1. traditional clustering algorithms such as K-means clustering algorithms and gaussian mixture models rely too much on the distance measured in the data space, which is often ineffective for high dimensional data; 2. deep neural networks are an effective way to map data to a low-dimensional and better separable representation space is desired, but in the presence of large amounts of complex data, even with deep neural networks, there is still significant overlap of data between classes before clustering begins. Therefore, the clustering method in the prior art enables the accuracy of text classification in the prior art to be low.

Disclosure of Invention

The application provides an unsupervised text classification method and device, electronic equipment and a storage medium, and aims to solve the problem of low accuracy of text classification.

In a first aspect, the present application provides an unsupervised text classification method, comprising:

acquiring a text training set, and constructing a first training set and a second training set based on the text training set;

acquiring a pre-constructed original clustering model, clustering the first training set by using a clustering layer of the original clustering model, and calculating the clustering loss of the clustered first training set;

calculating the contrast loss of the second training set by using a contrast learning layer of the original clustering model, and training the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model;

and classifying the texts to be classified by using the standard clustering model to obtain a classification result.

In detail, the constructing a first training set and a second training set based on the text training set includes:

selecting a preset number of text documents from the text training set as training examples, and summarizing all the selected training examples to obtain the first training set;

randomly selecting a preset number of words from the training examples in the first training set, and randomly inserting or replacing the preset number of words into the text of each training example in the first training set by using a preset mask language model to obtain an augmented example;

and summarizing the training examples and the augmentation examples corresponding to the training examples to obtain the second training set.

In detail, the clustering the first training set by using the clustering layer of the original clustering model and calculating the clustering loss of the clustered first training set includes:

acquiring a preset category set, and calculating the category probability of each training instance in the first training set being divided into a specific category in the category set;

performing auxiliary optimization on the category probability by using the clustering layer to obtain an auxiliary probability;

calculating the class probability and the KL divergence value of the auxiliary probability of each training example in the first training set;

and taking the KL divergence value as a loss value of the training example, and summarizing the loss values of all the training examples to obtain the clustering loss of the first training set.

In detail, the calculating a class probability that each training instance in the first training set is classified into a specific class in the class set includes:

performing feature vector mapping on each training instance in the first training set by using a preset feature mapping model to obtain a first vector set;

calculating each training instance in the first set of vectors using the following t-distribution calculation formula

A class probability of being classified as a kth class in the set of classes:

wherein the content of the first and second substances,

in order to be the probability of the category,

representing in a first vector set

The vectorized training instance is then used to,

is the preset number of the categories,

is shown as

The class center of each of the categories,

are fixed parameters.

In detail, the calculating, by using the contrast learning layer of the original clustering model, the contrast loss of the second training set includes:

randomly selecting training examples in the second training set and augmented examples corresponding to the training examples as a pair of positive samples, and taking all unselected examples as negative samples;

and calculating the contrast loss of the positive sample compared with the negative sample until the training examples in the second training set are all selected as positive samples, and summarizing the contrast loss of all the positive samples to obtain the contrast loss of the second training set.

In detail, the calculating the contrast loss of the positive samples compared to the negative samples until all training examples in the second training set are selected as positive samples, and summarizing the contrast losses of all positive samples to obtain the contrast loss of the second training set includes:

performing feature vector mapping on the positive samples and the negative samples in the second training set by using the feature mapping model to obtain a second vector set;

calculating the contrast loss of the vectorized positive sample in the second vector set compared with the vectorized negative sample by using a preset contrast loss function;

and when the training examples in the second training set are all selected as positive samples, summarizing the contrast losses of all the positive samples to obtain the contrast loss of the second training set.

In detail, the calculating, by using a preset contrast loss function, the contrast loss of the vectored positive samples compared to the vectored negative samples in the second vector set includes:

calculating the contrast loss of the vectored positive samples compared to the vectored negative samples in the second vector set using the following contrast loss function:

wherein the content of the first and second substances,

is the loss of contrast for the positive sample,

for a pair of positive samples after vectorization,

a training example is shown in which the training example,

showing the corresponding augmented instance of the training instance,

as a parameter of the temperature, it is,

in order to indicate the function,

for the number of training instances in the second training set,

is a vectorized negative example.

In detail, the training of the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model comprises:

calculating a joint loss value of the clustering loss and the contrast loss based on a pre-constructed balance factor;

updating parameters of the original clustering model based on the joint loss value, and returning to the step of clustering the first training set by using the clustering layer of the original clustering model when model training does not meet a preset training condition until model training meets the training condition to obtain the standard clustering model.

In detail, the calculating a joint loss value of the clustering loss and the contrast loss based on the pre-constructed balance factor includes:

calculating a joint loss value for the clustering loss and the contrast loss using a joint loss function of:

wherein the content of the first and second substances,

for the value of the joint loss to be described,

for the purpose of the said balance factor,

in order to compare the losses of the process,

is lost to clustering.

In a second aspect, the present application provides an unsupervised text classification apparatus, the apparatus comprising:

the training set constructing module is used for acquiring a text training set and constructing a first training set and a second training set based on the text training set;

the clustering loss calculation module is used for acquiring a pre-constructed original clustering model, clustering the first training set by using a clustering layer of the original clustering model, and calculating the clustering loss of the clustered first training set;

the joint training module is used for calculating the comparison loss of the second training set by using a comparison learning layer of the original clustering model, and training the original clustering model by combining the clustering loss and the comparison loss to obtain a standard clustering model;

and the text classification module is used for classifying the texts to be classified by utilizing the standard clustering model to obtain a classification result.

In a third aspect, an unsupervised text classification device is provided, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

a processor configured to implement the steps of the unsupervised text classification method according to any of the embodiments of the first aspect when executing the program stored in the memory.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the unsupervised text classification method according to any of the embodiments of the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

according to the method, the first training set and the second training set are constructed through the text training set, so that the data volume of model training can be improved, the data distribution is more uniform, and the accuracy of model training is improved. And clustering the first training set to enable the examples from the same text category to be clustered together, and calculating the contrast loss of the second training set by using the contrast learning layer to enable the data from different examples to be far away from each other in the representation space, so that the overlapping of the data among the categories can be obviously reduced, and the accuracy of text classification is improved. Therefore, the unsupervised text classification method, the unsupervised text classification device, the electronic equipment and the computer readable storage medium can solve the problem of low accuracy of text classification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of an unsupervised text classification method according to an embodiment of the present application;

fig. 2 is a block diagram of an unsupervised text classification apparatus according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device for unsupervised text classification according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flowchart of an unsupervised text classification method according to an embodiment of the present application. In this embodiment, the unsupervised text classification method includes:

s1, obtaining a text training set, and constructing a first training set and a second training set based on the text training set.

In the embodiment of the present invention, the text training set may be an open-source short text data set, for example, an AgNews text data set, a SearchSnippets text data set, or the like.

Specifically, the constructing a first training set and a second training set based on the text training set includes:

In the embodiment of the invention, an AgNews text data set is taken as an example, each text document in the AgNews text data set can be taken as a training example, and a batch of text documents with the size of M are randomly selected from the AgNews text data set to be taken as a first training set.

In the embodiment of the invention, the first training set is expanded through Data Augmentation (Data Augmentation), which is a technology for automatically expanding training Data, and as a deep learning model needs a large amount of Data to support training, problems of poor model robustness and poor model performance caused by Data lack and unbalanced Data distribution often occur during actual model training, and the problems can be effectively avoided through Data Augmentation.

In the embodiment of the present invention, the preset mask language model may be a bert model, a RoBERTa model, or the like, for example, a first training set has M training examples, top-n words are found in the training examples, each training example is inserted or replaced by using the word, a new training example is generated for all M training examples, and a second training set after data augmentation includes 2M examples.

S2, obtaining a pre-constructed original clustering model, clustering the first training set by using a clustering layer of the original clustering model, and calculating the clustering loss of the clustered first training set.

In the embodiment of the invention, the original clustering model comprises a clustering layer and a comparison learning layer, wherein the clustering layer consists of a linear mapping layer, the linear mapping layer is used for realizing linear combination of characteristic vectors and improving the clustering capability of the characteristic vectors of the same kind, the dimensionality is 768 × K, and K is a preset category number; the contrast learning layer consists of a single-layer multilayer perceptron (MLP), with 768 input dimension, 128 output dimension, ReLU as the activation function, written as ReLU

。

Specifically, the clustering the first training set by using the clustering layer of the original clustering model and calculating the clustering loss of the clustered first training set includes:

Further, the performing auxiliary optimization on the class probability by using the clustering layer to obtain an auxiliary probability includes:

and performing auxiliary optimization on the class probability by using auxiliary distribution in the clustering layer to obtain an auxiliary probability.

Optionally, the auxiliary probability includes:

wherein the content of the first and second substances,

in order to be said auxiliary probability,

in order to be the probability of the category,

in order to be said auxiliary distribution,

is the preset number of the categories,

the number of training examples in the first training set.

In an optional embodiment of the present invention, the KL divergence value is calculated by the following KL divergence calculation formula:

wherein the content of the first and second substances,

is the value of the KL dispersion as described,

is shown as

A set of secondary probabilities for each of the training instances,

is shown as

A set of class probabilities for each of the training instances,

is shown as

The auxiliary probability of an individual class is,

is shown as

Class probabilities for individual classes.

In an alternative embodiment of the present invention, the loss values of all training examples are summarized by the following formula:

wherein the content of the first and second substances,

for the loss of clusters of the first training set,

is the size of the first training set.

A class probability of being classified as a kth class in the set of classes:

wherein the content of the first and second substances,

in order to be the probability of the category,

representing in a first vector set

The vectorized training instance is then used to,

is the preset number of the categories,

is shown as

The class center of each of the categories,

are fixed parameters.

In an optional embodiment of the present invention, the preset feature mapping model may be a sequence Transformer model, and the sequence Transformer model is used to map an input text document to a feature space, that is, the input text document is converted into a vector representation and written as a vector representation

。

In the embodiment of the invention, the auxiliary distribution is used for carrying out auxiliary optimization on the class probability, so that the class probability is closer to the corresponding class, and the classification accuracy is improved.

In the embodiment of the invention, the training examples with the same semantic category can be clustered together through the linear mapping layer of the clustering layer, and the category probability of each training example is optimized through auxiliary distribution, so that the accuracy of text clustering is further improved.

And S3, calculating the contrast loss of the second training set by using the contrast learning layer of the original clustering model, and training the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model.

In this embodiment of the present invention, the calculating the contrast loss of the second training set by using the contrast learning layer of the original clustering model includes:

In an optional embodiment of the present invention, the calculating, by using a preset contrast loss function, a contrast loss of the vectored positive sample compared to the vectored negative sample in the second vector set includes:

wherein the content of the first and second substances,

is the loss of contrast for the positive sample,

for a pair of positive samples after vectorization,

a training example is shown in which the training example,

showing the corresponding augmented instance of the training instance,

as a parameter of the temperature, it is,

in order to indicate the function,

for the number of training instances in the second training set,

is a vectorized negative example.

In an alternative embodiment of the present invention, the contrast loss of all positive samples is summarized using the following formula:

wherein the content of the first and second substances,

is the loss of contrast for the second training set,

the number of training instances in the second training set.

In the embodiment of the present invention, the second training set obtained by data augmentation includes 2M instances, for a pair of positive samples, the remaining 2M-2 samples are all regarded as negative samples, and for a pair of positive samples,

，

to activate the function, the pair of positive samples is separated from the other 2M-2 negative samples by contrast loss, such that augmented instances from the same instance are close to each other in the representation space, while augmented instances from different instances are far from each other in the representation space, improving the accuracy of text classification.

In an optional embodiment of the present invention, the calculating a joint loss value of the clustering loss and the contrast loss based on the pre-constructed balance factor includes:

wherein the content of the first and second substances,

for the value of the joint loss to be described,

for the purpose of the said balance factor,

in order to compare the losses of the process,

is lost to clustering.

In an optional embodiment of the present invention, the preset training conditions include a maximum number of times of model training, class-center convergence of clustering, and the like.

In another optional embodiment of the present invention, after obtaining the standard clustering model, the method further includes:

and selecting a preset number of text documents from the text training set as test cases, clustering the test cases by using the standard clustering model, and calculating the evaluation index result of the clustered test cases.

In the embodiment of the invention, the evaluation indexes can be accuracy, normalized mutual information and other evaluation indexes, and the performance of the standard clustering model is quantified according to the evaluation index result.

And S4, classifying the texts to be classified by using the standard clustering model to obtain a classification result.

In the embodiment of the invention, the text to be classified can be a user emotion analysis text, a user intention identification text and the like which are subjected to voice conversion. For example, the emotion text of the user is classified into two classification results of "happy" and "unhappy" through a standard clustering model, or the intention recognition text of the user is classified into different intention classification results through the standard clustering model.

According to the method, the first training set and the second training set are constructed through the text training set, so that the data volume of model training can be improved, the data distribution is more uniform, and the accuracy of model training is improved. And clustering the first training set to enable the examples from the same text category to be clustered together, and calculating the contrast loss of the second training set by using the contrast learning layer to enable the data from different examples to be far away from each other in the representation space, so that the overlapping of the data among the categories can be obviously reduced, and the accuracy of text classification is improved. Therefore, the unsupervised text classification method provided by the invention can solve the problem of low accuracy of text classification.

As shown in fig. 2, an embodiment of the present application provides a block schematic diagram of an unsupervised text classification apparatus 10, where the unsupervised text classification apparatus 10 includes: the training set constructing module 11, the cluster loss calculating module 12, the joint training module 13, and the text classifying module 14.

The training set constructing module 11 is configured to obtain a text training set, and construct a first training set and a second training set based on the text training set;

the clustering loss calculating module 12 is configured to obtain a pre-constructed original clustering model, perform clustering processing on the first training set by using a clustering layer of the original clustering model, and calculate a clustering loss of the clustered first training set;

the joint training module 13 is configured to calculate a comparison loss of the second training set by using a comparison learning layer of the original clustering model, and train the original clustering model by combining the clustering loss and the comparison loss to obtain a standard clustering model;

the text classification module 14 is configured to classify the text to be classified by using the standard clustering model to obtain a classification result.

In detail, when the modules in the unsupervised text classification device 10 in the embodiment of the present application are used, the same technical means as the unsupervised text classification method described in fig. 1 above are adopted, and the same technical effects can be produced, and are not described again here.

As shown in fig. 3, an electronic device provided in the embodiment of the present application includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete communication with each other through the communication bus 114;

a memory 113 for storing a computer program;

in an embodiment of the present application, the processor 111, when executing the program stored in the memory 113, is configured to implement the unsupervised text classification method provided in any one of the foregoing method embodiments, including:

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the unsupervised text classification method as provided in any of the foregoing method embodiments.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An unsupervised text classification method, characterized in that the method comprises:

classifying the texts to be classified by using the standard clustering model to obtain a classification result;

wherein the constructing a first training set and a second training set based on the text training set comprises:

summarizing the training examples and the augmentation examples corresponding to the training examples to obtain the second training set;

wherein, the training of the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model comprises:

2. The unsupervised text classification method according to claim 1, wherein the clustering the first training set using the clustering layer of the original clustering model and calculating the clustering loss of the clustered first training set comprises:

3. The unsupervised text classification method of claim 2, wherein the calculating a class probability that each training instance in the first training set is classified into a particular class in the set of classes comprises:

A class probability of being classified as a kth class in the set of classes:

wherein the content of the first and second substances,

in order to be the probability of the category,

representing in a first vector set

The vectorized training instance is then used to,

is the preset number of the categories,

is shown as

The class center of each of the categories,

are fixed parameters.

4. The unsupervised text classification method of claim 1, wherein the calculating the contrast loss of the second training set using the contrast learning layer of the original clustering model comprises:

5. The unsupervised method of text classification according to claim 4, wherein the calculating the contrast loss of the positive samples compared to the negative samples until all training instances in the second training set are selected as positive samples, and summarizing the contrast losses of all positive samples to obtain the contrast loss of the second training set comprises:

6. The unsupervised method of text classification as claimed in claim 5, wherein said calculating the contrast loss of the vectored positive samples compared to the vectored negative samples in the second vector set using a predetermined contrast loss function comprises:

wherein the content of the first and second substances,

is the loss of contrast for the positive sample,

for a pair of positive samples after vectorization,

a training example is shown in which the training example,

showing the corresponding augmented instance of the training instance,

as a parameter of the temperature, it is,

in order to indicate the function,

for training examples in the second training setThe number of the first and second groups is,

is a vectorized negative example.

7. The unsupervised text classification method according to claim 1, wherein the calculating a joint loss value for the cluster loss and the contrast loss based on a pre-constructed balance factor comprises:

wherein the content of the first and second substances,

for the value of the joint loss to be described,

for the purpose of the said balance factor,

in order to compare the losses of the process,

is lost to clustering.

8. An unsupervised text classification apparatus, characterized in that the apparatus comprises:

the text classification module is used for classifying the texts to be classified by utilizing the standard clustering model to obtain a classification result;

wherein the training set constructing module is specifically configured to:

wherein, the joint training module is specifically configured to:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the unsupervised text classification method of any one of claims 1-7 when executing a program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the unsupervised text classification method according to any one of claims 1 to 7.