CN113704479B - Unsupervised text classification method and device, electronic equipment and storage medium - Google Patents

Unsupervised text classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113704479B
CN113704479B CN202111249214.9A CN202111249214A CN113704479B CN 113704479 B CN113704479 B CN 113704479B CN 202111249214 A CN202111249214 A CN 202111249214A CN 113704479 B CN113704479 B CN 113704479B
Authority
CN
China
Prior art keywords
training
clustering
training set
loss
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111249214.9A
Other languages
Chinese (zh)
Other versions
CN113704479A (en
Inventor
张剑
程刚
黄仁杰
刘代琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Raisound Technology Co ltd
Original Assignee
Shenzhen Raisound Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Raisound Technology Co ltd filed Critical Shenzhen Raisound Technology Co ltd
Priority to CN202111249214.9A priority Critical patent/CN113704479B/en
Publication of CN113704479A publication Critical patent/CN113704479A/en
Application granted granted Critical
Publication of CN113704479B publication Critical patent/CN113704479B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses an unsupervised text classification method, which comprises the following steps: the method comprises the steps of obtaining a text training set, constructing a first training set and a second training set based on the text training set, obtaining a pre-constructed original clustering model, carrying out clustering processing on the first training set by utilizing a clustering layer of the original clustering model, calculating clustering loss of the clustered first training set, calculating contrast loss of the second training set by utilizing a contrast learning layer of the original clustering model, training the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model, and classifying texts to be classified by utilizing the standard clustering model to obtain a classification result. The invention also provides an unsupervised text classification device, electronic equipment and a computer readable storage medium. The method and the device can solve the problem of low accuracy of text classification.

Description

Unsupervised text classification method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an unsupervised text classification method and device, electronic equipment and a computer readable storage medium.
Background
Clustering is one of the most fundamental challenges in unsupervised learning, and has been widely used for text classification. The clustering method in the prior art has the following defects: 1. traditional clustering algorithms such as K-means clustering algorithms and gaussian mixture models rely too much on the distance measured in the data space, which is often ineffective for high dimensional data; 2. deep neural networks are an effective way to map data to a low-dimensional and better separable representation space is desired, but in the presence of large amounts of complex data, even with deep neural networks, there is still significant overlap of data between classes before clustering begins. Therefore, the clustering method in the prior art enables the accuracy of text classification in the prior art to be low.
Disclosure of Invention
The application provides an unsupervised text classification method and device, electronic equipment and a storage medium, and aims to solve the problem of low accuracy of text classification.
In a first aspect, the present application provides an unsupervised text classification method, comprising:
acquiring a text training set, and constructing a first training set and a second training set based on the text training set;
acquiring a pre-constructed original clustering model, clustering the first training set by using a clustering layer of the original clustering model, and calculating the clustering loss of the clustered first training set;
calculating the contrast loss of the second training set by using a contrast learning layer of the original clustering model, and training the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model;
and classifying the texts to be classified by using the standard clustering model to obtain a classification result.
In detail, the constructing a first training set and a second training set based on the text training set includes:
selecting a preset number of text documents from the text training set as training examples, and summarizing all the selected training examples to obtain the first training set;
randomly selecting a preset number of words from the training examples in the first training set, and randomly inserting or replacing the preset number of words into the text of each training example in the first training set by using a preset mask language model to obtain an augmented example;
and summarizing the training examples and the augmentation examples corresponding to the training examples to obtain the second training set.
In detail, the clustering the first training set by using the clustering layer of the original clustering model and calculating the clustering loss of the clustered first training set includes:
acquiring a preset category set, and calculating the category probability of each training instance in the first training set being divided into a specific category in the category set;
performing auxiliary optimization on the category probability by using the clustering layer to obtain an auxiliary probability;
calculating the class probability and the KL divergence value of the auxiliary probability of each training example in the first training set;
and taking the KL divergence value as a loss value of the training example, and summarizing the loss values of all the training examples to obtain the clustering loss of the first training set.
In detail, the calculating a class probability that each training instance in the first training set is classified into a specific class in the class set includes:
performing feature vector mapping on each training instance in the first training set by using a preset feature mapping model to obtain a first vector set;
calculating each training instance in the first set of vectors using the following t-distribution calculation formula
Figure 486531DEST_PATH_IMAGE002
A class probability of being classified as a kth class in the set of classes:
Figure DEST_PATH_IMAGE003
wherein the content of the first and second substances,
Figure 844831DEST_PATH_IMAGE004
in order to be the probability of the category,
Figure DEST_PATH_IMAGE005
representing in a first vector set
Figure 317401DEST_PATH_IMAGE006
The vectorized training instance is then used to,
Figure DEST_PATH_IMAGE007
is the preset number of the categories,
Figure 853687DEST_PATH_IMAGE008
is shown as
Figure DEST_PATH_IMAGE009
The class center of each of the categories,
Figure DEST_PATH_IMAGE011
are fixed parameters.
In detail, the calculating, by using the contrast learning layer of the original clustering model, the contrast loss of the second training set includes:
randomly selecting training examples in the second training set and augmented examples corresponding to the training examples as a pair of positive samples, and taking all unselected examples as negative samples;
and calculating the contrast loss of the positive sample compared with the negative sample until the training examples in the second training set are all selected as positive samples, and summarizing the contrast loss of all the positive samples to obtain the contrast loss of the second training set.
In detail, the calculating the contrast loss of the positive samples compared to the negative samples until all training examples in the second training set are selected as positive samples, and summarizing the contrast losses of all positive samples to obtain the contrast loss of the second training set includes:
performing feature vector mapping on the positive samples and the negative samples in the second training set by using the feature mapping model to obtain a second vector set;
calculating the contrast loss of the vectorized positive sample in the second vector set compared with the vectorized negative sample by using a preset contrast loss function;
and when the training examples in the second training set are all selected as positive samples, summarizing the contrast losses of all the positive samples to obtain the contrast loss of the second training set.
In detail, the calculating, by using a preset contrast loss function, the contrast loss of the vectored positive samples compared to the vectored negative samples in the second vector set includes:
calculating the contrast loss of the vectored positive samples compared to the vectored negative samples in the second vector set using the following contrast loss function:
Figure 778917DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE013
is the loss of contrast for the positive sample,
Figure 940908DEST_PATH_IMAGE014
for a pair of positive samples after vectorization,
Figure DEST_PATH_IMAGE015
a training example is shown in which the training example,
Figure 267985DEST_PATH_IMAGE016
showing the corresponding augmented instance of the training instance,
Figure DEST_PATH_IMAGE017
as a parameter of the temperature, it is,
Figure 739286DEST_PATH_IMAGE018
in order to indicate the function,
Figure DEST_PATH_IMAGE019
for the number of training instances in the second training set,
Figure 151813DEST_PATH_IMAGE020
is a vectorized negative example.
In detail, the training of the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model comprises:
calculating a joint loss value of the clustering loss and the contrast loss based on a pre-constructed balance factor;
updating parameters of the original clustering model based on the joint loss value, and returning to the step of clustering the first training set by using the clustering layer of the original clustering model when model training does not meet a preset training condition until model training meets the training condition to obtain the standard clustering model.
In detail, the calculating a joint loss value of the clustering loss and the contrast loss based on the pre-constructed balance factor includes:
calculating a joint loss value for the clustering loss and the contrast loss using a joint loss function of:
Figure DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 851916DEST_PATH_IMAGE022
for the value of the joint loss to be described,
Figure 33498DEST_PATH_IMAGE024
for the purpose of the said balance factor,
Figure 488750DEST_PATH_IMAGE025
in order to compare the losses of the process,
Figure 74059DEST_PATH_IMAGE026
is lost to clustering.
In a second aspect, the present application provides an unsupervised text classification apparatus, the apparatus comprising:
the training set constructing module is used for acquiring a text training set and constructing a first training set and a second training set based on the text training set;
the clustering loss calculation module is used for acquiring a pre-constructed original clustering model, clustering the first training set by using a clustering layer of the original clustering model, and calculating the clustering loss of the clustered first training set;
the joint training module is used for calculating the comparison loss of the second training set by using a comparison learning layer of the original clustering model, and training the original clustering model by combining the clustering loss and the comparison loss to obtain a standard clustering model;
and the text classification module is used for classifying the texts to be classified by utilizing the standard clustering model to obtain a classification result.
In a third aspect, an unsupervised text classification device is provided, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;
a memory for storing a computer program;
a processor configured to implement the steps of the unsupervised text classification method according to any of the embodiments of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the unsupervised text classification method according to any of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the method, the first training set and the second training set are constructed through the text training set, so that the data volume of model training can be improved, the data distribution is more uniform, and the accuracy of model training is improved. And clustering the first training set to enable the examples from the same text category to be clustered together, and calculating the contrast loss of the second training set by using the contrast learning layer to enable the data from different examples to be far away from each other in the representation space, so that the overlapping of the data among the categories can be obviously reduced, and the accuracy of text classification is improved. Therefore, the unsupervised text classification method, the unsupervised text classification device, the electronic equipment and the computer readable storage medium can solve the problem of low accuracy of text classification.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of an unsupervised text classification method according to an embodiment of the present application;
fig. 2 is a block diagram of an unsupervised text classification apparatus according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device for unsupervised text classification according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flowchart of an unsupervised text classification method according to an embodiment of the present application. In this embodiment, the unsupervised text classification method includes:
s1, obtaining a text training set, and constructing a first training set and a second training set based on the text training set.
In the embodiment of the present invention, the text training set may be an open-source short text data set, for example, an AgNews text data set, a SearchSnippets text data set, or the like.
Specifically, the constructing a first training set and a second training set based on the text training set includes:
selecting a preset number of text documents from the text training set as training examples, and summarizing all the selected training examples to obtain the first training set;
randomly selecting a preset number of words from the training examples in the first training set, and randomly inserting or replacing the preset number of words into the text of each training example in the first training set by using a preset mask language model to obtain an augmented example;
and summarizing the training examples and the augmentation examples corresponding to the training examples to obtain the second training set.
In the embodiment of the invention, an AgNews text data set is taken as an example, each text document in the AgNews text data set can be taken as a training example, and a batch of text documents with the size of M are randomly selected from the AgNews text data set to be taken as a first training set.
In the embodiment of the invention, the first training set is expanded through Data Augmentation (Data Augmentation), which is a technology for automatically expanding training Data, and as a deep learning model needs a large amount of Data to support training, problems of poor model robustness and poor model performance caused by Data lack and unbalanced Data distribution often occur during actual model training, and the problems can be effectively avoided through Data Augmentation.
In the embodiment of the present invention, the preset mask language model may be a bert model, a RoBERTa model, or the like, for example, a first training set has M training examples, top-n words are found in the training examples, each training example is inserted or replaced by using the word, a new training example is generated for all M training examples, and a second training set after data augmentation includes 2M examples.
S2, obtaining a pre-constructed original clustering model, clustering the first training set by using a clustering layer of the original clustering model, and calculating the clustering loss of the clustered first training set.
In the embodiment of the invention, the original clustering model comprises a clustering layer and a comparison learning layer, wherein the clustering layer consists of a linear mapping layer, the linear mapping layer is used for realizing linear combination of characteristic vectors and improving the clustering capability of the characteristic vectors of the same kind, the dimensionality is 768 × K, and K is a preset category number; the contrast learning layer consists of a single-layer multilayer perceptron (MLP), with 768 input dimension, 128 output dimension, ReLU as the activation function, written as ReLU
Figure 640170DEST_PATH_IMAGE028
Specifically, the clustering the first training set by using the clustering layer of the original clustering model and calculating the clustering loss of the clustered first training set includes:
acquiring a preset category set, and calculating the category probability of each training instance in the first training set being divided into a specific category in the category set;
performing auxiliary optimization on the category probability by using the clustering layer to obtain an auxiliary probability;
calculating the class probability and the KL divergence value of the auxiliary probability of each training example in the first training set;
and taking the KL divergence value as a loss value of the training example, and summarizing the loss values of all the training examples to obtain the clustering loss of the first training set.
Further, the performing auxiliary optimization on the class probability by using the clustering layer to obtain an auxiliary probability includes:
and performing auxiliary optimization on the class probability by using auxiliary distribution in the clustering layer to obtain an auxiliary probability.
Optionally, the auxiliary probability includes:
Figure 676259DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 302412DEST_PATH_IMAGE030
in order to be said auxiliary probability,
Figure 689531DEST_PATH_IMAGE031
in order to be the probability of the category,
Figure 731436DEST_PATH_IMAGE033
in order to be said auxiliary distribution,
Figure 887611DEST_PATH_IMAGE034
is the preset number of the categories,
Figure 684666DEST_PATH_IMAGE036
the number of training examples in the first training set.
In an optional embodiment of the present invention, the KL divergence value is calculated by the following KL divergence calculation formula:
Figure 480453DEST_PATH_IMAGE037
wherein the content of the first and second substances,
Figure 653945DEST_PATH_IMAGE039
is the value of the KL dispersion as described,
Figure 664626DEST_PATH_IMAGE040
is shown as
Figure 632582DEST_PATH_IMAGE042
A set of secondary probabilities for each of the training instances,
Figure 728714DEST_PATH_IMAGE043
is shown as
Figure 112422DEST_PATH_IMAGE042
A set of class probabilities for each of the training instances,
Figure 243189DEST_PATH_IMAGE044
is shown as
Figure 382047DEST_PATH_IMAGE045
The auxiliary probability of an individual class is,
Figure 965475DEST_PATH_IMAGE031
is shown as
Figure 903606DEST_PATH_IMAGE046
Class probabilities for individual classes.
In an alternative embodiment of the present invention, the loss values of all training examples are summarized by the following formula:
Figure 888879DEST_PATH_IMAGE047
wherein the content of the first and second substances,
Figure 198638DEST_PATH_IMAGE048
for the loss of clusters of the first training set,
Figure 269362DEST_PATH_IMAGE050
is the size of the first training set.
In detail, the calculating a class probability that each training instance in the first training set is classified into a specific class in the class set includes:
performing feature vector mapping on each training instance in the first training set by using a preset feature mapping model to obtain a first vector set;
calculating each training instance in the first set of vectors using the following t-distribution calculation formula
Figure 260452DEST_PATH_IMAGE002
A class probability of being classified as a kth class in the set of classes:
Figure 100232DEST_PATH_IMAGE003
wherein the content of the first and second substances,
Figure 580892DEST_PATH_IMAGE004
in order to be the probability of the category,
Figure 873333DEST_PATH_IMAGE005
representing in a first vector set
Figure 730431DEST_PATH_IMAGE006
The vectorized training instance is then used to,
Figure 611668DEST_PATH_IMAGE007
is the preset number of the categories,
Figure 263229DEST_PATH_IMAGE008
is shown as
Figure 308545DEST_PATH_IMAGE009
The class center of each of the categories,
Figure 703755DEST_PATH_IMAGE011
are fixed parameters.
In an optional embodiment of the present invention, the preset feature mapping model may be a sequence Transformer model, and the sequence Transformer model is used to map an input text document to a feature space, that is, the input text document is converted into a vector representation and written as a vector representation
Figure 190231DEST_PATH_IMAGE051
In the embodiment of the invention, the auxiliary distribution is used for carrying out auxiliary optimization on the class probability, so that the class probability is closer to the corresponding class, and the classification accuracy is improved.
In the embodiment of the invention, the training examples with the same semantic category can be clustered together through the linear mapping layer of the clustering layer, and the category probability of each training example is optimized through auxiliary distribution, so that the accuracy of text clustering is further improved.
And S3, calculating the contrast loss of the second training set by using the contrast learning layer of the original clustering model, and training the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model.
In this embodiment of the present invention, the calculating the contrast loss of the second training set by using the contrast learning layer of the original clustering model includes:
randomly selecting training examples in the second training set and augmented examples corresponding to the training examples as a pair of positive samples, and taking all unselected examples as negative samples;
and calculating the contrast loss of the positive sample compared with the negative sample until the training examples in the second training set are all selected as positive samples, and summarizing the contrast loss of all the positive samples to obtain the contrast loss of the second training set.
In detail, the calculating the contrast loss of the positive samples compared to the negative samples until all training examples in the second training set are selected as positive samples, and summarizing the contrast losses of all positive samples to obtain the contrast loss of the second training set includes:
performing feature vector mapping on the positive samples and the negative samples in the second training set by using the feature mapping model to obtain a second vector set;
calculating the contrast loss of the vectorized positive sample in the second vector set compared with the vectorized negative sample by using a preset contrast loss function;
and when the training examples in the second training set are all selected as positive samples, summarizing the contrast losses of all the positive samples to obtain the contrast loss of the second training set.
In an optional embodiment of the present invention, the calculating, by using a preset contrast loss function, a contrast loss of the vectored positive sample compared to the vectored negative sample in the second vector set includes:
calculating the contrast loss of the vectored positive samples compared to the vectored negative samples in the second vector set using the following contrast loss function:
Figure 278273DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 545306DEST_PATH_IMAGE013
is the loss of contrast for the positive sample,
Figure 478627DEST_PATH_IMAGE014
for a pair of positive samples after vectorization,
Figure 850569DEST_PATH_IMAGE015
a training example is shown in which the training example,
Figure 109512DEST_PATH_IMAGE016
showing the corresponding augmented instance of the training instance,
Figure 598263DEST_PATH_IMAGE017
as a parameter of the temperature, it is,
Figure 600854DEST_PATH_IMAGE018
in order to indicate the function,
Figure 61922DEST_PATH_IMAGE019
for the number of training instances in the second training set,
Figure 226187DEST_PATH_IMAGE020
is a vectorized negative example.
In an alternative embodiment of the present invention, the contrast loss of all positive samples is summarized using the following formula:
Figure 467813DEST_PATH_IMAGE052
wherein the content of the first and second substances,
Figure 8515DEST_PATH_IMAGE053
is the loss of contrast for the second training set,
Figure 386407DEST_PATH_IMAGE054
the number of training instances in the second training set.
In the embodiment of the present invention, the second training set obtained by data augmentation includes 2M instances, for a pair of positive samples, the remaining 2M-2 samples are all regarded as negative samples, and for a pair of positive samples,
Figure 174103DEST_PATH_IMAGE055
Figure 637446DEST_PATH_IMAGE057
to activate the function, the pair of positive samples is separated from the other 2M-2 negative samples by contrast loss, such that augmented instances from the same instance are close to each other in the representation space, while augmented instances from different instances are far from each other in the representation space, improving the accuracy of text classification.
In detail, the training of the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model comprises:
calculating a joint loss value of the clustering loss and the contrast loss based on a pre-constructed balance factor;
updating parameters of the original clustering model based on the joint loss value, and returning to the step of clustering the first training set by using the clustering layer of the original clustering model when model training does not meet a preset training condition until model training meets the training condition to obtain the standard clustering model.
In an optional embodiment of the present invention, the calculating a joint loss value of the clustering loss and the contrast loss based on the pre-constructed balance factor includes:
calculating a joint loss value for the clustering loss and the contrast loss using a joint loss function of:
Figure 981839DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 479817DEST_PATH_IMAGE022
for the value of the joint loss to be described,
Figure 923568DEST_PATH_IMAGE024
for the purpose of the said balance factor,
Figure 139785DEST_PATH_IMAGE025
in order to compare the losses of the process,
Figure 22291DEST_PATH_IMAGE026
is lost to clustering.
In an optional embodiment of the present invention, the preset training conditions include a maximum number of times of model training, class-center convergence of clustering, and the like.
In another optional embodiment of the present invention, after obtaining the standard clustering model, the method further includes:
and selecting a preset number of text documents from the text training set as test cases, clustering the test cases by using the standard clustering model, and calculating the evaluation index result of the clustered test cases.
In the embodiment of the invention, the evaluation indexes can be accuracy, normalized mutual information and other evaluation indexes, and the performance of the standard clustering model is quantified according to the evaluation index result.
And S4, classifying the texts to be classified by using the standard clustering model to obtain a classification result.
In the embodiment of the invention, the text to be classified can be a user emotion analysis text, a user intention identification text and the like which are subjected to voice conversion. For example, the emotion text of the user is classified into two classification results of "happy" and "unhappy" through a standard clustering model, or the intention recognition text of the user is classified into different intention classification results through the standard clustering model.
According to the method, the first training set and the second training set are constructed through the text training set, so that the data volume of model training can be improved, the data distribution is more uniform, and the accuracy of model training is improved. And clustering the first training set to enable the examples from the same text category to be clustered together, and calculating the contrast loss of the second training set by using the contrast learning layer to enable the data from different examples to be far away from each other in the representation space, so that the overlapping of the data among the categories can be obviously reduced, and the accuracy of text classification is improved. Therefore, the unsupervised text classification method provided by the invention can solve the problem of low accuracy of text classification.
As shown in fig. 2, an embodiment of the present application provides a block schematic diagram of an unsupervised text classification apparatus 10, where the unsupervised text classification apparatus 10 includes: the training set constructing module 11, the cluster loss calculating module 12, the joint training module 13, and the text classifying module 14.
The training set constructing module 11 is configured to obtain a text training set, and construct a first training set and a second training set based on the text training set;
the clustering loss calculating module 12 is configured to obtain a pre-constructed original clustering model, perform clustering processing on the first training set by using a clustering layer of the original clustering model, and calculate a clustering loss of the clustered first training set;
the joint training module 13 is configured to calculate a comparison loss of the second training set by using a comparison learning layer of the original clustering model, and train the original clustering model by combining the clustering loss and the comparison loss to obtain a standard clustering model;
the text classification module 14 is configured to classify the text to be classified by using the standard clustering model to obtain a classification result.
In detail, when the modules in the unsupervised text classification device 10 in the embodiment of the present application are used, the same technical means as the unsupervised text classification method described in fig. 1 above are adopted, and the same technical effects can be produced, and are not described again here.
As shown in fig. 3, an electronic device provided in the embodiment of the present application includes a processor 111, a communication interface 112, a memory 113, and a communication bus 114, where the processor 111, the communication interface 112, and the memory 113 complete communication with each other through the communication bus 114;
a memory 113 for storing a computer program;
in an embodiment of the present application, the processor 111, when executing the program stored in the memory 113, is configured to implement the unsupervised text classification method provided in any one of the foregoing method embodiments, including:
acquiring a text training set, and constructing a first training set and a second training set based on the text training set;
acquiring a pre-constructed original clustering model, clustering the first training set by using a clustering layer of the original clustering model, and calculating the clustering loss of the clustered first training set;
calculating the contrast loss of the second training set by using a contrast learning layer of the original clustering model, and training the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model;
and classifying the texts to be classified by using the standard clustering model to obtain a classification result.
Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the unsupervised text classification method as provided in any of the foregoing method embodiments.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. An unsupervised text classification method, characterized in that the method comprises:
acquiring a text training set, and constructing a first training set and a second training set based on the text training set;
acquiring a pre-constructed original clustering model, clustering the first training set by using a clustering layer of the original clustering model, and calculating the clustering loss of the clustered first training set;
calculating the contrast loss of the second training set by using a contrast learning layer of the original clustering model, and training the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model;
classifying the texts to be classified by using the standard clustering model to obtain a classification result;
wherein the constructing a first training set and a second training set based on the text training set comprises:
selecting a preset number of text documents from the text training set as training examples, and summarizing all the selected training examples to obtain the first training set;
randomly selecting a preset number of words from the training examples in the first training set, and randomly inserting or replacing the preset number of words into the text of each training example in the first training set by using a preset mask language model to obtain an augmented example;
summarizing the training examples and the augmentation examples corresponding to the training examples to obtain the second training set;
wherein, the training of the original clustering model by combining the clustering loss and the contrast loss to obtain a standard clustering model comprises:
calculating a joint loss value of the clustering loss and the contrast loss based on a pre-constructed balance factor;
updating parameters of the original clustering model based on the joint loss value, and returning to the step of clustering the first training set by using the clustering layer of the original clustering model when model training does not meet a preset training condition until model training meets the training condition to obtain the standard clustering model.
2. The unsupervised text classification method according to claim 1, wherein the clustering the first training set using the clustering layer of the original clustering model and calculating the clustering loss of the clustered first training set comprises:
acquiring a preset category set, and calculating the category probability of each training instance in the first training set being divided into a specific category in the category set;
performing auxiliary optimization on the category probability by using the clustering layer to obtain an auxiliary probability;
calculating the class probability and the KL divergence value of the auxiliary probability of each training example in the first training set;
and taking the KL divergence value as a loss value of the training example, and summarizing the loss values of all the training examples to obtain the clustering loss of the first training set.
3. The unsupervised text classification method of claim 2, wherein the calculating a class probability that each training instance in the first training set is classified into a particular class in the set of classes comprises:
performing feature vector mapping on each training instance in the first training set by using a preset feature mapping model to obtain a first vector set;
calculating each training instance in the first set of vectors using the following t-distribution calculation formula
Figure 313114DEST_PATH_IMAGE002
A class probability of being classified as a kth class in the set of classes:
Figure 50126DEST_PATH_IMAGE003
wherein the content of the first and second substances,
Figure 386560DEST_PATH_IMAGE004
in order to be the probability of the category,
Figure 816405DEST_PATH_IMAGE005
representing in a first vector set
Figure 58030DEST_PATH_IMAGE006
The vectorized training instance is then used to,
Figure 395471DEST_PATH_IMAGE007
is the preset number of the categories,
Figure 38942DEST_PATH_IMAGE008
is shown as
Figure 374108DEST_PATH_IMAGE009
The class center of each of the categories,
Figure 916079DEST_PATH_IMAGE011
are fixed parameters.
4. The unsupervised text classification method of claim 1, wherein the calculating the contrast loss of the second training set using the contrast learning layer of the original clustering model comprises:
randomly selecting training examples in the second training set and augmented examples corresponding to the training examples as a pair of positive samples, and taking all unselected examples as negative samples;
and calculating the contrast loss of the positive sample compared with the negative sample until the training examples in the second training set are all selected as positive samples, and summarizing the contrast loss of all the positive samples to obtain the contrast loss of the second training set.
5. The unsupervised method of text classification according to claim 4, wherein the calculating the contrast loss of the positive samples compared to the negative samples until all training instances in the second training set are selected as positive samples, and summarizing the contrast losses of all positive samples to obtain the contrast loss of the second training set comprises:
performing feature vector mapping on the positive samples and the negative samples in the second training set by using the feature mapping model to obtain a second vector set;
calculating the contrast loss of the vectorized positive sample in the second vector set compared with the vectorized negative sample by using a preset contrast loss function;
and when the training examples in the second training set are all selected as positive samples, summarizing the contrast losses of all the positive samples to obtain the contrast loss of the second training set.
6. The unsupervised method of text classification as claimed in claim 5, wherein said calculating the contrast loss of the vectored positive samples compared to the vectored negative samples in the second vector set using a predetermined contrast loss function comprises:
calculating the contrast loss of the vectored positive samples compared to the vectored negative samples in the second vector set using the following contrast loss function:
Figure 994893DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 492871DEST_PATH_IMAGE013
is the loss of contrast for the positive sample,
Figure 326835DEST_PATH_IMAGE014
for a pair of positive samples after vectorization,
Figure 277473DEST_PATH_IMAGE015
a training example is shown in which the training example,
Figure 159978DEST_PATH_IMAGE016
showing the corresponding augmented instance of the training instance,
Figure 325512DEST_PATH_IMAGE017
as a parameter of the temperature, it is,
Figure 268060DEST_PATH_IMAGE018
in order to indicate the function,
Figure 705994DEST_PATH_IMAGE019
for training examples in the second training setThe number of the first and second groups is,
Figure 188928DEST_PATH_IMAGE020
is a vectorized negative example.
7. The unsupervised text classification method according to claim 1, wherein the calculating a joint loss value for the cluster loss and the contrast loss based on a pre-constructed balance factor comprises:
calculating a joint loss value for the clustering loss and the contrast loss using a joint loss function of:
Figure 661498DEST_PATH_IMAGE021
wherein the content of the first and second substances,
Figure 774948DEST_PATH_IMAGE022
for the value of the joint loss to be described,
Figure 700178DEST_PATH_IMAGE024
for the purpose of the said balance factor,
Figure DEST_PATH_IMAGE025
in order to compare the losses of the process,
Figure 740465DEST_PATH_IMAGE026
is lost to clustering.
8. An unsupervised text classification apparatus, characterized in that the apparatus comprises:
the training set constructing module is used for acquiring a text training set and constructing a first training set and a second training set based on the text training set;
the clustering loss calculation module is used for acquiring a pre-constructed original clustering model, clustering the first training set by using a clustering layer of the original clustering model, and calculating the clustering loss of the clustered first training set;
the joint training module is used for calculating the comparison loss of the second training set by using a comparison learning layer of the original clustering model, and training the original clustering model by combining the clustering loss and the comparison loss to obtain a standard clustering model;
the text classification module is used for classifying the texts to be classified by utilizing the standard clustering model to obtain a classification result;
wherein the training set constructing module is specifically configured to:
selecting a preset number of text documents from the text training set as training examples, and summarizing all the selected training examples to obtain the first training set;
randomly selecting a preset number of words from the training examples in the first training set, and randomly inserting or replacing the preset number of words into the text of each training example in the first training set by using a preset mask language model to obtain an augmented example;
summarizing the training examples and the augmentation examples corresponding to the training examples to obtain the second training set;
wherein, the joint training module is specifically configured to:
calculating a joint loss value of the clustering loss and the contrast loss based on a pre-constructed balance factor;
updating parameters of the original clustering model based on the joint loss value, and returning to the step of clustering the first training set by using the clustering layer of the original clustering model when model training does not meet a preset training condition until model training meets the training condition to obtain the standard clustering model.
9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the steps of the unsupervised text classification method of any one of claims 1-7 when executing a program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the unsupervised text classification method according to any one of claims 1 to 7.
CN202111249214.9A 2021-10-26 2021-10-26 Unsupervised text classification method and device, electronic equipment and storage medium Active CN113704479B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111249214.9A CN113704479B (en) 2021-10-26 2021-10-26 Unsupervised text classification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111249214.9A CN113704479B (en) 2021-10-26 2021-10-26 Unsupervised text classification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113704479A CN113704479A (en) 2021-11-26
CN113704479B true CN113704479B (en) 2022-02-18

Family

ID=78647032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111249214.9A Active CN113704479B (en) 2021-10-26 2021-10-26 Unsupervised text classification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113704479B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522958A (en) * 2020-05-28 2020-08-11 泰康保险集团股份有限公司 Text classification method and device
CN112214602A (en) * 2020-10-23 2021-01-12 中国平安人寿保险股份有限公司 Text classification method and device based on humor, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111295669A (en) * 2017-06-16 2020-06-16 马克波尔公司 Image processing system
CN110298415B (en) * 2019-08-20 2019-12-03 视睿(杭州)信息科技有限公司 A kind of training method of semi-supervised learning, system and computer readable storage medium
US11263476B2 (en) * 2020-03-19 2022-03-01 Salesforce.Com, Inc. Unsupervised representation learning with contrastive prototypes
US11386302B2 (en) * 2020-04-13 2022-07-12 Google Llc Systems and methods for contrastive learning of visual representations
CN113378632B (en) * 2021-04-28 2024-04-12 南京大学 Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522958A (en) * 2020-05-28 2020-08-11 泰康保险集团股份有限公司 Text classification method and device
CN112214602A (en) * 2020-10-23 2021-01-12 中国平安人寿保险股份有限公司 Text classification method and device based on humor, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113704479A (en) 2021-11-26

Similar Documents

Publication Publication Date Title
TWI677852B (en) A method and apparatus, electronic equipment, computer readable storage medium for extracting image feature
CN109871446B (en) Refusing method in intention recognition, electronic device and storage medium
TWI752455B (en) Image classification model training method, image processing method, data classification model training method, data processing method, computer device, and storage medium
CN110569500A (en) Text semantic recognition method and device, computer equipment and storage medium
CN108475262A (en) Electronic equipment and method for text-processing
CN111274371B (en) Intelligent man-machine conversation method and equipment based on knowledge graph
CN113220886A (en) Text classification method, text classification model training method and related equipment
JPWO2014136316A1 (en) Information processing apparatus, information processing method, and program
Yu et al. Training an adaptive dialogue policy for interactive learning of visually grounded word meanings
CN110827797B (en) Voice response event classification processing method and device
CN106156120B (en) Method and device for classifying character strings
CN109726291B (en) Loss function optimization method and device of classification model and sample classification method
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
CN112632248A (en) Question answering method, device, computer equipment and storage medium
CN113449084A (en) Relationship extraction method based on graph convolution
CN113254637A (en) Grammar-fused aspect-level text emotion classification method and system
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
JP2020098592A (en) Method, device and storage medium of extracting web page content
CN114282513A (en) Text semantic similarity matching method and system, intelligent terminal and storage medium
CN112446405A (en) User intention guiding method for home appliance customer service and intelligent home appliance
CN109727091A (en) Products Show method, apparatus, medium and server based on dialogue robot
CN110969005A (en) Method and device for determining similarity between entity corpora
CN113704479B (en) Unsupervised text classification method and device, electronic equipment and storage medium
CN111552810A (en) Entity extraction and classification method and device, computer equipment and storage medium
CN116738956A (en) Prompt template generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant