CN113010675A

CN113010675A - Method and device for classifying text information based on GAN and storage medium

Info

Publication number: CN113010675A
Application number: CN202110270549.2A
Authority: CN
Inventors: 汪剑; 李志飞
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Mobvoi Information Technology Co Ltd; Chumen Wenwen Information Technology Co Ltd
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-06-22

Abstract

The application discloses a text information classification method and device based on a generation countermeasure network (GAN) and a computer readable storage medium. Firstly, a generation confrontation network model is constructed; then, training the generated countermeasure network model by using the text data with the label and the text data without the label to obtain a text information classification model; then, the classification model is used to classify the text information. Therefore, a large amount of label-free text data can be utilized to train the text information classification model, so that the precision of the text information classification model is higher, the classification result of the text information is more accurate, and the application effect based on the text information classification is better.

Description

Method and device for classifying text information based on GAN and storage medium

Technical Field

The present application relates to the field of information processing, and in particular, to a method and an apparatus for classifying text information based on a generative countermeasure network (GAN), and a computer-readable storage medium.

Background

The traditional text information classification method mainly trains a model based on training data with labels. In practice, however, the tagged training data is very limited, and there is also a large amount of useful data that is untagged, e.g., log information. If the data without labels can be introduced during model training, the model precision can be greatly improved, and the accuracy of text information classification by using the model is correspondingly improved.

Therefore, how to introduce data without labels in the process of training the model so as to obtain a more accurate classification result when the model is actually applied to text information classification becomes a technical problem to be solved.

Disclosure of Invention

The applicant creatively provides a text information classification method, a device and a storage medium based on a generation countermeasure network.

According to a first aspect of the embodiments of the present application, a text information classification method based on generation of a countermeasure network includes: acquiring text information to be classified; obtaining the classification of the text information to be classified according to the text information to be classified and a first text information classification model, wherein the first text information classification model is based on generation of a countermeasure network and is obtained by training with labeled text data and unlabeled text data; and outputting the classification.

According to an embodiment of the present application, before obtaining the classification to which the text information to be classified belongs according to the text information to be classified and the first text information classification model, the method further includes: constructing and generating a confrontation network model; and training the countermeasure network model by using the labeled text data and the unlabeled text data to obtain a first text information classification model.

According to an embodiment of the present application, training a countermeasure network model using labeled text data and unlabeled text data to obtain a first text information classification model includes: carrying out vector conversion on the text data with the label to obtain a first vector representation; carrying out vector conversion on the text data without the label to obtain a second vector representation; inputting the first vector representation and the second vector representation into a discrimination model for generating a confrontation network model to perform data source discrimination, wherein the first vector representation is discriminated as 1, and the second vector representation is discriminated as 0; inputting a first vector representation and a second vector representation into a text information classification model for generating a confrontation network model for classification, wherein the first vector representation is classified into a corresponding class from 0 to K-1 classes, and the second vector representation is classified into a K-th class, wherein K is a natural number greater than or equal to 2; and performing countermeasure model training on the generated countermeasure network to obtain a first text information classification model.

According to one embodiment of the present application, performing vector conversion includes: vector conversion was performed using the BERT pre-training model.

According to an embodiment of the present application, the method further comprises: performing minimum-max (min-max) training on the discrimination model, wherein the training uses a gradient inversion layer (GRL) to realize gradient negative feedback.

According to an embodiment of the present application, the discriminant model uses the following loss function:

E_x～Pg[f_w(x)]-E_x～Pr[f_w(x)]

where E represents the mathematical expectation, Pg represents the second vector representation, Pr represents the first vector representation, f_wRepresenting a discriminant model.

According to one embodiment of the present application, the unlabeled text data includes randomly generated gaussian noise data.

According to a second aspect of the embodiments of the present application, there is provided a text information classification apparatus based on generation of a confrontation network, the apparatus including: the information acquisition module is used for acquiring text information to be classified; the classification determining module is used for obtaining the classification of the text information to be classified according to the text information to be classified and a first text information classification model, wherein the first text information classification model is based on a generation countermeasure network and is trained by using labeled text data and unlabeled text data to obtain a text information classification model; and the classification output module is used for outputting the classification.

According to an embodiment of the present application, the apparatus further comprises: the generation countermeasure network model building module is used for building a generation countermeasure network model; and the first text information classification model training module is used for training the countermeasure network model by using the text data with the labels and the text data without the labels to obtain a first text information classification model.

According to a third aspect of embodiments of the present application, there is provided a computer-readable storage medium, the storage medium comprising a set of computer-executable instructions, which when executed, perform any one of the above text information classification methods based on a generation countermeasure network.

The embodiment of the application provides a text information classification method and device based on a generation countermeasure network, namely a computer readable storage medium. Firstly, a generation confrontation network model is constructed; then, training the generated countermeasure network model by using the text data with the label and the text data without the label to obtain a text information classification model; then, the classification model is used to classify the text information. Therefore, a large amount of label-free text data such as log information can be used for training the text information classification model, so that the precision of the text information classification model is higher, the classification result of the text information is more accurate, and the effect of the application (such as spam identification, system error classification, emotion analysis and the like) based on the text information classification is better.

It is to be understood that not all of the above advantages need to be achieved in the present application, but that a specific technical solution may achieve a specific technical effect, and that other embodiments of the present application may also achieve advantages not mentioned above.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present application will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present application are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

FIG. 1 is a schematic flow chart illustrating an implementation of a text information classification method based on a generation countermeasure network according to an embodiment of the present application;

FIG. 2 is a diagram illustrating a basic process of generating a confrontation training of a confrontation network model according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a text information classification device based on a generation countermeasure network according to an embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present application more obvious and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Fig. 1 shows an implementation flow of a text information classification method based on generation of a confrontation network in an embodiment of the present application. Referring to fig. 1, an embodiment of the present application provides a text information classification method based on a generation countermeasure network, where the method includes: operation 110, acquiring text information to be classified; operation 120, obtaining a classification to which the text information to be classified belongs according to the text information to be classified and a first text information classification model, where the first text information classification model is based on a generation countermeasure network and is trained by using labeled text data and unlabeled text data to obtain a text information classification model; operation 130, output classification.

Wherein, the text information refers to the information of text format, including: reading file contents from the text file; text information queried from text type and string fields in a database, text information crawled from various web page contents, and the like.

The text information classification refers to classifying the text information into one or more of K categories given the text information, where K is a natural number greater than or equal to 2. Text information classification is often used in applications such as system error classification, spam recognition, and sentiment analysis.

The marked text data refers to training data formed by marking text information of the expected classification; and the unlabeled text data is training data formed by text information which does not give the desired classification.

Currently, the common text information classification methods are based on deep learning text information classification models, such as fastText models, TextCNN models, TextRNN models, and the like. However, these models need to be trained by using a large amount of labeled text information to achieve better model accuracy and application effect, the training data with labels is very limited, and more data which is ubiquitous is not labeled, for example, log information. If the text data without the labels are manually marked, a great deal of manpower, material resources and time cost are needed.

Generating a countermeasure network model through at least two modules in the framework: the mutual game learning of the Generative Model and the discriminant Model can generate more accurate output results. The generation model can be used for learning the characteristics of the existing data to generate new data, and the discrimination model discriminates the newly generated data to judge the authenticity of the information. Through the countermeasures and games of the two models and the deep learning process of the respective models, the capability of distinguishing the true from the false of the model is stronger and stronger, and correspondingly, the data generated by the generated model can be more and more falsified.

Thus, the present inventors have creatively conceived of a countermeasure training method using a generated countermeasure network, in which labeled text data and unlabeled text data are mixed and delivered to a generative model and a discriminant model, wherein the generative model is used for classification, and the discriminant model is used for discriminating the source of these data. After the countertraining lasts for a period of time and reaches a certain precision, the generated model can more accurately simulate the real classification of the label-free data, so that the text classification model with higher precision is formed.

Therefore, in operation 120, the embodiment of the present application makes the text classification model obtained by the above training method, i.e., the first text information classification model, classify the text. Because the first text information classification model uses a large amount of label-free data and is subjected to the countertraining of the discrimination model, the precision of the first text information classification model is higher than that of a model which only uses label data and is not subjected to the countertraining, and the classification result is more accurate.

It should be noted that the embodiment shown in fig. 1 is only one basic embodiment of the present application based on the text information classification method for generating a confrontation network, and further refinement and expansion can be performed by an implementer on the basis of the embodiment.

In constructing the generative confrontation network model, any existing or applicable relevant library for implementing deep learning, such as Keras, or an open-source artificial intelligence framework, such as OpenAI, may be used to construct the generative confrontation network model, and specific processes may include: importing a program package; defining a variable; establishing a generator and a discriminator; defining an optimizer and the like.

The embodiment of the application does not limit the specific mode or method adopted by an implementer to construct the generation countermeasure model, but theoretically, the generation countermeasure network model which is easy to use and good in optimization effect is constructed, so that the training process can be greatly simplified and shortened, and the model can be converged more quickly.

Then, the text data with the label and the text data without the label can be coded and converted to be used as a well-constructed training data set for generating the confrontation network model, and the confrontation training is carried out on the generated confrontation network model.

Text data is a mathematical expression, i.e. vector representation, that cannot be directly input to generate a countermeasure network and needs to be converted into a form that can be recognized and used by a computer to perform operations. After the conversion of the labeled text data and the unlabeled text data, the labeled text data and the unlabeled text data can be input into a text information classification model and a judgment model which are established in advance and used for generating a confrontation network model for learning.

When training the discriminant model, the expected output of the first vector corresponding to the labeled text data may be labeled as 1, and the expected output of the second vector corresponding to the unlabeled text data may be labeled as 0. Therefore, the discrimination model can fully learn the difference between the labeled text data and the unlabeled text data, and when the classification result generated by the text classification model according to the labeled text data and the unlabeled text data can not distinguish the data source of the discrimination model, the text classification model can be used for indicating that the text classification model has learned the characteristics of the unlabeled text data and accurately classifying the unlabeled text data, so that the expected first text classification model which can be used for practical application is obtained.

When the text information classification model is trained, the expected output of the first vector corresponding to the labeled text data can be labeled as the type of the text data labeled, namely, the corresponding class in the classes from 0 to K-1, and the expected output of the second vector corresponding to the text data without labeling can be labeled as the class K. Thus, in the initial state, the classification of the data can be output in such a way that it is easiest to distinguish the source of the data. Thus, it is easily broken by the discriminant model, i.e., the discriminant model learns over a period of time, once the K types are seen, until such data is from unlabeled text data.

After the text information classification model is identified, the characteristics of the labeled text data are further learned, a new classification result is tried to be generated, the category of the unlabeled text data is changed into any one of the categories from 0 to K-1, and the classification result is output to enable the discrimination model to discriminate again; if the data source of the classification result is judged to be from the marked text data 1 by the judgment model, the text information classification model is proved to have learned the category of the unmarked text data; if the data source of the classification result is still judged to be from the non-labeled text data 0 by the judgment model, the text information classification model does not learn the category of the non-labeled text data, and the text information classification model can be adjusted and optimized according to the loss function to classify the non-labeled text data again. Therefore, a text information classification model which can learn the characteristics of the label-free text data and classify the label-free text data can be obtained through multiple iterations, optimization and games with the discrimination model. The above process is realized by a confrontation training process for generating a confrontation network model.

Inputting the first vector representation and the second vector representation into a text information classification model for generating a confrontation network model for classification, wherein the first vector representation is classified into a corresponding class in the classes from 0 to K-1, and the second vector representation is classified into a K-th class

BERT as a substitute for Word2Vec has greatly refreshed accuracy in multiple directions in the NLP domain, and the pre-trained language model has been proven to be capable of achieving better model accuracy with less data for learning. Therefore, when vector conversion is carried out, a BERT pre-training model is used, higher model precision can be obtained, and a good data basis is laid for a subsequent text information classification model.

According to an embodiment of the present application, the method further comprises: and carrying out minimum maximization training on the discrimination model, wherein the training uses a gradient inversion layer to realize gradient negative feedback.

The infinitesimal maximization problem is a common and important mathematical programming problem, which is to find the minimum value of the maximum probability of failure, and needs to be solved in the countermeasure training and generation of the countermeasure network, which is the infinitesimal maximization training referred to in this embodiment.

In the embodiment, the gradient reversal layer is adopted to realize the gradient negative feedback, so that two classification errors can be maximized, namely, the text information classification model can not be distinguished from the data source as far as possible. Therefore, after the judgment model completes minimum maximization training, the confrontation training is carried out with the text information classification model, and the precision of the text information classification model can be higher.

E_x～Pg[f_w(x)]-E_x～Pr[f_w(x)]

In the present embodiment, in addition to label-free text data, gaussian noise data generated at random is used for the countermeasure training. Therefore, the robustness and the fault tolerance of the text information classification model can be further improved.

Fig. 2 is a schematic diagram illustrating a basic process of generating a confrontation training of a confrontation network model according to an embodiment of the present application.

As shown in fig. 2, the basic process of generating the confrontation training of the confrontation network model mainly includes:

firstly, inputting labeled text data 201 and unlabeled text data 202 (including randomly generated Gaussian noise data) into a BERT pre-training model 203 for vector conversion, and acquiring a corresponding first vector representation 204 (corresponding to the labeled text data 201) and a second vector representation 205 (corresponding to the unlabeled text data 202);

then, the first vector representation 204 and the second vector representation 205 are used as the input of a discrimination model 207, data source discrimination is carried out, a marked text is discriminated to be 1, a non-marked text is discriminated to be 0, the discrimination model realizes gradient negative feedback through a gradient inversion layer, and minimum maximization training is completed;

furthermore, the first vector representation 204 and the second vector representation 205 are used as input for the text classification model 205, the first vector representation 204 is classified into 0, 1, … …, K-1 class (class labeled with labeled text data), and the second vector representation 205 is classified into K class;

subsequently, a stochastic gradient descent method is applied to resist and learn the resist model until the text information classification model 206 reaches a desired model accuracy. At this time, a first text information classification model which can be practically applied to classify the text information is obtained.

It should be noted that the embodiment shown in fig. 2 is only an exemplary description of how to generate the countermeasure network model by using labeled text data and unlabeled text data based on the text information classification method for generating the countermeasure network, and is not limited to the implementation manner or the application scenario of the embodiment of the present application, and an implementer may apply any applicable implementation manner to any applicable application scenario according to specific implementation needs and implementation conditions.

Further, an embodiment of the present application further provides a text information classification apparatus based on generation of a confrontation network, as shown in fig. 3, where the apparatus 30 includes: the information acquisition module 301 is configured to acquire text information to be classified; the classification determining module 302 is configured to obtain a classification to which the text information to be classified belongs according to the text information to be classified and a first text information classification model, where the first text information classification model is based on a generation countermeasure network and is trained by using labeled text data and unlabeled text data to obtain a text information classification model; and a classification output module 303, configured to output the classification.

According to an embodiment of the present application, the apparatus 30 further comprises: the generation countermeasure network model building module is used for building a generation countermeasure network model; and the first text information classification model training module is used for training the countermeasure network model by using the text data with the labels and the text data without the labels to obtain a first text information classification model.

According to an embodiment of the present application, the first text information classification model training module includes: the first vector conversion module is used for carrying out vector conversion on the text data with the labels to obtain a first vector representation; the second vector conversion module is used for carrying out vector conversion on the text data without the label to obtain a second vector representation; the discrimination model learning module is used for inputting the first vector representation and the second vector representation into a discrimination model for generating a confrontation network model to perform data source discrimination, wherein the first vector representation is discriminated as 1, and the second vector representation is discriminated as 0; the text information classification model learning module is used for inputting the first vector representation and the second vector representation into a text information classification model for generating the confrontation network model for classification, wherein the first vector representation is classified into a corresponding class from 0 to K-1 classes, the second vector representation is classified into a K-th class, and K is a natural number greater than or equal to 2; and the confrontation model training module is used for carrying out confrontation model training on the generated confrontation network to obtain a first text information classification model.

According to an embodiment of the present application, the vector transformation module is specifically configured to perform vector transformation using a BERT pre-training model.

According to an embodiment of the present application, the apparatus 30 further comprises: and the discrimination model minimum maximization training module is used for carrying out minimum maximization (min-max) training on the discrimination model, wherein the training uses a gradient inversion layer to realize gradient negative feedback.

According to a third aspect of embodiments of the present application, there is provided a computer storage medium comprising a set of computer executable instructions for performing any one of the above text information classification methods based on a generative confrontation network.

Here, it should be noted that: the above description of the text information classification device based on the generation countermeasure network and the above description of the embodiment of the computer storage medium are similar to the description of the foregoing method embodiment, and have similar beneficial effects to the foregoing method embodiment, and therefore, the description thereof is omitted. For the technical details that have not been disclosed in the present application in the description of the embodiment of the text information classification apparatus based on generation countermeasure network and the description of the embodiment of the computer storage medium, please refer to the description of the foregoing method embodiment of the present application for understanding, and therefore, for brevity, will not be described again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another device, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage medium, a Read Only Memory (ROM), a magnetic disk, and an optical disk.

Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or portions thereof that contribute to the prior art may be embodied in the form of a software product stored in a storage medium, and including several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a removable storage medium, a ROM, a magnetic disk, an optical disk, or the like, which can store the program code.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A text information classification method based on a generation countermeasure network, the method comprising:

acquiring text information to be classified;

obtaining the classification of the text information to be classified according to the text information to be classified and a first text information classification model, wherein the first text information classification model is a text information classification model obtained by training by using labeled text data and unlabeled text data based on a generated countermeasure network;

and outputting the classification.

2. The method according to claim 1, wherein before the obtaining of the classification to which the text information to be classified belongs according to the text information to be classified and the first text information classification model, the method further comprises:

constructing and generating a confrontation network model;

and training the generated confrontation network model by using the labeled text data and the unlabeled text data to obtain the first text information classification model.

3. The method of claim 2, wherein training the generative confrontation network model using the labeled text data and the unlabeled text data to obtain the first text-information classification model comprises:

performing vector conversion on the text data with the labels to obtain a first vector representation;

carrying out vector conversion on the text data without the label to obtain a second vector representation;

inputting the first vector representation and the second vector representation into a discriminant model of the generated confrontation network model for data source discrimination, wherein the first vector representation is discriminated as 1, and the second vector representation is discriminated as 0;

inputting the first vector representation and the second vector representation into the text information classification model for generating the confrontation network model for classification, wherein the first vector representation is classified into a corresponding class from 0 to K-1 classes, and the second vector representation is classified into a Kth class, wherein K is a natural number greater than or equal to 2;

and performing countermeasure model training on the generated countermeasure network to obtain the first text information classification model.

4. The method of claim 3, wherein the performing vector conversion comprises:

vector conversion was performed using the BERT pre-training model.

5. The method of claim 3, further comprising:

and carrying out minimum maximization training on the discrimination model, wherein the training uses a gradient reversal layer GRL to realize gradient negative feedback.

6. The method of claim 5, wherein the discriminant model uses a loss function as follows:

E_x～Pg[f_w(x)]-E_x～Pr[f_w(x)]

where E represents the mathematical expectation, Pg represents the second vector representation, and Pr represents the firstVector representation, f_wRepresenting a discriminant model.

7. The method of any of claims 1 to 6, wherein the unlabeled text data comprises randomly generated Gaussian noise data.

8. An apparatus for classifying text information based on a network for creating confrontation, the apparatus comprising:

the information acquisition module is used for acquiring text information to be classified;

the classification determining module is used for obtaining the classification of the text information to be classified according to the text information to be classified and a first text information classification model, wherein the first text information classification model is based on a generation countermeasure network and is trained by using labeled text data and unlabeled text data to obtain a text information classification model;

and the classification output module is used for outputting the classification.

9. The apparatus of claim 8, further comprising:

the generation countermeasure network model building module is used for building a generation countermeasure network model;

and the first text information classification model training module is used for training the generated confrontation network model by using the labeled text data and the unlabeled text data to obtain the first text information classification model.

10. A computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform the method of any of claims 1 to 7.