CN113918716A

CN113918716A - Method and device for constructing generation confrontation topic model based on spectrum norm normalization

Info

Publication number: CN113918716A
Application number: CN202111199957.XA
Authority: CN
Inventors: 饶洋辉; 张志宏
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2022-01-11
Anticipated expiration: 2041-10-14
Also published as: CN113918716B

Abstract

The invention provides a method and a device for constructing a generated confrontation topic model based on spectrum norm normalization, wherein the method comprises the following steps: acquiring a text data set, and processing to obtain a tf-idf vector set of the text; a spectral norm normalization method is introduced to carry out countermeasure training on the generated countermeasure topic model to obtain a trained generator, an encoder and a supervised discriminator with a classification function; after the subject model is subjected to countermeasure training, the generator outputs subject-word distribution in the text data set; the encoder outputs a document-topic distribution in the text dataset, inputs the topic-word distribution and the document-topic distribution into a supervised discriminator, and classifies the text. The invention constructs a supervised bidirectional generation confrontation network and introduces a method of spectrum norm normalization, so that a topic model can be stably trained; the supervised discriminator can effectively extract and utilize the guidance of the label information to improve the quality of the generated theme while ensuring that the generated countermeasure network carries out countermeasure training.

Description

Method and device for constructing generation confrontation topic model based on spectrum norm normalization

Technical Field

The invention relates to the technical field of nature, in particular to a method and a device for constructing a confrontation topic generation model based on spectrum norm normalization.

Background

The Chinese text is an important carrier for network information dissemination, and how to efficiently mine valuable contents from massive text data is a very worthy of research. Topic models are an important technology for text mining, which can extract potential topics or keywords from a large amount of text data to be applied to downstream tasks such as emotion analysis, text classification and question and answer systems. The more topic models applied today include implicit Dirichlet generation confrontation topic models, confrontation generation confrontation topic models, two-way confrontation generation confrontation topic models, and the like.

The publication number is CN109918510A (publication date is 2019-06-21), and a cross-domain keyword extraction method is provided, and comprises the following steps: constructing a topic-based antagonistic neural network, encoding texts in a source field and a target field based on topics by using a topic-based encoder, introducing antagonistic learning to ensure that characteristics learned by the topic-based encoder are independent of the fields and the private characteristics of the target field are reserved by using an antagonistic learning and a bidirectional self-encoder, and finally completing keyword extraction by combining a keyword labeler in the topic-based antagonistic neural network with the output of the topic-based encoder; continuously optimizing each part parameter in the confrontation neural network based on the theme in the training stage; in the testing stage, the text of the target field is input into the trained subject-based confrontation neural network, so that the keywords are extracted.

The method utilizes the antagonistic neural network topic model to extract keywords, and the existing topic model based on the antagonistic neural network still has the following two problems: firstly, the models have larger instability in the training process; secondly, the model cannot effectively utilize the label information.

Disclosure of Invention

The invention provides a method and a device for constructing a confrontation topic model based on spectrum norm normalization, aiming at overcoming the defects that the topic model in the prior art is unstable in the training process and label information cannot be effectively utilized.

In order to solve the technical problems, the technical scheme of the invention is as follows:

in a first aspect, the invention provides a method for generating a confrontation topic model based on spectrum norm normalization, which comprises the following steps:

s1: and acquiring a text data set, preprocessing the text data set, and performing feature vectorization on the text to obtain a text feature vector set, namely a tf-idf vector set.

S2: a generative confrontational model is constructed, wherein the generative confrontational model comprises a generator, an encoder, and a supervised evaluator.

S3: inputting the tf-idf vector into the generated confrontation topic model and performing confrontation training on the generated confrontation topic model by using a spectral norm normalization method to obtain a trained generated confrontation topic model; wherein the generator outputs a topic-word distribution in the text dataset; the encoder outputs a document-topic distribution in the text dataset; and inputting the document-subject distribution and the subject-word distribution into a supervised discriminator with a classification function to classify the text.

Preferably, the calculation formula of the tf-idf vector in step S1 is as follows:

tf-idf_i，j＝tf_i，j×idf_i

where | D | represents the total number of documents in the text dataset, | { j: t is t_i∈d_jDenotes the number of documents containing subject words, k denotes the number of preset subjects, n_i，jNumber of occurrences of jth word, n, representing ith document_k，jNumber of occurrences of jth word, tf, representing kth document_i，jFrequency, idf, of occurrence of jth word representing ith document in whole text data set_ijThe ratio of the number of documents representing the jth word in which the ith document appears to the total number of documents, tf-idf_i，jAnd expressing the finally obtained word frequency-inverse document frequency.

Preferably, the preprocessing of the text data set in step S1 includes word segmentation, stop word removal, punctuation removal, digit removal, high frequency word removal, and low frequency word removal.

Preferably, the step S2 specifically includes the following steps:

s3.1: initializing and generating parameters of a confrontation topic model;

s3.2: training an encoder using the tf-idf vector;

s3.3: training a generator by using preset Dirichlet distribution sampling data;

s3.4: the output of the coder and the generator is utilized, and the supervised discriminator is trained by using a spectral norm normalization method, so that the aim of confrontation training is fulfilled.

Preferably, the step S3.2 specifically includes the following steps:

s3.2.1: tf-idf vectors of S documents are selected from the tf-idf vector set obtained in step S1

S3.2.2: vector tf-idf

The Yu is input into the encoder to obtain the theme vector

Preferably, the encoder includes a linear connection layer, a BatchNormal layer, a leakyreu layer, a linear connection layer, and a Softmax layer, which are connected in this order.

Preferably, the step S3.3 specifically includes the following steps:

s3.3.1: dirichlet distribution-compliant vectors sampled from a Dirichlet distribution prior distribution

S3.3.2: vector to be subjected to Dirichlet distribution

Inputting the data into a generator to obtain a generated document vector

Preferably, the generator includes a linear connection layer, a BatchNormal layer, a leakyreu layer, a linear connection layer, and a Softmax layer, which are connected in sequence.

Preferably, the step S3.4 specifically includes the following steps:

s3.4.1: vector the subject

And tf-idf vector

Are spliced to obtain

Will be provided with

As a true input vector to the supervised discriminator, r denotes that the vector is from a true sample.

S3.4.2: will generate a document vector

And topic vector

Is spliced to obtain

Will be provided with

As a generating input vector for the supervised discriminator; f denotes that the vector is from the generated samples.

S3.4.3: inputting a true input vector and a generated input vector into a computerSupervising the discriminator to obtain a classification output probability

And supervised discriminator output probability

C represents the number of categories in the text data set, and C belongs to {1,2.. C } and represents the probability that the corresponding sample output by the supervised discriminator is a real sample; if the probability of identifying y-c is the greatest, then the representative supervised discriminator considers the data to be from a true sample.

S3.4.4: the Wasserstein distance is calculated as the total loss function L:

wherein ,

is a prior parameter, lambda, of the Dirichlet distribution determined from the dimensions of the subject distribution₁ and λ₂Is a parameter that sets the importance of training between two distributions for tuning the model, L_advRepresenting the function of the penalty of confrontation, L_clsA classification loss function is represented.

S3.4.5: and (3) calculating an arithmetic mean of all samples, carrying out back propagation by using an RMSprop method in a gradient descent method to minimize a loss function L, and simultaneously limiting propagation of an overlarge gradient by using a spectral norm normalization method to ensure the stability of training until the model converges.

In a second aspect, the present invention further provides a device for constructing a generative confrontation topic model based on spectrum norm normalization, which is applied to the method for constructing a generative confrontation topic model based on spectrum norm normalization in any of the above schemes, and includes:

the data acquisition module is used for acquiring a text data set, preprocessing the text data set, and then performing feature vectorization on the text through a word bag model to obtain a text feature vector set, namely a tf-idf vector set;

a model building module for building a generative confrontational model, wherein the generative confrontational model comprises a generator, an encoder, and a supervised discriminator;

the training module is used for inputting the tf-idf vector into the generated confrontation topic model and carrying out confrontation training on the generated confrontation topic model by using a spectral norm normalization method to obtain a trained generated confrontation topic model; wherein the generator outputs a topic-word distribution in the text dataset; the encoder outputs a document-topic distribution in the text dataset; a supervised discriminator with classification function classifies texts based on an input document-topic distribution and topic-word distribution.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that: by adopting the method and the device for constructing the generated confrontation topic model based on the spectrum norm normalization, a supervised bidirectional generated confrontation network is constructed, and the method of the spectrum norm normalization is introduced to relieve the problem that the generated confrontation network cannot be stably trained and converged, so that the topic model can be stably trained; in addition, the supervised discriminator can effectively extract and utilize the guidance of the label information to improve the quality of the generated theme while ensuring that the generated countermeasure network carries out countermeasure training.

Drawings

FIG. 1 is a flow chart of a method for generating a confrontation topic model construction based on spectral norm normalization.

Fig. 2 is a network architecture diagram for generating a countermeasure topic model.

Fig. 3 is a network configuration diagram of an encoder.

Fig. 4 is a network architecture diagram of the generator.

Fig. 5 is a diagram of a network architecture with a supervised discriminator.

FIG. 6 is a schematic diagram of a device for generating a confrontation topic model based on spectral norm normalization.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

Referring to fig. 1 to 4, the present embodiment provides a method for generating a confrontation topic model based on spectrum norm normalization, including the following steps:

s1: acquiring a text data set comprising an AGnews news data set and a DBpedia encyclopedia data set, preprocessing the text data set including word segmentation, stop word removal, punctuation mark removal, digit removal, high-frequency word removal and low-frequency word removal, and then performing feature vectorization on the text through a word bag model to obtain a text feature vector set, namely a tf-idf vector set; the tf-idf vector is calculated as follows:

tf-idf_i，j＝tf_i，j×idf_i

S2: constructing a generating confrontation topic model, wherein the generating confrontation topic model comprises a generator, an encoder and a supervised discriminator;

s3: inputting the tf-idf vector into the generated confrontation topic model and performing confrontation training on the generated confrontation topic model by using a spectral norm normalization method to obtain a trained generated confrontation topic model; wherein the generator outputs a topic-word distribution in the text dataset; the encoder outputs a document-topic distribution in the text dataset, specifically comprising the steps of:

s3.1: initializing and generating parameters of a confrontation topic model; the parameters to be initialized comprise a priori parameter alpha of Dirichlet distribution, and preset topic number k, tf-idf vector

Vector sampled from Dirichlet distribution

n is the step ratio of the training encoder to the generator, the size m in the training process. In this embodiment, let α be 10^-4，n＝5，m＝512，λ₁＝1，λ₂＝5。

S3.2: as shown in fig. 3, the encoder is trained using tf-idf vectors; the method specifically comprises the following steps:

s3.2.1: tf-idf vectors of m documents are selected from the tf-idf vector set obtained in step S1

S3.2.2: vector tf-idf

The ChochNormal layer, the LeakyReLU layer, the linear connection layer and the Softmax layer of the encoder are introduced to obtain a theme vector

S3.3: training a generator by using preset Dirichlet distribution sampling data; the method specifically comprises the following steps:

S3.3.2: vectors that would obey the Dirichlet distribution, as shown in FIG. 4

The linear connection layer, the BatchNormal layer, the LeakyReLU layer, the linear connection layer and the Softmax layer of the generator are subjected to Yu to obtain a generated document vector

S3.4: the output of the encoder and the generator is utilized, and a spectral norm normalization method is used for training the supervised discriminator at the same time, so that the aim of confrontation training is fulfilled; the method specifically comprises the following steps:

s3.4.1: vector the subject

And tf-idf vector

Is spliced to obtain

Will be provided with

Fight of steps on the eastern side of the hall where the host stood to welcome the guests is the true input vector to the supervised discriminator, r indicates that the vector is from a true sample.

S3.4.2: will generate a document vector

And topic vector

Is spliced to obtain

Will be provided with

Fight of steps on the eastern side of the hall where the host stood to welcome the guests is the input vector generated by the supervised discriminator, f indicates that the vector is from the generated samples.

S3.4.3: inputting the true input vector and the generated input vector into a supervised discriminator to obtain a classification output probability

And supervised discriminator output probability

S3.4.4: the Wasserstein distance is calculated as the total loss function L:

wherein ,

S3.4.5: an arithmetic mean is calculated for all samples, a RMSprop method in a gradient descent method is used for back propagation to minimize a loss function L, meanwhile, a spectrum norm normalization method is used, a weight matrix is normalized in the training process, namely divided by the spectrum norm of the weight matrix, and propagation of overlarge gradients is limited to ensure the stability of training until the model converges.

The multiplication of the matrix is a linear mapping, for which it is K-Lipschitz (K-Lipschitz continuous) over the entire domain of definition if it is K-Lipschitz at zero; k is a Lipschitz constant, in the embodiment, K is required to be ensured to be 1, so that the parameter matrix A of the supervised discriminator meets 1-Lipschitz continuity, propagation of overlarge gradient is limited, and the stability of training is ensured by ensuring the stability of the parameter change of the matrix A.

For the parameter matrix a of the supervised discriminator, the principle of normalization of the spectral norm is as follows:

let the vector base formed by the eigenvectors of the matrix A be { v }_nIs the vector basis with corresponding eigenvalues of { λ }_nIs a characteristic value, then there is

That is to say that the first and second electrodes,

if λ₁For maximum eigenvalue, only the requirement of satisfying

At this time

That is, the spectrum norm σ (a) of the matrix a, and if the matrix a satisfies 1-Lipschitz continuity, it is only necessary to divide all elements of the matrix a by σ (a) at the same time.

S4: and inputting the document-subject distribution and the subject-word distribution into a supervised discriminator with a classification function to classify the text.

In this embodiment, the fully trained generator and encoder obtained after training can accept document input in the form of tf-idf vectors and output topic distributions corresponding to the documents, i.e., document-topic distributions; if the identity matrix is input to the generator, the generator outputs a matrix of k x v, where k represents the number of topics and v represents the size of the vocabulary, which is also referred to as a document-word matrix, which is typically used to extract the distribution of words in each topic and measure the quality of the topic model in terms of the relevance of the topic words in each topic.

By adopting the method and the device for constructing the generated confrontation topic model based on the spectrum norm normalization, a supervised bidirectional generated confrontation network is constructed, and the method of the spectrum norm normalization is introduced to relieve the problem that the generated confrontation network cannot be stably trained and converged, so that the topic model can be stably trained; in addition, the supervised discriminator can effectively extract and utilize the guidance of the label information to improve the quality of the generated theme while ensuring that the generated countermeasure network carries out countermeasure training. The subject term generated by the embodiment can be more effectively applied to the application of downstream tasks due to the introduced external tag information, and has a stronger application value. Compared with the prior art, the method has the advantages of convenient training, high training speed, good training effect and strong expansibility, and can be widely applied to the fields of emotion analysis, text classification, text clustering, question-answering system, machine translation and the like except for the topic model.

Example 2

Referring to fig. 6, the present embodiment provides a device for constructing a confrontation topic model based on spectrum norm normalization, which is applied to the method for constructing a confrontation topic model based on spectrum norm normalization, provided in embodiment 1, and includes: the device comprises a data acquisition module, a model construction module, a training module and a test module.

In a specific implementation process, a data acquisition module acquires a text data set, preprocesses the text data set, and then performs feature vectorization on a text through a bag-of-words model to obtain a text feature vector set, namely a tf-idf vector set; the model building module builds a generation confrontation topic model, wherein the generation confrontation topic model comprises a generator, an encoder and a supervised discriminator; the training module is used for inputting the tf-idf vector into the generated confrontation topic model and performing confrontation training on the generated confrontation topic model by using a spectral norm normalization method to obtain a trained generated confrontation topic model, wherein the generator outputs topic-word distribution in a text data set; the encoder outputs a document-topic distribution in the text dataset; the test module inputs the document-subject distribution and the subject-word distribution into a supervised discriminator with a classification function to classify the text. .

The terms describing positional relationships in the drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A method for generating a confrontation topic model based on spectrum norm normalization is characterized by comprising the following steps:

s1: acquiring a text data set, preprocessing the text data set, and performing feature vectorization on the text to obtain a text feature vector set, namely a tf-idf vector set;

s3: inputting the tf-idf vector into the generated confrontation topic model and performing confrontation training on the generated confrontation topic model by using a spectral norm normalization method to obtain a trained generated confrontation topic model; wherein the generator outputs a topic-word distribution in the text dataset; the encoder outputs a document-topic distribution in the text dataset; a supervised discriminator with classification function classifies texts based on an input document-topic distribution and topic-word distribution.

2. The method for constructing a confrontational subject model based on spectral norm normalization according to claim 1, wherein the calculation formula of tf-idf vector in step S1 is as follows:

tf﹣idf_i,j＝tf_i,j×idf_i

where | D | represents the total number of documents in the text dataset, | { j: t |, where_i∈d_jDenotes the number of documents containing subject words, k denotes the number of preset subjects, n_i,jNumber of occurrences of jth word, n, representing ith document_k,jNumber of occurrences of jth word, tf, representing kth document_i,jFrequency, idf, of occurrence of jth word representing ith document in whole text data set_ijThe ratio of the number of documents representing the jth word in which the ith document appears to the total number of documents, tf-idf_i,jAnd expressing the finally obtained word frequency-inverse document frequency.

3. The method for constructing a confrontational model based on spectral norm normalization of claim 1, wherein the preprocessing performed on the text data set in step S1 includes word segmentation, stop word removal, punctuation removal, digit removal, high frequency word removal and low frequency word removal.

4. The method for constructing a confrontational subject model based on spectral norm normalization according to claim 1, wherein the step S2 specifically includes the following steps:

s3.1: initializing and generating parameters of a confrontation topic model;

s3.2: training an encoder using the tf-idf vector;

5. The method for constructing a generative confrontational topic model based on spectral norm normalization according to claim 4, wherein the step S3.2 specifically comprises the following steps:

S3.2.2: vector tf-idf

Inputting the theme vector into an encoder to obtain a theme vector

6. The method for constructing the antagonistic topic model based on the spectral norm normalization according to claim 4, wherein the encoder comprises a linear connection layer, a BatchNormal layer, a LeakyReLU layer, a linear connection layer and a Softmax layer which are connected in sequence.

7. The method for constructing a generative confrontational topic model based on spectral norm normalization according to claim 4, wherein the step S2.3 specifically comprises the following steps:

S3.3.2: vector to be subjected to Dirichlet distribution

Inputting the data into a generator to obtain a generated document vector

8. The method for constructing the confrontation topic model based on the spectral norm normalization of claim 7, wherein the generator comprises a linear connection layer, a BatchNormal layer, a LeakyReLU layer, a linear connection layer and a Softmax layer which are connected in sequence.

9. The method for constructing a generative confrontational topic model based on spectral norm normalization according to claim 7, wherein the step S3.4 specifically comprises the following steps:

s3.4.1: vector the subject

And tf-idf vector

Is spliced to obtain

Will be provided with

True input vector as supervised discriminatorAnd r denotes that the vector is from a real sample;

s3.4.2: will generate a document vector

And topic vector

Is spliced to obtain

Will be provided with

As a generating input vector for the supervised discriminator; f denotes that the vector is from the generated samples;

And supervised discriminator output probability

C represents the number of categories in the text data set, and C belongs to {1,2.. C } and represents the probability that the corresponding sample output by the supervised discriminator is a real sample; if the probability of identifying y as c is the maximum, the supervised discriminator considers that the data is from the real sample;

s3.4.4: the Wasserstein distance is calculated as the total loss function L:

L＝λ₁L_adv+λ₂L_cls

wherein ,

is a prior parameter, lambda, of the Dirichlet distribution determined from the dimensions of the subject distribution₁ and λ₂Is a parameter that sets the importance of training between two distributions for tuning the model, L_advRepresenting the function of the penalty of confrontation, L_clsRepresenting a classification loss function;

10. A device for generating a confrontation topic model based on spectrum norm normalization is characterized by comprising the following steps:

the training module is used for inputting the tf-idf vector into the generated confrontation topic model and carrying out confrontation training on the generated confrontation topic model by using a spectral norm normalization method to obtain a trained generated confrontation topic model;

the test module is used for testing the generated confrontation topic model after being trained by the model training module, wherein the generator outputs topic-word distribution in the text data set; the encoder outputs a document-topic distribution in the text dataset; and inputting the document-subject distribution and the subject-word distribution into a supervised discriminator with a classification function to classify the text.