CN107491417B

CN107491417B - Document generation method based on specific division under topic model

Info

Publication number: CN107491417B
Application number: CN201710548431.5A
Authority: CN
Inventors: 周凯文; 杨智慧; 马会心; 何震瀛; 荆一楠; 王晓阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2017-07-06
Filing date: 2017-07-06
Publication date: 2021-06-22
Anticipated expiration: 2037-07-06
Also published as: CN107491417A

Abstract

The invention belongs to the technical field of data mining, and particularly relates to a document generation method based on a specially divided topic model. The invention adds the concept of the subset according to a given text database dividing mode, for example, for some text databases, such as news databases, the topic distribution of texts in a certain time segment has certain similarity, especially for the texts of different news channels reporting the same event, the database can be divided into the subsets by utilizing the attribute of the time segment. Thus, the present invention proposes a new topic model (DbLDA) on a text database; in the DbLDA, the specific steps of the generation of each document are: generating a theme matrix; generating a topic distribution for a subset: generating a theme distribution for the articles in the subset; for each word, a topic is selected, and a word is selected. Can be applied to a text database with structured attributes.

Description

Document generation method based on specific division under topic model

Technical Field

The invention belongs to the technical field of data mining, and particularly relates to a document generation method based on a specially divided topic model, which is applied to a text database with structured attributes.

Background

The method is widely applied to the field of data mining nowadays by utilizing a topic model to process and analyze text data, wherein LDA (Latent Dirichlet Allocation) is widely concerned as a simple and easy-to-use topic model. LDA assumes that each text originates from a separate generation process and thus ignores the connections between texts, which may however lead to a degradation of the model effect.

A large amount of text data contains not only structured attributes such as time, place, etc., but also unstructured text content attributes. The text data can be organized into a number of subsets according to these structured attributes, stored in a text database, which forms a particular partition of the text database. Based on the observation that the texts classified into the same subset have some commonality, i.e. there is a link between the texts. For example, in the case of a news database, news data at the same time or the same place may be focused on important events such as the spread of a certain virus or the movement track of a certain typhoon. According to this phenomenon, the entire news database may be divided into some subsets having commonality according to their time or location attributes. In order to analyze the document set which can be subjected to subset division, the invention designs a new theme model based on LDA from the perspective of generating the model, and the aim is to fully utilize the commonality in each subset aiming at the specific division of the text database so as to obtain a theme model with better effect.

Disclosure of Invention

The invention aims to provide a document generation method based on a theme model of specific division, which has better effect.

The invention relates to a new model constructed according to an LDA model. Text generation process in LDA: firstly, a distribution is generated from Dirichlet distribution as the theme distribution of an article, the theme of the article is generated from the theme distribution, and then a word is generated from the corresponding theme, so that the word in the document is obtainedWord of^[3]The distribution of the topics of each article is independent.

The topic model based on specific division, which is proposed by the present invention, is designed according to LDA, and is applied to a text database with structured attributes, such as a news database with time or place tags (i.e. the text database can be divided according to these structured attributes). The invention considers the concept of adding subsets according to a given text database dividing mode, for example, for some text databases, such as news databases, the topic distribution of texts in a certain time segment has certain similarity, especially for the texts of different news channels reporting the same event, the database can be divided into subsets by using the attribute of the time segment. Therefore, the invention provides a new topic model (LDA over Text Database, implicit Dirichlet distribution on the Text Database) on the Text Database, which is recorded as DbLDA.

In the DbLDA, the specific steps of the generation method of each document are as follows:

step (1), generating a theme matrix: phi is a_k～Dir(β)；

Step (2), generating theme distribution for a subset:

and (3) generating theme distribution for the articles in the subset:

step (4) for each word

(a) Selecting one theme: z is a radical of_s,d,n～Mult(π(θ′_s,d))；

(b) Selecting a word: w is a_s,d,n|z_s,d,n～Mult(φ_k)。

Wherein the content of the first and second substances,

is a mapping from a multi-term distribution parameter vector to a natural vector:

c is a constant, so that each multi-term distribution parameter vector corresponds to a natural parameter vector family; pi is the mapping from the natural parameter vector back to the multinomial distribution parameter vector,

the probability distribution of the random variables to each other in the graph model of fig. 1 corresponds to this generation process.

The key to the effectiveness of a topic model is whether it truly reflects the true distribution of the text data, i.e., whether the topic distribution has a corresponding physical meaning. From the perspective of the generation process, step (2) is equivalent to assigning a main topic distribution (or average topic distribution) to each subset, and step (3) adds gaussian noise to the average topic distribution to generate a topic distribution for each text in the subset, which is equivalent to that the topic distribution of each article in the subset surrounds a common main topic distribution. It is hoped that the commonality of the texts in the subsets is expressed by the modeling mode, so that the model parameters (namely, the topic, the multinomial distribution on the vocabulary) deduced by the model can reflect the characteristics of the text data more truly, and better model performance is obtained. Gaussian distributed Σ (gaussian distributed covariance matrix) can be considered as the subject distribution density in the subset.

Compared with LDA, the method adds a probability path when generating the theme distribution, namely, one layer of logic Gaussian distribution is added. In the field of data analysis, logical gaussian distributions are often viewed as "gaussian distributions over simplices". We use this distribution to model commonalities among the subsets.

As shown in fig. 2, the text topic distribution in the same subset on the simplex of the topic distribution of the DbLDA model is more concentrated, and there is similarity in topic distribution when corresponding to a real text database, for example, several news texts describing the same event at the same time.

With new topic models, a key task is to infer the model. The topic model is a class of generative models that generate documents, such as LDA (latent dirichlet distribution), from a designed graph model. The graph model comprises a plurality of hidden variables and known variables, and the main work of analyzing the graph model is to estimate the values of the hidden variables through the known variables, namely to solve the posterior distribution of the hidden variables in the model. However, most topic models have high complexity, and the posterior distribution of hidden variables in the model generally does not have a closed-form expression, so that the posterior distribution of the model needs to be approximately calculated, namely the model inference problem. The method for approximating inference is various, and the invention selects a method of using shrinkage variational Bayes to approximate the model, wherein the shrinkage means that some hidden variables in the model are removed from the posterior distribution by means of marginal integration. The reasons are two reasons: the DbLDA model is complex, not all hidden variables can be contracted, and the distribution of the required samples is more, so that the convergence of the sampling method is possibly too slow, and the convergence is difficult to judge; the calculation amount required by each iteration of DbLDA is large, the convergence is fast by using a variational method, and the program performance can be obviously improved.

For the variational method, the variational method transforms the problem into a maximization problem by the following formula, since the true posterior is not available. The logarithm of probability of a fact (text) is equal to the KL divergence (Kullback-Leibler divergence, also known as relative entropy) plus the true Lower Bound (ELBO), so minimizing the KL divergence is done by maximizing the ELBO.

Based on the idea of systolic variational bayes, we model explicitly the dependencies between hidden variables, but since θ' in DbLDA (the distribution of topics for an article in a subset) is difficult to remove by marginal integration, we only show modeling θ,

(mean topic distribution and topic matrix of subsets, respectively). This approach is referred to as "partial contraction variational bayes". Thus, the variation posterior distribution has the following form:

in which θ' follows a variational Gaussian distribution

z obeys variational multinomial distributions

Thus, ELBO becomes:

wherein the content of the first and second substances,

is the entropy of the variation distribution.

First, about

Maximizing ELBO. Since there is no limit to the two variation distributions, the maximum value is

And (4) obtaining, namely, obtaining when the variational posterior is equal to the real posterior. After simplification, ELBO becomes:

then, ELBO is developed according to a graph model of DbLDA:

due to the non-conjugation of dirichlet distribution and logical gaussian distribution, it is difficult to directly compute in ELBO:

in order to simplify the calculation and meet the purpose of modeling, each dimension of the K-dimensional random variable theta' is assigned to obey independent unary Gaussian distribution, namely, the covariance matrix is defined as a diagonal matrix

Meanwhile, a mapping from the polynomial distribution parameter vector to the natural parameter vector is selected as c 1. By simplification, the above equation can be calculated.

Wherein D is_sIs the amount of text in the subset s.

Referring to the inference method of Gaussian random variables in CTM (Correlated Topic Model), since the variation of the logarithmic normalization factor of theta' is expected to be difficult to calculate, in order to maintain the property of the lower bound of ELBO, the invention uses Taylor expansion to find an upper bound for the logarithmic normalization factor of theta^[1]：

Then, with respect to the variation parameter

Maximizing ELBO, the loop is updated in turn about each variation parameter. Specifically, the method comprises the following steps:

first, about

Maximizing ELBO.

Simplifying the above formula by eliminating terms appearing in the denominator of the numerator in the above formula yields:

however, the computation cost of the variation expectation term in the above formula is too high, so the variation expectation is estimated by adopting a Gaussian approximation method in the original paper of a shrinkage variation Bayesian inference method paper, but only 0-order Taylor expansion is kept as further approximation in the approximation,

to improve the computational performance (zero taylor expansion 1).

Wherein the content of the first and second substances,

namely, it is

The expectation is that.

Using the above-mentioned pair

By approximating the expression (c), we can obtain

The update equation of (1).

Second, ELBO is maximized with respect to ζ.

For derivatives of the zeta-related term in ELBO and making the derivative zero, the zeta update equation can be obtained:

the third step, about

Maximizing ELBO.

ELBO about

Taking the derivative as:

there is no analytical solution to zero this derivative. For this purpose, Newton's method is used to solve

To the maximization of (a).

Finally, as to

Maximize ELBO with the constraint of

As above, this maximization problem also does not have an analytical solution, and is therefore solved using newton's method. ELBO about

Taking the derivative of

And (4) combining the conclusions, and updating the variation parameters in turn in each iteration to obtain a coordinated ascending algorithm of the ELBO.

Drawings

FIG. 1 is a DbLDA graphical model.

FIG. 2 shows a sample of the simplex distribution of DbLDA topics (3 topics, 2 subsets, 1000 articles, a red dot corresponding to the topic distribution of an article, a blue dot for the main topic distribution of each subset, and a pink triangle for the simplex distribution of topics)

Fig. 3 shows DbLDA/CVB0_ LDA (implicit dirichlet distribution inferred using systolic variational bayes)/CGS _ LDA (implicit dirichlet distribution inferred using gibbs sampling) experimental test results (α ═ 1.01, β ═ 0.01, K ═ 50 for DbLDA, given ═ 1.0) for Predictive Perplexity (descriptive prediction capability) predicted on the one month news data (3942 news texts, 16379 words) of the road agency, while for DbLDA, given ═ 1.0)

Fig. 4 shows experimental results of predicted chaos for DbLDA/CVB0_ LDA under different settings (different sizes of text database, 1, 2, 6 months, and different sub-divisions, 15 days news for a sub-set and 30 days news for a sub-set).

Detailed Description

The method comprises the steps of dividing the test of the DbLDA subject model into an LDA comparison experiment and a model experiment under different parameters, wherein the comparison method comprises the steps of calculating the prediction chaos and the running time under different models, and the different parameters are different subset lengths or different text database sizes. The text prediction capabilities of the DbLDA topic model are intended to be tested in various ways.

(1) Experimental Environment and data set introduction

The experimental programs are all run in an Ubuntu 16.04 environment, the experimental machines CPU is i5-3470, the memory 12GB, and the experiments are all performed using codes written in Java (8), and the following DbLDA and LDA experiments are all performed under the conditions that the given parameters α is 1.01, β is 0.01, and K is 50, while the given parameters Σ is 1.0 for DbLDA.

And collecting English news data on the Rough agency as test language material for testing the performance of the DbLDA. The division basis of the news text subsets is time, and the minimum division unit is day. The linguistic data is subjected to word segmentation, punctuation removal, root word extraction and other processing, and all words with one occurrence frequency are removed. Through processing, the news data of the road transparent society in six months (one month to six months, 22723 news texts, 36639 words) are obtained. The test text set is 10% of the training text set.

(2) Comparative experiment with LDA

The measure of the comparison experiment is prediction chaos, a standard often used for measuring the prediction capability of the language model, and the lower prediction chaos represents that the model has better prediction capability.

For the DbLDA, the test set text generation probability calculation mode is as follows (namely, using the variational posterior as the true posterior and using the value of the parameter in the expected estimation model):

the subject model compared to DbLDA is an implicit dirichlet distribution using a systolic variational bayesian inference (CVB) (while approximating the desired variational, again using a zero-order taylor expansion approximation, denoted CVB 0). As the theme distribution of each text in the test text set needs to be obtained, phi is obtained by training in the training text set, and then for each article in the test text set, the theta of each article is obtained by training 50% of the first text. The algorithm for training the test text set is consistent with the previous method except that phi is a fixed value and is fixed as a result obtained in the training text set. And obtaining the theme distribution of the test text set, namely substituting the expression into the last 50% of texts of each test text, and calculating the prediction chaos.

In the comparative experiment, data of one month (april, 3942 news texts, 16379 words) in the data set of the ro-moto was taken as a data set, and for DbLDA, it was set that each subset included 7 days of news texts. In the experiment, each section of program is iterated for 500 times, phi of the training text set is estimated after each iteration, and the prediction chaos is calculated according to the method. It should be noted that the implicit Dirichlet distribution inferred by using the contracted Gibbs sampling is added in the experiment for comparison, so as to illustrate that different model inference methods can cause different model test results.

The results of the experiment are shown in FIG. 3. From experimental results, the prediction capability of the DbLDA on the test text set is stronger than that of the hidden Dirichlet distribution deduced by using the contracted Gibbs sampling, and a better model is not obtained in 500 iterations due to slower convergence of the hidden Dirichlet distribution by using the contracted Gibbs sampling, which also shows the influence of different approximate deduction modes on the performance of the topic model. The convergence speed of the DbLDA is slower than that of the implicit Dirichlet distribution deduced by using the contracted Gibbs sampling, and meanwhile, the time required for reaching the same prediction chaos is longer than that of the implicit Dirichlet distribution by using the contracted Gibbs sampling, because the number of variables needing to be updated by iteration of the DbLDA is more than that of the implicit Dirichlet distribution, and because some variables need to be updated by using a Newton method, 5-10 times of extra iteration is added in each iteration to obtain the calculation amount of an approximate solution.

Table 2 shows the time efficiency comparison of DbLDA and implicit dirichlet distribution inferred using contracted gibbs sampling, recording the time required for both algorithms to reach the same one of the chaos thresholds, the data set tested and the parameter settings as above.

TABLE 2 comparison of iterative time tests (time required to reach the same Perplexity in seconds)

Claims

1. A document generation method based on a specially divided topic model is characterized in that the topic model is hidden Dirichlet distribution on a text database and is marked as DbLDA; in DbLDA, the specific steps for each document generation are as follows:

step (1), generating a theme matrix: phi is a_k～Dir(β)；

Step (2), generating theme distribution for a subset:

and (3) generating theme distribution for the articles in the subset:

step (4) for each word

(a) Selecting one theme: z is a radical of_sdn～Mult(π(θ′_sd))；

(b) Selecting a word: w is a_sdn|z_sdn～Mult(φ_k)；

Wherein the content of the first and second substances,

each multinomial distribution parameter vector corresponds to a natural parameter vector family, and c is a constant; pi is the mapping from the natural parameter vector back to the multinomial distribution parameter vector,

wherein the parameters and symbols used are as follows:

s represents the number of subsets;

α represents a hyper-parameter of Dirichlet priors of the distribution of topics in the subset;

beta represents a hyper-parameter of Dirichlet prior of the distribution of each word frequency in each topic;

∑_srepresenting the distribution density of the topics in the subset s;

θ_srepresenting the mean topic distribution of the subset s;

θ′_sda topic distribution representing the d text in the subset s;

phi denotes a topic matrix;

z_sdna topic representing the nth word of the d text in the subset s;

w_sdnthe nth word representing the mth text in the subset s.

2. The method for generating document under topic model based on specific partition according to claim 1, wherein the topic model is approximated by using a method of contraction variational Bayes; the contraction is to remove some hidden variables in the subject model from the posterior distribution by means of marginal integration;

for the variational bayes method, the probability logarithm of a fact is equal to the KL divergence plus the lower bound of the fact, denoted as ELBO, so minimizing the KL divergence is obtained by maximizing ELBO:

since the topic distribution θ' of an article in a subset of the DbLDA is difficult to remove by marginal integration, only the average topic distribution θ and topic matrix for the modeled subset are displayed

Of (c); this practice is called partial shrinkage variational bayes; thus, the variation posterior distribution has the following form:

in which θ' follows a variational Gaussian distribution

z obeys variational multinomial distributions

Thus, ELBO becomes:

wherein the content of the first and second substances,

is the entropy of the variation distribution.

3. The method for generating documents under the topic model based on specific partitions of claim 2, wherein the specific steps of maximizing ELBO are as follows:

first, about

And

maximizing ELBO;

since there is no limit to the two variation distributions, the maximum value is

Obtaining the variation posterior, namely the variation posterior is equal to the real posterior; after simplification, ELBO becomes:

then, ELBO is developed according to a graph model of DbLDA:

to simplify the calculation, each dimension of θ' is assigned to follow an independent unary Gaussian distribution, i.e., the covariance matrix is defined as a diagonal matrix

Meanwhile, selecting a mapping from the multinomial distribution parameter vector to the natural parameter vector as c being 1; through simplification, the above formula can be calculated;

wherein D is_sK represents the number of topics as the number of texts in the subset s;

an upper bound is found for the logarithmic normalization factor for θ' using Taylor expansion:

then, with respect to the variation parameter

Maximizing ELBO, and sequentially circulating each variation parameter for updating;

representing a variation posterior function;

a variational polynomial distribution parameter representing z;

a desire to express a gaussian distribution of variation of θ';

a covariance representing a variation gaussian distribution of θ';

ζ represents a variation parameter required in calculating the logarithmic normalization factor of θ'.

4. The method of claim 3, wherein the parameters related to variation are selected from the group consisting of

Maximizing ELBO, and sequentially circulating each variation parameter to update, the specific steps are as follows:

first, about

Maximizing ELBO;

because the computation cost of the variation expectation term in the above formula is too high, the variation expectation is estimated by adopting a Gaussian approximation method, and only 0-order Taylor expansion is reserved as further approximation during the approximation so as to improve the computation performance: