CN107491417B - Document generation method based on specific division under topic model - Google Patents

Document generation method based on specific division under topic model Download PDF

Info

Publication number
CN107491417B
CN107491417B CN201710548431.5A CN201710548431A CN107491417B CN 107491417 B CN107491417 B CN 107491417B CN 201710548431 A CN201710548431 A CN 201710548431A CN 107491417 B CN107491417 B CN 107491417B
Authority
CN
China
Prior art keywords
distribution
elbo
topic
variation
subset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710548431.5A
Other languages
Chinese (zh)
Other versions
CN107491417A (en
Inventor
周凯文
杨智慧
马会心
何震瀛
荆一楠
王晓阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201710548431.5A priority Critical patent/CN107491417B/en
Publication of CN107491417A publication Critical patent/CN107491417A/en
Application granted granted Critical
Publication of CN107491417B publication Critical patent/CN107491417B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data mining, and particularly relates to a document generation method based on a specially divided topic model. The invention adds the concept of the subset according to a given text database dividing mode, for example, for some text databases, such as news databases, the topic distribution of texts in a certain time segment has certain similarity, especially for the texts of different news channels reporting the same event, the database can be divided into the subsets by utilizing the attribute of the time segment. Thus, the present invention proposes a new topic model (DbLDA) on a text database; in the DbLDA, the specific steps of the generation of each document are: generating a theme matrix; generating a topic distribution for a subset: generating a theme distribution for the articles in the subset; for each word, a topic is selected, and a word is selected. Can be applied to a text database with structured attributes.

Description

Document generation method based on specific division under topic model
Technical Field
The invention belongs to the technical field of data mining, and particularly relates to a document generation method based on a specially divided topic model, which is applied to a text database with structured attributes.
Background
The method is widely applied to the field of data mining nowadays by utilizing a topic model to process and analyze text data, wherein LDA (Latent Dirichlet Allocation) is widely concerned as a simple and easy-to-use topic model. LDA assumes that each text originates from a separate generation process and thus ignores the connections between texts, which may however lead to a degradation of the model effect.
A large amount of text data contains not only structured attributes such as time, place, etc., but also unstructured text content attributes. The text data can be organized into a number of subsets according to these structured attributes, stored in a text database, which forms a particular partition of the text database. Based on the observation that the texts classified into the same subset have some commonality, i.e. there is a link between the texts. For example, in the case of a news database, news data at the same time or the same place may be focused on important events such as the spread of a certain virus or the movement track of a certain typhoon. According to this phenomenon, the entire news database may be divided into some subsets having commonality according to their time or location attributes. In order to analyze the document set which can be subjected to subset division, the invention designs a new theme model based on LDA from the perspective of generating the model, and the aim is to fully utilize the commonality in each subset aiming at the specific division of the text database so as to obtain a theme model with better effect.
Disclosure of Invention
The invention aims to provide a document generation method based on a theme model of specific division, which has better effect.
The invention relates to a new model constructed according to an LDA model. Text generation process in LDA: firstly, a distribution is generated from Dirichlet distribution as the theme distribution of an article, the theme of the article is generated from the theme distribution, and then a word is generated from the corresponding theme, so that the word in the document is obtainedWord of[3]The distribution of the topics of each article is independent.
The topic model based on specific division, which is proposed by the present invention, is designed according to LDA, and is applied to a text database with structured attributes, such as a news database with time or place tags (i.e. the text database can be divided according to these structured attributes). The invention considers the concept of adding subsets according to a given text database dividing mode, for example, for some text databases, such as news databases, the topic distribution of texts in a certain time segment has certain similarity, especially for the texts of different news channels reporting the same event, the database can be divided into subsets by using the attribute of the time segment. Therefore, the invention provides a new topic model (LDA over Text Database, implicit Dirichlet distribution on the Text Database) on the Text Database, which is recorded as DbLDA.
In the DbLDA, the specific steps of the generation method of each document are as follows:
step (1), generating a theme matrix: phi is ak~Dir(β);
Step (2), generating theme distribution for a subset:
Figure BDA0001343785890000022
and (3) generating theme distribution for the articles in the subset:
Figure BDA0001343785890000023
step (4) for each word
(a) Selecting one theme: z is a radical ofs,d,n~Mult(π(θ′s,d));
(b) Selecting a word: w is as,d,n|zs,d,n~Mult(φk)。
Wherein the content of the first and second substances,
Figure BDA0001343785890000025
is a mapping from a multi-term distribution parameter vector to a natural vector:
Figure BDA0001343785890000024
c is a constant, so that each multi-term distribution parameter vector corresponds to a natural parameter vector family; pi is the mapping from the natural parameter vector back to the multinomial distribution parameter vector,
Figure BDA0001343785890000021
the probability distribution of the random variables to each other in the graph model of fig. 1 corresponds to this generation process.
The key to the effectiveness of a topic model is whether it truly reflects the true distribution of the text data, i.e., whether the topic distribution has a corresponding physical meaning. From the perspective of the generation process, step (2) is equivalent to assigning a main topic distribution (or average topic distribution) to each subset, and step (3) adds gaussian noise to the average topic distribution to generate a topic distribution for each text in the subset, which is equivalent to that the topic distribution of each article in the subset surrounds a common main topic distribution. It is hoped that the commonality of the texts in the subsets is expressed by the modeling mode, so that the model parameters (namely, the topic, the multinomial distribution on the vocabulary) deduced by the model can reflect the characteristics of the text data more truly, and better model performance is obtained. Gaussian distributed Σ (gaussian distributed covariance matrix) can be considered as the subject distribution density in the subset.
Compared with LDA, the method adds a probability path when generating the theme distribution, namely, one layer of logic Gaussian distribution is added. In the field of data analysis, logical gaussian distributions are often viewed as "gaussian distributions over simplices". We use this distribution to model commonalities among the subsets.
As shown in fig. 2, the text topic distribution in the same subset on the simplex of the topic distribution of the DbLDA model is more concentrated, and there is similarity in topic distribution when corresponding to a real text database, for example, several news texts describing the same event at the same time.
With new topic models, a key task is to infer the model. The topic model is a class of generative models that generate documents, such as LDA (latent dirichlet distribution), from a designed graph model. The graph model comprises a plurality of hidden variables and known variables, and the main work of analyzing the graph model is to estimate the values of the hidden variables through the known variables, namely to solve the posterior distribution of the hidden variables in the model. However, most topic models have high complexity, and the posterior distribution of hidden variables in the model generally does not have a closed-form expression, so that the posterior distribution of the model needs to be approximately calculated, namely the model inference problem. The method for approximating inference is various, and the invention selects a method of using shrinkage variational Bayes to approximate the model, wherein the shrinkage means that some hidden variables in the model are removed from the posterior distribution by means of marginal integration. The reasons are two reasons: the DbLDA model is complex, not all hidden variables can be contracted, and the distribution of the required samples is more, so that the convergence of the sampling method is possibly too slow, and the convergence is difficult to judge; the calculation amount required by each iteration of DbLDA is large, the convergence is fast by using a variational method, and the program performance can be obviously improved.
For the variational method, the variational method transforms the problem into a maximization problem by the following formula, since the true posterior is not available. The logarithm of probability of a fact (text) is equal to the KL divergence (Kullback-Leibler divergence, also known as relative entropy) plus the true Lower Bound (ELBO), so minimizing the KL divergence is done by maximizing the ELBO.
Figure BDA0001343785890000031
Based on the idea of systolic variational bayes, we model explicitly the dependencies between hidden variables, but since θ' in DbLDA (the distribution of topics for an article in a subset) is difficult to remove by marginal integration, we only show modeling θ,
Figure BDA0001343785890000032
(mean topic distribution and topic matrix of subsets, respectively). This approach is referred to as "partial contraction variational bayes". Thus, the variation posterior distribution has the following form:
Figure BDA0001343785890000033
in which θ' follows a variational Gaussian distribution
Figure BDA0001343785890000034
z obeys variational multinomial distributions
Figure BDA0001343785890000035
Thus, ELBO becomes:
Figure BDA0001343785890000036
wherein the content of the first and second substances,
Figure BDA0001343785890000037
is the entropy of the variation distribution.
First, about
Figure BDA0001343785890000038
Maximizing ELBO. Since there is no limit to the two variation distributions, the maximum value is
Figure BDA0001343785890000039
And (4) obtaining, namely, obtaining when the variational posterior is equal to the real posterior. After simplification, ELBO becomes:
Figure BDA00013437858900000310
then, ELBO is developed according to a graph model of DbLDA:
Figure BDA0001343785890000041
due to the non-conjugation of dirichlet distribution and logical gaussian distribution, it is difficult to directly compute in ELBO:
Figure BDA0001343785890000042
in order to simplify the calculation and meet the purpose of modeling, each dimension of the K-dimensional random variable theta' is assigned to obey independent unary Gaussian distribution, namely, the covariance matrix is defined as a diagonal matrix
Figure BDA0001343785890000043
Meanwhile, a mapping from the polynomial distribution parameter vector to the natural parameter vector is selected as c 1. By simplification, the above equation can be calculated.
Figure BDA0001343785890000044
Wherein D issIs the amount of text in the subset s.
Referring to the inference method of Gaussian random variables in CTM (Correlated Topic Model), since the variation of the logarithmic normalization factor of theta' is expected to be difficult to calculate, in order to maintain the property of the lower bound of ELBO, the invention uses Taylor expansion to find an upper bound for the logarithmic normalization factor of theta[1]
Figure BDA0001343785890000045
Then, with respect to the variation parameter
Figure BDA0001343785890000046
Maximizing ELBO, the loop is updated in turn about each variation parameter. Specifically, the method comprises the following steps:
first, about
Figure BDA0001343785890000047
Maximizing ELBO.
Figure BDA0001343785890000048
Simplifying the above formula by eliminating terms appearing in the denominator of the numerator in the above formula yields:
Figure BDA0001343785890000051
however, the computation cost of the variation expectation term in the above formula is too high, so the variation expectation is estimated by adopting a Gaussian approximation method in the original paper of a shrinkage variation Bayesian inference method paper, but only 0-order Taylor expansion is kept as further approximation in the approximation,
to improve the computational performance (zero taylor expansion 1).
Figure BDA0001343785890000052
Wherein the content of the first and second substances,
Figure BDA0001343785890000053
namely, it is
Figure BDA0001343785890000054
The expectation is that.
Using the above-mentioned pair
Figure BDA0001343785890000055
By approximating the expression (c), we can obtain
Figure BDA0001343785890000056
The update equation of (1).
Second, ELBO is maximized with respect to ζ.
For derivatives of the zeta-related term in ELBO and making the derivative zero, the zeta update equation can be obtained:
Figure BDA0001343785890000057
the third step, about
Figure BDA0001343785890000058
Maximizing ELBO.
ELBO about
Figure BDA0001343785890000059
Taking the derivative as:
Figure BDA00013437858900000510
there is no analytical solution to zero this derivative. For this purpose, Newton's method is used to solve
Figure BDA00013437858900000511
To the maximization of (a).
Finally, as to
Figure BDA00013437858900000512
Maximize ELBO with the constraint of
Figure BDA00013437858900000513
As above, this maximization problem also does not have an analytical solution, and is therefore solved using newton's method. ELBO about
Figure BDA00013437858900000514
Taking the derivative of
Figure BDA00013437858900000515
And (4) combining the conclusions, and updating the variation parameters in turn in each iteration to obtain a coordinated ascending algorithm of the ELBO.
Drawings
FIG. 1 is a DbLDA graphical model.
FIG. 2 shows a sample of the simplex distribution of DbLDA topics (3 topics, 2 subsets, 1000 articles, a red dot corresponding to the topic distribution of an article, a blue dot for the main topic distribution of each subset, and a pink triangle for the simplex distribution of topics)
Fig. 3 shows DbLDA/CVB0_ LDA (implicit dirichlet distribution inferred using systolic variational bayes)/CGS _ LDA (implicit dirichlet distribution inferred using gibbs sampling) experimental test results (α ═ 1.01, β ═ 0.01, K ═ 50 for DbLDA, given ═ 1.0) for Predictive Perplexity (descriptive prediction capability) predicted on the one month news data (3942 news texts, 16379 words) of the road agency, while for DbLDA, given ═ 1.0)
Fig. 4 shows experimental results of predicted chaos for DbLDA/CVB0_ LDA under different settings (different sizes of text database, 1, 2, 6 months, and different sub-divisions, 15 days news for a sub-set and 30 days news for a sub-set).
Detailed Description
The method comprises the steps of dividing the test of the DbLDA subject model into an LDA comparison experiment and a model experiment under different parameters, wherein the comparison method comprises the steps of calculating the prediction chaos and the running time under different models, and the different parameters are different subset lengths or different text database sizes. The text prediction capabilities of the DbLDA topic model are intended to be tested in various ways.
(1) Experimental Environment and data set introduction
The experimental programs are all run in an Ubuntu 16.04 environment, the experimental machines CPU is i5-3470, the memory 12GB, and the experiments are all performed using codes written in Java (8), and the following DbLDA and LDA experiments are all performed under the conditions that the given parameters α is 1.01, β is 0.01, and K is 50, while the given parameters Σ is 1.0 for DbLDA.
And collecting English news data on the Rough agency as test language material for testing the performance of the DbLDA. The division basis of the news text subsets is time, and the minimum division unit is day. The linguistic data is subjected to word segmentation, punctuation removal, root word extraction and other processing, and all words with one occurrence frequency are removed. Through processing, the news data of the road transparent society in six months (one month to six months, 22723 news texts, 36639 words) are obtained. The test text set is 10% of the training text set.
(2) Comparative experiment with LDA
The measure of the comparison experiment is prediction chaos, a standard often used for measuring the prediction capability of the language model, and the lower prediction chaos represents that the model has better prediction capability.
Figure BDA0001343785890000061
For the DbLDA, the test set text generation probability calculation mode is as follows (namely, using the variational posterior as the true posterior and using the value of the parameter in the expected estimation model):
Figure BDA0001343785890000062
the subject model compared to DbLDA is an implicit dirichlet distribution using a systolic variational bayesian inference (CVB) (while approximating the desired variational, again using a zero-order taylor expansion approximation, denoted CVB 0). As the theme distribution of each text in the test text set needs to be obtained, phi is obtained by training in the training text set, and then for each article in the test text set, the theta of each article is obtained by training 50% of the first text. The algorithm for training the test text set is consistent with the previous method except that phi is a fixed value and is fixed as a result obtained in the training text set. And obtaining the theme distribution of the test text set, namely substituting the expression into the last 50% of texts of each test text, and calculating the prediction chaos.
In the comparative experiment, data of one month (april, 3942 news texts, 16379 words) in the data set of the ro-moto was taken as a data set, and for DbLDA, it was set that each subset included 7 days of news texts. In the experiment, each section of program is iterated for 500 times, phi of the training text set is estimated after each iteration, and the prediction chaos is calculated according to the method. It should be noted that the implicit Dirichlet distribution inferred by using the contracted Gibbs sampling is added in the experiment for comparison, so as to illustrate that different model inference methods can cause different model test results.
The results of the experiment are shown in FIG. 3. From experimental results, the prediction capability of the DbLDA on the test text set is stronger than that of the hidden Dirichlet distribution deduced by using the contracted Gibbs sampling, and a better model is not obtained in 500 iterations due to slower convergence of the hidden Dirichlet distribution by using the contracted Gibbs sampling, which also shows the influence of different approximate deduction modes on the performance of the topic model. The convergence speed of the DbLDA is slower than that of the implicit Dirichlet distribution deduced by using the contracted Gibbs sampling, and meanwhile, the time required for reaching the same prediction chaos is longer than that of the implicit Dirichlet distribution by using the contracted Gibbs sampling, because the number of variables needing to be updated by iteration of the DbLDA is more than that of the implicit Dirichlet distribution, and because some variables need to be updated by using a Newton method, 5-10 times of extra iteration is added in each iteration to obtain the calculation amount of an approximate solution.
Table 2 shows the time efficiency comparison of DbLDA and implicit dirichlet distribution inferred using contracted gibbs sampling, recording the time required for both algorithms to reach the same one of the chaos thresholds, the data set tested and the parameter settings as above.
Figure BDA0001343785890000071
Figure BDA0001343785890000081
TABLE 2 comparison of iterative time tests (time required to reach the same Perplexity in seconds)
Figure BDA0001343785890000082

Claims (4)

1. A document generation method based on a specially divided topic model is characterized in that the topic model is hidden Dirichlet distribution on a text database and is marked as DbLDA; in DbLDA, the specific steps for each document generation are as follows:
step (1), generating a theme matrix: phi is ak~Dir(β);
Step (2), generating theme distribution for a subset:
Figure FDA0003022643590000011
and (3) generating theme distribution for the articles in the subset:
Figure FDA0003022643590000012
step (4) for each word
(a) Selecting one theme: z is a radical ofsdn~Mult(π(θ′sd));
(b) Selecting a word: w is asdn|zsdn~Mult(φk);
Wherein the content of the first and second substances,
Figure FDA0003022643590000016
is a mapping from a multi-term distribution parameter vector to a natural vector:
Figure FDA0003022643590000013
each multinomial distribution parameter vector corresponds to a natural parameter vector family, and c is a constant; pi is the mapping from the natural parameter vector back to the multinomial distribution parameter vector,
Figure FDA0003022643590000014
wherein the parameters and symbols used are as follows:
s represents the number of subsets;
α represents a hyper-parameter of Dirichlet priors of the distribution of topics in the subset;
beta represents a hyper-parameter of Dirichlet prior of the distribution of each word frequency in each topic;
srepresenting the distribution density of the topics in the subset s;
θsrepresenting the mean topic distribution of the subset s;
θ′sda topic distribution representing the d text in the subset s;
phi denotes a topic matrix;
zsdna topic representing the nth word of the d text in the subset s;
wsdnthe nth word representing the mth text in the subset s.
2. The method for generating document under topic model based on specific partition according to claim 1, wherein the topic model is approximated by using a method of contraction variational Bayes; the contraction is to remove some hidden variables in the subject model from the posterior distribution by means of marginal integration;
for the variational bayes method, the probability logarithm of a fact is equal to the KL divergence plus the lower bound of the fact, denoted as ELBO, so minimizing the KL divergence is obtained by maximizing ELBO:
Figure FDA0003022643590000015
since the topic distribution θ' of an article in a subset of the DbLDA is difficult to remove by marginal integration, only the average topic distribution θ and topic matrix for the modeled subset are displayed
Figure FDA0003022643590000021
Of (c); this practice is called partial shrinkage variational bayes; thus, the variation posterior distribution has the following form:
Figure FDA0003022643590000022
in which θ' follows a variational Gaussian distribution
Figure FDA0003022643590000023
z obeys variational multinomial distributions
Figure FDA0003022643590000024
Thus, ELBO becomes:
Figure FDA0003022643590000025
wherein the content of the first and second substances,
Figure FDA0003022643590000026
is the entropy of the variation distribution.
3. The method for generating documents under the topic model based on specific partitions of claim 2, wherein the specific steps of maximizing ELBO are as follows:
first, about
Figure FDA0003022643590000027
And
Figure FDA0003022643590000028
maximizing ELBO;
since there is no limit to the two variation distributions, the maximum value is
Figure FDA0003022643590000029
Figure FDA00030226435900000210
Obtaining the variation posterior, namely the variation posterior is equal to the real posterior; after simplification, ELBO becomes:
Figure FDA00030226435900000211
then, ELBO is developed according to a graph model of DbLDA:
Figure FDA00030226435900000212
to simplify the calculation, each dimension of θ' is assigned to follow an independent unary Gaussian distribution, i.e., the covariance matrix is defined as a diagonal matrix
Figure FDA00030226435900000213
Meanwhile, selecting a mapping from the multinomial distribution parameter vector to the natural parameter vector as c being 1; through simplification, the above formula can be calculated;
Figure FDA00030226435900000214
Figure FDA0003022643590000031
wherein D issK represents the number of topics as the number of texts in the subset s;
an upper bound is found for the logarithmic normalization factor for θ' using Taylor expansion:
Figure FDA0003022643590000032
then, with respect to the variation parameter
Figure FDA0003022643590000033
Maximizing ELBO, and sequentially circulating each variation parameter for updating;
Figure FDA0003022643590000034
representing a variation posterior function;
Figure FDA0003022643590000035
a variational polynomial distribution parameter representing z;
Figure FDA0003022643590000036
a desire to express a gaussian distribution of variation of θ';
Figure FDA0003022643590000037
a covariance representing a variation gaussian distribution of θ';
ζ represents a variation parameter required in calculating the logarithmic normalization factor of θ'.
4. The method of claim 3, wherein the parameters related to variation are selected from the group consisting of
Figure FDA0003022643590000038
Maximizing ELBO, and sequentially circulating each variation parameter to update, the specific steps are as follows:
first, about
Figure FDA0003022643590000039
Maximizing ELBO;
Figure FDA00030226435900000310
simplifying the above formula by eliminating terms appearing in the denominator of the numerator in the above formula yields:
Figure FDA00030226435900000311
because the computation cost of the variation expectation term in the above formula is too high, the variation expectation is estimated by adopting a Gaussian approximation method, and only 0-order Taylor expansion is reserved as further approximation during the approximation so as to improve the computation performance:
Figure FDA00030226435900000312
wherein the content of the first and second substances,
Figure FDA00030226435900000313
namely, it is
Figure FDA00030226435900000314
(iii) a desire;
using the above-mentioned pair
Figure FDA00030226435900000315
Is approximated to obtain
Figure FDA00030226435900000316
The update equation of (1);
second step, ELBO maximization with respect to ζ
For the derivative of the ζ -related term in ELBO and making the derivative zero, the ζ update equation is obtained:
Figure FDA0003022643590000041
the third step, about
Figure FDA0003022643590000042
Maximizing ELBO
ELBO about
Figure FDA0003022643590000043
Taking the derivative as:
Figure FDA0003022643590000044
there is no analytical solution to zero this derivative; for this purpose, Newton's method is used to solve
Figure FDA0003022643590000045
The maximization problem of (a);
finally, as to
Figure FDA0003022643590000046
Maximize ELBO with the constraint of
Figure FDA0003022643590000047
As above, this maximization problem also does not have an analytical solution, so it is solved using newton's method; ELBO about
Figure FDA0003022643590000048
Taking the derivative of
Figure FDA0003022643590000049
By integrating the above steps, each iteration updates the variation parameters in turn, and a coordinated ascending algorithm of ELBO is obtained;
v represents the vocabulary size;
Figure FDA00030226435900000410
indicates removal of wsdn,zsdnWord w with topic k in all subsequent textssdnThe number of (2).
CN201710548431.5A 2017-07-06 2017-07-06 Document generation method based on specific division under topic model Active CN107491417B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710548431.5A CN107491417B (en) 2017-07-06 2017-07-06 Document generation method based on specific division under topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710548431.5A CN107491417B (en) 2017-07-06 2017-07-06 Document generation method based on specific division under topic model

Publications (2)

Publication Number Publication Date
CN107491417A CN107491417A (en) 2017-12-19
CN107491417B true CN107491417B (en) 2021-06-22

Family

ID=60644370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710548431.5A Active CN107491417B (en) 2017-07-06 2017-07-06 Document generation method based on specific division under topic model

Country Status (1)

Country Link
CN (1) CN107491417B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110331B (en) * 2019-04-30 2021-02-26 清华大学 Text generation method, device, medium and computing equipment
CN110738242B (en) * 2019-09-25 2021-08-10 清华大学 Bayes structure learning method and device of deep neural network

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN102591917A (en) * 2011-12-16 2012-07-18 华为技术有限公司 Data processing method and system and related device
CN105183833A (en) * 2015-08-31 2015-12-23 天津大学 User model based microblogging text recommendation method and recommendation apparatus thereof
CN105740354A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Adaptive potential Dirichlet model selection method and apparatus

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825648B2 (en) * 2010-04-15 2014-09-02 Microsoft Corporation Mining multilingual topics
US20120278353A1 (en) * 2011-04-28 2012-11-01 International Business Machines Searching with topic maps of a model for canonical model based integration

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587493A (en) * 2009-06-29 2009-11-25 中国科学技术大学 Text classification method
CN102591917A (en) * 2011-12-16 2012-07-18 华为技术有限公司 Data processing method and system and related device
CN105183833A (en) * 2015-08-31 2015-12-23 天津大学 User model based microblogging text recommendation method and recommendation apparatus thereof
CN105740354A (en) * 2016-01-26 2016-07-06 中国人民解放军国防科学技术大学 Adaptive potential Dirichlet model selection method and apparatus

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Text Categorization Based on Topic Model;Shibin Zhou et al;《International Journal of Computational Intelligence Systems》;20091204;第2卷(第4期);第398-409页 *
一种面向大型网络的快速随机化社区挖掘算法;余韬 等;《第26届中国数据库学术会议论文集(B辑)》;20090915;第406-412页 *
自然语言处理中主题模型的发展;徐戈 等;《计算机学报》;20110831(第08期);第1423-1436页 *

Also Published As

Publication number Publication date
CN107491417A (en) 2017-12-19

Similar Documents

Publication Publication Date Title
Jiang et al. Sentence level topic models for associated topics extraction
JP5250076B2 (en) Structure prediction model learning apparatus, method, program, and recording medium
JP2940501B2 (en) Document classification apparatus and method
JP6902945B2 (en) Text summarization system
CN109471889B (en) Report accelerating method, system, computer equipment and storage medium
Pruteanu-Malinici et al. Hierarchical Bayesian modeling of topics in time-stamped documents
US20160203105A1 (en) Information processing device, information processing method, and information processing program
CN111462751A (en) Method, apparatus, computer device and storage medium for decoding voice data
CN107491417B (en) Document generation method based on specific division under topic model
WO2023088309A1 (en) Method for rewriting narrative text, device, apparatus, and medium
Tatti Ranking episodes using a partition model
Wang et al. A brief tour of Bayesian sampling methods
JP4143234B2 (en) Document classification apparatus, document classification method, and storage medium
US20220114441A1 (en) Apparatus and method for scheduling data augmentation technique
CN110716761A (en) Automatic and self-optimizing determination of execution parameters of software applications on an information processing platform
US7853541B1 (en) Method and apparatus for simmered greedy optimization
Wang et al. Gaussian process-based random search for continuous optimization via simulation
JP7143599B2 (en) Metadata evaluation device, metadata evaluation method, and metadata evaluation program
Culp et al. On adaptive regularization methods in boosting
US20220101187A1 (en) Identifying and quantifying confounding bias based on expert knowledge
CN111339287B (en) Abstract generation method and device
CN114610576A (en) Log generation monitoring method and device
CN110162629B (en) Text classification method based on multi-base model framework
CN113297854A (en) Method, device and equipment for mapping text to knowledge graph entity and storage medium
Bethard et al. Topic model analysis of metaphor frequency for psycholinguistic stimuli

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant