CN109376347A

CN109376347A - A kind of HSK composition generation method based on topic model

Info

Publication number: CN109376347A
Application number: CN201811202083.7A
Authority: CN
Inventors: 吕学强; 游新冬; 董志安
Original assignee: Beijing Information Science and Technology University
Current assignee: Beijing Information Science and Technology University
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2019-02-22

Abstract

The HSK composition generation method based on topic model that the present invention relates to a kind of, comprising: training LDA model obtains the distribution of sentence and text, word and text, calculates cross entropy, selection and the most similar sentence of subject key words, then generates text.HSK composition generation method provided by the invention based on topic model, pass through training LDA topic model, the distribution of sentence and text, word and text is obtained, and by calculating cross entropy, selection and the most similar sentence of subject key words, then text is generated, and the text automatically generated effect in continuity and logicality is good, syntax error is less, and wrong word is less, writing task can be completed well, can meet the needs of practical application well.

Description

A kind of HSK composition generation method based on topic model

Technical field

The invention belongs to text information processing technical fields, and in particular to a kind of HSK composition generation based on topic model Method.

Background technique

In the epoch of IT industry and internet high speed development, people just dream of to calculate natural language can, so as to us It can be issued in extensive non-structured text and excavate hiding information and knowledge.Artificial intelligence (AI) technology quickly increases It is long.Before 20 years, dark blue (the Deep Blue) that IBM Corporation developed in 1997 defeats chess world champion Garry Kasparov, in March, 2016, AlphaGo with its Monte Carlo tree search algorithm defeated Li Shishi.This is artificial intelligence study An important milestone.

On the other hand, AI and the natural language processing technique that is combined into of big data bring unprecedented development.Artificial intelligence Energy robot is because its working principle is that rule-based carry out reasoning from logic, works so being suitable for sequencing, can handle data Measure larger, the demanding work of timeliness.Big data supports the idea of some ice-breakings to break many industries, or even the biography of writing System frame.With the development of computer technology and artificial intelligence technology, mankind's highest wisdom and generate literature writing, into The epoch of " computer manufacture " are entered.Writing idea, writing behavior and the variation for writing the mode of thinking are also brought simultaneously.Natural language Generation is that the machine of such as knowledge base or logical form indicates that system generates the natural language processing task of natural language.It can be with It says, natural language generation system converts data to the translater that natural language indicates just as one.However, due to natural language Intrinsic expressivity, generate final language method be different from compiler method.

Test of Chinese Language Ability for Foreigners (HSK) is the international Chinese for being the Chinese proficiency of non-Chinese person for test mother tongue and setting up Capability standardization examination.It is equivalent to level Four, six grades of examinations and TOEFL, the IELTS examination etc. of English.It is examined both at home and abroad about English The research of the research of examination, especially English exam writing has had great successes.But Test of Chinese Language Ability for Foreigners is write at present Research it is also less, especially study the ability of existing natural language processing technique intelligent answer.As HSK is in whole world model The popularization enclosed, more and more overseas Chinese studying persons begin participating in HSK examination.The research that the country takes an examination for HSK is also continuous It increases.

What writing topic was mainly investigated is word order, grammer, content and logic of language, is good research spatial term Project.Writing task looks like relatively difficult challenge.But task and training machine learning model are write by analysis, Writing task can also be converted to trainable text generation task.With big data technology, natural language processing and its The continuous development of his artificial intelligence technology has gradually started the exploration and practice that news report is automatically generated with algorithm.With new The continuous practice and development of writing Auto are heard, people's rapid and convenient can be helped by constantly having confirmed artificial intelligence technology Carry out data processing and integration.The propagating contents and circulation way of news media will be changed in the development of news media circle. However, the prior art automatically generates text less effective in continuity and logicality, there is more, wrong word in syntax error More, these problems urgently improve.

Summary of the invention

For above-mentioned problems of the prior art, it can avoid above-mentioned skill occur the purpose of the present invention is to provide one kind The HSK composition generation method based on topic model of art defect.

In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows:

A kind of HSK composition generation method based on topic model, comprising: training LDA model obtains sentence and text, word The distribution of language and text calculates cross entropy, selection and the most similar sentence of subject key words, then generates text.

Further, the step of selecting training dataset training LDA model, selecting training dataset includes: selection " HSK Dynamic composition corpus " is used as basic corpus；The modification of composition is marked first, in accordance in corpus, is mark by corpus processing Mark composition processing is the composition of specification that is, according to the mistake marked out in corpus and the modification provided by quasi- composition corpus, These are standardized into composition sample as standard corpus and carries out the training to LDA model.

Further, the step of training LDA model includes:

When LDA algorithm starts, randomly given parameters θ_dAnd φ_t, then continuous iteration and study following steps A, B, C institute The process of description finally obtains the output that convergent result is exactly LDA；

A. to a specific document d_sIn the i-th vocabulary w_i, it is assumed that vocabulary w_iCorresponding theme is t_j, then:

P_j(w_i|d_s)=p (w_i|t_j)×(t_j|d_s)；

B. we can enumerate the theme in theme set T now, obtain all p_j(w_i|d_s), wherein 1~k of j value, It then can be d according to these probability value results_sIn i-th of word w_iSelect a theme；

C. w all in lexical set D are carried out the calculating of p (w | d) and reselect theme to regard an iteration as.

Further, the step of calculating cross entropy includes:

For probability distribution p and q, cross entropy is

Wherein,

Further, the specific steps of the HSK composition generation method based on topic model are as follows: select training dataset LDA model is trained, the distribution of the distribution and each sentence of text set main contents under theme is obtained；It calculates and waits The cross entropy between sentence and document is selected, the lesser sentence of cross entropy is selected；By candidate sentences according in former candidate text Relative position parameter marshalling；Export the composition automatically generated.

Further, include: with the method that LDA model generates a text

A. the theme distribution θ for generating document i is sampled from Di Li Cray distribution α_i；

B. from the multinomial distribution θ of theme_iMiddle sampling generates the theme z of j-th of word of document i_ij；

C. sampling generates theme z from Di Li Cray distribution β_ijWord distribution

D. from the multinomial distribution of wordMiddle sampling ultimately generates word w_ij。

Further, the Joint Distribution of all visible variables and hidden variable is in the LDA model

p(w_i, z_i, θ_i, φ | α, β)；

To θ in formula_iWithIt quadratures and to z_iSummation obtains the maximal possibility estimation of vocabulary distribution:

p(w_i| α, β)=∫ ∫ ∑ p (w_i, z_i, θ_i, φ | α, β)；

Assuming that having M texts in corpus, wherein all vocabulary w and the corresponding theme z of vocabulary are as follows

W=(w₁..., w_m)；

Z=(z₁..., z_m)；

w_mIndicate the vocabulary in m texts, z_mThen indicate the number of the corresponding theme of these words.

Further, in the LDA model, it is assumed that be now to generate m texts, first check the text of m texts Then the distribution of sheet-theme generates the theme number z of n-th of word_{M, n}；

In the distribution of vocabulary-theme, searching number is z_{M, n}Theme, and select the vocabulary under the theme, finally obtain Vocabulary w_{M, n}, thus can generate n-th of word of m documents in corpus；

In LDA model, m documents can be corresponding with m independent Dirichlet-Multinomial conjugated structures；K Theme corresponds to k independent Dirichlet-Multinomial conjugated structures；

Wherein, n_m=(n_m ⁽¹⁾..., n_m ^(k)), n_m ^(k)Indicate the number of the corresponding word of k-th of theme in m texts；Root According to Dirichlet-Multinomial conjugated structure, θ is obtained_mPosterior distrbutionp be Dir (θ_m|n_m+α)；

The probability that theme generates in entire text set is calculated

It obtains

W '=(w₍₁₎..., w_(k))；

Z '=(z₍₁₎..., z_(k))；

w_(k)Indicating these words all is that theme k is generated, z_(k)It is then the number of the theme of these corresponding words；It obtains

n_k=(n_k ⁽¹⁾..., n_k ^(t))；

n_k ^(t)It is the number of word t in the vocabulary of theme k generation；

Obtaining the probability that word generates in entire text set is

It obtains

Finally, the Gibbs Sampling formula of LDA model is obtained are as follows:

Further, the HSK composition generation method based on topic model comprises the following steps that

(1) it selects training dataset to be trained LDA model, obtains the distribution of text set main contents and each Distribution of the sentence under theme；

(2) cross entropy between candidate sentences and document is calculated, the lesser sentence of cross entropy is selected；

(3) by candidate sentences according to the relative position parameter marshalling in former candidate text；

(4) composition automatically generated is exported.

HSK composition generation method provided by the invention based on topic model obtains sentence by training LDA topic model The distribution of son and text, word and text, and by calculating cross entropy, selection and the most similar sentence of subject key words, then Text is generated, and the text automatically generated effect in continuity and logicality is good, syntax error is less, and wrong word is less, energy It is enough to complete writing task well, the needs of practical application can be met well.

Detailed description of the invention

Fig. 1 is LDA basic model figure；

Fig. 2 is LDA joint ensemble figure.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing and specific implementation The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, and does not have to It is of the invention in limiting.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.

The generation method as shown in Figure 1, a kind of HSK based on topic model writes a composition, comprising steps of selecting training data training Practice LDA topic model, obtain the distribution of sentence and text, word and text, and by calculating cross entropy, selection is crucial with theme The most similar sentence of word, then generates text.

First writing task of HSK5 is used to given word, writes a short text.In July, 2013, new Chinese proficiency was examined Examination Pyatyi really inscribe make part first inscribe it is as follows.

It (all to be used, sequence is in no particular order) incorporated by reference to following word, write the short essay of 80 words or so.

The application of biographic information outstanding characteristic

According to the given word of stem, writing, this is a typical material composition task.We can substantially find out, inscribe Five words gone out given in dry are absolutely not without any associated.Therefore, it then can be surrounded again by first determining theme The expansion writing of this theme, and use whole words given in stem.Above-mentioned example is analyzed, we can be determined that out, theme It is " job hunting ", therefore, in writing, so that it may be write around " job hunting " this theme.

Thus, it is possible to the basic step write:

(1) according to the given word of stem, theme is determined.Writing task will be carried out around this theme always.

(2) it is made sentences with institute to word.Because stem requires institute that must all use to vocabulary, given word can be used Sentence-making, but it is to ensure that the sentence produced will mutually be unified with theme.

By the analysis above to this writing task it can easily be seen that this writing task focuses on around a spy In addition, it is ensured that vocabulary listed by stem all occurs in writing fixed theme is unfolded to write,.

Therefore, it is desirable to automatically generate composition with machine realization, complete if this writing task it is necessary to having come according to theme At writing task.From this angle, the invention proposes the automatic writing methods based on LDA topic model, to realize machine The automatic composition task of device.

Task is write for this, the invention proposes the automatic writing methods based on topic model, and take out using sentence The strategy taken completes the generation of text.The strategy of sentence extraction, mainly by extracting sentence appropriate from candidate text, then These sentences are sorted and are combined, to generate complete chapter.Sentence extraction based on topic model passes through theme and key Word is selected and is extracted to the sentence in candidate text.

Therefore, it when generating a text, first to be regenerated several corresponding on the basis of given several keywords Word.Then, sentence is screened and extracted in candidate text with these words.Finally, recycling phase of the sentence in candidate text Position is ranked up the sentence being drawn into, finally obtains a generation text.

Select training dataset the step of include:

1) select " HSK dynamic composition corpus " as basic corpus；

" HSK dynamic composition corpus " is the NOCFL's scientific research item presided over by Beijing Language and Culture University professor Cui Xiliang Mesh.It is that the foreigner that mother tongue is non-Chinese participates in the test paper corpus of high Test of Chinese Language Ability for Foreigners (HSK high etc.) essay examination, warp The modification supplement of many years is crossed, corpus first has collected 11569 compositions.Original language material contains examinee's composition test paper in corpus, And the very detailed information such as examinee's composition score.In addition, the tagged corpus in corpus is then to the mistake in examinee's composition There is very comprehensive modification mark, marked content mainly has (1) word processing: marking including wrong word, hiatus mark, multiword mark Deng.(2) punctuation mark is handled: being marked including wrong punctuate, the mark of vacancy or extra punctuate.(3) word is handled: including mistake Word lacks word, the mark such as collocation error.(4) sentence is handled: being marked including grammatically wrong sentence, the error label of clause is mixed etc..(5) piece Chapter processing: the error flag including connection method between sentence, in terms of semantic meaning representation.

2) corpus is handled；

Since examinee that " HSK dynamic composition corpus " is handmarking writes a composition corpus, first, in accordance in corpus to work Corpus handle and is write a composition corpus for standard by the modification mark of text, i.e., according to marked out in corpus wrong and the modification provided, It is the composition of specification by mark composition processing.These are standardized into composition sample as standard corpus and carries out the instruction to LDA model Practice.It writes a composition 10000 in addition, obtaining students in middle and primary schools from internet, the abundant and supplement as corpus.

Training LDA topic model the step of include:

When LDA algorithm starts, randomly given parameters θ_dAnd φ_t, θ_dRepresent the theme distribution of document i, φ_tRepresent theme t Word distribution, then continuous iteration and study following steps A, B, C described in process, finally obtaining convergent result is exactly The output of LDA:

p_j(w_i|d_s)=p (w_i|t_j)×(t_j|d_s)；

Cross entropy is used to measure the otherness of 2 functions or probability distribution: the more big then relative entropy of difference is bigger, and difference is smaller Then relative entropy is smaller.Therefore sentence is chosen with cross entropy, construction generates text.

Calculate cross entropy the step of include:

For probability distribution p and q, cross entropy is

Wherein,

Implicit Di Li Cray distribution (Latent Dirichlet Allocation, LDA) is a kind of topic model, it can By by the theme of every document in document sets according to probability distribution in the form of provide.In natural language processing, Di Like is implied Thunder model is a kind of productive statistical model, observation group can be explained by unobservable group, and thus solve Release the similitude of the certain parts of data.

In LDA model, each document can be counted as the mixing of various themes, wherein each document is considered to have Its theme distribution is distributed to by LDA.

LDA is a kind of typical bag of words, it considers that a document is the set of one group of vocabulary, it is all only between vocabulary Vertical existing, not successive relationship.Document only includes the theme of sub-fraction, and the theme often only uses a small portion The vocabulary divided.Therefore, document and theme, the distribution of theme and vocabulary can be obtained by LDA.

Beta distribution is the conjugate prior probability distribution of binomial distribution: and, there is following relationship for nonnegative real number

Beta (p | α, β)+Count (m₁, m₂)=Beta (p | α+m₁, β+m₂) (2.1)；

In formula, Count (m₁, m₂) it is Beta distribution Beta (m₁+m₂, p) counting.Herein, the group's number observed According to obedience bi-distribution, also, the prior distribution of parameter and Posterior distrbutionp all obey Beta distribution.In this case, we It persuades and is conjugated from Beta-Binomial.

Di Li Cray distribution (Dirichlet distribution) is the conjugate prior probability distribution of multinomial distribution:

It " willIt is extended to continuous real number set from discrete integer set, it is hereby achieved that general expression formula:

Similarly, in formula,It is that Cray is distributed in DiCounting.Likewise, at this In, the data observed obey multinomial distribution, and the prior distribution and Posterior distrbutionp of parameter are all that Cray is distributed in Di.In this feelings Under condition, we persuade is conjugated from Dirichlet-Multinomial.

It can be seen that the distribution of Di Li Cray is the conjugate prior probability distribution of multinomial distribution, and Beta distribution is binomial The conjugate prior probability distribution of formula distribution.

Therefore, model is generated with LDA, the method for generating a text is as follows:

Model, which is generated, with LDA generates the text comprising multiple themes:

Choose parameter θ~P (θ)；

For each of the N words w_n:

Choose a topic z_n~P (z | θ)；

Choose a word w_n~P (w | z)；

Defining θ is a theme vector, and vector θ is non-negative normalized vector, and each of these column indicate each theme In the probability that document occurs；P (θ) is the distribution about θ, and is distributed for Di Li Cray；N and w_nIbid；z_nIndicate the master of selection The probability distribution of theme z, the specially value of θ when topic, P (z | θ) indicate to select specific θ, i.e. P (z=i | θ)=θ_i；P (w | z) table The probability distribution of theme w when showing selection theme z.

The description of above method is to select a theme vector θ, and calculate the probability of each theme in document.In master It inscribes distribution vector θ and selects a theme z, from the distribution of theme and vocabulary, theme is generated according to the vocabulary probability distribution of theme z Relevant word.

Its model is as shown in Figure 1.Therefore the Joint Distribution of all visible variables and hidden variable is in entire model

p(w_i, z_i, θ_i, φ | α, β) (2.3)

p(w_i| α, β)=∫ ∫ ∑ p (w_i, z_i, θ_i, φ | α, β) (2.4)

The joint probability distribution is corresponded on figure, Fig. 2 is obtained；

W=(w₁..., w_m) (2.5)

Z=(z₁..., z_m) (2.6)

Thus, it is possible to analyze, this process of α → θ → z in figure, it is assumed that be now to generate m texts, first check m Then text-theme distribution of text generates the theme number z of n-th of word_{M, n}.Since process α → θ corresponds to Di Li Cray Distribution, and process θ → z is distributed corresponding to Multinomial, therefore this process is integrally a Dirichlet- Multinomial conjugated structure.

This process of β → w in figure, in the distribution of vocabulary-theme, searching number is z_{M, n}Theme, and select the theme Under vocabulary, finally obtain vocabulary w_{M, n}, thus can generate n-th of word of m documents in corpus.

In addition, since LDA model is a bag of words, process α → θ → z and process β → w are mutually indepedent , and there is no chronological order.As a result, in LDA model, m documents can be corresponding with m independent Dirichlet- Multinomial conjugated structure；Similarly, k theme will correspond to k independent Dirichlet-Multinomial conjugation Structure.

Due to

N in formula_m=(n_m ⁽¹⁾..., n_m ^(k)), wherein n_m ^(k)Indicate the corresponding word of k-th of theme in m texts Number.Then, according to Dirichlet-Multinomial conjugated structure, available θ_mPosterior distrbutionp be Dir (θ_m|n_m+ α)。

Because the process that the theme of the m piece document in text set generates is independent from each other process, we can obtain To m mutually independent Dirichlet-Multinomial conjugated structures, entire text set can be calculated in we as a result, The probability that middle theme generates

Likewise, available

W '=(w₍₁₎..., w_(k)) (2.9)

Z '=(z₍₁₎..., z_(k)) (2.10)

w_(k)It indicates, these words are all that theme k is generated, z_(k)It is then the number of the theme of these corresponding words.Due to text In this random two by the word that theme k is generated be it is mutually indepedent, can exchange, therefore, whole process or one Dirichlet-Multinomial conjugated structure.

Here, available

n_k=(n_k ⁽¹⁾..., n_k ^(t)) (2.11)

n_k ^(t)It is the number of word t in the vocabulary of theme k generation.Further, it is possible to obtain what word in entire text set generated Probability is

Merge formula, it is available

Finally, the Gibbs Sampling formula of LDA model is obtained are as follows:

The present embodiment is that " the Nature, food, the mankind, science, arable land " carries out composition generation with topic keyword, and step is such as Under:

(1) it selects training dataset to be trained LDA topic model, obtains the distribution of text set main contents, and Distribution of each sentence under theme.

(2) cross entropy between candidate sentences and document is calculated, the lesser sentence of cross entropy is selected.

(3) by candidate sentences according to the relative position parameter marshalling in former candidate text.

(4) composition automatically generated is exported.

It is evaluated using automated decision system composition is automatically generated.

Using trained LDA model, text generation is carried out, wherein text number of words is controlled in 200 words or so.Again to generation The quality of text is evaluated, and HSK5 composition standards of grading are as shown in table 1:

Table 1HSK composition standards of grading

The composition that the present embodiment generates is as follows:

From ancient times to the present, these three aspects of clothing, food, shelter, always being that it is most important to leave the mankind for for the Nature is also most tired scratch The problem of mankind.Nowadays the supply without food to the mankind causes all people's class by hunger pangs, with civilization Development, population increase, and people have to borrow scientific strength to solve the problems, such as that population growth leads to inanition.But this The rule of the Nature is violated, although can satisfy human wants in the short time, people have appreciated that constantly a large amount of use Chemical fertilizer and pesticide gradually decrease cultivated area.The peasants of many countries produce " green food " in the world, and people are Wish the food not polluted.If not using chemical fertilizer, grain yield can be largely reduced, and cannot be supported tellurian complete The portion mankind.But if the mankind increase grain yield with pesticide and chemical fertilizer always, without taking reasonable measure, like that, consequence meeting It is more serious.I thinks what unhealthy family will not be happy.So the mankind will wait science rationally using arable land in a short time There is further development.Just as ancients create cultivated land, I thinks that the mankind will not so yield to the Nature easily.I firmly believe that The owner of the earth is the mankind forever.

It can be found that text includes all keywords, and theme is clear, and content is kept to the point, clear logic, and content is coherent, symbol Close the standard that top grade is allocated as text.The present embodiment completes writing according to keyword for task using method of the invention well, The text of generation can be unfolded around keyword well, and suit theme.

Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not Therefore limitations on the scope of the patent of the present invention are interpreted as.It should be pointed out that for those of ordinary skill in the art, Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection model of the invention It encloses.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

The generation method 1. a kind of HSK based on topic model writes a composition characterized by comprising training LDA model obtains sentence With the distribution of text, word and text, cross entropy, selection and the most similar sentence of subject key words are calculated, text is then generated.
2. the composition generation method according to claim 1 based on topic model, which is characterized in that select training dataset The step of training LDA model, selection training dataset includes: to select " HSK dynamic composition corpus " as basic corpus；It is first First the modification of composition is marked according in corpus, corpus handle as standard composition corpus, i.e., according to being marked out in corpus Mistake and the modification that provides, be the composition of specification by mark composition processing, these standardized into composition sample, as standard speech Material, carries out the training to LDA model.
3. the composition generation method according to claim 1 to 2 based on topic model, which is characterized in that training LDA model The step of include:

When LDA algorithm starts, randomly given parameters θ_dAnd φ_t, then continuous iteration and study following steps A, B, C described by Process, finally obtain the output that convergent result is exactly LDA；

A. to a specific document d_sIn the i-th vocabulary w_i, it is assumed that vocabulary w_iCorresponding theme is t_j, then:

p_j(w_i|d_s)=p (w_i|t_j)×(t_j|d_s)；

B. we can enumerate the theme in theme set T now, obtain all p_j(w_i|d_s), wherein 1~k of j value, then It can be d according to these probability value results_sIn i-th of word w_iSelect a theme；

C. w all in lexical set D are carried out the calculating of p (w | d) and reselect theme to regard an iteration as.
4. the composition generation method according to claim 1 to 3 based on topic model, which is characterized in that calculate cross entropy Step includes:

Cross entropy is

Wherein,
5. the composition generation method described in -4 based on topic model according to claim 1, which is characterized in that described to be based on theme The specific steps of the HSK composition generation method of model are as follows: select training dataset to be trained LDA model, obtain text set The distribution of the distribution of main contents and each sentence under theme；Calculate the cross entropy between candidate sentences and document, selection The lesser sentence of cross entropy；By candidate sentences according to the relative position parameter marshalling in former candidate text；Output automatically generates Composition.
6. the composition generation method described in -5 based on topic model according to claim 1, which is characterized in that raw with LDA model Method at a text includes:

A. the theme distribution θ for generating document i is sampled from Di Li Cray distribution α_i；

B. from the multinomial distribution θ of theme_iMiddle sampling generates the theme z of j-th of word of document i_ij；

C. sampling generates theme z from Di Li Cray distribution β_ijWord distribution

D. from the multinomial distribution of wordMiddle sampling ultimately generates word w_ij。
7. the composition generation method described in -6 based on topic model according to claim 1, which is characterized in that the LDA model In the Joint Distribution of all visible variables and hidden variable be

p(w_i, z_i, θ_i, φ | α, β)；

To θ in formula_iWithIt quadratures and to z_iSummation obtains the maximal possibility estimation of vocabulary distribution:

p(w_i| α, β)=∫ ∫ ∑ p (w_i, z_i, θ_i, φ | α, β)；

Assuming that having M texts in corpus, wherein all vocabulary w and the corresponding theme z of vocabulary are as follows

W=(w₁..., w_m)；

Z=(z₁..., z_m)；

w_mIndicate the vocabulary in m texts, z_mThen indicate the number of the corresponding theme of these words.
8. the composition generation method described in -7 based on topic model according to claim 1, which is characterized in that in the LDA mould In type, it is assumed that be now to generate m texts, first check text-theme distribution of m texts, then generate n-th of word Theme number z_{M, n}；

In the distribution of vocabulary-theme, searching number is z_{M, n}Theme, and select the vocabulary under the theme, finally obtain vocabulary w_{M, n}, thus can generate n-th of word of m documents in corpus；

In LDA model, m documents can be corresponding with m independent Dirichlet-Multinomial conjugated structures；K theme Corresponding to k independent Dirichlet-Multinomial conjugated structures；

According to Dirichlet-Multinomial conjugated structure, θ is obtained_mPosterior distrbutionp be Dir (θ_m|n_m+α)；

The probability that theme generates in entire text set is calculated

It obtains

W '=(w₍₁₎..., w_(k))；

Z '=(z₍₁₎..., z_(k))；

w_(k)Indicating these words all is that theme k is generated, z_(k)It is then the number of the theme of these corresponding words；

It obtains

n_k=(n_k ⁽¹⁾..., n_k ^(t))；

n_k ^(t)It is the number of word t in the vocabulary of theme k generation；

Obtaining the probability that word generates in entire text set is

It obtains

Finally, the Gibbs Sampling formula of LDA model is obtained are as follows:
9. the composition generation method described in -8 based on topic model according to claim 1, which is characterized in that the method includes Steps are as follows:

(1) select training dataset LDA model is trained, obtain text set main contents distribution and each sentence Distribution under theme；

(2) cross entropy between candidate sentences and document is calculated, the lesser sentence of cross entropy is selected；

(3) by candidate sentences according to the relative position parameter marshalling in former candidate text；

(4) composition automatically generated is exported.