CN109376347A - A kind of HSK composition generation method based on topic model - Google Patents
A kind of HSK composition generation method based on topic model Download PDFInfo
- Publication number
- CN109376347A CN109376347A CN201811202083.7A CN201811202083A CN109376347A CN 109376347 A CN109376347 A CN 109376347A CN 201811202083 A CN201811202083 A CN 201811202083A CN 109376347 A CN109376347 A CN 109376347A
- Authority
- CN
- China
- Prior art keywords
- theme
- distribution
- text
- composition
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The HSK composition generation method based on topic model that the present invention relates to a kind of, comprising: training LDA model obtains the distribution of sentence and text, word and text, calculates cross entropy, selection and the most similar sentence of subject key words, then generates text.HSK composition generation method provided by the invention based on topic model, pass through training LDA topic model, the distribution of sentence and text, word and text is obtained, and by calculating cross entropy, selection and the most similar sentence of subject key words, then text is generated, and the text automatically generated effect in continuity and logicality is good, syntax error is less, and wrong word is less, writing task can be completed well, can meet the needs of practical application well.
Description
Technical field
The invention belongs to text information processing technical fields, and in particular to a kind of HSK composition generation based on topic model
Method.
Background technique
In the epoch of IT industry and internet high speed development, people just dream of to calculate natural language can, so as to us
It can be issued in extensive non-structured text and excavate hiding information and knowledge.Artificial intelligence (AI) technology quickly increases
It is long.Before 20 years, dark blue (the Deep Blue) that IBM Corporation developed in 1997 defeats chess world champion Garry
Kasparov, in March, 2016, AlphaGo with its Monte Carlo tree search algorithm defeated Li Shishi.This is artificial intelligence study
An important milestone.
On the other hand, AI and the natural language processing technique that is combined into of big data bring unprecedented development.Artificial intelligence
Energy robot is because its working principle is that rule-based carry out reasoning from logic, works so being suitable for sequencing, can handle data
Measure larger, the demanding work of timeliness.Big data supports the idea of some ice-breakings to break many industries, or even the biography of writing
System frame.With the development of computer technology and artificial intelligence technology, mankind's highest wisdom and generate literature writing, into
The epoch of " computer manufacture " are entered.Writing idea, writing behavior and the variation for writing the mode of thinking are also brought simultaneously.Natural language
Generation is that the machine of such as knowledge base or logical form indicates that system generates the natural language processing task of natural language.It can be with
It says, natural language generation system converts data to the translater that natural language indicates just as one.However, due to natural language
Intrinsic expressivity, generate final language method be different from compiler method.
Test of Chinese Language Ability for Foreigners (HSK) is the international Chinese for being the Chinese proficiency of non-Chinese person for test mother tongue and setting up
Capability standardization examination.It is equivalent to level Four, six grades of examinations and TOEFL, the IELTS examination etc. of English.It is examined both at home and abroad about English
The research of the research of examination, especially English exam writing has had great successes.But Test of Chinese Language Ability for Foreigners is write at present
Research it is also less, especially study the ability of existing natural language processing technique intelligent answer.As HSK is in whole world model
The popularization enclosed, more and more overseas Chinese studying persons begin participating in HSK examination.The research that the country takes an examination for HSK is also continuous
It increases.
What writing topic was mainly investigated is word order, grammer, content and logic of language, is good research spatial term
Project.Writing task looks like relatively difficult challenge.But task and training machine learning model are write by analysis,
Writing task can also be converted to trainable text generation task.With big data technology, natural language processing and its
The continuous development of his artificial intelligence technology has gradually started the exploration and practice that news report is automatically generated with algorithm.With new
The continuous practice and development of writing Auto are heard, people's rapid and convenient can be helped by constantly having confirmed artificial intelligence technology
Carry out data processing and integration.The propagating contents and circulation way of news media will be changed in the development of news media circle.
However, the prior art automatically generates text less effective in continuity and logicality, there is more, wrong word in syntax error
More, these problems urgently improve.
Summary of the invention
For above-mentioned problems of the prior art, it can avoid above-mentioned skill occur the purpose of the present invention is to provide one kind
The HSK composition generation method based on topic model of art defect.
In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows:
A kind of HSK composition generation method based on topic model, comprising: training LDA model obtains sentence and text, word
The distribution of language and text calculates cross entropy, selection and the most similar sentence of subject key words, then generates text.
Further, the step of selecting training dataset training LDA model, selecting training dataset includes: selection " HSK
Dynamic composition corpus " is used as basic corpus;The modification of composition is marked first, in accordance in corpus, is mark by corpus processing
Mark composition processing is the composition of specification that is, according to the mistake marked out in corpus and the modification provided by quasi- composition corpus,
These are standardized into composition sample as standard corpus and carries out the training to LDA model.
Further, the step of training LDA model includes:
When LDA algorithm starts, randomly given parameters θdAnd φt, then continuous iteration and study following steps A, B, C institute
The process of description finally obtains the output that convergent result is exactly LDA;
A. to a specific document dsIn the i-th vocabulary wi, it is assumed that vocabulary wiCorresponding theme is tj, then:
Pj(wi|ds)=p (wi|tj)×(tj|ds);
B. we can enumerate the theme in theme set T now, obtain all pj(wi|ds), wherein 1~k of j value,
It then can be d according to these probability value resultssIn i-th of word wiSelect a theme;
C. w all in lexical set D are carried out the calculating of p (w | d) and reselect theme to regard an iteration as.
Further, the step of calculating cross entropy includes:
For probability distribution p and q, cross entropy is
Wherein,
Further, the specific steps of the HSK composition generation method based on topic model are as follows: select training dataset
LDA model is trained, the distribution of the distribution and each sentence of text set main contents under theme is obtained;It calculates and waits
The cross entropy between sentence and document is selected, the lesser sentence of cross entropy is selected;By candidate sentences according in former candidate text
Relative position parameter marshalling;Export the composition automatically generated.
Further, include: with the method that LDA model generates a text
A. the theme distribution θ for generating document i is sampled from Di Li Cray distribution αi;
B. from the multinomial distribution θ of themeiMiddle sampling generates the theme z of j-th of word of document iij;
C. sampling generates theme z from Di Li Cray distribution βijWord distribution
D. from the multinomial distribution of wordMiddle sampling ultimately generates word wij。
Further, the Joint Distribution of all visible variables and hidden variable is in the LDA model
p(wi, zi, θi, φ | α, β);
To θ in formulaiWithIt quadratures and to ziSummation obtains the maximal possibility estimation of vocabulary distribution:
p(wi| α, β)=∫ ∫ ∑ p (wi, zi, θi, φ | α, β);
Assuming that having M texts in corpus, wherein all vocabulary w and the corresponding theme z of vocabulary are as follows
W=(w1..., wm);
Z=(z1..., zm);
wmIndicate the vocabulary in m texts, zmThen indicate the number of the corresponding theme of these words.
Further, in the LDA model, it is assumed that be now to generate m texts, first check the text of m texts
Then the distribution of sheet-theme generates the theme number z of n-th of wordM, n;
In the distribution of vocabulary-theme, searching number is zM, nTheme, and select the vocabulary under the theme, finally obtain
Vocabulary wM, n, thus can generate n-th of word of m documents in corpus;
In LDA model, m documents can be corresponding with m independent Dirichlet-Multinomial conjugated structures;K
Theme corresponds to k independent Dirichlet-Multinomial conjugated structures;
Wherein, nm=(nm (1)..., nm (k)), nm (k)Indicate the number of the corresponding word of k-th of theme in m texts;Root
According to Dirichlet-Multinomial conjugated structure, θ is obtainedmPosterior distrbutionp be Dir (θm|nm+α);
The probability that theme generates in entire text set is calculated
It obtains
W '=(w(1)..., w(k));
Z '=(z(1)..., z(k));
w(k)Indicating these words all is that theme k is generated, z(k)It is then the number of the theme of these corresponding words;It obtains
nk=(nk (1)..., nk (t));
nk (t)It is the number of word t in the vocabulary of theme k generation;
Obtaining the probability that word generates in entire text set is
It obtains
Finally, the Gibbs Sampling formula of LDA model is obtained are as follows:
Further, the HSK composition generation method based on topic model comprises the following steps that
(1) it selects training dataset to be trained LDA model, obtains the distribution of text set main contents and each
Distribution of the sentence under theme;
(2) cross entropy between candidate sentences and document is calculated, the lesser sentence of cross entropy is selected;
(3) by candidate sentences according to the relative position parameter marshalling in former candidate text;
(4) composition automatically generated is exported.
HSK composition generation method provided by the invention based on topic model obtains sentence by training LDA topic model
The distribution of son and text, word and text, and by calculating cross entropy, selection and the most similar sentence of subject key words, then
Text is generated, and the text automatically generated effect in continuity and logicality is good, syntax error is less, and wrong word is less, energy
It is enough to complete writing task well, the needs of practical application can be met well.
Detailed description of the invention
Fig. 1 is LDA basic model figure;
Fig. 2 is LDA joint ensemble figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing and specific implementation
The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, and does not have to
It is of the invention in limiting.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise
Under every other embodiment obtained, shall fall within the protection scope of the present invention.
The generation method as shown in Figure 1, a kind of HSK based on topic model writes a composition, comprising steps of selecting training data training
Practice LDA topic model, obtain the distribution of sentence and text, word and text, and by calculating cross entropy, selection is crucial with theme
The most similar sentence of word, then generates text.
First writing task of HSK5 is used to given word, writes a short text.In July, 2013, new Chinese proficiency was examined
Examination Pyatyi really inscribe make part first inscribe it is as follows.
It (all to be used, sequence is in no particular order) incorporated by reference to following word, write the short essay of 80 words or so.
The application of biographic information outstanding characteristic
According to the given word of stem, writing, this is a typical material composition task.We can substantially find out, inscribe
Five words gone out given in dry are absolutely not without any associated.Therefore, it then can be surrounded again by first determining theme
The expansion writing of this theme, and use whole words given in stem.Above-mentioned example is analyzed, we can be determined that out, theme
It is " job hunting ", therefore, in writing, so that it may be write around " job hunting " this theme.
Thus, it is possible to the basic step write:
(1) according to the given word of stem, theme is determined.Writing task will be carried out around this theme always.
(2) it is made sentences with institute to word.Because stem requires institute that must all use to vocabulary, given word can be used
Sentence-making, but it is to ensure that the sentence produced will mutually be unified with theme.
By the analysis above to this writing task it can easily be seen that this writing task focuses on around a spy
In addition, it is ensured that vocabulary listed by stem all occurs in writing fixed theme is unfolded to write,.
Therefore, it is desirable to automatically generate composition with machine realization, complete if this writing task it is necessary to having come according to theme
At writing task.From this angle, the invention proposes the automatic writing methods based on LDA topic model, to realize machine
The automatic composition task of device.
Task is write for this, the invention proposes the automatic writing methods based on topic model, and take out using sentence
The strategy taken completes the generation of text.The strategy of sentence extraction, mainly by extracting sentence appropriate from candidate text, then
These sentences are sorted and are combined, to generate complete chapter.Sentence extraction based on topic model passes through theme and key
Word is selected and is extracted to the sentence in candidate text.
Therefore, it when generating a text, first to be regenerated several corresponding on the basis of given several keywords
Word.Then, sentence is screened and extracted in candidate text with these words.Finally, recycling phase of the sentence in candidate text
Position is ranked up the sentence being drawn into, finally obtains a generation text.
Select training dataset the step of include:
1) select " HSK dynamic composition corpus " as basic corpus;
" HSK dynamic composition corpus " is the NOCFL's scientific research item presided over by Beijing Language and Culture University professor Cui Xiliang
Mesh.It is that the foreigner that mother tongue is non-Chinese participates in the test paper corpus of high Test of Chinese Language Ability for Foreigners (HSK high etc.) essay examination, warp
The modification supplement of many years is crossed, corpus first has collected 11569 compositions.Original language material contains examinee's composition test paper in corpus,
And the very detailed information such as examinee's composition score.In addition, the tagged corpus in corpus is then to the mistake in examinee's composition
There is very comprehensive modification mark, marked content mainly has (1) word processing: marking including wrong word, hiatus mark, multiword mark
Deng.(2) punctuation mark is handled: being marked including wrong punctuate, the mark of vacancy or extra punctuate.(3) word is handled: including mistake
Word lacks word, the mark such as collocation error.(4) sentence is handled: being marked including grammatically wrong sentence, the error label of clause is mixed etc..(5) piece
Chapter processing: the error flag including connection method between sentence, in terms of semantic meaning representation.
2) corpus is handled;
Since examinee that " HSK dynamic composition corpus " is handmarking writes a composition corpus, first, in accordance in corpus to work
Corpus handle and is write a composition corpus for standard by the modification mark of text, i.e., according to marked out in corpus wrong and the modification provided,
It is the composition of specification by mark composition processing.These are standardized into composition sample as standard corpus and carries out the instruction to LDA model
Practice.It writes a composition 10000 in addition, obtaining students in middle and primary schools from internet, the abundant and supplement as corpus.
Training LDA topic model the step of include:
When LDA algorithm starts, randomly given parameters θdAnd φt, θdRepresent the theme distribution of document i, φtRepresent theme t
Word distribution, then continuous iteration and study following steps A, B, C described in process, finally obtaining convergent result is exactly
The output of LDA:
A. to a specific document dsIn the i-th vocabulary wi, it is assumed that vocabulary wiCorresponding theme is tj, then:
pj(wi|ds)=p (wi|tj)×(tj|ds);
B. we can enumerate the theme in theme set T now, obtain all pj(wi|ds), wherein 1~k of j value,
It then can be d according to these probability value resultssIn i-th of word wiSelect a theme;
C. w all in lexical set D are carried out the calculating of p (w | d) and reselect theme to regard an iteration as.
Cross entropy is used to measure the otherness of 2 functions or probability distribution: the more big then relative entropy of difference is bigger, and difference is smaller
Then relative entropy is smaller.Therefore sentence is chosen with cross entropy, construction generates text.
Calculate cross entropy the step of include:
For probability distribution p and q, cross entropy is
Wherein,
Implicit Di Li Cray distribution (Latent Dirichlet Allocation, LDA) is a kind of topic model, it can
By by the theme of every document in document sets according to probability distribution in the form of provide.In natural language processing, Di Like is implied
Thunder model is a kind of productive statistical model, observation group can be explained by unobservable group, and thus solve
Release the similitude of the certain parts of data.
In LDA model, each document can be counted as the mixing of various themes, wherein each document is considered to have
Its theme distribution is distributed to by LDA.
LDA is a kind of typical bag of words, it considers that a document is the set of one group of vocabulary, it is all only between vocabulary
Vertical existing, not successive relationship.Document only includes the theme of sub-fraction, and the theme often only uses a small portion
The vocabulary divided.Therefore, document and theme, the distribution of theme and vocabulary can be obtained by LDA.
Beta distribution is the conjugate prior probability distribution of binomial distribution: and, there is following relationship for nonnegative real number
Beta (p | α, β)+Count (m1, m2)=Beta (p | α+m1, β+m2) (2.1);
In formula, Count (m1, m2) it is Beta distribution Beta (m1+m2, p) counting.Herein, the group's number observed
According to obedience bi-distribution, also, the prior distribution of parameter and Posterior distrbutionp all obey Beta distribution.In this case, we
It persuades and is conjugated from Beta-Binomial.
Di Li Cray distribution (Dirichlet distribution) is the conjugate prior probability distribution of multinomial distribution:
It " willIt is extended to continuous real number set from discrete integer set, it is hereby achieved that general expression formula:
Similarly, in formula,It is that Cray is distributed in DiCounting.Likewise, at this
In, the data observed obey multinomial distribution, and the prior distribution and Posterior distrbutionp of parameter are all that Cray is distributed in Di.In this feelings
Under condition, we persuade is conjugated from Dirichlet-Multinomial.
It can be seen that the distribution of Di Li Cray is the conjugate prior probability distribution of multinomial distribution, and Beta distribution is binomial
The conjugate prior probability distribution of formula distribution.
Therefore, model is generated with LDA, the method for generating a text is as follows:
A. the theme distribution θ for generating document i is sampled from Di Li Cray distribution αi;
B. from the multinomial distribution θ of themeiMiddle sampling generates the theme z of j-th of word of document iij;
C. sampling generates theme z from Di Li Cray distribution βijWord distribution
D. from the multinomial distribution of wordMiddle sampling ultimately generates word wij。
Model, which is generated, with LDA generates the text comprising multiple themes:
Choose parameter θ~P (θ);
For each of the N words w_n:
Choose a topic zn~P (z | θ);
Choose a word wn~P (w | z);
Defining θ is a theme vector, and vector θ is non-negative normalized vector, and each of these column indicate each theme
In the probability that document occurs;P (θ) is the distribution about θ, and is distributed for Di Li Cray;N and wnIbid;znIndicate the master of selection
The probability distribution of theme z, the specially value of θ when topic, P (z | θ) indicate to select specific θ, i.e. P (z=i | θ)=θi;P (w | z) table
The probability distribution of theme w when showing selection theme z.
The description of above method is to select a theme vector θ, and calculate the probability of each theme in document.In master
It inscribes distribution vector θ and selects a theme z, from the distribution of theme and vocabulary, theme is generated according to the vocabulary probability distribution of theme z
Relevant word.
Its model is as shown in Figure 1.Therefore the Joint Distribution of all visible variables and hidden variable is in entire model
p(wi, zi, θi, φ | α, β) (2.3)
To θ in formulaiWithIt quadratures and to ziSummation obtains the maximal possibility estimation of vocabulary distribution:
p(wi| α, β)=∫ ∫ ∑ p (wi, zi, θi, φ | α, β) (2.4)
The joint probability distribution is corresponded on figure, Fig. 2 is obtained;
Assuming that having M texts in corpus, wherein all vocabulary w and the corresponding theme z of vocabulary are as follows
W=(w1..., wm) (2.5)
Z=(z1..., zm) (2.6)
wmIndicate the vocabulary in m texts, zmThen indicate the number of the corresponding theme of these words.
Thus, it is possible to analyze, this process of α → θ → z in figure, it is assumed that be now to generate m texts, first check m
Then text-theme distribution of text generates the theme number z of n-th of wordM, n.Since process α → θ corresponds to Di Li Cray
Distribution, and process θ → z is distributed corresponding to Multinomial, therefore this process is integrally a Dirichlet-
Multinomial conjugated structure.
This process of β → w in figure, in the distribution of vocabulary-theme, searching number is zM, nTheme, and select the theme
Under vocabulary, finally obtain vocabulary wM, n, thus can generate n-th of word of m documents in corpus.
In addition, since LDA model is a bag of words, process α → θ → z and process β → w are mutually indepedent
, and there is no chronological order.As a result, in LDA model, m documents can be corresponding with m independent Dirichlet-
Multinomial conjugated structure;Similarly, k theme will correspond to k independent Dirichlet-Multinomial conjugation
Structure.
Due to
N in formulam=(nm (1)..., nm (k)), wherein nm (k)Indicate the corresponding word of k-th of theme in m texts
Number.Then, according to Dirichlet-Multinomial conjugated structure, available θmPosterior distrbutionp be Dir (θm|nm+
α)。
Because the process that the theme of the m piece document in text set generates is independent from each other process, we can obtain
To m mutually independent Dirichlet-Multinomial conjugated structures, entire text set can be calculated in we as a result,
The probability that middle theme generates
Likewise, available
W '=(w(1)..., w(k)) (2.9)
Z '=(z(1)..., z(k)) (2.10)
w(k)It indicates, these words are all that theme k is generated, z(k)It is then the number of the theme of these corresponding words.Due to text
In this random two by the word that theme k is generated be it is mutually indepedent, can exchange, therefore, whole process or one
Dirichlet-Multinomial conjugated structure.
Here, available
nk=(nk (1)..., nk (t)) (2.11)
nk (t)It is the number of word t in the vocabulary of theme k generation.Further, it is possible to obtain what word in entire text set generated
Probability is
Merge formula, it is available
Finally, the Gibbs Sampling formula of LDA model is obtained are as follows:
The present embodiment is that " the Nature, food, the mankind, science, arable land " carries out composition generation with topic keyword, and step is such as
Under:
(1) it selects training dataset to be trained LDA topic model, obtains the distribution of text set main contents, and
Distribution of each sentence under theme.
(2) cross entropy between candidate sentences and document is calculated, the lesser sentence of cross entropy is selected.
(3) by candidate sentences according to the relative position parameter marshalling in former candidate text.
(4) composition automatically generated is exported.
It is evaluated using automated decision system composition is automatically generated.
Using trained LDA model, text generation is carried out, wherein text number of words is controlled in 200 words or so.Again to generation
The quality of text is evaluated, and HSK5 composition standards of grading are as shown in table 1:
Table 1HSK composition standards of grading
The composition that the present embodiment generates is as follows:
From ancient times to the present, these three aspects of clothing, food, shelter, always being that it is most important to leave the mankind for for the Nature is also most tired scratch
The problem of mankind.Nowadays the supply without food to the mankind causes all people's class by hunger pangs, with civilization
Development, population increase, and people have to borrow scientific strength to solve the problems, such as that population growth leads to inanition.But this
The rule of the Nature is violated, although can satisfy human wants in the short time, people have appreciated that constantly a large amount of use
Chemical fertilizer and pesticide gradually decrease cultivated area.The peasants of many countries produce " green food " in the world, and people are
Wish the food not polluted.If not using chemical fertilizer, grain yield can be largely reduced, and cannot be supported tellurian complete
The portion mankind.But if the mankind increase grain yield with pesticide and chemical fertilizer always, without taking reasonable measure, like that, consequence meeting
It is more serious.I thinks what unhealthy family will not be happy.So the mankind will wait science rationally using arable land in a short time
There is further development.Just as ancients create cultivated land, I thinks that the mankind will not so yield to the Nature easily.I firmly believe that
The owner of the earth is the mankind forever.
It can be found that text includes all keywords, and theme is clear, and content is kept to the point, clear logic, and content is coherent, symbol
Close the standard that top grade is allocated as text.The present embodiment completes writing according to keyword for task using method of the invention well,
The text of generation can be unfolded around keyword well, and suit theme.
HSK composition generation method provided by the invention based on topic model obtains sentence by training LDA topic model
The distribution of son and text, word and text, and by calculating cross entropy, selection and the most similar sentence of subject key words, then
Text is generated, and the text automatically generated effect in continuity and logicality is good, syntax error is less, and wrong word is less, energy
It is enough to complete writing task well, the needs of practical application can be met well.
Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not
Therefore limitations on the scope of the patent of the present invention are interpreted as.It should be pointed out that for those of ordinary skill in the art,
Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection model of the invention
It encloses.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (9)
- The generation method 1. a kind of HSK based on topic model writes a composition characterized by comprising training LDA model obtains sentence With the distribution of text, word and text, cross entropy, selection and the most similar sentence of subject key words are calculated, text is then generated.
- 2. the composition generation method according to claim 1 based on topic model, which is characterized in that select training dataset The step of training LDA model, selection training dataset includes: to select " HSK dynamic composition corpus " as basic corpus;It is first First the modification of composition is marked according in corpus, corpus handle as standard composition corpus, i.e., according to being marked out in corpus Mistake and the modification that provides, be the composition of specification by mark composition processing, these standardized into composition sample, as standard speech Material, carries out the training to LDA model.
- 3. the composition generation method according to claim 1 to 2 based on topic model, which is characterized in that training LDA model The step of include:When LDA algorithm starts, randomly given parameters θdAnd φt, then continuous iteration and study following steps A, B, C described by Process, finally obtain the output that convergent result is exactly LDA;A. to a specific document dsIn the i-th vocabulary wi, it is assumed that vocabulary wiCorresponding theme is tj, then:pj(wi|ds)=p (wi|tj)×(tj|ds);B. we can enumerate the theme in theme set T now, obtain all pj(wi|ds), wherein 1~k of j value, then It can be d according to these probability value resultssIn i-th of word wiSelect a theme;C. w all in lexical set D are carried out the calculating of p (w | d) and reselect theme to regard an iteration as.
- 4. the composition generation method according to claim 1 to 3 based on topic model, which is characterized in that calculate cross entropy Step includes:Cross entropy isWherein,
- 5. the composition generation method described in -4 based on topic model according to claim 1, which is characterized in that described to be based on theme The specific steps of the HSK composition generation method of model are as follows: select training dataset to be trained LDA model, obtain text set The distribution of the distribution of main contents and each sentence under theme;Calculate the cross entropy between candidate sentences and document, selection The lesser sentence of cross entropy;By candidate sentences according to the relative position parameter marshalling in former candidate text;Output automatically generates Composition.
- 6. the composition generation method described in -5 based on topic model according to claim 1, which is characterized in that raw with LDA model Method at a text includes:A. the theme distribution θ for generating document i is sampled from Di Li Cray distribution αi;B. from the multinomial distribution θ of themeiMiddle sampling generates the theme z of j-th of word of document iij;C. sampling generates theme z from Di Li Cray distribution βijWord distributionD. from the multinomial distribution of wordMiddle sampling ultimately generates word wij。
- 7. the composition generation method described in -6 based on topic model according to claim 1, which is characterized in that the LDA model In the Joint Distribution of all visible variables and hidden variable bep(wi, zi, θi, φ | α, β);To θ in formulaiWithIt quadratures and to ziSummation obtains the maximal possibility estimation of vocabulary distribution:p(wi| α, β)=∫ ∫ ∑ p (wi, zi, θi, φ | α, β);Assuming that having M texts in corpus, wherein all vocabulary w and the corresponding theme z of vocabulary are as followsW=(w1..., wm);Z=(z1..., zm);wmIndicate the vocabulary in m texts, zmThen indicate the number of the corresponding theme of these words.
- 8. the composition generation method described in -7 based on topic model according to claim 1, which is characterized in that in the LDA mould In type, it is assumed that be now to generate m texts, first check text-theme distribution of m texts, then generate n-th of word Theme number zM, n;In the distribution of vocabulary-theme, searching number is zM, nTheme, and select the vocabulary under the theme, finally obtain vocabulary wM, n, thus can generate n-th of word of m documents in corpus;In LDA model, m documents can be corresponding with m independent Dirichlet-Multinomial conjugated structures;K theme Corresponding to k independent Dirichlet-Multinomial conjugated structures;According to Dirichlet-Multinomial conjugated structure, θ is obtainedmPosterior distrbutionp be Dir (θm|nm+α);The probability that theme generates in entire text set is calculatedIt obtainsW '=(w(1)..., w(k));Z '=(z(1)..., z(k));w(k)Indicating these words all is that theme k is generated, z(k)It is then the number of the theme of these corresponding words;It obtainsnk=(nk (1)..., nk (t));nk (t)It is the number of word t in the vocabulary of theme k generation;Obtaining the probability that word generates in entire text set isIt obtainsFinally, the Gibbs Sampling formula of LDA model is obtained are as follows:
- 9. the composition generation method described in -8 based on topic model according to claim 1, which is characterized in that the method includes Steps are as follows:(1) select training dataset LDA model is trained, obtain text set main contents distribution and each sentence Distribution under theme;(2) cross entropy between candidate sentences and document is calculated, the lesser sentence of cross entropy is selected;(3) by candidate sentences according to the relative position parameter marshalling in former candidate text;(4) composition automatically generated is exported.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811202083.7A CN109376347A (en) | 2018-10-16 | 2018-10-16 | A kind of HSK composition generation method based on topic model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811202083.7A CN109376347A (en) | 2018-10-16 | 2018-10-16 | A kind of HSK composition generation method based on topic model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109376347A true CN109376347A (en) | 2019-02-22 |
Family
ID=65400554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811202083.7A Pending CN109376347A (en) | 2018-10-16 | 2018-10-16 | A kind of HSK composition generation method based on topic model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109376347A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182210A (en) * | 2020-09-25 | 2021-01-05 | 四川华空天行科技有限公司 | Language generation model based on composition data feature classifier and writing support method |
CN112667806A (en) * | 2020-10-20 | 2021-04-16 | 上海金桥信息股份有限公司 | Text classification screening method using LDA |
CN114330251A (en) * | 2022-03-04 | 2022-04-12 | 阿里巴巴达摩院(杭州)科技有限公司 | Text generation method, model training method, device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120095952A1 (en) * | 2010-10-19 | 2012-04-19 | Xerox Corporation | Collapsed gibbs sampler for sparse topic models and discrete matrix factorization |
CN107967257A (en) * | 2017-11-20 | 2018-04-27 | 哈尔滨工业大学 | A kind of tandem type composition generation method |
CN108090231A (en) * | 2018-01-12 | 2018-05-29 | 北京理工大学 | A kind of topic model optimization method based on comentropy |
-
2018
- 2018-10-16 CN CN201811202083.7A patent/CN109376347A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120095952A1 (en) * | 2010-10-19 | 2012-04-19 | Xerox Corporation | Collapsed gibbs sampler for sparse topic models and discrete matrix factorization |
CN107967257A (en) * | 2017-11-20 | 2018-04-27 | 哈尔滨工业大学 | A kind of tandem type composition generation method |
CN108090231A (en) * | 2018-01-12 | 2018-05-29 | 北京理工大学 | A kind of topic model optimization method based on comentropy |
Non-Patent Citations (1)
Title |
---|
徐艳华等: "基于LDA模型的HSK作文生成", 《数据分析与知识发现》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112182210A (en) * | 2020-09-25 | 2021-01-05 | 四川华空天行科技有限公司 | Language generation model based on composition data feature classifier and writing support method |
CN112182210B (en) * | 2020-09-25 | 2023-11-24 | 四川华空天行科技有限公司 | Language generation model based on composition and theory data feature classifier and composition supporting method |
CN112667806A (en) * | 2020-10-20 | 2021-04-16 | 上海金桥信息股份有限公司 | Text classification screening method using LDA |
CN114330251A (en) * | 2022-03-04 | 2022-04-12 | 阿里巴巴达摩院(杭州)科技有限公司 | Text generation method, model training method, device and storage medium |
CN114330251B (en) * | 2022-03-04 | 2022-07-19 | 阿里巴巴达摩院(杭州)科技有限公司 | Text generation method, model training method, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jacobs | Text-based intelligent systems: Current research and practice in information extraction and retrieval | |
Brodsky et al. | Characterizing motherese: On the computational structure of child-directed language | |
CN103207856B (en) | A kind of Ontological concept and hierarchical relationship generation method | |
Jordan | A phylogenetic analysis of the evolution of Austronesian sibling terminologies | |
CN101599071A (en) | The extraction method of conversation text topic | |
CN109376347A (en) | A kind of HSK composition generation method based on topic model | |
Jiang et al. | Two-stage entity alignment: combining hybrid knowledge graph embedding with similarity-based relation alignment | |
Song et al. | An exploration-based approach to computationally supported design-by-analogy using D3 | |
Chen et al. | Probing simile knowledge from pre-trained language models | |
CN109215798A (en) | A kind of construction of knowledge base method towards Chinese medicine ancient Chinese prose | |
Vandevoorde | On semantic differences: a multivariate corpus-based study of the semantic field of inchoativity in translated and non-translated Dutch | |
Zou et al. | Research and implementation of intelligent question answering system based on knowledge Graph of traditional Chinese medicine | |
Snyder et al. | Cross-lingual Propagation for Morphological Analysis. | |
CN106897436B (en) | A kind of academic research hot keyword extracting method inferred based on variation | |
Yan et al. | Sentiment Analysis of Short Texts Based on Parallel DenseNet. | |
Liang et al. | Patent trend analysis through text clustering based on k-means algorithm | |
CN115757827A (en) | Knowledge graph creating method and device for patent text, storage medium and equipment | |
Williams et al. | Growing naturally: The DicSci Organic E-Advanced Learner's Dictionary of Verbs in Science | |
Barrs | Using the sketch engine corpus query tool for language teaching | |
Ismail et al. | Rich semantic graph: A new semantic text representation approach for arabic language | |
Aguiar et al. | Towards technological approaches for concept maps mining from text | |
Elwert | Network analysis between distant reading and close reading | |
Haddad | Relevance & assessment: cognitively motivated approach toward assessor-centric query-topic relevance model | |
Sánchez-Zamora et al. | Visualizing tags as a network of relatedness | |
Platonova et al. | Application of tagging services for term analysis on visual plane in financial engineering |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190222 |
|
WD01 | Invention patent application deemed withdrawn after publication |