CN109376347A - A kind of HSK composition generation method based on topic model - Google Patents

A kind of HSK composition generation method based on topic model Download PDF

Info

Publication number
CN109376347A
CN109376347A CN201811202083.7A CN201811202083A CN109376347A CN 109376347 A CN109376347 A CN 109376347A CN 201811202083 A CN201811202083 A CN 201811202083A CN 109376347 A CN109376347 A CN 109376347A
Authority
CN
China
Prior art keywords
theme
distribution
text
composition
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811202083.7A
Other languages
Chinese (zh)
Inventor
吕学强
游新冬
董志安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201811202083.7A priority Critical patent/CN109376347A/en
Publication of CN109376347A publication Critical patent/CN109376347A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The HSK composition generation method based on topic model that the present invention relates to a kind of, comprising: training LDA model obtains the distribution of sentence and text, word and text, calculates cross entropy, selection and the most similar sentence of subject key words, then generates text.HSK composition generation method provided by the invention based on topic model, pass through training LDA topic model, the distribution of sentence and text, word and text is obtained, and by calculating cross entropy, selection and the most similar sentence of subject key words, then text is generated, and the text automatically generated effect in continuity and logicality is good, syntax error is less, and wrong word is less, writing task can be completed well, can meet the needs of practical application well.

Description

A kind of HSK composition generation method based on topic model
Technical field
The invention belongs to text information processing technical fields, and in particular to a kind of HSK composition generation based on topic model Method.
Background technique
In the epoch of IT industry and internet high speed development, people just dream of to calculate natural language can, so as to us It can be issued in extensive non-structured text and excavate hiding information and knowledge.Artificial intelligence (AI) technology quickly increases It is long.Before 20 years, dark blue (the Deep Blue) that IBM Corporation developed in 1997 defeats chess world champion Garry Kasparov, in March, 2016, AlphaGo with its Monte Carlo tree search algorithm defeated Li Shishi.This is artificial intelligence study An important milestone.
On the other hand, AI and the natural language processing technique that is combined into of big data bring unprecedented development.Artificial intelligence Energy robot is because its working principle is that rule-based carry out reasoning from logic, works so being suitable for sequencing, can handle data Measure larger, the demanding work of timeliness.Big data supports the idea of some ice-breakings to break many industries, or even the biography of writing System frame.With the development of computer technology and artificial intelligence technology, mankind's highest wisdom and generate literature writing, into The epoch of " computer manufacture " are entered.Writing idea, writing behavior and the variation for writing the mode of thinking are also brought simultaneously.Natural language Generation is that the machine of such as knowledge base or logical form indicates that system generates the natural language processing task of natural language.It can be with It says, natural language generation system converts data to the translater that natural language indicates just as one.However, due to natural language Intrinsic expressivity, generate final language method be different from compiler method.
Test of Chinese Language Ability for Foreigners (HSK) is the international Chinese for being the Chinese proficiency of non-Chinese person for test mother tongue and setting up Capability standardization examination.It is equivalent to level Four, six grades of examinations and TOEFL, the IELTS examination etc. of English.It is examined both at home and abroad about English The research of the research of examination, especially English exam writing has had great successes.But Test of Chinese Language Ability for Foreigners is write at present Research it is also less, especially study the ability of existing natural language processing technique intelligent answer.As HSK is in whole world model The popularization enclosed, more and more overseas Chinese studying persons begin participating in HSK examination.The research that the country takes an examination for HSK is also continuous It increases.
What writing topic was mainly investigated is word order, grammer, content and logic of language, is good research spatial term Project.Writing task looks like relatively difficult challenge.But task and training machine learning model are write by analysis, Writing task can also be converted to trainable text generation task.With big data technology, natural language processing and its The continuous development of his artificial intelligence technology has gradually started the exploration and practice that news report is automatically generated with algorithm.With new The continuous practice and development of writing Auto are heard, people's rapid and convenient can be helped by constantly having confirmed artificial intelligence technology Carry out data processing and integration.The propagating contents and circulation way of news media will be changed in the development of news media circle. However, the prior art automatically generates text less effective in continuity and logicality, there is more, wrong word in syntax error More, these problems urgently improve.
Summary of the invention
For above-mentioned problems of the prior art, it can avoid above-mentioned skill occur the purpose of the present invention is to provide one kind The HSK composition generation method based on topic model of art defect.
In order to achieve the above-mentioned object of the invention, technical solution provided by the invention is as follows:
A kind of HSK composition generation method based on topic model, comprising: training LDA model obtains sentence and text, word The distribution of language and text calculates cross entropy, selection and the most similar sentence of subject key words, then generates text.
Further, the step of selecting training dataset training LDA model, selecting training dataset includes: selection " HSK Dynamic composition corpus " is used as basic corpus;The modification of composition is marked first, in accordance in corpus, is mark by corpus processing Mark composition processing is the composition of specification that is, according to the mistake marked out in corpus and the modification provided by quasi- composition corpus, These are standardized into composition sample as standard corpus and carries out the training to LDA model.
Further, the step of training LDA model includes:
When LDA algorithm starts, randomly given parameters θdAnd φt, then continuous iteration and study following steps A, B, C institute The process of description finally obtains the output that convergent result is exactly LDA;
A. to a specific document dsIn the i-th vocabulary wi, it is assumed that vocabulary wiCorresponding theme is tj, then:
Pj(wi|ds)=p (wi|tj)×(tj|ds);
B. we can enumerate the theme in theme set T now, obtain all pj(wi|ds), wherein 1~k of j value, It then can be d according to these probability value resultssIn i-th of word wiSelect a theme;
C. w all in lexical set D are carried out the calculating of p (w | d) and reselect theme to regard an iteration as.
Further, the step of calculating cross entropy includes:
For probability distribution p and q, cross entropy is
Wherein,
Further, the specific steps of the HSK composition generation method based on topic model are as follows: select training dataset LDA model is trained, the distribution of the distribution and each sentence of text set main contents under theme is obtained;It calculates and waits The cross entropy between sentence and document is selected, the lesser sentence of cross entropy is selected;By candidate sentences according in former candidate text Relative position parameter marshalling;Export the composition automatically generated.
Further, include: with the method that LDA model generates a text
A. the theme distribution θ for generating document i is sampled from Di Li Cray distribution αi
B. from the multinomial distribution θ of themeiMiddle sampling generates the theme z of j-th of word of document iij
C. sampling generates theme z from Di Li Cray distribution βijWord distribution
D. from the multinomial distribution of wordMiddle sampling ultimately generates word wij
Further, the Joint Distribution of all visible variables and hidden variable is in the LDA model
p(wi, zi, θi, φ | α, β);
To θ in formulaiWithIt quadratures and to ziSummation obtains the maximal possibility estimation of vocabulary distribution:
p(wi| α, β)=∫ ∫ ∑ p (wi, zi, θi, φ | α, β);
Assuming that having M texts in corpus, wherein all vocabulary w and the corresponding theme z of vocabulary are as follows
W=(w1..., wm);
Z=(z1..., zm);
wmIndicate the vocabulary in m texts, zmThen indicate the number of the corresponding theme of these words.
Further, in the LDA model, it is assumed that be now to generate m texts, first check the text of m texts Then the distribution of sheet-theme generates the theme number z of n-th of wordM, n
In the distribution of vocabulary-theme, searching number is zM, nTheme, and select the vocabulary under the theme, finally obtain Vocabulary wM, n, thus can generate n-th of word of m documents in corpus;
In LDA model, m documents can be corresponding with m independent Dirichlet-Multinomial conjugated structures;K Theme corresponds to k independent Dirichlet-Multinomial conjugated structures;
Wherein, nm=(nm (1)..., nm (k)), nm (k)Indicate the number of the corresponding word of k-th of theme in m texts;Root According to Dirichlet-Multinomial conjugated structure, θ is obtainedmPosterior distrbutionp be Dir (θm|nm+α);
The probability that theme generates in entire text set is calculated
It obtains
W '=(w(1)..., w(k));
Z '=(z(1)..., z(k));
w(k)Indicating these words all is that theme k is generated, z(k)It is then the number of the theme of these corresponding words;It obtains
nk=(nk (1)..., nk (t));
nk (t)It is the number of word t in the vocabulary of theme k generation;
Obtaining the probability that word generates in entire text set is
It obtains
Finally, the Gibbs Sampling formula of LDA model is obtained are as follows:
Further, the HSK composition generation method based on topic model comprises the following steps that
(1) it selects training dataset to be trained LDA model, obtains the distribution of text set main contents and each Distribution of the sentence under theme;
(2) cross entropy between candidate sentences and document is calculated, the lesser sentence of cross entropy is selected;
(3) by candidate sentences according to the relative position parameter marshalling in former candidate text;
(4) composition automatically generated is exported.
HSK composition generation method provided by the invention based on topic model obtains sentence by training LDA topic model The distribution of son and text, word and text, and by calculating cross entropy, selection and the most similar sentence of subject key words, then Text is generated, and the text automatically generated effect in continuity and logicality is good, syntax error is less, and wrong word is less, energy It is enough to complete writing task well, the needs of practical application can be met well.
Detailed description of the invention
Fig. 1 is LDA basic model figure;
Fig. 2 is LDA joint ensemble figure.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawing and specific implementation The present invention will be further described for example.It should be appreciated that described herein, specific examples are only used to explain the present invention, and does not have to It is of the invention in limiting.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.
The generation method as shown in Figure 1, a kind of HSK based on topic model writes a composition, comprising steps of selecting training data training Practice LDA topic model, obtain the distribution of sentence and text, word and text, and by calculating cross entropy, selection is crucial with theme The most similar sentence of word, then generates text.
First writing task of HSK5 is used to given word, writes a short text.In July, 2013, new Chinese proficiency was examined Examination Pyatyi really inscribe make part first inscribe it is as follows.
It (all to be used, sequence is in no particular order) incorporated by reference to following word, write the short essay of 80 words or so.
The application of biographic information outstanding characteristic
According to the given word of stem, writing, this is a typical material composition task.We can substantially find out, inscribe Five words gone out given in dry are absolutely not without any associated.Therefore, it then can be surrounded again by first determining theme The expansion writing of this theme, and use whole words given in stem.Above-mentioned example is analyzed, we can be determined that out, theme It is " job hunting ", therefore, in writing, so that it may be write around " job hunting " this theme.
Thus, it is possible to the basic step write:
(1) according to the given word of stem, theme is determined.Writing task will be carried out around this theme always.
(2) it is made sentences with institute to word.Because stem requires institute that must all use to vocabulary, given word can be used Sentence-making, but it is to ensure that the sentence produced will mutually be unified with theme.
By the analysis above to this writing task it can easily be seen that this writing task focuses on around a spy In addition, it is ensured that vocabulary listed by stem all occurs in writing fixed theme is unfolded to write,.
Therefore, it is desirable to automatically generate composition with machine realization, complete if this writing task it is necessary to having come according to theme At writing task.From this angle, the invention proposes the automatic writing methods based on LDA topic model, to realize machine The automatic composition task of device.
Task is write for this, the invention proposes the automatic writing methods based on topic model, and take out using sentence The strategy taken completes the generation of text.The strategy of sentence extraction, mainly by extracting sentence appropriate from candidate text, then These sentences are sorted and are combined, to generate complete chapter.Sentence extraction based on topic model passes through theme and key Word is selected and is extracted to the sentence in candidate text.
Therefore, it when generating a text, first to be regenerated several corresponding on the basis of given several keywords Word.Then, sentence is screened and extracted in candidate text with these words.Finally, recycling phase of the sentence in candidate text Position is ranked up the sentence being drawn into, finally obtains a generation text.
Select training dataset the step of include:
1) select " HSK dynamic composition corpus " as basic corpus;
" HSK dynamic composition corpus " is the NOCFL's scientific research item presided over by Beijing Language and Culture University professor Cui Xiliang Mesh.It is that the foreigner that mother tongue is non-Chinese participates in the test paper corpus of high Test of Chinese Language Ability for Foreigners (HSK high etc.) essay examination, warp The modification supplement of many years is crossed, corpus first has collected 11569 compositions.Original language material contains examinee's composition test paper in corpus, And the very detailed information such as examinee's composition score.In addition, the tagged corpus in corpus is then to the mistake in examinee's composition There is very comprehensive modification mark, marked content mainly has (1) word processing: marking including wrong word, hiatus mark, multiword mark Deng.(2) punctuation mark is handled: being marked including wrong punctuate, the mark of vacancy or extra punctuate.(3) word is handled: including mistake Word lacks word, the mark such as collocation error.(4) sentence is handled: being marked including grammatically wrong sentence, the error label of clause is mixed etc..(5) piece Chapter processing: the error flag including connection method between sentence, in terms of semantic meaning representation.
2) corpus is handled;
Since examinee that " HSK dynamic composition corpus " is handmarking writes a composition corpus, first, in accordance in corpus to work Corpus handle and is write a composition corpus for standard by the modification mark of text, i.e., according to marked out in corpus wrong and the modification provided, It is the composition of specification by mark composition processing.These are standardized into composition sample as standard corpus and carries out the instruction to LDA model Practice.It writes a composition 10000 in addition, obtaining students in middle and primary schools from internet, the abundant and supplement as corpus.
Training LDA topic model the step of include:
When LDA algorithm starts, randomly given parameters θdAnd φt, θdRepresent the theme distribution of document i, φtRepresent theme t Word distribution, then continuous iteration and study following steps A, B, C described in process, finally obtaining convergent result is exactly The output of LDA:
A. to a specific document dsIn the i-th vocabulary wi, it is assumed that vocabulary wiCorresponding theme is tj, then:
pj(wi|ds)=p (wi|tj)×(tj|ds);
B. we can enumerate the theme in theme set T now, obtain all pj(wi|ds), wherein 1~k of j value, It then can be d according to these probability value resultssIn i-th of word wiSelect a theme;
C. w all in lexical set D are carried out the calculating of p (w | d) and reselect theme to regard an iteration as.
Cross entropy is used to measure the otherness of 2 functions or probability distribution: the more big then relative entropy of difference is bigger, and difference is smaller Then relative entropy is smaller.Therefore sentence is chosen with cross entropy, construction generates text.
Calculate cross entropy the step of include:
For probability distribution p and q, cross entropy is
Wherein,
Implicit Di Li Cray distribution (Latent Dirichlet Allocation, LDA) is a kind of topic model, it can By by the theme of every document in document sets according to probability distribution in the form of provide.In natural language processing, Di Like is implied Thunder model is a kind of productive statistical model, observation group can be explained by unobservable group, and thus solve Release the similitude of the certain parts of data.
In LDA model, each document can be counted as the mixing of various themes, wherein each document is considered to have Its theme distribution is distributed to by LDA.
LDA is a kind of typical bag of words, it considers that a document is the set of one group of vocabulary, it is all only between vocabulary Vertical existing, not successive relationship.Document only includes the theme of sub-fraction, and the theme often only uses a small portion The vocabulary divided.Therefore, document and theme, the distribution of theme and vocabulary can be obtained by LDA.
Beta distribution is the conjugate prior probability distribution of binomial distribution: and, there is following relationship for nonnegative real number
Beta (p | α, β)+Count (m1, m2)=Beta (p | α+m1, β+m2) (2.1);
In formula, Count (m1, m2) it is Beta distribution Beta (m1+m2, p) counting.Herein, the group's number observed According to obedience bi-distribution, also, the prior distribution of parameter and Posterior distrbutionp all obey Beta distribution.In this case, we It persuades and is conjugated from Beta-Binomial.
Di Li Cray distribution (Dirichlet distribution) is the conjugate prior probability distribution of multinomial distribution:
It " willIt is extended to continuous real number set from discrete integer set, it is hereby achieved that general expression formula:
Similarly, in formula,It is that Cray is distributed in DiCounting.Likewise, at this In, the data observed obey multinomial distribution, and the prior distribution and Posterior distrbutionp of parameter are all that Cray is distributed in Di.In this feelings Under condition, we persuade is conjugated from Dirichlet-Multinomial.
It can be seen that the distribution of Di Li Cray is the conjugate prior probability distribution of multinomial distribution, and Beta distribution is binomial The conjugate prior probability distribution of formula distribution.
Therefore, model is generated with LDA, the method for generating a text is as follows:
A. the theme distribution θ for generating document i is sampled from Di Li Cray distribution αi
B. from the multinomial distribution θ of themeiMiddle sampling generates the theme z of j-th of word of document iij
C. sampling generates theme z from Di Li Cray distribution βijWord distribution
D. from the multinomial distribution of wordMiddle sampling ultimately generates word wij
Model, which is generated, with LDA generates the text comprising multiple themes:
Choose parameter θ~P (θ);
For each of the N words w_n:
Choose a topic zn~P (z | θ);
Choose a word wn~P (w | z);
Defining θ is a theme vector, and vector θ is non-negative normalized vector, and each of these column indicate each theme In the probability that document occurs;P (θ) is the distribution about θ, and is distributed for Di Li Cray;N and wnIbid;znIndicate the master of selection The probability distribution of theme z, the specially value of θ when topic, P (z | θ) indicate to select specific θ, i.e. P (z=i | θ)=θi;P (w | z) table The probability distribution of theme w when showing selection theme z.
The description of above method is to select a theme vector θ, and calculate the probability of each theme in document.In master It inscribes distribution vector θ and selects a theme z, from the distribution of theme and vocabulary, theme is generated according to the vocabulary probability distribution of theme z Relevant word.
Its model is as shown in Figure 1.Therefore the Joint Distribution of all visible variables and hidden variable is in entire model
p(wi, zi, θi, φ | α, β) (2.3)
To θ in formulaiWithIt quadratures and to ziSummation obtains the maximal possibility estimation of vocabulary distribution:
p(wi| α, β)=∫ ∫ ∑ p (wi, zi, θi, φ | α, β) (2.4)
The joint probability distribution is corresponded on figure, Fig. 2 is obtained;
Assuming that having M texts in corpus, wherein all vocabulary w and the corresponding theme z of vocabulary are as follows
W=(w1..., wm) (2.5)
Z=(z1..., zm) (2.6)
wmIndicate the vocabulary in m texts, zmThen indicate the number of the corresponding theme of these words.
Thus, it is possible to analyze, this process of α → θ → z in figure, it is assumed that be now to generate m texts, first check m Then text-theme distribution of text generates the theme number z of n-th of wordM, n.Since process α → θ corresponds to Di Li Cray Distribution, and process θ → z is distributed corresponding to Multinomial, therefore this process is integrally a Dirichlet- Multinomial conjugated structure.
This process of β → w in figure, in the distribution of vocabulary-theme, searching number is zM, nTheme, and select the theme Under vocabulary, finally obtain vocabulary wM, n, thus can generate n-th of word of m documents in corpus.
In addition, since LDA model is a bag of words, process α → θ → z and process β → w are mutually indepedent , and there is no chronological order.As a result, in LDA model, m documents can be corresponding with m independent Dirichlet- Multinomial conjugated structure;Similarly, k theme will correspond to k independent Dirichlet-Multinomial conjugation Structure.
Due to
N in formulam=(nm (1)..., nm (k)), wherein nm (k)Indicate the corresponding word of k-th of theme in m texts Number.Then, according to Dirichlet-Multinomial conjugated structure, available θmPosterior distrbutionp be Dir (θm|nm+ α)。
Because the process that the theme of the m piece document in text set generates is independent from each other process, we can obtain To m mutually independent Dirichlet-Multinomial conjugated structures, entire text set can be calculated in we as a result, The probability that middle theme generates
Likewise, available
W '=(w(1)..., w(k)) (2.9)
Z '=(z(1)..., z(k)) (2.10)
w(k)It indicates, these words are all that theme k is generated, z(k)It is then the number of the theme of these corresponding words.Due to text In this random two by the word that theme k is generated be it is mutually indepedent, can exchange, therefore, whole process or one Dirichlet-Multinomial conjugated structure.
Here, available
nk=(nk (1)..., nk (t)) (2.11)
nk (t)It is the number of word t in the vocabulary of theme k generation.Further, it is possible to obtain what word in entire text set generated Probability is
Merge formula, it is available
Finally, the Gibbs Sampling formula of LDA model is obtained are as follows:
The present embodiment is that " the Nature, food, the mankind, science, arable land " carries out composition generation with topic keyword, and step is such as Under:
(1) it selects training dataset to be trained LDA topic model, obtains the distribution of text set main contents, and Distribution of each sentence under theme.
(2) cross entropy between candidate sentences and document is calculated, the lesser sentence of cross entropy is selected.
(3) by candidate sentences according to the relative position parameter marshalling in former candidate text.
(4) composition automatically generated is exported.
It is evaluated using automated decision system composition is automatically generated.
Using trained LDA model, text generation is carried out, wherein text number of words is controlled in 200 words or so.Again to generation The quality of text is evaluated, and HSK5 composition standards of grading are as shown in table 1:
Table 1HSK composition standards of grading
The composition that the present embodiment generates is as follows:
From ancient times to the present, these three aspects of clothing, food, shelter, always being that it is most important to leave the mankind for for the Nature is also most tired scratch The problem of mankind.Nowadays the supply without food to the mankind causes all people's class by hunger pangs, with civilization Development, population increase, and people have to borrow scientific strength to solve the problems, such as that population growth leads to inanition.But this The rule of the Nature is violated, although can satisfy human wants in the short time, people have appreciated that constantly a large amount of use Chemical fertilizer and pesticide gradually decrease cultivated area.The peasants of many countries produce " green food " in the world, and people are Wish the food not polluted.If not using chemical fertilizer, grain yield can be largely reduced, and cannot be supported tellurian complete The portion mankind.But if the mankind increase grain yield with pesticide and chemical fertilizer always, without taking reasonable measure, like that, consequence meeting It is more serious.I thinks what unhealthy family will not be happy.So the mankind will wait science rationally using arable land in a short time There is further development.Just as ancients create cultivated land, I thinks that the mankind will not so yield to the Nature easily.I firmly believe that The owner of the earth is the mankind forever.
It can be found that text includes all keywords, and theme is clear, and content is kept to the point, clear logic, and content is coherent, symbol Close the standard that top grade is allocated as text.The present embodiment completes writing according to keyword for task using method of the invention well, The text of generation can be unfolded around keyword well, and suit theme.
HSK composition generation method provided by the invention based on topic model obtains sentence by training LDA topic model The distribution of son and text, word and text, and by calculating cross entropy, selection and the most similar sentence of subject key words, then Text is generated, and the text automatically generated effect in continuity and logicality is good, syntax error is less, and wrong word is less, energy It is enough to complete writing task well, the needs of practical application can be met well.
Embodiments of the present invention above described embodiment only expresses, the description thereof is more specific and detailed, but can not Therefore limitations on the scope of the patent of the present invention are interpreted as.It should be pointed out that for those of ordinary skill in the art, Without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection model of the invention It encloses.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (9)

  1. The generation method 1. a kind of HSK based on topic model writes a composition characterized by comprising training LDA model obtains sentence With the distribution of text, word and text, cross entropy, selection and the most similar sentence of subject key words are calculated, text is then generated.
  2. 2. the composition generation method according to claim 1 based on topic model, which is characterized in that select training dataset The step of training LDA model, selection training dataset includes: to select " HSK dynamic composition corpus " as basic corpus;It is first First the modification of composition is marked according in corpus, corpus handle as standard composition corpus, i.e., according to being marked out in corpus Mistake and the modification that provides, be the composition of specification by mark composition processing, these standardized into composition sample, as standard speech Material, carries out the training to LDA model.
  3. 3. the composition generation method according to claim 1 to 2 based on topic model, which is characterized in that training LDA model The step of include:
    When LDA algorithm starts, randomly given parameters θdAnd φt, then continuous iteration and study following steps A, B, C described by Process, finally obtain the output that convergent result is exactly LDA;
    A. to a specific document dsIn the i-th vocabulary wi, it is assumed that vocabulary wiCorresponding theme is tj, then:
    pj(wi|ds)=p (wi|tj)×(tj|ds);
    B. we can enumerate the theme in theme set T now, obtain all pj(wi|ds), wherein 1~k of j value, then It can be d according to these probability value resultssIn i-th of word wiSelect a theme;
    C. w all in lexical set D are carried out the calculating of p (w | d) and reselect theme to regard an iteration as.
  4. 4. the composition generation method according to claim 1 to 3 based on topic model, which is characterized in that calculate cross entropy Step includes:
    Cross entropy is
    Wherein,
  5. 5. the composition generation method described in -4 based on topic model according to claim 1, which is characterized in that described to be based on theme The specific steps of the HSK composition generation method of model are as follows: select training dataset to be trained LDA model, obtain text set The distribution of the distribution of main contents and each sentence under theme;Calculate the cross entropy between candidate sentences and document, selection The lesser sentence of cross entropy;By candidate sentences according to the relative position parameter marshalling in former candidate text;Output automatically generates Composition.
  6. 6. the composition generation method described in -5 based on topic model according to claim 1, which is characterized in that raw with LDA model Method at a text includes:
    A. the theme distribution θ for generating document i is sampled from Di Li Cray distribution αi
    B. from the multinomial distribution θ of themeiMiddle sampling generates the theme z of j-th of word of document iij
    C. sampling generates theme z from Di Li Cray distribution βijWord distribution
    D. from the multinomial distribution of wordMiddle sampling ultimately generates word wij
  7. 7. the composition generation method described in -6 based on topic model according to claim 1, which is characterized in that the LDA model In the Joint Distribution of all visible variables and hidden variable be
    p(wi, zi, θi, φ | α, β);
    To θ in formulaiWithIt quadratures and to ziSummation obtains the maximal possibility estimation of vocabulary distribution:
    p(wi| α, β)=∫ ∫ ∑ p (wi, zi, θi, φ | α, β);
    Assuming that having M texts in corpus, wherein all vocabulary w and the corresponding theme z of vocabulary are as follows
    W=(w1..., wm);
    Z=(z1..., zm);
    wmIndicate the vocabulary in m texts, zmThen indicate the number of the corresponding theme of these words.
  8. 8. the composition generation method described in -7 based on topic model according to claim 1, which is characterized in that in the LDA mould In type, it is assumed that be now to generate m texts, first check text-theme distribution of m texts, then generate n-th of word Theme number zM, n
    In the distribution of vocabulary-theme, searching number is zM, nTheme, and select the vocabulary under the theme, finally obtain vocabulary wM, n, thus can generate n-th of word of m documents in corpus;
    In LDA model, m documents can be corresponding with m independent Dirichlet-Multinomial conjugated structures;K theme Corresponding to k independent Dirichlet-Multinomial conjugated structures;
    According to Dirichlet-Multinomial conjugated structure, θ is obtainedmPosterior distrbutionp be Dir (θm|nm+α);
    The probability that theme generates in entire text set is calculated
    It obtains
    W '=(w(1)..., w(k));
    Z '=(z(1)..., z(k));
    w(k)Indicating these words all is that theme k is generated, z(k)It is then the number of the theme of these corresponding words;
    It obtains
    nk=(nk (1)..., nk (t));
    nk (t)It is the number of word t in the vocabulary of theme k generation;
    Obtaining the probability that word generates in entire text set is
    It obtains
    Finally, the Gibbs Sampling formula of LDA model is obtained are as follows:
  9. 9. the composition generation method described in -8 based on topic model according to claim 1, which is characterized in that the method includes Steps are as follows:
    (1) select training dataset LDA model is trained, obtain text set main contents distribution and each sentence Distribution under theme;
    (2) cross entropy between candidate sentences and document is calculated, the lesser sentence of cross entropy is selected;
    (3) by candidate sentences according to the relative position parameter marshalling in former candidate text;
    (4) composition automatically generated is exported.
CN201811202083.7A 2018-10-16 2018-10-16 A kind of HSK composition generation method based on topic model Pending CN109376347A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811202083.7A CN109376347A (en) 2018-10-16 2018-10-16 A kind of HSK composition generation method based on topic model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811202083.7A CN109376347A (en) 2018-10-16 2018-10-16 A kind of HSK composition generation method based on topic model

Publications (1)

Publication Number Publication Date
CN109376347A true CN109376347A (en) 2019-02-22

Family

ID=65400554

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811202083.7A Pending CN109376347A (en) 2018-10-16 2018-10-16 A kind of HSK composition generation method based on topic model

Country Status (1)

Country Link
CN (1) CN109376347A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182210A (en) * 2020-09-25 2021-01-05 四川华空天行科技有限公司 Language generation model based on composition data feature classifier and writing support method
CN112667806A (en) * 2020-10-20 2021-04-16 上海金桥信息股份有限公司 Text classification screening method using LDA
CN114330251A (en) * 2022-03-04 2022-04-12 阿里巴巴达摩院(杭州)科技有限公司 Text generation method, model training method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120095952A1 (en) * 2010-10-19 2012-04-19 Xerox Corporation Collapsed gibbs sampler for sparse topic models and discrete matrix factorization
CN107967257A (en) * 2017-11-20 2018-04-27 哈尔滨工业大学 A kind of tandem type composition generation method
CN108090231A (en) * 2018-01-12 2018-05-29 北京理工大学 A kind of topic model optimization method based on comentropy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120095952A1 (en) * 2010-10-19 2012-04-19 Xerox Corporation Collapsed gibbs sampler for sparse topic models and discrete matrix factorization
CN107967257A (en) * 2017-11-20 2018-04-27 哈尔滨工业大学 A kind of tandem type composition generation method
CN108090231A (en) * 2018-01-12 2018-05-29 北京理工大学 A kind of topic model optimization method based on comentropy

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐艳华等: "基于LDA模型的HSK作文生成", 《数据分析与知识发现》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112182210A (en) * 2020-09-25 2021-01-05 四川华空天行科技有限公司 Language generation model based on composition data feature classifier and writing support method
CN112182210B (en) * 2020-09-25 2023-11-24 四川华空天行科技有限公司 Language generation model based on composition and theory data feature classifier and composition supporting method
CN112667806A (en) * 2020-10-20 2021-04-16 上海金桥信息股份有限公司 Text classification screening method using LDA
CN114330251A (en) * 2022-03-04 2022-04-12 阿里巴巴达摩院(杭州)科技有限公司 Text generation method, model training method, device and storage medium
CN114330251B (en) * 2022-03-04 2022-07-19 阿里巴巴达摩院(杭州)科技有限公司 Text generation method, model training method, device and storage medium

Similar Documents

Publication Publication Date Title
Jacobs Text-based intelligent systems: Current research and practice in information extraction and retrieval
Brodsky et al. Characterizing motherese: On the computational structure of child-directed language
CN103207856B (en) A kind of Ontological concept and hierarchical relationship generation method
Jordan A phylogenetic analysis of the evolution of Austronesian sibling terminologies
CN101599071A (en) The extraction method of conversation text topic
CN109376347A (en) A kind of HSK composition generation method based on topic model
Jiang et al. Two-stage entity alignment: combining hybrid knowledge graph embedding with similarity-based relation alignment
Song et al. An exploration-based approach to computationally supported design-by-analogy using D3
Chen et al. Probing simile knowledge from pre-trained language models
CN109215798A (en) A kind of construction of knowledge base method towards Chinese medicine ancient Chinese prose
Vandevoorde On semantic differences: a multivariate corpus-based study of the semantic field of inchoativity in translated and non-translated Dutch
Zou et al. Research and implementation of intelligent question answering system based on knowledge Graph of traditional Chinese medicine
Snyder et al. Cross-lingual Propagation for Morphological Analysis.
CN106897436B (en) A kind of academic research hot keyword extracting method inferred based on variation
Yan et al. Sentiment Analysis of Short Texts Based on Parallel DenseNet.
Liang et al. Patent trend analysis through text clustering based on k-means algorithm
CN115757827A (en) Knowledge graph creating method and device for patent text, storage medium and equipment
Williams et al. Growing naturally: The DicSci Organic E-Advanced Learner's Dictionary of Verbs in Science
Barrs Using the sketch engine corpus query tool for language teaching
Ismail et al. Rich semantic graph: A new semantic text representation approach for arabic language
Aguiar et al. Towards technological approaches for concept maps mining from text
Elwert Network analysis between distant reading and close reading
Haddad Relevance & assessment: cognitively motivated approach toward assessor-centric query-topic relevance model
Sánchez-Zamora et al. Visualizing tags as a network of relatedness
Platonova et al. Application of tagging services for term analysis on visual plane in financial engineering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190222

WD01 Invention patent application deemed withdrawn after publication