CN110096705A

CN110096705A - A kind of unsupervised english sentence simplifies algorithm automatically

Info

Publication number: CN110096705A
Application number: CN201910354246.1A
Authority: CN
Inventors: 强继朋; 李云; 袁运浩
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-08-06
Anticipated expiration: 2039-04-29
Also published as: CN110096705B

Abstract

The invention discloses the unsupervised english sentences of one kind in internet area to simplify algorithm automatically, carries out as follows: the vector expression of step 1, training word；Step 2, the frequency for obtaining word；Step 3 obtains simplified sentence set and complex sentence subclass respectively；Step 4, filling phrase table；Simplified sentence language model and complex sentence sublanguage model is respectively trained in step 5；The phrase-based sentence simplified model of step 6, building；Step 7, iteration execute the strategy of retroversion, and more preferably sentence simplified model, the present invention using the parallel corpus of any mark, are not making full use of English wikipedia corpus for training, effectively increase the accuracy that english sentence simplifies.

Description

A kind of unsupervised english sentence simplifies algorithm automatically

Technical field

The present invention relates to a kind of internet text algorithm, in particular to a kind of unsupervised english sentence simplifies automatically to be calculated Method.

Background technique

In recent years, the text information on internet provides many useful knowledge and information to wider user.So Afterwards, for many people, the writing mode of online text, such as vocabulary and syntax result, it may be difficult to reading and understanding, especially It is, cognition low to those literacy rates or aphasis or the limited people of text language knowledge.It is answered comprising non-everyday words or length The text of miscellaneous sentence is not only difficult similarly to be difficult to be analyzed by machine by people's reading and understanding.Autotext simplification is In the case where retaining original text information, simplify the content of original text as far as possible, to reach more easily wider Spectators' reading and understanding.

Existing text simplifies the algorithm that algorithm utilizes machine translation, from the complicated sentence under a kind of language and simplifies sentence Parallel corpus centering learn simplify sentence.It is a kind of learning tasks for having supervision, its validity that this text, which simplifies algorithm, Heavy dependence largely simplifies corpus parallel.But existing English is parallel now simplifies corpus mainly from general English It is obtained in the English wikipedia of wikipedia and children's edition, is distinguished in two different wikipedias by matching algorithm and select sentence Son is used as parallel sentence pair.The parallel simplified corpus that can be obtained at present, not only quantity is few, but also includes the sentence of many non-reduced Son to the sentence pair with mistake, write by layman by the wikipedia for being primarily due to children's edition, is not and common dimension Base encyclopaedia corresponds, and causes to be difficult to select suitable sentence matching algorithm.Because the problem of simplifying parallel corpus, causes to have It is not highly desirable that text, which simplifies algorithm effect,.

Summary of the invention

The object of the present invention is to provide a kind of unsupervised english sentences to simplify algorithm automatically, woth no need to any parallel letter Change corpus, only using the wikipedia corpus of open downloading, the automatic simplification to english sentence is realized, so as to allow user more to hold Easy reading and understanding english sentence, the especially people of cognition or aphasis.

The object of the present invention is achieved like this: the unsupervised english sentence of one kind simplifies algorithm automatically, as follows It carries out:

Step 1, using disclosed English wikipedia corpus D as training corpus, using word embedded mobile GIS Word2vec The vector for obtaining word t indicates v_t；The term vector obtained by Word2vec algorithm indicates to can be good at catching the language of word Adopted feature；Using Skip-Gram model learning word embedded mobile GIS Word2vec；Given corpus D and word t, consideration one is with t Centered on sliding window, use W_tIt indicates to appear in the set of words in t contextual window；Observe pair of context words set Number definition of probability is as follows:

In formula (1), v'_wIt is the context vector expression of word w, V is the vocabulary of D；Then, the entirety of Skig-Gram Objective function is defined as foloows:

In formula (2), the vector expression of word can be learnt by maximizing the objective function；

Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates word t in D Frequency of occurrence；

Step 3 utilizes wikipedia corpus D, the simplified sentence set S and complex sentence subclass C of acquisition；

Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase of another word probability using the vector of word Table PT (Phrase Table)；In PT, word t_iTo word t_jTranslation probability p (t_j|t_i) calculation formula it is as follows:

In formula (4), cos indicates cosine similarity calculation formula；

Step 5 is directed to simplified sentence set S and complex sentence subclass C, and the progress of language model KenLM algorithm is respectively adopted Training obtains reduction language model LM_SWith complex language model LM_C；LM_SAnd LM_CIt is kept not in iterative learning procedure below Become；

Step 6 utilizes phrase table PT, reduction language model LM_SWith complex language model LM_C, using phrase-based machine Translation algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification for simplifying sentence AlgorithmGiven complexity sentence c,Algorithm utilizes formula (5), calculates separately the sentence s's of different contamination compositions Score, finally selecting score to be high sentence s ' will be as simplified sentence:

S'=argmax_sp(c|s)p(s) (5)

In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and it is from language that p (s), which is the probability of sentence s, Say model LM_SIt obtains；

Step 7 utilizes initial PBMT algorithmIteration executes the strategy of retroversion (Back-translation), raw Simplify algorithm at more preferably text.

It is further limited as of the invention, step 3 specifically includes:

Step 3.1, for each sentence s in wikipedia corpus D, calculated using Flesch Reading Ease (FRE) Method is given a mark, and such as formula (3), and is ranked up from high to low by score value；

In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, ts (s) table Show the articulatory number of institute in sentence s；

Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, and removal is intermediate The sentence set of score；Finally, selecting the sentence set of high score as the sentence collection cooperation for simplifying sentence set S and low score For complex sentence subclass C.

It is further limited as of the invention, the step 7 specifically includes:

Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence collection of new synthesis Close S₀, then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N；

Step 7.2, the parallel corpus (S using synthesis_i-1, C), reduction language model LM_SWith complex language model LM_C, instruction Practice the new PBMT algorithm from simplified sentence to complicated sentence

Step 7.3 utilizesIt translates and simplifies sentence set S, obtain the complex sentence subclass C of new synthesis_i；

Step 7.4, the parallel corpus (C using synthesis_i, S), reduction language model LM_CWith complex language model LM_S, training It is new from complicated sentence to the PBMT algorithm for simplifying sentence

Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtained_i；Again It returns to step 7.2 to repeat, until iteration n times.

Compared with prior art, the beneficial effects of the present invention are:

1, the present invention is during fill phrase table, combine the term vector that is obtained from wikipedia corpus indicate with Word frequency rate can catch the semantic information of word and the frequency of use of word, overcome traditional phrase-based machine translation PBMT algorithm needs to fill phrase table using parallel corpus；

2, the present invention utilizes Flesch Reading Ease (FRE) algorithm pair using wikipedia corpus as knowledge base Sentence is given a mark, so that simplified sentence set and complex sentence subclass are obtained, so as to more accurate training complex sentence Sublanguage model and simplified sentence language model；

3, the present invention is based on PBMT using the phrase table, complex sentence sublanguage model and simplified sentence language model that obtain Algorithm constructs initial unsupervised text and simplifies algorithm；The text simplifies algorithm and is not only unsupervised algorithm, even more simple It is single, easy to explain and be rapidly performed by training；

4, the present invention generates parallel corpus using algorithm is simplified, to use back after constructing initial simplification algorithm The strategy translated optimizes existing text simplified model, entry that may be wrong in initial phrase table is had modified, into one Walk boosting algorithm type performance.

Specific embodiment

The present invention will be further described combined with specific embodiments below.

A kind of unsupervised english sentence simplifies algorithm automatically, carries out as follows:

Step 1, using disclosed English wikipedia corpus D as training corpus, can from "https:// dumps.wikimedia.org/enwiki/" downloading, v is indicated using the vector that word embedded mobile GIS Word2vec obtains word t_t； The term vector obtained by Word2vec algorithm indicates to can be good at catching the semantic feature of word；The vector for obtaining word indicates Afterwards, the similarity of available word helps the similar set of words of height for finding each word；In this example, each vector Dimension is set as 300, using Skip-Gram model learning word embedded mobile GIS Word2vec；Given corpus D and word t, considers One sliding window centered on t uses W_tIt indicates to appear in the set of words in t contextual window；Sliding window is set as t The word of front 5 and below 5 words；The log probability of observation context words set is defined as follows:

In formula (2), the vector of word indicates that the mesh can be maximized by using random gradient descent algorithm and negative sampling Scalar functions are learnt.

Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates word t in D Frequency of occurrence；Simplify in field in text, the complexity measure of word passes through the frequency that can consider word；It is, in general, that the frequency of word Rate is higher, which is more readily appreciated that；Therefore, word frequency can be used to find from the similar set of words of height of word t and be easiest to The word of understanding.

The corpus of a super large in step 3, wikipedia corpus D contains a large amount of complex sentence subclass and simple Sentence set；Using wikipedia corpus D, obtains and simplify sentence set S and complex sentence subclass C；

Step 3.1, for each sentence s in wikipedia corpus D, calculated using FRE (Flesch Reading Ease) Method is given a mark, and such as formula (3), and is ranked up from high to low by score value；Score value is higher to mean that sentence is simpler, and score value is lower Mean that sentence is more difficult；

In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, ts (s) table Show the articulatory number of institute in sentence s；FRE algorithm is usually used to the quality that evaluation text simplified model finally simplifies result；

Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, and removal is intermediate The sentence set of score；High score and low point of sentence are removed, is to remove especially extreme sentence；Remove the sentence of intermediate comparison scores Son is to establish apparent boundary between S and C；Finally, select the sentence set of high score as simplify sentence set S and The sentence set of low score is as complex sentence subclass C；In this example, S and C have selected 1,000 ten thousand sentences respectively.

Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase of another word probability using the vector of word Table PT (Phrase Table).In PT, word t_iTo word t_jTranslation probability p (t_j|t_i) calculation formula it is as follows:

In formula (4), cos indicates cosine similarity calculation formula；Probability conversion in view of learning all words is infeasible , in this example, 300,000 most frequent words are selected, and only calculate the probability for arriving 200 most like words；To word Proper noun in language only calculates the probability for arriving oneself itself.

Step 5, the simplification sentence set S and complex sentence subclass C obtained for step 3, are respectively adopted language model KenLM algorithm is trained, and obtains reduction language model LM_SWith complex language model LM_C；LM_SAnd LM_CIn iteration below It is remained unchanged during practising；Language model is used to calculate to the probability for the sequence of terms specified in corpus；Reduction language model and Complex language model facilitates the quality for improving simplified model by the following method by the probability of calculating sequence of terms: executing Local replacement and word order are reset.

Step 6 utilizes phrase table PT, reduction language model LM_SWith complex language model LM_C, using phrase-based machine Translation algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification for simplifying sentence AlgorithmPBMT algorithm was proposed at 2007 " Statistical phrase-based translation " at first, was used Come for there is diglot machine translation；Given complexity sentence c,Algorithm utilizes formula (5), calculates separately the group of different words The score for the sentence s being combined into, finally selecting score to be high sentence s ' will be as simplified sentence:

S'=argmax_sp(c|s)p(s) (5)

In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and it is from language that p (s), which is the probability of sentence s, Say model LM_SIt obtains.

Step 7, in view of non-parallel corpus can only be obtained, utilize initial PBMT algorithmIteration executes retroversion (Back-translation) very difficult unsupervised learning problem can be converted into supervised learning task by strategy, Simplify algorithm to generate more preferably text；

Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence collection of new synthesis Close S₀；Then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N；

Step 7.2, the parallel corpus (S using synthesis_i-1, C), reduction language model LM_CWith complex language model LM_S, instruction Practice the new PBMT algorithm from simplified sentence to complicated sentence

Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtained_i；Again It returns to step 7,2 repeat, until iteration n times；In this example, N is arranged to 3.

It intuitively says, since the input of PBMT algorithm includes noise, it is incorrect for leading to many entries in phrase table 's；Nevertheless, language model can help to correct some mistakes during generating simplified sentence；As long as such case It has occurred, with the lasting progress of iteration, phrase table and translation algorithm can be all enhanced accordingly；With more in phrase table Entry will be repaired, and PBMT algorithm also can be stronger and stronger.

The present invention is not limited to the above embodiments, on the basis of technical solution disclosed by the invention, the skill of this field For art personnel according to disclosed technology contents, one can be made to some of which technical characteristic by not needing creative labor A little replacements and deformation, these replacements and deformation are within the scope of the invention.

Claims

1. a kind of unsupervised english sentence simplifies algorithm automatically, which is characterized in that carry out as follows:

Step 1, using disclosed English wikipedia corpus D as training corpus, obtained using word embedded mobile GIS Word2vec The vector of word t indicates v_t；The term vector obtained by Word2vec algorithm indicates to can be good at catching the semanteme of word special Sign；Using Skip-Gram model learning word embedded mobile GIS Word2vec；Given corpus D and word t considers that one with t is The sliding window of the heart, uses W_tIt indicates to appear in the set of words in t contextual window；The logarithm for observing context words set is general Rate is defined as follows:

In formula (1), v'_wIt is the context vector expression of word w, V is the vocabulary of D；Then, the overall goals letter of Skig-Gram Number is defined as foloows:

Step 2, using wikipedia corpus D, count the frequency f (t) of each word t, f (t) indicates appearance of the word t in D Number；

Step 4 is indicated and the frequency of word, filling indicate that word is translated as the phrase table PT of another word probability using the vector of word (Phrase Table)；In PT, word t_iTo word t_jTranslation probability p (t_j|t_i) calculation formula it is as follows:

In formula (4), cos indicates cosine similarity calculation formula；

Step 5 is directed to simplified sentence set S and complex sentence subclass C, and language model KenLM algorithm is respectively adopted and is trained, Obtain reduction language model LM_SWith complex language model LM_C；LM_SAnd LM_CIt is remained unchanged in iterative learning procedure below；

Step 6 utilizes phrase table PT, reduction language model LM_SWith complex language model LM_C, using phrase-based machine translation Algorithm PBMT (Phrased-based Machine Translation) constructs complicated sentence to the simplification algorithm for simplifying sentenceGiven complexity sentence c,Algorithm utilizes formula (5), calculates separately the score of the sentence s of different contamination compositions, Finally selecting score to be high sentence s ' will be as simplified sentence:

S'=argmax_sp(c|s)p(s) (5)

In formula (5), PBMT algorithm decomposes inner product of the p (c | s) as phrase table PT, and p (s) is the probability of sentence s, is from language mould Type LM_SIt obtains；

Step 7 utilizes initial PBMT algorithmIteration executes the strategy of retroversion (Back-translation), generates more Excellent text simplifies algorithm.

2. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that step 3 tool Body includes:

Step 3.1, for each sentence s in wikipedia corpus D, using Flesch Reading Ease (FRE) algorithm into Row marking, such as formula (3), and is ranked up from high to low by score value；

In formula (3), FRE (s) indicates the FRE score of sentence s, and tw (s) indicates the number of all words in sentence s, and ts (s) indicates sentence The articulatory number of institute in sub- s；

Step 3.2, removal score are more than 100 sentence set, and removal obtains the sentence set lower than 20 points, removes intermediate comparison scores Sentence set；Finally, it is multiple to select the sentence set of high score to be used as the sentence set for simplifying sentence set S and low score Miscellaneous sentence set C.

3. the unsupervised english sentence of one kind according to claim 1 simplifies algorithm automatically, which is characterized in that the step 7 specifically include:

Step 7.1, first withAlgorithm translates complex sentence subclass C, obtains the simplification sentence set S of new synthesis₀, Then, circulation executes step 7.2 to 7.5, and the number of iterations i is from 1 to N；

Step 7.2, the parallel corpus (S using synthesis_i-1, C), reduction language model LM_SWith complex language model LM_C, training is newly From simplify sentence to complicated sentence PBMT algorithm

Step 7.4, the parallel corpus (C using synthesis_i, S), reduction language model LM_CWith complex language model LM_S, train newly From complicated sentence to the PBMT algorithm for simplifying sentence

Step 7.5 utilizesComplex sentence subclass C is translated, the simplification sentence set S of new synthesis is obtained_i；It comes back to Step 7.2 repeats, until iteration n times.