CN113705223A

CN113705223A - Personalized English text simplification method taking reader as center

Info

Publication number: CN113705223A
Application number: CN202111025610.3A
Authority: CN
Inventors: 强继朋; 张峰; 李云
Original assignee: Yangzhou University
Current assignee: Yangzhou University
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-11-26

Abstract

The invention discloses a personalized English text simplification method taking readers as centers, which comprises the following steps of 1, setting the simplification grade of the current simplification method according to the English grade currently possessed by the readers, and acquiring a word bank corresponding to the grade; step 2, performing clause processing on the text read by the reader to obtain a clause set; and 3, simplifying each sentence in the sentence set from front to back in sequence by adopting a sentence and word simplification method, acquiring a simplified sentence set, and returning the simplified sentence set to a reader. The invention fully utilizes the pre-training language model and the dictionary library, meets the requirement of different readers on English text simplification, and simultaneously improves the accuracy of English text simplification.

Description

Personalized English text simplification method taking reader as center

Technical Field

The invention relates to the field of English text simplification, in particular to a personalized English text simplification method taking readers as centers.

Background

In recent years, with the development of the internet, a large amount of english text appears in the public sight. For example, many professional papers downloaded in english journals are selected by many people to read the papers directly, rather than translating the papers into their native languages first and then reading the papers. For these texts, if the texts contain a large amount of rare and uncommon words, the understanding of the reader on the meaning of the text contents is severely restricted. Research proves that if 90% of English words in the text are in the cognitive range of the reader, even if the text is long and complex, the content meaning of the text can be easily understood by the reader.

English text simplification aims to simplify words or syntax in the text, so that a reader can read and understand the meaning of the text and simultaneously retain original text information to the maximum extent. For a given piece of input text, the text reduction system replacing complex words in the text with simple words requires two conditions to be met: 1) the output text should preserve the meaning of the input as much as possible; 2) outputting text should minimize the number of complex words (words that the reader cannot understand). These two conditions can create a conflict, and to reduce the complexity of the text, the system can select the simplest candidate surrogate. However, when the simplest candidate alternative word is selected, the original semantics of the text cannot be guaranteed. The existing text simplification algorithm does not consider the cognitive level of readers, blindly sets some vocabularies with relatively low word frequency as complex vocabularies, and replaces the complex vocabularies with similar word senses, thereby achieving the text simplification effect and increasing the risk of the difference between the simplified text meaning and the original text meaning.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a personalized English text simplification method taking a reader as the center.

The purpose of the invention is realized as follows: a personalized English text simplification method taking a reader as a center comprises the following steps:

step 1, according to the English level currently possessed by a reader, setting the simplification level of the current simplification method, and acquiring a word bank R corresponding to the level;

step 2, supposing the Text of the document currently read by the reader, and adopting a sentence segmentation methodDividing the Text to obtain a sentence set T ═ c₁,…,c_i,…,c_m}; m represents the number of sentences in the set T;

step 3, each sentence c in the T is sequentially subjected to the simplification method of the sentences and the words from front to back_i(i is more than or equal to 1 and less than or equal to m) is simplified, and a simplified sentence set SS is obtained as the set { s ═ s%₁,…,s_i,…,s_mAnd returns the SS to the reader.

As a further limitation of the present invention, the step 2 specifically includes the following steps:

step 2.1: defining a set T, wherein the initial value is null;

step 2.2: deleting special symbols and redundant characters in the document Text;

step 2.3: segmenting the document Text according to' to obtain an initial sentence set T _ init;

step 2.4: sequentially traversing sentences sent in the set T _ init_aAnd the initial value of a is 1.

As a further limitation of the present invention, the step 2.4 specifically includes the following steps:

step 2.4.1: for sent_aJudging sent_aIs it there? ',' |! ' symbol, if present, performs the following steps; otherwise send_aAdding the obtained mixture into the set T, and executing a step 2.4.4;

step 2.4.2: if sent_aIncluding'! 'sign, then according to'! ' opposite Sent_aCarrying out segmentation; obtain a set of clauses t_a；

Step 2.4.3: if sent_aContains'? Sign, then according to? ' opposite Sent_aCarrying out segmentation; adding the obtained clauses into the set T in sequence;

step 2.4.4: let a be a +1, repeat step 2.4 until all sentences in the set T _ init have been traversed.

As a further limitation of the present invention, the step 2.4.2 specifically comprises the following steps:

step 2.4.2.1: traverse set t_aDetermine if each sentence in the set contains'?The' sign, if any, is as? Dividing the sentence, and sequentially adding the obtained clauses into a set T; otherwise, directly adding the sentence into the set T.

As a further limitation of the present invention, the step 3 specifically includes the following steps:

step 3.1: using word segmentation tool to pair sentence c_iPerforming word segmentation to obtain a corresponding word set and a corresponding part of speech tag c_i＝{{w₁,p₁},…,{w_j,p_j},…,{w_n,p_n}}；w_j(1. ltoreq. j. ltoreq.n) represents the jth word in the sentence, p_jIs w_jCorresponding part of speech tag, n represents sentence c_iThe number of words of (a);

step 3.2: initializing j to 1, and converting the original sentence c_iAssigning to the simplified sentence s_i；

Step 3.3: if j is equal to n +1, return to the simplified sentence s_iAnd terminating the iteration; otherwise, continuing to execute the step 3.4;

step 3.4: judgment of w_jWhether the stop word belongs to or not, if not, executing the step 3.5; otherwise, j +1 is assigned to j, and step 3.3 is executed;

step 3.5: judgment of p_jIf the part of speech set belongs to { noun (n), verb (v), adjective (adj), adverb (adv) }, if so, executing step 3.6; otherwise, j +1 is assigned to j, and step 3.3 is executed;

step 3.6: extracting w using stem extraction tool_jStem of_jJudging stem_jWhether the reader belongs to the word bank R corresponding to the reader or not, if not, executing the step 3.7; otherwise, j +1 is assigned to j, and step 3.3 is executed;

step 3.7: obtaining word w by using public online synonym dictionary library_jThe synonym set Syn;

step 3.8: obtaining a sentence c by adopting a word simplification method based on a pre-training language representation model Bert_iChinese word w_jCandidate replacement word CS ═ { CS ═₁,…,cs_k,…,cs_p}；cs_k(1. ltoreq. k. ltoreq.p) representsThe kth word in the CS, p is the number of candidate substitutional words specified by the user;

step 3.9: screening candidate substitute words CS to determine the final substitute word sub_jFor use in combination with sub_jSubstitution of simplified sentence s_iOriginal word w_j；

As a further limitation of the present invention, the step 3.8 specifically includes the following steps:

step 3.8.1: obtaining an English pre-training language representation model Bert;

step 3.8.2: using "[ MASK ] peculiar to the Bert model]"symbol substitution sentence c_iWord w in_jThe sentence after substitution is defined as c_i’；

Step 3.8.3: concatenating symbols "[ CLS ] in sequence]", sentence c_iSentence c_i'sum symbol' [ SEP]", the sequence after splicing is defined as Q { [ CLS { [],c_i,c_i’,[SEP]}；

Step 3.8.4: obtaining "[ MASK ] by the formula (1)]"production order X of all words in the position correspondence vocabulary v ═ X₁,…,x_y,…,x_v}；x_y(1 ≦ y ≦ v) representing outputting the word ordered at the y-th bit, v representing the number of words of the vocabulary of the Bert model;

X(·|[MASK])＝Bert(Q) (1)

step 3.8.5: defining the initial value of the set CS to be null, and initializing y to be 1;

step 3.8.6: if the number of words in the CS set is equal to p, terminating iteration; otherwise, continue to step 3.8.7;

step 3.8.7: obtaining x with stemming tool_yIf the stem of the word is not equal to w_jStem of_jThen x is_yAdding the data into the set CS; otherwise, y +1 is assigned to y and step 3.8.6 is performed.

As a further limitation of the present invention, the step 3.9 specifically includes the following steps:

step 3.9.1: initializing k to 1;

step 3.9.2: if k is equal to p +1, the original word w_jIs assigned tosub_jAnd terminating the iteration; otherwise, continue to step 3.9.3;

step 3.9.3: judging cs_kWhether the synonym set Syn belongs to, if yes, execute step 3.9.4; otherwise, k +1 is assigned to k and step 3.9.2 is performed;

step 3.9.4: judging cs_kWhether it belongs to reader's thesaurus R, if so, cs_kAssign to sub_jAnd the iteration is terminated; otherwise, k +1 is assigned to k and step 3.9.2 is performed.

By adopting the technical scheme, compared with the prior art, the invention has the beneficial effects that: 1. the method identifies the words in the text which are not in the cognitive range of the reader, and replaces the words as complex words, thereby avoiding unnecessary simplification, furthest keeping the original semantics of the text and simultaneously considering the reading capability of the reader.

2. The method simultaneously utilizes the pre-training language model to generate candidate words of the complex words and corresponding synonyms in the dictionary library. The replaced text is more accurate in semantic aspect.

3. The invention selects the intersection part of the candidate word set generated by the pre-training model and the synonym set in the dictionary library, and retrieves the vocabulary in the cognitive range of the reader from the intersection part, thereby achieving the purpose of simplifying the text.

Detailed Description

A personalized English text simplification method taking readers as centers specifically comprises the following steps:

step 1, according to the English level currently possessed by a reader, setting the simplification level of the current simplification method, and acquiring a word bank R corresponding to the level; the English grades of the invention are divided into 4 types, which are respectively as follows: college students grade four and six in English, and grade four and eight in English major; the lexicon R can be inhttps://github.com/mahavivo/english-wordlistAnd (6) obtaining.

Step 2, supposing that the document Text is read currently by the reader, the Text is divided by adopting a sentence segmentation method to obtain a sentence set T ═ c₁,…,c_i,…,c_m}; m represents the number of sentences in the set T;

step 2.1: defining a set T, wherein the initial value is null;

step 2.2: deleting special symbols and redundant characters in the document Text; the method mainly eliminates redundant characters such as '\ n', '\ t' and brackets which cannot be matched in the document;

step 2.4: sequentially traversing sentences sent in the set T _ init_aThe initial value of a is 1;

Step 2.4.2.1: traverse set t_aDetermine if each sentence in the set contains'? The' sign, if any, is as? Dividing the sentence, and sequentially adding the obtained clauses into a set T; otherwise, directly adding the sentence into the set T;

Step 3, each sentence c in the T is sequentially subjected to the simplification method of the sentences and the words from front to back_i(i is more than or equal to 1 and less than or equal to m) is simplified, and a simplified sentence set SS is obtained as the set { s ═ s%₁,…,s_i,…,s_mAnd returning SS to the reader;

step 3.6: extracting w using stem extraction tool_jStem of_jJudging stem_jWhether the reader belongs to the word bank R corresponding to the reader or not, if not, executing the step 3.7; otherwise, j +1 is assigned to j, and step 3.3 is executed; the word segmentation tool, the disabled word bank, the word stem extraction tool and the acquired part of speech tag tool used in the steps 3.1 to 3.6 are all from an nltk bank of python language;

step 3.7: obtaining word w by using public online synonym dictionary library_jThe synonym set Syn; the thesaurus dictionary repository used here is: BigHuge Thesaurus, the website is:https://words.bighugelabs.com/(ii) a Obtaining word w by using the online API of the thesaurus dictionary repository_jThe synonym set Syn; can be selected fromhttps:// words.bighugelabs.com/site/apiObtaining;

step 3.8: obtaining a sentence c by adopting a word simplification method based on a pre-training language representation model Bert_iChinese word w_jCandidate replacement word CS ═ { CS ═₁,…,cs_k,…,cs_p}；cs_k(k is more than or equal to 1 and less than or equal to p) represents the kth word in the CS, and p is the number of candidate substitutional words specified by the user; the pre-training language model Bert is derived from the thesis' Bert: Pre-training of deep bidirectional transformations for language understating ", published in 2018; bert is to train Mask Language Model (MLM) using a massive corpus of text. In the vocabulary simplification algorithm, a Mask Language Model (MLM) predicts the probability that all words in a vocabulary list belong to covered vocabularies by covering complex words and selects words with high probability values as candidate substitute words;

step 3.8.1: obtaining an English pre-training language representation model Bert; here, a Bert model implemented based on the pyrtch framework is selected, fromhttps://github.com/***-research/bertDownload "BERT-Large, Uncsated (wheel Word masking");

Step 3.8.3: concatenating symbols "[ CLS ] in sequence]", sentence c_iSentence c_i'sum symbol' [ SEP]", the sequence after splicing is defined as Q { [ CLS { [],c_i,c_i’,[SEP]}; the benefits of using sequence Q are: first, consider the sentence sen_iWord pairs in need of simplification in China]Predicting the influence of a result, and then improving the probability of words similar to the original words by using the NSP task of the Bert model;

X(·|[MASK])＝Bert(Q) (1)

step 3.8.6: if the number of words in the CS set is equal to p, terminating iteration; otherwise, continue to step 3.8.7; p here takes the value 100;

step 3.8.7: obtaining x with stemming tool_yIf the stem of the word is not equal to w_jWord ofStem stem_jThen x is_yAdding the data into the set CS; otherwise, y +1 is assigned to y and step 3.8.6 is performed.

Step 3.9.1: initializing k to 1;

step 3.9.2: if k is equal to p +1, the original word w_jAssign to sub_jAnd terminating the iteration; otherwise, continue to step 3.9.3;

The invention provides a personalized English text simplification method taking readers as centers, which only simplifies parts which cannot be understood by the readers in a text according to the cognitive level of a certain type of readers, so that the readers can well understand the text content and simultaneously retain the original text information to the maximum extent. The pre-training language model and the dictionary library are fully utilized, the requirement of different readers on English text simplification is met, and meanwhile the accuracy of English text simplification is improved.

The present invention is not limited to the above-mentioned embodiments, and based on the technical solutions disclosed in the present invention, those skilled in the art can make some substitutions and modifications to some technical features without creative efforts according to the disclosed technical contents, and these substitutions and modifications are all within the protection scope of the present invention.

Claims

1. A personalized English text simplification method taking a reader as a center is characterized by comprising the following steps:

2. The reader-centric personalized english text simplification method according to claim 1, characterized in that said step 2 specifically comprises the steps of:

step 2.1: defining a set T, wherein the initial value is null;

3. The reader-centric personalized english text simplification method according to claim 2, characterized in that said step 2.4 specifically comprises the steps of:

Step 2.4.3: if sent_aContains'? Sign, then according to? ' opposite Sent_aCarrying out segmentation; sequentially subjecting to obtainThe clauses of (a) are added into the set T;

4. The reader-centric personalized english text simplification method according to claim 3, characterized in that said step 2.4.2 specifically comprises the steps of:

step 2.4.2.1: traverse set t_aDetermine if each sentence in the set contains'? The' sign, if any, is as? Dividing the sentence, and sequentially adding the obtained clauses into a set T; otherwise, directly adding the sentence into the set T.

5. The reader-centric personalized english text simplification method according to claim 1, characterized in that said step 3 specifically comprises the steps of:

step 3.8: obtaining a sentence c by adopting a word simplification method based on a pre-training language representation model Bert_iChinese word w_jCandidate replacement word CS ═ { CS ═₁,…,cs_k,…,cs_p}；cs_k(k is more than or equal to 1 and less than or equal to p) represents the kth word in the CS, and p is the number of candidate substitutional words specified by the user;

step 3.9: screening candidate substitute words CS to determine the final substitute word sub_jFor use in combination with sub_jSubstitution of simplified sentence s_iOriginal word w_j。

6. The reader-centric personalized english text simplification method according to claim 5, characterized in that said step 3.8 specifically comprises the steps of:

X(·|[MASK])＝Bert(Q) (1)

7. The reader-centric personalized english text simplification method according to claim 5, characterized in that said step 3.9 specifically comprises the steps of:

step 3.9.1: initializing k to 1;