CN115630140A - English reading material difficulty judgment method based on text feature fusion - Google Patents

English reading material difficulty judgment method based on text feature fusion Download PDF

Info

Publication number
CN115630140A
CN115630140A CN202211364247.2A CN202211364247A CN115630140A CN 115630140 A CN115630140 A CN 115630140A CN 202211364247 A CN202211364247 A CN 202211364247A CN 115630140 A CN115630140 A CN 115630140A
Authority
CN
China
Prior art keywords
sentence
bert
english
difficulty
embedding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211364247.2A
Other languages
Chinese (zh)
Inventor
甘健侯
王宇辰
李子杰
周菊香
欧阳昭相
陈恳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Normal University
Original Assignee
Yunnan Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Normal University filed Critical Yunnan Normal University
Priority to CN202211364247.2A priority Critical patent/CN115630140A/en
Publication of CN115630140A publication Critical patent/CN115630140A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for judging difficulty of English reading materials based on text feature fusion, and belongs to the field related to natural language processing. Firstly, encoding an input English text aiming at an English reading material data set, inputting an encoded result into a pre-training language model, and calculating to obtain a feature vector containing semantic information. And then, carrying out part-of-speech tagging on the English text, inputting the obtained part-of-speech sequence into the LSTM, and calculating to obtain a feature vector containing grammar information. And counting relevant factors and extracting features of the relevant factors according to the factors influencing the difficulty degree of English reading materials. All the obtained feature vectors are spliced and input into a full-connection layer, and finally a numerical value from 0 to 1 is output through sigmoid to represent difficulty. The method can effectively judge the difficulty of English reading materials and better assist various self-adaptive learning services in English teaching.

Description

English reading material difficulty judgment method based on text feature fusion
Technical Field
The invention relates to a method for judging difficulty of English reading materials based on text feature fusion, and belongs to the technical field of natural language processing.
Background
English is used as a second language which is widely learned, reading is also used as an important component in English learning, and how to accurately judge the difficulty of English reading materials makes people with different English levels capable of receiving education suitable for the English levels of the people and further promotes personalized learning more importantly.
Researches related to the difficulty degree of English reading materials appear in the early 20 th century, and researches aiming at the difficulty degree judgment of the English reading materials are core problems concerned by relevant researchers at home and abroad until now. Therefore, a great deal of research is carried out by a plurality of researchers aiming at factors influencing the difficulty degree of English reading materials, a plurality of influencing factors are summarized, a plurality of formulas for calculating the difficulty degree of the English reading materials are generated, and the formulas are long-term and help people to select proper English texts. However, with the continuous development of informatization, the generated text becomes more complex, and the method of formulating the rule is generally simpler and does not have good generalization capability, so that a good effect cannot be obtained.
With the continuous development of language models, birt (Bidirectional Encoder reproduction from transforms) models were proposed in 10 months and valley in 2018, so that the development of the natural language processing field enters a new stage. BERT is a pre-trained Language model, which is not trained by using only a unidirectional Language model or by shallow splicing two unidirectional Language models like a traditional Language model, but instead uses MLM (masked Language model) to train bidirectional transforms, generates deep bidirectional Language representations, and performs well in 11 different Natural Language Processing (NLP) tests. Many scholars work well with BERT for other tasks in the field of natural language processing, and this way of training a trained model by migrating it to a new model is called Transfer learning (Transfer learning). Considering that most tasks have certain relevance, the efficiency of the model can be greatly accelerated by transferring the learned parameters to a new model in a certain way. Fine-tuning is used as one method of transfer learning, a convolutional layer in a pre-training model is frozen, and other convolutional layers and full-connection layers are trained, so that the learning time of the model can be further prolonged, and the training cost of the model is reduced.
Disclosure of Invention
The invention aims to provide a method for judging difficulty of English reading materials based on text feature fusion, which is used for improving the accuracy and efficiency of judging the difficulty of the English reading materials.
The invention provides a method for judging difficulty of English reading materials based on text feature fusion by summarizing the viewpoint of a linguist on factors influencing difficulty of English reading materials and considering the advantages of a pre-training language model in a natural language processing task, and the method is used for fusing various text features and judging difficulty of the English reading materials by utilizing a deep learning technology.
The technical scheme of the invention is as follows: firstly, encoding an input English text aiming at an English reading material data set, and inputting encoded information into a trained pre-training language model to obtain a feature vector containing semantic information; then, part-of-speech tagging is carried out on the input text, and the obtained part-of-speech sequence is input into an LSTM (least squares TM) to obtain a feature vector containing grammar information; counting the factors influencing the difficulty of English reading materials, carrying out embedding expression on the factors, splicing all the characteristics, inputting the spliced characteristics into a full-connection layer, and finally outputting through a sigmoid layer to obtain a numerical expression difficulty of 0-1.
The method for judging the English reading difficulty specifically comprises the following steps:
step1: semantic features of the text are extracted using a pre-trained language model.
Firstly, aiming at an English reading material data set (a Newsela data set and a self-collected data set are used for carrying out experiments), coding an input English text, and inputting coded information into a trained pre-training language model to obtain a feature vector containing semantic information.
The specific process comprises the steps of firstly extracting information such as words, sentence positions and word positions in sentences to carry out One-hot coding, inputting a pre-training language model to obtain semantic feature vectors, and selecting a Bert model by the pre-training model.
Step2: and (5) extracting grammatical information features.
And performing part-of-speech tagging on the text, and inputting the obtained part-of-speech sequence into the LSTM to obtain a feature vector containing grammatical information.
Step3: and (5) extracting the statistical information features.
Counting the factors influencing the difficulty of English reading materials, carrying out embedding expression on the factors, splicing all the characteristics, inputting the spliced characteristics into a full-connection layer, and finally outputting through a sigmoid layer to obtain a numerical expression difficulty of 0-1.
Step4: and (4) difficulty prediction.
And obtaining a numerical value from 0 to 1 through sigmoid layer output to represent difficulty.
The Step1 is specifically as follows:
step1.1: suppose that the currently input English text is S t ,S t In which n words are included, S t ={w 1 ,w 2 ,…,w i ,…,w n In which w i Representing the ith word.
The Bert model typically adds [ CLS ] at the beginning of a sentence to represent the beginning of a paragraph and [ SEP ] in the middle of two sentences to separate the sentences.
The converted sentence is S BERT ={[CLS],w 1 ,w 2 ,…,[SEP],…,w n-2 ,w n-1 ,w n ,[SEP]}。
Step1.2: will S BERT Is set to M, if S t If the length of S is less than M, then S is selected BERT Addition of [ PAD]Performing filling, S after filling operation BERT Comprises the following steps:
S BERT ={[CLS],w 1 ,w 2 ,…,[SEP],…,w n-2 ,w n-1 ,w n ,[SEP],…,[PAD]}
if S t Is greater than M, truncating and discarding the subsequent content, truncating the operated S BERT Comprises the following steps:
S BERT ={[CLS],w 1 ,w 2 ,…,[SEP],…,w M-2 ,w M-1 ,w M ,[SEP]}
step1.3: to S BERT Is embedding coded, namely:
Figure BDA0003923624510000031
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003923624510000032
D BERT representing the embedding dimensions set by the pre-trained language model.
Step1.4: to S BERT The content in (1) is sentence position coded, namely:
S segmentembedding ={E A ,E A ,E A ,E B ,E B ,E B ,E B ,…,E i ,E i }
wherein E is A Denotes the first sentence, E B The second sentence is represented by the first sentence,
Figure BDA0003923624510000033
subsequent sentences analogized in the same way, E i Indicating the ith sentence.
Step1.5: to S BERT The content in (1) is subjected to word position coding, namely:
S positionembedding ={E 1 ,E 2 ,E 3 ,…,E i ,…,E n-2 ,E n-1 ,E n ,…,E M }
wherein E is i A position code representing the ith word,
Figure BDA0003923624510000034
step1.6: will S embedding 、S segmrntembedding 、S positionembedding Inputting the obtained result into a pre-training language model (BERT is used by default) to obtain a feature vector O output by the last layer BERT Namely:
Figure BDA0003923624510000035
step1.7: there are various schemes for selecting a sentence vector, such as: 1) Taking X [CLS] As a sentence vector. 2) To O BERT The average pooling was performed, and the results were obtained. 3) To O is BERT Performing maximum pooling, and taking the result. 4) Mixing O with BERT The results of (2) further extract features using CNN. 5) Is prepared from O BERT The result of (2) is input into the LSTM extraction feature. In the task of the invention, X is selected [CLS] As a sentence vector.
The Step2 specifically comprises the following steps:
step2.1: for the input text S t ={w 1 ,w 2 ,w 3 ,…,w n Add [ CLS ] at the beginning of sentence]Indicating the beginning of a sentence, adding [ SEP ] in the middle of two sentences]For separating sentences, the converted sentences are:
S sen ={[CLS],w 1 ,w 2 ,…,[SEP],…,w n-2 ,w n-1 ,w n ,[SEP],…,[PAD]}
step2.2: to S sen And performing part-of-speech tagging to obtain:
S POS ={[SPACE],[PRP],[VBP],[NNP],…,[RB],[JJ],[SPACE],…,[PAD]}
wherein [ SPACE ] represents [ CLS ] and [ SEP ], [ PRP ] represents pronouns, [ VBP ] represents verbs, [ NNP ] represents nouns, [ RB ] represents degree adverbs, [ JJ ] represents adjectives.
Step2.3: to S POS Carry out embedded representation to obtain E POS Namely:
Figure BDA0003923624510000041
wherein D is POS Representing the embedding dimension of the part-of-speech token.
Step2.4: will E pos Inputting LSTM, and taking the last layer of output result O pos Feature vectors (i.e., grammatical features) as part-of-speech sequences of sentences in which
Figure BDA0003923624510000042
In Step2, the grammatical features of the sentence are mainly calculated, and grammar and vocabulary are the key for distinguishing the difficulty of English texts, so the complexity of grammar needs to be considered. The invention takes the part of speech sequence of the sentence as input, and uses the characteristics of the LSTM learning sequence, thereby realizing the vectorized representation of the grammar and inputting the vectorized representation into the neural network for the calculation of the subsequent steps. In the existing method, the grammatical information is mainly obtained by counting the number of keywords and the co-occurrence of the keywords, and the method cannot completely express the sequence information, so that the LSTM is used in the method to better learn grammatical features.
The Step3 is specifically as follows:
because factors influencing the difficulty degree of the English reading material need to consider sentence length, preposition number, average word length and the like as influencing factors besides semantics and grammar, the factors are counted and encoded and then input into the model. After the information is added, the convergence is faster during model training, and meanwhile, the robustness of the model is further improved.
The method comprises the following specific steps:
step3.1: and (3) counting the sentence length and carrying out embedding operation: for sentence S t ={w 1 ,w 2 ,…,w n H, then the sentence length is embedded as
Figure BDA0003923624510000043
Where L represents the embedding of the vector for sentence length, n represents the number of words, and D represents the embedding dimension.
Step3.2: counting the preposition number and carrying out embedding operation: for sentence S t ={w 1 ,w 2 ,…,w n }, number of prepositions embedded
Figure BDA0003923624510000044
Wherein, P represents the embedding of the vector as the number of prepositions, x represents the specific number, and D represents the embedding dimension.
Step3.3: and (3) counting the average word length and carrying out embedding operation: for sentence S t ={w 1 ,w 2 ,…,w 1 Embedding of preposition number
Figure BDA0003923624510000045
Where a represents the embedding of the vector as the average word length, x represents the specific number, and D represents the embedding dimension.
Step3.4: will be provided with
Figure BDA0003923624510000046
Splicing is performed as statistical information of sentences:
Figure BDA0003923624510000051
wherein the content of the first and second substances,
Figure BDA0003923624510000052
the Step4 is specifically as follows:
step4.1: combining semantic features X [CLS] Grammatical feature O POS Statistical information characteristic O STA Input full connection after splicingAnd (3) layer connection, inputting the sigmoid layer prediction result and outputting:
Figure BDA0003923624510000053
step4.2: calculating the loss:
Figure BDA0003923624510000054
wherein, y ic Representing the true class of sample i, if c is equal to 1, and if not, 0,p ic Representing the predicted probability that the observation sample i belongs to the class c;
step4.3: adam is used to optimize the loss in order to minimize the loss, and when the loss is minimized, the model achieves the best results.
The part inputs the three characteristics into a neural network after splicing, and limits output to be [0,1] by using a sigmoid function, thereby realizing difficult and easy judgment.
The invention has the beneficial effects that: when English text is difficult to judge, the invention comprehensively considers the semantic information, the grammatical information, the statistical information and other characteristics of the text, compared with the traditional method, the invention considers the importance of the English text semantic information, uses LSTM to learn the grammatical information of the text, and simultaneously inputs the traditional statistical information into a neural network for calculation. Therefore, the difficult and easy judgment model with better effect and stronger robustness compared with the traditional method is obtained.
Drawings
FIG. 1 is a flow chart of the steps of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
Example 1: as shown in fig. 1, a method for determining difficulty of english reading materials based on text feature fusion, first encode an input english text for an english reading material data set, and input encoded information into a trained pre-training language model to obtain a feature vector containing semantic information; then, part-of-speech tagging is carried out on the English text, and the obtained part-of-speech sequence is input into an LSTM (least squares TM) to obtain a feature vector containing grammar information; counting the factors influencing the difficulty of English reading materials, carrying out embedding expression on the factors, splicing all the characteristics, inputting the spliced characteristics into a full-connection layer, and finally outputting through a sigmoid layer to obtain a numerical expression difficulty of 0-1.
Assuming that there is a set a of english reading materials, where there are N pieces of data of english reading materials, a = { S = { S } 1 ,S 2 ,S 3 ,…,S N In which S is i Representing the ith english reading material text in the english reading material collection. The method for judging the English reading difficulty specifically comprises the following steps:
step1: the pre-training model selects a Bert model, the pre-training language model part is mainly used for learning semantic information of a text, three features are required for inputting the pre-training language model, and are respectively the feature of each word, the sentence position feature and the word position feature, and the three features are extracted.
Step2: and (5) extracting grammatical features.
Step3: and (5) extracting the statistical information features.
Step4: and (4) difficulty prediction.
The Step1 is specifically as follows:
step1.1: assume that the currently input English text is S t ,S t In which n words, S t ={w 1 ,w 2 ,…,w i ,…,w n In which w i Representing the ith word.
The Bert model typically adds [ CLS ] at the beginning of a sentence to represent the beginning of a paragraph and [ SEP ] in the middle of two sentences to separate the sentences.
The converted sentence is S BERT ={[CLS],w 1 ,w 2 ,…,[SEP],…,w n-2 ,w n-1 ,w n ,[SEP]}。
Step1.2: will S BERT Is set to M, if S t If the length of (D) is less than M, the length of S is measured BERT Addition of [ PAD]Performing filling, S after filling operation BERT ={[CLS],w 1 ,w 2 ,…,[SEP],…,w n-2 ,w n-1 ,w n ,[SEP],…,[PAD]}。
If S t Is greater than M, truncating and discarding the subsequent content, truncating the operated S BERT ={[CLS],w 1 ,w 2 ,…,[SEP],…,w M-2 ,w M-1 ,w M ,[SEP]}。
Step1.3: to S BERT Is embedding coded, namely:
Figure BDA0003923624510000061
wherein
Figure BDA0003923624510000062
D BERT Representing the embedding dimensions set by the pre-trained language model.
Step1.4: to S BERT The sentence position coding is performed on the content in (1), namely:
S segmentembedding ={E A ,E A ,E A ,E B ,E B ,E B ,E B ,…,E i ,E i }
wherein E A Denotes a first sentence, E B The second sentence is represented by the first sentence,
Figure BDA0003923624510000063
subsequent sentences and so on, where E i The ith sentence is shown.
Step1.5: to S BERT The content in (1) is subjected to word position coding, namely:
S positionembedding ={E 1 ,E 2 ,E 3 ,…,E i ,…,E n-2 ,E n-1 ,E n ,…,E M }
wherein E i A position code representing the ith word,
Figure BDA0003923624510000064
step1.6: will S embedding 、S segmrntembedding 、S positionembedding Inputting the obtained result into a pre-training language model (BERT is used by default) to obtain a feature vector O output by the last layer BERT Namely:
Figure BDA0003923624510000071
step1.7: there are various schemes for selecting a sentence vector, such as: 1) Taking X [CLS] As a sentence vector. 2) To O BERT The average pooling was performed, and the results were obtained. 3) To O BERT Performing maximum pooling, and taking the result. 4) Mixing O with BERT The results of (3) further extract features using CNN. 5) Mixing O with BERT The result of (2) is input into the LSTM extraction feature. In the task of the invention, X is selected [CLS] As a sentence vector.
The Step2 is specifically as follows:
step2.1: for the input text S t ={w 1 ,w 2 ,w 3 ,…,w m Add [ CLS ] at the beginning of sentence]Indicating the beginning of a sentence, adding [ SEP ] in the middle of two sentences]For separating sentences, the converted sentences are:
S sen ={[CLS],w 1 ,w 2 ,…,[SEP],…,w n-2 ,w n-1 ,w n ,[SEP],…,[PAD]}
step2.2: to S sen And performing part-of-speech tagging to obtain:
S POS ={[SPACE],[PRP],[VBP],[NNP],…,[RB],[JJ],[SPACE],…,[PAD]}
wherein [ SPACE ] represents [ CLS ] and [ SEP ], [ PRP ] represents pronouns, [ VBP ] represents verbs, [ NNP ] represents nouns, [ RB ] represents degree adverbs, [ JJ ] represents adjectives.
Step2.3: to S POS Carry out embedded representation to obtain E POS Namely:
Figure BDA0003923624510000072
wherein D is POS Representing the embedding dimension of the part-of-speech token.
Step2.4: will E pos Inputting LSTM, and taking the last layer of output result O pos Feature vector as sentence grammar information, wherein
Figure BDA0003923624510000073
The Step3 is specifically as follows:
because factors influencing the difficulty degree of the English reading material need to consider sentence length, preposition number, average word length and the like as influencing factors besides semantics and grammar, the factors are counted and encoded and then input into a model, and the method specifically comprises the following steps:
step3.1: and (3) counting the sentence length and carrying out embedding operation: for sentence S t ={w 1 ,w 2 ,…,w n The sentence length is embedded as
Figure BDA0003923624510000074
Where L represents the embedding of the vector for sentence length, n represents the number of words, and D represents the embedding dimension.
Step3.2: counting the number of prepositions and performing embedding operation: for sentence S t ={w 1 ,w 2 ,…,w n Embedding of preposition number
Figure BDA0003923624510000081
Wherein, P represents the embedding of the vector as the number of prepositions, x represents the specific number, and D represents the embedding dimension.
Step3.3: and (3) counting the average word length and carrying out embedding operation: for sentence S t ={w 1 ,w 2 ,…,w n Embedding of preposition number
Figure BDA0003923624510000082
Where a represents the embedding of the vector as the average word length, x represents the specific number, and D represents the embedding dimension.
Step3.4: will be provided with
Figure BDA0003923624510000083
Splicing is performed as statistical information of sentences:
Figure BDA0003923624510000084
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003923624510000085
the Step4 is specifically as follows:
step4.1: combining semantic features X [CLS] Grammatical feature O POS Statistical information characteristic O STA Inputting the spliced data into a full-connection layer, inputting a sigmoid layer prediction result and outputting:
Figure BDA0003923624510000086
step4.2: calculating the loss:
Figure BDA0003923624510000087
wherein, y ic Representing the true class of sample i, if c is equal to 1, and if not equal to 0,p ic Representing the predicted probability that the observed sample i belongs to the class c.
Step4.3: adam is used to optimize the loss in order to minimize the loss, and when the loss is minimized, the model achieves the best results.
This example selects two english reading material data sets CEFR and Newsela with an indicia of ease, and a manually constructed data set CEED of the present invention. The CEFR and CEED are public hierarchical English reading text data sets, and the Newsela data set is a non-public hierarchical English reading text data set (which can be applied to a Newsela website). Basic data statistics are performed on the three data sets, and the statistical results are shown in table 1. Where Num represents the amount of text contained in the data set and Class represents the number of Class categories.
TABLE 1 data set essential information
Figure BDA0003923624510000088
(1) The CEFR is composed of 1493 English texts which are labeled according to the European common reference frame (CEFR) levels A1, A2, B1, B2, C1 and C2, and the difficulty is increased from A1 to C2. English text in the data set was taken from online free resources including British culture Association, ESLFast and CNN daily mail data sets. The content of the english text contains conversations, descriptions, short stories, newspaper stories, and other articles.
(2) CEED is collected from 469 reading questions of English examinations such as middle school entrance examination, college entrance examination, level four, level six, special four, special eight and the like, and the difficulty classification is as follows: the difficulty degree of the middle test is recorded as Z, the difficulty degree of the college test is recorded as G, the difficulty degree of the fourth test is recorded as S, the difficulty degree of the sixth test is recorded as L, the difficulty degree of the special fourth test is recorded as E, and the difficulty degree of the special eighth test is recorded as B. The difficulty increases from the middle-size to the special eight.
(3) Newsela consists of 10722 english texts, each of which is labeled with a difficulty number 2 to 12 according to the standard of the us K12 education for difficulty division, with increasing difficulty numbers from 2 to 12.
The invention arranges the English text in the data set, and the arranging process is as follows: step one, reading each English text according to paragraphs; secondly, marking the difficulty level corresponding to each paragraph; and step four, calculating the number of words, the number of media words and the average word degree contained in each paragraph, and finally sorting the words into csv files. The number of paragraphs contained in the sorted data set is as follows: the CEFR contains 12096 paragraphs, the Newsela contains 227971 paragraphs, and the CEED contains 3381 paragraphs.
And adding corresponding difficulty labels to the extracted paragraphs for better obtaining difficulty coefficients in subsequent experiments. In the CEFR dataset, the difficulty labels A1, A2, B1, B2 are set to 0, and the difficulty labels C1, C2 are set to 1. In the Newsela dataset, the difficulty label with a rank of 6 or more is set to 1, and the difficulty label with a rank of less than 6 is set to 0. In the CEED data set, because the classification has similarity, the invention divides the data set into three subsets, and divides the data of the middle and high schools into one subset, called CEED-EE for short; dividing the data of four levels and six levels into a subset, namely CEED-CET; data of special four and special eight are divided into a subset, called CEED-TEM for short. Wherein, the difficulty labels of the secondary school entrance examination, the fourth level and the special school entrance examination are set as 0, and the difficulty labels of the college entrance examination, the sixth level and the special school entrance examination are set as 1. The number of positive and negative samples contained in each sorted data set is shown in table 2.
Table 2: number of positive and negative samples
Figure BDA0003923624510000091
The invention selects the pre-training language model facing the Fill-mask task in the past few years, such as Bert, bart, xlnet, roberta, xlm-roberta, to test and compare with CNN, LSTM, biLSTM. On parameter settings, the pytorech version 1.10 was used, and the NVIDIA GeForce RTX 2080Ti GPU was used. The pre-trained models were all obtained from Huggingface. The selection of the hyper-parameters is as follows: batchsize takes {16, 32, 64}, learning rate takes {1e-3,1e-4,1e-5}, and word embedding dimension takes 768. Different models were tested on different data sets, and the results were as follows:
table 3: experimental results for different models in CEFR and Newsela
Figure BDA0003923624510000101
From table 3, it can be seen that in both data sets, the method of the invention (when using BERT as a pre-trained language model) resulted in the best results in the three metrics AUC, ACC, RMSE and both data sets. In the CEFR data set, the method is higher than the second method in AUC, ACC and RMS indexes, and the method improves 5.81 percent of AUC, 7.02 percent of ACC and 5.14 percent of RMSE. In the Newsela dataset, the method of the invention is also higher than the second one in AUC, ACC and RMS indexes, and the AUC is improved by 1.63%, the ACC is improved by 1.04% and the RMSE is reduced by 1.15%. When the data set is small (CEFR data set), the pre-trained language model requires only less data to perform better.
Table 4: results of different pre-trained language models in CEFR and Newsela
Figure BDA0003923624510000102
As shown in Table 4, the present invention compares the effects of different pre-trained language models that improve and enhance BERT for different tasks, respectively. From the results, the best results can be obtained by the BERT model in the CEFR data set, and the BERT model is higher than the second model in the indexes of AUC, ACC and RMS, and is improved by 0.35 percent in AUC, 0.24 percent in ACC and 0.92 percent in RMSE. The XLNet model gave the best results on the Newsela dataset, with an improvement of 0.37% in AUC, 0.58% in ACC and 0.60% in RMSE compared to BERT. However, the overall gap between these pre-trained models is not large, but the results are better than both CNN and LSTM.
Table 5: experimental results of different models in CEED
Figure BDA0003923624510000103
Figure BDA0003923624510000111
As can be seen from Table 5, in both datasets, the method of the invention (when using BERT as a pre-trained language model) resulted in the best results in all three metrics AUC, ACC, RMSE and three datasets. In the CEED-EE data set, the indexes of AUC, ACC and RMS are all higher than the second one, the method of the invention increases 8.20% in AUC, 4.71% in ACC and 7.05% in RMSE. In the CEED-CET data set, the AUC, ACC and RMS indexes of the method are all higher than those of the second name, and the method improves 5.32 percent of AUC, 3.77 percent of ACC and 1.95 percent of RMSE. On the CEED-TEM data set, the indexes of AUC and ACC were respectively improved by 9.09% and 12.5% compared with the second name, and the index of REME was reduced by 8.51%.
Table 6: results of different pre-trained language models in CEED
Figure BDA0003923624510000112
As shown in Table 6, the present invention also compared the effects of different pre-trained language models in the CEED dataset. Overall, roBERTa achieves better results in all three subsets of CEED. Compared with BERT, the AUC is improved by 5.06%, ACC is improved by 8.49%, and RMSE is reduced by 13.1% in CEED-EE data set. Compared with BERT, in CEED-CET data set, the AUC is improved by 3.99%, ACC is improved by 11.32%, and RMSE is reduced by 11.20%. Compared to BERT in CEED-TEM data set, the AUC was improved by 3.17%, ACC by 3.12% and RMSE by 4.35%. These pre-trained language models are superior to CNN and LSTM in all three metrics as a whole.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (5)

1. A method for judging difficulty of English reading materials based on text feature fusion is characterized by comprising the following steps:
step1: firstly, encoding an input English text aiming at an English reading material data set, and inputting encoded information into a trained pre-training language model to obtain a feature vector containing semantic information;
step2: performing part-of-speech tagging on the text, and inputting the obtained part-of-speech sequence into an LSTM to obtain a feature vector containing grammatical information;
step3: extracting statistical information features; counting the factors influencing the difficulty of English reading materials, carrying out embedding expression on the factors, splicing all the characteristics, and inputting the spliced characteristics into a full-connection layer;
step4: and finally, obtaining a numerical value representing difficulty from 0 to 1 through sigmoid layer output, and finishing difficulty judgment.
2. The method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step1 is specifically:
step1.1: assume that the currently input English text is S t ,S t In which n words, S t ={w 1 ,w 2 ,...,w i ,...,w n In which w i Represents the ith word;
the converted sentence is S BERT ={[CLS],w 1 ,w 2 ,...,[SEP],...,w n-2 ,w n-1 ,w n ,[SEP]};
Step1.2: will S BERT Is set to M, if S t If the length of (D) is less than M, the length of S is measured BERT Addition of [ PAD]Performing filling, S after filling operation BERT Comprises the following steps:
S BERT ={[CLS],w 1 ,w 2 ,…,[SEP],…,w n-2 ,w n-1 ,w n ,[SEP],…,[PAD]}
if S t Is greater than M, truncating and discarding the subsequent content, truncating the operated S BERT Comprises the following steps:
S BERT ={[CLS],w 1 ,w 2 ,…,[SEP],…,w M-2 ,w M-1 ,w M ,[SEP]}
step1.3: to S BERT Is embedding coded, namely:
Figure FDA0003923624500000011
wherein the content of the first and second substances,
Figure FDA0003923624500000012
D BERT representing the embedding dimension set by the pre-training language model;
step1.4: to S BERT The content in (1) is sentence position coded, namely:
S segmentembedding ={E A ,E A ,E A ,E B ,E B ,E B ,E B ,...,E i ,E i }
wherein, E A Denotes a first sentence, E B The second sentence is represented by the first sentence,
Figure FDA0003923624500000013
subsequent sentences analogized in the same way, E i Represents the ith sentence;
step1.5: to S BERT The content in (1) is subjected to word position coding, namely:
S position embedding ={E 1 ,E 2 ,E 3 ,…,E i ,…,E n-2 ,E n-1 ,E n ,…,E M }
wherein E is i A position code representing the ith word,
Figure FDA0003923624500000021
step1.6: will S embedding 、S segmrntembedding 、S positionembedding Inputting the result into a pre-training language model to obtain a feature vector O output by the last layer BERT Namely:
Figure FDA0003923624500000022
step1.7: selecting X [CLS] As a sentence vector.
3. The method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step2 is specifically:
step2.1: for the input text S t ={w 1 ,w 2 ,w 3 ,...,w n Add [ CLS ] at the beginning of sentence]Indicating the beginning of a sentence, adding [ SEP ] in the middle of two sentences]For separating sentences, the converted sentences are:
S sen ={[CLS],w 1 ,w 2 ,…,[SEP],…,w n-2 ,w n-1 ,w n ,[SEP],…,[PAD]}
step2.2: to S sen And performing part-of-speech tagging to obtain:
S POS ={[SPACE],[PRP],[VBP],[NNP],…,[RB],[JJ],[SPACE],…,[PAD]}
wherein [ SPACE ] represents [ CLS ] and [ SEP ], [ PRP ] represents pronouns, [ VBP ] represents verbs, [ NNP ] represents nouns, [ RB ] represents degree adverbs, [ JJ ] represents adjectives;
step2.3: to S POS Performing embedded representation to obtain E POS Namely:
Figure FDA0003923624500000023
wherein D is POS An embedding dimension representing a part of speech token;
step2.4: will E pos Inputting LSTM, and taking the last layer of output result O pos As feature vectors of sentence grammar information, wherein
Figure FDA0003923624500000024
4. The method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step3 is specifically:
step3.1: and (3) counting the sentence length and carrying out embedding operation: for sentence S t ={w 1 ,w 2 ,...,w n H, then the sentence length is embedded as
Figure FDA0003923624500000025
Wherein, L represents the embedding of the vector as the sentence length, n represents the number of words, and D represents the embedding dimension;
step3.2: counting the preposition number and carrying out embedding operation: for sentence S t ={w 1 ,w 2 ,...,w n }, number of prepositions embedded
Figure FDA0003923624500000026
Wherein, P represents the embedding of the vector as preposition number, which represents specific number, D represents the embedding dimension;
step3.3: and (3) counting the average word length and carrying out embedding operation: for sentence S t ={w 1 ,w 2 ,...,w n }, number of prepositions embedded
Figure FDA0003923624500000031
Wherein, A represents the embedding of the vector with the average word length, which represents the specific number, and D represents the embedding dimension;
step3.4: will be provided with
Figure FDA0003923624500000032
Splicing is performed as statistical information of sentences:
Figure FDA0003923624500000033
wherein the content of the first and second substances,
Figure FDA0003923624500000034
5. the method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step4 is specifically:
step4.1: applying semantic features X [CLS] Grammatical feature O POS Statistical information characteristic O STA Inputting the spliced data into a full-connection layer, inputting a sigmoid layer prediction result and outputting:
Figure FDA0003923624500000035
step4.2: calculating the loss:
Figure FDA0003923624500000036
wherein, y iC Representing the true class of sample i, if c is equal to 1, and if not equal to 0,p iC Representing the predicted probability that the observation sample i belongs to the class c;
step4.3: adam is used to optimize the loss in order to minimize the loss, and when the loss is minimized, the model achieves the best results.
CN202211364247.2A 2022-11-02 2022-11-02 English reading material difficulty judgment method based on text feature fusion Pending CN115630140A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211364247.2A CN115630140A (en) 2022-11-02 2022-11-02 English reading material difficulty judgment method based on text feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211364247.2A CN115630140A (en) 2022-11-02 2022-11-02 English reading material difficulty judgment method based on text feature fusion

Publications (1)

Publication Number Publication Date
CN115630140A true CN115630140A (en) 2023-01-20

Family

ID=84909207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211364247.2A Pending CN115630140A (en) 2022-11-02 2022-11-02 English reading material difficulty judgment method based on text feature fusion

Country Status (1)

Country Link
CN (1) CN115630140A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796045A (en) * 2023-08-23 2023-09-22 北京人天书店集团股份有限公司 Multi-dimensional book grading method, system and readable medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116796045A (en) * 2023-08-23 2023-09-22 北京人天书店集团股份有限公司 Multi-dimensional book grading method, system and readable medium
CN116796045B (en) * 2023-08-23 2023-11-10 北京人天书店集团股份有限公司 Multi-dimensional book grading method, system and readable medium

Similar Documents

Publication Publication Date Title
CN108829801B (en) Event trigger word extraction method based on document level attention mechanism
CN111738003B (en) Named entity recognition model training method, named entity recognition method and medium
WO2022141878A1 (en) End-to-end language model pretraining method and system, and device and storage medium
CN110020438A (en) Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
CN111414481A (en) Chinese semantic matching method based on pinyin and BERT embedding
CN110619043A (en) Automatic text abstract generation method based on dynamic word vector
CN113283236B (en) Entity disambiguation method in complex Chinese text
CN115204143B (en) Method and system for calculating text similarity based on prompt
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN114117041B (en) Attribute-level emotion analysis method based on specific attribute word context modeling
CN114611520A (en) Text abstract generating method
CN115630140A (en) English reading material difficulty judgment method based on text feature fusion
CN114398900A (en) Long text semantic similarity calculation method based on RoBERTA model
CN113961706A (en) Accurate text representation method based on neural network self-attention mechanism
CN110674293B (en) Text classification method based on semantic migration
CN111985223A (en) Emotion calculation method based on combination of long and short memory networks and emotion dictionaries
CN114880994B (en) Text style conversion method and device from direct white text to irony text
CN114595687B (en) Laos text regularization method based on BiLSTM
CN116049349A (en) Small sample intention recognition method based on multi-level attention and hierarchical category characteristics
CN115759102A (en) Chinese poetry wine culture named entity recognition method
CN113343648B (en) Text style conversion method based on potential space editing
CN112528003B (en) Multi-item selection question-answering method based on semantic sorting and knowledge correction
CN115017404A (en) Target news topic abstracting method based on compressed space sentence selection
CN114282537A (en) Social text-oriented cascade linear entity relationship extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination