CN115630140A

CN115630140A - English reading material difficulty judgment method based on text feature fusion

Info

Publication number: CN115630140A
Application number: CN202211364247.2A
Authority: CN
Inventors: 甘健侯; 王宇辰; 李子杰; 周菊香; 欧阳昭相; 陈恳
Original assignee: Yunnan Normal University
Current assignee: Yunnan Normal University
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2023-01-20

Abstract

The invention relates to a method for judging difficulty of English reading materials based on text feature fusion, and belongs to the field related to natural language processing. Firstly, encoding an input English text aiming at an English reading material data set, inputting an encoded result into a pre-training language model, and calculating to obtain a feature vector containing semantic information. And then, carrying out part-of-speech tagging on the English text, inputting the obtained part-of-speech sequence into the LSTM, and calculating to obtain a feature vector containing grammar information. And counting relevant factors and extracting features of the relevant factors according to the factors influencing the difficulty degree of English reading materials. All the obtained feature vectors are spliced and input into a full-connection layer, and finally a numerical value from 0 to 1 is output through sigmoid to represent difficulty. The method can effectively judge the difficulty of English reading materials and better assist various self-adaptive learning services in English teaching.

Description

English reading material difficulty judgment method based on text feature fusion

Technical Field

The invention relates to a method for judging difficulty of English reading materials based on text feature fusion, and belongs to the technical field of natural language processing.

Background

English is used as a second language which is widely learned, reading is also used as an important component in English learning, and how to accurately judge the difficulty of English reading materials makes people with different English levels capable of receiving education suitable for the English levels of the people and further promotes personalized learning more importantly.

Researches related to the difficulty degree of English reading materials appear in the early 20 th century, and researches aiming at the difficulty degree judgment of the English reading materials are core problems concerned by relevant researchers at home and abroad until now. Therefore, a great deal of research is carried out by a plurality of researchers aiming at factors influencing the difficulty degree of English reading materials, a plurality of influencing factors are summarized, a plurality of formulas for calculating the difficulty degree of the English reading materials are generated, and the formulas are long-term and help people to select proper English texts. However, with the continuous development of informatization, the generated text becomes more complex, and the method of formulating the rule is generally simpler and does not have good generalization capability, so that a good effect cannot be obtained.

With the continuous development of language models, birt (Bidirectional Encoder reproduction from transforms) models were proposed in 10 months and valley in 2018, so that the development of the natural language processing field enters a new stage. BERT is a pre-trained Language model, which is not trained by using only a unidirectional Language model or by shallow splicing two unidirectional Language models like a traditional Language model, but instead uses MLM (masked Language model) to train bidirectional transforms, generates deep bidirectional Language representations, and performs well in 11 different Natural Language Processing (NLP) tests. Many scholars work well with BERT for other tasks in the field of natural language processing, and this way of training a trained model by migrating it to a new model is called Transfer learning (Transfer learning). Considering that most tasks have certain relevance, the efficiency of the model can be greatly accelerated by transferring the learned parameters to a new model in a certain way. Fine-tuning is used as one method of transfer learning, a convolutional layer in a pre-training model is frozen, and other convolutional layers and full-connection layers are trained, so that the learning time of the model can be further prolonged, and the training cost of the model is reduced.

Disclosure of Invention

The invention aims to provide a method for judging difficulty of English reading materials based on text feature fusion, which is used for improving the accuracy and efficiency of judging the difficulty of the English reading materials.

The invention provides a method for judging difficulty of English reading materials based on text feature fusion by summarizing the viewpoint of a linguist on factors influencing difficulty of English reading materials and considering the advantages of a pre-training language model in a natural language processing task, and the method is used for fusing various text features and judging difficulty of the English reading materials by utilizing a deep learning technology.

The technical scheme of the invention is as follows: firstly, encoding an input English text aiming at an English reading material data set, and inputting encoded information into a trained pre-training language model to obtain a feature vector containing semantic information; then, part-of-speech tagging is carried out on the input text, and the obtained part-of-speech sequence is input into an LSTM (least squares TM) to obtain a feature vector containing grammar information; counting the factors influencing the difficulty of English reading materials, carrying out embedding expression on the factors, splicing all the characteristics, inputting the spliced characteristics into a full-connection layer, and finally outputting through a sigmoid layer to obtain a numerical expression difficulty of 0-1.

The method for judging the English reading difficulty specifically comprises the following steps:

step1: semantic features of the text are extracted using a pre-trained language model.

Firstly, aiming at an English reading material data set (a Newsela data set and a self-collected data set are used for carrying out experiments), coding an input English text, and inputting coded information into a trained pre-training language model to obtain a feature vector containing semantic information.

The specific process comprises the steps of firstly extracting information such as words, sentence positions and word positions in sentences to carry out One-hot coding, inputting a pre-training language model to obtain semantic feature vectors, and selecting a Bert model by the pre-training model.

Step2: and (5) extracting grammatical information features.

And performing part-of-speech tagging on the text, and inputting the obtained part-of-speech sequence into the LSTM to obtain a feature vector containing grammatical information.

Step3: and (5) extracting the statistical information features.

Counting the factors influencing the difficulty of English reading materials, carrying out embedding expression on the factors, splicing all the characteristics, inputting the spliced characteristics into a full-connection layer, and finally outputting through a sigmoid layer to obtain a numerical expression difficulty of 0-1.

Step4: and (4) difficulty prediction.

And obtaining a numerical value from 0 to 1 through sigmoid layer output to represent difficulty.

The Step1 is specifically as follows:

step1.1: suppose that the currently input English text is S _t ，S _t In which n words are included, S _t ＝{w ₁ ,w ₂ ,…,w _i ,…,w _n In which w _i Representing the ith word.

The Bert model typically adds [ CLS ] at the beginning of a sentence to represent the beginning of a paragraph and [ SEP ] in the middle of two sentences to separate the sentences.

The converted sentence is S _BERT ＝{[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _n-2 ,w _n-1 ,w _n ,[SEP]}。

Step1.2: will S _BERT Is set to M, if S _t If the length of S is less than M, then S is selected _BERT Addition of [ PAD]Performing filling, S after filling operation _BERT Comprises the following steps:

S _BERT ＝{[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _n-2 ,w _n-1 ,w _n ,[SEP],…,[PAD]}

if S _t Is greater than M, truncating and discarding the subsequent content, truncating the operated S _BERT Comprises the following steps:

S _BERT ＝{[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _M-2 ,w _M-1 ,w _M ,[SEP]}

step1.3: to S _BERT Is embedding coded, namely:

wherein, the first and the second end of the pipe are connected with each other,

D _BERT representing the embedding dimensions set by the pre-trained language model.

Step1.4: to S _BERT The content in (1) is sentence position coded, namely:

S _{segmentembedding} ＝{E _A ,E _A ,E _A ,E _B ,E _B ,E _B ,E _B ,…,E _i ,E _i }

wherein E is _A Denotes the first sentence, E _B The second sentence is represented by the first sentence,

subsequent sentences analogized in the same way, E _i Indicating the ith sentence.

Step1.5: to S _BERT The content in (1) is subjected to word position coding, namely:

S _{positionembedding} ＝{E ₁ ,E ₂ ,E ₃ ,…,E _i ,…,E _n-2 ,E _n-1 ,E _n ,…,E _M }

wherein E is _i A position code representing the ith word,

step1.6: will S _embedding 、S _{segmrntembedding} 、S _{positionembedding} Inputting the obtained result into a pre-training language model (BERT is used by default) to obtain a feature vector O output by the last layer _BERT Namely:

step1.7: there are various schemes for selecting a sentence vector, such as: 1) Taking X _[CLS] As a sentence vector. 2) To O _BERT The average pooling was performed, and the results were obtained. 3) To O is _BERT Performing maximum pooling, and taking the result. 4) Mixing O with _BERT The results of (2) further extract features using CNN. 5) Is prepared from O _BERT The result of (2) is input into the LSTM extraction feature. In the task of the invention, X is selected _[CLS] As a sentence vector.

The Step2 specifically comprises the following steps:

step2.1: for the input text S _t ＝{w ₁ ,w ₂ ,w ₃ ,…,w _n Add [ CLS ] at the beginning of sentence]Indicating the beginning of a sentence, adding [ SEP ] in the middle of two sentences]For separating sentences, the converted sentences are:

S _sen ＝{[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _n-2 ,w _n-1 ,w _n ,[SEP],…,[PAD]}

step2.2: to S _sen And performing part-of-speech tagging to obtain:

S _POS ＝{[SPACE],[PRP],[VBP],[NNP],…,[RB],[JJ],[SPACE],…,[PAD]}

wherein [ SPACE ] represents [ CLS ] and [ SEP ], [ PRP ] represents pronouns, [ VBP ] represents verbs, [ NNP ] represents nouns, [ RB ] represents degree adverbs, [ JJ ] represents adjectives.

Step2.3: to S _POS Carry out embedded representation to obtain E _POS Namely:

wherein D is _POS Representing the embedding dimension of the part-of-speech token.

Step2.4: will E _pos Inputting LSTM, and taking the last layer of output result O _pos Feature vectors (i.e., grammatical features) as part-of-speech sequences of sentences in which

In Step2, the grammatical features of the sentence are mainly calculated, and grammar and vocabulary are the key for distinguishing the difficulty of English texts, so the complexity of grammar needs to be considered. The invention takes the part of speech sequence of the sentence as input, and uses the characteristics of the LSTM learning sequence, thereby realizing the vectorized representation of the grammar and inputting the vectorized representation into the neural network for the calculation of the subsequent steps. In the existing method, the grammatical information is mainly obtained by counting the number of keywords and the co-occurrence of the keywords, and the method cannot completely express the sequence information, so that the LSTM is used in the method to better learn grammatical features.

The Step3 is specifically as follows:

because factors influencing the difficulty degree of the English reading material need to consider sentence length, preposition number, average word length and the like as influencing factors besides semantics and grammar, the factors are counted and encoded and then input into the model. After the information is added, the convergence is faster during model training, and meanwhile, the robustness of the model is further improved.

The method comprises the following specific steps:

step3.1: and (3) counting the sentence length and carrying out embedding operation: for sentence S _t ＝{w ₁ ,w ₂ ,…,w _n H, then the sentence length is embedded as

Where L represents the embedding of the vector for sentence length, n represents the number of words, and D represents the embedding dimension.

Step3.2: counting the preposition number and carrying out embedding operation: for sentence S _t ＝{w ₁ ,w ₂ ,…,w _n }, number of prepositions embedded

Wherein, P represents the embedding of the vector as the number of prepositions, x represents the specific number, and D represents the embedding dimension.

Step3.3: and (3) counting the average word length and carrying out embedding operation: for sentence S _t ＝{w ₁ ,w ₂ ,…,w ₁ Embedding of preposition number

Where a represents the embedding of the vector as the average word length, x represents the specific number, and D represents the embedding dimension.

Step3.4: will be provided with

Splicing is performed as statistical information of sentences:

wherein the content of the first and second substances,

the Step4 is specifically as follows:

step4.1: combining semantic features X _[CLS] Grammatical feature O _POS Statistical information characteristic O _STA Input full connection after splicingAnd (3) layer connection, inputting the sigmoid layer prediction result and outputting:

step4.2: calculating the loss:

wherein, y _ic Representing the true class of sample i, if c is equal to 1, and if not, 0,p _ic Representing the predicted probability that the observation sample i belongs to the class c;

step4.3: adam is used to optimize the loss in order to minimize the loss, and when the loss is minimized, the model achieves the best results.

The part inputs the three characteristics into a neural network after splicing, and limits output to be [0,1] by using a sigmoid function, thereby realizing difficult and easy judgment.

The invention has the beneficial effects that: when English text is difficult to judge, the invention comprehensively considers the semantic information, the grammatical information, the statistical information and other characteristics of the text, compared with the traditional method, the invention considers the importance of the English text semantic information, uses LSTM to learn the grammatical information of the text, and simultaneously inputs the traditional statistical information into a neural network for calculation. Therefore, the difficult and easy judgment model with better effect and stronger robustness compared with the traditional method is obtained.

Drawings

FIG. 1 is a flow chart of the steps of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Example 1: as shown in fig. 1, a method for determining difficulty of english reading materials based on text feature fusion, first encode an input english text for an english reading material data set, and input encoded information into a trained pre-training language model to obtain a feature vector containing semantic information; then, part-of-speech tagging is carried out on the English text, and the obtained part-of-speech sequence is input into an LSTM (least squares TM) to obtain a feature vector containing grammar information; counting the factors influencing the difficulty of English reading materials, carrying out embedding expression on the factors, splicing all the characteristics, inputting the spliced characteristics into a full-connection layer, and finally outputting through a sigmoid layer to obtain a numerical expression difficulty of 0-1.

Assuming that there is a set a of english reading materials, where there are N pieces of data of english reading materials, a = { S = { S } ₁ ,S ₂ ,S ₃ ,…,S _N In which S is _i Representing the ith english reading material text in the english reading material collection. The method for judging the English reading difficulty specifically comprises the following steps:

step1: the pre-training model selects a Bert model, the pre-training language model part is mainly used for learning semantic information of a text, three features are required for inputting the pre-training language model, and are respectively the feature of each word, the sentence position feature and the word position feature, and the three features are extracted.

Step2: and (5) extracting grammatical features.

Step3: and (5) extracting the statistical information features.

Step4: and (4) difficulty prediction.

The Step1 is specifically as follows:

step1.1: assume that the currently input English text is S _t ，S _t In which n words, S _t ＝{w ₁ ,w ₂ ,…,w _i ,…,w _n In which w _i Representing the ith word.

Step1.2: will S _BERT Is set to M, if S _t If the length of (D) is less than M, the length of S is measured _BERT Addition of [ PAD]Performing filling, S after filling operation _BERT ＝{[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _n-2 ,w _n-1 ,w _n ,[SEP],…,[PAD]}。

If S _t Is greater than M, truncating and discarding the subsequent content, truncating the operated S _BERT ＝{[CLS],w ₁ ,w ₂ ,…,[SEP],…,w _M-2 ,w _M-1 ,w _M ,[SEP]}。

Step1.3: to S _BERT Is embedding coded, namely:

wherein

Step1.4: to S _BERT The sentence position coding is performed on the content in (1), namely:

wherein E _A Denotes a first sentence, E _B The second sentence is represented by the first sentence,

subsequent sentences and so on, where E _i The ith sentence is shown.

wherein E _i A position code representing the ith word,

step1.7: there are various schemes for selecting a sentence vector, such as: 1) Taking X _[CLS] As a sentence vector. 2) To O _BERT The average pooling was performed, and the results were obtained. 3) To O _BERT Performing maximum pooling, and taking the result. 4) Mixing O with _BERT The results of (3) further extract features using CNN. 5) Mixing O with _BERT The result of (2) is input into the LSTM extraction feature. In the task of the invention, X is selected _[CLS] As a sentence vector.

The Step2 is specifically as follows:

step2.1: for the input text S _t ＝{w ₁ ,w ₂ ,w ₃ ,…,w _m Add [ CLS ] at the beginning of sentence]Indicating the beginning of a sentence, adding [ SEP ] in the middle of two sentences]For separating sentences, the converted sentences are:

step2.2: to S _sen And performing part-of-speech tagging to obtain:

S _POS ＝{[SPACE],[PRP],[VBP],[NNP],…,[RB],[JJ],[SPACE],…,[PAD]}

Step2.3: to S _POS Carry out embedded representation to obtain E _POS Namely:

Step2.4: will E _pos Inputting LSTM, and taking the last layer of output result O _pos Feature vector as sentence grammar information, wherein

The Step3 is specifically as follows:

because factors influencing the difficulty degree of the English reading material need to consider sentence length, preposition number, average word length and the like as influencing factors besides semantics and grammar, the factors are counted and encoded and then input into a model, and the method specifically comprises the following steps:

step3.1: and (3) counting the sentence length and carrying out embedding operation: for sentence S _t ＝{w ₁ ,w ₂ ,…,w _n The sentence length is embedded as

Step3.2: counting the number of prepositions and performing embedding operation: for sentence S _t ＝{w ₁ ,w ₂ ,…,w _n Embedding of preposition number

Step3.3: and (3) counting the average word length and carrying out embedding operation: for sentence S _t ＝{w ₁ ,w ₂ ,…,w _n Embedding of preposition number

Step3.4: will be provided with

Splicing is performed as statistical information of sentences:

the Step4 is specifically as follows:

step4.1: combining semantic features X _[CLS] Grammatical feature O _POS Statistical information characteristic O _STA Inputting the spliced data into a full-connection layer, inputting a sigmoid layer prediction result and outputting:

step4.2: calculating the loss:

wherein, y _ic Representing the true class of sample i, if c is equal to 1, and if not equal to 0,p _ic Representing the predicted probability that the observed sample i belongs to the class c.

This example selects two english reading material data sets CEFR and Newsela with an indicia of ease, and a manually constructed data set CEED of the present invention. The CEFR and CEED are public hierarchical English reading text data sets, and the Newsela data set is a non-public hierarchical English reading text data set (which can be applied to a Newsela website). Basic data statistics are performed on the three data sets, and the statistical results are shown in table 1. Where Num represents the amount of text contained in the data set and Class represents the number of Class categories.

TABLE 1 data set essential information

(1) The CEFR is composed of 1493 English texts which are labeled according to the European common reference frame (CEFR) levels A1, A2, B1, B2, C1 and C2, and the difficulty is increased from A1 to C2. English text in the data set was taken from online free resources including British culture Association, ESLFast and CNN daily mail data sets. The content of the english text contains conversations, descriptions, short stories, newspaper stories, and other articles.

(2) CEED is collected from 469 reading questions of English examinations such as middle school entrance examination, college entrance examination, level four, level six, special four, special eight and the like, and the difficulty classification is as follows: the difficulty degree of the middle test is recorded as Z, the difficulty degree of the college test is recorded as G, the difficulty degree of the fourth test is recorded as S, the difficulty degree of the sixth test is recorded as L, the difficulty degree of the special fourth test is recorded as E, and the difficulty degree of the special eighth test is recorded as B. The difficulty increases from the middle-size to the special eight.

(3) Newsela consists of 10722 english texts, each of which is labeled with a difficulty number 2 to 12 according to the standard of the us K12 education for difficulty division, with increasing difficulty numbers from 2 to 12.

The invention arranges the English text in the data set, and the arranging process is as follows: step one, reading each English text according to paragraphs; secondly, marking the difficulty level corresponding to each paragraph; and step four, calculating the number of words, the number of media words and the average word degree contained in each paragraph, and finally sorting the words into csv files. The number of paragraphs contained in the sorted data set is as follows: the CEFR contains 12096 paragraphs, the Newsela contains 227971 paragraphs, and the CEED contains 3381 paragraphs.

And adding corresponding difficulty labels to the extracted paragraphs for better obtaining difficulty coefficients in subsequent experiments. In the CEFR dataset, the difficulty labels A1, A2, B1, B2 are set to 0, and the difficulty labels C1, C2 are set to 1. In the Newsela dataset, the difficulty label with a rank of 6 or more is set to 1, and the difficulty label with a rank of less than 6 is set to 0. In the CEED data set, because the classification has similarity, the invention divides the data set into three subsets, and divides the data of the middle and high schools into one subset, called CEED-EE for short; dividing the data of four levels and six levels into a subset, namely CEED-CET; data of special four and special eight are divided into a subset, called CEED-TEM for short. Wherein, the difficulty labels of the secondary school entrance examination, the fourth level and the special school entrance examination are set as 0, and the difficulty labels of the college entrance examination, the sixth level and the special school entrance examination are set as 1. The number of positive and negative samples contained in each sorted data set is shown in table 2.

Table 2: number of positive and negative samples

The invention selects the pre-training language model facing the Fill-mask task in the past few years, such as Bert, bart, xlnet, roberta, xlm-roberta, to test and compare with CNN, LSTM, biLSTM. On parameter settings, the pytorech version 1.10 was used, and the NVIDIA GeForce RTX 2080Ti GPU was used. The pre-trained models were all obtained from Huggingface. The selection of the hyper-parameters is as follows: batchsize takes {16, 32, 64}, learning rate takes {1e-3,1e-4,1e-5}, and word embedding dimension takes 768. Different models were tested on different data sets, and the results were as follows:

table 3: experimental results for different models in CEFR and Newsela

From table 3, it can be seen that in both data sets, the method of the invention (when using BERT as a pre-trained language model) resulted in the best results in the three metrics AUC, ACC, RMSE and both data sets. In the CEFR data set, the method is higher than the second method in AUC, ACC and RMS indexes, and the method improves 5.81 percent of AUC, 7.02 percent of ACC and 5.14 percent of RMSE. In the Newsela dataset, the method of the invention is also higher than the second one in AUC, ACC and RMS indexes, and the AUC is improved by 1.63%, the ACC is improved by 1.04% and the RMSE is reduced by 1.15%. When the data set is small (CEFR data set), the pre-trained language model requires only less data to perform better.

Table 4: results of different pre-trained language models in CEFR and Newsela

As shown in Table 4, the present invention compares the effects of different pre-trained language models that improve and enhance BERT for different tasks, respectively. From the results, the best results can be obtained by the BERT model in the CEFR data set, and the BERT model is higher than the second model in the indexes of AUC, ACC and RMS, and is improved by 0.35 percent in AUC, 0.24 percent in ACC and 0.92 percent in RMSE. The XLNet model gave the best results on the Newsela dataset, with an improvement of 0.37% in AUC, 0.58% in ACC and 0.60% in RMSE compared to BERT. However, the overall gap between these pre-trained models is not large, but the results are better than both CNN and LSTM.

Table 5: experimental results of different models in CEED

As can be seen from Table 5, in both datasets, the method of the invention (when using BERT as a pre-trained language model) resulted in the best results in all three metrics AUC, ACC, RMSE and three datasets. In the CEED-EE data set, the indexes of AUC, ACC and RMS are all higher than the second one, the method of the invention increases 8.20% in AUC, 4.71% in ACC and 7.05% in RMSE. In the CEED-CET data set, the AUC, ACC and RMS indexes of the method are all higher than those of the second name, and the method improves 5.32 percent of AUC, 3.77 percent of ACC and 1.95 percent of RMSE. On the CEED-TEM data set, the indexes of AUC and ACC were respectively improved by 9.09% and 12.5% compared with the second name, and the index of REME was reduced by 8.51%.

Table 6: results of different pre-trained language models in CEED

As shown in Table 6, the present invention also compared the effects of different pre-trained language models in the CEED dataset. Overall, roBERTa achieves better results in all three subsets of CEED. Compared with BERT, the AUC is improved by 5.06%, ACC is improved by 8.49%, and RMSE is reduced by 13.1% in CEED-EE data set. Compared with BERT, in CEED-CET data set, the AUC is improved by 3.99%, ACC is improved by 11.32%, and RMSE is reduced by 11.20%. Compared to BERT in CEED-TEM data set, the AUC was improved by 3.17%, ACC by 3.12% and RMSE by 4.35%. These pre-trained language models are superior to CNN and LSTM in all three metrics as a whole.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A method for judging difficulty of English reading materials based on text feature fusion is characterized by comprising the following steps:

step1: firstly, encoding an input English text aiming at an English reading material data set, and inputting encoded information into a trained pre-training language model to obtain a feature vector containing semantic information;

step2: performing part-of-speech tagging on the text, and inputting the obtained part-of-speech sequence into an LSTM to obtain a feature vector containing grammatical information;

step3: extracting statistical information features; counting the factors influencing the difficulty of English reading materials, carrying out embedding expression on the factors, splicing all the characteristics, and inputting the spliced characteristics into a full-connection layer;

step4: and finally, obtaining a numerical value representing difficulty from 0 to 1 through sigmoid layer output, and finishing difficulty judgment.

2. The method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step1 is specifically:

step1.1: assume that the currently input English text is S _t ，S _t In which n words, S _t ＝{w ₁ ，w ₂ ，...，w _i ，...，w _n In which w _i Represents the ith word;

the converted sentence is S _BERT ＝{[CLS]，w ₁ ，w ₂ ，...，[SEP]，...，w _n-2 ，w _n-1 ，w _n ，[SEP]}；

Step1.2: will S _BERT Is set to M, if S _t If the length of (D) is less than M, the length of S is measured _BERT Addition of [ PAD]Performing filling, S after filling operation _BERT Comprises the following steps:

S _BERT ＝{[CLS]，w ₁ ，w ₂ ，…，[SEP]，…，w _n-2 ，w _n-1 ，w _n ，[SEP]，…，[PAD]}

S _BERT ＝{[CLS]，w ₁ ，w ₂ ，…，[SEP]，…，w _M-2 ，w _M-1 ，w _M ，[SEP]}

step1.3: to S _BERT Is embedding coded, namely:

wherein the content of the first and second substances,

D _BERT representing the embedding dimension set by the pre-training language model;

step1.4: to S _BERT The content in (1) is sentence position coded, namely:

S _{segmentembedding} ＝{E _A ，E _A ，E _A ，E _B ，E _B ，E _B ，E _B ，...，E _i ，E _i }

wherein, E _A Denotes a first sentence, E _B The second sentence is represented by the first sentence,

subsequent sentences analogized in the same way, E _i Represents the ith sentence;

S _{position embedding} ＝{E ₁ ，E ₂ ，E ₃ ，…，E _i ，…，E _n-2 ，E _n-1 ，E _n ，…，E _M }

wherein E is _i A position code representing the ith word,

step1.6: will S _embedding 、S _{segmrntembedding} 、S _{positionembedding} Inputting the result into a pre-training language model to obtain a feature vector O output by the last layer _BERT Namely:

step1.7: selecting X _[CLS] As a sentence vector.

3. The method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step2 is specifically:

step2.1: for the input text S _t ＝{w ₁ ，w ₂ ，w ₃ ，...，w _n Add [ CLS ] at the beginning of sentence]Indicating the beginning of a sentence, adding [ SEP ] in the middle of two sentences]For separating sentences, the converted sentences are:

S _sen ＝{[CLS]，w ₁ ，w ₂ ，…，[SEP]，…，w _n-2 ，w _n-1 ，w _n ，[SEP]，…，[PAD]}

step2.2: to S _sen And performing part-of-speech tagging to obtain:

S _POS ＝{[SPACE]，[PRP]，[VBP]，[NNP]，…，[RB]，[JJ]，[SPACE]，…，[PAD]}

wherein [ SPACE ] represents [ CLS ] and [ SEP ], [ PRP ] represents pronouns, [ VBP ] represents verbs, [ NNP ] represents nouns, [ RB ] represents degree adverbs, [ JJ ] represents adjectives;

step2.3: to S _POS Performing embedded representation to obtain E _POS Namely:

wherein D is _POS An embedding dimension representing a part of speech token;

step2.4: will E _pos Inputting LSTM, and taking the last layer of output result O _pos As feature vectors of sentence grammar information, wherein

4. The method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step3 is specifically:

step3.1: and (3) counting the sentence length and carrying out embedding operation: for sentence S _t ＝{w ₁ ，w ₂ ，...，w _n H, then the sentence length is embedded as

Wherein, L represents the embedding of the vector as the sentence length, n represents the number of words, and D represents the embedding dimension;

step3.2: counting the preposition number and carrying out embedding operation: for sentence S _t ＝{w ₁ ，w ₂ ，...，w _n }, number of prepositions embedded

Wherein, P represents the embedding of the vector as preposition number, which represents specific number, D represents the embedding dimension;

step3.3: and (3) counting the average word length and carrying out embedding operation: for sentence S _t ＝{w ₁ ，w ₂ ，...，w _n }, number of prepositions embedded

Wherein, A represents the embedding of the vector with the average word length, which represents the specific number, and D represents the embedding dimension;

step3.4: will be provided with

Splicing is performed as statistical information of sentences:

wherein the content of the first and second substances,

5. the method for determining difficulty of reading English material based on text feature fusion as claimed in claim 1, wherein Step4 is specifically:

step4.1: applying semantic features X _[CLS] Grammatical feature O _POS Statistical information characteristic O _STA Inputting the spliced data into a full-connection layer, inputting a sigmoid layer prediction result and outputting:

step4.2: calculating the loss:

wherein, y _iC Representing the true class of sample i, if c is equal to 1, and if not equal to 0,p _iC Representing the predicted probability that the observation sample i belongs to the class c;