CN110851599A

CN110851599A - Automatic scoring method and teaching and assisting system for Chinese composition

Info

Publication number: CN110851599A
Application number: CN201911059419.3A
Authority: CN
Inventors: 夏俐
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-02-28
Anticipated expiration: 2039-11-01
Also published as: CN110851599B

Abstract

The invention provides an automatic scoring method and an auxiliary teaching system for Chinese compositions. The method comprises the following steps: acquiring a composition to be scored; a shallow feature extraction step, which is used for extracting the shallow features of the composition to be evaluated; a deep semantic feature extraction step, which is used for extracting deep semantic features of the composition to be evaluated, wherein the deep semantic features comprise wrongly written characters and grammar wrongly written features; a scoring step, which is used for combining the extracted shallow layer characteristics and deep layer semantic characteristics and adopting random forest fitting to obtain a scoring result of the composition to be scored; also comprises a pinyin conversion step and a theme extraction step. The method combines the shallow feature and the deep semantic feature of the composition, has high scoring accuracy, obtains an ideal evaluation result by training on a small sample, and effectively improves the utilization rate of the sample; meanwhile, the functions of wrongly written character recognition and correction, pinyin recognition and conversion, grammar error recognition and correction and the like are added, multi-dimensional information is provided as feedback tutoring of user writing, and user experience is enhanced.

Description

Automatic scoring method and teaching and assisting system for Chinese composition

Technical Field

The invention relates to natural language processing technology in the field of artificial intelligence, in particular to a Chinese composition automatic scoring method and a teaching and assisting system.

Background

Brief introduction to automatic composition scoring System

An automatic composition scoring system aes (automatic assessment scoring) is an educational aid based on an intelligent algorithm in the trend of artificial intelligence and deep learning technology. Compared with a manual scoring system, the automatic composition scoring system has the advantages of being more objective, timely, efficient and low in cost, and more attention and research are paid, so that the research and development of the automatic composition scoring system gradually becomes a trend. The traditional composition automatic scoring system mainly models and analyzes texts through shallow features, ignores deep semantic features of the texts, and adopts a cyclic neural network to extract the deep semantic features of the texts by adopting a deep learning technology, so that a scoring result is more objective.

Challenge of Chinese composition automatic scoring system

For natural language processing technology, most of the research is based on English at present, Chinese is technically much more complex than English due to the characteristics of Chinese, and Chinese processing is relatively lacked in practical application and has a plurality of difficulties and challenges. The existing automatic composition scoring system mainly processes English compositions, and the processing result of Chinese compositions is not ideal. The invention provides an automatic scoring method and an auxiliary teaching system specially for Chinese composition.

The traditional composition automatic scoring system needs to manually design text features, is high in cost and cannot understand deep semantics of the text; the deep learning technology for extracting deep semantic features of texts depends on a large corpus, the scale of the traditional Chinese composition corpus is small, and how to improve the effective utilization rate of samples is very important. Meanwhile, how to design features on a small-scale sample, how to identify and correct wrongly written characters, pinyin and grammar errors appearing in the Chinese composition, how to combine the extracted features for training, how to ensure the accuracy of writing tutorial feedback information and the like are a series of problems to be solved by designing an automatic scoring system for the Chinese composition.

Prior art implementation

In designing an automatic composition scoring system, non-patent document 1 trains a CNN-LSTM model on an english composition data set. Non-patent document 2 extracts lexical and syntactic features of a composition, and trains the extracted features using a multiple linear regression model. Patent document 3 provides a composition scoring method in which two neural networks are designed, feature vectors and word vectors of a composition text are used as inputs to the neural networks, and composition scores are calculated from outputs of the two neural networks. Patent document 4 provides a composition scoring method based on an attention mechanism, which adopts a neural network attention framework of a word-sentence-document three-layer structure, and uses manually extracted features to fuse with a document layer, thereby setting an attention weight of the document layer. Patent document 5 acquires a large number of compositions of a certain composition topic, analyzes the content of each composition, obtains the composition mode of each composition, trains a time sequence model of the composition, tests the user composition by using the model, and scores the user composition according to a novelty degree.

Non-patent document 1: taghipour K, Ng H T.A Neural Approach to Automated construction [ C ]// Conference on Empirical Methods in Natural Langugen processing.2016

Non-patent document 2: composition automatic scoring research using libama, chinese as a second language test [ D ]. university of beijing language, 2006

Patent document 3: CN108519975A composition scoring method, device and storage medium

Patent document 4: CN107133211A composition scoring method based on attention mechanism

Patent document 5: CN109635087A composition scoring method and family education equipment

Disadvantages of the prior art

The deep learning technique represented by non-patent document 1 depends on a large-scale sample during training, and cannot achieve a satisfactory training effect in a small-scale sample; the machine learning technology represented by non-patent document 2 does not fully extract deep semantic features of the composition, and meanwhile, the fitting capability of a multiple linear regression model is limited, so that the accuracy of composition scoring is low; the methods represented by patent documents 3 and 4 are based on a neural network, the patent document 3 predicts and makes text scores by designing a plurality of neural networks, and the patent document 4 adopts an attention mechanism at the output end of the neural network to improve the scoring accuracy, however, the methods have low utilization rate of samples and cannot obtain satisfactory training results on small samples; patent document 5, which trains a neural network under a specific topic, will result in insufficient generalization capability of the trained neural network, and only takes the novelty of a composition as a judgment standard without considering the characteristics of other dimensions of the composition, which will result in low accuracy of composition scoring.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a construction method of a Chinese composition automatic scoring system, a Chinese composition automatic scoring method, a teaching and assisting system, a computer readable storage medium and a computer program product.

According to the technical scheme, the shallow layer characteristics and the deep layer semantic characteristics of the composition are combined, so that the scoring accuracy is improved, the utilization rate of the sample is improved, and a satisfactory effect is achieved on the small sample training.

In order to achieve the above object, an embodiment of the first aspect of the present invention provides a method for constructing an automatic scoring system for chinese composition, the method comprising the following steps:

a corpus construction step, which is used for constructing a Chinese composition corpus;

a shallow feature extraction step, namely extracting shallow features of the composition based on the corpus;

a deep semantic feature extraction step, wherein deep semantic features of the composition are extracted based on the corpus, and the deep semantic features comprise wrongly written character features and grammar wrongly written features;

and a regression step, which is used for combining the extracted shallow layer characteristics and deep layer semantic characteristics and adopting random forest fitting to obtain the scoring result of the composition.

Further, the extracting of the wrongly written character specifically includes: adopting a probability word segmentation model to segment the composition; according to the word segmentation result, comparing the composition text with the wrongly written character recognition corpus to obtain a suspicious word set; comparing the suspicious word set with the wrongly written character correction corpus to obtain a candidate word set; and calculating semantic confusion of the candidate word set, and taking the word with the minimum confusion as a wrongly written character correction result.

Further, the extracting the syntax error feature specifically includes: and (4) training word vectors by utilizing the corpus, inputting the word vectors into the Bi-LSTM neural network model, and training to obtain a labeling sequence, namely a grammar error result.

Furthermore, the method also comprises a pinyin conversion step, which is used for identifying the pinyin in the text to be scored and converting the pinyin into corresponding Chinese characters.

Further, the method also comprises a theme extracting step, wherein the theme extracting step is used for extracting the theme to be scored as implicit in the text.

The embodiment of the second aspect of the invention provides a Chinese composition automatic scoring method, which comprises the following steps:

acquiring a composition to be scored: acquiring a composition picture to be scored, and performing Chinese recognition to obtain a composition text; or directly acquiring the composition text to be evaluated;

shallow layer feature extraction: processing the composition text to be scored to obtain word segmentation results of the composition text; according to the word segmentation result, counting shallow features of the composition to be scored;

deep semantic feature extraction: extracting deep semantic features of the composition to be scored, wherein the deep semantic features comprise wrongly written characters and grammar wrongly written characters;

grading: and combining the extracted shallow layer features and deep layer semantic features and adopting random forest fitting to obtain a scoring result of the composition to be scored.

Further, the extracting of the wrongly written character specifically includes: processing the composition text to be scored to obtain word segmentation results of the composition text; according to the word segmentation result, comparing the composition text to be scored with the wrongly written character recognition corpus to obtain a suspicious word set; comparing the suspicious word set with the wrongly written character correction corpus to obtain a candidate word set; and calculating semantic confusion of the candidate word set, and taking the word with the minimum confusion as a wrongly written character correction result.

Further, the extracting the syntax error feature specifically includes: processing the composition texts to be evaluated to obtain word vectors of the composition texts; and inputting the word vector into a Bi-LSTM neural network model, and training to obtain a labeling sequence, namely a grammar error result.

Further, the shallow features specifically include the number of sentences, the average length of the sentences, the number of full characters, the number of metaphors, the number of pinyin, and the vocabulary level.

Further, the syntax error characteristics specifically include four types: redundant words, missing words, wrong word selections, unordered words.

The embodiment of the third aspect of the invention provides a Chinese composition automatic scoring system, which comprises the following modules:

the composition to be scored acquisition module: acquiring a composition picture to be scored, and performing Chinese recognition to obtain a composition text; or directly acquiring the composition text to be evaluated;

shallow layer feature extraction module: the system is used for processing the composition texts to be scored to obtain word segmentation results of the composition texts; according to the word segmentation result, counting shallow features of the composition to be scored;

the deep semantic feature extraction module: the method is used for extracting deep semantic features of the composition to be scored, wherein the deep semantic features comprise wrongly written characters and grammar wrongly written characters;

a scoring module: and the method is used for combining the extracted shallow layer characteristics and deep layer semantic characteristics and adopting random forest fitting to obtain a scoring result of the composition to be scored.

Furthermore, the system also comprises a pinyin conversion module which is used for identifying pinyin in the text to be scored and converting the pinyin into corresponding Chinese characters.

Further, the system also comprises a theme extracting module used for extracting the theme to be scored as implicit in the text.

The embodiment of the fourth aspect of the invention provides a Chinese composition automatic scoring system, which is constructed according to the construction method of the Chinese composition automatic scoring system.

An embodiment of a fifth aspect of the present invention provides a Chinese composition automatic scoring tutoring system, which includes a memory, a processor, and a computer program stored in the memory and operable on the processor; or the teaching and assisting system comprises a terminal and a cloud server which is connected with the terminal and is stored with a computer program, wherein the computer program is executed to realize the automatic Chinese composition scoring method.

An embodiment of a sixth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed, implements the above-mentioned automatic scoring method for chinese compositions.

An embodiment of the seventh aspect of the present invention provides a computer program product, which when executed, implements the above method for automatically scoring a chinese composition.

The composition scoring method only considering the shallow feature has low scoring accuracy; the method of simply considering deep semantic features requires a large corpus to perform sample training. According to the invention, by combining the shallow layer characteristics and the deep layer semantic characteristics of the composition, the scoring accuracy is improved, and the utilization rate of the sample is effectively improved, so that a series of problems in the prior art are solved.

Compared with the existing Chinese composition scoring software, the composition automatic scoring method and the teaching and assisting system have the advantages that: according to the technical scheme, the shallow layer characteristics and the deep layer semantic characteristics of the composition are combined, the scoring accuracy is high, an ideal evaluation result is obtained by training on a small sample, and the utilization rate of the sample is effectively improved; meanwhile, the functions of wrongly written character recognition and correction, pinyin recognition and conversion, grammar error recognition and correction and the like are added, multi-dimensional writing tutoring information feedback is provided, and user experience is enhanced.

Drawings

FIG. 1 is a schematic diagram of the working principle of the automatic Chinese composition scoring method and the teaching and assisting system according to the present invention.

FIG. 2 is a schematic diagram illustrating the principle of shallow feature extraction according to the present invention.

FIG. 3 is a schematic diagram illustrating the principle of extracting the syntax error feature according to the present invention.

FIG. 4 is a schematic diagram of an implementation of the Chinese composition automatic scoring tutoring system of the present invention.

FIG. 5 is one of the UI interfaces of the automatic Chinese composition scoring system constructed by the present invention: an OCR recognition interface schematic.

Fig. 6 is a second UI interface of the automatic scoring system for chinese compositions constructed in the present invention: and (4) displaying an interface schematic diagram by grading.

FIG. 7 is a diagram illustrating key steps in the method for automatically scoring Chinese compositions according to the present invention.

Fig. 8-10 are diagrams illustrating an embodiment of an automatic scoring method for chinese composition according to the present invention, wherein fig. 8 is a diagram illustrating an image of a composition to be scored, fig. 9 is a diagram illustrating chinese character recognition, and fig. 10 is a diagram illustrating a result of scoring by using the automatic scoring method for chinese composition according to the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

FIG. 1 is a schematic diagram of the working principle of the automatic Chinese composition scoring method and the teaching and assisting system according to the present invention. As shown in FIG. 1, the technical scheme of the invention combines the composition shallow feature and the deep semantic feature, improves the scoring accuracy and the sample utilization rate, and obtains satisfactory effect on small samples. Firstly, constructing a Chinese composition corpus, and extracting shallow features of the composition based on the corpus, wherein the shallow features are mainly statistical features; extracting deep semantic features of the composition based on the corpus, wherein the deep semantic features comprise wrongly written characters and grammar wrongly written characters; combining the shallow layer characteristics and the deep layer semantic characteristics and adopting random forest fitting to obtain the score of the composition. The scheme has high scoring accuracy rate when training on a small sample, and effectively improves the utilization rate of the sample.

Compared with the existing Chinese composition scoring software, the Chinese composition automatic scoring teaching and assisting system provided by the invention has the advantages that the functions of wrongly written character correction, pinyin correction, grammar error recognition and the like are added, and multidimensional writing tutoring information feedback is provided.

Method for constructing automatic Chinese composition scoring system

The construction method of the automatic Chinese composition scoring system of the present invention is described in detail below.

Firstly, a Chinese composition corpus is constructed. The method comprises the steps of collecting 1000 composition pictures, hiring professional scoring teachers to score compositions, identifying Chinese characters by adopting a network-accessible cloud OCR technology and manually proofreading, and constructing an electronic version Chinese composition corpus; the invention constructs one-to-six-grade word banks based on human teaching version pupil Chinese teaching materials, wherein the number of the one-grade word banks is 174, the second-grade word banks is 536, the third-grade word banks is 1132, the fourth-grade word banks is 1737, the fifth-grade word banks is 2172, and the sixth-grade word banks is 2655. It should be noted that the Chinese composition corpus can also be constructed by using composition pictures collected from other places, and the source of the composition is not limited by the invention; or directly obtain the composition text. The word stock can also be constructed by adopting other textbook systems of publishers or selecting other sources independent of the textbooks.

And then shallow feature extraction is carried out. And counting the number of sentences in the composition, the average word number of the sentences, the full-text word number, the number of metaphors, the number of pinyin and the matching degree of the composition and each grade lexicon, and taking the extracted result as the shallow feature of the composition. Metaphorical characteristic words are: like, liked, like, liked, as, like. When the shallow feature is counted, a probability word segmentation model is adopted, as shown in fig. 2, word segmentation marks S, B, M, E are defined as word formation of a single word, the beginning, the middle and the end of a word group respectively, and each word is represented as a visible state o_tWord segmentationThe sign being represented as a hidden state s_tThen the best participle combination can be expressed as such P (o)₁，o₂，…o_n|s₁，s₂，…，s_n) The largest combination. Defining lambda as input model parameter, a as state transition probability matrix, b as observation probability matrix, delta_t(i) Is the maximum probability value, delta, in a single path with state i at time t_t(i)＝maxP(i_t＝i，i_t-1，…，i₁，o_t，…，o₁I |, λ), i ═ 1, 2, …, N. Definition psi_t(i) T-1 node of the maximum probability path in the single path with state i at time t, psi_t(i)＝argmax_1≤j≤N[δ_t-1(j)a_ji]. The termination state is P^*＝max_1≤i≤Nδ_T(i)，

Backtracking the optimal path to obtain the optimal word segmentation combination,

and then, deep semantic feature extraction is carried out. And extracting deep semantic features of the composition, such as wrongly written characters. Firstly, a probability word segmentation model is adopted to segment words of a composition, and the word segmentation result is compared with a wrongly-written character recognition corpus to obtain a suspicious word set. The wrongly-written character recognition corpus can include but is not limited to a manual definition dictionary, a confusion set dictionary, a people daily dictionary and the like; the number of words in the manual definition dictionary is 177, the confusion set dictionary is 759, and the daily dictionary of people is 584429. And comparing the suspicious word set with the wrongly written character correction corpus to obtain a candidate word set. The wrongly written character correction corpus includes but is not limited to a common word dictionary, the same component and radical set and the same pinyin set; wherein, it is commonly usedThe number of words in the dictionary is 3502, the same Pinyin dictionary is 3431, and the similar dictionary is 1664. Training a confusion model based on a people daily newspaper corpus, and defining w_iIs a word in the article, the confusion degree PP of the sentence S is

And calculating the confusion degree of the candidate word set by using the model, and taking the element with the minimum confusion degree as a wrongly written character correction result.

The deep semantic feature extraction also comprises grammar error feature extraction. As shown in FIG. 3, first, w is defined by training word vectors in microblog corpus_iFor composition texts, the learning goal is to maximize the likelihood function L ═ logp (w | content (w)), and the trained word vector is used as the input of the neural network model. And (3) adopting Bi-LSTM as a neural network model, defining c as the state of the cell, a as the output of the cell, w as the weight, sigma as an activation function, and selecting sigmoid as the activation function. The LSTM cell needs to be operated through a three-layer gate, the first layer gate is a forgetting gate, the output and the state of the last cell are selectively forgotten, f_t＝σ(w_f·[a_t-1，c_t]+b_f) It is then necessary to determine that new information is deposited in the cellular state, divided into two parts. Firstly, the sigmoid layer determines an updating value, the tanh layer creates a new candidate vector, u_t＝σ(w_f·[a_t-1，c_t]+b_f)，

When the cell state is updated, part of the information is discarded, new information is added, namely the state of the next cell,

finally, the partial state of the output is determined by the sigmoid layer, and the cell state is processed by tanh to finally obtain the desired output, o_t＝σ(w_o[a_t-1，w_t]+b_o)，a_t＝o_t·tanh(c_t). And (3) processing the output of the Bi-LSTM neural network by a conditional random field (conditional random field), and considering the mutual relation between the positions before and after the output to obtain a high-accuracy labeling sequence, wherein the labeling sequence is a part-of-speech and grammar wrong labeling result of each character. The annotation sequence can be denoted by letters R, M, S, W, which correspond to four types of syntax errors: redundant words (R), missing words (M), wrong word selections (S), unordered words (W). The syntax error characteristics may include, but are not limited to, one or more of the four above. The Batch size of the Bi-LSTM neural network is 64, Epoch is 200, Embedding dim is 100, rnhidden dim is 200, LSTMmaxlen is 300, dropout is 0.25, training is carried out on a data set provided by a CGED (Chinese Grammar Errordignosis) competition, the accuracy rate finally reaches 0.861, and grammatical error characteristics are extracted from the composition set by using a trained Bi-LSTM model.

And finally, a regression step, namely combining the extracted shallow layer features and deep layer semantic features and adopting random forest fitting to obtain a scoring result of the composition. The random forest firstly resamples the sample data, randomly extracts N samples in the original N training samples in a returning way each time, and constructs a decision tree by taking a plurality of obtained sample sets as training samples. When a decision tree is constructed, m features in the candidate features are randomly extracted to serve as candidate features for decision under the current node, and the best combination is selected from the candidate features. And after a group of decision trees are obtained, voting is carried out on the output of the group of decision trees, and the class with the most votes is used as the decision of the random forest. In the embodiment of the invention, 100 decision trees are selected for training each time, the average error of the scores under the percentage score is 2.78 points, and the consistency evaluation standard quadratic weighted kappa value is 0.759.

The embodiment of the invention also can comprise a pinyin conversion step and a theme extraction step. The pinyin conversion step is used for converting pinyin in a user text into corresponding Chinese characters, the pinyin is represented as a visible state, the Chinese characters with the same pinyin are in a hidden state by adopting the same method as the probability word segmentation model, and the best pinyin conversion result is obtained by solving. Topic extraction for use in user compositionImplicit topics are extracted, assuming that the article consists of K topics, the kth topic consisting of

The words are formed, an LDA (latent Dirichlet allocation) model is constructed,

wherein

Is a K-dimensional distribution hyperparameter. For any composition d, the subject distribution theta is expressed by Dirichlet distribution_dFor any topic k, the word distribution β is represented by a Dirichlet distribution_kEach word has a conditional probability of corresponding to a topic of

Gibbs sampling is performed on the conditional probability to obtain a topic of each word, and K is set to be 5 in the embodiment of the present invention. Thus, the design of the automatic Chinese composition scoring system is completed.

The schematic diagram of the automatic Chinese composition scoring teaching and assisting system constructed by the above method for constructing the automatic Chinese composition scoring system is shown in fig. 4, wherein the cloud server and the terminal are both in the prior art, and are not described herein again. The Chinese composition automatic scoring teaching auxiliary system is realized through a computer program, the computer program is stored on a cloud server, the cloud server is connected with a terminal, and after an authorized user downloads the computer program from the cloud server through the terminal, the program is executed on the terminal, so that the automatic scoring of compositions is realized. The UI system interface includes an OCR recognition interface and a score display interface, as shown in fig. 5 and 6, where fig. 5 is a schematic diagram of the OCR recognition interface, and fig. 6 is a schematic diagram of the score display interface. The teaching and assisting system can also be designed to include a memory, a processor, and a computer program stored on the memory and executable on the processor; the computer program is executed to implement automatic scoring of a composition.

Automatic scoring method for Chinese composition

The automatic Chinese composition scoring method of the present invention is described below. As shown in fig. 5, in the OCR recognition interface, the user needs to submit a handwritten composition picture at the local terminal, click the upload picture button to obtain an OCR recognition result, and click the start-modify button to obtain a composition reviewing result, as shown in fig. 6. The composition reviewing result may include, but is not limited to, composition score, keyword, lexicon matching degree, pinyin conversion result, wrongly written character recognition and correction result, grammar mistake result, and the like, and the content displayed on the interface may be increased or decreased according to the requirement in the specific implementation process.

Specifically, the Chinese composition automatic scoring method comprises the following steps:

Fig. 7 illustrates key steps in the above-described method. Aiming at the shallow feature extraction step, processing the composition text to be scored by adopting a probability word segmentation model to obtain a word segmentation result of the composition text; and according to the word segmentation result, counting shallow features of the composition to be scored, wherein the shallow features comprise but are not limited to the number of sentences, the average length of the sentences, the number of full words, the number of metaphors, the number of pinyin, the word level and other features. The probability scoreThe word model is shown in FIG. 2, the word-defining marks S, B, M, E are the word-forming of a single word, the beginning, middle and end of a word group, and each word is represented as a visible state o_tThe word segmentation markers are represented as hidden states s_tThen the best participle combination can be expressed as such P (o)₁，o₂，…o_n|s₁，s₂，…，s_n) The largest combination. Defining lambda as input model parameter, a as state transition probability matrix, b as observation probability matrix, delta_t(i) Is the maximum probability value, delta, in a single path with state i at time t_t(i)＝maxP(i_t＝i，i_t-1，…，i₁，o_t，…，o₁I |, λ), i ═ 1, 2, …, N. Definition psi_t(i) T-1 node of the maximum probability path in the single path with state i at time t, psi_t(i)＝argmax_1≤j≤N[δ_t-1(j)a_ji]. The termination state is P^*＝max_1≤i≤Nδ_T(i)，

aiming at the wrongly-written character feature extraction step, adopting a probability word segmentation model to process the composition text to be scored to obtain a word segmentation result of the composition text; and comparing the composition text to be evaluated with the wrongly-written character recognition corpus according to the word segmentation result, and counting unmatched words to obtain a suspicious word set. The wrongly written word recognition corpus can include but is not limited to a manually defined dictionary, a confusion set dictionary, a people daily dictionary and the like; the number of words in the manual definition dictionary is 177, the confusion set dictionary is 759, and the daily dictionary of people is 584429. Will be provided withAnd comparing the suspected word set with the wrongly-written character correction corpus to obtain a candidate word set, and calculating the semantic confusion degree of the candidate word set, wherein the word with the minimum confusion degree is used as a wrongly-written character correction result, and the original word is the wrongly-written character result. The corpus of corrected wrongly written words may include but is not limited to a dictionary of common words, a set of the same components and radicals, and a set of the same pinyin; wherein, the number of words in the common dictionary is 3502, the same Pinyin dictionary is 3431, and the similar dictionary is 1664. Calculating the semantic confusion degree by using the trained confusion degree model, defining wi as a word to be scored, and defining the semantic confusion degree PP of the sentence S as

And calculating the semantic confusion degree of the candidate word set by using the model, and taking the word with the minimum semantic confusion degree as a wrongly written character correction result.

Processing the composition text to be evaluated aiming at the grammar error characteristic extraction step to obtain a word vector of the composition text; inputting the word vector into a Bi-LSTM neural network model for training to obtain a labeling sequence; the word with the annotation sequence R, M, S, W is a syntax error result. And (3) adopting Bi-LSTM as a neural network model, defining c as the state of the cell, a as the output of the cell, w as the weight, sigma as an activation function, and selecting sigmoid as the activation function. The LSTM cell needs to be operated through a three-layer gate, the first layer gate is a forgetting gate, the output and the state of the last cell are selectively forgotten, f_t＝σ(w_f·[a_t-1，c_t]+b_f) It is then necessary to determine that new information is deposited in the cellular state, divided into two parts. Firstly, the sigmoid layer determines an updating value, the tanh layer creates a new candidate vector, u_t＝σ(w_f·[a_t-1，c_t]+b_f)，

When the cell state is updated, part of the information is discarded, new information is added, namely the state of the next cell,finally, the partial state of the output is determined by the sigmoid layer, and the cell state is processed by tanh to finally obtain the desired output, o_t＝σ(w_o[a_t-1，w_t]+b_o)，a_t＝o_t·tanh(c_t). And (3) processing the output of the Bi-LSTM neural network by a conditional random field (conditional random field), and considering the mutual relation between the positions before and after the output to obtain a high-accuracy labeling sequence, wherein the labeling sequence is a part-of-speech and grammar wrong labeling result of each character. The annotation sequence can be denoted by letters R, M, S, W, which correspond to four types of syntax errors: redundant words (R), missing words (M), wrong word selections (S), unordered words (W). The syntax error characteristics may include, but are not limited to, one or more of the four above.

Aiming at the scoring step, namely the regression step, combining the extracted shallow layer features and deep layer semantic features (including wrongly written characters and grammatical errors) and then training by adopting a random forest to obtain the final score of the composition to be scored. The random forest firstly resamples the sample data, randomly extracts N samples in the original N training samples in a returning way each time, and constructs a decision tree by taking a plurality of obtained sample sets as training samples. When a decision tree is constructed, m features in the candidate features are randomly extracted to serve as candidate features for decision under the current node, and the best combination is selected from the candidate features. And after a group of decision trees are obtained, voting is carried out on the output of the group of decision trees, and the class with the most votes is used as the decision of the random forest. In the embodiment of the invention, 100 decision trees are selected for training each time, the average error of the scores under the percentage score is 2.78 points, and the consistency evaluation standard quadraticated kappa value is 0.759.

The Chinese composition automatic scoring method can also comprise a pinyin conversion step and a theme extraction step. The pinyin conversion step is used for converting pinyin in the text made by the user into corresponding Chinese characters, the method which is the same as the probability word segmentation model is adopted to express the pinyin as a visible state, and the Chinese characters with the same pinyin are hiddenAnd storing the state, and solving to obtain the optimal pinyin conversion result. The topic extraction is used for extracting the topics implicit in the user composition, and the hypothesis article is composed of K topics, wherein the K topic is composed ofThe words are formed, an LDA (latent Dirichlet allocation) model is constructed,

wherein

Gibbs sampling is performed on the conditional probability to obtain a topic of each word, and K is set to be 5 in the embodiment of the present invention.

Fig. 8-10 show an embodiment of the present invention for composition scoring using the automatic Chinese composition scoring method. Wherein, fig. 8 is a schematic diagram of obtaining a picture of a composition to be scored, fig. 9 is a schematic diagram of identifying Chinese characters, and fig. 10 is a result of scoring by adopting the automatic scoring method of the Chinese composition.

The embodiment of the invention also comprises a Chinese composition automatic scoring system, and each module of the system corresponds to each step of the Chinese composition automatic scoring method one by one. The system comprises the following modules:

The embodiment of the invention also comprises a Chinese composition automatic scoring teaching and assisting system, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; or the teaching and assisting system comprises a terminal and a cloud server which is connected with the terminal and is stored with a computer program, wherein the computer program is executed to realize the automatic Chinese composition grading method.

Embodiments of the present invention also include a computer-readable storage medium having a computer program stored thereon, where the computer program is executed to implement the automatic scoring method for chinese composition according to the present invention.

Embodiments of the present invention also include a computer program product, which when executed implements the method for automatically scoring chinese compositions of the present invention.

Compared with the existing Chinese composition scoring software, the composition automatic scoring method and the teaching and assisting system have the advantages that: according to the technical scheme, the shallow layer characteristics and the deep layer semantic characteristics of the composition are combined, so that the scoring accuracy is high, an ideal evaluation result is obtained by training on a small sample, and the utilization rate of the sample is effectively improved; meanwhile, the functions of wrongly-written character recognition and correction, pinyin recognition and conversion, grammar error recognition and correction and the like are added, multi-dimensional information feedback is provided, and user experience is enhanced.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these modifications and substitutions should also be regarded as the protection scope of the present invention.

Claims

1. A method for constructing an automatic Chinese composition scoring system is characterized by comprising the following steps: the method comprises the following steps:

2. The method for constructing the automatic Chinese composition scoring system according to claim 1, wherein the method comprises the following steps: the extraction of the wrongly written character features specifically comprises the following steps: adopting a probability word segmentation model to segment the composition; according to the word segmentation result, comparing the composition text with the wrongly written character recognition corpus to obtain a suspicious word set; comparing the suspicious word set with the wrongly written character correction corpus to obtain a candidate word set; and calculating semantic confusion of the candidate word set, and taking the word with the minimum confusion as a wrongly written character correction result.

3. The method for constructing the automatic Chinese composition scoring system according to claim 1, wherein the method comprises the following steps: the extracting of the grammatical error features specifically comprises: and (4) training word vectors by utilizing the corpus, inputting the word vectors into the Bi-LSTM neural network model, and training to obtain a labeling sequence, namely a grammar error result.

4. A Chinese composition automatic scoring method is characterized in that: the method comprises the following steps:

5. The method for automatically scoring Chinese compositions as claimed in claim 4, wherein: the extraction of the wrongly written character features specifically comprises the following steps: processing the composition text to be scored to obtain word segmentation results of the composition text; according to the word segmentation result, comparing the composition text to be scored with the wrongly written character recognition corpus to obtain a suspicious word set; comparing the suspicious word set with the wrongly written character correction corpus to obtain a candidate word set; and calculating semantic confusion of the candidate word set, and taking the word with the minimum confusion as a wrongly written character correction result.

6. The method for automatically scoring Chinese compositions as claimed in claim 4, wherein: the extracting of the grammatical error features specifically comprises: processing the composition texts to be evaluated to obtain word vectors of the composition texts; and inputting the word vector into a Bi-LSTM neural network model, and training to obtain a labeling sequence, namely a grammar error result.

7. The method for automatically scoring Chinese compositions as claimed in claim 4, wherein: and the pinyin conversion step is used for identifying the pinyin in the text to be scored and converting the pinyin into corresponding Chinese characters.

8. The method for automatically scoring Chinese compositions as claimed in claim 4, wherein: also comprises a theme extracting step, which is used for extracting the theme to be scored as implicit in the text.

9. An automatic scoring system for Chinese composition is characterized in that: the system comprises the following modules:

10. An automatic scoring teaching and assisting system for Chinese composition comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; or the teaching and assistance system comprises a terminal and a cloud server which is connected with the terminal and is stored with a computer program, and the teaching and assistance system is characterized in that: the computer program is executed to implement the automatic scoring method for chinese composition according to any one of claims 4 to 8.