CN115204151A

CN115204151A - Chinese text error correction method, system and readable storage medium

Info

Publication number: CN115204151A
Application number: CN202211118545.3A
Authority: CN
Inventors: 王鹏鸣; 郝书乐
Original assignee: East China Jiaotong University
Current assignee: East China Jiaotong University
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2022-10-18

Abstract

The invention discloses a Chinese text error correction method, a system and a readable storage medium, the method firstly carries out paragraph division on an original Chinese text to obtain a target paragraph, and forms a sentence vector, then the sentence vector is input into a Bert model, the sentence meaning can be fully understood by utilizing the characteristic of strong reading and understanding capability of the Bert model, so that the problems of multiple error types and long-distance dependence are solved, a preset number of predicted character codes which are ordered at the front are obtained, a predicted character set is generated, then a candidate character set is generated by combining a preset shape-similar dictionary, and finally a character is selected from an intersection of the predicted character set and the candidate character set to replace a target character.

Description

Chinese text error correction method, system and readable storage medium

Technical Field

The invention relates to the technical field of word processing, in particular to a Chinese text error correction method, a Chinese text error correction system and a readable storage medium.

Background

With the wide application of the pinyin input method and the voice recognition, more and more Chinese text errors occur due to phonetics, and simultaneously, a large number of similar characters occur due to the use of the handwriting input method. Correcting chinese text is a challenging issue.

The problems faced by the current Chinese text error correction are as follows: (1) the error types are many: under the current popular grammar error correction labeling system, grammar error types are divided into more than fifty types, if grammar errors can be even divided into more than hundred types according to the grammar teaching system, in the complicated error types, learning targets are difficult to unify, and the error correction accuracy is low. (2) Dependence on distance: taking the consistent subjects and predicates as an example, sometimes the subjects for judging the consistency of the subjects and predicates of the predicates are far away from the predicates, the consistent conditions of the subjects and predicates at a long distance are sparse in the training corpus, and the learning of long-distance information is just a difficulty of machine learning.

Disclosure of Invention

Therefore, the embodiment of the invention provides a Chinese text error correction method, a Chinese text error correction system and a readable storage medium, so as to improve the error correction accuracy and avoid the problem of long-distance dependence.

The embodiment of the invention provides a Chinese text error correction method, which comprises the following steps:

acquiring an input original Chinese text, performing paragraph division on the original Chinese text to obtain a target paragraph, and using an occlusion mark to occlude a target word in the target paragraph to obtain an occluded target paragraph;

coding the occlusion target paragraph to form a sentence vector;

inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of an occlusion mark according to the context of the target paragraph, and calculating a score value of each predicted character code;

sorting the score values of the predicted character codes according to the order of the score values from high to low, and acquiring a preset number of predicted character codes which are sorted in the front;

decoding the predictive character codes with the preset number and the front sequence to obtain the corresponding predictive characters with the preset number and form a predictive character set;

judging whether the character of the target character exists in the predicted character set or not;

if the character of the target character does not exist in the predicted character set, generating a candidate character set by combining a preset shape-similar-sound dictionary according to the character of the target character;

generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set;

if only one word exists in the error correction set, selecting the word only stored in the error correction set to replace the target word; if a plurality of characters exist in the error correction set, replacing the target characters respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result;

wherein, the calculation formula of the confusion score is expressed as:

wherein,

a score of the degree of confusion is expressed,

which is representative of the current sentence,

representing the total number of words in the sentence,

the sequence number of the word in the sentence is represented,

before showing

The probability of a word or words,

representation is based on front

Word calculation to obtain the first

The probability of a word.

According to the Chinese text error correction method provided by the embodiment of the invention, firstly, original Chinese text is subjected to paragraph division to obtain a target paragraph, a sentence vector is formed by the target paragraph, then the sentence vector is input into a Bert model, the sentence meaning can be fully understood by utilizing the characteristic of strong reading and understanding capability of the Bert model, so that the problems of multiple error types and long-distance dependence are solved, after a preset number of prediction character codes which are ranked in the front are obtained and a prediction character set is generated, a candidate character set is generated by combining a preset similarity dictionary, and finally, a character is selected from an intersection set which is taken by the prediction character set and the candidate character set to replace the target character, and most errors are errors caused by similarity and similarity. Therefore, the intersection of the predicted character set and the candidate character set is used as the basis of error correction, and the accuracy of error correction can be effectively improved.

In addition, the method for correcting the chinese text according to the above embodiment of the present invention may further have the following additional technical features:

further, in the second step, the step of encoding the occlusion target paragraph to form a sentence vector specifically includes:

coding the occlusion target paragraph by using a token function to form a sentence vector, and during coding, carrying out attention labeling on each token sequence;

sentence vector

Is shown as

；

Wherein,

which is representative of the operation of the encoder,

is shown as

The token value of a word is then set,

is shown as

The token value of a word is then set,

is shown as

The number of words is one of a plurality of words,

is shown as

And words.

Further, in the third step, the sentence vector is input into the Bert model, and in the step of predicting a plurality of predicted characters at the position of the occlusion marker according to the context of the target paragraph, the language features before and after are extracted by using a self-attention mechanism, and the corresponding formula is as follows:

wherein,

a first feature vector representing an Attention weight,

a second feature vector representing the Attention weight,

a vector representing the features of the input is,

a weight representing the first feature vector is calculated,

a weight representing the second feature vector,

the weight of the vector representing the input feature,

representing a sentence vector.

Further, in the third step, for the above

Performing a linear transformation using the plurality of sets of parameter tables to obtain a result from the attention mechanism,the corresponding formula is:

wherein,

presentation pair

A multi-head self-attention operation is performed,

showing the combined operation of multiple heads of attention,

the basis weight is represented by a weight of the basis,

denotes the first

The attention head to which the word corresponds,

denotes the first

The weight of the first feature vector to which the word corresponds,

is shown as

The weight of the second feature vector to which the word corresponds,

is shown as

The weights of the vectors of input features to which the words correspond,

indicating that the corresponding Attention function operates.

Further, in the third step, in the step of calculating the score value of each predicted character code, the score value of the predicted character code is calculated by using the following formula:

wherein,

a score value representing the encoding of the predicted character,

representing a transpose operation.

Further, after the step of determining whether the character of the target word exists in the predicted character set, the method further includes:

if the character of the target character exists in the predicted character set, judging that the target character does not need error correction;

after the step of generating an error correction set according to the predicted character set and the candidate character set, the method further comprises:

determining whether the error correction set is empty;

if the error correction set is empty, judging that the target word does not need error correction;

and if the error correction set is not empty, selecting one word in the error correction set to replace the target word.

Further, the step of generating a candidate character set in combination with a preset morphological and phonetic similarity dictionary according to the characters of the target word specifically includes:

and searching a plurality of candidate characters from a preset shape-pronunciation-like dictionary by using an edit distance algorithm according to the characters of the target words, and generating a candidate character set.

Further, for the Bert model, the loss function of the first part and the loss function of the second part are included, and the loss function of the first part and the loss function of the second part form a total loss function;

the loss function of the first part is expressed as:

wherein,

representing the penalty function of the first part of the Bert model,

representing the maximum number of words in the set formed by the words of the mask,

parameters representing the Encoder part of the Bert model,

represents parameters in the output layer connected to the Encoder in the Mask-LM task,

the size of the dictionary is expressed in terms of,

representing the first in a word set

The number of words is one of a plurality of words,

representing the first in the word set

The score probability corresponding to the word;

the second part of the loss function, expressed correspondingly as:

wherein,

a penalty function representing the second part of the Bert model,

representing the classifier parameters followed by the Encoder in the sentence prediction task,

representing the first in a word set

The relationship of the correspondence of the words,

the kind of the relation is represented by,

indicating that there is a relationship between two sentences adjacent in front and back,

indicating that there is no relationship between two sentences adjacent to each other,

representing the first in the word set

The score probability of the word correspondence;

the total loss function is expressed as:

wherein,

the total loss function of the Bert model is represented.

The invention also provides a Chinese text error correction system, wherein the system comprises:

the target shielding module is used for acquiring an input original Chinese text, performing paragraph division on the original Chinese text to obtain a target paragraph, and shielding a target word in the target paragraph by using a shielding mark to obtain a shielded target paragraph;

the target coding module is used for coding the shielded target paragraphs to form sentence vectors;

the coding prediction module is used for inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of an occlusion mark according to the context of the target paragraph, and calculating the score of each predicted character code;

the score sorting module is used for sorting the score values of all the predictive character codes according to the order of the score values from high to low and acquiring the predictive character codes with the preset number which are sorted in the front;

the encoding and decoding module is used for decoding the predicted character codes with the preset number and the front order to obtain the corresponding predicted characters with the preset number and form a predicted character set;

the character judgment module is used for judging whether the character of the target character exists in the predicted character set or not;

the first generation module is used for generating a candidate character set in combination with a preset shape-similarity dictionary according to the characters of the target characters if the characters of the target characters do not exist in the predicted character set;

the second generation module is used for generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set;

the target replacement module is used for selecting only one word in the error correction set to replace the target word if only one word exists in the error correction set; if a plurality of characters exist in the error correction set, replacing the target characters respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result;

wherein, the calculation formula of the confusion score is expressed as:

wherein,

a score indicative of the degree of confusion is provided,

which represents the current sentence in question,

representing the total number of words in the sentence,

representing the sequence number of the words in the sentence,

before showing

The probability of a word or words,

representation is based on front

Word calculation yields

The probability of a word.

The invention also proposes a readable storage medium on which a computer program is stored, wherein said program, when executed by a processor, implements the chinese text correction method as described above.

Drawings

The above and/or additional aspects and advantages of embodiments of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a method for correcting errors in Chinese text according to an embodiment of the present invention;

fig. 2 is a block diagram of a chinese text error correction system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Referring to fig. 1, a method for correcting a chinese text according to an embodiment of the present invention includes the following steps S1 to S9:

s1, acquiring an input original Chinese text, carrying out paragraph division on the original Chinese text to obtain a target paragraph, and using an occlusion mark to occlude a target word in the target paragraph to obtain an occluded target paragraph.

Because the Bert model has good performance on extracting feature words of long texts and understanding sentence meanings, the original Chinese text is segmented according to paragraphs and then transmitted into the Bert model, so that the Bert model can fully understand the sentence meanings.

In this embodiment, the occlusion mark is represented by [ MASK ], for example, "he is a monarch", and the occlusion of the target word "monarch" therein results in "he is a [ MASK ] king", and then the word at the [ MASK ] position is predicted.

In this step, the target paragraph after paragraph division is expressed in the form of sentence and phrase list:

. Wherein,

represents the second in the target paragraph

The number of words is one of a plurality of words,

represent the first in the object paragraph

The number of words is such that,

，

representing the maximum number of words in the target paragraph.

And S2, coding the occlusion target paragraph to form a sentence vector.

The characters are directly input into the model for calculation, the calculation intensity is overlarge, each character is digitally encoded according to the dictionary, and the calculation amount can be greatly reduced by inputting the characters into the model for calculation.

Specifically, the occlusion target paragraphs are encoded by using a token function to form a sentence vector, and during encoding, attention labeling is performed on each token sequence.

Since the Bert model can only process input sequences with a length of 512 tokens, if the input length is smaller than 512 token sequences, 0 sequences can be complemented later. In this embodiment, during encoding, attention labeling is performed on each token sequence, that is, the Bert model has an entry _ mask sequence in addition to the token sequence, and if the token is a token of a character in a text, the entry _ mask is labeled as 1, which indicates that the token is to be focused during model calculation; for the complementary 0 sequence, its attribute _ mask is labeled as 0, and this meaningless token is not considered in the model computation.

In this step, the sentence vector

Is shown as

；

Wherein,

which is representative of the operation of the encoder,

is shown as

The token value of a word is set to,

is shown as

The token value of a word is then set,

is shown as

The number of words is one of a plurality of words,

denotes the first

And words.

And S3, inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of the shielding mark according to the context of the target paragraph, and calculating the score of each predicted character code.

In this step, the language features before and after the extraction are extracted using a self-attention mechanism. Specifically, there is the following calculation formula:

wherein,

a first feature vector representing an Attention weight,

a second feature vector representing the Attention weight,

a vector representing the features of the input is,

a weight representing the first feature vector is calculated,

a weight representing the second feature vector,

the weight of the vector representing the input feature,

representing a sentence vector.

For the above

Linear transformations are performed using sets of parameter tables. Finally, the result of the self-attention mechanism is the integration of the results of all self-attention mechanisms.

Wherein,

pair of representations

A multi-head self-attention operation is performed,

showing the combined operation of multiple heads of attention,

the basis weight is represented by a weight of the basis,

is shown as

The attention head to which the word corresponds,

is shown as

The weight of the first feature vector to which the word corresponds,

is shown as

The weight of the second feature vector to which the word corresponds,

is shown as

The weight of the vector of input features to which the word corresponds,

representing the corresponding Attention function operation.

Additionally, in the present embodiment, the scale factor is used

Added to the self-attention mechanism, the calculation formula is as follows:

wherein,

which represents the operation of the normalization function, is,

representing a transpose operation.

Further, in this step, the score value of the predicted character encoding is calculated using the following formula:

wherein,

a score value representing the predicted character encoding.

Still taking "he is a [ MASK ] king" as an example, if we want to predict the word of this [ MASK ] position, the predicted word may not be so accurate, e.g., we may predict the words of "di, medi, \ 8230;" etc., at which time although "di" has been predicted, it is possible for the Bert model that "di" is as much as the words of "di", "day", medi ", etc., but" jie khaki is very brave "if" he was the last sentence of a [ MASK ] king ". The score value of the ' di ' word calculated by the formula is far higher than the score values of the ' di ', ' tian ', meditation ' and other words.

And S4, sorting the score values of the predictive character codes according to the order of the score values from high to low, and acquiring the predictive character codes with the preset number, which are sorted in the front, of the predictive character codes.

Preferably, the predetermined number is at least 5.

And S5, decoding the predictive character codes with the preset number, which are ranked at the top, to obtain the corresponding predictive characters with the preset number, and forming a predictive character set.

In the present invention, for the above-mentioned Bert model, the penalty function of the Bert model consists of two parts. The first part is from the word-level classification task of Mask-LM, the other part is the classification task of sentence level. Through the joint learning of the two tasks, the representation learned by the Bert has token-level information and also contains sentence-level semantic information.

Specifically, in the loss function of the first part, if the words to be mask form a set,since it is a dictionary size

The first part of the loss function is expressed as:

wherein,

representing the penalty function of the first part of the Bert model,

parameters representing the Encoder part of the Bert model,

represents the parameters in the output layer connected to the Encoder in the Mask-LM task,

the size of the dictionary is indicated and,

representing the first in the word set

The number of words is one of a plurality of words,

representing the first in the word set

The score probability that a word corresponds to.

Further, in the sentence prediction task, there is a loss function of the second part of the classification problem, which is expressed as:

wherein,

representing the penalty function of the second part of the Bert model,

representing the first in a word set

The relationship of the correspondence of the words,

the kind of the relation is represented by,

indicating that there is a relationship between two sentences that are adjacent to each other,

to represent the first in a set of words

The score probability of the word correspondence.

Thus, for the Bert model, the total loss function of the two task joint learning is represented as:

wherein,

the total loss function of the Bert model is represented.

And S6, judging whether the character of the target character exists in the predicted character set.

It can be understood that if the character of the target word exists in the predicted character set, it is determined that the target word does not need error correction, and then the other words except the target word are shielded and the flow of steps S1 to S6 is repeated.

And S7, if the character of the target character does not exist in the predicted character set, generating a candidate character set by combining a preset shape-similar-sound dictionary according to the character of the target character.

Specifically, according to the characters of the target words, a plurality of candidate characters are searched from a preset shape-likeness dictionary by using an edit distance algorithm, and a candidate character set is generated.

And S8, generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set.

S9, if only one word exists in the error correction set, selecting the only one word in the error correction set to replace the target word; and if a plurality of words exist in the error correction set, replacing the target words respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result.

Wherein, the calculation formula of the confusion score is expressed as:

wherein,

a score indicative of the degree of confusion is provided,

which is representative of the current sentence,

representing the total number of words in the sentence,

the sequence number of the word in the sentence is represented,

before showing

The probability of a word or words,

representation based on front

Word calculation to obtain the first

The probability of a word.

As a specific example, after the step of generating an error correction set according to the predicted character set and the candidate character set, the method further includes:

determining whether the error correction set is empty;

In summary, according to the chinese text error correction method provided in this embodiment, first, a paragraph is divided from an original chinese text to obtain a target paragraph, and a sentence vector is formed by the target paragraph, and then the sentence vector is input into a Bert model, and the meaning of the sentence can be fully understood by using the characteristic of the strong reading and understanding capability of the Bert model, so as to solve the problems of multiple error types and long-distance dependence.

Referring to fig. 2, the present invention further provides a chinese text error correction system, wherein the system includes:

the target shielding module is used for acquiring an input original Chinese text, performing paragraph division on the original Chinese text to obtain a target paragraph, and shielding target words in the target paragraph by using shielding marks to obtain a shielded target paragraph;

the target coding module is used for coding the shielding target paragraph to form a sentence vector;

the score sorting module is used for sorting the score values of all the predicted character codes according to the order of the score values from high to low and acquiring the preset number of the predicted character codes which are sorted in the front;

the encoding and decoding module is used for decoding the predicted character codes with the preset number which are ranked in the front to obtain the corresponding predicted characters with the preset number and form a predicted character set;

the target replacement module is used for selecting only one word stored in the error correction set to replace the target word if only one word exists in the error correction set; if a plurality of characters exist in the error correction set, replacing the target characters respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result;

wherein, the calculation formula of the confusion score is expressed as:

wherein,

a score of the degree of confusion is expressed,

which is representative of the current sentence,

representing the total number of words in the sentence,

the sequence number of the word in the sentence is represented,

before showing

The probability of a word or words,

representation based on front

Word calculationFirst, the

The probability of a word.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following technologies, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A Chinese text error correction method is characterized by comprising the following steps:

step one, acquiring an input original Chinese text, performing paragraph division on the original Chinese text to obtain a target paragraph, and shielding target words in the target paragraph by using a shielding mark to obtain a shielded target paragraph;

step two, encoding the occlusion target paragraph to form a sentence vector;

inputting the sentence vector into a Bert model, predicting a plurality of predicted character codes at the position of an occlusion mark according to the context of the target paragraph, and calculating a score of each predicted character code;

sorting score values of all the predictive character codes according to the order of the score values from high to low, and acquiring a preset number of predictive character codes which are sorted in the front;

step six, judging whether the character of the target character exists in the predicted character set or not;

step seven, if the character of the target character does not exist in the predicted character set, generating a candidate character set by combining a preset shape-similar-sound dictionary according to the character of the target character;

step eight, generating an error correction set according to the predicted character set and the candidate character set, wherein the error correction set is an intersection of the predicted character set and the candidate character set;

step nine, if only one word exists in the error correction set, selecting the only one word in the error correction set to replace the target word; if a plurality of words exist in the error correction set, replacing the target words respectively, calculating the confusion score of the sentence after each replacement, and taking the sentence with the highest confusion score as a final error correction result;

wherein, the calculation formula of the confusion score is expressed as:

wherein,

a score indicative of the degree of confusion is provided,

which represents the current sentence in question,

representing the total number of words in the sentence,

the sequence number of the word in the sentence is represented,

before showing

The probability of a word or words,

representation is based on front

Word calculation yields

The probability of a word.

2. The method for correcting errors in chinese text according to claim 1, wherein in the second step, the step of encoding the occlusion target paragraphs to form sentence vectors specifically comprises:

sentence vector

Is shown as

；

Wherein,

which represents the operation of the encoder, is,

is shown as

The token value of a word is set to,

is shown as

The token value of a word is then set,

is shown as

The number of words is one of a plurality of words,

is shown as

And words.

3. The method for correcting chinese text according to claim 2, wherein in the third step, the sentence vector is input into a Bert model, and in the step of predicting the plurality of predicted characters at the positions of the occlusion markers according to the context of the target paragraph, the language features before and after being extracted by using a self-attention mechanism are represented by the following formula:

wherein,

a first feature vector representing an Attention weight,

a second feature vector representing the Attention weight,

a vector representing the features of the input is,

a weight representing the first feature vector is calculated,

a weight representing the second feature vector is calculated,

the weight of the vector representing the input feature,

representing a sentence vector.

4. The method of claim 3, wherein the step three is a step of correcting the Chinese textFor the above

Linear transformation is performed using multiple sets of parameter tables to obtain the result of the self-attention mechanism, and the corresponding formula is:

wherein,

pair of representations

A multi-head self-attention operation is performed,

showing the combined operation of multiple heads of attention,

the basis weight is represented by a weight of the basis,

denotes the first

The attention head to which the word corresponds,

denotes the first

The weight of the first feature vector to which the word corresponds,

is shown as

The weight of the second feature vector to which the word corresponds,

denotes the first

The weights of the vectors of input features to which the words correspond,

indicating that the corresponding Attention function operates.

5. The method of correcting chinese text according to claim 4, wherein in the step three, in the step of calculating the score value of each predictive character encoding, the score value of the predictive character encoding is calculated using the following equation:

wherein,

a score value representing the encoding of the predictive character,

representing a transpose operation.

6. The chinese text error correction method of claim 1, wherein after the step of determining whether the character of the target word is present in the predicted character set, the method further comprises:

determining whether the error correction set is empty;

7. The method for correcting errors in chinese text according to claim 1, wherein the step of generating a candidate character set in combination with a preset voice-like dictionary based on the characters of the target word specifically comprises:

8. The method of claim 1, wherein the Bert model includes a first part loss function and a second part loss function, and the first part loss function and the second part loss function form a total loss function;

the loss function of the first part is expressed as:

wherein,

representing the penalty function of the first part of the Bert model,

parameters representing the Encoder part of the Bert model,

the size of the dictionary is expressed in terms of,

representing the first in the word set

The number of words is one of a plurality of words,

representing the first in the word set

The score probability corresponding to the word;

the second part of the loss function, expressed correspondingly as:

wherein,

a penalty function representing the second part of the Bert model,

representing the first in the word set

The relationship of the correspondence of the words,

the kind of the relation is represented by,

representing the first in the word set

The score probability of the word correspondence;

the total loss function is expressed as:

wherein,

the total loss function of the Bert model is represented.

9. A chinese text correction system, the system comprising:

the first generation module is used for generating a candidate character set according to the characters of the target words and in combination with a preset shape-similarity dictionary if the characters of the target words do not exist in the predicted character set;

wherein, the calculation formula of the confusion score is expressed as:

wherein,

a score indicative of the degree of confusion is provided,

which represents the current sentence in question,

representing the total number of words in the sentence,

representing the sequence number of the words in the sentence,

before showing

The probability of a word or words,

representation is based on front

Word calculation to obtain the first

The probability of a word.

10. A readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the chinese text error correction method of any one of claims 1 to 8.