CN115081430A

CN115081430A - Chinese spelling error detection and correction method and device, electronic equipment and storage medium

Info

Publication number: CN115081430A
Application number: CN202210576165.8A
Authority: CN
Inventors: 张家俊; 李鑫; 赵阳; 宗成庆
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-09-20

Abstract

The invention provides a Chinese spelling error detection and correction method, a device, an electronic device and a storage medium, belonging to the technical field of natural language processing, wherein the method comprises the following steps: inputting the Chinese character input sequence into a comparison learning model to obtain a similar character vector corresponding to each Chinese character in the Chinese character input sequence output by the comparison learning model; detecting wrong Chinese characters in the Chinese character input sequence based on the similar character vectors to obtain the positions and types of the wrong Chinese characters; coding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence; and correcting the wrong Chinese characters in the Chinese character input sequence based on the similar character vectors, the positions and the types of the wrong Chinese characters and the coding vectors to obtain an optimal corrected text. By means of the character-sound similar relation and the character-shape similar relation of each Chinese character, error detection and correction of wrong Chinese characters in the Chinese character input sequence are achieved, the accuracy rate of detection and correction of the similar errors of the complex Chinese characters is improved, and the correction quality of Chinese spelling correction is improved.

Description

Chinese spelling error detection and correction method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a device for detecting and correcting errors of Chinese spelling, electronic equipment and a storage medium.

Background

The Chinese spelling error correction is a new hot problem in the field of natural language processing, and aims to identify misspelling of words, disorder of words and the like in Chinese texts and modify the texts containing errors into correct texts.

In the related art, most of the research on the Chinese spelling error correction focuses on the fusion of a priori knowledge of character pronunciation, character pattern and the like. The common practice is to encode the character pronunciation and the character pattern by using an encoder, detect the spelling error in the text by using the Chinese character semantics, the character pronunciation and the character pattern information together, and correct the spelling error.

However, in the related art, only the pronunciation and the font information are regarded as an independent feature of the Chinese character, more prior knowledge such as the pronunciation and the font information is used at the error detection end in the fusion process, and the similar relation of different Chinese characters on the pronunciation and the font is not considered in the error detection and correction processes, so that the errors are difficult to be efficiently identified in a complex Chinese character similar error identification scene, and in the correction process of the wrong character, the decoding candidates of the wrong character are difficult to be efficiently selected.

Disclosure of Invention

The invention provides a Chinese spelling error detection and correction method, a Chinese spelling error detection and correction device, electronic equipment and a storage medium, which are used for solving the defects that similar Chinese characters and errors are difficult to efficiently identify and correct in the prior art and realizing the detection and correction of the similar complicated Chinese characters.

The invention provides a Chinese spelling error detection and correction method, which comprises the following steps:

inputting a Chinese character input sequence into a contrast learning model to obtain similar character vectors corresponding to all Chinese characters in the Chinese character input sequence output by the contrast learning model; the similar word vectors comprise a sound-like vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training;

detecting wrong Chinese characters in the Chinese character input sequence based on the similar character vectors to obtain the positions and types of the wrong Chinese characters;

coding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence;

and correcting the wrong Chinese characters in the Chinese character input sequence based on the similar character vector, the positions and the types of the wrong Chinese characters and the coding vector to obtain an optimal corrected text.

According to the Chinese spelling error detection and correction method provided by the invention, the method for detecting the wrong Chinese character in the Chinese character input sequence based on the similar character vector and obtaining the position and the type of the wrong Chinese character in the Chinese character input sequence comprises the following steps:

splicing the semantic vector, the sound-like vector and the shape-like vector of each Chinese character in the Chinese character input sequence and the full sentence vector corresponding to the Chinese character input sequence to obtain a spliced vector;

calculating the sound-like gating value and the shape-like gating value of each Chinese character based on the splicing vector;

and determining the position and the type of the wrong Chinese character in the Chinese character input sequence based on the sound-like gating value and the shape-like gating value.

According to the Chinese spelling error detection and correction method provided by the invention, the position and the type of the wrong Chinese character in the Chinese character input sequence are determined based on the sound-like gating value and the shape-like gating value, and the method comprises the following steps:

taking the sound-like gating value and the shape-like gating value as the weight values of the sound-like vector and the shape-like vector respectively;

carrying out weighted summation on the semantic vector, the acoustic similarity vector and the shape similarity vector to obtain a fusion vector;

and determining the position and the type of the wrong Chinese character in the Chinese character input sequence based on the fusion vector.

According to the Chinese spelling error detection and correction method provided by the invention, the method for correcting the wrong Chinese character in the Chinese character input sequence based on the similar character vector, the position and the type of the wrong Chinese character in the Chinese character input sequence and the coding vector to obtain the optimal corrected text comprises the following steps:

decoding the coding vector based on the position and the type of the wrong Chinese character, and calculating to obtain the first K semantic candidates of the wrong Chinese character;

determining the sound-like vector and the shape-like vector of the wrong Chinese character based on the position and the type of the wrong Chinese character;

determining top M similar candidates based on the K semantic candidates, the plausibility vector, and the plausibility vector;

determining optimal path parameters based on the vectors respectively corresponding to the M similar candidates, the vector corresponding to the wrong Chinese character and the vector corresponding to the Chinese character at the adjacent position of the wrong Chinese character;

and determining the optimal corrected text based on the optimal path parameters.

According to the method for detecting and correcting errors in Chinese spelling provided by the invention, the determining of the first M similar candidates based on the K semantic candidates, the phonemic vectors and the shape vectors comprises the following steps:

calculating the similarity between the K semantic candidates and the sound-like vector and the shape-like vector respectively based on the K semantic candidates, the sound-like vector and the shape-like vector;

and sequencing the similarity, and selecting the similar candidates corresponding to the first M similarities.

According to the error detection and correction method for Chinese spelling provided by the invention, the optimal path parameters are determined based on the vectors respectively corresponding to the M similar candidates, the vector corresponding to the wrong Chinese character and the vector corresponding to the Chinese character at the adjacent position of the wrong Chinese character, and the method comprises the following steps:

respectively calculating the dependency relationship values between the vectors respectively corresponding to the M similar candidates and the vectors corresponding to the wrong Chinese characters based on the vectors respectively corresponding to the M similar candidates, the vectors corresponding to the wrong Chinese characters and the vectors corresponding to the Chinese characters at the adjacent positions of the wrong Chinese characters;

and selecting the maximum dependency relationship value as the optimal path parameter based on the dependency relationship value.

The invention also provides a Chinese spelling error detection and correction device, comprising:

the comparison learning module is used for inputting a Chinese character input sequence into a comparison learning model to obtain similar character vectors corresponding to all Chinese characters in the Chinese character input sequence output by the comparison learning model; the similar word vectors comprise a sound-like vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training;

the error detection module is used for detecting wrong Chinese characters in a Chinese character input sequence based on the similar character vectors to obtain the positions and types of the wrong Chinese characters;

the coding module is used for coding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence;

and the error correction module is used for correcting the wrong Chinese characters in the Chinese character input sequence based on the similar character vectors, the positions and the types of the wrong Chinese characters and the coding vectors to obtain an optimal corrected text.

The present invention also provides an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method for detecting and correcting errors in chinese spelling.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the apparatus for Chinese spelling error detection and correction as described in any of the above.

The present invention also provides a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of the apparatus for detecting and correcting errors in chinese spelling, as described in any of the above.

The Chinese spelling error detection and correction method, the device, the electronic equipment and the storage medium provided by the invention obtain the character-sound similar relation and the character-shape similar relation of each Chinese character in the Chinese character input sequence by comparing the learning model, and then fuse the character-sound similar relation and the character-shape similar relation into the error detection process of the Chinese character input sequence.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of the error detection and correction method for Chinese spelling provided by the present invention;

FIG. 2 is a schematic structural diagram of a comparative learning model provided by the present invention;

FIG. 3 is a second schematic flow chart of the method for error detection and correction in Chinese spelling according to the present invention;

FIG. 4 is a flow chart of the Chinese spelling error detection provided by the present invention;

FIG. 5 is a second flowchart of the method for error detection and correction in Chinese spelling according to the present invention;

FIG. 6 is a flow chart of the Chinese spell correction provided by the present invention;

FIG. 7 is a schematic structural diagram of an error detection and correction apparatus for Chinese spelling provided by the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

The method for detecting and correcting errors of chinese spelling according to the present invention is described in detail below with reference to the accompanying drawings.

The invention provides a Chinese spelling error detection and correction method, which can be applied to the error correction scene of error text, and inputs a Chinese character input sequence into a contrast learning model to obtain similar character vectors corresponding to each Chinese character in the Chinese character input sequence output by the contrast learning model; the similar word vectors comprise a sound-like vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training; detecting wrong Chinese characters in the Chinese character input sequence based on the similar character vectors to obtain the positions and types of the wrong Chinese characters; coding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence; and correcting the wrong Chinese characters in the Chinese character input sequence based on the similar character vector, the positions and the types of the wrong Chinese characters and the coding vector to obtain an optimal corrected text. By fusing the character-sound similar relation and the character-shape similar relation into the error detection process of the Chinese character input sequence, the error detection and correction of the wrong Chinese character in the Chinese character input sequence are realized based on the error detection result of the wrong Chinese character, the accuracy of the detection and correction of the complicated Chinese character similar error is improved, and the correction quality of the Chinese spelling error correction is improved.

The method of detecting and correcting errors in Chinese spelling is described below with reference to FIGS. 1-6.

Fig. 1 is a schematic flow chart of a method for detecting and correcting errors in chinese spelling according to the present invention, as shown in fig. 1, the method includes steps 101-104, wherein:

step 101, inputting a Chinese character input sequence into a contrast learning model to obtain a similar character vector corresponding to each Chinese character in the Chinese character input sequence output by the contrast learning model; the similar word vectors comprise a sound-like vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training.

It should be noted that the method for detecting and correcting errors in chinese spelling provided by the present invention can be applied to the scene of correcting errors in erroneous texts. The execution body of the method can be a Chinese spelling error detection and correction device, such as an electronic device, or a control module in the Chinese spelling error detection and correction device for executing the Chinese spelling error detection and correction method.

Optionally, the comparative learning model is obtained by training a sample chinese character triplet, and before the detailed description of the error detection and correction process of the comparative learning model, the training process of the comparative learning model is briefly described, including the following implementation processes:

step 1): extracting a similar Chinese character confusion set from the Chinese spelling error correction data set;

specifically, the chinese spell correction data set includes an erroneous text, erroneous characters and positions of the erroneous chinese characters in the erroneous text, and correction characters corresponding to the erroneous characters. For example, assume that there is an error text "what is nothing", error characters are "ji" and "ji", an error character "ji" is located at "6" in the error text, "an error character" is located at "7" in the error text, "a corrected character corresponding to the error character" ji "is" ji ", a corrected character corresponding to the error character" is "qu", and a correction result corresponding to the error text is "nothing".

In practice, the method of selecting key value pairs by using a regular expression or a dictionary is used to extract the error characters and the correction characters corresponding to the error characters in the error text, and the error characters and the correction characters corresponding to the error characters are stored in a dictionary form. For example, the dictionary ═ { ji: recording: get }.

Step 2): and expanding the extracted similar Chinese character confusion set by combining the keyboard editing distance, and constructing a Chinese character similar data set based on the expanded similar Chinese character confusion set.

Specifically, on the basis of the QWERTY keyboard layout, the keyboard distance and the combination mode between the legal initials and finals under the keyboard distance are fully considered, the error characters and the corrected characters in the similar Chinese character confusion set are expanded according to the extracted similar Chinese character confusion set, and the near characters corresponding to the error characters and the corrected characters in the similar Chinese character confusion set respectively are obtained.

For example, for the pinyin Han, in the QWERTY keyboard layout, the letters H are adjacent to G, Y, U, J, N and B, except U, the rest adjacent letters are initials, and the rest adjacent letters are combined with the final "an", so that a plurality of similar phonetic characters with different initials and the same final can be generated, and the combination of the initials and the final "an" is regarded as the phonetic confusion of the Han, thereby obtaining the phonetic characters corresponding to the wrong characters and the corrected characters in the similar Chinese character confusion set.

In practice, a Chinese character similar data set is constructed according to the phonetic characters respectively corresponding to the error characters and the corrected characters in the expanded similar Chinese character confusion set and according to the error characters, the corrected characters and the phonetic characters. The Chinese character similar data set comprises Chinese character triples, and the Chinese character triples comprise sound-like triples or shape-like triples. For example, the sound-like triples (Ji-ji, Deconchai) are anchor Chinese character units, and Ji-ji and Deconchai are similar Chinese character sound Chinese character units and non-similar Chinese character sound Chinese character units of anchor Chinese characters respectively.

In the training process of the comparison learning model, the Chinese character three tuples in the similar Chinese character data set are coded into vectors, and the supervised comparison learning model is used for explicitly modeling the similarity relation among the Chinese character triples.

Specifically, a pre-training language model, such as a Bidirectional encoding representation from transforms (BERT) model based on Transformers, is used to encode the three tuples of Chinese characters in the similar Chinese character data set into vectors, resulting in the three tuples of Chinese charactersCorresponding vector

Wherein E is _a The vector of the Chinese character unit is the anchor point,

the vectors are the unit vectors of the similar Chinese characters,

non-similar Chinese character unit vectors; will vector

And inputting the distance into a comparison learning model, wherein the comparison learning model draws the cosine distance between the anchor Chinese character unit vector and the similar Chinese character unit vector in a high-dimensional space, pushes the cosine distance between the anchor Chinese character unit vector and the non-similar Chinese character unit vector, namely draws the distance between the anchor Chinese character unit and the similar Chinese character unit vector, and pushes the distance between the anchor Chinese character unit and the non-similar Chinese character unit.

In practice, the objective function used by the comparative learning model is:

wherein sim is the vector cosine similarity, tau is the temperature coefficient, N is the minimum batch sample number used in training,

similar Chinese character unit vectors corresponding to the anchor Chinese character unit vectors in each batch of training samples;

non-similar Chinese character unit vectors corresponding to the anchor Chinese character unit vectors in each batch of training samples; using gradient descent method to L _a Carrying out gradient calculation, reverse derivation and optimization of anchor Chinese character unit vectors and similar Chinese character unit directions corresponding to the Chinese character triplesThe amount, and the cosine distance between the anchor Chinese character unit vector and the dissimilar Chinese character unit vector.

For example, the sound-like triples (ji-ji, tear-chai) and the sound-like triples (ji-ji, tear-chai) are input into the contrast learning model, the contrast learning model continuously draws the characters "ji-ji" and the characters "ji-ji" close, and simultaneously pushes away the characters "ji-ji" and the characters "tear-chai", so as to continuously optimize the Chinese character unit representation of each Chinese character in the sound-like triples (ji-ji, tear-chai), and finally, the character-sound similarity relation of each Chinese character is explicitly merged into the Chinese character representation.

It should be noted that, when the chinese character triplets are similar to the shape triplets, the comparative learning model incorporates the shape similarity relationship of each chinese character into the chinese character representation.

Optionally, for the trained comparison learning model, inputting a Chinese character input sequence to the comparison learning model to obtain a similar character vector of each Chinese character output by the comparison learning model, wherein the similar character vector comprises a similar pronunciation vector and a similar shape vector, and the similar pronunciation vector is used for representing the character-pronunciation similarity relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character. The Chinese character input sequence can be single Chinese character, word or sentence, and the Chinese character input sequence can be in two forms, one of which is Ji-ji and the other is Ji-Qi.

For example, for the trained contrast learning model, when the Chinese character input sequence input by the contrast learning model is 'ji-ji', the output of the contrast learning model is the sound-like vector of 'ji'; when the Chinese character input sequence inputted by the comparison learning model is Ji-Zhi Qi, the shape similar vector of Ji is outputted by the comparison learning model.

And 102, detecting wrong Chinese characters in the Chinese character input sequence based on the similar character vectors to obtain the positions and types of the wrong Chinese characters.

Specifically, inputting a Chinese character input sequence into a pre-training language model to obtain a full sentence vector corresponding to the Chinese character input sequence output by the pre-training language model and a semantic vector of each Chinese character, wherein the semantic vector is used for expressing the semantic relation of each Chinese character; and according to the sound-like vector and the shape-like vector corresponding to each Chinese character in the Chinese character input sequence obtained by comparing the learning model, the full sentence vector corresponding to the Chinese character input sequence output by the pre-training language model and the semantic vector of each Chinese character, error detection is carried out on the wrong Chinese character in the Chinese character input sequence, and the position and the type of the wrong Chinese character in the Chinese character input sequence are obtained. Wherein the type of the wrong Chinese character comprises at least one of the following items: correct Chinese characters, pronunciation-like Chinese characters and shape-like Chinese characters.

And 103, coding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence.

Specifically, an Encoder (Encoder) is used for encoding all Chinese characters included in the Chinese character input sequence to obtain a word vector corresponding to each Chinese character, so as to obtain an encoding vector corresponding to the Chinese character input sequence, wherein the encoding vector is a vector sequence formed by the word vectors corresponding to each Chinese character in the Chinese character input sequence.

And step 104, correcting the wrong Chinese characters in the Chinese character input sequence based on the similar character vectors, the positions and the types of the wrong Chinese characters and the coding vectors to obtain an optimal corrected text.

Specifically, the position and type of the wrong Chinese character in the Chinese character input sequence are detected according to the similar character vector corresponding to each Chinese character in the Chinese character input sequence obtained by comparing the learning model, namely the shape similar vector and the sound similar vector of each Chinese character, and the coding vector obtained by coding through a coder, and the wrong Chinese character in the Chinese character input sequence is corrected to obtain the optimal correction text.

According to the Chinese spelling error detection and correction method provided by the embodiment of the invention, the word-sound similar relation and the character-shape similar relation are fused in the error detection process of the Chinese character input sequence, and the error detection and correction of the wrong Chinese character in the Chinese character input sequence are realized based on the error detection result of the wrong Chinese character, so that the error detection and correction accuracy of the complicated Chinese character similar error is improved, and the correction quality of the Chinese spelling error correction is improved.

Fig. 2 is a schematic structural diagram of the comparative learning model provided in the present invention, and as shown in fig. 2, a character "ping" is encoded by an Encoder (Encoder) to obtain an encoding vector corresponding to the character "ping"; the character 'hot-re' is encoded by an Encoder to obtain an encoding vector corresponding to the character 'hot-re'; the character 'river-he' is encoded by an Encoder to obtain an encoding vector corresponding to the character 'river-he'; and similar character sound Chinese character unit 'apple-ping' and non-similar character sound Chinese character unit 'eat-chi' which form sound-like triplets with anchor Chinese character unit 'flat-ping', similar character sound Chinese character unit 'big-re' and non-similar character sound Chinese character unit 'cold-leng' in sound-like triplets with anchor Chinese character unit 'hot-re', similar character sound Chinese character unit 'what-he' and non-similar character sound Chinese character unit 'per-mei' in sound-like triplets with anchor Chinese character unit 'river-he' are respectively encoded by the Encoder to respectively obtain coding vectors respectively corresponding to 'apple-ping', 'eat-chi', 'big-re', 'cold-leng', 'what-he' and 'per-mei'.

In practice, in the process of training the contrast learning model, the contrast learning model continuously draws the cosine distance between the anchor Chinese character unit 'flat-ping' and the similar Chinese character unit 'apple-ping', and pushes away the cosine distance between the anchor Chinese character unit 'flat-ping' and the dissimilar character sound Chinese character unit ('eat-chi', 'irritate-re', 'cold-leng', 'he-he' and 'per-mei'), so as to train and obtain the Chinese character vector containing the similar information of the Chinese characters.

Fig. 3 is a second schematic flow chart of the method for detecting and correcting errors in chinese spelling according to the present invention, as shown in fig. 3, the method includes steps 301-306, wherein:

step 301, inputting a Chinese character input sequence into a contrast learning model to obtain similar character vectors corresponding to each Chinese character in the Chinese character input sequence output by the contrast learning model; the similar word vectors comprise a sound-like vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training.

For the description and explanation of step 301, reference may be made to the description and explanation of step 101 above, and the same technical effects can be achieved, and further description is omitted here to avoid repetition.

Step 302, the semantic vector, the sound-like vector and the shape-like vector of each Chinese character in the Chinese character input sequence and the full sentence vector corresponding to the Chinese character input sequence are spliced to obtain a spliced vector.

Specifically, the Chinese character input sequence is input to a pre-training language model, such as a BERT model, to obtain semantic vectors of each Chinese character in the Chinese character input sequence and a full sentence CLS vector corresponding to the Chinese character input sequence. For example, for the Chinese character input sequence "all things are wrong", taking the wrong character "all things are good" as an example, the "all things are good" is input into the pre-training language model, and the semantic vector E corresponding to the wrong Chinese character "all things" is obtained _s And a full sentence CLS vector E _cls 。

In practice, the pronunciation-like vector E corresponding to each Chinese character is obtained based on inputting the Chinese character input sequence into the contrast learning model _p Sum-shape similarity vector E _g The semantic vector E of each Chinese character in the Chinese character input sequence _s Sound-like vector E _p The similarity vector E _g And a full sentence CLS vector E _cls And splicing to obtain a splicing vector.

And step 303, calculating the sound-like gating value and the shape-like gating value of each Chinese character based on the splicing vector.

Specifically, the splicing vector is input into the full-connection layer, and based on a gating mechanism, a pronunciation-like gating value and a shape-like gating value of each Chinese character are calculated in the following calculation modes shown in formula (2) and formula (3), wherein:

g _p ＝σ(W ^p [E _s ,E _p ,E _g ,E _cls ]+b ^p ) (2)

g _g ＝σ(W ^g [E _s ,E _p ,E _g ,E _cls ]+b ^g ) (3)

wherein, W ^p 、W ^g 、b ^p 、b ^g Are all learnable parameters, g _p Is a sound-like gating value, g _g To resemble the gating value, σ is the activation function (sigmod).

And step 304, determining the position and the type of the wrong Chinese character in the Chinese character input sequence based on the sound-like gating value and the shape-like gating value.

Specifically, the position and the type of the wrong Chinese character in the Chinese character input sequence can be determined according to the calculated sound-like gating value and the shape-like gating value of each Chinese character.

And 305, coding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence.

And step 306, correcting the wrong Chinese characters in the Chinese character input sequence based on the word vectors, the positions and the types of the wrong Chinese characters and the coding vectors to obtain an optimal correction text.

Optionally, for the description and explanation of the steps 305-306, reference may be made to the description and explanation of the steps 103-104, and the same technical effect can be achieved, and in order to avoid repetition, the description is not repeated here.

The Chinese spelling error detection and correction method provided by the embodiment of the invention has the advantages that the semantic vector, the sound-like vector, the shape-like vector and the full sentence vector corresponding to the Chinese character input sequence in the Chinese character input sequence are spliced, the sound-like gating value and the shape-like gating value of each Chinese character are calculated based on the spliced vector, and the position and the type of the wrong Chinese character in the Chinese character input sequence are determined, so that the character-sound-like relation and the character-shape-like relation are fused into the error detection process of the Chinese character input sequence, the wrong Chinese character in the Chinese character input sequence is corrected based on the error detection result of the wrong Chinese character, the accuracy of detecting and correcting the complicated Chinese character-like errors is improved, and the correction quality of Chinese character error correction is improved.

Optionally, the specific implementation process of step 304 includes the following steps:

mode 1, using the articulation gating value and the shape gating value as the weight of the articulation vector and the shape vector respectively;

the mode 2 is to perform weighted summation on the semantic vector, the acoustic similarity vector and the shape similarity vector to obtain a fusion vector;

and 3, determining the position and the type of the wrong Chinese character in the Chinese character input sequence based on the fusion vector.

Specifically, the calculated sound-like gating value and the calculated shape-like gating value of each Chinese character are respectively used as the weight of the sound-like vector and the shape-like vector of each Chinese character, then the sound-like vector of each Chinese character is multiplied by the sound-like gating value, the shape-like vector of each Chinese character is multiplied by the shape-like gating value, and finally the multiplication result of the sound-like vector of each Chinese character and the sound-like gating value, the multiplication result of the shape-like vector and the shape-like gating value and the semantic vector of each Chinese character are added to obtain a fusion vector.

In practice, the fusion vector is input to a gating fusion embedding layer to obtain semantic vectors, phonemic vectors and morphological vectors which correspond to each Chinese character after weighting fusion respectively, then the semantic vectors, phonemic vectors and morphological vectors which correspond to each Chinese character after weighting fusion are input to a self-attention layer to obtain the relation between the word vector of each Chinese character output from the attention layer at the current position and the word vectors at other positions, then the result output from the attention layer is input to a classifier, the classifier classifies each Chinese character to obtain the classification result marked with a label, and therefore the position and the type of the wrong Chinese character in a Chinese character input sequence are obtained. Wherein the type includes at least one of: correct Chinese characters, pronunciation-like Chinese characters and shape-like Chinese characters.

Fig. 4 is a schematic flow chart of the chinese spelling error detection provided by the present invention, as shown in fig. 4, with Input: the process of detecting the position and the type of the wrong Chinese character in the Chinese character input sequence is explained by taking the example of 'deeply devoting the construction of first-class universities in the world' and 'devoting' the wrong Chinese character as an example:

inputting 'deep devastating to world first-class university construction' into a pre-training language model to obtain a semantic vector E corresponding to 'devastating' characters _s And a full sentence CLS vector E _cls (ii) a Inputting 'deep attacking world first-class university construction' into the comparison learning model to obtain the sound-likeness direction corresponding to 'attacking' wordQuantity E _p The similarity vector E _g (ii) a Semantic vector E corresponding to 'destroy' word _s Sound-like vector E _p The similarity vector E _g Full sentence CLS vector E corresponding to Chinese character input sequence _cls Splicing is carried out to obtain spliced vectors, and then the spliced vectors are respectively calculated by using a gating mechanism to obtain acoustic gating values g corresponding to acoustic vectors of' attack _p The shape-like gating value g corresponding to the shape-like vector of "interior" of the tube _g The sound-like gating value g corresponding to the sound-like vector _p As the weight of the acoustic similarity vector, the shape-like gating value g corresponding to the shape-like vector _g As the weight of the shape-like vector.

Further, semantic vector E for "destroy" word _s Sound-like vector E _p Sum-shape similarity vector E _g Carrying out weighted summation to obtain a fusion vector; inputting the fusion vector into a gated fusion embedding Layer (Gate Fused embedding), inputting the result output by the gated fusion embedding Layer into a self-Attention Layer (Attention Layer), and inputting the result output by the Attention Layer into a Classifier (Classifier) to obtain the classification result of the Classifier, wherein the classification result is marked with Labels (Labels), wherein O represents correct Chinese characters, G represents shape and error, and P represents sound and error.

Fig. 5 is a third schematic flow chart of the method for detecting and correcting errors in chinese spelling according to the present invention, as shown in fig. 5, the method includes steps 501-508, wherein:

step 501, inputting a Chinese character input sequence into a contrast learning model to obtain similar character vectors corresponding to each Chinese character in the Chinese character input sequence output by the contrast learning model; the similar word vectors comprise a sound-like vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training.

And 502, detecting wrong Chinese characters in the Chinese character input sequence based on the similar character vectors to obtain the positions and types of the wrong Chinese characters.

Step 503, encoding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence.

For the description and explanation of steps 501 through 503, reference may be made to the description and explanation of steps 101 through 103, and the same technical effects can be achieved.

And step 504, decoding the coding vector based on the position and the type of the wrong Chinese character, and calculating to obtain the first K semantic candidates of the wrong Chinese character.

Specifically, according to the position and the type of a wrong Chinese character in a Chinese character input sequence, inputting a coding vector corresponding to the Chinese character input sequence into a normalization function (Softmax), decoding the coding vector by Softmax, respectively calculating the character-sound similarity between the coding vector corresponding to the Chinese character input sequence and a Chinese character-sound similarity of a word list and the character-shape similarity between the coding vector and the Chinese character-shape similarity of the word list, and selecting the first K sound-like candidates and the K shape-like candidates as the first K semantic candidates.

And 505, determining the sound-like vector and the shape-like vector of the wrong Chinese character based on the position and the type of the wrong Chinese character.

Specifically, the phonemic vectors and the morphemic vectors of all Chinese characters in the Chinese character input sequence are obtained based on the positions and the types of the wrong Chinese characters and the comparison learning model, and the phonemic vectors and the morphemic vectors of the wrong Chinese characters are determined.

Step 506, determining the top M similar candidates based on the K semantic candidates, the plausibility vector and the plausibility vector.

Specifically, according to K semantic candidates of the wrong Chinese character, the sound-like vector and the shape-like vector of the wrong Chinese character, word-sound similarity of a weighted vector of a coding vector of the wrong Chinese character and the sound-like vector corresponding to the wrong Chinese character and the sound-like vector of the Chinese character in a word list of the K semantic candidates are calculated respectively, word-sound similarity, word-shape similarity and confidence degree sequencing are carried out respectively on the coding vector of the wrong Chinese character and the weight vector of the shape-like vector corresponding to the wrong Chinese character and the shape-like vector of the Chinese character in the word list of the K semantic candidates, and the first M similar candidates are determined.

Step 507, determining optimal path parameters based on the vectors respectively corresponding to the M similar candidates, the vector corresponding to the wrong Chinese character, and the vector corresponding to the Chinese character at the adjacent position of the wrong Chinese character.

Optionally, the vector corresponding to the Chinese character at the adjacent position of the wrong Chinese character may be the vector corresponding to the previous Chinese character of the wrong Chinese character, or may be the vector corresponding to the next Chinese character of the wrong Chinese character; the optimal path parameter represents the maximum dependency relationship value between the vector of the error Chinese character and the vectors corresponding to the M similar candidates respectively.

Step 508, determining an optimal correction text based on the optimal path parameters.

Specifically, according to the optimal path parameters, the similar candidate with the largest dependency relationship value is determined, so that the optimal correction text is determined.

The Chinese spelling error detection and correction method provided by the invention decodes the coding vector corresponding to the Chinese character input sequence according to the position and the type of the wrong Chinese character, and calculates to obtain the first K semantic candidates of the wrong Chinese character; determining the sound-like vector and the shape-like vector of the wrong Chinese character according to the position and the type of the wrong Chinese character; determining the first M similar candidates according to the K semantic candidates of the wrong Chinese characters, the sound-like vectors and the shape-like vectors of the wrong Chinese characters; determining optimal path parameters according to the vectors respectively corresponding to the M similar candidates, the vector corresponding to the wrong Chinese character and the vector corresponding to the Chinese character at the adjacent position of the wrong Chinese character; based on the optimal path parameters, the optimal correction text is determined, the character-sound similar relation and the character-shape similar relation are fused into the error correction process of the Chinese character input sequence, the wrong Chinese characters in the Chinese character input sequence are corrected, the accuracy of detection and correction of the complex Chinese character similar errors is improved, and the correction quality of Chinese spelling error correction is improved.

Optionally, the specific implementation process of step 506 includes the following steps:

step 1) calculating the similarity between the K semantic candidates and the audio-like vector and the shape-like vector respectively based on the K semantic candidates, the audio-like vector and the shape-like vector;

and 2) sorting the similarity, and selecting the similar candidates corresponding to the first M similarities.

Specifically, according to K semantic candidates of the wrong Chinese character, the sound-like vector and the shape-like vector of the wrong Chinese character, the word-sound similarity of the coding vector of the wrong Chinese character and the sound-like vector corresponding to the wrong Chinese character and the word-sound similarity of the sound-like vector of the Chinese character in a word list of the K semantic candidates, and the word-sound similarity, the word-shape similarity and the confidence degree of the coding vector of the wrong Chinese character and the shape-like vector of the Chinese character corresponding to the wrong Chinese character and the shape-like vector of the K semantic candidates in the word list are respectively sequenced, and the first M are selected as the similar candidates. Wherein M is less than or equal to L.

Optionally, the specific implementation process of step 507 includes the following steps:

step 1) respectively calculating the dependency relationship values between the vectors respectively corresponding to the M similar candidates and the vector corresponding to the wrong Chinese character based on the vectors respectively corresponding to the M similar candidates, the vector corresponding to the wrong Chinese character and the vector corresponding to the Chinese character at the adjacent position of the wrong Chinese character;

and 2) selecting the maximum dependency relationship value as the optimal path parameter based on the dependency relationship value.

Specifically, according to vectors corresponding to the M similar candidates, vectors corresponding to the wrong Chinese characters and vectors corresponding to Chinese characters at adjacent positions of the wrong Chinese characters, dependency relationship values between the vectors corresponding to the M similar candidates and the vectors corresponding to the wrong Chinese characters are calculated respectively, a maximum dependency relationship value is obtained through maximum likelihood estimation, and the maximum dependency relationship value is selected as an optimal path parameter.

FIG. 6 is a schematic flow chart of the Chinese spelling error correction provided by the present invention, and as shown in FIG. 6, the error correction process of the erroneous Chinese character in the Chinese character input sequence is explained by taking "what is not good" and "what is wrong" as examples:

inputting 'what is not good' into a Detector (Detector) and an Encoder (Encoder), wherein the Detector consists of a comparison learning model and a pre-training language model, the comparison learning model outputs a sound-like vector and a shape-like vector of each Chinese character in 'what is not good', the pre-training language model outputs a full-sentence CLS vector corresponding to 'what is not good' and a semantic vector of each Chinese character, and the semantic vectors are used for representing the semantic relationship of each Chinese character; based on the pronunciation-like vector and the shape-like vector of each Chinese character output by the comparison learning model and the full-sentence CLS vector and the semantic vector of each Chinese character output by the pre-training language model, the position and the type of the wrong Chinese character of 'what is not good' are detected, and the position and the type of 'what is not good' and 'the wrong Chinese character' are obtained, namely the error detection result, output by the detector: { (6, ji), (7, of) }; the encoder outputs a corresponding encoded vector of "what is not good" to the encoder; inputting the coding vector corresponding to the 'nothing is' into Softmax, and the Softmax decodes the coding vector corresponding to the 'nothing is' into the Softmax; according to the positions and types of the wrong Chinese characters ' Ji ' and ' output by the detector, respectively calculating the character-sound similarity of the coding vector corresponding to ' Ji ' and ' and the character-shape similarity of the coding vector and the character-shape similarity of the word list, respectively selecting the first 6 sound-similarity candidates and the shape-similarity candidates as the first 6 semantic candidates of the wrong Chinese characters, namely, the semantic candidates of ' Ji ' are ' Ji ', ' remember ', ' Ji ' doing ', ' count ', ' value ', the semantic candidates of ' are ' the's ', ' get ', ' the's ' is ' in ' the's ' and ' the's ' are ' heart '.

Further, obtaining the sound-like vector and the shape-like vector of each Chinese character in the Chinese character input sequence according to the position and the type of the wrong Chinese character and the comparison learning model, and determining the sound-like vector and the shape-like vector of the wrong Chinese character 'Ji' and 'the' Chinese character; inputting the phonetic similarity vectors and the shape similarity vectors of K semantic candidates of ' Ji ' and ' of the wrong Chinese character, the phonetic similarity vectors and the shape similarity vectors of the ' Ji ' and ' of the wrong Chinese character into a Filter, respectively calculating the character-phonetic similarity of the phonetic similarity vectors of the Chinese character in a word list and the weighted vectors of the coding vectors of the wrong Chinese character and the corresponding shape similarity vectors of the wrong Chinese character and the shape similarity vectors of the K semantic candidates in the word list, respectively carrying out character-phonetic similarity, character-shape similarity and confidence degree sequencing, and selecting the semantic candidates corresponding to the first 3 cosine similarity as the similarity candidates, thereby obtaining the similarity candidates of ' Ji ' as ' notation ', ' meter ', ' and ' ground ' and ' de '.

Further, the vectors w corresponding to the 3 similar candidates are respectively _i,m And w _i+1,n Vector h corresponding to wrong Chinese character _i And the vector h corresponding to the Chinese character at the adjacent position of the wrong Chinese character _i+1 Inputting the vector w into a Dependency Searcher (dependent Searcher), that is, sequentially inputting the vector w into a Multi-Head Attention layer (Multi-Head Attention), a residual connection and layer normalization (Add and LayerNorm) and a Feed Forward Network layer (Feed Forward Network) included in the Dependency Searcher, and calculating vectors w corresponding to 3 similar candidates respectively _i,m And w _i+1,n Vector h corresponding to wrong Chinese character _i Repeatedly calculating the dependence relationship value for 8 times, obtaining the maximum dependence relationship value through maximum likelihood estimation, and selecting the maximum dependence relationship value as the optimal path parameter; and determining the optimal correction text as 'no remember what' according to the optimal path parameters.

The following describes the error detection and correction device for Chinese spelling provided by the present invention, and the error detection and correction device for Chinese spelling described below and the error detection and correction method for Chinese spelling described above can be referred to correspondingly.

Fig. 7 is a schematic structural diagram of a chinese spelling error detection and correction apparatus according to the present invention, and as shown in fig. 7, the chinese spelling error detection and correction apparatus 700 includes: a comparison learning module 701, an error detection module 702, an encoding module 703 and an error correction module 704; wherein:

a comparison learning module 701, configured to input a Chinese character input sequence to a comparison learning model, and obtain a similar character vector corresponding to each Chinese character in the Chinese character input sequence output by the comparison learning model; the similar word vectors comprise a sound-like vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training;

an error detection module 702, configured to detect an erroneous Chinese character in a Chinese character input sequence based on the similar character vector, and obtain a position and a type of the erroneous Chinese character;

the encoding module 703 is configured to encode the chinese character input sequence to obtain a coding vector corresponding to the chinese character input sequence;

and the error correction module 704 is configured to correct the erroneous Chinese characters in the Chinese character input sequence based on the similar character vector, the positions and types of the erroneous Chinese characters, and the coding vector, so as to obtain an optimal corrected text.

According to the Chinese spelling error detection and correction device provided by the embodiment of the invention, the word-sound similar relation and the character-shape similar relation are fused in the error detection process of the Chinese character input sequence, and the error detection and correction of the wrong Chinese character in the Chinese character input sequence are realized based on the error detection result of the wrong Chinese character, so that the error detection and correction accuracy of the complicated Chinese character similar error is improved, and the correction quality of the Chinese spelling error correction is improved.

Optionally, the error detection module 702 is specifically configured to:

Optionally, the error correction module 704 is specifically configured to:

Fig. 8 is a schematic physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device 800 may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a chinese spell error detection and correction method, the method comprising: inputting a Chinese character input sequence into a contrast learning model to obtain a character vector corresponding to each Chinese character in the Chinese character input sequence output by the contrast learning model; the word vectors comprise a phoneticizing vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training; detecting wrong Chinese characters in the Chinese character input sequence based on the character vectors to obtain the positions and types of the wrong Chinese characters; coding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence; and correcting the wrong Chinese characters in the Chinese character input sequence based on the character vectors, the positions and the types of the wrong Chinese characters and the coding vectors to obtain an optimal corrected text.

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the method for detecting and correcting a chinese spelling error, the method comprising: inputting a Chinese character input sequence into a contrast learning model to obtain a character vector corresponding to each Chinese character in the Chinese character input sequence output by the contrast learning model; the word vectors comprise a phoneticizing vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training; detecting wrong Chinese characters in the Chinese character input sequence based on the character vectors to obtain the positions and types of the wrong Chinese characters; coding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence; and correcting the wrong Chinese characters in the Chinese character input sequence based on the character vectors, the positions and the types of the wrong Chinese characters and the coding vectors to obtain an optimal corrected text.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute the method for detecting and correcting an error in chinese spelling provided by the above methods, the method comprising: inputting a Chinese character input sequence into a contrast learning model to obtain a character vector corresponding to each Chinese character in the Chinese character input sequence output by the contrast learning model; the word vectors comprise a phoneticizing vector and a shape-like vector; the sound-like vectors are used for representing the character sound-like relation of each Chinese character; the shape similarity vector is used for representing the shape similarity relation of each Chinese character; the comparison learning model is obtained based on sample Chinese character triple training; detecting wrong Chinese characters in the Chinese character input sequence based on the character vectors to obtain the positions and types of the wrong Chinese characters; coding the Chinese character input sequence to obtain a coding vector corresponding to the Chinese character input sequence; and correcting the wrong Chinese characters in the Chinese character input sequence based on the character vectors, the positions and the types of the wrong Chinese characters and the coding vectors to obtain an optimal corrected text.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A Chinese spelling error detection and correction method is characterized by comprising the following steps:

2. The method of claim 1, wherein the detecting the wrong Chinese character in the Chinese character input sequence based on the similar word vector to obtain the position and type of the wrong Chinese character in the Chinese character input sequence comprises:

3. The method of claim 2, wherein the determining the location and type of the erroneous Chinese character in the input sequence of Chinese characters based on the pronunciation-like gating value and the shape-like gating value comprises:

4. The method of claim 1, wherein the correcting the wrong Chinese character in the Chinese character input sequence based on the similar word vector, the position and type of the wrong Chinese character in the Chinese character input sequence, and the encoding vector to obtain an optimal corrected text comprises:

determining top M similar candidates based on the K semantic candidates, the phonemic vector and the plausibility vector;

5. The method of claim 4, wherein the determining the top M similar candidates based on the K semantic candidates, the phonemic vectors and the shape vectors comprises:

6. The method of claim 4, wherein the determining the optimal path parameters based on the vectors corresponding to the M similar candidates, the vector corresponding to the erroneous chinese character, and the vectors corresponding to chinese characters at adjacent positions of the erroneous chinese character comprises:

7. An error detection and correction device for Chinese spelling, comprising:

the error detection module is used for detecting wrong Chinese characters in the Chinese character input sequence based on the similar character vectors to obtain the positions and types of the wrong Chinese characters;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of detecting and correcting chinese spell errors as claimed in any one of claims 1 to 6 when executing the program.

9. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the chinese spelling error detection and correction method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the chinese spelling error detection and correction method according to any of claims 1 to 6.