CN112329446B

CN112329446B - Chinese spelling checking method

Info

Publication number: CN112329446B
Application number: CN201910646536.3A
Authority: CN
Inventors: 段建勇; 王昊; 张梅; 马东超; 王冰; 潘利建; 袁阳
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2019-07-17
Filing date: 2019-07-17
Publication date: 2023-05-23
Anticipated expiration: 2039-07-17
Also published as: CN112329446A

Abstract

The invention discloses a Chinese spelling checking method, which comprises the following steps: establishing a Chinese spell checking model; setting Chinese spelling error check as sequence labeling task; adding dynamic words and pinyin to train the model; respectively inputting characters, words and pinyin into the trained model; and matching the characters, the words and the pinyin input in the model through the sequence labeling task. The invention can effectively combine the three characteristics of the word, the word and the pinyin, can realize an end-to-end error checking solution without word segmentation, avoids a complicated process, combines the three characteristics of the word, the word and the pinyin, does not need word segmentation, has more universality and has more field adaptability than the traditional error checking method.

Description

Chinese spelling checking method

Technical Field

The invention relates to the technical field of automatic text error checking, in particular to a Chinese spelling checking method.

Background

With the development of information processing technology, traditional text work is basically replaced by a computer, and with the development of the internet, electronic books, electronic newspapers, electronic mails and the like become a part of daily life of people, but text errors are more and more, and the problems of low efficiency, high strength, long period and the like of traditional manual inspection obviously cannot meet the requirements of text spelling inspection, so that the automatic text inspection technology influences the development of the rhythm and publishing industry, and the research on automatic text inspection has important practical significance.

Unlike english, chinese spell check first has a natural separator between each word in english, such as a space, comma, etc., and there is no obvious boundary between chinese and words. Secondly, in English, most mistakes are derived from word spelling errors, the mistakes can be checked by looking up a dictionary directly, each word in Chinese is legal, the Chinese mistakes can be seen only by combining a context, and the currently used checking only uses the characteristics of the word and does not use the pinyin characteristics.

Disclosure of Invention

In order to overcome the problems in the related art, the embodiment of the invention provides a Chinese spelling checking method, which integrates the characteristics of characters, words and pinyin, does not need word segmentation and realizes end-to-end error checking.

The embodiment of the invention provides a Chinese spelling checking method, which comprises the following steps:

establishing a Chinese spell checking model;

setting Chinese spelling error check as sequence labeling task;

adding dynamic words and pinyin to train the model;

the method comprises the steps of respectively inputting characters, words and pinyin into a trained model, and further comprises the steps that the characters, the words and the pinyin are respectively represented by a first enabling, a second enabling and a third enabling, and the words, the pinyin and the common are respectively represented by the first enabling, the second enabling and the third enablingThe formula is as follows:

c _i i-th character representing an input sentence, +.>

Representing character c _i Corresponding vector,/->

And->

Respectively represent substring c _b ,c _b+1 ,...,c _e Word vector and pinyin vector e ^c 、e ^w And e ^p The first, second and third mapping lookup tables respectively represent characters, words and pinyin;

matching the characters, words and pinyin input in the model through a sequence labeling task,

the word outputs the hidden layer of the last node

Vector representation of the word matching currently +.>

As input, the target output is +.>

The node with e as the subscript is input as part of its input, and the calculation formula is as follows: />

The pinyin is the same as the words, the hidden layer state of the initial node is used as input, and the other input of the pinyin is the matched pinyin vector representation

The calculation formula is as follows:

in order to control the vector representation of characters, pinyin and word output, a gating mechanism is adopted to control the weight, and the calculation formula is as follows:

representing the output of each word with its calculated weight,/->

Representing the output of each pinyin with its calculated weights and then normalizing to sum their weights to be equal to one,/or%>

I.e. all coefficients ending with i +.>

And->

Character c _i Input weight +.>

The sum of the weight coefficients is one, normalization is realized, and therefore, the calculation formula of each feature fusion node is obtained: />

The hidden layer state sequence output by the Chinese spell checking model is h ₁ ,h ₂ ,...,h _m The probability analysis and calculation are carried out through the CRF layer, and the label sequence y=l with the maximum probability is output ₁ ,l ₂ ,..,l _m The probability calculation formula is as follows:

further, the Chinese spell checking model is built based on a neural sequence.

Further comprises, for each character c _i Are all given a label l _i E { T, F }, T and F representing correct and incorrect characters, respectively, the character marked F being regarded as a wrong character, a plurality of characters c _i The sentence is formed, and the operation formula of the sentence is as follows: s=c ₁ ,c ₂ ,...,c _m ，c _i The i-th character representing the sentence s, and m represents the length of the sentence.

Further comprises matching the pre-training word vector table by using substrings in the original sentence for words and pinyin, wherein the set in the pre-training word vector is used as a pre-training dictionary and is respectively expressed as D ^w And D ^p A pre-training word vector table D for representing words and pinyin respectively ^w And D ^p Are pre-trained on a large scale corpus using word2 vec.

Further, the method comprises the steps of,

the technical scheme provided by the embodiment of the invention has the following beneficial effects: the method has the advantages that the characteristics of the characters, the words and the pinyin can be effectively fused, the word segmentation is not needed, an end-to-end error checking solution is realized, a complicated process is avoided, the three characteristics of the characters, the words and the pinyin are fused, the word segmentation is not needed, and the method has universality and field adaptability compared with the traditional error checking method.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart of a method of checking Chinese spelling in accordance with an embodiment of the invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and related applications, methods consistent with aspects of the invention as detailed in the accompanying claims.

FIG. 1 is a flow chart of a method for checking Chinese spellings according to an embodiment of the present invention, as shown in FIG. 1, comprising the steps of:

and step 101, establishing a Chinese spell checking model based on the nerve sequence.

Step 102, setting Chinese spelling error check as sequence labeling task.

Setting Chinese misspell checking as a sequence labeling task for each character c _i Are all given a label l _i E { T, F }, T and F represent correct and incorrect characters, respectively, the character marked F is regarded as a wrong character, and the operation formula of the sentence is as follows: s=c ₁ ,c ₂ ,...,c _m ，c _i The i-th character representing the sentence s, and m represents the length of the sentence.

And step 103, adding dynamic words and pinyin to train the model.

And 104, respectively inputting characters, words and pinyin into the trained model.

The characters, words and pinyin are respectively represented by a first emmbedding, a second emmbedding and a third emmbedding, and the formulas are as follows:

c _i i-th character representing an input sentence, +.>

Representing character c _i Corresponding vector,/->

And->

Respectively represent substring c _b ,c _b+1 ,...,c _e Word vector and pinyin vector e ^c 、e ^w And e ^p The first, second and third mapping lookup tables are respectively used for representing characters, words and pinyin.

And 105, matching the characters, the words and the pinyin input in the model through a sequence labeling task.

The word and the spelling are matched with the pre-training word vector table by adopting substrings in the original sentence, and the set in the pre-training word vector is used as a pre-training dictionary and is respectively expressed as D ^w And D ^p A pre-training word vector table D for representing words and pinyin respectively ^w And D ^p Are pre-trained on a large scale corpus using word2 vec.

Hidden layer output state of word to last node

Vector representation of the word matching currently +.>

As input, the target output is +.>

The node with e as the subscript is input as part of its input, and the calculation formula is as follows:

the spelling is the same as the words, the hidden layer state of the initial node is used as input, and the other input of the spelling is matched spellingVector representation

The calculation formula is as follows: />

representing the output of each word with its calculated weight,/->

I.e. all coefficients ending with i +.>

And->

Character c _i Input weight +.>

Different from standard LSTM in +.>

And->

The calculation is also different, the output form is consistent with the standard LSTM, and is a hidden layer

And memory cell output->

In this way the information carried by words and pinyin is effectively fused into +.>

And->

To a lower node as a reference.

other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for checking chinese spellings, comprising the steps of:

establishing a Chinese spell checking model;

setting Chinese spelling error check as sequence labeling task;

adding dynamic words and pinyin to train the Chinese spelling model;

the method comprises the steps of respectively inputting characters, words and pinyin into a trained Chinese spelling model, and further comprises the steps that the characters, the words and the pinyin are respectively represented by a first enabling, a second enabling and a third enabling, and the formulas are as follows:

c _i i-th character representing an input sentence, +.>

Representing character c _i Corresponding vector,/->

And->

the characters, words and spellings input in the Chinese spelling model are matched through the sequence labeling task,

the word outputs the hidden layer of the last node

Vector representation of the word matching currently +.>

As input, the target output is +.>

The calculation formula is as follows: />

representing the output of each word with its calculated weight,/->

I.e. all coefficients ending with i +.>

And->

Character c _i Input weight +.>

2. the method of claim 1, wherein the chinese spell checking model is built based on a neural sequence.

3. The method of claim 1, wherein the setting of the chinese spelling error check as a sequence labeling task further comprises, for each character c _i Are all given a label l _i E { T, F }, T and F representing correct and incorrect characters, respectively, the character marked F being regarded as a wrong character, a plurality of characters c _i The sentence is formed, and the operation formula of the sentence is as follows: s=c ₁ ,c ₂ ,...,c _m ，c _i The i-th character representing the sentence s, and m represents the length of the sentence.

4. The method of claim 1, wherein the matching of characters, words and pinyin entered in the model by the sequence labeling task further comprises matching the pre-training word vector table with both words and pinyin using substrings in the original sentence, the set of pre-training word vectors being used as a pre-training dictionary, denoted D respectively ^w And D ^p A pre-training word vector table D for representing words and pinyin respectively ^w And D ^p Are pre-trained on a large scale corpus using word2 vec.