CN112329446B - Chinese spelling checking method - Google Patents
Chinese spelling checking method Download PDFInfo
- Publication number
- CN112329446B CN112329446B CN201910646536.3A CN201910646536A CN112329446B CN 112329446 B CN112329446 B CN 112329446B CN 201910646536 A CN201910646536 A CN 201910646536A CN 112329446 B CN112329446 B CN 112329446B
- Authority
- CN
- China
- Prior art keywords
- pinyin
- words
- word
- characters
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a Chinese spelling checking method, which comprises the following steps: establishing a Chinese spell checking model; setting Chinese spelling error check as sequence labeling task; adding dynamic words and pinyin to train the model; respectively inputting characters, words and pinyin into the trained model; and matching the characters, the words and the pinyin input in the model through the sequence labeling task. The invention can effectively combine the three characteristics of the word, the word and the pinyin, can realize an end-to-end error checking solution without word segmentation, avoids a complicated process, combines the three characteristics of the word, the word and the pinyin, does not need word segmentation, has more universality and has more field adaptability than the traditional error checking method.
Description
Technical Field
The invention relates to the technical field of automatic text error checking, in particular to a Chinese spelling checking method.
Background
With the development of information processing technology, traditional text work is basically replaced by a computer, and with the development of the internet, electronic books, electronic newspapers, electronic mails and the like become a part of daily life of people, but text errors are more and more, and the problems of low efficiency, high strength, long period and the like of traditional manual inspection obviously cannot meet the requirements of text spelling inspection, so that the automatic text inspection technology influences the development of the rhythm and publishing industry, and the research on automatic text inspection has important practical significance.
Unlike english, chinese spell check first has a natural separator between each word in english, such as a space, comma, etc., and there is no obvious boundary between chinese and words. Secondly, in English, most mistakes are derived from word spelling errors, the mistakes can be checked by looking up a dictionary directly, each word in Chinese is legal, the Chinese mistakes can be seen only by combining a context, and the currently used checking only uses the characteristics of the word and does not use the pinyin characteristics.
Disclosure of Invention
In order to overcome the problems in the related art, the embodiment of the invention provides a Chinese spelling checking method, which integrates the characteristics of characters, words and pinyin, does not need word segmentation and realizes end-to-end error checking.
The embodiment of the invention provides a Chinese spelling checking method, which comprises the following steps:
establishing a Chinese spell checking model;
setting Chinese spelling error check as sequence labeling task;
adding dynamic words and pinyin to train the model;
the method comprises the steps of respectively inputting characters, words and pinyin into a trained model, and further comprises the steps that the characters, the words and the pinyin are respectively represented by a first enabling, a second enabling and a third enabling, and the words, the pinyin and the common are respectively represented by the first enabling, the second enabling and the third enablingThe formula is as follows:c i i-th character representing an input sentence, +.>Representing character c i Corresponding vector,/->And->Respectively represent substring c b ,c b+1 ,...,c e Word vector and pinyin vector e c 、e w And e p The first, second and third mapping lookup tables respectively represent characters, words and pinyin;
matching the characters, words and pinyin input in the model through a sequence labeling task,
the word outputs the hidden layer of the last nodeVector representation of the word matching currently +.>As input, the target output is +.>The node with e as the subscript is input as part of its input, and the calculation formula is as follows: />
The pinyin is the same as the words, the hidden layer state of the initial node is used as input, and the other input of the pinyin is the matched pinyin vector representationThe calculation formula is as follows:
in order to control the vector representation of characters, pinyin and word output, a gating mechanism is adopted to control the weight, and the calculation formula is as follows: representing the output of each word with its calculated weight,/->Representing the output of each pinyin with its calculated weights and then normalizing to sum their weights to be equal to one,/or%>I.e. all coefficients ending with i +.>And->Character c i Input weight +.>The sum of the weight coefficients is one, normalization is realized, and therefore, the calculation formula of each feature fusion node is obtained: />
The hidden layer state sequence output by the Chinese spell checking model is h 1 ,h 2 ,...,h m The probability analysis and calculation are carried out through the CRF layer, and the label sequence y=l with the maximum probability is output 1 ,l 2 ,..,l m The probability calculation formula is as follows:
further, the Chinese spell checking model is built based on a neural sequence.
Further comprises, for each character c i Are all given a label l i E { T, F }, T and F representing correct and incorrect characters, respectively, the character marked F being regarded as a wrong character, a plurality of characters c i The sentence is formed, and the operation formula of the sentence is as follows: s=c 1 ,c 2 ,...,c m ,c i The i-th character representing the sentence s, and m represents the length of the sentence.
Further comprises matching the pre-training word vector table by using substrings in the original sentence for words and pinyin, wherein the set in the pre-training word vector is used as a pre-training dictionary and is respectively expressed as D w And D p A pre-training word vector table D for representing words and pinyin respectively w And D p Are pre-trained on a large scale corpus using word2 vec.
Further, the method comprises the steps of,
the technical scheme provided by the embodiment of the invention has the following beneficial effects: the method has the advantages that the characteristics of the characters, the words and the pinyin can be effectively fused, the word segmentation is not needed, an end-to-end error checking solution is realized, a complicated process is avoided, the three characteristics of the characters, the words and the pinyin are fused, the word segmentation is not needed, and the method has universality and field adaptability compared with the traditional error checking method.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow chart of a method of checking Chinese spelling in accordance with an embodiment of the invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and related applications, methods consistent with aspects of the invention as detailed in the accompanying claims.
FIG. 1 is a flow chart of a method for checking Chinese spellings according to an embodiment of the present invention, as shown in FIG. 1, comprising the steps of:
and step 101, establishing a Chinese spell checking model based on the nerve sequence.
Setting Chinese misspell checking as a sequence labeling task for each character c i Are all given a label l i E { T, F }, T and F represent correct and incorrect characters, respectively, the character marked F is regarded as a wrong character, and the operation formula of the sentence is as follows: s=c 1 ,c 2 ,...,c m ,c i The i-th character representing the sentence s, and m represents the length of the sentence.
And step 103, adding dynamic words and pinyin to train the model.
And 104, respectively inputting characters, words and pinyin into the trained model.
The characters, words and pinyin are respectively represented by a first emmbedding, a second emmbedding and a third emmbedding, and the formulas are as follows:c i i-th character representing an input sentence, +.>Representing character c i Corresponding vector,/->And->Respectively represent substring c b ,c b+1 ,...,c e Word vector and pinyin vector e c 、e w And e p The first, second and third mapping lookup tables are respectively used for representing characters, words and pinyin.
And 105, matching the characters, the words and the pinyin input in the model through a sequence labeling task.
The word and the spelling are matched with the pre-training word vector table by adopting substrings in the original sentence, and the set in the pre-training word vector is used as a pre-training dictionary and is respectively expressed as D w And D p A pre-training word vector table D for representing words and pinyin respectively w And D p Are pre-trained on a large scale corpus using word2 vec.
Hidden layer output state of word to last nodeVector representation of the word matching currently +.>As input, the target output is +.>The node with e as the subscript is input as part of its input, and the calculation formula is as follows:
the spelling is the same as the words, the hidden layer state of the initial node is used as input, and the other input of the spelling is matched spellingVector representationThe calculation formula is as follows: />
In order to control the vector representation of characters, pinyin and word output, a gating mechanism is adopted to control the weight, and the calculation formula is as follows: representing the output of each word with its calculated weight,/->Representing the output of each pinyin with its calculated weights and then normalizing to sum their weights to be equal to one,/or%>I.e. all coefficients ending with i +.>And->Character c i Input weight +.>The sum of the weight coefficients is one, normalization is realized, and therefore, the calculation formula of each feature fusion node is obtained: />Different from standard LSTM in +.>And->The calculation is also different, the output form is consistent with the standard LSTM, and is a hidden layerAnd memory cell output->In this way the information carried by words and pinyin is effectively fused into +.>And->To a lower node as a reference.
The hidden layer state sequence output by the Chinese spell checking model is h 1 ,h 2 ,...,h m The probability analysis and calculation are carried out through the CRF layer, and the label sequence y=l with the maximum probability is output 1 ,l 2 ,..,l m The probability calculation formula is as follows:
other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
Claims (4)
1. A method for checking chinese spellings, comprising the steps of:
establishing a Chinese spell checking model;
setting Chinese spelling error check as sequence labeling task;
adding dynamic words and pinyin to train the Chinese spelling model;
the method comprises the steps of respectively inputting characters, words and pinyin into a trained Chinese spelling model, and further comprises the steps that the characters, the words and the pinyin are respectively represented by a first enabling, a second enabling and a third enabling, and the formulas are as follows:c i i-th character representing an input sentence, +.>Representing character c i Corresponding vector,/->And->Respectively represent substring c b ,c b+1 ,...,c e Word vector and pinyin vector e c 、e w And e p The first, second and third mapping lookup tables respectively represent characters, words and pinyin;
the characters, words and spellings input in the Chinese spelling model are matched through the sequence labeling task,
the word outputs the hidden layer of the last nodeVector representation of the word matching currently +.>As input, the target output is +.>The node with e as the subscript is input as part of its input, and the calculation formula is as follows:
the pinyin is the same as the words, the hidden layer state of the initial node is used as input, and the other input of the pinyin is the matched pinyin vector representationThe calculation formula is as follows: />In order to control the vector representation of characters, pinyin and word output, a gating mechanism is adopted to control the weight, and the calculation formula is as follows: representing the output of each word with its calculated weight,/->Representing the output of each pinyin with its calculated weights and then normalizing to sum their weights to be equal to one,/or%>I.e. all coefficients ending with i +.>And->Character c i Input weight +.>The sum of the weight coefficients is one, normalization is realized, and therefore, the calculation formula of each feature fusion node is obtained: />
The hidden layer state sequence output by the Chinese spell checking model is h 1 ,h 2 ,...,h m The probability analysis and calculation are carried out through the CRF layer, and the label sequence y=l with the maximum probability is output 1 ,l 2 ,..,l m The probability calculation formula is as follows:
2. the method of claim 1, wherein the chinese spell checking model is built based on a neural sequence.
3. The method of claim 1, wherein the setting of the chinese spelling error check as a sequence labeling task further comprises, for each character c i Are all given a label l i E { T, F }, T and F representing correct and incorrect characters, respectively, the character marked F being regarded as a wrong character, a plurality of characters c i The sentence is formed, and the operation formula of the sentence is as follows: s=c 1 ,c 2 ,...,c m ,c i The i-th character representing the sentence s, and m represents the length of the sentence.
4. The method of claim 1, wherein the matching of characters, words and pinyin entered in the model by the sequence labeling task further comprises matching the pre-training word vector table with both words and pinyin using substrings in the original sentence, the set of pre-training word vectors being used as a pre-training dictionary, denoted D respectively w And D p A pre-training word vector table D for representing words and pinyin respectively w And D p Are pre-trained on a large scale corpus using word2 vec.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910646536.3A CN112329446B (en) | 2019-07-17 | 2019-07-17 | Chinese spelling checking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910646536.3A CN112329446B (en) | 2019-07-17 | 2019-07-17 | Chinese spelling checking method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112329446A CN112329446A (en) | 2021-02-05 |
CN112329446B true CN112329446B (en) | 2023-05-23 |
Family
ID=74319458
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910646536.3A Active CN112329446B (en) | 2019-07-17 | 2019-07-17 | Chinese spelling checking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112329446B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324621A (en) * | 2012-03-21 | 2013-09-25 | 北京百度网讯科技有限公司 | Method and device for correcting spelling of Thai texts |
CN107608963A (en) * | 2017-09-12 | 2018-01-19 | 马上消费金融股份有限公司 | Chinese error correction method, device and equipment based on mutual information and storage medium |
CN108563632A (en) * | 2018-03-29 | 2018-09-21 | 广州视源电子科技股份有限公司 | Method, system, computer device and storage medium for correcting character spelling errors |
CN109492202A (en) * | 2018-11-12 | 2019-03-19 | 浙江大学山东工业技术研究院 | A kind of Chinese error correction of coding and decoded model based on phonetic |
CN109918489A (en) * | 2019-02-28 | 2019-06-21 | 上海乐言信息科技有限公司 | A kind of knowledge question answering method and system of more strategy fusions |
-
2019
- 2019-07-17 CN CN201910646536.3A patent/CN112329446B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103324621A (en) * | 2012-03-21 | 2013-09-25 | 北京百度网讯科技有限公司 | Method and device for correcting spelling of Thai texts |
CN107608963A (en) * | 2017-09-12 | 2018-01-19 | 马上消费金融股份有限公司 | Chinese error correction method, device and equipment based on mutual information and storage medium |
CN108563632A (en) * | 2018-03-29 | 2018-09-21 | 广州视源电子科技股份有限公司 | Method, system, computer device and storage medium for correcting character spelling errors |
CN109492202A (en) * | 2018-11-12 | 2019-03-19 | 浙江大学山东工业技术研究院 | A kind of Chinese error correction of coding and decoded model based on phonetic |
CN109918489A (en) * | 2019-02-28 | 2019-06-21 | 上海乐言信息科技有限公司 | A kind of knowledge question answering method and system of more strategy fusions |
Non-Patent Citations (6)
Title |
---|
Peng Jin 等.Integrating Pinyin to Improve Spelling Errors Detection for Chinese Language.2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).2014,455-458. * |
Zijia Han 等.Chinese Spelling Check based on Sequence Labeling.2019 International Conference on Asian Language Processing (IALP).2020,373-378. * |
卓利艳.字词级中文文本自动校对的方法研究.中国优秀硕士学位论文全文数据库信息科技辑.2018,I138-1931. * |
张松磊.中文拼写检错和纠错算法的优化及实现.中国优秀硕士学位论文全文数据库信息科技辑.2019,I138-1882. * |
王冰.基于深度学习的文本校对方法研究.中国优秀硕士学位论文全文数据库信息科技辑.2021,I138-2724. * |
陈欢 ; 张奇 ; .基于话题翻译模型的双语文本纠错.计算机应用与软件.2016,第33卷(第3期),284-287. * |
Also Published As
Publication number | Publication date |
---|---|
CN112329446A (en) | 2021-02-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9087047B2 (en) | Text proofreading apparatus and text proofreading method using post-proofreading sentence with highest degree of similarity | |
Blitzer et al. | Domain adaptation with structural correspondence learning | |
CN109033080B (en) | Medical term standardization method and system based on probability transfer matrix | |
CN107870901B (en) | Method, recording medium, apparatus and system for generating similar text from translation source text | |
CN109325229B (en) | Method for calculating text similarity by utilizing semantic information | |
US20100094614A1 (en) | Machine Learning for Transliteration | |
CN108363688B (en) | Named entity linking method fusing prior information | |
CN110489554B (en) | Attribute-level emotion classification method based on location-aware mutual attention network model | |
CN103324621A (en) | Method and device for correcting spelling of Thai texts | |
CN110851601A (en) | Cross-domain emotion classification system and method based on layered attention mechanism | |
Xiong et al. | HANSpeller: a unified framework for Chinese spelling correction | |
CN115935959A (en) | Method for labeling low-resource glue word sequence | |
Mon et al. | SymSpell4Burmese: symmetric delete Spelling correction algorithm (SymSpell) for burmese spelling checking | |
Kaur et al. | Hybrid approach for spell checker and grammar checker for Punjabi | |
CN112329446B (en) | Chinese spelling checking method | |
Abu Bakar et al. | NUWT: Jawi-specific Buckwalter corpus for Malays word tokenization | |
CN106339367A (en) | Method for automatically correcting Mongolian | |
Karimi | Machine transliteration of proper names between English and Persian | |
Malecha et al. | Maximum entropy part-of-speech tagging in nltk | |
Li et al. | The first international ancient Chinese word segmentation and POS tagging bakeoff: Overview of the EvaHan 2022 evaluation campaign | |
Wibowo et al. | Spelling checker of words in rejang language using the n-gram and euclidean distance methods | |
Zhang et al. | CMMC-BDRC solution to the NLP-TEA-2018 Chinese grammatical error diagnosis task | |
CN109657207B (en) | Formatting processing method and processing device for clauses | |
Hasan et al. | SweetCoat-2D: Two-Dimensional Bangla Spelling Correction and Suggestion Using Levenshtein Edit Distance and String Matching Algorithm | |
Swaroop et al. | Parts of speech tagging for Kannada |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |