CN112329446B - Chinese spelling checking method - Google Patents

Chinese spelling checking method Download PDF

Info

Publication number
CN112329446B
CN112329446B CN201910646536.3A CN201910646536A CN112329446B CN 112329446 B CN112329446 B CN 112329446B CN 201910646536 A CN201910646536 A CN 201910646536A CN 112329446 B CN112329446 B CN 112329446B
Authority
CN
China
Prior art keywords
pinyin
words
word
characters
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910646536.3A
Other languages
Chinese (zh)
Other versions
CN112329446A (en
Inventor
段建勇
王昊
张梅
马东超
王冰
潘利建
袁阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China University of Technology
Original Assignee
North China University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China University of Technology filed Critical North China University of Technology
Priority to CN201910646536.3A priority Critical patent/CN112329446B/en
Publication of CN112329446A publication Critical patent/CN112329446A/en
Application granted granted Critical
Publication of CN112329446B publication Critical patent/CN112329446B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a Chinese spelling checking method, which comprises the following steps: establishing a Chinese spell checking model; setting Chinese spelling error check as sequence labeling task; adding dynamic words and pinyin to train the model; respectively inputting characters, words and pinyin into the trained model; and matching the characters, the words and the pinyin input in the model through the sequence labeling task. The invention can effectively combine the three characteristics of the word, the word and the pinyin, can realize an end-to-end error checking solution without word segmentation, avoids a complicated process, combines the three characteristics of the word, the word and the pinyin, does not need word segmentation, has more universality and has more field adaptability than the traditional error checking method.

Description

Chinese spelling checking method
Technical Field
The invention relates to the technical field of automatic text error checking, in particular to a Chinese spelling checking method.
Background
With the development of information processing technology, traditional text work is basically replaced by a computer, and with the development of the internet, electronic books, electronic newspapers, electronic mails and the like become a part of daily life of people, but text errors are more and more, and the problems of low efficiency, high strength, long period and the like of traditional manual inspection obviously cannot meet the requirements of text spelling inspection, so that the automatic text inspection technology influences the development of the rhythm and publishing industry, and the research on automatic text inspection has important practical significance.
Unlike english, chinese spell check first has a natural separator between each word in english, such as a space, comma, etc., and there is no obvious boundary between chinese and words. Secondly, in English, most mistakes are derived from word spelling errors, the mistakes can be checked by looking up a dictionary directly, each word in Chinese is legal, the Chinese mistakes can be seen only by combining a context, and the currently used checking only uses the characteristics of the word and does not use the pinyin characteristics.
Disclosure of Invention
In order to overcome the problems in the related art, the embodiment of the invention provides a Chinese spelling checking method, which integrates the characteristics of characters, words and pinyin, does not need word segmentation and realizes end-to-end error checking.
The embodiment of the invention provides a Chinese spelling checking method, which comprises the following steps:
establishing a Chinese spell checking model;
setting Chinese spelling error check as sequence labeling task;
adding dynamic words and pinyin to train the model;
the method comprises the steps of respectively inputting characters, words and pinyin into a trained model, and further comprises the steps that the characters, the words and the pinyin are respectively represented by a first enabling, a second enabling and a third enabling, and the words, the pinyin and the common are respectively represented by the first enabling, the second enabling and the third enablingThe formula is as follows:
Figure GDA0004177606680000021
c i i-th character representing an input sentence, +.>
Figure GDA0004177606680000022
Representing character c i Corresponding vector,/->
Figure GDA0004177606680000023
And->
Figure GDA0004177606680000024
Respectively represent substring c b ,c b+1 ,...,c e Word vector and pinyin vector e c 、e w And e p The first, second and third mapping lookup tables respectively represent characters, words and pinyin;
matching the characters, words and pinyin input in the model through a sequence labeling task,
the word outputs the hidden layer of the last node
Figure GDA0004177606680000025
Vector representation of the word matching currently +.>
Figure GDA0004177606680000026
As input, the target output is +.>
Figure GDA0004177606680000027
The node with e as the subscript is input as part of its input, and the calculation formula is as follows: />
Figure GDA0004177606680000028
The pinyin is the same as the words, the hidden layer state of the initial node is used as input, and the other input of the pinyin is the matched pinyin vector representation
Figure GDA0004177606680000029
The calculation formula is as follows:
Figure GDA00041776066800000210
in order to control the vector representation of characters, pinyin and word output, a gating mechanism is adopted to control the weight, and the calculation formula is as follows:
Figure GDA00041776066800000211
Figure GDA00041776066800000212
representing the output of each word with its calculated weight,/->
Figure GDA00041776066800000213
Representing the output of each pinyin with its calculated weights and then normalizing to sum their weights to be equal to one,/or%>
Figure GDA0004177606680000031
I.e. all coefficients ending with i +.>
Figure GDA0004177606680000032
And->
Figure GDA0004177606680000033
Character c i Input weight +.>
Figure GDA0004177606680000034
The sum of the weight coefficients is one, normalization is realized, and therefore, the calculation formula of each feature fusion node is obtained: />
Figure GDA0004177606680000035
The hidden layer state sequence output by the Chinese spell checking model is h 1 ,h 2 ,...,h m The probability analysis and calculation are carried out through the CRF layer, and the label sequence y=l with the maximum probability is output 1 ,l 2 ,..,l m The probability calculation formula is as follows:
Figure GDA0004177606680000036
further, the Chinese spell checking model is built based on a neural sequence.
Further comprises, for each character c i Are all given a label l i E { T, F }, T and F representing correct and incorrect characters, respectively, the character marked F being regarded as a wrong character, a plurality of characters c i The sentence is formed, and the operation formula of the sentence is as follows: s=c 1 ,c 2 ,...,c m ,c i The i-th character representing the sentence s, and m represents the length of the sentence.
Further comprises matching the pre-training word vector table by using substrings in the original sentence for words and pinyin, wherein the set in the pre-training word vector is used as a pre-training dictionary and is respectively expressed as D w And D p A pre-training word vector table D for representing words and pinyin respectively w And D p Are pre-trained on a large scale corpus using word2 vec.
Further, the method comprises the steps of,
the technical scheme provided by the embodiment of the invention has the following beneficial effects: the method has the advantages that the characteristics of the characters, the words and the pinyin can be effectively fused, the word segmentation is not needed, an end-to-end error checking solution is realized, a complicated process is avoided, the three characteristics of the characters, the words and the pinyin are fused, the word segmentation is not needed, and the method has universality and field adaptability compared with the traditional error checking method.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow chart of a method of checking Chinese spelling in accordance with an embodiment of the invention.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and related applications, methods consistent with aspects of the invention as detailed in the accompanying claims.
FIG. 1 is a flow chart of a method for checking Chinese spellings according to an embodiment of the present invention, as shown in FIG. 1, comprising the steps of:
and step 101, establishing a Chinese spell checking model based on the nerve sequence.
Step 102, setting Chinese spelling error check as sequence labeling task.
Setting Chinese misspell checking as a sequence labeling task for each character c i Are all given a label l i E { T, F }, T and F represent correct and incorrect characters, respectively, the character marked F is regarded as a wrong character, and the operation formula of the sentence is as follows: s=c 1 ,c 2 ,...,c m ,c i The i-th character representing the sentence s, and m represents the length of the sentence.
And step 103, adding dynamic words and pinyin to train the model.
And 104, respectively inputting characters, words and pinyin into the trained model.
The characters, words and pinyin are respectively represented by a first emmbedding, a second emmbedding and a third emmbedding, and the formulas are as follows:
Figure GDA0004177606680000051
c i i-th character representing an input sentence, +.>
Figure GDA0004177606680000052
Representing character c i Corresponding vector,/->
Figure GDA0004177606680000053
And->
Figure GDA0004177606680000054
Respectively represent substring c b ,c b+1 ,...,c e Word vector and pinyin vector e c 、e w And e p The first, second and third mapping lookup tables are respectively used for representing characters, words and pinyin.
And 105, matching the characters, the words and the pinyin input in the model through a sequence labeling task.
The word and the spelling are matched with the pre-training word vector table by adopting substrings in the original sentence, and the set in the pre-training word vector is used as a pre-training dictionary and is respectively expressed as D w And D p A pre-training word vector table D for representing words and pinyin respectively w And D p Are pre-trained on a large scale corpus using word2 vec.
Hidden layer output state of word to last node
Figure GDA0004177606680000055
Vector representation of the word matching currently +.>
Figure GDA0004177606680000056
As input, the target output is +.>
Figure GDA0004177606680000057
The node with e as the subscript is input as part of its input, and the calculation formula is as follows:
Figure GDA0004177606680000058
the spelling is the same as the words, the hidden layer state of the initial node is used as input, and the other input of the spelling is matched spellingVector representation
Figure GDA0004177606680000059
The calculation formula is as follows: />
Figure GDA00041776066800000510
In order to control the vector representation of characters, pinyin and word output, a gating mechanism is adopted to control the weight, and the calculation formula is as follows:
Figure GDA0004177606680000061
Figure GDA0004177606680000062
representing the output of each word with its calculated weight,/->
Figure GDA0004177606680000063
Representing the output of each pinyin with its calculated weights and then normalizing to sum their weights to be equal to one,/or%>
Figure GDA0004177606680000064
I.e. all coefficients ending with i +.>
Figure GDA0004177606680000065
And->
Figure GDA0004177606680000066
Character c i Input weight +.>
Figure GDA0004177606680000067
The sum of the weight coefficients is one, normalization is realized, and therefore, the calculation formula of each feature fusion node is obtained: />
Figure GDA0004177606680000068
Different from standard LSTM in +.>
Figure GDA0004177606680000069
And->
Figure GDA00041776066800000610
The calculation is also different, the output form is consistent with the standard LSTM, and is a hidden layer
Figure GDA00041776066800000611
And memory cell output->
Figure GDA00041776066800000612
In this way the information carried by words and pinyin is effectively fused into +.>
Figure GDA00041776066800000613
And->
Figure GDA00041776066800000614
To a lower node as a reference.
The hidden layer state sequence output by the Chinese spell checking model is h 1 ,h 2 ,...,h m The probability analysis and calculation are carried out through the CRF layer, and the label sequence y=l with the maximum probability is output 1 ,l 2 ,..,l m The probability calculation formula is as follows:
Figure GDA0004177606680000071
other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (4)

1. A method for checking chinese spellings, comprising the steps of:
establishing a Chinese spell checking model;
setting Chinese spelling error check as sequence labeling task;
adding dynamic words and pinyin to train the Chinese spelling model;
the method comprises the steps of respectively inputting characters, words and pinyin into a trained Chinese spelling model, and further comprises the steps that the characters, the words and the pinyin are respectively represented by a first enabling, a second enabling and a third enabling, and the formulas are as follows:
Figure FDA0004177606670000011
c i i-th character representing an input sentence, +.>
Figure FDA0004177606670000012
Representing character c i Corresponding vector,/->
Figure FDA0004177606670000013
And->
Figure FDA0004177606670000014
Respectively represent substring c b ,c b+1 ,...,c e Word vector and pinyin vector e c 、e w And e p The first, second and third mapping lookup tables respectively represent characters, words and pinyin;
the characters, words and spellings input in the Chinese spelling model are matched through the sequence labeling task,
the word outputs the hidden layer of the last node
Figure FDA0004177606670000015
Vector representation of the word matching currently +.>
Figure FDA0004177606670000016
As input, the target output is +.>
Figure FDA0004177606670000017
The node with e as the subscript is input as part of its input, and the calculation formula is as follows:
Figure FDA0004177606670000018
the pinyin is the same as the words, the hidden layer state of the initial node is used as input, and the other input of the pinyin is the matched pinyin vector representation
Figure FDA0004177606670000019
The calculation formula is as follows: />
Figure FDA00041776066700000110
In order to control the vector representation of characters, pinyin and word output, a gating mechanism is adopted to control the weight, and the calculation formula is as follows:
Figure FDA0004177606670000021
Figure FDA0004177606670000022
representing the output of each word with its calculated weight,/->
Figure FDA0004177606670000023
Representing the output of each pinyin with its calculated weights and then normalizing to sum their weights to be equal to one,/or%>
Figure FDA0004177606670000024
I.e. all coefficients ending with i +.>
Figure FDA0004177606670000025
And->
Figure FDA0004177606670000026
Character c i Input weight +.>
Figure FDA0004177606670000027
The sum of the weight coefficients is one, normalization is realized, and therefore, the calculation formula of each feature fusion node is obtained: />
Figure FDA0004177606670000028
The hidden layer state sequence output by the Chinese spell checking model is h 1 ,h 2 ,...,h m The probability analysis and calculation are carried out through the CRF layer, and the label sequence y=l with the maximum probability is output 1 ,l 2 ,..,l m The probability calculation formula is as follows:
Figure FDA0004177606670000029
2. the method of claim 1, wherein the chinese spell checking model is built based on a neural sequence.
3. The method of claim 1, wherein the setting of the chinese spelling error check as a sequence labeling task further comprises, for each character c i Are all given a label l i E { T, F }, T and F representing correct and incorrect characters, respectively, the character marked F being regarded as a wrong character, a plurality of characters c i The sentence is formed, and the operation formula of the sentence is as follows: s=c 1 ,c 2 ,...,c m ,c i The i-th character representing the sentence s, and m represents the length of the sentence.
4. The method of claim 1, wherein the matching of characters, words and pinyin entered in the model by the sequence labeling task further comprises matching the pre-training word vector table with both words and pinyin using substrings in the original sentence, the set of pre-training word vectors being used as a pre-training dictionary, denoted D respectively w And D p A pre-training word vector table D for representing words and pinyin respectively w And D p Are pre-trained on a large scale corpus using word2 vec.
CN201910646536.3A 2019-07-17 2019-07-17 Chinese spelling checking method Active CN112329446B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910646536.3A CN112329446B (en) 2019-07-17 2019-07-17 Chinese spelling checking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910646536.3A CN112329446B (en) 2019-07-17 2019-07-17 Chinese spelling checking method

Publications (2)

Publication Number Publication Date
CN112329446A CN112329446A (en) 2021-02-05
CN112329446B true CN112329446B (en) 2023-05-23

Family

ID=74319458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910646536.3A Active CN112329446B (en) 2019-07-17 2019-07-17 Chinese spelling checking method

Country Status (1)

Country Link
CN (1) CN112329446B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324621A (en) * 2012-03-21 2013-09-25 北京百度网讯科技有限公司 Method and device for correcting spelling of Thai texts
CN107608963A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 Chinese error correction method, device and equipment based on mutual information and storage medium
CN108563632A (en) * 2018-03-29 2018-09-21 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN109492202A (en) * 2018-11-12 2019-03-19 浙江大学山东工业技术研究院 A kind of Chinese error correction of coding and decoded model based on phonetic
CN109918489A (en) * 2019-02-28 2019-06-21 上海乐言信息科技有限公司 A kind of knowledge question answering method and system of more strategy fusions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103324621A (en) * 2012-03-21 2013-09-25 北京百度网讯科技有限公司 Method and device for correcting spelling of Thai texts
CN107608963A (en) * 2017-09-12 2018-01-19 马上消费金融股份有限公司 Chinese error correction method, device and equipment based on mutual information and storage medium
CN108563632A (en) * 2018-03-29 2018-09-21 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN109492202A (en) * 2018-11-12 2019-03-19 浙江大学山东工业技术研究院 A kind of Chinese error correction of coding and decoded model based on phonetic
CN109918489A (en) * 2019-02-28 2019-06-21 上海乐言信息科技有限公司 A kind of knowledge question answering method and system of more strategy fusions

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Peng Jin 等.Integrating Pinyin to Improve Spelling Errors Detection for Chinese Language.2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).2014,455-458. *
Zijia Han 等.Chinese Spelling Check based on Sequence Labeling.2019 International Conference on Asian Language Processing (IALP).2020,373-378. *
卓利艳.字词级中文文本自动校对的方法研究.中国优秀硕士学位论文全文数据库信息科技辑.2018,I138-1931. *
张松磊.中文拼写检错和纠错算法的优化及实现.中国优秀硕士学位论文全文数据库信息科技辑.2019,I138-1882. *
王冰.基于深度学习的文本校对方法研究.中国优秀硕士学位论文全文数据库信息科技辑.2021,I138-2724. *
陈欢 ; 张奇 ; .基于话题翻译模型的双语文本纠错.计算机应用与软件.2016,第33卷(第3期),284-287. *

Also Published As

Publication number Publication date
CN112329446A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
US9087047B2 (en) Text proofreading apparatus and text proofreading method using post-proofreading sentence with highest degree of similarity
Blitzer et al. Domain adaptation with structural correspondence learning
CN109033080B (en) Medical term standardization method and system based on probability transfer matrix
CN107870901B (en) Method, recording medium, apparatus and system for generating similar text from translation source text
CN109325229B (en) Method for calculating text similarity by utilizing semantic information
US20100094614A1 (en) Machine Learning for Transliteration
CN108363688B (en) Named entity linking method fusing prior information
CN110489554B (en) Attribute-level emotion classification method based on location-aware mutual attention network model
CN103324621A (en) Method and device for correcting spelling of Thai texts
CN110851601A (en) Cross-domain emotion classification system and method based on layered attention mechanism
Xiong et al. HANSpeller: a unified framework for Chinese spelling correction
CN115935959A (en) Method for labeling low-resource glue word sequence
Mon et al. SymSpell4Burmese: symmetric delete Spelling correction algorithm (SymSpell) for burmese spelling checking
Kaur et al. Hybrid approach for spell checker and grammar checker for Punjabi
CN112329446B (en) Chinese spelling checking method
Abu Bakar et al. NUWT: Jawi-specific Buckwalter corpus for Malays word tokenization
CN106339367A (en) Method for automatically correcting Mongolian
Karimi Machine transliteration of proper names between English and Persian
Malecha et al. Maximum entropy part-of-speech tagging in nltk
Li et al. The first international ancient Chinese word segmentation and POS tagging bakeoff: Overview of the EvaHan 2022 evaluation campaign
Wibowo et al. Spelling checker of words in rejang language using the n-gram and euclidean distance methods
Zhang et al. CMMC-BDRC solution to the NLP-TEA-2018 Chinese grammatical error diagnosis task
CN109657207B (en) Formatting processing method and processing device for clauses
Hasan et al. SweetCoat-2D: Two-Dimensional Bangla Spelling Correction and Suggestion Using Levenshtein Edit Distance and String Matching Algorithm
Swaroop et al. Parts of speech tagging for Kannada

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant