CN105045778A - Chinese homonym error auto-proofreading method - Google Patents

Chinese homonym error auto-proofreading method Download PDF

Info

Publication number
CN105045778A
CN105045778A CN201510354692.4A CN201510354692A CN105045778A CN 105045778 A CN105045778 A CN 105045778A CN 201510354692 A CN201510354692 A CN 201510354692A CN 105045778 A CN105045778 A CN 105045778A
Authority
CN
China
Prior art keywords
homonym
chinese
word
sentence
mistake
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510354692.4A
Other languages
Chinese (zh)
Other versions
CN105045778B (en
Inventor
吴健康
严熙
刘亮亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University of Science and Technology
Original Assignee
Jiangsu University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University of Science and Technology filed Critical Jiangsu University of Science and Technology
Priority to CN201510354692.4A priority Critical patent/CN105045778B/en
Publication of CN105045778A publication Critical patent/CN105045778A/en
Application granted granted Critical
Publication of CN105045778B publication Critical patent/CN105045778B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention discloses a Chinese homonym error auto-proofreading method. The method comprises: first, generating a confusion set of Chinese homonyms; through large-scale Web corpus training, collecting statistics on a left-adjacent binary model, a right-adjacent binary model, and an adjacent ternary model; obtaining a local adjacent NGram model by using the confusion set of Chinese homonyms and a probability estimation algorithm; by using a weighted combination method and by calculating a sentence context support degree of a word in a sentence and a sentence context support degree of a homonym in a homonym confusion set corresponding to the word, determining whether a homonym error exists; marking the homonym error and providing a correction suggestion list, so as to implement Chinese homonym auto-proofreading. The Chinese homonym error auto-proofreading method provided by the present invention is quick in system response and high in efficiency and accuracy, and meets precision requirements of actual applications.

Description

A kind of Chinese homonym mistake auto-collation
Technical field
The present invention relates to the natural language processing in artificial intelligence computer field, particularly automatic proofreading for Chinese texts field.
Background technology
Along with the high speed development of the information processing technology and internet, traditional text work almost all replace by computing machine, the e-texts such as e-book, electronic newspaper, Email, office document, blog, microblogging etc. all become a part for people's daily life, but the mistake in text also gets more and more, this brings very large challenge to proof-reading.Traditional artificial correction efficiency is low, intensity is large, the cycle long demand that obviously can not meet text proofreading.
Text automatic Proofreading is one of main application of natural language processing, is also a difficult problem for natural language understanding.Chinese is input in computing machine by input method, along with increasing people uses spelling input method to input Chinese character, and spelling input method can input word and phrase, therefore occur increasing homonym mistake in the text, homonym mistake is the category belonging to true word mistake.The auto-collation of Chinese true word mistake also exists following problem:
1) word that true word mistake occurs also is the word in dictionary, and this is the difficult point of automatic proofreading for Chinese texts.
2) true word mistake can disturb the syntax and semantics of whole sentence, therefore finds that true word mistake needs a lot of knowledge and resource.
3) Sparse is the main obstacle of of the automatic Proofreading of true word mistake;
4) homonym automatic Proofreading comprises automatic errordetecting and automatic error-correcting, and automatic errordetecting finds the homonym mistake in sentence, and automatic error-correcting proofreads the mistake in sentence, provides amending advice.And a lot of method is all automatic errordetecting and two stages of automatic error-correcting are separated at present.
For above-mentioned Railway Project, the present invention proposes and achieves automatic errordetecting and the auto-collation of Chinese homonym mistake.
Summary of the invention
Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of Chinese homonym mistake auto-collation, is a kind of method integrating automatic errordetecting and automatic Proofreading.
Technical scheme:
For solving the problems of the technologies described above, the invention provides a kind of Chinese homonym mistake auto-collation, it is obscured collection and the local of Weight based on homonym and adjoins NGram model and combine determining method and carry out Chinese homonym mistake automatic Proofreading, and the method comprises the following steps:
1) by phonetic transcriptions of Chinese characters, the homonym setting up Chinese word obscures collection;
2) set up the second from left unit, the local of right binary and ternary adjoins NGram model, based on step 1) homonym that obtains obscures collection, by Probabilistic estimation, NGram model is adjoined to described local and carry out probability estimate, obtain the adjacent NGram model in local by large-scale corpus training;
3) based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list.
Preferably, described step 1) comprising: utilize phonetic transcriptions of Chinese characters table and Chinese dictionary, generate homonym and obscure collection:
C S e t ( W i ) = { W i 1 , W i 2 , ... W i n } ;
Wherein W ia Chinese word, w ihomonym.
Preferably, described step 1) in the homonym source of obscuring collection comprise two parts: identification division and manually proofread part automatically;
Wherein identification division comprises the steps: automatically
Step 11) read Chinese character dictionary, the Chinese word in Chinese dictionary is read in Chinese word structure and goes;
Step 12) read in the phonetic transcriptions of Chinese characters of phonetic transcriptions of Chinese characters table in phonetic transcriptions of Chinese characters structure;
Step 13) integrating step 11) Chinese word that obtains and step 12) phonetic transcriptions of Chinese characters that obtains, convert the Chinese word in Chinese dictionary to phonetic, put into homonym structure and generate homonym dictionary configuration, namely homonym obscures collection;
Wherein artificial check and correction part comprises: to step 13) homonym that obtains obscures collection and manually proofreads, upgrade homonym and obscure collection;
The structure that described homonym obscures collection is as follows:
C S e t ( W i ) = { W i 1 , W i 2 , ... W i n } ;
Wherein W ia word, w ihomonym.
Preferably, described step 2) comprise the following steps:
Step 21) based on extensive Web language material, the local setting up left adjacent binary, right adjacent binary and adjacent ternary adjoins NGram model: carrying out participle to the sentence in language material, obtaining L=W as carried out participle to sentence L 1w 2..W i-1w iw i+1w n, for word W i,
Left adjacent binary is: LeftBiGram (W i)=W i-1w i;
Right adjacent binary is: RightBiGram (W i)=W iw i+1;
Adjacent ternary is: TriGram (W i)=W i-1w iw i+1;
Step 22) obscure collection CSet (W based on extensive Web language material and homonym i), the left side that statistics homonym obscures concentrated all words adjoins binary and the co-occurrence frequency, right adjacent binary and the co-occurrence frequency thereof and adjacent ternary and the co-occurrence frequency thereof, wherein C S e t ( W i ) = { W i 1 , W i 2 , ... W i n } , w ihomonym;
Step 23) obscure collection CSet (W based on homonym i), binary is adjoined to a left side, the right local adjoining binary and adjacent ternary adjoins NGram model and carries out probability estimate, thus the local that generation comprises probabilistic estimated value adjoins NGram model; Wherein
The probability estimate of left adjacent binary is:
P l e f t ( W i | W i - 1 ) = C o u n t ( W i - 1 W i ) C o u n t ( W i - 1 W i ) + Σ k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k ) - - - ( 1 ) ;
The probability estimate of right adjacent binary is:
P r i g h t ( W i | W i + 1 ) = C o u n t ( W i W i + 1 ) C o u n t ( W i W i + 1 ) + Σ k = 1 | C S e t ( W i ) | C o u n t ( W i k W i + 1 ) - - - ( 2 ) ;
The probability estimate of adjacent ternary is:
P t r i ( W i | W i - 1 W i + 1 ) = C o u n t ( W i - 1 W i W i + 1 ) C o u n t ( W i - 1 W i W i + 1 ) + Σ k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k W i + 1 ) - - - ( 3 ) ;
Wherein W k∈ CSet (W i), Count (W i-1w i) represent W i-1w ithe co-occurrence frequency in language material, Count (W iw i+1) represent W iw i+1the co-occurrence frequency in language material, Count (W i-1w iw i+1) represent W i-1w iw i+1the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material.
Preferably, described step 3) comprise the following steps:
Step 31) participle is carried out to the sentence S of application this method, the word W in the sentence S after traversal participle i, judge whether that there is homonym obscures collection CSet (W i), the word that wherein there is homonym and obscure collection carry out step 32) process, until the word in sentence has been traversed;
Step 32) if W ithere is CSet (W i), based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list, specifically comprise:
Step 32-1) by the word W in combination scoring function Score calculating sentence S icontext support in sentence is
Score(W i)=α 1*P left(W i|W i-1)+α 2*P right(W i|W i+1)+α 3*P tri(W i|W i-1W i+1)(4);
Wherein α 1+ α 2+ α 3=1, α 1>0, α 2>0, α 3>0, α 1, α 2, α 3represent the weight of left adjacent binary, right adjacent binary, adjacent ternary respectively;
Step 32-2) by the word W in combination scoring function Score calculating sentence S icorresponding homonym obscures each concentrated homonym context support in sentence is
S c o r e ( W i j ) = α 1 * P l e f t ( W i j | W i - 1 ) + α 2 * P r i g h t ( W i j | W i + 1 ) + α 3 * P t r i ( W i j | W i - 1 W i + 1 ) - - - ( 5 ) ;
Wherein: W i j ∈ C S e t ( W i ) ;
P 1 e f t ( W i j | W i - 1 ) = C o u n t ( W i - 1 W i j ) C o u n t ( W i - 1 W i ) + Σ k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k ) - - - ( 6 ) ;
P r i g h t ( W i j | W i + 1 ) = C o u n t ( W i j W i + 1 ) C o u n t ( W i W i + 1 ) + Σ k = 1 | C S e t ( W i ) | C o u n t ( W i k W i + 1 ) - - - ( 7 ) ;
P t r i ( W i j | W i - 1 W i + 1 ) = C o u n t ( W i - 1 W i j W i + 1 ) C o u n t ( W i - 1 W i W i + 1 ) + Σ k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k W i + 1 ) - - - ( 8 ) ;
Step 32-3) to W iand at CSet (W i) in the context support Score of each homonym sort;
Step 32-4) if Score is (W i)=0, then to W imarked erroneous, and list in Score sequencing table 's as amending advice list; Otherwise turn to step 32-5);
Step 32-5) if Score is (W i) >0, and S c o r e ( W i ) < &beta; * max W i j &Element; C S e t ( W i ) S c o r e ( W i j ) , To W imarked erroneous, lists file names with corresponding as amending advice list, otherwise mark for correct word, wherein β is the probability that a word mistake becomes its homonym.
As preferably, above-mentioned steps 32-1) and step 32-2) in, the weight α of left adjacent binary 1=0.25, the weight α of right adjacent binary 2=0.25, the weight α of adjacent ternary 3=0.5.
Preferably, described step 32-5) in word mistake become probability values≤0.01 of its homonym.
Beneficial effect: the auto-collation that the present invention proposes a kind of Chinese homonym mistake, the left side that the method adopts homonym to obscure collection and Weight adjoins binary, right adjacent binary and adjacent triple combination method and judges the homonym in sentence, identify homonym mistake, and provide the amending advice of homonym mistake, integrate automatic errordetecting and automatic Proofreading.Experiment shows, the method recall rate of homonym mistake automatic Proofreading provided by the invention reaches 81.2%, and precision reaches 75.6%, the realistic application demand of faster system response, precision, validity and accuracy high, there is higher practicality.
Accompanying drawing explanation
Fig. 1 homonym mistake automatic Proofreading process flow diagram.
Embodiment
Below in conjunction with drawings and Examples, the present invention is further described.
A kind of Chinese homonym mistake auto-collation provided by the invention is obscured collection and the local of Weight based on homonym and is adjoined NGram model and combine determining method and carry out Chinese homonym mistake automatic Proofreading, the method comprises the following steps: 1), set up homonym and obscure collection, to Chinese word by phonetic transcriptions of Chinese characters, the homonym setting up Chinese word obscures collection.
As shown in Figure 1, utilize phonetic transcriptions of Chinese characters table and Chinese dictionary, generate homonym and obscure collection:
C S e t ( W i ) = { W i 1 , W i 2 , ... W 1 n } ;
Wherein W ia Chinese word, w ihomonym.
The source that in the present embodiment, homonym obscures collection comprises two parts: identification division and manually proofread part automatically;
Wherein identification division comprises the steps: automatically
Step 11) read Chinese character dictionary, the Chinese word in Chinese dictionary is read in Chinese word structure and goes;
Step 12) read in the phonetic transcriptions of Chinese characters of phonetic transcriptions of Chinese characters table in phonetic transcriptions of Chinese characters structure;
Step 13) integrating step 11) Chinese word that obtains and step 12) phonetic transcriptions of Chinese characters that obtains, convert the Chinese word in Chinese dictionary to phonetic, put into homonym structure and generate homonym dictionary configuration, namely homonym obscures collection;
Wherein artificial check and correction part comprises: to step 13) homonym that obtains obscures collection and manually proofreads, upgrade homonym and obscure collection.
2), set up the second from left unit, the local of right binary and ternary adjoins NGram model, based on step 1) homonym that obtains obscures collection, by Probabilistic estimation, NGram model is adjoined to described local and carry out probability estimate, obtain the adjacent NGram model in local by large-scale corpus training.Be specially:
Step 21) based on extensive Web language material, the local setting up left adjacent binary, right adjacent binary and adjacent ternary adjoins NGram model: carrying out participle to the sentence in language material, obtaining L=W as carried out participle to sentence L 1w 2..W i-1w iw i+1w n, for word W i,
Left adjacent binary is: LeftBiGram (W i)=W i-1w i;
Right adjacent binary is: RightBiGram (W i)=W iw i+1;
Adjacent ternary is: TriGram (W i)=W i-1w iw i+1;
Step 22) obscure collection CSet (W based on extensive Web language material and homonym i), the left side that statistics homonym obscures concentrated all words adjoins binary and the co-occurrence frequency, right adjacent binary and the co-occurrence frequency thereof and adjacent ternary and the co-occurrence frequency thereof, wherein C S e t ( W i ) = { W i 1 , W i 2 , ... W i n } , w ihomonym;
Step 23) obscure collection CSet (W based on homonym i), binary is adjoined to a left side, the right local adjoining binary and adjacent ternary adjoins NGram model and carries out probability estimate, thus the local that generation comprises probabilistic estimated value adjoins NGram model; Wherein
The probability estimate of left adjacent binary is:
P 1 e f t ( W i | W i - 1 ) = C o u n t ( W i - 1 W i ) C o u n t ( W i - 1 W i ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k ) - - - ( 1 ) ;
The probability estimate of right adjacent binary is:
P r i g h t ( W i | W i + 1 ) = C o u n t ( W i W i + 1 ) C o u n t ( W i W i + 1 ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i k W i + 1 ) - - - ( 2 ) ;
The probability estimate of adjacent ternary is:
P t r i ( W i | W i - 1 W i + 1 ) = C o u n t ( W i - 1 W i W i + 1 ) C o u n t ( W i - 1 W i W i + 1 ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k W i + 1 ) - - - ( 3 ) ;
Wherein W k∈ CSet (W i), Count (W i-1w i) represent W i-1w ithe co-occurrence frequency in language material, Count (W iw i+1) represent W iw i+1the co-occurrence frequency in language material, Count (W i-1w iw i+1) represent W i-1w iw i+1the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material.
3), based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list.As shown in Figure 1, be specially:
Step 31) participle is carried out to the sentence S of application this method, the word W in the sentence S after traversal participle i, judge whether that there is homonym obscures collection CSet (W i), the word that wherein there is homonym and obscure collection carry out step 32) process, until the word in sentence has been traversed;
Step 32) if W ithere is CSet (W i), based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list, specifically comprise:
Step 32-1) by the word W in combination scoring function Score calculating sentence S icontext support in sentence is
Score(W i)=α 1*P left(W i|W i-1)+α 2*P right(W i|W i+1)+α 3*P tri(W i|W i-1W i+1)(4);
Wherein α 1+ α 2+ α 3=1, α 1>0, α 2>0, α 3>0, α 1, α 2, α 3represent the weight of left adjacent binary, right adjacent binary, adjacent ternary respectively; In the present embodiment, α 12=0.25, α 3=0.5, can certainly suitably adjust according to actual needs.
Step 32-2) by the word W in combination scoring function Score calculating sentence S icorresponding homonym obscures concentrated each homonym W jcontext support in sentence is
Score(W j)=α 1*P left(W j|W i-1)+α 2*P right(W j|W i+1)+α 3*P tri(W j|W i-1W i+1)(5);
Wherein W i j &Element; C S e t ( W i ) ;
P 1 e f t ( W i j | W i - 1 ) = C o u n t ( W i - 1 W i j ) C o u n t ( W i - 1 W i ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k ) - - - ( 6 ) ;
P r i g h t ( W i j j | W i + 1 ) = C o u n t ( W i j W i + 1 ) C o u n t ( W i W i + 1 ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i k W i + 1 ) - - - ( 7 ) ;
P t r i ( W j | W i - 1 W i + 1 ) = C o u n t ( W i - 1 W i j W i + 1 ) C o u n t ( W i - 1 W i W i + 1 ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k W i + 1 ) - - - ( 8 ) ;
Step 32-3) to W iand at CSet (W i) in the context support Score of each homonym sort;
Step 32-4) if Score is (W i)=0, then to W imarked erroneous, and list in Score sequencing table 's as amending advice list; Otherwise turn to step 32-5);
Step 32-5) if Score is (W i) >0, and S c o r e ( W i ) < &beta; * max W j &Element; C S e t ( W i ) S c o r e ( W j ) , To W imarked erroneous, lists file names with corresponding as amending advice list, otherwise mark W ifor correct word, wherein β is the probability that a word mistake becomes its homonym, usual β≤0.01, β=0.01 in the present embodiment.Experiment:
Live through repeatedly open test, the testing material of experiment employing 10,000 row sentence, homonym error 6 00 place in manual construction language material sentence, with parameter given in embodiment for experiment parameter.Experiment shows, the method recall rate of homonym mistake automatic Proofreading provided by the invention reaches 81.2%, and precision reaches 75.6%.This precision has exceeded prior art, reaches the demand of practical application, has higher validity and accuracy.
Above implementation column is only preferred embodiment of the present invention, does not form restriction to the present invention, and relevant staff is in the scope not departing from the technology of the present invention thought, and any amendment carried out, equivalent replacement, improvement etc., all drop in protection scope of the present invention.

Claims (7)

1. a Chinese homonym mistake auto-collation, is characterized in that, obscure collection and the local of Weight based on homonym and adjoin NGram model and combine determining method and carry out Chinese homonym mistake automatic Proofreading, the method comprises the following steps:
1) by phonetic transcriptions of Chinese characters, the homonym setting up Chinese word obscures collection;
2) set up the second from left unit, the local of right binary and ternary adjoins NGram model, based on step 1) homonym that obtains obscures collection, by Probabilistic estimation, NGram model is adjoined to described local and carry out probability estimate, obtain the adjacent NGram model in local by large-scale corpus training;
3) based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list.
2. Chinese homonym mistake auto-collation according to claim 1, is characterized in that described step 1) comprising: utilize phonetic transcriptions of Chinese characters table and Chinese dictionary, generate homonym and obscure collection:
C S e t ( W i ) = { W i 1 , W i 2 , ... W i n } ;
Wherein W ia Chinese word, w ihomonym.
3. Chinese homonym mistake auto-collation according to claim 1, is characterized in that described step 1) in the homonym source of obscuring collection comprise two parts: identification division and manually proofread part automatically;
Wherein identification division comprises the steps: automatically
Step 11) read Chinese character dictionary, the Chinese word in Chinese dictionary is read in Chinese word structure and goes;
Step 12) read in the phonetic transcriptions of Chinese characters of phonetic transcriptions of Chinese characters table in phonetic transcriptions of Chinese characters structure;
Step 13) integrating step 11) Chinese word that obtains and step 12) phonetic transcriptions of Chinese characters that obtains, convert the Chinese word in Chinese dictionary to phonetic, put into homonym structure and generate homonym dictionary configuration, namely homonym obscures collection;
Wherein artificial check and correction part comprises: to step 13) homonym that obtains obscures collection and manually proofreads, upgrade homonym and obscure collection;
The structure that described homonym obscures collection is as follows:
C S e t ( W i ) = { W 1 1 , W i 2 , ... W 1 n } ;
Wherein W ia word, w ihomonym.
4. Chinese homonym mistake auto-collation according to claim 1, is characterized in that described step 2) comprise the following steps:
Step 21) based on extensive Web language material, the local setting up left adjacent binary, right adjacent binary and adjacent ternary adjoins NGram model: carrying out participle to the sentence in language material, obtaining L=W as carried out participle to sentence L 1w 2..W i-1w iw i+1w n, for word W i,
Left adjacent binary is: LeftBiGram (W i)=W i-1w i;
Right adjacent binary is: RightBiGram (W i)=W iw i+1;
Adjacent ternary is: TriGram (W i)=W i-1w iw i+1;
Step 22) obscure collection CSet (W based on extensive Web language material and homonym i), the left side that statistics homonym obscures concentrated all words adjoins binary and the co-occurrence frequency, right adjacent binary and the co-occurrence frequency thereof and adjacent ternary and the co-occurrence frequency thereof, wherein C S e t ( W i ) = { W i 1 , W i 2 , ... W i n } , w ihomonym;
Step 23) obscure collection CSet (W based on homonym i), binary is adjoined to a left side, the right local adjoining binary and adjacent ternary adjoins NGram model and carries out probability estimate, thus the local that generation comprises probabilistic estimated value adjoins NGram model; Wherein the probability estimate of left adjacent binary is:
P l e f t ( W i | W i - 1 ) = C o u n t ( W i - 1 W i ) C o u n t ( W i - 1 W i ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k ) - - - ( 1 ) ;
The probability estimate of right adjacent binary is:
P r i g h t ( W i | W i + 1 ) = C o u n t ( W i W i + 1 ) C o u n t ( W i W i + 1 ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i k W i + 1 ) - - - ( 2 ) ;
The probability estimate of adjacent ternary is:
P t r i ( W i | W i - 1 W i + 1 ) = C o u n t ( W i - 1 W i W i + 1 ) C o u n t ( W i - 1 W i W i + 1 ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k W i + 1 ) - - - ( 3 ) ;
Wherein count (W i-1w i) represent W i-1w ithe co-occurrence frequency in language material, Count (W iw i+1) represent W iw i+1the co-occurrence frequency in language material, Count (W i-1w iw i+1) represent W i-1w iw i+1the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material.
5. Chinese homonym mistake auto-collation according to claim 4, is characterized in that described step 3) comprise the following steps:
Step 31) participle is carried out to the sentence S of application this method, the word W in the sentence S after traversal participle i, judge whether that there is homonym obscures collection CSet (W i), the word that wherein there is homonym and obscure collection carry out step 32) process, until the word in sentence has been traversed;
Step 32) if W ithere is CSet (W i), based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list, specifically comprise:
Step 32-1) by the word W in combination scoring function Score calculating sentence S icontext support in sentence is Score (W i)=α 1* P left(W i| W i-1)+α 2* P right(W i| W i+1)+α 3* P tri(W i| W i-1w i+1) (4);
Wherein α 1+ α 2+ α 3=1, α 1>0, α 2>0, α 3>0, α 1, α 2, α 3represent the weight of left adjacent binary, right adjacent binary, adjacent ternary respectively;
Step 32-2) by the word W in combination scoring function Score calculating sentence S icorresponding homonym obscures each concentrated homonym context support in sentence is
S c o r e ( W i j ) = &alpha; 1 * P l e f t ( W i j | W i - 1 ) + &alpha; 2 * p r i g h t ( W i j | W i + 1 ) + &alpha; 3 * P t r i ( W i j | W i - 1 W i + 1 ) - - - ( 5 ) ;
Wherein W j∈ CSet (W i);
P l e f t ( W i j | W i - 1 ) = C o u n t ( W i - 1 W i j ) C o u n t ( W i - 1 W i ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k ) - - - ( 6 ) ;
P r i g h t ( W j | W i + 1 ) = C o u n t ( W j W i + 1 ) C o u n t ( W i W i + 1 ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i k W i + 1 ) - - - ( 7 ) ;
P t r i ( W j | W i - 1 W i + 1 ) = C o u n t ( W i - 1 W j W i + 1 ) C o u n t ( W i - 1 W i W i + 1 ) + &Sigma; k = 1 | C S e t ( W i ) | C o u n t ( W i - 1 W i k W i + 1 ) - - - ( 8 ) ;
Step 32-3) to W iand at CSet (W i) in the context support Score of each homonym sort;
Step 32-4) if Score is (W i)=0, then to W imarked erroneous, and list in Score sequencing table w j, as amending advice list; Otherwise turn to step 32-5);
Step 32-5) if Score is (W i) >0, and to W imarked erroneous, lists file names with corresponding as amending advice list, otherwise mark W ifor correct word, wherein β is the probability that a word mistake becomes its homonym.
6. Chinese homonym mistake auto-collation according to claim 5, is characterized in that: described step 32-1) and step 32-2) in, the weight α of left adjacent binary 1=0.25, the weight α of right adjacent binary 2=0.25, the weight α of adjacent ternary 3=0.5.
7. Chinese homonym mistake auto-collation according to claim 5, is characterized in that: described step 32-5) in word mistake become probability values≤0.01 of its homonym.
CN201510354692.4A 2015-06-24 2015-06-24 A kind of Chinese homonym mistake auto-collation Active CN105045778B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510354692.4A CN105045778B (en) 2015-06-24 2015-06-24 A kind of Chinese homonym mistake auto-collation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510354692.4A CN105045778B (en) 2015-06-24 2015-06-24 A kind of Chinese homonym mistake auto-collation

Publications (2)

Publication Number Publication Date
CN105045778A true CN105045778A (en) 2015-11-11
CN105045778B CN105045778B (en) 2017-10-17

Family

ID=54452334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510354692.4A Active CN105045778B (en) 2015-06-24 2015-06-24 A kind of Chinese homonym mistake auto-collation

Country Status (1)

Country Link
CN (1) CN105045778B (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573979A (en) * 2015-12-10 2016-05-11 江苏科技大学 Chinese character confusion set based wrong word knowledge generation method
CN106528616A (en) * 2016-09-30 2017-03-22 厦门快商通科技股份有限公司 Language error correcting method and system for use in human-computer interaction process
CN108491392A (en) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN108519973A (en) * 2018-03-29 2018-09-11 广州视源电子科技股份有限公司 Character spelling detection method, system, computer equipment and storage medium
CN108563634A (en) * 2018-03-29 2018-09-21 广州视源电子科技股份有限公司 Method and system for identifying character spelling errors, computer equipment and storage medium
CN108563632A (en) * 2018-03-29 2018-09-21 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN108829665A (en) * 2018-05-22 2018-11-16 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108845984A (en) * 2018-05-22 2018-11-20 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108874770A (en) * 2018-05-22 2018-11-23 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108984515A (en) * 2018-05-22 2018-12-11 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN109388252A (en) * 2017-08-14 2019-02-26 北京搜狗科技发展有限公司 A kind of input method and device
CN110083819A (en) * 2018-01-26 2019-08-02 北京京东尚科信息技术有限公司 Spell error correction method, device, medium and electronic equipment
CN110600011A (en) * 2018-06-12 2019-12-20 ***通信有限公司研究院 Voice recognition method and device and computer readable storage medium
CN110619119A (en) * 2019-07-23 2019-12-27 平安科技(深圳)有限公司 Intelligent text editing method and device and computer readable storage medium
CN110851599A (en) * 2019-11-01 2020-02-28 中山大学 Automatic scoring method and teaching and assisting system for Chinese composition
CN110991166A (en) * 2019-12-03 2020-04-10 中国标准化研究院 Chinese wrongly-written character recognition method and system based on pattern matching
CN111312209A (en) * 2020-02-21 2020-06-19 北京声智科技有限公司 Text-to-speech conversion processing method and device and electronic equipment
CN111709228A (en) * 2020-06-22 2020-09-25 中国标准化研究院 Automatic recognition method for repeated errors of words
WO2021051877A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Method for obtaining input text in artificial intelligence interview, and related apparatus
CN112668328A (en) * 2020-12-25 2021-04-16 广东南方新媒体科技有限公司 Media intelligent proofreading algorithm
WO2021129439A1 (en) * 2019-12-28 2021-07-01 科大讯飞股份有限公司 Voice recognition method and related product

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters
US8364487B2 (en) * 2008-10-21 2013-01-29 Microsoft Corporation Speech recognition system with display information
CN104166462A (en) * 2013-05-17 2014-11-26 北京搜狗科技发展有限公司 Input method and system for characters

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8364487B2 (en) * 2008-10-21 2013-01-29 Microsoft Corporation Speech recognition system with display information
CN102375807A (en) * 2010-08-27 2012-03-14 汉王科技股份有限公司 Method and device for proofing characters
CN104166462A (en) * 2013-05-17 2014-11-26 北京搜狗科技发展有限公司 Input method and system for characters

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
施恒利 等: "汉字种子混淆集的构建方法研究", 《计算机科学》 *
石敏 等: "基于决策列表的中文同音词自动识别与校对", 《电子设计工程》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105573979B (en) * 2015-12-10 2018-05-22 江苏科技大学 A kind of wrongly written character word knowledge generation method that collection is obscured based on Chinese character
CN105573979A (en) * 2015-12-10 2016-05-11 江苏科技大学 Chinese character confusion set based wrong word knowledge generation method
CN106528616A (en) * 2016-09-30 2017-03-22 厦门快商通科技股份有限公司 Language error correcting method and system for use in human-computer interaction process
CN106528616B (en) * 2016-09-30 2019-12-17 厦门快商通科技股份有限公司 Language error correction method and system in human-computer interaction process
CN109388252A (en) * 2017-08-14 2019-02-26 北京搜狗科技发展有限公司 A kind of input method and device
CN110083819B (en) * 2018-01-26 2024-02-09 北京京东尚科信息技术有限公司 Spelling error correction method, device, medium and electronic equipment
CN110083819A (en) * 2018-01-26 2019-08-02 北京京东尚科信息技术有限公司 Spell error correction method, device, medium and electronic equipment
CN108563632A (en) * 2018-03-29 2018-09-21 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN108563634A (en) * 2018-03-29 2018-09-21 广州视源电子科技股份有限公司 Method and system for identifying character spelling errors, computer equipment and storage medium
CN108519973A (en) * 2018-03-29 2018-09-11 广州视源电子科技股份有限公司 Character spelling detection method, system, computer equipment and storage medium
CN108491392A (en) * 2018-03-29 2018-09-04 广州视源电子科技股份有限公司 Method, system, computer device and storage medium for correcting character spelling errors
CN108829665B (en) * 2018-05-22 2022-05-31 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108874770A (en) * 2018-05-22 2018-11-23 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108984515A (en) * 2018-05-22 2018-12-11 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108829665A (en) * 2018-05-22 2018-11-16 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108845984A (en) * 2018-05-22 2018-11-20 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108874770B (en) * 2018-05-22 2022-04-22 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108845984B (en) * 2018-05-22 2022-04-22 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN108984515B (en) * 2018-05-22 2022-09-06 广州视源电子科技股份有限公司 Wrongly written character detection method and device, computer readable storage medium and terminal equipment
CN110600011B (en) * 2018-06-12 2022-04-01 ***通信有限公司研究院 Voice recognition method and device and computer readable storage medium
CN110600011A (en) * 2018-06-12 2019-12-20 ***通信有限公司研究院 Voice recognition method and device and computer readable storage medium
CN110619119B (en) * 2019-07-23 2022-06-10 平安科技(深圳)有限公司 Intelligent text editing method and device and computer readable storage medium
CN110619119A (en) * 2019-07-23 2019-12-27 平安科技(深圳)有限公司 Intelligent text editing method and device and computer readable storage medium
WO2021051877A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Method for obtaining input text in artificial intelligence interview, and related apparatus
CN110851599A (en) * 2019-11-01 2020-02-28 中山大学 Automatic scoring method and teaching and assisting system for Chinese composition
CN110991166B (en) * 2019-12-03 2021-07-30 中国标准化研究院 Chinese wrongly-written character recognition method and system based on pattern matching
CN110991166A (en) * 2019-12-03 2020-04-10 中国标准化研究院 Chinese wrongly-written character recognition method and system based on pattern matching
WO2021129439A1 (en) * 2019-12-28 2021-07-01 科大讯飞股份有限公司 Voice recognition method and related product
CN111312209A (en) * 2020-02-21 2020-06-19 北京声智科技有限公司 Text-to-speech conversion processing method and device and electronic equipment
WO2021258739A1 (en) * 2020-06-22 2021-12-30 中国标准化研究院 Method for automatically identifying word repetition error
CN111709228A (en) * 2020-06-22 2020-09-25 中国标准化研究院 Automatic recognition method for repeated errors of words
US20220343070A1 (en) * 2020-06-22 2022-10-27 China National Institute Of Standardization Method for automatically identifying word repetition errors
CN111709228B (en) * 2020-06-22 2023-11-21 中国标准化研究院 Automatic identification method for word repetition errors
CN112668328A (en) * 2020-12-25 2021-04-16 广东南方新媒体科技有限公司 Media intelligent proofreading algorithm

Also Published As

Publication number Publication date
CN105045778B (en) 2017-10-17

Similar Documents

Publication Publication Date Title
CN105045778A (en) Chinese homonym error auto-proofreading method
CN104991889B (en) A kind of non-multi-character word error auto-collation based on fuzzy participle
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN107832229A (en) A kind of system testing case automatic generating method based on NLP
CN103942339B (en) Synonym method for digging and device
CN105512110A (en) Wrong word knowledge base construction method based on fuzzy matching and statistics
Zhang et al. HANSpeller++: A unified framework for Chinese spelling correction
Ljubešić et al. Predicting the level of text standardness in user-generated content
KR101633556B1 (en) Apparatus for grammatical error correction and method using the same
Janssen NeoTag: a POS Tagger for Grammatical Neologism Detection.
Park et al. Korean language resources for everyone
CN107145584A (en) A kind of resume analytic method based on n gram models
CN105824800B (en) A kind of true word mistake auto-collation of Chinese
Richter et al. Korektor–a system for contextual spell-checking and diacritics completion
CN105573979A (en) Chinese character confusion set based wrong word knowledge generation method
CN104346326A (en) Method and device for determining emotional characteristics of emotional texts
Zhao et al. A hybrid model for Chinese spelling check
CN106202255A (en) Merge the Vietnamese name entity recognition method of physical characteristics
Schneider et al. Comparing rule-based and SMT-based spelling normalisation for English historical texts
CN106202037B (en) Vietnamese phrase tree constructing method based on chunking
Geyken et al. On-the-fly Generation of Dictionary Articles for the DWDS Website
CN107894977A (en) With reference to the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary
CN103714053A (en) Japanese verb identification method for machine translation
Chiu et al. Chinese spell checking based on noisy channel model
CN105183807A (en) emotion reason event identifying method and system based on structure syntax

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20151111

Assignee: JIANGSU KEDA HUIFENG SCIENCE AND TECHNOLOGY Co.,Ltd.

Assignor: JIANGSU University OF SCIENCE AND TECHNOLOGY

Contract record no.: X2020980007325

Denomination of invention: An automatic correction method for Chinese homonym errors

Granted publication date: 20171017

License type: Common License

Record date: 20201029

EE01 Entry into force of recordation of patent licensing contract
EC01 Cancellation of recordation of patent licensing contract

Assignee: JIANGSU KEDA HUIFENG SCIENCE AND TECHNOLOGY Co.,Ltd.

Assignor: JIANGSU University OF SCIENCE AND TECHNOLOGY

Contract record no.: X2020980007325

Date of cancellation: 20201223

EC01 Cancellation of recordation of patent licensing contract