A kind of Chinese homonym mistake auto-collation
Technical field
The present invention relates to the natural language processing in artificial intelligence computer field, particularly automatic proofreading for Chinese texts field.
Background technology
Along with the high speed development of the information processing technology and internet, traditional text work almost all replace by computing machine, the e-texts such as e-book, electronic newspaper, Email, office document, blog, microblogging etc. all become a part for people's daily life, but the mistake in text also gets more and more, this brings very large challenge to proof-reading.Traditional artificial correction efficiency is low, intensity is large, the cycle long demand that obviously can not meet text proofreading.
Text automatic Proofreading is one of main application of natural language processing, is also a difficult problem for natural language understanding.Chinese is input in computing machine by input method, along with increasing people uses spelling input method to input Chinese character, and spelling input method can input word and phrase, therefore occur increasing homonym mistake in the text, homonym mistake is the category belonging to true word mistake.The auto-collation of Chinese true word mistake also exists following problem:
1) word that true word mistake occurs also is the word in dictionary, and this is the difficult point of automatic proofreading for Chinese texts.
2) true word mistake can disturb the syntax and semantics of whole sentence, therefore finds that true word mistake needs a lot of knowledge and resource.
3) Sparse is the main obstacle of of the automatic Proofreading of true word mistake;
4) homonym automatic Proofreading comprises automatic errordetecting and automatic error-correcting, and automatic errordetecting finds the homonym mistake in sentence, and automatic error-correcting proofreads the mistake in sentence, provides amending advice.And a lot of method is all automatic errordetecting and two stages of automatic error-correcting are separated at present.
For above-mentioned Railway Project, the present invention proposes and achieves automatic errordetecting and the auto-collation of Chinese homonym mistake.
Summary of the invention
Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of Chinese homonym mistake auto-collation, is a kind of method integrating automatic errordetecting and automatic Proofreading.
Technical scheme:
For solving the problems of the technologies described above, the invention provides a kind of Chinese homonym mistake auto-collation, it is obscured collection and the local of Weight based on homonym and adjoins NGram model and combine determining method and carry out Chinese homonym mistake automatic Proofreading, and the method comprises the following steps:
1) by phonetic transcriptions of Chinese characters, the homonym setting up Chinese word obscures collection;
2) set up the second from left unit, the local of right binary and ternary adjoins NGram model, based on step 1) homonym that obtains obscures collection, by Probabilistic estimation, NGram model is adjoined to described local and carry out probability estimate, obtain the adjacent NGram model in local by large-scale corpus training;
3) based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list.
Preferably, described step 1) comprising: utilize phonetic transcriptions of Chinese characters table and Chinese dictionary, generate homonym and obscure collection:
Wherein W
ia Chinese word,
w
ihomonym.
Preferably, described step 1) in the homonym source of obscuring collection comprise two parts: identification division and manually proofread part automatically;
Wherein identification division comprises the steps: automatically
Step 11) read Chinese character dictionary, the Chinese word in Chinese dictionary is read in Chinese word structure and goes;
Step 12) read in the phonetic transcriptions of Chinese characters of phonetic transcriptions of Chinese characters table in phonetic transcriptions of Chinese characters structure;
Step 13) integrating step 11) Chinese word that obtains and step 12) phonetic transcriptions of Chinese characters that obtains, convert the Chinese word in Chinese dictionary to phonetic, put into homonym structure and generate homonym dictionary configuration, namely homonym obscures collection;
Wherein artificial check and correction part comprises: to step 13) homonym that obtains obscures collection and manually proofreads, upgrade homonym and obscure collection;
The structure that described homonym obscures collection is as follows:
Wherein W
ia word,
w
ihomonym.
Preferably, described step 2) comprise the following steps:
Step 21) based on extensive Web language material, the local setting up left adjacent binary, right adjacent binary and adjacent ternary adjoins NGram model: carrying out participle to the sentence in language material, obtaining L=W as carried out participle to sentence L
1w
2..W
i-1w
iw
i+1w
n, for word W
i,
Left adjacent binary is: LeftBiGram (W
i)=W
i-1w
i;
Right adjacent binary is: RightBiGram (W
i)=W
iw
i+1;
Adjacent ternary is: TriGram (W
i)=W
i-1w
iw
i+1;
Step 22) obscure collection CSet (W based on extensive Web language material and homonym
i), the left side that statistics homonym obscures concentrated all words adjoins binary and the co-occurrence frequency, right adjacent binary and the co-occurrence frequency thereof and adjacent ternary and the co-occurrence frequency thereof, wherein
w
ihomonym;
Step 23) obscure collection CSet (W based on homonym
i), binary is adjoined to a left side, the right local adjoining binary and adjacent ternary adjoins NGram model and carries out probability estimate, thus the local that generation comprises probabilistic estimated value adjoins NGram model; Wherein
The probability estimate of left adjacent binary is:
The probability estimate of right adjacent binary is:
The probability estimate of adjacent ternary is:
Wherein W
k∈ CSet (W
i), Count (W
i-1w
i) represent W
i-1w
ithe co-occurrence frequency in language material, Count (W
iw
i+1) represent W
iw
i+1the co-occurrence frequency in language material, Count (W
i-1w
iw
i+1) represent W
i-1w
iw
i+1the co-occurrence frequency in language material,
represent
the co-occurrence frequency in language material,
represent
the co-occurrence frequency in language material,
represent
the co-occurrence frequency in language material.
Preferably, described step 3) comprise the following steps:
Step 31) participle is carried out to the sentence S of application this method, the word W in the sentence S after traversal participle
i, judge whether that there is homonym obscures collection CSet (W
i), the word that wherein there is homonym and obscure collection carry out step 32) process, until the word in sentence has been traversed;
Step 32) if W
ithere is CSet (W
i), based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list, specifically comprise:
Step 32-1) by the word W in combination scoring function Score calculating sentence S
icontext support in sentence is
Score(W
i)=α
1*P
left(W
i|W
i-1)+α
2*P
right(W
i|W
i+1)+α
3*P
tri(W
i|W
i-1W
i+1)(4);
Wherein α
1+ α
2+ α
3=1, α
1>0, α
2>0, α
3>0, α
1, α
2, α
3represent the weight of left adjacent binary, right adjacent binary, adjacent ternary respectively;
Step 32-2) by the word W in combination scoring function Score calculating sentence S
icorresponding homonym obscures each concentrated homonym
context support in sentence is
Wherein:
Step 32-3) to W
iand at CSet (W
i) in the context support Score of each homonym sort;
Step 32-4) if Score is (W
i)=0, then to W
imarked erroneous, and list in Score sequencing table
's
as amending advice list; Otherwise turn to step 32-5);
Step 32-5) if Score is (W
i) >0, and
To W
imarked erroneous, lists file names with
corresponding
as amending advice list, otherwise mark
for correct word, wherein β is the probability that a word mistake becomes its homonym.
As preferably, above-mentioned steps 32-1) and step 32-2) in, the weight α of left adjacent binary
1=0.25, the weight α of right adjacent binary
2=0.25, the weight α of adjacent ternary
3=0.5.
Preferably, described step 32-5) in word mistake become probability values≤0.01 of its homonym.
Beneficial effect: the auto-collation that the present invention proposes a kind of Chinese homonym mistake, the left side that the method adopts homonym to obscure collection and Weight adjoins binary, right adjacent binary and adjacent triple combination method and judges the homonym in sentence, identify homonym mistake, and provide the amending advice of homonym mistake, integrate automatic errordetecting and automatic Proofreading.Experiment shows, the method recall rate of homonym mistake automatic Proofreading provided by the invention reaches 81.2%, and precision reaches 75.6%, the realistic application demand of faster system response, precision, validity and accuracy high, there is higher practicality.
Accompanying drawing explanation
Fig. 1 homonym mistake automatic Proofreading process flow diagram.
Embodiment
Below in conjunction with drawings and Examples, the present invention is further described.
A kind of Chinese homonym mistake auto-collation provided by the invention is obscured collection and the local of Weight based on homonym and is adjoined NGram model and combine determining method and carry out Chinese homonym mistake automatic Proofreading, the method comprises the following steps: 1), set up homonym and obscure collection, to Chinese word by phonetic transcriptions of Chinese characters, the homonym setting up Chinese word obscures collection.
As shown in Figure 1, utilize phonetic transcriptions of Chinese characters table and Chinese dictionary, generate homonym and obscure collection:
Wherein W
ia Chinese word,
w
ihomonym.
The source that in the present embodiment, homonym obscures collection comprises two parts: identification division and manually proofread part automatically;
Wherein identification division comprises the steps: automatically
Step 11) read Chinese character dictionary, the Chinese word in Chinese dictionary is read in Chinese word structure and goes;
Step 12) read in the phonetic transcriptions of Chinese characters of phonetic transcriptions of Chinese characters table in phonetic transcriptions of Chinese characters structure;
Step 13) integrating step 11) Chinese word that obtains and step 12) phonetic transcriptions of Chinese characters that obtains, convert the Chinese word in Chinese dictionary to phonetic, put into homonym structure and generate homonym dictionary configuration, namely homonym obscures collection;
Wherein artificial check and correction part comprises: to step 13) homonym that obtains obscures collection and manually proofreads, upgrade homonym and obscure collection.
2), set up the second from left unit, the local of right binary and ternary adjoins NGram model, based on step 1) homonym that obtains obscures collection, by Probabilistic estimation, NGram model is adjoined to described local and carry out probability estimate, obtain the adjacent NGram model in local by large-scale corpus training.Be specially:
Step 21) based on extensive Web language material, the local setting up left adjacent binary, right adjacent binary and adjacent ternary adjoins NGram model: carrying out participle to the sentence in language material, obtaining L=W as carried out participle to sentence L
1w
2..W
i-1w
iw
i+1w
n, for word W
i,
Left adjacent binary is: LeftBiGram (W
i)=W
i-1w
i;
Right adjacent binary is: RightBiGram (W
i)=W
iw
i+1;
Adjacent ternary is: TriGram (W
i)=W
i-1w
iw
i+1;
Step 22) obscure collection CSet (W based on extensive Web language material and homonym
i), the left side that statistics homonym obscures concentrated all words adjoins binary and the co-occurrence frequency, right adjacent binary and the co-occurrence frequency thereof and adjacent ternary and the co-occurrence frequency thereof, wherein
w
ihomonym;
Step 23) obscure collection CSet (W based on homonym
i), binary is adjoined to a left side, the right local adjoining binary and adjacent ternary adjoins NGram model and carries out probability estimate, thus the local that generation comprises probabilistic estimated value adjoins NGram model; Wherein
The probability estimate of left adjacent binary is:
The probability estimate of right adjacent binary is:
The probability estimate of adjacent ternary is:
Wherein W
k∈ CSet (W
i), Count (W
i-1w
i) represent W
i-1w
ithe co-occurrence frequency in language material, Count (W
iw
i+1) represent W
iw
i+1the co-occurrence frequency in language material, Count (W
i-1w
iw
i+1) represent W
i-1w
iw
i+1the co-occurrence frequency in language material,
represent
the co-occurrence frequency in language material,
represent
the co-occurrence frequency in language material,
represent
the co-occurrence frequency in language material.
3), based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list.As shown in Figure 1, be specially:
Step 31) participle is carried out to the sentence S of application this method, the word W in the sentence S after traversal participle
i, judge whether that there is homonym obscures collection CSet (W
i), the word that wherein there is homonym and obscure collection carry out step 32) process, until the word in sentence has been traversed;
Step 32) if W
ithere is CSet (W
i), based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list, specifically comprise:
Step 32-1) by the word W in combination scoring function Score calculating sentence S
icontext support in sentence is
Score(W
i)=α
1*P
left(W
i|W
i-1)+α
2*P
right(W
i|W
i+1)+α
3*P
tri(W
i|W
i-1W
i+1)(4);
Wherein α
1+ α
2+ α
3=1, α
1>0, α
2>0, α
3>0, α
1, α
2, α
3represent the weight of left adjacent binary, right adjacent binary, adjacent ternary respectively; In the present embodiment, α
1=α
2=0.25, α
3=0.5, can certainly suitably adjust according to actual needs.
Step 32-2) by the word W in combination scoring function Score calculating sentence S
icorresponding homonym obscures concentrated each homonym W
jcontext support in sentence is
Score(W
j)=α
1*P
left(W
j|W
i-1)+α
2*P
right(W
j|W
i+1)+α
3*P
tri(W
j|W
i-1W
i+1)(5);
Wherein
Step 32-3) to W
iand at CSet (W
i) in the context support Score of each homonym sort;
Step 32-4) if Score is (W
i)=0, then to W
imarked erroneous, and list in Score sequencing table
's
as amending advice list; Otherwise turn to step 32-5);
Step 32-5) if Score is (W
i) >0, and
To W
imarked erroneous, lists file names with
corresponding
as amending advice list, otherwise mark W
ifor correct word, wherein β is the probability that a word mistake becomes its homonym, usual β≤0.01, β=0.01 in the present embodiment.Experiment:
Live through repeatedly open test, the testing material of experiment employing 10,000 row sentence, homonym error 6 00 place in manual construction language material sentence, with parameter given in embodiment for experiment parameter.Experiment shows, the method recall rate of homonym mistake automatic Proofreading provided by the invention reaches 81.2%, and precision reaches 75.6%.This precision has exceeded prior art, reaches the demand of practical application, has higher validity and accuracy.
Above implementation column is only preferred embodiment of the present invention, does not form restriction to the present invention, and relevant staff is in the scope not departing from the technology of the present invention thought, and any amendment carried out, equivalent replacement, improvement etc., all drop in protection scope of the present invention.