CN105045778A

CN105045778A - Chinese homonym error auto-proofreading method

Info

Publication number: CN105045778A
Application number: CN201510354692.4A
Authority: CN
Inventors: 吴健康; 严熙; 刘亮亮
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2015-06-24
Filing date: 2015-06-24
Publication date: 2015-11-11
Anticipated expiration: 2035-06-24
Also published as: CN105045778B

Abstract

The present invention discloses a Chinese homonym error auto-proofreading method. The method comprises: first, generating a confusion set of Chinese homonyms; through large-scale Web corpus training, collecting statistics on a left-adjacent binary model, a right-adjacent binary model, and an adjacent ternary model; obtaining a local adjacent NGram model by using the confusion set of Chinese homonyms and a probability estimation algorithm; by using a weighted combination method and by calculating a sentence context support degree of a word in a sentence and a sentence context support degree of a homonym in a homonym confusion set corresponding to the word, determining whether a homonym error exists; marking the homonym error and providing a correction suggestion list, so as to implement Chinese homonym auto-proofreading. The Chinese homonym error auto-proofreading method provided by the present invention is quick in system response and high in efficiency and accuracy, and meets precision requirements of actual applications.

Description

A kind of Chinese homonym mistake auto-collation

Technical field

The present invention relates to the natural language processing in artificial intelligence computer field, particularly automatic proofreading for Chinese texts field.

Background technology

Along with the high speed development of the information processing technology and internet, traditional text work almost all replace by computing machine, the e-texts such as e-book, electronic newspaper, Email, office document, blog, microblogging etc. all become a part for people's daily life, but the mistake in text also gets more and more, this brings very large challenge to proof-reading.Traditional artificial correction efficiency is low, intensity is large, the cycle long demand that obviously can not meet text proofreading.

Text automatic Proofreading is one of main application of natural language processing, is also a difficult problem for natural language understanding.Chinese is input in computing machine by input method, along with increasing people uses spelling input method to input Chinese character, and spelling input method can input word and phrase, therefore occur increasing homonym mistake in the text, homonym mistake is the category belonging to true word mistake.The auto-collation of Chinese true word mistake also exists following problem:

1) word that true word mistake occurs also is the word in dictionary, and this is the difficult point of automatic proofreading for Chinese texts.

2) true word mistake can disturb the syntax and semantics of whole sentence, therefore finds that true word mistake needs a lot of knowledge and resource.

3) Sparse is the main obstacle of of the automatic Proofreading of true word mistake;

4) homonym automatic Proofreading comprises automatic errordetecting and automatic error-correcting, and automatic errordetecting finds the homonym mistake in sentence, and automatic error-correcting proofreads the mistake in sentence, provides amending advice.And a lot of method is all automatic errordetecting and two stages of automatic error-correcting are separated at present.

For above-mentioned Railway Project, the present invention proposes and achieves automatic errordetecting and the auto-collation of Chinese homonym mistake.

Summary of the invention

Goal of the invention: in order to overcome the deficiencies in the prior art, the invention provides a kind of Chinese homonym mistake auto-collation, is a kind of method integrating automatic errordetecting and automatic Proofreading.

Technical scheme:

For solving the problems of the technologies described above, the invention provides a kind of Chinese homonym mistake auto-collation, it is obscured collection and the local of Weight based on homonym and adjoins NGram model and combine determining method and carry out Chinese homonym mistake automatic Proofreading, and the method comprises the following steps:

1) by phonetic transcriptions of Chinese characters, the homonym setting up Chinese word obscures collection;

2) set up the second from left unit, the local of right binary and ternary adjoins NGram model, based on step 1) homonym that obtains obscures collection, by Probabilistic estimation, NGram model is adjoined to described local and carry out probability estimate, obtain the adjacent NGram model in local by large-scale corpus training;

3) based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list.

Preferably, described step 1) comprising: utilize phonetic transcriptions of Chinese characters table and Chinese dictionary, generate homonym and obscure collection:

C S e t (W_{i}) = {W_{i}^{1}, W_{i}^{2}, ... W_{i}^{n}};

Wherein W _ia Chinese word, w _ihomonym.

Preferably, described step 1) in the homonym source of obscuring collection comprise two parts: identification division and manually proofread part automatically;

Wherein identification division comprises the steps: automatically

Step 11) read Chinese character dictionary, the Chinese word in Chinese dictionary is read in Chinese word structure and goes;

Step 12) read in the phonetic transcriptions of Chinese characters of phonetic transcriptions of Chinese characters table in phonetic transcriptions of Chinese characters structure;

Step 13) integrating step 11) Chinese word that obtains and step 12) phonetic transcriptions of Chinese characters that obtains, convert the Chinese word in Chinese dictionary to phonetic, put into homonym structure and generate homonym dictionary configuration, namely homonym obscures collection;

Wherein artificial check and correction part comprises: to step 13) homonym that obtains obscures collection and manually proofreads, upgrade homonym and obscure collection;

The structure that described homonym obscures collection is as follows:

C S e t (W_{i}) = {W_{i}^{1}, W_{i}^{2}, ... W_{i}^{n}};

Wherein W _ia word, w _ihomonym.

Preferably, described step 2) comprise the following steps:

Step 21) based on extensive Web language material, the local setting up left adjacent binary, right adjacent binary and adjacent ternary adjoins NGram model: carrying out participle to the sentence in language material, obtaining L=W as carried out participle to sentence L ₁w ₂..W _i-1w _iw _i+1w _n, for word W _i,

Left adjacent binary is: LeftBiGram (W _i)=W _i-1w _i;

Right adjacent binary is: RightBiGram (W _i)=W _iw _i+1;

Adjacent ternary is: TriGram (W _i)=W _i-1w _iw _i+1;

Step 22) obscure collection CSet (W based on extensive Web language material and homonym _i), the left side that statistics homonym obscures concentrated all words adjoins binary and the co-occurrence frequency, right adjacent binary and the co-occurrence frequency thereof and adjacent ternary and the co-occurrence frequency thereof, wherein

C S e t (W_{i}) = {W_{i}^{1}, W_{i}^{2}, ... W_{i}^{n}},

w _ihomonym;

Step 23) obscure collection CSet (W based on homonym _i), binary is adjoined to a left side, the right local adjoining binary and adjacent ternary adjoins NGram model and carries out probability estimate, thus the local that generation comprises probabilistic estimated value adjoins NGram model; Wherein

The probability estimate of left adjacent binary is:

P_{l e f t} (W_{i} | W_{i - 1}) = \frac{C o u n t (W_{i - 1} W_{i})}{C o u n t (W_{i - 1} W_{i}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k})} - - - (1);

The probability estimate of right adjacent binary is:

P_{r i g h t} (W_{i} | W_{i + 1}) = \frac{C o u n t (W_{i} W_{i + 1})}{C o u n t (W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i}^{k} W_{i + 1})} - - - (2);

The probability estimate of adjacent ternary is:

P_{t r i} (W_{i} | W_{i - 1} W_{i + 1}) = \frac{C o u n t (W_{i - 1} W_{i} W_{i + 1})}{C o u n t (W_{i - 1} W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k} W_{i + 1})} - - - (3);

Wherein W _k∈ CSet (W _i), Count (W _i-1w _i) represent W _i-1w _ithe co-occurrence frequency in language material, Count (W _iw _i+1) represent W _iw _i+1the co-occurrence frequency in language material, Count (W _i-1w _iw _i+1) represent W _i-1w _iw _i+1the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material.

Preferably, described step 3) comprise the following steps:

Step 31) participle is carried out to the sentence S of application this method, the word W in the sentence S after traversal participle _i, judge whether that there is homonym obscures collection CSet (W _i), the word that wherein there is homonym and obscure collection carry out step 32) process, until the word in sentence has been traversed;

Step 32) if W _ithere is CSet (W _i), based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list, specifically comprise:

Step 32-1) by the word W in combination scoring function Score calculating sentence S _icontext support in sentence is

Score(W _i)＝α ₁*P _left(W _i|W _i-1)+α ₂*P _right(W _i|W _i+1)+α ₃*P _tri(W _i|W _i-1W _i+1)(4)；

Wherein α ₁+ α ₂+ α ₃=1, α ₁>0, α ₂>0, α ₃>0, α ₁, α ₂, α ₃represent the weight of left adjacent binary, right adjacent binary, adjacent ternary respectively;

Step 32-2) by the word W in combination scoring function Score calculating sentence S _icorresponding homonym obscures each concentrated homonym context support in sentence is

S c o r e (W_{i}^{j}) = α_{1} * P_{l e f t} (W_{i}^{j} | W_{i - 1}) + α_{2} * P_{r i g h t} (W_{i}^{j} | W_{i + 1}) + α_{3} * P_{t r i} (W_{i}^{j} | W_{i - 1} W_{i + 1}) - - - (5);

Wherein:

W_{i}^{j} &Element; C S e t (W_{i});

P_{1 e f t} (W_{i}^{j} | W_{i - 1}) = \frac{C o u n t (W_{i - 1} W_{i}^{j})}{C o u n t (W_{i - 1} W_{i}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k})} - - - (6);

P_{r i g h t} (W_{i}^{j} | W_{i + 1}) = \frac{C o u n t (W_{i}^{j} W_{i + 1})}{C o u n t (W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i}^{k} W_{i + 1})} - - - (7);

P_{t r i} (W_{i}^{j} | W_{i - 1} W_{i + 1}) = \frac{C o u n t (W_{i - 1} W_{i}^{j} W_{i + 1})}{C o u n t (W_{i - 1} W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k} W_{i + 1})} - - - (8);

Step 32-3) to W _iand at CSet (W _i) in the context support Score of each homonym sort;

Step 32-4) if Score is (W _i)=0, then to W _imarked erroneous, and list in Score sequencing table 's as amending advice list; Otherwise turn to step 32-5);

Step 32-5) if Score is (W _i) >0, and

S c o r e (W_{i}) < β * \max_{W_{i}^{j} &Element; C S e t (W_{i})} S c o r e (W_{i}^{j}),

To W _imarked erroneous, lists file names with corresponding as amending advice list, otherwise mark for correct word, wherein β is the probability that a word mistake becomes its homonym.

As preferably, above-mentioned steps 32-1) and step 32-2) in, the weight α of left adjacent binary ₁=0.25, the weight α of right adjacent binary ₂=0.25, the weight α of adjacent ternary ₃=0.5.

Preferably, described step 32-5) in word mistake become probability values≤0.01 of its homonym.

Beneficial effect: the auto-collation that the present invention proposes a kind of Chinese homonym mistake, the left side that the method adopts homonym to obscure collection and Weight adjoins binary, right adjacent binary and adjacent triple combination method and judges the homonym in sentence, identify homonym mistake, and provide the amending advice of homonym mistake, integrate automatic errordetecting and automatic Proofreading.Experiment shows, the method recall rate of homonym mistake automatic Proofreading provided by the invention reaches 81.2%, and precision reaches 75.6%, the realistic application demand of faster system response, precision, validity and accuracy high, there is higher practicality.

Accompanying drawing explanation

Fig. 1 homonym mistake automatic Proofreading process flow diagram.

Embodiment

Below in conjunction with drawings and Examples, the present invention is further described.

A kind of Chinese homonym mistake auto-collation provided by the invention is obscured collection and the local of Weight based on homonym and is adjoined NGram model and combine determining method and carry out Chinese homonym mistake automatic Proofreading, the method comprises the following steps: 1), set up homonym and obscure collection, to Chinese word by phonetic transcriptions of Chinese characters, the homonym setting up Chinese word obscures collection.

As shown in Figure 1, utilize phonetic transcriptions of Chinese characters table and Chinese dictionary, generate homonym and obscure collection:

C S e t (W_{i}) = {W_{i}^{1}, W_{i}^{2}, ... W_{1}^{n}};

Wherein W _ia Chinese word, w _ihomonym.

The source that in the present embodiment, homonym obscures collection comprises two parts: identification division and manually proofread part automatically;

Wherein identification division comprises the steps: automatically

Wherein artificial check and correction part comprises: to step 13) homonym that obtains obscures collection and manually proofreads, upgrade homonym and obscure collection.

2), set up the second from left unit, the local of right binary and ternary adjoins NGram model, based on step 1) homonym that obtains obscures collection, by Probabilistic estimation, NGram model is adjoined to described local and carry out probability estimate, obtain the adjacent NGram model in local by large-scale corpus training.Be specially:

Left adjacent binary is: LeftBiGram (W _i)=W _i-1w _i;

Right adjacent binary is: RightBiGram (W _i)=W _iw _i+1;

Adjacent ternary is: TriGram (W _i)=W _i-1w _iw _i+1;

C S e t (W_{i}) = {W_{i}^{1}, W_{i}^{2}, ... W_{i}^{n}},

w _ihomonym;

The probability estimate of left adjacent binary is:

P_{1 e f t} (W_{i} | W_{i - 1}) = \frac{C o u n t (W_{i - 1} W_{i})}{C o u n t (W_{i - 1} W_{i}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k})} - - - (1);

The probability estimate of right adjacent binary is:

P_{r i g h t} (W_{i} | W_{i + 1}) = \frac{C o u n t (W_{i} W_{i + 1})}{C o u n t (W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i}^{k} W_{i + 1})} - - - (2);

The probability estimate of adjacent ternary is:

P_{t r i} (W_{i} | W_{i - 1} W_{i + 1}) = \frac{C o u n t (W_{i - 1} W_{i} W_{i + 1})}{C o u n t (W_{i - 1} W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k} W_{i + 1})} - - - (3);

3), based on step 2) local that obtains adjoins NGram model, utilize the combined method of Weight, the context support of concentrated homonym in sentence is obscured by the homonym of the word in calculating sentence and correspondence thereof, judge whether to there is homonym mistake, and homonym mistake is marked and provides amending advice list.As shown in Figure 1, be specially:

Wherein α ₁+ α ₂+ α ₃=1, α ₁>0, α ₂>0, α ₃>0, α ₁, α ₂, α ₃represent the weight of left adjacent binary, right adjacent binary, adjacent ternary respectively; In the present embodiment, α ₁=α ₂=0.25, α ₃=0.5, can certainly suitably adjust according to actual needs.

Step 32-2) by the word W in combination scoring function Score calculating sentence S _icorresponding homonym obscures concentrated each homonym W _jcontext support in sentence is

Score(W _j)＝α ₁*P _left(W _j|W _i-1)+α ₂*P _right(W _j|W _i+1)+α ₃*P _tri(W _j|W _i-1W _i+1)(5)；

Wherein

W_{i}^{j} &Element; C S e t (W_{i});

P_{1 e f t} (W_{i}^{j} | W_{i - 1}) = \frac{C o u n t (W_{i - 1} W_{i}^{j})}{C o u n t (W_{i - 1} W_{i}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k})} - - - (6);

P_{r i g h t} (W_{i_{j}}^{j} | W_{i + 1}) = \frac{C o u n t (W_{i}^{j} W_{i + 1})}{C o u n t (W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i}^{k} W_{i + 1})} - - - (7);

P_{t r i} (W_{j} | W_{i - 1} W_{i + 1}) = \frac{C o u n t (W_{i - 1} W_{i}^{j} W_{i + 1})}{C o u n t (W_{i - 1} W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k} W_{i + 1})} - - - (8);

Step 32-5) if Score is (W _i) >0, and

S c o r e (W_{i}) < β * \max_{W_{j} &Element; C S e t (W_{i})} S c o r e (W_{j}),

To W _imarked erroneous, lists file names with corresponding as amending advice list, otherwise mark W _ifor correct word, wherein β is the probability that a word mistake becomes its homonym, usual β≤0.01, β=0.01 in the present embodiment.Experiment:

Live through repeatedly open test, the testing material of experiment employing 10,000 row sentence, homonym error 6 00 place in manual construction language material sentence, with parameter given in embodiment for experiment parameter.Experiment shows, the method recall rate of homonym mistake automatic Proofreading provided by the invention reaches 81.2%, and precision reaches 75.6%.This precision has exceeded prior art, reaches the demand of practical application, has higher validity and accuracy.

Above implementation column is only preferred embodiment of the present invention, does not form restriction to the present invention, and relevant staff is in the scope not departing from the technology of the present invention thought, and any amendment carried out, equivalent replacement, improvement etc., all drop in protection scope of the present invention.

Claims

1. a Chinese homonym mistake auto-collation, is characterized in that, obscure collection and the local of Weight based on homonym and adjoin NGram model and combine determining method and carry out Chinese homonym mistake automatic Proofreading, the method comprises the following steps:

2. Chinese homonym mistake auto-collation according to claim 1, is characterized in that described step 1) comprising: utilize phonetic transcriptions of Chinese characters table and Chinese dictionary, generate homonym and obscure collection:

C S e t (W_{i}) = {W_{i}^{1}, W_{i}^{2}, ... W_{i}^{n}};

Wherein W _ia Chinese word, w _ihomonym.

3. Chinese homonym mistake auto-collation according to claim 1, is characterized in that described step 1) in the homonym source of obscuring collection comprise two parts: identification division and manually proofread part automatically;

Wherein identification division comprises the steps: automatically

The structure that described homonym obscures collection is as follows:

C S e t (W_{i}) = {W_{1}^{1}, W_{i}^{2}, ... W_{1}^{n}};

Wherein W _ia word, w _ihomonym.

4. Chinese homonym mistake auto-collation according to claim 1, is characterized in that described step 2) comprise the following steps:

Left adjacent binary is: LeftBiGram (W _i)=W _i-1w _i;

Right adjacent binary is: RightBiGram (W _i)=W _iw _i+1;

Adjacent ternary is: TriGram (W _i)=W _i-1w _iw _i+1;

C S e t (W_{i}) = {W_{i}^{1}, W_{i}^{2}, ... W_{i}^{n}},

w _ihomonym;

Step 23) obscure collection CSet (W based on homonym _i), binary is adjoined to a left side, the right local adjoining binary and adjacent ternary adjoins NGram model and carries out probability estimate, thus the local that generation comprises probabilistic estimated value adjoins NGram model; Wherein the probability estimate of left adjacent binary is:

P_{l e f t} (W_{i} | W_{i - 1}) = \frac{C o u n t (W_{i - 1} W_{i})}{C o u n t (W_{i - 1} W_{i}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k})} - - - (1);

The probability estimate of right adjacent binary is:

P_{r i g h t} (W_{i} | W_{i + 1}) = \frac{C o u n t (W_{i} W_{i + 1})}{C o u n t (W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i}^{k} W_{i + 1})} - - - (2);

The probability estimate of adjacent ternary is:

P_{t r i} (W_{i} | W_{i - 1} W_{i + 1}) = \frac{C o u n t (W_{i - 1} W_{i} W_{i + 1})}{C o u n t (W_{i - 1} W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k} W_{i + 1})} - - - (3);

Wherein count (W _i-1w _i) represent W _i-1w _ithe co-occurrence frequency in language material, Count (W _iw _i+1) represent W _iw _i+1the co-occurrence frequency in language material, Count (W _i-1w _iw _i+1) represent W _i-1w _iw _i+1the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material, represent the co-occurrence frequency in language material.

5. Chinese homonym mistake auto-collation according to claim 4, is characterized in that described step 3) comprise the following steps:

Step 32-1) by the word W in combination scoring function Score calculating sentence S _icontext support in sentence is Score (W _i)=α ₁* P _left(W _i| W _i-1)+α ₂* P _right(W _i| W _i+1)+α ₃* P _tri(W _i| W _i-1w _i+1) (4);

S c o r e (W_{i}^{j}) = α_{1} * P_{l e f t} (W_{i}^{j} | W_{i - 1}) + α_{2} * p_{r i g h t} (W_{i}^{j} | W_{i + 1}) + α_{3} * P_{t r i} (W_{i}^{j} | W_{i - 1} W_{i + 1}) - - - (5);

Wherein W _j∈ CSet (W _i);

P_{l e f t} (W_{i}^{j} | W_{i - 1}) = \frac{C o u n t (W_{i - 1} W_{i}^{j})}{C o u n t (W_{i - 1} W_{i}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k})} - - - (6);

P_{r i g h t} (W_{j} | W_{i + 1}) = \frac{C o u n t (W_{j} W_{i + 1})}{C o u n t (W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i}^{k} W_{i + 1})} - - - (7);

P_{t r i} (W_{j} | W_{i - 1} W_{i + 1}) = \frac{C o u n t (W_{i - 1} W_{j} W_{i + 1})}{C o u n t (W_{i - 1} W_{i} W_{i + 1}) + Σ_{k = 1}^{| C S e t (W_{i}) |} C o u n t (W_{i - 1} W_{i}^{k} W_{i + 1})} - - - (8);

Step 32-4) if Score is (W _i)=0, then to W _imarked erroneous, and list in Score sequencing table w _j, as amending advice list; Otherwise turn to step 32-5);

Step 32-5) if Score is (W _i) >0, and to W _imarked erroneous, lists file names with corresponding as amending advice list, otherwise mark W _ifor correct word, wherein β is the probability that a word mistake becomes its homonym.

6. Chinese homonym mistake auto-collation according to claim 5, is characterized in that: described step 32-1) and step 32-2) in, the weight α of left adjacent binary ₁=0.25, the weight α of right adjacent binary ₂=0.25, the weight α of adjacent ternary ₃=0.5.

7. Chinese homonym mistake auto-collation according to claim 5, is characterized in that: described step 32-5) in word mistake become probability values≤0.01 of its homonym.