CN1116343A

CN1116343A - Chinese wrongly writen character automatic correcting method and device

Info

Publication number: CN1116343A
Application number: CN 94109394
Authority: CN
Inventors: 张照煌
Original assignee: Industrial Technology Research Institute ITRI
Current assignee: Industrial Technology Research Institute ITRI
Priority date: 1994-08-05
Filing date: 1994-08-05
Publication date: 1996-02-07
Anticipated expiration: 2014-08-05
Also published as: CN1056933C

Abstract

Said method compares the selected words with original words in text to find wrongly written or mispronounced charaters and provide right words. Said invention has practical value in Chinese words process.

Description

Chinese wrongly writen character automatic correcting method and device

The present invention is relevant for a kind of Chinese wrongly writen character automatic correcting method and device, particularly relevant for utilizing comprehensive approximate word collection replacement and language model marking mode, make font, word sound, the meaning of word or the word collection close produce candidate character string with input code, and find out the highest candidate character string of scoring, so that obtain the Chinese wrongly writen character automatic correcting method and the device of correct word.

" wrongly written character " former finger one Chinese words is owing to increase and decrease, change the erroneous word that stroke or radical misplace are caused, " malapropism " then refers to the situation of misapplying his word without certain word, also has the people to contain " malapropism " with " wrongly written character " speech now, below is referred to as " wrongly written or mispronounced characters ".

The number of wrongly written or mispronounced characters has a strong impact on the quality of document, tradition with an artificial school again the school original text in school correct, waste time and energy and often have and leak the school situation, as the newspapers and magazines book of general many schools publication, still common malapropism is grown thickly.In recent years because popularizing of computer though exempted the erroneous word that the stroke mistake causes through the document of input computer, produces the mistake that causes owing to input process also thereupon.So utilize computer to detect automatically and to correct the demand of wrongly written or mispronounced characters very urgent really.

" detection wrongly written or mispronounced characters " refers to find out the place of wrongly written or mispronounced characters in the document, and " correcting wrongly written or mispronounced characters " then refers to find out the correct corresponding word of this wrongly written or mispronounced characters.Known techniques only has the function that detects and do not correct as commercial Chinese school original text system, and the present invention then possesses the function that detects and correct simultaneously.

The wrongly written or mispronounced characters of computer document is write the mistake that production process or input editing process are produced no matter derive from, all can be divided into following four classes or wherein the above institute of two classes cause jointly:

(1) unisonance or nearly sound word, its pronunciation is identical or close,

Example 1: " Hang ” Trace suspicious (shape)

Example 2: press " step " with regard to class (portion)

(2) word familiar in shape,

Example 3: tea " Pot " (Pot)

Example 4: defend " mattress " (bacterium)

(3) the close word of the meaning of word,

Example 5: previously " do not study carefully " (fault)

Example 6: name is " symbol " Real (pair) not

(4) input operation mistake, promptly owing to the close wrongly written or mispronounced characters that causes of input code or owing to the scarce word of editing operation mistake generation, superfluous word or front and back word intermodulation,

Example 7: " Shito ” System (Xi ， Warehouse Jie Code respectively is VIF, HVIF)

Example 8: " clod of earth " " bank " (frustration), Xi is used to " being used to " ()

According to these finishing analysis in addition, the blundering font of common people, word sound, the meaning of word or close word with input code are compiled, make it to become comprehensive approximate word collection database, in order to the literal in the former document that replaces, produce candidate character string, constitute basis of the present invention.

As for the Chinese language model comprehensive grading, contain the scoring of substrate language model and " non-former word deduction of points ".

Its language model scoring can utilize known statistics scoring end, continues and shows or clump continues and shows or mark frequently based on the speech long word of dictionary as word table, speech the continue table, part of speech of word that continue between table, speech that continue, and shows with probability value or fractional value." non-former word deduction of points " then is that approximate word to non-former literal is with classification or stepless deduction of points.

Utilize the language model comprehensive grading, find out the highest candidate character string of scoring, again with former document in Chinese words compare, can detect the wrongly written or mispronounced characters place in the document automatically and corresponding correct word is provided, extremely with practical value.

The automatic detection method of Chinese wrongly written character of No. 81104438 " Chinese wrongly written character Auto-Sensing method and arrangement for detecting " propositions of Taiwan of the prior art patented claim, mainly comprise two steps: (1) false disconnected speech step, promptly with reference to a dictionary with find out can't form multiple words form the monosyllabic word of multiple words with adjacent words, and with its taking-up; (2) determining step, promptly the intensity that continues according to the word frequency of the monosyllabic word of each taking-up and prev word, back one word judges whether to be correct word.This method has two shortcomings: (1) False Rate is too high, has only a real wrongly written character in the word of average per 40 sign mistakes; (2) fail to provide corresponding correct word.

Other has the Taiwan patented claim No. 80102492 " wrongly written character that improves Chinese discrimination power is more executed " and No. 80107315 " document identification correcting device ", is the many candidate identification result that produces at the text-recognition device and does the wrongly written character corrigendum, and is irrelevant with the present invention.

It is 4,689 that the United States Patent (USP) such as the patent No. are arranged again, 768 (1987), 4,783,758 (1988), 4,903,206 (1988), 4,829,472 (1989), 5,148,367 (1992) patent, being at correcting as the spelling check of western languages such as English, because characteristic of speech sounds differs widely, is the technology that has nothing to do with the present invention therefore.

The Chinese document school original text system relevant with the present invention is in the past all by the word frequency that detects individual character behind the disconnected speech and the front and back word technology of intensity that continues, so there is False Rate too high and fail to provide the shortcoming and the difficulties such as correct word of correspondence.The present invention is for overcoming these shortcomings, a kind of automatic detection is provided and corrected the method and the device of Chinese wrongly written or mispronounced characters.

First purpose of the present invention is to provide a kind of Chinese wrongly written character of novelty to detect correction method and device automatically.

Second purpose of the present invention is to provide the correct corresponding word of detected wrongly written or mispronounced characters, for correcting.

A further object of the present invention is to reduce the False Rate that wrongly written or mispronounced characters detects, and improves the efficient of automatic school original text.

For achieving the above object, Chinese wrongly written or mispronounced characters of the present invention detects correction method automatically, is the method that the power supply brain detected and corrected wrongly written or mispronounced characters in the Chinese document automatically, and this method comprises the following steps:

Comprehensive approximate word collection replacement step is replaced the literal in the document with font, word sound, the meaning of word or with each literal of the comprehensive approximate word collection of the close word of input code, be combined into a plurality of candidate character strings;

Language model scoring step is utilized a statistics formula language model that each candidate character string is marked, and is found out the highest candidate character string of scoring; And

The wrongly written or mispronounced characters determining step, candidate character string and the literal in the document that this scoring is the highest are word for word compared, and to indicate wherein different literal be wrongly written or mispronounced characters.

Again, Chinese wrongly written or mispronounced characters of the present invention detects automatically and corrects device, is the device that the power supply brain detected and corrected wrongly written or mispronounced characters in the Chinese document automatically, and this device comprises:

Comprehensive approximate word collection replacement device, in order to the literal in the document is replaced into font, word sound, the meaning of word or with the literal of the close word of input code, for being combined into a plurality of candidate character strings;

The language model scoring apparatus in order to each candidate character string is marked, and is found out the highest candidate character string of scoring; And

The wrongly written or mispronounced characters judgment means, in order to word for word comparing the highest candidate character string of this scoring and the literal in the document, and to indicate wherein different literal be wrongly written or mispronounced characters.

Be clear demonstration device and method of the present invention, cooperate diagram to be described in detail as follows now:

Fig. 1 detects the calcspar of correcting device embodiment for the wrongly written or mispronounced characters of the present invention's Chinese.

Fig. 2 detects the process flow diagram of correction method for the wrongly written or mispronounced characters of the present invention's Chinese.

Fig. 3 is the comprehensively some of approximate word collection database of the embodiment of the invention.

Fig. 4 is the input example sentence that contains four wrongly written or mispronounced characterss.

Fig. 5 is the result of this input example sentence after approximate word collection replacement.

Fig. 6 is mark the highest five candidate character strings of this input example sentence after the language model scoring.

Fig. 7 handles output result behind this example sentence for the embodiment of the invention.

The wrongly written or mispronounced characters of the present invention Chinese detect correct device embodiment composition as shown in Figure 1.

This device mainly comprises: input media 100, comprehensively approximate word collection replacement device 120, language model scoring apparatus 140, and wrongly written or mispronounced characters judgment means 170,180.

The Chinese document 110 that input media 100 inputs are provided by the user, and can comprise a segmenting device, in order to before replacement, earlier the literal in the document is divided into a plurality of processing units according to punctuation mark.

Comprehensive approximate word collection replacement device 120 is in order to replace into each literal in the document 110 font, word sound, the meaning of word or the literal close with input code, for being combined into a plurality of candidate character strings.Should then comprise by comprehensive approximate word collection replacement device: (a) comprehensively approximate word collection data library device, include Chinese words and concentrate each literal to comprise one or more fonts, word sound, the meaning of word or the literal close with input code of former word, the approximate word of each literal also can be divided into a plurality of grades; Reach (b) replacement device, the literal replacement is the comprehensive approximate word that is similar in the word device.

Language model scoring apparatus 140 in order to each candidate character string is marked, and is found out the highest candidate character string of scoring.This language model scoring apparatus comprises: (a) language model staqtistical data base, write down the frequency of occurrences of each linguistic unit and the frequency of occurrences that continues between the linguistic unit, and wherein also can comprise the Chinese vocabulary bank of each speech part of speech of record; (b) scoring apparatus according to linguistic unit contained in the word string and language model staqtistical data base, is evaluated the mark of this word string, and this scoring apparatus is deducted points to the literal of non-former document; And (c) the highest scoring candidate character string search device, decision is the candidate character string of high scoring, and present embodiment is searched the highest scoring candidate character string in the dynamic programming mode.

Wrongly written or mispronounced characters judgment means 170,180, in order to word for word comparing the highest candidate character string of this scoring and the literal in the document, and to indicate wherein different literal be wrongly written or mispronounced characters.This wrongly written or mispronounced characters judgment means comprises (a) comparison device 170, word for word compares the literal of the highest candidate character string of this scoring and the document; And (b) indication device 180, indicating the different literal of comparison result is wrongly written or mispronounced characters, when indicating wrongly written or mispronounced characters, judges that the corresponding literal in the highest candidate character string of this scoring is the correct word of this wrongly written or mispronounced characters, and will indicate result's output and become document 190 after the sign.

The treatment scheme of the present invention's Chinese wrongly written or mispronounced characters detection correction method as shown in Figure 2.The method power supply brain detects the wrongly written or mispronounced characters in the Chinese document automatically, comprises the following steps: input step 200, imports a Chinese document 110, can be earlier with the literal in the document according to ", ", ".", "? ", "! ", "; ", punctuation mark such as ": " is divided into a plurality of processing units.The word string of each processing unit is carried out 230 to 290 each step, until each processing unit all after treatment, end step 220; Comprehensive approximate word collection replacement step 230, with the literal in the document 110 with font (S), word sound (P), the meaning of word (M) or with each literal replacement of the comprehensive approximate word collection 130 of the close word of input code (I), be combined into a plurality of candidate character strings, wherein comprehensively approximate word collection is made up of one or more fonts, word sound, the meaning of word or the literal close with input code that each literal comprises former word, and wherein the approximate word of each literal also can be divided into a plurality of grades; Language model scoring step 240, utilize 250 pairs of each candidate character strings of a statistics formula language model to mark, wherein the language model scoring is deducted points to the literal scoring of non-former document, utilizes Viterbi dynamic programming mode to search the highest scoring candidate character string (260); And wrongly written or mispronounced characters determining step 270, candidate character string and the literal in the document that this scoring is the highest are word for word compared, and to indicate wherein different literal be wrongly written or mispronounced characters (280), judges that simultaneously the corresponding literal in the highest candidate character string of this scoring is the correct word of this wrongly written or mispronounced characters; And will indicate result output (290) and become one and indicate back document (190).

Now lift an example, implementation process of the present invention is described.

Suppose that " comprehensively approximate word collection " is:

One:

People: go into S

Power: Li P Reed P cutter S sword S

Oneself: S S in the sixth of the twelve Earthly Branches second S

Do: sweet P universe P thousand S

Shoot a retrievable arrow: dagger-axe S

Smelting: control S

Study carefully: M is with regard to P for fault

Sharp: Li M clever S cuts the S S that stops and declares S power P

Anxious: loft P disease M

Yarn: be I

Venerate: venerate S Only S whetstone S root S and lick S Paper S Arrived S to S

The region between the heart and the diaphragm: educate the blind S of S

The replacement step is a processing unit with the word string after making pauses in reading unpunctuated ancient writings, and establishes former sentence to be

S＝C1，C2，...，Cn

Each Chinese words,, produce candidate character string through comprehensive approximate word collection replacement:

P(i1，i2，...，in)＝c1(i1)，c2(i2)，...，cn(in)

Wherein cj (ij) contains ij the approximate word of former word at j interior word, and 1＜=ij＜=mj (ij=1 represents to use former word),

1＜=j＜=n, that is form altogether m1 * m2 * ... * mn candidate character string.

Utilize language model each candidate character string of marking, wherein mark and carry out " non-former word deduction of points ", find out the highest candidate character string of scoring.

The Chinese language model comprehensive grading comprises the scoring of substrate language model and " non-former word deduction of points ".

Substrate language model scoring can utilize known statistics scoring, continues and shows or clump continues and shows or mark frequently based on the speech long word of dictionary as word table, speech the continue table, part of speech of word that continue between table, speech that continue, and shows with probability value or fractional value." non-former word deduction of points " then is that the approximate word to non-former literal gives classification or stepless deduction of points.

What the used substrate language model of present embodiment was that word continues word frequency in table and the dictionary between speech unites scoring, and " non-former word is deducted points " with candidate character string P (i1, i2 ..., the approximate word number weighting of the non-former word of using in) and getting:

Penalty (P (i1, i2 ..., in))=W * (ij!=1 number)

FinalScore＝BaseScore＋Penalty

Find out the practice of the highest candidate character string of scoring and can take exhaustive search method or Viterbi formula dynamic programming search method.

If find out the highest candidate character string of scoring and be P (k1, k2 .kn), through with former document in Chinese words S=C1, C2 ... Cn is compared, and can detect the wrongly written or mispronounced characters place in the document automatically, and corresponding correct word is provided.Be not equal to cj as cj (kj), then indicating cj is wrongly written or mispronounced characters, and is the correct word of correspondence with cj (kj).

The output step:

The result of output each processing unit comprises that wrongly written or mispronounced characters indicates and provide corresponding correct word.

Now with a specific example sentence operation of the present invention is described;

(1) input and punctuate step

" message that tea Pot Shito System name is not inconsistent Real is Difference and walking not.”

S＝C1C2...????C15

(2) approximate word collection replacement step

The message that tea Pot Shito System name is not inconsistent Real is Difference and walking not

Bitter edible plant Pot is Fu Pins of Lv Tibia Seoul

Apply free and unfettered

In respect of 2 * 2 * 2 * 2 * 3 * 3 * 2 * 2=576 candidate character string.

(3) language model scoring step

Output is the result of the unit of processing respectively, comprises that wrongly written or mispronounced characters indicates and provide corresponding Chinese language model comprehensive grading TOP V candidate character string, for:

Ranking	Scoring	Candidate character string
Ranking	Scoring	Candidate character string	?1 ?2 ?3 ?4 ?5	?189－8＝181 ?184－6＝178 ?182－6＝176 ?177－4＝173 ?181－10＝171	Tea Pot Xi System name not secondary Real message not Tibia and walk tea Pot Xi System name not secondary Real message not Tibia and walk tea Pot Xi System name be not inconsistent Real disappear Tibia not and walk message that tea Pot Xi System name is not inconsistent Real not Tibia and tea Pot Xi System name not secondary Real De Pins cease not Tibia and walk

(4) compare step

Former sentence: the message that tea Pot Shito System name is not inconsistent Real is Difference and walk best result not: tea Pot Xi System name is the message Tibia and walk not Difference and to walk Pot be secondary Tibia of message that XX X X (5) output step tea Pot Shito System name is not inconsistent Real not of secondary Real not

Successfully detect and correct four all wrongly written or mispronounced characterss of former sentence.The appraisal procedure of effect of the present invention is as follows: the Chinese words total number of word that makes A=input document

B=school original text method indicates the number of words of wrongly written or mispronounced characters

The number of words that C=school original text method detects and correctly corrects

D=school original text method detects the number of words into true wrongly written or mispronounced characters

The true wrongly written or mispronounced characters number of words of E=input document is sign rate B-rate=B/A then

Accuracy rate P-rate=D/B

Recall rate D-rate=D/E

Correct the index of the existing Chinese school of rate C-rate=C/E original text system: (seeing CCL Research Journal, 1992.8)

B-rate=5.2% (too high)

P-rate=2.5% (too low)

D-rate=73.8% (can)

Result after a large amount of experiments is as follows for C-rate=0% (not having) embodiments of the invention: (B, C, D is the embodiment index, B ', D ' is the index of the known Chinese school of simulation original text system) test data A B C D B ' D ' D and D ' International Politics 37,114 13 66 2,987 10 4

International economy 87,890 51 17 17 4,721 15 12

Internal politics 121,863 73 34 34 8,362 29 27

International economy 110,079 66 48 48 5,526 47 45

356946????203???105???105???21596????101???88

If D ' is 73.8% of E, E=137 then calculates every index of the present invention thus and is

Sign rate B-rate=B/A=203/356946=0.056%

Accuracy rate P-rate=D/B=105/203=51.72%

Recall rate D-rate=D/E=105/137=76.64%

Correct rate C-rate=C/E=105/137=76.64%

B-rate of the present invention, P-rate, the C-rate index is excellent far beyond known techniques all, and D-rate is roughly suitable, proves that the present invention is extremely with practical value.

Claims

1. a Chinese wrongly written or mispronounced characters detects correction method automatically, and this method is the method that the power supply brain detected and corrected wrongly written or mispronounced characters in the Chinese document automatically, it is characterized in that comprising the following steps:

The erroneous words determining step, candidate character string that above-mentioned scoring is the highest and the literal in the described document are word for word compared, and to indicate wherein different literal be wrongly written or mispronounced characters.

2. the method for claim 1 is characterized in that, the comprehensive approximate word collection in the described comprehensive approximate word collection replacement step is made up of one or more fonts, word sound, the meaning of word or the literal close with input code that each literal comprises former word.

3. method as claimed in claim 2 is characterized in that, described comprehensive approximate word concentrates the approximate word of each literal to be divided into a plurality of grades.

4. the method for claim 1 is characterized in that, in the described comprehensive approximate word collection replacement step, earlier the literal in the described document is divided into a plurality of processing units according to the labeling symbol before the replacement.

5. the method for claim 1 is characterized in that, described language model scoring step is deducted points to the literal scoring of non-former document.

6. the method for claim 1 is characterized in that, described wrongly written or mispronounced characters determining step judges that the corresponding literal in the highest candidate character string of described scoring is the correct word of this wrongly written or mispronounced characters when indicating wrongly written or mispronounced characters.

7. a Chinese wrongly written or mispronounced characters detects automatically and corrects device, and this device power supply brain detects and correct the wrongly written or mispronounced characters in the Chinese document automatically, it is characterized in that it comprises:

The wrongly written or mispronounced characters judgment means, in order to word for word comparing the highest candidate character string of above-mentioned scoring and the literal in the described document, and to indicate wherein different literal be wrongly written or mispronounced characters.

8. device as claimed in claim 7 is characterized in that, described comprehensive approximate word collection replacement device comprises a segmenting device, in order to earlier the literal in the described document is divided into a plurality of processing units according to punctuation mark before replacement.

9. device as claimed in claim 7 is characterized in that, described comprehensive approximate word collection replacement device comprises:

Comprehensive approximate word collection data library device includes Chinese words and concentrates each literal to comprise one or more fonts, word sound, the meaning of word or the literal close with input code of former word; And

The replacement device is the approximate word in the comprehensive approximate word acquisition means with the literal replacement.

10. device as claimed in claim 9 is characterized in that, the approximate word of comprehensive approximate each literal of word collection data library device in the described comprehensive approximate word acquisition means is divided into a plurality of grades.

11. device as claimed in claim 7 is characterized in that, described language model scoring apparatus comprises:

The language model staqtistical data base writes down the frequency of occurrences of each linguistic unit and the frequency of occurrences that continues between the linguistic unit;

Scoring apparatus according to linguistic unit contained in the word string and language model staqtistical data base, is evaluated the mark of this word string; And

The highest scoring candidate character string search device, decision is the candidate character string of high scoring.

12. device as claimed in claim 11 is characterized in that, described scoring apparatus is deducted points to the literal scoring of non-former document.

13. device as claimed in claim 11 is characterized in that, the language model staqtistical data base of described language model scoring apparatus comprises the Chinese vocabulary bank of each speech part of speech of record.

14. device as claimed in claim 11 is characterized in that, described language model scoring apparatus is searched the highest scoring candidate character string in the dynamic programming mode.

15. device as claimed in claim 7 is characterized in that, described wrongly written or mispronounced characters judgment means comprises:

Comparison device is word for word compared the highest candidate character string of described scoring and the literal in the described document; And

Indication device, indicating the different literal of comparison result is wrongly written or mispronounced characters.

16. device as claimed in claim 7 is characterized in that, described wrongly written or mispronounced characters judgment means judges that the corresponding literal in the highest candidate character string of described scoring is the correct word of this wrongly written or mispronounced characters when indicating wrongly written or mispronounced characters.