CN103714053A

CN103714053A - Japanese verb identification method for machine translation

Info

Publication number: CN103714053A
Application number: CN201310569693.1A
Authority: CN
Inventors: 张孝飞; 胡月卿; 马伟; 金善花; 孟翔; 李彦刚; 王强
Original assignee: Beijing Zhongxian Electronic Technology Development Center
Current assignee: Beijing Zhong Xian Electronic Technology Development Co., Ltd.
Priority date: 2013-11-13
Filing date: 2013-11-13
Publication date: 2014-04-09
Anticipated expiration: 2033-11-13
Also published as: CN103714053B

Abstract

The invention discloses a Japanese verb identification method for machine translation and belongs to the field of natural language processing. The method has the advantages that the method is based on rules and combines with dictionaries by analyzing the conjugated form rules of Japanese verbs, the verbs in a text can be identified completely, and basic forms of the verbs can be obtained by form restoration; the method can use common universal dictionaries and is high in adaptability and robustness; by the method, the lexical analysis accuracy and bilingual word corresponding effect in machine translation are increased, and the translation quality of machine translation is increased as a whole.

Description

A kind of Japanese verb recognition methods of Machine oriented translation

Technical field

The invention belongs to natural language processing field, relate to a kind of automatic identifying method of Japanese verb, be specifically related to the Japanese verb recognition methods that a kind of Machine oriented rule-based and that dictionary combines is translated.

Background technology

Along with science and technology and cultural exchanges day by day frequent between Sino-Japan, understanding and the conversion disorder broken through between language become one of key element, convert Japanese information translation to readable intelligible Chinese information timely and accurately, not only there is theoretic value, have more the necessity and urgency in reality.In existing statictic machine translation system, need to carry out participle pretreatment operation before parallel corpora is carried out to machine training, its quality will directly affect translation quality.Because Japanese verb exists, apply flexibly in a large number shape and dictionary is included not congruent factor, the Japanese verb cutting based on dictionary is difficult to the effect that reaches desirable always.How verb being carried out to correct cutting and identification, improve the effect of word alignment, and then promote whole mechanical translation quality, is one of current problem demanding prompt solution.

The kudo of Japan is opened up and in 2006, has been developed the MeCab morphactin analytical tool of increasing income, this morphactin analytical tool be take dictionary as benchmark, the Japanese verb that dictionary can be included (fundamental form entry) correctly identifies, but, the Japanese verb of not including at parsing dictionary can be two even a plurality of words by its cutting while applying flexibly shape entry, then each word is carried out to part-of-speech tagging.This recognition methods, fails a complete verb to be syncopated as, and the participle pretreatment operation as for statistical machine translation, can reduce bilingual word-alignment effect, is unfavorable for the calculating of translation model probability, affects translation quality.

In the Japanese segmentation method of the interim < < of domestic tangible Chinese core journals < of golden spring < microcomputer information > > the 22nd volume 1-3 in 2006 based on morpheme and application > > mono-literary composition in O C R system thereof, a kind of Japanese segmentation method based on morpheme has been proposed, its main thought is according to Japanese verb feature and applies flexibly rule verb is split as to morpheme and suffix two parts, be stored in respectively in two different dictionaries, again Japanese verb is identified.The original intention of the method is to identify for OCR, object is to improve OCR correct recognition rata, after identification, do not need it to translate or other processing, its weak point is it is also to fail the word of a distortion intactly to cut out, need in addition respectively two dictionaries to be processed, extract morpheme information not only consuming time but also consume power.

Summary of the invention

This method mainly indicates to search candidate's verb according to the ending of the appearance position of verb in japanese sentence and verb, after finding candidate's verb, it is reduced, after reduction again by its correctness of formal verification of consulting the dictionary.If the new term information after reduction is found its corresponding entry in dictionary, explanation is reduced successfully, and then can carry out part-of-speech tagging to this word; If do not find identical entry by the new entry information of going back after meta-rule reduction in dictionary, candidate's verb is carried out to cutting again and reduction processing, after processing, if it does not find its corresponding entry yet in dictionary, entry is kept intact, and does not process.

Japanese verb feature:

After Japanese verb mainly appears at auxiliary word, combination auxiliary word and conjunction,

Japanese verb ending tab character is limited,

Japanese verb is applied flexibly shape and is had certain rule,

Feature based on aforementioned Japanese verb, the present invention proposes a kind of rule-based Japanese verb recognition methods combining with dictionary.The method comprises the following steps:

Steps A, retrieves and marks the special word that comprises left adjacency sign (character or character string) and ending sign (character), does not participate in follow-up verb identification.

Described special word comprises anomalous verb and special non-verb two class words, and described anomalous verb refers to that the character that comprises this special Japanese verb comprises the left word in abutting connection with sign (character or character string) while searching; Described special non-verb refers to the non-verb that comprises verb ending tab character.

Step B, after retrieving special word, starts to search candidate's verb.

Step C, reduces to the candidate's verb finding, and verifies that by the mode of consulting the dictionary whether it is correct.

Step D, for reducing successfully and find candidate's verb of corresponding entry in dictionary, carries out part-of-speech tagging to it.

Wherein, further comprising the steps in described step B:

Step B1, retrieves the left adjacency sign (character or character string) that candidate's verb is searched.

Described candidate's verb is searched left in abutting connection with indicating that (character or character string) comprising: auxiliary word, combination auxiliary word, conjunction.

Described verb is searched ending sign (character) and being comprised: shape ending sign applied flexibly in five sections of verb ending signs, one section of verb ending sign, verb.

Step B2 searches candidate's verb ending sign (character) in left respective range after sign (character or character string).

Step B3, the part using the left character late in abutting connection with sign (character or character string) to candidate's verb ending sign (character) cuts out as candidate's verb to be restored.

To sum up, we suppose

for the text-string of input,

for the left set forming in abutting connection with sign (character or character string) of verb,

set for ending tab character composition., for any one input text, the possible situation that comprises verb in its character string is following form:

Finding left adjacency sign

with ending sign

after, will

character late extremely

part cut out, as candidate's verb to be restored.

Described step C further comprises following steps:

C1, adopts character string forward direction maximum matching algorithm for the candidate's verb finding, and retrieves the suffix (P) of candidate's verb to be restored.

C2, the suffix (P) to the candidate's verb retrieving, reduces processing by the also meta-rule of its correspondence.

C3, compares the entry information after reduction with the corresponding entry information in dictionary, the correctness of checking identification.

C4, while not finding corresponding entry information in dictionary for the entry information after reduction, we can carry out secondary cutting and secondary reduction processing to candidate's verb, now, if can reduce successfully and find the entry after reduction in dictionary, illustrating and reduce successfully, otherwise no longer it is processed.

Described secondary cutting and secondary reduction processing are that the candidate's verb to be restored based on found may be two words or the consideration of three word combinations, according to Japanese verb, being used in conjunction rule and Japanese verb is used in conjunction tab character it is carried out to secondary cutting, by its cutting, be single word, and then by going back meta-rule, it reduced.

To sum up, the core algorithm that the reduction of our candidate's verb adopts is character string forward direction maximum matching method, works as and time, extract

and reduce processing by its corresponding also meta-rule.Again the entry information after reduction is contrasted to the correctness that can verify identification with the corresponding entry information in dictionary.

The invention has the beneficial effects as follows: Japanese verb recognition methods in the past, all fail using verb apply flexibly shape entry as a complete word segmentation out, be unfavorable for that the bilingual word-alignment in statistical machine translation research is processed, affected translation quality.The Japanese verb recognition methods that the rule-based and dictionary of the Machine oriented translation that the present invention adopts combines, effectively dictionary not being included to Japanese verb applies flexibly the cutting intactly of shape entry and identifies, bilingual word-alignment effect while having improved participle pre-service in statistical machine translation, and be conducive to the lifting of mechanical translation quality based on statistics.

Accompanying drawing explanation

Figure is core processing process flow diagram of the present invention.

Embodiment

Specific embodiment below in conjunction with the identification of Japanese verb, further describes method of the present invention.

Embodiment

What this embodiment was described is that all verbs in Japanese patent documentation are identified, and related Japanese verb is applied flexibly form and comprised: fundamental form, past tense, passive type, make dynamic formula, perfect etc.

As shown in the figure, Japanese verb of the present invention recognition methods comprises following step:

Special word is retrieved and is marked

Retrieval and the mark of special word carried out in the special word storehouse of summing up according to us, do not participate in follow-up Japanese verb identification.

Now input Japanese as follows:

① Recognize Certificate スイッチ Ga そ Entries order と ID(designation) the imperial capable う of The of To I ってそれぞれ system.

2. Ga Ru は on mood temperature Ga, the too warm め of Yang Hot Ga ground The, the warm め Ru から In あ Ru of the empty mood The of ground Ga.

Result for retrieval is as follows:

① Recognize Certificate スイッチ Ga そ Entries order と ID(designation) To I ってそれぞれ +++ adv system is driven the capable う of The.

2. Ga Ru on mood temperature Ga +++ v ことは, the too warm め of Yang Hot Ga ground The, the warm め Ru から In あ Ru of the empty mood The of ground Ga.

Sentence " それぞれ " is 1. a non-verb, and it,, because comprising verb ending tab character " れ ", if do not retrieved in advance, can be identified as verb, and identification makes the mistake." the upper Ga Ru " of sentence in is 2. an anomalous verb, because comprising verb in its character, search left in abutting connection with sign " Ga ", if do not retrieved in advance, follow-up verb search rule can be " upper/Ga/Ru " three parts " upper Ga Ru " cutting, identification makes the mistake, so we retrieve in advance this class special word and are marked, do not participate in follow-up verb identification.

Candidate's verb is searched

After special word retrieval and mark finish, start in abutting connection with sign (character or character string), ending tab character seek scope, ending tab character, to search candidate's verb according to candidate's verb is left.

Now input a Japanese as follows:

Cis に Let けられ of Side から under Side To in さら To, こ box-shaped body は, そ

Lookup result is as shown in the table:

algorithm example searched in table 1 candidate verb

Japanese character (string)	に	た	Let けられ
				Sequence number	16	11	FIRST-char 11

In above-mentioned sequence number, 16 represent leftly in abutting connection with tab character, to be numbered 16(and to represent in this embodiment " To ") character, 11 represent that verbs ending tab characters are numbered 11(and represent in this embodiment " ") character, when searching, first find the left character in abutting connection with tab character numbering 16, then left in abutting connection with (13 of tab character, 3) in scope, search ending tab character, find the character of ending tab character numbering 11, described (13, 3) scope is the scope that verb ending tab character may occur, from left in abutting connection with tab character, seek scope is locked in the 3rd in the scope of the 13rd character from left to right, looked-up sequence is from back to front, since the 13rd character, search forward until the 3rd character, finding verb ending sign is numbered 11(and represents in this embodiment " ") character after, by left first character after tab character or character string, be that FIRST-CHAR is that 11(represents " " in this embodiment to character number) part link together, be candidate's verb to be restored that we will extract.

Candidate's verb search rule is as follows:

1.を*->FIND(OR,(8,2),"り"|"き"|"ぎ"|"し"|"ち"|"ひ"|"び"|"み")

……

5.において*->FIND(OR,(6,16),"た"|"だ")

……

16 に* ->FIND(OR,(3,13),"た"|"だ")

……

The reduction of candidate's verb

Now input Japanese as follows:

1. " in さら To, こ box-shaped body は, そ under Side To Cis に Let けられ of Side から ", the candidate's verb to be restored wherein having found out Wei “ Let けられ ".

table 2 candidate's verb to be restored retrieving algorithm example

Candidate's verb to be restored	Reduction treatment process	Candidate's verb after reduction
			Let けられ	られた（P ₁₂₉）→る(I ₁₂₉)	Let け Ru

For candidate's verb “ Let けられ to be restored " we by before existing algorithm to maximum matching method, find out “ Let けられ " suffix P ₁₂₉, i.e. " られ ", and then Jiang “ Let けられ " according to P ₁₂₉go back meta-rule " られ " is reduced to I for the 129th of place ₁₂₉i.e. " Ru ", described the 129th rule is " * られ->INFLEX (; Ru) ", first find the suffix of the moving verb of candidate to be restored, in above-mentioned " Let けられ " word; find out its suffix for " られ "; again " られ " is reduced to " Ru "; and then obtain new entry information " Let け Ru ", finally by the checking “ Let け Ru of looking up the dictionary " whether this entry exist; there is “ Let け Ru in dictionary " this entry, illustrate that identification is correctly.

Candidate's verb also meta-rule is as follows:

1*ぼう->INFLEX(-,ぶ)

……

129 *られた->INFLEX(-,る)

……

174.*われる->INFLEX(-,う)

……

Above-mentioned example has been described the situation that entry information after reduction is found consistent entry in dictionary, if corresponding entry do not found in the entry after reduction in dictionary, at this moment we can carry out cutting again and reduction processing again to it

Now input Japanese as follows:

RAID は, データ The PVC ット/バイト Unit position, あ Ru いは Block ロック Unit Wei で Complex number scale recording device To disperse て to preserve The Ru +++ V method In, the high め of processing をオーバーラップ The Ru こと To I りパフォーマ Application ス The, high speed を Actual Now てい Ru.

The candidate's verb to be restored finding according to above-mentioned candidate's verb search rule is that the new term after " disperseing て to preserve The Ru " this word reduces by above-mentioned also meta-rule is " disperseing て to preserve ", but, this entry is because be being used in combination of two verbs, so cannot find this entry in dictionary.For this class entry, we carry out cutting according to candidate's verb secondary segmentation rules to it.

Candidate's verb secondary segmentation rules is as follows:

ん*->FIND(OR,(6,3),"て")

……

ん*->FIND(OR,(6,3),"い"|"き"|"ぎ"|"し"|"じ"|"ち"|"み"|"り"|"れ"|"え"|"じ"|"け"|げ"|"せ"|"ぜ"|"ね"|"べ"|"め"|"ば")

……

Wherein, above-mentioned rule has priority from front to back, " ん " represents that all candidate's verbs are left in abutting connection with sign, " OR " represents outside and right, the meaning is used in conjunction tab character for search verb to the right in the outside of " ん ", we are according to " ん *->FIND (OR, (6, 3), " て ") " rule, in (6 of correspondence, 3) what in scope, find that this entry " disperses て to preserve The Ru " is used in conjunction sign " て ", then this word is divided into " disperseing て " and " preserving The Ru " two words, again according to above-mentioned candidate's verb also meta-rule it is reduced, after reduction, by consulting the dictionary, verify its reduction correctness, if can reduce successfully and find the entry after reduction in dictionary, illustrate and reduce successfully, if it does not find its corresponding entry yet in dictionary, entry is kept intact, do not process.。

Part-of-speech tagging

If corresponding entry found in the candidate's verb after reduction in dictionary, according to the also meta-rule and the dictionary collection situation that carry out before it, it is carried out to part-of-speech tagging.

The part-of-speech tagging symbol that this method is used is as follows:

table 3 part-of-speech tagging symbol

Part of speech

Adverbial word

Adjective

Noun

Verb

Pronoun

Conjunction

Symbol

adv

adj

n

v

pron

col

It is as follows that shape morphological markers symbol applied flexibly in the verb that the present invention uses:

shape form label symbol applied flexibly in table 4 Japanese verb

Form	Symbol	Form	Symbol
				Fundamental form	ori	Make shape	cau
Modus tollens	no	Passive shape	pas
				Suppose shape	if	Perfect	over
Past tense	past	End shape	te
				てい Ru shape	ing	Continue shape	con
ま The shape	masu	Active shape	can

In addition, the combined situation that shape form also exists verb form in above-mentioned table applied flexibly in Japanese verb, about composite marking symbol, do not enumerate.

" Cis に Let けられ of Side から under Side To in さら To, こ box-shaped body は, そ for example.”

Its annotation results is as follows:

Cis に Let けられ of Side から under Side To in さら To, こ box-shaped body は, そ +++ V (paspast).

By above method, though dictionary do not include Japanese verb apply flexibly shape entry, also can and identify a complete verb (verb fundamental form and apply flexibly shape) cutting.

Claims

1. a Japanese verb recognition methods for Machine oriented translation, is characterized in that, comprises the following steps:

Steps A, retrieves and marks the left special word in abutting connection with sign and ending sign when comprising candidate's verb and searching, and does not participate in follow-up verb identification, and wherein, left adjacency is masked as character or character string, and ending is masked as character;

Step B, retrieves left adjacency sign and candidate's verb ending sign, searches candidate's verb;

Step C, reduces to the candidate's verb finding, and verifies that by the mode of consulting the dictionary whether it is correct;

Step D, for after reduction and can find candidate's verb of corresponding entry in dictionary, carries out part-of-speech tagging to it;

Wherein, further comprising the steps in described step B:

Step B1, retrieves the left in abutting connection with sign of candidate's verb;

Step B2 searches the ending tab character of candidate's verb in left specified scope after sign;

Step B3, the part using the left character late in abutting connection with sign to candidate's verb ending tab character cuts out as candidate's verb to be restored;

Described step C further comprises following steps:

C1, adopts character string forward direction maximum matching algorithm for the candidate's verb finding, and retrieves the suffix of candidate's verb to be restored;

C2, the suffix to the candidate's verb retrieving, reduces processing by the also meta-rule of its correspondence;

C3, compares the entry information after reduction with the corresponding entry information in dictionary, the correctness of checking identification;

C4, if when the entry information after reduction does not find corresponding entry information in dictionary, carries out cutting again and reduction processing to candidate's verb, now, if can reduce successfully and find the entry after reduction in dictionary, illustrating and reduce successfully, otherwise no longer it is processed.

2. method according to claim 1, the described special word in described steps A comprises anomalous verb and special non-verb.

3. method according to claim 2, described anomalous verb refers to and in Japanese verb, comprises the left verb in abutting connection with sign while searching; Described special non-verb refers to the non-verb that comprises verb ending tab character.

4. method according to claim 1, the left adjacency in described step B1 is masked as in japanese sentence and indicates that verb is about to auxiliary word, auxiliary word combination or the conjunction occurring.

5. method according to claim 1, the ending tab character in described step B2 is fundamental form and all last character of applying flexibly shape entry of Japanese verb.

6. method according to claim 1, the specified scope in described step B2 is for applying flexibly shape rule, the scope that the various ending sign most probables that sum up occur according to Japanese verb.

7. method according to claim 1, the suffix of the candidate's verb to be restored in described step C1 be Japanese verb apply flexibly shape part.

8. method according to claim 1, described secondary cutting in described step C4 and secondary reduction are processed: according to Japanese verb, be used in conjunction rule and Japanese verb and be used in conjunction tab character it is carried out to secondary cutting, by its cutting, be single word, and then by going back meta-rule, it reduced.

9. method according to claim 1, if corresponding entry found in the candidate's verb after reduction in dictionary, carries out part-of-speech tagging to it.

10. method according to claim 9, the part-of-speech tagging symbol of adverbial word, adjective, noun, verb, pronoun, conjunction is respectively adv, adj, n, v, pron, col.