CN105740218A

CN105740218A - Post-editing processing method for mechanical translation

Info

Publication number: CN105740218A
Application number: CN201610045883.7A
Authority: CN
Inventors: 姚佳; 刘世林; 吴雨浓; 陈炳章
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2015-12-31
Filing date: 2016-01-22
Publication date: 2016-07-06

Abstract

The invention relates to the field of mechanical translation, in particular to a post-editing processing method for mechanical translation. The post-editing processing method for mechanical translation provided by the invention has the advantages that a wrong word correction rule template and a reordering rule template are designed for the extraction of wrong word correction rules and reordering rules in mechanical translation; errors in a mechanical translation are corrected and adjusted according to the order that translated word error correction is followed by word order error adjustments, so that the quality of the mechanical translation is systematically improved through comprehensive measures, and the mechanical translation effects are remarkably improved; consideration factors introduced to the wrong word correction rule template and the reordering rule template in the method are complete, and apart from the mechanical translation and a standard translation, relevant information of a source text is further introduced; the source text introduction method better complies with the principle and essence of translation; the extracted correction rules are correspondingly combined with more reasonable conditions, so that translated word errors and word order errors in mechanical translation can be more effectively corrected.

Description

A kind of machine translation postedit processing method

Technical field

The present invention relates to machine translation field, particularly to a kind of machine translation postedit processing method.

Background technology

Nowadays the Internet has spread all over the world, and can share whenever and wherever possible from different nationalities with national people Exchange of information；The all information obtained on network that the most highly desirable energy of people is the most unimpeded.So, multiple Between language, the automatic translation by computer of precise and high efficiency is under the present and following internationalization atmosphere, has greatly The market demand.But, performance height, the Internet multilingual translation system that powerful, accuracy rate is high System is under present technical merit, in addition it is also necessary to have the most great technological difficulties to need to overcome.At existing machine Under device translation skill, high-quality available machine translation is still that unavailable.Solve this at present to ask The result of machine translation, for using machine translation to process as early stage, is made human post-editing by the general fashion of topic, It is thus possible to obtain the translation result that can use.Generally to obtain high-quality translation result, to manually The specialized capability of the editorial staff of postedit requires high, and the human post-editing personnel of expert's level are essential , but in the face of huge translation demand gap, the workload of human post-editing is very big, a limited number of specially Family is to can't resolve the hugest task amount, and manpower high in postedit and time cost limit The development of machine translation and application.

Researchers are by finding, in machine translation the analysis of user's edit pattern and translation error type In result, a lot of mistakes be repeat (such as vocabulary translation mistake, the mistake of sentence structure type, word The mistake etc. of language form), if processing these mistakes repeated by human post-editing, will consume greatly Man power and material's cost, the most seriously reduces efficiency and the satisfaction of translation Consumer's Experience of machine translation simultaneously. Therefore Many researchers attempts building an automatic postedit model, with the type of error according to machine translation certainly Dynamic that correction comprises same or similar translation error, to reduce the workload of human post-editing, improve machine Translation quality.Existing main stream approach is putting down according to " machine translation expert's postedit translation " mostly Lang material trains automatic postedit model based on SMT (machine translation based on statistics).Although based on The research of the automatic postedit of statistical machine translation has been achieved for certain achievement；But statistical machine translation What specifically there occurs inside SMT, also there are many indefinite.For this postedit technology, only Can know that the method can improve final translation result quality, but do not know which postedit operation concrete is Effectively (i.e. part postedit operation represents the defect of machine translation system), this is unfavorable for analyzing intuitively The drawback of machine translation.If machine translation can be parsed by the way of study automatically in these cases The associative mode of repetitive error, and by the corrigendum automatically of the mistake of these medellings；Machine can be analyzed turn over The wrong root translated, contributes to improving the quality of machine translation from source.

In addition the mistake of machine translation is generally divided into two classes, and first, translation word mistake, in translation, translation word mistake is One of basic mistake, according to statistics, translation word mistake (include word loss, unnecessary word, word mistake, The situations such as translation word is inconsistent) more than the 60% of machine translation always mistake can be accounted for；The second, word order mistake, Translation word order mistake, word order syntax error is one of basic mistake, according to statistics, word order mistake (bag Include and put in front word order mistake, sentence be verb/modal verb (MD) in interrogative (W) phrase word order mistake, sentence The situation such as neighbours' phrase word order mistake in phrase word order mistake, sentence) occupied in machine translation always mistake Proportion is very big, and in view of the huge grammatical differences existed between different language, word order mistake is turned at machine Translate the probability of middle appearance very big, the word order mistake strong influence degree of specialization of machine translation.Translation word is wrong By mistake shared in machine translation error with word order mistake proportion is the biggest, if certain mistake of single correction Improvement to machine translation effect is local finite；In the face of substantial amounts of translation demand, one is needed to combine Close the way improving machine translation quality.

Summary of the invention

It is an object of the invention to overcome the above-mentioned deficiency in the presence of prior art, it is provided that a kind of machine translation Postedit processing method, by carrying out word order mistake adjustment again machine translation first being carried out translation word error correction, The translation quality making machine translation significantly improves.To achieve these goals, the present invention constructs wrong word and repaiies Positive rule template and adjust sequence rule template, according to word order after the most wrong word correction adjust method revise machine The Translation Errors of translation.Wherein the wrong word modification rule template in the present invention comprises the phase of current word and substitute Pass information, the relevant information of described current word and substitute is except comprising corresponding machine translation and the phase of standard translation Outside the information of pass, also comprise the relevant information of corresponding source document.Described tune sequence rule template comprises the first sequence to be adjusted Word and the relevant information of the second sequence word to be adjusted, and the first sequence word to be adjusted and the relevant information of the second sequence word to be adjusted.

In order to realize foregoing invention purpose, the present invention provides techniques below scheme, a kind of machine translation postedit Processing method, comprises implemented below process:

(1) building wrong word modification rule template, described wrong word modification rule template includes rule condition and correction Action, wherein rule condition includes the current word of machine translation, the top n vocabulary of current word and current word Rear N number of vocabulary, and the top n vocabulary of translation source word corresponding to current word and rear N number of vocabulary, wherein N For the positive integer not less than 1；Described corrective action is: current word is modified to substitute；

Building and adjust sequence rule template, described tune sequence rule template includes: the corresponding informance of sequence word pair to be adjusted, institute Stating and treat that the second sequence word to be adjusted is to including: the first sequence word to be adjusted and the second sequence word to be adjusted, wherein said first waits to adjust Sequence word information includes: the first sequence word to be adjusted, before and after the first sequence word to be adjusted N number of word and corresponding part of speech and N number of word before and after the corresponding former word of first sequence word to be adjusted；Described second sequence word information to be adjusted includes: second waits to adjust Sequence word, N number of word sequence word to be adjusted with corresponding part of speech and second corresponding source word before and after the second sequence word to be adjusted The most N number of word, wherein N is 0 or is positive integer；

(2) the wrong word modification rule of above-mentioned wrong word modification rule template extraction machine translation is used, according to wrong word Alteration ruler revises the translation word mistake of machine translation；

(3) use above-mentioned tune sequence rule template to extract the tune sequence rule of machine translation and regular according to tune sequence Adjust the word order mistake in step (2) revised machine translation.

Concrete, the acquisition process of described current word and substitute includes: by machine translation and standard being translated The content of literary composition and source document contrasts, when in the context finding A word in machine translation and standard translation The context of B word identical, and during A ≠ B, just using the A word in machine translation as current word, And using the B word in standard translation as substitute.

Further, wherein said step (2) includes implemented below step:

(2-1) preparing training set, described training set includes source document to be translated and corresponding standard translation；

(2-2) above-mentioned source document to be translated input machine translation system will obtain the machine translation of correspondence；

(2-3) described training sample set and machine translation are input to the study with wrong word modification rule template In machine；

(2-4) described learning machine contrast machine translation and standard translation and the difference of source document, according to wrong word correction Rule template extracts the first rule of the wrong word correction in machine translation, forms the first corresponding regular collection；

(2-5) the first rule set is utilized all to revise Dev machine translation；And revised translation and Dev standard Translation compares, and calculates the BLEU yield value of each rule, selects the increasing of BLEU from rule set The modification rule (being defined as: the first modification rule) that benefit value is maximum；

(2-6) apply described first modification rule to revise machine translation, form the first correction translation, equally First modification rule is applied to Dev machine translation；

(2-7) revise first in translation input learning machine；Described learning machine contrast the first correction translation and mark Quasi-translation and the difference of source document, according to wrong word modification rule template extraction Second Rule, form Second Rule collection；

(2-8) utilize Second Rule collection to revise Dev machine translation；And revised translation and Dev standard Translation compares, and calculates the BLEU yield value of each rule, selects the increasing of BLEU from rule set The modification rule (being defined as: the second modification rule) that benefit value is maximum；

Iteration successively, until the BLEU yield value of the modification rule of all extractions is less than the threshold value set, stops Only calculate.

Further, during Rule Extraction, the yield value of BLEU will be selected from rule set every time Maximum modification rule extracts and records；According to the sequencing formation rule sequence returned.

Further, described sequence of rules is applied automatically to correct the wrong translation word in machine translation；Minimizing machine Translation word mistake in device translation.

Further, the acquisition process of described first sequence word to be adjusted and the second sequence word to be adjusted includes: by by machine Device translation and the alignment of standard translation, it is established that machine translation and the mapping relations of standard translation word；Work as discovery In the mapping of machine translation and standard translation, when there is the word pair that position intersects, it is believed that this word is treated for second Adjust sequence word pair, the second previous word of sequence word centering to be adjusted is defined as the first sequence word to be adjusted, later word is fixed Justice is the second sequence word to be adjusted.

Further, described step (3) comprises implemented below step:

(3-1) preparing training set, described training set includes source document to be translated and corresponding standard translation；

(3-2) above-mentioned source document to be translated input machine translation system will obtain the machine translation of correspondence；

(3-3) described training sample set and machine translation are input to has translation word and adjust the study of sequence rule template In machine；

(3-4) described learning machine contrast machine translation and standard translation and the difference of source document, adjust sequence according to translation word Rule template extracts the first rule that the word order in machine translation adjusts, and forms the first corresponding regular collection；

(3-5) utilize every rule of the first rule set to adjust development set machine translation；And after adjusting Development set machine translation compares with development set standard translation, calculates the yield value of BLEU before and after adjusting； From rule set, select the tune sequence rule that the yield value of BLEU is maximum, be defined as: first adjusts sequence rule；

(3-6) apply described first to adjust sequence rule to adjust machine translation, form the first adjustment translation；

(3-7) adjust first in translation input learning machine；Described learning machine contrast the first adjustment translation and mark Quasi-translation and the difference of source document, adjust sequence rule template to extract Second Rule according to translation word, form Second Rule collection；

(3-8) utilize every rule of Second Rule collection to adjust machine translation in development set；And after adjusting The standard translation of translation and development set compare, calculate the yield value of BLEU before and after adjusting；From rule Concentrate the tune sequence rule that the yield value selecting BLEU is maximum, be defined as: second adjusts sequence rule；

Iteration successively, until the yield value of BLEU is less than the threshold value set, stops calculating.

Further, during Rule Extraction, the yield value of BLEU will be selected from rule set every time Maximum tune sequence Rule Extraction is out；By the sequencing formation rule sequence extracted.

Further, described sequence of rules is applied to adjust in step (2) revised machine translation Word order mistake.

Further, described current word can be null value, is i.e. equivalent to lack machine translation the situation of word.

Further, described substitute can also be null value, is i.e. equivalent to will occur many words in machine translation Situation, by replacing with the situation of null value, deletes unnecessary word, makes the result of machine translation improve.

Further, described current word and or substitute can also be phrase, by the increase of phrase, deletion And replacement, it appeared that the wrong word situation of more machine translation so that the result effect of machine translation is notable Lifting.

Further, described translation word adjusts the top n word of the second sequence word to be adjusted in sequence rule template can be null value, This situation is equivalent to preposition to beginning of the sentence for the word being in sentence position rearward in machine translation.

Further, described translation word adjust in sequence rule template first after adjusting sequence word N number of word can be null value, This situation be equivalent to by machine translation to be in the word of forward position in sentence rearmounted to sentence tail.

Further, described first sequence word to be adjusted is the word block being made up of each and every one word no less than two.

Further, described second sequence word to be adjusted is the word block being made up of the word no less than two.

Further, described first sequence word to be adjusted and, or the second sequence word to be adjusted is all for by the word no less than two The word block of composition.

Compared with prior art, beneficial effects of the present invention: the present invention provides at a kind of machine translation postedit Reason method, extracts the wrong word in machine translation by the wrong word modification rule template of design and tune sequence rule template Modification rule and adjust sequence rule, according to first revising translation word mistake and adjust again the order of word order mistake, revise and Adjust the mistake of machine translation, carry out the quality improving machine translation of system with comprehensive means so that machine turns over The effect translated significantly improves.Introduce in the wrong word modification rule template of the present invention and tune sequence rule template examines Consider factor comprehensive, in addition to there is machine translation, standard translation, also introduce the relevant information of source document；Draw Enter the method for original text more conform to the principle of translation and essence (because translation is on the basis of accurate, clear and coherent, The source language message is transformed into the behavior of another kind of linguistic information)；The modification rule extracted is tied the most accordingly Close more reasonable terms, it is possible to the translation word of significantly more efficient correction machine translation and word order mistake.Use The machine translation postedit that this method realizes, it is achieved that the automatization of machine translation postedit, saves use The manpower spent in human post-editing and time cost, cleared away obstacle for obtaining high-quality machine translation.

Additionally, the inventive method is by selecting the machine translation of different machine translation systems to be modified rule Extraction, the modification rule extracted and adjust sequence rule there is higher adaptability and specific aim.

Accompanying drawing illustrates:

Fig. 1 be the inventive method realize process schematic.

Fig. 2 is the process schematic that in the inventive method, sequence of rules extracts.

Fig. 3 is that mistake word modification rule of the present invention is to adjust the extraction process schematic diagram of sequence rule.

Fig. 4 is the wrong word modification rule template schematic diagram used in the embodiment of the present invention 1.

Fig. 5 is the tune sequence rule template schematic diagram used in the embodiment of the present invention 1.

Fig. 6 is the source document in the embodiment of the present invention 1, machine translation and the schematic diagram of standard translation.

Fig. 7 is the source document in the embodiment of the present invention 1, machine translation and standard translation alignment schematic diagram.

Fig. 8 is for through the leakage revised source document of word, machine translation and standard translation schematic diagram.

Fig. 9 is source document, machine translation and standard translation schematic diagram in alignment Fig. 8.

Figure 10 is the wrong word alteration ruler template of application again, and amended source document, machine translation and standard are translated Literary composition schematic diagram.

Figure 11 translates for figure alignment Figure 10 source document, machine translation and standard, finds out " crossover " adjusting sequence The schematic diagram of phrase pair.

Figure 12 be application adjust in tune sequence rule adjustment Figure 11 of sequence rule template extraction the source document after machine translation, Machine translation and standard translation schematic diagram.

It should be noted that all accompanying drawings of the present invention are schematically, do not represent real process.

Detailed description of the invention

Below in conjunction with test example and detailed description of the invention, the present invention is described in further detail.But should be by This is interpreted as that the scope of the above-mentioned theme of the present invention is only limitted to below example, all real based on present invention institute Existing technology belongs to the scope of the present invention.

A kind of machine translation postedit processing method, by first carrying out translation word error correction again to machine translation Carry out word order mistake adjustment so that the translation quality of machine translation significantly improves.The present invention constructs wrong word and repaiies Positive rule template and adjust sequence rule template, according to word order after the most wrong word correction adjust method revise machine The Translation Errors of translation.Wherein the wrong word modification rule template in the present invention comprises the phase of current word and substitute Pass information, the relevant information of described current word and substitute is except comprising corresponding machine translation and the phase of standard translation Outside the information of pass, also comprise the relevant information of corresponding source document.Described tune sequence rule template comprises the first sequence to be adjusted Word and the relevant information of the second sequence word to be adjusted, and the first sequence word to be adjusted and the relevant information of the second sequence word to be adjusted.

A kind of machine translation postedit processing method, comprises implemented below process as shown in Figure 1:

(1) building wrong word modification rule template, described wrong word modification rule template includes rule condition and correction Action, wherein rule condition includes the current word of machine translation, the top n vocabulary of current word and current word Rear N number of vocabulary, and the top n vocabulary of translation source word corresponding to current word and rear N number of vocabulary, wherein N For the positive integer not less than 1；Described corrective action is: current word is modified to substitute.

Above-mentioned rule template, by contrast machine translation and standard translation, finds out the translation word mistake in machine translation, By combining the context in the current word of machine translation and context and corresponding original text, extract Revise the rule of translation word mistake, considerations more comprehensive and reasonable, more conform to essence and the principle of translation, Translation is on the basis of accurate, clear and coherent, the source language message is transformed into the behavior of another kind of linguistic information, ginseng The modification rule that burn-in device translation and original text extract, the linguistic context of original text of more fitting, final translation is more Accurate and natural.

Building and adjust sequence rule template, described translation word adjusts sequence rule template to include rule condition and adjustment action, institute State rule condition and include treating the relevant information of the second sequence word pair to be adjusted, treat that the second sequence word to be adjusted is to including that first treats Adjust sequence word and the second sequence word to be adjusted；Wherein the first sequence word information to be adjusted includes: the first sequence word to be adjusted, and first treats Adjusting the top n word of sequence word (or phrase), first wait adjusting N number of word after sequence word, the first sequence word to be adjusted The part of speech of top n word, first wait adjusting the part of speech of N number of word, the source document that the first sequence word to be adjusted is corresponding after sequence word The top n word of Central Plains word and first N number of word after the source document Central Plains word that tune sequence word is corresponding.Second sequence to be adjusted Word information includes: the second sequence word to be adjusted (or second sequence word group to be adjusted), the top n word of the second sequence word to be adjusted, Second wait adjusting N number of word after sequence word, the part of speech of the top n word of the second sequence word to be adjusted, the second sequence word to be adjusted The part of speech of rear N number of word, the top n word of the source document Central Plains word that the second sequence word to be adjusted is corresponding and the second sequence word to be adjusted Rear N number of word of corresponding source document Central Plains word, wherein N is 0 or (translation word adjusts sequence rule template for positive integer In the corresponding information treating the second sequence word pair to be adjusted can which increase carry rule as required to hereinafter extending The motility then taken out, it is possible to adapt to more complicated word order and adjust situation)；Described adjustment action is by first Sequence word to be adjusted and the second sequence word exchange sequence to be adjusted；Described first sequence word to be adjusted and the acquisition of the second sequence word to be adjusted Process includes, by machine translation and standard translation being alignd, it is established that machine translation and standard translation word Mapping relations；When finding in the mapping with standard translation of the machine translation, there is the word pair that mapping position is intersected Time, it is believed that this word to for the second sequence word to be adjusted to (" crossover " word to), before the second sequence word centering to be adjusted One word is defined as the first sequence word to be adjusted, and later word is defined as the second sequence word to be adjusted.The inventive method is led to Cross above-mentioned rule template by contrast machine translation and standard translation, find out the word order mistake in machine translation and Tune sequence rule, by considering the relevant information (bag of the first of word order to be adjusted the sequence word to be adjusted and the second sequence word to be adjusted Include context word and, or the part of speech of correspondence), and combine the contextual information of source document of correspondence, extraction is translated Tone sequence rule, considerations more comprehensive and reasonable, more conform to translation essence and principle, translation be Accurately, on the basis of smoothness, source document information is transformed into the behavior of another kind of linguistic information, translates with reference to machine The tune sequence rule that literary composition and source document extract, the linguistic context of source document of more fitting, final translation more accurate and natural.

(2) the wrong word modification rule of above-mentioned wrong word modification rule template extraction machine translation is used, according to wrong word Alteration ruler revises the translation word mistake of machine translation；This method is by the coupling of wrong word modification rule template not The disconnected erroneous translation vocabulary extracting and revising in corresponding machine translation makes machine translation constantly translate to standard The direction approximation of literary composition, to improve the quality of machine translation, reaches the effect of automatic postedit, applies the present invention The modification rule that method is extracted is revised, and will be substantially reduced in machine translation the translation word mistake repeated, joint For revising the manpower spent by translation word mistake and time cost in less manpower postedit.

Further, described step (2) includes the process that realizes as shown in Figure 2 and Figure 3:

(2-1) preparing training set, described training set includes source document and corresponding standard translation, training sample set Size can choose according to the needs of study, such as 20000, the sample that usual training sample is concentrated This meeting is the most, to ensure the quantity of modification rule and the quality that extract；

(2-2) above-mentioned source document to be translated input machine translation system will obtain the machine translation of correspondence；Machine Translation system is existing machine translation system, and such as Baidu's translation, Google translate, have translation, spirit lattice These translations etc., different machine translation systems can due to the mode that its translation word mistake of each own feature occurs Can have any different, the inventive method can select existing any machine translation system, and versatility is good, is suitable for Wide wealthy；Simultaneously when, after selected a certain machine translation system, the present invention can carry out corresponding wrong word modification rule Extracting, the drawback that there is this translation system carries out intuitively and effectively analyzing, thus have the strongest for Property.

(2-3) described training sample set is input to has wrong word modification rule mould with corresponding machine collection of translations In the learning machine of plate；Described learning machine be load wrong word modification rule template, it is possible to realize machine translation, Alignment between standard translation and original text and set up the functional module of corresponding mapping relations (mapping relations can To be interpreted as vocabulary or the corresponding relation of phrase in original text, machine translation and standard translation), wherein machine is translated Literary composition with mark translation align and mapping relations foundation employing meteor instrument realize, and machine translation and Translation source document then carries out correspondence by the way of dictionary and distance.

(2-4) described learning machine contrast machine translation and standard translation and the difference of original text, comparison process process Employing METEOR alignment tool or dictionary approach realize, such as original text is:

The first rule according to the wrong word correction in wrong word modification rule template extraction machine translation, forms correspondence The first regular collection；Training sample is concentrated, and such as comprises 20000 source document samples, and corresponding will have 20000 Bar standard translation and corresponding 20000 machine translations and 20000 translation source documents of correspondence, by rule mould The comparison of plate, will extract a series of modification rule.

(2-5) (Dev collection is for exploitation to revise Dev machine collection of translations to utilize every rule of the first rule set Sample set)；And revised translation is compared with Dev standard translation, calculate with rule correction every day The yield value of BLEU front and back, selecting the maximum modification rule of the yield value of BLEU from rule set (will It is defined as: the first modification rule)；The introducing of described development sample collection is possible to prevent modification rule to extract appearance The situation of over-fitting occurs.

(2-6) apply described first modification rule to revise machine translation, form the first correction translation；Pass through The rule of the yield value maximum choosing BLEU concentrates corresponding machine translation to revise training sample, and it is revised Effect the most notable, be conducive to improving rule and revise the efficiency extracted.

(2-8) utilize every rule of Second Rule collection to revise Dev machine translation；And by revised machine Device translation compares with Dev standard translation, calculates the yield value of every rule BLEU, from rule set Select the modification rule (being defined as: the second modification rule) that the yield value of BLEU is maximum；Calculate at this Improving efficiency to reduce amount of calculation during the yield value of BLEU, original rule not affected by new regulation is continued to use The yield value of the BLEU calculated in previous step.

Repeat the above steps, successively iteration, until the yield value of BLEU less than the threshold value set or is learned not To stopping calculating during new regulation.

During Rule Extraction, the correction of the yield value maximum of BLEU will be selected from rule set every time Rule (the first modification rule, the second modification rule, the 3rd modification rule ...) is extracted and is recorded；Finally All rules that return form a sequence of rules according to the sequencing returned.

Use described sequence of rules for revising the translation word mistake of machine translation, will effectively correct machine translation In the wrong translation word that repeats, significantly improve the translation quality of machine translation, needed for reducing human post-editing The manpower wanted and time cost.

This method is applied in machine translation system, by reducing the translation word mistake of machine translation from source, carries The translation word accuracy of high machine translation, improves the reliability of machine translation so that the exchange between different language The most smooth and easy, promote the communication used between the people of different language, promote that society and economy are further Advance.

Further, described current word can be null value, is i.e. equivalent to that machine translation being occurred, translation lacks word Situation, adds vocabulary, and the result making machine translation is more accurate.

Further, described substitute can also be null value, be i.e. equivalent to by machine translation compared to manually Translation occurs the situation of many words；By replacing with the situation of null value, unnecessary word is deleted, makes machine The result of translation is more correct.

Further, described current word and or substitute can also be phrase, by the increase of phrase, deletion Revise with replacing, it appeared that the wrong word situation of more machine translation so that the result of machine translation is accurate Rate significantly promotes.

Further, described step (3) comprises implemented below step:

Embodiment 1

According to the translation between different language, it should combine linguistic context and the feature of context content, build such as Fig. 4 The modification rule template of shown translation word mistake:

Wd:A ＞ B

... .wd:C@[-2] &wd:D@[-1] &wd:E@[1] &wd:F@[2] ...

... srcwd:G@[-2] &srcwd:H@[-1] &srcwd:I@[1] &srcwd:J@[2] ...

Wherein the wd:A ＞ B in the first row is modification rule, represents and current word A is modified to substitute B, Second, third rule of conduct condition, wd:C@[-2] is expressed as current word second word below, wd:D@[-1] Representing current word first word below, wd:E@[1] represents first word before current word, wd:F@[2] Represent second word before current word, second word after substitute B in srcwd:G@[-2] expression original text, Srcwd:H@[-1] represents first word, replacement in srcwd:I@[1] expression original text after substitute B in original text First word before word B, second word before substitute B in srcwd:J@[2] expression original text.Above-mentioned rule Being described as of template: when there is meeting above-mentioned rule condition, trigger corrective action, current word A is repaiied It it is being just corresponding substitute.Above-mentioned rule template is loaded in learning machine, by training set (include original text and Corresponding standard translation), machine translation corresponding with training set original text is input in learning machine, is alignd by learning machine Machine translation and source document and standard translation, find out therein correction according to rule template and advise

As it is shown in figure 5, word order regulation rule is as follows:

The relevant information of the most front four behavior first sequence words to be adjusted, the relevant letter of rear four behavior second sequence words to be adjusted Breath, Words_1:[X represents the first sequence word X to be adjusted, and wd:B1@[-2] is expressed as before the first sequence word to be adjusted Second word B1, wd:C1@[-1] represent first the word C1, wd:D1@[1] before the first sequence word to be adjusted Represent the first sequence word to be adjusted first word D1, wd:E1@[2] expression the first sequence word to be adjusted below the below Two word E1, pos_1:F1@[-2] be the part of speech of the second word before the first sequence word to be adjusted be F1, The part of speech of the first word before pos_1:H1@[-1] first sequence word to be adjusted is H1, pos_1:I1@[1] first The part of speech of sequence word to be adjusted the first word below is I1, pos_1:J1@[2] first sequence word to be adjusted below The part of speech of two words is J1.Cwd_1:B2@[-2] represents that before the corresponding former word of the first sequence word to be adjusted, second word is B2, srcwd_1:C2@[-1] represent that before the corresponding former word of the first sequence word to be adjusted, first word is C2, Srcwd_1:D2@[1] represents that after the corresponding former word of the first sequence word to be adjusted, first word is D2, [2 represent that after the corresponding former word of the first sequence word to be adjusted, second word is E2 to srcwd_1:[email protected] to the 7th The information for the second sequence word to be adjusted of row, wherein Words_2:[a] represent that the second sequence word to be adjusted is a, Word_2:b@[-2] represents that before the second sequence word to be adjusted, second word is b, and word_2:c@[-1] represents second Before sequence word to be adjusted, first word is that c, word_2:d@[1] represent first word after the second sequence word to be adjusted Represent after the second sequence word to be adjusted that second word is e, pos_2:f@[-2] expression for d, word_2:e@[2] The part of speech of second word before two sequence words to be adjusted is before f, pos_2:h@[-1] represent the second sequence word to be adjusted The part of speech of first word in face is the part of speech that h, pos_2:i@[1] represent the second sequence word to be adjusted first word below Represent that after the second sequence word to be adjusted, the part of speech of second word is j, srcwd_2:b1@[-2] for i, pos_2:j@[2] Represent that before the corresponding former word of the second sequence word to be adjusted, second word is that b1, srcwd_2:c1@[-1] represent that second treats Adjusting first word before the corresponding former word of sequence word is that c1, srcwd_2:d1@[1] represent that the second sequence word to be adjusted is corresponding After former word, first word is that d1, srcwd_2:d1@[1] represent after the corresponding former word of the second sequence word to be adjusted First word is that d1, rcwd_2:e1@[2] represent that after the corresponding former word of the second sequence word to be adjusted, second word is e1。

Above-mentioned rule template is applied to carry out decimation rule and use the rule of correspondence carry out wrong word correction and carry out Word order adjusts the amendment effect that can obtain good machine translation, such as Fig. 6, Fig. 7, Fig. 8, Fig. 9, figure 10, shown in Figure 11 and Figure 12.Such as: " source document is: goes to the cinema and sit which bus？；Machine Translation is: machine translation: go to cinema which bus should i take？；Standard translation: which bus should i take to go to the cinema？" by above-mentioned source document, machine translation and the standard translation of aliging, and Above-mentioned rule template is applied successively to be modified to by machine translation: " go to the cinema which bus should i take？”“to go to the cinema which bus should i take？" and " which bus should i take to go to the cinema？”.Can be seen that the translation effect through above-mentioned amendment and adjustment machine translation significantly carries Rise.

Should be appreciated that the sample size comprised in the training set in the embodiment of the present invention, much larger than 1, is such as Article 20000, carrying out extracting rule by the training sample of such quantity, machine is translated by the final sequence of rules produced In literary composition, the correction effect of wrong word is more preferably.

Claims

1. a machine translation postedit processing method, it is characterised in that: comprise implemented below process:

2. the method for claim 1, it is characterised in that: described current word and the acquisition of substitute Journey includes: by the content of machine translation and standard translation and source document being contrasted, when finding machine translation The context of middle A word is identical with the context of the B word in standard translation, and during A ≠ B, just Using the A word in machine translation as current word, and using the B word in standard translation as substitute.

3. method as claimed in claim 2, it is characterised in that: wherein said step (2) include with Under realize step:

(2-2) above-mentioned source document to be translated input machine translation system will obtain the machine translation of correspondence；(2-3) It is input to described training sample set and machine translation to have in the learning machine of wrong word modification rule template；

(2-5) the first rule set is utilized all to revise Dev machine translation；And revised translation and Dev standard Translation compares, and calculates the BLEU yield value of each rule, selects the increasing of BLEU from rule set The modification rule that benefit value is maximum, is defined as: the first modification rule；

(2-8) utilize Second Rule collection to revise Dev machine translation；And revised translation and Dev standard Translation compares, and calculates the BLEU yield value of each rule, selects the increasing of BLEU from rule set The modification rule that benefit value is maximum, is defined as: the second modification rule；

4. method as claimed in claim 3, it is characterised in that: during Rule Extraction, will every time The modification rule of the yield value maximum selecting BLEU from rule set extracts to be recorded；According to the elder generation returned After sequentially form sequence of rules.

5. method as claimed in claim 4, it is characterised in that: apply described sequence of rules automatically to correct Wrong translation word in machine translation；Reduce the translation word mistake in machine translation.

6. the method as described in one of claim 1 to 5, it is characterised in that described first sequence word to be adjusted and The acquisition process of the second sequence word to be adjusted includes: by machine translation and standard translation being alignd, it is established that machine Translation and the mapping relations of standard translation word；When finding in the mapping with standard translation of the machine translation, exist During the word pair that position intersects, it is believed that this word is to for the second sequence word pair to be adjusted, by previous for the second sequence word centering to be adjusted Individual word is defined as the first sequence word to be adjusted, and later word is defined as the second sequence word to be adjusted.

7. method as claimed in claim 6, it is characterised in that described step (3) comprises implemented below Step:

8. processing method as claimed in claim 7, it is characterised in that: during Rule Extraction, will From rule set, select the tune sequence Rule Extraction of yield value maximum of BLEU out every time；By the priority extracted Sequentially form sequence of rules.

9. processing method as claimed in claim 8, it is characterised in that: apply described sequence of rules to adjust Word order mistake in step (2) revised machine translation.