CN102023969A

CN102023969A - Methods and devices for acquiring weighted language model probability and constructing weighted language model

Info

Publication number: CN102023969A
Application number: CN2009101702922A
Authority: CN
Inventors: 刘占一; 王海峰; 吴华
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2009-09-10
Filing date: 2009-09-10
Publication date: 2011-04-20

Abstract

The invention provides a method and a device for acquiring weighted language model probability for a sentence, a method and a device for constructing a weighted language model and mechanical translation method and system based on a corpus. The method for acquiring the weighted language model probability for the sentence comprises the following steps of: weighting the probability of each word group related to the calculation of language model probability in the sentence by utilizing the weight of the word group; and acquiring the weighted language model probability for the sentence according to the weighted probability of each word group related to the calculation of the language model probability, wherein the weight of each word group is set according to the structure of the sentence and used for reflecting the influence degree of the word group on the fluency of the sentence. In the invention, according to the structure of the sentence, each word group related to the calculation of the language model probability in the sentence is assigned with the weight for reflecting the influence degree of the word group on the fluency of the sentence, and the fluency of the sentence can be reflected more accurately on the basis of the structure of the sentence.

Description

Obtain the method and apparatus of weighting probabilistic language model and structure weighting language model

Technical field

The present invention relates to the information processing technology, particularly, be related to method and apparatus, the method and apparatus that makes up the weighting language model and machine translation method and the system that has used these methods or device that sentence obtains the weighting probabilistic language model based on corpus.

Background technology

Statistical machine translation is one of main automatic translation by computer technology based on corpus.In this technology, can use multiple probability model usually.Language model is one of topmost probability model in the statistical machine translation technology.So-called language model is used to a sentence (or word sequence) and calculates a probable value, with the model of the fluent degree that shows this sentence (or word sequence).That is to say that language model is used to a sentence (or word sequence) to calculate can represent the probability of occurrence of this sentence (or word sequence) in affiliated language, the probable value of promptly whether using always.

In the statistical machine translation technology, come the probability of occurrence (hereinafter referred to as " probabilistic language model ") of calculated candidate translation by utilizing language model, can help translation to select.Because probabilistic language model is high more, show that this translation is commonly used more, the custom of language under meeting more like this, by using the fluent degree of probabilistic language model evaluate candidate translation, can guarantee that translation generates quality.

In existing statistical machine translation technology, language model utilizes Markov model to train from single this language material of Chinese language usually and obtains.According to Markov model, for a sentence E={e who comprises N speech ₁, e ₂..., e _N, obtain its probabilistic language model p (E) according to following formula (1):

p (E) = Π_{i = 1}^{N} p (e_{i} | e_{1}, e_{2}, . . ., e_{i - 2}, e_{i - 1}) - - - (1)

Wherein, p (e _i| e ₁, e ₂..., e _I-2, e _I-1) be word e _iProbability, this probability is represented word e _iAppear at i-1 word e of front ₁, e ₂..., e _I-2, e _I-1Probability afterwards.

But, because the sparse reason of training data when calculating the probabilistic language model of sentence according to following formula (1), in the actual calculation process, be theoretical foundation usually with the Markov model, use level and smooth ngram model to obtain probabilistic language model approx.According to level and smooth ngram model, for the above-mentioned sentence E={e that comprises N speech ₁, e ₂..., e _N, obtain its probabilistic language model p (E) approx according to following formula (2):

p (E) \approx Π_{i = 1}^{N} p (e_{i} | e_{i - n + 1}, e_{i - n + 2}, . . ., e_{i - 2}, e_{i - 1}) - - - (2)

Wherein, each word e _iProbability p (e _i| e _I-n+1, e _I-n+2..., e _I-2, e _I-1) be no longer dependent on and appear at e _iAll i-1 word of front, and only be to depend on n-1 the word that appears at its front.Usually, n gets 2～5.Usually, group of words " e _I-n+1, e _I-n+2..., e _I-2, e _I-1, e _i" be called a ngram.In the case, Probability p (e _i| e _I-n+1, e _I-n+2..., e _I-2, e _I-1) to be also referred to as be ngram " e _I-n+1, e _I-n+2..., e _I-2, e _I-1, e _i" probability.

Describe process in detail with object lesson below according to level and smooth ngram Model Calculation probabilistic language model.

For example, suppose that sentence to be calculated is " this is your seat. ", and set n=3, then according to following formula (2), the probabilistic language model of this sentence by the Probability p separately (this) of 5 words " this " of forming this sentence, " is ", " your ", " seat " and ". ", p (is|this), p (your|this, is), p (seat|is, your) and p (.|your, seat) product obtains, that is:

P (this is your seat.)=p (this) * p (is|this) * p (your|this, is) * p (seat|is, your) * p (.|your, seat) wherein, Probability p (is|this) expression " is " appears at " this " probability afterwards, and this probability can calculate by the frequency of occurrences of adding up " is " and " this is " from single this language material of Chinese language in advance.At this, group of words " this, is " is called a 2-gram (or bigram).In addition, Probability p (your|this, is) expression " your " appears at " this is " probability afterwards, and is same, and this probability also can calculate by the statistics " this is your " and the frequency of occurrences of " this is " from single this language material of Chinese language in advance.At this, group of words " this, is, your " is called a 3-gram (or trigram).For Probability p (this), p (seat|is, your) and p (.|your seat), also is same.

According to following formula (2) and above-mentioned example as can be seen, language model (back is called " standard language model ") based on level and smooth ngram model construction, when calculating the probabilistic language model of sentence,, do not treat for the probability of all ngram in the sentence with making any distinction between.But the translation that statictic machine translation system generates is made of the various ingredients such as phrase, speech etc. usually.For different ingredients, their quality also is inequality, so its importance in sentence not all is identical also.If do not distinguish the importance of ingredient in the sentence, then the probabilistic language model that might calculate can not embody the fluent degree of sentence well.To this, describe with following concrete example.

Suppose for sentence to be translated " I would like a middle seat. " two candidate's translations below having obtained based on the statictic machine translation system of phrase:

I want T1:() (one medium) (seat) (.)

I want T2:() (centre) (seat) (.)

Wherein, the phrase of candidate's translation is formed in " () " expression, and this phrase is to extract in object statement from bilingual example sentence.

Below table 1 show the probability of each 3-gram relevant (in order to represent conveniently to have used the logarithm log (p (e of probable value here with the calculating of the probabilistic language model of above-mentioned candidate's translation T1, T2 _i| e _I-n+1, e _I-n+2..., e _I-2, e _I-1))), wherein the different piece among candidate's translation T1 and the T2 (medium, centre) replaces with X.

Table 1

Wherein,＜s〉and＜/s〉be respectively the sentence beginning and end sign that adds.

Based on the probability of shown each 3-gram of last table, can adopt standard language model to calculate probabilistic language model as shown in table 2 below for candidate's translation T1, T2 based on level and smooth ngram model construction.

Table 2

Result of calculation according to the shown probabilistic language model of last table 2, statictic machine translation system will be selected the final translation of the higher candidate's translation T1 of probabilistic language model as top sentence " I wouldlike a middle seat. ", and in fact quality candidate's translation T2 is not selected because probabilistic language model is lower preferably.

Thereby, as can be seen, under the structure of not considering sentence or formation, situation the about promptly importance of the ingredient in the sentence not being distinguished, the probabilistic language model that is calculated based on above-mentioned level and smooth ngram model might be that ropy translation is higher than the measured translation of matter, thereby finally causes statictic machine translation system to select translation mistakenly.

That is to say that owing to the structure of having ignored sentence based on the language model of level and smooth ngram model construction, institute is so that statictic machine translation system can not be distinguished high-quality translation effectively from candidate's translation.

Summary of the invention

The present invention proposes in view of above-mentioned the problems of the prior art just, its purpose is to provide a kind of method and apparatus, the method and apparatus that makes up the weighting language model and machine translation method and system based on corpus that has used these methods or device for sentence acquisition weighting probabilistic language model, so that by structure according to sentence, assign weight for each relevant in sentence group of words with the calculating of probabilistic language model, obtain the weighting probabilistic language model of sentence, thereby embody the quality of sentence based on the structure of sentence more accurately.

According to an aspect of the present invention, provide a kind of for sentence obtains the method for weighting probabilistic language model, comprising: for each relevant with the calculating of probabilistic language model in sentence group of words, the weight of utilizing this group of words is weighted the probability of this group of words; And, be that above-mentioned sentence obtains the weighting probabilistic language model according to the probability after the weighting of above-mentioned each group of words relevant with the calculating of probabilistic language model; Wherein, the weight of above-mentioned each group of words be set according to the structure of sentence, be used to reflect the weight of this group of words for the influence degree of the fluent degree of sentence.

According to another aspect of the present invention, a kind of method that makes up the weighting language model is provided, comprise: make up the weighting language model, this weighting language model: for each relevant in sentence group of words with the calculating of probabilistic language model, the weight of utilizing this group of words is weighted the probability of this group of words, and, be that above-mentioned sentence obtains the weighting probabilistic language model according to the probability after the weighting of above-mentioned each group of words relevant with the calculating of probabilistic language model; Wherein, the weight of above-mentioned each group of words be set according to the structure of sentence, be used to reflect the weight of this group of words for the influence degree of the fluent degree of sentence.

Preferably, above-mentioned each group of words is divided into by a group of words that phrase comprised and group of words two classes of striding two phrases at least, and above-mentioned classification of striding at least two phrases is compared by a classification that phrase comprised with above-mentioned, is set high relatively weight.

Preferably, above-mentioned each group of words is divided into a plurality of classifications according to the quantity of the phrase that is comprised in the group of words, and comprises the many more classifications of phrase quantity in above-mentioned a plurality of classification, is set high more weight.

Preferably, the weight of above-mentioned each group of words is to utilize the exploitation collection by the accurate adjusted weight of class, this exploitation collect comprise multiple source language sentence and with this multiple source language sentence corresponding reference translation.

Above-mentioned accurate adjustment realizes by following steps: for above-mentioned classification each, according to the influence degree of this classification for the fluent degree of sentence, set the weight initial value and comprise this weight initial value in the interior region of search; For each of above-mentioned classification, under the constant situation of the weighted value of guaranteeing other classifications, in such other region of search from its weight initial value, with predetermined step level weighted value ground one by one, be above-mentioned multiple source language sentence generation translation based on this weighted value; For each of above-mentioned classification, in its corresponding region of search, determine to obtain for above-mentioned multiple source language sentence the weighted value of the optimum translation that contrasts with above-mentioned reference translation, as such other optimal weights value; Whole for above-mentioned classification repeat the step into above-mentioned multiple source language sentence generation translation and definite optimal weights value, till the quality of translation no longer improves.

According to a further aspect of the invention, a kind of machine translation method based on corpus is provided, comprise: utilize the above-mentioned method for sentence acquisition weighting probabilistic language model, a plurality of candidate's translations that are respectively at sentence generation to be translated obtain the weighting probabilistic language models; And, from these a plurality of candidate's translations, select the final translation of above-mentioned sentence to be translated with reference to the weighting probabilistic language model of above-mentioned a plurality of candidate's translations.

According to a further aspect of the invention, a kind of device that obtains the weighting probabilistic language model for sentence is provided, comprise: the probabilistic language model computing unit, it is used for: for sentence each group of words relevant with the calculating of probabilistic language model, the weight of utilizing this group of words is weighted the probability of this group of words, and, be that above-mentioned sentence obtains the weighting probabilistic language model according to the probability after the weighting of above-mentioned each group of words relevant with the calculating of probabilistic language model; Wherein, the weight of above-mentioned each group of words be set according to the structure of sentence, be used to reflect the weight of this group of words for the influence degree of the fluent degree of sentence.

Preferably, weight setting unit in the above-mentioned device that obtains the weighting probabilistic language model for sentence is divided into a plurality of classifications with above-mentioned each group of words according to the quantity of the phrase that is comprised in the group of words, and, set high more weight for comprising the many more classifications of phrase quantity in above-mentioned a plurality of classifications.

Preferably, above-mentioned device for sentence acquisition weighting probabilistic language model also comprises: weight adjustment unit, it utilizes each weight of the above-mentioned classification of exploitation set pair accurately to adjust, this exploitation collect comprise multiple source language sentence and with this multiple source language sentence corresponding reference translation.

According to a further aspect of the invention, a kind of device that makes up the weighting language model is provided, comprise: the model construction unit, it makes up the weighting language model, this weighting language model: for each relevant in sentence group of words with the calculating of probabilistic language model, the weight of utilizing this group of words is weighted the probability of this group of words, and according to the probability after the weighting of above-mentioned each group of words relevant with the calculating of probabilistic language model, for above-mentioned sentence obtains the weighting probabilistic language model; Wherein, the weight of above-mentioned each group of words be set according to the structure of sentence, be used to reflect the weight of this group of words for the influence degree of the fluent degree of sentence.

Preferably, the weight of above-mentioned each group of words be according to sentence structure, set by class.

Preferably, above-mentionedly, realizes this sentence according to following formula for obtaining weighting probabilistic language model:

p_{w} (E) \approx Π_{i = 1}^{N} p {(e_{i} | e_{i - n + 1}, e_{i - n + 2}, . . ., e_{i - 2}, e_{i - 1})}^{w_{e (e_{i - n + 1, . . ., e_{i}})}}

Wherein, E={e ₁, e ₂..., e _NExpression comprises the sentence of N speech, p _w(E) represent the weighting probabilistic language model of this sentence, p (e _i| e _I-n+1, e _I-n+2..., e _I-2, e _I-1) expression group of words (e _I-n+1, e _I-n+2... e _I-2, e _I-1, e _i) probability,

Be this group of words (e _I-n+1, e _I-n+2..., e _I-2, e _I-1, e _i) weight.

According to a further aspect of the invention, provide a kind of machine translation system, comprising: the above-mentioned weighting language model that obtains the device of weighting probabilistic language model or utilize the device of above-mentioned structure weighting language model to make up for sentence based on corpus; And translation generation unit, it is a plurality of candidate's translations of sentence generation to be translated, and utilize above-mentioned device or the above-mentioned weighting language model that obtains the weighting probabilistic language model for sentence, be respectively these a plurality of candidate's translations and obtain the weighting probabilistic language model, and, from these a plurality of candidate's translations, select the final translation of above-mentioned sentence to be translated with reference to the weighting probabilistic language model of these a plurality of candidate's translations.

Description of drawings

Believe by below in conjunction with the explanation of accompanying drawing, can make people understand the above-mentioned characteristics of the present invention, advantage and purpose better the specific embodiment of the invention.

Fig. 1 obtains the process flow diagram of the method for weighting probabilistic language model according to the embodiment of the invention for sentence;

Fig. 2 is a process flow diagram of setting the process of weight in the step 110 of Fig. 1 by class for each group of words relevant with the calculating of probabilistic language model;

Fig. 3 is the process flow diagram that utilizes the process that the weight of each classification of exploitation set pair accurately adjusts in the step 210 of Fig. 2;

Fig. 4 is the process flow diagram according to the method for the structure weighting language model of the embodiment of the invention;

Fig. 5 is the process flow diagram based on the machine translation method of corpus according to the embodiment of the invention;

Fig. 6 obtains the block scheme of the device of weighting probabilistic language model according to the embodiment of the invention for sentence;

Fig. 7 is the block scheme according to the device of the structure weighting language model of the embodiment of the invention; And

Fig. 8 is the block scheme based on the machine translation system of corpus according to the embodiment of the invention.

Embodiment

The present invention proposes a kind of notion of weighting language model, this notion is on existing standard language model based, considered the structure of sentence, for each relevant in sentence group of words with the calculating of probabilistic language model, assign the weight of this group of words of reflection, so that embody the quality of sentence better based on these weights for the importance of the fluent degree of sentence.

At this, so-called group of words, the group of forming, represent these a plurality of words might be in sentence to occur in order by a plurality of words.

In addition, the probability of group of words represents that last word in this group of words appears at all words probability afterwards of its front in this group of words.The probability of group of words also can be called last word of being in this group of words, with this group of words in the relevant probability of word of its front.

On the basis of the notion of this weighting language model of the present invention, each preferred embodiment of the present invention is elaborated below in conjunction with accompanying drawing.

Fig. 1 obtains the process flow diagram of the method for weighting probabilistic language model according to the embodiment of the invention for sentence.

As shown in Figure 1, this method is given sentence at first in step 105, determines wherein relevant each group of words and the probability thereof of calculating with the probabilistic language model of this sentence.

This step realizes based on a plurality of group of words and the probability thereof that count from single this language material of Chinese language in advance.That is to say, in this step,, determine each group of words and probability thereof relevant in the above-mentioned given sentence with the calculating of probabilistic language model by in a plurality of group of words that from single this language material of Chinese language, count in advance and probability thereof, searching.

In one embodiment, with level and smooth ngram model accordingly, group of words described here refers to ngram.

Then, in step 110, according to the structure of above-mentioned given sentence, for each relevant group of words of calculating of the probabilistic language model of above-mentioned and this sentence is set the weight of this group of words of reflection for the influence degree of the fluent degree of sentence respectively.Particularly, for the group of words that can influence the fluent degree of sentence more, set high more weight.

In this step, can adopt multiple mode to set weight for each group of words.For example can set to the word group weight one by one.In addition, also can set weight by class from the angle of the setting that makes things convenient for weight.About set the method for weight by class, will be described in detail in conjunction with Fig. 2 in the back.

In step 115, for each relevant with the calculating of probabilistic language model in above-mentioned sentence group of words, the weight of utilizing this group of words is weighted the probability of this group of words, thereby is that this sentence obtains the weighting probabilistic language model.

In one embodiment, in this step, on level and smooth ngram model based, the weight of utilizing each group of words relevant with the calculating of the probabilistic language model of above-mentioned sentence according to following formula (3) is weighted the probability of this group of words, thereby obtains the weighting probabilistic language model approx for this sentence:

p_{w} (E) \approx Π_{i = 1}^{N} p {(e_{i} | e_{i - n + 1}, e_{i - n + 2}, . . ., e_{i - 2}, e_{i - 1})}^{w_{e (e_{i - n + 1, . . ., e_{i}})}} - - - (3)

Wherein, E={e ₁, e ₂..., e _NExpression comprises the sentence of N speech, p _w(E) represent the weighting probabilistic language model of this sentence, p (e _i| e _I-n+1, e _I-n+2..., e _I-2, e _I-1) expression word e _iProbability, also be group of words (being ngram) (e _I-n+1, e _I-n+2..., e _I-2, e _I-1, e _i) probability,

Be this group of words (e _I-n+1, e _I-n+2..., e _I-2, e _I-1, e _i) weight.

More than be exactly present embodiment obtain the overall process of the method for weighting probabilistic language model for sentence.

Below in conjunction with being the process that each relevant with the calculating of probabilistic language model in above-mentioned given sentence group of words is set weight by class in the step 110 of Fig. 2 detailed description Fig. 1.

As shown in Figure 2, at first in step 205,, above-mentioned each group of words is classified according to the structure of sentence.In this step, can adopt several different methods to come group of words is classified.

For example,, under situation, consider that the sentence of candidate's translation is made up of phrase, can above-mentioned a plurality of group of words be divided into two class C1 and C2 according to sentence, relevant structure with phrase based on the statictic machine translation system of phrase at an embodiment:

C1: by a group of words that phrase comprised;

C2: the group of words of striding at least two phrases.

Under this mode classification, still the sentence of being lifted with the front " this is your seat. " is an example, because the knowledge that basis obtains in advance this sentence as can be known is made up of three phrases " this ", " is your seat " and ". ", so each group of words relevant with the calculating of probabilistic language model can be classified as follows shown in the table 3 in this sentence.

Table 3

Group of words	Classification
		(this)	C1
(this?is)	C2
		(this?is?your)	C2
(is?your?seat)	C1
		(your?seat.)	C2

Certainly, top sorting technique only is an example, as long as can make things convenient for the setting of weight, also can adopt other sorting technique.For example, also can classify, if a group of words is included in the phrase by the quantity of the phrase that is comprised in each group of words, then this group of words is categorized into the 1st class, if a group of words is striden two phrases, then be categorized into the 2nd class, stride three phrases and then be categorized into the 3rd class or the like.

Though top illustrated sorting technique is to consider for the statictic machine translation system based on phrase, but this sorting technique also is applicable to the machine translation system based on corpus of other types, for example based on the statictic machine translation system of level phrase with based on machine translation system of example etc.Certainly, for the machine translation system based on corpus of these other types, also can adopt other to be fit to their sorting technique more.

Then, in step 210, according to the influence degree of each classification of being divided, for each classification is determined suitable weight for the fluent degree of sentence.

Under the situation of the sorting technique of the C1 of Miao Shuing, C2, can with the weight setting of C1 class w in front _C1, be w with the weight setting of C2 class _C2Thereby for the sentence " this is yourseat. " of front, based on the mode classification shown in the table 3, the calculating of the weighting probabilistic language model of this sentence can be carried out according to following formula:

p_{w} (thisisyourseat .) = p {(this)}^{w_{C 1}} \times p {(is | this)}^{w_{C 2}} \times p {(your | thisis)}^{w_{C 2}} \times p {(seat | isyour)}^{w_{C 1}} \times p {(. | yourseat)}^{w_{C 2}}

In this step, when the weight that is fit to for each category setting,, set high more weight for the classification that can influence the fluent degree of sentence more.For example, in the mode classification described in step 205, stride the group of words of the many more classifications of the quantity of phrase, can influence sentence more and fluently spend, thereby can be that they set high relatively weight.

For example, under the situation of C1 in front, the sorting technique of C2, can be with the weight w of C1 class and C2 class _C1And w _C2Be set at 0.7 and 1.3 respectively, more important with the group of words that shows the C2 class than the group of words of C1 class, more can influence the fluent degree of sentence.Because for statictic machine translation system based on phrase, if a group of words is in the phrase, then this group of words is exactly a translation system word sequence known, nature, rather than translation system a plurality of phrases form by being connected in series, so its fluent degree can be guaranteed.But, if a group of words has been striden a plurality of phrases, then since this group of words be that translation system forms by being connected in series a plurality of phrases in translation process, so its fluent degree should be checked by emphasis, to guarantee the fluent degree of whole sentence.

Certainly, for other types based on for the machine translation system of corpus, can be according to the characteristics of this translation system, adopt other modes that are fit to is that different weights is set in different classification.

In one embodiment, the weight of determining for above-mentioned each classification in this step is predefined.

In addition, in a further embodiment, the weight of determining for above-mentioned each classification is to utilize the exploitation collection to carry out accurate adjustment in this step.At this, so-called exploitation collection, comprised pre-prepd a large amount of source language sentence and with these source language sentence corresponding reference translations.The details of the method for accurately adjusting about the weight of utilizing each classification of exploitation set pair is described in detail below in conjunction with Fig. 3.

Fig. 3 is according to hill-climbing algorithm, utilizes the process flow diagram of the process that the weight of each classification of exploitation set pair accurately adjusts.

Particularly, as shown in Figure 3, this process is at first in step 305, for the above-mentioned classification of being divided each, sets a weight initial value and comprises this weight initial value in the interior region of search.At this, this region of search is expressed as [ML, MH].

In this step, the weight initial value and the region of search of all categories are to suit to determine according to the influence degree of this classification for the fluent degree of sentence.And, be appreciated that in this step, can be the accurate weight initial value of each category setting, and can be provided with roughly, or even can for each category setting average weight value as weight initial value: w ₁=w ₂=...=w _m=1/m (m is the quantity of classification).

In step 310, suppose to exist m classification, the initial value of then setting classification logotype i is 0, i.e. i=0.

In step 315, set i=i+1, determine from this m classification that promptly classification i adjusts object as current weight.

In step 320, based on the current weighted value of each classification, utilize the machine translation system that has adopted weighting language model notion of the present invention based on corpus, the source language sentence of concentrating for pre-prepd exploitation generates translation.Wherein, as mentioned above, exploitation concentrate comprise pre-prepd a large amount of source language sentence and with these source language sentence corresponding reference translations.

Different with the machine translation system based on corpus of the prior art, employed machine translation system based on corpus in this step is the machine translation system based on corpus that adopts the notion of weighting language model of the present invention to come the weighting probabilistic language model of calculated candidate translation.Further, should calculate the weighting probabilistic language model according to following formula (3) for candidate's translation based on the machine translation system of corpus.

That is to say, in this step, select according to the weighting probabilistic language model of candidate's translation for developing machine translation system that translation that concentrated source language sentence generates is based on corpus, the current weighted value that the weighting probabilistic language model of these candidate's translations then is based on above-mentioned each classification calculates.

In step 325,, determine in step 320, to utilize the translation that whether is better than previous generation based on the machine translation system of corpus based on the quality of the translation that current weighted value generated of above-mentioned each classification according to the reference translation that above-mentioned exploitation is concentrated.Wherein, this previous translation that generates is based on the weighted value of before having attempted in the weight adjustment process and the translation that generates.If then handle and advance to step 330, otherwise forward step 335 to.

In this step, can the artificially quality of translation relatively, also can utilize existing automatic translation scoring method or system to determine the quality that the translation of current generation is compared with the translation of generation before.

In step 330, with the above-mentioned current weight w that adjusts the classification i of object as weight _iBe set at such other current optimal weights, i.e. w _{I max}=w _i

In step 335,, adjust such other current weighted value w with suitable step level step for the classification i that adjusts object as weight _i, i.e. w _i=w _i+ step guarantees that simultaneously the weighted value of other classifications immobilizes.

In step 340, determine above-mentioned adjusted weighted value w _iWhether be in such other region of search [ML, MH], i.e. w whether _i＜MH.If, then handle and turn back to step 320, continue to adjust weight, otherwise owing to found optimal weights w at classification i at classification i _ImaxSo, handle advancing to step 345.

In step 345, determine whether all to have finished the weight adjustment, i.e. i+1＞m whether at all categories.If then this processing advances to step 350, otherwise turns back to step 315, continue to adjust weight at next classification.

In step 350, determine current optimal weights based on all categories is whether the quality of the translation that generated of above-mentioned source language sentence is better than the translation quality from 1 to m weight adjustment process at previous round i.If, then handle and turn back to step 310, proceed the weight adjustment process of next round i from 1 to m, otherwise this process finishes.

That is to say that the weight adjustment process of Fig. 3 repeats many wheels from i=1 to m, till the quality of translation no longer improves for above-mentioned all m classification.

The process of accurately adjusting according to the weight of above-mentioned each classification of above-mentioned utilization shown in Figure 3 exploitation set pair is the determined optimal weights value of each classification, finally is set at the weight of each classification.Thereby during the weight that determine to be fit to for each classification in step 210, directly the classification of the basis group of words relevant with the calculating of probabilistic language model be determined according to the pre-set respective weights of such process.

More than be exactly that present embodiment obtained the detailed description of the method for weighting probabilistic language model for sentence.In the present embodiment, by the structure according to sentence is that each relevant with the calculating of probabilistic language model in sentence group of words is assigned weight, can embody the weighting probabilistic language model of the quality of sentence on the basis of the structure of having considered sentence more accurately for the sentence acquisition.

Under same inventive concept, the invention provides a kind of method that makes up the weighting language model.Be described in greater detail below in conjunction with accompanying drawing.

Fig. 4 is the process flow diagram according to the method for the structure weighting language model of the embodiment of the invention.

As shown in Figure 4, this method according to single this language material of Chinese language, counts a plurality of group of words and probability thereof at first in step 405.

It will be understood by those skilled in the art that this step can adopt in this area known now or in the future as can be known the group of words and the probability method thereof that are used to add up relevant with language model realize, omit the detailed description of this step at this.

In step 410, according to above-mentioned a plurality of group of words and probability thereof, construct the weighting language model, this weighting language model is for each relevant with the calculating of probabilistic language model in sentence group of words, the weight of utilizing this group of words is weighted the probability of this group of words, and according to the probability after the weighting of each group of words, for sentence obtains the weighting probabilistic language model.Wherein, in the sentence weight of each group of words relevant with the calculating of probabilistic language model be set according to the structure of sentence, be used to reflect the weight of this group of words for the influence degree of the fluent degree of sentence.Particularly, for the group of words that can influence the fluent degree of sentence more, set high more weight.

In one embodiment, the weight of each group of words relevant with the calculating of probabilistic language model is that method shown in Figure 2 above utilizing is set by class in the above-mentioned sentence.In a further embodiment, the weight of this each group of words is that method shown in Figure 3 has been passed through accurate adjustment above utilizing.

In addition, in one embodiment, in this step, on level and smooth ngram model based, construct such weighting language model: the weight of utilizing each relevant with the calculating of probabilistic language model in sentence group of words according to following formula (3) is weighted the probability of this group of words, thereby obtains the weighting probabilistic language model approx for sentence.

It more than is exactly detailed description to the method for the structure weighting language model of present embodiment.In the present embodiment, constructed weighting language model is that each relevant with the calculating of probabilistic language model in sentence group of words is assigned weight by the structure according to sentence, can embody the weighting probabilistic language model of the quality of sentence on the basis of the structure of having considered sentence more accurately for the sentence acquisition.

The weighting language model constructed according to above method, direct alternate standard language model and being used by machine translation system based on corpus, make this machine translation system be the weighting probabilistic language model of candidate's translation acquisition based on sentence structure based on corpus, and then, more effectively select high-quality candidate's translation as final translation with reference to the weighting probabilistic language model of each candidate's translation.

Need to prove, in the above in the process of Fig. 4, though comprised the step 405 that goes out a plurality of group of words and probability thereof according to single this corpus statistics of Chinese language, but, also can not comprise this step, and a plurality of group of words and the probability thereof that have counted in the direct application standard language model are carried out follow-up step 410 thereon.

Though each above embodiment is in conjunction with describing based on the statictic machine translation system of phrase, but, the present invention can be applied to equally other types based on the machine translation system of corpus, for example based on the statictic machine translation system of level phrase with based on machine translation system of example etc., and for these other types based on for the machine translation system of corpus, also can access similar excellent effect.

Therefore, the present invention also provides the machine translation method based on corpus of the method for sentence acquisition weighting probabilistic language model shown in Figure 1 above a kind of the application.Fig. 5 is the process flow diagram of this method.

As shown in Figure 5, this method obtains the sentence to be translated of source language at first in step 505.

In step 510, utilize machine translation system based on corpus, according to known or as can be known translation model in the future now, be a plurality of candidate's translations of this sentence generation to be translated.

In step 515, utilize the method for sentence acquisition weighting probabilistic language model shown in Figure 1, be respectively above-mentioned a plurality of candidate's translation and calculate the weighting probabilistic language model.

In step 520,, from above-mentioned a plurality of candidate's translations, select the final translation of above-mentioned sentence to be translated with reference to above-mentioned weighting probabilistic language model.

It will be appreciated by those skilled in the art that, in this step, can be directly according to the weighting probabilistic language model of candidate's translation, from a plurality of candidate's translations, select final translation, also can and (for example utilize other translation models with the weighting probabilistic language model, phrase translation model, speech translation models etc.) probability that obtains combines and carries out the selection of translation.

It more than is exactly detailed description based on the machine translation method of corpus to present embodiment.In the present embodiment, by obtain weighting probabilistic language model for candidate's translation based on sentence structure, and the weighting probabilistic language model with reference to each candidate's translation is selected final translation, can select high-quality translation more accurately, thereby the performance of mechanical translation is improved.

The translation of machine translation system based on the weighting probabilistic language model time that contrasts based on phrase with object lesson generates quality and the generation of the translation based on standard language model probability time quality below.

Be example still for example, suppose that statictic machine translation system based on phrase is similarly it and has obtained two candidate's translations with the sentence to be translated " I would like a middle seat. " of front:

I want T1:() (one medium) (seat) (.)

I want T2:() (centre) (seat) (.)

Wherein, candidate's translation T1 by phrase " I want ", " one medium ", " seat " and "." form.In addition, candidate's translation T2 by phrase " I want ", " centre ", " seat " and "." form.

Based on above-mentioned situation, following table 4 shows probability and the classification of each relevant with the calculating of probabilistic language model among above-mentioned candidate's translation T1, T2 3-gram (group of words), and wherein different parts replaces with X among candidate's translation T1 and the T2.

Table 4

Probability and classification based on each 3-gram shown in the last table, below table 5 show the standard language model probability that adopts the standard language model to calculate for candidate's translation T1, T2, and the weight of C1 class and C2 class is set at 0.7 and the weighting probabilistic language model that adopted the notion of weighting language model of the present invention to calculate for candidate's translation T1, T2 at 1.3 o'clock respectively.

Table 5

As can be seen from Table 5, under the situation of the standard language model of the structure of not considering sentence, for the probabilistic language model that candidate's translation T1 calculates higher relatively, thereby the statictic machine translation system based on phrase will be selected candidate's translation T1 according to this result of calculation, rather than the higher candidate's translation T2 of quality, thereby cause generating quality based on the relatively poor translation of the statictic machine translation system of phrase.

With respect to this, under the situation of the notion of the weighting language model of the structure of having considered sentence of the present invention, because the important component part in the sentence (C2 class) has been assigned high relatively weight, the importance of this part is amplified relatively, and then make candidate's translation T1, T2 each other the gap on this important component part amplify, so can obtain high relatively probabilistic language model at high-quality candidate's translation T2, and then make that this high-quality translation is final selectedly to be gone out.

Though above-mentioned example is enumerated at the statictic machine translation system based on phrase,, for adopted other types of the present invention based on for the machine translation system of corpus, also can access similar excellent effect.

Under same inventive concept, the invention provides a kind of device that obtains the weighting probabilistic language model for sentence.Be described in greater detail below in conjunction with accompanying drawing.

Fig. 6 obtains the block scheme of the device of weighting probabilistic language model according to the embodiment of the invention for sentence.As shown in Figure 6, the device 60 for sentence acquisition weighting probabilistic language model of present embodiment comprises: group of words and probability determining unit 61 thereof, weight setting unit 62, weight adjustment unit 63, probabilistic language model computing unit 64.

Particularly, group of words and probability determining unit 61 thereof are given sentence, determine each wherein relevant with the calculating of probabilistic language model group of words and probability thereof.

In one embodiment, above-mentioned each group of words and probability thereof are to go out according to single this corpus statistics of Chinese language in advance and be recorded in together accordingly.

Weight setting unit 62 is according to the structure of sentence, for above-mentioned each group of words relevant with the calculating of probabilistic language model set the weight of this group of words of reflection for the influence degree of the fluent degree of sentence respectively.Particularly, for the group of words that can influence the fluent degree of sentence more, set high more weight.

In one embodiment, weight setting unit 62 is that above-mentioned each group of words is determined weight according to the structure of sentence by class.

In a further embodiment, weight setting unit 62 is divided into above-mentioned each group of words by a group of words that phrase comprised and group of words two classes of striding two phrases at least, and for the classification of striding at least two phrases, compared by a classification that phrase comprised with above-mentioned, set high relatively weight.

In another embodiment, weight setting unit 62 is divided into a plurality of classifications with above-mentioned each group of words according to the quantity of the phrase that is comprised in the group of words, and for comprising the many more classifications of phrase quantity, sets high more weight.

Weight adjustment unit 63 is utilized the exploitation collection, by class the weight of above-mentioned each group of words is accurately adjusted, this exploitation collect comprise multiple source language sentence and with this multiple source language sentence corresponding reference translation.

Probabilistic language model computing unit 64 utilizes its weight that its probability is weighted for above-mentioned each group of words, and according to the probability after the weighting of above-mentioned each group of words, for above-mentioned sentence obtains the weighting probabilistic language model.

In one embodiment, probabilistic language model computing unit 64 is according to following formula (3), the weight of utilizing above-mentioned each group of words relevant with the calculating of probabilistic language model is weighted the probability of this group of words, thereby obtains the weighting probabilistic language model approx for this sentence.

More than be exactly that present embodiment obtained the detailed description of the device of weighting probabilistic language model for sentence.

Under same inventive concept, the invention provides a kind of device that makes up the weighting language model.Be described in greater detail below in conjunction with accompanying drawing.

Fig. 7 is the block scheme according to the device of the structure weighting language model of the embodiment of the invention.As shown in Figure 7, the device 70 of the structure weighting language model of present embodiment comprises: group of words and probability statistics unit 71 thereof, model construction unit 72.

Particularly, group of words and probability statistics unit 71 thereof count a plurality of group of words and probability thereof according to single this language material of Chinese language.Certainly, also can not comprise this group of words and probability statistics unit 71 thereof, and directly adopt a plurality of group of words and the probability thereof that has counted in the standard language model.

Model construction unit 72 is based on above-mentioned a plurality of group of words and probability thereof, make up the weighting language model, this weighting language model is for each relevant with the calculating of probabilistic language model in sentence group of words, the weight of utilizing this group of words is weighted the probability of this group of words, and according to the probability after the weighting of above-mentioned each group of words, for this sentence obtains the weighting probabilistic language model.Wherein, in the sentence weight of each group of words relevant with the calculating of probabilistic language model be set according to the structure of sentence, be used to reflect the weight of this group of words for the influence degree of the fluent degree of sentence.Particularly, for the group of words that can influence the fluent degree of sentence more, set high more weight.

In one embodiment, the weight of each group of words relevant with the calculating of probabilistic language model is set by class in the above-mentioned sentence.And then the weight of above-mentioned each group of words is to utilize the exploitation collection to carry out accurate adjustment, this exploitation collect comprise multiple source language sentence and with this multiple source language sentence corresponding reference translation.

In one embodiment, model construction unit 72 makes up such weighting language model, this weighting language model can be according to following formula (3), the weight of utilizing each relevant with the calculating of probabilistic language model in above-mentioned sentence group of words is weighted the probability of this group of words, thereby obtains the weighting probabilistic language model approx for this sentence.

It more than is exactly detailed description to the device of the structure weighting language model of present embodiment.

The of the present invention application is described below above-mentionedly to be obtained the device of weighting language model or makes up the machine translation system based on corpus of the device of weighting language model for sentence.

Fig. 8 is the block scheme based on the machine translation system of corpus according to the embodiment of the invention.As shown in Figure 8, the machine translation system 80 based on corpus of present embodiment comprises the weighting language model, the translation generation unit 81 that obtain the device 60 of weighting probabilistic language model or utilize the device 70 of the structure weighting language model of Fig. 7 to make up for sentence of Fig. 6.

Particularly, translation generation unit 81 is a plurality of candidate's translations of sentence generation to be translated according to translation model, and utilize above-mentioned device 60 or the weighting language model that obtains the weighting probabilistic language model for sentence, be respectively these a plurality of candidate's translations and obtain the weighting probabilistic language model, and, from these a plurality of candidate's translations, select the final translation of above-mentioned sentence to be translated with reference to the weighting probabilistic language model of these a plurality of candidate's translations.

It more than is exactly detailed description based on the machine translation system of corpus to present embodiment.

Present embodiment for sentence obtain the weighting probabilistic language model device 60, make up the device 70 of weighting language model and based on machine translation system 80 and each ingredient thereof of corpus, can constitute by the circuit or the chip of special use, also can carry out corresponding program and realize by computing machine (processor).

Though more than by some exemplary embodiments to of the present invention for sentence obtain the weighting probabilistic language model method and apparatus, make up the method and apparatus of weighting language model and be described in detail based on the machine translation method and the system of corpus, but above these embodiment are not exhaustive, and those skilled in the art can realize variations and modifications within the spirit and scope of the present invention.Therefore, the present invention is not limited to these embodiment, and scope of the present invention only is as the criterion with claims.

Claims

1. one kind for sentence obtains the method for weighting probabilistic language model, comprising:

For each relevant with the calculating of probabilistic language model in sentence group of words, the weight of utilizing this group of words is weighted the probability of this group of words; And

According to the probability after the weighting of above-mentioned each group of words relevant, be that above-mentioned sentence obtains the weighting probabilistic language model with the calculating of probabilistic language model;

Wherein, the weight of above-mentioned each group of words be set according to the structure of sentence, be used to reflect the weight of this group of words for the influence degree of the fluent degree of sentence.

2. method that makes up the weighting language model comprises:

Make up the weighting language model, this weighting language model:

For each relevant with the calculating of probabilistic language model in sentence group of words, the weight of utilizing this group of words is weighted the probability of this group of words, and

3. method according to claim 1 and 2, wherein realize according to following formula for above-mentioned sentence obtains the weighting probabilistic language model:

p_{w} (E) \approx Π_{i = 1}^{N} p {(e_{i} | e_{i - n + 1}, e_{i - n + 2}, . . ., e_{i - 2}, e_{i - 1})}^{w_{e (e_{i - n + 1, . . ., e_{i}})}}

Wherein, E={e ₁, e ₂..., e _NExpression comprises the sentence of N speech, p _w(E) represent the weighting probabilistic language model of this sentence, p (e _i| e _I-n+1, e _I-n+2..., e _I-2, e _I-1) expression group of words (e _I-n+1, e _I-n+2..., e _I-2, e _I-1, e _i) probability, Be this group of words (e _I-n+1, e _I-n+2..., e _I-2, e _I-1, e _i) weight.

4. method according to claim 1 and 2, the weight of wherein above-mentioned each group of words relevant with the calculating of probabilistic language model be according to sentence structure, set by class.

5. machine translation method based on corpus comprises:

Utilize the described method for sentence acquisition weighting probabilistic language model of claim 1, a plurality of candidate's translations at sentence generation to be translated obtain the weighting probabilistic language model respectively; And

With reference to the weighting probabilistic language model of above-mentioned a plurality of candidate's translations, from these a plurality of candidate's translations, select the final translation of above-mentioned sentence to be translated.

6. one kind for sentence obtains the device of weighting probabilistic language model, comprising:

The probabilistic language model computing unit, it is used for:

7. device according to claim 6 also comprises:

The weight setting unit, it is that above-mentioned each group of words relevant with the calculating of probabilistic language model set weight;

Wherein, this weight setting unit is that above-mentioned each group of words is set weight according to the structure of sentence by class.

8. device according to claim 7, wherein above-mentioned weight setting unit is divided into above-mentioned each group of words by a group of words that phrase comprised and group of words two classes of striding two phrases at least, and for above-mentioned classification of striding at least two phrases, compared by a classification that phrase comprised with above-mentioned, set high relatively weight.

9. device that makes up the weighting language model comprises:

The model construction unit, it makes up the weighting language model, this weighting language model:

10. machine translation system based on corpus comprises:

The described weighting language model that obtains the device of weighting probabilistic language model or utilize the device of the described structure weighting of claim 9 language model to make up for sentence of claim 6; And

The translation generation unit, it is a plurality of candidate's translations of sentence generation to be translated, and utilize above-mentioned device or the above-mentioned weighting language model that obtains the weighting probabilistic language model for sentence, be respectively these a plurality of candidate's translations and obtain the weighting probabilistic language model, and, from these a plurality of candidate's translations, select the final translation of above-mentioned sentence to be translated with reference to the weighting probabilistic language model of these a plurality of candidate's translations.