WO2017038996A1 - Word alignment model construction apparatus, machine translation apparatus, word alignment model production method, and recording medium - Google Patents

Word alignment model construction apparatus, machine translation apparatus, word alignment model production method, and recording medium Download PDF

Info

Publication number
WO2017038996A1
WO2017038996A1 PCT/JP2016/075886 JP2016075886W WO2017038996A1 WO 2017038996 A1 WO2017038996 A1 WO 2017038996A1 JP 2016075886 W JP2016075886 W JP 2016075886W WO 2017038996 A1 WO2017038996 A1 WO 2017038996A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
probability information
sentence
language
alignment model
Prior art date
Application number
PCT/JP2016/075886
Other languages
French (fr)
Japanese (ja)
Inventor
将夫 内山
Original Assignee
国立研究開発法人情報通信研究機構
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立研究開発法人情報通信研究機構 filed Critical 国立研究開発法人情報通信研究機構
Publication of WO2017038996A1 publication Critical patent/WO2017038996A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language

Definitions

  • the present invention relates to a word alignment model construction apparatus for constructing a word alignment model.
  • Statistic machine translation creates a translation model ModelB from parallel translation data DataB. And the input sentence is translated using ModelB. At this time, if the input sentence is a sentence in the same field as DataB, the translation result can be expected to be highly accurate. However, when the input sentence is a sentence in a field different from DataB, the translation accuracy decreases.
  • the translation model ModelS is created by using bilingual data DataS in a different field from DataB, and the translation accuracy of sentences in the same field as DataS is improved by using both ModelB and ModelS. Achievable.
  • ModelB can be created from DataB with high accuracy without problems, but it is difficult to create ModelS from DataS with high accuracy.
  • Step 1. Word alignment of DataS Build ModelS from the result of word alignment.
  • a word alignment model AlignB for word alignment is constructed from DataB, and DataS is aligned using AlignB.
  • a word alignment model AlignS is constructed from DataS, and DataS is aligned using AlignS.
  • C Combine DataB and DataS, build a word alignment model AlignBS from them, and align DataS using AlignBS.
  • D Construct a word alignment model AlignB from DataB. Then, AlignBS is constructed from DataS using AlignB as an initial model, and DataS is aligned using AlignBS (see Non-Patent Document 4).
  • AlignB as an initial model means extracting only sufficient statistics for word alignment from AlignB and updating the sufficient statistics using DataS.
  • each of the four conventional techniques (A) to (D) has the following problems.
  • DataS is word-aligned using AlignB, so the word alignment accuracy is low as long as DataB and DataS are different from each other.
  • (B) is effective when DataS is tens of thousands of sentences or more, but the accuracy of AlignS is low when there are about 100 sentences, so the word alignment accuracy is low.
  • (C) when the sizes of DataS and DataB are extremely different, even if a characteristic word pair appears in DataS, it can be countered by the majority of DataB.
  • a word alignment model can be constructed just by processing DataS as in (D), but unlike (D), a word alignment model that does not cancel word alignment unique to DataS is constructed. For the purpose.
  • the word alignment model construction device includes a first language sentence that is a sentence of a first language that is a first language and a second language sentence that is a sentence of a second language that is a second language.
  • a small-scale parallel translation data storage unit that can store small-scale parallel translation data that is a pair and is a small-scale parallel translation data having a number of parallel translation sentences less than the first threshold (N1), and is acquired from the small-scale parallel translation data
  • Correspondence probability relating to the probability that a word pair having a first word that is a word in the first language and a second word that is a word in the second language and the probability that the first word and the second word correspond to each other are word alignment models
  • a small word alignment model storage unit capable of storing a small word alignment model having a plurality of word alignment data having first correspondence probability information as information, and a number equal to or greater than a second threshold (N2, N2> N1) Have a translation of This is an alignment model of words acquired from large-scale parallel
  • the probability information calculation unit that calculates the first correspondence probability information that is paired with one word pair by repeating the loop twice or more, and the first that the probability information calculation unit finally calculates for each word pair
  • a word alignment model construction apparatus comprising a correspondence probability information storage unit that stores correspondence probability information in association with a word pair in a small word alignment model storage unit.
  • the word alignment model construction device provides the probability information calculation unit for each bilingual sentence included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence, relative to the first invention.
  • the first correspondence probability information acquisition means for obtaining the first correspondence probability information of the initial value corresponding to one word pair or the first correspondence probability information calculated in the previous loop for one word pair, and a small scale Second correspondence probability information corresponding to one word pair is acquired from the large-scale word alignment model storage unit for each word pair included in the parallel translation data and for each word pair included in the parallel translation sentence.
  • Second correspondence probability information acquisition means and for each bilingual sentence included in the small-scale parallel translation data and for each word pair included in the bilingual sentence, for each word pair, the first word in the first language sentence First word position indicating position
  • the second word position corresponding to the first word and indicating the position in the second language sentence, the number of first sentence words that is the number of words in the first language sentence, and the second language sentence Bilingual sentence word position information acquisition means for acquiring parallel sentence word position information having the number of second sentence words that is the number of words, and a parallel sentence word position corresponding to the parallel sentence word position information acquired by the parallel sentence word position information acquisition means
  • the bilingual sentence word position probability information acquiring means for acquiring the probability information from the parallel sentence word position probability information storage unit, the first corresponding probability information and the second corresponding probability information acquiring means acquired by the first corresponding probability information acquiring means last time
  • the acquired second correspondence probability information is added at a predetermined ratio, and the result of addition is multiplied by the parallel sentence word position probability information acquired by the parallel sentence word position probability information acquisition
  • Intermediate probability value calculating means For each word pair, using the intermediate probability value calculated by the intermediate probability value calculation means, the first correspondence probability information acquisition means before normalization for acquiring the first correspondence probability information before normalization, and the normal correspondence for each word pair Normalization processing is performed on the first correspondence probability information before normalization acquired by the first correspondence probability information acquisition means before normalization and the first correspondence probability information is acquired, and until the end condition is satisfied , Previous first correspondence probability information acquisition means, second correspondence probability information acquisition means, parallel translation word position information acquisition means, parallel translation word position probability information acquisition means, intermediate probability value calculation means, pre-normalization first correspondence probability information acquisition And a word alignment model construction apparatus comprising control means for repeatedly performing processing of normalization means.
  • the word alignment model construction apparatus provides a parallel translation word position probability information stored in the parallel translation word position probability information storage unit, which is a large-scale parallel translation. It is the word alignment model construction apparatus which is the bilingual sentence word position probability information acquired using data.
  • the word alignment model construction device of the fourth aspect of the present invention indicates the position of the first word in the first language sentence for each word pair included in the large-scale parallel translation data, relative to the first aspect.
  • the bilingual sentence word position probability information acquiring means for acquiring the bilingual sentence word position probability information corresponding to the bilingual sentence word position information acquired by the acquiring means from the bilingual sentence word position probability information storage unit, Stored in the probability information storage
  • Translated sentence word position probability information is word alignment model building device is the translation word position probability information translated sentence word position probability information acquisition means has acquired.
  • the machine translation device has a small word alignment model storage unit and a word alignment model construction device that the word alignment model construction device has, with respect to any one of the first to fourth inventions.
  • a translation unit comprising: a translation unit that acquires a first language sentence from a second language sentence received by a reception unit using bilingual sentence word position probability information for each of one or more parallel translation word position information .
  • the word alignment of the small-scale bilingual corpus can be executed with high accuracy, and as a result, accurate translation results can be obtained.
  • the word alignment model construction apparatus can perform word alignment of a small-scale parallel corpus with high accuracy.
  • Block diagram of word alignment model construction apparatus 1 according to Embodiment 1 The flowchart explaining operation
  • FIG. 1 is a block diagram of the word alignment model construction apparatus 1 in the present embodiment.
  • the word alignment model construction device 1 includes a storage unit 11, a probability information calculation unit 12, and a corresponding probability information storage unit 13.
  • the storage unit 11 includes a small bilingual data storage unit 111, a small word alignment model storage unit 112, a large word alignment model storage unit 113, and a bilingual word position probability information storage unit 114.
  • the probability information calculation unit 12 includes a previous first correspondence probability information acquisition unit 121, a second correspondence probability information acquisition unit 122, a bilingual sentence word position information acquisition unit 123, a bilingual sentence word position probability information acquisition unit 124, and an intermediate probability value calculation unit. 125, first pre-normalization probability information acquisition means 126, normalization means 127, and control means 128.
  • the storage unit 11 can store various types of information.
  • the various types of information include, for example, small-scale parallel translation data, small-scale word alignment model, large-scale word alignment model, parallel sentence word position probability, and the like, which will be described later.
  • the small parallel translation data storage unit 111 can store small parallel translation data.
  • the small-scale parallel translation data has one or more parallel translation sentences.
  • a bilingual sentence is a pair of a first language sentence and a second language sentence.
  • the first language sentence is a sentence in the first language that is the first language.
  • the second language sentence is a sentence of the second language that is the second language.
  • the second language sentence is a result of translating the first language sentence into the second language.
  • Small-scale parallel translation data has a small number of parallel translation sentences.
  • the small-scale parallel translation data usually has a number of parallel translation sentences less than the first threshold (N1).
  • the small-scale parallel translation data has, for example, about 100,000 to 100,000 parallel translations.
  • any language can be used as long as the first language and the second language are different.
  • the first language or the second language is, for example, English, Japanese, Chinese, French, German, Spanish, Korean or the like.
  • the small word alignment model storage unit 112 can store a small word alignment model.
  • the small word alignment model is a word alignment model acquired using small parallel translation data.
  • the small-scale word alignment model has a plurality of word alignment data.
  • the word alignment data includes a word pair and first correspondence probability information.
  • the word pair has a first word that is a word in the first language and a second word that is a word in the second language.
  • the first correspondence probability information is correspondence probability information that is information regarding the probability that the first word corresponds to the second word.
  • the large-scale word alignment model storage unit 113 stores a large-scale word alignment model.
  • the large-scale word alignment model is a word alignment model acquired from large-scale parallel translation data that is large-scale parallel translation data.
  • the word alignment model includes a word pair having a first word and a second word, and second correspondence probability information that is correspondence probability information regarding a probability that the first word and the second word correspond to each other.
  • Large-scale parallel translation data usually has a number of parallel translations equal to or greater than a second threshold (N2, N2> N1).
  • N2 is one digit or more larger (10 times or more) than N1.
  • N1 and N2 are natural numbers.
  • the parallel translation word position probability information storage unit 114 can store parallel translation word position probability information for each of one or more parallel translation word position information.
  • the bilingual sentence word position information is information acquired from one or more parallel translation sentences, and has a first word position, a second word position, a first sentence word number, and a second sentence word number.
  • the first word position is information indicating the position of the first word in the first language sentence.
  • the second word position is the position of the second word corresponding to the first word and is information indicating the position in the second language sentence.
  • the first sentence word count is the number of words in the first language sentence.
  • the number of second sentence words is the number of words in the second language sentence.
  • the bilingual sentence word position probability information is information regarding the probability of matching with the bilingual sentence word position information. Normally, the bilingual sentence word position probability information is a probability of matching with the bilingual sentence word position information.
  • the bilingual sentence word position probability information storage unit 114 stores two or more pairs of bilingual sentence word position information and bilingual sentence word position probability information acquired from one or more parallel translation sentences included in the large-scale parallel translation data. That is preferred.
  • a pair of the parallel sentence word position information and the parallel sentence word position probability information stored in the parallel sentence word position probability information storage unit 114 uses a large-scale parallel translation data, and a parallel sentence word position information acquisition unit 123 to be described later.
  • the information acquired by the bilingual sentence word position probability information acquisition unit 124 is suitable.
  • the bilingual sentence word position probability information storage unit 114 stores two pairs of bilingual sentence word position information and bilingual sentence word position probability information acquired from one or more parallel translation sentences included in the large-scale bilingual data and the small-scale bilingual data. As described above, it may be stored.
  • a pair of parallel sentence word position information and parallel sentence word position probability information stored in the parallel sentence word position probability information storage unit 114 is a parallel sentence word to be described later using large-scale parallel translation data and small-scale parallel translation data.
  • the information acquired by the position information acquisition unit 123 and the bilingual word position probability information acquisition unit 124 may be used.
  • the probability information calculation unit 12 sets the first correspondence probability information calculated in the initial value or the previous loop for one word pair and the large-scale word for each word pair included in the parallel translation sentence included in the small-scale parallel translation data. Using the second correspondence probability information paired with one word pair of the alignment model and the parallel translation word position probability information corresponding to one word pair in the parallel translation sentence, the loop is repeated two or more times, First correspondence probability information to be paired with one word pair is calculated.
  • the previous first correspondence probability information acquisition unit 121 sets the initial value corresponding to one word pair for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence. First correspondence probability information or first correspondence probability information calculated in the previous loop is acquired.
  • the first correspondence probability information acquisition unit 121 last time includes the i-th word e (i) in the first language (e) and the j-th word f (j) in the second language (f).
  • f (j))” calculated in the previous loop is acquired.
  • Such first correspondence probability information is usually stored at least temporarily in the storage unit 11.
  • the second correspondence probability information acquisition unit 122 is configured to provide a second correspondence probability corresponding to one word pair for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence. Information is acquired from the large-scale word alignment model storage unit 113.
  • the second correspondence probability information acquisition unit 122 adds the i-th word e (i) in the first language (e) and the j-th word f (j) in the second language (f).
  • f (j))” is acquired from the large-scale word alignment model storage unit 113.
  • the bilingual sentence word position information acquisition means 123 performs the first word in the first language sentence for each bilingual sentence included in the small bilingual data and for each word pair included in the bilingual sentence.
  • the first word position indicating the position, the second word position corresponding to the first word and indicating the position in the second language sentence, and the first sentence word number indicating the number of words in the first language sentence
  • translated sentence word position information having the number of second sentence words that is the number of words of the second language sentence.
  • the bilingual sentence word position information acquisition unit 123 has a first word position indicating the position of the first word in the first language sentence with respect to one word pair in one bilingual sentence.
  • the second word position corresponding to the first word and the second word position indicating the position in the second language sentence (j) the first sentence word number that is the number of words in the first language sentence ( m) and the second sentence word number (n), which is the number of words in the second language sentence, are acquired from one parallel translation sentence and one word pair.
  • (i) (j) is usually information indicating the number of words in a sentence.
  • the bilingual sentence word position information acquisition means 123 is a first word position indicating the position of the first word in the first language sentence for each word pair included in the large-scale parallel translation data, and a second corresponding to the first word.
  • the second word position indicating the position of the word in the second language sentence, the first sentence word number that is the number of words in the first language sentence, and the second sentence word number that is the number of words in the second language sentence.
  • the bilingual sentence word position probability information acquisition unit 124 acquires bilingual sentence word position probability information corresponding to the bilingual sentence word position information acquired by the bilingual sentence word position information acquisition unit 123 from the bilingual sentence word position probability information storage unit 114. .
  • the translated sentence word position probability information acquiring unit 124 converts the translated sentence word position corresponding to the translated sentence word position information (i, j, m, n) acquired by the translated sentence word position information acquiring unit 123.
  • i, m, n)” is searched from the bilingual sentence word position probability information storage unit 114.
  • the bilingual sentence word position probability information acquiring unit 124 calculates bilingual sentence word position probability information corresponding to the bilingual sentence word position information acquired by the bilingual sentence word position information acquiring unit 123 for each word pair included in the large-scale bilingual data. It is preferable to obtain from the translated sentence word position probability information storage unit 114.
  • the intermediate probability value calculation means 125 includes first correspondence probability information acquired by the first correspondence probability information acquisition means 121 last time, second correspondence probability information acquired by the second correspondence probability information acquisition means 122, and bilingual sentence word position probability. An intermediate probability value is calculated using the parallel translation word position probability information acquired by the information acquisition means 124.
  • the intermediate probability value calculation unit 125 includes the first correspondence probability information acquired by the first correspondence probability information acquisition unit 121 last time and the second correspondence probability information acquired by the second correspondence probability information acquisition unit 122. Are added at a predetermined ratio, and the addition result is multiplied by the bilingual sentence word position probability information acquisition unit 124 to calculate an intermediate probability value.
  • the intermediate probability value calculation unit 125 calculates the arithmetic expression “p (i
  • j) ( ⁇ B (e (i)
  • the constant ⁇ is a numerical value satisfying “0 ⁇ ⁇ 1”, and is usually “0.5”.
  • the first correspondence probability information acquisition unit 126 before normalization uses the intermediate probability value calculated by the intermediate probability value calculation unit 125 for each word pair to acquire first correspondence probability information before normalization.
  • the first correspondence probability information acquisition unit 126 before normalization uses the intermediate probability value p (i
  • f ( j)) + p (i
  • “Sum” is information obtained by cumulatively adding intermediate probability values corresponding to word pairs.
  • the normalization means 127 applies the first correspondence probability information “C (e (i)
  • the normalizing means 127 normalizes the first correspondence probability information “C (e (i)
  • the normalizing means 127 normalizes the first correspondence probability information “C (e (i)
  • Equation 1 k is a subscript of an arbitrary word. That is, the denominator of Equation 1 indicates that the sum is taken for all word pairs.
  • the control unit 128 performs the previous first correspondence probability information acquisition unit 121, the second correspondence probability information acquisition unit 122, the bilingual sentence word position information acquisition unit 123, the bilingual sentence word position probability information acquisition unit 124, the intermediate The processes of the probability value calculating unit 125, the first correspondence probability information acquiring unit 126 before normalization, and the normalizing unit 127 are repeatedly performed. That is, the control means 128 until the first condition is satisfied, the previous first correspondence probability information acquisition means 121, the second correspondence probability information acquisition means 122, the bilingual sentence word position information acquisition means 123, and the bilingual sentence word position probability information acquisition means 124.
  • control is performed to loop the processes of the intermediate probability value calculation means 125, the first correspondence probability information acquisition means 126 before normalization, and the normalization means 127.
  • the termination condition is, for example, that a predetermined number of loops has been reached.
  • the predetermined number of loops is, for example, any one of 4 to 6.
  • the correspondence probability information storage unit 13 stores the first correspondence probability information finally calculated by the probability information calculation unit 12 for each word pair in the small word alignment model storage unit 112 in association with the word pair.
  • the first correspondence probability information finally calculated by the probability information calculation unit 12 is the final first correspondence probability information when the processing of the loop is completed assuming that the control unit 128 satisfies the termination condition. .
  • the small bilingual data storage unit 111, the small word alignment model storage unit 112, the large word alignment model storage unit 113, and the bilingual word position probability information storage unit 114 constituting the storage unit 11 are nonvolatile recordings.
  • a medium is preferred, but a volatile recording medium can also be realized.
  • the process of storing information in the storage unit 11 does not matter.
  • information may be stored in the storage unit 11 via a recording medium, information transmitted via a communication line or the like may be stored in the storage unit 11, or Information input via the input device may be stored in the storage unit 11.
  • the value calculation unit 125, the first correspondence probability information acquisition unit 126 before normalization, the normalization unit 127, the control unit 128, and the correspondence probability information storage unit 13 can be usually realized by an MPU, a memory, or the like.
  • the processing procedure of the probability information calculation unit 12 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).
  • the flowchart of FIG. 2 is processing for all two or more word pairs. That is, E-Step in the flowchart of FIG. 2 is processed for all word pairs, and then M-Step is processed for all word pairs.
  • the probability information calculation unit 12 performs an initialization process.
  • the initialization process is a process of substituting initial values for various variables, for example.
  • the initialization process is, for example, a process of substituting “0” into first correspondence probability information “C (e (i)
  • the initialization process is a process of substituting “0” into the first correspondence probability information “ ⁇ S (e (i)
  • Step S202 The probability information calculation unit 12 performs E-step, and acquires first correspondence probability information “C (e (i)
  • Step S203 The normalizing means 127 performs M-step to obtain the first correspondence probability information “ ⁇ S (e (i)
  • M-step is a process of normalizing the first correspondence probability information “C (e (i)
  • the normalizing means 127 normalizes the first correspondence probability information “C (e (i)
  • Step S204 The probability information calculation unit 12 determines whether or not the end condition is met. If the end condition is met, the process ends. If the end condition is not met, the process returns to step S202.
  • the end condition is, for example, that a predetermined number of loops has been reached, as described above.
  • Step S301 The control means 128 assigns 1 to the counter s.
  • Step S302 The control unit 128 determines whether or not the s-th parallel translation sentence exists in the small-scale parallel translation data storage unit 111. If the s-th parallel translation sentence exists, the process goes to step S303, and if the s-th parallel translation sentence does not exist, the process returns to the upper process.
  • Step S303 The control means 128 assigns 1 to the counter i.
  • Step S304 The control means 128 substitutes 0 for the variable sum.
  • Step S305 The control means 128 assigns 0 to the counter j.
  • Step S306 The probability information calculation unit 12 calculates an intermediate probability value “p (i
  • Step S307 The control means 128 adds the intermediate probability value “p (i
  • Step S308 The control means 128 determines whether j matches n. If j matches n, the process goes to step S309, and if j does not match n, the process goes to step S316.
  • Step S309 The control means 128 substitutes 0 for the counter j.
  • the first correspondence probability information acquisition unit 126 before normalization adds a value obtained by dividing the intermediate probability value by sum to the current first correspondence probability information before normalization, and creates a new first correspondence probability before normalization.
  • Information is acquired and stored in the buffer or storage unit 11. That is, the first correspondence probability information obtaining unit 126 before normalization 126 calculates the arithmetic expression “C (e (i)
  • f (j))” before new normalization is calculated by “/ sum” and accumulated in the buffer or storage unit 11.
  • Step S311 The control means 128 determines whether j matches n. If j matches n, the process goes to step S312, and if j does not match n, the process goes to step S315.
  • Step S312 The control means 128 determines whether i matches m. If i matches m, the process goes to step S313. If i does not match m, the process goes to step S314.
  • Step S313 The control means 128 increments the counter s by 1, and returns to step S302.
  • Step S314 The control means 128 increments the counter i by 1, and returns to step S304.
  • Step S315) The control means 128 increments the counter j by 1, and returns to step S310.
  • Step S316 The control means 128 increments the counter j by 1, and returns to step S306.
  • step S306 details of the process of calculating the intermediate probability value “p (i
  • the previous first correspondence probability information acquisition unit 121 includes the i-th word e (i) in the first language (e) and the j-th word f in the second language (f) in the s-th parallel translation.
  • Step S402 The second correspondence probability information acquisition means 122 in the sth parallel translation sentence, the i-th word e (i) of the first language (e) and the j-th word f (2) of the second language (f) Second correspondence probability information “ ⁇ B (e (i)
  • the translated text word position information acquisition unit 123 in the sth translated text includes the i-th word e (i) of the first language (e) and the j-th word f (2) of the second language (f).
  • the bilingual sentence word position information (i, j, m, n) of j) is acquired.
  • the translated sentence word position probability information acquiring unit 124 converts the translated sentence word position probability information “i, j, m, n” corresponding to the translated sentence word position information (i, j, m, n) acquired by the translated sentence word position information acquiring unit 123. “ ⁇ S (j
  • the intermediate probability value calculation means 125 includes the first correspondence probability information “ ⁇ S (e (i)
  • the word alignment model construction device 1 operates like a program shown in FIG.
  • the specific example described here is an extension of the method described in Non-Patent Document 5.
  • the point of the word alignment model construction device 1 is 501 in FIG.
  • Reference numeral 501 denotes processing by the probability information calculation unit 12.
  • 501 is an initial value or first correspondence probability information calculated in the previous loop for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence, Using the second correspondence probability information paired with one word pair included in the word alignment model and the parallel translation word position probability information corresponding to one word pair in the parallel translation sentence, the loop is repeated two or more times. This is a process of calculating first correspondence probability information paired with one word pair.
  • reference numeral 501 denotes a process performed by the intermediate probability value calculating unit 125, which is acquired by the first corresponding probability information acquired by the first corresponding probability information acquiring unit 121 and the second corresponding probability information acquiring unit 122.
  • the second correspondence probability information is added at a predetermined ratio, and the addition result is multiplied by the translated sentence word position probability information acquisition unit 124 to calculate an intermediate probability value. It is processing. That is, the word alignment model construction apparatus 1 uses the probability ⁇ B (e (i)
  • word alignment of a small-scale parallel corpus can be executed with high accuracy.
  • the amount of calculation when estimating ⁇ S is similar to that when using DataS, but estimated with DataB. Probability is available.
  • the constant ⁇ “0 ⁇ ⁇ 1” (501 in FIG. 5) is appropriately set, so that the characteristic in the small-scale data DataS is obtained. This makes it possible to prevent the word alignment from being canceled.
  • the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded on a recording medium such as a CD-ROM and distributed. This also applies to other embodiments in this specification.
  • achieves the word alignment model construction apparatus 1 in this Embodiment is the following programs. That is, in this program, the computer-accessible recording medium is a first language sentence that is a first language sentence that is the first language and a second language sentence that is a second language sentence that is the second language.
  • a small-scale bilingual data storage unit capable of storing small-scale parallel translation data, which is a small-scale parallel translation data having a number of parallel translations less than the first threshold (N1), and the small-scale parallel translation data Alignment model of acquired words, a word pair having a first word that is a first language word and a second word that is a second language word, and the first word and the second word correspond to each other
  • a small-scale word alignment model storage unit capable of storing a small-scale word alignment model having a plurality of word alignment data having first correspondence probability information which is correspondence probability information regarding the probability of performing, and a second threshold (N2, N2> 1)
  • a word alignment model acquired from large-scale parallel translation data which is large-scale parallel translation data having the above-mentioned number of parallel translation sentences, a word pair having a first word and a second word, and the first word
  • a large-scale word alignment model storage unit storing a large-scale word alignment model having a plurality of word alignment data having second correspondence probability information that
  • the probability information calculation unit converts the word pair into one word pair for each parallel sentence included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence.
  • the first corresponding probability information acquisition means for acquiring the first corresponding probability information of the corresponding initial value or the first corresponding probability information calculated in the previous loop, and for each parallel sentence included in the small-scale parallel translation data
  • Second correspondence probability information acquisition means for acquiring second correspondence probability information corresponding to the one word pair from the large-scale word alignment model storage unit for each word pair included in the sentence;
  • the first word position indicating the position of the first word in the first language sentence with respect to one word pair for each parallel sentence included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence ,
  • a bilingual sentence word position information acquisition unit that acquires bilingual sentence word position information having the number of second sentence
  • Normalization means and until the end condition is satisfied, the previous first correspondence probability information acquisition means, the second correspondence probability information acquisition means, the parallel translation word position information acquisition means, the parallel translation word position probability information acquisition means, A program that causes a computer to function as the intermediate probability value calculating means, the first correspondence probability information acquiring means before normalization, and a control means that repeatedly performs the processing of the normalizing means. It is.
  • the parallel sentence word position probability information stored in the parallel sentence word position probability information storage unit is the parallel sentence word position probability information acquired using the large-scale parallel translation data. It is preferable that the program functions a computer.
  • the computer has a first word position indicating a position of the first word in the first language sentence, and a first word position corresponding to the first word.
  • a second word position that is a position of two words and indicates a position in the second language sentence, a first sentence word number that is the number of words in the first language sentence, and a second sentence word that is the number of words in the second language sentence
  • Bilingual sentence word position information acquisition means for acquiring bilingual sentence word position information having a number, and bilingual sentence word position information acquired by the bilingual sentence word position information acquisition means for each word pair included in the large-scale parallel translation data Is further stored as a translated sentence word position probability information acquisition unit for acquiring the translated sentence word position probability information corresponding to the translated sentence word position probability information storage unit, and is stored in the translated sentence word position probability information storage unit.
  • Translated sentence word position probability information as the a translated sentence word position probability information translated sentence word position probability information acquisition means has acquired, it is preferably a program for
  • Embodiment 2 In this embodiment, a machine translation apparatus using the word alignment model configured in Embodiment 1 will be described.
  • FIG. 6 is a block diagram of the machine translation apparatus 2 in the present embodiment.
  • the machine translation device 2 is a device that translates a second language sentence and obtains a first language sentence, for example.
  • the machine translation device 2 includes a small-scale word alignment model storage unit 112, a parallel translation word position probability information storage unit 114, a reception unit 21, and a translation unit 22.
  • the reception unit 21 receives a second language sentence.
  • reception means reception of information input from an input device such as a keyboard, mouse, or touch panel, reception by a voice microphone, reception of information transmitted via a wired or wireless communication line, an optical disk or a magnetic disk
  • the concept includes reception of information read from a recording medium such as a semiconductor memory.
  • the second language sentence input means may be anything such as a keyboard, mouse or menu screen.
  • the accepting unit 21 can be realized by a device driver for input means such as a keyboard, control software for a menu screen, or the like.
  • the translation unit 22 translates each small word alignment model stored in the small word alignment model storage unit 112 and one or more parallel sentence word position information stored in the parallel sentence word position probability information storage unit 114. Using the sentence word position probability information, the first language sentence is acquired from the second language sentence received by the receiving unit 21. Since the translation unit 22 is a known technique, a detailed description thereof is omitted.
  • the translation unit 22 can usually be realized by an MPU, a memory, or the like.
  • the processing procedure of the translation unit 22 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).
  • e a first language sentence
  • f a second language sentence
  • E (i) is the i-th word of e
  • f (j) is the j-th word of f.
  • f (0) is introduced as the special word NULL. This is useful when the word in e does not correspond to any of the words in f.
  • Non-Patent Document 5 is applied to bilingual data of about 100 sentences as an estimation method of this probability, the parameter of this probability distribution diverges and an effective probability is obtained. This is because estimation is impossible. Further, this probability can be determined only from the number of words i, j, m, and n, and therefore the same probability can be used with high accuracy even if the parallel translation data is different.
  • f) is calculated from “ ⁇ S (j
  • f) is the translation result of the second language sentence f.
  • the word alignment of the small-scale bilingual corpus can be executed with high accuracy, and a highly accurate translation result can be obtained.
  • the software that realizes the information processing apparatus in the present embodiment is the following program. That is, in this program, the computer-accessible recording medium includes a small-scale word alignment model storage unit included in the word alignment model construction device and a bilingual word position probability information storage unit included in the word alignment model construction device.
  • a computer that accepts a second language sentence, a small word alignment model stored in the small word alignment model storage section, and one or more stored in the parallel sentence word position probability information storage section This is a program that functions as a translation unit that acquires a first language sentence from a second language sentence accepted by the accepting unit using bilingual sentence word position probability information for each bilingual sentence word position information.
  • FIG. 7 shows the external appearance of a computer that executes the program described in this specification to realize the word alignment model construction apparatus and the like of the various embodiments described above.
  • the above-described embodiments can be realized by computer hardware and a computer program executed thereon.
  • FIG. 7 is an overview diagram of the computer system 300
  • FIG. 8 is a block diagram of the system 300.
  • the computer system 300 includes a computer 301 including a CD-ROM drive 3012, a keyboard 302, a mouse 303, and a monitor 304.
  • the computer 301 in addition to the CD-ROM drive 3012, the computer 301 includes an MPU 3013, a bus 3014 connected to the MPU 3013 and the CD-ROM drive 3012, a ROM 3015 for storing programs such as a boot-up program, and an MPU 3013. And a RAM 3016 for temporarily storing instructions of the application program and providing a temporary storage space, and a hard disk 3017 for storing the application program, the system program, and data.
  • the computer 301 may further include a network card that provides connection to a LAN.
  • a program that causes the computer system 300 to execute the functions of the word alignment model construction apparatus of the above-described embodiment is stored in the CD-ROM 3101, inserted into the CD-ROM drive 3012, and further transferred to the hard disk 3017. good.
  • the program may be transmitted to the computer 301 via a network (not shown) and stored in the hard disk 3017.
  • the program is loaded into the RAM 3016 at the time of execution.
  • the program may be loaded directly from the CD-ROM 3101 or the network.
  • the program does not necessarily include an operating system (OS), a third-party program, or the like that causes the computer 301 to execute functions such as the word alignment model construction device of the above-described embodiment.
  • the program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 300 operates is well known and will not be described in detail.
  • processing performed by hardware for example, processing performed by a modem or an interface card in the transmission step (only performed by hardware) Processing is not included.
  • the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.
  • two or more communication means existing in one apparatus may be physically realized by one medium.
  • each process may be realized by centralized processing by a single device, or may be realized by distributed processing by a plurality of devices.
  • the word alignment model construction apparatus has an effect that word alignment of a small-scale parallel corpus can be executed with high accuracy, and is useful as a word alignment model construction apparatus or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

[Problem] It has hitherto been impossible to accurately perform word alignment of a small-size parallel text corpus. [Solution] This word alignment model construction apparatus is provided with: a probability information calculation unit that, for each word pair included in a parallel text sentence included in small-size parallel text data, repeats a loop twice or more for one word pair to calculate first correspondence probability information that forms a pair with the one word pair, by using an initial value or the first correspondence probability information calculated in a previous loop, second correspondence probability information that forms a pair with one word pair included in a large-size word alignment model, and parallel text sentence word position probability information corresponding to one word pair in a parallel text sentence; and a correspondence probability information accumulation unit that accumulates, in association with each word pair, the first correspondence probability information calculated finally by the probability information calculation unit, in a small-size word alignment model storage unit. Thus, word alignment of the small-size parallel text corpus can be accurately performed.

Description

単語アライメントモデル構築装置、機械翻訳装置、単語アライメントモデルの生産方法、および記録媒体Word alignment model construction device, machine translation device, word alignment model production method, and recording medium
 本発明は、単語アライメントモデルを構築する単語アライメントモデル構築装置等に関するものである。 The present invention relates to a word alignment model construction apparatus for constructing a word alignment model.
 統計的機械翻訳(SMT)では、対訳データDataBから、翻訳モデルModelBを作成する。そして、ModelBを利用して、入力文を翻訳する。このとき、入力文がDataBと同様な分野の文である場合には、その翻訳結果は高精度であることが期待できる。けれども、入力文がDataBとは異なる分野の文であるときには、その翻訳精度は低下する。 Statistic machine translation (SMT) creates a translation model ModelB from parallel translation data DataB. And the input sentence is translated using ModelB. At this time, if the input sentence is a sentence in the same field as DataB, the translation result can be expected to be highly accurate. However, when the input sentence is a sentence in a field different from DataB, the translation accuracy decreases.
 この対策として、DataBと異なる分野の対訳データDataSを利用することにより、翻訳モデルModelSを作成し、ModelBとModelSの双方を利用することにより、DataSと同分野の文の翻訳精度の高精度化を達成可能である。 As a countermeasure, the translation model ModelS is created by using bilingual data DataS in a different field from DataB, and the translation accuracy of sentences in the same field as DataS is improved by using both ModelB and ModelS. Achievable.
 このModelBとModelSの双方を利用する方法としては、非特許文献1等に記載されている。なお、この方法をSMTの分野適応と呼ぶ。 The method using both ModelB and ModelS is described in Non-Patent Document 1 and the like. This method is called SMT field adaptation.
 分野適応のときの問題点としては、DataBが大規模(10-1000万文程度)であるのに対して、DataSが小規模(100文程度のこともある)であることがある。こうした場合には、DataBからModelBを作成するのは、問題なく高精度に行えるが、DataSからModelSを作成するのを高精度に行うのは困難である。 The problem when adapting to the field is that DataB is large (about 10 to 10 million sentences), whereas DataS is small (sometimes about 100 sentences). In such a case, ModelB can be created from DataB with high accuracy without problems, but it is difficult to create ModelS from DataS with high accuracy.
 その理由としては、ModelSを作る過程としては、通常、次のステップがとられるからである。
ステップ1.DataSを単語アライメントする
ステップ2.単語アライメントの結果から ModelSを構築する。
The reason is that the following steps are usually taken in the process of creating ModelS.
Step 1. 1. Word alignment of DataS Build ModelS from the result of word alignment.
 上記ステップ1の単語アライメントにおいて、オープンソースツールである GIZA++(非特許文献2参照)や fast_align(非特許文献3参照)が利用されることが多い。しかし、これらのツールは、小規模の対訳データにおける単語アライメントは精度よく実行できない。 単 語 In the word alignment in step 1 above, open source tools such as GIZA ++ (see Non-Patent Document 2) and fast_align (see Non-Patent Document 3) are often used. However, these tools cannot accurately perform word alignment in small-scale parallel translation data.
 上記の問題設定において、小規模対訳データDataSを単語アライメントする従来技術には、以下の(A)~(D)の4つがある。
(A)DataBから単語アライメントのための単語アライメントモデルAlignBを構築し、AlignBを利用してDataSをアライメントする。
(B)DataSから単語アライメントモデルAlignSを構築し、AlignSを利用してDataSをアライメントする。
(C)DataBとDataSを一つにまとめて、そこから単語アライメントモデル AlignBSを構築し、AlignBSを用いて DataSをアライメントする。
(D)DataBから単語アライメントモデル AlignBを構築する。そして、AlignBを初期モデルとして、 DataSから AlignBSを構築し、AlignBSを利用して DataSをアライメントする(非特許文献4参照)。ただし、AlignBを初期モデルとすることは、単語アライメントの十分統計量のみを AlignBから抽出して、その十分統計量を、DataSを利用して更新することをいう。
In the above problem setting, there are the following four (A) to (D) as conventional techniques for word alignment of small-scale parallel translation data DataS.
(A) A word alignment model AlignB for word alignment is constructed from DataB, and DataS is aligned using AlignB.
(B) A word alignment model AlignS is constructed from DataS, and DataS is aligned using AlignS.
(C) Combine DataB and DataS, build a word alignment model AlignBS from them, and align DataS using AlignBS.
(D) Construct a word alignment model AlignB from DataB. Then, AlignBS is constructed from DataS using AlignB as an initial model, and DataS is aligned using AlignBS (see Non-Patent Document 4). However, using AlignB as an initial model means extracting only sufficient statistics for word alignment from AlignB and updating the sufficient statistics using DataS.
 しかしながら、従来技術においては、小規模対訳コーパスの単語アライメントを精度よく実行できなかった。 However, in the prior art, word alignment of a small-scale parallel corpus could not be executed with high accuracy.
 さらに具体的には、(A)~(D)の4つ従来技術は、各々、以下の課題を有する。
(A)は、AlignBを利用して DataSを単語アライメントするのであるから、そもそもDataBとDataSが異なる以上、単語アライメント精度は低い。
(B)は、DataSが数万文以上あるときには効果的であるが、100文程度のときにはAlignSの精度が低くなるので、単語アライメント精度は低い。
(C)について、DataSとDataBの大きさが極端に違うときには、DataSで特徴的な単語対などが出現したとしても、それはDataBの多数派に打ち消されることが考えられる。たとえば、DataBが電気電子関係の多量の対訳文であり、そのため、「potential」は「電位」とアライメントすることが多い場合に、DataSとして情報関係の少量の対訳文があって、「potential」が「潜在的」とアライメントするとしても、それが大量の対訳文により打ち消される可能性がある。また、DataBとDataSを統合してから単語アライメントをするので、たとえ DataSが小さくても、DataBが大規模なため単語アライメントに多大の時間がかかる。
(D)について、出来上がった単語アライメントモデルAlignBSは、(C)のモデルと同等なものであるので上記(C)と同じ問題がある。(D)の利点は、AlignBを初期値として利用することにより、DataSを処理するだけで(C)と同等なものができる点である。
More specifically, each of the four conventional techniques (A) to (D) has the following problems.
In (A), DataS is word-aligned using AlignB, so the word alignment accuracy is low as long as DataB and DataS are different from each other.
(B) is effective when DataS is tens of thousands of sentences or more, but the accuracy of AlignS is low when there are about 100 sentences, so the word alignment accuracy is low.
Regarding (C), when the sizes of DataS and DataB are extremely different, even if a characteristic word pair appears in DataS, it can be countered by the majority of DataB. For example, if DataB is a large amount of bilingual texts related to electric and electronic, and "potential" is often aligned with "potential", there is a small amount of bilingual text related to data as DataS, and "potential" Even if it is aligned with "potential", it can be canceled by a large amount of translations. In addition, since word alignment is performed after integrating DataB and DataS, even if DataS is small, since DataB is large, word alignment takes a lot of time.
Regarding (D), the completed word alignment model AlignBS is equivalent to the model (C), and therefore has the same problem as (C) above. The advantage of (D) is that, by using AlignB as an initial value, the equivalent of (C) can be achieved by simply processing DataS.
 本発明においては、例えば、(D)と同様に DataSを処理するだけで単語アライメントモデルを構築できるが、(D)とは異なり、DataSに特有な単語アライメントも打ち消されない単語アライメントモデルを構築することを目的とする。 In the present invention, for example, a word alignment model can be constructed just by processing DataS as in (D), but unlike (D), a word alignment model that does not cancel word alignment unique to DataS is constructed. For the purpose.
 本第一の発明の単語アライメントモデル構築装置は、第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値(N1)未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、第一単語と第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、第二の閾値(N2,N2>N1)以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、第一単語と第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、1以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部と、小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、大規模単語アライメントモデルが有する一の単語対と対になる第二対応確率情報と、対訳文の中における一の単語対に対応する対訳文単語位置確率情報とを用いて、2回以上ループを繰り返して、一の単語対と対になる第一対応確率情報を算出する確率情報算出部と、単語対ごとに、確率情報算出部が最終的に算出した第一対応確率情報を、単語対に対応付けて、小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積部とを具備する単語アライメントモデル構築装置である。 The word alignment model construction device according to the first aspect of the present invention includes a first language sentence that is a sentence of a first language that is a first language and a second language sentence that is a sentence of a second language that is a second language. A small-scale parallel translation data storage unit that can store small-scale parallel translation data that is a pair and is a small-scale parallel translation data having a number of parallel translation sentences less than the first threshold (N1), and is acquired from the small-scale parallel translation data Correspondence probability relating to the probability that a word pair having a first word that is a word in the first language and a second word that is a word in the second language and the probability that the first word and the second word correspond to each other are word alignment models A small word alignment model storage unit capable of storing a small word alignment model having a plurality of word alignment data having first correspondence probability information as information, and a number equal to or greater than a second threshold (N2, N2> N1) Have a translation of This is an alignment model of words acquired from large-scale parallel translation data, which is large-scale parallel translation data, and the correspondence between the word pair having the first word and the second word and the probability that the first word corresponds to the second word Information obtained from a large-scale word alignment model storage unit storing a large-scale word alignment model having a plurality of word alignment data having second correspondence probability information that is probability information and one or more parallel translations The first word position indicating the position of the first word in the first language sentence, the second word position indicating the position in the second language sentence that is the position of the second word corresponding to the first word, the first For each translation sentence word position information having the number of first sentence words that is the number of words in the language sentence and the number of second sentence words that is the number of words in the second language sentence, the translation related to the probability of matching the translation sentence word position information The word position probability information storage unit that can store the word position probability information and the word pair possessed by the parallel sentence included in the small-scale parallel data are calculated for one word pair in the initial value or the previous loop. First correspondence probability information, second correspondence probability information paired with one word pair of the large-scale word alignment model, and bilingual word position probability information corresponding to one word pair in the bilingual sentence. Using the probability information calculation unit that calculates the first correspondence probability information that is paired with one word pair by repeating the loop twice or more, and the first that the probability information calculation unit finally calculates for each word pair A word alignment model construction apparatus comprising a correspondence probability information storage unit that stores correspondence probability information in association with a word pair in a small word alignment model storage unit.
 かかる構成により、小規模対訳コーパスの単語アライメントを精度よく実行できる。 With such a configuration, word alignment of a small-scale parallel corpus can be executed with high accuracy.
 また、本第二の発明の単語アライメントモデル構築装置は、第一の発明に対して、確率情報算出部は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、一の単語対に対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報を取得する前回第一対応確率情報取得手段と、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、一の単語対に対応する第二対応確率情報を、大規模単語アライメントモデル格納部から取得する第二対応確率情報取得手段と、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段と、前回第一対応確率情報取得手段が取得した第一対応確率情報と第二対応確率情報取得手段が取得した第二対応確率情報とを予め決められた割合で加算し、加算した結果と、対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する中間確率値算出手段と、単語対ごとに、中間確率値算出手段が算出した中間確率値を用いて、正規化前の第一対応確率情報を取得する正規化前第一対応確率情報取得手段と、単語対ごとに、正規化前第一対応確率情報取得手段が取得した正規化前の第一対応確率情報に対して、正規化の処理を行い、第一対応確率情報を取得する正規化手段と、終了条件を満たすまで、前回第一対応確率情報取得手段、第二対応確率情報取得手段、対訳文単語位置情報取得手段、対訳文単語位置確率情報取得手段、中間確率値算出手段、正規化前第一対応確率情報取得手段、および正規化手段の処理を繰り返して行わせる制御手段とを具備する単語アライメントモデル構築装置である。 In addition, the word alignment model construction device according to the second aspect of the invention provides the probability information calculation unit for each bilingual sentence included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence, relative to the first invention. The first correspondence probability information acquisition means for obtaining the first correspondence probability information of the initial value corresponding to one word pair or the first correspondence probability information calculated in the previous loop for one word pair, and a small scale Second correspondence probability information corresponding to one word pair is acquired from the large-scale word alignment model storage unit for each word pair included in the parallel translation data and for each word pair included in the parallel translation sentence. Second correspondence probability information acquisition means, and for each bilingual sentence included in the small-scale parallel translation data and for each word pair included in the bilingual sentence, for each word pair, the first word in the first language sentence First word position indicating position The second word position corresponding to the first word and indicating the position in the second language sentence, the number of first sentence words that is the number of words in the first language sentence, and the second language sentence Bilingual sentence word position information acquisition means for acquiring parallel sentence word position information having the number of second sentence words that is the number of words, and a parallel sentence word position corresponding to the parallel sentence word position information acquired by the parallel sentence word position information acquisition means The bilingual sentence word position probability information acquiring means for acquiring the probability information from the parallel sentence word position probability information storage unit, the first corresponding probability information and the second corresponding probability information acquiring means acquired by the first corresponding probability information acquiring means last time The acquired second correspondence probability information is added at a predetermined ratio, and the result of addition is multiplied by the parallel sentence word position probability information acquired by the parallel sentence word position probability information acquisition means to calculate an intermediate probability value. Intermediate probability value calculating means For each word pair, using the intermediate probability value calculated by the intermediate probability value calculation means, the first correspondence probability information acquisition means before normalization for acquiring the first correspondence probability information before normalization, and the normal correspondence for each word pair Normalization processing is performed on the first correspondence probability information before normalization acquired by the first correspondence probability information acquisition means before normalization and the first correspondence probability information is acquired, and until the end condition is satisfied , Previous first correspondence probability information acquisition means, second correspondence probability information acquisition means, parallel translation word position information acquisition means, parallel translation word position probability information acquisition means, intermediate probability value calculation means, pre-normalization first correspondence probability information acquisition And a word alignment model construction apparatus comprising control means for repeatedly performing processing of normalization means.
 かかる構成により、小規模対訳コーパスの単語アライメントを精度よく実行できる。 With such a configuration, word alignment of a small-scale parallel corpus can be executed with high accuracy.
 また、本第三の発明の単語アライメントモデル構築装置は、第一または第二の発明に対して、対訳文単語位置確率情報格納部に格納されている対訳文単語位置確率情報は、大規模対訳データを用いて取得された対訳文単語位置確率情報である単語アライメントモデル構築装置である。 The word alignment model construction apparatus according to the third aspect of the present invention provides a parallel translation word position probability information stored in the parallel translation word position probability information storage unit, which is a large-scale parallel translation. It is the word alignment model construction apparatus which is the bilingual sentence word position probability information acquired using data.
 かかる構成により、小規模対訳コーパスの単語アライメントをより精度よく実行できる。 With such a configuration, word alignment of a small-scale parallel corpus can be executed with higher accuracy.
 また、本第四の発明の単語アライメントモデル構築装置は、第一の発明に対して、大規模対訳データが有する各単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、大規模対訳データが有する各単語対に対して、対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段とをさらに具備し、対訳文単語位置確率情報格納部に格納されている対訳文単語位置確率情報は、対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報である単語アライメントモデル構築装置である。 The word alignment model construction device of the fourth aspect of the present invention indicates the position of the first word in the first language sentence for each word pair included in the large-scale parallel translation data, relative to the first aspect. A first word position, a second word position corresponding to the first word and indicating a position in the second language sentence, a first sentence word number that is the number of words in the first language sentence, and Bilingual word position information acquisition means for acquiring bilingual word position information having the number of second sentence words that is the number of words in the bilingual sentence, and bilingual word position information for each word pair included in the large-scale parallel data The bilingual sentence word position probability information acquiring means for acquiring the bilingual sentence word position probability information corresponding to the bilingual sentence word position information acquired by the acquiring means from the bilingual sentence word position probability information storage unit, Stored in the probability information storage Translated sentence word position probability information is word alignment model building device is the translation word position probability information translated sentence word position probability information acquisition means has acquired.
 かかる構成により、小規模対訳コーパスの単語アライメントをより精度よく実行できる。 With such a configuration, word alignment of a small-scale parallel corpus can be executed with higher accuracy.
 また、本第五の発明の機械翻訳装置は、第一から第四いずれか1つの発明に対して、単語アライメントモデル構築装置が有する小規模単語アライメントモデル格納部と、単語アライメントモデル構築装置が有する対訳文単語位置確率情報格納部と、第二言語文を受け付ける受付部と、小規模単語アライメントモデル格納部に格納されている小規模単語アライメントモデル、および対訳文単語位置確率情報格納部に格納されている1以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、受付部が受け付けた第二言語文から第一言語文を取得する翻訳部とを具備する機械翻訳装置である。 The machine translation device according to the fifth aspect of the invention has a small word alignment model storage unit and a word alignment model construction device that the word alignment model construction device has, with respect to any one of the first to fourth inventions. Stored in the bilingual sentence word position probability information storage unit, the second language sentence receiving unit, the small word alignment model stored in the small word alignment model storage unit, and the bilingual word position probability information storage unit A translation unit comprising: a translation unit that acquires a first language sentence from a second language sentence received by a reception unit using bilingual sentence word position probability information for each of one or more parallel translation word position information .
 かかる構成により、小規模対訳コーパスの単語アライメントを精度よく実行できる結果、精度の良い翻訳結果が得られる。 With such a configuration, the word alignment of the small-scale bilingual corpus can be executed with high accuracy, and as a result, accurate translation results can be obtained.
 本発明による単語アライメントモデル構築装置によれば、小規模対訳コーパスの単語アライメントを精度よく実行できる。 The word alignment model construction apparatus according to the present invention can perform word alignment of a small-scale parallel corpus with high accuracy.
実施の形態1における単語アライメントモデル構築装置1のブロック図Block diagram of word alignment model construction apparatus 1 according to Embodiment 1 同単語アライメントモデル構築装置1の動作について説明するフローチャートThe flowchart explaining operation | movement of the word alignment model construction apparatus 1 同E-stepの詳細について説明するフローチャートFlow chart explaining details of the E-step 同中間確率値を算出する処理の詳細について説明するフローチャートThe flowchart explaining the detail of the process which calculates the intermediate probability value 同単語アライメントモデル構築装置1の動作を示す図The figure which shows operation | movement of the word alignment model construction apparatus 1 実施の形態2における機械翻訳装置2のブロック図Block diagram of machine translation apparatus 2 according to Embodiment 2 上記実施の形態におけるコンピュータシステムの概観図Overview of the computer system in the above embodiment 同コンピュータシステムのブロック図Block diagram of the computer system
 以下、単語アライメントモデル構築装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of the word alignment model construction device and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.
 (実施の形態1)
 本実施の形態において、2つの言語の単語アライメントモデルを作成する装置であって、大規模データの単語アライメントの確率と小規模データの単語アライメントの確率との両方を用いて、最終的な小規模データの単語アライメントの確率の情報を取得する単語アライメントモデル構築装置について説明する。
(Embodiment 1)
In the present embodiment, an apparatus for creating a word alignment model in two languages, using both the word alignment probability of large-scale data and the word alignment probability of small-scale data, the final small-scale A word alignment model construction apparatus that acquires information on the probability of word alignment of data will be described.
 図1は、本実施の形態における単語アライメントモデル構築装置1のブロック図である。 FIG. 1 is a block diagram of the word alignment model construction apparatus 1 in the present embodiment.
 単語アライメントモデル構築装置1は、格納部11、確率情報算出部12、および対応確率情報蓄積部13を備える。 The word alignment model construction device 1 includes a storage unit 11, a probability information calculation unit 12, and a corresponding probability information storage unit 13.
 格納部11は、小規模対訳データ格納部111、小規模単語アライメントモデル格納部112、大規模単語アライメントモデル格納部113、および対訳文単語位置確率情報格納部114を備える。 The storage unit 11 includes a small bilingual data storage unit 111, a small word alignment model storage unit 112, a large word alignment model storage unit 113, and a bilingual word position probability information storage unit 114.
 確率情報算出部12は、前回第一対応確率情報取得手段121、第二対応確率情報取得手段122、対訳文単語位置情報取得手段123、対訳文単語位置確率情報取得手段124、中間確率値算出手段125、正規化前第一対応確率情報取得手段126、正規化手段127、および制御手段128を備える。 The probability information calculation unit 12 includes a previous first correspondence probability information acquisition unit 121, a second correspondence probability information acquisition unit 122, a bilingual sentence word position information acquisition unit 123, a bilingual sentence word position probability information acquisition unit 124, and an intermediate probability value calculation unit. 125, first pre-normalization probability information acquisition means 126, normalization means 127, and control means 128.
 格納部11は、各種の情報を格納し得る。各種の情報とは、例えば、後述する小規模対訳データ、小規模単語アライメントモデル、大規模単語アライメントモデル、対訳文単語位置確率等である。 The storage unit 11 can store various types of information. The various types of information include, for example, small-scale parallel translation data, small-scale word alignment model, large-scale word alignment model, parallel sentence word position probability, and the like, which will be described later.
 小規模対訳データ格納部111は、小規模対訳データを格納し得る。小規模対訳データは、1以上の対訳文を有する。対訳文は、第一言語文と第二言語文との対である。第一言語文は、第一の言語である第一言語の文である。第二言語文は、第二の言語である第二言語の文である。第二言語文は、第一言語文を第二言語に翻訳した結果である。小規模対訳データは、少ない数の対訳文を有する。小規模対訳データは、通常、第一の閾値(N1)未満の数の対訳文を有する。小規模対訳データは、例えば、10~10万程度の数の対訳文を有する。 The small parallel translation data storage unit 111 can store small parallel translation data. The small-scale parallel translation data has one or more parallel translation sentences. A bilingual sentence is a pair of a first language sentence and a second language sentence. The first language sentence is a sentence in the first language that is the first language. The second language sentence is a sentence of the second language that is the second language. The second language sentence is a result of translating the first language sentence into the second language. Small-scale parallel translation data has a small number of parallel translation sentences. The small-scale parallel translation data usually has a number of parallel translation sentences less than the first threshold (N1). The small-scale parallel translation data has, for example, about 100,000 to 100,000 parallel translations.
 また、第一言語と第二言語とは異なる言語であれば、どの言語でも良い。第一言語、または第二言語は、例えば、英語、日本語、中国語、フランス語、ドイツ語、スペイン語、韓国語等である。 Also, any language can be used as long as the first language and the second language are different. The first language or the second language is, for example, English, Japanese, Chinese, French, German, Spanish, Korean or the like.
 小規模単語アライメントモデル格納部112は、小規模単語アライメントモデルを格納し得る。小規模単語アライメントモデルは、小規模対訳データを用いて取得される単語のアライメントモデルである。小規模単語アライメントモデルは、複数の単語アライメントデータを有する。単語アライメントデータは、単語対と、第一対応確率情報とを有する。単語対は、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する。第一対応確率情報は、第一単語と第二単語とが対応する確率に関する情報である対応確率情報である。 The small word alignment model storage unit 112 can store a small word alignment model. The small word alignment model is a word alignment model acquired using small parallel translation data. The small-scale word alignment model has a plurality of word alignment data. The word alignment data includes a word pair and first correspondence probability information. The word pair has a first word that is a word in the first language and a second word that is a word in the second language. The first correspondence probability information is correspondence probability information that is information regarding the probability that the first word corresponds to the second word.
 大規模単語アライメントモデル格納部113は、大規模単語アライメントモデルを格納している。大規模単語アライメントモデルは、大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルである。単語のアライメントモデルは、第一単語と第二単語とを有する単語対と、第一単語と第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する。大規模対訳データは、通常、第二の閾値(N2,N2>N1)以上の数の対訳文を有する。通常、N2は、N1と比較して、1桁以上大きい(10倍以上である)。なお、N1、N2は自然数である。 The large-scale word alignment model storage unit 113 stores a large-scale word alignment model. The large-scale word alignment model is a word alignment model acquired from large-scale parallel translation data that is large-scale parallel translation data. The word alignment model includes a word pair having a first word and a second word, and second correspondence probability information that is correspondence probability information regarding a probability that the first word and the second word correspond to each other. Large-scale parallel translation data usually has a number of parallel translations equal to or greater than a second threshold (N2, N2> N1). Usually, N2 is one digit or more larger (10 times or more) than N1. N1 and N2 are natural numbers.
 対訳文単語位置確率情報格納部114は、1以上の対訳文単語位置情報ごとに、対訳文単語位置確率情報を格納し得る。対訳文単語位置情報は、1以上の対訳文から取得された情報であり、第一単語位置、第二単語位置、第一文単語数、および第二文単語数を有する。第一単語位置は、第一言語文の中における第一単語の位置を示す情報である。第二単語位置は、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す情報である。第一文単語数は、第一言語文の単語数である。第二文単語数は、第二言語文の単語数である。対訳文単語位置確率情報は、対訳文単語位置情報に合致する確率に関する情報である。通常、対訳文単語位置確率情報は、対訳文単語位置情報に合致する確率である。 The parallel translation word position probability information storage unit 114 can store parallel translation word position probability information for each of one or more parallel translation word position information. The bilingual sentence word position information is information acquired from one or more parallel translation sentences, and has a first word position, a second word position, a first sentence word number, and a second sentence word number. The first word position is information indicating the position of the first word in the first language sentence. The second word position is the position of the second word corresponding to the first word and is information indicating the position in the second language sentence. The first sentence word count is the number of words in the first language sentence. The number of second sentence words is the number of words in the second language sentence. The bilingual sentence word position probability information is information regarding the probability of matching with the bilingual sentence word position information. Normally, the bilingual sentence word position probability information is a probability of matching with the bilingual sentence word position information.
 対訳文単語位置確率情報格納部114は、大規模対訳データが有する1以上の対訳文から取得された対訳文単語位置情報と対訳文単語位置確率情報との組を、2以上、格納していることは好適である。 The bilingual sentence word position probability information storage unit 114 stores two or more pairs of bilingual sentence word position information and bilingual sentence word position probability information acquired from one or more parallel translation sentences included in the large-scale parallel translation data. That is preferred.
 対訳文単語位置確率情報格納部114に格納されている対訳文単語位置情報と対訳文単語位置確率情報との組は、大規模対訳データを用いて、後述する対訳文単語位置情報取得手段123、対訳文単語位置確率情報取得手段124が取得した情報であることは好適である。 A pair of the parallel sentence word position information and the parallel sentence word position probability information stored in the parallel sentence word position probability information storage unit 114 uses a large-scale parallel translation data, and a parallel sentence word position information acquisition unit 123 to be described later. The information acquired by the bilingual sentence word position probability information acquisition unit 124 is suitable.
 対訳文単語位置確率情報格納部114は、大規模対訳データと小規模対訳データとが有する1以上の対訳文から取得された対訳文単語位置情報と対訳文単語位置確率情報との組を、2以上、格納していても良い。 The bilingual sentence word position probability information storage unit 114 stores two pairs of bilingual sentence word position information and bilingual sentence word position probability information acquired from one or more parallel translation sentences included in the large-scale bilingual data and the small-scale bilingual data. As described above, it may be stored.
 対訳文単語位置確率情報格納部114に格納されている対訳文単語位置情報と対訳文単語位置確率情報との組は、大規模対訳データと小規模対訳データとを用いて、後述する対訳文単語位置情報取得手段123、対訳文単語位置確率情報取得手段124が取得した情報であっても良い。 A pair of parallel sentence word position information and parallel sentence word position probability information stored in the parallel sentence word position probability information storage unit 114 is a parallel sentence word to be described later using large-scale parallel translation data and small-scale parallel translation data. The information acquired by the position information acquisition unit 123 and the bilingual word position probability information acquisition unit 124 may be used.
 確率情報算出部12は、小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、大規模単語アライメントモデルが有する一の単語対と対になる第二対応確率情報と、対訳文の中における一の単語対に対応する対訳文単語位置確率情報とを用いて、2回以上ループを繰り返して、一の単語対と対になる第一対応確率情報を算出する。 The probability information calculation unit 12 sets the first correspondence probability information calculated in the initial value or the previous loop for one word pair and the large-scale word for each word pair included in the parallel translation sentence included in the small-scale parallel translation data. Using the second correspondence probability information paired with one word pair of the alignment model and the parallel translation word position probability information corresponding to one word pair in the parallel translation sentence, the loop is repeated two or more times, First correspondence probability information to be paired with one word pair is calculated.
 前回第一対応確率情報取得手段121は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、一の単語対に対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報を取得する。 The previous first correspondence probability information acquisition unit 121 sets the initial value corresponding to one word pair for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence. First correspondence probability information or first correspondence probability information calculated in the previous loop is acquired.
 具体的には、例えば、前回第一対応確率情報取得手段121は、第一言語(e)のi番目の単語e(i)と第二言語(f)のj番目の単語f(j)とに対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報「θS(e(i)|f(j))」を取得する。かかる第一対応確率情報は、通常、格納部11に少なくとも一時的に格納されている。 Specifically, for example, the first correspondence probability information acquisition unit 121 last time includes the i-th word e (i) in the first language (e) and the j-th word f (j) in the second language (f). The first correspondence probability information of the initial value corresponding to or the first correspondence probability information “θS (e (i) | f (j))” calculated in the previous loop is acquired. Such first correspondence probability information is usually stored at least temporarily in the storage unit 11.
 第二対応確率情報取得手段122は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、一の単語対に対応する第二対応確率情報を、大規模単語アライメントモデル格納部113から取得する。 The second correspondence probability information acquisition unit 122 is configured to provide a second correspondence probability corresponding to one word pair for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence. Information is acquired from the large-scale word alignment model storage unit 113.
 具体的には、例えば、第二対応確率情報取得手段122は、第一言語(e)のi番目の単語e(i)と第二言語(f)のj番目の単語f(j)とに対応する第二対応確率情報「θB(e(i)|f(j))」を大規模単語アライメントモデル格納部113から取得する。 Specifically, for example, the second correspondence probability information acquisition unit 122 adds the i-th word e (i) in the first language (e) and the j-th word f (j) in the second language (f). Corresponding second correspondence probability information “θB (e (i) | f (j))” is acquired from the large-scale word alignment model storage unit 113.
 対訳文単語位置情報取得手段123は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する。 The bilingual sentence word position information acquisition means 123 performs the first word in the first language sentence for each bilingual sentence included in the small bilingual data and for each word pair included in the bilingual sentence. The first word position indicating the position, the second word position corresponding to the first word and indicating the position in the second language sentence, and the first sentence word number indicating the number of words in the first language sentence And translated sentence word position information having the number of second sentence words that is the number of words of the second language sentence.
 具体的には、例えば、対訳文単語位置情報取得手段123は、一の対訳文の中の一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置(i)、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置(j)、第一言語文の単語数である第一文単語数(m)、および第二言語文の単語数である第二文単語数(n)を、一の対訳文と一の単語対とから取得する。なお、(i) (j)は、通常、文の中の何番目の単語であるかを示す情報である。 Specifically, for example, the bilingual sentence word position information acquisition unit 123 has a first word position indicating the position of the first word in the first language sentence with respect to one word pair in one bilingual sentence. (i) the second word position corresponding to the first word and the second word position indicating the position in the second language sentence (j), the first sentence word number that is the number of words in the first language sentence ( m) and the second sentence word number (n), which is the number of words in the second language sentence, are acquired from one parallel translation sentence and one word pair. Note that (i) (j) is usually information indicating the number of words in a sentence.
 対訳文単語位置情報取得手段123は、大規模対訳データが有する各単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する。 The bilingual sentence word position information acquisition means 123 is a first word position indicating the position of the first word in the first language sentence for each word pair included in the large-scale parallel translation data, and a second corresponding to the first word. The second word position indicating the position of the word in the second language sentence, the first sentence word number that is the number of words in the first language sentence, and the second sentence word number that is the number of words in the second language sentence The bilingual sentence word position information which has is acquired.
 対訳文単語位置確率情報取得手段124は、対訳文単語位置情報取得手段123が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、対訳文単語位置確率情報格納部114から取得する。 The bilingual sentence word position probability information acquisition unit 124 acquires bilingual sentence word position probability information corresponding to the bilingual sentence word position information acquired by the bilingual sentence word position information acquisition unit 123 from the bilingual sentence word position probability information storage unit 114. .
 具体的には、例えば、対訳文単語位置確率情報取得手段124は、対訳文単語位置情報取得手段123が取得した対訳文単語位置情報(i,j,m,n)に対応する対訳文単語位置確率情報「δS(j|i,m,n)」を、対訳文単語位置確率情報格納部114から検索する。 Specifically, for example, the translated sentence word position probability information acquiring unit 124 converts the translated sentence word position corresponding to the translated sentence word position information (i, j, m, n) acquired by the translated sentence word position information acquiring unit 123. The probability information “δS (j | i, m, n)” is searched from the bilingual sentence word position probability information storage unit 114.
 対訳文単語位置確率情報取得手段124は、大規模対訳データが有する各単語対に対して、対訳文単語位置情報取得手段123が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、対訳文単語位置確率情報格納部114から取得することは好適である。 The bilingual sentence word position probability information acquiring unit 124 calculates bilingual sentence word position probability information corresponding to the bilingual sentence word position information acquired by the bilingual sentence word position information acquiring unit 123 for each word pair included in the large-scale bilingual data. It is preferable to obtain from the translated sentence word position probability information storage unit 114.
 中間確率値算出手段125は、前回第一対応確率情報取得手段121が取得した第一対応確率情報と、第二対応確率情報取得手段122が取得した第二対応確率情報と、対訳文単語位置確率情報取得手段124が取得した対訳文単語位置確率情報とを用いて、中間確率値を算出する。 The intermediate probability value calculation means 125 includes first correspondence probability information acquired by the first correspondence probability information acquisition means 121 last time, second correspondence probability information acquired by the second correspondence probability information acquisition means 122, and bilingual sentence word position probability. An intermediate probability value is calculated using the parallel translation word position probability information acquired by the information acquisition means 124.
 具体的には、例えば、中間確率値算出手段125は、前回第一対応確率情報取得手段121が取得した第一対応確率情報と第二対応確率情報取得手段122が取得した第二対応確率情報とを予め決められた割合で加算し、加算した結果と、対訳文単語位置確率情報取得手段124が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する。 Specifically, for example, the intermediate probability value calculation unit 125 includes the first correspondence probability information acquired by the first correspondence probability information acquisition unit 121 last time and the second correspondence probability information acquired by the second correspondence probability information acquisition unit 122. Are added at a predetermined ratio, and the addition result is multiplied by the bilingual sentence word position probability information acquisition unit 124 to calculate an intermediate probability value.
 さらに具体的には、例えば、中間確率値算出手段125は、演算式「p(i|j)=(λθB(e(i)|f(j)) +(1-λ)θS(e(i)|f(j)))*δS(j|i,m,n)」により、中間確率値p(i|j)を算出する。ここで、定数λは、「0<λ<1」を満たす数値であり、通常、「0.5」である。 More specifically, for example, the intermediate probability value calculation unit 125 calculates the arithmetic expression “p (i | j) = (λθB (e (i) | f (j)) + (1-λ) θS (e (i ) | f (j))) * ΔS (j | i, m, n) ”, the intermediate probability value p (i | j) is calculated. Here, the constant λ is a numerical value satisfying “0 <λ <1”, and is usually “0.5”.
 正規化前第一対応確率情報取得手段126は、単語対ごとに、中間確率値算出手段125が算出した中間確率値を用いて、正規化前の第一対応確率情報を取得する。 The first correspondence probability information acquisition unit 126 before normalization uses the intermediate probability value calculated by the intermediate probability value calculation unit 125 for each word pair to acquire first correspondence probability information before normalization.
 具体的には、例えば、正規化前第一対応確率情報取得手段126は、単語対ごとに、中間確率値p(i|j)を用いて、演算式「C(e(i)|f(j))+= p(i|j)/sum」により、正規化前の第一対応確率情報「C(e(i)|f(j))」を算出する。なお、「sum」は、単語対に対応する中間確率値を累積加算した情報である。 Specifically, for example, the first correspondence probability information acquisition unit 126 before normalization uses the intermediate probability value p (i | j) for each word pair, and uses the arithmetic expression “C (e (i) | f ( j)) + = p (i | j) / sum ”, the first correspondence probability information“ C (e (i) | f (j)) ”before normalization is calculated. “Sum” is information obtained by cumulatively adding intermediate probability values corresponding to word pairs.
 正規化手段127は、単語対ごとに、正規化前第一対応確率情報取得手段126が取得した正規化前の第一対応確率情報「C(e(i)|f(j))」に対して、正規化の処理を行い、第一対応確率情報を取得する。 For each word pair, the normalization means 127 applies the first correspondence probability information “C (e (i) | f (j))” before normalization acquired by the first correspondence probability information acquisition means 126 before normalization. Then, normalization processing is performed to obtain first correspondence probability information.
 具体的には、例えば、正規化手段127は、単語対ごとに、正規化前の第一対応確率情報「C(e(i)|f(j))」を平均場近似で正規化して、第一対応確率情報「θS(e(i)|f(j))」を取得する。なお、平均場近似は公知技術であるので、詳細な説明は省略する。 Specifically, for example, the normalizing means 127 normalizes the first correspondence probability information “C (e (i) | f (j))” before normalization for each word pair by means of mean field approximation, First correspondence probability information “θS (e (i) | f (j))” is acquired. Since the mean field approximation is a known technique, a detailed description is omitted.
 また、例えば、正規化手段127は、以下の数式1により、正規化前の第一対応確率情報「C(e(i)|f(j))」を正規化し、第一対応確率情報「θS(e(i)|f(j))」を取得しても良い。 Further, for example, the normalizing means 127 normalizes the first correspondence probability information “C (e (i) | f (j))” before normalization by the following formula 1, and the first correspondence probability information “θS”. (e (i) | f (j)) ”may be acquired.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 なお、数式1において、kは任意の単語の添え字である。すなわち、数式1の分母は、全ての単語対についての和をとることを示している。 In Equation 1, k is a subscript of an arbitrary word. That is, the denominator of Equation 1 indicates that the sum is taken for all word pairs.
 制御手段128は、終了条件を満たすまで、前回第一対応確率情報取得手段121、第二対応確率情報取得手段122、対訳文単語位置情報取得手段123、対訳文単語位置確率情報取得手段124、中間確率値算出手段125、正規化前第一対応確率情報取得手段126、および正規化手段127の処理を繰り返して行わせる。つまり、制御手段128は、終了条件を満たすまで、前回第一対応確率情報取得手段121、第二対応確率情報取得手段122、対訳文単語位置情報取得手段123、対訳文単語位置確率情報取得手段124、中間確率値算出手段125、正規化前第一対応確率情報取得手段126、および正規化手段127の処理をループさせる制御を行う。ここで、終了条件とは、例えば、予め決められたループ回数になったことである。また、予め決められたループ回数は、例えば、4回から6回のいずれかである。 Until the end condition is satisfied, the control unit 128 performs the previous first correspondence probability information acquisition unit 121, the second correspondence probability information acquisition unit 122, the bilingual sentence word position information acquisition unit 123, the bilingual sentence word position probability information acquisition unit 124, the intermediate The processes of the probability value calculating unit 125, the first correspondence probability information acquiring unit 126 before normalization, and the normalizing unit 127 are repeatedly performed. That is, the control means 128 until the first condition is satisfied, the previous first correspondence probability information acquisition means 121, the second correspondence probability information acquisition means 122, the bilingual sentence word position information acquisition means 123, and the bilingual sentence word position probability information acquisition means 124. Then, control is performed to loop the processes of the intermediate probability value calculation means 125, the first correspondence probability information acquisition means 126 before normalization, and the normalization means 127. Here, the termination condition is, for example, that a predetermined number of loops has been reached. The predetermined number of loops is, for example, any one of 4 to 6.
 対応確率情報蓄積部13は、単語対ごとに、確率情報算出部12が最終的に算出した第一対応確率情報を、単語対に対応付けて、小規模単語アライメントモデル格納部112に蓄積する。なお、確率情報算出部12が最終的に算出した第一対応確率情報とは、制御手段128が終了条件を満たすとして、ループの処理を終了した場合の、最終的な第一対応確率情報である。 The correspondence probability information storage unit 13 stores the first correspondence probability information finally calculated by the probability information calculation unit 12 for each word pair in the small word alignment model storage unit 112 in association with the word pair. Note that the first correspondence probability information finally calculated by the probability information calculation unit 12 is the final first correspondence probability information when the processing of the loop is completed assuming that the control unit 128 satisfies the termination condition. .
 格納部11を構成している小規模対訳データ格納部111、小規模単語アライメントモデル格納部112、大規模単語アライメントモデル格納部113、および対訳文単語位置確率情報格納部114は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The small bilingual data storage unit 111, the small word alignment model storage unit 112, the large word alignment model storage unit 113, and the bilingual word position probability information storage unit 114 constituting the storage unit 11 are nonvolatile recordings. A medium is preferred, but a volatile recording medium can also be realized.
 格納部11に情報が記憶される過程は問わない。例えば、記録媒体を介して情報が格納部11で記憶されるようになってもよく、通信回線等を介して送信された情報が格納部11で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された情報が格納部11で記憶されるようになってもよい。 The process of storing information in the storage unit 11 does not matter. For example, information may be stored in the storage unit 11 via a recording medium, information transmitted via a communication line or the like may be stored in the storage unit 11, or Information input via the input device may be stored in the storage unit 11.
 確率情報算出部12を構成している前回第一対応確率情報取得手段121、第二対応確率情報取得手段122、対訳文単語位置情報取得手段123、対訳文単語位置確率情報取得手段124、中間確率値算出手段125、正規化前第一対応確率情報取得手段126、正規化手段127、制御手段128、および対応確率情報蓄積部13は、通常、MPUやメモリ等から実現され得る。確率情報算出部12等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはROM等の記録媒体に記録されている。但し、ハードウェア(専用回路)で実現しても良い。 Previous correspondence first correspondence probability information acquisition means 121, second correspondence probability information acquisition means 122, parallel translation word position information acquisition means 123, parallel translation word position probability information acquisition means 124, intermediate probability constituting the probability information calculation unit 12 The value calculation unit 125, the first correspondence probability information acquisition unit 126 before normalization, the normalization unit 127, the control unit 128, and the correspondence probability information storage unit 13 can be usually realized by an MPU, a memory, or the like. The processing procedure of the probability information calculation unit 12 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).
 次に、単語アライメントモデル構築装置1の動作について、図2のフローチャートを用いて説明する。なお、図2のフローチャートは、2以上の全単語対についての処理である。つまり、図2のフローチャートにおける、E-Stepが全単語対について処理され、次に、M-Stepが全単語対について処理される。 Next, the operation of the word alignment model construction device 1 will be described using the flowchart of FIG. Note that the flowchart of FIG. 2 is processing for all two or more word pairs. That is, E-Step in the flowchart of FIG. 2 is processed for all word pairs, and then M-Step is processed for all word pairs.
 (ステップS201)確率情報算出部12は、初期化処理を行う。初期化処理は、例えば、各種の変数に初期値を代入する処理である。初期化処理は、例えば、正規化前の第一対応確率情報「C(e(i)|f(j))」に「0」を代入する処理である。また、初期化処理は、例えば、第一対応確率情報「θS(e(i)|f(j))」に「0」を代入する処理である。 (Step S201) The probability information calculation unit 12 performs an initialization process. The initialization process is a process of substituting initial values for various variables, for example. The initialization process is, for example, a process of substituting “0” into first correspondence probability information “C (e (i) | f (j))” before normalization. The initialization process is a process of substituting “0” into the first correspondence probability information “θS (e (i) | f (j))”, for example.
 (ステップS202)確率情報算出部12は、E-stepを行い、正規化前の第一対応確率情報「C(e(i)|f(j))」を取得する。そして、確率情報算出部12は、各単語対に対応付けて、正規化前の第一対応確率情報「C(e(i)|f(j))」を図示しないバッファまたは格納部11に一時蓄積する。E-stepの詳細について、図3のフローチャートを用いて後述する。 (Step S202) The probability information calculation unit 12 performs E-step, and acquires first correspondence probability information “C (e (i) | f (j))” before normalization. Then, the probability information calculation unit 12 temporarily stores the first correspondence probability information “C (e (i) | f (j))” before normalization in a buffer or storage unit 11 (not shown) in association with each word pair. accumulate. Details of the E-step will be described later with reference to the flowchart of FIG.
 (ステップS203)正規化手段127は、M-stepを行い、第一対応確率情報「θS(e(i)|f(j))」を取得する。そして、確率情報算出部12は、各単語対に対応付けて、第一対応確率情報「θS(e(i)|f(j))」を図示しないバッファまたは格納部11に一時蓄積する。 (Step S203) The normalizing means 127 performs M-step to obtain the first correspondence probability information “θS (e (i) | f (j))”. Then, the probability information calculation unit 12 temporarily accumulates first correspondence probability information “θS (e (i) | f (j))” in a buffer or storage unit 11 (not shown) in association with each word pair.
 なお、M-stepとは、正規化前の第一対応確率情報「C(e(i)|f(j))」を正規化する処理である。正規化手段127は、例えば、上述した平均場近似を用いて、正規化前の第一対応確率情報「C(e(i)|f(j))」を正規化し、第一対応確率情報「θS(e(i)|f(j))」を取得する。 Note that M-step is a process of normalizing the first correspondence probability information “C (e (i) | f (j))” before normalization. For example, the normalizing means 127 normalizes the first correspondence probability information “C (e (i) | f (j))” before normalization using the above-described mean field approximation, and the first correspondence probability information “ θS (e (i) | f (j)) ”is acquired.
 (ステップS204)確率情報算出部12は、終了条件に合致するか否かを判断する。終了条件に合致する場合は処理を終了し、終了条件に合致しない場合はステップS202に戻る。なお、終了条件とは、上述したように、例えば、予め決められたループ回数になったこと等である。 (Step S204) The probability information calculation unit 12 determines whether or not the end condition is met. If the end condition is met, the process ends. If the end condition is not met, the process returns to step S202. The end condition is, for example, that a predetermined number of loops has been reached, as described above.
 次に、ステップS202のE-stepの詳細について、図3のフローチャートを用いて説明する。 Next, details of the E-step in step S202 will be described using the flowchart of FIG.
 (ステップS301)制御手段128は、カウンタsに1を代入する。 (Step S301) The control means 128 assigns 1 to the counter s.
 (ステップS302)制御手段128は、小規模対訳データ格納部111に、s番目の対訳文が存在するか否かを判断する。s番目の対訳文が存在する場合はステップS303に行き、s番目の対訳文が存在しない場合は上位処理にリターンする。 (Step S302) The control unit 128 determines whether or not the s-th parallel translation sentence exists in the small-scale parallel translation data storage unit 111. If the s-th parallel translation sentence exists, the process goes to step S303, and if the s-th parallel translation sentence does not exist, the process returns to the upper process.
 (ステップS303)制御手段128は、カウンタiに1を代入する。 (Step S303) The control means 128 assigns 1 to the counter i.
 (ステップS304)制御手段128は、変数sumに0を代入する。 (Step S304) The control means 128 substitutes 0 for the variable sum.
 (ステップS305)制御手段128は、カウンタjに0を代入する。 (Step S305) The control means 128 assigns 0 to the counter j.
 (ステップS306)確率情報算出部12は、中間確率値「p(i|j)」を算出する。中間確率値「p(i|j)」を算出する処理の詳細について、図4のフローチャートを用いて後述する。 (Step S306) The probability information calculation unit 12 calculates an intermediate probability value “p (i | j)”. Details of the process of calculating the intermediate probability value “p (i | j)” will be described later with reference to the flowchart of FIG.
 (ステップS307)制御手段128は、変数sumに中間確率値「p(i|j)」を加算する。 (Step S307) The control means 128 adds the intermediate probability value “p (i | j)” to the variable sum.
 (ステップS308)制御手段128は、jがnと一致するか否かを判断する。jがnと一致する場合はステップS309に行き、jがnと一致しない場合はステップS316に行く。 (Step S308) The control means 128 determines whether j matches n. If j matches n, the process goes to step S309, and if j does not match n, the process goes to step S316.
 (ステップS309)制御手段128は、カウンタjに0を代入する。 (Step S309) The control means 128 substitutes 0 for the counter j.
 (ステップS310)正規化前第一対応確率情報取得手段126は、現在の正規化前第一対応確率情報に、中間確率値をsumで除算した値を加算し、新しい正規化前第一対応確率情報を取得し、バッファまたは格納部11に蓄積する。つまり、正規化前第一対応確率情報取得手段126は、演算式「C(e(i)|f(j))← C(e(i)|f(j))+ p(i|j)/sum」により、新しい正規化前の第一対応確率情報「C(e(i)|f(j))」を算出し、バッファまたは格納部11に蓄積する。 (Step S310) The first correspondence probability information acquisition unit 126 before normalization adds a value obtained by dividing the intermediate probability value by sum to the current first correspondence probability information before normalization, and creates a new first correspondence probability before normalization. Information is acquired and stored in the buffer or storage unit 11. That is, the first correspondence probability information obtaining unit 126 before normalization 126 calculates the arithmetic expression “C (e (i) | f (j)) ← C (e (i) | f (j)) + p (i | j) The first correspondence probability information “C (e (i) | f (j))” before new normalization is calculated by “/ sum” and accumulated in the buffer or storage unit 11.
 (ステップS311)制御手段128は、jがnと一致するか否かを判断する。jがnと一致する場合はステップS312に行き、jがnと一致しない場合はステップS315に行く。 (Step S311) The control means 128 determines whether j matches n. If j matches n, the process goes to step S312, and if j does not match n, the process goes to step S315.
 (ステップS312)制御手段128は、iがmと一致するか否かを判断する。iがmと一致する場合はステップS313に行き、iがmと一致しない場合はステップS314に行く。 (Step S312) The control means 128 determines whether i matches m. If i matches m, the process goes to step S313. If i does not match m, the process goes to step S314.
 (ステップS313)制御手段128は、カウンタsを1、インクリメントし、ステップS302に戻る。 (Step S313) The control means 128 increments the counter s by 1, and returns to step S302.
 (ステップS314)制御手段128は、カウンタiを1、インクリメントし、ステップS304に戻る。 (Step S314) The control means 128 increments the counter i by 1, and returns to step S304.
 (ステップS315)制御手段128は、カウンタjを1、インクリメントし、ステップS310に戻る。 (Step S315) The control means 128 increments the counter j by 1, and returns to step S310.
 (ステップS316)制御手段128は、カウンタjを1、インクリメントし、ステップS306に戻る。 (Step S316) The control means 128 increments the counter j by 1, and returns to step S306.
 次に、ステップS306の中間確率値「p(i|j)」を算出する処理の詳細について、図4のフローチャートを用いて説明する。 Next, details of the process of calculating the intermediate probability value “p (i | j)” in step S306 will be described using the flowchart of FIG.
 (ステップS401)前回第一対応確率情報取得手段121は、s番目の対訳文における、第一言語(e)のi番目の単語e(i)と第二言語(f)のj番目の単語f(j)とに対応する初期値の第一対応確率情報「θS(e(i)|f(j))」または前回のループにおいて算出した第一対応確率情報「θS(e(i)|f(j))」を取得する。 (Step S401) The previous first correspondence probability information acquisition unit 121 includes the i-th word e (i) in the first language (e) and the j-th word f in the second language (f) in the s-th parallel translation. The first correspondence probability information “θS (e (i) | f (j))” of the initial value corresponding to (j) and the first correspondence probability information “θS (e (i) | f” calculated in the previous loop (j)) ”.
 (ステップS402)第二対応確率情報取得手段122は、s番目の対訳文における、第一言語(e)のi番目の単語e(i)と第二言語(f)のj番目の単語f(j)とに対応する第二対応確率情報「θB(e(i)|f(j))」を、大規模単語アライメントモデル格納部113から取得する。 (Step S402) The second correspondence probability information acquisition means 122 in the sth parallel translation sentence, the i-th word e (i) of the first language (e) and the j-th word f (2) of the second language (f) Second correspondence probability information “θB (e (i) | f (j))” corresponding to j) is acquired from the large-scale word alignment model storage unit 113.
 (ステップS403)対訳文単語位置情報取得手段123は、s番目の対訳文における、第一言語(e)のi番目の単語e(i)と第二言語(f)のj番目の単語f(j)の対訳文単語位置情報(i,j,m,n)を取得する。 (Step S403) The translated text word position information acquisition unit 123 in the sth translated text includes the i-th word e (i) of the first language (e) and the j-th word f (2) of the second language (f). The bilingual sentence word position information (i, j, m, n) of j) is acquired.
 (ステップS404)対訳文単語位置確率情報取得手段124は、対訳文単語位置情報取得手段123が取得した対訳文単語位置情報(i,j,m,n)に対応する対訳文単語位置確率情報「δS(j|i,m,n)」を、対訳文単語位置確率情報格納部114から取得する。 (Step S404) The translated sentence word position probability information acquiring unit 124 converts the translated sentence word position probability information “i, j, m, n” corresponding to the translated sentence word position information (i, j, m, n) acquired by the translated sentence word position information acquiring unit 123. “δS (j | i, m, n)” is acquired from the bilingual sentence word position probability information storage unit 114.
 (ステップS405)中間確率値算出手段125は、前回第一対応確率情報取得手段121が取得した第一対応確率情報「θS(e(i)|f(j))」と、第二対応確率情報取得手段122が取得した第二対応確率情報「θB(e(i)|f(j))」と、対訳文単語位置確率情報取得手段124が取得した対訳文単語位置確率情報「δS(j|i,m,n)」とを用いて、中間確率値を算出し、上位処理にリターンする。 (Step S405) The intermediate probability value calculation means 125 includes the first correspondence probability information “θS (e (i) | f (j))” acquired by the first correspondence probability information acquisition means 121 last time and the second correspondence probability information. The second correspondence probability information “θB (e (i) | f (j))” acquired by the acquisition unit 122 and the parallel sentence word position probability information “δS (j |) acquired by the parallel sentence word position probability information acquisition unit 124. i, m, n) "is used to calculate the intermediate probability value and the process returns to the upper process.
 以下、本実施の形態における単語アライメントモデル構築装置1の具体的な動作について説明する。 Hereinafter, a specific operation of the word alignment model construction apparatus 1 in the present embodiment will be described.
 単語アライメントモデル構築装置1は、例えば、図5に示すプログラムのように動作する。また、ここで述べる具体例は、非特許文献5で述べられた方法を拡張したものである。 The word alignment model construction device 1 operates like a program shown in FIG. The specific example described here is an extension of the method described in Non-Patent Document 5.
 また、単語アライメントモデル構築装置1のポイントは、図5の501である。501は、確率情報算出部12の処理である。 Also, the point of the word alignment model construction device 1 is 501 in FIG. Reference numeral 501 denotes processing by the probability information calculation unit 12.
 501は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、大規模単語アライメントモデルが有する一の単語対と対になる第二対応確率情報と、対訳文の中における一の単語対に対応する対訳文単語位置確率情報とを用いて、2回以上ループを繰り返して、一の単語対と対になる第一対応確率情報を算出する処理である。 501 is an initial value or first correspondence probability information calculated in the previous loop for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence, Using the second correspondence probability information paired with one word pair included in the word alignment model and the parallel translation word position probability information corresponding to one word pair in the parallel translation sentence, the loop is repeated two or more times. This is a process of calculating first correspondence probability information paired with one word pair.
 さらに具体的には、501は、中間確率値算出手段125が行う処理であり、前回第一対応確率情報取得手段121が取得した第一対応確率情報と第二対応確率情報取得手段122が取得した第二対応確率情報とを予め決められた割合で加算し、加算した結果と、対訳文単語位置確率情報取得手段124が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する処理である。すなわち、単語アライメントモデル構築装置1においては、大規模データDataBから求めた確率θB(e(i)|f(j))を利用して、p(i|j)を求めることにより、大規模データの確率を「θS(e(i)|f(j))」の推定に利用している。 More specifically, reference numeral 501 denotes a process performed by the intermediate probability value calculating unit 125, which is acquired by the first corresponding probability information acquired by the first corresponding probability information acquiring unit 121 and the second corresponding probability information acquiring unit 122. The second correspondence probability information is added at a predetermined ratio, and the addition result is multiplied by the translated sentence word position probability information acquisition unit 124 to calculate an intermediate probability value. It is processing. That is, the word alignment model construction apparatus 1 uses the probability θB (e (i) | f (j)) obtained from the large-scale data DataB to obtain p (i | j), thereby obtaining the large-scale data. Is used to estimate “θS (e (i) | f (j))”.
 以上、本実施の形態によれば、小規模対訳コーパスの単語アライメントを精度よく実行できる。 As described above, according to the present embodiment, word alignment of a small-scale parallel corpus can be executed with high accuracy.
 さらに具体的には、本実施の形態によれば、従来技術(D)と同様に、θSを推定するときの計算量は、DataSを利用するときと同程度でありながら、DataBで推定された確率を利用可能である。また、本実施の形態によれば、従来技術(D)とは異なり、上記定数λ「0<λ<1」(図5の501)を適切に設定することにより、小規模データDataSにおける特徴的な単語アライメントが打ち消されることを防ぐことが可能となる。 More specifically, according to the present embodiment, as in the prior art (D), the amount of calculation when estimating θS is similar to that when using DataS, but estimated with DataB. Probability is available. Further, according to the present embodiment, unlike the conventional technique (D), the constant λ “0 <λ <1” (501 in FIG. 5) is appropriately set, so that the characteristic in the small-scale data DataS is obtained. This makes it possible to prevent the word alignment from being canceled.
 なお、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをCD-ROMなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における単語アライメントモデル構築装置1を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータがアクセス可能な記録媒体は、第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値(N1)未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、前記小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、第二の閾値(N2,N2>N1)以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、1以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部とを具備し、コンピュータを、前記小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、前記大規模単語アライメントモデルが有する前記一の単語対と対になる第二対応確率情報と、前記対訳文の中における前記一の単語対に対応する対訳文単語位置確率情報とを用いて、2回以上ループを繰り返して、前記一の単語対と対になる第一対応確率情報を算出する確率情報算出部と、単語対ごとに、前記確率情報算出部が最終的に算出した第一対応確率情報を、前記単語対に対応付けて、前記小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積部として機能させるためのプログラムである。 Note that the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded on a recording medium such as a CD-ROM and distributed. This also applies to other embodiments in this specification. In addition, the software which implement | achieves the word alignment model construction apparatus 1 in this Embodiment is the following programs. That is, in this program, the computer-accessible recording medium is a first language sentence that is a first language sentence that is the first language and a second language sentence that is a second language sentence that is the second language. A small-scale bilingual data storage unit capable of storing small-scale parallel translation data, which is a small-scale parallel translation data having a number of parallel translations less than the first threshold (N1), and the small-scale parallel translation data Alignment model of acquired words, a word pair having a first word that is a first language word and a second word that is a second language word, and the first word and the second word correspond to each other A small-scale word alignment model storage unit capable of storing a small-scale word alignment model having a plurality of word alignment data having first correspondence probability information which is correspondence probability information regarding the probability of performing, and a second threshold (N2, N2> 1) A word alignment model acquired from large-scale parallel translation data, which is large-scale parallel translation data having the above-mentioned number of parallel translation sentences, a word pair having a first word and a second word, and the first word A large-scale word alignment model storage unit storing a large-scale word alignment model having a plurality of word alignment data having second correspondence probability information that is correspondence probability information regarding the probability that the second word corresponds to, Information obtained from one or more parallel translations, a first word position indicating the position of the first word in the first language sentence, a position of the second word corresponding to the first word, and a second language sentence For each translated sentence word position information having a second word position indicating a position in the word, a first sentence word number that is the number of words in the first language sentence, and a second sentence word number that is the number of words in the second language sentence , Parallel translation A bilingual sentence word position probability information storage unit capable of storing bilingual sentence word position probability information related to the probability of matching with the word position information, and a computer for each word pair included in the bilingual sentence included in the small-scale bilingual data The first correspondence probability information calculated in the initial value or the previous loop with respect to one word pair, the second correspondence probability information paired with the one word pair included in the large-scale word alignment model, Probability of calculating first correspondence probability information paired with the one word pair by repeating the loop two or more times using the translated sentence word position probability information corresponding to the one word pair in the parallel translation sentence For each word pair, the first correspondence probability information finally calculated by the probability information calculation unit is associated with the word pair and stored in the small word alignment model storage unit. This is a program for functioning as a probability information storage unit.
 また、上記プログラムにおいて、前記確率情報算出部は、前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、当該一の単語対に対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報を取得する前回第一対応確率情報取得手段と、前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、当該一の単語対に対応する第二対応確率情報を、前記大規模単語アライメントモデル格納部から取得する第二対応確率情報取得手段と、前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、前記対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、前記対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段と、前記前回第一対応確率情報取得手段が取得した第一対応確率情報と前記第二第一対応確率情報取得手段が取得した第二対応確率情報とを予め決められた割合で加算し、当該加算した結果と、前記対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する中間確率値算出手段と、単語対ごとに、前記中間確率値算出手段が算出した中間確率値を用いて、正規化前の第一対応確率情報を取得する正規化前第一対応確率情報取得手段と、単語対ごとに、前記正規化前第一対応確率情報取得手段が取得した正規化前の第一対応確率情報に対して、正規化の処理を行い、第一対応確率情報を取得する正規化手段と、終了条件を満たすまで、前記前回第一対応確率情報取得手段、前記第二対応確率情報取得手段、前記対訳文単語位置情報取得手段、前記対訳文単語位置確率情報取得手段、前記中間確率値算出手段、前記正規化前第一対応確率情報取得手段、および前記正規化手段の処理を繰り返して行わせる制御手段とを具備するものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, the probability information calculation unit converts the word pair into one word pair for each parallel sentence included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence. The first corresponding probability information acquisition means for acquiring the first corresponding probability information of the corresponding initial value or the first corresponding probability information calculated in the previous loop, and for each parallel sentence included in the small-scale parallel translation data, Second correspondence probability information acquisition means for acquiring second correspondence probability information corresponding to the one word pair from the large-scale word alignment model storage unit for each word pair included in the sentence; The first word position indicating the position of the first word in the first language sentence with respect to one word pair for each parallel sentence included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence , A second word position corresponding to the first word and indicating a position in the second language sentence, a first sentence word number that is the number of words in the first language sentence, and a second language sentence A bilingual sentence word position information acquisition unit that acquires bilingual sentence word position information having the number of second sentence words that is the number of words, and a bilingual sentence word position information corresponding to the bilingual sentence word position information acquired by the bilingual sentence word position information acquisition unit The translation sentence word position probability information acquisition means for acquiring the position probability information from the parallel translation word position probability information storage unit, the first correspondence probability information acquired by the previous first correspondence probability information acquisition means, and the second first The second correspondence probability information acquired by the correspondence probability information acquisition means is added at a predetermined ratio, and the result of the addition and the parallel sentence word position probability information acquired by the parallel sentence word position probability information acquisition means Multiply the intermediate probability value First correspondence probability information before normalization that obtains first correspondence probability information before normalization using the intermediate probability value calculated by the intermediate probability value calculation means for each word pair For each word pair, the normalization processing is performed on the first correspondence probability information before normalization acquired by the acquisition means and the first correspondence probability information acquisition means before normalization, and the first correspondence probability information is acquired. Normalization means, and until the end condition is satisfied, the previous first correspondence probability information acquisition means, the second correspondence probability information acquisition means, the parallel translation word position information acquisition means, the parallel translation word position probability information acquisition means, A program that causes a computer to function as the intermediate probability value calculating means, the first correspondence probability information acquiring means before normalization, and a control means that repeatedly performs the processing of the normalizing means. It is.
 また、上記プログラムにおいて、前記対訳文単語位置確率情報格納部に格納されている対訳文単語位置確率情報は、前記大規模対訳データを用いて取得された対訳文単語位置確率情報であるものとして、コンピュータを機能させるプログラムであることは好適である。 In the above program, the parallel sentence word position probability information stored in the parallel sentence word position probability information storage unit is the parallel sentence word position probability information acquired using the large-scale parallel translation data. It is preferable that the program functions a computer.
 また、上記プログラムにおいて、コンピュータを、前記大規模対訳データが有する各単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、前記大規模対訳データが有する各単語対に対して、前記対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、前記対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段としてさらに機能させ、前記対訳文単語位置確率情報格納部に格納されている対訳文単語位置確率情報は、前記対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報であるものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, for each word pair possessed by the large-scale parallel translation data, the computer has a first word position indicating a position of the first word in the first language sentence, and a first word position corresponding to the first word. A second word position that is a position of two words and indicates a position in the second language sentence, a first sentence word number that is the number of words in the first language sentence, and a second sentence word that is the number of words in the second language sentence Bilingual sentence word position information acquisition means for acquiring bilingual sentence word position information having a number, and bilingual sentence word position information acquired by the bilingual sentence word position information acquisition means for each word pair included in the large-scale parallel translation data Is further stored as a translated sentence word position probability information acquisition unit for acquiring the translated sentence word position probability information corresponding to the translated sentence word position probability information storage unit, and is stored in the translated sentence word position probability information storage unit. Translated sentence word position probability information as the a translated sentence word position probability information translated sentence word position probability information acquisition means has acquired, it is preferably a program for causing a computer to function.
 (実施の形態2)
 本実施の形態において、実施の形態1で構成された単語アライメントモデルを用いた機械翻訳装置について説明する。
(Embodiment 2)
In this embodiment, a machine translation apparatus using the word alignment model configured in Embodiment 1 will be described.
 図6は、本実施の形態における機械翻訳装置2のブロック図である。機械翻訳装置2は、例えば、第二言語文を翻訳し、第一言語文を得る装置である。 FIG. 6 is a block diagram of the machine translation apparatus 2 in the present embodiment. The machine translation device 2 is a device that translates a second language sentence and obtains a first language sentence, for example.
 機械翻訳装置2は、小規模単語アライメントモデル格納部112、対訳文単語位置確率情報格納部114、受付部21、および翻訳部22を備える。 The machine translation device 2 includes a small-scale word alignment model storage unit 112, a parallel translation word position probability information storage unit 114, a reception unit 21, and a translation unit 22.
 受付部21は、第二言語文を受け付ける。ここで、受け付けとは、キーボードやマウス、タッチパネルなどの入力デバイスから入力された情報の受け付け、音声のマイクによる受け付け、有線もしくは無線の通信回線を介して送信された情報の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された情報の受け付けなどを含む概念である。 The reception unit 21 receives a second language sentence. Here, reception means reception of information input from an input device such as a keyboard, mouse, or touch panel, reception by a voice microphone, reception of information transmitted via a wired or wireless communication line, an optical disk or a magnetic disk The concept includes reception of information read from a recording medium such as a semiconductor memory.
 第二言語文の入力手段は、キーボードやマウスやメニュー画面によるもの等、何でも良い。受付部21は、キーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The second language sentence input means may be anything such as a keyboard, mouse or menu screen. The accepting unit 21 can be realized by a device driver for input means such as a keyboard, control software for a menu screen, or the like.
 翻訳部22は、小規模単語アライメントモデル格納部112に格納されている小規模単語アライメントモデル、および対訳文単語位置確率情報格納部114に格納されている1以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、受付部21が受け付けた第二言語文から第一言語文を取得する。なお、翻訳部22は公知技術であるので詳細な説明は省略する。 The translation unit 22 translates each small word alignment model stored in the small word alignment model storage unit 112 and one or more parallel sentence word position information stored in the parallel sentence word position probability information storage unit 114. Using the sentence word position probability information, the first language sentence is acquired from the second language sentence received by the receiving unit 21. Since the translation unit 22 is a known technique, a detailed description thereof is omitted.
 翻訳部22は、通常、MPUやメモリ等から実現され得る。翻訳部22の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはROM等の記録媒体に記録されている。但し、ハードウェア(専用回路)で実現しても良い。 The translation unit 22 can usually be realized by an MPU, a memory, or the like. The processing procedure of the translation unit 22 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).
 以下、本実施の形態における機械翻訳装置2の具体的な動作について説明する。 Hereinafter, a specific operation of the machine translation apparatus 2 in the present embodiment will be described.
 対訳データDataXについて、その中の対訳文を<e,f>とし、e = e(1) e(2) … e(m), f= f(1) f(2) … f(n)のように、m単語とn単語からなるとする。ここで、eは第一言語文、fは第二言語文である、とする。また、e(i)はeのi番目の単語であり、f(j)はfのj番目の単語である。なお、特別な単語NULLとして f(0)を導入する。これは、e中の単語がf中の単語のいずれにも対応しない場合に有用である。 For bilingual data DataX, the bilingual sentence in it is <e, f>, and e = e (1) e (2)… e (m), f = f (1) f (2)… f (n) Thus, it is assumed that it consists of m words and n words. Here, e is a first language sentence and f is a second language sentence. E (i) is the i-th word of e, and f (j) is the j-th word of f. Note that f (0) is introduced as the special word NULL. This is useful when the word in e does not correspond to any of the words in f.
 次に、DataXにおけるfを条件とするeの確率を「PX(e|f)」とする。 Next, the probability of e on condition that f in DataX is “PX (e | f)”.
 また、DataXにおいて、eの単語数が m、fの単語数がnのときに、文eのi番目にある単語が、文fのj番目の単語とアライメントされる確率を「δX(j|i,m,n)」とする。 In DataX, when the number of words in e is m and the number of words in f is n, the probability that the i-th word in sentence e is aligned with the j-th word in sentence f is expressed as “δX (j | i, m, n) ".
 また、DataXにおいて、e(i)がf(j)にアライメントされる確率を「θX(e(i)|f(j))」とする。 In DataX, the probability that e (i) is aligned with f (j) is “θX (e (i) | f (j))”.
 このとき、以下の数式2が成り立つ。 At this time, the following formula 2 holds.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 また、これらの確率、δX、θXは、データXから上記非特許文献5の手法を利用して推定可能である。なお、PXについては、δXとθXから一意的に計算可能である。 Further, these probabilities, δX, θX can be estimated from the data X using the method of Non-Patent Document 5 described above. Note that PX can be uniquely calculated from ΔX and θX.
 ここで、まず、δSの推定法を述べる。それは次のものである。 Here, first, the estimation method of δS will be described. It is:
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 すなわち、δSとしてはδBと同じ確率を用いる。その理由は、この確率の推定方法として上記非特許文献5で述べられている方法を、100文程度の対訳データに対して適用すると、この確率分布のパラメータが発散してしまい、有効な確率を推定不可能だからである。また、この確率は、単に、i, j, m, nという単語の数のみから決定可能なものであるので、対訳データが異なったとしても、同じ確率が精度よく利用できるからである。 That is, the same probability as δB is used as δS. The reason is that if the method described in Non-Patent Document 5 is applied to bilingual data of about 100 sentences as an estimation method of this probability, the parameter of this probability distribution diverges and an effective probability is obtained. This is because estimation is impossible. Further, this probability can be determined only from the number of words i, j, m, and n, and therefore the same probability can be used with high accuracy even if the parallel translation data is different.
 次に、θSの推定方法は、図5のプログラムである。これは、上記非特許文献5と同様にEM法に基づくものである。 Next, the method of estimating θS is the program shown in FIG. This is based on the EM method as in Non-Patent Document 5.
 そして、PS(e|f)が、「δS(j|i,m,n)」「「θS(e(i)|f(j))」より算出される。 PS (e | f) is calculated from “δS (j | i, m, n)” and “θS (e (i) | f (j))”.
 そして、PS(e|f)の確率値が最も大きい第一言語文eが第二言語文fの翻訳結果である。 The first language sentence e having the largest probability value of PS (e | f) is the translation result of the second language sentence f.
 以上、本実施の形態によれば、小規模対訳コーパスの単語アライメントを精度よく実行できる結果、精度の良い翻訳結果が得られる。 As described above, according to the present embodiment, the word alignment of the small-scale bilingual corpus can be executed with high accuracy, and a highly accurate translation result can be obtained.
 さらに、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムにおいて、コンピュータがアクセス可能な記録媒体は、単語アライメントモデル構築装置が有する小規模単語アライメントモデル格納部と、単語アライメントモデル構築装置が有する対訳文単語位置確率情報格納部とを具備し、コンピュータを、第二言語文を受け付ける受付部と、前記小規模単語アライメントモデル格納部に格納されている小規模単語アライメントモデル、および前記対訳文単語位置確率情報格納部に格納されている1以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、前記受付部が受け付けた第二言語文から第一言語文を取得する翻訳部として機能させるプログラムである。 Furthermore, the software that realizes the information processing apparatus in the present embodiment is the following program. That is, in this program, the computer-accessible recording medium includes a small-scale word alignment model storage unit included in the word alignment model construction device and a bilingual word position probability information storage unit included in the word alignment model construction device. A computer that accepts a second language sentence, a small word alignment model stored in the small word alignment model storage section, and one or more stored in the parallel sentence word position probability information storage section This is a program that functions as a translation unit that acquires a first language sentence from a second language sentence accepted by the accepting unit using bilingual sentence word position probability information for each bilingual sentence word position information.
 また、図7は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の単語アライメントモデル構築装置等を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図7は、このコンピュータシステム300の概観図であり、図8は、システム300のブロック図である。 FIG. 7 shows the external appearance of a computer that executes the program described in this specification to realize the word alignment model construction apparatus and the like of the various embodiments described above. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 7 is an overview diagram of the computer system 300, and FIG. 8 is a block diagram of the system 300.
 図7において、コンピュータシステム300は、CD-ROMドライブ3012を含むコンピュータ301と、キーボード302と、マウス303と、モニタ304とを含む。 7, the computer system 300 includes a computer 301 including a CD-ROM drive 3012, a keyboard 302, a mouse 303, and a monitor 304.
 図8において、コンピュータ301は、CD-ROMドライブ3012に加えて、MPU3013と、MPU3013、CD-ROMドライブ3012に接続されたバス3014と、ブートアッププログラム等のプログラムを記憶するためのROM3015と、MPU3013に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのRAM3016と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク3017とを含む。ここでは、図示しないが、コンピュータ301は、さらに、LANへの接続を提供するネットワークカードを含んでも良い。 In FIG. 8, in addition to the CD-ROM drive 3012, the computer 301 includes an MPU 3013, a bus 3014 connected to the MPU 3013 and the CD-ROM drive 3012, a ROM 3015 for storing programs such as a boot-up program, and an MPU 3013. And a RAM 3016 for temporarily storing instructions of the application program and providing a temporary storage space, and a hard disk 3017 for storing the application program, the system program, and data. Although not shown here, the computer 301 may further include a network card that provides connection to a LAN.
 コンピュータシステム300に、上述した実施の形態の単語アライメントモデル構築装置等の機能を実行させるプログラムは、CD-ROM3101に記憶されて、CD-ROMドライブ3012に挿入され、さらにハードディスク3017に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ301に送信され、ハードディスク3017に記憶されても良い。プログラムは実行の際にRAM3016にロードされる。プログラムは、CD-ROM3101またはネットワークから直接、ロードされても良い。 A program that causes the computer system 300 to execute the functions of the word alignment model construction apparatus of the above-described embodiment is stored in the CD-ROM 3101, inserted into the CD-ROM drive 3012, and further transferred to the hard disk 3017. good. Alternatively, the program may be transmitted to the computer 301 via a network (not shown) and stored in the hard disk 3017. The program is loaded into the RAM 3016 at the time of execution. The program may be loaded directly from the CD-ROM 3101 or the network.
 プログラムは、コンピュータ301に、上述した実施の形態の単語アライメントモデル構築装置等の機能を実行させるオペレーティングシステム(OS)、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能(モジュール)を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム300がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS), a third-party program, or the like that causes the computer 301 to execute functions such as the word alignment model construction device of the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 300 operates is well known and will not be described in detail.
 なお、上記プログラムにおいて、情報を送信するステップや、情報を受信するステップなどでは、ハードウェアによって行われる処理、例えば、送信ステップにおけるモデムやインターフェースカードなどで行われる処理(ハードウェアでしか行われない処理)は含まれない。 In the above program, in the step of transmitting information, the step of receiving information, etc., processing performed by hardware, for example, processing performed by a modem or an interface card in the transmission step (only performed by hardware) Processing) is not included.
 また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.
 また、上記各実施の形態において、一の装置に存在する2以上の通信手段は、物理的に一の媒体で実現されても良いことは言うまでもない。 Further, in each of the above embodiments, it goes without saying that two or more communication means existing in one apparatus may be physically realized by one medium.
 また、上記各実施の形態において、各処理は、単一の装置によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process may be realized by centralized processing by a single device, or may be realized by distributed processing by a plurality of devices.
 本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above embodiment, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.
 以上のように、本発明にかかる単語アライメントモデル構築装置は、小規模対訳コーパスの単語アライメントを精度よく実行できるという効果を有し、単語アライメントモデル構築装置等として有用である。 As described above, the word alignment model construction apparatus according to the present invention has an effect that word alignment of a small-scale parallel corpus can be executed with high accuracy, and is useful as a word alignment model construction apparatus or the like.
 1 単語アライメントモデル構築装置
 2 機械翻訳装置
 11 格納部
 12 確率情報算出部
 13 対応確率情報蓄積部
 21 受付部
 22 翻訳部
 111 小規模対訳データ格納部
 112 小規模単語アライメントモデル格納部
 113 大規模単語アライメントモデル格納部
 114 対訳文単語位置確率情報格納部
 121 前回第一対応確率情報取得手段
 122 第二対応確率情報取得手段
 123 対訳文単語位置情報取得手段
 124 対訳文単語位置確率情報取得手段
 125 中間確率値算出手段
 126 正規化前第一対応確率情報取得手段
 127 正規化手段
 128 制御手段
DESCRIPTION OF SYMBOLS 1 Word alignment model construction apparatus 2 Machine translation apparatus 11 Storage part 12 Probability information calculation part 13 Corresponding probability information storage part 21 Reception part 22 Translation part 111 Small parallel translation data storage part 112 Small word alignment model storage part 113 Large scale word alignment Model storage unit 114 Parallel sentence word position probability information storage part 121 Previous first correspondence probability information acquisition means 122 Second correspondence probability information acquisition means 123 Parallel sentence word position information acquisition means 124 Parallel sentence word position probability information acquisition means 125 Intermediate probability value Calculation means 126 First correspondence probability information acquisition means 127 before normalization 127 Normalization means 128 Control means

Claims (6)

  1. 第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値(N1)未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、
    前記小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、
    第二の閾値(N2,N2>N1)以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、
    1以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部と、
    前記小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、前記大規模単語アライメントモデルが有する前記一の単語対と対になる第二対応確率情報と、前記対訳文の中における前記一の単語対に対応する対訳文単語位置確率情報とを用いて、2回以上ループを繰り返して、前記一の単語対と対になる第一対応確率情報を算出する確率情報算出部と、
    単語対ごとに、前記確率情報算出部が最終的に算出した第一対応確率情報を、前記単語対に対応付けて、前記小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積部とを具備する単語アライメントモデル構築装置。
    It is a pair of a first language sentence that is a sentence of the first language that is the first language and a second language sentence that is a sentence of the second language that is the second language, and is less than the first threshold (N1) A small-scale parallel translation data storage unit capable of storing small-scale parallel translation data, which is a small-scale parallel translation data having a number of parallel translation sentences;
    An alignment model of words acquired from the small-scale parallel translation data, a word pair having a first word that is a word in a first language and a second word that is a word in a second language, the first word, and the A small word alignment model storage unit capable of storing a small word alignment model having a plurality of word alignment data having first correspondence probability information which is correspondence probability information regarding the probability that the second word corresponds;
    An alignment model of words acquired from large-scale parallel translation data, which is large-scale parallel translation data having a number of parallel translations equal to or greater than a second threshold (N2, N2> N1), and a first word and a second word A large-scale word alignment model having a plurality of word alignment data having a word pair and second correspondence probability information that is correspondence probability information regarding a probability that the first word and the second word correspond to each other is stored A large-scale word alignment model storage;
    Information obtained from one or more parallel translations, a first word position indicating the position of the first word in the first language sentence, a position of the second word corresponding to the first word, and a second language sentence For each translated sentence word position information having a second word position indicating a position in the word, a first sentence word number that is the number of words in the first language sentence, and a second sentence word number that is the number of words in the second language sentence A bilingual word position probability information storage unit that can store bilingual word position probability information related to the probability of matching the bilingual word position information;
    For each word pair included in the bilingual sentence included in the small-scale parallel translation data, the first correspondence probability information calculated in the initial value or the previous loop for one word pair, and the large-scale word alignment model has Using the second correspondence probability information paired with one word pair and the parallel translation word position probability information corresponding to the one word pair in the parallel translation sentence, the loop is repeated twice or more to A probability information calculation unit for calculating first correspondence probability information paired with the word pair;
    A correspondence probability information storage unit that stores first correspondence probability information finally calculated by the probability information calculation unit for each word pair in the small word alignment model storage unit in association with the word pair; Word alignment model construction device.
  2. 前記確率情報算出部は、
    前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、当該一の単語対に対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報を取得する前回第一対応確率情報取得手段と、
    前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、当該一の単語対に対応する第二対応確率情報を、前記大規模単語アライメントモデル格納部から取得する第二対応確率情報取得手段と、
    前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、
    前記対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、前記対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段と、
    前記前回第一対応確率情報取得手段が取得した第一対応確率情報と前記第二対応確率情報取得手段が取得した第二対応確率情報とを予め決められた割合で加算し、当該加算した結果と、前記対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する中間確率値算出手段と、
    単語対ごとに、前記中間確率値算出手段が算出した中間確率値を用いて、正規化前の第一対応確率情報を取得する正規化前第一対応確率情報取得手段と、
    単語対ごとに、前記正規化前第一対応確率情報取得手段が取得した正規化前の第一対応確率情報に対して、正規化の処理を行い、第一対応確率情報を取得する正規化手段と、
    終了条件を満たすまで、前記前回第一対応確率情報取得手段、前記第二対応確率情報取得手段、前記対訳文単語位置情報取得手段、前記対訳文単語位置確率情報取得手段、前記中間確率値算出手段、前記正規化前第一対応確率情報取得手段、および前記正規化手段の処理を繰り返して行わせる制御手段とを具備する請求項1記載の単語アライメントモデル構築装置。
    The probability information calculation unit includes:
    The first correspondence probability information of the initial value corresponding to the one word pair or the previous correspondence information for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence Previous first correspondence probability information obtaining means for obtaining first correspondence probability information calculated in the loop;
    Second correspondence probability information corresponding to the one word pair for each word pair included in the parallel translation sentence and for each word pair included in the parallel translation sentence, the large-scale word Second correspondence probability information obtaining means for obtaining from the alignment model storage unit;
    A first word position indicating a position of the first word in the first language sentence for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence; The second word position corresponding to the first word and indicating the position in the second language sentence, the first sentence word number that is the number of words in the first language sentence, and the second language sentence Bilingual sentence word position information acquisition means for acquiring bilingual sentence word position information having the number of second sentence words that is the number of words;
    Parallel sentence word position probability information acquisition means for acquiring parallel sentence word position probability information corresponding to the parallel sentence word position information acquired by the parallel sentence word position information acquisition means from the parallel sentence word position probability information storage unit;
    The first correspondence probability information acquired by the previous first correspondence probability information acquisition means and the second correspondence probability information acquired by the second correspondence probability information acquisition means are added at a predetermined ratio, and the addition result An intermediate probability value calculating means for calculating an intermediate probability value by multiplying the parallel sentence word position probability information acquired by the parallel sentence word position probability information acquiring means;
    For each word pair, using the intermediate probability value calculated by the intermediate probability value calculating means, the first correspondence probability information acquisition means before normalization for acquiring the first correspondence probability information before normalization;
    For each word pair, normalization means for obtaining first correspondence probability information by performing normalization processing on the first correspondence probability information before normalization obtained by the first correspondence probability information obtaining means before normalization When,
    Until the end condition is satisfied, the previous first correspondence probability information acquisition means, the second correspondence probability information acquisition means, the parallel translation word position information acquisition means, the parallel translation word position probability information acquisition means, the intermediate probability value calculation means The word alignment model construction apparatus according to claim 1, further comprising: a first correspondence probability information acquisition unit before normalization, and a control unit that repeatedly performs the processing of the normalization unit.
  3. 請求項1または請求項2いずれか一項に記載の単語アライメントモデル構築装置が有する小規模単語アライメントモデル格納部と、
    請求項1から請求項4いずれか一項に記載の単語アライメントモデル構築装置が有する対訳文単語位置確率情報格納部と、
    第二言語文を受け付ける受付部と、
    前記小規模単語アライメントモデル格納部に格納されている小規模単語アライメントモデル、および前記対訳文単語位置確率情報格納部に格納されている1以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、前記受付部が受け付けた第二言語文から第一言語文を取得する翻訳部とを具備する機械翻訳装置。
    A small-scale word alignment model storage unit included in the word alignment model construction device according to claim 1,
    A bilingual word position probability information storage unit included in the word alignment model construction device according to any one of claims 1 to 4,
    A reception unit for receiving a second language sentence;
    The small word alignment model stored in the small word alignment model storage unit and the bilingual word position probability information for each one or more parallel word position information stored in the parallel word position probability information storage unit A machine translation apparatus comprising: a translation unit that acquires a first language sentence from a second language sentence received by the reception unit.
  4. 記録媒体は、
    第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値(N1)未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、
    前記小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、
    第二の閾値(N2,N2>N1)以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、
    1以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部とを具備し、
    確率情報算出部、対応確率情報蓄積部により実現される単語アライメントモデルの生産方法であって、
    前記確率情報算出部が、前記小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、前記大規模単語アライメントモデルが有する前記一の単語対と対になる第二対応確率情報と、前記対訳文の中における前記一の単語対に対応する対訳文単語位置確率情報とを用いて、2回以上ループを繰り返して、前記一の単語対と対になる第一対応確率情報を算出する確率情報算出ステップと、
    前記対応確率情報蓄積部が、単語アライメントモデルの生産方法単語対ごとに、前記確率情報算出ステップで最終的に算出された第一対応確率情報を、前記単語対に対応付けて、前記小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積ステップとを具備する単語アライメントモデルの生産方法。
    The recording medium is
    It is a pair of a first language sentence that is a sentence of the first language that is the first language and a second language sentence that is a sentence of the second language that is the second language, and is less than the first threshold (N1) A small-scale parallel translation data storage unit capable of storing small-scale parallel translation data, which is a small-scale parallel translation data having a number of parallel translation sentences;
    An alignment model of words acquired from the small-scale parallel translation data, a word pair having a first word that is a word in a first language and a second word that is a word in a second language, the first word, and the A small word alignment model storage unit capable of storing a small word alignment model having a plurality of word alignment data having first correspondence probability information which is correspondence probability information regarding the probability that the second word corresponds;
    An alignment model of words acquired from large-scale parallel translation data, which is large-scale parallel translation data having a number of parallel translations equal to or greater than a second threshold (N2, N2> N1), and a first word and a second word A large-scale word alignment model having a plurality of word alignment data having a word pair and second correspondence probability information that is correspondence probability information regarding a probability that the first word and the second word correspond to each other is stored A large-scale word alignment model storage;
    Information obtained from one or more parallel translations, a first word position indicating the position of the first word in the first language sentence, a position of the second word corresponding to the first word, and a second language sentence For each translated sentence word position information having a second word position indicating a position in the word, a first sentence word number that is the number of words in the first language sentence, and a second sentence word number that is the number of words in the second language sentence A bilingual sentence word position probability information storage unit capable of storing bilingual sentence word position probability information related to the probability of matching with the bilingual sentence word position information,
    A method for producing a word alignment model realized by a probability information calculation unit and a corresponding probability information storage unit,
    The probability information calculation unit calculates, for each word pair included in the parallel translation sentence included in the small-scale parallel translation data, the first correspondence probability information calculated in the initial value or the previous loop for one word pair; Two or more times using the second correspondence probability information paired with the one word pair of the scale word alignment model and the parallel sentence word position probability information corresponding to the one word pair in the parallel translation sentence Probability information calculation step of calculating first correspondence probability information paired with the one word pair by repeating a loop;
    The correspondence probability information accumulating unit associates the first correspondence probability information finally calculated in the probability information calculation step with the word pair for each production method word pair of the word alignment model. A method for producing a word alignment model, comprising: a corresponding probability information accumulating step for accumulating in an alignment model storage unit.
  5. コンピュータがアクセス可能な記録媒体は、
    第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値(N1)未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、
    前記小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、
    第二の閾値(N2,N2>N1)以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、
    1以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部とを具備し、
    コンピュータを、
    前記小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、前記大規模単語アライメントモデルが有する前記一の単語対と対になる第二対応確率情報と、前記対訳文の中における前記一の単語対に対応する対訳文単語位置確率情報とを用いて、2回以上ループを繰り返して、前記一の単語対と対になる第一対応確率情報を算出する確率情報算出部と、
    単語対ごとに、前記確率情報算出部が最終的に算出した第一対応確率情報を、前記単語対に対応付けて、前記小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積部として機能させるためのプログラムを記録した記録媒体。
    Computer-accessible recording media
    It is a pair of a first language sentence that is a sentence of the first language that is the first language and a second language sentence that is a sentence of the second language that is the second language, and is less than the first threshold (N1) A small-scale parallel translation data storage unit capable of storing small-scale parallel translation data, which is a small-scale parallel translation data having a number of parallel translation sentences;
    An alignment model of words acquired from the small-scale parallel translation data, a word pair having a first word that is a word in a first language and a second word that is a word in a second language, the first word, and the A small word alignment model storage unit capable of storing a small word alignment model having a plurality of word alignment data having first correspondence probability information which is correspondence probability information regarding the probability that the second word corresponds;
    An alignment model of words acquired from large-scale parallel translation data, which is large-scale parallel translation data having a number of parallel translations equal to or greater than a second threshold (N2, N2> N1), and a first word and a second word A large-scale word alignment model having a plurality of word alignment data having a word pair and second correspondence probability information that is correspondence probability information regarding a probability that the first word and the second word correspond to each other is stored A large-scale word alignment model storage;
    Information obtained from one or more parallel translations, a first word position indicating the position of the first word in the first language sentence, a position of the second word corresponding to the first word, and a second language sentence For each translated sentence word position information having a second word position indicating a position in the word, a first sentence word number that is the number of words in the first language sentence, and a second sentence word number that is the number of words in the second language sentence A bilingual sentence word position probability information storage unit capable of storing bilingual sentence word position probability information related to the probability of matching with the bilingual sentence word position information,
    Computer
    For each word pair included in the bilingual sentence included in the small-scale parallel translation data, the first correspondence probability information calculated in the initial value or the previous loop for one word pair, and the large-scale word alignment model has Using the second correspondence probability information paired with one word pair and the parallel translation word position probability information corresponding to the one word pair in the parallel translation sentence, the loop is repeated twice or more to A probability information calculation unit for calculating first correspondence probability information paired with the word pair;
    For each word pair, the first correspondence probability information finally calculated by the probability information calculation unit is made to correspond to the word pair and function as a correspondence probability information accumulation unit that accumulates in the small-scale word alignment model storage unit. A recording medium on which a program is recorded.
  6. コンピュータがアクセス可能な記録媒体は、
    請求項1または請求項2に記載の単語アライメントモデル構築装置が有する小規模単語アライメントモデル格納部と、
    請求項1または請求項2に記載の単語アライメントモデル構築装置が有する対訳文単語位置確率情報格納部とを具備し、
    コンピュータを、
    第二言語文を受け付ける受付部と、
    前記小規模単語アライメントモデル格納部に格納されている小規模単語アライメントモデル、および前記対訳文単語位置確率情報格納部に格納されている1以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、前記受付部が受け付けた第二言語文から第一言語文を取得する翻訳部として機能させるためのプログラムを記録した記録媒体。
    Computer-accessible recording media
    A small-scale word alignment model storage unit included in the word alignment model construction device according to claim 1 or 2,
    A bilingual word position probability information storage unit included in the word alignment model construction device according to claim 1 or 2;
    Computer
    A reception unit for receiving a second language sentence;
    The small word alignment model stored in the small word alignment model storage unit and the bilingual word position probability information for each one or more parallel word position information stored in the parallel word position probability information storage unit The recording medium which recorded the program for functioning as a translation part which acquires a 1st language sentence from the 2nd language sentence which the said reception part received using.
PCT/JP2016/075886 2015-09-04 2016-09-02 Word alignment model construction apparatus, machine translation apparatus, word alignment model production method, and recording medium WO2017038996A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2015174465A JP6687935B2 (en) 2015-09-04 2015-09-04 Word alignment model construction device, machine translation device, word alignment model production method, machine translation method, and program
JP2015-174465 2015-09-04

Publications (1)

Publication Number Publication Date
WO2017038996A1 true WO2017038996A1 (en) 2017-03-09

Family

ID=58187757

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2016/075886 WO2017038996A1 (en) 2015-09-04 2016-09-02 Word alignment model construction apparatus, machine translation apparatus, word alignment model production method, and recording medium

Country Status (2)

Country Link
JP (1) JP6687935B2 (en)
WO (1) WO2017038996A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526727B (en) * 2017-07-31 2021-01-19 苏州大学 Language generation method based on statistical machine translation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009064051A (en) * 2007-09-04 2009-03-26 National Institute Of Information & Communication Technology Information processor, information processing method and program
JP2010122982A (en) * 2008-11-20 2010-06-03 Nec Corp System, method and program for analyzing language, and system, method and program for translating by machine

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009064051A (en) * 2007-09-04 2009-03-26 National Institute Of Information & Communication Technology Information processor, information processing method and program
JP2010122982A (en) * 2008-11-20 2010-06-03 Nec Corp System, method and program for analyzing language, and system, method and program for translating by machine

Also Published As

Publication number Publication date
JP2017049917A (en) 2017-03-09
JP6687935B2 (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN107729322B (en) Word segmentation method and device and sentence vector generation model establishment method and device
Oda et al. Learning to generate pseudo-code from source code using statistical machine translation
US10832012B2 (en) Method executed in translation system and including generation of translated text and generation of parallel translation data
JP5484317B2 (en) Large-scale language model in machine translation
CN107357772A (en) List filling method, device and computer equipment
WO2019060353A1 (en) System and method for translating chat messages
US9536518B2 (en) Unsupervised training method, training apparatus, and training program for an N-gram language model based upon recognition reliability
US20140095143A1 (en) Transliteration pair matching
CN110795913B (en) Text encoding method, device, storage medium and terminal
CN110678868B (en) Translation support system, translation support apparatus, translation support method, and computer-readable medium
WO2015096529A1 (en) Universal machine translation engine-oriented individualized translation method and device
CN107391495B (en) Sentence alignment method of bilingual parallel corpus
CN102567306B (en) Acquisition method and acquisition system for similarity of vocabularies between different languages
US20160132491A1 (en) Bilingual phrase learning apparatus, statistical machine translation apparatus, bilingual phrase learning method, and storage medium
Denkowski et al. Real time adaptive machine translation for post-editing with cdec and TransCenter
CN112214574A (en) Context aware sentence compression
CN105229625A (en) Obtain the mixing Hash scheme of effective HMM
CN114141384A (en) Method, apparatus and medium for retrieving medical data
WO2017145811A1 (en) Topic assessment apparatus, topic assessment method, and recording medium
WO2017038996A1 (en) Word alignment model construction apparatus, machine translation apparatus, word alignment model production method, and recording medium
JP2012185622A (en) Bilingual phrase learning device, phrase-based statistical machine translation device, bilingual phrase learning method and bilingual phrase production method
Chahuneau et al. pycdec: A Python Interface to cdec.
JP5710551B2 (en) Machine translation result evaluation apparatus, translation parameter optimization apparatus, method, and program
JP5295037B2 (en) Learning device using Conditional Random Fields or Global Conditional Log-linearModels, and parameter learning method and program in the learning device
JP5428199B2 (en) Parallel translation extraction apparatus and parallel translation extraction method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16842025

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16842025

Country of ref document: EP

Kind code of ref document: A1