WO2017038996A1

WO2017038996A1 - Word alignment model construction apparatus, machine translation apparatus, word alignment model production method, and recording medium

Info

Publication number: WO2017038996A1
Application number: PCT/JP2016/075886
Authority: WO
Inventors: 将夫内山
Original assignee: 国立研究開発法人情報通信研究機構
Priority date: 2015-09-04
Filing date: 2016-09-02
Publication date: 2017-03-09
Also published as: JP2017049917A; JP6687935B2

Abstract

[Problem] It has hitherto been impossible to accurately perform word alignment of a small-size parallel text corpus. [Solution] This word alignment model construction apparatus is provided with: a probability information calculation unit that, for each word pair included in a parallel text sentence included in small-size parallel text data, repeats a loop twice or more for one word pair to calculate first correspondence probability information that forms a pair with the one word pair, by using an initial value or the first correspondence probability information calculated in a previous loop, second correspondence probability information that forms a pair with one word pair included in a large-size word alignment model, and parallel text sentence word position probability information corresponding to one word pair in a parallel text sentence; and a correspondence probability information accumulation unit that accumulates, in association with each word pair, the first correspondence probability information calculated finally by the probability information calculation unit, in a small-size word alignment model storage unit. Thus, word alignment of the small-size parallel text corpus can be accurately performed.

Description

単語アライメントモデル構築装置、機械翻訳装置、単語アライメントモデルの生産方法、および記録媒体Word alignment model construction device, machine translation device, word alignment model production method, and recording medium

　本発明は、単語アライメントモデルを構築する単語アライメントモデル構築装置等に関するものである。 The present invention relates to a word alignment model construction apparatus for constructing a word alignment model.

　統計的機械翻訳（SMT）では、対訳データDataBから、翻訳モデルModelBを作成する。そして、ModelBを利用して、入力文を翻訳する。このとき、入力文がDataBと同様な分野の文である場合には、その翻訳結果は高精度であることが期待できる。けれども、入力文がDataBとは異なる分野の文であるときには、その翻訳精度は低下する。 Statistic machine translation (SMT) creates a translation model ModelB from parallel translation data DataB. And the input sentence is translated using ModelB. At this time, if the input sentence is a sentence in the same field as DataB, the translation result can be expected to be highly accurate. However, when the input sentence is a sentence in a field different from DataB, the translation accuracy decreases.

　この対策として、DataBと異なる分野の対訳データDataSを利用することにより、翻訳モデルModelSを作成し、ModelBとModelSの双方を利用することにより、DataSと同分野の文の翻訳精度の高精度化を達成可能である。 As a countermeasure, the translation model ModelS is created by using bilingual data DataS in a different field from DataB, and the translation accuracy of sentences in the same field as DataS is improved by using both ModelB and ModelS. Achievable.

　このModelBとModelSの双方を利用する方法としては、非特許文献１等に記載されている。なお、この方法をSMTの分野適応と呼ぶ。 The method using both ModelB and ModelS is described in Non-Patent Document 1 and the like. This method is called SMT field adaptation.

　分野適応のときの問題点としては、DataBが大規模（10-1000万文程度）であるのに対して、DataSが小規模(100文程度のこともある)であることがある。こうした場合には、DataBからModelBを作成するのは、問題なく高精度に行えるが、DataSからModelSを作成するのを高精度に行うのは困難である。 The problem when adapting to the field is that DataB is large (about 10 to 10 million sentences), whereas DataS is small (sometimes about 100 sentences). In such a case, ModelB can be created from DataB with high accuracy without problems, but it is difficult to create ModelS from DataS with high accuracy.

　その理由としては、ModelSを作る過程としては、通常、次のステップがとられるからである。
ステップ１．DataSを単語アライメントする
ステップ２．単語アライメントの結果から ModelSを構築する。 The reason is that the following steps are usually taken in the process of creating ModelS.
Step 1. 1. Word alignment of DataS Build ModelS from the result of word alignment.

　上記ステップ１の単語アライメントにおいて、オープンソースツールである GIZA++（非特許文献２参照)や fast_align（非特許文献３参照）が利用されることが多い。しかし、これらのツールは、小規模の対訳データにおける単語アライメントは精度よく実行できない。単語 In the word alignment in step 1 above, open source tools such as GIZA ++ (see Non-Patent Document 2) and fast_align (see Non-Patent Document 3) are often used. However, these tools cannot accurately perform word alignment in small-scale parallel translation data.

　上記の問題設定において、小規模対訳データDataSを単語アライメントする従来技術には、以下の（Ａ）～（Ｄ）の４つがある。
（Ａ）DataBから単語アライメントのための単語アライメントモデルAlignBを構築し、AlignBを利用してDataSをアライメントする。
（Ｂ）DataSから単語アライメントモデルAlignSを構築し、AlignSを利用してDataSをアライメントする。
（Ｃ）DataBとDataSを一つにまとめて、そこから単語アライメントモデル AlignBSを構築し、AlignBSを用いて DataSをアライメントする。
（Ｄ）DataBから単語アライメントモデル AlignBを構築する。そして、AlignBを初期モデルとして、 DataSから AlignBSを構築し、AlignBSを利用して DataSをアライメントする（非特許文献４参照）。ただし、AlignBを初期モデルとすることは、単語アライメントの十分統計量のみを AlignBから抽出して、その十分統計量を、DataSを利用して更新することをいう。 In the above problem setting, there are the following four (A) to (D) as conventional techniques for word alignment of small-scale parallel translation data DataS.
(A) A word alignment model AlignB for word alignment is constructed from DataB, and DataS is aligned using AlignB.
(B) A word alignment model AlignS is constructed from DataS, and DataS is aligned using AlignS.
(C) Combine DataB and DataS, build a word alignment model AlignBS from them, and align DataS using AlignBS.
(D) Construct a word alignment model AlignB from DataB. Then, AlignBS is constructed from DataS using AlignB as an initial model, and DataS is aligned using AlignBS (see Non-Patent Document 4). However, using AlignB as an initial model means extracting only sufficient statistics for word alignment from AlignB and updating the sufficient statistics using DataS.

　しかしながら、従来技術においては、小規模対訳コーパスの単語アライメントを精度よく実行できなかった。 However, in the prior art, word alignment of a small-scale parallel corpus could not be executed with high accuracy.

　さらに具体的には、（Ａ）～（Ｄ）の４つ従来技術は、各々、以下の課題を有する。
（Ａ）は、AlignBを利用して DataSを単語アライメントするのであるから、そもそもDataBとDataSが異なる以上、単語アライメント精度は低い。
（Ｂ）は、DataSが数万文以上あるときには効果的であるが、100文程度のときにはAlignSの精度が低くなるので、単語アライメント精度は低い。
（Ｃ）について、DataSとDataBの大きさが極端に違うときには、DataSで特徴的な単語対などが出現したとしても、それはDataBの多数派に打ち消されることが考えられる。たとえば、DataBが電気電子関係の多量の対訳文であり、そのため、「potential」は「電位」とアライメントすることが多い場合に、DataSとして情報関係の少量の対訳文があって、「potential」が「潜在的」とアライメントするとしても、それが大量の対訳文により打ち消される可能性がある。また、DataBとDataSを統合してから単語アライメントをするので、たとえ DataSが小さくても、DataBが大規模なため単語アライメントに多大の時間がかかる。
（Ｄ）について、出来上がった単語アライメントモデルAlignBSは、（Ｃ）のモデルと同等なものであるので上記（Ｃ）と同じ問題がある。（Ｄ）の利点は、AlignBを初期値として利用することにより、DataSを処理するだけで（Ｃ）と同等なものができる点である。 More specifically, each of the four conventional techniques (A) to (D) has the following problems.
In (A), DataS is word-aligned using AlignB, so the word alignment accuracy is low as long as DataB and DataS are different from each other.
(B) is effective when DataS is tens of thousands of sentences or more, but the accuracy of AlignS is low when there are about 100 sentences, so the word alignment accuracy is low.
Regarding (C), when the sizes of DataS and DataB are extremely different, even if a characteristic word pair appears in DataS, it can be countered by the majority of DataB. For example, if DataB is a large amount of bilingual texts related to electric and electronic, and "potential" is often aligned with "potential", there is a small amount of bilingual text related to data as DataS, and "potential" Even if it is aligned with "potential", it can be canceled by a large amount of translations. In addition, since word alignment is performed after integrating DataB and DataS, even if DataS is small, since DataB is large, word alignment takes a lot of time.
Regarding (D), the completed word alignment model AlignBS is equivalent to the model (C), and therefore has the same problem as (C) above. The advantage of (D) is that, by using AlignB as an initial value, the equivalent of (C) can be achieved by simply processing DataS.

　本発明においては、例えば、（Ｄ）と同様に DataSを処理するだけで単語アライメントモデルを構築できるが、（Ｄ）とは異なり、DataSに特有な単語アライメントも打ち消されない単語アライメントモデルを構築することを目的とする。 In the present invention, for example, a word alignment model can be constructed just by processing DataS as in (D), but unlike (D), a word alignment model that does not cancel word alignment unique to DataS is constructed. For the purpose.

　本第一の発明の単語アライメントモデル構築装置は、第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値（Ｎ１）未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、第一単語と第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、第二の閾値（Ｎ２，Ｎ２＞Ｎ１）以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、第一単語と第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、１以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部と、小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、大規模単語アライメントモデルが有する一の単語対と対になる第二対応確率情報と、対訳文の中における一の単語対に対応する対訳文単語位置確率情報とを用いて、２回以上ループを繰り返して、一の単語対と対になる第一対応確率情報を算出する確率情報算出部と、単語対ごとに、確率情報算出部が最終的に算出した第一対応確率情報を、単語対に対応付けて、小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積部とを具備する単語アライメントモデル構築装置である。 The word alignment model construction device according to the first aspect of the present invention includes a first language sentence that is a sentence of a first language that is a first language and a second language sentence that is a sentence of a second language that is a second language. A small-scale parallel translation data storage unit that can store small-scale parallel translation data that is a pair and is a small-scale parallel translation data having a number of parallel translation sentences less than the first threshold (N1), and is acquired from the small-scale parallel translation data Correspondence probability relating to the probability that a word pair having a first word that is a word in the first language and a second word that is a word in the second language and the probability that the first word and the second word correspond to each other are word alignment models A small word alignment model storage unit capable of storing a small word alignment model having a plurality of word alignment data having first correspondence probability information as information, and a number equal to or greater than a second threshold (N2, N2> N1) Have a translation of This is an alignment model of words acquired from large-scale parallel translation data, which is large-scale parallel translation data, and the correspondence between the word pair having the first word and the second word and the probability that the first word corresponds to the second word Information obtained from a large-scale word alignment model storage unit storing a large-scale word alignment model having a plurality of word alignment data having second correspondence probability information that is probability information and one or more parallel translations The first word position indicating the position of the first word in the first language sentence, the second word position indicating the position in the second language sentence that is the position of the second word corresponding to the first word, the first For each translation sentence word position information having the number of first sentence words that is the number of words in the language sentence and the number of second sentence words that is the number of words in the second language sentence, the translation related to the probability of matching the translation sentence word position information The word position probability information storage unit that can store the word position probability information and the word pair possessed by the parallel sentence included in the small-scale parallel data are calculated for one word pair in the initial value or the previous loop. First correspondence probability information, second correspondence probability information paired with one word pair of the large-scale word alignment model, and bilingual word position probability information corresponding to one word pair in the bilingual sentence. Using the probability information calculation unit that calculates the first correspondence probability information that is paired with one word pair by repeating the loop twice or more, and the first that the probability information calculation unit finally calculates for each word pair A word alignment model construction apparatus comprising a correspondence probability information storage unit that stores correspondence probability information in association with a word pair in a small word alignment model storage unit.

　かかる構成により、小規模対訳コーパスの単語アライメントを精度よく実行できる。 With such a configuration, word alignment of a small-scale parallel corpus can be executed with high accuracy.

　また、本第二の発明の単語アライメントモデル構築装置は、第一の発明に対して、確率情報算出部は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、一の単語対に対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報を取得する前回第一対応確率情報取得手段と、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、一の単語対に対応する第二対応確率情報を、大規模単語アライメントモデル格納部から取得する第二対応確率情報取得手段と、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段と、前回第一対応確率情報取得手段が取得した第一対応確率情報と第二対応確率情報取得手段が取得した第二対応確率情報とを予め決められた割合で加算し、加算した結果と、対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する中間確率値算出手段と、単語対ごとに、中間確率値算出手段が算出した中間確率値を用いて、正規化前の第一対応確率情報を取得する正規化前第一対応確率情報取得手段と、単語対ごとに、正規化前第一対応確率情報取得手段が取得した正規化前の第一対応確率情報に対して、正規化の処理を行い、第一対応確率情報を取得する正規化手段と、終了条件を満たすまで、前回第一対応確率情報取得手段、第二対応確率情報取得手段、対訳文単語位置情報取得手段、対訳文単語位置確率情報取得手段、中間確率値算出手段、正規化前第一対応確率情報取得手段、および正規化手段の処理を繰り返して行わせる制御手段とを具備する単語アライメントモデル構築装置である。 In addition, the word alignment model construction device according to the second aspect of the invention provides the probability information calculation unit for each bilingual sentence included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence, relative to the first invention. The first correspondence probability information acquisition means for obtaining the first correspondence probability information of the initial value corresponding to one word pair or the first correspondence probability information calculated in the previous loop for one word pair, and a small scale Second correspondence probability information corresponding to one word pair is acquired from the large-scale word alignment model storage unit for each word pair included in the parallel translation data and for each word pair included in the parallel translation sentence. Second correspondence probability information acquisition means, and for each bilingual sentence included in the small-scale parallel translation data and for each word pair included in the bilingual sentence, for each word pair, the first word in the first language sentence First word position indicating position The second word position corresponding to the first word and indicating the position in the second language sentence, the number of first sentence words that is the number of words in the first language sentence, and the second language sentence Bilingual sentence word position information acquisition means for acquiring parallel sentence word position information having the number of second sentence words that is the number of words, and a parallel sentence word position corresponding to the parallel sentence word position information acquired by the parallel sentence word position information acquisition means The bilingual sentence word position probability information acquiring means for acquiring the probability information from the parallel sentence word position probability information storage unit, the first corresponding probability information and the second corresponding probability information acquiring means acquired by the first corresponding probability information acquiring means last time The acquired second correspondence probability information is added at a predetermined ratio, and the result of addition is multiplied by the parallel sentence word position probability information acquired by the parallel sentence word position probability information acquisition means to calculate an intermediate probability value. Intermediate probability value calculating means For each word pair, using the intermediate probability value calculated by the intermediate probability value calculation means, the first correspondence probability information acquisition means before normalization for acquiring the first correspondence probability information before normalization, and the normal correspondence for each word pair Normalization processing is performed on the first correspondence probability information before normalization acquired by the first correspondence probability information acquisition means before normalization and the first correspondence probability information is acquired, and until the end condition is satisfied , Previous first correspondence probability information acquisition means, second correspondence probability information acquisition means, parallel translation word position information acquisition means, parallel translation word position probability information acquisition means, intermediate probability value calculation means, pre-normalization first correspondence probability information acquisition And a word alignment model construction apparatus comprising control means for repeatedly performing processing of normalization means.

　また、本第三の発明の単語アライメントモデル構築装置は、第一または第二の発明に対して、対訳文単語位置確率情報格納部に格納されている対訳文単語位置確率情報は、大規模対訳データを用いて取得された対訳文単語位置確率情報である単語アライメントモデル構築装置である。 The word alignment model construction apparatus according to the third aspect of the present invention provides a parallel translation word position probability information stored in the parallel translation word position probability information storage unit, which is a large-scale parallel translation. It is the word alignment model construction apparatus which is the bilingual sentence word position probability information acquired using data.

　かかる構成により、小規模対訳コーパスの単語アライメントをより精度よく実行できる。 With such a configuration, word alignment of a small-scale parallel corpus can be executed with higher accuracy.

　また、本第四の発明の単語アライメントモデル構築装置は、第一の発明に対して、大規模対訳データが有する各単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、大規模対訳データが有する各単語対に対して、対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段とをさらに具備し、対訳文単語位置確率情報格納部に格納されている対訳文単語位置確率情報は、対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報である単語アライメントモデル構築装置である。 The word alignment model construction device of the fourth aspect of the present invention indicates the position of the first word in the first language sentence for each word pair included in the large-scale parallel translation data, relative to the first aspect. A first word position, a second word position corresponding to the first word and indicating a position in the second language sentence, a first sentence word number that is the number of words in the first language sentence, and Bilingual word position information acquisition means for acquiring bilingual word position information having the number of second sentence words that is the number of words in the bilingual sentence, and bilingual word position information for each word pair included in the large-scale parallel data The bilingual sentence word position probability information acquiring means for acquiring the bilingual sentence word position probability information corresponding to the bilingual sentence word position information acquired by the acquiring means from the bilingual sentence word position probability information storage unit, Stored in the probability information storage Translated sentence word position probability information is word alignment model building device is the translation word position probability information translated sentence word position probability information acquisition means has acquired.

　また、本第五の発明の機械翻訳装置は、第一から第四いずれか１つの発明に対して、単語アライメントモデル構築装置が有する小規模単語アライメントモデル格納部と、単語アライメントモデル構築装置が有する対訳文単語位置確率情報格納部と、第二言語文を受け付ける受付部と、小規模単語アライメントモデル格納部に格納されている小規模単語アライメントモデル、および対訳文単語位置確率情報格納部に格納されている１以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、受付部が受け付けた第二言語文から第一言語文を取得する翻訳部とを具備する機械翻訳装置である。 The machine translation device according to the fifth aspect of the invention has a small word alignment model storage unit and a word alignment model construction device that the word alignment model construction device has, with respect to any one of the first to fourth inventions. Stored in the bilingual sentence word position probability information storage unit, the second language sentence receiving unit, the small word alignment model stored in the small word alignment model storage unit, and the bilingual word position probability information storage unit A translation unit comprising: a translation unit that acquires a first language sentence from a second language sentence received by a reception unit using bilingual sentence word position probability information for each of one or more parallel translation word position information .

　かかる構成により、小規模対訳コーパスの単語アライメントを精度よく実行できる結果、精度の良い翻訳結果が得られる。 With such a configuration, the word alignment of the small-scale bilingual corpus can be executed with high accuracy, and as a result, accurate translation results can be obtained.

　本発明による単語アライメントモデル構築装置によれば、小規模対訳コーパスの単語アライメントを精度よく実行できる。 The word alignment model construction apparatus according to the present invention can perform word alignment of a small-scale parallel corpus with high accuracy.

実施の形態１における単語アライメントモデル構築装置１のブロック図Block diagram of word alignment model construction apparatus 1 according to Embodiment 1 同単語アライメントモデル構築装置１の動作について説明するフローチャートThe flowchart explaining operation | movement of the word alignment model construction apparatus 1 同Ｅ－ｓｔｅｐの詳細について説明するフローチャートFlow chart explaining details of the E-step 同中間確率値を算出する処理の詳細について説明するフローチャートThe flowchart explaining the detail of the process which calculates the intermediate probability value 同単語アライメントモデル構築装置１の動作を示す図The figure which shows operation | movement of the word alignment model construction apparatus 1 実施の形態２における機械翻訳装置２のブロック図Block diagram of machine translation apparatus 2 according to Embodiment 2 上記実施の形態におけるコンピュータシステムの概観図Overview of the computer system in the above embodiment 同コンピュータシステムのブロック図Block diagram of the computer system

　以下、単語アライメントモデル構築装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of the word alignment model construction device and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.

　（実施の形態１）
　本実施の形態において、２つの言語の単語アライメントモデルを作成する装置であって、大規模データの単語アライメントの確率と小規模データの単語アライメントの確率との両方を用いて、最終的な小規模データの単語アライメントの確率の情報を取得する単語アライメントモデル構築装置について説明する。 (Embodiment 1)
In the present embodiment, an apparatus for creating a word alignment model in two languages, using both the word alignment probability of large-scale data and the word alignment probability of small-scale data, the final small-scale A word alignment model construction apparatus that acquires information on the probability of word alignment of data will be described.

　図１は、本実施の形態における単語アライメントモデル構築装置１のブロック図である。 FIG. 1 is a block diagram of the word alignment model construction apparatus 1 in the present embodiment.

　単語アライメントモデル構築装置１は、格納部１１、確率情報算出部１２、および対応確率情報蓄積部１３を備える。 The word alignment model construction device 1 includes a storage unit 11, a probability information calculation unit 12, and a corresponding probability information storage unit 13.

　格納部１１は、小規模対訳データ格納部１１１、小規模単語アライメントモデル格納部１１２、大規模単語アライメントモデル格納部１１３、および対訳文単語位置確率情報格納部１１４を備える。 The storage unit 11 includes a small bilingual data storage unit 111, a small word alignment model storage unit 112, a large word alignment model storage unit 113, and a bilingual word position probability information storage unit 114.

　確率情報算出部１２は、前回第一対応確率情報取得手段１２１、第二対応確率情報取得手段１２２、対訳文単語位置情報取得手段１２３、対訳文単語位置確率情報取得手段１２４、中間確率値算出手段１２５、正規化前第一対応確率情報取得手段１２６、正規化手段１２７、および制御手段１２８を備える。 The probability information calculation unit 12 includes a previous first correspondence probability information acquisition unit 121, a second correspondence probability information acquisition unit 122, a bilingual sentence word position information acquisition unit 123, a bilingual sentence word position probability information acquisition unit 124, and an intermediate probability value calculation unit. 125, first pre-normalization probability information acquisition means 126, normalization means 127, and control means 128.

　格納部１１は、各種の情報を格納し得る。各種の情報とは、例えば、後述する小規模対訳データ、小規模単語アライメントモデル、大規模単語アライメントモデル、対訳文単語位置確率等である。 The storage unit 11 can store various types of information. The various types of information include, for example, small-scale parallel translation data, small-scale word alignment model, large-scale word alignment model, parallel sentence word position probability, and the like, which will be described later.

　小規模対訳データ格納部１１１は、小規模対訳データを格納し得る。小規模対訳データは、１以上の対訳文を有する。対訳文は、第一言語文と第二言語文との対である。第一言語文は、第一の言語である第一言語の文である。第二言語文は、第二の言語である第二言語の文である。第二言語文は、第一言語文を第二言語に翻訳した結果である。小規模対訳データは、少ない数の対訳文を有する。小規模対訳データは、通常、第一の閾値（Ｎ１）未満の数の対訳文を有する。小規模対訳データは、例えば、１０～１０万程度の数の対訳文を有する。 The small parallel translation data storage unit 111 can store small parallel translation data. The small-scale parallel translation data has one or more parallel translation sentences. A bilingual sentence is a pair of a first language sentence and a second language sentence. The first language sentence is a sentence in the first language that is the first language. The second language sentence is a sentence of the second language that is the second language. The second language sentence is a result of translating the first language sentence into the second language. Small-scale parallel translation data has a small number of parallel translation sentences. The small-scale parallel translation data usually has a number of parallel translation sentences less than the first threshold (N1). The small-scale parallel translation data has, for example, about 100,000 to 100,000 parallel translations.

　また、第一言語と第二言語とは異なる言語であれば、どの言語でも良い。第一言語、または第二言語は、例えば、英語、日本語、中国語、フランス語、ドイツ語、スペイン語、韓国語等である。 Also, any language can be used as long as the first language and the second language are different. The first language or the second language is, for example, English, Japanese, Chinese, French, German, Spanish, Korean or the like.

　小規模単語アライメントモデル格納部１１２は、小規模単語アライメントモデルを格納し得る。小規模単語アライメントモデルは、小規模対訳データを用いて取得される単語のアライメントモデルである。小規模単語アライメントモデルは、複数の単語アライメントデータを有する。単語アライメントデータは、単語対と、第一対応確率情報とを有する。単語対は、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する。第一対応確率情報は、第一単語と第二単語とが対応する確率に関する情報である対応確率情報である。 The small word alignment model storage unit 112 can store a small word alignment model. The small word alignment model is a word alignment model acquired using small parallel translation data. The small-scale word alignment model has a plurality of word alignment data. The word alignment data includes a word pair and first correspondence probability information. The word pair has a first word that is a word in the first language and a second word that is a word in the second language. The first correspondence probability information is correspondence probability information that is information regarding the probability that the first word corresponds to the second word.

　大規模単語アライメントモデル格納部１１３は、大規模単語アライメントモデルを格納している。大規模単語アライメントモデルは、大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルである。単語のアライメントモデルは、第一単語と第二単語とを有する単語対と、第一単語と第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する。大規模対訳データは、通常、第二の閾値（Ｎ２，Ｎ２＞Ｎ１）以上の数の対訳文を有する。通常、Ｎ２は、Ｎ１と比較して、１桁以上大きい（１０倍以上である）。なお、Ｎ１、Ｎ２は自然数である。 The large-scale word alignment model storage unit 113 stores a large-scale word alignment model. The large-scale word alignment model is a word alignment model acquired from large-scale parallel translation data that is large-scale parallel translation data. The word alignment model includes a word pair having a first word and a second word, and second correspondence probability information that is correspondence probability information regarding a probability that the first word and the second word correspond to each other. Large-scale parallel translation data usually has a number of parallel translations equal to or greater than a second threshold (N2, N2> N1). Usually, N2 is one digit or more larger (10 times or more) than N1. N1 and N2 are natural numbers.

　対訳文単語位置確率情報格納部１１４は、１以上の対訳文単語位置情報ごとに、対訳文単語位置確率情報を格納し得る。対訳文単語位置情報は、１以上の対訳文から取得された情報であり、第一単語位置、第二単語位置、第一文単語数、および第二文単語数を有する。第一単語位置は、第一言語文の中における第一単語の位置を示す情報である。第二単語位置は、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す情報である。第一文単語数は、第一言語文の単語数である。第二文単語数は、第二言語文の単語数である。対訳文単語位置確率情報は、対訳文単語位置情報に合致する確率に関する情報である。通常、対訳文単語位置確率情報は、対訳文単語位置情報に合致する確率である。 The parallel translation word position probability information storage unit 114 can store parallel translation word position probability information for each of one or more parallel translation word position information. The bilingual sentence word position information is information acquired from one or more parallel translation sentences, and has a first word position, a second word position, a first sentence word number, and a second sentence word number. The first word position is information indicating the position of the first word in the first language sentence. The second word position is the position of the second word corresponding to the first word and is information indicating the position in the second language sentence. The first sentence word count is the number of words in the first language sentence. The number of second sentence words is the number of words in the second language sentence. The bilingual sentence word position probability information is information regarding the probability of matching with the bilingual sentence word position information. Normally, the bilingual sentence word position probability information is a probability of matching with the bilingual sentence word position information.

　対訳文単語位置確率情報格納部１１４は、大規模対訳データが有する１以上の対訳文から取得された対訳文単語位置情報と対訳文単語位置確率情報との組を、２以上、格納していることは好適である。 The bilingual sentence word position probability information storage unit 114 stores two or more pairs of bilingual sentence word position information and bilingual sentence word position probability information acquired from one or more parallel translation sentences included in the large-scale parallel translation data. That is preferred.

　対訳文単語位置確率情報格納部１１４に格納されている対訳文単語位置情報と対訳文単語位置確率情報との組は、大規模対訳データを用いて、後述する対訳文単語位置情報取得手段１２３、対訳文単語位置確率情報取得手段１２４が取得した情報であることは好適である。 A pair of the parallel sentence word position information and the parallel sentence word position probability information stored in the parallel sentence word position probability information storage unit 114 uses a large-scale parallel translation data, and a parallel sentence word position information acquisition unit 123 to be described later. The information acquired by the bilingual sentence word position probability information acquisition unit 124 is suitable.

　対訳文単語位置確率情報格納部１１４は、大規模対訳データと小規模対訳データとが有する１以上の対訳文から取得された対訳文単語位置情報と対訳文単語位置確率情報との組を、２以上、格納していても良い。 The bilingual sentence word position probability information storage unit 114 stores two pairs of bilingual sentence word position information and bilingual sentence word position probability information acquired from one or more parallel translation sentences included in the large-scale bilingual data and the small-scale bilingual data. As described above, it may be stored.

　対訳文単語位置確率情報格納部１１４に格納されている対訳文単語位置情報と対訳文単語位置確率情報との組は、大規模対訳データと小規模対訳データとを用いて、後述する対訳文単語位置情報取得手段１２３、対訳文単語位置確率情報取得手段１２４が取得した情報であっても良い。 A pair of parallel sentence word position information and parallel sentence word position probability information stored in the parallel sentence word position probability information storage unit 114 is a parallel sentence word to be described later using large-scale parallel translation data and small-scale parallel translation data. The information acquired by the position information acquisition unit 123 and the bilingual word position probability information acquisition unit 124 may be used.

　確率情報算出部１２は、小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、大規模単語アライメントモデルが有する一の単語対と対になる第二対応確率情報と、対訳文の中における一の単語対に対応する対訳文単語位置確率情報とを用いて、２回以上ループを繰り返して、一の単語対と対になる第一対応確率情報を算出する。 The probability information calculation unit 12 sets the first correspondence probability information calculated in the initial value or the previous loop for one word pair and the large-scale word for each word pair included in the parallel translation sentence included in the small-scale parallel translation data. Using the second correspondence probability information paired with one word pair of the alignment model and the parallel translation word position probability information corresponding to one word pair in the parallel translation sentence, the loop is repeated two or more times, First correspondence probability information to be paired with one word pair is calculated.

　前回第一対応確率情報取得手段１２１は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、一の単語対に対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報を取得する。 The previous first correspondence probability information acquisition unit 121 sets the initial value corresponding to one word pair for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence. First correspondence probability information or first correspondence probability information calculated in the previous loop is acquired.

　具体的には、例えば、前回第一対応確率情報取得手段１２１は、第一言語（e）のi番目の単語e(i)と第二言語（f）のj番目の単語f(j)とに対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報「θS(e(i)|f(j))」を取得する。かかる第一対応確率情報は、通常、格納部１１に少なくとも一時的に格納されている。 Specifically, for example, the first correspondence probability information acquisition unit 121 last time includes the i-th word e (i) in the first language (e) and the j-th word f (j) in the second language (f). The first correspondence probability information of the initial value corresponding to or the first correspondence probability information “θS (e (i) | f (j))” calculated in the previous loop is acquired. Such first correspondence probability information is usually stored at least temporarily in the storage unit 11.

　第二対応確率情報取得手段１２２は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、一の単語対に対応する第二対応確率情報を、大規模単語アライメントモデル格納部１１３から取得する。 The second correspondence probability information acquisition unit 122 is configured to provide a second correspondence probability corresponding to one word pair for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence. Information is acquired from the large-scale word alignment model storage unit 113.

　具体的には、例えば、第二対応確率情報取得手段１２２は、第一言語（e）のi番目の単語e(i)と第二言語（f）のj番目の単語f(j)とに対応する第二対応確率情報「θB(e(i)|f(j))」を大規模単語アライメントモデル格納部１１３から取得する。 Specifically, for example, the second correspondence probability information acquisition unit 122 adds the i-th word e (i) in the first language (e) and the j-th word f (j) in the second language (f). Corresponding second correspondence probability information “θB (e (i) | f (j))” is acquired from the large-scale word alignment model storage unit 113.

　対訳文単語位置情報取得手段１２３は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する。 The bilingual sentence word position information acquisition means 123 performs the first word in the first language sentence for each bilingual sentence included in the small bilingual data and for each word pair included in the bilingual sentence. The first word position indicating the position, the second word position corresponding to the first word and indicating the position in the second language sentence, and the first sentence word number indicating the number of words in the first language sentence And translated sentence word position information having the number of second sentence words that is the number of words of the second language sentence.

　具体的には、例えば、対訳文単語位置情報取得手段１２３は、一の対訳文の中の一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置(i)、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置(j)、第一言語文の単語数である第一文単語数(m)、および第二言語文の単語数である第二文単語数(n)を、一の対訳文と一の単語対とから取得する。なお、(i) (j)は、通常、文の中の何番目の単語であるかを示す情報である。 Specifically, for example, the bilingual sentence word position information acquisition unit 123 has a first word position indicating the position of the first word in the first language sentence with respect to one word pair in one bilingual sentence. (i) the second word position corresponding to the first word and the second word position indicating the position in the second language sentence (j), the first sentence word number that is the number of words in the first language sentence ( m) and the second sentence word number (n), which is the number of words in the second language sentence, are acquired from one parallel translation sentence and one word pair. Note that (i) (j) is usually information indicating the number of words in a sentence.

　対訳文単語位置情報取得手段１２３は、大規模対訳データが有する各単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する。 The bilingual sentence word position information acquisition means 123 is a first word position indicating the position of the first word in the first language sentence for each word pair included in the large-scale parallel translation data, and a second corresponding to the first word. The second word position indicating the position of the word in the second language sentence, the first sentence word number that is the number of words in the first language sentence, and the second sentence word number that is the number of words in the second language sentence The bilingual sentence word position information which has is acquired.

　対訳文単語位置確率情報取得手段１２４は、対訳文単語位置情報取得手段１２３が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、対訳文単語位置確率情報格納部１１４から取得する。 The bilingual sentence word position probability information acquisition unit 124 acquires bilingual sentence word position probability information corresponding to the bilingual sentence word position information acquired by the bilingual sentence word position information acquisition unit 123 from the bilingual sentence word position probability information storage unit 114. .

　具体的には、例えば、対訳文単語位置確率情報取得手段１２４は、対訳文単語位置情報取得手段１２３が取得した対訳文単語位置情報（i,j,m,n）に対応する対訳文単語位置確率情報「δS（j|i,m,n）」を、対訳文単語位置確率情報格納部１１４から検索する。 Specifically, for example, the translated sentence word position probability information acquiring unit 124 converts the translated sentence word position corresponding to the translated sentence word position information (i, j, m, n) acquired by the translated sentence word position information acquiring unit 123. The probability information “δS (j | i, m, n)” is searched from the bilingual sentence word position probability information storage unit 114.

　対訳文単語位置確率情報取得手段１２４は、大規模対訳データが有する各単語対に対して、対訳文単語位置情報取得手段１２３が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、対訳文単語位置確率情報格納部１１４から取得することは好適である。 The bilingual sentence word position probability information acquiring unit 124 calculates bilingual sentence word position probability information corresponding to the bilingual sentence word position information acquired by the bilingual sentence word position information acquiring unit 123 for each word pair included in the large-scale bilingual data. It is preferable to obtain from the translated sentence word position probability information storage unit 114.

　中間確率値算出手段１２５は、前回第一対応確率情報取得手段１２１が取得した第一対応確率情報と、第二対応確率情報取得手段１２２が取得した第二対応確率情報と、対訳文単語位置確率情報取得手段１２４が取得した対訳文単語位置確率情報とを用いて、中間確率値を算出する。 The intermediate probability value calculation means 125 includes first correspondence probability information acquired by the first correspondence probability information acquisition means 121 last time, second correspondence probability information acquired by the second correspondence probability information acquisition means 122, and bilingual sentence word position probability. An intermediate probability value is calculated using the parallel translation word position probability information acquired by the information acquisition means 124.

　具体的には、例えば、中間確率値算出手段１２５は、前回第一対応確率情報取得手段１２１が取得した第一対応確率情報と第二対応確率情報取得手段１２２が取得した第二対応確率情報とを予め決められた割合で加算し、加算した結果と、対訳文単語位置確率情報取得手段１２４が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する。 Specifically, for example, the intermediate probability value calculation unit 125 includes the first correspondence probability information acquired by the first correspondence probability information acquisition unit 121 last time and the second correspondence probability information acquired by the second correspondence probability information acquisition unit 122. Are added at a predetermined ratio, and the addition result is multiplied by the bilingual sentence word position probability information acquisition unit 124 to calculate an intermediate probability value.

　さらに具体的には、例えば、中間確率値算出手段１２５は、演算式「p(i|j)=(λθB(e(i)|f(j)) +(1-λ)θS(e(i)|f(j)))*δS（j|i,m,n）」により、中間確率値p(i|j)を算出する。ここで、定数λは、「０＜λ＜１」を満たす数値であり、通常、「０．５」である。 More specifically, for example, the intermediate probability value calculation unit 125 calculates the arithmetic expression “p (i | j) = (λθB (e (i) | f (j)) + (1-λ) θS (e (i ) | f (j))) * ΔS (j | i, m, n) ”, the intermediate probability value p (i | j) is calculated. Here, the constant λ is a numerical value satisfying “0 <λ <1”, and is usually “0.5”.

　正規化前第一対応確率情報取得手段１２６は、単語対ごとに、中間確率値算出手段１２５が算出した中間確率値を用いて、正規化前の第一対応確率情報を取得する。 The first correspondence probability information acquisition unit 126 before normalization uses the intermediate probability value calculated by the intermediate probability value calculation unit 125 for each word pair to acquire first correspondence probability information before normalization.

　具体的には、例えば、正規化前第一対応確率情報取得手段１２６は、単語対ごとに、中間確率値p(i|j)を用いて、演算式「C(e(i)|f(j))+= p(i|j)/sum」により、正規化前の第一対応確率情報「C(e(i)|f(j))」を算出する。なお、「sum」は、単語対に対応する中間確率値を累積加算した情報である。 Specifically, for example, the first correspondence probability information acquisition unit 126 before normalization uses the intermediate probability value p (i | j) for each word pair, and uses the arithmetic expression “C (e (i) | f ( j)) + = p (i | j) / sum ”, the first correspondence probability information“ C (e (i) | f (j)) ”before normalization is calculated. “Sum” is information obtained by cumulatively adding intermediate probability values corresponding to word pairs.

　正規化手段１２７は、単語対ごとに、正規化前第一対応確率情報取得手段１２６が取得した正規化前の第一対応確率情報「C(e(i)|f(j))」に対して、正規化の処理を行い、第一対応確率情報を取得する。 For each word pair, the normalization means 127 applies the first correspondence probability information “C (e (i) | f (j))” before normalization acquired by the first correspondence probability information acquisition means 126 before normalization. Then, normalization processing is performed to obtain first correspondence probability information.

　具体的には、例えば、正規化手段１２７は、単語対ごとに、正規化前の第一対応確率情報「C(e(i)|f(j))」を平均場近似で正規化して、第一対応確率情報「θS(e(i)|f(j))」を取得する。なお、平均場近似は公知技術であるので、詳細な説明は省略する。 Specifically, for example, the normalizing means 127 normalizes the first correspondence probability information “C (e (i) | f (j))” before normalization for each word pair by means of mean field approximation, First correspondence probability information “θS (e (i) | f (j))” is acquired. Since the mean field approximation is a known technique, a detailed description is omitted.

　また、例えば、正規化手段１２７は、以下の数式１により、正規化前の第一対応確率情報「C(e(i)|f(j))」を正規化し、第一対応確率情報「θS(e(i)|f(j))」を取得しても良い。 Further, for example, the normalizing means 127 normalizes the first correspondence probability information “C (e (i) | f (j))” before normalization by the following formula 1, and the first correspondence probability information “θS”. (e (i) | f (j)) ”may be acquired.

　なお、数式１において、ｋは任意の単語の添え字である。すなわち、数式１の分母は、全ての単語対についての和をとることを示している。 In Equation 1, k is a subscript of an arbitrary word. That is, the denominator of Equation 1 indicates that the sum is taken for all word pairs.

　制御手段１２８は、終了条件を満たすまで、前回第一対応確率情報取得手段１２１、第二対応確率情報取得手段１２２、対訳文単語位置情報取得手段１２３、対訳文単語位置確率情報取得手段１２４、中間確率値算出手段１２５、正規化前第一対応確率情報取得手段１２６、および正規化手段１２７の処理を繰り返して行わせる。つまり、制御手段１２８は、終了条件を満たすまで、前回第一対応確率情報取得手段１２１、第二対応確率情報取得手段１２２、対訳文単語位置情報取得手段１２３、対訳文単語位置確率情報取得手段１２４、中間確率値算出手段１２５、正規化前第一対応確率情報取得手段１２６、および正規化手段１２７の処理をループさせる制御を行う。ここで、終了条件とは、例えば、予め決められたループ回数になったことである。また、予め決められたループ回数は、例えば、４回から６回のいずれかである。 Until the end condition is satisfied, the control unit 128 performs the previous first correspondence probability information acquisition unit 121, the second correspondence probability information acquisition unit 122, the bilingual sentence word position information acquisition unit 123, the bilingual sentence word position probability information acquisition unit 124, the intermediate The processes of the probability value calculating unit 125, the first correspondence probability information acquiring unit 126 before normalization, and the normalizing unit 127 are repeatedly performed. That is, the control means 128 until the first condition is satisfied, the previous first correspondence probability information acquisition means 121, the second correspondence probability information acquisition means 122, the bilingual sentence word position information acquisition means 123, and the bilingual sentence word position probability information acquisition means 124. Then, control is performed to loop the processes of the intermediate probability value calculation means 125, the first correspondence probability information acquisition means 126 before normalization, and the normalization means 127. Here, the termination condition is, for example, that a predetermined number of loops has been reached. The predetermined number of loops is, for example, any one of 4 to 6.

　対応確率情報蓄積部１３は、単語対ごとに、確率情報算出部１２が最終的に算出した第一対応確率情報を、単語対に対応付けて、小規模単語アライメントモデル格納部１１２に蓄積する。なお、確率情報算出部１２が最終的に算出した第一対応確率情報とは、制御手段１２８が終了条件を満たすとして、ループの処理を終了した場合の、最終的な第一対応確率情報である。 The correspondence probability information storage unit 13 stores the first correspondence probability information finally calculated by the probability information calculation unit 12 for each word pair in the small word alignment model storage unit 112 in association with the word pair. Note that the first correspondence probability information finally calculated by the probability information calculation unit 12 is the final first correspondence probability information when the processing of the loop is completed assuming that the control unit 128 satisfies the termination condition. .

　格納部１１を構成している小規模対訳データ格納部１１１、小規模単語アライメントモデル格納部１１２、大規模単語アライメントモデル格納部１１３、および対訳文単語位置確率情報格納部１１４は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 The small bilingual data storage unit 111, the small word alignment model storage unit 112, the large word alignment model storage unit 113, and the bilingual word position probability information storage unit 114 constituting the storage unit 11 are nonvolatile recordings. A medium is preferred, but a volatile recording medium can also be realized.

　格納部１１に情報が記憶される過程は問わない。例えば、記録媒体を介して情報が格納部１１で記憶されるようになってもよく、通信回線等を介して送信された情報が格納部１１で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された情報が格納部１１で記憶されるようになってもよい。 The process of storing information in the storage unit 11 does not matter. For example, information may be stored in the storage unit 11 via a recording medium, information transmitted via a communication line or the like may be stored in the storage unit 11, or Information input via the input device may be stored in the storage unit 11.

　確率情報算出部１２を構成している前回第一対応確率情報取得手段１２１、第二対応確率情報取得手段１２２、対訳文単語位置情報取得手段１２３、対訳文単語位置確率情報取得手段１２４、中間確率値算出手段１２５、正規化前第一対応確率情報取得手段１２６、正規化手段１２７、制御手段１２８、および対応確率情報蓄積部１３は、通常、ＭＰＵやメモリ等から実現され得る。確率情報算出部１２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 Previous correspondence first correspondence probability information acquisition means 121, second correspondence probability information acquisition means 122, parallel translation word position information acquisition means 123, parallel translation word position probability information acquisition means 124, intermediate probability constituting the probability information calculation unit 12 The value calculation unit 125, the first correspondence probability information acquisition unit 126 before normalization, the normalization unit 127, the control unit 128, and the correspondence probability information storage unit 13 can be usually realized by an MPU, a memory, or the like. The processing procedure of the probability information calculation unit 12 and the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

　次に、単語アライメントモデル構築装置１の動作について、図２のフローチャートを用いて説明する。なお、図２のフローチャートは、２以上の全単語対についての処理である。つまり、図２のフローチャートにおける、E-Stepが全単語対について処理され、次に、M-Stepが全単語対について処理される。 Next, the operation of the word alignment model construction device 1 will be described using the flowchart of FIG. Note that the flowchart of FIG. 2 is processing for all two or more word pairs. That is, E-Step in the flowchart of FIG. 2 is processed for all word pairs, and then M-Step is processed for all word pairs.

　（ステップＳ２０１）確率情報算出部１２は、初期化処理を行う。初期化処理は、例えば、各種の変数に初期値を代入する処理である。初期化処理は、例えば、正規化前の第一対応確率情報「C(e(i)|f(j))」に「０」を代入する処理である。また、初期化処理は、例えば、第一対応確率情報「θS(e(i)|f(j))」に「０」を代入する処理である。 (Step S201) The probability information calculation unit 12 performs an initialization process. The initialization process is a process of substituting initial values for various variables, for example. The initialization process is, for example, a process of substituting “0” into first correspondence probability information “C (e (i) | f (j))” before normalization. The initialization process is a process of substituting “0” into the first correspondence probability information “θS (e (i) | f (j))”, for example.

　（ステップＳ２０２）確率情報算出部１２は、Ｅ－ｓｔｅｐを行い、正規化前の第一対応確率情報「C(e(i)|f(j))」を取得する。そして、確率情報算出部１２は、各単語対に対応付けて、正規化前の第一対応確率情報「C(e(i)|f(j))」を図示しないバッファまたは格納部１１に一時蓄積する。Ｅ－ｓｔｅｐの詳細について、図３のフローチャートを用いて後述する。 (Step S202) The probability information calculation unit 12 performs E-step, and acquires first correspondence probability information “C (e (i) | f (j))” before normalization. Then, the probability information calculation unit 12 temporarily stores the first correspondence probability information “C (e (i) | f (j))” before normalization in a buffer or storage unit 11 (not shown) in association with each word pair. accumulate. Details of the E-step will be described later with reference to the flowchart of FIG.

　（ステップＳ２０３）正規化手段１２７は、Ｍ－ｓｔｅｐを行い、第一対応確率情報「θS(e(i)|f(j))」を取得する。そして、確率情報算出部１２は、各単語対に対応付けて、第一対応確率情報「θS(e(i)|f(j))」を図示しないバッファまたは格納部１１に一時蓄積する。 (Step S203) The normalizing means 127 performs M-step to obtain the first correspondence probability information “θS (e (i) | f (j))”. Then, the probability information calculation unit 12 temporarily accumulates first correspondence probability information “θS (e (i) | f (j))” in a buffer or storage unit 11 (not shown) in association with each word pair.

　なお、Ｍ－ｓｔｅｐとは、正規化前の第一対応確率情報「C(e(i)|f(j))」を正規化する処理である。正規化手段１２７は、例えば、上述した平均場近似を用いて、正規化前の第一対応確率情報「C(e(i)|f(j))」を正規化し、第一対応確率情報「θS(e(i)|f(j))」を取得する。 Note that M-step is a process of normalizing the first correspondence probability information “C (e (i) | f (j))” before normalization. For example, the normalizing means 127 normalizes the first correspondence probability information “C (e (i) | f (j))” before normalization using the above-described mean field approximation, and the first correspondence probability information “ θS (e (i) | f (j)) ”is acquired.

　（ステップＳ２０４）確率情報算出部１２は、終了条件に合致するか否かを判断する。終了条件に合致する場合は処理を終了し、終了条件に合致しない場合はステップＳ２０２に戻る。なお、終了条件とは、上述したように、例えば、予め決められたループ回数になったこと等である。 (Step S204) The probability information calculation unit 12 determines whether or not the end condition is met. If the end condition is met, the process ends. If the end condition is not met, the process returns to step S202. The end condition is, for example, that a predetermined number of loops has been reached, as described above.

　次に、ステップＳ２０２のＥ－ｓｔｅｐの詳細について、図３のフローチャートを用いて説明する。 Next, details of the E-step in step S202 will be described using the flowchart of FIG.

　（ステップＳ３０１）制御手段１２８は、カウンタｓに１を代入する。 (Step S301) The control means 128 assigns 1 to the counter s.

　（ステップＳ３０２）制御手段１２８は、小規模対訳データ格納部１１１に、ｓ番目の対訳文が存在するか否かを判断する。ｓ番目の対訳文が存在する場合はステップＳ３０３に行き、ｓ番目の対訳文が存在しない場合は上位処理にリターンする。 (Step S302) The control unit 128 determines whether or not the s-th parallel translation sentence exists in the small-scale parallel translation data storage unit 111. If the s-th parallel translation sentence exists, the process goes to step S303, and if the s-th parallel translation sentence does not exist, the process returns to the upper process.

　（ステップＳ３０３）制御手段１２８は、カウンタｉに１を代入する。 (Step S303) The control means 128 assigns 1 to the counter i.

　（ステップＳ３０４）制御手段１２８は、変数sumに０を代入する。 (Step S304) The control means 128 substitutes 0 for the variable sum.

　（ステップＳ３０５）制御手段１２８は、カウンタｊに０を代入する。 (Step S305) The control means 128 assigns 0 to the counter j.

　（ステップＳ３０６）確率情報算出部１２は、中間確率値「p(i|j)」を算出する。中間確率値「p(i|j)」を算出する処理の詳細について、図４のフローチャートを用いて後述する。 (Step S306) The probability information calculation unit 12 calculates an intermediate probability value “p (i | j)”. Details of the process of calculating the intermediate probability value “p (i | j)” will be described later with reference to the flowchart of FIG.

　（ステップＳ３０７）制御手段１２８は、変数sumに中間確率値「p(i|j)」を加算する。 (Step S307) The control means 128 adds the intermediate probability value “p (i | j)” to the variable sum.

　（ステップＳ３０８）制御手段１２８は、ｊがｎと一致するか否かを判断する。ｊがｎと一致する場合はステップＳ３０９に行き、ｊがｎと一致しない場合はステップＳ３１６に行く。 (Step S308) The control means 128 determines whether j matches n. If j matches n, the process goes to step S309, and if j does not match n, the process goes to step S316.

　（ステップＳ３０９）制御手段１２８は、カウンタｊに０を代入する。 (Step S309) The control means 128 substitutes 0 for the counter j.

　（ステップＳ３１０）正規化前第一対応確率情報取得手段１２６は、現在の正規化前第一対応確率情報に、中間確率値をsumで除算した値を加算し、新しい正規化前第一対応確率情報を取得し、バッファまたは格納部１１に蓄積する。つまり、正規化前第一対応確率情報取得手段１２６は、演算式「C(e(i)|f(j))← C(e(i)|f(j))+ p(i|j)/sum」により、新しい正規化前の第一対応確率情報「C(e(i)|f(j))」を算出し、バッファまたは格納部１１に蓄積する。 (Step S310) The first correspondence probability information acquisition unit 126 before normalization adds a value obtained by dividing the intermediate probability value by sum to the current first correspondence probability information before normalization, and creates a new first correspondence probability before normalization. Information is acquired and stored in the buffer or storage unit 11. That is, the first correspondence probability information obtaining unit 126 before normalization 126 calculates the arithmetic expression “C (e (i) | f (j)) ← C (e (i) | f (j)) + p (i | j) The first correspondence probability information “C (e (i) | f (j))” before new normalization is calculated by “/ sum” and accumulated in the buffer or storage unit 11.

　（ステップＳ３１１）制御手段１２８は、ｊがｎと一致するか否かを判断する。ｊがｎと一致する場合はステップＳ３１２に行き、ｊがｎと一致しない場合はステップＳ３１５に行く。 (Step S311) The control means 128 determines whether j matches n. If j matches n, the process goes to step S312, and if j does not match n, the process goes to step S315.

　（ステップＳ３１２）制御手段１２８は、ｉがｍと一致するか否かを判断する。ｉがｍと一致する場合はステップＳ３１３に行き、ｉがｍと一致しない場合はステップＳ３１４に行く。 (Step S312) The control means 128 determines whether i matches m. If i matches m, the process goes to step S313. If i does not match m, the process goes to step S314.

　（ステップＳ３１３）制御手段１２８は、カウンタｓを１、インクリメントし、ステップＳ３０２に戻る。 (Step S313) The control means 128 increments the counter s by 1, and returns to step S302.

　（ステップＳ３１４）制御手段１２８は、カウンタｉを１、インクリメントし、ステップＳ３０４に戻る。 (Step S314) The control means 128 increments the counter i by 1, and returns to step S304.

　（ステップＳ３１５）制御手段１２８は、カウンタｊを１、インクリメントし、ステップＳ３１０に戻る。 (Step S315) The control means 128 increments the counter j by 1, and returns to step S310.

　（ステップＳ３１６）制御手段１２８は、カウンタｊを１、インクリメントし、ステップＳ３０６に戻る。 (Step S316) The control means 128 increments the counter j by 1, and returns to step S306.

　次に、ステップＳ３０６の中間確率値「p(i|j)」を算出する処理の詳細について、図４のフローチャートを用いて説明する。 Next, details of the process of calculating the intermediate probability value “p (i | j)” in step S306 will be described using the flowchart of FIG.

　（ステップＳ４０１）前回第一対応確率情報取得手段１２１は、ｓ番目の対訳文における、第一言語（e）のi番目の単語e(i)と第二言語（f）のj番目の単語f(j)とに対応する初期値の第一対応確率情報「θS(e(i)|f(j))」または前回のループにおいて算出した第一対応確率情報「θS(e(i)|f(j))」を取得する。 (Step S401) The previous first correspondence probability information acquisition unit 121 includes the i-th word e (i) in the first language (e) and the j-th word f in the second language (f) in the s-th parallel translation. The first correspondence probability information “θS (e (i) | f (j))” of the initial value corresponding to (j) and the first correspondence probability information “θS (e (i) | f” calculated in the previous loop (j)) ”.

　（ステップＳ４０２）第二対応確率情報取得手段１２２は、ｓ番目の対訳文における、第一言語（e）のi番目の単語e(i)と第二言語（f）のj番目の単語f(j)とに対応する第二対応確率情報「θB(e(i)|f(j))」を、大規模単語アライメントモデル格納部１１３から取得する。 (Step S402) The second correspondence probability information acquisition means 122 in the sth parallel translation sentence, the i-th word e (i) of the first language (e) and the j-th word f (2) of the second language (f) Second correspondence probability information “θB (e (i) | f (j))” corresponding to j) is acquired from the large-scale word alignment model storage unit 113.

　（ステップＳ４０３）対訳文単語位置情報取得手段１２３は、ｓ番目の対訳文における、第一言語（e）のi番目の単語e(i)と第二言語（f）のj番目の単語f(j)の対訳文単語位置情報（i,j,m,n）を取得する。 (Step S403) The translated text word position information acquisition unit 123 in the sth translated text includes the i-th word e (i) of the first language (e) and the j-th word f (2) of the second language (f). The bilingual sentence word position information (i, j, m, n) of j) is acquired.

　（ステップＳ４０４）対訳文単語位置確率情報取得手段１２４は、対訳文単語位置情報取得手段１２３が取得した対訳文単語位置情報（i,j,m,n）に対応する対訳文単語位置確率情報「δS（j|i,m,n）」を、対訳文単語位置確率情報格納部１１４から取得する。 (Step S404) The translated sentence word position probability information acquiring unit 124 converts the translated sentence word position probability information “i, j, m, n” corresponding to the translated sentence word position information (i, j, m, n) acquired by the translated sentence word position information acquiring unit 123. “δS (j | i, m, n)” is acquired from the bilingual sentence word position probability information storage unit 114.

　（ステップＳ４０５）中間確率値算出手段１２５は、前回第一対応確率情報取得手段１２１が取得した第一対応確率情報「θS(e(i)|f(j))」と、第二対応確率情報取得手段１２２が取得した第二対応確率情報「θB(e(i)|f(j))」と、対訳文単語位置確率情報取得手段１２４が取得した対訳文単語位置確率情報「δS（j|i,m,n）」とを用いて、中間確率値を算出し、上位処理にリターンする。 (Step S405) The intermediate probability value calculation means 125 includes the first correspondence probability information “θS (e (i) | f (j))” acquired by the first correspondence probability information acquisition means 121 last time and the second correspondence probability information. The second correspondence probability information “θB (e (i) | f (j))” acquired by the acquisition unit 122 and the parallel sentence word position probability information “δS (j |) acquired by the parallel sentence word position probability information acquisition unit 124. i, m, n) "is used to calculate the intermediate probability value and the process returns to the upper process.

　以下、本実施の形態における単語アライメントモデル構築装置１の具体的な動作について説明する。 Hereinafter, a specific operation of the word alignment model construction apparatus 1 in the present embodiment will be described.

　単語アライメントモデル構築装置１は、例えば、図５に示すプログラムのように動作する。また、ここで述べる具体例は、非特許文献５で述べられた方法を拡張したものである。 The word alignment model construction device 1 operates like a program shown in FIG. The specific example described here is an extension of the method described in Non-Patent Document 5.

　また、単語アライメントモデル構築装置１のポイントは、図５の５０１である。５０１は、確率情報算出部１２の処理である。 Also, the point of the word alignment model construction device 1 is 501 in FIG. Reference numeral 501 denotes processing by the probability information calculation unit 12.

　５０１は、小規模対訳データが有する対訳文ごとに、かつ対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、大規模単語アライメントモデルが有する一の単語対と対になる第二対応確率情報と、対訳文の中における一の単語対に対応する対訳文単語位置確率情報とを用いて、２回以上ループを繰り返して、一の単語対と対になる第一対応確率情報を算出する処理である。 501 is an initial value or first correspondence probability information calculated in the previous loop for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence, Using the second correspondence probability information paired with one word pair included in the word alignment model and the parallel translation word position probability information corresponding to one word pair in the parallel translation sentence, the loop is repeated two or more times. This is a process of calculating first correspondence probability information paired with one word pair.

　さらに具体的には、５０１は、中間確率値算出手段１２５が行う処理であり、前回第一対応確率情報取得手段１２１が取得した第一対応確率情報と第二対応確率情報取得手段１２２が取得した第二対応確率情報とを予め決められた割合で加算し、加算した結果と、対訳文単語位置確率情報取得手段１２４が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する処理である。すなわち、単語アライメントモデル構築装置１においては、大規模データDataBから求めた確率θB(e(i)|f(j))を利用して、p(i|j)を求めることにより、大規模データの確率を「θS(e(i)|f(j))」の推定に利用している。 More specifically, reference numeral 501 denotes a process performed by the intermediate probability value calculating unit 125, which is acquired by the first corresponding probability information acquired by the first corresponding probability information acquiring unit 121 and the second corresponding probability information acquiring unit 122. The second correspondence probability information is added at a predetermined ratio, and the addition result is multiplied by the translated sentence word position probability information acquisition unit 124 to calculate an intermediate probability value. It is processing. That is, the word alignment model construction apparatus 1 uses the probability θB (e (i) | f (j)) obtained from the large-scale data DataB to obtain p (i | j), thereby obtaining the large-scale data. Is used to estimate “θS (e (i) | f (j))”.

　以上、本実施の形態によれば、小規模対訳コーパスの単語アライメントを精度よく実行できる。 As described above, according to the present embodiment, word alignment of a small-scale parallel corpus can be executed with high accuracy.

　さらに具体的には、本実施の形態によれば、従来技術（Ｄ）と同様に、θSを推定するときの計算量は、DataSを利用するときと同程度でありながら、DataBで推定された確率を利用可能である。また、本実施の形態によれば、従来技術（Ｄ）とは異なり、上記定数λ「０＜λ＜１」（図５の５０１）を適切に設定することにより、小規模データDataSにおける特徴的な単語アライメントが打ち消されることを防ぐことが可能となる。 More specifically, according to the present embodiment, as in the prior art (D), the amount of calculation when estimating θS is similar to that when using DataS, but estimated with DataB. Probability is available. Further, according to the present embodiment, unlike the conventional technique (D), the constant λ “0 <λ <1” (501 in FIG. 5) is appropriately set, so that the characteristic in the small-scale data DataS is obtained. This makes it possible to prevent the word alignment from being canceled.

　なお、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ－ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における単語アライメントモデル構築装置１を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータがアクセス可能な記録媒体は、第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値（Ｎ１）未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、前記小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、第二の閾値（Ｎ２，Ｎ２＞Ｎ１）以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、１以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部とを具備し、コンピュータを、前記小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、前記大規模単語アライメントモデルが有する前記一の単語対と対になる第二対応確率情報と、前記対訳文の中における前記一の単語対に対応する対訳文単語位置確率情報とを用いて、２回以上ループを繰り返して、前記一の単語対と対になる第一対応確率情報を算出する確率情報算出部と、単語対ごとに、前記確率情報算出部が最終的に算出した第一対応確率情報を、前記単語対に対応付けて、前記小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積部として機能させるためのプログラムである。 Note that the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded on a recording medium such as a CD-ROM and distributed. This also applies to other embodiments in this specification. In addition, the software which implement | achieves the word alignment model construction apparatus 1 in this Embodiment is the following programs. That is, in this program, the computer-accessible recording medium is a first language sentence that is a first language sentence that is the first language and a second language sentence that is a second language sentence that is the second language. A small-scale bilingual data storage unit capable of storing small-scale parallel translation data, which is a small-scale parallel translation data having a number of parallel translations less than the first threshold (N1), and the small-scale parallel translation data Alignment model of acquired words, a word pair having a first word that is a first language word and a second word that is a second language word, and the first word and the second word correspond to each other A small-scale word alignment model storage unit capable of storing a small-scale word alignment model having a plurality of word alignment data having first correspondence probability information which is correspondence probability information regarding the probability of performing, and a second threshold (N2, N2> 1) A word alignment model acquired from large-scale parallel translation data, which is large-scale parallel translation data having the above-mentioned number of parallel translation sentences, a word pair having a first word and a second word, and the first word A large-scale word alignment model storage unit storing a large-scale word alignment model having a plurality of word alignment data having second correspondence probability information that is correspondence probability information regarding the probability that the second word corresponds to, Information obtained from one or more parallel translations, a first word position indicating the position of the first word in the first language sentence, a position of the second word corresponding to the first word, and a second language sentence For each translated sentence word position information having a second word position indicating a position in the word, a first sentence word number that is the number of words in the first language sentence, and a second sentence word number that is the number of words in the second language sentence , Parallel translation A bilingual sentence word position probability information storage unit capable of storing bilingual sentence word position probability information related to the probability of matching with the word position information, and a computer for each word pair included in the bilingual sentence included in the small-scale bilingual data The first correspondence probability information calculated in the initial value or the previous loop with respect to one word pair, the second correspondence probability information paired with the one word pair included in the large-scale word alignment model, Probability of calculating first correspondence probability information paired with the one word pair by repeating the loop two or more times using the translated sentence word position probability information corresponding to the one word pair in the parallel translation sentence For each word pair, the first correspondence probability information finally calculated by the probability information calculation unit is associated with the word pair and stored in the small word alignment model storage unit. This is a program for functioning as a probability information storage unit.

　また、上記プログラムにおいて、前記確率情報算出部は、前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、当該一の単語対に対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報を取得する前回第一対応確率情報取得手段と、前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、当該一の単語対に対応する第二対応確率情報を、前記大規模単語アライメントモデル格納部から取得する第二対応確率情報取得手段と、前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、前記対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、前記対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段と、前記前回第一対応確率情報取得手段が取得した第一対応確率情報と前記第二第一対応確率情報取得手段が取得した第二対応確率情報とを予め決められた割合で加算し、当該加算した結果と、前記対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する中間確率値算出手段と、単語対ごとに、前記中間確率値算出手段が算出した中間確率値を用いて、正規化前の第一対応確率情報を取得する正規化前第一対応確率情報取得手段と、単語対ごとに、前記正規化前第一対応確率情報取得手段が取得した正規化前の第一対応確率情報に対して、正規化の処理を行い、第一対応確率情報を取得する正規化手段と、終了条件を満たすまで、前記前回第一対応確率情報取得手段、前記第二対応確率情報取得手段、前記対訳文単語位置情報取得手段、前記対訳文単語位置確率情報取得手段、前記中間確率値算出手段、前記正規化前第一対応確率情報取得手段、および前記正規化手段の処理を繰り返して行わせる制御手段とを具備するものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, the probability information calculation unit converts the word pair into one word pair for each parallel sentence included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence. The first corresponding probability information acquisition means for acquiring the first corresponding probability information of the corresponding initial value or the first corresponding probability information calculated in the previous loop, and for each parallel sentence included in the small-scale parallel translation data, Second correspondence probability information acquisition means for acquiring second correspondence probability information corresponding to the one word pair from the large-scale word alignment model storage unit for each word pair included in the sentence; The first word position indicating the position of the first word in the first language sentence with respect to one word pair for each parallel sentence included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence , A second word position corresponding to the first word and indicating a position in the second language sentence, a first sentence word number that is the number of words in the first language sentence, and a second language sentence A bilingual sentence word position information acquisition unit that acquires bilingual sentence word position information having the number of second sentence words that is the number of words, and a bilingual sentence word position information corresponding to the bilingual sentence word position information acquired by the bilingual sentence word position information acquisition unit The translation sentence word position probability information acquisition means for acquiring the position probability information from the parallel translation word position probability information storage unit, the first correspondence probability information acquired by the previous first correspondence probability information acquisition means, and the second first The second correspondence probability information acquired by the correspondence probability information acquisition means is added at a predetermined ratio, and the result of the addition and the parallel sentence word position probability information acquired by the parallel sentence word position probability information acquisition means Multiply the intermediate probability value First correspondence probability information before normalization that obtains first correspondence probability information before normalization using the intermediate probability value calculated by the intermediate probability value calculation means for each word pair For each word pair, the normalization processing is performed on the first correspondence probability information before normalization acquired by the acquisition means and the first correspondence probability information acquisition means before normalization, and the first correspondence probability information is acquired. Normalization means, and until the end condition is satisfied, the previous first correspondence probability information acquisition means, the second correspondence probability information acquisition means, the parallel translation word position information acquisition means, the parallel translation word position probability information acquisition means, A program that causes a computer to function as the intermediate probability value calculating means, the first correspondence probability information acquiring means before normalization, and a control means that repeatedly performs the processing of the normalizing means. It is.

　また、上記プログラムにおいて、前記対訳文単語位置確率情報格納部に格納されている対訳文単語位置確率情報は、前記大規模対訳データを用いて取得された対訳文単語位置確率情報であるものとして、コンピュータを機能させるプログラムであることは好適である。 In the above program, the parallel sentence word position probability information stored in the parallel sentence word position probability information storage unit is the parallel sentence word position probability information acquired using the large-scale parallel translation data. It is preferable that the program functions a computer.

　また、上記プログラムにおいて、コンピュータを、前記大規模対訳データが有する各単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、前記大規模対訳データが有する各単語対に対して、前記対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、前記対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段としてさらに機能させ、前記対訳文単語位置確率情報格納部に格納されている対訳文単語位置確率情報は、前記対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報であるものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, for each word pair possessed by the large-scale parallel translation data, the computer has a first word position indicating a position of the first word in the first language sentence, and a first word position corresponding to the first word. A second word position that is a position of two words and indicates a position in the second language sentence, a first sentence word number that is the number of words in the first language sentence, and a second sentence word that is the number of words in the second language sentence Bilingual sentence word position information acquisition means for acquiring bilingual sentence word position information having a number, and bilingual sentence word position information acquired by the bilingual sentence word position information acquisition means for each word pair included in the large-scale parallel translation data Is further stored as a translated sentence word position probability information acquisition unit for acquiring the translated sentence word position probability information corresponding to the translated sentence word position probability information storage unit, and is stored in the translated sentence word position probability information storage unit. Translated sentence word position probability information as the a translated sentence word position probability information translated sentence word position probability information acquisition means has acquired, it is preferably a program for causing a computer to function.

　（実施の形態２）
　本実施の形態において、実施の形態１で構成された単語アライメントモデルを用いた機械翻訳装置について説明する。 (Embodiment 2)
In this embodiment, a machine translation apparatus using the word alignment model configured in Embodiment 1 will be described.

　図６は、本実施の形態における機械翻訳装置２のブロック図である。機械翻訳装置２は、例えば、第二言語文を翻訳し、第一言語文を得る装置である。 FIG. 6 is a block diagram of the machine translation apparatus 2 in the present embodiment. The machine translation device 2 is a device that translates a second language sentence and obtains a first language sentence, for example.

　機械翻訳装置２は、小規模単語アライメントモデル格納部１１２、対訳文単語位置確率情報格納部１１４、受付部２１、および翻訳部２２を備える。 The machine translation device 2 includes a small-scale word alignment model storage unit 112, a parallel translation word position probability information storage unit 114, a reception unit 21, and a translation unit 22.

　受付部２１は、第二言語文を受け付ける。ここで、受け付けとは、キーボードやマウス、タッチパネルなどの入力デバイスから入力された情報の受け付け、音声のマイクによる受け付け、有線もしくは無線の通信回線を介して送信された情報の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された情報の受け付けなどを含む概念である。 The reception unit 21 receives a second language sentence. Here, reception means reception of information input from an input device such as a keyboard, mouse, or touch panel, reception by a voice microphone, reception of information transmitted via a wired or wireless communication line, an optical disk or a magnetic disk The concept includes reception of information read from a recording medium such as a semiconductor memory.

　第二言語文の入力手段は、キーボードやマウスやメニュー画面によるもの等、何でも良い。受付部２１は、キーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The second language sentence input means may be anything such as a keyboard, mouse or menu screen. The accepting unit 21 can be realized by a device driver for input means such as a keyboard, control software for a menu screen, or the like.

　翻訳部２２は、小規模単語アライメントモデル格納部１１２に格納されている小規模単語アライメントモデル、および対訳文単語位置確率情報格納部１１４に格納されている１以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、受付部２１が受け付けた第二言語文から第一言語文を取得する。なお、翻訳部２２は公知技術であるので詳細な説明は省略する。 The translation unit 22 translates each small word alignment model stored in the small word alignment model storage unit 112 and one or more parallel sentence word position information stored in the parallel sentence word position probability information storage unit 114. Using the sentence word position probability information, the first language sentence is acquired from the second language sentence received by the receiving unit 21. Since the translation unit 22 is a known technique, a detailed description thereof is omitted.

　翻訳部２２は、通常、ＭＰＵやメモリ等から実現され得る。翻訳部２２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The translation unit 22 can usually be realized by an MPU, a memory, or the like. The processing procedure of the translation unit 22 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

　以下、本実施の形態における機械翻訳装置２の具体的な動作について説明する。 Hereinafter, a specific operation of the machine translation apparatus 2 in the present embodiment will be described.

　対訳データDataXについて、その中の対訳文を<e,f>とし、e = e(1) e(2) … e(m), f= f(1) f(2) … f(n)のように、m単語とn単語からなるとする。ここで、eは第一言語文、fは第二言語文である、とする。また、e(i)はeのi番目の単語であり、f(j)はfのj番目の単語である。なお、特別な単語NULLとして f(0)を導入する。これは、e中の単語がf中の単語のいずれにも対応しない場合に有用である。 For bilingual data DataX, the bilingual sentence in it is <e, f>, and e = e (1) e (2)… e (m), f = f (1) f (2)… f (n) Thus, it is assumed that it consists of m words and n words. Here, e is a first language sentence and f is a second language sentence. E (i) is the i-th word of e, and f (j) is the j-th word of f. Note that f (0) is introduced as the special word NULL. This is useful when the word in e does not correspond to any of the words in f.

　次に、DataXにおけるfを条件とするeの確率を「PX(e|f)」とする。 Next, the probability of e on condition that f in DataX is “PX (e | f)”.

　また、DataXにおいて、eの単語数が m、ｆの単語数がnのときに、文eのi番目にある単語が、文fのｊ番目の単語とアライメントされる確率を「δX（j|i,m,n）」とする。 In DataX, when the number of words in e is m and the number of words in f is n, the probability that the i-th word in sentence e is aligned with the j-th word in sentence f is expressed as “δX (j | i, m, n) ".

　また、DataXにおいて、e(i)がf(j)にアライメントされる確率を「θX(e(i)|f(j))」とする。 In DataX, the probability that e (i) is aligned with f (j) is “θX (e (i) | f (j))”.

　このとき、以下の数式２が成り立つ。 At this time, the following formula 2 holds.

　また、これらの確率、δX、θXは、データXから上記非特許文献５の手法を利用して推定可能である。なお、PXについては、δXとθXから一意的に計算可能である。 Further, these probabilities, δX, θX can be estimated from the data X using the method of Non-Patent Document 5 described above. Note that PX can be uniquely calculated from ΔX and θX.

　ここで、まず、δSの推定法を述べる。それは次のものである。 Here, first, the estimation method of δS will be described. It is:

　すなわち、δSとしてはδBと同じ確率を用いる。その理由は、この確率の推定方法として上記非特許文献５で述べられている方法を、１００文程度の対訳データに対して適用すると、この確率分布のパラメータが発散してしまい、有効な確率を推定不可能だからである。また、この確率は、単に、i, j, m, nという単語の数のみから決定可能なものであるので、対訳データが異なったとしても、同じ確率が精度よく利用できるからである。 That is, the same probability as δB is used as δS. The reason is that if the method described in Non-Patent Document 5 is applied to bilingual data of about 100 sentences as an estimation method of this probability, the parameter of this probability distribution diverges and an effective probability is obtained. This is because estimation is impossible. Further, this probability can be determined only from the number of words i, j, m, and n, and therefore the same probability can be used with high accuracy even if the parallel translation data is different.

　次に、θSの推定方法は、図５のプログラムである。これは、上記非特許文献５と同様にEM法に基づくものである。 Next, the method of estimating θS is the program shown in FIG. This is based on the EM method as in Non-Patent Document 5.

　そして、PS(e|f)の確率値が最も大きい第一言語文ｅが第二言語文ｆの翻訳結果である。 The first language sentence e having the largest probability value of PS (e | f) is the translation result of the second language sentence f.

　以上、本実施の形態によれば、小規模対訳コーパスの単語アライメントを精度よく実行できる結果、精度の良い翻訳結果が得られる。 As described above, according to the present embodiment, the word alignment of the small-scale bilingual corpus can be executed with high accuracy, and a highly accurate translation result can be obtained.

　さらに、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムにおいて、コンピュータがアクセス可能な記録媒体は、単語アライメントモデル構築装置が有する小規模単語アライメントモデル格納部と、単語アライメントモデル構築装置が有する対訳文単語位置確率情報格納部とを具備し、コンピュータを、第二言語文を受け付ける受付部と、前記小規模単語アライメントモデル格納部に格納されている小規模単語アライメントモデル、および前記対訳文単語位置確率情報格納部に格納されている１以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、前記受付部が受け付けた第二言語文から第一言語文を取得する翻訳部として機能させるプログラムである。 Furthermore, the software that realizes the information processing apparatus in the present embodiment is the following program. That is, in this program, the computer-accessible recording medium includes a small-scale word alignment model storage unit included in the word alignment model construction device and a bilingual word position probability information storage unit included in the word alignment model construction device. A computer that accepts a second language sentence, a small word alignment model stored in the small word alignment model storage section, and one or more stored in the parallel sentence word position probability information storage section This is a program that functions as a translation unit that acquires a first language sentence from a second language sentence accepted by the accepting unit using bilingual sentence word position probability information for each bilingual sentence word position information.

　また、図７は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の単語アライメントモデル構築装置等を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図７は、このコンピュータシステム３００の概観図であり、図８は、システム３００のブロック図である。 FIG. 7 shows the external appearance of a computer that executes the program described in this specification to realize the word alignment model construction apparatus and the like of the various embodiments described above. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 7 is an overview diagram of the computer system 300, and FIG. 8 is a block diagram of the system 300.

　図７において、コンピュータシステム３００は、ＣＤ－ＲＯＭドライブ３０１２を含むコンピュータ３０１と、キーボード３０２と、マウス３０３と、モニタ３０４とを含む。 7, the computer system 300 includes a computer 301 including a CD-ROM drive 3012, a keyboard 302, a mouse 303, and a monitor 304.

　図８において、コンピュータ３０１は、ＣＤ－ＲＯＭドライブ３０１２に加えて、ＭＰＵ３０１３と、ＭＰＵ３０１３、ＣＤ－ＲＯＭドライブ３０１２に接続されたバス３０１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ３０１５と、ＭＰＵ３０１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ３０１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク３０１７とを含む。ここでは、図示しないが、コンピュータ３０１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 In FIG. 8, in addition to the CD-ROM drive 3012, the computer 301 includes an MPU 3013, a bus 3014 connected to the MPU 3013 and the CD-ROM drive 3012, a ROM 3015 for storing programs such as a boot-up program, and an MPU 3013. And a RAM 3016 for temporarily storing instructions of the application program and providing a temporary storage space, and a hard disk 3017 for storing the application program, the system program, and data. Although not shown here, the computer 301 may further include a network card that provides connection to a LAN.

　コンピュータシステム３００に、上述した実施の形態の単語アライメントモデル構築装置等の機能を実行させるプログラムは、ＣＤ－ＲＯＭ３１０１に記憶されて、ＣＤ－ＲＯＭドライブ３０１２に挿入され、さらにハードディスク３０１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ３０１に送信され、ハードディスク３０１７に記憶されても良い。プログラムは実行の際にＲＡＭ３０１６にロードされる。プログラムは、ＣＤ－ＲＯＭ３１０１またはネットワークから直接、ロードされても良い。 A program that causes the computer system 300 to execute the functions of the word alignment model construction apparatus of the above-described embodiment is stored in the CD-ROM 3101, inserted into the CD-ROM drive 3012, and further transferred to the hard disk 3017. good. Alternatively, the program may be transmitted to the computer 301 via a network (not shown) and stored in the hard disk 3017. The program is loaded into the RAM 3016 at the time of execution. The program may be loaded directly from the CD-ROM 3101 or the network.

　プログラムは、コンピュータ３０１に、上述した実施の形態の単語アライメントモデル構築装置等の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３００がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS), a third-party program, or the like that causes the computer 301 to execute functions such as the word alignment model construction device of the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 300 operates is well known and will not be described in detail.

　なお、上記プログラムにおいて、情報を送信するステップや、情報を受信するステップなどでは、ハードウェアによって行われる処理、例えば、送信ステップにおけるモデムやインターフェースカードなどで行われる処理（ハードウェアでしか行われない処理）は含まれない。 In the above program, in the step of transmitting information, the step of receiving information, etc., processing performed by hardware, for example, processing performed by a modem or an interface card in the transmission step (only performed by hardware) Processing) is not included.

　また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

　また、上記各実施の形態において、一の装置に存在する２以上の通信手段は、物理的に一の媒体で実現されても良いことは言うまでもない。 Further, in each of the above embodiments, it goes without saying that two or more communication means existing in one apparatus may be physically realized by one medium.

　また、上記各実施の形態において、各処理は、単一の装置によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process may be realized by centralized processing by a single device, or may be realized by distributed processing by a plurality of devices.

　本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 The present invention is not limited to the above embodiment, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

　以上のように、本発明にかかる単語アライメントモデル構築装置は、小規模対訳コーパスの単語アライメントを精度よく実行できるという効果を有し、単語アライメントモデル構築装置等として有用である。 As described above, the word alignment model construction apparatus according to the present invention has an effect that word alignment of a small-scale parallel corpus can be executed with high accuracy, and is useful as a word alignment model construction apparatus or the like.

　１　単語アライメントモデル構築装置
　２　機械翻訳装置
　１１　格納部
　１２　確率情報算出部
　１３　対応確率情報蓄積部
　２１　受付部
　２２　翻訳部
　１１１　小規模対訳データ格納部
　１１２　小規模単語アライメントモデル格納部
　１１３　大規模単語アライメントモデル格納部
　１１４　対訳文単語位置確率情報格納部
　１２１　前回第一対応確率情報取得手段
　１２２　第二対応確率情報取得手段
　１２３　対訳文単語位置情報取得手段
　１２４　対訳文単語位置確率情報取得手段
　１２５　中間確率値算出手段
　１２６　正規化前第一対応確率情報取得手段
　１２７　正規化手段
　１２８　制御手段 DESCRIPTION OF SYMBOLS 1 Word alignment model construction apparatus 2 Machine translation apparatus 11 Storage part 12 Probability information calculation part 13 Corresponding probability information storage part 21 Reception part 22 Translation part 111 Small parallel translation data storage part 112 Small word alignment model storage part 113 Large scale word alignment Model storage unit 114 Parallel sentence word position probability information storage part 121 Previous first correspondence probability information acquisition means 122 Second correspondence probability information acquisition means 123 Parallel sentence word position information acquisition means 124 Parallel sentence word position probability information acquisition means 125 Intermediate probability value Calculation means 126 First correspondence probability information acquisition means 127 before normalization 127 Normalization means 128 Control means

Claims

第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値（Ｎ１）未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、
前記小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、
第二の閾値（Ｎ２，Ｎ２＞Ｎ１）以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、
１以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部と、
前記小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、前記大規模単語アライメントモデルが有する前記一の単語対と対になる第二対応確率情報と、前記対訳文の中における前記一の単語対に対応する対訳文単語位置確率情報とを用いて、２回以上ループを繰り返して、前記一の単語対と対になる第一対応確率情報を算出する確率情報算出部と、
単語対ごとに、前記確率情報算出部が最終的に算出した第一対応確率情報を、前記単語対に対応付けて、前記小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積部とを具備する単語アライメントモデル構築装置。 It is a pair of a first language sentence that is a sentence of the first language that is the first language and a second language sentence that is a sentence of the second language that is the second language, and is less than the first threshold (N1) A small-scale parallel translation data storage unit capable of storing small-scale parallel translation data, which is a small-scale parallel translation data having a number of parallel translation sentences;
An alignment model of words acquired from the small-scale parallel translation data, a word pair having a first word that is a word in a first language and a second word that is a word in a second language, the first word, and the A small word alignment model storage unit capable of storing a small word alignment model having a plurality of word alignment data having first correspondence probability information which is correspondence probability information regarding the probability that the second word corresponds;
An alignment model of words acquired from large-scale parallel translation data, which is large-scale parallel translation data having a number of parallel translations equal to or greater than a second threshold (N2, N2> N1), and a first word and a second word A large-scale word alignment model having a plurality of word alignment data having a word pair and second correspondence probability information that is correspondence probability information regarding a probability that the first word and the second word correspond to each other is stored A large-scale word alignment model storage;
Information obtained from one or more parallel translations, a first word position indicating the position of the first word in the first language sentence, a position of the second word corresponding to the first word, and a second language sentence For each translated sentence word position information having a second word position indicating a position in the word, a first sentence word number that is the number of words in the first language sentence, and a second sentence word number that is the number of words in the second language sentence A bilingual word position probability information storage unit that can store bilingual word position probability information related to the probability of matching the bilingual word position information;
For each word pair included in the bilingual sentence included in the small-scale parallel translation data, the first correspondence probability information calculated in the initial value or the previous loop for one word pair, and the large-scale word alignment model has Using the second correspondence probability information paired with one word pair and the parallel translation word position probability information corresponding to the one word pair in the parallel translation sentence, the loop is repeated twice or more to A probability information calculation unit for calculating first correspondence probability information paired with the word pair;
A correspondence probability information storage unit that stores first correspondence probability information finally calculated by the probability information calculation unit for each word pair in the small word alignment model storage unit in association with the word pair; Word alignment model construction device.
前記確率情報算出部は、
前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、当該一の単語対に対応する初期値の第一対応確率情報または前回のループにおいて算出した第一対応確率情報を取得する前回第一対応確率情報取得手段と、
前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、当該一の単語対に対応する第二対応確率情報を、前記大規模単語アライメントモデル格納部から取得する第二対応確率情報取得手段と、
前記小規模対訳データが有する対訳文ごとに、かつ当該対訳文が有する単語対ごとに、一の単語対に対して、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報を取得する対訳文単語位置情報取得手段と、
前記対訳文単語位置情報取得手段が取得した対訳文単語位置情報に対応する対訳文単語位置確率情報を、前記対訳文単語位置確率情報格納部から取得する対訳文単語位置確率情報取得手段と、
前記前回第一対応確率情報取得手段が取得した第一対応確率情報と前記第二対応確率情報取得手段が取得した第二対応確率情報とを予め決められた割合で加算し、当該加算した結果と、前記対訳文単語位置確率情報取得手段が取得した対訳文単語位置確率情報とを乗算し、中間確率値を算出する中間確率値算出手段と、
単語対ごとに、前記中間確率値算出手段が算出した中間確率値を用いて、正規化前の第一対応確率情報を取得する正規化前第一対応確率情報取得手段と、
単語対ごとに、前記正規化前第一対応確率情報取得手段が取得した正規化前の第一対応確率情報に対して、正規化の処理を行い、第一対応確率情報を取得する正規化手段と、
終了条件を満たすまで、前記前回第一対応確率情報取得手段、前記第二対応確率情報取得手段、前記対訳文単語位置情報取得手段、前記対訳文単語位置確率情報取得手段、前記中間確率値算出手段、前記正規化前第一対応確率情報取得手段、および前記正規化手段の処理を繰り返して行わせる制御手段とを具備する請求項１記載の単語アライメントモデル構築装置。 The probability information calculation unit includes:
The first correspondence probability information of the initial value corresponding to the one word pair or the previous correspondence information for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence Previous first correspondence probability information obtaining means for obtaining first correspondence probability information calculated in the loop;
Second correspondence probability information corresponding to the one word pair for each word pair included in the parallel translation sentence and for each word pair included in the parallel translation sentence, the large-scale word Second correspondence probability information obtaining means for obtaining from the alignment model storage unit;
A first word position indicating a position of the first word in the first language sentence for each word pair included in the small-scale parallel translation data and for each word pair included in the parallel translation sentence; The second word position corresponding to the first word and indicating the position in the second language sentence, the first sentence word number that is the number of words in the first language sentence, and the second language sentence Bilingual sentence word position information acquisition means for acquiring bilingual sentence word position information having the number of second sentence words that is the number of words;
Parallel sentence word position probability information acquisition means for acquiring parallel sentence word position probability information corresponding to the parallel sentence word position information acquired by the parallel sentence word position information acquisition means from the parallel sentence word position probability information storage unit;
The first correspondence probability information acquired by the previous first correspondence probability information acquisition means and the second correspondence probability information acquired by the second correspondence probability information acquisition means are added at a predetermined ratio, and the addition result An intermediate probability value calculating means for calculating an intermediate probability value by multiplying the parallel sentence word position probability information acquired by the parallel sentence word position probability information acquiring means;
For each word pair, using the intermediate probability value calculated by the intermediate probability value calculating means, the first correspondence probability information acquisition means before normalization for acquiring the first correspondence probability information before normalization;
For each word pair, normalization means for obtaining first correspondence probability information by performing normalization processing on the first correspondence probability information before normalization obtained by the first correspondence probability information obtaining means before normalization When,
Until the end condition is satisfied, the previous first correspondence probability information acquisition means, the second correspondence probability information acquisition means, the parallel translation word position information acquisition means, the parallel translation word position probability information acquisition means, the intermediate probability value calculation means The word alignment model construction apparatus according to claim 1, further comprising: a first correspondence probability information acquisition unit before normalization, and a control unit that repeatedly performs the processing of the normalization unit.
請求項１または請求項２いずれか一項に記載の単語アライメントモデル構築装置が有する小規模単語アライメントモデル格納部と、
請求項１から請求項４いずれか一項に記載の単語アライメントモデル構築装置が有する対訳文単語位置確率情報格納部と、
第二言語文を受け付ける受付部と、
前記小規模単語アライメントモデル格納部に格納されている小規模単語アライメントモデル、および前記対訳文単語位置確率情報格納部に格納されている１以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、前記受付部が受け付けた第二言語文から第一言語文を取得する翻訳部とを具備する機械翻訳装置。 A small-scale word alignment model storage unit included in the word alignment model construction device according to claim 1,
A bilingual word position probability information storage unit included in the word alignment model construction device according to any one of claims 1 to 4,
A reception unit for receiving a second language sentence;
The small word alignment model stored in the small word alignment model storage unit and the bilingual word position probability information for each one or more parallel word position information stored in the parallel word position probability information storage unit A machine translation apparatus comprising: a translation unit that acquires a first language sentence from a second language sentence received by the reception unit.
記録媒体は、
第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値（Ｎ１）未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、
前記小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、
第二の閾値（Ｎ２，Ｎ２＞Ｎ１）以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、
１以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部とを具備し、
確率情報算出部、対応確率情報蓄積部により実現される単語アライメントモデルの生産方法であって、
前記確率情報算出部が、前記小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、前記大規模単語アライメントモデルが有する前記一の単語対と対になる第二対応確率情報と、前記対訳文の中における前記一の単語対に対応する対訳文単語位置確率情報とを用いて、２回以上ループを繰り返して、前記一の単語対と対になる第一対応確率情報を算出する確率情報算出ステップと、
前記対応確率情報蓄積部が、単語アライメントモデルの生産方法単語対ごとに、前記確率情報算出ステップで最終的に算出された第一対応確率情報を、前記単語対に対応付けて、前記小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積ステップとを具備する単語アライメントモデルの生産方法。 The recording medium is
It is a pair of a first language sentence that is a sentence of the first language that is the first language and a second language sentence that is a sentence of the second language that is the second language, and is less than the first threshold (N1) A small-scale parallel translation data storage unit capable of storing small-scale parallel translation data, which is a small-scale parallel translation data having a number of parallel translation sentences;
An alignment model of words acquired from the small-scale parallel translation data, a word pair having a first word that is a word in a first language and a second word that is a word in a second language, the first word, and the A small word alignment model storage unit capable of storing a small word alignment model having a plurality of word alignment data having first correspondence probability information which is correspondence probability information regarding the probability that the second word corresponds;
An alignment model of words acquired from large-scale parallel translation data, which is large-scale parallel translation data having a number of parallel translations equal to or greater than a second threshold (N2, N2> N1), and a first word and a second word A large-scale word alignment model having a plurality of word alignment data having a word pair and second correspondence probability information that is correspondence probability information regarding a probability that the first word and the second word correspond to each other is stored A large-scale word alignment model storage;
Information obtained from one or more parallel translations, a first word position indicating the position of the first word in the first language sentence, a position of the second word corresponding to the first word, and a second language sentence For each translated sentence word position information having a second word position indicating a position in the word, a first sentence word number that is the number of words in the first language sentence, and a second sentence word number that is the number of words in the second language sentence A bilingual sentence word position probability information storage unit capable of storing bilingual sentence word position probability information related to the probability of matching with the bilingual sentence word position information,
A method for producing a word alignment model realized by a probability information calculation unit and a corresponding probability information storage unit,
The probability information calculation unit calculates, for each word pair included in the parallel translation sentence included in the small-scale parallel translation data, the first correspondence probability information calculated in the initial value or the previous loop for one word pair; Two or more times using the second correspondence probability information paired with the one word pair of the scale word alignment model and the parallel sentence word position probability information corresponding to the one word pair in the parallel translation sentence Probability information calculation step of calculating first correspondence probability information paired with the one word pair by repeating a loop;
The correspondence probability information accumulating unit associates the first correspondence probability information finally calculated in the probability information calculation step with the word pair for each production method word pair of the word alignment model. A method for producing a word alignment model, comprising: a corresponding probability information accumulating step for accumulating in an alignment model storage unit.
コンピュータがアクセス可能な記録媒体は、
第一の言語である第一言語の文である第一言語文と第二の言語である第二言語の文である第二言語文との対であり、第一の閾値（Ｎ１）未満の数の対訳文を有する小規模な対訳データである小規模対訳データを格納し得る小規模対訳データ格納部と、
前記小規模対訳データから取得される単語のアライメントモデルであり、第一言語の単語である第一単語と第二言語の単語である第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第一対応確率情報とを有する複数の単語アライメントデータを有する小規模単語アライメントモデルを格納し得る小規模単語アライメントモデル格納部と、
第二の閾値（Ｎ２，Ｎ２＞Ｎ１）以上の数の対訳文を有する大規模な対訳データである大規模対訳データから取得された単語のアライメントモデルであり、第一単語と第二単語とを有する単語対と、前記第一単語と前記第二単語とが対応する確率に関する対応確率情報である第二対応確率情報とを有する複数の単語アライメントデータを有する大規模単語アライメントモデルを格納している大規模単語アライメントモデル格納部と、
１以上の対訳文から取得された情報であり、第一言語文の中における第一単語の位置を示す第一単語位置、当該第一単語に対応する第二単語の位置であり第二言語文の中における位置を示す第二単語位置、第一言語文の単語数である第一文単語数、および第二言語文の単語数である第二文単語数を有する対訳文単語位置情報ごとに、対訳文単語位置情報に合致する確率に関する対訳文単語位置確率情報を格納し得る対訳文単語位置確率情報格納部とを具備し、
コンピュータを、
前記小規模対訳データに含まれる対訳文が有する単語対ごとに、一の単語対に対して、初期値または前回のループにおいて算出した第一対応確率情報と、前記大規模単語アライメントモデルが有する前記一の単語対と対になる第二対応確率情報と、前記対訳文の中における前記一の単語対に対応する対訳文単語位置確率情報とを用いて、２回以上ループを繰り返して、前記一の単語対と対になる第一対応確率情報を算出する確率情報算出部と、
単語対ごとに、前記確率情報算出部が最終的に算出した第一対応確率情報を、前記単語対に対応付けて、前記小規模単語アライメントモデル格納部に蓄積する対応確率情報蓄積部として機能させるためのプログラムを記録した記録媒体。 Computer-accessible recording media
It is a pair of a first language sentence that is a sentence of the first language that is the first language and a second language sentence that is a sentence of the second language that is the second language, and is less than the first threshold (N1) A small-scale parallel translation data storage unit capable of storing small-scale parallel translation data, which is a small-scale parallel translation data having a number of parallel translation sentences;
An alignment model of words acquired from the small-scale parallel translation data, a word pair having a first word that is a word in a first language and a second word that is a word in a second language, the first word, and the A small word alignment model storage unit capable of storing a small word alignment model having a plurality of word alignment data having first correspondence probability information which is correspondence probability information regarding the probability that the second word corresponds;
An alignment model of words acquired from large-scale parallel translation data, which is large-scale parallel translation data having a number of parallel translations equal to or greater than a second threshold (N2, N2> N1), and a first word and a second word A large-scale word alignment model having a plurality of word alignment data having a word pair and second correspondence probability information that is correspondence probability information regarding a probability that the first word and the second word correspond to each other is stored A large-scale word alignment model storage;
Information obtained from one or more parallel translations, a first word position indicating the position of the first word in the first language sentence, a position of the second word corresponding to the first word, and a second language sentence For each translated sentence word position information having a second word position indicating a position in the word, a first sentence word number that is the number of words in the first language sentence, and a second sentence word number that is the number of words in the second language sentence A bilingual sentence word position probability information storage unit capable of storing bilingual sentence word position probability information related to the probability of matching with the bilingual sentence word position information,
Computer
For each word pair included in the bilingual sentence included in the small-scale parallel translation data, the first correspondence probability information calculated in the initial value or the previous loop for one word pair, and the large-scale word alignment model has Using the second correspondence probability information paired with one word pair and the parallel translation word position probability information corresponding to the one word pair in the parallel translation sentence, the loop is repeated twice or more to A probability information calculation unit for calculating first correspondence probability information paired with the word pair;
For each word pair, the first correspondence probability information finally calculated by the probability information calculation unit is made to correspond to the word pair and function as a correspondence probability information accumulation unit that accumulates in the small-scale word alignment model storage unit. A recording medium on which a program is recorded.
コンピュータがアクセス可能な記録媒体は、
請求項１または請求項２に記載の単語アライメントモデル構築装置が有する小規模単語アライメントモデル格納部と、
請求項１または請求項２に記載の単語アライメントモデル構築装置が有する対訳文単語位置確率情報格納部とを具備し、
コンピュータを、
第二言語文を受け付ける受付部と、
前記小規模単語アライメントモデル格納部に格納されている小規模単語アライメントモデル、および前記対訳文単語位置確率情報格納部に格納されている１以上の対訳文単語位置情報ごとの対訳文単語位置確率情報を用いて、前記受付部が受け付けた第二言語文から第一言語文を取得する翻訳部として機能させるためのプログラムを記録した記録媒体。 Computer-accessible recording media
A small-scale word alignment model storage unit included in the word alignment model construction device according to claim 1 or 2,
A bilingual word position probability information storage unit included in the word alignment model construction device according to claim 1 or 2;
Computer
A reception unit for receiving a second language sentence;
The small word alignment model stored in the small word alignment model storage unit and the bilingual word position probability information for each one or more parallel word position information stored in the parallel word position probability information storage unit The recording medium which recorded the program for functioning as a translation part which acquires a 1st language sentence from the 2nd language sentence which the said reception part received using.