JP5766152B2

JP5766152B2 - Language model generation apparatus, method and program

Info

Publication number: JP5766152B2
Application number: JP2012137187A
Authority: JP
Inventors: 済央野本; 浩和政瀧; 高橋　敏; 敏高橋
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-06-18
Filing date: 2012-06-18
Publication date: 2015-08-19
Anticipated expiration: 2032-06-18
Also published as: JP2014002257A

Description

本発明は、テキストコーパスから言語モデルを生成する技術に関する。 The present invention relates to a technique for generating a language model from a text corpus.

現在、音声認識や自動翻訳など様々な分野で確率的言語モデル（以下、単に「言語モデル」ともいう）が使われている。言語モデルとは単語列、文字列に対して、それらが起こる確率を与えるモデルである。言語モデルとしてｎ−ｇｒａｍモデルが最も一般的である（非特許文献１参照）。ｎ−ｇｒａｍモデルは単語の生起確率が直近の（ｎ−１）単語にのみ依存するという仮定に基づいたモデルである。例えば、「私はりんごを＿＿＿＿」という文を考えると、下線部分に入る単語は「食べる」や「買う」「かじる」等であろうと推測される。これは下線部分の前に表れる「りんご」「を」という単語の並びから推測される。このように、ある時点での単語の生起確率を推定するには直前にある数個の単語の出現情報を用いることが有効である。 Currently, probabilistic language models (hereinafter also simply referred to as “language models”) are used in various fields such as speech recognition and automatic translation. A language model is a model that gives the probability of occurrence to word strings and character strings. The n-gram model is the most common language model (see Non-Patent Document 1). The n-gram model is a model based on the assumption that the word occurrence probability depends only on the most recent (n-1) words. For example, considering the sentence “I am an apple, ____”, it is assumed that the words that are underlined are “eating”, “buying”, “gazing”, and the like. This is inferred from the sequence of the words “apple” and “wo” appearing before the underlined portion. As described above, it is effective to use the appearance information of several words just before the word occurrence probability at a certain point in time.

一般的に、直前の一単語の情報のみを用いる場合をｂｉｇｒａｍ、直前の二単語の情報を用いる場合をｔｒｉｇｒａｍと呼ぶ。例えば、上記の例において下線部分に入る単語を考える場合、ｂｉｇｒａｍでは「を」のみを考慮し、ｔｒｉｇｒａｍでは「りんご」「を」を考慮する。 In general, a case where only the information of the immediately preceding one word is used is called a bigram, and a case where the information of the immediately preceding two words is used is called a trigram. For example, in the above example, when considering a word that falls underlined, only “wo” is considered in the bigram, and “apple” and “ho” are considered in the trigram.

例えば、上記の例で下線部分に「食べる」が入る条件付き確率Ｐは、単語列Ｗの出現頻度をＣ（Ｗ）と表すとすると、ｂｉｇｒａｍモデル、ｔｒｉｇｒａｍモデルではそれぞれ以下のように計算される。
bigram:P(食べる|を)=C(を-食べる)/C(を)
trigram:P(食べる|りんご-を)=C(りんご-を-食べる)/C(りんご-を)
通常、音声認識の分野ではｎ＝２（ｂｉｇｒａｍ）やｎ＝３（ｔｒｉｇｒａｍ）が用いられることが多い。 For example, in the above example, the conditional probability P that “eat” enters the underlined portion is calculated as follows in the bigram model and the trigram model, assuming that the appearance frequency of the word string W is C (W). .
bigram: P (eat |) = C (eat-eat) / C ()
trigram: P (eat | apple-) = C (apple-eat) / C (apple-)
Usually, in the field of speech recognition, n = 2 (bigram) and n = 3 (trigram) are often used.

一般的に、ｂｉｇｒａｍよりｔｒｉｇｒａｍのほうが推定精度は高い。例えば、単語「を」の後に続く単語を推定する問題よりも、単語列「りんご」「を」の後に続く単語を推定する問題のほうが容易である。よって、理想の言語モデルとは、あらゆるｔｒｉｇｒａｍモデルで計算される条件付き確率（以下「ｔｒｉｇｒａｍ確率」ともいう）が実際の出現分布と等しい状態にある場合となる。 In general, trigram has higher estimation accuracy than bigram. For example, the problem of estimating the word following the word strings “apple” and “to” is easier than the problem of estimating the word following the word “o”. Therefore, an ideal language model is a case where a conditional probability (hereinafter also referred to as “trigram probability”) calculated by any trigram model is in a state equal to an actual appearance distribution.

このようなｂｉｇｒａｍモデルで計算される条件付き確率（以下「ｂｉｇｒａｍ確率」ともいう）やｔｒｉｇｒａｍ確率は、通常、大量の学習コーパスから学習されることが望ましい。なお、コーパスとは、自然言語に基づき生成されたテキストデータからなるデータベースである。学習コーパスのサイズが大きければ大きいほど、多くのｎ−ｇｒａｍパタン（ｎ個の単語からなる単語列のパタン）を学習することが可能となり、さらにそのｎ−ｇｒａｍモデルで計算される条件付き確率（以下「ｎ−ｇｒａｍ確率」ともいう）は統計的に信頼度が高い値となる。つまり、言語モデルの精度が高くなる。逆に学習コーパスサイズが小さい場合には、十分なｎ−ｇｒａｍパタンを網羅することができず、またそのｎ−ｇｒａｍ確率は統計的に信頼度が低い。つまり、言語モデルの精度が低い。このように言語モデルの精度を向上させるためには、大量の学習コーパスが必要となる。 It is desirable that the conditional probability (hereinafter also referred to as “bigram probability”) and trigram probability calculated by such a bigram model is usually learned from a large amount of learning corpus. A corpus is a database composed of text data generated based on a natural language. The larger the size of the learning corpus, the more n-gram patterns (word string patterns made up of n words) can be learned, and the conditional probability calculated by the n-gram model ( (Hereinafter also referred to as “n-gram probability”) is a statistically high value. That is, the accuracy of the language model is increased. On the other hand, when the learning corpus size is small, a sufficient n-gram pattern cannot be covered, and the n-gram probability is statistically low in reliability. That is, the accuracy of the language model is low. Thus, in order to improve the accuracy of the language model, a large amount of learning corpus is required.

また学習コーパスは実際のタスクと同じものが望ましい。例えば、音声認識に言語モデルを用いる場合では、音声認識対象となるタスクと同等な単語の出現頻度分布を持つ学習コーパスであることが望ましい。例えば、野球中継で用いられる単語の出現傾向とコールセンタ等の電話応対で用いられる単語の出現傾向とは異なる。そのため、音声認識を用いて野球中継の字幕作成を行おうとした場合、野球中継内容を書き起こしたテキストデータを学習コーパスとして生成された言語モデルを用いたほうが、電話応対内容を書き起こしたテキストを学習コーパスとして生成された言語モデルを用いた場合に比べ、その認識精度は高くなる。 The learning corpus is preferably the same as the actual task. For example, when a language model is used for speech recognition, it is desirable that the learning corpus has an appearance frequency distribution of words equivalent to a task that is a speech recognition target. For example, the appearance tendency of words used in baseball broadcasts is different from the appearance tendency of words used in telephone receptions such as call centers. Therefore, when trying to create subtitles for a baseball broadcast using speech recognition, it is better to use the text model that transcribes the content of the baseball broadcast as a learning corpus and the text that transcribes the content of the telephone response. The recognition accuracy is higher than when a language model generated as a learning corpus is used.

北研二、「言語と計算４確率的言語モデル」、1999年、東京大学出版会、ｐ５７−６２Kenji Kita, “Language and Computation 4 Stochastic Language Model”, 1999, The University of Tokyo Press, p57-62

しかしながら、前述の通り、精度の高い言語モデルを生成するためには、大量の学習コーパスを必要とし、少量の学習コーパスしか用意できない場合には精度の高い言語モデルを生成することができない。特に、特定のタスク用の学習コーパスを用意しようとすると、大量の学習コーパスを用意することができない場合が多い。 However, as described above, in order to generate a highly accurate language model, a large amount of learning corpus is required, and when only a small amount of learning corpus can be prepared, a highly accurate language model cannot be generated. In particular, if a learning corpus for a specific task is prepared, a large number of learning corpora cannot be prepared in many cases.

また、音声認識において利用される言語モデルを生成する場合、音声から書き起こしたテキストデータを学習コーパスとしたほうが、認識精度が高くなる。このとき、音声を書き起こして大量の学習コーパスを作成するためには、大量の音声を人手により書き起こす作業が必要となり、その作業には大きなコスト（時間及び人件費等）がかかる。さらに、タスク毎に大量の学習コーパスを用意しようとすると、そのコストはさらに大きなものとなる。このコストを削減するために、少量の学習コーパスから言語モデルを生成すると、その精度は低くなる。 Further, when generating a language model used in speech recognition, the recognition accuracy is higher when text data written from speech is used as a learning corpus. At this time, in order to transcribe speech and create a large amount of learning corpus, it is necessary to manually transcribe a large amount of speech, which requires a large cost (time and labor costs). Furthermore, if a large amount of learning corpus is prepared for each task, the cost becomes even higher. In order to reduce this cost, if a language model is generated from a small amount of learning corpus, its accuracy is lowered.

本発明は、少量のテキストコーパスから、従来技術と比べて、精度の高い言語モデルを生成する技術を提供することを目的とする。 An object of the present invention is to provide a technique for generating a language model with higher accuracy than a conventional technique from a small amount of text corpus.

上記の課題を解決するために、本発明の第一の態様によれば、言語モデル生成装置は、形態素単位に分かち書きされ、文節の係り受け関係が付加されたオリジナルテキストを用いて、係り受け先が同じである複数の文節を並び替えて、疑似テキストを生成する疑似テキスト生成部と、オリジナルテキストにおけるｎ−ｇｒａｍパタンの出現頻度及び疑似テキストにおけるｎ−ｇｒａｍパタンの出現頻度を用いてｎ−ｇｒａｍ確率を求め、言語モデルを生成する言語モデル生成部とを含む。 In order to solve the above-described problem, according to the first aspect of the present invention, the language model generation device uses the original text that is divided into morpheme units and added with the dependency relationship of clauses. N-gram using a pseudo-text generation unit that generates a pseudo-text by rearranging a plurality of clauses having the same number, an appearance frequency of an n-gram pattern in the original text, and an appearance frequency of the n-gram pattern in the pseudo-text A language model generation unit for obtaining a probability and generating a language model.

上記の課題を解決するために、本発明の第二の態様によれば、言語モデル生成方法は、形態素単位に分かち書きされ、文節の係り受け関係が付加されたオリジナルテキストを用いて、係り受け先が同じである複数の文節を並び替えて、疑似テキストを生成する疑似テキスト生成ステップと、オリジナルテキストにおけるｎ−ｇｒａｍパタンの出現頻度及び疑似テキストにおけるｎ−ｇｒａｍパタンの出現頻度を用いてｎ−ｇｒａｍ確率を求め、言語モデルを生成する言語モデル生成ステップとを含む。 In order to solve the above-mentioned problem, according to the second aspect of the present invention, the language model generation method uses the original text that is divided into morpheme units and added with the dependency relation of clauses. A pseudo-text generation step for rearranging a plurality of clauses having the same number to generate pseudo-text, and an n-gram pattern using the appearance frequency of the n-gram pattern in the original text and the appearance frequency of the n-gram pattern in the pseudo text A language model generation step for determining a probability and generating a language model.

本発明によれば、一文から獲得されるｎ−ｇｒａｍパタンを増加させることで、少量のテキストコーパスから、従来技術と比べて、精度の高い言語モデルを生成できるという効果を奏する。 According to the present invention, by increasing the n-gram pattern acquired from one sentence, there is an effect that a language model with higher accuracy can be generated from a small amount of text corpus than in the conventional technology.

図１Ａは文節の係り受け関係を説明するための図、図１Ｂは構文解析結果を説明するための図。FIG. 1A is a diagram for explaining a dependency relation between clauses, and FIG. 1B is a diagram for explaining a syntax analysis result. 第一実施形態に係る言語モデル生成装置の機能ブロック図。The functional block diagram of the language model production | generation apparatus which concerns on 1st embodiment. 第一実施形態に係る言語モデル生成装置の処理フローを示す図。The figure which shows the processing flow of the language model production | generation apparatus which concerns on 1st embodiment. 係り受け先が同じ文節である複数の文節を並び替える方法を説明するための図。The figure for demonstrating the method to rearrange the some clause whose dependency is the same clause. 第二実施形態に係る言語モデル生成装置の機能ブロック図。The functional block diagram of the language model production | generation apparatus which concerns on 2nd embodiment. 第二実施形態に係る言語モデル生成装置の処理フローを示す図。The figure which shows the processing flow of the language model production | generation apparatus which concerns on 2nd embodiment. 第二実施形態の第一判定方法に係る疑似テキスト選択部の機能ブロック図。The functional block diagram of the pseudo text selection part which concerns on the 1st determination method of 2nd embodiment. 第二実施形態の第一判定方法に係る疑似テキスト選択部の処理フローを示す図。The figure which shows the processing flow of the pseudo text selection part which concerns on the 1st determination method of 2nd embodiment. 第二実施形態の第二判定方法に係る疑似テキスト選択部の機能ブロック図。The functional block diagram of the pseudo text selection part which concerns on the 2nd determination method of 2nd embodiment. 第二実施形態の第二判定方法に係る疑似テキスト選択部の処理フローを示す図。The figure which shows the processing flow of the pseudo text selection part which concerns on the 2nd determination method of 2nd embodiment. 第三実施形態に係る言語モデル生成装置の機能ブロック図。The functional block diagram of the language model production | generation apparatus which concerns on 3rd embodiment. 第三実施形態に係る言語モデル生成装置の処理フローを示す図。The figure which shows the processing flow of the language model production | generation apparatus which concerns on 3rd embodiment.

＜第一実施形態のポイント＞
「私はあのりんごを今日友達と食べる（私／は／あの／りんご／を／今日／友達／と／食べる）」という一文からは以下の七つのｔｒｉｇｒａｍパタンが学習される。ただし、括弧内は形態素単位に分割した結果である。
１．私−は−あの
２．は−あの−りんご
３．あの−りんご−を
４．りんご−を−今日
５．を−今日−友達
６．今日−友達−と
７．友達−と−食べる
本実施形態では、ある一文から得られるｎ−ｇｒａｍパタン（例えばｔｒｉｇｒａｍパタン）を増やしたい。 <Points of first embodiment>
The following seven trigram patterns are learned from the sentence “I eat that apple with my friend today (I / ha / that / apple / to / today / friend / and / eat)”. However, the results in parentheses are divided into morpheme units.
1. I-that-2. Ha-that-apple3. That apple-4. Apple-today -5. -Today-Friends 6. 6. Today-with friends- In this embodiment to eat with friends, it is desired to increase the n-gram pattern (for example, trigram pattern) obtained from a certain sentence.

そこで、本実施形態は日本語の「語順変動」特性に着目する。日本語は、特に口語では、語順変動が生じやすい言語である。例えば、「私はあのりんごを今日友達と食べる」という文は「今日あのりんごを友達と私は食べる」や「私は今日あのりんごを友達と食べる」と話されても日本語の並びとして間違いでない。このように、日本語は語順を一意に決定することは難しい。そして様々な語順変化を少量の学習コーパスによって網羅することは難しい。そこで、ある学習コーパス中の各テキストに対し、語順を変動させたテキストを作成し、それらも学習コーパスとして用いることで、学習するｎ−ｇｒａｍパタン数を増やす。なお、元々ある学習コーパスをオリジナルテキストコーパスと呼び、オリジナルテキストコーパス中のテキストデータをオリジナルテキストと呼ぶ。オリジナルテキストの語順を変動させたテキストを疑似テキストと呼び、疑似テキストからなるコーパスを疑似コーパスと呼ぶ。オリジナルテキストコーパスと疑似コーパスとを併せて学習コーパスとして利用する。 Therefore, the present embodiment focuses on the “word order fluctuation” characteristic of Japanese. Japanese is a language that is prone to change in word order, especially in spoken language. For example, the sentence "I eat that apple with my friend today" is wrong as a Japanese line even if I say "I eat that apple with my friend today" or "I eat that apple with my friend today" Not. Thus, it is difficult to uniquely determine the word order in Japanese. And it is difficult to cover various word order changes with a small amount of learning corpus. In view of this, for each text in a certain learning corpus, text in which the word order is changed is created and used as a learning corpus to increase the number of n-gram patterns to be learned. An original learning corpus is called an original text corpus, and text data in the original text corpus is called an original text. Text in which the word order of the original text is changed is called pseudo text, and a corpus composed of pseudo text is called a pseudo corpus. The original text corpus and the pseudo corpus are used together as a learning corpus.

例えば「私はあのりんごを今日友達と食べる」は以下のような語順で表現されても日本語の並びとして不自然でない。
オリジナルテキスト：私はあのりんごを今日友達と食べる
疑似テキスト（１）：今日私は友達とあのりんごを食べる
疑似テキスト（２）：私は今日あのりんごを友達と食べる
疑似テキスト（３）：私は今日友達とあのりんごを食べる
疑似テキスト（４）：私は友達と今日あのりんごを食べる
疑似テキスト（５）：私は友達とあのりんごを今日食べる
疑似テキスト（６）：今日あのりんごを私は友達と食べる
…
上記のような並び替えにより、元々の文には含まれなかった「今日−あの−りんご」「友達−と−今日」「今日−私−は」等のｔｒｉｇｒａｍパタンも学習することが可能になる。例えば、疑似テキスト（１）「今日私は友達とあのりんごを食べる」からは以下の７つのｔｒｉｇｒａｍパタン（１）１〜（１）７が学習される。（１）１〜（１）５及び（１）７が疑似テキスト（１）により新しく獲得されたｔｒｉｇｒａｍパタンである。
（１）１．今日−私−は
（１）２．私−は−友達
（１）３．は−友達−と
（１）４．友達−と−あの
（１）５．と−あの−りんご
（１）６．あの−りんご−を
（１）７．りんご−を−食べる
このように並び替えによりオリジナルテキストから疑似テキストを生成することで、オリジナルテキストからは得られなかった新たなｎ−ｇｒａｍパタンを抽出することが可能となる。 For example, “I eat that apple with my friend today” is not unnatural as a Japanese sequence even if it is expressed in the following word order:
Original text: Pseudo-text I eat that apple with friends today (1): Pseudo-text I eat that apple with friends today (2): Pseudo-text I eat that apple with friends today (3): I Pseudo-text to eat that apple with a friend today (4): Pseudo-text to eat that apple with a friend today (5): Pseudo-text to eat that apple with a friend today (6): Today I'm a friend with that apple Eat with ...
By rearranging as described above, it becomes possible to learn trigram patterns that were not included in the original sentence, such as “today-that-apple”, “friends-to-today”, and “today-me-ha”. . For example, the following seven trigram patterns (1) 1 to (1) 7 are learned from the pseudo text (1) “I eat that apple with my friend today”. (1) 1 to (1) 5 and (1) 7 are trigram patterns newly acquired by the pseudo text (1).
(1) 1. Today -I- (1) 2. My friend (1) Ha-friends-(1) 4. Friends-and-(1) 5. And-that-apple (1) 6. That apple-(1) 7. By eating apples in this way, pseudo-text is generated from the original text by rearrangement, so that a new n-gram pattern that cannot be obtained from the original text can be extracted.

本実施形態では語順変動を実現するために、「係り受け関係」を用いる。日本語における「係り受け関係」とは、文節と文節がある意味的なつながり（修飾するものと修飾されるもの）を持って関係していることを指す。「文節」とは、文を細かく分割していった際に、最も小さい意味のまとまりのことである。一般的に、文節は名詞や動詞などの「自立語」と「接語」から構成される。「接語」は無い場合や省略される場合がある。例えば、「私はあのりんごを今日友達と食べる」は以下のような文節に区切ることができる。
オリジナルテキスト：私はあのりんごを今日友達と食べる
文節：私は／あの／りんごを／今日／友達と／食べる
このような区切られた文節において、それぞれの文節は、図１Ａのような係り受け関係を抽出できる。図１Ａの例の場合、「私は→食べる」「あの→りんごを」「りんごを→食べる」「今日→食べる」「友達と→食べる」の計５個の係り受け関係が抽出される。係り受け関係にある文節間は、修飾するものから修飾されるものに対して直接の接続関係が成り立つ。また、係り受け関係が同じ深さにある各文節は互いに独立の関係にある。 In the present embodiment, “dependency relation” is used to realize word order fluctuation. “Dependency relationship” in Japanese means that a phrase and a phrase are related to each other with a meaningful connection (modifier and modifier). “Sentence” is a group of meanings that is the smallest when a sentence is divided into small pieces. In general, a phrase is composed of “independent words” such as nouns and verbs and “junctions”. “Suffix” may be omitted or omitted. For example, “I eat that apple with my friend today” can be broken into the following phrases:
Original text: I will eat that apple with friends today: I will eat / that / apples / today / with friends / Eat in such a delimited clause, each clause is a dependency relationship as shown in FIG. 1A Can be extracted. In the case of the example in FIG. 1A, a total of five dependency relationships are extracted: “I eat →” “that → apple”, “apple → eat”, “today → eat”, and “friend → eat”. Between the clauses in the dependency relationship, a direct connection relationship is established from what is modified to what is modified. In addition, clauses having the same dependency relationship are independent of each other.

「私は」「（あの）りんごを」「今日」「友達と」は「食べる」に係っている。「食べる」に係る４つの文節を並び替えても日本語の語順として誤りではない。上記のような並び替えにより、オリジナルテキストには含まれなかった「今日−あの−りんご」「友達−と−今日」「今日−わたし−は」等のようなｔｒｉｇｒａｍパタンも学習することが可能になる。このように係り受け関係を用いることで一文からより多くの自然なｎ−ｇｒａｍパタンが抽出可能となる。 “I” (that) apples, “today” and “with friends” are involved in “eating”. Rearranging the four phrases related to “eat” is not an error in the Japanese word order. By rearranging as described above, it is possible to learn trigram patterns such as “Today-That-Apple”, “Friends-To-Today”, “Today-I-Ha” etc. that were not included in the original text. Become. By using the dependency relationship in this manner, more natural n-gram patterns can be extracted from one sentence.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted.

＜第一実施形態＞
図２は言語モデル生成装置１００の機能ブロック図を、図３はその処理フローを示す。 <First embodiment>
FIG. 2 is a functional block diagram of the language model generation apparatus 100, and FIG. 3 shows a processing flow thereof.

言語モデル生成装置１００は、形態素解析部１１０、構文解析部１２０、疑似テキスト生成部１３０及び言語モデル生成部１４０を含む。 The language model generation apparatus 100 includes a morphological analysis unit 110, a syntax analysis unit 120, a pseudo text generation unit 130, and a language model generation unit 140.

言語モデル生成装置１００は、オリジナルテキストコーパス中のＴ個のオリジナルテキストｔｅｘ_ｔを受け取り、このオリジナルテキストｔｅｘ_ｔを用いて言語モデルを生成し、出力する。ただし、ｔ＝１，２，…，Ｔである。以下、各部の詳細を説明する。なお、本実施形態では、オリジナルテキストコーパスには、オリジナルテキストからなるテキストデータのみが含まれていればよく、品詞情報等は必ずしも必要ではない。 Language model generator 100 receives the T original text tex _t in the original text corpus to generate a language model using the original text tex _t, and outputs. However, t = 1, 2,..., T. Details of each part will be described below. In the present embodiment, the original text corpus only needs to include text data composed of the original text, and part-of-speech information is not necessarily required.

＜形態素解析部１１０＞
・入力：オリジナルテキストｔｅｘ_ｔ
・出力：形態素解析結果（形態素単位に分かち書きされたオリジナルテキスト）ｍｏｒ_ｔ
・処理内容：オリジナルテキストｔｅｘ_ｔを形態素解析して（ｓ１１０）、オリジナルテキストを形態素単位に分割し、形態素解析結果（形態素単位に分かち書きされたオリジナルテキスト）ｍｏｒ_ｔを出力する。なお、形態素とは、言語的に意味を持つ最小単位のことである。形態素解析技術としては、従来技術を用いる。例えば「私はあのりんごを今日友達と食べる」というリジナルテキストを形態素解析すると、以下のように、単語が「／」で区切られた形式の形態素解析結果ｍｏｒ_ｔが得られる。
⇒私／は／あの／りんご／を／今日／友達／と／食べる <Morphological analyzer 110>
• Input: Original text tex _t
- Output: the result of morphological analysis (the original text has been word-separated into morphemes) mor _t
- processing content: The morphological analysis of the original text tex _t (s110), by dividing the original text into morphemes and outputs the (word-separated by the original text into morphemes) mor _t morphological analysis result. A morpheme is the smallest unit that has linguistic significance. Conventional technology is used as the morphological analysis technology. For example, if "I am that apple a day to eat with your friends" to morphological analysis of the original casing text, as shown in the following, the word is "/" in delimited format of the morphological analysis result mor _t is obtained.
⇒I / Ha / Ano / Apple / Today / Friends / To / Eat

＜構文解析部１２０＞
・入力：形態素解析結果（形態素単位に分かち書きされたオリジナルテキスト）ｍｏｒ_ｔ
・出力：構文解析結果（形態素解析結果と文節の係り受け関係を示す情報）ｓｙｎ_ｔ
・処理内容：形態素解析結果ｍｏｒ_ｔを構文解析して（ｓ１２０）、形態素解析結果ｍｏｒ_ｔを文節に分割し、分割された複数の文節間の係り受け関係を解析し、構文解析結果（形態素解析結果と文節の係り受け関係を示す情報）ｓｙｎ_ｔを出力する。なお、本実施形態において構文解析とは、文節の係り受け関係を解析することを意味する。構文解析技術としては、従来技術を用いる。例えば「私／は／あの／りんご／を／今日／友達／と／食べる」という形態素解析結果に対して構文解析を行うと図１Ｂのような構文解析結果ｓｙｎ_ｔが得られる。なお、図１Ｂのような係り受け関係を本明細書では便宜上「私／は（６）あの（３）りんご／を（６）今日（６）友達／と（６）食べる」と記す。括弧中の数字は、直前の文節が、係っている文節の番号を意味する。例えば第一文節「私／は」は第六文節「食べる」に係っている。 <Syntax analyzer 120>
• Input: morphological analysis results (leaving a space between words is the original text into morphemes) mor _t
Output: Syntax analysis result (information indicating morphological analysis result and clause dependency) syn _t
- processing content: The morphological analysis result parsing mor _t (s120), by dividing the morphological analysis result mor _t in clause analyzes the dependency relationships between the plurality of divided clauses, syntax analysis result (morphological analysis result clause dependency information indicating a relationship) to the syn _t. In the present embodiment, the syntax analysis means analyzing the dependency relation of clauses. Conventional techniques are used as the parsing technique. For example, "I am / is / that / apple / a / today / friend / and / eat" syntax analysis result syn _t like that and do the syntax analysis on the morphological analysis result FIG. 1B is obtained. In this specification, the dependency relationship as shown in FIG. 1B is referred to as “I / I (6) That (3) Apple / (6) Today (6) Friends / (6) Eat” for convenience. The number in parentheses means the number of the clause that the previous clause is related to. For example, the first phrase “I / ha” is related to the sixth phrase “eat”.

＜疑似テキスト生成部１３０＞
・入力：構文解析結果（形態素解析結果と文節の係り受け関係を示す情報）ｓｙｎ_ｔ
・出力：疑似テキストｔｅｘ_ｔ，ｕ
・処理内容：構文解析結果ｓｙｎ_ｔを用いて、各文節を並び替えてＵ_ｔ個の疑似テキストｔｅｘ_ｔ，ｕを生成する（ｓ１３０）。ただし、ｕ＝１，２，…，Ｕ_ｔである。並び替えは、係り受け先が同じ文節である複数の文節を並び替えることによって行う。例えば、「私／は（６）あの（３）りんご／を（６）今日（６）友達／と（６）食べる」を受け取った場合、第六文節「食べる」を係り受け先とする第一文節「私／は」、第三文節「（あの）／りんごを」、第四文節「今日」及び第五文節「友達／と」の四つの文節を並び替える。この四つの文節を順列組合せに従って並び替えることで疑似テキストｔｅｘ_ｔ，ｕを生成する。よって、（４！−１＝４×３×２×１−１＝２３通り）の疑似テキストｔｅｘ_ｔ，ｕが生成される（図４参照）。なお、「−１」はオリジナルテキストｔｅｘ_ｔに相当する。なお、ある構文解析結果ｓｙｎ_ｔに対して、係り受け先が同じとなる文節が存在しない場合、Ｕ_ｔ＝０であり、疑似テキストｔｅｘ_ｔ，ｕを生成しない。 <Pseudo Text Generation Unit 130>
Input: Syntax analysis result (information indicating morphological analysis result and clause dependency) syn _t
-Output: pseudo text tex _{t, u}
Processing content: Using the syntax analysis result syn _t , the respective clauses are rearranged to generate U _t pieces of pseudo text tex _{t, u} (s130). However, u = 1, 2,..., U _t . The rearrangement is performed by rearranging a plurality of phrases having the same dependency destination. For example, if you receive “I / I (6) That (3) Apple / (6) Today (6) Friends / (6) Eat”), the 6th sentence “Eat” will be the first The four clauses of the phrase “I / ha”, the third clause “(that) / apple”, the fourth clause “today”, and the fifth clause “friend / to” are rearranged. By rearranging these four clauses according to the permutation combination, pseudo text tex _{t, u} is generated. Therefore, (4! -1 = 4 × 3 × 2 × 1-1 = 23) pseudo-text tex _{t, u} is generated (see FIG. 4). It should be noted that, "- 1" corresponds to the original text tex _t. If there is no clause having the same dependency destination for a certain parsing result syn _t , U _t = 0, and no pseudo text tex _{t, u} is generated.

＜言語モデル生成部１４０＞
・入力：オリジナルテキストｔｅｘ_ｔ、疑似テキストｔｅｘ_ｔ，ｕ
・出力：言語モデル（ｎ−ｇｒａｍモデル）
・処理内容：Ｔ個のオリジナルテキストｔｅｘ_ｔにおけるｎ−ｇｒａｍパタンの出現頻度Ｃｏｕｎｔ_Ｇと（Ｕ_１＋Ｕ_２＋…＋Ｕ_Ｔ）個の疑似テキストｔｅｘ_ｔ，ｕにおけるｎ−ｇｒａｍパタンの出現頻度Ｃｏｕｎｔ_Ｓとからｎ−ｇｒａｍ確率を求め、言語モデルを生成する（ｓ１４０）。なお、ｎ−ｇｒａｍ確率を求める際に、Ｔ個のオリジナルテキストｔｅｘ_ｔから得られるｎ−ｇｒａｍパタンの出現頻度Ｃｏｕｎｔ_Ｇ及び（Ｕ_１＋Ｕ_２＋…＋Ｕ_Ｔ）個の疑似テキストｔｅｘ_ｔ，ｕから得られるｎ−ｇｒａｍパタンの出現頻度Ｃｏｕｎｔ_Ｓに対して重み付けに行ってもよい。例えば、重みＷで重み付け混合をしたｂｉｇｒａｍ確率は次式によって計算される。 <Language model generation unit 140>
-Input: Original text tex _t , pseudo text tex _{t, u}
・ Output: Language model (n-gram model)
Processing contents: Appearance frequency Count _G of n-gram patterns in T original texts tex _t and appearance frequency Count _S of n-gram patterns in (U ₁ + U ₂ +... + U _T ) pseudo texts tex _{t, u} The n-gram probability is obtained from the above and a language model is generated (s140). Incidentally, when determining the n-gram probability, frequency of occurrence of n-gram patterns obtained from the T original text tex _t Count _G and _{_{(U 1 + U 2 + ...}} + U T) pieces of pseudo-text _{tex t,} from _u You may weight to the appearance frequency Count _S of the n-gram pattern obtained. For example, the bigram probability obtained by weighted mixing with the weight W is calculated by the following equation.

なお、重みＷで重み付け混合をしたｔｒｉｇｒａｍ確率は次式によって計算される。 The trigram probability obtained by performing weighted mixing with the weight W is calculated by the following equation.

ただし、重みＷは、０より大きい値とし、重みＷが１であればオリジナルテキストｔｅｘ_ｔと疑似テキストｔｅｘ_ｔ，ｕとを同等の重み付けで集計することを意味する。通常、オリジナルテキストｔｅｘ_ｔのほうが疑似テキストｔｅｘ_ｔ，ｕよりも、語順的に確からしいと考えられるため、Ｗを１以下に設定することが望ましい。例えば、重みＷは、開発セットの認識精度が最大になるような言語モデルを生成する値で決定する。 However, the weight W is set to a value larger than 0, and if the weight W is 1, it means that the original text tex _t and the pseudo text tex _{t, u} are totaled with the same weight. Normally, it is considered that the original text tex _t is more likely in word order than the pseudo text tex _{t, u} , so it is desirable to set W to 1 or less. For example, the weight W is determined by a value that generates a language model that maximizes the recognition accuracy of the development set.

＜効果＞
このような構成により、一文（オリジナルテキストｔｅｘ_ｔ）から獲得されるｎ−ｇｒａｍパタンを増加させることができ、従来技術と比べて、少量のテキストコーパスから、精度の高い言語モデルを生成できる。 <Effect>
With such a configuration, the n-gram pattern acquired from one sentence (original text tex _t ) can be increased, and a highly accurate language model can be generated from a small amount of text corpus as compared with the prior art.

＜変形例＞
言語モデル生成装置１００は、形態素解析部１１０や構文解析部１２０を備えずに、例えば他の装置により予め求められた形態素解析結果ｍｏｒ_ｔや構文解析結果ｓｙｎ_ｔを入力としてもよい。 <Modification>
Language model generator 100, without providing the morphological analysis unit 110 and the syntax analyzing unit 120 may be input, for example, other previously obtained morphological analysis result by the apparatus mor _t and syntax analysis result syn _t.

また、ｎ−ｇｒａｍモデルを生成する際に周知のスムージングまたは平滑化と呼ばれる方法を用いてもよい（非特許文献１参照）。 Further, when generating an n-gram model, a known method called smoothing or smoothing may be used (see Non-Patent Document 1).

＜第二実施形態＞
第一実施形態と異なる部分についてのみ説明する。 <Second embodiment>
Only parts different from the first embodiment will be described.

構文解析部１２０における構文解析に誤りがある場合、その誤りにより本来正しくない文型の疑似テキストが生成されてしまう可能性がある。それにより後段で生成される言語モデルの性能が劣化する可能性がある。そこで、第二実施形態では、疑似テキストが語順として確からしいか否かを判定する処理部を追加する。 When there is an error in the syntax analysis in the syntax analysis unit 120, there is a possibility that a pseudo-text having an originally incorrect sentence type is generated due to the error. As a result, the performance of the language model generated later may be deteriorated. Therefore, in the second embodiment, a processing unit that determines whether or not the pseudo text is likely to be in word order is added.

図５は言語モデル生成装置２００の機能ブロック図を、図６はその処理フローを示す。 FIG. 5 is a functional block diagram of the language model generation apparatus 200, and FIG.

言語モデル生成装置２００は、形態素解析部２１０、構文解析部１２０、疑似テキスト生成部１３０、言語モデル生成部１４０を含み、さらに、疑似テキスト選択部２５０を含む。 The language model generation apparatus 200 includes a morphological analysis unit 210, a syntax analysis unit 120, a pseudo text generation unit 130, a language model generation unit 140, and further includes a pseudo text selection unit 250.

＜形態素解析部２１０＞
・入力：オリジナルテキストｔｅｘ_ｔ
・出力：形態素解析結果（形態素単位に分かち書きされ、品詞情報が付加されたオリジナルテキスト）ｍｏｒ’_ｔ
・処理内容：オリジナルテキストｔｅｘ_ｔを形態素解析して（ｓ２１０）、オリジナルテキストを形態素単位に分割し、分割した各形態素に品詞を付与して、形態素解析結果（形態素単位に分かち書きされ、品詞情報が付加されたたオリジナルテキスト）ｍｏｒ’_ｔを出力する。形態素解析技術としては、従来技術を用いる。例えば「私はあのりんごを今日友達と食べる」というリジナルテキストを形態素解析すると、以下のように、単語が「／」で区切られ、品詞を付加された形式の形態素解析結果ｍｏｒ’_ｔが得られる。
⇒私（名詞：代名詞）／は（連用助詞）／あの（連体詞）／りんご（名詞）／を（格助詞：連用）／今日（名詞：日時：連用）／友達（名詞）／と（格助詞：連用）／食べる（動詞） <Morphological analyzer 210>
• Input: Original text tex _t
-Output: Morphological analysis result (original text divided into morpheme units and with part-of-speech information added) mor ' _t
And processing the contents: a and morphological analysis original text tex _t (s210), to divide the original text into morphemes, by applying a part of speech for each morpheme obtained by dividing, is leaving a space between words in the morphological analysis result (morpheme units, is part of speech information and it outputs the added original text) mor _'t. Conventional technology is used as the morphological analysis technology. For example, "I have that apple today to eat with my friends" and morphological analysis of the original casing text, as follows, words are separated by a "/", morphological analysis result mor _'t of the format that has been added to part of speech can be obtained .
⇒ I (noun: pronoun) / ha (conjunctive particle) / that (combined particle) / ringo (noun) / a (case particle: conjunctive) / today (noun: date: conjunctive) / friend (noun) / to (case particle) : Continuous use) / eat (verb)

＜疑似テキスト選択部２５０＞
・入力：（品詞情報が付加された形態素解析結果ｍｏｒ’_ｔと文節の係り受け関係を示す情報とからなる構文解析結果ｓｙｎ’_ｔを用いて生成されるため、品詞情報が付加されている）疑似テキストｔｅｘ’_ｔ，ｕ、形態素解析結果（形態素単位に分かち書きされ、品詞情報が付加されたオリジナルテキスト）ｍｏｒ’_ｔ
・出力：選択疑似テキストｔｅｘ’_ｔ，ｙ
・処理内容：オリジナルテキストｔｅｘ_ｔの言葉の並びを用いて、疑似テキストｔｅｘ’_ｔ，ｕの言葉の並びが正しいか否かを判定し、正しいと判定された疑似テキストｔｅｘ’_ｔ，ｕを選択し（ｓ２５０）、言語モデル生成部１４０に出力し、言語モデルの学習に用いる。正しくないと判定された場合にはその疑似テキストｔｅｘ’_ｔ，ｕを選択せず言語モデル学習に用いない。本実施形態では、言葉の並びが正しいか否かを判定する際に品詞の語順を利用し、オリジナルテキストｔｅｘ_ｔの品詞の語順と疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順とを比較して、確からしい品詞の語順を持つ疑似テキストｔｅｘ’_ｔ，ｕを選択する。疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順が正しいか否かを判定する方法を以下に二つ説明する。 <Pseudo-text selection unit 250>
Input: (part of speech information is added because it is generated using a parse analysis result syn ' _t composed of a morphological analysis result mor' _t to which part of speech information is added and information indicating a dependency relationship between clauses) Pseudo-text tex ' _{t, u} , morpheme analysis result (original text divided into morpheme units and with part-of-speech information added) mor' _t
-Output: Selection pseudo text tex ' _{t, y}
Processing content: using the word sequence of the original text tex _t , determine whether the word sequence of the pseudo text tex ′ _{t, u} is correct _, and select the pseudo text tex ′ _{t, u} determined to be correct (S250), it is output to the language model generation unit 140 and used for learning the language model. If it is determined that it is not correct, the pseudo text tex ′ _{t, u} is not selected and is not used for language model learning. In the present embodiment, the part-of-speech word order is used to determine whether the word sequence is correct, and the part-of-speech word order of the original text tex _t is compared with the part-of-speech word order of the pseudo-text tex ′ _{t, u.} Then, the pseudo-text tex ′ _{t, u} having the word order of a certain part of speech is selected. Two methods for determining whether or not the word order of the part of speech of the pseudo text tex ′ _{t, u} is correct will be described below.

（１）第一判定方法
図７及び図８を用いて、第一判定方法について説明する。疑似テキスト選択部２５０は、第一品詞情報取得部２５１、出現品詞列集合記憶部２５３、第二品詞情報取得部２５５及び判定部２５７を含む。まず、第一品詞情報取得部２５１は、形態素解析結果ｍｏｒ’_ｔからオリジナルテキストｔｅｘ_ｔに付加された品詞情報を取り出し（ｓ２５１）、Ｔ個のオリジナルテキストｔｅｘ_ｔの品詞の語順の集合を、出現品詞列集合として、出現品詞列集合記憶部２５３に格納する（ｓ２５３）。次に、第二品詞情報取得部２５５は、疑似テキストｔｅｘ’_ｔ，ｕに付加された品詞情報から、疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順を取り出し（ｓ２５５）、判定部２５７に出力する。判定部２５７は、疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順を受け取り、出現品詞列集合記憶部２５３内の出現品詞列集合に同様の品詞の語順が存在するか否かを判定し（ｓ２５７）、存在する場合には、その品詞の語順は確からしいと判断し、その品詞の語順に対応する疑似テキストｔｅｘ’_ｔ，ｕを選択し（ｓ２５８）、選択疑似テキストｔｅｘ’_ｔ，ｙとして言語モデル生成部１４０に出力する。ただし、ｙ＝１，２，…，Ｙ_ｔであり、Ｙ_ｔはあるオリジナルテキストｔｅｘ_ｔから得られるＵ_ｔ個の疑似テキストｔｅｘ’_ｔ，ｕから選択される選択疑似テキストｔｅｘ’_ｔ，ｙの個数である。存在しない場合には、その疑似テキストｔｅｘ’_ｔ，ｕは本来正しくない文型であると判断し、選択しない。 (1) First determination method The first determination method will be described with reference to FIGS. 7 and 8. The pseudo text selection unit 250 includes a first part-of-speech information acquisition unit 251, an appearance part-of-speech string collection storage unit 253, a second part-of-speech information acquisition unit 255, and a determination unit 257. First, the first part-of-speech information acquisition unit 251, a set of morphological analysis result mor _'t take out part of speech information added to the original text tex _t from (s251), the part of speech of the T of the original text tex _t word order, appearance The part-of-speech string set is stored in the appearance part-of-speech string set storage unit 253 (s253). Next, the second part of speech information acquisition unit 255 outputs the pseudo text tex _'t, the part of speech information added to _u, pseudo text tex' _t, removed the part of speech of the word order of _u (S255), the determination unit 257 . The determination unit 257 receives the part-of-speech word order of the pseudo-text tex ′ _{t, u} and determines whether or not a similar part-of-speech word order exists in the appearance part-of-speech string set storage unit 253 (s257). If it exists, it is determined that the word order of the part of speech is probable, the pseudo text tex ' _{t, u} corresponding to the word order of the part of speech is selected (s258), and the language model is selected as the selected pseudo text tex' _{t, y.} The data is output to the generation unit 140. However, y = 1, 2,..., Y _t , and Y _t is a selection pseudo text tex ′ _{t, y} selected from U _t pseudo texts tex ′ _{t, u} obtained from a certain original text tex _t . It is a number. If it does not exist, it is determined that the pseudo text tex ′ _{t, u} is an originally incorrect sentence type and is not selected.

なお、疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順と、出現品詞列集合記憶部２５３内の出現品詞列集合に含まれる品詞の語順とは、必ずしも全て同じである必要はなく、所定の割合（例えば、９０％）以上、同じである場合に、疑似テキストｔｅｘ’_ｔ，ｕを選択してもよい。言い換えると、疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順と出現品詞列集合に含まれる何れかの品詞の語順とが所定の割合以上一致する場合に、その疑似テキストｔｅｘ’_ｔ，ｕを選択してもよい。どの程度の語順が同じである場合に、疑似テキストｔｅｘ’_ｔ，ｕを選択するかは、認識精度がよくなるように実験的に定める。例えば、疑似テキストの品詞の語順が、１０個の品詞の語順からなるとき、出現品詞列集合から１０個の品詞の語順からなるものを取り出し、比較し、９個または１０個の品詞の語順を一致する場合に、その疑似テキストを選択する。なお、他の方法により一致の割合を求めてもよい。 Note that the word order of the part of speech of the pseudo text tex't _{, u} and the word order of the part of speech included in the appearing part of speech sequence set in the appearance part of speech sequence set storage unit 253 are not necessarily the same, and a predetermined ratio ( For example, the pseudo-text tex ′ _{t, u} may be selected when 90%) or more are the same. In other words, if the word order of the part of speech of the pseudo text tex ′ _{t, u} and the word order of any part of speech included in the appearance part of speech sequence set match a predetermined ratio or more, the pseudo text tex ′ _{t, u} is selected. May be. When the word order is the same, the selection of the pseudo text tex ′ _{t, u} is determined experimentally so that the recognition accuracy is improved. For example, when the part-of-speech word order of the pseudo-text is composed of ten part-of-speech word orders, the part-of-speech sequence set is extracted from the part-of-speech sequence set and the part-of-speech word order is compared. If there is a match, select the pseudo-text. Note that the matching ratio may be obtained by other methods.

オリジナルテキストコーパスのコーパスサイズが十分に大きくない場合に、疑似テキストの品詞の語順が出現品詞列集合に同様の品詞の語順が存在する（言い換えると、所定の割合が１００％である）ことを選択の条件にすると、出現品詞列集合に含まれる品詞の語順の種類が少ないため、多くの疑似テキストは選択されない。そうすると、疑似コーパス及び学習コーパスのコーパスサイズが小さくなるため、結果として言語モデルの精度が低くなる可能性がある。そのような場合に、一致の割合を低くすることで、疑似コーパス及び学習コーパスのコーパスサイズを大きくし、結果として言語モデルの精度を向上させることができる。 If the corpus size of the original text corpus is not sufficiently large, select that the part-of-speech word order of the pseudo-text is the same part-of-speech part order in the appearance part-of-speech sequence set (in other words, the predetermined percentage is 100%) In this condition, since there are few types of part-of-speech word order included in the appearance part-of-speech string set, many pseudo-texts are not selected. Then, the corpus sizes of the pseudo corpus and the learning corpus are reduced, and as a result, the accuracy of the language model may be lowered. In such a case, by reducing the matching ratio, the corpus sizes of the pseudo corpus and the learning corpus can be increased, and as a result, the accuracy of the language model can be improved.

（２）第二判定方法
図９及び図１０を用いて、第二判定方法について説明する。疑似テキスト選択部２５０は、第一品詞情報取得部２５１、出現品詞列集合記憶部２５３、第二品詞情報取得部２５５及び判定部２５７に加えて、品詞ｎ−ｇｒａｍ確率計算部２５８及び品詞ｎ−ｇｒａｍ確率記憶部２５９をさらに含む。第一品詞情報取得部２５１、出現品詞列集合記憶部２５３、第二品詞情報取得部２５５における処理は第一判定方法と同様である。 (2) Second determination method The second determination method will be described with reference to FIGS. 9 and 10. The pseudo-text selection unit 250 includes a part-of-speech n-gram probability calculation unit 258 and a part-of-speech n- in addition to the first part-of-speech information acquisition unit 251, the appearance part-of-speech sequence storage unit 253, the second part-of-speech information acquisition unit 255, and the determination unit 257. A gram probability storage unit 259 is further included. The processes in the first part-of-speech information acquisition unit 251, the appearance part-of-speech string set storage unit 253, and the second part-of-speech information acquisition unit 255 are the same as those in the first determination method.

品詞ｎ−ｇｒａｍ確率計算部２５８は、出現品詞列集合記憶部２５３内の出現品詞列集合を取り出し、出現品詞列集合内に含まれる品詞ｎ−ｇｒａｍパタンについての品詞ｎ−ｇｒａｍ確率を計算し（ｓ２５８）、品詞ｎ−ｇｒａｍ確率記憶部２５９に格納する（ｓ２５９、ただし図１０では品詞ｎ−ｇｒａｍ確率として品詞ｔｒｉｇｒａｍ確率を用いた場合を例示している）。例えば、出現品詞列集合内における品詞列Ｗの出現頻度をＣ（Ｗ）と表すとすると、品詞ｂｉｇｒａｍ確率、品詞ｔｒｉｇｒａｍ確率はそれぞれ以下のように計算される。ただし、次式において、Ａ，Ｂ，Ｃはそれぞれ品詞を表し、「−」は品詞の繋がりを表し、例えば、Ｂ−Ａは品詞Ｂの後に品詞Ａが出現することを表す。
品詞bigram確率:P(A|B)=C(B-A)/C(B)
品詞trigram確率:P(A|B-C)=C(B-C-A)/C(B-C) The part-of-speech n-gram probability calculation unit 258 extracts the appearance part-of-speech sequence set in the appearance part-of-speech sequence storage unit 253 and calculates the part-of-speech n-gram probability for the part-of-speech n-gram pattern included in the appearance part-of-speech sequence set ( s258) and stored in the part-of-speech n-gram probability storage unit 259 (s259, where FIG. 10 illustrates the case where the part-of-speech trigram probability is used as the part-of-speech n-gram probability). For example, if the appearance frequency of the part-of-speech string W in the appearance part-of-speech string set is expressed as C (W), the part-of-speech bigram probability and the part-of-speech trigram probability are calculated as follows. In the following expression, A, B, and C each represent a part of speech, “−” represents a connection of parts of speech, for example, B-A represents that part of speech A appears after part of speech B.
Part of speech bigram probability: P (A | B) = C (BA) / C (B)
Part of speech trigram probability: P (A | BC) = C (BCA) / C (BC)

判定部２５７は、疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順を受け取り、疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順から得られる品詞ｎ−ｇｒａｍパタンに対応する品詞ｎ−ｇｒａｍ確率を品詞ｎ−ｇｒａｍ確率記憶部２５９から取り出す（ｓ２５７ａ）。例えば、疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順として、（連体詞）（名詞：代名詞）（連用助詞）（名詞）（格助詞：連用）（名詞）（格助詞：連用）（名詞：日時：連用）（動詞）を受け取った場合、以下の七つの品詞ｔｒｉｇｒａｍパタンに対応する品詞ｔｒｉｇｒａｍ確率を品詞ｎ−ｇｒａｍ確率記憶部２５９から取り出す。
１．（連体詞）−（名詞：代名詞）−（連用助詞）
２．（名詞：代名詞）−（連用助詞）−（名詞）
３．（連用助詞）−（名詞）−（格助詞：連用）
４．（名詞）−（格助詞：連用）−（名詞）
５．（格助詞：連用）−（名詞）−（格助詞：連用）
６．（名詞）−（格助詞：連用）−（名詞：日時：連用）
７．（格助詞：連用）−（名詞：日時：連用）−（動詞）
取り出した品詞ｎ−ｇｒａｍ確率と事前に定めた閾値と比較し（ｓ２５７ｂ）、閾値以上の場合、その品詞の語順は確からしいと判断し、その品詞の語順に対応する疑似テキストｔｅｘ’_ｔ，ｕを選択し（ｓ２５８）、選択疑似テキストｔｅｘ’_ｔ，ｙとして言語モデル生成部１４０に出力する。閾値未満の場合には、その疑似テキストｔｅｘ’_ｔ，ｕは本来正しくない文型であると判断し、選択しない。 Determination unit 257, pseudo-text tex _'t, receives the part of speech of the word order of _u, pseudo text tex' _t, the part-of-speech n-gram probability corresponding to the part-of-speech n-gram patterns obtained from the part of speech of the word order of _u-speech n- Extracted from the gram probability storage unit 259 (s257a). For example, as the word order of the part of speech of the pseudo-text tex't _{, u} , (conjunctive) (noun: pronoun) (conjunctive particle) (noun) (case particle: joint use) (noun) (case particle: joint use) (noun: date: When the (continuous) (verb) is received, the part-of-speech trigram probabilities corresponding to the following seven part-of-speech trigram patterns are extracted from the part-of-speech n-gram probability storage unit 259.
1. (Conjunctive)-(noun: pronoun)-(continuous particle)
2. (Noun: pronoun)-(continuous particle)-(noun)
3. (Consecutive particle)-(noun)-(case particle: consecutive)
4). (Noun)-(case particle: continuous use)-(noun)
5. (Case particle: continuous use)-(noun)-(case particle: continuous use)
6). (Noun)-(case particle: continuous use)-(noun: date: continuous use)
7). (Case particle: continuous)-(noun: date: continuous)-(verb)
The extracted part-of-speech n-gram probability is compared with a predetermined threshold value (s257b), and if it is equal to or greater than the threshold value, the word order of the part-of-speech is determined to be probable, and the pseudo-text tex ′ _{t, u} corresponding to the word order of the part-of-speech Is selected (s258), and is output to the language model generation unit 140 as selection pseudo text tex't _{, y} . If it is less than the threshold, it is determined that the pseudo text tex ′ _{t, u} is an originally incorrect sentence type and is not selected.

閾値と比較する方法としては以下のような方法が考えられる。 The following method can be considered as a method of comparing with the threshold.

（ｉ）取り出した品詞ｎ−ｇｒａｍ確率の平均値を求め、平均値と閾値とを比較する。平均値が閾値以上の場合、その品詞の語順は確からしいと判断する。 (I) The average value of the extracted part-of-speech n-gram probabilities is obtained, and the average value is compared with a threshold value. If the average value is greater than or equal to the threshold, it is determined that the word order of the part of speech is likely.

（ｉｉ）取り出した品詞ｎ−ｇｒａｍ確率のそれぞれと閾値とを比較し、Ｍ_ｔ，ｕ個の品詞ｎ−ｇｒａｍ確率が閾値以上の場合、その品詞の語順は確からしいと判断する。ただし、疑似テキストｔｅｘ’_ｔ，ｕに含まれる品詞ｎ−ｇｒａｍパタンの個数をＮ_ｔ，ｕ個とすると、Ｍ_ｔ，ｕ≦［ＶＮ_ｔ，ｕ］であり、０＜Ｖ≦１とし、［・］は・以下の最大の整数を表す。なお、Ｖは認識精度がよくなるように実験的に定める。 (Ii) Each of the extracted part-of-speech n-gram probabilities is compared with a threshold, and if the M _{t, u} part-of-speech n-gram probabilities are equal to or greater than the threshold, it is determined that the word order of the part of speech is likely. However, if the number of part-of-speech n-gram patterns included in the pseudo text tex ′ _{t, u} is N _{t, u} , then M _{t, u} ≦ [VN _{t, u} ], and 0 <V ≦ 1,・] Represents the following maximum integer. V is determined experimentally so that the recognition accuracy is improved.

＜言語モデル生成部１４０＞
言語モデル生成部１４０は、入力として、疑似テキスト生成部１３０で生成された（Ｕ_１＋Ｕ_２＋…＋Ｕ_Ｔ）個の疑似テキストｔｅｘ’_ｔ，ｕ全てではなく、その中から疑似テキスト選択部２５０で選択された（Ｙ_１＋Ｙ_２＋…＋Ｙ_Ｔ）個の選択疑似テキストｔｅｘ’_ｔ，ｙのみを用いて、言語モデルを生成する（ｓ１４０）。言語モデルを生成方法は第一実施形態と同様である。 <Language model generation unit 140>
The language model generation unit 140 receives, as an input, not all (U ₁ + U ₂ +... + U _T ) pseudo texts tex ′ _{t, u} generated by the pseudo text generation unit 130, but the pseudo text selection unit 250 among them. A language model is generated using only the (Y ₁ + Y ₂ +... + Y _T ) selection pseudo-texts tex ′ _{t, y} selected in (S140). The method for generating the language model is the same as in the first embodiment.

＜効果＞
このような構成により、第一実施形態と同様の効果を得ることができる。さらに、本来正しくない文型の疑似テキストｔｅｘ’_ｔ，ｕを用いて言語モデルを生成することを防ぎ、言語モデルの性能劣化を防止することができる。 <Effect>
With such a configuration, the same effect as that of the first embodiment can be obtained. Furthermore, it is possible to prevent a language model from being generated using pseudo-text tex ′ _{t, u} having an originally incorrect sentence type, and to prevent performance degradation of the language model.

＜第三実施形態＞
第二実施形態と異なる部分についてのみ説明する。 <Third embodiment>
Only parts different from the second embodiment will be described.

第三実施形態では、言語モデル生成部１４０において、オリジナルテキストｔｅｘ_ｔと疑似テキストｔｅｘ_ｔ，ｕの重みＷ（式（１）や式（２）参照）を疑似テキストｔｅｘ_ｔ，ｕ毎に変える。生成される疑似テキストｔｅｘ_ｔ，ｕにおいて、「確からしさ」の観点から、Ｔ個のオリジナルテキストｔｅｘ_ｔと同等の頻度を与えてよさそうな語順や、間違いではないがあまり使われない語順であるといったことも考えられる。そこで第三実施形態では、重みＷを疑似テキストｔｅｘ_ｔ，ｕ毎に算出する処理を加える。 In the third embodiment, the language model generation unit 140 changes the weight W (see formula (1) and formula (2)) between the original text tex _t and the pseudo text tex _{t, u} for each pseudo text tex _{t, u} . In the generated pseudo-text tex _{t, u} , from the viewpoint of “probability”, it is a word order that is likely to give the same frequency as the T original text tex _t , or a word order that is not mistaken but is not often used. It can also be considered. Therefore, in the third embodiment, a process of calculating the weight W for each pseudo text tex _{t, u} is added.

図１１は言語モデル生成装置３００の機能ブロック図を、図１２はその処理フローを示す。 FIG. 11 is a functional block diagram of the language model generation apparatus 300, and FIG. 12 shows its processing flow.

言語モデル生成装置３００は、形態素解析部２１０、構文解析部１２０、疑似テキスト生成部１３０、言語モデル生成部１４０、疑似テキスト選択部２５０を含み、さらに疑似テキスト重み算出部３７０を含む。 The language model generation apparatus 300 includes a morphological analysis unit 210, a syntax analysis unit 120, a pseudo text generation unit 130, a language model generation unit 140, a pseudo text selection unit 250, and further includes a pseudo text weight calculation unit 370.

＜疑似テキスト重み算出部３７０＞
・入力：（品詞情報が付加されている）選択疑似テキストｔｅｘ’_ｔ，ｙ、形態素解析結果（形態素単位に分かち書きされ、品詞情報が付加されたオリジナルテキスト）ｍｏｒ’_ｔ
・出力：選択疑似テキストｔｅｘ’_ｔ，ｙ毎の重みＷ_ｔ，ｙ
・処理内容：Ｔ個のオリジナルテキストｔｅｘ_ｔの品詞の語順と同じ品詞の語順を多く持つ選択疑似テキストｔｅｘ’_ｔ，ｙほど、大きな重みＷ_ｔ，ｙを算出し（ｓ３７０）、選択疑似テキストｔｅｘ’_ｔ，ｙとともに言語モデル生成部１４０に出力する。重みＷ_ｔ，ｙの算出方法としては、例えば以下の方法がある。 <Pseudo Text Weight Calculation Unit 370>
Input: selection pseudo-text tex ′ _{t, y} (with part-of-speech information added), morpheme analysis result (original text divided into morpheme units and with part-of-speech information added) mor ′ _t
Output: Weight of selected pseudo text tex ' _{t, y} W _{t, y}
Processing content: The selected pseudo-text tex ′ _{t, y} having the same part-of-speech word order as the part-of-speech word order of the T original texts tex _t is calculated with a larger weight W _{t, y} (s370). 'Output to the language model generation unit 140 together with _{t and y} . As a calculation method of the weight W _{t, y} , for example, there are the following methods.

第二実施形態で用いた出現品詞列集合及び品詞ｎ−ｇｒａｍ確率を用いて、重みＷ_ｔ，ｙを算出する。ただし、品詞ｎ−ｇｒａｍ確率は、０から１の値をとる。なお、品詞ｎ−ｇｒａｍ確率が大きければ「語順的に確からしい」ことを意味し、品詞ｎ−ｇｒａｍ確率が小さければ「語順的に誤りらしい」ことを意味する。 The weights W _{t, y} are calculated using the appearance part-of-speech string set and the part-of-speech n-gram probability used in the second embodiment. However, the part-of-speech n-gram probability takes a value from 0 to 1. If the part-of-speech n-gram probability is large, it means “probably in word order”, and if the part-of-speech n-gram probability is small, it means “probably in word order”.

疑似テキスト重み算出部３７０は、品詞情報が付加されている選択疑似テキストｔｅｘ’_ｔ，ｙから、品詞の語順を取り出す。以下に、重みＷ_ｔ，ｙを決定する方法を三つ説明する。 The pseudo text weight calculation unit 370 extracts the word order of the part of speech from the selected pseudo text tex ′ _{t, y} to which the part of speech information is added. Hereinafter, three methods for determining the weight W _{t, y} will be described.

（１）第一決定方法
疑似テキスト重み算出部３７０は、出現品詞列集合記憶部２５３内の出現品詞列集合に含まれる何れかの品詞の語順と疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順とが所定の割合（例えば、９５％）以上一致するか否かを判定し、一致する場合には、その疑似テキストｔｅｘ’_ｔ，ｕの品詞の語順は確からしいと判断し、重みＷ_ｔ，ｙの値を大きな値Ａ_１とする。一致しない場合には、その疑似テキストｔｅｘ’_ｔ，ｙは本来正しくない文型であると判断し、重みＷ_ｔ，ｙの値を小さな値Ａ_２とする。 (1) First Determination Method The pseudo text weight calculation unit 370 determines the word order of any part of speech included in the appearance part-of-speech sequence set in the appearance part-of-speech sequence set storage unit 253, the word order of the part-of-speech of the pseudo text tex ′ _{t, u} , Is determined to be equal to or greater than a predetermined ratio (for example, 95%), and if they match, it is determined that the word order of the part of speech of the pseudo-text tex ′ _{t, u} is probable, and the weight W _{t, y} the value a large value a _1. If they do not match, the pseudo-text tex _'t, determines that _y is inherently incorrect sentence patterns, weights W _t, the value of _y smaller value A _2.

以下の第二決定方法及び第三決定方法の場合、疑似テキスト重み算出部３７０は、さらに、疑似テキストｔｅｘ’_ｔ，ｙの品詞の語順から得られる品詞ｎ−ｇｒａｍパタンに対応する品詞ｎ−ｇｒａｍ確率を疑似テキスト選択部２５０内の品詞ｎ−ｇｒａｍ確率記憶部２５９から取り出す。 In the case of the following second determination method and third determination method, the pseudo text weight calculation unit 370 further includes a part of speech n-gram corresponding to the part of speech n-gram pattern obtained from the word order of the part of speech of the pseudo text tex ′ _{t, y.} The probability is extracted from the part-of-speech n-gram probability storage unit 259 in the pseudo text selection unit 250.

（２）第二決定方法
取り出した品詞ｎ−ｇｒａｍ確率と事前に定めた閾値Ｘとを比較し、閾値Ｘ以上の場合、その品詞の語順は確からしいと判断し、重みＷ_ｔ，ｙの値を大きな値Ａ_１とする。閾値Ｘ未満の場合には、その疑似テキストｔｅｘ’_ｔ，ｙは本来正しくない文型であると判断し、重みＷ_ｔ，ｙの値を小さな値Ａ_２とする。ただし、Ａ_１＞Ａ_２である。Ｘ、Ａ_１、Ａ_２は事前に開発セットの認識精度が最大になるように定めておく。例えば、Ｘ、Ａ_１、Ａ_２は、様々な値の組合せを用意して、言語モデルとしての認識精度がよくなるように実験的に定める。なお、Ｘは、０に近づけると全ての品詞の語順が許容されることになるため、品詞の語順による重み付けの意味がなくなる。また、この例では、閾値Ｘ以上、または、閾値Ｘ未満の二つのパタンに分類したが、Ｎ個の閾値Ｘ_ｎを設け（ただし、Ｎは２以上の整数であり、ｎ＝１，２，…，Ｎであり、Ｘ_１＜Ｘ_２＜…＜Ｘ_Ｎ）、（Ｎ＋１）個のパタンに分類しても問題ない。閾値の個数が増えることで、重みＷ_ｔ，ｙの表現能力が向上し、言語モデルの性能が向上すると考えられる。一方で事前に決めるパラメータ数（閾値Ｘ_１，Ｘ_２，…，Ｘ_Ｎや、（Ｎ＋１）個のパタンに対応する（Ｎ＋１）個の値Ａ_１、Ａ_２，…，Ａ_Ｎ＋１）が増えるため計算コストが増大する。 (2) Second determination method The extracted part-of-speech n-gram probability is compared with a predetermined threshold value X, and if it is greater than or equal to the threshold value X, it is determined that the word order of the part-of-speech is likely, and the value of the weight W _{t, y} a to a large value _{a 1.} If it is less than the threshold X is, the pseudo-text tex _'t, determines that _y is inherently incorrect sentence patterns, weights W _t, the value of _y smaller value A _2. _However, it is _A 1> A _2. X, A ₁ and A ₂ are determined in advance so that the recognition accuracy of the development set is maximized. For example, X, A ₁ , and A ₂ are prepared experimentally so that various combinations of values are prepared and the recognition accuracy as a language model is improved. Note that, when X approaches 0, the word order of all parts of speech is allowed, so the meaning of weighting according to the word order of parts of speech is lost. In this example, the pattern is classified into two patterns _equal to or greater than the threshold value X or less than the threshold value X. However, N threshold _{values Xn} are provided (where N is an integer equal to or greater than 2, and n = 1, 2, .., N, and X ₁ <X ₂ <... <X _N ), (N + 1) patterns can be classified. By increasing the number of threshold _values , it is considered that the ability to express the weights W _{t, y} is improved and the performance of the language model is improved. On the other hand, the number of parameters determined in advance (threshold values X ₁ , X ₂ ,..., _XN and (N + 1) values A ₁ , A ₂ ,..., A _{N + 1} corresponding to (N + 1) patterns increases). Calculation cost increases.

なお、閾値と比較する方法としては、疑似テキスト選択部２５０と同様の方法を用いることができる。つまり、以下のように比較する。 As a method for comparing with the threshold value, a method similar to that for the pseudo text selecting unit 250 can be used. That is, the comparison is made as follows.

（ｉ）取り出した品詞ｎ−ｇｒａｍ確率の平均値を求め、平均値と閾値Ｘとを比較する。平均値が閾値Ｘ以上の場合、その品詞の語順は確からしいと判断する。 (I) The average value of the extracted part-of-speech n-gram probabilities is obtained, and the average value is compared with the threshold value X. When the average value is equal to or greater than the threshold value X, it is determined that the word order of the part of speech is likely.

（ｉｉ）取り出した品詞ｎ−ｇｒａｍ確率のそれぞれと閾値Ｘとを比較し、Ｍ個の品詞ｎ−ｇｒａｍ確率が閾値Ｘ以上の場合、その品詞の語順は確からしいと判断する。 (Ii) Each of the extracted part-of-speech n-gram probabilities is compared with a threshold X, and if the M part-of-speech n-gram probabilities are equal to or greater than the threshold X, it is determined that the word order of the part of speech is likely.

（３）第三決定方法
そもそも品詞ｎ−ｇｒａｍ確率が大きければ、「語順的に確からしい」ことを意味し、品詞ｎ−ｇｒａｍ確率が小さければ「語順的に誤りらしい」ことを意味するので、取り出した品詞ｎ−ｇｒａｍ確率の平均値を求め、その平均値（または平均値に所定の値を乗じた値）を重みとして利用する。 (3) Third determination method In the first place, if the part-of-speech n-gram probability is large, it means “probably in word order”, and if the part-of-speech n-gram probability is small, it means “probably in word order”. An average value of the extracted part-of-speech n-gram probabilities is obtained, and the average value (or a value obtained by multiplying the average value by a predetermined value) is used as a weight.

＜言語モデル生成部１４０＞
言語モデル生成部１４０は、オリジナルテキストｔｅｘ_ｔ、選択疑似テキストｔｅｘ’_ｔ，ｙ及び重みＷ_ｔ，ｙを受け取り、式（１）または（２）等により、ｎ−ｇｒａｍ確率を計算し、言語モデルを生成する（ｓ１４０）。言語モデルを生成方法は第二実施形態と同様である。ただし、式（１）または（２）等において、選択疑似テキストｔｅｘ’_ｔ，ｙ毎に、重みＷに代えて、重みＷ_ｔ，ｙを用いて計算する。 <Language model generation unit 140>
The language model generation unit 140 receives the original text tex _t , the selected pseudo-text tex ′ _{t, y} and the weight W _{t, y} , calculates the n-gram probability according to the equation (1) or (2), and the language model Is generated (s140). The method for generating the language model is the same as in the second embodiment. However, in the formula (1) or (2) or the like, for each selected pseudo-text tex ′ _{t, y} , the weight W _{t, y} is used instead of the weight W.

＜効果＞
このような構成により、第二実施形態と同様の効果を得ることができる。さらに、より確からしい語順を持つ選択疑似テキストｔｅｘ’_ｔ，ｙに対して、大きな重みＷ_ｔ，ｙを与え、言語モデルの精度を向上させることができる。 <Effect>
With such a configuration, the same effect as that of the second embodiment can be obtained. Furthermore, it is possible to improve the accuracy of the language model by giving a large weight W _{t, y} to the selected pseudo-text tex ′ _{t, y} having a more certain word order.

＜変形例＞
第二実施形態の言語モデル生成装置２００に疑似テキスト重み算出部３７０を加えた構成となっているが、第一実施形態の言語モデル生成装置１００に加えてもよい。この場合、疑似テキスト重み算出部３７０や言語モデル生成部１４０では、選択疑似テキストｔｅｘ’_ｔ，ｙに代えて、品詞情報が付加されている疑似テキストｔｅｘ’_ｔ，ｕを用いる。よって、第一実施形態の形態素解析部１１０に代えて、第二実施形態の形態素解析部２１０を用い、オリジナルテキストｔｅｘ_ｔを形態素単位に分割し、分割した各形態素に品詞を付与して、形態素解析結果ｍｏｒ’_ｔを出力する。また、この場合、疑似テキスト重み算出部３７０において、品詞ｎ−ｇｒａｍ確率を求め、図示しない記憶部に格納する。 <Modification>
Although the pseudo text weight calculation unit 370 is added to the language model generation device 200 of the second embodiment, it may be added to the language model generation device 100 of the first embodiment. In this case, the pseudo text weight calculation unit 370 and the language model generation unit 140 use the pseudo text tex ′ _{t, u} with part-of-speech information added instead of the selected pseudo text tex ′ _{t, y} . Therefore, instead of the morphological analysis unit 110 of the first embodiment, using the morphological analysis unit 210 of the second embodiment divides the original text tex _t into morphemes by assigning parts of speech to each morpheme divided morpheme and it outputs the analysis result mor _'t. In this case, the pseudo-text weight calculation unit 370 obtains a part-of-speech n-gram probability and stores it in a storage unit (not shown).

＜その他の変形例＞
また、本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
Further, the present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
上述した言語モデル生成装置は、コンピュータにより機能させることもできる。この場合はコンピュータに、目的とする装置（各種実施形態で図に示した機能構成をもつ装置）として機能させるためのプログラム、またはその処理手順（各実施形態で示したもの）の各過程をコンピュータに実行させるためのプログラムを、ＣＤ−ＲＯＭ、磁気ディスク、半導体記憶装置などの記録媒体から、あるいは通信回線を介してそのコンピュータ内にダウンロードし、そのプログラムを実行させればよい。 <Program and recording medium>
The language model generation apparatus described above can also be functioned by a computer. In this case, each process of a program for causing a computer to function as a target device (a device having the functional configuration shown in the drawings in various embodiments) or a process procedure (shown in each embodiment) is processed by the computer. A program to be executed by the computer may be downloaded from a recording medium such as a CD-ROM, a magnetic disk, or a semiconductor storage device or via a communication line into the computer, and the program may be executed.

１００，２００，３００言語モデル生成装置
１１０，２１０形態素解析部
１２０構文解析部
１３０疑似テキスト生成部
１４０言語モデル生成部
２５０疑似テキスト選択部
２５１第一品詞情報取得部
２５３出現品詞列集合記憶部
２５５第二品詞情報取得部
２５７判定部
２５８確率計算部
２５９確率記憶部
３７０疑似テキスト重み算出部 100, 200, 300 Language model generation device 110, 210 Morphological analysis unit 120 Syntax analysis unit 130 Pseudo text generation unit 140 Language model generation unit 250 Pseudo text selection unit 251 First part of speech information acquisition unit 253 Appearance part of speech sequence set storage unit 255 Two-part-of-speech information acquisition unit 257 determination unit 258 probability calculation unit 259 probability storage unit 370 pseudo text weight calculation unit

Claims

形態素単位に分かち書きされ、文節の係り受け関係が付加されたオリジナルテキストを用いて、係り受け先が同じである複数の文節を並び替えて、疑似テキストを生成する疑似テキスト生成部と、
前記オリジナルテキストにおけるｎ−ｇｒａｍパタンの出現頻度及び前記疑似テキストにおけるｎ−ｇｒａｍパタンの出現頻度を用いてｎ−ｇｒａｍ確率を求め、言語モデルを生成する言語モデル生成部とを含み、
前記オリジナルテキストには、さらに各形態素に対して品詞情報が付加されているものとし、
前記オリジナルテキストの品詞の語順と前記疑似テキストの品詞の語順とを比較して、確からしい品詞の語順を持つ疑似テキストを選択する疑似テキスト選択部をさらに含み、
前記言語モデル生成部は、前記オリジナルテキストにおけるｎ−ｇｒａｍパタンの出現頻度及び前記疑似テキスト選択部において選択された前記疑似テキストにおけるｎ−ｇｒａｍパタンの出現頻度を用いてｎ−ｇｒａｍ確率を求め、言語モデルを生成する、
言語モデル生成装置。 A pseudo-text generation unit that generates pseudo-text by rearranging a plurality of clauses having the same dependency destination using original text that is divided into morpheme units and added with dependency relationships of clauses;
Seek n-gram probabilities using the occurrence frequency of n-gram patterns in frequency and the pseudo text n-gram patterns in the original text, viewed contains a language model generator for generating a language model,
Part of speech information is added to each original morpheme in the original text.
A pseudo text selection unit that compares the word order of the part of speech of the original text with the word order of the part of speech of the pseudo text, and selects a pseudo text having a probable part of speech part;
The language model generation unit obtains an n-gram probability using the appearance frequency of the n-gram pattern in the original text and the appearance frequency of the n-gram pattern selected in the pseudo-text selection unit. Generate a model,
Language model generator.

請求項１記載の言語モデル生成装置であって、
前記疑似テキスト選択部は、
前記オリジナルテキストに付加されている品詞情報を取り出す第一品詞情報取得部と、
前記オリジナルテキストの品詞の語順の集合である出現品詞列集合を記憶する出現品詞列集合記憶部と、
前記疑似テキストに付加された品詞情報から、前記疑似テキストの品詞の語順を取り出す第二品詞情報取得部と、
前記疑似テキストの品詞の語順と前記出現品詞列集合に含まれる何れかの品詞の語順とが所定の割合以上一致する場合に、その疑似テキストを選択する判定部と、を含む、
言語モデル生成装置。 The language model generation device according to claim 1 ,
The pseudo-text selection unit
A first part-of-speech information acquisition unit that extracts part-of-speech information added to the original text;
An appearance part-of-speech sequence storage unit that stores an appearance part-of-speech sequence set that is a set in the word order of the part of speech of the original text;
A second part-of-speech information acquisition unit that extracts a word order of the part-of-speech of the pseudo-text from the part-of-speech information added to the pseudo-text;
A determination unit that selects the pseudo-text when the word order of the part-of-speech of the pseudo-text and the word order of any part-of-speech included in the appearance part-of-speech sequence set match a predetermined ratio or more,
Language model generator.

請求項１記載の言語モデル生成装置であって、
前記疑似テキスト選択部は、
前記オリジナルテキストに付加されている品詞情報を取り出す第一品詞情報取得部と、
前記オリジナルテキストの品詞の語順の集合である出現品詞列集合を記憶する出現品詞列集合記憶部と、
前記疑似テキストに付加された品詞情報から、前記疑似テキストの品詞の語順を取り出す第二品詞情報取得部と、
前記出現品詞列集合に含まれる品詞ｎ−ｇｒａｍパタンについての品詞ｎ−ｇｒａｍ確率を計算する品詞ｎ−ｇｒａｍ確率計算部と、
前記品詞ｎ−ｇｒａｍ確率を記憶する品詞ｎ−ｇｒａｍ確率記憶部と、
前記疑似テキストの品詞の語順から得られる品詞ｎ−ｇｒａｍパタンに対応する品詞ｎ−ｇｒａｍ確率を前記品詞ｎ−ｇｒａｍ確率記憶部から取り出し、取り出した品詞ｎ−ｇｒａｍ確率と事前に定めた閾値とを比較し、閾値以上の場合、その品詞の語順に対応する疑似テキストを選択する判定部と、を含む、
言語モデル生成装置。 The language model generation device according to claim 1 ,
The pseudo-text selection unit
A first part-of-speech information acquisition unit that extracts part-of-speech information added to the original text;
An appearance part-of-speech sequence storage unit that stores an appearance part-of-speech sequence set that is a set in the word order of the part of speech of the original text;
A second part-of-speech information acquisition unit that extracts a word order of the part-of-speech of the pseudo-text from the part-of-speech information added to the pseudo-text;
A part-of-speech n-gram probability calculator for calculating a part-of-speech n-gram probability for a part-of-speech n-gram pattern included in the appearance part-of-speech sequence set;
A part-of-speech n-gram probability storage unit for storing the part-of-speech n-gram probability;
The part-of-speech n-gram probability corresponding to the part-of-speech n-gram pattern obtained from the word order of the part-of-speech part of the pseudo-text is extracted from the part-of-speech n-gram probability storage unit, and the extracted part-of-speech n-gram probability and a predetermined threshold value are obtained. A determination unit that selects pseudo-text corresponding to the word order of the part of speech if the comparison is greater than or equal to the threshold,
Language model generator.

請求項１から請求項３の何れかに記載の言語モデル生成装置であって、
前記オリジナルテキストには、さらに各形態素に対して品詞情報が付加されているものとし、
前記オリジナルテキストの品詞の語順と同じ品詞の語順を多く持つ疑似テキストほど、大きな重みを算出する言語モデル重み算出部をさらに含み、
前記言語モデル生成部は、前記オリジナルテキストにおけるｎ−ｇｒａｍパタンの出現頻度及び前記疑似テキストにおけるｎ−ｇｒａｍパタンの出現頻度に対して前記重みにより重み付けを行い、ｎ−ｇｒａｍ確率を求め、言語モデルを生成する、
言語モデル生成装置。 The language model generation device according to any one of claims 1 to 3 ,
Part of speech information is added to each original morpheme in the original text.
A language model weight calculation unit that calculates a greater weight for pseudo text having more part-of-speech word order than part-of-speech word order of the original text,
The language model generation unit weights the appearance frequency of the n-gram pattern in the original text and the appearance frequency of the n-gram pattern in the pseudo text with the weight, obtains an n-gram probability, Generate,
Language model generator.

形態素単位に分かち書きされ、文節の係り受け関係が付加されたオリジナルテキストを用いて、係り受け先が同じである複数の文節を並び替えて、疑似テキストを生成する疑似テキスト生成ステップと、
前記オリジナルテキストにおけるｎ−ｇｒａｍパタンの出現頻度及び前記疑似テキストにおけるｎ−ｇｒａｍパタンの出現頻度を用いてｎ−ｇｒａｍ確率を求め、言語モデルを生成する言語モデル生成ステップとを含み、
前記オリジナルテキストには、さらに各形態素に対して品詞情報が付加されているものとし、
前記オリジナルテキストの品詞の語順と前記疑似テキストの品詞の語順とを比較して、確からしい品詞の語順を持つ疑似テキストを選択する疑似テキスト選択ステップをさらに含み、
前記言語モデル生成ステップにおいて、前記オリジナルテキストにおけるｎ−ｇｒａｍパタンの出現頻度及び前記疑似テキスト選択ステップにおいて選択された前記疑似テキストにおけるｎ−ｇｒａｍパタンの出現頻度を用いてｎ−ｇｒａｍ確率を求め、言語モデルを生成する、
言語モデル生成方法。 A pseudo-text generation step of rearranging a plurality of clauses having the same dependency destination to generate pseudo-text by using original text that is divided into morpheme units and to which a dependency relationship of clauses is added;
Using said frequency of n-gram patterns sought n-gram probability in occurrence frequency and the pseudo text n-gram patterns in the original text, viewed contains a language model generating step of generating a language model,
Part of speech information is added to each original morpheme in the original text.
A pseudo-text selecting step of comparing the word order of the part of speech of the original text with the word order of the part of speech of the pseudo-text to select a pseudo-text having a probable part-of-speech word order;
In the language model generation step, an n-gram probability is obtained using the appearance frequency of the n-gram pattern in the original text and the appearance frequency of the n-gram pattern in the pseudo text selected in the pseudo text selection step. Generate a model,
Language model generation method.

請求項１から請求項４の何れかに記載の言語モデル生成装置としてコンピュータを機能させるためのプログラム。 The program for functioning a computer as a language model production | generation apparatus in any one of Claims 1-4 .