JP6568429B2

JP6568429B2 - Pronunciation sequence expansion device and program thereof

Info

Publication number: JP6568429B2
Application number: JP2015167821A
Authority: JP
Inventors: 麻乃一木; 庄衛佐藤; 彰夫小林
Original assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2015-08-27
Filing date: 2015-08-27
Publication date: 2019-08-28
Anticipated expiration: 2035-08-27
Also published as: JP2017044901A

Description

本発明は、発音辞書の発音系列に対して、発話音声の発音系列を拡張する発音系列拡張装置およびそのプログラムに関する。 The present invention relates to a pronunciation sequence expansion device for expanding a pronunciation sequence of uttered speech with respect to a pronunciation sequence of a pronunciation dictionary and a program thereof.

通常、音声認識では、単語と当該単語の発音系列（音素列）とを対応付けた発音辞書を用いている。この発音辞書には、一般的な辞書に記載されているような単語に対する読みが発音として登録されている。
しかし、表記上の読みと実際に発話された発音とでは異なることが多い。例えば、放送番組では、ニュース番組のアナウンサの正確な（発音辞書の発音と近い）発音に比べ、情報番組の出演者の発話は曖昧な発音が多い。
このように、実際に発話された発音が、発音辞書に登録された読みから変動を起こしていた場合、発音内容と単語が一致しないことになる。その場合、音声認識では、発音内容に近い単語を選択するため、誤認識が発生してしまう。 Usually, in speech recognition, a pronunciation dictionary in which a word is associated with a pronunciation sequence (phoneme string) of the word is used. In this pronunciation dictionary, pronunciations of words as described in general dictionaries are registered as pronunciations.
However, the reading on the notation and the pronunciation actually spoken are often different. For example, in broadcast programs, the utterances of performers in information programs are often ambiguous compared to the exact (close to the pronunciation dictionary) pronunciation of news program announcers.
Thus, when the pronunciation actually spoken has changed from the reading registered in the pronunciation dictionary, the pronunciation content and the word do not match. In that case, in speech recognition, a word close to the pronunciation content is selected, and thus erroneous recognition occurs.

近年、このような発音変動（変形）に関する研究は、数多く行われている。
例えば、発音辞書に、標準的な発音に加え、発話音声を忠実に書き起こした書き起こしテキストを発音変動事例として追加登録する手法が開示されている（非特許文献１）。
この発話音声の書き起こしテキストを利用する手法は、発音変動の規則を、人手を介して登録する必要があり、また、その規則の元となる事例を多数得なければならない等、膨大な労力を要する。 In recent years, many studies on such pronunciation variation (deformation) have been conducted.
For example, there is disclosed a method of additionally registering, as a pronunciation variation example, a transcription text that faithfully transcribes an uttered voice in addition to a standard pronunciation in a pronunciation dictionary (Non-patent Document 1).
This method of using the transcribed speech text requires enormous effort, such as the need to register the rules of pronunciation variation manually, and to obtain a large number of cases that are the basis of the rules. Cost.

そこで、本願の発明者らは、発音変動を音素列の変動として捉え、発音辞書の発音系列に対して、発話音声の発音系列（音素列）を拡張する手法を提案した（非特許文献２参照）。
この手法を用いることで、発話音声の書き起こしテキストを利用することなく、発音辞書に対して、実発話に基づく発音変動を考慮した発音系列を拡張することができる。 Therefore, the inventors of the present application have proposed a method of capturing pronunciation variation as phoneme sequence variation and extending the pronunciation sequence (phoneme sequence) of the uttered speech to the pronunciation sequence of the pronunciation dictionary (see Non-Patent Document 2). ).
By using this method, it is possible to extend the pronunciation sequence in consideration of the pronunciation variation based on the actual utterance to the pronunciation dictionary without using the transcription text of the uttered speech.

堤，加藤，小坂，好田，“発音変形依存モデルを用いた講演音声認識”，電子情報通信学会論文誌，vol.J89-D，No.2，pp.305-313，2006Tsutsumi, Kato, Kosaka, Yoshida, “Lecture Speech Recognition Using Pronunciation Deformation Dependence Model”, IEICE Transactions, vol.J89-D, No.2, pp.305-313, 2006 一木，尾上，奥，小林，佐藤，“大規模コーパスから学習した音素翻訳モデルに基づく発音系列の自動生成”，一般社団法人日本音響学会，春季研究発表会講演論文集，1-1-9（2015）Ichiki, Onoe, Oku, Kobayashi, Sato, “Automatic generation of phonetic sequences based on phoneme translation models learned from large-scale corpora”, The Acoustical Society of Japan, Spring Meeting, 1-1-9 (2015)

非特許文献２の手法は、発話音声の書き起こしテキストを利用することなく、発音系列を拡張することができる点で優れているが、さらなる改良の余地があった。
非特許文献２には、統計的機械翻訳を行う際の素性の重みパラメータ（翻訳パラメータ）を最適化する方法について言及されていなかった。なお、素性とは、機械翻訳において、種々の識別の判定基準に使用する情報である。
一般に、統計的機械翻訳には、標準的な素性として、言語モデルや翻訳モデルの尤度、フレーズペナルティ、単語ペナルティ等の素性があり、これらの素性の重みを翻訳パラメータとして最適化して翻訳に用いている。
具体的には、統計的機械翻訳は、素性（言語モデル尤度等）ｋの重みλ_ｋを翻訳パラメータとし、翻訳文候補をｅ、翻訳文候補の素性値（言語モデル尤度値等）をｆ_ｋ（ｅ）としたとき、翻訳結果ｅ＾（ｅハット）を、以下の式（１）に示すように、ｆ_ｋ（ｅ）の重み付けを最大にする翻訳文候補（仮説）ｅとして翻訳する。 The method of Non-Patent Document 2 is excellent in that the pronunciation sequence can be expanded without using the transcription text of the uttered speech, but there is room for further improvement.
Non-Patent Document 2 did not mention a method for optimizing a feature weight parameter (translation parameter) when performing statistical machine translation. The feature is information used for various identification criteria in machine translation.
In general, statistical machine translation has features such as likelihood of language model and translation model, phrase penalty, word penalty, etc. as standard features. The weights of these features are optimized as translation parameters and used for translation. ing.
Specifically, statistical machine translation uses a weight λ _k of a feature (language model likelihood or the like) _k as a translation parameter, a translation sentence candidate as e, and a feature value of the translation sentence candidate (language model likelihood value or the like). When f _k (e) is assumed, the translation result e ^ (e hat) is translated as a translation sentence candidate (hypothesis) e that maximizes the weight of f _k (e) as shown in the following equation (1). To do.

このパラメータの最適化は、一般的な言語の翻訳の場合、翻訳結果と参照訳とが、文意が同じであるか等の評価値に基づいて、評価が高くなるように調整している。ここで、参照訳とは、一般的に原言語の翻訳前の評価文に対して専門家が実際に翻訳し作成したものである。
通常、統計的機械翻訳の素性の最適化には、ＢＬＥＵ（BiLingual Evaluation Understudy）、レーベンシュタイン（Levenshtein）距離、ＲＩＢＥＳ（Rank-based Intuitive Bilingual Evaluation Score）といった評価値を基準に最適化アルゴリズムを適用することで最適化を行っている。
しかし、非特許文献２の手法のように、発音系列を拡張する場合、評価基準となる参照訳が存在しないことに加え、発音系列の拡張に適切な評価値が知られていないため、言語モデル、翻訳モデル等の素性の重みパラメータ（翻訳パラメータ）を最適化することができなかった。 In the case of translation in a general language, this parameter optimization is adjusted so that the evaluation is high based on an evaluation value such as whether the translation result and the reference translation have the same sentence meaning. Here, the reference translation is generally created by an expert actually translating an evaluation sentence before translation in the source language.
In general, optimization of statistical machine translation features uses optimization algorithms based on evaluation values such as BLEU (BiLingual Evaluation Understudy), Levenshtein distance, RIBES (Rank-based Intuitive Bilingual Evaluation Score) The optimization is done.
However, when the pronunciation sequence is expanded as in the method of Non-Patent Document 2, there is no reference translation as an evaluation standard, and an evaluation value appropriate for expansion of the pronunciation sequence is not known. The feature weight parameter (translation parameter) such as the translation model could not be optimized.

本発明は、このような問題に鑑みてなされたものであり、発音辞書の発音系列に対して、発話音声の発音系列（音素列）を拡張する際に翻訳パラメータを調整して、発音辞書を拡張することが可能な発音系列拡張装置およびそのプログラムを提供することを課題とする。 The present invention has been made in view of such a problem. The pronunciation parameter is adjusted by adjusting the translation parameter when expanding the pronunciation sequence (phoneme sequence) of the uttered speech with respect to the pronunciation sequence of the pronunciation dictionary. It is an object of the present invention to provide a pronunciation sequence expansion device that can be expanded and a program thereof.

前記課題を解決するため、本発明に係る発音系列拡張装置は、見出し語とその発音系列を示す音素列とを対応付けた発音辞書と、文脈依存音素の音響モデルと、音声とその書き起こしテキストとを対応付けた学習コーパスとにより、前記発音辞書における前記見出し語の発音系列を拡張する発音系列拡張装置であって、音素列生成手段と、文脈依存音素発音辞書生成手段と、文脈依存音素ｎ−ｇｒａｍモデル生成手段と、音素認識手段と、統計的機械翻訳モデル生成手段と、翻訳手段と、パラメータ設定手段と、拡張手段と、発音辞書選択手段と、を備える構成とした。 In order to solve the above problems, a pronunciation sequence expansion device according to the present invention includes a pronunciation dictionary in which a headword and a phoneme string indicating the pronunciation sequence are associated with each other, an acoustic model of a context-dependent phoneme, speech, and a transcription text thereof. Is a pronunciation sequence expansion device that expands the pronunciation sequence of the headword in the pronunciation dictionary using a learning corpus associated with the phoneme string generation unit, the context-dependent phoneme pronunciation dictionary generation unit, and the context-dependent phoneme n -Gram model generation means, phoneme recognition means, statistical machine translation model generation means, translation means, parameter setting means, expansion means, and pronunciation dictionary selection means.

かかる構成において、発音系列拡張装置は、音素列生成手段によって、文脈依存音素の音響モデルと発音辞書とにより、発音辞書の見出し語単位で音素のアライメント（強制アライメント）を行うことで音素列を生成する。
ここで、文脈依存音素とは、特定の音素を中心とした前後の音素列を考慮した音素、あるいは音素の単語内での位置（語頭、語中、語末等）である。代表的な文脈依存音素の一例は、３つの音素で構成されるトライフォン（ｔｒｉｐｈｏｎｅ）、５つの音素で構成されるクイントフォン（ｑｕｉｎｔｐｈｏｎｅ）がある。 In such a configuration, the phoneme sequence expansion device generates a phoneme sequence by phoneme sequence generation means by performing phoneme alignment (forced alignment) in units of headwords in the pronunciation dictionary using the context-dependent phoneme acoustic model and the pronunciation dictionary. To do.
Here, the context-dependent phoneme is a phoneme that takes into consideration a phoneme string before and after a specific phoneme, or a position of a phoneme within a word (beginning of word, end of word, end of word, etc.). An example of a typical context-dependent phoneme is a triphone composed of three phonemes, and a quintphone composed of five phonemes.

そして、発音系列拡張装置は、文脈依存音素発音辞書生成手段によって、見出し語およびその発音系列を組とする、音素発音辞書を生成する。
また、発音系列拡張装置は、文脈依存音素ｎ−ｇｒａｍモデル生成手段によって、文脈依存音素の連接確率（ｎ−ｇｒａｍ）を統計的にモデル化する。
このように、文脈依存音素を単位として音素系列を拡張するため、音素単位で音声認識可能な発音辞書および言語モデルが生成されることになる。 Then, the pronunciation sequence expansion device generates a phoneme pronunciation dictionary including the headword and its pronunciation sequence as a set by the context-dependent phoneme pronunciation dictionary generation unit.
The phonetic sequence expansion device statistically models the concatenation probability (n-gram) of the context-dependent phonemes by the context-dependent phoneme n-gram model generation means.
In this way, since the phoneme sequence is expanded in units of context-dependent phonemes, a pronunciation dictionary and a language model that can be recognized by phonemes are generated.

そして、発音系列拡張装置は、音素認識手段によって、文脈依存音素発音辞書と文脈依存音素ｎ−ｇｒａｍモデルとにより、学習コーパスの音声を音素単位で音声認識することで、音声の前後の発音変動をより正確に表現した音素列を生成することができる。 Then, the phonetic sequence expansion device recognizes the pronunciation variation before and after the speech by recognizing the speech of the learning corpus by the phoneme recognition means by the context-dependent phoneme pronunciation dictionary and the context-dependent phoneme n-gram model. A phoneme string expressed more accurately can be generated.

そして、発音系列拡張装置は、統計的機械翻訳モデル生成手段によって、音素列生成手段で生成された単一音素の音素列である標準音素列と、音素認識手段で認識された音素列である実発話音素列とを対訳データとして学習し、標準音素列の任意のフレーズから実発話音素列の任意のフレーズへ翻訳するための確率モデルである統計的機械翻訳モデルを生成することで、正確な発音である標準音素列から、発音変動を伴った実発話音素列へ翻訳を行うためのモデルを生成する。 Then, the phonetic sequence expansion device includes a standard phoneme sequence that is a phoneme sequence of a single phoneme generated by the phoneme sequence generation unit and a phoneme sequence that is recognized by the phoneme recognition unit by the statistical machine translation model generation unit. Accurate pronunciation by learning a phoneme sequence as a parallel translation data and generating a statistical machine translation model that is a probabilistic model for translating from any phrase in the standard phoneme sequence to any phrase in the actual phoneme sequence A model is generated for translating from the standard phoneme sequence to the actual speech phoneme sequence with pronunciation variation.

そして、発音系列拡張装置は、翻訳手段によって、設定される翻訳パラメータを用いて、統計的機械翻訳モデルにより、発音辞書に登録されている見出し語に対応する音素列が翻訳結果の文字列として翻訳される確率が最大となる文字列を求め、当該見出し語に対応する音素列を翻訳することで、発音辞書に登録されている正しい音素列に対して、発音変動を加味した音素列を生成する。
そして、発音系列拡張装置は、拡張手段によって、翻訳手段で翻訳された音素列を新たな発音系列を示す音素列として見出し語に追加することで、拡張した発音辞書を生成する。
このとき、発音系列拡張装置は、パラメータ設定手段によって、翻訳手段で使用する１以上の素性に対応する翻訳パラメータを複数設定することで、拡張手段によって、異なる翻訳パラメータごとに、拡張発音辞書の候補となる複数の拡張発音辞書候補が生成される。 The phonetic sequence expansion device translates the phoneme string corresponding to the headword registered in the pronunciation dictionary as the character string of the translation result by the statistical machine translation model using the translation parameter set by the translation means. A phoneme string that generates pronunciation variation is generated for a correct phoneme string registered in the pronunciation dictionary by obtaining a character string that maximizes the probability of being played and translating the phoneme string corresponding to the entry word. .
The pronunciation sequence expansion device generates an expanded pronunciation dictionary by adding the phoneme string translated by the translation means to the headword as a phoneme string indicating a new pronunciation sequence by the expansion means.
At this time, the pronunciation sequence expansion device sets a plurality of translation parameters corresponding to one or more features used in the translation unit by the parameter setting unit, so that the expansion unit can select an extended pronunciation dictionary candidate for each different translation parameter. A plurality of extended pronunciation dictionary candidates are generated.

そして、発音系列拡張装置は、発音辞書選択手段によって、既知の学習データである音声と当該音声に対応する単語列とに基づいて、複数の拡張発音辞書候補の中から拡張発音辞書を選択する。
この選択は、既知の学習データである音声に対応する単語列の音響尤度を最大とする拡張発音辞書候補を選択することとしてもよいし、既知の学習データである音声を音素認識した音素列と、拡張発音辞書候補を用いて既知の学習データである音声に対応する単語列の最尤音素列との編集距離が最小となる拡張発音辞書候補を選択することとしてもよい。 Then, the pronunciation sequence expansion device selects an extended pronunciation dictionary from a plurality of extended pronunciation dictionary candidates based on the speech that is known learning data and the word string corresponding to the speech by the pronunciation dictionary selection unit.
This selection may be performed by selecting an extended pronunciation dictionary candidate that maximizes the acoustic likelihood of a word string corresponding to speech that is known learning data, or a phoneme sequence obtained by phoneme recognition of speech that is known learning data. Alternatively, the extended pronunciation dictionary candidate may be selected by using the extended pronunciation dictionary candidate to minimize the editing distance from the maximum likelihood phoneme sequence of the word sequence corresponding to the speech that is known learning data.

なお、発音系列拡張装置は、コンピュータを、音素列生成手段、文脈依存音素発音辞書生成手段、文脈依存音素ｎ−ｇｒａｍモデル生成手段、音素認識手段、統計的機械翻訳モデル生成手段、翻訳手段、パラメータ設定手段、拡張手段、発音辞書選択手段として機能させるための発音系列拡張プログラムで動作させることができる。 The phonetic sequence expansion apparatus includes a computer, a phoneme string generation unit, a context-dependent phoneme pronunciation dictionary generation unit, a context-dependent phoneme n-gram model generation unit, a phoneme recognition unit, a statistical machine translation model generation unit, a translation unit, and a parameter. It can be operated by a pronunciation sequence expansion program for functioning as setting means, expansion means, and pronunciation dictionary selection means.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、発音辞書に対して、実発話に基づく発音変動を考慮して、発音系列を拡張することができる。また、本発明によれば、発音辞書における同様の発音変動が発生する見出し語に対して、発音系列を統計的な手段に基づいて拡張することができる。
また、本発明によれば、翻訳パラメータを変更して生成した拡張発音辞書候補の中から、既知の学習データに基づいて、適切な拡張発音辞書を選択することができる。このとき、翻訳パラメータをより多く設定することで、拡張発音辞書を最適化することが可能になる。
これによって、本発明で拡張された発音辞書を用いることで、今まで発音辞書に登録されていなかった標準的な発音以外の発音であっても、精度よく音声認識することが可能になる。 The present invention has the following excellent effects.
According to the present invention, the pronunciation sequence can be expanded with respect to the pronunciation dictionary in consideration of pronunciation fluctuations based on actual utterances. Further, according to the present invention, the pronunciation series can be expanded based on statistical means for the headwords in which similar pronunciation fluctuations occur in the pronunciation dictionary.
Further, according to the present invention, an appropriate extended pronunciation dictionary can be selected based on known learning data from the extended pronunciation dictionary candidates generated by changing the translation parameters. At this time, it is possible to optimize the extended pronunciation dictionary by setting more translation parameters.
As a result, by using the pronunciation dictionary expanded in the present invention, it is possible to accurately recognize speech even with pronunciations other than standard pronunciations that have not been registered in the pronunciation dictionary until now.

本発明の第１実施形態に係る発音系列拡張装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the pronunciation series expansion apparatus which concerns on 1st Embodiment of this invention. 図１のフレーズ翻訳モデル生成手段が生成するフレーズテーブルの例を示す図である。It is a figure which shows the example of the phrase table which the phrase translation model production | generation means of FIG. 1 produces | generates. 図１の音素ｎ−ｇｒａｍモデル生成手段が使用する単語辞書の例を示す図である。It is a figure which shows the example of the word dictionary which the phoneme n-gram model production | generation means of FIG. 1 uses. 拡張された発音辞書（拡張発音辞書）の例を示す図である。It is a figure which shows the example of the extended pronunciation dictionary (extended pronunciation dictionary). 拡張発音辞書を用いた音声認識環境を説明するための説明図である。It is explanatory drawing for demonstrating the speech recognition environment using an extended pronunciation dictionary. 本発明の第１実施形態に係る発音系列拡張装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the pronunciation series expansion apparatus which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る発音系列拡張装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the pronunciation series expansion apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る発音系列拡張装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the pronunciation series expansion apparatus which concerns on 2nd Embodiment of this invention.

以下、本発明の実施形態について図面を参照して説明する。
≪第１実施形態≫
〔発音系列拡張装置の構成〕
まず、図１を参照して、本発明の第１実施形態に係る発音系列拡張装置１の構成について説明する。 Embodiments of the present invention will be described below with reference to the drawings.
<< First Embodiment >>
[Configuration of Pronunciation Series Expansion Device]
First, the configuration of the pronunciation sequence expansion device 1 according to the first embodiment of the present invention will be described with reference to FIG.

発音系列拡張装置１は、発音辞書１００に登録されている発音（発音系列）に、実発話による発音（発音系列）を対応付けて拡張し、拡張発音辞書１０３を生成するものである。この発音系列拡張装置１は、発音辞書１００と、音響モデル１０１と、学習コーパス１０２とから、発音辞書１００に登録されていない発音系列を拡張することで、拡張発音辞書１０３を生成する。 The pronunciation sequence expansion device 1 generates an extended pronunciation dictionary 103 by expanding the pronunciation (pronunciation sequence) registered in the pronunciation dictionary 100 in association with the pronunciation (pronunciation sequence) of the actual utterance. The pronunciation sequence expansion device 1 generates an extended pronunciation dictionary 103 by expanding a pronunciation sequence that is not registered in the pronunciation dictionary 100 from the pronunciation dictionary 100, the acoustic model 101, and the learning corpus 102.

発音辞書１００は、拡張の元となる発音辞書で、所定の文字列である見出し語（ここでは、単語とする）ごとに、その発音系列を示す子音と母音との構成（音素列）を示した辞書である。
この発音辞書１００は、従来の発音辞書として、人手を介して文字列（単語）とその発音系列とを対応付けた辞書であってもよいし、発音系列拡張装置１によって拡張された拡張発音辞書１０３をさらに拡張させたい場合は、拡張発音辞書１０３を発音辞書１００として用いてもよい。 The pronunciation dictionary 100 is a pronunciation dictionary that is a source of expansion, and indicates a configuration (phoneme string) of consonants and vowels indicating a pronunciation sequence for each headword (here, a word) that is a predetermined character string. Dictionaries.
The pronunciation dictionary 100 may be a dictionary in which a character string (word) and its pronunciation series are associated with each other as a conventional pronunciation dictionary, or an extended pronunciation dictionary expanded by the pronunciation series expansion device 1. If it is desired to further expand 103, the extended pronunciation dictionary 103 may be used as the pronunciation dictionary 100.

音響モデル１０１は、大量の音声データから予め学習した音素ごとの音響特徴量（メル周波数ケプストラム係数等）を隠れマルコフモデル（ＨＭＭ：Hidden Markov Model）によってモデル化したものである。本実施例における音響モデル１０１は、従来の音声認識において用いられるトライフォンＨＭＭである。
なお、音響モデル１０１における音響特徴量の尤度計算は、従来より用いられているガウス混合モデル（ＧＭＭ：Gaussian mixture model）音響モデルであっても、ディープニュートラルネットワーク（ＤＮＮ：Deep Neural Network）音響モデルであっても構わない。 The acoustic model 101 is obtained by modeling an acoustic feature amount (mel frequency cepstrum coefficient, etc.) for each phoneme learned in advance from a large amount of speech data using a hidden Markov model (HMM). The acoustic model 101 in the present embodiment is a triphone HMM used in conventional speech recognition.
Note that the likelihood calculation of the acoustic feature quantity in the acoustic model 101 may be a deep neutral network (DNN) acoustic model, even if it is a Gaussian mixture model (GMM) acoustic model that has been used conventionally. It does not matter.

学習コーパス１０２は、予め大量の音声データ（音声コーパス）と、音声データの書き起こしテキスト（テキストコーパス）とを対応付けたデータである。この学習コーパス１０２は、例えば、ニュース番組、情報番組等におけるアナウンサ、リポータ等の約１０００時間程度の音声（音声コーパス）と、その音声を書き起こしたテキスト（テキストコーパス）である。
なお、発音辞書１００、音響モデル１０１および学習コーパス１０２は、それぞれ図示を省略した記憶手段に記憶されているものとする。 The learning corpus 102 is data in which a large amount of speech data (speech corpus) is associated with a transcription text (text corpus) of speech data in advance. The learning corpus 102 is, for example, about 1000 hours of speech (voice corpus) such as an announcer or reporter in a news program or information program, and text (text corpus) that transcribes the speech.
It is assumed that the pronunciation dictionary 100, the acoustic model 101, and the learning corpus 102 are stored in storage means (not shown).

拡張発音辞書１０３は、発音系列拡張装置１によって、発音辞書１００が拡張された辞書である。すなわち、拡張発音辞書１０３は、発音辞書１００に登録されている文字列の発音系列（音素列）に、さらに、実発話により表現される発音系列（音素列）が拡張された辞書である。
以下、発音系列拡張装置１の構成について詳細に説明する。 The extended pronunciation dictionary 103 is a dictionary in which the pronunciation dictionary 100 is expanded by the pronunciation sequence expansion device 1. That is, the extended pronunciation dictionary 103 is a dictionary in which a pronunciation sequence (phoneme sequence) expressed by actual utterance is further expanded to a pronunciation sequence (phoneme sequence) of a character string registered in the pronunciation dictionary 100.
Hereinafter, the configuration of the pronunciation sequence expansion device 1 will be described in detail.

図１に示すように、発音系列拡張装置１は、音素列生成手段１０と、文脈依存発音辞書生成手段１１と、文脈依存音素ｎ−ｇｒａｍモデル生成手段１２と、音素認識手段１３と、統計的機械翻訳モデル生成手段１４と、翻訳手段１５と、拡張手段１６と、パラメータ設定手段１７と、発音辞書選択手段１８と、を備える。 As shown in FIG. 1, the pronunciation sequence expansion device 1 includes a phoneme string generation unit 10, a context-dependent pronunciation dictionary generation unit 11, a context-dependent phoneme n-gram model generation unit 12, a phoneme recognition unit 13, Machine translation model generation means 14, translation means 15, expansion means 16, parameter setting means 17, and pronunciation dictionary selection means 18 are provided.

音素列生成手段１０は、発音辞書１００と音響モデル１０１とに基づいて、学習コーパス１０２の音声（音声コーパス）を強制アライメントすることで、当該音声を、発音辞書１００に登録されている文字列に対応する音素列に切り分けるものである。
すなわち、音素列生成手段１０は、学習コーパス１０２の音声から、音響モデル１０１に対応する音響特徴量を抽出する。そして、音素列生成手段１０は、音響モデル１０１の文脈依存ＨＭＭを用いて、音声の書き起こしテキスト（テキストコーパス）を事前知識とする音声認識を行うことで、音声を、発音辞書１００に登録されている文字列（見出し語）に対応して切り分け（強制アライメント）、各文字列に対応する文脈依存音素列を抽出する。 The phoneme string generation means 10 forcibly aligns the speech (speech corpus) of the learning corpus 102 based on the pronunciation dictionary 100 and the acoustic model 101, thereby converting the speech into a character string registered in the pronunciation dictionary 100. It is divided into corresponding phoneme strings.
That is, the phoneme string generation unit 10 extracts an acoustic feature amount corresponding to the acoustic model 101 from the speech of the learning corpus 102. Then, the phoneme sequence generation means 10 performs speech recognition using the context-dependent HMM of the acoustic model 101 as prior knowledge of the transcription text (text corpus) of speech, so that the speech is registered in the pronunciation dictionary 100. In accordance with the character strings (headwords) that are present, they are segmented (forced alignment), and context-dependent phoneme strings corresponding to the character strings are extracted.

この音素列生成手段１０は、生成した文脈依存音素１０ｔを１つの単語として、図示を省略した記憶手段に記憶する。この文脈依存音素１０ｔは、文脈依存発音辞書生成手段１１および文脈依存音素ｎ−ｇｒａｍモデル生成手段１２において利用される。
また、音素列生成手段１０は、文脈依存音素列から文脈非依存の音素列に変換し、標準音素列１０ｍとする。この文脈非依存の音素列は、例えば、文脈依存音素列の中心音素を抽出したモノフォン（単一音素）の音素列である。この標準音素列１０ｍは図示を省略した記憶手段に記憶され、統計的機械翻訳モデル生成手段１４において利用される。 The phoneme string generation unit 10 stores the generated context-dependent phoneme 10t as one word in a storage unit (not shown). The context-dependent phoneme 10t is used in the context-dependent pronunciation dictionary generation unit 11 and the context-dependent phoneme n-gram model generation unit 12.
The phoneme string generation means 10 converts the context-dependent phoneme string into a context-independent phoneme string to obtain a standard phoneme string 10m. This context-independent phoneme string is, for example, a monophone (single phoneme) phoneme string obtained by extracting the central phoneme of the context-dependent phoneme string. The standard phoneme string 10m is stored in a storage unit (not shown) and used in the statistical machine translation model generation unit 14.

文脈依存発音辞書生成手段１１は、見出し語およびその発音系列を組とする音素発音辞書（文脈依存発音辞書１１ｔ）を生成するものである。
この文脈依存発音辞書生成手段１１は、生成した文脈依存発音辞書１１ｔを、図示を省略した記憶手段に記憶する。この文脈依存発音辞書１１ｔは、音素認識手段１３において利用される。 The context-dependent pronunciation dictionary generation means 11 generates a phoneme pronunciation dictionary (context-dependent pronunciation dictionary 11t) that includes a headword and its pronunciation series.
The context-dependent pronunciation dictionary generation unit 11 stores the generated context-dependent pronunciation dictionary 11t in a storage unit (not shown). This context-dependent pronunciation dictionary 11t is used in the phoneme recognition means 13.

文脈依存音素ｎ−ｇｒａｍモデル生成手段１２は、音素列生成手段１０で生成された複数の文脈依存音素１０ｔから、ｎ−ｇｒａｍモデル（文脈依存音素ｎ−ｇｒａｍモデル１２ｔ）を生成するものである。この文脈依存音素ｎ−ｇｒａｍモデル１２ｔは、音素列生成手段１０で生成された複数の文脈依存音素１０ｔから、ｎ−ｇｒａｍモデルにより統計的にモデル化したものである。 The context-dependent phoneme n-gram model generation unit 12 generates an n-gram model (context-dependent phoneme n-gram model 12t) from the plurality of context-dependent phonemes 10t generated by the phoneme sequence generation unit 10. The context-dependent phoneme n-gram model 12t is statistically modeled by a n-gram model from a plurality of context-dependent phonemes 10t generated by the phoneme string generation means 10.

文脈依存音素ｎ−ｇｒａｍモデル１２ｔは、文脈依存音素１０ｔの出現頻度等をモデル化したものであって、生成手法は従来の単語に基づく言語モデルの手法と同じであるため、ここでは、詳細な説明を省略する。
この文脈依存音素ｎ−ｇｒａｍモデル生成手段１２は、生成した文脈依存音素ｎ−ｇｒａｍモデル１２ｔを、図示を省略した記憶手段に記憶する。この文脈依存音素ｎ−ｇｒａｍモデル１２ｔは、音素認識手段１３において利用される。 The context-dependent phoneme n-gram model 12t is obtained by modeling the appearance frequency of the context-dependent phoneme 10t, and the generation method is the same as the conventional language model method based on words. Description is omitted.
The context-dependent phoneme n-gram model generation unit 12 stores the generated context-dependent phoneme n-gram model 12t in a storage unit (not shown). The context-dependent phoneme n-gram model 12t is used in the phoneme recognition means 13.

音素認識手段１３は、音響モデル１０１と、文脈依存発音辞書１１ｔと、文脈依存音素ｎ−ｇｒａｍモデル１２ｔとを用いて、学習コーパス１０２の音声から音素を認識するものである。
この音素認識手段１３は、学習コーパス１０２の音声から音響特徴量を抽出し、音響モデル１０１と文脈依存発音辞書１１ｔとから文脈依存音素の候補をリストアップし、文脈依存音素ｎ−ｇｒａｍモデル１２ｔに基づく接続確率が最大となる音素列を認識結果とする。
すなわち、一般的な音声認識が単語単位で認識するのに対し、音素認識手段１３は、文脈依存音素単位で認識し、実発話の音素列を生成する。 The phoneme recognition means 13 recognizes phonemes from the speech of the learning corpus 102 using the acoustic model 101, the context-dependent pronunciation dictionary 11t, and the context-dependent phoneme n-gram model 12t.
The phoneme recognition unit 13 extracts acoustic feature quantities from the speech of the learning corpus 102, lists context-dependent phoneme candidates from the acoustic model 101 and the context-dependent pronunciation dictionary 11t, and creates a context-dependent phoneme n-gram model 12t. The phoneme string having the maximum connection probability is set as the recognition result.
That is, while general speech recognition is recognized in units of words, the phoneme recognition unit 13 recognizes in units of context-dependent phonemes and generates a phoneme string of actual speech.

このように、音素認識手段１３は、音素環境依存を考慮したものであるため、精度よく音素を認識することができる。
この音素認識手段１３は、認識した音素列（実発話音素列１３ｍ）を、図示を省略した記憶手段に記憶する。この実発話音素列１３ｍは、統計的機械翻訳モデル生成手段１４において利用される。 Thus, since the phoneme recognition means 13 considers phoneme environment dependence, it can recognize a phoneme accurately.
The phoneme recognition unit 13 stores the recognized phoneme string (actually uttered phoneme string 13m) in a storage unit (not shown). This actual utterance phoneme string 13m is used in the statistical machine translation model generation means 14.

統計的機械翻訳モデル生成手段１４は、元の発音辞書１００を用いて生成された標準音素列１０ｍを原言語とし、文脈依存発音辞書１１ｔおよび文脈依存音素ｎ−ｇｒａｍモデル１２ｔを用いて生成された実発話音素列１３ｍを目的言語とする翻訳モデル（統計的機械翻訳モデル）を生成するものである。 The statistical machine translation model generation means 14 is generated using the standard phoneme string 10m generated using the original pronunciation dictionary 100 as a source language and using the context-dependent pronunciation dictionary 11t and the context-dependent phoneme n-gram model 12t. A translation model (statistical machine translation model) having the actual speech phoneme string 13m as a target language is generated.

ここで、統計的機械翻訳モデルは、ベイズの定理により、原言語ｆが目的言語ｅに翻訳される確率が最大となって翻訳結果ｅ＾（ｅハット）が生成されるモデルとして、以下の式（２）で定式化されている。 Here, the statistical machine translation model is a model in which the probability that the source language f is translated into the target language e is maximized by the Bayes' theorem and the translation result e ^ (e hat) is generated. Formulated in (2).

この式（２）で、Ｐｒ（ｅ｜ｆ）は、原言語ｆが目的言語ｅに翻訳される条件付き確率を示す。また、Ｐｒ（ｆ｜ｅ）は、翻訳モデル（フレーズ翻訳モデル）であって、目的言語ｅが原言語ｆに翻訳される条件付き確率を示す。また、Ｐｒ（ｅ）は、目的言語ｅの言語モデルであって、目的言語ｅの事前確率を示す。
ここでは、統計的機械翻訳モデル生成手段１４は、フレーズ翻訳モデル生成手段１４１と、音素ｎ−ｇｒａｍモデル生成手段１４２と、を備える。 In this equation (2), Pr (e | f) represents a conditional probability that the source language f is translated into the target language e. Pr (f | e) is a translation model (phrase translation model), and indicates a conditional probability that the target language e is translated into the source language f. Pr (e) is a language model of the target language e, and indicates the prior probability of the target language e.
Here, the statistical machine translation model generation unit 14 includes a phrase translation model generation unit 141 and a phoneme n-gram model generation unit 142.

フレーズ翻訳モデル生成手段１４１は、音素列生成手段１０で生成された音素列である標準音素列１０ｍと、音素認識手段１３で生成された文脈依存音素列である実発話音素列１３ｍとを対訳データとして、標準音素列１０ｍのあるフレーズ（原言語フレーズ）が、実発話音素列１３ｍのあるフレーズ（目的言語フレーズ）に翻訳される翻訳モデル（フレーズ翻訳モデル）を生成するものである。すなわち、フレーズ翻訳モデル１４１ｍは、前記式（２）の翻訳モデルＰｒ（ｆ｜ｅ）を生成するものである。 The phrase translation model generation unit 141 translates the standard phoneme sequence 10m that is the phoneme sequence generated by the phoneme sequence generation unit 10 and the actual utterance phoneme sequence 13m that is the context-dependent phoneme sequence generated by the phoneme recognition unit 13 into parallel data. As a result, a translation model (phrase translation model) in which a phrase (original language phrase) having the standard phoneme string 10m is translated into a phrase (target language phrase) having the actual speech phoneme string 13m is generated. That is, the phrase translation model 141m generates the translation model Pr (f | e) of the formula (2).

なお、対訳データを用いて、翻訳モデルを生成する手法は一般的な手法を用いることができる。例えば、P.Koehnらによる“Moses: Open Source Toolkit for Statistical Machine Translation”（Proceedings of the ACL 2007 Demo and Poster Sessions, pages 177-180）に記載されているようなＭｏｓｅｓ等のツールを用いることができる。 Note that a general method can be used as a method of generating a translation model using parallel translation data. For example, tools such as Moses as described in “Moses: Open Source Toolkit for Statistical Machine Translation” (Proceedings of the ACL 2007 Demo and Poster Sessions, pages 177-180) by P. Koehn et al. Can be used. .

このフレーズ翻訳モデル生成手段１４１は、例えば、フレーズ翻訳モデル１４１ｍとして、図２に示すように、「原言語フレーズ」が「目的言語フレーズ」に翻訳される「確率［％］」をテーブル情報（フレーズテーブルＦＴ）として生成する。
例えば、図２では、原言語フレーズの「ｋａＱｋｏｋ」が目的言語フレーズの「ｋａｋｏｋ」に翻訳される確率は、６０．２０（％）であることを示している。
なお、図２のフレーズテーブルＦＴの「例」の欄は、参考までに、各フレーズを音素列として含む単語を示しており、実際にテーブル内に含まれるものではない。 As shown in FIG. 2, the phrase translation model generation means 141, for example, stores “probability [%]” of “source language phrase” translated into “target language phrase” as table information (phrase) as shown in FIG. Table FT).
For example, FIG. 2 shows that the probability that the source language phrase “kaQkok” is translated into the target language phrase “kakok” is 60.20 (%).
Note that the “example” column in the phrase table FT in FIG. 2 shows words including each phrase as a phoneme string for reference, and is not actually included in the table.

このフレーズ翻訳モデル生成手段１４１は、生成したフレーズ翻訳モデル１４１ｍを、図示を省略した記憶手段に記憶する。このフレーズ翻訳モデル１４１ｍは、統計的機械翻訳モデル１４ｍの一部として、翻訳手段１５において利用される。 The phrase translation model generation unit 141 stores the generated phrase translation model 141m in a storage unit (not shown). This phrase translation model 141m is used in the translation means 15 as a part of the statistical machine translation model 14m.

音素ｎ−ｇｒａｍモデル生成手段１４２は、音素認識手段１３で生成された音素列である目的言語の実発話音素列１３ｍから、言語モデル（音素ｎ−ｇｒａｍモデル１４２ｍ）を生成するものである。この音素ｎ−ｇｒａｍモデル１４２ｍは、音素認識手段１３で生成された実発話音素列１３ｍから、音素（モノフォン）のｎ−ｇｒａｍを統計的にモデル化したものである。すなわち、音素ｎ−ｇｒａｍモデル生成手段１４２は、前記式（２）の言語モデルＰｒ（ｅ）を生成するものである。 The phoneme n-gram model generation unit 142 generates a language model (phoneme n-gram model 142m) from the actual speech phoneme sequence 13m of the target language that is the phoneme sequence generated by the phoneme recognition unit 13. The phoneme n-gram model 142m is a statistical model of the phoneme (monophone) n-gram from the actual speech phoneme sequence 13m generated by the phoneme recognition means 13. That is, the phoneme n-gram model generation unit 142 generates the language model Pr (e) of the above formula (2).

ここでは、音素ｎ−ｇｒａｍモデル生成手段１４２は、予め所定数（例えば、４０個）の音素を単語とする単語辞書（不図示）を参照し、学習テキストとして実発話音素列１３ｍを入力して、音素ｎ−ｇｒａｍモデル１４２ｍを生成する。この音素ｎ−ｇｒａｍモデル生成手段１４２が使用する単語辞書の単語は、例えば、図３に示すような音素とする。 Here, the phoneme n-gram model generation means 142 refers to a word dictionary (not shown) having a predetermined number (for example, 40) of phonemes as words in advance, and inputs the actual utterance phoneme string 13m as learning text. A phoneme n-gram model 142m is generated. The words in the word dictionary used by the phoneme n-gram model generation unit 142 are, for example, phonemes as shown in FIG.

一般的な言語モデルの生成が単語の出現頻度等をモデル化したものであるのに対し、音素ｎ−ｇｒａｍモデル１４２ｍは、音素の出現頻度等をモデル化したものである。よって、音素ｎ−ｇｒａｍモデル生成手段１４２は、その対象となる素材（単語または音素）が異なるのみで、生成手法は従来の言語モデルの手法と同じであるため、ここでは、詳細な説明を省略する。 The generation of a general language model models the appearance frequency of a word, while the phoneme n-gram model 142m models the appearance frequency of a phoneme. Therefore, the phoneme n-gram model generation unit 142 is different from the target language (word or phoneme) only in the target material (word or phoneme), and the generation method is the same as the method of the conventional language model. To do.

この音素ｎ−ｇｒａｍモデル生成手段１４２は、生成した音素ｎ−ｇｒａｍモデル１４２ｍを、図示を省略した記憶手段に記憶する。この音素ｎ−ｇｒａｍモデル１４２ｍは、統計的機械翻訳モデル１４ｍの一部として、翻訳手段１５において利用される。 The phoneme n-gram model generation unit 142 stores the generated phoneme n-gram model 142m in a storage unit (not shown). This phoneme n-gram model 142m is used in the translation means 15 as a part of the statistical machine translation model 14m.

翻訳手段１５は、パラメータ設定手段１７で設定される翻訳パラメータを用いて、元の発音辞書１００に登録されている見出し語の発音系列（音素列）を、統計的機械翻訳モデル生成手段１４で生成された統計的機械翻訳モデル１４ｍに基づいて翻訳するものである。
すなわち、翻訳手段１５は、前記式（２）により、見出し語の発音（原言語ｆの音素列に相当）が、フレーズ翻訳モデル１４１ｍであるＰｒ（ｆ｜ｅ）と、音素ｎ−ｇｒａｍモデル１４２ｍであるＰｒ（ｅ）との同時確率が最大となる音素列を、見出し語の翻訳文（目的言語ｅの音素列に相当）として生成する。 The translation unit 15 uses the translation parameters set by the parameter setting unit 17 to generate the pronunciation sequence (phoneme sequence) of the headword registered in the original pronunciation dictionary 100 by the statistical machine translation model generation unit 14. The translation is performed based on the statistical machine translation model 14m.
That is, the translation means 15 uses Pr (f | e), which is the phrase translation model 141m, and the phoneme n-gram model 142m as the pronunciation of the headword (corresponding to the phoneme string of the source language f) according to the equation (2). Is generated as a translation of the headword (corresponding to the phoneme string of the target language e).

なお、パラメータ設定手段１７で設定される翻訳パラメータは、統計的機械翻訳を行う際の素性の重みであって、１以上の素性に対応するパラメータ群である。すなわち、翻訳手段１５は、前記式（２）を素性ごとに対数表現した前記式（１）において、素性ｋごとの重みλ_ｋを用いて、前記式（１）により、翻訳文を生成する。
この素性は、例えば、前記したＭｏｓｅｓを用いて翻訳を行う場合であれば、言語モデル（音素ｎ−ｇｒａｍモデル１４２ｍ）重み、翻訳モデル（フレーズ翻訳モデル１４１ｍ）、単語ペナルティ（出力文の長さ制限）等である。
この翻訳手段１５は、元の発音辞書１００に登録されている見出し語ごとに、対応する発音系列の翻訳文を生成し、拡張手段１６に出力する。 The translation parameters set by the parameter setting means 17 are feature weights when performing statistical machine translation, and are parameter groups corresponding to one or more features. That is, the translation unit 15 generates a translated sentence according to the expression (1) using the weight λ _k for each feature k in the expression (1) in which the expression (2) is logarithmically expressed for each feature.
For example, in the case of performing translation using the above-mentioned Moses, this feature includes language model (phoneme n-gram model 142m) weight, translation model (phrase translation model 141m), word penalty (output sentence length restriction) ) Etc.
This translation means 15 generates a corresponding pronunciation sequence translation for each headword registered in the original pronunciation dictionary 100 and outputs it to the expansion means 16.

拡張手段１６は、元の発音辞書１００に登録されている見出し語の発音系列に、翻訳手段１５で翻訳された新たな発音（翻訳文）である音素列を拡張するものである。
すなわち、拡張手段１６は、元の発音辞書１００に登録されている見出し語に対応する翻訳文が翻訳手段１５から入力されるたびに、その翻訳文（音素列）と、元の発音辞書１００に登録されている発音系列（音素列）とを比較する。そして、拡張手段１６は、翻訳文と元の発音系列とが一致しない場合、当該見出し語に対して、翻訳文を新たな発音系列として追加する。 The extension means 16 extends a phoneme string that is a new pronunciation (translated sentence) translated by the translation means 15 to the pronunciation sequence of the headword registered in the original pronunciation dictionary 100.
That is, each time the translated text corresponding to the headword registered in the original pronunciation dictionary 100 is input from the translation means 15, the expanding means 16 stores the translated text (phoneme string) and the original pronunciation dictionary 100. Compares the registered pronunciation series (phoneme string). Then, when the translated sentence and the original pronunciation series do not match, the expansion means 16 adds the translated sentence as a new pronunciation series for the entry word.

ここでは、拡張手段１６は、元の発音辞書１００の見出し語に対する発音系列（元の発音系列および新たな発音系列）を、新たな発音辞書として拡張発音辞書候補１６ｄを生成する。もちろん、拡張手段１６は、元の発音辞書１００に新たな発音系列のみを追加することとしてもよい。
この拡張手段１６は、パラメータ設定手段１７による新たな翻訳パラメータの設定が実施されるたびに、順次、新たな拡張発音辞書候補１６ｄを生成し、図示を省略した記憶手段に記憶する。また、拡張手段１６は、新たな拡張発音辞書候補１６ｄを生成した旨を、発音辞書選択手段１８に通知する。 Here, the expansion means 16 generates an extended pronunciation dictionary candidate 16d with the pronunciation series (the original pronunciation series and the new pronunciation series) for the entry word in the original pronunciation dictionary 100 as a new pronunciation dictionary. Of course, the expansion means 16 may add only a new pronunciation sequence to the original pronunciation dictionary 100.
Each time the expansion unit 16 sets a new translation parameter by the parameter setting unit 17, a new extended pronunciation dictionary candidate 16d is sequentially generated and stored in a storage unit (not shown). Further, the expansion means 16 notifies the pronunciation dictionary selection means 18 that a new extended pronunciation dictionary candidate 16d has been generated.

パラメータ設定手段１７は、翻訳手段１５で用いる素性の翻訳パラメータを順次変更して設定するものである。すなわち、パラメータ設定手段１７は、前記式（１）において、素性ｋごとの重みλ_ｋを適宜変更して、翻訳手段１５に出力することで、翻訳を実行させる。
このパラメータ設定手段１７は、翻訳パラメータを構成する素性ごとの個々のパラメータの予め定めた制限範囲内の値を格子とする格子探索法により、適宜パラメータを変更することで、異なる翻訳パラメータを設定する。なお、パラメータ設定手段１７は、すべての素性のパラメータを網羅的に変更する必要はなく、予め定めたパラメータについてのみ変更を行うこととしてもよい。 The parameter setting means 17 sequentially changes and sets the feature translation parameters used in the translation means 15. That is, the parameter setting unit 17 performs translation by appropriately changing the weight λ _k for each feature k in the equation (1) and outputting it to the translation unit 15.
The parameter setting means 17 sets different translation parameters by appropriately changing the parameters by a lattice search method using a value within a predetermined limit range of each parameter for each feature constituting the translation parameters as a lattice. . Note that the parameter setting unit 17 does not need to change all feature parameters comprehensively, and may change only predetermined parameters.

このパラメータ設定手段１７は、翻訳パラメータを設定するたびに、その旨を拡張手段１６に通知する。また、パラメータ設定手段１７は、翻訳パラメータの変更がすべて完了した場合、その旨を発音辞書選択手段１８に通知する。 The parameter setting unit 17 notifies the expansion unit 16 of this every time a translation parameter is set. Further, the parameter setting means 17 notifies the pronunciation dictionary selection means 18 when all the translation parameter changes are completed.

発音辞書選択手段１８は、既知の学習データに基づいて、拡張手段１６によって順次生成される複数の拡張発音辞書候補１６ｄ，１６ｄ，…の中から１つを選択するものである。この発音辞書選択手段１８は、１つの拡張発音辞書候補１６ｄを選択する基準として、既知の学習データに対する音響尤度を用いる。
ここでは、発音辞書選択手段１８は、音響尤度算出手段１８１と、尤度最大辞書選択手段１８２と、を備える。 The pronunciation dictionary selection unit 18 selects one of a plurality of extended pronunciation dictionary candidates 16d, 16d,... Sequentially generated by the expansion unit 16 based on known learning data. This pronunciation dictionary selection means 18 uses the acoustic likelihood for known learning data as a reference for selecting one extended pronunciation dictionary candidate 16d.
Here, the pronunciation dictionary selecting unit 18 includes an acoustic likelihood calculating unit 181 and a maximum likelihood dictionary selecting unit 182.

音響尤度算出手段１８１は、音響モデル１０１および拡張発音辞書候補１６ｄを用いて、既知の学習データである音声に対応する単語列（テキスト）を強制アライメント（強制単語アライメント）して、当該単語列を音素列に切り分けたときの音響尤度を算出するものである。なお、強制アライメントにより音響尤度を求めるには、一般的な手法、例えば、ＫａｌｄｉＴｏｏｌｋｉｔ等のツールを用いることができる。
また、ここでは、既知の音声およびそれに対応する単語列として、学習コーパス１０２の一部を用いるが、学習コーパス１０２とは異なる音声と、その音声の書き起こしテキストであっても構わない。
この音響尤度算出手段１８１は、算出した音響尤度を尤度最大辞書選択手段１８２に出力する。 The acoustic likelihood calculating means 181 uses the acoustic model 101 and the extended pronunciation dictionary candidate 16d to forcibly align a word string (text) corresponding to speech that is known learning data (forced word alignment), and the word string Is calculated as a phoneme string. In order to obtain the acoustic likelihood by forced alignment, a general method such as a tool such as Kaldi Toolkit can be used.
Here, a part of the learning corpus 102 is used as the known speech and the word string corresponding to the known speech, but a speech different from the learning corpus 102 and a transcription text of the speech may be used.
The acoustic likelihood calculating unit 181 outputs the calculated acoustic likelihood to the maximum likelihood dictionary selecting unit 182.

尤度最大辞書選択手段１８２は、音響尤度算出手段１８１で算出された音響尤度が最大となる拡張発音辞書候補１６ｄを選択するものである。
ここでは、尤度最大辞書選択手段１８２は、拡張手段１６によって、拡張発音辞書候補１６ｄが生成されるたびに、前回生成された拡張発音辞書候補１６ｄと今回生成された拡張発音辞書候補１６ｄとの音響尤度算出手段１８１で算出された音響尤度を比較し、音響尤度の大きい拡張発音辞書候補１６ｄを記憶手段（不図示）に残し、音響尤度の小さい拡張発音辞書候補１６ｄを記憶手段（不図示）から削除する。 The maximum likelihood dictionary selecting unit 182 selects the extended pronunciation dictionary candidate 16d having the maximum acoustic likelihood calculated by the acoustic likelihood calculating unit 181.
Here, every time the extended pronunciation dictionary candidate 16d is generated by the expansion means 16, the maximum likelihood dictionary selection means 182 determines whether the extended pronunciation dictionary candidate 16d generated last time and the extended pronunciation dictionary candidate 16d generated this time. The acoustic likelihood calculated by the acoustic likelihood calculating means 181 is compared, the extended pronunciation dictionary candidate 16d having a large acoustic likelihood is left in the storage means (not shown), and the extended pronunciation dictionary candidate 16d having a small acoustic likelihood is stored in the storage means. Delete from (not shown).

そして、尤度最大辞書選択手段１８２は、パラメータ設定手段１７からすべての翻訳パラメータの変更が完了した旨を通知された段階で、記憶手段（不図示）に存在する拡張発音辞書候補１６ｄを拡張発音辞書１０３とする。
これによって、発音辞書選択手段１８は、翻訳パラメータが最適化された状態で生成された拡張発音辞書候補１６ｄを選択することができる。 When the likelihood setting dictionary selecting unit 182 is notified from the parameter setting unit 17 that all the translation parameters have been changed, the extended pronunciation dictionary candidate 16d existing in the storage unit (not shown) is expanded. The dictionary 103 is assumed.
Thereby, the pronunciation dictionary selection means 18 can select the extended pronunciation dictionary candidate 16d generated with the translation parameters optimized.

ここで、図４を参照して、発音系列拡張装置１によって拡張発音辞書１０３に追加された音素列の例について説明する。
図４に示すように、拡張発音辞書１０３は、「見出し語」、「元の音素列」、「追加音素列」で構成される。「見出し語」および「元の音素列」は、元の発音辞書１００に登録されているものと同じで、「追加音素列」が、発音系列拡張装置１によって追加されたものである。 Here, an example of a phoneme string added to the extended pronunciation dictionary 103 by the pronunciation sequence expansion device 1 will be described with reference to FIG.
As shown in FIG. 4, the extended pronunciation dictionary 103 includes “entry words”, “original phoneme strings”, and “additional phoneme strings”. The “headword” and the “original phoneme string” are the same as those registered in the original pronunciation dictionary 100, and the “additional phoneme string” is added by the pronunciation sequence expansion device 1.

例えば、図４では、見出し語「女川湾」に元の音素列「ｏｎａｇａｇａｗａＮ」が登録されており、発音系列拡張装置１によって、追加音素列「ｏｎａｇａｗａＮ」が追加された例を示している。また、同様に、見出し語「志津川湾」に元の音素列「ｓｈｉｚｕｇａｇａｗａＮ」が登録されており、発音系列拡張装置１によって、追加音素列「ｓｈｉｚｕｇａｗａＮ」が追加されている。
このように、発音系列拡張装置１は、発音のしにくさによって、単語内の音素列「ｇａｗａｗａ」が「ｇａｗａ」に発音変動する場合、個別に手動でこの変動規則を設定する必要がない。 For example, FIG. 4 shows an example in which the original phoneme string “onagagawaN” is registered in the headword “Onagawa Bay” and the additional phoneme string “onagawa N” is added by the pronunciation sequence expansion device 1. Similarly, the original phoneme string “shizugagawaN” is registered in the headword “Shizagawa Bay”, and the additional phoneme string “shizugawaN” is added by the pronunciation sequence expansion device 1.
As described above, the pronunciation sequence expansion device 1 does not need to manually set the variation rule individually when the phoneme string “gawa” in the word changes to “gawa” due to difficulty of pronunciation.

また、図４では、見出し語「ホームグラウンド」に元の音素列「ｈｏ：ｍｕｇｕｒａｕＮｄｏ」が登録されており、発音系列拡張装置１によって、追加音素列「ｈｏ：ｍｕｒａｕＮｄｏ」が追加された例を示している。
このように、発音系列拡張装置１は、長い単語中の発音しにくい奥舌性子音の「ｇ」の欠落についても発音辞書に追加することができる。 Further, FIG. 4 shows an example in which the original phoneme string “ho: mugurauNdo” is registered in the headword “home ground”, and the additional phoneme string “ho: murauNdo” is added by the pronunciation sequence expansion device 1. ing.
As described above, the pronunciation sequence expansion device 1 can also add a missing “g” in the long tongue consonant that is difficult to pronounce in a long word to the pronunciation dictionary.

なお、発音系列拡張装置１が生成する拡張発音辞書１０３は、一般的な音声認識装置、例えば、大語彙連続音声認識装置において使用することができる。その場合、例えば、図５に示すように、大語彙連続音声認識装置２００は、発音系列拡張装置１が生成する拡張発音辞書１０３に加え、既存の音響モデル１０１と言語モデル１０４とにより、入力音声を音声認識し認識結果を出力する。 Note that the extended pronunciation dictionary 103 generated by the pronunciation sequence expansion device 1 can be used in a general speech recognition device, for example, a large vocabulary continuous speech recognition device. In this case, for example, as shown in FIG. 5, the large vocabulary continuous speech recognition apparatus 200 uses the existing acoustic model 101 and the language model 104 in addition to the extended pronunciation dictionary 103 generated by the pronunciation sequence expansion apparatus 1. Is recognized and the recognition result is output.

以上説明したように発音系列拡張装置１を構成することで、発音系列拡張装置１は、発音変動を、学習コーパス１０２を用いて発音辞書に追加することができる。また、発音系列拡張装置１は、素性の重みパラメータ（翻訳パラメータ）を最適化して、拡張発音辞書を生成することができる。
なお、発音系列拡張装置１は、図示を省略したコンピュータを、前記した各手段として機能させるプログラム（発音系列拡張プログラム）で動作させることができる。 By configuring the pronunciation sequence expansion device 1 as described above, the pronunciation sequence expansion device 1 can add pronunciation variations to the pronunciation dictionary using the learning corpus 102. Further, the pronunciation sequence expansion device 1 can generate an extended pronunciation dictionary by optimizing the feature weight parameter (translation parameter).
Note that the pronunciation sequence expansion device 1 can operate a computer (not shown) with a program (pronunciation sequence expansion program) that functions as each of the above-described means.

〔発音系列拡張装置の動作〕
次に、図６を参照（構成については適宜図１参照）して、本発明の第１実施形態に係る発音系列拡張装置１の動作について説明する。 [Operation of phonetic sequence expansion device]
Next, referring to FIG. 6 (refer to FIG. 1 as appropriate for the configuration), the operation of the pronunciation sequence expansion device 1 according to the first embodiment of the present invention will be described.

まず、発音系列拡張装置１は、学習コーパス１０２の音声（音声コーパス）から、文脈依存音素列と、文脈非依存音素列とを生成する。
すなわち、発音系列拡張装置１は、音素列生成手段１０によって、発音辞書１００と音響モデル１０１とに基づいて、学習コーパス１０２の音声（音声コーパス）を強制アライメントし、発音辞書１００に登録されている文字列に対応する文脈依存音素列を生成する（ステップＳ１）。 First, the pronunciation sequence expansion device 1 generates a context-dependent phoneme sequence and a context-independent phoneme sequence from the speech of the learning corpus 102 (speech corpus).
That is, the phonetic sequence expansion device 1 is forcibly aligned the speech (speech corpus) of the learning corpus 102 based on the pronunciation dictionary 100 and the acoustic model 101 by the phoneme string generation means 10 and is registered in the pronunciation dictionary 100. A context-dependent phoneme string corresponding to the character string is generated (step S1).

さらに、音素列生成手段１０は、ステップＳ１で生成された文脈依存音素列から文脈非依存の単一音素の音素列を生成する（ステップＳ２）。
このステップＳ２で生成された音素列は、後のステップＳ６で使用する原言語の音素列（標準音素列１０ｍ）である。 Further, the phoneme string generation means 10 generates a context-independent single phoneme string from the context-dependent phoneme string generated in step S1 (step S2).
The phoneme string generated in step S2 is the source language phoneme string (standard phoneme string 10m) used in the subsequent step S6.

そして、発音系列拡張装置１は、文脈依存発音辞書生成手段１１によって、ステップＳ１で生成された文脈依存音素列を、文脈依存音素ごとに、見出し語およびその発音系列とする発音辞書（文脈依存発音辞書１１ｔ）を生成する（ステップＳ３）。 Then, the pronunciation sequence expansion device 1 uses the context-dependent pronunciation dictionary generation unit 11 to generate the pronunciation dictionary (context-dependent pronunciation) that uses the context-dependent phoneme sequence generated in step S1 for each context-dependent phoneme. A dictionary 11t) is generated (step S3).

さらに、発音系列拡張装置１は、文脈依存音素ｎ−ｇｒａｍモデル生成手段１２によって、ステップＳ１で生成された文脈依存音素列から、文脈依存音素を１つの単語とみなしたｎ−ｇｒａｍモデル（文脈依存音素ｎ−ｇｒａｍモデル１２ｔ）を生成する（ステップＳ４）。 Further, the phonetic sequence expansion device 1 uses the context-dependent phoneme n-gram model generation means 12 to generate an n-gram model (context-dependent) that regards a context-dependent phoneme as one word from the context-dependent phoneme sequence generated in step S1. A phoneme n-gram model 12t) is generated (step S4).

そして、発音系列拡張装置１は、音素認識手段１３によって、ステップＳ３，Ｓ４でそれぞれ生成された文脈依存発音辞書１１ｔおよび文脈依存音素ｎ−ｇｒａｍモデル１２ｔを用いて、学習コーパス１０２の音声（音声コーパス）から音素を認識する（ステップＳ５）。
このステップＳ５で生成された音素列は、後のステップＳ６で使用する目的言語の音素列（実発話音素列１３ｍ）である。 The phonetic sequence expansion device 1 uses the context-dependent pronunciation dictionary 11t and the context-dependent phoneme n-gram model 12t generated by the phoneme recognition unit 13 in steps S3 and S4, respectively, to generate the speech (speech corpus) of the learning corpus 102. ) To recognize phonemes (step S5).
The phoneme string generated in step S5 is the target language phoneme string (actual utterance phoneme string 13m) used in the subsequent step S6.

そして、発音系列拡張装置１は、統計的機械翻訳モデル生成手段１４によって、ステップＳ２で生成された音素列（標準音素列１０ｍ）を原言語、ステップＳ５で認識された音素列（実発話音素列１３ｍ）を目的言語とする統計的機械翻訳モデルを生成する（ステップＳ６）。 Then, the phonetic sequence expansion device 1 uses the statistical machine translation model generation unit 14 to generate the phoneme sequence (standard phoneme sequence 10m) generated in step S2 as the source language and the phoneme sequence recognized in step S5 (actual phoneme sequence). A statistical machine translation model whose target language is 13m) is generated (step S6).

すなわち、発音系列拡張装置１は、統計的機械翻訳モデル生成手段１４のフレーズ翻訳モデル生成手段１４１によって、標準音素列１０ｍと実発話音素列１３ｍとを対訳データとして、標準音素列１０ｍのあるフレーズが、実発話音素列１３ｍのあるフレーズに翻訳される翻訳モデル（フレーズ翻訳モデル１４１ｍ）を生成する。
そして、発音系列拡張装置１は、統計的機械翻訳モデル生成手段１４の音素ｎ−ｇｒａｍモデル生成手段１４２によって、実発話音素列１３ｍから、音素を１つの単語とみなして、ｎ−ｇｒａｍモデル（音素ｎ−ｇｒａｍモデル１４２ｍ）を生成する。
このステップＳ６で生成されたフレーズ翻訳モデル１４１ｍと音素ｎ−ｇｒａｍモデル１４２ｍとにより、前記式（２）で示す統計的機械翻訳モデル１４ｍが構成されることになる。 That is, the phonetic sequence expansion device 1 uses the phrase translation model generation unit 141 of the statistical machine translation model generation unit 14 to convert a phrase having the standard phoneme sequence 10m using the standard phoneme sequence 10m and the actual speech phoneme sequence 13m as parallel translation data. Then, a translation model (phrase translation model 141m) to be translated into a phrase with the actual speech phoneme string 13m is generated.
Then, the phoneme sequence expansion device 1 regards the phoneme as one word from the actual utterance phoneme sequence 13 m by the phoneme n-gram model generation unit 142 of the statistical machine translation model generation unit 14, and determines an n-gram model (phoneme). An n-gram model 142m) is generated.
The phrase translation model 141m and the phoneme n-gram model 142m generated in step S6 constitute the statistical machine translation model 14m represented by the above equation (2).

そして、発音系列拡張装置１は、パラメータ設定手段１７によって、翻訳手段１５で用いる素性のパラメータ（翻訳パラメータ）を設定する（ステップＳ７）。 Then, the pronunciation sequence expansion device 1 sets feature parameters (translation parameters) used by the translation unit 15 by the parameter setting unit 17 (step S7).

さらに、発音系列拡張装置１は、翻訳手段１５および拡張手段１６によって、元の発音辞書１００を拡張した拡張発音辞書候補１６ｄを生成する（ステップＳ８）。
すなわち、発音系列拡張装置１は、翻訳手段１５によって、元の発音辞書１００に登録されている見出し語の発音系列（音素列）を順次読み出し、ステップＳ６で生成された統計的機械翻訳モデル１４ｍと、ステップＳ７で設定された翻訳パラメータとに基づいて翻訳する。
そして、発音系列拡張装置１は、拡張手段１６によって、見出し語に対応する発音系列と、翻訳した発音系列とが異なる場合、翻訳した発音系列を当該見出し語の発音系列（音素列）として新たに追加することで、拡張発音辞書候補１６ｄを生成し、図示を省略した記憶手段に記憶する。 Furthermore, the pronunciation sequence expansion device 1 generates an extended pronunciation dictionary candidate 16d obtained by extending the original pronunciation dictionary 100 by the translation unit 15 and the expansion unit 16 (step S8).
That is, the pronunciation sequence expansion device 1 sequentially reads out the pronunciation sequence (phoneme sequence) of the headwords registered in the original pronunciation dictionary 100 by the translation unit 15 and the statistical machine translation model 14m generated in step S6. The translation is performed based on the translation parameters set in step S7.
Then, when the pronunciation sequence corresponding to the headword differs from the translated pronunciation sequence by the expansion means 16, the pronunciation sequence expansion device 1 newly sets the translated pronunciation sequence as the pronunciation sequence (phoneme sequence) of the headword. By adding, an extended pronunciation dictionary candidate 16d is generated and stored in a storage unit (not shown).

そして、発音系列拡張装置１は、発音辞書選択手段１８の音響尤度算出手段１８１によって、音響モデル１０１と拡張発音辞書候補１６ｄとを用いて、既知の音声データに対応する単語列を強制アライメントして、当該単語列を音素列に切り分けたときの音響尤度を算出する（ステップＳ９）。 Then, the pronunciation sequence expansion device 1 uses the acoustic likelihood calculation means 181 of the pronunciation dictionary selection means 18 to forcibly align a word string corresponding to known speech data using the acoustic model 101 and the extended pronunciation dictionary candidate 16d. Then, the acoustic likelihood when the word string is cut into phoneme strings is calculated (step S9).

そして、発音系列拡張装置１は、発音辞書選択手段１８の尤度最大辞書選択手段１８２によって、前回生成された拡張発音辞書候補１６ｄと今回生成された拡張発音辞書候補１６ｄとの音響尤度を比較し、音響尤度の大きい拡張発音辞書候補１６ｄを記憶手段（不図示）に残す（ステップＳ１０）。 Then, the pronunciation sequence expansion device 1 compares the acoustic likelihoods of the extended pronunciation dictionary candidate 16d generated last time and the extended pronunciation dictionary candidate 16d generated this time by the maximum likelihood dictionary selection means 182 of the pronunciation dictionary selection means 18. Then, the extended pronunciation dictionary candidate 16d having a large acoustic likelihood is left in the storage means (not shown) (step S10).

その後、発音系列拡張装置１は、予め定めた翻訳パラメータの範囲内での変更が完了したか否かを判定する（ステップＳ１１）。
ここで、翻訳パラメータの変更が完了していない場合（ステップＳ１１でＮｏ）、発音系列拡張装置１は、ステップＳ７に戻って新たな翻訳パラメータを設定する。
一方、翻訳パラメータの変更が完了した場合（ステップＳ１１でＹｅｓ）、発音系列拡張装置１は、発音辞書選択手段１８によって、記憶手段（不図示）に存在する拡張発音辞書候補１６ｄを拡張発音辞書１０３として決定し（ステップＳ１２）、動作を終了する。
以上の動作によって、発音系列拡張装置１は、発音変動のある発話音声の発音系列（音素列）を最適化したパラメータにより生成して発音辞書に追加し、拡張することができる。 Thereafter, the pronunciation sequence expansion device 1 determines whether or not the change within the predetermined translation parameter range is completed (step S11).
If the translation parameter change has not been completed (No in step S11), the pronunciation sequence expansion device 1 returns to step S7 and sets a new translation parameter.
On the other hand, when the translation parameter change is completed (Yes in step S11), the pronunciation sequence expansion device 1 uses the pronunciation dictionary selection unit 18 to convert the extended pronunciation dictionary candidate 16d existing in the storage unit (not shown) into the extended pronunciation dictionary 103. (Step S12), and the operation ends.
With the above operation, the pronunciation sequence expansion device 1 can generate and expand the pronunciation sequence (phoneme sequence) of the uttered speech with pronunciation variation by using the optimized parameters.

≪第２実施形態≫
〔発音系列拡張装置の構成〕
次に、図７を参照して、本発明の第２実施形態に係る発音系列拡張装置１Ｂの構成について説明する。この発音系列拡張装置１Ｂは、発音系列拡張装置１と同様に、発音辞書１００に登録されている発音（発音系列）に、実発話による発音（発音系列）を対応付けて拡張し、拡張発音辞書１０３を生成するものである。 << Second Embodiment >>
[Configuration of Pronunciation Series Expansion Device]
Next, with reference to FIG. 7, the configuration of the pronunciation sequence expansion device 1B according to the second embodiment of the present invention will be described. Similar to the pronunciation sequence expansion device 1, this pronunciation sequence expansion device 1 B expands the pronunciation (pronunciation sequence) registered in the pronunciation dictionary 100 in association with the pronunciation (pronunciation sequence) based on the actual utterance. 103 is generated.

図７に示すように、発音系列拡張装置１Ｂは、音素列生成手段１０と、文脈依存発音辞書生成手段１１と、文脈依存音素ｎ−ｇｒａｍモデル生成手段１２と、音素認識手段１３と、統計的機械翻訳モデル生成手段１４と、翻訳手段１５と、拡張手段１６と、パラメータ設定手段１７と、発音辞書選択手段１８Ｂと、を備える。
発音辞書選択手段１８Ｂ以外の構成は、図１で説明した発音系列拡張装置１と同じであるため、同一の符号を付して説明を省略する。 As shown in FIG. 7, the phoneme sequence expansion device 1B includes a phoneme string generation unit 10, a context-dependent pronunciation dictionary generation unit 11, a context-dependent phoneme n-gram model generation unit 12, a phoneme recognition unit 13, and a statistical The machine translation model generation means 14, the translation means 15, the expansion means 16, the parameter setting means 17, and the pronunciation dictionary selection means 18B are provided.
Since the configuration other than the pronunciation dictionary selection unit 18B is the same as that of the pronunciation sequence expansion device 1 described with reference to FIG. 1, the same reference numerals are given and description thereof is omitted.

発音辞書選択手段１８Ｂは、既知の学習データに基づいて、拡張手段１６によって順次生成される複数の拡張発音辞書候補１６ｄ，１６ｄ，…の中から１つを選択するものである。この発音辞書選択手段１８Ｂは、拡張発音辞書候補１６ｄを選択する基準として、既知の学習データを音素認識した音素列とし、既知の学習データを強制音素アライメントした音素列との編集距離を用いる。
ここでは、発音辞書選択手段１８Ｂは、編集距離算出手段１８３と、距離最小辞書選択手段１８４と、を備える。 The pronunciation dictionary selection unit 18B selects one of a plurality of extended pronunciation dictionary candidates 16d, 16d,... Sequentially generated by the expansion unit 16 based on known learning data. The pronunciation dictionary selection unit 18B uses, as a criterion for selecting the extended pronunciation dictionary candidate 16d, a phoneme string obtained by phoneme recognition of known learning data, and an edit distance between the phoneme string obtained by forced phoneme alignment of the known learning data.
Here, the pronunciation dictionary selecting unit 18B includes an edit distance calculating unit 183 and a minimum distance dictionary selecting unit 184.

編集距離算出手段１８３は、既知の学習データである音声を音素認識した音素列と、音響モデル１０１および拡張発音辞書候補１６ｄを用いて、既知の学習データである音声に対応する単語列を強制アライメント（強制音素アライメント）した最尤の音素列との間の編集距離（レーベンシュタイン距離）を算出するものである。
ここでは、既知の音声およびそれに対応する単語列として、学習コーパス１０２の一部を用いるが、学習コーパス１０２とは異なる音声と、その音声の書き起こしテキストであっても構わない。 The edit distance calculation means 183 forcibly aligns a word sequence corresponding to speech that is known learning data, using a phoneme sequence obtained by phoneme recognition of speech that is known learning data, and the acoustic model 101 and the extended pronunciation dictionary candidate 16d. The editing distance (Levenstein distance) between the maximum likelihood phoneme sequence (forced phoneme alignment) is calculated.
Here, a part of the learning corpus 102 is used as the known speech and the corresponding word string, but a speech different from the learning corpus 102 and a transcription text of the speech may be used.

この編集距離算出手段１８３は、既知の学習データである音声を音素認識して、音素列を生成する。この音素認識は、音響モデル１０１と、文脈依存発音辞書１１ｔと、文脈依存音素ｎ−ｇｒａｍモデル１２ｔとを用いて、音素認識手段１３と同様の手法で認識することができる。ここでは、編集距離算出手段１８３は、既知の学習データである音声を音素認識手段１３で音声認識させ、その認識結果を用いることとする（なお、図７中、編集距離算出手段１８３と音素認識手段１３との接続線は図示を省略する）。ここでは、この音素認識により得られる音素列を、編集距離を測る基準の参照訳として用いる。 This editing distance calculation means 183 recognizes phonemes as known learning data and generates phoneme strings. This phoneme recognition can be recognized by the same method as the phoneme recognition means 13 using the acoustic model 101, the context-dependent pronunciation dictionary 11t, and the context-dependent phoneme n-gram model 12t. Here, the edit distance calculation means 183 recognizes speech that is known learning data by the phoneme recognition means 13 and uses the recognition result (note that the edit distance calculation means 183 and the phoneme recognition in FIG. 7). The connection line to the means 13 is not shown). Here, the phoneme string obtained by the phoneme recognition is used as a reference translation of the standard for measuring the edit distance.

さらに、編集距離算出手段１８３は、音響モデル１０１と拡張発音辞書候補１６ｄとを用いて、既知の学習データである音声に対応する単語列（テキスト）を強制音素アライメントして、最尤の音素列を生成する。
ここで、参照訳をｒ、拡張発音辞書候補１６ｄに対応する音素列をｅとしたとき、レーベンシュタイン距離は、以下の式（３）に示すように、参照訳ｒから音素列ｅに変換するまでの最小の操作数Ｌｅｖ（ｒ，ｅ）で定義される。 Furthermore, the edit distance calculation means 183 uses the acoustic model 101 and the extended pronunciation dictionary candidate 16d to perform forced phoneme alignment on a word string (text) corresponding to speech that is known learning data, and thus the maximum likelihood phoneme string. Is generated.
Here, when the reference translation is r and the phoneme string corresponding to the extended pronunciation dictionary candidate 16d is e, the Levenshtein distance is converted from the reference translation r to the phoneme string e as shown in the following equation (3). It is defined by the minimum operation number Lev (r, e).

この式（３）で、ｉｎｓ（ｒ，ｅ）、ｄｅｌ（ｒ，ｅ）、ｓｕｂ（ｒ，ｅ）は、それぞれ、参照訳ｒから音素列ｅに変換するまでの参照訳ｒに対する各操作（挿入、削除、置換）の回数である。
この編集距離算出手段１８３は、算出した編集距離を距離最小辞書選択手段１８４に出力する。 In this expression (3), ins (r, e), del (r, e), and sub (r, e) are operations for the reference translation r until the reference translation r is converted to the phoneme sequence e ( (Insertion, deletion, replacement).
The edit distance calculating unit 183 outputs the calculated edit distance to the minimum distance dictionary selecting unit 184.

距離最小辞書選択手段１８４は、編集距離算出手段１８３で算出された編集距離が最小となる拡張発音辞書候補１６ｄを選択するものである。
ここでは、距離最小辞書選択手段１８４は、拡張手段１６によって、拡張発音辞書候補１６ｄが生成されるたびに、前回生成された拡張発音辞書候補１６ｄと今回生成された拡張発音辞書候補１６ｄとの編集距離算出手段１８３で算出された編集距離を比較し、編集距離の大きい拡張発音辞書候補１６ｄを記憶手段（不図示）から削除する。 The minimum distance dictionary selection unit 184 selects the extended pronunciation dictionary candidate 16d that minimizes the editing distance calculated by the editing distance calculation unit 183.
Here, every time the extended pronunciation dictionary candidate 16d is generated by the extension means 16, the minimum distance dictionary selection means 184 edits the extended pronunciation dictionary candidate 16d generated last time and the extended pronunciation dictionary candidate 16d generated this time. The edit distances calculated by the distance calculation means 183 are compared, and the extended pronunciation dictionary candidate 16d having a large edit distance is deleted from the storage means (not shown).

そして、距離最小辞書選択手段１８４は、パラメータ設定手段１７からすべての翻訳パラメータの変更が完了した旨を通知された段階で、記憶手段（不図示）に存在する拡張発音辞書候補１６ｄを拡張発音辞書１０３とする。
これによって、発音辞書選択手段１８Ｂは、翻訳パラメータが最適化された状態で生成された拡張発音辞書候補１６ｄを選択することができる。 Then, when the minimum distance dictionary selection unit 184 is notified by the parameter setting unit 17 that all the translation parameters have been changed, the extended pronunciation dictionary candidate 16d existing in the storage unit (not shown) is expanded. 103.
Thereby, the pronunciation dictionary selection means 18B can select the extended pronunciation dictionary candidate 16d generated with the translation parameters optimized.

以上説明したように発音系列拡張装置１Ｂを構成することで、発音系列拡張装置１Ｂは、発音系列拡張装置１（図１参照）と同様に、発音変動を、学習コーパス１０２を用いて発音辞書に追加することができる。また、発音系列拡張装置１Ｂは、素性の重みパラメータ（翻訳パラメータ）を最適化して、拡張発音辞書を生成することができる。
なお、発音系列拡張装置１Ｂは、図示を省略したコンピュータを、前記した各手段として機能させるプログラム（発音系列拡張プログラム）で動作させることができる。 By configuring the pronunciation sequence expansion device 1B as described above, the pronunciation sequence expansion device 1B uses the learning corpus 102 to convert the pronunciation variation into the pronunciation dictionary as in the pronunciation sequence expansion device 1 (see FIG. 1). Can be added. Further, the pronunciation sequence expansion device 1B can generate an extended pronunciation dictionary by optimizing the feature weight parameter (translation parameter).
Note that the pronunciation sequence expansion device 1B can operate a computer (not shown) with a program (pronunciation sequence expansion program) that functions as each of the above-described means.

〔発音系列拡張装置の動作〕
次に、図８を参照（構成については適宜図７参照）して、本発明の第２実施形態に係る発音系列拡張装置１Ｂの動作について説明する。 [Operation of phonetic sequence expansion device]
Next, referring to FIG. 8 (refer to FIG. 7 as appropriate for the configuration), the operation of the pronunciation sequence expansion device 1B according to the second embodiment of the present invention will be described.

発音系列拡張装置１Ｂの動作は、図６で説明した発音系列拡張装置１の動作に対して、複数の拡張発音辞書候補の中から１つを選択する動作が異なるのみである。
すなわち、発音系列拡張装置１の動作と発音系列拡張装置１Ｂの動作とは、図６のステップＳ９，Ｓ１０と図８のステップＳ９Ｂ，Ｓ１０Ｂの動作が異なるだけであるため、他の動作については、同一のステップ番号を付して説明を省略する。 The operation of the pronunciation sequence expansion device 1B is different from the operation of the pronunciation sequence expansion device 1 described in FIG. 6 only in the operation of selecting one of a plurality of extended pronunciation dictionary candidates.
That is, the operation of the pronunciation sequence expansion device 1 and the operation of the pronunciation sequence expansion device 1B differ only in the operations of steps S9 and S10 in FIG. 6 and steps S9B and S10B in FIG. The same step number is attached and description is abbreviate | omitted.

ステップＳ８で拡張発音辞書候補１６ｄを生成した後、発音系列拡張装置１Ｂは、発音辞書選択手段１８Ｂの編集距離算出手段１８３によって、既知の学習データである音声を音素認識した音素列と、音響モデル１０１とステップＳ８で生成された拡張発音辞書候補１６ｄを用いて、既知の学習データである音声に対応する単語列を強制音素アライメントした最尤の音素列との間の編集距離（レーベンシュタイン距離）を算出する（ステップＳ９Ｂ）。 After generating the extended pronunciation dictionary candidate 16d in step S8, the pronunciation sequence expansion device 1B uses the editing distance calculation unit 183 of the pronunciation dictionary selection unit 18B to recognize a phoneme string obtained by phoneme recognition of speech that is known learning data, and an acoustic model. Edit distance (Levenstein distance) between the maximum likelihood phoneme sequence obtained by forcibly phoneme-aligning a word sequence corresponding to speech that is known learning data using the extended pronunciation dictionary candidate 16d generated in step S8 Is calculated (step S9B).

そして、発音系列拡張装置１Ｂは、発音辞書選択手段１８Ｂの距離最小辞書選択手段１８４によって、前回生成された拡張発音辞書候補１６ｄと今回生成された拡張発音辞書候補１６ｄとの編集距離を比較し、編集距離の小さい拡張発音辞書候補１６ｄを記憶手段（不図示）に残す（ステップＳ１０Ｂ）。
その後の動作は、発音系列拡張装置１の動作と同じである。
以上の動作によって、発音系列拡張装置１Ｂは、発音変動のある発話音声の発音系列（音素列）を最適化したパラメータにより生成して発音辞書に追加し、拡張することができる。 Then, the pronunciation sequence expansion device 1B compares the editing distance between the extended pronunciation dictionary candidate 16d generated last time and the extended pronunciation dictionary candidate 16d generated this time by the minimum distance dictionary selection means 184 of the pronunciation dictionary selection means 18B, The extended pronunciation dictionary candidate 16d having a small editing distance is left in the storage means (not shown) (step S10B).
The subsequent operation is the same as the operation of the pronunciation sequence expansion device 1.
Through the above operation, the pronunciation sequence expansion device 1B can generate and expand the pronunciation sequence (phoneme sequence) of the uttered speech with pronunciation variation by using the optimized parameters.

以上、本発明の実施形態について説明したが、本発明は、これらの実施形態に限定されるものではない。
ここでは、発音辞書１００の見出し語として、単語を例として説明した。しかし、発音系列拡張装置１，１Ｂは、音素を単位として発音変動の音素列を新たな発音系列とするため、必ずしも対象とする見出し語は単語である必要はなく、任意の文字列（複数の単語、文章等）であればよい。
これによって、発音系列拡張装置１，１Ｂは、単語内の発音変動のみならず、単語間の発音変動にも対応することができる。 As mentioned above, although embodiment of this invention was described, this invention is not limited to these embodiment.
Here, a word has been described as an example of a headword in the pronunciation dictionary 100. However, since the phoneme sequence expansion devices 1 and 1B use a phoneme string of phonetic variation as a new phoneme sequence in units of phonemes, the target headword does not necessarily need to be a word, and any character string (a plurality of character strings) Word, sentence, etc.).
As a result, the pronunciation sequence expansion devices 1 and 1B can cope with not only the pronunciation variation within a word but also the pronunciation variation between words.

１，１Ｂ発音系列拡張装置
１０音素列生成手段
１１文脈依存発音辞書生成手段
１２文脈依存音素ｎ−ｇｒａｍモデル生成手段
１３音素認識手段
１４統計的機械翻訳モデル生成手段
１４１フレーズ翻訳モデル生成手段
１４２音素ｎ−ｇｒａｍモデル生成手段
１５翻訳手段
１６拡張手段
１７パラメータ変更手段
１８，１８Ｂ発音辞書選択手段
１８１音響尤度算出手段
１８２尤度最大辞書選択手段
１８３編集距離算出手段
１８４距離最小辞書選択手段
１００発音辞書
１０１音響モデル
１０２学習コーパス
１０３拡張発音辞書 1, 1B Pronunciation sequence expansion device 10 Phoneme sequence generation means 11 Context-dependent pronunciation dictionary generation means 12 Context-dependent phoneme n-gram model generation means 13 Phoneme recognition means 14 Statistical machine translation model generation means 141 Phrase translation model generation means 142 Phoneme n -Gram model generation means 15 translation means 16 expansion means 17 parameter change means 18, 18B pronunciation dictionary selection means 181 acoustic likelihood calculation means 182 maximum likelihood dictionary selection means 183 edit distance calculation means 184 distance minimum dictionary selection means 100 pronunciation dictionary 101 Acoustic model 102 Learning corpus 103 Extended pronunciation dictionary

Claims

見出し語とその発音系列を示す音素列とを対応付けた発音辞書と、文脈依存音素の音響モデルと、音声とその書き起こしテキストとを対応付けた学習コーパスとにより、前記発音辞書における前記見出し語の発音系列を拡張する発音系列拡張装置であって、
前記音響モデルと前記発音辞書とにより、前記学習コーパスの音声の文脈依存音素の音素列である文脈依存音素列を生成するとともに、単一音素の音素列を生成する音素列生成手段と、
前記文脈依存音素を見出し語およびその発音系列とする文脈依存発音辞書を生成する文脈依存発音辞書生成手段と、
前記文脈依存音素列から、文脈依存音素ｎ−ｇｒａｍモデルを生成する文脈依存音素ｎ−ｇｒａｍモデル生成手段と、
前記文脈依存発音辞書と前記文脈依存音素ｎ−ｇｒａｍモデルとにより、前記学習コーパスの音声を音素単位で音声認識する音素認識手段と、
前記音素列生成手段で生成された単一音素の音素列である標準音素列と、前記音素認識手段で認識された音素列である実発話音素列とを対訳データとして、統計的機械翻訳モデルを生成する統計的機械翻訳モデル生成手段と、
設定される翻訳パラメータを用いて、前記統計的機械翻訳モデルにより、前記発音辞書に登録されている前記見出し語に対応する音素列を翻訳する翻訳手段と、
前記翻訳パラメータを設定するパラメータ設定手段と、
異なる翻訳パラメータごとに、前記翻訳手段で翻訳された音素列を新たな発音系列を示す音素列として前記見出し語に追加して、前記翻訳パラメータに対応する複数の拡張発音辞書候補を生成する拡張手段と、
既知の学習データである音声と当該音声に対応する単語列とに基づいて、前記複数の拡張発音辞書候補の中から拡張発音辞書を選択する発音辞書選択手段と、
を備えることを特徴とする発音系列拡張装置。 The headword in the pronunciation dictionary by a pronunciation dictionary that associates a headword with a phoneme string indicating a pronunciation sequence, an acoustic model of a context-dependent phoneme, and a learning corpus that associates speech with its transcription text A pronunciation sequence expansion device for extending the pronunciation sequence of
Generating a context-dependent phoneme sequence that is a phoneme sequence of a context-dependent phoneme of speech of the learning corpus, and generating a phoneme sequence of a single phoneme by the acoustic model and the pronunciation dictionary;
A context-dependent pronunciation dictionary generating means for generating a context-dependent pronunciation dictionary having the context-dependent phonemes as headwords and their pronunciation series;
Context-dependent phoneme n-gram model generation means for generating a context-dependent phoneme n-gram model from the context-dependent phoneme sequence;
Phoneme recognition means for recognizing speech of the learning corpus in phoneme units by the context-dependent pronunciation dictionary and the context-dependent phoneme n-gram model;
A statistical machine translation model using a standard phoneme sequence that is a phoneme sequence of a single phoneme generated by the phoneme sequence generation unit and an actual utterance phoneme sequence that is a phoneme sequence recognized by the phoneme recognition unit as parallel translation data. A statistical machine translation model generating means for generating;
Translation means for translating a phoneme string corresponding to the headword registered in the pronunciation dictionary by the statistical machine translation model using a set translation parameter;
Parameter setting means for setting the translation parameters;
Expansion means for generating a plurality of extended pronunciation dictionary candidates corresponding to the translation parameter by adding the phoneme string translated by the translation means to the headword as a phoneme string indicating a new pronunciation sequence for each different translation parameter When,
A pronunciation dictionary selection means for selecting an extended pronunciation dictionary from the plurality of extended pronunciation dictionary candidates based on a voice that is known learning data and a word string corresponding to the voice;
A pronunciation sequence expansion device comprising:

前記発音辞書選択手段は、
前記音響モデルおよび前記拡張発音辞書候補を用いて、前記既知の学習データである音声に対応する単語列を強制単語アライメントして音素列に切り分けたときの音響尤度を、前記複数の拡張発音辞書候補ごとに算出する音響尤度算出手段と、
前記音響尤度算出手段で算出された音響尤度が最大となる拡張発音辞書候補を前記拡張発音辞書として選択する尤度最大辞書選択手段と、
を備えることを特徴とする請求項１に記載の発音系列拡張装置。 The pronunciation dictionary selecting means is
Using the acoustic model and the extended pronunciation dictionary candidates, the acoustic likelihood when the word string corresponding to the speech that is the known learning data is forced word aligned and divided into phoneme strings is expressed as the plurality of extended pronunciation dictionaries. Acoustic likelihood calculating means for calculating each candidate;
Maximum likelihood dictionary selection means for selecting, as the extended pronunciation dictionary, an extended pronunciation dictionary candidate that maximizes the acoustic likelihood calculated by the acoustic likelihood calculation means;
The pronunciation sequence expansion device according to claim 1, further comprising:

前記発音辞書選択手段は、
前記既知の学習データである音声を音素認識した音素列と、前記音響モデルおよび前記拡張発音辞書候補を用いて、前記既知の学習データである音声に対応する単語列を強制音素アライメントした前記複数の拡張発音辞書候補ごとの最尤の音素列との編集距離を算出する編集距離算出手段と、
前記編集距離算出手段で算出された編集距離が最小となる拡張発音辞書候補を前記拡張発音辞書として選択する距離最小辞書選択手段と、
を備えることを特徴とする請求項１に記載の発音系列拡張装置。 The pronunciation dictionary selecting means is
Using the phoneme string obtained by phoneme recognition of the speech that is the known learning data, and the acoustic model and the extended pronunciation dictionary candidate, the word strings corresponding to the speech that is the known learning data are subjected to forced phoneme alignment. Editing distance calculation means for calculating an editing distance with the maximum likelihood phoneme string for each extended pronunciation dictionary candidate;
Distance minimum dictionary selection means for selecting an extended pronunciation dictionary candidate that minimizes the edit distance calculated by the edit distance calculation means, as the extended pronunciation dictionary;
The pronunciation sequence expansion device according to claim 1, further comprising:

コンピュータを、請求項１から請求項３のいずれか一項に記載の発音系列拡張装置として機能させるための発音系列拡張プログラム。 A pronunciation sequence expansion program for causing a computer to function as the pronunciation sequence expansion device according to any one of claims 1 to 3.