JP2004117503A

JP2004117503A - Method, device, and program for generating acoustic model for voice recognition, recording medium, and voice recognition device using the acoustic model

Info

Publication number: JP2004117503A
Application number: JP2002277225A
Authority: JP
Inventors: Shinji Watabe; 渡部　晋治; Atsushi Nakamura; 中村　篤; Yasuhiro Minami; 南　泰浩; Shuko Ueda; 上田　修功
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-09-24
Filing date: 2002-09-24
Publication date: 2004-04-15
Anticipated expiration: 2022-09-24
Also published as: JP3920749B2

Abstract

<P>PROBLEM TO BE SOLVED: To obtain an acoustic model of high performance with a small learning data volume and to obtain the acoustic model by relatively simple calculation. <P>SOLUTION: A variational Bayesian evaluation function is given by formula 1 wherein O is time series feature quantity vectors of leaning data, Θ=(θ<SB>c</SB>:c=1 to C), θ<SB>c</SB>is model parameters of phoneme categories c, Z is an HMM status variable sequence or a gaussian mixture variance sequence, m (is one element of a set M) is random variables of degree of freedom of a model structure, and [q(Θ, Z¾O, m)] is posteriori distributions of the variational Bayes method. m is fixed to obtain q(θ<SB>c</SB>¾O, m) and q(Z¾O, m), and these are used to obtain an F<SP>m</SP>value, and q(θ<SB>c</SB>¾O, m) and q(Z¾O, m) which maximize the F<SP>m</SP>value are obtained. A model structure is determined by obtaining m in the set M, which maximizes F<SP>m</SP>[q(Θ, Z¾O, m)]. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、確率統計的な音声認識に用いられる音響モデルの作成方法、その装置、そのプログラムおよびその記録媒体と、上記音響モデルを用いた音声認識装置に関する。
【０００２】
【従来の技術】
まず音声認識装置の概略を説明する（詳しくは例えば非特許文献１参照）。音声認識装置は図１に示すように主に、フレームごとに学習音声信号データ（以下単に学習データと書く）を時系列特徴量ベクトルに変換する特徴量ベクトル変換部１１と、モデルパラメータ学習及び適切なモデル構造決定を行う音響モデル作成部１２と、得られた音響モデルを用いて未知入力音声の時系列特徴量ベクトルに対しスコアを算出し、これに発音辞書や言語モデル等に対するスコアを考慮して認識結果を与える認識部１３とからなる。
【０００３】
音響モデルについて説明する。通常音声認識用音響モデルでは各音素を隠れマルコフモデル（ＨＭＭ）で表現する。図２に示すようにＨＭＭは１乃至複数の各状態Ｓ１，Ｓ２，Ｓ３に対して出力分布Ｄ１，Ｄ２，Ｄ３がそれぞれ与えられる。通常この出力分布Ｄ１，Ｄ２，Ｄ３には図２に示すように混合ガウス分布が用いられる。図２において矢印１４，１５は状態遷移を表わし、ＨＭＭ状態数は３、出力分布の混合数も３の場合である。学習データに対してはその何れの部分が何れの音素であるかを示すラベル情報が与えられている。ラベル情報により得られる学習データ中の各音素に対応するデータから、尤度を最大化するようにモデルパラメータ（ガウス分布の平均、分散、混合重み係数、状態遷移確率）を推定することをモデルパラメータの学習と呼ぶ。そのモデルパラメータを推定する最尤法は、データが十分大きいという近似のもとに成り立つ手法であるため、データが小さいとモデルパラメータの推定が不正確であるという問題がある。また、１音素あたりのＨＭＭ状態数や、出力分布の混合数をいくつに設定するかがモデル構造の決定にあたる。他にも音素カテゴリーをどのように設定するかや、ベイズ法の場合、事前分布をどう設定するかといった問題もモデル構造決定にあたる。
【０００４】
ここで従来の音響モデル作成法について図３を参照して説明する。まず、複数の与えられたモデル構造１〜Ｎに対し、仮モデル作成部１２−１〜１２−Ｎにおいて、まず最尤学習によるモデルパラメータの推定をモデルパラメータ推定部１２１で入力された特徴量ベクトル系列について行うことにより、それぞれ仮音響モデルを作成する。モデル構造の決定は、認識率を基準に評価する。上記課程で得られたそれぞれの仮音響モデルを用いて、認識部１２２でそれぞれの認識結果を出力し、さらにそれらを用いて評価部１２３でそれぞれの認識率を出力する。モデル選択部１２４でこれらＮ個の認識率を比較して認識率の高さで仮音響モデルを評価し、最も高い認識率の仮音響モデルを、音響モデル構造と決定する。このように従来法では、認識部１２２と評価部１２３で認識率を得ることによってはじめてモデル構造の評価を行うことができる。このようなモデル評価は、計算時間がかかる、認識用のデータによって結果が変わる、自動化が難しいといった問題を抱えている。
【０００５】
一方、様々なモデル構造を用意してモデルパラメータの学習を行い、それをもとに記述長、ベイズ情報量といった評価関数を計算して、それを用いてモデル構造決定を行う手法もある（例えば非特許文献２、３参照）。つまり図４に示すように、モデル作成部１２−１〜１２−Ｎではそれぞれまず入力された特徴量ベクトル系列について、そのモデル構造に対するモデルパラメータをモデルパラメータ推定部１２１で最尤学習法により推定する。次にこれら各推定したモデルパラメータについて評価関数値（非特許文献２では記述長、非特許文献３ではベイズ情報量）を評価関数計算部１２５でそれぞれ計算し、モデル選択部１２４でこれら評価関数値中の最も高い仮音響モデルを選択して、音響モデル構造と決定する。
【０００６】
このようにこれらの手法は認識部で認識結果を出力し、評価部で認識率を出力するといった操作が必要ではないため、図３に示した手法における上記の問題点を解決することができる。しかしこれらの評価関数はいずれもデータが十分大きいという近似のもとに成り立つ評価関数である。そのため、少量データのときは適切なモデル構造決定が行われないという問題が生じる。また、これらの評価関数はいずれも、音声認識用音響モデルのような潜在変数を含むモデルに対しては正確な評価関数値を与えることができないため、このような観点からも適切なモデル構造決定が行われないという問題が生じる。さらに、モデルパラメータ推定時の評価関数は尤度であり、モデルの評価関数は記述長又はベイズ情報量であり、複数の評価関数がモデル作成において使われることになり、これにより適切な音響モデルが作成されないという問題が生じる。
【０００７】
音響モデルの学習ではないが、一般にベイズ法を用いると、学習対象に対し事前知識を事前分布として導入でき、学習データが少ないときに他の学習法と比べ汎化能力が高い学習ができ、またモデル選択の自動決定も可能であることが知られている（例えば非特許文献４参照）。しかしベイズ法では事後分布の推定が重要であるが、モデルに潜在変数が含まれる場合（潜在モデル）、複雑な期待値操作が必要となるため、これを解析的に扱うことは一般的に難しい。音声認識用音響モデルに利用されるＨＭＭや混合ガウス分布モデルは、潜在モデルであるため、ベイズ法に適用するのは困難であった。
【０００８】
近年、期待値計算に変分近似を利用した、変分ベイズ法による事後分布（変分事後分布）推定に基づく学習法が提案されている（例えば非特許文献４参照）。しかし、変分ベイズ法を用いて音響モデルを作成することは提案されていない。音響モデルの作成にはモデルパラメータの学習だけでなく、コンテキスト依存ＨＭＭにおける状態共有の仕方や総状態数、および状態ごとの出力分布混合数の設定といったモデル構造選択を含んだより複雑な学習法を必要とする。
なおこの発明の実施形態中に１具体例として状態共有ＨＭＭ構造の選択を音素決定木法に基づいて行うが、この手法は例えば非特許文献５に示されている。
【０００９】
【非特許文献１】
古井貞煕著「近代科学社」出版、２００１年Ｐ．１７４−２１０
【非特許文献２】
篠田浩一、渡辺隆夫著「情報量基準を用いた状態クラスタリングによる音響モデルの作成」信学技報、ＳＰ９６−７９、１９９６年、ＰＰ．９−１５
【非特許文献３】
ダブリュ．チョウ（Ｗ．Ｃｈｏｕ），ダブリュ．ライヒル（Ｗ．Ｒｅｉｃｈｌ）著「デイシジョン　ツリ　ステート　タイング　ベースド　オン　ペナライズド　ベイジアン　インフォメーション　クリテリオン（Ｄｅｃｉｓｉｏｎ　Ｔｒｅｅ　Ｓｔａｔｅ　Ｔｙｉｎｇ　Ｂａｓｅｄ　ｏｎ　Ｐｅｎａｌｉｚｅｄ　Ｂａｙｅｓｉａｎ　Ｉｎｆｏｒｍａｔｉｏｎ　Ｃｒｉｔｅｒｉｏｎ）プロシ．アイシイエイエスエスピー（Ｐｒｏｃ．ＩＣＡＳＳＰ）’９９、第１巻、ＰＰ．３４５−３４８（１９９９）．
【非特許文献４】
上田修功著「最小モデル探索のための変分ベイズ学習」人工知能学会論文誌、１６巻、２号、ＰＰ．２９９−３０８、２００１年
【非特許文献５】
ジェイ．ジェイ．オデル（Ｊ．Ｊ．Ｏｄｅｌｌ）著「ザ　ユース　オブ　コンテックスイン　ラージ　ボキアブラリ　スピーチ　リコグナイション（Ｔｈｅ　Ｕｓｅ　ｏｆ　Ｃｏｎｔｅｘｔ　ｉｎ　Ｌａｒｇｅ　Ｖｏｃａｂｕｌａｒｙ　Ｓｐｅｅｃｈ　Ｒｅｃｏｇｎｉｔｉｏｎ）」１９９５年ピーエッチデー　ゼイシス，ケンブリッジ　ユニバーシテイ（ＰｈＤ　ｔｈｅｓｉｓ，Ｃａｍｂｒｉｄｇｅ　Ｕｎｉｖｅｒｓｉｔｙ）
【００１０】
【発明が解決しようとする課題】
この発明の目的は記述長最小化（ＭＤＬ）基準や最尤基準による手法と比べ、少量データでも性能のよいものを作ることができ、しかも計算を効率的に行うことができる音響モデル作成方法、その装置、そのプログラム及びその記録媒体と、その音響モデルを用いる音声認識装置を提供することにある。
【００１１】
【課題を解決するための手段】
この発明によれば、音響モデルのモデル構造及びベイズ的事前分布を複数用意して、学習データから変分ベイズ法により、モデルパラメータ学習とモデル構造評価関数値を算出し、これら算出されたモデル構造評価関数値をもとにモデル構造を決定してそのモデルを音響モデルとする。
【００１２】
【発明の実施の形態】
この発明による音響モデル作成装置の１実施形態の機能構成を図５に示す。学習データは特徴量ベクトル変換部１１で特徴量ベクトル系列に変換される。複数のモデル構造１〜Ｍを用意し、それぞれに対し、評価部２１−１〜２１−Ｍの事後分布推定部２１１で、変分ベイズ法を用いて、変分ベイズ評価関数をもとに特徴量ベクトル系列についてモデルパラメータ学習を行う。つまりラベル情報により得られる学習データ中の各音素に対応するデータ（特徴量ベクトル）から、変分ベイズ評価関数を最大化するようにモデルパラメータの変分事後分布を推定する。
【００１３】
これら各推定された変分事後分布をもとに再び変分ベイズ評価関数値をそれぞれの評価関数計算部２１２で計算し、これらの評価関数１〜Ｍを用いて、モデル選択部２２で音素カテゴリー全体としての評価関数が最も多くなるように各カテゴリーのモデル構造を決定し、これと、その対応するモデルパラメータの変分事後分布とを音素カテゴリーの音響モデルとする。
この発明は変分ベイズ法を用いることによりモデルパラメータ学習と、モデル構造決定に同様の評価関数を用いるため、複数の評価関数を用いる従来手法と比べて最適性がより保証される。
【００１４】
また、初期モデル構造及び事前分布に既存の音響モデルを用いて、適応用学習音響信号に対し上記の音響モデル作成処理を行うことにより、その適応用学習音声信号に適応化された音響モデルを作成することができる。
このようにして得られた音響モデルは、モデルパラメータそのものではなくそれの変分事後分布で構成される。そのため、この音響モデルと音声認識に用いる装置は、この変分事後分布と未知音声入力データからベイズ予測に基づいて音響スコアが計算される。
【００１５】
実施例
次により具体的に、つまりこの発明の実施例を説明するが、この説明に先立ち、ベイズ法を用いて音響モデルを作成しようとすると、大変であることを示し、その後、この発明の実施例を説明する。
図６に示すように入力されたラベル付き学習データに対し、必要に応じて例えば聴覚特性を考慮した波形処理（フィルタ処理）などの前処理を行い（Ｓ１）、ＬＰＣ（線形予測）分析などの相関処理をフレームごとに行い、更に必要に応じて周波数帯域の制限などスペクトル処理を施して、Ｄ次元時系列特徴量ベクトルＯ＝｛Ｏ _ｔ∈Ｒ^Ｄ：ｔ＝１，…，Ｔ｝に変換する（Ｓ２）。ここでＴは全フ
レーム数を表す。特徴量としてはケプストラム、Δケプストラム、メル周波数ケプストラムなどが用いられる。
【００１６】
この時系列特徴量ベクトルＯに対し、初期モデル構造を設定する（Ｓ３）。
初期モデル構造設定では、まず初めに初期音素カテゴリーを設定する（Ｓ３−１）。初期音素カテゴリーとしては前後の音素環境を考慮した環境依存音素や環境独立音素を用いる。次に１つの音素カテゴリーを複数個のＨＭＭ状態に細分化し、その各状態に出力分布を設定する（Ｓ３−２）。さらに、ベイズ的事前分布をＨＭＭ状態遷移確率及び出力分布に対して設定する（Ｓ３−３）。
この事前分布は、統計的に信頼性の高いパラメータを与える。例えば、カテゴリーを細分化すると、つまり例えば環境依存音素の環境音素（前後の音素）数を多くして、モデル数を多くすると、それに伴い各カテゴリーに割り当てられる学習データ（特徴量ベクトル、以下同様）が不足し、統計的信頼性が低くなる。そのため、複数の環境依存音素に割り当てられる学習データを共有し、例えばトライホン（ｔｒｉｐｈｏｎｅ：連続する３つの音素）の中心の音素が等しい学習データを、その全てのｔｒｉｐｈｏｎｅカテゴリーに対し共通に用い、それによって得られるモデルパラメータを環境依存音素の事前分布として与える。また、各カテゴリーに含まれるＨＭＭ状態数を増加させると、それに伴い各出力分布に割り当てられる学習データが不足し、統計的信頼性が低くなる。そのため、複数のＨＭＭ状態に割り当てられる学習データを共有し、例えば隣接する状態に割り当てる学習データ、その両状態に対し共通に用い、それによって得られるモデルパラメータを事前分布として与える。また、出力分布中の混合数を増加させると、それに伴い各ガウス分布に割り当てられる学習データが不足し、統計的信頼性が低くなる。そのため、複数のガウス分布に割り当てられる学習データを共有し、それによって得られるモデルパラメータを事前分布として与える。
【００１７】
なお、不特定話者用モデルを料金話者用に適応化する話者適応タスクでは、不特定話者のモデルパラメータを事前分布として与える。雑音抑圧や音源分離による前処理によって歪んだ音声信号入力に対する適応タスクでは、歪みのない入力音声信号により作られたモデルパラメータを事前分布として与える。
このように事前分布の多様性を含めたモデルをモデル構造と呼ぶ。ベイズ学習では、このモデル構造の自由度を確率変数ｍ∈Ｍとすることにより、ｍの事後分布を導入することができる。ここでＭはｍの集合を表す。
【００１８】
次に、前記初期モデル構造からモデルパラメータ学習をベイズ学習を用いて行う（Ｓ４）。ベイズ学習で重要なのは確率変数に対する事後分布を求めることであるが、通常これを求めるのは容易ではない。例えば、ある固定されたモデル構造ｍでの、音素カテゴリーｃに関するモデルパラメータθ_ｃに対する事後分布ｐ（θ_ｃ｜Ｏ，ｍ）を求めるとする。非特許文献４に示すベイズ法の手法を参考
にすると、ｐ（θ_ｃ｜Ｏ，ｍ）は、出力分布ｐ（Ｏ，Ｚ｜Θ，ｍ）と事前分
布ｐ（Θ｜ｍ）から、次のように表現される。
【数１】

ここで、Θ＝｛θ_ｃ：ｃ＝１，…，Ｃ｝、Ｃは音素カテゴリーの数であり、Θ_−ｃはθ_ｃの補集合を表し、Ｚは潜在変数の集合である。モデルパラメータは具体的にはＨＭＭ状態遷移確率や、ＨＭＭ状態中の出力分布を混合ガウス分布で表したときの混合重み係数及びガウス分布における平均、分散である。また、Ｚは具体的には、ＨＭＭ状態系列変数、つまり１音素がどのようにしていくつの状態を通るかのとり得る数や混合ガウス系列変数、つまり、１つの音素が各状態の何番目のガウス分布を通るかのあらゆる組合せの数である。ｐ（Ｏ，Ｚ｜Θ，ｍ）ｐ（Θ，｜ｍ）はモデル構造設定時に具体的な関数形を与えることができる。
式（１）の計算により得られた事後分布を用いて式（２）
【数２】

を計算してベイズ評価関数を計算し（Ｓ５）、更に各ｍについての式（２）の計算結果からそれが最大のものｍを選択する（Ｓ６）。つまり次式を求める。
ｍ＾＝ａｒｇ_ｍ ^ｍａｘｐ（ｍ｜Ｏ）　　　　　　　　　　　　　　　（３）
このｍ＾のモデル構造を当該音素カテゴリーの音響モデルとする。全ての音素カテゴリーについて同様の処理を行って各音響モデルを求める。
このようにすればベイズ法により音響モデルを作ることができるが、実際には式（１）の計算は多重積分などを含むため、解析的な扱いが困難である。また、モンテカルロシュミレーションにより求める方法もあるが計算時間の問題からそのようなアプローチは非現実的である。
よって、ベイズ法により音響モデルを作ることは現実的でない。
この発明では変分ベイズ法を用いて音響モデルを作成する。その実施例を図７に示す。前処理（Ｓ１）、特徴量ベクトル変数（Ｓ２）、初期モデル機能設定（Ｓ３）は図６のそれと同様である。
【００１９】
次にこの実施例では式（４）で与える変分ベイズ評価関数を基準にして変分法による近似計算でモデルパラメータの事後分布（変分事後分布）を分布推定する（Ｓ４）。
【数３】

ここで＜ｕ（ｙ）＞_ｐ（ｙ）　は分布ｐ（ｙ）に対するｕ（ｙ）の期待値をあらわす。ｑ（Θ，Ｚ｜Ｏ，ｍ）は変分法により近似的に求まる事後分布である。Ｆ^ｍは変分事後分布ｑ（Θ，Ｚ｜Ｏ，ｍ）に対する汎関数である。式（４）は非特許文献４に示す積分ベイズ法の手法を参考とすると得られる。
確率変数の統計的独立性ｑ（Θ，Ｚ｜Ｏ，ｍ）＝ｑ（Ｚ｜Ｏ，ｍ）Π_ｃ＝１ ^Ｃｑ（θ_ｃ｜Ｏ，ｍ）を仮定し、Ｆ^ｍをｑ（θ_ｃ｜Ｏ，ｍ），ｑ（Ｚ｜Ｏ，ｍ）に関して変分法を用いて最大化することにより、固定されたｍに対する適切なｑ（θ_ｃ｜Ｏ，ｍ），ｑ（Ｚ｜Ｏ，ｍ）を次式で表現することができる。
【数４】

ｑ（θ_ｃ｜Ｏ，ｍ），ｑ（Ｚ｜Ｏ，ｍ）は相互に依存しているため、バウム−ウェルチアルゴリズムもしくはビタービアルゴリズムに基づく反復計算を用いて効率的に求めることができる。このようにして、ある固定されたｍに対する変分事後分布ｑ（θ_ｃ｜Ｏ，ｍ），ｑ（Ｚ｜Ｏ，ｍ）を変分ベイズ法で、事後分布推定部２１１（図５）において求めることにより、モデルパラメータを学習する。
【００２０】
次にモデル構造決定の指標となる評価関数について考察する。ｍの事前分布を一様と仮定すると、変分事後分布ｑ（ｍ｜Ｏ）とＦ^ｍは次式に示す関係を持つ。
【数５】

により適切なモデル構造ｍ＾を事後確率最大化（ＭＡＰ）の意味で決定することができる。つまり、Ｆ^ｍはある固定されたｍにおけるｑ（Θ｜Ｏ，ｍ），ｑ（Ｚ｜Ｏ，ｍ）の最適性を与える評価汎関数であると同時に、モデル構造ｍの最適性を与える評価関数であると言える。従って、Ｆ^ｍを用いることにより、ＨＭＭや混合ガウス分布モデルのような潜在変数を含むモデル学習およびモデル構造の決定を、変分ベイズ評価関数を用いて統一的に議論できる。モデルパラメータ学習で得られた変分事後分布ｑ（Θ，Ｚ｜Ｏ，ｍ）を式（４）に代入して固定されたｍにおけるモデル構造決定関数である変分ベイズ評価関数を評価関数計算部２１２（図５）で計算する（Ｓ５）。
Ｆ^ｍを集合Ｍにおける全てのｍに対して計算することにより、モデル選択部２２（図５）で式（７）に基づき適切なモデル構造を決定する（Ｓ６）。つまりステップＳ５で求めた評価関数値が最も大きいモデル構造と、その事後分布ｑ（θ_ｃ｜Ｏ，ｍ），ｑ（Ｚ｜Ｏ，ｍ）を最大化する事後分布とを当該音素カテゴリーｃの音響モデルとする。
全ての音素カテゴリーｃを選択したかを調べ（Ｓ７）、選択していないものがあればその１つを選択してステップＳ３に戻る（Ｓ８）。全ての音素カテゴリーについての音響モデルの決定をすると処理を終了する。
【００２１】
モデル選択部２２におけるモデル構造の決定はモデル構造の変化を木構造を用いて階層的に表現することにより、効率よく適切なモデル構造を探索することができる。以下この実施例において、木構造を用いた環境依存音素の共有に関するモデル構造決定例を示すが、木構造以外であっても、全ての組み合わせを考慮して最もＦ^ｍが大きくなるようにモデル構造を決定する手法や、最も細分化されたモデル構造における各状態やガウス分布をボトムアップ的に併合させ、最もＦ^ｍが大きくなるようにモデル構造を決定する手法を用いてもよい。また同様の議論が、環境依存音素の共有に関するモデル構造決定のみならず、１音素あたりのＨＭＭ状態数及び、ＨＭＭ状態を混合ガウス分布で表したときの混合数をいくつにするかといったモデル構造決定においても有効である。なぜなら、環境依存音素の共有問題は環境独立音素を複数の環境依存音素でクラスタリングする問題とみなすことができ、同様にＨＭＭ状態数、混合数の決定問題もそれぞれ環境依存音素、ＨＭＭ状態におけるクラスタリング問題とみなすことができるため、これらは本質的には同様のクラスタリング問題として扱うことができるからである。そのため、この３つの種類のクラスタリングを同時に行う、もしくは、それぞれ独立に行うことにより、モデル構造を決定していくことができる。
【００２２】
環境依存音素の共有問題について実施例を示す。この手法は例えば非特許文献５に示されている。まず環境独立音素カテゴリーが３つのＨＭＭ状態を含み、その各状態に含まれる出力分布が１混合ガウス分布で表される初期モデルを用意して説明を行う。またこのときの環境依存音素カテゴリーとして当該音素の直前直後の音素を考慮したｔｒｉｐｈｏｎｅカテゴリーを用いる。ある木の節ｎに対応付けられたＨＭＭ状態集合をΩ（ｎ）とする。初めに、ルートノード（ｎ＝０）を用意する。つまりルートノードには、同一の中心音素を持つｔｒｉｐｈｏｎｅ　ＨＭＭ状態の集合を対応付けさせる。このとき、質問群から適切に選ばれた質問Ｑを用いて、図８に示すように集合Ω（ｎ）を質問Ｑの回答（Ｙｅｓ又はＮｏ）に応じてΩ（ｎ_Ｙ ^Ｑ）とΩ（ｎ_Ｎ ^Ｑ）に分割し、それらを新たなノードｎ_Ｙ ^Ｑとｎ_Ｎ ^Ｑに対応付ける適切な質問の選び方は後で述べる。
【００２３】
以下では分枝数を２として話を進めるが、分枝数が２以上であっても同様に話を進めることができる。この分割により新しく得られたノードに含まれる状態集合に対してそれぞれ再び質問による分割を行い、これを繰り返すことによって、図９に示すように、最終的に木構造が構築される。各リーフノードに対応付けられた状態集合を共有することによって、状態共有型ＨＭＭ構造が構築される。用いる質問群は音声学の知見により得られた、前後の音素環境に対する質問群である。質問の具体例を図１０に示す。このとき、各ノードにおける分割前後でｑ（Θ｜Ｏ，ｍ），ｑ（Ｚ｜Ｏ，ｍ）を変分ベイズ学習によりそれぞれ求め、それをもとに評価関数Ｆ^ｍの値をそれぞれ算出する。Ｆ^ｍ値の変化が最も大きい質問を採用することによって適切な分割を行うことができる。これを全てのノードに対して行うことにより、Ｆ^ｍ値で最適化された木構造を得ることができる。またＦ^ｍがこれ以上増減しないノードをリーフノードとすることにより木構造におけるリーフノード数を決定することができる。これにより適切なモデル構造を決定することができる。つまり各リーフノードに残った複数のｔｒｉｐｈｏｎｅカテゴリーに対し、そのリーフノードのモデル構造をモデル構造として共通に用いる。質問を用いるのではなくΩ（０）を分割する場合の全ての分割のやり方、つまり分割に対する全組み合わせを考えその各組み合わせについて分割前後のｑ（Θ｜Ｏ，ｍ），ｑ（Ｚ｜Ｏ，ｍ）を変分ベイズ法によりそれぞれ求め、それをもとにＦ^ｍを算出し、Ｆ^ｍの変化が最も大きい分割を採用するようにしてもよい。
【００２４】
ＨＭＭ状態数、混合ガウス数の決定も同様に行うことができる。例えばＨＭＭ状態数についてみれば、各ノードにおいて、共有した学習データ集合を状態数が１のもの、それ以外のものとして分割した時の分割前後におけるＦ^ｍの値を求め、同様に状態数が２，３，…それぞれのものと、その他のものとにそれぞれ分割した時の各分割前後の各Ｆ^ｍを求め、これら分割前後におけるＦ^ｍの変化の最も大きな分割のやり方を採用して、これにより分割された学習データ集合をそれぞれ次のノードの学習データ集合とする。
以上のようにして、Ｆ^ｍを評価関数とした変分ベイズ的アプローチにより、モデルパラメータの学習と適切なモデル構造決定により音響モデルを作成することができる。前記モデル構造決定、ＨＭＭ状態数決定、混合ガウス数決定を例えば各分割条件を同時に与えることにより同時処理で決定してもよい。
【００２５】
更に必要に応じて図７中に破線で示すように、ステップＳ７で全ての音素カテゴリーを選択した場合は、処理を終了とすることなく、得られた音響モデルに基づき、モデルパラメータの学習を変分ベイズ法により行う（Ｓ９）。この場合はステップＳ４において行ったと同様のことを行うがその際に用いるモデル構造は前記モデル構造１〜Ｍではなく、ステップＳ６で得られた音響モデルについて行う。この再モデルパラメータ学習で得られたモデルパラメータの事後分布を、その音響モデルに採用する。図７では各音素カテゴリーごとにそのモデル構造とモデルパラメータの変分事後分布を決定したが、全音素カテゴリーについて図７中のステップＳ３以後を実行し、音素カテゴリー全体としての評価関数値が最大になるように、各音素カテゴリーのモデル構造を決定し、これと、その対応するモデルパラメータの変分事後分布とをその音素カテゴリーの音響モデルとし、全音素カテゴリーの音響モデルを同時に決定してもよい。
【００２６】
このようにモデルパラメータを再学習する場合は、次のようにしてもよい。つまり先にモデル構造の決定（選択）を、音素決定木法に基づいて行う例を示した場合と同様に、各ＨＭＭ状態の出力分布を単一ガウス分布とし、かつ各ＨＭＭ状態への学習データの割り当てを固定とすることにより、変分事後分布の推定に反復計算を省略して、評価関数値を求めることができ、つまり計算時間を大幅に短縮して評価関数値を求め、その後、ステップＳ９におけるモデルパラメータの再学習において、実際に用いるモデルパラメータの変分事後分布を求め、かつＨＭＭの状態あたりの出力分布の混合数を増加し、また学習データのＨＭＭ状態への割り当てを可変にする。
【００２７】
前述したように、特定話者用モデルを作る場合、つまり話者適応タスクでは、既存の不特定話者用音響モデルを初期モデル構造とし、かつそのモデルパラメータをベイズ的事前分布として、図７中のステップＳ４以後の処理を行えばよい。入力音声学習データとしては、その特定話者の音声信号を用いる。また歪みを受けている音声信号に対する認識用音響モデルを作る場合、つまり歪み音声に対する適応タスクでは歪みのない入力音声により作られた既存の音響モデルを初期モデル構造とし、かつそのモデルパラメータをベイズ的事前分布として、図７中のステップＳ４以下の処理を行えばよい。入力音声学習データとしてはその歪みを受けている音声信号を用いる。
【００２８】
次にこの発明により作成された音響モデルを用いる音声認識装置の実施例を、図１１を参照して説明する。図１２にその処理の流れを示す。
未知入力音声信号は特徴量ベクトル変換部３１でフレームごとに特徴量ベクトルｘに変換される（Ｓ１）。この場合の特徴量はモデル格納部３２に格納されている音響モデルを作成する際に用いた特徴量と同一のものとする。モデル格納部３２にはこの発明の方法により作成された音響モデルにあって、各音素カテゴリーごとにそのモデルパラメータθ_ｃ、つまりガウス分布の平均や分散などの変分事後分布ｑ（θ_ｃ｜Ｏ，ｍ）とモデル構造とが格納されている。実際的には、ガウス分布の平均の平均、分散の平均などが変分事後分布ｑ（θ_ｃ｜Ｏ，ｍ）として格納されている。また各音素カテゴリーｃごとに、そのモデルパラメータθ_ｃとモデル構造ｍに対する音声データｘの分布ｐ（ｘ｜θ_ｃ，ｍ）、つまりその分布の平均と分散が格納されている。
【００２９】
フレームごとの特徴量ベクトルｘについて、各音素カテゴリーｃについてモデル格納部３２内のその音響モデルを用いてベイズ予測に基づく音響スコアｓ（ｃ）を、次式によりスコア計算部３３で計算する。
ｓ（ｃ）＝∫ｄθ_ｃｑ（θ_ｃ｜Ｏ，ｍ）ｐ（ｘ｜θ_ｃ，ｍ）　　（８）
この積分を次のように事後確率最大化近似をしてもよい。
【数６】

このようにして計算したフレームごとの各音素カテゴリーｃごとの音響スコアｓ（ｃ）を用いて音素カテゴリー決定部３４において、例えばビダビアルゴリズムにより音素カテゴリー又はその候補を決定し（Ｓ３）、更にこれら音素カテゴリーについて単語認識部３５で、メモリ３６内の発音辞書、言語モデルを組み合わせることにより、単語列の認識結果を出力する（Ｓ４）。
この発明の有効性を実証するために、非特許文献２に示す最尤法と記述長最小化（ＭＤＬ）基準の組み合わせによるパラメータ学習、モデル構造選択法を従来法とし、これと、この発明方法とについて、学習データの変化に伴う単語認識率の推移について実験を行った。実験にあたり図１３に示す音声分析条件と図１４に示す初期ＨＭＭを用意する。事前分布パラメータは、音素決定木のルートノードにおけるｔｒｉｐｈｏｎｅ　ＨＭＭ状態集合に割り当てられた学習データの平均、分散により与える。図１５に学習と評価に用いたデータを示す。学習データに対して、乱数を用いてランダムに文を抜き取ることによりデータ量を変化させた。学習データの変化に伴う発明法と従来法の認識率及び木分割時の分割総数（≒モデル構造）をそれぞれ図１６、図１７に示す。従来法（１）は状態共有型ＨＭＭの構築においてルートノードのサンプル数を元に記述長を求め、ＭＤＬ基準でモデル構造を選択したものである。なおこの実験では出力分布混合数を１に保った。この発明のベイズ的基準と従来法（１）を比較すると、小規模学習データ領域（６０文以下）では発明法の認識結果が従来法（１）と比較して、最大で５０％近く上回っているのが図１９からわかる。これは発明法が、ＭＤＬ基準の適用範囲外であるような小規模学習データ領域に対しても、十分機能することを示している。実際図１６において学習データが少なくなるに従い、発明法は分割数０のモデル構造を選択するが、従来法（１）では分割数が０に近づかない。
【００３０】
一方、ＭＤＬ基準で小規模学習データ領域での上述の問題を回避するために記述長を調節したのが、従来法（２）のグラフである。ここでは記述長を、小規模学習データ領域での分割数がこの発明のベイズ法の場合と一致するように調節した。このようにほぼモデル構造を等しくした場合でも発明法が従来法と比べて１０％ほど上回っているのがわかる。これは、変分事後分布の推定や、ベイズ推測に基づく音響スコア計算における期待値操作により、過学習が緩和されたために生じた差であると考えられる。
次に、学習データを３，０００文に固定し、状態あたりの出力分布混合数を一律に変化させた際のベイズ的評価関数値と、それに伴う認識率の変化を示したグラフを図１８に示す。認識率は混合数の増加に伴って向上するが、１５混合以上になると過学習の効果により逆に劣化していく。このとき評価関数値の変化は認識率の変化とほぼ一致しており、このことから発明法が出力分布混合数の設定に対しても効果的であることがわかる。
【００３１】
図５に示したこの発明による音響モデル作成装置をコンピュータにより機能させることもできる。その場合は、例えば図７に示した方法の各ステップをコンピュータに実行させるための音響モデル作成プログラムを、ＣＤ−ＲＯＭ、磁気ディスクなどの記録媒体又は通信回線を介してコンピュータ内にダウンロードして、そのプログラムをコンピュータに実行させればよい。同様に図１１に示した音声認識装置をコンピュータに機能させてもよい。
【００３２】
【発明の効果】
以上述べたようにこの発明によれば小規模学習データでも高性能な音響モデル構造決定、音響モデルのパラメータ学習を実現することができる。
【図面の簡単な説明】
【図１】音響モデル作成と、音声認識の一般的機能構成を示す図。
【図２】隠れマルコフモデルの例を説明するための図。
【図３】音声認識により評価を行う従来の音響モデル作成装置の機能構成を示す図。
【図４】評価関数により評価を行う従来の音響モデル作成装置の機能構成を示す図。
【図５】この発明による音響モデル作成装置の機能構成例を示す図。
【図６】ベイズ法を用いる音響モデル作成方法の考えられる手法を示す流れ図。
【図７】この発明による音響モデル作成方法の例を示す流れ図。
【図８】木構造を用いるモデル構造決定の際の質問に対するＨＭＭ状態集合の分割を説明するための図。
【図９】モデル構造決定に用いた木構造の例を示す図。
【図１０】ＨＭＭ状態集合の分割に用いる質問の具体例を示す図。
【図１１】この発明による音声認識装置の機能構成例を示す図。
【図１２】この発明による音声認識方法の処理手順の例を示す流れ図。
【図１３】実験に用いた音声分析条件を示す図。
【図１４】実験に用いた初期ＨＭＭを示す図。
【図１５】実験に用いた学習・評価データを示す図。
【図１６】学習データに応じた認識率の実験結果を示す図。
【図１７】学習データに応じた分割数の実験結果を示す図。
【図１８】状態あたりの出力分布混合数を一律に変化させた場合の認識率と評価関数値の実験結果を示す図。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method for creating an acoustic model used for stochastic statistical speech recognition, an apparatus therefor, a program therefor and a recording medium therefor, and a speech recognition apparatus using the above acoustic model.
[0002]
[Prior art]
First, an outline of the speech recognition device will be described (for details, see Non-Patent Document 1, for example). As shown in FIG. 1, the speech recognition apparatus mainly includes a feature vector conversion unit 11 for converting learning speech signal data (hereinafter simply referred to as learning data) into a time-series feature vector for each frame; The acoustic model creation unit 12 that determines a simple model structure, calculates a score for a time-series feature vector of an unknown input voice using the obtained acoustic model, and considers a score for a pronunciation dictionary, a language model, and the like. And a recognition unit 13 for giving a recognition result.
[0003]
The acoustic model will be described. In the normal acoustic model for speech recognition, each phoneme is represented by a hidden Markov model (HMM). As shown in FIG. 2, the HMM is provided with output distributions D1, D2, and D3 for one or more states S1, S2, and S3, respectively. Normally, a Gaussian mixture distribution is used for the output distributions D1, D2 and D3 as shown in FIG. In FIG. 2,

arrows

14 and 15 represent state transitions, in which the number of HMM states is 3 and the number of mixed output distributions is 3. Label information indicating which part is which phoneme is given to the learning data. Estimating model parameters (average of Gaussian distribution, variance, mixing weighting factor, state transition probability) from data corresponding to each phoneme in learning data obtained from label information so as to maximize likelihood Called learning. The maximum likelihood method for estimating the model parameters is a method that is established based on an approximation that the data is sufficiently large. Therefore, if the data is small, the estimation of the model parameters is inaccurate. The number of HMM states per phoneme and the number of mixed output distributions are set to determine the model structure. Other issues such as how to set phoneme categories and how to set prior distributions in the case of the Bayesian method also fall into the model structure determination.
[0004]
Here, a conventional acoustic model creation method will be described with reference to FIG. First, for a plurality of given model structures 1 to N, the provisional model creation units 12-1 to 12 -N first estimate the model parameters by maximum likelihood learning using the feature amount vector input by the model parameter estimation unit 121. Provisional acoustic models are created for each of the sequences. The model structure is determined based on the recognition rate. Using the respective provisional acoustic models obtained in the above process, the recognition unit 122 outputs respective recognition results, and further uses them to output the respective recognition rates in the evaluation unit 123. The model selection unit 124 compares these N recognition rates, evaluates the temporary acoustic model based on the high recognition rate, and determines the temporary acoustic model with the highest recognition rate as the acoustic model structure. As described above, in the conventional method, the model structure can be evaluated only by obtaining the recognition rate by the recognition unit 122 and the evaluation unit 123. Such model evaluation has problems such as a long calculation time, a change in the result depending on recognition data, and difficulty in automation.
[0005]
On the other hand, there is also a method of preparing various model structures, learning model parameters, calculating evaluation functions such as description length and Bayesian information based on the model functions, and using the functions to determine a model structure (for example, Non-Patent Documents 2 and 3). That is, as shown in FIG. 4, the model creating units 12-1 to 12 -N first estimate the model parameters for the model structure of the input feature vector series by the maximum likelihood learning method in the model parameter estimating unit 121. . Next, an evaluation function value (description length in Non-Patent Document 2, Bayesian information amount in Non-Patent Document 3) is calculated for each of these estimated model parameters by an evaluation function calculation unit 125, and these evaluation function values are calculated by a model selection unit 124. The highest temporary acoustic model is selected and determined as the acoustic model structure.
[0006]
As described above, these methods do not require an operation of outputting a recognition result by the recognition unit and outputting a recognition rate by the evaluation unit, and thus can solve the above-described problem in the method shown in FIG. However, each of these evaluation functions is an evaluation function that is established under the approximation that the data is sufficiently large. Therefore, there is a problem that an appropriate model structure is not determined when the amount of data is small. In addition, since none of these evaluation functions can give an accurate evaluation function value to a model including latent variables such as an acoustic model for speech recognition, an appropriate model structure determination is performed from such a viewpoint. Is not performed. Furthermore, the evaluation function at the time of estimating the model parameters is likelihood, the evaluation function of the model is the description length or the amount of Bayes information, and a plurality of evaluation functions will be used in model creation, whereby an appropriate acoustic model can be obtained. The problem of not being created arises.
[0007]
Although it is not acoustic model learning, in general, using Bayes method, prior knowledge can be introduced as a prior distribution to the learning object, and when the amount of learning data is small, learning with higher generalization ability than other learning methods can be performed. It is known that automatic determination of model selection is also possible (for example, see Non-Patent Document 4). However, in the Bayes method, estimation of the posterior distribution is important, but when the model includes latent variables (latent model), it is generally difficult to treat this analytically because complex expected value operations are required. . The HMM and the Gaussian mixture model used for the acoustic model for speech recognition are latent models, so it has been difficult to apply them to the Bayesian method.
[0008]
In recent years, a learning method based on estimating a posterior distribution (variation posterior distribution) by a variational Bayes method using variational approximation for expected value calculation has been proposed (for example, see Non-Patent Document 4). However, it has not been proposed to create an acoustic model using the variational Bayes method. In addition to learning model parameters, creating acoustic models requires more complex learning methods including model structure selection such as how to share states in context-dependent HMMs, the total number of states, and the setting of the number of output distributions for each state. I need.
As a specific example in the embodiment of the present invention, selection of a state-sharing HMM structure is performed based on a phoneme decision tree method. This method is disclosed in, for example, Non-Patent Document 5.
[0009]
[Non-patent document 1]
Published by Sadahiro Furui, Modern Science Company, 2001, p. 174-210
[Non-patent document 2]
Koichi Shinoda, Takao Watanabe, "Creation of Acoustic Model by State Clustering Using Information Criterion", IEICE Technical Report, SP96-79, 1996, PP. 9-15
[Non-Patent Document 3]
W. Butterfly (W. Chou), W. Chou. Written by W. Reichl, Decision Tree Stated Base On Penalized Bayesian Information Criterion (Decision Tree State State Tying Based On On Penalized @ Bayesian Information Systems, Inc. Crypt. 345-348 (1999).
[Non-patent document 4]
Osamu Ueda, "Variational Bayes Learning for Searching Minimal Models", Transactions of the Japanese Society for Artificial Intelligence, Vol. 16, No. 2, PP. 299-308, 2001
[Non-Patent Document 5]
Jay. Jay. JJ Odell, "The Use of Contex in Large Bokeh Abrasive Speech Recognition," "The Use of Context, in Large, Vocabulary, Speech, Recognition, Canada, United States, Canada, Canada"
[0010]
[Problems to be solved by the invention]
An object of the present invention is to provide an acoustic model creation method capable of producing a high-performance model with a small amount of data and performing calculations efficiently, as compared with a method based on the description length minimization (MDL) criterion or the maximum likelihood criterion. An object of the present invention is to provide a speech recognition apparatus using the apparatus, the program, the recording medium, and the acoustic model.
[0011]
[Means for Solving the Problems]
According to the present invention, a plurality of model structures and Bayesian prior distributions of acoustic models are prepared, and model parameter learning and model structure evaluation function values are calculated from the learning data by a variational Bayes method. A model structure is determined based on the evaluation function value, and the model is used as an acoustic model.
[0012]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 5 shows a functional configuration of one embodiment of the acoustic model creating apparatus according to the present invention. The learning data is converted by the feature vector conversion unit 11 into a feature vector series. A plurality of model structures 1 to M are prepared, and for each of them, the posterior distribution estimating unit 211 of the evaluating units 21-1 to 21-M uses the variational Bayesian method and characterizes the variational Bayesian evaluation function. Model parameter learning is performed on the quantity vector sequence. That is, the variation posterior distribution of the model parameters is estimated from the data (feature vector) corresponding to each phoneme in the learning data obtained from the label information so as to maximize the variation Bayes evaluation function.
[0013]
Based on these estimated variation posterior distributions, the variation Bayesian evaluation function values are calculated again by the respective evaluation function calculation units 212, and the phoneme category is calculated by the model selection unit 22 using these evaluation functions 1 to M. The model structure of each category is determined so that the evaluation function as a whole becomes the largest, and this and the variation posterior distribution of the corresponding model parameters are used as the acoustic model of the phoneme category.
According to the present invention, the same evaluation function is used for model parameter learning and model structure determination by using the variational Bayes method, so that the optimality is more ensured than the conventional method using a plurality of evaluation functions.
[0014]
In addition, by using the existing acoustic model for the initial model structure and the prior distribution, the above acoustic model creation processing is performed on the learning audio signal for adaptation, thereby creating an acoustic model adapted to the learning audio signal for adaptation. can do.
The acoustic model obtained in this way is not composed of the model parameters themselves but of the variation posterior distribution thereof. For this reason, the acoustic model and the apparatus used for speech recognition calculate an acoustic score based on Bayesian prediction from the variation posterior distribution and the unknown speech input data.
[0015]
Example
Next, more specifically, an embodiment of the present invention will be described. Prior to this description, it is difficult to create an acoustic model using the Bayes method, and thereafter, the embodiment of the present invention will be described. explain.
As shown in FIG. 6, pre-processing such as waveform processing (filter processing) taking into account auditory characteristics is performed on the input labeled learning data as necessary (S1), and LPC (linear prediction) analysis is performed. A correlation process is performed for each frame, and if necessary, a spectrum process such as limiting a frequency band is performed.O= ｛O _t∈R^D: T = 1,..., T} (S2). Where T is all
Indicates the number of frames. Cepstrum, Δ cepstrum, mel frequency cepstrum and the like are used as the feature amount.
[0016]
This time series feature vectorO, An initial model structure is set (S3).
In the initial model structure setting, first, an initial phoneme category is set (S3-1). As the initial phoneme category, an environment-dependent phoneme or an environment-independent phoneme in consideration of the surrounding phoneme environment is used. Next, one phoneme category is subdivided into a plurality of HMM states, and an output distribution is set for each state (S3-2). Further, a Bayesian prior distribution is set for the HMM state transition probability and the output distribution (S3-3).
This prior distribution gives statistically reliable parameters. For example, when a category is subdivided, that is, when the number of environment phonemes (phonemes before and after the environment) of the environment-dependent phonemes is increased and the number of models is increased, learning data (feature vector, hereinafter the same) assigned to each category accordingly. And the statistical reliability is reduced. Therefore, learning data assigned to a plurality of environment-dependent phonemes is shared, for example, learning data having the same phoneme at the center of a triphone (three consecutive phonemes) is used in common for all of the triphone categories, and The obtained model parameters are given as a prior distribution of environment-dependent phonemes. Further, when the number of HMM states included in each category is increased, learning data allocated to each output distribution is insufficient, and the statistical reliability is reduced. Therefore, learning data assigned to a plurality of HMM states is shared, for example, learning data assigned to adjacent states, and used in common for both states, and model parameters obtained thereby are given as prior distributions. Further, when the number of mixtures in the output distribution is increased, the learning data allocated to each Gaussian distribution becomes insufficient, and the statistical reliability decreases. Therefore, learning data assigned to a plurality of Gaussian distributions is shared, and model parameters obtained thereby are given as prior distributions.
[0017]
In the speaker adaptation task for adapting the unspecified speaker model for the toll speaker, model parameters of the unspecified speaker are given as a prior distribution. In an adaptation task for an audio signal input distorted by preprocessing by noise suppression or sound source separation, a model parameter created from an input audio signal without distortion is given as a prior distribution.
A model including such a variety of prior distributions is called a model structure. In Bayesian learning, the posterior distribution of m can be introduced by setting the degree of freedom of the model structure as a random variable m∈M. Here, M represents a set of m.
[0018]
Next, model parameter learning is performed from the initial model structure using Bayesian learning (S4). The important thing in Bayesian learning is to find the posterior distribution for a random variable, but finding this is usually not easy. For example, in a fixed model structure m, the model parameter θ for the phoneme category c_cPosterior distribution p (θ_c|O, M). Refer to the Bayesian method shown in Non-Patent Document 4.
Then p (θ_c|O, M) is the output distribution p (O, Z | Θ, m)
From the cloth p (Θ | m), it is expressed as follows.
(Equation 1)

Where Θ = ｛θ_c: C = 1,..., C｝, C is the number of phoneme categories,_-CIs θ_cAnd Z is a set of latent variables. The model parameters are, specifically, the HMM state transition probability, the mixing weight coefficient when the output distribution in the HMM state is represented by the Gaussian mixture distribution, and the mean and variance in the Gaussian distribution. Further, Z is specifically an HMM state sequence variable, that is, the number of how many states a phoneme can take or a mixed Gaussian series variable, that is, the number of a phoneme in each state. The number of any combination that goes through a Gaussian distribution. p (O, Z | Θ, m) p (Θ, | m) can give a specific function form when setting the model structure.
Equation (2) is calculated using the posterior distribution obtained by the calculation of equation (1).
(Equation 2)

Is calculated to calculate the Bayesian evaluation function (S5), and the maximum m is selected from the calculation result of Expression (2) for each m (S6). That is, the following equation is obtained.
m ＾ = arg_m ^maxp (m |O) (3)
This model structure of m ＾ is defined as an acoustic model of the phoneme category. The same processing is performed for all phoneme categories to obtain each acoustic model.
In this way, an acoustic model can be created by the Bayes method, but in practice, the calculation of Expression (1) involves multiple integration and the like, and thus it is difficult to handle analytically. In addition, there is a method of obtaining by Monte Carlo simulation, but such an approach is impractical due to a problem of calculation time.
Therefore, it is not realistic to create an acoustic model by the Bayes method.
In the present invention, an acoustic model is created using the variational Bayes method. An example is shown in FIG. The pre-processing (S1), the feature quantity vector variable (S2), and the initial model function setting (S3) are the same as those in FIG.
[0019]
Next, in this embodiment, the distribution estimation of the model parameter posterior distribution (variation posterior distribution) is performed by approximation calculation using the variational method based on the variational Bayes evaluation function given by equation (4) (S4).
(Equation 3)

Where <u (y)>_{p (y)}Represents the expected value of u (y) for the distribution p (y). q (Θ, Z |O, M) are posterior distributions approximately determined by the variational method. F^mIs the variation posterior distribution q (Θ, Z |O, M). Equation (4) is obtained by referring to the method of the integral Bayes method shown in Non-Patent Document 4.
Statistical independence q (Θ, Z |O, M) = q (Z |O, M) Π_{c = 1} ^Cq (θ_c|O, M), and F^mTo q (θ_c|O, M), q (Z |O, M) by using the variational method to maximize the appropriate q (θ_c|O, M), q (Z |O, M) can be expressed by the following equation.
(Equation 4)

q (θ_c|O, M), q (Z |O, M) are interdependent and can be efficiently determined using an iterative calculation based on the Baum-Welch or Viterbi algorithm. Thus, the variation posterior distribution q (θ for a fixed m_c|O, M), q (Z |O, M) by the variational Bayes method in the posterior distribution estimator 211 (FIG. 5) to learn the model parameters.
[0020]
Next, an evaluation function serving as an index for determining a model structure is considered. Assuming that the prior distribution of m is uniform, the variation posterior distribution q (m |O) And F^mHas the relationship shown in the following equation.
(Equation 5)

Thus, an appropriate model structure m ＾ can be determined in the meaning of the posterior probability maximization (MAP). That is, F^mIs q (Θ |O, M), q (Z |O, M) as well as an evaluation function that gives the optimality of the model structure m. Therefore, F^m, Model learning including latent variables such as HMM and Gaussian mixture model and determination of model structure can be uniformly discussed using a variational Bayes evaluation function. Variational posterior distribution q (Θ, Z |) obtained by model parameter learningO, M) into Expression (4), and a variation Bayesian evaluation function, which is a model structure determination function at the fixed m, is calculated by the evaluation function calculation unit 212 (FIG. 5) (S5).
F^mIs calculated for all m in the set M, and the model selecting unit 22 (FIG. 5) determines an appropriate model structure based on the equation (7) (S6). That is, the model structure having the largest evaluation function value obtained in step S5 and its posterior distribution q (θ_c|O, M), q (Z |O, M) is assumed to be the acoustic model of the phoneme category c.
It is checked whether all phoneme categories c have been selected (S7), and if there is no phoneme category c, one of them is selected and the process returns to step S3 (S8). When the acoustic models for all the phoneme categories have been determined, the process ends.
[0021]
The determination of the model structure in the model selecting unit 22 is performed by hierarchically expressing the change of the model structure using a tree structure, so that an appropriate model structure can be efficiently searched. Hereinafter, in this embodiment, an example of determining a model structure regarding sharing of environment-dependent phonemes using a tree structure will be described.^m, And the states and Gaussian distributions in the most subdivided model structure are merged from the bottom up,^mMay be used to determine the model structure so that is larger. A similar argument was made not only for determining the model structure related to the sharing of environment-dependent phonemes, but also for determining the number of HMM states per phoneme and the number of mixtures when the HMM states were represented by a Gaussian mixture distribution. It is also effective in Because the problem of sharing environment-dependent phonemes can be regarded as a problem of clustering environment-independent phonemes with a plurality of environment-dependent phonemes. Similarly, the problem of determining the number of HMM states and the number of mixtures is also a problem of clustering in environment-dependent phonemes and HMM states, respectively. Because they can be treated essentially as a similar clustering problem. Therefore, the model structure can be determined by performing these three types of clustering simultaneously or independently.
[0022]
An embodiment will be described with respect to an environment-dependent phoneme sharing problem. This method is disclosed in Non-Patent Document 5, for example. First, a description will be made by preparing an initial model in which the environment-independent phoneme category includes three HMM states and the output distribution included in each state is represented by a Gaussian mixture distribution. In addition, a tripone category that takes into account the phonemes immediately before and after the phoneme is used as the environment-dependent phoneme category at this time. An HMM state set associated with a node n of a certain tree is Ω (n). First, a root node (n = 0) is prepared. That is, a set of tripphone @ HMM states having the same center phoneme is associated with the root node. At this time, using the question Q appropriately selected from the question group, the set Ω (n) is changed according to the answer (Yes or No) of the question Q as shown in FIG._Y ^Q) And Ω (n_N ^Q) And split them into a new node n_Y ^QAnd n_N ^QHow to select the appropriate question to map to will be described later.
[0023]
In the following, the discussion will be performed with the number of branches being 2, but the discussion can be similarly performed even when the number of branches is 2 or more. The state set included in the node newly obtained by this division is again divided by a question, and this is repeated to finally construct a tree structure as shown in FIG. By sharing a state set associated with each leaf node, a state-sharing HMM structure is constructed. The question group used is a question group for phonemic environments before and after, obtained from knowledge of phonetics. FIG. 10 shows a specific example of the question. At this time, q (Θ |O, M), q (Z |O, M) are obtained by variational Bayes learning, and the evaluation function F^mAre calculated respectively. F^mAppropriate division can be performed by adopting the question having the largest change in value. By performing this for all nodes, F^mA tree structure optimized by the value can be obtained. Also F^mThe number of leaf nodes in the tree structure can be determined by setting a node that does not increase or decrease any more as a leaf node. Thus, an appropriate model structure can be determined. That is, for a plurality of tripphone categories remaining in each leaf node, the model structure of the leaf node is commonly used as a model structure. Instead of using a question, all the division methods in the case of dividing Ω (0), that is, considering all combinations for division, q (Θ |O, M), q (Z |O, M) by the variational Bayes method, respectively, and^mAnd calculate F^mMay be adopted as the division having the largest change.
[0024]
The determination of the number of HMM states and the number of Gaussian mixture can be performed in the same manner. For example, regarding the number of HMM states, at each node, when the shared learning data set is divided into one with the number of states and the other, the F before and after the division.^m, And the number of states before and after each division when the number of states is divided into 2, 3,.^m, And F before and after these divisions^mIs adopted as the learning data set of the next node.
As described above, F^mBy using a variational Bayesian approach using as an evaluation function, an acoustic model can be created by learning model parameters and determining an appropriate model structure. The determination of the model structure, the determination of the number of HMM states, and the determination of the number of Gaussian mixtures may be determined by simultaneous processing, for example, by simultaneously giving each division condition.
[0025]
Further, if all phoneme categories are selected in step S7 as indicated by broken lines in FIG. 7 as necessary, the learning of model parameters is changed based on the obtained acoustic model without terminating the process. This is performed by the minute Bayes method (S9). In this case, the same operation as that performed in step S4 is performed, but the model structure used in this case is not the model structures 1 to M but the acoustic model obtained in step S6. The posterior distribution of the model parameters obtained by this remodeling parameter learning is adopted for the acoustic model. In FIG. 7, the model structure and the variation posterior distribution of the model parameters are determined for each phoneme category. However, after step S3 in FIG. 7 is executed for all phoneme categories, the evaluation function value as a whole phoneme category is maximized. Thus, the model structure of each phoneme category is determined, and this and the corresponding variational posterior distribution of the model parameters are taken as the acoustic model of the phoneme category, and the acoustic models of all phoneme categories may be determined at the same time. .
[0026]
When the model parameters are re-learned in this manner, the following may be performed. That is, as in the case where the model structure is previously determined (selected) based on the phoneme decision tree method, the output distribution of each HMM state is set to a single Gaussian distribution, and the learning data for each HMM state is By fixing the assignment of, the evaluation function value can be obtained by omitting the iterative calculation for estimating the variation posterior distribution, that is, the calculation time is greatly reduced to obtain the evaluation function value, and then the step In the retraining of the model parameters in S9, the variation posterior distribution of the model parameters actually used is obtained, the number of output distributions per HMM state is increased, and the assignment of learning data to the HMM state is made variable. .
[0027]
As described above, when a specific speaker model is created, that is, in a speaker adaptation task, an existing unspecified speaker acoustic model is used as an initial model structure, and its model parameters are used as Bayesian prior distributions in FIG. The processing after step S4 may be performed. As the input voice learning data, a voice signal of the specific speaker is used. Also, when creating an acoustic model for recognition of a distorted speech signal, that is, in an adaptation task for distorted speech, an existing acoustic model created from undistorted input speech is used as the initial model structure, and its model parameters are Bayesian. What is necessary is just to perform the process after step S4 in FIG. 7 as a prior distribution. As the input voice learning data, a voice signal subjected to the distortion is used.
[0028]
Next, an embodiment of a speech recognition apparatus using an acoustic model created according to the present invention will be described with reference to FIG. FIG. 12 shows the flow of the processing.
The unknown input speech signal is input to the feature vector conversion unit 31 for each frame.x(S1). In this case, the feature amount is the same as the feature amount used when creating the acoustic model stored in the model storage unit 32. In the model storage unit 32, there is an acoustic model created by the method of the present invention._cIn other words, the variation posterior distribution q (θ_c|O, M) and the model structure are stored. Practically, the average of the Gaussian distribution, the average of the variance, and the like are the variation posterior distributions q (θ_c|O, M). For each phoneme category c, its model parameter θ_cAnd distribution p (x | θ) of audio data x with respect to model structure m_c, M), that is, the mean and variance of the distribution are stored.
[0029]
Feature vector for each framexFor each phoneme category c, an acoustic score s (c) based on Bayesian prediction is calculated by the score calculation unit 33 using the acoustic model in the model storage unit 32 according to the following equation.
s (c) = ∫dθ_cq (θ_c|O, M) p (x | θ_c, M) (8)
This integration may be approximated to the posterior probability maximization as follows.
(Equation 6)

Using the acoustic score s (c) for each phoneme category c for each frame calculated in this way, the phoneme category determination unit 34 determines a phoneme category or a candidate thereof by, for example, a Vidabi algorithm (S3). For the phoneme category, the word recognition unit 35 outputs the recognition result of the word string by combining the pronunciation dictionary and the language model in the memory 36 (S4).
In order to demonstrate the effectiveness of the present invention, a conventional method is a parameter learning and model structure selection method based on a combination of a maximum likelihood method and a minimum description length (MDL) criterion shown in Non-Patent Document 2, We conducted experiments on the changes in the word recognition rate with changes in learning data. For the experiment, a voice analysis condition shown in FIG. 13 and an initial HMM shown in FIG. 14 are prepared. The prior distribution parameter is given by the average and variance of the learning data assigned to the tripone @ HMM state set at the root node of the phoneme decision tree. FIG. 15 shows data used for learning and evaluation. The amount of data was changed by randomly extracting sentences from the learning data using random numbers. FIGS. 16 and 17 show the recognition rates of the invention method and the conventional method and the total number of divided trees (≒ model structure) according to the change in the learning data, respectively. In the conventional method (1), a description length is obtained based on the number of samples of a root node in the construction of a state-sharing HMM, and a model structure is selected based on the MDL. In this experiment, the output distribution mixture number was kept at 1. Comparing the Bayesian criterion of the present invention with the conventional method (1), the recognition result of the invention method in the small learning data area (60 sentences or less) exceeds the conventional method (1) by almost 50% at the maximum. It can be seen from FIG. This indicates that the method of the present invention works well even in a small-scale learning data area that is out of the applicable range of the MDL standard. In fact, as the learning data decreases in FIG. 16, the invention method selects a model structure with a division number of 0, but the division number does not approach 0 with the conventional method (1).
[0030]
On the other hand, the graph of the conventional method (2) is obtained by adjusting the description length in order to avoid the above-described problem in the small-scale learning data area on the MDL basis. Here, the description length was adjusted so that the number of divisions in the small-scale learning data area coincided with the case of the Bayesian method of the present invention. It can be seen that the invention method is about 10% higher than the conventional method even when the model structures are almost equal. This is considered to be a difference caused by over-learning being alleviated by the estimation of the variation posterior distribution and the expectation operation in the acoustic score calculation based on Bayesian inference.
Next, FIG. 18 is a graph showing Bayesian evaluation function values when the learning data is fixed at 3,000 sentences and the number of output distributions per state is uniformly changed, and the change in the recognition rate associated therewith. Show. The recognition rate increases with an increase in the number of mixtures. However, when the number of mixtures exceeds 15, the recognition rate deteriorates due to the effect of over-learning. At this time, the change in the evaluation function value almost coincides with the change in the recognition rate, which indicates that the present invention is effective for setting the output distribution mixture number.
[0031]
The acoustic model creation device according to the present invention shown in FIG. 5 can be operated by a computer. In that case, for example, an acoustic model creation program for causing a computer to execute each step of the method shown in FIG. 7 is downloaded into a computer via a recording medium such as a CD-ROM or a magnetic disk or a communication line, What is necessary is just to make a computer execute the program. Similarly, the speech recognition device shown in FIG. 11 may be caused to function by a computer.
[0032]
【The invention's effect】
As described above, according to the present invention, high-performance acoustic model structure determination and acoustic model parameter learning can be realized even with small-scale learning data.
[Brief description of the drawings]
FIG. 1 is a diagram showing a general functional configuration of acoustic model creation and speech recognition.
FIG. 2 is a diagram illustrating an example of a hidden Markov model.
FIG. 3 is a diagram showing a functional configuration of a conventional acoustic model creation device that performs evaluation by voice recognition.
FIG. 4 is a diagram showing a functional configuration of a conventional acoustic model creation device that performs evaluation using an evaluation function.
FIG. 5 is a diagram showing an example of a functional configuration of an acoustic model creation device according to the present invention.
FIG. 6 is a flowchart showing a possible method of an acoustic model creation method using the Bayes method.
FIG. 7 is a flowchart showing an example of an acoustic model creation method according to the present invention.
FIG. 8 is a view for explaining division of an HMM state set for a query when determining a model structure using a tree structure.
FIG. 9 is a diagram showing an example of a tree structure used for determining a model structure.
FIG. 10 is a diagram showing a specific example of a question used for dividing the HMM state set.
FIG. 11 is a diagram showing a functional configuration example of a speech recognition device according to the present invention.
FIG. 12 is a flowchart showing an example of a processing procedure of a voice recognition method according to the present invention.
FIG. 13 is a diagram showing speech analysis conditions used in the experiment.
FIG. 14 is a diagram showing an initial HMM used in the experiment.
FIG. 15 is a diagram showing learning / evaluation data used in an experiment.
FIG. 16 is a diagram showing experimental results of a recognition rate according to learning data.
FIG. 17 is a diagram showing experimental results of the number of divisions according to learning data.
FIG. 18 is a diagram showing experimental results of recognition rates and evaluation function values when the number of output distributions per state is uniformly changed.

Claims

学習音声信号を時系列特徴量ベクトルに変換するステップと、
音響モデルのモデル構造及びベイズ的事前分布を複数用意して、これらについて、上記時系列特徴量ベクトルから変分ベイズ評価関数を最大化するようにモデルパラメータの変分事後分布を推定するステップと、
上記推定された変分事後分布をもとに再び変分ベイズ評価関数値をそれぞれ計算するステップと、
上記計算した複数の変分ベイズ評価関数値をもとに変分ベイズ評価関数を最大とする複数のモデル構造を決定し、その各モデル構造及び対応する変分事後分布を組として音響モデルを得るステップと
を有する音声認識用音響モデル作成方法。Converting the learning speech signal into a time-series feature vector;
Preparing a plurality of model structures and Bayesian prior distributions of the acoustic model, and estimating a variation posterior distribution of the model parameters so as to maximize the variational Bayes evaluation function from the time-series feature amount vectors,
Calculating again a variation Bayesian evaluation function value based on the estimated variation posterior distribution,
A plurality of model structures that maximize the variational Bayesian evaluation function are determined based on the plurality of calculated variational Bayesian evaluation function values, and an acoustic model is obtained by using each model structure and the corresponding variational posterior distribution as a set. And a method for creating an acoustic model for speech recognition.

上記決定されたモデル構造について、上記時系列特徴量ベクトルから、変分ベイズ評価関数を最大化するようにモデルパラメータの変分事後分布を再び推定して、上記音響モデルの変分事後分布を修正するステップを有することを特徴とする請求項１記載の音声認識用音響モデル作成方法。For the determined model structure, re-estimate the variation posterior distribution of the model parameters from the time-series feature amount vector so as to maximize the variational Bayes evaluation function, and correct the variation posterior distribution of the acoustic model. 2. The method for creating an acoustic model for speech recognition according to claim 1, further comprising the step of:

上記初期モデル構造及びベイズ的事前分布に既存の音響モデルを用い、上記学習音声信号として適応用学習音声信号を用い、上記既存の音響モデルを上記適応用学習音声信号に適応させた音響モデルを作成することを特徴とする請求項１又は２記載の音声認識用音響モデル作成方法。Using an existing acoustic model for the initial model structure and Bayesian prior distribution, using an adaptive learning audio signal as the learning audio signal, and creating an acoustic model in which the existing acoustic model is adapted to the adaptive learning audio signal 3. The method of creating an acoustic model for speech recognition according to claim 1, wherein

上記モデルパラメータの変分事後分布の推定は、上記用意した各モデル構造について、変分ベイズ評価関数を最大化するように、各音素カテゴリーごとのモデルパラメータの変分事後分布と、音響モデルの潜在変数の変分事後分布とを推定するステップであり、
上記評価関数値の計算ステップは、上記モデルパラメータの変分事後分布と上記潜在変数の変分事後分布とを用いて、変分ベイズ評価関数値を計算するステップであることを特徴とする請求項１乃至３の何れかに記載の音声認識用音響モデル作成方法。The estimation of the variation posterior distribution of the model parameters described above is performed for each of the prepared model structures so that the variation Bayes evaluation function is maximized. Estimating the variation posterior distribution of the variable,
The step of calculating the evaluation function value is a step of calculating a variation Bayesian evaluation function value using the variation posterior distribution of the model parameters and the variation posterior distribution of the latent variables. 4. The method for creating an acoustic model for speech recognition according to any one of 1 to 3.

上記音響モデルとして状態共有型隠れマルコフモデル構造とし、各時系列特徴量ベクトルの各隠れマルコフモデル状態への割り当てを固定し、各状態の出力分布を単一ガウス分布として、上記モデルパラメータの変分事後分布の推定、上記変分ベイズ評価関数値の推定、上記モデル構造の決定を行い、
上記変分事後分布の修正ステップで、上記時系列特徴量ベクトルの割り当てを可変とし、１状態あたり複数混合ガウス分布を用いることを特徴とする請求項２記載の音声認識用音響モデル作成方法。The acoustic model has a state-shared Hidden Markov Model structure, the assignment of each time-series feature vector to each Hidden Markov Model state is fixed, and the output distribution of each state is a single Gaussian distribution. Estimate the posterior distribution, estimate the variation Bayesian evaluation function value, determine the model structure,
3. The method of creating an acoustic model for speech recognition according to claim 2, wherein in the step of modifying the variation posterior distribution, the assignment of the time-series feature amount vector is made variable, and a plurality of mixed Gaussian distributions are used per state.

学習音声信号が入力され、その時系列特徴量ベクトルを出力する特徴量ベクトル変換部と、
互いに異なるモデル構造及びベイズ的事前分布がそれぞれ設定され、上記時系列特徴量ベクトルがそれぞれ入力され、変分ベイズ評価関数を最大化するようにモデルパラメータの変分事後分布をそれぞれ推定する複数の事後分布推定部と、
これら事後分布推定部よりのモデルパラメータの変分事後分布がそれぞれ入力され、変分ベイズ評価関数値をそれぞれ計算する複数の評価関数計算部と、
これら評価関数計算部からの変分ベイズ評価関数値が入力され、変分ベイズ評価関数値を最大とする各モデル構造を決定し、その各モデル構造と対応するモデルパラメータの変分事後分布を組とした音響モデルを出力するモデル選択部と
を具備する音声認識用音響モデル作成装置。A feature vector conversion unit that receives a learning speech signal and outputs a time-series feature vector;
Different model structures and Bayesian prior distributions are set respectively, the above time-series feature vectors are respectively input, and a plurality of posterior posteriors each estimating a variation posterior distribution of model parameters so as to maximize a variational Bayes evaluation function. A distribution estimator;
A plurality of evaluation function calculators each of which receives a variation posterior distribution of model parameters from the posterior distribution estimator and calculates a variation Bayesian evaluation function value,
Variational Bayesian evaluation function values from these evaluation function calculation units are input, each model structure that maximizes the variational Bayesian evaluation function value is determined, and a variation posterior distribution of each model structure and corresponding model parameters is formed. And a model selecting unit that outputs the acoustic model set as a speech model.

請求項１乃至５の何れかに記載した音声認識用音響モデル作成方法の各ステップをコンピュータに実行させるための音響モデル作成プログラム。An acoustic model creation program for causing a computer to execute each step of the speech recognition acoustic model creation method according to claim 1.

請求項７に記載した音響モデル作成プログラムを記録したコンピュータ読み取り可能な記録媒体。A computer-readable recording medium storing the acoustic model creation program according to claim 7.

請求項１乃至５の何れかに記載した音声認識用音響モデル作成方法によって作成された音響モデルが格納されたモデル格納部と、
未知入力音声信号の特徴量ベクトルをフレームごとに求める特徴量ベクトル変換部と、
上記特徴量ベクトルに対する、上記モデル格納部に格納されている各カテゴリーの音響スコアをそのモデルパラメータの変分事後分布を用いて計算するスコア計算部と、
その計算されたスコアから未知入力音声信号のカテゴリーを決定するカテゴリー決定部とを具備する音声認識装置。A model storage unit in which an acoustic model created by the acoustic model creation method for speech recognition according to claim 1 is stored,
A feature vector conversion unit that obtains a feature vector of the unknown input audio signal for each frame,
A score calculation unit for calculating the acoustic score of each category stored in the model storage unit using the variation posterior distribution of the model parameter for the feature amount vector,
A category determining unit that determines a category of the unknown input voice signal from the calculated score.

未知入力音声信号をフレームごとにその特徴量を求めて特徴量ベクトルに変換し、
上記特徴量ごとに、上記請求項１乃至５の何れかに記載した音声認識用音響モデル作成方法により作成した音響モデルを用いて、ベイズ予測に基づく各カテゴリーごとの音響スコアを計算し、
上記特徴量ごとに得られる上記各カテゴリーごとの音響スコアの系列から上記未知入力音響信号のカテゴリーを決定する
ことを特徴とする音声認識方法。The unknown input audio signal is converted into a characteristic amount vector by calculating its characteristic amount for each frame,
For each of the feature amounts, using an acoustic model created by the acoustic model for speech recognition according to any one of claims 1 to 5, calculate an acoustic score for each category based on Bayesian prediction,
A speech recognition method, wherein a category of the unknown input audio signal is determined from a series of audio scores for each category obtained for each feature amount.

請求項１０記載の音声認識方法の各ステップをコンピュータに実行させるための音声認識プログラム。A speech recognition program for causing a computer to execute each step of the speech recognition method according to claim 10.

請求項１１記載の音声認識プログラムを記録したコンピュータ読み取り可能な記録媒体。A computer-readable recording medium on which the speech recognition program according to claim 11 is recorded.