JP4180110B2

JP4180110B2 - Language recognition

Info

Publication number: JP4180110B2
Application number: JP52671596A
Authority: JP
Inventors: フランシス・ジェイムズスカヒル、; アリソン・ダイアンサイモンズ、; スティーブン・ジョンホイットテイカー、
Original assignee: British Telecommunications PLC
Current assignee: British Telecommunications PLC
Priority date: 1995-03-07
Filing date: 1996-03-07
Publication date: 2008-11-12
Anticipated expiration: 2016-03-07
Also published as: CA2211636C; NO974097L; DE69615667D1; NO974097D0; AU702903B2; ES2164870T3; KR19980702723A; DE69615667T2; JPH11501410A; US5999902A; CN1178023A; CN1150515C; AU4887696A; EP0813735B1; WO1996027872A1; NZ302748A; CA2211636A1; KR100406604B1; EP0813735A1; MX9706407A

Description

本発明は、入力音声信号が最も明らかに似ている単語（または、より一般的には話声）の語彙の何れか１つを確認するために認識プロセスを実行し、単語の語彙と関係しているアプリオリ確率に関して情報が有効である言語認識器に関する。（訳者注：sppech recognitionは音声認識の訳語をあてることが多いが、ここではvoiceではなく話し言葉speechの訳に言語をあてる。）この状況の１例は、我々の共願の国際特許出願第WO95/02524号明細書に記載された自動電話番号案内システムである。このシステムでは、
（ｉ）ユーザは都市の名前を話し；
（ii）言語認識器は、記憶された都市のデータを参照することによって、話された都市の名前に最もよく整合する幾つかの都市を識別し、“得点（スコア）”または整合の一致度を示す確率を生成し；
（iii）リストは識別された都市に存在する全ての道路名からコンパイルされ；
（iv）ユーザは道路名を話し；
（ｖ）言語認識器はリストに含まれるものの中から幾つかの道路名を識別し、話された道路名に最もよく整合するものだけ得点を与え、
（vi）道路得点は、その道路が位置する都市が得た得点にしたがってそれぞれ重み付けをされ、最も可能性の高い“道路”が最も重み付けされた得点を有すると考えられる。
アプリオリ確率は、前の言語認識プロセスから始まる必要はない；例えば上記の特許出願に記載されている別の番号案内システムでは、呼の発生源を識別する信号を使用して、その領域から照会者が望んでいる可能性の最も高い都市に関する統計的情報にアクセスして、都市名認識プロセスの結果に重み付けをする。
このプロセスは、留保条件のために信頼度が高いという利点を有する。この留保条件とは、例えば、道路名認識段階で第１の選定都市の道路よりも第２の選定都市の道路の方が著しく得点が高くなければ、第２の選定都市からは道路を選定しないというものである。しかしながらこのプロセスの欠点は、道路名認識段階を実行するとき、認識器が道路名の制限された数だけしか生成しないので、この短い道路名リストでは、低得点の都市に位置する道路名しか含むことができない、すなわち高得点の都市内に位置する道路の低得点の道路名は、重み付けプロセスを適用する前に認識器によって既に“プルーニング”されていることである。
米国特許第47838303号明細書は、アプリオリ確率が先に認識された１又は複数のパターンの所定のコンテキストに関係している言語認識装置を記載している。ある単語の後にある別の単語が発生する確率を示す言語得点は、それらの単語を含むシーケンスに対して得られる得点と共同される。
本発明にしたがうと、言語認識方法であって：
未知の話声の部分を基準モデルと繰返し比較して類似性について累積された尺度を生成して、この累積された尺度が、基準話声の複数の許容できるシーケンスを定義する記憶されたデータにより定義された該シーケンスの各々に対して生成されるようにし、この類似性の累積された尺度が、それぞれの許容できるシーケンス内の以前の話声に対応する基準モデルと話声の１又は複数の以前の部分との比較から得られた、以前に生成された尺度からの寄与分を含むものであるが、別の繰返しの比較から、他のシーケンスに対する尺度より、所定のプルーニング基準によって定義された程度まで類似性の指標を小さくするようなシーケンスを除くものであるようにし；
該累積された尺度に許容されたシーケンスの各々に対する重み付け因子に従って重み付けをし、この重み付けが部分的シーケンスに対する尺度もしくは累積された尺度の各計算に対して、この部分的シーケンスで始まる許容できるシーケンスの各々に対する重み付け因子から、話声もしくはこの部分的シーケンスで始まるより短いシーケンスに対して生成された尺度に加えられる重み付け因子を差引いた組合せた値による重み付けであるようにする言語認識方法が提供される。
好ましくは、前記重み付けをした累積された尺度が他のシーケンスに対するよりも、プルーニング基準により定義された程度まで類似性の指標を小さくするようなシーケンスが、別の繰返し比較から排除される。プルーニングは、生成された尺度の数に依存して行われ、さらなる繰返し比較から除外されずに、その数を一定に保つようにする。
本発明の別の態様では、言語認識装置であって：
話声を表す基準モデルに関係するデータと基準話声の許容できるシーケンスを定義するデータとを記憶するための記憶手段と；
未知の話声の部分を基準モデルと繰返し比較して類似性の累積された尺度を、基準話声の複数の許容できるシーケンスを定義する記憶されたデータにより定義されたこのシーケンスの各々に対して生成し、該累積された尺度は前に生成された尺度からの寄与分を含むものであって、それぞれの許容できるシーケンス内の以前の話声に対応する基準モデルと話声の１又は複数の以前の部分との比較から得られた尺度であるように生成するための比較手段と；
該累積された尺度に許容されたシーケンスの各々に対する重み付け因子に従って重み付けをし、この重み付けが部分的シーケンスに対する尺度もしくは累積された尺度の各計算に対してこの部分的シーケンスで始まる許容できるシーケンスの各々に対する重み付け因子から、話声もしくはこの部分的シーケンスで始まるより短いシーケンスに対して生成された尺度に加えられる重み付け因子を差引いた組合せた値による重み付けとするよう重み付けをする手段とから成る言語認識装置が提供される。
さらに別の態様では、本発明は、音に対応する基準モデルと定義する記憶されたデータと認識すべき話声に各シーケンスが対応しており、かつこのモデルの許容できるシーケンスを定義する記憶されたデータとを参照することにより言語認識をする方法であって：
未知の話声の部分を基準モデルと比較して話声の以前の部分と部分的に許容できるシーケンスとの間の類似性を示す尺度を更新し、話声のより長い部分とより長い部分的に許容できるシーケンスとの間の類似性を示す尺度を作るようにし；
これら部分的なシーケンスでその尺度が類似性について定義された度合いよりも小さな尺度となるようなものを識別し；
識別された部分的なシーケンスの１つで始まるシーケンス又は部分的なシーケンスに関する尺度のさらなる生成を抑制し；
て成り、該識別は尺度の閾値との比較により実行され、またこの閾値は生成されかつ抑制されていない尺度の数に依存して繰返し調節されて、その数が定数を維持するようにされていることを特徴とする方法が提供される。
本発明のさらに別の態様では、基準話声の複数の許容できるシーケンスを表す言語認識網の各ノードに重み付け因子を指定する方法であって：
各ノードに対して、そのノードを取込んでいる部分的シーケンスで始まる許容できるシーケンスの各々に対する重み付け因子の値と、その部分的シーケンスで始まる話声もしくはより短いシーケンスに適用される重み付け因子を差引く値の組合せをすることを含む方法が提供される。
重み付け因子はログ変域内で生成することができ、所定の重み付け因子のログは、許容できるシーケンスに対応する網の最終ノードに対して指定され；
各先行するノードに対してログ確率値としてノードまたは後段のノードに指定されたそれらの値の最大値を指定され；
各ノードに対する値から先行するノードに指定された値を減ずる。
前記ノードは基準話声を表すモデルと関係しており、関係するモデルのパラメータは各ノードに指定された重み付け因子を反映して修正することができる。
本発明はとくに、木（トリー）構造を有し、第１のノード以外の少なくとも１つのノードが２以上の枝をもつ認識網に応用可能である。
ここで本発明の幾つかの実施形態を例示的に添付の図面を参照して記載する。
図１は、本発明の１実施形態にしたがう装置のブロック図である。
図２は、隠れマルコフモデル（Hidden Markov Models）のネットワークの概略図を示す。
図３は、図１のトークンメモリの内容を示す。
図４は、図１の装置によって重み付けの配置を示す。
図５は、図１のノードメモリの内容を示す。
図６は、図１の動作を示すフローチャートを示す。
図７は、図１の語彙メモリの内容を示す。
図８は、図４の配置に対する代わりの重み付け手順を示す。
言語認識には基本的に２つの方法がある。並列処理方法は、各話声（例えば、単語）が基準テンプレートまたはモデルを連続的に比較して、もっとも類似するものを１つ以上識別するやり方であり、トリー処理方法は、話声の一部分（例えば、フォニーム）が基準テンプレートまたはモデル（なお“モデル”は一般的な意味で使用している）と比較して、その部分を識別し、さらに次の部分に対して同様の処理を行なうやり方である。
ここでトリー構造を使用する実施形態を記載する。
図１の言語認識器は、言語信号用の入力１を有し、これはデジタル対アナログコンバータ２によってデジタル形式に変換される。次にこのデジタル信号は、例えば１０ｍｓの間多数のパラメータまたは“特徴”をもつ継続するフレームのそれぞれを計算する特徴抽出器３に供給される。通常使用される特徴、例えばＭｅｌ周波数ケプストラム係数（Mel frequency cepstral coefficients）または変形予測係数を選択することができる。
フレーム毎の可能な特徴値の組合せの数は相当に大きく、後の処理を処理可能な程度まで減少するために、ベクトル量子化を応用するのが一般的で、すなわち特徴の組を制限された数ｍの標準の特徴の組合せ（ｖ₁、ｖ₂、…、ｖ_m）の１つに整合させるのが一般的である。これはベクトル量子化器（ＶＱ）４によって行われ、単一の数または“観察”Ｏ_j（ｊ番目フレームに対して）を生成する。次にこれが、分類器５に供給され、通常ここでは観察シーケンス［Ｏ_j］をモデルメモリ６に記憶された１組のモデルに整合させる。各モデルは異なるサブワード、例えばフォニームに対応する。分類器は、プログラムメモリ52、ノードメモリ53、およびトークンメモリ54内の記憶プログラムによって制御される中央プロセッサ51を含んでいる。分類器は、隠れマルコフモデルを使用して分類処理を行なう。ここでその原理を記載する。
概念的に、隠れマルコフモデルは“ブラックボックス”として扱われ、ｎの可能な状態を有し、正規間隔で１つの状態から次の状態へ進むことができるか、またはその代わりに確率のパラメータにしたがって同じ状態に留まることができる。状態ｉから状態ｊへの遷移確率はａ_ijであり、状態ｉに留まる確率はａ_iiである。したがって、次の式のようになる。

言語音の時間的順序が原因で、左−右モデルが一般的に、ａ_ijが

のときのみゼロ以外になるものに対して使用される。特定の状態で、出力が生成され、それは可能な限定された数ｍの出力、例えば、第２の組の確率にしたがってｖ₁、ｖ₂、…、ｖ_mの１つであってもよい。このコンテキストでは、ｖ_kは特定の組の言語の特徴を識別する。状態ｊのときに出力ｖ_kを生成する確率はｂ_jkである。したがって、

第３のパラメータは、特定の状態から始まる確率である。状態ｉから始まる確率はπ_iである。
したがってモデルは、１組のパラメータ、
Ａ＝［ａ_ij］（ｉ＝１…ｎ，ｊ＝１…ｎ）
Ｂ＝［ｂ_jk］（ｉ＝１…ｎ，ｊ＝１…ｎ）
π＝［π_i］（ｉ＝１…ｎ）、
およびこのパラメータに適用されて、出力シーケンスを生成することができる１組の規則から成る。事実、モデルは存在せず、出力シーケンスも生成されない。むしろ、言語認識の問題は、ｖのシーケンス（各ｖは観察された言語の特徴の組を表している）を与えるとき、Ａ、Ｂ、πによって定められるモデルＭがこのシーケンス（観察シーケンス）を生成できる確率Ｐが何であるか”という質問として形成される。
この質問は、それぞれが（例えば）異なるフォニームを表している多数の異なるモデルに対して照会されるとき、最も高い確率を有するモデルによって表されるフォニームが認識されたと考えられる。
観察シーケンスが時間ｔ＝１乃至ｔ＝Ｔに対してＯ₁、Ｏ₂、…、Ｏ_Tであると想定する。この観察で状態_jに到達する確率α_T（j）は、次の反復式によって得られる。

モデルＭによって生成される観察シーケンスＯの確率は、次のとおりである。

これは、全ての可能な状態のシーケンスを考慮した観察シーケンスＯの確率であり；実際には、一定量の計算を減少するために、通常はＶｉｔｅｒｂｉアルゴリズムを呼出し、観察シーケンスを生成するのに最高の確率をもっている状態シーケンスと関係する確率を計算する。この場合式１乃至３が次の式と置換される。

または、ログ変域では、

モデルメモリ６は、相関言語の各フォニームに対するＡ、Ｂ、およびπの値（これらは一緒にモデルＭと呼ばれる）を含む。モデルのパラメータを生成するためのトレーニングプロセスは一般的であるので、さらに説明を加えない。S.J.Coxによる”Hidden Markov Models for Automatic Speech Recognition: Theory and Application”（British Telecom Technology Journal Vol.6, No.2, 1988年2月）を参照されたい。特定の観察シーケンスＯのフォニームは、各モデルＭ₁…Ｍ_Q（なお、Ｑはモデルの数である）に対してＰ_r ^v（Ｏ｜Ｍ_i）を計算することによって認識される。最も高いＰ_r ^vを生成するモデルを有するフォニームは、認識されると考えられる。
もちろん、実際には単語を認識することが必要である。この処理は、多数のノードを有するネットワークまたはトリー構造の形態で視覚化することができる。この構造は、後で分かるように各ノードはメモリの各領域に対応するという意味でのみ存在する。
図２は、“ｙｅｓ”と“ｎｏ”を区別する簡単なネットワークを示している。ここではこれらのフォニームを｛ｙ｝｛ｅｈ｝｛ｓ｝および｛ｎ｝｛ｏｗ｝で示している。
図２でノード10は、最後のノード16と同様にノイズモデル（全体的に１つの状態のモデル）に対応しており、これらは前後の“黙音（silence）”を表している。最後のノードを除く残りのノードは、図示されたフォニームに対応している。例えば、ノード11は“ｙｅｓ”のフォニーム［ｙ］に対応している。
動作において、ノードは図３に示されている次の情報を含むトークンを受取る：
−前のノードから累積された得点；
−前のノードの識別子（ノードメモリ内のアドレス）；
−このトークンを生成した前のノードによって受取られるトークンの識別子（トークンメモリ内のアドレス）；
−トークンはさらに活性／不活性フラグも含み、この使用は以下に記載する。
このようなトークンは全て、将来の参照のためにトークンメモリ54に記憶される。
第１のノードはフレームレートでエンプティトークンを供給される。ノードに到達するトークンは、そのノードへ向うパス上のノードと関係するモデルにこれまでの言語入力が対応する尤度（実際には確率の対数）を示す得点を含んでいる；したがってノード13に到達するトークンは、ここまでの言語が話声｛ｙ｝｛ｅｈ｝に対応する尤度を示す得点を含んでいる。ノードに関連するタスクは、新しい言語入力フレームとそのモデルを比較することである。これは、新しいフレームに関して式７乃至９の計算を行って、到来する得点に付加されて、得点を更新する確率Ｐ_r ^vを得ることによって行われる。新しいトークンはこの得点を含む出力であり、次のノードへ送られる。普通、この得点は、トークンを出力する前にモデルの状態の数（一般的に３）に等しいフレーム数に累積される。その後、トークンはフレーム毎に生成される。ノードが別のトークンを受取る一方で、それが依然として第１のノードを処理しているとき、このノードは別のトークンの得点と第１のノードの最新の得点を比較し（すなわち、最新のログＰ_r ^vと到来するトークンの得点とを加算し）、新しいトークンを無視するか、または別のトークンの得点が２つの得点の高い方であるか低い方であるかにしたがって新しいトークンのために現在の処理を放棄する。
与えられた例では、パスは最後のノードを除いて収束しない。パスが収束可能なとき、多数のパスの伝搬が可能であっても、２つのトークンの同時到着の確率は、普通より低い得点を有するものを無視することによって処理される。
最後のノード16では、収束するパスの最も高い得点のノードを除いた全てを拒絶することができるが、多くの応用では、２つ以上を保持することが望ましい。さらに、最後のノードで望ましい得点になる機会はないと考えられるほど低い得点を保持するトークンの伝搬を終了する準備が行われる。以下でさらにこの“プルーニング”処理を説明する。ネットワークを通るパスを識別して、話声のフォニームを発見することができる。これは、出力トークンから戻って成功したトークンシーケンスをトレースする、“前のトークン”アドレスを使用してトークンメモリ内のトークンを識別することによって認識されると考えられる。
トリー構造に組込まれるＨＭモデルは単一の大きなモデルであると考えられることを記載しておくべきであろう。
ここまでに説明したように、認識器は、一般的な意味で、通常のものである。ここで記載される認識器の別の特徴は、認識トリーヘアプリオリ確率を“伝搬する”目的を有することである。単語“ｃａｔ”、“ｃａｂ”、“ｃｏｂ”、“ｄｏｇ”、および“ｄｅｎ”を区別するための図４に示されたトリーを検討する。前のプロセスの結果として、これらを行うアプリオリ確率は、値０．５、０．２、０．３、０．１、０．１を重み付けすることによって表されると想定する。これは、ノード23,24,26,29,31の得点入力を、別の決定が行われる前にこれらの値によって重み付けをされる必要があることを意味している。しかしながら、重み付けは、次に示すようにトリー内の各ノードに対して行われる。したがってその単語が“ｃａｔ”または“ｃａｂ”または“ｃｏｂ”である確率は、０．５＋０．２＋０．３＝１．０の重み付けをすることによって表され、一方で“ｄｏｇ”または“ｄｅｎ”に対する対応する値は０．１＋０．１＝０．２である。その結果、ノード21に対する得点入力は、１．０に因子によって重み付けされ、ノード27に対する入力は０．２の因子によって重み付けされる。一方で“ｃａｔ”または“ｃａｂ”に関連する値は０．７であり、他方で“ｃｏｂ”に関連する値は０．３であるので、ノード22および25への入力に適切に重み付けをする必要がある。しかしながら、１．０の因子はノード21によってこのブランチに既に適用されているので、ノード22および25における重み付けは以下のように表すことができる。
ノード22における重み付け＝０．７／１．０＝０．７
ノード25における重み付け＝０．３／１．０＝０．３
同様に、ノード23および24は以下のように表すことができる。
ノード23における重み付け＝０．５／（１．０×０．７）＝５／７
ノード24における重み付け＝０．２／（１．０×０．７）＝２／７
また、ノード28および30は以下のように表すことができる。
０．１／０．２＝０．５
もちろん、図４のトリーは、このプロセスを概念的に表しているだけである。実際には、各ノードは、以下の情報を有するノードメモリ（図５参照）内のエントリによって表される。
−使用されるモデルの（モデルメモリ内の）アドレス；
−ネットワーク内の次のノードのアドレス；
−ノードが活性であるか否かを示すフラグ；
−そのノードに関係する重み付けを示すログ値；
−計算の結果に対する一時的ストレージ。
最初の２つの項目の内容は、認識器の語彙を設定するときに決定される。このプロセスは、認識される単語のリストを含む語彙メモリ７（図１参照）を参照することによって実行され、各単語に対して、アドレスのストリングがその単語の言語音（sound）に対応するフォニームモデルのシーケンスを識別する（同じく図７参照）。ノードメモリの内容の生成（以下で説明されるログの重み付け値の内容のセーブ）は一般的である；それは、各単語に対応するノードアドレスのシーケンスの語彙メモリへの挿入を含む。
ＣＰＵ51は、図６のフローチャートに示されているように、プログラムメモリ52内に記憶されるプログラム制御のもとで以下の処理を行う；
第１に、第１のノードへの入力としてエンプティトークンを生成する、すなわちゼロ（すなわち、ログ（1））の得点およびゼロを発生するノードアドレス（これはトークンが第１のノードによって処理されることを意味すると考えられる）および前のフレームの日付を有するトークンメモリ内にエントリを生成する。これらの第１のノードはそこで“活性”であると考えられる。
次に、フレーム期間では以下の段階を実行する：
各活性ノードに対して：
−ＨＭＭプロセスを開始し、かつこのノードによって処理されるトークンが最後のフレーム内で生成されなかったときには、現在のフレーム観察Ｏを使用してＨＭＭプロセスを更新する。プロセスがｎフレームに到達したとき（なおｎはこのノードに関係する特定のＨＭＭの状態の数である）、ノードメモリに記憶されたログのアプリオリ確率値を計算された尤度の値に加算し、その結果を使ってトークンメモリ内に新しいエントリを生成する（それにも関わらず、現在のプロセスは次のフレームに対しても続けることができることに注意すべきである）；
−プロセスを開始せず、このノードによって処理されるトークンが最後のフレーム中に生成されなかったとき、（すなわち、活性フラグがちょうど設定されたとき）現在のフレーム観察を使用して、新しいＨＭＭプロセスを開始する。単一の状態のＨＭＭの場合、その結果を使用して、トークンメモリ内に新しいエントリを生成する；（それにも関わらず現在のプロセスは次のフレームまで続けることができることに注意すべきである）；
−プロセスを開始し、このノードによって処理されるトークンが生成されたとき、到来する得点と内部の得点とを比較し、その結果にしたがって上述のプロセスを継続し、無変化のままか、あるいは第１の状態への入力として到来するスコアを付加する。
−生成された各トークンに対して、
−トークン得点から発生ノードアドレスを得て；
−発生ノードに対するノードメモリエントリから“次のノード”アドレスを得て；
−このような次のノードのそれぞれを次のフレームに対して活性であるとフラグを立てる。
−トークンメモリ内に新しいエントリが生成されるときは、
−関係する得点が、記憶された“全てのトークンに対する最高得点”数を越えているとき、この数字を更新し；
−関係する得点が、記憶された“全てのトークンに対する最高得点”数よりも所定のマージン（例えば、５０）よりも大きい分だけ小さければ、トークンメモリのエントリを除去する（“プルーニング”段階）。この結果ノードが入力および出力の両トークンをもたないとき、それを不活性にする（すなわち、ノードメモリのエントリを除去する）。
−最後のノードでは、
認識が完了したとき、および認識パスのトレースバックを行うことができることに関する決定は、特定の測定を検査する規則および閾値のシステムに基づいて行われる。したがって、各フレームに対して、最後のノードで現れる最良のトークンをトレースバックして、最後のノイズノードで幾つのフレームが費やされたかを検査する。（ネットワーク内の全てのパスは、端部にノイズノードを有すると想定する）。継続期間が閾値よりも長く、パスの得点が別の閾値よりも高いとき、認識を止める（すなわち、完全なパスに対する認識得点が適度に好ましくなり、パスが端部に適度な量のノイズ、一般的に２０フレーム、すなわち０．３２フレームを含むまで、待たなければならない）。これは、言語検出アルゴリズムの終了を最も簡単に記述したものである。実際には、アルゴリズムは現時点までの信号のＳＮＲおよびノイズエネルギーの分散に関する付加的な検査によって拡張することができる。さらに多数のタイムアウトがあって上述の検査が連続して失敗するとき、言語検出の終了が結局はトリガすることが確実となるようにする。
次に、最高の得点トークン、またはＮ_outの最高の得点トークン（ここでＮ_outは所望の数の出力選択である）に対して、
（ａ）トークンから先のノードおよびそこから関連するモデル識別子を検索し；
（ｂ）前のトークンメモリエントリを検索し；
（ｃ）全てのモデルを識別するまで、（ａ）および（ｂ）を反復する。
ここで認識された単語は、関連する得点と一緒に使用できる。
上述は、認識プロセスである：このプロセスを開始する前に、ログのアプリオリ確率をノードメモリに入力することが必要である。前の認識プロセスによって、図７に示されたフォーマットでアプリオリ確率値を生成したと仮定する。ここでは（例として）多数の都市名のそれぞれがそれに割当てられた確率を有するとする。ＣＰＵ52は、ノードアプリオリの確率の値を導き出すための次の設定プロセスを実行する。
第１に、語彙メモリ７を参照することによって、単語をノードシーケンスに変換することが必要であり、その結果、認識トリーを通る各可能なパスに対して、各ノードへ向う途中のログのアプリオリ値の合計が分かる。次に、図４に示されているように、各ノードに対して個々の値を計算することが必要であり、次のようになる：
（ａ）所定の確率値を各単語に対応する最後のノードに割当て；
（ｂ）右側から左側へ進み（図４参照）、各ノードに対して、それにしたがうノードに割当てられたものの合計である確率値を割当て（図４では、第１のノードは割当てられた値の１を有するようにとられている）；
（ｃ）左側から右側へ進み、前のノードに割当てられた値によって各ノードに対する確率値を分割し；
（ｄ）全ての値のログを取る。
実際には、全体的に計算の面倒の少ない技術がログ値と共に行われ、合計ではなく、最大値を取る。したがって、（図８に示されているように）：
（ａ）所定のログの確率値を各単語に対応する最後のノードに割当て；
（ｂ）ノードまたは後段のノードに割当てられた最大値であるログ確率値を各ノードに割当て；
（ｃ）各ノードに対する値から、前のノードに割当てられた値を控除する。
もちろん分岐していないリンクの計算（角括弧で示されている）は行う必要がない。
上述では、第１の基準は、トークンは閾値より下の得点を保持する、すなわち、如何なるときにおいても“最良のパス”の得点数を保持するときにトークンを消去するというものである。事実、ログ確率を使用するので、ログの得点と、最良のログ得点から最良の平均動作を与えるように設定された固定マージン値を引いたものとの間で比較が行われる。
しかしながら、実際には使用するための最適のプルーニングレベルは実際に話された話声に依存する。したがって、変形例では、プルーニングは認識器の現在の計算上の負荷の関数として調節される。例えば、それは活性ノードの数に依存して調節することができる。したがって、
１．幾つかのノードのみが活性であるとき、プルーニング閾値は緩められ、より多くのノードが活性状態を保ち、潜在的に精度を高められる。
２．多くのノードが活性であるとき、プルーニング閾値はきつくされ、計算量を減少する。
これに関して、閾値を調節して、活性ノードの数を一定に保つことは可能である。このときは、各時間フレームにおいて、活性ノードｎ_aの数は所望の目標ｎ_t（例えば、１３００）と比較される。閾値のマージン値Ｍ_Tを段階値Ｍ_s（例えば、２）によって、Ｍ_oの開始値（例えば、１００）から最小値Ｍ_min（例えば、７５）と最大値（例えば、１５０）との間で変化させることができる。各時間フレームごとに以下の段階をとることができる。
（１）ｎ_a＞ｎ_tおよびＭ_T＞Ｍ_minのとき、Ｍ＝Ｍ−Ｍ_s
（２）ｎ_a＜ｎ_tおよびＭ_T＜Ｍ_minのとき、Ｍ＝Ｍ＋Ｍ_s
しかしながら他の基準を適用できるときもあり、例えば活性モデル状態数または（とくに非常に多くの語彙を有する認識器のときは）活性単語数に基づいて決定することができる。
この動的な閾値の調節は、アプリオリの重み付けを行わないシステムで使用することもできる。
上述の認識器は、特定の状態で行われる限定された数の可能な観察を行うように制限されている。しかしながら望むのであれば、観察Ｏに対する値を有する連続する確率密度ｂ_j（Ｏ）によって確率ｂ_jkを置換することができる。周知のように、全体的な連続確率密度は、もっと拘束された形態−通常ガウス分布の連続関数の離散数の重み付けをした合計（または混合）によってうまい具合に近似値を得ることができる。したがって確率密度の関数は、

なお、Ｘは混合における成分（または“モード”）の数であり、ｃ_jxは状態ｊにおけるモードｘの重み付けであり、Ｎ［Ｏ、μ_jx、Ｕ_jx］は、中間ベクトルμ_jxおよび共分散マトリックスＵ_jxで多変量垂直分布からベクトルＯを引出す確率である。
ガウス分布に対して、次の式が成り立つ。

なお、ｄはベクトルの大きさである。これは、Ｕが項目（term)σ_iを有する対角行列であるとき、次のように減少する。

なお、υ_iはＯの要素である。
式１乃至９の認識プロセスは変更されず、ｂの定義のみが変化する。この連続する密度モデルのトレーニングプロセスは知られているので、以下で説明しない。
並列処理方法は、上述で説明したトリー処理方法よりも簡単である。プルーニングを含むこの典型的な処理方法は、モデルを検査するときの実行リストの上から（例えば）６つの“最良”の候補を維持することを含む。例えば、
（ａ）未知の単語を最初の６つのモデルと比較し、これらのモデルのリストを生成し、それぞれに対して類似の得点を記録し；
（ｂ）未知の単語を別のモデルと比較する。得られた得点が、リスト内の他の何れよりも高ければ−すなわち類似度がより高いときは、リスト内の最も低い得点エントリに新しいモデルおよび得点を代入し；
（ｃ）全てのモデルを処理するまで段階（ｂ）を反復する。このプロセスは、上から６つの高得点のモデルのリストを生成する。最良の候補を選択する前に、アプリオリの確率が適用されるとき、６つの得点のそれぞれが相関する重み付け因子によって乗算され、最良の重み付けをされた得点を有する候補が選択される。
提案された方法では、認識プロセス中に重み付けが行われる；すなわち、
（ａ）未知の単語を第１の６つのモデルと比較し、それぞれに対して類似の得点を発生する。各モデルに対する重み付け因子によって得点を乗算する。これらのモデルのリストを生成し、それぞれに対して重み付けされた得点を記録し；
（ｂ）未知の単語を別のモデルと比較する。このモデルに対する重み付け因子によって得点を乗算する。得られた重み付けした得点がリスト内の他のものよりも高いときは、最も低く重み付けをされた得点を有するリストにおいてエントリに対する新しいモデルおよび重み付けされた得点を代入し；
（ｃ）全てのモデルが処理されるまで、段階（ｂ）を繰返す。The present invention performs a recognition process to identify any one of the words (or more generally speaking) vocabulary that the input speech signal most clearly resembles and relates to the vocabulary of the words. It is related with the language recognizer in which the information regarding a priori probability is valid. (Translator's note: speech recognition is often translated into speech recognition, but here the language is used to translate spoken speech rather than voice.) One example of this situation is our international patent application WO95. This is an automatic telephone number guidance system described in the specification of / 02524. In this system,
(I) The user speaks the name of the city;
(Ii) The language recognizer identifies several cities that best match the name of the spoken city by referring to the stored city data, and "score (score)" or matching score Generate a probability indicating
(Iii) the list is compiled from all road names that exist in the identified city;
(Iv) The user speaks the road name;
(V) The language recognizer identifies several road names from those included in the list and gives only the score that best matches the spoken road name;
(Vi) The road score is weighted according to the score obtained by the city where the road is located, and the most likely “road” is considered to have the most weighted score.
A priori probabilities need not start from a previous language recognition process; for example, in another number guidance system described in the above-mentioned patent application, a referrer from that area using a signal identifying the origin of the call Weights the results of the city name recognition process by accessing statistical information about the cities that are most likely to want.
This process has the advantage of high reliability due to reserve conditions. This reservation condition means, for example, that no road is selected from the second selected city unless the score of the second selected city is significantly higher than that of the first selected city at the road name recognition stage. That's it. However, the disadvantage of this process is that when performing the road name recognition stage, the recognizer generates only a limited number of road names, so this short road name list includes only road names located in low scoring cities. The low scoring road names of roads that are not possible, i.e. located in high scoring cities, are already "pruned" by the recognizer before applying the weighting process.
U.S. Pat. No. 4,838,303 describes a language recognizer in which the a priori probability is related to a predetermined context of one or more patterns previously recognized. A language score indicating the probability of another word occurring after a word is combined with the score obtained for a sequence containing those words.
According to the present invention, a language recognition method comprising:
The unknown speech portion is iteratively compared with the reference model to generate a cumulative measure for similarity, which is stored by the stored data defining multiple acceptable sequences of the reference speech. A cumulative measure of this similarity is generated for each of the defined sequences and a reference model corresponding to the previous speech in each acceptable sequence and one or more of the speech Includes contributions from previously generated scales obtained from comparisons with previous parts, but from other repeated comparisons, to scales defined by a given pruning criterion, rather than scales for other sequences Try to exclude sequences that reduce the similarity measure;
Weighting according to a weighting factor for each of the allowed sequences for the accumulated measure, and for each calculation of the measure for the partial sequence or the accumulated measure, the weighting of the allowable sequence starting with this partial sequence A language recognition method is provided that makes the weighting by a combined value of the weighting factor for each minus the weighting factor added to the scale generated for a spoken voice or a shorter sequence starting with this partial sequence. .
Preferably, sequences in which the weighted accumulated measure reduces the similarity measure to the extent defined by the pruning criteria, rather than relative to other sequences, are excluded from another iterative comparison. Pruning is done depending on the number of scales generated and keeps that number constant without being excluded from further repeated comparisons.
In another aspect of the invention, a language recognition device comprising:
Storage means for storing data relating to a reference model representing speech and data defining an acceptable sequence of reference speech;
An unknown measure of speech is iteratively compared to a reference model and a cumulative measure of similarity is obtained for each of these sequences defined by stored data defining multiple acceptable sequences of reference speech. The generated and accumulated measures include contributions from previously generated measures, and the reference model and one or more of the speeches corresponding to the previous speech in each acceptable sequence A comparison means for generating to be a measure obtained from a comparison with the previous part;
Each allowable sequence that weighs according to a weighting factor for each of the allowed sequences for the accumulated measure, and whose weighting begins with this partial sequence for each calculation of the measure for the partial sequence or the accumulated measure And a means for weighting to a weight by a combined value obtained by subtracting a weighting factor added to a scale generated for a speech or a shorter sequence starting with this partial sequence from a weighting factor for Is provided.
In yet another aspect, the present invention relates to stored data defining a reference model corresponding to sound and each sequence corresponding to speech to be recognized and stored to define an acceptable sequence of this model. Language recognition by referring to the collected data:
Update the measure of similarity between the previous part of the speech and the partially acceptable sequence by comparing the part of the unknown speech with the reference model, so that the longer and longer parts of the speech Make a measure of the similarity between the sequence and the acceptable sequence;
Identify those partial sequences whose measures are less than the defined degree of similarity;
Suppress further generation of a measure for sequences or partial sequences starting with one of the identified partial sequences;
The identification is performed by comparison with a scale threshold, and this threshold is repeatedly adjusted depending on the number of scales generated and not suppressed so that the number remains constant. There is provided a method characterized in that
In yet another aspect of the invention, a method for assigning a weighting factor to each node of a language recognition network representing a plurality of acceptable sequences of reference speech:
For each node, the difference between the value of the weighting factor for each acceptable sequence starting with the partial sequence incorporating that node and the weighting factor applied to speech or shorter sequences starting with that partial sequence. A method is provided that includes combining subtraction values.
A weighting factor can be generated within the log domain, and a log of a predetermined weighting factor is specified for the last node in the network corresponding to an acceptable sequence;
Specified for each preceding node as the log probability value is the maximum of those values specified for the node or subsequent nodes;
Subtract the value specified for the preceding node from the value for each node.
The nodes are associated with a model representing the reference speech, and the parameters of the associated model can be modified to reflect the weighting factors specified for each node.
The present invention is particularly applicable to a recognition network having a tree structure and having at least one node other than the first node having two or more branches.
Several embodiments of the present invention will now be described by way of example with reference to the accompanying drawings.
FIG. 1 is a block diagram of an apparatus according to an embodiment of the present invention.
FIG. 2 shows a schematic diagram of a network of Hidden Markov Models.
FIG. 3 shows the contents of the token memory of FIG.
FIG. 4 shows the weighting arrangement by the apparatus of FIG.
FIG. 5 shows the contents of the node memory of FIG.
FIG. 6 is a flowchart showing the operation of FIG.
FIG. 7 shows the contents of the vocabulary memory of FIG.
FIG. 8 shows an alternative weighting procedure for the arrangement of FIG.
There are basically two methods for language recognition. The parallel processing method is a method in which each speech (eg, word) continuously compares a reference template or model to identify one or more similar ones. For example, in a way that a phoneme) identifies a part compared to a reference template or model (“model” is used in a general sense) and then does the same for the next part. is there.
An embodiment using a tree structure will now be described.
The language recognizer of FIG. 1 has an input 1 for a language signal, which is converted to digital form by a digital-to-analog converter 2. This digital signal is then supplied to a feature extractor 3 which calculates each successive frame with a number of parameters or “features” for eg 10 ms. Commonly used features such as Mel frequency cepstral coefficients or deformation prediction coefficients can be selected.
The number of possible feature value combinations per frame is fairly large, and it is common to apply vector quantization to reduce subsequent processing to a degree that can be processed, ie, limited feature sets A combination of standard features of several meters (v₁, V₂, ..., v_mIn general). This is done by a vector quantizer (VQ) 4 and is a single number or “observation” O_j(For the jth frame). This is then fed to the classifier 5, usually here the observation sequence [O_j] Is matched with a set of models stored in the model memory 6. Each model corresponds to a different subword, for example a phoneme. The classifier includes a central processor 51 controlled by a stored program in a program memory 52, a node memory 53, and a token memory 54. The classifier performs a classification process using a hidden Markov model. The principle will be described here.
Conceptually, the hidden Markov model is treated as a “black box” and has n possible states and can proceed from one state to the next at regular intervals, or alternatively to a probability parameter. Therefore, it can stay in the same state. The transition probability from state i to state j is a_ijAnd the probability of staying in state i is a_iiIt is. Therefore, the following equation is obtained.

Due to the temporal order of speech sounds, the left-right model is generally a_ijBut

Used only for things that are non-zero. In a particular state, an output is generated, which is a limited number m of possible outputs, for example v according to a second set of probabilities.₁, V₂, ..., v_mIt may be one of In this context, v_kIdentifies features of a particular set of languages. Output in state j_kThe probability of generating is b_jkIt is. Therefore,

The third parameter is the probability of starting from a specific state. The probability of starting from state i is π_iIt is.
The model is therefore a set of parameters,
A = [a_ij] (I = 1 ... n, j = 1 ... n)
B = [b_jk] (I = 1 ... n, j = 1 ... n)
π = [π_i] (I = 1 ... n),
And a set of rules that can be applied to this parameter to produce an output sequence. In fact, there is no model and no output sequence is generated. Rather, the problem of language recognition is that given a sequence of v (where each v represents a set of observed language features), the model M defined by A, B, π represents this sequence (observation sequence). The question is “What is the probability P that can be generated?”.
This query is considered to have recognized the phoneme represented by the model with the highest probability when it is queried against a number of different models each representing (for example) a different phoneme.
The observation sequence is O for times t = 1 to t = T.₁, O₂... O_TAssuming that Condition with this observation_jProbability of reaching_T(J) is obtained by the following iterative formula.

The probability of the observation sequence O generated by the model M is as follows.

This is the probability of the observation sequence O taking into account all possible state sequences; in practice, to reduce a certain amount of computation, it is usually best to call the Viterbi algorithm and generate an observation sequence. Compute the probability associated with a state sequence with a probability of In this case, Formulas 1 to 3 are replaced with the following formulas.

Or in the log domain:

The model memory 6 contains the values of A, B, and π (these are collectively referred to as model M) for each phoneme of the correlation language. The training process for generating model parameters is common and will not be further described. See “Hidden Markov Models for Automatic Speech Recognition: Theory and Application” by S.J.Cox (British Telecom Technology Journal Vol. 6, No. 2, February 1988). The phoneme of a specific observation sequence O is the model M₁... M_Q(Note that Q is the number of models)_r ^v(O | M_i) To be recognized. Highest P_r ^vA phoneme with a model that generates is considered to be recognized.
Of course, it is actually necessary to recognize words. This process can be visualized in the form of a network or tree structure with multiple nodes. This structure exists only in the sense that each node corresponds to a region of the memory, as will be seen later.
FIG. 2 shows a simple network that distinguishes between “yes” and “no”. Here, these phonemes are indicated by {y} {eh} {s} and {n} {ow}.
In FIG. 2, the node 10 corresponds to the noise model (a model of one state as a whole) like the last node 16, and these represent the “silence” before and after. The remaining nodes except the last node correspond to the illustrated phoneme. For example, the node 11 corresponds to the phoneme [y] of “yes”.
In operation, the node receives a token containing the following information shown in FIG.
-Score accumulated from previous node;
The identifier of the previous node (address in the node memory);
The identifier of the token (address in token memory) received by the previous node that generated this token;
-The token also includes an active / inactive flag, the use of which is described below.
All such tokens are stored in token memory 54 for future reference.
The first node is supplied with an empty token at the frame rate. The token that reaches a node contains a score that indicates the likelihood (actually the logarithm of probability) that the previous language input corresponds to the model associated with the node on the path to that node; The reached token includes a score indicating the likelihood that the language so far corresponds to the speech {y} {eh}. The task associated with the node is to compare the new language input frame with its model. This is the probability P of updating the score added to the incoming score by performing the calculations of Equations 7-9 for the new frame_r ^vIs done by The new token is an output containing this score and is sent to the next node. Normally, this score is accumulated in a number of frames equal to the number of model states (generally 3) before outputting the token. Thereafter, a token is generated for each frame. When a node receives another token while it is still processing the first node, it compares the score of the other token with the latest score of the first node (ie, the latest log P_r ^vAnd the token score of the incoming token), ignore the new token, or process the current token for the new token according to whether the score of another token is the higher or lower of the two scores Abandon.
In the given example, the path does not converge except for the last node. When a path can converge, the probability of simultaneous arrival of two tokens is handled by ignoring those with lower than normal scores, even if multiple paths can be propagated.
The last node 16 can reject all but the highest scoring node of the convergent path, but in many applications it is desirable to keep more than one. In addition, provisions are made to terminate the propagation of tokens that have a low enough score that the last node is considered unlikely to have the desired score. The “pruning” process will be further described below. A path through the network can be identified to find a spoken phoneme. This may be recognized by identifying the token in the token memory using the “previous token” address that traces the successful token sequence back from the output token.
It should be noted that the HM model incorporated in the tree structure is considered a single large model.
As explained so far, the recognizer is a normal one in a general sense. Another feature of the recognizer described here is that it has the purpose of “propagating” the a priori probability of the recognition tree. Consider the tree shown in FIG. 4 to distinguish the words “cat”, “cab”, “cob”, “dog”, and “den”. As a result of the previous process, assume that the a priori probabilities of doing these are represented by weighting the values 0.5, 0.2, 0.3, 0.1, 0.1. This means that the score inputs of

nodes

23, 24, 26, 29 and 31 need to be weighted by these values before another decision is made. However, weighting is performed for each node in the tree as follows. Thus, the probability that the word is “cat” or “cab” or “cob” is expressed by weighting 0.5 + 0.2 + 0.3 = 1.0, while against “dog” or “den” The corresponding value is 0.1 + 0.1 = 0.2. As a result, the scoring input for node 21 is weighted by a factor of 1.0 and the input for node 27 is weighted by a factor of 0.2. On the one hand, the value associated with “cat” or “cab” is 0.7, and on the other hand, the value associated with “cob” is 0.3, so the inputs to

nodes

22 and 25 are appropriately weighted. There is a need. However, since a factor of 1.0 has already been applied to this branch by node 21, the weights at

nodes

22 and 25 can be expressed as:
Weighting at node 22 = 0.7 / 1.0 = 0.7
Weight at node 25 = 0.3 / 1.0 = 0.3
Similarly,

nodes

23 and 24 can be expressed as:
Weight at node 23 = 0.5 / (1.0 × 0.7) = 5/7
Weight at node 24 = 0.2 / (1.0 × 0.7) = 2/7

Nodes

28 and 30 can be expressed as follows.
0.1 / 0.2 = 0.5
Of course, the tree in FIG. 4 only conceptually represents this process. In practice, each node is represented by an entry in a node memory (see FIG. 5) having the following information:
The address of the model used (in model memory);
The address of the next node in the network;
A flag indicating whether the node is active;
A log value indicating the weighting associated with the node;
-Temporary storage for calculation results.
The contents of the first two items are determined when setting the recognizer vocabulary. This process is performed by referring to a vocabulary memory 7 (see FIG. 1) that contains a list of recognized words, and for each word, a string of addresses corresponds to the sound of that word. The sequence of the neem model is identified (see also FIG. 7). Generating the contents of the node memory (saving the contents of the log weight values described below) is common; it involves inserting a sequence of node addresses corresponding to each word into the vocabulary memory.
The CPU 51 performs the following processing under program control stored in the program memory 52, as shown in the flowchart of FIG.
First, generate an empty token as input to the first node, ie a score of zero (ie log (1)) and a node address that generates zero (this is the token is processed by the first node) And an entry in the token memory having the date of the previous frame. These first nodes are then considered "active".
Next, the following steps are performed during the frame period:
For each active node:
-Start the HMM process and update the HMM process using the current frame observation O when the token processed by this node was not generated in the last frame. When the process reaches n frames (where n is the number of specific HMM states associated with this node), the log a priori probability value stored in the node memory is added to the calculated likelihood value. , Use the result to create a new entry in the token memory (note that the current process can nevertheless continue for the next frame);
-A new HMM process using the current frame observation when the process is not started and the token processed by this node was not generated during the last frame (ie when the liveness flag was just set) To start. For a single state HMM, the result is used to create a new entry in the token memory; (note that the current process can nevertheless continue until the next frame) ;
-When the process is started and a token to be processed by this node is generated, the incoming score is compared with the internal score, and the above process is continued according to the result and remains unchanged or Add the incoming score as an input to the 1 state.
-For each token generated,
-Obtain the originating node address from the token score;
Obtaining the “next node” address from the node memory entry for the originating node;
Flag each such next node as active for the next frame.
-When a new entry is created in the token memory,
-Update this number when the relevant score exceeds the stored "highest score for all tokens" number;
If the relevant score is smaller than the stored “highest score for all tokens” by a predetermined margin (eg 50), the token memory entry is removed (“pruning” phase). As a result, when a node does not have both input and output tokens, it is deactivated (ie, the node memory entry is removed).
-In the last node,
Decisions about when recognition is complete and that a recognition path can be traced back are made based on a rule and threshold system that examines specific measurements. Thus, for each frame, the best token that appears at the last node is traced back to check how many frames were spent at the last noise node. (Assuming all paths in the network have noise nodes at the ends). When the duration is longer than the threshold and the path score is higher than another threshold, the recognition stops (i.e., the recognition score for the complete path is reasonably favorable, the path is moderately noisy at the end, generally Until 20 frames are included, ie 0.32 frames). This is the simplest description of the end of the language detection algorithm. In practice, the algorithm can be extended with additional checks on the SNR and noise energy variance of the signal to date. In addition, when there are a large number of timeouts and the above test fails continuously, it is ensured that the end of language detection will eventually trigger.
Next, the highest scoring token, or N_outBest scoring token (where N_outIs the desired number of output selections)
(A) retrieve the node ahead from the token and the associated model identifier from it;
(B) retrieve the previous token memory entry;
(C) Repeat (a) and (b) until all models are identified.
Words recognized here can be used with associated scores.
The above is the recognition process: Before starting this process, it is necessary to enter the a priori probability of the log into the node memory. Assume that the previous recognition process generated a priori probability values in the format shown in FIG. Here (as an example), assume that each of a number of city names has a probability assigned to it. The CPU 52 executes the following setting process for deriving the value of the node a priori probability.
First, it is necessary to convert words into node sequences by referring to the vocabulary memory 7, so that for each possible path through the recognition tree, the log a priori on the way to each node. You can see the sum of the values. Next, as shown in FIG. 4, it is necessary to calculate individual values for each node, as follows:
(A) assigning a predetermined probability value to the last node corresponding to each word;
(B) Proceeding from the right side to the left side (see FIG. 4), each node is assigned a probability value that is the sum of those assigned to the node accordingly (in FIG. 4, the first node is assigned the assigned value). 1));
(C) proceed from left to right and divide the probability value for each node by the value assigned to the previous node;
(D) Log all values.
In practice, a technique that is less cumbersome for the calculation as a whole is performed together with the log value and takes the maximum value, not the sum. Therefore (as shown in FIG. 8):
(A) assigning a probability value of a given log to the last node corresponding to each word;
(B) assigning to each node a log probability value which is the maximum value assigned to the node or the subsequent node;
(C) Subtract the value assigned to the previous node from the value for each node.
Of course, there is no need to calculate unbranched links (shown in square brackets).
In the above, the first criterion is that the token retains a score below the threshold, ie, erases the token when it retains the “best pass” score at any time. In fact, since log probabilities are used, a comparison is made between the log score and the best log score minus a fixed margin value set to give the best average behavior.
However, in practice, the optimal pruning level to use depends on the spoken voice actually spoken. Thus, in a variant, the pruning is adjusted as a function of the current computational load of the recognizer. For example, it can be adjusted depending on the number of active nodes. Therefore,
1. When only some nodes are active, the pruning threshold is relaxed and more nodes remain active, potentially increasing accuracy.
2. When many nodes are active, the pruning threshold is tightened, reducing the amount of computation.
In this regard, it is possible to keep the number of active nodes constant by adjusting the threshold. At this time, in each time frame, the active node n_aIs the desired target n_t(For example, 1300). Threshold margin value M_TStep value M_s(Eg 2)_oStarting value (for example, 100) to minimum value M_min(E.g., 75) and a maximum value (e.g., 150). The following steps can be taken for each time frame.
(1) n_a> N_tAnd M_T> M_minWhere M = M−M_s
(2) n_a<N_tAnd M_T<M_minWhen M = M + M_s
However, other criteria may be applicable, for example, based on the number of active model states or the number of active words (especially for recognizers with very many vocabularies).
This dynamic threshold adjustment can also be used in systems that do not perform a priori weighting.
The recognizers described above are limited to making a limited number of possible observations that are made in a particular state. However, if desired, a continuous probability density b with values for observation O_jProbability b by (O)_jkCan be substituted. As is well known, the overall continuous probability density can be nicely approximated by a more constrained form—a discrete weighted sum (or mixture) of continuous functions of normal Gaussian distribution. So the probability density function is

Where X is the number of components (or “modes”) in the mixing and c_jxIs the weighting of mode x in state j and N [O, μ_jx, U_jx] Is the intermediate vector μ_jxAnd covariance matrix U_jxIs the probability of extracting the vector O from the multivariate vertical distribution.
For a Gaussian distribution, the following equation holds:

Here, d is the magnitude of the vector. This is because U is an item (term) σ_iIs reduced as follows.

Υ_iIs an element of O.
The recognition process of Equations 1-9 is not changed, only the definition of b changes. This continuous density model training process is known and will not be described below.
The parallel processing method is simpler than the tree processing method described above. This exemplary processing method involving pruning involves maintaining (for example) six “best” candidates from the top of the run list when examining the model. For example,
(A) compare unknown words to the first six models, generate a list of these models, and record similar scores for each;
(B) Compare the unknown word with another model. If the score obtained is higher than any other in the list--ie, if the similarity is higher, substitute the new model and score for the lowest score entry in the list;
(C) Repeat step (b) until all models have been processed. This process generates a list of the six high score models from the top. Prior to selecting the best candidate, when a priori probabilities are applied, each of the six scores is multiplied by a correlated weighting factor and the candidate with the best weighted score is selected.
In the proposed method, weighting is performed during the recognition process;
(A) Compare unknown words to the first six models and generate similar scores for each. Multiply the score by the weighting factor for each model. Generate a list of these models and record weighted scores for each;
(B) Compare the unknown word with another model. Multiply the score by the weighting factor for this model. ObtainedWeightIf the score scored is higher than the others in the list, substitute the new model for the entry and the weighted score in the list with the lowest weighted score;
(C) Repeat step (b) until all models have been processed.

Claims

言語認識方法であって：
類似性の尺度を生成するために、未知の話声の一部分と基準モデルとを比較し、該未知の話声は以前の部分と前記一部分と別の部分からなり；
該未知の話声の別の部分を基準モデルと繰返し比較して、基準話声についての複数の許容できるシーケンスの各々に対して類似性の累積された尺度を生成し、ここで該基準話声についての許容できるシーケンスは、このシーケンスを定義している記憶されたデータによって定義されるものであり、また該生成された、類似性の累積された尺度が、前に生成された類似性の尺度からのそれぞれ寄与分を含むものであり、ここで言う前に生成された類似性の尺度は、それぞれの許容できるシーケンス内の以前の基準話声に対応している基準モデルと未知の話声のいくつかの以前の部分との比較からそれぞれ得られた尺度であるとし；
該累積された尺度に該許容できるシーケンスの各々に対する論理に基づいた確率を表わしている重み付け因子に従って重み付けをし；
この重み付けが、いくつかの許容できる基準話声のシーケンスの一部分についての尺度または累積された尺度の各計算を重み付けすることにより実行され、
この重み付けは、その一部分で始まる許容できる各シーケンスについての重み付け因子を組合せることによって得られる組み合わせた値を使用し、
ここで該組合せた値は、上記一部分が始まる話声もしくはそれより短いシーケンスについて生成された尺度に適用されるいずれかの重み付け因子により修正されることを特徴とする言語認識の方法。Language recognition method:
Comparing a portion of the unknown speech with a reference model to generate a measure of similarity, the unknown speech comprising the previous portion, the portion and another portion;
Another portion of the unknown speech is repeatedly compared with a reference model to generate a cumulative measure of similarity for each of a plurality of acceptable sequences for the reference speech, where the reference speech is An acceptable sequence for is defined by the stored data defining this sequence, and the generated accumulated measure of similarity is a previously generated similarity measure. The similarity measure generated before we say here is a reference model corresponding to the previous reference speech in each acceptable sequence and the unknown speech. Suppose that each is a measure obtained from a comparison with several previous parts;
Weighting the accumulated measure according to a weighting factor representing a logic based probability for each of the allowable sequences;
This weighting is performed by weighting each calculation of a measure or a cumulative measure for a portion of a number of acceptable reference speech sequences,
This weighting uses a combined value obtained by combining the weighting factors for each acceptable sequence starting with that portion,
A method of language recognition wherein the combined value is modified by any weighting factor applied to a measure generated for a speech starting from the part or a shorter sequence.

前記重み付けされ累積された尺度が、他の許容されるシーケンスに対する尺度よりも、プルーニング規準により定義された程度まで小さい類似性を示すシーケンスを、さらなる繰返し比較から排除するようにした請求項１記載の方法。Measure the accumulated is the weighting, than measure for other acceptable sequence, a sequence showing a small similarity to the extent defined by pruning criterion, according to claim 1, wherein the to exclude from further repetitive comparison Method.

前記プルーニング規準は、生成されかつ以後の繰返し比較から排除されなかった尺度の数に依存して、繰返し調節されて、その数が一定に保たれるようにする請求項２記載の方法。The pruning criteria, depending on the number of generated and measure that were not excluded from further repetitive comparison, is repetitive adjusted A method according to claim 2 wherein as the number is kept constant.

言語認識装置であり：
話声を表す基準モデルに関係するデータと、基準話声の許容できる複数のシーケンスを定義するデータとを記憶するための記憶手段と；
いくつかの先行する以前の部分と、いくつかの別の部分とを有する未知の話声の部分を基準モデルと繰返し比較して、類似性の累積された尺度を、基準話声の複数の許容できるシーケンスの各々に対して生成し、該生成された、類似性の累積された尺度は前に生成された類似性の尺度からのそれぞれの寄与分を含むものであって、ここでの前に生成された類似性の尺度は、それぞれの許容できるシーケンス内の以前の基準話声に対応している基準モデルと未知の話声のいくつかの以前の部分との比較からそれぞれ得られた尺度であるように生成するための比較手段と；
該累積された尺度に該許容できるシーケンスの各々に対する論理に基づいた確率を表わしている重み付け因子に従って重み付けするための重み付け手段とを備え、
この重み付け手段は、いくつかの許容できる基準話声のシーケンスの一部分についての尺度または累積された尺度に重み付けをするように動作可能であり、
この際の重み付けは、その一部分で始まる許容できる各シーケンスについての重み付け因子を組合せることによって得られる組み合わせた値を使用し、
ここで該組合せた値は、上記一部分が始まる話声もしくはそれより短いシーケンスについて生成された尺度に適用されるいずれかの重み付け因子により修正されることを特徴とする言語認識装置。Language recognition device:
Storage means for storing data relating to a reference model representing speech and data defining a plurality of acceptable sequences of reference speech;
An unknown speech part with some preceding previous part and some other part is iteratively compared with the reference model to determine a cumulative measure of similarity to multiple tolerances of the reference speech. Generated for each of the possible sequences, and the generated accumulated measure of similarity includes the respective contributions from the previously generated similarity measure, The generated similarity measure is a measure obtained from a comparison of the reference model corresponding to the previous reference speech in each acceptable sequence with some previous parts of the unknown speech. A comparison means to produce as;
Weighting means for weighting the accumulated measure according to a weighting factor representing a logic based probability for each of the acceptable sequences;
The weighting means is operable to weight a measure or a cumulative measure for a portion of a number of acceptable reference speech sequences;
Weighting at this time, using the combined values obtained by weighting factors by a combination Rukoto for each sequence acceptable starting with a portion thereof,
Here, the combined value is corrected by any weighting factor applied to a scale generated for a speech starting from the part or a shorter sequence.

重み付けされ累積された尺度が、他の許容されるシーケンスに対する尺度よりも、所定のプルーニング規準により定義された程度まで小さい類似性を示すシーケンスを、さらなる繰り返し比較から排除する手段をさらに含む請求項４記載の装置。Measure accumulated is weighted, than measure for other acceptable sequence, further comprising a means for eliminating a sequence showing a small similarity to the extent that a more defined predetermined pruning criterion, the comparison returns Ri further Repetitive The apparatus of claim 4.

プルーニング規準は、生成されかつ以後の繰返し比較から排除されなかった尺度の数に依存して、繰返し調節されて、その数が一定に保たれるようにする請求項５記載の装置。Pruning criteria, depending on the number of generated and measure that were not excluded from further repetitive comparison, is repeatedly adjusted, according to claim 5 wherein as the number is kept constant.

基準話声の複数の許容できるシーケンスを表す言語認識網の各ノードに重み付け因子を指定する方法であって：
基準話声の許容できるシーケンスの各々について重み付け因子を前もって決めることと、ただしここで各重み付け因子は、発生しているそれぞれの許容できるシーケンスについての論理に基づいた確率を表わしているものであり；
該網内の各ノードに対して、そのノードを取り込んでいるいくつかの許容できるシーケンスの一部分で始まる許容できるシーケンスの各々に対する重み付け因子の値を組み合わせることであって、該組み合わせた値は前記部分が始まる話声もしくはいくつかの許容できるシーケンスのより短い部分に適用される重み付け因子によって修正されるものと、を含む方法。A method for assigning a weighting factor to each node of a language recognition network representing multiple acceptable sequences of reference speech:
Predetermining a weighting factor for each acceptable sequence of reference speech, where each weighting factor represents a logic-based probability for each acceptable sequence being generated;
For each node in the network, combining the weighting factor values for each of the allowable sequences starting with a portion of a number of allowable sequences incorporating that node , the combined value being said portion method comprising: a shall be corrected by the weighting factors applied to a shorter portion of the speech or some acceptable sequence begins.

許容できるシーケンスに対応する網の最終ノードに対して与えられた重み付け因子のログを指定し；
各先行するノードに対してログ確率値として後段のノードに指定されたそれらの値の最大値を指定し；
各ノードに対する値から先行するノードに指定された値を減ずることを含む請求項７記載の方法。Specify a log of weighting factors given to the last node of the network corresponding to an acceptable sequence;
Specify the maximum of those values specified for the subsequent nodes as log probability values for each preceding node;
8. The method of claim 7, comprising subtracting the value specified for the preceding node from the value for each node.

前記認識網は木構造を有し、第１のノード以外の少なくとも１つのノードが２以上の枝をもっている請求項７または８に記載の方法。The method according to claim 7 or 8, wherein the recognition network has a tree structure, and at least one node other than the first node has two or more branches.