JP3916792B2

JP3916792B2 - Voice recognition device

Info

Publication number: JP3916792B2
Application number: JP06191499A
Authority: JP
Inventors: 直人加藤
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1999-03-09
Filing date: 1999-03-09
Publication date: 2007-05-23
Anticipated expiration: 2019-03-09
Also published as: JP2000259174A

Description

【０００１】
【発明の属する技術分野】
本発明は、言語モデルを使用して音声認識を行う音声認識装置に関する。
【０００２】
【従来の技術】
言語モデルを使用して音声認識を行う音声認識装置が知られている。言語モデルとしては一般に、文章が記載されているテキストからｎ個の連続する単語について学習したｎ−ｇｒａｍモデルが使用される。従来は、ｎ＝２の場合である２−ｇｒａｍ（ｂｉ−ｇｒａｍとも呼ばれる）やｎ＝３の場合である３−ｇｒａｍ（ｔｒｉ−ｇｒａｍとも呼ばれる）がよく使用されていた（例えば、中川聖一著「確率モデルによる音声認識」電子通信学会、１９９８）。
【０００３】
従来この種の音声認識方法をｔｒｉ−ｇｒａｍを例にして説明する。ｔｒｉ−ｇｒａｍとして次の２つの言語モデルがテキストから学習されているとする。それぞれにはスコア（文法的あるいは意味的な結合度あるいは使用頻度について予め与える得点）が付けられている。この例では、テキスト中の出現頻度に基づいてスコアが計算されている。
【０００４】
（ｔｒｉ−ｇｒａｍの言語モデル例）
放送−技術−研究所スコア７
放送−技術−緊急会議スコア８
次のような音声が発生され、音声認識の対象となったとする。
【０００５】
（音声）
「砧にある放送技術研究所」
ここで音声の意味内容を文字で表現しているが、音声認識装置への入力は音声波形である。音声認識装置では入力の音声波形から音響的な特徴を取り出し、予め、音素（音韻よりも時間的に短い音声の長さの単位）ごとに用意されている音響モデルと比較する。入力の音声波形とその音響特徴がよく似ている音素を組み合わせて単語を認識する。通常、もっともらしさが高い幾つかの単語を認識候補として選択した後、言語モデルを使用して認識結果として出力する単語を決定する。
【０００６】
上記音声の「技術」に継続する「研究所」について、以下の４つの認識候補が得られたとする。
【０００７】
（認識候補の例）
【０００８】
【表１】

【０００９】
この時、現在、認識中の単語を含めた認識候補のｔｒｉ−ｇｒａｍは次のようになる。
【００１０】
（認識候補のｔｒｉ−ｇｒａｍの例）
認識候補１放送−技術−研究所
認識候補２放送−技術−緊急会議
認識候補３放送−技術−緩急
認識候補４放送−技術−県警
上記（ｔｒｉ−ｇｒａｍの言語モデル例）、すなわち、言語モデルとしてスコアが与えられているｔｒｉ−ｇｒａｍを参照すると、２つのｔｒｉ−ｇｒａｍ、すなわち、（認識候補のｔｒｉ−ｇｒａｍ）の４つの認識候補の中の認識候補１と認識候補２についてはｔｒｉ−ｇｒａｍのスコアが与えられており、他の認識候補３，４については該当するものが存在しない。このため、認識候補３および４についてのｔｒｉ−ｇｒａｍが除外される。
【００１１】
認識候補１、２についてのスコアを比較すると、認識候補２のスコアが８、認識候補１のスコアが７なので、スコアの高い認識候補として、認識候補２、すなわち、４つの認識候補（研究所、緊急会議、緩急、県警）の中の「緊急会議」が単語の認識結果として決定される。この認識結果は実際の音声「研究所」と異なるので誤認識となる。
【００１２】
この問題の１つの単純な解決策はｎ−ｇｒａｍモデルの中でｎ≧４のｎを使用するこである。今の例で６−ｇｒａｍを使用する例を説明する。テキストから次の６−ｇｒａｍが学習されているものとする。
【００１３】
（６−ｇｒａｍの言語モデル例）
砧−に−ある−放送−技術−研究所スコア９
音声認識候補として６−ｇｒａｍまで考慮すると、認識候補の６−ｇｒａｍは次のようになる。
【００１４】
（認識候補の−６ｇｒａｍの例）
認識候補１砧−に−ある−放送−技術−研究所
認識候補２砧−に−ある−放送−技術−緊急会議
認識候補３砧−に−ある−放送−技術−緩急
認識候補４砧−に−ある−放送−技術−県警
（６−ｇｒａｍの言語モデル例）を参照すると、この場合には、認識候補１のみがスコアを有するので、単語の認識結果は「研究所」となる。これまでの認識結果と連結すると、最終的には「砧にある放送技術研究所」が得られ、音声の意味内容と音声認識結果とが一致した正解が得られる。
【００１５】
なお、可変ｎ−ｇｒａｍ（政瀧浩和、松永昭一、匂坂芳典「連続音声認識のための可変長連鎖統計言語モデル」電子情報通信学会音声研究会報告，ＳＰ９５−７３，ｐｐ．１−６，１９９５では出現頻度が高い定型表現に対しては、ｎ≧４のｎ−ｇｒａｍを利用する方法を提案している。
【００１６】
【発明が解決しようとする課題】
しかしながら、上述の単純な解決策では、一般にどれくらいのｎを使用すればよいかが問題となる。ｎ＝４，５，６．．．と単純に全てのｎの場合をテキストから学習すると、ｎの値の増加と共に言語モデルのサイズ（個数）は大きくなり、音声認識装置の記憶容量を超えてしまう。
【００１７】
また、出現頻度が１回でも音声認識の際に重要となるｎ≧４のｎ−ｇｒａｍもあるので、出現頻度が高いｎ−ｇｒａｍのみを対象にしている可変ｎ−ｇｒａｍを使うことも困難である。
【００１８】
本発明の目的は、上述の点に鑑みて、言語モデルのデータサイズを増やすことなく、また、頻度によらず、認識精度のよい音声認識装置を提供することにある。
【００１９】
【課題を解決するための手段】
このような目的を達成するために、請求項１の発明は、入力の音声に対して単語毎に複数の音声認識候補を取得し、当該取得した複数の認識候補の中の１つを予め定めた選択基準にしたがって選択し、当該選択された認識候補を音声認識結果とする音声認識装置において、単語およびその単語の学習テキストにおける出現位置を記憶した記憶手段と、前記複数の音声認識候補の単語、及び、それまでに音声認識結果として選択された単語それぞれと同一単語の、前記学習テキストにおける出現位置の値を取得して、対応する音声認識候補の単語又は音声認識結果として選択された単語の単語位置の値とする単語位置検出手段と、前記複数の音声認識候補の単語それぞれについて、当該音声認識候補の単語の単語位置の値と、それまでに音声認識結果として選択された各単語の単語位置の値との連続性を調べ、連続している単語列の長さを計数する連続単語長さ計数手段とを有し、前記連続単語長さ計数手段によって計数された長さを前記選択基準とし、該選択基準に従って、該計数された単語列の長さが最も長い音声認識候補の単語を、音声認識結果として選択することを特徴とする。
【００２０】
請求項２の発明は、請求項１に記載の音声認識装置は言語モデルを使用して音声認識が行われ、前記言語モデルの使用に際して得られるｎ個の連続する単語列の出現頻度を示すスコアと前記計数手段により計数された単語列の長さが大きくなるほどその値が大きくなるスコアとを前記複数の認識候補それぞれについて加算する加算手段をさらに具え、加算されたスコアの値が最も大きい認識候補を単語の音声認識結果として選択することを特徴とする。
【００２１】
請求項３の発明は、請求項１に記載の音声認識装置において、前記記憶手段に登録すべき単語およびその出現位置を入力する入力手段と、当該入力された単語および単語位置を前記記憶手段に登録する登録手段とをさらに具えたことを特徴とする。
【００２２】
請求項４の発明は、請求項３に記載の音声認識装置において、前記入力手段は文が記載された学習テキストを受け付け、当該受け付けた文を単語に分割し、当該単語に分割された学習テキストから単語の出現位置を検出し、入力することを特徴とする。
【００２３】
請求項５の発明は、請求項１に記載の音声認識装置において、前記記憶手段に記憶される出現位置は同一の単語について複数の出現位置が許容されることを特徴とする。
【００２６】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態を詳細に説明する。
【００２７】
本発明を適用した音声認識装置のシステム構成を図１に示す。図１において、ｉ１は言語モデルを学習するためのテキストを入力する端子であり、ｉ２は単語の出現位置を検出し、辞書に登録するためのテキストを入力する端子である。ｉ３は認識対象の音声を入力する端子である。０１は音声認識結果を出力する端子である。
【００２８】
１はｎ−ｇｒａｍモデル装置であり、端子ｉ１から入力されたテキストについて単語列の出現頻度を計数し、各単語列についてスコアを付与する。このようにして学習した単語列およびそのスコアが言語モデルとしてｎ−ｇｒａｍモデル装置１内の記憶装置に保存される。
【００２９】
ｎ−ｇｒａｍモデル装置１は音声認識デコード装置３から単語列４が与えられた場合には、与えられた単語列について、装置内に保存された複数の言語モデルを検索し、合致する言語モデルのスコアを音声認識デコード装置３に返す。
【００３０】
２は単語出現位置辞書装置である。単語出現位置辞書装置２は端子ｉ２から入力されたテキスト中の各単語の出現位置を検出し、単語とその出現位置を単語出現位置辞書装置２内の記憶装置に保存する。
【００３１】
また、音声認識デコード装置３から単語列が与えられた場合には、保存されている単語およびその位置を参照して、与えられた単語に合致する保存の単語のスコアを音声認識デコード装置３に返す。
【００３２】
３は音声認識デコード装置であり、端子ｉ３から入力された音声（信号）を音声認識する。より具体的には、従来と同様、音響モデルを使用して複数の音声認識候補を取得し、ｎ−ｇｒａｍモデル装置１に保存された言語モデルのｎ−ｇｒａｍ（スコア）および単語出現位置辞書装置２に保存されている単語およびその出現位置に基づき複数の音声認識候補の中から音声認識結果として使用する候補を決定する。この処理については後で詳しく説明する。
【００３３】
図１の音声認識装置はたとえば、パーソナルコンピュータなどの汎用コンピュータで実現できる。汎用コンピュータで実行する音声認識プログラムの内容を図２および図３に示す。図２は音声認識処理の具体的な処理内容を示す。図３は言語出現位置辞書に単語およびおよびその出現位置を登録するための処理内容を示す。
【００３４】
図２および図３を参照して本発明に係る処理を説明する。
【００３５】
ステップＳ１〜Ｓ３において汎用コンピュータ（内蔵のＣＰＵ）は言語モデルの学習を行う。登録したいテキスト（いわゆる文書）をキーボードや装着のフロッピーディスクからインターフェース（図１のｉ１，ｉ３に対応）を介して入力し、装置内のメモリに一時記憶しておく。メモリ上のテキストを従来と同様にして単語に分割し、ｎ−ｇｒａｍの値を従来と同様にして出現頻度から計算する。
【００３６】
計算された値（スコア）とその値に対応する単語列が言語モデルとして内蔵のハードディスクに保存される（ステップＳ１）。
【００３７】
他のテキストが入力されると汎用コンピュータは入力されたテキストを単語分割し、ｎ−ｇｒａｍを計算し直す。新たに計算された値で、ハードディスク上の対応の単語列のスコアが更新される。新たに出現した単語列についてはｎ−ｇｒａｍの値と共に、単語列がハードディスクに保存される（ステップＳ２）。
【００３８】
単語分割されたメモリ上のテキストを使用して、分割された単語について単語出現位置を検出する。この処理は図３を使用して説明する。新たに検出された単語についてはハードディスク上の単語出現位置辞書に追加登録される（ステップＳ３）。
【００３９】
ステップＳ４からＳ７で音声認識処理を実行する。
【００４０】
汎用コンピュータに接続のマイクロホンから音声が入力されると、入力された音声（信号）は音響特徴が抽出される。ハードディスクに保存されている音響モデルを参照することにより入力の音響特徴に対してもっともらしさが高い音声認識候補単語が各単語ごとに複数作成される（ステップＳ４）。
【００４１】
ステップＳ５では認識候補に対して、単語出現位置辞書に記載されている出現位置を与える。ステップＳ６では、これまでに音声認識結果として選択された単語列と今、選択しようとしている認識候補とを組み合わせた単語列について出現位置の連続性を調べる。単語出現位置が連続してる単語列の長さに応じてスコアを与える。
【００４２】
ステップＳ７では、ステップＳ６で得られた出現位置についてのスコアと言語モデルのｎ−ｇｒａｍでのスコアとを加算し、加算結果が最も大きい値を持つ認識候補を認識結果（選択する音声認識候補）とする。
【００４３】
ステップＳ８では入力された音声から得られるたとえば、文の単語認識候補全てについて、上述の認識候補確定処理を行ったかの終了判定を行う。終了していない場合には、手順をステップＳ４に戻し、次の位置の単語認識候補の作成および確定処理を実行する。全ての単語について候補選択を終了すると、選択された単語列（例えば、文）をインターフェース（図１の出力端子ｏ１に対応）を介してディスプレイに表示する（ステップＳ８→エンド）。
【００４４】
具体的な処理例を以下に示す。テキストが汎用コンピュータに入力されるとテキストは言語モデルの学習のために単語に分割される。テキストはニュース原稿のような大量のテキストが使用される。この単語に分割されたテキストを使用して、ｎ−ｇｒａｍを計算する。スコアは単語列の出現頻度に基づいて計算される（ステップＳ１）。
【００４５】
この計算によりｔｒｉ−ｇｒａｍ１として以下のスコアが得られたものとする。
【００４６】
（ｔｒｉ−ｇｒａｍ１）
放送−技術−研究所スコア６
放送−技術−緊急会議スコア８
新たな下記のテキストがテキストが入力されるとこのテキストは次のように単語分割される。
【００４７】
（テキスト）
世田谷の砧にある放送技術研究所のグラウンド
（単語分割されたテキスト）
世田谷−の−砧−に−ある−放送−技術−研究所−の−グラウンド
上記ｔｒｉ−ｇｒａｍ１の中の単語列と同じ単語列が単語分割されたテキストの中にあるので、
新たなｔｒｉ−ｇｒａｍ２が次のように再計算される（ステップＳ２）。
【００４８】
（ｔｒｉ−ｇｒａｍ２）
放送−技術−研究所スコア７（６＋１）
放送−技術−緊急会議スコア８
以上がｔｒｉ−ｇｒａｍの学習例であるが、ｎ≧４のｎ−ｇｒａｍについても学習が行われ、入力のテキスト中の単語と同一部分があるものについてはスコアの値が１だけインクリメント（加算）される。このようにして学習された（作成された）ｎ−ｇｒａｍの言語モデルが汎用コンピュータ内のハードディスクに保存される。
【００４９】
次に本発明に係る単語出現位置辞書の登録処理を図３を使用して説明する。テキストが入力されると従来のようにテキストが単語分割される（ステップＳ１１→Ｓ１２）。
【００５０】
次に、変数Ｉの初期値として数値１を与え、分割された単語のＩ番目の単語を取り出す（ステップＳ１３→Ｓ１４）。取り出された単語についてハードディスク上の単語出現位置辞書を検索し、同一の単語がなければ、単語およびそのスコアを登録（記憶）する（ステップＳ１５）。
【００５１】
以下、変数Ｉの値を更新して（ステップＳ１７）、テキストから分割の次の単語を取り出して、出現位置を単語出現位置辞書に登録する（ステップＳ１３〜Ｓ１６→Ｓ１７→Ｓ１３のループ処理）。文末の単語の出現位置の登録を終了すると図３の手順を終了する（ステップＳ１６→エンド）。
【００５２】
このような処理を行うことによって、下記の入力のテキストの各単語の出現位置が単語出現位置辞書に登録することができる（ステップＳ３）。
【００５３】
（単語分割されたテキストと出現位置の関係）
出現世田谷−の−砧−に−ある−放送−技術−研究所−の−グラウンド
位置 01 02 03 04 05 06 07 08 09 10
認識すべき下記の音声が入力されると、
（音声）
砧にある放送技術研究所
（ここでは文字表記を行っているが実際の入力は音声波形である。）
音声認識認識候補が順次に選択され、単語音声「研究所」については以下のような音声認識候補が作成される（ステップＳ４）。
【００５４】
（認識候補の例）
ここで、下記のような文字列の範囲でｔｒｉ−ｇｒａｍを適用すると、
【００５５】
【表２】

【００５６】
認識候補１，２のそれぞれについて、接続の単語の出現位置を単語出現位置辞書から求める（ステップＳ５）と、以下のようになる。
【００５７】
（認識候補の単語出現位置）
認識候補１砧−に−ある−放送−技術−研究所
出現位置１ 03 04 05 06 07 08
認識候補２砧−に−ある−放送−技術−緊急会議
出現位置２ 03 04 05 06 07 NULL
ここで、「NULL」は単語出現位置辞書には単語位置が記載されていないことを示す。
【００５８】
認識候補の単語を基準にして。そこから単語出現位置が連続している単語列の長さに対してスコアが与えられる。認識候補１は03〜08と６単語が連続しているので、６点のスコアが与えられる。一方、認識候補２では基準位置の単語「緊急会議」によって、単語出現位置が連続しなくなるので、出現位置に関するスコアは与えられない（ステップＳ６）。
【００５９】
言語モデルのｔｒｉ−ｇｒａｍのスコアと単語出現位置のスコアを加算すると（認識候補の総スコア）
認識候補１＝７＋６＝１３
認識候補２＝８＋０＝８
したがって、スコアが高い認識候補１（「研究所」）が認識結果として選択される（ステップＳ７）。
【００６０】
上述の実施形態の他に次の形態を実施できる。
【００６１】
１）上記音声認識プログラムを記録する記録媒体は、ＲＯＭ、ＲＡＭ等のＩＣメモリ、ハードディスクなどの固定記憶装置、フロッピーディスク、ＣＤＲＯＭなどの各種の情報記録媒体を使用することができる。
【００６２】
２）単語出現位置辞書に記載する単語位置の個数は１個に限らず複数とすることができる。この場合には、単語出現位置の連続の単語数を計数する際に、複数の単語位置それぞれについて連続の有無を判定し、前の単語と出現位置が連続する単語出現位置を使用する。
【００６３】
３）上述の実施形態では単語出現位置辞書の作成および登録機能を有する音声認識装置を説明したが、単語出現位置辞書は外部の装置で作成しておき、通信あるいは記録媒体を介して音声認識装置に実装するようにしてもよい。この場合には、音声認識装置側では、実装された単語出現位置辞書を使用して音声認識を行う。
【００６４】
４）上述の実施形態では言語モデルから得られるスコアと、単語出現位置辞書から得られるスコアとを加算して、複数の音声認識候補の中の１つを選択した。しかしながら、特定の用途、たとえば、入力される音声の文が限定されているような場合は、単語の出現位置だけを複数の音声認識候補の選択基準として使用することができる。
【００６５】
５）単語出現位置の連続性を調べるにはつぎのようにするとよい。基準となる単語の出現位置を単語出現位置から取得すると、取得した出現位置の値を１だけ減算する。次に基準となる単語の前の位置の単語についても単語出現位置を単語出現位置辞書から取得し、取得した出現位置値と、上記減算により得られる値を比較する。一致判定が得られると、２つの単語は連続していることになる。以下、順次に接続する単語について、単語出現位置から取得した出現位置の値と、連続する場合に予測される出現位置の値を比較する。また、一致判定が得られる回数を計数することで、連続の単語長さを計数することができる。
【００６６】
なお、単語列の先頭から認識候補の選択を行っていくので、選択が行われる毎に、選択された単語の単語出現辞書の出現位置をメモリに一時記憶しておくと、その都度、同一の単語の出現位置を単語出現辞書から取得する必要はない。
【００６７】
【発明の効果】
以上、説明したように、請求項１の発明によれば、複数の認識候補の選択基準の１つとして、その認識候補の出現位置を使用し、他の単語と出現位置に関する連続性を調べることで、実際の入力音声の意味内容により近い認識候補を選択することができる。この音声認識候補の選択に使用される単語およびその出現位置情報は、ｎ−ｇｒａｍの言語モデルの情報量よりも小さくできる。
【００６８】
請求項２の発明では、言語モデルによる出現頻度のスコアと、単語位置の連続長さのスコアを加算することで、単語位置のみのあるいは言語モデルを使用する認識候補の選択よりも音声認識精度を高めることができる。
【００６９】
請求項３の発明によれば、音声認識装置に、単語およびその出現位置を登録する機能が備わるので、新しい単語の音声認識にも対処することができる。
【００７０】
請求項４の発明では、テキストを用意することでテキストから自動的に新しい単語を検出し、単語およびその出現位置を登録することができる。
【００７１】
請求項５の発明では、出現位置が多岐に渡る単語についても、その出現位置を登録しておくことで、このような単語が認識候補となった場合にも対処することができる。
【図面の簡単な説明】
【図１】本発明実施形態のシステム構成を示すブロック図である。
【図２】本発明実施形態の音声認識処理内容を示すフローチャートである。
【図３】本発明実施形態の単語出現位置辞書の登録処理内容を示すフローチャートである。
【符号の説明】
１ｎ−ｇｒａｍモデル装置
２単語出現位置辞書装置
３音声認識デコード装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speech recognition equipment which performs speech recognition using a language model.
[0002]
[Prior art]
A speech recognition apparatus that performs speech recognition using a language model is known. In general, an n-gram model in which n consecutive words are learned from a text in which a sentence is described is used as the language model. Conventionally, 2-gram when n = 2 (also referred to as bi-gram) and 3-gram when n = 3 (also referred to as tri-gram) were often used (for example, Seiichi Nakagawa) Author “Speech Recognition by Stochastic Model”, Electronic Communication Society, 1998).
[0003]
Conventionally, this type of speech recognition method will be described by taking tri-gram as an example. Assume that the following two language models are learned from text as tri-grams. Each is given a score (score given in advance for grammatical or semantic connectivity or frequency of use). In this example, the score is calculated based on the appearance frequency in the text.
[0004]
(Example of tri-gram language model)
Broadcast-Technology-Laboratory Score 7
Broadcast-Technology-Emergency Meeting Score 8
Assume that the following speech is generated and is subject to speech recognition.
[0005]
(voice)
"Broadcast Technology Laboratory in Sakai"
Here, the meaning content of the speech is expressed by characters, but the input to the speech recognition apparatus is a speech waveform. The speech recognition apparatus extracts an acoustic feature from the input speech waveform and compares it with an acoustic model prepared in advance for each phoneme (unit of speech length shorter in time than a phoneme). Words are recognized by combining input speech waveforms and phonemes with similar acoustic characteristics. Usually, after selecting several words having high likelihood as recognition candidates, a word to be output as a recognition result is determined using a language model.
[0006]
Assume that the following four recognition candidates are obtained for the “laboratory” that continues to the “technology” of the voice.
[0007]
(Example of recognition candidate)
[0008]
[Table 1]

[0009]
At this time, the tri-gram of the recognition candidate including the currently recognized word is as follows.
[0010]
(Example of recognition candidate tri-gram)
Recognition Candidate 1 Broadcast-Technology-Laboratory Recognition Candidate 2 Broadcast-Technology-Emergency Conference Recognition Candidate 3 Broadcast-Technology-Slow / Right Recognition Candidate 4 Broadcast-Technology-Prefectural Police Above (example of tri-gram language model), ie, language model Referring to the tri-gram to which the score is given, the tri-gram of two tri-grams, that is, the recognition candidate 1 and the recognition candidate 2 among the four recognition candidates of (recognition candidate tri-gram) are tri-gram A score is given, and there is no corresponding candidate for the other recognition candidates 3 and 4. For this reason, tri-grams for the recognition candidates 3 and 4 are excluded.
[0011]
Comparing the scores for the

recognition candidates

1 and 2, the recognition candidate 2 has a score of 8 and the recognition candidate 1 has a score of 7, so that the recognition candidate 2 having a high score, that is, four recognition candidates (laboratory, “Emergency Meeting” (Emergency Meeting, Moderate, Prefectural Police) is determined as a word recognition result. Since this recognition result is different from the actual speech “laboratory”, it is misrecognized.
[0012]
One simple solution to this problem is to use n with n ≧ 4 in the n-gram model. An example of using 6-gram will be described in the present example. It is assumed that the next 6-gram is learned from the text.
[0013]
(Example of 6-gram language model)
砧 -Ni-Ari-Broadcast-Technology-Laboratory Score 9
Considering up to 6-gram as a speech recognition candidate, the recognition candidate 6-gram is as follows.
[0014]
(Example of recognition candidate -6gram)
Recognition Candidate 1 砧 --Any-Broadcast-Technology-Laboratory Recognition Candidate 2 砧 --Any-Broadcast-Technology-Emergency Conference Recognition Candidate 3 砧 --Any-Broadcast-Technology-Sudden Recognition Candidate 4 -Reference-Broadcasting-Technology-Prefectural Police (6-gram language model example) In this case, since only recognition candidate 1 has a score, the word recognition result is "laboratory". When connected with the recognition results so far, the “Broadcasting Technology Laboratory in Sakai” is finally obtained, and a correct answer in which the meaning content of the speech and the speech recognition result coincide is obtained.
[0015]
In addition, variable n-gram (Hirokazu Masami, Shoichi Matsunaga, Yoshinori Mozaka “Variable-length chain statistical language model for continuous speech recognition”, IEICE Speech Study Group, SP95-73, pp. 1-6, 1995. Then, a method using n-gram of n ≧ 4 is proposed for the fixed expression with high appearance frequency.
[0016]
[Problems to be solved by the invention]
However, with the simple solution described above, it is generally a problem how many n should be used. n = 4, 5, 6. . . If all the n cases are learned from the text, the size (number) of the language model increases as the value of n increases, which exceeds the storage capacity of the speech recognition apparatus.
[0017]
In addition, since there is an n ≧ gram of n ≧ 4 that is important for speech recognition even if the appearance frequency is once, it is difficult to use a variable n-gram that targets only an n-gram having a high appearance frequency. is there.
[0018]
An object of the present invention, in view of the above, without increasing the data size of the language model, also regardless of the frequency, to provide a good speech recognition equipment of recognition accuracy.
[0019]
[Means for Solving the Problems]
In order to achieve such an object, the invention of claim 1 acquires a plurality of speech recognition candidates for each word with respect to the input speech, and predetermines one of the acquired recognition candidates. In the speech recognition apparatus that selects according to the selected selection criterion and uses the selected recognition candidate as a speech recognition result, a storage unit that stores the word and the appearance position of the word in the learning text, and the plurality of speech recognition candidate words And, the value of the appearance position in the learning text of the same word as each of the words selected as the speech recognition result so far is acquired, and the corresponding speech recognition candidate word or the word selected as the speech recognition result and word position detecting means to a value of a word location, for each word of said plurality of speech recognition candidates, and the value of the word position of the words in the speech recognition candidate, the speech certified by then Examining the continuity of the value of the word positions of the words selected as a result, and a continuous word length counting means for counting the length of the word sequence are contiguous, by the continuous word length counting means The counted length is used as the selection criterion, and according to the selection criterion, a speech recognition candidate word having the longest length of the counted word string is selected as a speech recognition result .
[0020]
According to a second aspect of the present invention, in the voice recognition device according to the first aspect, speech recognition is performed using a language model, and a score indicating an appearance frequency of n consecutive word strings obtained when the language model is used. And an addition means for adding each of the plurality of recognition candidates to a score whose value increases as the length of the word string counted by the counting means increases, and the recognition candidate having the largest added score value Is selected as the speech recognition result of the word.
[0021]
According to a third aspect of the present invention, in the speech recognition apparatus according to the first aspect, an input unit that inputs a word to be registered in the storage unit and an appearance position thereof, and the input word and the word position are stored in the storage unit. It further comprises registration means for registering.
[0022]
According to a fourth aspect of the present invention, in the speech recognition apparatus according to the third aspect, the input unit accepts a learning text in which a sentence is written, divides the received sentence into words, and the learning text is divided into the words. From the above, the appearance position of a word is detected and input.
[0023]
According to a fifth aspect of the present invention, in the voice recognition device according to the first aspect, a plurality of appearance positions are allowed for the same word as the appearance positions stored in the storage means.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0027]
A system configuration of a speech recognition apparatus to which the present invention is applied is shown in FIG. In FIG. 1, i1 is a terminal for inputting text for learning a language model, and i2 is a terminal for detecting the appearance position of a word and inputting text for registration in a dictionary. i3 is a terminal for inputting speech to be recognized. 01 is a terminal for outputting a voice recognition result.
[0028]
Reference numeral 1 denotes an n-gram model device, which counts the appearance frequency of word strings for text input from a terminal i1, and gives a score for each word string. The learned word string and its score are stored in the storage device in the n-gram model device 1 as a language model.
[0029]
When the word sequence 4 is given from the speech recognition decoding device 3, the n-gram model device 1 searches a plurality of language models stored in the device for the given word sequence, and finds a matching language model. The score is returned to the speech recognition decoding device 3.
[0030]
Reference numeral 2 denotes a word appearance position dictionary device. The word appearance position dictionary device 2 detects the appearance position of each word in the text input from the terminal i2, and stores the word and its appearance position in a storage device in the word appearance position dictionary device 2.
[0031]
When a word string is given from the speech recognition decoding device 3, the stored word and its position are referred to and the score of the saved word that matches the given word is given to the speech recognition decoding device 3. return.
[0032]
Reference numeral 3 denotes a speech recognition / decoding device, which recognizes speech (signal) input from the terminal i3. More specifically, a language model n-gram (score) and a word appearance position dictionary device that acquire a plurality of speech recognition candidates using an acoustic model and are stored in the n-gram model device 1 as in the past. A candidate to be used as a speech recognition result is determined from a plurality of speech recognition candidates based on the words stored in 2 and their appearance positions. This process will be described in detail later.
[0033]
The voice recognition apparatus in FIG. 1 can be realized by a general-purpose computer such as a personal computer. The contents of the speech recognition program executed on the general-purpose computer are shown in FIGS. FIG. 2 shows specific processing contents of the voice recognition processing. FIG. 3 shows the processing contents for registering a word and its appearance position in the language appearance position dictionary.
[0034]
The processing according to the present invention will be described with reference to FIGS.
[0035]
In steps S1 to S3, the general-purpose computer (built-in CPU) learns a language model. Text to be registered (so-called document) is input from a keyboard or a mounted floppy disk via an interface (corresponding to i1 and i3 in FIG. 1), and temporarily stored in a memory in the apparatus. The text in the memory is divided into words in the same manner as in the past, and the value of n-gram is calculated from the appearance frequency in the same manner as in the past.
[0036]
The calculated value (score) and the word string corresponding to the calculated value are stored as a language model in the built-in hard disk (step S1).
[0037]
When other text is input, the general-purpose computer divides the input text into words and recalculates n-gram. The score of the corresponding word string on the hard disk is updated with the newly calculated value. For a newly appearing word string, the word string is stored in the hard disk together with the value of n-gram (step S2).
[0038]
A word appearance position is detected for the divided word by using the text on the memory into which the word is divided. This process will be described with reference to FIG. The newly detected word is additionally registered in the word appearance position dictionary on the hard disk (step S3).
[0039]
Voice recognition processing is executed in steps S4 to S7.
[0040]
When sound is input from a microphone connected to a general-purpose computer, acoustic features are extracted from the input sound (signal). By referring to the acoustic model stored in the hard disk, a plurality of candidate words for speech recognition that are most likely for the input acoustic features are created for each word (step S4).
[0041]
In step S5, the appearance position described in the word appearance position dictionary is given to the recognition candidate. In step S6, the continuity of appearance positions is examined for a word string that is a combination of a word string that has been selected as a speech recognition result so far and a recognition candidate that is currently being selected. A score is given according to the length of a word string in which word appearance positions are continuous.
[0042]
In step S7, the score for the appearance position obtained in step S6 and the score in the n-gram of the language model are added, and the recognition candidate having the largest addition result is recognized as the recognition result (voice recognition candidate to be selected). And
[0043]
In step S8, for example, it is determined whether or not the above-described recognition candidate determination processing has been performed for all word recognition candidates of sentences obtained from the input speech. If not completed, the procedure returns to step S4, and a word recognition candidate at the next position is created and confirmed. When candidate selection is completed for all words, the selected word string (for example, sentence) is displayed on the display via the interface (corresponding to the output terminal o1 in FIG. 1) (step S8 → END).
[0044]
A specific processing example is shown below. When text is input to a general-purpose computer, the text is divided into words for learning a language model. A large amount of text such as a news manuscript is used as the text. The n-gram is calculated using the text divided into words. The score is calculated based on the appearance frequency of the word string (step S1).
[0045]
It is assumed that the following score is obtained as tri-gram1 by this calculation.
[0046]
(Tri-gram1)
Broadcast-Technology-Laboratory Score 6
Broadcast-Technology-Emergency Meeting Score 8
When new text below is entered, the text is divided into words as follows:
[0047]
(text)
Broadcast Technology Laboratory ground at the foot of Setagaya (word-divided text)
Setagaya-no- 砧 -in-broadcast-technical-laboratory-ground The same word sequence as the word sequence in tri-gram1 is in the word-divided text.
A new tri-gram2 is recalculated as follows (step S2).
[0048]
(Tri-gram2)
Broadcast-Technology-Laboratory Score 7 (6 + 1)
Broadcast-Technology-Emergency Meeting Score 8
The above is an example of learning tri-gram, but learning is also performed for n-gram of n ≧ 4, and the score value is incremented (added) by 1 for the same part as the word in the input text. Is done. The n-gram language model learned (created) in this way is stored in the hard disk in the general-purpose computer.
[0049]
Next, a word appearance position dictionary registration process according to the present invention will be described with reference to FIG. When the text is input, the text is divided into words as in the prior art (steps S11 → S12).
[0050]
Next, the numerical value 1 is given as the initial value of the variable I, and the I-th word of the divided words is extracted (steps S13 → S14). The word appearance position dictionary on the hard disk is searched for the extracted word, and if there is no identical word, the word and its score are registered (stored) (step S15).
[0051]
Thereafter, the value of the variable I is updated (step S17), the next word to be divided is extracted from the text, and the appearance position is registered in the word appearance position dictionary (loop processing of steps S13 to S16 → S17 → S13). When the registration of the appearance position of the word at the end of the sentence is completed, the procedure of FIG. 3 is terminated (step S16 → END).
[0052]
By performing such processing, the appearance position of each word in the following input text can be registered in the word appearance position dictionary (step S3).
[0053]
(Relationship between word-divided text and appearance position)
Appearance in Setagaya-Nada-Broadcasting-Technology-Laboratory ground position 01 02 03 04 05 06 07 08 09 10
When the following voice to be recognized is input,
(voice)
Broadcasting Technology Research Laboratories in Kashiwa (In this case, text is written, but the actual input is a speech waveform.)
Speech recognition recognition candidates are sequentially selected, and the following speech recognition candidates are created for the word speech “laboratory” (step S4).
[0054]
(Example of recognition candidate)
Here, when tri-gram is applied in the range of the following character string,
[0055]
[Table 2]

[0056]
For each of the

recognition candidates

1 and 2, when the appearance position of the connected word is obtained from the word appearance position dictionary (step S5), it is as follows.
[0057]
(Recognition candidate word appearance position)
Candidate for recognition 1 砧 -Ni-A-Broadcasting-Technology-Laboratory appearance 1 03 04 05 06 07 08
Candidate 2 for recognition-砧---Broadcast-Technology-Emergency meeting appearance location 2 03 04 05 06 07 NULL
Here, “NULL” indicates that the word position is not described in the word appearance position dictionary.
[0058]
Based on recognition candidate words. From there, a score is given for the length of a word string in which word appearance positions are continuous. Since the recognition candidate 1 has 6 words consecutive from 03 to 08, a score of 6 points is given. On the other hand, in the recognition candidate 2, the word appearance positions are not consecutive due to the word “emergency meeting” at the reference position, and thus a score relating to the appearance position is not given (step S 6).
[0059]
When the tri-gram score of the language model is added to the score of the word appearance position (total score of recognition candidates)
Recognition candidate 1 = 7 + 6 = 13
Recognition candidate 2 = 8 + 0 = 8
Therefore, recognition candidate 1 (“laboratory”) having a high score is selected as the recognition result (step S7).
[0060]
In addition to the above embodiment, the following embodiment can be implemented.
[0061]
1) As a recording medium for recording the voice recognition program, various information recording media such as an IC memory such as ROM and RAM, a fixed storage device such as a hard disk, a floppy disk, and a CDROM can be used.
[0062]
2) The number of word positions described in the word appearance position dictionary is not limited to one and can be plural. In this case, when counting the number of consecutive words at the word appearance position, the presence / absence of the word is determined for each of the plurality of word positions, and the word appearance position where the previous word and the appearance position are continuous is used.
[0063]
3) In the above-described embodiment, the voice recognition device having the function of creating and registering the word appearance position dictionary has been described. However, the word appearance position dictionary is created by an external device, and the voice recognition device via communication or a recording medium. You may make it mount in. In this case, the speech recognition apparatus performs speech recognition using the implemented word appearance position dictionary.
[0064]
4) In the above-described embodiment, the score obtained from the language model and the score obtained from the word appearance position dictionary are added to select one of a plurality of speech recognition candidates. However, in a specific application, for example, when the input speech sentence is limited, only the appearance position of the word can be used as a selection criterion for a plurality of speech recognition candidates.
[0065]
5) In order to check the continuity of the word appearance position, the following is recommended. When the appearance position of the reference word is acquired from the word appearance position, the value of the acquired appearance position is subtracted by 1. Next, the word appearance position is also obtained from the word appearance position dictionary for the word at the position before the reference word, and the obtained appearance position value is compared with the value obtained by the subtraction. If a match determination is obtained, the two words are consecutive. Hereinafter, for the sequentially connected words, the value of the appearance position acquired from the word appearance position is compared with the value of the appearance position predicted when the words are consecutive. In addition, by counting the number of times that a match determination is obtained, continuous word lengths can be counted.
[0066]
Since recognition candidates are selected from the beginning of the word string, each time a selection is made, if the appearance position of the word appearance dictionary of the selected word is temporarily stored in the memory, it will be the same each time. It is not necessary to obtain the word appearance position from the word appearance dictionary.
[0067]
【The invention's effect】
As described above, according to the first aspect of the present invention, the appearance position of the recognition candidate is used as one of the selection criteria of the plurality of recognition candidates, and the continuity regarding other words and the appearance position is examined. Thus, a recognition candidate closer to the semantic content of the actual input speech can be selected. The word used for selection of the speech recognition candidate and its appearance position information can be made smaller than the information amount of the n-gram language model.
[0068]
In the invention of claim 2, by adding the appearance frequency score by the language model and the continuous length score of the word position, the speech recognition accuracy can be improved as compared with the selection of the recognition candidate using only the word position or using the language model. Can be increased.
[0069]
According to the invention of claim 3, since the speech recognition apparatus has a function of registering a word and its appearance position, it is possible to cope with speech recognition of a new word.
[0070]
In the invention of claim 4, by preparing the text, it is possible to automatically detect a new word from the text and register the word and its appearance position.
[0071]
According to the fifth aspect of the present invention, it is possible to deal with a case where such a word becomes a recognition candidate by registering the appearance position of words having various appearance positions.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a system configuration of an embodiment of the present invention.
FIG. 2 is a flowchart showing the contents of speech recognition processing according to the embodiment of the present invention.
FIG. 3 is a flowchart showing registration processing contents of a word appearance position dictionary according to the embodiment of the present invention.
[Explanation of symbols]
1 n-gram model device 2 word appearance position dictionary device 3 speech recognition decoding device

Claims

入力の音声に対して単語毎に複数の音声認識候補を取得し、当該取得した複数の認識候補の中の１つを予め定めた選択基準にしたがって選択し、当該選択された認識候補を音声認識結果とする音声認識装置において、
単語およびその単語の学習テキストにおける出現位置を記憶した記憶手段と、
前記複数の音声認識候補の単語、及び、それまでに音声認識結果として選択された単語それぞれと同一単語の、前記学習テキストにおける出現位置の値を取得して、対応する音声認識候補の単語又は音声認識結果として選択された単語の単語位置の値とする単語位置検出手段と、
前記複数の音声認識候補の単語それぞれについて、当該音声認識候補の単語の単語位置の値と、それまでに音声認識結果として選択された各単語の単語位置の値との連続性を調べ、連続している単語列の長さを計数する連続単語長さ計数手段とを有し、
前記連続単語長さ計数手段によって計数された長さを前記選択基準とし、該選択基準に従って、該計数された単語列の長さが最も長い音声認識候補の単語を、音声認識結果として選択することを特徴とする音声認識装置。A plurality of speech recognition candidates are acquired for each word with respect to the input speech, one of the acquired recognition candidates is selected according to a predetermined selection criterion, and the selected recognition candidate is speech-recognized. In the resulting speech recognition device,
Storage means for storing the word and the appearance position of the word in the learning text ;
The values of the appearance positions in the learning text of the plurality of voice recognition candidate words and the same word as each of the words selected as the voice recognition result so far are acquired, and the corresponding voice recognition candidate word or voice A word position detecting means for setting the value of the word position of the word selected as the recognition result ;
For each word of the plurality of speech recognition candidates, the continuity between the value of the word position of the word of the speech recognition candidate and the value of the word position of each word that has been selected as a speech recognition result so far is examined, and Continuous word length counting means for counting the length of the word string
The length counted by the continuous word length counting means is used as the selection criterion, and the speech recognition candidate word having the longest word string length is selected as the speech recognition result according to the selection criterion. A speech recognition apparatus characterized by that.

請求項１に記載の音声認識装置は言語モデルを使用して音声認識が行われ、前記言語モデルの使用に際して得られるｎ個の連続する単語列の出現頻度を示すスコアと前記計数手段により計数された単語列の長さが大きくなるほどその値が大きくなるスコアとを前記複数の認識候補それぞれについて加算する加算手段をさらに具え、加算されたスコアの値が最も大きい認識候補を単語の音声認識結果として選択することを特徴とする音声認識装置。 The speech recognition apparatus according to claim 1, wherein speech recognition is performed using a language model, and the score indicating the appearance frequency of n consecutive word strings obtained when the language model is used is counted by the counting unit. Further adding means for adding each of the plurality of recognition candidates with a score whose value increases as the length of the word string increases, and the recognition candidate having the largest added score value is used as the word speech recognition result. A speech recognition apparatus characterized by selecting.

請求項１に記載の音声認識装置において、前記記憶手段に登録すべき単語およびその出現位置を入力する入力手段と、当該入力された単語および単語位置を前記記憶手段に登録する登録手段とをさらに具えたことを特徴とする音声認識装置。 2. The speech recognition apparatus according to claim 1, further comprising: an input unit that inputs a word to be registered in the storage unit and an appearance position thereof; and a registration unit that registers the input word and word position in the storage unit. A voice recognition device characterized by comprising.

請求項３に記載の音声認識装置において、前記入力手段は文が記載された学習テキストを受け付け、当該受け付けた文を単語に分割し、当該単語に分割された学習テキストから単語の出現位置を検出し、入力することを特徴とする音声認識装置。4. The speech recognition apparatus according to claim 3, wherein the input unit receives a learning text in which a sentence is written, divides the received sentence into words, and detects an appearance position of the word from the learning text divided into the words. And a speech recognition apparatus characterized by being input.

請求項１に記載の音声認識装置において、前記記憶手段に記憶される出現位置は同一の単語について複数の出現位置が許容されることを特徴とする音声認識装置。 The speech recognition apparatus according to claim 1, wherein a plurality of appearance positions are allowed for the same word as the appearance position stored in the storage unit.