JP3563018B2

JP3563018B2 - Speech recognition device, speech recognition method, and program recording medium

Info

Publication number: JP3563018B2
Application number: JP2000220576A
Authority: JP
Inventors: 和正本田; 彰鶴田; 浩幸勘座
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2000-07-21
Filing date: 2000-07-21
Publication date: 2004-09-08
Anticipated expiration: 2020-07-21
Also published as: JP2002041078A

Description

【０００１】
【発明の属する技術分野】
この発明は、コンピュータや携帯情報端末に搭載されて人間の発声による音声を認識する音声認識装置および音声認識方法、並びに、音声認識処理プログラムを記録したプログラム記録媒体に関する。
【０００２】
【従来の技術】
音声認識装置において、認識精度を高めるために、必要に応じて認識対象語彙を切り換えるという認識方法がある。このような認識方法を用いた音声認識装置の応用例として、パーソナルコンピュータや日本語ワードプロセッサ等の表示装置を有する機器において、表示装置を用いたメニュー表示による機器の操作ガイドを、音声認識を用いて行うことが考えられる。
【０００３】
上述のような操作ガイドによれば、操作方法や操作による効果の表示を画面で確認しながら操作を学ぶことができる。そして、上記表示装置の画面が狭い等、上記表示装置からの情報量が少ない場合には、複数の機器操作に関する操作ガイドの表示を時間の経過と共に自動的に切り換える場合がある。このような操作ガイドに音声を用いれば、利用者にとって分り易く、且つ、操作ボタンの数を減らして操作を簡単にすることができる。その場合、複数の機器操作に関する操作ガイドの表示の切り換えと共に認識対象語彙を切り換えれば、認識対象語彙を少なくすることができるので高い認識精度を得ることができる。
【０００４】
このような認識対象語彙を切り換える認識方法の応用においては、切り換え表示する各メニューに関連のある認識対象語彙のセットを、メニュー数分だけ複数記憶しておく。そして、利用者の操作や時間の経過等によるメニュー表示の切り換えに同期して認識対象語彙を切り換えることによって、夫々のメニューにおいては必要最小限の語彙を対象に認識処理を行うことができ、認識精度を向上させることができるのである。その場合、時間の経過と共にメニュー表示を自動的に切り換える際には、機器が自動的に認識対象語彙をも切り換えることになる。
【０００５】
以下、上記認識対象語彙の切り換えが可能な音声認識装置について説明する。図４は、上記認識対象語彙の切り換えが可能な音声認識装置の一例を示すブロック図である。ここで、本音声認識装置１は、認識対象語彙の切り換えおよび出力部１３による表示内容の切り換えは、所定時間毎に音声認識装置１自身が自動的に行うものとする。音声認識装置１は、Ａ／Ｄ（アナログ／ディジタル）変換部２、音響分析部３、認識部４、音響モデル格納部５、認識対象語彙格納・判定部６、現認識対象語彙識別子記憶部７、タイマ部８、認識対象語彙切り換え要求部９、認識対象語彙切り換え要求時刻記憶部１０、音声検出部１１、音声時刻記憶部１２、出力部１３で構成される。
【０００６】
話者によって上記音声認識装置１に入力された音声は、Ａ／Ｄ変換部２に送出されてディジタル化される。そして、このディジタル化された音声波形は、音響分析部３で、２０ｍｓｅｃ〜４０ｍｓｅｃの区間毎に比較的短時間の時間窓を掛けると共に、８ｍｓｅｃ〜１６ｍｓｅｃ毎に上記時間窓をシフトしていく短時間スペクトル分析の手法によって分析される。上記時間窓によって切り出された音声波形は、切り出し時の時間長を有するフレームと呼ばれる単位の特徴ベクトルの時系列に変換される。ここで、上記特徴ベクトルは、その時刻における音声スペクトルの特微量を抽出したもので、通常は１０次元〜１００次元であり、ＬＰＣ（線形予測分析）メルケプストラム係数等が広く用いられている。こうして変換された特徴ベクトルは認識部４に送出されると共に、音声入力の開始を検出する音声検出部１１にも出力される。そうすると、音声時刻記憶部１２は、音声検出部１１からの音声入力開始信号とタイマ部８からの時刻信号とに基づいて、音声入力の開始時刻を検出して記憶する。
【０００７】
上記音響モデル格納部５には、認識単位毎に用意されたＨＭＭ（隠れマルコフモデル）が用意されている。ここで、上記認識単位としては、音素や単語が広く用いられている。また、ＨＭＭとは、複数個の状態を有する非決定性確率有限オートマトンであり、非定常信号源を定常信号源の連結で表す統計的信号源モデルである。尚、出力確率や遷移確率等のパラメータは、対応する学習音声を与えてバウム‐ウェルチアルゴリズムと呼ばれるアルゴリズム等によって予め学習されている。以下、音響モデル格納部５には、認識単位が音素であるＨＭＭが記憶されているものとする。
【０００８】
上記認識対象語彙の切り換えの動作は、特開平６‐３３７６９５号公報に開示されている方法を適用する。上記認識対象語彙として、認識対象語彙セットＡと認識対象語彙セットＢとがあり、現時点においては認識対象語彙識別子記憶部７には認識対象語彙セットＡの識別子が記憶されているものとする。また、出力部１３は、認識対象語彙セットＡに対応する表示内容を表示しているものとする。
【０００９】
この状態で、所定時間が経過すると、タイマ部８から認識対象語彙切り換え要求部９及び出力部１３に対して通知がなされる。そうすると、出力部１３は、表示内容を認識対象語彙セットＢに対応する表示内容に変更する。また、認識対象語彙切り換え要求部９から切り換えが要求され、その要求時刻が認識対象語彙切り換え要求時刻記憶部１０に記憶される。そして、認識対象語彙格納・判定部６によって、認識対象語彙切り換え要求時刻記憶部１０に記憶されている要求時刻Ｔｃと音声時刻記憶部１２に記憶されている音声入力開始時刻Ｔｓとが比較され、音声入力開始時刻Ｔｓが要求時刻Ｔｃよりも後である場合には、認識対象語彙の切り換えが要求された後に発声が行われたのであるから適切な認識対象語彙セットは認識対象語彙セットＢであると判定される。それ以外は、認識対象語彙セットＡであると判定される。そして、該当する認識対象語彙セットの識別子で現認識対象語彙識別子記憶部７の記憶内容を更新するのである。
【００１０】
こうして、適切な認識対象語彙セットの判定が終了すると、認識部４は、音響分析部３で得られた特徴ベクトルと、現認識対象語彙識別子記憶部７に記憶されている識別子に対応して認識対象語彙格納・判定部６から出力される何れかの認識対象語彙セットを構成する各単語の音素列と、音響モデル格納部５に格納されているＨＭＭを用いて、以下のようにして音声認識を行う。
【００１１】
すなわち、先ず、上記認識対象語彙に含まれる各単語のＨＭＭを求める。具体的には、音響モデル格納部５に記憶されている各音素のＨＭＭを、認識対象語彙セットを構成している各単語の音素列に対応させて結合するのである。
【００１２】
次に、夫々の単語のＨＭＭについて、音響分析部３からの特徴ベクトルを用いて生起確率を求める。ＨＭＭによる音声認識においては、音声は初期状態から最終状態までの状態遷移の間にＨＭＭから出力されるシンボルの時系列として表される。そこで、初期状態の確率を任意の値に定め、順次状態遷移毎に出力確率および遷移確率を掛けていくことによって、発声がそのモデルＭ（単語のＨＭＭ）から発生される確率を求めることができる。逆に、発声を観測した場合に、その発声があるモデルＭから発生したと仮定すると、そのモデルＭからの発生の確率が計算できることになる。
【００１３】
以下、上記認識部４における認識アルゴリズムについて詳細に説明する。認識部４は、音響分析部３によって得られた特徴ベクトルの時系列を入力とし、認識対象語彙格納・判定部６からの認識対象語彙に含まれる総ての単語のＨＭＭに関してその生起確率を求め、最も高い生起確率を呈するＨＭＭの単語を認識結果とする。すなわち、ｔ（＝１，２，…，Ｉ）をフレーム番号として、特徴ベクトルの時系列で表現された入力の系列を、
Ｘ＝ｘ_ｖｅｃ１，ｘ_ｖｅｃ２，ｘ_ｖｅｃ３，…，ｘ_ｖｅｃｔ，…，ｘ_ｖｅｃＩ
とする。尚、「ｘ_ｖｅｃｉ」は多次元のベクトルである。以下、ベクトルｘを「ｘ_ｖｅｃ」と表記する。さらに、モデルＭの初期状態の集合をＳとし、最終状態の集合をＦとする。また、「ｉ，ｊ」を状態番号として、ｊ番目の状態の遷移系列を
Ｑ＝ｑ_Ｏ ^ｊ，ｑ_１ ^ｊ，ｑ_２ ^ｊ，…，ｑ_ｔ ^ｊ，…，ｑ_Ｉ ^ｊ
と表す。上式において、「ｑ_ｔ ^ｊ」は、ｔ番目のフレームの入力記号ｘ_ｖｅｃｔによって遷移した状態を表す。ここで、ｑ_Ｏ ^ｊ∈Ｓであり、ｑ_Ｉ ^ｊ∈Ｆである。更に、初期状態の初期確率をπ_ｉ：Σ_ｑｉ _∈ _Ｓπ_ｉ＝１で表し、状態ｑ_ｉから状態ｑ_ｊへの遷移確率をａ_ｉｊとし、そのときにｘ_ｖｅｃｉが出力される出力確率をｂ_ｉｊ（ｘ_ｖｅｃｉ）とすると、入力系列の生起確率（尤度）Ｐ（Ｘ｜Ｍ）は、

で表される。この生起確率（尤度）Ｐ（Ｘ｜Ｍ）の演算を、認識対象語彙に含まれる全単語に対応するＨＭＭついて計算し、最も高い生起確率（尤度）Ｐを呈するＨＭＭに対応する単語を認識結果として出力部１３に出力して表示するのである。
【００１４】
【発明が解決しようとする課題】
しかしながら、上記従来の特開平６‐３３７６９５号公報に開示された認識対象語彙切り換え動作を適用した音声認識装置には、以下のような問題がある。すなわち、上述したように、特開平６‐３３７６９５号公報に開示された認識対象語彙切り換え動作においては、音声入力開始時刻Ｔｓが認識対象語彙切り換え要求時刻Ｔｃよりも後である場合に認識対象語彙のセットを切り換えるようにしている。この方法は、話者の操作によって認識対象語彙切り換え要求がなされる場合には、必ず認識対象語彙の切り換え要求がなされた後に発声が行われるために有効である。
【００１５】
ところが、図４に示す音声認識装置のように、時間の経過と共に自動的に認識対象語彙が切り換る音声認識装置の場合には、認識対象語彙の切り換えは、話者の意識とは全く関係なく行われる。したがって、何らかの理由で話者が認識対象語彙の発声の機会を逸してしまい、且つ、自動的に認識対象語彙の切り換えが行われた場合には、何らかの方法によって話者が発声したかった切り換え前の認識対象語彙の設定状態にもどす必要が生ずる。そして、その場合には、何らかの操作を話者に負担させるか、若しくは、自動的に切り換え前の認識対象語彙が設定されるまで話者を待たせることになるという問題がある。
【００１６】
そこで、この発明の目的は、自動的に認識対象語彙を切り換える場合でも高い認識精度が得られる使い易い音声認識装置および音声認識方法、並びに、音声認識処理プログラムを記録したプログラム記録媒体を提供することにある。
【００１７】
【課題を解決するための手段】
上記目的を達成するため、第１の発明は、入力された音声を認識する認識部と，この認識部の認識結果を含む情報を出力する出力部と，上記認識時に用いられる認識対象語彙が格納された認識語彙格納部と，タイマ部と，このタイマ部からの時刻信号に基づいて上記認識対象語彙の切り換えを要求する認識対象語彙切り換え要求部を有する音声認識装置において、上記出力部は，複数の出力内容を切り換え出力するようになっており、上記認識対象語彙は，上記出力部の出力内容に対応した認識対象語の集合でなる複数の認識対象語彙セットに分類され，上記認識対象語彙の切り換えは上記認識対象語彙セットの単位で行われるようになっており、上記タイマ部からの時刻信号に基づいて，上記各認識対象語彙セット用の重みを決定する重み決定部を備えて、上記認識部は，上記全認識対象語彙セットおよび上記決定された各重みを用いて，入力音声を認識するようになっていることを特徴としている。
【００１８】
上記構成によれば、認識部によって、全認識対象語彙セットおよびタイマ部からの時刻信号に基づいて重み決定部によって決定された各認識対象語彙セット用の重みを用いて、入力音声が認識される。その際に、上記タイマ部からの時刻信号に基づいて認識対象語彙切り換え要求部によって認識対象語彙の切り換えが要求されると、現在用いられている認識対象語彙セットが、出力部の出力内容の切り換えに応じた認識対象語彙セットに切り換えられる。したがって、切り換え前の認識対象語彙セット用の重みの値を低めるようにすれば、上記出力部の出力内容に対応している切り換え後の認識対象語彙の認識精度が高められる。
【００１９】
さらに、話者が、上記認識対象語彙セットが切り換えられたことを知らずに、切り換え前の認識対象語彙で発声したとしても、切り換え前の認識対象語彙セットの語をも用いて認識が行われているので、上記切り換え前の認識対象語彙セットの語に関しても高い認識精度が得られる。
【００２０】
また、上記第１の発明の音声認識装置は、上記重み決定部を、上記認識対象語彙切り換え要求部によって認識対象語彙の切り換えが要求されてから重み決定までの経過時間に応じて、切り換え前の認識対象語彙セット用の重みを低下させる一方、切り換え後の認識対象語彙セット用の重みを上昇させるように成すことが望ましい。
【００２１】
上記構成によれば、上記認識対象語彙切り換え要求部によって認識対象語彙の切り換えが要求されてからの経過時間が長くなるに連れて、切り換え前の認識対象語彙の認識精度が低くなる一方、切り換え後の認識対象語彙の認識精度が高くなる。こうして、認識に用いられる上記認識対象語彙の切り換えが徐々に行われる。
【００２２】
また、上記第１の発明の音声認識装置は、上記認識部を、上記全認識対象語彙セットを構成する各語の尤度を算出し、各語の尤度の値に各語が属する認識対象語彙セット用の重みを掛け、その値が最も高い語を認識結果とするように成すことが望ましい。
【００２３】
上記構成によれば、認識に用いられている認識対象語彙セット用の重みと認識に用いられていない認識対象語彙セット用の重みとを最適に設定することによって、上記出力部の出力内容に対応した切り換え後の認識対象語彙の認識精度を高めることと、話者が切り換え前の認識対象語彙で発声した場合でも高い認識精度を得ることとが、容易に達成される。
【００２４】
また、上記第１の発明の音声認識装置は、上記出力部を、上記認識対象語彙切り換え要求部からの認識対象語彙切り換え要求がなされた時点に出力している出力内容に対応する認識対象語彙セット用の重みの値と、次に出力すべき出力内容に対応する認識対象語彙セット用の重みの値との差が所定値未満になると、上記出力内容を切り換えるように成すことが望ましい。
【００２５】
上記構成によれば、上記認識対象語彙セットが切り換えられるのに呼応して、上記出力部の出力内容が対応する出力内容に切り換えられる。
【００２６】
また、第２の発明の音声認識方法は、入力された音声を認識対象語彙を用いて認識して認識結果を出力するに際して，タイマ部からの時刻信号に基づいて上記認識対象語彙の切り換えを自動的に行う音声認識方法において、複数の出力内容を出力部に切り換え出力し、上記各出力内容に対応した認識対象語の集合でなる複数の認識対象語彙セットの単位で上記認識対象語彙の切り換えを行い、上記タイマ部からの時刻信号に基づいて上記各認識対象語彙セット用の重みを決定し、上記全認識対象語彙セットおよび上記決定された各重みを用いて上記入力音声の認識を行うことを特徴としている。
【００２７】
上記構成によれば、全認識対象語彙セットおよびタイマ部からの時刻信号に基づいて決定された各認識対象語彙セット用の重みを用いて、入力音声が認識される。その際に、上記タイマ部からの時刻信号に基づいて認識対象語彙の切り換えが要求されると、現在用いられている認識対象語彙セットが、出力部の出力内容の切り換えに応じた認識対象語彙セットに切り換えられる。したがって、切り換え前の認識対象語彙セット用の重みの値を低めるようにすれば、上記出力部の出力内容に対応している切り換え後の認識対象語彙の認識精度が高められる。
【００２８】
さらに、話者が、上記認識対象語彙セットが切り換えられたことを知らずに、切り換え前の認識対象語彙で発声したとしても、切り換え前の認識対象語彙セットの語をも用いて認識が行われているので、上記切り換え前の認識対象語彙セットの語に関しても高い認識精度が得られる。
【００２９】
また、第３の発明のプログラム記録媒体は、コンピュータを、請求項１における認識部，出力部，タイマ部，認識対象語彙切り換え要求部および重み決定部として機能させる音声認識処理プログラムが記録されていることを特徴としている。
【００３０】
上記構成によれば、請求項１の場合と同様に、切り換え前の認識対象語彙セット用の重みの値を低めるようにすれば、上記出力部の出力内容に対応している切り換え後の認識対象語彙の認識精度が高められる。さらに、話者が、上記認識対象語彙セットが切り換えられたことを知らずに切り換え前の認識対象語彙で発声したとしても、高い認識精度が得られる。
【００３１】
【発明の実施の形態】
以下、この発明を図示の実施の形態により詳細に説明する。図１は、本実施の形態の音声認識装置におけるブロック図である。この音声認識装置２１は、音声入力部２２，Ａ／Ｄ変換部２３，音響分析部２４，認識部２５，音響モデル格納部２６，第１認識対象語彙セット格納部２７，第２認識対象語彙セット格納部２８，重み係数決定部２９，タイマ部３０，認識対象語彙切り換え要求部３１および出力部３２で構成される。
【００３２】
上記音声入力部２２は、マイクロホンを含む音声入力装置を備えて、入力された音声を電気信号（音声信号）に変換してＡ／Ｄ変換部２３に出力する。Ａ／Ｄ変換部２３は、入力されたアナログ信号である音声信号をディジタル信号に変換し、ディジタル化された音声信号を音響分析部２４に出力する。尚、上記ディジタル化された音声信号は、振幅値の時系列で表されている。
【００３３】
上記音響分析部２４は、Ａ／Ｄ変換部２３からのディジタル音声信号からフレーム毎に特徴ベクトルを抽出して認識部２５に出力する。ここで、上記特徴ベクトルは、各フレームにおける音声信号のパワー，１次〜１６次のＬＰＣケプストラム係数，前フレームのパワーおよび前フレームのＬＰＣケプストラム係数（１次〜１６次）の合計３４の要素からなる３４次元ベクトルｘ_ｖｅｃを、総てのフレーム（ｔ＝１，２，…，Ｉ）に亘って配列したものである。
【００３４】
上記認識部２５は、音響モデルを利用して、音響分析部２４で抽出された特徴ベクトルを用いて、第１認識対象語彙セット格納部２７に格納されている認識対象語彙セットＡおよび第２認識対象語彙セット格納部２８に格納されている認識対象語彙セットＢを構成する各単語の生起確率（尤度）Ｐを、従来の技術で説明した手法を用いて計算する。さらに、重み係数決定部２９で決定された重みｗを各単語の尤度Ｐに掛けて、最も高い尤度ｗ・Ｐを呈するＨＭＭに対応する単語を出力部３２に出力するのである。
【００３５】
上記音響モデル格納部２６は、認識部２５で音声認識を行う際に使用される音響モデルが格納されている。上記音響モデルは、音素を単位として、予め不特定話者の学習音声を用いてバウム‐ウェルチアルゴリズムと呼ばれるアルゴリズムによって学習（初期学習）されたＨＭＭが用いられる。尚、上記ＨＭＭは、各状態における遷移確率と出力確率分布とを要素とする状態数分の配列で記憶されている。また、上記遷移確率は、各状態への遷移確率を要素として遷移数分の配列で記憶されている。また、上記出力確率は、複数の正規分布を重み付け加算した多次元の混合正規分布で表され、各正規分布における混合の重みと平均ベクトルと分散ベクトルとを要素とする次元数分の配列で記憶されている。ここで、上記平均ベクトルと分散ベクトルとは、音響分析部２４で音声信号からフレーム毎に抽出される特徴ベクトルの要素数と同じ「３４」の要素の配列で表される。
【００３６】
上記タイマ部３０は、時刻を表す時刻信号を認識対象語彙切り換え要求部３１，重み係数決定部２９および出力部３２に出力して、時刻を通知する。そうすると、認識対象語彙切り換え要求部３１は、上記通知された時刻に基づいて、認識対象語彙の切り換えを要求するか否かを判断する。そして、要求する場合には、重み係数決定部２９に対して認識対象語彙の切り換えを要求する。
【００３７】
上記重み係数決定部２９は、第１認識対象語彙セット格納部２７に格納されている認識対象語彙セットＡおよび第２認識対象語彙セット格納部２８に格納されている認識対象語彙セットＢのうち、出力部３２によって現在表示されている表示内容に対応する認識対象語彙セットを構成する単語に掛けられる重みｗ_２、および、出力部３２によって表示されていない表示内容に対応する認識対象語彙セットを構成する単語に掛けられる重みｗ_１を決定する。これらの重みｗ_１，ｗ_２は、記憶されている重み関数Ｗ_１（ｔ），Ｗ_２（ｔ）を用いて、認識対象語彙切り換え要求部３１から切り換えが要求された時のタイマ部３０からの時刻ｔ_Ｏを基準として所定時間ΔＴが経過する毎に決定される。そして、決定された両重みｗ_１，ｗ_２の値は認識部２５に順次出力される。
【００３８】
上記第１認識対象語彙セット格納部２７および第２認識対象語彙セット格納部２８には、夫々の認識対象語彙セットＡ，Ｂを構成する単語が、各単語の表記と音素列との文字列を要素とする文字数分の配列で記憶されている。
【００３９】
上記出力部３２は、ディスプレイを含む画像表示装置を備えて、認識対象語彙セットＡに対応した第１表示内容と認識対象語彙セットＢに対応した第２表示内容とを格納している。そして、タイマ部３０から通知される時刻に基づいて、第１，第２表示内容のうち現在表示している表示内容を変更するか否かを判断し、変更する場合は画面の表示内容を切り換える。さらに、認識部２５からの認識結果を画面に表示する。
【００４０】
図２は、上記出力部３２が現在選択している表示内容に対応する認識対象語彙セット用の重み関数Ｗ_２（ｔ）と非選択表示内容に対応する認識対象語彙セット用の重み関数Ｗ_１（ｔ）との時間変化を示す。重み関数Ｗ_１（ｔ）の値は、認識対象語彙の切り換え要求が出力された時刻ｔ_０で１よりも小さい０近傍の所定値「ａ」から単調増加し始め、時刻ｔ_２以降は値「１」となる。一方、重み関数Ｗ_２（ｔ）の値は、重み関数Ｗ_１（ｔ）の値とは逆に、時刻ｔ_０で値「１」から単調減少し始めて、時刻ｔ_２以降は所定値「ａ」となる。その場合、時刻ｔ_１で重みｗ_１と重みｗ_２の差が閾値ｈとなる。そして、出力部３２は、この差の値が閾値ｈ未満になると、つまり認識対象語彙の切り換えが要求された時刻ｔ_０から時間Ｔ（＞（ｔ_１−ｔ_０））が経過すると、画面に表示されている表示内容を切り換えるのである。
【００４１】
すなわち、上記出力部３２がタイマ部３０から通知される時刻に基づいて表示内容を変更すると判断する時点は、認識対象語彙切り換え要求部３１がタイマ部３０から通知された時刻に基づいて上記切り換えを要求すると判断する時点よりも上記時間Ｔだけ遅れるように設定されているのである。
【００４２】
このように、本実施の形態においては、出力部３２によって、自動的に画面の表示内容が切り換えられるのであるが、表示内容が切り換る前であっても切り換った後であっても、認識部２５は、認識対象語彙セットＡおよび認識対象語彙セットＢの両語彙セットの語彙を対象として尤度Ｐの計算を行う。そして、現在出力部３２によって選択されている表示内容に対応する認識対象語彙セットを構成する単語の尤度Ｐには、表示内容切り換え前であれば１＞ｗ＞（１＋ａ＋ｈ）／２であり、切り換え後であれば１＞ｗ＞（１＋ａ−ｈ）／２である重みｗを掛ける。一方、非選択側の表示内容に対応する認識対象語彙セットを構成する単語の尤度Ｐには、表示内容切り換え前であれば（１＋ａ−ｈ）／２＞ｗ＞ａであり、切り換え後であれば（１＋ａ＋ｈ）／２＞ｗ＞ａである重みｗを掛ける。こうして、最終的な尤度ｗ・Ｐを計算して認識結果を決定するようにしている。
【００４３】
換言すれば、図４に示す従来の音声認識装置における認識対象語彙の切り換えは、尤度Ｐの演算に用いる認識対象語彙そのものを切り換えることによって行うのに対して、本実施の形態においては、尤度Ｐの演算に用いる２セットの認識対象語彙は切り換えずに尤度Ｐに掛ける重みｗの値を「１」と０近傍の所定値「ａ」との間で徐々に変化させることによって行うのである。
【００４４】
したがって、本実施の形態においては、何らかの理由で話者が認識対象語彙の発声の機会を逸してしまい、且つ、自動的に認識対象語彙の切り換えが行われた後でも、切り換え前の認識対象語彙の単語に関する尤度ｗ・Ｐの計算も行われることになり、話者が切り換え前の認識対象語彙で発声しても正しく認識することが可能になる。また、その場合、図４に示す音声認識装置のように認識対象語彙そのものを切り換えた場合と同様に、出力部３２の表示内容に対応した語彙の認識精度を高める機能は損なわれないのである。
【００４５】
図３は、上記重み係数決定部２９によって実行される重み決定処理動作のフローチャートである。以下、図３に従って、重み決定の動作について説明する。ここで、出力部３２が現在選択している表示内容に対応する認識対象語彙セット用の重み関数をＷ_２（ｔ）とし、非選択表示内容に対応する認識対象語彙セット用の重み関数をＷ_１（ｔ）とする。認識対象語彙切り換え要求部３１から切り換えが要求されると重み決定処理動作がスタートする。
【００４６】
ステップＳ１で、上記タイマ部３０からの時刻信号に基づいて、認識対象語彙の切り換え要求時刻ｔ_０が取得される。ステップＳ２で、重み値ｗの算出回数ｊが「０」に初期化される。ステップＳ３で、算出回数ｊがインクリメントされる。ステップＳ４で、切り換え要求時刻ｔ_０を取得してから又は前回重み値ｗを算出してから所定時間ΔＴが経過したか否かが判別される。その結果、経過していればステップＳ５に進む。ステップＳ５で、現在の時刻（ｔ_０＋ｊ・ΔＴ）が時刻ｔ_２を越えているか否かが判別される。その結果、超えていれば重み決定処理動作を終了する一方、越えていなければステップＳ６に進む。
【００４７】
ステップＳ６で、上記重み関数Ｗ_ｉ（ｔ）の関数番号ｉが「１」に初期化される。ステップＳ７で、重み関数Ｗ_１（ｔ）における切り換え要求時刻ｔ_０からの経過時間ｔに「ｊ・ΔＴ」が代入されて、重みの値ｗ_ｉが算出される。ステップＳ８で、関数番号ｉがインクリメントされる。ステップＳ９で、関数番号ｉの値が「２」よりも大きいか否かが判別される。その結果、「２」以下であればステップＳ７にリターンして重み値ｗ_２の算出に移行する一方、「２」よりも大きければ、総ての認識対象語彙セットＡ，Ｂに対応する現時刻での重みが算出されたと判断されて、ステップＳ１０に進む。ステップＳ１０で、上記算出された現時刻での重み値ｗ_１，ｗ_２の配列が認識部２５に出力される。そうした後、ステップＳ３にリターンして、次の時刻での重み値ｗ_１，ｗ_２の算出に移行する。
【００４８】
以後、上記ステップＳ３〜ステップＳ１０を繰り返し、ステップＳ５において現在の時刻が時刻ｔ_２を越えていると判別されると、重み決定処理動作を終了する。その後は、表示内容に対応する認識対象語彙セット用の重み値ｗ_２として「１」が所定時間ΔＴ毎に出力され、非選択表示内容に対応する認識対象語彙セット用の重み値ｗ_１として所定値「ａ」が所定時間ΔＴ毎に出力される。そして、次に認識対象語彙切り換え要求部３１から切り換え要求が出力されると、上記重み決定処理動作がスタートするのである。
【００４９】
上述のように、本実施の形態における認識部２５は、音響モデル格納部２６に格納された音響モデルを用いて、第１認識対象語彙セット格納部２７に格納された認識対象語彙セットＡと第２認識対象語彙セット格納部２８に格納された認識対象語彙セットＢとを構成する単語の尤度Ｐを算出する。その際における出力部３２の表示内容の切り換えに伴う認識対象語彙セットの切り換えは、認識対象語彙セットそのものを切り換えるのではなく、選択，非選択認識対象語彙セットを構成する単語の尤度Ｐに掛ける重みｗ_２，ｗ_１の値を「１」と０近傍の所定値「ａ」とに切り換えることによって行う。そして、その場合に、重みｗ_２，ｗ_１の値を段階的に切り換えるのではなく、認識対象語彙切り換え要求部３１から切り換え要求がなされた時刻ｔ_０からの経過時間「ｊ・ΔＴ」に比例して徐々に値「１」から値「ａ」へ又は値「ａ」から値「１」へ切り換えるようにしている。
【００５０】
したがって、本実施の形態によれば、何らかの理由で話者が認識対象語彙の発声の機会を逸してしまい、且つ、自動的に認識対象語彙が切り換えられてしまっても、切り換え前の認識対象語彙セットの単語に関する尤度ｗ・Ｐの計算をも行うので、話者が切り換え前の認識対象語彙で発声しても正しく認識することができる。また、その場合に、図４に示す音声認識装置のごとく認識対象語彙そのものを切り換える場合と同様に、出力部３２の表示内容に対応した認識対象語彙の認識精度を高める機能は損なわれることはない。
【００５１】
尚、上記実施の形態においては、選択認識対象語彙セット用の重み関数Ｗ_２（ｔ）および非選択表示内容に対応する認識対象語彙セット用の重み関数Ｗ_１（ｔ）を、認識対象語彙切り換え要求部３１による切り換え要求時刻ｔ_０からの経過時間「ｊ・ΔＴ」に比例して、値「１」，「ａ」から値「ａ」，「１」へ直線的に切り換えるようにしている。しかしながら、この発明においては、関数Ｗ_１（ｔ），Ｗ_２（ｔ）の形状は直線に限定されるものではない。曲線にして、表示内容の切り換え時刻ｔ_１までの関数Ｗ_２（ｔ）の値を高める一方関数Ｗ_１（ｔ）の値を低め、表示内容の切り換え時刻ｔ_１以降の関数Ｗ_２（ｔ）の値を低める一方関数Ｗ_１（ｔ）の値を高めてもよい。
【００５２】
また、上記実施の形態においては、上記重み係数決定部２９を、認識対象語彙切り換え要求部３１からの切り換え要求時刻ｔ_Ｏを基準として所定時間ΔＴが経過する毎に重み値ｗ_１，ｗ_２を決定して認識部２５に出力するように構成し、認識部２５は、入力される重み値ｗ_１，ｗ_２を必要に応じて用いて認識処理を行うように構成している。しかしながら、この発明はこれに限定されるものではなく、認識部２５を、認識を行う際に重み係数決定部２９に対して重み決定要求を出すように構成し、重み係数決定部２９は、重み決定要求を受けると、認識対象語彙切り換え要求部３１による切り換え要求時刻ｔ_Ｏからの経過時間を重み関数Ｗ_ｉ（ｔ）に代入して算出するように構成しても差し支えない。
【００５３】
ところで、上記各実施の形態における上記認識部，出力部，タイマ部，認識対象語彙切り換え要求部および重み決定部としての機能は、プログラム記録媒体に記録された音声認識処理プログラムによって実現される。上記実施の形態における上記プログラム記録媒体は、ＲＯＭ（リード・オンリ・メモリ）でなるプログラムメディアである。あるいは、外部補助記憶装置に装着されて読み出されるプログラムメディアであってもよい。尚、何れの場合においても、上記プログラムメディアから音声認識処理プログラムを読み出すプログラム読み出し手段は、上記プログラムメディアに直接アクセスして読み出す構成を有していてもよいし、ＲＡＭ（ランダム・アクセス・メモリ）に設けられたプログラム記憶エリア（図示せず）にダウンロードし、上記プログラム記憶エリアにアクセスして読み出す構成を有していてもよい。尚、上記プログラムメディアからＲＡＭの上記プログラム記憶エリアにダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。
【００５４】
ここで、上記プログラムメディアとは、本体側と分離可能に構成され、磁気テープやカセットテープ等のテープ系、フロッピーディスク，ハードディスク等の磁気ディスクやＣＤ（コンパクトディスク）‐ＲＯＭ，ＭＯ（光磁気）ディスク，ＭＤ（ミニディスク），ＤＶＤ（ディジタルビデオディスク）等の光ディスクのディスク系、ＩＣ（集積回路）カードや光カード等のカード系、マスクＲＯＭ，ＥＰＲＯＭ（紫外線消去型ＲＯＭ），ＥＥＰＲＯＭ（電気的消去型ＲＯＭ），フラッシュＲＯＭ等の半導体メモリ系を含めた、固定的にプログラムを坦持する媒体である。
【００５５】
また、上記各実施の形態における音声認識装置は、モデムを備えてインターネットを含む通信ネットワークと接続可能な構成を有していれば、上記プログラムメディアは、通信ネットワークからのダウンロード等によって流動的にプログラムを坦持する媒体であっても差し支えない。尚、その場合における上記通信ネットワークからダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。あるいは、別の記録媒体からインストールされるものとする。
【００５６】
尚、上記記録媒体に記録されるものはプログラムのみに限定されるものではなく、データも記録することが可能である。
【００５７】
【発明の効果】
以上より明らかなように、第１の発明の音声認識装置は、出力部の出力内容に対応した複数の認識対象語彙セットを認識語彙格納部に格納し、重み決定部によって、タイマ部からの時刻信号に基づいて上記各認識対象語彙セット用の重みを決定し、認識部によって、上記全認識対象語彙セットおよび上記決定された各重みを用いて入力音声を認識するので、認識対象語彙切り換え要求部による認識対象語彙の切り換え要求に基づいて、上記出力部の出力内容の切り換えに応じた認識対象語彙セットに切り換えられる際に、切り換え前の認識対象語彙セット用の重みの値を低めるようにすれば、切り換え後の認識対象語彙の認識精度を高めることができる。
【００５８】
さらに、話者が、上記認識対象語彙セットが切り換えられたことを知らずに、切り換え前の認識対象語彙で発声しても、切り換え前の認識対象語彙セットの語をも用いて認識を行うので、上記切り換え前の認識対象語彙セットの語に関しても高い認識精度を得ることができる。
【００５９】
すなわち、この発明によれば、自動的に認識対象語彙を切り換える場合でも高い認識精度を得ることができる。さらに、その際に話者に何らかの操作や待ち時間を負担させることがなく、使い易い音声認識装置を実現できる。
【００６０】
また、上記第１の発明の音声認識装置は、上記重み決定部を、上記認識対象語彙切り換え要求部によって認識対象語彙の切り換えが要求されてから重み決定までの経過時間に応じて、切り換え前の認識対象語彙セット用の重みを低下させる一方、切り換え後の認識対象語彙セット用の重みを上昇させるように成せば、認識に用いられる上記認識対象語彙の切り換えを徐々に行うことができる。したがって、上記切り換え前の認識対象語彙セットの語に関しても高い認識精度を得ることができる。
【００６１】
また、上記第１の発明の音声認識装置は、上記認識部を、全認識対象語彙セットを構成する各語の尤度を算出し、各語の尤度の値に各語が属する認識対象語彙セット用の重みを掛け、その値が最も高い語を認識結果とするように成せば、認識に用いられている認識対象語彙セット用の重みと認識に用いられていない認識対象語彙セット用の重みとを最適に設定すれば、上記出力部の出力内容に対応している切り換え後の認識対象語彙の認識精度を高めることと、話者が切り換え前の認識対象語彙で発声した場合でも高い認識精度を得ることとを、容易に達成することができる。
【００６２】
また、上記第１の発明の音声認識装置は、上記出力部を、上記認識対象語彙切り換え要求部からの認識対象語彙切り換え要求がなされた時点に出力している出力内容に対応する認識対象語彙セット用の重みの値と、次に出力すべき出力内容に対応する認識対象語彙セット用の重みの値との差が所定値未満になると、上記出力内容を切り換えるように成せば、上記認識対象語彙セットが切り換えられるのに呼応して、上記出力部の出力内容を対応する出力内容に切り換えることができる。
【００６３】
また、第２の発明の音声認識方法は、タイマ部からの時刻信号に基づいて出力部の出力内容に対応した複数の認識対象語彙セット用の重みを決定し、全認識対象語彙セットおよび上記決定された各重みを用いて入力音声を認識するので、認識対象語彙セットが切り換えられる際に、切り換え前の認識対象語彙セット用の重みの値を低めるようにすれば、上記出力部の出力内容に応じた切り換え後の認識対象語彙の認識精度を高めることができる。
【００６４】
さらに、話者が、上記認識対象語彙セットが切り換えられたことを知らずに、切り換え前の認識対象語彙で発声しても、切り換え前の認識対象語彙セットの語をも用いて認識を行うので、上記切り換え前の認識対象語彙セットの語に関しても高い認識精度を得ることができる。
【００６５】
また、第３の発明のプログラム記録媒体は、コンピュータを、請求項１における認識部，出力部，タイマ部，認識対象語彙切り換え要求部および重み決定部として機能させる音声認識処理プログラムが記録されているので、請求項１の場合と同様に、切り換え前の認識対象語彙セット用の重みの値を低めるようにすれば、上記出力部の出力内容に対応している切り換え後の認識対象語彙の認識精度を高めることができる。さらに、話者が、上記認識対象語彙セットが切り換えられたことを知らずに切り換え前の認識対象語彙で発声したとしても、高い認識精度を得ることができる。
【図面の簡単な説明】
【図１】この発明の音声認識装置におけるブロック図である。
【図２】選択，非選択認識対象語彙セット用の重み関数の時間変化を示す図である。
【図３】図１における重み係数決定部によって実行される重み決定処理動作のフローチャートである。
【図４】認識対象語彙の切り換えが可能な従来の音声認識装置のブロック図である。
【符号の説明】
２１…音声認識装置、
２２…音声入力部、
２３…Ａ／Ｄ変換部、
２４…音響分析部、
２５…認識部、
２６…音響モデル格納部、
２７…第１認識対象語彙セット格納部、
２８…第２認識対象語彙セット格納部、
２９…重み係数決定部、
３０…タイマ部、
３１…認識対象語彙切り換え要求部、
３２…出力部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice recognition device and a voice recognition method mounted on a computer or a portable information terminal for recognizing voice uttered by a human, and a program recording medium storing a voice recognition processing program.
[0002]
[Prior art]
In a speech recognition device, there is a recognition method in which a recognition target vocabulary is switched as needed in order to increase recognition accuracy. As an application example of the voice recognition device using such a recognition method, in a device having a display device such as a personal computer or a Japanese word processor, an operation guide of the device by menu display using the display device is provided by using voice recognition. It is possible to do.
[0003]
According to the operation guide as described above, the operation can be learned while confirming the operation method and the display of the effect by the operation on the screen. When the amount of information from the display device is small, such as when the screen of the display device is narrow, the display of the operation guide related to the operation of a plurality of devices may be automatically switched over time. If voice is used for such an operation guide, it is easy for the user to understand, and the number of operation buttons can be reduced to simplify the operation. In this case, if the recognition target vocabulary is switched together with the switching of the operation guide display relating to the operation of a plurality of devices, the recognition target vocabulary can be reduced, so that high recognition accuracy can be obtained.
[0004]
In the application of the recognition method for switching the recognition target words, a plurality of sets of the recognition target words related to each menu to be switched and displayed are stored for the number of menus. By switching the vocabulary to be recognized in synchronization with the switching of the menu display due to the operation of the user or the passage of time, the recognition processing can be performed on the minimum vocabulary in each menu. Accuracy can be improved. In this case, when the menu display is automatically switched over time, the device automatically switches the vocabulary to be recognized.
[0005]
Hereinafter, a speech recognition apparatus capable of switching the vocabulary to be recognized will be described. FIG. 4 is a block diagram showing an example of a speech recognition device capable of switching the vocabulary to be recognized. Here, it is assumed that the voice recognition device 1 automatically switches the recognition target vocabulary and the display content by the output unit 13 at predetermined time intervals. The speech recognition apparatus 1 includes an A / D (analog / digital) conversion unit 2, an acoustic analysis unit 3, a recognition unit 4, an acoustic model storage unit 5, a recognition target vocabulary storage / determination unit 6, and a current recognition target vocabulary identifier storage unit 7. , A timer unit 8, a recognition target vocabulary switching request unit 9, a recognition target vocabulary switching request time storage unit 10, a voice detection unit 11, a voice time storage unit 12, and an output unit 13.
[0006]
The voice input to the voice recognition device 1 by the speaker is sent to the A / D converter 2 and digitized. The digitized speech waveform is applied to the acoustic analysis unit 3 by applying a relatively short time window to each section of 20 msec to 40 msec and shifting the time window by 8 msec to 16 msec. It is analyzed by the technique of spectrum analysis. The speech waveform extracted by the time window is converted into a time series of feature vectors in units called frames having a time length at the time of extraction. Here, the feature vector is obtained by extracting a very small amount of the speech spectrum at that time, and usually has 10 to 100 dimensions, and LPC (linear prediction analysis) mel-cepstral coefficients and the like are widely used. The feature vector thus converted is sent to the recognition unit 4 and is also output to the voice detection unit 11 that detects the start of voice input. Then, the voice time storage unit 12 detects and stores the voice input start time based on the voice input start signal from the voice detection unit 11 and the time signal from the timer unit 8.
[0007]
The acoustic model storage unit 5 has an HMM (Hidden Markov Model) prepared for each recognition unit. Here, phonemes and words are widely used as the recognition units. The HMM is a nondeterministic stochastic finite state automaton having a plurality of states, and is a statistical signal source model in which an unsteady signal source is represented by a connection of stationary signal sources. Note that parameters such as an output probability and a transition probability are learned in advance by an algorithm called a Baum-Welch algorithm by giving a corresponding learning voice. Hereinafter, it is assumed that the acoustic model storage unit 5 stores an HMM whose recognition unit is a phoneme.
[0008]
The switching operation of the recognition target vocabulary is performed by a method disclosed in Japanese Patent Application Laid-Open No. Hei 6-337695. The recognition target vocabulary includes a recognition target vocabulary set A and a recognition target vocabulary set B. It is assumed that the recognition target vocabulary set A stores the identifier of the recognition target vocabulary set A at the present time. It is assumed that the output unit 13 displays the display content corresponding to the vocabulary set A to be recognized.
[0009]
In this state, when a predetermined time elapses, the timer unit 8 notifies the recognition target vocabulary switching request unit 9 and the output unit 13. Then, the output unit 13 changes the display content to the display content corresponding to the vocabulary set B to be recognized. Further, switching is requested from the recognition target vocabulary switching request unit 9, and the request time is stored in the recognition target vocabulary switching request time storage unit 10. Then, the recognition target vocabulary storage / determination unit 6 compares the request time Tc stored in the recognition target vocabulary switching request time storage unit 10 with the voice input start time Ts stored in the voice time storage unit 12, and If the voice input start time Ts is later than the request time Tc, since the utterance was made after the request for switching the vocabulary to be recognized was made, the appropriate vocabulary set to be recognized is the vocabulary set B to be recognized. Is determined. Otherwise, it is determined to be the recognition target vocabulary set A. Then, the storage content of the current recognition target vocabulary identifier storage unit 7 is updated with the identifier of the relevant recognition target vocabulary set.
[0010]
When the determination of the appropriate vocabulary set to be recognized is completed, the recognizing unit 4 recognizes the feature vector obtained by the acoustic analysis unit 3 and the identifier stored in the current vocabulary identifier storage unit 7. Speech recognition is performed using the phoneme sequence of each word constituting one of the recognition target vocabulary sets output from the target vocabulary storage / determination unit 6 and the HMM stored in the acoustic model storage unit 5 as follows. I do.
[0011]
That is, first, the HMM of each word included in the recognition target vocabulary is obtained. Specifically, the HMM of each phoneme stored in the acoustic model storage unit 5 is combined in correspondence with the phoneme sequence of each word constituting the vocabulary set to be recognized.
[0012]
Next, for the HMM of each word, the occurrence probability is obtained using the feature vector from the acoustic analysis unit 3. In speech recognition by the HMM, speech is expressed as a time series of symbols output from the HMM during a state transition from an initial state to a final state. Therefore, by setting the probability of the initial state to an arbitrary value and sequentially multiplying the output probability and the transition probability for each state transition, the probability that the utterance is generated from the model M (HMM of the word) can be obtained. . Conversely, assuming that the utterance is generated from a certain model M when the utterance is observed, the probability of occurrence from the model M can be calculated.
[0013]
Hereinafter, the recognition algorithm in the recognition unit 4 will be described in detail. The recognition unit 4 receives the time series of the feature vectors obtained by the acoustic analysis unit 3 as input, and calculates the occurrence probabilities of the HMMs of all the words included in the recognition target vocabulary from the recognition target vocabulary storage / determination unit 6. , The word of the HMM exhibiting the highest occurrence probability is set as the recognition result. That is, with t (= 1, 2,..., I) as the frame number, the input sequence represented by the time series of the feature vector is
X = x_vec1, x_vec2, x_vec3, ..., x_vect, ..., x_vecI
And Note that "x_vec"i" is a multidimensional vector. Hereinafter, the vector x is referred to as “x_vec". Further, a set of initial states of the model M is S, and a set of final states is F. Also, with “i, j” as the state number, the transition sequence of the j-th state is
Q = q_O ^j, Q₁ ^j, Q₂ ^j, ..., q_t ^j, ..., q_I ^j
It expresses. In the above equation, “q_t ^j] Is the input symbol x of the t-th frame._vecThe transition state is represented by t. Where q_O ^j∈S and q_I ^j∈F. Further, the initial probability of the initial state is π_i: Σ_qi _∈ _Sπ_i= 1 and state q_iFrom state q_jTransition probability to a_ijAnd then x_vecThe output probability that i is output is b_ij(X_veci), the occurrence probability (likelihood) P (X | M) of the input sequence is

Is represented by The calculation of the occurrence probability (likelihood) P (X | M) is calculated for the HMMs corresponding to all the words included in the vocabulary to be recognized, and the word corresponding to the HMM exhibiting the highest occurrence probability (likelihood) P is calculated. The recognition result is output to the output unit 13 and displayed.
[0014]
[Problems to be solved by the invention]
However, the speech recognition apparatus to which the recognition target vocabulary switching operation disclosed in the above-mentioned conventional Japanese Patent Laid-Open Publication No. Hei 6-337695 is applied has the following problems. That is, as described above, in the recognition target vocabulary switching operation disclosed in Japanese Patent Laid-Open No. 6-337695, when the speech input start time Ts is later than the recognition target vocabulary switching request time Tc, the recognition target vocabulary is not changed. The set is switched. This method is effective because when a request to switch the recognition target vocabulary is made by the operation of the speaker, the utterance is always performed after the request to switch the recognition target vocabulary is made.
[0015]
However, in the case of a speech recognition device in which the vocabulary to be recognized is automatically switched over time as in the speech recognition device shown in FIG. 4, the switching of the vocabulary to be recognized has nothing to do with the consciousness of the speaker. Done without. Therefore, if for some reason the speaker misses the opportunity to utter the recognition target vocabulary, and the recognition target vocabulary is automatically switched, before the switch that the speaker wanted to utter by any method, It is necessary to return to the setting state of the recognition target vocabulary. In that case, there is a problem that the speaker is burdened with some operation or the speaker is automatically made to wait until the recognition target vocabulary before switching is set.
[0016]
Therefore, an object of the present invention is to provide an easy-to-use speech recognition device and a speech recognition method capable of obtaining high recognition accuracy even when the recognition target vocabulary is automatically switched, and a program recording medium storing a speech recognition processing program. It is in.
[0017]
[Means for Solving the Problems]
In order to achieve the above object, a first aspect of the present invention provides a recognition unit for recognizing an input voice, an output unit for outputting information including a recognition result of the recognition unit, and a recognition target vocabulary used for the recognition. A recognition vocabulary storage unit, a timer unit, and a recognition target vocabulary switching request unit that requests switching of the recognition target vocabulary based on a time signal from the timer unit. The vocabulary to be recognized is classified into a plurality of vocabulary sets to be recognized, each of which is a set of words to be recognized, corresponding to the output of the output unit. The switching is performed in units of the recognition target vocabulary set, and a weight for determining the weight for each recognition target vocabulary set based on the time signal from the timer unit. Comprises a tough, the recognition unit, using each weight being the total recognition target vocabulary set and the determined, is characterized in that is adapted to recognize the input speech.
[0018]
According to the above configuration, the recognition unit recognizes the input speech using the weight for each recognition target vocabulary set determined by the weight determination unit based on the entire recognition target vocabulary set and the time signal from the timer unit. . At this time, if the recognition target vocabulary switching request unit requests the switching of the recognition target vocabulary based on the time signal from the timer unit, the currently used recognition target vocabulary set is switched to the output content of the output unit. Is switched to the vocabulary set to be recognized according to. Therefore, if the weight value for the recognition target vocabulary set before switching is reduced, the recognition accuracy of the recognition target vocabulary after switching corresponding to the output content of the output unit can be improved.
[0019]
Furthermore, even if the speaker does not know that the recognition target vocabulary set has been switched, and utters in the recognition target vocabulary before switching, the recognition is performed using the words of the recognition target vocabulary set before switching. Therefore, high recognition accuracy can be obtained for the words in the vocabulary set to be recognized before the switching.
[0020]
The speech recognition device according to the first aspect of the present invention may further comprise: the weight determining unit determines the weight before the switching according to an elapsed time from when the recognition target vocabulary switching request unit requests the switching of the recognition target vocabulary to when the weight is determined. It is desirable to reduce the weight for the recognition target vocabulary set while increasing the weight for the switched recognition target vocabulary set.
[0021]
According to the above configuration, as the elapsed time from when the switching of the recognition target vocabulary is requested by the recognition target vocabulary switching request unit increases, the recognition accuracy of the recognition target vocabulary before the switching decreases, , The recognition accuracy of the recognition target vocabulary is increased. In this way, the vocabulary to be recognized used for recognition is gradually switched.
[0022]
The speech recognition apparatus according to the first aspect of the present invention may be configured such that the recognition unit calculates the likelihood of each word constituting the all-recognition-target vocabulary set, and calculates the likelihood of each word belonging to the likelihood value of each word It is desirable that the weight for the vocabulary set is multiplied so that the word having the highest value is used as the recognition result.
[0023]
According to the above configuration, it is possible to cope with the output content of the output unit by optimally setting the weight for the recognition target vocabulary set used for recognition and the weight for the recognition target vocabulary set not used for recognition. It is easy to increase the recognition accuracy of the recognition target vocabulary after the switching, and to obtain high recognition accuracy even when the speaker utters in the recognition target vocabulary before the switching.
[0024]
Further, the speech recognition apparatus according to the first aspect of the present invention provides the recognition target vocabulary set corresponding to the output content that is output when the recognition target vocabulary switching request is made from the recognition target vocabulary switching request unit. It is desirable to switch the output contents when the difference between the value of the weight for use and the value of the weight for the recognition target vocabulary set corresponding to the output contents to be output next is less than a predetermined value.
[0025]
According to the configuration, in response to the switching of the vocabulary set to be recognized, the output content of the output unit is switched to the corresponding output content.
[0026]
Further, the speech recognition method of the second invention automatically switches the recognition target vocabulary based on a time signal from a timer unit when recognizing an input voice using the recognition target vocabulary and outputting a recognition result. A plurality of output contents are switched to an output unit and output, and the recognition target vocabulary is switched in units of a plurality of recognition target vocabulary sets each of which is a set of recognition target words corresponding to each of the output contents. And determining the weight for each of the recognition target vocabulary sets based on the time signal from the timer unit, and recognizing the input speech using the all recognition target vocabulary sets and the determined weights. Features.
[0027]
According to the above configuration, the input speech is recognized using the weights for each recognition target vocabulary set determined based on the entire recognition target vocabulary set and the time signal from the timer unit. At this time, when the switching of the recognition target vocabulary is requested based on the time signal from the timer unit, the currently used recognition target vocabulary set is changed to the recognition target vocabulary set corresponding to the switching of the output content of the output unit. Is switched to. Therefore, if the weight value for the recognition target vocabulary set before switching is reduced, the recognition accuracy of the recognition target vocabulary after switching corresponding to the output content of the output unit can be improved.
[0028]
Furthermore, even if the speaker does not know that the recognition target vocabulary set has been switched, and utters in the recognition target vocabulary before switching, the recognition is performed using the words of the recognition target vocabulary set before switching. Therefore, high recognition accuracy can be obtained for the words in the vocabulary set to be recognized before the switching.
[0029]
A program recording medium according to a third aspect of the present invention stores a speech recognition processing program for causing a computer to function as a recognition unit, an output unit, a timer unit, a recognition target vocabulary switching request unit, and a weight determination unit. It is characterized by:
[0030]
According to the above configuration, similarly to the case of claim 1, if the weight value for the recognition target vocabulary set before switching is reduced, the recognition target after switching corresponding to the output content of the output unit is reduced. Vocabulary recognition accuracy is improved. Furthermore, even if the speaker utters the vocabulary set before the change without knowing that the set of vocabulary sets to be recognized has been switched, high recognition accuracy can be obtained.
[0031]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. FIG. 1 is a block diagram of the speech recognition device according to the present embodiment. The speech recognition device 21 includes a speech input unit 22, an A / D conversion unit 23, an acoustic analysis unit 24, a recognition unit 25, an acoustic model storage unit 26, a first recognition target vocabulary set storage unit 27, and a second recognition target vocabulary set. It comprises a storage unit 28, a weight coefficient determination unit 29, a timer unit 30, a recognition target vocabulary switching request unit 31, and an output unit 32.
[0032]
The voice input unit 22 includes a voice input device including a microphone, converts an input voice into an electric signal (voice signal), and outputs the electrical signal to the A / D converter 23. The A / D converter 23 converts the input analog audio signal into a digital signal, and outputs the digitized audio signal to the acoustic analyzer 24. The digitized audio signal is represented by a time series of amplitude values.
[0033]
The acoustic analysis unit 24 extracts a feature vector for each frame from the digital audio signal from the A / D conversion unit 23 and outputs the feature vector to the recognition unit 25. Here, the feature vector is obtained from a total of 34 elements of the power of the audio signal in each frame, the LPC cepstrum coefficient of the first to 16th order, the power of the previous frame, and the LPC cepstrum coefficient of the previous frame (1st to 16th order). A 34-dimensional vector x_vecAre arranged over all frames (t = 1, 2,..., I).
[0034]
The recognizing unit 25 uses the acoustic model and the feature vector extracted by the acoustic analyzing unit 24 to recognize the vocabulary set A and the second vocabulary set stored in the first vocabulary set storing unit 27. The occurrence probability (likelihood) P of each word constituting the recognition target vocabulary set B stored in the target vocabulary set storage unit 28 is calculated using the method described in the related art. Further, the weight w determined by the weight coefficient determination unit 29 is multiplied by the likelihood P of each word, and the word corresponding to the HMM exhibiting the highest likelihood w · P is output to the output unit 32.
[0035]
The acoustic model storage unit 26 stores an acoustic model used when the recognition unit 25 performs speech recognition. As the acoustic model, an HMM that has been learned (initial learning) in advance by a so-called Baum-Welch algorithm using learning speech of an unspecified speaker in units of phonemes is used. Note that the HMM is stored in an array of the number of states in which the transition probability and the output probability distribution in each state are elements. The transition probabilities are stored in an array corresponding to the number of transitions, using the transition probabilities for the respective states as elements. Further, the output probability is represented by a multidimensional mixed normal distribution obtained by weighting and adding a plurality of normal distributions, and is stored in an array of the number of dimensions having elements of a mixing weight, an average vector, and a variance vector in each normal distribution. Have been. Here, the average vector and the variance vector are represented by an array of “34” elements, which is the same as the number of elements of the feature vector extracted from the audio signal by the acoustic analysis unit 24 for each frame.
[0036]
The timer unit 30 outputs a time signal indicating the time to the recognition target vocabulary switching request unit 31, the weight coefficient determination unit 29, and the output unit 32 to notify the time. Then, the recognition target vocabulary switching request unit 31 determines whether to request the switching of the recognition target vocabulary based on the notified time. When the request is made, the weight coefficient determination unit 29 is requested to switch the recognition target vocabulary.
[0037]
The weighting factor determination unit 29 includes a recognition target vocabulary set A stored in the first recognition target vocabulary set storage unit 27 and a recognition target vocabulary set B stored in the second recognition target vocabulary set storage unit 28. The weight w applied to the words constituting the vocabulary set to be recognized corresponding to the display content currently displayed by the output unit 32₂, And a weight w to be applied to the words constituting the recognition target vocabulary set corresponding to the display contents not displayed by the output unit 32₁To determine. These weights w₁, W₂Is the stored weight function W₁(T), W₂Using (t), the time t from the timer unit 30 when the switching is requested from the recognition target vocabulary switching request unit 31_OIs determined each time a predetermined time ΔT elapses with reference to. Then, the determined both weights w₁, W₂Are sequentially output to the recognizing unit 25.
[0038]
In the first recognition target vocabulary set storage unit 27 and the second recognition target vocabulary set storage unit 28, the words constituting each of the recognition target vocabulary sets A and B store the character string of the notation of each word and the phoneme string. It is stored in an array for the number of characters used as elements.
[0039]
The output unit 32 includes an image display device including a display, and stores the first display content corresponding to the recognition target vocabulary set A and the second display content corresponding to the recognition target vocabulary set B. Then, based on the time notified from the timer unit 30, it is determined whether or not to change the currently displayed display content of the first and second display contents, and if so, the display content of the screen is switched. . Further, the recognition result from the recognition unit 25 is displayed on the screen.
[0040]
FIG. 2 shows a weighting function W for a vocabulary set to be recognized corresponding to the display content currently selected by the output unit 32.₂(T) and the weight function W for the vocabulary set to be recognized corresponding to the non-selected display contents₁(T) shows a time change. Weight function W₁The value of (t) is the time t at which the request to switch the vocabulary to be recognized is output.₀Starts monotonically increasing from a predetermined value “a” near 0 which is smaller than 1 at time t₂Thereafter, the value becomes “1”. On the other hand, the weight function W₂The value of (t) is the weight function W₁Contrary to the value of (t), time t₀Starts decreasing monotonically from the value “1” at time t₂Thereafter, the value becomes the predetermined value “a”. In that case, time t₁And weight w₁And weight w₂Is the threshold value h. When the value of the difference is less than the threshold value h, that is, at the time t when the switching of the recognition target vocabulary is requested,₀From time T (> (t₁-T₀After)), the display contents displayed on the screen are switched.
[0041]
That is, when the output unit 32 determines to change the display content based on the time notified from the timer unit 30, the recognition target vocabulary switching request unit 31 performs the switching based on the time notified from the timer unit 30. The time T is set to be later than the time when it is determined that the request is made.
[0042]
As described above, in the present embodiment, the display content of the screen is automatically switched by the output unit 32. However, the display content may be changed before or after the display content is changed. The recognition unit 25 calculates the likelihood P for the vocabulary of both the vocabulary sets of the vocabulary set A and the vocabulary set B to be recognized. Then, the likelihood P of the words constituting the recognition target vocabulary set corresponding to the display content currently selected by the output unit 32 is 1> w> (1 + a + h) / 2 before the display content is switched, After switching, a weight w that satisfies 1> w> (1 + a−h) / 2 is applied. On the other hand, the likelihood P of the words constituting the recognition target vocabulary set corresponding to the display contents on the non-selected side is (1 + a−h) / 2> w> a before the display contents are switched, and after the switching. If there is, a weight w that satisfies (1 + a + h) / 2> w> a is multiplied. Thus, the recognition result is determined by calculating the final likelihood w · P.
[0043]
In other words, the switching of the recognition target vocabulary in the conventional speech recognition apparatus shown in FIG. 4 is performed by switching the recognition target vocabulary itself used in the calculation of the likelihood P, whereas in the present embodiment, the recognition target vocabulary is switched. The two sets of recognition target words used in the calculation of the degree P are performed by gradually changing the value of the weight w applied to the likelihood P between “1” and a predetermined value “a” near 0 without switching. is there.
[0044]
Therefore, in the present embodiment, even if the speaker loses the opportunity to utter the recognition target vocabulary for some reason and the recognition target vocabulary is automatically switched, the recognition target vocabulary before switching is not changed. The likelihood w · P for the word is also calculated, so that even if the speaker utters the vocabulary to be recognized before switching, it can be correctly recognized. Also, in this case, the function of improving the recognition accuracy of the vocabulary corresponding to the display content of the output unit 32 is not impaired, as in the case where the vocabulary to be recognized itself is switched as in the speech recognition device shown in FIG.
[0045]
FIG. 3 is a flowchart of the weight determination processing operation performed by the weight coefficient determination unit 29. Hereinafter, the operation of the weight determination will be described with reference to FIG. Here, the weight function for the recognition target vocabulary set corresponding to the display content currently selected by the output unit 32 is W₂(T), and the weighting function for the vocabulary set to be recognized corresponding to the non-selected display contents is W₁(T). When the switching is requested from the recognition target vocabulary switching request unit 31, the weight determination processing operation is started.
[0046]
In step S1, based on the time signal from the timer unit 30, the switching request time t₀Is obtained. In step S2, the number of calculations j of the weight value w is initialized to “0”. In step S3, the number of calculations j is incremented. In step S4, the switching request time t₀Is obtained, or it is determined whether or not a predetermined time ΔT has elapsed since the previous calculation of the weight value w. As a result, if it has passed, the process proceeds to step S5. In step S5, the current time (t₀+ J · ΔT) at time t₂Is determined. As a result, if it exceeds, the weight determination processing operation is ended, while if it does not exceed, the process proceeds to step S6.
[0047]
In step S6, the weight function W_iThe function number i of (t) is initialized to “1”. In step S7, the weight function W₁Switching request time t in (t)₀“J · ΔT” is substituted for the elapsed time t from_iIs calculated. In step S8, the function number i is incremented. In step S9, it is determined whether or not the value of the function number i is larger than “2”. As a result, if it is “2” or less, the process returns to step S7 to return to the weight value w₂On the other hand, if it is larger than “2”, it is determined that the weights at the current time corresponding to all the vocabulary sets A and B to be recognized have been calculated, and the process proceeds to step S10. In step S10, the calculated weight value w at the current time is calculated.₁, W₂Are output to the recognizing unit 25. After that, the process returns to step S3, where the weight value w at the next time is set.₁, W₂Move on to calculation of.
[0048]
Thereafter, steps S3 to S10 are repeated, and in step S5, the current time is set to time t.₂If it is determined that the number exceeds the threshold value, the weight determination processing operation ends. After that, the weight value w for the recognition target vocabulary set corresponding to the display content₂Is output every predetermined time ΔT, and the weight value w for the vocabulary set to be recognized corresponding to the non-selected display contents₁Is output every predetermined time ΔT. Then, when a switching request is output from the recognition target vocabulary switching requesting unit 31, the above-described weight determination processing operation starts.
[0049]
As described above, the recognition unit 25 according to the present embodiment uses the acoustic model stored in the acoustic model storage unit 26 and the recognition target vocabulary set A stored in the first recognition target vocabulary set storage unit 27 and the 2. Calculate the likelihood P of the words constituting the vocabulary set B to be recognized stored in the vocabulary set storage unit 28 for recognition. At this time, the switching of the recognition target vocabulary set accompanying the switching of the display content of the output unit 32 does not switch the recognition target vocabulary set itself, but multiplies the likelihood P of the words constituting the selected / non-selected recognition target vocabulary set. Weight w₂, W₁Is switched between "1" and a predetermined value "a" near 0. Then, in that case, the weight w₂, W₁Is not switched step by step, but the time t at which the switching request is made from the recognition target vocabulary switching request unit 31.₀From the value "1" to the value "a" or from the value "a" to the value "1" gradually in proportion to the elapsed time "j..DELTA.T" since the start.
[0050]
Therefore, according to the present embodiment, even if the speaker misses the opportunity to utter the recognition target vocabulary for some reason and the recognition target vocabulary is automatically switched, the recognition target vocabulary before switching is not changed. Since the likelihood w · P for the words in the set is also calculated, even if the speaker utters the recognition target vocabulary before switching, it can be correctly recognized. Further, in this case, the function of improving the recognition accuracy of the recognition target vocabulary corresponding to the display content of the output unit 32 is not impaired, similarly to the case where the recognition target vocabulary itself is switched as in the voice recognition device shown in FIG. .
[0051]
In the above embodiment, the weighting function W for the vocabulary set to be selected and recognized is₂(T) and weighting function W for the vocabulary set to be recognized corresponding to the non-selected display contents₁(T) is changed to a switching request time t by the recognition target vocabulary switching request unit 31.₀Are linearly switched from the values “1” and “a” to the values “a” and “1” in proportion to the elapsed time “j · ΔT” from. However, in the present invention, the function W₁(T), W₂The shape of (t) is not limited to a straight line. Curve the display content switching time t₁Function W up to₂Function W while increasing the value of (t)₁(T) is reduced, and the display content switching time t₁Subsequent function W₂Function W while lowering the value of (t)₁The value of (t) may be increased.
[0052]
Further, in the above embodiment, the weighting factor determination unit 29 is set to the switching request time t from the recognition target vocabulary switching requesting unit 31._OWeight value w each time a predetermined time ΔT elapses based on₁, W₂Is determined and output to the recognizing unit 25. The recognizing unit 25 receives the input weight value w₁, W₂Is used to perform the recognition process as needed. However, the present invention is not limited to this. The recognition unit 25 is configured to issue a weight determination request to the weight coefficient determination unit 29 when performing recognition, and the weight coefficient determination unit 29 Upon receiving the determination request, the switching request time t by the recognition target vocabulary switching request unit 31_OTime from the weighting function W_iIt may be configured to calculate by substituting into (t).
[0053]
By the way, the functions as the recognition unit, the output unit, the timer unit, the recognition target vocabulary switching request unit, and the weight determination unit in each of the above embodiments are realized by a speech recognition processing program recorded on a program recording medium. The program recording medium in the above embodiment is a program medium composed of a ROM (Read Only Memory). Alternatively, it may be a program medium that is mounted on and read from an external auxiliary storage device. In any case, the program reading means for reading the voice recognition processing program from the program medium may have a configuration of directly accessing and reading the program medium, or may be a RAM (random access memory). A configuration may be adopted in which the program is downloaded to a provided program storage area (not shown), and the program storage area is accessed and read. It is assumed that a download program for downloading from the program medium to the program storage area of the RAM is stored in the main unit in advance.
[0054]
Here, the above-mentioned program medium is configured to be separable from the main body side, such as a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, a CD (compact disk) -ROM, an MO (magneto-optical). Disk system of optical disks such as disks, MDs (mini disks), DVDs (digital video disks), card systems such as IC (integrated circuit) cards and optical cards, mask ROMs, EPROMs (ultraviolet erasable ROMs), and EEPROMs (electrical This is a medium that fixedly carries a program, including a semiconductor memory system such as an erasable ROM) and a flash ROM.
[0055]
Further, if the voice recognition device in each of the above embodiments has a configuration that can be connected to a communication network including the Internet with a modem, the program media can be dynamically programmed by downloading from the communication network or the like. May be used as the medium. In this case, it is assumed that a download program for downloading from the communication network is stored in the main device in advance. Alternatively, it shall be installed from another recording medium.
[0056]
It should be noted that what is recorded on the recording medium is not limited to only a program, and data can also be recorded.
[0057]
【The invention's effect】
As is clear from the above, the speech recognition device of the first invention stores a plurality of recognition target vocabulary sets corresponding to the output contents of the output unit in the recognition vocabulary storage unit, and uses the weight determination unit to determine the time from the timer unit. A weight for each of the recognition target vocabulary sets is determined based on the signal, and the recognition unit recognizes the input speech using the entire recognition target vocabulary set and each of the determined weights. When switching to the recognition target vocabulary set corresponding to the switching of the output content of the output unit based on the request for switching the recognition target vocabulary by the above, the value of the weight for the recognition target vocabulary set before the switching is reduced. Thus, the recognition accuracy of the recognition target vocabulary after switching can be improved.
[0058]
Furthermore, even if the speaker utters in the recognition target vocabulary before switching without knowing that the recognition target vocabulary set has been switched, recognition is performed using the words in the recognition target vocabulary set before switching. High recognition accuracy can also be obtained for the words in the recognition target vocabulary set before the switching.
[0059]
That is, according to the present invention, high recognition accuracy can be obtained even when the vocabulary to be recognized is automatically switched. Furthermore, an easy-to-use speech recognition device can be realized without burdening the speaker with any operation or waiting time.
[0060]
The speech recognition device according to the first aspect of the present invention may further comprise: the weight determining unit determines the weight before the switching according to an elapsed time from when the recognition target vocabulary switching request unit requests the switching of the recognition target vocabulary to when the weight is determined. If the weight for the recognition target vocabulary set is reduced while the weight for the recognition target vocabulary set after switching is increased, the recognition target vocabulary used for recognition can be switched gradually. Therefore, high recognition accuracy can be obtained for the words in the vocabulary set to be recognized before the switching.
[0061]
The speech recognition apparatus according to the first aspect of the present invention may be configured such that the recognition unit calculates the likelihood of each word constituting the entire vocabulary set to be recognized, and the vocabulary to be recognized to which each word belongs to the likelihood value of each word. If the weight for the set is multiplied and the word having the highest value is used as the recognition result, the weight for the vocabulary set to be recognized used for recognition and the weight for the vocabulary set to be recognized not used for recognition If the setting is optimally set, the recognition accuracy of the vocabulary to be recognized after switching corresponding to the output content of the output unit is improved, and the recognition accuracy is high even when the speaker utters the vocabulary to be recognized before switching. Can be easily achieved.
[0062]
Further, the speech recognition apparatus according to the first aspect of the present invention provides the recognition target vocabulary set corresponding to the output content that is output when the recognition target vocabulary switching request is made from the recognition target vocabulary switching request unit. If the difference between the value of the weight for use and the value of the weight for the recognition target vocabulary set corresponding to the output content to be output next is less than a predetermined value, the output content is switched. In response to the switching of the set, the output content of the output unit can be switched to the corresponding output content.
[0063]
The speech recognition method according to a second aspect of the present invention determines a weight for a plurality of vocabulary sets to be recognized corresponding to the output content of the output unit based on a time signal from a timer unit, Since the input speech is recognized using the weights thus set, when the vocabulary set to be recognized is switched, if the value of the weight for the vocabulary set to be recognized before switching is reduced, the output content of the output unit can be reduced. The recognition accuracy of the vocabulary to be recognized after the switching can be improved.
[0064]
Furthermore, even if the speaker utters in the recognition target vocabulary before switching without knowing that the recognition target vocabulary set has been switched, recognition is performed using the words in the recognition target vocabulary set before switching. High recognition accuracy can also be obtained for the words in the recognition target vocabulary set before the switching.
[0065]
A program recording medium according to a third aspect of the present invention stores a speech recognition processing program for causing a computer to function as a recognition unit, an output unit, a timer unit, a recognition target vocabulary switching request unit, and a weight determination unit. Therefore, as in the case of claim 1, if the weight value for the recognition target vocabulary set before switching is reduced, the recognition accuracy of the recognition target vocabulary after switching corresponding to the output content of the output unit is reduced. Can be increased. Furthermore, even if the speaker utters the vocabulary to be recognized before the change without knowing that the set of vocabulary to be recognized has been switched, high recognition accuracy can be obtained.
[Brief description of the drawings]
FIG. 1 is a block diagram of a speech recognition apparatus according to the present invention.
FIG. 2 is a diagram showing a temporal change of a weight function for a selected / non-selected vocabulary set to be recognized.
FIG. 3 is a flowchart of a weight determination processing operation performed by a weight coefficient determination unit in FIG. 1;
FIG. 4 is a block diagram of a conventional speech recognition apparatus capable of switching a vocabulary to be recognized.
[Explanation of symbols]
21 ... Speech recognition device,
22 ... voice input unit,
23 ... A / D converter,
24 ... Acoustic analysis unit,
25 ... Recognition unit
26 ... Acoustic model storage unit
27: first recognition target vocabulary set storage unit,
28 second vocabulary set to be recognized storage unit,
29 ... weight coefficient determining unit,
30 ... Timer part,
31: recognition target vocabulary switching request unit
32 output part.

Claims

入力された音声を認識する認識部と、この認識部の認識結果を含む情報を出力する出力部と、上記認識時に用いられる認識対象語彙が格納された認識語彙格納部と、タイマ部と、このタイマ部からの時刻信号に基づいて上記認識対象語彙の切り換えを要求する認識対象語彙切り換え要求部を有する音声認識装置において、
上記出力部は、複数の出力内容を切り換え出力するようになっており、
上記認識対象語彙は、上記出力部の出力内容に対応した認識対象語の集合でなる複数の認識対象語彙セットに分類され、上記認識対象語彙の切り換えは上記認識対象語彙セットの単位で行われるようになっており、
上記タイマ部からの時刻信号に基づいて、上記各認識対象語彙セット用の重みを決定する重み決定部を備えて、
上記認識部は、上記全認識対象語彙セットおよび上記決定された各重みを用いて、入力音声を認識するようになっていることを特徴とする音声認識装置。A recognition unit for recognizing an input voice, an output unit for outputting information including a recognition result of the recognition unit, a recognition vocabulary storage unit for storing a recognition target vocabulary used at the time of the recognition, a timer unit, In a speech recognition apparatus having a recognition target vocabulary switching request unit that requests the switching of the recognition target vocabulary based on a time signal from a timer unit,
The output unit is configured to switch and output a plurality of output contents,
The recognition target vocabulary is classified into a plurality of recognition target vocabulary sets each including a set of recognition target words corresponding to the output contents of the output unit, and the switching of the recognition target vocabulary is performed in units of the recognition target vocabulary set. Has become
Based on the time signal from the timer unit, comprising a weight determination unit that determines the weight for each of the recognition target vocabulary set,
The speech recognition device, wherein the recognition unit recognizes the input speech using the all-recognition target vocabulary set and each of the determined weights.

請求項１に記載の音声認識装置において、
上記重み決定部は、上記認識対象語彙切り換え要求部によって認識対象語彙の切り換えが要求されてから重み決定までの経過時間に応じて、切り換え前の認識対象語彙セット用の重みを低下させる一方、切り換え後の認識対象語彙セット用の重みを上昇させるようになっていることを特徴とする音声認識装置。The speech recognition device according to claim 1,
The weight determining unit reduces the weight for the recognition target vocabulary set before switching according to the elapsed time from when the recognition target vocabulary switching request unit requests the switching of the recognition target vocabulary to when the weight is determined. A speech recognition apparatus characterized in that the weight for a later set of vocabulary to be recognized is increased.

請求項１あるいは請求項２に記載の音声認識装置において、上記認識部は、上記全認識対象語彙セットを構成する各語の尤度を算出し、各語の尤度の値に各語が属する認識対象語彙セット用の重みを掛け、その値が最も高い語を認識結果とするようになっていることを特徴とする音声認識装置。3. The speech recognition device according to claim 1, wherein the recognizing unit calculates a likelihood of each word constituting the all-recognition target vocabulary set, and the word belongs to a likelihood value of each word. A speech recognition apparatus characterized in that a word for a vocabulary set to be recognized is weighted and a word having the highest value is used as a recognition result.

請求項２に記載の音声認識装置において、
上記出力部は、上記認識対象語彙切り換え要求部からの認識対象語彙切り換え要求がなされた時点に出力している出力内容に対応する認識対象語彙セット用の重みの値と、次に出力すべき出力内容に対応する認識対象語彙セット用の重みの値との差が所定値未満になると、上記出力内容を切り換えるようになっていることを特徴とする音声認識装置。The speech recognition device according to claim 2,
The output unit outputs a weight value for a recognition target vocabulary set corresponding to the output content output at the time when the recognition target vocabulary switching request is made from the recognition target vocabulary switching request unit, and an output to be output next. A speech recognition apparatus characterized in that the output contents are switched when a difference from a value of a weight for a recognition target vocabulary set corresponding to contents is less than a predetermined value.

入力された音声を認識対象語彙を用いて認識して認識結果を出力するに際して、タイマ部からの時刻信号に基づいて上記認識対象語彙の切り換えを自動的に行う音声認識方法において、
複数の出力内容を出力部に切り換え出力し、
上記各出力内容に対応した認識対象語の集合でなる複数の認識対象語彙セットの単位で、上記認識対象語彙の切り換えを行い、
上記タイマ部からの時刻信号に基づいて、上記各認識対象語彙セット用の重みを決定し、
上記全認識対象語彙セットおよび上記決定された各重みを用いて、上記入力音声の認識を行うことを特徴とする音声認識方法。When recognizing an input voice using a recognition target vocabulary and outputting a recognition result, a voice recognition method for automatically switching the recognition target vocabulary based on a time signal from a timer unit,
Switching multiple output contents to the output section for output,
Switching of the recognition target vocabulary is performed in units of a plurality of recognition target vocabulary sets each including a set of recognition target words corresponding to the respective output contents,
Based on the time signal from the timer unit, determine the weight for each recognition vocabulary set,
A speech recognition method, wherein the input speech is recognized using the all vocabulary set to be recognized and each of the determined weights.

コンピュータを、
請求項１における認識部，出力部，タイマ部，認識対象語彙切り換え要求部および重み決定部
として機能させる音声認識処理プログラムが記録されたことを特徴とするコンピュータ読出し可能なプログラム記録媒体。Computer
2. A computer-readable program recording medium, wherein the speech recognition processing program functioning as a recognition unit, an output unit, a timer unit, a recognition target vocabulary switching request unit and a weight determination unit according to claim 1 is recorded.