JP2004325635A

JP2004325635A - Apparatus, method, and program for speech processing, and program recording medium

Info

Publication number: JP2004325635A
Application number: JP2003118305A
Authority: JP
Inventors: Kenichi Kumagai; 建一熊谷
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2003-04-23
Filing date: 2003-04-23
Publication date: 2004-11-18
Anticipated expiration: 2023-04-23
Also published as: JP4074543B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus, a method, and a program for speech processing and a program recording medium that can decide a factor of misrecognition and inform a user of that. <P>SOLUTION: A segmenting part 14 divides a feature quantity of an input speech extracted by a feature extraction part 13 into segments by phonemes by comparison with a standard model stored in a standard model storage part 18. A factor analysis part 15 finds feature quantities regarding a plurality of factors of misrecognition according to the feature quantities by the segments, calculates degrees of deviation of the feature quantities of the respective factors from the standard model, and detects the factor having the largest deviation. A message generation part 16 and a message presentation part 17 present the factor having the largest deviation to the user in the form of a message. Thus, the user is informed of the factor of misrecognition with a factor which is easy for a human to intuitively understand and a feeling of physical disorder that the user has can be reduced. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、音声認識システム等に利用される音声処理装置，音声処理方法，音声処理プログラムおよびプログラム記録媒体に関する。
【０００２】
【従来の技術】
現在、音声認識システムの認識性能は、書き起こし文を読み上げた朗読音声であれば、不特定話者タスクであっても高い単語認識性能を有している。これは、多数話者データベースの利用が可能であり、殆どの話者の音響特性を学習できるためである。また、Ｍａｘｉｍｕｍａｐｏｓｔｅｒｉｏｒｉ（以下、ＭＡＰと略称する）やＭａｘｉｍｕｍＬｉｋｅｌｉｈｏｏｄＬｉｎｅａｒＲｅｇｒｅｓｓｉｏｎ（以下、ＭＬＬＲと略称する）等の話者適応技術によって、少ない音声サンプルから話者の音響特性を学習することも可能である。
【０００３】
ここで、上記話者の音響特性とは、話者の発声器官の違い等、発声器官の物理特性の違いによって起こる音響特性のことである。例えば、声道長の違い等によって、音声のスペクトルが話者毎に異なる。尚、上述したＭＡＰやＭＬＬＲは、Ｓ．Ｙｏｕｎｇ他著“ＴｈｅＨＴＫＢＯＯＫ”に詳しく述べられている。
【０００４】
しかしながら、自然に且つ自由に発声された音声（以下、自然音声と言う）に対する認識性能は不十分である（篠崎他、音講論、ｐｐ１７−１８、Ｍａｒ．２００２）。自然音声認識が難しい理由は、発話スタイルの要因が大きいといわれている（山本他、信学論ｐｐ２４３８−２４４７、Ｎｏｖ．２０００）。また、自然音声と朗読音とを使ってモデルを学習した場合でも、自然音声の認識率はかなり低下する。この原因は、総ての発話速度に対応したモデルを作成することが難しいことと、自然音声においては特に母音をはっきりと発音しない（なまける）傾向があるためであると考えられる。
【０００５】
前者の原因に対しては、発話速度毎に遷移パスを分離するマルチパス隠れマルコフモデル（以下、ＨＭＭと略称する）（李他、音講文，ｐｐ．８９−９０，Ｍａｒ．２００２）等が提案されている。しかしながら、計算コストに見合った認識精度は得られていない。また、後者の問題に対しては、自然音声を上記ＭＡＰやＭＬＬＲ等の話者適応技術によって音響モデルを学習することが考えられる。しかしながら、そうすると、逆に母音モデルの特徴空間が大きくなってしまい、結果として自然音声の認識率が向上しても、朗読音声の認識精度が悪くなり兼ねない。
【０００６】
ここで、上記発話スタイルとは、上記「話者の音響特性」のような発声器官の物理特性の違いではなく、話者の環境や文化等によって起こる音響特性のことである。例えば、方言，早口，ゆっくりしゃべる，はっきりと発音しない等である。
【０００７】
さらに、あらゆる騒音環境下において高性能な認識性能を保証することはできない。予め収録した騒音を学習音声に重畳した音声をモデル（マッチドモデル）化する方法によって良い認識性能が得られるが、全環境の騒音を収録するのは不可能である。そのために、騒音環境の場合も、上記話者適応の場合と同様に少数の騒音データから上記ＭＡＰやＭＬＬＲ等によって適応処理を行う方法がなされている。しかしながら、その場合であっても上記マッチドモデル化する方法と比較すると認識性能は劣る。また、利用者が手当たり次第に環境適応を行うと、音響モデルがどのようになるか予測がつかないために好ましくない。
【０００８】
利用者にとって、利用者自身の音声の音響特性は如何にもならないが、周りの騒音や発話スタイルに対しては対応が簡単である。例えば、騒音に対しては静かな場所に移動できるし、発話スタイルに対しては標準的な話し方をすればよい。したがって、誤認識の原因が、話者の音響特性によるものか発話スタイルによるものか騒音によるものかを判定して、判定結果を利用者に知らせることができれば、誤認識による不快感を少なくすることができることになる。また、発話スタイルへの適応を行わないことで、認識性能が向上しない無駄な適応処理を回避することができる。同様に、対応していない環境を通知してやることによって、無駄な環境適応処理を回避することができる。
【０００９】
しかしながら、多くの音声認識システムにおいては、利用者に誤認識理由すら通知してはいない。その理由は、誤認識の原因を一般の人が理解できるように説明するのが難しいためである。具体的には、上記ＨＭＭを用いた音声認識システムにおいては、入力音声の音韻性以外の情報を含んだ「Ｍｅｌ−ｆｒｅｑｕｅｎｃｙｃｅｐｓｔｒａｌｃｏｅｆｆｉｃｉｅｎｔｓ（以下、ＭＦＣＣと略称する）」等の特徴ベクトルと標準モデルとの確率統計距離を基準としたマッチングスコアの大小によって認識結果が判定されるので、誤認識の原因を音声学の知見に完全に（１対１の対応で）結び付けることができないからである。
【００１０】
入力音声と標準音声との物理的な距離尺度を基準とした認識システムにおいては、上述したような誤認識理由を教示する装置ではないが、標準的な発話を利用者に学習させる音声認識装置が提案されている（例えば、特許文献１参照）。
【００１１】
その他、上記誤認識理由通知を行うものとしては、以下のような音声認識方法及び装置がある（特許文献２参照）。この音声認識方法及び装置においては、音声が入力されると、音声認識タスクによって入力音声を分析し、予め登録されている音声データと比較して一致するものを検出する。その際に、認識結果が「ＮＧ」である場合には、ＮＧであった旨の表示と理由コードとを表示するようにしている。
【００１２】
また、従来の話者適応可能な音声認識システムにおいては、話者の音響特性と発話スタイルの違いが明確化されていないため、発話スタイルや周辺環境も話者の音響特性と同様に学習してしまうことになる。例えば、話者適応技術を用いて信頼性の高いサブワードだけに話者適応を行う音声認識装置及び自動音声認識装置がある（特許文献３参照）。この音声認識装置及び自動音声認識装置では、認識結果の尤度尺度が閾値以上になる信頼性の高いサブワードにのみモデル適応を行うことによって、適応による認識性能劣化を小さくするようにしている。
【００１３】
【特許文献１】
特開平０１‐２８５９９８号公報
【特許文献２】
特開２０００‐１１２４９７号公報
【特許文献３】
特開２０００‐１８１４８２号公報
【００１４】
【発明が解決しようとする課題】
しかしながら、上記従来の音声認識装置や音声認識方法においては、以下のような問題がある。
【００１５】
すなわち、先ず、上記特許文献１に開示された音声認識装置においては、上記のような発話スタイルと話者の音響特性とを区別することはできないし、周辺環境に適応することもできない。さらに、認識を行う認識モードと、指定単語の発話者による音節特徴パターンを作成して登録する登録モードとを有している。そして、上記登録モードでは、発声単語を指示すると共に、正しく認識されるための発声方法（つまり、誤認識され易い理由）を指示するようになっている。ところが、上記登録モードは認識モードと分離しているため、認識モードにおいて誤認識が発生した場合に誤認識の理由を発話者に通知することができず、任意文の音声入力時において誤り原因を知らせることができないという問題がある。
【００１６】
また、上記特許文献２に開示された音声認識方法及び装置においては、入力音声の認識に失敗した場合にその理由情報を通知するのであるが、その通知内容は精々「比較すべき音声登録データなし」や「入力音量過多」等の程度である。また、誤認識理由を取得する手段や方法が開示されておらず、複数の要因が重なり合って発生する誤認識の理由をどのように取得するのかは不明である。したがって、十分な誤認識理由を利用者に通知することができないという問題がある。
【００１７】
また、上記特許文献３に開示された音声認識装置及び自動音声認識装置においては、信頼尺度が閾値以上になるサブワードにモデル適応を行うのであるが、実際に信頼度の定義や信頼度の閾値を決めるのは非常に難しい。例えば、信頼度の閾値を低くし過ぎると適応による認識性能劣化は防げるのではあるが、適応を行う確率が低くなるために適応効果があまり得られない。したがって、そのようなトレードオフの関係を見極めるのは非常に難しいのである。
【００１８】
さらに、誤認識の原因が音響特性と発話スタイルと周辺環境との何れであるかを、区別することはできない。したがって、尤度尺度が閾値以上であって認識の信頼度が高い場合には、発話スタイルおよび周辺環境にも適応しようとすることになる。ところが、上述したように、発話スタイルは、自然音声を用いて学習した場合であっても認識率は劣化するものであるから同様に認識率の劣化を招き、結果的に無駄な計算をすることになる。また、誤認識した理由や信頼度が低い理由等を利用者に通知する理由取得・通知手段が存在しないために、利用者に不快感を与える可能性もある。
【００１９】
そこで、この発明の目的は、誤認識となる要因を判定して利用者に通知することが可能な音声処理装置，音声処理方法，音声処理プログラムおよびプログラム記録媒体を提供することにある。
【００２０】
【課題を解決するための手段】
上記目的を達成するため、この発明の音声処理装置は、入力された音声の特徴量と標準モデルとの比較を行うに際して、上記入力された音声の特徴量に基づいて複数の誤認識の要因に関する特徴量を求め，各要因毎に上記特徴量の上記標準モデルからのずれの度合いを算出する要因別ずれ算出手段と、上記算出されたずれの度合いが許容範囲を表す閾値内にあるか否かを判定すると共に，上記閾値内にある場合には，上記ずれの度合いを上記許容範囲内にあることを表す所定値に変換するずれ度合変換手段と、上記算出されたずれの度合いと上記変換されたずれの度合いとに基づいて最もずれの度合いが大きい要因を検出する要因検出手段と、上記検出された最もずれの大きい要因を誤認識となる要因として出力する誤認識要因出力手段を備えている。
【００２１】
上記構成によれば、入力音声波形の特徴量に基づいて、例えば人間が直感的に理解し易い誤認識の要因に関する特徴量が求められる。そして、上記特徴量と標準モデルとのずれの度合が最も大きな要因が誤認識となる原因として検出され、ユーザに対して出力される。こうして、利用者に、誤認識となる原因を知らせることによって、結果的に誤認識に至った場合における利用者の不快感が減少される。
【００２２】
また、１実施例の音声処理装置では、上記誤認識要因出力手段は、上記検出された最もずれの大きい要因が複数存在する場合には、誤認識要因を出力せずに、音声の入力を再度行うことを促すメッセージを出力するようになっている。
【００２３】
上記最もずれの大きい要因が複数存在する場合には、突発的な雑音が発生した場合に多い。この実施例によれば、このような場合には、再入力を促すことによって、突発的な雑音に対して頑健に上記要因の分析が行われる。
【００２４】
また、１実施例の音声処理装置では、上記誤認識要因出力手段による上記メッセージの出力に従って音声が再度入力された場合には、上記許容範囲を表す閾値を上記許容範囲が狭くなるように変更する閾値変更手段を備えている。
【００２５】
この実施例によれば、上記許容範囲を表す閾値が上記許容範囲を狭くするように変更されるため、ずれの度合いが強調されることになる。したがって、誤認識の要因分析結果がより得易くなり、何度も利用者に音声入力させる手間が不要になる。
【００２６】
また、１実施例の音声処理装置では、上記誤認識要因出力手段は、上記検出された最もずれの大きい要因が前回の音声入力時と同じ要因である場合は、２番目にずれが大きい要因を上記誤認識となる要因として出力するようになっている。
【００２７】
この実施例によれば、利用者に対して何度も同じ指示を出さないようにして、利用者の不快感が減らされる。
【００２８】
また、１実施例の音声処理装置では、上記標準モデルは確率関数で表されており、上記要因別ずれ算出手段は、上記誤認識の要因に関する特徴量としてパワー，話速，話者性および周辺環境雑音の特徴量を求め、各要因毎に、上記標準モデルを表す確率関数における当該要因の特徴量に基づく確率値を用いて、当該標準モデルとのずれの度合いを算出するようになっている。
【００２９】
この実施例によれば、入力音声波形の特徴量に基づいて、人間が直感的に理解し易い誤認識の要因に関する特徴量が求められる。さらに、上記ずれの度合いを累積確率値によって表すことによって、異なる要因間のずれの度合いを確率値で比較することが可能になる。したがって、ずれの度合いの値に特別な正規化を施すことなく、最もずれの大きな要因を検出することが可能になる。
【００３０】
また、この発明の音声処理方法は、入力された音声の特徴量と標準モデルとの比較を行うに際して、上記入力された音声の特徴量に基づいて複数の誤認識の要因に関する特徴量を求め，各要因毎に上記特徴量の上記標準モデルからのずれの度合いを算出し、上記算出されたずれの度合いが許容範囲を表す閾値内にあるか否かを判定すると共に，上記閾値内にある場合には，上記ずれの度合いを上記許容範囲内にあることを表す所定値に変換し、上記算出されたずれの度合いと上記変換されたずれの度合いとに基づいて最もずれの度合いが大きい要因を検出し、上記検出された最もずれの大きい要因を誤認識となる要因として出力する。
【００３１】
上記構成によれば、利用者に、誤認識となる原因を、例えば人間が直感的に理解し易い要因によって知らせることによって、結果的に誤認識に至った場合における利用者の不快感が減少される。
【００３２】
また、この発明の音声処理プログラムは、コンピュータを、この発明の音声処理装置における要因別ずれ算出手段，ずれ度合変換手段，要因検出手段および誤認識要因出力手段として機能させる。
【００３３】
上記構成によれば、利用者に、誤認識となる原因を、例えば人間が直感的に理解し易い要因によって知らせることによって、結果的に誤認識に至った場合における利用者の不快感が減少される。
【００３４】
また、この発明のプログラム記録媒体は、この発明の音声処理プログラムが記録されている。
【００３５】
上記構成によれば、コンピュータで読み出して実行することによって、利用者に、誤認識となる原因が、例えば人間が直感的に理解し易い要因によって提示される。こうして、結果的に誤認識に至った場合における利用者の不快感が減少される。
【００３６】
【発明の実施の形態】
以下、この発明を図示の実施の形態により詳細に説明する。図１は、本実施の形態の音声処理装置におけるハードウェア構成を示す図である。
【００３７】
図１において、１は数値演算・制御等の処理を行う中央演算処理装置であり、本実施の形態において説明する処理手順に従って演算・処理を行う。２はＲＡＭ（ランダム・アクセス・メモリ）やＲＯＭ（リード・オンリ・メモリ）等で構成される記憶装置であり、中央演算処理装置１によって実行される処理手順（音声処理プログラム）やその処理に必要な一時データが格納される。３はハードディスク等で構成される外部記憶装置であり、音声処理用の標準パターン（テンプレート）や標準モデル等が格納される。４はマイクロホンやキーボード等で構成される入力装置であり、ユーザが発声した音声やキー入力された文字列を入力する。５はディスプレイやスピーカ等で構成される出力装置であり、分析結果あるいはこの分析結果を処理することによって得られた情報を出力する。６はバスであり、中央演算処理装置１〜入力装置５の各種装置を相互に接続する。尚、本音声処理装置のハードウェア構成は、図１に示す構成に加えて、インターネット等の通信ネットワークと接続する通信Ｉ／Ｆを備えていても構わない。
【００３８】
但し、本実施の形態においては、音声処理装置および音声処理プログラムは独立しているが、他の装置の一部として組み込んだり、他のプログラムの一部として組み込むことも可能である。そして、その場合における入力は、上記他の装置やプログラムを介して間接的に行われることになる。
【００３９】
以下、上記ハードウェア構成を踏まえて、本実施の形態において実行される処理について説明する。
【００４０】
図２は、本実施の形態における音声処理装置の機能的構成を示すブロック図である。入力部１１から、利用者の音声とそのラベル（発話内容のテキスト表記）とが入力される。そして、入力された音声は、Ａ／Ｄ変換部１２においてデジタル化される。このとき上記入力されたテキストはそのままである。
【００４１】
デジタル化された信号は、特徴抽出部１３によって、ある時間区間（フレーム）毎にＭＦＣＣベクトルに変換される。尚、上記ＭＦＣＣを求める詳細な方法は、上述した「Ｓ．Ｙｏｕｎｇ他著“ＴｈｅＨＴＫＢＯＯＫ”」を参考されたい。また、ＭＦＣＣは特徴分析方法の１つであって、Ｌｉｎｅａｒｐｒｅｄｉｃｔｉｏｎｆｉｌｔｅｒｃｏｅｆｆｉｃｉｅｎｔｓ（線形予測フィルタ係数）等を用いても同じことである。
【００４２】
尚、上記特徴抽出部１３は、上述したように、本音声処理装置および音声処理プログラムを他の装置やプログラムに容易に組み込むことが可能なように、外部装置から特徴抽出されたパラメータが直接入力されることが可能なようになっている。その場合には、外部装置から入力されるパラメータと後に述べる標準モデルとの特徴分析方法を同じにする必要がある。例えば、上記標準モデルのパターンがＭＦＣＣで表現されている場合には、入力パラメータの特徴量もＭＦＣＣ表現にする必要がある。このとき上記入力されたテキストはそのままである。
【００４３】
上記特徴抽出部１３によって抽出されたＭＦＣＣベクトル列は、セグメント分割部１４によって、標準モデル格納部１８に格納された標準モデルの集合を用いて音素毎のセグメントに分割される。この音素毎のセグメントへの分割は、以下のようにして行われる。
【００４４】
すなわち、上記標準モデルがＨＭＭである場合、ＨＭＭの状態ｉから状態ｊに遷移する確率をａ＿ｉｊとし、ＨＭＭの状態ｊにおいてフレームｔにおける特徴ベクトルＯｔを観測する確率をｂ＿ｊ（Ｏｔ）とすると、最終フレームＴにおいてＨＭＭの最終状態Ｎに至る対数尤度Ｌ＿Ｎ（Ｔ）は、次式

に従って、ビタビアルゴリズムによって求められる。そして、Ｌ＿Ｎ（Ｔ）が求められた際の（つまり、最終フレームＴの最終状態Ｎに到達した際の）フレームに対する状態番号を総て記憶しておき、記憶した状態番号を特徴ベクトル（ＭＦＣＣベクトル）に割り当てることによって、特徴ベクトル列を音素単位に分割する。
【００４５】
尚、上述した本方法が難解だと思われる場合には、上述の「Ｓ．Ｙｏｕｎｇ他著“ＴｈｅＨＴＫＢＯＯＫ”」を参考にして行っても差し支えない。
【００４６】
こうして、音素毎のセグメントに分割された特徴ベクトル列は、上記テキスト表記されたラベルが付加されて要因分析部１５に入力される。そして、要因分析部１５によって、誤認識となる要因が調べられる。メッセージ作成部１６は、要因分析部１５による分析結果に従って、利用者へ提示するメッセージの文字列を作成する。最後に、メッセージ提示部１７によって、上記作成された文字列に基づいて、出力装置５を構成する上記ディスプレイにメッセージを表示したり、内蔵するテキスト音声合成手段で合成音声に変換してスピーカから音声出力したりすることによって、利用者に通知される。
【００４７】
但し、本音声処理装置および音声処理プログラムが、他の装置や他のプログラムの一部として組み込まれている場合には、メッセージ提示部１７は、上記作成された文字列を上記他の装置に返すことになる。
【００４８】
すなわち、上記Ａ／Ｄ変換部１２，特徴抽出部１３，セグメント分割部１４，要因分析部１５およびメッセージ作成部１６とメッセージ提示部１７の一部とは上記中央演算処理装置１で構成され、入力部１１は上記入力装置４で構成され、メッセージ提示部１７の上記一部の残りは上記出力装置５で構成され、標準モデル格納部１８は上記外部記憶装置３で構成されるのである。また、中央演算処理装置１は、上述した各部１２〜１７による本実施の形態に係る処理動作の他に、演算・判断処理，計時処理および入出力処理等の各種の処理動作をも行うようになっている。
【００４９】
以下、上記要因分析部１５による誤認識要因の分析と、メッセージ作成部１６によるメッセージの作成とについて、詳細に説明する。図３および図４は、要因分析部１５およびメッセージ作成部１６によって実行される要因分析・メッセージ作成処理動作のフローチャートである。尚、ステップＳ２０およびステップＳ２２はメッセージ作成部１６による処理であり、その他のステップは要因分析部１５による処理である。
【００５０】
上記セグメント分割部１４によるセグメントへの分割が終了すると要因分析・メッセージ作成処理動作がスタートする。そして、先ず、ステップＳ１で、セグメント分割部１４からの入力があるか否かが判別される。そして、入力があればステップＳ２に進む。ステップＳ２で、セグメント分割部１４からの上記セグメント毎に分割されてラベルが付けられた特徴ベクトルが取り込まれる。ステップＳ３で、セグメント分割部１４からの連続した入力回数を計時・記憶しているカウンタの値に基づいて、初回の入力であるか否かが判別される。その結果、初回の入力であればステップＳ５に進み、そうでなければステップＳ４に進む。
【００５１】
ステップＳ４で、後に実行される上記特徴ベクトルと標準モデルとの離れの度合が許容範囲内であるかを判定する際に用いる閾値が、初回入力時に用いる標準閾値から入力回数に応じた閾値に変更される。ここで、上記閾値は、入力回数が増加するに従って標準閾値から段階的に減少するように設定されている。また、上記閾値は、上記特徴ベクトルと標準モデルとの「ずれ（離れ）の度合い」が許容範囲内である場合に上記ずれ度合を所定値にするためにも用いられ、ずれの要因毎に予め設定されて外部記憶装置３等に記憶されている。尚、上記閾値は、音声認識システムの認識性能に依存するので、予め認識率９５％以上の話者の発話から求めた特徴ベクトルに基づいて実験的に決めておく。
【００５２】
ステップＳ５で、発話入力前における非音声区間のセグメントと雑音モデルとの離れ度合が算出される。尚、上記雑音モデルは、予め収録された雑音から学習によって求められて、標準モデル格納部１８に格納されている。また、上記非音声区間のセグメント（特徴ベクトル列）と雑音モデルとの離れ度合は、上記雑音モデルが与えられた際に非音声区間の特徴ベクトル列を観測する対数尤度の累積確率値として求められる。
【００５３】
具体的には、上記雑音モデルをＭｎとし、雑音の特徴ベクトル列をＸとする。その場合、雑音モデルＭｎが与えられた際に入力特徴ベクトル列Ｘを観測する対数尤度をＬ（Ｘ｜Ｍｎ）とし、雑音特徴ベクトル列のフレーム数（継続長）をＴとすると、継続長Ｔで正規化した対数尤度ｘ（＝Ｌ（Ｘ｜Ｍｎ）／Ｔ）（以下、正規化対数尤度と言う）の累積確率値Ｓｎは、次式で表される。

ここで、Ｎｎ（ｘ；μｎ，σｎ）は、確率変数ｘについて平均値μｎと分散値σｎとを有する単一ガウス分布であり、学習データから予め推定しておく。また、式中の積分の範囲は、入力雑音の正規化対数尤度＜μｎである場合は、「ａ」が学習データ中の正規化対数尤度の最小値であり、「ｂ」が入力雑音の正規化対数尤度である。また、入力雑音の正規化対数尤度＞μｎである場合は、「ａ」が入力雑音の正規化対数尤度であり、「ｂ」が学習データ中の正規化対数尤度の最大値である。但し、確率密度関数を単一ガウス分布として表すのは計算量を削減するためであり、混合ガウス分布等を用いても差し支えない。
【００５４】
上記累積確率値Ｓｎは、その値が小さい程、入力雑音の正規化対数尤度が学習データの正規化対数尤度ｘの単一ガウス分布の平均μｎから離れていることを意味し、入力雑音の特徴が学習した雑音モデルから大きくずれていることを示す。
【００５５】
ステップＳ６で、上記算出された非音声区間のセグメント（入力雑音）と雑音モデルとの離れ度合（累積確率値Ｓｎ）は、上記ステップＳ４において設定された閾値あるいは上記標準閾値よりも小さいか否か、つまり、入力雑音の特徴が雑音モデルから大きくずれているか否かが判別される。その結果、大きくずれている場合は、ビタビアルゴリズムによって求められる最尤状態経路が信頼できないので、ステップＳ２０に進む。一方、ずれていない場合にはステップＳ７に進む。
【００５６】
ステップＳ７で、入力音声のパワーの標準分布からの離れ度合が算出される。この場合の離れ度合は、上記ステップＳ５の場合と同様に、特徴ベクトルのパワーの平均値の累積確率値として求められる。
【００５７】
具体的には、先ず、入力音声の特徴ベクトルのパワーが、ＨＭＭの各状態毎に平均化される。次に、音素のパワーの累積確率値Ｓｐが、次式によって表される各状態ｉｎの累積確率値の中央値で近似することによって求められる。

ここで、Ｎｐ＿ｉｎ（ｐ＿ｉｎ；μｐ＿ｉｎ，σｐ＿ｉｎ）は、ＨＭＭの状態ｉｎに割り当てられたパワーの平均値ｐ＿ｉｎである確率変数について平均値μｐ＿ｉｎと分散値σｐ＿ｉｎとを有する単一ガウス分布であり、学習データから予め推定しておく。また、式中の積分範囲は、入力音声における状態ｉｎに割り当てられたパワーの平均値＜μｐ＿ｉｎである場合には、「ａ」が学習データのパワーの最小値であり、「ｂ」が入力音声のパワーの平均値である。また、入力音声における状態ｉｎに割り当てられたパワーの平均値＞μｐ＿ｉｎである場合には、「ａ」が入力音声のパワーの平均値であり、「ｂ」が学習データのパワーの最大値である。但し、確率密度関数を単一ガウス分布として表すのは計算量を削減するためであり、例えば混合ガウス分布等を用いても差し支えない。
【００５８】
上述したように、各状態毎に確率過程を独立と見なして各状態の累積確率値の中央値で音素のパワーの累積確率値Ｓｐの近似を行うことによって、音素の各状態のパワーを確率変数とした結合確率密度関数Ｐｒｏｂ（ｉ１，ｉ２，…，ｉｎ）の複雑な推定や積分をすることが必要ないのである。
【００５９】
上記累積確率値Ｓｐは、その値が大きい程、標準的な発話スタイルに近いことを示している。また、積分の範囲から、標準的な発話スタイルよりもパワーが小さいのか（入力パワーの平均値＜μｐ＿ｉｎ）あるいは大きいのか（入力パワーの平均値＞μｐ＿ｉｎ）が判別可能となるのである。
【００６０】
ステップＳ８で、上記算出され入力音声のパワーと標準分布との離れ度合（累積確率値Ｓｐ）は、上記ステップＳ４において設定された閾値または上記標準閾値よりも大きい場合には、累積確率値Ｓｐの値は定数「１」に変換されて出力される。この処理によって、入力音声のパワーと標準モデルとのずれが小さい場合には、ずれの度合いを無視できるようになる。
【００６１】
ステップＳ９で、入力音声の話速の標準分布からの離れ度合が算出される。この場合の離れ度合は、上記ステップＳ５の場合と同様に、継続長の累積確率値として求められる。
【００６２】
具体的には、先ず、入力音素のセグメントに属する特徴ベクトルの総フレーム数から継続長Ｔが計算される。この継続長Ｔは、音素を発声するのに掛った時間であり、その逆数は話速を表す。次に、継続長の累積確率値ＳＴが次式によって求められる。

ここで、Ｐ（ｘ；λ）は、確率変数ｘについて平均値λを有するポアソン分布であり、学習データから予め推定しておく。また、式中の積分の範囲は、入力音声の音素の継続長Ｔ＜λである場合には、「ａ」が学習データの最小値であり、「ｂ」がＴである。また、入力音声の音素の継続長Ｔ＞λである場合には、「ａ」がＴであり、「ｂ」が学習データの最大値である。
【００６３】
上記累積確率値ＳＴは、その値が大きい程、継続長Ｔが標準分布に近いことを示す。また、積分の範囲から、標準的な発話スタイルより話速が速いのか（継続長Ｔ＜λ）あるいは遅いのか（継続長Ｔ＞λ）が判別可能となるのである。
【００６４】
ステップＳ１０で、上記算出された入力音声の話速と標準分布との離れ度合（累積確率値ＳＴ）は、上記ステップＳ４において設定された閾値または上記標準閾値よりも大きい場合は、累積確率値ＳＴの値は定数「１」に変換されて出力される。この処理によって、入力音声の話速と標準モデルとのずれが小さい場合には、ずれの度合いを無視できるようになる。
【００６５】
ステップＳ１１で、入力話者における音響特性（話者性）の標準分布からの離れ度合が算出される。この場合の離れ度合は、上記ステップＳ５の場合と同様に、標準モデルが与えられた際に入力特徴ベクトル列を観測する対数尤度の累積確率値として求められる。
【００６６】
具体的には、上記標準モデルをＭｓとし、入力特徴ベクトル列をＸとする。その場合、標準モデルＭｓが与えられた際に入力特徴ベクトル列Ｘを観測する対数尤度をＬ（Ｘ｜Ｍｓ）とし、入力特徴ベクトル列Ｘのフレーム数（継続長）をＴとすると、継続長Ｔで正規化した正規化対数尤度ｙ（＝Ｌ（Ｘ｜Ｍｓ）／Ｔ）の累積確率値Ｓｓは、次式で表される。

ここで、Ｎｓ（ｙ；μｓ，σｓ）は、確率変数ｙについて平均値μｓと分散値σｓとを有する単一ガウス分布であり、学習データから予め推定しておく。また、式中の積分値の範囲は、入力特徴ベクトルの正規化対数尤度＜μｓである場合は、「ａ」が学習データ中の正規化対数尤度の最小値であり、「ｂ」が入力特徴ベクトルの正規化対数尤度である。また、入力特徴ベクトルの正規化対数尤度＞μｓである場合には、「ａ」が入力特徴ベクトルの正規化対数尤度であり、「ｂ」が学習データ中の正規化対数尤度の最大値である。但し、確率密度関数を単一ガウス分布として表すのは計算量を削減するためであり、混合ガウス分布等を用いても構わない。
【００６７】
上記累積確率値Ｓｓは、その値が大きい程、入力話者の音響特性は標準話者の音響特性に近いことを示す。但し、上記発話スタイルである入力音声のパワーや話速度の場合と異なって、積分の範囲は意味をなさない。
【００６８】
ステップＳ１２で、上記算出された入力話者における音響特性と標準分布との離れ度合（累積確率値Ｓｓ）は、上記ステップＳ４において設定された閾値あるいは上記標準閾値よりも大きい場合には、累積確率値Ｓｓの値は定数「１」に変換されて出力される。この処理によって、入力話者の音響特性と標準モデルとのずれが小さい場合には、ずれの度合いを無視できるようになる。
【００６９】
ステップＳ１３で、上記ステップＳ８，ステップＳ１０およびステップＳ１２において設定された各累積確率値Ｓｐ，ＳＴ，Ｓｓを直接比較することによって、最も小さい値を有して標準モデルから一番離れている要因が、認識誤りの要因であると判定される。その際に、上記ステップＳ８，ステップＳ１０およびステップＳ１２において総ての要因の累積確率値が１に変換されている場合には、本ステップの処理は行われない。ステップＳ１４で、上記ステップＳ１３による判定結果に基づいて当該セグメントの分析メッセージが作成され、当該セグメントのラベル名および各要因の累積確率値Ｓｐ，ＳＴ，Ｓｓと対応付けられて、記憶装置２のＲＡＭ等に保存される。その場合における分析メッセージの作成は、図５の＜詳細情報＞に示すように、定型キーワードに、上記ステップＳ１３における判定結果を埋め込むことによって行われる。但し、判定結果がない場合には分析メッセージは作成されない。
【００７０】
ステップＳ１５で、全セグメントの入力が終了したか否かが判別される。その結果終了した場合にはステップＳ１６に進み、そうでなければ上記ステップＳ７に戻って次のセグメントの処理に移行する。
【００７１】
ステップＳ１６で、上記ステップＳ１４において記憶装置２のＲＡＭ等に保存された全セグメントの累積確率値Ｓｐ，ＳＴ，Ｓｓに基づいて、各々の要因ｉについて発話全体のスコアＳｉ＿ｔｏｔａｌ（同時確率）が次式によって求められる。

そして、こうして求められ発話全体のスコアＳｉ＿ｔｏｔａｌが最小値を呈する要因をバッファに保存しておく。
【００７２】
ステップＳ１７で、上記ステップＳ１６における発話全体のスコア算出の結果に基づいて、総ての要因が同スコアであるか否かが判別される。その結果、総ての要因が同スコアである場合にはステップＳ２１に進み、そうでない場合にはステップＳ１８に進む。ステップＳ１８で、上記ステップＳ１６において求められた要因と前の入力において求められた誤認識の要因とが同じか否かが判別される。その結果、同じ場合にはステップＳ１９に進み、異なる場合にはステップＳ２０に進む。但し、初回入力の場合には、総てのバッファが初期化されている本ステップにおける判別結果は偽（ＮＯ）となる。ステップＳ１９で、発話全体の誤認識の要因が次に（２番目に）小さいスコアの要因に変更される。こうすることによって、利用者に対して同じ要因が提示されることが防止される。
【００７３】
ステップＳ２０で、ユーザに対して誤認識の要因、つまり最小スコアを有する要因が、図５の上側半分に示すごとくメッセージの形式で提示される。その際に、必要に応じて、図５の＜詳細情報＞に示すごとく、上記ステップＳ１４において作成された分析メッセージも合せて提示される。但し、上記ステップＳ６から本ステップに分岐した場合には、誤り原因が雑音であることが提示される。そうした後、入力回数を０に初期化して、今回の入力音声に対する要因分析・メッセージ作成処理動作を終了する。
【００７４】
ステップＳ２１で、上記ステップＳ８，ステップＳ１０およびステップＳ１２において総ての要因における累積確率値の値が定数「１」に変換された場合等には総ての要因のスコアが同一になり、総ての要因が特に標準モデルからずれてはいないことになる。ところが、このようなことは、突発的な雑音が発生した場合に起きることが多い。そのために、本ステップでは、誤認識の要因が突発ノイズと推定される。ステップＳ２２で、ユーザに対して突発的な雑音があったか否かを確認し、もう一度入力を促すメッセージが提示される。そうした後、入力回数がインクリメントされて、上記ステップＳ１に戻って同じ音声の再入力待ちの状態となる。
【００７５】
このように、入力回数をカウントしておき、その入力回数に応じて上記標準閾値（つまり、標準的な範囲）を狭くすることによって、総ての要因のスコアが同一になることを防ぎ、連続して誤り原因が分らなくなることを防ぐのである。つまり、誤認識の要因に関して何らかの結果を出して、利用者に対する不快感を少なくすることができるのである。
【００７６】
上記構成を有して上述のごとく動作する音声処理装置は、例えば音声認識システムに組み込まれることによって、次のように利用される。すなわち、音声認識システムのシステム本体側の特徴抽出部から、入力音声の特徴ベクトル列とそのラベルが特徴抽出部１３に入力される。そして、セグメント分割部１４および要因分析部１５によって上述のようにして誤認識となる要因が分析され、メッセージ作成部１６によって上記誤認識となる要因を提示するためのメッセージが作成される。そうすると、このメッセージが、メッセージ提示部１７によって、システム本体側に返送されるのである。こうすることによって、上記システム本体側では、入力音声の認識に失敗した場合には、本音声処理装置側から返送されてきた当該誤認識音声に関する上記メッセージをシステム本体側の出力装置に表示するのである。さらに、上記誤認識となる要因が発話スタイルおよび周辺雑音である場合には、無駄な適応を避けることも可能になるのである。
【００７７】
こうすることによって、利用者は、誤認識や信頼度低下の原因をより具体的に知ることができ、その原因が発話スタイルに関するものであれば即座に対応することができる。さらに、誤認識や信頼度低下の原因が分らないことに起因する不快感を無くすことができるのである。
【００７８】
上述した本音声処理装置が組み込まれた音声認識システムの機能は、本音声処理プログラムを音声認識装置の音声認識プログラム中に組み込んでも達成することができる。勿論、本音声処理装置を音声認識装置とは独立して用い、音声認識装置の使用者に、本音声処理装置を用いることによって、音声認識時に起るであろう誤認識の要因を予め知らせることもできる。この場合には、音声認識装置の使用者が自分の発話スタイルに標準との差があることを予め知ることによって、後の音声認識を効率良く行うことができることになる。
【００７９】
以上のごとく、上記実施の形態においては、上記セグメント分割部１４によって、入力音声の特徴ベクトル列を標準モデルとの比較によって音素毎のセグメントに分割する。そして、要因分析部１５によって、各セグメント毎の特徴ベクトル列に基づいて複数の要因に関する特徴量を求め、各要因毎に特徴量と標準モデルとのずれの度合いを算出し、その算出されたずれの度合が許容範囲内に在るか否かを入力回数に応じて狭く設定される閾値に基づいて判定する。そして、許容範囲内に在る場合には、そのずれの度合を「１」に変換する。そうした後、上記各要因の判定結果から最もずれの大きい要因を検出する。そして、メッセージ提示部１７によって、上記検出結果に基づいて、最もずれの大きい要因を提示するようにしている。
【００８０】
したがって、音声波形の特徴ベクトルから例えば人間が直感的に理解し易い誤認識の要因を抽出して、最もずれの大きな要因を検出することによって、何が誤認識の原因となり得るかを推定することができる。したがって、利用者に、誤認識となる原因を知らせることができ、利用者の不快感を減らすことができるのである。
【００８１】
その際における上記誤認識の主な原因として、次の４項目
（Ａ）音声パワーの標準モデルからのずれ
（Ｂ）音声話速の標準モデルとのずれ
（Ｃ）話者の音響特性
（Ｄ）周辺雑音
を用いている。そのうちの要因（Ａ），（Ｂ）は上記発話スタイルである。したがって、本実施の形態によれば、誤認識となる原因を、話者の音響特性と発話スタイルと周辺雑音とに区別して利用者に知らせることができる。そのために、利用者は、誤認識となる要因が要因（Ａ），（Ｂ），（Ｄ）である場合には、音声認識時に的確に対応することが可能になる。
【００８２】
また、上記要因のうちの要因（Ａ）〜要因（Ｃ）と要因（Ｄ）との検出方法は少し異なっている。すなわち、利用者の発話区間内に埋もれた雑音の検出は非常に難しい。そのため、図６に示すように、利用者の発声前における無音区間によって周辺雑音の検出を行うのである。周辺雑音は略定常であると考えられ、このような検出方法でも問題はないと考えられる。
【００８３】
但し、利用者の発声区間内に、警笛や駅アナウンス等の突発ノイズが発生した場合には誤認識の要因となる。そして、このような突発ノイズは、要因（Ａ）〜要因（Ｃ）の総てのずれに同様に作用するため、突発ノイズを要因として特定することが困難である。そこで、本実施の形態においては、利用者の発話区間内において検出された要因（Ａ）〜要因（Ｃ）のずれが略同じである場合に、誤認識要因は突発的な雑音であると推定するのである。但し、その場合には、誤認識要因を提示せずに、メッセージ提示部１７によって再入力を促すメッセージを出力するようにしている。そして、音声の再入力があった場合には、上記閾値を更に小さく設定するようにしている。こうすることによって、突発的な雑音に対して頑健に誤り分析を行うことができ、ずれの度合いを強調することによって誤り分析結果が得易くなり、何度も利用者に発声させる手間が不要になるのである。
【００８４】
ところで、上記実施の形態における上記中央演算処理装置１による上記要因別ずれ算出手段，ずれ度合変換手段，要因検出手段および誤認識要因出力手段としての機能は、プログラム記録媒体に記録された音声処理プログラムによって実現される。上記実施の形態におけるプログラム記録媒体は、上記ＲＯＭでなるプログラムメディアである。または、上記外部補助記憶装置に装着されて読み出されるプログラムメディアであってもよい。尚、何れの場合においても、プログラムメディアから音声処理プログラムを読み出すプログラム読み出し手段は、上記プログラムメディアに直接アクセスして読み出す構成を有していてもよいし、上記ＲＡＭに設けられたプログラム記憶エリア（図示せず）にダウンロードし、上記プログラム記憶エリアにアクセスして読み出す構成を有していてもよい。尚、上記プログラムメディアから上記ＲＡＭのプログラム記憶エリアにダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。
【００８５】
ここで、上記プログラムメディアとは、本体側と分離可能に構成され、磁気テープやカセットテープ等のテープ系、フロッピーディスク，ハードディスク等の磁気ディスクやＣＤ（コンパクトディスク）‐ＲＯＭ，ＭＯ（光磁気）ディスク，ＭＤ（ミニディスク），ＤＶＤ（ディジタル多用途ディスク）等の光ディスクのディスク系、ＩＣ（集積回路）カードや光カード等のカード系、マスクＲＯＭ，ＥＰＲＯＭ（紫外線消去型ＲＯＭ），ＥＥＰＲＯＭ（電気的消去型ＲＯＭ），フラッシュＲＯＭ等の半導体メモリ系を含めた、固定的にプログラムを坦持する媒体である。
【００８６】
また、上記実施の形態における音声処理置は、インターネット等の通信ネットワークと通信Ｉ／Ｆを介して接続可能な構成を有している場合には、上記プログラムメディアは、通信ネットワークからのダウンロード等によって流動的にプログラムを坦持する媒体であっても差し支えない。尚、その場合における上記通信ネットワークからダウンロードするためのダウンロードプログラムは、予め本体装置に格納されているものとする。あるいは、別の記録媒体からインストールされるものとする。
【００８７】
尚、上記記録媒体に記録されるものはプログラムのみに限定されるものではなく、データも記録することが可能である。
【００８８】
【発明の効果】
以上より明らかなように、この発明は、入力された音声の特徴量に基づいて複数の誤認識の要因に関する特徴量を求め、各要因毎に上記特徴量の上記標準モデルからのずれの度合いを算出し、最もずれの度合いが大きい要因を検出して誤認識となる要因として出力するので、利用者に、誤認識となる原因を、例えば人間が直感的に理解し易い要因によって知らせることができる。したがって、音声認識の際に誤認識に至った場合に、利用者は何故誤認識となったのかを明確に知ることができる。したがって、利用者が、誤認識となった原因が分らずに不快な気分になることを回避することができるのである。
【００８９】
さらに、上記誤認識の要因に関する特徴量としてパワー，話速，話者性および周辺環境雑音の特徴量を求めるようにすれば、誤認識となる原因を、話者の音響特性と発話スタイルと周辺雑音とに区別して利用者に知らせることができる。したがって、利用者は、誤認識となる要因がパワー，話速および周辺環境雑音である場合には、音声認識時に的確に対応することが可能になる。
【００９０】
また、音声認識装置とは独立した構成となっているため、状況によっては、音声認識装置と組み合せて音声認識システムを構成することによって、音声認識の効率と認識率とを高めることができる。
【図面の簡単な説明】
【図１】この発明の音声処理装置におけるハードウェア構成を示す図である。
【図２】図１に示す音声処理装置の機能的構成を示すブロック図である。
【図３】図２における要因分析部およびメッセージ作成部によって実行される要因分析・メッセージ作成処理動作のフローチャートである。
【図４】図３に続く要因分析・メッセージ作成処理動作のフローチャートである。
【図５】図２におけるメッセージ提示部によって提示されるメッセージの一例を示す図である。
【図６】図２におけるセグメント分割部への入力音声の一例を示す図である。
【符号の説明】
１…中央演算処理装置、
２…記憶装置、
３…外部記憶装置、
４…入力装置、
５…出力装置、
１１…入力部、
１２…Ａ／Ｄ変換部、
１３…特徴抽出部、
１４…セグメント分割部、
１５…要因分析部、
１６…メッセージ作成部、
１７…メッセージ提示部、
１８…標準モデル格納部。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice processing device, a voice processing method, a voice processing program, and a program recording medium used in a voice recognition system and the like.
[0002]
[Prior art]
At present, the recognition performance of a speech recognition system has a high word recognition performance even for an unspecified speaker task as long as it is a reading voice reading a transcribed sentence. This is because a multi-speaker database can be used and acoustic characteristics of most speakers can be learned. Further, it is also possible to learn the acoustic characteristics of a speaker from a small number of voice samples by a speaker adaptation technique such as Maximum a posteriori (hereinafter abbreviated as MAP) or Maximum Likelihood Linear Regression (hereinafter abbreviated as MLLR). is there.
[0003]
Here, the speaker acoustic characteristics refer to acoustic characteristics caused by differences in physical characteristics of vocal organs, such as differences in vocal organs of speakers. For example, the spectrum of the voice differs from speaker to speaker due to differences in vocal tract length and the like. The above-mentioned MAP and MLLR are described in S.A. See, eg, Young et al., "The HTKBOOK".
[0004]
However, the recognition performance for naturally and freely uttered speech (hereinafter referred to as natural speech) is insufficient (Shinozaki et al., Ongaku Lecture, pp. 17-18, Mar. 2002). It is said that the reason for the difficulty of natural speech recognition is a large factor of the utterance style (Yamamoto et al., IEICE, pp. 2438-2447, Nov. 2000). Further, even when a model is learned using natural sounds and reading sounds, the recognition rate of natural sounds is considerably reduced. This is considered to be because it is difficult to create models corresponding to all utterance speeds, and natural voices tend not to pronounce (clear) vowels in particular.
[0005]
For the former cause, a multi-pass hidden Markov model (hereinafter abbreviated as HMM) that separates transition paths for each utterance speed (Li et al., Ongaku-bun, pp.89-90, Mar.2002), etc. Proposed. However, recognition accuracy commensurate with the calculation cost has not been obtained. For the latter problem, it is conceivable to learn an acoustic model of natural speech by a speaker adaptation technique such as MAP or MLLR. However, in this case, on the contrary, the feature space of the vowel model becomes large, and as a result, even if the recognition rate of the natural voice is improved, the recognition accuracy of the reading voice may deteriorate.
[0006]
Here, the utterance style is not a difference in physical characteristics of a vocal organ as in the above-mentioned "speaker acoustic characteristics", but an acoustic characteristic caused by the environment, culture, and the like of the speaker. For example, dialect, fast-talking, speaking slowly, and not pronounced clearly.
[0007]
Furthermore, high-performance recognition performance cannot be guaranteed in all noise environments. Although good recognition performance can be obtained by a method of modeling a voice (matched model) in which the previously recorded noise is superimposed on the learning voice, it is impossible to record the noise in all environments. For this reason, in the case of a noise environment, a method of performing an adaptive process by using the MAP, the MLLR, or the like from a small number of noise data as in the case of the speaker adaptation has been adopted. However, even in that case, the recognition performance is inferior to the above-described method of making a matched model. Also, it is not preferable that the user adapts the environment randomly, because it is impossible to predict what the acoustic model will be like.
[0008]
For the user, the acoustic characteristics of the user's own voice do not matter, but it is easy to deal with surrounding noise and speech style. For example, it is possible to move to a quiet place for noise and to speak in a standard way for speaking style. Therefore, if it is possible to determine whether the cause of erroneous recognition is due to the speaker's acoustic characteristics, utterance style, or noise, and inform the user of the determination result, the discomfort due to erroneous recognition should be reduced. Can be done. Further, by not adapting to the utterance style, it is possible to avoid unnecessary adaptation processing in which the recognition performance is not improved. Similarly, by notifying an environment that is not supported, useless environment adaptation processing can be avoided.
[0009]
However, in many voice recognition systems, the user is not even notified of the reason for erroneous recognition. The reason is that it is difficult to explain the cause of the misrecognition so that ordinary people can understand it. More specifically, in the speech recognition system using the HMM, a feature vector such as “Mel-frequency ceptral coefficients (hereinafter abbreviated as MFCC)” including information other than the phonological properties of the input speech, a standard model, This is because the recognition result is determined based on the magnitude of the matching score based on the probability statistical distance of, and the cause of the erroneous recognition cannot be completely (one-to-one correspondence) linked to the knowledge of phonetics.
[0010]
In a recognition system based on a physical distance scale between an input voice and a standard voice, it is not a device that teaches the reason for false recognition as described above, but a voice recognition device that allows a user to learn a standard utterance is used. It has been proposed (for example, see Patent Document 1).
[0011]
In addition, there is the following speech recognition method and apparatus for performing the above-mentioned misrecognition reason notification (see Patent Document 2). In this voice recognition method and apparatus, when a voice is input, the input voice is analyzed by a voice recognition task and compared with voice data registered in advance to detect a match. At this time, if the recognition result is "NG", a display indicating that the recognition is NG and a reason code are displayed.
[0012]
Also, in the conventional speaker-recognizable speech recognition system, since the difference between the speaker's acoustic characteristics and the utterance style is not clarified, the utterance style and surrounding environment are learned in the same way as the speaker's acoustic characteristics. Will be lost. For example, there are a speech recognition device and an automatic speech recognition device that perform speaker adaptation only to highly reliable subwords using a speaker adaptation technique (see Patent Document 3). In the speech recognition device and the automatic speech recognition device, model adaptation is performed only on highly reliable subwords whose likelihood scale of a recognition result is equal to or larger than a threshold value, so that degradation in recognition performance due to adaptation is reduced.
[0013]
[Patent Document 1]
JP-A-01-285998
[Patent Document 2]
JP 2000-112497 A
[Patent Document 3]
JP 2000-181482 A
[0014]
[Problems to be solved by the invention]
However, the above-described conventional speech recognition device and speech recognition method have the following problems.
[0015]
That is, first, in the speech recognition device disclosed in Patent Document 1, it is not possible to distinguish between the utterance style and the acoustic characteristics of the speaker, and it is not possible to adapt to the surrounding environment. Further, it has a recognition mode for performing recognition and a registration mode for creating and registering a syllable feature pattern by a speaker of a designated word. In the registration mode, an utterance word is indicated, and an utterance method for correctly recognizing (that is, a reason why misrecognition is likely) is indicated. However, since the registration mode is separated from the recognition mode, it is not possible to notify the speaker of the reason of the misrecognition in the case of misrecognition in the recognition mode. There is a problem that it cannot be notified.
[0016]
Further, in the speech recognition method and apparatus disclosed in Patent Document 2, when the recognition of the input speech fails, the reason information is notified, but the content of the notification is at most "no voice registration data to be compared. "And" Excessive input volume ". In addition, no means or method for acquiring the reason for erroneous recognition is disclosed, and it is unclear how to obtain the reason for erroneous recognition that occurs when a plurality of factors overlap. Therefore, there is a problem that the user cannot be notified of a sufficient reason for the misrecognition.
[0017]
Further, in the speech recognition device and the automatic speech recognition device disclosed in Patent Document 3, model adaptation is performed on a subword having a confidence measure equal to or greater than a threshold value. It is very difficult to decide. For example, if the threshold value of the reliability is set too low, the deterioration of the recognition performance due to the adaptation can be prevented, but the adaptation effect is low, so that the adaptation effect is not obtained much. Therefore, it is very difficult to determine such a trade-off relationship.
[0018]
Further, it cannot be distinguished whether the cause of the erroneous recognition is an acoustic characteristic, a speech style, or a surrounding environment. Therefore, when the likelihood scale is equal to or larger than the threshold value and the reliability of recognition is high, an attempt is made to adapt to the utterance style and the surrounding environment. However, as described above, the utterance style deteriorates the recognition rate even when learning using natural speech, so that the recognition rate similarly deteriorates, and consequently useless calculations are performed. become. In addition, since there is no reason acquisition / notification means for notifying the user of the reason of the misrecognition or the low reliability, the user may be uncomfortable.
[0019]
Therefore, an object of the present invention is to provide a voice processing device, a voice processing method, a voice processing program, and a program recording medium that can determine a cause of erroneous recognition and notify a user of the cause.
[0020]
[Means for Solving the Problems]
In order to achieve the above object, the speech processing apparatus according to the present invention, when comparing a feature amount of an input speech with a standard model, relates to a plurality of factors of erroneous recognition based on the feature amount of the input speech. Factor-based deviation calculating means for obtaining a characteristic amount and calculating the degree of deviation of the characteristic amount from the standard model for each factor; and determining whether the calculated degree of deviation is within a threshold value representing an allowable range. And if the deviation is within the threshold, the deviation degree converting means for converting the degree of deviation to a predetermined value indicating that the deviation is within the allowable range; Factor detection means for detecting a factor with the largest deviation based on the degree of deviation, and misrecognition factor output means for outputting the detected factor with the largest deviation as a factor causing misrecognition. To have.
[0021]
According to the above configuration, for example, a feature amount related to a factor of an erroneous recognition that is easily intuitively understood by a human is obtained based on the feature amount of the input voice waveform. Then, the cause of the largest deviation between the feature amount and the standard model is detected as a cause of erroneous recognition and output to the user. By notifying the user of the cause of the misrecognition in this way, the discomfort of the user in the event of misrecognition is reduced.
[0022]
Further, in the voice processing device of one embodiment, the erroneous recognition factor output means outputs the erroneous recognition factor again without outputting the erroneous recognition factor when there are a plurality of the detected factors having the largest deviation. A message prompting you to do so is output.
[0023]
When there are a plurality of factors having the largest deviations, a sudden noise often occurs. According to this embodiment, in such a case, the above-mentioned factor is analyzed robustly against sudden noise by prompting re-input.
[0024]
Further, in the voice processing device of one embodiment, when voice is input again in accordance with the output of the message by the erroneous recognition factor output unit, the threshold value indicating the allowable range is changed so that the allowable range becomes narrow. A threshold changing means is provided.
[0025]
According to this embodiment, the threshold value indicating the permissible range is changed to narrow the permissible range, so that the degree of deviation is emphasized. Therefore, the result of the analysis of the cause of the misrecognition can be obtained more easily, and the trouble of repeatedly inputting the voice to the user is not required.
[0026]
Further, in the voice processing device of one embodiment, the erroneous recognition factor output means may determine the second largest factor of the deviation if the detected largest factor of the deviation is the same factor as the previous speech input. The information is output as a cause of the erroneous recognition.
[0027]
According to this embodiment, it is possible to reduce the discomfort of the user by not giving the same instruction to the user many times.
[0028]
In the speech processing apparatus of one embodiment, the standard model is represented by a stochastic function, and the factor-based deviation calculating means includes power, speech speed, speaker characteristics, and surroundings as characteristic quantities relating to the factors of the erroneous recognition. A characteristic amount of environmental noise is obtained, and for each factor, a degree of deviation from the standard model is calculated using a probability value based on the characteristic amount of the factor in the probability function representing the standard model. .
[0029]
According to this embodiment, based on the feature amount of the input speech waveform, a feature amount related to a factor of erroneous recognition that is easy for a human to intuitively understand is obtained. Further, by expressing the degree of the deviation by the cumulative probability value, it is possible to compare the degree of the deviation between different factors by the probability value. Therefore, it is possible to detect the cause of the largest deviation without performing special normalization on the value of the degree of deviation.
[0030]
Further, in the speech processing method of the present invention, when comparing the feature quantity of the input speech with the standard model, the feature quantity relating to a plurality of erroneous recognition factors is obtained based on the feature quantity of the input speech, Calculating the degree of deviation of the feature amount from the standard model for each factor, determining whether the calculated degree of deviation is within a threshold representing an allowable range, and determining whether the deviation is within the threshold. Is converted into a predetermined value indicating that the degree of deviation is within the allowable range, and based on the calculated degree of deviation and the converted degree of deviation, a factor having the largest degree of deviation is determined. Then, the detected factor having the largest deviation is output as a factor causing erroneous recognition.
[0031]
According to the above configuration, the cause of erroneous recognition is notified to the user, for example, by a factor that is intuitively easy for humans to understand, so that discomfort of the user when the erroneous recognition is eventually performed is reduced. You.
[0032]
Further, the audio processing program of the present invention causes a computer to function as a factor-based deviation calculating unit, a deviation degree converting unit, a factor detecting unit, and a misrecognition factor outputting unit in the audio processing device of the present invention.
[0033]
According to the above configuration, the cause of erroneous recognition is notified to the user, for example, by a factor that is intuitively easy for humans to understand, so that discomfort of the user when the erroneous recognition is eventually performed is reduced. You.
[0034]
Further, the program recording medium of the present invention stores the audio processing program of the present invention.
[0035]
According to the above configuration, the cause of the misrecognition is presented to the user by, for example, a factor that is easily intuitively understood by a human by being read and executed by the computer. In this way, the user's discomfort in the event of erroneous recognition is reduced.
[0036]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the present invention will be described in detail with reference to the illustrated embodiments. FIG. 1 is a diagram illustrating a hardware configuration of the audio processing device according to the present embodiment.
[0037]
In FIG. 1, reference numeral 1 denotes a central processing unit that performs processing such as numerical calculation and control, and performs calculation and processing according to the processing procedure described in the present embodiment. Reference numeral 2 denotes a storage device including a RAM (random access memory), a ROM (read only memory), and the like, which are necessary for the processing procedure (voice processing program) executed by the central processing unit 1 and its processing. Temporary data is stored. Reference numeral 3 denotes an external storage device configured by a hard disk or the like, and stores a standard pattern (template) for voice processing, a standard model, and the like. An input device 4 includes a microphone, a keyboard, and the like, and inputs a voice uttered by a user or a character string input by a key. An output device 5 includes a display, a speaker, and the like, and outputs an analysis result or information obtained by processing the analysis result. Reference numeral 6 denotes a bus, which connects various devices of the central processing unit 1 to the input device 5 to each other. The hardware configuration of the audio processing device may include a communication I / F connected to a communication network such as the Internet, in addition to the configuration illustrated in FIG.
[0038]
However, in this embodiment, the audio processing device and the audio processing program are independent, but may be incorporated as a part of another device or as a part of another program. Then, the input in that case is performed indirectly via the other device or program.
[0039]
Hereinafter, processing executed in the present embodiment based on the above hardware configuration will be described.
[0040]
FIG. 2 is a block diagram illustrating a functional configuration of the audio processing device according to the present embodiment. From the input unit 11, a user's voice and its label (text description of the utterance content) are input. Then, the input voice is digitized in the A / D converter 12. At this time, the input text remains unchanged.
[0041]
The digitized signal is converted by the feature extraction unit 13 into an MFCC vector for each certain time section (frame). For a detailed method of obtaining the MFCC, see “S. Young et al.,“ The HTKBOOK ”” described above. MFCC is one of the feature analysis methods, and the same applies even when Linear prediction filter coefficients (linear prediction filter coefficients) and the like are used.
[0042]
Note that, as described above, the feature extraction unit 13 directly inputs parameters whose features are extracted from an external device so that the voice processing device and the voice processing program can be easily incorporated into another device or program. It is possible to be. In this case, it is necessary to use the same feature analysis method for parameters input from an external device and a standard model described later. For example, when the pattern of the standard model is represented by MFCC, the feature amount of the input parameter also needs to be represented by MFCC. At this time, the input text remains unchanged.
[0043]
The MFCC vector sequence extracted by the feature extraction unit 13 is divided by the segment division unit 14 into segments for each phoneme using a set of standard models stored in the standard model storage unit 18. The division into segments for each phoneme is performed as follows.
[0044]
That is, when the standard model is an HMM, the probability of transition from the state i of the HMM to the state j is a_ij, and the probability of observing the feature vector Ot in the frame t in the state j of the HMM is b_j (Ot). The log likelihood L_N (T) leading to the final state N of the HMM in frame T is

According to the Viterbi algorithm. Then, all the state numbers for the frame when L_N (T) is obtained (that is, when the final state N of the last frame T is reached) are stored, and the stored state numbers are stored in a feature vector (MFCC vector). ) To divide the feature vector sequence into phoneme units.
[0045]
If the above method is considered to be difficult, the method may be performed with reference to the above-mentioned "S. Young et al.," The HTKBOOK "".
[0046]
The feature vector sequence divided into segments for each phoneme is input to the factor analysis unit 15 with the label described in the text. Then, the factor analysis unit 15 examines a factor causing erroneous recognition. The message creation unit 16 creates a character string of a message to be presented to the user according to the analysis result by the factor analysis unit 15. Finally, the message presenting unit 17 displays a message on the display constituting the output device 5 based on the created character string, or converts the message into a synthesized voice by a built-in text-to-speech synthesizing unit and outputs the voice from the speaker. The user is notified by output or the like.
[0047]
However, when the voice processing device and the voice processing program are incorporated as another device or a part of another program, the message presenting unit 17 returns the created character string to the other device. Will be.
[0048]
That is, the A / D conversion unit 12, the feature extraction unit 13, the segment division unit 14, the factor analysis unit 15, the message creation unit 16, and a part of the message presentation unit 17 are configured by the central processing unit 1, and The unit 11 is constituted by the input device 4, the rest of the message presentation unit 17 is constituted by the output device 5, and the standard model storage unit 18 is constituted by the external storage device 3. The central processing unit 1 also performs various processing operations such as calculation / judgment processing, timekeeping processing, and input / output processing in addition to the processing operations according to the present embodiment by the above-described units 12 to 17. Has become.
[0049]
Hereinafter, the analysis of the erroneous recognition factor by the factor analysis unit 15 and the creation of the message by the message creation unit 16 will be described in detail. 3 and 4 are flowcharts of the factor analysis / message creation processing operation executed by the factor analysis unit 15 and the message creation unit 16. Steps S20 and S22 are processes by the message creating unit 16, and the other steps are processes by the factor analyzing unit 15.
[0050]
When the segment division by the segment division unit 14 is completed, the factor analysis / message creation processing operation starts. Then, first, in step S1, it is determined whether or not there is an input from the segment dividing unit 14. If there is an input, the process proceeds to step S2. In step S2, the feature vector divided and labeled for each segment from the segment dividing unit 14 is fetched. In step S3, it is determined whether or not the input is the first input based on the value of the counter that counts and stores the number of continuous inputs from the segment division unit 14. As a result, if it is the first input, the process proceeds to step S5; otherwise, the process proceeds to step S4.
[0051]
In step S4, the threshold used to determine whether the degree of separation between the feature vector and the standard model executed later is within an allowable range is changed from the standard threshold used at the first input to a threshold corresponding to the number of inputs. Is done. Here, the threshold value is set so as to gradually decrease from the standard threshold value as the number of times of input increases. The threshold is also used to set the degree of deviation to a predetermined value when the “degree of deviation (separation)” between the feature vector and the standard model is within an allowable range. It is set and stored in the external storage device 3 or the like. Since the threshold value depends on the recognition performance of the speech recognition system, it is experimentally determined in advance based on a feature vector obtained from an utterance of a speaker having a recognition rate of 95% or more.
[0052]
In step S5, the degree of separation between the segment of the non-speech section and the noise model before utterance input is calculated. The noise model is obtained by learning from previously recorded noise and stored in the standard model storage unit 18. Further, the degree of separation between the segment (feature vector sequence) of the non-speech section and the noise model is calculated as the cumulative probability value of the log likelihood of observing the feature vector sequence of the non-speech section when the noise model is given. Can be
[0053]
Specifically, the noise model is Mn, and the noise feature vector sequence is X. In this case, if the log likelihood of observing the input feature vector sequence X when the noise model Mn is given is L (X | Mn) and the number of frames (continuation length) of the noise feature vector sequence is T, the continuation length The cumulative probability value Sn of the log likelihood x (= L (X | Mn) / T) normalized by T (hereinafter, referred to as normalized log likelihood) is represented by the following equation.

Here, Nn (x; μn, σn) is a single Gaussian distribution having a mean value μn and a variance value σn for the random variable x, and is estimated in advance from learning data. When the normalized log likelihood of the input noise is smaller than μn, “a” is the minimum value of the normalized log likelihood in the learning data, and “b” is the input noise range. Is the normalized log likelihood of. If the normalized log likelihood of the input noise is greater than μn, “a” is the normalized log likelihood of the input noise, and “b” is the maximum value of the normalized log likelihood in the training data. . However, the probability density function is represented as a single Gaussian distribution in order to reduce the amount of calculation, and a Gaussian mixture distribution or the like may be used.
[0054]
The smaller the cumulative probability value Sn is, the smaller the value is, the more the normalized log likelihood of the input noise is far from the average μn of the single Gaussian distribution of the normalized log likelihood x of the learning data. It is shown that the feature of (1) deviates greatly from the learned noise model.
[0055]
In step S6, it is determined whether the calculated degree of separation (accumulated probability value Sn) between the segment (input noise) of the non-voice section and the noise model is smaller than the threshold set in step S4 or the standard threshold. That is, it is determined whether or not the characteristics of the input noise are largely deviated from the noise model. As a result, if there is a large deviation, the maximum likelihood state path obtained by the Viterbi algorithm is not reliable, and the process proceeds to step S20. On the other hand, if there is no deviation, the process proceeds to step S7.
[0056]
In step S7, the degree of deviation of the power of the input voice from the standard distribution is calculated. The degree of separation in this case is obtained as the cumulative probability value of the average value of the power of the feature vector, as in the case of step S5.
[0057]
Specifically, first, the power of the feature vector of the input speech is averaged for each state of the HMM. Next, the cumulative probability value Sp of the power of the phoneme is obtained by approximating the median value of the cumulative probability value of each state in expressed by the following equation.

Here, Np_in (p_in; μp_in, σp_in) is a single Gaussian distribution having an average value μp_in and a variance σp_in for a random variable that is the average value p_in of the power allocated to the state in of the HMM, and the learning data Is estimated in advance. When the average of the power assigned to the state in the input voice <μp_in, “a” is the minimum value of the power of the learning data, and “b” is the input voice. Is the average value of the powers. When the average value of the power assigned to the state in the input voice is greater than μp_in, “a” is the average value of the power of the input voice, and “b” is the maximum value of the power of the learning data. . However, the probability density function is expressed as a single Gaussian distribution in order to reduce the amount of calculation, and for example, a Gaussian mixture distribution or the like may be used.
[0058]
As described above, the stochastic process is considered independent for each state, and by approximating the cumulative probability value Sp of the phoneme power with the median of the cumulative probability values of each state, the power of each state of the phoneme is changed to a random variable. It is not necessary to perform complicated estimation and integration of the combined probability density function Prob (i1, i2,..., In).
[0059]
The larger the cumulative probability value Sp is, the closer to the standard utterance style is. Also, from the range of integration, it is possible to determine whether the power is smaller than the standard utterance style (the average value of the input power <μp_in) or larger (the average value of the input power> μp_in).
[0060]
In step S8, if the calculated degree of separation between the power of the input voice and the standard distribution (cumulative probability value Sp) is larger than the threshold value set in step S4 or the standard threshold value, the cumulative probability value Sp The value is converted to a constant “1” and output. By this processing, when the difference between the power of the input voice and the standard model is small, the degree of the difference can be ignored.
[0061]
In step S9, the degree of separation of the input voice from the standard distribution of the speech speed is calculated. The degree of separation in this case is obtained as a cumulative probability value of the continuation length, as in the case of step S5.
[0062]
Specifically, first, the continuation length T is calculated from the total number of frames of the feature vectors belonging to the input phoneme segment. This duration T is the time taken to utter a phoneme, and its reciprocal represents the speech speed. Next, the cumulative probability value ST of the continuation length is obtained by the following equation.

Here, P (x; λ) is a Poisson distribution having an average value λ for the random variable x, and is estimated in advance from learning data. Further, when the range of integration in the expression is the duration T <λ of the phoneme of the input speech, “a” is the minimum value of the learning data, and “b” is T. When the duration of the phoneme of the input speech is T> λ, “a” is T and “b” is the maximum value of the learning data.
[0063]
The larger the cumulative probability value ST is, the closer the continuation length T is to the standard distribution. Further, from the range of integration, it is possible to determine whether the speech speed is faster (continuous length T <λ) or slower (continuous length T> λ) than the standard utterance style.
[0064]
In step S10, if the calculated degree of separation between the speech speed of the input voice and the standard distribution (cumulative probability value ST) is larger than the threshold value set in step S4 or the standard threshold value, the cumulative probability value ST Is converted to a constant “1” and output. By this processing, when the difference between the speech speed of the input voice and the standard model is small, the degree of the difference can be ignored.
[0065]
In step S11, the degree of departure from the standard distribution of acoustic characteristics (speaker characteristics) of the input speaker is calculated. The degree of separation in this case is obtained as the cumulative probability value of the log likelihood of observing the input feature vector sequence when the standard model is given, as in the case of step S5.
[0066]
Specifically, the standard model is Ms, and the input feature vector sequence is X. In this case, if the log likelihood of observing the input feature vector sequence X when the standard model Ms is given is L (X | Ms) and the number of frames (continuation length) of the input feature vector sequence X is T, The cumulative probability value Ss of the normalized log likelihood y (= L (X | Ms) / T) normalized by the length T is represented by the following equation.

Here, Ns (y; μs, σs) is a single Gaussian distribution having a mean value μs and a variance value s of the random variable y, and is estimated in advance from learning data. When the normalized log likelihood of the input feature vector is less than μs, “a” is the minimum value of the normalized log likelihood in the learning data, and “b” is the range of the integrated value in the expression. This is the normalized log likelihood of the input feature vector. If the normalized log likelihood of the input feature vector is greater than μs, “a” is the normalized log likelihood of the input feature vector, and “b” is the maximum of the normalized log likelihood in the training data. Value. However, the probability density function is represented as a single Gaussian distribution in order to reduce the amount of calculation, and a Gaussian mixture distribution or the like may be used.
[0067]
The larger the cumulative probability value Ss is, the more the acoustic characteristic of the input speaker is closer to the acoustic characteristic of the standard speaker. However, unlike the case of the power of the input voice and the speech speed which are the utterance styles, the range of integration does not make sense.
[0068]
In step S12, if the calculated degree of separation between the acoustic characteristics of the input speaker and the standard distribution (cumulative probability value Ss) is larger than the threshold value set in step S4 or the standard threshold value, the cumulative probability The value Ss is converted to a constant “1” and output. By this processing, when the difference between the acoustic characteristics of the input speaker and the standard model is small, the degree of the difference can be ignored.
[0069]
In step S13, by directly comparing the cumulative probability values Sp, ST, and Ss set in steps S8, S10, and S12, the factor having the smallest value and being farthest from the standard model is determined. Is determined to be the cause of the recognition error. At this time, if the cumulative probability values of all the factors have been converted to 1 in steps S8, S10, and S12, the process of this step is not performed. In step S14, an analysis message for the segment is created based on the determination result in step S13, and is associated with the label name of the segment and the cumulative probability values Sp, ST, and Ss of each factor, and is stored in the RAM of the storage device 2. Etc. The creation of the analysis message in that case is performed by embedding the determination result in step S13 into a fixed keyword as shown in <Detailed Information> in FIG. However, if there is no determination result, no analysis message is created.
[0070]
In step S15, it is determined whether the input of all segments has been completed. As a result, when the process is completed, the process proceeds to step S16, and otherwise, the process returns to step S7 to shift to the process of the next segment.
[0071]
In step S16, based on the cumulative probability values Sp, ST, and Ss of all segments stored in the RAM or the like of the storage device 2 in step S14, the score Si_total (simultaneous probability) of the entire utterance for each factor i is given by the following equation. Required by

Then, the factor that the score Si_total of the entire utterance thus obtained exhibits the minimum value is stored in the buffer.
[0072]
In step S17, it is determined whether or not all the factors have the same score based on the result of calculating the score of the entire utterance in step S16. As a result, if all the factors have the same score, the process proceeds to step S21; otherwise, the process proceeds to step S18. In step S18, it is determined whether or not the factor determined in step S16 is the same as the erroneous recognition factor determined in the previous input. As a result, when they are the same, the process proceeds to step S19, and when they are different, the process proceeds to step S20. However, in the case of the first input, the determination result in this step in which all the buffers are initialized is false (NO). In step S19, the factor of the misrecognition of the entire utterance is changed to the factor of the next (second) smallest score. This prevents the same factor from being presented to the user.
[0073]
In step S20, the factor of the misrecognition, that is, the factor having the minimum score is presented to the user in the form of a message as shown in the upper half of FIG. At this time, as shown in <Detailed Information> in FIG. 5, if necessary, the analysis message created in step S14 is also presented. However, if the process branches from step S6 to this step, it is presented that the cause of the error is noise. After that, the number of inputs is initialized to 0, and the factor analysis / message creation processing operation for the current input voice is terminated.
[0074]
In step S21, for example, when the values of the cumulative probability values for all the factors are converted into the constant "1" in steps S8, S10, and S12, the scores of all the factors become the same, and all the scores become the same. Is not particularly deviated from the standard model. However, such a situation often occurs when sudden noise occurs. Therefore, in this step, the cause of the erroneous recognition is estimated to be sudden noise. In step S22, it is confirmed whether or not the user has sudden noise, and a message prompting the user to input again is presented. After that, the number of times of input is incremented, and the process returns to step S1 to wait for re-input of the same sound.
[0075]
In this way, by counting the number of inputs and narrowing the standard threshold (that is, the standard range) according to the number of inputs, it is possible to prevent scores of all factors from being equal, and This prevents the error cause from being lost. In other words, it is possible to reduce the discomfort to the user by giving some result regarding the cause of the misrecognition.
[0076]
The speech processing device having the above configuration and operating as described above is used as follows by being incorporated in a speech recognition system, for example. That is, the feature vector sequence of the input speech and its label are input to the feature extraction unit 13 from the feature extraction unit on the system body side of the speech recognition system. Then, the factors causing the misrecognition are analyzed by the segment dividing unit 14 and the factor analyzing unit 15 as described above, and the message for presenting the factors causing the misrecognition is created by the message creating unit 16. Then, this message is returned to the system main body by the message presenting unit 17. By doing so, when the recognition of the input voice fails, the system main body displays the message regarding the erroneously recognized voice returned from the voice processing device on the output device of the system main body. is there. Further, when the factors causing the erroneous recognition are the speech style and the surrounding noise, useless adaptation can be avoided.
[0077]
By doing so, the user can more specifically know the cause of the misrecognition and the decrease in reliability, and can immediately respond if the cause is related to the utterance style. Further, it is possible to eliminate the discomfort caused by the fact that the cause of the erroneous recognition or the decrease in the reliability is unknown.
[0078]
The functions of the speech recognition system incorporating the speech processing device described above can also be achieved by incorporating the speech processing program into the speech recognition program of the speech recognition device. Of course, the present speech processing device is used independently of the speech recognition device, and the user of the speech recognition device is notified of the cause of the misrecognition that may occur at the time of speech recognition by using the speech processing device. You can also. In this case, the user of the speech recognition apparatus knows in advance that his or her utterance style is different from the standard, so that later speech recognition can be performed efficiently.
[0079]
As described above, in the embodiment, the segment division unit 14 divides the feature vector sequence of the input speech into segments for each phoneme by comparison with the standard model. Then, the factor analysis unit 15 obtains feature amounts related to a plurality of factors based on the feature vector sequence for each segment, calculates the degree of shift between the feature amounts and the standard model for each factor, and calculates the calculated shift. Is determined based on a threshold value set narrowly according to the number of inputs. Then, if it is within the allowable range, the degree of the shift is converted to “1”. After that, the factor with the largest deviation is detected from the determination result of each factor. Then, the message presenting unit 17 presents the factor having the largest deviation based on the detection result.
[0080]
Therefore, for example, by extracting factors of erroneous recognition that are easy for humans to understand intuitively from the feature vector of the audio waveform, and detecting the factor with the largest deviation, it is possible to estimate what may cause erroneous recognition. Can be. Therefore, the user can be informed of the cause of the misrecognition, and the discomfort of the user can be reduced.
[0081]
At this time, the following four main reasons
(A) Deviation of audio power from standard model
(B) Speech speed deviation from standard model
(C) Speaker acoustic characteristics
(D) Ambient noise
Is used. The factors (A) and (B) are the utterance styles. Therefore, according to the present embodiment, it is possible to notify the user of the cause of erroneous recognition by distinguishing between the acoustic characteristics of the speaker, the utterance style, and the surrounding noise. For this reason, when the factors that cause erroneous recognition are the factors (A), (B), and (D), the user can appropriately cope with the voice recognition.
[0082]
Further, among the above factors, the method of detecting the factors (A) to (C) and the factor (D) is slightly different. That is, it is very difficult to detect the noise buried in the utterance section of the user. Therefore, as shown in FIG. 6, the surrounding noise is detected by a silent section before the user utters. Ambient noise is considered to be substantially stationary, and it is considered that there is no problem with such a detection method.
[0083]
However, if a sudden noise such as a horn or a station announcement occurs in the utterance section of the user, it may cause erroneous recognition. Since such sudden noise acts on all the deviations of the factors (A) to (C), it is difficult to specify the sudden noise as a factor. Therefore, in the present embodiment, when the deviations of the factors (A) to (C) detected in the user's speech section are substantially the same, it is estimated that the misrecognition factor is sudden noise. You do it. However, in that case, the message presenting unit 17 outputs a message prompting re-input without presenting the erroneous recognition factor. When the voice is re-input, the threshold value is set to be smaller. By doing so, error analysis can be performed robustly against sudden noise, and the error analysis result can be easily obtained by emphasizing the degree of deviation, eliminating the need for the user to speak repeatedly. It becomes.
[0084]
The functions of the central processing unit 1 in the above embodiment as the factor-dependent shift calculating means, the shift degree converting means, the factor detecting means, and the erroneous recognition factor output means are performed by a sound processing program recorded on a program recording medium. It is realized by. The program recording medium in the above embodiment is a program medium formed of the ROM. Alternatively, it may be a program medium that is mounted on and read from the external auxiliary storage device. In any case, the program reading means for reading the audio processing program from the program medium may have a configuration of directly accessing and reading the program medium or a program storage area provided in the RAM (FIG. (Not shown), and may be configured to access and read the program storage area. It is assumed that a download program for downloading from the program medium to the program storage area of the RAM is stored in the main device in advance.
[0085]
Here, the above-mentioned program medium is configured to be separable from the main body side, such as a tape system such as a magnetic tape or a cassette tape, a magnetic disk such as a floppy disk or a hard disk, a CD (compact disk) -ROM, an MO (magneto-optical). Disk system of optical disks such as disk, MD (mini disk), DVD (digital versatile disk), card system such as IC (integrated circuit) card and optical card, mask ROM, EPROM (ultraviolet erasing ROM), EEPROM (electric This is a medium that fixedly carries a program, including a semiconductor memory system such as a temporary erasing ROM) and a flash ROM.
[0086]
In the case where the audio processing device according to the above-described embodiment has a configuration connectable to a communication network such as the Internet via a communication I / F, the program medium is downloaded by a communication network or the like. It may be a medium that carries the program fluidly. In this case, it is assumed that a download program for downloading from the communication network is stored in the main device in advance. Alternatively, it shall be installed from another recording medium.
[0087]
It should be noted that what is recorded on the recording medium is not limited to only a program, and data can also be recorded.
[0088]
【The invention's effect】
As is apparent from the above description, the present invention obtains a plurality of feature amounts relating to the factors of erroneous recognition based on the feature amount of the input speech, and determines the degree of deviation of the feature amount from the standard model for each factor. Calculating and detecting the factor with the largest degree of deviation and outputting it as a factor causing erroneous recognition, it is possible to inform the user of the cause of erroneous recognition, for example, by a factor that is easy for humans to intuitively understand. . Therefore, when erroneous recognition occurs during voice recognition, the user can clearly know why the erroneous recognition was performed. Therefore, it is possible to prevent the user from feeling uncomfortable without knowing the cause of the misrecognition.
[0089]
In addition, if the features of power, speech speed, speaker characteristics, and surrounding environment noise are determined as the features relating to the above-described misrecognition factors, the cause of the misrecognition is determined by the speaker's acoustic characteristics, utterance style, and surroundings. The user can be notified separately from noise. Therefore, when the factors that cause erroneous recognition are power, speech speed, and surrounding environment noise, the user can appropriately cope with the voice recognition.
[0090]
Further, since the speech recognition device is configured independently of the speech recognition device, depending on the situation, the speech recognition efficiency and the recognition rate can be increased by configuring the speech recognition system in combination with the speech recognition device.
[Brief description of the drawings]
FIG. 1 is a diagram showing a hardware configuration of an audio processing device according to the present invention.
FIG. 2 is a block diagram showing a functional configuration of the audio processing device shown in FIG.
FIG. 3 is a flowchart of a factor analysis / message creation processing operation performed by a factor analysis unit and a message creation unit in FIG. 2;
FIG. 4 is a flowchart of a factor analysis / message creation processing operation following FIG. 3;
FIG. 5 is a diagram illustrating an example of a message presented by a message presenting unit in FIG. 2;
FIG. 6 is a diagram illustrating an example of an input voice to a segment dividing unit in FIG. 2;
[Explanation of symbols]
1. Central processing unit,
2. Storage device,
3. External storage device,
4: Input device,
5 output device,
11 Input part,
12 ... A / D converter,
13 ... feature extraction unit,
14 ... Segment division part,
15 ... Factor analysis unit,
16 Message creation unit
17… Message presentation part,
18 ... Standard model storage unit.

Claims

入力された音声の特徴量と標準モデルとの比較を行う音声処理装置であって、
上記入力された音声の特徴量に基づいて複数の誤認識の要因に関する特徴量を求め、各要因毎に上記特徴量の上記標準モデルからのずれの度合いを算出する要因別ずれ算出手段と、
上記算出されたずれの度合いが許容範囲を表す閾値内にあるか否かを判定すると共に、上記閾値内にある場合には、上記ずれの度合いを上記許容範囲内にあることを表す所定値に変換するずれ度合変換手段と、
上記算出されたずれの度合いと上記変換されたずれの度合いとに基づいて、最もずれの度合いが大きい要因を検出する要因検出手段と、
上記検出された最もずれの大きい要因を誤認識となる要因として出力する誤認識要因出力手段
を備えたことを特徴とする音声処理装置。An audio processing device that compares a feature amount of input audio with a standard model,
Factor-based deviation calculating means for calculating a characteristic amount related to a plurality of factors of erroneous recognition based on the input characteristic amount of the voice and calculating a degree of deviation of the characteristic amount from the standard model for each factor;
It is determined whether or not the calculated degree of deviation is within a threshold value representing an allowable range, and if it is within the threshold value, the degree of deviation is set to a predetermined value indicating that the degree is within the allowable range. A degree-of-shift conversion means for converting;
Based on the calculated shift degree and the converted shift degree, factor detecting means for detecting a factor having the largest shift degree,
An audio processing apparatus comprising: an erroneous recognition factor output unit that outputs the detected cause of the largest deviation as a cause of erroneous recognition.

請求項１に記載の音声処理装置において、
上記誤認識要因出力手段は、上記検出された最もずれの大きい要因が複数存在する場合には、誤認識要因を出力せずに、音声の入力を再度行うことを促すメッセージを出力するようになっていることを特徴とする音声処理装置。The audio processing device according to claim 1,
The erroneous recognition factor output means, when there are a plurality of the detected factors having the largest deviation, outputs a message prompting to re-input the voice without outputting the erroneous recognition factor. An audio processing device characterized by:

請求項２に記載の音声処理装置において、
上記誤認識要因出力手段による上記メッセージの出力に従って音声が再度入力された場合には、上記許容範囲を表す閾値を上記許容範囲が狭くなるように変更する閾値変更手段を備えたことを特徴とする音声処理装置。The audio processing device according to claim 2,
When a voice is input again according to the output of the message by the erroneous recognition factor output unit, a threshold changing unit that changes a threshold value indicating the allowable range so that the allowable range is narrowed is provided. Voice processing device.

請求項１記載の音声処理装置において、
上記検出された最もずれの大きい要因が前回の音声入力時と同じ要因であるか否かを判定する要因判定手段を備えて、
上記誤認識要因出力手段は、上記検出された最もずれの大きい要因が前回の音声入力時と同じ要因である場合には、２番目にずれが大きい要因を上記誤認識となる要因として出力するようになっていることを特徴とする音声処理装置。The audio processing device according to claim 1,
A factor determining unit that determines whether the detected factor of the largest deviation is the same factor as the previous voice input is provided,
The erroneous recognition factor output means outputs the second largest variance factor as the erroneous recognition factor when the detected largest variance factor is the same factor as the previous voice input. An audio processing device characterized in that:

請求項１記載の音声処理装置において、
上記標準モデルは、確率関数で表されており、
上記要因別ずれ算出手段は、上記誤認識の要因に関する特徴量としてパワー，話速，話者性および周辺環境雑音の特徴量を求め、各要因毎に、上記標準モデルを表す確率関数における当該要因の特徴量に基づく確率値を用いて、当該標準モデルとのずれの度合いを算出するようになっている
ことを特徴とする音声処理装置。The audio processing device according to claim 1,
The standard model is represented by a probability function,
The above-mentioned factor-dependent shift calculating means obtains the characteristic quantities of power, speech speed, speaker characteristics and surrounding environment noise as the characteristic quantities relating to the above-mentioned misrecognition factor, and for each factor, the factor in the probability function representing the standard model is obtained. A speech processing apparatus configured to calculate a degree of deviation from the standard model using a probability value based on the characteristic amount of the speech processing.

入力された音声の特徴量と標準モデルとの比較を行う音声処理方法であって、
上記入力された音声の特徴量に基づいて複数の誤認識の要因に関する特徴量を求め、各要因毎に上記特徴量の上記標準モデルからのずれの度合いを算出し、
上記算出されたずれの度合いが許容範囲を表す閾値内にあるか否かを判定すると共に、上記閾値内にある場合には、上記ずれの度合いを上記許容範囲内にあることを表す所定値に変換し、
上記算出されたずれの度合いと上記変換されたずれの度合いとに基づいて、最もずれの度合いが大きい要因を検出し、
上記検出された最もずれの大きい要因を誤認識となる要因として出力する
ことを特徴とする音声処理方法。An audio processing method for comparing a feature amount of an input audio with a standard model,
Based on the input feature amount of the voice, determine feature amounts related to a plurality of factors of erroneous recognition, calculate the degree of deviation of the feature amount from the standard model for each factor,
It is determined whether or not the calculated degree of deviation is within a threshold value representing an allowable range, and if it is within the threshold value, the degree of deviation is set to a predetermined value indicating that the degree is within the allowable range. Converted,
Based on the calculated shift degree and the converted shift degree, a factor having the largest shift degree is detected,
A speech processing method characterized by outputting the detected cause of the largest deviation as a cause of erroneous recognition.

コンピュータを、
請求項１における要因別ずれ算出手段，ずれ度合変換手段，要因検出手段および誤認識要因出力手段
として機能させることを特徴とする音声処理プログラム。Computer
2. A sound processing program according to claim 1, wherein said sound processing program functions as a factor-dependent shift calculating unit, a shift degree converting unit, a factor detecting unit, and an erroneous recognition factor outputting unit.

請求項７に記載の音声処理プログラムが記録されたことを特徴とするコンピュータ読出し可能なプログラム記録媒体。A computer-readable program recording medium on which the audio processing program according to claim 7 is recorded.