JP2012047820A

JP2012047820A - Voice recognition device, and method and program for recognizing voice

Info

Publication number: JP2012047820A
Application number: JP2010187442A
Authority: JP
Inventors: Atsunori Ogawa; 厚徳小川; Atsushi Nakamura; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-08-24
Filing date: 2010-08-24
Publication date: 2012-03-08
Anticipated expiration: 2030-08-24
Also published as: JP5400727B2

Abstract

PROBLEM TO BE SOLVED: To improve estimation accuracy of true-false and the error cause of a recognition result obtained by a voice recognition device.SOLUTION: In an voice recognition device, a true-false/error cause estimating part comprises: a model parameter recording part; a true-false/error cause conditional probability calculating part; and a true-false/error cause conditional probability marginalizing part, and the model parameter recording part records a model parameter required to calculate a conditional probability based on a recognition model which shows the relationship between a speech feature quantity vector and a true-false/error cause label vector. The true-false/error cause conditional probability calculating part calculates a conditional probability based on the recognition model using the model parameter at every state that a predetermined true-false/error cause label vector can take while using the speech feature quantity vector of each word as input. Then, the true-false/error cause conditional probability marginalizing part calculates a marginalizing conditional probability of a true/false label and each error cause label using the conditional probability based on the recognition model.

Description

この発明は、入力音声信号の音声認識結果が、どの程度信頼できるかを表す信頼度と誤り原因を推定するようにした音声認識装置と、その方法とプログラムに関する。 The present invention relates to a speech recognition apparatus that estimates how reliable a speech recognition result of an input speech signal is and the cause of the error, and a method and program thereof.

音声認識結果の信頼度（正解不正解とその確からしさ）を推定する音声認識装置としては、特許文献１に開示されたものが知られている。図１０にその音声認識装置９００の機能構成を示して動作を簡単に説明する。音声認識装置９００は、記憶部４、発話分割部５、音声認識部６、音響モデル格納部１０、辞書・言語モデル格納部１２、情報変換部２０、信頼度付与部２２、識別モデル格納部２９、出力部２６、を備える。 As a speech recognition apparatus that estimates the reliability of a speech recognition result (correct and incorrect answer and its likelihood), the one disclosed in Patent Document 1 is known. FIG. 10 shows the functional configuration of the speech recognition apparatus 900, and the operation will be briefly described. The speech recognition apparatus 900 includes a storage unit 4, an utterance division unit 5, a speech recognition unit 6, an acoustic model storage unit 10, a dictionary / language model storage unit 12, an information conversion unit 20, a reliability assignment unit 22, and an identification model storage unit 29. The output unit 26 is provided.

記憶部４は、入力端子２に入力される音声信号を離散値化したディジタル音声信号として記憶する。発話分割部５は、所定値以上継続する無音区間に挟まれたディジタル音声信号を一発話として分割する。音声認識部６は、音響分析部８と認識探索部７とから構成される。音響分析部８は、ディジタル音声信号を特徴量ベクトルの時系列に変換する。認識探索部７は、音響モデル格納部１０と辞書・言語モデル格納部１２に格納された音響モデルと言語モデルを用いて、辞書・言語モデル格納部１２に登録されている単語列と特徴量ベクトルの時系列との照合を行い、照合尤度が最も高い単語列を認識結果として出力する。 The storage unit 4 stores the audio signal input to the input terminal 2 as a digital audio signal that has been converted into discrete values. The utterance dividing unit 5 divides a digital voice signal sandwiched between silent periods that continue for a predetermined value or more as one utterance. The voice recognition unit 6 includes an acoustic analysis unit 8 and a recognition search unit 7. The acoustic analysis unit 8 converts the digital speech signal into a feature vector time series. The recognition search unit 7 uses the acoustic model and the language model stored in the acoustic model storage unit 10 and the dictionary / language model storage unit 12 to use the word string and feature vector registered in the dictionary / language model storage unit 12. And the word string having the highest matching likelihood is output as a recognition result.

音響分析部８における音声分析方法としてよく用いられるのは、ケプストラム分析であり、特徴量としてはＭＦＣＣ（Mel Frequency Cepstral Coefficient）、ΔＭＦＣＣ、ΔΔＭＦＣＣ、対数パワー、Δ対数パワー等があり、これらが１０〜１００次元程度の特徴量ベクトルを構成する。分析フレーム幅は３０ｍｓ程度、分析フレームシフト幅は１０ｍｓ程度で分析が実行される。 A cepstrum analysis is often used as a speech analysis method in the acoustic analysis unit 8 and features include MFCC (Mel Frequency Cepstral Coefficient), ΔMFCC, ΔΔMFCC, logarithmic power, Δlogarithmic power, etc. A feature vector of about 100 dimensions is constructed. The analysis is executed with an analysis frame width of about 30 ms and an analysis frame shift width of about 10 ms.

音響モデルは、上記ＭＦＣＣ等の音声の特徴量を音素等の適切なカテゴリでモデル化したものである。この音響モデルを用いて入力音声のフレーム毎の特徴量と各カテゴリのモデルとの音響的な近さが音響尤度として計算される。現在のモデル化の手法としては、確率・統計理論によるＨＭＭ（Hidden Markov Model）に基づくものが主流となっている。言語モデルの形式は、単語リスト、定型文法、Ｎ−gramモデルの三つに大別される。孤立単語発声を認識対象とする音声認識装置においては、認識対象の単語を列挙した単語リストが用いられる（単語リストは辞書・言語モデル格納部１２に格納されている辞書と等価である）。定型的な文章発声を認識対象とする音声認識装置においては、辞書・言語モデル格納部１２に登録されている単語を連結して、装置で受理する発話内容（文章）を記述した定型文法が用いられる。自由な連続発話を認識対象とする音声認識装置においては、辞書・言語モデル格納部１２に登録されている単語のＮ連鎖確率を保持しているＮ−gramモデルが用いられ、これによりＮ連鎖以下の単語のつながり易さが言語尤度として計算される。以上のような音響モデル、言語モデルを用いた音声認識装置については、例えば非特許文献１と２に詳述されている。 The acoustic model is obtained by modeling the voice feature amount such as the MFCC in an appropriate category such as a phoneme. Using this acoustic model, the acoustic proximity between the feature quantity of each frame of the input speech and the model of each category is calculated as the acoustic likelihood. Current modeling techniques are based on HMM (Hidden Markov Model) based on probability / statistical theory. Language model formats are roughly divided into three categories: word lists, fixed grammars, and N-gram models. In a speech recognition apparatus that recognizes isolated word utterances, a word list that lists words to be recognized is used (the word list is equivalent to a dictionary stored in the dictionary / language model storage unit 12). In a speech recognition apparatus that recognizes typical sentence utterances, a fixed grammar that describes the utterance contents (sentences) received by the apparatus by connecting words registered in the dictionary / language model storage unit 12 is used. It is done. In the speech recognition apparatus for recognizing free continuous utterances, an N-gram model that holds the N chain probability of words registered in the dictionary / language model storage unit 12 is used. The ease of connecting words is calculated as language likelihood. The speech recognition apparatus using the above acoustic model and language model is described in detail in Non-Patent Documents 1 and 2, for example.

情報変換部２０は、単語列を構成する各単語について、例えば図１１に示す様な発話特徴量ベクトルに変換する。発話特徴量ベクトルの各単語の品詞情報は、この例では３７種類に分類される。品詞情報に付随する音響尤度スコアと言語尤度スコアと音素継続時間長は、この例ではそれぞれの平均値、分散値、最大値、最小値、が計算される。 The information conversion unit 20 converts each word constituting the word string into an utterance feature amount vector as shown in FIG. 11, for example. The part of speech information of each word of the utterance feature vector is classified into 37 types in this example. In this example, the average value, variance value, maximum value, and minimum value of the acoustic likelihood score, the language likelihood score, and the phoneme duration length associated with the part-of-speech information are calculated.

信頼度付与部２２は、発話特徴量ベクトルを評価して信頼度を付与する。信頼度の付与は、識別モデル格納部２９に格納されている予め学習した発話特徴量ベクトルと音声認識率とを関連付けた値と、情報変換部２０が出力する発話特徴量ベクトルとを対比することで行う。例えば、１０％間隔の音声認識率に対応させた発話特徴量ベクトルを用意して置くことで、音声認識結果が１００％信頼できるものか、或いは全く信頼できない信頼度の音声認識結果であるのかを、１０％の間隔で信頼度を付与することができる。出力部２６は、各発話単位毎に、単語系列と、各単語の発話特徴量ベクトルと、信頼度とを出力する。 The reliability providing unit 22 evaluates the utterance feature quantity vector and provides the reliability. The reliability is given by comparing a value obtained by associating a previously learned utterance feature vector stored in the identification model storage unit 29 with a speech recognition rate with the utterance feature vector output by the information conversion unit 20. To do. For example, by preparing an utterance feature vector corresponding to a speech recognition rate at 10% intervals, whether the speech recognition result is 100% reliable or not reliable at all. Reliability can be given at intervals of 10%. The output unit 26 outputs a word series, an utterance feature amount vector of each word, and a reliability for each utterance unit.

しかし、信頼度を出力するだけでは不十分な場合もある。例えば、認識対象が男性話者に設定されている音声認識装置に、女性話者が音声入力すると音声が認識できないことが多い。図１２にその状況を示す。男性の発話する「名古屋」は認識されるが、女性の発話する「京都」は認識されない。この場面で、信頼度を基に再発声を促されても、使用者はなぜ再発声を促されたのか理由が分からず、音声認識装置の不適切な使用を繰り返してしまう可能性がある。 However, it may not be sufficient to output the reliability level. For example, when a female speaker inputs a voice into a voice recognition device in which the recognition target is set to a male speaker, the voice is often not recognized. FIG. 12 shows the situation. “Nagoya” spoken by men is recognized, but “Kyoto” spoken by women is not recognized. In this situation, even if the user is prompted to recurrence based on the reliability, the user does not know why the user is prompted to recurrence, and may repeatedly use the voice recognition device inappropriately.

そこで、信頼度推定に加えて誤認識原因の同時推定を可能にした図１３に示す音声認識装置９５０も考案されている（非特許文献４）。音声認識装置９５０は、音声認識部９６と、正誤・誤り原因推定部９７を備える。音声認識部９６は、入力音声を音声認識した単語列と、その単語列を構成する各単語の特徴量を複数のパラメータで表した各単語の発話特徴量ベクトルと、を出力する。正誤・誤り原因推定部９７は、各単語の発話特徴量ベクトルを入力として、その各単語の正解不正解と誤り原因ラベルベクトルの推定値とその確からしさを、発話特徴量ベクトルと音声認識結果単語の正解不正解及び誤り原因との関係を表す識別モデルに基づく条件付確率を用いて推定する。 Therefore, a speech recognition device 950 shown in FIG. 13 that enables simultaneous estimation of the cause of misrecognition in addition to reliability estimation has been devised (Non-patent Document 4). The speech recognition device 950 includes a speech recognition unit 96 and a correct / error / error cause estimation unit 97. The voice recognition unit 96 outputs a word string obtained by voice recognition of the input voice and an utterance feature quantity vector of each word in which the feature quantity of each word constituting the word string is represented by a plurality of parameters. The correctness / error cause estimation unit 97 receives the utterance feature quantity vector of each word as input, and determines the correct and incorrect answer of each word, the estimated value of the error cause label vector, and its probability, the utterance feature quantity vector, and the speech recognition result word. It is estimated using a conditional probability based on an identification model that represents the relationship between the correct answer and the cause of error.

特開２００７−２４０５８９号公報JP 2007-240589 A

鹿野清宏、伊藤克亘、河原達也、武田一哉、山本幹雄、IT Text 音声認識システム、オーム社、pp. 1-51, 2001Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, IT Text Speech Recognition System, Ohmsha, pp. 1-51, 2001 安藤彰男、リアルタイム音声認識、（社）電子情報通信学会、pp. 1-58, pp. 125-170, 2003Akio Ando, Real-time Speech Recognition, IEICE, pp. 1-58, 125-170, 2003 H. Jiang, “Confidence measures for speech recognition: A survey,” Speech Communication, vol. 45, pp. 455-470, 2005.H. Jiang, “Confidence measures for speech recognition: A survey,” Speech Communication, vol. 45, pp. 455-470, 2005. 小川厚徳、中村篤、「最大エントロピーモデルに基づく信頼度と誤認識原因の同時推定」、日本音響学会2009年春季研究発表会2-5-17.Atsunobu Ogawa, Atsushi Nakamura, “Simultaneous estimation of reliability and cause of misrecognition based on maximum entropy model”, Acoustical Society of Japan 2009 Spring Meeting 2-5-17. 小川厚徳、中村篤、「信頼度と誤り原因の推定における識別モデルの検討」、日本音響学会2010年春季研究発表会１-Q-6.Atsunobu Ogawa and Atsushi Nakamura, “Examination of Discrimination Model in Estimating Reliability and Cause of Error”, Acoustical Society of Japan 2010 Spring Research Presentation 1-Q-6.

従来の音声認識結果に信頼度を付与して出力する音声認識装置によれば、信頼度を利用することで認識結果が正しい或いは間違っているという推定に基づく運用が実現できる。
しかし、それだけではユーザが十分に音声認識装置を使いこなすことが出来ない。また、信頼度と誤認識原因の推定を行う音声認識方法も、発話特徴量ベクトルと音声認識結果単語の正解不正解及び誤り原因との関係を表す識別モデルに基づく条件付確率が、最も高い確率の正誤・誤り原因ラベルベクトルのみを基に正誤及び誤り原因の推定結果を確定してしまうため、その推定が不安定になる場合があった。 According to a conventional speech recognition apparatus that outputs a speech recognition result with reliability added thereto, an operation based on the estimation that the recognition result is correct or incorrect can be realized by using the reliability.
However, that alone does not allow the user to fully use the speech recognition apparatus. In addition, the speech recognition method that estimates the reliability and the cause of misrecognition also has the highest probability of conditionality based on an identification model that represents the relationship between the utterance feature vector and the correct and incorrect answer of the speech recognition result word and the error cause. In this case, the estimation result of the correctness / error cause is determined based only on the correct / error / error cause label vector.

この発明はこの点に鑑みてなされたものであり、音声の認識誤りが生じた場合に、その正誤及び誤り原因推定を安定化させて、それを基に、利用者に適切な情報を提示することの出来る音声認識装置とその方法と、プログラムを提供することを目的とする。 The present invention has been made in view of this point. When a speech recognition error occurs, the correctness and error cause estimation are stabilized, and appropriate information is presented to the user based on the stabilization. An object of the present invention is to provide a speech recognition apparatus, a method thereof, and a program.

この発明の音声認識装置は、音声認識部と正誤・誤り原因推定部を備える。音声認識部は、入力音声を音声認識した単語列と、その単語列を構成する各単語の特徴量を複数のパラメータで表した各単語の発話特徴量ベクトルとを出力する。正誤・誤り原因推定部は、音声認識結果である単語列中の各単語に対して、各単語の発話特徴量ベクトルを入力として各単語の正解不正解と誤り原因の推定値と、それらの確からしさを推定する。正誤・誤り原因推定部は、更に、モデルパラメータ記録部と、正誤・誤り原因条件付確率計算部と、正誤・誤り原因条件付確率周辺化部と、を具備する。 The speech recognition apparatus according to the present invention includes a speech recognition unit and an error / error cause estimation unit. The voice recognition unit outputs a word string obtained by voice recognition of the input voice and an utterance feature quantity vector of each word in which the feature quantity of each word constituting the word string is represented by a plurality of parameters. The correct / error / error cause estimator receives the utterance feature vector of each word as input for each word in the word sequence that is the speech recognition result, the correct incorrect answer of each word, the estimated cause of the error, and their confirmation Estimate the likelihood. The correctness / error cause estimation unit further includes a model parameter recording unit, a correctness / error cause conditional probability calculation unit, and a correctness / error cause conditional probability marginalization unit.

モデルパラメータ記録部は、発話特徴量ベクトルと正誤・誤り原因ラベルベクトルとの関係を表す識別モデルに基づく条件付確率を計算するのに必要なモデルパラメータを記録する。正誤・誤り原因条件付確率計算部は、各単語の発話特徴量ベクトルを入力として、予め設定された正誤・誤り原因ラベルベクトルの取り得る状態毎に、識別モデルに基づく条件付確率をモデルパラメータを用いて計算する。正誤・誤り原因条件付確率周辺化部は、注目する正誤・誤り原因ラベルベクトルの要素について、上記条件付確率の周辺化を行い正誤ラベルと各誤り原因ラベルの周辺化条件付確率を計算する。 The model parameter recording unit records model parameters necessary for calculating a conditional probability based on an identification model representing the relationship between the utterance feature quantity vector and the correctness / error cause label vector. The correctness / error cause conditional probability calculation unit uses the utterance feature vector of each word as an input, and sets the conditional probability based on the identification model as a model parameter for each possible state of the preset correctness / error cause label vector. Use to calculate. The correctness / error cause conditional probability marginalization unit performs the peripheralization of the conditional probabilities for the elements of the correct / error / error cause label vector of interest, and calculates the marginalized conditional probabilities of the correct / wrong label and each error cause label.

この発明の音声認識装置は、正誤・誤り原因条件付確率周辺化部が、注目する正誤・誤り原因ラベルベクトルの要素について、条件付確率の周辺化を行い正誤ラベルと各誤り原因ラベルの周辺化条件付確率を計算する。その周辺化条件付確率を用いて正誤ラベル及び誤り原因ラベルを推定することで、正誤判定及び各誤り原因の推定を安定化させその推定精度を向上させることができる。また、図１４に示すように音声認識装置を利用する場面に合った適切なメッセージ、例えば「認識できませんでした。現在、男性の声を認識する設定になっていますので、女性の方が使用されている場合は、女性認識用ボタンを押して下さい。」等を提示することが出来る。 In the speech recognition apparatus according to the present invention, the right / wrong / error cause conditional probability marginalization unit performs peripheralization of conditional probabilities on the correct / wrong / error cause label vector elements of interest, and marginalizes the right / wrong label and each error cause label Calculate conditional probabilities. By estimating the correct / incorrect label and the error cause label using the marginalized conditional probability, the correct / incorrect determination and the estimation of each error cause can be stabilized and the estimation accuracy can be improved. In addition, as shown in FIG. 14, an appropriate message suitable for the scene where the voice recognition device is used, for example, “Could not be recognized. Currently, it is set to recognize male voice, so female is used. If so, please press the button for female recognition. "

この発明の音声認識装置１００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 100 of this invention. 音声認識装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 100. 正誤・誤り原因ラベルベクトルｙ^→の取り得る値と正誤・誤り原因条件付確率Ｐ_ＭＥ（ｙ^→｜ｘ^→）の一例を示す図。The figure which shows an example of the value which correct / error cause label vector y- ^> can take, and correct / error-cause conditional probability _PME (y- ^> | x- ^> ). この発明の音声認識装置２００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 200 of this invention. 音声認識装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 200. この発明の音声認識装置３００の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus 300 of this invention. 音声認識装置３００の動作フローを示す図。The figure which shows the operation | movement flow of the speech recognition apparatus 300. 正誤・誤り原因メッセージの一例を示す図。The figure which shows an example of a right / wrong / error cause message. 評価実験結果の信頼度のＲＯＣ曲線を示す図。The figure which shows the ROC curve of the reliability of an evaluation experiment result. 特許文献１の音声認識装置９００の機能構成を示す図。The figure which shows the function structure of the speech recognition apparatus 900 of patent document 1. FIG. 発話特徴量ベクトルｘ^→の一例を示す図。The figure which shows an example of utterance feature-value vector x- ^> . 従来の音声認識の状況の一例を示す図。The figure which shows an example of the condition of the conventional voice recognition. 従来の音声認識装置９５０の機能構成例を示す図。The figure which shows the function structural example of the conventional speech recognition apparatus 950. この発明の音声認識装置を用いた音声認識の状況の一例を示す図。The figure which shows an example of the condition of the speech recognition using the speech recognition apparatus of this invention.

以下に、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は省略する。 Embodiments of the present invention will be described below with reference to the drawings. The same components in the drawings are denoted by the same reference numerals, and the description thereof is omitted.

図１にこの発明の音声認識装置１００の機能構成例を示す。その動作フローを図２に示す。音声認識装置１００は、音声認識部３０、正誤・誤り原因推定部４０、を備える。音声認識装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows a functional configuration example of the speech recognition apparatus 100 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 100 includes a speech recognition unit 30 and a correct / error / error cause estimation unit 40. The speech recognition apparatus 100 is realized by reading a predetermined program into a computer configured with, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU.

音声認識部３０は、入力端子２に入力される音声を音声認識した単語列と、その単語列を構成する各単語の特徴量を複数のパラメータで表した各単語の発話特徴量ベクトルｘ^→（→は図中の表記が正しい）と、を出力する（ステップＳ３０）。音声認識部３０は、従来技術で説明した音声認識装置９００の記録部４から情報変換部２０までの構成を含むものである。各単語の発話特徴量ベクトルｘ^→も、例えば音響尤度スコアや言語尤度スコアから成る図１１に示したようなベクトルである。 The speech recognition unit 30 speech-recognizes the speech input to the input terminal 2 and the utterance feature amount vector x ^→ (word) representing the feature amount of each word constituting the word sequence by a plurality of parameters. → indicates that the notation in the drawing is correct) (step S30). The speech recognition unit 30 includes the configuration from the recording unit 4 to the information conversion unit 20 of the speech recognition apparatus 900 described in the related art. The utterance feature quantity vector x ^{→ of} each word is also a vector as shown in FIG. 11 including an acoustic likelihood score and a language likelihood score, for example.

正誤・誤り原因推定部４０は、更に、正誤・誤り原因条件付確率計算部４１と、モデルパラメータ記録部４２と、正誤・誤り原因条件付確率周辺化部４４と、を備える。正誤・誤り原因条件付確率計算部４１は、音声認識部３０が出力する各単語の発話特徴量ベクトルx^→を入力として、識別モデルの一種である最大エントロピーモデル（ＭＥＭ：Maximum Entropy Model）に基づく条件付確率を、予め設定された正誤・誤り原因ラベルベクトルｙ^→の取り得る状態毎に、モデルパラメータ記録部４２に記録されている素性関数f_k（x^→，ｙ^→）とその重みパラメータλ_kと（これらが最大エントロピーモデルのモデルパラメータである）、を用いて計算する（ステップＳ４１）。最大エントロピーモデルは識別モデルの一例であり、最近の信頼度推定手法に用いられるものである。 The correctness / error cause estimation unit 40 further includes a correctness / error cause conditional probability calculation unit 41, a model parameter recording unit 42, and a correctness / error cause conditional probability peripheral unit 44. The correctness / error cause conditional probability calculation unit 41 is based on a maximum entropy model (MEM), which is a kind of identification model, with the utterance feature vector x ^→ of each word output from the speech recognition unit 30 as an input. A feature function f _k (x ^→ , y ^→ ) and its weight parameter λ recorded in the model parameter recording unit 42 for each possible state of a correct / error / error cause label vector y ^→ set in advance. Calculation is performed using _k and (these are model parameters of the maximum entropy model) (step S41). The maximum entropy model is an example of an identification model and is used in recent reliability estimation methods.

正誤・誤り原因条件付確率周辺化部４４は、識別モデルに基づく条件付確率から正誤ラベルと各誤り原因ラベルの周辺化条件付確率を計算する（ステップＳ４４）。周辺化条件付確率とは、例えば、正誤ラベルが正解の場合の条件付確率の和である。詳しくは後述する。 The correctness / error cause conditional probability marginalization unit 44 calculates the corrective label and the marginal conditional probability of each error cause label from the conditional probability based on the identification model (step S44). The marginalized conditional probability is, for example, the sum of conditional probabilities when the correct / incorrect label is correct. Details will be described later.

この周辺化条件付確率を用いて正誤判定を推定することで、全ての正誤・誤り原因ラベルベクトルの条件付確率を総合的に用いることが出来る。よって、正誤判定及び各誤り原因の推定を安定化させその推定精度を向上させることが出来る。 By estimating the correctness / incorrectness using this marginalized conditional probability, the conditional probabilities of all correctness / error cause label vectors can be used comprehensively. Therefore, it is possible to stabilize the correctness determination and the estimation of each error cause and improve the estimation accuracy.

正誤・誤り原因ラベルベクトルｙ^→の具体例を示して更に詳しくこの発明を説明する。正誤・誤り原因ラベルベクトルｙ^→とは、一つの正誤ラベルｙ₀と一つ以上の誤り原因ラベルｙ_i,i≧１を各次元に持つベクトルである。正誤ラベルと誤り原因ラベルｙ_iは、例えば表１に示すようなものである。なお、以下の説明では、音声認識装置１００は孤立単語音声認識装置であり、静かな場所において男性の声で日本の地名発声を音声認識する場合を想定する。 The present invention will be described in more detail by showing a specific example of the right / wrong / error cause label vector y ^→ . The right / wrong / error cause label vector y ^→ is a vector having one right / wrong label y ₀ and one or more error cause labels y _i , i ≧ 1 in each dimension. The correct / incorrect label and the error cause label y _i are as shown in Table 1, for example. In the following description, it is assumed that the speech recognition device 100 is an isolated word speech recognition device, and recognizes Japanese place name speech with a male voice in a quiet place.

正解不正解を表す正誤ラベルｙ_０は、発話特徴量ベクトルｘ^→から最大エントロピーモデルに基づいて推定された２値の情報である。ｙ_０＝０が正解、ｙ_０＝１が不正解を表す。 The correct / incorrect label y ₀ representing the correct / incorrect answer is binary information estimated based on the maximum entropy model from the utterance feature vector x 1 ^→ . y ₀ = 0 represents a correct answer, and y ₀ = 1 represents an incorrect answer.

誤り原因ラベルｙ_１は、語彙内（ｙ_１＝０）か、語彙外（ｙ_１＝１）かを表す。誤り原因ラベルｙ_２は、雑音なし（ｙ_２＝０）か、雑音あり（ｙ_２＝１）かを表す。誤り原因ラベルｙ_３は、男性（ｙ_３＝０）か、女性（ｙ_３＝１）かを表す。 The error cause label y ₁ indicates whether it is in the vocabulary (y ₁ = 0) or outside the vocabulary (y ₁ = 1). The error cause label y ₂ indicates whether there is no noise (y ₂ = 0) or there is noise (y ₂ = 1). The error cause label y ₃ represents male (y ₃ = 0) or female (y ₃ = 1).

誤り原因ラベルとしては、表１に示す４種類の他にも、例えば、音量が適切であるか/適切でないか、使用者の年齢層が想定内か/想定外か、などを挙げることができる。以降の説明では煩雑さを避ける目的で表１示す４種類に限定して説明を行う。 In addition to the four types shown in Table 1, the error cause label can include, for example, whether the sound volume is appropriate / inappropriate, whether the user's age group is within the assumption / not expected, and the like. . In the following description, the description is limited to the four types shown in Table 1 for the purpose of avoiding complexity.

この場合、正誤・誤り原因ラベルベクトルｙ^→の取り得る状態は２^４＝１６状態に場合分けすることができる。しかし、例えば、ｙ^→＝（ｙ_０，ｙ_１，ｙ_２，ｙ_３）＝（０，１，０，０）の「語彙外だけど認識できた」は有り得ない状態である。これらの存在しない組み合わせを考慮すると、正誤・誤り原因ラベルベクトルｙ^→の取り得る状態は、図３に示すように１２状態である。図３は、正誤・誤り原因ラベルベクトルｙ^→の取り得る状態と正誤・誤り原因条件付確率Ｐ_ＭＥ（ｙ^→｜ｘ^→）の一例を示す図である。各行は正誤・誤り原因ラベルベクトルｙ^→の取り得る状態を示す。また、図３の一番右の列は、ある発話特徴量ベクトルｘ^→で計算された正誤・誤り原因条件付確率Ｐ_ＭＥ（ｙ^→｜ｘ^→）の一例を示す。 In this case, the possible states of the right / wrong / error cause label vector y ^→ can be divided into 2 ⁴ = 16 states. However, for example, it is impossible that “it was outside the vocabulary but could be recognized” in which y ^→ = (y ₀ , y ₁ , y ₂ , y ₃ ) = ( ₀ , ₁ , ₀ , ₀ ). Considering these non-existing combinations, there are 12 possible states of the correct / error / error cause label vector y ^→ as shown in FIG. FIG. 3 is a diagram illustrating an example of a possible state of a correct / error / error cause label vector y ^→ and a correct / error / error cause conditional probability P _ME (y ^→ | x ^→ ). Each row shows a possible state of the correct / error / error cause label vector y ^→ . Further, the rightmost column in FIG. 3 shows an example of the correctness / error cause conditional probability P _ME (y ^→ | x ^→ ) calculated by a certain utterance feature vector x ^→ .

このように有り得ない状態を含む正誤・誤り原因ラベルベクトルｙ^→は、予め正誤・誤り原因条件付確率計算部４１に設けて置いても良いし、図１に破線で示すように正誤・誤り原因ラベルベクトル記録部４３を設け、そこに記録して置き、正誤・誤り原因条件付確率計算部４１がその正誤・誤り原因ラベルベクトルｙ^→を参照するようにしても良い。 The correct / error / error cause label vector y ^→ including such an impossible state may be provided in the correct / error / error cause conditional probability calculation unit 41 in advance, or the correct / error / error cause as shown by a broken line in FIG. A label vector recording unit 43 may be provided and recorded there, and the correctness / error cause conditional probability calculation unit 41 may refer to the correctness / error cause label vector y ^→ .

最大エントロピーモデルに基づく正誤・誤り原因推定では、例えばこれら１２状態の正誤・誤り原因ラベルベクトルｙ^→と、発話特徴量ベクトルｘ^→との関係を、予め学習データを用いて学習しておく。先ず、発話特徴量ベクトルｘ^→と正誤・誤り原因ラベルベクトルｙ^→の関係を表すＫ種類（例えば１００〜１００万種類程度）の素性関数ｆ_ｋ（ｘ^→，ｙ^→），ｋ＝１，２，…，Ｋを用意する。そして、各素性関数ｆ_ｋ（ｘ^→，ｙ^→）の重みパラメータλ_ｋを、例えば準ニュートン法により学習して推定する。これらの素性関数ｆ_ｋ（ｘ^→，ｙ^→）と重みパラメータλ_ｋは、モデルパラメータ記録部４２に予め記録される。 In the true / false / error cause estimation based on the maximum entropy model, for example, the relationship between the 12-state correct / wrong cause label vector y ^→ and the utterance feature vector x ^→ is learned in advance using learning data. First, K types (for example, about 1 to 1 million types) of feature functions f _k (x ^→ , y ^→ ), k = 1, 2 representing the relationship between the utterance feature vector x ^→ and the correct / error / cause label vector y ^→. , ..., K are prepared. Then, the weight parameter λ _k of each feature function f _k (x ^→ , y ^→ ) is learned and estimated by, for example, the quasi-Newton method. These feature functions f _k (x ^→ , y ^→ ) and weight parameter λ _k are recorded in advance in the model parameter recording unit 42.

正誤・誤り原因条件付確率計算部４１は、発話特徴量ベクトルｘ^→を入力として、モデルパラメータ記録部４２に記録されている素性関数ｆ_ｋ（ｘ^→，ｙ^→）と重みパラメータλ_ｋを参照して式（１）に示す正誤・誤り原因条件付確率Ｐ_ＭＥ（ｙ^→｜ｘ^→）を計算する。 The correctness / error cause conditional probability calculation unit 41 receives the utterance feature vector x ^→ as an input, and refers to the feature function f _k (x ^→ , y ^→ ) and the weight parameter λ _k recorded in the model parameter recording unit 42. Then, the correctness / error cause conditional probability P _ME (y ^→ | x ^→ ) shown in Expression (1) is calculated.

正誤・誤り原因条件付確率Ｐ_ＭＥ（ｙ^→｜ｘ^→）は、この例では１２個ある正誤・誤り原因ラベルベクトルｙ^→毎に計算される。これらの値は、０〜１の確率値である（全ての正誤・誤り原因ラベルベクトルｙ^→（この例では１２個）についてその条件付確率を足すと1.0になる。すなわち、Σ_ｙ→Ｐ_ＭＥ（ｙ^→｜ｘ^→）＝1.0である。 In this example, the correctness / error cause conditional probability P _ME (y ^→ | x ^→ ) is calculated for every 12 correct / error cause label vectors y ^→ . These values are probability values of 0 to 1 (adding the conditional probabilities for all correct / error-cause label vectors y ^→ (12 in this example), that is, 1.0, that is, Σ _{y →} P _ME (Y ^→ | x ^→ ) = 1.0.

図３に示す例では、正誤・誤り原因ラベルベクトルｙ^→＝（０，０，１，０）「雑音ありだけど正解」の条件付確率（0.1）が、他の正誤・誤り原因ラベルベクトルよりも大きい。背景技術で説明した従来の音声認識装置７５０は、この最も高い条件付確率の正誤・誤り原因ラベルベクトルのみを基に正誤及び誤り原因推定結果を確定していた。 In the example shown in FIG. 3, the correct / error / error cause label vector y ^→ = (0, ^0, 1, 0) “there is noisy but the correct answer” has a conditional probability (0.1) higher than other correct / error / cause cause label vectors. large. The conventional speech recognition apparatus 750 described in the background art determines the correctness / error cause estimation results based only on the correct / error / error cause label vector having the highest conditional probability.

この発明の音声認識装置１００は、正誤・誤り原因ラベルベクトルｙ^→の推定精度を条件付確率値の周辺化により向上させる点で新しい。周辺化とは、確率論において基本的な処理である。ここでは複数の確率変数（本発明ではｙ_０,ｙ₁,ｙ_２,ｙ_３）の同時確率が与えられた時に、ある一つの確率変数に注目してその他の確率変数についての確率を全て足し合わせる処理である。正誤・誤り原因条件付確率周辺化部４４は、注目する正誤・誤り原因ラベルベクトルｙ^→の要素について、正誤・誤り原因条件付確率Ｐ_ＭＥ（ｙ^→｜ｘ^→）の周辺化を行う（ステップＳ４４）。 The speech recognition apparatus 100 of the present invention is new in that the estimation accuracy of the correctness / error cause label vector y ^→ is improved by the marginalization of conditional probability values. Peripheralization is a basic process in probability theory. Here, when the joint probability of a plurality of random variables (in the present invention, y ₀ , y ₁ , y ₂ , y ₃ ) is given, paying attention to one random variable, all the probabilities for other random variables are added. It is a process to match. The correctness / error cause conditional probability marginalization unit 44 performs peripheralization of the correctness / error cause conditional probability P _ME (y ^→ | x ^→ ) with respect to an element of the correct correctness / error cause label vector y ^→ S44).

例えば、正誤ラベルｙ_０の周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_０|ｘ^→）は、式（２）で得ることができる。 For example, the marginalized conditional probability P _ME ^m (y ₀ | x ^→ ) of the correct / incorrect label y ₀ can be obtained by Expression (2).

式（２）の計算を図３の例で説明すると、正解を表すｙ_０＝０の周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_０＝０|ｘ^→）は、正誤・誤り原因ラベルベクトルｙ^→の状態数１２の内の０，１，２，３の４状態の正誤・誤り原因条件付確率Ｐ_ＭＥ（ｙ^→｜ｘ^→）の和であり、Ｐ_ＭＥ ^ｍ(ｙ_０＝０|ｘ^→）＝0.3となる。不正解を表すｙ_１＝１の周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_０＝１|ｘ^→）は、８〜１５の８状態の正誤・誤り原因条件付確率Ｐ_ＭＥ（ｙ^→｜ｘ^→）の和であり、Ｐ_ＭＥ ^ｍ(ｙ_０＝１|ｘ^→）＝0.7となる。 The calculation of Equation (2) will be described with reference to the example of FIG. 3. The marginalized conditional probability P _ME ^m (y ₀ = 0 | x ^→ ) of y ₀ = 0 representing the correct answer is the correct / error cause label vector y ^→ It is the sum of ^{^{_{| (x → y →),}}} P ME m (y 0 = 0 | ME probability with correctness and error cause conditions of the four states of 0, 1, 2, and 3 of the number of state 12 _P x ^→ ) = 0.3. The marginalized conditional probability P _ME ^m (y ₀ = 1 | x ^→ ) of y ₁ = 1 representing an incorrect answer is 8 to 15 eight-state correctness / error cause conditional probability P _ME (y ^→ | x ^→ ) And P _ME ^m (y ₀ = 1 | x ^→ ) = 0.7.

このように正誤・誤り原因条件付確率Ｐ_ＭＥ（ｙ^→｜ｘ^→）を周辺化することで、全ての正誤・誤り原因ラベルベクトルの条件付確率を総合的に用いることができ、正誤及び誤り原因の推定精度を向上させる効果が期待できる。 In this way, by concatenating the correctness / error cause conditional probability P _ME (y ^→ | x ^→ ), the conditional probabilities of all correct / error cause label vectors can be used comprehensively. The effect of improving the cause estimation accuracy can be expected.

この他の誤り原因ラベルｙ_ｉ，ｉ＝１，２，３についても同様に周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_＊|ｘ^→）を求めることができる。語彙内（ｙ_１＝０）か語彙外（ｙ_１＝１）の周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_１|ｘ^→）は、図３の例ではＰ_ＭＥ ^ｍ(ｙ_１＝０|ｘ^→）＝0.644、Ｐ_ＭＥ ^ｍ(ｙ_１＝１|ｘ^→）＝0.356である。同様に、雑音なしの周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_２＝０|ｘ^→）＝0.452、雑音ありの周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_２＝１|ｘ^→）＝0.548である。また、男性の周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_３＝０|ｘ^→）＝0.505、女性の周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_３＝１|ｘ^→）＝0.495である。 For other error cause labels y _i , i = 1, 2, 3 as well, the marginalized conditional probability P _ME ^m (y _* | x ^→ ) can be obtained. The marginalized conditional probability P _ME ^m (y ₁ | x ^→ ) within the vocabulary (y ₁ = 0) or outside the vocabulary (y ₁ = 1) is P _ME ^m (y ₁ = 0 | x in the example of FIG. ^→ ) = 0.644 and P _ME ^m (y ₁ = 1 | x ^→ ) = 0.356. Similarly, the marginal conditional probability P _ME ^m (y ₂ = 0 | x ^→ ) = 0.552 without noise and the marginal conditional probability P _ME ^m (y ₂ = 1 | x ^→ ) = 0.548 with noise. . In addition, the marginal conditional probability P _ME ^m (y ₃ = 0 | x ^→ ) = 0.505 of the male and the marginal conditional probability P _ME ^m (y ₃ = 1 | x ^→ ) = 0.495 of the female.

このような周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_＊|ｘ^→）を、音声認識装置１００は、その仕様に応じて全て出力しても良いし、例えば、正誤ラベルｙ_０の正解のみの周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_０＝０|ｘ^→）を出力するようにしても良い。 The speech recognition apparatus 100 may output all of the marginalized conditional probabilities P _ME ^m (y _* | x ^→ ) according to the specification, for example, only the correct answer of the correct / incorrect label y ₀ of conditional probability _{^{_{P ME m (y 0 = 0}}} | x →) may be output.

図４にこの発明の音声認識装置２００の機能構成例を示す。その動作フローを図５に示す。音声認識装置２００は、正誤・誤り原因推定部５０が、正誤誤り原因選択部５１を備える点で音声認識装置１００と異なる。 FIG. 4 shows a functional configuration example of the speech recognition apparatus 200 of the present invention. The operation flow is shown in FIG. The speech recognition apparatus 200 is different from the speech recognition apparatus 100 in that the error / error cause estimation unit 50 includes an error / error cause selection unit 51.

正誤誤り原因選択部５１は、正誤・誤り原因条件付確率周辺化部４４で計算された周辺化された条件付確率Ｐ_ＭＥ ^ｍ(ｙ_＊|ｘ^→）を入力として式（３）の計算で正誤・誤り原因ラベルベクトルの推定値ｙ^→＾を求める（ステップＳ５１）。 The right / wrong error cause selection unit 51 receives the marginal conditional probability P _ME ^m (y _* | x ^→ ) calculated by the right / wrong / error cause conditional probability marginalization unit 44 as an input and calculates the equation (3). An estimated value y ^→ ^ of the correctness / error cause label vector is obtained (step S51).

等号で結ばれた式（３）の中央の項のｙ_ｉ＾は、右側の項に示す通り、０か１の何れかであり、周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_＊＝ｊ|ｘ^→）のｊ＝０，１で大きな方の値を与えるｊである。上記した図３の例では、Ｐ_ＭＥ ^ｍ(ｙ_０＝１|ｘ^→）＝0.7、Ｐ_ＭＥ ^ｍ(ｙ_１＝０|ｘ^→）＝0.644、Ｐ_ＭＥ ^ｍ(ｙ_２＝１|ｘ^→）＝0.548、Ｐ_ＭＥ ^ｍ(ｙ_３＝０|ｘ^→）＝0.505、が大きな値を示したので、正誤・誤り原因ラベルベクトルの推定値ｙ^→＾＝（ｙ^→ _０＾，ｙ^→ _１＾，ｙ^→ _２＾，ｙ^→ _３＾）は、（１，０，１，０）となる。 The middle term y _i ^ in equation (3) connected by the equal sign is either 0 or 1, as shown in the right term, and the marginalized conditional probability P _ME ^m (y _* = j | x ^→ ) where j = 0, 1 gives the larger value. In the example of FIG. 3 described above, P _ME ^m (y ₀ = 1 | x ^→ ) = 0.7, P _ME ^m (y ₁ = 0 | x ^→ ) = 0.644, P _ME ^m (y ₂ = 1 | x ^→ ) = 0.548, P _ME ^m (y ₃ = 0 | x ^→ ) = 0.505 showed a large value, so the estimated value y ^→ ^ = (y ^→ ₀ ^, y ^→ ₁ ^, y ^→ ₂ ^, y ^→ ₃ ^) becomes (1, 0, 1, 0).

更に、正誤・誤り原因選択部５１は、正誤・誤り原因ラベルベクトルの推定値ｙ^→＾の確からしさを表す周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ^→＾|ｘ^→）を、式（４）に示すように正誤・誤り原因ラベルｙ_ｉ毎の周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ_＊|ｘ^→）の積で求める。 Further, the correctness / error cause selection unit 51 calculates a marginal conditional probability P _ME ^m (y ^→ ^ | x ^→ ) representing the probability of the estimated value y ^→ ^ of the correct / error / cause cause label vector by the equation (4). As shown in FIG. 4, the product is obtained as a product of marginalized conditional probabilities P _ME ^m (y _* | x ^→ ) for each correctness / error cause label y _i .

図３の例では、正誤・誤り原因ラベルベクトルの推定値ｙ^→＾＝（１，０，１，０）の周辺化条件付確率Ｐ_ＭＥ ^ｍ(ｙ^→＾|ｘ^→）は、0.7×0.644×0.548×0.505＝0.125となる。このように音声認識装置２００によれば、音声認識結果の正解不正解とその誤り原因の推定値ｙ^→＾と、その確からしさＰ_ＭＥ ^ｍ(ｙ^→＾|ｘ^→）を推定することが可能である。 In the example of FIG. 3, the marginalized conditional probability P _ME ^m (y ^→ ^ | x ^→ ) of the correct value / error cause label vector estimated value y ^→ ^ = (1, 0, 1, ⁰ ) is 0.7 × 0.644. × 0.548 × 0.505 = 0.125 Thus, according to the speech recognition apparatus 200, it is possible to estimate the correct answer incorrect answer of the speech recognition result and the estimated value y ^→ ^ of the error cause and the probability P _ME ^m (y ^→ ^ | x ^→ ). It is.

図６にこの発明の音声認識装置３００の機能構成例を示す。その動作フローを図７に示す。音声認識装置３００は、音声認識装置２００の機能構成に更に、正誤・誤り原因メッセージ生成部６０の構成を加えたものである。 FIG. 6 shows a functional configuration example of the speech recognition apparatus 300 of the present invention. The operation flow is shown in FIG. The speech recognition device 300 is obtained by adding the configuration of the correct / error cause message generation unit 60 to the functional configuration of the speech recognition device 200.

音声認識装置２００は、正誤・誤り原因ラベルベクトルの推定値ｙ^→＾を出力するので、使用者はそのベクトルｙ^→＾を確認することで、どのように対処すべきかを知ることができる。音声認識装置３００は、更に利便性を向上させることを目的に、正誤・誤り原因ラベルベクトルｙ^→＾から正誤・誤り原因メッセージを生成するようにしたものである。 Since the speech recognition apparatus 200 outputs the estimated value y ^→ ^ of the correct / error cause label vector, the user can know how to deal with it by checking the vector y ^→ ^. The speech recognition apparatus 300 is configured to generate an error / error cause message from the error / error cause label vector y ^→ ^ for the purpose of further improving convenience.

正誤・誤り原因メッセージ生成部６０は、正誤・誤り原因ラベルベクトルの推定値ｙ^→＾に対応させたメッセージを出力するものであり、例えば、図８に示すように使用者により分かり易い対処方法を提示することを可能にする。図３の例では、正誤・誤り原因ラベルベクトルの推定値ｙ^→＾＝（１，０，１，０）が推定されるが、このベクトルに対応するメッセージは「認識できませんでした。雑音が大きい場所で使用されている場合は、もう少し静かな場所で使用してください。」となる。 The right / wrong / error cause message generator 60 outputs a message corresponding to the estimated value y ^→ ^ of the right / wrong cause label vector. For example, as shown in FIG. Allows to present. In the example of FIG. 3, the estimated value y ^→ ^ = (1, 0, 1, 0) of the correct / error cause label vector is estimated, but the message corresponding to this vector is “Unrecognizable. If it is used in a place, use it in a quieter place. "

メッセージは、必ずしも音声認識結果が不正解と推定された場合（ｙ^→＾の状態＝８〜１５）のみに出力するのではなく、音声認識結果が正解と推定された場合でも正しい使用方法を促すメッセージを提示するようにしても良い。
〔評価実験〕
この発明の効果を確認する目的で評価実験を行った。実験条件を簡単に説明する。孤立単語発声データベースを基に、８個のサブセットから成る学習データと評価データを準備した。各サブセットは、注目する語彙外単語の発声、雑音環境下での発声、女性の発声、の三つの誤り原因の有無の組み合わせに対応している。 The message is not necessarily output only when the speech recognition result is estimated to be incorrect (y ^→ ^ state = 8 to 15), but the message prompts the correct usage even when the speech recognition result is estimated to be correct. A message may be presented.
[Evaluation experiment]
An evaluation experiment was conducted for the purpose of confirming the effect of the present invention. The experimental conditions will be briefly described. Based on the isolated word utterance database, learning data and evaluation data consisting of 8 subsets were prepared. Each subset corresponds to a combination of the presence / absence of three error causes: utterance of a word outside the vocabulary of interest, utterance in a noisy environment, and female utterance.

語彙外単語、話者、及び、雑音種類については学習データと評価データで重複はないものとした。語彙内単語については、学習データと評価データで一部重複があり、何名かの話者は複数のサブセットで発声を行っている。 For non-vocabulary words, speakers, and noise types, there was no overlap between learning data and evaluation data. For words in the vocabulary, there is some overlap between the learning data and the evaluation data, and some speakers speak in multiple subsets.

学習手順を説明する。男性クリーンＨＭＭ、四つのＧＭＭ(Gaussian Mixture Model、男声クリーンＧＭＭ、男声雑音ありＧＭＭ、女性クリーンＧＭＭ、女性雑音ありＧＭＭ)、及び3830単語発音辞書を準備した。男声クリーンＨＭＭ、四つのＧＭＭ、及び、3830単語発音辞書を用いて学習データに対して音声認識と発話特徴量ベクトルｘ^→と正誤・誤り原因結果ベクトルｙ^→とのペア取得処理を行い、認識結果単語と、発話特徴量ベクトルｘ^→と正誤・誤り原因結果ベクトルｙ^→の正解ペアを取得した。発話特徴量ベクトルｘ^→は、フレーム平均ＨＭＭ/ＧＭＭ尤度、平均音素長、事後確率等から成る１８次元の特徴量ベクトルとした。 The learning procedure will be described. A male clean HMM, four GMMs (Gaussian Mixture Model, male voice clean GMM, male voice noise GMM, female clean GMM, female noise GMM), and a 3830 word pronunciation dictionary were prepared. Using male clean voice HMM, four GMMs, and 3830 word pronunciation dictionary, the speech recognition and utterance feature vector x ^→ and correct / error cause result vector y ^→ are paired and the recognition result A correct pair of a word and an utterance feature vector x ^→ and a correct / error cause result vector y ^→ was obtained. The utterance feature vector x ^→ is an 18-dimensional feature vector consisting of frame average HMM / GMM likelihood, average phoneme length, posterior probability, and the like.

最大エントロピーモデルには総数Ｋ＝1080の素性関数を定義した。これらの素性関数と正解ペアを用いて最大エントロピーモデルの重みパラメータλ_ｋを推定した。 The maximum entropy model defined a total number of feature functions K = 1080. The weight parameter λ _k of the maximum entropy model was estimated using these feature functions and correct pairs.

次に評価手順を説明する。上記の男声クリーンＨＭＭ、四つのＧＭＭ及び、3830単語発音辞書を再度用いて、評価データに対して音声認識と発話特徴量ベクトルｘ^→の取得処理を行い、認識結果単語と発話特徴量ベクトルｘ^→を得た。次に、最大エントロピーモデルの識別処理により発話特徴量ベクトルｘ^→に対応する正誤・誤り原因ラベルベクトルｙ^→とその条件付確率を得た。続いて、その条件付確率に対して式（２）の周辺化処理を行い、信頼度（正誤）と各誤り原因とそれらの周辺化条件付確率を得た。 Next, the evaluation procedure will be described. Using the male clean HMM, the four GMMs, and the 3830 word pronunciation dictionary again, the speech recognition and the utterance feature vector x ^→ are obtained from the evaluation data, and the recognition result word and the utterance feature vector x ^→ Got. Next, to obtain the utterance feature vectors x correctness, corresponding to ^→ error cause label vector y ^→ and the conditional probability by the identification process of the maximum entropy model. Subsequently, the marginalization processing of Expression (2) was performed on the conditional probability, and the reliability (correctness), each error cause, and the marginal conditional probability were obtained.

図９に、評価実験で得られた信頼度のＲＯＣ（Receiver Operator Characteristics）曲線を示す。横軸は認識結果が正しいのに誤りと誤推定した率である誤棄却率であり、縦軸は認識結果が誤っているのに正しいと誤推定した率である誤受理率である。図中の破線で示す直線は等誤り率（Equal Error Rate）を示す。この発明を実線、周辺化を行わない従来法を一点鎖線で示す。これらの曲線が図の左下に位置するほど信頼度（正誤）推定精度が高いことを示す。 FIG. 9 shows a ROC (Receiver Operator Characteristics) curve of reliability obtained in the evaluation experiment. The horizontal axis is the false rejection rate, which is the rate of erroneously estimating that the recognition result is correct, but the vertical axis is the false acceptance rate, which is the rate of erroneously estimating that the recognition result is incorrect. A straight line indicated by a broken line in the figure indicates an equal error rate. The present invention is shown by a solid line, and the conventional method without marginalization is shown by a dashed line. The more these curves are located at the lower left of the figure, the higher the reliability (correctness) estimation accuracy is.

この発明の方が、等誤り率を9.44％削減できた。この等誤り率の変化は、有意水準１％で統計的に有意である。このように、この発明の音声認識方法は、信頼度推定の精度を向上させる効果を奏する。 The present invention was able to reduce the equal error rate by 9.44%. This change in the equal error rate is statistically significant at a significance level of 1%. Thus, the speech recognition method of the present invention has an effect of improving the accuracy of reliability estimation.

以上説明したこの発明の音声認識装置とその方法は、上述した実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能である。例えば、上記した装置及び方法において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力に応じて並列的にあるいは個別に実行されるものとしても良い。 The speech recognition apparatus and method of the present invention described above are not limited to the above-described embodiments, and can be appropriately changed without departing from the spirit of the present invention. For example, the processes described in the above-described apparatus and method are not only executed in time series according to the description order, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes. good.

また、例えば、最大エントロピーモデルに代わる識別モデルとして、サポートベクトルマシン(ＳＶＭ：Support Vector Machine)や、条件付確率場（ＣＲＦ：Coditional Random Fields）を用いることも可能である。また、この発明は、上述した実施形態で説明した孤立単語発声を認識する音声認識装置に限らず、定型文法に沿った発声を認識する音声認識装置や自由な連続発話を認識する音声認識にも用いることができる。 Further, for example, a support vector machine (SVM) or a conditional random field (CRF) can be used as an identification model in place of the maximum entropy model. In addition, the present invention is not limited to the speech recognition device that recognizes isolated word utterances described in the above-described embodiments, but is also applicable to speech recognition devices that recognize utterances according to a fixed grammar and speech recognition that recognizes free continuous utterances. Can be used.

また、上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 Further, when the processing means in the above apparatus is realized by a computer, the processing contents of functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ-ＲＡＭ
（Random Access Memory）、ＣＤ-ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ-Ｒ
（Recordable）/ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto Optical disc）等を、半導体メモリとしてフラッシュメモリー等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape, or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM
(Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R
(Recordable) / RW (ReWritable) or the like can be used as a magneto-optical recording medium, MO (Magneto Optical disc) or the like as a semiconductor memory, and flash memory or the like as a semiconductor memory.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

入力音声を音声認識した単語列と、その単語列を構成する各単語の特徴量を複数のパラメータで表した各単語の発話特徴量ベクトルと、を出力する音声認識部と、
上記各単語の発話特徴量ベクトルを入力として、その各単語の正解不正解と誤り原因の推定値とその確からしさを推定する正誤・誤り原因推定部と、を備える音声認識装置であって、
上記正誤・誤り原因推定部は、
上記発話特徴量ベクトルと正誤・誤り原因ラベルベクトルとの関係を表す識別モデルに基づく条件付確率を計算するのに必要なモデルパラメータを記録したモデルパラメータ記録部と、
上記各単語の発話特徴量ベクトルを入力として、予め設定された上記正誤・誤り原因ラベルベクトルの取り得る状態毎に、上記識別モデルに基づく条件付確率を上記モデルパラメータを用いて計算する正誤・誤り原因条件付確率計算部と、
注目する正誤・誤り原因ラベルベクトルの要素について、上記条件付確率の周辺化を行い正誤ラベルと各誤り原因ラベルの周辺化条件付確率を計算する正誤・誤り原因条件付確率周辺化部と、
を具備することを特徴とする音声認識装置。 A speech recognition unit that outputs a word sequence obtained by speech recognition of the input speech, and an utterance feature amount vector of each word that represents a feature amount of each word constituting the word sequence by a plurality of parameters;
A speech recognition device comprising, as an input the utterance feature amount vector of each word, a correct / incorrect answer for each word, an estimated value of an error cause, and a correctness / error cause estimation unit for estimating the probability,
The correctness / error cause estimation unit is
A model parameter recording unit that records a model parameter necessary to calculate a conditional probability based on an identification model that represents the relationship between the utterance feature vector and the correctness / error cause label vector;
Correct / error to calculate conditional probabilities based on the identification model using the model parameters for each possible state of the correct / error / error cause label vector set in advance using the utterance feature vector of each word A cause conditional probability calculator,
For the correct / error / error cause label vector element to be noticed, the above-mentioned conditional probability is marginalized to calculate the correct / false label and the marginal conditional probability of each error cause label;
A speech recognition apparatus comprising:

請求項１に記載した音声認識装置において、
上記周辺化条件付確率が最大になる上記正誤・誤り原因ラベルベクトルを選択し、当該選択結果の正解不正解と誤り原因の推定値を、上記周辺化条件付確率の積で表せる条件付確率と共に出力する正誤・誤り原因選択部を、更に備えることを特徴とする音声認識装置。 The speech recognition apparatus according to claim 1,
Select the correct / incorrect error cause label vector that maximizes the marginal conditional probability, along with the conditional probability that the correct / incorrect answer of the selection result and the estimated cause of error can be represented by the product of the peripheral conditional probability A speech recognition apparatus, further comprising a right / wrong / error cause selection unit for outputting.

請求項１又は２に記載した音声認識装置において、
上記正誤・誤り原因ラベルベクトル若しくは誤り原因ラベルベクトルを入力として、それらラベルベクトルに対応した正誤・誤り原因メッセージを生成する正誤・誤り原因メッセージ生成部を、更に備えることを特徴とする音声認識装置。 The speech recognition apparatus according to claim 1 or 2,
A speech recognition apparatus, further comprising: a correct / error / error cause message generation unit that receives the correct / error / error cause label vector or the error cause label vector and generates a correct / error / error cause message corresponding to the label vector.

入力音声を音声認識した単語列と、その単語列を構成する各単語の特徴量を複数のパラメータで表した各単語の発話特徴量ベクトルと、を出力する音声認識過程と、
上記各単語の発話特徴量ベクトルを入力として、その各単語の正解不正解と誤り原因の推定値、及びその確からしさを推定する正誤・誤り原因推定過程と、を備える音声認識方法であって、
上記正誤・誤り原因推定過程は、
上記各単語の発話特徴量ベクトルを入力として、予め設定された上記正誤・誤り原因ラベルベクトルの取り得る状態毎に、上記識別モデルに基づく条件付確率を、モデルパラメータ記録部に記録された上記発話特徴量ベクトルと正誤・誤り原因ラベルベクトルとの関係を表す識別モデルに基づく条件付確率を計算するのに必要なモデルパラメータを用いて計算する正誤・誤り原因条件付確率計算ステップと、
注目する正誤・誤り原因ラベルベクトルの要素について、上記条件付確率の周辺化を行い正誤ラベルと各誤り原因ラベルの周辺化条件付確率を計算する正誤・誤り原因条件付確率周辺化ステップと、
を含むことを特徴とする音声認識方法。 A speech recognition process for outputting a word sequence obtained by speech recognition of the input speech, and an utterance feature amount vector of each word that represents the feature amount of each word constituting the word sequence by a plurality of parameters;
A speech recognition method comprising: an utterance feature amount vector of each word as an input, and a correct / incorrect answer of each word and an estimated value of an error cause, and a correct / incorrect / error cause estimation process for estimating the probability,
The above error / error cause estimation process is as follows:
The utterance recorded in the model parameter recording unit with a conditional probability based on the identification model for each possible state of the correct / wrong / error cause label vector set in advance using the utterance feature vector of each word Correct / error / cause cause conditional probability calculation step using a model parameter necessary to calculate a conditional probability based on an identification model representing the relationship between the feature vector and the correct / error / error cause label vector;
Corrected error / error cause conditional probability marginalization step for calculating the marginal conditional probability of each error cause label by marginalizing the conditional probability for the elements of the correct / error / error cause label vector of interest,
A speech recognition method comprising:

請求項４に記載した音声認識方法において、
上記周辺化条件付確率が最大になる上記正誤・誤り原因ラベルベクトルを選択し、当該選択結果の正解不正解と誤り原因の推定値を、上記周辺化条件付確率の積で表せる条件付確率と共に出力する正誤・誤り原因選択過程を、更に備えることを特徴とする音声認識方法。 The speech recognition method according to claim 4,
Select the correct / incorrect error cause label vector that maximizes the marginal conditional probability, along with the conditional probability that the correct / incorrect answer of the selection result and the estimated cause of error can be represented by the product of the peripheral conditional probability A speech recognition method, further comprising a correct / error / error cause selection process to be output.

請求項４又は５に記載した音声認識方法において、
上記正誤・誤り原因ラベルベクトル若しくは誤り原因ラベルベクトルを入力として、それらラベルベクトルに対応した正誤・誤り下人メッセージを生成する正誤・誤り原因メッセージ生成過程を、更に備えることを特徴とする音声認識方法。 The speech recognition method according to claim 4 or 5,
A speech recognition method, further comprising: a correct / error / cause cause message generation process for receiving the correct / error / error cause label vector or the error cause label vector and generating a correct / incorrect error message corresponding to the label vector. .

請求項１乃至３の何れかに記載した音声認識装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as the voice recognition device according to any one of claims 1 to 3.