JP6121187B2

JP6121187B2 - Acoustic model correction parameter estimation apparatus, method and program thereof

Info

Publication number: JP6121187B2
Application number: JP2013025865A
Authority: JP
Inventors: マークデルクロア; 小川　厚徳; 厚徳小川; ソンジュンハム; 中谷　智広; 智広中谷; 中村　篤; 篤中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-02-13
Filing date: 2013-02-13
Publication date: 2017-04-26
Anticipated expiration: 2033-02-13
Also published as: JP2014153680A

Description

本発明は、入力データから特徴量を抽出し、その特徴量を用いて入力データを予め定義されたクラスタに分類するパターン認識において、クラスタ分類精度を向上させるためのクラスタ分類モデルパラメータ補正技術及び特徴量補正技術に関する。例えば、入力音声から特徴量を抽出し、その特徴量を用いて入力音声を単語列に変換する音声認識における、音響モデル補正パラメータ推定装置、特徴量補正パラメータ推定装置、それらの方法及びプログラムに関する。 The present invention relates to a cluster classification model parameter correction technique and a feature for improving cluster classification accuracy in pattern recognition in which a feature amount is extracted from input data, and the input data is classified into a predefined cluster using the feature amount. It relates to quantity correction technology. For example, the present invention relates to an acoustic model correction parameter estimation device, a feature amount correction parameter estimation device, a method and a program thereof in speech recognition in which a feature amount is extracted from input speech and the input speech is converted into a word string using the feature amount.

音声認識装置が実際に置かれる環境は様々である。このため音響モデルを学習するための音声データと実際に入力される音声の特徴は一致しないことが多い。不一致の原因は、周囲の雑音環境、話者の多様性、等であり、これらが音声認識精度を劣化させる。このため、これら周囲雑音環境や話者の多様性に対して頑健（ロバスト）な音声認識技術が求められている。ロバストな音声認識技術として、入力音声と音響モデルとが適合しやすくなるように、入力音声から抽出した特徴ベクトルや音響モデルを補正する技術が知られている。 There are various environments where voice recognition devices are actually placed. For this reason, the voice data for learning the acoustic model and the characteristics of the actually input voice often do not match. The cause of the mismatch is the ambient noise environment, the diversity of speakers, etc., which degrade the speech recognition accuracy. For this reason, there is a need for a speech recognition technology that is robust against these ambient noise environments and speaker diversity. As a robust speech recognition technology, a technology for correcting a feature vector or an acoustic model extracted from an input speech is known so that the input speech and the acoustic model can be easily matched.

入力音声から抽出した特徴ベクトルを補正することでロバストな音声認識を実現する技術として、非特許文献１が知られている。この技術では、ｄＭＭＩ（differenced Maximum Mutual Information）という基準に基づいて特徴ベクトルを補正するための補正パラメータを学習する。また、非特許文献２記載されているように、周囲雑音環境に対して、雑音下音声の雑音抑圧（音声強調）処理を行う場合に、ｄＭＭＩ基準に基づく音響モデルの分散パラメータの補正パラメータを推定する技術がある。 Non-Patent Document 1 is known as a technique for realizing robust speech recognition by correcting feature vectors extracted from input speech. In this technique, a correction parameter for correcting a feature vector is learned based on a standard called dMMI (differenced Maximum Mutual Information). As described in Non-Patent Document 2, when noise suppression (speech enhancement) processing of speech under noise is performed on the ambient noise environment, the correction parameter of the dispersion parameter of the acoustic model based on the dMMI criterion is estimated. There is technology to do.

また、音響モデルを補正する技術として、音響モデルのパラメータを、線形回帰を用いて補正する線形回帰音響モデル適応技術（非特許文献３、４）が知られている。 As a technique for correcting an acoustic model, a linear regression acoustic model adaptation technique (Non-Patent Documents 3 and 4) in which parameters of an acoustic model are corrected using linear regression is known.

デルクロア・マーク，小川厚徳，渡部晋治，中谷智広，中村篤,「dMMI識別基準による特徴量変換の識別学習」,日本音響学会春季研究発表会，March 2012,pp. 121-122Delcroa Mark, Ogawa Atsunori, Watanabe Koji, Nakatani Tomohiro, Nakamura Atsushi, “Distinguishing Learning for Feature Conversion Using dMMI Discrimination Criteria”, Acoustical Society of Japan Spring Meeting, March 2012, pp. 121-122 デルクロア・マーク，小川厚徳，渡部晋治，中谷智広，中村篤,「dMMI識別基準による教師なし動的分散適応」,日本音響学会秋季研究発表会，September 2012,pp. 131-132Delcroa Mark, Ogawa Atsunori, Watanabe Koji, Nakatani Tomohiro, Nakamura Atsushi, "Unsupervised Dynamic Distributed Adaptation Based on dMMI Discrimination Criteria", Acoustical Society of Japan Autumn Meeting, September 2012, pp. 131-132 Leggetter C. J. and Woodland P. C.， “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models”, Computer Speech & Language， 1995, vol. 9， no. 2， pp. 171-185Leggetter C. J. and Woodland P. C., “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models”, Computer Speech & Language, 1995, vol. 9, no. 2, pp. 171-185 L.F. Uebel and P.C. Woodland， “Discriminative linear transforms for speaker adaptation”, in Proc. ISCA Tutorial and Research Workshop (ITRW) on Adaptation Methods for Speech Recognition， 2001， pp. 61-64.L.F. Uebel and P.C. Woodland, “Discriminative linear transforms for speaker adaptation”, in Proc. ISCA Tutorial and Research Workshop (ITRW) on Adaptation Methods for Speech Recognition, 2001, pp. 61-64.

非特許文献１のようなｄＭＭＩ基準に基づく補正パラメータの学習には、大量の学習用の音声データ（以下「学習用音声データ」ともいう）とそれに対応する正解シンボルの系列（以下「正解シンボル系列」ともいう）が必要となる。そのため、学習用音声データと正解シンボル系列からなる学習データ、特に、正解シンボル系列を準備する際に多大なコストがかかる。非特許文献２のようなｄＭＭＩ基準に基づく音響モデルの分散パラメータの動的な補正パラメータの適応技術は、音声強調処理が必要なため、話者の多様性などへの適用が困難であり、汎用性に欠ける。 For correction parameter learning based on the dMMI standard as in Non-Patent Document 1, a large amount of learning speech data (hereinafter also referred to as “learning speech data”) and a corresponding series of correct symbols (hereinafter referred to as “correct symbol series”). Is also required). Therefore, enormous costs are incurred when preparing learning data composed of learning speech data and correct symbol sequences, particularly correct symbol sequences. The adaptation technique of the dynamic correction parameter of the dispersion parameter of the acoustic model based on the dMMI standard as described in Non-Patent Document 2 requires speech enhancement processing, and is difficult to apply to speaker diversity. Lack of sex.

一方、線形回帰パラメータの推定を最尤基準（ＭＬＬＲ(Maximum likelihood linear regression)）で行う方法（非特許文献３）または識別的基準の一種である最大相互情報量(Maximum Mutual Information : MMI)基準（ＭＭＩ−ＬＲ）で行う方法（非特許文献４）に基づく補正パラメータの適応技術は、少ない音声データで実行することができるという利点がある。また、正解シンボルを必要としない教師なし適応を行うことも可能であり、この場合は正解シンボルを人手で準備する必要がないという利点がある。 On the other hand, a linear regression parameter is estimated using a maximum likelihood criterion (MLLR (Maximum likelihood linear regression)) (Non-patent Document 3) or a maximum mutual information (Maximum Mutual Information: MMI) criterion, which is a kind of discriminative criterion ( The correction parameter adaptation technique based on the method performed by MMI-LR (Non-Patent Document 4) has an advantage that it can be executed with a small amount of audio data. In addition, it is possible to perform unsupervised adaptation that does not require a correct symbol. In this case, there is an advantage that it is not necessary to prepare the correct symbol manually.

教師あり適応の場合はＭＭＩ−ＬＲはＭＬＬＲよりも性能が良いと報告されている（非特許文献４）。しかし、ＭＭＩ−ＬＲの識別基準による音響モデルの教師なし適応方法は、適応データを音声認識した結果を正解ラベルと見做して利用するので、正解ラベル（と見做された音声認識結果）に誤りが含まれることが多い。ＭＭＩ−ＬＲのような識別基準は、正解シンボルと他の認識仮説を考慮し、直接音響モデルパラメータを最適化することによって、認識性能を大きく向上させる技術であるため、正解シンボルに誤りがある場合はうまく音響モデルパラメータを最適化できず、性能が改善しないか悪化する可能性がある。 In the case of supervised adaptation, it is reported that MMI-LR has better performance than MLLR (Non-patent Document 4). However, the unsupervised adaptation method of the acoustic model based on the identification criteria of the MMI-LR uses the result of speech recognition of the adaptation data as a correct answer label, so that it is used as the correct answer label (the recognized speech recognition result). Often contains errors. An identification criterion such as MMI-LR is a technique that greatly improves the recognition performance by directly optimizing the acoustic model parameters in consideration of the correct symbol and other recognition hypotheses. Cannot successfully optimize the acoustic model parameters and performance may not improve or worsen.

本発明は、正解シンボルの誤りの悪影響を弱める仕組みを導入し、正解シンボルの誤りが多い教師なし適応の場合でも、識別基準による音響モデル適応の精度の低下を防ぎ、識別基準による教師なし音響モデル適応を可能にする音響モデル補正パラメータ推定技術及び特徴量補正パラメータ推定技術を提供することを目的とする。 The present invention introduces a mechanism that weakens the adverse effects of correct symbol errors, and prevents deterioration of the accuracy of acoustic model adaptation due to discrimination criteria even in the case of unsupervised adaptation with many correct symbol errors. An object of the present invention is to provide an acoustic model correction parameter estimation technique and a feature amount correction parameter estimation technique that enable adaptation.

上記の課題を解決するために、本発明の第一の態様によれば、音響モデル補正パラメータ推定装置は、音響モデルには混合ガウス分布モデルが含まれるものとし、音響モデルパラメータには混合ガウス分布モデルに含まれるガウス分布の平均ベクトルが含まれるものとし、学習用音声データの特徴量及び学習用音声データに対する正解シンボル系列から、平均ベクトルを補正するための平均補正パラメータを求める。音響モデル補正パラメータ推定装置は、予め求められた音響モデル及び言語モデルが記憶される記憶部と、記憶部に記憶された音響モデルの平均ベクトルを、平均補正パラメータを用いて補正する音響モデル補正部と、補正した平均ベクトルを含む音響モデルと言語モデルとに基づき、学習用音声データの特徴量を音声認識することによって得られる対立候補シンボル系列ごとに、予め定めた粒度で、正解シンボル系列との相違度を求めるエラーカウント計算部と、言語モデルによって得られる対立候補シンボル系列の言語確率、学習用音声データの特徴量と対立候補シンボル系列に基づき音響モデルによって得られる音響スコア及び相違度に基づき、平均補正パラメータで識別学習基準の目的関数を微分したときの微分値を求める補正パラメータ微分値計算部と、微分値に応じて平均補正パラメータを変更することで、平均補正パラメータを更新する補正パラメータ更新部と、を含む。 In order to solve the above problems, according to a first aspect of the present invention, an acoustic model correction parameter estimation device includes a mixed Gaussian distribution model in an acoustic model, and the acoustic model parameter includes a mixed Gaussian distribution. It is assumed that an average vector of Gaussian distribution included in the model is included, and an average correction parameter for correcting the average vector is obtained from the feature amount of the learning speech data and the correct symbol sequence for the learning speech data. The acoustic model correction parameter estimation device includes a storage unit that stores a predetermined acoustic model and a language model, and an acoustic model correction unit that corrects an average vector of the acoustic model stored in the storage unit using an average correction parameter. And a correct symbol sequence at a predetermined granularity for each candidate candidate symbol sequence obtained by speech recognition of the feature quantity of the learning speech data based on the acoustic model including the corrected average vector and the language model. Based on the error count calculation unit for obtaining the degree of difference, the language probability of the opponent candidate symbol series obtained by the language model, the acoustic score obtained by the acoustic model based on the feature amount of the speech data for learning and the opponent candidate symbol series, and the degree of difference, A correction parameter for obtaining the differential value when the objective function of the discriminative learning criterion is differentiated by the average correction parameter. Chromatography including data and differential value calculation unit, by changing the average correction parameter according to the differential value, a correction parameter update section for updating the average correction parameter, a.

上記の課題を解決するために、本発明の他の態様によれば、特徴量補正パラメータ推定装置は、学習用音声データの特徴量及び学習用音声データに対する正解シンボル系列から、認識用音声データの特徴量を補正するための特徴量補正パラメータを求める。特徴量補正パラメータ推定装置は、予め求められた、ガウス分布で表現された音響モデル及び言語モデルが記憶される記憶部と、学習用音声データの特徴量を音響モデルを表現するガウス分布のクラスタ毎の特徴量補正パラメータにより補正した補正後の特徴量を求める特徴量補正部と、予め定めた粒度で、補正後の特徴量を音声認識することによって得られる対立候補シンボル系列ごとに、正解シンボル系列との相違度を求めるエラーカウント計算部と、言語モデルによって得られる対立候補シンボル系列の言語確率、補正後の特徴量と対立候補シンボル系列に基づき音響モデルによって得られる音響スコア及び相違度に基づき、特徴量補正パラメータで識別学習基準の目的関数を微分したときの微分値を求める補正パラメータ微分値計算部と、微分値に応じて特徴量補正パラメータを変更することで、特徴量補正パラメータを更新する補正パラメータ更新部と、を含む。 In order to solve the above-described problem, according to another aspect of the present invention, the feature amount correction parameter estimation apparatus calculates the recognition speech data from the feature amount of the learning speech data and the correct symbol sequence for the learning speech data. A feature amount correction parameter for correcting the feature amount is obtained. Feature quantity correction parameter estimation apparatus previously obtained, and a storage unit in which the acoustic model and the language model represented by a Gaussian distribution is stored, for each of the Gaussian distribution representing the acoustic model feature quantity of training speech data cluster feature amount correction unit that calculates a characteristic amount after the correction corrected by the characteristic amount correction parameters and, in a predetermined size, the feature amount after correction for each allele candidate symbol sequence obtained by recognizing speech, correct symbol sequence An error count calculation unit for calculating the degree of difference with the language probability of the opponent candidate symbol series obtained by the language model, based on the acoustic score and the degree of difference obtained by the acoustic model based on the corrected feature quantity and the opponent candidate symbol series, Correction parameter differential value calculation to obtain the differential value when the objective function of the discriminative learning criterion is differentiated with the feature value correction parameter If, by changing the characteristic amount correction parameter depending on the differential value, including a correction parameter update section for updating the characteristic quantity correction parameter, a.

上記の課題を解決するために、本発明の他の態様によれば、音響モデル補正パラメータ推定方法は、音響モデルには混合ガウス分布モデルが含まれるものとし、音響モデルパラメータには混合ガウス分布モデルに含まれるガウス分布の平均ベクトルが含まれるものとし、学習用音声データの特徴量及び学習用音声データに対する正解シンボル系列から、平均ベクトルを補正するための平均補正パラメータを求める。音響モデル補正パラメータ推定方法は、記憶部には予め求められた音響モデル及び言語モデルが記憶され、記憶部に記憶された音響モデルの平均ベクトルを、平均補正パラメータを用いて補正する音響モデル補正ステップと、補正した平均ベクトルを含む音響モデルと言語モデルとに基づき、学習用音声データの特徴量を音声認識することによって得られる対立候補シンボル系列ごとに、予め定めた粒度で、正解シンボル系列との相違度を求めるエラーカウント計算ステップと、言語モデルによって得られる対立候補シンボル系列の言語確率、学習用音声データの特徴量と対立候補シンボル系列に基づき音響モデルによって得られる音響スコア及び相違度に基づき、平均補正パラメータで識別学習基準の目的関数を微分したときの微分値を求める補正パラメータ微分値計算ステップと、微分値に応じて平均補正パラメータを変更することで、平均補正パラメータを更新する補正パラメータ更新ステップと、を含む。 In order to solve the above problems, according to another aspect of the present invention, an acoustic model correction parameter estimation method includes an acoustic model including a mixed Gaussian distribution model, and the acoustic model parameter includes a mixed Gaussian distribution model. An average correction parameter for correcting the average vector is obtained from the feature amount of the learning speech data and the correct symbol sequence for the learning speech data. In the acoustic model correction parameter estimation method, the acoustic model and the language model obtained in advance are stored in the storage unit, and the acoustic model correction step of correcting the average vector of the acoustic model stored in the storage unit using the average correction parameter And a correct symbol sequence at a predetermined granularity for each candidate candidate symbol sequence obtained by speech recognition of the feature quantity of the learning speech data based on the acoustic model including the corrected average vector and the language model. Based on the error count calculation step for obtaining the dissimilarity, the language probability of the opposing candidate symbol series obtained by the language model, the acoustic score obtained by the acoustic model and the dissimilarity based on the feature amount of the learning speech data and the opposing candidate symbol series, Find the differential value when the objective function of the discriminative learning criterion is differentiated with the average correction parameter. That includes a correction parameter differential value calculation step, by changing the average correction parameter according to the differential value, a correction parameter update step of updating the average correction parameter, a.

上記の課題を解決するために、本発明の他の態様によれば、特徴量補正パラメータ推定方法は、学習用音声データの特徴量及び学習用音声データに対する正解シンボル系列から、認識用音声データの特徴量を補正するための特徴量補正パラメータを求める。特徴量補正パラメータ推定方法は、記憶部には予め求められた、ガウス分布で表現された音響モデル及び言語モデルが記憶され、学習用音声データの特徴量を音響モデルを表現するガウス分布のクラスタ毎の特徴量補正パラメータにより補正した補正後の特徴量を求める特徴量補正ステップと、予め定めた粒度で、補正後の特徴量を音声認識することによって得られる対立候補シンボル系列ごとに、正解シンボル系列との相違度を求めるエラーカウント計算ステップと、言語モデルによって得られる対立候補シンボル系列の言語確率、補正後の特徴量と対立候補シンボル系列に基づき音響モデルによって得られる音響スコア及び相違度に基づき、特徴量補正パラメータで識別学習基準の目的関数を微分したときの微分値を求める補正パラメータ微分値計算ステップと、微分値に応じて特徴量補正パラメータを変更することで、特徴量補正パラメータを更新する補正パラメータ更新ステップと、を含む。 In order to solve the above-described problem, according to another aspect of the present invention, a feature amount correction parameter estimation method is provided that performs recognition of speech data for recognition from a feature amount of speech data for learning and a correct symbol sequence for the speech data for learning. A feature amount correction parameter for correcting the feature amount is obtained. In the feature quantity correction parameter estimation method, an acoustic model and a language model expressed in Gaussian distribution, which are obtained in advance, are stored in the storage unit, and feature quantities of learning speech data are represented for each Gaussian distribution cluster that expresses the acoustic model. a feature amount correction step of calculating a feature amount after correction in the correction by the feature amount correction parameters, in a predetermined size, the feature amount after correction for each allele candidate symbol sequence obtained by recognizing speech, correct symbol sequence An error count calculation step for obtaining a degree of difference, and a language probability of an alternative candidate symbol series obtained by the language model, an acoustic score obtained by the acoustic model based on the corrected feature quantity and the alternative candidate symbol series, and a degree of difference, Correction parameter to obtain the differential value when the objective function of the discriminative learning criterion is differentiated by the feature value correction parameter It includes a differential value calculation step, by changing the characteristic amount correction parameter depending on the differential value, a correction parameter update step of updating the characteristic quantity correction parameter, a.

本発明によれば、正解シンボルの誤りの悪影響を弱めることによって、従来技術よりも適切に音響モデルパラメータまたは特徴量に対する補正パラメータを求めることができるという効果を奏する。 According to the present invention, it is possible to obtain a correction parameter for an acoustic model parameter or a feature amount more appropriately than the related art by weakening an adverse effect of an error of a correct answer symbol.

線形回帰音響モデル適応技術を搭載した音声認識装置の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus carrying a linear regression acoustic model adaptation technique. 線形回帰音響モデル適応技術を搭載した音声認識装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the speech recognition apparatus carrying a linear regression acoustic model adaptation technique. 音響モデル補正パラメータ学習装置の機能構成例を示す図。The figure which shows the function structural example of an acoustic model correction parameter learning apparatus. 音響モデル補正パラメータ学習装置の処理フロー例を示す図。The figure which shows the example of a processing flow of an acoustic model correction parameter learning apparatus. 第一実施形態に係る音響モデル補正パラメータ学習装置の構成例を示す図。The figure which shows the structural example of the acoustic model correction parameter learning apparatus which concerns on 1st embodiment. 第一実施形態に係る音響モデル補正パラメータ学習装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the acoustic model correction parameter learning apparatus which concerns on 1st embodiment. 補正した特徴量に基づき音声認識を行う音声認識装置の機能構成例を示す図。The figure which shows the function structural example of the speech recognition apparatus which performs speech recognition based on the corrected feature-value. 補正した特徴量に基づき音声認識を行う音声認識装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the speech recognition apparatus which performs speech recognition based on the corrected feature-value. 第二実施形態に係る特徴量補正パラメータ学習装置の構成例を示す図。The figure which shows the structural example of the feature-value correction parameter learning apparatus which concerns on 2nd embodiment. 第二実施形態に係る特徴量補正パラメータ学習装置の処理フロー例を示す図。The figure which shows the example of a processing flow of the feature-value correction parameter learning apparatus which concerns on 2nd embodiment.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」等は、本来直前の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直後に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. In the following description, the symbol “^” or the like used in the text should be described immediately above the character immediately before, but it is described immediately after the character due to restrictions on text notation. In the formula, these symbols are written in their original positions. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

まず、第一実施形態について説明する前に、まず、音響モデル適応技術を搭載した音声認識装置について説明する。 First, before describing the first embodiment, a speech recognition device equipped with an acoustic model adaptation technique will be described first.

＜音響モデル適応技術を搭載した音声認識装置９０＞
図１に線形回帰音響モデル適応技術を搭載した音声認識装置９０の機能構成例、図２にその処理フロー例を示す。音声認識装置９０は、特徴量抽出部９１、単語列探索部９２、記憶部９３、音響モデル補正部９４から構成される。 <Voice recognition device 90 equipped with acoustic model adaptation technology>
FIG. 1 shows an example of a functional configuration of a speech recognition apparatus 90 equipped with a linear regression acoustic model adaptation technique, and FIG. 2 shows an example of its processing flow. The speech recognition apparatus 90 includes a feature amount extraction unit 91, a word string search unit 92, a storage unit 93, and an acoustic model correction unit 94.

（記憶部９３）
記憶部９３には、予め音響モデルと言語モデルが記憶されている。音響モデルは、音声の音響的特徴をモデル化したものである。言語モデルは音素や単語といった多数のシンボル系列から構成されている。通常、音声認識用音響モデルでは、各音素をLeft to rightのＨＭＭ（Hidden Markov Model：隠れマルコフモデル）で、ＨＭＭ状態の出力確率分布をＧＭＭ（Gaussian Mixture Model：混合ガウス分布モデル）で表現する。そのため、実際に音響モデルとして記憶部９３に記憶されているのは、音素などの各シンボルにおけるＨＭＭの状態遷移確率、ＧＭＭの混合重み因子、及びガウス分布の平均ベクトルμ_ｍ及び共分散行列（以下「分散パラメータ」ともいう）Σ_ｍ等となる。ここで、ＭはＧＭＭに含まれるガウス分布の総数であり、ｍは、ＧＭＭのガウス分布のインデックスであり、ｍ＝１，２，…，Ｍである。これらを音響モデルパラメータと呼び、その集合をΛとする。言語モデルは音素や単語といった多数のシンボル系列から構成されており、図中のＰ（Ｓ_ｊ）は言語モデルによって得られる対立候補シンボル系列Ｓ_ｊの確率（以下「言語確率」ともいう）である。なお、対立候補シンボル系列Ｓ_ｊとは音声認識結果となりうるシンボル系列であり、シンボル系列とは音素や単語等からなるシンボルの系列である。 (Storage unit 93)
The storage unit 93 stores an acoustic model and a language model in advance. The acoustic model is a model of acoustic features of speech. The language model is composed of many symbol sequences such as phonemes and words. Normally, in an acoustic model for speech recognition, each phoneme is represented by a Left to right HMM (Hidden Markov Model) and an output probability distribution of the HMM state is represented by a GMM (Gaussian Mixture Model). Therefore, what is stored actually in the storage unit 93 as an acoustic model, the state transition probability of the HMM in each symbol, such as phonemes, mixture weight factors GMM, and mean vector mu _m and the covariance matrix of the Gaussian distribution (hereinafter It is also called “dispersion parameter”) Σ _m or the like. Here, M is the total number of Gaussian distributions included in the GMM, m is an index of the Gaussian distribution of the GMM, and m = 1, 2,. These are called acoustic model parameters, and the set is Λ. The language model is composed of a large number of symbol sequences such as phonemes and words, and P (S _j ) in the figure is the probability of the opposing candidate symbol sequence S _j obtained by the language model (hereinafter also referred to as “language probability”). . The opposing candidate symbol series S _j is a symbol series that can be a speech recognition result, and the symbol series is a series of symbols including phonemes and words.

（特徴量抽出部９１）
特徴量抽出部９１は、認識用音声データを読み込み（ｓ９３）、音声の特徴量を抽出する（ｓ９５）。特徴量としては、例えば、ＭＦＣＣ（Mel Frequency Cepstral Coefficient）、ΔＭＦＣＣ、ΔΔＭＦＣＣ、対数パワー、Δ対数パワー等があり、これらが１０〜１００次元程度の特徴量ベクトルｏを構成する。さらに、時系列の特徴量ベクトルである特徴量ベクトル系列Ｏを以下のように表現できる。 (Feature Extraction Unit 91)
The feature quantity extraction unit 91 reads the recognition voice data (s93), and extracts the voice feature quantity (s95). Examples of the feature quantity include MFCC (Mel Frequency Cepstral Coefficient), ΔMFCC, ΔΔMFCC, logarithmic power, Δlogarithmic power, and the like, which constitute a feature quantity vector o of about 10 to 100 dimensions. Further, a feature vector sequence O that is a time-series feature vector can be expressed as follows.

ただし、Ｎはフレームの数、ｎは１からＮの整数、Ｒは実数の集合である。つまり、Ｏは１からＮフレーム目までのＤ次元特徴量ベクトルで表現されるデータである。例えば、分析フレーム幅は３０ｍｓ程度、分析フレームシフト幅は１０ｍｓ程度で分析が実行される。 Here, N is the number of frames, n is an integer from 1 to N, and R is a set of real numbers. That is, O is data represented by D-dimensional feature quantity vectors from the first to the Nth frame. For example, the analysis is executed with an analysis frame width of about 30 ms and an analysis frame shift width of about 10 ms.

（音響モデル補正部９４）
音響モデル補正部９４は、補正前の（記憶部９３に記憶された）音響モデルパラメータΛを含む音響モデルと、予め学習し記憶部９３に記憶しておいた音響モデル補正パラメータθ＾を読み込み（ｓ９１、ｓ９４）、音響モデル補正パラメータθ＾を用いて、音響モデルパラメータΛを含む音響モデルを補正し（ｓ９６）、補正した音響モデルパラメータΛ＾を単語列検索部９２に送る。この例では、線形回帰音響モデル適応は以下の式（２）のように、音響モデルパラメータに含まれる平均ベクトルμ＝｛μ_１，μ_２，…，μ_Ｍ｝を補正する。 (Acoustic model correction unit 94)
The acoustic model correction unit 94 reads the acoustic model including the acoustic model parameter Λ before being corrected (stored in the storage unit 93) and the acoustic model correction parameter θ ^ previously learned and stored in the storage unit 93 ( s91, s94), the acoustic model correction parameter θ ^ is used to correct the acoustic model including the acoustic model parameter Λ (s96), and the corrected acoustic model parameter Λ ^ is sent to the word string search unit 92. In this example, linear regression acoustic model adaptation corrects the mean vector μ = {μ ₁ , μ ₂ ,..., Μ _M } included in the acoustic model parameters as shown in the following equation (2).

ただし、μ＾_ｍは補正後の音響モデルパラメータにおけるｍ番目のガウス分布の平均ベクトル、Ａは平均ベクトルに対する変換行列、ｂは平均ベクトルに対するバイアスベクトルである。’はベクトルまたは行列の転置を表す。今後、Ａ，ｂもしくはＷを平均補正パラメータともいう。また、式（２）は平均ベクトルμ_ｍの補正の例を示したが、分散パラメータΣ_ｍについても同様の補正を行うことができる。音響モデルパラメータを補正するためのパラメータを音響モデル補正パラメータと呼び、平均補正パラメータや分散パラメータを補正するためのパラメータ（以下「分散補正パラメータ」ともいう）は音響モデル補正パラメータに含まれるものとする。この例では、音響モデル補正パラメータθは平均補正パラメータＡ，ｂのみからなるため、平均補正パラメータＡ，ｂのことを音響モデル補正パラメータθ＝（Ａ，ｂ）ともいう。平均補正パラメータＡ，ｂを、変換行列Ａとバイアスベクトルｂとからなる行列として表現した場合をＷ＝［Ａｂ］とし、変換行列Ａとバイアスベクトルｂとからなる集合として表現した場合をθ＝（Ａ，ｂ）とする。 Here, μ ^ _m is an average vector of the mth Gaussian distribution in the corrected acoustic model parameter, A is a transformation matrix for the average vector, and b is a bias vector for the average vector. 'Represents the transpose of a vector or matrix. In the future, A, b, or W will also be referred to as an average correction parameter. Further, Equation (2) is an example of a correction of the mean vector mu _m, it is possible to perform the similar correction for the dispersion parameter sigma _m. Parameters for correcting acoustic model parameters are called acoustic model correction parameters, and parameters for correcting average correction parameters and dispersion parameters (hereinafter also referred to as “dispersion correction parameters”) are included in the acoustic model correction parameters. . In this example, since the acoustic model correction parameter θ includes only the average correction parameters A and b, the average correction parameters A and b are also referred to as acoustic model correction parameters θ = (A, b). When the average correction parameters A and b are expressed as a matrix made up of the conversion matrix A and the bias vector b, W = [A b], and when expressed as a set made up of the conversion matrix A and the bias vector b, θ = (A, b).

より詳細な補正を行うため、音響モデル補正パラメータはよく音響モデルのガウス分布のクラスタ毎に推定される。その場合は、補正後の平均ベクトルμ＾_ｍは式（３）のようになる。クラスタの作り方は、例えば非特許文献３のような方法がある。 In order to perform more detailed correction, the acoustic model correction parameter is often estimated for each cluster of the Gaussian distribution of the acoustic model. In this case, the corrected average vector μ ^ _m is as shown in Equation (3). As a method for creating a cluster, for example, there is a method as described in Non-Patent Document 3.

ｋはガウス分布のクラスタのインデックス、Ａ_ｋ，ｂ_ｋはクラスタｋの平均補正パラメータである。クラスタ毎に平均補正パラメータを推定した場合、θ_ｋ＝（Ａ_ｋ，ｂ_ｋ）とし、Ｋをクラスタの総数とし、ｋ＝１，２，…，Ｋとし、θ＝（θ_１，θ_２，…，θ_Ｋ）とする。また、Ｗ_ｋ＝［Ａ_ｋｂ_ｋ］とし、Ｗ＝（Ｗ_１，Ｗ_２，…，Ｗ_Ｋ）とする。 k is an index of a Gaussian cluster, and A _k and b _k are average correction parameters of the cluster k. When the average correction parameter is estimated for each cluster, θ _k = (A _k , b _k ), K is the total number of clusters, k = 1, 2,..., K, and θ = (θ ₁ , θ ₂ , ..., θ _K ). Also, W _k = [A _k b _k ] and W = (W ₁ , W ₂ ,..., W _K ).

（単語列探索部９２）
単語列探索部９２は、音響モデル補正部９４から取得した補正後の音響モデルパラメータΛ＾に基づき、特徴量ベクトル系列Ｏに対するＪ個の対立候補シンボル系列Ｓ_ｊを生成して、対立候補シンボル系列Ｓ_ｊ毎に音響スコアを算出する。ただし、ｊ＝１，２，…，Ｊであり、Ｊは１以上の整数である。さらに、単語列探索部９２は、予め言語モデルを記憶部９３から読み込んでおき（ｓ９２）、この言語モデルに基づき、対立候補シンボル系列Ｓ_ｊ毎に言語スコアを算出する。さらに、音響スコアと言語スコアとを統合して、Ｊ個の対立候補シンボル系列Ｓ_ｊの中から、認識用音声データに対応する文として最も確からしい（最も音響スコアと言語スコアとを統合したスコアが高い）対立候補シンボル系列を探索し（ｓ９７）、その対立候補シンボル系列を認識結果（単語列）Ｓ＾として出力する（ｓ９８）。 (Word string search unit 92)
The word string search unit 92 generates J conflict candidate symbol sequences S _j for the feature vector sequence O based on the corrected acoustic model parameters Λ ^ acquired from the acoustic model correction unit 94, and the conflict candidate symbol sequences An acoustic score is calculated for each S _j . However, j = 1, 2,..., J, and J is an integer of 1 or more. Further, the word string search unit 92 reads a language model from the storage unit 93 in advance (s92), and calculates a language score for each of the conflict candidate symbol series S _j based on the language model. Further, by integrating the acoustic score and the language score, it is most likely as a sentence corresponding to the speech data for recognition from among the J conflict candidate symbol sequences S _j (the most integrated score of the acoustic score and the language score). The conflict candidate symbol sequence is searched (s97), and the conflict candidate symbol sequence is output as a recognition result (word string) S ^ (s98).

＜音響モデル補正パラメータ学習装置８０＞
上記の音声認識装置９０では、音響モデルパラメータを、線形回帰を用いて補正する。つまり、音響モデル補正部９４で用いる平均補正パラメータθ＾は線形回帰パラメータである。 <Acoustic Model Correction Parameter Learning Device 80>
In the speech recognition apparatus 90 described above, the acoustic model parameters are corrected using linear regression. That is, the average correction parameter θ ^ used in the acoustic model correction unit 94 is a linear regression parameter.

線形回帰パラメータを学習する方法として、線形回帰パラメータの推定を最尤基準（ＭＬＬＲ(Maximum likelihood linear regression)）で行う方法（非特許文献３）と、識別的基準の一種である最大相互情報量(Maximum Mutual Information : MMI)基準（ＭＭＩ−ＬＲ）で行う方法（非特許文献４）が知られている。最尤基準（非特許文献３）よりも識別的基準（非特許文献４）により推定された補正パラメータを用いる方が、最終的な音声認識精度が向上することが多い。 As a method of learning linear regression parameters, a method of performing estimation of linear regression parameters using a maximum likelihood criterion (MLLR (Maximum likelihood linear regression)) (non-patent document 3) and a maximum mutual information amount that is a kind of discriminative criterion ( A method (Non-Patent Document 4) is known which is performed based on Maximum Mutual Information (MMI) standard (MMI-LR). In many cases, the final speech recognition accuracy is improved by using the correction parameter estimated by the discriminative criterion (Non-Patent Document 4) rather than the maximum likelihood criterion (Non-Patent Document 3).

以下では、非特許文献４の音響モデル補正パラメータ学習装置８０の具体的な処理を、図３及び図４を用いて説明する。図３に音響モデル補正パラメータ学習装置８０の機能構成例、図４にその処理フロー例を示す。音響モデル補正パラメータ学習装置８０は、特徴量抽出部８１、音響モデル補正パラメータ計算部８３及び記憶部９３を備える。音響モデル補正パラメータ学習装置８０は、学習用音声データとその学習用音声データに対する正解シンボル系列Ｓ_ｒとからなる学習データを入力とし、音響モデル補正パラメータθ＾を出力する。 Hereinafter, specific processing of the acoustic model correction parameter learning device 80 of Non-Patent Document 4 will be described with reference to FIGS. 3 and 4. FIG. 3 shows a functional configuration example of the acoustic model correction parameter learning device 80, and FIG. 4 shows a processing flow example thereof. The acoustic model correction parameter learning device 80 includes a feature amount extraction unit 81, an acoustic model correction parameter calculation unit 83, and a storage unit 93. Acoustic model correction parameter learning unit 80 inputs the training data comprising the correct symbol sequence S _r training speech data and for the training speech data, and outputs the acoustic model correction parameter theta ^.

（特徴量抽出部８１）
特徴量抽出部８１は、学習用音声データを読み込み（ｓ８３）、音声の特徴量ベクトル系列Ｏを抽出する（ｓ８５）。特徴量抽出の具体的な処理は、上述の音声認識装置９０の特徴量抽出部９１と同じである。 (Feature Extraction Unit 81)
The feature amount extraction unit 81 reads the learning speech data (s83), and extracts the speech feature amount vector series O (s85). The specific processing of feature amount extraction is the same as that of the feature amount extraction unit 91 of the voice recognition device 90 described above.

（音響モデル補正パラメータ計算部８３）
音響モデル補正パラメータ計算部８３は、記憶部９３から音響モデルと言語モデルとを読み込み（ｓ８１、ｓ８２）、さらに、正解シンボル系列Ｓ_ｒを読み込み（ｓ８４）、特徴量抽出部８１で抽出した音声の特徴量Ｏを用いて、音響モデル補正パラメータθ＝（θ_１，θ_２，…，θ_Ｋ）を推定し（ｓ８６）、出力する（ｓ８７）。音響モデル補正パラメータθは、適応データ（学習用音声データの特徴量ベクトル系列Ｏ）とその特徴量ベクトル系列Ｏに対応する正解シンボル系列Ｓ_ｒとを用いて、次式のように目的関数Ｆ_θを最大化する形で推定される。 (Acoustic model correction parameter calculation unit 83)
Acoustic model correction parameter calculation unit 83 reads the acoustic model and a language model from the storage unit 93 (s81, s82), further reads the correct symbol sequence _{S r} (s84), the voice extracted by the feature amount extraction unit 81 The acoustic model correction parameter θ = (θ ₁ , θ ₂ ,..., Θ _K ) is estimated using the feature amount O (s86) and output (s87). The acoustic model correction parameter θ is obtained by using objective data F _θ using the adaptive data (feature vector series O of learning speech data) and the correct symbol series S _r corresponding to the feature vector series O as shown in the following equation. Is estimated in a way that maximizes.

非特許文献４では、目的関数としてＭＭＩ基準を用いるので、式（４）の代わりに式（５）を使用する。 In Non-Patent Document 4, since the MMI criterion is used as the objective function, Expression (5) is used instead of Expression (4).

ここで、ＭＭＩ目的関数は次式のように書ける。 Here, the MMI objective function can be written as:

ここで、Ｓ_ｊは特徴量ベクトル系列Ｏを音声認識することによって得られる対立候補シンボル系列、Ｐ（Ｓ_ｒ）及びＰ（Ｓ_ｊ）は言語モデルによってそれぞれ得られる正解シンボル系列Ｓ_ｒの言語確率及び対立候補シンボル系列Ｓ_ｊの言語確率、ｐ_Λ（Ｏ｜Ｓ_ｒ）及びｐ_Λ（Ｏ｜Ｓ_ｊ）はそれぞれ正解シンボル系列Ｓ_ｒ及び対立候補シンボル系列Ｓ_ｊにおいて音響モデル（ＨＭＭ）によって得られる音響スコア、ψは音響スコアに対するスケーリングパラメータ、ηは言語確率に対するスケーリングパラメータを表す。 Here, S _j is an opposing candidate symbol sequence obtained by speech recognition of the feature quantity vector sequence O, and P (S _r ) and P (S _j ) are language probabilities of correct symbol sequences S _r respectively obtained by the language model. and opposition candidate symbol sequence _{S j} language probability, p _lambda resulting in | | _{(S j} O) correct each symbol sequence _{S r} and opposition candidate symbol sequence _{S j} by the acoustic model (HMM) (O _{S r)} and p _lambda The acoustic score to be obtained, ψ represents a scaling parameter for the acoustic score, and η represents a scaling parameter for the language probability.

また、音響スコアｐ_Λ（Ｏ｜Ｓ_ｊ）は以下の式で書くことができる。 The acoustic score p _Λ (O | S _j ) can be written by the following equation.

ただし、Ｔは適応データ（学習用音声データの特徴量ベクトル系列Ｏ）の特徴量ベクトル系列の長さ、ｔはフレーム番号またはそのフレームに対応する時刻（以下「フレーム時刻」ともいう）を表し、１からＴの整数である。つまり、特徴量ベクトル系列Ｏは１からＴフレーム目までのＤ次元特徴量ベクトルで表現されるデータである（式（１）参照）。また、｛ｎ_１：Ｔ｝は対立候補シンボル系列Ｓ_ｊに対応するＨＭＭ状態シーケンス（フレーム時刻１からＴまで）であり、Σ_{｛ｎ１：Ｔ｝}（ただし、下付添字｛ｎ１：Ｔ｝は｛ｎ_１：Ｔ｝を表す）は対立候補シンボル系列Ｓ_ｊに対応する可能なあらゆるＨＭＭ状態シーケンスの足し算、ｐ（ｏ_ｔ｜ｎ_ｔ）はフレーム時刻ｔにおけるＨＭＭ状態ｎ_ｔから特徴量ベクトルｏ_ｔが出力される確率（なお、ＨＭＭ状態は一般にＧＭＭで表現される）、ｐ（ｎ_ｔ｜ｎ_ｔ−１）はフレーム時刻ｔ−１におけるあるＨＭＭ状態ｎ_ｔ−１からフレーム時刻ｔにおけるあるＨＭＭ状態ｎ_ｔへの遷移確率である。 Where T represents the length of the feature vector sequence of the adaptive data (feature vector sequence O of the speech data for learning), t represents the frame number or the time corresponding to the frame (hereinafter also referred to as “frame time”), It is an integer from 1 to T. That is, the feature vector sequence O is data represented by D-dimensional feature vectors from the 1st to the Tth frames (see Expression (1)). {N _{1: T} } is an HMM state sequence (from frame time 1 to T) corresponding to the conflict candidate symbol sequence S _j , and Σ _{{n1: T}} (where the subscript {n1: T} is {N _{1: T} } represents the sum of all possible HMM state sequences corresponding to the conflict candidate symbol sequence S _j , and p (o _t | n _t ) is the feature vector o from the HMM state n _{t at} frame time t. _The probability that _t will be output (note that the HMM state is generally expressed by GMM), and p (n _t | n _t−1 ) is from a certain HMM state n _{t−1 at} frame time t ₋₁ to frame time t. This is the transition probability to the HMM state n _t .

〔第一実施形態〕
［第一実施形態のポイント］
しかし、前述の通り、ＭＭＩ−ＬＲの識別基準による音響モデルの教師なし適応方法は、適応データを音声認識した結果を正解ラベルと見做して利用するので、正解ラベル（と見做された音声認識結果）に誤りが含まれることが多く、うまくモデルパラメータを最適化できず、性能が改善しないか悪化する可能性がある。 [First embodiment]
[Points of first embodiment]
However, as described above, the unsupervised adaptation method of the acoustic model based on the MMI-LR identification standard uses the result of speech recognition of the adaptation data as the correct answer label, and therefore uses the correct answer label (the voice that is considered as the correct answer label). (Recognition results) often contain errors, model parameters cannot be optimized well, and performance may not improve or deteriorate.

そこで、第一実施形態では、正解シンボルの誤りを考慮し、識別学習による音響モデル補正パラメータ推定を行うために、音響モデル補正パラメータを推定する基準として参考文献１及び参考文献２に記述されているｄＭＭＩ導関数基準（differenced MMI:ｄＭＭＩ）を応用する。
［参考文献１］ McDermott，E.， Watanabe， S. and Nakamura，A.， “Discriminative training based on an integrated view of MPE and MMI in margin and error space”, In Proc. ICASSP'10， 2010, pp. 4894 - 4897
［参考文献２］特願２００９−１９８３６２号公報 Therefore, in the first embodiment, reference 1 and reference 2 are described as criteria for estimating the acoustic model correction parameter in order to estimate the acoustic model correction parameter by discriminating learning in consideration of the error of the correct symbol. The dMMI derivative criterion (differenced MMI: dMMI) is applied.
[Reference 1] McDermott, E., Watanabe, S. and Nakamura, A., “Discriminative training based on an integrated view of MPE and MMI in margin and error space”, In Proc. ICASSP'10, 2010, pp. 4894-4897
[Reference 2] Japanese Patent Application No. 2009-198362

すなわち、ｄＭＭＩ−ＬＲを開発した。以下で数式を用いてｄＭＭＩ−ＬＲによる音響モデル補正パラメータの推定方法について述べる。 That is, dMMI-LR was developed. Hereinafter, an estimation method of the acoustic model correction parameter by dMMI-LR will be described using mathematical formulas.

まず、以下のΨ関数を以下のように定義する。 First, the following Ψ function is defined as follows.

ここでσはマージンパラメータ、ε_ｊ,ｒは正解シンボル系列Ｓ_ｒに対する対立候補シンボル系列Ｓ_ｊの相違度（例えば、単語エラー数、音素エラー数等）を表している。すなわち、マージンパラメータσは相違度ε_ｊ，ｒに応じて音響モデル補正パラメータ推定時に対立候補シンボル系列Ｓ_ｊをどれだけ重視するかをコントロールするパラメータである。マージンパラメータσは、−∞〜＋∞の値を取り得る。マージンパラメータσがマイナスの値を取れば、小さい相違度ε_ｊ，ｒを持つ対立候補シンボル系列Ｓ_ｊほど、すなわち、エラー数が少ない対立候補シンボル系列Ｓ_ｊほど重視される。逆に、マージンパラメータσがプラスの値を取れば、大きい相違度ε_ｊ，ｒを持つ対立候補シンボル系列Ｓ_ｊほど、すなわち、エラー数が多い対立候補シンボル系列Ｓ_ｊほど重視されることになる。このΨ関数を用いると、ｄＭＭＩ識別学習基準の目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}（ただし、下付添字σ１、σ２は、σ_１、σ_２を表す）は次式のように書くことができる。 Here, σ represents a margin parameter, and ε _{j, r} represents the degree of difference (for example, the number of word errors, the number of phoneme errors, etc.) of the opposing candidate symbol series S _j with respect to the correct symbol series S _r . That is, the margin parameter σ is a parameter for controlling how much the opposing candidate symbol sequence S _j is emphasized when the acoustic model correction parameter is estimated according to the dissimilarity ε _{j, r} . The margin parameter σ can take a value of −∞ to + ∞. If the margin parameter σ takes a negative value, the opponent candidate symbol series S _j having a small difference ε _{j, r} is emphasized, that is, the opponent candidate symbol series S _j having a smaller number of errors. On the other hand, if the margin parameter σ takes a positive value, the opponent candidate symbol series S _j having a large difference ε _{j, r} , that is, the opponent candidate symbol series S _j having a larger number of errors is more important. . Using this Ψ function, the objective function F ^dMMI _{θ, σ1, σ2} (where subscripts σ1, σ2 represent σ ₁ , σ ₂ ) can be written as follows: .

分子の第一マージンパラメータσ_１はマイナスの値を取る（σ_１＜０）。すなわち、分子では相違度ε_ｊ，ｒが小さい対立候補シンボル系列Ｓ_ｊほど重視される。一方、分母の第二マージンパラメータσ_２はプラスの値を取る（σ_２＞０）。すなわち、分母では相違度ε_ｊ，ｒが大きい対立候補シンボル系列Ｓ_ｊほど重視される。 The first margin parameter σ ₁ of the numerator takes a negative value (σ ₁ <0). In other words, in the numerator, the opposition candidate symbol series S _j with a smaller difference ε _{j, r} is more important. On the other hand, the second margin parameter σ ₂ of the denominator takes a positive value (σ ₂ > 0). That is, in the denominator, the opposite candidate symbol series S _j having a larger difference ε _{j, r} is more important.

このｄＭＭＩ識別学習基準は、第一マージンパラメータσ_１及び第二マージンパラメータσ_２を調整することによって、ＭＰＥ（Minimum Phone Error）識別学習基準（参考文献３）またはＢＭＭＩ（boosted-MMI）識別学習基準（参考文献４）に近づく。
［参考文献３］Povey， D.， Woodland， P.C.，“Minimum Phone Error and I-smoothing for improved discriminative training”, In Proc. ICASSP， 2002, vol.1，pp.I-105-I-108
［参考文献４］Povey， D.， Kanevsky， D.，Kingsbury， B.， Ramabhadran， B.， Saon， G. and Visweswariah， K.， “Boosted MMI for model and feature-space discriminative training”, In Proc. ICASSP， 2008, pp.4057-4060，
ここでσ_２は、例えば、＋０．１という０に近い小さなプラスの値に設定しておけばよい（参考文献５）。
［参考文献５］Saon， G. and Povey， D.， “Penalty function maximization for large margin HMM training”, In Proc. Interspeech， 2008, pp.920-923
例えば、σ_１を大きなマイナスの値（理論的には−∞、実装上は、例えば、−５０）に設定する。このとき、次式に示すように、ｄＭＭＩ識別学習基準の目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}はＢＭＭＩ識別学習基準の目的関数Ｆ^ＢＭＭＩ _θ，σ２に近づく。 This dMMI discriminative learning criterion is adjusted by adjusting the first margin parameter σ ₁ and the second margin parameter σ ₂ , so that the MPE (Minimum Phone Error) discriminative learning criterion (reference 3) or the BMMI (boosted-MMI) discriminative learning criterion Approach (Reference 4).
[Reference 3] Povey, D., Woodland, PC, “Minimum Phone Error and I-smoothing for improved discriminative training”, In Proc. ICASSP, 2002, vol.1, pp.I-105-I-108
[Reference 4] Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G. and Visweswariah, K., “Boosted MMI for model and feature-space discriminative training”, In Proc ICASSP, 2008, pp.4057-4060,
Here, σ ₂ may be set to a small positive value close to 0, for example, +0.1 (reference document 5).
[Reference 5] Saon, G. and Povey, D., “Penalty function maximization for large margin HMM training”, In Proc. Interspeech, 2008, pp.920-923
For example, σ _{1 is set} to a large negative value (theoretically −∞, for example, −50 for implementation). At this time, as shown in the following equation, the objective function F ^dMMI _{θ, σ1, σ2} of the dMMI discrimination learning criterion approaches the objective function F ^BMMI _{θ, σ2} of the BMMI discrimination learning criterion.

式（１０）から明らかなように、ＢＭＭＩ識別学習基準の目的関数Ｆ^ＢＭＭＩ _θ，σ２では、分母の第二マージンパラメータσ_２のみが残る。すなわちＢＭＭＩ識別学習基準による音響モデル補正パラメータ推定では、エラー数が多い（相違度ε_ｊ，ｒが大きい）対立候補シンボル系列Ｓ_ｊほど重視される（参考文献６参照）。
［参考文献６］Povey， D.， Kanevsky， D.，Kingsbury， B.， Ramabhadran， B.， Saon， G. and Visweswariah， K.， “Boosted MMI for model and feature-space discriminative training”, In Proc. ICASSP， 2008, pp.4057-4060 As is clear from the equation (10), only the second margin parameter σ ₂ of the denominator remains in the objective function F ^BMMI _{θ, σ2} of the BMMI discrimination learning standard. In other words, in the acoustic model correction parameter estimation based on the BMMI discrimination learning criterion, the more likely candidate symbol series S _j has a greater number of errors (difference ε _{j, r} is larger) (see Reference 6).
[Reference 6] Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G. and Visweswariah, K., “Boosted MMI for model and feature-space discriminative training”, In Proc ICASSP, 2008, pp.4057-4060

ＢＭＭＩの分子は直接正解シンボルの貢献が考慮されるので、正解シンボルの誤りの影響を受けやすい。一方、ｄＭＭＩの場合、σ_１をより大きい値（例えば−１０）に設定することによって、分子では認識の対立候補シンボル系列Ｓ_ｊの貢献の足し算になる。マージンexp(ψσ_１ε_ｊ，ｒ)を重みとして、正解シンボル系列Ｓ_ｒに近い対立候補シンボル系列Ｓ_ｊ（正解シンボルＳ_ｒに対する誤りが少ない）が考慮される。そのため、分子には、正解シンボル系列Ｓ_ｒだけでなく、それに近い対立候補シンボル系列Ｓ_ｊも考慮されることによって、正解シンボルの誤りの悪影響を弱めることができる。その結果、正解シンボル系列Ｓ_ｒに誤りがあっても、音響モデル補正パラメータを識別基準により安定して精度よく推定することができる。マージンパラメータσ_１の値は、分子でどのぐらいの対立候補シンボル系列Ｓ_ｊを考慮するかを決める値である。σ_１の値はタスクの認識率などに依存する。ただし、例えば−3〜−10の間の値に設定すると良い。 Since the BMMI numerator directly considers the contribution of the correct symbol, it is susceptible to the error of the correct symbol. On the other hand, in the case of dMMI, by setting σ ₁ to a larger value (for example, −10), in the numerator, the contribution of the opposing candidate symbol sequence S _j for recognition is added. Considering the margin exp (ψσ ₁ ε _{j, r} ) as a weight _{, an} alternative candidate symbol sequence S _j close to the correct symbol sequence S _r (there is little error with respect to the correct symbol S _r ) is considered. Therefore, not only the correct symbol sequence S _r but also the contending candidate symbol sequence S _j close to the correct symbol sequence S _r is considered in the numerator, so that the adverse effect of the error of the correct symbol can be reduced. As a result, even if there is an error in the correct symbol series _Sr , it is possible to estimate the acoustic model correction parameter stably and accurately with the identification criterion. The value of the margin parameter σ _{1 is} a value that determines how many conflict candidate symbol sequences S _j are considered in the numerator. The value of σ ₁ depends on the task recognition rate and the like. However, it may be set to a value between −3 and −10, for example.

ここで、音響モデル補正パラメータ群の集合θは、次式のように、上記のｄＭＭＩ識別学習基準の目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}を最大化するように推定される。 Here, the set θ of acoustic model correction parameter groups is estimated so as to maximize the objective function F ^dMMI _{θ, σ1, σ2} of the dMMI discrimination learning criterion as shown in the following equation.

ここでは、Ｗ_ｋの推定方法について述べる。ｄＭＭＩ識別学習基準の目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}を最大化するようなＷ_ｋを求めるために、まず目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}をＷ_ｋで微分する。ここで、対立候補シンボル系列Ｓ_ｊが単語（または音素）ラティスの形態で表現される場合、ラティス上で勾配を計算することにより、Ｆ^ｄＭＭＩ _{θ，σ１，σ２}をＷ_ｋで微分した値は以下のように表現される。 Here, we describe a method of estimating W _k. In order to obtain W _k that maximizes the objective function F ^dMMI _{θ, σ1, σ2} of the dMMI discrimination learning criterion, first, the objective function F ^dMMI _{θ, σ1, σ2} is differentiated by W _k . Here, when the opposing candidate symbol series S _j is expressed in the form of a word (or phoneme) lattice, the value _obtained by differentiating F ^dMMI _{θ, σ1, and σ2} by W _k by calculating the gradient on the lattice is as follows: It is expressed as

ここで、ｑ_ｔはフレーム時刻ｔにおけるラティスのアークを、ｎ_ｔはフレーム時刻ｔにおける音響モデル（例えば、ＨＭＭからなる音響モデル）の状態を、ｍは状態ｎ_ｔにおけるガウス分布のインデックスを表す（例えば、ＨＭＭの状態の出力確率分布はＧＭＭで表現されるものとする）。また、γ^ｄＭＭＩ _ｑｔ（ただし、下付添字ｑｔはｑ_ｔを表す）は単語（または音素）ラティスのアークｑ_ｔの事後確率であり、同じラティスについて、第一マージンパラメータσ_１または第二マージンパラメータσ_２を使って、二度、Forward-Backward algorithmを実行して計算される（参考文献７参照）。
［参考文献７］E. McDermott， T.J. Hazen， J.L. Roux， A. Nakamura and S. Katagiri， “Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error”, IEEE Trans. ASLP， 2007, vol. 15， no. 1, pp. 203 - 223
γ_ｎｔ，ｍ（ｔ）、Σ_ｎｔ，ｍ及びμ_ｎｔ，ｍ（ただし、それぞれ下付添字ｎｔはｎ_ｔを表す）は、それぞれ状態ｎ_ｔのガウス分布ｍの事後確率、共分散行列及び平均ベクトルである。これらの値の求め方は、例えば、参考文献８に詳述されている。
［参考文献８］V. Valtchev，J.J. Odell， P.C. Woodland， and S.J. Young， “Lattice-based discriminative training for large vocabulary speech recognition”, In Proc. ICSLP，1996， vol. 2， pp.605-609 Here, q _t represents a lattice arc at frame time t, n _t represents the state of an acoustic model (for example, an acoustic model made of HMM) at frame time t, and m represents an index of a Gaussian distribution at state n _t ( For example, the output probability distribution of the HMM state is expressed by GMM). Also, gamma ^DMMI _qt (where subscript qt represents _{q t)} is a word (or phoneme) is a posterior probability of Lattice arc _{q t,} for the same lattice, the first margin parameter sigma ₁ or the second margin parameter It is calculated by executing the Forward-Backward algorithm twice using σ ₂ (see Reference 7).
[Reference 7] E. McDermott, TJ Hazen, JL Roux, A. Nakamura and S. Katagiri, “Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error”, IEEE Trans. ASLP, 2007, vol. 15, no 1, pp. 203-223
gamma _nt, m (t), sigma _{nt, m} and mu _{nt, m} (wherein each subscript nt represents _{n t)} is the posterior probability of the Gaussian m of the respective states _{n t,} covariance matrix and mean Is a vector. The method of obtaining these values is described in detail in Reference Document 8, for example.
[Reference 8] V. Valtchev, JJ Odell, PC Woodland, and SJ Young, “Lattice-based discriminative training for large vocabulary speech recognition”, In Proc. ICSLP, 1996, vol. 2, pp.605-609

例えばＲ−Ｐｒｏｐ（参考文献９）のような勾配法を用いる場合は、微分値（∂Ｆ^ｄＭＭＩ _{θ，σ１，σ２}／∂Ｗ_ｋ）が０に近づくようにＷ_ｋを更新すればよい。または例えば、Extended-Baum Welchのようなアルゴリズム（参考文献３）を用いて、Ｗ_ｋを更新することもできる。このように、上記の式が得られれば、平均補正パラメータＷ_ｋは容易に推定することが可能である。
［参考文献９］Riedmiller，M. and Braun， H.， “A direct adaptive method for faster backpropagation learning: The RPROP algorithm”， InProc. ICNN’93， 1993, pp. 586-591 For example, when a gradient method such as R-Prop (reference document 9) is used, W _k may be updated so that the differential value (∂F ^dMMI _{θ, σ1, σ2} / ∂W _k ) approaches zero. Alternatively, for example, W _k can be updated using an algorithm such as Extended-Baum Welch (reference document 3). Thus, if the above equation is obtained, the average correction parameter W _k can be easily estimated.
[Reference 9] Riedmiller, M. and Braun, H., “A direct adaptive method for faster backpropagation learning: The RPROP algorithm”, InProc. ICNN'93, 1993, pp. 586-591

＜音響モデル補正パラメータ学習装置１００＞
以上の原理に基づく、第一実施形態の音響モデル補正パラメータ学習装置１００の構成について説明する。装置構成図を図５に、処理フローを図６に示す。 <Acoustic Model Correction Parameter Learning Device 100>
The configuration of the acoustic model correction parameter learning device 100 according to the first embodiment based on the above principle will be described. FIG. 5 shows an apparatus configuration diagram and FIG. 6 shows a processing flow.

音響モデル補正パラメータ学習装置１００は、特徴量抽出部１１０、音響モデル補正部１２０、エラーカウント計算部１３０、補正パラメータ微分値計算部１４０、補正パラメータ更新部１５０、収束判定部１６０、音響モデル記憶部１７０及び言語モデル記憶部１８０を含む。 The acoustic model correction parameter learning device 100 includes a feature amount extraction unit 110, an acoustic model correction unit 120, an error count calculation unit 130, a correction parameter differential value calculation unit 140, a correction parameter update unit 150, a convergence determination unit 160, and an acoustic model storage unit. 170 and a language model storage unit 180.

音響モデル補正パラメータ学習装置１００は、学習用音声データ（以下「適応用音声データ」ともいう）とその正解シンボル系列Ｓ_ｒと音響モデル補正パラメータの初期値θ^０とを受け取り、音響モデル補正パラメータを更新し、最適な音響モデル補正パラメータを求め、出力する。なお、本実施形態では、音響モデル補正パラメータは、平均補正パラメータのみからなり、クラスタ毎に求められるものとしているが、他の音響モデル補正パラメータを含んでもよいし、また、クラスタ毎に求めなくともよい。 The acoustic model correction parameter learning device 100 receives learning speech data (hereinafter also referred to as “adaptive speech data”), its correct symbol sequence S _r, and an initial value θ ⁰ of the acoustic model correction parameter, and receives the acoustic model correction parameter. Update, find and output the optimal acoustic model correction parameters. In the present embodiment, the acoustic model correction parameter is composed of only the average correction parameter and is obtained for each cluster. However, other acoustic model correction parameters may be included, or may not be obtained for each cluster. Good.

（特徴量抽出部１１０）
特徴量抽出部１１０は、学習用音声データを読み込み（ｓ１０３）、その特徴量ベクトル系列Ｏを抽出し（ｓ１０５）、音響モデル補正部１２０に出力する。特徴量抽出の具体的な処理は、既存の技術を用いることができる。例えば、上述の音声認識装置９０の特徴量抽出部９１と同じ方法により特徴量を抽出すればよい。 (Feature Extraction Unit 110)
The feature quantity extraction unit 110 reads the learning speech data (s103), extracts the feature quantity vector series O (s105), and outputs it to the acoustic model correction unit 120. An existing technique can be used for specific processing of feature quantity extraction. For example, the feature amount may be extracted by the same method as the feature amount extraction unit 91 of the voice recognition device 90 described above.

（音響モデル記憶部１７０及び言語モデル記憶部１８０）
音響モデル記憶部１７０及び言語モデル記憶部１８０には、それぞれ予め求められた音響モデル及び言語モデルが記憶される。音響モデル及び言語モデルとしては、既存のモデルを用いればよい。例えば、記憶部９３において説明した音響モデルと言語モデルを用いることができる。 (Acoustic model storage unit 170 and language model storage unit 180)
The acoustic model storage unit 170 and the language model storage unit 180 store an acoustic model and a language model obtained in advance, respectively. Existing models may be used as the acoustic model and the language model. For example, the acoustic model and language model described in the storage unit 93 can be used.

（音響モデル補正部１２０）
音響モデル補正部１２０は、補正前の音響モデルΛを音響モデル記憶部１７０から読み込み（ｓ１０１）、音響モデル補正パラメータの初期値θ^０または更新された音響モデル補正パラメータθ^ｉ−１（ただし、ｉは、繰り返し回数を表すインデックスを示す）を受け取り、式（３）により音響モデルの平均ベクトルを補正し（ｓ１０６）、補正後の音響モデルΛ＾を、エラーカウント計算部１３０に出力する。 (Acoustic model correction unit 120)
The acoustic model correction unit 120 reads the acoustic model Λ before correction from the acoustic model storage unit 170 (s101), and the initial value θ ^{0 of} the acoustic model correction parameter or the updated acoustic model correction parameter θ ^i-1 (where i Indicates an index representing the number of repetitions), corrects the average vector of the acoustic model by equation (3) (s106), and outputs the corrected acoustic model Λ ^ to the error count calculator 130.

ただし、θ^０＝｛θ₁ ^０，θ₂ ^０，…，θ_K ^０｝であり、θ_ｋ ^０＝｛W_ｋ ^０｝である。同様に、θ^ｉ−１＝｛θ₁ ^ｉ−１，θ₂ ^ｉ−１，…，θ_K ^ｉ−１｝であり、θ_ｋ ^ｉ−１＝｛W_ｋ ^ｉ−１｝である。初期値W_ｋ ^０を構成するA_k ⁰、b_k ⁰としては、例えば、それぞれ単位行列、ゼロベクトル（全ての要素が０のベクトル）等が考えられる。 However, θ ⁰ = {θ ₁ ⁰ , θ ₂ ⁰ ,..., Θ _K ⁰ }, and θ _k ⁰ = {W _k ⁰ }. Similarly, θ ⁱ⁻¹ = {θ ₁ ⁱ⁻¹ , θ ₂ ⁱ⁻¹ ,..., Θ _K ⁱ⁻¹ }, and θ _k ⁱ⁻¹ = {W _k ⁱ⁻¹ }. As A _k ⁰ and b _k ⁰ constituting the initial value W _k ⁰ , for example, a unit matrix, a zero vector (a vector in which all elements are 0), and the like can be considered.

（エラーカウント計算部１３０）
エラーカウント計算部１３０は、言語モデル記憶部１８０から言語モデルを読み込み（ｓ１０２）、この言語モデルと、音響モデル補正部１２０から受け取った補正後の音響モデルΛ＾とを用いて、特徴量抽出部１１０から受け取った特徴量ベクトル系列Ｏを音声認識することによって得られるJ個の対立候補シンボル系列Ｓ_ｊを求める。さらに、エラーカウント計算部１３０は、入力された正解シンボル系列Ｓ_ｒを読み込み（ｓ１０４）、予め定めた粒度で、対立候補シンボル系列Ｓ_ｊ毎に、正解シンボル系列Ｓ_ｒとの相違度ε_ｊ，ｒを求め（ｓ１０７）、補正パラメータ微分値計算部１４０に出力する。特に、予め定めた粒度を音素以下の粒度とすれば、相互情報量最大化の枠組みで粒度の細かい相違度を用いることが可能となる。例えば、予め定めた粒度（音素や単語等）で読み込んだ正解シンボル系列Ｓ_ｒと求めた対立候補シンボル系列Ｓ_ｊとの異なる部分をカウントし、カウント値を相違度ε_ｊ，ｒとして求める。 (Error count calculator 130)
The error count calculation unit 130 reads the language model from the language model storage unit 180 (s102), and uses the language model and the corrected acoustic model Λ ^ received from the acoustic model correction unit 120 to use the feature amount extraction unit. J conflict candidate symbol sequences S _j obtained by voice recognition of the feature vector sequence O received from 110 are obtained. Further, the error count calculation unit 130 reads the correct symbol sequence _{S r} input (s104), at a predetermined size, for each allele candidate symbol sequence _{S j,} dissimilarity epsilon _j the correct symbol sequence _{S _r,} _r is obtained (s107) and output to the correction parameter differential value calculation unit 140. In particular, if the predetermined granularity is equal to or smaller than the phoneme, it is possible to use a fine degree of difference in the granularity in the framework of mutual information maximization. For example, different portions of the correct symbol series S _r read at a predetermined granularity (phonemes, words, etc.) and the obtained opponent candidate symbol series S _j are counted, and the count value is obtained as the dissimilarity ε _{j, r} .

（補正パラメータ微分値計算部１４０）
補正パラメータ微分値計算部１４０は、言語モデル記憶部１８０から言語モデルを読み込み（ｓ１０２）、入力された正解シンボル系列Ｓ_ｒを読み込み（ｓ１０４）、補正後の音響モデルΛ＾を受け取り、エラーカウント計算部１３０から受け取った対立候補シンボル系列Ｓ_ｊと相違度ε_ｊ，ｒとを用いて、式（９）で表される目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}を求める。 (Correction parameter differential value calculation unit 140)
Correction parameter differential value calculation section 140 reads the language model from the language model storage unit 180 (s102), reads the correct symbol sequence _{S r} input (s104), it receives an acoustic model of the corrected lambda ^, error count calculation The objective function F ^dMMI _{θ, σ1, σ2} represented by the equation (9) is obtained using the conflict candidate symbol series S _j received from the unit 130 and the dissimilarity ε _{j, r} .

ただし、第一マージンパラメータσ_１の調整は、学習用音声データの特徴と認識用音声データの特徴との不一致の度合いを考慮して人手により行われるものとする。第二マージンパラメータσ_２は、例えば、＋０．１という０に近い小さなプラスの値とする。 However, it is assumed that the adjustment of the first margin parameter σ ₁ is performed manually in consideration of the degree of mismatch between the features of the learning speech data and the features of the recognition speech data. The second margin parameter σ ₂ is a small positive value close to 0, for example, +0.1.

さらに、補正パラメータ微分値計算部１４０は、目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}を音響モデル補正パラメータＷ_ｋ＝［Ａ_ｋｂ_ｋ］で微分する（式（１２）、ｓ１０８）。 Further, the correction parameter differential value calculation unit 140 differentiates the objective function F ^dMMI _{θ, σ1, σ2} by the acoustic model correction parameter W _k = [A _k b _k ] (formulas (12), s108).

算出した微分値（∂Ｆ^ｄＭＭＩ _{θ，σ１，σ２}／∂Ｗ_ｋ）を補正パラメータ更新部１５０に出力する。 The calculated differential value (∂F ^dMMI _{θ, σ1, σ2} / W _k ) is output to the correction parameter update unit 150.

（補正パラメータ更新部１５０）
補正パラメータ更新部１５０は、微分値（∂Ｆ^ｄＭＭＩ _{θ，σ１，σ２}／∂Ｗ_ｋ）に応じて平均補正パラメータＷ_ｋを変更することで、平均補正パラメータを更新する。つまり、式（９）の目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}を最大化するように、式（１１）に従い、Ｗ_ｋ、すなわちＡ_ｋおよびｂ_ｋを同時に更新する（ｓ１０９）。 (Correction parameter update unit 150)
The correction parameter updating unit 150 updates the average correction parameter by changing the average correction parameter W _k according to the differential value (∂F ^dMMI _{θ, σ1, σ2} / ２W _k ). That is, W _k , that is, A _k and b _k are simultaneously updated according to equation (11) so as to maximize the objective function F ^dMMI _{θ, σ1, σ2} of equation (9) ( ^s109 ).

例えばＲ−Ｐｒｏｐ（参考文献９）のような勾配法を用いる場合は、微分値（∂Ｆ^ｄＭＭＩ _{θ，σ１，σ２}／∂Ｗ_ｋ）が０に近づくようにＷ_ｋを更新すればよい。または例えば、Extended-Baum Welchのようなアルゴリズム（参考文献３）を用いて、Ｗ_ｋを更新することもできる。更新後の音響モデル補正パラメータθ＾＝（Ｗ_１，…，Ｗ_ｋ，…，Ｗ_Ｋ）を収束判定部１６０に出力する。 For example, when a gradient method such as R-Prop (reference document 9) is used, W _k may be updated so that the differential value (∂F ^dMMI _{θ, σ1, σ2} / ∂W _k ) approaches zero. Alternatively, for example, W _k can be updated using an algorithm such as Extended-Baum Welch (reference document 3). The updated acoustic model correction parameter θ ^ = (W ₁ ,..., W _k ,..., W _K ) is output to the convergence determination unit 160.

（収束判定部１６０）
収束判定部１６０は、音響モデル補正パラメータθ＾を受け取り、音響モデル補正パラメータの推定が収束したか否かを判定し（ｓ１１０）、収束していると判定した場合には、収束時の音響モデル補正パラメータθ＾を、音響モデル補正パラメータ推定装置の出力値として、出力する（ｓ１１１）。収束していないと判定した場合には、音響モデル補正パラメータθ＾を音響モデル補正部１２０に出力し、音響モデル補正部１２０、エラーカウント計算部１３０、補正パラメータ微分値計算部１４０、補正パラメータ更新部１５０、収束判定部１６０の処理を繰り返すように制御信号を出力する。収束判定部１６０は、例えば、（１）一つ前に求めた音響モデル補正パラメータと今回求めた音響モデル補正パラメータとの差分が閾値以下になった場合や（２）繰り返し回数が所定の回数以上になった場合に、収束していると判定する。 (Convergence determination unit 160)
The convergence determination unit 160 receives the acoustic model correction parameter θ ^, determines whether or not the estimation of the acoustic model correction parameter has converged (s110), and determines that it has converged. The correction parameter θ ^ is output as an output value of the acoustic model correction parameter estimation device (s111). If it is determined that it has not converged, the acoustic model correction parameter θ ^ is output to the acoustic model correction unit 120, and the acoustic model correction unit 120, the error count calculation unit 130, the correction parameter differential value calculation unit 140, and the correction parameter update The control signal is output so that the processing of the unit 150 and the convergence determination unit 160 is repeated. The convergence determination unit 160 may, for example, (1) when the difference between the acoustic model correction parameter obtained last time and the acoustic model correction parameter obtained this time is equal to or less than a threshold value, or (2) the number of repetitions is a predetermined number or more. When it becomes, it determines with having converged.

＜シミュレーション結果＞
以下の表は効果の例として、大語彙連続音声認識タスクで、話者に対する教師なし音響モデル適応の実験結果を表す。このように、本発明は従来の音響モデル適応（ＭＬＬＲ、ＭＭＩ−ＬＲ）よりも性能を改善することがわかる。 <Simulation results>
The following table shows the experimental results of unsupervised acoustic model adaptation for speakers in a large vocabulary continuous speech recognition task as an example of the effect. Thus, it can be seen that the present invention improves performance over conventional acoustic model adaptation (MLLR, MMI-LR).

＜効果＞
このような構成により、正解シンボルの誤りの悪影響を弱めることができ、従来技術（ＭＬＬＲやＭＭＩ−ＬＲの識別基準に基づく音響モデル適応）よりも適切に音響モデルパラメータに対する補正パラメータを求めることができる。さらに、このようにして求めた音響モデル補正パラメータを用いて、補正した音響モデルを用いて音声認識を行うことで、従来技術に比べ、音声認識精度を改善できる。 <Effect>
With such a configuration, it is possible to weaken the adverse effects of errors of correct symbols, and to obtain correction parameters for acoustic model parameters more appropriately than in the prior art (acoustic model adaptation based on the MLLR or MMI-LR identification criteria). . Further, by performing speech recognition using the corrected acoustic model using the acoustic model correction parameter obtained in this way, the speech recognition accuracy can be improved as compared with the prior art.

＜変形例＞
第一実施形態では、音響モデル補正パラメータは平均補正パラメータのみを含むが、混合ガウス分布モデルに含まれるガウス分布の分散パラメータΣ_ｍを補正する分散補正パラメータも含む構成としてもよい。 <Modification>
In the first embodiment, the acoustic model correction parameter includes only the average correction parameter. However, the acoustic model correction parameter may include a dispersion correction parameter for correcting the dispersion parameter Σ _m of the Gaussian distribution included in the mixed Gaussian distribution model.

この場合、以下の式（１３）や（１４）により分散パラメータΣ_ｍを補正することで、ｄＭＭＩ基準による推定をすることができる。 In this case, the estimation based on the dMMI standard can be performed by correcting the dispersion parameter Σ _m by the following equations (13) and (14).

もしくは、 Or

ここで、Ｄ_ｋは分散補正パラメータである。 Here, D _k is a dispersion correction parameter.

平均補正パラメータW_ｋ=[Ａ_ｋ、ｂ_ｋ]と分散補正パラメータＤ_ｋの同時推定は以下のように行うことができる。 Simultaneous estimation of the average correction parameter W _k = [A _k , b _k ] and the dispersion correction parameter D _k can be performed as follows.

ただし、θ^b＝（Ｗ_１, D_１，…，Ｗ_ｋ，D_k，…，Ｗ_K，D_K）である。式（１５）は平均と共分散パラメータの同時推定を示しているが、共分散だけ推定することも可能である。 ^{_{However, θ b = (W 1,}} D 1, ..., W k, D k, ..., W K, D K) is. Equation (15) shows the simultaneous estimation of the mean and covariance parameters, but it is also possible to estimate only the covariance.

ｄＭＭＩ識別学習基準の目的関数Ｆ^ｄＭＭＩ _{θb，σ１，σ２}(ただし、下付添字θｂは、θ^ｂを表す。)を最大化するようなＤ_ｋを求めるために、まず目的関数Ｆ^ｄＭＭＩ _{θb，σ１，σ２}をＤ_ｋで微分する。目的関数Ｆ^ｄＭＭＩ _{θb，σ１，σ２}をＤ_ｋで微分した値は、分散補正パラメータが式（１３）により補正される場合、以下のように表現される。 The objective function ^{F dMMI} _θb of DMMI discriminative training _{criterion, .sigma.1, .sigma. @ 2} (where subscripts .theta.b represents. a theta ^b) in order to determine the _{D k} that maximizes a first objective function ^{_F dMMI} _θb, ^σ1 _{, Σ2} is differentiated by D _k . A value _obtained by differentiating the objective function F ^dMMI _{θb, σ1, σ2} by D _k is expressed as follows when the dispersion correction parameter is corrected by the equation (13).

また、式（１５）は、分散補正パラメータが式（１４）により補正される場合、以下のように表現される。 Further, Expression (15) is expressed as follows when the dispersion correction parameter is corrected by Expression (14).

また、平均ベクトルの変換行列Ａ_ｋと分散パラメータの変換行列（式（１３）のD_k）が同じになるように制約を加えてもよい。その場合は、平均ベクトルと分散パラメータの補正は式（１８）のように行い、音響モデル補正パラメータの推定は式（１９）で行う。 Further, a constraint may be added so that the average vector conversion matrix A _k and the dispersion parameter conversion matrix (D _{k in} Expression (13)) are the same. In that case, the correction of the average vector and the dispersion parameter is performed as in Expression (18), and the estimation of the acoustic model correction parameter is performed as in Expression (19).

なお、A_k ^cは平均ベクトル及び分散パラメータに対する変換行列、b_k ^cは平均ベクトルに対するバイアスベクトルである。 A _k ^c is a transformation matrix for the average vector and the dispersion parameter, and b _k ^c is a bias vector for the average vector.

ただし、θ^c=(A^c _1,b^c ₁,…,A^c _k,b^c _k,…,A^c _K,b^c _K）である。 ^{^{_{However, θ c = (A c 1}}} , b c 1, ..., A c k, b c k, ..., A c K, b c K) is.

変形例に係る音響モデル補正パラメータ学習装置１００の構成について、第一実施形態と異なる部分を中心に説明する。 The configuration of the acoustic model correction parameter learning device 100 according to the modification will be described with a focus on differences from the first embodiment.

（音響モデル補正部１２０）
音響モデル補正部１２０は、式（３）により、平均ベクトルを補正し、さらに、式（１３）または（１４）により分散パラメータΣ_ｍを補正する（ｓ１０６）。なお、音響モデル適応技術を搭載した音声認識装置９０の音響モデル補正部９４においても対応する式（式（１３）または式（１４））により、分散パラメータΣ_ｍを補正する。 (Acoustic model correction unit 120)
Acoustic model correction unit 120, by Equation (3), to correct the average vector, furthermore, to correct the dispersion parameter sigma _m by equation (13) or (14) (s106). Even by the corresponding formula (Formula (13) or (14)), to correct the dispersion parameter sigma _m in the acoustic model correction unit 94 of the speech recognition device 90 equipped with an acoustic model adaptation techniques.

（エラーカウント計算部１３０）
エラーカウント計算部１３０は、補正した平均ベクトルと補正した共分散行列とを含む音響モデルと言語モデルとに基づき、学習用音声データの特徴量を音声認識することによって得られる対立候補シンボル系列ごとに、予め定めた粒度で、正解シンボル系列との相違度を求める（ｓ１０７）。 (Error count calculator 130)
The error count calculation unit 130 performs, for each conflict candidate symbol series obtained by speech recognition of the feature amount of the learning speech data, based on the acoustic model and the language model including the corrected average vector and the corrected covariance matrix. The degree of difference from the correct symbol series is obtained with a predetermined granularity (s107).

（補正パラメータ微分値計算部１４０）
補正パラメータ微分値計算部１４０は、式（１５）の右辺で用いられている目的関数Ｆ^ｄＭＭＩ _{θｂ，σ１，σ２}を求める。さらに、補正パラメータ微分値計算部１４０は、目的関数Ｆ^ｄＭＭＩ _{θｂ，σ１，σ２}を音響モデル補正パラメータＷ_ｋ、Ｄ_ｋでそれぞれ微分する（式（１２）、（１６）または（１７）ｓ１０８）。 (Correction parameter differential value calculation unit 140)
The correction parameter differential value calculation unit 140 ^{obtains the} objective function F ^dMMI _{θb, σ1, σ2} used on the right side of the equation (15). Further, the correction parameter differential value calculation unit 140 differentiates the objective function F ^dMMI _{θb, σ1, σ2} by the acoustic model correction parameters W _k , D _k , respectively (formula (12), (16) or (17) s108).

（補正パラメータ更新部１５０）
補正パラメータ更新部１５０は、平均補正パラメータＷ_ｋに対する微分値∂Ｆ^ｄＭＭＩ _{θｂ，σ１，σ２}／∂Ｗ_ｋに応じて平均補正パラメータＷ_ｋを変更することで、平均補正パラメータを更新し、さらに、分散補正パラメータＤ_ｋに対する微分値∂Ｆ^ｄＭＭＩ _{θｂ，σ１，σ２}／∂Ｄ_ｋに応じて分散補正パラメータＤ_ｋを変更することで、分散補正パラメータＤ_ｋを更新する（ｓ１０９）。 (Correction parameter update unit 150)
Correction parameter update unit 150, the differential value ^_∂F dMMI θb relative to the average correction parameter _{W _k,} _.sigma.1, by changing the average correction parameter _{W k} in accordance with _{.sigma. @ 2} / ∂W _k, updating the average correction parameters, further, differential value ^_∂F dMMI θb for dispersion correction parameter _{D _k,} _.sigma.1, by changing the dispersion correction parameter _{D k} in accordance with _σ2 / ∂D _k, updating the dispersion correction parameter _{D k} (s109).

（収束判定部１６０）
収束判定部１６０は、平均補正パラメータ及び分散補正パラメータの更新が予め定めた条件を満たすか否かを判定し（ｓ１１０）、満たす場合には、更新後の平均補正パラメータ及び分散補正パラメータをそれぞれ求める平均補正パラメータ及び分散補正パラメータとして出力し（ｓ１１１）、条件を満たさない場合には、音響モデル補正部１２０、エラーカウント計算部１３０、補正パラメータ微分値計算部１４０及び補正パラメータ更新部１５０の処理を繰り返す。 (Convergence determination unit 160)
The convergence determination unit 160 determines whether or not the update of the average correction parameter and the dispersion correction parameter satisfies a predetermined condition (s110), and if so, obtains the updated average correction parameter and dispersion correction parameter, respectively. When the average correction parameter and the dispersion correction parameter are output (s111) and the conditions are not satisfied, the acoustic model correction unit 120, the error count calculation unit 130, the correction parameter differential value calculation unit 140, and the correction parameter update unit 150 are processed. repeat.

このような構成により、適切に、平均補正パラメータに加え分散補正パラメータを求めることができる。 With such a configuration, it is possible to appropriately obtain the dispersion correction parameter in addition to the average correction parameter.

＜その他の変形例＞
音響補正パラメータをクラスタ毎に求めなくともよい。その場合、式（２）で平均ベクトルを補正し、式（１３ａ）または式（１４ａ）で分散パラメータを補正する。 <Other variations>
The acoustic correction parameter need not be obtained for each cluster. In that case, the average vector is corrected by Expression (2), and the dispersion parameter is corrected by Expression (13a) or Expression (14a).

また、本実施形態では、音響モデル補正パラメータ学習装置１００は、特徴量抽出部１１０を含むが、学習用音声データに対する特徴量を入力される場合には、特徴量抽出部１１０を含まなくともよい。 In this embodiment, the acoustic model correction parameter learning device 100 includes the feature amount extraction unit 110. However, the feature amount extraction unit 110 may not be included when a feature amount for learning speech data is input. .

図示しない記憶部に予め目的関数Ｆ^ｄＭＭＩ _{θｂ，σ１，σ２}を音響モデル補正パラメータで微分したもの（例えば式（１２）、（１６）、（１７）で表される）を計算式として記憶しておいてもよい。この場合、実際の微分値は以下のようにして求める。補正パラメータ微分値計算部１４０は、計算式を記憶部から読み込み、さらに、言語モデル記憶部１８０から言語モデルを読み込み、正解シンボル系列Ｓ_ｒを読み込み、補正後の音響モデルΛ＾と対立候補シンボル系列Ｓ_ｊと相違度ε_ｊ，ｒとを受け取り、計算式に代入し、微分値（∂Ｆ^ｄＭＭＩ _{θｂ，σ１，σ２}／∂Ｗ_ｋ）や（∂Ｆ^ｄＭＭＩ _{θｂ，σ１，σ２}／∂Ｄ_ｋ）を算出し（ｓ１０８）、補正パラメータ更新部１０９に出力する。 An objective function F ^dMMI _{θb, σ1, σ2} differentiated by an acoustic model correction parameter (for example, expressed by equations (12), (16), (17)) is stored as a calculation formula in a storage unit (not shown). It may be left. In this case, the actual differential value is obtained as follows. Correction parameter differential value calculation unit 140 reads the calculation formula from the storage unit, further, reads the language model from the language model storage unit 180, reads the correct symbol sequence S _r, acoustic models lambda ^ and opposition candidate symbol sequence after correction S _j and dissimilarity epsilon _j, receive and _r, are substituted into equation, the differential value ^{_{(∂F dMMI θb, σ1, σ2}} / ∂W k) and ^{_{(∂F dMMI θb, σ1, σ2}} / ∂D k) Is calculated (s108) and output to the correction parameter updating unit 109.

〔第二実施形態〕
式（３）の音響モデルパラメータの補正は式（２０）のように、特徴量の補正と等しくなることを示すことができる。 [Second Embodiment]
It can be shown that the correction of the acoustic model parameter in Expression (3) is equal to the correction of the feature amount as in Expression (20).

Ａ_ｋ ^ｆは特徴量に対する変換行列、ｂ_ｋ ^ｆは特徴量に対するバイアスベクトルである。Ｗ_ｋ ^ｆ：＝［Ａ_ｋ ^ｆｂ_ｋ ^ｆ］である。
目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}を特徴量補正パラメータＷ_ｋ ^ｆで微分すると、次式のようになる。 A _k ^f is a transformation matrix for the feature quantity, and b _k ^f is a bias vector for the feature quantity. W _k ^f : = [A _k ^f b _k ^f ].
When the objective function F ^dMMI _{θ, σ1, σ2} is differentiated by the feature amount correction parameter W _k ^f , the following equation is obtained.

第一実施形態では、音響モデルを補正することで話者適応を行う構成を前提とし、音響モデル補正パラメータを推定する構成を説明したが、式（２０）を用いると、本発明は特徴量の補正パラメータ（以下「特徴量補正パラメータ」ともいう）の推定にも応用することができる。 In the first embodiment, the configuration for estimating the acoustic model correction parameter has been described on the assumption that the speaker adaptation is performed by correcting the acoustic model. However, using the equation (20), the present invention uses the feature amount. The present invention can also be applied to estimation of correction parameters (hereinafter also referred to as “feature amount correction parameters”).

まず、補正した特徴量に基づき音声認識を行う音声認識装置７０について説明する。 First, the speech recognition apparatus 70 that performs speech recognition based on the corrected feature amount will be described.

＜音声認識装置７０＞
図７に音声認識装置７０の機能構成例、図８にその処理フロー例を示す。音声認識装置７０は、特徴量抽出部９１、特徴量補正部７１、単語列探索部７２、記録部７４から構成される。 <Voice recognition device 70>
FIG. 7 shows a functional configuration example of the speech recognition apparatus 70, and FIG. The voice recognition device 70 includes a feature amount extraction unit 91, a feature amount correction unit 71, a word string search unit 72, and a recording unit 74.

予め音響モデル及び言語モデルを記録部７４に記録している。さらに、予め特徴量補正パラメータＷ^ｆを記録部７４に記録している。Ｗ^ｆ＝（Ｗ_１ ^ｆ，Ｗ_２ ^ｆ，…，Ｗ_Ｋ ^ｆ）とし、Ｗ_ｋ ^ｆ＝｛Ａ_ｋ ^ｆｂ_ｋ ^ｆ｝とする。なお、本実施形態では、Ｗ^ｆ＝θ、Ｗ_ｋ＝θ_ｋ ^ｆとする。 An acoustic model and a language model are recorded in the recording unit 74 in advance. Further, the feature amount correction parameter ^Wf is recorded in the recording unit 74 in advance. ^Let W ^f = (W ₁ ^f , W ₂ ^f ,..., W _K ^f ) and W _k ^f = {A _k ^f b _k ^f }. In this embodiment, W ^f = θ and W _k = θ _k ^f .

特徴量補正部７１は特徴量補正パラメータＷ^ｆを読み込む（ｓ７１）。特徴量抽出部９１で抽出した特徴量ベクトル系列Ｏを、単語列探索部７２に送る前に、特徴量補正部７１において、予め求めておいた特徴量補正パラメータＷ^ｆを用いて式（２０）により、補正する（ｓ７２）。 The feature amount correction unit 71 reads the feature amount correction parameter W ^f (s71). Before the feature quantity vector series O extracted by the feature quantity extraction unit 91 is sent to the word string search unit 72, the feature quantity correction unit 71 uses the feature quantity correction parameter W ^f obtained in advance to formula (20). Thus, the correction is made (s72).

単語列探索部７２は、音響モデルと言語モデルを読み込む（ｓ７１、ｓ７２）。単語列探索部７２は、まず、音響モデルに基づき、特徴量補正部７１で補正された特徴量ベクトル系列Ｏ＾に対するＪ個の対立候補シンボル系列Ｓ_ｊを生成して、対立候補シンボル系列Ｓ_ｊ毎に音響スコアを算出する。次に、言語モデルに基づき、対立候補シンボル系列Ｓ_ｊ毎に言語スコアを算出する。さらに、音響スコアと言語スコアとを統合して、Ｊ個の対立候補シンボル系列Ｓ_ｊの中から、認識用音声データに対応する文として最も確からしい（最も音響スコアと言語スコアとを統合したスコアが高い）対立候補シンボル系列を探索し（ｓ７５）、その対立候補シンボル系列を認識結果（単語列）Ｓ＾として出力する（ｓ７６）。 The word string search unit 72 reads the acoustic model and the language model (s71, s72). First, based on the acoustic model, the word string search unit 72 generates J conflict candidate symbol sequences S _j for the feature amount vector series O ^ corrected by the feature amount correction unit 71, and the conflict candidate symbol sequence S _j. An acoustic score is calculated every time. Next, based on the language model, a language score is calculated for each conflict candidate symbol series S _j . Further, by integrating the acoustic score and the language score, it is most likely as a sentence corresponding to the speech data for recognition from among the J conflict candidate symbol sequences S _j (the most integrated score of the acoustic score and the language score). The conflict candidate symbol sequence is searched (s75), and the conflict candidate symbol sequence is output as a recognition result (word string) S ^ (s76).

この特徴量ベクトル系列Ｏの補正は、最終的な音声認識精度を向上させることを目的として行われる。すなわち特徴量補正技術のポイントは、最終的な音声認識精度を向上させるための特徴量補正パラメータＷ^ｆをいかに推定するか、という点にある。 The correction of the feature vector series O is performed for the purpose of improving the final speech recognition accuracy. That the point of the feature quantity correction technique, the final or estimated speech recognition accuracy how the feature quantity correction parameter W ^f for improving, in that.

本実施形態では、特徴量補正パラメータＷ^ｆを以下のように推定する。第一実施形態と異なる部分を中心に説明する。 In the present embodiment, the feature amount correction parameter W ^f is estimated as follows. A description will be given centering on differences from the first embodiment.

＜特徴量補正パラメータ推定装置２００＞
特徴量補正パラメータ推定装置２００の構成を図９に、処理フローを図１０に示す。第一実施形態とは異なる処理についてのみ説明する。特徴量補正パラメータ推定装置２００は、特徴量抽出部１１０、音響モデル補正部２２０、エラーカウント計算部２３０、補正パラメータ微分値計算部２４０、補正パラメータ更新部２５０、収束判定部２６０、音響モデル記憶部１７０及び言語モデル記憶部１８０を含む。 <Feature Quantity Correction Parameter Estimation Device 200>
The configuration of the feature amount correction parameter estimation apparatus 200 is shown in FIG. 9, and the processing flow is shown in FIG. Only processing different from the first embodiment will be described. The feature amount correction parameter estimation device 200 includes a feature amount extraction unit 110, an acoustic model correction unit 220, an error count calculation unit 230, a correction parameter differential value calculation unit 240, a correction parameter update unit 250, a convergence determination unit 260, and an acoustic model storage unit. 170 and a language model storage unit 180.

（特徴量補正部２２０）
特徴量補正部２２０は、特徴量補正パラメータの初期値Ｗ^ｆ０または更新された特徴量補正パラメータＷ^{ｆ（ｉ−１）}と、補正前の特徴量ベクトル系列Ｏとを受け取り、式（２０）に基づき特徴量ベクトル系列Ｏを補正し（ｓ２０６）、補正後の特徴量ベクトル系列Ｏ＾をエラーカウント計算部２３０に出力する。 (Feature correction unit 220)
The feature amount correction unit 220 receives the initial value W ^{f0 of} the feature amount correction parameter or the updated feature amount correction parameter W ^{f (i−1)} and the feature amount vector series O before correction, and the equation (20). Based on this, the feature vector sequence O is corrected (s206), and the corrected feature vector sequence O ^ is output to the error count calculator 230.

ただし、Ｗ^ｆ０＝｛Ｗ₁ ^ｆ０，Ｗ_２ ^ｆ０，…，Ｗ_Ｋ ^ｆ０｝であり、Ｗ_ｋ ^ｆ０＝｛Ａ_ｋ ^ｆ０ｂ_ｋ ^ｆ０｝である。初期値Ａ_ｋ ^ｆ０，ｂ_ｋ ^ｆ０としては、例えば、それぞれ単位行列、ゼロベクトル（全ての要素が０のベクトル）等が考えられる。同様に、Ｗ^{ｆ（ｉ−１）}＝｛Ｗ_１ ^{ｆ（ｉ−１）}，Ｗ_２ ^{ｆ（ｉ−１）}，…，Ｗ_Ｋ ^{ｆ（ｉ−１）}｝であり、Ｗ_ｋ ^{ｆ（ｉ−１）}＝｛Ａ_ｋ ^{ｆ（ｉ−１）} ｂ_ｋ ^{ｆ（ｉ−１）}｝である。また、本実施形態では、特徴量補正パラメータをθとも記載する。 ^{_{^{However, W f0 = {W 1 f0}}} , W 2 f0, ..., W K f0} is _a ^{_{^{_{W k f0 = {A k f0}}}} b k f0}. As the initial values A _k ^f0 and b _k ^f0 , for example, a unit matrix, a zero vector (a vector in which all elements are 0), and the like can be considered. ^{_{^{Similarly, W f (i-1)}}} = {W 1 f (i-1), W 2 f (i-1), ..., W K f (i-1)} _a, ^{W k f (i- 1)} = {A _k ^{f (i−1)} b _k ^{f (i−1)} }. In the present embodiment, the feature amount correction parameter is also described as θ.

（エラーカウント計算部２３０）
エラーカウント計算部２３０は、補正前の特徴量ベクトル系列Ｏの代わりに補正後の特徴量ベクトル系列Ｏ＾を用いる点を除いては、第一実施形態と同じである（ｓ１０２、ｓ１０４、ｓ２０７）。 (Error count calculator 230)
The error count calculation unit 230 is the same as that in the first embodiment except that the corrected feature vector sequence O ^ is used instead of the uncorrected feature vector sequence O (s102, s104, s207). .

（補正パラメータ微分値計算部２４０）
補正パラメータ微分値計算部２４０は、補正前の特徴量ベクトル系列Ｏの代わりに補正後の特徴量ベクトル系列Ｏ＾を用い、補正後の音響モデルΛ＾の代わりに音響モデルΛを用い、目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}を微分する際に音響モデル補正パラメータＷ_ｋの代わりに特徴量補正パラメータＷ_ｋ ^ｆを用いる。 (Correction parameter differential value calculation unit 240)
The correction parameter differential value calculation unit 240 uses the corrected feature vector series O ^ instead of the uncorrected feature vector series O ^, uses the acoustic model Λ instead of the corrected acoustic model Λ ^, and uses the objective function. When differentiating F ^dMMI _{θ, σ1, and σ2} , the feature amount correction parameter W _k ^f is used instead of the acoustic model correction parameter W _k .

よって、補正パラメータ微分値計算部２４０は、音響モデル記憶部１７０及び言語モデル記憶部１８０からそれぞれ音響モデル及び言語モデルを読み込み（ｓ１０１、ｓ１０２）、入力された正解シンボル系列Ｓ_ｒを読み込み（ｓ１０４）、エラーカウント計算部１３０から受け取った対立候補シンボル系列Ｓ_ｊと相違度ε_ｊ，ｒとを用いて、次式で表される目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}を求める。 Therefore, the correction parameter differential value calculation unit 240, respectively, from the acoustic model storage unit 170 and the language model storage unit 180 reads the acoustic models and language models (s101, s102), reads the correct symbol sequence _{S r} input (s104) The objective function F ^dMMI _{θ, σ1, σ2} expressed by the following equation is obtained using the conflict candidate symbol series S _j and the dissimilarity ε _{j, r} received from the error count calculation unit 130.

ただし、第一マージンパラメータσ_１の調整は、学習用音声データの特徴と認識用音声データの特徴との不一致の度合いを考慮して人手により行われるものとする。第二マージンパラメータσ_２は、例えば、＋０．１という０に近い小さなプラスの値とする。さらに、補正パラメータ微分値計算部２４０は、目的関数Ｆ^ｄＭＭＩ _{θ，σ１，σ２}を特徴量補正パラメータＷ_ｋ ^ｆ＝［Ａ_ｋ ^ｆｂ_ｋ ^ｆ］で微分する（式（２１）、ｓ２０８）。 However, it is assumed that the adjustment of the first margin parameter σ ₁ is performed manually in consideration of the degree of mismatch between the features of the learning speech data and the features of the recognition speech data. The second margin parameter σ ₂ is a small positive value close to 0, for example, +0.1. Further, the correction parameter differential value calculation unit 240 differentiates the objective function F ^dMMI _{θ, σ1, σ2} by the feature amount correction parameter W _k ^f = [A _k ^f b _k ^f ] (formulas (21), s208).

算出した微分値（∂Ｆ^ｄＭＭＩ _{θ，σ１，σ２}／∂Ｗ_ｋ ^ｆ）を補正パラメータ更新部２５０に出力する。 The calculated differential value (∂F ^dMMI _{θ, σ1, σ2} / ∂W _k ^f ) is output to the correction parameter update unit 250.

（補正パラメータ更新部２５０）
補正パラメータ更新部２５０は、微分値（∂Ｆ^ｄＭＭＩ _{θ，σ１，σ２}／∂Ｗ_ｋ ^ｆ）を受け取り、微分値（∂Ｆ^ｄＭＭＩ _{θ，σ１，σ２}／∂Ｗ_ｋ ^ｆ）に応じてＷ_ｋ ^ｆ、すなわちＡ_ｋ ^ｆおよびｂ_ｋ ^ｆを同時に更新する（ｓ２０９）。更新後の特徴量補正パラメータＷ＾_ｋ ^ｆを収束判定部１６０に出力する。 (Correction parameter update unit 250)
The correction parameter updating unit 250 receives the differential value (∂F ^dMMI _{θ, σ1, σ2} / W _k ^f ), and receives W _k ^f according to the differential value (∂F ^dMMI _{θ, σ1, σ2} / ∂W _k ^f ). That is, A _k ^f and b _k ^f are updated simultaneously (s209). The updated feature value correction parameter W _k ^f is output to the convergence determination unit 160.

（収束判定部２６０）
収束判定部２６０は、音響モデル補正パラメータの代わりに特徴量補正パラメータＷ＾_ｋ ^ｆを用いる点を除いては、第一実施形態と同じである（ｓ２１０、ｓ２１１）。 (Convergence determination unit 260)
The convergence determination unit 260 is the same as that in the first embodiment except that the feature amount correction parameter W _k ^f is used instead of the acoustic model correction parameter (s210, s211).

＜効果＞
このような構成により、正解シンボルの誤りの悪影響を弱めることができ、従来技術よりも適切に特徴量に対する補正パラメータを求めることができる。さらに、このようにして求めた特徴量補正パラメータを用いて、認識用音声データの特徴量を補正し、補正した特徴量に基づき音声認識を行うことで、従来技術に比べ、音声認識精度を改善できる。また、特徴量補正の場合は音響モデルパラメータを更新する必要がないという利点もある。 <Effect>
With such a configuration, it is possible to weaken the adverse effects of errors in correct symbols, and to obtain correction parameters for feature values more appropriately than in the prior art. Furthermore, using the feature value correction parameters obtained in this way, the feature amount of the recognition speech data is corrected, and speech recognition is performed based on the corrected feature amount, thereby improving speech recognition accuracy compared to the conventional technology. it can. In the case of feature amount correction, there is also an advantage that it is not necessary to update acoustic model parameters.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

音響モデルには混合ガウス分布モデルが含まれるものとし、音響モデルパラメータには前記混合ガウス分布モデルに含まれるガウス分布の平均ベクトルが含まれるものとし、学習用音声データの特徴量及び前記学習用音声データに対する正解シンボル系列から、前記平均ベクトルを補正するための平均補正パラメータを求める音響モデル補正パラメータ推定装置であって、
予め求められた前記音響モデル及び言語モデルが記憶される記憶部と、
前記記憶部に記憶された音響モデルの平均ベクトルを、平均補正パラメータを用いて補正する音響モデル補正部と、
補正した前記平均ベクトルを含む音響モデルと前記言語モデルとに基づき、前記学習用音声データの前記特徴量を音声認識することによって得られる対立候補シンボル系列ごとに、予め定めた粒度で、前記正解シンボル系列との相違度を求めるエラーカウント計算部と、
前記言語モデルによって得られる前記対立候補シンボル系列の言語確率、前記学習用音声データの前記特徴量と前記対立候補シンボル系列に基づき前記音響モデルによって得られる音響スコア及び前記相違度に基づき、前記平均補正パラメータで識別学習基準の目的関数を微分したときの微分値を求める補正パラメータ微分値計算部と、
前記微分値に応じて前記平均補正パラメータを変更することで、前記平均補正パラメータを更新する補正パラメータ更新部と、を含む、
音響モデル補正パラメータ推定装置。 The acoustic model includes a mixed Gaussian distribution model, and the acoustic model parameter includes an average vector of the Gaussian distribution included in the mixed Gaussian distribution model. The feature amount of the learning speech data and the learning speech An acoustic model correction parameter estimation device for obtaining an average correction parameter for correcting the average vector from a correct symbol sequence for data,
A storage unit for storing the acoustic model and the language model obtained in advance;
An acoustic model correction unit that corrects an average vector of the acoustic model stored in the storage unit using an average correction parameter;
Based on the corrected acoustic model including the average vector and the language model, the correct answer symbol with a predetermined granularity for each opposing candidate symbol series obtained by speech recognition of the feature amount of the learning speech data An error count calculation unit for calculating the degree of difference from the series;
The average correction based on the language probability of the contending candidate symbol sequence obtained by the language model, the acoustic score obtained by the acoustic model based on the feature amount of the learning speech data and the contending candidate symbol sequence, and the dissimilarity A correction parameter differential value calculation unit for obtaining a differential value when the objective function of the discriminative learning criterion is differentiated by a parameter;
A correction parameter update unit that updates the average correction parameter by changing the average correction parameter according to the differential value,
Acoustic model correction parameter estimation device.

請求項１記載の音響モデル補正パラメータ推定装置であって、
前記音響モデルパラメータには、さらに、前記混合ガウス分布モデルに含まれるガウス分布の共分散行列が含まれるものとし、
前記音響モデル補正部は、さらに、前記混合ガウス分布モデルに含まれるガウス分布の共分散行列を、分散補正パラメータを用いて補正し、
前記エラーカウント計算部は、補正した前記平均ベクトルと補正した前記共分散行列とを含む音響モデルと前記言語モデルとに基づき、前記学習用音声データの前記特徴量を音声認識することによって得られる対立候補シンボル系列ごとに、予め定めた粒度で、前記正解シンボル系列との相違度を求め、
前記補正パラメータ微分値計算部は、さらに、前記言語モデルによって得られる前記対立候補シンボル系列の言語確率、前記学習用音声データの前記特徴量と前記対立候補シンボル系列に基づき前記音響モデルによって得られる音響スコア及び前記相違度に基づき、前記分散補正パラメータで識別学習基準の目的関数を微分したときの微分値を求め、
補正パラメータ更新部は、さらに、前記分散補正パラメータに対する前記微分値に応じて前記分散補正パラメータを変更することで、前記分散補正パラメータを更新する、
音響モデル補正パラメータ推定装置。 The acoustic model correction parameter estimation device according to claim 1,
The acoustic model parameters further include a covariance matrix of a Gaussian distribution included in the mixed Gaussian distribution model,
The acoustic model correction unit further corrects a covariance matrix of a Gaussian distribution included in the mixed Gaussian distribution model using a dispersion correction parameter,
The error count calculation unit is configured to recognize the feature amount of the learning speech data based on an acoustic model including the corrected average vector and the corrected covariance matrix and the language model. For each candidate symbol sequence, the degree of difference from the correct symbol sequence is determined with a predetermined granularity.
The correction parameter differential value calculation unit further includes a sound obtained by the acoustic model based on a language probability of the alternative candidate symbol series obtained by the language model, the feature amount of the learning speech data, and the alternative candidate symbol series. Based on the score and the degree of difference, obtain a differential value when differentiating the objective function of the discriminative learning criterion with the dispersion correction parameter,
The correction parameter update unit further updates the dispersion correction parameter by changing the dispersion correction parameter according to the differential value with respect to the dispersion correction parameter.
Acoustic model correction parameter estimation device.

音響モデルには混合ガウス分布モデルが含まれるものとし、音響モデルパラメータには前記混合ガウス分布モデルに含まれるガウス分布の平均ベクトルが含まれるものとし、学習用音声データの特徴量及び前記学習用音声データに対する正解シンボル系列から、前記平均ベクトルを補正するための平均補正パラメータを求める音響モデル補正パラメータ推定方法であって、
記憶部には予め求められた前記音響モデル及び言語モデルが記憶され、
前記記憶部に記憶された音響モデルの平均ベクトルを、平均補正パラメータを用いて補正する音響モデル補正ステップと、
補正した前記平均ベクトルを含む音響モデルと前記言語モデルとに基づき、前記学習用音声データの前記特徴量を音声認識することによって得られる対立候補シンボル系列ごとに、予め定めた粒度で、前記正解シンボル系列との相違度を求めるエラーカウント計算ステップと、
前記言語モデルによって得られる前記対立候補シンボル系列の言語確率、前記学習用音声データの前記特徴量と前記対立候補シンボル系列に基づき前記音響モデルによって得られる音響スコア及び前記相違度に基づき、前記平均補正パラメータで識別学習基準の目的関数を微分したときの微分値を求める補正パラメータ微分値計算ステップと、
前記微分値に応じて前記平均補正パラメータを変更することで、前記平均補正パラメータを更新する補正パラメータ更新ステップと、を含む、
音響モデル補正パラメータ推定方法。 The acoustic model includes a mixed Gaussian distribution model, and the acoustic model parameter includes an average vector of the Gaussian distribution included in the mixed Gaussian distribution model. The feature amount of the learning speech data and the learning speech An acoustic model correction parameter estimation method for obtaining an average correction parameter for correcting the average vector from a correct symbol sequence for data,
The storage unit stores the acoustic model and language model obtained in advance,
An acoustic model correction step of correcting an average vector of the acoustic model stored in the storage unit using an average correction parameter;
Based on the corrected acoustic model including the average vector and the language model, the correct answer symbol with a predetermined granularity for each opposing candidate symbol series obtained by speech recognition of the feature amount of the learning speech data An error count calculation step for calculating the degree of difference from the series;
The average correction based on the language probability of the contending candidate symbol sequence obtained by the language model, the acoustic score obtained by the acoustic model based on the feature amount of the learning speech data and the contending candidate symbol sequence, and the dissimilarity A correction parameter differential value calculation step for obtaining a differential value when the objective function of the discriminative learning criterion is differentiated by a parameter;
A correction parameter update step of updating the average correction parameter by changing the average correction parameter according to the differential value,
Acoustic model correction parameter estimation method.

請求項３記載の音響モデル補正パラメータ推定方法であって、
前記音響モデルパラメータには、さらに、前記混合ガウス分布モデルに含まれるガウス分布の共分散行列が含まれるものとし、
前記音響モデル補正ステップにおいて、さらに、前記混合ガウス分布モデルに含まれるガウス分布の共分散行列を、分散補正パラメータを用いて補正し、
前記エラーカウント計算ステップにおいて、補正した前記平均ベクトルと補正した前記共分散行列とを含む音響モデルと前記言語モデルとに基づき、前記学習用音声データの前記特徴量を音声認識することによって得られる対立候補シンボル系列ごとに、予め定めた粒度で、前記正解シンボル系列との相違度を求め、
前記補正パラメータ微分値計算ステップにおいて、さらに、前記言語モデルによって得られる前記対立候補シンボル系列の言語確率、前記学習用音声データの前記特徴量と前記対立候補シンボル系列に基づき前記音響モデルによって得られる音響スコア及び前記相違度に基づき、前記分散補正パラメータで識別学習基準の目的関数を微分したときの微分値を求め、
補正パラメータ更新ステップにおいて、さらに、前記分散補正パラメータに対する前記微分値に応じて前記分散補正パラメータを変更することで、前記分散補正パラメータを更新する、
音響モデル補正パラメータ推定方法。 The acoustic model correction parameter estimation method according to claim 3 ,
The acoustic model parameters further include a covariance matrix of a Gaussian distribution included in the mixed Gaussian distribution model,
In the acoustic model correction step, a Gaussian distribution covariance matrix included in the mixed Gaussian distribution model is further corrected using a dispersion correction parameter,
In the error count calculation step, based on an acoustic model including the corrected average vector and the corrected covariance matrix and the language model, a pair obtained by performing speech recognition on the feature amount of the learning speech data. For each candidate symbol sequence, the degree of difference from the correct symbol sequence is determined with a predetermined granularity.
In the correction parameter differential value calculation step, a sound obtained by the acoustic model based on a language probability of the candidate candidate symbol series obtained by the language model, the feature amount of the learning speech data, and the candidate candidate symbol series Based on the score and the degree of difference, obtain a differential value when differentiating the objective function of the discriminative learning criterion with the dispersion correction parameter,
In the correction parameter update step, further, the dispersion correction parameter is updated by changing the dispersion correction parameter according to the differential value with respect to the dispersion correction parameter.
Acoustic model correction parameter estimation method.

請求項１もしくは請求項２記載の音響モデル補正パラメータ推定装置として、コンピュータを機能させるためのプログラム。 Claim 1 or to an acoustic model correction parameter estimation equipment of claim 2, wherein a program for causing a computer to function.