JP4651496B2

JP4651496B2 - Speech recognition apparatus and speech recognition method

Info

Publication number: JP4651496B2
Application number: JP2005305014A
Authority: JP
Inventors: 知子松井; 國士田邉; オイスティンビルネケス
Original assignee: Inter University Research Institute Corp Research Organization of Information and Systems
Current assignee: Inter University Research Institute Corp Research Organization of Information and Systems
Priority date: 2005-09-19
Filing date: 2005-09-19
Publication date: 2011-03-16
Anticipated expiration: 2025-09-19
Also published as: JP2007086703A

Description

本発明は、音声認識装置及び音声認識方法に関するものである。詳しくは、訓練用音声データを用いて学習機械に学習させた後。新たに与えられた音声データに対して、可能性のある単語などの確率による予測を計算し、音声から単語などを推測する音声認識装置及び音声認識方法に関するものである。 The present invention relates to a voice recognition device and a voice recognition method. Specifically, after letting the learning machine learn using the training voice data. The present invention relates to a speech recognition apparatus and a speech recognition method for calculating predictions based on probabilities of possible words and the like for newly given speech data and inferring words and the like from speech.

従来の音声認識では、最大事後尤度の決定ルールを用いて発声内容を判定する。最大事後尤度は、隠れマルコフモデルＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）などで表される音響モデルの尤度と、Ｎ−ｇｒａｍなどで表される言語モデルの確率の積として計算する。これらのモデルにおいては、ＨＭＭのパラメータは誤り最小化学習アルゴリズムやＥＭ（ＥｘｐｅｃｔａｔｉｏｎＭａｘｉｍｉｚａｔｉｏｎ）アルゴリズムを用いて学習する。（非特許文献１，２参照） In the conventional speech recognition, the utterance content is determined using the maximum posterior likelihood determination rule. The maximum a posteriori likelihood is calculated as a product of the likelihood of an acoustic model represented by a hidden Markov model HMM (Hidden Markov Model) and the probability of a language model represented by N-gram or the like. In these models, the HMM parameters are learned using an error minimization learning algorithm or an EM (Expectation Maximization) algorithm. (See Non-Patent Documents 1 and 2)

他方、確率的な予測判別力に優れた学習機械である罰金付きロジスティック回帰マシン（ＰｅｎａｌｉｚｅｄＬｏｇｉｓｔｉｃＲｅｇｒｅｓｓｉｏｎＭａｃｈｉｎｅ；以下本明細書においてＰＬＲＭという）が田邉により提案され、音声の分野でその性能が示されている。（非特許文献３〜８参照） On the other hand, a fine logistic regression machine (Penarized Logistic Regression Machine; hereinafter referred to as PLRM in this specification), which is a learning machine with excellent probabilistic predictive discriminating power, was proposed by Tanabe, and its performance was shown in the field of speech Yes. (See Non-Patent Documents 3-8)

中川聖一、“確率モデルによる音声認識”、（社）電子情報通信学会、１９８８．Seiichi Nakagawa, “Voice Recognition by Probabilistic Model”, The Institute of Electronics, Information and Communication Engineers, 1988. Ｂ，−Ｈ，Ｊｕａｎｇ，Ｗ．Ｃｈｏｕ，ａｎｄＣ．−Ｈ．Ｌｅｅ，“Ｍｉｎｉｍｕｍｃｌａｓｓｉｆｉｃａｔｉｏｎｅｒｒｏｒｒａｔｅｍｅｔｈｏｄｓｆｏｒｓｐｅｅｃｈｒｅｏｇｎｉｔｉｏｎ” ，ＩＥＥＥＴｒａｎｓ，ＳｐｅｅｃｈａｎｄＡｕｄｉｏＰｒｏｃｅｓｓｉｎｇ，ｖｏｌ．５，ｐｐ．２５７−２６５，Ｍａｙ１９９７．B, -H, Jung, W .; Chou, and C.C. -H. Lee, “Minimum classification error rate methods for speech re-regulation”, IEEE Trans, Speech and Audio Processing, vol. 5, pp. 257-265, May 1997. Ｋ．Ｔａｎａｂｅ，“Ｐｅｎａｌｉｚｅｄｌｏｇｉｓｔｉｃｒｅｇｒｅｓｓｉｏｎｍａｃｈｉｎｅｓ：Ｎｅｗｍｅｔｈｏｄｓｆｏｒｓｔａｔｉｓｔｉｃａｌｐｒｅｄｉｃｔｉｏｎ１”，ＩＳＭＣｏｏｐｅｒａｔｉｖｅＲｅｓｅａｒｃｈＲｅｐｏｒｔ１４３，ｐｐ．１６３−１９４，Ｍａｒｃｈ２００１．K. Tanabe, “Penalized logistic regression machines: New methods for statistical prediction 1”, ISM Cooperative Research Report 143, pp. 197 163-194, March 2001. Ｋ．Ｔａｎａｂｅ，“Ｐｅｎａｌｉｚｅｄｌｏｇｉｓｔｉｃｒｅｇｒｅｓｓｉｏｎｍａｃｈｉｎｅｓ：Ｎｅｗｍｅｔｈｏｄｓｆｏｒｓｔａｔｉｓｔｉｃａｌｐｒｅｄｉｃｔｉｏｎ２”，第４回情報論的学習理論ワークショップ（ＩＢＩＳ２００１），ｐｐ．７１−７６，Ｊｕｌｙ．２００１．K. Tanabé, “Penalized logic regression machines: New methods for statistical prediction 2”, 4th Information Theory Learning Theory Workshop (IBIS2001), pp. 199 71-76, July. 2001. Ｋ．Ｔａｎａｂｅ，“ＰｅｎａｌｉｚｅｄｌｏｇｉｓｔｉｃｒｅｇｒｅｓｓｉｏｎｍａｃｈｉｎｅｓａｎｄＲｅｌａｔｅｄＬｉｎｅａｒＮｕｍｅｒｉｃａｌＡｌｇｅｂｒａ”，京都大学数理解析研究所、講究録１３２０，ｐｐ。２３９−２４９，２００３．K. Tanab, “Penalized logistic regression machines and Related Linear Numerical Algebra”, Kyoto University Institute of Mathematical Analysis, Proc. 1320, pp. 239-249, 2003. Ｔ．ＭａｔｓｕｉａｎｄＫ．Ｔａｎａｂｅ，“ＳｐｅａｋｅｒＩｄｅｎｔｉｆｉｃａｔｉｏｎｗｉｔｈＤｕａｌＰｅｎａｌｉｚｅｄＬｏｇｉｓｔｉｃＲｅｇｒｅｓｓｉｏｎＭａｃｈｉｎｅ”，ＰｒｏｃｅｅｄｉｎｇｓｏｆＯｄｙｓｓｅｙ，ｐｐ．３６３−３６６，２００４．T.A. Matsui and K.M. Tanab, “Speaker Identification with Dual Penalized Logistic Regression Machine”, Processeds of Odyssey, pp. 363-366, 2004. Ｔ．ＭａｔｓｕｉａｎｄＫ．Ｔａｎａｂｅ，“ＰｒｏｂａｂｉｌｉｓｔｉｃＳｐｅａｋｅｒＩｄｅｎｔｉｆｉｃａｔｉｏｎｗｉｔｈｄｕａｌＰｅｎａｌｉｚｅｄＬｏｇｉｓｔｉｃＲｅｇｒｅｓｓｉｏｎＭａｃｈｉｎｅ”，ＰｒｏｃｅｅｄｉｎｇｓｏｆＩＣＳＬＰ，ｐｐ．ＩＩＩ−１７９７−１８００，２００４．T.A. Matsui and K.M. Tanabbe, “Probabilistic Spiker Identification with Dual Penalized Logic Regression Machine”, Proceedings of ICSLP, pp. III-1797-1800, 2004. Ｔ．ＭａｔｓｕｉａｎｄＫ．Ｔａｎａｂｅ，“ＳｐｅａｋｅｒＲｅｃｏｇｎｉｔｉｏｎＷｉｔｈｏｕｔＦｅａｔｕｒｅＥｘｔｒａｃｔｉｏｎＰｒｏｃｅｓｓ”，ＰｒｏｃｅｅｄｉｎｇｓｏｆＷｏｒｋｓｈｏｐｏｎＳｔａｔｉｓｔｉｃａｌＭｏｄｅｌｉｎｇＡｐｐｒｏａｃｈｆｏｒＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ：ＢｅｙｏｎｄＨＭＭ，ｐｐ．７９−８４，２００４．T.A. Matsui and K.M. Tanab, “Speaker Recognition Without Feature Extraction Process”, Processeds of Workshops on Strategic Modeling for Prospect Recognition. 79-84, 2004.

音声認識に関して、ＨＭＭのパラメータを誤り最小化学習アルゴリズムやＥＭアルゴリズムを用いて求める場合には必ずしも最適ではない解に収束してしまうことがあり、ＨＭＭを用いた方法にはさらに認識性能の向上が期待できる。また一般に、音声認識では周囲雑音などの影響があるために、１００％の確率で認識することは難しい。しかし、ＨＭＭなどの従来の音響モデルの尤度は確率の指標を与えないので、音声認識性能が高くない。そこで、確率による認識結果を出力でき、従来方法に比して音声認識性能を確実に向上できる音声認識装置及び音声認識方法が望まれていた。 With respect to speech recognition, when the HMM parameters are obtained using an error minimization learning algorithm or EM algorithm, it may converge to a solution that is not necessarily optimal, and the method using the HMM further improves the recognition performance. I can expect. In general, speech recognition is affected by ambient noise and the like, so it is difficult to recognize with 100% probability. However, since the likelihood of a conventional acoustic model such as HMM does not give an index of probability, speech recognition performance is not high. Therefore, there has been a demand for a speech recognition apparatus and a speech recognition method that can output recognition results based on probability and can reliably improve speech recognition performance as compared with conventional methods.

本発明は、音声内容を示す単語などの認識単位について、確率による認識結果を出力でき、ＨＭＭなどの音響モデルを用いた従来方法に比して音声認識性能を確実に向上できる音声認識装置及び音声認識方法を提供することを目的とする。 The present invention provides a speech recognition apparatus and a speech that can output a recognition result based on a probability for a recognition unit such as a word indicating speech content and can reliably improve speech recognition performance as compared with a conventional method using an acoustic model such as an HMM. An object is to provide a recognition method.

上記課題を解決するために、請求項１に記載の音声認識装置では、例えば図１に示すように、音声信号からメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出部４と、特徴抽出部４で抽出された訓練用音声データに係る特徴ベクトル時系列に基づいて、隠れマルコフモデルなどの音響モデルに基づき単語などの認識単位に係る尤度関数群を計算する音響モデル尤度計算部８と、音響モデル尤度計算部８で求めた尤度関数群を回帰関数φに取り込んでＰＬＲＭの係数パラメータＷをＰＬＲＭ罰金付き尤度を最大化することにより学習する第１のＰＬＲＭ音響学習部６と、音響モデル尤度計算部８で求めた尤度関数群を回帰関数φに取り込んでＰＬＲＭのハイパーパラメータΛを最急降下法などの最適化法によりＰＬＲＭ罰金付き尤度を最大化することにより学習する第２のＰＬＲＭ音響学習部５と、第１のＰＬＲＭ音響学習部６で最適化された係数パラメータＷと第２のＰＬＲＭ音響学習部５で最適化されたハイパーパラメータΛを用いて認識単位に対する可能性を確率で表現したものを計算する音声認識部９とを備え、音響モデル尤度計算部８は、ハイパーパラメータΛを音響モデルに係るパラメータに取り込んで尤度関数群を計算する。 In order to solve the above problem, in the speech recognition apparatus according to claim 1, for example, as shown in FIG. 1, a feature extraction unit 4 that extracts a feature vector time series such as a mel cepstrum coefficient time series from a speech signal; Acoustic model likelihood calculation for calculating a likelihood function group related to a recognition unit such as a word based on an acoustic model such as a hidden Markov model based on a feature vector time series related to training speech data extracted by the feature extraction unit 4 The first PLRM acoustic learning is performed by taking the likelihood function group obtained by the unit 8 and the acoustic model likelihood calculating unit 8 into the regression function φ and maximizing the PLRM coefficient likelihood with the PLRM coefficient parameter W. PLRM punishment is performed by incorporating the likelihood function group obtained by the unit 6 and the acoustic model likelihood calculating unit 8 into the regression function φ and by using the optimization method such as the steepest descent method. The second PLRM acoustic learning unit 5 that learns by maximizing the associated likelihood, the coefficient parameter W optimized by the first PLRM acoustic learning unit 6 and the second PLRM acoustic learning unit 5 are optimized. A speech recognition unit 9 that calculates the probability of the recognition unit expressed by probability using the hyperparameter Λ, and the acoustic model likelihood calculation unit 8 incorporates the hyperparameter Λ into parameters related to the acoustic model. Compute the likelihood function group.

ここにおいて、音声信号には受信手段で受信される音声信号の他に、記憶手段に収録された音声データも含まれる。また、音響モデルとは尤度を計算可能な従来の確率モデルをいい、隠れマルコフモデル（ＨＭＭ）が代表的であるが、これに限られない。また、パラメータＷを学習するとは、全ての訓練用音声データについて計算された罰金付き尤度が最も大きくなるようにパラメータＷを定めることをいい、パラメータΛを学習するとは、罰金付き尤度が最も大きくなるようにパラメータΛを定めることをいう。また、音響モデル尤度計算部８は、常にハイパーパラメータΛを音響モデルに係るパラメータに取り込んで尤度関数群を計算しなくても、計算できるように構成されていれば良い。
このように構成すると、音声内容を示す単語などの認識単位について、確率による認識結果を出力でき、その信頼度の指標としても用いることが出来る。また、ＰＬＲＭと音響モデルが結合されて、ＨＭＭなどの音響モデルを用いた従来方法に比して音声認識性能を確実に向上できる音声認識装置を提供できる。Here, the audio signal includes audio data recorded in the storage means in addition to the audio signal received by the receiving means. The acoustic model is a conventional probability model capable of calculating the likelihood, and is typically a hidden Markov model (HMM), but is not limited thereto. Learning the parameter W means determining the parameter W so that the likelihood with fines calculated for all training speech data is maximized. Learning the parameter Λ has the highest likelihood with fines. The parameter Λ is determined so as to increase. The acoustic model likelihood calculation unit 8 only needs to be configured so that the hyperparameter Λ can always be calculated without taking the hyperparameter Λ into the parameter related to the acoustic model and calculating the likelihood function group.
If comprised in this way, the recognition result by probability can be output about recognition units, such as a word which shows the audio | voice content, and it can use also as the parameter | index of the reliability. In addition, it is possible to provide a speech recognition apparatus in which the PLRM and the acoustic model are combined and the speech recognition performance can be reliably improved as compared with a conventional method using an acoustic model such as an HMM.

請求項２に記載の音声認識方法は、例えば図２に示すように、音声信号からメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出工程（Ｓ０１１）と、特徴抽出工程で抽出された訓練用音声データに係る特徴ベクトル時系列に基づいて、隠れマルコフモデルなどの音響モデルに基づき単語などの認識単位に係る尤度関数群を計算する音響モデル尤度計算工程（Ｓ００２，Ｓ００５）と、音響モデル尤度計算工程で求めた尤度関数群を回帰関数φに取り込んでＰＬＲＭの係数パラメータＷを、ＰＬＲＭ罰金付き尤度を最大化することにより学習する第１のＰＬＲＭ音響学習工程（Ｓ００３，Ｓ００６）と、音響モデル尤度計算工程で求めた尤度関数群を回帰関数φに取り込んでＰＬＲＭのハイパーパラメータΛを最急降下法などの最適化法によりＰＬＲＭ罰金付き尤度を最大化することにより学習する第２のＰＬＲＭ音響学習工程（Ｓ００４）と、第１のＰＬＲＭ音響学習工程で最適化された係数パラメータＷと第２のＰＬＲＭ音響学習工程で最適化されたハイパーパラメータΛを用いて認識単位に対する可能性を確率で表現したものを計算する音声認識工程（Ｓ００８）とを備え、音響モデル尤度計算工程は、ハイパーパラメータΛを音響モデルに係るパラメータに取り込んで尤度関数群を計算する。 The speech recognition method according to claim 2 is extracted by a feature extraction step (S011) for extracting a feature vector time series such as a mel cepstrum coefficient time series from a speech signal and a feature extraction step as shown in FIG. An acoustic model likelihood calculation step (S002, S005) for calculating a likelihood function group related to a recognition unit such as a word based on an acoustic model such as a hidden Markov model based on a feature vector time series related to the training speech data; The first PLRM acoustic learning step (S003) in which the likelihood function group obtained in the acoustic model likelihood calculation step is taken into the regression function φ to learn the coefficient parameter W of the PLRM by maximizing the likelihood with PLRM fineness (S003). , S006), and the likelihood function group obtained in the acoustic model likelihood calculation step is taken into the regression function φ and the hyperparameter Λ of the PLRM is calculated by the steepest descent method The second PLRM acoustic learning step (S004) for learning by maximizing the PLRM fine likelihood with which optimization method, the coefficient parameter W and the second PLRM optimized in the first PLRM acoustic learning step A speech recognition step (S008) that calculates a probability representation for the recognition unit using the hyperparameter Λ optimized in the acoustic learning step, and the acoustic model likelihood calculation step includes the hyperparameter Λ A likelihood function group is calculated by taking in parameters relating to the acoustic model.

ここにおいて、音響モデル尤度計算工程、第１のＰＬＲＭ音響学習工程、第２のＰＬＲＭ音響学習工程は何度繰り返されても良い。一般的には繰り返し回数が多いほど音声認識性能が向上するので、収束するまで行なうことが好ましい。音響モデル尤度計算工程は、常にハイパーパラメータΛを音響モデルに係るパラメータに取り込んで尤度関数群を計算しなくても、いずれかの工程で計算するように構成されていれば良い。
このように構成すると、音声内容を示す単語などの認識単位について、確率による認識結果を出力でき、その信頼度の指標としても用いることも出来る。また、ＰＬＲＭと音響モデルが結合されて、ＨＭＭなどの音響モデルを用いた従来方法に比して音声認識性能を確実に向上できる音声認識方法を提供できる。Here, the acoustic model likelihood calculation step, the first PLRM acoustic learning step, and the second PLRM acoustic learning step may be repeated any number of times. In general, the greater the number of repetitions, the better the voice recognition performance. The acoustic model likelihood calculation step may be configured to calculate at any step without always taking the hyperparameter Λ into the parameters related to the acoustic model and calculating the likelihood function group.
If comprised in this way, the recognition result by a probability can be output about recognition units, such as a word which shows an audio | voice content, and it can also be used as an index of the reliability. In addition, by combining the PLRM and the acoustic model, it is possible to provide a speech recognition method that can surely improve the speech recognition performance as compared with a conventional method using an acoustic model such as an HMM.

請求項３に記載のプログラムは、請求項２に記載の音声認識方法をコンピュータに実行させるためのプログラムである。 A program according to a third aspect is a program for causing a computer to execute the speech recognition method according to the second aspect.

本発明によれば、音声内容を示す単語などの認識単位について、確率による認識結果を出力でき、その信頼度の指標としても用いることも出来、また、ＰＬＲＭと音響モデルが結合されて、ＨＭＭなどの音響モデルを用いた従来方法に比して音声認識性能を確実に向上できる音声認識装置及び音声認識方法を提供できる。 According to the present invention, a recognition result based on probability can be output for a recognition unit such as a word indicating speech content, and can also be used as an index of reliability thereof. Also, a PLRM and an acoustic model are combined to form an HMM or the like. It is possible to provide a speech recognition apparatus and a speech recognition method that can improve the speech recognition performance reliably as compared with the conventional method using the acoustic model.

以下に図面に基づき本発明の実施の形態について説明する。従来の音響モデルとして代表的な隠れマルコフモデル（ＨＭＭ）を用いる音声認識装置から得られるＭ位候補までの尤度をＰＬＲＭの入力回帰関数として用いて学習機械ＰＬＲＭによる音声学習を行い、入力音声の発声内容を表す単語候補を確率的に判定するものである。また、ＰＬＲＭの中のハイパーパラメータとなるＨＭＭなどの音響モデルのパラメータを、ＰＬＲＭを定義している罰金付尤度という評価規準に基づいて最急降下法等の最適化法で調整し、ＰＬＲＭと音響モデルが結合して学習効果を高める。 Embodiments of the present invention will be described below with reference to the drawings. Speech learning by the learning machine PLRM is performed by using the likelihood to the M-th candidate obtained from a speech recognition apparatus using a typical hidden Markov model (HMM) as a conventional acoustic model as an input regression function of the PLRM. A word candidate representing the utterance content is determined probabilistically. In addition, the parameters of the acoustic model such as HMM, which is a hyper parameter in the PLRM, are adjusted by an optimization method such as the steepest descent method based on the evaluation criterion called fine likelihood defining the PLRM, and the PLRM and the acoustic The model combines to enhance the learning effect.

図１に本発明の実施の形態における音声認識装置１００の構成例のブロック図を示す。図１において、１は音声信号、２は訓練用音声データ、１００は音声認識装置である。音声認識装置１００は、音声信号を受信する受信部３と、受信された音声信号１又は訓練用音声データ２から、メルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出部４と、単語などの認識単位の発声内容に係る特徴ベクトル時系列情報及びＨＭＭに係るパラメータなどを蓄積した音響モデルデータベース７と、特徴抽出部４で取得された特徴ベクトル時系列（特に訓練用音声データ２に係る特徴ベクトル時系列）について音響モデルデータベース７に蓄積された特徴ベクトル時系列情報を参照して、ＨＭＭに基づき認識単位候補に係るＨＭＭの尤度を計算し、あるいはＰＬＲＭ学習部３０から与えられるパラメータＷとΛの値に対して尤度関数群を計算する音響モデル尤度計算部８と、音響モデル尤度計算部８で求めた認識単位候補に係る尤度関数群を回帰関数φに取り込んで、ＰＬＲＭの係数パラメータＷをＰＬＲＭ罰金付き尤度を最大化することにより、学習する第１のＰＬＲＭ音響学習部６と、この尤度関数群を回帰関数φに取り込んで、ＰＬＲＭのハイパーパラメータΛを最急降下法などの最適化法によりＰＬＲＭ罰金付き尤度を最大化することにより学習する第２のＰＬＲＭ音響学習部５と、第１のＰＬＲＭ音響学習６部で最適化された係数パラメータＷと第２のＰＬＲＭ音響学習５で最適化されたハイパーパラメータΛを用いて認識単位に対する可能性を確率で表現したものを計算し、発声内容の確率的な判定を行う音声認識部９、音声認識部９の計算結果等を表示する表示部１０、第２のＰＬＲＭ音響学習部５や第１のＰＬＲＭ音響学習部６や音声認識部９における計算結果や中間データ等を適宜記憶する記憶部１１、音声認識装置１００を構成する各部を制御して、音声認識装置１００としての機能を果たさせる制御部１２等から構成される。 FIG. 1 shows a block diagram of a configuration example of a speech recognition apparatus 100 according to an embodiment of the present invention. In FIG. 1, 1 is a speech signal , 2 is training speech data, and 100 is a speech recognition device. The speech recognition apparatus 100 includes a reception unit 3 that receives a speech signal, a feature extraction unit 4 that extracts a feature vector time series such as a mel cepstrum coefficient time series from the received speech signal 1 or training speech data 2, An acoustic model database 7 that stores feature vector time series information related to the utterance content of a recognition unit such as a word, parameters related to HMM, and the like, and a feature vector time series acquired by the feature extraction unit 4 (particularly in training speech data 2) The feature vector time series) with reference to the feature vector time series information stored in the acoustic model database 7, the likelihood of the HMM related to the recognition unit candidate is calculated based on the HMM, or the parameter given from the PLRM learning unit 30 An acoustic model likelihood calculation unit 8 that calculates a likelihood function group for the values of W and Λ, and an acoustic model likelihood calculation unit 8 A first PLRM acoustic learning unit 6 that learns by incorporating a likelihood function group related to recognition unit candidates into the regression function φ and maximizing the PLRM coefficient-likelihood parameter coefficient W, and the likelihood A second PLRM acoustic learning unit 5 that learns the function group by incorporating the function group into the regression function φ and maximizing the PLRM hyper-parameter Λ by the optimization method such as the steepest descent method, and the likelihood with the PLRM fine; The probability parameter for the recognition unit is calculated using the coefficient parameter W optimized in the 6th part of PLRM acoustic learning and the hyperparameter Λ optimized in the second PLRM acoustic learning 5, and the utterance content is calculated. probabilistic determination voice recognition unit 9 performs a display unit 10 for displaying the calculation results of the speech recognizer 9, second PLRM acoustic learning unit 5 and the first PLRM acoustic learning section 6 Yaoto Storage unit 11 for storing appropriate calculation results and intermediate data or the like in the recognition unit 9, and controls the respective units constituting the speech recognition device 100, a control unit 12 or the like to fulfill the function as the speech recognition apparatus 100 .

ここにおいて、パラメータＷを学習するとは、全ての訓練用音声データについて計算された罰金付き尤度を最も大きくなるようにパラメータＷを定めることをいい、パラメータΛを学習するとは、罰金付き尤度が最も大きくなるようにパラメータΛを定めることをいう。尤度関数群を求めるとは、音響モデルのパラメータの複数個の値を尤度計算部の外から供与されるか、あるいは内部で学習し定め、それらの値に対する複数の音響モデルの尤度を計算することをいう。 Here, learning the parameter W means that the parameter W is determined so as to maximize the fine likelihood calculated for all training speech data, and learning the parameter Λ means that the fine likelihood is The parameter Λ is determined so as to be the largest. The likelihood function group is obtained by providing a plurality of values of parameters of the acoustic model from outside the likelihood calculation unit or by learning and determining the likelihood of the plurality of acoustic models for those values. It means to calculate.

このうち、第１のＰＬＲＭ音響学習部６と第２のＰＬＲＭ音響学習部５とはＰＬＲＭ学習部３０を構成し、第１のＰＬＲＭ音響学習部６と第２のＰＬＲＭ音響学習部５とでパラメータ情報Ｗ、Λをやりとりをしながら学習を繰り返し、認識性能を向上させる。音響モデルデータベース７と音響モデル尤度計算部８とは音響モデル部４０の一部を構成し、音響モデル部４０はＨＭＭに基づく音声認識機能及び尤度計算機能を有する。ＰＬＲＭ学習部３０は音響モデル部４０から回帰関数φに取り込む尤度関数群を受け取り、音響モデル部４０にパラメータΛを供与する。また、ＰＬＲＭ学習部３０と音声認識部９とはＰＬＲＭに基づいて処理を実行するＰＬＲＭ部２０の一部を構成し、ＰＬＲＭ部２０はＰＬＲＭの諸機能を有する。 Among these, the first PLRM acoustic learning unit 6 and the second PLRM acoustic learning unit 5 constitute a PLRM learning unit 30, and the first PLRM acoustic learning unit 6 and the second PLRM acoustic learning unit 5 use the parameters. Learning is repeated while exchanging information W and Λ to improve recognition performance. The acoustic model database 7 and the acoustic model likelihood calculation unit 8 constitute a part of the acoustic model unit 40, and the acoustic model unit 40 has a speech recognition function and a likelihood calculation function based on the HMM. The PLRM learning unit 30 receives a likelihood function group to be incorporated into the regression function φ from the acoustic model unit 40 and provides the parameter Λ to the acoustic model unit 40. The PLRM learning unit 30 and the speech recognition unit 9 constitute a part of the PLRM unit 20 that executes processing based on the PLRM, and the PLRM unit 20 has various functions of the PLRM.

図２に本実施の形態における音声認識方法の処理フロー例を示す。なお、枠の中はＰＬＲＭ学習部３０における処理フロー例である。 FIG. 2 shows a processing flow example of the speech recognition method in the present embodiment. The frame shows an example of a processing flow in the PLRM learning unit 30.

受信部３で音声信号１を受信する（Ｓ０１０）。受信部３で受信した音声信号１又は訓練用音声データ２から、特徴抽出部４でメルケプストラム係数時系列などの特徴ベクトル時系列Ｘを抽出する（Ｓ０１１）。 The receiving unit 3 receives the audio signal 1 (S010). A feature vector time series X such as a mel cepstrum coefficient time series is extracted from the audio signal 1 or training audio data 2 received by the receiving unit 3 (S011).

学習時には訓練用音声データ２が用いられる。音響モデル尤度計算部８で、前記特徴ベクトル時系列Ｘを入力して、誤り最小化学習アルゴリズムやＥＭアルゴリズムを用いてＨＭＭに基づいて学習し、音響モデルの第２のパラメータΛの初期値を求める（Ｓ００１）。音響モデル尤度計算部８で、訓練用音声データ２の特徴ベクトル時系列Ｘを音響モデルのデータベース７に蓄積された認識単位（単語など）の発声内容に係る特徴ベクトル時系列情報を参照し、ＨＭＭの尤度を複数（Ｍ個）の認識単位の候補について計算し、最適化された尤度関数群を求める。なお、認識単位として、単語の他に、述語、句、単音などがある。 Training voice data 2 is used during learning. The acoustic model likelihood calculation unit 8 inputs the feature vector time series X, learns based on the HMM using an error minimization learning algorithm or an EM algorithm, and sets an initial value of the second parameter Λ of the acoustic model. Obtain (S001). The acoustic model likelihood calculation unit 8 refers to the feature vector time series information related to the utterance content of the recognition unit (such as a word) stored in the acoustic model database 7 for the feature vector time series X of the training speech data 2, The likelihood of the HMM is calculated for a plurality of (M) recognition unit candidates, and an optimized likelihood function group is obtained. In addition to the word, the recognition unit includes a predicate, a phrase, and a single sound.

第１のＰＬＲＭ音響学習部６では、計算された各候補の音響モデルの尤度をＰＬＲＭの回帰関数φ（ｘ；Λ）＝（φ_１（ｘ；Λ），φ_２（ｘ；Λ），．．．，φ_Ｍ（ｘ；Λ））としてＰＬＲＭに入力する（Ｓ００２）。ＰＬＲＭでは、音響モデルのパラメータΛはＰＬＲＭのハイパーパラメータとして扱われる。また、ＰＬＲＭでは単語などの認識単位の確率は、

で表され、ＰＬＲＭの罰金付き尤度は、

で表される（非特許文献３、４、５参照）。In the first PLRM acoustic learning unit 6, the likelihood of the calculated acoustic model of each candidate is used as the PLRM regression function φ (x; Λ) = (φ ₁ (x; Λ), φ ₂ (x; Λ), ..., Φ _M (x; Λ)) are input to the PLRM (S002). In PLRM, the acoustic model parameter Λ is treated as a PLRM hyperparameter. In PLRM, the probability of a recognition unit such as a word is

The PLRM fine likelihood is

(See Non-Patent Documents 3, 4, and 5).

ただし、Ｋは単語などの認識単位の総数（１≦Ｍ≦Ｋ）、Ｗ＝（ｗ_１，ｗ_２，．．．，ｗ_Ｋ）はＰＬＲＭの係数パラメータ（第１のパラメータ）で、その要素ｗ_ｋはＭ次元の重みベクトルである。θ＝（Ｗ，Λ）である。また、Ｄ＝｛（ｘ^ｎ，ｙ^ｎ）｝_ｎ＝_１，．．．，_ＮはＮ個の訓練用音声信号ｘ^ｎと、その単語などの認識単位のラベルｙ^ｎの組を表す。Γは学習データのクラスによる偏りを補正する行列で、例えばｋ番目の対角成分がｋ番目のラベルを有する訓練サンプル数に対応した対角行列である。Σは正定値行列、δは

とのバランスを取るために用いられるパラメータである。Γ、Σ、δはデータに応じて、適宜、事前に選択しておく。なお、Σの選択については後述する。Where K is the total number of recognition units such as words (1 ≦ M ≦ K), W = (w ₁ , w ₂ ,..., W _K ) is a coefficient parameter (first parameter) of PLRM, and its elements w _k is an M-dimensional weight vector. θ = (W, Λ). Also, D = {(x ⁿ , y ⁿ )} _n = ₁ ,. . . , _N represents represents and N training speech signal x ^n, a set of labels y ⁿ of recognition units, such as the word. Γ is a matrix that corrects the bias due to the class of learning data, and is, for example, a diagonal matrix corresponding to the number of training samples in which the kth diagonal component has the kth label. Σ is a positive definite matrix, δ is

It is a parameter used to balance Γ, Σ, and δ are appropriately selected in advance according to the data. The selection of Σ will be described later.

第１のＰＬＲＭ音響学習部６ではＰＬＲＭの係数パラメータＷを学習して最適化する（Ｓ００３）。すなわち、全ての訓練用音声データについて計算された罰金付き尤度が最も大きくなるようにパラメータＷを定める。第２のＰＬＲＭ音響学習部５において、ＰＬＲＭの係数パラメータＷを固定して、最急降下法などの最適化法を用いて音響モデルのパラメータΛを、ＰＬＲＭ罰金付き尤度を最大化することにより学習する（Ｓ００４）。 The first PLRM acoustic learning unit 6 learns and optimizes the PLRM coefficient parameter W (S003). That is, the parameter W is determined so that the likelihood with fine calculated for all the training speech data is maximized. In the second PLRM acoustic learning unit 5, the coefficient parameter W of the PLRM is fixed, and the acoustic model parameter Λ is learned by maximizing the likelihood with PLRM fineness using an optimization method such as the steepest descent method. (S004).

音響モデル部４０ではＰＬＲＭ学習部３０で求められたパラメータΛを受け取り、これを用いて音響モデル尤度計算部８では、音響モデルの尤度を計算する。ＰＬＲＭ学習部３０では音響モデル部４０で計算された認識単位の各候補についての尤度を受け取り、これをＰＬＲＭの回帰関数として第１のＰＬＲＭ音響学習部６に入力し（Ｓ００５）、ＰＬＲＭの係数パラメータＷを学習する（Ｓ００６）。ここで音声認識部９ではパラメータＷ，Λを用いて（式１）により、単語などの認識単位の確率ｐ_ｋ（θ）による表現を求めることも可能である。しかし、Ｓ００４〜Ｓ００６を繰り返すことにより、パラメータＷ，Λを学習して、確率ｐ_ｋ（θ）の妥当性を向上させるのが好ましい。すなわち、第２のＰＬＲＭ音響学習部５において、新たに求められたＰＬＲＭの係数パラメータＷを固定して、最急降下法などの最適化法を用いて音響モデルのパラメータΛを最適化する（Ｓ００４）。そして、Ｓ００４からＳ００６の工程を十分に繰り返して、ＰＬＲＭの係数パラメータＷ及び音響モデルのパラメータΛが収束したと判断された時点で、ＰＬＲＭによる音響学習を終了する（Ｓ００７）。The acoustic model unit 40 receives the parameter Λ obtained by the PLRM learning unit 30, and the acoustic model likelihood calculation unit 8 uses this to calculate the likelihood of the acoustic model. The PLRM learning unit 30 receives the likelihood of each candidate for the recognition unit calculated by the acoustic model unit 40, and inputs the likelihood to the first PLRM acoustic learning unit 6 as a regression function of the PLRM (S005). The parameter W is learned (S006). Here, the speech recognition unit 9 can also obtain an expression based on the probability p _k (θ) of a recognition unit such as a word by using (Expression 1) using the parameters W and Λ. However, it is preferable to learn the parameters W and Λ by repeating S004 to S006 to improve the validity of the probability p _k (θ). That is, the second PLRM acoustic learning unit 5 fixes the newly obtained PLRM coefficient parameter W and optimizes the acoustic model parameter Λ using an optimization method such as the steepest descent method (S004). . Then, the steps from S004 to S006 are sufficiently repeated, and when it is determined that the coefficient parameter W of the PLRM and the parameter Λ of the acoustic model have converged, the acoustic learning by the PLRM is terminated (S007).

テスト時には、音声信号１を受信部３で受信して（Ｓ０１０）、特徴抽出部４でメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する（Ｓ０１１）。特徴ベクトル時系列を学習済みのＰＬＲＭ部２０に入力して、入力音声に対応する単語などの認識単位の候補に対する確率に基づいて推定する（Ｓ００８）。最大の確率を示す候補を認識結果として表示部１０で表示する（Ｓ００９）。計算される確率の要素の全体は、認識結果の確信度を与える指標としてつかうことも出来よう。 During the test, the audio signal 1 is received by the receiving unit 3 (S010), and the feature extraction unit 4 extracts a feature vector time series such as a mel cepstrum coefficient time series (S011). The feature vector time series is input to the learned PLRM unit 20, and is estimated based on the probability of recognition unit candidates such as words corresponding to the input speech (S008). A candidate indicating the maximum probability is displayed as a recognition result on the display unit 10 (S009). The whole element of the calculated probability can be used as an index that gives the certainty of the recognition result.

次にΣの選択について説明する。最も単純な選択肢は、Σ＝Ｉである。ただし、Ｉは単位行列を表す。もう一つの選択肢は、訓練用音声データへの過学習を考慮したΣ＝（１／Ｎ）ΦΦ^Ｔである。（非特許文献３、４、５参照）Next, selection of Σ will be described. The simplest option is Σ = I. Here, I represents a unit matrix. Another option is the excessive learning was considered Σ = (1 / N) ΦΦ T to training speech data. (See Non-Patent Documents 3, 4, and 5)

最後に、上記音声認識装置による認識実験について説明する。
ＴＩ４６データベース（表１）のＥセットに対して実験を行った。このセットは、１４３３の訓練用音声と２２９１のテスト用音声を含む。各音声信号から、１３次元のメルケプストラム係数時系列、その一次と二次回帰係数からなる３９次元の特徴ベクトルを抽出した。特徴ベクトルの抽出では２５ミリ秒のハミング窓、１０ミリ秒の窓シフトを用いた。各単語は、５ガウス混合分布の状態のＨＭＭによってモデル化した。δについては、予備実験からΣ_１＝Ｉの場合はδ＝０．０１が、Σ_２＝（１／Ｎ）ΦΦ^Ｔの場合はδ＝５０００に設定した。

Finally, a recognition experiment using the voice recognition apparatus will be described.
Experiments were performed on the E set of the TI46 database (Table 1). This set includes 1433 training sounds and 2291 test sounds. A 39-dimensional feature vector consisting of a 13-dimensional mel cepstrum coefficient time series and its primary and secondary regression coefficients was extracted from each speech signal. The feature vector extraction used a 25 ms Hamming window and a 10 ms window shift. Each word was modeled by an HMM with a 5 Gaussian mixture distribution. For [delta], [delta] = 0.01 If the preliminary experiment sigma ₁ = I is, in the case of _{Σ 2 = (1 / N)} ΦΦ T is set to [delta] = 5000.

表２は、本発明法と従来法の単語認識率を比較した結果を示す。

Table 2 shows the result of comparing the word recognition rates of the method of the present invention and the conventional method.

また、本発明は実施の形態における音声認識方法を、コンピュータに実行させるためのプログラムとして、また当該プログラムを記録したコンピュータ読み取り可能な記録媒体としても実現可能である。プログラムは、コンピュータに内蔵のＲＯＭに記録して用いても良く、ＦＤ、ＣＤ−ＲＯＭ、内蔵又は外付けの磁気ディスク等の記録媒体に記録し、コンピュータに読み取って用いても良く、インターネットを介してコンピュータにダウンロードして用いても良い。 Further, the present invention can be realized as a program for causing a computer to execute the speech recognition method according to the embodiment, and also as a computer-readable recording medium on which the program is recorded. The program may be used by being recorded in a ROM built in the computer, or may be recorded on a recording medium such as an FD, CD-ROM, built-in or external magnetic disk, read by the computer, and used via the Internet. May be downloaded to a computer and used.

以上、本発明の実施の形態について説明したが、本発明は上記の実施の形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で、実施の形態に種々変更を加えられることは明白である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and various modifications can be made to the embodiments without departing from the spirit of the present invention. It is obvious.

例えば、上記実施の形態では、例えば、音響モデルの第２のパラメータΛを、最急降下法などの最適化法によりＰＬＲＭ罰金付き尤度を最大化することにより、学習する第２のＰＬＲＭ音響学習工程（Ｓ００４）からＰＬＲＭのパラメータＷを最適化して学習する第１のＰＬＲＭ音響学習工程（Ｓ００６）を繰り返し行なう例を説明したが、この繰り返し回数は任意である。また、従来の音響モデルが隠れマルコフモデル（ＨＭＭ）の例を説明したが、ベイジアンネットに基づく音響モデルなど他の音響モデルの尤度を用いても良い。また、訓練音声用データは実験で用いたＥセットに限られない。 For example, in the above embodiment, for example, the second PLRM acoustic learning step of learning the second parameter Λ of the acoustic model by maximizing the likelihood with a PLRM fine by an optimization method such as the steepest descent method The example in which the first PLRM acoustic learning step (S006) for learning by optimizing the parameter W of the PLRM has been described from (S004), but the number of repetitions is arbitrary. Moreover, although the conventional acoustic model demonstrated the example of the hidden Markov model (HMM), you may use the likelihood of other acoustic models, such as an acoustic model based on a Bayesian network. The training voice data is not limited to the E set used in the experiment.

本発明は音声認識に利用できる。 The present invention can be used for speech recognition.

本発明の実施の形態における音声認識装置の構成例のブロック図である。It is a block diagram of the structural example of the speech recognition apparatus in embodiment of this invention. 本発明の実施の形態における音声認識方法の処理フロー例を示す図である。It is a figure which shows the example of a processing flow of the speech recognition method in embodiment of this invention.

符号の説明Explanation of symbols

１音声信号
２訓練用音声データ
３受信部
４特徴抽出部
５第２のＰＬＲＭ音響学習部
６第１のＰＬＲＭ音響学習部
７音響モデルデータベース
８音響モデル尤度計算部
９音声認識部
１０表示部
１１記憶部
１２制御部
２０ＰＬＲＭ部
３０ＰＬＲＭ学習部
４０音響モデル部
１００音声認識装置
Ｗ係数パラメータ
Λ ハイパーパラメータDESCRIPTION OF SYMBOLS 1 Voice signal 2 Training voice data 3 Reception part 4 Feature extraction part 5 2nd PLRM acoustic learning part 6 1st PLRM acoustic learning part 7 Acoustic model database 8 Acoustic model likelihood calculation part 9 Speech recognition part 10 Display part 11 Storage unit 12 Control unit 20 PLRM unit 30 PLRM learning unit 40 Acoustic model unit 100 Speech recognition device W Coefficient parameter Λ Hyper parameter

Claims

音声信号からメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出部と；
前記特徴抽出部で抽出された訓練用音声データに係る特徴ベクトル時系列に基づいて、隠れマルコフモデルなどの音響モデルに基づき単語などの認識単位に係る尤度関数群を計算する音響モデル尤度計算部と；
前記音響モデル尤度計算部で求めた尤度関数群を回帰関数に取り込んで罰金付きロジスティック回帰マシン（ＰｅｎａｌｉｚｅｄＬｏｇｉｓｔｉｃＲｅｇｒｅｓｓｉｏｎＭａｃｈｉｎｅ；以下本請求項でＰＬＲＭという）の係数パラメータを，ＰＬＲＭ罰金付き尤度を最大化することにより学習する第１のＰＬＲＭ音響学習部と；
前記音響モデル尤度計算部で求めた尤度関数群を前記回帰関数に取り込んで前記ＰＬＲＭのハイパーパラメータを、最急降下法などの最適化法によりＰＬＲＭ罰金付き尤度を最大化することにより学習する第２のＰＬＲＭ音響学習部と；
前記第１のＰＬＲＭ音響学習部で最適化された前記係数パラメータと前記第２のＰＬＲＭ音響学習部で最適化された前記ハイパーパラメータを用いて前記認識単位に対する可能性を確率で表現したものを計算する音声認識部とを備え；
前記音響モデル尤度計算部は、前記ハイパーパラメータを前記音響モデルに係るパラメータに取り込んで前記尤度関数群を計算する；
音声認識装置。A feature extraction unit that extracts a feature vector time series such as a mel cepstrum coefficient time series from a speech signal;
Acoustic model likelihood calculation for calculating a likelihood function group related to a recognition unit such as a word based on an acoustic model such as a hidden Markov model based on a feature vector time series related to training speech data extracted by the feature extraction unit Part;
The likelihood function group obtained by the acoustic model likelihood calculation unit is taken into a regression function, and a coefficient parameter of a penalty logistic regression machine (hereinafter referred to as PLRM in this claim) is used as a coefficient parameter, and a likelihood with a PLRM penalty is used. A first PLRM acoustic learning unit that learns by maximizing;
The likelihood function group obtained by the acoustic model likelihood calculation unit is taken into the regression function, and the hyperparameter of the PLRM is learned by maximizing the likelihood with PLRM fineness by an optimization method such as a steepest descent method. A second PLRM acoustic learning unit;
Using the coefficient parameter optimized by the first PLRM acoustic learning unit and the hyperparameter optimized by the second PLRM acoustic learning unit, a probability representation of the possibility for the recognition unit is calculated. A voice recognition unit for
The acoustic model likelihood calculating unit calculates the likelihood function group by taking the hyper parameter into a parameter related to the acoustic model;
Voice recognition device.

音声信号からメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出工程と；
前記特徴抽出工程で抽出された訓練用音声データに係る特徴ベクトル時系列に基づいて、隠れマルコフモデルなどの音響モデルに基づき単語などの認識単位に係る尤度関数群を計算する音響モデル尤度計算工程と；、
前記音響モデル尤度計算工程で求めた尤度関数群を回帰関数に取り込んで罰金付きロジスティック回帰マシン（ＰｅｎａｌｉｚｅｄＬｏｇｉｓｔｉｃＲｅｇｒｅｓｓｉｏｎＭａｃｈｉｎｅ；以下本請求項でＰＬＲＭという）の係数パラメータを、ＰＬＲＭ罰金付き尤度を最大化することにより学習する第１のＰＬＲＭ音響学習工程と；
前記音響モデル尤度計算工程で求めた尤度関数群を前記回帰関数に取り込んで前記ＰＬＲＭのハイパーパラメータを，最急降下法などの最適化法によりＰＬＲＭ罰金付き尤度を最大化することにより学習する第２のＰＬＲＭ音響学習工程と；
前記第１のＰＬＲＭ音響学習工程で最適化された前記係数パラメータと前記第２のＰＬＲＭ音響学習工程で最適化された前記ハイパーパラメータを用いて前記認識単位に対する可能性を確率で表現したものを計算する音声認識工程とを備え；
前記音響モデル尤度計算工程は、前記ハイパーパラメータを前記音響モデルに係るパラメータに取り込んで前記尤度関数群を計算する；
音声認識方法。A feature extraction step of extracting a feature vector time series such as a mel cepstrum coefficient time series from a speech signal;
Acoustic model likelihood calculation for calculating a likelihood function group related to a recognition unit such as a word based on an acoustic model such as a hidden Markov model based on a feature vector time series related to training speech data extracted in the feature extraction step The process;
The likelihood function group obtained in the acoustic model likelihood calculation step is taken into the regression function, and the coefficient parameter of the Penalized Logistic Regression Machine (hereinafter referred to as PLRM in this claim) is set as the coefficient parameter of the PLRM fine. A first PLRM acoustic learning step that learns by maximizing;
The likelihood function group obtained in the acoustic model likelihood calculation step is taken into the regression function, and the hyperparameter of the PLRM is learned by maximizing the likelihood with a PLRM fine by an optimization method such as a steepest descent method. A second PLRM acoustic learning step;
Using the coefficient parameter optimized in the first PLRM acoustic learning step and the hyper parameter optimized in the second PLRM acoustic learning step, a probability representation of the probability for the recognition unit is calculated. A voice recognition step to perform;
The acoustic model likelihood calculating step calculates the likelihood function group by taking the hyperparameter into a parameter related to the acoustic model;
Speech recognition method.

請求項２に記載の音声認識方法をコンピュータに実行させるためのプログラム。 A program for causing a computer to execute the speech recognition method according to claim 2.