JP4651496B2 - Speech recognition apparatus and speech recognition method - Google Patents

Speech recognition apparatus and speech recognition method Download PDF

Info

Publication number
JP4651496B2
JP4651496B2 JP2005305014A JP2005305014A JP4651496B2 JP 4651496 B2 JP4651496 B2 JP 4651496B2 JP 2005305014 A JP2005305014 A JP 2005305014A JP 2005305014 A JP2005305014 A JP 2005305014A JP 4651496 B2 JP4651496 B2 JP 4651496B2
Authority
JP
Japan
Prior art keywords
plrm
likelihood
acoustic
unit
acoustic model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
JP2005305014A
Other languages
Japanese (ja)
Other versions
JP2007086703A5 (en
JP2007086703A (en
Inventor
知子 松井
國士 田邉
オイスティン ビルネケス
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inter University Research Institute Corp Research Organization of Information and Systems
Original Assignee
Inter University Research Institute Corp Research Organization of Information and Systems
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inter University Research Institute Corp Research Organization of Information and Systems filed Critical Inter University Research Institute Corp Research Organization of Information and Systems
Priority to JP2005305014A priority Critical patent/JP4651496B2/en
Publication of JP2007086703A publication Critical patent/JP2007086703A/en
Publication of JP2007086703A5 publication Critical patent/JP2007086703A5/ja
Application granted granted Critical
Publication of JP4651496B2 publication Critical patent/JP4651496B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Description

本発明は、音声認識装置及び音声認識方法に関するものである。詳しくは、訓練用音声データを用いて学習機械に学習させた後。新たに与えられた音声データに対して、可能性のある単語などの確率による予測を計算し、音声から単語などを推測する音声認識装置及び音声認識方法に関するものである。  The present invention relates to a voice recognition device and a voice recognition method. Specifically, after letting the learning machine learn using the training voice data. The present invention relates to a speech recognition apparatus and a speech recognition method for calculating predictions based on probabilities of possible words and the like for newly given speech data and inferring words and the like from speech.

従来の音声認識では、最大事後尤度の決定ルールを用いて発声内容を判定する。最大事後尤度は、隠れマルコフモデル HMM(Hidden Markov Model)などで表される音響モデルの尤度と、N−gramなどで表される言語モデルの確率の積として計算する。これらのモデルにおいては、HMMのパラメータは誤り最小化学習アルゴリズムやEM(Expectation Maximization)アルゴリズムを用いて学習する。(非特許文献1,2参照)  In the conventional speech recognition, the utterance content is determined using the maximum posterior likelihood determination rule. The maximum a posteriori likelihood is calculated as a product of the likelihood of an acoustic model represented by a hidden Markov model HMM (Hidden Markov Model) and the probability of a language model represented by N-gram or the like. In these models, the HMM parameters are learned using an error minimization learning algorithm or an EM (Expectation Maximization) algorithm. (See Non-Patent Documents 1 and 2)

他方、確率的な予測判別力に優れた学習機械である罰金付きロジスティック回帰マシン(Penalized Logistic Regression Machine;以下本明細書においてPLRMという)が田邉により提案され、音声の分野でその性能が示されている。(非特許文献3〜8参照)  On the other hand, a fine logistic regression machine (Penarized Logistic Regression Machine; hereinafter referred to as PLRM in this specification), which is a learning machine with excellent probabilistic predictive discriminating power, was proposed by Tanabe, and its performance was shown in the field of speech Yes. (See Non-Patent Documents 3-8)

中川聖一、“確率モデルによる音声認識”、(社)電子情報通信学会、1988.Seiichi Nakagawa, “Voice Recognition by Probabilistic Model”, The Institute of Electronics, Information and Communication Engineers, 1988. B,−H, Juang, W. Chou, and C.−H. Lee,“Minimum classification error rate methods for speech reognition” ,IEEE Trans,Speech and Audio Processing, vol.5,pp.257−265,May 1997.B, -H, Jung, W .; Chou, and C.C. -H. Lee, “Minimum classification error rate methods for speech re-regulation”, IEEE Trans, Speech and Audio Processing, vol. 5, pp. 257-265, May 1997. K.Tanabe,“Penalized logistic regression machines:New methods for statistical prediction 1”,ISM Cooperative Research Report 143,pp.163−194,March 2001.K. Tanabe, “Penalized logistic regression machines: New methods for statistical prediction 1”, ISM Cooperative Research Report 143, pp. 197 163-194, March 2001. K.Tanabe,“Penalized logistic regression machines:New methods for statistical prediction 2”,第4回情報論的学習理論ワークショップ(IBIS2001),pp.71−76,July.2001.K. Tanabé, “Penalized logic regression machines: New methods for statistical prediction 2”, 4th Information Theory Learning Theory Workshop (IBIS2001), pp. 199 71-76, July. 2001. K.Tanabe,“Penalized logistic regression machines and Related Linear Numerical Algebra”,京都大学数理解析研究所、講究録1320,pp。239−249,2003.K. Tanab, “Penalized logistic regression machines and Related Linear Numerical Algebra”, Kyoto University Institute of Mathematical Analysis, Proc. 1320, pp. 239-249, 2003. T.Matsui and K.Tanabe,“Speaker Identification with Dual Penalized Logistic Regression Machine”,Proceedings of Odyssey,pp.363−366,2004.T.A. Matsui and K.M. Tanab, “Speaker Identification with Dual Penalized Logistic Regression Machine”, Processeds of Odyssey, pp. 363-366, 2004. T.Matsui and K.Tanabe,“Probabilistic Speaker Identification with dual Penalized Logistic Regression Machine”,Proceedings of ICSLP,pp.III−1797−1800,2004.T.A. Matsui and K.M. Tanabbe, “Probabilistic Spiker Identification with Dual Penalized Logic Regression Machine”, Proceedings of ICSLP, pp. III-1797-1800, 2004. T.Matsui and K.Tanabe,“Speaker Recognition Without Feature Extraction Process”,Proceedings of Workshop on Statistical Modeling Approach for Speech Recognition:Beyond HMM,pp.79−84,2004.T.A. Matsui and K.M. Tanab, “Speaker Recognition Without Feature Extraction Process”, Processeds of Workshops on Strategic Modeling for Prospect Recognition. 79-84, 2004.

音声認識に関して、HMMのパラメータを誤り最小化学習アルゴリズムやEMアルゴリズムを用いて求める場合には必ずしも最適ではない解に収束してしまうことがあり、HMMを用いた方法にはさらに認識性能の向上が期待できる。また一般に、音声認識では周囲雑音などの影響があるために、100%の確率で認識することは難しい。しかし、HMMなどの従来の音響モデルの尤度は確率の指標を与えないので、音声認識性能が高くない。そこで、確率による認識結果を出力でき、従来方法に比して音声認識性能を確実に向上できる音声認識装置及び音声認識方法が望まれていた。  With respect to speech recognition, when the HMM parameters are obtained using an error minimization learning algorithm or EM algorithm, it may converge to a solution that is not necessarily optimal, and the method using the HMM further improves the recognition performance. I can expect. In general, speech recognition is affected by ambient noise and the like, so it is difficult to recognize with 100% probability. However, since the likelihood of a conventional acoustic model such as HMM does not give an index of probability, speech recognition performance is not high. Therefore, there has been a demand for a speech recognition apparatus and a speech recognition method that can output recognition results based on probability and can reliably improve speech recognition performance as compared with conventional methods.

本発明は、音声内容を示す単語などの認識単位について、確率による認識結果を出力でき、HMMなどの音響モデルを用いた従来方法に比して音声認識性能を確実に向上できる音声認識装置及び音声認識方法を提供することを目的とする。  The present invention provides a speech recognition apparatus and a speech that can output a recognition result based on a probability for a recognition unit such as a word indicating speech content and can reliably improve speech recognition performance as compared with a conventional method using an acoustic model such as an HMM. An object is to provide a recognition method.

上記課題を解決するために、請求項1に記載の音声認識装置では、例えば図1に示すように、音声信号からメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出部4と、特徴抽出部4で抽出された訓練用音声データに係る特徴ベクトル時系列に基づいて、隠れマルコフモデルなどの音響モデルに基づき単語などの認識単位に係る尤度関数群を計算する音響モデル尤度計算部8と、音響モデル尤度計算部8で求めた尤度関数群を回帰関数φに取り込んでPLRMの係数パラメータWをPLRM罰金付き尤度を最大化することにより学習する第1のPLRM音響学習部6と、音響モデル尤度計算部8で求めた尤度関数群を回帰関数φに取り込んでPLRMのハイパーパラメータΛを最急降下法などの最適化法によりPLRM罰金付き尤度を最大化することにより学習する第2のPLRM音響学習部5と、第1のPLRM音響学習部6で最適化された係数パラメータWと第2のPLRM音響学習部5で最適化されたハイパーパラメータΛを用いて認識単位に対する可能性を確率で表現したものを計算する音声認識部9とを備え、音響モデル尤度計算部8は、ハイパーパラメータΛを音響モデルに係るパラメータに取り込んで尤度関数群を計算する。  In order to solve the above problem, in the speech recognition apparatus according to claim 1, for example, as shown in FIG. 1, a feature extraction unit 4 that extracts a feature vector time series such as a mel cepstrum coefficient time series from a speech signal; Acoustic model likelihood calculation for calculating a likelihood function group related to a recognition unit such as a word based on an acoustic model such as a hidden Markov model based on a feature vector time series related to training speech data extracted by the feature extraction unit 4 The first PLRM acoustic learning is performed by taking the likelihood function group obtained by the unit 8 and the acoustic model likelihood calculating unit 8 into the regression function φ and maximizing the PLRM coefficient likelihood with the PLRM coefficient parameter W. PLRM punishment is performed by incorporating the likelihood function group obtained by the unit 6 and the acoustic model likelihood calculating unit 8 into the regression function φ and by using the optimization method such as the steepest descent method. The second PLRM acoustic learning unit 5 that learns by maximizing the associated likelihood, the coefficient parameter W optimized by the first PLRM acoustic learning unit 6 and the second PLRM acoustic learning unit 5 are optimized. A speech recognition unit 9 that calculates the probability of the recognition unit expressed by probability using the hyperparameter Λ, and the acoustic model likelihood calculation unit 8 incorporates the hyperparameter Λ into parameters related to the acoustic model. Compute the likelihood function group.

ここにおいて、音声信号には受信手段で受信される音声信号の他に、記憶手段に収録された音声データも含まれる。また、音響モデルとは尤度を計算可能な従来の確率モデルをいい、隠れマルコフモデル(HMM)が代表的であるが、これに限られない。また、パラメータWを学習するとは、全ての訓練用音声データについて計算された罰金付き尤度が最も大きくなるようにパラメータWを定めることをいい、パラメータΛを学習するとは、罰金付き尤度が最も大きくなるようにパラメータΛを定めることをいう。また、音響モデル尤度計算部8は、常にハイパーパラメータΛを音響モデルに係るパラメータに取り込んで尤度関数群を計算しなくても、計算できるように構成されていれば良い。
このように構成すると、音声内容を示す単語などの認識単位について、確率による認識結果を出力でき、その信頼度の指標としても用いることが出来る。また、PLRMと音響モデルが結合されて、HMMなどの音響モデルを用いた従来方法に比して音声認識性能を確実に向上できる音声認識装置を提供できる。
Here, the audio signal includes audio data recorded in the storage means in addition to the audio signal received by the receiving means. The acoustic model is a conventional probability model capable of calculating the likelihood, and is typically a hidden Markov model (HMM), but is not limited thereto. Learning the parameter W means determining the parameter W so that the likelihood with fines calculated for all training speech data is maximized. Learning the parameter Λ has the highest likelihood with fines. The parameter Λ is determined so as to increase. The acoustic model likelihood calculation unit 8 only needs to be configured so that the hyperparameter Λ can always be calculated without taking the hyperparameter Λ into the parameter related to the acoustic model and calculating the likelihood function group.
If comprised in this way, the recognition result by probability can be output about recognition units, such as a word which shows the audio | voice content, and it can use also as the parameter | index of the reliability. In addition, it is possible to provide a speech recognition apparatus in which the PLRM and the acoustic model are combined and the speech recognition performance can be reliably improved as compared with a conventional method using an acoustic model such as an HMM.

請求項2に記載の音声認識方法は、例えば図2に示すように、音声信号からメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出工程(S011)と、特徴抽出工程で抽出された訓練用音声データに係る特徴ベクトル時系列に基づいて、隠れマルコフモデルなどの音響モデルに基づき単語などの認識単位に係る尤度関数群を計算する音響モデル尤度計算工程(S002,S005)と、音響モデル尤度計算工程で求めた尤度関数群を回帰関数φに取り込んでPLRMの係数パラメータWを、PLRM罰金付き尤度を最大化することにより学習する第1のPLRM音響学習工程(S003,S006)と、音響モデル尤度計算工程で求めた尤度関数群を回帰関数φに取り込んでPLRMのハイパーパラメータΛを最急降下法などの最適化法によりPLRM罰金付き尤度を最大化することにより学習する第2のPLRM音響学習工程(S004)と、第1のPLRM音響学習工程で最適化された係数パラメータWと第2のPLRM音響学習工程で最適化されたハイパーパラメータΛを用いて認識単位に対する可能性を確率で表現したものを計算する音声認識工程(S008)とを備え、音響モデル尤度計算工程は、ハイパーパラメータΛを音響モデルに係るパラメータに取り込んで尤度関数群を計算する。  The speech recognition method according to claim 2 is extracted by a feature extraction step (S011) for extracting a feature vector time series such as a mel cepstrum coefficient time series from a speech signal and a feature extraction step as shown in FIG. An acoustic model likelihood calculation step (S002, S005) for calculating a likelihood function group related to a recognition unit such as a word based on an acoustic model such as a hidden Markov model based on a feature vector time series related to the training speech data; The first PLRM acoustic learning step (S003) in which the likelihood function group obtained in the acoustic model likelihood calculation step is taken into the regression function φ to learn the coefficient parameter W of the PLRM by maximizing the likelihood with PLRM fineness (S003). , S006), and the likelihood function group obtained in the acoustic model likelihood calculation step is taken into the regression function φ and the hyperparameter Λ of the PLRM is calculated by the steepest descent method The second PLRM acoustic learning step (S004) for learning by maximizing the PLRM fine likelihood with which optimization method, the coefficient parameter W and the second PLRM optimized in the first PLRM acoustic learning step A speech recognition step (S008) that calculates a probability representation for the recognition unit using the hyperparameter Λ optimized in the acoustic learning step, and the acoustic model likelihood calculation step includes the hyperparameter Λ A likelihood function group is calculated by taking in parameters relating to the acoustic model.

ここにおいて、音響モデル尤度計算工程、第1のPLRM音響学習工程、第2のPLRM音響学習工程は何度繰り返されても良い。一般的には繰り返し回数が多いほど音声認識性能が向上するので、収束するまで行なうことが好ましい。音響モデル尤度計算工程は、常にハイパーパラメータΛを音響モデルに係るパラメータに取り込んで尤度関数群を計算しなくても、いずれかの工程で計算するように構成されていれば良い。
このように構成すると、音声内容を示す単語などの認識単位について、確率による認識結果を出力でき、その信頼度の指標としても用いることも出来る。また、PLRMと音響モデルが結合されて、HMMなどの音響モデルを用いた従来方法に比して音声認識性能を確実に向上できる音声認識方法を提供できる。
Here, the acoustic model likelihood calculation step, the first PLRM acoustic learning step, and the second PLRM acoustic learning step may be repeated any number of times. In general, the greater the number of repetitions, the better the voice recognition performance. The acoustic model likelihood calculation step may be configured to calculate at any step without always taking the hyperparameter Λ into the parameters related to the acoustic model and calculating the likelihood function group.
If comprised in this way, the recognition result by a probability can be output about recognition units, such as a word which shows an audio | voice content, and it can also be used as an index of the reliability. In addition, by combining the PLRM and the acoustic model, it is possible to provide a speech recognition method that can surely improve the speech recognition performance as compared with a conventional method using an acoustic model such as an HMM.

請求項3に記載のプログラムは、請求項2に記載の音声認識方法をコンピュータに実行させるためのプログラムである。  A program according to a third aspect is a program for causing a computer to execute the speech recognition method according to the second aspect.

本発明によれば、音声内容を示す単語などの認識単位について、確率による認識結果を出力でき、その信頼度の指標としても用いることも出来、また、PLRMと音響モデルが結合されて、HMMなどの音響モデルを用いた従来方法に比して音声認識性能を確実に向上できる音声認識装置及び音声認識方法を提供できる。  According to the present invention, a recognition result based on probability can be output for a recognition unit such as a word indicating speech content, and can also be used as an index of reliability thereof. Also, a PLRM and an acoustic model are combined to form an HMM or the like. It is possible to provide a speech recognition apparatus and a speech recognition method that can improve the speech recognition performance reliably as compared with the conventional method using the acoustic model.

以下に図面に基づき本発明の実施の形態について説明する。従来の音響モデルとして代表的な隠れマルコフモデル(HMM)を用いる音声認識装置から得られるM位候補までの尤度をPLRMの入力回帰関数として用いて学習機械PLRMによる音声学習を行い、入力音声の発声内容を表す単語候補を確率的に判定するものである。また、PLRMの中のハイパーパラメータとなるHMMなどの音響モデルのパラメータを、PLRMを定義している罰金付尤度という評価規準に基づいて最急降下法等の最適化法で調整し、PLRMと音響モデルが結合して学習効果を高める。  Embodiments of the present invention will be described below with reference to the drawings. Speech learning by the learning machine PLRM is performed by using the likelihood to the M-th candidate obtained from a speech recognition apparatus using a typical hidden Markov model (HMM) as a conventional acoustic model as an input regression function of the PLRM. A word candidate representing the utterance content is determined probabilistically. In addition, the parameters of the acoustic model such as HMM, which is a hyper parameter in the PLRM, are adjusted by an optimization method such as the steepest descent method based on the evaluation criterion called fine likelihood defining the PLRM, and the PLRM and the acoustic The model combines to enhance the learning effect.

図1に本発明の実施の形態における音声認識装置100の構成例のブロック図を示す。図1において、1は音声信号、2は訓練用音声データ、100は音声認識装置である。音声認識装置100は、音声信号を受信する受信部3と、受信された音声信号1又は訓練用音声データ2から、メルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出部4と、単語などの認識単位の発声内容に係る特徴ベクトル時系列情報及びHMMに係るパラメータなどを蓄積した音響モデルデータベース7と、特徴抽出部4で取得された特徴ベクトル時系列(特に訓練用音声データ2に係る特徴ベクトル時系列)について音響モデルデータベース7に蓄積された特徴ベクトル時系列情報を参照して、HMMに基づき認識単位候補に係るHMMの尤度を計算し、あるいはPLRM学習部30から与えられるパラメータWとΛの値に対して尤度関数群を計算する音響モデル尤度計算部8と、音響モデル尤度計算部8で求めた認識単位候補に係る尤度関数群を回帰関数φに取り込んで、PLRMの係数パラメータWをPLRM罰金付き尤度を最大化することにより、学習する第1のPLRM音響学習部6と、この尤度関数群を回帰関数φに取り込んで、PLRMのハイパーパラメータΛを最急降下法などの最適化法によりPLRM罰金付き尤度を最大化することにより学習する第2のPLRM音響学習部5と、第1のPLRM音響学習6部で最適化された係数パラメータWと第2のPLRM音響学習5で最適化されたハイパーパラメータΛを用いて認識単位に対する可能性を確率で表現したものを計算し、発声内容の確率的な判定を行う音声認識部9、音声認識部9の計算結果等を表示する表示部10、第2のPLRM音響学習5や第1のPLRM音響学習部6や音声認識部9における計算結果や中間データ等を適宜記憶する記憶部11、音声認識装置100を構成する各部を制御して、音声認識装置100としての機能を果たさせる制御部12等から構成される。 FIG. 1 shows a block diagram of a configuration example of a speech recognition apparatus 100 according to an embodiment of the present invention. In FIG. 1, 1 is a speech signal , 2 is training speech data, and 100 is a speech recognition device. The speech recognition apparatus 100 includes a reception unit 3 that receives a speech signal, a feature extraction unit 4 that extracts a feature vector time series such as a mel cepstrum coefficient time series from the received speech signal 1 or training speech data 2, An acoustic model database 7 that stores feature vector time series information related to the utterance content of a recognition unit such as a word, parameters related to HMM, and the like, and a feature vector time series acquired by the feature extraction unit 4 (particularly in training speech data 2) The feature vector time series) with reference to the feature vector time series information stored in the acoustic model database 7, the likelihood of the HMM related to the recognition unit candidate is calculated based on the HMM, or the parameter given from the PLRM learning unit 30 An acoustic model likelihood calculation unit 8 that calculates a likelihood function group for the values of W and Λ, and an acoustic model likelihood calculation unit 8 A first PLRM acoustic learning unit 6 that learns by incorporating a likelihood function group related to recognition unit candidates into the regression function φ and maximizing the PLRM coefficient-likelihood parameter coefficient W, and the likelihood A second PLRM acoustic learning unit 5 that learns the function group by incorporating the function group into the regression function φ and maximizing the PLRM hyper-parameter Λ by the optimization method such as the steepest descent method, and the likelihood with the PLRM fine; The probability parameter for the recognition unit is calculated using the coefficient parameter W optimized in the 6th part of PLRM acoustic learning and the hyperparameter Λ optimized in the second PLRM acoustic learning 5, and the utterance content is calculated. probabilistic determination voice recognition unit 9 performs a display unit 10 for displaying the calculation results of the speech recognizer 9, second PLRM acoustic learning unit 5 and the first PLRM acoustic learning section 6 Yaoto Storage unit 11 for storing appropriate calculation results and intermediate data or the like in the recognition unit 9, and controls the respective units constituting the speech recognition device 100, a control unit 12 or the like to fulfill the function as the speech recognition apparatus 100 .

ここにおいて、パラメータWを学習するとは、全ての訓練用音声データについて計算された罰金付き尤度を最も大きくなるようにパラメータWを定めることをいい、パラメータΛを学習するとは、罰金付き尤度が最も大きくなるようにパラメータΛを定めることをいう。尤度関数群を求めるとは、音響モデルのパラメータの複数個の値を尤度計算部の外から供与されるか、あるいは内部で学習し定め、それらの値に対する複数の音響モデルの尤度を計算することをいう。  Here, learning the parameter W means that the parameter W is determined so as to maximize the fine likelihood calculated for all training speech data, and learning the parameter Λ means that the fine likelihood is The parameter Λ is determined so as to be the largest. The likelihood function group is obtained by providing a plurality of values of parameters of the acoustic model from outside the likelihood calculation unit or by learning and determining the likelihood of the plurality of acoustic models for those values. It means to calculate.

このうち、第1のPLRM音響学習部6と第2のPLRM音響学習部5とはPLRM学習部30を構成し、第1のPLRM音響学習部6と第2のPLRM音響学習部5とでパラメータ情報W、Λをやりとりをしながら学習を繰り返し、認識性能を向上させる。音響モデルデータベース7と音響モデル尤度計算部8とは音響モデル部40の一部を構成し、音響モデル部40はHMMに基づく音声認識機能及び尤度計算機能を有する。PLRM学習部30は音響モデル部40から回帰関数φに取り込む尤度関数群を受け取り、音響モデル部40にパラメータΛを供与する。また、PLRM学習部30と音声認識部9とはPLRMに基づいて処理を実行するPLRM部20の一部を構成し、PLRM部20はPLRMの諸機能を有する。  Among these, the first PLRM acoustic learning unit 6 and the second PLRM acoustic learning unit 5 constitute a PLRM learning unit 30, and the first PLRM acoustic learning unit 6 and the second PLRM acoustic learning unit 5 use the parameters. Learning is repeated while exchanging information W and Λ to improve recognition performance. The acoustic model database 7 and the acoustic model likelihood calculation unit 8 constitute a part of the acoustic model unit 40, and the acoustic model unit 40 has a speech recognition function and a likelihood calculation function based on the HMM. The PLRM learning unit 30 receives a likelihood function group to be incorporated into the regression function φ from the acoustic model unit 40 and provides the parameter Λ to the acoustic model unit 40. The PLRM learning unit 30 and the speech recognition unit 9 constitute a part of the PLRM unit 20 that executes processing based on the PLRM, and the PLRM unit 20 has various functions of the PLRM.

図2に本実施の形態における音声認識方法の処理フロー例を示す。なお、枠の中はPLRM学習部30における処理フロー例である。  FIG. 2 shows a processing flow example of the speech recognition method in the present embodiment. The frame shows an example of a processing flow in the PLRM learning unit 30.

受信部3で音声信号1を受信する(S010)。受信部3で受信した音声信号1又は訓練用音声データ2から、特徴抽出部4でメルケプストラム係数時系列などの特徴ベクトル時系列Xを抽出する(S011)。  The receiving unit 3 receives the audio signal 1 (S010). A feature vector time series X such as a mel cepstrum coefficient time series is extracted from the audio signal 1 or training audio data 2 received by the receiving unit 3 (S011).

学習時には訓練用音声データ2が用いられる。音響モデル尤度計算部8で、前記特徴ベクトル時系列Xを入力して、誤り最小化学習アルゴリズムやEMアルゴリズムを用いてHMMに基づいて学習し、音響モデルの第2のパラメータΛの初期値を求める(S001)。音響モデル尤度計算部8で、訓練用音声データ2の特徴ベクトル時系列Xを音響モデルのデータベース7に蓄積された認識単位(単語など)の発声内容に係る特徴ベクトル時系列情報を参照し、HMMの尤度を複数(M個)の認識単位の候補について計算し、最適化された尤度関数群を求める。なお、認識単位として、単語の他に、述語、句、単音などがある。  Training voice data 2 is used during learning. The acoustic model likelihood calculation unit 8 inputs the feature vector time series X, learns based on the HMM using an error minimization learning algorithm or an EM algorithm, and sets an initial value of the second parameter Λ of the acoustic model. Obtain (S001). The acoustic model likelihood calculation unit 8 refers to the feature vector time series information related to the utterance content of the recognition unit (such as a word) stored in the acoustic model database 7 for the feature vector time series X of the training speech data 2, The likelihood of the HMM is calculated for a plurality of (M) recognition unit candidates, and an optimized likelihood function group is obtained. In addition to the word, the recognition unit includes a predicate, a phrase, and a single sound.

第1のPLRM音響学習部6では、計算された各候補の音響モデルの尤度をPLRMの回帰関数φ(x;Λ)=(φ(x;Λ),φ(x;Λ),...,φ(x;Λ))としてPLRMに入力する(S002)。PLRMでは、音響モデルのパラメータΛはPLRMのハイパーパラメータとして扱われる。また、PLRMでは単語などの認識単位の確率は、

Figure 0004651496
で表され、PLRMの罰金付き尤度は、
Figure 0004651496
で表される(非特許文献3、4、5参照)。In the first PLRM acoustic learning unit 6, the likelihood of the calculated acoustic model of each candidate is used as the PLRM regression function φ (x; Λ) = (φ 1 (x; Λ), φ 2 (x; Λ), ..., Φ M (x; Λ)) are input to the PLRM (S002). In PLRM, the acoustic model parameter Λ is treated as a PLRM hyperparameter. In PLRM, the probability of a recognition unit such as a word is
Figure 0004651496
The PLRM fine likelihood is
Figure 0004651496
(See Non-Patent Documents 3, 4, and 5).

ただし、Kは単語などの認識単位の総数(1≦M≦K)、W=(w,w,...,w)はPLRMの係数パラメータ(第1のパラメータ)で、その要素wはM次元の重みベクトルである。θ=(W,Λ)である。また、D={(x,y)},...,はN個の訓練用音声信号xと、その単語などの認識単位のラベルyの組を表す。Γは学習データのクラスによる偏りを補正する行列で、例えばk番目の対角成分がk番目のラベルを有する訓練サンプル数に対応した対角行列である。Σは正定値行列、δは

Figure 0004651496
とのバランスを取るために用いられるパラメータである。Γ、Σ、δはデータに応じて、適宜、事前に選択しておく。なお、Σの選択については後述する。Where K is the total number of recognition units such as words (1 ≦ M ≦ K), W = (w 1 , w 2 ,..., W K ) is a coefficient parameter (first parameter) of PLRM, and its elements w k is an M-dimensional weight vector. θ = (W, Λ). Also, D = {(x n , y n )} n = 1 ,. . . , N represents represents and N training speech signal x n, a set of labels y n of recognition units, such as the word. Γ is a matrix that corrects the bias due to the class of learning data, and is, for example, a diagonal matrix corresponding to the number of training samples in which the kth diagonal component has the kth label. Σ is a positive definite matrix, δ is
Figure 0004651496
It is a parameter used to balance Γ, Σ, and δ are appropriately selected in advance according to the data. The selection of Σ will be described later.

第1のPLRM音響学習部6ではPLRMの係数パラメータWを学習して最適化する(S003)。すなわち、全ての訓練用音声データについて計算された罰金付き尤度が最も大きくなるようにパラメータWを定める。第2のPLRM音響学習部5において、PLRMの係数パラメータWを固定して、最急降下法などの最適化法を用いて音響モデルのパラメータΛを、PLRM罰金付き尤度を最大化することにより学習する(S004)。  The first PLRM acoustic learning unit 6 learns and optimizes the PLRM coefficient parameter W (S003). That is, the parameter W is determined so that the likelihood with fine calculated for all the training speech data is maximized. In the second PLRM acoustic learning unit 5, the coefficient parameter W of the PLRM is fixed, and the acoustic model parameter Λ is learned by maximizing the likelihood with PLRM fineness using an optimization method such as the steepest descent method. (S004).

音響モデル部40ではPLRM学習部30で求められたパラメータΛを受け取り、これを用いて音響モデル尤度計算部8では、音響モデルの尤度を計算する。PLRM学習部30では音響モデル部40で計算された認識単位の各候補についての尤度を受け取り、これをPLRMの回帰関数として第1のPLRM音響学習部6に入力し(S005)、PLRMの係数パラメータWを学習する(S006)。ここで音声認識部9ではパラメータW,Λを用いて(式1)により、単語などの認識単位の確率p(θ)による表現を求めることも可能である。しかし、S004〜S006を繰り返すことにより、パラメータW,Λを学習して、確率p(θ)の妥当性を向上させるのが好ましい。すなわち、第2のPLRM音響学習部5において、新たに求められたPLRMの係数パラメータWを固定して、最急降下法などの最適化法を用いて音響モデルのパラメータΛを最適化する(S004)。そして、S004からS006の工程を十分に繰り返して、PLRMの係数パラメータW及び音響モデルのパラメータΛが収束したと判断された時点で、PLRMによる音響学習を終了する(S007)。The acoustic model unit 40 receives the parameter Λ obtained by the PLRM learning unit 30, and the acoustic model likelihood calculation unit 8 uses this to calculate the likelihood of the acoustic model. The PLRM learning unit 30 receives the likelihood of each candidate for the recognition unit calculated by the acoustic model unit 40, and inputs the likelihood to the first PLRM acoustic learning unit 6 as a regression function of the PLRM (S005). The parameter W is learned (S006). Here, the speech recognition unit 9 can also obtain an expression based on the probability p k (θ) of a recognition unit such as a word by using (Expression 1) using the parameters W and Λ. However, it is preferable to learn the parameters W and Λ by repeating S004 to S006 to improve the validity of the probability p k (θ). That is, the second PLRM acoustic learning unit 5 fixes the newly obtained PLRM coefficient parameter W and optimizes the acoustic model parameter Λ using an optimization method such as the steepest descent method (S004). . Then, the steps from S004 to S006 are sufficiently repeated, and when it is determined that the coefficient parameter W of the PLRM and the parameter Λ of the acoustic model have converged, the acoustic learning by the PLRM is terminated (S007).

テスト時には、音声信号1を受信部3で受信して(S010)、特徴抽出部4でメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する(S011)。特徴ベクトル時系列を学習済みのPLRM部20に入力して、入力音声に対応する単語などの認識単位の候補に対する確率に基づいて推定する(S008)。最大の確率を示す候補を認識結果として表示部10で表示する(S009)。計算される確率の要素の全体は、認識結果の確信度を与える指標としてつかうことも出来よう。  During the test, the audio signal 1 is received by the receiving unit 3 (S010), and the feature extraction unit 4 extracts a feature vector time series such as a mel cepstrum coefficient time series (S011). The feature vector time series is input to the learned PLRM unit 20, and is estimated based on the probability of recognition unit candidates such as words corresponding to the input speech (S008). A candidate indicating the maximum probability is displayed as a recognition result on the display unit 10 (S009). The whole element of the calculated probability can be used as an index that gives the certainty of the recognition result.

次にΣの選択について説明する。最も単純な選択肢は、Σ=Iである。ただし、Iは単位行列を表す。もう一つの選択肢は、訓練用音声データへの過学習を考慮したΣ=(1/N)ΦΦである。(非特許文献3、4、5参照)Next, selection of Σ will be described. The simplest option is Σ = I. Here, I represents a unit matrix. Another option is the excessive learning was considered Σ = (1 / N) ΦΦ T to training speech data. (See Non-Patent Documents 3, 4, and 5)

最後に、上記音声認識装置による認識実験について説明する。
TI46データベース(表1)のEセットに対して実験を行った。このセットは、1433の訓練用音声と2291のテスト用音声を含む。各音声信号から、13次元のメルケプストラム係数時系列、その一次と二次回帰係数からなる39次元の特徴ベクトルを抽出した。特徴ベクトルの抽出では25ミリ秒のハミング窓、10ミリ秒の窓シフトを用いた。各単語は、5ガウス混合分布の状態のHMMによってモデル化した。δについては、予備実験からΣ=Iの場合はδ=0.01が、Σ=(1/N)ΦΦの場合はδ=5000に設定した。

Figure 0004651496
Finally, a recognition experiment using the voice recognition apparatus will be described.
Experiments were performed on the E set of the TI46 database (Table 1). This set includes 1433 training sounds and 2291 test sounds. A 39-dimensional feature vector consisting of a 13-dimensional mel cepstrum coefficient time series and its primary and secondary regression coefficients was extracted from each speech signal. The feature vector extraction used a 25 ms Hamming window and a 10 ms window shift. Each word was modeled by an HMM with a 5 Gaussian mixture distribution. For [delta], [delta] = 0.01 If the preliminary experiment sigma 1 = I is, in the case of Σ 2 = (1 / N) ΦΦ T is set to [delta] = 5000.
Figure 0004651496

表2は、本発明法と従来法の単語認識率を比較した結果を示す。

Figure 0004651496
Table 2 shows the result of comparing the word recognition rates of the method of the present invention and the conventional method.
Figure 0004651496

また、本発明は実施の形態における音声認識方法を、コンピュータに実行させるためのプログラムとして、また当該プログラムを記録したコンピュータ読み取り可能な記録媒体としても実現可能である。プログラムは、コンピュータに内蔵のROMに記録して用いても良く、FD、CD−ROM、内蔵又は外付けの磁気ディスク等の記録媒体に記録し、コンピュータに読み取って用いても良く、インターネットを介してコンピュータにダウンロードして用いても良い。  Further, the present invention can be realized as a program for causing a computer to execute the speech recognition method according to the embodiment, and also as a computer-readable recording medium on which the program is recorded. The program may be used by being recorded in a ROM built in the computer, or may be recorded on a recording medium such as an FD, CD-ROM, built-in or external magnetic disk, read by the computer, and used via the Internet. May be downloaded to a computer and used.

以上、本発明の実施の形態について説明したが、本発明は上記の実施の形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で、実施の形態に種々変更を加えられることは明白である。  Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and various modifications can be made to the embodiments without departing from the spirit of the present invention. It is obvious.

例えば、上記実施の形態では、例えば、音響モデルの第2のパラメータΛを、最急降下法などの最適化法によりPLRM罰金付き尤度を最大化することにより、学習する第2のPLRM音響学習工程(S004)からPLRMのパラメータWを最適化して学習する第1のPLRM音響学習工程(S006)を繰り返し行なう例を説明したが、この繰り返し回数は任意である。また、従来の音響モデルが隠れマルコフモデル(HMM)の例を説明したが、ベイジアンネットに基づく音響モデルなど他の音響モデルの尤度を用いても良い。また、訓練音声用データは実験で用いたEセットに限られない。  For example, in the above embodiment, for example, the second PLRM acoustic learning step of learning the second parameter Λ of the acoustic model by maximizing the likelihood with a PLRM fine by an optimization method such as the steepest descent method The example in which the first PLRM acoustic learning step (S006) for learning by optimizing the parameter W of the PLRM has been described from (S004), but the number of repetitions is arbitrary. Moreover, although the conventional acoustic model demonstrated the example of the hidden Markov model (HMM), you may use the likelihood of other acoustic models, such as an acoustic model based on a Bayesian network. The training voice data is not limited to the E set used in the experiment.

本発明は音声認識に利用できる。  The present invention can be used for speech recognition.

本発明の実施の形態における音声認識装置の構成例のブロック図である。It is a block diagram of the structural example of the speech recognition apparatus in embodiment of this invention. 本発明の実施の形態における音声認識方法の処理フロー例を示す図である。It is a figure which shows the example of a processing flow of the speech recognition method in embodiment of this invention.

符号の説明Explanation of symbols

1 音声信号
2 訓練用音声データ
3 受信部
4 特徴抽出部
5 第2のPLRM音響学習部
6 第1のPLRM音響学習部
7 音響モデルデータベース
8 音響モデル尤度計算部
9 音声認識部
10 表示部
11 記憶部
12 制御部
20 PLRM部
30 PLRM学習部
40 音響モデル部
100 音声認識装置
W 係数パラメータ
Λ ハイパーパラメータ
DESCRIPTION OF SYMBOLS 1 Voice signal 2 Training voice data 3 Reception part 4 Feature extraction part 5 2nd PLRM acoustic learning part 6 1st PLRM acoustic learning part 7 Acoustic model database 8 Acoustic model likelihood calculation part 9 Speech recognition part 10 Display part 11 Storage unit 12 Control unit 20 PLRM unit 30 PLRM learning unit 40 Acoustic model unit 100 Speech recognition device W Coefficient parameter Λ Hyper parameter

Claims (3)

音声信号からメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出部と;
前記特徴抽出部で抽出された訓練用音声データに係る特徴ベクトル時系列に基づいて、隠れマルコフモデルなどの音響モデルに基づき単語などの認識単位に係る尤度関数群を計算する音響モデル尤度計算部と;
前記音響モデル尤度計算部で求めた尤度関数群を回帰関数に取り込んで罰金付きロジスティック回帰マシン(Penalized Logistic Regression Machine;以下本請求項でPLRMという)の係数パラメータを,PLRM罰金付き尤度を最大化することにより学習する第1のPLRM音響学習部と;
前記音響モデル尤度計算部で求めた尤度関数群を前記回帰関数に取り込んで前記PLRMのハイパーパラメータを、最急降下法などの最適化法によりPLRM罰金付き尤度を最大化することにより学習する第2のPLRM音響学習部と;
前記第1のPLRM音響学習部で最適化された前記係数パラメータと前記第2のPLRM音響学習部で最適化された前記ハイパーパラメータを用いて前記認識単位に対する可能性を確率で表現したものを計算する音声認識部とを備え;
前記音響モデル尤度計算部は、前記ハイパーパラメータを前記音響モデルに係るパラメータに取り込んで前記尤度関数群を計算する;
音声認識装置。
A feature extraction unit that extracts a feature vector time series such as a mel cepstrum coefficient time series from a speech signal;
Acoustic model likelihood calculation for calculating a likelihood function group related to a recognition unit such as a word based on an acoustic model such as a hidden Markov model based on a feature vector time series related to training speech data extracted by the feature extraction unit Part;
The likelihood function group obtained by the acoustic model likelihood calculation unit is taken into a regression function, and a coefficient parameter of a penalty logistic regression machine (hereinafter referred to as PLRM in this claim) is used as a coefficient parameter, and a likelihood with a PLRM penalty is used. A first PLRM acoustic learning unit that learns by maximizing;
The likelihood function group obtained by the acoustic model likelihood calculation unit is taken into the regression function, and the hyperparameter of the PLRM is learned by maximizing the likelihood with PLRM fineness by an optimization method such as a steepest descent method. A second PLRM acoustic learning unit;
Using the coefficient parameter optimized by the first PLRM acoustic learning unit and the hyperparameter optimized by the second PLRM acoustic learning unit, a probability representation of the possibility for the recognition unit is calculated. A voice recognition unit for
The acoustic model likelihood calculating unit calculates the likelihood function group by taking the hyper parameter into a parameter related to the acoustic model;
Voice recognition device.
音声信号からメルケプストラム係数時系列などの特徴ベクトル時系列を抽出する特徴抽出工程と;
前記特徴抽出工程で抽出された訓練用音声データに係る特徴ベクトル時系列に基づいて、隠れマルコフモデルなどの音響モデルに基づき単語などの認識単位に係る尤度関数群を計算する音響モデル尤度計算工程と;、
前記音響モデル尤度計算工程で求めた尤度関数群を回帰関数に取り込んで罰金付きロジスティック回帰マシン(Penalized Logistic Regression Machine;以下本請求項でPLRMという)の係数パラメータを、PLRM罰金付き尤度を最大化することにより学習する第1のPLRM音響学習工程と;
前記音響モデル尤度計算工程で求めた尤度関数群を前記回帰関数に取り込んで前記PLRMのハイパーパラメータを,最急降下法などの最適化法によりPLRM罰金付き尤度を最大化することにより学習する第2のPLRM音響学習工程と;
前記第1のPLRM音響学習工程で最適化された前記係数パラメータと前記第2のPLRM音響学習工程で最適化された前記ハイパーパラメータを用いて前記認識単位に対する可能性を確率で表現したものを計算する音声認識工程とを備え;
前記音響モデル尤度計算工程は、前記ハイパーパラメータを前記音響モデルに係るパラメータに取り込んで前記尤度関数群を計算する;
音声認識方法。
A feature extraction step of extracting a feature vector time series such as a mel cepstrum coefficient time series from a speech signal;
Acoustic model likelihood calculation for calculating a likelihood function group related to a recognition unit such as a word based on an acoustic model such as a hidden Markov model based on a feature vector time series related to training speech data extracted in the feature extraction step The process;
The likelihood function group obtained in the acoustic model likelihood calculation step is taken into the regression function, and the coefficient parameter of the Penalized Logistic Regression Machine (hereinafter referred to as PLRM in this claim) is set as the coefficient parameter of the PLRM fine. A first PLRM acoustic learning step that learns by maximizing;
The likelihood function group obtained in the acoustic model likelihood calculation step is taken into the regression function, and the hyperparameter of the PLRM is learned by maximizing the likelihood with a PLRM fine by an optimization method such as a steepest descent method. A second PLRM acoustic learning step;
Using the coefficient parameter optimized in the first PLRM acoustic learning step and the hyper parameter optimized in the second PLRM acoustic learning step, a probability representation of the probability for the recognition unit is calculated. A voice recognition step to perform;
The acoustic model likelihood calculating step calculates the likelihood function group by taking the hyperparameter into a parameter related to the acoustic model;
Speech recognition method.
請求項2に記載の音声認識方法をコンピュータに実行させるためのプログラム。  A program for causing a computer to execute the speech recognition method according to claim 2.
JP2005305014A 2005-09-19 2005-09-19 Speech recognition apparatus and speech recognition method Active JP4651496B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2005305014A JP4651496B2 (en) 2005-09-19 2005-09-19 Speech recognition apparatus and speech recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2005305014A JP4651496B2 (en) 2005-09-19 2005-09-19 Speech recognition apparatus and speech recognition method

Publications (3)

Publication Number Publication Date
JP2007086703A JP2007086703A (en) 2007-04-05
JP2007086703A5 JP2007086703A5 (en) 2008-10-16
JP4651496B2 true JP4651496B2 (en) 2011-03-16

Family

ID=37973705

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2005305014A Active JP4651496B2 (en) 2005-09-19 2005-09-19 Speech recognition apparatus and speech recognition method

Country Status (1)

Country Link
JP (1) JP4651496B2 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101493552B1 (en) * 2008-05-14 2015-02-13 닛토보 온쿄 엔지니어링 가부시키가이샤 Signal judgment method, signal judgment apparatus, program, and signal judgment system
JP2010044031A (en) * 2008-07-15 2010-02-25 Nittobo Acoustic Engineering Co Ltd Method for identifying aircraft, method for measuring aircraft noise and method for determining signals using the same

Also Published As

Publication number Publication date
JP2007086703A (en) 2007-04-05

Similar Documents

Publication Publication Date Title
JP6686154B2 (en) Utterance recognition method and device
JP4141495B2 (en) Method and apparatus for speech recognition using optimized partial probability mixture sharing
JP5229478B2 (en) Statistical model learning apparatus, statistical model learning method, and program
JP5752060B2 (en) Information processing apparatus, large vocabulary continuous speech recognition method and program
JP6509694B2 (en) Learning device, speech detection device, learning method and program
JP5249967B2 (en) Speech recognition device, weight vector learning device, speech recognition method, weight vector learning method, program
JP6031316B2 (en) Speech recognition apparatus, error correction model learning method, and program
US20220101859A1 (en) Speaker recognition based on signal segments weighted by quality
JP2004226982A (en) Method for speech recognition using hidden track, hidden markov model
JPWO2007105409A1 (en) Standard pattern adaptation device, standard pattern adaptation method, and standard pattern adaptation program
Walker et al. Semi-supervised model training for unbounded conversational speech recognition
JP4796460B2 (en) Speech recognition apparatus and speech recognition program
JP4651496B2 (en) Speech recognition apparatus and speech recognition method
JP3920749B2 (en) Acoustic model creation method for speech recognition, apparatus thereof, program thereof and recording medium thereof, speech recognition apparatus using acoustic model
JP6027754B2 (en) Adaptation device, speech recognition device, and program thereof
JP5006888B2 (en) Acoustic model creation device, acoustic model creation method, acoustic model creation program
JP5288378B2 (en) Acoustic model speaker adaptation apparatus and computer program therefor
JP2007078943A (en) Acoustic score calculating program
JP6546070B2 (en) Acoustic model learning device, speech recognition device, acoustic model learning method, speech recognition method, and program
JP2002251198A (en) Speech recognition system
JP4537970B2 (en) Language model creation device, language model creation method, program thereof, and recording medium thereof
JP2008064849A (en) Sound model creation device, speech recognition device using the same, method, program and recording medium therefore
JP4779239B2 (en) Acoustic model learning apparatus, acoustic model learning method, and program thereof
JP4394972B2 (en) Acoustic model generation method and apparatus for speech recognition, and recording medium recording an acoustic model generation program for speech recognition
JP5104732B2 (en) Extended recognition dictionary learning device, speech recognition system using the same, method and program thereof

Legal Events

Date Code Title Description
A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20080901

A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20080901

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20101104

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20101124

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20101214

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20131224

Year of fee payment: 3

R250 Receipt of annual fees

Free format text: JAPANESE INTERMEDIATE CODE: R250