JP3898676B2

JP3898676B2 - Voice recognition device

Info

Publication number: JP3898676B2
Application number: JP2003298598A
Authority: JP
Inventors: ドンライシュ; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2003-08-22
Filing date: 2003-08-22
Publication date: 2007-03-28
Anticipated expiration: 2023-08-22
Also published as: JP2005070292A

Description

この発明は音声認識技術に関し、特に、雑音環境下でも高い精度で音声認識を行なうことが可能な音声認識装置、そこで使用される重みベクトルを学習するための方法、及びサブバンド方式ＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：隠れマルコフモデル）学習方法に関する。 The present invention relates to speech recognition technology, and more particularly to a speech recognition apparatus capable of performing speech recognition with high accuracy even in a noisy environment, a method for learning weight vectors used therein, and a subband HMM (Hidden Markov). Model: Hidden Markov Model) relates to a learning method.

コンピュータ技術及び信号処理技術の発達により、自動音声認識（ＡＳＲ：ＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ）システムは自動化の域に達しつつある。たとえばコンピュータの音声入力などが典型的な例である。しかし、現在のＡＳＲが人間の音声認識能力と比較して限定された能力しか有していない点についてはよく知られている。 With the development of computer technology and signal processing technology, Automatic Speech Recognition (ASR) systems are reaching the level of automation. For example, a computer voice input is a typical example. However, it is well known that current ASR has limited capabilities compared to human speech recognition capabilities.

ある研究では、人間の聴覚処理では、言語的メッセージは別々の周波数のサブバンドで互いに独立にデコードされ、それらサブバンドでの決定をマージすることにより最終的なデコード判定が行なわれていることが明らかにされている。また別の研究によれば、人間は限られたスペクトル的な手掛かりのみに基づいて音声を認識することができ、別々の周波数領域の音響的な手掛かりを統合することにより音声信号を認識することができることが示されている。 In one study, in human auditory processing, linguistic messages are decoded independently of each other in different frequency subbands, and the final decoding decision is made by merging the decisions in those subbands. It has been revealed. Another study shows that humans can recognize speech based only on limited spectral cues, and can recognize speech signals by integrating acoustic cues from different frequency domains. It has been shown that it can.

こうした現象を明らかにするために、サブバンドを利用した音声認識の方式が提案されている。 In order to clarify this phenomenon, a speech recognition method using subbands has been proposed.

サブバンドを利用した音声認識が提案されている理由としては、それ以外にも技術的なものがある。 There are other technical reasons why speech recognition using subbands has been proposed.

第１の理由は、音声信号が損なわれるのはある特定の周波数領域に限定されることが多いということである。そのため、複数のサブバンドでの判定結果に基づいて音声認識を行なえば、雑音のない周波数領域の信号が十分信頼できる情報を有している限り、認識精度が著しく低下するおそれは少ない。 The first reason is that the audio signal is often damaged to a specific frequency range. Therefore, if speech recognition is performed based on the determination results in a plurality of subbands, the recognition accuracy is unlikely to be significantly reduced as long as the noise-free frequency domain signal has sufficiently reliable information.

第２の理由は、ある種の特徴量に関しては、サブバンドから推定された結果は全帯域から得られたものよりも頑健でかつ効率がよいということである。例えば自己回帰スペクトルなどの場合である。 The second reason is that for certain feature quantities, the results estimated from the subbands are more robust and more efficient than those obtained from the entire band. For example, autoregressive spectrum.

第３の理由は、音声中の静的な部分の間での遷移は、全ての周波数領域で同時に起こるわけではないということである。一般的な全帯域用のＨＭＭでは、そうした遷移は同期して生じることが想定されている。サブバンド方式ではこの制約が緩和される可能性がある。 A third reason is that transitions between static parts of speech do not occur simultaneously in all frequency domains. In a general all-band HMM, such a transition is assumed to occur synchronously. This restriction may be relaxed in the subband method.

サブバンド方式の音声認識として二つの方式が提案されている。第１は並列サブバンド方式（ＰＳＢ）であり、第２は連結サブバンド方式（ＣＳＢ）である（非特許文献１及び非特許文献２）。 Two methods have been proposed for subband speech recognition. The first is a parallel subband method (PSB), and the second is a concatenated subband method (CSB) (Non-Patent Document 1 and Non-Patent Document 2).

ＰＳＢでは、全帯域のパワースペクトルを幾つかのサブバンドに分割し、各サブバンドをそれぞれケプストラムの特徴量に変換する。それら特徴量を互いに独立にモデル化し、音素、単語、又は文などの特定のセグメントレベルで組合せる。特徴量をマージする手続では、サブバンドでの特徴量を操作して性能を向上させることが柔軟に行なえる。研究結果によれば、ＰＳＢによって特定帯域の雑音により損なわれた音声信号に対する認識性能が改善されることが分かっている。一方で他の研究によれば、背景雑音が広い帯域にわたって加えられている場合には、性能が低くなるという結果も得られている。 In PSB, the power spectrum of the entire band is divided into several subbands, and each subband is converted into a cepstrum feature amount. These feature quantities are modeled independently of each other and combined at a specific segment level such as phonemes, words, or sentences. In the procedure of merging feature amounts, it is possible to flexibly improve performance by manipulating feature amounts in subbands. Research results show that PSB improves recognition performance for speech signals that are corrupted by noise in a specific band. On the other hand, other studies have shown that performance is degraded when background noise is applied over a wide band.

ＰＳＢの実装では、サブバンドの数、サブバンドの間の境界、及びセグメントレベルでのマージなどを課題に応じて適切に調整しなければならないという研究結果が得られている。サブバンドごとに得られた尤度を組合せるにあたっては、線形の重み付け又は非線形の重み付けのいずれも考えられる。研究結果によれば、非線形方式（例えば多層パーセプトロン）によると、線形方式よりもよい性能が得られることが分かっている。線形方式では、重みとして固定値、サブバンドでの認識精度、サブバンドでのＳＮＲ（Ｓｉｇｎａｌ−ｔｏ−ＮｏｉｓｅＲａｔｉｏ）、サブバンド間の相互情報等を使用することが考えられる。それらは統計的手法により推定できる。例えば最小分類誤差（ＭｉｎｉｍｕｍＣｌａｓｓｉｆｉｃａｔｉｏｎＥｒｒｏｒ：ＭＣＥ），最大相互情報（ＭａｘｉｍｕｍＭｕｔｕａｌＩｎｆｏｒｍａｔｉｏｎ：ＭＭＩ）などである。研究結果によれば、どのような特徴量を用いるかによって性能も変る。 In the implementation of PSB, research results have been obtained that the number of subbands, the boundary between subbands, the merge at the segment level, and the like must be appropriately adjusted according to the problem. In combining the likelihood obtained for each subband, either linear weighting or non-linear weighting can be considered. Research results show that non-linear methods (eg, multilayer perceptrons) provide better performance than linear methods. In the linear method, it is conceivable to use a fixed value, recognition accuracy in subbands, SNR (Signal-to-Noise Ratio) in subbands, mutual information between subbands, and the like as weights. They can be estimated by statistical methods. For example, minimum classification error (MCE), maximum mutual information (MMI), and the like. According to the research results, the performance varies depending on what kind of feature is used.

ＣＳＢでは、サブバンドのケプストラムベクトルを連結して（つなぎあわせて）音声認識のための一つの特徴ベクトルとする。ＣＳＢは、限定された帯域において雑音が加えられている場合にはＰＳＢよりも高い性能であることが示されている。しかし、各サブバンドから得られた特徴量が互いに相関している場合もあり、そうした場合には雑音のない（クリーンな）音声に対する認識ではＣＳＢは基本となる方式よりも性能が低下する可能性があるという研究結果も示されている。 In CSB, subband cepstrum vectors are connected (joined) to form one feature vector for speech recognition. CSB has been shown to perform better than PSB when noise is added in a limited band. However, there are cases where the feature values obtained from each subband are correlated with each other. In such a case, the performance of CSB may be lower than that of the basic method when recognizing noiseless (clean) speech. The research results that there is.

ハーマンスキーＨ．，ティブレワラＳ．，及びパヴェルＭ．，「部分的に損なわれた音声に対するＡＳＲに向けて」、ＩＥＥＥＩＣＳＬＰ予稿集、４６２−４６５頁、１９９６．（ＨｅｒｍａｎｓｋｙＨ．，ＴｉｂｒｅｗａｌａＳ．，ａｎｄＰａｖｅｌＭ．，”ＴｏｗａｒｄｓＡＳＲＯｎＰａｒｔｉａｌｌｙＣｏｒｒｕｐｔｅｄＳｐｅｅｃｈ”，Ｐｒｏｃ．，ＩＥＥＥＩＣＳＬＰ，ｐｐ．４６２−４６５，１９９６）Hermansky Tive Rewara S. , And Pavel M. "Towards ASR for partially impaired speech," IEEE ICSLP Proceedings, pages 462-465, 1996. (Hermansky H., Tibrewala S., and Pavele M., “Towards ASR On Partially Corrupted Speech”, Proc., IEEE ICSLP, pp. 462-465, 1996). オウカワＳ．，ボッチエリＥ．，及びポタミアノスＡ．，「雑音環境下におけるマルチバンド音声認識」、ＩＣＡＳＳＰ予稿集、６４１−６４４頁、１９９８．（ＯｋａｗａＳ．，ＢｏｃｃｈｉｅｒｉＥ．，ａｎｄＰｏｔａｍｉａｎｏｓＡ．，”Ｍｕｌｔｉ−ｂａｎｄｓｐｅｅｃｈｒｅｃｏｇｎｉｔｉｏｎｉｎｎｏｉｓｙｅｎｖｉｒｏｎｍｅｎｓ”，Ｐｒｏｃ．ＩＣＡＳＳＰ．，ｐｐ．６４１−６４４，１９９８）Okawa S. Bocchieri E. , And Potamianos A. , “Multi-band speech recognition under noisy environment”, ICASSP proceedings, pages 641-644, 1998. (Okawa S., Bocchieri E., and Potamianos A., “Multi-band speech recognition in noise environment”, Proc. IASSP., Pp. 641-644, 1998).

以上の様に、サブバンド方式を用いると人間と同様の精度で音声認識を行なえる可能性があるが、実際に人間と同様にロバスト性が高く精度の高い認識を行なえるようなサブバンド方式の音声認識は実現されていない。 As described above, if the subband method is used, there is a possibility that speech recognition can be performed with the same accuracy as that of humans. However, a subband method that can perform recognition with high robustness and high accuracy in the same way as humans. Voice recognition has not been realized.

従って、本発明は、サブバンド方式の音声認識装置であって、よりロバスト性が高く、認識精度の高い音声認識装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a speech recognition device of a subband system, which is more robust and has high recognition accuracy.

本発明の第１の局面に係る音声認識装置は、音声信号から所定の抽出方法により抽出される特徴量からなる特徴ベクトルを入力として音声認識をする様に学習が行なわれた音声認識手段と、入力される音声信号のそれぞれ予め定められた複数のサブバンドごとの特徴ベクトルであって、かつ特徴ベクトルと同次元の特徴ベクトルを所定の抽出方法に従って抽出するためのサブバンド特徴ベクトル抽出手段と、サブバンド特徴ベクトル抽出手段により抽出された複数のサブバンドの特徴ベクトルに、それぞれ各サブバンドに予め割当てられた重みを乗算して互いに加算することにより特徴ベクトルを生成し、音声認識手段に与えるための手段とを含む。 A speech recognition apparatus according to a first aspect of the present invention includes speech recognition means that has been trained to perform speech recognition using a feature vector consisting of a feature amount extracted from a speech signal by a predetermined extraction method, Subband feature vector extracting means for extracting a feature vector of each of a plurality of predetermined subbands of the input audio signal and having the same dimension as the feature vector according to a predetermined extraction method; To generate feature vectors by multiplying the feature vectors of a plurality of subbands extracted by the subband feature vector extraction unit by weights assigned in advance to each subband and adding them to the speech recognition unit Means.

好ましくは、サブバンド特徴ベクトル抽出手段は、音声信号から、複数のサブバンドごとに対数フィルタ帯域エネルギ（対数ＦＢＥ）ベクトルを抽出するための手段と、サブバンドごとに対応する重みを対数ＦＢＥベクトルに乗算し、かつ当該サブバンドに対応する要素以外の要素に予め定める定数を埋め込むことにより当該対数ＦＢＥベクトルの次元を全帯域から抽出される対数ＦＢＥベクトルの次元まで拡張するための複数の乗算手段と、複数の乗算手段の出力を各サブバンドのケプストラムベクトルに変換するための複数のメルケプストラム変換手段とを含む。 Preferably, the subband feature vector extracting means extracts a logarithmic filter band energy (logarithmic FBE) vector for each of a plurality of subbands from the audio signal, and sets a weight corresponding to each subband to the logarithmic FBE vector. A plurality of multiplication means for expanding the dimension of the logarithmic FBE vector to the dimension of the logarithmic FBE vector extracted from all bands by multiplying and embedding a predetermined constant in an element other than the element corresponding to the subband; And a plurality of mel cepstrum conversion means for converting the output of the plurality of multiplication means into cepstrum vectors of each subband.

さらに好ましくは、対数ＦＢＥベクトルを抽出するための手段は、音声信号から音声スペクトルを抽出するための音声スペクトル抽出手段と、音声スペクトル抽出手段により抽出された音声スペクトルを受け、メル周波数ＦＢＥベクトルを出力するメルフィルタ手段と、メルフィルタ手段の出力するメル周波数ＦＢＥベクトルに対する対数計算を行なって対数ＦＢＥベクトルを出力するための対数計算手段と、対数計算手段から出力される対数ＦＢＥベクトルを、複数のサブバンドに分割するためのサブバンドフィルタ手段とを含む。 More preferably, the means for extracting the logarithm FBE vector receives the audio spectrum extracted from the audio signal and the audio spectrum extracted by the audio spectrum extracting means, and outputs a mel frequency FBE vector. A mel filter means, a logarithmic calculation means for performing a logarithmic calculation on the mel frequency FBE vector output from the mel filter means and outputting a logarithmic FBE vector, and a logarithm FBE vector outputted from the logarithmic calculation means. Subband filter means for dividing into bands.

本発明の第２の局面に係る方法は、音声コーパスにより学習が行なわれた最大事後確率デコーダを用いてサブバンド方式の音声認識を行なう際に使用される、サブバンド方式の音声認識デコーダに付随する重みベクトルを学習するための方法である。このサブバンド方式の音声認識は、入力音声の音声スペクトルを予め定められた複数のサブバンドの各々に分割して得たサブバンドごとの特徴ベクトルに、それぞれのサブバンドに対応する重みを乗算し、乗算結果を互いに加算して得られた全帯域の特徴ベクトルをデコーダに与えて当該デコーダの出力に基づいて音声認識を行なうものである。上記した方法は、予め制限された帯域の雑音が加えられた学習用音声データを準備するステップと、学習用音声データに対し、重みベクトルの要素の値を変化させながらデコーダを用いたサブバンド方式の音声認識を行なうステップと、音声認識を行なうステップの結果に基づき、学習用音声データに対応する正しい認識結果に対して求められる尤度が最大となる様に重みベクトルの各要素を決定するステップとを含む。 The method according to the second aspect of the present invention is associated with a subband speech recognition decoder used when performing subband speech recognition using a maximum a posteriori probability decoder trained by a speech corpus. This is a method for learning a weight vector to be used. In this subband speech recognition, the feature vector for each subband obtained by dividing the speech spectrum of the input speech into each of a plurality of predetermined subbands is multiplied by a weight corresponding to each subband. The feature vectors of all bands obtained by adding the multiplication results to each other are given to the decoder, and speech recognition is performed based on the output of the decoder. The method described above includes a step of preparing learning speech data to which noise in a limited band is added, and a subband method using a decoder while changing the value of a weight vector element for the learning speech data Step of performing speech recognition and determining each element of the weight vector so as to maximize the likelihood required for the correct recognition result corresponding to the speech data for learning based on the results of the step of performing speech recognition Including.

好ましくは、デコーダはＨＭＭ（隠れマルコフモデル）によるデコーダを含む。 Preferably, the decoder comprises an HMM (Hidden Markov Model) decoder.

さらに好ましくは、重みベクトルの要素ｗ_i（ｉ＝１〜Ｋ，ただしＫはサブバンドの数）は、その合計がサブバンドの数に等しいという制約を充足する。 More preferably, the weight vector elements w _i (i = 1 to K, where K is the number of subbands) satisfy the constraint that the sum is equal to the number of subbands.

本発明の第３の局面に係る方法は、所定の音声コーパスにより音声認識のための学習が完了した、音声認識のための第１のＨＭＭを、さらに学習用データを用いた学習によって調整するためのサブバンド方式ＨＭＭ学習方法である。この方法は、学習用音声データを準備するステップと、学習用音声データを用い、第１のＨＭＭを、学習用データに一致する第２のＨＭＭにマッピングするステップとを含む。当該マッピングするステップは、学習用音声データと第１のＨＭＭとに基づき、予め定められる学習方法により、予め定められる第１の数の重みを要素として持つ重み係数ベクトルを算出するステップと、第１のＨＭＭを構成する正規分布を規定する平均ベクトルから得られた音声スペクトルを第１の数の周波数サブバンドに分割するステップと、周波数サブバンドごとに音声スペクトルからケプストラムベクトルを算出するステップと、周波数サブバンドごとに算出されたケプストラムベクトルに、重み係数ベクトルのうちの対応する要素である重み係数をそれぞれ乗算して互いに加算することにより、新たな平均ベクトルを得るステップと、新たな平均ベクトルによって規定される正規分布により、第２のＨＭＭを構成するステップとを含む。 The method according to the third aspect of the present invention is for adjusting the first HMM for speech recognition, which has been learned for speech recognition by a predetermined speech corpus, by further learning using learning data. This is a subband HMM learning method. The method includes the steps of preparing learning speech data and mapping the first HMM to a second HMM that matches the learning data using the learning speech data. The mapping step includes a step of calculating a weight coefficient vector having a predetermined first number of weights as elements by a predetermined learning method based on the learning speech data and the first HMM, Dividing a speech spectrum obtained from an average vector defining a normal distribution that constitutes the HMM into a first number of frequency subbands; calculating a cepstrum vector from the speech spectrum for each frequency subband; The cepstrum vector calculated for each subband is multiplied by the weighting factor that is the corresponding element of the weighting factor vector and added to each other to obtain a new mean vector and specified by the new mean vector Forming a second HMM with a normal distribution No.

好ましくは、重み係数ベクトルを算出するステップは、第１のＨＭＭと重み係数とが与えられたときに、分割するステップ、算出するステップ、及び加算するステップと同様の計算により学習データから得られた特徴ベクトルに対し第１のＨＭＭが出力する尤度が最大となるような、周波数サブバンドごとの重みを推定するステップを含む。 Preferably, the step of calculating the weighting factor vector is obtained from the learning data by the same calculation as the dividing step, the calculating step, and the adding step when the first HMM and the weighting factor are given. Estimating a weight for each frequency subband such that the likelihood that the first HMM outputs for the feature vector is maximized.

重み係数ベクトルを算出するステップは、第１のＨＭＭを構成する正規分布を規定する平均ベクトルを複数のクラスのいずれかに分類するステップと、学習用音声データと第１のＨＭＭとに基づき、予め定められる学習方法により、予め定められる第１の数の重みを要素として持つ重み係数ベクトルを、複数のクラスごとに算出するステップとを含み、マッピングするステップは、第１のＨＭＭを構成する正規分布を規定する平均ベクトルが、複数のクラスのいずれに属するかを判定するステップをさらに含み、新たな平均ベクトルを得るステップは、周波数サブバンドごとに算出されたケプストラムベクトルに、重み係数ベクトルのうちの、処理対象となっている平均ベクトルが属するクラスに対応したものに含まれる重み係数をそれぞれ乗算して互いに加算することにより、新たな平均ベクトルを得るステップを含む。 The step of calculating the weighting coefficient vector is based on the step of classifying the average vector defining the normal distribution constituting the first HMM into one of a plurality of classes, the learning speech data, and the first HMM in advance. Calculating a weighting coefficient vector having a predetermined first number of weights as an element for each of a plurality of classes according to a predetermined learning method, and the step of mapping includes a normal distribution constituting the first HMM The method further includes the step of determining to which of the plurality of classes the average vector that defines is defined, and the step of obtaining a new average vector includes adding the cepstrum vector calculated for each frequency subband to the weight coefficient vector. , Each of the weighting factors included in the class corresponding to the class to which the average vector being processed belongs By calculation to be added together, including the step of obtaining a new average vector.

以下においては、本発明の実施の形態として、ＣＳＢに類似したサブバンド方式の音声認識装置について説明する。ＣＳＢと同様、以下の実施の形態の装置でも、音声スペクトルをサブバンドに分割し、各サブバンドのケプストラムを得て組合せることにより全帯域のケプストラムを得る。この全帯域のケプストラムを用いて音声認識を行なう。 In the following, a subband speech recognition apparatus similar to CSB will be described as an embodiment of the present invention. Similar to CSB, the apparatus of the following embodiment also divides the speech spectrum into subbands, obtains and combines the cepstra of each subband, and obtains the cepstrum of the entire band. Speech recognition is performed using the cepstrum of this entire band.

以下の実施の形態がＣＳＢと異なる点は以下の通りである。すなわち、音声信号からサブバンドのフィルタ帯域エネルギ（ＦＢＥ：Ｆｉｌｔｅｒ−ＢａｎｄＥｎｅｒｇｉｅｓ）を得た後に、それらをサブバンドに分割する。その後各サブバンドの対数ＦＢＥの離散周波数出力に、そのサブバンドの信頼性に基づいて求められる重みを乗ずる。さらにその結果得られた特徴ベクトルを全帯域のＦＢＥベクトルの要素数まで拡張する。このとき、他のサブバンドの離散周波数領域の要素の値はゼロとする。こうして各サブバンドについて得られた対数ＦＢＥベクトルに離散コサイン変換（ＤＣＴ）を行なうことにより、そのサブバンドに対応するケプストラムベクトルが得られる。これらベクトルは、ＤＣＴの際にトランケーションが行なわれなかった場合と行なわれた場合とにより、全帯域のＦＢＥベクトルと同じ又はそれより小さな次元数を持つ。 The following embodiment is different from the CSB as follows. That is, after obtaining subband filter band energy (FBE: Filter-Band Energy) from the audio signal, they are divided into subbands. Thereafter, the discrete frequency output of the logarithm FBE of each subband is multiplied by a weight determined based on the reliability of the subband. Further, the resulting feature vector is expanded to the number of elements of the FBE vector of the entire band. At this time, the values of the elements in the discrete frequency regions of the other subbands are set to zero. By performing discrete cosine transform (DCT) on the logarithm FBE vector thus obtained for each subband, a cepstrum vector corresponding to that subband is obtained. These vectors have the same or smaller dimensions than the full-band FBE vectors, depending on whether truncation was performed or not performed during DCT.

この様にして得られたサブバンドのケプストラムベクトルに重みを乗算して互いに加算することにより、全帯域に対するケプストラムベクトルが得られる。重みが全て１であるとすれば、ＤＣＴの線形性により、得られたケプストラムは、メルスケールの周波数ケプストラム係数（ＭＦＣＣ）となる。 A cepstrum vector for the entire band is obtained by multiplying the cepstrum vectors of the subbands obtained in this way by weights and adding them to each other. If the weights are all 1, the obtained cepstrum is a melscale frequency cepstrum coefficient (MFCC) due to the linearity of DCT.

一方ＣＳＢでは、各サブバンドの対数ＦＢＥベクトルを全帯域のベクトルの次元数まで拡張することはしないで、サブバンドごとに得られるケプストラムベクトルを連結して（つなぎあわせて）全帯域のケプストラムベクトルを得る。つまりＣＳＢでは各サブバンドごとに処理が独立して行なわれるのであって、この点が、本実施の形態とＣＳＢ方式とのケプストラムベクトルの算出方法の大きな相違である。 On the other hand, in the CSB, the log FBE vector of each subband is not expanded to the number of dimensions of the vector of the entire band, but the cepstrum vector obtained for each subband is concatenated (connected) to form the cepstrum vector of the entire band. obtain. That is, in the CSB, processing is performed independently for each subband, and this is a major difference in the cepstrum vector calculation method between the present embodiment and the CSB method.

本発明の実施の形態では、ＤＣＴの線形性のため、重みをサブバンドのケプストラムに乗算してＨＭＭの枠組みの中に組み込むことができる。従って、これら重みを得るための最適化の手続を採ることができる。この場合、重みの数が小さいため、少量の学習データのみを用いて重みを推定できる。 In an embodiment of the present invention, because of the linearity of DCT, the weight can be multiplied by the subband cepstrum and incorporated into the HMM framework. Accordingly, an optimization procedure for obtaining these weights can be taken. In this case, since the number of weights is small, the weight can be estimated using only a small amount of learning data.

以下に述べる実施の形態の説明ではこの最適化学習の過程の細部をその都度述べることはせず、実施の形態の最後にまとめて説明する。なお、最適化方式として二つが考えられる。第１は特徴空間でのマッピングであり、第２はモデル空間でのマッピングである。以下に述べる第１の実施の形態が特徴空間でのマッピングを用いたものであり、第２の実施の形態がモデル空間でのマッピングを用いたものである。 In the following description of the embodiment, details of the optimization learning process will not be described each time, but will be described collectively at the end of the embodiment. There are two possible optimization methods. The first is the mapping in the feature space, and the second is the mapping in the model space. The first embodiment described below uses mapping in the feature space, and the second embodiment uses mapping in the model space.

なお、以下の説明のテキスト中で使用する記号「＾」等は、本来はその直後の文字の直上に記載すべきものであるが、テキスト記法の制限により当該文字の直前に記載する。式中では、これらの記号等は本来の位置に記載してある。 The symbol “^” and the like used in the text of the following description should be described immediately above the character immediately after that, but it is described immediately before the character due to restrictions on text notation. In the formula, these symbols are written in their original positions.

［第１の実施の形態］
‐概略‐
第１の実施の形態では、各サブバンドの対数ＦＢＥに、それらサブバンドの信頼性に基づいて算出される重みを乗じる。各サブバンドについて、ＤＣＴによりこの対数ＦＢＥを全帯域の次元数を持つケプストラムに変換する。ＤＣＴは線形変換であるため全サブバンドのケプストラムベクトルを加算することで全帯域に対するケプストラムベクトルが得られる。この際の、各サブバンドのケプストラムの重みは、そのサブバンドの対数ＦＢＥの重みと等しい。 [First Embodiment]
-Outline-
In the first embodiment, the logarithm FBE of each subband is multiplied by a weight calculated based on the reliability of the subband. For each subband, the logarithm FBE is converted by DCT into a cepstrum having the number of dimensions of the entire band. Since DCT is a linear transformation, a cepstrum vector for all bands can be obtained by adding the cepstrum vectors of all subbands. At this time, the weight of the cepstrum of each subband is equal to the logarithm FBE of that subband.

‐構成‐
図１に、本発明の第１の実施の形態に係る音声認識システム１０のブロック図を示す。図１を参照して、音声認識システム１０は、ＨＭＭの学習と、音声認識の際のサブバンドごとの重みの学習とを行なうための学習部２０と、学習部２０により学習されたＨＭＭ及び重みを用いて入力音声２２の音声認識を行ない、認識出力２６を出力するための、サブバンド方式の音声認識部２４とを含む。 -Constitution-
FIG. 1 shows a block diagram of a speech recognition system 10 according to the first exemplary embodiment of the present invention. Referring to FIG. 1, a speech recognition system 10 includes a learning unit 20 that performs HMM learning and learning of weights for each subband during speech recognition, and the HMM and weights learned by the learning unit 20. And a sub-band speech recognition unit 24 for performing speech recognition of the input speech 22 and outputting a recognition output 26.

学習部２０は、多数の音声データからなる音声コーパス４０と、音声コーパス４０を用いてＨＭＭ４４の通常の学習を行なうためのＨＭＭ学習部４２と、サブバンドごとの重みの学習のための、少量の学習用音声（実際に音声認識部２４が使用される環境に即した雑音を含むもの）を記憶するための学習用音声記憶部４６と、ＨＭＭ４４を使用したサブバンド方式の音声認識において学習用音声が得られる尤度が最大となるようなサブバンドごとの重みｗ₁〜ｗ_Kの値を学習するための重み学習部４８と、重み学習部４８により学習された重みｗ₁〜ｗ_Kを記憶するための重み記憶部５０とを含む。 The learning unit 20 includes a speech corpus 40 composed of a large number of speech data, an HMM learning unit 42 for performing normal learning of the HMM 44 using the speech corpus 40, and a small amount for learning weights for each subband. A learning speech storage unit 46 for storing learning speech (including noise in accordance with an environment where the speech recognition unit 24 is actually used), and a learning speech in subband speech recognition using the HMM 44 storing a weight learning portion 48 for learning the value of the weight w ₁ to w _K for each sub-band such that the maximum, a weight w ₁ to w _K learned by weight learning section 48 likelihood is obtained And a weight storage unit 50.

音声認識部２４は、入力音声２２を受けて、サブバンド方式の特徴抽出を行ない、重み記憶部５０に記憶された重みｗ₁〜ｗ_Kを用いてそれら特徴ベクトルを加算することで音声認識に用いる特徴ベクトルを生成するためのサブバンド方式特徴抽出部６０と、サブバンド方式特徴抽出部６０が出力する特徴ベクトルをＨＭＭ４４に与え、ＨＭＭ４４から出力される尤度が最大となる音素を認識結果として出力するための判定部６２とを含む。判定部６２の出力系列が認識出力２６となる。 The voice recognition unit 24 receives the input voice 22, performs subband feature extraction, and adds the feature vectors using the weights w _{1 to} w _K stored in the weight storage unit 50 for voice recognition. A subband scheme feature extraction unit 60 for generating a feature vector to be used, and a feature vector output from the subband scheme feature extraction unit 60 are given to the HMM 44, and a phoneme having a maximum likelihood output from the HMM 44 is used as a recognition result. And a determination unit 62 for outputting. The output series of the determination unit 62 becomes the recognition output 26.

サブバンド方式特徴抽出部６０が行なうサブバンド方式の特徴抽出について、図２を参照して説明する。図２は、サブバンド方式特徴抽出部６０の詳細を示すブロック図である。図２を参照して、サブバンド方式特徴抽出部６０は、入力信号に対して離散フーリエ変換（ＤＦＴ）を行ない、音声スペクトルに変換するためのＤＦＴ処理部８０と、ＤＦＴ処理部８０の出力する音声スペクトルを入力として受け、メル周波数ＦＢＥを出力する一群のメルフィルタバンク８２と、メルフィルタバンク８２の出力するメル周波数ＦＢＥに対する対数計算を行なって対数ＦＢＥを出力するための対数計算部８４と、対数計算部８４から出力される対数ＦＢＥを複数（例えばＫ個）のサブバンドに分割するためのサブバンドフィルタ８６とを含む。なお、図２において、「ＤＸＤ」とあるのはＤＣＴ行列のサイズを示し、「Ｄ」は全帯域のＦＢＥベクトル長を意味する The subband feature extraction performed by the subband feature extraction unit 60 will be described with reference to FIG. FIG. 2 is a block diagram illustrating details of the subband scheme feature extraction unit 60. Referring to FIG. 2, subband scheme feature extraction section 60 performs a discrete Fourier transform (DFT) on the input signal, and outputs the result from DFT processing section 80 for converting into an audio spectrum, and DFT processing section 80. A group of mel filter banks 82 that receive a speech spectrum as an input and output a mel frequency FBE; a logarithmic calculation unit 84 for performing logarithmic calculation on the mel frequency FBE output from the mel filter bank 82 and outputting a logarithmic FBE; And a subband filter 86 for dividing the logarithm FBE output from the logarithm calculation unit 84 into a plurality of (for example, K) subbands. In FIG. 2, “DXD” indicates the size of the DCT matrix, and “D” indicates the FBE vector length of the entire band.

サブバンドフィルタ８６は、対数ＦＢＥベクトルｆ＝｛ｆ₁,ｆ₂,...,ｆ_D｝をＫ個のサブバンド｛ベクトルｆ₁,ベクトルf₂,...,ベクトルｆ_K｝に分割するためのものである。ただしＤはベクトルの次元を表し、Ｋはサブバンドの数を表す。各サブバンドフィルタは矩形の単位インパルス応答を持ち、それぞれが担当するサブバンドに属する周波数の対数ＦＢＥは通し、それ以外のものは遮断する機能を持つ。 The subband filter 86 divides the logarithm FBE vector f = {f ₁ , f ₂ ,..., F _D } into K subbands {vector f ₁ , vector f ₂ ,..., Vector f _K }. Is to do. However, D represents the dimension of a vector and K represents the number of subbands. Each subband filter has a rectangular unit impulse response, and has a function of passing the logarithm FBE of the frequency belonging to the subband to which each subband is assigned, and blocking the others.

サブバンド方式特徴抽出部６０はさらに、サブバンドフィルタ８６の出力に対し、各サブバンドに割当てられた重みｗ₁〜ｗ_Kをそれぞれ乗じるためのＫ個の乗算回路８８Ａ〜８８Ｋを含む。乗算回路８８Ａ〜８８Ｋの出力はいずれもＤ次元の対数ＦＢＥベクトルｆi＝｛ｆⁱ ₁,ｆⁱ ₂,...,ｆⁱ _D｝（ｉ＝１〜Ｋ）である。それらベクトルの要素のうち、各サブバンドフィルタの担当する周波数領域以外の要素の値はゼロである。もとの対数ＦＢＥベクトルとサブバンド対数ＦＢＥベクトルとの関係は次の式により表される。 Subband scheme feature extraction unit 60 further to the output of the sub-band filter 86, containing K multiplication circuit 88A~88K for multiplying the weights w ₁ to w _K assigned to each sub-band, respectively. The outputs of the multiplying circuits 88A to 88K are D-dimensional logarithmic FBE vectors fi = {f ⁱ ₁ , f ⁱ ₂ ,..., F ⁱ _D } (i = 1 to K). Among the elements of these vectors, the values of elements other than the frequency domain that each subband filter is responsible for are zero. The relationship between the original log FBE vector and the subband log FBE vector is expressed by the following equation.

乗算回路８８Ａ〜８８Ｋの出力する対数ＦＢＥベクトルは次の式で表される。 The logarithm FBE vector output from the multiplying circuits 88A to 88K is expressed by the following equation.

ただし＾ｆ_iはｉ番目の乗算回路８８Ａ〜８８Ｋのうちｉ番目の出力、ｗ_iはｉ番目のサブバンドの重み、をそれぞれ表す。 However, {circumflex over (f)} _i represents the i th output of the i th multiplication circuits 88A to 88K, and w _i represents the weight of the i th subband.

サブバンド方式特徴抽出部６０はさらに、それぞれ乗算回路８８Ａ〜８８Ｋの出力を受ける様に接続され、乗算回路８８Ａ〜８８Ｋの出力する対数ＦＢＥベクトルに対してＤＣＴを行なうことによりサブバンドごとのケプストラムベクトルを出力するためのＤＣＴ処理部９０Ａ〜９０Ｋと、ＤＣＴ処理部９０Ａ〜９０Ｋの出力するケプストラムベクトルを加算し通常のＭＦＣＣを出力するための加算部９２とを含む。 The subband scheme feature extraction unit 60 is further connected so as to receive the outputs of the multiplier circuits 88A to 88K, respectively, and performs DCT on the logarithm FBE vector output from the multiplier circuits 88A to 88K to thereby obtain a cepstrum vector for each subband. DCT processing units 90A to 90K for outputting, and an adding unit 92 for adding the cepstrum vectors output from the DCT processing units 90A to 90K and outputting a normal MFCC.

ＤＣＴ処理部９０Ａ〜９０Ｋの出力するケプストラムベクトル＾ｃ_iは次の式で表される。 The cepstrum vectors {circumflex over (c)} _i output from the DCT processing units 90 </ b> A to 90 </ b> K are expressed by the following equations.

加算部９２の出力するケプストラムベクトル＾ｃは次の式で表される。 The cepstrum vector {circumflex over (c)} output from the adder 92 is expressed by the following equation.

ここで、重みｗ_i〜ｗ_Kの値がいずれも１であれば、ケプストラムベクトル＾ｃは通常のＭＦＣＣとなる。ＤＣＴが線形変換であるためである。すなわち、以下の式が証明されて
いる。 Here, if the values of the weights w _{i to} w _K are all 1, the cepstrum vector ^ c is a normal MFCC. This is because DCT is a linear transformation. That is, the following formula is proved.

ただしベクトルｃは通常の処理で得られるＭＦＣＣである。 However, the vector c is MFCC obtained by normal processing.

ＤＣＴが線形変換であるため、サブバンド対数ＦＢＥベクトルに重み付けをして加算しＤＣＴ処理をするということは、サブバンドのケプストラムベクトルに重み付けをしてから加算するのと同じである。 Since DCT is a linear transformation, weighting and adding a subband logarithm FBE vector and performing DCT processing is the same as weighting a subband cepstrum vector and then adding.

ただしベクトルｃ_iは重みなしのサブバンドのケプストラムベクトルである。 However, the vector c _i is an unweighted subband cepstrum vector.

ｗ_iは重み係数なので、以下の制約が存在する。 Since w _i is a weighting factor, the following constraints exist.

問題は、重みｗ_iを決定する方法である。本実施の形態では、重みは学習用音声記憶部４６に記憶された雑音入りの学習データを用いて重み学習部４８が行なう学習により決定する。その詳細については、実施の形態の説明の最後に、モデル空間での重みの決定と共に説明する。 The problem is how to determine the weights w _i . In the present embodiment, the weight is determined by learning performed by the weight learning unit 48 using noise-containing learning data stored in the learning speech storage unit 46. Details thereof will be described together with determination of weights in the model space at the end of the description of the embodiment.

‐動作‐
第１の実施の形態に係る音声認識システム１０は以下の様に動作する。このシステムの動作には、二つの局面が存在する。第１の局面は、図１に示す学習部２０によるＨＭＭ４４の学習と重みｗ₁〜ｗ_Kとの学習という局面であり、第２の局面は音声認識部２４による、このＨＭＭ４４を用いた音声認識である。以下順に説明する。 -Operation-
The speech recognition system 10 according to the first embodiment operates as follows. There are two aspects to the operation of this system. The first aspect is an aspect of learning of the HMM 44 by the learning unit 20 shown in FIG. 1 and learning of the weights w _{1 to} w _K , and the second aspect is voice recognition using the HMM 44 by the voice recognition unit 24. It is. This will be described in order below.

［ＨＭＭ４４の学習］
この学習にも二つの局面が存在する。第１は、音声コーパス４０及びＨＭＭ学習部４２を用いたＨＭＭ４４の通常の意味での学習である。第２は、学習済みのＨＭＭ４４及び学習用音声記憶部４６を用いた、重み学習部４８による重みｗ₁〜ｗ_Kの学習である。 [Learn HMM44]
There are two aspects to this learning. The first is learning in the normal sense of the HMM 44 using the speech corpus 40 and the HMM learning unit 42. The second is learning of the weights w _{1 to} w _K by the weight learning unit 48 using the learned HMM 44 and the learning speech storage unit 46.

ＨＭＭ４４の学習は、通常行なわれるものと同様である。 The learning of the HMM 44 is the same as that normally performed.

一方、重み学習部４８による学習の基本的考え方は以下の通りである。学習用音声記憶部４６が与えられたときに、式（６）によって学習用音声の特徴ベクトルを算出する。それをＨＭＭ４４に与えたときに、各音素のＨＭＭごとに尤度が得られる。学習用データに対して各ＨＭＭから得られる尤度のうち、正解に対して得られる尤度が最大となる様に、重みを決定する。この詳細については後述する。ＨＭＭ４４及び重みｗ₁〜ｗ_Kの学習が終了すると、音声認識部２４による音声認識が可能になる。 On the other hand, the basic concept of learning by the weight learning unit 48 is as follows. When the learning speech storage unit 46 is given, the feature vector of the learning speech is calculated by the equation (6). When it is given to the HMM 44, the likelihood is obtained for each HMM of each phoneme. Among the likelihoods obtained from each HMM for the learning data, the weight is determined so that the likelihood obtained for the correct answer is maximized. Details of this will be described later. When the learning of the HMM 44 and the weights w _{1 to} w _K is completed, the voice recognition by the voice recognition unit 24 becomes possible.

図２を参照して、入力音声に対してＤＦＴ処理部８０によりＤＦＴが実行され、音声スペクトルが得られる。この音声スペクトルをメルフィルタバンク８２を通すことにより、ＦＢＥが得られる。対数計算部８４によりこのＦＢＥの対数を算出することで対数ＦＢＥが得られる。この対数ＦＢＥをサブバンドフィルタ８６に与えることで、対数ＦＢＥがサブバンドに分割される。 Referring to FIG. 2, DFT is executed by the DFT processing unit 80 on the input speech, and a speech spectrum is obtained. By passing this voice spectrum through the mel filter bank 82, FBE is obtained. The logarithm FBE is obtained by calculating the logarithm of the FBE by the logarithm calculator 84. By providing this logarithm FBE to the subband filter 86, the logarithm FBE is divided into subbands.

乗算回路８８Ａ〜８８Ｋの各々は、与えられた自己の担当する周波数領域のサブバンドの対数ＦＢＥに重みｗ₁〜ｗ_Kを乗算する。乗算結果のベクトルについて、担当領域以外の離散周波数領域の要素に０を代入することにより、ベクトルを全帯域のベクトルの次元にまで拡張する。拡張されたベクトルは、ＤＣＴ処理部９０Ａ〜９０Ｋにそれぞれ与えられる。ＤＣＴ処理部９０Ａ〜９０Ｋはそれぞれ、与えられた重み付け後のサブバンドの対数ＦＢＥに対してＤＣＴ処理を行なうことにより、サブバンドケプストラムベクトルを出力する。これらサブバンドケプストラムベクトルは加算部９２に与えられ、最終的なＭＦＣＣが得られる。このＭＦＣＣを図１に示すＨＭＭ４４に与えることで各音素のＨＭＭごとに尤度が得られる。それら尤度に基づき、判定部６２が入力音声２２に対応する音素を判定し認識出力２６として出力する。 Each multiplier circuit 88A~88K is logarithmic FBE subbands in the frequency domain self responsible given multiplying a weight w ₁ to w _K. By substituting 0 for the elements of the discrete frequency region other than the assigned region, the vector is expanded to the full-band vector dimension. The expanded vectors are given to the DCT processing units 90A to 90K, respectively. Each of DCT processing units 90A to 90K outputs a subband cepstrum vector by performing DCT processing on the logarithm FBE of the given weighted subband. These subband cepstrum vectors are given to the adder 92 to obtain a final MFCC. The likelihood is obtained for each HMM of each phoneme by giving this MFCC to the HMM 44 shown in FIG. Based on these likelihoods, the determination unit 62 determines a phoneme corresponding to the input speech 22 and outputs it as a recognition output 26.

［第２の実施の形態］
この第２の実施の形態に係る音声認識システムは、第１の実施の形態とは異なり、モデル空間で上記した重み付けの処理を行なう。すなわち、ＨＭＭとしてサブバンドごとに重み付けをしたＨＭＭ学習を行ない、得られたＨＭＭを用いて音声認識を行なう。図３に、第２の実施の形態に係る音声認識システム１００のブロック図を示す。 [Second Embodiment]
Unlike the first embodiment, the speech recognition system according to the second embodiment performs the above-described weighting process in the model space. That is, HMM learning weighted for each subband is performed as an HMM, and speech recognition is performed using the obtained HMM. FIG. 3 is a block diagram of the speech recognition system 100 according to the second embodiment.

図３を参照して、この音声認識システム１００は、サブバンド方式によるモデル空間での重み付けを行なうことにより、ＨＭＭ学習を行なうための学習部１０２と、学習部１０２による学習の行なわれたＨＭＭ１２４を用いて、入力音声２２に対する音声認識を行ない認識出力１０６を出力するための音声認識部１０４とを含む。 Referring to FIG. 3, this speech recognition system 100 includes a learning unit 102 for performing HMM learning by weighting in a model space by a subband method, and an HMM 124 that has been learned by the learning unit 102. And a speech recognition unit 104 for performing speech recognition on the input speech 22 and outputting a recognition output 106.

学習部１０２は、音声コーパス４０と、音声コーパス４０を用いて一般的な学習方法によってＨＭＭ４４の学習を行なうためのＨＭＭ学習部４２と、学習用音声記憶部４６と、ＨＭＭ４４及び学習用音声記憶部４６に基づいて、モデル空間での重みを学習する処理を行なうための重み学習部１２０と、重み学習部１２０により学習された重みに基づき、ＨＭＭ４４をサブバンドごとに重み付けがされた形で音響をモデル化する様に調整することでＨＭＭ１２４を生成するためのＨＭＭ調整部１２２とを含む。 The learning unit 102 includes a speech corpus 40, an HMM learning unit 42 for learning the HMM 44 by a general learning method using the speech corpus 40, a learning speech storage unit 46, an HMM 44, and a learning speech storage unit. 46, the weight learning unit 120 for performing the process of learning the weight in the model space, and the HMM 44 in the form weighted for each subband based on the weight learned by the weight learning unit 120. An HMM adjustment unit 122 for generating the HMM 124 by adjusting so as to be modeled is included.

重み学習部１２０による重みの学習の詳細については、第１の実施の形態における重みの学習とともに実施の形態の説明の最後に示す。 Details of weight learning by the weight learning unit 120 are shown at the end of the description of the embodiment together with weight learning in the first embodiment.

図４を参照して、ＨＭＭ調整部１２２は、ＨＭＭ４４を構成する各ＨＭＭを規定する正規分布の平均ベクトルを一つずつ順番に全て取出すための平均ベクトル取出処理部１５０と、平均ベクトル取出処理部１５０により取出された一つの平均ベクトルに対し、逆ＤＣＴ処理を行なうためのＩＤＣＴ処理部１５２とを含む。音声認識における特徴量としてＭＦＣＣを採用した場合、ＨＭＭ４４内のガウス混合分布の平均ベクトルはケプストラム領域の値である。従ってその平均ベクトルをＩＤＣＴ処理部１５２によって逆ＤＣＴ処理することにより、スペクトル領域の量である対数ＦＢＥに戻すことができる。 Referring to FIG. 4, HMM adjustment unit 122 includes an average vector extraction processing unit 150 for sequentially extracting one average vector of normal distributions defining each HMM constituting HMM 44 one by one, and an average vector extraction processing unit. And an IDCT processing unit 152 for performing inverse DCT processing on one average vector extracted by 150. When MFCC is adopted as a feature amount in speech recognition, the average vector of the Gaussian mixture distribution in the HMM 44 is a value in the cepstrum region. Therefore, by performing inverse DCT processing on the average vector by the IDCT processing unit 152, the logarithm FBE that is the amount of the spectral region can be restored.

ＨＭＭ調整部１２２はさらに、ＩＤＣＴ処理部１５２から出力される対数ＦＢＥをＫ個のサブバンドに分割するためのサブバンドフィルタ１５４と、サブバンドフィルタ１５４により分割されたＫ個のサブバンドに対しそれぞれＤＣＴ処理を行なうことによりサブバンドのケプストラムを算出するためのＤＣＴ処理部１５６Ａ〜１５６Ｋと、ＤＣＴ処理部１５６Ａ〜１５６Ｋの出力に対し、それぞれ当該サブバンドに対応する重みｗ₁〜ｗ_Kを乗算するためのＫ個の乗算回路１５８Ａ〜１５８Ｋと、乗算回路１５８Ａ〜１５８Ｋの出力を互いに加算することによりＨＭＭの平均ベクトルを算出するための加算部１６０とを含む。 The HMM adjustment unit 122 further includes a subband filter 154 for dividing the logarithm FBE output from the IDCT processing unit 152 into K subbands, and K subbands divided by the subband filter 154, respectively. The outputs of DCT processing units 156A to 156K and DCT processing units 156A to 156K for calculating the subband cepstrum by performing DCT processing are respectively multiplied by weights w ₁ to w _K corresponding to the subbands. And K adder circuits 158A to 158K, and an adder 160 for calculating an average vector of the HMM by adding the outputs of the multiplier circuits 158A to 158K to each other.

再び図３を参照して、音声認識部１０４は、入力音声２２の全帯域から特徴量（ＭＦＣＣ）を抽出し特徴ベクトルを生成し、ＨＭＭ１２４に与えるための特徴抽出部１３０と、特徴ベクトルに応答してＨＭＭ１２４内の各音素ごとに設けられたＨＭＭから出力される尤度に基づいて、入力音声２２に対応する音素を判定するための判定部１３２とを含む。判定部１３２の出力系列により認識出力１０６が形成される。 Referring to FIG. 3 again, the speech recognition unit 104 extracts a feature quantity (MFCC) from the entire band of the input speech 22, generates a feature vector, and responds to the feature vector. And a determination unit 132 for determining a phoneme corresponding to the input speech 22 based on the likelihood output from the HMM provided for each phoneme in the HMM 124. The recognition output 106 is formed by the output series of the determination unit 132.

ＨＭＭ調整部１２２によるＨＭＭ１２４の生成はより具体的には以下の手順で行なわれる。 More specifically, the generation of the HMM 124 by the HMM adjustment unit 122 is performed according to the following procedure.

（１）ＨＭＭ４４中のＨＭＭを規定するガウス混合分布の平均ベクトルμ＝｛μ₀，μ₁，…，μ_N-1｝をＨＭＭ４４中のＨＭＭから取出し、ＩＤＣＴ処理部１５２での逆ＤＣＴ処理により対数ＦＢＥベクトルｍ＝｛ｍ₁，ｍ₂，…，ｍ_N｝に変換する。ここでＮはベクトルμの次元である。 (1) The average vector μ = {μ ₀ , μ ₁ ,..., Μ _N−1 } of the Gaussian mixture distribution that defines the HMM in the HMM 44 is taken out from the HMM in the HMM 44 and subjected to inverse DCT processing in the IDCT processing unit 152 Logarithm FBE vector m = {m ₁ , m ₂ ,..., _{M N} }. Where N is the dimension of the vector μ.

（２）対数ベクトルｍをサブバンド｛ベクトルｍ₁，ベクトルｍ₂，…，ベクトルｍ_K｝に分割する。ここでＫはサブバンドの数である。各サブバンドのベクトルをＤＣＴによりケプストラムベクトル｛ベクトルμ₁，ベクトルμ₂，…，ベクトルμ_K｝に変換する。 (2) The logarithm vector m is divided into subbands {vector m ₁ , vector m ₂ ,..., Vector m _K }. Here, K is the number of subbands. Each subband vector is converted into a cepstrum vector {vector μ ₁ , vector μ ₂ ,..., Vector μ _K } by DCT.

（３）サブバンドケプストラムベクトルに重み係数を乗算する。 (3) Multiply the subband cepstrum vector by a weighting factor.

（４）サブバンドケプストラムベクトルを全て加算することで、学習後の平均ベクトル＾μが得られる。 (4) By adding all the subband cepstrum vectors, the average vector ^ μ after learning is obtained.

この平均ベクトルを用いてＨＭＭ４４中のＨＭＭの正規分布を変更することにより、調整後のＨＭＭ１２４中のＨＭＭが得られる。ＨＭＭ４４中の全てのＨＭＭを構成する全ての正規分布の平均ベクトルについてこれを繰返すことにより、ＨＭＭ１２４を生成することができる。 The HMM in the adjusted HMM 124 is obtained by changing the normal distribution of the HMM in the HMM 44 using this average vector. The HMM 124 can be generated by repeating this operation for all the normal distribution average vectors constituting all the HMMs in the HMM 44.

‐動作‐
音声認識システム１００は概ね以下の様に動作する。この音声認識システム１００の場合にも、動作には大きく二つの局面が存在する。第１の局面はＨＭＭ１２４の生成であり、第２の局面はＨＭＭ１２４を用いた音声認識である。 -Operation-
The voice recognition system 100 generally operates as follows. Also in the case of this voice recognition system 100, there are two major aspects in operation. The first aspect is generation of the HMM 124, and the second aspect is voice recognition using the HMM 124.

第１の局面は、さらに大きく三つの工程に分かれる。第１はＨＭＭ学習部４２によるＨＭＭ４４の学習である。第２は重み学習部１２０による重みの学習である。第３は学習された重みと、ＨＭＭ４４とに基づいて調整されたＨＭＭ１２４を得る局面である。 The first aspect is further divided into three steps. The first is learning of the HMM 44 by the HMM learning unit 42. The second is weight learning by the weight learning unit 120. The third aspect is to obtain the HMM 124 adjusted based on the learned weight and the HMM 44.

第１の工程におけるＨＭＭ学習部４２によるＨＭＭ４４の学習は通常の方法で行なわれる。また、第２の工程における重み学習部１２０による重みの学習については、実施の形態の説明の最後に説明する。 Learning of the HMM 44 by the HMM learning unit 42 in the first step is performed by a normal method. The weight learning by the weight learning unit 120 in the second step will be described at the end of the description of the embodiment.

第３の工程でのＨＭＭ調整部１２２の動作は以下の通りである。図４を参照して、まず、平均ベクトル取出処理部１５０がＨＭＭ４４中のＨＭＭを構成するガウス混合分布を規定する正規分布のうちの一つの平均ベクトルを取出す。ＩＤＣＴ処理部１５２がこの平均ベクトルに対し逆ＤＣＴを行ない、対数ＦＢＥベクトルに戻す。サブバンドフィルタ１５４によってこの対数ＦＢＥベクトルをサブバンドごとのベクトルに分割し、ＤＣＴ処理部１５６Ａ〜１５６Ｋにそれぞれ与える。 The operation of the HMM adjustment unit 122 in the third step is as follows. With reference to FIG. 4, first, the average vector extraction processing unit 150 extracts one average vector of normal distributions defining the Gaussian mixture distribution constituting the HMM in the HMM 44. The IDCT processing unit 152 performs inverse DCT on the average vector, and returns the logarithm FBE vector. The logarithm FBE vector is divided into vectors for each subband by the subband filter 154 and given to the DCT processing units 156A to 156K.

ＤＣＴ処理部１５６Ａ〜１５６Ｋが、各サブバンドごとの対数ＦＢＥベクトルにＤＣＴ処理を行ない、ケプストラムベクトルに変換する。乗算回路１５８Ａ〜１５８Ｋは、各サブバンドのケプストラムベクトルに対し、重み学習部１２０が第２の工程で算出したサブバンドごとの重みを乗算し加算部１６０に与える。加算部１６０はこれらサブバンドごとに算出され重み付けされたケプストラムベクトルを加算する。これにより、平均ベクトル取出処理部１５０によって取出された平均ベクトルに対応する、学習後のＨＭＭ１２４中のＨＭＭの平均ベクトルが算出され、ＨＭＭ１２４に組込まれる。 DCT processing units 156A to 156K perform DCT processing on the logarithmic FBE vector for each subband, and convert it into a cepstrum vector. Multiplication circuits 158 </ b> A to 158 </ b> K multiply the cepstrum vector of each subband by the weight for each subband calculated by the weight learning unit 120 in the second step, and supply the result to addition unit 160. The adder 160 adds the weighted cepstrum vectors calculated for each subband. Thereby, the average vector of the HMM in the learned HMM 124 corresponding to the average vector extracted by the average vector extraction processing unit 150 is calculated and incorporated into the HMM 124.

以上の様にして学習部１０２によるＨＭＭ１２４の学習が完了する。 As described above, the learning of the HMM 124 by the learning unit 102 is completed.

第２の局面では、学習後のＨＭＭ１２４を用いて音声認識が行なわれる。図１に示す入力音声２２に対し、特徴抽出部１３０が特徴抽出を行なう。特徴量は、ＨＭＭ４４の学習の際に用いられたのと同様の手法で抽出されたケプストラムベクトルである。特徴抽出部１３０はこのケプストラムベクトルをＨＭＭ１２４中の各ＨＭＭに与える。ＨＭＭ１２４中の各ＨＭＭは、与えられたケプストラムベクトルに対し、入力音声が各ＨＭＭの担当する音素である確率を出力する。判定部１３２が、これら確率に基づいて、入力音声２２の音声認識を行ない認識出力１０６を与える。 In the second aspect, speech recognition is performed using the HMM 124 after learning. A feature extraction unit 130 performs feature extraction on the input speech 22 shown in FIG. The feature amount is a cepstrum vector extracted by a method similar to that used when learning the HMM 44. The feature extraction unit 130 gives this cepstrum vector to each HMM in the HMM 124. Each HMM in the HMM 124 outputs, for a given cepstrum vector, the probability that the input speech is a phoneme assigned to each HMM. The determination unit 132 performs voice recognition of the input voice 22 based on these probabilities and gives a recognition output 106.

［重みの学習］
第１の実施の形態及び第２の実施の形態では、共通してサブバンドごとの重みの学習が行なわれている。これら実施の形態では、この学習はいずれも最尤推定法によって行なわれる。ただし第１の実施の形態では特徴空間での重みの学習が行なわれ、第２の実施の形態ではモデル空間での重みの学習が行なわれている。 [Learning weights]
In the first embodiment and the second embodiment, weight learning for each subband is performed in common. In these embodiments, this learning is performed by the maximum likelihood estimation method. However, weight learning in the feature space is performed in the first embodiment, and weight learning is performed in the model space in the second embodiment.

ＡＳＲにおける認識とは、学習済みのモデル集合Φ_X＝｛Φ_xi｝（Φ_xiはｉ番目のモデル）及びテストデータＹ＝｛ベクトルｙ₁，ベクトルｙ₂，…，ベクトルｙ_T｝が与えられたときに、対応するイベントシーケンスＷ＝｛Ｗ₁，Ｗ₂，…，Ｗ_L｝を認識する、ということである。Φ_XとＹとの間に不一致があれば、認識結果のシーケンスＷに誤差が生ずる。多くのＡＳＲでは、最大事後確率（ＭａｘｉｍｕｍＡＰｏｓｔｅｒｉｏｒｉ：ＭＡＰ）推定デコーダを用いる。 Recognition in ASR is given a learned model set Φ _X = {Φ _xi } (Φ _xi is the i-th model) and test data Y = {vector y ₁ , vector y ₂ ,..., Vector y _T }. The corresponding event sequence W = {W ₁ , W ₂ ,..., W _L } is recognized. If there is a mismatch between Φ _X and Y, an error occurs in the recognition result sequence W. Many ASRs use Maximum A Postoriori (MAP) estimation decoders.

この学習は、認識の性能を向上させるために、不一致を最小化させるという方針で行なう。学習は、前述した様に特徴空間でもモデル空間でも可能である。以下、それぞれの場合につき、重みの学習の詳細について説明する。 This learning is performed with a policy of minimizing inconsistencies in order to improve recognition performance. As described above, learning can be performed in the feature space or the model space. The details of weight learning will be described below for each case.

特徴空間では、学習データＸ＝｛ベクトルｘ₁，ベクトルｘ₂，…，ベクトルｘ_T｝が観測データのシーケンスＹ＝｛ベクトルｙ₁，ベクトルｙ₂，…，ベクトルｙ_T｝にマップされる。もしもこの際の変形が可換であれば、次の様に逆変換ＦνによってＹからＸへの逆マッピングを行なうことができる。 In feature space, the learning data X = {vectors x _1, vector x _2, ..., vector x _T} sequence Y = the observation data {vector y _1, vector y _2, ..., vector y _T} is mapped to. If the deformation at this time is commutative, the inverse mapping from Y to X can be performed by the inverse transformation Fν as follows.

これと同じ様にモデル空間でも、関数Ｇηによって学習後のモデルΦ_Xをテストデータと一致するモデルΦ_Yにマッピングすることができる。 Similarly, in the model space, the learned model Φ _X can be mapped to the model Φ _Y that matches the test data by the function Gη.

ただしηはこの関数のパラメータである。 Where η is a parameter of this function.

不一致を最小化する方法の一つは、連結尤度を最大化するパラメータν又はηとＷとを推定することである。特徴空間では、次の式により表される。 One way to minimize the mismatch is to estimate the parameters ν or η and W that maximize the connection likelihood. In the feature space, it is expressed by the following formula.

モデル空間では次の式により表される。 In the model space, it is expressed by the following formula.

変数ν又はηとＷとに関する連結尤度を最大化する作業は、ν又はηを固定してＷを変化させることで尤度を最大化し、その後にＷを固定してν又はηを変化させることで尤度を最大化する、という操作を繰返すことで実現できる。 The work of maximizing the joint likelihood for the variable ν or η and W is to maximize the likelihood by changing W with ν or η fixed, and then changing ν or η by fixing W. This can be realized by repeating the operation of maximizing the likelihood.

Ｗを推定するプロセスは、ＡＳＲで通常行なわれている学習の手続と同様である。ここでは、ν又はηの推定について考える。Ｗを固定して考えているので、式（１５）及び（１６）から変数としてのＷを取除いて考えることができる。その結果、式（１５）及び（１６）はそれぞれ次の式（１７）及び（１８）に書き換えることができる。 The process of estimating W is the same as the learning procedure normally performed in ASR. Here, the estimation of ν or η is considered. Since W is considered fixed, it can be considered by removing W as a variable from Equations (15) and (16). As a result, the equations (15) and (16) can be rewritten into the following equations (17) and (18), respectively.

及び as well as

Ｓ＝｛ｓ₁，ｓ₂，…，ｓ_T｝を可能な全ての状態シーケンスの集合、Ｋ＝｛ｋ₁，ｋ₂，…，ｋ_T｝を全ての混合要素のシーケンスの集合であるものとする。すると式（１７）及び（１８）はそれぞれ次の式（１９）及び（２０）の様に書き換えることができる。 S = {s ₁ , s ₂ ,..., S _T } is a set of all possible state sequences, and K = {k ₁ , k ₂ ,..., K _T } is a set of all mixed element sequences. And Then, the equations (17) and (18) can be rewritten as the following equations (19) and (20), respectively.

‐特徴空間での学習‐
＾νを推定するために、ＥＭアルゴリズムを使用する。ＥＭアルゴリズムは２ステップでの繰返処理である。第１のステップはＥステップ（期待ステップ）と呼ばれる。Ｅステップでは、以下の随伴関数を計算する。 -Learning in feature space-
Use the EM algorithm to estimate ^ ν. The EM algorithm is a repeated process in two steps. The first step is called the E step (expected step). In the E step, the following adjoint function is calculated.

第２のステップはＭステップ（最大化ステップ）と呼ばれる。このステップでは、以下の式（２２）に示す様に、随伴関数の値を最大化する＾νの値を推定する。 The second step is called M step (maximization step). In this step, as shown in the following equation (22), the value of ^ ν that maximizes the value of the adjoint function is estimated.

もしもＱ（ν，＾ν）≧Ｑ（ν，ν）であれば、ｐ（Ｙ｜＾ν，Φ_X）≧ｐ（Ｙ｜ν，Φ_X）となることが証明されている。従って、Ｅステップ及びＭステップを繰返し適用した場合、尤度は減少しないことが保証されている。そして、この処理を尤度の増加が所定のしきい値よりも小さくなるまで繰返す。 If Q (ν, ^ ν) ≧ Q (ν, ν), it has been proven that p (Y | ^ ν, Φ _X ) ≧ p (Y | ν, Φ _X ). Therefore, it is guaranteed that the likelihood does not decrease when the E step and the M step are repeatedly applied. This process is repeated until the increase in likelihood becomes smaller than a predetermined threshold value.

ここでは、逆変形関数によってＹの各フレームがＸの対応フレームにマップされるものとする。 Here, it is assumed that each frame of Y is mapped to a corresponding frame of X by the inverse deformation function.

随伴関数は次の様に書ける。 The adjoint function can be written as

ただし、ｐ_y（ｙ_t｜ｓ_t，ｋ_t，＾ν，Φ_X）はランダム変数ｙ_tの確率密度関数であって、このランダム変数ｙ_tとランダム変数Ｘ_tとの間の関係は次の式により表される。 _{_{However, p y (y t | s}} t, k t, ^ ν, Φ X) is a probability density function of the random variable y _t, the relationship between the random variable y _t and the random variable X _t is next It is expressed by the following formula.

ここで、式（２５）の右辺の分母はヤコビアンであって、その（ｉ、ｊ）番目の要素は以下の様に表される。 Here, the denominator on the right side of Expression (25) is Jacobian, and the (i, j) -th element is expressed as follows.

ただしここでｙ_t,jはベクトルｙ_tのｉ番目の要素であり、分母の関数ｆの添え字（＾ν，ｊ）はこの関数が関数ｆのｊ番目の要素であることを示す。 However where y _{t, j} is the i th element of the vector y _t, subscript denominator of the function f (^ ν, j) indicates that the function is a j th element of the function f.

一般的には、ｐ_X（ｘ_t｜ｓ_t，ｋ_t，＾ν，Φ_X）はガウス分布として定義され、従って随伴関数は次の様に書ける。 In general, p _X (x _t | s _t , k _t , ^ ν, Φ _X ) is defined as a Gaussian distribution, so the adjoint function can be written as

随伴関数はまた、次の様に書くこともできる。 The adjoint function can also be written as:

ここでは＾νを含む項のみについて調べればよいので、随伴関数は以下の様に書くことができる。 Here, only the term including ^ ν needs to be examined, so the adjoint function can be written as follows.

この随伴関数の最大値を得るために、この式を＾νで微分した式の値をゼロとおいて＾νを求める。すなわち、 In order to obtain the maximum value of this adjoint function, the value of the expression obtained by differentiating this expression by ^ ν is set to zero to obtain ^ ν. That is,

サブバンドの重み学習のアルゴリズムでは、観測されたデータから学習データへの逆変形関数は次の式で表される。 In the subband weight learning algorithm, the inverse transformation function from the observed data to the learning data is expressed by the following equation.

式（６）によれば、この関数は次の様に表すことができる。 According to equation (6), this function can be expressed as:

ただしベクトルｉは単位ベクトルである。行列Ｃはケプストラム行列であり、その列ベクトルはサブバンドケプストラムである。すなわち、行列Ｃ＝｛ベクトルｃ₁，ベクトルｃ₂，…，ベクトルｃ_K｝。Ｋは前述の通りサブバンドの数である。ベクトルｗは重みベクトルである。すなわちベクトルｗ＝｛ｗ₁，ｗ₂，…，ｗ_K｝。 However, the vector i is a unit vector. The matrix C is a cepstrum matrix, and its column vector is a subband cepstrum. That is, the matrix C = {vector c ₁ , vector c ₂ ,..., Vector c _K }. K is the number of subbands as described above. Vector w Ru der weight vector. That is, the vector w = {w ₁ , w ₂ ,..., W _K }.

ｆ_wは直接関数（ｄｉｒｅｃｔｆｕｎｃｔｉｏｎ）ではないため、次のヤコビアン行列を計算するのは困難である。 for f _w is not a direct function (direct function), it is difficult to compute the next Jacobian matrix.

計算を簡単にするために、このヤコビアン行列の行列式の値が１であると仮定する。この仮定は、重みの値が全て１に近く、ケプストラムの要素が互いに独立であると仮定した場合に成り立つ。すなわち、 In order to simplify the calculation, it is assumed that the value of the determinant of this Jacobian matrix is 1. This assumption is valid when the weight values are all close to 1 and the cepstrum elements are independent of each other. That is,

すると、式（２９）は次の様に書ける。 Then, equation (29) can be written as follows.

ここで以下の様に定義する。 Here, it is defined as follows.

すると、式（３６）は次の式（３８）の様に書ける。 Then, the equation (36) can be written as the following equation (38).

一方、重みには式（７）に示す制約が課されている。式（７）は次の式（３９）の様に書くことができる。 On the other hand, the constraint shown in Expression (7) is imposed on the weight. Expression (7) can be written as the following Expression (39).

ここで、目的関数を次の様に定義する。 Here, the objective function is defined as follows.

この目的関数の最大値を見つけるために、この式を＾ｗに関し微分した式の値をゼロとおく。 In order to find the maximum value of this objective function, the value of an expression obtained by differentiating this expression with respect to ^ w is set to zero.

この式から次の式が導かれる。 From this equation, the following equation is derived.

この式から、次の解が得られる。 From this equation, the following solution is obtained.

ただし However,

λは式（３９）及び（４３）を組合せることにより得ることができる。 λ can be obtained by combining equations (39) and (43).

‐モデル空間での学習‐
特徴空間での学習では、ヤコビアン行列の行列式が１であることを仮定した。しかしこの仮定により、学習の過程で不正確さが生じるおそれがある。モデル空間での学習では、そのような仮定は不要である。必要な仮定は、特徴量の中に０次のケプストラムが存在するということである。 -Learning in model space-
In the learning in the feature space, it is assumed that the determinant of the Jacobian matrix is 1. However, this assumption can lead to inaccuracies in the learning process. Such assumptions are not necessary for learning in the model space. A necessary assumption is that a zero-order cepstrum exists in the feature quantity.

ここでは、＾ηを推定するためにＥＭアルゴリズムを用いる。Ｅステップでは以下の随伴関数を計算する。 Here, the EM algorithm is used to estimate ^ η. In step E, the following adjoint function is calculated.

Ｍステップでは随伴関数を最大化することで＾ηを推定する。 In M step, ^ η is estimated by maximizing the adjoint function.

学習は、トレーニングデータＸによりトレーニングされたモデル中の平均ベクトルと共分散行列とに対して行なわれるものとする。学習での変換は次の通りである。 Learning is performed on the average vector and the covariance matrix in the model trained by the training data X. The conversion in learning is as follows.

この場合、ランダム変数ｙ_tの確率密度関数は次の通りである。 In this case, the probability density function of the random variable y _t is as follows.

ここでは、＾ηを含む項のみについて着目するので、随伴関数は次の様に書くことができる。 Here, since only the term including ^ η is focused, the adjoint function can be written as follows.

この随伴関数の最大値を求めるために、この式を＾ηに関して微分し、その式の値を０として解く。 In order to obtain the maximum value of this adjoint function, this equation is differentiated with respect to ^ η, and the value of the equation is solved as zero.

既に述べた様に、モデル空間での変換は以下の式により表される。 As already described, the transformation in the model space is expressed by the following equation.

ただし行列Ｕはケプストラム行列であり、その列ベクトルはサブバンドのケプストラムベクトルである。すなわち行列Ｕ＝｛μ₁，μ₂，…，μ_K｝。ベクトルｗは重みベクトルである。すなわち、ベクトルｗ＝［ｗ₁，ｗ₂，…，ｗ_K］^T。 However, the matrix U is a cepstrum matrix, and its column vector is a subband cepstrum vector. That is, the matrix U = {μ ₁ , μ ₂ ,..., Μ _K }. The vector w is a weight vector. That is, the vector w = [w ₁ , w ₂ ,..., W _K ] ^T.

すると、式（５２）は次の様に書くことができる。 Then, equation (52) can be written as:

一方、重みに関しては式（３９）で示される制約が課されている。 On the other hand, with respect to the weight, the constraint expressed by the equation (39) is imposed.

目的関数を次の様に定義する。 The objective function is defined as follows.

この目的関数の最大値を求めるために、これを＾ｗに関し微分してその式の値を０とおく。 In order to obtain the maximum value of this objective function, it is differentiated with respect to ^ w and the value of the equation is set to zero.

この式を解くことにより次を得る。 Solving this equation gives:

その結果、解は次の通りとなる。 As a result, the solution is as follows.

ただし行列Ｘ及びベクトルｙは以下の式で表される。 However, the matrix X and the vector y are expressed by the following equations.

またλは式（３９）と（６０）とを組合せることにより得られ、その結果は式（４６）と同一になる。 Also, λ is obtained by combining equations (39) and (60), and the result is the same as equation (46).

［第３の実施の形態］
第２の実施の形態では、ＨＭＭを規定するガウス複合分布の構成要素である正規分布の各々について、その平均ベクトルを算出するための重みを別々に学習している。仮に類似した正規分布については同じ重みを使用する様にすれば、重みの学習をより少ない学習データで行なうことができ、学習に要する時間も短縮化できる。 [Third Embodiment]
In the second embodiment, the weight for calculating the average vector is separately learned for each normal distribution that is a component of the Gaussian composite distribution that defines the HMM. If the same weight is used for a similar normal distribution, weight learning can be performed with less learning data, and the time required for learning can be shortened.

そこでこの第３の実施の形態では、第２の実施の形態の音声認識システム１００を改良し、正規分布を複数のクラスに分類し、クラスごとに同じ重み付けを用いることにする。図５に、その様にクラスごとに予め算出された重みを用いたＨＭＭ調整部１８０のブロック図を示す。図３のＨＭＭ調整部１２２をこのＨＭＭ調整部１８０と置換し、さらに重み学習部１２０での重み学習を後に説明する様に変更することにより、本実施の形態に係る音声認識システムを得ることができる。 Therefore, in the third embodiment, the speech recognition system 100 of the second embodiment is improved, the normal distribution is classified into a plurality of classes, and the same weighting is used for each class. FIG. 5 shows a block diagram of the HMM adjustment unit 180 using weights calculated in advance for each class. The voice recognition system according to the present embodiment can be obtained by replacing the HMM adjustment unit 122 in FIG. 3 with the HMM adjustment unit 180 and further changing the weight learning in the weight learning unit 120 as described later. it can.

本実施の形態のシステムでは、平均ベクトルによって正規分布をいくつか（例えばＬ個）のクラスに分類する。その分類は重み学習部による学習時と、ＨＭＭ調整部１８０によるＨＭＭの調整時とで共通して用いられる。なお、以下の説明では、図４と同じ構成要素には同じ参照番号を付してある。それらの機能も同一である。従って、それらについての詳細な説明は繰返さない。なお、本実施の形態でクラス数Ｌ＝１とすれば第２の実施の形態の場合と同様になる。 In the system of the present embodiment, the normal distribution is classified into several (for example, L) classes based on the average vector. The classification is commonly used for learning by the weight learning unit and for adjustment of the HMM by the HMM adjustment unit 180. In the following description, the same components as those in FIG. 4 are denoted by the same reference numerals. Their functions are also the same. Thus, no repeated detailed description thereof will. In this embodiment, if the number of classes L = 1, it is the same as in the case of the second embodiment.

図５を参照して、ＨＭＭ調整部１８０は、平均ベクトル取出処理部１５０と、平均ベクトル取出処理部１５０により取出された平均ベクトルがＬ個のクラスのうちいずれに属するかを判定するためのクラス判定部１９０と、取出された平均ベクトルに対し逆ＤＣＴ処理を行なうＩＤＣＴ処理部１５２、ＩＤＣＴ処理部１５２の出力をサブバンドに分割するサブバンドフィルタ１５４、及びサブバンドフィルタ１５４の出力に対しサブバンドごとのＤＣＴ処理を行なうＤＣＴ処理部１５６Ａ〜１５６Ｋとを含む。 Referring to FIG. 5, HMM adjustment unit 180 has an average vector extraction processing unit 150 and a class for determining to which of the L classes the average vector extracted by average vector extraction processing unit 150 belongs. The determination unit 190, the IDCT processing unit 152 that performs inverse DCT processing on the extracted average vector, the subband filter 154 that divides the output of the IDCT processing unit 152 into subbands, and the subband for the output of the subband filter 154 DCT processing units 156A to 156K that perform DCT processing for each.

ＨＭＭ調整部１８０はさらに、図４の乗算回路１５８Ａ〜１５８Ｋに替えて、ＤＣＴ処理部１５６Ａ〜１５６Ｋがそれぞれ出力するケプストラムベクトルにクラス判定部１９０によって判定されたクラスに応じた重みを乗算するための乗算回路１９４Ａ〜１９４Ｋと、乗算回路１９４Ａ〜１９４Ｋの出力を加算する加算部１６０とを含む。クラス判定部１９０により判定されたクラスがクラスＬ１であるものとすると、乗算回路１９４Ａ〜１９４Ｋにより各サブバンドのケプストラムベクトルに乗算される重みはｗ_1,L1、ｗ_2,L1、…、ｗ_K,L1と表すことができる。 The HMM adjustment unit 180 further multiplies the cepstrum vectors output from the DCT processing units 156A to 156K by weights corresponding to the classes determined by the class determination unit 190, instead of the multiplication circuits 158A to 158K of FIG. Multipliers 194A to 194K and an adder 160 that adds the outputs of the multipliers 194A to 194K are included. When the class determined by the class determining unit 190 is assumed to be the class L1, weights to be multiplied with the cepstral vector of each sub-band by the multiplication circuit 194A~194K is _{_{w 1, L1, w 2,}} L1, ..., w K _{, L1} .

この場合の重みの学習は以下の通りに行なわれる。 In this case, weight learning is performed as follows.

ＨＭＭ４４のＨＭＭモデルのガウス成分をＬ個のクラスに分類する。各モデルの混合分布数を状態によらず一定（たとえばＣ₀）とし、状態数をＳ₀とすると、Ｌ＝Ｃ₀×Ｓ₀と書くことができる。これらＬ個のクラスごとに異なる重み係数を乗算するものとする。この場合、式（５５）は以下の様になる。 The Gaussian component of the HMM model of the HMM 44 is classified into L classes. When the number of mixed distributions of each model is constant (for example, C ₀ ) regardless of the state, and the number of states is S ₀ , L = C ₀ × S ₀ can be written. It is assumed that a different weighting factor is multiplied for each of these L classes. In this case, Expression (55) is as follows.

この重み係数には、クラスごとに式（３９）と同じ制約が課されている。 This weighting factor is subject to the same restrictions as in equation (39) for each class.

この場合、目的関数は以下の様になる。 In this case, the objective function is as follows.

この目的関数を＾ｗ_cに関して微分し、その値をゼロとする。 This objective function is differentiated with respect to ^ w _c, to its value to be zero.

この式を解くことにより次を得る。 Solving this equation gives:

解は次の通りである。 The solution is as follows.

ただし However,

λは式（６４）及び（６８）を組合せることで得られる。 λ is obtained by combining equations (64) and (68).

［第４の実施の形態］
モデル空間の学習では、平均ベクトルに対する重み付けの手続は、平均ベクトルに対する一種の変換とみなすことができる。そのため、重みに関する制約を取除くことができる。その結果、解はわずかに形の異なったものとなる。 [Fourth Embodiment]
In model space learning, the procedure for weighting an average vector can be regarded as a kind of transformation for the average vector. Therefore, it is possible to remove restrictions on weights. As a result, the solution is slightly different in shape.

式（５６）を＾ｗに関し微分してその値をゼロとおく。 The expression (56) is differentiated with respect to ^ w and the value is set to zero.

この式から次を得る。 From this equation we get

解は次の通りとなる。 The solution is as follows.

ただし行列Ｘ，ベクトルｙはそれぞれ式（６１）及び（６２）に等しい。 However, the matrix X and the vector y are equal to the equations (61) and (62), respectively.

この結果を用い、第３の実施の形態と同様にして本発明の第４の実施の形態の音声認識システムを構築することができる。 Using this result, the speech recognition system according to the fourth embodiment of the present invention can be constructed in the same manner as the third embodiment.

なお、上記した実施の形態の説明では、特徴空間における重み付けとモデル空間とにおける重み付けとを別々のものとして説明した。しかし両者を同時に行なうこともできる。たとえば、雑音が混入しているサブバンドが既知の場合には、その周波数部分には小さな重み（例えばゼロ）を割当てることにより、その成分が音声認識に与える悪影響を抑えることができる。この場合、特徴空間とモデル空間との間の整合をとる必要があるので、両者において重み付けを行なう必要がある。 In the above description of the embodiment, the weighting in the feature space and the weighting in the model space are described separately. However, both can be performed simultaneously. For example, when a subband in which noise is mixed is known, a small weight (for example, zero) is assigned to the frequency portion, thereby suppressing adverse effects of the component on speech recognition. In this case, since it is necessary to achieve matching between the feature space and the model space, it is necessary to perform weighting on both.

［実験結果］
Ａｕｒｏｒａ２タスクとＤＡＲＰＡリソースマネージャ（ＲＭ）タスクとを用いた実験を行なった。Ａｕｒｏｒａ２コーパスは、雑音環境下での分散音声認識の研究のために作成されたものである。データは予め通常の通信チャネル（Ｇ．７１２又はＭＩＲＳ）の周波数特性に従ってプリフィルタを通し、２０ｄＢから−５ｄＢまで５ｄＢずつ変化させた６段階の信号対雑音比（ｓｉｇｎａｌ−ｔｏ−ｎｏｉｓｅｒａｔｉｏ：ＳＮＲ）で人工的に雑音を加えたものを用いた。 [Experimental result]
Experiments were performed using the Aurora 2 task and the DARPA resource manager (RM) task. The Aurora 2 corpus was created for the study of distributed speech recognition in noisy environments. The data is pre-filtered in advance according to the frequency characteristics of a normal communication channel (G.712 or MIRS) and changed in steps of 5 dB from 20 dB to -5 dB (signal-to-noise ratio: SNR). And artificially added noise.

実験ではＲＭタスクの音声信号の、制限された帯域に雑音を加え、本発明の実施の形態に係るシステムの性能を検証した。音素ごとのＨＭＭとして４９個のＨＭＭを用意した。各ＨＭＭは３つの状態を持つ。１状態ごとに４つの混合分布が存在する。ただし、１状態の「ｓｐ」モデルは例外である。 In the experiment, noise was added to a limited band of the voice signal of the RM task, and the performance of the system according to the embodiment of the present invention was verified. 49 HMMs were prepared as HMMs for each phoneme. Each HMM has three states. There are four mixed distributions per state. The one-state “sp” model is an exception.

雑音信号の振幅は１．０Ｅ５（１００ｄＢ）に設定した。ケプストラムを逆ＤＣＴにより対数ＦＢＥに変換し、対数ＦＢＥとして１３個の離散周波数領域に分割した。雑音が３つの離散周波数領域にまたがるものと想定し、雑音を４つの方法で加えた。すなわち、第１〜第３の離散周波数領域、第４〜第６の離散周波数領域、第７〜第９の離散周波数領域、及び第１０〜第１２の離散周波数領域である。比較のため６つの方式（ＦＢ，ＣＳＢ，ＣＣＳＢ、ＷＳＢ、ＷＡ，ＷＳＢ）について、ノイズのないテストデータと、ノイズが加えられたテストデータとを用いてそれぞれ実験を行なった。ＦＢとは全帯域をまとめて（サブバンドに分割せず）音声認識を行なう方法である（従来技術）。ＣＳＢが従来技術であることについては前述した。ＣＣＳＢ（ｃｏｎｃａｔｅｎａｔｅｄｃｌｅａｎｓｕｂ−ｂａｎｄ）とは、ＣＳＢと同様であるが、雑音のある離散周波数領域については捨てる、という方式である。ＷＳＢ（ｗｅｉｇｈｔｅｄｓｕｂ−ｂａｎｄ）は、本発明の第１の実施の形態に係るものである。ＷＡ（ｓｕｂ−ｂａｎｄｗｅｉｇｈｉｎｇａｄａｐｔａｔｉｏｎ）は本発明の第２の実施の形態に係る方法である。ＵＷＡ（ｕｎｌｉｍｉｅｄｓｕｂ−ｂａｎｄｗｅｉｇｈｉｎｇａｄａｐｔａｔｉｏｎ）は本発明の第４の実施の形態に係るものである。雑音が加えられた離散周波数領域の重み付け係数をゼロ、他の部分の重み付け係数を１とした。また、特徴空間とモデル空間との整合をとるため、方法で学習を行なった。 The amplitude of the noise signal was set to 1.0E5 (100 dB). The cepstrum was converted into logarithm FBE by inverse DCT and divided into 13 discrete frequency regions as logarithm FBE. Assuming that the noise spans three discrete frequency regions, the noise was added in four ways. That is, they are the first to third discrete frequency regions, the fourth to sixth discrete frequency regions, the seventh to ninth discrete frequency regions, and the tenth to twelfth discrete frequency regions. For comparison, experiments were performed for six methods (FB, CSB, CCSB, WSB, WA, WSB) using test data without noise and test data with added noise. FB is a method for performing speech recognition by collecting all bands together (not divided into subbands) (prior art). As described above, CSB is a prior art. CCSB (constrained clean sub-band) is a method similar to CSB, except that a noisy discrete frequency region is discarded. WSB (weighted sub-band) relates to the first embodiment of the present invention. WA (sub-band weighing adaptation) is a method according to the second embodiment of the present invention. UWA (unlimited sub-band weighing adaptation) relates to the fourth embodiment of the present invention. The weighting coefficient in the discrete frequency region to which noise was added was set to zero, and the weighting coefficient in other portions was set to 1. Moreover, in order to make the feature space and the model space consistent, learning was performed by the method.

各方式による認識精度の結果を次の表に示す。 The results of recognition accuracy by each method are shown in the following table.

表１から分かる様に、雑音のないテストデータについては、全帯域を使用したＦＢがＣＳＢより高い性能を示した。雑音を加えたテストデータに対しては、ＦＢの性能はかなり低下した。低い周波数領域に雑音を加えた方が性能の低下が著しい。雑音のために音声信号のフォルマントがあいまいとなったためと思われる。 As can be seen from Table 1, for the test data without noise, the FB using the entire band showed higher performance than the CSB. For test data with added noise, the performance of the FB deteriorated considerably. When noise is added to the low frequency region, the performance is significantly reduced. This is probably because the formant of the audio signal has become ambiguous due to noise.

ＣＳＢの平均精度はＦＢの平均精度より低い。ＣＣＳＢの精度はＣＳＢよりわずかに低くなっている。この結果は、サブバンドに雑音が加えられている場合にも、そのサブバンドの情報を全て捨ててしまうよりはその情報を用いた方が認識精度に対する悪影響が少ないことを示す。ＷＳＢは、雑音が加えられたテストデータについては、ＣＳＢ及びＦＢのいずれよりもはるかに精度が高く、その精度は雑音がない場合のＦＢの認識精度に近い程である。ＷＡ及びＵＷＡも同様で、ＦＢより高い精度が得られた。 The average accuracy of CSB is lower than the average accuracy of FB. The accuracy of CCSB is slightly lower than that of CSB. This result shows that, even when noise is added to a subband, using that information has less adverse effect on the recognition accuracy than discarding all the information of the subband. WSB is much more accurate than CSB and FB for test data with noise added, and the accuracy is close to the recognition accuracy of FB in the absence of noise. WA and UWA were the same, and higher accuracy than FB was obtained.

類似した条件で、分割するサブバンドの数によって認識精度がどのような変化をするかについても実験をした。その結果を図６及び図７にそれぞれ示す。図６はＡｕｒｏｒａ２タスクに関するもの、図７はＲＭタスクに関するものである。 Experiments were also conducted on how the recognition accuracy changes depending on the number of subbands to be divided under similar conditions. The results are shown in FIGS. 6 and 7, respectively. FIG. 6 relates to the Aurora 2 task, and FIG. 7 relates to the RM task.

図６及び図７を参照して、サブバンドの数を１〜１３まで変化させて実験を行なった。この結果はＷＡ及びＵＷＡによるものである。図６及び図７によれば、いずれの場合にもサブバンドの数を増加させると認識精度も向上する傾向が明らかである。特に、図６の場合にはＷＡでサブバンド数が７以上、ＵＷＡでサブバンド数が３以上のときによい性能を示し、図７の場合にはＷＡでサブバンド数が３以上、ＵＷＡでサブバンド数が４以上のときによい結果を示すことが分かる。 With reference to FIGS. 6 and 7, the experiment was performed by changing the number of subbands from 1 to 13. This result is due to WA and UWA. According to FIGS. 6 and 7, it is clear that in any case, the recognition accuracy tends to improve when the number of subbands is increased. In particular, in the case of FIG. 6, the performance is good when the number of subbands is 7 or more in WA and the number of subbands is 3 or more in UWA, and in FIG. 7, the number of subbands is 3 or more in WA and UWA. It can be seen that good results are shown when the number of subbands is 4 or more.

［実施の形態の好ましい効果］
以上の通り、本発明の各実施の形態によれば、サブバンド方式でサブバンド間の相関を保つことができる。実験結果からも、音声信号のうち制限された帯域に雑音が加わっている場合、これら方式によれば既存の方式と比較してより高い性能を得ることができる。サブバンドごとの重みは最尤推定法により推定でき、かつサブバンドの数が少ないので、学習データの数が少なくても重みの推定を行なうことができる。 [Preferable effects of the embodiment]
As described above, according to each embodiment of the present invention, the correlation between subbands can be maintained by the subband method. Also from the experimental results, when noise is added to the limited band of the audio signal, these methods can obtain higher performance than the existing methods. The weight for each subband can be estimated by the maximum likelihood estimation method, and since the number of subbands is small, the weight can be estimated even if the number of learning data is small.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内でのすべての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim in the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

本発明の第１の実施の形態に係る音声認識システムのブロック図である。1 is a block diagram of a voice recognition system according to a first embodiment of the present invention. 図１に示すサブバンド方式特徴抽出部６０の詳細なブロック図である。FIG. 2 is a detailed block diagram of a subband scheme feature extraction unit 60 shown in FIG. 1. 第２の実施の形態に係る音声認識システムのブロック図である。It is a block diagram of the speech recognition system which concerns on 2nd Embodiment. 図３に示すＨＭＭ調整部１２２の詳細なブロック図である。FIG. 4 is a detailed block diagram of an HMM adjustment unit 122 shown in FIG. 3. 第２の実施の形態に係る音声認識システムのＨＭＭ調整部１８０の詳細なブロック図である。It is a detailed block diagram of the HMM adjustment part 180 of the speech recognition system which concerns on 2nd Embodiment. サブバンド数とＷＡ及びＵＷＡ方式での認識精度との関係を示すグラフである。It is a graph which shows the relationship between the number of subbands and the recognition accuracy in WA and UWA system. サブバンド数とＷＡ及びＵＷＡ方式での認識精度との関係を示すグラフである。It is a graph which shows the relationship between the number of subbands and the recognition accuracy in WA and UWA system.

符号の説明Explanation of symbols

１０音声認識システム、２０，１０２学習部、２２入力音声、２４，１０４音声認識部、２６，１０６認識出力、４０音声コーパス、４２ＨＭＭ学習部、４４，１２４ＨＭＭ、４６学習用音声記憶部、４８重み学習部、５０重み記憶部、６０サブバンド方式特徴抽出部、６２判定部、８０ＤＦＴ処理部、８２メルフィルタバンク、８４対数計算部、８６サブバンドフィルタ、８８Ａ〜８８Ｋ，１５８Ａ〜１５８Ｋ，１９４Ａ〜１９４Ｋ乗算回路、９０Ａ〜９０Ｋ，１５６Ａ〜１５６ＫＤＣＴ処理部、９２加算部、１２０重み学習部、１２２，１８０ＨＭＭ調整部、１３０特徴抽出部、１３２判定部、１５０平均ベクトル取出処理部、１５２ＩＤＣＴ処理部、１５４サブバンドフィルタ、１９０クラス判定部 10 speech recognition system, 20, 102 learning unit, 22 input speech, 24, 104 speech recognition unit, 26, 106 recognition output, 40 speech corpus, 42 HMM learning unit, 44, 124 HMM, 46 learning speech storage unit, 48 Weight learning unit, 50 weight storage unit, 60 subband scheme feature extraction unit, 62 determination unit, 80 DFT processing unit, 82 mel filter bank, 84 logarithm calculation unit, 86 subband filter, 88A to 88K, 158A to 158K, 194A ˜194K multiplication circuit, 90A˜90K, 156A˜156K DCT processing unit, 92 addition unit, 120 weight learning unit , 122,180 HMM adjustment unit, 130 feature extraction unit, 132 determination unit, 150 average vector extraction processing unit, 152 IDCT Processing unit, 154 subband filter, 190 class determination unit

Claims

音声信号から得られるメルケプストラム特徴ベクトルを入力として音声認識をする様に学習が行なわれた音声認識手段と、
入力される音声信号から、複数のサブバンドごとに対数フィルタ帯域エネルギ（対数ＦＢＥ）ベクトルを抽出するための手段と、
前記サブバンドごとに対応する重みを前記対数ＦＢＥベクトルに乗算し、かつ各対数ＦＢＥベクトルにおいて、当該サブバンドに対応する要素以外の要素に予め定める定数を埋め込むことにより当該対数ＦＢＥベクトルの次元を全帯域から抽出される対数ＦＢＥベクトルの次元まで拡張するための複数の乗算手段と、
前記複数の乗算手段の出力を、前記メルケプストラム特徴ベクトルと同次元の、各サブバンドのメルケプストラムベクトルに変換するための複数のメルケプストラム変換手段と、
前記複数のメルケプストラム変換手段により出力される前記各サブバンドのメルケプストラムベクトルを互いに加算することにより、メルケプストラム係数を要素とする特徴ベクトルを生成し、前記音声認識手段に与えるための手段とを含む、音声認識装置。 Speech recognition means trained to perform speech recognition using a mel cepstrum feature vector obtained from a speech signal as input,
Means for extracting a logarithmic filter band energy (logarithmic FBE) vector for each of a plurality of subbands from an input audio signal;
The log FBE vector is multiplied by the weight corresponding to each subband, and in each log FBE vector, a predetermined constant is embedded in an element other than the element corresponding to the subband, so that the dimension of the log FBE vector is all set. A plurality of multiplication means for extending to the dimension of the logarithm FBE vector extracted from the band;
A plurality of mel cepstrum conversion means for converting the output of the plurality of multiplication means into a mel cepstrum vector of each subband having the same dimension as the mel cepstrum feature vector;
By adding the mel-cepstrum vector of each sub-band that will be output by the plurality of mel cepstrum conversion unit from each other, and generates a feature vector for the mel cepstrum coefficients as elements, and means for providing to said speech recognition means Including a speech recognition device.

前記対数ＦＢＥベクトルを抽出するための手段は、
前記音声信号から音声スペクトルを抽出するための音声スペクトル抽出手段と、
前記音声スペクトル抽出手段により抽出された音声スペクトルを受け、メル周波数ＦＢＥベクトルを出力するメルフィルタ手段と、
前記メルフィルタ手段の出力するメル周波数ＦＢＥベクトルに対する対数計算を行なって対数ＦＢＥベクトルを出力するための対数計算手段と、
前記対数計算手段から出力される対数ＦＢＥベクトルを、前記複数のサブバンドに分割するためのサブバンドフィルタ手段とを含む、請求項１に記載の音声認識装置。 The means for extracting the logarithm FBE vector is:
Voice spectrum extraction means for extracting a voice spectrum from the voice signal;
Mel filter means for receiving the voice spectrum extracted by the voice spectrum extracting means and outputting a mel frequency FBE vector;
Logarithmic calculation means for performing logarithmic calculation on the mel frequency FBE vector output from the mel filter means and outputting a logarithmic FBE vector;
The speech recognition apparatus according to claim 1 , further comprising subband filter means for dividing a logarithm FBE vector output from the logarithm calculation means into the plurality of subbands.

前記音声認識手段は、音声信号から得られるメルケプストラム特徴ベクトルを入力として音声認識をする様に学習が行なわれたＨＭＭ（隠れマルコフモデル）によるデコーダを含む、請求項１又は請求項２に記載の音声認識装置。3. The speech recognition unit according to claim 1, wherein the speech recognition means includes a decoder based on an HMM (Hidden Markov Model) trained to perform speech recognition using a mel cepstrum feature vector obtained from a speech signal as an input. Voice recognition device.

前記サブバンドごとに対応する重みからなる重みベクトルの要素ｗWeight vector element w consisting of weights corresponding to each subband _ii （ｉ＝１〜Ｋ，ただしＫは前記サブバンドの数）は、次の制約(I = 1 to K, where K is the number of subbands)
を充足する、請求項１〜請求項３のいずれかに記載の音声認識装置。The speech recognition device according to any one of claims 1 to 3, wherein

前記予め定める定数は０である、請求項１〜請求項４のいずれかに記載の音声認識装置。The voice recognition device according to claim 1, wherein the predetermined constant is zero.