JPH06289891A

JPH06289891A - Speech recognition device

Info

Publication number: JPH06289891A
Application number: JP5077025A
Authority: JP
Inventors: Tadashi Suzuki; 鈴木　　忠; Kunio Nakajima; 邦男中島
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1993-04-02
Filing date: 1993-04-02
Publication date: 1994-10-18
Anticipated expiration: 2015-10-23
Also published as: JP3102195B2

Abstract

PURPOSE:To obtain excellent recognition performance even for an unsteady noise superposed input speech which greatly varies in superposed noise and S/N by providing a function for estimating the power and spectrum of the superposed noise of the noise superposed input speech signal at every acoustic analytic frame. CONSTITUTION:A similarity arithmetic means 5 calculates the similarity between the noise superposed speech feature vector composed by a feature vector composing means 16 and a feature vector in a noise superposed input speech feature vector time series to become an object of S/N arithmetic by an S/N arithmetic means 15 and outputs similarity to a noise superposed speech feature vector to which a noise matched with the S/N or each feature vector in the noise superposed input speech feature vector time series is added to a collating means 6. The collating means 6 uses specific similarity data to collate the noise superposed input speech feature vector time series with speech models of respective categories under the restriction of a noise model so that the similarity becomes maximum, thereby outputting the category of the speech model which gives the highest similarity as a recognition result.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、雑音重畳入力音声信
号に重畳している雑音を推定する機能を備えた音声認識
装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus having a function of estimating noise superimposed on a noise-superimposed input speech signal.

【０００２】[0002]

【従来の技術】音声のスペクトル情報を用いる音声認識
装置では、照合用の標準音声モデルの学習に用いる音声
信号に重畳する雑音と認識実行時において入力される音
声信号に重畳する雑音との差が大きい場合、認識性能の
劣化が生じてしまう。これは雑音重畳による音声スペク
トルの変形が重畳雑音のスペクトル様態に大きく影響を
受けるためである。2. Description of the Related Art In a speech recognition apparatus using spectral information of speech, a difference between noise superimposed on a speech signal used for learning a standard speech model for matching and noise superimposed on a speech signal input at the time of recognition is reduced. If it is large, the recognition performance will deteriorate. This is because the deformation of the speech spectrum due to noise superposition is greatly affected by the spectral pattern of superposed noise.

【０００３】このような性能劣化を避けるためには認識
環境下で標準音声モデルの学習を行う必要があるが、認
識環境が変わるたびに音声の学習をやり直さなければな
らないという不便さがあった。これに対し、雑音が音声
に重畳しない静かな環境下で標準音声モデルの学習を行
い、認識時においてはその環境における重畳雑音を標準
音声モデルに加える手法が、文献“マルチテンプレート
を用いる雑音下の数字音声認識”（北村、水谷、日本音
響学会講演論文集平成元年１０月 pp.65-66）におい
て提案されている。In order to avoid such performance deterioration, it is necessary to learn the standard speech model under the recognition environment, but there is an inconvenience that the learning of the speech must be restarted every time the recognition environment changes. On the other hand, the method of learning the standard speech model in a quiet environment where noise is not superimposed on the speech and adding the superimposed noise in that environment to the standard speech model at the time of recognition is described in the document "Multi-template Numerical Speech Recognition "(Kitamura, Mizutani, The Acoustical Society of Japan, Proceedings, October 1989, pp.65-66).

【０００４】図５は、この手法に基づく音声認識装置の
構成図の１例である。図において、２は入力端１より入
力される雑音が重畳した未知入力音声信号に対し音響分
析を行い雑音重畳入力音声特徴ベクトル時系列を出力す
る音響分析手段、３は雑音が重畳していない学習用音声
から作成した音声モデルを記憶する音声モデルメモリ、
４は音声モデルメモリ３における標準音声特徴ベクトル
に平均的な重畳雑音の特徴ベクトルを付加する平均雑音
付加手段である。FIG. 5 is an example of a block diagram of a voice recognition device based on this method. In the figure, 2 is acoustic analysis means for performing acoustic analysis on an unknown input speech signal on which noise is superimposed input from the input terminal 1 and outputting a noise-superimposed input speech feature vector time series, 3 is learning in which noise is not superimposed. Voice model memory that stores the voice model created from the voice for
Reference numeral 4 is an average noise adding means for adding an average feature vector of superimposed noise to the standard voice feature vector in the voice model memory 3.

【０００５】５は平均雑音付加手段４の出力である雑音
付加特徴ベクトルと音響分析手段２の出力であるところ
の雑音重畳入力音声特徴ベクトル時系列とを入力とし
て、雑音重畳入力音声特徴ベクトル時系列の各特徴ベク
トルと雑音付加特徴ベクトルとの類似度を演算する類似
度演算手段、６は類似度演算手段５の出力である類似度
データを入力として雑音重畳入力音声特徴ベクトル時系
列と音声モデルとの照合処理を行い認識結果７を出力す
る照合手段である。A noise-added input voice feature vector time series 5 receives the noise-added feature vector output from the average noise addition means 4 and the noise-superimposed input voice feature vector time series output from the acoustic analysis means 2. , A similarity calculation means for calculating the similarity between each feature vector and the noise-added feature vector, and 6 is a noise-superimposed input speech feature vector time series and a speech model with the similarity data output from the similarity calculation means 5 as an input. It is a collation unit that performs the collation process of No. 1 and outputs the recognition result 7.

【０００６】次に動作について、ＤＰマッチング方式に
よる離散単語認識の場合を例にとり説明を行う。入力端
１より入力された雑音重畳入力音声信号は、音響分析手
段２において任意の分析フレーム（例えば周期１０ｍｓ
ｅｃ、フレーム長２５．６ｍｓｅｃ、ハミング窓）によ
り音響分析され、自己相関係数を特徴ベクトルとする雑
音重畳入力音声特徴ベクトル時系列｛Ｘ（ｉ）｜ｉ＝
１，２，…，Ｉ｝に変換される。ここでＸ（ｉ）は第ｉ
フレームの自己相関係数ベクトルで、Ｉはフレーム数で
ある。Next, the operation will be described by taking the case of discrete word recognition by the DP matching method as an example. The noise-superimposed input voice signal input from the input terminal 1 is analyzed by the acoustic analysis means 2 in an arbitrary analysis frame (for example, a period of 10 ms).
ec, frame length 25.6 msec, Hamming window), and noise-superimposed input speech feature vector time series {X (i) | i =
, 1, ..., I}. Where X (i) is the i-th
In the vector of autocorrelation coefficients of frames, I is the number of frames.

【０００７】音声モデルメモリ３には、カテゴリｋ（ｋ
＝１，２，…，Ｋ）の音声モデルとして、雑音が重畳し
ていないか若しくは想定される雑音重畳入力音声信号の
ＳＮ比より良いＳＮ比を持つカテゴリｋの単語音声の特
徴ベクトル時系列｛Ｓk（ｊ）｜ｊ＝１，２，…，Ｊk｝
が記憶されている。ここでＳk（ｊ）はカテゴリｋの単
語音声の第ｊフレームの自己相関係数ベクトルで、以後
これを標準音声特徴ベクトルと呼ぶ。The voice model memory 3 has a category k (k
= 1, 2, ..., K), a feature vector time series of a word voice of category k that has no noise or has a better SN ratio than the expected SN ratio of a noise-superimposed input voice signal { Sk (j) | j = 1, 2, ..., Jk}
Is remembered. Here, Sk (j) is the autocorrelation coefficient vector of the j-th frame of the word voice of category k, which will be referred to as a standard voice feature vector hereinafter.

【０００８】平均雑音付加手段４は、音声モデルメモリ
３に記憶されているカテゴリｋの音声モデルの標準音声
特徴ベクトルＳk（ｊ）に対し、あらかじめ与えられた
平均的な重畳雑音の特徴ベクトルＺをやはりあらかじめ
定められたＳＮ比になるように付加して、雑音付加標準
音声特徴ベクトルＹk（ｊ）として出力する。なおＳk
（ｊ）に対するＺの付加は、ベクトルの和によって行っ
ている。The average noise adding means 4 gives an average superposed noise feature vector Z given in advance to the standard voice feature vector Sk (j) of the voice model of category k stored in the voice model memory 3. The noise is also added so that the SN ratio is predetermined, and the noise-added standard speech feature vector Yk (j) is output. Note that Sk
Z is added to (j) by the sum of vectors.

【０００９】類似度演算手段５は、音響分析手段２の出
力であるところの雑音重畳入力音声特徴ベクトル時系列
の各特徴ベクトルＸ（ｉ）と平均雑音付加手段４の出力
であるところの雑音付加標準音声特徴ベクトルＹk
（ｊ）との類似度Ｄk（ｉ，ｊ）として出力する。類似
度には例えば、Ｘ（ｉ）およびＹk（ｊ）をそれぞれＬ
ＰＣ分析して得られるＬＰＣケプストラム係数ベクトル
のユークリッド距離の逆数を用いる。The similarity calculating means 5 adds each of the feature vectors X (i) of the noise-superimposed input voice feature vector time series which is the output of the acoustic analysis means 2 and the noise addition which is the output of the average noise adding means 4. Standard voice feature vector Yk
It is output as the similarity Dk (i, j) with (j). For the similarity, for example, X (i) and Yk (j) are L
The reciprocal of the Euclidean distance of the LPC cepstrum coefficient vector obtained by PC analysis is used.

【００１０】照合手段６は、類似度演算手段５の出力で
ある類似度Ｄk（ｉ，ｊ）（但し、ｉ＝１，２，…，
Ｉ、ｊ＝１，２，…，Ｊk）を用いてＤＰマッチングを
行い、雑音重畳入力音声に対するカテゴリｋの音声モデ
ルの類似度を求める。これを全ての音声モデルについて
行い、類似度を最大にする音声モデルのカテゴリを認識
結果７として出力する。The collating means 6 outputs the similarity Dk (i, j) (where i = 1, 2, ..., And) output from the similarity calculating means 5.
DP matching is performed using I, j = 1, 2, ..., Jk) to obtain the similarity of the speech model of category k to the noise-superimposed input speech. This is performed for all voice models, and the category of the voice model that maximizes the similarity is output as the recognition result 7.

【００１１】以上の処理により、雑音重畳入力音声特徴
ベクトル時系列は、雑音ベクトルの付加によりスペクト
ル変形された標準音声特徴ベクトルから成る音声モデル
と照合されることになり、雑音重畳によるスペクトル変
形を原因とする認識性能劣化を抑制する。By the above process, the noise-superimposed input speech feature vector time series is collated with the speech model composed of the standard speech feature vector spectrally transformed by the addition of the noise vector. The deterioration of recognition performance is suppressed.

【００１２】[0012]

【発明が解決しようとする課題】従来の音声認識装置は
以上のように構成されているため、雑音重畳入力音声
は、平均的な雑音ベクトルをあるＳＮ比で付加する事で
スペクトル変形させた標準音声特徴ベクトルから成る音
声モデルと照合されることになり、変動の少ない雑音が
重畳したＳＮ比既知の雑音重畳入力音声に対し、雑音重
畳を原因とする認識性能劣化を抑制することができた。Since the conventional speech recognition apparatus is constructed as described above, the noise-superimposed input speech is spectrum-transformed by adding an average noise vector at a certain SN ratio. As a result, the recognition model was compared with the speech model consisting of speech feature vectors, and it was possible to suppress the deterioration of recognition performance due to noise superposition for noise-superimposed input speech with a known signal-to-noise ratio superposed with noise with little fluctuation.

【００１３】しかるに実際の環境騒音は確率的な変動を
持っており、例えば空調のファン騒音のような比較的定
常と思われる騒音であっても、音響分析における分析フ
レームでの短時間スペクトル分析を行えば、フレームご
とに変化する非定常なものであることが明らかである。
ましてや、種々雑多な騒音源が存在するより一般的な騒
音環境においては重畳雑音の定常性は期待するべくもな
い。また、発声音声の大きさや音声を入力するマイクと
口との距離の変動によっても雑音重畳入力音声のＳＮ比
は変化してしまう。However, the actual environmental noise has a stochastic variation, and even if the noise is considered to be relatively stationary, such as fan noise of an air conditioner, short-time spectrum analysis in the analysis frame in acoustic analysis is performed. If done, it is clear that it is a non-stationary one that changes from frame to frame.
Furthermore, in a more general noise environment in which various noise sources exist, the constancy of superimposed noise cannot be expected. Further, the SN ratio of the noise-superimposed input voice also changes depending on the size of the voiced voice and the change in the distance between the microphone for inputting the voice and the mouth.

【００１４】よって従来の音声認識装置では、平均的な
雑音のスペクトルとは異なる雑音が重畳するような非定
常騒音環境下や入力音声のＳＮ比変動が大きい場合は、
認識性能の劣化が避けられないという問題があった。Therefore, in the conventional speech recognition apparatus, in an unsteady noise environment where noise different from the average noise spectrum is superposed or when the SN ratio fluctuation of the input speech is large,
There is a problem that the deterioration of recognition performance cannot be avoided.

【００１５】この発明は、上記の問題を解決するために
なされたもので、雑音重畳入力音声信号における重畳雑
音のパワーとスペクトルを、音響分析フレームごとに推
定する機能を持つことで、パワー、スペクトル共に非定
常な雑音が重畳した未知入力音声に対しても、また発声
音量の変化や口からマイクまでの距離の変化によるＳＮ
比変動がある雑音重畳入力音声に対しても極めて良好な
認識性能を発揮する音声認識装置を得ることを目的とし
ている。The present invention has been made in order to solve the above problems, and has a function of estimating the power and spectrum of the superimposed noise in the noise-superimposed input speech signal for each acoustic analysis frame. Even for unknown input speech with non-stationary noise superimposed, SN due to changes in vocalization volume and changes in distance from mouth to microphone
It is an object of the present invention to obtain a speech recognition device that exhibits extremely good recognition performance even for noise-superimposed input speech having a ratio variation.

【００１６】[0016]

【課題を解決するための手段】この発明に係る音声認識
装置は、雑音が重畳した未知入力音声信号に対し設定さ
れる複数個の分析フレームの各々について音響分析を行
い雑音重畳入力音声特徴ベクトル時系列を出力する音響
分析手段と、音声信号に重畳する雑音の特徴ベクトル時
系列を表現する雑音モデルを記憶する雑音モデルメモリ
と、標準音声の特徴ベクトル時系列を表現する音声モデ
ルを記憶する音声モデルメモリと、音声モデルメモリに
格納されている標準音声特徴ベクトルに対し線形予測分
析を行い最尤パラメータと標準音声残差パワーを求める
線形予測分析手段と、線形予測分析手段の出力であると
ころの最尤パラメータを記憶する最尤パラメータメモリ
と、同じく線形予測分析手段の出力であるところの標準
音声残差パワーを記憶する音声残差パワーメモリと、雑
音モデルメモリ上の雑音特徴ベクトルを入力として最尤
パラメータメモリ上の最尤パラメータとの積和演算を行
い雑音残差パワーを求める雑音残差演算手段と、雑音残
差演算手段の出力であるところの雑音残差パワーを記憶
する雑音残差パワーメモリと、音響分析手段の出力であ
るところの雑音重畳入力音声特徴ベクトル時系列の各特
徴ベクトルに対し最尤パラメータメモリ上の最尤パラメ
ータとの積和演算を行い雑音重畳入力音声残差パワーを
求める残差パワー演算手段と、残差パワー演算手段の出
力であるところの雑音重畳入力音声残差パワーと音声残
差パワーメモリ上の標準音声残差パワーと雑音残差パワ
ーメモリ上の雑音残差パワーとを用いて雑音重畳入力音
声のＳＮ比を求めるＳＮ比演算手段と、ＳＮ比演算手段
の出力であるところのＳＮ比に従い音声モデルメモリ上
の標準音声特徴ベクトルと雑音モデルメモリ上の雑音特
徴ベクトルの合成を行い雑音重畳音声特徴ベクトルを生
成する特徴ベクトル合成手段と、音響分析手段の出力で
ある雑音重畳入力音声特徴ベクトル時系列の各特徴ベク
トルに対し特徴ベクトル合成手段の出力である雑音重畳
音声特徴ベクトルとの類似度を演算する類似度演算手段
と、類似度演算手段の出力であるところの類似度データ
を用いて照合処理を行い認識結果を出力する照合手段を
備えたものである。A speech recognition apparatus according to the present invention performs acoustic analysis on each of a plurality of analysis frames set for an unknown input speech signal on which noise is superimposed, and performs a noise-superimposed input speech feature vector. Acoustic analysis means for outputting a sequence, a noise model memory for storing a noise model expressing a characteristic vector time series of noise superimposed on a speech signal, and a speech model for storing a speech model expressing a standard speech characteristic vector time series The memory and the linear prediction analysis means for performing the linear prediction analysis on the standard speech feature vector stored in the speech model memory to obtain the maximum likelihood parameter and the standard speech residual power, and the maximum of the outputs of the linear prediction analysis means. The maximum likelihood parameter memory that stores the likelihood parameter and the standard speech residual power that is the output of the linear prediction analysis means are A noise residual calculation means for performing a sum-of-products calculation of a speech residual power memory to be remembered and a maximum likelihood parameter in a maximum likelihood parameter memory with a noise feature vector in a noise model memory as an input, and a noise residual calculation means, A noise residual power memory that stores the noise residual power that is the output of the residual calculation means, and a maximum likelihood parameter for each feature vector of the noise superimposed input speech feature vector time series that is the output of the acoustic analysis means. Residual power calculation means for calculating the sum of products with the maximum likelihood parameter on the memory to obtain the noise-superimposed input speech residual power, and the noise-superimposed input speech residual power and speech residual output from the residual power calculation means. SN ratio calculating means for obtaining the SN ratio of the noise-superimposed input speech using the standard speech residual power on the differential power memory and the noise residual power on the noise residual power memory , Feature vector synthesizing means for synthesizing the standard speech feature vector on the speech model memory and the noise feature vector on the noise model memory according to the SN ratio which is the output of the SN ratio computing means to generate a noise-superimposed speech feature vector, Similarity calculation means for calculating the similarity between each feature vector of the noise-superimposed input speech feature vector output from the acoustic analysis means and the noise-superimposed speech feature vector output from the feature vector synthesis means, and similarity computation It is provided with a collating unit that performs the collation process using the similarity data which is the output of the unit and outputs the recognition result.

【００１７】請求項２の発明における音声認識装置は、
雑音が重畳した未知入力音声信号に対し設定される複数
個の分析フレームの各々について音響分析を行い雑音重
畳入力音声特徴ベクトル時系列を出力する音響分析手段
と、音声信号に重畳する雑音の特徴ベクトル時系列を表
現する雑音モデルを記憶する雑音モデルメモリと、標準
音声の特徴ベクトル時系列を表現する音声モデルを記憶
する音声モデルメモリと、音声モデルメモリに格納され
ている標準音声特徴ベクトルに対し線形予測分析を行い
最尤パラメータと標準音声残差パワーを求める線形予測
分析手段と、線形予測分析手段の出力であるところの最
尤パラメータを記憶する最尤パラメータメモリと、同じ
く線形予測分析手段の出力であるところの標準音声残差
パワーを記憶する音声残差パワーメモリと、雑音モデル
メモリ上の雑音特徴ベクトルを入力として最尤パラメー
タメモリ上の最尤パラメータとの積和演算を行い雑音残
差パワーを求める雑音残差演算手段と、雑音残差演算手
段の出力であるところの雑音残差パワーを記憶する雑音
残差パワーメモリと、音響分析手段の出力であるところ
の雑音重畳入力音声特徴ベクトル時系列の各特徴ベクト
ルに対し最尤パラメータメモリ上の最尤パラメータとの
積和演算を行い雑音重畳入力音声残差パワーを求める残
差パワー演算手段と、残差パワー演算手段の出力である
ところの雑音重畳入力音声残差パワーと音声残差パワー
メモリ上の標準音声残差パワーと雑音残差パワーメモリ
上の雑音残差パワーとを用いて雑音重畳入力音声のＳＮ
比を求めるＳＮ比演算手段と、ＳＮ比演算手段の出力で
あるところのＳＮ比に従い音声モデルメモリ上の標準音
声特徴ベクトルと雑音モデルメモリ上の雑音特徴ベクト
ルの合成を行い雑音重畳音声特徴ベクトルを生成する特
徴ベクトル合成手段と、音響分析手段の出力である雑音
重畳入力音声特徴ベクトル時系列の各特徴ベクトルに対
し特徴ベクトル合成手段の出力である雑音重畳音声特徴
ベクトルとの類似度を演算する類似度演算手段と、類似
度演算手段の出力であるところの類似度データを入力と
して音声モデルと雑音重畳入力音声特徴ベクトル時系列
との最適照合パスを求める最適照合パス決定手段と、音
響分析手段の出力である雑音重畳入力音声特徴ベクトル
時系列における各特徴ベクトルに対しＳＮ比演算手段の
出力であるＳＮ比と雑音モデルメモリ上の雑音特徴ベク
トルとを用いて重畳雑音特徴ベクトルを生成する重畳雑
音生成手段と、最適照合パス決定手段の出力であるとこ
ろの照合パスデータと重畳雑音生成手段の出力であると
ころの重畳雑音特徴ベクトルとを用いて入力雑音特徴ベ
クトル時系列を求める重畳雑音決定手段と、ＳＮ比演算
手段の出力であるところのＳＮ比と音響分析手段の出力
であるところの雑音重畳入力音声特徴ベクトル時系列と
音声モデルメモリ上の標準音声特徴ベクトルと最適照合
パス決定手段の出力であるところの照合パスデータとを
入力として音声パワー比を求めるパワー比決定手段と、
音響分析手段の出力であるところの雑音重畳入力音声特
徴ベクトル時系列と音声モデルメモリ上の標準音声特徴
ベクトルと重畳雑音決定手段の出力であるところの入力
雑音特徴ベクトル時系列とパワー比決定手段の出力であ
るところの音声パワー比とを入力として雑音重畳入力音
声特徴ベクトル時系列の各特徴ベクトルと音声モデルメ
モリ上の標準音声特徴ベクトルとの雑音適応化類似度を
演算する雑音適応化類似度演算手段と、雑音適応化類似
度演算手段の出力であるところの雑音適応化類似度デー
タを用いて照合を行い認識結果を出力する照合手段を備
えたものである。The voice recognition device according to the second aspect of the invention is
Acoustic analysis means for performing acoustic analysis on each of a plurality of analysis frames set for an unknown input speech signal on which noise is superimposed, and outputting a noise-superimposed input speech feature vector time series, and a noise feature vector to be superimposed on the speech signal. A noise model memory that stores a noise model that represents a time series, a voice model memory that stores a voice model that represents a standard speech feature vector time series, and a linear pattern with respect to the standard voice feature vector that is stored in the voice model memory. Linear predictive analysis means for performing predictive analysis to find the maximum likelihood parameter and standard speech residual power, maximum likelihood parameter memory for storing the maximum likelihood parameter that is the output of the linear predictive analysis means, and output of the linear predictive analysis means as well. The speech residual power memory that stores the standard speech residual power and the noise feature on the noise model memory. A noise residual calculation means for calculating the sum of products with the maximum likelihood parameter on the maximum likelihood parameter memory by inputting a vector to obtain the noise residual power, and the noise residual power output from the noise residual calculation means are stored. The noise residual power memory and the noise-superimposed input, which is the output of the acoustic analysis means, are subjected to the sum-of-products operation for each feature vector of the noise-superimposed input speech feature vector time series and the maximum likelihood parameter on the maximum likelihood parameter memory. Residual power calculating means for obtaining the speech residual power, and noise-superimposed input speech residual power and output of residual power computing means standard speech residual power and noise residual power memory on speech residual power memory SN of the noise-superimposed input speech using the above noise residual power
A standard speech feature vector on the voice model memory and a noise feature vector on the noise model memory are synthesized according to the SN ratio calculation means for obtaining the ratio and the SN ratio which is the output of the SN ratio calculation means to obtain the noise superimposed voice feature vector. Similarity for calculating the similarity between the feature vector synthesizing means to be generated and the noise-superimposed speech feature vector output from the feature vector synthesizing means for each feature vector of the noise-superimposed input speech feature vector time series output from the acoustic analysis means. Of the degree calculation means, the similarity matching output which is the output of the similarity calculating means, and an optimum matching path determining means for finding an optimum matching path between the voice model and the noise-superimposed input speech feature vector time series; The SN ratio which is the output of the SN ratio calculating means for each feature vector in the time-series noise-superimposed input voice feature vector which is the output A superposed noise generation means for generating a superposed noise feature vector using the noise feature vector on the noise model memory, a matching path data output from the optimum matching path determination means, and an output from the superposed noise generation means. Superposed noise determination means for obtaining an input noise feature vector time series using the superposed noise feature vector, and noise superposed input speech feature vector as the output of the SN ratio calculation means and the output of the acoustic analysis means. A power ratio determining means for obtaining a voice power ratio by inputting a time-series and standard voice feature vector on a voice model memory and matching path data which is an output of the optimum matching path determining means.
The noise-superimposed input speech feature vector time series which is the output of the acoustic analysis means, the standard speech feature vector on the speech model memory, and the input noise feature vector time series which is the output of the superposition noise determination means and the power ratio determination means Noise-superimposed input speech feature vector with the output speech power ratio as an input. Noise-adaptive similarity computation that computes the noise-adaptive similarity between each feature vector of the time-series input speech feature vector and the standard speech feature vector on the speech model memory. And means for collating using the noise-adaptive similarity data output from the noise-adaptive similarity calculator, and outputting a recognition result.

【００１８】また請求項３の発明における音声認識装置
は、雑音が重畳した未知入力音声信号に対し設定される
複数個の分析フレームの各々について音響分析を行い雑
音重畳入力音声特徴ベクトル時系列を出力する音響分析
手段と、音声信号に重畳する雑音の特徴ベクトル時系列
を表現する雑音モデルを記憶する雑音モデルメモリと、
標準音声の特徴ベクトル時系列を表現する音声モデルを
記憶する音声モデルメモリと、音声モデルメモリに格納
されている標準音声特徴ベクトルに対し線形予測分析を
行い最尤パラメータと標準音声残差パワーを求める線形
予測分析手段と、線形予測分析手段の出力であるところ
の最尤パラメータを記憶する最尤パラメータメモリと、
同じく線形予測分析手段の出力であるところの標準音声
残差パワーを記憶する音声残差パワーメモリと、雑音モ
デルメモリ上の雑音特徴ベクトルを入力として最尤パラ
メータメモリ上の最尤パラメータとの積和演算を行い雑
音残差パワーを求める雑音残差演算手段と、雑音残差演
算手段の出力であるところの雑音残差パワーを記憶する
雑音残差パワーメモリと、音響分析手段の出力であると
ころの雑音重畳入力音声特徴ベクトル時系列の各特徴ベ
クトルに対し最尤パラメータメモリ上の最尤パラメータ
との積和演算を行い雑音重畳入力音声残差パワーを求め
る残差パワー演算手段と、残差パワー演算手段の出力で
あるところの雑音重畳入力音声残差パワーと音声残差パ
ワーメモリ上の標準音声残差パワーと雑音残差パワーメ
モリ上の雑音残差パワーとを用いて雑音重畳入力音声の
ＳＮ比を求めるＳＮ比演算手段と、ＳＮ比演算手段の出
力であるところのＳＮ比に従い音声モデルメモリ上の標
準音声特徴ベクトルと雑音モデルメモリ上の雑音特徴ベ
クトルの合成を行い雑音重畳音声特徴ベクトルを生成す
る特徴ベクトル合成手段と、音響分析手段の出力である
雑音重畳入力音声特徴ベクトル時系列の各特徴ベクトル
に対し特徴ベクトル合成手段の出力である雑音重畳音声
特徴ベクトルとの類似度を演算する類似度演算手段と、
類似度演算手段の出力であるところの類似度データを入
力として音声モデルと雑音重畳入力音声特徴ベクトル時
系列との最適照合パスを求める最適照合パス決定手段
と、音響分析手段の出力である雑音重畳入力音声特徴ベ
クトル時系列における各特徴ベクトルに対しＳＮ比演算
手段の出力であるＳＮ比と雑音モデルメモリ上の雑音特
徴ベクトルとを用いて重畳雑音特徴ベクトルを生成する
重畳雑音生成手段と、最適照合パス決定手段の出力であ
るところの照合パスデータと重畳雑音生成手段の出力で
あるところの重畳雑音特徴ベクトルとを用いて入力雑音
特徴ベクトル時系列を求める重畳雑音決定手段と、音響
分析手段の出力であるところの雑音重畳入力音声特徴ベ
クトル時系列と音声モデルメモリ上の標準音声特徴ベク
トルと重畳雑音決定手段の出力であるところの入力雑音
特徴ベクトル時系列とを入力として雑音重畳入力音声特
徴ベクトル時系列の各特徴ベクトルと音声モデルメモリ
上の標準音声特徴ベクトルとの雑音除去類似度を演算す
る雑音除去類似度演算手段と、雑音除去類似度演算手段
の出力であるところの雑音適応化類似度データを用いて
照合を行い認識結果を出力する照合手段を備えたもので
ある。The speech recognition apparatus according to the invention of claim 3 acoustically analyzes each of a plurality of analysis frames set for an unknown input speech signal on which noise is superimposed, and outputs a noise-superimposed input speech feature vector time series. Acoustic analysis means, and a noise model memory for storing a noise model expressing a feature vector time series of noise superimposed on a speech signal,
The maximum likelihood parameter and standard speech residual power are obtained by performing linear prediction analysis on the speech model memory that stores the speech model that represents the time series of the characteristic vector of the standard speech and the standard speech feature vector that is stored in the speech model memory. A linear prediction analysis means, a maximum likelihood parameter memory for storing a maximum likelihood parameter that is an output of the linear prediction analysis means,
Similarly, the product sum of the speech residual power memory that stores the standard speech residual power that is the output of the linear prediction analysis means and the maximum likelihood parameter on the maximum likelihood parameter memory with the noise feature vector on the noise model memory as input. The noise residual calculation means for calculating the noise residual power, the noise residual power memory for storing the noise residual power which is the output of the noise residual calculation means, and the output of the acoustic analysis means Residual power calculation means for calculating the sum of products of the maximum likelihood parameter in the maximum likelihood parameter memory for each feature vector of the noise-superimposed input speech feature vector time series Noise-superimposed input speech residual power and speech residual power at the output of the means Standard speech residual power on memory and noise residual power Noise residual on memory The SN ratio calculating means for obtaining the SN ratio of the noise-superimposed input speech using the word and the standard speech feature vector on the voice model memory and the noise feature on the noise model memory according to the SN ratio which is the output of the SN ratio calculating means. Feature vector synthesizing means for synthesizing vectors to generate a noise-superimposed speech feature vector, and noise-superimposing output of the feature vector synthesizing means for each feature vector of the noise-superimposed input speech feature vector time series output from the acoustic analysis means Similarity calculation means for calculating the similarity with the voice feature vector,
Optimal matching path determining means for finding the optimal matching path between the voice model and the noise-superimposed input speech feature vector time series using the similarity data, which is the output of the similarity calculating means, as input, and the noise superimposing output of the acoustic analysis means. Superimposition noise generation means for generating a superposition noise feature vector using the SN ratio output from the SN ratio calculation means and the noise feature vector on the noise model memory for each feature vector in the input speech feature vector time series, and optimum matching. Superposed noise determination means for obtaining an input noise feature vector time series using the matching path data output from the path determination means and the superposed noise feature vector output from the superposed noise generation means, and the output of the acoustic analysis means Noise convolutional input speech feature vector time series, standard speech feature vector on speech model memory, and convolutional noise decision Input noise feature vector time series, which is the output of the stage, and noise removal for calculating the noise removal similarity between each feature vector of the noise-superimposed input voice feature vector time series and the standard voice feature vector on the voice model memory The similarity calculation means and the matching means for performing the matching using the noise adaptation similarity data output from the noise removal similarity calculation means and outputting the recognition result are provided.

【００１９】また請求項４の発明における音声認識装置
は、雑音が重畳した未知入力音声信号に対し設定される
複数個の分析フレームの各々について音響分析を行い雑
音重畳入力音声特徴ベクトル時系列を出力する音響分析
手段と、音声信号に重畳する雑音の特徴ベクトル時系列
を表現する雑音モデルを記憶する雑音モデルメモリと、
標準音声の特徴ベクトル時系列を表現する音声モデルを
記憶する音声モデルメモリと、音声モデルメモリに格納
されている標準音声特徴ベクトルに対し線形予測分析を
行い最尤パラメータと標準音声残差パワーを求める線形
予測分析手段と、線形予測分析手段の出力であるところ
の最尤パラメータを記憶する最尤パラメータメモリと、
同じく線形予測分析手段の出力であるところの標準音声
残差パワーを記憶する音声残差パワーメモリと、雑音モ
デルメモリ上の雑音特徴ベクトルを入力として最尤パラ
メータメモリ上の最尤パラメータとの積和演算を行い雑
音残差パワーを求める雑音残差演算手段と、雑音残差演
算手段の出力であるところの雑音残差パワーを記憶する
雑音残差パワーメモリと、音響分析手段の出力であると
ころの雑音重畳入力音声特徴ベクトル時系列の各特徴ベ
クトルに対し最尤パラメータメモリ上の最尤パラメータ
との積和演算を行い雑音重畳入力音声残差パワーを求め
る残差パワー演算手段と、残差パワー演算手段の出力で
あるところの雑音重畳入力音声残差パワーと音声残差パ
ワーメモリ上の標準音声残差パワーと雑音残差パワーメ
モリ上の雑音残差パワーとを用いて雑音重畳入力音声の
ＳＮ比を求めるＳＮ比演算手段と、ＳＮ比演算手段の出
力であるところのＳＮ比に従い音声モデルメモリ上の標
準音声特徴ベクトルと雑音モデルメモリ上の雑音特徴ベ
クトルの合成を行い雑音重畳音声特徴ベクトルを生成す
る特徴ベクトル合成手段と、音響分析手段の出力である
雑音重畳入力音声特徴ベクトル時系列の各特徴ベクトル
に対し特徴ベクトル合成手段の出力である雑音重畳音声
特徴ベクトルとの類似度を演算する類似度演算手段と、
類似度演算手段の出力であるところの類似度データを入
力として音声モデルと雑音重畳入力音声特徴ベクトル時
系列との最適照合パスを求める最適照合パス決定手段
と、ＳＮ比演算手段の出力であるところのＳＮ比と音響
分析手段の出力であるところの雑音重畳入力音声特徴ベ
クトル時系列と音声モデルメモリ上の標準音声特徴ベク
トルと最適照合パス決定手段の出力であるところの照合
パスデータとを入力として音声パワー比を求めるパワー
比決定手段と、音響分析手段の出力である雑音重畳入力
音声特徴ベクトル時系列における各特徴ベクトルに対し
ＳＮ比演算手段の出力であるＳＮ比と雑音モデルメモリ
上の雑音特徴ベクトルとを用いて重畳雑音特徴ベクトル
を生成する重畳雑音生成手段と、最適照合パス決定手段
の出力であるところの照合パスデータと重畳雑音生成手
段の出力であるところの重畳雑音特徴ベクトルとパワー
比決定手段の出力であるところの音声パワー比とを用い
て付加雑音特徴ベクトルを求める付加雑音決定手段と、
付加雑音決定手段の出力であるところの付加雑音特徴ベ
クトルと音声モデルメモリ上の標準音声特徴ベクトルを
入力として雑音付加標準音声特徴ベクトルを求める雑音
付加手段と、音響分析手段の出力であるところの雑音重
畳入力音声特徴ベクトル時系列と雑音付加手段の出力で
あるところの雑音付加標準音声特徴ベクトルとの類似度
を演算する類似度演算手段と、類似度演算手段の出力で
あるところの類似度データを用いて照合を行い認識結果
を出力する照合手段を備えたものである。In the speech recognition apparatus according to the invention of claim 4, acoustic analysis is performed for each of a plurality of analysis frames set for an unknown input speech signal on which noise is superimposed, and a noise-superimposed input speech feature vector time series is output. Acoustic analysis means, and a noise model memory for storing a noise model expressing a feature vector time series of noise superimposed on a speech signal,
The maximum likelihood parameter and standard speech residual power are obtained by performing linear prediction analysis on the speech model memory that stores the speech model that represents the time series of the characteristic vector of the standard speech and the standard speech feature vector that is stored in the speech model memory. A linear prediction analysis means, a maximum likelihood parameter memory for storing a maximum likelihood parameter that is an output of the linear prediction analysis means,
Similarly, the product sum of the speech residual power memory that stores the standard speech residual power that is the output of the linear prediction analysis means and the maximum likelihood parameter on the maximum likelihood parameter memory with the noise feature vector on the noise model memory as input. The noise residual calculation means for calculating the noise residual power, the noise residual power memory for storing the noise residual power which is the output of the noise residual calculation means, and the output of the acoustic analysis means Residual power calculation means for calculating the sum of products of the maximum likelihood parameter in the maximum likelihood parameter memory for each feature vector of the noise-superimposed input speech feature vector time series Noise-superimposed input speech residual power and speech residual power at the output of the means Standard speech residual power on memory and noise residual power Noise residual on memory The SN ratio calculating means for obtaining the SN ratio of the noise-superimposed input speech using the word and the standard speech feature vector on the voice model memory and the noise feature on the noise model memory according to the SN ratio which is the output of the SN ratio calculating means. Feature vector synthesizing means for synthesizing vectors to generate a noise-superimposed speech feature vector, and noise-superimposing output of the feature vector synthesizing means for each feature vector of the noise-superimposed input speech feature vector time series output from the acoustic analysis means. Similarity calculation means for calculating the similarity with the voice feature vector,
The output of the SN ratio calculating means and the optimum matching path determining means for obtaining the optimum matching path between the voice model and the noise-superimposed input voice feature vector time series by inputting the similarity data which is the output of the similarity calculating means. Of the noise-superimposed input speech feature vector, which is the output of the acoustic analysis means, the standard speech feature vector on the speech model memory, and the matching path data, which is the output of the optimum matching path determining means. A power ratio determining means for obtaining a voice power ratio, and an SN ratio output from the SN ratio calculating means for each feature vector in the noise-superimposed input voice feature vector time series output from the acoustic analysis means and a noise feature on the noise model memory. The output of the superposed noise generation means for generating the superposed noise feature vector using the vector and the optimum matching path determination means. And additive noise determining means for determining the additional noise feature vector by using the collating path data and voice power ratio where is the output of the superimposing noise feature vector and the power ratio determination means where is the output of the superimposing noise generating means,
Noise output means for determining the noise-added standard speech feature vector by inputting the additive noise feature vector output by the additive noise determination means and the standard speech feature vector on the speech model memory, and noise output by the acoustic analysis means. The similarity calculation means for calculating the similarity between the superimposed input speech feature vector time series and the noise-added standard speech feature vector which is the output of the noise addition means, and the similarity data which is the output of the similarity calculation means It is provided with a collating means for collating and outputting the recognition result.

【００２０】[0020]

【作用】この発明において、ＳＮ比演算手段は、音響分
析手段の出力である雑音重畳入力音声特徴ベクトル時系
列の各特徴ベクトルついてのＳＮ比演算を行う際に、残
差パワー演算手段の出力である雑音重畳入力音声残差パ
ワーと線形予測分析の出力である標準音声残差パワーと
雑音残差演算手段の出力である雑音残差パワーの３種の
残差パワーが用いている。この３種の残差パワーを求め
る際に線形予測分析手段及び雑音残差演算手段におい
て、音声モデルメモリに記憶されている音声モデルの標
準音声特徴ベクトルおよび雑音モデルメモリに記憶され
ている雑音モデルの雑音特徴ベクトルの２種の特徴ベク
トルが用いられている。特徴ベクトル合成手段は、ＳＮ
比演算手段が上記３種の残差パワーを用いて求めた雑音
重畳入力音声特徴ベクトル時系列の特徴ベクトルについ
てのＳＮ比に従って、上記２種の特徴ベクトルすなわち
標準音声特徴ベクトルと雑音特徴ベクトルとの合成を行
い、雑音重畳音声特徴ベクトルとして出力している。In the present invention, the SN ratio calculating means outputs the residual power calculating means when calculating the SN ratio of each feature vector of the noise-superimposed input speech feature vector time series output from the acoustic analyzing means. Three types of residual powers are used: a certain noise-superimposed input speech residual power, a standard speech residual power that is the output of the linear prediction analysis, and a noise residual power that is the output of the noise residual calculation means. When obtaining these three types of residual powers, the linear prediction analysis means and the noise residual calculation means use the standard speech feature vector of the speech model stored in the speech model memory and the noise model stored in the noise model memory. Two types of feature vectors, noise feature vectors, are used. The feature vector synthesizing means is an SN
According to the SN ratio of the noise-superimposed input speech feature vector time-series feature vector obtained by the ratio calculating means using the three types of residual powers, the two types of feature vectors, that is, the standard speech feature vector and the noise feature vector, It synthesizes and outputs as a noise-superimposed speech feature vector.

【００２１】この発明における類似度演算手段は、上記
のように特徴ベクトル合成手段で合成された雑音重畳音
声特徴ベクトルと、前述のＳＮ比演算手段においてＳＮ
比演算の対象となった雑音重畳入力音声特徴ベクトル時
系列中の特徴ベクトルとの類似度演算を行っているの
で、雑音重畳入力音声特徴ベクトル時系列の各特徴ベク
トルのＳＮ比に合わせて雑音を付加された雑音重畳音声
特徴ベクトルとの類似度が照合手段に出力される。The similarity calculating means in the present invention includes the noise-superimposed speech feature vector synthesized by the feature vector synthesizing means as described above, and the SN ratio calculating means described above.
Since the similarity calculation with the feature vector in the noise-superimposed input speech feature vector time series, which is the target of the ratio calculation, is performed, noise is adjusted according to the SN ratio of each feature vector of the noise-superimposed input speech feature vector time series. The similarity with the added noise-superimposed speech feature vector is output to the matching means.

【００２２】また、照合手段は、上記のように生成され
た類似度データを用い、雑音モデルの制約の下で、雑音
重畳入力音声特徴ベクトル時系列に対し、類似度が最大
になるように各カテゴリの音声モデルと照合を行い、最
も高い類似度を与えた音声モデルのカテゴリを認識結果
として出力する。Further, the matching means uses the similarity data generated as described above, and under the constraint of the noise model, the similarity is maximized with respect to the noise-superimposed input speech feature vector time series. The model is matched with the voice model of the category, and the category of the voice model having the highest similarity is output as the recognition result.

【００２３】また他の発明における重畳雑音生成手段
は、ＳＮ比演算手段の出力であるＳＮ比とＳＮ比演算の
対象となった雑音重畳入力音声特徴ベクトル時系列中の
特徴ベクトルのパワーとを用いて該特徴ベクトルにおけ
る雑音成分のパワーを求め、この値と、ＳＮ比演算手段
において用いられた雑音残差パワーに対応する雑音モデ
ル内の雑音特徴ベクトルとを用いて重畳雑音特徴ベクト
ルを生成する。ここで生成される該重畳雑音特徴ベクト
ルは、雑音重畳入力音声特徴ベクトル時系列中の特徴ベ
クトルト音声モデルメモリに記憶されている音声モデル
の標準音声特徴ベクトルと雑音モデルメモリに記憶され
ている雑音モデルの雑音特徴ベクトルの３種の特徴ベク
トルにより一意に求められている点で、類似度演算手段
の出力である類似度データ及びＳＮ比演算手段の出力で
あるＳＮ比と１対１に対応している。The superposed noise generation means in another invention uses the SN ratio output from the SN ratio calculation means and the power of the feature vector in the noise superposed input speech feature vector time series which is the target of the SN ratio calculation. Then, the power of the noise component in the feature vector is obtained, and a superposed noise feature vector is generated using this value and the noise feature vector in the noise model corresponding to the noise residual power used in the SN ratio calculation means. The superposed noise feature vector generated here is the standard voice feature vector of the voice model stored in the voice model memory of the feature vector in the noise-superimposed input voice feature vector time series and the noise model stored in the noise model memory. The noise feature vector is uniquely obtained from the three types of feature vectors, and in a one-to-one correspondence with the similarity data output from the similarity calculation means and the SN ratio output from the SN ratio calculation means. There is.

【００２４】最適照合パス決定手段は、類似度演算手段
の出力である類似度データを用い、雑音モデルの制約の
下で、雑音重畳入力音声特徴ベクトル時系列と音声モデ
ルとの類似度が最大になる照合パスを決定する。The optimum matching path determining means uses the similarity data output from the similarity calculating means and maximizes the similarity between the noise superimposed input speech feature vector time series and the speech model under the constraint of the noise model. Determine the matching path.

【００２５】重畳雑音決定手段は、重畳雑音生成手段が
生成した重畳雑音特徴ベクトルと最適照合パス決定手段
が出力した照合パスとを用いて、雑音重畳入力音声特徴
ベクトル時系列の各特徴ベクトルに対応する重畳雑音特
徴ベクトルを求め、重畳雑音特徴ベクトル時系列として
出力する。また、パワー比決定手段は、ＳＮ比演算手段
の出力であるＳＮ比と最適照合パス決定手段の出力であ
る照合パスと雑音重畳入力音声特徴ベクトル時系列の各
特徴ベクトルのパワーと音声モデルメモリに記憶されて
いる音声モデルの特徴ベクトルのパワーとを用いて、雑
音重畳入力音声中の音声信号と音声モデルのパワー比を
求める。The superposed noise determining means uses the superposed noise feature vector generated by the superposed noise generating means and the matching path output by the optimum matching path determining means to correspond to each feature vector of the noise-superimposed input speech feature vector time series. Then, the superimposed noise feature vector is calculated and output as a superimposed noise feature vector time series. The power ratio determining means stores the SN ratio output from the SN ratio calculating means, the matching path output from the optimum matching path determining means, the power of each feature vector of the noise-superimposed input voice feature vector time series, and the voice model memory. Using the stored power of the feature vector of the voice model, the power ratio between the voice signal in the noise-superimposed input voice and the voice model is obtained.

【００２６】雑音適応化類似度演算手段は、音声モデル
メモリに記憶されている音声モデルの標準音声特徴ベク
トルに対しパワー比決定手段の出力であるところの音声
パワー比を用いて音声モデルと雑音重畳入力音声中の音
声信号のパワーが一致するようにパワー正規化処理を施
し、重畳雑音決定手段の出力であるところの重畳雑音特
徴ベクトル時系列を用いて雑音重畳入力音声特徴ベクト
ル時系列の各特徴ベクトルとの雑音適応化類似度を求め
る。The noise adaptation similarity calculating means uses the voice power ratio output from the power ratio determining means for the standard voice feature vector of the voice model stored in the voice model memory to superimpose the voice model and noise. Power normalization processing is performed so that the powers of the voice signals in the input speech match, and each feature of the noise-superimposed input voice feature vector time series is used by using the superposed noise feature vector time series that is the output of the superposed noise determination means. Find the noise adaptation similarity with the vector.

【００２７】また別の発明においては、雑音除去類似度
演算手段は、雑音重畳入力音声特徴ベクトル時系列の各
特徴ベクトルに対し、重畳雑音決定手段の出力であると
ころの重畳雑音特徴ベクトル時系列の各特徴ベクトルを
用いて雑音除去を行い、音声モデルメモリ上に記憶され
ている音声モデルの各特徴ベクトルとの類似度を演算す
る。According to another aspect of the present invention, the noise removal similarity calculation means generates a superposed noise feature vector time series output from the superposed noise determination means for each feature vector of the noise superposed input speech feature vector time series. Noise removal is performed using each feature vector, and the degree of similarity with each feature vector of the voice model stored in the voice model memory is calculated.

【００２８】また別の発明においては、付加雑音決定手
段は、重畳雑音生成手段が生成した重畳雑音特徴ベクト
ルと最適照合パス決定手段が出力した照合パスとパワー
比決定手段の出力である音声パワー比とを用いて、音声
モデルの標準音声特徴ベクトルに対する付加雑音特徴ベ
クトルを求める。雑音付加手段は、該付加雑音特徴ベク
トルを音声モデルの標準音声特徴ベクトルに付加し、雑
音付加標準音声特徴ベクトルを出力する。類似度演算手
段は、該雑音付加標準音声特徴ベクトルと雑音重畳入力
音声特徴ベクトル時系列の各特徴ベクトルとの類似度を
演算する。According to another aspect of the present invention, the additive noise determining means includes the superimposed noise feature vector generated by the superimposed noise generating means, the matching path output by the optimum matching path determining means, and the voice power ratio output by the power ratio determining means. And are used to obtain the additive noise feature vector for the standard voice feature vector of the voice model. The noise adding means adds the added noise feature vector to the standard voice feature vector of the voice model and outputs the noise added standard voice feature vector. The similarity calculation means calculates the similarity between the noise-added standard speech feature vector and each feature vector of the noise-superimposed input speech feature vector time series.

【００２９】[0029]

【実施例】【Example】

実施例１．図１は、請求項１の発明に関わる音声認識装
置の一実施例の構成を示すブロック図である。図におい
て、２は入力端１より入力される雑音重畳入力音声に対
し、音響分析を行い雑音重畳入力音声特徴ベクトル時系
列を出力する音響分析手段、３は標準音声の特徴ベクト
ル時系列を表現する音声モデルを記憶する音声モデルメ
モリである。Example 1. 1 is a block diagram showing the configuration of an embodiment of a voice recognition apparatus according to the invention of claim 1. In FIG. In the figure, 2 is an acoustic analysis means for performing acoustic analysis on a noise-superimposed input speech input from the input terminal 1 and outputting a noise-superimposed input speech feature vector time series, and 3 is a feature vector time series of standard speech. It is a voice model memory that stores a voice model.

【００３０】８は音声に重畳する雑音の特徴ベクトル時
系列を表現する雑音モデルを記憶する雑音モデルメモ
リ、９は音声モデルメモリ３に記憶されている音声モデ
ルの標準音声特徴ベクトルを入力として線形予測分析を
行い、最尤パラメータを最尤パラメータメモリ１０に、
標準音声残差パワーを音声残差パワーメモリ１１に書き
込む線形予測分析手段、１２は雑音モデルメモリ８に記
憶されている雑音モデルの雑音特徴ベクトルと最尤パラ
メータメモリ１０に記憶されている最尤パラメータとの
積和演算により雑音残差パワーを求め雑音残差メモリ１
３に書き込む雑音Reference numeral 8 is a noise model memory for storing a noise model expressing a time series of feature vector of noise superimposed on speech, and 9 is a linear prediction with the standard speech feature vector of the speech model stored in the speech model memory 3 as an input. Analysis is performed and maximum likelihood parameters are stored in maximum likelihood parameter memory 10.
Linear predictive analysis means for writing the standard speech residual power into the speech residual power memory 11, 12 is the noise feature vector of the noise model stored in the noise model memory 8 and the maximum likelihood parameter stored in the maximum likelihood parameter memory 10. Noise residual power is calculated by multiply-add operation with
Noise to write in 3

【００３１】１４は音響分析手段２の出力である雑音重
畳入力音声特徴ベクトル時系列の各特徴ベクトルに対し
最尤パラメータメモリ１０に記憶されている最尤パラメ
ータとの積和演算を行い雑音重畳入力音声残差パワーを
求める残差パワー演算手段、１５は残差パワー演算手段
１４の出力であるところの雑音重畳入力音声残差パワー
と音声残差パワーメモリ１１に記憶されている標準音声
残差パワーと雑音残差パワーメモリ１３に記憶されてい
る雑音残差パワーとを用いて雑音重畳入力音声のＳＮ比
を求めるＳＮ比演算手段、１６はＳＮ比演算手段１５の
出力であるところのＳＮ比に従い音声モデルメモリ３に
記憶されている標準音声特徴ベクトルと雑音モデルメモ
リ８に記憶されている雑音特徴ベクトルの合成を行い雑
音重畳音声特徴ベクトルを生成する特徴ベクトル合成手
段である。Reference numeral 14 denotes a noise-superimposed input by performing a sum-of-products operation with the maximum likelihood parameter stored in the maximum-likelihood parameter memory 10 for each feature vector of the noise-superimposed input speech feature vector time series output from the acoustic analysis unit 2. Residual power calculation means for obtaining the speech residual power, and 15 is the noise-superimposed input speech residual power which is the output of the residual power calculation means 14 and the standard speech residual power stored in the speech residual power memory 11. And the noise residual power stored in the noise residual power memory 13 for calculating the SN ratio of the noise-superimposed input speech, and 16 denotes the SN ratio which is the output of the SN ratio calculating means 15. The standard speech feature vector stored in the speech model memory 3 and the noise feature vector stored in the noise model memory 8 are combined to generate the noise-superimposed speech feature vector. A feature vector combining means for generating a torque.

【００３２】５は音響分析手段２の出力である雑音重畳
入力音声特徴ベクトル時系列の各特徴ベクトルに対し特
徴ベクトル合成手段１６の出力である雑音重畳音声特徴
ベクトルとの類似度を演算する類似度演算手段、７は類
似度演算手段５の出力であるところの類似度データを用
いて照合処理を行い認識結果７を出力する照合手段であ
る。Reference numeral 5 is a degree of similarity for calculating the similarity between each feature vector of the noise-superimposed input speech feature vector output from the acoustic analysis means 2 and the noise-superimposed speech feature vector output from the feature vector synthesis means 16. Computation means 7 is a collation means for performing a collation process using the similarity data output from the similarity computation means 5 and outputting a recognition result 7.

【００３３】次に動作について、まずＤＰマッチング法
を照合手段７に採用した離散単語認識の場合を例に説明
を行う。入力端１より入力された雑音重畳入力音声信号
は、音響分析手段２において任意の分析フレーム（例え
ばフレーム周期１０ｍｓｅｃ、フレーム長２５．６ｍｓ
ｅｃ、ハミング窓）について音響分析され、自己相関係
数ベクトルを特徴ベクトルとする雑音重畳入力音声特徴
ベクトル時系列｛Ｘ（ｉ）｜ｉ＝１，２，…，Ｉ｝に変
換される。ここでＸ（ｉ）は第ｉフレームの自己相関係
数ベクトルで、Ｉはフレーム数である。Next, the operation will be described by taking the case of discrete word recognition in which the DP matching method is adopted as the matching means 7, as an example. The noise-superimposed input voice signal input from the input terminal 1 is analyzed by the acoustic analysis means 2 in an arbitrary analysis frame (for example, frame period 10 msec, frame length 25.6 ms).
ec, a Hamming window), and is converted into a noise-superimposed input speech feature vector time series {X (i) | i = 1, 2, ..., I} having an autocorrelation coefficient vector as a feature vector. Here, X (i) is the autocorrelation coefficient vector of the i-th frame, and I is the number of frames.

【００３４】音声モデルメモリ３には、カテゴリｋ（ｋ
＝１，２，…，Ｋ）の音声モデルとして、カテゴリｋの
単語音声信号に対し音響分析手段２における音響分析と
同等の音響分析処理を施し得られる自己相関係数ベクト
ルからなる標準音声特徴ベクトル時系列｛Ｓk（ｊ）｜
ｊ＝１，２，…，Ｊk｝が記憶されている。但し、前記
単語音声信号は、この発明による音声認識装置が対象と
する雑音重畳入力音声信号よりも高いＳＮ比をもってい
る必要がある。The voice model memory 3 has a category k (k
, 1, 2, ..., K), a standard speech feature vector consisting of an autocorrelation coefficient vector obtained by subjecting a word speech signal of category k to acoustic analysis processing equivalent to the acoustic analysis in acoustic analysis means 2. Time series {Sk (j) |
j = 1, 2, ..., Jk} is stored. However, the word voice signal must have a higher SN ratio than the noise-superimposed input voice signal targeted by the voice recognition device according to the present invention.

【００３５】雑音モデルメモリ８には、入力音声に重畳
することが想定される雑音信号に対し、音響分析手段２
における音響分析と同等の音響分析処理を施して得られ
る自己相関係数ベクトル時系列を表現する雑音モデルが
１個もしくは複数個記憶されている。各雑音モデルはそ
れぞれ異なる種類の雑音を表現しているが、絶対的なパ
ワー値の違いは雑音の種類を区別しない。The noise model memory 8 stores the acoustic analysis means 2 for noise signals which are supposed to be superimposed on the input voice.
One or a plurality of noise models that represent an autocorrelation coefficient vector time series obtained by performing an acoustic analysis process equivalent to the acoustic analysis in FIG. Each noise model expresses a different type of noise, but the absolute difference in power value does not distinguish the type of noise.

【００３６】ここでは一例として、重畳雑音信号が、パ
ワーについては音響分析フレーム毎の変動が大きいが、
パワー正規化しスペクトル形状にのみ注目すると音響分
析フレーム毎にＮ種類の雑音がランダムに現れるという
ような場合を例に採り説明する。この場合、スペクトル
が相異なる前記Ｎ種類の雑音の特徴ベクトル｛Ｚn｜ｎ
＝１，２，…，Ｎ｝がそれぞれ雑音モデルとして雑音モ
デルメモリに記憶される。Here, as an example, the superposed noise signal has a large fluctuation in power for each acoustic analysis frame,
An explanation will be given by taking as an example the case where N types of noise appear randomly for each acoustic analysis frame when power normalization is focused only on the spectrum shape. In this case, the feature vectors of the N kinds of noise having different spectra {Zn | n
, 1, 2, ..., N} are respectively stored in the noise model memory as noise models.

【００３７】線形予測分析手段９は、音声モデルメモリ
３に記憶されている全ての音声モデルの全ての標準音声
特徴ベクトルＳk（ｊ）に対し、以下に示す処理を行
う。The linear predictive analysis means 9 performs the following processing for all standard speech feature vectors Sk (j) of all speech models stored in the speech model memory 3.

【００３８】１．例えば自己相関法などを用いて、標準
音声特徴ベクトルＳk（ｊ）の自己相関係数ベクトルか
ら線形予測パラメータを求める。1. For example, using the autocorrelation method, a linear prediction parameter is obtained from the autocorrelation coefficient vector of the standard speech feature vector Sk (j).

【００３９】２．次に、１で得られた線形予測パラメー
タの自己相関係数である最尤パラメータＡk（ｊ）＝
｛ａkj（ｍ）｜ｍ＝０，１，…，Ｍ｝を求め最尤パラメ
ータメモリ１０に保存する。2. Next, the maximum likelihood parameter Ak (j) = which is the autocorrelation coefficient of the linear prediction parameter obtained in 1.
{Akj (m) | m = 0, 1, ..., M} is calculated and stored in the maximum likelihood parameter memory 10.

【００４０】３．標準音声特徴ベクトルＳk（ｊ）の正
規化自己相関係数ベクトルと最尤パラメータＡk（ｊ）
とを用いて積和演算を行い標準音声残差パワーαkjを求
め、音声残差パワーメモリに保存する。αkjは以下の
（１）式により求める。3. Normalized autocorrelation coefficient vector of standard speech feature vector Sk (j) and maximum likelihood parameter Ak (j)
The sum of products operation is performed using and to obtain the standard speech residual power αkj, which is stored in the speech residual power memory. αkj is calculated by the following equation (1).

【００４１】[0041]

【数１】 [Equation 1]

【００４２】雑音残差演算手段１２は、雑音モデルメモ
リ８に記憶されている雑音モデルの雑音特徴ベクトル
｛Ｚn｜ｎ＝１，２，…，Ｎ｝の正規化自己相関係数に
対し、最尤パラメータメモリ１０に記憶されている全て
の最尤パラメータＡk（ｊ）を用いて積和演算を行い、
雑音残差パワーβkj,nを求め雑音算差パワーメモリ１３
に書き込む。βkj,nは以下の（２）式で求める。The noise residual calculation means 12 determines the maximum of the normalized autocorrelation coefficient of the noise feature vector {Zn | n = 1, 2, ..., N} of the noise model stored in the noise model memory 8. A product-sum operation is performed using all the maximum likelihood parameters Ak (j) stored in the likelihood parameter memory 10,
The noise residual power βkj, n is calculated and the noise difference power memory 13
Write in. βkj, n is calculated by the following equation (2).

【００４３】[0043]

【数２】 [Equation 2]

【００４４】残差パワー演算手段１４は、音響分析手段
２の出力であるところの雑音重畳入力音声特徴ベクトル
時系列の各特徴ベクトルＸ（ｉ）（ｉ＝１，２，…，
Ｉ）の正規化自己相関係数ベクトルに対し、最尤パラメ
ータメモリ１０に記憶されている全ての最尤パラメータ
Ａk（ｊ）を用いて積和演算を行い、雑音重畳入力音声
残差パワーγki,jを求める。γki,jは（３）式で求め
る。The residual power calculation means 14 is a feature vector X (i) (i = 1, 2, ...,) of the noise-superimposed input speech feature vector time series which is the output of the acoustic analysis means 2.
For the normalized autocorrelation coefficient vector of I), the sum-of-products calculation is performed using all the maximum likelihood parameters Ak (j) stored in the maximum likelihood parameter memory 10, and the noise-superimposed input speech residual power γki, ask for j. γki, j is calculated by equation (3).

【００４５】[0045]

【数３】 [Equation 3]

【００４６】ＳＮ比演算手段１５は、残差パワー演算手
段１４の出力である雑音重畳入力音声残差パワーγki,j
と音声残差パワーメモリ１１に記憶されている標準音声
残差パワーαkjと雑音残差パワーメモリ１３に記憶され
ている雑音残差パワーβkj,nとを用いて、ＳＮ比Ｒki,
j,nを（４）式から求める。The SN ratio calculating means 15 outputs the noise-superimposed input speech residual power γki, j which is the output of the residual power calculating means 14.
Using the standard speech residual power αkj stored in the speech residual power memory 11 and the noise residual power βkj, n stored in the noise residual power memory 13, the SN ratio Rki,
j, n is calculated from the equation (4).

【００４７】[0047]

【数４】 [Equation 4]

【００４８】この式は、以下のように導出される。雑音
が重畳していないある音声信号φ（ｔ）（ｔは時間を表
す）が、（５）式のようなＡＲ過程に従うとする。This equation is derived as follows. It is assumed that a certain voice signal φ (t) (t represents time) on which noise is not superposed follows the AR process as expressed by equation (5).

【００４９】[0049]

【数５】 [Equation 5]

【００５０】すると、線形予測係数ψmは、音声信号φ
（ｔ）の自己相関係数から、自己相関法により一意に求
められる。Then, the linear prediction coefficient ψm is equal to the speech signal φ.
It is uniquely obtained by the autocorrelation method from the autocorrelation coefficient of (t).

【００５１】任意の信号を、１／Ｈ（ｚ）なる伝達特性
を持つフィルタに入力した時得られる出力信号のパワー
は、前記線形予測係数ψmの自己相関係数（最尤パラメ
ータと呼ばれている）Ψτ（τ＝０，１，…，Ｍ）と入
力信号の自己相関係数の積和演算により求められる。前
述の音声信号φ（ｔ）を入力する場合を例に採れば、フ
ィルタの出力信号のパワー（残差パワーと呼ばれる）Ｐ
φは、（６）式により得られる。式中Φτは音声信号φ
（ｔ）の自己相関係数でτ＝０，１，…，Ｍである。The power of the output signal obtained when an arbitrary signal is input to a filter having a transfer characteristic of 1 / H (z) is the autocorrelation coefficient (called the maximum likelihood parameter) of the linear prediction coefficient ψm. Ψτ (τ = 0, 1, ..., M) and the autocorrelation coefficient of the input signal. Taking the case of inputting the audio signal φ (t) described above as an example, the power (called the residual power) P of the output signal of the filter P
φ is obtained by the equation (6). Where Φτ is the audio signal φ
The autocorrelation coefficient of (t) is τ = 0, 1, ..., M.

【００５２】[0052]

【数６】 [Equation 6]

【００５３】次に、音声信号φ（ｔ）に雑音信号ξ
（ｔ）が（７）式のように重畳した雑音重畳信号ω
（ｔ）を考える。Next, the noise signal ξ is added to the voice signal φ (t).
Noise superposed signal ω in which (t) is superposed as shown in equation (7)
Consider (t).

【００５４】[0054]

【数７】 [Equation 7]

【００５５】音声信号φ（ｔ）と雑音信号ξ（ｔ）の無
相関が仮定できるならば、雑音重畳信号ω（ｔ）の自己
相関係数Ωτは、（８）式のようにφ（ｔ）の自己相関
係数Φτとξ（ｔ）の自己相関係数Ξτの和として与え
られる。If no correlation between the voice signal φ (t) and the noise signal ξ (t) can be assumed, the autocorrelation coefficient Ωτ of the noise superposed signal ω (t) is φ (t ) Autocorrelation coefficient Φτ and ξ (t) autocorrelation coefficient Ξτ.

【００５６】[0056]

【数８】 [Equation 8]

【００５７】このような雑音重畳信号ω（ｔ）を１／Ｈ
（ｚ）なる伝達特性を持つ前記フィルタに入力した場合
の残差パワーＰωは、（９）式により得られる。Such a noise superimposed signal ω (t) is 1 / H
The residual power Pω when input to the filter having the transfer characteristic (z) is obtained by the equation (9).

【００５８】[0058]

【数９】 [Equation 9]

【００５９】（９）式のΩτを（８）式にて置き換える
と（１０）式のようになる。When Ωτ in equation (9) is replaced by equation (8), equation (10) is obtained.

【００６０】[0060]

【数１０】 [Equation 10]

【００６１】（９）（１０）式をまとめて（１１）式と
する。Equations (9) and (10) are put together into equation (11).

【００６２】[0062]

【数１１】 [Equation 11]

【００６３】（１１）式において、Ωτ、Φτ、Ξτの
正規化自己相関係数をそれぞれΩ■τ、Φ■τ、Ξ■τ
とすると、In equation (11), the normalized autocorrelation coefficients of Ωτ, Φτ, and Ξτ are respectively Ω ■ τ, Φ ■ τ, and Ξ ■ τ.
Then,

【００６４】[0064]

【数１２】 [Equation 12]

【００６５】となる。また（８）式においてτ＝０とす
ると（１３）式のようになる。It becomes When τ = 0 in the equation (8), the equation (13) is obtained.

【００６６】[0066]

【数１３】 [Equation 13]

【００６７】（１２）式のΩ0を（１３）式で置き換
え、ＳＮ比Φ0／Ξ0についてとくと、Replacing Ω0 in the equation (12) with the equation (13), and regarding the SN ratio Φ0 / Ξ0,

【００６８】[0068]

【数１４】 [Equation 14]

【００６９】となる。すなわち、雑音重畳音声信号ω
（ｔ）に対し、雑音が重畳していない音声信号φ（ｔ）
の正規化自己相関係数Φ■τ及び、重畳している雑音の
正規化自己相関係数Ξ■τが分かれば、雑音重畳音声信
号ω（ｔ）の正規化自己相関係数Ω■τ及び音声信号φ
（ｔ）の最尤パラメータΨτにより雑音重畳信号ω
（ｔ）のＳＮ比が求められることがわかる。It becomes That is, the noise-superimposed speech signal ω
Voice signal φ (t) with no noise superimposed on (t)
If the normalization autocorrelation coefficient Φ ■ τ and the normalization autocorrelation coefficient Ξ ■ τ of the superimposed noise are known, the normalization autocorrelation coefficient Ω ■ τ of the noise-superimposed speech signal ω (t) and Audio signal φ
The noise-superimposed signal ω is calculated by the maximum likelihood parameter Ψτ of (t).
It can be seen that the SN ratio of (t) is required.

【００７０】音声認識装置に入力される雑音重畳入力音
声信号は、雑音重畳以前の音声信号の正規化自己相関係
数も、重畳している雑音信号の正規化自己相関係数も、
未知であるため、音響分析手段２の出力である雑音重畳
入力音声特徴ベクトル時系列｛Ｘ（ｉ）｜ｉ＝１，２，
…，Ｉ｝の各特徴ベクトルに対し、音声モデルメモリ３
に記憶されている音声モデルの全ての標準音声特徴ベク
トルと雑音モデルメモリ８に記憶されている雑音モデル
の全ての雑音特徴ベクトルの組み合わせについて得られ
るＳＮ比を、ＳＮ比演算手段１５は出力する。The noise-superimposed input speech signal input to the speech recognition apparatus has both the normalized autocorrelation coefficient of the speech signal before noise superposition and the normalized autocorrelation coefficient of the superimposed noise signal.
Since it is unknown, the noise-superimposed input speech feature vector time series {X (i) | i = 1, 2,
, I} for each feature vector of the voice model memory 3
The SN ratio calculating means 15 outputs the SN ratios obtained for all the combinations of the standard voice feature vectors of the voice model stored in 1. and the noise feature vectors of the noise model stored in the noise model memory 8.

【００７１】特徴ベクトル合成手段１６は、ＳＮ比演算
手段１５の出力であるところのＳＮ比Ｒki,j,n（ｋ＝
１,２,…,Ｋ、ｊ＝１,２,…,ＪK、ｉ＝１,２,…,Ｉ、ｎ
＝１,２,…,Ｎ）を入力とし、音声モデルメモリ３に記
憶されている音声モデルの標準音声特徴ベクトルＳkjと
雑音モデルメモリ８に記憶されている雑音モデルの雑音
特徴ベクトルＺnとのパワー比がＳＮ比Ｒki,j,nと一致
するように特徴ベクトルの合成を行い、雑音重畳音声特
徴ベクトルＹki,j,nとして出力する。The feature vector synthesizing means 16 outputs an SN ratio Rki, j, n (k =
, 2, K, j = 1, 2, ..., JK, i = 1, 2, ..., I, n
, 1, 2, ..., N) as input, the power of the standard voice feature vector Skj of the voice model stored in the voice model memory 3 and the noise feature vector Zn of the noise model stored in the noise model memory 8. The feature vectors are synthesized so that the ratio matches the SN ratio Rki, j, n and output as the noise-superimposed speech feature vector Yki, j, n.

【００７２】類似度演算手段５は、音響分析手段２の出
力であるところの雑音重畳入力音声特徴ベクトル時系列
｛Ｘ（ｉ）｜ｉ＝１,２,…,Ｉ｝の各特徴ベクトルに対
し、特徴ベクトル合成手段１６の出力であるところの雑
音重畳音声特徴ベクトルＹki,j,n（ｋ＝１,２,…,Ｋ、
ｊ＝１,２,…,ＪK、ｉ＝１,２,…,Ｉ、ｎ＝１,２,…,
Ｎ）を用い、Ｘ（ｉ）とＹki,j,n（ｋ＝１,２,…,Ｋ、
ｊ＝１,２,…,ＪK、ｎ＝１,２,…,Ｎ）との類似度Ｄ1k
i,j,nを求める。類似度としては一例として、特徴ベク
トルである自己相関係数をＬＰＣ分析して得られるＬＰ
Ｃケプストラムベクトルのユークリッド距離の逆数があ
る。The similarity calculation means 5 for each feature vector of the noise-superimposed input speech feature vector time series {X (i) | i = 1,2, ..., I} which is the output of the acoustic analysis means 2. , The noise-superimposed speech feature vector Yki, j, n (k = 1,2, ..., K, which is the output of the feature vector synthesizing unit 16)
j = 1, 2, ..., JK, i = 1, 2, ..., I, n = 1, 2 ,.
N), X (i) and Yki, j, n (k = 1,2, ..., K,
j = 1,2, ..., JK, n = 1,2, ..., N) similarity D1k
Find i, j, n. As an example of the similarity, an LP obtained by LPC analysis of an autocorrelation coefficient which is a feature vector
There is the reciprocal of the Euclidean distance of the C cepstrum vector.

【００７３】照合手段６は、類似度演算手段５の出力で
ある類似度データＤ1ki,j,n（ｋ＝１,２,…,Ｋ、ｊ＝
１,２,…,ＪK、ｉ＝１,２,…,Ｉ、ｎ＝１,２,…,Ｎ）を
用い、カテゴリｋの音声モデルと雑音重畳入力音声との
照合を雑音モデルの制約の下で行い、類似度が最大にな
る音声モデルのカテゴリを認識結果７として出力する。The matching means 6 outputs the similarity data D1ki, j, n (k = 1, 2, ..., K, j =) output from the similarity calculation means 5.
1,2, ..., JK, i = 1,2, ..., I, n = 1,2, ..., N), and the matching of the category k speech model with the noise-superimposed input speech is performed using the noise model constraint. This is performed below, and the category of the voice model having the maximum similarity is output as the recognition result 7.

【００７４】この場合雑音モデルの制約は、音響分析フ
レーム毎にランダムにｎが１,２,…,Ｎのうちのどれか
の値をとるということだけなので、｛Ｄ1ki,j,n｜ｎ＝
１,２,…,Ｎ｝の中でいちばん高い類似度を、雑音重畳
入力音声特徴ベクトル時系列の第ｉフレームとカテゴリ
ｋの音声モデルの第ｊフレームとの類似度とみなしてＤ
Ｐマッチングを行うことで照合を行う。In this case, the only restriction on the noise model is that n takes one of 1, 2, ..., N at random for each acoustic analysis frame. Therefore, {D1ki, j, n | n =
The highest similarity among 1, 2, ..., N} is regarded as the similarity between the i-th frame of the noise-superimposed input speech feature vector time series and the j-th frame of the category k speech model, and D
Collation is performed by performing P matching.

【００７５】これにより照合パス上では、雑音重畳入力
音声特徴ベクトル時系列の各特徴ベクトルに対し、音声
モデルとの類似度最大化条件による、標準音声特徴ベク
トルと雑音ベクトルの対応付けがなされることになり、
結果として、雑音重畳入力音声に対する正解カテゴリの
音声モデルに対し、雑音重畳入力音声に重畳している雑
音と等しい雑音を等しいＳＮ比で重畳させて照合させた
ことと等しく、非定常雑音が重畳しＳＮ比変動が大きい
雑音重畳入力音声に対しても、正しい認識が行える。As a result, on the matching path, the standard voice feature vector and the noise vector are associated with each feature vector of the noise-superimposed input voice feature vector time series according to the similarity maximization condition with the voice model. become,
As a result, non-stationary noise is superimposed on the speech model of the correct category for the noise-superimposed input voice, which is equivalent to matching the noise superposed on the noise-superimposed input voice with the same SN ratio and matching. Correct recognition can be performed even for noise-superimposed input speech with a large SN ratio variation.

【００７６】以上、請求項１の発明に係わる実施例につ
いて、ＤＰマッチングによる照合を行う場合を例に採り
説明を行ったが、照合方式はＤＰマッチングに限定され
るものではなく、例えばＨＭＭによる認識手法を用いて
もかまわない。Although the embodiment according to the invention of claim 1 has been described above by taking the case where the matching is performed by the DP matching as an example, the matching method is not limited to the DP matching and, for example, the recognition by the HMM is performed. You may use the method.

【００７７】この場合、音声モデルメモリ３には、各カ
テゴリの音声を表すＨＭＭが音声モデルとして記憶さ
れ、ＨＭＭの各状態（もしくは各遷移）において出力確
率をもつ音声特徴ベクトルが上記説明における標準音声
特徴ベクトルとなる。具体的には、連続分布型ＨＭＭで
は各状態（もしくは各遷移）での出力確率演算における
１個もしくは複数個の平均特徴ベクトルが、音声特徴ベ
クトルに対しコードブックによるベクトル量子化処理を
行う離散分布型ＨＭＭでは各状態（もしくは各遷移）に
おいて出力確率をもつ１個もしくは複数個のコードラベ
ルの特徴ベクトルが標準音声特徴ベクトルとなる。ま
た、上記実施例の説明において述べたように、特徴ベク
トル合成手段１６における標準音声特徴ベクトルと雑音
特徴ベクトルとの合成には標準音声特徴ベクトルのパワ
ー情報が必要となるため、パワー情報を含めた音声特徴
ベクトルの出力確率を扱うＨＭＭを用いる。In this case, the voice model memory 3 stores HMMs representing voices of each category as voice models, and voice feature vectors having output probabilities in each state (or each transition) of the HMMs are standard voices described above. It becomes a feature vector. Specifically, in the continuous distribution type HMM, one or a plurality of average feature vectors in the output probability calculation in each state (or each transition) are discrete distributions in which a vector quantization process by a codebook is performed on a voice feature vector. In the type HMM, a feature vector of one or a plurality of code labels having an output probability in each state (or each transition) becomes a standard speech feature vector. Further, as described in the description of the above embodiment, since the power information of the standard voice feature vector is necessary for the synthesis of the standard voice feature vector and the noise feature vector in the feature vector synthesis means 16, the power information is included. An HMM that handles the output probability of the voice feature vector is used.

【００７８】雑音モデルメモリ８に記憶される雑音モデ
ルも雑音を表現するＨＭＭであってかまわず、この場合
の雑音特徴ベクトルは、音声モデルの場合と同様にＨＭ
Ｍの各状態（もしくは各遷移）において出力確率をもつ
雑音の特徴ベクトルがこれにあたる。また、雑音モデル
間での遷移確率をもたせることで１個の大きな雑音モデ
ルを用いてもかまわない。The noise model stored in the noise model memory 8 may be an HMM expressing noise. In this case, the noise feature vector is the HM as in the case of the voice model.
A feature vector of noise having an output probability in each state (or each transition) of M corresponds to this. Also, one large noise model may be used by providing a transition probability between the noise models.

【００７９】類似度演算手段５では、ＨＭＭの各状態
（もしくは各遷移）において雑音重畳入力音声特徴ベク
トル時系列の各特徴ベクトルが出力される確率を演算
し、類似度データとして出力する。照合手段６は、類似
度演算手段５の出力であるところの類似度データを用い
て、雑音重畳入力音声と各カテゴリのＨＭＭとの、雑音
モデルの制約の下での照合を行い、類似度が最大になる
ＨＭＭのカテゴリを認識結果として出力する。The similarity calculation means 5 calculates the probability that each feature vector of the noise-superimposed input speech feature vector time series is output in each state (or each transition) of the HMM, and outputs it as similarity data. The matching unit 6 uses the similarity data output from the similarity calculation unit 5 to perform a matching between the noise-superimposed input speech and the HMM of each category under the constraint of the noise model, and the similarity is determined. The maximum HMM category is output as the recognition result.

【００８０】以上、単語認識を例に採りその動作につい
て説明を行ったが、この発明の請求項１に係る実施例は
認識対象を単語に限定するものではなく、音声における
他の発声単位を用いてもかまわない。The operation has been described above by taking word recognition as an example. However, the embodiment according to claim 1 of the present invention does not limit the recognition target to a word, but uses another utterance unit in speech. It doesn't matter.

【００８１】また類似度演算手段において、特徴ベクト
ルである自己相関係数から得られるあらゆる音響パラメ
ータ、例えばＬＳＰパラメータやＬＰＣメルケプストラ
ム係数、声道断面積関数を用いた類似度や、同じく自己
相関係数から得られるパラメータを用いたあらゆる距離
尺度、例えばＬＰＣメルケプストラム係数のユークリッ
ド距離や、ＷＬＲ距離、ＷＧＤ距離尺度、群遅延スペク
トル距離、重み付けケプストラムのユークリッド距離、
またこれらユークリッド距離の代わりにチェビシェフ距
離などを用いた類似度を採用してもかまわない。In the similarity calculating means, all acoustic parameters obtained from the autocorrelation coefficient which is a feature vector, for example, LSP parameter, LPC mel cepstrum coefficient, similarity using vocal tract cross-sectional area function, self-correlation Any distance measure using a parameter obtained from the number, for example, Euclidean distance of LPC mel-cepstrum coefficient, WLR distance, WGD distance measure, group delay spectral distance, Euclidean distance of weighted cepstrum,
Further, instead of these Euclidean distances, Chebyshev distance or the like may be used as the similarity.

【００８２】加えて、音響分析によるところの特徴ベク
トルを自己相関係数のみに限定することなく、他の音響
パラメータを付与した特徴ベクトルを用い、これによる
類似度により照合を行ってもかまわない。In addition, the feature vector obtained by the acoustic analysis is not limited to only the autocorrelation coefficient, and a feature vector provided with another acoustic parameter may be used to perform matching based on the similarity.

【００８３】実施例２．図２は、請求項２の発明に係る
音声認識装置の一実施例の構成を示すブロック図であ
る。図において、１は入力端、２は音響分析手段、３は
音声モデルメモリ、５は類似度演算手段、６は照合手
段、７は認識結果、８は雑音モデルメモリ、９は線形予
測分析手段、１０は最尤パラメータメモリ、１１は音声
残差パワーメモリ、１２、雑音残差演算手段、１３は雑
音残差パワーメモリ、１４は残差パワー演算手段、１５
はＳＮ比演算手段、１６は特徴ベクトル合成手段で、図
１に同一符号を付した構成要素と同一部分であるため詳
細な説明は省略する。Example 2. FIG. 2 is a block diagram showing the configuration of an embodiment of the speech recognition apparatus according to the invention of claim 2. In the figure, 1 is an input end, 2 is an acoustic analysis means, 3 is a voice model memory, 5 is a similarity calculation means, 6 is a matching means, 7 is a recognition result, 8 is a noise model memory, 9 is a linear prediction analysis means, 10 is a maximum likelihood parameter memory, 11 is a speech residual power memory, 12, is a noise residual calculation means, 13 is a noise residual power memory, 14 is a residual power calculation means, 15
Is an SN ratio calculating means, and 16 is a feature vector synthesizing means, which is the same as the component denoted by the same reference numeral in FIG.

【００８４】また、１７は前記類似度演算手段５の出力
である類似度データを入力として雑音重畳入力音声と音
声モデルとの類似度を最大とする最適照合パスを求める
最適照合パス決定手段、１８は前記ＳＮ比演算手段の出
力であるＳＮ比と前記音響分析手段２の出力である雑音
重畳入力音声特徴ベクトル時系列と前記雑音モデルメモ
リ８に記憶されている雑音モデルの雑音特徴ベクトルと
を用いて重畳雑音特徴ベクトルを生成する重畳雑音生成
手段、１９は最適照合パス決定手段１７の出力である照
合パスデータに従い重畳雑音生成手段１８の出力である
重畳雑音特徴ベクトルから入力雑音特徴ベクトル時系列
を求める重畳雑音決定手段である。Reference numeral 17 is an optimum matching path determining means for obtaining the optimum matching path which maximizes the similarity between the noise-superimposed input speech and the voice model, using the similarity data output from the similarity calculating means 5 as input. Is the SN ratio which is the output of the SN ratio calculation means, the time series of the noise-superimposed input speech feature vector which is the output of the acoustic analysis means 2, and the noise feature vector of the noise model stored in the noise model memory 8. A superimposing noise generating means for generating a superimposing noise characteristic vector is generated, and 19 is an input noise characteristic vector time series from the superimposing noise characteristic vector output by the superimposing noise generating means 18 according to the matching path data output by the optimum matching path determining means 17. This is a means for determining the superimposed noise.

【００８５】２０は前記音声モデルメモリ３に記憶され
ている音声モデルの標準音声特徴ベクトルと音響分析手
段２の出力である雑音重畳入力音声特徴ベクトル時系列
とＳＮ比演算手段１５の出力であるＳＮ比と最適照合パ
ス決定手段１７の出力である照合パスデータとを用いて
雑音重畳入力音声と音声モデルとのパワー比を求めるパ
ワー比決定手段、２１は音響分析手段２の出力である雑
音重畳入力音声特徴ベクトル時系列と重畳雑音決定手段
１９の出力である入力雑音特徴ベクトル時系列とパワー
比決定手段２０の出力である音声パワー比と音声モデル
メモリ３に記憶されている音声モデルの標準音声特徴ベ
クトルとを用いて雑音重畳入力音声特徴ベクトル時系列
の各特徴ベクトルに対し標準音声特徴ベクトルとの雑音
適応化類似度を演算する雑音適応化類似度演算手段であ
る。Reference numeral 20 denotes a standard voice feature vector of the voice model stored in the voice model memory 3, a noise-superimposed input voice feature vector time series output from the acoustic analysis unit 2 and an SN ratio output from the SN ratio calculation unit 15. A power ratio determining means for obtaining the power ratio between the noise-superimposed input voice and the voice model using the ratio and the verification path data output from the optimum matching path determining means 17, and 21 is the noise-superimposing input output from the acoustic analysis means 2. Time series of voice feature vector and input noise feature vector time series output from superposition noise determining means 19 and voice power ratio output from power ratio determining means 20 and standard voice features of the voice model stored in the voice model memory 3. And the noise adaptation similarity of the standard speech feature vector to each feature vector of the noise-superimposed input speech feature vector time series using A noise adaptation similarity calculating means for.

【００８６】次に動作について、まずＤＰマッチング法
を照合手段７および最適照合パス決定手段１７に採用し
た離散単語認識の場合を例に説明を行う。音声モデルメ
モリ３及び雑音モデルメモリ８の記憶内容及び、雑音重
畳入力音声信号の入力端１への入力から、類似度演算手
段５までの動作は、上記実施例１の場合と同一であるの
で説明を省く。Next, the operation will be described by taking the case of discrete word recognition in which the DP matching method is adopted in the matching means 7 and the optimum matching path determining means 17 as an example. The contents stored in the voice model memory 3 and the noise model memory 8 and the operations from the input of the noise-superimposed input voice signal to the input end 1 to the similarity calculation means 5 are the same as in the case of the above-described first embodiment. Omit.

【００８７】重畳雑音生成手段１８は、音響分析手段２
の出力であるところの雑音重畳入力音声特徴ベクトル時
系列｛Ｘ（ｉ）｜ｉ＝１,２,…,Ｉ｝の各特徴ベクトル
Ｘ（ｉ）に対し、ＳＮ比演算手段１５の出力であるとこ
ろのＳＮ比Ｒki,j,n（ｋ＝１,２,…,Ｋ、ｊ＝１,２,…,
ＪK、ｉ＝１,２,…,Ｉ、ｎ＝１,２,…,Ｎ）を用いて
（１５）式のように重畳雑音パワーηki,j,n（ｋ＝１,
２,…,Ｋ、ｊ＝１,２,…,ＪK、ｎ＝１,２,…,Ｎ）を求
める。ただし、Ｘ（ｉ）のパワーは自己相関係数ベクト
ルの０次元要素を用い、ｘ0（ｉ）とする。The superposed noise generation means 18 is the acoustic analysis means 2
Which is the output of the noise-superimposed input speech feature vector time series {X (i) | i = 1,2, ..., I}. However, the SN ratio Rki, j, n (k = 1,2, ..., K, j = 1,2, ...,
Using JK, i = 1, 2, ..., I, n = 1, 2, ..., N), the superposed noise power ηki, j, n (k = 1,
, ..., K, j = 1,2, ..., JK, n = 1,2, ..., N). However, the power of X (i) is defined as x0 (i) by using the zero-dimensional element of the autocorrelation coefficient vector.

【００８８】[0088]

【数１５】 [Equation 15]

【００８９】次いで、雑音モデルメモリ８に記憶されて
いる雑音モデルの雑音特徴ベクトル｛Ｚn｜ｎ＝１,２,
…,Ｎ｝のスペクトル形状を保存したままパワーだけを
重畳雑音パワーηki,j,nと一致させた重畳雑音特徴ベク
トルＵki,j,nを生成する。すなわち、雑音特徴ベクトル
Ｚnの正規化自己相関係数ベクトルの各次元要素に対し
重畳雑音パワーηki,j,nを掛けた値をＵki,j,nの各次元
要素とする。Then, the noise feature vector of the noise model stored in the noise model memory 8 {Zn | n = 1, 2,
, N}, the superposed noise feature vector Uki, j, n in which only the power is made to match the superposed noise power ηki, j, n is generated while the spectral shape of N ,. That is, a value obtained by multiplying each dimensional element of the normalized autocorrelation coefficient vector of the noise feature vector Zn by the superimposed noise power ηki, j, n is set as each dimensional element of Uki, j, n.

【００９０】このようにして得られる雑音重畳特徴ベク
トルＵki,j,nは、雑音重畳入力音声特徴ベクトル時系列
の中の特徴ベクトルＸ（ｉ）に対し、標準音声特徴ベク
トルＳk（ｊ）と雑音特徴ベクトルＺnとを用いて求めた
ＳＮ比Ｒki,j,nによるパワーと雑音特徴ベクトルＺnの
スペクトル形状をもつ。The noise superposed feature vector Uki, j, n obtained in this way is the standard speech feature vector Sk (j) and the noise with respect to the feature vector X (i) in the noise superposed input speech feature vector time series. It has the spectrum shape of the power and noise feature vector Zn according to the SN ratio Rki, j, n obtained using the feature vector Zn.

【００９１】最適照合パス決定手段１７は、前記類似度
演算手段５の出力である類似度データＤ1ki,j,n（ｋ＝
１,２,…,Ｋ、ｊ＝１,２,…,ＪK、ｉ＝１,２,…,Ｉ、ｎ
＝１,２,…,Ｎ）を用い、カテゴリｋの音声モデルと雑
音重畳入力音声との照合を雑音モデルの制約の下で行
い、各音声モデルとの類似度を最大にする最適照合パス
を求める。照合処理自体は、実施例１の照合手段６にお
ける処理と同じであるので詳細な説明は省く。The optimum matching path determining means 17 outputs the similarity data D1ki, j, n (k =
, 2, K, j = 1, 2, ..., JK, i = 1, 2, ..., I, n
= 1, 2, ..., N), the matching between the speech model of category k and the noise-superimposed input speech is performed under the constraint of the noise model, and the optimum matching path that maximizes the similarity with each speech model is obtained. Ask. The matching process itself is the same as the process in the matching means 6 of the first embodiment, and therefore detailed description is omitted.

【００９２】ここでは、雑音重畳入力音声とカテゴリｋ
の音声モデルとの照合による最適照合パスを、Ｌ＝１,
２,…,Ｌkなる変数に対し一意に値をとる３つの関数ｆ
ｋ（Ｌ）、ｇｋ（Ｌ）、ｈｋ（Ｌ）として表現する。ｆ
ｋ（Ｌ）はｉについて、ｇｋ（Ｌ）はｊについて、ｈｋ
（Ｌ）はｎについての関数であり、Ｌ＝１,２,…,Ｌkに
ついて（１６）式を満たす。Here, noise-superimposed input speech and category k
The optimum matching path by matching with the voice model of L = 1,
Three functions f that take unique values for the variables 2, ..., Lk
It is expressed as k (L), gk (L), and hk (L). f
k (L) for i, gk (L) for j, hk
(L) is a function for n and satisfies the equation (16) for L = 1, 2, ..., Lk.

【００９３】[0093]

【数１６】 [Equation 16]

【００９４】重畳雑音決定手段１９は、重畳雑音生成手
段１８の出力であるところの重畳雑音特徴ベクトルＵk
i,j,n（ｋ＝１,２,…,Ｋ、ｊ＝１,２,…,ＪK、ｉ＝１,
２,…,I、n=1,２）と最適照合パス決定手段１７の出力で
あるところの照合パスデータを入力とし、雑音重畳入力
音声とカテゴリｋの音声モデルとの類似度を最大にする
照合パス上の重畳雑音特徴ベクトルについて、ｋおよび
ｉを同じくする重畳雑音特徴ベクトルの平均特徴ベクト
ルを求め、これを入力雑音特徴ベクトルＶk（ｉ）とす
る。The superposed noise determining means 19 is a superposed noise feature vector Uk which is the output of the superposed noise generating means 18.
i, j, n (k = 1,2, ..., K, j = 1,2, ..., JK, i = 1,
2, ..., I, n = 1, 2) and the matching path data that is the output of the optimum matching path determining means 17 are input to maximize the similarity between the noise-superimposed input speech and the speech model of category k. For the superimposed noise feature vector on the matching path, an average feature vector of the superimposed noise feature vectors having the same k and i is obtained, and this is set as the input noise feature vector Vk (i).

【００９５】すなわち、音声モデルのあるカテゴリｋ
（ｋ＝１,２,…,Ｋ）についてＬ＝１,２,…,Ｌkとした
時、ｆｋ（Ｌ）を同じくする重畳雑音特徴ベクトルＵkf
k(L),gk(L),hk(L)の平均特徴ベクトルを求め、入力雑音
特徴ベクトルＶk（ｆｋ（Ｌ））とする。これにより、
入力雑音特徴ベクトル時系列｛Ｖk（ｉ）｜ｉ＝１,２,
…,Ｉ｝（ｋ＝１,２,…,Ｋ）が得られる。That is, a certain category k of the voice model
When L = 1, 2, ..., Lk for (k = 1, 2, ..., K), the superimposed noise feature vector Ukf having the same fk (L)
An average feature vector of k (L), gk (L), and hk (L) is obtained and set as an input noise feature vector Vk (fk (L)). This allows
Input noise feature vector time series {Vk (i) | i = 1,2,
, I} (k = 1, 2, ..., K) is obtained.

【００９６】以上のように本重畳雑音決定手段は、請求
項１の発明になる雑音重畳入力音声とカテゴリｋの音声
モデルとの照合手法により得られた照合パスに従い、雑
音重畳入力音声特徴ベクトル時系列の各特徴ベクトルに
対する入力雑音特徴ベクトルを求める。As described above, the present superimposed noise determining means determines the time of the noise-superimposed input speech feature vector according to the collation path obtained by the collation method of the noise-superimposed input speech and the category k speech model according to the invention of claim 1. The input noise feature vector for each feature vector in the sequence is obtained.

【００９７】パワー比決定手段２０は、ＳＮ比演算手段
１５の出力であるＳＮ比と音響分析手段２の出力である
雑音重畳入力音声特徴ベクトル時系列と音声モデルメモ
リ３に記憶されている音声モデルの標準音声特徴ベクト
ルと最適照合パス決定手段１７の出力である照合パスデ
ータを入力とし、雑音重畳入力音声とカテゴリｋの音声
モデルとの類似度を最大にする照合パスにおいてＳＮ比
が閾値Ｒｔを越える部分区間に対し、これに対応する音
声モデルの標準音声特徴ベクトルのパワーの平均値を求
める。ついで、照合パス上の同部分区間に対応する雑音
重畳入力音声特徴ベクトル時系列中の各特徴ベクトルに
対し、照合パス上のＳＮ比との演算により特徴ベクトル
における音声信号のパワーを求め、この平均値と音声モ
デルの標準音声特徴ベクトルから得られたパワー平均値
との比を音声パワー比とする。The power ratio determining means 20 includes the SN ratio output from the SN ratio calculating means 15, the noise-superimposed input voice feature vector time series output from the acoustic analysis means 2, and the voice model stored in the voice model memory 3. Of the standard speech feature vector and the matching path data output from the optimum matching path determining means 17, and the SN ratio is set to the threshold Rt in the matching path that maximizes the similarity between the noise-superimposed input speech and the voice model of category k. The average value of the powers of the standard speech feature vectors of the speech model corresponding to this is obtained for the subsections that exceed. Next, for each feature vector in the time series of the noise-added input voice feature vector corresponding to the same partial section on the matching path, the power of the speech signal in the feature vector is calculated by calculation with the SN ratio on the matching path, and this average is calculated. The ratio between the value and the power average value obtained from the standard speech feature vector of the speech model is defined as the speech power ratio.

【００９８】すなわち照合パスデータにおいて、音声モ
デルのあるカテゴリｋ（ｋ＝１,２，…，Ｋ）について
Ｌ＝１,２,…,Ｌkとした時、ＳＮ比Ｒkfk(L),gk(L),hk
(L)が閾値Ｒｔを越えるＬについて、標準音声特徴ベク
トルＳk（ｇｋ（Ｌ））のパワーの平均して音声モデル
パワーを求め、ついで同じＬについて、雑音重畳入力音
声特徴ベクトルＸ（ｆｋ（Ｌ））とＳＮ比Ｒkfk(L),gk
(L),hk(L)とから（１７）式により得られる音声パワー
ζk（ｆｋ（Ｌ）ｉ）を平均して入力音声パワーを求め
る。That is, in the matching path data, when L = 1, 2, ..., Lk for a certain category k (k = 1, 2, ..., K) of the voice model, the SN ratio Rkfk (L), gk (L ), hk
For L where (L) exceeds the threshold Rt, the power of the standard speech feature vector Sk (gk (L)) is averaged to obtain the speech model power, and then for the same L, the noise-superimposed input speech feature vector X (fk (Lk )) And SN ratio Rkfk (L), gk
The input voice power is obtained by averaging the voice power ζk (fk (L) i) obtained from the equation (17) from (L) and hk (L).

【００９９】[0099]

【数１７】 [Equation 17]

【０１００】前記入力音声パワーを前記音声モデルパワ
ーで割った値を雑音重畳入力音声とカテゴリｋの音声モ
デルとの照合による音声パワー比εkとして出力する。
以上のように本パワー比決定手段は、請求項１の発明に
なる雑音重畳入力音声とカテゴリｋの音声モデルとの照
合手法により得られた照合パスに従い、入力音声と音声
モデルとの音声パワー比を求める。A value obtained by dividing the input voice power by the voice model power is output as a voice power ratio εk obtained by matching the noise-superimposed input voice and the voice model of category k.
As described above, the power ratio determining means determines the voice power ratio between the input voice and the voice model according to the matching path obtained by the matching method of the noise-superimposed input voice and the voice model of category k according to the invention of claim 1. Ask for.

【０１０１】雑音適応化類似度演算手段２１は、まず音
声モデルメモリ３に記憶されているカテゴリｋ（ｋ＝
１,２,…,Ｋ）の音声モデルの標準音声特徴ベクトル
｛Ｓk（ｊ）｜ｊ＝１,２,…,Ｊk｝に対し、パワー比決
定手段２０の出力であるところの音声パワー比εkを用
いて、カテゴリｋの音声モデルの音声モデルパワーと雑
音重畳入力音声の入力音声パワーとが一致するようにパ
ワー補正を行い、パワー正規化標準音声特徴ベクトルＴ
k（ｊ）｜ｊ＝１,２,…,Ｊk｝を得る。パワー補正は、
標準音声特徴ベクトルの自己相関係数ベクトルＳk
（ｊ）の各次元要素に音声パワー比εkを掛けることで
行う。The noise-adaptive similarity calculation means 21 first receives the category k (k = k = k) stored in the voice model memory 3.
1, 2, ..., K) of the standard speech feature vector {Sk (j) | j = 1, 2, ..., Jk} of the speech model, the speech power ratio .epsilon.k which is the output of the power ratio determining means 20. Is used to perform power correction so that the voice model power of the voice model of the category k and the input voice power of the noise-superimposed input voice match, and the power-normalized standard voice feature vector T
k (j) | j = 1, 2, ..., Jk} is obtained. Power correction is
Autocorrelation coefficient vector Sk of standard speech feature vector
This is done by multiplying each dimensional element of (j) by the voice power ratio εk.

【０１０２】ついで、音響分析手段２の出力である雑音
重畳入力音声特徴ベクトル時系列｛Ｘ（ｉ）｜ｉ＝１,
２,…,Ｉ｝の各特徴ベクトルに対し、重畳雑音決定手段
１９の出力である入力雑音特徴ベクトル時系列｛Ｖk
（ｉ）｜ｉ＝１,２,…,Ｉ｝を用いて、パワー正規化標
準音声特徴ベクトルＴk（ｊ）との雑音適応化類似度Ｄ2
k（ｉ,ｊ）を（１８）式のように求める。Then, the noise-superimposed input voice feature vector time series {X (i) | i = 1, which is the output of the acoustic analysis means 2,
For each feature vector of 2, ..., I}, the input noise feature vector time series {Vk which is the output of the superposition noise determination means 19
(I) | i = 1,2, ..., I} is used, and noise adaptation similarity D2 with the power-normalized standard speech feature vector Tk (j)
k (i, j) is calculated as shown in equation (18).

【０１０３】[0103]

【数１８】 [Equation 18]

【０１０４】式中、ｄ（＊,＊）は括弧内の２つの自己
相関係数ベクトルの間に定義される類似度で、例えばそ
れぞれの自己相関係数をＬＰＣ分析して得られるＬＰＣ
ケプストラムベクトルのユークリッド距離の逆数であ
る。また、式中におけるＴk（ｊ）とＶk（ｉ）の和は、
２つの特徴ベクトルの各次元要素の和によるベクトルの
合成を示す。In the equation, d (*, *) is the degree of similarity defined between the two autocorrelation coefficient vectors in parentheses, for example, LPC obtained by LPC analysis of each autocorrelation coefficient.
It is the reciprocal of the Euclidean distance of the cepstrum vector. The sum of Tk (j) and Vk (i) in the equation is
Fig. 7 shows composition of a vector by summing each dimensional element of two feature vectors.

【０１０５】これにより類似度Ｄ2k（ｉ，ｊ）は、請求
項１の発明になる雑音重畳入力音声とカテゴリｋの音声
モデルとの照合手法に基づき得られた、雑音重畳入力音
声特徴ベクトルＸ（ｉ）に対する入力重畳雑音特徴ベク
トルＶk（ｉ）による、パワー正規化標準音声特徴ベク
トルＴk（ｉ）への雑音適応化機能をもつ類似度演算手
法となっている。As a result, the similarity D2k (i, j) is obtained by the noise superimposing input speech feature vector X ( This is a similarity calculation method having a noise adaptation function to the power-normalized standard speech feature vector Tk (i) by the input superimposed noise feature vector Vk (i) for i).

【０１０６】照合手段６は、雑音適応化類似度データＤ
2k（ｉ，ｊ）（ｉ＝１,２,…,Ｉ、ｊ＝１,２,…,Ｊk、
ｋ＝１,２,…,Ｋ）を用いて、雑音重畳入力音声とカテ
ゴリｋの音声モデルとの照合を行い、類似度が最大にな
る音声モデルのカテゴリを認識結果７として出力する。
以上、請求項２の発明に係わる実施例について、ＤＰマ
ッチングによる照合を行う場合を例に採り説明を行った
が、実施例１の場合と同様に、最適照合パス決定手段１
７及び照合手段６における雑音重畳入力音声と音声モデ
ルとの照合方式よび類似度演算手段５については実施例
１と同じであるので説明を省く。The matching means 6 uses the noise adaptive similarity data D.
2k (i, j) (i = 1,2, ..., I, j = 1,2, ..., Jk,
(k = 1, 2, ..., K), the noise-superimposed input speech is compared with the speech model of category k, and the category of the speech model having the maximum similarity is output as the recognition result 7.
In the above, the embodiment according to the invention of claim 2 has been described by taking the case of performing the matching by DP matching as an example, but as in the case of the first embodiment, the optimum matching path determining means 1
7 and the matching method of the noise-superimposed input voice and the voice model in the matching means 6 and the similarity calculation means 5 are the same as those in the first embodiment, and therefore their explanations are omitted.

【０１０７】最適照合パス決定手段１７においては、類
似度演算手段５の出力である類似度データを用いて、雑
音モデルの制約の下での、雑音重畳入力音声と各カテゴ
リのＨＭＭとのビタビ照合パスを出力する。この時、音
声モデル及び雑音モデルについての照合パスデータは、
変数Ｌに対しＨＭＭの各状態（もしくは各遷移）を規定
する関数ではなく、各状態（もしくは各遷移）において
雑音重畳入力音声との類似度を最大にする標準音声（も
しくは雑音）特徴ベクトルを規定するものとする。The optimum matching path determining means 17 uses the similarity data output from the similarity calculating means 5 to perform Viterbi matching between the noise-superimposed input speech and the HMM of each category under the constraint of the noise model. Print the path. At this time, the matching path data for the voice model and the noise model are
Rather than a function that defines each state (or each transition) of the HMM for the variable L, a standard speech (or noise) feature vector that maximizes the similarity to the noise-superimposed input speech in each state (or each transition) is defined. It shall be.

【０１０８】これは、音声モデル及び雑音モデルに離散
型ＨＭＭや混合連続分布型ＨＭＭを用いた場合、ＨＭＭ
の各状態（もしくは各遷移）における標準音声（もしく
は雑音）特徴ベクトルが複数個あるためである。雑音適
応化類似度演算手段２１においては、ＨＭＭの各状態
（もしくは各遷移）において雑音重畳入力音声特徴ベク
トル時系列の各特徴ベクトルが出力される確率を演算
し、類似度データとして出力する。照合手段６は、類似
度演算手段５の出力であるところの類似度データを用い
て、雑音重畳入力音声と各カテゴリのＨＭＭとの照合を
行い、類似度が最大になるＨＭＭのカテゴリを認識結果
として出力する。この時の照合方式は、ビタビに限定さ
れない。This is because when a discrete HMM or a mixed continuous distribution HMM is used for the voice model and noise model, the HMM
This is because there are a plurality of standard speech (or noise) feature vectors in each state (or each transition). The noise adaptation similarity calculating means 21 calculates the probability that each feature vector of the noise-superimposed input speech feature vector time series is output in each state (or each transition) of the HMM, and outputs it as similarity data. The matching unit 6 uses the similarity data output from the similarity calculation unit 5 to match the noise-superimposed input speech with the HMM of each category, and recognizes the category of the HMM having the highest similarity. Output as. The matching method at this time is not limited to Viterbi.

【０１０９】以上、単語認識を例に採りその動作につい
て説明を行ったが、この発明の請求項２に係る実施例
は、実施例１の場合と同様、認識対象を単語に限定する
ものではなく、音声における他の発声単位を用いてもか
まわない。また類似度演算手段においても、実施例１の
場合と同様、特徴ベクトルである自己相関係数から得ら
れるあらゆる音響パラメータ、例えばＬＳＰパラメータ
やＬＰＣメルケプストラム係数、声道断面積関数を用い
た類似度や、同じく自己相関係数から得られるパラメー
タを用いたあらゆる距離尺度、例えばＬＰＣメルケプス
トラム係数のユークリッド距離や、ＷＬＲ距離、ＷＧＤ
距離尺度、群遅延スペクトル距離、重み付けケプストラ
ムのユークリッド距離、またこれらユークリッド距離の
代わりにチェビシェフ距離などを用いた類似度を採用し
てもかまわない。加えて、音響分析によるところの特徴
ベクトルを自己相関係数のみに限定することなく、他の
音響パラメータを付与した特徴ベクトルを用い、これに
よる類似度により照合を行ってもかまわない。Although the operation has been described above by taking word recognition as an example, the embodiment according to claim 2 of the present invention does not limit the recognition target to a word as in the case of the first embodiment. , Other voicing units in speech may be used. Also in the similarity calculation means, similar to the case of the first embodiment, the similarity using all acoustic parameters obtained from the autocorrelation coefficient which is a feature vector, for example, the LSP parameter, the LPC mel cepstrum coefficient, and the vocal tract cross-sectional area function. Or any distance measure using parameters obtained from the autocorrelation coefficient, such as the Euclidean distance of the LPC mel cepstrum coefficient, the WLR distance, the WGD
A distance measure, a group delay spectrum distance, a weighted cepstrum Euclidean distance, or a similarity using a Chebyshev distance instead of these Euclidean distances may be adopted. In addition, the feature vector obtained by the acoustic analysis is not limited to only the autocorrelation coefficient, and a feature vector to which another acoustic parameter is added may be used and the matching may be performed based on the similarity.

【０１１０】実施例３．図３は、請求項３の発明に係る
音声認識装置の一実施例の構成を示すブロック図であ
る。図において、１は入力端、２は音響分析手段、３は
音声モデルメモリ、５は類似度演算手段、６は照合手
段、７は認識結果、８は雑音モデルメモリ、９は線形予
測分析手段、１０は最尤パラメータメモリ、１１は音声
残差パワーメモリ、１２、雑音残差演算手段、１３は雑
音残差パワーメモリ、１４は残差パワー演算手段、１５
はＳＮ比演算手段、１６は特徴ベクトル合成手段、１７
は最適照合パス決定手段、１８は重畳雑音生成手段、１
９は重畳雑音決定手段で、図２に同一符号を付した構成
要素と同一部分であるため詳細な説明は省略する。Example 3. FIG. 3 is a block diagram showing the configuration of an embodiment of the speech recognition apparatus according to the invention of claim 3. In the figure, 1 is an input end, 2 is an acoustic analysis means, 3 is a voice model memory, 5 is a similarity calculation means, 6 is a matching means, 7 is a recognition result, 8 is a noise model memory, 9 is a linear prediction analysis means, 10 is a maximum likelihood parameter memory, 11 is a speech residual power memory, 12, is a noise residual calculation means, 13 is a noise residual power memory, 14 is a residual power calculation means, 15
Is an SN ratio calculating means, 16 is a feature vector synthesizing means, 17
Is an optimal matching path determination means, 18 is a superposed noise generation means, 1
Reference numeral 9 is a superimposed noise determining means, which is the same as the component denoted by the same reference numeral in FIG.

【０１１１】また、２２は前記音響分析手段２の出力で
あるところの雑音重畳入力音声特徴ベクトルと前記重畳
雑音決定手段１９の出力であるところの入力雑音特徴ベ
クトル時系列と音声モデルメモリ３に記憶されている音
声モデルの標準音声特徴ベクトルとを入力とし、雑音重
畳入力音声特徴ベクトルに対し入力雑音特徴ベクトル時
系列を用いた雑音除去処理を行った後標準音声特徴ベク
トルとの類似度を求める雑音除去類似度演算手段であ
る。Reference numeral 22 denotes a noise-superimposed input speech feature vector which is an output of the acoustic analysis means 2 and an input noise feature vector time series which is an output of the superposition noise determination means 19 and is stored in the speech model memory 3. Noise that obtains the degree of similarity with the standard speech feature vector after noise reduction processing using the time series of the input noise feature vector for the noise-superimposed input speech feature vector as input It is a removal similarity calculation means.

【０１１２】次に動作について、まずＤＰマッチング法
を照合手段７および最適照合パス決定手段１７に採用し
た離散単語認識の場合を例に説明を行う。音声モデルメ
モリ３及び雑音モデルメモリ８の記憶内容及び、雑音重
畳入力音声信号の入力端１への入力から、重畳雑音決定
手段１９までの動作は、上記実施例２の場合と同一であ
るので説明を省く。Next, the operation will be described by taking the case of discrete word recognition in which the DP matching method is adopted in the matching means 7 and the optimum matching path determining means 17 as an example. The contents stored in the voice model memory 3 and the noise model memory 8 and the operations from the input of the noise-superimposed input voice signal to the input terminal 1 to the superposed noise determination means 19 are the same as in the case of the second embodiment described above. Omit.

【０１１３】雑音除去類似度演算手段２２は、前記音響
分析手段２の出力であるところの雑音重畳入力音声特徴
ベクトル時系列｛Ｘ（ｉ）｜ｉ＝１,２,…,Ｉ｝の各特
徴ベクトルに対し、前記重畳雑音決定手段１９の出力で
あるところの入力雑音特徴ベクトル時系列｛Ｖk（ｉ）
｜ｉ＝１,２,…,Ｉ｝による雑音除去を施した後、音声
モデルメモリ３に記憶されている音声モデルの標準音声
特徴ベクトルＳk（ｊ）との類似度を（１９）式のよう
に求める。The noise removal similarity calculation means 22 outputs each characteristic of the noise superimposed input speech feature vector time series {X (i) | i = 1,2, ..., I} which is the output of the acoustic analysis means 2. For the vector, the input noise feature vector time series {Vk (i) which is the output of the superposed noise determination means 19
After noise removal by | i = 1, 2, ..., I}, the similarity with the standard speech feature vector Sk (j) of the speech model stored in the speech model memory 3 is expressed by equation (19). Ask for.

【０１１４】[0114]

【数１９】 [Formula 19]

【０１１５】式中、ｄ（＊，＊）は括弧内の２つの自己
相関係数ベクトルの間に定義される類似度で、例えばそ
れぞれの自己相関係数をＬＰＣ分析して得られるＬＰＣ
ケプストラムベクトルのユークリッド距離の逆数であ
る。また、式中におけるＸ（ｉ）からＶk（ｉ）の減算
は、Ｘ（ｉ）の各次元要素からＶk（ｉ）の各次元要素
を減算する、ベクトルの引き算を示している。In the equation, d (*, *) is the similarity defined between the two autocorrelation coefficient vectors in parentheses, for example, LPC obtained by LPC analysis of each autocorrelation coefficient.
It is the reciprocal of the Euclidean distance of the cepstrum vector. Further, the subtraction of Vk (i) from X (i) in the equation indicates vector subtraction in which each dimensional element of Vk (i) is subtracted from each dimensional element of X (i).

【０１１６】これは、請求項１の発明になる雑音重畳入
力音声とカテゴリｋの音声モデルとの照合手法に基づき
得られた入力重畳雑音特徴ベクトルＶk（ｉ）による、
雑音重畳入力音声に対する雑音除去機能をもつ類似度演
算になっている。該雑音除去類似度演算手段はｉ＝１,
２,…,Ｉ、ｊ＝１,２,…,Ｊk、ｋ＝１,２,…,Ｋについ
て雑音除去類似度Ｄ3k（ｉ，ｊ）を出力する。This is based on the input superposed noise feature vector Vk (i) obtained on the basis of the matching method of the noise superposed input voice and the category k voice model according to the invention of claim 1.
It is a similarity calculation with a noise removal function for noise-superimposed input speech. The noise removal similarity calculation means is i = 1,
, I, j = 1, 2, ..., Jk, k = 1, 2, ..., K, and outputs the noise removal similarity D3k (i, j).

【０１１７】照合手段６は、前記雑音除去類似度演算手
段２２の出力である雑音除去類似度Ｄ3k（ｉ，ｊ）を入
力とし、実施例２における照合手段６と同様に、雑音重
畳入力音声とカテゴリｋの音声モデルとの照合を行い、
類似度が最大になる音声モデルのカテゴリを認識結果７
として出力する。The matching means 6 receives the noise removal similarity D3k (i, j) which is the output of the noise removal similarity calculation means 22 as an input, and, like the matching means 6 in the second embodiment, inputs the noise-superimposed input voice. Match with the voice model of category k,
Recognition result of the category of the voice model that maximizes the similarity 7
Output as.

【０１１８】以上、請求項３の発明に係わる実施例につ
いて、ＤＰマッチングによる照合を行う場合を例に採り
説明を行ったが、実施例２の場合と同様に、最適照合パ
ス決定手段１７及び照合手段６における雑音重畳入力音
声と音声モデルとの照合方式はＤＰマッチングに限定さ
れるものではなく、例えばＨＭＭによる認識手法を用い
てもかまわない。この場合の音声モデルメモリ３、雑音
モデルメモリ８、類似度演算手段５、最適照合パス決定
手段１７については実施例２と同じであるので説明を省
く。In the above, the embodiment according to the invention of claim 3 has been described by taking the case of performing the matching by DP matching as an example, but as in the case of the second embodiment, the optimum matching path determining means 17 and the matching are determined. The matching method of the noise-superimposed input voice and the voice model in the means 6 is not limited to DP matching, and a recognition method by HMM may be used, for example. The voice model memory 3, the noise model memory 8, the similarity calculation means 5, and the optimum matching path determination means 17 in this case are the same as those in the second embodiment, and therefore their explanations are omitted.

【０１１９】雑音除去類似度演算手段２１においては、
ＨＭＭの各状態（もしくは各遷移）において、入力雑音
特徴ベクトル時系列による雑音除去処理を施した雑音重
畳入力音声特徴ベクトル時系列の各特徴ベクトルが出力
される確率を演算し、類似度データとして出力する。照
合手段６は、類似度演算手段５の出力であるところの類
似度データを用いて、雑音重畳入力音声と各カテゴリの
ＨＭＭとの照合を行い、類似度が最大になるＨＭＭのカ
テゴリを認識結果として出力する。この時の照合方式
は、ビタビに限定されない。In the noise removal similarity calculation means 21,
In each state (or each transition) of the HMM, the probability that each feature vector of the noise-superimposed input speech feature vector time series subjected to noise removal processing by the input noise feature vector time series is calculated and output as similarity data. To do. The matching unit 6 uses the similarity data output from the similarity calculation unit 5 to match the noise-superimposed input speech with the HMM of each category, and recognizes the category of the HMM having the highest similarity. Output as. The matching method at this time is not limited to Viterbi.

【０１２０】以上、単語認識を例に採りその動作につい
て説明を行ったが、この発明の請求項３に係る実施例
は、実施例２の場合と同様、認識対象を単語に限定する
ものではなく、音声における他の発声単位を用いてもか
まわない。Although the operation has been described above by taking word recognition as an example, the embodiment according to claim 3 of the present invention does not limit the recognition target to a word as in the case of the second embodiment. , Other voicing units in speech may be used.

【０１２１】また類似度演算手段においても、実施例２
の場合と同様、特徴ベクトルである自己相関係数から得
られるあらゆる音響パラメータ、例えばＬＳＰパラメー
タやＬＰＣメルケプストラム係数、声道断面積関数を用
いた類似度や、同じく自己相関係数から得られるパラメ
ータを用いたあらゆる距離尺度、例えばＬＰＣメルケプ
ストラム係数のユークリッド距離や、ＷＬＲ距離、ＷＧ
Ｄ距離尺度、群遅延スペクトル距離、重み付けケプスト
ラムのユークリッド距離、またこれらユークリッド距離
の代わりにチェビシェフ距離などを用いた類似度を採用
してもかまわない。Also in the similarity calculation means, the second embodiment
As in the case of, all acoustic parameters obtained from the autocorrelation coefficient which is a feature vector, for example, LSP parameters, LPC mel cepstrum coefficients, similarity using vocal tract cross-sectional area function, and parameters obtained from the autocorrelation coefficient also , Any Euclidean distance of the LPC mel cepstrum coefficient, WLR distance, WG
The D distance measure, the group delay spectrum distance, the Euclidean distance of the weighted cepstrum, and the similarity using the Chebyshev distance or the like instead of these Euclidean distances may be adopted.

【０１２２】加えて、音響分析によるところの特徴ベク
トルを自己相関係数のみに限定することなく、他の音響
パラメータを付与した特徴ベクトルを用い、これによる
類似度により照合を行ってもかまわない。とくに、雑音
除去類似度演算手段２２における自己相関係数上での雑
音除去処理は、雑音除去後の自己相関係数ベクトルが非
現実的な値をとりＬＰＣ分析が行えなくなる場合があ
り、これを避けるため、音響分析にＤＦＴによるスペク
トル分析を加えパワースペクトルを特徴ベクトルに含め
ることで、雑音除去類似度演算手段２２における雑音除
去処理を、パワースペクトル上で行い、雑音除去後のパ
ワースペクトルがマイナスの値をとった周波数について
は０で置き換えた後、このパワースペクトルに対し逆Ｄ
ＦＴ演算を行うことで導出された自己相関係数を特徴ベ
クトルとして用いることができる。また、雑音除去にお
ける問題が回避できる他の特徴ベクトル、例えばフィル
タバンクの出力などを用いてもかまわない。In addition, the feature vector obtained by the acoustic analysis is not limited to only the autocorrelation coefficient, and a feature vector provided with other acoustic parameters may be used and the matching may be performed based on the similarity. Particularly, in the noise removal processing on the autocorrelation coefficient in the noise removal similarity calculation means 22, there is a case where the autocorrelation coefficient vector after the noise removal has an unrealistic value and the LPC analysis cannot be performed. In order to avoid it, by adding the spectrum analysis by DFT to the acoustic analysis and including the power spectrum in the feature vector, the noise removal processing in the noise removal similarity calculation means 22 is performed on the power spectrum, and the power spectrum after the noise removal is negative. After replacing the valued frequency with 0, the inverse D
The autocorrelation coefficient derived by performing the FT calculation can be used as the feature vector. Further, another feature vector that can avoid the problem in noise removal, for example, the output of a filter bank may be used.

【０１２３】実施例４．図４は、請求項４の発明に係る
音声認識装置の一実施例の構成を示すブロック図であ
る。図において、１は入力端、２は音響分析手段、３は
音声モデルメモリ、５は類似度演算手段、６は照合手
段、７は認識結果、８は雑音モデルメモリ、９は線形予
測分析手段、１０は最尤パラメータメモリ、１１は音声
残差パワーメモリ、１２、雑音残差演算手段、１３は雑
音残差パワーメモリ、１４は残差パワー演算手段、１５
はＳＮ比演算手段、１６は特徴ベクトル合成手段、１７
は最適照合パス決定手段、１８は重畳雑音生成手段、２
０はパワー比決定手段であり、図２に同一符号を付した
構成要素と同一部分であるため詳細な説明は省略する。Example 4. FIG. 4 is a block diagram showing the configuration of an embodiment of the speech recognition apparatus according to the invention of claim 4. In the figure, 1 is an input end, 2 is an acoustic analysis means, 3 is a voice model memory, 5 is a similarity calculation means, 6 is a matching means, 7 is a recognition result, 8 is a noise model memory, 9 is a linear prediction analysis means, 10 is a maximum likelihood parameter memory, 11 is a speech residual power memory, 12, is a noise residual calculation means, 13 is a noise residual power memory, 14 is a residual power calculation means, 15
Is an SN ratio calculating means, 16 is a feature vector synthesizing means, 17
Is an optimum matching path determination means, 18 is a superposed noise generation means, 2
Reference numeral 0 is a power ratio determining means, which is the same part as the component denoted by the same reference numeral in FIG.

【０１２４】また、２３は最適照合パス決定手段１７の
出力である照合パスデータに従い重畳雑音生成手段１８
の出力である重畳雑音特徴ベクトルから付加雑音特徴ベ
クトルを求める付加雑音決定手段、２４は付加雑音決定
手段２３の出力である付加雑音特徴ベクトルを用いて音
声モデルメモリ３に記憶されている音声モデルの標準音
声特徴ベクトルに対する雑音付加処理を施し雑音付加標
準音声特徴ベクトルを出力する雑音付加手段、２５は音
響分析手段２の出力であるところの雑音重畳入力音声特
徴ベクトル時系列と雑音付加手段２４の出力であるとこ
ろの雑音付加標準音声特徴ベクトルとの類似度を求める
類似度演算手段である。Further, 23 is the superposed noise generating means 18 according to the matching path data output from the optimum matching path determining means 17.
Of the speech model stored in the speech model memory 3 using the additional noise feature vector output from the additional noise determining means 23. Noise adding means for performing noise addition processing on the standard speech feature vector and outputting the noise-added standard speech feature vector, 25 is a time series of the noise-superimposed input speech feature vector, which is the output of the acoustic analysis means 2, and the output of the noise addition means 24. Is a similarity calculation means for obtaining the similarity with the noise-added standard speech feature vector.

【０１２５】次に動作について、まずＤＰマッチング法
を照合手段７および最適照合パス決定手段１７に採用し
た離散単語認識の場合を例に説明を行う。音声モデルメ
モリ３及び雑音モデルメモリ８の記憶内容及び、雑音重
畳入力音声信号の入力端１への入力から、パワー比決定
手段２０までの動作は、上記実施例２の場合と同一であ
るので説明を省く。Next, the operation will be described by taking the case of discrete word recognition in which the DP matching method is adopted in the matching means 7 and the optimum matching path determining means 17 as an example. The contents stored in the voice model memory 3 and the noise model memory 8 and the operations from the input of the noise-superimposed input voice signal to the input end 1 to the power ratio determining means 20 are the same as in the case of the second embodiment described above. Omit.

【０１２６】付加雑音決定手段２３は、前記重畳雑音生
成手段１８の出力であるところの重畳雑音特徴ベクトル
Ｕki,j,n（ｋ＝１,２,…,Ｋ、ｊ＝１,２,…,ＪK、ｉ＝
１,２,…,Ｉ、ｎ＝１,２）と前記最適照合パス決定手段
１７の出力であるところの照合パスデータとパワー比決
定手段２０の出力であるところの音声パワー比εkを入
力とし、雑音重畳入力音声とカテゴリｋの音声モデルと
の類似度を最大にする照合パス上の重畳雑音特徴ベクト
ルについて、ｋおよびｊを同じくする重畳雑音特徴ベク
トルの平均特徴ベクトルを求め、これを音声パワー比ε
kでパワー補正し、付加雑音特徴ベクトルＷk（ｊ）とす
る。The additive noise determining means 23 is a superposed noise feature vector Uki, j, n (k = 1,2, ..., K, j = 1,2, ..., Which is the output of the superposed noise generating means 18. JK, i =
1, 2, ..., I, n = 1, 2), the matching path data being the output of the optimum matching path determining means 17, and the voice power ratio εk being the output of the power ratio determining means 20 are input. , For the superimposed noise feature vector on the matching path that maximizes the similarity between the noise-superimposed input voice and the speech model of category k, the average feature vector of the superimposed noise feature vectors having the same k and j is obtained, and this is used as the speech power. Ratio ε
The power is corrected by k to obtain the additive noise feature vector Wk (j).

【０１２７】すなわち、まず音声モデルのあるカテゴリ
ｋ（ｋ＝１,２,…,Ｋ）についてＬ＝１,２,…,Ｌkとし
た時、ｇｋ（Ｌ）を同じくする重畳雑音特徴ベクトルＵ
kfk(L),gk(L),hk(L)の平均特徴ベクトルを求め、ついで
この平均特徴ベクトルの各次元要素を音声パワー比εk
で割ることで得られた特徴ベクトルを付加雑音特徴ベク
トルＷk（ｇｋ（Ｌ））とする。これにより、付加雑音
特徴ベクトル｛Ｗk（ｊ）｜ｊ＝１,２,…,Ｊk｝（ｋ＝
１,２,…,Ｋ）が得られる。That is, first, assuming that L = 1, 2, ..., Lk for a certain category k (k = 1, 2, ..., K) of the voice model, the superimposed noise feature vector U having the same gk (L).
The average feature vector of kfk (L), gk (L), and hk (L) is obtained, and then each dimensional element of this average feature vector is set to the speech power ratio εk.
The feature vector obtained by dividing by is the additive noise feature vector Wk (gk (L)). As a result, the additive noise feature vector {Wk (j) | j = 1, 2, ..., Jk} (k =
1, 2, ..., K) are obtained.

【０１２８】以上の動作により本付加雑音決定手段は、
請求項１の発明になる雑音重畳入力音声とカテゴリｋの
音声モデルとの照合手法に基づき、標準音声特徴ベクト
ルに対する付加雑音特徴ベクトルを求める。By the above operation, the present additional noise determining means
The additive noise feature vector for the standard voice feature vector is obtained based on the matching method of the noise-superimposed input voice and the voice model of category k according to the invention of claim 1.

【０１２９】雑音付加手段２４は、音声モデルメモリ３
に記憶されている音声モデルの標準音声特徴ベクトル
｛Ｓk（ｊ）｜ｊ＝１,２,…,Ｊk｝（ｋ＝１,２,…,Ｋ）
に対し、付加雑音決定手段２３の出力である付加雑音特
徴ベクトル｛Ｗk（ｊ）｜ｊ＝１,２,…,Ｊk｝（ｋ＝１,
２,…,Ｋ）を用いて、（２０）式のように雑音付加標準
音声特徴ベクトル｛Ｙk（ｊ）｜ｊ＝１,２,…,Ｊk｝
（ｋ＝１,２,…,Ｋ）を求める。The noise adding means 24 is used for the voice model memory 3
Standard speech feature vector {Sk (j) | j = 1,2, ..., Jk} (k = 1,2, ..., K) of the speech model stored in
On the other hand, the additive noise feature vector {Wk (j) | j = 1, 2, ..., Jk} (k = 1,
2, ..., K), the noise-added standard speech feature vector {Yk (j) | j = 1,2, ..., Jk} as shown in equation (20).
(K = 1, 2, ..., K) is calculated.

【０１３０】[0130]

【数２０】 [Equation 20]

【０１３１】式中のベクトル和は、特徴ベクトルの各次
元要素毎の和により行う。The vector sum in the equation is calculated by the sum of the feature vector for each dimensional element.

【０１３２】類似度演算手段２５は、音響分析手段２の
出力であるところの雑音重畳入力音声特徴ベクトル時系
列｛Ｘ（ｉ）｜ｉ＝１,２,…,Ｉ｝と前記雑音付加手段
の出力であるところの雑音付加標準音声特徴ベクトル
｛Ｙk（ｊ）｜ｊ＝１,２,…，Ｊk｝（ｋ＝１,２,…,
Ｋ）との類似度Ｄ4k（ｉ，ｊ）を（２１）式に従い求め
る。The similarity calculating means 25 is a noise superimposing input voice feature vector time series {X (i) | i = 1,2, ..., I} which is the output of the acoustic analyzing means 2 and the noise adding means. Noise-added standard speech feature vector {Yk (j) | j = 1,2, ..., Jk} (k = 1,2, ...,
The similarity D4k (i, j) with K) is calculated according to the equation (21).

【０１３３】[0133]

【数２１】 [Equation 21]

【０１３４】式中、ｄ（＊，＊）は括弧内の２つの自己
相関係数ベクトルの間に定義される類似度で、例えばそ
れぞれの自己相関係数をＬＰＣ分析して得られるＬＰＣ
ケプストラムベクトルのユークリッド距離の逆数であ
る。In the equation, d (*, *) is the degree of similarity defined between the two autocorrelation coefficient vectors in parentheses, for example, LPC obtained by LPC analysis of each autocorrelation coefficient.
It is the reciprocal of the Euclidean distance of the cepstrum vector.

【０１３５】照合手段６は、前記類似度演算手段２５の
出力であるところの類似度データＤ4k（ｉ，ｊ）（ｉ＝
１,２,…,Ｉ、ｊ＝１,２,…,Ｊk、ｋ＝１,２,…,Ｋ）を
入力として、雑音重畳入力音声とカテゴリｋの音声モデ
ルとの照合を行い、類似度を最大にする音声モデルのカ
テゴリを認識結果７として出力する。The collating means 6 outputs the similarity data D4k (i, j) (i =
1, 2, ..., I, j = 1, 2, ..., Jk, k = 1, 2, ..., K) are input, and the noise-superimposed input speech is compared with the speech model of category k, and the similarity is calculated. The category of the voice model that maximizes is output as the recognition result 7.

【０１３６】以上、請求項４の発明に係わる実施例につ
いて、ＤＰマッチングによる照合を行う場合を例に採り
説明を行ったが、他の実施例の場合と同様に、最適照合
パス決定手段１７及び照合手段６における雑音重畳入力
音声と音声モデルとの照合方式はＤＰマッチングに限定
されるものではなく、例えばＨＭＭによる認識手法を用
いてもかまわない。この場合の音声モデルメモリ３、雑
音モデルメモリ８、類似度演算手段５、最適照合パス決
定手段１７については実施例２と同じであるので説明を
省く。In the above, the embodiment according to the invention of claim 4 has been described by exemplifying the case where the matching is performed by DP matching. However, as in the case of the other embodiments, the optimum matching path determining means 17 and The matching method of the noise-superimposed input voice and the voice model in the matching means 6 is not limited to DP matching, and a recognition method by HMM may be used, for example. The voice model memory 3, the noise model memory 8, the similarity calculation means 5, and the optimum matching path determination means 17 in this case are the same as those in the second embodiment, and therefore their explanations are omitted.

【０１３７】類似度演算手段２５では、ＨＭＭの各状態
（もしくは各遷移）における標準音声特徴ベクトルに対
応する雑音付加標準音声特徴ベクトルを用いて、雑音重
畳入力音声特徴ベクトル時系列の各特徴ベクトルが出力
される確率を演算し、類似度データとして出力する。照
合手段６は、類似度演算手段５の出力であるところの類
似度データを用いて、雑音重畳入力音声と各カテゴリの
ＨＭＭとの照合を行い、類似度が最大になるＨＭＭのカ
テゴリを認識結果として出力する。この時の照合方式
は、ビタビに限定されない。The similarity calculating means 25 uses the noise-added standard speech feature vector corresponding to the standard speech feature vector in each state (or each transition) of the HMM to calculate each feature vector of the noise-superimposed input speech feature vector time series. The output probability is calculated and output as similarity data. The matching unit 6 uses the similarity data output from the similarity calculation unit 5 to match the noise-superimposed input speech with the HMM of each category, and recognizes the category of the HMM having the highest similarity. Output as. The matching method at this time is not limited to Viterbi.

【０１３８】以上、単語認識を例に採りその動作につい
て説明を行ったが、この発明の請求項４に係る実施例
は、他の実施例の場合と同様、認識対象を単語に限定す
るものではなく、音声における他の発声単位を用いても
かまわない。The operation has been described above by taking the word recognition as an example, but the embodiment according to claim 4 of the present invention does not limit the recognition target to the word as in the other embodiments. Alternatively, another voicing unit in the voice may be used.

【０１３９】また類似度演算手段においても、他の実施
例の場合と同様、特徴ベクトルである自己相関係数から
得られるあらゆる音響パラメータ、例えばＬＳＰパラメ
ータやＬＰＣメルケプストラム係数、声道断面積関数を
用いた類似度や、同じく自己相関係数から得られるパラ
メータを用いたあらゆる距離尺度、例えばＬＰＣメルケ
プストラム係数のユークリッド距離や、ＷＬＲ距離、Ｗ
ＧＤ距離尺度、群遅延スペクトル距離、重み付けケプス
トラムのユークリッド距離、またこれらユークリッド距
離の代わりにチェビシェフ距離などを用いた類似度を採
用してもかまわない。Also in the similarity calculating means, as in the other embodiments, all acoustic parameters obtained from the autocorrelation coefficient, which is a feature vector, such as the LSP parameter, the LPC mel cepstrum coefficient, and the vocal tract cross-sectional area function, are calculated. The degree of similarity used, or any distance measure that also uses parameters obtained from the autocorrelation coefficient, such as the Euclidean distance of the LPC mel cepstrum coefficient, the WLR distance, and the W
A GD distance measure, a group delay spectrum distance, a weighted cepstrum Euclidean distance, or a similarity using a Chebyshev distance or the like instead of these Euclidean distances may be adopted.

【０１４０】加えて、音響分析によるところの特徴ベク
トルを自己相関係数のみに限定することなく、他の音響
パラメータを付与した特徴ベクトルを用い、これによる
類似度により照合を行ってもかまわない。In addition, the feature vector obtained by the acoustic analysis is not limited to only the autocorrelation coefficient, and a feature vector to which another acoustic parameter is added may be used to perform matching based on the similarity.

【０１４１】なお、上記４つの実施例では専用のハード
ウェアにて構成するものを示したが、汎用の計算機は信
号処理プロセッサにおけるソフトウェア処理によって実
現するようにしても良い。Although the above-mentioned four embodiments have been shown to be constituted by dedicated hardware, a general-purpose computer may be realized by software processing in the signal processor.

【０１４２】[0142]

【発明の効果】この発明は、以上説明したように構成さ
れているので、以下に記載されるような効果を奏する。Since the present invention is constructed as described above, it has the following effects.

【０１４３】請求項１の発明においては、雑音重畳入力
音声特徴ベクトル時系列の各特徴ベクトルに対し、音声
モデルメモリに記憶されている音声モデルの標準音声特
徴ベクトルと雑音モデルメモリに記憶されている雑音モ
デルの雑音特徴ベクトルとの全ての組み合わせによるＳ
Ｎ比演算を行い、次いで、このＳＮ比に合わせて標準音
声特徴ベクトルと雑音特徴ベクトルとの合成を行い、得
られた雑音重畳音声特徴ベクトルと該雑音重畳入力音声
特徴ベクトル時系列中の特徴ベクトルとの類似度を求
め、この類似度データを用いて雑音重畳音声と音声モデ
ルとの照合を雑音モデルの制約の下で行っているため、
雑音重畳入力音声における重畳雑音特徴ベクトルとＳＮ
比の推定と、雑音重畳入力音声と音声モデルとの照合が
同時に行われており、重畳雑音およびＳＮ比が大きく変
動するような非定常雑音重畳入力音声に対しても良好な
認識性能が得られる。In the invention of claim 1, for each feature vector of the noise-superimposed input voice feature vector time series, the standard voice feature vector of the voice model stored in the voice model memory and the noise model memory are stored. S by all combinations with the noise feature vector of the noise model
The N-ratio calculation is performed, and then the standard speech feature vector and the noise feature vector are combined according to this SN ratio, and the obtained noise-superimposed speech feature vector and the feature vector in the noise-superimposed input speech feature vector time series. Is calculated, and the noise-superimposed speech is compared with the speech model using this similarity data under the constraint of the noise model.
Superposed noise feature vector and SN in noisy input speech
The ratio estimation and the matching of the noise-superimposed input speech and the speech model are performed at the same time, and good recognition performance can be obtained even for non-stationary noise-superimposed input speech whose convolution noise and SN ratio vary greatly. .

【０１４４】また、請求項２の発明においては、請求項
１の発明における雑音重畳入力音声と音声モデルとの照
合手法に基づき得られる照合パスに従い、雑音重畳入力
音声における入力雑音特徴ベクトル時系列を求め、これ
を用いてパワー正規化音声モデルと雑音重畳入力音声と
の雑音適応化類似度演算を行い再照合を行っているの
で、重畳雑音およびＳＮ比が大きく変動するような非定
常雑音重畳入力音声に対しても良好な認識性能が得られ
る。Further, according to the invention of claim 2, the input noise feature vector time series in the noise-superimposed input voice is calculated in accordance with the matching path obtained based on the matching method of the noise-superimposed input voice and the voice model in the invention of claim 1. Since the noise-adapted similarity model of the power-normalized speech model and the noise-superimposed input speech is calculated and re-matched by using this, the non-stationary noise-superimposed input in which the superposed noise and the SN ratio vary greatly. Good recognition performance is also obtained for voice.

【０１４５】また、請求項３の発明においては、請求項
１の発明における雑音重畳入力音声と音声モデルとの照
合手法に基づき得られる照合パスに従い、雑音重畳入力
音声における入力雑音特徴ベクトル時系列を求め、該入
力雑音特徴ベクトル時系列による雑音除去処理を施した
雑音重畳入力音声と音声モデルとの類似度演算を行い再
照合を行っているので、重畳雑音及びＳＮ比が大きく変
動するような非定常雑音重畳入力音声に対しても良好な
認識性能が得られる。Further, according to the invention of claim 3, the input noise feature vector time series in the noise-superimposed input speech is calculated in accordance with the verification path obtained based on the verification method of the noise-superimposed input speech and the speech model in the invention of claim 1. Then, the similarity calculation between the noise-superimposed input speech that has been subjected to noise removal processing based on the time series of the input noise feature vector and the speech model is performed and re-matching is performed. Good recognition performance is obtained even for stationary noise-superimposed input speech.

【０１４６】また、請求項４の発明においては、請求項
１の発明における雑音重畳入力音声と音声モデルとの照
合手法に基づき得られる照合パスに従い、音声モデルの
標準音声特徴ベクトルに付加する雑音特徴ベクトルを求
め、該雑音特徴ベクトルを付加した音声モデルと雑音重
畳入力音声との類似度演算を行い再照合をしているの
で、重畳雑音およびＳＮ比が大きく変動するような非定
常雑音重畳入力音声に対しても良好な認識性能が得られ
る。Further, in the invention of claim 4, the noise feature added to the standard speech feature vector of the voice model according to the matching path obtained based on the matching method of the noise-superimposed input voice and the voice model in the invention of claim 1. Since the vector is obtained and the speech model to which the noise feature vector is added and the noise-superimposed input speech are subjected to similarity calculation and re-matching is performed, the non-stationary noise-superimposed input speech with a large variation in superposed noise and SN ratio is obtained. Also, good recognition performance can be obtained for.

【図面の簡単な説明】[Brief description of drawings]

【図１】この発明の実施例１による音声認識装置を示す
ブロック図である。FIG. 1 is a block diagram showing a voice recognition device according to a first embodiment of the present invention.

【図２】この発明の実施例２による音声認識装置を示す
ブロック図である。FIG. 2 is a block diagram showing a voice recognition device according to a second embodiment of the present invention.

【図３】この発明の実施例３による音声認識装置を示す
ブロック図である。FIG. 3 is a block diagram showing a voice recognition device according to a third embodiment of the present invention.

【図４】この発明の実施例４による音声認識装置を示す
ブロック図である。FIG. 4 is a block diagram showing a voice recognition device according to a fourth embodiment of the present invention.

【図５】従来の音声認識装置を示すブロック図である。FIG. 5 is a block diagram showing a conventional voice recognition device.

【符号の説明】[Explanation of symbols]

１入力端２音響分析手段３音声モデルメモリ５類似度演算手段６照合手段７認識結果８雑音モデルメモリ９線形予測分析手段１０最尤パラメータメモリ１１音声残差パワーメモリ１２雑音残差演算手段１３雑音残差パワーメモリ１４残差パワー演算手段１５ＳＮ比演算手段１６特徴ベクトル合成手段１７最適照合パス決定手段１８重畳雑音生成手段１９重畳雑音決定手段２０パワー比決定手段２１雑音適応化類似度演算手段２２雑音除去類似度演算手段２３付加雑音決定手段２４雑音付加手段２５類似度演算手段 DESCRIPTION OF SYMBOLS 1 Input terminal 2 Acoustic analysis means 3 Speech model memory 5 Similarity calculation means 6 Matching means 7 Recognition result 8 Noise model memory 9 Linear prediction analysis means 10 Maximum likelihood parameter memory 11 Speech residual power memory 12 Noise residual calculation means 13 Noise Residual power memory 14 Residual power calculating means 15 SN ratio calculating means 16 Feature vector synthesizing means 17 Optimal matching path determining means 18 Superimposing noise generating means 19 Superimposing noise determining means 20 Power ratio determining means 21 Noise adaptation similarity calculating means 22 Noise removal similarity calculation means 23 Additional noise determination means 24 Noise addition means 25 Similarity calculation means

Claims

【特許請求の範囲】[Claims]

【請求項１】相異なる音声を表現する音声モデルを持
ち、未知入力音声と前記音声モデルとの照合により音声
認識を行う音声認識装置において、雑音が重畳した未知
入力音声信号に対し設定される複数個の分析フレームの
各々について音響分析を行い雑音重畳入力音声特徴ベク
トル時系列を出力する音響分析手段と、音声信号に重畳
する雑音の特徴ベクトル時系列を表現する雑音モデルを
記憶する雑音モデルメモリと、標準音声の特徴ベクトル
時系列を表現する音声モデルを記憶する音声モデルメモ
リと、音声モデルメモリに記憶されている音声モデルの
標準音声特徴ベクトルに対し線形予測分析を行い最尤パ
ラメータと標準音声残差パワーを求める線形予測分析手
段と、線形予測分析手段の出力であるところの最尤パラ
メータを記憶する最尤パラメータメモリと、同じく線形
予測分析手段の出力であるところの標準音声残差パワー
を記憶する音声残差パワーメモリと、雑音モデルメモリ
に記憶されている雑音モデルの雑音特徴ベクトルを入力
として最尤パラメータメモリ上の最尤パラメータとの積
和演算を行い雑音残差パワーを求める雑音残差演算手段
と、雑音残差演算手段の出力であるところの雑音残差パ
ワーを記憶する雑音残差パワーメモリと、音響分析手段
の出力であるところの雑音重畳入力音声特徴ベクトル時
系列の各特徴ベクトルに対し最尤パラメータメモリ上の
最尤パラメータとの積和演算を行い雑音重畳入力音声残
差パワーを求める残差パワー演算手段と、残差パワー演
算手段の出力であるところの雑音重畳入力音声残差パワ
ーと音声残差パワーメモリ上の標準音声残差パワーと雑
音残差パワーメモリ上の雑音残差パワーとを用いて雑音
重畳入力音声のＳＮ比を求めるＳＮ比演算手段と、ＳＮ
比演算手段の出力であるところのＳＮ比に従い音声モデ
ルメモリ上の標準音声特徴ベクトルと雑音モデルメモリ
上の雑音特徴ベクトルの合成を行い雑音重畳音声特徴ベ
クトルを生成する特徴ベクトル合成手段と、音響分析手
段の出力である雑音重畳入力音声特徴ベクトル時系列の
各特徴ベクトルに対し特徴ベクトル合成手段の出力であ
る雑音重畳音声特徴ベクトルとの類似度を演算する類似
度演算手段と、類似度演算手段の出力であるところの類
似度データを用いて照合処理を行い認識結果を出力する
照合手段を備えたことを特徴とする音声認識装置。1. A voice recognition device having voice models for expressing different voices and performing voice recognition by matching an unknown input voice with the voice model, a plurality of voice recognition devices being set for an unknown input voice signal on which noise is superimposed. Acoustic analysis means for performing acoustic analysis on each of the analysis frames to output a noise-superimposed input speech feature vector time series, and a noise model memory for storing a noise model expressing the noise feature vector time series to be superimposed on the speech signal. , A voice model memory that stores a voice model that represents a time series of a feature vector of a standard voice, and a linear prediction analysis is performed on the standard voice feature vector of the voice model stored in the voice model memory to perform the maximum likelihood parameter and the standard voice residual. The linear predictive analysis means for obtaining the difference power and the maximum likelihood parameter that is the output of the linear predictive analysis means are stored. Likelihood parameter memory, a speech residual power memory for storing the standard speech residual power which is also the output of the linear prediction analysis means, and a noise likelihood vector of the noise model stored in the noise model memory as inputs. Noise residual calculation means for performing a sum-of-products calculation with the maximum likelihood parameter on the parameter memory to obtain noise residual power, and a noise residual power memory for storing the noise residual power which is the output of the noise residual calculation means. And the noise-superimposed input speech residual power by performing a product-sum operation of each feature vector of the noise-superimposed input speech feature vector time series, which is the output of the acoustic analysis means, with the maximum likelihood parameter in the maximum likelihood parameter memory. Residual power computing means, noise-superimposed input speech residual power and output of residual power computing means and standard sound on speech residual power memory And SN ratio calculation means for calculating the SN ratio of the noisy input speech by using the residual power and noise residual power on noise residual power memory, SN
A feature vector synthesizing unit for synthesizing a standard voice feature vector on a voice model memory and a noise feature vector on a noise model memory according to an SN ratio which is an output of the ratio calculation unit to generate a noise-superimposed voice feature vector, and an acoustic analysis. Of the noise-superimposed input speech feature vector time series, which is the output of the means, and the similarity computation means for computing the similarity with the noise-superimposed speech feature vector, which is the output of the feature vector synthesis means, and the similarity computation means. A speech recognition apparatus comprising a collating unit that performs collation processing using similarity data that is an output and outputs a recognition result.

【請求項２】相異なる音声を表現する音声モデルを持
ち、未知入力音声と前記音声モデルとの照合により音声
認識を行う音声認識装置において、雑音が重畳した未知
入力音声信号に対し設定される複数個の分析フレームの
各々について音響分析を行い雑音重畳入力音声特徴ベク
トル時系列を出力する音響分析手段と、音声信号に重畳
する雑音の特徴ベクトル時系列を表現する雑音モデルを
記憶する雑音モデルメモリと、標準音声の特徴ベクトル
時系列を表現する音声モデルを記憶する音声モデルメモ
リと、音声モデルメモリに記憶されている音声モデルの
標準音声特徴ベクトルに対し線形予測分析を行い最尤パ
ラメータと標準音声残差パワーを求める線形予測分析手
段と、線形予測分析手段の出力であるところの最尤パラ
メータを記憶する最尤パラメータメモリと、同じく線形
予測分析手段の出力であるところの標準音声残差パワー
を記憶する音声残差パワーメモリと、雑音モデルメモリ
に記憶されている雑音モデルの雑音特徴ベクトルを入力
として最尤パラメータメモリ上の最尤パラメータとの積
和演算を行い雑音残差パワーを求める雑音残差演算手段
と、雑音残差演算手段の出力であるところの雑音残差パ
ワーを記憶する雑音残差パワーメモリと、音響分析手段
の出力であるところの雑音重畳入力音声特徴ベクトル時
系列の各特徴ベクトルに対し最尤パラメータメモリ上の
最尤パラメータとの積和演算を行い雑音重畳入力音声残
差パワーを求める残差パワー演算手段と、残差パワー演
算手段の出力であるところの雑音重畳入力音声残差パワ
ーと音声残差パワーメモリ上の標準音声残差パワーと雑
音残差パワーメモリ上の雑音残差パワーとを用いて雑音
重畳入力音声のＳＮ比を求めるＳＮ比演算手段と、ＳＮ
比演算手段の出力であるところのＳＮ比に従い音声モデ
ルメモリ上の標準音声特徴ベクトルと雑音モデルメモリ
上の雑音特徴ベクトルの合成を行い雑音重畳音声特徴ベ
クトルを生成する特徴ベクトル合成手段と、音響分析手
段の出力である雑音重畳入力音声特徴ベクトル時系列の
各特徴ベクトルに対し特徴ベクトル合成手段の出力であ
る雑音重畳音声特徴ベクトルとの類似度を演算する類似
度演算手段と、類似度演算手段の出力であるところの類
似度データを入力として音声モデルと雑音重畳入力音声
特徴ベクトル時系列との最適照合パスを求める最適照合
パス決定手段と、音響分析手段の出力である雑音重畳入
力音声特徴ベクトル時系列における各特徴ベクトルに対
しＳＮ比演算手段の出力であるＳＮ比と雑音モデルメモ
リ上の雑音特徴ベクトルとを用いて重畳雑音特徴ベクト
ルを生成する重畳雑音生成手段と、最適照合パス決定手
段の出力であるところの照合パスデータと重畳雑音生成
手段の出力であるところの重畳雑音特徴ベクトルとを用
いて入力雑音特徴ベクトル時系列を求める重畳雑音決定
手段と、ＳＮ比演算手段の出力であるところのＳＮ比と
音響分析手段の出力であるところの雑音重畳入力音声特
徴ベクトル時系列と音声モデルメモリ上の標準音声特徴
ベクトルと最適照合パス決定手段の出力であるところの
照合パスデータとを入力として音声パワー比を求めるパ
ワー比決定手段と、音響分析手段の出力であるところの
雑音重畳入力音声特徴ベクトル時系列と音声モデルメモ
リ上の標準音声特徴ベクトルと重畳雑音決定手段の出力
であるところの入力雑音特徴ベクトル時系列とパワー比
決定手段の出力であるところの音声パワー比とを入力と
して雑音重畳入力音声特徴ベクトル時系列の各特徴ベク
トルと音声モデルメモリ上の標準音声特徴ベクトルとの
雑音適応化類似度を演算する雑音適応化類似度演算手段
と、雑音適応化類似度演算手段の出力であるところの雑
音適応化類似度データを用いて照合を行い認識結果を出
力する照合手段を備えたことを特徴とする音声認識装
置。2. A voice recognition device having voice models for expressing different voices and performing voice recognition by matching an unknown input voice with the voice model, a plurality of voice recognition devices being set for an unknown input voice signal on which noise is superimposed. Acoustic analysis means for performing acoustic analysis on each of the analysis frames to output a noise-superimposed input speech feature vector time series, and a noise model memory for storing a noise model expressing the noise feature vector time series to be superimposed on the speech signal. , A voice model memory that stores a voice model that represents a time series of a feature vector of a standard voice, and a linear prediction analysis is performed on the standard voice feature vector of the voice model stored in the voice model memory to perform the maximum likelihood parameter and the standard voice residual. The linear predictive analysis means for obtaining the difference power and the maximum likelihood parameter that is the output of the linear predictive analysis means are stored. Likelihood parameter memory, a speech residual power memory for storing the standard speech residual power which is also the output of the linear prediction analysis means, and a noise likelihood vector of the noise model stored in the noise model memory as inputs. Noise residual calculation means for performing a sum-of-products calculation with the maximum likelihood parameter on the parameter memory to obtain noise residual power, and a noise residual power memory for storing the noise residual power which is the output of the noise residual calculation means. And the noise-superimposed input speech residual power by performing a product-sum operation of each feature vector of the noise-superimposed input speech feature vector time series, which is the output of the acoustic analysis means, with the maximum likelihood parameter in the maximum likelihood parameter memory. Residual power computing means, noise-superimposed input speech residual power and output of residual power computing means and standard sound on speech residual power memory And SN ratio calculation means for calculating the SN ratio of the noisy input speech by using the residual power and noise residual power on noise residual power memory, SN
A feature vector synthesizing unit for synthesizing a standard voice feature vector on a voice model memory and a noise feature vector on a noise model memory according to an SN ratio which is an output of the ratio calculation unit to generate a noise-superimposed voice feature vector, and an acoustic analysis. Of the noise-superimposed input speech feature vector time series, which is the output of the means, and the similarity computation means for computing the similarity with the noise-superimposed speech feature vector, which is the output of the feature vector synthesis means, and the similarity computation means. Optimal matching path determining means for finding an optimal matching path between the speech model and the noise-superimposed input speech feature vector time series using the similarity data as an output, and the noise-superimposing input speech feature vector output from the acoustic analysis means. For each feature vector in the sequence, the SN ratio output from the SN ratio calculation means and the noise feature vector on the noise model memory Using the superposed noise generation means for generating a superposed noise feature vector using the toll and the matching path data which is the output of the optimum matching path determination means and the superposed noise feature vector which is the output of the superposed noise generation means. The input noise feature vector time series for obtaining the input noise feature vector time series, the SN ratio output from the SN ratio calculation means, and the noise superimposed input voice feature vector time series output from the acoustic analysis means Of the standard speech feature vector and the matching path data that is the output of the optimum matching path determining means to obtain a speech power ratio, and a noise-superimposed input speech feature vector that is the output of the acoustic analysis means. Time series and standard speech feature vector on speech model memory and input noise feature vector as output of convolutional noise decision means The noise adaptation similarity between each feature vector of the noise-superimposed input speech feature vector time series and the standard speech feature vector on the speech model memory is input with the time series and the speech power ratio, which is the output of the power ratio determining means, as inputs. And a matching means for performing a matching using the noise adapted similarity calculating means for calculating and a noise adapted similarity data output from the noise adapted similarity calculating means and outputting a recognition result. Voice recognition device.

【請求項３】相異なる音声を表現する音声モデルを持
ち、未知入力音声と前記音声モデルとの照合により音声
認識を行う音声認識装置において、雑音が重畳した未知
入力音声信号に対し設定される複数個の分析フレームの
各々について音響分析を行い雑音重畳入力音声特徴ベク
トル時系列を出力する音響分析手段と、音声信号に重畳
する雑音の特徴ベクトル時系列を表現する雑音モデルを
記憶する雑音モデルメモリと、標準音声の特徴ベクトル
時系列を表現する音声モデルを記憶する音声モデルメモ
リと、音声モデルメモリに記憶されている音声モデルの
標準音声特徴ベクトルに対し線形予測分析を行い最尤パ
ラメータと標準音声残差パワーを求める線形予測分析手
段と、線形予測分析手段の出力であるところの最尤パラ
メータを記憶する最尤パラメータメモリと、同じく線形
予測分析手段の出力であるところの標準音声残差パワー
を記憶する音声残差パワーメモリと、雑音モデルメモリ
に記憶されている雑音モデルの雑音特徴ベクトルを入力
として最尤パラメータメモリ上の最尤パラメータとの積
和演算を行い雑音残差パワーを求める雑音残差演算手段
と、雑音残差演算手段の出力であるところの雑音残差パ
ワーを記憶する雑音残差パワーメモリと、音響分析手段
の出力であるところの雑音重畳入力音声特徴ベクトル時
系列の各特徴ベクトルに対し最尤パラメータメモリ上の
最尤パラメータとの積和演算を行い雑音重畳入力音声残
差パワーを求める残差パワー演算手段と、残差パワー演
算手段の出力であるところの雑音重畳入力音声残差パワ
ーと音声残差パワーメモリ上の標準音声残差パワーと雑
音残差パワーメモリ上の雑音残差パワーとを用いて雑音
重畳入力音声のＳＮ比を求めるＳＮ比演算手段と、ＳＮ
比演算手段の出力であるところのＳＮ比に従い音声モデ
ルメモリ上の標準音声特徴ベクトルと雑音モデルメモリ
上の雑音特徴ベクトルの合成を行い雑音重畳音声特徴ベ
クトルを生成する特徴ベクトル合成手段と、音響分析手
段の出力である雑音重畳入力音声特徴ベクトル時系列の
各特徴ベクトルに対し特徴ベクトル合成手段の出力であ
る雑音重畳音声特徴ベクトルとの類似度を演算する類似
度演算手段と、類似度演算手段の出力であるところの類
似度データを入力として音声モデルと雑音重畳入力音声
特徴ベクトル時系列との最適照合パスを求める最適照合
パス決定手段と、音響分析手段の出力である雑音重畳入
力音声特徴ベクトル時系列における各特徴ベクトルに対
しＳＮ比演算手段の出力であるＳＮ比と雑音モデルメモ
リ上の雑音特徴ベクトルとを用いて重畳雑音特徴ベクト
ルを生成する重畳雑音生成手段と、最適照合パス決定手
段の出力であるところの照合パスデータと重畳雑音生成
手段の出力であるところの重畳雑音特徴ベクトルとを用
いて入力雑音特徴ベクトル時系列を求める重畳雑音決定
手段と、音響分析手段の出力であるところの雑音重畳入
力音声特徴ベクトル時系列と音声モデルメモリ上の標準
音声特徴ベクトルと重畳雑音決定手段の出力であるとこ
ろの入力雑音特徴ベクトル時系列とを入力として雑音重
畳入力音声特徴ベクトル時系列の各特徴ベクトルと音声
モデルメモリ上の標準音声特徴ベクトルとの雑音除去類
似度を演算する雑音除去類似度演算手段と、雑音除去類
似度演算手段の出力であるところの雑音適応化類似度デ
ータを用いて照合を行い認識結果を出力する照合手段を
備えたことを特徴とする音声認識装置。3. A voice recognition device having voice models for expressing different voices and performing voice recognition by matching an unknown input voice with the voice model, a plurality of voice recognition devices being set for an unknown input voice signal on which noise is superimposed. Acoustic analysis means for performing acoustic analysis on each of the analysis frames to output a noise-superimposed input speech feature vector time series, and a noise model memory for storing a noise model expressing the noise feature vector time series to be superimposed on the speech signal. , A voice model memory that stores a voice model that represents a time series of a feature vector of a standard voice, and a linear prediction analysis is performed on the standard voice feature vector of the voice model stored in the voice model memory to perform the maximum likelihood parameter and the standard voice residual. The linear predictive analysis means for obtaining the difference power and the maximum likelihood parameter that is the output of the linear predictive analysis means are stored. Likelihood parameter memory, a speech residual power memory for storing the standard speech residual power which is also the output of the linear prediction analysis means, and a noise likelihood vector of the noise model stored in the noise model memory as inputs. Noise residual calculation means for performing a sum-of-products calculation with the maximum likelihood parameter on the parameter memory to obtain noise residual power, and a noise residual power memory for storing the noise residual power which is the output of the noise residual calculation means. And the noise-superimposed input speech residual power by performing a product-sum operation of each feature vector of the noise-superimposed input speech feature vector time series, which is the output of the acoustic analysis means, with the maximum likelihood parameter in the maximum likelihood parameter memory. Residual power computing means, noise-superimposed input speech residual power and output of residual power computing means and standard sound on speech residual power memory And SN ratio calculation means for calculating the SN ratio of the noisy input speech by using the residual power and noise residual power on noise residual power memory, SN
A feature vector synthesizing unit for synthesizing a standard voice feature vector on a voice model memory and a noise feature vector on a noise model memory according to an SN ratio which is an output of the ratio calculation unit to generate a noise-superimposed voice feature vector, and an acoustic analysis. Of the noise-superimposed input speech feature vector time series, which is the output of the means, and the similarity computation means for computing the similarity with the noise-superimposed speech feature vector, which is the output of the feature vector synthesis means, and the similarity computation means. Optimal matching path determining means for finding an optimal matching path between the speech model and the noise-superimposed input speech feature vector time series using the similarity data as an output, and the noise-superimposing input speech feature vector output from the acoustic analysis means. For each feature vector in the sequence, the SN ratio output from the SN ratio calculation means and the noise feature vector on the noise model memory Using the superposed noise generation means for generating a superposed noise feature vector using the toll and the matching path data which is the output of the optimum matching path determination means and the superposed noise feature vector which is the output of the superposed noise generation means. The input noise feature vector time series is calculated by the following, and the noise superimposed input speech feature vector time series, which is the output of the acoustic analysis means, the standard speech feature vector on the speech model memory, and the output of the superimposed noise determination means. A noise removal similarity calculation means for calculating the noise removal similarity between each feature vector of the noise-superimposed input voice feature vector time series and the standard voice feature vector on the voice model memory, with a certain input noise feature vector time series as an input. And the noise adaptive similarity data that is the output of the noise removal similarity calculation means is used for matching and the recognition result is obtained. Speech recognition apparatus characterized by comprising a verification means for force.

【請求項４】相異なる音声を表現する音声モデルを持
ち、未知入力音声と前記音声モデルとの照合により音声
認識を行う音声認識装置において、雑音が重畳した未知
入力音声信号に対し設定される複数個の分析フレームの
各々について音響分析を行い雑音重畳入力音声特徴ベク
トル時系列を出力する音響分析手段と、音声信号に重畳
する雑音の特徴ベクトル時系列を表現する雑音モデルを
記憶する雑音モデルメモリと、標準音声の特徴ベクトル
時系列を表現する音声モデルを記憶する音声モデルメモ
リと、音声モデルメモリに記憶されている音声モデルの
標準音声特徴ベクトルに対し線形予測分析を行い最尤パ
ラメータと標準音声残差パワーを求める線形予測分析手
段と、線形予測分析手段の出力であるところの最尤パラ
メータを記憶する最尤パラメータメモリと、同じく線形
予測分析手段の出力であるところの標準音声残差パワー
を記憶する音声残差パワーメモリと、雑音モデルメモリ
上の雑音特徴ベクトルを入力として最尤パラメータメモ
リ上の最尤パラメータとの積和演算を行い雑音残差パワ
ーを求める雑音残差演算手段と、雑音残差演算手段の出
力であるところの雑音残差パワーを記憶する雑音残差パ
ワーメモリと、音響分析手段の出力であるところの雑音
重畳入力音声特徴ベクトル時系列の各特徴ベクトルに対
し最尤パラメータメモリ上の最尤パラメータとの積和演
算を行い雑音重畳入力音声残差パワーを求める残差パワ
ー演算手段と、残差パワー演算手段の出力であるところ
の雑音重畳入力音声残差パワーと音声残差パワーメモリ
上の標準音声残差パワーと雑音残差パワーメモリ上の雑
音残差パワーとを用いて雑音重畳入力音声のＳＮ比を求
めるＳＮ比演算手段と、ＳＮ比演算手段の出力であると
ころのＳＮ比に従い音声モデルメモリ上の標準音声特徴
ベクトルと雑音モデルメモリ上の雑音特徴ベクトルの合
成を行い雑音重畳音声特徴ベクトルを生成する特徴ベク
トル合成手段と、音響分析手段の出力である雑音重畳入
力音声特徴ベクトル時系列の各特徴ベクトルに対し特徴
ベクトル合成手段の出力である雑音重畳音声特徴ベクト
ルとの類似度を演算する類似度演算手段と、類似度演算
手段の出力であるところの類似度データを入力として音
声モデルと雑音重畳入力音声特徴ベクトル時系列との最
適照合パスを求める最適照合パス決定手段と、ＳＮ比演
算手段の出力であるところのＳＮ比と音響分析手段の出
力であるところの雑音重畳入力音声特徴ベクトル時系列
と音声モデルメモリ上の標準音声特徴ベクトルと最適照
合パス決定手段の出力であるところの照合パスデータと
を入力として音声パワー比を求めるパワー比決定手段
と、音響分析手段の出力である雑音重畳入力音声特徴ベ
クトル時系列における各特徴ベクトルに対しＳＮ比演算
手段の出力であるＳＮ比と雑音モデルメモリ上の雑音特
徴ベクトルとを用いて重畳雑音特徴ベクトルを生成する
重畳雑音生成手段と、最適照合パス決定手段の出力であ
るところの照合パスデータと重畳雑音生成手段の出力で
あるところの重畳雑音特徴ベクトルとパワー比決定手段
の出力であるところの音声パワー比とを用いて付加雑音
特徴ベクトルを求める付加雑音決定手段と、付加雑音決
定手段の出力であるところの付加雑音特徴ベクトルと音
声モデルメモリ上の標準音声特徴ベクトルを入力として
雑音付加標準音声特徴ベクトルを求める雑音付加手段
と、音響分析手段の出力であるところの雑音重畳入力音
声特徴ベクトル時系列と雑音付加手段の出力であるとこ
ろの雑音付加標準音声特徴ベクトルとの類似度を演算す
る類似度演算手段と、類似度演算手段の出力であるとこ
ろの類似度データを用いて照合を行い認識結果を出力す
る照合手段を備えたことを特徴とする音声認識装置。4. A voice recognition device that has voice models that express different voices and performs voice recognition by matching an unknown input voice with the voice model. A plurality of voice recognition devices are set for an unknown input voice signal on which noise is superimposed. Acoustic analysis means for performing acoustic analysis on each of the analysis frames to output a noise-superimposed input speech feature vector time series, and a noise model memory for storing a noise model expressing the noise feature vector time series to be superimposed on the speech signal. , A voice model memory that stores a voice model that represents a time series of a feature vector of a standard voice, and a linear prediction analysis is performed on the standard voice feature vector of the voice model stored in the voice model memory to perform the maximum likelihood parameter and the standard voice residual. The linear predictive analysis means for obtaining the difference power and the maximum likelihood parameter that is the output of the linear predictive analysis means are stored. Likelihood parameter memory, a speech residual power memory that stores the standard speech residual power that is also the output of the linear prediction analysis means, and a maximum likelihood parameter memory with the noise feature vector on the noise model memory as an input. A noise residual calculation means for performing a sum of products calculation with a parameter to obtain a noise residual power, a noise residual power memory for storing the noise residual power that is the output of the noise residual calculation means, and an acoustic analysis means Residual power calculation means for obtaining a noise-superimposed input speech residual power by performing a sum-of-products operation with the maximum likelihood parameter on the maximum likelihood parameter memory for each feature vector of the noise-superimposed input speech feature vector time series , The noise-superimposed input speech residual power, which is the output of the residual power calculating means, and the standard speech residual power and the noise residual power on the speech residual power memory. -SN ratio calculation means for obtaining the SN ratio of the noise-superimposed input speech using the noise residual power on the memory, and the standard speech feature vector and noise on the speech model memory according to the SN ratio output from the SN ratio calculation means. Feature vector synthesizing means for synthesizing the noise feature vector on the model memory to generate a noise-superimposed speech feature vector, and feature vector synthesizing means for each feature vector of the noise-superimposed input speech feature vector time series output from the acoustic analysis means. Similarity calculation means for calculating the similarity with the noise-superimposed speech feature vector, and a similarity model output from the similarity calculation means as input, the speech model and the noise-superimposed input speech feature vector time series Output of the SN ratio and acoustic analysis means, which is the output of the SN ratio calculation means. Where the noise-superimposed input speech feature vector time series, the standard speech feature vector on the speech model memory, and the matching path data, which is the output of the optimum matching path determining means, are input to obtain the power ratio determining means. And the noise characteristic vector on the noise model memory and the noise feature vector on the noise model memory for each feature vector in the noise-superimposed input speech feature vector time series output from the acoustic analysis means. A superimposing noise generating unit for generating a superimposing noise feature vector for generating the optimum matching path determining unit, a collation path data as an output of the optimum matching path determining unit, a superimposing noise feature vector as an output of the superimposing noise generating unit, and a voice for outputting the power ratio determining unit. And an output of the additive noise determining means for determining the additive noise feature vector using the power ratio and the output of the additive noise determining means. Noise additive means for obtaining the noise-added standard speech feature vector by inputting the noise-added noise feature vector and the standard speech feature vector on the speech model memory, and the noise-superimposed input speech feature vector time series which is the output of the acoustic analysis means. The recognition result is compared by using the similarity calculation means for calculating the similarity with the noise-added standard speech feature vector, which is the output of the noise addition means, and the similarity data, which is the output of the similarity calculation means. A voice recognition device comprising a collating means for outputting.