JPH10232694A

JPH10232694A - Device and method for voice recognition

Info

Publication number: JPH10232694A
Application number: JP9034575A
Authority: JP
Inventors: Kenichi Taniguchi; 賢一谷口; Nobuyuki Kono; 信幸香野; Tadamichi Tokuda; 肇道徳田; Hiroo Ikura; 啓雄居倉
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1997-02-19
Filing date: 1997-02-19
Publication date: 1998-09-02

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device capable of precisely recognizing a voice that a noise is changed on the way of utterance in voice recognition under the noise by using a hidden Markov model. SOLUTION: This device is provided with a clustering part 107 separating a non-stationary noise to plural stationary parts by power, a noise HMM(hidden Markov model) learning part 108 obtaining stationary noise HMMs for plural stationary noises and a noise HMM synthetic part 115 synthesizing a composite noise HMM from these stationary noise HMMs. Then, by NOVO (voice mixed with noise) converting a standard voice HMM by this composite noise HMM, a NOVO-HMM seasoned with this non-stationary noise is generated even when an environmental noise in a place where the voice recognition device is used is the non-stationary noise, and the voice is recognized even when the noise is changed on the way of utterance.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はＨＭＭ方式を用いた
音声認識装置および音声認識方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus and a speech recognition method using the HMM system.

【０００２】[0002]

【従来の技術】計算機による音声の自動認識に広く用い
られている手法に隠れマルコフモデル（Ｈｉｄｄｅｎ
ＭａｒｋｏｖＭｏｄｅｌ：以下、本発明ではＨＭＭと
称す）によるものがある。初めにＨＭＭによる音声認識
の方法について説明する。2. Description of the Related Art Hidden Markov models (Hidden) are widely used for automatic speech recognition by computers.
Markov Model (hereinafter, referred to as HMM in the present invention). First, a method of speech recognition by the HMM will be described.

【０００３】ＨＭＭは、Ｎ個の状態Ｓ1，Ｓ2，．．．，
ＳNを持ち、一定周期毎に、ある確率（遷移確率）で状
態を次々に遷移するとともに、その際に、ある確率（出
力確率）でラベル（特徴データ）を一つずつ出力する。
音声をラベル（特徴データ）の時系列と見た場合に、学
習時に、各単語を数回発声してそれらをモデル化したＨ
ＭＭを作成しておく。そして、未知の入力音声を認識す
る時には、その入力音声のラベル系列に合致したラベル
系列を出力する確率が最大であるＨＭＭを探し、そのＨ
ＭＭに対応する単語を出力結果とする。この手法は最尤
推定法と呼ばれている。The HMM has N states S1, S2,. . . ,
It has an SN and transitions between states one after another at a certain probability (transition probability) at regular intervals, and at that time, outputs labels (feature data) one by one at a certain probability (output probability).
When speech is viewed as a time series of labels (feature data), at the time of learning, each word is uttered several times to model them.
Create MM. When recognizing an unknown input voice, an HMM that has the highest probability of outputting a label sequence that matches the label sequence of the input voice is searched for.
Let the word corresponding to MM be the output result. This method is called a maximum likelihood estimation method.

【０００４】さらに詳細に説明すると、学習時に認識対
象の人の音声サンプル群と、認識対象の各単語毎にＨＭ
Ｍを準備しておく。そしてこれらのＨＭＭがその認識対
象の音声サンプル群から抽出される特徴データ系列を出
力し易いようにそのＨＭＭを定義している内部パラメー
タを調節する。この際ｆｏｒｗａｒｄ−ｂａｃｋｗａｒ
ｄアルゴリズムを用いてＨＭＭの内部パラメータを調節
し、各ＨＭＭにはその認識対象の単語にマッチした内部
パラメータが設定される。More specifically, a speech sample group of a person to be recognized during learning and an HM for each word to be recognized are described.
Prepare M. Then, internal parameters defining the HMM are adjusted so that these HMMs can easily output a feature data sequence extracted from the speech sample group to be recognized. At this time, forward-backwar
The internal parameters of the HMM are adjusted using the d-algorithm, and an internal parameter matching the word to be recognized is set in each HMM.

【０００５】未知の音声が入力されると、最尤推定法に
より、各ＨＭＭ毎にその未知の音声から抽出した特徴デ
ータ系列の出力し易さ（尤度）を算出し、最大の尤度を
出力したＨＭＭに対応する単語を認識結果とする。When an unknown voice is input, the maximum likelihood is calculated for each HMM by calculating the likelihood of outputting a feature data sequence extracted from the unknown voice for each HMM. A word corresponding to the output HMM is set as a recognition result.

【０００６】このように予め単語毎にそのＨＭＭを学習
して、各単語に最も適した状態の遷移確率と各状態遷移
におけるラベルの出力確率を求めておけば、未知の単語
のラベル系列が入力した時に各ＨＭＭに対して確率（尤
度）計算を行なえば、どの単語に対するＨＭＭがこのラ
ベル系列を出力し易いかがわかり、これにより認識がで
きる。As described above, by learning the HMM of each word in advance and obtaining the transition probability of the state most suitable for each word and the output probability of the label at each state transition, the label sequence of an unknown word can be input. If the probability (likelihood) calculation is performed for each HMM at this time, it is possible to know which word the HMM easily outputs this label sequence, and thus recognition can be performed.

【０００７】ＨＭＭを用いて雑音が重畳された音声を認
識する手法の一つにＦｒａｎｃＭａｒｔｉｎが文献
‘‘ＲｅｃｏｇｎｉｔｉｏｎｏｆＮｏｉｓｙＳｐ
ｅｅｃｈｂｙＣｏｍｐｏｓｉｔｉｏｎｏｆＨｉ
（信学技報ＳＰ９２−９６）で提案したＮＯＶＯ−ＨＭ
Ｍを用いる方法がある。これは雑音から作成したＨＭＭ
すなわち「雑音ＨＭＭ」と「標準パターンの音声ＨＭ
Ｍ」の内部パラメータを、前記文献中でＮＯＶＯ（Ｖｏ
ｉｃｅｍｉｘｅｄｗｉｔｈｎｏｉｓｅ）変換と呼
ばれている手法で合成し、こうして作成された「雑音が
重畳された音声ＨＭＭ」すなわちＮＯＶＯ−ＨＭＭを用
いることにより、雑音が重畳された音声を高い精度で認
識するというものである。[0007] One of the methods for recognizing speech on which noise is superimposed using HMM is described by Franc Martin in the document "Recognition of Noise Sp.
ech by Composition of Hi
NOVO-HM proposed in (IEICE Technical Report SP92-96)
There is a method using M. This is an HMM created from noise
That is, “Noise HMM” and “Standard pattern voice HM”
The internal parameter of “M” is set to NOVO (Vo
By using the "voice HMM with noise superimposed", that is, the NOVO-HMM created in this way by using a technique called "ice mixed with noise" conversion, the voice with noise superimposed is recognized with high accuracy. That is.

【０００８】図９はＮＯＶＯ変換の概念図であり、認識
対象単語の学習サンプルデータを用いた学習によって標
準音声ＨＭＭを生成し、雑音の学習サンプルデータを用
いた学習によって雑音ＨＭＭを生成し、これら標準音声
ＨＭＭと雑音ＨＭＭとをＮＯＶＯ変換によって合成し、
各認識対象単語毎にＮＯＶＯ−ＨＭＭを得る。FIG. 9 is a conceptual diagram of the NOVO conversion, in which a standard speech HMM is generated by learning using learning sample data of a word to be recognized, and a noise HMM is generated by learning using noise learning sample data. A standard voice HMM and a noise HMM are synthesized by NOVO conversion,
A NOVO-HMM is obtained for each recognition target word.

【０００９】図６は、従来のＮＯＶＯ−ＨＭＭが表す対
数スペクトルの概形図であり、図７は雑音重畳音声入力
して作成したＨＭＭが表す対数スペクトルの概形図であ
る。両者で大体同じはずである概形が異なることが分か
る。結果として認識率の低下を招いていた。FIG. 6 is a schematic diagram of a logarithmic spectrum represented by a conventional NOVO-HMM, and FIG. 7 is a schematic diagram of a logarithmic spectrum represented by an HMM created by inputting noise-superimposed speech. It can be seen that the outlines, which should be roughly the same, are different. As a result, the recognition rate was reduced.

【００１０】図１０は従来のＮＯＶＯ変換におけるＨＭ
Ｍの内部パラメータの計算手順のフローチャートであ
る。従来の方式のＮＯＶＯ変換では、まず、標準音声Ｈ
ＭＭおよび雑音ＨＭＭの内部パラメータであるケプスト
ラムをＣＯＳ変換によって対数スペクトルへ変換する
（ｓｔｅｐ１）。FIG. 10 shows HM in the conventional NOVO conversion.
9 is a flowchart of a procedure for calculating an internal parameter of M. In the conventional NOVO conversion, first, standard voice H
The cepstrum, which is an internal parameter of the MM and the noise HMM, is converted into a logarithmic spectrum by COS conversion (step 1).

【００１１】次に、どちらも指数変換を行なって線形ス
ペクトルに変換する（ｓｔｅｐ２）。その後、２つの線
形スペクトルを加算し、標準音声と雑音とを重畳したも
のの線形スペクトルを作成する（ｓｔｅｐ３）。そし
て、作成した線形スペクトルを対数変換によって対数ス
ペクトルに戻す（ｓｔｅｐ４）。さらに逆ＣＯＳ変換す
る（ｓｔｅｐ５）ことにより、標準音声と雑音の重畳し
たもののケプストラムを得る。Next, in both cases, exponential conversion is performed to convert to a linear spectrum (step 2). After that, the two linear spectra are added, and a linear spectrum of a standard voice and noise superimposed is created (step 3). Then, the created linear spectrum is returned to a logarithmic spectrum by logarithmic conversion (step 4). Further, by performing inverse COS conversion (step 5), a cepstrum of a standard voice and noise superimposed is obtained.

【００１２】２つの線形スペクトルの加算の部分の計算
式は、ＦｒａｎｃＭａｒｔｉｎの文献のHMMcompositi
onの項に記述されているように、以下の（数１），（数
２）のようになる。The formula for the addition of two linear spectra is given by HMMcompositi in the article by Franc Martin.
As described in the section on, the following (Formula 1) and (Formula 2) are obtained.

【００１３】[0013]

【数１】 (Equation 1)

【００１４】[0014]

【数２】 (Equation 2)

【００１５】ここで、k(SNR)は次の（数３）ように表さ
れます。Here, k (SNR) is expressed as the following (Equation 3).

【００１６】[0016]

【数３】 (Equation 3)

【００１７】なお、以上の数式中のμは平均ベクトル、
Σは分散の行列である。またＲ_ln、Ｓ_ln、Ｎ_lnは各々、
雑音重畳音声、音声、雑音を意味する。ＳＮＲは雑音重
畳時のSN比である。Ｓ_powとＮ_powは、各々ＨＭＭの学習
に用いた音声および雑音のパワーの平均値である。Here, μ in the above equation is an average vector,
Σ is a variance matrix. R _ln , S _ln , and N _ln each represent
It means noise superimposed voice, voice, and noise. SNR is the SN ratio when noise is superimposed. S _pow and N _pow are the average values of the powers of speech and noise used for HMM learning, respectively.

【００１８】そして、この（数３）でのｋ(SNR)は、雑
音が重畳された音声のSN比により変わる、つまり雑音の
パワーのみで種類には無関係のパラメータである。例え
ば、学習時に用いる音声と雑音のパワーを等しくしてお
き、ＳＮ比が０dB(SNR=0)となるように雑音を重畳した
音声を認識する場合、ｋ(SNR)の値は雑音の種類によら
ず１となる。The k (SNR) in the equation (3) changes depending on the S / N ratio of the voice on which the noise is superimposed, that is, it is a parameter independent of the type only of the noise power. For example, if the power of noise used for learning is set equal to the power of noise, and the speech in which noise is superimposed so that the SN ratio becomes 0 dB (SNR = 0) is recognized, the value of k (SNR) depends on the type of noise. It becomes 1 anyway.

【００１９】[0019]

【発明が解決しようとする課題】しかしながら、以上の
ＮＯＶＯ−ＨＭＭによる認識手法でも雑音が重畳された
音声を認識すると概ね良好な結果が得られていたが、そ
れは雑音が発声時間中に大きく変化しないことが必要で
あり、発声途中で雑音の種類が大きく変化した場合には
認識率が大きく低下してしまうという問題点があった。However, even with the above-described NOVO-HMM recognition method, generally good results have been obtained when recognizing speech on which noise is superimposed, but the noise does not change significantly during the utterance time. However, if the type of noise changes significantly during utterance, there is a problem that the recognition rate is greatly reduced.

【００２０】例えば図９に示す従来方式のＮＯＶＯ変換
では、雑音を重畳した音声を認識する場合、ｋ(SNR)の
値は雑音の種類によらず１となる。また従来の認識方式
では標準パターンの音声ＨＭＭと「雑音ＨＭＭ」を、雑
音の種類が何であっても同じように合成するため、雑音
の影響が大きくなる場合、そのＮＯＶＯ−ＨＭＭの表現
では対応できていない。For example, in the conventional NOVO conversion shown in FIG. 9, when recognizing a voice on which noise is superimposed, the value of k (SNR) is 1 irrespective of the type of noise. Also, in the conventional recognition method, since the speech HMM of the standard pattern and the "noise HMM" are synthesized in the same manner regardless of the type of noise, the expression of the NOVO-HMM cannot cope with the case where the influence of noise increases. Not.

【００２１】本発明は、隠れマルコフモデルを用いた非
定常な雑音下での音声認識において、発声途中に雑音が
変化する音声を高い精度で認識することができる音声認
識装置および音声認識方法を提供することを目的とす
る。The present invention provides a speech recognition apparatus and a speech recognition method capable of recognizing a speech in which noise changes during utterance with high accuracy in speech recognition under non-stationary noise using a hidden Markov model. The purpose is to do.

【００２２】[0022]

【課題を解決するための手段】本発明の音声認識装置に
おいては、入力する雑音をそのパワーにより複数の定常
雑音成分に分離する雑音分離部と、この雑音分離部によ
って得られた複数の定常雑音成分のそれぞれについて定
常雑音ＨＭＭを求める定常雑音学習部と、複数の定常雑
音ＨＭＭから１つの複合雑音ＨＭＭを合成する雑音合成
部とを設け、この複数の定常雑音ＨＭＭから合成して得
た複合雑音ＨＭＭと標準パターンの音声ＨＭＭとをＮＯ
ＶＯ変換するように構成したので、発声途中に雑音が変
化する場合でも音声を高い精度で認識することができ
る。In the speech recognition apparatus according to the present invention, a noise separating section for separating input noise into a plurality of stationary noise components by its power, and a plurality of stationary noises obtained by the noise separating section. A stationary noise learning unit that obtains a stationary noise HMM for each of the components; and a noise synthesis unit that synthesizes one composite noise HMM from the plurality of stationary noise HMMs, and a composite noise obtained by combining the plurality of stationary noise HMMs NO for HMM and standard pattern voice HMM
Since the VO conversion is performed, the voice can be recognized with high accuracy even when the noise changes during the utterance.

【００２３】また、標準パターンの音声ＨＭＭと雑音Ｈ
ＭＭとを合成する過程で、音声の平均残差パワーと雑音
の平均残差パワー（雑音の予測値パワーと実際のパワー
との差の平均値）の比を係数として重み付けをし、雑音
が重畳された音声を表現するＮＯＶＯ−ＨＭＭを作成す
る（請求項５）。The speech HMM of the standard pattern and the noise H
In the process of synthesizing MM, weighting is performed using the ratio of the average residual power of speech to the average residual power of noise (the average value of the difference between the predicted power of noise and the actual power) as a coefficient, and noise is superimposed. Then, a NOVO-HMM expressing the speech is created (claim 5).

【００２４】あるいは、予め、任意に重み係数を与えて
仮のＮＯＶＯ−ＨＭＭを作成し評価することを、重み係
数を変えながら繰り返すことで、認識率の良い重みの係
数を求めておき、その予め求めた係数で重み付けをし、
雑音が重畳された音声を表現するＮＯＶＯ−ＨＭＭを作
成する（請求項６）。Alternatively, the provision of a temporary NOVO-HMM by arbitrarily assigning a weighting factor and repeating the evaluation while changing the weighting factor is repeated to obtain a weighting factor with a good recognition rate. Weight with the obtained coefficient,
A NOVO-HMM expressing a voice on which noise is superimposed is created (claim 6).

【００２５】[0025]

【発明の実施の形態】本発明の請求項１に記載の発明
は、入力された音声および雑音信号をＡ／Ｄ変換する音
声入力部と、所定間隔毎に入力信号を分割して所定間隔
毎の周波数特徴量に分析する特徴量抽出部と、認識すべ
き単語の標準パターンが格納される音声ＨＭＭ格納部
と、入力する雑音をそのパワーにより複数の定常雑音成
分に分離する雑音分離部と、雑音分離部によって得られ
た複数の定常雑音成分のそれぞれについて定常雑音ＨＭ
Ｍを求める定常雑音学習部と、複数の定常雑音ＨＭＭか
ら１つの複合雑音ＨＭＭを合成する雑音合成部と、音声
ＨＭＭと雑音合成部によって合成された複合雑音ＨＭＭ
をＮＯＶＯ変換をすることにより雑音が重畳した音声の
ＨＭＭを合成するＮＯＶＯ−ＨＭＭ計算部と、認識の対
象となる音声信号の特徴量とＮＯＶＯ−ＨＭＭ計算部に
よって得られたＨＭＭとを基に尤度を計算して最も尤も
らしい単語を認識結果として決定する認識結果判定部と
を備えたものであり、これにより、音声認識装置が使用
される場所の環境雑音が非定常な雑音であっても、この
雑音に似たスペクトルの雑音でクラスタリングして複数
の定常的な雑音の成分に分離し、この複数の定常雑音成
分のそれぞれについて雑音ＨＭＭを作成し、これら複数
の雑音ＨＭＭを合成することによって複合な雑音を加味
したＮＯＶＯ−ＨＭＭが生成され、非定常な雑音を加味
したＮＯＶＯ変換をすることが可能となる。DESCRIPTION OF THE PREFERRED EMBODIMENTS The invention according to claim 1 of the present invention provides a voice input unit for A / D converting an input voice and noise signal, an input signal divided at predetermined intervals, and A speech feature HMM storage unit that stores a standard pattern of a word to be recognized, a noise separation unit that separates input noise into a plurality of stationary noise components by its power, For each of the plurality of stationary noise components obtained by the noise separating unit, the stationary noise HM
M, a noise synthesis unit that synthesizes one composite noise HMM from a plurality of stationary noise HMMs, and a composite noise HMM synthesized by the speech HMM and the noise synthesis unit.
Is subjected to NOVO conversion to synthesize an HMM of a speech on which noise is superimposed, and a likelihood is calculated based on a feature amount of a speech signal to be recognized and the HMM obtained by the NOVO-HMM calculation unit. And a recognition result determining unit that calculates the degree of degree and determines the most likely word as a recognition result, whereby even if the environmental noise at the place where the speech recognition device is used is non-stationary noise, By clustering with noise having a spectrum similar to this noise and separating it into a plurality of stationary noise components, creating a noise HMM for each of the plurality of stationary noise components, and combining these plurality of noise HMMs, A NOVO-HMM taking into account complex noise is generated, and NOVO conversion taking into account non-stationary noise becomes possible.

【００２６】本発明の請求項５に記載の発明は、ＨＭＭ
を用いて雑音が重畳された音声を認識する音声認識方式
であって、認識に用いるＮＯＶＯ−ＨＭＭを作成するた
めに、標準音声ＨＭＭと雑音ＨＭＭを合成する時に、音
声と雑音との平均残差パワーの比を係数として重み付け
することを特徴とする音声認識方法であり、これによ
り、雑音が重畳された音声を認識するＮＯＶＯ−ＨＭＭ
の作成に際して、標準パターンの音声ＨＭＭと雑音ＨＭ
Ｍとの合成の重みを考慮し、雑音の種類によらず、最高
の認識精度が出せるＮＯＶＯ−ＨＭＭを作成することが
できる。The invention according to claim 5 of the present invention provides an HMM
Is a speech recognition method for recognizing speech on which noise is superimposed by using a standard speech HMM and a noise HMM when synthesizing a standard speech HMM and a noise HMM in order to create a NOVO-HMM used for recognition. This is a speech recognition method characterized by weighting a power ratio as a coefficient, whereby a NOVO-HMM for recognizing speech on which noise is superimposed.
When creating a speech, the standard pattern speech HMM and noise HM
A NOVO-HMM that can provide the highest recognition accuracy can be created regardless of the type of noise in consideration of the weight of synthesis with M.

【００２７】以下、本発明の一実施の形態による音声認
識装置および音声認識方法について図面を参照しながら
説明する。Hereinafter, a speech recognition apparatus and a speech recognition method according to an embodiment of the present invention will be described with reference to the drawings.

【００２８】（実施の形態１）図１は本発明の一実施の
形態による音声認識装置の構成ブロック図である。１０
１は標準パターンを作成するための音声信号をデジタル
値に変換する音声入力部、１０２は標準パターン音声信
号からフレーム毎に特徴量を算出する音声特徴量抽出
部、１０３は複数の標準パターン音声特徴量から各単語
毎に標準パターンとなるＨＭＭ（以下、標準音声ＨＭＭ
とする）を作成する音声ＨＭＭ学習部、１０４は標準音
声ＨＭＭを格納する音声ＨＭＭ格納部である。(Embodiment 1) FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to an embodiment of the present invention. 10
Reference numeral 1 denotes an audio input unit for converting an audio signal for creating a standard pattern into a digital value; 102, an audio feature extraction unit for calculating a feature for each frame from the standard pattern audio signal; 103, a plurality of standard pattern audio features; HMM that becomes a standard pattern for each word from the amount (hereinafter referred to as standard speech HMM
And a voice HMM storage unit 104 for storing a standard voice HMM.

【００２９】１０５は雑音標準パターンを作成するため
の雑音信号をデジタル値に変換する雑音入力部、１０６
は雑音信号からフレーム毎に特徴量を算出する雑音特徴
量抽出部である。１０７は雑音特徴量から似た分析デー
タを集めて、クラスタリングする雑音分離部としてのク
ラスタリング部である。１０８はクラスタリングされた
雑音特徴量から、それぞれの雑音毎に標準パターンとな
る雑音ＨＭＭを作成する雑音ＨＭＭ学習部である。１１
５はそれぞれの雑音ＨＭＭを１つの雑音ＨＭＭに合成す
る雑音ＨＭＭ合成部である。１０９は１つに合成された
雑音ＨＭＭを格納する雑音ＨＭＭ格納部である。Reference numeral 105 denotes a noise input unit for converting a noise signal for creating a noise standard pattern into a digital value;
Is a noise feature value extraction unit that calculates a feature value for each frame from the noise signal. Reference numeral 107 denotes a clustering unit as a noise separating unit that collects similar analysis data from the noise feature amount and performs clustering. Reference numeral 108 denotes a noise HMM learning unit that creates a noise HMM serving as a standard pattern for each noise from the clustered noise feature amounts. 11
Reference numeral 5 denotes a noise HMM combining unit that combines each noise HMM into one noise HMM. A noise HMM storage unit 109 stores the combined noise HMM.

【００３０】１１０は標準音声ＨＭＭと雑音ＨＭＭを合
成して雑音が重畳した音声のＨＭＭをＮＯＶＯ法により
合成するＮＯＶＯ−ＨＭＭ計算部、１１１は雑音が重畳
した標準音声ＨＭＭを格納するＮＯＶＯ−ＨＭＭ格納部
である。Numeral 110 denotes a NOVO-HMM calculating unit for synthesizing the standard voice HMM and the noise HMM and synthesizing the HMM of the voice on which the noise is superimposed by the NOVO method, and 111 denotes a NOVO-HMM storing the standard voice HMM on which the noise is superimposed. Department.

【００３１】１１２は音声認識の対象となる音声信号を
デジタル値に変換する信号入力部、１１３は入力信号か
らフレーム毎に特徴量を算出する信号特徴量抽出部、１
１４は入力単語の出力確率を計算すると共に認識結果の
決定を行う認識結果判定部である。Reference numeral 112 denotes a signal input unit for converting a speech signal to be subjected to speech recognition into a digital value; 113, a signal feature value extraction unit for calculating a feature value for each frame from the input signal;
Reference numeral 14 denotes a recognition result determination unit that calculates an output probability of an input word and determines a recognition result.

【００３２】図２は本発明の一実施の形態による音声認
識装置の回路ブロック図である。２０１は音声を電気信
号に変換するマイク（マイクロホン）、２０２は中央処
理装置（ＣＰＵ）、２０３は読み出し専用メモリ（ＲＯ
Ｍ）、２０４は書き込み可能メモリ（ＲＡＭ）、２０５
は出力装置である。図１の構成ブロック図における信号
入力部１１２および音声入力部１０１は、マイク２０１
とＣＰＵ２０２により構成される。また図１における各
特徴量抽出部とデータ格納部と認識結果判定部１１４
は、ＣＰＵ２０２がＲＯＭ２０３に書かれたプログラム
を実行し、ＲＡＭ２０４にアクセスすることにより実行
される。FIG. 2 is a circuit block diagram of a speech recognition apparatus according to one embodiment of the present invention. 201 is a microphone (microphone) for converting sound into an electric signal, 202 is a central processing unit (CPU), and 203 is a read-only memory (RO).
M) and 204 are writable memories (RAM) and 205
Is an output device. The signal input unit 112 and the audio input unit 101 in the configuration block diagram of FIG.
And the CPU 202. Each feature amount extraction unit, data storage unit, and recognition result determination unit 114 in FIG.
Is executed by the CPU 202 executing the program written in the ROM 203 and accessing the RAM 204.

【００３３】図１における音声入力部１０１および雑音
入力部１０５および信号入力部１１２はマイク２０１と
ＣＰＵ２０２により構成されている。また音声特徴量抽
出部１０２と音声学習ＨＭＭ部１０３と雑音特徴量抽出
部１０６とクラスタリング部１０７と雑音学習ＨＭＭ部
１０８と雑音ＨＭＭ合成部１１５と信号特徴量抽出部１
１３とＮＯＶＯ−ＨＭＭ計算部１１０と認識結果判定部
１１４は、ＣＰＵ２０２がＲＯＭ２０３に書かれたプロ
グラムを実行し、ＲＡＭ２０４にアクセスすることによ
り実行される。このとき、音声ＨＭＭ格納部１０４およ
び雑音ＨＭＭ格納部１０９およびＮＯＶＯ−ＨＭＭ格納
部はＲＡＭ２０４により構成される。The voice input unit 101, the noise input unit 105, and the signal input unit 112 in FIG. Further, the speech feature extraction unit 102, the speech learning HMM unit 103, the noise feature extraction unit 106, the clustering unit 107, the noise learning HMM unit 108, the noise HMM synthesis unit 115, and the signal feature extraction unit 1
13, the NOVO-HMM calculation unit 110, and the recognition result determination unit 114 are executed by the CPU 202 executing the program written in the ROM 203 and accessing the RAM 204. At this time, the voice HMM storage unit 104, the noise HMM storage unit 109, and the NOVO-HMM storage unit are configured by the RAM 204.

【００３４】図３は本実施の形態による音声認識方法の
フローチャートである。まず、音声入力部１０１から標
準パターンとなる音声を入力する。一標準パターンあた
り数十から数百人分の音声波形を収集し、入力とする
（ｓｔｅｐ１）。その音声波形の音声区間に対し、ＬＰ
Ｃ（ＬｉｎｅｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎ
ｇ）ケプストラム分析などの分析方法で周波数分析を行
なう（ｓｔｅｐ２）。これらの音声周波数分析データを
基に、不特定話者用の標準パターンとなる標準音声ＨＭ
Ｍをｆｏｒｗａｒｄ−ｂａｃｋｗａｒｄアルゴリズムを
用いて作成する（ｓｔｅｐ３）。FIG. 3 is a flowchart of the voice recognition method according to the present embodiment. First, a voice serving as a standard pattern is input from the voice input unit 101. Speech waveforms of several tens to several hundreds per standard pattern are collected and input (step 1). LP for the voice section of the voice waveform
C (Liner Predictive Codin)
g) Perform frequency analysis by an analysis method such as cepstrum analysis (step 2). On the basis of these voice frequency analysis data, a standard voice HM serving as a standard pattern for an unspecified speaker is used.
M is created using the forward-backward algorithm (step 3).

【００３５】次に、雑音入力部１０５から音声認識装置
が使用される場所の非定常な環境雑音を入力する（ｓｔ
ｅｐ４）。その雑音信号に対し、ＬＰＣケプストラム分
析などの分析方法で周波数分析を行なう（ｓｔｅｐ
５）。雑音信号をパワーの基準により似た雑音信号に分
離する（ｓｔｅｐ６）。あるいはパワースペクトルのス
ペクトル距離で分離することができる。例えば、この際
の分離基準としてはＬＰＣケプストラム距離尺度を用い
ることができる。これらの雑音周波数分析データを基
に、分離された各クラスタの雑音を連結して、それぞれ
の雑音の標準パターンとなる雑音ＨＭＭをｆｏｒｗａｒ
ｄ−ｂａｃｋｗａｒｄアルゴリズムを用いて作成する
（ｓｔｅｐ７）。Next, unsteady environmental noise at the place where the speech recognition device is used is input from the noise input unit 105 (st
ep4). A frequency analysis is performed on the noise signal by an analysis method such as LPC cepstrum analysis (step
5). The noise signal is separated into similar noise signals according to a power criterion (step 6). Alternatively, they can be separated by the spectral distance of the power spectrum. For example, an LPC cepstrum distance scale can be used as a separation criterion in this case. Based on these noise frequency analysis data, the noise of each separated cluster is concatenated, and a noise HMM serving as a standard pattern of each noise is forward generated.
It is created using the d-backward algorithm (step 7).

【００３６】そして、それぞれの雑音ＨＭＭを後述する
雑音ＨＭＭ合成方法で１つの雑音ＨＭＭに合成し、合成
された雑音ＨＭＭを雑音ＨＭＭ格納部１０９に格納す
る。そして音声ＨＭＭ格納部１０４上に格納されている
各標準音声毎に雑音ＨＭＭ格納部１０９で格納されてい
る各雑音ＨＭＭとＮＯＶＯ変換を行なう（ｓｔｅｐ
８）。Then, each noise HMM is combined into one noise HMM by a noise HMM combining method described later, and the combined noise HMM is stored in the noise HMM storage unit 109. Then, for each standard voice stored in the voice HMM storage unit 104, NOVO conversion is performed on each noise HMM stored in the noise HMM storage unit 109 (step
8).

【００３７】未知の音声が入力されると、次のように尤
度計算を行い、最大の尤度を出力したＮＯＶＯ−ＨＭＭ
に対応する単語名を認識結果として出力装置２０５に出
力する。具体的には、マイク２０１に音声が入力される
と（ｓｔｅｐ９）、信号特徴量抽出部１１３を経て得ら
れた特徴量をＲＡＭ２０４に書き込む（ｓｔｅｐ１
０）。次に、ＲＯＭ２０３上に格納されている各標準音
声毎に作成したＮＯＶＯ−ＨＭＭについて、ＲＡＭ２０
４上の特徴量に対する尤度を計算する（ｓｔｅｐ１
１）。そして、最大の尤度を出力したＮＯＶＯ−ＨＭＭ
に対応する単語名を認識結果として出力装置１５に出力
する（ｓｔｅｐ１２）。When an unknown voice is input, the likelihood calculation is performed as follows, and the NOVO-HMM which outputs the maximum likelihood is output.
Is output to the output device 205 as a recognition result. Specifically, when a voice is input to the microphone 201 (step 9), the feature amount obtained through the signal feature amount extraction unit 113 is written to the RAM 204 (step 1).
0). Next, the NOVO-HMM created for each standard voice stored in the ROM 203 is stored in the RAM 20.
4 is calculated (step 1).
1). Then, the NOVO-HMM that outputs the maximum likelihood
Is output to the output device 15 as a recognition result (step 12).

【００３８】通常、標準音声ＨＭＭは音声認識の応用さ
れる目的により予め決まった単語の音声波形データを収
録し（ｓｔｅｐ１）、音声特徴量の抽出（ｓｔｅｐ
２）、標準音声ＨＭＭの計算（ｓｔｅｐ３）まで実行し
ておいて、標準パターンとなるＨＭＭをＲＯＭ２０３に
格納しておくことになる。Normally, the standard speech HMM records speech waveform data of a word determined in advance according to the purpose of speech recognition (step 1), and extracts speech features (step 1).
2) The calculation up to the standard voice HMM (step 3) is executed, and the HMM serving as the standard pattern is stored in the ROM 203.

【００３９】また、音声認識装置が使用される目的が単
一で、音声認識装置が使用される環境雑音が変わらなけ
れば、使用環境下の雑音データを予め収録し（ｓｔｅｐ
４）、雑音特徴量の抽出（ｓｔｅｐ５）、クラスタリン
グ（ｓｔｅｐ６）、雑音ＨＭＭの計算（ｓｔｅｐ７）、
雑音ＨＭＭの合成（ｓｔｅｐ１３）、標準音声ＨＭＭと
雑音ＨＭＭの合成（ｓｔｅｐ８）まで実行しておいて、
合成済のＨＭＭをＲＯＭ２０３に格納しておくこともで
きる。Further, if the purpose of using the speech recognition apparatus is single and the environmental noise in which the speech recognition apparatus is used does not change, noise data under the use environment is recorded in advance (step).
4), extraction of noise features (step 5), clustering (step 6), calculation of noise HMM (step 7),
The synthesis of the noise HMM (step 13) and the synthesis of the standard speech HMM and the noise HMM (step 8) are executed beforehand.
The combined HMM may be stored in the ROM 203.

【００４０】図４は本発明の一実施の形態による音声認
識方法に用いられるＮＯＶＯ変換の概念図であって、Ｎ
ＯＶＯ−ＨＭＭの作成過程を示すものであり、図１４に
示す従来方法と比べると学習サンプルデータを用いた学
習によって生成する雑音ＨＭＭの形状が異なる。FIG. 4 is a conceptual diagram of NOVO conversion used in the speech recognition method according to one embodiment of the present invention.
14 illustrates a process of creating an OVO-HMM. The shape of a noise HMM generated by learning using learning sample data is different from that of the conventional method illustrated in FIG.

【００４１】ＮＯＶＯ変換を施す雑音ＨＭＭの状態数と
認識に際して考慮する雑音の種類の数を共に２とした場
合、従来の方法では１種類の雑音から２状態の雑音ＨＭ
Ｍを学習によって直接生成した後、その雑音ＨＭＭと単
語ＨＭＭに対してＮＯＶＯ変換を施していたが、本発明
では、雑音ＨＭＭ学習部によって変動する雑音から学習
した複数個の雑音ＨＭＭを用いる。If both the number of states of the noise HMM to be subjected to NOVO conversion and the number of types of noise to be considered in recognition are set to 2, the conventional method converts two types of noise HM from one type of noise.
After generating M directly by learning, NOVO conversion is performed on the noise HMM and the word HMM. However, in the present invention, a plurality of noise HMMs learned from noise that fluctuates by the noise HMM learning unit are used.

【００４２】例えば、雑音ＨＭＭ学習部において２種類
の雑音が学習されたとする。２種類の雑音から学習によ
ってそれぞれ１状態の雑音ＨＭＭを生成し、状態遷移確
率を人為的に与えることによって（自己遷移確率を０．
７程度、他状態遷移確率を０．３程度）それら雑音ＨＭ
Ｍの状態を結合し、２状態の雑音ＨＭＭを生成する。そ
して、この２状態の雑音ＨＭＭと標準音声ＨＭＭとに従
来のＮＯＶＯ変換を施してＮＯＶＯ−ＨＭＭを作成す
る。For example, it is assumed that two types of noise have been learned in the noise HMM learning unit. A one-state noise HMM is generated by learning from two types of noise, and a state transition probability is artificially given (the self-transition probability is set to 0.
About 7 and other state transition probability about 0.3) Those noise HM
The M states are combined to generate a two-state noise HMM. Then, a NOVO-HMM is created by performing conventional NOVO conversion on the two-state noise HMM and the standard voice HMM.

【００４３】このように本発明によれば、変動する雑音
に対し雑音重畳音声を高い確率で認識することができ
る。例えば自動車内におにては、非定常雑音が、数１０
０ｍｓ程度の定常区間の連続からなる非定常雑音とパル
ス状の重畳雑音であることから、定常区間の連続である
非定常雑音に対し効果のある雑音対策方法で非定常雑音
環境下での認識性能の向上を図ることができる。As described above, according to the present invention, a noise-superimposed speech can be recognized with a high probability for fluctuating noise. For example, in a car, non-stationary noise is expressed by several tens.
Recognition performance in a non-stationary noise environment with a noise suppression method that is effective for non-stationary noise that is a continuation of a steady section because it is non-stationary noise consisting of a continuation of about 0 ms and pulse-like superimposed noise Can be improved.

【００４４】（実施の形態２）図１４に示す従来のＮＯ
ＶＯ変換は、雑音の学習サンプルデータを用いた学習に
よって雑音ＨＭＭを作成した後、これら標準音声ＨＭＭ
と雑音ＨＭＭとをＮＯＶＯ変換によって合成し、各認識
対象単語毎にＮＯＶＯ−ＨＭＭを得ているが、それだけ
では図９に示すように雑音を重畳した音声を認識する場
合、ｋ(SNR)の値は雑音の種類によらず１となる。(Embodiment 2) The conventional NO shown in FIG.
The VO conversion is performed by creating a noise HMM by learning using the noise learning sample data, and then by using these standard speech HMMs.
And a noise HMM are synthesized by NOVO conversion to obtain a NOVO-HMM for each word to be recognized. However, when a speech with noise superimposed is recognized as shown in FIG. 9 alone, the value of k (SNR) is used. Is 1 regardless of the type of noise.

【００４５】しかし現実は、同じＳＮ比でも雑音の種類
によって平均残差パワーが異なり、平均残差パワーによ
ってその雑音の影響度が違う。例えば比較的周期性の高
い成分からなる雑音の平均残差パワーは低く、その場合
は音声認識に対する雑音の影響は小さい。それに対し周
期性の低い成分からなる雑音の平均残差パワーは大き
く、このような平均残差パワーの大きい雑音ほど、雑音
の影響が大きくなる。図７がこれに相当し、図６の概形
との間に差がある。However, in reality, even with the same SN ratio, the average residual power differs depending on the type of noise, and the degree of influence of the noise differs depending on the average residual power. For example, the average residual power of noise composed of components having relatively high periodicity is low, in which case the influence of noise on speech recognition is small. On the other hand, the average residual power of noise composed of components having low periodicity is large, and the influence of the noise increases as the noise has a large average residual power. FIG. 7 corresponds to this, and there is a difference from the outline of FIG.

【００４６】本実施の形態は、雑音ＨＭＭ合成部１１５
にて各雑音の雑音ＨＭＭから１つの雑音ＨＭＭを合成す
る際、それぞれの雑音ＨＭＭにおいて標準音声の平均残
差パワーとその雑音の平均残差パワーの比によって重み
付けを行ない、各雑音ＨＭＭに個別の重み付け係数を掛
けて合成する音声認識方法である。In this embodiment, the noise HMM synthesizing section 115
When combining one noise HMM from the noise HMM of each noise, weighting is performed by the ratio of the average residual power of the standard speech to the average residual power of the noise in each noise HMM, and each noise HMM is individually weighted. This is a speech recognition method of synthesizing by applying a weighting coefficient.

【００４７】図５は本発明の一実施の形態による音声認
識方式に用いられるＮＯＶＯ変換におけるＨＭＭの内部
パラメータの計算手順のフローチャートである。ＮＯＶ
Ｏ変換では、まず、標準音声ＨＭＭおよび雑音ＨＭＭの
内部パラメータであるケプストラムをＣＯＳ変換によっ
て対数スペクトルへ変換する（ｓｔｅｐ１）。FIG. 5 is a flowchart of a procedure for calculating internal parameters of the HMM in the NOVO conversion used in the speech recognition system according to one embodiment of the present invention. NOV
In the O conversion, first, the cepstrum, which is an internal parameter of the standard speech HMM and the noise HMM, is converted into a logarithmic spectrum by COS conversion (step 1).

【００４８】次に、どちらも指数変換を行なって線形ス
ペクトルに変換する（ｓｔｅｐ２）。そして、標準音声
の平均残差パワーと雑音の平均残差パワーを各々求め
る。これらは、各ＨＭＭの内部パラメータに残差パワー
の項が含まれているため容易に求められる。次に標準音
声の平均残差パワーと雑音の平均残差パワーの比によっ
て重み付けをし、２つの線形スペクトルを加算する（ｓ
ｔｅｐ３）。Next, in both cases, exponential conversion is performed to convert to a linear spectrum (step 2). Then, the average residual power of the standard voice and the average residual power of the noise are obtained. These are easily obtained because the internal power parameters of each HMM include the term of the residual power. Next, weighting is performed by the ratio of the average residual power of the standard voice to the average residual power of the noise, and two linear spectra are added (s
step3).

【００４９】特許請求の範囲の請求項５に記載の発明
は、標準音声の平均残差パワーと雑音の平均残差パワー
の比を係数として、この係数を掛けることにより重み付
けをし、２つの線形スペクトルを加算する。また請求項
６に記載の発明は、予め、任意に重みを与えて仮のＮＯ
ＶＯ−ＨＭＭを作成し評価する処理を重みを変えながら
繰り返すことで、認識率の良い重みの値を求めておき、
その予め求めた値を係数として重み付けをし、２つの線
形スペクトルを加算する。ｓｔｅｐ３ではこれら２つの
内の何れかの方法により線形スペクトルを加算する。According to a fifth aspect of the present invention, the ratio between the average residual power of the standard voice and the average residual power of the noise is used as a coefficient, and weighting is performed by multiplying the coefficient by the coefficient. Add the spectra. According to a sixth aspect of the present invention, the provisional NO.
By repeating the process of creating and evaluating a VO-HMM while changing the weight, a value of a weight having a good recognition rate is obtained in advance.
The value obtained in advance is weighted as a coefficient, and the two linear spectra are added. In step 3, the linear spectrum is added by one of these two methods.

【００５０】これらにより標準音声と雑音の重畳したも
のの線形スペクトルを作成する。そして、作成した線形
スペクトルを、対数変換し（ｓｔｅｐ４）、逆ＣＯＳ変
換する（ｓｔｅｐ５）ことにより標準音声と雑音の重畳
したもののケプストラムを得る。From these, a linear spectrum of the standard speech and noise superimposed is created. Then, the created linear spectrum is subjected to logarithmic conversion (step 4) and inverse COS conversion (step 5) to obtain a cepstrum of a standard voice and noise superimposed.

【００５１】従来の技術の項にて示した２つの線形スペ
クトルの加算の部分の計算式（数１）および（数２）
は、下記の（数４）および（数５）ように変形される。Formulas (Equation 1) and (Equation 2) for adding two linear spectra shown in the section of the prior art
Is transformed into the following (Equation 4) and (Equation 5).

【００５２】[0052]

【数４】 (Equation 4)

【００５３】[0053]

【数５】 (Equation 5)

【００５４】ここで、以上の数式における係数ｍは次の
（数６）または（数７）のように表される。Here, the coefficient m in the above equation is represented by the following (Equation 6) or (Equation 7).

【００５５】[0055]

【数６】 (Equation 6)

【００５６】[0056]

【数７】 (Equation 7)

【００５７】数式中のＮ_residual-powとＳ_residual-pow
は各々ＨＭＭの学習に用いた雑音および音声の残差パワ
ーの平均値である。その他の意味は、前出の（数１），
（数２），（数３）の説明と同様である。N _residual-pow and S _{residual-pow in the equation}
Are the average values of the residual power of noise and speech used for HMM learning, respectively. Other meanings are (Equation 1),
This is the same as the description of (Equation 2) and (Equation 3).

【００５８】図８は、本発明の一実施の形態による音声
認識方式により作成したＮＯＶＯ−ＨＭＭが表す対数ス
ペクトルの概形図であり、図７の、雑音重畳音声を入力
して求めたＨＭＭが表す対数スペクトルの概形図に近い
ことが分かる。このように本実施の形態は、雑音の種類
によらず、雑音が重畳された音声を図８が示すようにう
まく表現でき、結果として雑音重畳音声を高い確率で認
識することができる。FIG. 8 is a schematic diagram of the logarithmic spectrum represented by the NOVO-HMM created by the speech recognition method according to one embodiment of the present invention. The HMM shown in FIG. It can be seen that it is close to a schematic diagram of the logarithmic spectrum. As described above, in the present embodiment, a voice on which noise is superimposed can be expressed well as shown in FIG. 8 irrespective of the type of noise, and as a result, a voice with superimposed noise can be recognized with a high probability.

【００５９】[0059]

【発明の効果】以上のように本発明は、入力する雑音を
そのパワーにより複数の定常雑音成分に分離する雑音分
離部と、この雑音分離部によって得られた複数の定常雑
音成分のそれぞれについて定常雑音ＨＭＭを求める定常
雑音学習部と、複数の定常雑音ＨＭＭから１つの非定常
雑音ＨＭＭを合成する雑音合成部とを設け、この複数の
定常雑音ＨＭＭから合成して得た複合雑音ＨＭＭと標準
パターンの音声ＨＭＭとをＮＯＶＯ変換するように構成
したので、音声認識装置が使用される場所の環境雑音が
非定常な雑音であっても、この非定常な雑音を加味した
ＮＯＶＯ−ＨＭＭが生成され、発声途中に雑音が変化し
ても高い精度で認識することができる。As described above, the present invention provides a noise separating section for separating input noise into a plurality of stationary noise components by its power, and a stationary noise component for each of the plurality of stationary noise components obtained by the noise separating section. A stationary noise learning unit for obtaining a noise HMM; and a noise synthesis unit for synthesizing one non-stationary noise HMM from the plurality of stationary noise HMMs. A composite noise HMM obtained by combining the plurality of stationary noise HMMs and a standard pattern Is configured to perform NOVO conversion with the voice HMM of the above, even if the environmental noise at the place where the voice recognition device is used is non-stationary noise, a NOVO-HMM taking into account the non-stationary noise is generated, Even if noise changes during utterance, it can be recognized with high accuracy.

【００６０】また標準音声ＨＭＭと雑音ＨＭＭを合成す
る時に、音声と雑音との平均残差パワーの比を係数とし
て重み付けすることにより、重畳される雑音の種類によ
らず高い精度で認識するWhen combining the standard speech HMM and the noise HMM, the ratio of the average residual power between the speech and the noise is weighted as a coefficient, so that the recognition is performed with high accuracy regardless of the type of the noise to be superimposed.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の一実施の形態による音声認識装置の構
成ブロック図FIG. 1 is a configuration block diagram of a speech recognition device according to an embodiment of the present invention;

【図２】本発明の一実施の形態による音声認識装置の回
路ブロック図FIG. 2 is a circuit block diagram of a speech recognition device according to one embodiment of the present invention.

【図３】本発明の一実施の形態による音声認識方法のフ
ローチャートFIG. 3 is a flowchart of a voice recognition method according to an embodiment of the present invention;

【図４】本発明の一実施の形態による音声認識方法に用
いられるＮＯＶＯ変換の概念図FIG. 4 is a conceptual diagram of NOVO conversion used in a speech recognition method according to an embodiment of the present invention.

【図５】本発明の一実施の形態による音声認識方式に用
いられるＮＯＶＯ変換の計算手順のフローチャートFIG. 5 is a flowchart of a NOVO conversion calculation procedure used in the voice recognition system according to the embodiment of the present invention;

【図６】従来のＮＯＶＯ−ＨＭＭから求めた対数スペク
トルの概形図FIG. 6 is a schematic diagram of a logarithmic spectrum obtained from a conventional NOVO-HMM.

【図７】雑音重畳音声を入力として作成したＨＭＭから
求めた対数スペクトルの概形図FIG. 7 is a schematic diagram of a logarithmic spectrum obtained from an HMM created using a noise-superimposed speech as an input.

【図８】本発明の一実施の形態による音声認識方式によ
り作成したＮＯＶＯ−ＨＭＭが表す対数スペクトルの概
形図FIG. 8 is a schematic diagram of a logarithmic spectrum represented by a NOVO-HMM created by a speech recognition method according to an embodiment of the present invention.

【図９】従来のＮＯＶＯ変換の概念図FIG. 9 is a conceptual diagram of conventional NOVO conversion.

【図１０】従来のＮＯＶＯ変換におけるＨＭＭの内部パ
ラメータの計算手順のフローチャートFIG. 10 is a flowchart of a procedure for calculating internal parameters of an HMM in a conventional NOVO conversion.

【符号の説明】[Explanation of symbols]

１０１音声入力部１０２音声特徴量抽出部１０３音声ＨＭＭ学習部１０４音声ＨＭＭ格納部１０５雑音入力部１０６雑音特徴量抽出部１０７クラスタリング部１０８雑音ＨＭＭ学習部１０９雑音ＨＭＭ格納部１１０ＮＯＶＯ−ＨＭＭ計算部１１１ＮＯＶＯ−ＨＭＭ格納部１１２信号入力部１１３信号特徴量抽出部１１４認識結果判定部１１５雑音ＨＭＭ合成部２０１マイク２０２ＣＰＵ２０３ＲＯＭ２０４ＲＡＭ２０５出力装置 Reference Signs List 101 speech input unit 102 speech feature extraction unit 103 speech HMM learning unit 104 speech HMM storage unit 105 noise input unit 106 noise feature extraction unit 107 clustering unit 108 noise HMM learning unit 109 noise HMM storage unit 110 NOVO-HMM calculation unit 111 NOVO-HMM storage unit 112 signal input unit 113 signal feature amount extraction unit 114 recognition result determination unit 115 noise HMM synthesis unit 201 microphone 202 CPU 203 ROM 204 RAM 205 output device

───────────────────────────────────────────────────── フロントページの続き (72)発明者居倉啓雄大阪府門真市大字門真1006番地松下電器産業株式会社内 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Hiroo Ikura 1006 Kazuma Kadoma, Kadoma City, Osaka Matsushita Electric Industrial Co., Ltd.

Claims

【特許請求の範囲】[Claims]

【請求項１】入力された音声および雑音信号をＡ／Ｄ変
換する音声入力部と、所定間隔毎に入力信号を分割して
所定間隔毎の周波数特徴量に分析する特徴量抽出部と、
認識すべき単語の標準パターンが格納される音声ＨＭＭ
格納部と、入力する雑音をそのパワーにより複数の定常
雑音成分に分離する雑音分離部と、前記雑音分離部によ
って得られた複数の定常雑音成分のそれぞれについて定
常雑音ＨＭＭを求める定常雑音学習部と、複数の定常雑
音ＨＭＭから１つの複合雑音ＨＭＭを合成する雑音合成
部と、音声ＨＭＭと前記雑音合成部によって合成された
複合雑音ＨＭＭをＮＯＶＯ変換をすることにより雑音が
重畳した音声のＨＭＭを合成するＮＯＶＯ−ＨＭＭ計算
部と、認識の対象となる音声信号の特徴量と前記ＮＯＶ
Ｏ−ＨＭＭ計算部によって得られたＨＭＭとを基に尤度
を計算して最も尤もらしい単語を認識結果として決定す
る認識結果判定部とを備えたことを特徴とする音声認識
装置。An audio input unit for A / D converting an input audio and noise signal; a feature extracting unit for dividing an input signal at predetermined intervals and analyzing the input signal at predetermined intervals;
Speech HMM in which standard patterns of words to be recognized are stored
A storage unit, a noise separation unit that separates input noise into a plurality of stationary noise components by its power, and a stationary noise learning unit that obtains a stationary noise HMM for each of the plurality of stationary noise components obtained by the noise separation unit. A noise synthesizing unit for synthesizing one composite noise HMM from a plurality of stationary noise HMMs, and a NOV conversion of the speech HMM and the composite noise HMM synthesized by the noise synthesizing unit to synthesize a speech HMM with noise superimposed thereon. A NOVO-HMM calculation unit, a feature amount of a speech signal to be recognized, and the NOV
A speech recognition apparatus comprising: a recognition result determining unit that calculates a likelihood based on an HMM obtained by an O-HMM calculating unit and determines a most likely word as a recognition result.

【請求項２】入力する雑音を雑音のパワースペクトルに
より定常雑音に近似できるように分離する雑音分離部を
有することを特徴とする請求項１記載の音声認識装置。2. A speech recognition apparatus according to claim 1, further comprising a noise separating section for separating input noise by a power spectrum of the noise so as to approximate the noise to stationary noise.

【請求項３】入力された音声をＡ／Ｄ変換する音声入力
部と、所定間隔毎に入力信号を分割して所定間隔毎の周
波数特徴量に分析する特徴量抽出部と、認識すべき単語
の標準パターンが格納される音声ＨＭＭ格納部と、入力
する雑音をそのパワーにより複数の定常雑音成分に分離
する雑音分離部と、前記雑音分離部によって得られた複
数の定常雑音成分のそれぞれについて定常雑音ＨＭＭを
求める定常雑音学習部と、複数の定常雑音ＨＭＭから１
つの複合雑音ＨＭＭを合成する雑音合成部と、前記雑音
合成部によって得られた複合雑音ＨＭＭを格納する雑音
ＨＭＭ格納部と、標準パターンの標準音声ＨＭＭと前記
雑音合成部によって合成された複合雑音ＨＭＭをＮＯＶ
Ｏ変換することにより雑音が重畳した音声のＨＭＭを合
成するＮＯＶＯ−ＨＭＭ計算部とを備え、初めに音声認識装置が使用される環境の雑音を収録し、
前記雑音分離部における処理と、前記定常雑音学習部に
おける処理と、前記雑音合成部における雑音合成処理を
予め行って得られた複合雑音ＨＭＭを前記雑音ＨＭＭ格
納部に格納し、前記音声ＨＭＭ格納部に格納された標準
パターンの標準音声ＨＭＭと前記雑音ＨＭＭ格納部に格
納された複合雑音ＨＭＭとを基に前記ＮＯＶＯ−ＨＭＭ
計算部にてＮＯＶＯ変換して雑音が重畳したＨＭＭを合
成し、認識の対象となる音声信号が入力されると、その
音声信号の特徴量と前記ＮＯＶＯ−ＨＭＭ計算部により
得られたＨＭＭとを基に尤度を計算して最も尤もらしい
単語を認識結果として決定することを特徴とする音声認
識装置。3. A speech input unit for A / D-converting an inputted speech, a feature extraction unit for dividing an input signal at predetermined intervals and analyzing it into frequency features at predetermined intervals, and a word to be recognized. , A noise HMM storage unit storing the standard pattern, a noise separation unit that separates input noise into a plurality of stationary noise components by its power, and a stationary noise component obtained by the noise separation unit. A stationary noise learning unit for obtaining a noise HMM, and one from a plurality of stationary noise HMMs
A noise synthesis unit that synthesizes two composite noise HMMs, a noise HMM storage unit that stores the composite noise HMM obtained by the noise synthesis unit, a standard voice HMM of a standard pattern, and a composite noise HMM synthesized by the noise synthesis unit NOV
A NOVO-HMM calculation unit for synthesizing the HMM of the voice on which the noise is superimposed by performing O-conversion, and first recording the noise of the environment in which the voice recognition device is used,
A composite noise HMM obtained by previously performing a process in the noise separation unit, a process in the stationary noise learning unit, and a noise synthesis process in the noise synthesis unit is stored in the noise HMM storage unit, and the speech HMM storage unit The NOVO-HMM based on the standard speech HMM of the standard pattern stored in the HMM and the composite noise HMM stored in the noise HMM storage unit.
The calculation unit synthesizes an HMM on which noise is superimposed by NOVO conversion, and when a speech signal to be recognized is input, the feature amount of the speech signal and the HMM obtained by the NOVO-HMM calculation unit are calculated. A speech recognition apparatus characterized in that the likelihood is calculated based on the word and a most likely word is determined as a recognition result.

【請求項４】ＨＭＭを用いて雑音が重畳された音声を認
識する音声認識方式であって、初めに音声認識装置が使
用される環境の雑音を収録し、この雑音をそのパワーに
より複数の定常雑音成分に分離し、得られた複数の定常
雑音成分のそれぞれについて定常雑音ＨＭＭを求め、こ
れら複数の定常雑音ＨＭＭから１つの複合雑音ＨＭＭを
合成し、標準パターンの標準音声ＨＭＭと前記複合雑音
ＨＭＭとを基に雑音が重畳したＮＯＶＯ−ＨＭＭを合成
し、認識の対象となる音声信号が入力されると、その音
声信号の特徴量と前記ＮＯＶＯ−ＨＭＭとを基に尤度を
計算して最も尤もらしい単語を認識結果として決定する
ことを特徴とする音声認識方法。4. A speech recognition method for recognizing speech on which noise is superimposed using an HMM, wherein noise of an environment in which a speech recognition device is used is first recorded, and the noise is converted into a plurality of stationary noises by its power. A noise component is separated into noise components, a stationary noise HMM is obtained for each of the plurality of obtained stationary noise components, and one composite noise HMM is synthesized from the plurality of stationary noise HMMs. When a speech signal to be recognized is input, the likelihood is calculated based on the feature amount of the speech signal and the NOVO-HMM to calculate the likelihood. A speech recognition method characterized by determining a likely word as a recognition result.

【請求項５】ＨＭＭを用いて雑音が重畳された音声を認
識する音声認識方式であって、認識に用いるＮＯＶＯ−
ＨＭＭを作成するために、標準音声ＨＭＭと雑音ＨＭＭ
を合成する時に、音声と雑音との平均残差パワーの比を
係数として重み付けすることを特徴とする音声認識方
法。5. A speech recognition method for recognizing speech on which noise is superimposed using an HMM, wherein the NOVO-
To create an HMM, a standard speech HMM and a noise HMM
A speech recognition method characterized in that, when synthesizing is performed, the ratio of the average residual power between speech and noise is weighted as a coefficient.

【請求項６】ＨＭＭを用いて雑音が重畳された音声を認
識する音声認識方式であって、認識に用いるＮＯＶＯ−
ＨＭＭを作成するために、予め任意に重み係数を与えて
仮のＮＯＶＯ−ＨＭＭを作成し評価する処理を重み係数
を変えながら繰り返すことで、認識率の良い重みの係数
を求めておき、実際に認識に用いるＮＯＶＯ−ＨＭＭを
作成する場合には、前記予め求めた係数を用いて重み付
けすることを特徴とする音声認識方法。6. A speech recognition method for recognizing speech on which noise is superimposed using an HMM, wherein the NOVO-
In order to create an HMM, a process of creating a temporary NOVO-HMM by arbitrarily assigning weighting factors in advance and evaluating the same is repeated while changing the weighting factor. A speech recognition method characterized in that when creating a NOVO-HMM to be used for recognition, weighting is performed using the previously determined coefficients.