JP2002091480A

JP2002091480A - Acoustic model generator and voice recognition device

Info

Publication number: JP2002091480A
Application number: JP2000283516A
Authority: JP
Inventors: Masaki Ida; 政樹伊田; Tomoko Matsui; 知子松井; Satoru Nakamura; 哲中村
Original assignee: ATR ONSEI GENGO TSUSHIN KENKYU; ATR Spoken Language Translation Research Laboratories
Current assignee: ATR ONSEI GENGO TSUSHIN KENKYU; ATR Spoken Language Translation Research Laboratories
Priority date: 2000-09-19
Filing date: 2000-09-19
Publication date: 2002-03-27

Abstract

PROBLEM TO BE SOLVED: To generate an acoustic model which strongly withstands intruding unknown noise so that the amount of required computations is reduced. SOLUTION: A Gaussian mixed model generating section 11 generates plural Gaussian mixed models in a 1 state so that output likelihood is maximized based on the waveform signal data of plural kind environmental noise that are stored in a database memory 21 for leaning. An HMM combining section 13 combines a prescribed noise-free voice HMM and generated noise Gaussian- mixed models by generating an adaptive HMM which includes a mixed Gaussian distribution of each sate expressed in the sum of linear couples of Gaussian distributions that are weighted respectively by prescribed weighting coefficients in all the combinations of the states. On the other hand, a feature extracting section 3 extracts the features of a natural uttered sentence based on the uttered voice signals of the sentence. A voice recognition section 4 conducts voice recognition of the uttered voice signals using the HMM generated adaptablly based on the extracted features and outputs a voice recognition result.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識装置のた
めの音響モデル生成装置及び音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an acoustic model generation device for a speech recognition device and a speech recognition device.

【０００２】[0002]

【従来の技術】音声認識システムの実環境下での使用を
考えたとき、周囲の環境音が存在するために認識性能の
低下が避けられない。そこで、周囲の環境音の混入に対
してロバストな音響モデルが必要になる。環境音の混入
にロバストな音響モデルの生成法としては、認識時の環
境音そのものを用いることはできないので、予め混入環
境音を予測して適応化を行う方法が用いられる。ところ
が、混入環境音の予測は変動成分を含むため、困難であ
ることが多い。2. Description of the Related Art When considering the use of a speech recognition system in a real environment, a reduction in recognition performance is unavoidable due to the presence of ambient environmental sounds. Therefore, an acoustic model that is robust against the mixing of surrounding environmental sounds is required. As a method of generating an acoustic model that is robust against mixing of environmental sounds, it is not possible to use the environmental sounds themselves at the time of recognition, and thus a method of predicting mixed environmental sounds in advance and performing adaptation is used. However, it is often difficult to predict mixed environmental sounds because they include a fluctuation component.

【０００３】従来技術における音響モデルの適応化の方
法としては次の２つに大別される。一方は、システム設
計時に認識時の環境音を想定した音響モデルを作成する
方法である。すなわち、例えば、既知の雑音の波形デー
タベースに基づいて、学習用雑音隠れマルコフモデル
（以下、隠れマルコフモデルをＨＭＭという。）を生成
した後、これを用いて、雑音のない音声ＨＭＭを学習す
ることにより適応化されたＨＭＭを生成し、これを音声
認識装置に用いることができる（以下、第１の従来例と
いう。）。[0003] Methods of adapting acoustic models in the prior art are roughly classified into the following two methods. One is a method of creating an acoustic model assuming environmental sounds at the time of recognition at the time of system design. That is, for example, a learning noise hidden Markov model (hereinafter, a hidden Markov model is referred to as an HMM) is generated based on a known noise waveform database, and is used to learn a noise-free speech HMM. , An HMM adapted by the above is generated and can be used in a speech recognition device (hereinafter, referred to as a first conventional example).

【０００４】これに対してもう一方は、認識時の環境音
データによって随時音響モデルの適応する方式である。
認識時の環境音そのものを適応化に用いることはできな
いので、一般に音声入力の直前の比較的少量の環境音が
用いられる（以下、第２の従来例という。）。On the other hand, the other is a method in which an acoustic model is adapted as needed based on environmental sound data at the time of recognition.
Since the environmental sound itself at the time of recognition cannot be used for adaptation, a relatively small amount of environmental sound immediately before voice input is generally used (hereinafter, referred to as a second conventional example).

【０００５】[0005]

【発明が解決しようとする課題】第１の従来例の方法に
よれば、想定した範囲内の環境音の混入に対しては強い
ロバスト性を示す。しかしながら、未知の雑音に対して
対応できず、頑強性に欠けるという問題点があり、さま
ざまな環境音の混入を想定した場合、すべての音声と環
境音の組合わせについて考慮する必要があるため、コス
ト面において現実的ではない。すなわち、既知の雑音の
種類を多くした場合、適応化されたＨＭＭの計算量が多
大になるという問題点があった。According to the method of the first conventional example, a strong robustness against the mixing of environmental sounds within an assumed range is exhibited. However, there is a problem that it cannot cope with unknown noise and lacks robustness.When various environmental sounds are mixed, it is necessary to consider all combinations of sounds and environmental sounds. It is not practical in terms of cost. That is, when the number of known noises is increased, the amount of calculation of the adapted HMM is increased.

【０００６】第２の従来例の方法においては、少量のデ
ータから認識中のあらゆる環境音を予測することは非常
に困難であり、想定外の環境音の混入には対応できな
い。In the method of the second conventional example, it is very difficult to predict all environmental sounds being recognized from a small amount of data, and cannot cope with unexpected mixing of environmental sounds.

【０００７】前者は混入する環境音がすべて既知である
という条件、後者は混入する環境音の特徴は不変である
という制約条件が存在する。一般に実使用においては環
境音は変動する成分を含んでいるため、上記の制約条件
が満たされるとは限らない。The former has a condition that all the mixed environmental sounds are known, and the latter has a constraint that the characteristics of the mixed environmental sounds are unchanged. In general, in actual use, the environmental sound includes a fluctuating component, so that the above-described constraint is not always satisfied.

【０００８】本発明の目的は以上の問題点を解決し、未
知雑音の混入に対して頑強であって音響モデルの計算量
を増大させることなく音響モデルを生成することができ
る音響モデル生成装置、及び、当該音響モデル生成装置
を用いた音声認識装置を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned problems, to provide an acoustic model generation apparatus which is robust against the incorporation of unknown noise and can generate an acoustic model without increasing the calculation amount of the acoustic model. Another object of the present invention is to provide a speech recognition device using the acoustic model generation device.

【０００９】[0009]

【課題を解決するための手段】本発明に係る音響モデル
生成装置は、学習用の複数の種類の環境雑音の波形信号
データを格納する記憶手段と、上記記憶手段に格納され
た学習用の複数の種類の環境雑音の波形信号データに基
づいて、出力尤度が最大となるように、１状態で複数混
合のガウス混合モデルを生成する生成手段と、所定の雑
音のない音声ＨＭＭと、上記生成手段により生成された
雑音ガウス混合モデルとを、これらの各状態のすべての
組み合わせの状態において、所定の重み係数で重み付け
された各ガウス分布の線形結合の和で表した各状態の混
合ガウス分布を含む適応化されたＨＭＭを生成すること
により合成する合成手段とを備えたことを特徴とする。According to the present invention, there is provided an acoustic model generating apparatus comprising: storage means for storing waveform signal data of a plurality of kinds of environmental noises for learning; and a plurality of learning signal stored in the storage means. Means for generating a Gaussian mixture model of a plurality of mixtures in one state based on waveform signal data of environmental noise of The noise Gaussian mixture model generated by the means is used to calculate the Gaussian mixture distribution of each state represented by the sum of linear combinations of the Gaussian distributions weighted by a predetermined weight coefficient in all combinations of these states. And synthesizing means for synthesizing by generating an adapted HMM.

【００１０】また、本発明に係る音声認識装置は、自然
発話文の発話音声信号に基づいてその特徴量を抽出する
抽出手段と、上記抽出された特徴量に基づいて、上記合
成手段により生成された適応化されたＨＭＭを用いて上
記発話音声信号の音声認識を行って音声認識結果を出力
する音声認識手段とを備えたことを特徴とする。Further, the speech recognition apparatus according to the present invention is characterized in that an extracting means for extracting a characteristic amount based on an uttered voice signal of a natural utterance sentence, and a synthesizing means generated based on the extracted characteristic amount. Voice recognition means for performing voice recognition of the utterance voice signal using the adapted HMM and outputting a voice recognition result.

【００１１】[0011]

【発明の実施の形態】以下、図面を参照して本発明に係
る実施形態について説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１２】図１は、本発明に係る一実施形態であるＨ
ＭＭモデル生成装置１００及び音声認識装置２００の構
成を示すブロック図である。この実施形態に係るＨＭＭ
モデル生成装置１００は、複数の種類の環境音の雑音波
形を含む学習用環境雑音波形データベースを用いて雑音
ガウス混合モデルを生成するガウス混合モデル生成部１
１と、生成された雑音ガウス混合モデルを用いて、雑音
のない音声ＨＭＭを、公知のＨＭＭ合成法により学習す
ることにより適応化されたＨＭＭを生成するＨＭＭ合成
部１３とを備えたことを特徴としている。具体的には、
本実施形態では、環境変動にロバストな音声モデルを少
ない計算量で構築する方法であって、未知の環境音が混
入した場合のロバスト性を高めるため、予めさまざまな
環境音が混入すると想定し、複数の種類の環境音を適応
データとして与えたＨＭＭ合成による環境適応化を行
い、ここで、環境音をＨＭＭとして独立に学習し、公知
のＨＭＭ合成法（例えば、従来技術文献１「F. Martin
et al.,”Recognition of Noisy Speech by Compositio
n of Hidden Markov Models”,電子情報通信学会技術報
告, SP92-96, pp.9-16, 1992」、従来技術文献２「南泰
浩ほか，“ＨＭＭ合成に基づく尤度最大化適応法”，電
子情報通信学会技術報告，ＳＰ９５−２４，１９９５年
６月」など参照。）によってすべての音声モデルに複数
の種類の環境音の影響を適応化させている。FIG. 1 shows an embodiment H according to the present invention.
FIG. 2 is a block diagram showing a configuration of an MM model generation device 100 and a speech recognition device 200. HMM according to this embodiment
The model generation device 100 generates a Gaussian mixture model generation unit 1 that generates a noise Gaussian mixture model using a learning environmental noise waveform database including noise waveforms of a plurality of types of environmental sounds.
1 and an HMM combining unit 13 that generates an adapted HMM by learning a noise-free speech HMM by a known HMM combining method using the generated noise Gaussian mixture model. And In particular,
In the present embodiment, it is a method of constructing a speech model that is robust to environmental fluctuations with a small amount of calculation, and it is assumed that various environmental sounds are mixed in advance in order to enhance robustness when unknown environmental sounds are mixed, Environmental adaptation is performed by HMM synthesis in which a plurality of types of environmental sounds are given as adaptation data. Here, the environmental sounds are independently learned as HMMs, and a known HMM synthesis method (for example, the related art document 1 “F. Martin”
et al., ”Recognition of Noisy Speech by Compositio
n of Hidden Markov Models ”, IEICE Technical Report, SP92-96, pp.9-16, 1992, Prior art document 2“ Yasuhiro Minami et al., “Maximum likelihood adaptive method based on HMM synthesis”, See Information and Communication Engineers Technical Report, SP95-24, June 1995. ) Adapts the effects of multiple types of environmental sounds to all speech models.

【００１３】従来技術における上述の問題点を解決する
ために、本実施形態で用いる手法は、未知の環境音が混
入した場合のロバスト性を高めるため、予めさまざまな
環境音が混入すると想定した環境適応化を行う。さまざ
まな環境音を雑音ガウス混合モデルとして独立に学習
し、ＨＭＭ合成によってすべての音声モデルに複数の種
類の環境音の影響を適応化させることにより環境変動に
ロバストな音声モデルを少ない計算量で構築することが
可能になる。In order to solve the above-mentioned problems in the prior art, the method used in the present embodiment uses an environment in which various environmental sounds are assumed to be mixed in advance in order to enhance robustness when unknown environmental sounds are mixed. Perform adaptation. Independently learns various environmental sounds as a noise Gaussian mixture model, and adapts the effects of multiple types of environmental sounds to all voice models by HMM synthesis to build a voice model that is robust to environmental changes with a small amount of computation It becomes possible to do.

【００１４】図１において、雑音のない音声波形データ
ベースメモリ３１は、例えば複数の話者の大規模な音素
ラベル付き音声波形信号（雑音がなくクリーンであ
る。）のデータベースを格納しており、また、ＨＭＭ生
成部１２は、当該データベースに基づいて公知のＥＭ
（Expectation-Maximization)アルゴリズムを用いて、
出力尤度が最大となるように、雑音のない音声ＨＭＭを
生成して雑音のない音声ＨＭＭメモリ３２に出力して格
納する。一方、学習用環境雑音波形データベースメモリ
２１は、例えば電子協騒音データベース（例えば、従来
技術文献３「電子協騒音データベース，（社）日本電子
工業振興協会，http://www.jeida.or.jp/committee/hum
anmed/ speech/noisedbj.html」など参照。）に格納さ
れた、学習用の複数の種類の環境雑音の波形信号のデー
タを格納していて、このデータベースメモリ２１に格納
された学習用の複数の種類の環境雑音の波形信号のデー
タに基づいて、ガウス混合モデル生成部１１は、公知の
ＥＭアルゴリズムを用いて、出力尤度が最大となるよう
に、１状態で複数混合の雑音ガウス混合モデルを生成し
て雑音ガウス混合モデルメモリ２２に出力して格納す
る。さらに、ＨＭＭ合成部１３は、音声ＨＭＭメモリ３
２に格納された雑音のない音声ＨＭＭと、モデルメモリ
２２に格納された雑音ガウス混合モデルとを、公知のＨ
ＭＭ合成法を用いて合成することにより、適応化された
ＨＭＭを生成して適応化されたＨＭＭメモリ２３に出力
して格納する。In FIG. 1, a noise-free speech waveform database memory 31 stores, for example, a database of speech waveform signals with large phoneme labels (clean and free of noise) of a plurality of speakers. , The HMM generation unit 12 uses a known EM based on the database.
(Expectation-Maximization) algorithm,
A noise-free speech HMM is generated so as to maximize the output likelihood, and is output and stored in the noise-free speech HMM memory 32. On the other hand, the learning environmental noise waveform database memory 21 stores, for example, an electronic cooperative noise database (for example, the related art document 3 “electronic cooperative noise database, Japan Electronics Industry Promotion Association, http://www.jeida.or.jp / committee / hum
anmed / speech / noisedbj.html ". ) Stores data of waveform signals of a plurality of types of environmental noise for learning stored on the basis of the data of waveform signals of a plurality of types of environmental noise for learning stored in the database memory 21. The Gaussian mixture model generation unit 11 uses a known EM algorithm to generate a plurality of mixed noise Gaussian mixture models in one state and output the same to the noise Gaussian mixture model memory 22 so that the output likelihood is maximized. And store. Further, the HMM synthesizing unit 13 outputs the voice HMM memory 3
2 and the noise-Gaussian mixture model stored in the model memory 22 by using a known H
By combining using the MM combining method, an adapted HMM is generated and output to the adapted HMM memory 23 for storage.

【００１５】本実施形態のＨＭＭ合成部１３で用いるＨ
ＭＭ合成法とは、雑音の存在しないクリーンな環境で学
習された音声ＨＭＭと環境音の特徴を学習した雑音ガウ
ス混合モデルとを合成して、環境音の混入した音声に対
するＨＭＭを作成する方法である。このＨＭＭ合成法で
は、従来技術文献２の図２に図示されるように、ケプス
トラム領域にある音声と雑音の各ガウス分布をそれぞれ
コサイン変換することにより、対数スペクトラム領域の
音声と雑音の各ガウス分布に変換した後、さらに、指数
変換することにより線形スペクトラム領域の音声と雑音
の対数ガウス分布に変換する。ここで、指数変換後の線
形スペクトラム領域の音声と雑音の対数ガウス分布を互
いに重み係数付け加算することにより、線形スペクトラ
ム領域における雑音が重畳した音声の対数ガウス分布を
生成する。さらに、生成した雑音が重畳した音声の対数
ガウス分布を対数変換して、対数スペクトラム領域にお
ける雑音が重畳した音声のガウス分布に変換した後、さ
らに逆コサイン変換することによりケプストラム領域に
おける雑音が重畳した音声のガウス分布を得る。以上が
ＨＭＭ合成法での出力確率の合成法である。H used in the HMM synthesis unit 13 of the present embodiment
The MM synthesis method is a method of synthesizing a speech HMM trained in a clean environment free of noise and a noise Gaussian mixture model learning the features of the environment sound to create an HMM for the sound mixed with the environment sound. is there. In this HMM synthesis method, as shown in FIG. 2 of the related art document 2, each Gaussian distribution of speech and noise in the cepstrum domain is subjected to cosine transform, respectively, thereby obtaining each Gaussian distribution of speech and noise in the logarithmic spectrum domain. After that, it is further converted to a logarithmic Gaussian distribution of voice and noise in the linear spectrum region by performing an exponential conversion. Here, the logarithmic Gaussian distribution of voice and noise in the linear spectrum region is generated by adding the logarithmic Gaussian distributions of voice and noise in the linear spectrum region after the exponential conversion with weighting factors. Furthermore, the logarithmic Gaussian distribution of the generated noise is superimposed on the logarithmic Gaussian distribution of the voice, and the noise in the logarithmic spectral domain is converted into the Gaussian distribution of the voice on which the noise is superimposed. Get Gaussian distribution of speech. The above is the method of combining output probabilities in the HMM combining method.

【００１６】雑音ガウス混合モデルの状態は、環境音の
多様性を表現するために出力確率分布を混合ガウス分布
で表す。このときの合成後のＨＭＭの出力確率分布は、
ケプストラム領域における音声ＨＭＭの混合分布と雑音
ガウス混合モデルの混合分布の和で表される。すなわ
ち、混合分布を構成する各ガウス分布は、音声ＨＭＭの
各状態と、雑音ガウス混合モデルの状態とのガウス分布
におけるすべての組合わせの和で表現され、混合の重み
係数は各重み係数の積で表現される。ここで、雑音ガウ
ス混合モデルが１状態で１ループであり、音声ＨＭＭが
３状態で１ループであるときの、図１のＨＭＭ合成部１
３におけるＨＭＭ合成法によるＨＭＭ学習処理を図２に
示す。The state of the noise-Gaussian mixture model represents an output probability distribution by a Gaussian mixture distribution in order to represent the diversity of environmental sounds. The output probability distribution of the combined HMM at this time is
It is represented by the sum of the mixture distribution of the speech HMM and the mixture distribution of the noise Gaussian mixture model in the cepstrum domain. That is, each Gaussian distribution forming the mixture distribution is represented by the sum of all combinations in the Gaussian distribution of each state of the speech HMM and the state of the noise Gaussian mixture model, and the weighting factor of the mixture is the product of each weighting factor. Is represented by Here, when the noise Gaussian mixture model is one loop in one state and the speech HMM is one loop in three states, the HMM synthesis unit 1 in FIG.
FIG. 2 shows an HMM learning process by the HMM combining method in FIG.

【００１７】上述のように、音声ＨＭＭと雑音ガウス混
合モデルとの合成を行うときに、各出力分布が混合ガウ
ス分布で表現されているとき、合成後の出力分布はそれ
ぞれの混合要素のすべての組合わせになる。合成後の各
要素の平均及び分散は元の混合要素の和になる。合成後
の各要素の混合重み係数は、元の混合重み係数の積で表
される。図２は音声ＨＭＭ及び雑音ガウス混合モデルと
もに２混合の出力分布で表現されているときの、合成後
の出力確率分布の導出を示している。なお、図２におい
て、Ｎ（・）は各ガウス分布の平均及び分散を示す。音
声ＨＭＭの第１状態の出力確率分布がガウス分布Ｓ₁₁，
Ｓ₁₂の重み係数付き和であり、雑音ガウス混合モデルの
出力確率分布がＮ₁，Ｎ₂の重み係数つき和であり、すな
わち、所定の重み係数で重み付けされたガウス分布の線
形結合の和である。それぞれの重み係数はｗ_s11，ｗｓ
₁₂，ｗ_n1，ｗ_n2とする。このとき、合成後の適応化され
たの第１状態の出力分布は、Ｓ₁₁＋Ｎ₁，Ｓ₁₂＋Ｎ₁，Ｓ
₁₁＋Ｎ₂，Ｓ₁₂＋Ｎ₂の４つのガウス分布の重み係数付き
和になる。さらに、同様にして、雑音ガウス混合モデル
の状態と、音声ＨＭＭの第２の状態との組み合わせにお
けるＨＭＭ合成、並びに、雑音ガウス混合モデルの状態
と、音声ＨＭＭの第３の状態との組み合わせにおけるＨ
ＭＭ合成を行う。As described above, when synthesizing the speech HMM and the noise Gaussian mixture model, when each output distribution is represented by a Gaussian mixture distribution, the output distribution after the synthesis is the same as that of each of the mixing elements. Become a combination. The average and variance of each element after composition are the sum of the original mixed elements. The mixed weight coefficient of each element after the combination is represented by the product of the original mixed weight coefficients. FIG. 2 shows the derivation of the output probability distribution after synthesis when both the speech HMM and the noise Gaussian mixture model are represented by an output distribution of two mixtures. In FIG. 2, N (•) indicates the average and variance of each Gaussian distribution. The output probability distribution of the first state of the voice HMM is Gaussian distribution S ₁₁ ,
S ₁₂ is a sum with weighting coefficients, and the output probability distribution of the noise Gaussian mixture model is a sum with weighting coefficients of N ₁ and N ₂ , that is, a sum of linear combinations of Gaussian distributions weighted by predetermined weighting coefficients. is there. The respective weighting factors are w _s11 , ws
₁₂ , w _n1 and w _n2 . At this time, the output distribution of the first state after the synthesis is S ₁₁ + N ₁ , S ₁₂ + N ₁ , S
It becomes a sum with weighting coefficients of four Gaussian distributions of ₁₁ + N ₂ and S ₁₂ + N ₂ . Further, similarly, HMM synthesis in a combination of the state of the noise Gaussian mixture model and the second state of the speech HMM, and HMM in a combination of the state of the noise Gaussian mixture model and the third state of the speech HMM.
Perform MM synthesis.

【００１８】従って、ＨＭＭ合成部１３は、音声ＨＭＭ
メモリ３２に格納されている雑音のない音声ＨＭＭと、
モデルメモリ２２に格納されている雑音ガウス混合モデ
ルとを、公知のＨＭＭ合成法を用いて、これらの各状態
のすべての組み合わせの状態において、所定の重み係数
で重み付けされた各ガウス分布の線形結合の和で表した
各状態の混合ガウス分布を含む適応化されたＨＭＭを生
成することにより合成し、ＨＭＭメモリ２３に出力して
格納する。Therefore, the HMM synthesizing unit 13 outputs the speech HMM
A noiseless speech HMM stored in the memory 32;
The linear combination of the Gaussian mixture model stored in the model memory 22 and the Gaussian distribution weighted by a predetermined weighting coefficient in a state of all combinations of these states using a known HMM combining method. Are synthesized by generating an adapted HMM including the Gaussian mixture distribution of each state represented by the sum of the above, and output to the HMM memory 23 for storage.

【００１９】図１において、音声認識装置２００は、マ
イクロホン１と、Ａ／Ｄ変換器２と、特徴抽出部３と、
音声認識部４とを備えて構成される。自然発話文の発生
音声はマイクロホン１に入力されて発声音声信号に変換
された後、Ａ／Ｄ変換器２により所定のサンプリング周
波数で音声ディジタルデータ信号にＡ／Ｄ変換される。
次いで、特徴抽出部３は、入力される音声ディジタルデ
ータ信号に基づいて、例えばＬＰＣ分析することによ
り、例えば、１２次のメルケプストラム係数と、１２次
のΔメルケプストラム係数と、パワーと、Δパワーとを
含む特徴ベクトルを抽出して音声認識部４に出力する。
さらに、音声認識部４は、ＨＭＭメモリ２３に格納され
た適応化されたＨＭＭを用いて音素の尤度を計算すると
ともに、単語ＨＭＭメモリ５に予め格納されている所定
の音素ベースの単語ＨＭＭを用いて単語の尤度を計算し
て、出力尤度が最大となる音素からなる単語を決定する
ことにより音声認識処理を行い、音声認識結果の最尤単
語の文字列を生成して出力する。In FIG. 1, a speech recognition apparatus 200 includes a microphone 1, an A / D converter 2, a feature extraction unit 3,
The voice recognition unit 4 is provided. The generated voice of the naturally uttered sentence is input to the microphone 1 and converted into a voiced voice signal, and then A / D converted by the A / D converter 2 into a voice digital data signal at a predetermined sampling frequency.
Next, the feature extraction unit 3 performs, for example, an LPC analysis based on the input audio digital data signal to obtain, for example, a twelfth-order mel-cepstral coefficient, a twelfth-order mel-cepstral coefficient, power, and Δpower. Are extracted and output to the speech recognition unit 4.
Further, the speech recognition unit 4 calculates the likelihood of a phoneme using the adapted HMM stored in the HMM memory 23, and converts a predetermined phoneme-based word HMM stored in the word HMM memory 5 in advance. The speech recognition processing is performed by calculating the likelihood of the word using the phoneme, and determining the word including the phoneme having the maximum output likelihood, and generating and outputting the character string of the most likely word of the speech recognition result.

【００２０】以上の実施形態において、ガウス混合モデ
ル生成部１１と、ＨＭＭ生成部１２と、ＨＭＭ合成部１
３と、特徴抽出部３と、音声認識部４とは、例えばディ
ジタル計算機などの演算制御装置により構成され、単語
ＨＭＭメモリ５と、学習用環境雑音波形データベースメ
モリ２１と、ガウス混合モデルメモリ２２と、適応化さ
れたＨＭＭメモリ２３と、雑音のない音声波形データベ
ースメモリ３１と、雑音のない音声ＨＭＭメモリ３２と
は、例えばハードディスクメモリなどの記憶装置により
構成される。In the above embodiment, the Gaussian mixture model generator 11, the HMM generator 12, and the HMM synthesizer 1
3, a feature extracting unit 3, and a speech recognizing unit 4 are configured by an arithmetic and control unit such as a digital computer, and include a word HMM memory 5, a learning environment noise waveform database memory 21, a Gaussian mixture model memory 22, The adapted HMM memory 23, the noise-free speech waveform database memory 31, and the noise-free speech HMM memory 32 are configured by a storage device such as a hard disk memory.

【００２１】[0021]

【実施例】本発明者らは、本実施形態のＨＭＭモデル生
成装置１００及び音声認識装置２００を用いて単語認識
実験を行い、その性能で適応化されたＨＭＭである音響
モデルを評価した。音声や雑音の両ＨＭＭの学習、及び
認識を行う際の音声データ分析条件を表１に示す。ま
た、雑音のないクリーンな環境で学習された音声ＨＭＭ
として表２に示すＨＭｎｅｔを用いた。EXAMPLES The present inventors conducted a word recognition experiment using the HMM model generation device 100 and the speech recognition device 200 of the present embodiment, and evaluated an acoustic model, which is an HMM adapted by its performance. Table 1 shows speech data analysis conditions when learning and recognizing both HMMs of speech and noise. A speech HMM learned in a clean environment without noise
The HMnet shown in Table 2 was used for the measurement.

【００２２】[0022]

【表１】音声データ分析条件 ――――――――――――――――――――――――――――――――――― サンプリング周波数：１６ｋＨｚ量子化ビット数：１６ｂｉｔ特徴分析：１２次ＭＦＣＣ＋１２次ΔＭＦＣＣ＋パワー＋Δパワー分析フレーム：２０ｍｓｅｃフレームシフト：１０ｍｓｅｃ ―――――――――――――――――――――――――――――――――――[Table 1] Speech data analysis conditions ――――――――――――――――――――――――――――――――――― Sampling frequency: 16 kHz Quantized bit Number: 16 bits Characteristic analysis: 12th-order MFCC + 12th-order ΔMFCC + power + Δpower Analysis frame: 20 msec Frame shift: 10 msec ―――――――――――――――――――――――――――― ―――――――

【００２３】[0023]

【表２】クリーンな音声ＨＭＭの構成 ――――――――――――――――――――――――――――――――――― ２６音素コンテキスト依存（トライフォン）性別依存総状態数：１４０３状態総分布数：７０３０分布各状態の混合分布数：１０又は５各音声モデルの最大状態数：４学習データ：出願人が所有する音声データベース（４００話者；１９９４８発声） ―――――――――――――――――――――――――――――――――――[Table 2] Structure of clean voice HMM ――――――――――――――――――――――――――――――――― 26 Phoneme context dependent ( Triphone) Gender dependent Total number of states: 1403 states Total number of distributions: 7030 distributions Number of mixed distributions of each state: 10 or 5 Maximum number of states of each audio model: 4 Learning data: Speech database (400 speakers) owned by applicant ; 19948 utterance) ―――――――――――――――――――――――――――――――――――

【００２４】雑音ガウス混合モデルの学習には一般的な
環境雑音として上述の電子協騒音データベースを用い
た。このうち１２種類、計４０００ｓｅｃのデータをオ
ープン（混入雑音環境が既知）条件の雑音ガウス混合モ
デルの学習用環境音データとする。比較対象として、こ
の４０００ｓｅｃに含まれない１種類、計４００ｓｅｃ
のデータをクローズド（混入雑音環境が未知）条件の雑
音ガウス混合モデルの学習用環境音データとする。この
内訳を表３に示す。クローズド条件のデータは、評価用
の環境音混入音声データの作成の際に評価用音声データ
に重畳して用いる。各環境音データは評価用音声データ
に対してＳＮ比が１５ｄＢになるよう振幅調整を行っ
た。For the learning of the noise Gaussian mixture model, the above-mentioned electronic cooperative noise database was used as general environmental noise. Of these, 12 types of data for a total of 4000 sec are used as learning environment sound data for the noise Gaussian mixture model under the open condition (the mixed noise environment is known). As a comparison target, one kind not included in this 4000 sec, a total of 400 sec
Is used as learning environment sound data for the noise Gaussian mixture model under the closed condition (the mixed noise environment is unknown). Table 3 shows the breakdown. The data of the closed condition is used by being superimposed on the evaluation sound data when creating the environmental sound mixing sound data for evaluation. The amplitude of each environmental sound data was adjusted so that the SN ratio became 15 dB with respect to the evaluation sound data.

【００２５】[0025]

【表３】騒音データベース（環境雑音波形データベース） ――――――――――――――――――――――――――――――――――― 分類内容総時間［ｓｅｃ］ ――――――――――――――――――――――――――――――――――― オープン走行自動車内８００駅４００公衆電話ボックス２００工場４００仕分け処理場４００幹線道路２００交差点２００人混み２００列車４００計算機室４００空調雑音６００エレベータホール２００ ――――――――――――――――――――――――――――――――――― クローズド展示会場４００ ―――――――――――――――――――――――――――――――――――[Table 3] Noise database (environmental noise waveform database) ――――――――――――――――――――――――――――――――― Classification Contents Total Time [sec] ――――――――――――――――――――――――――――――――― Open Driving car 800 stations 400 Public telephone booth 200 Factory 400 Sorting plant 400 Main road 200 Intersection 200 crowded 200 Train 400 Computer room 400 Air conditioning noise 600 Elevator hall 200 ―――――――――――――――――――――――――― ――――――――― Closed exhibition hall 400 ―――――――――――――――――――――――――――――――――――

【００２６】各環境音データを用いて雑音ガウス混合モ
デルの学習生成を行った。雑音ガウス混合モデルの構成
は１状態で１ループとし、出力確率分布の混合数を１〜
３とする。これにより、各混合数を用いたときの雑音ガ
ウス混合モデルの学習データに認識時の環境が既知の場
合（クローズド条件）と未知の場合（オープン条件）に
ついて適応化されたＨＭＭを生成した。比較のため、環
境適応化を行わない、雑音のないＨＭＭ（クリーンなＨ
ＭＭ；音声ＨＭＭメモリ３２内に格納された雑音のない
音声ＨＭＭ）を用いた場合と、クローズド条件の雑音ガ
ウス混合モデル学習用データを音声モデル学習用音声デ
ータに重畳して再学習を行った音声モデル（以下、再学
習されたＨＭＭという。）についても評価を行った。Learning generation of a noise Gaussian mixture model was performed using each environmental sound data. The configuration of the noise Gaussian mixture model is one loop in one state, and the number of mixtures in the output probability distribution is 1 to
3 is assumed. As a result, an HMM adapted to the learning data of the noise Gaussian mixture model when each number of mixtures is used and the environment at the time of recognition is known (closed condition) and unknown (open condition) is generated. For comparison, a noise-free HMM (clean H
MM; a speech HMM with no noise stored in the speech HMM memory 32) and a speech obtained by superimposing the noise Gaussian mixture model learning data of the closed condition on the speech model learning speech data and performing re-learning. The model (hereinafter referred to as a retrained HMM) was also evaluated.

【００２７】認識実験のタスクは、出願人が所有する音
声認識装置内の音素ベースの単語ＨＭＭを用いた、音素
バランス２１６単語の孤立単語認識であり、評価用デー
タは男性話者１名、計２１６発声を用いた。環境音を含
まない評価データ（雑音のないテストセット）とクロー
ズド条件の雑音ガウス混合モデル学習用データと同じ環
境音を重畳したデータ（雑音有りのテストセット）の２
種類の評価セットを用意した。実験結果の単語認識率を
図４に示す。The task of the recognition experiment is the recognition of 216 isolated words using a phoneme-based word HMM in a speech recognition device owned by the applicant. The evaluation data is a total of one male speaker. 216 utterances were used. Evaluation data that does not include environmental sound (no-noise test set) and data in which the same environmental sound as the noise-Gaussian mixture model learning data under the closed condition is superimposed (noisy test set)
Different types of evaluation sets were prepared. FIG. 4 shows the word recognition rate as an experimental result.

【００２８】[0028]

【表４】単語認識率 ――――――――――――――――――――――――――――――――――― ＨＭＭの種類雑音ガウス混合モデルテストセット単語認識率［％］ ――――――――――――――――――――――――――――――――――― 雑音のないＨＭＭ −−− 雑音なし９４．６雑音のないＨＭＭ −−− 雑音有り８１．８ ――――――――――――――――――――――――――――――――――― 適応化されたＨＭＭクローズド＆１混合雑音有り８５．３適応化されたＨＭＭオープン＆１混合雑音有り８４．５適応化されたＨＭＭクローズド＆２混合雑音有り８７．６適応化されたＨＭＭオープン＆２混合雑音有り８４．９適応化されたＨＭＭクローズド＆３混合雑音有り８８．０適応化されたＨＭＭオープン＆３混合雑音有り８４．９ ――――――――――――――――――――――――――――――――――― 再学習されたＨＭＭ −−− 雑音有り８９．０ ―――――――――――――――――――――――――――――――――――[Table 4] Word recognition rate ――――――――――――――――――――――――――――――――― HMM type Noise Gaussian mixture model test Set Word recognition rate [%] ――――――――――――――――――――――――――――――――― HMM without noise ---- Noise None 94.6 HMM without noise ---- With noise 81.8 ―――――――――――――――――――――――――――――――――― -Adapted HMM with closed & 1 mixed noise 85.3 Adapted HMM with open & 1 mixed noise 84.5 Adapted HMM with closed & 2 mixed noise 87.6 Adapted HMM with open & 2 mixed noise 84.9 Adapted HMM Closed & 3 Mixed with Noise 88.0 Adapted HMM Pun & 3 mixed with noise 84.9 − With noise 89.0 ―――――――――――――――――――――――――――――――――――

【００２９】この表４から明らかなように、雑音のない
クリーンＨＭＭを用いた場合、入力音声に環境音が混入
すると認識性能が低下するが、環境適応化ＨＭＭを用い
ることでその低下を抑えることができる。また、雑音ガ
ウス混合モデルの混合分布数が等しいならば、雑音ガウ
ス混合モデルの学習時に認識時の環境音が既知の場合
（クローズド条件）に比べて未知の場合（オープン条
件）は、認識率が各混合数において１〜３ポイント
（％）少ない。逆に、認識時の環境音が既知の場合は、
混合数の増加に伴って認識性能が向上し、混合数３の場
合には再学習を行った音響モデル（再学習されたＨＭ
Ｍ）と同等の認識性能を示す。これに対して、認識時の
環境音が未知の場合は性能向上があまり見られない。こ
の原因の一つとして、未知の場合に雑音ガウス混合モデ
ルの学習に用いた環境音の種類の不足が考えられる。As is apparent from Table 4, when a clean HMM having no noise is used, the recognition performance is reduced when an environmental sound is mixed in the input speech, but the reduction is suppressed by using an environment-adaptive HMM. Can be. If the number of mixture distributions of the noise Gaussian mixture model is equal, the recognition rate is lower when the environmental sound at the time of recognition is unknown (open condition) than when the environment sound at the time of recognition of the noise Gaussian mixture model is known (closed condition). 1 to 3 points (%) less in each mixing number. Conversely, if the environmental sound at the time of recognition is known,
Recognition performance improves with an increase in the number of mixtures, and when the number of mixtures is 3, the re-trained acoustic model (the re-learned HM
The recognition performance is equivalent to that of M). On the other hand, if the environmental sound at the time of recognition is unknown, there is little performance improvement. One of the causes is considered to be a lack of types of environmental sounds used for learning the noise Gaussian mixture model when unknown.

【００３０】[0030]

【発明の効果】以上詳述したように本発明に係る音響モ
デル生成装置によれば、学習用の複数の種類の環境雑音
の波形信号データに基づいて、出力尤度が最大となるよ
うに、１状態で複数混合のガウス混合モデルを生成した
後、所定の雑音のない音声ＨＭＭと、上記生成手段によ
り生成された雑音ガウス混合モデルとを、これらの各状
態のすべての組み合わせの状態において、所定の重み係
数で重み付けされた各ガウス分布の線形結合の和で表し
た各状態の混合ガウス分布を含む適応化されたＨＭＭを
生成することにより合成する。従って、以下の特有の効
果を有する。（１）複数の種類の環境雑音の波形信号データに基づい
てガウス混合モデルを生成しているので、このガウス混
合モデルと、音声ＨＭＭとを合成した適応化されたＨＭ
Ｍは、未知の雑音の混入に対して頑強なモデルとなる。（２）雑音モデルとして複数混合のモデルを利用するこ
とにより、多種多様な雑音に対して有効な雑音モデルの
構築ができ、雑音の時間的変動に対する耐性が向上す
る。（３）雑音モデルの重み係数を現場データ等を用いて適
応化した場合、合成後のモデルの重み係数だけを現場デ
ータに対応させればよいので、従来例に比較して計算量
を大幅に軽減でき、大規模な音響モデルにおいても高速
に環境適応化が可能になる。As described in detail above, according to the acoustic model generating apparatus according to the present invention, the output likelihood is maximized based on the waveform signal data of a plurality of types of environmental noise for learning. After a plurality of Gaussian mixture models are generated in one state, a predetermined noise-free speech HMM and the noise-Gaussian mixture model generated by the generation means are converted to a predetermined state in all combinations of these states. Are synthesized by generating an adapted HMM including a mixed Gaussian distribution of each state represented by a sum of linear combinations of the respective Gaussian distributions weighted by the weighting factors. Therefore, the following specific effects are obtained. (1) Since a Gaussian mixture model is generated based on waveform signal data of a plurality of types of environmental noises, an adaptive HM obtained by combining the Gaussian mixture model with a speech HMM
M is a model that is robust against mixing of unknown noise. (2) By using a plurality of mixed models as a noise model, it is possible to construct a noise model that is effective for various types of noise, and to improve resistance to temporal fluctuation of noise. (3) When the weighting factor of the noise model is adapted using field data or the like, only the weighting factor of the synthesized model needs to be made to correspond to the field data. Therefore, environmental adaptation can be performed at high speed even in a large-scale acoustic model.

【００３１】また、本発明に係る音声認識装置によれ
ば、自然発話文の発話音声信号に基づいてその特徴量を
抽出し、抽出された特徴量に基づいて、上記合成された
適応化されたＨＭＭを用いて上記発話音声信号の音声認
識を行って音声認識結果を出力する。従って、未知の雑
音が混入した音声信号に対して、従来例に比較して高い
音声認識率で音声認識することができ、雑音が重畳した
音声に対して頑健な音声認識装置を提供できる。Further, according to the speech recognition apparatus of the present invention, the characteristic amount is extracted based on the uttered voice signal of the natural utterance sentence, and the synthesized and adapted character is extracted based on the extracted characteristic amount. Using the HMM, speech recognition of the uttered speech signal is performed, and a speech recognition result is output. Therefore, a speech signal containing unknown noise can be recognized with a higher speech recognition rate than the conventional example, and a robust speech recognition device can be provided for speech with noise superimposed.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明に係る一実施形態であるＨＭＭモデル
生成装置１００及び音声認識装置２００の構成を示すブ
ロック図である。FIG. 1 is a block diagram showing a configuration of an HMM model generation device 100 and a speech recognition device 200 according to an embodiment of the present invention.

【図２】図１のＨＭＭ合成部１３におけるＨＭＭ合成
法によるＨＭＭ学習処理を示す説明図である。FIG. 2 is an explanatory diagram illustrating an HMM learning process by an HMM combining method in an HMM combining unit 13 in FIG. 1;

【符号の説明】[Explanation of symbols]

１…マイクロホン、２…Ａ／Ｄ変換器、３…特徴抽出部、４…音声認識部、５…単語ＨＭＭメモリ、１１…ガウス混合モデル生成部、１２…ＨＭＭ生成部、１３…ＨＭＭ合成部、２１…学習用環境雑音波形データベースメモリ、２２…ガウス混合モデルメモリ、２３…適応化されたＨＭＭメモリ、３１…雑音のない音声波形データベースメモリ、３２…雑音のない音声ＨＭＭメモリ、１００…ＨＭＭモデル生成装置、２００…音声認識装置。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... A / D converter, 3 ... Feature extraction part, 4 ... Speech recognition part, 5 ... Word HMM memory, 11 ... Gaussian mixture model generation part, 12 ... HMM generation part, 13 ... HMM synthesis part, Reference numeral 21: learning environment noise waveform database memory 22: Gaussian mixture model memory 23: adapted HMM memory 31: noise-free speech waveform database memory 32: noise-free speech HMM memory 100: HMM model generation Apparatus, 200 ... Speech recognition apparatus.

───────────────────────────────────────────────────── フロントページの続き (72)発明者松井知子京都府相楽郡精華町光台二丁目２番地２株式会社エイ・ティ・アール音声言語通信研究所内 (72)発明者中村哲京都府相楽郡精華町光台二丁目２番地２株式会社エイ・ティ・アール音声言語通信研究所内Ｆターム(参考） 5D015 GG05 HH05 HH21 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Tomoko Matsui, 2-2-2 Kodai, Seika-cho, Soraku-gun, Kyoto Pref. Inside the AT R Spoken Language Communication Research Laboratories (72) Inventor Satoshi Nakamura Soraku-gun, Kyoto 2nd-2nd, Kouka Seikacho 2F AT Term Co., Ltd. Spoken Language Communication Research Lab. 5D015 GG05 HH05 HH21

Claims

【特許請求の範囲】[Claims]

【請求項１】学習用の複数の種類の環境雑音の波形信
号データを格納する記憶手段と、上記記憶手段に格納された学習用の複数の種類の環境雑
音の波形信号データに基づいて、出力尤度が最大となる
ように、１状態で複数混合のガウス混合モデルを生成す
る生成手段と、所定の雑音のない音声隠れマルコフモデルと、上記生成
手段により生成された雑音ガウス混合モデルとを、これ
らの各状態のすべての組み合わせの状態において、所定
の重み係数で重み付けされた各ガウス分布の線形結合の
和で表した各状態の混合ガウス分布を含む適応化された
隠れマルコフモデルを生成することにより合成する合成
手段とを備えたことを特徴とする音響モデル生成装置。1. A storage means for storing a plurality of types of environmental noise waveform signal data for learning, and an output based on the plurality of types of learning environmental noise waveform signal data stored in the storage means. Generating means for generating a Gaussian mixture model of plural mixtures in one state so that the likelihood is maximized; speech hidden Markov model having no predetermined noise; and a noise Gaussian mixture model generated by the generating means, Generate an adapted Hidden Markov Model containing the Gaussian mixture of each state expressed as the sum of linear combinations of the Gaussian distributions weighted by a predetermined weighting factor for all combinations of these states. And a synthesizing means for synthesizing the acoustic model.

【請求項２】自然発話文の発話音声信号に基づいてそ
の特徴量を抽出する抽出手段と、上記抽出された特徴量に基づいて、請求項１記載の合成
手段により生成された適応化された隠れマルコフモデル
を用いて上記発話音声信号の音声認識を行って音声認識
結果を出力する音声認識手段とを備えたことを特徴とす
る音声認識装置。2. An adaptive means generated by the synthesizing means according to claim 1, wherein said extracting means extracts a characteristic amount based on an uttered voice signal of a spontaneously uttered sentence. Voice recognition means for performing voice recognition of the utterance voice signal using a hidden Markov model and outputting a voice recognition result.