JP2003076393A

JP2003076393A - Method for estimating voice in noisy environment and voice recognition method

Info

Publication number: JP2003076393A
Application number: JP2001264882A
Authority: JP
Inventors: Ikuyo Katsuse; 郁代勝瀬
Original assignee: INST OF SYSTEMS INFORMATION TE; INST OF SYSTEMS INFORMATION TECHNOLOGIES KYUSHU; WAVE COME KK
Current assignee: INST OF SYSTEMS INFORMATION TE; INST OF SYSTEMS INFORMATION TECHNOLOGIES KYUSHU; WAVE COME KK
Priority date: 2001-08-31
Filing date: 2001-08-31
Publication date: 2003-03-14

Abstract

PROBLEM TO BE SOLVED: To provide a voice estimating method and a voice recognition method which are robustly operated even of a voice signal inputted in noises or a voice signal in which noise is mixed on a communication line. SOLUTION: The voice estimating method includes a step for segmenting an input acoustic signal by short-time segments, an acoustic analyzing step for performing a short time frequency analysis, an element estimating step for estimating elements required for voice estimation, and a voice estimating step for estimating a voice by using elements obtained in the element estimating step. Concretely, the input acoustic signal is segmented by short-time segments, and short-time frequency analysis is performed, and spectrum envelopes of voice held in a code book for voice recognition are utilized as knowledge to generate sound models, and spectrum information obtained by the short-time frequency analysis is regarded as a probability density function, and maximum posteriori probability estimation is used to estimate mixed weight values, and it is judged that the existence supposition of elements generating the sound model having the maximum weight value at each time has the maximum likelihood, and these elements are outputted.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、騒音下で入力され
た音声信号、あるいは通信路で雑音が混入された音声信
号においても頑健に動作する音声推定方法および音声認
識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice estimating method and a voice recognizing method which can robustly operate even a voice signal input under noise or a voice signal in which noise is mixed in a communication path.

【０００２】[0002]

【従来の技術】現存する音声認識アルゴリズムは、雑音
の混入に対して非常に脆弱であり、何らかの雑音対策を
行わなければエアコンの音のような些細な雑音の存在さ
え認識性能劣化につながる。そのため、雑音対策はこれ
までいくつか検討されてきた（参考文献：中川理一，
“ロバストな音声認識のための音響信号処理”，日本音
響学会誌53(11), pp.864874, 1997.）。雑音下の音声認
識に適用可能な従来の技術は、概ね、音声強調による方
法と雑音と音声を分離しないで音声認識を行う方法に大
別できる。音声強調は大まかに言えば、複数の入力信号
を用いて雑音と音声の混合過程を解くという逆問題に帰
着するものと、雑音や目的音声の時間−周波数特性を使
って、雑音の成分を引算するまたは音声成分を強調する
手段に大別できる。2. Description of the Related Art Existing voice recognition algorithms are extremely vulnerable to noise contamination, and even if small noise such as air conditioner noise is present, recognition performance is deteriorated unless some noise countermeasure is taken. Therefore, some noise countermeasures have been studied so far (reference: Riichi Nakagawa,
“Acoustic signal processing for robust speech recognition”, Journal of Acoustical Society of Japan 53 (11), pp.864874, 1997.). The conventional techniques applicable to speech recognition under noise can be roughly classified into a method based on speech enhancement and a method to perform speech recognition without separating noise and speech. Roughly speaking, speech enhancement results in the inverse problem of solving the mixing process of noise and speech using multiple input signals, and subtracts the noise component using the time-frequency characteristics of noise and target speech. It can be roughly divided into means for calculating or emphasizing voice components.

【０００３】なお、雑音が含まれる音声から正確な音声
情報を推定することが有効となる技術として音声認識や
音声合成をはじめ、様々なものが考えられるが、その本
質は同じであると考えられるので、ここでは音声認識を
例に挙げて説明することにする。Various techniques such as voice recognition and voice synthesis are conceivable as a technique for which it is effective to estimate accurate voice information from a voice containing noise, but the essence is considered to be the same. Therefore, the voice recognition will be described here as an example.

【０００４】雑音と音声を分離しない手法の代表はＨＭ
Ｍ（Hidden Markow Model）分解またはＨＭＭ合成によ
るＨＭＭモデルの適応化である。Barker等は、時間‐ス
ペクトル領域で目的音声の推定は行うが、分離をしない
で認識を行う方法について提案しているが、これは、２
つの処理にまたがっているといえる（参考文献：J．Bar
ker, M． Cooke and D. Ellis, “Decoding speech in
the presence of other sound sources”，2000．）。HM is representative of a method that does not separate noise and speech.
It is an adaptation of the HMM model by M (Hidden Markow Model) decomposition or HMM synthesis. Barker et al. Proposed a method of estimating the target speech in the time-spectral domain but recognizing it without separation.
It can be said that it spans two processes (Reference: J. Bar
ker, M. Cooke and D. Ellis, “Decoding speech in
the presence of other sound sources ”, 2000.).

【０００５】（１）音声強調：音声を雑音から分離した
のち音声認識を行う方法(1) Speech enhancement: A method of recognizing speech after separating speech from noise.

【０００６】− 音の伝播特性を主に利用した方法＊独立成分分析（参考文献：K. Torkkola, “Blind s
eparation for audio signals-Are we there yet?” I
CA99 proceedings, pp. 239‐244, 1999.）＊マイクロフォンアレイ（参考文献：金田豊，“マイ
クロフォン系におけるディジタルフィルタの応用‐不要
な音を取り除く技術‐”，日本音響学会誌，45, 2, pp.
125‐128，1989.）＊ノイズキャンセラ（参考文献：B．Widrow et al.
“Adaptive Noise Canceling: Principles and Applica
tions”, Proc. IEEE, 63, pp. 1692-1716, 1975.、辻
井重男，適応信号処理，ISBN4-7856-2011-0, 1995.）-Method mainly utilizing sound propagation characteristics * Independent component analysis (reference: K. Torkkola, "Blind s
eparation for audio signals-Are we there yet? ”I
CA99 proceedings, pp. 239-244, 1999.) * Microphone array (Reference: Yutaka Kaneda, "Application of Digital Filter in Microphone System-Technology for Removing Unwanted Sound-", Journal of Acoustical Society of Japan, 45, 2, pp. .
125-128, 1989.) * Noise canceller (reference: B. Widrow et al.
“Adaptive Noise Canceling: Principles and Applica
tions ”, Proc. IEEE, 63, pp. 1692-1716, 1975., Shigeo Tsujii, Adaptive Signal Processing, ISBN4-7856-2011-0, 1995.)

【０００７】− スペクトル表現された情報を主に利用
した方法A method that mainly uses spectrally represented information

【０００８】＊スペクトルサブトラクション（参考文
献：伊福部達，音の福祉工学，ISBN4 339-01103-7,19
97.、P. Lockwood and J. Boudy, “Experiments with
a nonlinear spectral subtracter(NSS). Hidden Marko
v models and the projection for robust speech reco
gnition・in cars”, Speech Communication, 11, P.215
-228, 1992.）＊ Computational Auditory Scene Analysis（参考文
献：M．Cooke，Modeling Auditory Processing and org
anization, Cambridge University Press，1993.）* Spectral subtraction (Reference: T. Ifukube, Sound Welfare Engineering, ISBN4 339-01103-7,19
97., P. Lockwood and J. Boudy, “Experiments with
a nonlinear spectral subtracter (NSS). Hidden Marko
v models and the projection for robust speech reco
gnition ・ in cars ”, Speech Communication, 11, P.215
-228, 1992.) * Computational Auditory Scene Analysis (reference: M. Cooke, Modeling Auditory Processing and org
anization, Cambridge University Press, 1993.)

【０００９】（２）雑音と音声を分離しないで音声認識
を行う方法 −ＨＭＭ分解・合成法によるモデル適応化による方法
（参考文献:M. J. F. Gales and S. J. Young, “Cepst
ral parameter compensation for HMM recognition in
noise”, Speech Communication, 12, pp.231-239.、滝
口哲也、中村哲、鹿野清宏，“雑音・残環境下でのＨＭ
Ｍ分解・合成法によるモデル適応化”，進学論（D-I
I），J81‐DII，10，pp.2231-2238．1998．、三木一
浩、西浦敬信、中村哲、鹿野蒲宏，“マイクロフォンア
レーとＨＭＭ分解・合成法による雑音・残響下音声認
識”，進学論（D‐II）, J83-D II, 11, pp.2206-214,
2000.）(2) Method of recognizing speech without separating noise and speech-Method of model adaptation by HMM decomposition / synthesis method (reference: MJF Gales and SJ Young, “Cepst
ral parameter compensation for HMM recognition in
noise ”, Speech Communication, 12, pp.231-239., Tetsuya Takiguchi, Satoshi Nakamura, Kiyohiro Kano,“ HM under noisy and residual environment ”
Model adaptation by M decomposition / synthesis method ”,
I), J81-DII, 10, pp.2231-2238. 1998. , Kazuhiro Miki, Takanobu Nishiura, Satoshi Nakamura, Kamahiro Shikano, "Noise and Reverberant Speech Recognition by Microphone Array and HMM Decomposition / Synthesis Method", D-II, J83-D II, 11, pp.2206 -214,
2000.)

【００１０】次に、従来の手法の概略について説明す
る。Next, the outline of the conventional method will be described.

【００１１】（ａ）ノイズキャンセラノイズキャンセラの基本形は、図１のような仕組みにな
っている。入力端子は２つあり、一つは雑音ｎ(t)と音
声ｓ(t)が混合した入力を受け、他方の入力端子は、除
去したい雑音源にできるだけ近接し、信号ｓ(t)の影響
をできるだけ受けないことが必要である。雑音ｎ(t)と
音声ｓ(t)の相関が弱い場合はかなり有効に動作する
が、相関が強い場合は効果が期待できない。(A) Noise Canceller The basic form of the noise canceller has a mechanism as shown in FIG. There are two input terminals, one receives the input where the noise n (t) and the voice s (t) are mixed, and the other input terminal is as close as possible to the noise source to be removed, and the influence of the signal s (t). It is necessary not to receive as much as possible. When the correlation between the noise n (t) and the voice s (t) is weak, it works quite effectively, but when the correlation is strong, the effect cannot be expected.

【００１２】（ｂ）スペクトルサブトラクション雑音が定常的であり雑音の周波数特性が事前にわかって
いる場合は、雑音を含む音声信号の短時間パワースペク
トルから雑音の短時間パワースペクトルを差し引くこと
により、音声信号のパワースペクトルを推定することが
できる。この方式をスペクトルサブトラクンョンとい
う。容易な方法で一般的によく利用されているが、非定
常雑音や未知の雑音に対しては十分に対応できない。(B) When the spectral subtraction noise is stationary and the frequency characteristic of the noise is known in advance, the short-time power spectrum of the noise is subtracted from the short-time power spectrum of the voice signal containing the noise, The power spectrum of the signal can be estimated. This method is called spectrum subtraction. It is commonly used as an easy method, but it cannot sufficiently deal with non-stationary noise and unknown noise.

【００１３】（ｃ）独立成分分析独立成分分析は、混合信号ｘ_i(t)が与えられたとき、推
定しようとしている信号ｓ_jの統計的性質を利用して、
その時空間混合過程ｈ_ijを推定するものである。(C) Independent component analysis Independent component analysis takes advantage of the statistical property of the signal s _j to be estimated when the mixed signal x _i (t) is given.
The spatiotemporal mixing process h _ij is estimated.

【数１】この方式では、仮定された信号ｓ_jの統計的性質が、目
的音声とは必ずしも一致しないこと、音源の数が既知で
なければならないこと、実際の音環境のダイナミックな
変化に追随することが難しいことなど、実環境での音源
分離の手法としては、まだ困難な点が多い。[Equation 1] In this method, it is difficult for the assumed statistical properties of the signal s _j to match the target voice, that the number of sound sources must be known, and to follow the dynamic changes in the actual sound environment. However, there are still many difficulties in the method of sound source separation in the real environment.

【００１４】（ｄ）マイクロフォンアレイマイクロフォンアレイを用いた方式は、指向性マイクロ
フォンを複数配列して、音源方向を推定しながら特定の
方向からの音声のみを抽出したり、特定の方向からの雑
音を抑制したりするものである。この機能はマイクロフ
ォンアレイとその後段に接続したＦＩＲフィルタにより
実現される。この方式は、会議室などの固定した環境で
はある程度の成果が見込めるが、動的に変化する環境で
ある場合や、マイクロフォンアレイの設置が不可能であ
るような環境が一般的であり、現実的ではない。(D) Microphone array In the system using a microphone array, a plurality of directional microphones are arranged to extract only the sound from a specific direction while estimating the sound source direction, or to extract noise from a specific direction. It is something to suppress. This function is realized by the microphone array and the FIR filter connected to the subsequent stage. Although this method can be expected to some extent in a fixed environment such as a conference room, it is generally a dynamic environment or an environment in which it is not possible to install a microphone array. is not.

【００１５】（ｅ）ＨＭＭ合成法によるモデル適応化図２は、（参考文献：滝口哲也、中村哲、鹿野清宏，
“雑音・残環境下でのＨＭＭ分解・合成法によるモデル
適応化，進学論（D-II），J81‐DII，10，pp.2231-223
8. 1998）に紹介されたＨＭＭ合成法を用いた雑音・残
環境下での音声認識手法である。混合ＨＭＭを得るため
には、クリーン音声ＨＭＭ、雑音ＨＭＭ、音響伝達特性
が既知、または推定可能でなければならないことがわか
る。少なくとも雑音が非定常で予測困難な場合は雑音モ
デルの推定は困難であるという問題がある。(E) Model Adaptation by HMM Combining Method FIG. 2 is (Reference: Tetsuya Takiguchi, Satoshi Nakamura, Kiyohiro Kano,
"Model adaptation by HMM decomposition / synthesis method under noisy / remaining environment, further study (D-II), J81-DII, 10, pp.2231-223
8. 1998) is a speech recognition method using HMM synthesis method under noisy and residual environment. It can be seen that clean speech HMMs, noise HMMs, and acoustic transfer characteristics must be known or can be estimated in order to obtain a mixed HMM. There is a problem that it is difficult to estimate the noise model at least when the noise is non-stationary and difficult to predict.

【００１６】[0016]

【発明が解決しようとする課題】音声認識の使用が望ま
れるような実環境のうち、走行中の自動車の社内のよう
な比較的安定した騒音環境は稀で通常は騒音は非定常で
あり、その数、種類、大きさ、出現頻度などが予測不可
能な場合が多い。また、騒音源の位置、音声発話者の位
置をはじめ、空間を構成するすべての物体が固定である
ような環境は実際には考えにくく、そのため、騒音や音
声の空間伝播に関する特性を前もって決めたり、空間伝
播特性が一定であると仮定して音源の推定をする方法
は、環境の変化が起こった場合には十分に追随できない
という問題がある。現実的な環境で頑健な音声認識を実
現するためには、騒音の特性と空間の伝達特性の両方に
依存しないような方法によって実現されていることが第
一の条件となる。Among the actual environments in which the use of voice recognition is desired, a relatively stable noise environment such as an in-house of a moving vehicle is rare, and usually the noise is unsteady. In many cases, the number, type, size, appearance frequency, etc. cannot be predicted. In addition, it is difficult to think of an environment in which all the objects that make up the space are fixed, including the position of the noise source and the position of the voice speaker, so it is difficult to determine the characteristics of noise and the spatial propagation of voice in advance. However, the method of estimating the sound source assuming that the spatial propagation characteristic is constant has a problem that it cannot sufficiently follow when the environment changes. In order to realize robust speech recognition in a realistic environment, the first condition is that it is realized by a method that does not depend on both noise characteristics and spatial transfer characteristics.

【００１７】本発明は、騒音下で入力された音声信号、
あるいは通信路で雑音が混入された音声信号においても
頑健に動作する音声推定方法および音声認識方法を提供
することを目的とする。According to the present invention, a voice signal input under noise,
Another object of the present invention is to provide a voice estimation method and a voice recognition method that can robustly operate even a voice signal in which noise is mixed in a communication path.

【００１８】[0018]

【課題を解決するための手段】前期課題を解決するた
め、本発明の雑音中の音声推定方法は、入力音響信号を
短時間セグメントごとに切り出し、短時間周波数分析を
行う音響分析ステップと、音声推定に必要とする要素を
推定する要素推定ステップと、要素推定ステップによっ
て得られた要素を用いて音声を推定する音声推定ステッ
プを含むものである。さらに具体的に述べれば、入力音
響信号を短時間セグメントごとに切り出し、短時間周波
数分析を行う音響分析ステップと、音声認識が符合帳で
保持する音声のスペクトル包絡を知識として利用し、音
モデルを生成するステップと、前記音響分析ステップに
よって得られたスペクトル情報を確率密度関数とみな
し、最大事後確率推定を用いて、混合重み値を推定する
推定ステップと、各時刻で最大の重み値を持つ音モデル
を生成した要素の存在仮定が最も尤度が高いとして、そ
の要素を出力する認識ステップとを含む騒音環境下にお
ける音声認識方法である。In order to solve the above problem, a speech estimation method in noise according to the present invention comprises an acoustic analysis step of cutting out an input acoustic signal into short-time segments and performing a short-time frequency analysis, It includes an element estimation step of estimating an element required for estimation, and a speech estimation step of estimating speech using the element obtained by the element estimation step. More specifically, the input sound signal is cut out for each short time segment, the sound analysis step of performing a short time frequency analysis and the spectrum envelope of the voice held by the voice recognition in the codebook are used as knowledge, and the sound model is calculated. The generating step, the spectral information obtained by the acoustic analysis step is regarded as a probability density function, the maximum a posteriori probability estimation is used to estimate the mixing weight value, and the sound having the maximum weight value at each time. It is a speech recognition method in a noisy environment including a recognition step of outputting the element assuming that the existence of the element that generated the model has the highest likelihood.

【００１９】[0019]

【発明の実施の形態】そもそも、何らかの事前知識を必
要としたり、何らかの対象に仮定を設けたり、何かを推
定するにしても、その対象は環境に依存するような雑音
や空間伝達特性に対してではなく、知りたい“ターゲッ
ト”（音声単語）におくべきである。雑音の特性や空間
の伝達特性を推定したところで、それは「ほしい」情報
ではないので、結局は捨てられる。本発明は推定の対象
を、このターゲットにおいているため、騒音の特性を事
前に知る必要はなく、処理の過程でそれらの推定も行わ
ない。騒音の特性が非定常で予測不可能であっても構わ
ない。騒音源の数にも依存しない。騒音の種類も選ばな
い（とはいえ、音声と構造的に非常に似た騒音が高レベ
ルで存在すれば、その影響は受けるであろう。）ので、
実環境での使用を考えると、従来の手法よりも頑健であ
るといえる。収音は１つのマイクロフォンにより行わ
れ、騒音源や発生源から収音点までの空間伝達特性を知
る必要がないので、環境の変化に対しても頑健である。BEST MODE FOR CARRYING OUT THE INVENTION In the first place, even if some prior knowledge is required, an assumption is made for some object, or something is estimated, the object will not be affected by noise or spatial transfer characteristics that depend on the environment. It should be put in the "target" (spoken word) you want to know, not in the end. When we estimate the characteristics of noise and transfer characteristics of space, they are not "desired" information, so they are eventually discarded. In the present invention, since the target of estimation is this target, it is not necessary to know the characteristics of noise in advance, and neither is estimated in the process of processing. The noise characteristics may be unsteady and unpredictable. It does not depend on the number of noise sources. Any kind of noise can be chosen (although if there is a high level of noise structurally very similar to voice, it will be affected).
Considering its use in a real environment, it can be said to be more robust than conventional methods. The sound is picked up by one microphone, and it is not necessary to know the spatial transfer characteristics from the noise source or the generation source to the sound collecting point. Therefore, it is robust against changes in the environment.

【００２０】まず、入力音響信号を短時間セグメントご
とに切り出し、高い周波数分解能で短時間周波数分析を
行う。ここでは可能な限り詳細な分析を行ってスペクト
ルの微細構造を保持する。First, the input acoustic signal is cut out for each short time segment, and short time frequency analysis is performed with high frequency resolution. Here, the most detailed analysis is performed to preserve the fine structure of the spectrum.

【００２１】次に、従来型音声認識が符合帳で保持する
音声のスペクトル包絡を知識として利用し、音モデルを
生成する。このとき、同一基本周波数にＭ個のモデルを
仮定する。Ｍは符合帳の要素の数に対応する。各音モデ
ルは符合帳のセントロイドに対応するスペクトル包絡を
その基本周波数の倍音で離散化することにより作成す
る、基本周波数は、音声として取りうる範囲の周波数す
べてを仮定する。各音モデルは、確率密度関数として表
現される。Next, the sound model is generated by utilizing the spectrum envelope of the sound held in the codebook by the conventional speech recognition as knowledge. At this time, M models are assumed at the same fundamental frequency. M corresponds to the number of elements in the codebook. Each sound model is created by discretizing the spectral envelope corresponding to the centroid of the codebook with the overtones of its fundamental frequency. The fundamental frequency assumes all frequencies in the range that can be taken as speech. Each sound model is represented as a probability density function.

【００２２】音響分析部によって得られたスペクトル情
報を確率密度関数とみなし、その確率密度関数が音モデ
ルの混合分布デモルから生成されたと考える。そして、
ＥＭ（Expectation‐Maximization）アルゴリズムを用
いた最大事後確率推定を用いて、混合重み値を推定す
る。このように、重み値の推定は短時間セグメントごと
に独立して行われるため、推定誤差は時間的に伝播しな
い。これは、処理全体の頑健性に貢献する重要な性質で
ある。The spectrum information obtained by the acoustic analysis unit is regarded as a probability density function, and it is considered that the probability density function is generated from the mixture distribution demol of the sound model. And
The mixture weight value is estimated using maximum posterior probability estimation using an EM (Expectation-Maximization) algorithm. In this way, since the weight value is estimated independently for each short time segment, the estimation error does not propagate in time. This is an important property that contributes to the robustness of the overall process.

【００２３】最後に、各時刻で最大の重み値を持つ音モ
デルを生成した要素の存在仮定が最も尤度が高いと考
え、その要素を出力する。このように、ノイズに関する
事前知識の利用やノイズのモデリングを一切行わずに目
的音声のパラメータを推定することができるので、ノイ
ズの特性に依存することなく、頑健な音声認識を実現す
ることができる。Finally, it is considered that the existence assumption of the element that generated the sound model having the maximum weight value at each time has the highest likelihood, and that element is output. In this way, the parameters of the target speech can be estimated without using any prior knowledge about noise or modeling the noise, so that robust speech recognition can be realized without depending on the characteristics of noise. .

【００２４】[0024]

【実施例】以下、本発明の実施例について説明する。実施例１小語藁特定話者を対象とした音声認識システムへの簡単
な使用例を示す。EXAMPLES Examples of the present invention will be described below. Example 1 A simple example of use in a voice recognition system for a small-word straw specific speaker is shown.

【００２５】このシステムは、大きく４つの過程から構
成されている。４つの過程を概念的に述べると以下のよ
うになる。（１）候補単語を発声した音声から音響テンプレート
（音モデル）を作成する。（２）観測信号を短時間スペクトル分析する。（３）ピッチとして取りうるあらゆる基本周波数の高調
波構造に対応する確率分布を考える。ひとつの基本周波
数に対して複数の確率分布を仮定する。この確率分布の
包絡は音モデルと呼ばれ音響テンプレートのスペクトル
包絡から生成される。そして、観測信号の周波数成分を
この確率分布の混合分布としてモデル化する。混合の重
み値を各時刻でＥＭアルゴリズムを用いて推定する。（４）各時刻で推定された音モデルの重み値を入力信号
と音響テンプレートとの類似度として扱い、類似度に基
づくＤＰマッチングを行う。“単語類似度”が最大とな
る候補単語を認識結果とする。This system is mainly composed of four processes. The following is a conceptual description of the four processes. (1) An acoustic template (sound model) is created from the voice that has uttered a candidate word. (2) Spectral analysis of the observed signal is performed for a short time. (3) Consider the probability distribution corresponding to the harmonic structure of any fundamental frequency that can be taken as the pitch. Multiple probability distributions are assumed for one fundamental frequency. The envelope of this probability distribution is called a sound model and is generated from the spectral envelope of the acoustic template. Then, the frequency components of the observed signal are modeled as a mixture distribution of this probability distribution. The weight value of the mixture is estimated at each time using the EM algorithm. (4) The weight value of the sound model estimated at each time is treated as the similarity between the input signal and the acoustic template, and DP matching based on the similarity is performed. The candidate word having the maximum “word similarity” is set as the recognition result.

【００２６】以下各項目について理論的に説明する。Each item will be theoretically explained below.

【００２７】１．観測信号の音響分析観測信号を高域強調し、Ｔ毎に短時間フーリエ変換を行
う。1. Acoustic analysis of observation signal The observation signal is emphasized in the high frequency range, and a short-time Fourier transform is performed for each T.

【数２】窓関数ｈ(t)は、最適な時間周波数の局在性を与えるガ
ウス関数に2次のcardinal B spline（パートレット窓）
を畳み込んだ関数[Equation 2] The window function h (t) is a quadratic cardinal B spline (partlet window) for the Gaussian function that gives the optimal time-frequency localization.
A function that convolves

【数３】 [Equation 3]

【数４】を用いる。また、単一の短時間フーリエ変換を用いたの
では、周波数帯域によっては周波数分解能や時間分解能
が低下するので、図３のようなマルチレートフィルタバ
ンクを用いる。[Equation 4] To use. Further, since using a single short-time Fourier transform lowers frequency resolution and time resolution depending on the frequency band, a multirate filter bank as shown in FIG. 3 is used.

【００２８】次に、フィルタバンクの出力から瞬時周波
数を求める。フィルタバンクの出力を X(ω,t)＝ａ＋ｊｂ式（５）とすると、その瞬時周波数は、Next, the instantaneous frequency is obtained from the output of the filter bank. If the output of the filter bank is X (ω, t) = a + jb Equation (5), its instantaneous frequency is

【数５】で与えられる。[Equation 5] Given in.

【００２９】次に、フィルタの中心周波数ωから、その
瞬時周波数λ(ω,t)への写像に基づいて、周波数成分の
候補を抽出する。フィルタバンクの中心周波数からその
出力の瞬時周波数への写像を考えると、一つの卓越した
成分があるときには、この成分が写像の平衡点に位置
し、その周辺の瞬時周波数の値がほぼ一定となる。それ
ゆえ、求めたい周波数成分の候補はフィルタの中心周波
数から瞬時周波数への写像の平衡点となり、次のように
求めることができる（参考文献：河原英紀、片寄晴弘、
R. Patterson and A. de Cheveigne,“瞬時周波数を用
いた基本周波数の高精度の抽出について”,日本音響学
会聴覚研究会資料 H-98-116, 1998.）。Next, a frequency component candidate is extracted based on the mapping from the center frequency ω of the filter to its instantaneous frequency λ (ω, t). Considering the mapping from the center frequency of the filter bank to the instantaneous frequency of its output, when there is one outstanding component, this component is located at the equilibrium point of the mapping, and the instantaneous frequency values around it are almost constant. . Therefore, the candidate of the frequency component to be obtained is the equilibrium point of the mapping from the center frequency of the filter to the instantaneous frequency, and can be obtained as follows (Reference: Hideki Kawahara, Haruhiro Katayose,
R. Patterson and A. de Cheveigne, “Precise extraction of fundamental frequency using instantaneous frequency”, Auditing Society of Japan Material H-98-116, 1998.).

【００３０】[0030]

【数６】さらに、周波数成分のパワーは、短時間フーリエ変換の
パワースペクトルの値として得られるので、ψ(ω,t)＝
｜X(ω,t)｜式（８）となる。[Equation 6] Furthermore, the power of the frequency component is obtained as the value of the power spectrum of the short-time Fourier transform, so ψ (ω, t) =
| X (ω, t) | Equation (8) is obtained.

【００３１】２．観測信号の確率密度関数の定義時刻ｔにおける、観測信号の確率密度関数p_ψ ^(t) (x)
は、2. Definition of probability density function of observation signal Probability density function of observation signal p _ψ ^(t) (x) at time t
Is

【数７】で与えられる。[Equation 7] Given in.

【００３２】３．音モデルの混合分布モデル同一の基本周波数にＭ個の音モデルがあるものとする。
基本周波数がＦのｍ番目の音モデルの確率密度関数を3. Mixture distribution model of sound model It is assumed that there are M sound models at the same fundamental frequency.
The probability density function of the m-th sound model whose fundamental frequency is F is

【数８】 [Equation 8]

【数９】とする。Ｈは対象とする倍音の数、ＧはＦ・ｈに最大値
を持つガウス関数である。[Equation 9] And H is the number of target overtones, and G is a Gaussian function having a maximum value in F · h.

【００３３】式（11）は、基本周波数がFのときに、そ
の倍音成分がどの周波数にどれくらいの強さで現れるか
をモデル化したものである。そして、観測信号の確率密
度関数ｐ(t)ψ(x)が、ｐ（x｜F，m，μ(t)(F,m)）の混
合分布モデルp（x｜θ(t)）から生成されたと考える。The expression (11) is a model of how much and at what frequency the overtone component appears when the fundamental frequency is F. Then, the probability density function p (t) ψ (x) of the observed signal is calculated from the mixture distribution model p (x | θ (t)) of p (x | F, m, μ (t) (F, m)). Think generated.

【数１０】ここで、ＦlとＦhは、期待される基本周波数の下限と上
限であり、ω(t)(F,m)は音モデルの重みで次式を満た
す。[Equation 10] Here, Fl and Fh are the lower and upper limits of the expected fundamental frequency, and ω (t) (F, m) is the weight of the sound model and satisfies the following equation.

【数１１】 [Equation 11]

【００３４】最終的に、モデルｐ（x｜θ(t)）から観測
した確率密度関数ｐ(t)ψ(x)が生成されたかのようにモ
デルパラメータθ(t)を推定できれば、その重みω(t)
(F,m)は各音モデルが相対的にどれくらい優勢かをあら
わす。Finally, if the model parameter θ (t) can be estimated as if the probability density function p (t) ψ (x) observed from the model p (x | θ (t)) was generated, its weight ω (t)
(F, m) represents the relative dominance of each sound model.

【００３５】４．音モデルの作成標準単語の総数をＮとし、ｎ番目の標準単語音声（高域
強調されている）の時刻ｔでのスペクトル包絡をC(t)(x
｜n)とする。標準単語音声のスペクトル包絡の抽出と有
声・無声判定はSTRAIGHTV30k16（参考文献：H. Kawahar
a, I. Masuda‐Katsuse and A. de Cheveigne, “Restr
ucturing speech representations using a Pitch‐ada
ptive time-frequency smoothing and an instantaneou
s-frequency-based F0 extraction: Possible role of
a repetitive structure in sounds”, Speech Communi
cation 27, pp.187‐207. 1999.）を用いている。4. Creation of sound model Let N be the total number of standard words, and let C (t) (x) be the spectral envelope of the nth standard word speech (high-frequency emphasized) at time t.
| N). STRAIGHTV30k16 (reference: H. Kawahar) for extracting the spectral envelope of standard word speech and voiced / unvoiced judgment.
a, I. Masuda‐Katsuse and A. de Cheveigne, “Restr
ucturing speech representations using a Pitch‐ada
ptive time-frequency smoothing and an instantaneou
s-frequency-based F0 extraction: Possible role of
a repetitive structure in sounds ”, Speech Communi
cation 27, pp.187-207. 1999.).

【００３６】基本周波数をＦとしたときの音モデルp(x|
F,n,μ^(t)(F,n))のパラメータμ^(t)(F,n)は次式で与え
られる。 μ(t)(F,n)＝{ｃ(t)(h|F,n)|h＝1,・・・,H} 式（１４）Sound model p (x |
F, n, the parameters of ^{μ (t) (F, n} )) μ (t) (F, n) is given by the following equation. μ (t) (F, n) = {c (t) (h | F, n) | h = 1, ..., H} Equation (14)

【数１２】ただし、この音モデルの作成は、音声の有声区間でのみ
有効である。無声区間は倍音構造を持たないため、基本
周波数を仮定するこのような音モデルの作成はできな
い。そもそも、無声区間は有声区間に比べてパワーが小
さい。これは高雑音下では無声部は有声部に比べて対雑
音比が小さくなり、より推定が困難であることを意味す
る。そこで、現時点では、標準テンプレートに基づく音
モデルの作成はせずに、むしろ標準パターンから作られ
るすべての音モデルに対してできるだけ中立であるよう
な音モデルで代用することとする。そこで、テンプレー
ト音声の無声部の音モデルのパラメータを、[Equation 12] However, the creation of this sound model is effective only in the voiced section of speech. Since the unvoiced section has no overtone structure, it is not possible to create such a sound model assuming a fundamental frequency. In the first place, the unvoiced section has less power than the voiced section. This means that under high noise, the unvoiced part has a smaller noise-to-noise ratio than the voiced part and is more difficult to estimate. Therefore, at the present time, instead of creating a sound model based on the standard template, a sound model that is as neutral as possible is substituted for all the sound models created from the standard pattern. Therefore, the parameters of the sound model of the unvoiced part of the template voice are

【数１３】とする。上式の音モデルは正確には、すべての音モデル
に対して中立であるかどうかは確認していない。今後、
詳細な検討が必要である。[Equation 13] And The above sound model does not exactly check whether it is neutral to all sound models. from now on,
Detailed examination is required.

【００３７】時刻ｔ、基本周波数Ｆの音モデルの総数を
Ｍ、候補となる標準単語の個数をＮとする。観測信号と
標準パターンの間の時間整合を考慮して、観測信号の時
刻を基準に整合窓の範囲にわたって各標準単語から複数
（ｒ個）の音モデルを生成する。Ｍ＝ｒ・Ｎ式（１７）標準単語の時間長は同一ではないので、前もって用意さ
れる音モデルの総数は、観測信号の開始時からの時間か
によって異なる（次第に減少する）。At time t, the total number of sound models of the fundamental frequency F is M, and the number of standard words as candidates is N. In consideration of the time matching between the observed signal and the standard pattern, a plurality (r) of sound models are generated from each standard word over the range of the matching window based on the time of the observed signal. M = r · N Equation (17) Since the time lengths of standard words are not the same, the total number of sound models prepared in advance differs (decreases gradually) depending on the time from the start of the observation signal.

【００３８】５．ＥＭアルゴリズムを用いたパラメー
タの推定確率密度関数ｐ_ψ ^(t)(x)を観測したときに、そのモデル
ｐ(x|θ^(t))のパラメータθ^(t)を事前分布ｐ₀(θ^(t))に
基づいて推定する。ＥＭアルゴリズムによる、事前分布
に基づくθ^(t)の最大事後確率推定は、結局は各繰り返
しにおいて古いパラメータ推定値θ^'(t)＝{ω^'(t),μ
^'(t)}を更新して新しいパラメータ推定値5. When observing the estimated probability density function p _ψ ^(t) (x) of the parameter using the EM algorithm, the parameter θ ^{(t) of the} model p (x | θ ^(t) ⁾ is pre-distributed p ₀ (θ ^{( t)} ). The maximum posterior probability estimation of θ ^(t) based on the prior distribution by the EM algorithm is, after all, the old parameter estimate θ ^{′ (t)} = {ω ^{′ (t)} , μ at each iteration.
Update ^'(t) } to get new parameter estimates

【数１４】を求めていくことになる。[Equation 14] Will be sought after.

【数１５】 [Equation 15]

【数１６】 [Equation 16]

【数１７】は無情報事前分布の場合の推定値、ω₀ ^(t)(F,m)とｃ₀
^(t)(h|F,m)は事前分布である。β_ω ^(t)は事前分布ω₀
^(t)(F,m)をどれくらい重視するかを決めるパラメータ、
β_μ ^(t)(F,m)は、事前分布ｃ₀ ^(t)(h|F,m)をどれくらい
重視するかを決めるパラメータである。[Equation 17] Is the estimated value in the case of the informationless prior distribution, ω ₀ ^(t) (F, m) and c ₀
^(t) (h | F, m) is the prior distribution. β _ω ^(t) is the prior distribution ω ₀
^{(t) A} parameter that determines how much importance is placed on (F, m),
β _μ ^(t) (F, m) is a parameter that determines how important the prior distribution c ₀ ^(t) (h | F, m) is.

【数１８】 [Equation 18]

【数１９】式（２１）[Formula 19] Formula (21)

【００３９】今、求めたいのは基本周波数ではなく、事
前分布として与えた標準単語の特徴ベクトルの存在仮定
がどれくらい妥当であるかなので、特徴ベクトルから直
接求められた事前分布者ｃ₀ ^(t)(h|F,m)の重みβ
_μ ^(t)(F,m)を十分大きく取る。そして、標準単語の特徴
ベクトルＣ^(t)(m)の類似度をNow, what we want to obtain is not the fundamental frequency, but how valid the assumption of the existence of the feature vector of the standard word given as the prior distribution is. Therefore, the prior distributor c ₀ ^(t) obtained directly from the feature vector. Weight of (h | F, m) β
_{Take μ} ^(t) (F, m) large enough. Then, the similarity of the feature vector C ^(t) (m) of the standard word is

【数２０】と定義する。[Equation 20] It is defined as

【００４０】６．音モデルの重み値に基づくＤＰマッチ
ング時刻ｔでの混合分布モデルを構成する音モデルは、そも
そもＮ個の標準時系列からそれぞれ整合窓長ｒずつ特徴
ベクトルを抽出したものである。そこで、音モデルのｓ
^(t)(m)をその出身単語毎にｒ個ずつ分類してＭ個の組を
作り、それぞれについて、類似度に基づくＤＰマッチン
グを行う（図４）。観測信号の時系列をＡ＝ａ₁,ａ₂,・・・,ａ_i,・・・,ａ_I 式（２３）標準テンプレートの時系列をＢ＝ｂ₁,ｂ₂,・・・,ｂ_j,・・・,ｂ_J 式（２４）で表し、ＡとＢから成る平面を考えると、Ａ，Ｂ両系列
の時間軸の対応関係すなわち時間伸縮関数は、この平面
上の格子点ｌ＝（i,j）の系列Ｌで表現することができる。Ｌ⁽ⁿ⁾＝ｌ₁,ｌ₂,・・・,ｌ_k 式（２５）ｌ_k＝（ｉ_k,ｊ_k）式（２６）6. The sound model forming the mixture distribution model at the DP matching time t based on the weight value of the sound model is a feature vector extracted from each of the N standard time series by the matching window length r. Therefore, the sound model s
^(t) (m) is classified into r pieces for each of the original words, M sets are formed, and DP matching based on the similarity is performed for each set (FIG. 4). The time series of the observation signals is A = a ₁ , a ₂ , ..., a _i , ..., a _I Equation (23) The time series of the standard template is B = b ₁ , b ₂ ,. _j , ..., b _J Given by the formula (24) and considering the plane consisting of A and B, the correspondence between the time axes of both A and B series, that is, the time expansion / contraction function, is determined by the grid point l = It can be represented by a sequence L of (i, j). L ⁽ⁿ⁾ = l ₁ , l ₂ , ..., l _k Formula (25) l _k = (i _k , j _k ) Formula (26)

【００４１】観測系列の特徴べクトルａ_iに標準系列の
特徴ベクトルｂ_jが混合されている割合の推定値がｓ(c)
＝ｓ(i,j)のとき、Ｌに沿った距離の総和は、The estimated value of the ratio of the feature vector a _i of the observation series to the feature vector b _j of the standard series is s (c)
= S (i, j), the sum of the distances along L is

【数２１】となり、この値が大きいほど、観測系列に標準系列が含
まれている割合が大きいことを示す。ここで、ｙ_kはＬ
に関連した正の重みであり、ｙ_k＝（ｉ_k−ｉ_k-1）＋（ｊ_k−ｊ_k-1）式（２８）ｉ₀＝ｊ₀＝０式（２９）とすると、[Equation 21] Therefore, the larger this value is, the larger the ratio of the observation series to the standard series is. Where y _k is L
To a positive weight associated, y _k = When _{_{(i k -i k-1)}} + (j k -j k-1) Equation _{_{(28) i 0 = j 0}} = 0 Equation (29),

【数２２】となり、[Equation 22] Next to

【数２３】と簡単化できる。Ｗ⁽ⁿ⁾(L)の最大値を与えるｎが、認識
結果となる単語の番号となる。Ｗ⁽ⁿ⁾（L）を、入力単語
に対するテンプレートｎの単語類似度とする。Ｌに関し
て次式の傾斜制限を設け、また、極端な伸縮を防ぐため
の整合窓の窓長はｒとする。傾斜制限の形状と重みを図
５に示す。[Equation 23] Can be simplified. N, which gives the maximum value of W ⁽ⁿ⁾ (L), is the number of the word that is the recognition result. ^Let W ⁽ⁿ⁾ (L) be the word similarity of template n to the input word. The inclination limit of the following formula is set for L, and the window length of the matching window for preventing extreme expansion and contraction is r. The shape and weight of the tilt limitation are shown in FIG.

【数２４】 [Equation 24]

【００４２】次に、音声認識の認識区間の決定法につい
て説明する。音声認識技術は、あくまで入力と複数のテ
ンプレートとの適合度をそれぞれ測り、最も適合度の大
きいテンプレートが認識結果として出力されているに過
ぎない。そのため、音声が入力されていない雑音のみの
区間であっても、認識結果が出力されてしまう。この問
題を回避するために、通常の音声認識では、入力信号の
パワーを監視して、パワーが閾値を越えた場合にのみ、
認識を行うという戦略を取っていることが多い。しかし
この戦略はいうまでもなく雑音が存在しない場合にのみ
有効である。オフィスのＯＡ機器の音など非周期性雑音
が多い場合は、パワーと同時に零交差数の変化を監視す
ることにより、音声区間の抽出を行うことができる。し
かし、雑音のレベルが大きい場合や雑音に周期性の信号
が含まれる場合はこの手法は頑健性に欠ける。Next, a method of determining a recognition section for voice recognition will be described. The speech recognition technology merely measures the degree of matching between the input and a plurality of templates, and outputs the template with the highest degree of matching as the recognition result. Therefore, the recognition result is output even in a noise-only section in which no voice is input. To avoid this problem, normal speech recognition monitors the power of the input signal and only when the power exceeds a threshold,
Often they have a strategy of recognition. But this strategy is, of course, only effective in the absence of noise. When there is a lot of non-periodic noise such as the sound of OA equipment in the office, the voice section can be extracted by monitoring the change in the number of zero crossings at the same time as the power. However, this method is not robust if the noise level is high or if the noise contains periodic signals.

【００４３】本実施形態では、スロートマイクロフォン
（throat microphone）という接触型マイクロフォンを
用いて、話者の喉頭音を直接センシングすることによ
り、発話区間の検出を行う。図６は、高雑音下で単語を
発声したときのヘッドセットマイクロフォンの出力波形
とスロートマイクロフォンの出力の波形と零交差数の変
化を示している。図６のように有声部では零交差数が大
きく減少するため、零交差数を手がかりに発話区間を決
めることができる。閾値とする零交差数を定め、それを
下回った時刻を発話開始とし、閾値を越えてから一定区
間再び閾値を下回らなかったならば、その閾値を越えた
時刻を発話終了時刻とした。そのような発話区間の決定
法では、無声子音で始まる発話や無声化して終了する発
話の場合はその無声部分は欠落するが現段階では無声部
では信頼できる推定を行っていないので、標準音声、観
測音声とも有声部を持って発話の開始終了とした。次の
評価で用いた音声信号は、すべてこの方法によって区間
抽出されたものである。In the present embodiment, a contact microphone called a throat microphone is used to directly sense the laryngeal sound of the speaker to detect the utterance section. FIG. 6 shows changes in the output waveform of the headset microphone and the output waveform of the throat microphone and the number of zero crossings when a word is uttered under high noise. As shown in FIG. 6, the number of zero crossings is greatly reduced in the voiced part, so that the utterance section can be determined based on the number of zero crossings. The number of zero-crossings as a threshold value was defined, and the time when the number of zero-crossings fell below the threshold was set as the utterance start time. In such a method of determining the utterance section, in the case of a utterance that starts with an unvoiced consonant or a utterance that ends after being unvoiced, the unvoiced part is missing, but at this stage, the unvoiced part does not perform reliable estimation, Both the observation voice and the voiced part were used as the start and end of utterance. The speech signals used in the next evaluation are all the sections extracted by this method.

【００４４】なお、図からは、パワーの変化によっても
発話区間が抽出できるように見えるがスロートマイクロ
フォンは接触型のマイクロフォンであるため、体動に起
因する雑音の影響が大きく、実質的ではない。From the figure, it seems that the utterance section can be extracted by changing the power, but since the throat microphone is a contact type microphone, the influence of noise due to body movement is large and it is not substantial.

【００４５】本発明の音声認識方法を、擬似騒音環境の
構築と、音声と雑音の計算機上の混合の２種類の方法で
評価した。なお、パラメータの多くは任意に決めたもの
であり、必ずしも最適なものではない。The speech recognition method of the present invention was evaluated by two methods: construction of a pseudo noise environment and mixing of speech and noise on a computer. Note that many of the parameters are arbitrarily determined and are not necessarily optimal.

【００４６】（１）擬似騒音空間での評価無響室において擬似騒音空間を構成した。騒音源とし
て、電子協騒音データベースの計算機室の騒音と工場騒
音を使用した。前者は定常的な騒音であるが、後者は非
定常的な騒音である。話者は騒音下で８つの孤立単語を
やや強めの語気で発声した。その音声を、話者が装着し
たヘッドセットマイクロフォン（B＆KType4035）とスロ
ートマイクロフォン（Audio Technica AT890）を通して
同時にＤＡＴのＬチャンネルとＲチャンネルに収録し
た。騒音のレベルは、データベースに収録されている校
正用信号を用いて、マイクロフォンの位置で、騒音が録
音された環境と同じレベルになるように調整された。そ
の結果、騒音のレベルは計算機室騒音、工場騒音とも、
およそ７０ｄＢ（Ｃ）となった。話者の発話音声の音圧
しベルはおよそ７８ｄＢ（Ｃ）であった。それゆえ、Ｓ
Ｎ比はおよそ８ｄＢとなる。また、テンプレート作成の
ため、騒音のない状態で音声のみの収録を行った。(1) Evaluation in a pseudo noise space A pseudo noise space was constructed in an anechoic room. As the noise source, the noise of the computer room and the factory noise of the JECO noise database were used. The former is steady noise, while the latter is unsteady noise. The speaker uttered eight isolated words in a noisy manner with a slight increase in speech. The voice was recorded on the L channel and R channel of DAT at the same time through the headset microphone (B & K Type 4035) and the throat microphone (Audio Technica AT890) worn by the speaker. The noise level was adjusted using the calibration signal recorded in the database so that the noise was at the same level as the recorded environment at the microphone position. As a result, the noise level is
It became about 70 dB (C). The sound pressure bell of the speech uttered by the speaker was about 78 dB (C). Therefore, S
The N ratio is about 8 dB. In order to create a template, we recorded only voice without noise.

【００４７】収録した音声は次の8つの単語である。・フクオカ（福岡）・サガ（佐賀）・クマモト（熊本）・ミヤザキ（宮崎）・ナガサキ（長崎）・オオイタ（大分）・カゴシマ（鹿児島）・オキナワ（沖縄）The recorded voices are the following eight words. ・ Fukuoka (Fukuoka) ・ Saga ・ Kumamoto (Kumamoto) ・ Miyazaki (Miyazaki) ・ Nagasaki (Nagasaki) ・ Oita (Oita) ・ Kagoshima (Kagoshima) ・ Okinawa (Okinawa)

【００４８】図７は、擬似計算機室騒音環境で収録され
た音声に対する結果である。図８は、擬似工場騒音環境
で収録された音声に対する結果である。機軸は、発話単
語、縦軸は入力音声に対する候補単語のテンプレート音
声の類似度を現している。図より、すべての入力音声に
ついて、正しい候補単語の単語類似度が、他の単語の単
語類似度よりも、大きくなっていることがわかる。FIG. 7 shows the results for the voice recorded in the noise environment of the pseudo computer room. FIG. 8 shows the results for voices recorded in a simulated factory noise environment. The axis represents the uttered word, and the vertical axis represents the similarity of the template voice of the candidate word to the input voice. From the figure, it can be seen that the word similarity of the correct candidate word is higher than the word similarity of the other words for all the input voices.

【００４９】（２）計算機上での音声と雑音混合による
評価ＳＮ比の違いによる性能の違いを評価するために雑音と
音声をいくつかのＳＮ比で計算機上で合成した。図９
は、背景騒音がない場合の結果、図１０は、背景騒音が
計算機室騒音でＳＮが１０ｄＢの場合、図１１は背景騒
音が計算機室騒音でＳＮが１０ｄＢの場合図１２は背景
騒音が工場騒音でＳＮが０ｄＢの場合、図１３は背景騒
音が工場騒音でＳＮが０ｄＢの場合である。背景騒音が
ない場合、ＳＮが１０ｄＢの場合は、すべての入力音声
について、正しい候補単語の単語類似度が、他の単語の
単語類似度よりも、大きくなっていることがわかる。騒
音レベルが大きくなるにつれて、候補単語の単語類似度
が低下してきて、他の単語の単語類似度との差が小さく
なることがわかる。ＳＮ０ｄＢでは、１６の入力のう
ち、３つが認識誤りを出力した。ヘッドセットマイクロ
フォンを用いて音声を収録する場合でＳＮ比が０ｄＢに
も至る高雑音下である環境は稀であると考えられるの
で、この結果から、ほとんどの実環境で本処理は有効に
働くことが予測できる。(2) Evaluation by Mixing Voice and Noise on Computer To evaluate the difference in performance due to the difference in SN ratio, noise and speech were synthesized on the computer at several SN ratios. Figure 9
As a result when there is no background noise, FIG. 10 shows that the background noise is the computer room noise and SN is 10 dB, and FIG. 11 is the background noise when the computer room noise is SN is 10 dB. When the SN is 0 dB, the background noise is the factory noise and the SN is 0 dB in FIG. When there is no background noise and the SN is 10 dB, it can be seen that the word similarity of correct candidate words is higher than the word similarities of other words for all input voices. It can be seen that as the noise level increases, the word similarity of the candidate words decreases, and the difference from the word similarity of other words decreases. At SN0 dB, 3 out of 16 inputs output a recognition error. Since it is considered that an environment with high noise with an SN ratio of 0 dB is rare when recording voice using a headset microphone, this result indicates that this process works effectively in most actual environments. Can be predicted.

【００５０】なお、今後の改善項目としては次の事項が
考えられる。（１）無声部の音モデルの扱いとDPマッチングにおける
経路、経路数の扱いについて無声部の音モデルは現在中立的なモデルを一様に使用し
ているが、この方法では、無声部の占める割合が多いほ
ど、評価が悪くなってしまう。ＤＰマッチング処理で
は、無声部のところで好ましくない経路を選択してしま
い、後続する有声部の推定がうまくいっていても、傾斜
制限の関係でその情報を取り込めない可能性がある。例
えば、有声部のみの経路を計算するなど、工夫が必要か
もしれない。また、もし無声部での“迷走”が後続する
有声部にさほど影響しないで済むのであれば例えば、最
終的に、Ｗ⁽ⁿ⁾(L)に（標準パターン長／有声部長）を掛
けるなど、何らかの補正をすればよいかもしれない。ま
たは、ＤＰマッチングの際の重みｙ_kを有声部と無声部
で変えることも考えられる。また、無声部のスペクトル
の包絡で音モデルを作って評価した場合との比較も必要
である。この方が精度がよくなるかもしれないし、よく
ならないかもしれない。The following items can be considered as future improvement items. (1) Handling of unvoiced sound model and handling of paths and the number of paths in DP matching The unvoiced sound model currently uses the neutral model uniformly. The higher the ratio, the worse the evaluation. In the DP matching process, an unfavorable path is selected at an unvoiced portion, and even if the succeeding voiced portion is estimated well, there is a possibility that the information cannot be captured due to the inclination limitation. For example, it may be necessary to devise a method such as calculating the path only for the voiced part. In addition, if the "stray" in the unvoiced part does not have a great influence on the following voiced part, for example, finally, W ⁽ⁿ⁾ (L) is multiplied by (standard pattern length / voiced part length). It may be possible to make some correction. Alternatively, it is possible to change the weight y _k in DP matching between the voiced part and the unvoiced part. It is also necessary to compare it with the case of making and evaluating a sound model with the envelope of the spectrum of the unvoiced part. This may or may not be more accurate.

【００５１】（２）雑音の音モデルについて使用される環境で顕著に観察される雑音が最初からわか
っていれば、この雑音について事前に音モデルを作成し
ておくと、精度が向上するであろう。(2) Sound model of noise If the noise that is remarkably observed in the environment in which it is used is known from the beginning, it is possible to improve the accuracy by creating a sound model in advance for this noise. Let's do it.

【００５２】（３）スロートマイクロフォンの出力から
得られる基本周波数の情報の利用についてＦの探索範囲を狭めることができる。これにより、精
度、速度共向上する。(3) It is possible to narrow the search range of F regarding the use of the information of the fundamental frequency obtained from the output of the throat microphone. This improves accuracy and speed.

【００５３】（４）標準テンプレートのイントネーショ
ン情報の利用について基本周波数の連続性、テンプレートの変化パターンから
の逸脱の度合いをω^(t ⁾(F,m)の重みとして与える。連続
音声の場合は、藤崎モデルやＪ−ＴＯＢＩとの組み合わ
せが必要となる。(4) Use of intonation information of standard template The continuity of the fundamental frequency and the degree of deviation from the template change pattern are given as the weight of ω ^(t ⁾ (F, m). In the case of continuous voice, combination with Fujisaki model and J-TOBI is required.

【００５４】（５）局所情報とグローバル情報音源分離問題では局所的な分離が前提となっている。本
手法でも情報を時間方向で局在させているが、これを、
時間一周波数平面で局在させることにより、より欠落す
る情報を減らすことができるのではないか。例えば音モ
デルを周波数分割するとか、ピークに重心を持つ表現を
行うとかの方法が考えられる。(5) Local information and global information In the sound source separation problem, local separation is premised. This method also localizes the information in the time direction.
It may be possible to reduce more missing information by localizing in the time-frequency plane. For example, a sound model may be frequency-divided or an expression having a center of gravity at a peak may be used.

【００５５】また、拡張に関する項目としては大きく分
けて２つの方向がある。1つは、膨大な英知の蓄積であ
る従来の音声認識アルゴリズムとの整合性を高めること
により、高雑音下の大語藁音声も認識を実現する方向で
ある。もう一つは、音声認識問題（from speech＋noise
to text）から音声抽出問題（from speech＋noise to
speech）への拡張である。また、本手法を構成する処理
は、もともと人間の聴覚による音声処理の様子をヒント
に考案されたものが多い。本手法の特徴と雑音下の音声
知覚との関係について眺めてみることは新たな展開の糸
口につながると思われる。Items related to expansion are roughly divided into two directions. One is to realize the recognition of large-word straw speech under high noise by improving the compatibility with the conventional speech recognition algorithm, which is a huge accumulation of wisdom. The other is the speech recognition problem (from speech + noise).
to text) to speech extraction problem (from speech + noise to)
It is an extension to speech). In addition, many of the processes that make up this method were originally devised with a hint of the state of voice processing by human hearing. Looking at the relationship between the characteristics of this method and the perception of speech under noisy conditions will lead to new developments.

【００５６】（１）ＤＴＷからＨＭＭへの拡張についてＨＭＭはＤＰマッチングを包含する。ＤＰマッチングで
できることはＨＭＭでもできる。(1) Extension from DTW to HMM The HMM includes DP matching. What you can do with DP matching can also be done with HMM.

【００５７】（２）ＭＦＣＣで記述された音響モデルと
の互換性についてＭＦＣＣもスペクトル包絡情報であるので、基本的には
互換性がある。しかし連続音声認識でＭＦＣＣと同時に
使われている△ＭＦＣＣは、現在の方式では導入できな
い。導入の必要性も含めて要検討。(2) Compatibility with acoustic model described in MFCC Since the MFCC is also spectral envelope information, it is basically compatible. However, ΔMFCC, which is used at the same time as MFCC in continuous speech recognition, cannot be introduced by the current method. Consideration is required including the necessity of introduction.

【００５８】（３）言語モデルの利用について言語モデルは、事前分布ω₀ ^(t)(F,m)を直接決定するこ
とができる。(3) Use of Language Model The language model can directly determine the prior distribution ω ₀ ^(t) (F, m).

【００５９】（４）音モデルの数の限界とディクテーシ
ョンへの展開の可能性について現在は小語藁単語認識を対象としているが、大語義ある
いはディクテーションを対象とする場合は単語毎に音モ
デルを作成するわけにはいかない。音声認識エンジンで
一般的に使用されている音響モデル（音韻モデル）の利
用が必須となる。この場合、重要となるが、本手法がど
れくらいの数の音モデルまで許容できるのかということ
である。可能な音モデル数について検討する必要があ
る。同時に、音韻モデルの数がどれくらい減らせるかを
検討する必要がある。音声認識エンジンでは、tri-phon
eならば数百のモデルを利用している。当然、bi‐phon
e，monophoneなら相当数が減る。無声部など高雑音下で
は手がかりとなりにくい音韻については綿退が可能であ
ろう。専用のＶＱコードブックを作成してもいいかもし
れない。(4) Limit of the number of sound models and possibility of expansion to dictation Currently, small word straw word recognition is targeted, but in the case of large sense or dictation, a sound model is selected for each word. I can't create it. It is essential to use an acoustic model (phoneme model) that is generally used in a speech recognition engine. In this case, what is important is how many sound models the method can tolerate. It is necessary to consider the number of possible sound models. At the same time, it is necessary to consider how the number of phonological models can be reduced. In the speech recognition engine, tri-phon
If it is e, it uses hundreds of models. Naturally, bi-phon
If you use e or monophone, the number will decrease considerably. It may be possible to withdraw about phonemes that are less likely to be a clue under high noise such as unvoiced parts. You may create a dedicated VQ codebook.

【００６０】（５）音声認識問題から音声抽出問題へＥＭアルゴリズムによって推定された重み値は、基本周
波数、スペクトルの形状の情報を与える。ということ
は、分析合成の手法を利用すれば雑音が混合される前の
音声が再現できるということである。問題は、推定され
たパラメータがどれくらい信頼できるかというconfiden
ce measureをどのようにして得るか、ということであ
る。このconfidence measureが得られれば信頼性の低い
部分は補間などを行うことにより、十分聴取に耐える音
声を再合成できる。補聴システムへの適用が期待でき
る。(5) From speech recognition problem to speech extraction problem The weight value estimated by the EM algorithm gives information on the fundamental frequency and the shape of the spectrum. This means that by using the method of analysis and synthesis, the voice before the noise is mixed can be reproduced. The problem is the confidencen on how reliable the estimated parameters are
How to get the ce measure. If this confidence measure is obtained, the unreliable part can be resynthesized by interpolating, etc. It can be expected to be applied to hearing aid systems.

【００６１】（６）ComputationaI Auditory Scene Ana
lysisとの整合性周波数方向のグルーピング、ボトムアップとトップダウ
ンの融合（言語モデルの利用の項を参照）等、すでに説
明が可能なものもある。(6) ComputationaI Auditory Scene Ana
Consistency with lysis There are some things that can already be explained, such as grouping in the frequency direction and fusion of bottom-up and top-down (see the section on using language models).

【００６２】（７）音声認識における「ピッチ」の重要
性について音声認識にピッチ情報が有効であるかどうかは長い間議
論の対象となってきた。その効果に対して多くの場合否
定的であった。その理由はそもそもピッチ情報の取り扱
い方法に問題があったのかもしれない。本発明の手法
は、ピッチの存在を積極的に利用しているにも関わら
ず、ピッチの推定そのものを直接的には行っていない。
ピッチ推定の困難さをある意味ではうまく回避してピッ
チ情報を有効に利用しているといえる。この問題は、Ka
waharaがこだわっている「聴覚はなぜピッチの存在に固
執するのか」という問題に直結する。Auditory Scene A
nalysisの立場から検討する価値がある。(7) Importance of "pitch" in speech recognition Whether pitch information is effective in speech recognition has been the subject of much debate. It was often negative for that effect. The reason may be that there was a problem in the way the pitch information was handled. Although the method of the present invention positively utilizes the existence of the pitch, it does not directly estimate the pitch itself.
In a sense, it can be said that the pitch information is effectively used by effectively avoiding the difficulty of pitch estimation. This issue is Ka
It is directly related to the problem "why does the hearing stick to the existence of the pitch?" Auditory Scene A
It is worth considering from the viewpoint of nalysis.

【００６３】[0063]

【発明の効果】上述したように、本発明によれば、騒音
下で入力された音声信号、あるいは通信路で雑音が混入
された音声信号においても頑健に動作する音声認識や音
声合成のような音声推定を行うことができる。As described above, according to the present invention, it is possible to robustly operate even a voice signal input under noise or a voice signal in which noise is mixed in a communication channel, such as voice recognition and voice synthesis. Speech estimation can be performed.

【図面の簡単な説明】[Brief description of drawings]

【図１】ノイズキャンセラの基本形を示すブロック図
である。FIG. 1 is a block diagram showing a basic form of a noise canceller.

【図２】ＨＭＭ合成法を用いたＨＭＭモデルの説明図
である。FIG. 2 is an explanatory diagram of an HMM model using the HMM synthesis method.

【図３】マルチレートフィルタバンクの構成図であ
る。FIG. 3 is a configuration diagram of a multi-rate filter bank.

【図４】音モデルとＤＰマッチングの関係を示す説明
図である。FIG. 4 is an explanatory diagram showing a relationship between a sound model and DP matching.

【図５】ＤＰマッチングにおける傾斜制限の説明図で
ある。FIG. 5 is an explanatory diagram of inclination limitation in DP matching.

【図６】音声波形と零交差数の関係を示す波形図であ
る。FIG. 6 is a waveform diagram showing a relationship between a voice waveform and the number of zero crossings.

【図７】擬似騒音空間での評価を示すグラフである。FIG. 7 is a graph showing evaluation in a pseudo noise space.

【図８】擬似騒音空間での評価を示すグラフである。FIG. 8 is a graph showing evaluation in a pseudo noise space.

【図９】背景騒音のない場合のグラフである。FIG. 9 is a graph when there is no background noise.

【図１０】計算機上での雑音の重畳を示すグラフであ
る。FIG. 10 is a graph showing superposition of noise on a computer.

【図１１】計算機上での雑音の重畳を示すグラフであ
る。FIG. 11 is a graph showing superposition of noise on a computer.

【図１２】計算機上での雑音の重畳を示すグラフであ
る。FIG. 12 is a graph showing noise superposition on a computer.

【図１３】計算機上での雑音の重畳を示すグラフであ
る。FIG. 13 is a graph showing noise superposition on a computer.

Claims

【特許請求の範囲】[Claims]

【請求項１】入力音響信号を短時間セグメントごとに
切り出し、短時間周波数分析を行う音響分析ステップ
と、音声推定に必要とする要素を推定する要素推定ステップ
と、要素推定ステップによって得られた要素を用いて音声を
推定する音声推定ステップを含む騒音環境下における音
声推定方法。1. An acoustic analysis step of cutting out an input acoustic signal for each short-time segment and performing a short-time frequency analysis, an element estimation step of estimating an element required for speech estimation, and an element obtained by the element estimation step. A speech estimation method in a noisy environment, including a speech estimation step of estimating a speech using the method.

【請求項２】入力音響信号を短時間セグメントごとに
切り出し、短時間周波数分析を行う音響分析ステップ
と、音声認識に必要とする要素を推定する要素推定ステップ
と、要素推定ステップによって得られた要素を用いて音声を
認識する音声認識ステップとを含む騒音環境下における
音声認識方法。2. An acoustic analysis step of cutting out an input acoustic signal for each short-time segment and performing a short-time frequency analysis, an element estimation step of estimating an element required for speech recognition, and an element obtained by the element estimation step. Speech recognition method in a noisy environment, including a speech recognition step for recognizing speech by using.

【請求項３】入力音響信号を短時間セグメントごとに
切り出し、短時間周波数分析を行う音響分析ステップ
と、音声認識が符合帳で保持する音声のスペクトル包絡を知
識として利用し、音モデルを生成するステップと、前記音響分析ステップによって得られたスペクトル情報
を確率密度関数とみなし、最大事後確率推定を用いて、
混合重み値を推定する推定ステップと、各時刻で最大の重み値を持つ音モデルを生成した要素の
存在仮定が最も尤度が高いとして、その要素を出力する
認識ステップとを含む騒音環境下における音声認識方
法。3. A sound model is generated by using, as knowledge, an acoustic analysis step of cutting out an input acoustic signal for each short-time segment and performing a short-time frequency analysis, and a spectral envelope of speech held in a codebook by speech recognition. Step, considering the spectral information obtained by the acoustic analysis step as a probability density function, using the maximum posterior probability estimation,
In a noise environment including an estimation step of estimating a mixed weight value and a recognition step of outputting the element assuming that the existence assumption of the element that generated the sound model having the maximum weight value at each time has the highest likelihood. Speech recognition method.