JP6464005B2

JP6464005B2 - Noise suppression speech recognition apparatus and program thereof

Info

Publication number: JP6464005B2
Application number: JP2015060541A
Authority: JP
Inventors: 彰夫小林; 和穂尾上
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2015-03-24
Filing date: 2015-03-24
Publication date: 2019-02-06
Anticipated expiration: 2035-03-24
Also published as: JP2016180839A

Description

本発明は、入力音声の雑音を抑圧して音声認識を行う雑音抑圧音声認識装置およびそのプログラムに関する。 The present invention relates to a noise-suppressed speech recognition apparatus that performs speech recognition by suppressing noise in an input speech and a program thereof.

音声認識を行う場合、音声認識の処理量の削減と認識性能の向上を図るため、入力音声から、人が発話した音声区間（発話区間）を検出して音声認識行い、それ以外の非音声区間（非発話区間）では、音声認識を行わないことが好ましい。
このような入力音声から発話区間を検出する手法としては、特許文献１に開示されている手法が存在する。
この手法は、音声および非音声の状態遷移を行う隠れマルコフモデルを予め定め、それぞれの状態遷移系列から計算される尤度を比較することで、音声区間を検出している。 When performing speech recognition, in order to reduce the amount of speech recognition processing and improve recognition performance, speech recognition is performed by detecting a speech segment (speech segment) from the input speech, and other non-speech segments In (non-speech section), it is preferable not to perform voice recognition.
As a technique for detecting a speech section from such input speech, there is a technique disclosed in Patent Document 1.
In this method, a hidden Markov model that performs state transition of speech and non-speech is determined in advance, and speech intervals are detected by comparing likelihoods calculated from the respective state transition sequences.

特開２００７−２３３１４８号公報JP 2007-233148 A

前記した特許文献１に記載の手法によれば、入力音声から音声区間と非音声区間とを識別して、音声区間の音声を取り出すことができる。
しかし、入力音声が放送番組の音声である場合、その音声には、人が発話した音声以外に、雑音として、音楽、音声認識対象外の外国語音声等が混在している場合がある。
このように、入力音声に種々の雑音が混在している場合、単純に音声区間と非音声区間を検出する従来の手法では、人が発話した音声区間のみを検出することは困難である。
一方、従来手法において仮に音声区間を検出することができたとしても、人が発話した音声区間に雑音が重畳している場合、音声認識の精度が低くなってしまう。 According to the method described in Patent Document 1 described above, it is possible to identify a speech segment and a non-speech segment from the input speech and extract speech in the speech segment.
However, when the input sound is a broadcast program sound, in addition to the sound uttered by a person, the sound may include music, foreign language sound that is not subject to speech recognition, and the like as noise.
As described above, when various noises are mixed in the input voice, it is difficult to detect only a voice section spoken by a person with a conventional method of simply detecting a voice section and a non-voice section.
On the other hand, even if the speech section can be detected in the conventional method, if noise is superimposed on the speech section spoken by a person, the accuracy of speech recognition is lowered.

本発明は、このような問題に鑑みてなされたものであり、雑音が重畳した入力音声から、音声認識対象の音声（母語音声）による音声区間を検出するとともに、当該区間に重畳されている雑音を抑圧して音声認識を行う雑音抑圧音声認識装置およびそのプログラムを提供することを課題とする。 The present invention has been made in view of such problems, and detects a speech section based on speech (native speech) as a speech recognition target from input speech on which noise is superimposed, and noise superimposed on the section. It is an object of the present invention to provide a noise-reduced speech recognition apparatus that performs speech recognition while suppressing noise and a program thereof.

前記課題を解決するため、本発明に係る雑音抑圧音声認識装置は、入力音声に対して雑音抑圧を行って音声認識を行う雑音抑圧音声認識装置であって、音響特徴量抽出手段と、統計モデル記憶手段と、クラス特徴量算出手段と、音声区間検出手段と、雑音区間検出手段と、雑音抑圧処理選択手段と、音声認識手段と、を備える構成とした。 In order to solve the above problems, a noise-suppressed speech recognition apparatus according to the present invention is a noise-suppressed speech recognition apparatus that performs speech recognition by performing noise suppression on an input speech, and includes an acoustic feature amount extraction unit, a statistical model, and the like. The storage unit, the class feature quantity calculation unit, the speech segment detection unit, the noise segment detection unit, the noise suppression processing selection unit, and the speech recognition unit are provided.

かかる構成において、雑音抑圧音声認識装置は、音響特徴量抽出手段によって、入力音声から、所定時間長のフレーム単位で音響特徴量を抽出する。この音響特徴量は、例えば、対数メルフィルタバンク出力やメル周波数ケプストラム係数である。
そして、雑音抑圧音声認識装置は、クラス特徴量算出手段によって、統計モデルに基づいて、音響特徴量抽出手段で抽出された音響特徴量から、フレームごとに、音声認識対象である母語音声を含む音声の種類ごとの各クラスが出現する事後確率と、雑音の種類ごとの各クラスが出現する事後確率とを、各クラスのクラス特徴量として算出する。この統計モデルは、予め統計モデル記憶手段に記憶されているもので、音響特徴量と音声の種類ごとの関係、および、音響特徴量と雑音の種類ごとの関係とを予め学習したものである。この統計モデルは、ニューラルネットワークのパラメータ（ネットワークの層間を結合する結合行列とバイアス項）としてモデル化しておくことができる。 In such a configuration, the noise-suppressed speech recognition apparatus extracts the acoustic feature amount from the input speech in units of frames having a predetermined time length by the acoustic feature amount extraction unit. This acoustic feature amount is, for example, a log mel filter bank output or a mel frequency cepstrum coefficient.
Then, the noise-suppressed speech recognition device uses the class feature amount calculation means based on the statistical model, from the acoustic feature amount extracted by the acoustic feature amount extraction means, for each frame, the speech including the native speech that is the speech recognition target The posterior probability of occurrence of each class for each type and the posterior probability of occurrence of each class for each type of noise are calculated as class feature values of each class. This statistical model is stored in advance in the statistical model storage means, and learns in advance the relationship between the acoustic feature quantity and the type of sound, and the relationship between the acoustic feature quantity and the type of noise. This statistical model can be modeled as parameters of a neural network (a coupling matrix and a bias term for coupling between network layers).

そして、雑音抑圧音声認識装置は、音声区間検出手段によって、音声の種類ごとの各クラスが出現する事後確率に基づいて、母語音声の音声区間を検出する。この音声区間は、隠れマルコフモデルに基づく、音声の種類（母語音声、非音声等）ごとの各クラスの状態遷移系列により検出することができる。
また、雑音抑圧音声認識装置は、雑音区間検出手段によって、雑音の種類ごとの各クラスが出現する事後確率に基づいて、雑音の種類（外国語音声、音楽等）ごとの雑音区間を検出する。この雑音の種類ごとの雑音区間についても、隠れマルコフモデルに基づく、雑音の種類ごとの各クラスの状態遷移系列により検出することができる。
これによって、雑音抑圧音声認識装置は、入力音声のどの区間が音声区間であるのかを検出することができるともに、その音声区間において、さらに雑音の種類ごとの区間を検出することができる。 Then, the noise suppression speech recognition apparatus detects the speech section of the native language speech based on the posterior probability that each class appears for each speech type by the speech section detection means. This speech segment can be detected from the state transition series of each class for each speech type (native speech, non-speech, etc.) based on the hidden Markov model.
Further, the noise suppression speech recognition apparatus detects a noise interval for each noise type (foreign language speech, music, etc.) based on the posterior probability that each class appears for each noise type by the noise interval detection means. The noise interval for each noise type can also be detected by the state transition sequence of each class for each noise type based on the hidden Markov model.
Accordingly, the noise-suppressed speech recognition apparatus can detect which section of the input speech is a speech section, and can further detect a section for each type of noise in the speech section.

そして、雑音抑圧音声認識装置は、雑音抑圧処理選択手段によって、音声区間に対応する雑音区間における雑音の種類に応じて、予め定めた雑音抑圧手法を選択する。 Then, the noise suppression speech recognition apparatus selects a predetermined noise suppression method according to the type of noise in the noise section corresponding to the speech section by the noise suppression processing selection means.

そして、雑音抑圧音声認識装置は、雑音抑圧手段によって、雑音抑圧処理選択手段で選択された雑音抑圧手法で、音声区間における雑音の音響特徴量を抑圧した音響特徴量を生成する。
これによって、雑音抑圧音声認識装置は、音声区間に重畳されている雑音の種類に応じて、個別に雑音を抑圧した音響特徴量を生成することができる。 Then, the noise-suppressed speech recognition apparatus generates an acoustic feature amount in which the noise feature amount of the noise in the speech section is suppressed by the noise suppression unit using the noise suppression method selected by the noise suppression processing selection unit.
As a result, the noise-suppressed speech recognition apparatus can generate an acoustic feature amount in which noise is individually suppressed according to the type of noise superimposed on the speech section.

そして、雑音抑圧音声認識装置は、音声認識手段によって、雑音抑圧手段で生成された音響特徴量により音声認識を行う。
これによって、雑音抑圧音声認識装置は、入力音声から、音声認識対象の音声（母語音声）による音声区間を検出し、その音声区間に重畳されている雑音の種類に応じて雑音を抑圧した音声認識を行うことができる。 Then, the noise suppression speech recognition apparatus performs speech recognition by the speech recognition unit using the acoustic feature amount generated by the noise suppression unit.
As a result, the noise-reduced speech recognition apparatus detects a speech section based on the speech to be speech-recognized (native speech) from the input speech, and performs speech recognition in which noise is suppressed according to the type of noise superimposed on the speech section. It can be performed.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、雑音が重畳した入力音声であっても、音声と雑音とをモデル化した統計モデルを用いることで、音声認識対象の音声による音声区間を精度よく検出することができる。また、本発明によれば、検出した音声区間に重畳されている雑音の種類に応じた雑音抑圧処理を施すことができるため、音声認識の精度を高めることができる。 The present invention has the following excellent effects.
According to the present invention, it is possible to accurately detect a speech section based on speech recognition target speech by using a statistical model in which speech and noise are modeled even for input speech on which noise is superimposed. Further, according to the present invention, since noise suppression processing according to the type of noise superimposed on the detected speech section can be performed, the accuracy of speech recognition can be improved.

本発明の実施形態に係る雑音抑圧音声認識装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the noise suppression speech recognition apparatus which concerns on embodiment of this invention. 統計モデル記憶手段に記憶され、クラス特徴量抽出手段が参照するリカレントニューラルネットワークの構造を模式的に示す模式図である。It is a schematic diagram schematically showing the structure of a recurrent neural network stored in a statistical model storage unit and referenced by a class feature quantity extraction unit. 音声区間検出手段における３つの状態遷移の例を示す遷移図である。It is a transition diagram which shows the example of three state transitions in an audio | voice area detection means. 音声区間検出手段において最尤系列を求める手法を説明するための説明図である。It is explanatory drawing for demonstrating the method of calculating | requiring a maximum likelihood sequence in a speech area detection means. 雑音区間検出手段における４つの状態遷移の例を示す遷移図である。It is a transition diagram which shows the example of four state transitions in a noise area detection means. 雑音区間検出手段において最尤系列を求める手法を説明するための説明図である。It is explanatory drawing for demonstrating the method of calculating | requiring a maximum likelihood sequence in a noise area detection means. 統計モデル記憶手段に記憶される統計モデルを学習するための学習データの例を示し、（ａ）は入力音声のフレームごとの音響特徴量、（ｂ）は入力音声を構成するフレームごとの音の構成、（ｃ）はフレームごとの音声区間検出手段用のクラス特徴量および雑音区間検出手段用のクラス特徴量を示す。The example of the learning data for learning the statistical model memorize | stored in a statistical model memory | storage means is shown, (a) is the acoustic feature-value for every frame of input speech, (b) is the sound of every frame which comprises input speech. Configuration (c) shows the class feature amount for the speech section detecting means and the class feature amount for the noise section detecting means for each frame. 本発明の実施形態に係る雑音抑圧音声認識装置の音響特徴量の蓄積動作を示すフローチャートである。It is a flowchart which shows the accumulation | storage operation | movement of the acoustic feature-value of the noise suppression speech recognition apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る雑音抑圧音声認識装置の雑音のクラス別に雑音抑圧を行う動作を示すフローチャートである。It is a flowchart which shows the operation | movement which performs noise suppression according to the noise class of the noise suppression speech recognition apparatus which concerns on embodiment of this invention.

以下、本発明の実施形態について図面を参照して説明する。
〔雑音抑圧音声認識装置の構成〕
まず、図１を参照して、本発明の実施形態に係る雑音抑圧音声認識装置１の構成について説明する。 Embodiments of the present invention will be described below with reference to the drawings.
[Configuration of noise suppression speech recognition device]
First, with reference to FIG. 1, the structure of the noise suppression speech recognition apparatus 1 which concerns on embodiment of this invention is demonstrated.

雑音抑圧音声認識装置１は、放送番組の音声（放送音声）のような雑音を含んだ音声から、音声認識対象の音声区間を検出し、当該区間に重畳されている雑音を抑圧して音声認識を行うものである。以下、音声認識対象の音声を母語音声（例えば、日本人による日本語の音声）という。また、ここで、雑音とは、音声認識対象の音声以外の音声あるいは音であって、音楽、外国語音声等である。もちろん、雑音には、音楽、外国語音声以外にも、種々存在するが、それらは、その１つとしてその他雑音ということとする。
ここでは、雑音抑圧音声認識装置１は、区間検出手段１０と、雑音抑圧手段２０と、音声認識手段３０と、を備える。 The noise-suppressed speech recognition apparatus 1 detects a speech section to be speech-recognized from speech including noise such as the sound of a broadcast program (broadcast sound), and suppresses the noise superimposed on the section to perform speech recognition. Is to do. Hereinafter, the speech to be recognized is referred to as a native language speech (for example, Japanese speech by a Japanese). Here, the noise is a voice or sound other than the voice to be recognized, and is music, foreign language voice, or the like. Of course, there are various types of noise other than music and foreign language speech, and one of them is other noise.
Here, the noise-suppressed speech recognition apparatus 1 includes a section detection unit 10, a noise suppression unit 20, and a speech recognition unit 30.

区間検出手段１０は、入力音声（放送音声）から音声区間を検出するとともに、当該音声区間内に重畳されている雑音の種別を特定し、音声フレーム（音声認識を行う音響特徴量の１単位を指す）ごとに、予め定めた複数の雑音抑圧手法のいずれが適しているかを選択するものである。
ここでは、区間検出手段１０は、音響特徴量抽出手段１１と、特徴量正規化手段１２と、フレームバッファ１３と、統計モデル記憶手段１４と、クラス特徴量算出手段１５と、音声区間検出手段１６と、雑音区間検出手段１７と、雑音抑圧処理選択手段１８と、を備える。 The section detection means 10 detects a voice section from the input voice (broadcast voice), specifies the type of noise superimposed in the voice section, and determines a voice frame (one unit of acoustic feature value for voice recognition). Each of the plurality of predetermined noise suppression methods is selected.
Here, the section detection means 10 includes an acoustic feature quantity extraction means 11, a feature quantity normalization means 12, a frame buffer 13, a statistical model storage means 14, a class feature quantity calculation means 15, and a voice section detection means 16. And noise section detecting means 17 and noise suppression processing selecting means 18.

音響特徴量抽出手段１１は、入力音声から、音響特徴量を抽出するものである。この音響特徴量は、雑音が重畳された入力音声の特徴量である。ここでは、音響特徴量として、対数メルフィルタバンク出力を用いる。 The acoustic feature quantity extraction unit 11 extracts an acoustic feature quantity from the input voice. This acoustic feature amount is a feature amount of the input voice on which noise is superimposed. Here, the log mel filter bank output is used as the acoustic feature quantity.

具体的には、音響特徴量抽出手段１１は、まず、入力音声を所定の時間単位でフレーム（例えば、フレーム長を２０〜３０ｍｓ、フレーム間隔を１０〜２０ｍｓ）を切り出す。そして、音響特徴量抽出手段１１は、フレームごとに離散フーリエ変換（ＤＦＴ：Discrete Fourier Transform）を行う。そして、音響特徴量抽出手段１１は、振幅スペクトルをメル尺度上で等間隔なフィルタバンクにかけて、各帯域のスペクトル成分を抽出する。そして、音響特徴量抽出手段１１は、フィルタバングの数の次数（例えば、２０次元）に圧縮された振幅スペクトルの対数をとることで、対数振幅スペクトル（特徴量ベクトル）を求める。
この音響特徴量抽出手段１１は、所定次数の対数振幅スペクトルを、特徴量正規化手段１２に出力する。 Specifically, the acoustic feature amount extraction unit 11 first cuts out a frame (for example, a frame length of 20 to 30 ms and a frame interval of 10 to 20 ms) from the input voice in a predetermined time unit. And the acoustic feature-value extraction means 11 performs a discrete Fourier transform (DFT: Discrete Fourier Transform) for every flame | frame. Then, the acoustic feature quantity extraction unit 11 extracts the spectral components of each band by applying the amplitude spectrum to a filter bank at equal intervals on the Mel scale. The acoustic feature quantity extraction unit 11 obtains a logarithmic amplitude spectrum (feature quantity vector) by taking the logarithm of the amplitude spectrum compressed to the order of the number of filter bangs (for example, 20 dimensions).
The acoustic feature quantity extraction unit 11 outputs a log amplitude spectrum of a predetermined order to the feature quantity normalization unit 12.

なお、ここでは、音響特徴量として、対数メルフィルタバンク出力（所定次数の対数振幅スペクトル）を用いることとするが、メル周波数ケプストラム係数（ＭＦＣＣ：Mel Frequency CepstralCoefficient）を用いることとしてもよい。
その場合、音響特徴量抽出手段１１は、前記のように求めた所定次数の対数振幅スペクトルに対して、離散コサイン変換（ＤＣＴ：Discrete Cosine Transform）を行い、スペクトルの声道成分となる低次（例えば、１２次元）の成分をメル周波数ケプストラム係数として取り出せばよい。 Here, the log mel filter bank output (log amplitude spectrum of a predetermined order) is used as the acoustic feature quantity, but a mel frequency cepstrum coefficient (MFCC) may be used.
In this case, the acoustic feature quantity extraction unit 11 performs a discrete cosine transform (DCT) on the logarithmic amplitude spectrum of the predetermined order obtained as described above, so that the low order (which is a vocal tract component of the spectrum) For example, a 12-dimensional component may be extracted as a mel frequency cepstrum coefficient.

特徴量正規化手段１２は、音響特徴量抽出手段１１で抽出された音響特徴量を正規化するものである。
例えば、特徴量正規化手段１２は、音響特徴量抽出手段１１で抽出された音響特徴量である所定次数の特徴量ベクトル（対数振幅スペクトルまたはメル周波数ケプストラム係数）を、平均値が“０”、分散が“１”となるように正規化する。これによって、特徴量ベクトルのダイナミックレンジ（幅）を圧縮するとともに、例えば、マイク特性、話者の違い等による音響特徴量のばらつき（歪み）を抑えることができる。
この特徴量正規化手段１２は、正規化された音響特徴量を、フレーム単位で、フレームバッファ１３に蓄積する。 The feature quantity normalization means 12 normalizes the acoustic feature quantity extracted by the acoustic feature quantity extraction means 11.
For example, the feature quantity normalization means 12 calculates a feature quantity vector (logarithmic amplitude spectrum or mel frequency cepstrum coefficient) of a predetermined order, which is the acoustic feature quantity extracted by the acoustic feature quantity extraction means 11, with an average value of “0”, Normalization is performed so that the variance is “1”. As a result, the dynamic range (width) of the feature vector can be compressed, and variation (distortion) in the acoustic feature can be suppressed due to, for example, microphone characteristics and speaker differences.
The feature amount normalizing means 12 stores the normalized acoustic feature amount in the frame buffer 13 in units of frames.

フレームバッファ１３は、音響特徴量をフレーム単位で記憶するもので、一般的なメモリで構成される。なお、フレームバッファ１３のサイズは、少なくとも音声区間検出手段１６および雑音区間検出手段１７において、各区間（音声区間、雑音区間）を検出する予め定めた大きさを有し、例えば、５０フレームとする。
このフレームバッファ１３には、特徴量正規化手段１２によって、音響特徴量が逐次記憶され、クラス特徴量算出手段１５によって順次読み出される。なお、フレームバッファ１３のクリアは、音声区間検出手段１６および雑音区間検出手段１７の説明において行うこととする。 The frame buffer 13 stores acoustic feature quantities in units of frames and is configured by a general memory. The size of the frame buffer 13 has a predetermined size for detecting each section (speech section, noise section) at least in the speech section detection means 16 and the noise section detection means 17, for example, 50 frames. .
In the frame buffer 13, acoustic feature amounts are sequentially stored by the feature amount normalizing unit 12 and sequentially read by the class feature amount calculating unit 15. The clearing of the frame buffer 13 is performed in the description of the voice section detecting means 16 and the noise section detecting means 17.

統計モデル記憶手段１４は、フレームごとの音響特徴量が入力された際に、音声の音をその種類ごとにクラス分けした各分類の状態と、雑音の音をその種類ごとにクラス分けした各分類の状態がそれぞれ占有する確率（隠れマルコフモデル〔ＨＭＭ：Hidden Markov Model〕の状態の事後確率）をモデル化した統計モデルを記憶するものである。
この統計モデルは、単一の音声入力（音響特徴量）から、音声に着目した分類と、雑音に着目した分類とを２つ独立して、それぞれの事後確率をモデル化したものである。
ここでは、音声を、母語音声と、非音声（音楽、外国語音声、その他雑音）と、無音との３つの種類にクラス分けすることとする。また、雑音を、雑音なし（母語音声のみ、または、無音）と、外国語音声と、音楽と、その他雑音との４つの種類にクラス分けすることとする。 When the acoustic feature quantity for each frame is input, the statistical model storage unit 14 is in a state of each classification in which the sound of the voice is classified for each type, and each classification in which the noise sound is classified for each type. The statistical model which modeled the probability (Hidden Markov Model [HMM: Hidden Markov Model] state posterior probability) which each state occupies is memorize | stored.
This statistical model is obtained by modeling two posterior probabilities from a single speech input (acoustic feature amount), a classification focused on speech and a classification focused on noise.
Here, the speech is classified into three types: native speech, non-speech (music, foreign language speech, other noise), and silence. In addition, noise is classified into four types: noiseless (only native speech or silence), foreign language speech, music, and other noise.

この統計モデル記憶手段１４は、統計モデルとして、例えば、予め学習した統計モデルの一つであるリカレントニューラルネットワークのパラメータ（ネットワーク層間を結ぶ結合行列およびバイアス）を記憶しておく。この統計モデル（リカレントニューラルネットワーク）の学習については、後で説明することとする。 The statistical model storage unit 14 stores, as a statistical model, for example, a recurrent neural network parameter (a connection matrix and a bias connecting network layers) that is one of statistical models learned in advance. The learning of this statistical model (recurrent neural network) will be described later.

ここで、図２を参照して、リカレントニューラルネットワークについて説明する。図２は、後記するクラス特徴量算出手段１５において、音響特徴量から、音声および雑音の各クラスの事後確率をクラス特徴量として求めるリカレントニューラルネットワークＮの構造を模式的に示したものである。 Here, the recurrent neural network will be described with reference to FIG. FIG. 2 schematically shows the structure of the recurrent neural network N that obtains the posterior probabilities of the speech and noise classes as class feature amounts from the acoustic feature amounts in the class feature amount calculation means 15 described later.

このリカレントニューラルネットワークＮは、入力層において、所定次元の音響特徴量を入力し、隠れ層Ａ、隠れ層Ｂを介し、２つの出力層から、それぞれ、音声（母語音声、非音声、無音の３つ）と、雑音（雑音なし、外国語音声、音楽、その他雑音の４つ）の各クラスの事後確率（クラス特徴量）を出力する。このとき、リカレントニューラルネットワークＮは、隠れ層Ａにおいて、前回の隠れ層Ａの出力を再帰させる。これによって、リカレントニューラルネットワークＮは、クラス特徴量が直前の音響特徴量の影響を受け、推定精度を高めることができる。
図１に戻って、雑音抑圧音声認識装置１の構成について説明を続ける。 This recurrent neural network N inputs an acoustic feature amount of a predetermined dimension in the input layer, and from the two output layers through the hidden layer A and the hidden layer B, the speech (native speech, non-speech, silent) 3 respectively. And posterior probabilities (class feature values) of each class of noise (no noise, foreign language speech, music, and other noise). At this time, the recurrent neural network N recursively outputs the output of the previous hidden layer A in the hidden layer A. As a result, the recurrent neural network N can increase the estimation accuracy because the class feature is affected by the immediately preceding acoustic feature.
Returning to FIG. 1, the description of the configuration of the noise suppression speech recognition apparatus 1 will be continued.

クラス特徴量算出手段１５は、フレームバッファ１３から、順次、フレームごとの音響特徴量を入力し、統計モデル記憶手段１４に記憶されている統計モデルに基づいて、音声（母語音声、非音声、無音の３つ）と、雑音（雑音なし、外国語音声、音楽、その他雑音の４つ）のクラスごとのクラス特徴量（事後確率）を算出するものである。
すなわち、クラス特徴量算出手段１５は、図２に示すように、統計モデル記憶手段１４に予め記憶されているリカレントニューラルネットワークＮの層間を結ぶ結合行列およびバイアスに基づいて、入力層から出力層までの演算を順次行い、クラス特徴量を算出する。 The class feature quantity calculation means 15 inputs the acoustic feature quantity for each frame sequentially from the frame buffer 13, and based on the statistical model stored in the statistical model storage means 14, the speech (native speech, non-speech, silence) 3) and noise (no noise, foreign language speech, music, and other noise), class feature amounts (posterior probabilities) for each class are calculated.
That is, as shown in FIG. 2, the class feature quantity calculation means 15 is based on the connection matrix and the bias connecting the layers of the recurrent neural network N stored in advance in the statistical model storage means 14, from the input layer to the output layer. Are sequentially calculated to calculate the class feature amount.

ここでは、クラス特徴量算出手段１５は、音声（母語音声、非音声、無音の３つ）について、それぞれのクラス（母語音声、非音声、無音）とともに、クラス特徴量（事後確率）をタグとして音声区間検出手段１６に出力する。また、クラス特徴量算出手段１５は、雑音（雑音なし、外国語音声、音楽、その他雑音の４つ）について、それぞれのクラス（雑音なし、外国語音声、音楽、その他雑音）とともに、クラス特徴量（事後確率）をタグとして雑音区間検出手段１７に出力する。 Here, the class feature quantity calculation means 15 uses the class feature quantity (a posteriori probability) as a tag for each of the voices (native speech, non-speech, and silence) along with the respective classes (native speech, non-speech, and silence). It outputs to the voice section detection means 16. Further, the class feature quantity calculating means 15 classifies the noise (no noise, foreign language voice, music, and other noise) and class feature quantity together with each class (no noise, foreign language voice, music, and other noise). (A posteriori probability) is output to the noise section detecting means 17 as a tag.

音声区間検出手段１６は、クラス特徴量算出手段１５においてフレームごとに算出されるクラス特徴量に基づいて、音声区間を検出するものである。
ここでは、音声区間検出手段１６は、各フレームが、“母語音声”、“非音声”および“無音”のどのクラスに属する音声であるのかを判定する。この場合、“母語音声”と判定された１以上の連続フレームが音声区間となる。 The voice section detection unit 16 detects a voice section based on the class feature amount calculated for each frame by the class feature amount calculation unit 15.
Here, the speech section detection means 16 determines which class each speech belongs to, which is “native speech”, “non-speech”, or “silence”. In this case, one or more consecutive frames determined as “native speech” are speech segments.

すなわち、音声区間検出手段１６は、図３に示すような“母語音声”、“非音声”および“無音”の３状態からなるエルゴディックＨＭＭにより、各状態（クラス）を、クラス特徴量算出手段１５から入力されるクラス特徴量（事後確率）に基づいて確率的に遷移させる。
そして、音声区間検出手段１６は、フレームバッファ１３に記憶されている音響特徴量の各クラスに遷移する最尤系列を求めることで、フレームごとのクラスを特定する。なお、最尤系列とは、遷移する確率が最大となるＨＭＭ状態系列をいう。 That is, the speech section detection means 16 classifies each state (class) by means of an ergodic HMM consisting of three states of “native speech”, “non-speech” and “silence” as shown in FIG. Based on the class feature value (a posteriori probability) input from 15, the transition is made probabilistically.
Then, the speech section detection unit 16 specifies a class for each frame by obtaining a maximum likelihood sequence that transitions to each class of acoustic feature values stored in the frame buffer 13. Note that the maximum likelihood sequence means an HMM state sequence having the maximum transition probability.

例えば、音声区間検出手段１６は、図４に示すように、フレームバッファ１３に音響特徴量がフレーム単位で特徴量列として記憶された状態において、クラス特徴量算出手段１５で算出されたクラス特徴量（クラスごとの事後確率）を順次入力するたびに、ビタビ（Ｖｉｔｅｒｂｉ）アルゴリズムにより最尤系列を求める。ここで、図４は、ある時間ｔにおいて、フレームバッファ１３の先頭から、クラスが“母語音声”，“母語音声”，…，“無音”が最尤系列であることを示している。
これによって、音声区間検出手段１６は、フレームバッファ１３に記憶されている音響特徴量がどのクラスの特徴量であるのかを順次判定することができる。 For example, as shown in FIG. 4, the speech section detection unit 16 performs the class feature amount calculated by the class feature amount calculation unit 15 in a state where the acoustic feature amount is stored in the frame buffer 13 as a feature amount sequence in units of frames. Each time (a posteriori probability for each class) is sequentially input, a maximum likelihood sequence is obtained by the Viterbi algorithm. Here, FIG. 4 shows that, from a top of the frame buffer 13, at a certain time t, the classes “mother tongue speech”, “mother tongue speech”,..., “Silence” are the maximum likelihood sequences.
As a result, the speech section detection unit 16 can sequentially determine which class of the acoustic feature quantity stored in the frame buffer 13 is the feature quantity.

この音声区間検出手段１６は、フレームごとに、判定したクラスを特定するタグを雑音抑圧処理選択手段１８に出力するとともに、フレームバッファ１３から当該フレームに対応する音声特徴量を読み出す旨を、雑音抑圧処理選択手段１８に指示する。
そして、音声区間検出手段１６は、雑音抑圧処理選択手段１８から読み出し完了の応答を受け取ったタイミングで、フレームバッファ１３の内容を更新する。 The voice section detection means 16 outputs a tag for identifying the determined class to the noise suppression processing selection means 18 for each frame, and reads out the voice feature quantity corresponding to the frame from the frame buffer 13 to suppress noise. The process selection unit 18 is instructed.
Then, the voice section detection unit 16 updates the contents of the frame buffer 13 at the timing when the read completion response is received from the noise suppression processing selection unit 18.

すなわち、音声区間検出手段１６は、図４に示すように、最尤系列によりクラスが決定した区間（決定区間）が特定され、各音響特徴量が雑音抑圧処理選択手段１８に出力された後、まだクラスが決定していない区間（未決区間）の音響特徴量をフレームバッファ１３の先頭に移動させ、残りのフレームバッファ１３の音響特徴量をクリアする。
このように、フレームバッファ１３には、バッファサイズをＮ_ｂｕｆ、決定区間をＮ_ｄｅｔ、未決区間をＮ_ｎｏｔとしたとき、以下の式（１）の関係がある。 That is, as shown in FIG. 4, the speech section detection unit 16 specifies the section (determination section) in which the class is determined by the maximum likelihood sequence, and outputs each acoustic feature amount to the noise suppression processing selection unit 18. The acoustic feature quantity in the section where the class has not yet been determined (undecided section) is moved to the head of the frame buffer 13, and the acoustic feature quantities in the remaining frame buffer 13 are cleared.
As described above, the frame buffer 13 has the relationship of the following expression (1), where the buffer size is N _buf , the determined interval is N _det , and the pending interval is N _not .

そこで、音声区間検出手段１６は、Ｎ_ｂｕｆフレームに対して、Ｎ_ｄｅｔフレーム分のクラスが決定された後、Ｎ_ｎｏｔフレームを先頭バッファに移動させる処理を繰り返し、入力音声の音響特徴量がフレームバッファ１３に入力され続ける限り、クラスの判定動作を繰り返す。
なお、図４において、フレームバッファ１３よりも前（図中、左）に、遷移状態を示す“○印”が存在しているが、これは、クラスの判定動作継続中の未決区間のクラス判定を行う際に、決定区間の最後の状態のクラス特徴量を含んで最尤系列を求めてもよいことを示している。これによって、最尤系列の精度を高めることができる。 Therefore, after the N _det frame class is determined for the N _buf frame, the voice section detection unit 16 repeats the process of moving the N _not frame to the head buffer, and the acoustic feature amount of the input voice is changed to the frame buffer. As long as the input continues to 13, the class determination operation is repeated.
In FIG. 4, “◯” indicating the transition state exists before the frame buffer 13 (left side in the figure). This indicates that the class determination of the pending section in which the class determination operation continues. This indicates that the maximum likelihood sequence may be obtained including the class feature amount in the last state of the determination section. Thereby, the accuracy of the maximum likelihood sequence can be increased.

雑音区間検出手段１７は、クラス特徴量算出手段１５においてフレームごとに算出されるクラス特徴量に基づいて、雑音の種別を区分する雑音区間を検出するものである。
ここでは、雑音区間検出手段１７は、各フレームが、“雑音なし”、“外国語音声”、“音楽”および“その他雑音”のどのクラスに属する雑音であるのかを判定する。この場合、例えば、“音楽”と判定された１以上の連続フレームが、音楽による雑音区間となる。 The noise section detection means 17 detects a noise section that classifies the type of noise based on the class feature quantity calculated for each frame by the class feature quantity calculation means 15.
Here, the noise section detecting means 17 determines to which class each frame belongs to “no noise”, “foreign language speech”, “music”, and “other noise”. In this case, for example, one or more consecutive frames determined to be “music” are noise intervals due to music.

すなわち、雑音区間検出手段１７は、図５に示すような“雑音なし”、“外国語音声”、“音楽”および“その他雑音”の４状態からなるエルゴディックＨＭＭにより、各状態（クラス）を、クラス特徴量算出手段１５から入力されるクラス特徴量（事後確率）に基づいて確率的に遷移させる。
そして、雑音区間検出手段１７は、フレームバッファ１３に記憶されている音響特徴量の各クラスに遷移する遷移確率が最大となる最尤系列を求めることで、フレームごとのクラスを特定する。 That is, the noise section detecting means 17 determines each state (class) by an ergodic HMM consisting of four states of “no noise”, “foreign language speech”, “music” and “other noise” as shown in FIG. Then, the transition is made probabilistically based on the class feature amount (a posteriori probability) input from the class feature amount calculation means 15.
Then, the noise section detecting unit 17 specifies a class for each frame by obtaining a maximum likelihood sequence that maximizes the transition probability of transition to each class of acoustic feature values stored in the frame buffer 13.

この雑音区間検出手段１７は、クラスの種別が異なる点を除いて、図６に示すように、音声区間検出手段１６と同様の手法（ビタビアルゴリズム）で最尤系列を求め、フレームごとのクラスを特定する。この図６は、クラスの種別が異なるだけで、図４と同じであるため、これ以上の説明は省略する。 As shown in FIG. 6, the noise section detection means 17 obtains the maximum likelihood sequence by the same method (Viterbi algorithm) as the voice section detection means 16 except that the class types are different, and determines the class for each frame. Identify. FIG. 6 is the same as FIG. 4 except that the class type is different, and thus further description thereof is omitted.

なお、フレームバッファ１３は音声区間検出手段１６と共通に参照するため、雑音区間検出手段１７は、音声区間検出手段１６が最尤系列を特定した時点に同期して、同じ時点までの最尤系列を特定（複数存在する場合は、その中から１つ選択）して、フレームごとに、判定したクラスを特定するタグを雑音抑圧処理選択手段１８に出力する。
これによって、あるフレームに対して、音声区間検出手段１６で判定されたクラスのタグ（“母語音声”，“非音声”，“無音”）と、雑音区間検出手段１７で判定されたクラスのタグ（“雑音なし”，“外国語音声”，“音楽”，“その他雑音”）とが対となって、フレームごとに雑音抑圧処理選択手段１８に出力される。 Since the frame buffer 13 is referred to in common with the speech section detection means 16, the noise section detection means 17 synchronizes with the time when the speech section detection means 16 specifies the maximum likelihood sequence, and the maximum likelihood sequence up to the same time point. Is specified (if one exists, one of them is selected), and a tag for specifying the determined class is output to the noise suppression processing selection means 18 for each frame.
As a result, for a certain frame, the class tag determined by the speech section detection means 16 (“native speech”, “non-speech”, “silence”) and the class tag determined by the noise section detection means 17 (“No noise”, “Foreign language speech”, “Music”, “Other noise”) are paired and output to the noise suppression processing selection means 18 for each frame.

雑音抑圧処理選択手段１８は、音声区間検出手段１６で検出されたフレームごとのクラスと、雑音区間検出手段１７で検出されたフレームごとのクラスとに基づいて、フレームごとに、予め定めた複数の雑音抑圧処理の中から１つを選択するものである。
ここでは、雑音抑圧処理選択手段１８は、音声区間検出手段１６で“母語音声”と判定されたフレームにおいて、雑音区間検出手段１７で判定された雑音の種別（クラス）に応じて、雑音抑圧処理を切り替える。 The noise suppression processing selection means 18 has a plurality of predetermined frames for each frame based on the class for each frame detected by the speech section detection means 16 and the class for each frame detected by the noise section detection means 17. One is selected from the noise suppression processing.
Here, the noise suppression processing selection means 18 performs noise suppression processing in the frame determined as “native speech” by the speech section detection means 16 according to the noise type (class) determined by the noise section detection means 17. Switch.

すなわち、“母語音声”と判定されたフレームにおいて、雑音のクラスが“雑音なし”の場合、雑音抑圧処理選択手段１８は、入力ａの出力を、雑音抑圧を行わない経路となる出力ｂに切り替えて、フレームバッファ１３に記憶されている音響特徴量を雑音抑圧手段２０に出力する。 That is, in the frame determined as “native speech”, when the noise class is “no noise”, the noise suppression processing selection unit 18 switches the output of the input “a” to the output “b” which is a path not performing noise suppression. Then, the acoustic feature quantity stored in the frame buffer 13 is output to the noise suppression means 20.

また、“母語音声”と判定されたフレームにおいて、雑音のクラスが“外国語音声”の場合、雑音抑圧処理選択手段１８は、入力ａの出力を、外国語音声を抑圧する手段（ここでは、特定雑音抑圧手段２１ａ）への経路となる出力ｃに切り替えて、フレームバッファ１３に記憶されている音響特徴量を雑音抑圧手段２０に出力する。 If the noise class is “foreign language speech” in the frame determined as “native language speech”, the noise suppression processing selection means 18 uses the output of the input a as means for suppressing foreign language speech (here, The sound characteristic amount stored in the frame buffer 13 is output to the noise suppression means 20 by switching to the output c serving as a route to the specific noise suppression means 21a).

また、“母語音声”と判定されたフレームにおいて、雑音のクラスが“音楽”の場合、雑音抑圧処理選択手段１８は、入力ａの出力を、音楽を抑圧する手段（ここでは、特定雑音抑圧手段２１ｂ）への経路となる出力ｄに切り替えて、フレームバッファ１３に記憶されている音響特徴量を雑音抑圧手段２０に出力する。 If the noise class is “music” in the frame determined as “native speech”, the noise suppression processing selection means 18 is a means for suppressing the output of the input a (in this case, a specific noise suppression means). The sound feature quantity stored in the frame buffer 13 is output to the noise suppression means 20 by switching to the output d which is a route to 21b).

また、“母語音声”と判定されたフレームにおいて、雑音のクラスが“その他雑音”の場合、雑音抑圧処理選択手段１８は、入力ａの出力を、その他固有の雑音を抑圧する手段（ここでは、特定雑音抑圧手段２１ｃ）への経路となる出力ｅに切り替えて、フレームバッファ１３に記憶されている音響特徴量を雑音抑圧手段２０に出力する。 When the noise class is “other noise” in the frame determined to be “native speech”, the noise suppression processing selection unit 18 suppresses the output of the input a to other inherent noise (here, The sound feature quantity stored in the frame buffer 13 is output to the noise suppression means 20 by switching to the output e as a route to the specific noise suppression means 21c).

なお、雑音抑圧処理選択手段１８は、音声区間検出手段１６で“非音声”と判定されたフレームについては、雑音区間検出手段１７で判定された雑音の種別（クラス）に関係なく、入力ａの出力を、出力を停止する経路となる出力ｆに切り替える。これによって、音声認識装置２において、非音声を誤って音声と識別することに起因する音声認識誤りの増加を防止することができる。
また、雑音抑圧処理選択手段１８は、音声区間検出手段１６で“無音”と判定されたフレームについては、入力ａの出力を出力ｂに切り替えて、無音の音響特徴量を雑音抑圧手段２０に出力することとしてもよいし、入力ａの出力を出力ｆに切り替えて、無音の音響特徴量を雑音抑圧手段２０に出力しないこととしてもよい。
これによって、雑音抑圧処理選択手段１８は、音声に重畳されている雑音の種別に応じて、最適な雑音抑圧手法を選択することができる。 It should be noted that the noise suppression processing selection means 18 for the frame determined as “non-speech” by the speech section detection means 16, regardless of the noise type (class) determined by the noise section detection means 17. The output is switched to the output f serving as a path for stopping the output. As a result, in the speech recognition device 2, it is possible to prevent an increase in speech recognition errors caused by erroneously identifying non-speech as speech.
Also, the noise suppression processing selection means 18 switches the output of the input a to the output b for the frame determined to be “silent” by the speech section detection means 16 and outputs the silent acoustic feature quantity to the noise suppression means 20. Alternatively, the output of the input a may be switched to the output f, and the silent acoustic feature quantity may not be output to the noise suppression unit 20.
Thus, the noise suppression processing selection unit 18 can select an optimal noise suppression method according to the type of noise superimposed on the speech.

なお、ここでは、雑音抑圧処理選択手段１８は、入出力の経路を切り替えることで、複数の雑音抑圧処理の中から１つを選択することとしたが、雑音抑圧処理を識別する識別子とともに、音響特徴量を雑音抑圧手段２０に出力することとしてもよい。 Here, the noise suppression processing selection means 18 selects one of a plurality of noise suppression processing by switching the input / output paths, but the sound suppression processing selection means 18 also selects the acoustic suppression processing together with an identifier for identifying the noise suppression processing. The feature amount may be output to the noise suppression unit 20.

雑音抑圧手段２０は、雑音抑圧処理選択手段１８で選択された雑音抑圧処理で、音声に重畳されている雑音を抑圧するものである。ここでは、雑音抑圧手段２０は、音響特徴量を予め学習した雑音抑圧モデルによる補正することで、雑音を抑圧する。
この雑音抑圧手段２０は、特定の雑音の種別に応じた複数の特定雑音抑圧手段２１（２１ａ，２１ｂ，２１ｃ）と、雑音抑圧モデル記憶手段２２と、を備える The noise suppression means 20 is a noise suppression process selected by the noise suppression process selection means 18 and suppresses noise superimposed on the voice. Here, the noise suppression unit 20 suppresses noise by correcting the acoustic feature value using a noise suppression model learned in advance.
The noise suppression unit 20 includes a plurality of specific noise suppression units 21 (21a, 21b, 21c) corresponding to specific noise types, and a noise suppression model storage unit 22.

特定雑音抑圧手段２１ａは、フレームごとの音響特徴量に対して、雑音抑圧モデル記憶手段２２に記憶されているモデルに基づいて、外国語音声の特徴を抑圧した音響特徴量を算出するものである。 The specific noise suppression unit 21a calculates an acoustic feature amount in which the feature of the foreign language speech is suppressed based on the model stored in the noise suppression model storage unit 22 with respect to the acoustic feature amount for each frame. .

特定雑音抑圧手段２１ｂは、フレームごとの音響特徴量に対して、雑音抑圧モデル記憶手段２２に記憶されているモデルに基づいて、音楽の特徴を抑圧した音響特徴量を算出するものである。 The specific noise suppression unit 21b calculates an acoustic feature amount in which the feature of music is suppressed based on the model stored in the noise suppression model storage unit 22 with respect to the acoustic feature amount for each frame.

特定雑音抑圧手段２１ｃは、フレームごとの音響特徴量に対して、雑音抑圧モデル記憶手段２２に記憶されているモデルに基づいて、その他の特定の雑音の特徴を抑圧した音響特徴量を算出するものである。
なお、特定雑音抑圧手段２１ａ，２１ｂ，２１ｃは、それぞれ雑音抑圧モデル記憶手段２２に記憶されている専用のモデルを用いることする。 The specific noise suppression unit 21c calculates an acoustic feature amount in which other specific noise features are suppressed based on a model stored in the noise suppression model storage unit 22 with respect to the acoustic feature amount for each frame. It is.
The specific noise suppression means 21a, 21b, and 21c use dedicated models stored in the noise suppression model storage means 22, respectively.

この雑音抑圧モデル記憶手段２２に記憶されているそれぞれのモデルには、特定の種類の雑音を抑圧するニューラルネットワークを用いることができる。このニューラルネットワークは、予め雑音が重畳された音声と、教師信号である雑音が重畳されていない音声とから学習したものである。 For each model stored in the noise suppression model storage means 22, a neural network that suppresses a specific type of noise can be used. This neural network is learned from a voice on which noise is superimposed in advance and a voice that is not superimposed on noise, which is a teacher signal.

すなわち、複数の特定雑音抑圧手段２１（２１ａ，２１ｂ，２１ｃ）は、入力層に入力される雑音が重畳された音響特徴量を、それぞれの雑音（外国語音声，音楽，その他雑音）用に予め学習されたニューラルネットワークのパラメータ（ネットワーク層間を結ぶ結合行列およびバイアス）を用いて出力層まで演算することで、雑音を抑圧した音響特徴量（雑音抑圧音響特徴量）を生成する。 That is, the plurality of specific noise suppression means 21 (21a, 21b, 21c) preliminarily use the acoustic feature amount superimposed with the noise input to the input layer for each noise (foreign language speech, music, other noise). By using the learned neural network parameters (the connection matrix and the bias connecting the network layers) and calculating up to the output layer, an acoustic feature quantity (noise-suppressed acoustic feature quantity) in which noise is suppressed is generated.

雑音抑圧モデル記憶手段２２は、特定の雑音が重畳された音声の音響特徴量を、雑音を抑圧した音声の音響特徴量に変換するためのモデルを記憶するものである。ここでは、雑音抑圧モデル記憶手段２２は、それぞれの雑音（外国語音声，音楽，その他雑音）に対応して、ニューラルネットワークのパラメータ（ネットワーク層間を結ぶ結合行列およびバイアス）を記憶しておく。 The noise suppression model storage unit 22 stores a model for converting an acoustic feature amount of speech on which specific noise is superimposed into an acoustic feature amount of speech with noise suppressed. Here, the noise suppression model storage means 22 stores neural network parameters (a connection matrix and a bias connecting network layers) corresponding to each noise (foreign language speech, music, and other noises).

なお、ニューラルネットワークを用いた雑音抑圧を行う手法は、公知の手法であって、例えば、以下の参考文献に記載されている。
（参考文献）「Xue Feng, Yaodong Zhang, James Glass ,“SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION”,2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)」 Note that a technique for performing noise suppression using a neural network is a known technique, and is described in, for example, the following references.
(Reference) “Xue Feng, Yaodong Zhang, James Glass,“ SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION ”, 2014 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)”

この雑音抑圧手段２０は、特定雑音抑圧手段２１ａ，２１ｂ，２１ｃで生成された雑音抑圧音響特徴量を、音声認識手段３０に出力する。
なお、雑音抑圧手段２０は、雑音のクラスが“雑音なし”と判定されたフレームの音響特徴量、すなわち、雑音抑圧処理選択手段１８の出力ｂから入力した音響特徴量については、特定雑音抑圧手段２１を介さずに、そのまま音声認識手段３０に出力する。 The noise suppression unit 20 outputs the noise suppression acoustic feature amount generated by the specific noise suppression units 21 a, 21 b, and 21 c to the speech recognition unit 30.
Note that the noise suppression unit 20 uses the specific noise suppression unit for the acoustic feature amount of the frame in which the noise class is determined to be “no noise”, that is, the acoustic feature amount input from the output b of the noise suppression processing selection unit 18. The signal is output as it is to the voice recognition means 30 without going through 21.

音声認識手段３０は、音声特徴量を入力として、音声認識を行うものである。
ここでは、音声認識手段３０は、雑音抑圧手段２０から、雑音を抑圧した音声の音響特徴量（雑音抑圧音響特徴量）を入力する。
この音声認識手段３０における音響特徴量から音声認識を行う手法は、一般的な手法を用いればよい。例えば、音声認識手段３０は、音響特徴量を隠れマルコフモデル（ＨＭＭ）でモデル化した音響モデルと、単語間の接続関係をモデル化した言語モデルとを用いて、順次入力される音響特徴量から文字列を推定すればよい。
この音声認識手段３０は、推定した文字列を認識結果として外部に出力する。例えば、音声認識手段３０は、認識結果である文字列を、図示を省略した表示装置に出力してもよいし、記録媒体に記録することとしてもよい。
このように、音声認識手段３０は、入力音声の音響特徴量に対して、雑音の成分を抑圧した音響特徴量を用いて音声認識を行うため、認識精度を高めることができる。 The voice recognition unit 30 performs voice recognition using a voice feature amount as an input.
Here, the speech recognition unit 30 inputs the acoustic feature amount (noise-suppressed acoustic feature amount) of the speech in which noise is suppressed from the noise suppression unit 20.
A general method may be used as a method for performing speech recognition from the acoustic feature amount in the speech recognition unit 30. For example, the speech recognizing unit 30 uses an acoustic model obtained by modeling an acoustic feature quantity by a hidden Markov model (HMM) and a language model obtained by modeling a connection relation between words, from sequentially inputted acoustic feature quantities. A character string may be estimated.
The voice recognition unit 30 outputs the estimated character string to the outside as a recognition result. For example, the voice recognition unit 30 may output a character string as a recognition result to a display device (not shown) or may record the character string on a recording medium.
As described above, since the speech recognition unit 30 performs speech recognition using the acoustic feature amount in which the noise component is suppressed with respect to the acoustic feature amount of the input speech, the recognition accuracy can be improved.

以上、説明したように雑音抑圧音声認識装置１を構成することで、雑音抑圧音声認識装置１は、雑音を含んだ一般的な環境で発話された音声から、母語音声の発話区間（音声区間）を精度よく推定することができる。また、雑音抑圧音声認識装置１は、音声に重畳された雑音の種類を判定し、雑音の種類に適した雑音抑圧手法により雑音を抑えるため、雑音を効果的に抑え、音声認識の精度を高めることができる。
なお、雑音抑圧音声認識装置１は、コンピュータを、前記した構成の各手段として機能させるためのプログラム（雑音抑圧音声認識プログラム）で動作させることができる。 As described above, by configuring the noise-suppressed speech recognition apparatus 1, the noise-suppressed speech recognition apparatus 1 is able to generate a speech section (speech section) of a native language speech from speech uttered in a general environment including noise. Can be estimated with high accuracy. In addition, the noise-suppressed speech recognition apparatus 1 determines the type of noise superimposed on the speech and suppresses the noise by a noise suppression method suitable for the type of noise, thereby effectively suppressing the noise and improving the accuracy of speech recognition. be able to.
The noise-suppressed speech recognition apparatus 1 can operate the computer with a program (noise-suppressed speech recognition program) for causing the computer to function as each means having the above-described configuration.

（統計モデルの学習について）
次に、統計モデル記憶手段１４に記憶される統計モデル（リカレントニューラルネットワーク）の学習について説明する。
ここでは、図２に示したリカレントニューラルネットワークＮとして、所定次元の音響特徴量を入力し、音声の各クラス（ここでは、母語音声、非音声、無音）のクラス特徴量（事後確率）と、雑音の各クラス（ここでは、雑音なし、外国語音声、音楽、その他雑音）のクラス特徴量（事後確率）とを出力する統計モデルを学習する例を説明する。
この場合、リカレントニューラルネットワークＮのネットワーク層間を結ぶ結合行列およびバイアスは、教師信号を用いる既存のアルゴリズムである誤差逆伝播法を用いて求めればよい。この教師信号は、既知の雑音が重畳された音声（音響特徴量）、および、当該音声のフレームごとの各クラスの事後確率である。 (About learning statistical models)
Next, learning of a statistical model (recurrent neural network) stored in the statistical model storage unit 14 will be described.
Here, as the recurrent neural network N shown in FIG. 2, an acoustic feature amount of a predetermined dimension is input, and the class feature amount (posterior probability) of each class of speech (here, native speech, non-speech, silence), An example of learning a statistical model that outputs class features (posterior probabilities) of each class of noise (here, no noise, foreign language speech, music, and other noise) will be described.
In this case, the connection matrix and the bias connecting the network layers of the recurrent neural network N may be obtained by using an error back propagation method that is an existing algorithm using a teacher signal. This teacher signal is a speech (acoustic feature amount) on which known noise is superimposed, and a posteriori probability of each class for each frame of the speech.

ここで、図７を参照（適宜、図１参照）して、教師信号として用いる学習データの一例について説明する。
図７（ａ）は、既知の雑音が所定時間に重畳されている音声の音響特徴量を１２フレーム分示している。なお、この音響特徴量は、図１に示した音響特徴量抽出手段１１と同様の手法で、雑音が重畳された音声から抽出したものである。
この音響特徴量が、リカレントニューラルネットワークＮの入力層に入力される信号となる。 Here, an example of learning data used as a teacher signal will be described with reference to FIG. 7 (refer to FIG. 1 as appropriate).
FIG. 7A shows the acoustic feature amount of the speech in which known noise is superimposed for a predetermined time for 12 frames. Note that this acoustic feature amount is extracted from the speech on which noise is superimposed by the same method as the acoustic feature amount extraction unit 11 shown in FIG.
This acoustic feature amount becomes a signal input to the input layer of the recurrent neural network N.

図７（ｂ）は、（ａ）の各フレームが、どのクラスの音で構成されているのかを示している。ここでは、母語音声が、第３フレームから第９フレームまで含まれ、音楽が、第１フレームから第１０フレームまで含まれ、外国語音声が第１１フレームから第１２フレームまで含まれていることを示している。 FIG. 7B shows which class of sound each frame of FIG. Here, the native language speech is included from the third frame to the ninth frame, the music is included from the first frame to the tenth frame, and the foreign language speech is included from the eleventh frame to the twelfth frame. Show.

図７（ｃ）は、音声区間検出手段１６および雑音区間検出手段１７のそれぞれに出力するクラス特徴量である音のクラスとその状態の事後確率（教師信号）とを示している。
例えば、音声区間検出手段１６に対する出力として、図７（ｂ）に示すように、第１フレームから第２フレームまで、および、第１０フレームから第１２フレームまでは、母語音声が含まれていないため、図７（ｃ）に示すように、非音声の状態の事後確率“１．０”が出力されることが期待される。また、第３フレームから第９フレームまでは、母語音声が含まれているため、音声の状態の事後確率“１．０”が出力されることが期待される。 FIG. 7C shows a sound class that is a class feature amount output to each of the speech section detection means 16 and the noise section detection means 17 and the posterior probability (teacher signal) of the state.
For example, as shown in FIG. 7B, the first to second frames and the tenth to twelfth frames do not contain the native speech as the output to the speech section detecting means 16. As shown in FIG. 7C, it is expected that the posterior probability “1.0” in the non-voice state is output. In addition, since the native speech is included in the third to ninth frames, it is expected that the posterior probability “1.0” of the speech state is output.

また、雑音区間検出手段１７に対する出力として、図７（ｂ）に示すように、第１フレームから第１０フレームまでは、音楽が含まれているため、図７（ｃ）に示すように、音楽の状態の事後確率“１．０”が出力されることが期待される。また、第１１フレームから第１２フレームまでは、外国語音声が含まれているため、外国語音声の状態の事後確率“１．０”が出力されることが期待される。
このような既知の種々の学習データを教師信号として学習させることで、リカレントニューラルネットワークを構成することができる。 Further, as shown in FIG. 7 (b), music is included in the output from the first frame to the tenth frame as the output to the noise section detecting means 17, and as shown in FIG. It is expected that the posterior probability “1.0” of the state will be output. Further, since the 11th to 12th frames contain foreign language speech, it is expected that a posterior probability “1.0” of the state of the foreign language speech is output.
A recurrent neural network can be configured by learning such various known learning data as a teacher signal.

〔雑音抑圧音声認識装置の動作〕
次に、図８，図９を参照して、本発明の実施形態に係る雑音抑圧音声認識装置１の動作について説明する。なお、ここでは、予め統計モデル記憶手段１４および雑音抑圧モデル記憶手段２２に各モデルが記憶されているものとする。
また、ここでは、雑音抑圧音声認識装置１の動作として、フレームバッファ１３に音響特徴量を蓄積する動作と、フレームバッファ１３に蓄積されている音響特徴量から、雑音を抑圧した音声認識を行う動作とに分けて説明する。 [Operation of noise suppression speech recognition system]
Next, the operation of the noise-suppressed speech recognition apparatus 1 according to the embodiment of the present invention will be described with reference to FIGS. Here, it is assumed that each model is stored in the statistical model storage unit 14 and the noise suppression model storage unit 22 in advance.
In addition, here, as the operation of the noise-suppressed speech recognition apparatus 1, an operation for accumulating acoustic feature amounts in the frame buffer 13 and an operation for performing speech recognition with noise suppressed from the acoustic feature amounts accumulated in the frame buffer 13. This will be explained separately.

（音響特徴量蓄積動作）
まず、図８を参照（構成については適宜図１参照）して、雑音抑圧音声認識装置１のフレームバッファ１３に音響特徴量を蓄積する動作について説明する。 (Acoustic feature accumulation operation)
First, referring to FIG. 8 (refer to FIG. 1 as appropriate for the configuration), an operation for accumulating acoustic feature quantities in the frame buffer 13 of the noise-suppressed speech recognition apparatus 1 will be described.

雑音抑圧音声認識装置１は、音響特徴量抽出手段１１によって、入力音声を所定の時間単位（フレーム単位）で切り出し、フレームごとに所定次数の音響特徴量を抽出する（ステップＳ１０）。ここでは、音響特徴量抽出手段１１が、対数メルフィルタバンク出力により音響特徴量を抽出する。 In the noise-suppressed speech recognition apparatus 1, the acoustic feature amount extraction unit 11 cuts out the input speech in a predetermined time unit (frame unit), and extracts a predetermined order acoustic feature amount for each frame (step S <b> 10). Here, the acoustic feature quantity extraction means 11 extracts the acoustic feature quantity from the log mel filter bank output.

そして、雑音抑圧音声認識装置１は、特徴量正規化手段１２によって、ステップＳ１０で抽出された所定次数の音響特徴量を、平均値が“０”、分散が“１”となるように正規化し（ステップＳ１１）、フレームバッファ１３に順次蓄積する（ステップＳ１２）。 Then, the noise suppression speech recognition apparatus 1 normalizes the acoustic feature quantity of the predetermined order extracted in step S10 by the feature quantity normalization unit 12 so that the average value is “0” and the variance is “1”. (Step S11) and sequentially accumulate in the frame buffer 13 (Step S12).

そして、雑音抑圧音声認識装置１は、音響特徴量抽出手段１１に音声が入力される間（ステップＳ１３でＹｅｓ）、ステップＳ１０に戻って、フレームバッファ１３への音響特徴量の蓄積動作を継続する。一方、入力音声がなくなった段階（ステップＳ１３でＮｏ）で、雑音抑圧音声認識装置１は、フレームバッファ１３への音響特徴量の蓄積動作を終了する。 Then, the noise-suppressed speech recognition apparatus 1 returns to step S10 and continues the operation of accumulating the acoustic feature quantity in the frame buffer 13 while the voice is input to the acoustic feature quantity extraction unit 11 (Yes in step S13). . On the other hand, at the stage where there is no input speech (No in step S13), the noise-suppressed speech recognition apparatus 1 ends the operation of accumulating the acoustic features in the frame buffer 13.

（クラス別雑音抑圧動作）
次に、図９を参照（構成については適宜図１参照）して、雑音抑圧音声認識装置１の雑音のクラス別に雑音抑圧手法を切り替えて雑音抑圧を行う動作について説明する。 (Class-specific noise suppression operation)
Next, referring to FIG. 9 (refer to FIG. 1 as appropriate for the configuration), an operation of performing noise suppression by switching the noise suppression method for each noise class of the noise-suppressed speech recognition apparatus 1 will be described.

雑音抑圧音声認識装置１のクラス特徴量算出手段１５が、フレームバッファ１３に音響特徴量が蓄積されるまで待機する（ステップＳ２０でＮｏ）。そして、音響特徴量が蓄積された段階（ステップＳ２０でＹｅｓ）で、雑音抑圧音声認識装置１は、クラス特徴量算出手段１５によって、統計モデル記憶手段１４に記憶されている統計モデルに基づいて、音響特徴量から、音声（母語音声、非音声、無音の３つ）と、雑音（雑音なし、外国語音声、音楽、その他雑音の４つ）のクラスごとの事後確率（クラス特徴量）を算出する（ステップＳ２１）。 The class feature quantity calculation means 15 of the noise-suppressed speech recognition apparatus 1 waits until the acoustic feature quantity is accumulated in the frame buffer 13 (No in step S20). Then, at the stage where the acoustic feature amount is accumulated (Yes in step S20), the noise-suppressed speech recognition apparatus 1 uses the class feature amount calculation unit 15 based on the statistical model stored in the statistical model storage unit 14. Calculates posterior probabilities (class feature values) for each class of speech (native speech, non-speech, silence) and noise (no noise, foreign language speech, music, and other noises) from acoustic features. (Step S21).

そして、雑音抑圧音声認識装置１は、音声区間検出手段１６および雑音区間検出手段１７によって、それぞれ、ステップＳ２１で算出されたクラスごとの音響特徴量の最尤系列を求める。ここで、音声区間検出手段１６において、最尤系列が決定されていない場合（ステップＳ２２でＮｏ）、雑音抑圧音声認識装置１は、ステップＳ２１に戻る。 Then, the noise suppression speech recognition apparatus 1 obtains the maximum likelihood sequence of the acoustic feature amount for each class calculated in step S21 by the speech segment detection unit 16 and the noise segment detection unit 17, respectively. Here, when the maximum likelihood sequence has not been determined in the speech section detection means 16 (No in step S22), the noise-suppressed speech recognition apparatus 1 returns to step S21.

一方、音声区間検出手段１６において、最尤系列が決定された場合（ステップＳ２２でＹｅｓ）、ステップＳ２３以降に動作を進める。なお、このとき、雑音区間検出手段１７は、音声区間検出手段１６と同期して、音声区間検出手段１６が最尤系列を決定した時点までの雑音区間の最尤系列を１つ決定する。 On the other hand, when the maximum likelihood sequence is determined in the speech section detection means 16 (Yes in step S22), the operation proceeds to step S23 and subsequent steps. At this time, the noise section detecting means 17 determines one maximum likelihood sequence of the noise section up to the time when the speech section detecting means 16 determines the maximum likelihood sequence in synchronization with the speech section detecting means 16.

そして、雑音抑圧音声認識装置１は、雑音抑圧処理選択手段１８によって、音声区間検出手段１６で最尤系列として検出されたフレームごとのクラスと、雑音区間検出手段１７で最尤系列として検出されたフレームごとのクラスとに基づいて、フレームごとに、予め定めた複数の雑音抑圧処理の中から、雑音の種別に応じた雑音抑圧処理を１つ選択する（ステップＳ２３）。 The noise suppression speech recognition apparatus 1 detects the class for each frame detected as the maximum likelihood sequence by the speech interval detection unit 16 by the noise suppression processing selection unit 18 and the maximum likelihood sequence by the noise interval detection unit 17. Based on the class for each frame, one noise suppression process corresponding to the type of noise is selected from a plurality of predetermined noise suppression processes for each frame (step S23).

そして、雑音抑圧音声認識装置１は、ステップＳ２３で選択された雑音抑圧処理を行う特定雑音抑圧手段２１によって、フレームごとの音響特徴量に対して、雑音抑圧モデル記憶手段２２に記憶されているモデルに基づいて、雑音抑圧処理として、フレームバッファ１３に蓄積されている音響特徴量から、雑音の成分を抑圧した音響特徴量を算出する（ステップＳ２４）。
このとき、音声区間検出手段１６で最尤系列として検出されたフレームのクラスが、音声（母語音声）でなければ、当該フレームの音響特徴量については、フレームバッファ１３から特定雑音抑圧手段２１への出力を行わないこととする。 Then, the noise-suppressed speech recognition apparatus 1 uses the specific noise suppression unit 21 that performs the noise suppression process selected in step S23 to store the model stored in the noise suppression model storage unit 22 for the acoustic feature amount for each frame. As a noise suppression process, an acoustic feature quantity in which a noise component is suppressed is calculated from the acoustic feature quantity stored in the frame buffer 13 (step S24).
At this time, if the frame class detected as the maximum likelihood sequence by the speech section detection unit 16 is not speech (native speech), the acoustic feature amount of the frame is transferred from the frame buffer 13 to the specific noise suppression unit 21. No output is performed.

そして、雑音抑圧音声認識装置１は、音声区間検出手段１６によって、フレームバッファ１３において、最尤系列を特定したクラスの音響特徴量をクリアする（ステップＳ２５）。
その後、雑音抑圧音声認識装置１は、音声認識手段３０によって、複数の特定雑音抑圧手段２１で順次算出されたフレーム単位の音響特徴量により音声認識を行う（ステップＳ２６）。 Then, the noise suppression speech recognition apparatus 1 clears the acoustic feature quantity of the class specifying the maximum likelihood sequence in the frame buffer 13 by the speech section detection means 16 (step S25).
After that, the noise-suppressed speech recognition apparatus 1 performs speech recognition by the speech recognition unit 30 using the acoustic features in units of frames sequentially calculated by the plurality of specific noise suppression units 21 (step S26).

そして、雑音抑圧音声認識装置１は、さらに、音声が入力されていれば（ステップＳ２７でＹｅｓ）、ステップＳ２０に戻って動作を続ける。一方、音声が入力されなければ（ステップＳ２７でＮｏ）、雑音抑圧音声認識装置１は、動作を終了する。
以上の動作によって、雑音抑圧音声認識装置１は、雑音が重畳された音声から、音声区間を検出し、その音声区間における雑音の種別に応じた雑音抑圧手法により、雑音を抑圧して、精度よく音声認識を行うことができる。 If the voice is further input (Yes in step S27), the noise suppression speech recognition apparatus 1 returns to step S20 and continues the operation. On the other hand, if no voice is input (No in step S27), the noise-suppressed voice recognition device 1 ends the operation.
Through the above operation, the noise-suppressed speech recognition apparatus 1 detects a speech section from speech with superimposed noise, and suppresses noise with a noise suppression method according to the type of noise in the speech section, thereby accurately. Voice recognition can be performed.

以上、本発明の実施形態に係る雑音抑圧音声認識装置１の構成および動作について説明したが、本発明は、この実施形態に限定されるものではない。
例えば、ここでは、統計モデル記憶手段１４に記憶する統計モデルを、リカレントニューラルネットワークを例として説明したが、他の統計モデルを用いても構わない。例えば、一般的なフィードフォワード型のニューラルネットワークを用いることとしてもよい。 The configuration and operation of the noise suppression speech recognition apparatus 1 according to the embodiment of the present invention have been described above, but the present invention is not limited to this embodiment.
For example, here, the statistical model stored in the statistical model storage unit 14 has been described by taking a recurrent neural network as an example, but other statistical models may be used. For example, a general feedforward type neural network may be used.

また、ここでは、音声区間検出手段１６において、“母語音声”、“非音声”および“無音”の３つの音声のクラスを規定したが、無音状態が存在しないことが既知の音声を音声認識対象とする場合、“無音”のクラスを省略してもよい。
また、ここでは、雑音区間検出手段１７において、“雑音なし”、“外国語音声”、“音楽”および“その他雑音”の４つのクラスの雑音を規定したが、“拍手”、“笑い声”等の雑音のクラスを規定することとしてもよい。また、予めあるクラスの雑音が存在しないことが既知であれば、そのクラスを省略してもよい。 In this example, the speech section detection means 16 defines three speech classes of “native speech”, “non-speech”, and “silence”, but speech that is known to have no silence state is subject to speech recognition. In this case, the “silence” class may be omitted.
Here, the noise section detection means 17 has defined four classes of noise, “no noise”, “foreign language speech”, “music”, and “other noise”, but “applause”, “laughter”, etc. The noise class may be specified. If it is known that a certain class of noise does not exist in advance, the class may be omitted.

また、ここでは、特徴量正規化手段１２を備える構成としたが、特定の話者しか発話しない等の場合、この構成を省略してもよい。
また、ここでは、雑音抑圧音声認識装置１の内部に音声認識手段３０を備える構成としたが、この音声認識手段３０を分離して、外部に音声認識装置として備えることとしてもよい。 In addition, here, the feature amount normalizing unit 12 is provided, but this configuration may be omitted when only a specific speaker speaks.
Here, the speech recognition unit 30 is provided inside the noise-suppressed speech recognition apparatus 1, but the speech recognition unit 30 may be separated and provided outside as a speech recognition apparatus.

１雑音抑圧音声認識装置
１０区間検出手段
１１音響特徴量抽出手段
１２特徴量正規化手段
１３フレームバッファ
１４統計モデル記憶手段
１５クラス特徴量算出手段
１６音声区間検出手段
１７雑音区間検出手段
１８雑音抑圧処理選択手段
２０雑音抑圧手段
２１特定雑音抑圧手段
２２雑音抑圧モデル記憶手段
３０音声認識手段 DESCRIPTION OF SYMBOLS 1 Noise suppression speech recognition apparatus 10 Section detection means 11 Acoustic feature-value extraction means 12 Feature-value normalization means 13 Frame buffer 14 Statistical model memory | storage means 15 Class feature-value calculation means 16 Speech area detection means 17 Noise area detection means 18 Noise suppression processing Selection means 20 Noise suppression means 21 Specific noise suppression means 22 Noise suppression model storage means 30 Speech recognition means

Claims

入力音声に対して雑音抑圧を行って音声認識を行う雑音抑圧音声認識装置であって、
前記入力音声から、所定時間長のフレーム単位で音響特徴量を抽出する音響特徴量抽出手段と、
前記音響特徴量と音声認識対象である母語音声を含む音声の種類ごとの関係、および、前記音響特徴量と雑音の種類ごとの関係とを予め学習した統計モデルを記憶する統計モデル記憶手段と、
前記統計モデルに基づいて、前記音響特徴量抽出手段で抽出された音響特徴量から、フレームごとに、前記音声の種類ごとの各クラスが出現する事後確率と、前記雑音の種類ごとの各クラスが出現する事後確率とを、各クラスのクラス特徴量として算出するクラス特徴量算出手段と、
前記音声の種類ごとの各クラスが出現する事後確率に基づいて、前記母語音声の音声区間を検出する音声区間検出手段と、
前記雑音の種類ごとの各クラスが出現する事後確率に基づいて、前記雑音の種類ごとの雑音区間を検出する雑音区間検出手段と、
前記音声区間に対応する前記雑音区間における雑音の種類に応じて、予め定めた雑音抑圧手法を選択する雑音抑圧処理選択手段と、
この雑音抑圧処理選択手段で選択された雑音抑圧手法で、前記音声区間における雑音の音響特徴量を抑圧した音響特徴量を生成する雑音抑圧手段と、
この雑音抑圧手段で生成された音響特徴量により音声認識を行う音声認識手段と、
を備えることを特徴とする雑音抑圧音声認識装置。 A noise suppression speech recognition apparatus for performing speech recognition by performing noise suppression on an input speech,
Acoustic feature quantity extraction means for extracting an acoustic feature quantity from the input speech in units of a frame of a predetermined time length;
A statistical model storage unit that stores a statistical model in which the acoustic feature amount and the relationship for each type of speech including the native speech that is a speech recognition target, and the acoustic feature amount and the relationship for each type of noise are previously learned;
Based on the statistical model, from the acoustic feature amount extracted by the acoustic feature amount extraction means, for each frame, the posterior probability that each class of the speech type appears, and each class for each noise type are Class feature quantity calculating means for calculating the posterior probability of appearance as a class feature quantity of each class;
Based on a posteriori probability that each class for each type of speech appears, speech section detection means for detecting a speech section of the native speech;
Based on the posterior probability that each class for each noise type appears, a noise interval detection means for detecting a noise interval for each noise type;
Noise suppression processing selection means for selecting a predetermined noise suppression method according to the type of noise in the noise section corresponding to the speech section;
Noise suppression means for generating an acoustic feature quantity by suppressing an acoustic feature quantity of noise in the speech section by the noise suppression method selected by the noise suppression processing selection means;
Speech recognition means for performing speech recognition based on the acoustic features generated by the noise suppression means;
A noise-reduced speech recognition apparatus comprising:

前記統計モデルは、前記音響特徴量から、前記音声の種類ごとの各クラスが出現する事後確率と、前記雑音の種類ごとの各クラスが出現する事後確率とをモデル化したニューラルネットワークであることを特徴とする請求項１に記載の雑音抑圧音声認識装置。 The statistical model is a neural network that models, from the acoustic feature quantity, a posterior probability that each class of each type of speech appears and a posterior probability that each class of each type of noise appears. The noise-suppressed speech recognition apparatus according to claim 1, wherein

前記音声区間検出手段は、隠れマルコフモデルに基づいて、前記音声の種類ごとの各クラスの状態遷移系列における前記音声区間を検出することを特徴とする請求項１または請求項２に記載の雑音抑圧音声認識装置。 3. The noise suppression according to claim 1, wherein the speech section detecting unit detects the speech section in a state transition sequence of each class for each type of speech based on a hidden Markov model. Voice recognition device.

前記雑音区間検出手段は、隠れマルコフモデルに基づいて、前記雑音の種類ごとの各クラスの状態遷移系列における前記雑音の種類ごとの雑音区間を検出することを特徴とする請求項１から請求項３のいずれか一項に記載の雑音抑圧音声認識装置。 The noise section detecting means detects a noise section for each type of noise in a state transition sequence of each class for each type of noise based on a hidden Markov model. The noise suppression speech recognition apparatus according to any one of the above.

前記音響特徴量抽出手段で抽出された音響特徴量の平均および分散を正規化する特徴量正規化手段を、さらに備えることを特徴とする請求項１から請求項４のいずれか一項に記載の雑音抑圧音声認識装置。 The feature quantity normalization means for normalizing the mean and variance of the acoustic feature quantities extracted by the acoustic feature quantity extraction means is further provided. Noise suppression speech recognition device.

コンピュータを、請求項１から請求項５のいずれか一項に記載の雑音抑圧音声認識装置として機能させるための雑音抑圧音声認識プログラム。 A noise-suppressed speech recognition program for causing a computer to function as the noise-suppressed speech recognition apparatus according to any one of claims 1 to 5.