JP2004053965A

JP2004053965A - Speech recognition device

Info

Publication number: JP2004053965A
Application number: JP2002211841A
Authority: JP
Inventors: Hiroyuki Hoshino; 星野　博之
Original assignee: Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc
Priority date: 2002-07-19
Filing date: 2002-07-19
Publication date: 2004-02-19
Anticipated expiration: 2022-07-19
Also published as: JP4003566B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device that exhibits an optimum noise suppression function by adapting to a state of ambient noises. <P>SOLUTION: A threshold frequency ωth computation part 40 determines a threshold frequency ωth where a spectrum S(ω) including voice is larger than a noise frequency spectrum N(ω) by, for example, 5dB. A lowpass function part 41 performs the role of an LPF on a time axis, on a frequency axis, and outputs an S<SB>low</SB>(ω) to a low frequency band NSS processing part 10. A highpass function part 42 performs the role of an HPF on the time axis, on the frequency axis, and outputs an S<SB>high</SB>(ω) to a high frequency band WF processing part 20. By using a subtraction parameters α updated as needed, an NSS operation part 12 computes an output P<SB>low</SB>(ω) by a nonlinear spectrum subtraction method. A WF determination part 21 reads the N(ω), and determines a Wiener filter H(ω) from the S<SB>high</SB>(ω). Next, a WF operation part 22 computes an output P<SB>high</SB>(ω) by multiplying the S<SB>high</SB>(ω) by the H(ω). <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、継続的に雑音の発生する騒音下において、有効に作用する音声認識装置に関する。
【０００２】
【従来の技術】
入力された音声から発音、単語及び文章を分析理解する装置である音声認識装置において、雑音信号を除去し音声信号のみを取りだすことが望ましいことは言うまでもない。ところが継続的ではあるものの一定ではない雑音の発生する騒音下においては、雑音を予め予測することは容易ではない。白色雑音でない騒音の例としては、移動中の車両、船舶、航空機等の操縦室或いは貨物室、作業機器及び輸送機器による騒音を有する工場及び倉庫内などが挙げられる。
【０００３】
このような、継続的ではあるものの一定ではない雑音の発生する騒音下における音声認識装置において、雑音を低下させる手法にスペクトルサブトラクション法がある（Ｓ．　Ｆ．　Ｂｏｌｌ，　ＩＥＥＥ　Ｔｒａｎｓ　Ａｃｏｕｓｔ．　Ｓｐｅｅｃｈ　Ｓｉｇｎａｌ　Ｐｒｏｃｅｓｓ．，　Ｖｏｌ．　２７，　Ｎｏ．　２，　Ａｐｒｉｌ　１９７９，　ｐｐ．　１１３−１２０）。線形スペクトルサブトラクション法は、入力信号を周波数スペクトルに変換した後、音声を含む信号区間と背景雑音信号区間とに判別し、音声を含む信号区間の周波数スペクトルからその直前の背景雑音信号区間の周波数スペクトルを減じることで音声信号の周波数スペクトルを得るものである。この際、直前の背景雑音信号区間の周波数スペクトルのパワーを一律に１乃至３倍として音声を含む信号区間の周波数スペクトルから減じることで、雑音抑制をより効果的にすることもできる。
【０００４】
一方、非線形スペクトルサブトラクション法と呼ばれる、減算パラメータαを周波数毎に設定するものが知られている（Ｐ．　Ｌｏｃｋｗｏｏｄ　ａｎｄ　Ｊ．　Ｂｏｎｄｙ，　Ｓｐｅｅｃｈ　Ｃｏｍｍｕｎｉｃａｔｉｏｎ，　１１　（１９９２）　２１５）。これは、周波数毎の減算パラメータα（ω）を、音声を含まない周波数スペクトルの、各周波数ω毎の最大値（又はそれに比例させる）とするものである。例えば時間軸上４０個のフレームを切り出し、各々を周波数変換して、周波数毎に４０個のスペクトル（パワー）のうちの最大値をとるとするものである。減算パラメータαの設定方法は、特開平９−１６０５９４、特開平１０−１７７３９４の他、出願人による特開２００２−１４６９４がある。
【０００５】
また、次の式で示されるフィルタを掛けるウィナーフィルタも知られている。ウィナーフィルタは線形処理であるので、スペクトルサブトラクション法のようには音声が劣化しない。
【数１】
Ｈ（ω）＝｛Ｓ（ω）／（Ｓ（ω）＋Ｎ（ω））｝＾β
【０００６】
数１において、ωは周波数、Ｓ（ω）はノイズの重畳した信号スペトクル、Ｎ（ω）は音声を含む区間の直前の音声を含まない区間の信号スペクトル（ノイズ）、βは定数で、｛｝＾βは、｛｝のβ乗を意味する。βはたとえば２とする。
【０００７】
更に、周波数帯域ごとに複数の騒音抑制手段を用いる技術も知られている。特開平９−３４４９６では、２４０Ｈｚと８００Ｈｚの２箇所の境界周波数で周波数帯域を３分割し、低周波帯域ではハイパスフィルタを、中周波帯域Ｓ／Ｎ比に応じた重み付けを、高周波帯域では適応フィルタを用いるものである。また、Ｊ．　Ｍｅｙｅｒ　ａｎｄＫ．　Ｕ．　Ｓｉｍｍｅｒ，　ＩＥＥＥ　ＩＣＡＳＳＰ−９７　ｐｐ．１１６７−１１７０のように、約１７００Ｈｚを境界周波数として、低周波帯域に対してはスペクトルサブストラクションを、高周波帯域に対してはウィナーフィルタを用いる技術も知られている。
【０００８】
【発明が解決しようとする課題】
上記特開平９−３４４９６も、Ｊ．　Ｍｅｙｅｒらの技術も境界周波数を固定するものである。しかし、境界周波数をどのように設定するか、またそれがどうして最適であるのかについては特開平９−３４４９６も、Ｊ．　Ｍｅｙｅｒらの論文も明確には示していない。実際のところ、例えば走行中の自動車の車室内騒音については、車速のような走行状況により騒音の大きさか大きく異なるのであり、境界周波数はそのような騒音の状況に対して設定されるべきである。
【０００９】
本発明は上記のように、複数の騒音抑制手段を用いる技術において、それら手段を適用する周波数帯域の境界周波数を可変にする技術を提供するものである。
【００１０】
【課題を解決するための手段】
上記の課題を解決するため、請求項１に記載の手段によれば、騒音下における騒音抑制機能を有する音声認識装置において、複数の騒音抑制機能を有し、可変な境界周波数によって当該複数の騒音抑制機能を周波数帯域ごとに使い分けることを特徴とする。また、請求項２に記載の手段によれば、可変な境界周波数は、入力される信号のＳ／Ｎ比又はノイズレベルによって随時設定されることを特徴とする。また、請求項３に記載の手段によれば、前記騒音抑制機能は２種類であり、可変な境界周波数が設定されることにより低周波数側と高周波数側で各々作用させることを特徴とする。
【００１１】
また、請求項４に記載の手段によれば、低周波数側の騒音抑制機能は非線形スペクトルサブストラクションであり、前記高周波数側の騒音抑制機能はウィナーフィルタであることを特徴とする。
【００１２】
また、請求項５に記載の手段によれば、前記非線形スペクトルサブストラクションにおいては、任意の区間に対し周波数スペクトルを求める周波数分析手段と、音声を含まない時間区間に対し、前記周波数分析手段により求められた雑音周波数スペクトルのスペクトル包絡を求め、各周波数における該スペクトル包絡に対応して減算パラメータを設定する減算パラメータ算定手段と、音声を含む時間区間に対し、前記周波数分析手段により求められた周波数スペクトルから、前記雑音周波数スペクトルの周波数ごとに前記減算パラメータ算定手段により決定された各周波数における減算パラメータを乗じた値を減算する減算手段とにより騒音抑制機能を発揮することを特徴とする。
【００１３】
【作用及び発明の効果】
複数の騒音抑制機能を周波数帯域ごとに使い分ける際、境界周波数を可変とすることで、周波数帯域ごとに音の状況に応じた最適の騒音抑制機能を用いることができる。境界周波数は、入力される信号のＳ／Ｎ比又はノイズレベルによって随時設定することが望ましく、騒音抑制機能は２種類で可変な境界周波数が設定されることにより低周波数側と高周波数側で各々作用させることで最も簡単な構成とすることができる。
【００１４】
非線形スペクトルサブストラクションはＳ／Ｎ比が小さい、即ちノイズの大きい領域で騒音抑制機能が良く低周波帯域に向く。また、ウィナーフィルタはＳ／Ｎ比が大きい、即ちノイズの小さい領域で騒音抑制機能が良く高周波帯域に向く。非線形スペクトルサブストラクションは出願人による特開２００２−１４６９４の技術を用いることで、装置の小型化と演算速度の向上が図れる。
【００１５】
【発明の実施の形態】
まず、図１にノイズを含まない音声のスペクトルと、エンジンを駆動させて停止状態、１００ｋｍ／ｈでの走行中、１２０ｋｍ／ｈでの走行中の３つの車室内での音声のないノイズのスペクトルを示す。５０００Ｈｚ以下のほとんどの領域において、エンジンを駆動させて停止状態のノイズスペクトルは音声スペクトルよりも２０ｄＢ以上小さい。一方、１００ｋｍ／ｈ走行中のノイズスペクトル、１２０ｋｍ／ｈ走行中のノイズスペクトルは２０００Ｈｚ以下では音声スペクトルと同程度か音声スペクトルよりも大きいノイズとなる部分があることがわかる。ここで、１００ｋｍ／ｈ走行中のノイズスペクトルは約２０００Ｈｚで音声スペクトルよりも５ｄＢ小さくなり、それ以上の周波数では５ｄＢ以上小さい。また、１２０ｋｍ／ｈ走行中のノイズスペクトルは約２５００Ｈｚで音声スペクトルよりも５ｄＢ小さくなり、それ以上の周波数では音声スペクトルよりも小さい。そこで、音声のスペクトルとノイズのスペクトルを例えば５００Ｈｚごとに分割して比較し、Ｓ／Ｎ比が例えば５ｄＢとなった領域以上はウィナーフィルタ（ＷＦ）で騒音を抑制し、それよりも下の領域では非線形スペクトルサブストラクション（ＮＳＳ）とすることで、境界領域を可変としながら周波数帯域ごとに最適な騒音抑制手段とすることができることがわかる。また別の方法として、各周波数ごとにＳ／Ｎ比が例えば５ｄＢ以上の場合はウィナーフィルタ（ＷＦ）を用い、５ｄＢ以下の場合は非線形スペクトルサブストラクション（ＮＳＳ）を用いることもできる。
【００１６】
上記の作用を有する音声認識装置１００の構成を図２に示す。入力信号が高速フーリエ変換器（ＦＦＴ、周波数分析手段）１により周波数スペクトル信号となる。スペクトル信号は例えば０〜１０ｋＨｚの範囲である。次にその周波数スペクトル信号が音声有無判定器（音声区間判定手段）２により、１連の入力信号の音声の有無が判定される。例えば１０００〜４０００Ｈｚの範囲での周波数スペクトルのパワーが他の範囲の周波数スペクトルのパワーよりも大きいか、などの特徴により判定される。ここで音声が含まれない雑音信号区間であると判断されると、雑音周波数スペクトル記憶部（メモリ）３に周波数スペクトル（雑音周波数スペクトルＮ（ω））が記憶される。
【００１７】
これは音声を含む信号区間が入力されるまで続けられ、雑音周波数スペクトルＮ（ω）が更新されていく。そして、音声を含む信号区間が入力されると、その高速フーリエ変換器（周波数分析手段）１の出力（音声有無判定器２で音声を含むとされたＳ（ω））が、閾値周波数（ω_ｔｈ）算定部４０、低域通過機能部４１、高域通過機能部４２に出力される。
【００１８】
閾値周波数（ω_ｔｈ）算定部４０では、音声を含むスペクトルＳ（ω）と、雑音周波数スペクトル記憶部（メモリ）３に記憶された雑音周波数スペクトルＮ（ω）とから、音声を含むスペクトルＳ（ω）が雑音周波数スペクトルＮ（ω）よりも５ｄＢ大きい閾値周波数ω_ｔｈを決定する。ここで、ω_ｔｈを境に、常に音声を含むスペクトルＳ（ω）が雑音周波数スペクトルＮ（ω）よりも５ｄＢ大きい領域と、常に音声を含むスペクトルＳ（ω）が雑音周波数スペクトルＮ（ω）よりも５ｄＢ大きくない領域とに分けることは必ずしも必要ではない。例えば、５００Ｈｚごとの帯域に分けて、その帯域内のスペクトルＳ（ω）の合計値が雑音周波数スペクトルＮ（ω）の合計値よりも５ｄＢ大きい最も低周波の帯域を選び、その帯域の低周波側の端を当該閾値周波数ω_ｔｈとするなどの方法でも良い。次に低域通過機能部４１、高域通過機能部４２に閾値周波数ω_ｔｈが出力される。低域通過機能部４１、高域通過機能部４２は時間軸上のＬＰＦ、ＨＰＦの役割を周波数軸上で果たすものである。本実施例においては、低域通過機能部４１ではスペクトルＳ（ω）に対し、ω≧ω_ｔｈとなるωに対しスペクトルＳ（ω）を０に置換する。反対に、高域通過機能部４２ではスペクトルＳ（ω）に対し、ω＜ω_ｔｈとなるωに対しスペクトルＳ（ω）を０に置換する。こうして、時間軸上のＬＰＦの役割を周波数軸上で果たす低域通過機能部４１はω≧ω_ｔｈとなるωに対しては０に置換された、スペクトルＳ_ｌｏｗ（ω）を低周波帯域ＮＳＳ処理部１０に出力し、時間軸上のＨＰＦの役割を周波数軸上で果たす高域通過機能部４２はω＜ω_ｔｈとなるωに対しては０に置換された、スペクトルＳ_ｈｉｇｈ（ω）を高周波帯域ＷＦ処理部２０に出力する。
【００１９】
低周波帯域ＮＳＳ処理部１０は減算パラメータ算定部１１とＮＳＳ演算部とからなり、スペクトルＳ（ω）の低周波帯域に対し、非線形スペクトルサブストラクションを行う。その処理内容は次の通りである。まず、減算パラメータ算定部１１は、随時、雑音周波数スペクトル記憶部（メモリ）３から雑音周波数スペクトルＮ（ω）を読み出し、減算パラメータα（ω）を次のように更新する。まず雑音周波数スペクトルＮ（ω）の対数ｌｏｇＮ（ω）が対数演算器１１１により求められる。次に高速フーリエ変換器（ＦＦＴ）１１２により、ケプストラムＣが求められる。次に低ケフレンシー窓器１１３によりケプストラムＣのうち低ケフレンシー部分Ｃ’が求められる。次に逆高速フーリエ変換器（ＩＦＦＴ）１１４により、雑音周波数スペクトルＮ（ω）の対数ｌｏｇＮ（ω）の包絡ｌ（ω）が求められる。包絡ｌ（ω）の値から減算パラメータα（ω）が算出器１１５により求められる。
【００２０】
図３は雑音周波数スペクトルＮ（ω）のスペクトル包絡と減算パラメータαとの関係の一例を示すグラフ図である。本実施例では雑音周波数スペクトル包絡に対し、減算パラメータαが最大２．６最小０．９となるよう設定している。即ち、雑音周波数スペクトル包絡の値が高いところでは減算パラメータαを大きく、雑音周波数スペクトル包絡の値が低いところでは減算パラメータαを小さくする。このように、雑音スペクトル包絡の各周波数ごとの値から減算パラメータαを決定するよう設定することで、容易に周波数依存のパラメータαを決定できる。
【００２１】
こうして、随時更新された減算パラメータαを使用して、ＮＳＳ演算部１２は、次の処理により出力Ｐ_ｌｏｗ（ω）を算出し、加算部４３に出力する。尚、Ｓ_ｌｏｗ（ω）が０のときは、Ｐ_ｌｏｗ（ω）も０として出力される。
【数２】
Ｐ_ｌｏｗ（ω）＝Ｓ_ｌｏｗ（ω）−α（ω）Ｎ（ω）
【００２２】
一方、高周波帯域ＷＦ処理部２０はＷＦ決定部２１とＷＦ演算部２２とから成り、スペクトルＳ_ｈｉｇｈ（ω）の高周波帯域に対し、ウィナーフィルタ処理を行う。ウィナーフィルタ処理は既に述べた次の式で示されるフィルタＨ（ω）をスペクトルＳ_ｈｉｇｈ（ω）に乗じることで達成される。
【数３】
Ｈ（ω）＝｛Ｓ_ｈｉｇｈ（ω）／（Ｓ_ｈｉｇｈ（ω）＋Ｎ（ω））｝＾β
【００２３】
まず、ＷＦ決定部２１では、雑音周波数スペクトル記憶部（メモリ）３から雑音周波数スペクトルＮ（ω）を読み出し、スペクトルＳ_ｈｉｇｈ（ω）とから数３の演算によりフィルタＨ（ω）を決定する。次にＷＦ演算部２２では、スペクトルＳ_ｈｉｇｈ（ω）とフィルタＨ（ω）を乗じて、出力Ｐ_ｈｉｇｈ（ω）を算出する。Ｓ_ｈｉｇｈ（ω）が０のときは、Ｐ_ｈｉｇｈ（ω）も０として出力される。
【００２４】
こうして、スペクトルＳ_ｌｏｗ（ω）が、低周波帯域ＮＳＳ処理部１０にて非線形スペクトルサブストラクションにより騒音が抑制された、出力Ｐ_ｌｏｗ（ω）に変換され、加算部４３に出力される。また、スペクトルＳ_ｈｉｇｈ（ω）が、高周波帯域ＷＦ処理部２０にてウィナーフィルタ処理により騒音が抑制された、出力Ｐ_ｈｉｇｈ（ω）に変換され、加算部４３に出力される。出力Ｐ_ｌｏｗ（ω）は、Ｓ_ｌｏｗ（ω）が０であるω≧ω_ｔｈとなるωに対しやはり０であり、出力Ｐ_ｈｉｇｈ（ω）は、Ｓ_ｈｉｇｈ（ω）が０であるω＜ω_ｔｈとなるωに対しやはり０である。結局これらの和Ｐ（ω）＝Ｐ_ｌｏｗ（ω）＋Ｐ_ｈｉｇｈ（ω）は、元の信号のスペクトルＳ（ω）の、ω＜ω_ｔｈである低周波帯域では非線形スペクトルサブストラクションにより、ω≧ω_ｔｈである高周波帯域ではウィナーフィルタ処理により騒音が抑制された音声信号となる。当該２つの帯域の境界周波数が可変であるので、音声認識装置１００は、騒音の状況に適応して最適な騒音抑制機能を発揮することのできる音声認識装置となる。
【００２５】
上記音声認識装置１００は、音声を含む信号区間における雑音スペクトルを充分に抑制するよう推定した、減算パラメータとすることができる。こうして、スペクトル包絡から減算パラメータを算出することで、全体の構成としても小さく、且つ適切な減算パラメータを算出できるものである。もっとも、より多量の計算を必要とする従来の非線形スペクトルサブトラクション法を用いて本願発明を実施しても良く、また、線形スペクトルサブトラクション法を用いても良い。更には、ウィナーフィルタ処理の他、カルマンフィルタ処理を用いて本願発明を実施しても良い。また、３以上の騒音抑制手段を組み合わせて、用いることも可能である。
【００２６】
本発明は、特に自動車の車室内での音声認識装置の、騒音抑制手段として特に有用である。更には、対話式カーナビゲーション、対話式運転情報案内における、運転手の音声を認識する際の、自動車の車室内の騒音を除去して言語認識する音声認識装置として特に有効である。この際、例えば対話式カーナビゲーションのスイッチを入れた後の一定時間を音声区間と認識するような構成としても良い。この場合、図２の音声有無判定器に代えて音声区間計測器を用い、スイッチを入れた後の一定時間を音声区間としてスペクトルＳ（ω）を出力し、その前までのスペクトルを雑音周波数スペクトルＮ（ω）としてメモリ３に記憶する構成とすれば良い。
【００２７】
本願においては周波数スペクトルは、０又は正の値をとるものとする。
また、ケプストラムを求める際、スペクトルａ_ｎから次のようにケプストラムｃ_ｎを求めても良い。尚、Σは、ｋについて、ｋ＝１からｋ＝ｎ−１までの和である。
【数４】
ｃ_ｎ＝ａ_ｎ−Σｋｃ_ｋａ_ｎ−ｋ／ｎ
【図面の簡単な説明】
【図１】本願発明の技術的思想を説明するためのグラフ図。
【図２】本発明の具体的な一実施例に係る音声認識装置１００の構成を示すブロック図。
【図３】本発明の雑音周波数スペクトルと減算パラメータαを決定する雑音周波数スペクトル包絡との関係を示すグラフ図。
【符号の説明】
１００　音声認識装置
１、１１２　高速フーリエ変換器
１０　低周波帯域ＮＳＳ処理部
１１　減算パラメータ算定部
１２　ＮＳＳ演算部
１１１　対数演算器
１１３　低ケフレンシー窓器
１１４　逆高速フーリエ変換器
１１５　算出器
２　　音声有無判定器
２０　高周波帯域ＷＦ処理部
２１　ＷＦ決定部
２２　ＷＦ演算部
３　　雑音周波数スペクトル記憶部（メモリ）
４０　閾値周波数算定部
４１　低域通過機能部
４２　高域通過機能部
４３　加算部
５　認識部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a speech recognition device that works effectively under noise that continuously generates noise.
[0002]
[Prior art]
Needless to say, it is desirable to remove a noise signal and extract only a speech signal in a speech recognition device that analyzes and understands pronunciation, words, and sentences from the input speech. However, it is not easy to predict the noise beforehand under the noise that is continuous but not constant. Examples of non-white noise include cockpits or cargo compartments of moving vehicles, ships, aircraft, and the like, factories and warehouses that have noise caused by working equipment and transport equipment, and the like.
[0003]
In such a speech recognition apparatus under a noise in which continuous but non-constant noise is generated, there is a spectral subtraction method as a technique for reducing the noise (SF Boll, IEEE Trans Acoustic. Speech Signal Process. 27, No. 2, April 1979, pp. 113-120). In the linear spectrum subtraction method, after converting an input signal into a frequency spectrum, a signal section including speech and a background noise signal section are discriminated, and the frequency spectrum of the immediately preceding background noise signal section is determined from the frequency spectrum of the signal section including speech. Is obtained to obtain the frequency spectrum of the audio signal. At this time, noise suppression can be made more effective by uniformly setting the power of the frequency spectrum of the immediately preceding background noise signal section to 1 to 3 times and subtracting it from the frequency spectrum of the signal section including speech.
[0004]
On the other hand, a method for setting a subtraction parameter α for each frequency, which is called a nonlinear spectrum subtraction method, is known (P. Lockwood and J. Bondy, Speech Communication, 11 (1992) 215). This is to set the subtraction parameter α (ω) for each frequency to the maximum value (or to make it proportional) for each frequency ω of the frequency spectrum that does not include speech. For example, 40 frames are cut out on the time axis, each of them is frequency-converted, and the maximum value of 40 spectra (power) is taken for each frequency. As a method of setting the subtraction parameter α, there are Japanese Patent Application Laid-Open No. 2002-14694 by the applicant in addition to Japanese Patent Application Laid-Open Nos. 9-160594 and 10-177394.
[0005]
Further, a Wiener filter for applying a filter represented by the following equation is also known. Since the Wiener filter is a linear process, the sound does not deteriorate as in the spectral subtraction method.
(Equation 1)
H (ω) = {S (ω) / (S (ω) + N (ω))} β
[0006]
In Expression 1, ω is a frequency, S (ω) is a signal spectrum on which noise is superimposed, N (ω) is a signal spectrum (noise) of a section immediately before a section including a voice and not including a voice, β is a constant, and ｛is a constant. ｝＾ β means ｛｝ raised to the power of β. β is 2, for example.
[0007]
Further, a technique using a plurality of noise suppression means for each frequency band is also known. In Japanese Patent Application Laid-Open No. 9-34496, a frequency band is divided into three at two boundary frequencies of 240 Hz and 800 Hz, a high-pass filter is used in a low frequency band, a weighting according to a medium frequency band S / N ratio, and an adaptive filter is used in a high frequency band. Is used. Also, J.I. Meyer and K .; U. Simmer, IEEE ICASPSP-97 pp. As in the case of 1167-1170, a technique using a spectral subtraction for a low frequency band and a Wiener filter for a high frequency band with a boundary frequency of about 1700 Hz is also known.
[0008]
[Problems to be solved by the invention]
The above-mentioned JP-A-9-34496 is also disclosed in J. Pat. The technique of Meyer et al. Also fixes the boundary frequency. However, Japanese Patent Application Laid-Open No. 9-34496 also describes how to set the boundary frequency and how it is optimal. The article by Meyer et al. Is not explicitly shown. As a matter of fact, for example, as for the cabin noise of a running automobile, the loudness of the noise varies greatly depending on the driving conditions such as the vehicle speed, and the boundary frequency should be set for such a noise condition. .
[0009]
The present invention, as described above, provides a technique in which a boundary frequency of a frequency band to which the means is applied is made variable in a technique using a plurality of noise suppression means.
[0010]
[Means for Solving the Problems]
According to a first aspect of the present invention, there is provided a speech recognition apparatus having a noise suppression function under noise, comprising a plurality of noise suppression functions, wherein the plurality of noises are controlled by a variable boundary frequency. The suppression function is selectively used for each frequency band. According to a second aspect of the present invention, the variable boundary frequency is set at any time by an S / N ratio or a noise level of an input signal. According to a third aspect of the present invention, the noise suppression function is of two types, and is set to operate on the low frequency side and the high frequency side by setting a variable boundary frequency.
[0011]
According to a fourth aspect of the present invention, the noise suppression function on the low frequency side is a nonlinear spectrum subtraction, and the noise suppression function on the high frequency side is a Wiener filter.
[0012]
According to the means described in claim 5, in the nonlinear spectrum subtraction, a frequency analysis means for obtaining a frequency spectrum for an arbitrary section and a frequency analysis means for a time section which does not include voice are used. Subtraction parameter calculation means for obtaining a spectrum envelope of the obtained noise frequency spectrum and setting a subtraction parameter corresponding to the spectrum envelope at each frequency; and a frequency spectrum obtained by the frequency analysis means for a time section including speech. And a subtraction means for subtracting a value obtained by multiplying a subtraction parameter at each frequency determined by the subtraction parameter calculation means for each frequency of the noise frequency spectrum, thereby exhibiting a noise suppression function.
[0013]
[Action and effect of the invention]
When a plurality of noise suppression functions are selectively used for each frequency band, by making the boundary frequency variable, it is possible to use an optimal noise suppression function according to the sound situation for each frequency band. The boundary frequency is desirably set at any time according to the S / N ratio or noise level of the input signal, and the noise suppression function has two types of variable boundary frequencies so that the low frequency side and the high frequency side each have The simplest configuration can be obtained by operating the same.
[0014]
The nonlinear spectrum subtraction has a good noise suppression function in a region where the S / N ratio is small, that is, a region where noise is large, and is suitable for a low frequency band. Further, the Wiener filter has a good noise suppression function in a region where the S / N ratio is large, that is, in a region where noise is small, and is suitable for a high frequency band. The nonlinear spectrum subtraction uses the technique disclosed in Japanese Patent Application Laid-Open No. 2002-14694 by the applicant, so that the size of the apparatus can be reduced and the calculation speed can be improved.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
First, FIG. 1 shows a spectrum of a sound including no noise, and a spectrum of a noise without sound in three passenger compartments when the engine is stopped and the vehicle is stopped, traveling at 100 km / h, and traveling at 120 km / h. Is shown. In most regions below 5000 Hz, the noise spectrum when the engine is driven and stopped is 20 dB or less smaller than the voice spectrum. On the other hand, it can be seen that the noise spectrum when traveling at 100 km / h and the noise spectrum when traveling at 120 km / h have a portion at 2000 Hz or less that is noise that is almost equal to or larger than the voice spectrum. Here, the noise spectrum during traveling at 100 km / h is about 2000 Hz, which is 5 dB smaller than the voice spectrum, and is 5 dB or less at frequencies higher than that. In addition, the noise spectrum when traveling at 120 km / h is lower than the voice spectrum at about 2500 Hz by 5 dB, and is lower than the voice spectrum at frequencies higher than 2500 Hz. Then, the spectrum of the voice and the spectrum of the noise are divided for every 500 Hz, for example, and compared, and the noise is suppressed by the Wiener filter (WF) in the area where the S / N ratio becomes, for example, 5 dB, and the area below the area is suppressed. It can be seen that by using the nonlinear spectrum subtraction (NSS), the noise suppression means can be optimal for each frequency band while making the boundary region variable. As another method, a Wiener filter (WF) can be used when the S / N ratio is, for example, 5 dB or more for each frequency, and a nonlinear spectrum subtraction (NSS) can be used when the S / N ratio is 5 dB or less.
[0016]
FIG. 2 shows the configuration of the speech recognition apparatus 100 having the above-described operation. An input signal is converted into a frequency spectrum signal by a fast Fourier transformer (FFT, frequency analysis means) 1. The spectral signal is, for example, in the range of 0 to 10 kHz. Next, the presence / absence of voice of the series of input signals is determined by the voice presence / absence determination unit (voice section determination means) 2 for the frequency spectrum signal. For example, it is determined based on features such as whether the power of the frequency spectrum in the range of 1000 to 4000 Hz is greater than the power of the frequency spectrum in the other range. Here, if it is determined that the section is a noise signal section that does not include speech, the frequency spectrum (noise frequency spectrum N (ω)) is stored in the noise frequency spectrum storage unit (memory) 3.
[0017]
This continues until a signal section including speech is input, and the noise frequency spectrum N (ω) is updated. Then, when a signal section including voice is input, the output of the fast Fourier transformer (frequency analysis means) 1 (S (ω) determined to include voice by the voice presence / absence determining unit 2) is equal to the threshold frequency (ω). _th ) Output to the calculation unit 40, the low-pass function unit 41, and the high-pass function unit 42.
[0018]
The threshold frequency (ω _th ) calculation unit 40 calculates a spectrum S (including speech) from the spectrum S (ω) including speech and the noise frequency spectrum N (ω) stored in the noise frequency spectrum storage unit (memory) 3. ω) determines a threshold frequency ω _th that is 5 dB greater than the noise frequency spectrum N (ω). Here, the boundary of omega _th spectrum S always including voice (omega) and 5dB area larger than the noise frequency spectrum N (omega) is the spectrum S always including voice (omega) is the noise frequency spectrum N (omega) It is not always necessary to divide the region into a region not larger than 5 dB. For example, the frequency band is divided into 500 Hz bands, and the lowest frequency band in which the total value of the spectrum S (ω) in the band is 5 dB larger than the total value of the noise frequency spectrum N (ω) is selected. the end side may be a method such as with the threshold frequency omega _th. Next, the threshold frequency _ωth is output to the low-pass function unit 41 and the high-pass function unit 42. The low-pass function unit 41 and the high-pass function unit 42 play the role of LPF and HPF on the time axis on the frequency axis. In this embodiment, the low-pass function unit 41 replaces the spectrum S (ω) with 0 for ω _satisfying ω ≧ ωth. Conversely, the high-pass function unit 42 replaces the spectrum S (ω) with 0 for ω _satisfying ω <ωth. Thus, the low-pass function unit 41 that plays the role of the LPF on the time axis on the frequency axis replaces the spectrum S _low (ω) with 0 for ω _satisfying ω ≧ ωth. The high-pass function unit 42 that outputs to the processing unit 10 and performs the role of the HPF on the time axis on the frequency axis is replaced with 0 for ω _satisfying ω < _ωth, and the spectrum S _high (ω) To the high frequency band WF processing unit 20.
[0019]
The low frequency band NSS processing unit 10 includes a subtraction parameter calculation unit 11 and an NSS calculation unit, and performs nonlinear spectrum subtraction on the low frequency band of the spectrum S (ω). The processing contents are as follows. First, the subtraction parameter calculation unit 11 reads the noise frequency spectrum N (ω) from the noise frequency spectrum storage unit (memory) 3 as needed and updates the subtraction parameter α (ω) as follows. First, the logarithm logN (ω) of the noise frequency spectrum N (ω) is obtained by the logarithmic calculator 111. Next, cepstrum C is obtained by fast Fourier transformer (FFT) 112. Next, the low quefrency portion C ′ of the cepstrum C is obtained by the low quefrency window device 113. Next, the envelope l (ω) of the logarithm logN (ω) of the noise frequency spectrum N (ω) is obtained by the inverse fast Fourier transformer (IFFT) 114. A subtraction parameter α (ω) is calculated by the calculator 115 from the value of the envelope l (ω).
[0020]
FIG. 3 is a graph showing an example of the relationship between the spectrum envelope of the noise frequency spectrum N (ω) and the subtraction parameter α. In the present embodiment, the subtraction parameter α is set to be 2.6 at maximum and 0.9 at minimum for the noise frequency spectrum envelope. That is, the subtraction parameter α is increased where the value of the noise frequency spectrum envelope is high, and the subtraction parameter α is decreased where the value of the noise frequency spectrum envelope is low. In this way, by setting the subtraction parameter α to be determined from the value of each frequency of the noise spectrum envelope, the frequency-dependent parameter α can be easily determined.
[0021]
In this way, using the subtraction parameter α updated as needed, the NSS operation unit 12 calculates the output P _low (ω) by the following processing and outputs it to the addition unit 43. When S _low (ω) is 0, P _low (ω) is also output as 0.
(Equation 2)
P _low (ω) = S _low (ω) −α (ω) N (ω)
[0022]
On the other hand, the high-frequency band WF processing unit 20 includes a WF determination unit 21 and a WF calculation unit 22, and performs Wiener filter processing on the high-frequency band of the spectrum S _high (ω). The Wiener filter processing is achieved by multiplying the spectrum S _high (ω) by the filter H (ω) expressed by the following equation already described.
[Equation 3]
H (ω) = {S _high (ω) / (S _high (ω) + N (ω))} β
[0023]
First, the WF determination unit 21 reads out the noise frequency spectrum N (ω) from the noise frequency spectrum storage unit (memory) 3 and determines the filter H (ω) from the spectrum S _high (ω) by the calculation of Expression 3. Next, the WF calculation unit 22 calculates an output P _high (ω) by multiplying the spectrum S _high (ω) by the filter H (ω). When S _high (ω) is 0, P _high (ω) is also output as 0.
[0024]
Thus, the spectrum S _low (ω) is converted into an output P _low (ω) in which the noise is suppressed by the non-linear spectrum subtraction in the low frequency band NSS processing unit 10, and is output to the addition unit 43. The spectrum S _high (ω) is converted into an output P _high (ω) in which noise is suppressed by the Wiener filter processing in the high-frequency band WF processing unit 20, and is output to the addition unit 43. The output P _low (ω) is also 0 with respect to ω _satisfying ω ≧ ω _th where S _low (ω) is 0, and the output P _high (ω) is ω <where S _high (ω) is 0. the ω _th ω also with respect to a 0. Eventually these sums _{P (ω) = P low (} ω) + P high (ω), spectrum S (omega) of the original signal, the nonlinear spectral sub scan traction in low frequency band, which is the ω <ω _th, ω ≧ a speech signal noise is suppressed by Wiener filtering a high frequency band is omega _th. Since the boundary frequency between the two bands is variable, the voice recognition device 100 is a voice recognition device that can exhibit an optimal noise suppression function in accordance with the noise situation.
[0025]
The speech recognition device 100 can be a subtraction parameter estimated to sufficiently suppress a noise spectrum in a signal section including speech. By calculating the subtraction parameter from the spectrum envelope in this way, it is possible to calculate a small and appropriate subtraction parameter as a whole. However, the present invention may be implemented using a conventional nonlinear spectral subtraction method requiring a larger amount of calculation, or a linear spectral subtraction method may be used. Further, the present invention may be implemented using Kalman filter processing in addition to Wiener filter processing. It is also possible to use a combination of three or more noise suppression means.
[0026]
INDUSTRIAL APPLICABILITY The present invention is particularly useful as a noise suppression unit of a voice recognition device particularly in a vehicle cabin. Further, the present invention is particularly effective as a speech recognition device for recognizing a language by removing noise in a vehicle cabin when recognizing a driver's voice in interactive car navigation and interactive driving information guidance. At this time, for example, a configuration may be adopted in which a certain time after switching on the interactive car navigation is recognized as a voice section. In this case, a voice section measuring device is used in place of the voice presence / absence determining unit of FIG. 2, a spectrum S (ω) is output with a certain time after the switch is turned on as a voice section, and the spectrum up to that time is used as a noise frequency spectrum. What is necessary is just to make it the structure memorize | stored in the memory 3 as N ((omega)).
[0027]
In the present application, the frequency spectrum assumes 0 or a positive value.
Further, when obtaining the cepstrum may be obtained cepstrum c _n as follows from the spectrum a _n. Note that Σ is the sum of k from k = 1 to k = n−1.
(Equation 4)
_{_{_{c n = a n -Σkc k a}}} n-k / n
[Brief description of the drawings]
FIG. 1 is a graph for explaining the technical idea of the present invention.
FIG. 2 is a block diagram showing a configuration of a speech recognition device 100 according to a specific embodiment of the present invention.
FIG. 3 is a graph showing a relationship between a noise frequency spectrum of the present invention and a noise frequency spectrum envelope for determining a subtraction parameter α.
[Explanation of symbols]
REFERENCE SIGNS LIST 100 Voice recognition device 1, 112 Fast Fourier transformer 10 Low frequency band NSS processing unit 11 Subtraction parameter calculation unit 12 NSS calculation unit 111 Logarithmic calculator 113 Low quefrency window unit 114 Inverse fast Fourier transformer 115 Calculator 2 Voice presence / absence determination unit Reference Signs List 20 high frequency band WF processing unit 21 WF determination unit 22 WF calculation unit 3 noise frequency spectrum storage unit (memory)
40 threshold frequency calculation unit 41 low-pass function unit 42 high-pass function unit 43 addition unit 5 recognition unit

Claims

騒音下における騒音抑制機能を有する音声認識装置において、
複数の騒音抑制機能を有し、可変な境界周波数によって当該複数の騒音抑制機能を周波数帯域ごとに使い分けることを特徴とする音声認識装置。In a voice recognition device having a noise suppression function under noise,
A speech recognition device having a plurality of noise suppression functions, wherein the plurality of noise suppression functions are selectively used for each frequency band by a variable boundary frequency.

前記可変な境界周波数は、入力される信号のＳ／Ｎ比又はノイズレベルによって随時設定されることを特徴とする請求項１に記載の音声認識装置。The speech recognition device according to claim 1, wherein the variable boundary frequency is set as needed according to an S / N ratio or a noise level of an input signal.

前記騒音抑制機能は２種類であり、可変な境界周波数が設定されることにより低周波数側と高周波数側で各々作用させることを特徴とする請求項２に記載の音声認識装置。3. The speech recognition apparatus according to claim 2, wherein the noise suppression function is of two types, and is activated on a low frequency side and a high frequency side by setting a variable boundary frequency.

前記低周波数側の騒音抑制機能は非線形スペクトルサブストラクションであり、前記高周波数側の騒音抑制機能はウィナーフィルタであることを特徴とする請求項３に記載の音声認識装置。The speech recognition device according to claim 3, wherein the low-frequency noise suppression function is a nonlinear spectrum subtraction, and the high-frequency noise suppression function is a Wiener filter.

前記非線形スペクトルサブストラクションにおいては、
任意の区間に対し周波数スペクトルを求める周波数分析手段と、
音声を含まない時間区間に対し、前記周波数分析手段により求められた雑音周波数スペクトルのスペクトル包絡を求め、各周波数における該スペクトル包絡に対応して減算パラメータを設定する減算パラメータ算定手段と、
音声を含む時間区間に対し、前記周波数分析手段により求められた周波数スペクトルから、前記雑音周波数スペクトルの周波数ごとに前記減算パラメータ算定手段により決定された各周波数における減算パラメータを乗じた値を減算する減算手段と
により騒音抑制機能を発揮することを特徴とする請求項４に記載の音声認識装置。In the nonlinear spectral subtraction,
Frequency analysis means for obtaining a frequency spectrum for an arbitrary section,
For a time section that does not include speech, obtain a spectrum envelope of the noise frequency spectrum obtained by the frequency analysis unit, and set a subtraction parameter corresponding to the spectrum envelope at each frequency, and a subtraction parameter calculation unit.
Subtraction for subtracting a value obtained by multiplying the frequency spectrum obtained by the frequency analysis unit with a subtraction parameter at each frequency determined by the subtraction parameter calculation unit for each frequency of the noise frequency spectrum from a time interval including voice. 5. The voice recognition device according to claim 4, wherein the means performs a noise suppression function.