JP2006084974A

JP2006084974A - Sound input device

Info

Publication number: JP2006084974A
Application number: JP2004271584A
Authority: JP
Inventors: Daisuke Saito; 大介斎藤; Mitsunobu Kaminuma; 充伸神沼
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2004-09-17
Filing date: 2004-09-17
Publication date: 2006-03-30
Anticipated expiration: 2024-09-17
Also published as: JP4529611B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound input device capable of removing noise of diffusivity with a small calculation amount compared with the SBE applied to a general frequency region ICA. <P>SOLUTION: The sound input device acquires detection signals in which an object signal and a non-object signal are intermingled, by detecting a sound in which an object signal and a non-object signal are intermingled, with microphones 10-1 to 10-n, and acquires a sound signal separation filter that separates at least one object signal from the detection signals by repeated learning. It comprises a band dividing means 103 that divides a sound frequency band into a plurality of frequency division bands, and the number of iterations deciding means 104 that decides the number of iterations of learning in each frequency division band so that it is directly proportional to the strength of the object signal contained in the detection signals in each of the frequency division bands. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は音声入力装置に関する。 The present invention relates to a voice input device.

近年、車室内における音声入力系は、音声認識による車載機器操作及びハンドフリー電話などに広く用いられている。これらの技術の実現を阻害する要因として、車室内における、音声入力使用者以外の音源からの音の存在があげられる。音声入力使用者からの音声を他の音源からの音から分離する方法として、複数の音響センサからそれぞれの音信号を取得し、取得した複数の音信号のみを用いて、その音信号から目的とする音信号を分離する手法はブラインド音源分離(ＢＳＳ)と呼ばれる。そして信号を分離する分離フィルタを学習によって得る方法として、独立成分分析法（Independent Component Analysis、以下ＩＣＡと記す）が提案されている。 2. Description of the Related Art In recent years, a voice input system in a passenger compartment is widely used for in-vehicle device operation by voice recognition, hands-free telephones, and the like. A factor that hinders the realization of these technologies is the presence of sound from a sound source other than the voice input user in the passenger compartment. As a method of separating the sound from the sound input user from the sound from other sound sources, each sound signal is acquired from a plurality of acoustic sensors, and only the plurality of acquired sound signals are used, A method of separating sound signals to be performed is called blind sound source separation (BSS). As a method for obtaining a separation filter for separating signals by learning, an independent component analysis method (hereinafter referred to as ICA) has been proposed.

N.Murata and S.Ikeda，"An on-line algorithm for blind source separation on speech signals"，Proceeding of 1998 International Symposium on Nonlinear Theory and its Application（(ＮＯＬＴＡ'98)，vol.3，pp.923-926，Sep.1998．N. Murata and S. Ikeda, "An on-line algorithm for blind source separation on speech signals", Proceeding of 1998 International Symposium on Nonlinear Theory and its Application ((NOLTA'98), vol.3, pp.923-926 , Sep. 1998. 「アレー信号処理を用いたブラインド音源分離の基礎」Technical report of IEICE，EA2001-7。"Basics of blind source separation using array signal processing" Technical report of IEICE, EA2001-7. 「独立成分解析とは」Computer Today，pp.38-43，1998.9，No.87、「fMRI画像解析への応用」Computer Today，pp.60-67，2001.1 No.95。“What is independent component analysis?” Computer Today, pp. 38-43, 1998.9, No. 87, “Application to fMRI image analysis” Computer Today, pp. 60-67, 2001.1 No. 95. S.Amari，A.Chichocki，and H.H.Yang，"A new learning algorithm for blind signal separation"，In：D.S．Touretzky，M.C．Mozer and M.E．Hasselmoeds.，Advanced in Neural Information Processing System 8，pp.753-763，MIT Press，Cambridge MA，1996．S.Amari, A.Chichocki, and H.H.Yang, “A new learning algorithm for blind signal separation”, In: D.S. Touretzky, M.C. Mozer and M.E. Hasselmoeds., Advanced in Neural Information Processing System 8, pp.753-763, MIT Press, Cambridge MA, 1996.

上記ＩＣＡに基づく音源分離手法を実環境にて用いる場合、観測信号の残響が問題となる。すなわち、音源から到来する信号には、その音源を観測する環境に起因する残響成分が畳み込まれることとなる。この場合、混合過程は畳み込み混合という複雑な問題となるため、分離フィルタを直接求めることは難しい。そこで、観測信号を周波数領域に変換し、各周波数帯域(周波数ビン)について、分離信号の独立性を最大化するように分離フィルタを学習生成する、周波数領域ＩＣＡ（ＦＤＩＣＡ）が提案されている。この手法によれば、複雑な畳み込み混合問題を各周波数における瞬時混合問題として表現でき、残響のある実環境においても、分離フィルタを求めることが可能であることが示されている。(非特許文献１参照)
しかしながら、上記ＦＤＩＣＡに基づく目的信号分離の処理における問題点としては、以下が挙げられる。 When the sound source separation method based on the ICA is used in an actual environment, the reverberation of the observation signal becomes a problem. That is, the reverberation component resulting from the environment in which the sound source is observed is convolved with the signal coming from the sound source. In this case, since the mixing process becomes a complicated problem of convolutional mixing, it is difficult to directly obtain the separation filter. Therefore, frequency domain ICA (FDICA) has been proposed in which an observation signal is converted into a frequency domain, and a separation filter is learned and generated so as to maximize the independence of the separation signal for each frequency band (frequency bin). According to this method, it is shown that a complicated convolution mixing problem can be expressed as an instantaneous mixing problem at each frequency, and a separation filter can be obtained even in a real environment with reverberation. (See Non-Patent Document 1)
However, problems in the target signal separation processing based on the FDICA include the following.

すなわち、各周波数帯域毎にＩＣＡの処理を行うため、計算量が非常に大きくなる。 That is, since the ICA process is performed for each frequency band, the amount of calculation becomes very large.

また、拡散性の信号原は、それを一信号源とみなすことが困難であることより、学習を繰り返しても分離精度が上がらない、あるいは学習をつづけることで、分離フィルタが最適な状態から分離精度が低下した状態へと移行してしまう可能性がある。 In addition, since it is difficult to consider a diffusive signal source as a single signal source, separation accuracy does not increase even after repeated learning, or the separation filter is separated from the optimal state by continuing learning. There is a possibility of shifting to a state of reduced accuracy.

本発明の目的は、この点を改良し、一般的な周波数領域ＩＣＡと比較して少ない計算量で済み、目的信号と非目的信号が混合された観測信号から目的信号を抽出できる音声入力装置を提供することである。 The object of the present invention is to improve this point and to reduce the amount of calculation compared to a general frequency domain ICA, and to provide a voice input device that can extract a target signal from an observation signal in which a target signal and a non-target signal are mixed. Is to provide.

目的信号と非目的信号とが混在する検知信号から少なくとも一つの目的信号を分離する分離フィルタを学習の反復によって取得する音声入力装置において、音響周波数帯域を複数の周波数分割帯域に分割する帯域分割手段と、各周波数分割帯域における学習の反復回数を、それぞれの周波数分割帯域において検知信号に含まれる目的信号の平均強さに正比例するように決定する反復回数決定手段とを有する音声入力装置を構成する。 Band division means for dividing an acoustic frequency band into a plurality of frequency division bands in a voice input device for acquiring a separation filter for separating at least one target signal from a detection signal in which a target signal and a non-target signal are mixed by repetition of learning And a repetition number determination means for determining the number of repetitions of learning in each frequency division band so as to be directly proportional to the average strength of the target signal included in the detection signal in each frequency division band. .

本発明の実施により、大きな学習効果が期待できない周波数帯域における学習の反復回数を減らして、一般的な周波数領域ＩＣＡと比較して少ない計算量で済み、目的信号と非目的信号が混合された観測信号から目的信号を抽出できる音声入力装置を提供することが可能となる。 By implementing the present invention, it is possible to reduce the number of learning iterations in a frequency band where a large learning effect cannot be expected, and to reduce the amount of calculation compared to a general frequency domain ICA, and to observe a mixture of a target signal and a non-target signal. It is possible to provide a voice input device that can extract a target signal from a signal.

以下に、本発明に係る音声入力装置が特徴とする、分離フィルタを得るための学習方法を、ＩＣＡの一例に適用した場合を説明する。 Hereinafter, a case where the learning method for obtaining the separation filter, which is a feature of the voice input device according to the present invention, is applied to an example of ICA will be described.

例えば、信号源として、音信号をK個のマイクロフォン（センサ）で音を受信することに加え、各音源から到来する、音信号同士が統計的に独立であることを利用することでマイクロフォンと同じK個もしくはK個以下の音源を分離することができる。当初、ＩＣＡを用いた音源分離法は、各音源からの到来音の時間差が考慮されていなかったため、マイクロフォンアレーに適用することは困難であった。しかし近年では、時間差を考慮し、マイクロフォンアレーを用いて複数の音信号を観測し、周波数領域にて混合過程の逆変換を求める手法が多数提案されている。 For example, as a signal source, in addition to receiving sound signals with K microphones (sensors), it is the same as a microphone by using the fact that sound signals coming from each sound source are statistically independent. K or less than K sound sources can be separated. Initially, the sound source separation method using ICA was difficult to apply to a microphone array because the time difference between incoming sounds from each sound source was not considered. However, in recent years, many methods have been proposed in which a time difference is taken into account and a plurality of sound signals are observed using a microphone array and an inverse transformation of the mixing process is obtained in the frequency domain.

一般に、L個の複数音源から到来する音信号が線形に混合されてK個のマイクロフォンにて観測されている場合、観測された音信号は、ある周波数fにおいて以下のように書くことができる。 In general, when sound signals coming from a plurality of L sound sources are linearly mixed and observed by K microphones, the observed sound signal can be written as follows at a certain frequency f.

X(f) ＝ A(f)S(f) (1)
ここで、S(f)は各音源から送出される音信号ベクトル、X(f)は受音点であるマイクロフォンアレーで観測された観測信号ベクトル、A(f)は各音源と受音点との空間的な音響系に関する混合行列であり、それぞれ以下のように書くことができる。 X (f) = A (f) S (f) (1)
Where S (f) is the sound signal vector transmitted from each sound source, X (f) is the observed signal vector observed at the microphone array that is the sound receiving point, and A (f) is the sound source and sound receiving point. Is a mixing matrix for the spatial acoustic system of, which can be written as follows:

S(f) ＝ [S_１(f),...,S_Ｌ(f)]^Ｔ (2)
X(f) ＝ [X_１(f),...,X_Ｌ(f)]^Ｔ (3) S (f) = [S ₁ (f), ..., S _L (f)] ^T (2)
X (f) = [X ₁ (f), ..., X _L (f)] ^T (3)

ここで上添字^Ｔはベクトルの転置を表す。このとき、混合行列A(f)が既知であれば、受音点での観測信号ベクトルX(f)を用いて、
S(f) ＝ A(f)⁻X(f) (5)
（ただし、A(f)⁻は行列A(f)の一般逆行列を表す）のようにA(f)の一般逆行列A(f)⁻を計算することで音源から送出される音信号S(f)を計算することができる。しかし一般にA(f)は未知であり、X(f)だけを利用することで音信号S(f)を求めなければならない。

Here, the superscript ^T represents transposition of the vector. At this time, if the mixing matrix A (f) is known, using the observation signal vector X (f) at the sound receiving point,
S (f) = A (f) ⁻ X (f) (5)
(However, A (f) ^- the matrix A (f) represents a general inverse matrix) of the generalized inverse matrix A A (f) (f) as ^- the sound signal is sent from the sound source by calculating S (f) can be calculated. However, in general, A (f) is unknown, and the sound signal S (f) must be obtained by using only X (f).

この問題を解くためには、音信号S(f)が確率的に発生し、更に、S(f)の各成分が全て互いに独立であると仮定する。このとき観測信号X(f)は混合された信号であるためX(f)の各成分の分布は独立ではない。そこで、観測信号に含まれる独立な成分をＩＣＡによって探索することを考える。すなわち、観測信号X(f)を独立な成分に変換する行列W(f)(以下、逆混合行列)を計算し、観測信号X(f)に逆混合行列W(f)を適用(行列乗算)することで、音源から送出される音信号S(f)に対して近似的な信号を求める。 In order to solve this problem, it is assumed that the sound signal S (f) is generated stochastically and that all components of S (f) are all independent of each other. At this time, since the observation signal X (f) is a mixed signal, the distribution of each component of X (f) is not independent. Therefore, consider searching for an independent component included in the observation signal by ICA. That is, the matrix W (f) (hereinafter referred to as the inverse mixing matrix) that converts the observed signal X (f) into independent components is calculated, and the inverse mixing matrix W (f) is applied to the observed signal X (f) (matrix multiplication). ) To obtain an approximate signal to the sound signal S (f) transmitted from the sound source.

ＩＣＡによる混合過程の逆変換を求める処理には時間領域で分析する手法と、周波数領域で分析する手法が提案されている。ここでは周波数領域で計算する手法を例にして説明する。 As processing for obtaining the inverse transformation of the mixing process by ICA, a method of analyzing in the time domain and a method of analyzing in the frequency domain have been proposed. Here, a method for calculating in the frequency domain will be described as an example.

最初に、各マイクロフォンにて観測された信号を適切な直交変換を用いて短時間フレーム分析を行う。このとき、１つのマイクロフォン入力における、特定の周波数ビンでの複素スペクトル値をプロットすることにより、それを時系列として考える。ここで、周波数ビンとは、例えば、短時間離散フーリエ変換によって周波数変換された信号ベクトルにおける個別の複素成分を示す。同様に、他のマイクロフォン入力に対しても同じ操作を行う。ここで得られた、時間‐周波数信号系列は、
X(f,t) ＝ [X_１(f,t),...,X_Ｋ(f,t)]^Ｔ (6)
と記述できる。次に、逆混合行列W(f)を用いて信号分離を行う。この処理は以下のように示される。 First, a short-time frame analysis is performed on the signal observed by each microphone using an appropriate orthogonal transform. At this time, it is considered as a time series by plotting the complex spectrum value at a specific frequency bin at one microphone input. Here, the frequency bin indicates, for example, an individual complex component in a signal vector frequency-converted by short-time discrete Fourier transform. Similarly, the same operation is performed for other microphone inputs. The time-frequency signal sequence obtained here is
X (f, t) = [X ₁ (f, t), ..., X _K (f, t)] ^T (6)
Can be described. Next, signal separation is performed using the inverse mixing matrix W (f). This process is shown as follows.

Y(f,t) ＝ [Y_１(f,t),...,Y_Ｌ(f,t)]^Ｔ＝ W(f)X(f,t) (7)
ここで、逆混合行列W(f)は、L個の時系列の出力Y(f,t)が互いに独立になるように最適化される。これらの処理を全ての周波数ビンについて行う。最後に、分離した時系列Y(f,t)に逆直交変換を適用して、音源信号時間波形の再構成を行う。 Y (f, t) = [ Y 1 (f, t), ..., Y L (f, t)] T = W (f) X (f, t) (7)
Here, the demixing matrix W (f) is optimized so that the L time series outputs Y (f, t) are independent of each other. These processes are performed for all frequency bins. Finally, inverse orthogonal transformation is applied to the separated time series Y (f, t) to reconstruct the sound source signal time waveform.

独立性の評価及び逆混合行列の最適化方法としては、Kullback-Leibler divergenceの最小化に基づく教師無し学習アルゴリズムや、２次または高次の相関を無相関化するアルゴリズムが提案されている（上記非特許文献２参照）。 As an independence evaluation and demixing matrix optimization method, an unsupervised learning algorithm based on the minimization of Kullback-Leibler divergence and an algorithm for decorrelating a second-order or higher-order correlation have been proposed (see above). Non-patent document 2).

なお、ＩＣＡは音信号処理だけではなく、例えば、複数の画像の混合画像から各々の画像を分離抽出する、移動体通信などで話が混線して到達した信号を、其々に分離する、或いは脳の内部の各所で生ずる信号を脳電計や脳磁計、fMRI（Functional Magnetic Resonance Imaging；磁気共鳴機能画像）などを用いて外部から測定した場合に、測定信号の中から目的の信号を分離抽出することなどに用いられている（上記非特許文献３参照）。 In addition to sound signal processing, ICA separates and extracts each image from a mixed image of a plurality of images, for example, separates signals that have arrived due to mixed communication in mobile communication or the like, or When signals generated in various parts of the brain are measured externally using an electroencephalograph, magnetoencephalograph, or fMRI (Functional Magnetic Resonance Imaging), the target signal is separated and extracted from the measurement signal. (See Non-Patent Document 3 above).

以下では、複数のマイクロフォンによる音源分離問題を例にとり、音源分離フィルタの学習アルゴリズムに周波数領域ＩＣＡを用いた場合の本発明の原理を説明する。 In the following, the principle of the present invention when the frequency domain ICA is used as the learning algorithm of the sound source separation filter will be described taking the sound source separation problem with a plurality of microphones as an example.

ＩＣＡを用いても、拡散性の音源が存在する場合等、信号の分離が困難である周波数帯域がある場合においては、該帯域において数十回の学習を経ても分離精度（例えばコサイン距離）の値が改善しない場合が多い。このような帯域における学習のための演算をつづけると、分離フィルタの分離性能が最適な状態から分離精度を低下させる状態へと移行させてしまう場合がある。このような学習を避けるため、本発明では帯域ごとに学習速度を変化させることを提案する。 Even when ICA is used, when there is a frequency band where it is difficult to separate signals, such as when there is a diffusive sound source, the separation accuracy (for example, cosine distance) can be obtained even after several tens of times of learning in the band. In many cases, the value does not improve. If the calculation for learning in such a band is continued, there is a case where the separation performance of the separation filter is shifted from the optimum state to the state in which the separation accuracy is lowered. In order to avoid such learning, the present invention proposes to change the learning speed for each band.

はじめに、各マイクロフォンにて集音され短時間フレーム分析された時間‐周波数信号系列を、上式(6)と同じく、X(f,t) ＝ [X_１(f,t),...,X_Ｋ(f,t)]^Ｔと記述する。次に、ＩＣＡによって最適化された逆混合行列を用いて音源分離を行う。この処理は下式のように示される。 _First , the time-frequency signal sequence collected by each microphone and subjected to the short-time frame analysis, as in the above equation (6), X (f, t) = [X ₁ (f, t), ..., _XK (f, t)] ^T Next, sound source separation is performed using an inverse mixing matrix optimized by ICA. This process is shown by the following equation.

Y(f,t) ＝ [Y_１(f,t),...,Y_Ｌ(f,t)]^Ｔ＝ W(f)X(f,t) (7)（再記）
ここで、Y(f,t)はは音源分離が為された分離信号である。ここで、i+1回目に学習される逆混合行列（音源分離フィルタ）W_ｉ＋１(f)は、i回目に学習された逆混合行列W_ｉ(f)から、下記式(8)によって計算することが、Amariらによって提案されている（上記非特許文献４）。 Y (f, t) = [ Y 1 (f, t), ..., Y L (f, t)] T = W (f) X (f, t) (7) ( re-g)
Here, Y (f, t) is a separated signal subjected to sound source separation. Here, the inverse mixing matrix (sound source separation filter) W _{i + 1} (f) learned at the _{i + 1-} th time is calculated from the inverse mixing matrix W _i (f) learned at the i-th time by the following equation (8). This is proposed by Amari et al. (Non-Patent Document 4).

W_ｉ＋１(f) ＝
η(diag(<Φ(Y(f,t))Y^Ｈ(f,t)>)-<Φ(Y(f,t))Y^Ｈ(f,t)>)W_ｉ(f)+W_ｉ(f) (8)
ここで、ηは更新係数、diag( )は対角行列、< >は時間に関する平均、^Ｈはエルミート転置を表す。Φ( )は、一般に音声信号のような非ガウス性の振幅分布に従う信号を扱う場合にはsigmoid関数によって近似する手法が提案されている（上記非特許文献１、４参照）。 W _{i + 1} (f) =
η (diag (<Φ (Y (f, t)) Y ^H (f, t)>)-<Φ (Y (f, t)) Y ^H (f, t)>) W _i (f) + W _i (f) (8)
Here, η is an update coefficient, diag () is a diagonal matrix, <> is an average over time, and ^H is Hermitian transpose. In general, a method of approximating Φ () with a sigmoid function has been proposed when dealing with a signal that follows a non-Gaussian amplitude distribution such as a speech signal (see Non-Patent Documents 1 and 4 above).

本発明に係る音声入力装置においては、学習の反復回数の決定に際して、２つ以上の相異なる反復回数を決定し、大きな学習効果が期待できない周波数帯域における学習の反復回数を減らして、一般的な周波数領域ＩＣＡと比較して少ない計算量で同等の効果が得られるようにする。 In the speech input device according to the present invention, when determining the number of iterations of learning, two or more different number of iterations are determined, and the number of iterations of learning in a frequency band where a large learning effect cannot be expected is reduced. The same effect is obtained with a small amount of calculation compared to the frequency domain ICA.

以下に、本発明の構成を、実施の形態例によって説明する。 Hereinafter, the configuration of the present invention will be described with reference to embodiments.

（第１実施形態）
本実施形態は本発明の基本的な構成を示したものであり、図１に示したように構成される。その動作は図３のフローチャートに示した通りである。この処理構成に従い、目的信号と非目的信号を分離する分離フィルタを学習し、得られた分離フィルタで目的信号の抽出を行う。なお、車室内という環境で本手法の適用を考える場合、目的信号は音声信号、非目的信号とは車両で発生する様々な雑音と捉えることができる。 (First embodiment)
This embodiment shows a basic configuration of the present invention, and is configured as shown in FIG. The operation is as shown in the flowchart of FIG. According to this processing configuration, a separation filter that separates the target signal and the non-target signal is learned, and the target signal is extracted by the obtained separation filter. In addition, when considering application of this method in an environment of a vehicle interior, the target signal can be regarded as an audio signal, and the non-target signal can be regarded as various noises generated in the vehicle.

図１は第１の実施形態の基本構成を示したブロック図である。図中、101-1〜101-nは、目的音と非目的音とが混在する音響を検知し、目的信号と非目的信号とが混在する複数の検知信号として出力する複数の音響センサであるマイクロフォンであり、一般的なマイクロフォンによって実現され、図２におけるマイクロフォン201-1〜201-nに対応する。102は、マイクロフォン101-1〜101-nの出力である検知信号を検知して離散信号に変換する検知手段であり、図２におけるフィルタ(アンチエリアシングフィルタ)202、ＡＤ変換部203、演算装置204に対応する。103は離散信号に変換された検知信号（時間領域の信号系列）を周波数領域に変換し、更に、音響周波数帯域を複数の周波数分割帯域に分割する帯域分割手段であり、図２における演算装置204、記憶装置205に対応する。信号を周波数に分解するための変換としては、直交変換であればいずれを用いてもよい。また、周波数分割における帯域幅は、固定でも可変でもよい。 FIG. 1 is a block diagram showing the basic configuration of the first embodiment. In the figure, reference numerals 101-1 to 101-n denote a plurality of acoustic sensors that detect sound in which target sound and non-target sound are mixed and output as a plurality of detection signals in which target signal and non-target signal are mixed. The microphone is realized by a general microphone and corresponds to the microphones 201-1 to 201-n in FIG. Reference numeral 102 denotes detection means for detecting detection signals output from the microphones 101-1 to 101-n and converting them into discrete signals. The filter (anti-aliasing filter) 202, the AD conversion unit 203, and the arithmetic device in FIG. Corresponds to 204. Reference numeral 103 denotes band dividing means for converting the detection signal (time-domain signal sequence) converted into a discrete signal into the frequency domain, and further dividing the acoustic frequency band into a plurality of frequency division bands. Corresponds to the storage device 205. Any transform may be used for transforming the signal into frequencies as long as it is orthogonal transform. The bandwidth in frequency division may be fixed or variable.

104は、例えば図１０における分析手段307によって分析された、周波数分割帯域ごとの目的信号と非目的信号との強さの情報に基づき、分離フィルタ計算手段105における学習の反復回数を周波数分割帯域ごとに決定する反復回数決定手段であり、図２における演算装置204、記憶装置205に対応する。反復回数決定は、より具体的には、目的信号(に類似した参照信号)と非目的信号との周波数ごとのスペクトルパワーの差（比）を導出し、目的信号エネルギーの小さい(S/Nが低い)帯域について、学習回数を小さく設定する。 104, for example, based on the information on the strength of the target signal and the non-target signal for each frequency division band analyzed by the analysis unit 307 in FIG. 10, the number of repetitions of learning in the separation filter calculation unit 105 is calculated for each frequency division band. It is an iteration number determination means that determines the above, and corresponds to the arithmetic unit 204 and the storage unit 205 in FIG. More specifically, the number of iterations is determined by deriving the spectral power difference (ratio) for each frequency between the target signal (similar reference signal) and the non-target signal, and the target signal energy is small (S / N is less For the (low) band, set the number of learning times small.

105は、反復回数決定手段104の決定した学習回数に基づき、分離フィルタの学習を行い、学習処理終了後の分離フィルタを更新後分離フィルタとしてフィルタ手段106へ上書きする分離フィルタ計算手段であり、図２における演算装置204、記憶装置205に対応する。分離フィルタ学習のアルゴリズムとしては、例えば周波数領域ＩＣＡ(ＦＤＩＣＡ)を用いる。 105 is a separation filter calculation unit that performs learning of the separation filter based on the learning number determined by the iteration number determination unit 104, and overwrites the filter unit 106 with the updated separation filter as the updated separation filter. 2 corresponds to the arithmetic device 204 and the storage device 205 in FIG. For example, frequency domain ICA (FDICA) is used as an algorithm for separation filter learning.

106は、保持している分離フィルタ(目的信号を分離するフィルタ)を検知信号に適用し、検知信号に含まれる目的信号を分離、抽出する分離フィルタ手段であり、図２における演算装置204、記憶装置205に対応する。 Reference numeral 106 denotes separation filter means for applying a held separation filter (a filter for separating a target signal) to a detection signal to separate and extract a target signal included in the detection signal. Corresponds to device 205.

図２は、本実施形態を実現するシステム構成の一例を示す図である。図においてマイクロフォン201-1〜201-nの出力である検知信号はフィルタ202を経てＡＤ変換部203に入力され、ＡＤ変換された後、記憶装置205と連結した演算装置204に入力され、演算処理される。フィルタ202は、上記検知信号に含まれるノイズを除去することに用いられる。演算装置204は、ＣＰＵ、ＭＰＵ、ＤＳＰ、ＦＰＧＡなどと、一般的な動作回路を組み合わせることによって構成され、記憶装置205は、キャッシュメモリ、メインメモリ、ＨＤＤ、ＣＤ、ＭＤ、ＤＶＤ、光ディスク、ＦＤＤなど、一般的な記憶媒体によって構成される。 FIG. 2 is a diagram illustrating an example of a system configuration for realizing the present embodiment. In the figure, the detection signals output from the microphones 201-1 to 201-n are input to the AD conversion unit 203 through the filter 202, and after AD conversion, are input to the arithmetic device 204 connected to the storage device 205 for arithmetic processing. Is done. The filter 202 is used to remove noise included in the detection signal. The arithmetic unit 204 is configured by combining a general operation circuit with a CPU, MPU, DSP, FPGA, etc., and the storage unit 205 is a cache memory, main memory, HDD, CD, MD, DVD, optical disk, FDD, etc. It is constituted by a general storage medium.

本実施形態においては、周波数分割帯域ごとに検知信号を分析して、その信号から目的信号を抽出する(雑音混じりの音声から音声を抽出する)手法における、計算量削減方法が可能となる。全周波数帯域において、規定回数学習する従来手法に対し、目的信号と非目的信号(例えば雑音)の含まれる成分に関する情報に基づき学習回数を周波数分割帯域ごとに決定する。これによって、従来手法に比べて、計算量の削減が可能となる。すなわち、目的信号の帯域ごとの内容に関する情報を利用し、目的信号情報が少ない帯域では分離フィルタ過程の学習に費やす計算量を削減することで、分離フィルタ計算過程における計算量を削減できる。 In the present embodiment, a calculation amount reduction method is possible in a method of analyzing a detection signal for each frequency division band and extracting a target signal from the signal (extracting speech from noise-mixed speech). In contrast to the conventional method of learning a specified number of times in all frequency bands, the number of learning is determined for each frequency division band based on information about components including a target signal and a non-target signal (for example, noise). This makes it possible to reduce the amount of calculation compared to the conventional method. That is, the amount of calculation in the separation filter calculation process can be reduced by using the information on the contents of each band of the target signal and reducing the amount of calculation spent on learning of the separation filter process in the band where the target signal information is small.

学習回数を周波数分割帯域ごとに決定する方法としては、それぞれの周波数分割帯域において、
（方法１）検知信号に含まれる目的信号の平均強さに正比例するように決定する、
（方法２）検知信号に含まれる非目的信号の平均強さに反比例するように決定する、
（方法３）検知信号に含まれる非目的信号の強さに対する目的信号の強さの比（目的信号の強さ/非目的信号の強さ）の周波数分割帯域ごとの平均値に正比例するように決定する、
（方法４）周波数分割帯域の任意の２つを帯域１、帯域２としたときに、帯域１における上記の比の平均値が帯域２における上記の比の平均値以上であれば必ず帯域１における学習の反復回数が帯域２における学習の反復回数よりも小とはならず、周波数分割帯域の２つ以上における学習の反復回数は相異なるように、各周波数分割帯域における前記学習の反復回数を決定する、
（方法５）予め個別に、周波数分割帯域の２つ以上における学習の反復回数は相異なるように決定された反復回数とする、などの方法がある。 As a method of determining the number of learning for each frequency division band, in each frequency division band,
(Method 1) Deciding to be directly proportional to the average intensity of the target signal included in the detection signal.
(Method 2) Determine to be inversely proportional to the average strength of the non-target signal included in the detection signal.
(Method 3) The ratio of the strength of the target signal to the strength of the non-target signal included in the detection signal (the strength of the target signal / the strength of the non-target signal) is directly proportional to the average value for each frequency division band. decide,
(Method 4) When any two frequency division bands are band 1 and band 2, if the average value of the ratio in band 1 is equal to or greater than the average value of the ratio in band 2, it is always in band 1 The number of learning iterations in each frequency division band is determined so that the number of learning iterations is not smaller than the number of learning iterations in band 2 and the number of learning iterations in two or more frequency division bands is different. To
(Method 5) There is a method in which the number of iterations of learning in two or more frequency division bands is individually determined in advance so as to be different.

ここで、上記「平均」は、各周波数分割帯域における周波数についての平均を意味し、これは以下においても同じである。 Here, the above “average” means an average of frequencies in each frequency division band, and this is the same in the following.

上記いずれの方法においても、目的信号情報が非目的信号(例えば雑音成分)と比べて少ない帯域では分離フィルタ過程の学習に費やす計算量を削減することで、分離フィルタ(目的信号を分離するフィルタ)の計算過程における計算量を削減できる。 In any of the above methods, a separation filter (a filter for separating a target signal) is reduced by reducing the amount of calculation spent on learning of the separation filter process in a band where the target signal information is smaller than that of a non-target signal (for example, a noise component). The amount of calculation in the calculation process can be reduced.

なお、（方法１）〜（方法４）の方法において、比例計算によって得られた回数の値は整数化する（丸めを行う）ものとする。また、（方法５）の方法においては、予め、それぞれの周波数分割帯域において、複数の学習回数を決定しておき、その中から、適宜、環境に適合する学習回数を選び出して用いてもよい。この場合に、反復回数を予め保持している記憶手段が音声入力装置に具備されていて、環境に適合する学習回数がその記憶手段から読みだされることが望ましい。なお、この（方法５）の方法においては、反復回数決定手段104が音声入力装置に具備されていなくてもよい。 In the methods (Method 1) to (Method 4), the value of the number of times obtained by proportional calculation is converted to an integer (rounded). In the method (Method 5), a plurality of learning times may be determined in advance in each frequency division band, and a learning number suitable for the environment may be selected and used from among them. In this case, it is desirable that the voice input device is provided with a storage unit that holds the number of repetitions in advance, and the learning number suitable for the environment is read from the storage unit. In the method (method 5), the repetition number determination means 104 may not be provided in the voice input device.

図３は、本実施形態の処理の流れを示したフローチャートである。図中、S101〜S116は各ステップを表す。このことは、図１２、１３、１４、１６、１７においても同じである。
S101で、システムの初期化、メモリへの読込作業を行う。
S102で、音入力を検知する。検知したらS103へ進む。
S103で、検知信号(雑音の混合した音声信号)の帯域分割処理を行う。周波数ビンごとの帯域幅は固定でも可変でも良い。ここで周波数ビンの帯域幅をstep(ω)とする。
S104で、帯域ごとの反復回数Iter(ω)を取得する。決定方法は図５及び図６に示す。
S105で、更新フラグRenew(ω)をすべての帯域ωに対してTrueに初期化する。
S106で、更新回数フラグを0に初期化する(Iteration=0)。
S107で、帯域フラグを1に初期化する(ω＝1)。
S108で、更新フラグRenew(ω)＝TrueであればS109へ進み、FalseであればS110へ進む。
S109で、帯域ωについて、分離フィルタの計算を行う。分離フィルタの計算手法としては、周波数領域ＩＣＡなどを適用する。
S110で、帯域フラグを加算する(ω＝ω＋step(ω)、step(ω)：周波数ビンの帯域幅)。
S111で、帯域フラグωの値が帯域の最大値Max(ω)に等しいときはS112へ進み、等しくないときはS108へもどる。帯域の最大値Max(ω)は例えばシステムのナイキスト周波数とする。
S112で、更新回数フラグを加算する(Iteration=Iteration+1)。
S113で、更新回数Iterationが、更新フラグRenew(ω)に等しいときはS114へ進み、等しくないときはS115へ進む。
S114で、帯域ωについて、更新フラグRenew(ω)＝Falseとする。次回の更新よりこの帯域は更新処理が行われない。
S115で、全帯域ωについて、更新フラグ Renew(ω)＝FalseならばS116へ進み、そうでない場合はS107へ戻る。
S116で、更新処理の結果生成された分離フィルタW(ω)を検知信号に適用して分離信号を得る。 FIG. 3 is a flowchart showing a processing flow of the present embodiment. In the figure, S101 to S116 represent steps. The same applies to FIGS. 12, 13, 14, 16, and 17.
In S101, the system is initialized and the memory is read.
In S102, sound input is detected. If detected, the process proceeds to S103.
In S103, the detection signal (sound signal mixed with noise) is subjected to band division processing. The bandwidth for each frequency bin may be fixed or variable. Here, the bandwidth of the frequency bin is assumed to be step (ω).
In S104, the number of iterations Iter (ω) for each band is acquired. The determination method is shown in FIGS.
In S105, the update flag Renew (ω) is initialized to True for all bands ω.
In S106, the update count flag is initialized to 0 (Iteration = 0).
In S107, the band flag is initialized to 1 (ω = 1).
If the update flag Renew (ω) = True in S108, the process proceeds to S109, and if it is False, the process proceeds to S110.
In S109, a separation filter is calculated for the band ω. As a calculation method of the separation filter, frequency domain ICA or the like is applied.
In S110, a band flag is added (ω = ω + step (ω), step (ω): bandwidth of frequency bin).
In S111, when the value of the band flag ω is equal to the maximum value Max (ω) of the band, the process proceeds to S112, and when it is not equal, the process returns to S108. The maximum value Max (ω) of the band is, for example, the Nyquist frequency of the system.
In S112, an update count flag is added (Iteration = Iteration + 1).
In S113, when the update count Iteration is equal to the update flag Renew (ω), the process proceeds to S114, and when it is not equal, the process proceeds to S115.
In S114, the update flag Renew (ω) = False is set for the band ω. No update processing is performed for this band from the next update.
If the update flag Renew (ω) = False for all bands ω in S115, the process proceeds to S116, and if not, the process returns to S107.
In S116, the separation filter W (ω) generated as a result of the update process is applied to the detection signal to obtain a separation signal.

図１における反復回数決定手段104（図３中のステップS104）の反復回数決定手法としては、例えば音声信号と雑音成分の帯域ごとの信号対雑音比(S/N)を用いることができる。図４〜６にその原理と決定手法を示す。 As a method of determining the number of iterations of the iteration number determination means 104 (step S104 in FIG. 3) in FIG. 1, for example, a signal-to-noise ratio (S / N) for each band of a speech signal and a noise component can be used. The principle and determination method are shown in FIGS.

図４は、雑音成分と参照信号の比較を示した図である。この図は、予め取得した目的信号(音声信号)と非目的信号(車両雑音信号)の各周波数帯域におけるエネルギーを比較したものである。図中、（Ａ）は目的信号（音声信号）と非目的信号（雑音信号）のパワースペクトルの一例を示している。点線が目的信号、実線が非目的信号である。（Ｂ）は（Ａ）から信号対雑音比(S/N)(ここでは目的信号の強さ/非目的信号の強さ(単位はdB))を求めた結果である。この値をsnr(ω)と置く。ただし、ここでωは分割周波数帯域中の代表周波数周波数(周波数分割帯域が１Hｚであれば周波数そのもの)あるいは該代表周波数のインデックスをあらわし、snr(ω)は分割周波数帯域中の各周波数でのS/N比の平均とする。 FIG. 4 is a diagram showing a comparison between a noise component and a reference signal. This figure compares the energy in each frequency band of the target signal (audio | voice signal) acquired previously and the non-target signal (vehicle noise signal). In the figure, (A) shows an example of the power spectrum of the target signal (voice signal) and the non-target signal (noise signal). The dotted line is the target signal and the solid line is the non-target signal. (B) is a result of obtaining the signal-to-noise ratio (S / N) (here, the strength of the target signal / the strength of the non-target signal (unit: dB)) from (A). Let this value be snr (ω). Here, ω represents a representative frequency in the divided frequency band (or the frequency itself if the frequency divided band is 1 Hz) or an index of the representative frequency, and snr (ω) represents S at each frequency in the divided frequency band. The average of the / N ratio.

この結果によれば、周波数帯によって、snr(ω)が高い帯域と低い帯域とがあることがわかる。例えば図４の例では、0〜500Ｈｚの帯域（縦長点線楕円で囲った部分）、及び4000Ｈｚ以上の帯域（横長点線楕円で囲った部分）において、S/Nが低い。こうしたS/Nが低い帯域は信号全体に占める音声信号情報の比率が低く、音源分離処理においては、分離が困難な帯域といえる。したがって、学習を反復しても分離性能の向上は難しいため、あえて多くの学習をする必要は無いと考えられる。そこで、こうした帯域を選択して学習回数を減らすことで、同様の分離性能の分離フィルタをより少ない計算負荷で導出できる。 According to this result, it can be seen that there are a high band and a low band of snr (ω) depending on the frequency band. For example, in the example of FIG. 4, the S / N is low in a band of 0 to 500 Hz (portion surrounded by a vertically long dotted ellipse) and a band of 4000 Hz or more (portion surrounded by a horizontal long dotted ellipse). Such a low S / N band has a low ratio of audio signal information to the entire signal, and can be said to be a band that is difficult to separate in sound source separation processing. Therefore, it is difficult to improve the separation performance even if learning is repeated, so it is considered that there is no need to do much learning. Therefore, by selecting such a band and reducing the number of learning times, a separation filter having the same separation performance can be derived with a smaller calculation load.

図５は第１実施形態の反復回数の決定手法（その１）を示した図である。図５に示すように、snr(ω)に対し、所定の閾値Thを与え、snr(ω)がこの閾値Th以上である帯域（周波数分割帯域の１つ）ついては更新回数Iter(ω)としてIter_（Ｈｉ）を与え、snr(ω)がこの閾値Thより小である帯域についてはIter(ω)としてIter_（Ｌｏ）を与えることする。Iter_（Ｈｉ）は更新回数の多い方の設定値、Iter_（Ｌｏ）は少ないほうの設定値である。 FIG. 5 is a diagram showing a method (part 1) for determining the number of iterations according to the first embodiment. As shown in FIG. 5, a predetermined threshold value Th is given to snr (ω), and a band (one of frequency division bands) where snr (ω) is equal to or greater than the threshold value Th is set as the update count Iter (ω) as Iter. _(Hi) is given, and iter _(Lo) is given as Iter ₍ ω) for a band where snr (ω) is smaller than this threshold Th. Iter _(Hi) is a setting value with a larger number of updates, and Iter _(Lo) is a setting value with a smaller number of updates.

分離フィルタ計算手段は、この周波数分割帯域ごとの反復回数に基づき分離フィルタの学習を行う。ここで、分離フィルタの初期値としては、例えば所定の方位に死角を向けたビームフォーマなどとすることが好ましい。この分離フィルタを初期値として、周波数ごとに分離フィルタを学習する。なお、例えばIter_（Ｌｏ）を０回に設定すれば、snr(ω)＜Thとなる帯域は学習を行わず、初期値がそのまま最終的な分離フィルタに反映されることとなる。例えば周波数領域ＩＣＡでは、音声に対し雑音が極度に大きい帯域や、雑音も音声も微量しか観測できない帯域などにおいて、学習処理が適切に行えず、不安定な分離フィルタとなる可能性がある。こうした場合に、該帯域を学習しないように設定すれば、分離フィルタが不安定になる可能性を低くできる。この方法は上記（方法４）に含まれる。 The separation filter calculation means learns the separation filter based on the number of iterations for each frequency division band. Here, the initial value of the separation filter is preferably, for example, a beam former having a blind spot directed in a predetermined direction. Using this separation filter as an initial value, the separation filter is learned for each frequency. For example, if Iter _(Lo) is set to 0, the band where snr (ω) <Th is not learned, and the initial value is directly reflected in the final separation filter. For example, in the frequency domain ICA, the learning process cannot be performed properly in a band where the noise is extremely large with respect to the voice or a band where only a very small amount of noise and voice can be observed, which may result in an unstable separation filter. In such a case, if the band is not set to be learned, the possibility that the separation filter becomes unstable can be reduced. This method is included in the above (Method 4).

図６は第１実施形態の反復回数決定方法（その２）を示した図であり、図５に示した手法を拡張し、snr(ω)の値から帯域ごとの反復回数をより細分化して決定する手法を示している。図６下部の数式は具体的な反復回数の計算手法を示している。この数式において、＊は乗算を表し、max(snr(ω))はsnr(ω)の最大値を表し、min(snr(ω))はsnr(ω)の最小値を表し、Iter_{（Ｍａｘ）}は反復回数の最大値を表し、Iter_{（Ｍｉｎ）}は反復回数の最小値を表し、stepは反復回数のステップサイズ(10とすれば、10、20、30、のように更新回数が決まる)を表し、Ｒ(・)は最近傍整数への丸めを表している。すなわち、snr(ω)の最大値max(snr(ω))、最小値min(snr(ω))が、反復回数の最大値Iter_{（Ｍａｘ）}、最小値Iter_{（Ｍｉｎ）}に対応し、この範囲内でエネルギー比snr(ω)（このsnr(ω)は、正確には、それぞれの周波数分割帯域におけるsnr(ω)の平均値を表している）に基づき反復回数が割り当てられる。例えば、最小反復回数10回、最大100回、ステップサイズ10とすれば、snr(ω)に応じて、10、20、30、…、100という反復回数が決定される。なお、snr(ω)がωの連続関数であるとすれば、中間値の定理によって、snr(ω)の周波数分割帯域における平均値＝snr(ω_ｍ)を満足するω_ｍがその周波数分割帯域に存在するので、式の右辺におけるsnr(ω)中のωがω_ｍに等しいとすれば、右辺におけるωの値が確定したものとなる。また、式の左辺におけるωは、その周波数分割帯域におけるωの代表値を表しているので、そのωもω_ｍに等しいとすれば、両辺のωが同一なものとなる。 FIG. 6 is a diagram showing the iteration number determination method (part 2) of the first embodiment. The technique shown in FIG. 5 is expanded to further subdivide the iteration number for each band from the value of snr (ω). The method of determination is shown. The formula at the bottom of FIG. 6 shows a specific method for calculating the number of iterations. In this equation, * represents multiplication, max (snr (ω)) represents the maximum value of snr (ω), min (snr (ω)) represents the minimum value of snr (ω), and Iter _(Max) Represents the maximum number of iterations, Iter _(Min) represents the minimum number of iterations, and step represents the step size of the number of iterations (10, 20, 30, etc., the number of updates is determined if 10 is assumed) R (•) represents rounding to the nearest integer. That is, the maximum value max (snr (ω)) and the minimum value min (snr (ω)) of the snr (ω) correspond to the maximum value Iter _(Max) and the minimum value Iter _(Min) of the number of iterations. The number of iterations is assigned based on the energy ratio snr (ω) (precisely, this snr (ω) represents the average value of snr (ω) in each frequency division band). For example, if the minimum number of iterations is 10, the maximum is 100 times, and the step size is 10, the number of iterations of 10, 20, 30,..., 100 is determined according to snr (ω). Incidentally, if snr (omega) is a continuous function of omega, the intermediate value theorem by the mean value = snr (ω _m) its frequency divided band satisfying omega _m are the in a frequency division band snr (omega) Therefore, if ω in snr (ω) on the right side of the equation is equal to ω _m , the value of ω on the right side is determined. In addition, ω on the left side of the equation represents a representative value of ω in the frequency division band. Therefore, if ω is also equal to ω _m , ω on both sides is the same.

上記の反復回数の決定方法は上記（方法４）に含まれ、（方法３）を含んでいる。 The method for determining the number of iterations is included in (Method 4) and includes (Method 3).

上述した実施形態では、目的信号と非目的信号との双方を用いてそのエネルギー比から反復回数を導出したが、エネルギーの差を用いても良いし、目的信号、非目的信号の何れかのみを用い、そのエネルギーの強さなどから反復回数を決定するようにしてもよい（上記（方法１）〜（方法５）参照）。 In the embodiment described above, the number of iterations is derived from the energy ratio using both the target signal and the non-target signal. However, the difference in energy may be used, or only the target signal or the non-target signal is used. The number of iterations may be determined based on the strength of the energy used (see (Method 1) to (Method 5) above).

また、上記説明では、予め取得した車室内の音声及び雑音を目的信号、非目的信号とし、事前にIter(ω)を決定し、これを適用する手法について説明したが、検知手段にて実際に検出される目的信号（実際には目的信号に近い観測信号、雑音の極めて少ない時に観測された音声等を指す）あるいは非目的信号を用いて、動的にIter(ω)を求めることも可能であり、この場合も上述と同様の反復回数決定手法を用いることができる。このうち、非目的信号として実際に検知された信号を用いる例については、第３実施形態で詳しく述べる。 In the above description, the voice and noise in the vehicle interior acquired in advance are used as the target signal and non-target signal, and the method of determining Iter (ω) in advance and applying it has been described. Iter can be obtained dynamically using the detected target signal (actually the observed signal close to the target signal, the speech observed when the noise is very low) or the non-target signal. Yes, in this case as well, the same iteration number determination method as described above can be used. Among these, an example in which a signal actually detected as a non-target signal is used will be described in detail in the third embodiment.

（第２実施形態）
本実施例は第１実施形態と基本的な構成を同じくする。すなわち、ブロック図は図１と同一であり、フローチャート図は図３は同一である。異なるのは、反復回数決定手法において、目的信号と非目的信号から適応的に生成したFIRフィルタ（下記音声帯域通過フィルタ）を反復回数決定用フィルタとして用いる点である。この部分について説明する。 (Second Embodiment)
This example has the same basic configuration as the first embodiment. That is, the block diagram is the same as FIG. 1, and the flowchart is the same as FIG. The difference is that, in the iteration number determination method, an FIR filter (the following audio bandpass filter) adaptively generated from the target signal and the non-target signal is used as the iteration number determination filter. This part will be described.

図７は、本実施形態における適応処理手段の基本的な構成を示した図である。適応処理は目的信号Ｓ１と入力信号(目的信号に非目的信号が混合した信号)Ｘ１を用いて、Ｘ１にフィルタＨ(ω)を適用した結果の出力信号Ｙ１と目的信号Ｓ１との差である誤差信号ｅが小さくなるように、フィルタＨ(ω)を学習する処理である。適応処理の結果、入力信号に含まれる雑音成分が強い帯域ほど強く抑圧するようなフィルタが生成される。つまり、適応処理により生成されるフィルタの周波数応答が大きい帯域ほど音声が十分観測される帯域であり、周波数応答の小さい帯域程、雑音が強く音声が十分に観測できない帯域であると捉えることができる。よって、該フィルタの周波数応答を用いれば、学習しても分離性能が向上しない帯域を推定することができ、第１実施形態同様学習回数の削減が可能である。本実施形態では、この周波数応答に基づき、反復回数Iter(ω)を決定する。本実施形態では、フィルタの適応処理のために、予め取得した目的信号Ｓ１(音声信号)と非目的信号Ｎ１(雑音信号)を用いる。すなわち、目的信号をＳ１、入力信号Ｘ１をＸ１＝Ｓ１＋Ｎ１として適応処理を行う。 FIG. 7 is a diagram showing a basic configuration of the adaptive processing means in the present embodiment. The adaptive processing is the difference between the target signal S1 and the output signal Y1 as a result of applying the filter H (ω) to X1 using the target signal S1 and the input signal (signal obtained by mixing the target signal with the non-target signal) X1. This is a process of learning the filter H (ω) so that the error signal e becomes small. As a result of the adaptive processing, a filter is generated that strongly suppresses the band having a stronger noise component included in the input signal. In other words, the band where the frequency response of the filter generated by the adaptive processing is larger is the band where the voice is sufficiently observed, and the band where the frequency response is smaller is a band where the noise is strong and the voice cannot be sufficiently observed. . Therefore, if the frequency response of the filter is used, it is possible to estimate a band in which separation performance does not improve even if learning is performed, and it is possible to reduce the number of learnings as in the first embodiment. In the present embodiment, the number of iterations Iter (ω) is determined based on this frequency response. In the present embodiment, a target signal S1 (audio signal) and a non-target signal N1 (noise signal) acquired in advance are used for adaptive processing of the filter. That is, adaptive processing is performed with the target signal as S1 and the input signal X1 as X1 = S1 + N1.

図８は、本実施形態の適応処理手段によって生成したフィルタの周波数応答を示した図である。図の（Ａ）は、適応処理に用いる目的信号と非目的信号のパワースペクトルの一例を示している。ここで、点線が目的信号、実線が非目的信号である。適応処理の結果得られるフィルタの一例を図８の（Ｂ）に示す。例えば、図８の例では、0〜500Ｈｚの帯域（縦長点線楕円で囲った部分）、及び3000Ｈｚ以上の帯域（横長点線楕円で囲った部分）において、周波数応答が小さい。以降、このフィルタを、音声スペクトルの大きい帯域を保持し、雑音スペクトルの大きい帯域を抑圧する特徴を有することから、音声帯域通過フィルタと呼ぶ。生成された音声帯域通過フィルタの周波数応答をresp(ω)とする。ただし、ここでωは分割周波数帯域中の代表周波数周波数(周波数分割帯域が１Ｈｚであれば周波数そのもの)あるいは該代表周波数のインデックスをあらわし、resp(ω)は分割周波数帯域中の各周波数での音声帯域通フィルタの周波数応答の平均とする。このresp(ω)を用いて反復回数を決定する。 FIG. 8 is a diagram showing the frequency response of the filter generated by the adaptive processing means of this embodiment. (A) of a figure has shown an example of the power spectrum of the target signal used for an adaptive process, and a non-target signal. Here, the dotted line is the target signal and the solid line is the non-target signal. An example of the filter obtained as a result of the adaptive processing is shown in FIG. For example, in the example of FIG. 8, the frequency response is small in a band of 0 to 500 Hz (portion surrounded by a vertically long dotted ellipse) and a band of 3000 Hz or more (portion surrounded by a horizontal long dotted ellipse). Hereinafter, this filter is referred to as a voice band pass filter because it has a feature of retaining a band having a large voice spectrum and suppressing a band having a large noise spectrum. Let the frequency response of the generated voice bandpass filter be resp (ω). Here, ω represents the representative frequency in the divided frequency band (or the frequency itself if the frequency divided band is 1 Hz) or the index of the representative frequency, and resp (ω) represents the voice at each frequency in the divided frequency band. The average frequency response of the bandpass filter. The resp (ω) is used to determine the number of iterations.

図９にresp(ω)を用いた反復回数の決定手法の一例を示す。この手法では、周波数応答の最大値max(resp(ω))、最小値min(resp(ω))をそれぞれ反復回数の最大値Iter(Max)、最小値Iter(Min)に対応させ、その間については、周波数応答の強さに応じて、反復回数が決定される。この場合の計算手法は、図下部の式で示したように、第１実施形態に示した計算手法（図６）と同様である。すなわち、図９下部の数式において、＊は乗算を表し、max(resp(ω))はresp(ω)の最大値を表し、min(resp(ω))はresp(ω)の最小値を表し、Iter(Max)は反復回数の最大値を表し、Iter(Min)は反復回数の最小値を表し、stepは反復回数のステップサイズ(10とすれば、10、20、30、のように更新回数が決まる)を表し、Ｒ(・)は最近傍整数への丸めを表している。すなわち、resp(ω)の最大値max(resp(ω))、最小値min(resp(ω))が、反復回数の最大値Iter(Max)、最小値Iter(Min)に対応し、この範囲内で音声帯域通フィルタの周波数応答resp(ω)に基づき反復回数が割り当てられる。例えば、最小反復回数10回、最大100回、ステップサイズ10とすれば、resp(ω)に応じて、10、20、30、…、100という反復回数が決定される。なお、resp(ω)がωの連続関数であるとすれば、中間値の定理によって、resp(ω)の周波数分割帯域における平均値＝resp(ω_ｍ)を満足するω_ｍがその周波数分割帯域に存在するので、式の右辺におけるresp(ω)中のωがω_ｍに等しいとすれば、右辺におけるωの値が確定したものとなる。また、式の左辺におけるωは、その周波数分割帯域におけるωの代表値を表しているので、そのωもω_ｍに等しいとすれば、両辺のωが同一なものとなる。 FIG. 9 shows an example of a method for determining the number of iterations using resp (ω). In this method, the maximum frequency response max (resp (ω)) and minimum value min (resp (ω)) correspond to the maximum number of iterations Iter (Max) and minimum value Iter (Min), respectively. The number of iterations is determined according to the strength of the frequency response. The calculation method in this case is the same as the calculation method (FIG. 6) shown in the first embodiment, as shown by the formula at the bottom of the figure. That is, in the formula at the bottom of FIG. 9, * represents multiplication, max (resp (ω)) represents the maximum value of resp (ω), and min (resp (ω)) represents the minimum value of resp (ω). , Iter (Max) represents the maximum number of iterations, Iter (Min) represents the minimum number of iterations, and step represents the step size of the iterations (10, 20, 30, etc. if 10 is assumed) R (·) represents rounding to the nearest integer. That is, the maximum value max (resp (ω)) and the minimum value min (resp (ω)) of resp (ω) correspond to the maximum value Iter (Max) and the minimum value Iter (Min) of the number of iterations, and this range The number of iterations is assigned based on the frequency response resp (ω) of the voice band pass filter. For example, if the minimum number of iterations is 10, the maximum is 100 times, and the step size is 10, the number of iterations of 10, 20, 30,..., 100 is determined according to resp (ω). Incidentally, if resp (omega) is a continuous function of omega, the intermediate value theorem by the mean value = resp (ω _m) its frequency divided band satisfying omega _m are the in a frequency division band resp (omega) Therefore, if ω in resp (ω) on the right side of the equation is equal to ω _m , the value of ω on the right side is fixed. In addition, ω on the left side of the equation represents a representative value of ω in the frequency division band. Therefore, if ω is also equal to ω _m , ω on both sides is the same.

上記の反復回数の決定方法は、各該周波数分割帯域における学習の反復回数を、それぞれの周波数分割帯域において音声帯域通過フィルタの通過率の平均値に正比例するように決定する方法を含んでいる。 The method of determining the number of repetitions includes a method of determining the number of repetitions of learning in each frequency division band so as to be directly proportional to the average value of the pass rate of the voice band pass filter in each frequency division band.

上記の方法を一般化して、周波数分割帯域の任意の２つを帯域１、帯域２としたときに、帯域１におけるresp(ω)の平均値が帯域２におけるresp(ω)の平均値以上であれば必ず帯域１における学習の反復回数が帯域２における学習の反復回数よりも小とはならず、周波数分割帯域の２つ以上における学習の反復回数は相異なるように、各周波数分割帯域における前記学習の反復回数を決定する方法を用いてもよい。 By generalizing the above method, if any two of the frequency division bands are band 1 and band 2, the average value of resp (ω) in band 1 is greater than or equal to the average value of resp (ω) in band 2 If necessary, the number of iterations of learning in the band 1 is not necessarily smaller than the number of iterations of learning in the band 2, and the number of iterations of learning in two or more of the frequency division bands is different. A method of determining the number of learning iterations may be used.

また、上記説明では予め車室内にて得られる目的信号（実際には目的信号に近い観測信号、雑音の極めて少ない時に観測された音声等を指す）と非目的信号を用いて、事前にIter(ω)を決定し、これを適用する手法について説明したが、検知手段にて実際に検出される目的信号あるいは非目的信号を用いて、動的にIter(ω)を求めることも可能であり、この場合も上述と同様の反復回数決定手法を用いることができる。このうち、非目的信号として実際に検知された信号を用いる例について、第３実施形態で詳しく述べる。 In addition, in the above description, the target signal obtained in advance in the passenger compartment (actually an observed signal close to the target signal, voice observed when the noise is extremely low) and a non-target signal are used in advance. ω) is determined and applied to this method, but it is also possible to dynamically determine Iter (ω) using the target signal or non-target signal actually detected by the detection means, Also in this case, the same iteration number determination method as described above can be used. Among these, an example in which a signal actually detected as a non-target signal is used will be described in detail in the third embodiment.

（第３実施形態）
第３実施形態として、第１及び第２実施形態の基本構成を用い、特に車両内での音声入力方法として具体化した一実施形態について説明する。 (Third embodiment)
As a third embodiment, an embodiment that uses the basic configuration of the first and second embodiments and that is particularly embodied as a voice input method in a vehicle will be described.

図１０は、第３実施形態の基本構成を示したブロック図である。この基本構成は、保持する分離フィルタを用いた音源分離処理と、音環境の変化に応じて分離フィルタを更新する更新処理と、更に音声を高S/Nで検出したときにこれを収録する処理との３系統から構成される。 FIG. 10 is a block diagram showing a basic configuration of the third embodiment. This basic configuration consists of a sound source separation process that uses a separation filter that is held, an update process that updates the separation filter in response to changes in the sound environment, and a process that records this when a high S / N is detected. It consists of three systems.

図１０中、301-1〜301-nは、目的音と非目的音とが混在する音響を検知し、目的信号と非目的信号とが混在する複数の検知信号として出力する複数の音響センサであるマイクロフォンであり、一般的なマイクロフォンによって実現され、図１１におけるマイクロフォン401-1〜401-nに対応する。302は、マイクロフォン301-1〜301-nの出力である検知信号を検知して離散信号に変換する検知手段であり、図１１におけるフィルタ(アンチエリアシングフィルタ)402、ＡＤ変換部403、演算装置404に対応する。 In FIG. 10, reference numerals 301-1 to 301-n denote a plurality of acoustic sensors that detect sound in which target sound and non-target sound are mixed and output as a plurality of detection signals in which target signal and non-target signal are mixed. It is a certain microphone, is realized by a general microphone, and corresponds to the microphones 401-1 to 401-n in FIG. Reference numeral 302 denotes detection means for detecting the detection signals output from the microphones 301-1 to 301-n and converting them into discrete signals. The filter (anti-aliasing filter) 402, the AD conversion unit 403, and the arithmetic device in FIG. Corresponds to 404.

303は図１１における情報収集部406に対応するスイッチ手段であり、車両に配置されている各種スイッチの情報を取り入れる。空調機スイッチ、窓開閉スイッチ、ワイパースイッチ、PTT(Push to Talk)スイッチなどで実現され、スイッチ情報に変化が生じた場合に、音環境の変化として検出し、分離フィルタの更新処理を行う。すなわち帯域分割手段以降の処理を開始する。 Reference numeral 303 denotes switch means corresponding to the information collection unit 406 in FIG. 11 and takes in information on various switches arranged in the vehicle. It is realized with an air conditioner switch, window open / close switch, wiper switch, PTT (Push to Talk) switch, etc. When a change occurs in switch information, it is detected as a change in the sound environment and the separation filter is updated. That is, the processing after the band dividing means is started.

304は図１１における情報収集部406に対応するセンサ手段であり、車両に配置されている各種センサの情報を取り入れる。速度センサ、エンジン回転数センサ、着座センサ、車室内カメラなどで実現される。センサ手段304の取得する情報に変化が生じた場合に、それを音環境の変化として検出し、分離フィルタの更新処理を行う。スイッチ手段303、センサ手段304からの情報の変化を「音環境の変化」として捉え、分離フィルタの更新処理を行う。例えば、空調機の強さが変化した、速度が上がったなどの情報を音環境の変化として捉える。 Reference numeral 304 denotes sensor means corresponding to the information collecting unit 406 in FIG. 11 and takes in information of various sensors arranged in the vehicle. This is realized by a speed sensor, an engine speed sensor, a seating sensor, a vehicle interior camera, and the like. When a change occurs in the information acquired by the sensor means 304, it is detected as a change in the sound environment, and the separation filter is updated. A change in information from the switch unit 303 and the sensor unit 304 is regarded as a “change in sound environment”, and the separation filter is updated. For example, information such as a change in the strength of the air conditioner or an increase in speed is captured as a change in the sound environment.

305は図１１における演算装置404及び記憶装置405に対応する判断手段であり、音環境変化及び音声信号の有無について監視、判断を行う。ここで、音環境とは、検知信号(観測された検知信号)を構成する目的信号(音声)と非目的信号(雑音)のスペクトル、パワースペクトルなどの特徴のことである。音環境判断及び音声信号有無の判断のために、上記各種スイッチ手段、センサ手段の情報、あるいは検知信号のスペクトルなどを用いることができる。判断は以下の３通りである。
１：音環境(目的信号、非目的信号の特徴)に変化無し。そのときは、現在保持する分離フィルタ(逆混合フィルタ)をそのまま保持し続ける。この状態で、音声が入力された場合は、保持する分離フィルタを用いて観測信号から音声を抽出する。
２：音環境に変化あり。こにときは、フィルタの更新を行う。現在の観測信号を用いて、帯域分割、分析、反復過程決定を行い、決定された学習回数に基づき分離フィルタを再学習、更新する。
３：高S/N音声が検出されたとき：アイドリング中など、非目的信号(雑音)が目的信号(音声)に対して十分無視できるほど小さい環境において、目的信号(音声)が検出された場合に、該信号を参照信号として収録し、記憶手段へ記憶する。 Reference numeral 305 denotes determination means corresponding to the arithmetic device 404 and the storage device 405 in FIG. 11, and monitors and determines whether there is a change in the sound environment and whether there is an audio signal. Here, the sound environment refers to characteristics such as a spectrum and a power spectrum of a target signal (speech) and a non-target signal (noise) that constitute a detection signal (observed detection signal). In order to determine the sound environment and the presence / absence of an audio signal, the information of the various switch means, sensor means, or the spectrum of the detection signal can be used. The judgment is as follows.
1: No change in sound environment (characteristics of target signal and non-target signal). At that time, the separation filter (back mixing filter) that is currently held is kept held as it is. In this state, when voice is input, the voice is extracted from the observation signal using a separation filter that is held.
2: There is a change in the sound environment. At this time, the filter is updated. Band splitting, analysis, and iterative process determination are performed using the current observation signal, and the separation filter is re-learned and updated based on the determined number of learnings.
3: When high S / N sound is detected: When the target signal (sound) is detected in an environment where the non-target signal (noise) is small enough to ignore the target signal (sound), such as during idling The signal is recorded as a reference signal and stored in the storage means.

306は図１１における演算装置404及び記憶装置405に対応する帯域分割手段であり、検知手段302によって離散信号に変換された検知信号(時間領域の信号系列)を周波数領域に変換し、更に、音響周波数帯域を複数の周波数分割帯域に分割する。分解するための変換としては、直交変換であればいずれを用いてもよい。 Reference numeral 306 denotes a band dividing unit corresponding to the arithmetic unit 404 and the storage unit 405 in FIG. 11, which converts the detection signal (time domain signal sequence) converted into a discrete signal by the detection unit 302 into the frequency domain, The frequency band is divided into a plurality of frequency division bands. Any transform may be used as long as it is an orthogonal transform.

307は図１１における演算装置404及び記憶装置405に対応する分析手段であり、検知信号（観測信号）を周波数ごとに分析し、含まれる目的信号と非目的信号である不要信号それぞれの強さを比較する。ただし、目的信号は記憶手段308に事前に記憶する音声信号とし、非目的信号は音声が検出されていないときの検知信号とする。分析対象としては、第１、第２実施形態のように、帯域ごとのS/Nや、適応処理により生成した分離フィルタの周波数応答などとする。 Reference numeral 307 denotes an analysis unit corresponding to the arithmetic unit 404 and the storage unit 405 in FIG. 11. The detection signal (observation signal) is analyzed for each frequency, and the strength of each of the target signal included and the unnecessary signal that is a non-target signal is determined. Compare. However, the target signal is a voice signal stored in advance in the storage unit 308, and the non-target signal is a detection signal when no voice is detected. The analysis target is the S / N for each band, the frequency response of the separation filter generated by the adaptive processing, and the like as in the first and second embodiments.

演算装置404は、ＣＰＵ、ＭＰＵ、ＤＳＰ、ＦＰＧＡなどと、一般的な動作回路を組み合わせることによって構成される。 The arithmetic unit 404 is configured by combining a general operation circuit with a CPU, MPU, DSP, FPGA, and the like.

記憶装置405は、キャッシュメモリ、メインメモリ、ＨＤＤ、ＣＤ、ＭＤ、ＤＶＤ、光ディスク、ＦＤＤなど、一般的な記憶媒体を用いて構成される。 The storage device 405 is configured using a general storage medium such as a cache memory, a main memory, an HDD, a CD, an MD, a DVD, an optical disk, and an FDD.

308は図１１における演算装置404及び記憶装置405に対応する記憶手段であり、目的信号を記憶する。高S/N音声(極低雑音環境下で取得された音声)が収録された場合にこれを記憶する。初期値としては、例えば複数の人の音声を重ね合わせた信号である会話音声類似雑音（Human Speech Like Noise、HSLN、梶田将司・小林大祐・武田一哉・板倉文忠、「ヒューマンスピーチライク雑音に含まれる音声的特徴の分析、日本音響学会誌，vol.53，No.5，pp.337-345，1997参照))などが好ましい。高S/N音声が収録された場合に、初期値として記憶された信号を置き換える、あるいは上記初期値信号に加算するなどとする。これにより、目的信号が、実際の使用者の音声に近いものとなる。 Reference numeral 308 denotes storage means corresponding to the arithmetic device 404 and the storage device 405 in FIG. 11, and stores a target signal. When high S / N sound (sound acquired under extremely low noise environment) is recorded, it is stored. As an initial value, for example, speech speech-like noise (Human Speech Like Noise, HSLN, Masashi Koda, Daisuke Kobayashi, Kazuya Takeda, Fumada Itakura, “Human Speech Like Noise” Analysis of phonetic features, Journal of the Acoustical Society of Japan, vol.53, No.5, pp.337-345, 1997)) etc. is preferred, which is stored as the initial value when high S / N speech is recorded. The target signal becomes close to the actual user's voice, for example.

309は図１１における演算装置404及び記憶装置405に対応する反復回数決定手段であり、分析手段307において分析した結果(目的信号と非目的信号のS/Nや該信号を用いて適応生成した分離フィルタの周波数応答)に基づき、学習処理における反復回数を決定する。決定手法としては第１、第２実施形態に示した手法などを用いる。 Reference numeral 309 denotes an iterative number determination means corresponding to the arithmetic unit 404 and the storage unit 405 in FIG. 11, and the result of analysis by the analysis unit 307 (the S / N of the target signal and the non-target signal and the adaptively generated separation using the signal) The number of iterations in the learning process is determined based on the frequency response of the filter. As the determination method, the method shown in the first and second embodiments is used.

310は図１１における演算装置404及び記憶装置405に対応するフィルタ計算手段であり、反復回数決定手段309の決定した学習回数に基づき、分離フィルタの学習を行う。学習処理終了後のフィルタを更新後分離フィルタとしてフィルタ手段311へ上書きする。分離フィルタ学習のアルゴリズムとしては、例えば周波数ＩＣＡを用いる。 Reference numeral 310 denotes a filter calculation unit corresponding to the arithmetic unit 404 and the storage unit 405 in FIG. 11, and performs learning of the separation filter based on the learning number determined by the iteration number determination unit 309. The filter after the learning process is overwritten on the filter means 311 as a post-update separation filter. For example, the frequency ICA is used as the separation filter learning algorithm.

311は図１１における演算装置404及び記憶装置405に対応するフィルタ手段であり、保持している分離フィルタを検知信号（雑音の混合された音声信号）に適用し、検知信号に含まれる目的信号（音声信号）を抽出し、出力信号として出力する。 Reference numeral 311 denotes filter means corresponding to the arithmetic device 404 and the storage device 405 in FIG. 11, and applies the held separation filter to the detection signal (sound signal mixed with noise) to obtain a target signal ( Audio signal) is extracted and output as an output signal.

図１１は、本実施形態を実現するシステム構成の一例を示した図であり、情報収集部406以外は、図２に示した構成と同様の構成を示している。 FIG. 11 is a diagram illustrating an example of a system configuration that realizes the present embodiment. The configuration other than the information collection unit 406 is the same as the configuration illustrated in FIG.

本実施形態では、非目的信号として実際に検知手段により検知された信号を用いる。非目的信号は車両の走行条件などにより変化するものと考えられるため、この変化に適応して反復回数を動的決定する。また、目的信号も話者の交代などにより変化することが考えられるため、使用者の音声が高S/N環境下で検出されたときにはこれを目的信号として蓄積する機構も備えた。 In the present embodiment, a signal actually detected by the detection means is used as the non-purpose signal. Since the non-purpose signal is considered to change depending on the driving condition of the vehicle, the number of iterations is dynamically determined in response to this change. In addition, since the target signal may change due to the change of the speaker, a mechanism for storing the target signal as a target signal when the user's voice is detected in a high S / N environment is also provided.

更に本実施形態では、処理負荷の高い分離フィルタ更新処理を削減するために、分離フィルタの更新処理と分離フィルタを用いた音源分離処理を別系統の処理とし、分離フィルタ更新処理は音環境の変化が検出されたときだけおこなう構成とする。 Furthermore, in this embodiment, in order to reduce the separation filter update process with a high processing load, the separation filter update process and the sound source separation process using the separation filter are performed in different systems, and the separation filter update process changes the sound environment. The configuration is performed only when the is detected.

本実施形態は、第１実施形態の構成に、車両に配置された各種スイッチ手段と、各種センサ手段と、検知信号、スイッチ情報、センサ情報に基づき音環境の変化の有無を判断する判断手段と、該検知信号から現在の目的信号及び非目的信号の情報量を推定する分析手段と、記憶手段と音声区間検出手段を加えた構成となっている。 In this embodiment, the configuration of the first embodiment includes various switch means arranged in a vehicle, various sensor means, and determination means for determining the presence or absence of a change in sound environment based on detection signals, switch information, and sensor information. The analysis means for estimating the information amount of the current target signal and the non-target signal from the detection signal, the storage means, and the voice section detection means are added.

また、本実施形態においては、目的信号と非目的信号を同時間に独立して取得することは困難であるため、(目的信号と非目的信号が混合された信号が観測される状況において、目的信号を抽出する方法が本発明の主旨であるため)非目的信号を雑音のみが存在する時の検知信号、目的信号を、非目的信号のエネルギーが低いときに発話された話者の音声信号として別時刻にそれぞれ取得する構成とし、これらの信号を用いて現在の目的信号及び非目的信号の情報量を推定する。 Further, in this embodiment, it is difficult to acquire the target signal and the non-target signal independently at the same time, so (in a situation where a signal in which the target signal and the non-target signal are mixed is observed) Since the method of extracting the signal is the gist of the present invention, the non-target signal is the detection signal when only noise is present, and the target signal is the voice signal of the speaker uttered when the energy of the non-target signal is low The information is acquired at different times, and the information amounts of the current target signal and non-target signal are estimated using these signals.

尚、音声信号の有無は、検知信号のスペクトルの分析結果や、ＰＴＴ(Push to Talk)スイッチなどのスイッチ手段、車室内カメラなどのセンサ手段を用いて検出できる。また音環境の変化は、話者の交代による目的信号の周波数特徴の変化、及び車両の速度、空調機レベルに基づく非目的信号の変化に相当するが、これについても、検知信号のスペクトル分析結果や、車室内の各種スイッチ手段の情報、車両の各所に配置されたセンサ手段などの情報を用いて検出することができる。 The presence / absence of an audio signal can be detected using the analysis result of the spectrum of the detection signal, switch means such as a PTT (Push to Talk) switch, and sensor means such as a vehicle interior camera. The change in the sound environment corresponds to the change in the frequency characteristics of the target signal due to the change of the speaker, and the change in the non-target signal based on the vehicle speed and the air conditioner level. It is also possible to detect using information on various switch means in the passenger compartment and information on sensor means arranged in various places of the vehicle.

本処理の全体の流れを図１２に、音環境を判断する処理について図１３に示し、分離フィルタ更新処理について図１４に表記する。図１３に示す判断処理において、音環境に変化があるか否か、音声信号が低S/Nで検出されたかなどを判断し、この結果に基づき図１２における音源分離処理、音声収録処理、分離フィルタ更新処理の何れかを行う。分離フィルタ更新処理では、個々に取得した目的信号と非目的信号を用いて帯域ごとの更新回数を決定し、これに基づきフィルタ学習を行う。更新回数決定方法やフィルタ更新方法については、第１、第２実施形態に述べた手法を適用する。 The overall flow of this processing is shown in FIG. 12, the processing for determining the sound environment is shown in FIG. 13, and the separation filter update processing is shown in FIG. In the determination process shown in FIG. 13, it is determined whether there is a change in the sound environment, whether the audio signal is detected at a low S / N, and the like. Based on this result, the sound source separation process, the sound recording process, and the separation in FIG. One of the filter update processes is performed. In the separation filter update process, the number of updates for each band is determined using the individually acquired target signal and non-target signal, and filter learning is performed based on this. The method described in the first and second embodiments is applied to the update count determination method and the filter update method.

本実施形態によれば、音環境が変化したときに、該音環境に適した反復回数を用いて分離フィルタを更新することができるため、演算量を削減し、分離性能の高い分離フィルタを得ることが可能である。 According to this embodiment, when the sound environment changes, the separation filter can be updated using the number of iterations suitable for the sound environment, so that the amount of computation is reduced and a separation filter with high separation performance is obtained. It is possible.

図１２はシステム全体の動作フローを示している。
S201でシステムの初期化、メモリへの読込作業を行う。
S202で音入力を検知する。検知したらS203へ進む。
S203で目的信号(音声)の有無、音環境変化の有無、非目的信号の有無を判断する(詳細は図１３のフロー図に示す)。
S204で目的信号フラグ(検知信号に音声が含まれているか否かを示すフラグ)がFalseであればS205へ、TrueであればS207へ進む。
S205で音環境フラグがTrueであればS206の分離フィルタ更新処理へ進み、Falseであれば処理を終了する。
S206で分離フィルタの学習、更新作業を行う(詳細は図１４のフロー図に示す)。
S207で非目的信号微小フラグ(検知信号に含まれる不用信号が目的信号に比べて十分小さいか否かを示すフラグ)がTrueであればS208へ、FalseであればS210へ進む。
S208で音声区間を検出する。区間検出には検知信号のエネルギーの変化、スペクトルの変化、及びＰＴＴスイッチの情報や車室内カメラの情報を用いることができる。
S209で検出された音声区間を新規参照信号として記憶手段308へ記憶する。検出される度に、話者や性別ごとに分類して加算平均することが好ましい。初期値としてはHSLNなどを用いることができる。
S210で現在保持する分離フィルタＷを検知信号へ適用して目的信号の抽出を行う。 FIG. 12 shows an operation flow of the entire system.
In S201, the system is initialized and the memory is read.
Sound input is detected in S202. If detected, the process proceeds to S203.
In S203, the presence / absence of a target signal (sound), the presence / absence of a change in sound environment, and the presence / absence of a non-target signal are determined (details are shown in the flowchart of FIG. 13).
In S204, if the target signal flag (flag indicating whether or not sound is included in the detection signal) is False, the process proceeds to S205, and if True, the process proceeds to S207.
If the sound environment flag is true in S205, the process proceeds to a separation filter update process in S206, and if it is False, the process is terminated.
In S206, the separation filter is learned and updated (details are shown in the flowchart of FIG. 14).
If the non-target signal minute flag (a flag indicating whether or not the unnecessary signal included in the detection signal is sufficiently smaller than the target signal) is true in S207, the process proceeds to S208, and if it is False, the process proceeds to S210.
In S208, a voice section is detected. For detecting the section, it is possible to use energy change of the detection signal, spectrum change, information on the PTT switch, and information on the vehicle interior camera.
The speech section detected in S209 is stored in the storage means 308 as a new reference signal. Each time it is detected, it is preferable to classify by speaker and gender and perform averaging. As an initial value, HSLN or the like can be used.
In S210, the separation filter W currently held is applied to the detection signal to extract the target signal.

図１３は判断手段(図１０の305)の判断処理フローを示している。
S301で下記のフラグを初期化する(Falseで初期化)する。各フラグは以下のことを表す。目的信号フラグ：Voice-act 検知信号に音声が含まれているか否か、非目的信号微小フラグ：Noise-small 検知信号に含まれる非目的信号が目的信号に比べて十分小さいか否か、音環境変化フラグ：Env-change 音環境が変化したか否か。
S302で時刻T-mにおける音環境情報を取得する。mは処理間隔時間である。音環境情報は、検知信号パワー、検知信号スペクトル、各種スイッチ情報、各種センサ情報などの何れか、あるいは複数を組み合わせたデータから生成され、これらデータに基づき、複数の音環境に分類されるものとする。分類は、該信号スペクトルや、速度、空調機レベルなどの状態と雑音成分特性を対応させたデータベースを参照するなどして行う。
S303で時刻T(現在)における音環境情報を取得する。
S304で時刻TとT-mにおける音環境情報を比較し、検知信号に目的信号(音声)が含まれるかを判定する。音声が含まれると判断した場合はS305へ、含まれないと判断された場合はS306へ進む。音声信号の有無は、検知信号のパワー変化、スペクトル変化、PTTスイッチの押下情報、車室内カメラの検出結果などを用いて判断できる。
S305で目的信号フラグを立てる（Voice-act=True)。
S306で非目的信号(雑音)のエネルギーが閾値より小さいか否かを判定する。閾値より小さい場合はS307へ、閾値以上の場合はS308へ進む。閾値は例えば目的信号(参照信号として用いる音声信号)と非目的信号(雑音信号)とのS/Nが10ｄB以上となる時の検知信号のエネルギーとする。
S307で非目的信号微小フラグを立てる（Noise-small=True)。
S308で時刻Tの音環境情報と、時刻T-mの音環境情報を比較する。比較結果音環境情報が異なる場合はS309へ進み、同一の場合は処理を終了する。
S309で音環境フラグを立てる（Env-change=True)。 FIG. 13 shows a determination processing flow of the determination means (305 in FIG. 10).
In S301, the following flags are initialized (initialized with False). Each flag represents the following: Target signal flag: Voice-act Whether or not the detection signal contains sound, Non-target signal small flag: Noise-small Whether or not the non-target signal included in the detection signal is sufficiently smaller than the target signal, sound environment Change flag: Env-change Whether the sound environment has changed.
In S302, sound environment information at time Tm is acquired. m is the processing interval time. The sound environment information is generated from any one or a combination of detection signal power, detection signal spectrum, various switch information, various sensor information, etc., and is classified into a plurality of sound environments based on these data. To do. The classification is performed by referring to a database in which the signal spectrum, the state such as the speed and the air conditioner level and the noise component characteristics are associated with each other.
In S303, sound environment information at time T (current) is acquired.
In S304, the sound environment information at times T and Tm is compared, and it is determined whether the detection signal includes the target signal (voice). If it is determined that the sound is included, the process proceeds to S305. If it is determined that the sound is not included, the process proceeds to S306. The presence / absence of an audio signal can be determined using the power change of the detection signal, the spectrum change, information on pressing of the PTT switch, the detection result of the vehicle interior camera, and the like.
In S305, the target signal flag is set (Voice-act = True).
In S306, it is determined whether the energy of the non-target signal (noise) is smaller than a threshold value. If it is smaller than the threshold value, the process proceeds to S307. The threshold value is, for example, the energy of the detection signal when the S / N between the target signal (audio signal used as a reference signal) and the non-target signal (noise signal) is 10 dB or more.
In S307, a non-target signal minute flag is set (Noise-small = True).
In S308, the sound environment information at time T is compared with the sound environment information at time Tm. If the comparison result sound environment information is different, the process proceeds to S309, and if it is the same, the process ends.
A sound environment flag is set in S309 (Env-change = True).

図１４は分離フィルタ更新処理(図１０の帯域分割手段306〜フィルタ計算手段310)の処理フローを示している。
S401で検知信号の帯域分割処理を行う。周波数ビンごとの帯域幅は固定でも可変でも良い。
S402で検知信号と参照信号の分析処理を行う。第１実施形態ではパワーの比較を行う(図４)。第２実施形態では適応フィルタ手法により反復回数決定用フィルタである音声帯域通過フィルタを生成する(図７)。
S403で分析処理の結果に基づき反復回数Iter(ω)を決定する。決定方法は第１実施形態に示した手法(図５、図６）、及び第２実施形態に示した手法(図９）などを用いればよい。
S404で更新フラグRenew(ω)をすべての帯域ωに大してTrueに初期化する。
S405で更新回数フラグを0に初期化する(Iteration=0）。
S406で帯域フラグを1に初期化する(ω＝1）。
S407で更新フラグRenew(ω)＝TrueであればS408へFalseであればS409へ進む。
S408で帯域ωについて、分離フィルタの計算を行う。分離フィルタの計算手法としては、周波数領域ＩＣＡなどを適用する。
S409で帯域フラグを加算する(ω＝ω＋step(ω)（step(ω)：周波数ビンの帯域幅)）。
S410で帯域フラグωの値が帯域の最大値Max(ω)に等しいときはS411へ、等しくないときはS411へ進む。帯域の最大値Max(ω)は例えばシステムのナイキスト周波数とする。
S411で更新回数フラグを加算する(Iteration=Iteration+1）。
S412で更新回数Iterationが更新フラグRenew(ω)に等しいときはS413へ、等しくないときはS414へ進む。
S413で帯域ωについて、更新フラグRenew(ω)＝Falseとする。次回の更新よりこの帯域は更新処理が行われない。
S414で全帯域ωについて、更新フラグRenew(ω)＝Falseならば処理終了。そうでない場合はS406へ戻る。 FIG. 14 shows a processing flow of the separation filter update processing (band division means 306 to filter calculation means 310 in FIG. 10).
In S401, detection signal band division processing is performed. The bandwidth for each frequency bin may be fixed or variable.
In S402, the detection signal and the reference signal are analyzed. In the first embodiment, power is compared (FIG. 4). In the second embodiment, an audio bandpass filter that is a filter for determining the number of iterations is generated by an adaptive filter technique (FIG. 7).
In S403, the number of iterations Iter (ω) is determined based on the result of the analysis process. The determination method may be the method shown in the first embodiment (FIGS. 5 and 6), the method shown in the second embodiment (FIG. 9), or the like.
In S404, the update flag Renew (ω) is initialized to True for all the bands ω.
In S405, the update count flag is initialized to 0 (Iteration = 0).
In S406, the band flag is initialized to 1 (ω = 1).
If the update flag Renew (ω) = True in S407, the process proceeds to S408, and if it is False, the process proceeds to S409.
In S408, the separation filter is calculated for the band ω. As a calculation method of the separation filter, frequency domain ICA or the like is applied.
In S409, a band flag is added (ω = ω + step (ω) (step (ω): bandwidth of frequency bin)).
If the value of the band flag ω is equal to the maximum value Max (ω) of the band in S410, the process proceeds to S411, and if not, the process proceeds to S411. The maximum value Max (ω) of the band is, for example, the Nyquist frequency of the system.
In S411, an update count flag is added (Iteration = Iteration + 1).
If the update count Iteration is equal to the update flag Renew (ω) in S412, the process proceeds to S413, and if not, the process proceeds to S414.
In S413, the update flag Renew (ω) = False is set for the band ω. No update processing is performed for this band from the next update.
If the update flag Renew (ω) = False for all bands ω in S414, the process ends. Otherwise, the process returns to S406.

（第４実施形態）
本実施例は各種スイッチ手段、センサ手段の情報の組み合わせと、該組み合わせが検出されるときの音環境を事前に分析、分類したデータ、及び該音環境における分離フィルタ更新処理時の最適な反復回数Iter(ω)を対応させたデータをそれぞれ記憶する手法である。 (Fourth embodiment)
In this embodiment, the combination of information of various switch means and sensor means, data obtained by analyzing and classifying the sound environment in advance when the combination is detected, and the optimum number of iterations when the separation filter is updated in the sound environment This is a method for storing data corresponding to Iter (ω).

その構成を図１５に示す。図において、図１０における同じ名称、機能については説明を省略する。 The configuration is shown in FIG. In the figure, descriptions of the same names and functions in FIG. 10 are omitted.

図中、505は図１１における演算装置404及び記憶装置に対応する判断手段であり、音環境変化及び音声信号の有無について監視、判断を行う。本実施形態では、スイッチ手段503及びセンサ504の情報と音環境を事前に対応させて記憶手段506にテーブルとして記憶している。したがって、スイッチ、センサ手段の情報から、逐次テーブルを参照することで音環境を取得し、この音環境の変化に基づき、フィルタ更新処理を行う。 In the figure, reference numeral 505 denotes a determination unit corresponding to the arithmetic device 404 and the storage device in FIG. In this embodiment, the information of the switch unit 503 and the sensor 504 and the sound environment are stored in advance as a table in the storage unit 506 in correspondence with each other. Therefore, the sound environment is acquired from the information of the switch and sensor means by sequentially referring to the table, and the filter update process is performed based on the change of the sound environment.

506は図１１における記憶装置405に対応する記憶手段１であり、各種スイッチ手段、センサ手段が出力する情報と、その時の音環境を予め分類記憶したテーブルを記憶する。 Reference numeral 506 denotes storage means 1 corresponding to the storage device 405 in FIG. 11, which stores information output by various switch means and sensor means and a table in which the sound environment at that time is classified and stored in advance.

508は図１１における記憶装置405に対応する記憶手段２であり、ここには、記憶手段１に分類記録されている音環境と、該音環境において最適な分離フィルタ学習時の更新回数Iter(ω)が対応して記憶されている。 Reference numeral 508 denotes storage means 2 corresponding to the storage device 405 in FIG. 11, which includes the sound environment classified and recorded in the storage means 1 and the number of updates Iter (ω ) Is stored correspondingly.

図１６は本実施形態の全体動作を示したフローチャートである。
S501でシステムの初期化、メモリへの読込作業を行う。
S502で音入力を検知する。検知したらS503へ進む。
S503で音声信号を検知する。検知したらS504へ、検知しない場合はS502へ進む。
S504でスイッチ手段(図１５の503)、センサ手段(図１５の504)の情報と、記憶手段１(図１５の506)のデータテーブルを参照し、現在の音環境を取得する。
S505で、S504で取得した音環境がそれまで保持していた音環境と異なる場合はS506へ、そうでなければS502へ進む。
S506で分離フィルタ更新処理を行う。詳細は図１７のフローチャートに示す。
S507で検知信号に対して分離フィルタW(ω)を適用して音源分離(音声抽出)を行う。 FIG. 16 is a flowchart showing the overall operation of this embodiment.
In S501, the system is initialized and the memory is read.
Sound input is detected in S502. If detected, the process proceeds to S503.
In S503, an audio signal is detected. If detected, the process proceeds to S504, and if not detected, the process proceeds to S502.
In S504, the current sound environment is obtained by referring to the information of the switch means (503 in FIG. 15), the sensor means (504 in FIG. 15) and the data table of the storage means 1 (506 in FIG. 15).
In S505, if the sound environment acquired in S504 is different from the sound environment held so far, the process proceeds to S506, and if not, the process proceeds to S502.
In S506, separation filter update processing is performed. Details are shown in the flowchart of FIG.
In S507, the separation filter W (ω) is applied to the detection signal to perform sound source separation (voice extraction).

図１７は本実施形態の分離フィルタ更新処理の流れを示したフローチャートである。
S601でシステムの初期化、メモリへの読込作業を行う。
S602で現在の音環境に対応するフィルタ更新処理の反復回数(Iter(ω))を取得する。
S603で検知信号の帯域分割処理を行う。周波数ビンごとの帯域幅は固定でも可変でも良い。
S604で更新フラグRenew(ω)をすべての帯域ωに対してTrueに初期化する。
S605で更新回数フラグを0に初期化する(Iteration=0)。
S606で帯域フラグを1に初期化する(ω＝1)。
S607で更新フラグRenew(ω)＝TrueであればS109へ、FalseであればS110へ進む。
S608で帯域ωについて、分離フィルタの計算を行う。分離フィルタの計算手法としては、周波数領域ＩＣＡなどを適用する。
S609で帯域フラグを加算する(ω＝ω＋step(ω)（step(ω)：各周波数帯の帯域幅))。
S610で帯域フラグωの値が帯域の最大値Max(ω)に等しいときはS611へ進み、等しくないときはS607へ戻る。帯域の最大値Max(ω)は例えばシステムのナイキスト周波数とする。
S611で更新回数フラグを加算する(Iteration=Iteration+1)。
S612で更新回数Iterationが更新フラグRenew(ω)に等しいときはS613へ、等しくないときはS614へ進む。
S613で帯域ωについて、更新フラグRenew(ω)＝Falseとする。次回の更新よりこの帯域は更新処理が行われない。
S614で全帯域ωについて、更新フラグRenew(ω)＝FalseならばS615へ進み、そうでない場合はS606へ戻る。
S615で更新処理の結果生成された分離フィルタW(ω)をフィルタ手段へとコピーする。 FIG. 17 is a flowchart showing the flow of the separation filter update process of the present embodiment.
In S601, the system is initialized and the memory is read.
In S602, the iteration number (Iter (ω)) of the filter update process corresponding to the current sound environment is acquired.
In S603, the detection signal is band-divided. The bandwidth for each frequency bin may be fixed or variable.
In S604, the update flag Renew (ω) is initialized to True for all bands ω.
In S605, the update count flag is initialized to 0 (Iteration = 0).
In S606, the band flag is initialized to 1 (ω = 1).
If the update flag Renew (ω) = True in S607, the process proceeds to S109, and if it is False, the process proceeds to S110.
In S608, the separation filter is calculated for the band ω. As a calculation method of the separation filter, frequency domain ICA or the like is applied.
In S609, a band flag is added (ω = ω + step (ω) (step (ω): bandwidth of each frequency band)).
If the value of the band flag ω is equal to the maximum value Max (ω) of the band in S610, the process proceeds to S611, and if not, the process returns to S607. The maximum value Max (ω) of the band is, for example, the Nyquist frequency of the system.
In S611, an update count flag is added (Iteration = Iteration + 1).
If the update count Iteration is equal to the update flag Renew (ω) in S612, the process proceeds to S613; otherwise, the process proceeds to S614.
In S613, the update flag Renew (ω) = False is set for the band ω. No update processing is performed for this band from the next update.
If the update flag Renew (ω) = False for all bands ω in S614, the process proceeds to S615, and if not, the process returns to S606.
In S615, the separation filter W (ω) generated as a result of the update process is copied to the filter means.

図１８に、各種センサ情報と、それに対応する音環境のデータテーブルを記憶した例を示す。この例では、速度域、空調機レベル、エンジン回転数から音環境を分類し、対応させて記憶している。 FIG. 18 shows an example in which various sensor information and a sound environment data table corresponding to the sensor information are stored. In this example, the sound environment is classified from the speed range, the air conditioner level, and the engine speed, and stored in correspondence.

図１９に、各音環境と、該音環境下での最適な反復回数を対応させたデータテーブルを記憶した例を示す。更新回数のテーブル作成にあたっては、第１、第２実施形態のように、各音環境における目的信号、非目的信号を用いて、第１、第２実施形態に示すような手法で、予め最適な学習回数を決定しておく。 FIG. 19 shows an example of storing a data table in which each sound environment is associated with the optimum number of repetitions under the sound environment. In creating the table of the number of update times, as shown in the first and second embodiments, using the target signal and the non-target signal in each sound environment, a method as shown in the first and second embodiments is used to obtain an optimum in advance. Determine the number of learnings.

判断手段505は前記センサ情報と音環境を対応させたデータテーブルを参照して、現在の音環境を判断し、反復回数決定手段507は、該音環境と反復回数を対応させたデータテーブルを参照して現在の音環境に適する反復回数を決定する。すなわち、判断手段505によって判断され決定された音環境に対応する前記学習の反復回数を記憶手段508から読み出し、該反復回数だけ前記分離フィルタを取得するための学習を反復することができる。 The determination unit 505 refers to the data table in which the sensor information is associated with the sound environment to determine the current sound environment, and the iteration number determination unit 507 refers to the data table in which the sound environment is associated with the iteration number. To determine the number of iterations suitable for the current sound environment. In other words, the learning iteration number corresponding to the sound environment determined and determined by the determination unit 505 can be read from the storage unit 508, and the learning for obtaining the separation filter can be repeated by the iteration number.

本手法によれば、車室内において検出されるセンサ情報の変化に応じて最適な反復回数が決定され、この反復回数に基づき分離フィルタが更新されるため、分離フィルタの計算量を削減することが可能である。 According to this method, the optimum number of iterations is determined according to changes in sensor information detected in the vehicle interior, and the separation filter is updated based on this number of iterations, so that the amount of calculation of the separation filter can be reduced. Is possible.

従来の周波数領域ＩＣＡ手法においては、全ての周波数帯域について、帯域毎にＩＣＡによる分離フィルタの計算が行われる構成となっていたため、計算量が大きくなる。 In the conventional frequency domain ICA method, since the calculation of the separation filter by the ICA is performed for each band for all frequency bands, the amount of calculation increases.

上記第１乃至第４実施形態によって詳細に説明したように、本発明では、この点を改良し、ＩＣＡによる分離フィルタ学習の前に、周波数ごとに入力信号に含まれる目的信号と非目的信号との強さを比較する手段を設け、目的信号が非目的信号より小さいと判断された帯域は、目的信号が非目的信号より大きいと判断された帯域より学習回数を小さくすることによって、一般的な周波数領域ＩＣＡと比較してわずかな計算量で済む処理過程及びシステムを提供する。このように、本発明においては、分離フィルタ学習する周波数帯域を適応的に判断できるため、すべての帯域において分離フィルタ学習行う従来の周波数領域ＩＣＡ手法と比較して少ない計算量で済む。 As described in detail with reference to the first to fourth embodiments, the present invention improves this point, and prior to the separation filter learning by ICA, the target signal and the non-target signal included in the input signal for each frequency. The band in which the target signal is determined to be smaller than the non-target signal is generally reduced by reducing the number of learning times from the band in which the target signal is determined to be larger than the non-target signal. The present invention provides a process and system that require a small amount of calculation compared to the frequency domain ICA. As described above, according to the present invention, the frequency band for which the separation filter learning is performed can be adaptively determined, and therefore, the calculation amount is small compared with the conventional frequency domain ICA method in which the separation filter learning is performed in all bands.

第１実施形態の基本構成を示したブロック図である。It is the block diagram which showed the basic composition of 1st Embodiment. 第１実施形態を実現するシステム構成の一例を示す図である。It is a figure which shows an example of the system configuration which implement | achieves 1st Embodiment. 第１実施形態の処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process of 1st Embodiment. 雑音成分と参照信号の比較を示した図である。It is the figure which showed the comparison of a noise component and a reference signal. 第１実施形態の反復回数決定方法（その１）を示した図である。It is the figure which showed the repetition frequency determination method (the 1) of 1st Embodiment. 第１実施形態の反復回数決定方法（その２）を示した図である。It is the figure which showed the repetition frequency determination method (the 2) of 1st Embodiment. 第２実施形態における適応処理手段の基本的な構成を示した図である。It is the figure which showed the fundamental structure of the adaptive process means in 2nd Embodiment. 第２実施形態の適応処理によって生成した分離フィルタの周波数応答を示した図である。It is the figure which showed the frequency response of the separation filter produced | generated by the adaptive process of 2nd Embodiment. 第２実施形態の反復回数決定方法を示した図である。It is the figure which showed the repetition frequency determination method of 2nd Embodiment. 第３実施形態の基本構成を示したブロック図である。It is the block diagram which showed the basic composition of 3rd Embodiment. 第３実施形態を実現するシステム構成の一例を示した図である。It is the figure which showed an example of the system configuration which implement | achieves 3rd Embodiment. 第３実施形態の処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process of 3rd Embodiment. 第３実施形態の判断手段の処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process of the determination means of 3rd Embodiment. 第３実施形態の分離フィルタ更新処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the separation filter update process of 3rd Embodiment. 第４実施形態の基本構成を示したブロック図である。It is the block diagram which showed the basic composition of 4th Embodiment. 第４実施形態の処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the process of 4th Embodiment. 第４実施形態の分離フィルタ更新処理の流れを示したフローチャートである。It is the flowchart which showed the flow of the separation filter update process of 4th Embodiment. 第４実施形態のセンサ情報と音環境を対応させたデータテーブルを示した図である。It is the figure which showed the data table which matched the sensor information and sound environment of 4th Embodiment. 第４実施形態の音環境と反復回数を対応させたデータテーブルを示した図である。It is the figure which showed the data table which matched the sound environment of 4th Embodiment, and the repetition frequency.

符号の説明Explanation of symbols

101-1、101-n：マイクロフォン、102：検知手段、103：帯域分割手段、104：反復回数決定手段、105：フィルタ計算手段、106：フィルタ手段、201-1、201-n：マイクロフォン、202：フィルタ、203：ＡＤ変換部、204：演算装置、205：記憶装置、301-1〜301-n：マイクロフォン、302：検知手段、303：スイッチ手段、304：センサ手段、305：判断手段、306：帯域分割手段、307：分析手段、308：記憶手段、309：反復回数決定手段、310：フィルタ計算手段、311：フィルタ手段、401-1〜401-n：マイクロフォン、402：フィルタ、403：ＡＤ変換部、404：演算装置、405：記憶装置、406：情報収集部、501-1〜501-n：マイクロフォン、502：検知手段、503：スイッチ手段、504：センサ手段、505：判断手段、506：記憶手段１、507：反復過程決定手段、508：記憶手段２、509：帯域分割手段、510：フィルタ計算手段、511：フィルタ手段。 101-1, 101-n: Microphone, 102: Detection means, 103: Band division means, 104: Repetition count determination means, 105: Filter calculation means, 106: Filter means, 201-1, 201-n: Microphone, 202 : Filter, 203: AD conversion unit, 204: Arithmetic unit, 205: Storage device, 301-1 to 301-n: Microphone, 302: Detection unit, 303: Switch unit, 304: Sensor unit, 305: Determination unit, 306 : Band division means, 307: Analysis means, 308: Storage means, 309: Iteration number determination means, 310: Filter calculation means, 311: Filter means, 401-1 to 401-n: Microphone, 402: Filter, 403: AD Conversion unit, 404: arithmetic device, 405: storage device, 406: information collection unit, 501-1 to 501-n: microphone, 502: detection unit, 503: switch unit, 504: sensor unit, 505: determination unit, 506 : Storage means 1, 507: iteration process determination means, 508: storage means 2, 509: band dividing means, 510 Filter calculation means, 511: filter means.

Claims

目的音と非目的音とが混在する音響を音響センサで検知することによって目的信号と非目的信号とが混在する検知信号を取得し、各該検知信号から少なくとも一つの該目的信号を分離する分離フィルタを学習の反復によって取得する音声入力装置において、
音響周波数帯域を複数の周波数分割帯域に分割する帯域分割手段と、各該周波数分割帯域における前記学習の反復回数を、それぞれの周波数分割帯域において前記検知信号に含まれる前記目的信号の平均強さに正比例するように決定する反復回数決定手段とを有することを特徴とする音声入力装置。 Separation in which at least one target signal is separated from each detection signal by obtaining a detection signal in which the target signal and the non-target signal are mixed by detecting a sound in which the target sound and the non-target sound are mixed with an acoustic sensor In a voice input device that acquires a filter by iterative learning,
Band division means for dividing the acoustic frequency band into a plurality of frequency division bands, and the number of repetitions of learning in each frequency division band are set to the average strength of the target signal included in the detection signal in each frequency division band. An audio input device comprising: an iterative number determining means for determining to be in direct proportion.

目的音と非目的音とが混在する音響を音響センサで検知することによって目的信号と非目的信号とが混在する検知信号を取得し、各該検知信号から少なくとも一つの該目的信号を分離する分離フィルタを学習の反復によって取得する音声入力装置において、
音響周波数帯域を複数の周波数分割帯域に分割する帯域分割手段と、各該周波数分割帯域における前記学習の反復回数を、それぞれの周波数分割帯域において前記検知信号に含まれる前記非目的信号の平均強さに反比例するように決定する反復回数決定手段とを有することを特徴とする音声入力装置。 Separation in which at least one target signal is separated from each detection signal by obtaining a detection signal in which the target signal and the non-target signal are mixed by detecting a sound in which the target sound and the non-target sound are mixed with an acoustic sensor In a voice input device that acquires a filter by iterative learning,
Band division means for dividing the acoustic frequency band into a plurality of frequency division bands, and the number of repetitions of learning in each frequency division band, the average strength of the non-target signal included in the detection signal in each frequency division band A speech input device, characterized in that it has repetition number determination means for determining it to be inversely proportional to.

目的音と非目的音とが混在する音響を音響センサで検知することによって目的信号と非目的信号とが混在する検知信号を取得し、各該検知信号から少なくとも一つの該目的信号を分離する分離フィルタを学習の反復によって取得する音声入力装置において、
音響周波数帯域を複数の周波数分割帯域に分割する帯域分割手段と、各該周波数分割帯域において前記検知信号に含まれる前記非目的信号の強さに対する前記目的信号の強さの比を求め、該周波数分割帯域の任意の２つを帯域１、帯域２としたときに、帯域１における該比の平均値が帯域２における該比の平均値以上であれば必ず帯域１における前記学習の反復回数が帯域２における前記学習の反復回数よりも小とはならず、該周波数分割帯域の２つ以上における前記学習の反復回数は相異なるように、各該周波数分割帯域における前記学習の反復回数を決定する反復回数決定手段とを有することを特徴とする音声入力装置。 Separation in which at least one target signal is separated from each detection signal by obtaining a detection signal in which the target signal and the non-target signal are mixed by detecting a sound in which the target sound and the non-target sound are mixed with an acoustic sensor In a voice input device that acquires a filter by iterative learning,
Band dividing means for dividing the acoustic frequency band into a plurality of frequency division bands, and determining the ratio of the intensity of the target signal to the intensity of the non-target signal included in the detection signal in each frequency division band; When any two of the divided bands are band 1 and band 2, if the average value of the ratio in band 1 is equal to or greater than the average value of the ratio in band 2, the number of iterations of learning in band 1 is always the band. The number of iterations for determining the number of iterations of learning in each frequency division band so that the number of iterations of learning in two or more of the frequency division bands is different from the number of iterations of learning in 2 A voice input device comprising a number of times determination means.

目的音と非目的音とが混在する音響を音響センサで検知することによって目的信号と非目的信号とが混在する検知信号を取得し、各該検知信号から少なくとも一つの該目的信号を分離する分離フィルタを学習の反復によって取得する音声入力装置において、
音響周波数帯域を複数の周波数分割帯域に分割する帯域分割手段と、各該周波数分割帯域における前記学習の反復回数を、それぞれの周波数分割帯域において前記検知信号に含まれる前記非目的信号の強さに対する前記目的信号の強さの比の平均値に正比例するように決定する反復回数決定手段とを有することを特徴とする音声入力装置。 Separation in which at least one target signal is separated from each detection signal by obtaining a detection signal in which the target signal and the non-target signal are mixed by detecting a sound in which the target sound and the non-target sound are mixed with an acoustic sensor In a voice input device that acquires a filter by iterative learning,
Band division means for dividing the acoustic frequency band into a plurality of frequency division bands, and the number of repetitions of the learning in each frequency division band with respect to the strength of the non-target signal included in the detection signal in each frequency division band A speech input apparatus comprising: an iterative number determining means for determining the ratio to be directly proportional to an average value of the intensity ratios of the target signals.

目的音と非目的音とが混在する音響を音響センサで検知することによって目的信号と非目的信号とが混在する検知信号を取得し、各該検知信号から少なくとも一つの該目的信号を分離する分離フィルタを学習の反復によって取得する音声入力装置において、
音響周波数帯域を複数の周波数分割帯域に分割し、各該周波数分割帯域における前記学習を、それぞれの周波数分割帯域において、該周波数分割帯域の２つ以上における前記学習の反復回数は相異なるように予め決定された反復回数だけ反復することを特徴とする音声入力装置。 Separation in which at least one target signal is separated from each detection signal by obtaining a detection signal in which the target signal and the non-target signal are mixed by detecting a sound in which the target sound and the non-target sound are mixed with an acoustic sensor In a voice input device that acquires a filter by iterative learning,
The acoustic frequency band is divided into a plurality of frequency division bands, and the learning in each frequency division band is performed in advance so that the number of repetitions of the learning in two or more of the frequency division bands is different in each frequency division band. A voice input device that repeats a determined number of repetitions.

前記検知信号に含まれる前記非目的信号の強さに対する前記目的信号の強さの比を分析する分析手段を有し、前記反復回数決定手段が、各前記周波数分割帯域における前記学習の反復回数を、それぞれの周波数分割帯域において該分析手段によって得られた該比の平均値によって決定することを特徴とする請求項３または４に記載の音声入力装置。 Analyzing means for analyzing the ratio of the strength of the target signal to the strength of the non-target signal included in the detection signal, and the iteration number determination means determines the number of iterations of the learning in each frequency division band. The voice input device according to claim 3 or 4, wherein the voice input device is determined by an average value of the ratios obtained by the analysis means in each frequency division band.

目的音と非目的音とが混在する音響を音響センサで検知することによって目的信号と非目的信号とが混在する検知信号を取得し、各該検知信号から少なくとも一つの該目的信号を分離する分離フィルタを学習の反復によって取得する音声入力装置において、
音響周波数帯域を複数の周波数分割帯域に分割する帯域分割手段と、前記検知信号に含まれる前記非目的信号が大きい帯域ほど強く抑圧するような帯域通過フィルタを生成する適応処理手段と、該周波数分割帯域の任意の２つを帯域１、帯域２としたときに、帯域１における該帯域通過フィルタの通過率の平均値が帯域２における該帯域通過フィルタの通過率の平均値以上であれば必ず帯域１における前記学習の反復回数が帯域２における前記学習の反復回数よりも小とはならず、該周波数分割帯域の２つ以上における前記学習の反復回数は相異なるように、各前記周波数分割帯域における前記学習の反復回数を決定する反復回数決定手段とを有することを特徴とする音声入力装置。 Separation in which at least one target signal is separated from each detection signal by obtaining a detection signal in which the target signal and the non-target signal are mixed by detecting a sound in which the target sound and the non-target sound are mixed with an acoustic sensor In a voice input device that acquires a filter by iterative learning,
Band division means for dividing the acoustic frequency band into a plurality of frequency division bands, adaptive processing means for generating a band-pass filter that strongly suppresses the band of the non-target signal included in the detection signal, and the frequency division If any two of the bands are band 1 and band 2, if the average value of the pass rate of the band pass filter in band 1 is equal to or greater than the average value of the pass rate of the band pass filter in band 2, The number of iterations of learning in 1 is not less than the number of iterations of learning in band 2, and the number of iterations of learning in two or more of the frequency division bands is different in each frequency division band. A speech input device comprising: a repetition number determination means for determining the number of repetitions of the learning.

目的音と非目的音とが混在する音響を音響センサで検知することによって目的信号と非目的信号とが混在する検知信号を取得し、各該検知信号から少なくとも一つの該目的信号を分離する分離フィルタを学習の反復によって取得する音声入力装置において、
音響周波数帯域を複数の周波数分割帯域に分割する帯域分割手段と、前記検知信号に含まれる前記非目的信号が大きい帯域ほど強く抑圧するような音声帯域通過フィルタを生成する適応処理手段と、各該周波数分割帯域における前記学習の反復回数を、それぞれの周波数分割帯域において該音声帯域通過フィルタの通過率の平均値に正比例するように決定する反復回数決定手段とを有することを特徴とする音声入力装置。 Separation in which at least one target signal is separated from each detection signal by obtaining a detection signal in which the target signal and the non-target signal are mixed by detecting a sound in which the target sound and the non-target sound are mixed with an acoustic sensor In a voice input device that acquires a filter by iterative learning,
Band dividing means for dividing the acoustic frequency band into a plurality of frequency division bands, adaptive processing means for generating a sound band pass filter that strongly suppresses the band of the non-target signal included in the detection signal, and A speech input device comprising repetition number determination means for determining the number of repetitions of the learning in the frequency division band so as to be directly proportional to the average value of the pass rate of the sound band pass filter in each frequency division band. .

目的音と非目的音とが混在する音響を音響センサで検知することによって目的信号と非目的信号とが混在する検知信号を取得し、各該検知信号から少なくとも一つの該目的信号を分離する分離フィルタを学習の反復によって取得する音声入力装置において、
音響周波数帯域を複数の周波数分割帯域に分割する帯域分割手段と、前記検知信号に含まれる音環境の内容を判断し決定する判断手段と、該音環境に対応する前記学習の反復回数を記憶する記憶手段とを有し、各該周波数分割帯域において、該判断手段によって判断され決定された音環境に対応する前記学習の反復回数を該記憶手段から読み出し、該読み出した反復回数だけ前記分離フィルタを取得するための学習を反復することを特徴とする音声入力装置。 Separation in which at least one target signal is separated from each detection signal by obtaining a detection signal in which the target signal and the non-target signal are mixed by detecting a sound in which the target sound and the non-target sound are mixed with an acoustic sensor In a voice input device that acquires a filter by iterative learning,
Band division means for dividing the acoustic frequency band into a plurality of frequency division bands, judgment means for judging and determining the content of the sound environment included in the detection signal, and storing the number of repetitions of the learning corresponding to the sound environment Storage means, and in each of the frequency division bands, the number of learning iterations corresponding to the sound environment determined and determined by the judgment means is read out from the storage means, and the separation filter is applied for the number of iterations read out. A voice input device characterized by repeating learning for acquisition.