JP5374427B2

JP5374427B2 - Sound source separation device, sound source separation method and program therefor, video camera device using the same, and mobile phone device with camera

Info

Publication number: JP5374427B2
Application number: JP2010062535A
Authority: JP
Inventors: 真人戸上
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2010-03-18
Filing date: 2010-03-18
Publication date: 2013-12-25
Anticipated expiration: 2030-03-18
Also published as: JP2011199474A

Description

本発明は、マイクロホンで収録した音の中から、特定方向の音源の音のみを抽出する音源分離装置に関し、特に、携帯電話などのように音が筐体上で回折するような機器に適する音源分離装置に関する。 The present invention relates to a sound source separation device that extracts only sound of a sound source in a specific direction from sound recorded by a microphone, and in particular, a sound source suitable for a device in which sound is diffracted on a housing such as a mobile phone. The present invention relates to a separation device.

複数のマイクロホンを用いて、特定方向の音源（目的音）のみを抽出する音源分離技術が広く検討されている。特に、最小分散ビームフォーマに基づく方法（例えば、非特許文献１参照）は、マイク間位相差から、時間毎・周波数成分毎に、目的音の到来方向を算出し、その位相差が、所定の範囲内であれば、その時間・周波数成分を目的音と判定し、範囲外であれば雑音と判定する。そして、その判定結果に基づき、雑音を抑圧するためのフィルタを学習する。マイク素子を２つ用いる場合、最小分散ビームフォーマに基づく方法では、所望の目的音が存在する音源方向の情報から目的音・雑音判定に用いる位相差の範囲を決定する。無響環境では、位相差の実測値と理論値の乖離が少ないため、理論値から求めた位相差の範囲に基づき、目的音と雑音の判定を行うことが可能となる。 A sound source separation technique for extracting only a sound source (target sound) in a specific direction using a plurality of microphones has been widely studied. In particular, the method based on the minimum dispersion beamformer (see, for example, Non-Patent Document 1) calculates the direction of arrival of the target sound for each time / frequency component from the inter-microphone phase difference, and the phase difference is a predetermined value. If it is within the range, the time / frequency component is determined as the target sound, and if it is out of the range, it is determined as noise. Based on the determination result, a filter for suppressing noise is learned. When two microphone elements are used, in the method based on the minimum dispersion beamformer, the range of the phase difference used for the target sound / noise determination is determined from the information on the sound source direction where the desired target sound exists. In an anechoic environment, the difference between the measured value of the phase difference and the theoretical value is small, and therefore it is possible to determine the target sound and noise based on the range of the phase difference obtained from the theoretical value.

また、他の音源分離技術として、音源方向の情報を使わず、目的音と雑音とが統計的に独立であることを利用した独立成分分析に基づく音源分離方式（例えば、特許文献１参照）がある。 As another sound source separation technique, there is a sound source separation method based on independent component analysis using the fact that the target sound and noise are statistically independent without using information on the sound source direction (see, for example, Patent Document 1). is there.

特開２００６−１７９６１号公報JP 2006-17961 A

戸上真人、天野明雄「人間共生ロボットEMIEWの騒音下音声認識技術」、計測と制御、Vol.46、No.6、2007.6Masato Togami, Akio Amano “Speech recognition technology under noise of human-symbiotic robot EMIEW”, Measurement and Control, Vol.46, No.6, 2007.6

上記のように、最小分散ビームフォーマに基づく方法によれば、無響環境では、目的音と雑音の判定を行うことが可能となるが、携帯電話などのように音が筐体上で回折するような場合は、位相差の実測値と理論値の乖離が大きいため、理論値から求めた位相差の範囲に基づいた判定は望ましくない。独立成分分析に基づく方法では、目的音と雑音の統計的性質を用いるため、比較的長い観測データが必要となり、また、出力される複数の信号のうち、どちらが聞きたい音かを判別することができない。 As described above, according to the method based on the minimum dispersion beamformer, it is possible to determine the target sound and noise in an anechoic environment, but the sound is diffracted on the housing like a mobile phone. In such a case, since the difference between the measured value of the phase difference and the theoretical value is large, the determination based on the range of the phase difference obtained from the theoretical value is not desirable. Since the method based on independent component analysis uses the statistical properties of the target sound and noise, relatively long observation data is required, and it is possible to determine which of the output signals is the sound you want to hear. Can not.

本発明は、携帯電話などのように筐体上で音が回折するような場合であっても、短時間に高精度な音の分離を実現することができる音源分離装置および音源分離方法を提供することを目的とする。 The present invention provides a sound source separation device and a sound source separation method capable of realizing high-precision sound separation in a short time even when sound is diffracted on a housing such as a mobile phone. The purpose is to do.

上記課題を解決するために、本発明の音源分離装置は、複数のマイクロホンからなるマイクアレイから入力される信号を処理して特定の音源の音のみを抽出する音源分離装置であって、予め得られるマイク配置の情報から算出した音の伝達モデルを用いて雑音抑圧フィルタを適応するフィルタ適応部と、マイク配置の情報を使わずに目的音と雑音との統計的性質を用いて音を分離する独立成分分析部と、前記フィルタ適応部の出力信号と独立成分分析部の出力信号のうち、最も前記フィルタ適応部の複数の周波数帯域のパワーの和と時間相関が高い信号を選択する分離音選択部と、を備えるものである。 In order to solve the above problems, a sound source separation device according to the present invention is a sound source separation device that processes a signal input from a microphone array including a plurality of microphones and extracts only sound of a specific sound source, and is obtained in advance. Filter adaptation unit that adapts the noise suppression filter using the sound transmission model calculated from the information on the microphone placement, and the statistical properties of the target sound and noise without using the microphone placement information Independent component analysis unit, and separated sound selection for selecting a signal having the highest sum of power and time correlation of the plurality of frequency bands of the filter adaptation unit among the output signal of the filter adaptation unit and the output signal of the independent component analysis unit A section.

本発明の音源分離装置において、更に、前記独立成分分析部の複数の分離信号から出力信号を選択するために、各分離信号と前記フィルタ適応部の複数の周波数帯域のパワーの和との時間相関をとる相関分析部を備えるものでよい。 In the sound source separation device of the present invention, in order to select an output signal from a plurality of separated signals of the independent component analysis unit, a time correlation between each separated signal and a sum of powers of a plurality of frequency bands of the filter adaptation unit A correlation analysis unit that takes

また、本発明の音源分離装置において、更に、前記独立成分分析部で算出する分離フィルタを用いて、前記フィルタ適応部で用いる音の伝達モデルを書き換える伝達モデル置換部を備えるものでよい。 The sound source separation device of the present invention may further include a transmission model replacement unit that rewrites a sound transmission model used in the filter adaptation unit using a separation filter calculated by the independent component analysis unit.

また、本発明の音源分離装置において、前記独立成分分析部に代えて、入力信号にクラスタリングを施すクラスタリング部を備えるものでよい。 The sound source separation apparatus of the present invention may include a clustering unit that performs clustering on an input signal instead of the independent component analysis unit.

また、本発明の音源分離装置において、パワー最小フィルタリング部を備え、前記フィルタ適応部は、フィルタモデルに基づいて雑音方向に対応する分離フィルタを作成し、前記パワー最小フィルタリング部は、入力信号に対して前記分離フィルタを作用させた後の出力信号のうち、最もパワーが最小となるものを選択するものでよい。 The sound source separation apparatus of the present invention further includes a minimum power filtering unit, the filter adaptation unit creates a separation filter corresponding to a noise direction based on a filter model, and the minimum power filtering unit Thus, the output signal having the smallest power among the output signals after the separation filter is operated may be selected.

また、本発明の音源分離装置において、前記独立成分分析部は、前記フィルタモデルに含まれるフィルタ候補の中からフィルタを選択するものでよい。 In the sound source separation device of the present invention, the independent component analysis unit may select a filter from filter candidates included in the filter model.

更に、本発明は、上記の音源分離装置を組み込んだビデオカメラ装置またはカメラ付き携帯電話装置である。 Furthermore, the present invention is a video camera device or a camera-equipped mobile phone device incorporating the above sound source separation device.

本発明の音源分離方法は、複数のマイクロホンからなるマイクアレイから入力される信号を処理して特定の音源の音のみを抽出する音源分離方法であって、予め得られるマイク配置の情報から算出した音の伝達モデルを用いて雑音抑圧フィルタを適応して出力信号を得るステップと、マイク配置の情報を使わずに目的音と雑音との統計的性質を用いて音を分離し出力信号を得るステップと、前記音の伝達モデルを用いて雑音抑圧フィルタを適応した出力信号と統計的性質を用いて音を分離した出力信号のうち、前記音の伝達モデルを用いて雑音抑圧フィルタを適応した複数の周波数帯域のパワーの和と最も時間相関が高い信号を選択するステップと、を備えるものである。 The sound source separation method of the present invention is a sound source separation method for extracting only the sound of a specific sound source by processing a signal input from a microphone array composed of a plurality of microphones, calculated from information on microphone arrangement obtained in advance. Applying a noise suppression filter using a sound transfer model to obtain an output signal, and separating the sound using the statistical properties of the target sound and noise without using microphone placement information to obtain an output signal And an output signal obtained by applying a noise suppression filter using the sound transfer model and an output signal obtained by separating the sound using statistical properties, and a plurality of noise suppression filters applied using the sound transfer model. Selecting a signal having the highest time correlation with the sum of the powers in the frequency band.

本発明のプログラムは、複数のマイクロホンからなるマイクアレイから入力される信号を処理して特定の音源の音のみを抽出する音源分離方法を、コンピュータに実行させるためのプログラムであって、予め得られるマイク配置の情報から算出した音の伝達モデルを用いて雑音抑圧フィルタを適応して出力信号を得るステップと、マイク配置の情報を使わずに目的音と雑音との統計的性質を用いて音を分離し出力信号を得るステップと、前記音の伝達モデルを用いて雑音抑圧フィルタを適応した出力信号と統計的性質を用いて音を分離した出力信号のうち、前記音の伝達モデルを用いて雑音抑圧フィルタを適応した複数の周波数帯域のパワーの和と最も時間相関が高い信号を選択するステップと、をコンピュータに実行させるものである。 The program of the present invention is a program for causing a computer to execute a sound source separation method for processing a signal input from a microphone array including a plurality of microphones and extracting only sound of a specific sound source, and is obtained in advance. Applying a noise suppression filter using the sound transfer model calculated from the microphone placement information to obtain an output signal, and using the statistical properties of the target sound and noise without using the microphone placement information, Separating the output signal by using the sound transmission model and the output signal obtained by applying a noise suppression filter using the sound transmission model and the output signal obtained by separating the sound using statistical properties; The step of selecting the signal having the highest time correlation and the sum of the powers of the plurality of frequency bands to which the suppression filter is applied is executed by a computer.

本発明により、携帯電話などの筐体上で音がイレギュラーに回折するような環境であっても、短時間に高精度に音を分離することが可能となる。 According to the present invention, even in an environment where sound is irregularly diffracted on a housing such as a mobile phone, it is possible to separate the sound with high accuracy in a short time.

本発明のハードウェアの構成を示す図。The figure which shows the structure of the hardware of this invention. 本発明を携帯電話に適用した場合のマイクロホンと携帯電話の位置関係の一例を示す図。The figure which shows an example of the positional relationship of a microphone and a mobile telephone at the time of applying this invention to a mobile telephone. 本発明の第一実施例のブロック構成を示す図。The figure which shows the block configuration of the 1st Example of this invention. 本発明の第二実施例のブロック構成を示す図。The figure which shows the block configuration of the 2nd Example of this invention. 本発明の独立成分分析による音源分離の処理フローを示す図。The figure which shows the processing flow of the sound source separation by the independent component analysis of this invention. 本発明の相関分析部の処理フローを示す図。The figure which shows the processing flow of the correlation analysis part of this invention. 本発明のフィルタ適応部の詳細なブロック構成を示す図。The figure which shows the detailed block structure of the filter adaptation part of this invention. 本発明の第三実施例のブロック構成を示す図。The figure which shows the block configuration of the 3rd Example of this invention. 本発明の分離音選択部の処理フローを示す図。The figure which shows the processing flow of the separated sound selection part of this invention. 本発明の第四実施例のブロック構成を示す図。The figure which shows the block configuration of 4th Example of this invention. 本発明の第五実施例のブロック構成を示す図。The figure which shows the block configuration of 5th Example of this invention. 本発明のパワー最小フィルタリング部の処理フローを示す図。The figure which shows the processing flow of the power minimum filtering part of this invention. 本発明のパワー最小フィルタリング部を用いた場合の処理フローを示す図。The figure which shows the processing flow at the time of using the power minimum filtering part of this invention.

以下、本発明の実施の形態を図面を用いて説明する。本発明の音源分離装置は、コンピュータに接続された複数のマイクロホンからなるマイクアレイと、マイクアレイにより収録した音波形情報をデジタル変換するAD変換装置と、コンピュータ上で、デジタル変換された音波形情報を解析し、ユーザー所望の音声を抽出するソフトウェアからなり、ユーザーが指定した任意の方向の音声を抽出することが可能である。本発明は、例えばビデオカメラ機能を備えた機器、例えば、カメラ付き携帯電話において、ビデオカメラ機能を利用して音声を録音する際に、用いられる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. A sound source separation device according to the present invention includes a microphone array composed of a plurality of microphones connected to a computer, an AD conversion device that digitally converts sound wave information recorded by the microphone array, and sound wave information digitally converted on the computer It is possible to extract voice in an arbitrary direction designated by the user. The present invention is used when recording audio using a video camera function in, for example, a device having a video camera function, for example, a camera-equipped mobile phone.

本発明のハードウェア構成を、図１に示す。
少なくとも２つ以上のマイクロホン素子からなるマイクアレイ101、マイクアレイから入力されるアナログの音圧値をデジタルデータに変換するA/D変換装置102、A/D変換装置の出力を処理する中央演算装置103、中央演算装置で実行するプログラムが使用するワーク領域を蓄える揮発性メモリ104、プログラム本体の格納及びプログラム実行時に必要な情報を記憶しておく不揮発性の記憶媒体105からなる。 The hardware configuration of the present invention is shown in FIG.
Microphone array 101 composed of at least two microphone elements, A / D converter 102 that converts analog sound pressure values input from the microphone array into digital data, and central processing unit that processes the output of the A / D converter 103, a volatile memory 104 for storing a work area used by a program executed by the central processing unit, and a nonvolatile storage medium 105 for storing a program main body and information necessary for executing the program.

図２は、本発明の音源分離装置を携帯電話に取り付けた際のイメージ図である。携帯電話201の筐体外部または表面に複数のマイクロホン（マイクアレイ202）を設置する。設置した複数のマイクロホンの出力は、携帯電話内部に設けた音響信号処理装置で処理される。 FIG. 2 is an image diagram when the sound source separation device of the present invention is attached to a mobile phone. A plurality of microphones (microphone array 202) are installed outside or on the surface of the mobile phone 201. Outputs of the plurality of microphones installed are processed by an acoustic signal processing device provided inside the mobile phone.

図３は、本発明の第一実施例の音源分離処理の具体的なソフトウェアブロック構成を示した図である。多チャンネルA/D変換器102では、マイクロホンで取り込んだアナログの音圧データが波形取り込み部301で、デジタルデータに変換される。A/D変換は、マイクロホンアレイを構成する各マイクロホン毎に行う。また、A/D変換のサンプリングレートは、任意のレートを設置することが可能であるが、音響信号処理装置と結びついたアプリケーション毎に変えることが望ましい。例えば、音声通話装置と結びつける場合は、通話8kHzサンプリングに設定する。また、携帯電話上でビデオカメラ機能などを利用し、ビデオカメラのオーディオ機能として本発明の音源分離装置を用いる場合は、44.1kHzなど高サンプリングレートに設定しても良い。 FIG. 3 is a diagram showing a specific software block configuration of the sound source separation process according to the first embodiment of the present invention. In the multi-channel A / D converter 102, the analog sound pressure data captured by the microphone is converted into digital data by the waveform capturing unit 301. A / D conversion is performed for each microphone constituting the microphone array. The sampling rate for A / D conversion can be set at an arbitrary rate, but it is desirable to change it for each application associated with the acoustic signal processing device. For example, when connecting to a voice call device, the call is set to 8 kHz sampling. Further, when a video camera function or the like is used on a mobile phone and the sound source separation device of the present invention is used as an audio function of the video camera, a high sampling rate such as 44.1 kHz may be set.

波形取り込み部301で取り込んだ各マイクロホン毎の信号は、中央演算装置103に送られて処理される。ここで、mはマイク素子インデックスとし、tはサンプリングタイムとする。Xm(t)はマイク素子mのtサンプル目のデジタル音圧データである。中央演算装置103では、Xm(t)を遅延バッファ部３０２でバッファリングする。バッファ長は数秒程度として、数秒程度データが溜まる度に、データを塊で出力する。Xm(t)は通常複数の音源が混ざった信号としてモデル化される。 The signal for each microphone captured by the waveform capturing unit 301 is sent to the central processing unit 103 for processing. Here, m is a microphone element index and t is a sampling time. Xm (t) is digital sound pressure data of the t-th sample of the microphone element m. In the central processing unit 103, Xm (t) is buffered by the delay buffer unit 302. The buffer length is about several seconds, and the data is output in chunks every time data accumulates for several seconds. Xm (t) is usually modeled as a signal in which multiple sound sources are mixed.

独立成分分析部３０３では、音源毎の統計的独立性に基づき、複数チャンネルの複数の音源が混合したデジタル音圧データを音源毎に分離する。数秒程度のデータの処理開始ポイントを数＋ｍｓ（Lshift）ずつずらしながら、数＋ｍｓ程度（Lframe）の信号に対して、短時間フーリエ変換を施すことにより、時間周波数領域信号Xm(f、τ)を得る。ここで、τは処理単位を表すインデックスとする。各マイク素子に短時間フーリエ変換を施すたびに、１加算される。ｆは周波数とする。音源分離は、時間周波数領域の信号Xm(f、τ)に対して、分離処理を施す。短時間フーリエ変換に先立って、直流成分カット、窓関数重畳などの処理を波形に施しても良い。窓関数としてはハミング窓やハニング窓、ブラックマン窓などが適用可能である。独立成分分析部による分離処理の具体的な処理フローは後述する。独立成分分析を適用した結果、周波数毎に分離に用いるフィルタWica(f)を得る。独立成分分析部は、周波数毎に適応したフィルタWica(f)を使って、y(f、τ)=Wica(f)X(f、τ)という式で分離信号y(f、τ)を得る。ここで、X(f、τ)は（式1）で定義されるベクトルであり、各マイクロホンの同一フレーム、同一周波数の信号を要素に持つベクトルである。 The independent component analysis unit 303 separates digital sound pressure data obtained by mixing a plurality of sound sources of a plurality of channels for each sound source based on statistical independence for each sound source. A time frequency domain signal Xm (f, τ) is obtained by performing a short-time Fourier transform on a signal of several + ms (Lframe) while shifting the processing start point of data of several seconds by several + ms (Lshift). obtain. Here, τ is an index representing a processing unit. Each time a short-time Fourier transform is applied to each microphone element, 1 is added. f is a frequency. In the sound source separation, separation processing is performed on the signal Xm (f, τ) in the time-frequency domain. Prior to the short-time Fourier transform, processing such as DC component cut and window function superposition may be performed on the waveform. As the window function, a Hamming window, Hanning window, Blackman window, or the like can be applied. A specific processing flow of the separation processing by the independent component analysis unit will be described later. As a result of applying the independent component analysis, a filter Wica (f) used for separation is obtained for each frequency. The independent component analyzer uses the filter Wica (f) adapted for each frequency to obtain the separated signal y (f, τ) by the equation y (f, τ) = Wica (f) X (f, τ) . Here, X (f, τ) is a vector defined by (Equation 1), and is a vector having signals of the same frame and the same frequency of each microphone as elements.

また、y(f、τ)は各音源のフレームτ、周波数ｆの分離信号を要素に持つベクトルとする。独立成分分析部の出力信号は、マイク数と同じ数だけの分離信号を要素に持つ。通常、これらのどれか一つがユーザーが必要とする所望音に相当し、それ以外は雑音に相当する。独立成分分析自身はどれが所望音か、またはどれが雑音かの情報を出力しない。本発明では、別の音源分離方式により所望音らしい信号を別途推定し、独立成分分析部の出力信号のうち、その信号と最も相関が高い信号を所望音として出力する。 In addition, y (f, τ) is a vector having as elements the separated signal of the frame τ and the frequency f of each sound source. The output signal of the independent component analyzer has as many separated signals as the number of microphones as elements. Usually, one of these corresponds to the desired sound required by the user, and the other corresponds to noise. The independent component analysis itself does not output information about which is the desired sound or which is noise. In the present invention, a signal that seems to be a desired sound is separately estimated by another sound source separation method, and a signal having the highest correlation with the signal among the output signals of the independent component analyzer is output as the desired sound.

以降の説明のために、ステアリングベクトルの定義を行う。音源方向θに存在する周波数fの音が各マイクロホンに到達するまでの位相の遅延量を要素に持つベクトルAf（θ)を（式２）で定義する。ここで、jは虚数単位を表すものとする。また、Mはマイクロホン数とする。本発明においては、Af(θ)のマイクロホン間で共通の遅延量は特に意味をもたないため、Af(θ)は必ずしも、音源位置からの遅延量として定義する必要はなく、基準のマイクロホンからの遅延量として定義しても良い。本発明の構成では１番目のマイクロホンを基準のマイクロホンとして遅延量Tm、f（θ）を（式３）で定義する。tm(θ)は音源方向θの音がm番目のマイクロホンまで届くまでの時間とする。 For the following explanation, a steering vector is defined. A vector Af (θ) whose element is a phase delay amount until the sound of the frequency f existing in the sound source direction θ reaches each microphone is defined by (Equation 2). Here, j represents an imaginary unit. M is the number of microphones. In the present invention, since the delay amount common between the microphones of Af (θ) has no particular meaning, Af (θ) does not necessarily have to be defined as the delay amount from the sound source position. It may be defined as a delay amount. In the configuration of the present invention, the delay amount Tm, f (θ) is defined by (Equation 3) using the first microphone as a reference microphone. tm (θ) is the time until the sound in the sound source direction θ reaches the m-th microphone.

マイクロホンが直線上に並んでいると仮定し、音源がマイクロホン間隔に対して十分遠い距離に存在すると仮定すると、Tm、f(θ)は（式４）で近似することができる。ここで、dmはm番目のマイクロホンと1番目のマイクロホンの間の距離とする。cは音速として常温で340m/s程度となるため、このような値に設定する。マイクロホンアレイが直線配置以外の場合は、Tm、f(θ)はより複雑な形となるが、いずれにせよマイクロホンアレイの幾何学配置が既知であれば、単純な幾何学計算により求めることができる。 Assuming that the microphones are arranged on a straight line and assuming that the sound source exists at a distance sufficiently far from the microphone interval, Tm and f (θ) can be approximated by (Equation 4). Here, dm is the distance between the mth microphone and the first microphone. Since c is about 340 m / s at room temperature as the speed of sound, this value is set. If the microphone array is other than a linear arrangement, Tm and f (θ) will be more complicated, but if the geometry of the microphone array is known anyway, it can be obtained by simple geometric calculations. .

本発明においてはマイクロホンアレイの幾何学配置は予め不揮発性メモリに記憶されているとし、その情報を利用してステアリングベクトルを生成するものとする。 In the present invention, it is assumed that the geometric arrangement of the microphone array is stored in advance in a nonvolatile memory, and the steering vector is generated using the information.

伝達モデル309は、例えば方位角360度を5度刻みに分解するなど物理空間を離散化し、離散化した各音源方向毎に前記ステアリングベクトルAf(θ)を保持する。 The transmission model 309 discretizes the physical space, for example, by decomposing the azimuth angle of 360 degrees into 5 degree increments, and holds the steering vector Af (θ) for each discretized sound source direction.

図７は、フィルタ適応部305の詳細なブロック構成を示した図である。フィルタ適応部701は、フィルタ適応部305と等しい。離散フーリエ変換部702は、時間領域のデジタルデータを周波数領域のデジタルデータに変換するためのブロックである。フィルタ適応部305では、伝達モデル309内の各音源方向毎のステアリングベクトルAf(θ)と、各入力信号X(f、τ)から、時間周波数毎の音源方向を（式５）で得る。ここで、*は、行列またはベクトルの共役転置を表す演算子とする。Ωallは、離散化した音源方向の集合とする。θ(f、τ)は、時間周波数毎の音源方向の推定値となる。 FIG. 7 is a diagram showing a detailed block configuration of the filter adaptation unit 305. The filter adaptation unit 701 is the same as the filter adaptation unit 305. The discrete Fourier transform unit 702 is a block for converting time domain digital data into frequency domain digital data. The filter adaptation unit 305 obtains the sound source direction for each time frequency from (Equation 5) from the steering vector Af (θ) for each sound source direction in the transfer model 309 and each input signal X (f, τ). Here, * is an operator representing a conjugate transpose of a matrix or a vector. Ωall is a set of discretized sound source directions. θ (f, τ) is an estimated value of the sound source direction for each time frequency.

音源方向推定部703は、音源方向を算出するブロックであり、周波数毎に行う。音源方向推定部703では、次にθ(f、τ)がΩdesiredの中に含まれるかどうかを判定する。Ωdesiredは、Ωallの部分集合とし、予め定める所望音の音源方向とする。例えば方位角正面付近の音を分離して抽出したい場合は、Ωdesiredを-30度から+30度といったように指定する。含まれている場合は、目的音適応部704で、（式６）で所望方向のステアリングベクトルa(f)を更新する。含まれていない場合は、雑音適応部705で、（式７）で雑音の共分散行列R(f)を更新する。ここで、β及びαは過去の所望音及び雑音の影響をどの程度考慮するかを制御するためのパラメータとし、0から1までの値を設定する。例えば、0.99などの値を事前に設定しておく。 The sound source direction estimation unit 703 is a block for calculating the sound source direction, and is performed for each frequency. Next, the sound source direction estimation unit 703 determines whether θ (f, τ) is included in Ωdesired. Ωdesired is a subset of Ωall, and is a sound source direction of a predetermined desired sound. For example, to separate and extract sounds near the azimuth angle, specify Ωdesired as -30 degrees to +30 degrees. If it is included, the target sound adaptation unit 704 updates the steering vector a (f) in the desired direction by (Equation 6). If it is not included, the noise adaptation unit 705 updates the noise covariance matrix R (f) in (Equation 7). Here, β and α are parameters for controlling how much the influence of past desired sound and noise is considered, and values from 0 to 1 are set. For example, a value such as 0.99 is set in advance.

次に、フィルタ適応部706は、周波数毎に、適応した共分散行列R(f)とa(f)から（式８）で雑音消去用の多チャンネルフィルタw(f)を作成する。ここで、a^(f)は、a(f)から（式９）で求める。フィルタ適応部305は、周波数毎のフィルタw(f)を出力して処理を終了する。 Next, the filter adaptation unit 706 creates a multi-channel filter w (f) for noise cancellation from the adapted covariance matrix R (f) and a (f) for each frequency using (Equation 8). Here, a ^ (f) is obtained from a (f) by (Equation 9). The filter adaptation unit 305 outputs the filter w (f) for each frequency and ends the process.

フィルタリング部306は、時間周波数毎に、入力信号X(f、τ)から雑音信号のみを除去する。雑音信号のみの除去は、フィルタ適応部305が出力するフィルタw(f)を使って（式１０）で行う。そして、各時間周波数毎の雑音抑圧信号y(f、τ)を出力して処理を終了する。 The filtering unit 306 removes only the noise signal from the input signal X (f, τ) for each time frequency. The removal of only the noise signal is performed by (Equation 10) using the filter w (f) output from the filter adaptation unit 305. Then, the noise suppression signal y (f, τ) for each time frequency is output and the processing is terminated.

リファレンス信号生成部307は、フィルタリング部306が出力する雑音抑圧信号ymvbf(f、τ)からリファレンスのパワースペクトルを（式１１）で求める。 The reference signal generation unit 307 obtains a reference power spectrum from the noise suppression signal ymvbf (f, τ) output by the filtering unit 306 using (Equation 11).

ここで、Fsはリファレンスのパワースペクトルに含める周波数の最小値とする。例えば、電源ノイズなどの影響で低域成分に雑音が混入する場合においては、Fsを100Hz程度に設定するなどする。また、Feはリファレンスのパワースペクトルに含める周波数の最大値とする。人間の音声は8000Hz以下の周波数に大部分が存在し、無声摩擦音などの一部の音素を除いては10000Hz以上の高域成分にパワーを持たない。このことを考慮し、Feを8000Hz以下の周波数に設定することが望ましい。リファレンス信号生成部307は、時間毎のリファレンスパワースペクトルr(τ)を出力して処理を終了する。 Here, Fs is the minimum frequency to be included in the reference power spectrum. For example, when noise is mixed in a low frequency component due to the influence of power supply noise or the like, Fs is set to about 100 Hz. Fe is the maximum frequency to be included in the reference power spectrum. Most human voices exist at frequencies below 8000 Hz, and do not have power in high-frequency components above 10000 Hz, except for some phonemes such as unvoiced friction sounds. Considering this, it is desirable to set Fe to a frequency of 8000 Hz or less. The reference signal generation unit 307 outputs the reference power spectrum r (τ) for each time and ends the process.

相関分析部304では、独立成分分析部が出力する分離信号y(f、τ)のうち、リファレンスパワースペクトルr(τ)との相関が最も高い信号を所望音とみなす。相関分析部304はまず（式１２）で分離信号とリファレンスパワースペクトルとの相関Cor、i(f)を求める。 The correlation analysis unit 304 regards the signal having the highest correlation with the reference power spectrum r (τ) among the separated signals y (f, τ) output from the independent component analysis unit as the desired sound. The correlation analysis unit 304 first obtains the correlation Cor, i (f) between the separated signal and the reference power spectrum in (Equation 12).

ここで、Eτは、τについての期待値計算を行う演算子とする。Pi(f、τ)は、独立成分分析部が出力する分離信号y(f、τ)のi番目の要素のパワー|yi(f、τ)|^2とする。期待値計算は通常無限に長い時間の信号のサンプル平均で置き換えても良い。しかし、本発明は、各時間毎の信号が逐次的に送られてくることを想定しているため、Cor、i(f)の計算は、各時間毎に（式１３）で行い、時間毎の相関係数Cor、i(f、τ)を得る。さらにPRi(f、τ)は（式１４）で得る。μpi(f、τ)は（式１５）、μr(τ)は（式１６）でそれぞれ得る。γは、過去の信号の忘却度合いを制御するパラメータであり、0から1までの値に設定する。相関分析部304は、Cor、i(f、τ)を出力して処理を終了する。 Here, Eτ is an operator that calculates an expected value for τ. Pi (f, τ) is the power | yi (f, τ) | ^ 2 of the i-th element of the separated signal y (f, τ) output from the independent component analyzer. The expected value calculation may be replaced with a sample average of the signal over a generally infinitely long time. However, since the present invention assumes that signals for each time are sequentially transmitted, the calculation of Cor, i (f) is performed by (Equation 13) for each time, and for each time. Correlation coefficients Cor, i (f, τ) are obtained. Further, PRi (f, τ) is obtained by (Equation 14). μpi (f, τ) is obtained by (Expression 15), and μr (τ) is obtained by (Expression 16). γ is a parameter that controls the degree of forgetting of past signals, and is set to a value from 0 to 1. The correlation analysis unit 304 outputs Cor, i (f, τ) and ends the process.

分離音選択部308は、まず、（式１７）で最もリファレンス信号に近い独立成分分析部の出力信号のインデックスimax(f、τ)を選択する。さらに、（式１８）でフィルタリング部306が出力する音源分離音ymvbf(f、τ)とリファレンス信号とのパワースペクトルの相関係数Cor、mvbf(f、τ)を算出する。ここでPRmvbf(ｆ、τ)は、（式１９）で算出する。またμmvbf(f、τ)は（式２０）で算出する。 The separated sound selection unit 308 first selects the index imax (f, τ) of the output signal of the independent component analysis unit closest to the reference signal in (Equation 17). Furthermore, correlation coefficients Cor and mvbf (f, τ) of the power spectrum between the sound source separated sound ymvbf (f, τ) output from the filtering unit 306 and the reference signal are calculated in (Equation 18). Here, PRmvbf (f, τ) is calculated by (Equation 19). Μmvbf (f, τ) is calculated by (Equation 20).

図９は、フィルタリング音を選択するか分離音を選択するかの処理フローの詳細を示した図である。「独立成分分析分離音抽出」901はimax(f、τ)及びCor、imax(r、τ)を選択する。「フィルタリング音抽出」902はymvbf(f、τ)を抽出する。「リファレンス信号との相関比較」903ではCor、mvbf(f、τ)を算出する。「フィルタリング音の方が相関が高いか？」９０４では、Cor、imax(r、τ)がCor、mvbf(f、τ)を上回っている場合、「独立成分分析分離音を選択」906に進み、imaxに相当する独立成分分析の分離信号yimax(f、τ)を該当する時間周波数の雑音抑圧信号とみなす。それ以外の場合、「フィルタリング音を選択」905に進み、ymvbf(f、τ)を該当する時間周波数の雑音抑圧信号とみなして出力する。 FIG. 9 is a diagram illustrating details of a processing flow for selecting a filtering sound or a separated sound. “Independent component analysis separated sound extraction” 901 selects imax (f, τ) and Cor, imax (r, τ). “Filtering sound extraction” 902 extracts ymvbf (f, τ). In “Comparison with reference signal” 903, Cor and mvbf (f, τ) are calculated. If “Correcting sound has higher correlation?” 904, if Cor, imax (r, τ) exceeds Cor, mvbf (f, τ), go to “Select independent component analysis separated sound” 906 , Imax (f, τ) of the independent component analysis corresponding to imax is regarded as a noise suppression signal of a corresponding time frequency. In other cases, the flow proceeds to “select filtering sound” 905, and ymvbf (f, τ) is regarded as a noise suppression signal of the corresponding time frequency and output.

図１3は、フィルタリング音選択と独立成分分析音の選択のどちらを行うかの代替案を示した図である。「独立成分分析分離音抽出」1301はyimax(f、τ)を選択する。「フィルタリング音抽出」1302はymvbf(f、τ)を抽出する。「フィルタリング音の方がパワーが小さいか？」1304では、ymvbf(f、τ)のほうがyimax(f、τ)よりパワーが小さい場合、「フィルタリング音を選択」1305に進み、yimax(f、τ)を選択して、処理を終了する。それ以外の場合、「独立成分分析分離音を選択」1306に進み、ymvbf(f、τ)を選択して処理を終了する。 FIG. 13 is a diagram showing an alternative method for selecting either filtering sound selection or independent component analysis sound selection. “Independent component analysis separation sound extraction” 1301 selects yimax (f, τ). “Filtering sound extraction” 1302 extracts ymvbf (f, τ). If the power of ymvbf (f, τ) is lower than yimax (f, τ) in 1304, the process proceeds to “Select filtering sound” 1305 and yimax (f, τ ) To finish the process. In other cases, the process proceeds to “select independent component analysis separated sound” 1306 to select ymvbf (f, τ) and terminate the process.

図６は独立成分分析から分離音選択までの処理フローを示した図である。図６は周波数毎に独立に行っても良いし、「初期化」602以降を周波数毎に行い、「独立成分分析」601を全周波数に対して一括で行うような構成をとっても良い。また「独立成分分析601」のみを例えば３秒程度の音声データに対して一度行い、それ以降の処理は毎フレーム行うような構成をとっても良い。「独立成分分析」601では、独立成分分析を施され、周波数毎にマイク数と等しい数の分離音を得る。「初期化」602では、分離音選択のために用いる変数max、imax、iをゼロクリアする。「判定」603では、iがlengthより大きいかどうかを判定する。lengthはマイク数と等しい数に設定する。つまり本判定は全ての分離音に対して調査をしたかどうかを判定していることに相当する。全ての分離音に対して調査し終わっていたら、処理を終了する。それ以外の場合は、「相関計算」604に移る。ここでは、 Cor、i(f、τ)を算出する。次にCor、i(f、τ)がmaxより大きいかどうかを「判定」605し、大きければ、「最大値更新」606に移り、max=Cor、i(f、τ)とし、imax=iとする。その後iに1を足して、「判定」603に戻る。それ以外の場合は、iに1を足して「判定」603に戻る。 FIG. 6 is a diagram showing a processing flow from independent component analysis to separation sound selection. 6 may be performed independently for each frequency, or may be configured such that “initialization” 602 and subsequent steps are performed for each frequency, and “independent component analysis” 601 is performed collectively for all frequencies. Alternatively, a configuration may be adopted in which only “independent component analysis 601” is performed once for audio data of, for example, about 3 seconds, and the subsequent processing is performed for each frame. In the “independent component analysis” 601, the independent component analysis is performed, and the number of separated sounds equal to the number of microphones is obtained for each frequency. In “Initialization” 602, variables max, imax, i used for selecting a separated sound are cleared to zero. In “determination” 603, it is determined whether i is greater than length. Length is set equal to the number of microphones. That is, this determination corresponds to determining whether or not all separated sounds have been investigated. If all the separated sounds have been investigated, the process is terminated. In other cases, the process proceeds to “correlation calculation” 604. Here, Cor, i (f, τ) is calculated. Next, whether or not Cor, i (f, τ) is larger than max is “determined” 605. If it is larger, the process proceeds to “maximum value update” 606, where max = Cor, i (f, τ) and imax = i And Thereafter, i is incremented by 1, and the process returns to “determination” 603. Otherwise, add 1 to i and return to “determination” 603.

図４は、本発明の第二の実施例のソフトウェアブロック構成を示した図である。波形取り込み部401、遅延バッファ部402、相関分析部404、フィルタ適応部405、フィルタリング部406、リファレンス信号生成部407、分離音選択部408の処理はそれぞれ波形取り込み部301、遅延バッファ部302、相関分析部304、フィルタ適応部305、フィルタリング部306、リファレンス信号生成部307、分離音選択部308と等しい。また伝達モデル409も伝達モデル309と等しい構成とする。独立成分分析部403は、独立成分分析部303の処理に加えて、分離フィルタWica(f)から算出される独立成分分析の分離信号のステアリングベクトルに相当するWica(f)i-1を出力する。Wica(f)i-1は分離フィルタWica(f)の逆行列の第i列とする。伝達モデル置換部410は、（式２１）で各分離信号毎のWica、(f)i-1の音源方向θ^i(f)を求める。 FIG. 4 is a diagram showing a software block configuration of the second embodiment of the present invention. The waveform acquisition unit 401, delay buffer unit 402, correlation analysis unit 404, filter adaptation unit 405, filtering unit 406, reference signal generation unit 407, and separated sound selection unit 408 are processed by the waveform acquisition unit 301, delay buffer unit 302, correlation, respectively. Equivalent to the analysis unit 304, the filter adaptation unit 305, the filtering unit 306, the reference signal generation unit 307, and the separated sound selection unit 308. Also, the transmission model 409 has the same configuration as the transmission model 309. The independent component analysis unit 403 outputs Wica (f) i-1 corresponding to the steering vector of the separation signal of the independent component analysis calculated from the separation filter Wica (f) in addition to the processing of the independent component analysis unit 303. . Wica (f) i-1 is the i-th column of the inverse matrix of the separation filter Wica (f). The transfer model replacement unit 410 obtains the sound source direction θ ^ i (f) of Wica, (f) i−1 for each separated signal in (Equation 21).

次にθ^i(f)のf方向へのヒストグラムを算出し、ヒストグラムのピークを与える音源方向をΘiとして求める。次に（式２２）でこの音源方向のステアリングベクトルを更新する。ηは過去のステアリングベクトルの忘却係数とする。 Next, a histogram of θ ^ i (f) in the f direction is calculated, and the sound source direction that gives the peak of the histogram is obtained as Θi. Next, the steering vector in the direction of the sound source is updated by (Equation 22). η is a forgetting factor of a past steering vector.

このような構成をとることにより、回折などの影響でステアリングベクトルが異なる値を取るような場合であっても自動的にアジャストすることが可能となる。 By adopting such a configuration, it is possible to automatically adjust even when the steering vector takes different values due to the influence of diffraction or the like.

図５は、本発明における独立成分分析部403の詳細な処理フローを示した図である。独立成分分析部403では短時間周波数領域信号に対して処理を行う。図５はある１つの周波数に対する処理であり、独立成分分析部では図５の処理を全周波数に対して施す。「収束判定」501では音源分離フィルタWの値が十分収束したかどうかを判定する。フィルタ更新回数が所定回数に達した場合収束したと判定しても良いし、後述する非線形共分散行列の非対角項のパワーが対角項のパワーに対して予め定める値以下になった場合に収束したと判定しても良い。収束したと判定されれば処理を終了し、音源分離フィルタWを出力する。 FIG. 5 is a diagram showing a detailed processing flow of the independent component analysis unit 403 in the present invention. The independent component analysis unit 403 performs processing on the short-time frequency domain signal. FIG. 5 shows a process for one frequency, and the independent component analyzer performs the process of FIG. 5 on all frequencies. In “convergence determination” 501, it is determined whether or not the value of the sound source separation filter W has sufficiently converged. When the number of filter updates reaches a predetermined number, it may be determined that the filter has converged, or when the power of the non-diagonal term of the nonlinear covariance matrix described later is equal to or less than a predetermined value with respect to the power of the diagonal term It may be determined that it has converged. If it is determined that it has converged, the processing is terminated and the sound source separation filter W is output.

「初期化」502では処理開始位置をバッファリングが取り込んだ波形の先頭にセットする。また後述するRica(f)を０クリアする。「判定」503では処理開始位置がバッファリングが取り込んだ波形の終了位置以下かどうかを判定する。処理開始位置が波形の終了位置以下である場合、フレーム毎、周波数毎のX(f、τ)をフィルタリング処理して、音源分離音ｙica（ｆ、τ）を得る（フィルタリング504）。得た分離音は、適応中の分離フィルタにより分離した波形であるため、分離が不十分であると考えられる。「共分散更新」505では、Rica(f)を（式２３）で更新する。ここでφ（ｘ）は音源の確率分布の微分関数に相当する関数であり、（式２４）で定義する。 In “Initialization” 502, the processing start position is set at the head of the waveform captured by the buffering. Also clears Rica (f) described later to zero. In “determination” 503, it is determined whether or not the processing start position is equal to or less than the end position of the waveform captured by buffering. When the processing start position is less than or equal to the end position of the waveform, X (f, τ) for each frame and frequency is filtered to obtain sound source separated sound yica (f, τ) (filtering 504). Since the obtained separated sound is a waveform separated by the adaptive separation filter, it is considered that the separation is insufficient. In “covariance update” 505, Rica (f) is updated by (Equation 23). Here, φ (x) is a function corresponding to the differential function of the probability distribution of the sound source and is defined by (Equation 24).

Rica(f)は非線形共分散行列と呼び、この非対角項が０に近づくほど、分離した各音源が独立になっていることを意味する。対角項は各音源の大きさに相当する。したがって、非対角項と対角項の比が重要になる。分離フィルタの収束判定ではこの比をチェックし、収束判定しても良い。「変数加算」507では波形の処理開始位置をフレームシフトLshift分加算する。波形処理開始地点がバッファリングで取り込んだ波形の終了地点に達している場合は、「フィルタ更新」506に処理を移す。「フィルタ更新」506では、（式２５）で分離フィルタを更新する。 Rica (f) is called a non-linear covariance matrix and means that the separated sound sources become independent as the off-diagonal term approaches zero. The diagonal term corresponds to the size of each sound source. Therefore, the ratio of off-diagonal terms and diagonal terms becomes important. In the convergence determination of the separation filter, this ratio may be checked to determine the convergence. In “variable addition” 507, the processing start position of the waveform is added by the frame shift Lshift. If the waveform processing start point has reached the end point of the waveform acquired by buffering, the process proceeds to “filter update” 506. In “filter update” 506, the separation filter is updated by (Equation 25).

ηはフィルタ更新速度を制御するための変数であり、大きいほどフィルタ収束速度は上がるが、フィルタが発散する可能性が大きくなる。小さいほどフィルタ収束速度は遅いが、フィルタが発散する可能性は低くなる。 η is a variable for controlling the filter update rate. The larger the value, the higher the filter convergence rate, but the greater the possibility that the filter will diverge. The smaller the value is, the slower the filter convergence speed is, but the possibility that the filter diverges becomes lower.

図８は、本発明の第三の実施例のソフトウェアブロック構成を示した図である。波形取り込み部801、遅延バッファ部802、相関分析部804、フィルタ適応部805、フィルタリング部806、リファレンス信号生成部807、分離音選択部808の処理は、それぞれ波形取り込み部301、遅延バッファ部302、相関分析部304、フィルタ適応部305、フィルタリング部306、リファレンス信号生成部307、分離音選択部308と等しい。クラスタリング部803は時々刻々得られる入力信号X(f、τ)を（式２６）で変形したQ(f、τ)にK-meansクラスタリングを施し予め定めるR個のセントロイドを行要素に持つ行列C(f)を得る。 FIG. 8 is a diagram showing a software block configuration of the third embodiment of the present invention. The processing of the waveform acquisition unit 801, delay buffer unit 802, correlation analysis unit 804, filter adaptation unit 805, filtering unit 806, reference signal generation unit 807, and separated sound selection unit 808 are the waveform acquisition unit 301, delay buffer unit 302, Equivalent to correlation analysis unit 304, filter adaptation unit 305, filtering unit 306, reference signal generation unit 307, and separated sound selection unit 308. The clustering unit 803 applies a K-means clustering to Q (f, τ) obtained by transforming the input signal X (f, τ) obtained from time to time according to (Equation 26), and has a predetermined R centroids as row elements. Get C (f).

r番目の音源を得るための分離フィルタWr(f)をWr(f)=C+(f)cと定める。C+(f)はC(f)のムーア・ペンローズ一般化逆行列とする。cはr番目の要素のみ1でその他の要素が全て0となるベクトルとする。r番目の分離音はyr(f、τ)= Wr(f)*X(f、τ)で得ることができる。クラスタリング部803は、yr(f、τ)を出力して終了する。 A separation filter Wr (f) for obtaining the r-th sound source is defined as Wr (f) = C + (f) c. C + (f) is the Moore-Penrose generalized inverse of C (f). Let c be a vector in which only the rth element is 1 and all other elements are 0. The r-th separated sound can be obtained by yr (f, τ) = Wr (f) * X (f, τ). The clustering unit 803 outputs yr (f, τ) and ends.

図１０は、本発明の第三の実施例のソフトウェアブロック構成を示した図である。波形取り込み部1001、遅延バッファ部1002、独立成分分析部1003、相関分析部1004、リファレンス信号生成部1007、分離音選択部1008の処理は、それぞれ波形取り込み部301、遅延バッファ部302、独立成分分析部303、相関分析部304、フィルタ適応部305、フィルタリング部306、リファレンス信号生成部307、分離音選択部308と等しい。フィルタ適応部1005では、フィルタリングモデル1009に予め保持する雑音方向j毎の共分散行列Rj(f、τ)と予め定める目的音ステアリングベクトルを使って雑音方向jに対応する分離フィルタwj(f、τ)を（式８）と同様に算出する。ここで、Rj(f、τ)は予め収録した雑音毎に求めるような構成をとる。パワー最小フィルタリング部1006では、wj(f、τ)を入力信号に対して作用させた後の出力信号のうち、最もパワーが最小となるものを時間周波数毎に選択し、それをymvbf(f、τ)とする。 FIG. 10 is a diagram showing a software block configuration of the third embodiment of the present invention. The processing of the waveform acquisition unit 1001, the delay buffer unit 1002, the independent component analysis unit 1003, the correlation analysis unit 1004, the reference signal generation unit 1007, and the separated sound selection unit 1008 are the waveform acquisition unit 301, delay buffer unit 302, and independent component analysis, respectively. It is the same as unit 303, correlation analysis unit 304, filter adaptation unit 305, filtering unit 306, reference signal generation unit 307, and separated sound selection unit 308. In the filter adaptation unit 1005, the separation filter wj (f, τ) corresponding to the noise direction j using the covariance matrix Rj (f, τ) for each noise direction j previously stored in the filtering model 1009 and the predetermined target sound steering vector. ) Is calculated in the same manner as in (Equation 8). Here, Rj (f, τ) is configured to be obtained for each previously recorded noise. The power minimum filtering unit 1006 selects the output signal having the smallest power among the output signals after applying wj (f, τ) to the input signal for each time frequency, and selects it for ymvbf (f, τ).

図１２にパワー最小フィルタリング部の具体的な処理フローを示す。「初期化」1201では、処理に必要な変数i、imin、minを全てゼロクリアする。「判定」1202では、iがlengthより小さいかどうかを判定する。lengthは最小化フィルタリング対象のフィルタ数とする。lengthを上回っている場合処理を終了する。下回っている場合、「フィルタリング」1203に進む。「フィルタリング」1203は、i番目のフィルタでフィルタリング後の信号ymvbf、i(f、τ)を算出する。「フィルタ後のパワーがmin以下か？」1204では、minよりも| ymvbf、i(f、τ)|^2が小さいかどうかを判定する。小さい場合、「パワー置換」1205に進み、minを| ymvbf、i(f、τ)|^2に設定し、imin=iとし、i=i+1として、「初期化」1201に戻る。それ以外の場合、i=i+1として、「初期化」1201に戻る。 FIG. 12 shows a specific processing flow of the minimum power filtering unit. In “initialization” 1201, all the variables i, imin, and min necessary for processing are cleared to zero. In “determination” 1202, it is determined whether i is smaller than length. length is the number of filters to be minimized. If it exceeds length, the process ends. If it is below, go to “Filtering” 1203. “Filtering” 1203 calculates the signal ymvbf, i (f, τ) after filtering by the i-th filter. In “120 is the power after filtering?” 1204, it is determined whether | ymvbf, i (f, τ) | ^ 2 is smaller than min. If smaller, go to “Power Replacement” 1205, set min to | ymvbf, i (f, τ) | ^ 2, set imin = i, set i = i + 1, and return to “Initialization” 1201. Otherwise, i = i + 1 and return to “initialization” 1201.

図１１は本発明の第五の実施例を示した図である。波形取り込み部1101、遅延バッファ部1102、相関分析部1104、フィルタ適応部1105、パワー最小フィルタリング部1106、リファレンス信号生成部1107、分離音選択部1108の処理は、それぞれ波形取り込み部1001、遅延バッファ部1002、相関分析部1004、フィルタ適応部1005、パワー最小フィルタリング部1006、リファレンス信号生成部1007、分離音選択部1008と等しい。独立成分分析部1103は、独立成分分析部の各雑音抑圧フィルタの候補をフィルタモデル1109の中から選択する。つまりフィルタモデル1109に含まれるフィルタ候補の中から最も入力信号を独立に分解するフィルタを選択する。 FIG. 11 is a diagram showing a fifth embodiment of the present invention. The waveform acquisition unit 1101, delay buffer unit 1102, correlation analysis unit 1104, filter adaptation unit 1105, minimum power filtering unit 1106, reference signal generation unit 1107, separated sound selection unit 1108 are processed by the waveform acquisition unit 1001 and the delay buffer unit, respectively. 1002, correlation analysis unit 1004, filter adaptation unit 1005, minimum power filtering unit 1006, reference signal generation unit 1007, and separated sound selection unit 1008. The independent component analysis unit 1103 selects, from the filter model 1109, each noise suppression filter candidate of the independent component analysis unit. In other words, the filter that decomposes the input signal most independently from the filter candidates included in the filter model 1109 is selected.

以上、本発明の実施の形態を携帯電話装置で説明したが、本発明は、ビデオカメラやビデオカメラ機能を備えたデジタルカメラなどの、特定方向の音源の音のみを抽出して用いる機器に用いることができる。 As described above, the embodiments of the present invention have been described using the mobile phone device. However, the present invention is used for devices that extract and use only sound of a sound source in a specific direction, such as a video camera or a digital camera having a video camera function. be able to.

101…マイクアレイ、102…ＡＤ変換装置、103…中央演算装置、104…揮発性メモリ、105…記憶媒体、
201…携帯電話、202…マイクアレイ、
301…波形取り込み部、302…遅延バッファ部、303…独立成分分析部、304…相関分析部、305…フィルタ適応部、306…フィルタリング部、307…リファレンス信号生成部、308…分離音選択部、309…伝達モデル、
401…波形取り込み部、402…遅延バッファ部、403…独立成分分析部、404…相関分析部、405…フィルタ適応部、406…フィルタリング部、407…リファレンス信号生成部、408…分離音選択部、409…伝達モデル、410…伝達モデル置換部、
701…フィルタ適応部、702…離散フーリエ変換部、703…音源方向推定部、704…目的音適応部、705…雑音適応部、706…フィルタ適応部、
801…波形取り込み部、802…遅延バッファ部、803…クラスタリング部、804…相関分析部、805…フィルタ適応部、806…フィルタリング部、807…リファレンス信号生成部、808…分離音選択部、809…伝達モデル、
1001…波形取り込み部、1002…遅延バッファ部、1003…独立成分分析部、1004…相関分析部、1005…フィルタ適応部、1006…フィルタリング部、1007…リファレンス信号生成部、1008…分離音選択部、1009…フィルタモデル、
1101…波形取り込み部、1102…遅延バッファ部、1103…独立成分分析部、1104…相関分析部、1105…フィルタ適応部、1106…フィルタリング部、1107…リファレンス信号生成部、1108…分離音選択部、1109…フィルタモデル。 101 ... Microphone array, 102 ... AD converter, 103 ... Central processing unit, 104 ... Volatile memory, 105 ... Storage medium,
201 ... mobile phone, 202 ... microphone array,
301 ... Waveform capturing unit 302 ... Delay buffer unit 303 ... Independent component analysis unit 304 ... Correlation analysis unit 305 ... Filter adaptation unit 306 ... Filtering unit 307 ... Reference signal generation unit 308 ... Separated sound selection unit 309 ... Transmission model,
401 ... Waveform capturing unit 402 ... Delay buffer unit 403 ... Independent component analysis unit 404 ... Correlation analysis unit 405 ... Filter adaptation unit 406 ... Filtering unit 407 ... Reference signal generation unit 408 ... Separated sound selection unit 409 ... transmission model, 410 ... transmission model replacement unit,
701 ... Filter adaptation unit, 702 ... Discrete Fourier transform unit, 703 ... Sound source direction estimation unit, 704 ... Target sound adaptation unit, 705 ... Noise adaptation unit, 706 ... Filter adaptation unit,
801 ... Waveform capture unit, 802 ... Delay buffer unit, 803 ... Clustering unit, 804 ... Correlation analysis unit, 805 ... Filter adaptation unit, 806 ... Filtering unit, 807 ... Reference signal generation unit, 808 ... Separated sound selection unit, 809 ... Transmission model,
DESCRIPTION OF SYMBOLS 1001 ... Waveform acquisition part, 1002 ... Delay buffer part, 1003 ... Independent component analysis part, 1004 ... Correlation analysis part, 1005 ... Filter adaptation part, 1006 ... Filtering part, 1007 ... Reference signal generation part, 1008 ... Separation sound selection part, 1009 ... Filter model,
1101 ... Waveform capturing unit, 1102 ... Delay buffer unit, 1103 ... Independent component analysis unit, 1104 ... Correlation analysis unit, 1105 ... Filter adaptation unit, 1106 ... Filtering unit, 1107 ... Reference signal generation unit, 1108 ... Separated sound selection unit, 1109: Filter model.

Claims

複数のマイクロホンからなるマイクアレイから入力される信号を処理して特定の音源の音のみを抽出する音源分離装置であって、
予め得られるマイク配置の情報から算出した音の伝達モデルを用いて雑音抑圧フィルタを適応するフィルタ適応部と、
マイク配置の情報を使わずに目的音と雑音との統計的性質を用いて音を分離する独立成分分析部と、
前記フィルタ適応部の出力信号と独立成分分析部の出力信号のうち、最も前記フィルタ適応部の複数の周波数帯域のパワーの和と時間相関が高い信号を選択する分離音選択部と、
を備える音源分離装置。 A sound source separation device that processes a signal input from a microphone array including a plurality of microphones and extracts only a sound of a specific sound source,
A filter adaptation unit adapted to apply a noise suppression filter using a sound transmission model calculated from information on microphone arrangement obtained in advance;
An independent component analysis unit that separates sounds using the statistical properties of target sound and noise without using information on microphone placement;
Of the output signal of the filter adaptation unit and the output signal of the independent component analysis unit, a separated sound selection unit that selects the signal having the highest time correlation and the sum of the powers of the plurality of frequency bands of the filter adaptation unit;
A sound source separation device comprising:

請求項１に記載の音源分離装置において、更に、
前記独立成分分析部の複数の分離信号から出力信号を選択するために、各分離信号と前記フィルタ適応部の複数の周波数帯域のパワーの和との時間相関をとる相関分析部を備えることを特徴とする音源分離装置。 The sound source separation device according to claim 1, further comprising:
In order to select an output signal from a plurality of separation signals of the independent component analysis unit, a correlation analysis unit is provided that takes time correlation between each separation signal and the sum of powers of a plurality of frequency bands of the filter adaptation unit. A sound source separation device.

請求項１に記載の音源分離装置において、更に、
前記独立成分分析部で算出する分離フィルタを用いて、前記フィルタ適応部で用いる音の伝達モデルを書き換える伝達モデル置換部を備えることを特徴とする音源分離装置。 The sound source separation device according to claim 1, further comprising:
A sound source separation apparatus comprising: a transmission model replacement unit that rewrites a sound transmission model used in the filter adaptation unit using a separation filter calculated by the independent component analysis unit.

請求項１に記載の音源分離装置において、
前記独立成分分析部に代えて、入力信号にクラスタリングを施すクラスタリング部を備えることを特徴とする音源分離装置。 The sound source separation device according to claim 1,
A sound source separation apparatus comprising a clustering unit that performs clustering on an input signal instead of the independent component analysis unit.

請求項１に記載の音源分離装置において、
パワー最小フィルタリング部を備え、
前記フィルタ適応部は、フィルタモデルに基づいて雑音方向に対応する分離フィルタを作成し、
前記パワー最小フィルタリング部は、入力信号に対して前記分離フィルタを作用させた後の出力信号のうち、最もパワーが最小となるものを選択するものであることを特徴とする音源分離装置。 The sound source separation device according to claim 1,
Power minimum filtering part,
The filter adaptation unit creates a separation filter corresponding to a noise direction based on a filter model,
The minimum power filtering unit selects a signal having the lowest power among output signals after the separation filter is applied to an input signal.

請求項５に記載の音源分離装置において、
前記独立成分分析部は、前記フィルタモデルに含まれるフィルタ候補の中からフィルタを選択することを特徴とする音源分離装置。 The sound source separation device according to claim 5,
The sound source separation device, wherein the independent component analysis unit selects a filter from filter candidates included in the filter model.

請求項１乃至６の何れか一つに記載の音源分離装置を組み込んだビデオカメラ装置。 A video camera device incorporating the sound source separation device according to any one of claims 1 to 6.

請求項１乃至６の何れか一つに記載の音源分離装置を組み込んだカメラ付き携帯電話装置。 A mobile phone device with a camera incorporating the sound source separation device according to claim 1.

複数のマイクロホンからなるマイクアレイから入力される信号を処理して特定の音源の音のみを抽出する音源分離方法であって、
予め得られるマイク配置の情報から算出した音の伝達モデルを用いて雑音抑圧フィルタを適応して出力信号を得るステップと、
マイク配置の情報を使わずに目的音と雑音との統計的性質を用いて音を分離し出力信号を得るステップと、
前記音の伝達モデルを用いて雑音抑圧フィルタを適応した出力信号と統計的性質を用いて音を分離した出力信号のうち、前記音の伝達モデルを用いて雑音抑圧フィルタを適応した複数の周波数帯域のパワーの和と最も時間相関が高い信号を選択するステップと、を備える音源分離方法。 A sound source separation method for extracting only the sound of a specific sound source by processing a signal input from a microphone array including a plurality of microphones,
Applying a noise suppression filter using a sound transmission model calculated from information on microphone arrangement obtained in advance to obtain an output signal;
Separating the sound using the statistical properties of the target sound and noise without using the microphone placement information to obtain an output signal;
Among the output signal obtained by applying a noise suppression filter using the sound transfer model and the output signal obtained by separating the sound using statistical properties, a plurality of frequency bands adapted for the noise suppression filter using the sound transfer model Selecting a signal having the highest temporal correlation with the sum of the powers of the sound sources.

複数のマイクロホンからなるマイクアレイから入力される信号を処理して特定の音源の音のみを抽出する音源分離方法を、コンピュータに実行させるためのプログラムであって、
予め得られるマイク配置の情報から算出した音の伝達モデルを用いて雑音抑圧フィルタを適応して出力信号を得るステップと、
マイク配置の情報を使わずに目的音と雑音との統計的性質を用いて音を分離し出力信号を得るステップと、
前記音の伝達モデルを用いて雑音抑圧フィルタを適応した出力信号と統計的性質を用いて音を分離した出力信号のうち、前記音の伝達モデルを用いて雑音抑圧フィルタを適応した複数の周波数帯域のパワーの和と最も時間相関が高い信号を選択するステップと、をコンピュータに実行させるプログラム。 A program for causing a computer to execute a sound source separation method for processing a signal input from a microphone array including a plurality of microphones and extracting only sound of a specific sound source,
Applying a noise suppression filter using a sound transmission model calculated from information on microphone arrangement obtained in advance to obtain an output signal;
Separating the sound using the statistical properties of the target sound and noise without using the microphone placement information to obtain an output signal;
Among the output signal obtained by applying a noise suppression filter using the sound transfer model and the output signal obtained by separating the sound using statistical properties, a plurality of frequency bands adapted for the noise suppression filter using the sound transfer model Selecting a signal having the highest time correlation with the sum of the powers of the programs.