JP2006072163A

JP2006072163A - Disturbing sound suppressing device

Info

Publication number: JP2006072163A
Application number: JP2004257836A
Authority: JP
Inventors: Masato Togami; 真人戸上; Akio Amano; 明雄天野
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-09-06
Filing date: 2004-09-06
Publication date: 2006-03-16

Abstract

<P>PROBLEM TO BE SOLVED: To precisely extract only a target speech from a plurality of channel signals of different channels mutually mixed when a sound is collected. <P>SOLUTION: Target sound extraction which is higher in precision than before is enabled by providing a user interface through which a person inputs positions and frequency characteristics of a target sound and a disturbing sound to a system and a sound source separation section that deters signal power after target sound extraction at each time from deviating from the signal power of the target sound. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、例えば複数のマイクロホン素子で観測した音声や音楽や各種雑音が混合した信号から、目的とする音のみを復元する音源分離技術に属する。 The present invention belongs to a sound source separation technique for restoring only a target sound, for example, from a signal obtained by mixing voice, music, and various noises observed with a plurality of microphone elements.

従来より、複数のマイクロホン素子を用いて目的音のみを強調する技術として、目的音にビームをあわせる遅延和アレーや、妨害音到来音方向に死角を合わせる死角形成型ビームフォーマなどがあった。しかし遅延和アレーは精度良く目的音を分離するためには、莫大なマイク素子が必要となり、また死角形成型ビームフォーマは妨害音数がマイク素子-1個より少ないときは精度良く分離できるが、妨害音数がマイク素子以上の時は精度が劣化することが良く知られている。
そこで、死角形成型ビームフォーマを２つ使うことで、従来の死角形成型ビームフォーマ１つを使う場合よりも多くの妨害音を抑圧する複数チャネルスペクトルサブトラクション法が提案されている。 Conventionally, as a technique for emphasizing only a target sound using a plurality of microphone elements, there are a delay-and-sum array that aligns the beam with the target sound, and a blind spot forming beamformer that aligns the blind spot in the direction of the interference sound arrival sound. However, the delay sum array requires a huge number of microphone elements to accurately separate the target sound, and the blind spot forming beamformer can accurately separate when the number of disturbing sounds is less than one microphone element. It is well known that the accuracy deteriorates when the number of disturbing sounds is greater than or equal to the microphone element.
In view of this, a multi-channel spectral subtraction method has been proposed in which two blind spot forming beamformers are used to suppress more disturbing sounds than when one conventional blind spot forming beamformer is used.

例えば、各妨害音について、死角形成型ビームフォーマ２つのうち、少なくともどちらかの線形フィルタが死角を形成する死角形成型ビームフォーマ対を用いる相補的複数チャネルスペクトルサブトラクション方式がある（非特許文献１参照）。この相補的複数チャネルスペクトルサブトラクション方式では、目的音抽出後の信号のパワーの期待値と目的音の信号パワーの期待値を一致させることができる。 For example, for each interfering sound, there is a complementary multi-channel spectral subtraction method using a blind spot forming beamformer pair in which at least one of the two linear filters forms a blind spot among the two blind spot forming beamformers (see Non-Patent Document 1). ). In this complementary multi-channel spectral subtraction method, the expected value of the signal power after the target sound extraction can be matched with the expected value of the signal power of the target sound.

又、目的音を強調し、妨害音を抑圧するビームフォーマと目的音を抑圧し、最もパワーの大きい妨害音を強調するビームフォーマを用いた複数チャネルスペクトルサブトラクション方式もある（例えば、特許文献２参照）。
一方で、入力された画像情報を処理して複数の人物位置を求め、ユーザーに複数の人物位置の中から、特定の人物位置を選択させ、選択された人物位置の音声のみを抽出する技術がある（例えば特許文献１、参照） There is also a multi-channel spectral subtraction method that uses a beamformer that emphasizes the target sound, suppresses the interference sound, and a beamformer that suppresses the target sound and emphasizes the interference power with the highest power (see, for example, Patent Document 2). ).
On the other hand, there is a technique for processing input image information to obtain a plurality of person positions, allowing a user to select a specific person position from among a plurality of person positions, and extracting only the sound of the selected person position. Yes (see Patent Document 1, for example)

特開平１０−５１８８９号公報Japanese Patent Laid-Open No. 10-51889 特開２００３−２７１１９１号公報JP 2003-271191 A H.Saruwatari, S.Kajita, K.Takeda, and F.Itakura, "Speech enhancement using nonlinear microphone array based on complementary beamforming," IEICE Trans.Fundamentals,Vol.E82-A,No.8,pp.1501-1510,1999.H.Saruwatari, S.Kajita, K.Takeda, and F.Itakura, "Speech enhancement using nonlinear microphone array based on complementary beamforming," IEICE Trans.Fundamentals, Vol.E82-A, No.8, pp.1501-1510 1999.

相補的複数チャネルスペクトルサブトラクション法では、目的音抽出後の信号のパワーの期待値と目的音の信号パワーの期待値を一致させることはできるが、各時間ごとにみると、目的音抽出後の信号のパワーと目的音の信号パワーとが一致することを保証せず、目的音抽出後の信号パワーと目的音の信号パワーがずれるという課題がある。
また目的音収集の際に目的音と妨害音の空間情報だけでなく、更に、目的音と妨害音の周波数特性が分からないと、目的音の抽出性能が劣化する。 In the complementary multi-channel spectral subtraction method, the expected value of the signal power after the target sound extraction can be matched with the expected value of the signal power of the target sound. However, it is not guaranteed that the power of the target sound matches the signal power of the target sound, and the signal power after extraction of the target sound and the signal power of the target sound shift.
Further, when not only the spatial information of the target sound and the disturbing sound but also the frequency characteristics of the target sound and the disturbing sound are not known at the time of collecting the target sound, the target sound extraction performance deteriorates.

本発明では、帯域ごとに各妨害音について少なくとも１つの線形フィルタが死角を形成する線形フィルタ対の全ての組み合わせの中で、２つの線形フィルタの出力パワーの期待値の積が最も小さくなるような線形フィルタ対を用いる。更に、本発明では、人間がシステムに目的音と妨害音の位置情報、周波数特性を入力するユーザーインターフェースを提供する。 In the present invention, among all combinations of linear filter pairs in which at least one linear filter forms a blind spot for each interfering sound for each band, the product of the expected values of the output powers of the two linear filters is minimized. Use linear filter pairs. Furthermore, the present invention provides a user interface that allows a human to input position information and frequency characteristics of target sound and interference sound into the system.

本発明の構成によれば、各時間ごとの目的音抽出後の信号パワーと目的音の信号パワーのずれを抑えることができる。さらに、妨害音と目的音の位置情報、周波数特性をユーザーインターフェースを用いて、システムに与えることができるため、妨害音や目的音の位置の推定を行わなくても、妨害音と目的音の位置を知ることができ、高精度な目的音抽出が可能となる。 According to the configuration of the present invention, it is possible to suppress the difference between the signal power after the target sound extraction for each time and the signal power of the target sound. In addition, since the location information and frequency characteristics of the interference sound and the target sound can be given to the system using the user interface, the position of the interference sound and the target sound can be determined without estimating the position of the interference sound and the target sound. Thus, the target sound can be extracted with high accuracy.

本発明の実施の形態について図面を用いて説明する。図１は、本発明の音声処理装置の基本構成図である。カメラ１で取り込んだ周囲の風景などを写した画像がカメラ画像取り込み部２に送られる。その画像を表示装置３に表示する。表示装置３に表示された画像をユーザーが見て、ユーザーは画像の中に写っている音源を見つけ、外部入力デバイス４を用いて、音源の位置を指定する。さらにユーザは音源の位置と音源が目的音か妨害音のどちらであるか又は音源の種類を指定することにする。指定された位置は画面上での位置であるため、入力処理部６で実際の環境での空間情報に変換する。また入力処理部６は、記憶部７に記憶されている音源の種類ごとの周波数特性の情報を用いて、音源の種類を周波数特性に変換する。また入力処理部６は、音源の空間情報と周波数特性と音源が目的音源か妨害音のどちらであるかを示すフラグを記憶部７に記憶する。入力処理部６での記憶処理はユーザーによって選択された全ての音源について行う。入力処理部６での記憶処理で記憶された情報は、音源分離部８に送られる。 Embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a basic configuration diagram of a speech processing apparatus according to the present invention. An image obtained by capturing the surrounding landscape captured by the camera 1 is sent to the camera image capturing unit 2. The image is displayed on the display device 3. The user looks at the image displayed on the display device 3, and the user finds a sound source shown in the image and designates the position of the sound source using the external input device 4. Further, the user designates the position of the sound source, whether the sound source is the target sound or the disturbing sound, or the type of the sound source. Since the designated position is a position on the screen, the input processing unit 6 converts it into spatial information in the actual environment. Further, the input processing unit 6 converts the type of the sound source into the frequency characteristic using the information on the frequency characteristic for each type of the sound source stored in the storage unit 7. Further, the input processing unit 6 stores in the storage unit 7 the spatial information and frequency characteristics of the sound source, and a flag indicating whether the sound source is the target sound source or the disturbing sound. The storage process in the input processing unit 6 is performed for all sound sources selected by the user. Information stored in the storage process in the input processing unit 6 is sent to the sound source separation unit 8.

またマイクロホンアレー１０の信号は帯域分割部９に送られ、短時間フーリエ変換を施され、チャネルごとに帯域分割された形で、音源分離部８に送られる。
音源分離部８では送られてきたチャネルごとの帯域分割された信号を入力処理部６での記憶処理で記憶部７に記憶した目的音や妨害音の空間情報や周波数特性の情報を用いて分離し、目的音を抽出し出力する。 The signal of the microphone array 10 is sent to the band dividing unit 9, subjected to short-time Fourier transform, and sent to the sound source separating unit 8 in the form of band division for each channel.
The sound source separation unit 8 separates the band-divided signal for each channel transmitted by using the spatial information and frequency characteristic information of the target sound and interfering sound stored in the storage unit 7 by the storage processing in the input processing unit 6. The target sound is extracted and output.

次に図２を用いて音源分離部８の詳細な説明をする。記憶部７から送られてきた目的音と妨害音の空間情報や音源が目的音か妨害音のどちらであるか、また音源の周波数特性は、まず線形フィルタ候補作成部８aに送られる。線形フィルタ候補作成部８aでは、フラグが目的音となっている音源を目的音とみなし、フラグが妨害音となっている音源を妨害音とみなし、妨害音の周波数特性の情報から帯域ごとの妨害音数を計算する。周波数特性下限値以上の特性を示す帯域で妨害音としてカウントされ、それ以外の帯域では、妨害音としてはカウントされないこととする。周波数特性下限値はシステム定数として予め定義、もしくは任意に設定できるようにしても良い。これら周波数ごとの妨害音数の情報を用いて、線形フィルタ候補作成部８aでは、帯域ごとに目的音方向には指向性を保ちつつ、各妨害音方向について少なくとも１つの線形フィルタが死角を作るような線形フィルタ対を複数対出力する。目的音の周波数特性が下限値を下回る帯域では、出力が０となる線形フィルタ対を出力する。この線形フィルタを作成するためには、目的音や各妨害音の方向が既知であることが前提となる。 Next, the sound source separation unit 8 will be described in detail with reference to FIG. The spatial information of the target sound and the disturbing sound sent from the storage unit 7, whether the sound source is the target sound or the disturbing sound, and the frequency characteristics of the sound source are first sent to the linear filter candidate creating unit 8a. In the linear filter candidate creation unit 8a, the sound source whose flag is the target sound is regarded as the target sound, the sound source whose flag is the disturbing sound is regarded as the disturbing sound, and the interference for each band is determined based on the frequency characteristic information of the disturbing sound Calculate the number of sounds. It is assumed that the interference sound is counted in a band showing a characteristic equal to or higher than the frequency characteristic lower limit value, and is not counted as a disturbing sound in other bands. The frequency characteristic lower limit value may be defined in advance as a system constant, or may be arbitrarily set. Using the information on the number of disturbing sounds for each frequency, the linear filter candidate creating unit 8a keeps directivity in the target sound direction for each band, and at least one linear filter creates a blind spot for each disturbing sound direction. Output multiple pairs of linear filters. In a band where the frequency characteristic of the target sound is below the lower limit value, a linear filter pair whose output is 0 is output. In order to create this linear filter, it is assumed that the direction of the target sound and each interference sound is known.

本願構成では、ユーザーが入力した目的音方向、妨害音方向に関する情報を保持しているため、目的音や各妨害音の方向が既知となる。線形フィルタ決定部８ｂでは線形フィルタ候補作成部８aが出力した複数対の線形フィルタ対のそれぞれについて、線形フィルタ対の各線形フィルタと帯域分割後の信号との積及び各積のパワーを計算し、それらパワーの積を計算する。その積が最も小さくなる線形フィルタ対を出力する。このように出力された線形フィルタ対を用いることで、目的音抽出後の信号のパワーと目的音の信号パワーとの差の２乗値の期待値を従来技術と比較し、小さくできるため、高精度な目的音の抽出を行うことができる。本発明で各時間ごとの目的音抽出後の信号パワーと目的音の信号パワーのずれを抑えることができる理由について以下説明する。
指向性を持った音源がD個存在すると仮定すると、マイクロホンアレーまでの音の伝播系は In the configuration of the present application, since the information regarding the target sound direction and the disturbing sound direction input by the user is held, the target sound and the direction of each disturbing sound are known. The linear filter determination unit 8b calculates the product of each linear filter of the linear filter pair and the signal after the band division and the power of each product for each of the plural pairs of linear filters output from the linear filter candidate creation unit 8a. Calculate the product of those powers. The linear filter pair with the smallest product is output. By using the linear filter pair output in this way, the expected value of the square value of the difference between the signal power after extraction of the target sound and the signal power of the target sound can be reduced as compared with the conventional technique. Accurate target sound extraction can be performed. The reason why it is possible to suppress the difference between the signal power after extracting the target sound and the signal power of the target sound every time in the present invention will be described below.
Assuming that there are D directional sound sources, the sound propagation system to the microphone array is

と表すことができる。
ここで、rd,iは音源dからマイクiまでの距離で、τd,i音源dから発せられた音がマイクiに到達するまでにかかる時間である。
Ωを妨害音集合とし、d0を目的音、S0(f)を目的音成分、N0(f)をd番目の妨害音成分とする。目的音方向に指向性を持つ二つの線形フィルタg,hを入力信号にかけた後の出力信号は、 It can be expressed as.
Here, rd, i is the distance from the sound source d to the microphone i, and is the time taken for the sound emitted from the τd, i sound source d to reach the microphone i.
Ω is an interference sound set, d0 is a target sound, S0 (f) is a target sound component, and N0 (f) is a d-th interference sound component. The output signal after applying two linear filters g and h with directivity in the target sound direction to the input signal is

と表すことができる。複数チャンネルスペクトルサブトラクション法では、これら二つの線形フィルタg,hの出力信号を用いて、 It can be expressed as. In the multi-channel spectral subtraction method, using the output signals of these two linear filters g and h,

で、目的音だけを分離し、抽出する。
目的音と妨害音が無相関であり、妨害音同士も無相関であり、目的音パワーが雑音パワーと比べて大きいとすると、複数チャンネルスペクトルサブトラクション法の出力信号のパワーの期待値を４倍したものは、 Then, only the target sound is separated and extracted.
If the target sound and the interfering sound are uncorrelated, the interfering sounds are also uncorrelated, and the target sound power is larger than the noise power, the expected value of the output signal power of the multi-channel spectral subtraction method is quadrupled. Things

となる。ここで、もし It becomes. Where, if

が成立すれば、 If

となる。つまり、複数チャンネルスペクトルサブトラクション法において、帯域ごとに各妨害音について少なくとも１つの線形フィルタが死角を形成する線形フィルタ対g,hを用いることで、出力信号と目的信号のパワーの期待値を一致させることができる。つまり平均的にみるとパワーの誤差は０ということである。
しかし死角を形成する線形フィルタ対g,hを用いるという条件だけでは、各時間ごとにみると、 It becomes. That is, in the multi-channel spectral subtraction method, the expected value of the power of the output signal and the target signal is matched by using a linear filter pair g, h in which at least one linear filter forms a blind spot for each interfering sound for each band. be able to. That is, on average, the power error is zero.
However, just using the linear filter pair g, h that forms the blind spot,

で表される誤差が残る。この誤差の二乗値の時間平均値は、 The error represented by remains. The time average value of the square of this error is

で表される。つまり、g,hが各妨害音について少なくとも１つの線形フィルタが死角を形成する線形フィルタ対であり、かつ２つの線形フィルタの出力パワーの期待値の積が最も小さくなるような線形フィルタ対であるとき、平均的に誤差を０にするとともに、時間毎の誤差も抑えることができるため、高精度な目的音の分離が可能となる。 It is represented by That is, g and h are a linear filter pair in which at least one linear filter forms a blind spot for each interference sound, and a linear filter pair in which the product of the expected values of the output power of the two linear filters is the smallest. At the same time, the error can be reduced to 0 on average, and the error for each time can be suppressed, so that the target sound can be separated with high accuracy.

主信号作成部８c、参照信号作成部８ｄ、スペクトルサブトラクション部８eは、（数３）の処理を行う。主信号作成部８cでは、線形フィルタ決定部８ｂが出力した線形フィルタ対の各線形フィルタと帯域分割後の信号との積を取り、その各積の和を取ることで、目的音のみが強調され、妨害音が抑圧された信号を出力する。参照信号作成部８ｄでは、線形フィルタ決定部８ｂが出力した線形フィルタ対の各線形フィルタと帯域分割後の信号との積を取り、その各積の差を取ることで、目的音のみが強調され、妨害音が抑圧された信号を出力する。スペクトルサブトラクション部８eでは主信号作成部８cが出力した信号のパワーと参照信号作成部８ｄが出力した信号のパワーの差を取ったものの平方根をパワーとし、主信号作成部８cが出力した信号の位相成分を位相とする信号を出力する。尚、サブストラクション部のみ別の装置で行うこととし、フィルタ対決定装置としても本願構成は利用することができる。 The main signal creation unit 8c, the reference signal creation unit 8d, and the spectrum subtraction unit 8e perform the processing of (Equation 3). In the main signal creation unit 8c, the product of each linear filter of the linear filter pair output from the linear filter determination unit 8b and the band-divided signal is taken, and the sum of the products is taken to emphasize only the target sound. , Outputs a signal in which the interference sound is suppressed. In the reference signal creation unit 8d, the product of each linear filter of the linear filter pair output from the linear filter determination unit 8b and the band-divided signal is obtained, and only the target sound is emphasized by taking the difference between the products. , Outputs a signal in which the interference sound is suppressed. In the spectral subtraction unit 8e, the square root of the difference between the power of the signal output from the main signal generating unit 8c and the signal output from the reference signal generating unit 8d is used as the power, and the phase of the signal output from the main signal generating unit 8c is used. A signal whose phase is a component is output. Note that only the subtraction unit is performed by another device, and the configuration of the present application can also be used as a filter pair determination device.

次に図３を用いてユーザーインターフェース部５の処理フローを説明する。ユーザーは表示装置３に表示された画像の中から音源を見つける。外部入力デバイス４を用いて、それら音源の位置を指定する。またユーザーが指定した音源が目的音か妨害音であるかを指定するためのメッセージボックスを表示装置３に表示する。次にユーザーが目的音か妨害音かをそのメッセージボックスを使い選択する。又、指定した音源の種類を指定するためのメッセージボックスを表示装置３に表示する。この際の音源の種類とは、例えば「成人男性の声」、「成人女性の声」、「子供の声」、「音楽」、「風の音」、「水の音」など人か自然音の別、又個人を特定するものであってもよい。音源が目的音か妨害音であるかどうか、又その種類は少なくとも一方位置情報とともに入力すれば、高精度な目的音抽出が可能となる。 Next, the processing flow of the user interface unit 5 will be described with reference to FIG. The user finds a sound source from the image displayed on the display device 3. Using the external input device 4, the positions of the sound sources are designated. In addition, a message box for designating whether the sound source designated by the user is the target sound or the disturbing sound is displayed on the display device 3. Next, the user selects the target sound or the disturbing sound using the message box. In addition, a message box for designating the designated sound source type is displayed on the display device 3. The type of sound source at this time is, for example, “adult male voice”, “adult female voice”, “child voice”, “music”, “wind sound”, “water sound”, etc. It may also identify an individual. Whether or not the sound source is the target sound or the disturbing sound and the type thereof are input together with at least one position information, the target sound can be extracted with high accuracy.

もちろん両方の情報をともに入力するようにすればより精度をあげることができる。ユーザーインターフェース部５との入力のやりとりのために、ユーザーは、表示装置３上に表示された３つのウィンドウを使用する。図４に表示装置３のウィンドウを図示する。３aはユーザーが音源を指定するためのカメラ画像表示部で、３ｂは、指定した音源が目的音か妨害音であるかを指定するための目的音、妨害音設定画面であり、３ｃは、指定した音源の種類を指定する音源種類指定画面である。上記のインターフェイスを介して音源が目的音か妨害音のどちらかであるかのフラグ立てを行うことで、妨害音のほうが目的音よりパワーが大きかったとしても、目的音方向をシステムが知ることができ、本願に関するシミュレーション結果を図５に例示する。 Of course, if both information are input together, the accuracy can be improved. The user uses three windows displayed on the display device 3 for exchanging input with the user interface unit 5. FIG. 4 shows a window of the display device 3. 3a is a camera image display unit for a user to specify a sound source, 3b is a target sound / interference sound setting screen for specifying whether the specified sound source is a target sound or an interference sound, and 3c is a specification It is a sound source type designation screen for designating the type of the sound source that has been played. By flagging whether the sound source is the target sound or interfering sound via the above interface, the system can know the direction of the target sound even if the interfering sound is more powerful than the target sound. The simulation result relating to the present application is illustrated in FIG.

図５は、聴覚上の歪みに相当する対数スペクトル距離を用いて、目的音の歪みを手法毎に示している。また男１、男２、女１、女２とあるのは、目的音の話者の性別と番号である。従来手法として３つの手法を例示した。遅延和アレー、MVBF、相補性の制約のみとあるのが従来手法である。提案手法とあるのが、本発明で用いる方法である。本発明で用いる方法が、対数スペクトル距離の観点で、全話者で最も歪みが小さく、効果が高いことがわかる。
上記実施例は装置構成を説明したが、本願はプログラムとしてコンピュータに読み込むことで実行されるようにしても良い。 FIG. 5 shows the distortion of the target sound for each method using the logarithmic spectral distance corresponding to the auditory distortion. In addition, male 1, male 2, female 1, and female 2 are the gender and number of the speaker of the target sound. Three methods are illustrated as conventional methods. The conventional method has only a delay-and-sum array, MVBF, and complementarity constraints. The proposed method is the method used in the present invention. It can be seen that the method used in the present invention has the least distortion and the highest effect in all speakers from the viewpoint of the logarithmic spectral distance.
Although the above embodiment has described the device configuration, the present application may be executed by being read into a computer as a program.

本発明の基本構成の一実施例を示す図。The figure which shows one Example of the basic composition of this invention. 本発明の音源分離部の一実施例の詳細を表すブロック図。The block diagram showing the detail of one Example of the sound source separation part of this invention. 本発明のユーザーインターフェース部での処理の一実施例の流れを表すフロー図。The flowchart showing the flow of one Example of the process in the user interface part of this invention. 本発明で用いる表示装置例を示す図。FIG. 9 illustrates an example of a display device used in the present invention. シミュレーションによる発明の効果を説明する図。The figure explaining the effect of the invention by simulation.

符号の説明Explanation of symbols

１・・・カメラ、２・・・カメラ画像取り込み部、３・・・表示装置、３a・・・カメラ画像表示部、３ｂ・・・目的音、妨害音設定画面、３ｃ・・・音源種類指定画面、４・・・外部入力デバイス、５・・・ユーザーインターフェース部、６・・・入力処理部、７・・・記憶部、８・・・音源分離部、８a・・・線形フィルタ候補作成部、８ｂ・・・線形フィルタ決定部、８ｃ・・・腫信号作成部、８ｄ・・・参照信号作成部、８e・・・スペクトルサブトラクション部、９・・・帯域分割部、１０・・・マイクロホンアレー。 DESCRIPTION OF SYMBOLS 1 ... Camera, 2 ... Camera image capture part, 3 ... Display apparatus, 3a ... Camera image display part, 3b ... Target sound, interference sound setting screen, 3c ... Sound source type designation | designated Screen 4, external input device 5, user interface unit 6, input processing unit 7, storage unit, 8 sound source separation unit, 8 a linear filter candidate creation unit , 8b ... linear filter determination unit, 8c ... tumor signal creation unit, 8d ... reference signal creation unit, 8e ... spectral subtraction unit, 9 ... band division unit, 10 ... microphone array .

Claims

少なくとも２チャネル以上のマイクロホン素子を持つマイクロホンアレー部と、
前記マイクロホンアレー部が出力する信号をチャネル毎に複数の周波数帯域に分割する帯域分割部と、
前記周波数帯域毎に複数対の線形フィルタを作成し、該複数対の線形フィルタを上記帯域分割部からの出力に作用させて得られる出力信号のパワーの積を算出し、該積の値が最も小さい線形フィルタ対を帯域ごとに１つ出力する音源分離部を有することを特徴とする音声処理装置。 A microphone array having at least two channels of microphone elements;
A band dividing unit that divides a signal output from the microphone array unit into a plurality of frequency bands for each channel;
Create a plurality of pairs of linear filters for each frequency band, calculate the product of the power of the output signal obtained by applying the plurality of pairs of linear filters to the output from the band dividing unit, and the value of the product is the largest An audio processing apparatus comprising a sound source separation unit that outputs one small linear filter pair for each band.

前記音源分離部は、前記出力される線形フィルタ対と前記帯域分割部が出力する信号を用いて上記マイクロアレー部が出力する信号から目的音を抽出するスペクトルサブトラクション部を更に有することを特徴とする請求項１記載の音声処理装置。 The sound source separation unit further includes a spectral subtraction unit that extracts a target sound from a signal output from the microarray unit using the output linear filter pair and a signal output from the band dividing unit. The speech processing apparatus according to claim 1.

画像情報を表示する表示部と、上記表示部に表示される画像中に存在する音源の位置と該音源が目的音か妨害音のいずれであるか又は該音源の種類の少なくとも何れかの入力を受けるユーザーインターフェース部と、
前記ユーザーインターフェース部で選択させた音源の画面上の位置を実際の環境での空間情報に変換し、上記選択された音源の種類を音源の周波数特性に変換し、前記音源の空間情報と該音源が目的音か妨害音のどちらであるかを示すフラグと該音源の周波数特性とを記憶部に記憶する入力処理部を有し、
上記音源分離部は上記記憶される情報を用いて上記線形フィルタ対を帯域ごとに１つ出力することを特徴とする請求項１又は２に記載の音声処理装置。 Input of at least one of a display unit for displaying image information, a position of a sound source existing in the image displayed on the display unit, and whether the sound source is a target sound or an interfering sound, or a type of the sound source Receiving user interface part,
The position on the screen of the sound source selected by the user interface unit is converted into spatial information in an actual environment, the type of the selected sound source is converted into frequency characteristics of the sound source, and the spatial information of the sound source and the sound source An input processing unit that stores a flag indicating whether the sound is a target sound or an interfering sound and a frequency characteristic of the sound source in a storage unit;
The sound processing apparatus according to claim 1, wherein the sound source separation unit outputs one linear filter pair for each band using the stored information.

画像情報を表示する表示部と、
音声を入力する入力部と、
上記表示部に表示される画像上で音源の位置と該音源が目的音か妨害音のいずれであるか、又は該音源の種類の少なくとも一方の入力を受けるユーザーインターフェース部と、
前記ユーザーインターフェース部で選択させた音源の画面上の位置を実際の環境での空間情報に変換する入力処理部を有し、
該入力処理部は前記音源が目的音であるか妨害音であるかの入力を受けた場合は、その識別フラグを記憶し、前記目的音の種類の入力を受けた場合には該選択された音源の種類を音源の周波数特性に変換して記憶し、
上記記憶された情報を用いて上記入力された音声から目的音を抽出することを特徴とする音声処理装置。 A display for displaying image information;
An input unit for inputting voice;
A user interface unit that receives an input of at least one of the position of the sound source on the image displayed on the display unit and whether the sound source is a target sound or an interfering sound, or the type of the sound source;
An input processing unit that converts the position on the screen of the sound source selected by the user interface unit into spatial information in an actual environment;
When the input processing unit receives an input indicating whether the sound source is a target sound or an interfering sound, the input processing unit stores the identification flag, and when the input of the type of the target sound is received, the input processing unit Converts the type of sound source into the frequency characteristics of the sound source and stores it,
A speech processing apparatus that extracts a target sound from the input speech using the stored information.