JP2015171030A

JP2015171030A - Sound collection device and program

Info

Publication number: JP2015171030A
Application number: JP2014045405A
Authority: JP
Inventors: 一浩片桐; Kazuhiro Katagiri
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2014-03-07
Filing date: 2014-03-07
Publication date: 2015-09-28
Anticipated expiration: 2034-03-07
Also published as: JP5648760B1

Abstract

PROBLEM TO BE SOLVED: To provide a sound collection device capable of equally extracting levels of all sound sources while maintaining sound quality even if a plurality of sound sources are present within a target area.SOLUTION: If there is a difference in power of a target area sound source between beamformer outputs of microphone arrays, each target area sound source component can be estimated by changing a value of a target area sound correction coefficient. Average power of target area sound sources included in data after sound collection processing with respect to the target area is calculated and the power of the target area sound source is adjusted so as to be the equal power level to target power (e.g., power of a target area sound source with maximum power).

Description

本発明は、収音装置及びプログラムに関し、例えば、特定のエリアの音のみを強調し、それ以外のエリアの音を抑圧する場合に適用し得るものである。 The present invention relates to a sound collection device and a program, and can be applied to, for example, emphasizing only sounds in a specific area and suppressing sounds in other areas.

特定の方向に存在する音（音声や音響；以下、音声及び音響をまとめて音響と呼ぶこともある）を強調し、それ以外の音を抑圧する技術として、マイクロホンアレイを用いたビームフォーマがある。ビームフォーマとは、各マイクロホンに到達する信号の時間差を利用して指向性や死角を形成する技術である（非特許文献１、非特許文献２参照）。 There is a beamformer using a microphone array as a technique for emphasizing sound existing in a specific direction (speech and sound; hereinafter, sound and sound may be collectively referred to as sound) and suppressing other sounds. . The beam former is a technique for forming directivity and blind spot using time differences between signals reaching each microphone (see Non-Patent Document 1 and Non-Patent Document 2).

しかし、単純にビームフォーマの指向性を収音目的とするエリア（以下、目的エリアと呼ぶ）に向けただけでは、目的エリアの周囲に雑音源が存在する場合、目的エリア内に存在する音源（以下、目的エリア音と呼ぶ）だけでなく、目的エリアの外に存在する音源（以下、非目的エリア音と呼ぶ）も同時に収音してしまうという問題が存在する。 However, simply by directing the directivity of the beamformer toward the area where sound collection is intended (hereinafter referred to as the target area), if there is a noise source around the target area, the sound source ( Hereinafter, there is a problem that not only the target area sound) but also a sound source existing outside the target area (hereinafter referred to as a non-target area sound) is picked up at the same time.

この問題に対して、本願発明者は、複数のマイクロホンアレイを用いて、別々の方向から指向性を目的エリアへ向けて交差させ、目的エリア音を収音する方式（以下、第１の従来法と呼ぶ）を既に提案している（特願２０１２−２１７３１５号明細書及び図面）。第１の従来法は、各マイクロホンアレイのビームフォーマ出力を同時に処理することで、目的エリア音を抽出する。以下、第１の従来法を簡単に説明する。 To solve this problem, the inventor of the present application uses a plurality of microphone arrays to cross the directivities from different directions toward the target area to collect the target area sound (hereinafter referred to as first conventional method). Has already been proposed (Japanese Patent Application No. 2012-217315 and drawings). In the first conventional method, a target area sound is extracted by simultaneously processing the beamformer output of each microphone array. Hereinafter, the first conventional method will be briefly described.

図５（Ａ）は、２つのマイクロホンアレイＭＡ１及びＭＡ２の指向性を目的エリアＴＡＲに向けたときのイメージである。この状態では、各マイクロホンアレイＭＡ１、ＭＡ２のビームフォーマの出力に共に、目的エリアＴＡＲにある音源による目的エリア音だけでなく、同じ指向性方向の非目的エリアにある音源による非目的エリア音が含まれる。しかし、目的エリアＴＡＲは、全てのマイクロホンアレイＭＡ１及びＭＡ２の指向性に含まれているため、目的エリア音の成分は、図５（Ｂ）に示すように、各ビームフォーマの出力に、同じ割合、分布で含まれる。これと比較して、非目的エリア音成分は、ビームフォーマ出力毎に異なっている。このような特微から、各ビームフォーマ出力に共通に含まれる成分は、目的エリア音が有する成分と推定することができ、これに基づいて、第１の従来法が構築された。 FIG. 5A is an image when the directivities of the two microphone arrays MA1 and MA2 are directed to the target area TAR. In this state, the output of the beamformers of the microphone arrays MA1 and MA2 includes not only the target area sound by the sound source in the target area TAR but also the non-target area sound by the sound source in the non-target area in the same directivity direction. It is. However, since the target area TAR is included in the directivity of all the microphone arrays MA1 and MA2, the component of the target area sound is the same as the output of each beamformer as shown in FIG. Included in the distribution. Compared to this, the non-target area sound component is different for each beamformer output. From such characteristics, the component included in each beamformer output in common can be estimated as the component of the target area sound, and based on this, the first conventional method has been established.

図６は、第１の従来法に従った収音装置の概要構成を演算式に沿って示すブロック図である。 FIG. 6 is a block diagram showing a schematic configuration of the sound collecting device according to the first conventional method along an arithmetic expression.

マイクロホンアレイＭＡ１を構成する複数のマイクロホンからの捕捉信号ｘ_１１（ｔ）〜ｘ_１Ｍ（ｔ）から第１の指向性形成部１１によって目的エリアＴＡＲ方向のビームフォーマ出力Ｘ_ｍａ１(ｔ)が得られ、同様に、マイクロホンアレイＭＡ２を構成する複数のマイクロホンからの捕捉信号ｘ_２１（ｔ）〜ｘ_２Ｍ（ｔ）から第２の指向性形成部１２によって目的エリアＴＡＲ方向のビームフォーマ出力Ｘ_ｍａ２(ｔ)が得られる。 The beamformer output X _ma1 (t) in the target area TAR direction is obtained by the first directivity forming unit 11 from the captured signals x ₁₁ (t) to x _1M (t) from the plurality of microphones constituting the microphone array MA1. Similarly, the beamformer output X _ma2 (t in the target area TAR direction by the second directivity forming unit 12 from the captured signals x ₂₁ (t) to x _2M (t) from the plurality of microphones constituting the microphone array MA2. ) Is obtained.

一方のビームフォーマ出力Ｘ_ｍａ１から他方のビームフォーマ出力Ｘ_ｍａ２をスペクトル減算することにより、両ビームフォーマ出力で重なっている目的エリア音成分は消去されるが、各ビームフォーマ出力中の非目的エリア音成分は重ならないため、被減算側のビームフォーマ出力に含まれている非目的エリア音成分Ｎ_ｍａ１が抽出される。（１）式は、概ねこのような考え方に従っている算出式である。 By spectral subtraction the other beamformer output X _ma2 from one beamformer output X _ma1, although destination area sound components that overlap in both beamformer output is deleted, the non-target area sound the beamformer in the output Since the components do not overlap, the non-target area sound component N _ma1 included in the beamformer output on the subtracted side is extracted. Formula (1) is a calculation formula that generally follows such a concept.

被減算側のビームフォーマ出力Ｘｍａ１から、そこに含まれている非目的エリア音成分Ｎｍａ１をスペクトル減算することにより、目的エリア音成分Ｙ_ｍａ１が抽出される。（２）式は、概ねこのような考え方に従っている算出式である。（２）式におけるγ_ｍａ１は、非目的エリア音成分の除去強度を定めている一定値をとる係数（スカラー量）である。 The target area sound component Y _ma1 is extracted by _performing spectral subtraction on the non-target area sound component Nma1 included therein from the beamformer output Xma1 on the subtracted side. The formula (2) is a calculation formula that generally follows this concept. In the equation (2), γ _ma1 is a coefficient (scalar amount) taking a constant value that determines the removal intensity of the non-target area sound component.

なお、（１）式及び（２）式におけるビームフォーマ出力Ｘ_ｍａ１、Ｘ_ｍａ２、非目的エリア音成分Ｎ_ｍａ１、目的エリア音成分Ｙ_ｍａ１はそれぞれ、周波数毎の振幅スペクトル値を要素としたベクトルとして表記している。

In addition, the beamformer outputs X _ma1 and X _ma2 , the non-target area sound component N _ma1 , and the target area sound component Y _ma1 in the expressions (1) and (2) are vectors each having an amplitude spectrum value for each frequency as an element. It is written.

上述のような２回のスペクトル減算を適用した方式により目的エリア音Ｙ_ｍａ１を抽出するためには、スペクトル減算される各ビームフォーマ出力Ｘ_ｍａ１、Ｘ_ｍａ２、に、同じタイミングの目的エリア音が同じパワーで含まれることが前提となる。図６における伝播遅延差補正部１３は、（１）式の演算に供する各ビームフォーマ出力Ｘ_ｍａ１、Ｘ_ｍａ２、のタイミングを同じにするためのものであり、パワー差補正部１４は、各ビームフォーマ出力Ｘ_ｍａ１、Ｘ_ｍａ２における目的エリア音のパワーを揃えるものである。これにより、非目的エリア音抽出用スペクトル減算部１５が（１）式の演算を実行でき、目的エリア音抽出用スペクトル減算部１６が（２）式の演算を実行できる。 In order to extract the target area sound Y _ma1 by the method using the two times of spectral subtraction as described above, the target area sound of the same timing is the same as each beamformer output X _ma1 , X _ma2 _subjected to spectral subtraction. It is assumed that it is included in power. The propagation delay difference correction unit 13 in FIG. 6 is for making the timings of the beamformer outputs X _ma1 and X _ma2 used for the calculation of the expression (1) the same, and the power difference correction unit 14 The powers of the target area sounds in the former outputs X _ma1 and X _ma2 are made uniform. As a result, the non-target area sound extraction spectrum subtraction unit 15 can execute the calculation of equation (1), and the target area sound extraction spectrum subtraction unit 16 can execute the calculation of equation (2).

目的エリア音が各マイクロホンアレイＭＡ１、ＭＡ２に到達する時間差τは、マイクロホンアレイＭＡ１、ＭＡ２と目的エリアＴＡＲ（例えば、エリアの中心位置を適用する）の位置情報が既知であれば、予め伝播遅延を計算して補正することができる。しかし、位置情報だけが既知では、各ビームフォーマ出力Ｘ_ｍａ１、Ｘ_ｍａ２間の目的エリア音のパワーを補正することが難しい。これは、目的エリア音成分が未知であることに加え、人間の音声には指向性があるため、目的エリアＴＡＲ内で話者の向きが変わると、その度にパワーが変化してしまうためである。 The time difference τ for the target area sound to reach the microphone arrays MA1 and MA2 is determined in advance if the positional information of the microphone arrays MA1 and MA2 and the target area TAR (for example, applying the center position of the area) is known. It can be calculated and corrected. However, only the known position information, it is difficult to correct the power of object areas sound between the beamformer output X _ma1, X _ma2. This is because, in addition to the unknown target area sound component, human voice has directivity, so that the power changes whenever the speaker's orientation changes in the target area TAR. is there.

そこで、第１の従来法では、各ビームフォーマ出力Ｘ_ｍａ１、Ｘ_ｍａ２間で振幅スペクトルの比率を利用し、目的エリア音のパワー補正係数（スカラー）α_ｍａ１を算出している。この算出方法を説明する。 Therefore, in the first conventional method, the power correction coefficient (scalar) α _ma1 of the target area sound is calculated using the ratio of the amplitude spectrum between the beamformer outputs X _ma1 and X _ma2 . This calculation method will be described.

（３）式に従い、タイミングを揃えたビームフォーマ出力Ｘ_ｍａ１(ｔ)及びＸ_ｍａ２(ｔ−τ)間で周波数毎に振幅スペクトルの比を求め、その比率の最頻値α_ｍａ１を算出する。（３）式におけるｍｏｄｅ（Ａ（ｋ））は、変数ｋにより値が変わる関数値Ａ（ｋ）のうち最も多く出現した値（最頻値）を表している。（３）式におけるｋは周波数を表すパラメータであり、Ｍ、Ｎはそれぞれ、収音帯域の下限周波数、上限周波数である。Ｘ_ｍａ１ｋ(ｔ)はビームフォーマ出力Ｘ_ｍａ１(ｔ)の周波数ｋの振幅スペクトルを表し、Ｘ_ｍａ２ｋ(ｔ−τ)はビームフォーマ出力Ｘ_ｍａ２(ｔ−τ)の周波数ｋの振幅スペクトルを表している。上述のように、目的エリア音成分は、全てのビームフォーマ出力Ｘ_ｍａ１(ｔ)及びＸ_ｍａ２(ｔ−τ)に同じ割合、分布で含まれているため、目的エリア音の周波数では、比率が全て同じになる。逆に、非目的エリア音成分は、分布が各ビームフォーマ出力Ｘ_ｍａ１(ｔ)、Ｘ_ｍａ２(ｔ−τ)で異なるので、比率はばらつくことになる。この特性から、全ての周波数についてそれぞれ比率を求めた後、比率の最頻値を求めれば、その値がそのまま各ビームフォーマ出力の目的エリア音のパワーが等しくなるように補正する係数α_ｍａ１(ｔ)となる。

According to the equation (3), the ratio of the amplitude spectrum is obtained for each frequency between the beamformer outputs X _ma1 (t) and X _ma2 (t−τ) having the same timing, and the mode value α _ma1 of the ratio is calculated. In the equation (3), mode (A (k)) represents the most frequently occurring value (mode value) among the function values A (k) whose value varies depending on the variable k. In Equation (3), k is a parameter representing frequency, and M and N are a lower limit frequency and an upper limit frequency of the sound collection band, respectively. X _ma1k (t) represents the amplitude spectrum of the beamformer output X _ma1 (t) at frequency k, and X _ma2k (t−τ) represents the amplitude spectrum of the beamformer output X _ma2 (t−τ) at frequency k. Yes. As described above, the target area sound component is included in all beamformer outputs X _ma1 (t) and X _ma2 (t−τ) in the same ratio and distribution. Everything will be the same. Conversely, the distribution of the non-target area sound component varies depending on the beamformer outputs X _ma1 (t) and X _ma2 (t−τ). From this characteristic, after determining the ratio for all frequencies, the mode value of the ratio is calculated, and the coefficient α _ma1 (t for correcting the value as it is so that the power of the target area sound of each beamformer output becomes equal. ).

図７は、各ビームフォーマ出力間の周波数毎の振幅スペクトルの比率をヒストグラムで表した説明図である。図７（Ａ）は、各マイクロホンアレイＭＡ１、ＭＡ２が目的エリアＴＡＲから等距離に配置されている場合である。目的エリアＴＡＲからの距離が同じため、入力される目的エリア音のパワーはほぼ等しく、比率の最頻値は１に近い値となっている。図７（Ｂ）は、マイクロホンアレイＭＡ１よりもマイクロホンアレイＭＡ２の方が目的エリアＴＡＲに近い場合である。目的エリアＴＡＲに近いマイクロホンアレイＭＡ２の方が目的エリア音のパワーが大きいため、比率の最頻値は１より小さい値となる。算出したパワー補正係数を用い、各ビームフォーマ出力Ｘ_ｍａ１(ｔ)及びＸ_ｍａ２(ｔ−τ)に含まれる目的エリア音のパワーが全て等しくなるように補正した後、上記手法により目的エリア音を抽出することができる。 FIG. 7 is an explanatory diagram showing the ratio of the amplitude spectrum for each frequency between the beamformer outputs as a histogram. FIG. 7A shows a case where the microphone arrays MA1 and MA2 are arranged equidistant from the target area TAR. Since the distance from the target area TAR is the same, the powers of the input target area sounds are almost equal, and the mode of the ratio is a value close to 1. FIG. 7B shows a case where the microphone array MA2 is closer to the target area TAR than the microphone array MA1. Since the microphone array MA2 closer to the target area TAR has a higher power of the target area sound, the mode value of the ratio is smaller than 1. Using the calculated power correction coefficient, the power of the target area sound included in each beamformer output X _ma1 (t) and X _ma2 (t−τ) is corrected to be equal, and then the target area sound is Can be extracted.

上述した（１）式〜（３）式は、マイクロホンアレイＭＡ１をメイン、マイクロホンアレイＭＡ２をサブとした処理を示しているが、マイクロホンアレイＭＡ１及びＭＡ２を逆にしても、同様に、目的エリア音の収音が可能である。 The above-described equations (1) to (3) show processing in which the microphone array MA1 is the main and the microphone array MA2 is the sub. However, even if the microphone arrays MA1 and MA2 are reversed, the target area sound is similarly obtained. Can be picked up.

このように、第１の従来法を用いれば、目的エリアＴＡＲの周囲に非目的エリア音源が存在していても、目的エリア音のみを収音することができる。 As described above, if the first conventional method is used, only the target area sound can be collected even if the non-target area sound source exists around the target area TAR.

しかし、目的エリアＴＡＲ内に複数の音源が存在していると、各マイクロホンアレイＭＡ１、ＭＡ２で収音される各音源のパワーにばらつきが生じる場合がある。例えば、図８に示すように、目的エリアＴＡＲ内に指向性を持った音源ＳＡ及びＳＢが存在し、音源ＳＡ及びＳＢ共にマイクロホンアレイＭＡ１に対して９０度方向（一方は時計回りに９０度方向、他方は反時計回りに９０度方向）を向いているが、マイクロホンアレイＭＡ２に対して音源ＳＡは後ろ、音源ＳＢは正面を向いている、という状況もあり得る。この場合において、各マイクロホンアレイＭＡ１、ＭＡ２と目的エリアＴＡＲの距離が等しければ、各マイクロホンアレイＭＡ１、ＭＡ２で収音された音源ＳＡのパワーは、マイクロホンアレイＭＡ１の方がマイクロホンアレイＭＡ２よりも大きくなる。逆に、音源Ｂに関しては、マイクロホンアレイＭＡ２の方がマイクロホンアレイＭＡ１よりも大きくなる。この場合、各ビームフォーマ出力間の比率を算出すると、音源ＳＡとＳＢの比率はそれぞれ異なり、比率のヒストグラムでは、図９に示すように、単峰にならず極値（以下、ピーク値と呼ぶ）が複数できる多峰になる。第１の従来法は、最頻値の比率をパワー補正係数とするため、音源によってはパワー補正が充分でなく、目的エリア音の抽出の際に、目的エリア音の成分の一部が抑圧されてしまう可能性がある。 However, if there are a plurality of sound sources in the target area TAR, the power of each sound source picked up by each microphone array MA1, MA2 may vary. For example, as shown in FIG. 8, sound sources SA and SB having directivity exist in the target area TAR, and both the sound sources SA and SB are oriented 90 degrees with respect to the microphone array MA1 (one is oriented 90 degrees clockwise). However, the sound source SA may be behind the microphone array MA2 and the sound source SB may be facing the front. In this case, if the distances between the microphone arrays MA1 and MA2 and the target area TAR are equal, the power of the sound source SA collected by the microphone arrays MA1 and MA2 is greater in the microphone array MA1 than in the microphone array MA2. . Conversely, for the sound source B, the microphone array MA2 is larger than the microphone array MA1. In this case, if the ratio between the beamformer outputs is calculated, the ratio between the sound sources SA and SB is different, and the ratio histogram is not a single peak as shown in FIG. 9, but is called an extreme value (hereinafter referred to as a peak value). ) Becomes a multi-peak that can have multiple. In the first conventional method, since the ratio of the mode value is used as the power correction coefficient, the power correction is not sufficient depending on the sound source, and a part of the target area sound component is suppressed when the target area sound is extracted. There is a possibility that.

そこで、特許出願人は、各ビームフォーマ出力間の比率の頻度を求めた後、ある一定以上の頻度をもつピーク値だけを検出し、各ピーク値の中で最大の値をパワー補正係数とする方式（以下、第２の従来法と呼ぶ）を提案している（特願２０１３−１５１８９３号明細書及び図面）。 Therefore, after obtaining the frequency of the ratio between the beamformer outputs, the patent applicant detects only the peak value having a certain frequency or more and uses the maximum value among the peak values as the power correction coefficient. A system (hereinafter referred to as a second conventional method) has been proposed (Japanese Patent Application No. 2013-151893 and drawings).

図１０は、目的エリア内に複数の音源が存在する場合の各目的エリア音源に対するパワー補正係数と抽出した雑音のパワーの関係の一例を示した図表である。第１の従来法に係る課題の条件から、例えば、マイクロホンアレイＭＡ１のビームフォーマ出力Ｘ_ｍａ１に含まれる目的エリア音源Ａ、Ｂのパワーが共に６、マイクロホンアレイＭＡ２のビームフォーマ出力Ｘ_ｍａ２では音源Ａのパワーが３、音源Ｂのパワーが９であるとする。まず、マイクロホンアレイＭＡ１をメイン、マイクロホンアレイＭＡ２をサブとしてエリア収音する場合を考える。周波数毎に振幅スペクトルの比率を求め、その頻度をヒストグラムに表すと、ピーク値が０．６７と２の２箇所に現れることになる。 FIG. 10 is a chart showing an example of the relationship between the power correction coefficient for each target area sound source and the extracted noise power when there are a plurality of sound sources in the target area. The condition of the problems of the first conventional method, for example, destination area sound source A included in the beamformer output _{X ma1} of the microphone array MA1, power B are both 6, the beamformer output _{X ma2} the sound source A microphone array MA2 Is 3 and the power of the sound source B is 9. First, let us consider a case where the microphone array MA1 is the main and the microphone array MA2 is the sub-area. When the ratio of the amplitude spectrum is obtained for each frequency and the frequency is represented in the histogram, the peak values appear at two locations of 0.67 and 2.

ここで、第１の従来法のように、最頻値をパワー補正係数として設定すると、ピーク値０．６７と２の頻度は状況によって変わるため、どちらが選択されるか予想できない。仮に、ピーク値０．６７がパワー補正係数α_ｍａ１として選択された場合、（１）式で抽出される非目的エリア音Ｎ_ｍａ１には、音源Ａのパワーが４．０だけ含まれる。つまり、音源Ａに対するパワー補正が充分でないため、スペクトル減算後に音源Ａの成分が消えずに残っている状態である。このまま（２）式に従って目的エリア音を抽出すると、音源Ａの成分が減算されて抑圧されてしまう。逆に、ピーク値２．０がパワー補正係数α_ｍａ１として選択された場合、（１）式に従って抽出した非目的エリア音Ｎ_ｍａ１に含まれる音源Ｂのパワーは−１２になる。しかし、スペクトル減算では、処理結果が０未満になることはないので、成分がマイナスになった場合は、０に置き換えるかフロアリングにより０に近い値とする。そのため、抽出された非目的エリア音Ｎ_ｍａ１には音源Ａ、Ｂが共に含まれず、続く処理でどちらの音源も抑圧されずに目的エリア音が抽出されることになる。同様に、マイクロホンアレイＭＡ２をメイン、マイクロホンアレイＭＡ１をサブとした場合では、補正係数α_ｍａ２を０．５とすると音源Ｂが抑圧されてしまうが、補正係数α_ｍａ２を１．５にしたときは、どちらの音源Ａ、Ｂも抑圧されず目的エリア音として抽出することができる。 Here, when the mode value is set as the power correction coefficient as in the first conventional method, the frequency of the peak values 0.67 and 2 varies depending on the situation, so it cannot be predicted which one will be selected. If the peak value 0.67 is selected as the power correction coefficient α _ma1 , the non-target area sound N _ma1 extracted by the equation (1) includes only 4.0 power of the sound source A. That is, since the power correction for the sound source A is not sufficient, the component of the sound source A remains without being lost after the spectrum subtraction. If the target area sound is extracted according to equation (2), the component of the sound source A is subtracted and suppressed. Conversely, when the peak value 2.0 is selected as the power correction coefficient α _ma1 , the power of the sound source B included in the non-target area sound N _ma1 extracted according to the equation (1) is −12. However, in the spectral subtraction, the processing result does not become less than 0. Therefore, when the component becomes negative, it is replaced with 0 or set to a value close to 0 by flooring. Therefore, the extracted non-target area sound N _ma1 does not include both the sound sources A and B, and the target area sound is extracted without suppressing either sound source in the subsequent processing. Similarly, when the microphone array MA2 is the main and the microphone array MA1 is the sub, the sound source B is suppressed when the correction coefficient α _{ma2 is set} to 0.5, but when the correction coefficient α _{ma2 is set} to 1.5, Both sound sources A and B can be extracted as target area sounds without being suppressed.

浅野太著、“音のアレイ信号処理 −音源の定位・追跡と分離”、社団法人日本音響学会、コロナ社、２０１１年２月２５日発行Asano Tadashi, "Sound Array Signal Processing-Localization / Tracking and Separation of Sound Sources", The Acoustical Society of Japan, Corona, February 25, 2011 矢頭隆、森戸誠、山田圭、小川哲司共著、“正方形マイクロホンアレイによる音源分離技術（＜特集＞音声認識技術の実用化への取り組み）”、一般社団法人情報処理学会、情報処理５１（１１）、ｐｐ．１４１０−１４１６、２０１０年Jointly written by Takashi Yagami, Makoto Morito, Atsushi Yamada, Tetsuji Ogawa, “Sound source separation technology using a square microphone array” Pp. 1410-1416, 2010

第２の従来法を用いれば、目的エリア内に音源が複数存在する状況においても、全ての目的エリア音源を抑圧することなく抽出することができる。つまり、エリア収音後の各目的エリア音源の（音量）レベルは、もともとビームフォーマ出力に含まれている大きさを保っていることになる。そのため、図８に示すような音源の配置の場合でも、マイクロホンアレイＭＡ２をメインとして第２の従来法によるエリア収音処理を行うと、マイクロホンアレイＭＡ２に対して後ろを向いている音源Ａのレベルが音源Ｂに比べて小さい状態のままで出力され、会話が聞き取り難いという状況が発生する可能性がある。第２の従来法では、比率の頻度のピーク値の最大値から最小値の間で、目的エリア音パワー補正係数の値を設定することにより、音源のレベルを調節する方法も提案している。 If the second conventional method is used, even if there are a plurality of sound sources in the target area, all the target area sound sources can be extracted without being suppressed. That is, the (volume) level of each target area sound source after the area is picked up originally maintains the size included in the beamformer output. Therefore, even in the case of the arrangement of the sound sources as shown in FIG. 8, if the area sound collection processing by the second conventional method is performed with the microphone array MA2 as the main, the level of the sound source A facing backward with respect to the microphone array MA2. May be output in a state that is smaller than that of the sound source B, and there is a possibility that the conversation is difficult to hear. The second conventional method also proposes a method of adjusting the sound source level by setting the value of the target area sound power correction coefficient between the maximum value and the minimum value of the peak value of the ratio frequency.

しかしながら、この方法はレベルの調節にスペクトル減算を用いるため、調節後に音源の音質が劣化する可能性がある。また、一方の音源のレベルを一定量だけ小さくすることはできても、大きくすることはできないため、全ての目的エリア音源を同じレベルに調節できる状況が限られている。 However, since this method uses spectral subtraction to adjust the level, the sound quality of the sound source may deteriorate after the adjustment. In addition, even though the level of one sound source can be reduced by a certain amount, it cannot be increased, so there are limited situations in which all target area sound sources can be adjusted to the same level.

そのため、目的エリア内に複数の音源が存在する場合において、音質を保ったまま全ての音源のレベルを同じようにすることが可能な収音装置及びプログラムが望まれている。 Therefore, when there are a plurality of sound sources in the target area, there is a demand for a sound collection device and a program that can make the levels of all sound sources the same while maintaining the sound quality.

第１の本発明の収音装置は、（１）複数のマイクロホンアレイと、（２）上記各マイクロホンアレイからの入力をビームフォーマによって目的エリア方向へ指向性を形成する指向性形成手段と、（３）上記各マイクロホンアレイのビームフォーマ出力間で周波数毎に振幅スペクトルの比率を算出し、比率の頻度を算出し、頻度のピーク値を検出し、ある一定以上の頻度のピーク値を求め、補正係数とする目的エリア音パワー補正係数算出手段と、（４）上記目的エリア音パワー補正係数算出手段が算出した補正係数の最大値を用い、上記各マイクロホンアレイのビームフォーマ出力を補正し、それぞれをスペクトル減算することで目的エリア方向に存在する非目的エリア音を抽出し、抽出した非目的エリア音を各マイクロホンアレイのビームフォーマ出力からスペクトル減算することにより目的エリア音を抽出する目的エリア音抽出手段と、（５）上記目的エリア音パワー補正係数算出手段が算出した補正係数の大きい順にビームフォーマ出力を補正してスペクトル減算を行い、非目的エリア音成分と１又は複数の目的エリア音成分を含んだ目的エリア混合成分を抽出し、異なる目的エリア音成分を含んだ目的エリア混合成分間の相違に基づいて、各目的エリア音成分を推定する目的エリア音成分推定手段と、（６）上記目的エリア音成分推定手段によって推定された各目的エリア音成分のパワー情報を求め、上記目的エリア音抽出手段から出力された複数の音源に係る目的エリア音出力のパワーレベルが揃うように調節する目的エリア音レベル調節手段とを備えることを特徴とする。 The sound collecting device of the first aspect of the present invention includes (1) a plurality of microphone arrays, and (2) directivity forming means for forming directivity in the direction of a target area by using a beamformer from inputs from the respective microphone arrays, 3) Calculate the ratio of the amplitude spectrum for each frequency between the beamformer outputs of each microphone array, calculate the frequency of the ratio, detect the peak value of the frequency, find the peak value of a certain frequency or more, and correct it A target area sound power correction coefficient calculating means as a coefficient, and (4) correcting the beamformer output of each microphone array using the maximum value of the correction coefficient calculated by the target area sound power correction coefficient calculating means, Spectral subtraction extracts non-target area sound that exists in the direction of the target area, and the extracted non-target area sound is used as a beam for each microphone array. A target area sound extracting means for extracting a target area sound by subtracting the spectrum from the format output; and (5) correcting the beamformer output in descending order of the correction coefficient calculated by the target area sound power correction coefficient calculating means. To extract a target area mixed component that includes a non-target area sound component and one or more target area sound components, and based on the difference between the target area mixed components that include different target area sound components, Target area sound component estimating means for estimating sound components; (6) obtaining power information of each target area sound component estimated by the target area sound component estimating means; And a target area sound level adjusting means for adjusting the power level of the target area sound output related to the sound source to be uniform. .

第２の本発明の収音プログラムは、複数のマイクロホンアレイを有する収音装置に搭載されるコンピュータを、（１）上記各マイクロホンアレイからの入力をビームフォーマによって目的エリア方向へ指向性を形成する指向性形成手段と、（２）上記各マイクロホンアレイのビームフォーマ出力間で周波数毎に振幅スペクトルの比率を算出し、比率の頻度を算出し、頻度のピーク値を検出し、ある一定以上の頻度のピーク値を求め、補正係数とする目的エリア音パワー補正係数算出手段と、（３）上記目的エリア音パワー補正係数算出手段が算出した補正係数の最大値を用い、上記各マイクロホンアレイのビームフォーマ出力を補正し、それぞれをスペクトル減算することで目的エリア方向に存在する非目的エリア音を抽出し、抽出した非目的エリア音を各マイクロホンアレイのビームフォーマ出力からスペクトル減算することにより目的エリア音を抽出する目的エリア音抽出手段と、（４）上記目的エリア音パワー補正係数算出手段が算出した補正係数の大きい順にビームフォーマ出力を補正してスペクトル減算を行い、非目的エリア音成分と１又は複数の目的エリア音成分を含んだ目的エリア混合成分を抽出し、異なる目的エリア音成分を含んだ目的エリア混合成分間の相違に基づいて、各目的エリア音成分を推定する目的エリア音成分推定手段と、（５）上記目的エリア音成分推定手段によって推定された各目的エリア音成分のパワー情報を求め、上記目的エリア音抽出手段から出力された複数の音源に係る目的エリア音出力のパワーレベルが揃うように調節する目的エリア音レベル調節手段として機能させることを特徴とする。 A sound collection program according to the second aspect of the present invention provides a computer mounted on a sound collection device having a plurality of microphone arrays, and (1) forms directivity in the direction of a target area by using a beamformer for input from each of the microphone arrays. (2) calculating the ratio of the amplitude spectrum for each frequency between the beamformer outputs of each microphone array, calculating the frequency of the ratio, detecting the peak value of the frequency, A target area sound power correction coefficient calculating means for calculating a peak value of the target area sound power, and (3) a maximum value of the correction coefficient calculated by the target area sound power correction coefficient calculating means, The non-target area sound that exists in the direction of the target area is extracted by correcting the output and subtracting each spectrum, and the extracted non-target A target area sound extraction means for extracting a target area sound by spectral subtracting the area sound from the beamformer output of each microphone array; and (4) a beam in descending order of the correction coefficient calculated by the target area sound power correction coefficient calculation means. Spectral subtraction is performed by correcting the former output, and a target area mixed component including a non-target area sound component and one or more target area sound components is extracted, and between target area mixed components including different target area sound components. Based on the difference, target area sound component estimating means for estimating each target area sound component; (5) obtaining power information of each target area sound component estimated by the target area sound component estimating means; The target area that is adjusted so that the power levels of the target area sound outputs related to multiple sound sources output from the extraction means are aligned. Characterized in that to function as a level adjustment means.

本発明によれば、目的エリア内に複数の音源が存在する場合において、音質を保ったまま全ての音源のレベルを同じようにすることができる。 According to the present invention, when there are a plurality of sound sources in the target area, the levels of all the sound sources can be made the same while maintaining the sound quality.

第１の実施形態に係る収音装置の構成を示すブロック図である。It is a block diagram which shows the structure of the sound collection device which concerns on 1st Embodiment. 第１の実施形態に係る収音装置における目的エリア音パワー補正係数の算出方法と目的エリア音成分の推定方法の手順を示すフローチャートである。It is a flowchart which shows the procedure of the calculation method of the target area sound power correction coefficient in the sound collection device which concerns on 1st Embodiment, and the estimation method of the target area sound component. マイクロホンアレイＭＡ２をメインとしてエリア収音処理をする場合のパワー補正係数値に対する各目的エリア音源のパワーの変化を示した説明図である。It is explanatory drawing which showed the change of the power of each target area sound source with respect to the power correction coefficient value in the case of performing area sound collection processing mainly using the microphone array MA2. マイクロホンアレイＭＡ１をメインとしてエリア収音処理をする場合のパワー補正係数値に対する各目的エリア音源のパワーの変化を示した説明図である。It is explanatory drawing which showed the change of the power of each target area sound source with respect to the power correction coefficient value in the case of performing area sound collection processing mainly using the microphone array MA1. ２つのマイクロホンアレイを用い、別々の場所から指向性を目的エリア方向に向けた状態と、そのときのスペクトルを示す説明図である。It is explanatory drawing which shows the state and the spectrum at that time using the two microphone arrays and directing directivity toward the target area from different places. 第１の従来法による収音装置の構成を示すブロック図である。It is a block diagram which shows the structure of the sound collection apparatus by the 1st conventional method. 各マイクロホンアレイのビームフォーマ出力間の周波数毎の振幅スペクトルの比率をヒストグラムで示した説明図である。It is explanatory drawing which showed the ratio of the amplitude spectrum for every frequency between the beam former outputs of each microphone array with the histogram. 第１の従来法で課題が生じる２つのマイクロホンアレイの配置例を示す説明図である。It is explanatory drawing which shows the example of arrangement | positioning of the two microphone arrays which a subject produces with a 1st conventional method. 目的エリア内に複数の音源が存在する場合の各ビームフォーマ出力間の周波数毎の振幅スペクトルの比率をヒストグラムで示した説明図である。It is explanatory drawing which showed the ratio of the amplitude spectrum for every frequency between each beamformer output in case the some sound source exists in the target area with the histogram. 目的エリア内に２つの音源が存在する場合の各目的エリア音源に対するパワー補正係数と抽出した非目的エリア音のパワーの関係の一例を示した図表である。It is the graph which showed an example of the relationship between the power correction coefficient with respect to each target area sound source and the power of the extracted non-target area sound when two sound sources exist in the target area.

（Ａ）第１の実施形態
以下、本発明による収音装置及びプログラムの第１の実施形態を、図面を参照にして説明する。 (A) First Embodiment Hereinafter, a first embodiment of a sound collecting apparatus and a program according to the present invention will be described with reference to the drawings.

（Ａ−１）第１の実施形態における技術思想
第１の実施形態の収音装置における構成及び動作の説明に先立ち、第１の実施形態の収音装置が適用している技術思想について説明する。 (A-1) Technical idea in the first embodiment Prior to the description of the configuration and operation of the sound collecting device in the first embodiment, the technical idea applied by the sound collecting device in the first embodiment will be described. .

第１の実施形態では、各目的エリア音源の成分を推定し、エリア収音処理後のデータに含まれる各目的エリア音源の平均パワーを求め、最も大きいパワーの目的エリア音源と同じレベルになるように、他の目的エリア音源のパワーを補正する。 In the first embodiment, the components of each target area sound source are estimated, the average power of each target area sound source included in the data after the area sound collection processing is obtained, and the same level as the target area sound source with the highest power is obtained. In addition, the power of other target area sound sources is corrected.

通常、エリア収音後のデータは、複数の目的エリア音源が混ざったままであり、どの周波数がどの目的エリア音源に含まれるのかを推定することは難しい。しかし、各マイクロホンアレイのビームフォーマ出力間で、目的エリア音源のパワーに差がある場合、目的エリア音補正係数の値を変えることにより各目的エリア音源成分を推定することができる。図３は、各ビームフォーマ出力中の各目的エリア音成分が図１０に示す値であった場合、マイクロホンアレイＭＡ２をメインとしてエリア収音処理を行う過程において、目的エリア音補正係数の値を変えていったときの各音源のパワーの変化を表している。音源のパワーが変化しなくなった目的エリア音補正係数値が、その音源に対応した補正係数値（比率の頻度のピーク値）となる。第２の従来法のように、目的エリア音源Ｂに係るパワーが変化しなくなった一定値Ｐ２を補正係数としてエリア収音処理を行った場合、目的エリア音源Ａ及びＢ共に各ビームフォーマ出力に含まれるパワーは等しくなっているため、（１）式の出力は非目的エリア音成分Ｎ_０だけとなる。それに対して、目的エリア音源Ａに係る変化しなくなった一定値Ｐ１を補正係数とした場合、目的エリア音源Ａに関しては、各ビームフォーマ出力に含まれるパワーは等しいため、（１）式の出力には、目的エリア音源Ａの成分Ｓ_Ａは含まれない。しかし、目的エリア音源Ｂに関しては、パワーの補正が充分でないため（１）式の出力に成分が残ってしまう。つまり、補正係数値をＰ１としたとき、（１）式の出力Ｎ_Ｓ１（＝Ｎ_ｍａ１）には、非目的エリア音成分Ｎ_０と目的エリア音源Ｂの成分Ｓ_Ｂが含まれることになる。そのため、このままエリア収音処理を行うと、目的エリア音源Ｂの成分Ｓ_Ｂがスペクトル減算により削られ、音質が劣化する原因となる。 Usually, data after area sound collection is a mixture of a plurality of target area sound sources, and it is difficult to estimate which frequency is included in which target sound source. However, if there is a difference in the power of the target area sound source between the beamformer outputs of each microphone array, each target area sound source component can be estimated by changing the value of the target area sound correction coefficient. FIG. 3 shows that when the target area sound component in each beamformer output has the value shown in FIG. 10, the value of the target area sound correction coefficient is changed in the process of performing the area sound collection process with the microphone array MA2 as the main. It represents the change in the power of each sound source when it goes. The target area sound correction coefficient value at which the power of the sound source no longer changes becomes the correction coefficient value (the peak value of the ratio frequency) corresponding to the sound source. As in the second conventional method, when area sound collection processing is performed using a fixed value P2 at which the power related to the target area sound source B no longer changes as a correction coefficient, both the target area sound sources A and B are included in each beamformer output. Therefore, the output of the expression (1) is only the non-target area sound component N ₀ . On the other hand, when the constant value P1 that does not change with respect to the target area sound source A is used as the correction coefficient, the power included in each beamformer output is equal for the target area sound source A. is not included in component S _a purpose area sound a. However, for the target area sound source B, the power correction is not sufficient, so that a component remains in the output of the expression (1). That is, when the correction factor values and P1, become, the may include the component _{S B} of the non-target area sound component _{N 0} and the target area sound source B (1) expression of the output _{_N} S1 ₍₌ _N _ma1). Therefore, when the left area sound-pickup processing, scraped the component S _B object area sound source B by spectral subtraction, causing the sound quality to deteriorate.

第１の実施形態では、本来、エリア収音処理に適さない非目的エリア音と目的エリア音源Ｂが混ざった出力Ｎ_ＳＢを利用して目的エリア音源成分の推定を行う。非目的エリア音成分は、Ｎ_０とＮ_ＳＢのどちらにも含まれているため、（４）式に従ってスペクトル減算を行えば、目的エリア音源Ｂの成分Ｓ_Ｂのみ抽出することができる。ここで、α_０はスペクトル減算の強度係数であり、通常の収音環境では例えば１．０を適用することができる。

In the first embodiment, naturally, to estimate the destination area source component using the output N _SB with mix of non-target areas sound and destination area sound source B is not suitable for the area sound-pickup processing. Non-target area sound component, because it contains in either N ₀ and N _SB, can be carried out spectral subtraction, extracts only the component S _B object area sound source B according (4). Here, α ₀ is an intensity coefficient of spectrum subtraction, and for example, 1.0 can be applied in a normal sound collection environment.

その後、（５）式に従って再びスペクトル減算することにより、他方の目的エリア音源Ａの成分Ｓ_Ａを抽出することができる。ここで、α_１はスペクトル減算の強度係数であり、（６）式に従って算出する。（６）式では、（３）式と同様にビームフォーマ出力Ｘ_ｍａ２と目的エリア音源Ｂの成分Ｓ_Ｂのパワーの比率を求め、求めた比率をヒストグラムにして、その最頻値を算出している。

Thereafter, (5) by re-spectral subtraction according to expression, it is possible to extract the component S _A of the other object area sound A. Here, α ₁ is an intensity coefficient of spectrum subtraction, and is calculated according to equation (6). (6) In the formula, (3) determine the ratio of the power of the component S _B of formula as well as the beamformer output X _ma2 and destination area sound source B, and the ratio obtained in the histogram, calculates the mode value Yes.

各目的エリア音源の成分は、スペクトル減算の出力のままでも良いが、以下のような操作を行って出力するようにしても良い。 The components of each target area sound source may be output as spectral subtraction, but may be output by performing the following operation.

例えば、バイナリマスクなどの手法を使い、目的エリア音源Ａの成分Ｓ_Ａと目的エリア音源Ｂの成分Ｓ_Ｂのパワーを周波数毎に比較し、小さい方の成分は全て０とするようにしても良い。目的エリア音源Ａの成分Ｓ_Ａと目的エリア音源Ｂの成分Ｓ_Ｂを推定した後、（７）式に示すように、目的エリア音源Ｂに係る変化しなくなったパワーに対応した補正係数値Ｐ２での目的エリア音源Ａの成分Ｓ_Ａと目的エリア音源Ｂの成分Ｓ_Ｂの平均パワーＰ_ＳＢ及びＰ_ＳＡの比率を求め、この比率βを目的エリア音源Ａの成分Ｓ_Ａにのみ乗算することで、目的エリア音源Ａの成分Ｓ_Ａと目的エリア音源Ｂの成分Ｓ_Ｂのパワーを同じにすることができる。 For example, using a technique such as a binary mask, the power of the component SA of the target area sound source _A and the power of the component S _B of the target area sound source B may be compared for each frequency so that the smaller component is all zero. . After estimating the component S _B component object area sound source A S _A and the target area sound source B, (7), as shown in the expression by the correction factor value P2 that corresponds to the power that is no longer changed according to the destination area source B find the ratio of the average power P _SB and P _SA component S _B components S _a and the target area sound source B object area sound source a of, by multiplying the ratio β only component S _a purpose area sound source a, the power of the component S _B component object area sound source a S _a and the target area sound source B can be the same.

β＝Ｐ_ＳＢ／Ｐ_ＳＡ …（７）
但し、比率βの値が１．０に近い場合（設定した閾値の範囲内にある場合）には、レベル補正を行わないようにしても良い。これにより、上述した場合とは逆にマイクロホンアレイＭＡ１をメインにエリア収音処理した場合においても（図４に示すような目的エリア音源Ａ及びＢ間でほとんど音量差がないような状況においても）問題なく動作させることができる。ここで、図４は、マイクロホンアレイＭＡ１をメインとしてエリア収音処理をする場合のパワー補正係数値に対する各目的エリア音源のパワーの変化を示した説明図であり、マイクロホンアレイＭＡ２をメインとした図３に対応する図面である。 β = P _SB / P _SA (7)
However, when the value of the ratio β is close to 1.0 (when it is within the set threshold range), the level correction may not be performed. Thus, contrary to the case described above, even when the microphone array MA1 is mainly subjected to area sound collection processing (even in a situation where there is almost no volume difference between the target area sound sources A and B as shown in FIG. 4). It can be operated without problems. Here, FIG. 4 is an explanatory diagram showing changes in the power of each target area sound source with respect to the power correction coefficient value when the area sound collection processing is performed with the microphone array MA1 as the main, and is a diagram with the microphone array MA2 as the main. 3 corresponds to FIG.

（Ａ−２）第１の実施形態の構成
図１は、第１の実施形態に係る収音装置の構成を示すブロック図である。マイクロホンアレイを除く図１に示す部分は、ハードウェア的に各種回路を接続して構築されても良く、また、ＣＰＵ、ＲＯＭ、ＲＡＭなどを有する汎用的な装置若しくはユニットが所定のプログラムを実行することで該当する機能を実現するように構築されても良く、いずれの構築方法を採用した場合であっても、機能的には、図１で表すことができる。 (A-2) Configuration of First Embodiment FIG. 1 is a block diagram illustrating a configuration of a sound collection device according to the first embodiment. The part shown in FIG. 1 excluding the microphone array may be constructed by connecting various circuits in hardware, and a general-purpose device or unit having a CPU, ROM, RAM, etc. executes a predetermined program. Thus, it may be constructed so as to realize the corresponding function, and even if any construction method is adopted, it can be functionally represented in FIG.

図１において、第１の実施形態に係る収音装置２０は、複数（図１は２個の場合を示している）のマイクロホンアレイＭＡ１及びＭＡ２、データ入力部２１、指向性形成部２２、伝播遅延差補正部２３、空間座標データ保持部２４、目的エリア音パワー補正係数算出部２５、目的エリア音抽出部２６、目的エリア音成分推定部２７及び目的エリア音レベル調節部２８を備える。 In FIG. 1, a sound collection device 20 according to the first embodiment includes a plurality of microphone arrays MA1 and MA2, a data input unit 21, a directivity forming unit 22, a propagation (FIG. 1 shows the case of two). A delay difference correction unit 23, a spatial coordinate data holding unit 24, a target area sound power correction coefficient calculation unit 25, a target area sound extraction unit 26, a target area sound component estimation unit 27, and a target area sound level adjustment unit 28 are provided.

第１のマイクロホンアレイＭＡ１は、目的エリア（以下、符号ＴＡＲを用いる；図５参照）が存在する空間の、目的エリアＴＡＲを指向できる場所に配置される。第１のマイクロホンアレイＭＡ１は、図６に示したように、Ｍ個（Ｍ≧２）のマイクロホンａ_１１、ａ_１２、…、ａ_１Ｍから構成され、各マイクロホンａ_１１、ａ_１２、…、ａ_１Ｍが音響を収音（捕捉）して音響信号ｘ_１１、ｘ_１２、…、ｘ_１Ｍを当該収音装置２０の本体に入力する。 The first microphone array MA1 is arranged in a space where a target area (hereinafter, reference numeral TAR is used; see FIG. 5) can be directed to the target area TAR. As shown in FIG. 6, the first microphone array MA1 includes M (M ≧ 2) microphones a ₁₁ , a ₁₂ ,..., A _1M , and each microphone a ₁₁ , a ₁₂ ,. _1M picks up (captures) sound and inputs sound signals x ₁₁ , x ₁₂ ,..., X _1M to the main body of the sound collecting device 20.

第２のマイクロホンアレイＭＡ２は、第１のマイクロホンアレイＭＡ１とは異なる、目的エリアＴＡＲを指向できる場所に配置されるが、第１のマイクロホンアレイＭＡ１と同様な構成を有する。第２のマイクロホンアレイＭＡ２を構成する各マイクロホンａ_２１、ａ_２２、…、ａ_２Ｍから音響信号ｘ_２１、ｘ_２２、…、ｘ_２Ｍが入力される。 The second microphone array MA2 is arranged at a location different from the first microphone array MA1 where the target area TAR can be pointed, but has the same configuration as the first microphone array MA1. Acoustic signals x ₂₁ , x ₂₂ ,..., X _2M are input from the respective microphones a ₂₁ , a ₂₂ ,..., A _2M that constitute the second microphone array MA2.

図１では、第１及び第２のマイクロホンアレイＭＡ１及びＭＡ２が直線上に並設されているように記載しているが、これは紙面上の都合のためであり、実際的な配置では、第１のマイクロホンアレイＭＡ１（のマイクロホンの配置平面）が目的エリアＴＡＲを臨む方向と、第２のマイクロホンアレイＭＡ２が目的エリアＴＡＲを臨む方向とがなす角度がある程度の値（例えば、４５度以上）であることが好ましい（上述した図５参照）。 In FIG. 1, the first and second microphone arrays MA1 and MA2 are described as being arranged side by side in a straight line, but this is for the convenience of the paper. The angle formed by the direction in which one microphone array MA1 (the microphone placement plane) faces the target area TAR and the direction in which the second microphone array MA2 faces the target area TAR is a certain value (for example, 45 degrees or more). There is preferably (see FIG. 5 above).

第１又は第２のマイクロホンアレイＭＡ１、ＭＡ２を構成するＭ個のマイクロホンの配置はビームフォーマを実行できる配置であれば良く、例えば、横一列、縦一列、十字状又は格子状のいずれかであっても良い。 The arrangement of the M microphones constituting the first or second microphone array MA1, MA2 may be any arrangement that can execute a beamformer, and may be any one of horizontal row, vertical row, cross shape, or lattice shape. May be.

データ入力部２１は、マイクロホンアレイＭＡ１、ＭＡ２で収音した音響信号をアナログ信号からデジタル信号（データ）に変換するものである。なお、上述した図６では、データ入力部の図示を省略している。 The data input unit 21 converts an acoustic signal collected by the microphone arrays MA1 and MA2 from an analog signal to a digital signal (data). In FIG. 6 described above, the data input unit is not shown.

指向性形成部２２は、各マイクロホンアレイＭＡ１、ＭＡ２からの出力（デジタル信号）に対するビームフォーマにより、目的エリア方向に向けた指向性ビームを形成し、各マイクロホンアレイＭＡ１、ＭＡ２についてのビームフォーマ出力Ｘ_ｍａ１(ｔ)及びＸ_ｍａ２(ｔ)を得るものである。ビームフォーマ法として、加算型の遅延和法、減算型のスペクトル減算法など各種手法を使うことができる。また、ターゲットとする目的エリアＴＡＲの範囲に応じて指向性の強度を変更するようにしても良い。ここで、指向性形成部２２が、上述した図６における第１及び第２の指向性形成部１１及び１２に対応している。 The directivity forming unit 22 forms a directional beam in the direction of the target area by a beamformer for outputs (digital signals) from the microphone arrays MA1 and MA2, and outputs beamformer outputs X for the microphone arrays MA1 and MA2. it is intended to obtain _ma1 a (t) and _{X ma2} (t). As the beamformer method, various methods such as an addition type delay sum method and a subtraction type spectral subtraction method can be used. Further, the intensity of directivity may be changed according to the target area TAR. Here, the directivity forming unit 22 corresponds to the first and second directivity forming units 11 and 12 in FIG. 6 described above.

空間座標データ保持部２４は、目的エリアＴＡＲ（の中心）の位置情報や、各マイクロホンアレイＭＡ１、ＭＡ２の位置情報を保持しているものである。 The spatial coordinate data holding unit 24 holds position information of the target area TAR (center) and position information of the microphone arrays MA1 and MA2.

伝播遅延差補正部２３は、目的エリアＴＡＲと各マイクロホンアレイＭＡ１、ＭＡ２の距離の違いにより発生する伝播遅延時間の差を算出し、その差を吸収するように、各マイクロホンアレイＭＡ１、ＭＡ２についてのビームフォーマ出力Ｘ_ｍａ１(ｔ)及びＸ_ｍａ２(ｔ)の少なくとも一方を補正するものである。具体的な手順例は、以下の通りであり、マイクロホンアレイの数が３以上でも適用できるように説明する。まず、空間座標データ保持部２４から、目的エリアＴＡＲの位置と各マイクロホンアレイの位置を取得し、各マイクロホンアレイへの目的エリア音の到達時間（伝播遅延時間）の差を算出する。目的エリアＴＡＲから最も遠い位置に配置されたマイクロホンアレイに目的エリア音が到達するタイミングを基準とし、全てのマイクロホンアレイに目的エリア音が同時に到達するように、基準のマイクロホンアレイ以外の他の全てのマイクロホンアレイのビームフォーマ出力に遅延を加える。 The propagation delay difference correction unit 23 calculates a difference in propagation delay time caused by a difference in distance between the target area TAR and each microphone array MA1, MA2, and absorbs the difference so that each microphone array MA1, MA2 It corrects at least one of the beamformer outputs X _ma1 (t) and X _ma2 (t). A specific procedure example is as follows, and will be described so that it can be applied even when the number of microphone arrays is three or more. First, the position of the target area TAR and the position of each microphone array are acquired from the spatial coordinate data holding unit 24, and the difference in the arrival time (propagation delay time) of the target area sound to each microphone array is calculated. Based on the timing at which the target area sound arrives at the microphone array arranged farthest from the target area TAR, all the other microphones other than the reference microphone array are simultaneously transmitted so that the target area sound reaches all the microphone arrays at the same time. Add a delay to the beamformer output of the microphone array.

ここで、伝播遅延差補正部２３及び空間座標データ保持部２４が、上述した図６における伝播遅延差補正部１４に対応している。 Here, the propagation delay difference correction unit 23 and the spatial coordinate data holding unit 24 correspond to the propagation delay difference correction unit 14 in FIG. 6 described above.

なお、目的エリアＴＡＲが変更されることなく、かつ、その目的エリアＴＡＲと各マイクロホンアレイＭＡ１、ＭＡ２との距離が等しい場合には、伝播遅延差補正部２３及び空間座標データ保持部２４を省略することができる。 When the target area TAR is not changed and the distance between the target area TAR and each of the microphone arrays MA1 and MA2 is equal, the propagation delay difference correction unit 23 and the spatial coordinate data holding unit 24 are omitted. be able to.

目的エリア音パワー補正係数算出部２５は、各ビームフォーマ出力Ｘ_ｍａ１、Ｘ_ｍａ２における目的エリア音のパワーを揃えるための補正係数を算出するものである。目的エリア音パワー補正係数算出部２５の詳細については、動作説明の項で明らかにする。 The target area sound power correction coefficient calculation unit 25 calculates a correction coefficient for aligning the power of the target area sound in each beamformer output X _ma1 and X _ma2 . Details of the target area sound power correction coefficient calculation unit 25 will be clarified in the description of the operation.

目的エリア音抽出部２６は、目的エリア音パワー補正係数算出部２５で算出した補正係数の内で最大値を選択し、各ビームフォーマ出力を補正した後にスペクトル減算することにより、目的エリア方向に存在する非目的エリア音を抽出し、さらに、抽出した非目的エリア音を各ビームフォーマの出力からスペクトル減算することにより、目的エリア音を抽出するものである。 The target area sound extraction unit 26 selects the maximum value among the correction coefficients calculated by the target area sound power correction coefficient calculation unit 25, corrects each beamformer output, and then subtracts the spectrum to thereby exist in the target area direction. The target area sound is extracted by extracting the non-target area sound to be detected and further subtracting the spectrum of the extracted non-target area sound from the output of each beam former.

目的エリア音成分推定部２７は、目的エリア音パワー補正係数算出部２５で算出した補正係数値の大きい順に各ビームフォーマ出力を補正してスペクトル減算を行い、非目的エリア音成分と各目的エリア音成分を推定するものである。目的エリア音成分推定部２７の詳細については、動作説明の項で明らかにする。 The target area sound component estimator 27 corrects each beamformer output in descending order of the correction coefficient values calculated by the target area sound power correction coefficient calculator 25, and performs spectral subtraction to obtain a non-target area sound component and each target area sound. The component is estimated. Details of the target area sound component estimation unit 27 will be clarified in the description of the operation.

目的エリア音レベル調節部２８は、目的エリア音抽出部２６の出力の中で最も大きい平均パワーの目的エリア音源とそれ以外の音源との平均パワーの比を求め、後述する条件を満たす場合に、各比率をそれぞれ対応した目的エリア音源成分に乗算し、パワーレベルを調節するものである。ここで、目的エリア音レベル調節部２８がレベル調節を行う条件は、比率が１．０±任意の定数の範囲を超えているという条件である。 The target area sound level adjustment unit 28 obtains the ratio of the average power of the target area sound source having the highest average power and the other sound sources in the output of the target area sound extraction unit 26, and satisfies the condition described later. Each ratio is multiplied by the corresponding target area sound source component to adjust the power level. Here, the condition that the target area sound level adjustment unit 28 performs the level adjustment is that the ratio exceeds the range of 1.0 ± any constant.

（Ａ−３）第１の実施形態の動作
次に、上述した構成を有する第１の実施形態に係る収音装置２０の動作を説明する。 (A-3) Operation of the First Embodiment Next, the operation of the sound collection device 20 according to the first embodiment having the above-described configuration will be described.

目的エリアＴＡＲに位置している全ての音源が放音した音響は、目的エリアＴＡＲを処理対象としている、全てのマイクロホンアレイＭＡ１及びＭＡ２のマイクロホンａ_１１、ａ_１２、…、ａ_１Ｍ、ａ_２１、ａ_２２、…、ａ_２Ｍによって捕捉される。なお、マイクロホンアレイＭＡ１及びＭＡ２のマイクロホンａ_１１、ａ_１２、…、ａ_１Ｍ、ａ_２１、ａ_２２、…、ａ_２Ｍは、目的エリアＴＡＲ以外のエリアに存在する音源からの音響も捕捉する。 The sounds emitted by all sound sources located in the target area TAR are processed by the microphones a ₁₁ , a ₁₂ ,..., A _1M , a ₂₁ in all the microphone arrays MA1 and MA2 that are targeted for processing in the target area TAR. captured by a ₂₂ ,..., a _2M . Note that the microphones a ₁₁ , a ₁₂ ,..., A _1M , a ₂₁ , a ₂₂ ,..., A _2M in the microphone arrays MA1 and MA2 also capture sound from a sound source that exists in an area other than the target area TAR.

第１のマイクロホンアレイＭＡ１の全てのマイクロホンａ_１１、ａ_１２、…、ａ_１Ｍが捕捉して得た音響信号（アナログ信号）ｘ_１１、ｘ_１２、…、ｘ_１Ｍは、データ入力部２１によってデジタル信号に変換されて指向性形成部２２に与えられ、同様に、第２のマイクロホンアレイＭＡ２の全てのマイクロホンａ_２１、ａ_２２、…、ａ_２Ｍが捕捉して得た音響信号（アナログ信号）ｘ_２１、ｘ_２２、…、ｘ_２Ｍは、データ入力部２１によってデジタル信号に変換されて指向性形成部２２に与えられる。 The acoustic signals (analog signals) x ₁₁ , x ₁₂ ,..., X _1M obtained by capturing all the microphones a ₁₁ , a ₁₂ ,..., A _{1M of} the first microphone array MA ₁ are digitally converted by the data input unit 21. The signal is converted to a signal and given to the directivity forming unit 22, and similarly, all the microphones a ₂₁ , a ₂₂ ,..., A _{2M of} the second microphone array MA2 capture and acquire acoustic signals (analog signals) x ₂₁ , x ₂₂ ,..., X _2M are converted into digital signals by the data input unit 21 and given to the directivity forming unit 22.

第１のマイクロホンアレイＭＡ１からのデジタル信号に変換された全ての音響信号に対し、指向性形成部２２によって、目的エリアＴＡＲの方向を指向性方向とするビームフォーマ処理が施されて、メインのビームフォーマ出力Ｘ_ｍａ１(ｔ)が得られて伝播遅延差補正部２３に与えられる。また、第２のマイクロホンアレイＭＡ２からのデジタル信号に変換された全ての音響信号に対し、指向性形成部２２によって、目的エリアＴＡＲの方向を指向性方向とするビームフォーマ処理が施されて、サブのビームフォーマ出力Ｘ_ｍａ２(ｔ)が得られて伝播遅延差補正部２３に与えられる。 The directivity forming unit 22 performs beamformer processing with the direction of the target area TAR as the directivity direction on all the acoustic signals converted into the digital signals from the first microphone array MA1 to obtain the main beam. A former output X _ma1 (t) is obtained and provided to the propagation delay difference correction unit 23. In addition, the directivity forming unit 22 performs beamformer processing with the direction of the target area TAR as the directivity direction on all the acoustic signals converted into the digital signals from the second microphone array MA2, so that _Beamformer output X _ma2 (t) is obtained and provided to the propagation delay difference correction unit 23.

伝播遅延差補正部２３において、空間座標データ保持部２４の保持データに基づいて、目的エリアＴＡＲと各マイクロホンアレイＭＡ１、ＭＡ２の距離の違いにより発生する、目的エリアＴＡＲから第１のマイクロホンアレイＭＡ１への伝播遅延時間と、目的エリアＴＡＲから第２のマイクロホンアレイＭＡへの伝播遅延時間との差が算出され、その時間差を吸収するように、各マイクロホンアレイＭＡ１、ＭＡ２についてのビームフォーマ出力Ｘ_ｍａ１(ｔ)及びＸ_ｍａ２(ｔ)の少なくとも１つの時間軸が補正される。 In the propagation delay difference correcting unit 23, the target area TAR and the first microphone array MA1 are generated due to the difference in distance between the target area TAR and each of the microphone arrays MA1 and MA2 based on the data held in the spatial coordinate data holding unit 24. And the propagation delay time from the target area TAR to the second microphone array MA are calculated, and the beamformer output X _ma1 for each of the microphone arrays MA1 and MA2 is absorbed so as to absorb the time difference. At least one time axis of t) and X _ma2 (t) is corrected.

以上のようにして時間軸が揃えられたビームフォーマ出力（周波数領域の信号）Ｘ_ｍａ１(ｔ)及びＸ_ｍａ２(ｔ−τ)が、目的エリア音パワー補正係数算出部２５に与えられる。遅延が付与されるビームフォーマ出力がサブのマイクロホンアレイのビームフォーマ出力に限らないが、上述では、サブのマイクロホンアレイのビームフォーマ出力をＸ_ｍａ２(ｔ−τ)と表記した。 The beamformer outputs (frequency domain signals) X _ma1 (t) and X _ma2 (t−τ) whose time axes are aligned as described above are provided to the target area sound power correction coefficient calculation unit 25. Although the beamformer output to which the delay is applied is not limited to the beamformer output of the sub microphone array, in the above _description, the beamformer output of the sub microphone array is expressed as X _ma2 (t−τ).

目的エリア音パワー補正係数算出部２５においては、時間軸が揃えられたビームフォーマ出力Ｘ_ｍａ１(ｔ)及びＸ_ｍａ２(ｔ−τ)に基づいて、これらビームフォーマ出力Ｘ_ｍａ１(ｔ)及びＸ_ｍａ２(ｔ−τ) における目的エリア音のパワーを揃えるための補正係数が算出される。 In the target area sound power correction coefficient calculation unit 25, based on the beamformer outputs X _ma1 (t) and X _ma2 (t−τ) whose time axes are aligned, these beamformer outputs X _ma1 (t) and X _ma2 A correction coefficient for aligning the power of the target area sound at (t−τ) is calculated.

図２は、目的エリア音パワー補正係数算出部２５及び目的エリア音成分推定部２７における処理を示すフローチャートである。なお、見方を変えれば、図２は、目的エリア音パワー補正係数算出部２５及び目的エリア音成分推定部２７の詳細構成を示すブロック図と見ることもできる。 FIG. 2 is a flowchart showing processing in the target area sound power correction coefficient calculation unit 25 and the target area sound component estimation unit 27. In other words, FIG. 2 can also be viewed as a block diagram showing detailed configurations of the target area sound power correction coefficient calculation unit 25 and the target area sound component estimation unit 27.

最初に、目的エリア音パワー補正係数算出部２５は、各ビームフォーマ出力Ｘ_ｍａ１(ｔ)、Ｘ_ｍａ２(ｔ−τ)間で、周波数毎に振幅スペクトルの比率を求める（ステップＳ１）。 First, the target area sound power correction coefficient calculation unit 25 obtains the ratio of the amplitude spectrum for each frequency between the beamformer outputs X _ma1 (t) and X _ma2 (t−τ) (step S1).

次に、目的エリア音パワー補正係数算出部２５は、求めた振幅スペクトルの比率の頻度を算出し（ステップＳ２）、頻度のピーク値を検出する（ステップＳ３）。例えば、データの中で１階微分値が０であって（極値）、２階微分値がマイナスであるデータ点（極大値）の振幅スペクトルの比率の値をピーク値とすることにより、ピーク値を検出する。また、頻度が何回か連続で増加した後（ここでは、振幅スペクトルの比率を１単位分だけ増加させることを１回と表現している）、何回か連続で減少したとき、その変換点をピーク値であると判定するようにしても良い。 Next, the target area sound power correction coefficient calculation unit 25 calculates the frequency of the obtained ratio of the amplitude spectrum (step S2), and detects the peak value of the frequency (step S3). For example, the peak value is obtained by setting the value of the ratio of the amplitude spectrum of the data point (maximum value) where the first-order differential value is 0 (extreme value) and the second-order differential value is negative in the data as a peak value. Detect value. Also, after the frequency increases several times continuously (in this case, increasing the ratio of the amplitude spectrum by one unit is expressed as one time), when the frequency decreases continuously several times, the conversion point May be determined to be a peak value.

目的エリア音パワー補正係数算出部２５は、ピーク値を検出すると、予め設定されている閾値以上の頻度を有するピーク値をそれぞれ、パワー補正係数とする（ステップＳ４）。この際、目的エリアＴＡＲを撮像した画像情報の分析などから、予め音源数が分かっている場合には、頻度の高い方から順に音源数だけピーク値を選んでパワー補正係数とするようにしても良い。 When the target area sound power correction coefficient calculating unit 25 detects the peak value, each of the peak values having a frequency equal to or higher than a preset threshold is set as the power correction coefficient (step S4). At this time, if the number of sound sources is known in advance from the analysis of image information obtained by imaging the target area TAR, the peak value is selected by the number of sound sources in descending order of frequency and used as a power correction coefficient. good.

以上にようにして決定されたパワー補正係数は、目的エリア音抽出部２６及び目的エリア音成分推定部２７に与えられる。 The power correction coefficient determined as described above is given to the target area sound extraction unit 26 and the target area sound component estimation unit 27.

目的エリア音抽出部６において、目的エリア音パワー補正係数算出部５が算出したパワー補正係数の内の最大値が選択され、サブのマイクロホンアレイに係るビームフォーマ出力に乗算された後、メインのマイクロホンアレイに係るビームフォーマ出力からスペクトル減算され、非目的エリア音成分が抽出される。さらに、目的エリア音抽出部６において、抽出された非目的エリア音が各ビームフォーマ出力からスペクトル減算されることにより、目的エリア音が抽出される。 In the target area sound extraction unit 6, the maximum value among the power correction coefficients calculated by the target area sound power correction coefficient calculation unit 5 is selected, multiplied by the beamformer output related to the sub microphone array, and then the main microphone. Spectral subtraction is performed from the beamformer output related to the array, and non-target area sound components are extracted. Further, the target area sound extraction unit 6 extracts the target area sound by performing spectral subtraction on the extracted non-target area sound from each beamformer output.

目的エリア音成分推定部２７において、目的エリア音パワー補正係数算出部５で算出されたパワー補正係数値の大きい順に各ビームフォーマ出力が補正された後、スペクトル減算が実行され、非目的エリア音成分と各目的エリア音成分が推定される。 The target area sound component estimation unit 27 corrects each beamformer output in descending order of the power correction coefficient values calculated by the target area sound power correction coefficient calculation unit 5, and then performs spectral subtraction to obtain a non-target area sound component. And each destination area sound component is estimated.

図２の右側は、目的エリア音成分推定部２７における目的エリア音成分を推定する処理を示している。目的エリア音源が２つの場合、まず、パワー補正係数の最大値をサブのマイクロホンアレイに係るビームフォーマ出力に乗算した後、メインのマイクロホンアレイに係るビームフォーマ出力からスペクトル減算することにより、非目的エリア音成分Ｎ_０が抽出される。この非目的エリア音成分Ｎ_０の抽出処理は、目的エリア音抽出部２６における処理の一部と同じであるので流用することができる。 The right side of FIG. 2 shows processing for estimating the target area sound component in the target area sound component estimation unit 27. When there are two target area sound sources, the non-target area is first obtained by multiplying the beamformer output related to the sub microphone array by the maximum value of the power correction coefficient and then subtracting the spectrum from the beamformer output related to the main microphone array. A sound component N ₀ is extracted. The extraction process of the non-target area sound component N ₀ is the same as a part of the process in the target area sound extraction unit 26 and can be used.

さらに、２番目に大きいパワー補正係数が、サブのマイクロホンアレイに係るビームフォーマ出力に乗算された後、メインのマイクロホンアレイに係るビームフォーマ出力からスペクトル減算されることにより、非目的エリア音成分Ｎ_０を含んだ目的エリア音成分Ｎ_Ｓ１（＝Ｎ_ＳＢ）が抽出される（Ｓ５）。 Further, after the beamformer output related to the sub microphone array is multiplied by the second largest power correction coefficient, the spectrum is subtracted from the beamformer output related to the main microphone array, so that the non-target area sound component N _{0 is obtained.} The target area sound component N _S1 (= N _SB ) including is extracted (S5).

次に、（４）式に従って、抽出された非目的エリア音成分Ｎ_０を含んだ目的エリア音成分Ｎ_Ｓ１から、非目的エリア音成分Ｎ_０がスペクトル減算されることにより、一方の目的エリア音源成分Ｓ_１（＝Ｓ_Ｂ）が推定される（Ｓ６）。推定する際に、スペクトル減算の後にバイナリマスクを行うことで、ある周波数を一つの目的エリア音源成分に属するように振り分けるようにしても良い。 Next, one non-target area sound source is obtained by spectral subtraction of the non-target area sound component N ₀ from the extracted target area sound component N _S1 including the extracted non-target area sound component N ₀ according to the equation (4). The component S ₁ (= S _B ) is estimated (S6). At the time of estimation, binary masking may be performed after spectrum subtraction to distribute a certain frequency so as to belong to one target area sound source component.

その後、(５)式に従って、メインのマイクロホンアレイに係るビームフォーマ出力から、他の目的エリア音源成分Ｓ_１（＝Ｓ_Ｂ）と非目的エリア音成分Ｎ_０がそれぞれスペクトル減算されて、もう一方の目的エリア音成分Ｓ_２（＝Ｓ_Ａ）が推定される（Ｓ７）。このスペクトル減算時に、他の目的エリア音源成分Ｓ_１（＝Ｓ_Ｂ）に乗算する係数α_１は、例えば、（６）式に示したように、メインのマイクロホンアレイに係るビームフォーマ出力と他の目的エリア音源成分Ｓ_１（＝Ｓ_Ｂ）を周波数毎に比を求め、その比をヒストグラムにしたときの最頻値とする。 After that, according to equation (5), the other target area sound source component S ₁ (= S _B ) and the non-target area sound component N ₀ are spectrally subtracted from the beamformer output related to the main microphone array, respectively, and the other The target area sound component S ₂ (= S _A ) is estimated (S7). When the spectrum is subtracted, the coefficient α ₁ by which the other target area sound source component S ₁ (= S _B ) is multiplied is, for example, as shown in the equation (6), the beamformer output related to the main microphone array and other factors. A ratio of the target area sound source component S ₁ (= S _B ) is obtained for each frequency, and the ratio is set as a mode value when the ratio is made a histogram.

目的エリア音レベル調節部２８において、（７）式に示したように、目的エリア音抽出部６の出力中で最も大きい平均パワーの目的エリア音源の平均パワーと、それ以外の目的エリア音源の平均パワーの比が求められ、各比率が、それぞれ対応した目的エリア音源成分に乗算され、目的エリア音源成分毎のパワーレベルが揃うように調節される。この際、比率が１．０±任意の定数の範囲内に含まれる目的エリア音源については、レベルの補正は実行されない。 In the target area sound level adjustment unit 28, as shown in the equation (7), the average power of the target area sound source having the highest average power among the outputs of the target area sound extraction unit 6 and the average of other target area sound sources Power ratios are obtained, and each ratio is multiplied by the corresponding target area sound source component, and the power levels for each target area sound source component are adjusted. At this time, the level correction is not executed for the target area sound source whose ratio is within the range of 1.0 ± any constant.

（Ａ−４）第１の実施形態の効果
第１の実施形態によれば、各目的エリア音源の成分を推定し、エリア収音処理後のデータに含まれる各目的エリア音源の平均パワーを求め、最も大きいパワーの目的エリア音源と同じレベルになるように、他の目的エリア音源のパワーを補正するようにしたので、目的エリア内に複数の音源が存在する場合において、音質を保ったまま全ての音源のレベルを同様にすることができる。 (A-4) Effects of First Embodiment According to the first embodiment, the components of each target area sound source are estimated, and the average power of each target area sound source included in the data after the area sound collection processing is obtained. Since the power of other target area sound sources is corrected so that it is at the same level as the target area sound source with the highest power, when there are multiple sound sources in the target area, all the sound quality is maintained. The sound source level can be made the same.

（Ｂ）他の実施形態
上記各実施形態の説明においても種々変形実施形態に言及したが、さらに、以下に例示するような変形実施形態を挙げることができる。 (B) Other Embodiments In the description of each of the above embodiments, various modified embodiments have been mentioned, and further modified embodiments as exemplified below can be given.

第１の実施形態においては、複数の目的エリア音源成分のパワーを揃うようにさせる比率βを（７）式に従って算出する場合を示したが、他の算出方法を適用するようにしても良い。例えば、各時刻での比率をそれぞれ算出し、その複数の比率の平均を、パワーを揃うようにさせる比率βとして適用するようにしても良い。 In the first embodiment, the case where the ratio β for equalizing the powers of the plurality of target area sound source components is calculated according to the equation (7) is shown, but other calculation methods may be applied. For example, the ratio at each time may be calculated, and the average of the plurality of ratios may be applied as the ratio β for matching the power.

また、第１の実施形態においては、平均パワーが最大な目的エリア音源成分の平均パワーに、他の目的エリア音源成分の平均パワーを揃うようにさせる場合を示したが、平均パワーが最大な目的エリア音源成分を含め、全ての目的エリア音源成分のレベル（例えば平均パワー）が予め定められている所定レベルになるように、各目的エリア音源成分のレベルを調節するようにしても良い。 Further, in the first embodiment, the case where the average power of the target area sound source component having the maximum average power is made equal to the average power of other target area sound source components has been described. You may make it adjust the level of each target area sound source component so that the level (for example, average power) of all the target area sound source components including the area sound source component becomes a predetermined level.

さらに、第１の実施形態においては、第１のマイクロホンアレイに係るビームフォーム出力をメインとし、第２のマイクロホンアレイに係るビームフォーム出力をサブとして目的エリア音を抽出するものを示したが、第１及び第２のマイクロホンアレイに係るビームフォーム出力を利用した他の方法によって目的エリア音を抽出するようにしても良い。例えば、第１のマイクロホンアレイに係るビームフォーム出力をメイン、第２のマイクロホンアレイに係るビームフォーム出力をサブとして抽出した目的エリア音と、第２のマイクロホンアレイに係るビームフォーム出力をメイン、第１のマイクロホンアレイに係るビームフォーム出力をサブとして抽出した目的エリア音のうち、一方を出力する目的エリア音として選択するようにしても良い。例えば、抽出された目的エリア音の音量レベルやパワーの大小から一方を選択する。また例えば、第１のマイクロホンアレイに係るビームフォーム出力をメイン、第２のマイクロホンアレイに係るビームフォーム出力をサブとして抽出した目的エリア音と、第２のマイクロホンアレイに係るビームフォーム出力をメイン、第１のマイクロホンアレイに係るビームフォーム出力をサブとして抽出した目的エリア音との平均値や加算値を、出力する目的エリア音とするようにしても良い。 Furthermore, in the first embodiment, the beamform output related to the first microphone array is used as the main, and the beamform output related to the second microphone array is used as the sub to extract the target area sound. The target area sound may be extracted by another method using the beamform output related to the first and second microphone arrays. For example, the target area sound extracted with the beamform output related to the first microphone array as the main, the beamform output related to the second microphone array as the sub, and the beamform output related to the second microphone array as the main, the first Of the target area sounds extracted with the beamform output related to the microphone array as a sub, one may be selected as the target area sound to be output. For example, one is selected from the volume level and power level of the extracted target area sound. Also, for example, the target area sound extracted with the beamform output related to the first microphone array as the main, the beamform output related to the second microphone array as the sub, and the beamform output related to the second microphone array as the main, An average value or an addition value with the target area sound extracted with the beamform output related to one microphone array as a sub may be used as the output target area sound.

第１の実施形態では、マイクロホンアレイが２つの場合において、各目的エリア音源成分を推定する場合を示したが、目的エリア音源が３つ以上の場合でも同様に各目的エリア音源成分を推定することができる。例えば、目的エリア音源がＡ、Ｂ、Ｃの３つの場合、補正係数として検出されるピークはＰ１、Ｐ２、Ｐ３の３つとなる。この順番で値が大きいとすれば、まず、補正係数値Ｐ３でビームフォーマ出力を補正した後にスペクトル減算を行い、非目的エリア音源成分Ｎ_０を抽出する。同様に、補正係数値Ｐ２及びＰ１を適用して、スペクトル減算を行い、それぞれ、非目的エリア音成分Ｎ_０と目的エリア音源Ｃの成分Ｓ_Ｃが含まれる混合成分Ｎ_Ｓ１と、非目的エリア音成分Ｎ_０と目的エリア音源Ｃの成分Ｓ_Ｃと目的エリア音源Ｂの成分Ｓ_Ｂが含まれる混合成分Ｎ_Ｓ２とを抽出する。次に、（４）式と同様なスペクトル減算を行うことにより、目的エリア音源Ｃの成分Ｓ_Ｃを抽出し、抽出した目的エリア音源Ｃの成分Ｓ_Ｃも利用し、（４）式に準じた演算を実行することにより目的エリア音源Ｂの成分Ｓ_Ｂを抽出する。最後に（５）式に準じた演算を実行することにより、目的エリア音源Ａの成分Ｓ_Ａを抽出する。以上のように、目的エリア音源が３つ以上の場合にも、第１の実施形態で説明した考え方（技術思想）により、各目的エリア音源成分を推定することができる。 In the first embodiment, the case where each target area sound source component is estimated in the case where there are two microphone arrays is shown, but each target area sound source component is similarly estimated even when there are three or more target area sound sources. Can do. For example, when there are three target area sound sources A, B, and C, there are three peaks P1, P2, and P3 detected as correction coefficients. If the value in this order is large, first, it performs a spectrum subtraction in the correction coefficient value P3 after correcting the beamformer output, to extract the non-target area source component N _0. Similarly, by applying the correction coefficient values P2 and P1, performs a spectrum subtraction, respectively, the mixed ingredients N _S1 which component S _C of the non-target area sound component N ₀ and the target area sound source C is contained, non-target area sound extracts the mixed component _{N S2} containing the component _{S B} components _{S C} and purpose area sound source B component _{N 0} and the target area source C. Then, by performing the same spectral subtraction and (4), to extract components S _C object area source C, component S _C of the extracted object area source C also utilized pursuant to (4) The component SB of the target area sound source _B is extracted by executing the calculation. Finally, the component SA of the target area sound source _A is extracted by executing a calculation according to the equation (5). As described above, even when there are three or more target area sound sources, each target area sound source component can be estimated based on the concept (technical idea) described in the first embodiment.

第１の実施形態では、図１に示すように各部が配置されているものを示したが、本発明の特徴から離れないならば、各部の位置関係が逆であっても良い。例えば、伝播遅延差補正部２３を指向性形成部２２の前段に設けるようにしても良い。 In the first embodiment, the components are arranged as shown in FIG. 1, but the positional relationship between the components may be reversed as long as it does not depart from the features of the present invention. For example, the propagation delay difference correction unit 23 may be provided before the directivity forming unit 22.

第１の実施形態では、ピーク値として取り扱う振幅スペクトル比率は、その頻度が閾値以上であることを要するものを示したが、これに代え、又は、これに加えて、他の条件を導入するようにしても良い。例えば、最大頻度の所定割合以上の頻度をとることを、ピーク値として取り扱う振幅スペクトル比率の条件とするようにしても良い。 In the first embodiment, the amplitude spectrum ratio handled as the peak value has been shown to require that the frequency be equal to or higher than the threshold value, but other conditions may be introduced instead of or in addition to this. Anyway. For example, taking a frequency equal to or greater than a predetermined ratio of the maximum frequency may be a condition of the amplitude spectrum ratio handled as a peak value.

第１の実施形態では、目的エリア音レベル調節部が常にレベル調節を行うものを示したが、レベル調節を行うか否かをオペレータが選択できるようにしても良い。 In the first embodiment, the target area sound level adjustment unit always performs the level adjustment. However, the operator may select whether or not to perform the level adjustment.

第１の実施形態では、各目的エリア音としてレベル調節が実行されたものを出力するものを示したが、レベル調節前の平均パワーの情報をも併せて出力するようにしても良い。 In the first embodiment, the target area sound that has been subjected to level adjustment is output, but information on average power before level adjustment may also be output.

第１の実施形態では、マイクロホンアレイが捕捉して得た音響信号をリアルタイムに処理するものを示したが、マイクロホンアレイが捕捉して得た音響信号を記憶媒体に記憶させ、その後、記憶媒体から読み出して処理して目的エリア音の強調信号を得るようにしても良い。このように記憶媒体を利用する場合には、マイクロホンアレイが設定されている場所と、目的エリア音の抽出処理する場所とが離れていても良い。同様に、リアルタイムに処理する場合にも、マイクロホンアレイが設定されている場所と、目的エリア音の抽出処理する場所とが離れていても良く、通信により信号を遠隔地に供給するようにしても良い。以上のような記憶媒体や通信を利用したりする場合も、本発明の「収音装置」の概念に含まれるものとする。 In the first embodiment, the acoustic signal acquired by the microphone array is processed in real time. However, the acoustic signal acquired by the microphone array is stored in the storage medium, and then stored in the storage medium. The enhancement signal of the target area sound may be obtained by reading and processing. When the storage medium is used in this way, the place where the microphone array is set may be separated from the place where the target area sound is extracted. Similarly, when processing in real time, the place where the microphone array is set and the place where the target area sound is extracted may be separated, and the signal may be supplied to a remote place by communication. good. The case where the above storage medium or communication is used is also included in the concept of the “sound collecting device” of the present invention.

上記各実施形態では、各マイクロホンアレイにおけるマイクロホンの数が同じものを示したが、各マイクロホンアレイにおけるマイクロホンの数が異なっていても良い。 In the above embodiments, the same number of microphones in each microphone array is shown, but the number of microphones in each microphone array may be different.

ＭＡ１、ＭＡ２…マイクロホンアレイ、２０…収音装置、２１…データ入力部、２２…指向性形成部、２３…伝播遅延差補正部、２４…空間座標データ保持部、２５…目的エリア音パワー補正係数算出部、２６…目的エリア音抽出部、２７…目的エリア音成分推定部、２８…目的エリア音レベル調節部。 MA1, MA2 ... microphone array, 20 ... sound pickup device, 21 ... data input unit, 22 ... directivity forming unit, 23 ... propagation delay difference correction unit, 24 ... spatial coordinate data holding unit, 25 ... target area sound power correction coefficient Calculation part, 26 ... target area sound extraction part, 27 ... target area sound component estimation part, 28 ... target area sound level adjustment part.

Claims

複数のマイクロホンアレイと、
上記各マイクロホンアレイからの入力をビームフォーマによって目的エリア方向へ指向性を形成する指向性形成手段と、
上記各マイクロホンアレイのビームフォーマ出力間で周波数毎に振幅スペクトルの比率を算出し、比率の頻度を算出し、頻度のピーク値を検出し、ある一定以上の頻度のピーク値を求め、補正係数とする目的エリア音パワー補正係数算出手段と、
上記目的エリア音パワー補正係数算出手段が算出した補正係数の最大値を用い、上記各マイクロホンアレイのビームフォーマ出力を補正し、それぞれをスペクトル減算することで目的エリア方向に存在する非目的エリア音を抽出し、抽出した非目的エリア音を各マイクロホンアレイのビームフォーマ出力からスペクトル減算することにより目的エリア音を抽出する目的エリア音抽出手段と、
上記目的エリア音パワー補正係数算出手段が算出した補正係数の大きい順にビームフォーマ出力を補正してスペクトル減算を行い、非目的エリア音成分と１又は複数の目的エリア音成分を含んだ目的エリア混合成分を抽出し、異なる目的エリア音成分を含んだ目的エリア混合成分間の相違に基づいて、各目的エリア音成分を推定する目的エリア音成分推定手段と、
上記目的エリア音成分推定手段によって推定された各目的エリア音成分のパワー情報を求め、上記目的エリア音抽出手段から出力された複数の音源に係る目的エリア音出力のパワーレベルが揃うように調節する目的エリア音レベル調節手段と
を備えることを特徴とする収音装置。 Multiple microphone arrays,
Directivity forming means for forming directivity in the direction of the target area by a beamformer from the microphone array,
Calculate the ratio of the amplitude spectrum for each frequency between the beamformer outputs of each microphone array, calculate the frequency of the ratio, detect the peak value of the frequency, find the peak value of a certain frequency or more, and calculate the correction coefficient and A target area sound power correction coefficient calculating means for
Using the maximum value of the correction coefficient calculated by the target area sound power correction coefficient calculation means, the beamformer output of each microphone array is corrected, and each non-target area sound existing in the target area direction is subtracted from the spectrum. A target area sound extraction means for extracting the target area sound by extracting and extracting the extracted non-target area sound from the beamformer output of each microphone array by spectrum subtraction;
A target area mixed component including a non-target area sound component and one or more target area sound components by correcting the beamformer output in descending order of the correction coefficients calculated by the target area sound power correction coefficient calculating means. A target area sound component estimating means for extracting each target area sound component based on a difference between target area mixed components including different target area sound components;
Obtain power information of each target area sound component estimated by the target area sound component estimation means, and adjust the power levels of the target area sound outputs related to the plurality of sound sources output from the target area sound extraction means. And a target area sound level adjusting means.

上記目的エリア音レベル調節手段は、上記目的エリア音抽出手段から出力された複数の目的エリア音出力の中のパワー情報が最大の目的エリア音出力のパワーレベルに、他の目的エリア音出力のパワーレベルを揃うように調節することを特徴とする請求項１に記載の収音装置。 The target area sound level adjusting means adjusts the power information of the other target area sound outputs to the power level of the maximum target area sound output among the plurality of target area sound outputs output from the target area sound extraction means. The sound collecting device according to claim 1, wherein the sound collecting device is adjusted so that the levels are uniform.

目的エリア内の音源からの音響が、上記各マイクロホンアレイに到達するのに要する伝播遅延時間の差を吸収する補正処理を行う伝播遅延差補正手段をさらに備えることを特徴とする請求項１又は２に記載の収音装置。 3. A propagation delay difference correction means for performing a correction process for absorbing a difference in propagation delay time required for sound from a sound source in a target area to reach each of the microphone arrays. The sound collecting device described in 1.

複数のマイクロホンアレイを有する収音装置に搭載されるコンピュータを、
上記各マイクロホンアレイからの入力をビームフォーマによって目的エリア方向へ指向性を形成する指向性形成手段と、
上記各マイクロホンアレイのビームフォーマ出力間で周波数毎に振幅スペクトルの比率を算出し、比率の頻度を算出し、頻度のピーク値を検出し、ある一定以上の頻度のピーク値を求め、補正係数とする目的エリア音パワー補正係数算出手段と、
上記目的エリア音パワー補正係数算出手段が算出した補正係数の最大値を用い、上記各マイクロホンアレイのビームフォーマ出力を補正し、それぞれをスペクトル減算することで目的エリア方向に存在する非目的エリア音を抽出し、抽出した非目的エリア音を各マイクロホンアレイのビームフォーマ出力からスペクトル減算することにより目的エリア音を抽出する目的エリア音抽出手段と、
上記目的エリア音パワー補正係数算出手段が算出した補正係数の大きい順にビームフォーマ出力を補正してスペクトル減算を行い、非目的エリア音成分と１又は複数の目的エリア音成分を含んだ目的エリア混合成分を抽出し、異なる目的エリア音成分を含んだ目的エリア混合成分間の相違に基づいて、各目的エリア音成分を推定する目的エリア音成分推定手段と、
上記目的エリア音成分推定手段によって推定された各目的エリア音成分のパワー情報を求め、上記目的エリア音抽出手段から出力された複数の音源に係る目的エリア音出力のパワーレベルが揃うように調節する目的エリア音レベル調節手段と
して機能させることを特徴とする収音プログラム。 A computer mounted on a sound collection device having a plurality of microphone arrays,
Directivity forming means for forming directivity in the direction of the target area by a beamformer from the microphone array,
Calculate the ratio of the amplitude spectrum for each frequency between the beamformer outputs of each microphone array, calculate the frequency of the ratio, detect the peak value of the frequency, find the peak value of a certain frequency or more, and calculate the correction coefficient and A target area sound power correction coefficient calculating means for
Using the maximum value of the correction coefficient calculated by the target area sound power correction coefficient calculation means, the beamformer output of each microphone array is corrected, and each non-target area sound existing in the target area direction is subtracted from the spectrum. A target area sound extraction means for extracting the target area sound by extracting and extracting the extracted non-target area sound from the beamformer output of each microphone array by spectrum subtraction;
A target area mixed component including a non-target area sound component and one or more target area sound components by correcting the beamformer output in descending order of the correction coefficients calculated by the target area sound power correction coefficient calculating means. A target area sound component estimating means for extracting each target area sound component based on a difference between target area mixed components including different target area sound components;
Obtain power information of each target area sound component estimated by the target area sound component estimation means, and adjust the power levels of the target area sound outputs related to the plurality of sound sources output from the target area sound extraction means. A sound collection program that functions as a target area sound level adjustment means.