JP2011205324A

JP2011205324A - Voice processor, voice processing method, and program

Info

Publication number: JP2011205324A
Application number: JP2010069732A
Authority: JP
Inventors: Osamu Shimada; 修嶋田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-03-25
Filing date: 2010-03-25
Publication date: 2011-10-13

Abstract

PROBLEM TO BE SOLVED: To provide a voice processor and a voice processing method capable of easily outputting voice signals in which sound volumes of a plurality of voices mixed with the voice signals are corrected.SOLUTION: The voice processor includes a voice signal outputting section 110 which outputs an arbitrary frequency band component among the voice signals obtained by a plurality of nondirectional microphones arranged apart from one another, an arrival direction discriminating section 120 which discriminates an arrival direction of a voice based on a phase difference between frequency band components of the voice signals, a voice correction amount deriving section 130 which derives correction amounts of sound volumes for the voice signals according to the arrival direction of the voice, and a sound volume correction executing section 140 which corrects the sound volumes of the voice signals by using the derived correction amount.

Description

本発明は、音声処理装置、音声処理方法およびプログラムに関する。 The present invention relates to a voice processing device, a voice processing method, and a program.

従来、複数の発話者からの音声が重畳されている音声信号に基づいて再生される音声には、特定の発話者からの音声が小さく聴取され聞き取りにくいといった課題があった。
このような課題に対し、複数の発話者からの音声が重畳された音声信号を、独立成分分析などの手法を用いて各発話者毎の音声に分離してから、それぞれの音声にたいして音量の補正を行うことが知られている。 Conventionally, the sound reproduced based on the sound signal on which the sounds from a plurality of speakers are superimposed has a problem that the sound from a specific speaker is small and difficult to hear.
In response to these issues, the speech signal with the speech from multiple speakers is separated into speech for each speaker using a technique such as independent component analysis, and then the volume is corrected for each speech. Is known to do.

しかしながら、上記のような独立成分分析などの手法を用いて各発話者毎の音声を分離するためには、高度で複雑かつ多量の演算を必要とするために、汎用の端末装置や音声会議装置といった音声処理機能を有する機器に搭載するには、コストが高くなってしまうとともに消費電力が大きくなってしまうといった問題があった。 However, in order to separate voices for each speaker using the above-described method such as independent component analysis, a high-level, complicated and large amount of computation is required. In the case of mounting on a device having such a voice processing function, there is a problem that the cost increases and the power consumption increases.

このような問題に対し、複数のマイクを用いて特定の位置からの音声信号を取得する収音装置によって取得した音声信号の音量を適正に補正する技術が知られている（特許文献１）。
特許文献１に記載の技術では、２つの指向性マイクと１つの無指向性マイクの合わせて３つのマイクを用いて一の収音装置を構成し、この収音装置による音声の収音方向を特定するとともに、収音した音声のレベルを収音方向毎に比較して調整する技術である。 In order to solve such a problem, a technique for appropriately correcting the volume of a sound signal acquired by a sound collecting device that acquires sound signals from a specific position using a plurality of microphones is known (Patent Document 1).
In the technique described in Patent Document 1, one sound collecting device is configured by using three microphones including two directional microphones and one omnidirectional microphone, and the sound collecting direction of the sound by the sound collecting device is determined. This is a technique for identifying and adjusting the level of collected sound for each direction of sound collection.

特開２００９−１７３４３号公報JP 2009-17343 A

しかしながら、特許文献１の技術は、２つの指向性マイクと１つの無指向性マイクを必要とし、これらマイクの配置位置に複雑な制約があるため、容易に実現することが困難であるといった問題があった。
また、特許文献１の技術は、一の収音装置の構成として２つの指向性マイクと１つの無指向性マイクのあわせて３つのマイクを用いるため、使用するマイクの個数が多くコストが高くなってしまうとともに、マイクを搭載するために十分な面積を装置内に確保しなければならないといった問題があった。
さらに、特許文献１の技術は、収音装置によって特定される収音方向が所定の方向に限定されるため、複数の音源が収音装置に近接している場合など、音源の方向が特定できない場合があり細かな制御ができないといった問題があった。 However, the technique of Patent Document 1 requires two directional microphones and one omnidirectional microphone, and there is a problem that it is difficult to easily realize these microphones because there are complicated restrictions on the arrangement positions of these microphones. there were.
Moreover, since the technique of patent document 1 uses three microphones in total, including two directional microphones and one omnidirectional microphone, as the configuration of one sound collection device, the number of microphones used is high and the cost is high. In addition, there is a problem that a sufficient area must be secured in the apparatus for mounting the microphone.
Furthermore, since the sound collection direction specified by the sound collection device is limited to a predetermined direction in the technique of Patent Document 1, the direction of the sound source cannot be specified, for example, when a plurality of sound sources are close to the sound collection device. In some cases, there was a problem that fine control was not possible.

そこで本発明は、上述の問題を解決すべく、複数の音声が混合されている音声信号に対して、複数の無指向性マイクによって収音して各音声の到来方向を判別するとともに、この音声信号に混合されている複数の音声の音量を補正した音声信号を容易に出力する音声処理装置および音声処理方法を提供することを目的とする。 Therefore, in order to solve the above-described problem, the present invention collects sound by using a plurality of omnidirectional microphones for a sound signal in which a plurality of sounds are mixed, and determines the direction of arrival of each sound. An object of the present invention is to provide an audio processing device and an audio processing method that easily output an audio signal in which the volume of a plurality of audio signals mixed with the signal is corrected.

上述の目的を達成するために、本発明は、互いに離間して配置された複数の無指向性マイクロフォンによってそれぞれ取得された音声信号のうち、任意の周波数帯域成分を出力する音声信号出力部と、この音声信号出力部から出力される音声信号の前記周波数帯域成分間の位相差に基づいて前記マイクロフォンによって収集された音声の到来方向を判別する到来方向判別部と、この到来方向判別部によって判別された到来方向に応じて前記音声信号に対する音量の補正量を導出する音声補正量導出部と、この音量補正量導出部によって導出された前記補正量を用いて前記音声信号の音量を補正する音量補正実行部とを備えることを特徴とする。 In order to achieve the above-described object, the present invention provides an audio signal output unit that outputs an arbitrary frequency band component among audio signals respectively acquired by a plurality of omnidirectional microphones arranged apart from each other; An arrival direction discriminating unit that discriminates the arrival direction of the voice collected by the microphone based on the phase difference between the frequency band components of the audio signal output from the audio signal output unit, and the arrival direction discriminating unit. A sound correction amount deriving unit for deriving a sound volume correction amount for the sound signal according to the arrival direction, and a sound volume correction for correcting the sound signal volume using the correction amount derived by the sound volume correction amount deriving unit. And an execution unit.

本発明によれば、音声信号出力部の無指向性マイクロフォンによって収集された音声における所定の周波数帯域成分間の位相差に基づいてこのマイクロフォンによって収集された音声の到来方向を判別し、マイクロフォンによって収集された音声の補正量を到来方向に応じて導出することにより、マイクロフォンによって収集された音声の音量を到来方向に応じて補正することができる。 According to the present invention, the direction of arrival of the voice collected by the microphone is determined based on the phase difference between the predetermined frequency band components in the voice collected by the omnidirectional microphone of the voice signal output unit, and collected by the microphone. By deriving the corrected amount of the sound according to the direction of arrival, the volume of the sound collected by the microphone can be corrected according to the direction of arrival.

したがって、複数の到来方向からの音声が重畳されている音声信号から、複数の音声間の音量が均等になるように異なる到来方向からの音声の音量を補正することができるため、音声信号に重畳された各音声を聞き取りやすく再生することができる。
また、音声信号に重畳されている複数の音声を分離せずにこの音声信号に重畳されている複数の音声の到来方向別に音声の補正を行うことから、低演算量でかつ容易に音声信号に重畳された各音声の音量の補正を実行することが可能となる。 Therefore, since the sound volume from different directions of arrival can be corrected so that the sound volume between the plurality of sounds is equalized from the sound signal in which the sounds from the plurality of directions of arrival are superimposed, it is superimposed on the sound signal. It is possible to reproduce each of the recorded voices in an easy-to-understand manner.
In addition, since the sound is corrected for each direction of arrival of the plurality of sounds superimposed on the sound signal without separating the plurality of sounds superimposed on the sound signal, the sound signal can be easily converted into the sound signal with a low amount of computation. It is possible to correct the volume of each superimposed voice.

本発明の第１の実施の形態にかかる音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice processing apparatus concerning the 1st Embodiment of this invention. 本発明の第１の実施の形態にかかる音声処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio | voice processing apparatus concerning the 1st Embodiment of this invention. 本発明の第２の実施の形態にかかる音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus concerning the 2nd Embodiment of this invention. 本発明における音声処理装置と取得する複数の音声との位置関係を概念的に説明する図である。It is a figure which illustrates notionally the positional relationship between the audio | voice processing apparatus in this invention, and several audio | voices acquired. 音源の位置と無指向性マイクロフォンによって収音される音声信号との関係を概念的に示す図である。It is a figure which shows notionally the relationship between the position of a sound source, and the audio | voice signal picked up by the omnidirectional microphone. 収音領域と取得した音声信号の音量との関係を概念的に示す図である。It is a figure which shows notionally the relationship between a sound collection area | region and the volume of the acquired audio | voice signal. 収音領域と取得した音声信号に対する補正量との関係を概念的に示す図である。It is a figure which shows notionally the relationship between a sound collection area | region and the corrected amount with respect to the acquired audio | voice signal. 本発明の第２の実施の形態にかかる音声処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio processing apparatus concerning the 2nd Embodiment of this invention. 本発明の第３の実施の形態にかかる音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio processing apparatus concerning the 3rd Embodiment of this invention. 本発明の第３の実施の形態にかかる音声処理装置における区切位置情報の決定を概念的に説明する図である。It is a figure which illustrates notionally the determination of the division | segmentation position information in the speech processing unit concerning the 3rd Embodiment of this invention.

以下、本発明の実施の形態について、図面を参照し詳細に説明する。
［第１の実施の形態］
本発明における第１の実施の形態にかかる音声処理装置は、異なる到来方向からの複数の音声が重畳された音声信号を複数のマイクロフォンで取得して、これらマイクロフォンで取得した音声信号における任意の周波数帯域成分間の位相差に基づいて判別される音声の到来方向に応じて取得した音声信号の音量を補正する音声処理装置である。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[First Embodiment]
The speech processing apparatus according to the first embodiment of the present invention acquires a speech signal on which a plurality of speeches from different directions of arrival are superimposed by using a plurality of microphones, and an arbitrary frequency in the speech signal obtained by these microphones. This is a sound processing apparatus that corrects the volume of a sound signal acquired according to the direction of arrival of sound determined based on the phase difference between band components.

本実施の形態にかかる音声処理装置１０は、図１に示すように、音声信号出力部１１０と到来方向判別部１２０と音量補正量導出部１３０と音量補正実行部１４０とから構成されている。
音声信号出力部１１０は、互いに離間して配置された複数の無指向性マイクロフォンによってそれぞれ取得された音声信号のうち、任意の周波数帯域成分を出力する。
到来方向判別部１２０は、音声信号出力部１１０から出力される音声信号の任意の周波数帯域間の位相差に基づいて、音声信号出力部１１０のマイクロフォンによって収集された音声の到来方向を判別する。 As shown in FIG. 1, the audio processing apparatus 10 according to the present embodiment includes an audio signal output unit 110, an arrival direction determination unit 120, a volume correction amount derivation unit 130, and a volume correction execution unit 140.
The audio signal output unit 110 outputs an arbitrary frequency band component among the audio signals respectively acquired by a plurality of omnidirectional microphones arranged apart from each other.
The arrival direction discriminating unit 120 discriminates the arrival direction of the voice collected by the microphone of the voice signal output unit 110 based on the phase difference between arbitrary frequency bands of the voice signal output from the voice signal output unit 110.

音量補正量導出部１３０は、到来方向判別部１２０によって判別された音声信号出力部１１０のマイクロフォンによって収集された音声の到来方向に応じてこのマイクロフォンによって取得された音声信号に対する音量の補正量を導出する。
音量補正実行部１４０は、音量補正量導出部１３０によって導出された補正量を用いて音声信号出力部１１０のマイクロフォンによって取得された音声信号の音量を補正する。 The sound volume correction amount deriving unit 130 derives a sound volume correction amount for the sound signal acquired by the microphone according to the sound arrival direction collected by the microphone of the sound signal output unit 110 determined by the arrival direction determining unit 120. To do.
The volume correction execution unit 140 corrects the volume of the audio signal acquired by the microphone of the audio signal output unit 110 using the correction amount derived by the volume correction amount deriving unit 130.

なお、本実施の形態にかかる音声処理装置１０の各構成要素は、ＣＰＵやメモリ、インターフェースからなるコンピュータにコンピュータプログラム（ソフトウエア）をインストールすることによって実現され、上述した音声処理装置１０の各種機能は、上記コンピュータの各種ハードウェア資源と上記コンピュータプログラムとが協働することによって実現される。 Each component of the speech processing apparatus 10 according to the present embodiment is realized by installing a computer program (software) in a computer including a CPU, a memory, and an interface, and various functions of the speech processing apparatus 10 described above. Is realized by the cooperation of various hardware resources of the computer and the computer program.

次に、図２を参照して、本実施の形態にかかる音声処理装置１０の動作について説明する。
本実施の形態にかかる音声処理装置１０は、図２に示すように、音声信号出力部１１０の複数のマイクロフォンによってそれぞれ取得された音声信号のうち、任意の周波数帯域成分を音声信号出力部１１０にそれぞれを出力させる（Ｓ１０１）。 Next, the operation of the speech processing apparatus 10 according to the present embodiment will be described with reference to FIG.
As shown in FIG. 2, the audio processing device 10 according to the present exemplary embodiment transmits an arbitrary frequency band component to the audio signal output unit 110 out of the audio signals respectively acquired by the plurality of microphones of the audio signal output unit 110. Each is output (S101).

音声信号出力部１１０から複数のマイクロフォンによってそれぞれ取得された音声信号における任意の周波数帯域成分それぞれが出力されると、到来方向判定部１２０は、音声信号出力部１１０から出力された各周波数帯域成分間の位相差に基づいて、マイクロフォンによって収集された音声の到来方向を判別する（Ｓ１０２）。 When each arbitrary frequency band component in the audio signal acquired by each of the plurality of microphones is output from the audio signal output unit 110, the arrival direction determination unit 120 determines whether each frequency band component output from the audio signal output unit 110 is between the frequency band components. Based on the phase difference, the direction of arrival of the voice collected by the microphone is determined (S102).

到来方向判定部１２０によってマイクロフォンによって収集された音声の到来方向を判別すると、音声補正量導出部１３０は、マイクロフォンによって取得された音声信号に対する音量の補正量を音声の到来方向に応じて導出する（Ｓ１０３）。
音声信号に対する音量の補正量が導出されると、音量補正実行部１４０は、導出された補正量を用いて音声信号の音量を補正する（Ｓ１０３）。 When the arrival direction determination unit 120 determines the arrival direction of the sound collected by the microphone, the sound correction amount deriving unit 130 derives the sound volume correction amount for the sound signal acquired by the microphone according to the sound arrival direction ( S103).
When the volume correction amount for the audio signal is derived, the volume correction execution unit 140 corrects the volume of the audio signal using the derived correction amount (S103).

このように、本実施の形態にかかる音声処理装置によれば、音声信号出力部のマイクロフォンによって収集された音声における所定の周波数帯域成分間の位相差に基づいてこのマイクロフォンによって収集された音声の到来方向を判別してマイクロフォンによって収集された音声の補正量を到来方向に応じて導出することにより、マイクロフォンによって収集された音声の音量を到来方向に応じて補正することができる。
したがって、複数の到来方向からの音声が重畳されている音声信号から、複数の音声間の音量が均等になるように異なる到来方向からの音声の音量を補正することができるため、音声信号に重畳された各音声を聞き取りやすく再生することができる。 As described above, according to the sound processing device according to the present embodiment, the arrival of the sound collected by the microphone based on the phase difference between the predetermined frequency band components in the sound collected by the microphone of the sound signal output unit. By determining the direction and deriving the correction amount of the voice collected by the microphone according to the direction of arrival, the volume of the voice collected by the microphone can be corrected according to the direction of arrival.
Therefore, since the sound volume from different directions of arrival can be corrected so that the sound volume between the plurality of sounds is equalized from the sound signal in which the sounds from the plurality of directions of arrival are superimposed, it is superimposed on the sound signal. It is possible to reproduce each of the recorded voices so that they can be easily heard.

［第２の実施の形態］
図３は、本発明における第２の実施の形態にかかる音声処理装置の構成を示すブロック図である。本実施の形態にかかる音声処理装置は、複数の音声が重畳された音声信号に対する音量の補正を音声の到来方向に応じて実行するものであり、特に、互いに離間して設置された２つの無指向性マイクロフォンによって取得された音声信号に基づいて、この音声信号に重畳された複数の音声の到来方向を判別して音声信号の音量を補正する音声処理装置である。
なお、本実施の形態にかかる音声処理装置の構成要素について、第１の実施の形態において説明した音声処理装置１０の構成要素と同様の構成および機能を有するものには、同一の符号を付し、その詳細な説明は省略する。 [Second Embodiment]
FIG. 3 is a block diagram showing the configuration of the speech processing apparatus according to the second embodiment of the present invention. The sound processing apparatus according to the present embodiment performs sound volume correction on a sound signal on which a plurality of sounds are superimposed according to the direction of arrival of the sound. This is a voice processing device that determines the direction of arrival of a plurality of voices superimposed on the voice signal based on the voice signal acquired by the directional microphone and corrects the volume of the voice signal.
In addition, about the component of the speech processing device concerning this Embodiment, the same code | symbol is attached | subjected to what has the structure and function similar to the component of the speech processing device 10 demonstrated in 1st Embodiment. Detailed description thereof will be omitted.

図３に示すように、本実施の形態にかかる音声処理装置２０は、音声信号出力部２１０と到来方向判別部２２０と音量補正量導出部２３０と音量補正実行部２４０とから構成されている。
音声信号出力部２１０は、２つの無指向性マイクロフォン２１１−ａ，２２１−ｂを有する収音部２１１と、収音部２１１によって取得された音声信号を任意の周波数帯域成分に変換する周波数分析部２１２とから構成されている。 As shown in FIG. 3, the audio processing device 20 according to the present embodiment includes an audio signal output unit 210, an arrival direction determination unit 220, a volume correction amount derivation unit 230, and a volume correction execution unit 240.
The audio signal output unit 210 includes a sound collection unit 211 having two omnidirectional microphones 211-a and 221-b, and a frequency analysis unit that converts the audio signal acquired by the sound collection unit 211 into an arbitrary frequency band component. 212.

音声信号出力２１０の収音部２１１は、無指向性マイクロフォン２１１−ａ，２１１−ｂを互いに離間して配置し、無指向性マイクロフォン２１１−ａ，２２１−ｂそれぞれによって取得された音声信号をそれぞれ出力する。
周波数分析部２１２は、収音部２１１から出力される無指向性マイクロフォン２１１−ａ，２１１−ｂそれぞれによって取得された音声信号について、所定の周波数帯域成分を生成し、出力する。ここで、周波数分析部２１２が出力する音声信号の周波数帯域成分における周波数帯域は、予め設定した周波数帯域とすることができる。例えば、人の声の周波数帯域の範囲内で予め設定した周波数帯域毎の音声信号における周波数帯域成分としても良い。 The sound collection unit 211 of the audio signal output 210 arranges the omnidirectional microphones 211-a and 211-b apart from each other, and the audio signals acquired by the omnidirectional microphones 211-a and 221-b, respectively. Output.
The frequency analysis unit 212 generates and outputs a predetermined frequency band component for the audio signals acquired by the omnidirectional microphones 211-a and 211-b output from the sound collection unit 211. Here, the frequency band in the frequency band component of the audio signal output from the frequency analysis unit 212 can be a preset frequency band. For example, it may be a frequency band component in an audio signal for each frequency band set in advance within the range of the human voice frequency band.

到来方向判定部２２０は、周波数分析部２１２から出力される無指向性マイクロフォン２１１−ａ，２１１−ｂそれぞれによって取得された音声信号の到来方向を、これら音声信号の所定の周波数帯域成分間の位相差に基づいて特定される収音領域毎に判別する。 The arrival direction determination unit 220 determines the arrival direction of the audio signal acquired by each of the omnidirectional microphones 211-a and 211-b output from the frequency analysis unit 212, between the predetermined frequency band components of the audio signal. It discriminate | determines for every sound collection area | region specified based on a phase difference.

＜音声信号の到来方向の判別＞
ここで、到来方向判別部２２０による音声信号の到来方向の判別機能について、具体的に説明する。
図４は、本実施の形態にかかる音声処理装置を搭載した端末Ａと、端末Ａの２つの無指向性マイクロフォン２１１−ａ，２１１−ｂによって収音される音声（音源１〜音源３）との位置関係の一例を概念的に示す図である。端末Ａは、異なる音源（音源１〜３）による異なる位置から到来する音声を無指向性マイクロフォン２１１−ａ，２１１−ｂによって収音し、これらマイクで収音した全ての音声、すなわち音源１〜３の音声を含んだ音声信号を出力する。 <Determination of direction of arrival of audio signal>
Here, the function of determining the arrival direction of the audio signal by the arrival direction determination unit 220 will be specifically described.
FIG. 4 shows a terminal A on which the sound processing apparatus according to the present embodiment is mounted, and sounds (sound source 1 to sound source 3) collected by the two omnidirectional microphones 211-a and 211-b of the terminal A. It is a figure which shows an example of the positional relationship of no. The terminal A picks up sounds coming from different positions by different sound sources (sound sources 1 to 3) by the omnidirectional microphones 211-a and 211-b, and collects all the sounds picked up by these microphones, that is, the sound sources 1 to 1. The audio signal including the audio of 3 is output.

図４に示すように、異なる音源の異なる位置から到来する音声を収音する場合、端末Ａの無指向性マイクロフォン２１１−ａによって収音される音源１〜３を含んだ音声信号と無指向性マイクロフォン２１１−ｂによって収音される音源１〜３を含んだ音声信号との間には、音源の位置に応じて収音時間のずれが発生している。ここで、音源の位置と２つの無指向性マイクロフォンによって収音される音声信号との関係の概念図を図５に示す。 As shown in FIG. 4, when collecting sounds coming from different positions of different sound sources, the sound signals including sound sources 1 to 3 collected by the omnidirectional microphone 211-a of the terminal A and omnidirectionality There is a difference in sound collection time between the sound signals including the sound sources 1 to 3 collected by the microphone 211-b according to the position of the sound source. Here, FIG. 5 shows a conceptual diagram of the relationship between the position of the sound source and the audio signal picked up by the two omnidirectional microphones.

例えば、図５に示すように、音源１からの音声を無指向性マイクロフォン２１１−ａで収音した音声波形と無指向性マイクロフォン２１１−ｂで収音した音声波形との間には収音時間差ａが、音源３からの音声を無指向性マイクロフォン２１１−ａで収音した音声波形と無指向性マイクロフォン２１１−ｂで収音した音声波形との間には収音時間差ｃが存在している。
このような収音時間差は、無指向性マイクロフォン２１１−ａ、２１１−ｂと音源との位置関係に応じて発生する。すなわち、無指向性マイクロフォン２１１−ａによって収音される音声のうち、音源１からの音声については、無指向性マイクロフォン２１１−ｂよりも無指向性マイクロフォン２１１−ａが音源１に近い位置にあることから無指向性マイクロフォン２１１−ｂで収音されるより時間ａだけ早く収音され、音源３からの音声については、無指向性マイクロフォン２１１−ｂよりも遠い位置にあることから時間ｃだけ遅れて収音される。 For example, as shown in FIG. 5, there is a difference in sound collection time between a sound waveform obtained by collecting sound from the sound source 1 by the omnidirectional microphone 211-a and a sound waveform collected by the omnidirectional microphone 211-b. There is a sound collection time difference c between a sound waveform obtained when a sound is collected from the sound source 3 by the omnidirectional microphone 211-a and a sound waveform obtained by the omnidirectional microphone 211-b. .
Such a sound collection time difference occurs according to the positional relationship between the omnidirectional microphones 211-a and 211-b and the sound source. That is, among the sounds collected by the omnidirectional microphone 211-a, the omnidirectional microphone 211-a is closer to the sound source 1 than the omnidirectional microphone 211-b for the sound from the sound source 1. Therefore, the sound is picked up earlier by time a than the sound picked up by the omnidirectional microphone 211-b, and the sound from the sound source 3 is delayed by time c because it is located farther than the omnidirectional microphone 211-b. Sound is collected.

一方、音源２からの音声については、無指向性マイクロフォン２１１−ａ、２１１−ｂ共に等距離の位置にあることから、同時に収音され、収音時間のずれは発生していない。
このように、音源の位置に応じて、２つの無指向性マイクロフォンが収音する音声のタイミングには時間差が発生しており、この収音タイミングのずれに基づく２つの無指向性マイクロフォンによって取得された音声信号の位相差によって、音源の位置を特定することができる。
音声到来方向判定部２２０は、上述の図５を参照した説明のように、無指向性マイクロフォン２１１−ａ，２１１−ｂのそれぞれによって取得された音声信号のうち、周波数分析部２１２から出力される各周波数成分間の位相差を抽出することにより、収音部２１１によって収音された音声の到来方向を判別する。 On the other hand, the sound from the sound source 2 is collected at the same time because the omnidirectional microphones 211-a and 211-b are at equidistant positions, and there is no deviation in the sound collection time.
In this way, there is a time difference in the timing of the sound collected by the two omnidirectional microphones depending on the position of the sound source, and the time difference is acquired by the two omnidirectional microphones based on the difference in the sound collection timing. The position of the sound source can be specified by the phase difference between the sound signals.
The voice arrival direction determination unit 220 is output from the frequency analysis unit 212 among the voice signals acquired by the omnidirectional microphones 211-a and 211-b as described with reference to FIG. 5 described above. By extracting the phase difference between the frequency components, the arrival direction of the sound collected by the sound collection unit 211 is determined.

具体的には、例えば、音声到来方向判定部２２０は、収音領域を３つの領域に分割するように対応付けられた音声信号の位相差の情報を予め記憶している。音声到来方向判定部２２０は、無指向性マイクロフォン２１１−ａ，２１１−ｂのそれぞれによって取得された音声信号のうち、周波数分析部２１２から出力される周波数成分間の抽出した位相差と、予め記憶している収音領域毎に対応付けられた位相差の情報とを比較して、抽出した位相差を有する周波数成分の収音領域を判別する。
すなわち、到来方向判別部２２０は、収音部２１１によって取得された音声信号の全ての周波数成分について位相差を抽出して、抽出した位相差と予め記憶している収音領域毎に対応付けられた位相差の情報とを比較することにより、収音部２２１１によって収音された音声の到来方向を判別する。 Specifically, for example, the voice arrival direction determination unit 220 stores in advance information on the phase difference of the voice signal associated with the sound collection area so as to be divided into three areas. The voice arrival direction determination unit 220 stores in advance the phase difference extracted between the frequency components output from the frequency analysis unit 212 out of the voice signals acquired by the omnidirectional microphones 211-a and 211-b, respectively. The information of the phase difference associated with each collected sound area is compared, and the sound collection area of the frequency component having the extracted phase difference is determined.
That is, the arrival direction determination unit 220 extracts the phase difference for all frequency components of the audio signal acquired by the sound collection unit 211, and associates the extracted phase difference with each pre-stored sound collection region. The direction of arrival of the sound collected by the sound collection unit 2211 is determined by comparing the information with the phase difference information.

音量補正量導出部２３０は、収音部２１１によって取得された音声信号の到来方向毎に音量の補正量を定めた音量補正係数を導出する。
音量補正量導出部２３０は、到来方向判定部２２０によって収音領域が判別された周波数成分の音量レベルに基づいて音声信号の音量レベルを収音領域毎に推定する音量推定部２３１と、音量推定部２３１によって推定された音声信号の収音領域毎の音量レベルに基づいて収音部２１１によって取得された音声信号の音量に対する補正量を収音領域毎に示す音量補正係数を導出する補正係数導出部２３２とから構成されている。 The volume correction amount deriving unit 230 derives a volume correction coefficient that defines a volume correction amount for each direction of arrival of the audio signal acquired by the sound collection unit 211.
The volume correction amount deriving unit 230 estimates the volume level of the audio signal for each sound collection area based on the volume level of the frequency component for which the sound collection area has been determined by the arrival direction determination unit 220, and the volume estimation. Correction coefficient derivation for deriving a volume correction coefficient for each sound collection area indicating the amount of correction for the sound volume of the sound signal acquired by the sound collection section 211 based on the volume level of each sound collection area of the sound signal estimated by the section 231 Part 232.

ここで、音量補正量導出部２３０の音量推定部２３１による音声信号の収音領域毎の音量レベルの推定機能および補正係数導出部２３２による音声信号の音量に対する補正量を収音領域毎に示す音量補正係数の導出機能について、具体的に説明する。 Here, a sound volume level estimation function for each sound collection region of the sound signal by the sound volume estimation unit 231 of the sound volume correction amount deriving unit 230 and a sound volume indicating the correction amount for the sound signal volume by the correction coefficient deriving unit 232 for each sound collection region. The correction coefficient derivation function will be specifically described.

＜音声信号の音量の推定＞
音量推定部２３１は、周波数分析部２１２から出力される収音部２１１によって取得された音声信号の各周波数成分毎に音量レベルを算出する。例えば、音量推定部２３１は、周波数分析部２１２から出力される音声信号の周波数成分におけるスペクトルのエネルギーに基づいて、この周波数成分の音量レベルを導出することができ、また、音声信号の周波数成分における振幅値（電圧値）に基づいてこの周波数成分の音量レベルを導出しても良い。 <Estimation of audio signal volume>
The volume estimation unit 231 calculates a volume level for each frequency component of the audio signal acquired by the sound collection unit 211 output from the frequency analysis unit 212. For example, the volume estimation unit 231 can derive the volume level of the frequency component based on the spectrum energy in the frequency component of the audio signal output from the frequency analysis unit 212, and The volume level of this frequency component may be derived based on the amplitude value (voltage value).

音量推定部２３１は、周波数分析部２１２から出力される音声信号の各周波数成分における音量レベルを算出し、算出した音量レベルとこの音量レベルを有する周波数成分の到来方向とを関連付けて、音声信号の到来方向と各周波数成分における音量との関係を導出する。すなわち、音量推定部２３１は、音声信号の任意の周波数成分における位相差と音量レベルとの関係を、到来方向判別部２２０によって判別された収音領域毎に導出する。
音量推定部２３１は、導出した収音領域と音声信号の各周波数成分における音量レベルとの関係に基づいて、収音領域における最大音量レベルをこの収音領域における音声信号の音量として推定する。 The volume estimation unit 231 calculates the volume level of each frequency component of the audio signal output from the frequency analysis unit 212, associates the calculated volume level with the arrival direction of the frequency component having the volume level, and The relationship between the direction of arrival and the volume of each frequency component is derived. That is, the sound volume estimation unit 231 derives the relationship between the phase difference and the sound volume level in an arbitrary frequency component of the audio signal for each sound collection area determined by the arrival direction determination unit 220.
The sound volume estimation unit 231 estimates the maximum sound volume level in the sound collection area as the sound signal volume in the sound collection area based on the relationship between the derived sound collection area and the sound volume level in each frequency component of the sound signal.

具体的には、例えば、音量推定部２３１によって推定される収音領域毎の音量は、収音領域と音声信号の各周波数成分における音量レベルとの関係の一例である図６中の実線に示すように、３つの収音領域（領域１〜３）における音声信号の周波数成分のうち、領域１における周波数成分の最大音量レベルを領域１の音声信号の音量とし、領域２、３も同様にこれら領域における周波数成分の最大音量レベルをこれら領域の音声信号の音量と推定する。 Specifically, for example, the volume for each sound collection region estimated by the sound volume estimation unit 231 is indicated by a solid line in FIG. 6 which is an example of the relationship between the sound collection region and the volume level of each frequency component of the audio signal. Thus, among the frequency components of the audio signal in the three sound collection areas (areas 1 to 3), the maximum volume level of the frequency component in area 1 is set as the volume of the audio signal in area 1, and these are similarly applied to areas 2 and 3 as well. The maximum volume level of the frequency component in the region is estimated as the volume of the audio signal in these regions.

＜音量補正係数の導出＞
補正係数導出部２３２は、音量推定部２３１によって推定された収音領域毎の音声信号の音量に基づいて収音領域毎に所望の音声信号が存在しているか否かを判定し、所望の音声信号が存在している収音領域それぞれの音声信号の音量を均一にするよう補正量を算出し、この補正量を収音領域毎に定めた音量補正係数を導出する。 <Derivation of volume correction coefficient>
The correction coefficient deriving unit 232 determines whether or not a desired audio signal exists for each sound collection region based on the sound volume of the sound signal for each sound collection region estimated by the sound volume estimation unit 231, and the desired sound signal A correction amount is calculated so as to make the volume of the sound signal of each sound collection region where the signal exists uniform, and a sound volume correction coefficient that determines this correction amount for each sound collection region is derived.

具体的には、例えば、図６に示すように、補正係数導出部２３２は、音量推定部２３１によって推定された音声信号の収音領域毎の音量に対して所定の閾値を予め設定する。この閾値より大きい音量を有する音声信号が、収音部２１１によって取得された所望の音声信号と判定することができる。この閾値は、規定値として予め設定しても良く、また、収音部２１１によって収音された音声に含まれる周囲雑音の音量を適宜算出して周囲雑音量に応じて定めても良い。 Specifically, for example, as illustrated in FIG. 6, the correction coefficient deriving unit 232 sets a predetermined threshold in advance for the sound volume for each sound collection region of the audio signal estimated by the sound volume estimation unit 231. An audio signal having a volume larger than the threshold value can be determined as a desired audio signal acquired by the sound collection unit 211. This threshold may be set in advance as a specified value, or may be determined according to the amount of ambient noise by appropriately calculating the volume of ambient noise included in the sound collected by the sound collection unit 211.

図６に示す例では、音量推定部２３１によって推定された音声信号の収音領域毎の音量のうち、領域１と領域３の音量については閾値を超えており、領域２の音量については閾値を下回っている。すなわち、補正係数推定部２３２は、収音領域が領域１、領域３からの音声信号には所望の音声信号が存在し、領域２からの音声信号には所望の音声信号が存在していないと判定することができる。 In the example illustrated in FIG. 6, among the sound volume for each sound collection area of the audio signal estimated by the sound volume estimation unit 231, the sound volume of the area 1 and the area 3 exceeds the threshold value, and the sound volume of the area 2 is set to the threshold value. It is below. That is, the correction coefficient estimator 232 indicates that the desired sound signal is present in the sound signal from the region 1 and the sound region 3 and the desired sound signal is not present in the sound signal from the region 2. Can be determined.

収音領域毎に所望の音声信号が存在しているか否かを判定すると、補正係数推定部２３２は、所望の音声信号が存在している収音領域のうち、最も高い音量の音声信号の収音領域と他の収音領域との音声信号の音量が同一となるよう収音領域毎に音量の補正量を算出する。
例えば、図６に示す例では、補正係数導出部２３２は、所望の音声信号が存在している領域１と領域３に対し、音量の高い領域１の音声信号の音量に合わせるよう領域３の音声信号の音量に対する補正量を算出する。また、領域２の音声信号に対しては、閾値未満の音量であるため補正を実行しない。 When it is determined whether or not a desired sound signal exists for each sound collection area, the correction coefficient estimation unit 232 collects the sound signal having the highest volume in the sound collection area where the desired sound signal exists. A volume correction amount is calculated for each sound collection area so that the sound signal volume of the sound area is the same as that of the other sound collection areas.
For example, in the example illustrated in FIG. 6, the correction coefficient deriving unit 232 adjusts the sound of the region 3 so as to match the sound volume of the sound signal of the region 1 having a high sound volume with respect to the regions 1 and 3 where the desired sound signal exists. A correction amount for the volume of the signal is calculated. Further, the sound signal in the region 2 is not corrected because the sound volume is lower than the threshold value.

すなわち、補正係数導出部２３２は、領域１と領域２の音声信号の音量に対する補正は行わず、領域３の音声信号の音量を領域１の音声信号の音量と同一となるよう補正する補正量を定めた音量補正係数を導出する。例えば、補正を行わない領域の補正量は１とし、補正を行う領域の補正量ｓは、ｓ＝（領域の最大音量）／（補正領域の音量）とすることができる。図６に示す例における領域３の音声信号の音量に対する補正量ｓは、ｓ＝Ｖ１／Ｖ３によって算出される。 That is, the correction coefficient deriving unit 232 does not perform correction on the volume of the audio signal in the areas 1 and 2, but corrects the volume of the audio signal in the area 3 to be the same as the volume of the audio signal in the area 1. A predetermined volume correction coefficient is derived. For example, the correction amount of the area where correction is not performed can be 1, and the correction amount s of the area where correction is performed can be s = (maximum volume of the area) / (volume of the correction area). The correction amount s with respect to the volume of the audio signal in the region 3 in the example shown in FIG. 6 is calculated by s = V1 / V3.

補正係数導出部２３２は、図６に示した領域１および領域２の音声信号の音量に対する補正量を１、領域３の音声信号の音量に対する補正量をｓと定めた音量補正係数を導出する。
補正係数導出部２３２によって導出される音量補正係数は、補正された音量が過大な音量とならないよう、補正後の音量に制限を加えるとしても良い。例えば、補正後の音量が予め設定した限界値を超えないように音量補正係数を調整するとしても良く、導出する音量補正係数をｓ・α（０＜α＜１）として、適宜αを設定することにより補正後の音量を調整するとしても良い。 The correction coefficient deriving unit 232 derives a volume correction coefficient in which the correction amount with respect to the volume of the audio signals in the areas 1 and 2 shown in FIG.
The volume correction coefficient derived by the correction coefficient deriving unit 232 may limit the corrected volume so that the corrected volume does not become excessive. For example, the volume correction coefficient may be adjusted so that the corrected volume does not exceed a preset limit value, and α is appropriately set with s · α (0 <α <1) as the derived volume correction coefficient. Thus, the corrected volume may be adjusted.

ここで、図７に補正係数導出部２３２によって導出される音量補正係数の一例を示す。図７に示すように、補正係数導出部２３２によって導出された図６に示した領域１および領域２の音声信号の音量に対する補正量を１、領域３の音声信号の音量に対する補正量をｓと定めた音量補正係数を波線にて示す。図７に示すような波線で示した音量補正係数を各領域の音声信号の音量補正に適応させると、領域２と領域３の境界上の音声信号に対して不連続となる音量補正を実行することとなる。これにより、異音の発生を引き起こす可能性がある。よって、補正係数導出部２３２は、不連続となる音量補正が発生しないような音量補正係数を導出する。 Here, FIG. 7 shows an example of the volume correction coefficient derived by the correction coefficient deriving unit 232. As shown in FIG. 7, the correction amount for the volume of the audio signal in the area 1 and area 2 shown in FIG. 6 derived by the correction coefficient deriving unit 232 is 1, and the correction amount for the volume of the audio signal in the area 3 is s. The determined volume correction coefficient is indicated by a wavy line. When the volume correction coefficient indicated by the wavy line as shown in FIG. 7 is applied to the volume correction of the audio signal in each area, the volume correction that is discontinuous with respect to the audio signal on the boundary between the area 2 and the area 3 is executed. It will be. This may cause abnormal noise. Therefore, the correction coefficient deriving unit 232 derives a sound volume correction coefficient that does not cause discontinuous sound volume correction.

例えば、図７の実線にて示すように、領域２および領域３のそれぞれの区間の中央となる位相差が各領域の補正量となるよう線形補間することで、領域２と領域３の境界上における補正量の不連続点の解消を実現することができる。また、線形補間だけではなく、２次補間などの非線形補間を用いても良く、過去の補正量を用いて平滑化しても良い。 For example, as shown by a solid line in FIG. 7, linear interpolation is performed so that the phase difference at the center of each section of the region 2 and the region 3 becomes the correction amount of each region. It is possible to eliminate the discontinuous points of the correction amount in. Further, not only linear interpolation but also nonlinear interpolation such as secondary interpolation may be used, and smoothing may be performed using past correction amounts.

音量補正実行部２４０は、音声信号出力部２１０から出力される音声信号に音量補正量導出部２３０によって導出される音量補正係数を反映させ、収音部２１１によって取得された音声信号の音量を補正する。
具体的には、周波数分析部２１２によって出力される音声信号の所定の周波数成分と補正係数導出部２３２によって導出される音量補正係数とを用いて、周波数成分の音量を補正する。例えば、任意の周波数成分Ｘ_i(f,t)の音量を音量補正係数Ｃ_i(f,t)を用いて補正した周波数成分をＹ_i(f,t)とすると、Ｙ_i(f,t)＝Ｘ_i(f,t)・Ｃ_i(f,t)とすることができる。ただし、fは周波数インデックス、tは時間インデックスとする。 The volume correction execution unit 240 corrects the volume of the audio signal acquired by the sound collection unit 211 by reflecting the volume correction coefficient derived by the volume correction amount deriving unit 230 in the audio signal output from the audio signal output unit 210. To do.
Specifically, the volume of the frequency component is corrected using a predetermined frequency component of the audio signal output by the frequency analysis unit 212 and the volume correction coefficient derived by the correction coefficient deriving unit 232. For example, any frequency components X _i (f, t) Volume Volume correction coefficient C _i (f, t) of the frequency component corrected using the Y _i (f, t) and, Y _i (f, t ) = X _i (f, t) · C _i (f, t). Here, f is a frequency index and t is a time index.

なお、本実施の形態にかかる音声処理装置２０の各構成要素は、ＣＰＵやメモリ、インターフェースからなるコンピュータにコンピュータプログラム（ソフトウエア）をインストールすることによって実現され、上述した音声処理装置２０の各種機能は、上記コンピュータの各種ハードウェア資源と上記コンピュータプログラムとが協働することによって実現される。 Each component of the speech processing apparatus 20 according to the present embodiment is realized by installing a computer program (software) in a computer including a CPU, a memory, and an interface, and various functions of the speech processing apparatus 20 described above. Is realized by the cooperation of various hardware resources of the computer and the computer program.

次に、本実施の形態にかかる音声処理装置２０の音声処理動作について、図８に示すフローチャートを参照して説明する。
図８に示すように、本実施の形態にかかる音声処理装置２０は、収音部２１１に搭載された２つの無指向性マイクロフォン２１１−ａ，２１１−ｂによって音声を収音する（Ｓ２０１）。 Next, the voice processing operation of the voice processing apparatus 20 according to the present embodiment will be described with reference to the flowchart shown in FIG.
As shown in FIG. 8, the sound processing device 20 according to the present embodiment collects sound by using the two omnidirectional microphones 211-a and 211-b mounted on the sound collection unit 211 (S <b> 201).

２つの無指向性マイクロフォン２１１−ａ，２１１−ｂそれぞれによって取得された音声信号は、周波数分析部２１２によって予め設定されている周波数帯域毎の周波数成分にそれぞれ分割されて出力される（Ｓ２０２）。
周波数分析部２１２から２つの無指向性マイクロフォン２１１−ａ，２１１−ｂによって取得された音声信号の各周波数成分がそれぞれ出力されると、到来方向判定部２２０は、所定の周波数帯域毎に出力された周波数成分のうち無指向性マイクロフォン２１１−ａによって取得された音声信号の周波数成分と無指向性マイクロフォン２１１−ｂによって取得された音声信号の周波数成分との位相差を抽出する（Ｓ２０３）。 The audio signals acquired by the two omnidirectional microphones 211-a and 211-b are respectively divided into frequency components for each frequency band set in advance by the frequency analysis unit 212 and output (S202).
When each frequency component of the audio signal acquired by the two omnidirectional microphones 211-a and 211-b is output from the frequency analysis unit 212, the arrival direction determination unit 220 is output for each predetermined frequency band. The phase difference between the frequency component of the audio signal acquired by the omnidirectional microphone 211-a and the frequency component of the audio signal acquired by the omnidirectional microphone 211-b is extracted (S203).

所定の周波数帯域毎に周波数成分の位相差を抽出すると、到来方向判定部２２０は、予め記憶している周波数成分の位相差と収音領域との関係および抽出した位相差に基づいて、収音部２１１によって取得された音声信号の各周波数成分の到来方向を予め設定された収音領域毎に判別する（Ｓ２０４）。 When the phase difference of the frequency component is extracted for each predetermined frequency band, the arrival direction determination unit 220 collects sound based on the relationship between the phase difference of the frequency component stored in advance and the sound collection area and the extracted phase difference. The arrival direction of each frequency component of the audio signal acquired by the unit 211 is determined for each preset sound collection area (S204).

収音部２１１によって取得された音声信号の各周波数成分の到来方向が収音領域毎に判別されると、音量補正量導出部２３０は、収音領域毎に音声信号の音量を推定して収音領域毎に音声信号の音量に対する補正量を定めた音量補正係数を導出する（Ｓ２０５）。 When the direction of arrival of each frequency component of the audio signal acquired by the sound collection unit 211 is determined for each sound collection region, the volume correction amount deriving unit 230 estimates the sound volume of the sound signal for each sound collection region and collects the sound signal. A volume correction coefficient that determines a correction amount for the volume of the audio signal for each sound region is derived (S205).

音量補正量導出部２３０によって音量補正係数が導出されると、音量補正実行部２４０は、周波数分析部２１２から出力される音声信号の各周波数成分と音量補正量導出部２３０から出力される音量補正係数とを用いて、収音部２１１によって取得された音声信号の音量を収音領域毎に補正する（Ｓ２０６）。 When the sound volume correction coefficient is derived by the sound volume correction amount deriving unit 230, the sound volume correction executing unit 240 is configured to correct each frequency component of the audio signal output from the frequency analysis unit 212 and the sound volume correction output from the sound volume correction amount deriving unit 230. Using the coefficient, the volume of the audio signal acquired by the sound collection unit 211 is corrected for each sound collection region (S206).

このように、本実施の形態にかかる音声処理装置２０は、互いに離間して設置された２つの無指向性マイクロフォンによって収音された音声に対し、所定の周波数帯域毎に分割した周波数成分間の位相差に基づいてこの周波数成分の到来方向を判別し、全ての到来方向からの音声信号の音量が均等になるよう音量の補正量を到来方向毎に導出することにより、複数の音声が重畳されている信号から複数の音声間の音量が均等になるように複数の音声の音量を補正することができる。
したがって、複数の音声が重畳された音声信号のうち、聞き取りづらい特定の音声の音量を補正することができ、聞き取りやすく音声信号を再生することが可能となる。 As described above, the sound processing device 20 according to the present embodiment is configured to perform processing between the frequency components divided for each predetermined frequency band with respect to the sound collected by the two omnidirectional microphones that are installed apart from each other. By determining the direction of arrival of this frequency component based on the phase difference and deriving the amount of volume correction for each direction of arrival so that the volume of the sound signal from all directions of arrival is equalized, multiple sounds are superimposed. The volume of the plurality of sounds can be corrected so that the volume between the plurality of sounds becomes equal from the received signal.
Therefore, it is possible to correct the volume of a specific sound that is difficult to hear among the sound signals on which a plurality of sounds are superimposed, and it is possible to reproduce the sound signal that is easy to hear.

また、本実施の形態にかかる音声処理装置は、無指向性マイクロフォンによって取得された複数の音声が重畳された音声信号に対し、個々の音声に分離せず、所定の周波数帯域毎に分割した周波数成分間の位相差を導出することにより音声信号の到来方向を判別し到来方向毎に音声信号の音量補正を実行することから、独立成分分析などの手法を用いて各音声を分離した後に各音声の音量補正を実行する処理よりも低演算量で且つ容易に複数の音声の音量を補正することができる。
したがって、本実施の形態にかかる音声処理装置を、汎用の端末装置や音声会議装置といった音声処理機能を有する機器に、コストを抑制し且つ消費電力を抑えて搭載することが可能となる。 In addition, the audio processing apparatus according to the present embodiment is a frequency obtained by dividing a sound signal on which a plurality of sounds acquired by an omnidirectional microphone are superimposed without being separated into individual sounds and divided into predetermined frequency bands. Since the arrival direction of the audio signal is determined by deriving the phase difference between the components and the volume of the audio signal is corrected for each arrival direction, each audio is separated after being separated using a technique such as independent component analysis. The volume of a plurality of sounds can be easily corrected with a smaller amount of computation than the process of executing the volume correction.
Therefore, the audio processing device according to the present embodiment can be mounted on a device having an audio processing function, such as a general-purpose terminal device or an audio conference device, with reduced cost and reduced power consumption.

［第３の実施の形態］
図９は、本発明における第３の実施の形態にかかる音声処理装置の構成を示すブロック図である。本実施の形態にかかる音声処理装置は、第２の実施の形態において説明した音声処理装置２０の機能に、取得した音声信号の各周波数成分における音量に応じて音声信号の収音領域を適宜設定する機能をさらに加えたものである。
なお、本実施の形態にかかる音声処理装置３０の構成および機能について、第２の実施の形態において説明した音声処理装置２０と同一の構成および機能を有するものには同一の符号を付し、これらの詳細な説明を省略する。 [Third Embodiment]
FIG. 9 is a block diagram showing the configuration of the speech processing apparatus according to the third embodiment of the present invention. The sound processing apparatus according to the present embodiment appropriately sets the sound signal collection area of the sound signal according to the volume of each frequency component of the acquired sound signal in the function of the sound processing apparatus 20 described in the second embodiment. The function to be added is added.
In addition, about the structure and function of the speech processing apparatus 30 concerning this Embodiment, what has the same structure and function as the speech processing apparatus 20 demonstrated in 2nd Embodiment attaches | subjects the same code | symbol, and these The detailed description of is omitted.

本実施の形態にかかる音声処理装置３０は、図９に示すように、互いに離間して設置された２つの無指向性マイクロフォン２１１−ａ，２１１−ｂによって収音された音声の各周波数成分をそれぞれ出力する音声信号出力部２１０と、音声信号出力部２１０から出力される音声の各周波数成分の到来方向を判別する到来方向判定部３２０と、音声信号出力部２１０によって取得された音声信号の音量の到来方向に応じた補正量を導出する音量補正量導出部２３０と、音声信号出力部２１０によって出力される音声に音量補正量導出部２３０によって導出される音量の補正量を反映させることにより、音声の音量を補正する音量補正実行部２４０とから構成されている。 As shown in FIG. 9, the audio processing device 30 according to the present exemplary embodiment uses each frequency component of the audio collected by two omnidirectional microphones 211-a and 211-b that are installed apart from each other. The audio signal output unit 210 that outputs the sound, the arrival direction determination unit 320 that determines the arrival direction of each frequency component of the sound output from the audio signal output unit 210, and the volume of the audio signal acquired by the audio signal output unit 210 A sound volume correction amount deriving unit 230 for deriving a correction amount according to the direction of arrival of sound, and by reflecting the sound volume correction amount derived by the sound volume correction amount deriving unit 230 in the sound output by the sound signal output unit 210, A volume correction execution unit 240 that corrects the volume of audio is configured.

上記した本実施の形態にかかる音声処理装置３０の構成要素のうち、到来方向判別部３２０は、音声信号出力部２１０から出力される音声信号の周波数帯域成分間の位相差におけるこの周波数帯域成分の音量レベルに応じて収音領域を特定する収音領域特定部３２１をさらに備える。 Of the components of the audio processing device 30 according to the present embodiment described above, the arrival direction determination unit 320 includes the frequency band component in the phase difference between the frequency band components of the audio signal output from the audio signal output unit 210. A sound collection area specifying unit 321 that specifies a sound collection area according to the volume level is further provided.

ここで、収音領域特定部３２１による収音領域の特定機能について、詳細に説明する。
収音領域特定部３２１は、周波数分析部２１２から所定の周波数帯域毎に出力される収音部２１１によって取得された音声信号の周波数成分のうち、無指向性マイクロフォン２１１−ａによって取得された音声信号の周波数成分と無指向性マイクロフォン２１１−ｂによって取得された音声信号の周波数成分との位相差を抽出し、音声信号の周波数成分に対し、抽出した位相差と音量レベル（周波数成分におけるスペクトルまたは電圧値）との関係を導出する。 Here, the sound collection area specifying function by the sound collection area specifying unit 321 will be described in detail.
The sound collection region specifying unit 321 includes the sound acquired by the omnidirectional microphone 211-a among the frequency components of the sound signal acquired by the sound collection unit 211 output from the frequency analysis unit 212 for each predetermined frequency band. The phase difference between the frequency component of the signal and the frequency component of the audio signal acquired by the omnidirectional microphone 211-b is extracted, and the extracted phase difference and volume level (spectrum or frequency component in the frequency component) are extracted from the frequency component of the audio signal. The relationship with the voltage value is derived.

図１０に、収音領域特定部３２１が導出する音声信号の周波数成分における位相差と音量レベルとの関係を概念的に説明する図を示す。
収音領域特定部３２１は、図１０に示すように、収音部２１１によって取得された音声信号の周波数成分毎に、抽出した位相差と音量レベルとを関連付けて記憶し（図１０で示す×印）、音声信号の周波数成分における位相差と音量レベルとの関係を導出する。 FIG. 10 conceptually illustrates the relationship between the phase difference in the frequency component of the audio signal derived by the sound collection area specifying unit 321 and the volume level.
As shown in FIG. 10, the sound collection area specifying unit 321 stores the extracted phase difference and volume level in association with each frequency component of the audio signal acquired by the sound collection unit 211 (× shown in FIG. 10). The relationship between the phase difference in the frequency component of the audio signal and the volume level is derived.

収音領域特定部３２１は、導出した音声信号の周波数成分における位相差と音量レベルとの関係に基づいて、収音領域を特定する。
具体的には、収音部２１１によって取得された音声信号の各周波数成分における位相差と音量レベルの関係から、補間により求められる図１０の一点波線で示すような位相差と音量レベルの関係を示す曲線を導出する。補間の方法として、スプライン補間を用いても良い。 The sound collection area specifying unit 321 specifies the sound collection area based on the relationship between the phase difference in the derived frequency component of the audio signal and the volume level.
Specifically, from the relationship between the phase difference and volume level in each frequency component of the audio signal acquired by the sound collection unit 211, the relationship between the phase difference and volume level as shown by the dashed line in FIG. The curve shown is derived. Spline interpolation may be used as an interpolation method.

収音領域特定部３２１は、補間により導出された音声信号の位相差と音量レベルとの関係を示す曲線の山と谷を検出し、谷を示す位相差を収音領域の境界位置と特定する。例えば、図１０に示すように、位相差Ｄ１が収音領域１と領域２の境界点とし、位相差Ｄ２を領域２と領域３の境界点と特定する。
図１０の例では、補間の後に谷を検出し、収音領域の境界点として特定するとしたが、演算量を削減するため、補間を実行せず近傍の値を用いて谷を検出し、収音領域の境界点として特定しても良い。 The sound collection area specifying unit 321 detects peaks and valleys of a curve indicating the relationship between the phase difference of the audio signal derived by interpolation and the volume level, and specifies the phase difference indicating the valley as the boundary position of the sound collection area. . For example, as shown in FIG. 10, the phase difference D1 is the boundary point between the sound collection region 1 and the region 2, and the phase difference D2 is specified as the boundary point between the region 2 and the region 3.
In the example of FIG. 10, valleys are detected after interpolation and specified as boundary points of the sound collection area. However, in order to reduce the amount of calculation, valleys are detected by using neighboring values without performing interpolation, and are collected. You may specify as a boundary point of a sound area.

収音領域特定部３２１によって収音領域が特定されると、音声補正量導出部２３０によって音声信号出力部２１０によって取得された音声信号の音量の補正量を収音領域毎に導出され、音量補正実行部２４０によって音声信号の音量の補正が実行される。 When the sound collection region is specified by the sound collection region specifying unit 321, the sound signal volume correction amount obtained by the sound signal output unit 210 is derived by the sound correction amount deriving unit 230 for each sound collection region, and the sound volume correction is performed. The execution unit 240 corrects the volume of the audio signal.

このように、本実施の形態にかかる音声処理装置によれば、取得した音声信号に応じて収音領域を特定してこの収音領域に応じて音声信号の音量の補正を実行することにより、収音した複数の音声の到来方向に応じて音源の到来方向を適宜特定することができる。
したがって、移動する音源に対しても音量の補正を行うことができるため、高品質な出力音声を生成することが可能となる。 As described above, according to the sound processing device according to the present embodiment, by specifying the sound collection area according to the acquired sound signal and correcting the volume of the sound signal according to the sound collection area, The arrival direction of the sound source can be appropriately specified according to the arrival directions of the collected voices.
Therefore, since the volume can be corrected even for a moving sound source, it is possible to generate high-quality output sound.

音声通話を実行する電話端末およびＴＶ会議システムや、音声録音機能を有するＩＣレコーダなどの録音機器に利用可能である。 The present invention can be used for a telephone terminal and a TV conference system for performing a voice call, and a recording device such as an IC recorder having a voice recording function.

１０，２０，３０…音声処理装置、１１０，２１０…音声信号出力部、１２０，２２０，３２０…到来方向判定部、３２１…収音領域特定部、１３０，２３０…音量補正量導出部、２３１…音量推定部、２３２…補正係数導出部、１４０，２４０…音量補正実行部。 DESCRIPTION OF SYMBOLS 10,20,30 ... Audio | voice processing apparatus, 110,210 ... Audio | voice signal output part, 120,220,320 ... Arrival direction determination part, 321 ... Sound collection area | region identification part, 130,230 ... Volume correction amount derivation | leading-out part, 231 ... Volume estimation unit, 232... Correction coefficient derivation unit, 140, 240.

Claims

互いに離間して配置された複数の無指向性マイクロフォンによってそれぞれ取得された音声信号のうち、任意の周波数帯域成分を出力する音声信号出力部と、
この音声信号出力部から出力される音声信号の前記周波数帯域成分間の位相差に基づいて前記マイクロフォンによって収集された音声の到来方向を判別する到来方向判別部と、
この到来方向判別部によって判別された到来方向に応じて前記音声信号の音量に対する補正量を導出する音量補正量導出部と、
この音量補正量導出部によって導出された前記補正量を用いて前記音声信号の音量を補正する音量補正実行部と
を備えることを特徴とする音声処理装置。 An audio signal output unit that outputs an arbitrary frequency band component among audio signals respectively acquired by a plurality of omnidirectional microphones arranged apart from each other;
A direction-of-arrival determination unit that determines a direction of arrival of sound collected by the microphone based on a phase difference between the frequency band components of the sound signal output from the sound signal output unit;
A volume correction amount derivation unit that derives a correction amount for the volume of the audio signal according to the direction of arrival determined by the direction of arrival determination unit;
A sound processing apparatus comprising: a sound volume correction execution unit that corrects a sound volume of the sound signal using the correction amount derived by the sound volume correction amount deriving unit.

請求項１に記載の音声処理装置において、
前記到来方向判別部は、前記マイクロフォンによって収集される音声の到来方向を前記周波数帯域成分間の位相差に応じて特定される収音領域毎に判別し、
前記音量補正量導出部は、前記到来方向判別部によって前記収音領域を判別された前記周波数帯域成分の音量レベルに基づいて前記音声信号の音量に対する補正量を前記収音領域毎に導出する
ことを特徴とする音声処理装置。 The speech processing apparatus according to claim 1,
The arrival direction determining unit determines the arrival direction of the sound collected by the microphone for each sound collection region specified according to the phase difference between the frequency band components,
The volume correction amount deriving unit derives a correction amount for the volume of the audio signal for each sound collection region based on the volume level of the frequency band component for which the sound collection region is determined by the arrival direction determination unit. A voice processing apparatus characterized by the above.

請求項２に記載の音声処理装置において、
前記到来方向判定部は、前記音声信号出力部から出力される任意の周波数帯域成分間の位相差におけるこの周波数帯域成分の音量レベルに応じて前記収音領域を特定することを特徴とする音声処理装置。 The speech processing apparatus according to claim 2, wherein
The voice direction characterized in that the arrival direction determination unit specifies the sound collection region according to a volume level of the frequency band component in a phase difference between arbitrary frequency band components output from the voice signal output unit apparatus.

互いに離して配置された複数の無指向性マイクロフォンによってそれぞれ取得された音声信号のうち、任意の周波数帯域成分を出力する音声信号出力ステップと、
この音声信号出力ステップから出力される音声信号の前記周波数帯域成分間の位相差に基づいて前記マイクロフォンによって収集された音声の到来方向を判別する到来方向判別ステップと、
この到来方向判別ステップによって判別された到来方向に応じて前記音声信号の音量に対する補正量を導出する音量補正量導出ステップと、
この音量補正量導出ステップによって導出された前記補正量を用いて前記音声信号の音量を補正する音量補正実行ステップと
を備えることを特徴とする音声処理方法。 An audio signal output step for outputting an arbitrary frequency band component among audio signals respectively acquired by a plurality of omnidirectional microphones arranged apart from each other;
A direction-of-arrival determination step of determining a direction of arrival of the sound collected by the microphone based on a phase difference between the frequency band components of the sound signal output from the sound signal output step;
A volume correction amount derivation step for deriving a correction amount for the volume of the audio signal according to the direction of arrival determined by the direction of arrival determination step;
A sound processing method comprising: a sound volume correction executing step of correcting the sound volume of the sound signal using the correction amount derived in the sound volume correction amount deriving step.

請求項４に記載の音声処理方法において、
前記到来方向判別ステップは、前記マイクロフォンによって収集される音声の到来方向を前記周波数帯域成分間の位相差に応じて特定される収音領域毎に判別し、
前記音量補正量導出ステップは、前記到来方向判別ステップによって前記収音領域を判別された前記周波数帯域成分の音量レベルに基づいて前記音声信号の音量に対する補正量を前記収音領域毎に導出する
ことを特徴とする音声処理方法。 The voice processing method according to claim 4,
The direction of arrival determination step determines the direction of arrival of sound collected by the microphone for each sound collection area specified according to the phase difference between the frequency band components,
The volume correction amount derivation step derives a correction amount for the volume of the audio signal for each sound collection area based on the volume level of the frequency band component in which the sound collection area is determined by the arrival direction determination step. A voice processing method characterized by the above.

請求項４または５に記載の音声処理方法をコンピュータに実行させることを特徴とする音声処理プログラム。 A speech processing program for causing a computer to execute the speech processing method according to claim 4 or 5.