JP2016042613A

JP2016042613A - Target speech section detector, target speech section detection method, target speech section detection program, audio signal processing device and server

Info

Publication number: JP2016042613A
Application number: JP2014164948A
Authority: JP
Inventors: 克之高橋; Katsuyuki Takahashi
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2014-08-13
Filing date: 2014-08-13
Publication date: 2016-03-31

Abstract

PROBLEM TO BE SOLVED: To provide a target speech section detector, a target speech section detection method, a target speech section detection program, an audio signal processing device and a server capable of improving the detection performance of a target speech section, by calculating an average coherence while reducing the impact of noise even under a large noise environment.SOLUTION: A target speech section detector includes a coherence factor calculation unit 14 for calculating a first directivity signal having a dead angle in a first predetermined azimuth, a second directivity signal having a dead angle in a second predetermined azimuth, and a coherence factor, for each frequency, based on an input sound signal, an average coherence calculation unit 15 for determining the intensity of impact of a noise signal component contained in the input sound signal for each frequency, based on the coherence factor of each frequency, and calculating the average coherence by using the coherence factor in a frequency band where the impact of noise signal component is small, and a target speech section determination unit 16 for determining whether or not the section of input sound signal belongs the target speech section based on the average coherence.SELECTED DRAWING: Figure 1

Description

本発明は、目的音声区間検出装置、目的音声区間検出方法、目的音声区間検出プログラム、音声信号処理装置及びサーバに関し、例えば、電話やテレビ会議等の音声を用いる通信機器やサーバにおける音響信号処理に適用し得るものである。 The present invention relates to a target voice segment detection device, a target voice segment detection method, a target voice segment detection program, a voice signal processing device, and a server. For example, the present invention relates to acoustic signal processing in a communication device or server that uses voice such as a telephone or a video conference. It can be applied.

例えば携帯端末（例えば、スマートフォンや携帯電話機等）や車載機器等には、入力された音声を認識する音声認識機能や音声通話機能等が搭載されるようになってきており、音声認識機能などの音声信号処理はますます厳しい雑音環境下で用いられるようになってきている。厳しい雑音環境下で音声信号処理機能が性能を維持できるようにするためには、利用者により発せられた音声を雑音等と区別して抽出することが好ましい。そして、正確に音声を抽出するためには、話者が話している区間（目的音声区間）と話者が話しておらず背景雑音だけが存在する区間（背景雑音区間）とを区別して検出する技術が必要である。 For example, mobile terminals (for example, smartphones and mobile phones) and in-vehicle devices have been equipped with voice recognition functions and voice call functions for recognizing input voices. Audio signal processing is increasingly used in severe noise environments. In order to enable the voice signal processing function to maintain its performance in a severe noise environment, it is preferable to extract the voice uttered by the user separately from noise or the like. In order to accurately extract speech, a section where the speaker is speaking (target speech section) and a section where the speaker is not speaking and where only background noise exists (background noise section) are distinguished and detected. Technology is needed.

目的音声区間と背景雑音区間とを区別する方法として、音声信号レベルと雑音信号とのレベル差に基づいて検出する方法や、特許文献１に記載されるようなコヒーレンスを用いる方法が挙げられる。 As a method for distinguishing between the target speech section and the background noise section, there are a method of detecting based on a level difference between a speech signal level and a noise signal, and a method of using coherence as described in Patent Document 1.

特許文献１の記載技術は、マイクロホンの左右に死角を有する２つの指向性を形成して得た２つの信号の相関に応じたコヒーレンス係数を周波数帯域毎に算出し、全周波数帯域のコヒーレンス係数を平均した平均コヒーレンスの大小に基づいて目的音声区間を検出するものである。平均コヒーレンスの大小は目的音声の到来方位と直結する特徴量なので、特許文献１の記載技術は、目的音声の到来方位に基づいて目的音声区間を検出する方式であるといえる。そのため、音声信号のレベル差に基づいて検出する方式とは異なり、大きな雑音に目的音声が埋もれ、目的音声レベルと雑音レベルとの差がつきにくい場合でも、目的音声区間を検出することができる。 The technology described in Patent Document 1 calculates, for each frequency band, a coherence coefficient corresponding to the correlation between two signals obtained by forming two directivities having blind spots on the left and right sides of the microphone, and calculates the coherence coefficient for all frequency bands. The target speech segment is detected based on the average coherence level. Since the magnitude of the average coherence is a feature quantity directly connected to the arrival direction of the target speech, the technique described in Patent Document 1 can be said to be a method for detecting the target speech section based on the arrival direction of the target speech. Therefore, unlike the method of detecting based on the level difference of the audio signal, the target audio section can be detected even when the target audio is buried in a large noise and the difference between the target audio level and the noise level is difficult to be attached.

特開２０１３−０６１４２１号公報JP 2013-061421 A

しかしながら、先述のように、近年、利用者は、ますます過酷な雑音環境下で携帯端末や車載機器等を利用するようになってきており、大きな雑音によってＳＮ比が０に近づき、さらにはマイナスになってしまうような場合には、特許文献１に記載の方式であっても、目的音声が雑音の影響を受けて、目的音声の特徴が薄れてしまい、目的音声区間の検出性能が低下してしまうという問題が生じ得る。 However, as described above, in recent years, users have come to use mobile terminals and in-vehicle devices under increasingly severe noise environments, and the S / N ratio approaches 0 due to large noise, and further minus In such a case, even in the method described in Patent Document 1, the target speech is affected by noise, and the target speech features are diminished, resulting in a decrease in detection performance of the target speech section. Can cause problems.

例えば高速走行中の自動車内等のようにＳＮ比がマイナスとなってしまう場合、周波数帯域毎に算出されるコヒーレンス係数の一部が雑音の影響を受け、目的音声の特徴が薄れていく。これにより、コヒーレンス係数を全周波数で平均した平均コヒーレンスも間接的に雑音の影響を受け、目的音声区間と雑音区間との特性差が小さくなるために、目的音声区間の検出性能が低下してしまう。 For example, when the S / N ratio becomes negative, such as in an automobile running at a high speed, a part of the coherence coefficient calculated for each frequency band is affected by noise, and the characteristics of the target speech are diminished. As a result, the average coherence obtained by averaging the coherence coefficients at all frequencies is also indirectly affected by noise, and the characteristic difference between the target speech section and the noise section is reduced, so that the detection performance of the target speech section is degraded. .

そのため、大雑音環境下でも正確に、目的音声区間を検出できるような目的音声区間検出装置、目的音声区間検出方法、目的音声区間検出プログラム、音声信号処理装置及びサーバが求められている。 Therefore, there is a need for a target speech segment detection device, a target speech segment detection method, a target speech segment detection program, a speech signal processing device, and a server that can accurately detect a target speech segment even in a noisy environment.

本発明は、前記課題を解決するためになされたものであり、次のような構成を採用する。 The present invention has been made to solve the above-described problems, and employs the following configuration.

第１の本発明に係る目的音声区間検出装置は、（１）入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出するコヒーレンス係数計算手段と、（２）コヒーレンス係数計算手段により算出された周波数毎のコヒーレンス係数に基づいて、入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域でのコヒーレンス係数を用いて平均コヒーレンスを算出する平均コヒーレンス計算手段と、（３）平均コヒーレンス計算手段により算出された平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定する目的音声区間判定手段とを備えることを特徴とする。 The target speech section detecting device according to the first aspect of the present invention is (1) a first directivity signal having a blind spot in a first predetermined direction and a second predetermined direction, each formed based on an input sound signal. A coherence coefficient that reflects the correlation with the second directivity signal having a blind spot, for each frequency, and (2) a coherence coefficient for each frequency calculated by the coherence coefficient calculation means. Average coherence calculating means for determining the strength of the influence of the noise signal component included in the input sound signal for each frequency and calculating the average coherence using a coherence coefficient in a frequency band where the influence of the noise signal component is small; (3 ) The purpose of determining whether or not the section of the input sound signal belongs to the target speech section based on the average coherence calculated by the average coherence calculation means Characterized in that it comprises a voice interval determining means.

第２の本発明に係る目的音声区間検出方法は、（１）コヒーレンス係数計算手段が、入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出し、（２）平均コヒーレンス計算手段が、コヒーレンス係数計算手段により算出された周波数毎のコヒーレンス係数に基づいて、入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域でのコヒーレンス係数を用いて平均コヒーレンスを算出し、（３）目的音声判定手段が、平均コヒーレンス計算手段により算出された平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定することを特徴とする。 The target speech segment detection method according to the second aspect of the present invention includes: (1) a first directional signal having a blind spot in a first predetermined direction, wherein the coherence coefficient calculating means is formed based on an input sound signal; A coherence coefficient reflecting the correlation with the second directional signal having a blind spot in the second predetermined direction is calculated for each frequency, and (2) the frequency calculated by the average coherence calculating means by the coherence coefficient calculating means. Based on the coherence coefficient of each, the strength of the influence of the noise signal component included in the input sound signal is determined for each frequency, the average coherence is calculated using the coherence coefficient in the frequency band where the influence of the noise signal component is small, (3) Based on the average coherence calculated by the average coherence calculation means by the target voice determination means, the section of the input sound signal belongs to the target voice section. And judging as to whether or not.

第３の本発明に係る目的音声区間検出プログラムは、コンピュータを、（１）入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出するコヒーレンス係数計算手段と、（２）コヒーレンス係数計算手段により算出された周波数毎のコヒーレンス係数に基づいて、入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域でのコヒーレンス係数を用いて平均コヒーレンスを算出する平均コヒーレンス計算手段と、（３）平均コヒーレンス計算手段により算出された平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定する目的音声区間判定手段として機能させることを特徴とする。 According to a third aspect of the present invention, there is provided a program for detecting a target speech section, comprising: (1) a first directivity signal having a blind spot in a first predetermined direction and a second direction, each formed based on an input sound signal; A coherence coefficient calculating means for calculating for each frequency a coherence coefficient reflecting a correlation with a second directional signal having a blind spot in a predetermined direction; and (2) a coherence coefficient for each frequency calculated by the coherence coefficient calculating means. And an average coherence calculating means for determining the strength of the influence of the noise signal component included in the input sound signal for each frequency and calculating an average coherence using a coherence coefficient in a frequency band where the influence of the noise signal component is small. (3) Based on the average coherence calculated by the average coherence calculating means, the corresponding section of the input sound signal belongs to the target voice section. Wherein the function as the target speech segment determination unit that determines whether.

第４の本発明に係る音声信号処理装置は、少なくとも２個のマイクロホンにより捕捉された周囲音の入力音信号に基づいて所定の音声信号処理を行なう音声信号処理装置において、（１）入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出するコヒーレンス係数計算手段と、（２）コヒーレンス係数計算手段により算出された周波数毎のコヒーレンス係数に基づいて、入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域でのコヒーレンス係数を用いて平均コヒーレンスを算出する平均コヒーレンス計算手段と、（３）平均コヒーレンス計算手段により算出された平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定する目的音声区間判定手段とを備えることを特徴とする。 An audio signal processing device according to a fourth aspect of the present invention is an audio signal processing device that performs predetermined audio signal processing based on an input sound signal of ambient sound captured by at least two microphones. (1) Input sound signal And a coherence coefficient reflecting the correlation between the first directional signal having a blind spot in the first predetermined direction and the second directional signal having a blind spot in the second predetermined direction, respectively formed based on A coherence coefficient calculating means for calculating each frequency, and (2) determining the strength of the influence of the noise signal component included in the input sound signal for each frequency based on the coherence coefficient for each frequency calculated by the coherence coefficient calculating means. Average coherence calculating means for calculating average coherence using a coherence coefficient in a frequency band where the influence of noise signal components is small, and (3) average coherence Based on the average coherence calculated by Nsu calculating means, it is the section of the input sound signal; and a target speech segment determination unit that determines whether belonging to the target speech segments.

第５の本発明に係るサーバは、少なくとも２個のマイクロホンにより捕捉された周囲音の入力音信号に基づいて所定の音声信号処理を行なうサーバにおいて、（１）入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出するコヒーレンス係数計算手段と、（２）コヒーレンス係数計算手段により算出された周波数毎のコヒーレンス係数に基づいて、入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域でのコヒーレンス係数を用いて平均コヒーレンスを算出する平均コヒーレンス計算手段と、（３）平均コヒーレンス計算手段により算出された平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定する目的音声区間判定手段とを備えることを特徴とする。 Servers according to a fifth aspect of the present invention are servers that perform predetermined audio signal processing based on input sound signals of ambient sounds captured by at least two microphones, and are respectively formed based on (1) input sound signals. Further, a coherence coefficient is calculated for each frequency that reflects the correlation between the first directional signal having a blind spot in the first predetermined direction and the second directional signal having a blind spot in the second predetermined direction. Coefficient calculation means; (2) Based on the coherence coefficient for each frequency calculated by the coherence coefficient calculation means, the influence of the noise signal component included in the input sound signal is determined for each frequency, and the influence of the noise signal component Average coherence calculation means for calculating average coherence using a coherence coefficient in a small frequency band, and (3) average coherence calculation means Based on the issued average coherence, it is the section of the input sound signal; and a target speech segment determination unit that determines whether belonging to the target speech segments.

本発明によれば、大雑音環境下でも、雑音の影響を軽減して平均コヒーレンスを算出し、目的音声区間の検出性能を改善できる。 According to the present invention, even in a noisy environment, it is possible to reduce the influence of noise and calculate the average coherence, thereby improving the detection performance of the target speech section.

実施形態に係る目的音声区間検出装置の構成を示すブロック図である。It is a block diagram which shows the structure of the target audio | voice area detection apparatus which concerns on embodiment. 大雑音環境下における目的音声及び雑音信号の概略的な特性を簡単に説明する説明図である。It is explanatory drawing which illustrates simply the general characteristic of the objective voice and noise signal in a large noise environment. 実施形態に係る平均コヒーレンス計算部の構成を示すブロック図である。It is a block diagram which shows the structure of the average coherence calculation part which concerns on embodiment. 実施形態に係る目的音声区間判定部の構成を示すブロック図である。It is a block diagram which shows the structure of the target audio | voice area determination part which concerns on embodiment. 実施形態に係る平均コヒーレンス計算部１５における平均コヒーレンス計算処理の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the average coherence calculation process in the average coherence calculation part 15 which concerns on embodiment.

（Ａ）主たる実施形態
以下では、本発明に係る目的音声区間検出装置、目的音声区間検出方法、目的音声区間検出プログラム、音声信号処理装置及びサーバの実施形態を、図面を参照しながら詳細に説明する。 (A) Main Embodiment Hereinafter, embodiments of a target speech segment detection device, a target speech segment detection method, a target speech segment detection program, a speech signal processing device, and a server according to the present invention will be described in detail with reference to the drawings. To do.

（Ａ−１）実施形態の構成
この実施形態に係る目的音声区間検出装置は、一対のマイクロホンが搭載され、若しくは、外付けされているものである。例えば、スマートフォンやタブレット端末やテレビ会議機器や車載機器等のように、一対のマイクロホンが搭載若しくは外付けされており、一対のマイクロホンにより収音された音声に対して音声信号処理を行なうものに広く適用することができる。 (A-1) Configuration of Embodiment The target speech segment detection device according to this embodiment is equipped with a pair of microphones or is externally attached. For example, a wide range of devices that have a pair of microphones, such as smartphones, tablet terminals, video conferencing equipment, and in-vehicle devices, and that perform audio signal processing on the sound collected by the pair of microphones. Can be applied.

特許請求の範囲に記載の「音声信号処理装置」は、少なくとも２個のマイクロホンにより捕捉された周囲音の入力音信号を用いて音声信号処理機能を有するものであり、例えば、携帯端末（例えば、スマートフォン、タブレット型端末、携帯電話機等を含む概念）、ノート型パーソナルコンピュータ、パーソナルコンピュータ、ゲーム端末、テレビ会議用機器、車載機器等を適用できる。 The “audio signal processing device” described in the claims has an audio signal processing function using an input sound signal of ambient sound captured by at least two microphones. For example, a mobile terminal (for example, A concept including a smartphone, a tablet terminal, a mobile phone, and the like), a notebook personal computer, a personal computer, a game terminal, a video conference device, an in-vehicle device, and the like can be applied.

以下では、この実施形態に係る目的音声区間検出装置が、一対のマイクロホンが搭載されて構成されている場合を例示して説明する。 Hereinafter, a case where the target speech section detection device according to this embodiment is configured by mounting a pair of microphones will be described as an example.

図１は、この実施形態に係る目的音声区間検出装置１の構成を示すブロック図である。 FIG. 1 is a block diagram showing a configuration of a target speech segment detection device 1 according to this embodiment.

この実施形態に係る目的音声区間検出装置１は、ハードウェア的な各種構成要素を接続して構築されたものであっても良く、また、一部の構成要素（例えば、スピーカ、マイクロホン、アナログ／デジタル変換部（Ａ／Ｄ変換部）、デジタル／アナログ変換部（Ｄ／Ａ変換部）等を除く部分）を、ＣＰＵ、ＲＯＭ、ＲＡＭ等のプログラムの実行構成を適用して、その機能を実現するように構築されたものであっても良い。いずれの構築方法を適用した場合であっても、目的音声区間検出装置１の機能的な詳細構成は、図１で表す構成となっている。なお、プログラムを適用する場合において、プログラムは、目的音声区間検出装置１が有するメモリに装置出荷時に書き込まれているものであっても良く、また、ダウンロードによりインストールされるものであっても良い。例えば、後者の場合としては、スマートフォン用のアプリケーションとしてプログラムを用意しておき、必要とする利用者が、インターネットを介してダウンロードしてインストールする場合を挙げることができる。 The target speech section detection device 1 according to this embodiment may be constructed by connecting various hardware components, and some components (for example, a speaker, a microphone, an analog / analog) The functions of the digital conversion unit (A / D conversion unit) and digital / analog conversion unit (D / A conversion unit) are implemented by applying the program execution configuration such as CPU, ROM, RAM, etc. It may be constructed to do so. Regardless of which construction method is applied, the functional detailed configuration of the target speech segment detection device 1 is the configuration shown in FIG. In the case of applying the program, the program may be written in the memory of the target speech section detection device 1 at the time of device shipment, or may be installed by downloading. For example, in the latter case, a program is prepared as an application for a smartphone, and a user who needs it can download and install it via the Internet.

図１において、この実施形態に係る目的音声区間検出装置１は、マイクロホンｍ＿１、マイクロホンｍ＿２、ＦＦＴ（高速フーリエ変換）部１１、第１の指向性形成部１２、第２の指向性形成部１３、コヒーレンス係数計算部１４、平均コヒーレンス計算部１５、目的音声区間判定部１６を有する。 In FIG. 1, the target speech section detecting device 1 according to this embodiment includes a microphone m_1, a microphone m_2, an FFT (Fast Fourier Transform) unit 11, a first directivity forming unit 12, a second directivity forming unit 13, A coherence coefficient calculator 14, an average coherence calculator 15, and a target speech segment determination unit 16 are included.

マイクロホンｍ＿１及びｍ＿２はそれぞれ、周囲音を捕捉して電気信号（アナログ信号）に変換するものである。マイクロホンｍ＿１及びｍ＿２は、正面から到来する音を主として捕捉するような指向性を有するものであることが好ましい。マイクロホンｍ＿１及びＭ＿２は、図示しないＡ／Ｄ変換部を介してＦＦＴ部１１と接続しており、マイクロホンｍ＿１及びｍ＿２により捕捉された入力音声信号はそれぞれ、Ａ／Ｄ変換部によりデジタル信号ｓ１（ｎ）及びｓ２（ｎ）に変換されてＦＦＴ部１１に与えられる。各マイクロホンｍ＿１及びｍ＿２は、例えば、目的音声区間検出装置１を搭載している機器の筐体に設けられたものであっても良いし、又は、機器に外付けされて接続されたものであっても良い。 The microphones m_1 and m_2 each capture ambient sound and convert it into an electrical signal (analog signal). The microphones m_1 and m_2 preferably have directivity that mainly captures sound coming from the front. The microphones m_1 and M_2 are connected to the FFT unit 11 via an A / D conversion unit (not shown), and the input audio signals captured by the microphones m_1 and m_2 are respectively digital signals s1 (n ) And s2 (n) and are given to the FFT unit 11. For example, each of the microphones m_1 and m_2 may be provided in a casing of a device in which the target voice section detecting device 1 is mounted, or may be externally connected to the device. May be.

ＦＦＴ部１１は、マイクロホンｍ＿１及びＭ＿２により捕捉された入力音声信号の各デジタル信号ｓ１（ｎ）及びｓ２（ｎ）を、時間領域から周波数領域に変換して周波数領域信号Ｘ１（ｆ，Ｋ）及びＸ２（ｆ，Ｋ）を算出するものである。なお、上記「ｎ」は時間を表すパラメータであり、「ｆ」は周波数を表すパラメータであり、「Ｋ」は、分析フレームのフレーム番号を表すパラメータである。例えば、ＦＦＴ部１１は、入力信号ｓ１（ｎ）に基づき所定のＮ個のサンプルからなるものを１つの分析フレームとし、ＦＦＴ部１１は、分析フレーム毎に高速フーリエ変換処理を施すことで、入力信号ｓ１（ｎ）を周波数領域信号Ｘ１（ｆ，Ｋ）に変換する。なお、以下では、フレームの順番が特に問題とならない場合には「Ｋ」の表記を省略して表現していることもある。 The FFT unit 11 converts the digital signals s1 (n) and s2 (n) of the input audio signal captured by the microphones m_1 and M_2 from the time domain to the frequency domain to convert the frequency domain signal X1 (f, K) and X2 (f, K) is calculated. Note that “n” is a parameter representing time, “f” is a parameter representing frequency, and “K” is a parameter representing the frame number of the analysis frame. For example, the FFT unit 11 uses a predetermined N samples based on the input signal s1 (n) as one analysis frame, and the FFT unit 11 performs fast Fourier transform processing for each analysis frame, The signal s1 (n) is converted into a frequency domain signal X1 (f, K). In the following description, when the frame order is not particularly problematic, the notation of “K” may be omitted.

第１の指向性形成部１２及び第２の指向性形成部１３は、ＦＦＴ部１１からの２つの周波数領域信号に遅延減算処理を施して、所定の方位に死角を有する指向性を形成するものである。第１の指向性形成部１２及び第２の指向性形成部１３は、所定の方位に死角を有する指向性を形成した信号Ｂ１（ｎ）及びＢ２（ｎ）をコヒーレンス係数計算部１４に与える。 The first directivity forming unit 12 and the second directivity forming unit 13 perform a delay subtraction process on the two frequency domain signals from the FFT unit 11 to form a directivity having a blind spot in a predetermined direction. It is. The first directivity forming unit 12 and the second directivity forming unit 13 provide the coherence coefficient calculation unit 14 with signals B1 (n) and B2 (n) formed with directivity having a blind spot in a predetermined direction.

第１の指向性形成部１２は、式（１）に従って、ＦＦＴ部１１からの２つの周波数領域信号Ｘ１（ｆ，Ｋ）及びＸ２（ｆ，Ｋ）に基づいて、例えば正面に対して右方向に強い指向性を持つ信号Ｂ１（ｆ）を算出するものである。 The first directivity forming unit 12 is based on the two frequency domain signals X1 (f, K) and X2 (f, K) from the FFT unit 11 according to the formula (1), for example, rightward with respect to the front. The signal B1 (f) having a strong directivity is calculated.

また、第２の指向性形成部１３は、式２)に従って、ＦＦＴ部１１からの２つの周波数領域信号Ｘ１（ｆ，Ｋ）及びＸ２（ｆ，Ｋ）に基づいて、例えば正面に対して左方向に強い指向性を持つ信号Ｂ２（ｆ）を算出するものである。信号Ｂ１（ｆ）及びＢ２（ｆ）は複素で表されている。

Further, the second directivity forming unit 13 is based on the two frequency domain signals X1 (f, K) and X2 (f, K) from the FFT unit 11 according to Expression 2), for example, left to the front. A signal B2 (f) having a strong directivity in the direction is calculated. Signals B1 (f) and B2 (f) are represented in complex.

コヒーレンス係数計算部１４は、第１の指向性形成部１２及び第２の指向性形成部１３により得られた信号Ｂ１（ｆ，Ｋ）及びＢ２（ｆ，Ｋ）を用いて、式（３）に従って、コヒーレンス係数ｃｏｒ（ｆ，Ｋ）を周波数毎に算出するものである。なお、式（３）において、Ｂ２（ｆ）＊はＢ２（ｆ）の共役複素数を示している。コヒーレンス係数計算部１４は、得られたコヒーレンス係数ｃｏｒ（ｆ，Ｋ）を平均コヒーレンス計算部１５に与える。 The coherence coefficient calculating unit 14 uses the signals B1 (f, K) and B2 (f, K) obtained by the first directivity forming unit 12 and the second directivity forming unit 13 to obtain the equation (3). Thus, the coherence coefficient cor (f, K) is calculated for each frequency. In Expression (3), B2 (f) * represents a conjugate complex number of B2 (f). The coherence coefficient calculation unit 14 gives the obtained coherence coefficient cor (f, K) to the average coherence calculation unit 15.

なお、この実施形態では、コヒーレンス係数計算部１４は、式（４）を利用してコヒーレンスＡＶＥ＿ＣＯＲを算出しないが、後述する説明でコヒーレンスＣＯＲに言及するので、式（４）にコヒーレンスＡＶＥ＿ＣＯＲの算出式を記載しておく。式（４）に示すコヒーレンスＡＶＥ＿ＣＯＲは、全ての周波数ｆ１〜ｆｍのコヒーレンス係数ｃｏｒ（ｆ）の平均値である。

In this embodiment, the coherence coefficient calculation unit 14 does not calculate the coherence AVE_COR using the equation (4), but refers to the coherence COR in the description to be described later. Is described. Coherence AVE_COR shown in Expression (4) is an average value of coherence coefficients cor (f) of all frequencies f1 to fm.

平均コヒーレンス計算部１５は、コヒーレンス係数計算部１４により得られたコヒーレンス係数ｃｏｒ（ｆ，Ｋ）に基づいて、雑音の影響の大きさを周波数毎に判定し、雑音の影響の小さい周波数帯域のコヒーレンス係数のみを用いて、平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）を算出するものである。 Based on the coherence coefficient cor (f, K) obtained by the coherence coefficient calculation unit 14, the average coherence calculation unit 15 determines the magnitude of the influence of noise for each frequency, and the coherence of the frequency band where the influence of noise is small. The average coherence AVE_COR (K) is calculated using only the coefficients.

ここで、平均コヒーレンス計算部１５について説明する。例えば、走行中の自動車等の車内のように大雑音環境下では、目的音声が雑音に埋もれてしまう。図２は、大雑音環境下における目的音声及び雑音信号の概略的な特性を簡単に説明する説明図である。図２において、横軸は周波数を示しており、縦軸は信号パワーを示している。図２に示すように、雑音信号は、低域に雑音成分のパワーが集中しており、高域では雑音成分のパワーが小さいという特性があり、周波数帯域毎に雑音信号成分の含有量が異なる。そのため、雑音信号が音声信号に及ぼす影響が大きい帯域と小さい帯域とがある。 Here, the average coherence calculation unit 15 will be described. For example, in a noisy environment such as in a car such as a running car, the target voice is buried in noise. FIG. 2 is an explanatory diagram for briefly explaining the schematic characteristics of the target voice and the noise signal under a large noise environment. In FIG. 2, the horizontal axis indicates the frequency, and the vertical axis indicates the signal power. As shown in FIG. 2, the noise signal has a characteristic that the power of the noise component is concentrated in the low band, and the power of the noise component is small in the high band, and the content of the noise signal component is different for each frequency band. . For this reason, there are a band where the influence of the noise signal on the audio signal is large and a band where the noise signal is small.

そして、周波数毎のコヒーレンス係数には、（ａ）雑音信号成分の影響が大きい周波数帯域では、目的音声の特徴が薄れるため、目的音声の有無によらずコヒーレンス係数の値には大きな変動が無い、（ｂ）雑音信号成分の影響が小さい周波数帯域では、目的音声の特徴が残るため、目的音声が存在する区間ではコヒーレンス係数が急変動する、という特徴がある。 The coherence coefficient for each frequency (a) in the frequency band where the influence of the noise signal component is large, the characteristics of the target voice are diminished, so that the value of the coherence coefficient does not vary greatly regardless of the presence or absence of the target voice. (B) Since the characteristics of the target voice remain in the frequency band where the influence of the noise signal component is small, the coherence coefficient varies abruptly in the section where the target voice exists.

そこで、この実施形態では、平均コヒーレンス計算部１５が、周波数毎のコヒーレンス係数の特徴に基づいて、周波数毎に雑音信号成分の影響が大きいか否かを判定する。そして、平均コヒーレンス計算部１５は、雑音信号成分の影響が大きい周波数帯域のコヒーレンス係数について、コヒーレンスの算出には寄与させずに棄却し、雑音の影響が小さい周波数帯域のコヒーレンス係数のみを用いてコヒーレンスの算出に寄与させるように制御する。これにより、大雑音環境下でも、雑音信号成分の影響を軽減した上でコヒーレンスを算出し、目的音声区間の検出性能を改善できるようになる。 Therefore, in this embodiment, the average coherence calculator 15 determines whether or not the influence of the noise signal component is large for each frequency based on the characteristics of the coherence coefficient for each frequency. Then, the average coherence calculation unit 15 rejects the coherence coefficient in the frequency band where the influence of the noise signal component is large without contributing to the calculation of the coherence, and uses only the coherence coefficient in the frequency band where the influence of the noise is small. Control to contribute to the calculation of. As a result, even under a large noise environment, coherence is calculated after reducing the influence of the noise signal component, and the detection performance of the target speech section can be improved.

図３は、この実施形態に係る平均コヒーレンス計算部１５の構成を示すブロック図である。図３において、この実施形態に係る平均コヒーレンス計算部１５は、長期平均値算出部１５１、雑音影響度判定部１５２、加算部１５３、カウンター部１５４、平均コヒーレンス算出部１５５、周波数毎長期平均値格納部１５６を有する。 FIG. 3 is a block diagram showing the configuration of the average coherence calculator 15 according to this embodiment. In FIG. 3, the average coherence calculation unit 15 according to this embodiment includes a long-term average value calculation unit 151, a noise influence degree determination unit 152, an addition unit 153, a counter unit 154, an average coherence calculation unit 155, and a long-term average value storage for each frequency. Part 156.

長期平均値算出部１５１は、コヒーレンス係数計算部１４により得られた各周波数のコヒーレンス係数ｃｏｒ（ｆ，Ｋ）を用いて、コヒーレンス係数の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）を周波数毎に算出するものである。 The long-term average value calculation unit 151 calculates the long-term average value long_cor (f, K) of the coherence coefficient for each frequency using the coherence coefficient cor (f, K) obtained by the coherence coefficient calculation unit 14. Is.

雑音影響度判定部１５２は、長期平均値算出部１５１により得られた周波数毎のコヒーレンス係数の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）と、コヒーレンス係数ｃｏｒ（ｆ，Ｋ）との比と所定の閾値Θとを比較して、雑音の影響度を周波数毎に判定するものである。なお、この実施形態では、雑音影響度判定部１５２が、コヒーレンス係数の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）と、コヒーレンス係数ｃｏｒ（ｆ，Ｋ）との比を求める場合を例示するが、比に限定されるものではなく、コヒーレンス係数の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）と、コヒーレンス係数ｃｏｒ（ｆ，Ｋ）との差分を求め、その差分値と閾値とを比較して判定するようにしても良い。 The noise influence degree determination unit 152 determines the ratio between the long-term average value long_cor (f, K) of the coherence coefficient for each frequency obtained by the long-term average value calculation unit 151 and the coherence coefficient cor (f, K) and a predetermined threshold value. Θ is compared to determine the noise influence level for each frequency. In this embodiment, the noise influence determination unit 152 exemplifies a case where the ratio between the long-term average value long_cor (f, K) of the coherence coefficient and the coherence coefficient cor (f, K) is obtained. The present invention is not limited, and the difference between the long-term average value long_cor (f, K) of the coherence coefficient and the coherence coefficient cor (f, K) is obtained, and the difference value is compared with a threshold value for determination. Also good.

この判定法で雑音の影響の大きさを推定できる背景を捕捉する。上述したように、大雑音環境下において、雑音信号成分の影響が大きい周波数帯域では、目的音声信号成分が雑音信号成分に埋もれてしまい、目的音声信号の特徴が薄れてしまい、コヒーレンス係数の値に大きな変動はない。これに対して、雑音信号成分の影響が小さい周波数帯域では、目的音声信号の特徴が残るため、目的音声信号成分の影響を受けてコヒーレンス係数が急変動する。 This determination method captures a background from which the magnitude of noise influence can be estimated. As described above, in a frequency band where the influence of the noise signal component is large in a large noise environment, the target speech signal component is buried in the noise signal component, and the characteristics of the target speech signal are diminished, resulting in a value of the coherence coefficient. There are no major fluctuations. On the other hand, in the frequency band in which the influence of the noise signal component is small, the characteristics of the target voice signal remain, so that the coherence coefficient changes rapidly due to the influence of the target voice signal component.

そこで、雑音影響度判定部１５２は、周波数毎に、コヒーレンス係数の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）と、コヒーレンス係数ｃｏｒ（ｆ，Ｋ）との比又は差分値と所定の閾値Θとを比較し、その比又は差分値が閾値Θ以上のときには、目的音声に由来する信号成分の寄与が大きく、雑音信号成分の影響が小さいと判定し、その比又は差分が閾値Θより小さいときには、目的音声に由来する信号成分の寄与が小さく、雑音信号成分の影響が大きいと判定する。 Therefore, for each frequency, the noise influence degree determination unit 152 compares the ratio or difference value between the long-term average value long_cor (f, K) of the coherence coefficient and the coherence coefficient cor (f, K) and a predetermined threshold value Θ. When the ratio or difference value is equal to or greater than the threshold Θ, it is determined that the contribution of the signal component derived from the target speech is large and the influence of the noise signal component is small. When the ratio or difference is less than the threshold Θ, It is determined that the contribution of the signal component derived from is small and the influence of the noise signal component is large.

加算部１５３は、雑音影響度判定部１５２により雑音信号成分の影響が小さいと判定された周波数のコヒーレンス係数のみを加算するものである。また、加算部１５３は、フレーム毎に、雑音信号成分の影響の小さい周波数のコヒーレンス係数の加算値を求めるため、フレーム毎にコヒーレンス係数の加算値を初期化する。 The adding unit 153 adds only the coherence coefficient of the frequency determined by the noise influence degree determining unit 152 that the influence of the noise signal component is small. Further, the addition unit 153 initializes the coherence coefficient addition value for each frame in order to obtain the addition value of the coherence coefficient of the frequency with a small influence of the noise signal component for each frame.

カウンター部１５４は、加算部１５３により加算されたコヒーレンス係数の数をカウントするものである。すなわち、カウンター部１５４は、加算部１５３によりコヒーレンス係数が加算される毎に、カウンター値をインクリメントする。また、カウンター部１５４は、フレーム毎に加算したコヒーレンス係数の数をカウントするようにするため、フレーム毎にカウンター値を初期化する。 The counter unit 154 counts the number of coherence coefficients added by the adding unit 153. That is, the counter unit 154 increments the counter value every time the coherence coefficient is added by the adding unit 153. Further, the counter unit 154 initializes a counter value for each frame in order to count the number of coherence coefficients added for each frame.

平均コヒーレンス算出部１５５は、加算部１５３により加算されて得たコヒーレンス係数の加算値を、カウンター部１５４によりカウントされたカウンター値で除算して、平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）を算出するものである。平均コヒーレンス算出部１５５により得られた平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）が、平均コヒーレンス計算部１５の出力として目的音声区間判定部１６に与えられる。 The average coherence calculation unit 155 calculates the average coherence AVE_COR (K) by dividing the addition value of the coherence coefficients obtained by the addition by the addition unit 153 by the counter value counted by the counter unit 154. The average coherence AVE_COR (K) obtained by the average coherence calculation unit 155 is given to the target speech section determination unit 16 as an output of the average coherence calculation unit 15.

周波数毎長期平均値格納部１５６は、長期平均値算出部１５１において周波数毎のコヒーレンス係数の長期平均値を算出する際に利用する、各周波数のコヒーレンス係数の過去の長期平均値を周波数毎に格納するものである。 The long-term average value storage unit 156 for each frequency stores the past long-term average value of the coherence coefficient for each frequency, which is used when the long-term average value calculation unit 151 calculates the long-term average value of the coherence coefficient for each frequency. To do.

目的音声区間判定部１６は、平均コヒーレンス計算部１５により得られた平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）に基づいて、目的音声区間を判定するものである。 The target speech segment determination unit 16 determines a target speech segment based on the average coherence AVE_COR (K) obtained by the average coherence calculation unit 15.

図４は、この実施形態に係る目的音声区間判定部１６の構成を示すブロック図である。図４において、目的音声区間判定部１６は、平均コヒーレンス取得部１６１、閾値比較判定部１６２、判定結果出力部１６３を有する。 FIG. 4 is a block diagram showing the configuration of the target speech segment determination unit 16 according to this embodiment. 4, the target speech segment determination unit 16 includes an average coherence acquisition unit 161, a threshold comparison determination unit 162, and a determination result output unit 163.

平均コヒーレンス取得部１６１は、平均コヒーレンス計算部１５により得られた平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）を取得するものである。 The average coherence acquisition unit 161 acquires the average coherence AVE_COR (K) obtained by the average coherence calculation unit 15.

閾値比較判定部１６２は、平均コヒーレンス取得部１６１により取得された平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）と目的音声区間判定閾値とを比較し、平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）が目的音声区間判定閾値より大きいとき、当該フレームは目的音声区間と判定し、そうでないとき、当該フレームは背景雑音区間と判定するものである。 The threshold comparison determination unit 162 compares the average coherence AVE_COR (K) acquired by the average coherence acquisition unit 161 with the target speech segment determination threshold, and when the average coherence AVE_COR (K) is larger than the target speech segment determination threshold, The frame is determined to be the target speech section, and if not, the frame is determined to be the background noise section.

判定結果出力部１６３は、閾値比較判定部１６２により目的音声区間と判定されたとき、判定結果を格納する変数ｒｅｓに「１」を代入して後段の構成部に出力し、又背景雑音区間と判定されたときに、変数ｒｅｓに「０」を代入して後段の構成部に出力するものである。 The determination result output unit 163 substitutes “1” for a variable res that stores the determination result when the threshold comparison determination unit 162 determines that the target speech section is output, and outputs the variable res to the subsequent constituent unit. When the determination is made, “0” is substituted into the variable “res” and output to the subsequent component.

（Ａ−２）実施形態の動作
次に、実施形態に係る目的音声区間検出装置１における目的音声区間検出方法の処理動作を、図面を参照しながら詳細に説明する。 (A-2) Operation of Embodiment Next, the processing operation of the target speech segment detection method in the target speech segment detection apparatus 1 according to the embodiment will be described in detail with reference to the drawings.

一対のマイクロホンｍ＿１及びｍ＿２により捕捉された入力音信号（アナログ信号）は、図示しないＡ／Ｄ変換部によりデジタル信号に変換され、デジタル信号ｓ１（ｎ）及びｓ２（ｎ）がＦＦＴ部１１に与えられる。 Input sound signals (analog signals) captured by the pair of microphones m_1 and m_2 are converted into digital signals by an A / D converter (not shown), and digital signals s1 (n) and s2 (n) are given to the FFT unit 11. It is done.

ＦＦＴ部１１において、デジタル信号ｓ１（ｎ）及びｓ２（ｎ）はそれぞれ、時間領域から周波数領域に変換され、周波数領域信号Ｘ１（ｆ，Ｋ）及びＸ２（ｆ，Ｋ）が第１の指向性形成部１２及び第２の指向性形成部１３に与えられる。 In the FFT unit 11, the digital signals s1 (n) and s2 (n) are respectively converted from the time domain to the frequency domain, and the frequency domain signals X1 (f, K) and X2 (f, K) have the first directivity. It is given to the forming unit 12 and the second directivity forming unit 13.

第１の指向性形成部１２及び第２の指向性形成部１３では、所定の方位に死角を有する信号Ｂ１（ｆ，Ｋ）及びＢ２（ｆ，Ｋ）が生成されて、信号Ｂ１（ｆ，Ｋ）及びＢ２（ｆ，Ｋ）がコヒーレンス係数計算部１４に与えられる。 In the first directivity forming unit 12 and the second directivity forming unit 13, signals B1 (f, K) and B2 (f, K) having a blind spot in a predetermined direction are generated, and the signal B1 (f, K) and B2 (f, K) are provided to the coherence coefficient calculator 14.

コヒーレンス係数計算部１４では、式（３）に従って、第１の指向性形成部１２からの信号Ｂ１（ｆ，Ｋ）と、第２の指向性形成部１３からの信号Ｂ２（ｆ，Ｋ）とに基づいて、コヒーレンス係数ｃｏｒ（ｆ，Ｋ）が計算される。得られたコヒーレンス係数ｃｏｒ（ｆ，Ｋ）は、平均コヒーレンス計算部１５に与えられる。 In the coherence coefficient calculation unit 14, the signal B 1 (f, K) from the first directivity forming unit 12 and the signal B 2 (f, K) from the second directivity forming unit 13 according to Expression (3) Based on, a coherence coefficient cor (f, K) is calculated. The obtained coherence coefficient cor (f, K) is given to the average coherence calculation unit 15.

平均コヒーレンス計算部１５では、各周波数のコヒーレンス係数ｃｏｒ（ｆ，Ｋ）に基づいて雑音の影響の強弱を周波数毎に判定し、雑音の影響の小さい帯域のコヒーレンス係数のみを用いて平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）が算出される。 The average coherence calculation unit 15 determines the strength of the influence of noise for each frequency based on the coherence coefficient cor (f, K) of each frequency, and uses only the coherence coefficient of the band where the influence of noise is small, to determine the average coherence AVE_COR ( K) is calculated.

図５は、実施形態に係る平均コヒーレンス計算部１５における平均コヒーレンス計算処理の動作例を示すフローチャートである。 FIG. 5 is a flowchart illustrating an operation example of the average coherence calculation process in the average coherence calculation unit 15 according to the embodiment.

Ｓ１０１では、平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）と、雑音の影響が小さい周波数のコヒーレンス係数の数を示すカウンター値（ＣＯＵＮＴ）とが初期化される。 In S101, an average coherence AVE_COR (K) and a counter value (COUNT) indicating the number of coherence coefficients having a frequency that is less influenced by noise are initialized.

次に、全ての周波数について周波数毎に雑音の影響の大きさを判定するために、Ｓ１０２〜Ｓ１０６の処理を周波数毎にループさせる。Ｓ１０２では、所定の周波数ビンｆからＳＴＡＲＴし、当該周波数ビンに関する処理が終了すると、周波数ビンｆの値をインクリメント（図４では、「ｆ＋＋」と表記する。）、ＥＮＤまで繰り返し処理を行なう。 Next, in order to determine the magnitude of the influence of noise for each frequency for all frequencies, the processing of S102 to S106 is looped for each frequency. In S102, START is performed from a predetermined frequency bin f, and when the processing related to the frequency bin is completed, the value of the frequency bin f is incremented (indicated as “f ++” in FIG. 4), and the processing is repeated until END.

Ｓ１０３では、当該周波数のコヒーレンス係数の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）が算出される。ここで、コヒーレンス係数の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）の算出方法は、式（５）を利用することができる。

In S103, a long-term average value long_cor (f, K) of the coherence coefficient of the frequency is calculated. Here, the long-term average value long_cor (f, K) of the coherence coefficient can be calculated using Equation (5).

式（５）は、当該周波数のコヒーレンス係数の過去の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ−１）と、現在のコヒーレンス係数ｃｏｒ（ｆ，Ｋ）とを用いて重み付け平均を行なうことにより、長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）を算出する関係式である。 Equation (5) is obtained by performing a weighted average using the past long-term average value long_cor (f, K-1) of the coherence coefficient of the frequency and the current coherence coefficient cor (f, K). This is a relational expression for calculating the value long_cor (f, K).

ここで、αは、長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ−１）と、現在のコヒーレンス係数ｃｏｒ（ｆ，Ｋ）とに付与する重みを表す値であり、０＜α＜１の任意の値をとることができる。例えば、αが「０」に近い値とする場合、過去の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ−１）の影響を大きくした長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）を算出することができる。一方、αが「１」に近い値とする場合、現在のフレームのコヒーレンス係数ｃｏｒ（ｆ，Ｋ）の影響を大きくした長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）を算出することができる。なお、αは、固定値であっても良いし、又は変動値であっても良い。さらに、αは、周波数毎に同じ値であっても良いし、又は異なる値であっても良い。 Here, α is a value representing a weight to be given to the long-term average value long_cor (f, K−1) and the current coherence coefficient cor (f, K), and an arbitrary value of 0 <α <1 is set. Can take. For example, when α is a value close to “0”, the long-term average value long_cor (f, K) in which the influence of the past long-term average value long_cor (f, K−1) is increased can be calculated. On the other hand, when α is a value close to “1”, the long-term average value long_cor (f, K) in which the influence of the coherence coefficient cor (f, K) of the current frame is increased can be calculated. Α may be a fixed value or a variable value. Furthermore, α may be the same value for each frequency, or may be a different value.

また、式（５）における過去のコヒーレンス係数の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ−１）は、任意のフレーム長のコヒーレンス係数を用いて算出したものであっても良い。任意のフレーム長は、周波数毎に異なるものであっても良い。 In addition, the long-term average long_cor (f, K−1) of the past coherence coefficient in Expression (5) may be calculated using a coherence coefficient of an arbitrary frame length. The arbitrary frame length may be different for each frequency.

なお、この実施形態では、式（５）を利用してコヒーレンス係数の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）を算出する場合を例示したが、その他の任意の算出方式を用いるようにしても良い。例えば、他の算出方式として算術平均を利用するようにしても良い。算術平均の場合、例えば式（５）においてα＝０．５とすることで、過去の長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ−１）と、現在のコヒーレンス係数ｃｏｒ（ｆ，Ｋ）との影響度を同じにして、現在フレームの長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）を算出できる。 In this embodiment, the case where the long-term average value long_cor (f, K) of the coherence coefficient is calculated using Expression (5) is exemplified, but any other calculation method may be used. For example, arithmetic mean may be used as another calculation method. In the case of the arithmetic average, for example, by setting α = 0.5 in the equation (5), the influence degree of the past long-term average value long_cor (f, K−1) and the current coherence coefficient cor (f, K). , The long-term average value long_cor (f, K) of the current frame can be calculated.

Ｓ１０４では、コヒーレンス係数の値の急激な変動を検出するため、Ｓ１０３で算出した長期平均値ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）と現在フレームのコヒーレンス係数ｃｏｒ（ｆ，Ｋ）との比を取り、その比と閾値Θとを比較する。そして、上記比が閾値Θ以上であれば目的音声の影響が大きいと判定し、処理はＳ１０５に移行する。また、上記比が閾値Θ未満であれば、雑音の影響が大きく、目的音声の影響は小さいと判定し、処理はＳ１０６に移行する。 In S104, in order to detect a sudden change in the value of the coherence coefficient, a ratio between the long-term average value long_cor (f, K) calculated in S103 and the coherence coefficient cor (f, K) of the current frame is calculated. The threshold value Θ is compared. If the ratio is equal to or greater than the threshold Θ, it is determined that the influence of the target voice is large, and the process proceeds to S105. If the ratio is less than the threshold Θ, it is determined that the influence of noise is large and the influence of the target voice is small, and the process proceeds to S106.

ｌｏｎｇ＿ｃｏｒ（ｆ，Ｋ）／ｃｏｒ（ｆ，Ｋ）≧Θ …（６）
式（６）において、閾値Θは、任意の値とすることができ、例えば固定値であっても良いし又は可変値であっても良い。さらに、閾値Θは、周波数毎に、同じ値であっても良いし又は異なる値としても良い。 long_cor (f, K) / cor (f, K) ≧ Θ (6)
In the equation (6), the threshold value Θ can be an arbitrary value, and may be a fixed value or a variable value, for example. Further, the threshold value Θ may be the same value or a different value for each frequency.

Ｓ１０５において、上記比が閾値Θ以上であり、目的音声の影響が大きいと判定される（すなわち、雑音の影響が小さいと判定される）と、平均コヒーレンスを格納する中間変数値ＡＶＥ＿ＣＯＲ（Ｋ）に、当該周波数帯域のコヒーレンス係数ｃｏｒ（ｆ，Ｋ）を加算すると共に、コヒーレンス数をカウントするカウンター値をインクリメント（図４では「ＣＯＵＮＴ＋＋」と表記する。）する。 In S105, when it is determined that the ratio is equal to or greater than the threshold Θ and the influence of the target voice is large (that is, it is determined that the influence of noise is small), the intermediate variable value AVE_COR (K) that stores the average coherence is set. Then, the coherence coefficient cor (f, K) of the frequency band is added, and the counter value for counting the number of coherence is incremented (indicated as “COUNT ++” in FIG. 4).

Ｓ１０５では、上記比が閾値Θ以上であり、目的音声の影響が大きいと判定した周波数帯域のコヒーレンス係数ｃｏｒ（ｆ，Ｋ）を平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）に加算している。しかし、上記比が閾値Θ未満であり、目的音声の影響が小さいと判定された周波数帯域のコヒーレンス係数は加算せず、平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）に寄与させない。以上のＳ１０２〜Ｓ１０６の処理を、全ての周波数について終了するまでループする。 In S105, the coherence coefficient cor (f, K) of the frequency band determined that the ratio is equal to or greater than the threshold Θ and the influence of the target speech is large is added to the average coherence AVE_COR (K). However, the coherence coefficient of the frequency band in which the ratio is less than the threshold Θ and the influence of the target speech is determined to be small is not added, and does not contribute to the average coherence AVE_COR (K). The above-described processing of S102 to S106 is looped until completion for all frequencies.

Ｓ１０７では、平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）をカウンター値（ＣＯＵＮＴ）で除算することで、平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）が算出される。そして、得られた平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）は、目的音声区間判定部１６に与えられる。 In S107, the average coherence AVE_COR (K) is calculated by dividing the average coherence AVE_COR (K) by the counter value (COUNT). Then, the obtained average coherence AVE_COR (K) is given to the target speech section determination unit 16.

Ｓ１０８では、分析フレームであるＫがインクリメントされ（図４では「Ｋ＋＋」と表記する。）、次のフレームについて処理が繰り返し行なわれる。 In S108, the analysis frame K is incremented (indicated as “K ++” in FIG. 4), and the process is repeated for the next frame.

目的音声区間判定部１６では、平均コヒーレンス計算部１５により算出された平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）と所定の閾値とが比較され、平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）が閾値以上であれば目的音声区間と判定し、平均コヒーレンスＡＶＥ＿ＣＯＲ（Ｋ）が閾値未満であれば背景雑音区間と判定する。そして、目的音声区間判定部１６は、目的音声区間であれば、判定結果を格納する変数ｒｅｓに「１」を代入し、背景雑音区間であればｒｅｓに「０」を代入し、判定結果が後段の構成部に与えられる。 The target speech segment determination unit 16 compares the average coherence AVE_COR (K) calculated by the average coherence calculation unit 15 with a predetermined threshold value, and determines that the target speech segment is the target speech segment if the average coherence AVE_COR (K) is equal to or greater than the threshold value. If the average coherence AVE_COR (K) is less than the threshold, it is determined as the background noise interval. Then, the target speech section determination unit 16 substitutes “1” for a variable res for storing the determination result if the target speech section, and substitutes “0” for res if the background noise section. This is given to the subsequent component.

（Ａ−３）第１の実施形態の効果
以上のように、第１の実施形態によれば、大雑音環境下においても、雑音信号成分の影響が小さい周波数帯域を選択し、当該周波数帯域でのコヒーレンス係数のみを寄与させて平均コヒーレンスを算出することができる。これにより、大雑音下での目的音声区間検出性能を高めることができる。 (A-3) Effect of First Embodiment As described above, according to the first embodiment, even in a large noise environment, a frequency band that is less affected by a noise signal component is selected, and the frequency band The average coherence can be calculated by contributing only the coherence coefficient. Thereby, the target speech area detection performance under a large noise can be improved.

（Ｂ）他の実施形態
上述した実施形態においても種々の変形実施形態を言及したが、本発明は、以下の変形実施形態にも適用することができる。 (B) Other Embodiments Although various modified embodiments have been mentioned in the above-described embodiments, the present invention can also be applied to the following modified embodiments.

（Ｂ−１）上述した実施形態では、本発明をテレビ会議システムや携帯電話などの通信装置に適用することで、目的音声区間の検出性能を向上させることができるため、通話音質や音声認識機能の向上が期待できる。 (B-1) In the above-described embodiment, since the present invention is applied to a communication device such as a video conference system or a mobile phone, the detection performance of the target voice section can be improved. Improvement can be expected.

また、上述した実施形態では、走行中の自動車や電車等の車内における大雑音環境下を例示した。しかし、大雑音環境下は、低域に雑音信号成分のパワーが強く影響を及ぼし、周波数が高くなるほど雑音信号成分のパワーが小さくなる傾向にあるという特性を有する環境を意図しており、車内に限らず、屋外にいる装置使用者のすぐそばを自動車や電車等が走行する場所であっても良いし、飛行場やガードレール下などにおいても上述した実施形態と同様の効果を得ることができる。 Moreover, in embodiment mentioned above, the under-noise environment in vehicles, such as a driving | running | working motor vehicle and a train, was illustrated. However, in a noisy environment, the power of the noise signal component has a strong influence on the low frequency range, and an environment with the characteristic that the power of the noise signal component tends to decrease as the frequency increases is intended. The present invention is not limited to this, and it may be a place where an automobile, a train, or the like travels in the immediate vicinity of the device user who is outdoors, and the same effect as the above-described embodiment can be obtained even under an airfield or a guardrail.

（Ｂ−２）上述した実施形態では、平均コヒーレンス計算部が、周波数毎のコヒーレンス係数に基づいて雑音信号成分の影響の強弱を判定する場合を例示したが、グラディエント・インデックス（ＧＩ：Gradient Index）を修正したｍｏｄＧＩを用いて判定するようにしても良い。 (B-2) In the above-described embodiment, the case where the average coherence calculation unit determines the strength of the influence of the noise signal component based on the coherence coefficient for each frequency is exemplified. However, a gradient index (GI) is used. The determination may be made using modGI corrected.

（Ｂ−３）上述した実施形態では、音声目的音信号単体で全ての処理を実行するものを示したが、目的音声区間の検出処理等を外部のサーバに委ねて実行するようにしても良い。例えば、音声信号処理装置がスマートフォン等の場合において、いわゆるクラウドシステムによってシステムを構成し、音声信号処理装置により取得された入力音信号が外部サーバに送信されて、外部サーバが目的音声区間の検出処理を行なうようにしても良い。特許請求の範囲における「サーバ」は、上記のようなクラウドシステムを構成するサーバを含むものとする。 (B-3) In the above-described embodiment, the voice target sound signal alone is shown to execute all the processing. However, the target voice section detection processing and the like may be left to an external server to execute. . For example, when the audio signal processing device is a smartphone or the like, the system is configured by a so-called cloud system, the input sound signal acquired by the audio signal processing device is transmitted to the external server, and the external server detects the target audio section. May be performed. The “server” in the claims includes a server constituting the cloud system as described above.

（Ｂ−４）上述した実施形態では、一対のマイクロホンにより捕捉された入力音信号を直ちに処理する装置やプログラムを示したが、一対のマイクロホンにより捕捉された信号を記録媒体に記録し、それを再生する場合にも本発明を適用できる。 (B-4) In the above-described embodiment, an apparatus or a program for immediately processing an input sound signal captured by a pair of microphones has been shown. However, a signal captured by a pair of microphones is recorded on a recording medium, and this is recorded. The present invention can also be applied to reproduction.

（Ｂ−５）上述した実施形態では、２個のマイクロホンを一対として音声信号処理装置が有する場合を例示したが、音声信号処理装置は３個以上のマイクロホンを有するようにしても良い。音声信号処理装置が３個以上のマイクロホンを有する場合でも、各マイクロホンにより捕捉された入力音信号に基づいて、所定の方位に死角を有する指向性を有する複数の指向性信号を形成することにより、本発明を適用することができる。 (B-5) In the above-described embodiment, the case where the audio signal processing apparatus has two microphones as a pair has been illustrated, but the audio signal processing apparatus may include three or more microphones. Even when the audio signal processing apparatus has three or more microphones, by forming a plurality of directional signals having directivity having blind spots in a predetermined direction based on input sound signals captured by the respective microphones, The present invention can be applied.

１…目的音声区間検出装置、ｍ＿１及びＭ＿２…マイクロホン、１１…ＦＦＴ（高速フーリエ変換）部、１２…第１の指向性形成部、１３…第２の指向性形成部、１４…コヒーレンス係数計算部、１５…平均コヒーレンス計算部、１６…目的音声区間判定部。 DESCRIPTION OF SYMBOLS 1 ... Target audio | voice area detection apparatus, m_1 and M_2 ... Microphone, 11 ... FFT (fast Fourier transform) part, 12 ... 1st directivity formation part, 13 ... 2nd directivity formation part, 14 ... Coherence coefficient calculation part , 15 ... average coherence calculation unit, 16 ... target speech section determination unit.

Claims

入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出するコヒーレンス係数計算手段と、
上記コヒーレンス係数計算手段により算出された周波数毎の上記コヒーレンス係数に基づいて、上記入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域での上記コヒーレンス係数を用いて平均コヒーレンスを算出する平均コヒーレンス計算手段と、
上記平均コヒーレンス計算手段により算出された上記平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定する目的音声区間判定手段と
を備えることを特徴とする目的音声区間検出装置。 Coherence reflecting the correlation between the first directional signal having a blind spot in the first predetermined direction and the second directional signal having a blind spot in the second predetermined direction, each formed based on the input sound signal. A coherence coefficient calculating means for calculating a coefficient for each frequency;
Based on the coherence coefficient for each frequency calculated by the coherence coefficient calculation means, the strength of the influence of the noise signal component included in the input sound signal is determined for each frequency, and in a frequency band where the influence of the noise signal component is small. Average coherence calculating means for calculating average coherence using the above coherence coefficient;
A target speech section comprising: target speech section determination means for determining whether or not the section of the input sound signal belongs to the target speech section based on the average coherence calculated by the average coherence calculation means. Detection device.

上記平均コヒーレンス計算手段が、
上記コヒーレンス係数の長期平均値を周波数毎に算出する長期平均値計算部と、
周波数毎に、上記長期平均値と上記コヒーレンス係数との比又は差分値と所定の閾値とを比較して、上記比又は上記差分値が所定の閾値以上の場合に雑音信号成分の影響が弱い周波数帯域と判定し、そうでない場合に雑音信号成分の影響が強い周波数帯域と判定する雑音影響度判定部と、
上記雑音影響度判定部により雑音信号成分の影響が小さいと判定された周波数帯域での上記コヒーレンス係数を加算した加算値を、加算したコヒーレンス係数の数を示す値で除算することで上記平均コヒーレンスを算出する平均コヒーレンス算出部と
を有することを特徴とする請求項１に記載の目的音声区間検出装置。 The average coherence calculating means is
A long-term average value calculation unit for calculating the long-term average value of the coherence coefficient for each frequency;
For each frequency, the ratio or difference value between the long-term average value and the coherence coefficient is compared with a predetermined threshold value, and the frequency of the influence of the noise signal component is weak when the ratio or the difference value is equal to or greater than the predetermined threshold value. A noise influence determination unit that determines a frequency band that is determined to be a frequency band when the influence of the noise signal component is strong,
The average coherence is calculated by dividing the added value obtained by adding the coherence coefficients in the frequency band determined to have a small influence of the noise signal component by the noise influence determination unit by the value indicating the number of added coherence coefficients. The target speech segment detection device according to claim 1, further comprising: an average coherence calculation unit for calculating.

入力音信号を時間領域から周波数領域に変換する周波数解析手段と、
上記周波数解析手段により得られた周波数領域信号に遅延減算処理を施して、上記第１の指向性信号を形成して上記コヒーレンス係数算出手段に与える第１の指向性形成手段と、
上記周波数解析手段により得られた周波数領域信号に遅延減算処理を施して、上記第２の指向性信号を形成して上記コヒーレンス係数算出手段に与える第２の指向性形成手段と
を備えることを特徴とする請求項１又は２に記載の目的音声区間検出装置。 A frequency analysis means for converting the input sound signal from the time domain to the frequency domain;
First directivity forming means for applying a delay subtraction process to the frequency domain signal obtained by the frequency analyzing means to form the first directivity signal and supplying the first directivity signal to the coherence coefficient calculating means;
A second directivity forming means for performing a delay subtraction process on the frequency domain signal obtained by the frequency analyzing means to form the second directivity signal and supplying the second directivity signal to the coherence coefficient calculating means. The target speech section detecting device according to claim 1 or 2.

コヒーレンス係数計算手段が、入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出し、
平均コヒーレンス計算手段が、上記コヒーレンス係数計算手段により算出された周波数毎の上記コヒーレンス係数に基づいて、上記入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域での上記コヒーレンス係数を用いて平均コヒーレンスを算出し、
目的音声判定手段が、上記平均コヒーレンス計算手段により算出された上記平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定する
ことを特徴とする目的音声区間検出方法。 The coherence coefficient calculating means includes a first directivity signal having a blind spot in a first predetermined direction and a second directivity signal having a blind spot in a second predetermined direction, each formed based on the input sound signal. Calculate the coherence coefficient reflecting the correlation for each frequency,
The average coherence calculation means determines the strength of the influence of the noise signal component included in the input sound signal for each frequency based on the coherence coefficient for each frequency calculated by the coherence coefficient calculation means, and the noise signal component Calculate the average coherence using the above coherence coefficient in the frequency band where the influence is small,
A target speech segment detection method, wherein the target speech determination unit determines whether or not the segment of the input sound signal belongs to the target speech segment based on the average coherence calculated by the average coherence calculation unit. .

コンピュータを、
入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出するコヒーレンス係数計算手段と、
上記コヒーレンス係数計算手段により算出された周波数毎の上記コヒーレンス係数に基づいて、上記入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域での上記コヒーレンス係数を用いて平均コヒーレンスを算出する平均コヒーレンス計算手段と、
上記平均コヒーレンス計算手段により算出された上記平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定する目的音声区間判定手段と
して機能させることを特徴とする目的音声区間検出プログラム。 Computer
Coherence reflecting the correlation between the first directional signal having a blind spot in the first predetermined direction and the second directional signal having a blind spot in the second predetermined direction, each formed based on the input sound signal. A coherence coefficient calculating means for calculating a coefficient for each frequency;
Based on the coherence coefficient for each frequency calculated by the coherence coefficient calculation means, the strength of the influence of the noise signal component included in the input sound signal is determined for each frequency, and in a frequency band where the influence of the noise signal component is small. Average coherence calculating means for calculating average coherence using the above coherence coefficient;
An object of the present invention is to function as target speech section determination means for determining whether or not the section of the input sound signal belongs to the target speech section based on the average coherence calculated by the average coherence calculation means. Voice segment detection program.

少なくとも２個のマイクロホンにより捕捉された周囲音の入力音信号に基づいて所定の音声信号処理を行なう音声信号処理装置において、
入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出するコヒーレンス係数計算手段と、
上記コヒーレンス係数計算手段により算出された周波数毎の上記コヒーレンス係数に基づいて、上記入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域での上記コヒーレンス係数を用いて平均コヒーレンスを算出する平均コヒーレンス計算手段と、
上記平均コヒーレンス計算手段により算出された上記平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定する目的音声区間判定手段と
を備えることを特徴とする音声信号処理装置。 In an audio signal processing apparatus that performs predetermined audio signal processing based on an input sound signal of ambient sound captured by at least two microphones,
Coherence reflecting the correlation between the first directional signal having a blind spot in the first predetermined direction and the second directional signal having a blind spot in the second predetermined direction, each formed based on the input sound signal. A coherence coefficient calculating means for calculating a coefficient for each frequency;
Based on the coherence coefficient for each frequency calculated by the coherence coefficient calculation means, the strength of the influence of the noise signal component included in the input sound signal is determined for each frequency, and in a frequency band where the influence of the noise signal component is small. Average coherence calculating means for calculating average coherence using the above coherence coefficient;
Voice signal processing comprising: target voice section determination means for judging whether or not the section of the input sound signal belongs to the target voice section based on the average coherence calculated by the average coherence calculation means. apparatus.

少なくとも２個のマイクロホンにより捕捉された周囲音の入力音信号に基づいて所定の音声信号処理を行なうサーバにおいて、
入力音信号に基づいてそれぞれ形成された、第１の所定方位に死角を有する第１の指向性信号と第２の所定方位に死角を有する第２の指向性信号との相関を反映させたコヒーレンス係数を、周波数毎に算出するコヒーレンス係数計算手段と、
上記コヒーレンス係数計算手段により算出された周波数毎の上記コヒーレンス係数に基づいて、上記入力音信号に含まれる雑音信号成分の影響の強弱を周波数毎に判定し、雑音信号成分の影響の小さい周波数帯域での上記コヒーレンス係数を用いて平均コヒーレンスを算出する平均コヒーレンス計算手段と、
上記平均コヒーレンス計算手段により算出された上記平均コヒーレンスに基づいて、入力音信号の当該区間が目的音声区間に属するか否かを判定する目的音声区間判定手段と
を備えることを特徴とするサーバ。 In a server that performs predetermined audio signal processing based on an input sound signal of ambient sound captured by at least two microphones,
Coherence reflecting the correlation between the first directional signal having a blind spot in the first predetermined direction and the second directional signal having a blind spot in the second predetermined direction, each formed based on the input sound signal. A coherence coefficient calculating means for calculating a coefficient for each frequency;
Based on the coherence coefficient for each frequency calculated by the coherence coefficient calculation means, the strength of the influence of the noise signal component included in the input sound signal is determined for each frequency, and in a frequency band where the influence of the noise signal component is small. Average coherence calculating means for calculating average coherence using the above coherence coefficient;
A target speech section determining means for determining whether or not the section of the input sound signal belongs to the target speech section based on the average coherence calculated by the average coherence calculating section.