JP2007065204A

JP2007065204A - Reverberation removing apparatus, reverberation removing method, reverberation removing program, and recording medium thereof

Info

Publication number: JP2007065204A
Application number: JP2005250053A
Authority: JP
Inventors: Keisuke Kinoshita; 慶介木下; Tomohiro Nakatani; 智広中谷; Masato Miyoshi; 正人三好
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-08-30
Filing date: 2005-08-30
Publication date: 2007-03-15

Abstract

<P>PROBLEM TO BE SOLVED: To remove reverberation with high accuracy even in a long reverberation time environment using only a small amount of data. <P>SOLUTION: A rear part reverberation power estimating part 10c extracts a minimum value or a false minimum value of power (gain) of a sound signal as a power estimation value of rear part reverberative sound components, and a rear part reverberation removing part 10d uses the power estimation value of the rear part reverberative sound components, to calculate estimation values of the direct sound components and initial reverberative sound components of the reverberation signal. Then, a reverberation removing filter calculation part 10g calculates an estimated inverse filter using the estimation values of the direct sound components and the initial reverberative sound components as reference signals, and a reverberation removing filter multiplication part 10h multiplys the estimated inverse filter by the sound signal power to output an operation result. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、残響を伴った音響信号から、残響を取り除いた音響信号を抽出する残響除去技術関する。 The present invention relates to a dereverberation technique for extracting an acoustic signal from which reverberation is removed from an acoustic signal accompanied by reverberation.

音響信号は、残響のある環境で観測されると、本来の信号に残響が重畳された信号として観測される。その音響信号が音声の場合、重畳した残響成分により音声の明瞭性が抵下する。それにより、本来の音声信号の性質を抽出することが困難となり、自動音声認識（以下、音声認識）システムの音声認識率も著しく低下する。このような場合、残響除去処理によって重畳した残響を取り除くことで、本来の音質に戻し、明瞭性を回復し、音声認識率を改善することができる。
残響除去法の従来例（特許文献１参照）を、図１３を参照して説明する。従来例では、まず、シングルマイクロホンを用いて残響の付加された音声を観測する。次に、その観測音の時間周波数的な特徴である調波構造（有声音部分）を調波構造抽出部５０１により抽出する。そして、残響除去フィルタ計算部５０２で、その調波構造を逆フィルタ推定の参照信号として逆フィルタを算出し、残響除去フィルタ乗算部５０３で残響除去処理を行う。
特開２００４−１０９７４２号公報 When an acoustic signal is observed in an environment with reverberation, it is observed as a signal in which reverberation is superimposed on the original signal. In the case where the acoustic signal is speech, the clarity of the speech is degraded by the superimposed reverberation component. This makes it difficult to extract the nature of the original speech signal, and the speech recognition rate of an automatic speech recognition (hereinafter referred to as speech recognition) system is significantly reduced. In such a case, by removing the reverberation superimposed by the reverberation removal process, the original sound quality can be restored, the clarity can be restored, and the speech recognition rate can be improved.
A conventional example of the dereverberation method (see Patent Document 1) will be described with reference to FIG. In the conventional example, first, sound with reverberation is observed using a single microphone. Next, the harmonic structure (voiced sound portion), which is the temporal frequency characteristic of the observed sound, is extracted by the harmonic structure extraction unit 501. The dereverberation filter calculation unit 502 calculates an inverse filter using the harmonic structure as a reference signal for inverse filter estimation, and the dereverberation filter multiplication unit 503 performs dereverberation processing.
JP 2004-109742 A

しかし、従来の残響除去方法では、少量のデータのみを用い、残響時間が長い残響環境においても高い精度で残響除去を行うことは困難であった。
すなわち、逆フィルタの推定は、実環境から抽出した「参照信号」を用いた学習によって行われる。そして、この「参照信号」が、本来のクリーン音響信号に近ければ近いほど、精度が高い逆フィルタを推定できる。ところが、従来法では、音響信号は有性音のみからなっているという少々無理のある仮定を前提に「参照信号」を推定し、逆フィルタの推定を行っていた。そのため、従来、推定される逆フィルタには相当な量の誤差が含まれていた。その結果、従来法を用い、高い精度で残響除去を行うためには、大量の音響信号データによって逆フィルタを推定し、それらの平均をとって逆フィルタの推定精度を高めるしかなかった。しかし、例えば、音声認識の前処理として残響除去を行う場合、その残響処理は、残響時間が長い残響環境にも対応可能なものでなければならない。従来法によって、例えば、５００ｍｓｅｃ程度の長い残響を除去する場合、１時間程度の大量の音響信号データを用いなければ、好適な音声認識率を得られる程度の精度で残響除去を行うことはできない。ところが、約１時間もの間、音源とマイクロホンとの間の伝達関数が一定である実環境は通常存在しない。すなわち、従来法では、時々刻々と変化する実環境の伝達関数に適応しつつ、高い精度で残響除去を行うことは困難であった。 However, with the conventional dereverberation method, it is difficult to perform dereverberation with high accuracy even in a reverberation environment where only a small amount of data is used and the reverberation time is long.
That is, the inverse filter is estimated by learning using the “reference signal” extracted from the real environment. The closer the “reference signal” is to the original clean acoustic signal, the more accurate the inverse filter can be estimated. However, in the conventional method, the “reference signal” is estimated on the assumption that the acoustic signal is composed only of sexual sounds, and the inverse filter is estimated. For this reason, conventionally, the estimated inverse filter includes a considerable amount of error. As a result, in order to perform the dereverberation with high accuracy using the conventional method, there is no choice but to estimate the inverse filter from a large amount of acoustic signal data and take the average of these to improve the estimation accuracy of the inverse filter. However, for example, when dereverberation is performed as preprocessing for speech recognition, the reverberation processing must be able to cope with a reverberation environment having a long reverberation time. For example, when a long reverberation of about 500 msec is removed by a conventional method, the reverberation cannot be removed with an accuracy enough to obtain a suitable speech recognition rate unless a large amount of acoustic signal data of about 1 hour is used. However, there is usually no real environment where the transfer function between the sound source and the microphone is constant for about one hour. In other words, with the conventional method, it is difficult to perform dereverberation with high accuracy while adapting to the transfer function of the real environment that changes every moment.

本発明はこのような点に鑑みてなされたものであり、少量のデータのみを用い、残響時間が長い残響環境においても高い精度で残響除去を行うことを可能にする技術を提供することを目的とする。 The present invention has been made in view of these points, and an object of the present invention is to provide a technique that enables dereverberation to be performed with high accuracy even in a reverberant environment with a long reverberation time using only a small amount of data. And

本発明では上記課題を解決するために、音響信号のパワー（利得）の最小値又は擬似最小値を、当該音響信号の後部残響音成分のパワー推定値として抽出し、後部残響音成分のパワー推定値を用い、音響信号の直接音成分と初期反射音成分との推定値を算出する。
ここで、残響は、直接音の後、自然対数的に減衰する。そのため、後部残響成分は、直接音成分及び初期反射音成分と比べてパワーが小さい。また、クリーンな音響信号は、時間的に密でないスパースな信号である。そして、後部残響成分は、本来、スパースな部分に重畳する。よって、スパースな部分の後部残響成分は、ある程度の時間区間を観測したときのパワーの最小値として観測される。さらに、インパルス応答が長い場合、後部残響成分は、比較的長時間定常的に残存する。そのため、スパースな部分から推定された後部残響成分のパワーを、スパースでない部分の後部残響成分のパワーと推定することは妥当である。すなわち、音響信号のパワーの最小値を、当該音響信号の後部残響音成分のパワー推定値とすることは妥当である。さらに、厳密な意味での最小値でなくても、それに類する「擬似最小値」であれば、その値を後部残響音成分のパワー推定値とすることは妥当である。なお、「擬似最小値」の具体例については後述する。 In order to solve the above problem, the present invention extracts the minimum value or pseudo-minimum value of the power (gain) of the acoustic signal as the power estimation value of the rear reverberation component of the acoustic signal, and estimates the power of the rear reverberation component. Using the value, an estimated value of the direct sound component and the initial reflected sound component of the acoustic signal is calculated.
Here, the reverberation attenuates naturally logarithmically after the direct sound. For this reason, the rear reverberation component has a smaller power than the direct sound component and the early reflection sound component. A clean acoustic signal is a sparse signal that is not dense in time. The rear reverberation component is superimposed on the sparse part. Therefore, the rear reverberation component of the sparse portion is observed as the minimum value of power when a certain time interval is observed. Furthermore, when the impulse response is long, the rear reverberation component remains steadily for a relatively long time. Therefore, it is appropriate to estimate the power of the rear reverberation component estimated from the sparse portion as the power of the rear reverberation component not from the sparse portion. That is, it is reasonable to use the minimum value of the power of the acoustic signal as the power estimate value of the rear reverberation component of the acoustic signal. Furthermore, even if it is not a minimum value in a strict sense, it is appropriate to use the value as a power estimation value of the rear reverberation sound component if it is a “pseudo minimum value” similar thereto. A specific example of the “pseudo minimum value” will be described later.

また、後部残響成分は、直接音成分や初期反射音成分と無相関である。そのため、例えば、スペクトル減算法等により、後部残響音成分のパワー推定値を用い、音響信号から後部残響音成分を除去した直接音成分及び初期反射音成分の推定値を算出することも容易である。ここで、音響認識を劣化させる残響中の主要因は後部残響成分である。よって、音響信号から後部残響音を除去した直接音成分と初期反射音成分との推定値を求め、それを「参照信号」として逆フィルタを推定した場合、その逆フィルタは、好適な音声認識率を得られる程度に残響を除去するものであるといえる。すなわち、本発明では、大量の音響信号データに対して逆フィルタを推定し、その平均をとるといった処理を経ることなく、少ない音響信号データから、好適な音声認識率を得られる程度の高精度で残響を除去することが可能な逆フィルタを推定することができる。なお、本発明の場合、逆フィルタを用いることなく、後部残響音成分のパワー推定値を用いて推定した音響信号の直接音成分と初期反射音成分とを、残響除去結果として利用することも可能である。 Further, the rear reverberation component is uncorrelated with the direct sound component and the early reflection sound component. Therefore, for example, it is also easy to calculate the estimated values of the direct sound component and the initial reflected sound component obtained by removing the rear reverberation sound component from the acoustic signal by using the power estimation value of the rear reverberation sound component by the spectral subtraction method or the like. . Here, the main factor during reverberation that degrades sound recognition is the rear reverberation component. Therefore, when the estimated value of the direct sound component and the initial reflected sound component obtained by removing the rear reverberation sound from the acoustic signal is obtained, and the inverse filter is estimated by using the estimated value as the “reference signal”, the inverse filter has a suitable speech recognition rate. It can be said that the reverberation is removed to the extent that can be obtained. That is, in the present invention, the inverse filter is estimated for a large amount of acoustic signal data, and the average of the inverse filter is not processed, so that a suitable speech recognition rate can be obtained from a small amount of acoustic signal data. An inverse filter capable of removing reverberation can be estimated. In the case of the present invention, it is also possible to use the direct sound component and the initial reflected sound component of the acoustic signal estimated by using the power estimation value of the rear reverberation sound component as the dereverberation result without using an inverse filter. It is.

また、本発明において好ましくは、周波数領域の音響信号を用い、音響信号の後部残響音成分のパワー推定値を抽出する。これにより、周波数毎の残響除去処理が可能となるため、好適な残響除去性能が得られる。
また、本発明において好ましくは、周波数領域の音響信号を用い、音響信号の後部残響音成分のパワー推定値を抽出し、直接音成分と初期反射音成分との周波数領域での推定値を参照信号として用い、推定逆フィルタを算出する。そして、後部残響音成分のパワー推定値を抽出する際のフレーム長を、推定逆フィルタを算出する際のフレーム長よりも短くする。 In the present invention, it is preferable that a power estimation value of a rear reverberation component of the acoustic signal is extracted using an acoustic signal in the frequency domain. Thereby, since the dereverberation process for each frequency is possible, a suitable dereverberation performance can be obtained.
Preferably, in the present invention, a power estimation value of the rear reverberation component of the acoustic signal is extracted using an acoustic signal in the frequency domain, and the estimation value in the frequency domain of the direct sound component and the initial reflected sound component is used as a reference signal. Is used to calculate the estimated inverse filter. Then, the frame length when extracting the power estimation value of the rear reverberation sound component is made shorter than the frame length when calculating the estimated inverse filter.

ここで、本発明では、音響信号のパワーの最小値又は擬似最小値を、当該音響信号の後部残響音成分のパワー推定値として抽出しているが、この処理では、音響信号を周波数領域に変換する際のフレーム長を短くすることが望ましい。音響信号の時間変動を詳細にとられることができるからである。一方、本発明で逆フィルタを推定する場合、直接音成分と初期反射音成分との推定値を参照信号として用いているが、この処理では、参照信号のフレーム長が長いほうが望ましい。フレーム長が短いと音響信号のインパルス応答を完全に捉えることができないからである。すなわち、本発明で逆フィルタを推定する際のフレーム長は、音響信号のインパルス応答長よりも長いことが望ましい。 Here, in the present invention, the minimum value or pseudo-minimum value of the power of the acoustic signal is extracted as the power estimation value of the rear reverberation component of the acoustic signal. In this process, the acoustic signal is converted into the frequency domain. It is desirable to shorten the frame length when doing so. This is because the time variation of the acoustic signal can be taken in detail. On the other hand, when the inverse filter is estimated in the present invention, the estimated values of the direct sound component and the initial reflected sound component are used as the reference signal. In this process, it is desirable that the frame length of the reference signal is long. This is because if the frame length is short, the impulse response of the acoustic signal cannot be completely captured. That is, it is desirable that the frame length when the inverse filter is estimated in the present invention is longer than the impulse response length of the acoustic signal.

また、本発明において好ましくは、平滑化した音響信号を用いて、音響信号の後部残響音成分のパワー推定値を抽出する。これにより、変動の激しい音響信号に対しても、適切に後部残響音成分のパワー推定値を抽出することができる。
また、本発明において好ましくは、フレーム毎に推定逆フィルタを算出して、当該推定逆フィルタを各フレーム間で平均したものを、残響除去用の逆フィルタとして用いる。これにより、より精度の高い逆フィルタを推定できる。
さらに、本発明において好ましくは、推定した逆フィルタと音響信号とを乗算して求めた残響除去信号を新たな音響信号として、本発明の残響除去処理を再び実行する。これにより、後部残響成分の除去精度を向上させることができる。 In the present invention, it is also preferable to extract a power estimate value of the rear reverberation component of the acoustic signal using the smoothed acoustic signal. Thereby, it is possible to appropriately extract the power estimation value of the rear reverberation sound component even for an acoustic signal with a large fluctuation.
In the present invention, preferably, an estimated inverse filter is calculated for each frame, and an average of the estimated inverse filter between the frames is used as an dereverberation inverse filter. Thereby, an inverse filter with higher accuracy can be estimated.
Further, in the present invention, it is preferable that the dereverberation process of the present invention is executed again using the dereverberation signal obtained by multiplying the estimated inverse filter and the acoustic signal as a new acoustic signal. Thereby, the removal accuracy of the rear reverberation component can be improved.

本発明により、少量のデータのみを用い、残響時間が長い残響環境においても高い精度で残響除去を行うことが可能となる。 According to the present invention, it is possible to perform dereverberation with high accuracy even in a reverberant environment using only a small amount of data and having a long reverberation time.

以下、本発明の実施の形態を図面を参照して説明する。
〔第１の実施の形態〕
まず、本形態の原理について説明する。
＜原理＞
音声認識の前処理として用いられる残響除去は、以下のような条件を満たさねばならない。
A‐1）残響時間が長い残響環境（0．5秒程度以上）でも有効であること。
A‐2）音声認識の性能の劣化の主な原因といわれる後部残響の除去に効果的であること（例えば、「B. W. Gillespie and L. E. Atlas, "Acoustic diversity for improved speech recognition in reverberant environments," Proc. of International Conference on Acoustics, Speech, and Signal Processing, vol.1 pp. 557-600 2002」等参照）。
そこで、本形態では、この上記の二つの条件を満たすような残響除去法を提案する。ここでは、まず、後部残響成分の特性を述べ、その後、後部残響を除去するための原理を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[First Embodiment]
First, the principle of this embodiment will be described.
<Principle>
Reverberation removal used as preprocessing for speech recognition must satisfy the following conditions.
A-1) It should be effective even in a reverberant environment (approximately 0.5 seconds or longer) with a long reverberation time.
A-2) Effective in removing posterior reverberation, which is said to be a major cause of speech recognition performance degradation (for example, “BW Gillespie and LE Atlas,“ Acoustic diversity for improved speech recognition in reverberant environments, ”Proc. of International Conference on Acoustics, Speech, and Signal Processing, vol.1 pp. 557-600 2002 ”).
Therefore, this embodiment proposes a dereverberation method that satisfies the above two conditions. Here, first, the characteristics of the rear reverberation component will be described, and then the principle for removing the rear reverberation will be described.

［残響音声中の後部残響の特性］
音声信号は、発話機構からくる制約により、短い時間区間では高い相関を持ち、逆に、それよりも長い区間とは相関の低い、準定常的な信号として考えられる。つまり、以下の式（１）で表される音声の自己相関R_ss(u)は、クリーンな音声をs(n)とした場合、以下の式（２）のように仮定できる。
R_ss(u)=ε[s(n) s(n‐u)] …(1)
|R_ss(u)|≒0 for u≧T …(2)
なお、ε[・]は期待値を示す。また、音声信号の場合、各々の音素の特徴によって、Ｔは30msから100ms程度の幅を持つ。また、u，nは離散時間を示す。
残響は下記の式（３）のように、クリーン音声s(n)と、インパルス応答h(n)=[h(0,n)...h(M-1,n)]^tの畳み込みで表される。なお、α^tは、ベクトルαの転置を示す。
x(n)=s(n)^th(n) …(3)
そのため、インパルス応答h(u)がu≧Tでエネルギーを有すると、観測信号（残響の付加された音声）x(n)は以下の三つからなる信号に分けることができる。一つ目は、h(0)s(n)として表される直接音、二つ目は、直接音と相関のある信号成分 [Characteristics of rear reverberation in reverberant speech]
The voice signal is considered to be a quasi-stationary signal having a high correlation in a short time interval and conversely having a low correlation with a longer interval due to the restriction caused by the speech mechanism. That is, the autocorrelation R _ss (u) of speech expressed by the following formula (1) can be assumed as the following formula (2) when clean speech is s (n).
R _ss (u) = ε [s (n) s (n−u)] (1)
| R _ss (u) | ≒ 0 for u≥T… (2)
Note that ε [·] indicates an expected value. In the case of an audio signal, T has a width of about 30 ms to 100 ms depending on the characteristics of each phoneme. U and n represent discrete times.
The reverberation is a convolution of clean speech s (n) and impulse response h (n) = [h (0, n) ... h (M-1, n)] ^t as shown in Equation (3) below. expressed. Α ^t indicates transposition of the vector α.
x (n) = s (n) ^t h (n)… (3)
Therefore, if the impulse response h (u) has energy when u ≧ T, the observed signal (sound with reverberation) x (n) can be divided into the following three signals. The first is the direct sound expressed as h (0) s (n), and the second is the signal component correlated with the direct sound.

、三つ目は、直接音と無相関な信号成分

The third is the signal component uncorrelated with the direct sound.

である。h(0)を直接音、h(u)(0<u<T‐1)を初期反射音、h(u)(u≧T)を後部残響と呼ぶ。
この観点から、以下のような後部残響の特性を仮定することができる。
B‐1）後部残響成分は、直接音成分・初期反射音成分と無相関

It is. h (0) is called a direct sound, h (u) (0 <u <T-1) is called an early reflection sound, and h (u) (u ≧ T) is called a rear reverberation.
From this point of view, the following reverberation characteristics can be assumed:
B-1) Rear reverberation component is uncorrelated with direct sound component and early reflection component

［逆フィルタ算出のための参照信号の推定（スペクトル減算法による後部残響成分の推定・除去）］
逆フィルタ算出のためには、参照信号が必要となるが（逆フィルタの算出方法については後述する)、本形態では、参照信号推定のためにスペクトル減算法を用いる。スペクトル減算法は、推定したい信号と、除去したい信号が互いに無相関な場合に用いられることが可能となる（例えば「S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. on Acoustics, Speech and Signal Processing, 27(2), pp. 113-120, 1979」参照）。前述のB‐1）に示したとおり、後部残響成分と直接音成分・初期反射音成分とは無相関であると仮定できる。そのため、もし、後部残響成分のパワーを時間毎に推定することができれば、直接音成分・初期反射音成分を、スペクトル減算法で推定することが可能となる。そして、後部残響成分の時間毎のパワー推定を行うために、従来法では用いられていなかった音響信号の時間領域の特徴に注目した処理を行う。これにより、従来法と比べて、格段に少ない学習データで有効な残響除去を行うことが可能となる。この時間領域の特徴を用いるために、以下の後部残響成分の特性と音声のスパース性という二つの特徴的な物理量を用いる。 [Estimation of reference signal for inverse filter calculation (estimation / removal of rear reverberation component by spectral subtraction method)]
In order to calculate the inverse filter, a reference signal is required (a method for calculating the inverse filter will be described later). In this embodiment, a spectral subtraction method is used for reference signal estimation. Spectral subtraction can be used when the signal to be estimated and the signal to be removed are uncorrelated with each other (for example, “SF Boll,“ Suppression of acoustic noise in speech using spectral subtraction, ”IEEE Trans. On Acoustics , Speech and Signal Processing, 27 (2), pp. 113-120, 1979 ”). As shown in B-1) above, it can be assumed that the rear reverberation component and the direct sound component / early reflected sound component are uncorrelated. Therefore, if the power of the rear reverberation component can be estimated for each time, the direct sound component and the initial reflected sound component can be estimated by the spectral subtraction method. Then, in order to estimate the power of the rear reverberation component for each time, processing that focuses on the characteristics of the time domain of the acoustic signal that has not been used in the conventional method is performed. This makes it possible to perform effective dereverberation with much less learning data than in the conventional method. In order to use this time domain feature, the following two characteristic physical quantities are used: the characteristic of the following reverberation component and the sparsity of speech.

後部残響成分の特性：
B‐2）残響は、直接音の後、自然対数的に減衰する。そのため、後部残響成分は直接音・初期反射音と比べてエネルギーが小さい。
B‐３）インパルス応答が長い場合、後部残響成分は、比較的長時間定常的に残存する。
音声のスパース性：
Ｂ‐４）クリーンな音声信号は、周波数帯域信号として見た場合、全ての時間に常にエネルギーが存在するような信号ではなく、時間的に密でない（スパースな）信号である。また、音声信号は、絶え間なく続くことはなく、数秒毎にポーズが入るのも特徴であり、その意味でもスパースな信号だといえる。 Characteristics of rear reverberation component:
B-2) The reverberation decays logarithmically after the direct sound. Therefore, the energy of the rear reverberation component is smaller than that of the direct sound and the early reflection sound.
B-3) When the impulse response is long, the rear reverberation component remains steadily for a relatively long time.
Voice sparsity:
B-4) When viewed as a frequency band signal, a clean audio signal is not a signal in which energy always exists at all times, but a signal that is not temporally dense (sparse). In addition, the audio signal does not continue continuously, and is characterized by pauses every few seconds. In that sense, it can be said to be a sparse signal.

スパース性という特徴を持った信号に後部残響成分が付加されると、信号のスパースな部分は後部残響成分によって満たされ、結果、音声はスパースでない信号になってしまう。
しかし、スパース性と後部残響の特徴を考えると、以下のように後部残響のパワーを推定することができる。
まず、B‐２）の特性を考えると、後部残響成分は、直接音成分・初期反射成分と比べてエネルギーが小さい。そのため、本来スパースな場所に重畳した後部残響成分は、ある時間区間においては、直接音・初期反射音成分と比べて、非常に小さいエネルギーとして観測されると仮定できる。つまり、後部残響成分は、本来、スパースな部分に重畳しているため、ある程度の時間区間を観測したときの、最小のエネルギーとして観測できる。また、B‐３）の特性を考えると、スパースな部分で観測された後部残響成分のパワーは、その周辺の時間での後部残響のパワーとほぼ同じであると考えられる。そのため、スパースでない部分の後部残響成分に関しても、スパースな部分から推定された後部残響成分を同様に適用できる。こうすることにより、すべての時間区間で後部残響成分のパワーを算出することができる。
スペクトル減算法の関数をNR[・]とした場合、処理後の音声は次式のように表すことができる。 When a rear reverberation component is added to a signal having the characteristic of sparsity, the sparse portion of the signal is filled with the rear reverberation component, and as a result, the speech becomes a non-sparse signal.
However, considering the characteristics of sparsity and rear reverberation, the power of rear reverberation can be estimated as follows.
First, considering the characteristics of B-2), the rear reverberation component has a smaller energy than the direct sound component and the initial reflection component. Therefore, it can be assumed that the rear reverberation component superimposed on a sparse place is observed as very small energy in a certain time interval as compared with the direct sound / initial reflection sound component. That is, the rear reverberation component is inherently superposed on a sparse portion, and thus can be observed as the minimum energy when a certain time interval is observed. Considering the characteristics of B-3), it is considered that the power of the rear reverberation component observed in the sparse portion is almost the same as the power of the rear reverberation in the surrounding time. Therefore, the rear reverberation component estimated from the sparse portion can be similarly applied to the rear reverberation component of the non-sparse portion. By doing so, the power of the rear reverberation component can be calculated in all time intervals.
When the function of the spectral subtraction method is NR [•], the processed speech can be expressed as the following equation.

なお、スペクトル減算法の関数NR[・]は［観測信号のパワー］−［後部残響成分のパワー］となる。また、式（４）のer(n)は、スペクトル減算処理により［観測信号のパワー］から引きすぎてしまった成分、あるいはスペクトル減算処理結果に残留してしまった［後部残響成分のパワー］を表す。B‐２）の特性を考えると、スペクトル減算法で除去した対象である後部残響成分は、直接音成分・初期反射音成分に比べ、エネルギーが小さいので、er(n)もまたあまり大きくならないことが推測される。

Note that the function NR [•] of the spectral subtraction method is [observation signal power] − [rear reverberation component power]. In addition, er (n) in the equation (4) represents a component that has been excessively subtracted from [observed signal power] by the spectral subtraction process or a [rear reverberation component power] that has remained in the spectral subtraction process result. To express. Considering the characteristics of B-2), er (n) should not be so large because the reverberation component, which is the target removed by the spectral subtraction method, has less energy than the direct sound component and early reflection component. Is guessed.

［逆フィルタの算出］
以上のスペクトル減算法により求められた信号を逆フィルタの参照信号として用いる。逆フィルタの導出を簡略化するため、ここからは周波数領域で各信号を表す。伝達関数をH、その直接音成分をD、初期反射音成分をR_E、クリーン音声をS、式（４）のer(n)に起因する項をERとすると、式（４）は、周波数領域では以下のように表すことができる。
S^{^}(τ,f)=D(f)S(τ,f)+R_E(f)S(τ,f)+ER(τ,f) …(5)
ここでτは、フレーム番号を示す。τが示すフレーム長さ（窓長）は、インパルス応答よりも長いことが望ましい。すなわち、残響時間の長いインパルス応答を扱う場合、τのフレーム長さは、後部残響成分の除去などで用いるフレーム長よりも長いことが望ましい。なお、α^{^}は [Calculation of inverse filter]
The signal obtained by the above spectral subtraction method is used as a reference signal for the inverse filter. In order to simplify the derivation of the inverse filter, each signal is represented in the frequency domain from here. When the transfer function is H, the direct sound component is D, the initial reflected sound component is R _E , the clean speech is S, and the term resulting from er (n) in equation (4) is ER, equation (4) is The area can be expressed as follows.
S ^{^} (τ, f) = D (f) S (τ, f) + R _E (f) S (τ, f) + ER (τ, f)… (5)
Here, τ indicates a frame number. The frame length (window length) indicated by τ is preferably longer than the impulse response. That is, when dealing with an impulse response having a long reverberation time, it is desirable that the frame length of τ is longer than the frame length used for removing the rear reverberation component. Α ^{^} is

を意味する。
本形態では、式（５）のS^{^}を逆フィルタの参照信号として、逆フィルタを以下のように計算する。まず、各フレームで逆フィルタの第一次近似を次式にしたがって算出する。

Means.
In the present embodiment, the inverse filter is calculated as follows using S ^{^} in Expression (5) as a reference signal of the inverse filter. First, the first-order approximation of the inverse filter is calculated in each frame according to the following equation.

ここで、Hはインパルス応答を示し、H(f)S(τ,f)は、観測される残響の付加された音声を表す。次に、この第一次近似には前述のＥRが含まれているので、好ましくは、このＥRを平均化して除去するために、各フレームの平均をとる。

Here, H represents an impulse response, and H (f) S (τ, f) represents a sound with added reverberation observed. Next, since the above-mentioned ER is included in this first order approximation, preferably, the average of each frame is taken in order to average and remove this ER.

得られた逆フィルタは、後部残響を除去する線形フィルタになっている。
［逆フィルタリング］
算出された逆フィルタW(f)を用い、観測信号H(f)S(τ,f)に、以下のような残響除去を施すと、処理音声Yは、以下のような直接音成分と初期反射音成分からなる音声に変換される。

The resulting inverse filter is a linear filter that removes rear reverberation.
[Reverse filtering]
Using the calculated inverse filter W (f) and applying the following dereverberation to the observed signal H (f) S (τ, f), the processed speech Y has the following direct sound components and initial values: It is converted into sound consisting of reflected sound components.

Y(τ,f)=W(f){H(f)S(τ,f)}
≒{D(f)+(R_E(f))}S(τ,f) …(8)
＜本形態の詳細＞
次に、本形態の詳細について説明する。 Y (τ, f) = W (f) {H (f) S (τ, f)}
≒ {D (f) + (R _E (f))} S (τ, f)… (8)
<Details of this embodiment>
Next, details of this embodiment will be described.

［ハードウェア構成］
図２は、第１の実施の形態における残響除去装置１０のハードウェア構成を例示したブロック図である。
図２に例示するように、この例の残響除去装置１０は、ＣＰＵ（Central Processing Unit）１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ（Read Only Memory）１５、ＲＡＭ（Random Access Memory）１６及びバス１７を有している。
この例のＣＰＵ１１は、制御部１１ａ、演算部１１ｂ及びレジスタ１１ｃを有し、レジスタ１１ｃに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部１２は、データが入力される入力インターフェース、キーボード、マウス等であり、出力部１３は、データが出力される出力インターフェース等である。補助記憶装置１４は、例えば、ハードディスク、ＭＯ（Magneto-Optical disc）、半導体メモリ等であり、残響除去装置１０としてコンピュータを機能させるためのプログラムが格納されるプログラム領域１４ａ及び各種データが格納されるデータ領域１４ｂを有している。また、ＲＡＭ１６は、ＳＲＡＭ (Static Random Access Memory)、ＤＲＡＭ (Dynamic Random Access Memory)等であり、上記のプログラムが格納されるプログラム領域１６ａ及び各種データが格納されるデータ領域１６ｂを有している。また、バス１７は、ＣＰＵ１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ１５及びＲＡＭ１６を通信可能に接続する。
なお、このようなハードウェアの具体例としては、例えば、パーソナルコンピュータの他、サーバ装置やワークステーション等を例示できる。 [Hardware configuration]
FIG. 2 is a block diagram illustrating a hardware configuration of the dereverberation apparatus 10 according to the first embodiment.
As illustrated in FIG. 2, the dereverberation apparatus 10 of this example includes a CPU (Central Processing Unit) 11, an input unit 12, an output unit 13, an auxiliary storage device 14, a ROM (Read Only Memory) 15, and a RAM (Random Access). Memory) 16 and a bus 17.
The CPU 11 in this example includes a control unit 11a, a calculation unit 11b, and a register 11c, and executes various calculation processes according to various programs read into the register 11c. The input unit 12 is an input interface for inputting data, a keyboard, a mouse, and the like, and the output unit 13 is an output interface for outputting data. The auxiliary storage device 14 is, for example, a hard disk, an MO (Magneto-Optical disc), a semiconductor memory, or the like, and stores a program area 14a in which a program for causing the computer to function as the dereverberation device 10 and various data are stored. It has a data area 14b. The RAM 16 is an SRAM (Static Random Access Memory), a DRAM (Dynamic Random Access Memory), or the like, and has a program area 16a in which the above program is stored and a data area 16b in which various data are stored. The bus 17 connects the CPU 11, the input unit 12, the output unit 13, the auxiliary storage device 14, the ROM 15, and the RAM 16 so that they can communicate with each other.
In addition, as a specific example of such hardware, a server apparatus, a workstation, etc. other than a personal computer can be illustrated, for example.

［プログラム構成］
次に、プログラム領域１４ａ，１６ａに格納されるプログラムの構成について説明する。本形態のプログラムは、音響信号のパワーの最小値又は擬似最小値を、当該音響信号の後部残響音成分のパワー推定値として抽出するための後部残響パワー推定プログラムと、後部残響音成分のパワー推定値を用い、音響信号の直接音成分と初期反射音成分との推定値を算出するための後部残響除去プログラムと、直接音成分と初期反射音成分との推定値を参照信号として用い、推定逆フィルタを算出するための残響除去フィルタ計算プログラムと、推定逆フィルタに音響信号のパワーを乗じ、その演算結果を出力するための残響除去フィルタ乗算プログラムとを有している。
なお、上述した各プログラムは、単一のプログラム列として記載されていてもよく、また、少なくとも一部のプログラムが別個のモジュールとしてライブラリに格納されていてもよい。また、上記のプログラム単体でその機能を実現できるものでもよいし、上記のプログラムがさらに他のライブラリ（記載していない）を読み出して各機能を実現するものでもよい。すなわち、上述した各プログラムの少なくとも一部が、残響除去装置１０の機能をコンピュータに実行させるためのプログラムに相当する。 [Program structure]
Next, the configuration of programs stored in the program areas 14a and 16a will be described. The program of the present embodiment includes a rear reverberation power estimation program for extracting the minimum value or pseudo minimum value of the power of the acoustic signal as a power estimation value of the rear reverberation component of the acoustic signal, and power estimation of the rear reverberation component Using the value, the back dereverberation program to calculate the estimated value of the direct sound component and the initial reflected sound component of the acoustic signal, and the estimated value of the direct sound component and the initial reflected sound component as the reference signal A dereverberation filter calculation program for calculating a filter, and a dereverberation filter multiplication program for multiplying the estimated inverse filter by the power of the acoustic signal and outputting the calculation result.
Each program described above may be described as a single program sequence, or at least a part of the programs may be stored in the library as separate modules. The function may be realized by the above-described program alone, or the above-mentioned program may read out another library (not described) to realize each function. That is, at least a part of each program described above corresponds to a program for causing a computer to execute the function of the dereverberation apparatus 10.

［ハードウェアとプログラムとの協働］
ＣＰＵ１１（図２）は、読み込まれたＯＳ（Operating System）プログラムに従い、補助記憶装置１４のプログラム領域１４ａに格納されている上述のプログラムをＲＡＭ１６のプログラム領域１６ａに書き込む。同様にＣＰＵ１１は、補助記憶装置１４のデータ領域１４ｂに格納されている各種データを、ＲＡＭ１６のデータ領域１６ｂに書き込む。そして、このプログラムやデータが書き込まれたＲＡＭ１６上のアドレスがＣＰＵ１１のレジスタ１１ｃに格納される。ＣＰＵ１１の制御部１１ｂは、レジスタ１１ｃに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ１６上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１１ａに順次実行させ、その演算結果をレジスタ１１ｃに格納していく。 [Cooperation between hardware and programs]
The CPU 11 (FIG. 2) writes the above-mentioned program stored in the program area 14 a of the auxiliary storage device 14 in the program area 16 a of the RAM 16 in accordance with the read OS (Operating System) program. Similarly, the CPU 11 writes various data stored in the data area 14 b of the auxiliary storage device 14 in the data area 16 b of the RAM 16. The address on the RAM 16 where the program and data are written is stored in the register 11c of the CPU 11. The control unit 11b of the CPU 11 sequentially reads these addresses stored in the register 11c, reads a program and data from the area on the RAM 16 indicated by the read address, causes the calculation unit 11a to sequentially execute the calculation indicated by the program, The calculation result is stored in the register 11c.

図１は、このようにＣＰＵ１１に上述のプログラムが読み込まれて実行されることにより構成される残響除去装置１０の機能構成を例示したブロック図である。なお、図１における矢印はデータの流れを示すが、制御部１０ｊに出入りするデータの流れに対応する矢印は省略してある。
図１に例示するように、本形態の残響除去装置１０は、記憶部１０ａ、周波数領域変換部１０ｂ，１０ｆ、後部残響パワー推定部１０ｃ、後部残響除去部１０ｄ、時間領域変換部１０ｅ，１０ｉ、残響除去フィルタ計算部１０ｇ、残響除去フィルタ乗算部１０ｈ、制御部１０ｊ及び一時メモリ１０ｋを有している。また、後部残響パワー推定部１０ｃは、平滑化部１０ｃａ、ベクトル生成部１０ｃｂ及び最小値抽出部１０ｃｄを有している。さらに、残響除去フィルタ計算部１０ｇは、フレーム逆フィルタ計算部１０ｇａ及び平均値算出部１０ｇｂを有している。 FIG. 1 is a block diagram illustrating a functional configuration of the dereverberation apparatus 10 configured by reading and executing the above-described program in the CPU 11 as described above. In addition, although the arrow in FIG. 1 shows the flow of data, the arrow corresponding to the flow of data entering / exiting the control part 10j is abbreviate | omitted.
As illustrated in FIG. 1, the dereverberation apparatus 10 of the present embodiment includes a storage unit 10a, frequency domain conversion units 10b and 10f, a rear reverberation power estimation unit 10c, a rear dereverberation removal unit 10d, and time domain conversion units 10e and 10i. It has a dereverberation filter calculation unit 10g, a dereverberation filter multiplication unit 10h, a control unit 10j, and a temporary memory 10k. The rear reverberation power estimation unit 10c includes a smoothing unit 10ca, a vector generation unit 10cb, and a minimum value extraction unit 10cd. Further, the dereverberation filter calculation unit 10g includes a frame inverse filter calculation unit 10ga and an average value calculation unit 10gb.

ここで、記憶部１０ａ及び一時メモリ１０ｋは、補助記憶装置１４、ＲＡＭ１６、レジスタ１１ｃ、その他のバッファメモリやキャッシュメモリ等の何れか、あるいはこれらを併用した記憶領域に相当する。また、周波数領域変換部１０ｂ，１０ｆ、後部残響パワー推定部１０ｃ、後部残響除去部１０ｄ、時間領域変換部１０ｅ，１０ｉ、残響除去フィルタ計算部１０ｇ、残響除去フィルタ乗算部１０ｈ及び制御部１０ｊは、ＣＰＵ１１に上記のプログラムを実行させることにより構成されるものである。
また、本形態の残響除去装置１０は、制御部１０ｊの制御のもと各処理を実行する。また、特に示さない限り、演算過程の各データは、逐一、一時メモリ１０ｋに格納・読み出され、各演算処理が進められる。 Here, the storage unit 10 a and the temporary memory 10 k correspond to any one of the auxiliary storage device 14, the RAM 16, the register 11 c, other buffer memory and cache memory, or a storage area using these in combination. The frequency domain transforming units 10b and 10f, the rear reverberation power estimating unit 10c, the rear reverberation removing unit 10d, the time domain transforming units 10e and 10i, the dereverberation filter calculating unit 10g, the dereverberation filter multiplying unit 10h, and the control unit 10j are: It is configured by causing the CPU 11 to execute the above program.
In addition, the dereverberation apparatus 10 according to the present embodiment executes each process under the control of the control unit 10j. Unless otherwise indicated, each piece of data in the calculation process is stored and read from the temporary memory 10k one by one, and each calculation process proceeds.

＜残響除去処理＞
次に、本形態の残響除去処理について説明する。
図３は、第１の実施の形態の残響除去処理を説明するためのフローチャートである。以下、この図と図１とを用いて、本形態の残響除去処理を説明していく。
［後部残響パワー推定処理］
図示していないシングルマイクロホンで観測された時間領域の音響信号x(n)は、逐一、記憶部１０ａに格納されていく。このように記憶部１０ａに格納された時間領域の音響信号x(n)は、周波数領域変換部１０ｂに順次読み込まれる。周波数領域変換部１０ｂは、読み込んだ時間領域の音響信号x(n)を、フレーム長Ｌ１の短時間フーリエ変換によって周波数領域の音響信号X(N,f)=|X(N,f)|e^j∠X(N,f)に変換し、そのパワー（マグニチュードスペクトラム）|X(N,f)|と位相e^j∠X(N,f)とを記憶部１０ａに格納する（ステップＳ１）。なお、Ｎはフレーム番号であり、ｆは周波数である。また、eは自然対数を表し、ｊは虚数単位を表し、∠X(N,f)は複素平面での角度（ラジアン）を表す。以下の処理では、X(N,f)の位相e^j∠X(N,f)は操作せず、パワー|X(N,f)|のみを操作する。また、フレーム長Ｌ１は、例えば、30msec程度の比較的短い長さとする。音響信号の時間領域での特徴をできるだけ正確に捉えるためである。
次に、後部残響パワー推定部１０ｃが、記憶部１０ａから周波数領域の音響信号のパワー|X(N,f)|を順次読み出し、それらのある時間区間での最小値を、当該音響信号の後部残響音成分のパワー推定値|RL(N,f)|として抽出し、記憶部１０ａに格納する（ステップＳ２）。以下にこの処理の詳細を示す。 <Reverberation removal processing>
Next, the dereverberation process of this embodiment will be described.
FIG. 3 is a flowchart for explaining the dereverberation process according to the first embodiment. Hereinafter, the dereverberation process of this embodiment will be described with reference to FIG. 1 and FIG.
[Rear reverberation power estimation process]
The time domain acoustic signal x (n) observed with a single microphone (not shown) is stored in the storage unit 10a one by one. The time domain acoustic signal x (n) stored in the storage unit 10a in this manner is sequentially read into the frequency domain conversion unit 10b. The frequency domain transform unit 10b converts the read time domain acoustic signal x (n) into a frequency domain acoustic signal X (N, f) = | X (N, f) | e by short-time Fourier transform with a frame length L1. ^{j∠X (N, f)} is converted, and its power (magnitude spectrum) | X (N, f) | and phase e ^{j∠X (N, f)} are stored in the storage unit 10a (step S1). N is a frame number and f is a frequency. E represents a natural logarithm, j represents an imaginary unit, and ∠X (N, f) represents an angle (radian) on a complex plane. In the following processing, the phase e ^{jXX (N, f)} of X (N, f) is not operated, and only the power | X (N, f) | is operated. The frame length L1 is a relatively short length of about 30 msec, for example. This is to capture the characteristics of the acoustic signal in the time domain as accurately as possible.
Next, the rear reverberation power estimation unit 10c sequentially reads out the power | X (N, f) | of the frequency domain acoustic signal from the storage unit 10a, and sets the minimum value in a certain time interval as the rear part of the acoustic signal. Extracted as the power estimate value | RL (N, f) | of the reverberant sound component and stored in the storage unit 10a (step S2). Details of this processing are shown below.

［ステップＳ２の詳細］
図４（ａ）は、図３のステップＳ２の詳細を説明するためのフローチャートである。
まず、後部残響パワー推定部１０ｃの平滑化部１０ｃａが、記憶部１０ａから周波数領域の音響信号のパワー|X(N,f)|を順次読み出し、それらを移動平均などで平滑化した値|X'(N,f)|を記憶部１０ａに格納する（ステップＳ１１）。これは、平滑化した音響信号を用いて、音響信号の後部残響音成分のパワー推定値|RL(N,f)|を抽出するためである。すなわち、音声信号は、変動の激しい信号であり、周波数領域の音響信号のパワー|X(N,f)|は隣り合うフレームでも大きく異なる値を持つことが少なくない。その場合に最小値を求めていくと、音声そのものを後部残響成分として推定してしまうこともある。そこで、音声の特徴をある程度平滑化し、大域的な特徴を表すように変換した後、その最小値を音響信号の後部残響音成分のパワー推定値|RL(N,f)|として抽出する。また、これにより、周辺雑音にも頑健な残響除去を実現できる。なお、スペクトルを平滑化し、最小値を求める手法は、雑音除去手法として文献「R. Martin, "Spectral subtraction based on minimum statistics," Proc. of European Association for Signal Processing, pp. 1182-1185, 1994.」で提唱されている。 [Details of Step S2]
FIG. 4A is a flowchart for explaining details of step S2 in FIG.
First, the smoothing unit 10ca of the rear reverberation power estimation unit 10c sequentially reads out the power | X (N, f) | of the frequency domain acoustic signal from the storage unit 10a, and smoothes them with a moving average or the like | X '(N, f) | is stored in the storage unit 10a (step S11). This is because the power estimation value | RL (N, f) | of the rear reverberation component of the acoustic signal is extracted using the smoothed acoustic signal. That is, the audio signal is a signal that fluctuates greatly, and the power | X (N, f) | of the frequency domain acoustic signal often has a significantly different value even in adjacent frames. In this case, if the minimum value is obtained, the voice itself may be estimated as a rear reverberation component. Therefore, after smoothing the voice features to some extent and converting them to represent global features, the minimum value is extracted as the power estimate value | RL (N, f) | of the rear reverberation component of the acoustic signal. This also makes it possible to achieve dereverberation robust against ambient noise. In addition, the method of smoothing the spectrum and obtaining the minimum value is the document “R. Martin,“ Spectral subtraction based on minimum statistics, ”Proc. Of European Association for Signal Processing, pp. 1182-1185, 1994. ”Is proposed.

［後部残響除去処理］
次に、後部残響除去部１０ｄが、記憶部１０ａから後部残響音成分のパワー推定値|RL(N,f)|と周波数領域の音響信号のパワー|X(N,f)|とを読み込み、それらを用い、音響信号X(N,f)の直接音成分と初期反射音成分との（パワー）推定値|S^{^}(N,f)|を算出し、記憶部１０ａに格納する（ステップＳ３）。本形態の例では、後部残響除去部１０ｄは、スペクトル減算法によって|S^{^}(N,f)|を算出する。この場合、|S^{^}(N,f)|は、次式のように算出される。
|S^{^}(N,f)|=|X(N,f)|‐|RL(N,f)| …(11)
次に、時間領域変換部１０ｅが、記憶部１０ａから、音響信号X(N,f)の直接音成分と初期反射音成分との（パワー）推定値|S^(N,f)|と、周波数領域の音響信号X(N,f)の位相e^j∠X(N,f)とを読み込む。そして、時間領域変換部１０ｅは、これらを掛け合わせた|S^(N,f)| e^j∠X(N,f)を、逆フーリエ変換によって時間領域の推定値s^{^}(n)に変換し、それを記憶部１０ａに格納する（ステップＳ４）。 [Rear dereverberation processing]
Next, the rear reverberation removing unit 10d reads the power estimation value | RL (N, f) | of the rear reverberation sound component and the power | X (N, f) | of the acoustic signal in the frequency domain from the storage unit 10a. Using these, the (power) estimated value | S ^{^} (N, f) | of the direct sound component and the initial reflected sound component of the acoustic signal X (N, f) is calculated and stored in the storage unit 10a (step S3). ). In the example of the present embodiment, the rear dereverberation unit 10d calculates | S ^{^} (N, f) | by the spectral subtraction method. In this case, | S ^{^} (N, f) | is calculated as follows.
| S ^{^} (N, f) | = | X (N, f) |-| RL (N, f) |… (11)
Next, the time domain conversion unit 10e receives (power) estimated value | S ^ (N, f) | between the direct sound component and the initial reflected sound component of the acoustic signal X (N, f) from the storage unit 10a, and The phase ^{ej∠X (N, f)} of the acoustic signal X (N, f) in the frequency domain is read. Then, the time domain transforming unit 10e transforms | S ^ (N, f) | e ^{j∠X (N, f)} obtained by multiplying them into an estimated value s ^{^} (n) in the time domain by inverse Fourier transform. Then, it is stored in the storage unit 10a (step S4).

［残響除去フィルタ計算処理］
次に、周波数領域変換部１０ｆが、記憶部１０ａから、時間領域の音響信号x(n)と上記の推定値s^{^}(n)とを読み込み、これらをフレーム長Ｌ２の短時間フーリエ変換によって、
・周波数領域の音響信号 X(τ,f)=|X(τ,f)|e^j∠X(τ,f)
・周波数領域の推定値 S^{^}(τ,f) =|S^{^}(τ,f)|e^{j∠ S^(τ,f)}
に変換し、それぞれのパワー|X(τ,f)|，|S^{^}(τ,f)|と位相e^j∠X(τ,f)，e^{j∠ S^(τ,f)}とを記憶部１０ａに格納する（ステップＳ５）。なお、τはフレーム番号を示す。また、フレーム長Ｌ２は、ステップＳ１で用いたフレーム長Ｌ１よりも長いことが好ましく、取り扱う音響信号のインパルス応答長よりも長いことがより望ましい。例えば、フレーム長Ｌ２は、長いインパルス応答にも十分に対処できるように３２７６８（２の１５乗）タップ程度であることが望ましい。
次に、残響除去フィルタ計算部１０ｇが、記憶部１０ａから推定値S^{^}(τ,f) のパワー |S^{^}(τ,f)|を読み出し、それを参照信号として用い、推定逆フィルタW(f)を算出して記憶部１０ａに格納する（ステップＳ６）。以下にこの処理の詳細を示す。 [Dereverberation filter calculation processing]
Next, the frequency domain transform unit 10f reads the time domain acoustic signal x (n) and the estimated value s ^{^} (n) from the storage unit 10a, and converts them into a short-time Fourier transform with a frame length L2.
^-Frequency domain acoustic signal X (τ, f) = | X (τ, f) | e ^j | X ^{(τ, f)}
・ Frequency domain estimate S ^{^} (τ, f) = | S ^{^} (τ, f) | e ^{j∠ S ^ (τ, f)}
^{And stores} each power | X (τ, f) |, | S ^{^} (τ, f) | and phase e ^{j∠X (τ, f)} , e ^{j∠ S ^ (τ, f)} Store in the unit 10a (step S5). Here, τ indicates a frame number. The frame length L2 is preferably longer than the frame length L1 used in step S1, and more preferably longer than the impulse response length of the acoustic signal to be handled. For example, the frame length L2 is desirably about 32768 (2 to the 15th power) tap so that a long impulse response can be sufficiently dealt with.
Next, the dereverberation filter calculation unit 10g reads the power | S ^{^} (τ, f) | of the estimated value S ^{^} (τ, f) from the storage unit 10a, uses it as a reference signal, and uses the estimated inverse filter W ( f) is calculated and stored in the storage unit 10a (step S6). Details of this processing are shown below.

［ステップＳ６の詳細］
図４（ｂ）は、図３のステップＳ６の詳細を説明するためのフローチャートである。
まず、残響除去フィルタ計算部１０ｇのフレーム逆フィルタ計算部１０ｇａが、記憶部１０ａからステップＳ５で生成されたパワー|X(τ,f)|，|S^{^}(τ,f)|を読み込み、 [Details of Step S6]
FIG. 4B is a flowchart for explaining details of step S6 in FIG.
First, the frame inverse filter calculation unit 10ga of the dereverberation filter calculation unit 10g reads the power | X (τ, f) |, | S ^{^} (τ, f) | generated from the storage unit 10a in step S5,

の演算によって、逆フィルタの第１次近似である推定逆フィルタW(τ,f)をフレーム毎に算出し、これらを記憶部１０ａに格納する（ステップＳ１５）。
次に、平均算出部１０ｇｂが、記憶部１０ａからステップＳ１５で算出された各フレームの推定逆フィルタW(τ,f)を読み込み、これらを各フレーム間で平均した推定逆フィルタW(f)を算出し、記憶部１０ａに出力して格納する（ステップＳ１６）。この平均化により、前述したＥR成分（式（７））を低減させ、推定逆フィルタの精度を向上させることができる（［ステップＳ６の詳細］の説明終わり）。

As a result, the estimated inverse filter W (τ, f), which is the first-order approximation of the inverse filter, is calculated for each frame and stored in the storage unit 10a (step S15).
Next, the average calculation unit 10gb reads the estimated inverse filter W (τ, f) of each frame calculated in step S15 from the storage unit 10a, and calculates the estimated inverse filter W (f) obtained by averaging these frames between the frames. Calculate, output and store in the storage unit 10a (step S16). By this averaging, the above-described ER component (equation (7)) can be reduced, and the accuracy of the estimated inverse filter can be improved (end of description of [Details of Step S6]).

［残響除去フィルタ乗算処理］
次に、残響除去フィルタ乗算部１０ｈが、記憶部１０ａから推定逆フィルタW(f)と、周波数領域の音響信号のパワー|X(τ,f)|とを読み込み、次式のように、この推定逆フィルタW(f)に音響信号のパワー|X(τ,f)|を乗じ、その演算結果である周波数領域の残響除去信号（のパワー）|Y(τ,f)|を算出し、記憶部１０ａに格納する（ステップＳ７）。
|Y(τ,f)|=W(f) |X(τ,f)| …(13)
そして、時間領域変換部１０ｉが、記憶部１０ａから残響除去信号のパワー|Y(τ,f)|と位相e^j∠X(τ,f)とを読み込み、これらの積Y(τ,f)=|Y(τ,f)| e^j∠X(τ,f)を算出し、その演算結果Y(τ,f)を、逆フーリエ変換によって時間領域の残響除去信号y(t)に変換し、出力する（ステップＳ８）。 [Dereverberation filter multiplication]
Next, the dereverberation filter multiplication unit 10h reads the estimated inverse filter W (f) and the power of the frequency domain acoustic signal | X (τ, f) | from the storage unit 10a. Multiply the estimated inverse filter W (f) by the power of the acoustic signal | X (τ, f) |, and calculate the frequency domain dereverberation signal (power) | Y (τ, f) | Store in the storage unit 10a (step S7).
| Y (τ, f) | = W (f) | X (τ, f) |… (13)
Then, the time domain conversion unit 10i reads the power | Y (τ, f) | and the phase e ^{j∠X (τ, f) of the} dereverberation signal from the storage unit 10a, and the product Y (τ, f) thereof. = | Y (τ, f) | e ^{j ∠X (τ, f)} is calculated, and the result Y (τ, f) is converted to the dereverberation signal y (t) in the time domain by inverse Fourier transform. Are output (step S8).

［繰り返し処理］
以上の処理によっても少ないデータで高い精度の残響除去が可能である。しかし、本形態では、さらに残響除去フィルタの推定精度を上げるため、図３及び図４に示した処理を複数回繰り返す。その場合、１回目のループの際は、周波数領域変換部１０ｂヘの入力が観測信号であるのに対し、2回目以降は、前のループで求まった時間領域変換部１０ｉからの出力信号が周波数領域変換部１０ｂヘの入力信号となる。これにより、後部残響の消し残りを低減させることができる。 [Repetition processing]
With the above processing, it is possible to remove dereverberation with high accuracy with a small amount of data. However, in this embodiment, the processing shown in FIGS. 3 and 4 is repeated a plurality of times in order to further improve the estimation accuracy of the dereverberation filter. In that case, in the first loop, the input to the frequency domain conversion unit 10b is an observation signal, whereas in the second and subsequent times, the output signal from the time domain conversion unit 10i obtained in the previous loop is a frequency. This is an input signal to the area conversion unit 10b. Thereby, the unerased remainder of the rear reverberation can be reduced.

図５は、このような繰り返し処理を説明するためのフローチャートである。
まず、制御部１０ｊが、変数ｋに１を代入し、その変数ｋを一時メモリ１０ｋに格納する（ステップＳ２１）。次に、制御部１０ｊの制御のもと、x(n)に対して前述したステップＳ１からＳ８までの残響除去処理を実行し、残響除去信号y(t)を記憶部１０ａに格納する（ステップＳ２２）。その後、制御部１０ｊは、一時メモリ１０ｋに格納された変数ｋを読み込み、その変数ｋがｋ_maxであるか否かを判断する（ステップＳ２３）。ここで、ｋ_maxとは、繰り返し処理の回数を示す整数である。ここで、制御部１０ｊがｋ=ｋ_maxでないと判断した場合、制御部１０ｊは、残響除去信号y(t)をx(n)に代入し、k+1を新たな変数kの値として一時メモリ１０ｋに格納し、処理をステップＳ２２に戻す。一方、制御部１０ｊがｋ=ｋ_maxであると判断した場合、制御部１０ｊは、その時点の残響除去信号y(t)を出力する（ステップＳ２５）。 FIG. 5 is a flowchart for explaining such repetition processing.
First, the control unit 10j substitutes 1 for a variable k, and stores the variable k in the temporary memory 10k (step S21). Next, under the control of the control unit 10j, the dereverberation process from steps S1 to S8 described above is performed on x (n), and the dereverberation signal y (t) is stored in the storage unit 10a (step S1). S22). Thereafter, the control unit 10j reads the variable k stored in the temporary memory 10k, and determines whether or not the variable k is k _max (step S23). Here, k _max is an integer indicating the number of repetition processes. Here, when the control unit 10j determines that k = k _max is not satisfied, the control unit 10j substitutes the dereverberation signal y (t) for x (n) and temporarily sets k + 1 as the value of the new variable k. The data is stored in the memory 10k, and the process returns to step S22. On the other hand, when the control unit 10j determines that k = k _max , the control unit 10j outputs the dereverberation signal y (t) at that time (step S25).

＜シミュレーション結果＞
次に、本形態の残響除去処理のシミュレーション結果を示す。
まず、連続発話データセットから女声と男声の発話（それぞれ約５０秒分）を取り出し、予め測定しておいた残響時間１．０，０．５，０．２，０．１秒のインパルス応答との畳み込み残響環境をシミュレートした。また、このシミュレーションでは、上記の［繰り返し処理］を４回繰り返した（ｋ_max =4）。
図６は、本形態による処理前のインパスル応答（図６（ａ））と、本形態による残響除去処理後のインパルス応答（図６（ｂ））とを比較したグラフである。ここで、図６の横軸は時間（秒）を示し、縦軸はインパルス応答の強度を示す。０．１秒近辺に引かれた破線Ａよりも後の残響成分に注目すると分かるように、本形態による処理前では、後部残響成分が十分除去されていないのに対し、本形態による処理後は、後部残響成分がほぼ完全に除去されている。 <Simulation results>
Next, a simulation result of the dereverberation process of this embodiment is shown.
First, utterances of female voice and male voice (each about 50 seconds) are extracted from the continuous utterance data set, and impulse responses with reverberation times of 1.0, 0.5, 0.2, and 0.1 seconds measured in advance The convolution reverberation environment was simulated. In this simulation, the above [Repetition processing] was repeated four times (k _max = 4).
FIG. 6 is a graph comparing the impulse response before the processing according to the present embodiment (FIG. 6A) and the impulse response after the dereverberation processing according to the present embodiment (FIG. 6B). Here, the horizontal axis in FIG. 6 represents time (seconds), and the vertical axis represents the impulse response intensity. As can be seen by paying attention to the reverberation component after the broken line A drawn in the vicinity of 0.1 second, the rear reverberation component is not sufficiently removed before the processing according to the present embodiment, but after the processing according to the present embodiment. The rear reverberation component is almost completely removed.

次に、本形態の効果を音声認識結果によって評価した。音響モデルは、文献「K. Kinoshita, et al., "Improving automatic speech recognition performance and speech intelligibility with harmonicity based dereverberation," Proc. of International Conference on Spoken Language Processing (ICSLP), 2004.」で提案されているマルチコンディションモデルと、Cepstral Mean Normalization (CMN)（B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," Journal of Acoustical Society of America, 55(6), pp. 1304-1312, 1974.）を用いた。CMNは残響除去処理後にも残っているインパルス応答の前半部分の抑圧に効果的に働く。 Next, the effect of this embodiment was evaluated based on the speech recognition result. Acoustic models have been proposed in the literature "K. Kinoshita, et al.," Improving automatic speech recognition performance and speech intelligibility with harmonicity based dereverberation, "Proc. Of International Conference on Spoken Language Processing (ICSLP), 2004. Multi-condition model and Cepstral Mean Normalization (CMN) (BS Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," Journal of Acoustical Society of America, 55 (6), pp. 1304-1312 , 1974.). CMN works effectively to suppress the first half of the impulse response that remains after the dereverberation process.

図７は、この音声認識結果を示した図である。ここで、図７（ａ）は、女声の発話の音声認識結果を示し、図７（ｂ）は、男声の発話の音声認識結果を示している。また、横軸は残響時間（秒）であり、縦軸は単語正解精度（％）である。さらに、これらのグラフの粗い鎖線は、クリーン音声を認識した場合の単語正解精度を示しており、システム上の単語正解精度の上限（限界性能）を示している。また、実線は、本発明の第１の実施の形態を適用した場合の単語正解精度を示し、細かい鎖線は、残響除去処理を行わない場合の単語正解精度を示している。これらの図に示すように、処理なしの場合は非常に低い認識性能であるのに対し、本形態の残響除去を適用すると、限界性能に非常に近い値まで音声認識率が改善した。 FIG. 7 is a diagram showing the speech recognition result. Here, FIG. 7A shows a speech recognition result of a female voice, and FIG. 7B shows a speech recognition result of a male voice. The horizontal axis represents reverberation time (seconds), and the vertical axis represents word accuracy (%). Furthermore, the rough chain line in these graphs indicates the word correct accuracy when clean speech is recognized, and indicates the upper limit (limit performance) of the word correct accuracy on the system. Further, the solid line indicates the word correct accuracy when the first embodiment of the present invention is applied, and the fine chain line indicates the word correct accuracy when the dereverberation process is not performed. As shown in these figures, the speech recognition rate is improved to a value very close to the limit performance when the dereverberation of this embodiment is applied while the processing performance is very low without processing.

次に、残響除去処理を行わない場合と、本発明の第１の実施の形態によって残響除去処理を行った場合と、従来の残響除去法（特許文献１）によって残響除去を行った場合との音声認識結果の比較を示す。図８は、これらの音声認識結果を示した図である。なお、
この図の横軸は残響時間（秒）であり、縦軸は単語正解精度（％）である。さらに、これらのグラフの二点鎖線は、システム上の単語正解精度の上限（限界性能）を示し、細い実線は、残響除去処理を行わない場合の単語正解精度を示している。また、太い実線は、１５秒の学習データを用いて本発明の第１の実施の形態を適用した場合の単語正解精度を示し、細かい鎖線は、６０分の学習データを用いて従来の残響除去法（特許文献１）を適用した場合の単語正解精度を示している。この図に示すとおり、本形態では１５秒という少ない学習データしか用いていないにも関わらず、６０分もの学習データを用いた従来法（特許文献１）と同程度の音声認識率を達成している。これは、本形態では、従来に比べて格段に少ないデータ量で精度の高い残響除去が可能であることを示している。 Next, a case where dereverberation processing is not performed, a case where dereverberation processing is performed according to the first embodiment of the present invention, and a case where dereverberation is performed by the conventional dereverberation method (Patent Document 1). A comparison of speech recognition results is shown. FIG. 8 is a diagram showing these speech recognition results. In addition,
In this figure, the horizontal axis represents reverberation time (seconds), and the vertical axis represents word accuracy (%). Furthermore, the two-dot chain line in these graphs indicates the upper limit (limit performance) of word accuracy on the system, and the thin solid line indicates the word accuracy when the dereverberation process is not performed. The thick solid line indicates the correct word accuracy when the first embodiment of the present invention is applied using 15-second learning data, and the fine chain line indicates conventional dereverberation using 60-minute learning data. The word correct accuracy when the method (Patent Document 1) is applied is shown. As shown in this figure, this embodiment achieves a speech recognition rate comparable to that of the conventional method (Patent Document 1) using 60 minutes of learning data, even though less learning data of 15 seconds is used. Yes. This indicates that, in this embodiment, dereverberation with high accuracy is possible with a much smaller amount of data than in the past.

〔第２の実施の形態〕
次に、本発明における第２の実施の形態について説明する。
本形態は第１の実施の形態の変形例である。第１の実施の形態では、周波数領域で推定逆フィルタを音響信号に乗じて残響除去信号を算出していた。しかし、第２の実施の形態では、時間領域で推定逆フィルタを音響信号に乗じて残響除去信号を算出する。以下では、第１の実施の形態との相違点のみを説明し、第１の実施の形態と共通する事項については説明を省略する。 [Second Embodiment]
Next, a second embodiment of the present invention will be described.
This embodiment is a modification of the first embodiment. In the first embodiment, the dereverberation signal is calculated by multiplying the acoustic signal by the estimated inverse filter in the frequency domain. However, in the second embodiment, the dereverberation signal is calculated by multiplying the acoustic signal by the estimated inverse filter in the time domain. In the following, only differences from the first embodiment will be described, and description of matters common to the first embodiment will be omitted.

＜構成＞
図９は、コンピュータに所定のプログラムが読み込まれることにより構成される残響除去装置１００の機能構成を例示したブロック図である。なお、図９において第１の実施の形態と共通する部分には、図１と同じ符号を付した。
図９に示すように、本形態の残響除去装置１００の第１の実施の形態との構成上の相違点は、残響除去フィルタ乗算部１０ｈの代わりに残響除去フィルタ乗算部１１０ｈが設けられ、時間領域変換部１０ｉの代わりに時間領域変換部１１０ｉが設けられる点である。 <Configuration>
FIG. 9 is a block diagram illustrating a functional configuration of the dereverberation apparatus 100 configured by reading a predetermined program into a computer. In FIG. 9, the same reference numerals as those in FIG. 1 are assigned to portions common to the first embodiment.
As shown in FIG. 9, the difference from the first embodiment of the dereverberation apparatus 100 according to the present embodiment is that a dereverberation filter multiplication unit 110h is provided instead of the dereverberation filter multiplication unit 10h. A time domain conversion unit 110i is provided instead of the region conversion unit 10i.

＜処理＞
図１０は、第２の実施の形態の残響除去処理を説明するためのフローチャートである。以下、この図と図９とを用いて、本形態の残響除去処理を説明していく。
ステップＳ３１からＳ３６の処理は、第１の実施の形態のステップＳ１からＳ６（図３）と同じであるため説明を省略する。
ステップＳ３６の処理の後、時間領域変換部１１０ｉは、記憶部１０ａから、記憶部１０ａから推定逆フィルタW(f)を読み込み、これを逆フーリエ変換によって時間領域の推定逆フィルタw(f)に変換して記憶部１０ａに格納する（ステップＳ３７）。
次に、残響除去フィルタ乗算部１１０ｈが、記憶部１０ａから、時間領域の推定逆フィルタw(f)及び音響信号x(n)を読み込み、推定逆フィルタw(f)に音響信号x(n)を乗じた時間領域の残響除去信号y(n)を算出し、出力する（ステップＳ３８）。
このような構成でも第１の実施の形態と同様な効果を得ることができる。 <Processing>
FIG. 10 is a flowchart for explaining the dereverberation process according to the second embodiment. Hereinafter, the dereverberation process of this embodiment will be described with reference to FIG. 9 and FIG.
Since the processing of steps S31 to S36 is the same as that of steps S1 to S6 (FIG. 3) of the first embodiment, description thereof is omitted.
After the process of step S36, the time domain transform unit 110i reads the estimated inverse filter W (f) from the storage unit 10a from the storage unit 10a, and converts it into the time domain estimated inverse filter w (f) by inverse Fourier transform. The data is converted and stored in the storage unit 10a (step S37).
Next, the dereverberation filter multiplication unit 110h reads the time domain estimated inverse filter w (f) and the acoustic signal x (n) from the storage unit 10a, and inputs the acoustic signal x (n) to the estimated inverse filter w (f). The dereverberation signal y (n) in the time domain multiplied by is calculated and output (step S38).
Even with such a configuration, the same effects as those of the first embodiment can be obtained.

〔第３の実施の形態〕
次に、本発明における第３の実施の形態について説明する。
本形態も第１の実施の形態の変形例である。第１の実施の形態のように逆フィルタを用いることなく、後部残響除去部１０ｄでの演算結果をそのまま残響除去信号とする点が相違点である。以下では、第１の実施の形態との相違点のみを説明し、第１の実施の形態と共通する事項については説明を省略する。 [Third Embodiment]
Next, a third embodiment of the present invention will be described.
This embodiment is also a modification of the first embodiment. The difference is that the calculation result in the rear dereverberation unit 10d is directly used as the dereverberation signal without using an inverse filter as in the first embodiment. In the following, only differences from the first embodiment will be described, and description of matters common to the first embodiment will be omitted.

＜構成＞
図１１は、コンピュータに所定のプログラムが読み込まれることにより構成される残響除去装置２００の機能構成を例示したブロック図である。なお、図１１において第１の実施の形態と共通する部分には、図１と同じ符号を付した。
図１１に示すように、本形態の残響除去装置２００の第１の実施の形態との構成上の相違点は、周波数領域変換部１０ｆ、残響除去フィルタ計算部１０ｇ、残響除去フィルタ上残部及び時間領域変換部１０ｉが存在しない点である。 <Configuration>
FIG. 11 is a block diagram illustrating a functional configuration of a dereverberation apparatus 200 configured by reading a predetermined program into a computer. In FIG. 11, the same reference numerals as those in FIG. 1 are given to portions common to the first embodiment.
As shown in FIG. 11, the structural differences from the first embodiment of the dereverberation apparatus 200 according to the present embodiment are the frequency domain transforming unit 10f, the dereverberation filter calculating unit 10g, the dereverberation filter remaining part, and the time. The area conversion unit 10i does not exist.

＜処理＞
図１２は、第３の実施の形態の残響除去処理を説明するためのフローチャートである。以下、この図と図１１とを用いて、本形態の残響除去処理を説明していく。
ステップＳ４１からＳ４４の処理は、第１の実施の形態のステップＳ１からＳ４（図３）とほぼ同じである。相違点は、ステップＳ４４において時間領域変換部１０ｅが算出した直接音成分と初期反射音成分との時間領域の推定値s^{^}(n)を残響除去結果として出力する点である。
このような構成としても、少ない学習データのみを用い、ある程度の精度の残響除去が可能である。 <Processing>
FIG. 12 is a flowchart for explaining the dereverberation process according to the third embodiment. Hereinafter, the dereverberation process of this embodiment will be described with reference to FIG. 11 and FIG.
The processing of steps S41 to S44 is substantially the same as steps S1 to S4 (FIG. 3) of the first embodiment. The difference is that output as dereverberation result an estimate s ^{^} (n) in the time domain and the time domain transformation unit directly 10e is calculated sound component and the initial reflected sound component in step S44.
Even with such a configuration, it is possible to remove dereverberation with a certain degree of accuracy using only a small amount of learning data.

〔変形例等〕
なお、本発明は上述の各実施の形態に限定されるものではない。例えば、上述の各実施の形態では、後部残響パワー推定部１０ｃが、音響信号のパワーの最小値を、当該音響信号の後部残響音成分のパワー推定値として抽出していた。しかし、後部残響パワー推定部１０ｃが、音響信号のパワーの「擬似最小値」を音響信号の後部残響音成分のパワー推定値として抽出する構成としてもよい。ここで、擬似最小値とは、最小値に準ずる値を意味する。例えば、後部残響パワー推定部１０ｃが、ある閾値以下の音響信号のパワーを選択し、それを後部残響音成分のパワー推定値としてもよい。また、所定数の音響信号のパワーを比較した時点で最小と判断された音響信号のパワーを、後部残響音成分のパワー推定値としてもよい。さらに、後部残響パワー推定部１０ｃが、所定の時間範囲内において、音響信号のパワーを小さいものから順に並び替え、先頭から所定の順位内にある何れかを、後部残響音成分のパワー推定値としてもよい。その他、音響信号のパワーが小さいものを、音響信号の後部残響音成分のパワー推定値として抽出する、という本発明の趣旨に合致した他の方法を用いて、後部残響音成分のパワー推定値を抽出してもよい。 [Modifications, etc.]
The present invention is not limited to the embodiments described above. For example, in each of the above-described embodiments, the rear reverberation power estimation unit 10c extracts the minimum value of the power of the acoustic signal as the power estimation value of the rear reverberation sound component of the acoustic signal. However, the rear reverberation power estimation unit 10c may extract the “pseudo minimum value” of the power of the acoustic signal as the power estimation value of the rear reverberation sound component of the acoustic signal. Here, the pseudo minimum value means a value according to the minimum value. For example, the rear reverberation power estimation unit 10c may select the power of an acoustic signal that is equal to or lower than a certain threshold value, and use it as the power estimation value of the rear reverberation sound component. Further, the power of the acoustic signal determined to be the minimum when the powers of the predetermined number of acoustic signals are compared may be used as the power estimation value of the rear reverberation sound component. Further, the rear reverberation power estimation unit 10c rearranges the power of the acoustic signal in order from the smallest in a predetermined time range, and any one within the predetermined order from the head is set as the power estimation value of the rear reverberation sound component. Also good. In addition, by using another method consistent with the gist of the present invention, the power estimation value of the rear reverberation sound component is extracted as a power estimation value of the rear reverberation sound component of the acoustic signal by using a low acoustic signal power. It may be extracted.

また、上記の実施の形態では、後部残響パワー推定部１０ｃが、平滑化した音響信号のパワーを用いて後部残響音成分のパワー推定値を抽出していたが、この平滑化を行わない構成としてもよい。
さらに、後部残響パワー推定部１０ｃの処理を時間領域で実行してもよい。
また、上記の実施の形態では、残響除去フィルタ計算部１０ｇが、フレーム毎に推定した逆フィルタを平均したものを最終的な逆フィルタとして推定していた。しかし、何れかのフレームで算出した逆フィルタを、そのまま最終的な逆フィルタの推定としてもよい。 Further, in the above embodiment, the rear reverberation power estimation unit 10c extracts the power estimation value of the rear reverberation sound component using the power of the smoothed acoustic signal, but the smoothing is not performed. Also good.
Further, the processing of the rear reverberation power estimation unit 10c may be executed in the time domain.
In the above embodiment, the dereverberation filter calculation unit 10g estimates the average of the inverse filters estimated for each frame as the final inverse filter. However, the inverse filter calculated in any frame may be used as the final inverse filter estimation as it is.

さらに、第１の実施の形態では、残響除去処理を繰り返すことによって精度を高める例を示したが、この繰り返しを行わない構成としてもよい。
また、上記の実施の形態では、フーリエ変換によって時間領域から周波数領域への変換を行うこととしたが、ウェーブレット変換やフィルタバンクなどによって、この変換を行ってもよい。
また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。さらに、残響除去装置の機能を分散配置する構成でもよい。 Furthermore, in the first embodiment, an example in which the accuracy is improved by repeating the dereverberation process has been described, but a configuration in which this repetition is not performed may be employed.
In the above embodiment, the time domain is converted to the frequency domain by Fourier transform. However, this conversion may be performed by wavelet transform, filter bank, or the like.
In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Furthermore, a configuration in which the functions of the dereverberation device are distributed may be employed.

その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。
また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、処理機能がコンピュータ上で実現される。
この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 Needless to say, other modifications are possible without departing from the spirit of the present invention.
Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing function is realized on the computer by executing the program on the computer.
The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical disks, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. As the magneto-optical recording medium, MO (Magneto-Optical disc) or the like can be used, and as the semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。
また、コンピュータが上記のプログラムを実行する他の形態として、コンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.
Further, as another mode in which the computer executes the above-described program, each time the program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

本発明は、さまざまな音響信号処理システムの要素技術として用いることが可能であり、そのシステム全体の性能を向上させる技術である。発話された音声信号の残響除去処理が要素技術として性能向上に寄与できるような音響システムには、例えば、以下のようなものを例示できる。実環境で収録された音声には、常に残響（反射音）が含まれるが、以下にあげるシステムは、そのような状況で用いられることを想定した例である。
１．残響環境での音声認識システム。
２．人が歌ったり、楽器で演奏したり、又はスピーカで演奏された音楽の残響を除去して、楽曲を検索したり、採譜したりする音楽情報処理システム。
３．人が発した音に反応して機械にコマンドをわたす機械制御インターフェース、及び機械と人間との対話装置。
４．残響環境下で残響を除去することで聞き取り易さを向上させる補聴器。
５．残響除去により音声の明瞭度を向上させるＴＶ会議システムなどの通信システム。 The present invention can be used as an elemental technology for various acoustic signal processing systems, and is a technology for improving the performance of the entire system. Examples of acoustic systems in which dereverberation processing of spoken speech signals can contribute to performance improvement as an elemental technology include the following. The sound recorded in the actual environment always includes reverberation (reflected sound), but the following system is an example that is assumed to be used in such a situation.
1. Speech recognition system in reverberant environment.
2. A music information processing system in which a person sings, performs with an instrument, or removes the reverberation of music performed with a speaker to search for music or record music.
3. A machine control interface that gives commands to a machine in response to sounds emitted by a person, and a machine-to-human dialogue device.
4). A hearing aid that improves the ease of hearing by removing reverberation in a reverberant environment.
5. A communication system such as a TV conference system that improves voice clarity by removing dereverberation.

図１は、第１の実施の形態における残響除去装置の機能構成を例示したブロック図である。FIG. 1 is a block diagram illustrating a functional configuration of a dereverberation apparatus according to the first embodiment. 図２は、第１の実施の形態における残響除去装置のハードウェア構成を例示したブロック図である。FIG. 2 is a block diagram illustrating a hardware configuration of the dereverberation apparatus according to the first embodiment. 図３は、第１の実施の形態の残響除去処理を説明するためのフローチャートである。FIG. 3 is a flowchart for explaining the dereverberation process according to the first embodiment. 図４（ａ）は、図３のステップＳ２の詳細を説明するためのフローチャートである。また、図４（ｂ）は、図３のステップＳ６の詳細を説明するためのフローチャートである。FIG. 4A is a flowchart for explaining details of step S2 in FIG. FIG. 4B is a flowchart for explaining details of step S6 in FIG. 図５は、第１の実施の形態の繰り返し処理を説明するためのフローチャートである。FIG. 5 is a flowchart for explaining the iterative process according to the first embodiment. 図６（ａ）は、第１の実施の形態による処理前のインパスル応答を示したグラフである。また、図６（ｂ）は、本形態による残響除去処理後のインパルス応答を示したグラフである。FIG. 6A is a graph showing an impulse response before processing according to the first embodiment. FIG. 6B is a graph showing an impulse response after the dereverberation process according to the present embodiment. 図７は、第１の実施の形態による残響除去処理後の音声認識結果を示した図である。ここで、図７（ａ）は、女声の発話の音声認識結果を示したグラフである。また、図７（ｂ）は、男声の発話の音声認識結果を示したグラフである。FIG. 7 is a diagram illustrating a speech recognition result after the dereverberation process according to the first embodiment. Here, FIG. 7A is a graph showing a speech recognition result of a female voice. FIG. 7B is a graph showing the speech recognition result of male voices. 図８は、残響除去処理を行わない場合と、本発明の第１の実施の形態によって残響除去処理を行った場合と、従来の残響除去法（特許文献１）によって残響除去を行った場合との音声認識結果の比較を示した図である。FIG. 8 shows a case where dereverberation processing is not performed, a case where dereverberation processing is performed according to the first embodiment of the present invention, and a case where dereverberation is performed by a conventional dereverberation method (Patent Document 1). It is the figure which showed the comparison of the voice recognition result. 図９は、第２の実施の形態の残響除去装置の機能構成を例示したブロック図である。FIG. 9 is a block diagram illustrating a functional configuration of the dereverberation apparatus according to the second embodiment. 図１０は、第２の実施の形態の残響除去処理を説明するためのフローチャートである。FIG. 10 is a flowchart for explaining the dereverberation process according to the second embodiment. 図１１は、第３の実施の形態の残響除去装置の機能構成を例示したブロック図である。FIG. 11 is a block diagram illustrating a functional configuration of the dereverberation apparatus according to the third embodiment. 図１２は、第３の実施の形態の残響除去処理を説明するためのフローチャートである。FIG. 12 is a flowchart for explaining the dereverberation process according to the third embodiment. 図１３は、残響除去法の従来例（特許文献１参照）を説明するためのブロック図である。FIG. 13 is a block diagram for explaining a conventional example of a dereverberation method (see Patent Document 1).

符号の説明Explanation of symbols

１０，１００，２００残響除去装置 10, 100, 200 Reverberation removal device

Claims

音響信号から残響を取り除く残響除去装置であって、
上記音響信号のパワーの最小値又は擬似最小値を、当該音響信号の後部残響音成分のパワー推定値として抽出する後部残響パワー推定部と、
上記後部残響音成分のパワー推定値を用い、上記音響信号の直接音成分と初期反射音成分との推定値を算出する後部残響除去部と、
を有することを特徴とする残響除去装置。 A dereverberation device that removes reverberation from an acoustic signal,
A rear reverberation power estimation unit that extracts the minimum value or pseudo-minimum value of the power of the acoustic signal as a power estimation value of a rear reverberation component of the acoustic signal;
A rear dereverberation unit that calculates an estimate of the direct sound component and the initial reflected sound component of the acoustic signal using the power estimate of the rear reverberation component;
A dereverberation apparatus comprising:

請求項１に記載の残響除去装置であって、
上記直接音成分と初期反射音成分との推定値を参照信号として用い、推定逆フィルタを算出する残響除去フィルタ計算部と、
上記推定逆フィルタに上記音響信号のパワーを乗じ、その演算結果を出力する残響除去フィルタ乗算部と、
を有することを特徴とする残響除去装置。 The dereverberation apparatus according to claim 1,
Using the estimated values of the direct sound component and the initial reflected sound component as a reference signal, a dereverberation filter calculating unit that calculates an estimated inverse filter;
A dereverberation filter multiplier that multiplies the estimated inverse filter by the power of the acoustic signal and outputs the calculation result;
A dereverberation apparatus comprising:

請求項１又は２に記載の残響除去装置であって、
上記後部残響パワー推定部は、
周波数領域の音響信号を用い、上記音響信号の後部残響音成分のパワー推定値を抽出する、
ことを特徴とする残響除去装置。 The dereverberation apparatus according to claim 1 or 2,
The rear reverberation power estimation unit is
Using a frequency domain acoustic signal, extract the power estimate of the rear reverberation component of the acoustic signal,
A dereverberation device characterized by that.

請求項１又は２に記載の残響除去装置であって、
上記後部残響パワー推定部は、
平滑化した上記音響信号を用いて、上記音響信号の後部残響音成分のパワー推定値を抽出する、
ことを特徴とする残響除去装置。 The dereverberation apparatus according to claim 1 or 2,
The rear reverberation power estimation unit is
Using the smoothed acoustic signal, extract a power estimate of the rear reverberation component of the acoustic signal,
A dereverberation device characterized by that.

請求項１又は２に記載の残響除去装置であって、
上記後部残響除去部は、
スペクトル減算法によって、上記音響信号のパワーと上記後部残響音成分のパワー推定値とから、上記音響信号の直接音成分と初期反射音成分との推定値を算出する、
ことを特徴とする残響除去装置。 The dereverberation apparatus according to claim 1 or 2,
The rear dereverberation unit is
From the power of the acoustic signal and the power estimation value of the rear reverberation sound component, the estimated values of the direct sound component and the initial reflected sound component of the acoustic signal are calculated by spectral subtraction.
A dereverberation device characterized by that.

請求項２に記載の残響除去装置であって、
上記後部残響パワー推定部は、
周波数領域の音響信号を用い、上記音響信号の後部残響音成分のパワー推定値を抽出し、
上記残響除去フィルタ計算部は、
上記直接音成分と初期反射音成分との周波数領域での推定値を参照信号として用い、上記推定逆フィルタを算出し、
上記後部残響パワー推定部が取り扱う信号のフレーム長は、上記残響除去フィルタ計算部が取り扱う信号のフレーム長よりも短い、
ことを特徴とする残響除去装置。 The dereverberation device according to claim 2,
The rear reverberation power estimation unit is
Using the frequency domain acoustic signal, extract the power estimate of the rear reverberation component of the acoustic signal,
The dereverberation filter calculation unit is
Using the estimated value in the frequency domain of the direct sound component and the initial reflected sound component as a reference signal, calculating the estimated inverse filter,
The frame length of the signal handled by the rear reverberation power estimation unit is shorter than the frame length of the signal handled by the dereverberation filter calculation unit,
A dereverberation device characterized by that.

請求項６に記載の残響除去装置であって、
上記残響除去フィルタ計算部が取り扱う信号のフレーム長は、上記音響信号のインパルス応答長よりも長い、
ことを特徴とする残響除去装置。 The dereverberation apparatus according to claim 6, wherein
The frame length of the signal handled by the dereverberation filter calculation unit is longer than the impulse response length of the acoustic signal,
A dereverberation device characterized by that.

請求項２に記載の残響除去装置であって、
上記残響除去フィルタ計算部は、
フレーム毎に推定逆フィルタを算出して、当該推定逆フィルタを各フレーム間で平均したものを出力し、
上記残響除去フィルタ乗算部は、
上記の各フレーム間で平均した推定逆フィルタに上記音響信号を乗じ、その演算結果を出力する、
ことを特徴とする残響除去装置。 The dereverberation device according to claim 2,
The dereverberation filter calculation unit is
Calculate the estimated inverse filter for each frame, and output the average of the estimated inverse filter between the frames,
The dereverberation filter multiplier is
Multiply the estimated inverse filter averaged between the frames by the acoustic signal and output the calculation result.
A dereverberation device characterized by that.

請求項２に記載の残響除去装置であって、
上記残響除去フィルタ乗算部からの出力値を上記音響信号とし、上記後部残響パワー推定部と、上記後部残響除去部と、上記残響除去フィルタ計算部と、残響除去フィルタ乗算部との処理を、それぞれ実行させる制御部を有する、
ことを特徴とする残響除去装置。 The dereverberation device according to claim 2,
The output value from the dereverberation filter multiplication unit is set as the acoustic signal, and the processes of the rear reverberation power estimation unit, the rear dereverberation removal unit, the dereverberation filter calculation unit, and the dereverberation filter multiplication unit are respectively performed. Having a control unit to be executed,
A dereverberation device characterized by that.

音響信号から残響を取り除く残響除去方法であって、
後部残響パワー推定部が、上記音響信号のパワーの最小値又は擬似最小値を、当該音響信号の後部残響音成分のパワー推定値として抽出するステップと、
後部残響除去部が、上記後部残響音成分のパワー推定値を用い、上記音響信号の直接音成分と初期反射音成分との推定値を算出するステップと、
を有することを特徴とする残響除去方法。 A reverberation removal method for removing reverberation from an acoustic signal,
A step in which a rear reverberation power estimation unit extracts a minimum value or a pseudo minimum value of the power of the acoustic signal as a power estimation value of a rear reverberation sound component of the acoustic signal;
A rear dereverberation unit calculating an estimated value of the direct sound component and the initial reflected sound component of the acoustic signal using the power estimate value of the rear reverberation sound component;
A dereverberation method characterized by comprising:

請求項１０に記載の残響除去方法であって、
残響除去フィルタ計算部が、上記直接音成分と初期反射音成分との推定値を参照信号として用い、推定逆フィルタを算出するステップと、
残響除去フィルタ乗算部が、上記推定逆フィルタに上記音響信号のパワーを乗じ、その演算結果を出力するステップと、
を有することを特徴とする残響除去方法。 The dereverberation method according to claim 10, wherein
A dereverberation filter calculation unit calculating an estimated inverse filter using the estimated values of the direct sound component and the initial reflected sound component as reference signals;
A dereverberation filter multiplier that multiplies the estimated inverse filter by the power of the acoustic signal and outputs the calculation result;
A dereverberation method characterized by comprising:

請求項１から９の何れかに記載の残響除去装置としてコンピュータを機能させるための残響除去プログラム。 A dereverberation program for causing a computer to function as the dereverberation apparatus according to claim 1.

請求項１２に記載のプログラムを格納したコンピュータ読み取り可能な記録媒体。 A computer-readable recording medium storing the program according to claim 12.