WO2006114100A1 - Estimation of signal from noisy observations - Google Patents

Estimation of signal from noisy observations

Info

Publication number
WO2006114100A1
WO2006114100A1 PCT/DK2006/000220 DK2006000220W WO2006114100A1 WO 2006114100 A1 WO2006114100 A1 WO 2006114100A1 DK 2006000220 W DK2006000220 W DK 2006000220W WO 2006114100 A1 WO2006114100 A1 WO 2006114100A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
noise
speech
estimation
model
Prior art date
Application number
PCT/DK2006/000220
Other languages
French (fr)
Inventor
Søren Vang ANDERSEN
Chunijan Li
Original Assignee
Aalborg Universitet
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aalborg Universitet filed Critical Aalborg Universitet
Publication of WO2006114100A1 publication Critical patent/WO2006114100A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques

Definitions

  • the invention relates to the field of signal processing, more specifically to processing aiming at estimating of a signal from noisy observations, e.g. with the aim of noise reduction or enhancing speech contained in a noisy signal.
  • the invention provides a method and a device, e.g. a headset, adapted to perform the method.
  • Noise reducing methods i.e. methods aiming at processing a noisy signal with the purpose of suppressing the noise, are important parts of e.g. modern hearing aids, headsets, mobile phones and the like.
  • noise reduction techniques are used for speech enhancement, e.g. to improve speech intelligibility of speech contained in noise.
  • MMSE frequency domain block minimum mean squared error
  • the invention provides a method for estimation of a signal from noisy observations, the method including employing a statistical model of signal dependencies in frequency domain.
  • the method of the first aspect is suited for noise reduction in noisy speech signals or for speech enhancement.
  • preferred embodiments provide a performance to complexity trade-off that makes them suited for resource-limited applications such as hearing aids by tuning a number of spectral components to be included in the estimate of each component.
  • the method includes the step of providing the statistical model by decomposing the noisy observations into a source model and a filter model, and preferably the source model includes performing a multi-pulse coding.
  • the source model includes performing a multi-pulse coding.
  • the signal dependencies in frequency domain include inter-frequency dependency, most preferably inter-frequency correlation, phase structure, and/or non- stationarity of the signal.
  • the method includes performing a linear minimum mean squared error estimation, preferably including performing a joint linear minimum mean squared error estimation of signal power spectra and phase spectra.
  • the invention provides a noise reduction method including performing the method according to the first aspect, and providing a noise suppressed signal based on an output therefrom.
  • the noise reduction method of the second aspect have the same advantages as mentioned for the first aspect, and it is understood that the preferred embodiments described for the first aspect apply for the second aspect as well.
  • the method is suited for a number of purposes where it is desired to perform a reduction of noise of a noisy signal, in general the method is suited to reduce noise by processing a noisy signal, i.e. an information signal corrupted by noise, and returning a noise suppressed signal.
  • the signal may in general represent any type of data, e.g. audio data, image data, control signal data, data representing measured values etc. or any combination thereof. Due to the computational efficiency, the method is suited for on-line applications where limited signal processing power is available.
  • the invention provides a speech enhancement method including performing the noise reduction method according to the second aspect on speech present in noisy observations so as to enhance speech by suppressing noise.
  • the speech enhancement method of the third aspect have the same advantages as mentioned for the first and second aspects, and the preferred embodiments mentioned for the first aspect therefore also apply.
  • the speech enhancement method is suited for application where a noisy audio signal containing speech is corrupted by noise.
  • the noise may be caused by electrical noise interfering with an electrical audio signal, or the noise may be acoustic noise such as introduced at the recording of the speech, e.g. a person speaking in a telephone at a place with traffic noise etc.
  • the speech enhancement method can then be used to increase speech intelligibility by enhancing the speech in relation to the noise.
  • the invention provides a device including a processor adapted to perform the method of any one of the first, second or third aspects.
  • a processor adapted to perform the method of any one of the first, second or third aspects.
  • the device may be: a mobile phone, a radio communication device, an internet telephony system, sound recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, or a monitoring system.
  • the device may be: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, or a headphone with a built-in microphone (so as to allow sound from the environments to reach the listener).
  • the invention provides a computer executable program code adapted to perform the method according to any one of the first, second or third aspects.
  • the program code may be present on a program carrier, e.g. a memory card, a disk etc. or in a RAM or ROM memory of a device.
  • a program carrier e.g. a memory card, a disk etc. or in a RAM or ROM memory of a device.
  • Fig. 1 shows graphs illustrating multi-pulse linear prediction coding
  • Fig. 2 shows a block diagram of a preferred device. While the invention is susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. It should be understood, however, that the invention is not intended to be limited to the particular forms disclosed. Rather, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
  • LMMSE linear MMSE
  • the algorithm aim at a joint LMMSE estimation of signal power spectra and phase spectra, as well as exploitation of correlation between spectral components.
  • the major cause of this inter-frequency correlation is shown to be the prominent temporal power localization in the excitation of voiced speech.
  • LMMSE estimators in time domain and frequency domain are first formulated. To obtain the joint estimator, the spectral signal covariance matrix instead of a diagonal covariance matrix as is the case in the Wiener filter derived under quasi-stationary assumption.
  • the signal covariance matrix is decomposed into a source model (excitation filter matrix) and a filter model (synthesis filter matrix).
  • the synthesis filter matrix is built from estimates of coefficients based on an all-pole model, and the excitation matrix is built from estimates of the instantaneous power of the excitation sequence.
  • a decision-directed power spectral subtraction method and a modified multi-pulse linear predictive coding (MPLPC) method are used in these estimations, respectively.
  • the spectral domain formulation of the LMMSE estimator reveals important insight into inter-frequency correlations. This is exploited to significantly reduce computational complexity of the estimator, thus making the method suited for resource-limited applications such as hearing aids and the like. Performance-to-complexity trade-off can be conveniently adjusted by tuning the number of spectral components to be included in the estimate of each component.
  • the proposed algorithm is able to reduce more noise than a number of other approaches selected from the state of the art.
  • the proposed algorithm improves segmental signal to noise ratio of a noisy signal by 13 dB for a white noise case with an input of a 0 dB signal to noise ratio.
  • the frequency domain MMSE approach includes the non-causal IIR Wiener filter [20], the MMSE Short-Time Spectral Amplitude (MMSE-STSA) estimator [10], the MMSE Log-Spectral Amplitude (MMSE-LSA) estimator [11], the Constrained Iterative Wiener Filtering (CIWF) [15], and the MMSE estimator using non-Gaussian priors [21].
  • MMSE-STSA MMSE Short-Time Spectral Amplitude estimator
  • MMSE-LSA MMSE Log-Spectral Amplitude estimator
  • CCWF Constrained Iterative Wiener Filtering
  • the quasi-stationarity assumption requires short time processing.
  • the assumption of uncorrelated spectral components can be warranted by assuming the signal to be infinitely long and wide-sense stationary [8] [14].
  • This infinite data length assumption is in principle violated when using the short-time processing, although the tO effect of this violation may be minor (and is not the major issue this paper addresses) .
  • the wide-sense stationarity assumption within a short frame does not well model the prominent temporal power localization in the excitation source of voiced speech due to the impulse train structure. This temporal power localization within a short frame can be modeled as a non-stationarity of the signal that is not
  • the TDC estimator is shown to be an LMMSE estimator with adjustable input noise level.
  • the known signal subspace based methods still assume stationarity within a short frame. This can be seen as follows.
  • TDC and SDC 0 the noisy signal covariance matrices are estimated by time averaging of the outer product of the signal vector, which requires stationarity within the interval of averaging.
  • the TSVD method applies singular value decomposition to the signal matrix instead. This can be shown to be equivalent to the eigen decomposition of the time averaged outer product of signal vectors.
  • the known signal subspace methods implicitly avoid the infinite data length assumption, so that the inter-frequency correlation caused by the finite length effect is accommodated.
  • the more important cause of inter-frequency correlation i.e., the non-stationarity within a frame is not modeled.
  • the MMSE approaches attenuate more in the spectral valleys than the spectral subtraction methods do. Perceptually, this is beneficial for high pitch voiced speech, which has sparsely located spectral peaks that are not able to mask the spectral valley sufficiently.
  • the signal subspace methods in [12] are designed to shape the residual noise power spectrum for a better spectral masking, where the masking threshold is > 0 found experimentally. Auditory masking techniques have received increasing attention in recent research of speech enhancement [2,26,29]. While the majority of these works focus on spectral domain masking, the work in [24] shows the importance of the temporal masking property in connection with the excitation source of voiced speech. It is shown that noise between the excitation impulses is more perceivable than noise close to the impulses, and this is especially so for the low pitch speech for which the excitation
  • Both the frequency domain signal covariance matrix and filtering matrix are estimated as complex-valued full matrices, which means that the information about inter-frequency correlation are not lost and the amplitude and phase spectra are estimated jointly.
  • the 20 domain LMMSE estimator is built. In the estimation of the signal covariance matrix, this matrix is decomposed into a synthesis filter matrix and an excitation matrix.
  • the synthesis filter matrix is estimated by a smoothed power spectral subtraction method followed by an autocorrelation Linear Predictive Coding (LPG) method.
  • LPG Linear Predictive Coding
  • the excitation matrix is a diagonal matrix with the instantaneous power of the LPC residual as
  • the proposed LMMSE estimator results in a lower spectral distortion to the enhanced speech signal while having higher noise reduction capability.
  • the algorithm applies more attenuation in the valleys between pitch impulses in time domain, while small attenuation is applied around the pitch impulses. This arrangement exploits the temporal masking effect, and results in a bets ' ter preservation of abrupt rise of the waveform amplitude while maintaining a large amount of noise reduction.
  • Section 0.2 the notations and assumptions used in the derivation of LMMSE estimators are outlined.
  • Section 0.3 the non-stationary modeling of the signal covariance matrices is described.
  • the 10 algorithm is summarized in Section 0.4.
  • Section 0.5 the computational complexity of the algorithm is reduced by identifying an interval of significant correlation and by simplifying the modified MPLPC procedure.
  • Experimental settings, objective, and subjective results are given in Section 0.6.
  • Section 0.7 discusses the obtained results.
  • y(n, k), s(n, k), v(n, k) denote the n'th sample of noisy observation, speech, and additive noise (uncorrelated with the speech signal) of the fc'th frame, respectively.
  • y [y(l, k), y(2, k), • • • , y(N, k)] ⁇ is the noisy signal vector of the fc'th frame, where N is the number of samples per frame.
  • C s and C v are the covariance matrices of the signal and the noise, respectively.
  • Y 0 + V, (4) where again boldface letters represent vectors and the frame indices are omitted.
  • the excitation source models the glottal pulse train
  • the filter models the resonance property of the vocal tract.
  • the vocal tract can be viewed as a slowly varying part of the system. Typically in a duration of 20 to 30 ms it changes very little. The vocal folds vibrate at a faster rate producing periodic glottal flow pulses. Typically there can be 2 to 8 glottal pulses in 20 ms.
  • this pulse train it is common practice to model this pulse train by a long-term correlation pattern parameterized by a long-term predictor [4] [3] [5].
  • this model fails to describe the linear relationship between the phases of the harmonics.
  • the long term predictor alone does not model the temporal localization of power in the excitation source. Instead, we apply a time envelope that captures the localization and concentration of pitch pulse energy in the time domain. This, in turn, introduces an element of non-stationarity to our signal model because the excitation sequence is now modeled as a random, sequence with time varying variance, i.e., the glottal pulses are modeled with higher variance and the rest of the excitation sequence is modeled with ⁇ lower variance.
  • This modeling of non-stationarity within a short frame implies a temporal resolution much finer than that of the quasi-stationarity based algorithms. The latter has a temporal resolution equal to the frame length. Thus we term the former the high temporal resolution model. It is worth noting that some unvoiced phonemes, such as plosives, have very fast changing waveform envelopes, which also could be modeled
  • the signal covariance matrix is usually estimated by averaging the outer product of the signal vector over time. As an example this is done in the signal subspace approach [12]. This method assumes ergodicity of the autocorrelation function within ⁇ the averaging interval.
  • h(n) is the impulse response of the LPC synthesis filter.
  • Section 0.3.2 addresses the estimation of H. Note that (8) does not take into account the zero-input response of the filter in the previous frame. Either i9 ⁇ the zero-input response can be subtracted prior to the estimation of each frame, or a windowed overlap-add procedure can be applied to eliminate this effect.
  • This temporal envelope is an important part of the new MMSE estimator because it provides the information of uneven temporal power distribution. In the following two subsections, we will describe the estimation of the spectral envelope and the temporal envelope respectively.
  • the synthesis filter has a spectrum that is the enve-
  • 2 is obtained by applying IS the DFT to the fc'th observation vector y(fc) and squaring the amplitudes.
  • 2 is a weighted sum of two parts, the power spectrum of the estimated signal of the previous frame, ⁇ (k — 1)
  • MPLPC is first introduced by Atal and Remde [4]
  • the impulse position and amplitude of the excitation in the context of analysis-by-synthesis linear predictive coding.
  • the principle is to represent the LPC residual with a few impulses in which the locations and amplitudes (gains) of the impulses are chosen such that the difference between the target signal and the synthesized signal is minimized.
  • the target signal will tO be the noisy signal and the synthesis filter must be estimated from the noisy signal.
  • the synthesis filter is treated as known.
  • For the residual of voiced speech there is usually one dominating impulse in each pitch period. We first determine one impulse per pitch period, then model the rest of the residual as a noise floor with constant variance. In MPLPC the impulses are found sequentially [17].
  • the first impulse lo- 2.5 cation and amplitude is found by minimizing the distance between the synthesized signal and the target signal. The effect of this impulse is subtracted from the target signal and the same procedure is applied to find the next impulse. Because this way of finding impulses does not take into account the interaction between the impulses, re-optimization of the impulse amplitudes is necessary every time a new impulse is found.
  • the number of pitch impulses p in a frame is determined in the following way. p is first assigned an initial value equal to the largest number of pitch periods possible in a frame. Then p impulses are determined using the above mentioned method. Only the " impulses with an amplitude larger than a threshold are selected as pitch impulses.
  • the threshold is set to 0.5 times the largest impulse amplitude in this frame. Having determined the impulses, a white noise sequence representing the noise floor of the excitation sequence is added into the gain optimization procedure together with all the impulses.
  • a codebook of 1024 white Gaussian noise sequences was used.
  • the white noise sequence that yields the smallest synthesis error to the target signal is chosen to be the estimate of the noise floor.
  • This procedure is in fact a multi-stage coder with p impulse stages and one Gaussian codebook stage, with a joint re-optimization of gains. Detailed treatment of this optimization problem can be found in [22]. After the optimization, we use a flat envelope equal to the square of
  • the MPLPC procedure can be interpreted as a non-linear Least Square fitting to the noisy signal, with the impulse positions and amplitudes as the model parameters.
  • the covariance matrix is used in the time LMMSE estimator (2) or in the spectral LMMSE estimator (5) after being transformed by (6).
  • the noise covariance matrix can be estimated using speech absent frames.
  • j ⁇ we assume the noise to be stationary.
  • the covariance matrix C v is diagonal with the noise variance as its diagonal elements.
  • the noise covariance matrix is no longer diagonal and it can be estimated using the time averaged outer product of the noise vector.
  • Cv is a diagonal matrix with the power spectral density of the noise as its diagonal elements. This is due to the assumed stationarity of the noise 1 . In the special case where the noise is white, the diagonal elements all equal the variance of the noise.
  • voiced speech is referred to as phonemes that require excitation from the vocal folds vibration
  • unvoiced speech consists of the rest of the phonemes.
  • voiced speech usually has most of its power concentrated in the low frequency band
  • unvoiced speech has a relatively flat spectrum within 0 to 4kHz. Every frame is low pass filtered and then
  • the filtered signal power is compared with the original signal power. If the power loss is more than a threshold, the frame is marked as an unvoiced frame, and vice versa. Note however, that even for the unvoiced frames, the spectral covariance matrix is non- diagonal because the signal covariance matrix C s , built in this way, is not Toeplitz.
  • the proposed approach we refer to the proposed approach as the Time-Frequency-Envelope MMSE
  • the TFE-MMSE estimators require inversion of a full covariance matrix C s or Cg. This high computational load prohibits the algorithm from real time application in hearing aids. Noticing that both covariance matrices are symmetric and positive definite,
  • Io Cholesky factorization can be applied to the covariance matrices, and the inversion can be done by inverting the Cholesky triangle.
  • a careful implementation requires N 3 /3 operations for the Cholesky factorization [13] and the algorithm complexity is O(N S ).
  • Another computation intensive part of the algorithm is the modified MPLPC method. In this section we propose simplifications to these two parts.
  • the first half of the signal spectrum can be estimated segment by segment.
  • the second half of the spectrum is simply a flipped and conjugated version of the first half.
  • Other segmentation schemes are applicable, such as overlapping segments. It is also possible to use
  • the optimization of the impulse amplitude and the gain of the noise floor brings in heavy computational load. It can be simplified by fixing the impulse shape and the noise floor level. In the simplified version, the MPLPC method is only used for searching the locations of the p dominating impulses.
  • a predetermined pulse shape is put at each location.
  • An envelope of the noise floor is also predetermined.
  • the pulse shape is chosen to be wider than an impulse in order to gain robustness against estimation error of the impulse locations. This is helpful as long as noise is present.
  • the pulse shape used in our experiment is a raised cosine waveform with a period of 18 samples and the ratio between the pulse
  • TFE-MMSE estimator Objective performance of the TFE-MMSE estimator is first evaluated and compared with the Wiener filter [20], the MMSE-LSA estimator [11], and the signal subspace method TDC estimator [12].
  • the TFE-MMSE estimator both the complete algorithm and the simplified algorithms are evaluated.
  • the sampling frequency IQ is 8kHz, and the frame length is 128 samples with 50% overlap.
  • the Wiener filter we use the same Decision Directed method as in the MMSE-LSA and the TFE-MMSE estimator to estimate the PSD of the signal.
  • An important parameter for the Decision Directed method is the smoothing factor a. The larger the a is, the more noise is removed and more distortion imposed to the signal, because of more smoothing made to 25 the spectrum.
  • the white Gaussian noise is computer generated, and the pink noise is generated by filtering white noise with a filter having a 3 dB per octave spectral power descend.
  • the car noise is recorded inside a car with a constant speed. Its spectrum is more low pass than the pink noise.
  • the quality measures used include the SNR, the segmental SNR, and the Log-Spectral Distortion (LSD).
  • D is defined as the ratio of the total signal power to the total noise power in the sentence.
  • the segmental SNR (segSNR) is defined as the average ratio of signal power to noise power per frame.
  • egSNR the average ratio of signal power to noise power per frame.
  • ⁇ S than e is not used in the calculation.
  • e was set to 4OdB lower than the average power of the utterance.
  • the segSNR is commonly considered to be more correlated to perceived quality than the SNR measure.
  • the LSD is defined as [27] :
  • TFE-MMSEl is the complete algorithm
  • the TDC also suffers from a musical residual noise.
  • To suppress its IO residual noise level to as low as that of the TFE-MMSE with a 0.999, the TDG requires a ⁇ lager than 8. This causes a sharp degradation of the SNR and LSD, and results in very muffled sounds.
  • TFE-MMSE3 To verify the perceived quality of the TFE-MMSE3 subjectively, ) 5" preference test between the TFE-MMSE3 and the WF, and between the TFE-MMSE3 and the MMSE-LSA are conducted.
  • the WF and the MMSE-LSA use their best value of smoothing factor (a — 0.98).
  • the test is confined to white Gaussian noise and a limited range of SNR's. Three sentences by male speakers and three by female speakers at each SNR level are used in the test. Eight unexperienced listeners are required to 20 vote for their preferred method based on the amount of noise reduction and speech distortion. The utterances are presented to the listeners by a high quality headphone.
  • the clean utterance is first played as a reference, and the enhanced utterances are played once, or more if the listener finds this necessary.
  • Table 1 and 2 show that : 1) at 10 dB and 15 dB the listeners clearly prefer the TFE-MMSE over the two Vo reference methods, while at 5 dB the preference on the TFE-MMSE is unclear ; 2) the TFE-MMSE method has a more significant impact on the processing of male speech than on the processing of female speech.
  • the speech enhanced by TFE-MMSE3 has barely audible background noise, and the speech sounds less muffled than the reference methods. There is one artifact heard in rare occasions that we 2 > t> believe is caused by remaining musical tones.
  • TAB 1 - Preference test between " WF and TFE-MMSE3 with additive white Gaussian noise.
  • TAB 2 - Preference test between MMSE-LSA and TFE-MMSE3 with additive white Gaussian noise.
  • the TFE-MMSE3 estimator has the best performance in all the three objective measures (SNR, segSNR, and LSD).
  • SNR speech_the TFE-MMSE3 estimator
  • the TFE-MMSE3 is the second in SNR, the best in LSD, and among the best in segSNR.
  • the TFE-MMSE3 estimator allows a high degree of suppression to the noise while maintaining low distortion to the signal.
  • the speech enhanced by the TFE- £ " MMSE3 has a very clean background and a certain speech dependent residual noise. When the SNR is high (10 dB and above), this speech dependent noise is very well masked by the speech, and the resulting speech sounds clean and clear.
  • voiced speech and female voiced speech are found to have different masking properties in the auditory system.
  • the auditory system is sensitive to high frequency noise in the valleys between the pitch pulse peaks in the time domain.
  • the auditory system is sensitive to low frequency noise in the valleys between the harmonics in the spectral domain. While the time domain valley for the male speech
  • the spectral valleys for the female speech are not attenuated enough ; a comb filter could help to remove the roughness in the female voiced speech.
  • the estimate of the residual power envelope contains information about the uneven distribution of signal power in time aids.
  • the TFE-MMSB estimator represents the sudden rises of amplitude better than the Wiener filter.
  • phase noise is audible when the. source SNR is very low.
  • a threshold of phase perception is found. This phase-noise tolerance threshold corresponds to an
  • the TFE-MMSE estimator has the ability of enhancing phase spectra because of its ability to estimate the temporal localization of residual power. It is the linearity in the phase of harmonics in the residual that makes the power be concentrated at periodic time instances, thus producing pitch pulses. Estimating the residual power temporal envelope enhances the linearity of the
  • Fig. 2 illustrates a block diagram of a preferred device embodiment.
  • the illustrated device may be such as a mobile phone, a headset or a part thereof.
  • the device is adapted to receive a noisy signal, e.g. an electrical analog or digital signal representing an audio signal containing speech and unintended noise.
  • the device includes a digital signal processor DSP that performs a signal processing on the noisy signal.
  • the signal estimation method is performed, preferably including a block based linear MMSE method including employing a statistical model of correlation between spectral components such as described in the foregoing.
  • the signal estimation method serves as input to a noise reduction method as will also be understood from the foregoing.
  • the output of the noise reduction method is a signal where the speech is enhanced in relation to the noise. This signal with enhanced speech is applied to a loudspeaker, preferably via an amplifier, so as to present an acoustic representation of the speech enhanced signal to a listener.
  • the device in Fig. 2 may be a hearing aid, a headset or a mobile phone or the like.
  • the DSP may either be built into the headset, or the DSP may be positioned remote from the headset, e.g. built into other equipment such as amplifier equipment.
  • the noisy signal can originate from a remote audio source or from microphone built into the hearing aid.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a method for estimation of a signal from noisy observations, the method including employing a statistical model of signal dependencies in frequency domain. Preferably the signal dependencies in frequency domain include inter-frequency correlation. Preferably the method provides the statistical model by decomposing the noisy observations into a source model and a filter model, and preferably the source model includes performing a multi-pulse coding. Preferred embodiments include performing a joint linear minimum mean squared error estimation of signal power spectra and phase spectra. The invention also provides a noise reduction method and a speech enhancement method utilizing the signal estimation method. In addition, the invention provides a device including a processor adapted to perform the defined methods. The device may be such as a hearing aid, a headset, a mobile phone or the like. The methods provide a low computation complexity and are thus suited for e.g. miniature equipment with limited signal processing power, since computation complexity can be adjusted by tuning the number of spectral components to be included in the estimate of each component.

Description

ESTIMATION OF SIGNAL FROM NOISY OBSERVATIONS
Field of the invention
The invention relates to the field of signal processing, more specifically to processing aiming at estimating of a signal from noisy observations, e.g. with the aim of noise reduction or enhancing speech contained in a noisy signal. The invention provides a method and a device, e.g. a headset, adapted to perform the method.
Background of the invention
Noise reducing methods, i.e. methods aiming at processing a noisy signal with the purpose of suppressing the noise, are important parts of e.g. modern hearing aids, headsets, mobile phones and the like. In such devices noise reduction techniques are used for speech enhancement, e.g. to improve speech intelligibility of speech contained in noise.
However, for applications within hearing aids and other miniature devices with limited signal processing power, a low complexity of the noise reduction algorithm is required to obtain a given amount noise reduction which can be performed in real time.
In prior art a large amount of single channel noise reduction methods exist. Earlier spectral subtraction based methods were often used. However, in recent years two classes of methods have attracted attention since they provide superior performance compared to spectral subtraction methods: 1) frequency domain block minimum mean squared error (MMSE) based methods, and 2) signal subspace based methods. The MMSE based methods all rely on an assumption of quasi-stationarity and an assumption of uncorrelated spectral components in the signal, i.e. short time processing is required. Signal subspace based methods also assumes stationarity within a short frame.
For example for noise reduction of noisy signals containing speech the short time processing is a disadvantage since voiced speech can not be properly modelled, and thus existing methods do not offer optimal signal estimation for speech signals.
Summary of the invention Thus, it may be seen as an object of the present invention to overcome the mentioned problem with prior art signal estimation methods that are not suited for modelling voiced parts of speech due to the requirement of short time processing.
In a first aspect, the invention provides a method for estimation of a signal from noisy observations, the method including employing a statistical model of signal dependencies in frequency domain.
By exploiting signal dependencies in frequency domain, especially correlation between spectral components, provides a method suited for voiced speech. This is caused by prominent temporal power localization in the excitation of voiced speech. Thus the method of the first aspect is suited for noise reduction in noisy speech signals or for speech enhancement. In addition, preferred embodiments provide a performance to complexity trade-off that makes them suited for resource-limited applications such as hearing aids by tuning a number of spectral components to be included in the estimate of each component.
It is to be understood that by "in frequency domain" is also included implementations of the method where the statistical model of signal dependencies is employed in time domain, since it is well-known that there is a duality between time and frequency domain and equivalent operations can be performed in either of these domains by proper transformation between the domains.
In preferred embodiments, the method includes the step of providing the statistical model by decomposing the noisy observations into a source model and a filter model, and preferably the source model includes performing a multi-pulse coding. This is advantageous for signal processing implementation purposes, since this operation results in a non-Toeplitz temporal signal covariance or a non-diagonal spectral signal covariance matrix in contrast to prior art quasi-stationary algorithms that result in a Toeplitz matrix.
Preferably the signal dependencies in frequency domain include inter-frequency dependency, most preferably inter-frequency correlation, phase structure, and/or non- stationarity of the signal.
Preferably the method includes performing a linear minimum mean squared error estimation, preferably including performing a joint linear minimum mean squared error estimation of signal power spectra and phase spectra.
In a second aspect, the invention provides a noise reduction method including performing the method according to the first aspect, and providing a noise suppressed signal based on an output therefrom.
Thus, the noise reduction method of the second aspect have the same advantages as mentioned for the first aspect, and it is understood that the preferred embodiments described for the first aspect apply for the second aspect as well.
The method is suited for a number of purposes where it is desired to perform a reduction of noise of a noisy signal, in general the method is suited to reduce noise by processing a noisy signal, i.e. an information signal corrupted by noise, and returning a noise suppressed signal. The signal may in general represent any type of data, e.g. audio data, image data, control signal data, data representing measured values etc. or any combination thereof. Due to the computational efficiency, the method is suited for on-line applications where limited signal processing power is available. In a third aspect, the invention provides a speech enhancement method including performing the noise reduction method according to the second aspect on speech present in noisy observations so as to enhance speech by suppressing noise.
Thus, being based on the first and second aspects, the speech enhancement method of the third aspect have the same advantages as mentioned for the first and second aspects, and the preferred embodiments mentioned for the first aspect therefore also apply.
The speech enhancement method is suited for application where a noisy audio signal containing speech is corrupted by noise. The noise may be caused by electrical noise interfering with an electrical audio signal, or the noise may be acoustic noise such as introduced at the recording of the speech, e.g. a person speaking in a telephone at a place with traffic noise etc. The speech enhancement method can then be used to increase speech intelligibility by enhancing the speech in relation to the noise.
In a fourth aspect the invention provides a device including a processor adapted to perform the method of any one of the first, second or third aspects. Thus, the advantages and embodiments mentioned for the first, second and third aspects therefore apply for the fourth aspect as well. Due to the computational efficiency of the proposed methods, the signal processing power of the processor is relaxed.
Especially, the device may be: a mobile phone, a radio communication device, an internet telephony system, sound recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, or a monitoring system.
Alternatively, the device may be: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, or a headphone with a built-in microphone (so as to allow sound from the environments to reach the listener).
In a fifth aspect, the invention provides a computer executable program code adapted to perform the method according to any one of the first, second or third aspects. Thus, the same advantages as mentioned for these aspects therefore apply.
The program code may be present on a program carrier, e.g. a memory card, a disk etc. or in a RAM or ROM memory of a device.
Brief description of the drawings
In the following the invention is described in more details with reference to the accompanying figures of which
Fig. 1 shows graphs illustrating multi-pulse linear prediction coding, and
Fig. 2 shows a block diagram of a preferred device. While the invention is susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and will be described in detail herein. It should be understood, however, that the invention is not intended to be limited to the particular forms disclosed. Rather, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention as defined by the appended claims.
Description of preferred embodiments In the following, specific embodiments of the invention are illustrated in details referring to a preferred noise reduction/parametric speech enhancement algorithm based on a block based linear MMSE (LMMSE) noise reduction with a high temporal resolution modelling of speech excitation. The algorithm aim at a joint LMMSE estimation of signal power spectra and phase spectra, as well as exploitation of correlation between spectral components. The major cause of this inter-frequency correlation is shown to be the prominent temporal power localization in the excitation of voiced speech. LMMSE estimators in time domain and frequency domain are first formulated. To obtain the joint estimator, the spectral signal covariance matrix instead of a diagonal covariance matrix as is the case in the Wiener filter derived under quasi-stationary assumption. To accomplish this, the signal covariance matrix is decomposed into a source model (excitation filter matrix) and a filter model (synthesis filter matrix). The synthesis filter matrix is built from estimates of coefficients based on an all-pole model, and the excitation matrix is built from estimates of the instantaneous power of the excitation sequence. A decision-directed power spectral subtraction method and a modified multi-pulse linear predictive coding (MPLPC) method are used in these estimations, respectively.
The spectral domain formulation of the LMMSE estimator reveals important insight into inter-frequency correlations. This is exploited to significantly reduce computational complexity of the estimator, thus making the method suited for resource-limited applications such as hearing aids and the like. Performance-to-complexity trade-off can be conveniently adjusted by tuning the number of spectral components to be included in the estimate of each component. Experiments show that the proposed algorithm is able to reduce more noise than a number of other approaches selected from the state of the art. The proposed algorithm improves segmental signal to noise ratio of a noisy signal by 13 dB for a white noise case with an input of a 0 dB signal to noise ratio.
A description of the specific embodiments now follows, referring also to Fig. 1 illustrating multi-pulse coding. Noise reduction is becoming an important function in hearing aids in recent years thanks to the application of powerful DSP hardware and the progress of noise reduction algorithm design. Noise reduction algorithms with high performance-to-complexity ratio have been the subject of extensive research study for many years. Among many
5 different approaches, two classes of single-channel speech enhancement methods have attracted significant attention in recent years because of their better performance compared to the classic spectral subtraction methods (A comprehensive study of Spectral Subtraction methods can be found in [6]). These two classes are the frequency domain block based Minimum Mean Squared Error (MMSE) approach and the signal
(0 subspace approach. The frequency domain MMSE approach includes the non-causal IIR Wiener filter [20], the MMSE Short-Time Spectral Amplitude (MMSE-STSA) estimator [10], the MMSE Log-Spectral Amplitude (MMSE-LSA) estimator [11], the Constrained Iterative Wiener Filtering (CIWF) [15], and the MMSE estimator using non-Gaussian priors [21]. These MMSE algorithms all rely on an assumption of quasi-
IS stationarity and an assumption of uncorrelated spectral components in the signal. The quasi-stationarity assumption requires short time processing. At the same time, the assumption of uncorrelated spectral components can be warranted by assuming the signal to be infinitely long and wide-sense stationary [8] [14]. This infinite data length assumption is in principle violated when using the short-time processing, although the tO effect of this violation may be minor (and is not the major issue this paper addresses) . More importantly, the wide-sense stationarity assumption within a short frame does not well model the prominent temporal power localization in the excitation source of voiced speech due to the impulse train structure. This temporal power localization within a short frame can be modeled as a non-stationarity of the signal that is not
%5 resolved by the short-time processing. In [18], we show how voiced speech is advantageously modeled as non-stationary even within a short frame, and that this model implies significant inter-frequency correlations. As a consequence of the stationarity and long frame assumptions, the MMSE approaches model the frequency domain signal covariance matrix as a diagonal matrix. Another class of speech enhancement methods, the signal subspace approach, implicitly exploits part of the inter-frequency correlation by allowing the frequency domain signal covariance matrix to be non-diagonal. This class includes the Time Domain Constraint (TDC) linear estimator and Spectral Domain Constraint (SDC) linear es- timator [12], and the Truncated Singular Value Decomposition (TSVD) estimator [9]. In [12], the TDC estimator is shown to be an LMMSE estimator with adjustable input noise level. When the TDC filtering matrix is transformed to the frequency domain, it is in general non-diagonal. Nevertheless, the known signal subspace based methods still assume stationarity within a short frame. This can be seen as follows. In TDC and SDC 0 the noisy signal covariance matrices are estimated by time averaging of the outer product of the signal vector, which requires stationarity within the interval of averaging. The TSVD method applies singular value decomposition to the signal matrix instead. This can be shown to be equivalent to the eigen decomposition of the time averaged outer product of signal vectors. Compared to the mentioned frequency domain MMSE iζ approaches, the known signal subspace methods implicitly avoid the infinite data length assumption, so that the inter-frequency correlation caused by the finite length effect is accommodated. However, the more important cause of inter-frequency correlation, i.e., the non-stationarity within a frame is not modeled.
In terms of exploiting the masking property of the human auditory system, the
20 above mentioned frequency domain MMSE algorithms and signal subspace based algorithms can be seen as spectral masking methods without explicit modeling of masking thresholds. To see this, observe that the MMSE approaches shape the residual noise (the remaining background noise) power spectrum to one more similar to the speech power spectrum, thereby facilitating a certain degree of masking of the noise. In ge-
25 neral, the MMSE approaches attenuate more in the spectral valleys than the spectral subtraction methods do. Perceptually, this is beneficial for high pitch voiced speech, which has sparsely located spectral peaks that are not able to mask the spectral valley sufficiently. The signal subspace methods in [12] are designed to shape the residual noise power spectrum for a better spectral masking, where the masking threshold is >0 found experimentally. Auditory masking techniques have received increasing attention in recent research of speech enhancement [2,26,29]. While the majority of these works focus on spectral domain masking, the work in [24] shows the importance of the temporal masking property in connection with the excitation source of voiced speech. It is shown that noise between the excitation impulses is more perceivable than noise close to the impulses, and this is especially so for the low pitch speech for which the excitation
S impulses locates temporally sparsely. This temporal masking property is not employed by current frequency domain MMSE estimators and the signal subspace approaches.
In this paper, we develop an LMMSE estimator with a high temporal resolution modeling of the excitation of voiced speech, aiming for modeling a certain non-stationarity of the speech within a short frame, which is not modeled by quasi-stationarity based
|0 algorithms. The excitation of voiced speech exhibits prominent temporal power localization, which appears as an impulse train superimposed with a low level noise floor. We model this temporal power localization as a non-stationarity. This non-stationarity causes significant inter-frequency correlation. Our LMMSE estimator therefore avoids the assumption of uncorrelated spectral components, and is able to exploit the inter-
/ ξ frequency correlation. Both the frequency domain signal covariance matrix and filtering matrix are estimated as complex-valued full matrices, which means that the information about inter-frequency correlation are not lost and the amplitude and phase spectra are estimated jointly. Specifically, we make use of the linear prediction based source-filter model to estimate the signal covariance matrix, upon which a time domain or frequency
20 domain LMMSE estimator is built. In the estimation of the signal covariance matrix, this matrix is decomposed into a synthesis filter matrix and an excitation matrix. The synthesis filter matrix is estimated by a smoothed power spectral subtraction method followed by an autocorrelation Linear Predictive Coding (LPG) method. The excitation matrix is a diagonal matrix with the instantaneous power of the LPC residual as
2.5 its diagonal elements. The instantaneous power of the LPC residual is estimated by a modified Multi-Pulse Linear Predictive Coding (MPLPC) method. Having estimated the signal covariance matrix, we use it in a vector LMMSE estimator. We show that by doing the LMMSE estimation in the frequency domain instead of in time domain, the computational complexity can be reduced significantly due to the fact that the
30 signal is less correlated in the frequency domain than in the time domain. Compared to several quasi-stationarity based estimators, the proposed LMMSE estimator results in a lower spectral distortion to the enhanced speech signal while having higher noise reduction capability. The algorithm, applies more attenuation in the valleys between pitch impulses in time domain, while small attenuation is applied around the pitch impulses. This arrangement exploits the temporal masking effect, and results in a bets' ter preservation of abrupt rise of the waveform amplitude while maintaining a large amount of noise reduction.
The rest of this paper is organized as follows. In Section 0.2 the notations and assumptions used in the derivation of LMMSE estimators are outlined. In Section 0.3, the non-stationary modeling of the signal covariance matrices is described. The 10 algorithm is summarized in Section 0.4. In Section 0.5, the computational complexity of the algorithm is reduced by identifying an interval of significant correlation and by simplifying the modified MPLPC procedure. Experimental settings, objective, and subjective results are given in Section 0.6. Finally, Section 0.7 discusses the obtained results.
I S" In this section, notations and statistic assumptions for the derivation of LMMSE estimators in time and frequency domain are outlined. Time domain LMMSE estimator
Let y(n, k), s(n, k), v(n, k) denote the n'th sample of noisy observation, speech, and additive noise (uncorrelated with the speech signal) of the fc'th frame, respectively.
O Then y(n, k) = s(n, k) + v(n, k). Alternatively, in vector form we have
y - s + v, (1) where boldface letters represent vectors and the frame indices are omitted to allow a compact notation.
For example y = [y(l, k), y(2, k), • • • , y(N, k)]τ is the noisy signal vector of the fc'th frame, where N is the number of samples per frame.
1$ To obtain linear MMSE estimators, we assume zero mean Gaussian PDF's for the noise and the speech processes. Under this statistic model the LMMSE estimate of the signal is the conditional mean [16]
Figure imgf000010_0001
where Cs and Cv are the covariance matrices of the signal and the noise, respectively. The covariance matrix is defined as Cs = E[ssH], where (-)H denotes Hermitian transposition and E[-] denotes the ensemble average operator. 5 Frequency domain LMMSE estimator and Wiener filter
In the frequency domain the goal is to estimate the complex DFT coefficients given a set of DFT coefficients of the noisy observation. Let Y(m, k), θ(m, k), and V(m, k) denote the m'th DFT coefficient of the Ar'th frame of the noisy observation, the signal, and the noise, respectively. Due to the linearity of the DFT operator, we have,
Y(m, k) = θ(m, k) + V(m, k). (3)
I b In vector form we have
Y = 0 + V, (4) where again boldface letters represent vectors and the frame indices are omitted. As an example, the noisy spectrum vector of the fc'th frame is arranged as Y = [Y(I, k), Y(2, k), ■ ■ • , Y(N, k)]T where the number of frequency bins is equal to the number of samples per frame N.
IS We again use the linear model. Y, θ, and V are assumed to be zero-mean complex Gaussian random variables and θ and V are assumed to be uncorrelated to each other.
The LMMSE estimate is the conditional mean θ = E[Θ\Y]
(5)
= Cθ(Cø + Cv)-1Y, where C# and Cy are the covariance matrices of the DFT coefficients of the signal and the noise, respectively. By applying inverse DFT to each side, (5) can be easily shown IP to be identical to (2). The relation between the two signal covariance matrices in time and frequency domain is
C = FC8F-1, (6) where F is the Fourier matrix. If the frame was infinitely long and the signal was stationary, C8 would be an infinitely large Toeplitz matrix. The infinite Fourier ma's" trix is known to be the eigenvector matrix of any infinite Toeplitz matrix [14]. Thus, Ce becomes diagonal and the LMMSE estimator (5) reduces to the non-causal HR
Wiener filter with the transfer function
H πWF ([,ω Λ) - - Pssjω) , ( ,77JΛ
* ss\U) + rυυ{ω) where Pss(ω) and Pυυ(ω) denotes the power spectral density (PSD) of the signal and the noise, respectively. In the sequel we refer to (7) as the Wiener filter or WF.
High temporal resolution modeling for the signal covariance matrix estimation
I O For both time and frequency domain LMMSE estimators described in Section 0.2, the estimation of the signal covariance matrix Cs is crucial. In this work, we assume the noise to be stationary. For the signal, however, we propose the use of a high temporal resolution model to capture the non-stationarity caused by the excitation power variation. This can be explained by examining the voice production mechanism. In the well
)5 known source-filter model for voiced speech, the excitation source models the glottal pulse train, and the filter models the resonance property of the vocal tract. The vocal tract can be viewed as a slowly varying part of the system. Typically in a duration of 20 to 30 ms it changes very little. The vocal folds vibrate at a faster rate producing periodic glottal flow pulses. Typically there can be 2 to 8 glottal pulses in 20 ms. In 2.0 speech coding, it is common practice to model this pulse train by a long-term correlation pattern parameterized by a long-term predictor [4] [3] [5]. However, this model fails to describe the linear relationship between the phases of the harmonics. That is, the long term predictor alone does not model the temporal localization of power in the excitation source. Instead, we apply a time envelope that captures the localization and concentration of pitch pulse energy in the time domain. This, in turn, introduces an element of non-stationarity to our signal model because the excitation sequence is now modeled as a random, sequence with time varying variance, i.e., the glottal pulses are modeled with higher variance and the rest of the excitation sequence is modeled with ζ lower variance. This modeling of non-stationarity within a short frame implies a temporal resolution much finer than that of the quasi-stationarity based algorithms. The latter has a temporal resolution equal to the frame length. Thus we term the former the high temporal resolution model. It is worth noting that some unvoiced phonemes, such as plosives, have very fast changing waveform envelopes, which also could be modeled
)5 as non-stationarity within the analysis frame. In this paper, however, we focus on the non-stationary modeling of voiced speech.
Modeling signal covariance matrix
The signal covariance matrix is usually estimated by averaging the outer product of the signal vector over time. As an example this is done in the signal subspace approach [12]. This method assumes ergodicity of the autocorrelation function within \ξ the averaging interval.
Here we propose the following method of estimating Cs with the ability to model a certain element of non-stationarity within a short frame. The following discussion is only appropriate for voiced speech. Let r denote the excitation source vector, and H denote the synthesis filtering matrix corresponding to the vocal tract filter such as
0 0 0
Mi) MO) 0
H — M2) Ki) KO)
h(N - l) h{N - 2) ■ ■ ■ h{0)
26 where h(n) is the impulse response of the LPC synthesis filter.
We then have
s = Hr, (8) and therefore
Cs = E[ssH] = HCrH", (9) where C1. is the covariance matrix of the model residual vector r. In (9) we treat H as a deterministic quantity. This simplification is common practice also when the LPC filter model is used to parameterize the power spectral density in classic Wiener
5 filtering [19] [15]. Section 0.3.2 addresses the estimation of H. Note that (8) does not take into account the zero-input response of the filter in the previous frame. Either i9β the zero-input response can be subtracted prior to the estimation of each frame, or a windowed overlap-add procedure can be applied to eliminate this effect.
We now model r as a sequence of independent zero mean random variables. The
(0 covariance matrix C1. is therefore diagonal with the variance of each element of r as its diagonal elements. For voiced speech, except for the pitch impulses, the rest of the residual is of very low amplitude and can be modeled as constant variance random variables. Therefore, the diagonal of Cr takes the shape of a constant floor with a few periodically located impulses. We term this the temporal envelope of the instantaneous
IS residual power. This temporal envelope is an important part of the new MMSE estimator because it provides the information of uneven temporal power distribution. In the following two subsections, we will describe the estimation of the spectral envelope and the temporal envelope respectively.
Estimating the spectral envelope
In the context of LPC analysis, the synthesis filter has a spectrum that is the enve-
1_0 lope of the signal spectrum. Thus, our goal in this subsection is to estimate the spectral envelope of the signal. We first use the Decision Directed method [10] to estimate the signal power spectrum and then use the autocorrelation method to find the spectral envelope.
The noisy signal power spectrum of the fc'th frame |Y(A;)|2 is obtained by applying IS the DFT to the fc'th observation vector y(fc) and squaring the amplitudes. The Decision Directed estimate of the signal power spectrum of the /c'th frame, |0(£;)|2, is a weighted sum of two parts, the power spectrum of the estimated signal of the previous frame, \θ(k — 1)|2, and the power-spectrum-subtraction estimate of the current frame's power spectrum :
I J(fc)|2
Figure imgf000014_0001
0), (10) where a is a smoothing factor a G [0, 1], and i?[]V(fc)|2] is the estimated noise power spectral density. The purpose of such a recursive scheme is to improve the estimate of the power spectrum subtraction method by smoothing out the random fluctuation in
5 the noise power spectrum, thus reduce the "musical noise" artifact [7]. Other iterative schemes with similar time or spectral constraints are applicable in this context. For a comprehensive study of constraint iterative filtering techniques, readers are referred to [15]. We now take the square-root of the estimated power spectrum and combine it with the noisy phase to reconstruct the so called intermediate estimate, which has
(0 the noise-reduced amplitude spectrum but noisy phase. An autocorrelation method LPC analysis is then applied to this intermediate estimate to obtain the synthesis filter coefficients.
Estimating the temporal envelope
We propose to use a modified MPLPC method to robustly estimate the temporal envelope of the residual power. MPLPC is first introduced by Atal and Remde [4]
[ζ to optimally determine the impulse position and amplitude of the excitation in the context of analysis-by-synthesis linear predictive coding. The principle is to represent the LPC residual with a few impulses in which the locations and amplitudes (gains) of the impulses are chosen such that the difference between the target signal and the synthesized signal is minimized. In the noise reduction scenario, the target signal will tO be the noisy signal and the synthesis filter must be estimated from the noisy signal. Here, the synthesis filter is treated as known. For the residual of voiced speech, there is usually one dominating impulse in each pitch period. We first determine one impulse per pitch period, then model the rest of the residual as a noise floor with constant variance. In MPLPC the impulses are found sequentially [17]. The first impulse lo- 2.5 cation and amplitude is found by minimizing the distance between the synthesized signal and the target signal. The effect of this impulse is subtracted from the target signal and the same procedure is applied to find the next impulse. Because this way of finding impulses does not take into account the interaction between the impulses, re-optimization of the impulse amplitudes is necessary every time a new impulse is found. The number of pitch impulses p in a frame is determined in the following way. p is first assigned an initial value equal to the largest number of pitch periods possible in a frame. Then p impulses are determined using the above mentioned method. Only the " impulses with an amplitude larger than a threshold are selected as pitch impulses. In our experiment, the threshold is set to 0.5 times the largest impulse amplitude in this frame. Having determined the impulses, a white noise sequence representing the noise floor of the excitation sequence is added into the gain optimization procedure together with all the impulses. We use a codebook of 1024 white Gaussian noise sequences in
|O the optimization. The white noise sequence that yields the smallest synthesis error to the target signal is chosen to be the estimate of the noise floor. This procedure is in fact a multi-stage coder with p impulse stages and one Gaussian codebook stage, with a joint re-optimization of gains. Detailed treatment of this optimization problem can be found in [22]. After the optimization, we use a flat envelope equal to the square of
/5 the gain of the selected noise sequence to model the variance of the noise floor. Finally, the temporal envelope of the instantaneous residual power is composed of the noise floor variance and the squared impulses. When applied to noisy signals, the MPLPC procedure can be interpreted as a non-linear Least Square fitting to the noisy signal, with the impulse positions and amplitudes as the model parameters.
The algorithm
2o Having obtained the estimate of the temporal envelope of the instantaneous residual power and the estimate of the synthesis filter matrix, we are able to build the signal covariance matrix in (9). The covariance matrix is used in the time LMMSE estimator (2) or in the spectral LMMSE estimator (5) after being transformed by (6).
The noise covariance matrix can be estimated using speech absent frames. Here, jζ we assume the noise to be stationary. For the time domain LMMSE estimator (2), if the noise is white, the covariance matrix Cv is diagonal with the noise variance as its diagonal elements. In the case of colored noise, the noise covariance matrix is no longer diagonal and it can be estimated using the time averaged outer product of the noise vector. For the spectral domain LMMSE estimator (5), Cv is a diagonal matrix with the power spectral density of the noise as its diagonal elements. This is due to the assumed stationarity of the noise1. In the special case where the noise is white, the diagonal elements all equal the variance of the noise.
We model the instantaneous power of the residual of unvoiced speech with a flat
5 envelope. Here, voiced speech is referred to as phonemes that require excitation from the vocal folds vibration, and unvoiced speech consists of the rest of the phonemes. We use a simple voiced/unvoiced detector that utilize the fact that voiced speech usually has most of its power concentrated in the low frequency band, while unvoiced speech has a relatively flat spectrum within 0 to 4kHz. Every frame is low pass filtered and then
I 0 the filtered signal power is compared with the original signal power. If the power loss is more than a threshold, the frame is marked as an unvoiced frame, and vice versa. Note however, that even for the unvoiced frames, the spectral covariance matrix is non- diagonal because the signal covariance matrix Cs, built in this way, is not Toeplitz. Hereafter, we refer to the proposed approach as the Time-Frequency-Envelope MMSE
|§" estimator (TFE-MMSE), due to its utilization of envelopes in both time and frequency domain. The algorithm is summarized in Algorithm 1.
Reducing computational complexity
The TFE-MMSE estimators require inversion of a full covariance matrix Cs or Cg. This high computational load prohibits the algorithm from real time application in hearing aids. Noticing that both covariance matrices are symmetric and positive definite,
Io Cholesky factorization can be applied to the covariance matrices, and the inversion can be done by inverting the Cholesky triangle. A careful implementation requires N3 /3 operations for the Cholesky factorization [13] and the algorithm complexity is O(NS). Another computation intensive part of the algorithm is the modified MPLPC method. In this section we propose simplifications to these two parts.
15 Further reduction of complexity for the filtering requires understanding of the inter-
1In modeling the spectral covariance matrix of the noise we have ignored the inter-frequency correlations caused by the finite-length window effect. With typical window length, e.g. 15 to 30ms, the inter-frequency correlations caused by the window effect is less significant than those caused by the non-stationarity of the signal. This can be easily seen by examining a plot of the spectral covariance matrix.
Algorithm 1 TFE-MMSE estimator
I: Take the fc'th frame,
2: Estimate the noise PSD from the latest speech-absent frame.
3: Calculate the power spectrum of the noisy signal.
4: Do power spectrum subtraction estimation of the signal PSD, and refine the estimate using
Decision-Directed smoothing (eq.(lθ)).
5: Reconstruct the signal by combining the amplitude spectrum estimated by 4 and the noisy phase. 6: Do LPC analysis to the reconstructed signal. Obtain the synthesis filter coefficients, and form the synthesis matrix H. 7: IF the frame is voiced
Estimate the envelope of the instantaneous residual power using the modified MPLPC method. 8: IF the frame is unvoiced
Use a constant envelope for the instantaneous residual power. 9= ENDIF
10: Calculate the residual covariance matrix C1.. 11: Form the signal covariance matrix C3 = HCfH^ (eq.(9)). 12: IF time domain LMMSE : s = C3(Cs + Cv)-V (eq.(2)). 13: IF frequency domain LMMSE : transform Cs to frequency domain Cg = FCgF""1, filter the noisy spectrum θ = C9(C9 + Cv)-1Y (eq.(5)), obtain the signal estimate by inverse DFT. 14: ENDIF 15: Calculate the power spectrum of the filtered signal, \θ(k — 1)|2, for use in the PSD estimation of next frame. 16: k = k + 1 and go to 1. frequency correlation. In the time domain the signal samples are clearly correlated with each other in a very long span. However, in the frequency domain, the correlation span is much smaller. This can be seen from the magnitude plots of the two covariance matrices (see Fig.l). ζ For the spectral covariance matrix, the significant values concentrate around the diagonal. This fact indicates that a small number of diagonals capture most of the inter-frequency correlation. The simplified procedure is as follows. Half of the spectrum vector θ is divided into small segments of I frequency bins each. The sub-vector starting at the j'th frequency is denoted as θsubl3, where j € [1, 1, 21, - • • , N/2] and
(£> Z <ξC N. The noisy signal spectrum and the noise spectrum can be segmented in the same way giving Ysub j and *Vsubj • The LMMSE estimate of θsubtJ needs only a block of the covariance matrix, which means that the estimate of a frequency component benefits from its correlations with I neighboring frequency components instead of all components. This can be written as
Figure imgf000018_0001
. /S The first half of the signal spectrum can be estimated segment by segment. The second half of the spectrum is simply a flipped and conjugated version of the first half. The segment length is chosen to be I = 8, which in our experience does not degrade performance noticeably when compared with the use of the full matrix. Other segmentation schemes are applicable, such as overlapping segments. It is also possible to use
2o a number of surrounding frequency components to estimate a single component at a time. We use the non-overlapping segmentation because it is computationally less expensive while maintaining good performance for small I. When the signal frame length is 128 samples and the block length is I = 8, using this simplified method requires only gjTj times of the original complexity for the filtering part of the algorithm with τ£ an extra expense of FFT operations to the covariance matrix. When I is set to values larger than 24, very little improvement in performance is observed. When I is set to values smaller than 8, the quality of enhanced speech degrades noticeably. By tuning the parameter I, an effective trade-off between the enhanced speech quality and the computational complexity is adjusted conveniently. In the MPLPC part of the algorithm, the optimization of the impulse amplitude and the gain of the noise floor brings in heavy computational load. It can be simplified by fixing the impulse shape and the noise floor level. In the simplified version, the MPLPC method is only used for searching the locations of the p dominating impulses. Once the
S locations are found, a predetermined pulse shape is put at each location. An envelope of the noise floor is also predetermined. The pulse shape is chosen to be wider than an impulse in order to gain robustness against estimation error of the impulse locations. This is helpful as long as noise is present. The pulse shape used in our experiment is a raised cosine waveform with a period of 18 samples and the ratio between the pulse
It) peak and the noise floor amplitude is experimentally determined to be 6.6. Finally, the estimated residual power must be normalized. Although the pulse shape and the relative level of the noise floor are fixed for all frames, experiments show that the TFE- MMSE estimator is not sensitive to this change. The performance of both the simplified procedure and the optimum procedure are evaluated in Section 0.6. Fig.1 shows the
\S estimated envelopes of residual in the two ways.
Results
Objective performance of the TFE-MMSE estimator is first evaluated and compared with the Wiener filter [20], the MMSE-LSA estimator [11], and the signal subspace method TDC estimator [12]. For the TFE-MMSE estimator, both the complete algorithm and the simplified algorithms are evaluated. For all estimators the sampling frequency IQ is 8kHz, and the frame length is 128 samples with 50% overlap. In the Wiener filter we use the same Decision Directed method as in the MMSE-LSA and the TFE-MMSE estimator to estimate the PSD of the signal. An important parameter for the Decision Directed method is the smoothing factor a. The larger the a is, the more noise is removed and more distortion imposed to the signal, because of more smoothing made to 25 the spectrum. In the MMSE-LSA estimator with the aforesaid parameter setting, we found experimentally a — 0.98 to be the best trade-off between noise reduction and signal distortion. We use the same a for the WF and the TFE-MMSE estimator as for the MMSE-LSA estimator. For the TDC, the parameter μ (μ ≥ 1) controls the degree of over suppression of the noise power [12]. The larger the μ is, the more attenuation to the noise but larger distortion to the speech. We choose μ = 3 in the experiments by balancing the noise reduction and signal distortion.
All estimators run with 32 sentences ftom different speakers (16 male and 16 female) from the TIMIT database [1] added with white Gaussian noise, pink noise, and car noise
5 in SNR ranging from 0 dB to 20 dB. The white Gaussian noise is computer generated, and the pink noise is generated by filtering white noise with a filter having a 3 dB per octave spectral power descend. The car noise is recorded inside a car with a constant speed. Its spectrum is more low pass than the pink noise. The quality measures used include the SNR, the segmental SNR, and the Log-Spectral Distortion (LSD). The SNR
|D is defined as the ratio of the total signal power to the total noise power in the sentence. The segmental SNR (segSNR) is defined as the average ratio of signal power to noise power per frame. To prevent the segSNR measure from being dominated by a few extreme low values, since the segSNR is measured in dB, it is common practice to apply a lower power threshold e to the signals. Any frame that has an average power lower
\S than e is not used in the calculation. We set e to 4OdB lower than the average power of the utterance. The segSNR is commonly considered to be more correlated to perceived quality than the SNR measure. The LSD is defined as [27] :
where e is
Figure imgf000020_0001
average power of the utterance. lO TFE-MMSEl is the complete algorithm, and TFE-MMSE2 is the one with simplified MPLPG and reduced covariance matrix (I = 8). It is observed that the TFE-MMSE2, although a result of simplification of TFE-MMSEl, has better performance than the TFE-MMSEl. This can be explained as follows : 1) Its wider pulse shape is more robust to the estimation error of impulse positions, and 2) the wider pulse shape can model
1$ to some extent the power concentration around the impulse peaks, which is overlooked by the spiky impulses. For this reason, in the following evaluations we investigate only the simplified algorithm.
Informal listening tests reveal that, although the speech enhanced by the TFE- MMSE algorithm has a significantly clearer sound (less muffled than the reference algorithms), the remaining background noise has musical tones. A solution to the musical noise problem is to set a higher value to the smoothing factor a. Using a larger a sacrifices the SNR and LSD slightly at high input SNR's, but improves the SNR and LSD at low input SNR's, and generally improves the segSNR significantly. The musi- 5" cal tones are also well suppressed. By setting a = 0.999, the residual noise is greatly reduced, while the speech still sounds less muffled than for the reference methods. The reference methods can not use a smoothing factor as high as the TFE-MMSE : experiments show that at a = 0.999 the MMSE-LSA and the WF result in extremely muffled sounds. The TDC also suffers from a musical residual noise. To suppress its IO residual noise level to as low as that of the TFE-MMSE with a = 0.999, the TDG requires a μ lager than 8. This causes a sharp degradation of the SNR and LSD, and results in very muffled sounds. The TFE-MMSE2 estimator with a large smoothing factor (a = 0.999) is hereafter termed TFE-MMSE3 and its objective measures are also shown in the figures. To verify the perceived quality of the TFE-MMSE3 subjectively, ) 5" preference test between the TFE-MMSE3 and the WF, and between the TFE-MMSE3 and the MMSE-LSA are conducted. The WF and the MMSE-LSA use their best value of smoothing factor (a — 0.98). The test is confined to white Gaussian noise and a limited range of SNR's. Three sentences by male speakers and three by female speakers at each SNR level are used in the test. Eight unexperienced listeners are required to 20 vote for their preferred method based on the amount of noise reduction and speech distortion. The utterances are presented to the listeners by a high quality headphone. The clean utterance is first played as a reference, and the enhanced utterances are played once, or more if the listener finds this necessary. The results in Table 1 and 2 show that : 1) at 10 dB and 15 dB the listeners clearly prefer the TFE-MMSE over the two Vo reference methods, while at 5 dB the preference on the TFE-MMSE is unclear ; 2) the TFE-MMSE method has a more significant impact on the processing of male speech than on the processing of female speech. At 10 dB and above, the speech enhanced by TFE-MMSE3 has barely audible background noise, and the speech sounds less muffled than the reference methods. There is one artifact heard in rare occasions that we 2>t> believe is caused by remaining musical tones. It is of very low power and occur some times at speech presence. The two reference methods have higher residual background noise and suffer from muffling and reverberance effects. When SNR is lower than 10 dB, a certain speech dependent noise occurs at speech presence in the TFE-MMSE3 processed speech. The lower the SNR is, the more audible this artifact is. Comparing the male and female speech processed by the TFE-MMSB3, the female speech sounds ζ a bit rough.
The algorithms are also evaluated for pink noise and car noise cases.
In these results the TDC algorithm is not included because the algorithm is proposed based on the white Gaussian noise assumption. Informal listening test shows that the perceptual quality in the pink noise case for all
Ib the three algorithms are very similar to the white noise case, and that in the car noise case all tested methods have very similar perceptual quality.
TAB. 1 - Preference test between "WF and TFE-MMSE3 with additive white Gaussian noise.
Figure imgf000022_0001
TAB. 2 - Preference test between MMSE-LSA and TFE-MMSE3 with additive white Gaussian noise.
Figure imgf000022_0002
Discussion
The results show that for male speech_the TFE-MMSE3 estimator has the best performance in all the three objective measures (SNR, segSNR, and LSD). For female speech, the TFE-MMSE3 is the second in SNR, the best in LSD, and among the best in segSNR. The TFE-MMSE3 estimator allows a high degree of suppression to the noise while maintaining low distortion to the signal. The speech enhanced by the TFE- £" MMSE3 has a very clean background and a certain speech dependent residual noise. When the SNR is high (10 dB and above), this speech dependent noise is very well masked by the speech, and the resulting speech sounds clean and clear. As spectrograms, indicate;-, the clearer sound is due to a better preserved signal spectrum, and a more suppressed background noise. At SNR lower than 5 dB, although the background IO still sounds clean, the speech dependent noise becomes audible, and perceived as a i distortion to the speech.The listeners preference start shifting from the TFE-MMSE3 towards the MMSE-LSA that has a more uniform residual noise, although the noise level is high. The conclusion here is that at high SNR, it is preferable to remove background noise completely using the TFE-MMSE estimator without major distortion to Jg the speech. This could be especially helpful at relieving listening fatigue for the hearing aid user. Whereas, at low SNR it is preferable to use a noise reduction strategy that produces uniform background noise, such as the MMSE-LSA algorithm.
The fact that female speech enhanced by the TFE-MMSE estimator sounds a little rougher than the male speech is consistent with the observation in [24], where male
2£) voiced speech and female voiced speech are found to have different masking properties in the auditory system. For male speech, the auditory system is sensitive to high frequency noise in the valleys between the pitch pulse peaks in the time domain. For the female speech, the auditory system is sensitive to low frequency noise in the valleys between the harmonics in the spectral domain. While the time domain valley for the male speech
IS is cleaned by the TFE-MMSE estimator, the spectral valleys for the female speech are not attenuated enough ; a comb filter could help to remove the roughness in the female voiced speech.
In the TFE-MMSE estimator, we apply a high temporal resolution non-stationary model to explain the pitch impulses in the LPC residual of voiced speech. This en-
JO ables the capture of abrupt changes in sample amplitude that are not captured by an AR linear stochastic model. In fact, the estimate of the residual power envelope contains information about the uneven distribution of signal power in time aids.
It can be observed that by a better model of temporal power distribution the TFE-MMSB estimator represents the sudden rises of amplitude better than the Wiener filter.
S Noise in the phase spectrum is reduced by the TFE-MMSE estimator. Although human ears are less sensitive to phase than, to power, it is found in recent work [28] [23]
[25] that phase noise is audible when the. source SNR is very low. In [28] a threshold of phase perception is found. This phase-noise tolerance threshold corresponds to an
SNR threshold of about 6 dB, which means for spectral components with local SNR
|0 smaller than 6 dB, it is necessary to reduce phase noise. The TFE-MMSE estimator has the ability of enhancing phase spectra because of its ability to estimate the temporal localization of residual power. It is the linearity in the phase of harmonics in the residual that makes the power be concentrated at periodic time instances, thus producing pitch pulses. Estimating the residual power temporal envelope enhances the linearity of the
Ig phase spectrum of the residual and therefore reduces phase noise in the signal.
References
[1] DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. CD-ROM, 1990. [2] K. Arehart, J. Hansen, S. Gallant, and L. Kalstein. Evaluation of an auditory masked threshold noise suppression algorithm in normal-hearing and hearing impaired listeners. Speech Communications, 40, no.4 :575-592, September 2003. [3] B. Atal. Predictive Coding of Speech at Low Bit Rate. IEEE Trans, on Comm., pages 600-614, April 1982. [4] B. Atal and J. Remde. A new model of LPC excitation for producing natural sounding speech at low Wt rates. Proc. of ICASSP 1982, 7 :614-617, May 1982. [5] B. S. Atal and M. R. Schroeder. Adaptive predictive coding of speech signals. Bell
Syst. Techn. J., 49 :1973-1986, 1970. [6] S. F. Boll. Suppression of Acoustic Noise in Speech Using Spectral Subtraction.
IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, No. 2 :113-120, April
1979. [7] O. Cappe. Elimination of the Musical Phenomenon with the Ephraϊm and Malah
Noise Suppressor. IEEE Trans. Acoust., Speech, Signal Processing, 2 :345-349,
April 1994. [8] W. B. Davenport and W. L. Root. An Introduction to the Theory of Random
Signals and Noise. New York : McGraw-Hill, 1958. [9] M. Dendrinos, S. Bakamidis, and G. Carayannis. Speech Enhancement from Noise :
A Regenerative Approach. Speech Communication, 10 :45-57, February 1991. [10] Y. Ephraim and D. Malah. Speech Enhancement Using a Minimum Mean-Square
Error Short-Time Spectral Amplitude Estimator. IEEE Trans, on Acoustics,
Speech, and Signal Processing, ASSPr32 :1109-1121, December 1984. [11] Y. Ephraim and D. Malah. Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator. IEEE Trans, on Acoustics, Speech, and Signal Processing, ASSP-33 :443-445, April 1985.
[12] Y. Ephraim and H. L. Van Trees. A Signal Subspace Approach for Speech Enhancement. IEEE Tran. Speech and Audio Processing, 3 :251-266, July 1995.
[13] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1996.
[14] R. M. Gray. Toeplitz and circulant matrices : A review, 2002.
[15] J. H. L. Hansen and M. A. Clements. Constrained Iterative Speech Enhancement with Application to Automatic Speech Recognition. IEEE Trans. Signal Processing, 39 no.4 :795-805, April 1991.
[16] S. M. Kay. Fundamentals of Statistical Signal Processing, Estimation Theory. Prentice Hall PTR, 1993.
[17] A. M. Kondoz. Digital Speech, Coding for Low Bit Rate Communications Systems. John Wiley & Sons, 1999.
[18] C. Li and S. V. Andersen. Inter-frequency Dependency in MMSE Speech Enhancement. Proceedings of the 6th Nordic Signal Processing Symposium, June 2004.
[19] J. S. Lim and A. V. Oppenheim. All-pole Modeling of Degraded Speech. IEEE Trans. Acoust., Speech, Signal Processing, ASP-26 :197-209, June 1978.
[20] J. S. Lim and A. V. Oppenheim. Enhancement and Bandwidth Compression of Noisy Speech. Proceedings of the IEEE, 67 :1586-1604, December 1979.
[21] R. Martin. Speech Enhancement Using MMSE Short Time Spectral Estimation With Gamma Distributed Speech Priors. Proc.of ICASSP 2002, 1 :253-256, May 2002.
[22] N. Moreau and P.Dymarski. Selection of excitation vectors for the CELP coders. IEEE Trans, on Speech and Audio Processing, 2(1) :29-41, January 1994.
[23] H. Pobloth and W. B. Kleijn. On Phase Perception in Speech. Proc.of ICASSP 1999, 1 :29-32, March 1999. 25 b
[24] J. Skoglund and W. Bastiaan Kleijn. On Time-Frequency Masking in Voiced Speech. IEEE Trans. Speech and Audio Processing, 8, No.4 :361-369, July 2000.
[25] J. Skoglund, W. Bastiaan Kleijn, and P. Hedelin. Audibility of Pitch- Synchronously Modulated Noise. Speech Coding For Telecommunications Proceeding; IEEE, 7-10 -.51-52, September 1997.
[26] D. Tsoukalas, J. Mourjoupoulos, and G. Kokkinakis. Speech enhancement based on audible noise suppression. IEEE Trans, on Speech and Audio Processing, 5(6) :497- 514, November 1997.
[27] J-M. Valin, J. Rouat, and F. Michaud. Microphone array post-filter for seperation of simultaneous non-stationary source. ICASSP 2004, pages 1-221, 2004.
[28] P. Vary. Noise Suppression By Spectral Magnitude Estimation - Mechanism and Theoretical Limits. Signal Processing 8, pages 387-400, May 1985.
[29] N. Virag. Single channel speech enhancement based on masking properties of the human auditory system. IEEE Trans, on Speech and Audio Processing, 7,no.2 :126- 137, 1999.
26
Fig. 2 illustrates a block diagram of a preferred device embodiment. The illustrated device may be such as a mobile phone, a headset or a part thereof. The device is adapted to receive a noisy signal, e.g. an electrical analog or digital signal representing an audio signal containing speech and unintended noise. The device includes a digital signal processor DSP that performs a signal processing on the noisy signal. First, the signal estimation method is performed, preferably including a block based linear MMSE method including employing a statistical model of correlation between spectral components such as described in the foregoing. The signal estimation method serves as input to a noise reduction method as will also be understood from the foregoing. The output of the noise reduction method is a signal where the speech is enhanced in relation to the noise. This signal with enhanced speech is applied to a loudspeaker, preferably via an amplifier, so as to present an acoustic representation of the speech enhanced signal to a listener.
As mentioned, the device in Fig. 2 may be a hearing aid, a headset or a mobile phone or the like. In case of a headset, the DSP may either be built into the headset, or the DSP may be positioned remote from the headset, e.g. built into other equipment such as amplifier equipment. In case of a hearing aid, the noisy signal can originate from a remote audio source or from microphone built into the hearing aid.
Even though the described embodiments are concerned with audio signals, it is appreciated that principles of the methods described can be used for a large variety of applications for audio signals as well as other types of noisy signals.
It is to be understood that reference signs in the claims should not be construed as limiting with respect to the scope of the claims.

Claims

27Claims
I. A method for estimation of a signal from noisy observations, the method including employing a statistical model of signal dependencies in frequency domain.
5 2. Method according to claim 1, including the step of providing the statistical model by decomposing the noisy observations into a source model and a filter model.
3. Method according to claim 2, wherein the source model includes performing a multi- pulse coding (MPLPC).
10
4. Method according to any of the preceding claims, wherein the signal dependencies in frequency domain include inter-frequency dependency.
5. Method according to claim 4, wherein the inter-frequency dependency includes inter- 15 frequency correlation.
6. Method according to any of the preceding claims, wherein the method includes performing a linear minimum mean squared error estimation.
20 7. Method according to claim 6, wherein the method includes performing a joint linear minimum mean squared error estimation of signal power spectra and phase spectra.
8. A noise reduction method including performing the method according to any of the preceding claims and providing a noise suppressed signal based on an output therefrom.
25
9. Speech enhancement method including performing the noise reduction method according to claim 8 on speech present in noisy observations so as to enhance speech by suppressing noise.
30 10. Device including a processor adapted to perform the method according to any of the preceding claims.
II. Device according to claim 10, the device being selected from the group consisting of: a mobile phone, a radio communication device, an internet telephony system, sound
35 recording equipment, sound processing equipment, sound editing equipment, broadcasting sound equipment, and a monitoring system.
12. Device according to claim 10, the device being selected from the group consisting of: a hearing aid, a headset, an assistive listening device, an electronic hearing protector, and 40 a headphone with a built-in microphone.
13. Computer executable program code adapted to perform the method according to any of claims 1-9.
PCT/DK2006/000220 2005-04-26 2006-04-26 Estimation of signal from noisy observations WO2006114100A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DKPA200500605 2005-04-26
DKPA200500605 2005-04-26

Publications (1)

Publication Number Publication Date
WO2006114100A1 true WO2006114100A1 (en) 2006-11-02

Family

ID=36441310

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/DK2006/000220 WO2006114100A1 (en) 2005-04-26 2006-04-26 Estimation of signal from noisy observations

Country Status (1)

Country Link
WO (1) WO2006114100A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009043066A1 (en) * 2007-10-02 2009-04-09 Akg Acoustics Gmbh Method and device for low-latency auditory model-based single-channel speech enhancement
CN114373475A (en) * 2021-12-28 2022-04-19 陕西科技大学 Voice noise reduction method and device based on microphone array and storage medium
CN117194901A (en) * 2023-11-07 2023-12-08 上海伯镭智能科技有限公司 Unmanned vehicle working state monitoring method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5007094A (en) * 1989-04-07 1991-04-09 Gte Products Corporation Multipulse excited pole-zero filtering approach for noise reduction
WO2000057402A1 (en) * 1999-03-19 2000-09-28 Siemens Aktiengesellschaft Method and device for estimating signal parameters by means of kalman filtration in order to free a language signal from noise

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5007094A (en) * 1989-04-07 1991-04-09 Gte Products Corporation Multipulse excited pole-zero filtering approach for noise reduction
WO2000057402A1 (en) * 1999-03-19 2000-09-28 Siemens Aktiengesellschaft Method and device for estimating signal parameters by means of kalman filtration in order to free a language signal from noise

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUNJIAN LI ET AL: "Integrating Kalman filtering and multi-pulse coding for speech enhancement with a non-stationary model of the speech signal", SIGNALS, SYSTEMS AND COMPUTERS, 2004. CONFERENCE RECORD OF THE THIRTY-EIGHTH ASILOMAR CONFERENCE ON PACIFIC GROVE, CA, USA NOV. 7-10, 2004, PISCATAWAY, NJ, USA,IEEE, 7 November 2004 (2004-11-07), pages 2300 - 2304, XP010781136, ISBN: 0-7803-8622-1 *
CHUNJIAN LI ET AL: "Inter-frequency dependency in mmse speech enhancement", SIGNAL PROCESSING SYMPOSIUM, 2004. NORSIG 2004. PROCEEDINGS OF THE 6TH NORDIC ESPOO, FINLAND 9-11 JUNE 2004, PISCATAWAY, NJ, USA,IEEE, 9 June 2004 (2004-06-09), pages 200 - 203, XP010732574, ISBN: 951-22-7065-X *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009043066A1 (en) * 2007-10-02 2009-04-09 Akg Acoustics Gmbh Method and device for low-latency auditory model-based single-channel speech enhancement
GB2465910A (en) * 2007-10-02 2010-06-09 Akg Acoustics Gmbh Method and device for low-latency auditory model-based single-channel speech enhancement
DE112007003674T5 (en) 2007-10-02 2010-08-12 Akg Acoustics Gmbh Method and apparatus for single-channel speech enhancement based on a latency-reduced auditory model
GB2465910B (en) * 2007-10-02 2012-02-15 Akg Acoustics Gmbh Method and device for low-latency auditory model-based single-channel speech enhancement
CN114373475A (en) * 2021-12-28 2022-04-19 陕西科技大学 Voice noise reduction method and device based on microphone array and storage medium
CN117194901A (en) * 2023-11-07 2023-12-08 上海伯镭智能科技有限公司 Unmanned vehicle working state monitoring method and system
CN117194901B (en) * 2023-11-07 2024-02-02 上海伯镭智能科技有限公司 Unmanned vehicle working state monitoring method and system

Similar Documents

Publication Publication Date Title
Hermansky et al. RASTA processing of speech
Wu et al. A reverberation-time-aware approach to speech dereverberation based on deep neural networks
Chen et al. New insights into the noise reduction Wiener filter
Huang et al. A multi-frame approach to the frequency-domain single-channel noise reduction problem
US8560320B2 (en) Speech enhancement employing a perceptual model
US7313518B2 (en) Noise reduction method and device using two pass filtering
Nakatani et al. Harmonicity-based blind dereverberation for single-channel speech signals
Bahoura et al. Wavelet speech enhancement based on time–scale adaptation
Habets Speech dereverberation using statistical reverberation models
US20090163168A1 (en) Efficient initialization of iterative parameter estimation
Xiao et al. Normalization of the speech modulation spectra for robust speech recognition
Chen et al. Fundamentals of noise reduction
Verteletskaya et al. Noise reduction based on modified spectral subtraction method
Hansen et al. Speech enhancement based on generalized minimum mean square error estimators and masking properties of the auditory system
WO2009043066A1 (en) Method and device for low-latency auditory model-based single-channel speech enhancement
EP1995722B1 (en) Method for processing an acoustic input signal to provide an output signal with reduced noise
Itoh et al. Environmental noise reduction based on speech/non-speech identification for hearing aids
Taşmaz et al. Speech enhancement based on undecimated wavelet packet-perceptual filterbanks and MMSE–STSA estimation in various noise environments
WO2006114100A1 (en) Estimation of signal from noisy observations
Li et al. A block-based linear MMSE noise reduction with a high temporal resolution modeling of the speech excitation
Sunnydayal et al. A survey on statistical based single channel speech enhancement techniques
JPH096391A (en) Signal estimating device
Prasad et al. Two microphone technique to improve the speech intelligibility under noisy environment
Lu et al. Temporal contrast normalization and edge-preserved smoothing of temporal modulation structures of speech for robust speech recognition
Upadhyay et al. A perceptually motivated stationary wavelet packet filterbank using improved spectral over-subtraction for enhancement of speech in various noise environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

NENP Non-entry into the national phase

Ref country code: RU

WWW Wipo information: withdrawn in national office

Country of ref document: RU

122 Ep: pct application non-entry in european phase

Ref document number: 06722912

Country of ref document: EP

Kind code of ref document: A1