EP0615226B1

EP0615226B1 - Method for noise reduction in disturbed voice channels

Info

Publication number: EP0615226B1
Application number: EP94102963A
Authority: EP
Inventors: Klaus Dr. Ing. Linhard
Original assignee: DaimlerChrysler AG
Current assignee: Mercedes Benz Group AG
Priority date: 1993-03-11
Filing date: 1994-02-28
Publication date: 1999-05-06
Anticipated expiration: 2014-02-28
Also published as: EP0615226A2; DE4307688A1; EP0615226A3; DE59408194D1

Description

Die Erfindung betrifft ein Verfahren nach dem Oberbegriff des Patentanspruchs 1.The invention relates to a method according to the preamble of claim 1.

Ein derartiges Verfahren findet Anwendung bei der automatischen Spracherkennung oder bei Freisprechanlagen zur Verbesserung der Sprachqualität, z.B. in Büroräumen oder im Kraftfahrzeug.Such a method is used in the automatic Speech recognition or for hands-free systems Improvement of speech quality, e.g. in offices or in the motor vehicle.

Gestörte Sprache ist besser erfaßbar, wenn sie mit zwei oder mehreren Kanälen aufgezeichnet wird. Dabei soll in jedem Kanal Sprache und Störung vorhanden sein. Die mehrkanaligen Signale werden mit einer digitalen Signalverarbeitung aufbereitet. Disrupted speech is easier to grasp if it is with two or multiple channels is recorded. Thereby in there is speech and interference on each channel. The multi-channel Signals are processed using digital signal processing processed.

Bei mehrkanaligen Systemen ist zunächst der Laufzeitunterschied des Nutzsignals in den einzelnen Kanälen zu ermitteln. Dabei wird es später möglich, die einzelnen Kanäle wieder phasenrichtig zu einem Kanal zusammenzuführen.In the case of multi-channel systems, the runtime difference is the first to determine the useful signal in the individual channels. It will later be possible to use the individual channels merging into a channel again in the correct phase.

Von besonderem Interesse sind Systeme mit 2 Kanälen, da sich hiermit bereits ein räumliches Schall feld nach einzelnen Richtungen auflösen läßt, der Rechenaufwand aber noch erträglich bleibt.Systems with 2 channels are of particular interest because a spatial sound field by individual Directions can be resolved, but the computational effort still bearable.

Ist die Richtung bekannt, aus der das interessierende Schallereignis eintrifft, wird eine akustische Richtkeule auf dieses Ereignis eingestellt.Is known the direction from which the interest When the sound event arrives, it becomes an acoustic beam set for this event.

Die Geräuschreduktion wird zunächst in jedem einzelnen Kanal durchgeführt. Da die Geräuschreduktion nicht fehlerfrei arbeitet können Verzerrungen und künstliche Einfügungen (z.B. "musical tones") entstehen. Bei der Zusammenführung der einzelnen verarbeiteten Kanälen ergibt sich eine Mittelung und damit Verringerung dieser Fehler.The noise reduction is initially in each individual channel carried out. Because the noise reduction is not flawless can work distortions and artificial insertions (e.g. "musical tones"). When merging of the individual processed channels results in one Averaging and thus reducing these errors.

Das Summensignal wird anschließend nachverarbeitet, indem die Kreuzkorrelation der Signale in den einzelnen Kanälen verwendet wird. Dabei wird vorausgesetzt daß Störungen oder Nachhall weniger korreliert ist als das Nutzsignal der Kanäle.The sum signal is then processed by the cross-correlation of the signals in the individual channels is used. It is assumed that interference or reverberation is less correlated than the useful signal of the channels.

Ein Verfahren zur Zusammenführung von 2 gestörten Sprachkanälen ist aus der Veröffentlichung "Multimicrophone signal-processing technique to remove room reverberation from speech signals" von Allen, Berkley und Blauert (J: Acoust. Soc. Am., Vol.62, No. 4, October 1977) und aus "Noise Suppression Signal Processing Using 2-Point Received Signals" von Kaneda und Tohyame (Electronics and Communication in Japan, Vol. 67-A, No. 12, 1984) bekannt. Das erste Verfahren ist zur Enthallung von Sprachsignalen gedacht und verwendet keinen echten Phasenausgleich des Nutzsignals und die Enthallung mit Geräuschreduktion wird nur in einer Nachverarbeitungsstufe durchgeführt. Das zweite Verfahren benutzt einen einfachen linearen Phasenausgleich der Kanäle, die Geräuschreduktion erfolgt aber auch hier nur in der Nachverarbeitungsstufe.A method of merging 2 disturbed Voice channels is from the publication "Multimicrophone signal-processing technique to remove room reverberation from speech signals "by Allen, Berkley and Blauert (J: Acoust. Soc. Am., Vol. 62, No. 4, October 1977) and from "Noise Suppression Signal Processing Using 2-Point Received Signals "by Kaneda and Tohyame (Electronics and Communication in Japan, Vol. 67-A, No. 12, 1984). The The first method is intended for the removal of speech signals and does not use real phase compensation of the Useful signal and the dehumidification with noise reduction only carried out in a post-processing stage. The the second method uses a simple linear phase compensation of the channels, but the noise is reduced again only in the post-processing stage.

Der Erfindung liegt deshalb die Aufgabe zugrunde, ein Verfahren zur Geräuschreduktion anzugeben, bei dem die Geräuschreduktion in mehreren Stufen durchgeführt und eine deutliche Verbesserung der Sprachqualität erzielt wird.The invention is therefore based on the object of a method to indicate noise reduction at which the noise reduction carried out in several stages and one significant improvement in speech quality is achieved.

Die Aufgabe wird gelöst durch die im kennzeichnenden Teil des Patentanspruchs 1 angegebenen Merkmale. Vorteilhafte Ausgestaltungen und/oder Weiterbildungen sind den Unteransprüchen zu entnehmen.The task is solved by the in the characterizing part of claim 1 specified features. Beneficial Refinements and / or further developments are the subclaims refer to.

Mit dem erfindungsgemäßen Verfahren werden die räumlichen und die zeitlichen Eigenschaften des Nutzsignals und der Störung systematisch ausgenutzt:

1.) räumliche Eigenschaft der Schallfelder:

a) Dämpfung von Punktstörquellen
Mit digitalen Richtungsfiltern am Eingang der Kanäle wird zusammen mit der Phasenschätzung eine akustische Richtkeule auf den Sprecher ausgerichtet. Für die Phasenschätzung wird das in der unveröffentlichten deutschen Patentanmeldung P 42 43 831 beschriebene Verfahren verwendet. Es ist robust gegenüber Störungen und benötigt nur einen geringen Rechenaufwand. Die Richtungsfilter sind fest eingestellt. Es wird angenommen, daß der Sprecher sich relativ nahe an den Mikrofonen befindet (Abstand < lm) und sich nur in einem beschränkten Bereich bewegt. Instationäre und stationäre Punkt-Störquellen werden durch diese räumliche Auswertung gedämpft.

b) Dämpfung von diffusen Störquellen
In der Nachverarbeitung werden mit Hilfe der Kreuzkorrelation die diffusen Stör- und Hallanteile gedämpft.

2.) zeitliche Signaleigenschaften:
Die spektrale Subtraktion schätzt die Störung in den Sprachpausen und führt eine betragsmäßige Subtraktion im Spektralbereich durch. Hier werden die zeitlich stationären Störanteile gedämpft.

3.) Mittelung der Kanäle (Addition):
Durch die räumliche Trennung der Aufnahmekanäle (Mikrofone in einem bestimmten Abstand) treten Fehler der spektralen Subtraktion (Verzerrung und "musical tones") in den einzelnen Kanälen z.T. zeitlich zufällig auf. Eine Mittelung der Kanäle vermindert diesen Fehler.

With the method according to the invention, the spatial and the temporal properties of the useful signal and the disturbance are used systematically:

1.) spatial property of the sound fields:

a) Attenuation of point sources of interference
With digital directional filters at the entrance of the channels, an acoustic directional lobe is aligned with the speaker together with the phase estimation. The method described in the unpublished German patent application P 42 43 831 is used for the phase estimation. It is robust against interference and requires little computing effort. The directional filters are fixed. It is assumed that the speaker is relatively close to the microphones (distance <1m) and only moves in a limited area. Transient and stationary point sources of interference are dampened by this spatial evaluation.

b) Attenuation of diffuse sources of interference
In post-processing, the cross-correlation dampens the diffuse interference and Hall components.

2.) Temporal signal properties:
The spectral subtraction estimates the disturbance in the speech pauses and carries out an amount-based subtraction in the spectral range. Here the temporally stationary disturbance components are damped.

3.) Averaging the channels (addition):
Due to the spatial separation of the recording channels (microphones at a certain distance), spectral subtraction errors (distortion and "musical tones") in the individual channels sometimes occur at random in time. Averaging the channels reduces this error.

Die Erfindung wird anhand von Ausführungsbeispielen näher erläutert und Bezugnahme auf schematische Zeichnungen.

FIG. 1: zeigt ein Blockdiagramm des gesamten Verfahrens.
FIG. 2: zeigt einen Vergleich der gemittelten Ausgangsleistungen Z verschiedener Verfahren mit der Leistung des Original-Geräuschsignals (Beispiel: Mikrofonabstand 12cm, Fahrzeug mit 140km/h). Es wird die zunehmende Geräuschreduktion gezeigt wenn die Verarbeitung mit einem Kanal, mit zwei Kanälen und mit zwei Kanälen mit Nachverarbeitung durchgeführt wird.

The invention is explained in more detail using exemplary embodiments and reference to schematic drawings.

FIG. 1: shows a block diagram of the entire method.
FIG. 2nd: shows a comparison of the averaged output powers Z of different methods with the power of the original noise signal (example: microphone distance 12 cm, vehicle at 140 km / h). The increasing noise reduction is shown when the processing is carried out with one channel, with two channels and with two channels with post-processing.

Die Mikrofonsignale x und y werden in den Frequenzbereich transformiert (FFT, Fast Fourier-Transformation). Die Segmente sind halb überlappt und werden mit einem Hamming-Fenster gewichtet. Die Segmente sind jeweils N Werte lang und werden um weitere N Nullen erweitert. Die Transformationslänge wird beispielsweise zu 2N = 512 gewählt. Es ergeben sich die transformierten Segmente X₁(i) und Y₁(i). Das Ausgangssignal z ergibt sich nach Rücktransformation und unter Berücksichtigung der Überlappung der Segmente. l bezeichnet den Blockindex der Segmente, i die diskrete Frequenz (i=0,1,2...,2N-1). Die Abtastrate der Signale x und y beträgt z.B. 12kHz.The microphone signals x and y are transformed into the frequency range (FFT, Fast Fourier Transformation). The segments are half overlapped and weighted with a Hamming window. The segments are each N values long and are expanded by an additional N zeros. The transformation length is chosen to be 2N = 512, for example. The transformed segments X ₁ (i) and Y ₁ (i) result. The output signal z results after inverse transformation and taking into account the overlap of the segments. l denotes the block index of the segments, i the discrete frequency (i = 0,1,2 ..., 2 N -1). The sampling rate of the signals x and y is, for example, 12 kHz.

Im Frequenzbereich wird der Langzeitmittelwert des Betragsspektrums subtrahiert (Spektrale Subtraktion H_SPS). Das Kurzzeitmittel K und das Langzeitmittel L werden benutzt, um eine erste adaptive Glättungkonstante β zu berechnen. Mit β wird das Störspektrum S_nn(i) geschätzt. Diese adaptive Glättungskonstante ersetzt den sonst üblichen Sprachpausendetektor. l bezeichnet den Blockindex, i die diskrete Frequenz. Als Glättungskonstante β_o wird z.B. β_o = 0.03 verwendet.

βt = glβ 0 mit gl = 2L l-1 L l-1 + Kl Ll = (1 - β l )L l-1 + βlKl S nn,l (i) = (1 - βl ) S nn,l-1(i) + βl Xl (i) 2 In the frequency domain, the long-term mean of the magnitude spectrum is subtracted (spectral subtraction H _SPS ). The short-term average K and the long-term average L are used to calculate a first adaptive smoothing constant β. The interference spectrum S _nn (i) is estimated with β. This adaptive smoothing constant replaces the otherwise common speech pause detector. l denotes the block index, i the discrete frequency. For example, β _o = 0.03 is used as the smoothing constant β _o .

β t = G l β 0 With G l = 2nd L l -1 L l -1 + K l L l = (1 - β l ) L l -1 + β l K l S nn, l ( i ) = (1 - β l ) S nn, l-1 ( i ) + β l X l ( i ) 2nd

Das Störspektrum wird normiert und subtrahiert. X l (i) = Xl (i) - S nn,l (i) Xl (i) X l (i) = (1 - S nn,l (i) Xl (i) 2 )Xl (i) The interference spectrum is normalized and subtracted. X l ( i ) = X l ( i ) - S nn, l ( i ) X l ( i ) X l ( i ) = (1 - S nn, l ( i ) X l ( i ) 2nd ) X l ( i )

Eine modifizierte Form ergibt sich mit: X l i =(1 - a S nn,l i S xx,l i )X l i ; für (1 - a S nn,l i S xx,l i )X l i < f 0 S nn,l i X l (i) = f 0 S nn,l (i); sonst A modified form results with: X l i = (1 - a S nn, l i S xx, l i ) X l i ; for 1 - a S nn, l i S xx, l i ) X l i < f 0 S nn, l i X l ( i ) = f 0 S nn, l ( i ); otherwise

Für die Leistungsdichte S_xx,l eines Kanales gilt: Sxx,l (i) = (1 - αl )S xx,l-1(i) + α l Xl (i) 2 α l = 2 - gl ; für 0.5 < 2 - gl < 2.0 α l = 0.5 ; für 0.5 > 2 - gl α l = 2 ; für 2 < 2 - gl The following applies to the power density S _{xx, l of} a channel: S xx, l ( i ) = (1 - α l ) P xx, l -1 ( i ) + α l X l ( i ) 2nd α l = 2 - G l ; for 0.5 <2 - G l <2.0 α l = 0.5; for 0.5> 2 - G l α l = 2; for 2 <2 - G l

f_o wird als "spectral floor" bezeichnet. Es wird ein Teil des Hintergrundgeräuschs zugelassen, um einen natürlich Höreindruck zu erzeugen und um einen Teil der "musical tones" zu maskieren. α ist ein Überschätzfaktor für das Gerausch und dient der weiteren Reduzierung des Restgeräuschs. Für diese Werte kann z.B. f_o = 0.2 und α = 1.5 gewählt werden.f _o is referred to as "spectral floor". Part of the background noise is allowed to create a natural auditory impression and to mask part of the "musical tones". α is an overestimation factor for the noise and serves to further reduce the residual noise. For these values, for example, f _o = 0.2 and α = 1.5 can be selected.

Im Gegensatz zu den bekannten Formen der spektralen Subtraktion wird eine zweite adaptive Glättung mit α dazu benutzt einen weiteren Teil der "musical tones" zu reduzieren, indem die Leistungsdichte S_xx bei Sprache wenig und bei Pause stark geglättet wird.In contrast to the known forms of spectral subtraction, a second adaptive smoothing with α is used to reduce another part of the "musical tones" by smoothing the power density S _xx little during speech and strongly smoothing during pause.

Für den zweiten Kanal Y gelten die entsprechenden Gleichungen. The corresponding equations apply to the second channel Y.

Zur Berechnung der linearen Phasenverschiebung zwischen Nutzanteilen in den Kanälen wird das in der nicht vorveröffentlichen Patentanmeldung P 42 43 831 angegebene Verfahren verwendet. Dieses Verfahren fügt sich nahtlos in das erfindungsgemäße Geräuschreduktionsverfahren ein. Die Phasenverschiebung wird an einer ausgewählten Anzahl der Maximas der Kreuzleistungsdichte geschätzt und die Phasenkorrektur durch Multiplikation im Frequenzbereich mit der Allpaßfunktion H_ALLP erreicht. X l (i) := X l (i)HALLP,l X l (i) := X l (i)(cos(i * ) + j sin(i * )) The method specified in the unpublished patent application P 42 43 831 is used to calculate the linear phase shift between useful parts in the channels. This method fits seamlessly into the noise reduction method according to the invention. The phase shift is estimated from a selected number of the maximums of the cross-power density and the phase correction is achieved by multiplication in the frequency domain with the all-pass function H _ALLP . X l ( i ): = X l ( i ) H ALLP, l X l ( i ): = X l ( i ) (cos ( i * ) + j sin ( i * ))

Bei mehr als zwei Kanälen wird die Phasenkorrektur für den jeweils weiteren Kanal durchgeführt. Der erste Kanal dient als Referenz.If there are more than two channels, the phase correction for the each additional channel performed. The first channel serves for reference.

Durch ein "Beamforming-Verfahren" werden für die Kanäle die Richtungsfilter berechnet. Dabei können als Geräusch verschiedene Fälle betrachtet werden. Es ergeben sich entsprechend der Geräuschsituation verschiedene Richtungsfilter H_R. Es wird ein Satz dieser Filter ausgewählt, jedoch kann falls im späteren Betrieb der Systemzustand bekannt ist, auf einem bestimmten Satz umgeschaltet werden oder die Filter können ständig adaptiert werden. Als "Beamforming-Verfahren" wird beispielsweise das Gradientenverfahren nach Frost ("An Algorithm for Linearly Constrained Adaptive Array Processing" Proc. IEEE, Vol. 60, No. 8, 1972) oder nach Sondhi und Elko ("Adaptive Optimization of Microphone Arrays under a Nonlinear Constraint" Int. Conf. on ASSP, Tokyo, 1096, S. 981-984) verwendet.The directional filters for the channels are calculated using a "beamforming process". Various cases can be considered as noise. Different directional filters H _R result depending on the noise _situation . A set of these filters is selected, however, if the system status is known in later operation, it is possible to switch to a specific set or the filters can be continuously adapted. The "beamforming method" is, for example, the gradient method according to Frost ("An Algorithm for Linearly Constrained Adaptive Array Processing" Proc. IEEE, Vol. 60, No. 8, 1972) or according to Sondhi and Elko ("Adaptive Optimization of Microphone Arrays under a Nonlinear Constraint "Int. Conf. on ASSP, Tokyo, 1096, pp. 981-984).

Für die Richtungsfilterung ergibt sich im Frequenzbereich die Multiplikation: X l (i) := X l (i)HR (i) The multiplication for directional filtering results in the frequency domain: X l ( i ): = X l ( i ) H R ( i )

Die Addition der Kanäle ergibt mit den Richtungsfiltern die Gesamt-Richtcharakteristik und das Ausgangssignal Zl (i) = X l (i) + Y l (i) The addition of the channels with the directional filters results in the overall directional characteristic and the output signal Z. l ( i ) = X l ( i ) + Y l ( i )

Außerdem führt die Addition der Kanäle zu einer Mittelung und damit Reduzierung der statistischen Fehler der spektralen Subtraktion.In addition, the addition of the channels leads to an averaging and thus reducing the statistical errors of the spectral Subtraction.

Anschließend wird die Kreuzleistungsdichte der beiden Kanäle mit Hilfe einer Glättungskonstanten (z.B. γ = 0.3) berechnet. Sxy,l (i) = (1 - γ)S xy,l-1(i) + γ X l (i)Yl * (i) The cross power density of the two channels is then calculated using a smoothing constant (eg γ = 0.3). S xy, l ( i ) = (1 - γ) S xy, l -1 ( i ) + γ X l ( i ) Y l * ( i )

Die Kreuzleistungsdichte S_xy wird mit der Summe der Leistungsdichten S_xx, S_yy der einzelnen Kanäle normiert. Es ergibt sich eine modifizierte Kohärenzfunktion: HKKF,l (i) = Sxy,l (i) Sxx,l (i) + Syy,l (i) ; für Sxy,l (i) Sxx,l (i) + Syy,l (i) > 0.3 HKKF,l (i) = 0.3; sonst mit S xx,l (i) = (1 - γ)S xx,l-1(i)+γ X l (i) X l * (i) S yy,l (i) = (1 - γ)S yy,l-1(i)+γ Y l (i) Y l * (i) The cross power density S _xy is standardized with the sum of the power densities S _xx, S _{yy of} the individual channels. The result is a modified coherence function: H KKF, l ( i ) = S xy, l ( i ) S xx, l ( i ) + S yy, l ( i ) ; For S xy, l ( i ) S xx, l ( i ) + S yy, l ( i ) > 0.3 H KKF, l ( i ) = 0.3; otherwise With S xx, l ( i ) = (1 - γ) S xx, l -1 ( i ) + γ X l ( i ) X l * ( i ) S yy, l ( i ) = (1 - γ) S yy, l -1 ( i ) + γ Y l ( i ) Y l * ( i )

Für das Ausgangssignal Z gilt: Zl (i) := Zl (i)HKKF,l (i) The following applies to the output signal Z: Z. l ( i ): = Z. l ( i ) H KKF, l ( i )

Werden Richtungsfilter nach dem Verfahren von Sondhi und Elko verwendet, ist ein inverses Filter zur Frequenzgangkorrektur erforderlich. Dieses Filter dient der Anhebung der tieferen Frequenzen, weil der Frequenzgang der Richtungsfilter (für die gewünschte Richtung, Richtung des Sprechers) zu einer Absenkung dieser Frequenzen führt. Dieses Filter H_INV kann auf einfache Weise aus dem berechneten Frequenzgang approximiert werden. Zl (i):= Zl (i)HINV,l (i) If directional filters are used according to the Sondhi and Elko method, an inverse filter for frequency response correction is required. This filter is used to raise the lower frequencies because the frequency response of the directional filters (for the desired direction, direction of the speaker) leads to a reduction in these frequencies. This filter H _INV can be approximated in a simple manner from the calculated frequency response. Z. l ( i ): = Z. l ( i ) H INV, l ( i )

Wird die Adaption nach dem Verfahren von Frost durchgeführt, ist kein inverses Filter erforderlich, weil der Frequenzgang in Richtung des Sprechers den konstanten Wert 1 hat.If the adaptation is carried out using the Frost method, no inverse filter is required because of the Frequency response in the direction of the speaker the constant value 1 has.

Das erfindungsgemäße Verfahren ist nicht auf Systeme mit zwei Kanälen beschränkt, sondern auf Mehrkanalsysteme (3 und mehr Kanäle) anwendbar.The method according to the invention is not based on systems limited to two channels, but to multi-channel systems (3rd and more channels) applicable.

Claims

Method of noise reduction of at least two disturbed speech channels, wherein the disturbed speed channels are led together into one output channel, characterised thereby

that by means of digital directional filters (H_R1, H_R2) and a linear phase estimation for the individual channels a swingable acoustic directional lobe is produced (H_ALLP), which follows the speaker movement and thereby the spatial disturbance sources are attenuated,

that the disturbance is estimated in the individual speech pauses and the disturbance sources which are stationary in terms of time are attenuated by spectral subtraction (H_SPS1, H_SPS2),

that subsequently the individual speech channels are added and thereby the statistical disturbances of the spectral subtraction are averaged, and

that the sum signal is reprocessed by a modified coherence function and thereby the diffuse disturbance and Hall components are attenuated.
Method according to claim 1, characterised thereby

that the spectral subtraction is carried out with two adaptive smoothing constants α, β,

that the disturbance spectrum S_nn is estimated with the first adaptive smoothing constant β, and

that the output density S_xx of the individual channels is smoothed strongly in the speed pauses and lightly during speech with the second adaptive smoothing constant α.
Method according to claim 1, characterised thereby that the linear phase displacement of at least two signals is ascertained by way of a defined number of maxima of the cross-power density in the frequency range.
Method according to claim 1, characterised thereby that the phase correction, the directional filtering and a possibly necessary inverse filtering is carried out in the frequency range.