WO1991003042A1

WO1991003042A1 - A method and an apparatus for classification of a mixed speech and noise signal

Info

Publication number: WO1991003042A1
Application number: PCT/DK1990/000214
Authority: WO
Inventors: Claus Elberling; Michael Ekelid; Carl Ludvigsen
Original assignee: Otwidan Aps Forenede Danske Høreapparat Fabrikker
Priority date: 1989-08-18
Filing date: 1990-08-17
Publication date: 1991-03-07
Also published as: DK406189A; DK406189D0

Abstract

For classification of a mixed speech and noise signal (101) the signal is divided into separate, frequency limited subsignals (103), each of which contains at least two harmonic frequencies for the speech signal. The envelopes (105) of the subsignals (103) are formed as well as a measure (107) of synchronism between the envelopes (105). The synchronism measure (107) is compared with a threshold value for classification of the mixed signal as being significantly or insignificantly affected by the speech signal. The classification takes place with an unpresidented frequency and can therefore form the basis for a considerably more precise estimate of the noise signal than before, in particular when this has a speech-like nature.

Description

A method and an apparatus for classification of a mixed speech and noise signal

The invention concerns a method and an apparatus for classification of a mixed speech and noise signal as being significantly or insignificantly affected by the speech signal.

The time intervals where the mixed signal is insignifi¬ cantly affected by the speech signal may be used for forming a running estimate of the noise signal with known methods, it being possible to suppress the noise on the basis of this estimate.

The invention may be used in electroacustic systems for transmission and signal processing of speech signals (e.g. mobile telephones, speech recognition systems and hearing aids), where it is endeavoured to eliminate or reduce de- gradation of speech quality, speech recognition and speech perception because of present background noise using noise suppressing and/or speech enhancing methods.

Electroacustic systems for transmission and signal pro- cessing of speech signals exist in numerous types and for many different purposes. The expansive development in the field of digital electronics, including particularly the digital signal processors, has made it possible to employ a plurality of methods not practically useful before in connection with removing or suppressing, in real time, the background noise, which occurs either acoustically simul¬ taneously with the speech signal (e.g. in a helicopter cockpit where machine and rotor noise affects the acoustic communication from the pilot) or as an electric signal, equivalent therewith, in the transmission system itself. Such methods are known from the literature and are called noise suppression or speech enhancement methods. Of these methods may be mentioned adaptive filtering and spectral subtraction. See e.g. (1) and (7). The aim of improving the signal/noise ratio (the ratio of speech signal magni¬ tude to noise magnitude) is that the methods are to counteract the degradation of the reception caused by the noise and the intelligibility of the transmitted speech signal. Several of the known methods are based on a run- ning estimate of the statistic characteristics of the background noise, e.g. intensity and frequency content. With a speech or pause detector time segments are identi¬ fied with and without speech signal, respectively, and in the segments exclusively containing background noise (speech pauses) the characteristics of the noise may be estimated by suitable signal analysis. Assuming a certain stationarity of the background noise this estimate may be used for adjusting the noise suppression or speech en¬ hancement method until the next time the noise can be estimated.

Several methods are described in the literature for dis¬ tinguishing between voiced speech, unvoiced speech, and pauses, both without and with background noise. See e.g. ( ), (5) and (8). (9) includes i.a. a survey of the most important methods which have been used for classification of speech, in particular in connection with speech recog¬ nition systems.

In particular two of the known principles should be men¬ tioned: the energy histogram and valley detector prin¬ ciples. In a noise suppression method (3) use of the valley detector method is reported for pointing out the time intervals in which a mixed speech and noise signal exclusively consists of background noise (i.e. corresponding to pauses in the speech signal). In the described invention the method is incorporated in a type of feedback loop by acting on the individual frequency bands of the output signal and with the purpose of increasing the field of use of the speech/noise detector.

However, none of the known speech and pause detectors are particularly robust when the speech signal is subjected to e.g. considerable reverberation, or when the background noise is added in a poor signal/noise ratio (less than 0 dB) or has a speech-like nature, i.e. resembles the speech signal from one or more speakers. In these cases the detection will be less certain with known methods. It has been attempted to reduce this problem by using a priori knowledge about the speech and noise signals. It has thus been utilized in (1) and (2) that the amplitude fluctuations in speech and noise are different in certain cases. When, however, the noise is speech-like, this difference will be marginal.

I So far, no speech detector has been developed which can operate reliably both with a poor signal/noise ratio and with speech-like noise. The object of the present inven¬ tion is therefore to provide a method and an apparatus where this problem is solved.

This object is achieved by the method stated in claim 1 and the apparatus stated in claim 8, involving detection of the time segments in a mixed speech and noise signal which are dominated by the speech signal. This is to be understood in combination with well-known knowledge, which is described below, that a speech signal includes a plu¬ rality of time segments where the speech signal contri¬ butes only insignificantly to the mixed signal. Such seg¬ ments are not just speech pauses (between words and sen- tences, breathing), but in particular also very short in¬ tervals, typically within a word where the speech signal assumes a value so that it just contributes insignifi¬ cantly to the mixed signal. These segments are detected, and it is possible- on the basis of this to update para¬ meters for the background noise. This is done with unpre- cedented frequency and can therefore form the basis for a considerably more precise estimate of the background noise.

In a speech signal the energy can assume relatively great values in short time intervals, corresponding to some of the voiced sounds (e.g. the open vowels) as well as some of the consonants (the fricatives and the plosives). Therefore, the signal/noise ratio will be relatively great in time segments containing these speech sounds, and these segments are thus particularly useful for detecting pre¬ sence of speech in background noise. The reason why the energy is great in the mentioned speech sounds is the following:

1) A vowel may be described as a (quasi)periodic time signal which in terms of frequency consists of a funda¬ mental frequency and its harmonics, whereby the speech energy simultaneously occurs in a larger frequency range.

2) A fricative and/or a plosive may be described as a short, noise-like time signal where the energy simul¬ taneously occurs in a wide frequency range.

In the preferred embodiment of the invention the frequency range of the speech signal is suitably divided into a plu¬ rality of frequency bands, and it thus applies that for each of the two types of speech sounds the energy occurs with a certain simultaneousness between the frequency bands. Further, it is special to the vowels that since the difference between two consecutive harmonic frequencies is always equal to the fundamental frequency for the speech signal, the envelope of a frequency restricted subsignal containing two or more consecutive harmonic frequencies will always be periodic and substantially synchronous with the fundamental frequency, since the envelope represents a beat signal with a frequency equal to the difference be¬ tween the two harmonics, which is precisely equal to the fundamental frequency. Since it is the same frequency, viz. the fundamental frequency of the speech signal, for all the subsignals which causes the beat signal which is detected by envelopment, the envelopes of the subsignals will substantially be synchonous or correlated with each other.

In order that this envelope, which is periodic with the fundamental frequency, can always be produced, it is ne¬ cessary that each subsignal has a frequency band width which always comprises at least two harmonic frequencies. This is obtained with a band width of at least twice the fundamental frequency. If the fundamental frequency is e.g. 220 Hz, the band width must at least be 440 Hz.

It is well-known from the literature, see e.g. (3), to examine a mixed speech and noise signal by division into time intervals and by splitting into a number of sub- signals by means of a filter bank consisting of bandpass filters. However, in contrast to the previously described methods, this is done in a particular manner in the pre¬ sent invention, since the invention realizes a filter bank consisting of bandpass filters with a band width which is especially dependant upon general characteristics of the speech signal, as well as a detector utilizing the corre¬ lation between the envelopes of the subsignals. Moreover, and still in contrast to the previously described methods, the aim of the present invention is not to point out the time intervals in the mixed speech and noise signal which just consist of noise (i.e. corresponding to pauses in the speech signal), but to point out the intervals which are dominated by the speech signal.

The invention will be explained more fully by the follow- ing description of a preferred embodiment with reference to the drawing, in which

fig. 1 is a block diagram schematically showing an appa¬ ratus according to the invention,

fig. 2 shows an example of an input signal consisting of a portion of a speech signal without noise, and how this signal is processed in the apparatus in fig. 1,

fig. 2A shows the input signal,

fig. 2B shows the frequency limited subsignals originating from filtering of the input signal,

fig« 2C shows the envelope signals corresponding to the subsignals in fig. 2B,

fig. 2D shows the synchonism signal from the synchronism detector as well as a threshold value with which it is compared, and

fig. 2E shows the final classification signal from the threshold detector.

In fig. 1 an electric input signal 101 consisting of a speech signal mixed with a noise signal (trafic noise, cafeteria noise, speech from other persons or the like) is passed to a filter bank 102 consisting of a plurality of optionally overlapping bandpass filters with increasing center frequency and covering in combination the entire frequency range of the speech signal or part thereof. Each bandpass filter has a band width greater than twice the greatest expected value of the fundamental frequency of the speech signal,- so that a subsignal 103 comprising at least two consecutive harmonic frequencies to the funda- mental frequency can pass through each bandpass filter.

The subsignals are passed to their respective envelope detectors 104, which form the time envelopes 105 for the subsignals 103 e.g. by means of rectification, squaring or analytical signals as well as optional subsequent low-pass filtering. This signal processing, which following band¬ pass filtering of the input signal generates and utilizes the envelopes of the bandpass filtered subsignals is known in other connections from the acoustic/audiological field, see e.g. (6).

The envelope signals are passed to a synchronism detector 106, which produces a measure of synchronism between the envelope signals 105 for a time segment of the signals. Then, the time course of the computed synchronism has the shape of a staircase curve and is called the synchronism signal 107.

The principle of the synchronism detector 106 may e.g. be based on correlation, an artificial neural network or another computing method applied to all or a subset of the envelope signals 105. For example, a correlation can be computed by first computing the product sum of the signal values for any pair of signals i.e. the envelope signals from two adjacent bandpass filters and then performing summation of all the computed product sums.

Finally, the synchronism signal 107 is passed to a thres¬ hold detector 108 where the synchronism signal 107 is com- pared with a threshold value. If the synchronism signal

107 is greater than the threshold value, the time segment in question is classified as being dominated by speech, and the classification signal 109 is set to the value binary 1. If not, the classification signal 109 is set to the value binary 0.

The overall function of the synchronism detector 106 and the threshold detector 108 may also be implemented by means of either a trained, a self-organizing or other artificial neural network using the envelope signals 105 as input signals and forming the desired classification signal 109 as output signal for classification of the mixed signal.

Presence of a noise signal affects the classification more or less depending upon the characteristics of the noise signal. If the noise signal is stochastic, speech-like noise, the speech detection will by and large not be af¬ fected even with a very small signal/noise ratio. If, on the other hand, the noise signal is a signal with an in- herent modulation as a speech signal, or if it is a real speech signal from one or more persons, the interplay be¬ tween the actual signal/noise ratio and the construction of the threshold detector 108 will be of decisive impor¬ tance. When e.g. the threshold detector 108 is arranged such that the threshold value 210 with a given time con¬ stant adaptively adjusts itself corresponding to a given fraction of the size of the synchronism signal 107, then only the dominating speech signal will advantageously be detected. Removal of the lowest frequency components of the synchronism signal provides the additional advantage that a continuous noise signal consisting of harmonic fre¬ quency components (e.g. acoustic noise from a rotating machine), will not erroneously be classified as being a speech signal. Fig. 2 shows an example of how a given input signal 201 is processed in the apparatus in fig. 1. To illustrate the fundamental principle of the invention the input signal 201 is shown in fig. 2A as a short speech signal without noise consisting first of a (voiced) vowel and then of an unvoiced fricative. Fig. 2B shows the frequency limited subsignals 203 formed in the filter bank 102. Fig. 2C illustrates the envelope signals 205 formed by the enve¬ lope detectors 104 from the subsignals 203 in fig. 2B. At the vowel, the envelope signals 205 in several frequency bands are shown to be correlated with each other and modu¬ lated with a frequency corresponding to the fundamental frequency. At the fricative, the envelope signals 205 show that short-term energy is present simultaneously in several frequency bands. Fig. 2D shows the synchronism signal 207 computed from the synchronism detector 106 as well as the threshold value 210 with which it is compared. Finally, fig. 2E shows the obtained classification signal 209.

An apparatus according to the invention may be implemented either in analog or digital hardware or in software or in combinations thereof.

References:

(1) US Patent No.- 4 025 721

(2) US Patent No. 4 185 168

(3) US Patent No. 4 630 304

(4) Cox B.V. and Timothy L.M.K. 1980. Nonparametric Rank- Order Statistics Applied to Robust Voiced-Unvoiced- Silence Classification. IEEE Trans. ASSP 28,5,550- 561.

(5) Gordos G. 1983. SPEECH DETECTION IN SEVERE NOISE. Proc. 11 ICA 91-94.

(6) Houtgast T. and Steeneken H.J.M. 1973. The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acoustica, 28, 66-73.

(7) Lim J.S. 1986. SPEECH ENHANCEMENT. Proc. ICASSP 3135- 3142.

(8) McAulay R.J. and Malpass M.L. 1980. Speech Enhance- ment Using Soft-Decision Noise Suppression Filter.

IEEE Trans. ASSP 28,2,137-145.

(9) Savoji M.H. 1989. A robust algorithm for accurate endpointing of speech signals. Speech Comm. 8, 45-60.

Claims

P a t e n t C l a i m s

1. A method of classifying, in a selected time interval, a mixed speech and noise signal (101, 201) as being signi¬ ficantly or insignificantly affected by the speech signal, where the mixed signal is divided into a plurality of se¬ parate, frequency limited subsignals (103, 203), c h a ¬ r a c t e r i z e d in that

- each subsignal (103, 203) comprises at least two harmo¬ nic frequencies for a fundamental frequency of the speech signal,

- the time envelope (105, 205) is generated for the sub- signals (103, 203),

a measure (107, 207) of synchronism between these enve¬ lopes (105, 205) is generated, and

this measure (107, 207) is compared with a threshold value (210).

2. A method according to claim 1, c h a r a c t e r - i z e d in that the mixed signal is divided into a plu¬ rality of time intervals in which the signal is classified successively.

3. A method according to claim 1, c h a r a c t e r - i z e d in that the selected time interval is a running time window.

4. A method according to claims 1-3, c h a r a c t e r ¬ i z e d in that all envelopes are used for generating the measure (107, 207) of synchronism between the envelopes (105, 205).

5. A method according to claims 1-3, c h a r a c t e r ¬ i z e d in that one or more subsets of the envelopes (105, 205) are used for generating the measure (107, 207) of synchronism between the envelopes (105, 205).

6. A method according to claims 1-5, c h a r a c t e r ¬ i z e d in that the generation of the measure (107, 207) of synchronism between the envelopes (105, 205) is based on a correlation computation.

7. A method according to claims 1-5, c h a r a c t e r ¬ i z e d in that the envelopes (105, 205) are passed as input signals to an artificial neural network which clas¬ sifies the signal.

8. An apparatus for classification of a mixed speech and noise signal (101, 201), comprising filter means each of which permits passage of a subsignal (103, 203), c h a ¬ r a c t e r i z e d in that

each subsignal (103, 203) contains at least two harmo¬ nic frequencies for a fundamental frequency for the speech signal, and that the apparatus moreover comprises

- means (194) for generating the time envelopes (105, 205) of the subsignals,

- means (106) for generating a measure (107, 207) of synchronism between these envelopes, as well as

- means (108) for comparing the synchronism signal (107, 207) with a given threshold value (210).