GB2278984A

GB2278984A - Speech presence detector

Info

Publication number: GB2278984A
Application number: GB9312049A
Authority: GB
Inventors: Andrew Bishop; Kieren Lee Feakes; Alan Pettitt
Original assignee: REDIFON TECHNOLOGY Ltd
Current assignee: REDIFON TECHNOLOGY Ltd
Priority date: 1993-06-11
Filing date: 1993-06-11
Publication date: 1994-12-14
Also published as: GB9312049D0

Abstract

A speech detection system for detecting the presence of speech in an audio input signal 1 derived from a predetermined environment such as an HF communication system. Algorithms are used at 4 to extract salient features from a digital signal obtained by analogue to digital conversion at 3 from a received audio signal 1. The algorithm outputs are applied to a neural network 5 which is trained by back propagation to distinguish between speech and other signals. The output of neural network 5 controls a switch or relay 6 to pass an audio signal for the duration a speech is detected. Delay 6 is for ensuring that the initial part of speech is not lost. Filter 10 removes high frequency components introduced by DAC 9. A post processing stage may follow neural network 5 for effective control of switch 6. The algorithm processor 4 may receive a digital audio signal directly as well as via an FFT stage. <IMAGE>

Description

SPEECH DETECTION SYSTEM The present invention relates to a speech detection system for detecting the presence of speech in an audio input signal. The system is applicable in particular to speech detection in environments where there are high levels of background noise.

Squelch devices are known which are designed to output an audio signal from a radio receiver only when speech is present on a received signal. Squelch devices have been used successfully in many applications but it has not proved possible in the past to provide a reliable squelch device for use with HF receivers. HF receivers suffer more than radios operating at any other frequency from a variety of unwanted signals which are sufficiently speech-like to cause a conventional squelch device to fail. The signal to noise level of detected signals in HF receivers can also be very low. Conventional squelch devices when applied to HF signals accordingly suffer from false alarms (incorrectly classifying noise signals as speech) and missed calls (not detecting speech when it is present).The level of false alarms and missed calls is sufficiently large to prevent conventional squelch systems being successfully deployed with HF receivers.

Given that squelch devices of acceptable reliability are not available for use with HF receivers, it has been necessary for operators to listen out for possible incoming calls, and as a result the operators are constantly subjected to high levels of noise and interference. This is so tiring that the operators can only work reliably for short periods of time. Thus there has been a clear need for many years for an HF squelch device of acceptable reliability.

The unwanted signals which occur on HF radio communication systems consist of both naturally occurring noise and man-made interference. Naturally occurring noise includes background hiss and static crashes caused by atmospheric discharges. Interference signals can be of various forms, including tones (single, multiple, continuous and intermittent), sweep signals (rapidly swept carriers), warbles (amplitude modulated carriers), teleprinter signals, morse code, fax signals, modulation (splatter) from adjacent channels, electrical pulse noise and jamming signals.

Thus signals received on an HF receiver can be a complex amalgam of noise, interference and speech signals that can be of very similar form. It is this that has defeated earlier attempts to provide an HF squelch device with acceptable reliability.

It is an object of the present invention to provide a speech detection system which is capable of obviating or mitigating the problems outlined above with regard to HF communication systems.

According to the present invention there is provided a speech detection system for detecting the presence of speech in an audio input signal derived from a predetermined environment, comprising means for converting the audio signal to a digital signal, means for analysing characteristics of the digital signal using each of a plurality of algorithms selected to discriminate between speech and other forms of signal, the analysing means producing a plurality of algorithm outputs, and a neural network having a plurality of inputs to each of which a respective algorithm output is applied, the neural network having been trained using training signals derived from the predetermined environment to provide an output indicative of the presence of speech in the audio input signal.

The predetermined environment may be an HF communications system, and the neural network may be trained by back propagation.

Preferably, the audio signal is applied to the converting means through a low pass filter.

The neural network may provide an output to control a switching device to deliver the audio input signal to an output terminal.

Alternatively, a delay buffer may be connected to the output of the audio signal converting means, the output of the delay buffer being connected to the switching device by a digital to analogue converter, and the delay buffer storing the output of the audio signal converting means for a period sufficient to enable the detection of speech by the neural network.

Preferably a fast fourier transform circuit is connected to the output of the audio signal converting means, the outputs of the audio signal converting means and the fast fourier transform circuit being applied to the analysing means.

The neural network may be a three layer device, and provide three outputs indicative respectively of the detection of speech, tones and sweep signals.

Embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings, in which: Fig. 1 is a block diagram of the basic components of a first embodiment of the present invention; Fig. 2 illustrates a second embodiment of the present invention incorporating a delay buffer; and Fig. 3 illustrates further details of embodiments of the present invention.

Referring to Fig. 1, this illustrates the present invention in its most basic form. An input terminal 1 has applied to it an audio output from an HF receiver (not shown). That audio signal is applied through a low pass filter 2 to an analogue to digital converter 3. The filter 2 is provided to prevent higher frequencies reaching the ADC and causing aliasing problems. The digital output of the ADC is applied to an algorithm processor 4 which analyses the digital output in accordance with a series of different algorithms selected to discriminate between various characteristics of the signals which can be expected to be output from an HF receiver. The processor 4 provides algorithm outputs in parallel to a neural network 5.

The neural network 5 is trained by back propagation to provide an output representative of the presence or absence of speech to a switching device in the form of a relay 6. When the presence of speech is detected, the output of the neural network closes the relay 6 and delivers the audio signal to an output terminal 7. The audio signal appearing on output 7 could then be monitored by a human operator. As the human operator would only have to monitor the HF signal once speech had been received, operator fatigue problems are avoided.

An experimental circuit in accordance with Fig. 1 has shown that the device can detect speech within 300ms, so only the first syllable of detected speech will be clipped and therefore lost from the audio output signal appearing at terminal 7. This initial loss of signal could be avoided however by adopting a modification of the circuit of Fig. 1 as illustrated in Fig. 2. The same reference numerals are used in Figs. 1 and 2 for the same components.

Referring to Fig. 2, the switching device 6 is in the form of a logical switch implemented in software. The output of the ADC 3 is applied to a delay circuit 8, the digital output of which is applied to the switching device 6. The output of the switching device 6 is converted by a digital to analogue converter 9. The delay buffer 8 stores the digitised signal on its input for a period which can be variable but typically will be 500ms. When speech is detected, the delayed signal is switched through to and re-constructed by the DAC 9, this occurring before the beginning of the speech content in the input signal has passed through the buffer 8. Accordingly the signal appearing at output terminal 7 includes all of the detected speech.

A reconstruction filter 10 is preferably provided between the switching device 6 and the terminal 7 to remove high frequency components introduced by the sampling processes employed in the DAC.

The performance of the speech detection devices described with reference to Figs. 1 and 2 has been found to be very much better than any previous squelch devices designed for use with HF receivers. It must be appreciated that the process of speech detection is quite different from that of speech recognition. In the case of speech recognition, individual words must be recognised. Accordingly speech recognition systems are limited to the number of words and the number of speakers that can be recognised. The speech detection processes as described in this specification in contrast are capable of generalising over all words and all speakers, both male and female, and of different languages, even in a very noisy environment.

Further details of the speech detection processing device are described with reference to Fig. 3. Again the same reference numerals are used where appropriate in Fig. 3 as in Fig. 1.

Referring to Fig. 3, the illustrated circuit comprises a fast fourier transform processor 11 which converts the digitised signal produced by the ADC 3 into the fourier domain. A sigma delta ADC was used which over-sampled the input by 128 times at a low resolution and then digitally filtered it to produce a high resolution digital output at 7.8kHz. The FFT functioned on each consecutive set of 128 samples to generate 64 output bins in the range 0 to 3.9kHz which were used by the fourier domain algorithms. The algorithm processor incorporates algorithms which operate in real time on both the output of the FFT 11 and the output of the ADC 3. Thus some algorithms work on the digitised time domain data (time domain algorithms) and the rest work on the output from the FFT (fourier domain algorithms).The processor used was a Texas Instruments TMS 320C30 floating point digital signal processor running at 33MHz.

A post-processing stage 12 is connected between the output of the neural network 5 and the relay 6. The post processing stage 12 makes the final decision as to whether or not the audio signal applied to terminal 1 should be switched through to the terminal 7. The post processing stage operates to ensure that the squelch device is opened (switch 6 is closed) only when speech has been detected for a predetermined period at a predetermined level. This avoids triggering of "speech detected" by sudden noise spikes. The post-processing stage also incorporates a latch circuit to hold the relay 6 closed for a predetermined period after speech is no longer detected. This is desirable to prevent the squelch device switching in and out during speech, for example when there are short breaks between words as the speaker takes time for thought or breathing.

In the embodiment of the invention illustrated in Fig. 3, the algorithm processor 4 produces 47 algorithm outputs, 43 from the output of the FFT 11, and 4 from the output of the ADC 3. The algorithms used were developed over a prolonged period, much of which was spent visualising the audio data to enable useful algorithms to be generated, and then testing the generated algorithms in isolation. These algorithm outputs were then used to train a neural network and the results indicated which signals the system was poor at correctly classifying. The system was then improved by writing appropriate additional algorithms, modifying existing algorithms and collecting additional training data. The result of changing certain parameters was not always as expected since the neural network was combining many values in a non-linear system.The process was an iterative one of producing algorithms and then adjusting those algorithms to generate the required results, the neural network being used merely to combine the strengths of all the algorithms which were produced.

The 47 algorithm outputs produced in the system illustrated in Fig. 3 included the outputs of the following three time domain algorithms, each of which process the raw sampled data from the ADC 3.

1. Standard deviation of energy density. This algorithm produced a single output. The energy density is the absolute value of the time domain samples. This algorithm considered the last 256 time domain samples, using a Hamming window. The standard deviation of these 256 values was the algorithm output.

2. Variation in mean energy density. This algorithm produced one output. The main energy density was calculated for two sets of 2048 time domain samples, the first with the last sample being the current sample, and the second being 1024 earlier samples. The algorithm output is the difference between these two mean energy density measurements, divided by the first mean energy density.

3. Mean and standard deviation zero crossing rate. This algorithm produced two outputs, both of which used the zero crossing rate. The algorithm considered a window of 256 time domain samples.

Each point was assigned an instantaneous zero crossing rate of either 0 or 1 depending on whether the time domain waveform passed through zero at that point. The first algorithm output was the mean of these instantaneous zero crossing rates and the second algorithm output was the standard deviation of the instantaneous zero crossing rates.

Ten algorithms producing forty three outputs operated on the output of the FFT 11, all processing the values in 64 bins resulting from the FFT. These algorithms were as follows: 1. Peak frequencies and change in peak frequencies. This algorithm provided three outputs. The first two outputs were the first and second peak frequencies. The position of the peak bin in the FFT gave the first peak frequency. The second peak frequency was the next highest FFT bin position, ignoring all those bins associated with the first peak. The third output of this algorithm was the change in the first peak frequency between consecutive FFTs.

2. Peak frequency consistency. This algorithm had two outputs.

The first peak frequency was found for each FFT and stored in a buffer of the results for the last 20 FFTs. It was stored as the FFT bin number at which the peak is found. The number of times each of the 64 bins occured was summed over the 20 FFTs. The highest sum of the 64 bins was the first algorithm output. The second output of this algorithm also worked from the buffer of first peak frequencies for the last 20 FFTs. It calculated the change in position of the peak frequency between consecutive FFTs for the last 20 FFTs. These values were then summed to give the second algorithm output.

3. Ratio of three highest peaks in the FFT. This algorithm had two outputs. The first was the ratio of the power at the first peak frequency to that at the second peak frequency. The second output was the ratio between the second and third peak frequency powers.

The first and second peak frequencies were those described in the above algorithm "peak frequencies and changing peak frequencies".

The third peak frequency was the peak FFT bin position, ignoring any bins associated with the first and second peak frequencies.

4. Change in spectral content. This algorithm had one output.

It compared the current FFT to the last FFT and summed the difference between each bin. For each of the 64 bins in the FFT, the difference between the current and the last FFT was calculated. These differences were summed for the 64 bins and the total was divided by the total of the 64 bins of the current FFT for normalisation purposes.

5. First peak frequency linearity measurements. This algorithm had five outputs. They made different measures on the linearity of the first peak frequency over 20 FFTs. A buffer of the first peak frequency for the last 20 FFTs was generated and used as the basis for the five outputs. The first algorithm output was the standard deviation of the first peak frequency over the last 20 FFTs. The second and third outputs were the values of the linear regression gradient (m) and the intercept (c), respectively, for the equation y = mx+c. The x axis was considered to be the position in the buffer of peak frequencies and the y axis was the peak frequency value. The values of m and c were calculated using least squares linear regression. The fourth output was the product moment correlation coefficient applied to the buffer of peak frequency values. The fifth output was the mean squared deviation of the peak frequency values from the straight line defined by the linear regression.

6. Change in first two peak frequencies, ordered. This algorithm provided two outputs. The first and second peak frequencies were located as for previously described algorithms, so the first peak frequency corresponded to the bin with the highest power. These two frequencies were then ordered so that the first peak frequency (ordered) was at the lowest frequency. A buffer of the first two peak frequencies (ordered) for the last 20 FFTs was created.

The difference between the first peak frequency (ordered) between consecutive FFTs was calculated. These values were summed to become the first algorithm output. The same calculation was performed for the second peak frequency (ordered) to give the second output.

7. FFT bin amplitude ratio. This algorithm provided four outputs. The 64 FFT bins were ordered in terms of their power content, bin number one being most powerful. The first output was the sum of bins 1 to 4 divided by the sum of bins 5 to 8. The second output was the sum of bins 1 to 8 divided by the sum of bins 9 to 16.

The third output was the sum of bins 1 to 16 divided by the sum of bins 17 to 32. The fourth output was the sum of bins 9 to 16 divided by the sum of bins 25 to 40.

8. Change in signal amplitude. This algorithm produced five outputs that were all related to a measure of the amplitude of the signal. The first output was the normalised signal amplitude and the other four were measures of the change in amplitude over time. The current signal amplitude was saved in a buffer for the last 20 FFTs.

The current amplitude was normalised by dividing by the mean amplitude for the last 20 FFTs and converted to dBs. This was scaled so that a constant amplitude gave an output of 50. This normalised amplitude was the first output of the algorithm. The difference between the current amplitude and the amplitude of the last FFT was calculated by taking the difference between the two values and dividing by the sum of the two. The last 20 values of this measure was stored in a second buffer. The second algorithm output was the mean of these 20 values. The third output was the standard deviation of the 20 amplitude values stored in the first buffer. From the first algorithm output, the normalised amplitude, a third buffer of digital amplitude was created.If the normalised amplitude was less than 25 then the digital amplitude was zero, else if the normalised amplitude was greater than 50 then the digital amplitude was set to one, else the digital amplitude was set to the value for the previous FFT. From the digital amplitude buffer, the number of transitions from 0 to 1, or 1 to 0, was summed for the last 20 FFTs. This value provided the fourth algorithm output. From the digital amplitude buffer, the percentage of the time that the digital amplitude was zero was calculated for the last 20 FFTs. This value was the fifth algorithm output.

9. Tone detection algorithms. This algorithm provided four outputs. Each worked by finding the peak value in the FFT and measuring the number of times that it occurs in the same position in the FFT, i.e. how tone-like it is. The bin in the FFT with the largest power was located and became the current peak position. The peak position was considered to be "consistent" if it varied by one bin or less. The first algorithm output was increased by 5 if two consecutive FFT peaks were consistent, else it was reduced by 30. The next output was designed to detect intermittent tones. If 7 consecutive FFT peaks were consistent then this was recognised as a tone. If any future FFT peaks were consistent then the output was increased by 4, else it was decreased by 7. The next output was designed to detect two tones.

If 3 consecutive FFT peaks were consistent then this was recognised as a tone. The last two tones were remembered. If the current peak was consistent with either remembered tone, then the output was increased by 3, else it was decreased by 5. The fourth output was a sensitive tone detector. If the current FFT was consistent with the last FFT then the output was increased by 10, else it was decreased by 6.

10. FFT compression. This algorithm provided 15 outputs which were a compression version of the 64 bin FFT. Considering the 64 bins, one being the lowest frequency, they were compressed to 15 values by taking the mean of the following groups of bins: (3,4)(5,6)(7,8)(9,10) (11,12)(13,14)(15,16)(17-19)(20-22)(23-25)(26-30)(31-35) (36-40)(41-48)(49-64). The outputs were then converted to a logarithmic scale in relation to the highest output which was set to 100.

The above time domain algorithms are based on known speech detection algorithms designed for use in non-noisy environments. The frequency domain algorithms were all developed during the course of the development of the device described with reference to Fig. 3.

Some of the frequency domain algorithms are intended to measure parameters directly indicative of speech, while others measure features contained in particular forms of interference. There are also algorithms that measure parameters which do not, on their own, give much confidence of the presence of speech but provide useful information when combined with the other algorithm outputs.

Each algorithm on its own would have a limited probability of success at detecting speech and rejecting noise and interference.

However, the combination of the strengths of all the algorithms yields a system which can be usefully employed as an HF squelch device.

This combination is performed by the neural network 5.

Each algorithm generates an output value which can be filtered and multiplied by a variable factor. After this the value is restricted to the range of from 0 to 100 before being applied as an input to the neural network 5. The multiplication factor is used to ensure that the full range of the neural network input is used. It could also be used to bias which sorts of signals the algorithm was most efficient at discriminating.

Thus there are 47 outputs from the algorithm processor 4 and correspondingly 47 inputs to the neural network 5. The neural network has three outputs, one each to indicate the detection of speech, tones (including intermittent and multiple tones) and sweepers (signals sweeping either up or down the audio frequency range). The neural network was created using the commercially available package "NeuralWorks". The "BackProp Builder" facility was used to create a back propagation network with 47 inputs, 14 hidden nodes and 3 outputs. The following options were selected: Momentum = 0.4, Learning Coefficient Ratio = 0.5, 1st Transition Point = 10,000, Learning Rule = Normalised-Cumulative Delta-Rule, Transfer Function = Hyperbolic Tangent. A bias node was connected to all the hidden and output nodes and the network was trained for 80,000 events. All other options were left at their default values. The system described with reference to Fig. 3 was shown in tests to reliably detect speech signals from HF receivers in 300ms at signal to noise levels of 6dB.

Filtering of each algorithm output was found to be useful because "continuous" speech is punctuated by gaps of for example up to 200ms which separate speech "spurts". This is a characteristic of normal speech and presents a problem because if the neural network is to be trained to detect speech only during speech spurts, then it would be necessary to accurately label each speech spurt in time. It is far less time consuming just to label each block of continuous speech. One method which may be used to overcome this problem is to pass each algorithm output through an infinite impulse response (IIR) filter to smooth the algorithm output so that it can effectively average across speech spurts and gaps. Such a filter also proved useful in removing spikes from algorithm outputs which tended to hamper the successful neural network training.

The neural network 5 was used to optimally amalgamate the outputs of all the algorithms to give a confidence level of the presence of speech. A neural network was used since it could cope with large numbers of algorithm outputs at a time and is extremely quick to train. The training can be considered as the process of assigning each algorithm output (i.e. neural network input) an appropriate contribution to the overall decision on the presence of speech.

A back-propagation neural network can function in two modes, that is training and evaluation. When used to evaluate, the neural network takes all the algorithm outputs as inputs and calculates an output probability of the presence of speech. During training, many training cases are presented to the network from which it learns.

Each training case consists of all the input values to the network along with the known output. The neural network first evaluates using all the inputs to get an output, compares this output with the known output from the training case, and then adjusts itself to reduce the error between the two outputs. This adjustment is achieved by adjusting the internal weight values of the network. Hence, to train a back propagation neural network, a number of training cases is required. Once trained, however, the "knowledge" gained by the neural network is permanently stored in the weights.

Training cases were generated by collecting audio signals containing a variety of speech, noise and interference signals and labelling each section as either speech, tone or sweepers. If none of these signals was present then all three outputs were trained to give the value zero. If one of the signals was present then the neural network was trained to give an output of 100 on the corresponding output while giving 0 on the other two outputs. Once the neural network was trained, the neural network configuration and the weights were downloaded to an operational system. That operational system then ran the algorithms and the neural network in evaluation mode and the post processing stage controlled the receiver audio output.

Thus the described system provides a speech detection device which functions reliably in very noisy environments, for example HF communication environments. Although the novel feature extraction algorithms described above have in practice proved highly effective it will be appreciated that alternative feature extraction algorithms could be devised to provide an effective performance. The high efficiency of the described system results from the combination of the variety of algorithm outputs in a neural network.

Claims

1. A speech detection system for detecting the presence of speech in an audio input signal derived from a predetermined environment, comprising means for converting the audio signal to a digital signal, means for analysing characteristics of the digital signal using each of a plurality of algorithms selected to discriminate between speech and other forms of signal, the analysing means producing a plurality of algorithm outputs, and a neural network having a plurality of inputs to each of which a respective algorithm output is applied, the neural network having been trained using training signals derived from the predetermined environment to provide an output indicative of the presence of speech in the audio input signal.

2. A speech detection system according to claim 1, wherein the neural network has been trained by back propagation.

3. A speech detection system according to claim 1 or 2, wherein the predetermined environment is an HF communications system.

4. A speech detection system according to claim 1, 2 or 3, wherein the audio signal is applied to the converting means through a low pass filter.

5. A speech detection system according to any preceding claim, wherein the neural network provides an output to control a switching device to deliver the audio input signal to an output terminal.

6. A speech detection system according to claim 5, wherein a delay buffer is connected to the output of the audio signal converting means, and the output of the delay buffer is connected to the switching device by a digital to analogue converter, the delay buffer storing the output of the audio signal converting means for a period sufficient to enable the detection of speech by the neural network.

7. A speech detection system according to any preceding claim, comprising a fast fourier transform circuit connected to the output of the audio signal converting means, the outputs of the audio signal converting means and the fast fourier transform circuit being applied to the analysing means.

8. A speech detection system according to any preceding claim, wherein the neural network is a three layer device.

9. A speech detection system according to claim 8, wherein the neural network provides three outputs respectively of the detection of speech, tones and sweep signals.

10. A speech detection system substantially as hereinbefore described with reference to Fig. 1, Fig. 2 or Fig. 3 of the accompanying drawings.