CA2286770A1 - Speech detection in a telecommunication system - Google Patents

Speech detection in a telecommunication system Download PDF

Info

Publication number
CA2286770A1
CA2286770A1 CA002286770A CA2286770A CA2286770A1 CA 2286770 A1 CA2286770 A1 CA 2286770A1 CA 002286770 A CA002286770 A CA 002286770A CA 2286770 A CA2286770 A CA 2286770A CA 2286770 A1 CA2286770 A1 CA 2286770A1
Authority
CA
Canada
Prior art keywords
signal
speech
neural network
noise
peak value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002286770A
Other languages
French (fr)
Inventor
Samu Kaajas
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Oyj
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CA2286770A1 publication Critical patent/CA2286770A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/06Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being correlation coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/12Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Monitoring And Testing Of Exchanges (AREA)
  • Complex Calculations (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

A method for speech detection in a telecommunication system comprising a signal source and a signal processor for processing a signal to be observed.
In the method desired identification numbers are determined (107-110) from said signal, said identification numbers are fed into a neural network, output values are calculated (111) for speech and noise neurons on the basis of said identification numbers in said neural network and a decision is made (104) on whether said signal is speech or noise.

Description

SPEECH DETECTION IN A TELECOMMUNICATION SYSTEM
BACKGROUND OF THE INVENTION
The invention relates to a method for speech detection in a tele-communication system comprising a signal source producing a signal, a signal processor including a neural network for processing said signal, in which method said neural network is trained to distinguish between a speech signal and a noise signal using speech and noise samples.
An important problem in different speech processing telecommuni cation systems is speech detection i.e. classifying a signal as speech or non speech. Non-speech is generally either silence or background noise, but may also be an information signal like the DTMF dual tone multi frequency signal to be transferred on a telephone channel. The capacity of the classifier is often decisive for the operation of the system.
The classification of signal as speech and non-speech is, for exam-ple, included in a prehandling part of each speech identification system. It is essential in speech identification to find the precise starting point of speech sounds, words or sentences to ensure a reliable identification. In speech re-cording the amount of data to be processed can be significantly reduced by recording only speech-containing data. The load and power consumption of a mobile station or another telecommunication apparatus or system can be re-duced if only information corresponding to real speech is coded and transmit-ted, not unnecessary background noise.
A speech signal is composed of consecutive speech sounds. Pres sure pulses deriving from the vocal cords pass through the oral cavity andlor the nasal cavity forming a speech sound. The shape of the cavities affects the pronunciation result, as well as which of the cavities is open, thus distinguish-ing between different speech sounds. The speech sounds can be grouped in various ways, a basic division being to group them as vowels and consonants.
Here background noise refers to all sound signals that are not speech and that will be processed by a speech detector.
In different publications background noise is often modelled as white noise. It accurately models noise occurring in analogical mobile or radio phones, but for modelling background noise occurring in various real life situa-tions the model is too simple.
Background noise can roughly be divided into two categories. Sta-tionary noise is fairly even continuous noise that can be caused by, for exam-ple, a ventilator, a copying machine, a restaurant environment or traffic. A
common feature is the continuity of the noise. Observed in the time domain noise occurs continuously throughout the entire observation period.
The other type of background noise is dynamic background noise. It is composed of random, heavy noise peaks. For example, all bangs and slams can be classified as dynamic noise. In the time domain noise is seen as fairly short noise sequences having a strong amplitude.
Speech can be distinguished from background noise by determin ing values for various properties describing the signal. Examples of such properties are the signal energy or spectrum.
in the following some prior art speech detection methods and sys-tems are described.
US patent 5,611,019 describes a speech detection arrangement.
The arrangement comprises a reference model, a parameter extractor for ex tracting parameters from an input signal and decision-making means for de ciding whether the signal is speech or non-speech. The presented solution also includes the idea of training the arrangement to detect and distinguish speech from non-speech.
US patent 5,598,466 describes a method for detecting speech ac tivity in a semi-duplex audio telecommunication system. In this method an av erage peak value representing the envelope of an audio signal is determined.
US patent 5,596,680 shows a method and an apparatus for detect-ing speech activity in an input signal. In this method a starting point is detected using poweNzero crossing. For detecting the end point of the speech sound in the signal cepstrum is determined. When the starting and end points of the speech sound are determined vector quantization is used to classify the speech sound as either speech or noise.
US patent 5,572,623 shows a method for speech detection from a signal containing noise. In this method a frame containing sound is detected, noise-containing frames preceding said frame are searched for, an autore gressive model of noise and a spectrum of average noise are formed. Then the flames preceding the sound in the spectrum are bleached wherefrom the real starting point of speech is searched for.
US patent 5,276,765 shows a voice activity detector {VAD) to be used in an LPC coder in a mobile system. The solution operates in connection with the LPC coder but does not use LPC coefficients when deciding whether the signal to be observed is speech or noise. Instead, the solution uses auto-correlation coefficients of an input signal, weighted and combined, for classi-fying the signal parts as speech or correspondingly noise.
A speech detector used in present mobile communication systems is shown in ETSI GSM standard ETSI GSM 06.32, v.4.1Ø July 1995. Voice Activity Detection. ETSI.
A method and an arrangement for speech detection is shown in publication Electronic Design, March 22, 1990, Newman W. C., Detecting Speech with an Adaptive Neural Network, pp. 79-89.
Since speech is composed of very different speech sounds, and background noise can also be very different, it is difficult to find common prop-erties for both background noise groups i.e. for stationary and dynamic noise.
On this account it is also difficult to make a classification decision.
A problem with prior art solutions is that they generally classify heavy background noise as speech. Classifying dynamic noise in particular, for example bangs, by the simple methods described often produces incorrect results.
A problem particularly in semi-duplex trunking mobile communica-tion systems is that although the mobile stations comprise a push to talk-switch (PTT) for indicating the start and the end of a speech item, the sub-scriber cannot inform the mobile communication system when he/she will use his/her speech item when talking from another telecommunication system to said semi-duplex mobile communication system. In this case speech detection is required at least for all speech signals arriving from the outside of said semi-duplex mobile communication system.
BRIEF DESCRIPTION OF THE INVENTION
An object of the invention is to provide a method and an apparatus for speech detection, the method and the apparatus operating in real time, taking an input signal from a telephone channel and being able to distinguish speech from noise particularly from dynamic noise as reliably as possible.
The invention relates to a method for speech detection in a tele-communication system, characterized by comprising the following steps:
determining from said signal identification numbers comprising at least the following identification numbers - LPC coefficients of said signal, - a peak value lag of an autocorrelation function of said signal, - an autocorrelation function peak value of said signal divided by said signal energy, and - a number of 0-level exceedings in said signal during a determined observation period, feeding said identification numbers as input vectors into said neural network previously trained to distinguish between speech and noise signals using speech and noise samples, calculating output values for speech and noise neurons on the basis of the identification numbers included in said input vectors in said neural net-work, deciding whether said signal is speech or noise on the basis of said output values.
The main idea of the invention is to employ the neural network in speech detection.
In accordance with the invention the properties determined from the signal to be observed, particularly LPC coefficients, autocorrelation function properties, 0-level exceedings of the signal, are given as input to the neural network making the classification decision. The neural network to be used in this inventive method is trained to distinguish speech from background noise.
An advantage with the invention is that the neural network of the in-vention can also be trained to distinguish heavy background noise from speech.
Another advantage with the invention is that the neural network is simple and easy to implement also as a real time solution, thus being applica-ble to speech transfer in almost real time, for example, in a mobile communi-cation system.
BRIEF DESCRIPTION OF THE DRAWINGS
In the following the invention will be described in greater detail with reference to accompanying drawings, in which Figure 1 is a flow chart describing a method of the invention, and Figure 2 is a block diagram showing a neural network used in the implementation of the invention.

DETAILED DESCRIPTION OF THE INVENTION
In accordance with the invention a neural network is used in the de-cision-making only when the signal average exceeds a particular threshold value. Thus the performance of the entire speech detection method and appa-5 ratus is improved by low noise levels. Autocorrelation is calculated and 0-level exceedings are counted from a 60 ms time window. The LPC coefficients are obtained directly from a TETRA speech codec. Hangover time refers to time after the last speech classification during which the signal is still classified as speech although the window to be observed no longer includes speech. As a result the weak consonants in the middle and end of a sentence are detected.
Autocorrelation An autocorrelation function of the signal x(n) is determined:
~(k) = E x(m)x(m + k) ( 1 ) m=-A local (short-time) autocorrelation function is determined as follows:
N-1-k Rn (k) _ ~ x(m)x(m - k)w(n)w(n + k) (2) m=0 In equation (2) w(n) is a finite length window function. The parame-ters are again window length N and window function shape.
Autocorrelation indicates how the signal correlates with itself. Perio dicities can be detected in the signal by means of autocorrelation. Since voiced speech sounds comprise basic frequency sequences the autocorrelation func tion should have peaks at intervals corresponding to the basic frequency. The existence of these peaks as well as their location and amplitude are properties by which speech can be distinguished from background noise using the auto correlation function. Though a telephone channel attenuates frequencies below 300 Hz, the autocorrelation function peaks more probably correspond to the first resonance, or formant, frequency than to the basic frequency.
Linear prediction and LPC coefficients A linear predictor is a system that predicts a sample value as a lin ear combination of previous samples. The output of the system thus being:
P
s(n) _ ~aks(n _ k) (3) k=1 Here p is the order of the predictor and a coefficients are weighting coeffi-cients, or so-called LPC coefficients, of the previous samples. These coeffi-cients can be solved by minimizing the sum E of the squares of the difference of real samples and predicted samples by a finite interval n in accordance with equation (4):
,2 E nEo Cs(n) xE aks(n k)J (q.) in practice the coefficients are generally solved using an autocorre-lation method where the matrix equation is solved using Levinson-Durbin re-cursion.
Autocorrelation is the best one of the methods studied. Car noise and white noise were distinguished from speech fairly reliably. Cepstrum gives similar results as autocorrelation. LPC was strong in the theater interval noise sample representing general noise. A bang sample was the most difficult noise sample.
On the basis of what has been presented above autocorrelation and LPC coefficients are worth taking as the detection methods of the speech de tection method and apparatus of the invention.
Neural networks Neural networks are nonlinear calculation networks which are tuned using training data. The neural network is loosely based on brain function, but is nevertheless inadequate to model the real function of the brain. The neural network can be implemented as a neurocomputer referring to a wide parallel simple adaptive network consisting of calculation elements. Such a network functions like a real biological neural network.
The neural network consists of several calculation elements, or neu rons, and the connections between them. A neuron is given one or more input values and based on these values the neuron produces a result that can be forwarded to one or more neurons. The neuron calculates the result as follows:
a=F~~p+b) (5) where a is a result, p is an input vector, w is a weighting coefficient vector, b is a bias and F is a transfer function of the neuron. A dot product is thus calcu-lated between the weighting coefficients and the input, the bias is added thereto and the result is given as an argument to the transfer function produc-ing the final result.
The neurons can be combined into neuron layers. A neuron layer comprises one or more neurons. The neuron layer neurons all obtain the same input vector and the neurons have the same transfer function. The neuron . layer result vector can be given to another layer as input, in which case a multi-layer neural network is obtained.
. The neural network has to be trained to solve a given problem. Dur ing the training phase the weighting coefficients and the bias values of the network neurons are tuned according to a training algorithm. In the invention some real life recordings are used to represent background noise. These re cordings are chosen so as to represent real situations particularly well and ex tensively.
After training the neural network is used by giving the network an in-put vector for which the network calculates a corresponding output. A properly trained neural network always calculates a similar result for an input vector group possessing certain properties.

The neural network is well applicable to solve the classification problem. In this invention the neural network decides into which category the given input vector belongs i.e. whether the signal sample to be observed is speech or noise.

A conveniently trained neural network succeeds in classifying the signal as speech or background noise. The input to be brought into the system can be a preprocessed signal as such. An example of the former is found in the publication presented above Electronic Design, March 22, 1990, Newman W.C., Detecting Speech with an Adaptive Neural Network, pp.
79-89, present-ing a neural network which is given a 25 ms signal as input, i.e. 250 samples at a 10 kHz sampling frequency.

In the method and apparatus of the invention the LPC coefficients that are calculated from the signal and that describe the signal are given as in-put to the neural network.

The inventive method and apparatus are designed to be imple-mented in a digital signal processor. This is why also the training data is as-sembled using a signal processor. The signal processor calculates the LPC

coefficients and the autocorrelation and counts the number of zero exceedings and stores the values in memory.

Eight neurons are selected for a hidden layer of the neural network of the invention. An increase in the number of neurons does not necessarily improve the function of the network. A hyperbolic tangent sigmoid transfer function is chosen as the transfer function of the hidden layer:
F(n) = tanh(n) (6) To the output layer of the neural network two neurons are chosen which are trained in such a way that one reacts to speech and the other to noise. The transfer function of the output layer is linear.
According to the method of the invention the signal is classified as speech when the output of the neuron tuned for speech is higher than the threshold value and the output of the noise neuron is smaller than that of the speech neuron.
When the number of speech vectors is increased in the training data the performance of the network improves since the threshold value can be increased without increasing speech clipping. The erroneous interpreta tions of noise as speech are thus reduced. By changing the threshold value a compromise can be reached between speech clipping and erroneous speech detection.
The neural network performance at low noise levels is poor. There-fore an observation of the signal amplitude average is added to the method and apparatus of the invention. The neural network is not used in the classifi-cation until the signal average exceeds a predetermined threshold value. A
more quiet signal is classified directly as silence.
In the example case the software implementing the method and ap-paratus of the invention is carried out as a software run in real time in the digital signal processor. The Texas Instruments TMS320C50 signal processor is used as the processor being able to receive and transmit audio data sam-pled at 8 kHz sampling frequency. The software is implemented in such a way that speech travels as such from the input to the output of the signal processor I/O port only when the algorithm classifies the signal as speech. Otherwise si-lence is transmitted. When the method and apparatus of the invention are im-plemented to a telecommunication system the practical solution can naturally be carried out in another way, for example in such a way that white noise or another sound, like an audible tone, can be heard instead of noise from the speech detector.
The method and apparatus of the invention are used for speech detection. The function of the method and apparatus of the invention can be significantly improved by adding a so-called hangover time to them. When the method or the apparatus performs a first non-speech classification after speech detection the hangover time is started during which a final classifica-tion decision is made, i.e. a decision on whether the signal is speech or noise.

. It is thus easier to include the weak consonants in the middle of the sentence.

A possible hangover time added to the method and apparatuses of the inven-tion and to the algorithms implementing them is 500 ms.

In telecommunication systems it is possible to locate the speech detector in the exchange or transmission parts of the telecommunication sys-tem. Naturally the speech detector can also be located in a telecommunication terminal. In mobile communication systems the speech detector of the inven-tion can be located, for example, in a system exchange, a base station con-troller, a base station or a mobile station, or in several of the above mentioned network elements.

The invention is here described with reference to the mobile com-munication system in particular. A mobile communication system typically has an exchange connected to other telecommunication and telephone networks using an interface unit. The mobile communication exchange is connected to a fixed telephone network using the intertace unit of the fixed network. The speech detector of the invention can be located in this interface unit of the fixed network in particular. Naturally the speech detector can also be located elsewhere in the mobile communication system or in its exchange.
The classi-fication information of the signal arriving from the telephone line or the speech channel of the fixed telephone network made by the speech detector of the mobile communication system on whether the signal is classified as speech or non-speech is transferred through a central possessing unit CPU of the inter-face unit of the fixed telephone network to a call control computer (CCC) of the exchange. The CCC distributes the speech items between the radio telephone subscriber and the fixed network subscriber on the basis of the information.

The speech detection method and apparatus of the invention can be carried out for example in the Texas Instruments TMS320C50 (40 MIPS) digital signal processor, whose main task is to convert or code speech arriving from the telephone line and correspondingly decode speech going to the fixed telephone line in the TETRA (Trans European Trunked Radio) mobile commu-nication system.

. In the TETRA mobile communication system the speech coding al-gorithm is based on the ACELP (Algebraic Code-Exited Linear Predictive) coding model. In this model a speech sample frame is synthesized by filtering a suitable excitation sequence with two time-dependent filters.
The first filter is a long-term prediction filter by which the periodicities of the speech signal are modelled. The second filter is a short-term prediction filter by which the enve-lope of the speech spectrum is modelled. The short-term filtering is performed as follows:
5 H(z) = Adz) - p1 C7) 1+Ea;z-' where as are the linear prediction coefficients and p is the order of the predic-tor which is 10 in the TETRA codec. The linear prediction analyses is per-formed at 30 ms intervals using a 32 ms asymmetrical window. The window consists of two Hamming windows of different lengths, the first one comprising 10 216 samples of the frame to be processed and 40 samples of the next frame.
The LPC coefficients are solved employing an autocorrelation method using the Levinson-Durbin recursion.
The LPC coefficients calculated by the TETRA codec can be given in accordance with the invention when delivered directly as the speech detec for input.
When implementing the TETRA codec in practice speech is always coded two frames at a time, or two times successively at 60 ms intervals. The LPC coefficients used by the speech detector are calculated on the basis of the first frame in accordance with the asymmetrical window described above. The other speech detector inputs can be determined for coding from a buffered 60 ms window. If signal was gathered into the buffer during a time period over 60 ms and the buffered data was given as delayed to the speech codec, the win-dow length used by the speech detector could be extended.
The combined performance time of the encoder and decoder used in the TETRA mobile communication system has been measured as 27 ms.
Since the coder and at the same time the speech detector are called at 60 ms intervals, 33 ms at the most remains at the speech detector's disposal. The performance time of the speech detector implemented in a manner according to the invention has been measured as 18 ms, the system thus meeting the real time requirements.
The speech detector is implemented as a specific program block as a part of the entire digital signal processor program code. The implementation ' is performed using the assembly language of the TMS320C50 processor.
A subprogram makes all the arrangements for encoding the TETRA
mobile communication system. The subprogram copies the 480 samples to be coded into the buffer for the speech detector. The speech detector is called from the subprogram after the first telecommunication frame (240 samples) is coded. The encoder has then calculated the LPC coefficients corresponding to the first frame needed by the speech detector.

The method of the invention can be carried out as a software to be run in the digital signal processor. The speech detector is in the same proces-sor with the speech codec of the mobile communication system.

Figure 1 shows a flow chart describing the method of the invention.

According to the first embodiment of the invention the first operation the speech detector performs is to calculate the average of the samples in step 101. The function of the entire speech detection system improves significantly when a signal whose average exceeds only a certain level is given to the neu-ral network for classification. Although the neural network inputs are in principal independent of the signal level, in practice the input values determined from the same speech sample at a different signal level deviate from one another.

This is caused by the inaccuracy of the AD/DA transformer biasing used in the arrangement, the inaccuracy effects becoming apparent at low signal levels.

The average is calculated in the loop where a sum of absolute values of all 480 samples fed into the buffer is calculated. This may conveniently be done as 32 bits utilizing a 32-bit accumulator buffer of the C50 processor presented above.

The sum obtained is divided by 512 by shifting to the right.
With a 0.9375 probability this produces the correct result, which is sufficiently close to the real average. In the following step of the method the result of the average calcula-tion is compared 102 to a fixed threshold value and if the average is smaller 103A than the threshold value a jump 103 is performed directly to the decision logic part 104 and the signal is classified as non-speech.
The threshold value is determined in such a manner that noise arriving from a normal speech chan-nel is removed and the threshold value is determined as high as possible with-out cutting off the speech signal. Then the hangover time can be increased in the next step 105 of the method. If again the average exceeds 106 the thresh-old value the process continues with the neural network part.

Alternatively in the second embodiment of the invention said aver-age calculation and result comparison are compensated with a prehandling part whose function is to remove the DC-offset from the signal i.e. to center the signal as close around 0-level as possible. This is carried out by filtering the signal with a high-pass filter, the cut-off frequency of which is very low. The filtering algorithm in the time region is the following:
s'(n) = s(n) - s(n-1 ) + alpha * s'(n-1 ) (g) .
where s(n) is the original signal and s'(n) is the filtered signal, alpha determin-ing the cut-off frequency.
After both embodiments the process continues by calculating 107 the autocorrelation function of the signal to be observed. Before calculating the autocorrelation the LPC coefficients of the signal are determined 106A. In the implementation of the invention the values of the LPC coefficients, for example 1, 3, 4, 5 and 10 calculated by the speech encoder of the TETRA mobile communication system, are copied into the input buffer of the neural network.
The encoder stores the coefficients LPC1, LPC3, LPC4, LPCS, LPC10 calcu-lated on the basis of the first frame in memory from where they are transferred to be used by the neural network.
In the method of the invention autocorrelation is calculated 107 us ing two internal loops. In the inner loop the actual sum of products is calcu lated and the outer loop sets the indexings in place. The outer loop sets two C-50 processor help registers indicating the beginning of the sample buffer and a position preceding the beginning of the buffer by a lag to be calculated corresponding to the value k in equation (1 ). The lag is the value of the auto correlation function argument, or k, described in equations (1) and (2). In addi-tion, the number of repetitions of both the outer and inner loops is stored in one help register. The inner loop is repeated 480 times when the first lag is being calculated, 479 times when the second lag is being calculated, 478 times for the third etc. The outer loop stores the result calculated by the inner loop in the autocorrelation buffer. The inner loop includes three commands.
The first one performs the multiplication, the second adds the result into the sum and loads the multiplication register for the following multiplication.
The third command is NOP (no operation) by which the loop length is increased into the minimum length of the three commands required by C50. The multipli-cation result is shifted 6 bits to the right by which 128 product sum operations can be performed without the risk of overflow. There is a small risk of overflow when the lags in the beginning are calculated, but the average low signal lev els and the use of saturation arithmetic further reduce the risk. Thus, two more bits of calculation precision are obtained compared to the scaling that is totally without risk.
After calculating 107 autocorrelation the highest peak of the auto-correlation function is searched for 108 i.e. the second highest buffer value ~ after the value corresponding to lag 0. In practice some values succeeding lag 0 are also high, therefore the search for a peak value is not started until lag 9.
The entire autocorrelation buffer is gone through in the loop and the memory address of the highest value is stored. When the start address of the buffer is subtracted from said address the lag corresponding to the peak value is ob-tained and this information Acor1 is copied into the neural network input buffer.
As a fast calculation operation connected with autocorrelation the peak value is divided 109 by energy, or by the lag 0 value. This is pertormed using the conditional subtraction command of the C 50 processor. The result obtained Acor2 is transferred to the neural network input buffer.
Hereafter the 0-level exceedings in the signal are counted in the method of the invention. When counting 110 the 0-level exceedings signs of the previous and the following sample are stored in an accumulator and an accumulator buffer. These are compared by summing the signs and by com-paring with a bit mask. If the signs differ the counter in the help register is in-creased by one. The entire sample buffer is thus gone through in the loop. The result is transferred to the neural network input buffer.
Figure 2 shows a block diagram of the neural network to be used in the implementation of the invention. The neural network shows a vector 201 fed thereto, the vector 201 comprising 5 LPC coefficients calculated above, LPC1, LPC3, LPC4, LPCS, LPC10, two autocorrelation values Acor1 and Acor2 mentioned above and the number of 0-level exceedings ZeroR in the signal. Hereafter the neural network shows a hidden layer 202 comprising eight neurons 203, 204, 205, 206, 207, 208, 209, 210 and two neurons of an output layer 211, a speech neuron 212 and a noise neuron 213. Finally the block diagram shows a decision logic 214 of the invention interpreting the val-ues of the output layer 211.
The neural network in Figure 2 in the method shown in Figure 1 cal-culates 111 on the basis of the input values the output values for the speech ~ 212 and noise neurons 213 on the basis of which the decision logic 214 makes 104 a classification decision of the signal to be processed.
The operating principal of the neural network is the following. Only a dot product according to equation (5), which is easy to implement by a C50 processor MAC command (multiply and accumulate), is calculated in all neu rona. Then calculating a fixed point presents a problem. The weighting coeffi-WO 98148407 PC"T/FI98/00345 cients accomplished during the training phase of the neural network vary in a wide area. Typically the weighting coefficients of a neuron can be between 0.05 - 90. The coefficients have to be scaled neuron-specifically to maintain the best possible calculation precision. The intermediate results of the hidden layer 202 can also move at a wide range.
The hidden layer 202 neuron first calculates the product sum be-tween the weighting coefficients and the input values of the neuron. The coef ficients are scaled between [-0.25, 0.25J by dividing by a suitable power of two that is higher than the highest coefficient. An additional division into four is performed to reduce the risk of overflow of the MAC operation. A bias is added to the sum in the MAC loop. The bias is multiplied by one and added to the sum as the last component. The sum obtained is then shifted to the left so that the result corresponds to the result calculated by unscaled weighting coeffi-cients. The upper accumulator then comprises the fractional part of the result and the Power accumulator the integral part. Sign bits are set and trash bits are removed using bit masks. The 32-bit end result is stored in the hidden layer 202 buffer.
All eight neurons 203-210 of the hidden layer 202 are implemented in the manner described above. The weighting coefficient scaling cause neu-ron-specific differences affecting the amount of shifting and the bit masks.
The results in the hidden layer buffer should after this be given as argument to the tangent sigmoid transfer function presented above in equation (6). In practice the only sensible way to perform the transfer function is to tabulate function values. Here a compromise has to be reached between memory space needed and table accuracy. In the implementation 640 values corresponding to the function values in the argument range [0, 2.5J were cho-sen to be tabulated. Negative values are obtained as complements of the tabulated values. If the argument is higher than 2.5 the function value is al-ready so close to one that it is approximated by one. In the implementation an index for table search is formed from a 32-bit result including the integral part and the fraction. If the index exceeds the end address of the table the search result is set at one. If the starting value has been negative the search is per-formed by a corresponding positive value and the searched value is negated. , The table search is performed in the loop for all neuron 203-210 results of the hidden layer 202. The results are stored in the hidden layer buffer. Since the results are between [-1, 1 ] they are now shown as 16-bits and stored in every second memory address in the hidden layer 202 buffer.

The implementation of the speech 212 and noise neurons 213 of the output layer 211 is similar to that of the hidden layer neurons, but some-what simpler. Since the output values are determined in the network training 5 data as either zero or one the output value varies in practice between [-0.1, 1.1]. Values higher than one can be saturated to correspond to one and thus the results of the output neurons 212, 213 can be indicated as 16-bits. This 16-bit result is ready in the upper accumulator of the processor and thus the setting of sign bits using bit masks can be avoided. The end results are stored 10 in the output buffer. Since the transfer function of the output layer 211 is linear it is not necessary to perform the actual transfer function.

The decision logic 214 reads the output value of the speech neuron 212. If the value is smaller than the threshold value the frame to be processed is classified as non-speech. If the value is higher the speech neuron 212 out-15 put is compared with the noise neuron 213 output. If the noise neuron 213 value is higher the signal is classified as non-speech otherwise as speech. If the program execution arrives 103 at the decision logic block 214 directly from the average calculation block 101 the signal is directly classified as non-speech.

A hangover time increase block 105 shown in Figure 1 is the last part of the speech detector. A signal classified as speech initializes a hang over variable adjusting the hangover time as one and also a speech flag variable as the speech detector output is set at value 0x1111 corresponding to speech. If a frame is classified as non-speech the value of the hang over variable is checked. If the value is zero or a maximum value, zero is set for the hang over variable and a value 0x0000 corresponding to non-speech is set for the speech_flag variable. Otherwise the =hang over variable is increased by one and the speech_flag is set at value 0x1111.

When the speech detector is started the speech-flag and the _hang,over are set at zero. The hangover time can be changed by the maximum value of the hang over variable. Value 9 is used in the implementation in which case the hangover time after the end of speech is 480 ms (8 * 60 ms).
The hangover time can be changed at 60 ms steps. After an increase in hangover time the speech detector subprogram is closed.

The speech detector needs memory for variables and buffers ap-proximately 1 kiloword, from where the sample buffer and the autocorrelation 1s buffer take up the greatest part. Data tables are also needed. They include the neural network weighting coefficient table and tangent sigmoid function mem ory table. They are located in program memory and the size of them is totally 730 words. The total memory need of the speech detector is thus approxi mately 1.7 kilowords.
On the basis of simulation results the best speech detector is a so-lution based on a neural network using autocorrelation function properties and LPC coefficients as input. The performance of the speech detector is entirely solved by neural network training. Best results are naturally achieved by a particularly extensive training material. The neural network is able to make very reliable classification decisions if the background noise is of the type used in training.
It is clear that the parameters of the method and apparatus, for ex-ample the number of neurons and the length of the signal window, can be changed and thus affect the function of the speech detector.
it is obvious for those skilled in the art that as technology pro-gresses the basic idea of the invention can be implemented in various ways.
The invention and its embodiments are thus not restricted to the examples de-scribed above but can vary within the scope of the claims.

Claims (6)

1. A method for speech detection in a telecommunication system comprising a signal source producing a signal, a signal processor including a neural network for processing said signal, in which method said neural network is trained to distinguish between a speech signal and a noise signal using speech and noise samples, characterized by the method comprising the following steps:
determining (107-110) from said signal identification numbers comprising at least the following identification numbers - LPC coefficients (LPC1, LPC3, LPC4, LPC5, LPC10) of said signal, - a peak value lag (Acor1) of an autocorrelation function of said signal, - an autocorrelation function peak value of said signal divided by said signal energy (Acor2), and - a number of 0-level (ZeroR) exceedings in said signal during a determined observation period, feeding said identification numbers as input vectors into said neural network (Figure 2) previously trained to distinguish between speech and noise signals using speech and noise samples, calculating (111) output values for speech (212) and noise (213) neurons on the basis of the identification numbers included in said input vectors in said neural network, deciding whether said signal is speech or noise on the basis of said output values.
2. A method as claimed in claim 1, characterized by said neural network input layer (201) and hidden layer (202) comprising eight neurons (203-210).
3. A method as claimed in claim 1, characterized by said neural network output layer comprising one speech neuron (212) and one noise neuron (213).
4. A method as claimed in claim 1, characterized by a decision logic (214) located after the neural network making said decision (104) on whether said signal is speech or noise.
5. A method as claimed in claim 1, characterized by comprising the following steps:
a) calculating (101) an amplitude average of the signal to be processed, b) comparing (102) said average to a predetermined threshold value, c) classifying (104) said signal as non-speech on the basis of said comparison if said average is smaller (103A) than said threshold value, d) proceeding to process said signal on the basis of said comparison if said average is higher (105) than said threshold value, whereby e) determining (106A) the predetermined LPC coefficients of said signal, f) feeding LPC coefficient values into a neural network input buffer, g) calculating (107) the autocorrelation function of said signal, h) searching for (108) the highest peak value of the autocorrelation function, i) subtracting a buffer starting address from said peak address, whereby a lag corresponding to the peak value is obtained, j) feeding the lag corresponding to said peak value into said neural network input buffer, k) dividing said peak value by said signal energy and obtaining a quotient, l) feeding said quotient into said neural network input buffer, m) counting (110) a number of 0-level exceedings in said signal during a determined observation period, n) feeding said number into said neural network input buffer, o) performing (111) said calculation in said neural network, and p) making said decision (104) on whether said signal is speech or noise.
6. A method as claimed in claim 1, characterized by comprising the following steps:
a) filtering the signal to be processed by a high pass filter to remove a DC-offset of the signal, b) determining (106A) the predetermined LPC coefficients of said signal, c) feeding LPC coefficient values into a neural network input buffer, d) calculating (107) the autocorrelation function of said signal, e) searching for (108) a highest peak value of the autocorrelation function, f) subtracting a buffer starting address from said peak address, whereby a lag corresponding to the peak value is obtained, g) feeding a lag corresponding to said peak value to said neural network input buffer, h) dividing said peak value by said signal energy obtaining a quotient, i) feeding said quotient into said neural network input buffer, j) counting (110) a number of 0-level exceedings in said signal during a determined observation period, k) feeding said number into said neural network input buffer, l) performing (111) said calculation in said neural network, and m) making said decision (104) on whether said signal is speech or noise.
CA002286770A 1997-04-18 1998-04-17 Speech detection in a telecommunication system Abandoned CA2286770A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
FI971679A FI971679A (en) 1997-04-18 1997-04-18 Detection of speech in a telecommunication system
FI971679 1997-04-18
PCT/FI1998/000345 WO1998048407A2 (en) 1997-04-18 1998-04-17 Speech detection in a telecommunication system

Publications (1)

Publication Number Publication Date
CA2286770A1 true CA2286770A1 (en) 1998-10-29

Family

ID=8548676

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002286770A Abandoned CA2286770A1 (en) 1997-04-18 1998-04-17 Speech detection in a telecommunication system

Country Status (6)

Country Link
EP (1) EP0976124A2 (en)
AU (1) AU736133B2 (en)
CA (1) CA2286770A1 (en)
FI (1) FI971679A (en)
NZ (1) NZ500272A (en)
WO (1) WO1998048407A2 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10381020B2 (en) 2017-06-16 2019-08-13 Apple Inc. Speech model-based neural network-assisted signal enhancement

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5276765A (en) * 1988-03-11 1994-01-04 British Telecommunications Public Limited Company Voice activity detection
JP2776848B2 (en) * 1988-12-14 1998-07-16 株式会社日立製作所 Denoising method, neural network learning method used for it
JPH03111898A (en) * 1989-09-26 1991-05-13 Sekisui Chem Co Ltd Voice detection system
JP2643593B2 (en) * 1989-11-28 1997-08-20 日本電気株式会社 Voice / modem signal identification circuit
IT1270438B (en) * 1993-06-10 1997-05-05 Sip PROCEDURE AND DEVICE FOR THE DETERMINATION OF THE FUNDAMENTAL TONE PERIOD AND THE CLASSIFICATION OF THE VOICE SIGNAL IN NUMERICAL CODERS OF THE VOICE
GB2278984A (en) * 1993-06-11 1994-12-14 Redifon Technology Limited Speech presence detector

Also Published As

Publication number Publication date
AU736133B2 (en) 2001-07-26
FI971679A (en) 1998-10-19
NZ500272A (en) 2001-03-30
EP0976124A2 (en) 2000-02-02
FI971679A0 (en) 1997-04-18
WO1998048407A2 (en) 1998-10-29
AU7045398A (en) 1998-11-13
WO1998048407A3 (en) 1999-02-11

Similar Documents

Publication Publication Date Title
Chu Speech coding algorithms: foundation and evolution of standardized coders
US5579435A (en) Discriminating between stationary and non-stationary signals
Bradbury Linear predictive coding
CA2122575C (en) Speaker independent isolated word recognition system using neural networks
CN108597496A (en) Voice generation method and device based on generation type countermeasure network
JPH0816187A (en) Speech recognition method in speech analysis
CN106409310A (en) Audio signal classification method and device
CN102089803A (en) Method and discriminator for classifying different segments of a signal
Latorre et al. Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?
CN111696580A (en) Voice detection method and device, electronic equipment and storage medium
CN113889090A (en) Multi-language recognition model construction and training method based on multi-task learning
CN101256772B (en) Method and device for determining attribution class of non-noise audio signal
Wu et al. Fully vector-quantized neural network-based code-excited nonlinear predictive speech coding
US5579432A (en) Discriminating between stationary and non-stationary signals
Rajesh Kumar et al. Optimization-enabled deep convolutional network for the generation of normal speech from non-audible murmur based on multi-kernel-based features
Bäckström et al. Voice activity detection
AU736133B2 (en) Speech detection in a telecommunication system
Chetouani et al. Neural predictive coding for speech discriminant feature extraction: The DFE-NPC.
CN116230018A (en) Synthetic voice quality evaluation method for voice synthesis system
Schwartz et al. Does the human auditory system include large scale spectral integration?
Wang et al. Phonetic segmentation for low rate speech coding
JP3183072B2 (en) Audio coding device
Hamandouche Speech Detection for noisy audio files
Chetouani et al. Discriminative training for neural predictive coding applied to speech features extraction
JPH04115299A (en) Method and device for voiced/voiceless sound decision making

Legal Events

Date Code Title Description
FZDE Discontinued