GB2456297A

GB2456297A - Impulsive shock detection and removal

Info

Publication number: GB2456297A
Application number: GB0723915A
Authority: GB
Inventors: Amir Nooralahiyan; Hamid Sepehr
Original assignee: Individual
Current assignee: Individual
Priority date: 2007-12-07
Filing date: 2007-12-07
Publication date: 2009-07-15
Also published as: GB0723915D0

Abstract

An audio-enhancing system receives an input audio signal 1. An Impulsive Shock Detector 50 detects the presence of impulsive shocks in the input audio signal by performing feature recognition on the audio signal data using an artificial neural network (ANN) which has been configured to identify impulsive shocks. Detected impulsive shocks are removed from the audio signal data. The audio signal data can be a time-domain representation of the input audio signal with the feature recognition being performed on the time-domain data. Each input vector to the artificial neural network is a set of features from a relatively short frame of audio signal data, such as a frame having a duration of less than 20ms. The artificial neural network can be a Multi-Layer Perceptron (MLP) neural network.

Description

IMPULSIVE SHOCK DETECTION AND REMOVAL

FIELD OF THE INVENTION

This invention relates to audio signal processing methods and systems for speech enhancement in telecommunication equipments and devices.

BACKGROUND TO THE INVENTION

Noise is a problem in many telecommunication networks, particularly when wireless communication devices, such as mobile or cordless phones and wireless headsets, are used in a noisy environment with a high level of ambient noise. The ambient noise combines with a user's speech at a first device, the combined signal is transported to a second device and the noise and speech signal is reproduced to a user. Additionally, CODECs in many digital devices further reduce the quality and intelligibility of speech.

The accumulative combination of CODECs and ambient noise can significantly affect the quality of conversation in telecommunication networks, causing strain and fatigue to a user.

A further problem in telecommunications networks is the effect of acoustic shocks.

Acoustic shock is a term used to describe unexpected disturbances in a signal. Acoustic shocks can take the form of high-level tones, such as signalling tones (e.g. DTMF or fax tones), shrieks, clicks or pops caused by equipment being plugged or unplugged, or unexpected changes in volume level, such as a user shouting. Acoustic shock is described in the articles: "An Ultra-Low Power, Miniature System for Detecting and Limiting Acoustic Shock in Headsets", Cornu et al., ICASSP 2003, Proceeding of International Embedded Solution, September 2004, Santa Clara, US; and "Subband-Based Acoustic Shock Limiting Algorithm On A Low-Resource DSP System", Choy et al., EUROSPEECH 2003. Arrangements for detecting acoustic shock are also described in WO 2006/073 609 Al and US 2005/0058274 Al. WO 2006/073 609 Al detects delta acoustic incidents that exceed a predetermined acoustic startle boundary by the use of averaging filters having different response times (Sms, 5Oms, 5s), and by comparing outputs of the different filters. US 2005/0058274 Al detects shrieks in an audio signal by using a bin differencer to determine a difference between each pair of adjacent frequency sub-band bins in the magnitude spectrum. Neither of these approaches are able to accurately detect a wide range of acoustic shocks, and are particularly ineffective at dealing with impulsive acoustic shocks.

The present invention seeks to address at least one of the problems identified above.

SUMMARY OF THE INVENTION

A first aspect of the invention provides a method of reducing impulsive acoustic shocks in an audio signal comprising: receiving audio signal data representing an input audio signal; detecting an impulsive shock in the input audio signal by performing feature recognition on the audio signal data using an artificial neural network which has been configured to identify impulsive shocks; and, removing a detected impulsive shock from the audio signal data.

The use of feature recognition on audio data, implemented by an artificial neural network, have been found to provide a particularly effective way of detecting a wide range of impulsive shocks. It has an advantage of permitting fast, real time, processing of audio data and has a further advantage of requiring only a modest amount of processing resources, which makes it particularly attractive to implementation in low-power applications.

The audio signal data can be a time-domain representation of the input audio signal with the feature recognition being performed on the time-domain data, or alternatively the method can comprise transforming the time-domain data to the frequency domain and performing the feature recognition on the frequency-domain data.

Advantageously, each input vector to the artificial neural network is a set of features from a relatively short frame of audio signal data, such as a frame having a duration of less than 2Oms, which further helps to minimise the amount of processing.

A Multi-Layer Perceptron (MLP) neural network has been found to offer advantageous results, although the invention is not limited to the use of a MLP.

A detected shock can be removed by at least one of: repeating a portion of the audio signal before the impulsive shock; interpolating signal values each side of the portion of the signal containing the detected impulsive shock; replacing the portion of the signal containing the detected impulsive shock by a comfort noise; replacing the portion of the signal containing the detected impulsive shock by pure silence.

Advantageously, the method further comprises generating a first output for applying to a noise-reducing stage when an impulsive shock is detected. The first output instructs the noise-reducing stage not to change calculated gain values. The presence of impulsive shocks has been found to disturb gain-setting calculations of a noise-reducing stage. Freezing calculated gain values during the presence of an impulsive shock has an advantage of ensuring a good continued quality of noise reduction after the occurrence of an impulsive shock.

Advantageously, the method further comprises generating a second output for applying to a noise-reducing stage when an impulsive shock is detected. The second output instructs the noise-reducing stage not to process a portion of audio signal data in which the impulsive shock has been detected. This has an advantage of saving processing resources which would otherwise be unnecessarily wasted when the portion of the signal containing the impulsive shock will be discarded.

Any of the above aspects of the invention can be applied to an audio signal received from another communications link, such as a speech signal from a remote device, or to an audio signal generated locally at a device, before sending over a communications link.

The integrated voice enhancement algorithms and software solution can be used in a number of semiconductor chipsets, as well as a number of wired or wireless communication devices and systems. Examples of such chipsets and devices/systems are: Semiconductor Chipsets for Bluetooth devices, wireless telephony (e.g. WiFi or other lP or packet-based protocols, cellular, cordless, Private Mobile Radio (PMR)), Wireless Voice Communication Devices & Equipments; mono and stereo Bluetooth headsets IEEE 802.11 WiFi handsets for VoIP telephony; IP-based Video Telephony products; Mobile telephony handsets; DECT phones; Personal Mobile Radio (PMR) or Push-to-Talk; Wired Voice Communication Devices & Equipments; Business telephony; Call centre equipments; IP telephony devices; ISDN phones; Soft Switches or BPX or ABAX; Main telephony switches.

Another aspect of the invention provides apparatus for implementing any of the steps of the method. The apparatus comprises modules for performing any of the described steps of the method.

The functionality described here can be implemented in software, hardware or a combination of these. The invention can be implemented by means of hardware comprising several distinct elements, by means of a dedicated processing apparatus (e.g. an Application Specific Integrated Circuit) or by a general-purpose processing device which is configured to implement the functionality, such as a Digital Signal Processor (DSP), configurable array, personal computer (PC) or any other form of processing apparatus.

Accordingly, another aspect of the invention provides software for implementing any of the described steps. The software may be stored on an electronic memory device, hard disk, optical disk or any other form of machine-readable storage medium. The software may be delivered as a computer program product on a machine-readable carrier or it may be downloaded to a device via a network connection. The invention is particularly suitable for embedded DSP implementation, and implementations which have ultra-low power constraints.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will be described, by way of example only, with reference to the accompanying drawings in which: Figure 1 shows an overall system in accordance with an embodiment of the invention which can detect, and remove, noise and acoustic shocks in an audio signal and which can monitor exposure to audio signal energy; Figure 2 shows the system of Figure 1 in more detail; Figure 3 shows the noise and shock detection and noise and shock removal stages of Figures 1 and 2 in more detail; Figure 4 shows the noise classification module forming part of Figure 3; Figure 5 shows a Time Domain Neural Network (TDNN) forming part of the noise classifier; Figure 6 shows a gain shaping function for an example noise class (factory noise); Figure 7 shows functional units of the acoustic shock module; Figure 8 shows an example of a click signal in the time-domain, and the corresponding rise in energy of the signal; Figure 9 shows an example of an artificial neural network for use in identifying impulsive shocks in an audio signal; Figure 10 shows the thresholding function of neurons in the hidden and output layer of artificial neural network of Figure 9; Figure 11 shows the hearing exposure monitoring (HEM) module in more detail; Figure 12 shows a telephony band mask used in the HEM of Figure 11 Figure 13 shows rise of accumulated energy over a period of time, and measurement of average slope of accumulated energy; Figure 14 shows minimum and maximum audible threshold levels for audio signal power; Figure 1 5 shows a hysteresis function used in the HEM module.

DESCRIPTION OF PREFERRED EMBODIMENTS

Figure 1 schematically shows an overall system for removing noise and acoustic/impulsive shocks from an audio signal, and for monitoring a user's exposure to audio signals. The system can form part of a communication device, such as a wireless or wireline terminal, or a sub-system of such a device. Although individual functional modules are described and illustrated, it should be understood that any module can be divided into multiple sub-modules, or combined with other modules in a manner that will be well understood to a skilled person. An incoming audio signal, in digital form, is transformed into the frequency domain and analysed by a Noise Analysis module. An output of the Noise Analysis module is fed to a Noise and Shock Reduction (NSR) module to optimise the attenuation of the unwanted ambient noise and acoustic shock signals with minimal effect on speech quality. In paralle! with the above, the time-domain representation of the audio input signal is also fed to an impulsive shock detector (ISD) module to identi1' short duration transient and impulsive shocks or clicks mixed with the incoming speech signal. Identified shocks are removed with minimal effect on speech quality by an Impulsive Shock Removal (ISR) module. The combined output of the above modules is then fed to the Hearing Exposure Monitoring (HEM) module to determine the amount of (daily) hearing exposure through the host device, such as a headset or handset.

Figure 2 shows the overall system in more detail. The digital input audio signal is received at an input 1. The signal is buffered 20 to form a buffer of digital data. A frequency domain analysis is performed on the buffered time-domain data. This is conveniently achieved by a Fast Fourier Transform (FFT). As is well known, a FFT generates data about a plurality of frequency sub-bands, distributed across a frequency range. The number of sub-bands is selected to give an acceptable compromise between accuracy and processing burden. An embodiment of this invention use 256 sub-bands distributed across the frequency range 0-4000Hz, although it will be appreciated that any suitable number of sub-bands can be used. Frequency sub-band data is made available to multiple modules of the system: noise estimator module 41, acoustic shock detection module 43 and noise and shock reduction module 44. These modules only use the magnitude data 3 derived by the FFT. Phase data is delivered to a re-synthesis module 22.

The sub-band data 3 representing time-frequency distribution of signal is analysed by Noise Estimator 41 to estimate the level of noise present in the speech signal. Acoustic Shock Detector 43 detects the presence of acoustic shock signals in the frequency sub-band data. The output of noise estimator 41 is fed to the noise classifier 42 to classi& the noise types present in the signal. The outputs of modules 41, 42, 43 are fed to the noise and shock reduction module 44 to calculate gain settings which will attenuate the detected background noise and acoustic shock signals with minimal effect on the quality of speech.

In parallel with the operations just described, the time-domain representation of the audio input signal is also fed to an impulsive shock detector (ISD) module 50 so that the same portion of the audio signal is simultaneously analysed in the time and frequency domains. Identified shocks are eliminated by Impulsive Shock Removal module 51.

The output from item 44, representing a noise-reduced audio signal in the frequency domain, is fed to the hearing exposure monitor 30. Module 30 can apply an auditory masking filter, such as an A-weighting filter, to the data 6 such that the data represents the loudness of the audio signal, as it will be perceived by a human, rather than actual sound intensity of the signaL Hearing Exposure Monitor 30 can operate in an active mode or in a passive mode. In the active mode, HEM 30 calculates the accumulated sub-band energies and shorter-term signal energy, and applies a variable gain to each sub-band such that a user is not exposed to more than a recommended limit of sound energy over a period of time, while also not experiencing an unnecessary attenuation of the audio signal.

In the passive mode HEM 30 does not variably attenuate the audio signal data 6. Instead, HEM applies a telephony mask to the data 6. If the pre-defined daily hearing exposure level is reached then an alert is raised to the user of the host device either through an audible announcement or through visual indication. HEM 30 also takes into account any changes in the signal energy calculation due to any shock removal operation by module 50.

The presence of a time-domain impulsive shock is confirmed using a flag 10. The output 7 of HEM 30 is fed into a sub-band synthesis module 22 to construct a time-domain representation of the noise and shock reduced audio signal. Impulsive shock removal module 51 is operable to conceal a portion of the audio signal in which an impulsive shock has been identified as being present (by module 50 or simply passes the time- domain signal 8 directly to create the time-domain audio signal for output 9.

Each of the modules will now be described in greater detail.

Noise and Shock Reduction, modules 41-44 Figure 3 shows this stage in greater detail. In this implementation the input samples comprise the desired or wanted signal (i.e. speech) and an undesired or unwanted signal (i.e. noise or shock), which are mixed together and presented to the system as if they come from a single source, such as a single microphone of a headset or wireless device.

This is also known as in-line noise.

The Noise and Shock Reduction (NSR) module 44 cancels the stationary and/or non-stationary noises mixed with the signal to enhance voice quality, improve intelligibility of speech signal and reduce the listening effort required by a user.

Additionally, it limits and attenuates the acoustic shock (such as shouts and shrieks) mixed with the speech signal to increase human hearing protection when using telecommunication devices.

In this implementation, the noise classifier 42 classifies the noise mixed with the signal into various noise types, which include a stationary noise type and different types of non-stationary noise types (such as: road/traffic/engine noise; factory/café/babble noise; wind noise; etc.) and indicates the relative amounts of the different noise types mixed with the signal. In a preferred embodiment, the noise classifier 42 is a pre-trained artificial neural network (ANN) which receives, as an input, a pre-processed multi-dimensional input vector from the noise estimator 41. The output of the noise classifier is also a multi-dimensional vector representing broad noise categories as well as indicating the amount of noise associated with a particular noise type present in the signal. These will be discussed in more detail later.

The output vector from the noise classifier 42 is fed to the NSR module 44. NSR 44 receives the vector from the noise classifier 42 and an input from the noise estimator 41 to determine the required sub-band gains and filter shaping. A two-stage noise reduction algorithm is used. A first stage 441 applies sub-band Wiener filtering and a second stage performs spectral subtraction 443. The second stage 443 can be selectively by-passed by switch 444. By-passing is controlled by an output of the Noise Classifier 42. If the pattern of ambient noise is classified (by NC 42) as being associated with stationary noise, the switch is activated to cause vector subtraction 443 to be performed on the signal, and to route the output to module 445. However, if the NC 42 indicates that the noise is non-stationary, the output element 442 is applied directly to module 445. In one, non-limiting embodiment, switch 444 is activated once the proportion of stationary noise exceeds a threshold level, such as 75% of total noise. It will be appreciated that the threshold level can be set to any desired value that provides a desired result.

If a signal (a flag) 10 from the impulsive shock removal module 51 is set as active, indicating the presence of impulsive shock in the incoming signal, the gain values of the gain & shaper module 441 remain unchanged and the previous gain values are frozen until item 10 becomes disabled. It has been found that an impulsive shock can disturb the gain settings of stage 441. By fixing the gain settings, there is an advantage that impulsive shocks are not allowed to cause a significant disturbance to the gain settings.

Noise Classifier Non-linear pattern recognition techniques are used to identify and categorise the noise present in the signal into "stationary" or "non-stationary" noise types and an indication of the nature and the contribution of different non-stationary noise types present in the signal. Stationary noise is a term used to describe noise with slow change in statistical and spectrum characteristic over time, such as engine noise and white noise. In contrast, non-stationary noise cannot be well-defined by statistical methods. This approach enables the noise reduction module 44 to maximise the noise attenuation while avoiding the adverse affects such as piping or metallic speech effect associated with classical noise cancellation techniques. The noise reduction module 44 uses the classification output vector to constantly update and optimise its parameter settings to maintain the desired speech quality.

In one embodiment, the Noise Classifier 42 identifies four classes of noise: stationary noise, and three different classes of non-stationary noise. These are: 1. Stationary Noise -examples include: engine noise from vehicle, road/tyre interaction noise, computer fan noise, airplane engine noise, helicopter propeller noise, white noise, and colour noise.

2. Non-stationary classl -examples include: construction noise, factory noise, city traffic/street noise, etc., whereby there is little or no correlation with the spectral characteristics of speech 3. Non-stationary class2 -examples include: babble noise, chatter noise, café noise, cocktail noise, etc., whereby there is a degree of correlation with the spectral characteristics of speech 4. Non-stationary class3 -Categorising wind noise whereby the non-stationary nature of the noise has varying dominant frequency It will be appreciated that the Noise Classifier can be arranged to identify a larger, or smaller, number of noise classes. Figure 4 schematically shows the relationship between the Noise Classifier 42 and neighbouring modules 41, 44. The Noise Classifier 42 uses semi-dynamic non-linear patiern recognition paradigms such as neurocomputing. In one preferred embodiment, a Time-Delay Neural Network (TDNN) is trained to identify and categorise the four noise classes outlined above. The input vector to the TDNN classifier is a set of normalised progressive values from the noise estimator 42 using an overlapping windowing function. For an example where FFT 21 provides data for 256 sub-bands, the symmetrical nature of the sub-band data only requires 128 of the sub-bands to be considered. An input vector to the TDNN comprises 128 normalised values received from the noise estimator 41, each normalised value corresponding to one of the 128 sub-bands that need to be considered. The set of 128 normalised values are updated every 16 ms. The parameters in each overlapping window constitute a subset of the input vectors to the TDNN network. This allows intra-layer propagation of input parameters in each layer to develop an internal representation of the spectral profile. The output vector from TDNN indicates the presence and contribution of different noise in the incoming signal. The normalised output vector is fed to the NSR module 43.

The preferred type of neural network for this task is known as a Time Delay Neural Network (TDNN). The TDNN converts the temporal sequences of input data into a static pattern by treating a finite sequence of time as another dimension in the problem. As shown in Figure 5, the input sequence is fed into a tapped delay line of finite extent, which in turn is fed into a static network. In producing the input vector to the TDNN, an overlapping windowing function is used to progressively extract the spectral features from the noise estimator. The structure of TDNN also incorporates a sliding window that propagates through rows in each layer.

The output of the TDNN is a function of the temporal sequences in input, with no feedback. The output thresholding function is a sigmoid function with an appropriate gradient tending towards a stable continuous function with interpolation capacity for best-fit pattern association. Thus, the output vector provides indication of the contributions from each noise type present in the signal.

In one embodiment, a single TDNN network with four outputs is trained to identify four different noise classes as indicated above. The combination of outputs will collectively represent the amount of each noise type present in the signal. The output from the TDNN network is fed to the NSR module 44 to modify the parameters of the filter, resulting in optimum noise reduction with minimum adverse effect on speech quality and intelligibility. One implementation of the TDNN that has been found to provide good results is: an input matrix layer having the dimensions 128 rows x 3 columns, where the three columns are selected out of 5 feature vector columns each comprising 128 elements; a second layer having a I 26x3 matrix of neurons; an output having 4 neurons, each output neuron corresponding to one of the four classes of noise.

The noise classifier is a form of fuzzy classifier. In operation, each of the outputs is in the range 0-1 and the sum of the four individual outputs of the TDNN =1. It can be seen that the outputs of the classifier do not simply give a black and white (i.e. noise is present/noise is not present) decision, but provide an indication of the relative contribution of each of the noise classes. This is of considerable use in helping to tailor parameters of the subsequent noise reduction stages.

It is important to note that, while the training phase for neural network can be computationally intensive and time consuming, the evaluation phase has extremely low computational overhead. The training of the neural network comprises applying a range of signals to the inputs of the network, the signals representing different noise classes that it is desired to identify to the network, and instructing the network as to which output should be activated for each input signal. Additionally, the input signals can comprise the same noise classes at different levels, and combinations of noise classes at different levels, and the network is trained with the results that are expected for each input. After the training phase, the network is tested by applying signals that the network has not previously been trained with, and monitoring the results to test that the network is providing the correct results. The training is preferably performed offline as part of the development of the product. Once a network has been trained and tested satisfactorily, the set of weight values that provide that behaviour can be transferred to the network in the Noise Classifier 42.

Noise and Shock Reduction module 44 The first stage of processing on the input sub-band signal is the gain calculator and shaper module 441. Module 441 calculates two sets of gain vectors. The first set of gains are calculated according to Wiener filtering and are where N represents the number of sub-bands in FFT sub-band analysis 21. These gains are momentary gain values that will be smoothed to dampen any sudden change in gain vectors.

Assume that the noise corrupted input signal is defined as: y[n]= x[n}+d[n], wherex[n] is the clean speech and d[n] is the additive noise. The momentary sub-band Wiener filters are calculated according to Equation I, where XX(k) and P(wk) are the power spectra of the clean and noise signals, respectively, in the kth sub-band. Equation 2 shows how these power spectra values are calculated. X(cok) and D(a)k) represent the signal and noise in the kth sub-band. (a

Gmk= xx k Gmk<G mm (I) Gmin Gm �= G,,, 1(°k) = E{ X(wk) I} (2) PXX(a)k)=E{l D(wk)12} Gmin in Equation I controls the level of noise attenuation achieved in noise reduction algorithm at the cost of speech distortion. Gmm is inversely proportional to the amount of speech distortion, i.e. an increase in noise attenuation (smaller Gmm) could mean higher speech distortion. In order to minimise the speech distortion and maximise the noise attenuation level, different value of Gmm is used. The optimum value for Gm,n is selected based on the type of noise i.e. stationary or non-stationary. Such a decision is controlled by the noise classifier 42. The value of Gmm would be smaller for the stationary noises because the noise estimator is more accurate in calculating the power of noise whereas it would have the highest value for noise classes, such as babble and café, which are highly non-stationary. Having n1,M' noise classes, then Gmm is calculated in accordance with Equation 3 whereby Gm is the stored value of Gmmfor the flhh noise class and5 is the output of noise classifier 42 representing the similarity/correlation of the noise signal with the flth noise class.

G = (3) The calculated gains from Equation I (Gm,Gm,...,Gm) are then smoothed over time.

The process of smoothing removes sudden change of gains caused by potential errors in the noise estimator 41 or the presence of a bursty noise signal. Instantaneous changes of gains could create a sudden deep reduction in signal levels which could result in a lack of subjective experience in naturalness of speech signal. Equation 4 shows an example algorithm for smoothing the sub-band Wiener gains: G = AGm + (1-A)G 0 �= A �= 1 (4) In Equation 4, as the value of A approaches zero, the variation of final gains would be less, and this is more appropriate for the removal of stationary noises (as the statistical characteristics of noise stays unchanged over time). Conversely, as the value of A increases, the gains would quickly change to follow the estimated level of noise provided by noise estimator 41, allowing effective noise removal. The value of Ais controlled by the output of noise classifier 42. For the more stationary noise classes, the value of A is closer to zero and for the faster varying noise classes like babble and café, the value of A is closer to 1. Having n=l,M' noise classes, then A is calculated in accordance with Equation 5: (5) where A,, is the stored value of A for the th noise class and 5,, is the output of noise classifier representing the similarity/correlation of the noise signal with the flhh noise class.

At the next stage of the Gain Calculator & Shaper 441, the calculated gains are shaped by a shaping function. These functions are selected from a table of functions depending on the class of noise detected by the noise classifier 42. The stored shaping functions are extracted based on extensive test on performance of the noise estimator 41 for different classes of noise and frequency characteristics of ambient noise. Figure 6 shows a shaping function used for factory noise, showing a gain value for each of the 256 frequency sub-bands in the range 0-4000Hz.

The final gain value of the Gain Calculator & Shaper 441 is calculated according to Equation 6 G=G*o,F (6) where F, represents the gain of shaping function at kth frequency bands for the th noise class, in this equation 5, is the output of fuzzy classifier that represents the similarity/correlation of ambient noise with the ith noise class (i can be between I to M and M is the total number of noise classes).

After theses gain values are calculated by Wiener filter, the acoustic shock detector 43 feeds module 441 with information to calculate the gain values for removing any acoustic shock signal that has been detected.

Acoustic Shock Detector Figure 7 schematically shows the main functional blocks of the acoustic shock detector 43.

Advantageously, the detection of acoustic shock and the subsequent removal of any detected shock does not operate on any frequencies below substantially 700 HL It has been found that frequencies in the range of I kHz to 4 kHz are most likely to cause a startle response than lower frequencies as these higher frequencies are close to the resonant frequency of the ear canal. In addition, all of the common in-band telephony signalling tones are located above around 700Hz; these include: Dual Tone Multi-Frequency (DTMF), Multi-Frequency-RI (MF-Rl), MF-R2, faxlmodem tones.

In frequency domain, only sub-bands representing frequency content of signal above 700 Hz are processed and the criteria for detection of acoustic shock signals.

Acoustic Shock Detector 43 monitors rate of change of energy in each of the sub-bands above 700Hz, which is first derivative of signal energy in different bands. If the rate of change in any of the sub-bands is greater than a certain threshold value for a predefined period of time, this is indicative of an acoustic shock being present in the signal. An additional factor in shock detection is the distribution of energy over different sub-bands.

If the centroid of buffered energy is highly concentrated around certain frequency sub-bands, the incoming signal is detected as shocks. Equation 7 shows gain calculation based on rate of change of energy and Equation 8 shows gain calculation based on energy distribution across sub-bands: G = f(Ek > T and E,10, > Tmtai) for 7',,' ms G=1 else G,, = P(ErO,,,-Ek)G jf(Rk > R and Erotai > T0101) for Th2ms (8) G=l else where D,, = is the rate of change of energy in sub-band k; Ek is the energy of signal in sub-band k; R = E k Eiotai The process of calculating a gain for a sub-band begins with Gk set to an initial value of Gk =1 Equation 7 is then applied to derive an updated value of Gk, and then Equation 8 is applied to the value of Gk resulting from Equation 7. The parameter E, represents the total energy of the incoming signal. 7E and 7'R are the threshold values chosen to have limited effect on pure speech signal, 7 and 7 represent the hangover times used for the rate of energy change (Dk) and energy distribution parameter (Rk). All of the sub-band gains G,G,2,...,G are initialized with value 1.

To meet the requirement of Equation 7, energy of the processed sub-band should be more than a threshold for a specific amount of time (hang-over time) and also the total energy of signal must be greater than a threshold value for duration of a TT0l milliseconds (hang-over time).

After the calculation of these gain values in Equation 7 & 8, module 441 calculates the final gain G0,G, ,...,GN vectors by multiplying the set of vectors by G,G,2,...,G, to obtain the final gain vector values, i.e.: Gk =G*G (9) These gain values are applied to the sub-band data at multiplier 442.

Returning to Figure 3, the next stage of noise and shock reduction is a vector subtractor stage 444 that is controlled by the output of the noise classifier. If the fuzzy value of the stationary class of the classifier output (c5,J,ar,) is greater than 0.75, the vector subtractor (item 444) will be active, otherwise it will be disabled. As described above, this ensures that vector subtraction is only performed on audio signals that are deemed to predominantly comprise stationary noise. The vector subtractor 444 will remove the residual noise left from the output of Wiener filter. The vector subtractor calculates a subtracted vector based on the output of noise estimator. The Vector subtraction method tries to extract the estimated clean speech as in Equation 10: fIY(wk)I1D(wk) (10) else where: Y(wk) is the output of the kth sub-band for the input signal; D(o)k) is the estimation of noise power and p represents the weighting applied to the estimated noise. As the magnitude spectrum of the enhanced signal I X() can be negative, a simple rule is used.

A final stage of the Noise and Shock Reducer 44 is a shock limiter 445. Module 445 uses the energy calculation of the incoming signal, made by acoustic shock detector 43, and limits the level of incoming signal in each sub-band. Table I shows an example table of settings for this stage, for frequencies in the range >700Hz. For each frequency range in the table, an entry is given of a maximum signal level. Limiter 445 ensures that any shocks that have erroneously remained undetected by the previous stages are removed from the audio data. As an example, consider that a fax tone in the 1900-2200Hz band has caused the energy in this band to have a value of -10dB. Table I indicates the maximum level should be says the max. gain should be -18dB. Accordingly, limiter 445 will limit the signal in this frequency range to -18dB. Our extensive tests show that limiting the energy of signal as shown in Table I substantially limits the acoustic shock signal with minimal adverse effect on the speech signal.

Frequency (Hz) Maximum Level (dB sine) 700-1100 -13 1100-1200 -14 1200-1350 -14 1350-1550 -15 1550-1700 -16 -16 1700-1900 -17 1900-2200 -18 2200-2400 -19 2400-2700 -20 2700-3100 -21 3 100-3400 -22 3400-3900 - 20 >3900 -18

Table 1

Impulsive Shock Detector Impulsive acoustic shocks (also known simply as impulsive shocks) are short duration (typically <16 millisecond), high frequency, noise interference often in the form of a spike or a click, which are unexpectedly added to the speech at some point between the source and destination of the audio signal and arrive at the user's headset or handset.

Impulsive shocks or sudden interferences are characterised as having a burst of energy with fast rise in the peak magnitude and energy, resembling a bang, pop or clang.

Impulsive shock is a phenomenon that may not happen very often, but when they do they result in symptoms such as pain in the inner ear, hypersensitivity to sounds (hyperacusis) or even tinnitus. Figure 8 shows a typical rise time of an impulsive shock (a click signal).

As shown, the energy of signal has a rise of 40dB in less than 5 milliseconds and quickly falls back to -1 5 dB in less than 25 milliseconds.

In accordance with an embodiment of the invention, the detection and the subsequent removal of any impulsive shocks are implemented on a frame-by-frame basis without imposing extra delays in the end-to-end processing time of the rest of the system.

A neural network is trained to detect the presence of impulsive shock or transient interference in a 16 millisecond frame which is subsequently processed to remove the impulsive shock without adversely affecting the quality of speech. In one implementation of this invention a pre-trained Multi-Layer Perceptron (MLP) neural network is used to detect the presence of impulsive shocks in any given 16 millisecond frame. In this implementation a fully connected single-hidden layer MLP (Figure 9) is chosen, where each neuron in the input layer is connected to all the neurons in the hidden layer, and each neuron in the hidden layer is connected to the output node, with no intra-layer connections in the input or the hidden layer. An MLP with one hidden layer is able to represent arbitrary functions which contain a Continuous mapping from one finite space to another and rarely two hidden layer MLP is required, therefore in this implementation the MLP has one hidden layer. The function of neurons in the input layer is purely distributive with no computational overhead, and that of hidden layers and output layer perform similar computat ion to Equation II.

In one implementation, an input vector to the MLP comprises 128 normalised values in the time domain; however the input vector to MLP can be any transformation of time-domain frame including Fourier transformation, Cepstrum or LPC (Linear Predictive Coding) coefficients of time domain frame of audio signal. In one implementation, the 128 normalised values of the input vector correspond to a 16 millisecond time window, i.e. each of the 128 values represents a signal amplitude at a discrete point in time over the 1 6ms period. The output of the MLP is a single node incorporating a non-linear thresholding function (sigmoid function) with a gradient that approaches a hard-limiting (Heaviside) function for clear-cut detection and decision making in the pattern space. This can be achieved by increasing the gradient of the output (sigmoid) thresholding function as shown below. As k -� the thresholding function tends to 1 (Heaviside function). The output of the MLP is a non-linear function of the weighted sum of all the inputs (Equation 11).

0 = f(s) where = w,x, and f(s) = 1 for 0 �= f(s,) �= 1 (11) 1+e where k is a positive constant that controls the spread of the function as shown in the above figure. Figure 10 shows the non linear relationship of the sigmoid function of each neuron for different values of k (k -0.5 and k=5). Good results have been achieved with k=0.5 for neurons in the hidden layer and k �= 5 for the output node.

The sigmoid function acts as an automatic gain control, the steep slope provides high gain for small signals, and thus the net can accept large inputs and still remain sensitive to small changes. It is also continuously differentiable with the first derivative as a simple function of the output (Equation 12), an important computational consideration in the back-propagation training algorithm.

= (I +e' )2 = kf(s)(I f(s1)) = ko(1 -°) (12) The back-propagation learning algorithm for the MLP ensures that the weights are adapted to reduce the error each time. The solution to detecting impulsive shocks is to train a network to obtain a set of weights in the weight space to correspond to two distinct regions in the pattern space. The points in one region represents patterns from class A (frames containing impulsive shocks), identified by the hard-limiting output value tending towards 1'; and those in the other region belong to class B (frames without impulsive shocks) identified by the hard-limiting output value tending towards 0'. Once the network is successfiully trained, when a shock is detected in the frame the output of the MLP tends towards 1', otherwise it tends towards 0'.

Given the size of the input vector, and the output node of the MLP, the structure of the hidden layer is chosen such that the overall size of the network provides a good match between the structure of the underlying problem and the capacity of the network to solve the problem. The size of the hidden layer is chosen to be large enough to form a good model of the problem, and at the same time, small enough to provide a good generalization to the actual test data.

A number of rule-of-thumb methods were used to determine the optimum number of neurons in the hidden layer, including trial and error methods of "pruning" or "growing" the number of nodes in the hidden layer to find the optimum number of neurons in the hidden layer of the MLP. In one implementation, 32 neurons was initially selected and through the process of pruning during training of the network, the optimum number of neurons was found to be 17 neurons giving the best results with good generalization.

The solution to identifying impulsive shocks in the specified time frame is to train the MLP to obtain a set of weights in the weight space to correspond to a distinct region in the pattern space. The points in this region represent patterns from the class of input vectors containing impulsive shocks. As with the TDNN described earlier, the neural network is trained by: (I) applying input vectors that represent known examples of impulsive shock signals that it is intended the network should recognise and also (ii) applying input vectors that represent normal signals that do not contain an impulsive shock signal. For each of(i) and (ii), the network is trained by indicating the expected output for each of the input vectors: in the case (I) the desired output is 1' and for case (ii) the desired output is 0'. The training phase is followed by a testing phase in which input vectors, representing signals that the network has not previously been trained with, are applied to the input of the network and the results provided the network is compared with the desired result. After training, a set of about 100 test vectors previously unseen by the net-work was used to test the network. About 25% of the test vectors contained impulsive shocks. When a shock is present in the input vector (input frame) the MLP output tends towards 1, otherwise it tends towards 0'. The training and testing is performed offline as part of the development of the product. Once a network has been trained and tested satisfactorily, the set of weight values that provide that behaviour can be transferred to the network in module 50. It is important to note that, while the training phase for neural network can be computationally intensive and time consuming, the evaluation phase has relatively low computational overhead.

Impulse Shock Removal (ISR) The time domain impulse shock removal module 51 receives an input from the impulse shock detector 50 indicating the presence or absence of an impulsive shock. There are several ways in which the ISR can respond to the presence of an impulsive shock. In one implementation, if the status is true 1', the ISR replaces the corresponding (16 ms) frame with a previous frame. Further alternatives for concealing the portion of the signal containing the detected impulsive shock are to replace the portion of the signal containing the detected impulsive shock with a comfort noise or pure silence. These can be summarised as frame concealment approaches. In another implementation, the ISR replaces the corresponding frame with a combination (interpolation) of the two adjacent frames, i.e. the frame before and after the frame in which the shock is present The ISR module 51 informs the NSR module 41, via signal 10, to stop processing the frame that will be dropped or concealed by the ISR. This has an advantage of saving unnecessary processing operations, which can conserve battery life of the host device.

An impulsive shock is such that, in the time-domain, it will not exceed a maximum of 2 consecutive frames of 16 milliseconds. In other words, the maximum positive detection by the ISD 50 is limited to two consecutive detections. If the output of the ISD indicated 3 or more consecutive detections, then the shock cannot be classed as impulsive shock with short duration. On receiving the third consecutive detection, the tSR 51 halts the frame concealment operation described above and waits until the next output value of 0' is received from the ISD 50. At the same time the signals 10 from ISR 51 to NSR 44 will also stop upon receiving the third consecutive detection from ISD 50.

This feature has the benefit of reliable detection of short duration impulsive shocks mixed with the voice signal and can completely remove the impulsive shock in real-time without adversely affecting speech quality.

Hearing Exposure Monitoring It is now well established that using headsets at high volume levels can damage human hearing. The risk to hearing damage increases substantially when listening to high volume levels for extended periods of time. The Hearing Exposure Monitoring (HEM) function 30 provides digital sound level monitoring technology to accurately measure the accumulative sound pressure levels (SPL) and, if required, will either modify the input signal or alert the user that an exposure limit has been reached to maxim ise human hearing protection. The HEM constantly calculates the overall sound output through the headset speakers and modifies or gives a warning (an alert) when the total exposure through the headset reaches an acceptable or recommended level. Preferably, the HEM monitors an accumulated energy level over a 24hr period although any of the recommended settings or the time period are totally adjustable and configurable and can be set at any pre-defined value.

There are a number of recommendations set by internationally recognized and well established institutes and standard bodies around the world, some of whom are listed below: a. European Legislation for Noise Directive 2003/1 0/EC -EC b. International Telecom Union (ITU) -P360 -World Wide Recommendations c. Health & Safety Executive (HSE) -UK d. Institution of Occupational Safety and Health (IOSH) -UK e. Royal National Institute for Deaf People (RNID)-UK f. Australian Communications and Media Authority (ACMA)-Australia g. Australian Communications Industry Forum (ACIF) -Australia h. The National Institute for Occupational Safety and Health (NIOSH)-USA i. The National Occupational Research Agenda (NORA) -USA Figure 11 shows an embodiment of the HEM module 30. An audio signal is represented by a set of frequency sub-band data 6. In the integrated solution of Figure 2, the input 6 to the HEM 30 is a signal which has been processed to remove noise and shocks. The frequency sub-band data 6 is filtered with a telephony band pass filter 31 to remove any out-of-telephony band noise. An example frequency response of the telephony band pass filter 31 is shown in Figure 12. The output of the filter 31 is applied to another filter 32 to generate the perception of loudness of incoming signal.

Filter 31 can be an A-weighting filter, or any other convenient type of filter which achieves the purpose of indicating perception of loudness. Next, two separate measurements are made of energy in the audio signal. Firstly, module 33 calculates an accumulative total of energy over a period of time, such as an 8hr or 24hr period.

Secondly, module 34 calculates an average audio signal power over a much shorter time scale, typically of the order of seconds, e.g. a 2 second period. For the avoidance of confusion, the term "long term average power" is used because the timescale is long term compared to the fluctuations in the signal, but it will be appreciated that the time scale of the measurement made by module 34 is very much shorter than that of module 33. Modules 33, 34 perform their respective calculations on sub-band data output from filter 32. An output signal 10 from the Impulse Shock Removal module 51 is applied to module 33 to indicate when an impulsive shock has been detected in the present buffer and where the shock will be concealed as explained in the previous section. In this case, module 33 replaces the energy of the buffer containing impulsive shock with the energy of previous buffers based on the concealment technique previously described, or by interpolating between measurements of adjacent frames. The measurements of modules 33, 34 are fed into the sub-band gain calculator module 35.

The sub-band gain calculator module 35 constantly monitors the accumulative input energy; it will then carry out one of the following actions based on whether HEM is configured to be in an active mode or a passive mode.

Active mode In active mode, module 35 calculates a gain for each sub-band. The gain value is based on the rate of accumulated energy versus time (i.e. how fast the accumulated energy is rising) and a shorter term indication of the recent level of the audio signal.

Figure 13 shows a graph of accumulated energy against time. The slope (gradient) (Vi) of the accumulative energy versus time relationship is calculated. Figure 13 shows the measurement of gradient at time T3 as (E) over 7. In broad terms, the gain of sub-bands should be inversely proportional in some way to the calculated gradient, so that a steeply rising gradient causes a reduced gain (higher level of attenuation) and a shallow gradient permits an increased gain (lower level of attenuation). By controlling gain from the outset (at accumulated timeO), the HEM is able to ensure that a user can work a full shift without exceeding the exposure guideline limit. The calculation of accumulated energy preferably totals energy across all sub-bands and calculates the gradient of the total accumulated energy. A more computationally-intensive alternative calculates a gradient per sub-band.

Module 35 is also responsive to the long term average signal power which, as described above, is the average level of the signal over a recent time period, such as the last 2 seconds. The sub-band gain is based on the gradient and the long-term average.

This has an advantage that during quiet periods of speech, which a user will have difficulty in hearing, the signal is attenuated to a small degree, or not attenuated at all, and during loud passages of speech the signal is attenuated to a higher degree. It is advantageous that gain values are not varied too quickly. Hysteresis thresholds define the audible long term average range for different bands of a speech signal. Figure 14 shows the long term average power of a speech signal over a period of time and indicates the hysteresis thresholds which represent the minimum and maximum values for the hysteresis function of the kth sub-band. Figure 15 shows a hysteresis function. As the power increases and surpasses the minimum threshold (PmM1n), the function output is triggered and will be active. The function remains active if the Power exceeds the maximum threshold (PTHMaX). If the power falls below the maximum threshold (PmM), the hysteresis output will be disabled and stay disabled it will remain disabled if the power falls below the minimum threshold as well but if the power comes back above minimum threshold it will be active again. Preferably, the long term average is calculated for each sub-band although, in an alternative embodiment, the long term average is calculated across all sub-bands.

Gk = K hysteresis is active f(V,P,T0) (13) Gh=l else The gain calculator 35 calculates a gain for each frequency sub-band. The calculated gain at a particular point in time, is inversely proportional to the gradient of accumulative energy in that band (V7), the long term average of audio signal power in that band (PTO) and elapsed time since monitoring began. The gain value is calculated according to described method if the hysteresis function output (Figure 14) for that specific band is active otherwise the gain for the that frequency band would be a unity gain.

Calculating gain values on the basis of the LTA maximises the possibility of listening to an audible voice signal over the longest period of time, while safely staying within a pre-defined amount of accumulative energy (e.g. representing a user's daily allowance). Calculating the gain values based on T0, ensures that the gain will be attenuated to a greater extent as the accumulated exposure nears the recommended exposure limit. This has an advantage of further reducing the possibility that a user will reach the exposure limit, and also allows a more sophisticated tailoring of the gain

profile over time compared to the prior art.

Operating with frequency sub-bands has another significant advantage. There are two ways to achieve a particular level of exposure: I. uniformly attenuate across all sub-bands by an amount XdB; 2. non-uniformly attenuate across the sub-bands, such that some sub-bands are attenuated by YdB and some sub-bands are attenuated by ZdB, where Y>Z and where Z<X. Stated another way, certain frequency bands are sacrificed at the expense of other frequency bands. Clearly, it is possible to use a more complicated attenuation profile with a wider range of attenuation settings. The non-uniform attenuation makes use of the fact that different frequency bands contribute to different parts of speech. As a broad rule, the lower frequency bands carry the more important information that are required for general understanding of speech, while the upper frequency bands carry detail which can aid clarity of understanding. The non-uniform attenuation allows certain frequency bands to be attenuated more than others, to provide a user with the more useful spectral information at a good listening level, and at a higher listening level than would be provided by the blanket attenuation across all bands of option (I).

The non-uniform attenuation can be achieved by modifying the P, andP,, values for each of the different sub bands. The overall calculated gain vector for each sub band is finally smoothed over time (Equation 1 4) to eliminate any undesirable amplitude fluctuation: GpM = + (1-A)G 0 �= �= 1 (14) where G,k,FM is the smoothed gain, is the momentary calculated gain and A is the smoothing factor.

Although the weighting applied to the audio signal in the active mode is intended to prevent a user from reaching a recommended exposure limit, a warning can be issued to a user when a recommended exposure limit is reached. Various options for the warning are described below.

Passive Mode In this mode there is no active restriction on the output of the transducer. The calculation of accumulated energy provided by module 33 is compared with a stored threshold value representing a recommended maximum exposure level. When the threshold value is exceeded, a warning 11 is provided to the user. The warning 11 could be in the form of an audible message, such as a tone, a tone sequence, or a short message such as "daily hearing exposure reached". In addition, or as an alternative, the exposure warning 11 can be provided to a user in visual form, such as a flashing light or LED (Light Emitting Diode), as a form of text or graphical display on the host device, or as a vibratory warning. The warning can be repeated at a pre-defined interval once the exposure limit is exceeded to regularly remind the user.

The time calculation of module 33 begins when the headset is first powered on and can automatically reset itself when 24 hours (or another required time interval) has elaed following headset power on.

In the system described here, the HEM module 30 is integrated into an overall system for noise and shock reduction, and operates on frequency sub-band data provided by other modules of the system. The HEM module 30 can also be implemented as a standalone module. En this variant, the HEM requires a FFT to derive frequency sub-band data that the modules within the HEM can operate upon.

In the embodiment described above the telephony band-pass filter 31 is shown as part of the Hearing Exposure Monitoring module 30. Alternatively, the telephony band-pass filter 31 can be incorporated within the Noise & Shock Reduction module 44, or as a preliminary stage immediately after FFT sub-band analysis module 21.

The Impulsive Shock Detector module 50 is described as operating on time-domain data. However, in an alternative embodiment, the time-domain audio signal data can be transformed to the frequency domain and the Impulsive Shock Detector module 50 can operate on the frequency domain data.

The invention is not limited to the embodiments described herein, which may be modified or varied without departing from the scope of the invention.

Claims

I. A method of reducing impulsive acoustic shocks in an audio signal comprising: receiving audio signal data representing an input audio signal; detecting an impulsive shock in the input audio signal by performing feature recognition on the audio signal data using an artificial neural network which has been configured to identify impulsive shocks; and, removing a detected impulsive shock from the audio signal data.
2. A method according to claim I wherein the audio signal data is a time-domain representation of the input audio signal and the feature recognition is performed on the time-domain data.
3. A method according to claim I wherein the audio signal data is a time-domain representation of the input audio signal and the method further comprises transforming the time-domain data to the frequency domain and performing the feature recognition on the frequency-domain data.
4. A method according to any one of the preceding claims, wherein each input vector to the artificial neural network is a set of features from a frame of audio signal data.
5. A method according to claim 4 wherein the frame of audio data has a duration of less than 2Oms.
6. A method according to any one of the preceding claims wherein the artificial neural network is a Multi-Layer Perceptron (MLP) network.
7. A method according to claim 6 wherein the Multi-Layer Perceptron (MLP) network has a single hidden layer of neurons.
8. A method according to any one of the preceding claims, further comprising removing a detected shock by one of: repeating a portion of the audio signal before the impulsive shock; interpolating signal values each side of the portion of the signal containing the detected impulsive shock; replacing the portion of the signal containing the detected impulsive shock by a comfort noise; replacing the portion of the signal containing the detected impulsive shock by pure silence.
9. A method according to any one of the preceding claims further comprising generating an output for applying to a noise-reducing stage when an impulsive shock is detected, the output instructing the noise-reducing stage not to change calculated gain values.
10. A method according to any one of the preceding claims further comprising generating an output for applying to a noise-reducing stage when an impulsive shock is detected, the output instructing the noise-reducing stage not to process a portion of audio signal data in which the impulsive shock has been detected.
II. Software for performing the method according to any one of the preceding claims.
12. Apparatus comprising a processor or computer configured to perform the method according to any one of claims I to 10.
13. A communication device incorporating the apparatus of claim 12.