GB2580655A - Reducing a noise level of an audio signal of a hearing system - Google Patents

Reducing a noise level of an audio signal of a hearing system Download PDF

Info

Publication number
GB2580655A
GB2580655A GB1900803.6A GB201900803A GB2580655A GB 2580655 A GB2580655 A GB 2580655A GB 201900803 A GB201900803 A GB 201900803A GB 2580655 A GB2580655 A GB 2580655A
Authority
GB
United Kingdom
Prior art keywords
audio signal
phonemes
probable
ambient audio
phoneme
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1900803.6A
Other versions
GB201900803D0 (en
Inventor
Roeck Hans-Ueli
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sonova Holding AG
Original Assignee
Sonova AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sonova AG filed Critical Sonova AG
Priority to GB1900803.6A priority Critical patent/GB2580655A/en
Publication of GB201900803D0 publication Critical patent/GB201900803D0/en
Publication of GB2580655A publication Critical patent/GB2580655A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Noise reduction for audio signals in a hearing device is achieved by detecting an n-gram of phonemes (eg. “iz” at T-2 and T-1) in an ambient audio signal 14 (dotted line), predicting a probable phoneme and its expected audio signal 38 (solid line) in a subsequent time slot T (in this case a silent pause after “is”), determining the deviation 40 between the observed and expected audio properties (eg. spectra or frequency band components) in the time slot (in this case a large deviation, possibly from a cough or environmental noise after “is”), and attenuating the signal in the time slot based on the deviation (ie. “forcing” the silent pause predicted). Probable sequences may be constructed for multiple time slots, phonemes may be discarded and the language model may be user-selected.

Description

DESCRIPTION
Reducing a noise level of an audio signal of a hearing system
FIELD OF THE INVENTION
The invention relates to a computer-implemented method, a computer program and a computer-readable medium for reducing a noise level of an audio signal of a hearing system. Furthermore, the invention relates to a hearing system with a hearing device.
BACKGROUND OF THE INVENTION
Hearing devices, as part of a hearing system, are generally small and complex devices. Hearing devices may include a processor, microphone, speaker, memory, housing, and other electronical and mechanical components. Examples of hearing devices are Behind-The-Ear (BTE), Receiver-In-Canal (RIC), In-The-Ear (ITE), Completely-In-Canal (CIC), and Invisible-In-The-Canal (IIC) devices. A user may prefer one of these hearing devices compared to another device based on hearing loss, aesthetic preferences, lifestyle needs, and budget.
A challenge for hearing impaired user is often an insufficient quality of the voice signal caused by background noise or ambient noise. Particularly in areas where a number of people are talking or strong background noise from traffic, construction sites, music, machines, or similar sources are present, speech intelligibility may be significantly decreased. Different solutions are known to cancel or attenuate a noise level in an audio signal to achieve a better detectability of a target signal. One of the challenges may be to distinguish between the target signal and an underlying noise signal. Usually, noise is often present as a background signal which mostly has a lower amplitude than the target signal. These background noise signals are also known as noise floor underlying a target signal.
A method to determine an SNR of a signal is a measurement of a long-term lowest 30 signal amplitude in a particular frequency band and assigning that value to a noise floor amplitude. This means that any signal amplitude higher than that noise floor amplitude may be associated with the target signal. However, such a model assumes a rather constant background noise level. Especially in cases where background noise is present with higher or alternating amplitudes, such known solutions may not work sufficiently well to provide an effective noise cancellation.
Given an SNR estimate, a Wiener Filter may be used to apply a specific rule to determine an attenuation value for the audio signal in a particular frequency band and a particular time slot. In particular, these filters apply a certain attenuation in conditions, where bad or low signal-noise-ratios are present. In turn, when a signal quality increases, the SNR is also increasing accordingly and less or no attenuation is applied to the audio signal. Given an imperfect SNR estimation as shown above, such filters may only provide a limited performance.
US 7 020 581 B2 describes an analysis system, where speech recognition is applied by a computer algorithm. US 7 363 221 B2 discloses a method to estimate a clean speech signal in order to be used in a pattern recognition system. US 8 615 393 B2 is about a speech recognition system providing an iterative training method for a noise reduction system.
DESCRIPTION OF THE INVENTION
It is an objective of the invention to provide a more effective noise cancellation with increased performance.
This objective is achieved by the subject-matter of the independent claims. Further exemplary embodiments are evident from the dependent claims and the following description.
The subject-matter of the invention may be based on the following considerations: When a noise suppression is supposed to be improved, one of the challenges may be seen in a reliable differentiation between a background noise or noise floor and a wanted target signal. The prior art solutions work sufficiently well only if well detectable differences in the amplitudes of a noise signal and a wanted target signal are present. In other words, in typical application scenarios of a hearing aid system, background noise may also have a similar or even higher amplitude levels than the target signal. Known solutions may therefore provide only insufficient performance.
One challenge may be seen in the fact that a noise cancellation system has no information about relevant target signal statistics to reliably identify the target signal. The underlying thought is to apply a reliable statistical model in order to be able to predict properties of a target signal in an upcoming time slot. In particular, also irrelevant speech portions contained in the background noise could be removed or at least attenuated.
Therefore, in a first aspect of the invention, a method for reducing a noise level of an audio signal of a hearing system is proposed.
The hearing system may comprise one or more devices helping a hearing impaired user to better recognize and understand speech or any other audio information. The hearing system may comprise one or more hearing devices to be worn at or in a user's ear. The one or more hearing devices may be one or more hearing aids adapted for compensating a hearing loss of the user.
The hearing system and/or a hearing device further may comprise a signal processing unit configured to apply an attenuation to an audio signal. A signal processing unit may be described as analogue or digital means for modifying an audio signal, in particular changing a signal strength or amplitude and/or frequency properties.
The signal processing unit may also be configured to attenuate (or at least not overly amplify) an audio signal depending on certain frequency bands. The attenuation is aimed to suppress unwanted audio signals by lowering an amplitude and signal energy vs. other wanted audio signals. Such attenuation may be frequency selective, meaning that only certain frequencies or a frequency band is selectively attenuated. Furthermore, levels of attenuation may also be adapted over time, allowing an adaptation to different background noise levels and audio signal characteristics.
In a first step of the method an ambient audio signal is acquired. Such an ambient audio signal may be any external audio signal captured from the surrounding area or taken from any audio input source like connected external devices such as smart phones, TV sets, external services, storages, databases, microphones and/or sound generators. Put differently, the ambient audio signal may be understood as the relevant audio signal which has to be recognized and the speech contained has to be understood by a user. The ambient audio signal may comprise human speech in a particular language. A language may be described as a typical spoken language such as English, Spanish, German, and which may also consider certain accents within a language. A language may therefore be understood as a consistent set of words with specific pronunciation for human communication.
In a next step an n-gram of one or more phonemes of the ambient audio signal is detected. A phoneme may be described as phonetic units of sound of a human voice which constitute spoken words in a particular language. The set of phonemes may be specific for every language and may also be specific for subtypes or accents of a language. The number of phonemes and/or n-grams in any language may be limited and may be stored in a database. Also, the spectral and/or temporal properties of a phoneme or an n-gram may be stored in a database.
The term "n-gram" may stand for a plurality of consecutive phonemes or also even one phoneme. Phonemes may be seen as not to exist as stand-alone objects but mostly in combination. Forming words from phonemes may have to consider the transfer effects from one phoneme to the next. A phoneme may also be noise or silence. Examples for n-grams are a biphone (two concatenated phonemes) or a triphone (three concatenated phonemes).
In a following step, a subsequent probable phoneme is determined based on the detected n-gram of the ambient audio signal. In other words, a next possible or likely phoneme is predicted based on the previously detected phonemes. For instance, given a noisy speech audio signal and a library/database of known n-grams, for instance triphones, it may be estimated which phoneme is most probably being vocalized in a next time slot or a number of next time slots. As the number of possible combinations of phonemes and n-grams may be limited, only a limited number of phonemes are likely to appear in an upcoming time slot. The detection of the n-gram in the ambient audio signal and the determination of the probable subsequent phoneme may be based on a language model database of the language. This may be, for instance, a database or library with stored words, phonemes, n-grams and/or combinations thereof.
In a next step, expected probable audio signal properties for a time slot after the current n-gram of phonemes are deduced based on the determined probable subsequent phoneme. In other words, if a certain probable phoneme is taken, the frequency spectrum, signal amplitude and/or temporal properties may be retrieved or calculated as this phoneme may be usually pronounced in a same or similar way by most persons. This may mean, that physical audio signal information in various aspects are available and/or may be stored for further processing. These expected audio signal properties may be provided only for a certain time slot within the duration of a phoneme meaning that the time slot may be timewise located within a timeframe defined by the probable subsequent phoneme.
In a next step, the ambient audio signal in that time slot is analyzed generating associated observed ambient audio signal properties. In other words, related to the same aspects as the expected audio signal properties like gain, amplitude, frequency etc., the corresponding properties are generated for the ambient audio signal for that same time slot.
Further related to the time slot, a deviation and/or difference of the ambient audio signal properties, acquired in the previous step, from the expected audio signal properties is determined. In other words, information may be provided, to what degree the two signal properties differ from each other. This may be represented in form of a dataset, values, physical control factors like an electric signal or similar means. Deviation may also mean differences in frequency specific signal strength or amplitude.
If the expected audio signal properties determined from an actually measured ambient audio signal significantly differ from the expected audio signal properties, it may be very likely that the ambient audio signal does not belong to the target signal. As example, if no spectral energy is expected for the next phoneme but some significant ambient audio signal is detected, then a high probability may be assumed that the signal does not belong to the target speech signal.
An attenuation of the signal processing unit may be adjusted based on the determined deviation for this relevant time slot. This may mean that a degree of the attenuation, for 30 instance measurable in dB, is dependent additionally on an underlying language model rather than merely processing the audio signal purely relying on prior art SNR determination methods.
According to an embodiment of the invention, adjusting the attenuation comprises increasing an attenuation of the ambient audio signal, when the determined deviation of the two audio signals increases. Put differently, the more the ambient audio signal properties differ from the expected audio signal properties, it may be more likely that the ambient audio signal does not belong to the target speech signal. Consequently, the ambient audio signal may therefore be more attenuated as it may likely be associated to background noise.
According to an embodiment of the invention, the determined subsequent probable phoneme is an n-gram of n sequential phonemes. This may mean that not only one single next phoneme is determined or predicted but a sequence of several phonemes. The several phonemes may, for example, constitute one or more n-grams like biphones or triphones. For instance, if a certain sequence of n-grams or phonemes has been detected and recognized in the previous time slots, there may be a certain probability that a particular sequence of phonemes or n-grams is likely to follow in the subsequent time slots. A language may contain typical sentences or combinations of words and n-grams, which are very frequently used and which are therefore more likely to appear in next time slots.
According to an embodiment of the invention, determining a probable subsequent phoneme comprises a generation of a plurality of probable phonemes and/or n-grams with associated probabilities. In other words, for a next or subsequent time slot, a plurality of possible probable next phonemes is determined. For each of the identified phonemes an associated probability may be assigned. This may be a list or table with a number of probable candidates of phonemes including the probability values for its appearance.
According to an example, a degree or strength of the attenuation may be based on the probability of the detected phoneme in the ambient audio signal. This may mean, that an audio signal, with acoustical properties different to a most probable phoneme with an associated probability of 90%, is suppressed by applying a higher attenuation, for example 80% of full attenuation in relation to when a most probable phoneme with only 30% probability is expected, where only e.g. 10% of full attenuation may get applied, only partly suppressing the ambient audio signal in the related time slot.
According to an embodiment of the invention, when the associated audio signal properties are deduced, the corresponding probabilities of the audio signal properties are 5 calculated. This calculation may be based on the determined probabilities of the probable phonemes.
According to an embodiment of the invention, the audio signal represents a limited frequency band out of a full frequency spectrum of the audio signal. Analysis and comparison of expected and observed signal properties may be performed in sub-bands of the full frequency spectrum, and the attenuation may be applied to the associated frequency band. In other words, the full or a part of the spectrum of the ambient audio signal may be divided into a predefined number of frequency slots. The determination of the expected audio signal properties, the observed ambient audio signal properties and the corresponding attenuation may be performed in frequency bands.
The frequency slots may have a same width or may, alternatively, also be divided in slots of different widths. According to an example, a slot width may be defined narrower in lower frequency ranges than in higher frequency ranges. The advantage may be that an accuracy of phoneme or n-gram recognition may be improved.
The before mentioned steps of determination of phonemes, n-grams, probabilities, 20 probable next phonemes and so on, may be simultaneously applied to several or even all frequency slots at the same time.
According to an embodiment of the invention, detecting a new phoneme of the ambient audio signal comprises detecting a sequence of n consecutive phonemes and the determination of a probable subsequent phoneme is based on the sequence of phonemes. In other words, if an improved prediction for a subsequent phoneme is desired, not only one earlier phoneme may get detected and recognized but a sequence or a plurality of phonemes in a number of elapsed time slots. Given a language model, the number of probable next upcoming phonemes may be lowered as the context may be better defined based on a higher number of known previous phonemes or n-grams in the past time slots.
According to an embodiment of the invention, determining a probable subsequent phoneme comprises the step of discarding the first phoneme of the n consecutive phonemes. The determining of the probable subsequent phoneme is based on the remaining (n-1) detected phonemes of the ambient audio signal. For example, a triphone containing three phonemes has been detected. Now, the first phoneme of the triphone is discarded and the remaining biphone is used to predict what the upcoming next triphone may be like. Now, as two of the three phonemes of the upcoming triphone are already known, the number of choices in a triphone or n-gram library of a particular language is significantly reduced to only a few in many cases.
According to an embodiment of the invention, expected audio signal properties are generated for a plurality of subsequent time slots. An advantage may be seen in the possibility to perform several recognition or detection cycles based on iterative determination to improve a prediction quality for the next phoneme or n-gram. For example, if possible next upcoming phonemes are known, respectively including the associated probabilities, the related audio signal properties may be derived for a number of upcoming time slots. Knowing this n-gram and its associated audio signal properties, a prediction quality of the audio signal properties of the ambient audio signal in the time slot may be adjusted and become more accurate. In an example, an expected audio signal may also be zero representing no signal or silence.
According to an embodiment of the invention, the language model database is selectable by the user. This selection may be done using a user interface of the hearing system, for example of the hearing device. According to another example, the language model may be selected based on a language recognition analyzing n-grams in the ambient audio signal. In other words, an automatic language and/or accent detection may be applied, wherein, for instance, a histogram or another statistical model of the detected n-grams are used as a basis to detect a certain language.
According to an embodiment of the invention the detection of the n-grams of phonemes in the ambient audio signal is based on one of the Hidden Markov model, Bayesian estimation model, neural networks, especially convolutional neuronal networks 30 and/ or recurrent neuronal networks, Mel frequency cepstral coefficients, bark spectra or feature vector method. According to an example, the above-mentioned detection or recognition methods may be combined for better recognition quality. As an example, a neural network may get fed with Mel frequency cepstral coefficients and output either the next expected phoneme with an associated probability/certainty and/or the signal properties in the different frequency bands of the next phoneme, and/or output directly a band specific attenuation.
Further aspects of the invention relate to a computer program for operating a hearing system, which, when being executed by a processor, is adapted to carry out the steps of the method as described in the above and in the following as well as to a computer-10 readable medium, in which such a computer program is stored.
For example, the computer program may be executed in a processor or a signal processing unit of the hearing system or a hearing device, which hearing device, for example, may be carried by the person behind the ear.
The computer-readable medium may be a memory of this hearing device or hearing system. The computer program also may be executed by a processor or signal processing unit of a connected device and the computer-readable medium may be a memory of a connected device. It also may be that steps of the method are performed by the hearing device and/or the hearing system, and other steps of the method are performed by other devices of the hearing system or other components of the described hearing system.
According to an example, the execution of the steps of the method, or a subset of steps, may also be executed by a cloud-based processing system which interacts with the hearing system.
In general, a computer-readable medium may be a hard disk, a USB (Universal Serial Bus) storage device, a RAM (Random Access Memory), a ROM (Read Only Memory), an EPROM (Erasable Programmable Read Only Memory) or a FLASH memory. A computer-readable medium may also be a data communication network, e.g. the Internet, which allows downloading a program code. The computer-readable medium may be a non-transitory or transitory medium.
It has to be understood that features of the method as described in the above and in the following may be features of the computer program, the computer-readable medium and the hearing system as described in the above and in the following, and vice versa.
These and other aspects of the invention will be apparent from and elucidated with 5 reference to the embodiments described hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
Below, embodiments of the present invention are described in more detail with reference to the attached drawings.
Fig. 1 shows a hearing system with different audio sources according to an embodiment of the invention.
Fig. 2 shows a diagram of a typical audio signal amplitude with a target signal and a noise floor over a time axis.
Fig. 3 shows a function of a given signal noise ratio and a corresponding attenuation.
Fig. 4 shows a phonemic chart of English phonemes.
Fig. 5 shows an example of n-grams constituting an English sentence.
Fig. 6 shows a method for reducing a noise level of an audio signal of a hearing system according to an embodiment of the invention.
Fig. 7 shows an attenuation function with signal noise ratios, probabilities and gain.
Fig. 8 shows a diagram of amplitudes of an expected audio signal and an ambient audio signal over time.
The reference symbols used in the drawings, and their meanings, are listed in summary form in the list of reference signs. In principle, identical parts are provided with the same reference symbols in the figures.
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS
Fig. 1 shows a simplified example of a hearing system 10 for a hearing impaired user. A hearing system 10 may be configured to improve hearing capabilities of the hearing impaired user by modifying or altering an audio signal to be better recognized by the user.
The hearing system 10 may comprise one or more hearing devices 11, which may be worn behind and/or in the user's ear. Additionally, the hearing system 10 may comprise a 5 smartphone 18 or another mobile device with large processing capabilities.
The hearing device 11 may comprise a microphone 12, a selection and mixing unit 16, a processing unit 22 and an output unit 28, such as a loudspeaker.
The microphone 12 may capture physical ambient audio and noise and may generate a corresponding ambient audio signal 14. Other audio sources may be possible, too.
Separately or simultaneously, different other audio signals may be captured and provided as ambient audio signal 14 for further processing. The selection and mixing unit 16 may be included to provide means for selection of a desired audio source and/or means for mixing different audio signals. An example of an ambient audio signal source may be a telephone or smartphone 18. Such signals may, for instance, be linked into the audio system 10 using different wireless protocols. In another example, a TV set 20 may serve as audio source, which provides an audio signal to be mixed or forwarded in the hearing system 10. In general, all audio sources, where human speech may be detected or extracted, may be considered as suitable ambient audio signal sources.
The selection and mixing unit 16 provides a consolidated ambient audio signal 14 to the signal processing unit 22. This signal processing unit 22 is designed and configured to attenuate the incoming ambient audio signal 14. This may be implemented using analogue electronic components and/or may also be implemented by using microprocessors or microcontrollers with peripheral components.
The processing unit 22 may apply different levels of attenuation to different frequency bands. In other words, the processing unit 22 may be configured to define frequency bands or frequency sub-bands, which may all be treated separately and independently in terms of an applied signal attenuation. The frequency bands may be defined with an equal bandwidth or frequency span, but may also have different frequency ranges and frequency spans allowing a higher resolution of specific attenuation in critical frequency ranges, which are essential for intelligibility and recognition of speech.
The hearing system 10 further comprises a language model database 24. This language model database 24 may contain detailed information about one or more languages. This may include libraries of phonemes 36, n-grams, words, grammar information, frequency of occurrence of certain words in that particular language, and other language specific information. The language model database 24 may be linked to the signal processing unit 22. This may mean, that the signal processing unit 22 may request specific information and the language model database 24 returns language specific data and information for processing by the signal processing unit 22. The language model database 24 may be stored in the hearing device 11, in the smartphone 18 and/or in a remote server, which 10 may be linked to the hearing system 10 via Internet.
After processing the ambient audio signal 14, in particular attenuating that signal, the signal processing unit 22 provides an output audio signal 26 to the audio output unit 28. In the shown example in Fig. 1 the audio output unit 28 is presented as a loudspeaker symbol. However, the output unit 28 symbolically stands for all means which are capable to provide a physical audio signal to a user, wherein the user is capable to physiologically capture and recognize this output audio signal 26. Besides sound and tone output devices, also cochlea implants and similar devices are possible.
Fig. 2 shows a diagram of an amplitude of an ambient audio signal 14 over a time axis t. The signal amplitude P may be understood as signal strength or power level in Watts, milliwatts, dB SPL or alike. Additionally, a frequency axis is indicated as third dimension.
For better clarity, the graph is shown for one specific frequency only. In every ambient audio signal, which is captured in a typical environment of a user, a certain portion of noise is included. Such noise may have very different causes, for example, traffic, construction, surrounding nature, background conversations, but also noise being generated from electronic processing and electronic or electrical components. In many cases, such background noise 32 has a low signal amplitude level. Known methods try to distinguish between noise signal 32 and a target signal 30 by determining a lowest amplitude of the audio signal 14. It is then assumed that the signal portion with amplitudes equal and lower than that minimal audio signal 14 may very likely be noise 32.
Such algorithms of noise detection assume that the noise floor 32 has a more or less constant amplitude over time.
Now, for determination of a signal-to-noise ratio (SNR), the difference between the amplitude of the noise floor 32 and the amplitude of the audio signal 13 is taken and the corresponding signal-to-noise ratio is calculated. In general, a high signal-to-noise ratio means that the target signal 30 is significantly stronger than the underlying noise signal 32 leading to a better detectability and a lower error rate when recognizing and capturing the audio signal 14. The noise floor 32 may be determined for every frequency or frequency band, indicated by the frequency axis in Fig. 2.
Referring to Fig. 3, a diagram of an attenuation of an ambient audio signal 14 over a SNR axis is shown. SNR stands for signal-to-noise ratio of an ambient audio signal 14 and is measured in dB. Low SNR values may mean that a signal quality is rather low. Referring to the example in Fig. 3, if the SNR is lower than 2 dB, a maximum attenuation of -9 dB is applied. With an increasing SNR value, indicating a better or improving signal quality, the attenuation level is linearly decreased down to 0 dB at an SNR of 11 dB. No attenuation (0 dB) is applied for all SNR values higher than 11 dB.
Fig. 4 shows a known chart of phonemes 36 of English language. The number of phonemes 36 in any language is limited. Such phonemes 36 may be stored in a language model database 24. Such phonemes 36 may be used for text to speech conversion, where text information is translated into synthetic spoken language. Focus of phonemes 36 are particularly pronunciation aspects of a language. For every phoneme 36, a corresponding physical audio signal may be derived. In other words, specific audio signal properties or specific ranges of physical audio signal properties are typical for each phoneme 36. Within the set of phonemes 36 voiced and unvoiced phonemes 36 may be differentiated.
Now referring to Fig. 5, the practical case of an application of phonemes 36 in n-grams is shown. An n-gram may be seen as a sequence of n consecutive phonemes 36. For instance, two phonemes 36 form a biphone, three phonemes 36 form a triphone and so on. In turn, a combination of different phonemes 36 and n-grams may form a sentence 34. In this case, the English sentence "The sky is blue" is taken as an example. The English words may also be described using phonemes 36, here indicated as single signs shown in Fig. 5. The phonemes 36 "s", "k" and "a" form a triphone as example for an n-gram. At the same time, the phonemes 36 "k", "a" and "I" form another n-gram or triphone. Taking different starting points, the next following phonemes 36 may form different n-grams.
For example, if the first three phonemes 36 (triphone) at time slot t are detected, the first phenome of the three phenomes (here the "s") is discarded. In order to determine the following or upcoming triphone, two of the three phonemes 36 are already known. Thus, the number of possible probable triphones or phonemes 36 is limited. Therefore, the appearance of certain phonemes 36 is very unlikely, whereas other phonemes 36 are more likely to appear in an upcoming time slot.
In Fig. 6, a flow diagram for a computer implemented method for reducing a noise level 10 of an audio signal of a hearing system 10 is shown.
The method comprises the step of acquiring 110 an ambient audio signal 14, wherein the ambient audio signal 14 comprises human speech in a particular language. This acquisition of the ambient audio signal 14 may be based on different audio sources, for instance external devices such as a TV set 20 or a telephone 18. Alternatively or additionally, a microphone 12 may be used to capture audio signals from a surrounding area or a speaking person nearby.
In a next step 120 an n-gram of one or more phonemes 36 of an ambient audio signal 14 is detected. An n-gram may comprise one or more phonemes 36, which may also include a zero-signal amplitude or silence or noise. The detection may be, for example, performed by a signal processing unit 22 and may be based on one of the Hidden Markov model, Bayesian estimation model, neural networks, Mel frequency cepstral coefficients, or feature vector method. According to an example, the methods mentioned above are used in combination with each other to improve recognition quality and performance. The detection is based on an ambient audio signal 14 which relates to past timeslots T-1, T-2, T-3 and so on. In other words, the detection may be seen as ex-post analysis of already received, and possibly recorded or stored, ambient audio signals 14.
In a following step 130 a subsequent probable phoneme 36 based on the detected n-gram of the ambient audio signal 14 is determined. This determination uses spectral and temporal information of phonemes from a language model database 24 of the language 30 of the human speech and compares and/or correlates them with the observed spectral and temporal properties of the audio signal to come up with an ordered list of most probable phonemes as observed.
The language model may also contain probabilities of appearance, libraries of words, n-grams, phonemes 36 and other language specific information. The language specific information from the language model database 24 is used to compare or correlate the most probable observed n-gram with the possible ones according to the selected language.
In step 140, expected audio signal properties for a time slot T based on the determined probable subsequent phoneme 36 may be deduced. For example, if a most probable phoneme 36 has been selected in a previous step, the associated physical, electrical or other audio signal properties may be retrieved. Such data may be stored in the language model database 24. In one example, the audio signal properties of the expected audio signal 38 may represent only a certain time portion out of the full duration of a phoneme 36.
In a step 150 the ambient audio signal 14 may be analysed in the time slot T generating associated observed ambient audio signal properties. According to an example, this may be executed simultaneously to one of the previous steps 120, 130 and/or 140. As a result of step 150, physical parameters or data describing the ambient audio signal 14, are available and may optionally be stored temporarily or permanently.
For example, having both the ordered list of possible observed n-grams and the list of possible n-grams for the selected language, and comparing or correlating the two allows to deduce if the most probable observed n-gram may belong to the language model (if a high correlation exists, assigning a high probability P close to 1) or rather not (if the most probable observed n-gram are not part of the language model or not very often used there, assigning a probability P close to 0) in step 160. This comparison may be performed on observed and possible n-grams, but also upon the underlying spectral and temporal audio signal properties. For example, a plurality of discrete signal amplitudes of both the ambient audio signal 14 and the expected audio signal 38 are compared and a difference is calculated. This may be, according to an example, repeated for a number of subsequent point in times. As a result, a degree or value of a deviation 40 between both audio signals is determined. For example, if an expected audio signal 38 of possible n-grams according to the language model expect a very low signal strength in a particular frequency band and the ambient audio signal 14 is detected with a strong signal in that particular frequency band, a high deviation 40 is determined and the associated audio signal respectively phoneme considered as less likely to belong to the target signal, despite its high signal amplitude, i.e. assigning a probability P close to 0.
Referring to a following step 170, having such information about a degree of deviation 40 between the expected audio signal 38 and actually captured ambient audio signal 14, the attenuation of the ambient audio signal 14 for this time slot T is adjusted. In other words, the degree of attenuation depends on a degree of the deviation 40 of the observed ambient audio signal 14 properties from the most probable expected audio signal 14 properties. Having in mind that the expected audio signal 38 properties are based on a model of the language which is used in the current conversation, a high deviation 40 may indicate that the ambient audio signal 14 likely belongs to the noise, or appears at least as not relevant, and does not belong to the conversation of interest. Therefore, it is desirable to lower a signal amplitude and attenuate the ambient audio signal 14 for that relevant time slot T. As a result, irrelevant portions of the ambient audio signal 14 may be at least partially removed and therefore a quality of the audio signal may be improved for better intelligibility and recognition rate by the user.
In Fig. 7, an example attenuation function is shown as a three-dimensional diagram with gain/attenuation in dB, a probability value and a signal-to-noise ratio. Depending on both a probability of an appearance of a certain phoneme 36 and a given signal quality represented by the signal-to-noise ratio, a specific attenuation value may be retrieved. According to one example, an attenuation function may be: Attenuation = Min(0;Max(Max_Atten;0.8*SN R-SNR_thres)-8*(1-P))), wherein Max_Atten represents a maximal attenuation (for example -9 dB in Fig. 3), P representing a probability of observed spectral and temporal signal properties of a 30 phoneme belonging to a probable one according to the language model, the, SNR representing an actually measured signal-to-noise ratio and SNR_thres representing a threshold SNR value, below which some or lower attenuation is applied (for example, 11 dB in Fig. 3).
Fig. 8 shows a diagram of an expected audio signal 38 (continuous line) and an ambient audio signal 14 (dotted line). In the timeslots T-2 and T-1 an expected audio signal 38 has been determined in previous cycles. The two phonemes "I" (in time slot T-2) and "z" (in time slot T-1) serve as a basis for a determination of the probable, expected phoneme of ambient audio signal 14 and its associated spectral and temporal properties. Taking these two phonemes (or the biphone) as a basis for predicting a probable next upcoming phoneme in time slot T, the expected audio signal 38 properties for that time slot T may be derived. Concretely, in this example, the expected audio signal 38 for the relevant frequency band has a very low amplitude or signal strength. The underlying logic is that, according to the underlying language model database 24, a pause is very likely to follow a sequence of the phonemes " ", "I" and "z" However, the actually captured ambient audio signal 14 in the time slot T shows a comparably high signal amplitude which leads to a high deviation 40 between the expected audio signal 38 and the ambient audio signal 14. This deviation 40 may now be used to apply an attenuation to the ambient audio signal 14 to lower its amplitude for the relevant time slot T. The attenuation may be applied with suitable fading time constants to fade in and out to limit and avoid audible artefacts of the fading process itself. In order to do so, the actual audio signal is delayed by a proper amount, e.g. one time slot T as typical phoneme duration.
While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive; the invention is not limited to the disclosed embodiments. Other variations to the disclosed embodiments may be understood and effected by those skilled in the art and practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single processor or controller or other unit may fulfil the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. Any reference signs in the claims should not be construed as limiting the scope.
LIST OF REFERENCE SYMBOLS
hearing system 11 hearing device 12 microphone 14 ambient audio signal 16 selection and mixing unit 18 telephone TV set 22 signal processing unit 24 language model database 26 output audio signal 28 audio output unit target signal 32 noise floor/ signal 34 sentence 36 phonemes 38 expected audio signal deviation method 110 acquiring detecting an n-gram determination of probable next phoneme deducing 150 analyzing 160 determining deviation adjusting attenuation

Claims (14)

  1. CLAIMS1. A method (100) for reducing a noise level of an audio signal of a hearing system (10), the hearing system (10) comprising a hearing device (11) to be worn at or in a user's ear; the method comprising the steps of: - acquiring (110) an ambient audio signal (14), wherein the ambient audio signal (14) comprises human speech in a language; -detecting (120) an n-gram of one or more phonemes (36) in the ambient audio signal (14); - determining (130) a subsequent probable phoneme (36) based on the detected n-gram; wherein the detection of the n-gram in the ambient audio signal (14) and the determination of the probable subsequent phoneme (36) are based on a language model database (24) of the language; - deducing (140) expected probable audio signal (38) properties for a time slot after the n-gram of phonemes based on the determined probable subsequent phoneme (36); -analyzing (150) the ambient audio signal (14) in the time slot generating associated observed ambient audio signal (14) properties; - determining (160), related to the time slot, a deviation (40) of the observed ambient audio signal (14) properties from the expected probable audio signal (38) properties; -adjusting (170), based on the deviation (40), the attenuation of the ambient audio signal (14) for the time slot.
  2. 2. The method (100) of claim 1, wherein adjusting the attenuation comprises increasing an attenuation of the ambient audio signal (14), when the determined deviation (40) of the two audio signals (14, 38) increases.
  3. 3. The method (100) according to of any of the previous claims, wherein the determined probable subsequent phoneme (36) is part of an n-gram of n sequential phonemes (36).
  4. 4. The method (100) of any of the previous claims, wherein determining (130) a probable subsequent phoneme (36) comprises a generation of a plurality of probable phonemes (36) and/or n-grams with associated probabilities.
  5. 5. The method (100) according to claim 4, wherein, based on the determined probabilities of the probable phonemes, deducing (140) the expected audio signal properties (38) comprises calculating the corresponding probabilities of the expected audio signal (38) properties.
  6. 6. The method (100) according to any of the previous claims, wherein the ambient audio signal (14) represents a frequency band out of a full frequency spectrum of the ambient audio signal (14) and the attenuation of the ambient audio signal (14) for the time slot is performed in the associated frequency band.
  7. 7. The method (100) according to any of the previous claims, wherein detecting (120) a phoneme (36) of the ambient audio signal (14) comprises detecting a sequence of n consecutive phonemes (36) and the determination of a probable subsequent phoneme (36) is based on this sequence of phonemes.
  8. 8. The method (100) according to claim 7, wherein determining a probable subsequent phoneme (36) comprises the step of -discarding the first phoneme (36) of the n consecutive phonemes (36); wherein the determining (130) of the subsequent probable phoneme (36) is based on the remaining (n-1) detected phonemes (36) of the ambient audio signal (14).
  9. 9. The method (100) according to any of the previous claims, wherein the expected audio signal (14) properties are generated for a plurality of subsequent time slots.
  10. 10. The method (100) according to any of the previous claims, wherein the language model database (24) is selectable by the user.
  11. 11. The method (100) according to any of the previous claims, wherein the detection (120) of the n-grams of phonemes (36) in the ambient audio signal (14) is based on one of the Hidden Markov model, Bayesian estimation model, neural networks, Mel frequency cepstral coefficients, or feature vector method.
  12. 12. A computer program for operatiing a hearing system, which, when being executed by at least one processor, is adapted to carry out the steps of the method (100) of one of the previous claims.
  13. 13. A computer-readable medium, in which a computer program according to claim 12 is stored.
  14. 14. A hearing system (10) comprising at least one hearing device (11), wherein the hearing system (10) is adapted for performing the method (100) of one of claims 1 to 11.
GB1900803.6A 2019-01-21 2019-01-21 Reducing a noise level of an audio signal of a hearing system Withdrawn GB2580655A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1900803.6A GB2580655A (en) 2019-01-21 2019-01-21 Reducing a noise level of an audio signal of a hearing system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1900803.6A GB2580655A (en) 2019-01-21 2019-01-21 Reducing a noise level of an audio signal of a hearing system

Publications (2)

Publication Number Publication Date
GB201900803D0 GB201900803D0 (en) 2019-03-13
GB2580655A true GB2580655A (en) 2020-07-29

Family

ID=65655893

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1900803.6A Withdrawn GB2580655A (en) 2019-01-21 2019-01-21 Reducing a noise level of an audio signal of a hearing system

Country Status (1)

Country Link
GB (1) GB2580655A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164387A (en) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method and device, electronic equipment and computer-readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6058365A (en) * 1990-11-16 2000-05-02 Atr Interpreting Telephony Research Laboratories Speech processing using an expanded left to right parser
US6260011B1 (en) * 2000-03-20 2001-07-10 Microsoft Corporation Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
US20080077403A1 (en) * 2006-09-22 2008-03-27 Fujitsu Limited Speech recognition method, speech recognition apparatus and computer program
US20180053500A1 (en) * 2016-08-22 2018-02-22 Google Inc. Multi-accent speech recognition

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6058365A (en) * 1990-11-16 2000-05-02 Atr Interpreting Telephony Research Laboratories Speech processing using an expanded left to right parser
US6260011B1 (en) * 2000-03-20 2001-07-10 Microsoft Corporation Methods and apparatus for automatically synchronizing electronic audio files with electronic text files
US20080077403A1 (en) * 2006-09-22 2008-03-27 Fujitsu Limited Speech recognition method, speech recognition apparatus and computer program
US20180053500A1 (en) * 2016-08-22 2018-02-22 Google Inc. Multi-accent speech recognition

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164387A (en) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method and device, electronic equipment and computer-readable storage medium

Also Published As

Publication number Publication date
GB201900803D0 (en) 2019-03-13

Similar Documents

Publication Publication Date Title
EP2962300B1 (en) Method and apparatus for generating a speech signal
US9524735B2 (en) Threshold adaptation in two-channel noise estimation and voice activity detection
CN107910011B (en) Voice noise reduction method and device, server and storage medium
JP4764995B2 (en) Improve the quality of acoustic signals including noise
US9558755B1 (en) Noise suppression assisted automatic speech recognition
EP2306457B1 (en) Automatic sound recognition based on binary time frequency units
US20090018826A1 (en) Methods, Systems and Devices for Speech Transduction
WO2013162995A2 (en) Systems and methods for audio signal processing
Saki et al. Automatic switching between noise classification and speech enhancement for hearing aid devices
US10547956B2 (en) Method of operating a hearing aid, and hearing aid
CN110853664A (en) Method and device for evaluating performance of speech enhancement algorithm and electronic equipment
JP2012189907A (en) Voice discrimination device, voice discrimination method and voice discrimination program
US9240190B2 (en) Formant based speech reconstruction from noisy signals
CN108810778B (en) Method for operating a hearing device and hearing device
Garg et al. A comparative study of noise reduction techniques for automatic speech recognition systems
US8635064B2 (en) Information processing apparatus and operation method thereof
JP7383122B2 (en) Method and apparatus for normalizing features extracted from audio data for signal recognition or modification
GB2580655A (en) Reducing a noise level of an audio signal of a hearing system
WO2019207912A1 (en) Information processing device and information processing method
Kranzusch et al. Prediction of Subjective Listening Effort from Acoustic Data with Non-Intrusive Deep Models.
EP4149120A1 (en) Method, hearing system, and computer program for improving a listening experience of a user wearing a hearing device, and computer-readable medium
Roßbach et al. Multilingual Non-intrusive Binaural Intelligibility Prediction based on Phone Classification
Zhao et al. Reverberant speech enhancement by spectral processing with reward-punishment weights
Sumithra et al. ENHANCEMENT OF NOISY SPEECH USING FREQUENCY DEPENDENT SPECTRAL SUBTRACTION METHOD
KR20050019686A (en) Apparatus for removing additional noise by using human auditory model

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)