CN116072137A

CN116072137A - Compensating for denoising artifacts

Info

Publication number: CN116072137A
Application number: CN202211360381.5A
Authority: CN
Inventors: M·O·海基宁; M·T·维勒尔莫; A·J·莱赫蒂涅米; A·J·埃罗宁
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2021-11-03
Filing date: 2022-11-02
Publication date: 2023-05-05
Also published as: EP4178230A1; US20230138240A1; GB202115772D0; GB2612587A

Abstract

An apparatus comprising means configured to: obtaining at least two audio signals; determining an audio object portion and an ambient audio portion relative to the at least two audio signals; determining a level parameter based on the environmental audio portion; applying noise suppression to the audio object portion, wherein the noise suppression is configured to be controlled based on the determined level parameter; and generating a noise suppressed audio object portion based on the applied noise suppression.

Description

Compensating for denoising artifacts

Technical Field

The present application relates to an apparatus and method for compensating for denoising artifacts, and more particularly to compensating for denoising artifacts when removing noise sources such as wind noise, background noise, motor noise, and operating noise.

Background

The audio objects may be provided as a spatial audio capturing process in which an audio scene is captured by a microphone and the captured audio signals are analyzed to determine a spatial audio signal comprising a plurality of (1-N) audio objects, where N is for example 5. Each of these objects has a separate audio signal and metadata describing its (spatial) characteristics. The metadata may be a parameterized representation of the characteristics of the audio object and may include parameters such as the direction (e.g., azimuth and elevation) of the audio object. Other examples include distance, spatial range, and gain of the object.

It is known to improve audio capture quality by applying noise suppression techniques. Thus, for example, there are known noise suppression techniques for suppressing noise sources such as wind noise, background noise (e.g., ventilation noise, traffic), motor noise (e.g., camera autofocus motor), and operating noise. Different techniques are generally required in the suppression of these noise sources.

The audio objects may be used as input formats for codecs such as Immersive Voice and Audio Services (IVAS) codecs.

Disclosure of Invention

According to a first aspect, there is provided an apparatus comprising means configured to: obtaining at least two audio signals; determining an audio object portion and an ambient audio portion relative to the at least two audio signals; determining a level parameter based on the environmental audio portion; applying noise suppression to the audio object portion, wherein the noise suppression is configured to be controlled based on the determined level parameter; and generating a noise suppressed audio object portion based on the applied noise suppression.

The component may be further configured to: combining the noise-suppressed audio object portion and the ambient audio portion to generate an output audio signal; and outputting and/or storing the output audio signal.

The component may be configured to: separating the at least two audio signals into the determined respective audio object portions and the ambient audio portion, the component may be configured to: an audio object portion audio signal is generated based on the level parameter at the previous time.

The means configured to generate the audio object portion audio signal based on the level parameter of the previous time may be configured to: determining an object separation direction parameter; determining a focus configuration based on the object separation direction parameter and a level parameter of a previous time; the focus configuration is applied to at least two audio signals to generate an audio object part audio signal.

The means configured to determine the focus configuration based on the object separation direction parameter and the level parameter of the previous time may be configured to: generating a first focussing filter having a first spatial width based on the level parameter at the previous time being equal to or greater than a first value; and generating a second focus filter having a second spatial width based on the level parameter at the previous time being less than the first value, wherein the second spatial width is less than the first spatial width and the second focus filter is more spatially selective than the first focus filter.

The means configured to apply the focus configuration to the at least two audio signals to generate an audio object part audio signal may be configured to: an ambient audio portion is generated by removing an audio object portion audio signal from at least two audio signals.

The means configured to apply noise suppression to the audio object portion (wherein the noise suppression is configurable to control based on the determined level parameter) is configured to: generating a first signal-to-noise ratio relative to a first time period based on the audio object portion and the ambient audio portion of the at least two audio signals; generating a second signal-to-noise ratio based on the audio object portion and the ambient audio portion of the at least two audio signals relative to a second time period, wherein the first time period is shorter than the second time period; combining the first signal-to-noise ratio and the second signal-to-noise ratio to generate a combined signal-to-noise ratio; multiplying the combined signal-to-noise ratio by a factor based on the level parameter to generate a noise suppression filter parameter; and applying a noise suppression filter having the noise suppression filter parameters to the audio object portion.

The means configured to determine the level parameter based on the remaining audio portion may be configured to: a level difference between the audio object portion and the ambient audio portion is determined.

The means configured to determine a level difference between the audio object portion and the ambient audio portion may be configured to: the level difference is further determined based on the noise suppressed audio object portions.

The means configured to determine the level parameter based on the ambient audio portion may be configured to: a level difference between the noise suppressed audio object portion and the ambient audio portion is determined.

The means configured to determine the level parameter based on the ambient audio portion may be configured to: a level parameter is determined based on the absolute level of the environmental audio portion.

The means configured to determine the level parameter based on the ambient audio portion may be configured to: for a defined or selected frequency band, a level difference is determined.

The means configured to apply noise suppression to the audio object portion (wherein the noise suppression may be configured to control based on the determined level parameter) may be configured to: noise suppression is applied to a defined or selected frequency band.

According to a second aspect, there is provided a method for an apparatus, comprising: obtaining at least two audio signals; determining an audio object portion and an ambient audio portion relative to the at least two audio signals; determining a level parameter based on the environmental audio portion; applying noise suppression to the audio object portion, wherein the noise suppression is configured to be controlled based on the determined level parameter; and generating a noise suppressed audio object portion based on the applied noise suppression.

The method further comprises the steps of: combining the noise-suppressed audio object portion and the ambient audio portion to generate an output audio signal; and outputting and/or storing the output audio signal.

The method may further comprise: separating the at least two audio signals into the determined respective audio object portions and the ambient audio portion, wherein generating the noise-suppressed audio object portions based on the applied noise suppression comprises: an audio object portion audio signal is generated based on the level parameter at the previous time.

Generating the audio object portion audio signal based on the level parameter of the previous time may include: determining an object separation direction parameter; determining a focus configuration based on the object separation direction parameter and a level parameter of a previous time; and applying the focus configuration to at least two audio signals to generate an audio object part audio signal.

Determining the focus configuration based on the object separation direction parameter and the level parameter of the previous time may include: generating a first focussing filter having a first spatial width based on the level parameter at the previous time being equal to or greater than a first value; and generating a second focus filter having a second spatial width based on the level parameter at the previous time being less than the first value, wherein the second spatial width is less than the first spatial width and the second focus filter is more spatially selective than the first focus filter.

Applying the focus configuration to the at least two audio signals to generate an audio object portion audio signal may comprise: an ambient audio portion is generated by removing an audio object portion audio signal from at least two audio signals.

Applying noise suppression to the audio object portion (wherein the noise suppression may be configured to control based on the determined level parameter) may include: generating a first signal-to-noise ratio relative to a first time period based on the audio object portion and the ambient audio portion of the at least two audio signals; generating a second signal-to-noise ratio based on the audio object portion and the ambient audio portion of the at least two audio signals relative to a second time period, wherein the first time period is shorter than the second time period; combining the first signal-to-noise ratio and the second signal-to-noise ratio to generate a combined signal-to-noise ratio; multiplying the combined signal-to-noise ratio by a factor based on the level parameter to generate a noise suppression filter parameter; and applying a noise suppression filter having the noise suppression filter parameters to the audio object portion.

Determining the level parameter based on the remaining audio portion may include: a level difference between the audio object portion and the ambient audio portion is determined.

Determining the level difference between the audio object portion and the environmental audio portion may include: the level difference is further determined based on the noise suppressed audio object portions.

Determining the level parameter based on the ambient audio portion may include: a level difference between the noise suppressed audio object portion and the ambient audio portion is determined.

Determining the level parameter based on the ambient audio portion may include: a level parameter is determined based on the absolute level of the environmental audio portion.

Determining the level parameter based on the ambient audio portion may include: for a defined or selected frequency band, a level difference is determined.

Applying noise suppression to the audio object portion (wherein the noise suppression may be configured to control based on the determined level parameter) may include: noise suppression is applied to a defined or selected frequency band.

According to a third aspect, there is provided an apparatus comprising at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to: obtaining at least two audio signals; determining an audio object portion and an ambient audio portion relative to the at least two audio signals; determining a level parameter based on the environmental audio portion; applying noise suppression to the audio object portion, wherein the noise suppression is configured to be controlled based on the determined level parameter; and generating a noise suppressed audio object portion based on the applied noise suppression.

The apparatus may be further caused to: combining the noise-suppressed audio object portion and the ambient audio portion to generate an output audio signal; and outputting and/or storing the output audio signal.

The apparatus may be caused to: separating the at least two audio signals into the determined respective audio object portion and the ambient audio portion, the apparatus may be caused to: an audio object portion audio signal is generated based on the level parameter at the previous time.

The apparatus being caused to generate the audio object portion audio signal based on the level parameter of the previous time may be caused to: determining an object separation direction parameter; determining a focus configuration based on the object separation direction parameter and a level parameter of a previous time; and applying the focus configuration to at least two audio signals to generate an audio object part audio signal.

The apparatus that is caused to determine the focus configuration based on the object separation direction parameter and the level parameter of the previous time may be caused to: generating a first focussing filter having a first spatial width based on the level parameter at the previous time being equal to or greater than a first value; and generating a second focus filter having a second spatial width based on the level parameter at the previous time being less than the first value, wherein the second spatial width is less than the first spatial width and the second focus filter is more spatially selective than the first focus filter.

The apparatus caused to apply the focus configuration to at least two audio signals to generate an audio object part audio signal may be caused to: an ambient audio portion is generated by removing an audio object portion audio signal from at least two audio signals.

The apparatus caused to apply noise suppression to the audio object portion (wherein the noise suppression may be configured to control based on the determined level parameter) may be caused to: generating a first signal-to-noise ratio relative to a first time period based on the audio object portion and the ambient audio portion of the at least two audio signals; generating a second signal-to-noise ratio based on the audio object portion and the ambient audio portion of the at least two audio signals relative to a second time period, wherein the first time period is shorter than the second time period; combining the first signal-to-noise ratio and the second signal-to-noise ratio to generate a combined signal-to-noise ratio; multiplying the combined signal-to-noise ratio by a factor based on the level parameter to generate a noise suppression filter parameter; and applying a noise suppression filter having the noise suppression filter parameters to the audio object portion.

The apparatus being caused to determine the level parameter based on the remaining audio portion may be caused to: a level difference between the audio object portion and the ambient audio portion is determined.

The apparatus being caused to determine a level difference between the audio object portion and the ambient audio portion may be caused to: the level difference is further determined based on the noise suppressed audio object portions.

The apparatus being caused to determine the level parameter based on the ambient audio portion may be caused to: a level difference between the noise suppressed audio object portion and the ambient audio portion is determined.

The apparatus being caused to determine the level parameter based on the ambient audio portion may be caused to: a level parameter is determined based on the absolute level of the environmental audio portion.

The apparatus being caused to determine the level parameter based on the ambient audio portion may be caused to: for a defined or selected frequency band, a level difference is determined.

The apparatus caused to apply noise suppression to the audio object portion (wherein the noise suppression may be configured to control based on the determined level parameter) may be caused to: noise suppression is applied to a defined or selected frequency band.

According to a fourth aspect, there is provided an apparatus comprising: means for obtaining at least two audio signals; means for determining an audio object portion and an ambient audio portion relative to the at least two audio signals; means for determining a level parameter based on the environmental audio portion; means for applying noise suppression to the audio object portion, wherein the noise suppression is configured to control based on the determined level parameter; and means for generating a noise suppressed audio object portion based on the applied noise suppression.

According to a fifth aspect, there is provided a computer program [ or a computer readable medium comprising program instructions ] comprising instructions for causing an apparatus to at least: obtaining at least two audio signals; determining an audio object portion and an ambient audio portion relative to the at least two audio signals; determining a level parameter based on the environmental audio portion; applying noise suppression to the audio object portion, wherein the noise suppression is configured to be controlled based on the determined level parameter; and generating a noise suppressed audio object portion based on the applied noise suppression.

According to a sixth aspect, there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus to at least: obtaining at least two audio signals; determining an audio object portion and an ambient audio portion relative to the at least two audio signals; determining a level parameter based on the environmental audio portion; applying noise suppression to the audio object portion, wherein the noise suppression is configured to be controlled based on the determined level parameter; and generating a noise suppressed audio object portion based on the applied noise suppression.

According to a seventh aspect, there is provided an apparatus comprising: an obtaining circuit configured to obtain at least two audio signals; a determining circuit configured to determine an audio object portion and an ambient audio portion with respect to the at least two audio signals; a determining circuit configured to determine a level parameter based on the environmental audio portion; an application circuit configured to apply noise suppression to the audio object portion, wherein the noise suppression is configured to control based on the determined level parameter; and a generating circuit configured to generate a noise-suppressed audio object portion based on the applied noise suppression.

According to an eighth aspect, there is provided a computer readable medium comprising program instructions for causing an apparatus to at least: obtaining at least two audio signals; determining an audio object portion and an ambient audio portion relative to the at least two audio signals; determining a level parameter based on the environmental audio portion; applying noise suppression to the audio object portion, wherein the noise suppression is configured to be controlled based on the determined level parameter; and generating a noise suppressed audio object portion based on the applied noise suppression.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform a method as described above.

An electronic device may comprise an apparatus as described above.

A chipset may comprise an apparatus as described above.

Embodiments of the present application aim to address the problems associated with the state of the art.

Drawings

For a better understanding of the present application, reference will now be made, by way of example, to the accompanying drawings, in which:

FIG. 1 schematically illustrates an apparatus suitable for implementing some embodiments;

FIG. 2 illustrates a flow chart of operation of the apparatus shown in FIG. 1, according to some embodiments;

FIG. 3 schematically illustrates an example loudness measurer as shown in FIG. 1, according to some embodiments;

FIG. 4 illustrates a flowchart of the operation of the example loudness measurer illustrated in FIG. 3, according to some embodiments;

FIG. 5 schematically illustrates an example noise suppressor as shown in FIG. 1, in accordance with some embodiments;

FIG. 6 illustrates a flowchart of the operation of the example noise suppressor shown in FIG. 5, according to some embodiments;

FIG. 7 schematically illustrates an example object separator as shown in FIG. 1, in accordance with some embodiments;

FIG. 8 illustrates a flowchart of the operation of the example object separator shown in FIG. 7, in accordance with some embodiments;

FIGS. 9 and 10 schematically illustrate other example apparatus suitable for implementing some embodiments;

FIGS. 11 and 12 schematically illustrate example systems suitable for implementing the apparatus of the embodiments (including the apparatus as shown in the previous figures); and

fig. 13 schematically shows an example apparatus suitable for implementing the illustrated device.

Detailed Description

The concepts as discussed in further detail herein with respect to the following embodiments relate to the capturing of audio scenes.

As described above, methods for audio capture, and in particular spatial audio capture, involve analyzing and processing microphone audio signals to determine audio signals and spatial parameters associated with an object.

Thus, the audio signal from the microphone may be processed in order to separate the audio objects, and further noise suppression may be applied.

However, it is not possible to set tuning parameters for object separation and noise suppression in a spatial audio capturing system so that the result of each input signal is optimal. The tuning parameters required vary depending on the nature of the input content. Thus, tuning parameters may be selected to provide an output of "average" quality or for worst case performance of the algorithm.

Furthermore, there is also a tradeoff or trade-off from manual tuning related to the manner in which tuning parameters affect the output quality of object separation and noise reduction.

Beamforming of microphone audio signals, which is commonly used in object separation, may amplify certain types of noise present in the input microphone audio signal. The choice of beamforming parameters may be regarded as a compromise between separation efficiency and amplified noise. In some embodiments, beamforming may be considered an example of focusing. Accordingly, the focusing element is configured to amplify the object sound relative to the ambient sound using any available method (e.g., beamforming, spatial filtering, machine learning methods, etc.). In the examples below, beamformers and beamforming are described, however, any suitable (spatial) focusing means may be used.

For example, if the audio scene includes a person speaking and ambient sound, where the ambient sound level is moderate, the "good" tuning for the beamformer coefficients may be to produce the narrowest possible beamforming sector including the person speaking and a large attenuation outside that sector.

In another case, if the audio scene includes a person speaking and the ambient noise is caused by wind, then the "good" tuning for the beamforming coefficients is to produce a wider beamforming sector and less attenuation outside that sector, as it will amplify wind noise to a lesser/lower degree.

If the noise reduction control is set too high, the application of noise reduction will typically give some signal artifacts. The tradeoff of noise reduction is between the amount of noise removed from the input signal and the amount of artifacts added to the output signal.

In a system that separates object audio and ambient sound, playback will mix both. The output quality is determined by the final mix. This means that the object audio is not heard alone, but mixed with the ambient sound. The tuning tradeoff should take into account the perceived quality of the object audio signal when combined with the ambient audio signal. Since there are many possible variations of the combination of the ambient audio signal and the object audio signal, it is virtually impossible to generate or determine a generic "preset" tuning that takes into account all combinations.

Embodiments as described in further detail below relate to control of noise suppression and object separation in spatial audio capture, wherein an adaptive control mechanism is provided that produces perceptually improved audio signals by providing for adjusting noise suppression and object separation parameters based on spectral characteristics of object audio and ambient sound. Furthermore, these embodiments attempt to prevent compromise and artifacts created by conventional manual object separation/noise reduction tuning. For example, these embodiments attempt to reduce audible (object separation/noise reduction) processing artifacts or prevent implementation of overly conservative control settings that do not provide the possible "maximum" (object separation/noise reduction) performance for the input content.

Accordingly, embodiments as described herein relate to an apparatus and method for a capture process of spatial audio in which two or more microphones in a spatial audio capture device are used to capture spatial audio signals that may be reproduced to a user, thereby enabling them to experience audio signals having at least some of the spatial characteristics present at the location of the spatial audio capture device during audio capture.

In these embodiments, an apparatus and method for improving the quality of spatial audio capture when the spatial audio capture includes audio object separation and noise suppression steps is presented.

In some embodiments this is achieved by:

obtaining at least two audio signals;

determining (and separating) at least one audio object (or direct) signal and a residual (or ambient) signal from the at least two audio signals;

applying noise suppression to the audio object signal to obtain at least one noise suppressed object signal;

determining a level difference based on the at least one audio object signal and the residual signal;

determining, based on the at least one audio object signal, a first amount of quality degradation resulting from separating the audio object signal from the remaining signal;

Using at least one spatial characteristic of sound to determine the quality degradation;

determining a second amount of quality degradation caused by noise suppression based on the at least one noise suppressed audio object signal; and

the first parameter of the separation process or the second parameter of the noise suppression process is adjusted based on at least one of the level difference, the first quality degradation amount or the second quality degradation amount, the spatial characteristics of the object and/or the environmental signal.

In an implemented embodiment, object separation and noise suppression artifacts are designed to be masked by ambient noise and not audible. Furthermore, it is an object of the implemented embodiments to improve perceived object separation and noise suppression quality. Furthermore, these embodiments adapt the image separation and noise suppression according to the change of the audio scene over time. Furthermore, in some embodiments, lower power consumption should be required for implementation, as the computational load is compatible with the input signal. In other words, if there is no audible gain, there is no unnecessary processing.

In the following examples, the sound source portion and the remaining portion of the captured microphone audio signal are discussed. The sound source part (or may be referred to by interchangeable terms as such as audio objects, sound objects, audio sources) may also be referred to as a direct audio signal part and refers to a signal arriving directly from a sound source. While the remaining or ambient parts (the terms are used interchangeably) refer to echoes and background noise present in the environment.

Fig. 1, for example, illustrates an apparatus suitable for implementing some embodiments.

The apparatus in this example shows a microphone input 101 configured to obtain or receive a plurality of microphone input audio signals (from physically separate or otherwise microphones). Any suitable number and/or array of microphones may be present. For example, in some embodiments, there may be a spherical array of microphones with a sufficient number of microphones (e.g., 30 or more), or a VR camera with microphones mounted on its surface. The microphone audio signal 108 may be passed to the object separator 103 and the environmental catcher 105. In some embodiments, the microphone audio signal is processed before being passed to the object separator 103 and the environmental capturer 105. For example, a suitable time-frequency transformer may be used to convert the microphone audio signal to the time-frequency domain.

In some embodiments, the apparatus includes an object separator 103. The object separator 103 is configured to obtain a plurality of microphone audio signals and to generate audio signals related to the audio objects. Examples of audio signals related to audio objects are, for example, audio signals related to persons speaking or singing, musical instruments, or other audio generating objects, such as animals or inanimate objects. Any suitable object separation process may be used in these embodiments. In practice, the audio signal output from the object separator may also contain other audio energy due to limitations in microphone position and number of microphones. In some embodiments, the object separator 103 is configured to generate multiple sets of audio signals, each set of audio signals being associated with a different identified object. In some embodiments, the object separator 103 is configured to output the object audio signal 104 to the noise suppressor and loudness measurer 107.

In addition, the apparatus includes an environmental catcher 105. The ambient capturer is configured to obtain a microphone audio signal and generate an ambient sound audio signal 106. Any suitable environment determination process may be used in these embodiments. In fact (in a similar manner as described above), the ambient audio signal output from the ambient capturer 105 may also contain audio energy associated with the object due to the limitations of microphone location and number of microphones. The ambient sound audio signal 106 may be output by the ambient capturer 105 to a loudness capturer and audio signal output (or combiner) 111.

In some embodiments, the audio object separator 103 and/or the environmental capturer 105 may use different microphones and/or signal processing techniques (such as beamforming) to accomplish their tasks. The object audio signal may also be separated using known AI/ML (artificial intelligence/machine learning) methods.

AI/ML separation methods are known to produce artifacts. Controlling the AI/ML method may include using different AI/ML methods, in particular AI/ML methods trained with different audio samples. For example, an AI/ML method that is used only for separating speech and has been trained with speech+noise samples only may be used, alternatively, an AI/ML method that is trained using speech+music+noise samples may be used. The speech + noise-only trained AI/ML method will typically cause more artifacts for speech objects than the speech + music + noise-trained AI/ML method when music and noise are present in the background, whereas the former achieves better separation for speech objects when noise is present in the background.

In some embodiments, the loudness measurer 107 is configured to obtain the outputs of the object separator 103 and the ambient capturer 105, which compares the level/level of the audio signal. In some embodiments, the comparison is divided into frequency bands related to human hearing. In some embodiments, a loudness model is used that combines spectral and temporal characteristics to model human hearing and determine which portions of the audio signal are audible.

The loudness measurer 107 is configured to output a control signal to the noise suppressor 109 (and the audio object separator 103). For example, in some embodiments, the loudness measurer 107 is configured to determine whether the loudness measure of the environmental capturer output is large enough to mask critical portions of the object separation signal, and the loudness measurer 107 is further configured to control the audio object separator 103 and the noise suppressor 109 such that it is configured to apply more aggressive/thorough processing to the object separation and noise suppression operations, as artifacts caused by the more aggressive processing are likely to be masked by the environmental sound. Similarly, in some embodiments, when the ambient sound level is determined to be low, the loudness measurer 107 is configured to control the audio object separator 103 and the noise suppressor 109 to make the object separation and noise suppression operations more conservative.

The noise suppressor 109 is configured to receive the output of the audio object separator 103 and a control signal from the loudness measurer 107. Further, the noise suppressor 109 is configured to apply a noise suppressing operation to the audio object audio signal based on the control signal from the loudness measurer 107. The output of the noise suppressor 109 may in turn be passed to an audio signal output 111.

The audio signal output 111 is configured to receive the outputs of the noise suppressor 109 and the environmental capturer 105 and output an audio signal. In some embodiments, the audio signal output 111 is configured to output a bitstream comprising the noise suppressed audio object audio signal and the ambient audio signal.

With respect to fig. 2, a flow chart of an example operation of the apparatus as shown in fig. 1 is shown.

Thus, for example, as shown in step 201 of fig. 2, a microphone input is shown obtained.

Further, as shown in step 205 of fig. 2, an ambient sound audio signal is determined/captured.

Furthermore, in fig. 2 it is shown by step 203 that the object audio signal is separated from the microphone audio signal.

Further, as shown in step 207 of fig. 2, the measured loudness is shown and control signals are determined based on the measured loudness, which are used for feedback and separation of the control object audio signals.

Furthermore, it is shown in fig. 2 by step 209 that noise from the audio object audio signal is suppressed based on these control signals.

Further, as shown in step 211 in fig. 2, the processed audio signal (both the noise suppressed audio object audio signal and the ambient audio signal) may be output.

With respect to fig. 3, an example loudness measurer 107 is shown in further detail.

The loudness measurer 107 is configured to obtain or receive the object separator audio signal 104 at a first input and the ambient capturer audio signal 106 at a second input. Further, the loudness measurer 107 includes a first input signal to a band divider (divider) 301 configured to select or divide or otherwise determine a frequency band from the object separator audio signal. In some embodiments, the divider/divider (and any of the dividers/dividers described herein) is configured to divide the audio signal into any suitable frequency band arrangement. For example, in some embodiments, the divider may generate critical bands (critical bands), triple bands (third octave bands), or barker bands (bark bands). In addition, the loudness measurer 107 includes a second input signal to a band divider 303 configured to select or divide or otherwise determine a frequency band from the ambient audio signal. In some embodiments, the band divider is implemented using a suitable filter bank.

These bands may be passed to a band-wise (band-wise) analyzer 305. In some embodiments, per-band estimator 205 includes a (first-audio object) band energy measurer 307 configured to determine or calculate, for each band, an audio signal energy associated with the audio object audio signal.

Further, in some embodiments, per-band estimator 305 comprises a (second-ambient) band energy measurer 307 configured to determine or calculate, for each band, an audio signal energy associated with the ambient audio signal.

Furthermore, in some embodiments, per-band estimator 305 includes a loudness difference analyzer 311. The loudness difference analyzer 311 is configured to analyze differences in energy levels (for corresponding frequency bands) between the audio object audio signal and the ambient audio signal. The difference in band energy is related to the extent to which one signal masks the other. The result of the comparison may in turn be used to generate control parameters or signals 312 that may be passed to the object separator to control object separation or to the noise suppressor to control noise suppression.

With respect to fig. 4, a flow chart of an example of a loudness measurer shown in fig. 3 is shown.

Thus, as shown in step 401 in fig. 4, an operation of obtaining a first (audio object audio signal) input signal is shown.

The first input signal is in turn divided into frequency bands, as shown in step 403 in fig. 4.

Further, as shown in step 405 of fig. 4, the energy of the frequency band of the first input signal may be determined.

Further, as shown in step 402 in fig. 4, an operation of obtaining a second (environmental audio signal) input signal is shown.

The second input signal is in turn divided into frequency bands, as shown in step 404 in fig. 4.

Further, as shown in step 406 of fig. 4, the energy of the frequency band of the second input signal may be determined.

Further, as shown in step 407 in fig. 4, a loudness difference between the first input signal and the second input signal is determined when processing per frequency band.

Further, as shown in step 409 in fig. 4, a control signal is generated and output based on the loudness difference.

With respect to fig. 5, an example noise suppressor 109 according to some embodiments is shown in further detail. In this example, the noise suppressor 109 includes an input signal to a band divider 501 configured to obtain a first (audio object audio signal) input audio signal and divide the audio signal into bands. These frequency bands may in turn be passed to a per-band processor 503.

In some embodiments, per-band processor 503 includes a band energy determiner/calculator 505 configured to receive a band portion of the audio object audio signal and determine energy (on a per-band basis). In some embodiments, the band energy has been previously determined (e.g., in a loudness estimator) and this value is used. These band energy values may be passed to a fast signal-to-noise ratio (SNR) Infinite Impulse Response (IIR) estimator 507 and a slow signal-to-noise ratio (SNR) Infinite Impulse Response (IIR) estimator 509.

A fast signal-to-noise ratio (SNR) Infinite Impulse Response (IIR) estimator 507 and a slow signal-to-noise ratio (SNR) Infinite Impulse Response (IIR) estimator 509 operate in parallel and track signal energy and produce and estimate a signal-to-noise ratio.

Further, per-band estimator 503 includes a signal-to-noise ratio (SNR) estimation combiner 511 configured to receive the outputs of fast signal-to-noise ratio (SNR) Infinite Impulse Response (IIR) estimator 507 and slow signal-to-noise ratio (SNR) Infinite Impulse Response (IIR) estimator 509 and combine them (with weights) to generate a combined SNR, which is passed to multiplier 513.

Multiplier 513 receives the combined SNR estimate and further receives a control signal from a loudness measurer and the output of the controller is configured to adjust the gain of the equalizer for the frequency band applied to the frequency band of the audio object audio signal. In other words, the equalizer is configured to apply a negative gain equal to the estimated amount of noise.

Since the noise estimation is not entirely accurate, blindly applying a negative gain to the band equalizer 515 that matches the estimation may result in artifacts in the output signal. Thus, applying less gain on the equalizer is configured to suppress less noise, but produces less processing artifacts on the output signal.

In these embodiments, the result of multiplier 513 (where the information is from the loudness estimator) controls the degree of noise estimation delivered for adjusting the equalizer. As the loudness of ambient sound on the current frequency band becomes larger relative to the subject audio on the same frequency band, the multiplier controlling the equalizer gain also becomes larger. The explanation for this is that stronger noise suppression can be used, as ambient noise will mask artifacts created by noise suppression.

With respect to fig. 6, a flow chart illustrating the operation of the example noise suppressor shown in fig. 5 is shown.

Thus, as shown in step 601 in fig. 6, an operation of obtaining a first (audio object audio signal) input signal is shown.

The first input signal is in turn divided into frequency bands, as shown in step 603 in fig. 6.

Further, as shown in step 605 of fig. 6, the energy of the frequency band of the first input signal may be determined.

Further, the fast SNR estimation as shown in step 607 in fig. 6 and the slow SNR estimation as shown in step 608 in fig. 6 operate in parallel.

Further, as shown in step 609 of fig. 6, the fast SNR estimate and the slow SNR estimate are combined.

Further, as shown in step 602 in fig. 6, the operation of obtaining a control signal from a loudness measurer is illustrated.

The combined SNR estimate is in turn multiplied by the control signal from the loudness measurer, as shown in step 611 in fig. 6.

Further, as shown in step 613 of fig. 6, the modified combined SNR estimate may be used to control the band equalizer gain to subtract or reject the noise energy of the band.

With respect to fig. 7, an example object separator 103 is shown, according to some embodiments. In this example, it is also shown how the loudness measurement controls object separation.

In this example, object separation is achieved by a beamformer 701. The beamformer 701 is configured to apply beamforming operations to the selected input microphone. The result is an audio signal that includes the object audio and can be output by audio object output 730.

Further, as shown in fig. 7, the ambient sound may be created (ambient capturer) by subtracting the object audio from the unprocessed input signal. The remaining residual signal (the remaining residual signal) is ambient sound, which may be communicated via the ambient output 740.

In some embodiments, the control signals from loudness measurer 118 are configured to be passed to object separation direction configurator 705 and beamformer configurator 707.

Thus, the object separation direction configurator 705 may also be configured to receive the beam forming directions from an external control. This may be set by a user or automatically detected, for example. The beamforming coefficients for the selected directions are selected from a database 709 of pre-computed beamforming coefficients. Database 709 may be configured to contain coefficients and metadata (such as direction and width of main lobes per band) as well as characteristics of beam patterns (such as suppression gain per band for directions other than main lobes).

In some embodiments, the beamforming configurator 707 is configured to first select all configurations applicable to the currently set object separation direction. Further, the control data is used to compare the ratio of the loudness of the ambient sound and the object audio to the suppression value of the beamforming coefficient on that frequency band for each frequency band.

Further, if the ambient sound loudness in the current frequency band is much greater than the object audio loudness, then the correlation of the beamforming suppression over that frequency band is low, as the ambient sound will always mask the object sound.

Furthermore, if the ambient sound loudness in the current frequency band approaches the loudness of the subject audio, the correlation of the beamforming suppression over that frequency band is high, as the subject audio can be restored with efficient beamforming.

Furthermore, if the ambient sound loudness in the current frequency band is small compared to the loudness of the object audio, the correlation of the beamforming suppression over that frequency band is low, because the object audio always masks the object sound.

Based on the above comparison, a score for each set of beamforming coefficients may be determined or calculated by assigning weights to the results of the comparison for each frequency band. These weights are summed to form the final score. Further, the beamforming coefficient with the highest score may be selected and implemented at the beamformer 701.

In some embodiments, the beamformer configurator is configured to track a recent (event) score window and select one of the scores that has been recently selected. This is to avoid too frequent switching between beamforming coefficients.

With respect to fig. 8, a flow chart of the operation of the object separator as shown in fig. 7 is shown.

Thus, as shown in step 801 in fig. 8, an operation of obtaining an input signal (of a microphone audio signal) is shown.

Further, as shown in step 802 of fig. 8, the control signal is obtained from the loudness measurer along with the direction selection control signal.

Further, as shown in step 803 in fig. 8, an object separation direction may be set.

Further, as shown in step 805 in fig. 8, a beamformer configuration may be determined.

Further, as shown in step 807 of fig. 8, the selected beamformer configuration may be applied to the input audio signal.

Further, as shown in step 809 of fig. 8, the remaining audio signals may be determined.

Further, as shown in step 811 in fig. 8, an audio object audio signal and an ambient sound audio signal may be output.

With respect to fig. 9, another configuration of an apparatus suitable for implementing some embodiments is shown. The apparatus shown in fig. 9 differs from the apparatus shown in fig. 1 in that the object separation and noise suppression are tightly coupled and function as one logic module 903. The output of the combined object separator and noise suppressor 903 is passed to a loudness measurer 907 and compared to the output of the ambient capturer 105 and based on the comparison control signal. This differs from the operation shown in the previous embodiments in that the effect of noise suppression is also taken into account in the loudness measurement and comparison.

Fig. 6 shows another configuration in which the outputs of the object separator 103 and the noise suppressor 1009 are respectively passed to the loudness measurer 1007. In these embodiments, the individual contributions of the object separator or noise suppressor may be distinguished from the previous configurations, and the control may be more fine-grained. The method is more measurement-based and less heuristic, as the contribution of the object separator or noise suppressor may be measured.

In some embodiments, auditory scene analysis may include determining audio energy distribution in different directions. This may be done using known methods such as beamforming or audio parameter analysis, etc. Auditory scene analysis may compare the object direction to audio energy in the object direction and determine masking of object separation artifacts based thereon.

Fig. 11 and 12 illustrate an end-to-end implementation of an embodiment. With respect to fig. 11, a capture device 1103 and a playback device are shown that communicate via a transmission/storage channel 1105.

The capture device 1103 is configured as described above and is configured to transmit an audio stream 1109 of audio objects and ambient sounds. In addition, metadata about the object direction and the ambient sound direction energy distribution is also transmitted. The playback device 1111 is further configured to send back data regarding the listener's direction 1107.

Listener orientation 1107 will affect sound scene rendering. Masking of the audio object by ambient noise will change with changes in orientation, which can affect the control process of adjusting the object separation and noise suppression parameters.

The capturing device 1101 includes a controller 1103 configured to generate an image separation control, a noise suppression control, and a bit rate control. In other words, the controller 1103 is configured to tune the object separation and noise suppression parameters in accordance with listener orientation data received from the playback.

In addition to object separation and noise suppression, the capture device and controller are configured to adjust encoding parameters, such as bit rate, according to measured or estimated level differences of the audio object and the ambient sound. For example, if the ambient sound is loud and masks most of the object audio, the bit rate may be set lower. Low bit rates will introduce coding artifacts, but these will be masked by the ambient sound.

In the above-described embodiments, the object separation, noise suppression, and encoding parameters may all be controlled based on the level difference between the audio object and the remaining (or ambient) portion of the captured audio signal. In some embodiments, control is achieved based on absolute levels of the remaining or environmental portions rather than determining level differences. In such an embodiment, the environment or surrounding portion being loud (or having a high level relative to a defined threshold) would indicate that the ambient sound is likely to mask a large portion of the object audio, and thus control is determined in a manner similar to those described above (where object audio may be masked).

With respect to the example shown in fig. 12, a capture device 1103 and a playback device 1111 are shown in communication via a transmission/storage channel 1105.

In this example, the playback device 1111 includes a controller 1203 configured to generate an object separation control, a noise suppression control, and a bit rate control. In other words, the controller 1203 is configured to tune the object separation and noise suppression parameters according to the listener orientation data received from the playback.

Thus, the capture device transmits audio objects, ambient sounds, and metadata related thereto over the network. The playback device receives the audio and metadata, renders the object audio using the head tracking data, and thereby measures a loudness difference between the object audio and the ambient sound to determine masking. The loudness differences may be estimated using the transmitted audio metadata including spatial parameters of the sound.

The loudness differences may be estimated in the direction of the object sound because they mask artifacts in the object sound better when the sound in the environment is in the same direction as the object sound than when the sound in the environment is in a different direction than the object sound. The result is used to control parameters of noise suppression occurring in playback. Likewise, if the ambient sound dominates, more noise suppression artifacts on the subject audio signal are allowed.

With respect to fig. 13, an example electronic device is shown that may be used as a computer, encoder processor, decoder processor, or any of the functional blocks described herein. The device may be any suitable electronic device or apparatus. For example, in some embodiments, device 1600 is a mobile device, user device, tablet computer, audio player, or the like.

In some embodiments, device 1600 includes at least one processor or central processing unit 1607. The processor 1607 may be configured to execute various program code, such as the methods described herein.

In some embodiments, device 1600 includes memory 1611. In some embodiments, at least one processor 1607 is coupled to memory 1611. Memory 1611 may be any suitable storage component. In some embodiments, memory 1611 includes program code portions for storing program code that can be implemented on processor 1607. Furthermore, in some embodiments, memory 1611 may also include a portion of stored data for storing data (e.g., data that has been processed or is to be processed according to embodiments described herein). The implemented program code stored in the program code portions and the data stored in the stored data portions may be retrieved by the processor 1607 via a memory-processor coupling when desired.

In some embodiments, device 1600 includes user interface 1605. In some embodiments, the user interface 1605 may be coupled to the processor 1607. In some embodiments, the processor 1607 may control the operation of the user interface 1605 and receive inputs from the user interface 1605. In some embodiments, user interface 1605 may enable a user to input commands to device 1600, for example, via a keyboard. In some embodiments, user interface 1605 may enable a user to obtain information from device 1600. For example, user interface 1605 may include a display configured to display information from device 1600 to a user. In some embodiments, user interface 1605 may include a touch screen or touch interface that enables both information to be entered into device 1600 and information to be displayed to a user of device 1600.

In some embodiments, device 1600 includes input/output ports 1609. In some embodiments, input/output port 1609 comprises a transceiver. In such embodiments, the transceiver may be coupled to the processor 1607 and configured to enable communication with other apparatuses or electronic devices, for example, via a wireless communication network. In some embodiments, the transceiver or any suitable transceiver or transmitter and/or receiver component may be configured to communicate with other electronic devices or apparatus via wired or wired coupling.

The transceiver may communicate with other devices via any suitable known communication protocol. For example, in some embodiments, the transceiver may use a suitable Universal Mobile Telecommunications System (UMTS) protocol, a Wireless Local Area Network (WLAN) protocol such as IEEE 802.X, a suitable short range radio frequency communication protocol such as bluetooth, or an infrared data communication path (IRDA).

The transceiver input/output port 1609 may be configured to transmit/receive audio signals, bitstreams, and in some embodiments perform the operations and methods described above by executing appropriate code using the processor 1607.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well known that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of a mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Further in this regard, it should be noted that any blocks of logic flows as in the figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on physical media such as memory chips or memory blocks implemented within a processor, magnetic media, and optical media.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory, and removable memory. The data processor may be of any type suitable to the local technical environment and may include, by way of non-limiting example, one or more of a general purpose computer, a special purpose computer, a microprocessor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a gate level circuit, and a processor based on a multi-core processor architecture.

Embodiments of the invention may be practiced in various components such as integrated circuit modules. The design of integrated circuits is generally a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs such as those provided by Synopsys, inc. of mountain view, california and Cadence Design, of san Jose, california, use sophisticated Design rules and libraries of pre-stored Design modules to automatically route conductors and locate components on a semiconductor chip. Once the design of the semiconductor circuit is completed, the resulting design in a standardized electronic format (e.g., opus, GDSII, or the like) may be transferred to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of exemplary embodiments of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

1. An apparatus, comprising: at least one processor and at least one memory including computer program code, the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to:

obtaining at least two audio signals;

determining an audio object portion and an ambient audio portion relative to the at least two audio signals;

determining a level parameter based on the environmental audio portion;

applying noise suppression to the audio object portion, wherein the noise suppression is configured to control based on the determined level parameter; and

based on the applied noise suppression, a noise suppressed audio object portion is generated.

2. The apparatus of claim 1, further caused to:

combining the noise suppressed audio object portion and the ambient audio portion to generate an output audio signal; and

outputting and/or storing the output audio signal.

3. The apparatus of claim 1, wherein the apparatus is caused to: separating the at least two audio signals into the determined respective audio object portions and the ambient audio portion, the apparatus being caused to: an audio object portion audio signal is generated based on the level parameter at the previous time.

4. An apparatus according to claim 3, wherein the apparatus caused to generate the audio object portion audio signal based on the level parameter of the previous time is caused to:

determining an object separation direction parameter;

determining a focus configuration based on the object separation direction parameter and the level parameter of the previous time; and

the focus configuration is applied to the at least two audio signals to generate the audio object part audio signal.

5. The apparatus of claim 4, wherein the apparatus caused to determine the focus configuration is caused to:

generating a first focussing filter having a first spatial width based on the level parameter at the previous time being equal to or greater than a first value; and

a second focus filter is generated having a second spatial width based on the level parameter at the previous time being less than the first value, wherein the second spatial width is less than the first spatial width and the second focus filter is more spatially selective than the first focus filter.

6. The apparatus of claim 4, wherein the apparatus caused to apply the focus configuration to the at least two audio signals is further caused to: the ambient audio portion is generated by removing the audio object portion audio signal from the at least two audio signals.

7. The apparatus of claim 1, wherein the apparatus caused to apply the noise suppression to the audio object portion is caused to:

generating a first signal-to-noise ratio relative to a first time period based on the audio object portions and the ambient audio portions of the at least two audio signals;

generating a second signal-to-noise ratio relative to a second time period based on the audio object portions and the ambient audio portions of the at least two audio signals, wherein the first time period is shorter than the second time period;

combining the first signal-to-noise ratio and the second signal-to-noise ratio to generate a combined signal-to-noise ratio;

multiplying the combined signal-to-noise ratio by a factor based on the level parameter to generate a noise suppression filter parameter; and

a noise suppression filter having the noise suppression filter parameters is applied to the audio object portion.

8. The apparatus of claim 1, wherein the means for being caused to determine the level parameter is caused to: a level difference between the audio object portion and the environmental audio portion is determined.

9. The apparatus of claim 8, wherein the level difference is determined further based on the noise suppressed audio object portion.

10. The apparatus of claim 1, wherein the means for being caused to determine the level parameter is caused to: a level difference between the noise suppressed audio object portion and the ambient audio portion is determined.

11. The apparatus of claim 1, wherein the level parameter is determined based on an absolute level of the environmental audio portion.

12. The apparatus of claim 1, wherein the level parameter is determined for a defined or selected frequency band.

13. The apparatus of claim 12, wherein the means configured to apply noise suppression to the audio object portion, wherein the noise suppression configured to control based on the determined level parameter is configured to: the noise suppression is applied to a defined or selected frequency band.

14. A method, comprising

Obtaining at least two audio signals;

determining a level parameter based on the environmental audio portion;

15. The method of claim 14, wherein the method further comprises:

outputting and/or storing the output audio signal.

16. The method of claim 14, further comprising:

separating the at least two audio signals into the determined respective audio object portions and ambient audio portions, wherein generating the noise suppressed audio object portions based on the applied noise suppression comprises: an audio object portion audio signal is generated based on the level parameter at the previous time.

17. The method of claim 16, wherein generating the audio object portion audio signal based on the level parameter of the previous time comprises:

determining an object separation direction parameter;

18. The method of claim 17, wherein determining the focus configuration based on the object separation direction parameter and the level parameter of the previous time comprises:

19. The method of claim 17, wherein applying the focus configuration to the at least two audio signals comprises: the ambient audio portion is generated by removing the audio object portion audio signal from the at least two audio signals.

20. The method of claim 14, wherein applying the noise suppression to the audio object portion comprises: