CN113270107B

CN113270107B - Method and device for acquiring loudness of noise in audio signal and electronic equipment

Info

Publication number: CN113270107B
Application number: CN202110395202.0A
Authority: CN
Inventors: 吴晨晨
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2024-02-06
Anticipated expiration: 2041-04-13
Also published as: WO2022218252A1; CN113270107A

Abstract

The application discloses a method and a device for acquiring noise loudness in an audio signal and electronic equipment, and belongs to the technical field of audio signal processing. The method for acquiring the loudness of noise in the audio signal comprises the following steps: acquiring N sub-band power spectrums of N audio frames in a target audio signal; obtaining noise power spectrum estimation corresponding to each audio frame in N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums; carrying out smooth updating treatment on the noise power spectrum estimation; performing compensation correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal; wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

Description

Method and device for acquiring loudness of noise in audio signal and electronic equipment

Technical Field

The application belongs to the technical field of audio signal processing, and particularly relates to a method and a device for acquiring noise loudness in an audio signal and electronic equipment.

Background

With the development of technology, electronic devices capable of implementing a call function are widely used. The voice enhancement algorithm can remove most of interference audio (namely noise) in the call, so the voice enhancement algorithm has very important significance for improving the call quality of the electronic equipment.

Noise estimation is one of the vital links in speech enhancement. Noise estimation refers to estimating the loudness of noise (i.e., the power spectrum of noise) in an audio signal generated and transmitted during a voice call. Accurate noise estimation of an audio signal is a precondition for ensuring a speech enhancement effect.

A disadvantage of the related art is that the degree of deviation between the estimated value and the true value of the loudness of noise in an audio signal is large.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for acquiring the loudness of noise in an audio signal and electronic equipment, which can solve the technical problem that the deviation degree of the result of noise estimation is larger when the minimum tracking method is adopted for noise estimation.

In a first aspect, an embodiment of the present application provides a method for acquiring a loudness of noise in an audio signal, including: acquiring N sub-band power spectrums of N audio frames in a target audio signal; obtaining noise power spectrum estimation corresponding to each audio frame in N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums; carrying out smooth updating treatment on the noise power spectrum estimation; performing compensation correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal; wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

In a second aspect, an embodiment of the present application provides an apparatus for acquiring a loudness of noise in an audio signal, including: the acquisition module is used for acquiring N sub-band power spectrums of N audio frames in the target audio signal; the estimation module is used for obtaining noise power spectrum estimation corresponding to each audio frame in the N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums obtained by the obtaining module; the updating module is used for carrying out smooth updating processing on the noise power spectrum estimation obtained by the estimation module; the correction module is used for carrying out compensation correction processing on the noise power spectrum estimation processed by the updating module to obtain the noise loudness of the target audio signal; wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

In a third aspect, embodiments of the present application provide an electronic device comprising a processor, a memory and a program or instructions stored on the memory and executable on the processor, the program or instructions implementing the steps of the method as in the first aspect when executed by the processor.

In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which, when executed by a processor, implement the steps of the method as in the first aspect.

In a fifth aspect, embodiments of the present application provide a chip, the chip including a processor and a communication interface, the communication interface being coupled to the processor, the processor being configured to execute programs or instructions to implement a method as in the first aspect.

In the embodiment of the present application, after N subband power spectrums of N audio frames in a target audio signal are obtained, noise power spectrum estimation corresponding to each audio frame in the N audio frames may be obtained according to M target power spectrums in each subband power spectrum in the N subband power spectrums. And finally, carrying out smooth updating processing on the noise power spectrum estimation, and carrying out compensation correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal. In the acquiring method provided in the embodiment of the present application, N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2. In other words, the acquiring method provided by the embodiment of the application performs statistics on two or more target power spectrums, and obtains noise power spectrum estimation according to the two or more target power spectrums. The method provided by the embodiment of the invention can effectively reduce the deviation degree between the noise estimation (namely the noise loudness obtained by estimation) and the noise true value, and thereby improve the voice enhancement effect and the voice call quality of the electronic equipment.

Drawings

Fig. 1 is one of the flowcharts of the steps of a method for acquiring the loudness of noise in an audio signal according to an embodiment of the present application;

FIG. 2 is a second flowchart illustrating a method for obtaining the loudness of noise in an audio signal according to an embodiment of the present application;

FIG. 3 is a third flowchart illustrating a method for obtaining the loudness of noise in an audio signal according to an embodiment of the present application;

FIG. 4 is a fourth flowchart illustrating steps of a method for obtaining the loudness of noise in an audio signal according to an embodiment of the present application;

fig. 5 is one of the constituent diagrams of the apparatus for acquiring the loudness of noise in an audio signal according to the embodiment of the present application;

FIG. 6 is a second schematic diagram of the apparatus for obtaining the loudness of noise in an audio signal according to the embodiment of the present application;

FIG. 7 is one of the constituent schematic diagrams of the electronic device of the embodiment of the present application;

fig. 8 is a second schematic diagram of the composition of the electronic device according to the embodiment of the present application.

Detailed Description

Technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

The method, the device and the electronic equipment for acquiring the noise loudness in the audio signal provided by the embodiment of the application are described in detail through specific embodiments and application scenes thereof with reference to the accompanying drawings.

With the development of technology, electronic devices such as mobile phones, personal computers, and smart watches, which can realize a call function, have been widely used. In order to improve the call quality, it is necessary to remove or filter the interference audio (i.e., noise) in the call by using a voice enhancement algorithm.

To achieve speech enhancement, the loudness of noise in a call needs to be estimated (i.e., noise estimation). The noise estimation method in the related art may include the following methods:

the first method is to realize noise estimation through a time recursion average algorithm, and the method cannot cover the situation that the background noise of the voice section environment is changed, so the method has the defect of untimely noise estimation.

The second method is to implement noise estimation based on a histogram, and the statistical steps of the method are performed within a fixed window length and require repeated computation on all frequency bands, so that the method has a disadvantage of large computation amount.

The third is to implement noise estimation by minimum tracking (minimum-tracking Algorithms), which can implement a rough estimate of noise level by tracking the minimum power spectral band (i.e., the minimum of spectral power) per audio frame. The method has less delay and reasonable calculated amount, and is a relatively ideal noise estimation method.

However, the minimum tracking method has a disadvantage in that the deviation of the noise estimation value from the noise true value is relatively large. Therefore, how to reduce the deviation degree when the noise estimation is performed by using the minimum tracking method is a technical problem to be solved by those skilled in the art.

In order to solve the foregoing technical problem, an embodiment of the present application provides a method for acquiring noise loudness in an audio signal, where an execution body of the method may be an acquiring device, and the acquiring device may be an electronic device, or may be a functional module and/or a functional body capable of implementing the acquiring method in the electronic device, and may specifically be determined according to actual use requirements. In order to more clearly describe the acquiring method provided in the embodiments of the present application, in the following method embodiments, the acquiring device is taken as an electronic device, that is, an executing body of the acquiring method is taken as an example to be described by way of illustration.

The electronic device in the embodiment of the present application may be a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), or other types of electronic devices, which is not limited in this application.

The method for obtaining the loudness of noise in the audio signal provided in the embodiment of the present application will be described in detail below by taking various embodiments as examples.

As shown in fig. 1, an embodiment of the present application provides a method for acquiring a noise loudness in an audio signal, where the method includes the following steps S101 to S104:

s101, an acquisition device acquires N sub-band power spectrums of N audio frames in a target audio signal.

Optionally, N is an integer greater than or equal to 1.

Optionally, the voice call audio signal may be acquired through a main Microphone (Mic) of the electronic device, and may also be acquired through a common acquisition of the main Microphone and the auxiliary Microphone of the electronic device. The method can be specifically determined according to actual use requirements, and the embodiment of the application is not limited.

Alternatively, the target audio signal may be a voice frequency signal transmitted through an electrical signal, and may also be an audio signal transmitted based on an internet protocol.

It will be appreciated that the target audio signal is an audio signal in a voice call audio signal, and the target audio signal may include a voice audio signal or a noise audio signal.

Alternatively, the target audio signal may be a part of or all of the voice call audio signals. The method can be specifically determined according to actual use requirements, and the embodiment of the application is not limited.

It will be appreciated that the noise audio signal may include noise audio signals generated during a voice call due to noisy surrounding environment, and may also include noise audio signals generated during transmission of the target audio signal due to less than ideal communication transmission quality.

Alternatively, the noise audio signal may be a stationary noise audio signal. The stationary noise audio signal may be a noise audio signal having a repetition frequency greater than a preset frequency (e.g., 10 Hz) or a noise audio signal having a sound level fluctuation not greater than a preset decibel (e.g., 3 dB) during measurement.

It is understood that the N audio frames may be N consecutive audio frames or N non-consecutive audio frames.

Alternatively, the N audio frames may be part of or all of the audio frames in the target audio signal. The method can be specifically determined according to actual use requirements, and the embodiment of the application is not limited.

It will be appreciated that each of the N audio frames may correspond to a respective one of the subband power spectrums, and that the N audio frames may correspond to N subband power spectrums.

Alternatively, each of the N subband power spectrums may be a logarithmic value (i.e., log value) of subband spectrum power, respectively.

It will be appreciated that each of the N sub-band power spectra described above may be used to characterize the distribution of audio signal power following frequency variations in each sub-band spectrum.

S102, the acquisition device obtains noise power spectrum estimation corresponding to each audio frame in the N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums.

Optionally, M is an integer greater than or equal to 2.

It can be understood that the M target power spectrums are power spectrums for obtaining noise power spectrum estimates corresponding to respective audio frames of the N audio frames.

Alternatively, the specific number of the M target power spectrums may be determined according to the number of N audio frames, and may also be determined according to the power values of the respective sub-band power spectrums in the N sub-band power spectrums. The method can be specifically determined according to actual use requirements, and the embodiment of the application is not limited.

Alternatively, the noise power spectrum estimate may be derived from the M target power spectrums based on the principles and algorithms of minimum tracking. Specifically, the principle of minimum tracking is: it is assumed that even during voice activity, the single band noisy voice power typically decays to the power level of the noise. Accordingly, the electronic device in the embodiment of the present application may obtain, according to M target power spectrums in each of N subband power spectrums, noise power spectrum estimates corresponding to each of N audio frames.

Optionally, the M target power spectrums are power spectrums falling within a preset power spectrum interval. The M target power spectrums may be all power spectrums or partial power spectrums falling within a preset power spectrum interval.

It can be appreciated that, based on the basic principle of minimum tracking, the preset power spectrum interval is a power spectrum interval in which the power value in each sub-band power spectrum is relatively smaller or relatively smallest.

It is understood that the preset power spectrum interval includes at least two or more power spectrums.

Alternatively, the preset power spectrum interval may be a set of power spectrums of the first P% or the first Q in the order from low to high in each subband power spectrum.

Alternatively, the preset power spectrum interval may be a set of power spectrums of the last R% or the last S in the order from high to low in each subband power spectrum.

It will be appreciated that P and R are each positive numbers and Q and S are each positive integers.

For example, the preset power spectrum interval may be a set of the top 10% or 25% of the power spectrums in the order from low to high in each sub-band power spectrum, and may be a set of the top 50 or 80 of the power spectrums in the order from low to high in each sub-band power spectrum. The method can be specifically determined according to actual use requirements, and the embodiment of the application is not limited.

For example, the preset power spectrum interval may be a set of power spectrums of the last 5% or 8% of the power spectrums of each subband in the order from high to low, or may be a set of power spectrums of the last 30 or 20 of the power spectrums of each subband in the order from high to low. The method can be specifically determined according to actual use requirements, and the embodiment of the application is not limited.

Further alternatively, as shown in fig. 2, S102 includes S1021 to S1022 described below:

s1021, the acquisition device respectively endows each target power spectrum in the M target power spectrums with a weight value.

Alternatively, the respective target power spectrums may or may not be given equal weight values. The method can be specifically determined according to actual use requirements, and the embodiment of the application is not limited.

S1022, the obtaining device obtains noise power spectrum estimation according to the M target power spectrums and the weight value.

It can be understood that the electronic device obtains the noise power spectrum estimation through a weighted average calculation formula according to the M target power spectrums and the weight values.

The electronic device assigns weight values to each of the M target power spectrums, respectively, for the purpose of introducing a weight allocation mechanism in the process of obtaining the noise power spectrum estimate. On the basis, the electronic equipment obtains noise power spectrum estimation according to M target power spectrums and weight values. Therefore, through reasonable distribution of the weight occupied by each target power spectrum in the M target power spectrums, the electronic equipment can further reduce the deviation degree of noise power spectrum estimation relative to the true value of the noise power spectrum.

S103, the acquisition device carries out smooth updating processing on the noise power spectrum estimation.

Optionally, the electronic device performs a smoothing update process on the noise power spectrum estimate using a smoothing factor.

And S104, the acquisition device performs compensation and correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal.

Optionally, the electronic device performs a compensation correction process on the noise power spectrum estimate using a compensation correction factor.

It will be appreciated that the purpose of the smoothing update process and the compensation correction process is to smooth and compensate the noise power spectrum estimate so as to make the noise power spectrum estimate more closely match the noise power spectrum true value. The values of the smoothing factor and the compensation correction factor can be specifically determined according to actual use requirements, and the embodiment of the application is not limited.

According to the electronic equipment, according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums, noise power spectrum estimation corresponding to each audio frame in the N audio frames is obtained, and further, the noise loudness of the target audio signal is obtained according to the noise power spectrum estimation through smooth updating processing and compensation correction processing. It should be noted that, the noise power spectrum estimation acquiring method (i.e., the noise loudness acquiring method) in the related art realizes rough estimation of the noise level in the voice call by tracking the minimum power spectrum band of each audio frame. However, even if the compensation factor is added, the noise power spectrum estimation obtaining method in the related art still has the defect of larger deviation between the noise estimation value and the true value. In order to solve the above-mentioned drawbacks in the related art, the acquisition method provided in the embodiments of the present application performs statistics on two or more target power spectrums, and thus obtains noise power spectrum estimation according to the two or more target power spectrums. Therefore, compared with the related art in which noise estimation is performed by tracking the minimum power spectrum band of each audio frame, the method provided by the embodiment of the present application can effectively reduce the deviation between the noise estimation (i.e., the loudness of the noise obtained by estimation) and the true value of the noise, and thus improve the call quality of the electronic device.

Optionally, the target audio signal comprises a long observation window and a short observation window alternating with each other.

Illustratively, one of the plurality of long viewing windows is adjacent to two of the plurality of short viewing windows, respectively. That is, the length of each observation window of the target audio signal is alternately changed in length.

It will be appreciated that the length of the long viewing window is greater than the length of the short viewing window. The specific lengths of the long observation window and the short observation window can be determined according to actual use requirements, and the embodiment of the application is not limited.

Illustratively, as shown in fig. 3, before S103, the acquiring method provided in the embodiment of the present application may further include the following step S105:

s105, the acquisition device judges the jump level of the noise in the target audio signal according to the effective value and the signal-to-noise ratio of each audio frame in the short observation window.

Alternatively, the effective value of each audio frame may be the Root Mean Square (Root Mean Square) of the spectral value of each audio frame. In other words, the effective value of each audio frame may be the square root of the average of the squares of the spectral values of each audio frame.

Alternatively, the signal-to-noise ratio (Signal Noise Ratio, SNR) of each audio frame may be the ratio between the power of the speech signal and the power of the noise signal in each audio frame. Wherein the signal to noise ratio of each audio frame may be expressed in decibels. The higher the signal-to-noise ratio of each audio frame, the less noise in that audio frame is indicated.

The effective value and signal-to-noise ratio of each audio frame can characterize the noise condition in that audio frame, whereby the change condition of the noise in the target audio signal (i.e., the jump level of the noise in the target audio signal) can be known from the effective value and signal-to-noise ratio of the respective audio frame.

It will be appreciated that the level of the jump in the noise in the target audio signal may be used to measure the change in loudness of the noise.

Optionally, the jump level of the noise in the target audio signal may include one level or may include multiple levels.

Alternatively, the trip level of the electronic device may include a first level trip and a second level trip, and may also include a third level trip, as well as a fourth level trip.

Alternatively, the lower the jump level, the greater the degree of variation in the noise is indicated.

It is understood that the value of the jump parameter of the first jump is larger than the value of the jump parameter of the second jump.

It should be noted that the value of the jump parameter may be represented by a db change value or a db change rate of the audio signal. For example, when the audio signal db change value of the first-level transition is greater than the audio signal db change value of the second-level transition, or the audio signal db change rate of the first-level transition is greater than the audio signal db change rate of the second-level transition, the transition parameter value of the first-level transition is greater than the transition parameter value of the second-level transition. It is understood that the jump parameter values described above may be characterized by values, levels, or percentages. It will be appreciated that the above-described audio signal is an audio signal of an audio frame in which no speech is present. Namely: the audio signal is an audio signal that is capable of characterizing the noisy environment in which a user of the electronic device is located.

Illustratively, where the trip level is a first order trip, it may be understood that the user of the electronic device has entered a significantly relatively noisy or significantly relatively quiet environment.

In the case of a first-order jump, it can be understood that the call quality of the user of the electronic device is significantly improved or significantly reduced.

It can be understood that the electronic device in the embodiment of the present application may accurately determine the jump level of the noise in the target audio signal according to the effective value and the signal-to-noise ratio of each audio frame in the short observation window. Furthermore, the electronic device in the embodiment of the present application may determine or select, according to the jump level, a smooth update processing manner adapted to the jump level, so as to achieve the purpose of further reducing the deviation between the noise estimation and the noise true value.

Alternatively, as shown in fig. 3, in the case where the hopping level of noise in the target audio signal is determined through S105, S103 includes S1031 to S1032 described below:

s1031, the acquisition device performs first smooth updating processing on the noise power spectrum estimation under the condition that the hopping level is the first-level hopping.

It is understood that the first smooth update process may be a smooth update process corresponding to a level one transition.

Alternatively, in the case where the hopping level is a first-order hopping, it can be understood that noise in the target audio signal is significantly stronger or weaker.

Further alternatively, S1031 described above includes S1031a to S1031b described below:

s1031a, the obtaining device obtains a first smoothing factor according to the audio frame information of the audio frame in the short observation window.

Optionally, in an embodiment of the present application, the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

It will be appreciated that the signal-to-noise ratio information described above characterizes the ratio between the power of the speech signal and the power of the noise signal of the audio frame in the short observation window of the target audio signal.

It will be appreciated that the above-described speech presence probability information may be information characterizing the probability of the presence of speech in an audio frame in a short observation window of the target audio signal.

Alternatively, in the embodiment of the present application, the voice existence probability information may be obtained by means of voice activity detection (Neural Network Voice Activity Detection, NNVAD) based on a neural network.

S1031b, the acquisition device performs first smoothing update processing on the noise power spectrum estimation by adopting a first smoothing factor.

The acquisition apparatus of the present embodiment performs the first smooth update processing with S1031a to S1031b for the following reasons. When noise loudness estimation is performed by means of minimum tracking, the method updates the noise estimate both in the speech segment and in the non-speech segment. However, the result of the noise power spectrum estimation is susceptible to high signal-to-noise ratios in the speech segment, which results in a forced lifting. In other words, since the signal-to-noise ratio of the audio signal of the speech segment is high, the degree of deviation of the noise estimate of the audio signal of the speech segment is large and is higher than the noise true value. Therefore, the electronic device in the embodiment of the present application obtains the first smoothing factor by adopting S1031a to S1031b and combining the speech presence probability information and the signal-to-noise ratio information of the audio signal, and performs the first smoothing update processing by adopting the obtained first smoothing factor. Therefore, the electronic equipment provided by the embodiment of the application can remarkably reduce the deviation degree between the noise estimation and the noise true value.

S1032, the acquisition device carries out second smooth update processing on the noise power spectrum estimation under the condition that the jump level is the second jump.

It is understood that the second smooth update process may be a smooth update process corresponding to a second level jump.

Alternatively, in the embodiment of the present application, in the case where the hopping level is the second-level hopping, it may be understood that the noise in the target audio signal is stronger or weaker to a lesser or less obvious extent.

Further alternatively, S1032 includes S1032a to S1032c described below:

s1032a, the obtaining device fits to obtain the initial smoothing factor according to the noise power spectrum estimation corresponding to the audio frame in the first long observation window and the noise power spectrum estimation corresponding to the audio frame in the first short observation window.

It will be appreciated that the first long viewing window and the first short viewing window are two viewing windows that are adjacent and alternate in length.

S1032b, the obtaining device performs superposition fitting on the initial smoothing factor through the audio frame information of the audio frame in the first long observation window to obtain a second smoothing factor.

Optionally, the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

It will be appreciated that the signal-to-noise ratio described above characterizes the ratio between the power of the speech signal and the power of the noise signal of an audio frame in a short observation window of the target audio signal.

It will be appreciated that the above-described speech presence probability information may be information characterizing the probability that speech is present in an audio frame in a short observation window of the target audio signal.

Alternatively, the voice presence probability information may be obtained by means of voice activity detection (Neural Network Voice Activity Detection, NNVAD) based on a neural network.

S1032c, the acquisition device adopts a second smoothing factor to carry out second smoothing update processing on the noise power spectrum estimation.

The acquisition apparatus of the embodiment of the present application performs the second smooth update processing with S1032a to S1032c for the following reasons. In the case where the transition level is a second level transition, the electronic device may consider that the user of the electronic device has entered a slightly changing noise floor environment (i.e., the noise floor environment surrounding the user of the electronic device changes relatively insignificantly). In the above case, the electronic device may first count M target power spectrums (i.e., power spectrums within a small value interval) of each audio frame within the long observation window, so as to obtain a noise power spectrum estimate within the long observation window. Then, an initial smoothing factor is calculated by fitting based on the noise estimate of the current observation window and the noise power spectrum estimate of the previous observation window (i.e., the first long observation window and the first short observation window). And finally, counting the effective value and the signal-to-noise ratio of each audio frame in the long observation window, calculating the signal-to-noise ratio of the signal in the long observation window, combining the voice existence probability information of the audio signal section in the long observation window, superposing and fitting to obtain a second smoothing factor, and updating the current noise power spectrum estimation. Thus, in the case that the user of the electronic device enters a slightly changed noise field environment, the electronic device of the embodiment of the application can determine or select a smoothing factor suitable for the noise field environment for updating processing so as to reduce the deviation degree between the noise estimation and the noise true value.

Further exemplary, as shown in fig. 4, the method of acquiring the loudness of noise in an audio signal may be implemented by the following S201 to S220:

s201, the acquisition device calculates a subband power spectrum acquired by a main microphone.

Wherein the subband power spectrum is a logarithmic power spectrum.

S202, the obtaining device calculates the effective value, the signal-to-noise ratio, the small value interval and the weight value distribution condition of the current audio frame to obtain noise power spectrum estimation.

The small value interval is the small value interval of N sub-band power spectrums. Namely: the small value interval is M target power spectrums falling into a preset power spectrum interval.

S203, the obtaining device updates the effective value, the signal-to-noise ratio, the noise power spectrum estimation and the voice frame identifier in the long observation window.

S204, the obtaining device updates the effective value, the signal-to-noise ratio, the noise power spectrum estimation and the voice frame identifier in the short observation window.

Wherein the voice frame identifiers in S203 and S204 are used to indicate voice existence probability information.

S205, each state variable vote in the short observation window of the device is obtained, and the audio signal attribute of the current observation window and the jump level of noise are judged.

The jump level of the noise can be understood as the amplitude of the background noise abrupt change, among other things. The first order jump represents that the noise becomes much larger or smaller, e.g., 10dB and above. The second order jump represents that the noise becomes larger or smaller than a slight, e.g. below 10 dB.

If the audio signal attribute of the current observation window is changed, recording the audio signal attribute of the current observation window, and outputting an identifier of the steady state noise state change of the audio signal segment.

S206, the acquisition device judges whether the hopping level is a first-level hopping.

If the determination result is yes, step S207 is executed, and if the determination result is no, step S215 is executed.

Wherein the trip level is determined based on the noise state identifier.

S207, the acquisition device judges whether the background noise is increased.

If the determination result is yes, step S208 is executed, and if the determination result is no, step S210 is executed.

S208, the obtaining device counts the effective value and the signal to noise ratio in the short observation window, and decides the selected small value interval range and the distribution weight coefficient according to the distribution rule, so as to obtain the noise power spectrum estimation of the short observation window section signal.

The inter-cell range may be understood as a range of M target power spectrums.

S209, acquiring voice existence probability information and signal to noise ratio information of each audio frame in a short observation window counted by the device, and adaptively selecting a smoothing factor of steady-state noise power spectrum estimation according to the audio signal characteristics of the current observation window.

S210, the acquisition device judges whether the background noise reduction amplitude is smaller than 20dB.

If the determination result is yes, step S211 is executed, and if the determination result is no, step S214 is executed.

S211, the obtaining device counts the effective value and the signal to noise ratio in the short observation window, and decides the range of the selected frame small value interval and the distribution weight coefficient according to the distribution rule, so as to obtain the noise power spectrum estimation of the short observation window section signal.

S212, acquiring voice existence probability information and signal to noise ratio information of each audio frame in the short observation window counted by the device, and smoothing factors of steady-state noise power spectrum estimation according to the audio signal characteristics of the current observation window.

S213, the acquisition device updates the current window noise level by combining the noise power spectrum estimated value of the previous observation window.

S214, the obtaining device counts the effective value and the signal-to-noise ratio of each audio frame in the short observation window, selects small value intervals of partial audio frames, and distributes weight coefficients to obtain noise power spectrum estimation in the short observation window.

S215, the acquisition device counts the small value interval of each audio frame in the long observation window, and noise power spectrum estimation in the long observation window is obtained.

S216, the obtaining device fits and calculates a smoothing factor according to the noise power spectrum estimation of the current observation window and the noise power spectrum estimation of the previous observation window.

S217, acquiring the effective value and the signal-to-noise ratio of each audio frame in the long observation window by the device to obtain the signal-to-noise ratio of the signal section in the observation window.

S218, the acquisition device judges the attribute of the voice signal segment by combining the voice existence probability information of the signal segment in the long observation window and the signal-to-noise ratio information of the voice signal segment, and superimposes the voice signal segment to generate an aggregation smooth window coefficient.

S219, the obtaining device updates the noise level of the current observation window by combining the noise power spectrum estimated value of the previous observation window.

S220, the acquisition device outputs the current noise power spectrum estimated value through the compensation correction function.

It will be appreciated that if the noise of the background environment makes a first order jump, the noise level is significantly higher or lower than the past noise level. When a suddenly-strong background noise field environment is entered, based on various state variables of the audio signals in the short observation window, the distribution rule of effective values and signal to noise ratios is counted, small value intervals of partial audio frames are selected, and the noise power spectrum estimated value of the signal segments in the initial short window is obtained. And then adaptively selecting a smoothing factor of the noise power spectrum estimation according to the signal-to-noise ratio distribution in the short observation window and the voice probability of each audio frame, and updating the noise power spectrum estimation of the current signal segment again. When a suddenly weakened background noise field environment is entered (e.g., the background noise drops below 20 dB), the same estimation method described above is adopted. When the noise power in the short observation window is reduced slightly (for example, above 20 dB), the user of the electronic equipment is assumed to enter a quieter environment, a new small value interval of partial audio frames is selected according to the distribution rule of effective values and signal to noise ratios in the short observation window, and then the weight coefficient is matched, so that the noise power spectrum estimated value in the short window is obtained.

It will be appreciated that if the background noise does not trigger the primary hop identifier, the hop level may be considered to be a secondary hop (i.e., a relatively slight hop). Under the condition of entering a noise field environment of the second-level jump, the electronic equipment firstly counts small value intervals of each audio frame in the long observation window to obtain noise power spectrum estimation in the long observation window. And fitting and calculating a smoothing factor according to the noise power spectrum estimation of the current observation window and the noise power spectrum estimation of the previous observation window. And counting the effective value and the signal-to-noise ratio of each audio frame in the long observation window, calculating the signal-to-noise ratio of the audio signal in the long observation window, combining the voice existence probability information of the audio signal section in the long observation window, superposing the fitted smoothing factors to generate an aggregate Ping Huachuang coefficient, and finally updating the current noise power spectrum estimation.

Through the above steps S201 to S220, the electronic device may reduce the deviation between the noise power spectrum estimation and the noise power spectrum true value, so as to accurately estimate the loudness of noise in the audio signal, and thereby improve the communication quality of the electronic device.

The embodiment of the application further provides an apparatus 200 for acquiring the noise loudness in the audio signal, where the apparatus 200 may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The acquiring device 200 may be a mobile electronic device or a non-mobile electronic device. By way of example, the mobile electronic device may be a cell phone, tablet computer, notebook computer, palm computer, vehicle-mounted electronic device, wearable device, ultra-mobile personal computer (ultra-mobile personal computer, UMPC), netbook or personal digital assistant (personal digital assistant, PDA), etc., and the non-mobile electronic device may be a server, network attached storage (Network Attached Storage, NAS), personal computer (personal computer, PC), television (TV), teller machine or self-service machine, etc., and the embodiments of the present application are not limited in particular.

The acquiring device 200 in the embodiment of the present application may be a device having an operating system. The operating system may be an Android operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of the present application.

The acquiring apparatus 200 provided in the embodiment of the present application can implement each process implemented by the method embodiments of fig. 1 to fig. 4, and in order to avoid repetition, a description is omitted here.

As shown in fig. 5, the acquisition apparatus 200 includes: the acquiring module 210 is configured to acquire N subband power spectrums of N audio frames in the target audio signal. The estimation module 220 is configured to obtain a noise power spectrum estimation corresponding to each of the N audio frames according to M target power spectrums in each of the N subband power spectrums obtained by the obtaining module 210. The updating module 230 is configured to perform a smooth updating process on the noise power spectrum estimation obtained by the estimating module 220. The correction module 240 is configured to compensate and correct the noise power spectrum estimation processed by the update module 230, so as to obtain the noise loudness of the target audio signal. Wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

Optionally, in the embodiment of the present application, the M target power spectrums are power spectrums that fall within a preset power spectrum interval. The M target power spectrums may be all power spectrums or partial power spectrums falling within a preset power spectrum interval. The preset power spectrum interval is a set of power spectrums of the first P% or the first Q of the power spectrums of each sub-band in the order from low to high, or a set of power spectrums of the last R% or the last S of the power spectrums of each sub-band in the order from high to low, wherein P and R are positive numbers respectively, and Q and S are positive integers respectively.

Optionally, in the embodiment of the present application, the target audio signal includes a long observation window and a short observation window that alternate with each other, as shown in fig. 6, the acquiring apparatus 200 further includes: the decision module 250. The determining module 250 is configured to determine a jump level of noise in the target audio signal according to the effective value and the signal-to-noise ratio of each audio frame in the short observation window before the updating module 230 performs the smoothing update processing on the noise power spectrum estimation obtained by the estimating module 220. The update module 230 is specifically configured to: in the case that the hopping level is the first-stage hopping, the noise power spectrum estimation obtained by the estimation module 220 is subjected to the first smoothing update processing. In the case that the hopping level is the second-level hopping, the second-level updating process is performed on the noise power spectrum estimation obtained by the estimation module 220. Wherein the jump parameter value of the first jump is larger than the jump parameter value of the second jump.

Optionally, in the embodiment of the present application, the update module 230 is specifically configured to: and obtaining a first smoothing factor according to the audio frame information of the audio frame in the short observation window. The noise power spectrum estimate obtained by the estimation module 220 is subjected to a first smoothing update process using a first smoothing factor. Wherein the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

Optionally, in the embodiment of the present application, the update module 230 is specifically configured to: and fitting according to the noise power spectrum estimation corresponding to the audio frame in the first long observation window and the noise power spectrum estimation corresponding to the audio frame in the first short observation window to obtain an initial smoothing factor. And performing superposition fitting on the initial smoothing factor through the audio frame information of the audio frames in the first long observation window to obtain a second smoothing factor. The noise power spectrum estimation obtained by the estimation module 220 is subjected to a second smoothing update process by using a second smoothing factor. Wherein the audio frame information includes: the signal-to-noise ratio information and the voice existence probability information, and the first long observation window and the first short observation window are adjacent observation windows.

Optionally, in the embodiment of the present application, the estimation module 220 is specifically configured to: and respectively assigning weight values to each target power spectrum in the M target power spectrums. And obtaining noise power spectrum estimation according to the M target power spectrums and the weight values.

The acquiring device 200 provided in the embodiment of the present application performs statistics on two or more target power spectrums, and thus obtains noise power spectrum estimation according to the two or more target power spectrums. The acquisition apparatus 200 provided by the embodiments of the present application can effectively reduce the degree of deviation between the noise estimate (i.e., the loudness of the noise obtained by the estimation) and the true value of the noise, compared to the related art in which the noise estimate is made by tracking the minimum power spectral band per audio frame.

As shown in fig. 7, the embodiment of the present application further provides an electronic device 100, which includes a processor 110, a memory 109, and a program or an instruction stored in the memory 109 and capable of running on the processor 110, where the program or the instruction implements each process of the obtaining method according to any embodiment of the present application when executed by the processor 110, and the same technical effects can be achieved, so that repetition is avoided and no further description is given here.

The electronic device in the embodiment of the application includes the mobile electronic device and the non-mobile electronic device.

Fig. 8 is a schematic hardware structure of an electronic device 100 implementing an embodiment of the present application.

The electronic device 100 includes, but is not limited to: radio frequency unit 101, network module 102, audio output unit 103, input unit 104, sensor 105, display unit 106, user input unit 107, interface unit 108, memory 109, and processor 110.

The audio output unit 103 serves as an acquisition module for acquiring N subband power spectrums of N audio frames in the target audio signal. The processor 110 is configured to serve as an estimation module, an update module, and a correction module, and is configured to obtain noise power spectrum estimates corresponding to each of the N audio frames according to M target power spectrums in each of the N subband power spectrums obtained by the obtaining module, and to perform smooth update processing on the noise power spectrum estimates obtained by the estimation module, and to perform compensation correction processing on the noise power spectrum estimates processed by the update module, so as to obtain noise loudness of the target audio signal. Wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2.

Optionally, in the embodiment of the present application, the target audio signal includes a long observation window and a short observation window that alternate with each other, and the processor 110 is further configured to, as a determining module, determine, before the updating module performs the smooth updating process on the noise power spectrum estimation obtained by the estimating module, a jump level of noise in the target audio signal according to an effective value and a signal-to-noise ratio of each audio frame in the short observation window. The updating module is specifically used for: and under the condition that the jump level is the first jump, carrying out first smooth updating processing on the noise power spectrum estimation obtained by the estimation module. And under the condition that the jump level is the second jump, carrying out second smooth update processing on the noise power spectrum estimation obtained by the estimation module. Wherein the jump parameter value of the first jump is larger than the jump parameter value of the second jump.

Optionally, in the embodiment of the present application, the processor 110 is used as an update module, which is specifically configured to: and obtaining a first smoothing factor according to the audio frame information of the audio frame in the short observation window. The noise power spectrum estimate obtained by the estimation module 220 is subjected to a first smoothing update process using a first smoothing factor. Wherein the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

Optionally, in an embodiment of the present application, the processor 110 is specifically configured to: and fitting according to the noise power spectrum estimation corresponding to the audio frame in the first long observation window and the noise power spectrum estimation corresponding to the audio frame in the first short observation window to obtain an initial smoothing factor. And performing superposition fitting on the initial smoothing factor through the audio frame information of the audio frames in the first long observation window to obtain a second smoothing factor. And carrying out second smoothing update processing on the noise power spectrum estimation obtained by the estimation module by adopting a second smoothing factor. Wherein the audio frame information includes: the signal-to-noise ratio information and the voice existence probability information, and the first long observation window and the first short observation window are adjacent observation windows.

Optionally, in an embodiment of the present application, the processor 110 is configured as an estimation module, specifically configured to: and respectively assigning weight values to each target power spectrum in the M target power spectrums. And obtaining noise power spectrum estimation according to the M target power spectrums and the weight values.

The electronic device 100 provided in the embodiment of the present application performs statistics on two or more target power spectrums, and thus obtains noise power spectrum estimation according to the two or more target power spectrums. Compared to the related art that performs noise estimation by tracking the minimum power spectral band per audio frame, the electronic device 100 provided by the embodiments of the present application can effectively reduce the degree of deviation between the noise estimation (i.e., the loudness of the noise obtained by the estimation) and the noise true value.

Those skilled in the art will appreciate that the electronic device 100 may further include a power source (e.g., a battery) for powering the various components, and that the power source may be logically coupled to the processor 110 via a power management system to perform functions such as managing charging, discharging, and power consumption via the power management system. The electronic device structure shown in fig. 8 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than shown, or may combine certain components, or may be arranged in different components, which are not described in detail herein.

It should be appreciated that in embodiments of the present application, the input unit 104 may include a graphics processor (Graphics Processing Unit, GPU) 1041 and a microphone 1042, the graphics processor 1041 processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The display unit 106 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 107 includes a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and so forth, which are not described in detail herein. Memory 109 may be used to store software programs as well as various data including, but not limited to, application programs and an operating system. The processor 110 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored, where the program or the instruction implements each process of the above-described embodiment of the obtaining method when executed by a processor, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

The processor is a processor in the electronic device in the above embodiment. Readable storage media include computer readable storage media such as Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic or optical disks, and the like.

The embodiment of the application further provides a chip, the chip comprises a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the abovexxxThe processes of the method embodiment can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.

It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, chip systems, or system-on-chip chips, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. A method for obtaining the loudness of noise in an audio signal, comprising:

acquiring N sub-band power spectrums of N audio frames in a target audio signal;

obtaining noise power spectrum estimation corresponding to each audio frame in the N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums;

performing smooth updating processing on the noise power spectrum estimation;

performing compensation correction processing on the processed noise power spectrum estimation to obtain the noise loudness of the target audio signal;

wherein N is an integer greater than or equal to 1, and M is an integer greater than or equal to 2;

the target audio signal comprises a long observation window and a short observation window which are mutually alternated;

the smoothing update processing for the noise power spectrum estimation includes:

under the condition that the jump level of noise in the target audio signal is a second jump, fitting according to the noise power spectrum estimation corresponding to the audio frame in the first long observation window and the noise power spectrum estimation corresponding to the audio frame in the first short observation window to obtain an initial smoothing factor;

performing superposition fitting on the initial smoothing factor through the audio frame information of the audio frames in the first long observation window to obtain a second smoothing factor;

Performing the second smoothing update processing on the noise power spectrum estimation by adopting the second smoothing factor;

wherein the first long viewing window and the first short viewing window are adjacent viewing windows.

2. The method according to claim 1, wherein the M target power spectrums are power spectrums falling within a preset power spectrum interval; the preset power spectrum interval is a set of top P% or top Q power spectrums arranged in a low-to-high order in each sub-band power spectrum, or the preset power spectrum interval is a set of bottom R% or bottom S power spectrums arranged in a high-to-low order in each sub-band power spectrum, wherein P and R are positive numbers respectively, and Q and S are positive integers respectively.

3. The acquisition method according to claim 1, characterized in that before the smoothing update processing of the noise power spectrum estimation, the acquisition method further comprises:

according to the effective value and the signal-to-noise ratio of each audio frame in the short observation window, determining the jump level of noise in the target audio signal;

the smoothing update processing for the noise power spectrum estimation further includes:

under the condition that the jump level is a first jump, carrying out first smooth updating processing on the noise power spectrum estimation;

The jump parameter value of the first jump is larger than that of the second jump.

4. The method according to claim 3, wherein said performing a first smoothing update process on said noise power spectrum estimate comprises:

obtaining a first smoothing factor according to the audio frame information of the audio frames in the short observation window;

performing the first smoothing update processing on the noise power spectrum estimation by adopting the first smoothing factor;

wherein the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

5. The acquisition method according to claim 1, wherein the audio frame information includes: signal-to-noise ratio information and speech presence probability information.

6. The method according to any one of claims 1 to 5, wherein the obtaining, according to M target power spectrums in each of the N subband power spectrums, a noise power spectrum estimate corresponding to each of the N audio frames includes:

respectively assigning weight values to each target power spectrum in the M target power spectrums;

and obtaining the noise power spectrum estimation according to the M target power spectrums and the weight value.

7. An apparatus for obtaining loudness of noise in an audio signal, comprising:

the acquisition module is used for acquiring N sub-band power spectrums of N audio frames in the target audio signal;

the estimation module is used for obtaining noise power spectrum estimation corresponding to each audio frame in the N audio frames according to M target power spectrums in each sub-band power spectrum in the N sub-band power spectrums acquired by the acquisition module;

the updating module is used for carrying out smooth updating processing on the noise power spectrum estimation obtained by the estimation module;

the correction module is used for carrying out compensation correction processing on the noise power spectrum estimation processed by the updating module to obtain the noise loudness of the target audio signal;

the updating module is specifically configured to:

under the condition that the jump level of noise in the target audio signal is a second jump, fitting according to the noise power spectrum estimation corresponding to the audio frame in the first long observation window and the noise power spectrum estimation corresponding to the audio frame in the first short observation window to obtain an initial smoothing factor; performing superposition fitting on the initial smoothing factor through the audio frame information of the audio frames in the first long observation window to obtain a second smoothing factor; and performing the second smoothing update processing on the noise power spectrum estimation obtained by the estimation module by adopting the second smoothing factor;

8. The acquisition device according to claim 7, wherein the M target power spectrums are power spectrums falling within a preset power spectrum interval; the preset power spectrum interval is a set of top P% or top Q power spectrums arranged in a low-to-high order in each sub-band power spectrum, or the preset power spectrum interval is a set of bottom R% or bottom S power spectrums arranged in a high-to-low order in each sub-band power spectrum, wherein P and R are positive numbers respectively, and Q and S are positive integers respectively.

9. The acquisition device of claim 7, wherein the acquisition device further comprises a decision module;

the judging module is used for judging the jump level of the noise in the target audio signal according to the effective value and the signal-to-noise ratio of each audio frame in the short observation window before the updating module carries out smooth updating processing on the noise power spectrum estimation obtained by the estimating module;

the updating module is specifically configured to:

under the condition that the jump level is a first jump, carrying out first smooth updating processing on the noise power spectrum estimation obtained by the estimation module;

10. The acquisition device according to claim 9, wherein the updating module is specifically configured to:

carrying out the first smoothing update processing on the noise power spectrum estimation obtained by the estimation module by adopting the first smoothing factor;

11. The acquisition apparatus of claim 7, wherein the audio frame information comprises: signal-to-noise ratio information and speech presence probability information.

12. The acquisition device according to any one of claims 7 to 11, characterized in that the estimation module is specifically configured to:

13. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the acquisition method of any one of claims 1 to 6.

14. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the acquisition method according to any one of claims 1 to 6.