CN114360563A

CN114360563A - Voice noise reduction method, device, equipment and storage medium

Info

Publication number: CN114360563A
Application number: CN202111660050.9A
Authority: CN
Inventors: 王倩; 沈洋; 来杏杏
Original assignee: Beijing Wutong Chelian Technology Co Ltd
Current assignee: Beijing Wutong Chelian Technology Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-15

Abstract

The embodiment of the application discloses a voice noise reduction method, a voice noise reduction device, voice noise reduction equipment and a storage medium, and belongs to the technical field of multimedia data processing. The method comprises the following steps: acquiring a first voice signal, wherein the first voice signal comprises a plurality of frequency point signals. And determining a smoothing factor in the recursive average algorithm based on the signal power of each frequency point signal in the plurality of frequency point signals. And based on the smoothing factor, carrying out noise estimation on the first voice signal through a recursive average algorithm to obtain a noise estimation value of the first voice signal, wherein the noise estimation value indicates the power of the noise signal in the first voice signal. And performing noise reduction processing on the first voice signal based on the noise estimation value. According to the embodiment of the application, the smoothing factor is determined in a self-adaptive manner according to the self characteristics of different voice signals so as to improve the accuracy of noise estimation, and finally, when the noise reduction processing is carried out on the first voice signal based on the noise estimation value, a pure voice signal can be obtained.

Description

Voice noise reduction method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of multimedia data processing, in particular to a voice noise reduction method, a voice noise reduction device, voice noise reduction equipment and a storage medium.

Background

The voice is the most important and most common communication medium in daily life of people, and with the wide application of mobile internet terminals such as mobile phones, voice communication can be more conveniently carried out between people through the terminals. However, these terminals usually need to be exposed to the external environment when acquiring the voice, and noise signals in the external environment can also be acquired by the terminals, so that the acquired voice is not clear due to interference of noise. Therefore, a speech noise reduction method is needed, which can perform noise reduction processing on speech acquired by a terminal, so that a finally processed speech signal has no noise interference.

Disclosure of Invention

The embodiment of the application provides a voice noise reduction method, a voice noise reduction device, equipment and a storage medium, and can solve the problem of unclear voice in the related art. The technical scheme is as follows:

in one aspect, a method for reducing noise in speech is provided, the method comprising:

acquiring a first voice signal, wherein the first voice signal comprises a plurality of frequency point signals;

determining a smoothing factor in a recursive average algorithm based on the signal power of each frequency point signal in the plurality of frequency point signals;

based on the smoothing factor, performing noise estimation on the first voice signal through the recursive average algorithm to obtain a noise estimation value of the first voice signal, wherein the noise estimation value indicates the power of a noise signal in the first voice signal;

and performing noise reduction processing on the first voice signal based on the noise estimation value.

Optionally, the determining a smoothing factor in a recursive average algorithm based on the signal power of each of the plurality of frequency point signals includes:

determining an activation function value of each frequency point signal based on the signal power of each frequency point signal, wherein the activation function value is positioned in a target value interval;

and determining the smoothing factor based on the activation function value of each frequency point signal in the plurality of frequency point signals.

Optionally, the determining the smoothing factor based on the activation function value of each of the plurality of frequency point signals includes:

determining the smoothing factor through a first formula based on the activation function value of each frequency point signal in the plurality of frequency point signals, wherein the first formula is as follows:

wherein, the

For the smoothing factor, the α_dTo a fixed value, said

The minimum value in the activation function value of each frequency point signal.

Optionally, the performing, based on the smoothing factor, noise estimation on the first speech signal through the recursive average algorithm to obtain a noise estimation value of the first speech signal includes:

smoothing the first voice signal based on the smoothing factor to obtain a voice existence probability of the first voice signal, wherein the voice existence probability indicates the probability of the existence of the effective voice signal in the first voice signal;

determining an initial noise estimate for the first speech signal based on the speech presence probability;

when the voice existence probability indicates that an effective voice signal exists in the first voice signal, performing deviation compensation on the initial noise estimation value according to a first compensation factor to obtain the noise estimation value;

when the voice existence probability indicates that no effective voice signal exists in the first voice signal, performing deviation compensation on the initial noise estimation value according to a second compensation factor to obtain the noise estimation value;

wherein the first compensation factor is less than the second compensation factor.

Optionally, the smoothing the first speech signal based on the smoothing factor to obtain the speech existence probability of the first speech signal includes:

performing first smoothing processing on the first voice signal to obtain a smoothing power spectrum of the first voice signal;

searching the smooth power spectrum through the number of target search windows to determine the minimum value in the smooth power spectrum, wherein the number of the target search windows is smaller than a reference value;

determining a speech presence probability of the first speech signal based on a minimum value in the smoothed power spectrum.

Optionally, before determining a smoothing factor in a recursive average algorithm based on the signal power of each of the plurality of frequency point signals, the method further includes:

acquiring a second voice signal, wherein the second voice signal is a last frame voice signal of the first voice signal;

determining an amount of average power change between the first speech signal and the second speech signal;

and if the average power variation exceeds a variation threshold, executing the operation of determining a smoothing factor in the recursive average algorithm based on the signal power of each frequency point signal in the plurality of frequency point signals.

Optionally, the method further comprises:

and if the average power variation does not exceed the variation threshold, performing noise estimation on the first voice signal through the recursive average algorithm based on a reference smoothing factor, and determining a noise estimation value of the first voice signal.

In another aspect, an apparatus for speech noise reduction is provided, the apparatus comprising:

the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a first voice signal, and the first voice signal comprises a plurality of frequency point signals;

a first determining module, configured to determine a smoothing factor in a recursive average algorithm based on a signal power of each of the multiple frequency point signals;

a noise estimation module, configured to perform noise estimation on the first speech signal through the recursive average algorithm based on the smoothing factor to obtain a noise estimation value of the first speech signal, where the noise estimation value indicates power of a noise signal in the first speech signal;

and the processing module is used for carrying out noise reduction processing on the first voice signal based on the noise estimation value.

Optionally, the first determining module includes:

the first determining submodule is used for determining an activation function value of each frequency point signal based on the signal power of each frequency point signal, and the activation function value is positioned in a target value interval;

and the second determining submodule is used for determining the smoothing factor based on the activation function value of each frequency point signal in the plurality of frequency point signals.

Optionally, the second determining sub-module is configured to:

wherein, the

For the smoothing factor, the α_dTo a fixed value, said

Optionally, the noise estimation module includes:

the processing submodule is used for smoothing the first voice signal based on the smoothing factor to obtain the voice existence probability of the first voice signal, and the voice existence probability indicates the probability of the existence of the effective voice signal in the first voice signal;

a third determining sub-module for determining an initial noise estimate for the first speech signal based on the speech presence probability;

a deviation compensation submodule, configured to perform deviation compensation on the initial noise estimation value according to a first compensation factor when the voice existence probability indicates that an effective voice signal exists in the first voice signal, so as to obtain the noise estimation value; when the voice existence probability indicates that no effective voice signal exists in the first voice signal, performing deviation compensation on the initial noise estimation value according to a second compensation factor to obtain the noise estimation value;

Optionally, the processing sub-module is configured to:

Optionally, the apparatus further comprises:

a second obtaining module, configured to obtain a second voice signal, where the second voice signal is a previous frame voice signal of the first voice signal;

a second determining module for determining an average power variation between the first speech signal and the second speech signal;

and a third determining module, configured to, if the average power variation exceeds a variation threshold, perform an operation of determining a smoothing factor in a recursive average algorithm based on the signal power of each of the multiple frequency point signals.

Optionally, the apparatus further comprises:

and a fourth determining module, configured to perform noise estimation on the first speech signal through the recursive average algorithm based on a reference smoothing factor if the average power variation does not exceed the variation threshold, and determine a noise estimation value of the first speech signal.

In another aspect, a computer device is provided, which includes a memory for storing a computer program and a processor for executing the computer program stored in the memory to implement the steps of the voice noise reduction method.

In another aspect, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned speech noise reduction method.

In another aspect, a computer program product is provided comprising instructions which, when run on a computer, cause the computer to perform the steps of the speech noise reduction method described above.

The technical scheme provided by the embodiment of the application can at least bring the following beneficial effects:

according to the embodiment of the application, a first voice signal is obtained, and a smoothing factor in a recursive average algorithm is determined based on the signal power of each frequency point signal in a plurality of frequency point signals of the first voice signal. And carrying out noise estimation on the first voice signal through the recursive average algorithm to obtain a noise estimation value of the first voice signal, and carrying out noise reduction processing on the first voice signal based on the noise estimation value. According to the embodiment of the application, the smoothing factor is determined based on the signal power of each frequency point signal of the first voice signal, and the signal power distribution of each frequency point signal in different voice signals is basically different, so that the smoothing factor can be determined in a self-adaptive manner according to the self characteristics of different voice signals, the accuracy of noise estimation is improved, and finally, when the noise reduction processing is performed on the first voice signal based on the noise estimation value, a pure voice signal can be obtained.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a speech noise reduction method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a noisy speech signal provided by an embodiment of the present application;

FIG. 3 is a diagram illustrating a noise signal in a noisy speech signal according to an embodiment of the present application;

FIG. 4 is a schematic diagram of effective signals in a noisy speech signal according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for reducing noise in speech according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech noise reduction apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.

Before explaining the speech noise reduction method provided by the embodiment of the present application in detail, an application scenario provided by the embodiment of the present application is introduced.

Today, the internet is widely popularized, terminals such as mobile phones enable people's daily life to be more convenient and faster, and the terminals can collect user voices to achieve voice communication functions or other voice functions.

For example, two users at different geographical locations may communicate by voice through respective terminals. The user A sends out voice, the terminal of the user A collects the voice and sends the voice to the terminal of the user B communicating with the user A, and the user B listens to the voice and can communicate with the user A in the same way. However, in the process of voice communication between two communication parties through the terminal, noise signals in the external environment are also collected by the terminal, which results in low quality of voice listened by the two communication parties.

For example, the user performs other voice operations such as navigation and recording using the terminal. It is usually required that the user operates at the corresponding software of the terminal, the user utters a voice, the terminal collects the voice of the user, and performs corresponding operations based on the corresponding software on the terminal. However, in the process of acquiring the user voice by the terminal, noise in the external environment is also acquired by the terminal, so that corresponding software on the terminal cannot correctly recognize the user voice, and further, operation indicated in the user voice cannot be executed. For example, when navigation software is used in a noisy environment, it is likely that the navigation software will not be able to correctly recognize the voice command we uttered because the ambient noise is too great.

Therefore, a speech noise reduction method is needed, which can perform noise reduction processing on speech acquired by a terminal, so that a finally processed speech signal has no noise interference.

In the prior art, noise is usually processed by a speech noise reduction method based on deep learning, which trains a speech noise reduction model so that the trained speech noise reduction model can perform noise reduction processing on a speech signal. However, this method requires a large amount of speech and noise data, and the performance of the speech noise reduction model trained by this method is determined by various factors, such as the size of the training set, the type of model used, the training process, etc., where the variation of any one factor affects the final result. Moreover, the speech noise reduction model trained by the speech noise reduction method based on deep learning is often large, the calculation complexity is high, and more resources are consumed, so that the real-time performance of speech noise reduction is difficult to ensure.

Based on the above problems, the embodiment of the present application provides a voice noise reduction method, which can ensure a smaller amount of calculation, further reduce system delay, meet the requirement of real-time performance, and achieve a better noise elimination effect.

The following explains the speech noise reduction method provided in the embodiments of the present application in detail.

Fig. 1 is a flowchart of a speech noise reduction method according to an embodiment of the present application. Referring to fig. 1, the method includes the following steps.

Step 101: acquiring a first voice signal, wherein the first voice signal comprises a plurality of frequency point signals.

The first voice signal is a frame of voice signal on a frequency domain after Fourier transform, so that the first voice signal is processed on the frequency domain subsequently, and further a noise estimation value of the first voice signal is obtained.

The above implementation process of obtaining the first speech signal may be: the terminal collects a section of voice signal, performs framing processing on the voice signal, and then performs Fourier transform on the processed voice signal to obtain the voice signal on a frequency domain.

Since the speech signal emitted by the user is substantially unchanged for a short period of time, that is, the speech signal emitted by the user has short-term stationarity. For the sake of convenience, the voice signal sent by the user is referred to as a valid signal, and the voice signal collected by the terminal includes a valid signal and a noise signal. As shown in fig. 2, the transient stationary speech signal in fig. 2 is a signal sent by a user, that is, an effective signal, and a small-amplitude transient signal in two effective signals is a noise signal. Thus, the speech signal can be divided into short segments (i.e., frames) for processing. Generally, an overlapping framing method is adopted, and specifically, during the dividing, a part of the speech signals of two adjacent frames are overlapped, so that there is a certain overlap between the speech signal of the previous frame and the speech signal of the next frame, which is to make the transition between the frames smooth and maintain the continuity of the frames.

After the framed voice signals are obtained, Fourier transform is carried out on each framed voice signal to obtain the voice signal of each frame voice signal on the frequency domain.

The first speech signal is any frame signal in the speech signal in the frequency domain, and each frame in the speech signal is processed according to the method in the embodiment of the application, so that the speech signal after noise reduction can be obtained. The embodiment of the present application takes the first speech signal as an example for explanation.

In addition, when the voice signal in the frequency domain is obtained, the power spectrum of the voice signal can be obtained, and then the average power of each frequency point signal in the voice signal can be obtained.

The implementation process of obtaining the power spectrum of the speech signal may be: when the first speech signal after the framing processing is subjected to fourier transform, an amplitude spectrum and a phase spectrum of the first speech signal can be obtained, and a sequence formed by squares of the amplitude spectrum of the first speech signal is a power spectrum of the first speech signal.

After obtaining the power spectrum of the first voice signal, that is, obtaining the signal power of each frequency point signal in the first voice signal, the voice signal sent by the user usually has a plurality of signals with the same frequency in the frequency domain, so that the signal power of the same frequency point signal may have a plurality of signal powers. Therefore, the signal power of the same frequency point signal can be summed and then the quotient is made with the number of the signal powers of the same frequency point signal, and the obtained value is the average power of the corresponding frequency point signal in the first voice signal.

And performing the operation on each frequency point signal in the first voice signal to obtain the signal power of each frequency point signal in the first voice signal, wherein the signal power refers to the average power of the corresponding frequency point signal.

Based on the above method for determining the average power of the first speech signal, the embodiment of the present application may further obtain a second speech signal, where the second speech signal is a speech signal of a previous frame of the first speech signal, determine an amount of average power change between the first speech signal and the second speech signal, and if the amount of average power change exceeds a change threshold, perform the following operation of step 102.

The first speech signal and the second speech signal, that is, the current frame signal and the previous frame signal, determine an average power variation between the first speech signal and the second speech signal, that is, calculate a difference between an average power of the first speech signal and an average power of the second speech signal.

Since the first voice signal and the second voice signal respectively include a plurality of frequency point signals, and each frequency point signal corresponds to an average power, it is necessary to determine the average powers of the first voice signal and the second voice signal first, and then calculate a difference between the average powers of the first voice signal and the second voice signal.

As an example, when determining the average power of the first voice signal and the second voice signal, an average value of the average power of all frequency point signals in the first voice signal and the second voice signal may be calculated, and then the average value is taken as the average power of the first voice signal and the average power of the second voice signal.

As another example, when determining the average power of the first voice signal and the average power of the second voice signal, the maximum value of the average power of the frequency point signals in the first voice signal and the second voice signal may be used as the average power of the first voice signal and the average power of the second voice signal, respectively. Or respectively taking the minimum value of the average power of the frequency point signals in the first voice signal and the second voice signal as the average power of the first voice signal and the second voice signal.

Of course, the manner of determining the average power of the first speech signal and the second speech signal may be other manners, which is not limited in this embodiment of the application.

After the average power variation between the first speech signal and the second speech signal is determined, if the average power variation between the first speech signal and the second speech signal exceeds the variation threshold, it is determined that the first speech signal has a sudden noise change, and at this time, the first speech signal is processed based on the following step 102, a smoothing factor in a recursive average algorithm is determined, and a noise estimation value of the first speech signal is obtained. The threshold of the variation may be set in advance, which is not limited in the embodiment of the present application.

If the average power variation between the first speech signal and the second speech signal does not exceed the variation threshold, that is, the first speech signal is considered to have no noise sudden change, then the noise estimation value of the first speech signal may be obtained through step 103 based on a reference smoothing factor, where the reference smoothing factor is 0.85. The implementation of step 103 is described in detail below.

In addition, after the first speech signal is acquired, the smoothing factor in the recursive averaging algorithm may be determined directly for the first speech signal according to the method in step 102, regardless of whether a noise mutation occurs. The implementation of step 102 is described in detail later.

Step 102: and determining a smoothing factor in the recursive average algorithm based on the signal power of each frequency point signal in the plurality of frequency point signals.

When the fixed smoothing factor value is too large, the effective signal in the subsequent voice signal is affected, and when the fixed smoothing factor value is too small, the noise estimation value is determined inaccurately. Therefore, the smoothing factor is converted into a variable parameter in step 102, which is different from the fixed smoothing factor value, so that the noise estimation accuracy can be improved when the noise estimation value of the first speech signal is determined subsequently.

In some embodiments, the implementation of step 102 may be divided into the following two steps:

the method comprises the following steps: and determining an activation function value of each frequency point signal based on the signal power of each frequency point signal, wherein the activation function value is positioned in a target value interval.

In step one, the implementation process of determining the activation function value of each frequency point signal may be: and determining the initial activation function value of each frequency point signal based on the signal power of each frequency point signal. And determining the activation function value of each frequency point signal based on the initial activation function value of each frequency point signal.

Wherein, determiningThe activation function used for the initial activation function value of each frequency point signal can be a sigmoid function, and the mathematical expression of the function is as follows:

wherein, l represents frequency point, x (l) is signal power of corresponding frequency point signal, P_s(l) And the initial activation function value is the signal of the corresponding frequency point. Through the sigmoid function, an initial activation function value of each frequency point signal can be obtained, and the initial activation function value is within the range of 0-1.

Of course, the activation function used for determining the initial activation function value of each frequency point signal may also be other types of functions, such as a hyperbolic tangent Tanh function, and the like, which is not limited in this embodiment of the application.

After the initial activation function value of each frequency point signal is obtained, the activation function value of each frequency point signal can be determined through the following formula:

wherein the content of the first and second substances,

for the activation function value, P, of each frequency point signal_s(l) For the initial activation function value of each frequency point signal, Kp is-2, a is 14, and C is 11, wherein the activation function value is in the range of 0-1, that is, the target value range is in the range of 0-1.

And after the activation function value of each frequency point signal is obtained, the operation of the following step two can be carried out.

Step two: and determining a smoothing factor based on the activation function value of each frequency point signal in the plurality of frequency point signals.

Wherein, the implementation process of the second step can be as follows: determining a smoothing factor based on an activation function value of each frequency point signal in a plurality of frequency point signals through a first formula, wherein the first formula is as follows:

wherein the content of the first and second substances,

as a smoothing factor, α_dIn order to be a fixed numerical value,

Wherein alpha is_d0.85, after determining the minimum value in the activation function values of the frequency point signals in the first voice signal

In a first formula

And a minimum value of 0.95, the minimum value being determined as a smoothing factor

Of course, when the smoothing factor is determined in step 102, the minimum value of the activation function values of the frequency point signals may be obtained, and then the minimum value is obtained according to the formula

Directly to determine the smoothing factor

Step 103: and based on the smoothing factor, carrying out noise estimation on the first voice signal through a recursive average algorithm to obtain a noise estimation value of the first voice signal, wherein the noise estimation value indicates the power of the noise signal in the first voice signal.

When noise estimation is performed on a first voice signal to obtain a noise estimation value, the embodiment of the present application improves the time delay of abrupt noise on the basis of IMCRA (Improved minimum Controlled Recursive Averaging algorithm), so that the time delay of the algorithm can be rapidly converged when noise abrupt change occurs, thereby better tracking the noise.

In some embodiments, the implementation of step 103 may be divided into the following steps:

step 1: and smoothing the first voice signal based on the smoothing factor to obtain the voice existence probability of the first voice signal, wherein the voice existence probability indicates the probability of the effective voice signal in the first voice signal.

Wherein, the implementation process of step 1 may be: and carrying out first smoothing processing on the first voice signal to obtain a smooth power spectrum of the first voice signal. And searching the smooth power spectrum through the number of the target search windows to determine the minimum value in the smooth power spectrum, wherein the number of the target search windows is less than the reference value 8. Based on the minimum value in the smoothed power spectrum, a speech presence probability of the first speech signal is determined.

In addition, the minimum value in the smoothed power spectrum can be determined based on the number of target search windows and the length of the target search window. The number of the target search windows is 2-3, the window length of the target search window is determined based on whether the first voice signal is subjected to noise abrupt change in the step 101, when the first voice signal is not subjected to noise abrupt change, the window length of the target search window is the first window length, and when the first voice signal is subjected to noise abrupt change, the window length of the target search window is the second window length.

When the first voice signal has a sudden noise change, the window length of the target search window is smaller, and therefore, the first window length is longer than the second window length in the embodiment of the application. Illustratively, the second window length may be set to 3-5.

It should be noted that, the implementation process of performing the smoothing processing on the first speech signal in step 1 is different from that of the IMCRA algorithm, and when the smooth power spectrum is searched in step 1, the number of the target search windows and the window length of the target search windows are both significantly reduced compared to the number of the search windows and the window length in the IMCRA algorithm, so that the time delay when the minimum value of the smooth power spectrum is determined can be effectively reduced, and the real-time performance of noise estimation is further improved.

After obtaining the minimum value in the smoothed power spectrum, the IMCRA algorithm may be referred to for determining the speech existence probability of the first speech signal based on the minimum value in the smoothed power spectrum, which is not limited in the embodiment of the present application.

After the speech existence probability is obtained, the following step 2 can be performed.

Step 2: an initial noise estimate value for the first speech signal is determined based on the speech presence probability.

The implementation process of determining the initial noise estimation value of the first speech signal in step 2 may refer to an IMCRA algorithm, which is not limited in the embodiment of the present application.

Due to the non-stationarity of the noise environment, after obtaining the initial noise estimate, it is usually necessary to offset compensate the initial noise estimate. In this case, different degrees of deviation compensation can be performed on different frequency points in step 3 based on the speech existence probability calculated in the previous step.

And step 3: and when the voice existence probability indicates that the first voice signal has an effective voice signal, performing deviation compensation on the initial noise estimation value according to the first compensation factor to obtain a noise estimation value. And when the voice existence probability indicates that no effective voice signal exists in the first voice signal, performing deviation compensation on the initial noise estimation value according to a second compensation factor to obtain a noise estimation value. Wherein the first compensation factor is less than the second compensation factor.

The compensation factor in the embodiment of the present application is a dynamic variable parameter, and when speech exists, the compensation factor should be small to avoid over-estimation of noise to cause speech distortion. When the frequency point speech does not exist, namely the current frequency point is a noise frame, the compensation factor at the moment is larger so as to reduce the residual of the noise.

Thus, when the speech presence probability indicates that a valid speech signal is present in the first speech signal, the first compensation factor at that time should be small. The first compensation factor should be larger when the speech presence probability indicates that no valid speech signal is present in the first speech signal.

As an example, the first compensation factor and the second compensation factor are determined by the following formula:

η(k,l)＝ω·[σ+(1-P(k,l))]

wherein, P (k, l) is the speech existence probability, and η (k, l) is the compensation factor.

When speech exists, the speech existence probability P (k, l) is 0, ω is 1.47 and σ is 0, and thus the first compensation factor at this time is 1.47.

When speech is not present, the speech presence probability P (k, l) is 1, where ω is 1.47 and σ is 5, and thus the second compensation factor at this time is 1.47 × 5 — 7.35.

After the compensation factor eta (k, l) is obtained, the formula can be passed

To dynamically adjust the noise estimate. Wherein the content of the first and second substances,

in order to be the initial noise estimate value,

is a compensated noise estimate.

Of course, when the deviation compensation is performed on the initial noise estimation value, the deviation compensation can also be directly performed through the formula

To determine a noise estimate, where ω is 1.47.

Fig. 3 is a schematic diagram of noise signals of the first speech signal after noise estimation, and as shown in fig. 3, the noise signals are all transient signals.

Step 104: and performing noise reduction processing on the first voice signal based on the noise estimation value.

After obtaining the noise estimate of the first speech signal, a log-mmse (log-minimum mean square error) gain function G may be passed through_mmse(k, l) performing noise reduction processing, namely speech enhancement processing, on the first speech signal to obtain a noise-reduced speech spectrum.

After the voice frequency spectrum after noise reduction is obtained, each frame of voice signal in the voice frequency spectrum after noise reduction is subjected to inverse Fourier transform and inverse framing processing to obtain a continuous effective voice signal on a time domain.

Fig. 4 is the effective speech signal after the noise signal is removed from the first speech signal, and it can be seen that the noise signal is removed from the speech signal in fig. 4 compared to the noisy speech signal in fig. 2.

The following takes fig. 5 as an example to further explain the above steps 101-104.

Fig. 5 is a flowchart of a speech noise reduction method according to an embodiment of the present application. As shown in fig. 5, firstly, a frame processing is performed on a noisy speech signal, and each frame of processed speech signal is transformed into a frequency domain through fourier transform, so as to obtain a plurality of frequency point signals of each frame of speech signal in the frequency domain. For any frame of voice signal, when the voice signal on the frequency domain is obtained, the power spectrum of the voice signal can be obtained, and then the average power of each frequency point signal in the voice signal is obtained. Comparing the average power variation of the current frame speech signal and the previous frame speech signal with a variation threshold value to determine whether the noise mutation occurs in the speech signal of the current frame.

If the current frame voice signal has noise mutation, a smoothing factor is determined based on the signal power of the frequency point signal of the voice signal, and then the voice signal is smoothed, so that the accuracy of noise estimation is improved. An initial noise estimate is obtained. And if the noise mutation does not occur in the current frame voice signal, smoothing the voice signal based on a fixed smoothing factor value so as to reduce the time delay during noise estimation.

After the initial noise estimation value is obtained based on the smoothing factor, deviation compensation is carried out on the initial noise estimation value by utilizing the voice existence probability generated in the smoothing process, and the noise estimation value of the voice signal is corrected. And then processing the noise estimation value of the voice signal based on a log-mmse gain function to obtain a noise-reduced voice frequency spectrum, and then performing inverse Fourier transform and inverse framing on the voice frequency spectrum to finally obtain the noise-reduced voice signal.

The embodiment of the application uses a detection mechanism of transient noise to detect the noise and then modifies the estimation mode of the noise, thereby better tracking the transient noise. In the embodiment of the present application, whether the noise abrupt change occurs in the first voice signal is determined by comparing the average power variation of the first voice signal and the average power variation of the second voice signal with a variation threshold. And in the case of sudden noise change of the first speech signal, determining a smoothing factor in a recursive averaging algorithm, and performing noise estimation on the first speech signal through the recursive averaging algorithm based on the smoothing factor. When the initial noise estimation value of the first voice signal is obtained, the noise estimation value is adjusted in a self-adaptive mode, noise over-estimation or under-estimation is prevented, and the accuracy of noise estimation is improved. The voice noise reduction method provided by the embodiment of the application has the advantages of low calculation complexity, less resource consumption, reduced time delay of the algorithm and guaranteed real-time performance of the algorithm.

Fig. 6 is a schematic structural diagram of a speech noise reduction apparatus provided in an embodiment of the present application, where the speech noise reduction apparatus may be implemented by software, hardware, or a combination of the two as part or all of a computer device, and the computer device may be the terminal shown in fig. 7 or the server shown in fig. 8. Referring to fig. 6, the apparatus includes: a first acquisition module 601, a first determination module 602, a noise estimation module 603 and a processing module 604.

A first obtaining module 601, configured to obtain a first voice signal, where the first voice signal includes multiple frequency point signals;

a first determining module 602, configured to determine a smoothing factor in a recursive average algorithm based on a signal power of each frequency point signal in a plurality of frequency point signals;

a noise estimation module 603, configured to perform noise estimation on the first speech signal through a recursive average algorithm based on the smoothing factor to obtain a noise estimation value of the first speech signal, where the noise estimation value indicates power of a noise signal in the first speech signal;

a processing module 604, configured to perform noise reduction processing on the first speech signal based on the noise estimation value.

Optionally, the first determining 602 module includes:

Optionally, the second determining sub-module is configured to:

determining a smoothing factor based on an activation function value of each frequency point signal in a plurality of frequency point signals through a first formula, wherein the first formula is as follows:

wherein the content of the first and second substances,

as a smoothing factor, α_dIn order to be a fixed numerical value,

Optionally, the noise estimation module 603 includes:

the processing submodule is used for carrying out smoothing processing on the first voice signal based on the smoothing factor to obtain the voice existence probability of the first voice signal, and the voice existence probability indicates the probability of the existence of the effective voice signal in the first voice signal;

a third determining submodule for determining an initial noise estimate value of the first speech signal based on the speech presence probability;

the deviation compensation submodule is used for performing deviation compensation on the initial noise estimation value according to the first compensation factor when the voice existence probability indicates that the effective voice signal exists in the first voice signal, so as to obtain a noise estimation value; when the voice existence probability indicates that no effective voice signal exists in the first voice signal, performing deviation compensation on the initial noise estimation value according to a second compensation factor to obtain a noise estimation value;

Optionally, a processing submodule, configured to:

carrying out first smoothing processing on the first voice signal to obtain a smoothing power spectrum of the first voice signal;

based on the minimum value in the smoothed power spectrum, a speech presence probability of the first speech signal is determined.

Optionally, the apparatus further comprises:

the second acquisition module is used for acquiring a second voice signal, wherein the second voice signal is a previous frame voice signal of the first voice signal;

and the third determining module is used for executing the operation of determining the smoothing factor in the recursive average algorithm based on the signal power of each frequency point signal in the plurality of frequency point signals if the average power variation exceeds the variation threshold.

Optionally, the apparatus further comprises:

and the fourth determination module is used for performing noise estimation on the first voice signal through a recursive average algorithm based on the reference smoothing factor and determining a noise estimation value of the first voice signal if the average power variation does not exceed the variation threshold.

It should be noted that: in the voice noise reduction device provided in the above embodiment, when noise reduction is performed on a voice signal, only the division of the above functional modules is illustrated, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the voice noise reduction device and the voice noise reduction method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 7 is a block diagram of a terminal 700 according to an embodiment of the present disclosure. In general, terminal 700 includes: a processor 701 and a memory 702.

The processor 701 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 701 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 701 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 701 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 701 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 702 may include one or more computer-readable storage media, which may be non-transitory. Memory 702 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 702 is used to store at least one instruction for execution by processor 701 to implement the speech noise reduction methods provided by method embodiments herein.

In some embodiments, the terminal 700 may further optionally include: a peripheral interface 703 and at least one peripheral. The processor 701, the memory 702, and the peripheral interface 703 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 703 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 704, touch screen display 705, camera 706, audio circuitry 707, positioning components 708, and power source 709.

Those skilled in the art will appreciate that the configuration shown in fig. 7 is not intended to be limiting of terminal 700 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application. The server 800 includes a Central Processing Unit (CPU)801, a system memory 804 including a Random Access Memory (RAM)802 and a Read Only Memory (ROM)803, and a system bus 805 connecting the system memory 804 and the central processing unit 801. The server 800 also includes a basic input/output system (I/O system) 806, which facilitates transfer of information between devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.

The basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse, keyboard, etc. for user input of information. Wherein a display 808 and an input device 809 are connected to the central processing unit 801 through an input output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 804 and mass storage 807 described above may be collectively referred to as memory.

According to various embodiments of the present application, server 800 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 800 may be connected to the network 812 through the network interface unit 811 coupled to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

In some embodiments, a computer-readable storage medium is also provided, in which a computer program is stored, which when executed by a processor implements the steps of the speech noise reduction method in the above embodiments. For example, the computer readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It is noted that the computer-readable storage medium referred to in the embodiments of the present application may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It should be understood that all or part of the steps for implementing the above embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The computer instructions may be stored in the computer-readable storage medium described above.

That is, in some embodiments, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the speech noise reduction method described above.

It is to be understood that reference herein to "at least one" means one or more and "a plurality" means two or more. In the description of the embodiments of the present application, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance.

The above-mentioned embodiments are provided not to limit the present application, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for speech noise reduction, the method comprising:

2. The method of claim 1, wherein said determining a smoothing factor in a recursive average algorithm based on the signal power of each of the plurality of frequency bin signals comprises:

3. The method of claim 2 wherein said determining said smoothing factor based on the activation function value for each of said plurality of frequency bin signals comprises:

wherein, the

For the smoothing factor, the α_dTo a fixed value, said

4. The method of claim 1, wherein said noise estimating the first speech signal by the recursive averaging algorithm based on the smoothing factor to obtain a noise estimate for the first speech signal comprises:

5. The method of claim 4, wherein smoothing the first speech signal based on the smoothing factor to obtain the speech presence probability of the first speech signal comprises:

6. The method according to any of claims 1-5, wherein before determining the smoothing factor in the recursive averaging algorithm based on the signal power of each of the plurality of frequency bin signals, the method further comprises:

7. The method of claim 6, wherein the method further comprises:

8. An apparatus for speech noise reduction, the apparatus comprising:

9. A computer device, characterized in that the computer device comprises a memory for storing a computer program and a processor for executing the computer program stored in the memory to implement the steps of the method according to any of the claims 1-7.

10. A computer-readable storage medium, characterized in that a computer program is stored in the storage medium, which computer program, when being executed by a processor, carries out the steps of the method of one of the claims 1 to 7.