CN103021420B

CN103021420B - Speech enhancement method of multi-sub-band spectral subtraction based on phase adjustment and amplitude compensation

Info

Publication number: CN103021420B
Application number: CN201210513075.0A
Authority: CN
Inventors: 刘文举; 李超
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2012-12-04
Filing date: 2012-12-04
Publication date: 2015-02-25
Anticipated expiration: 2032-12-04
Also published as: CN103021420A

Abstract

The invention discloses a speech enhancement method of a multi-sub-band spectral subtraction based on phase adjustment and amplitude compensation. The method mainly includes truncating signals acquired by a microphone and performing fast Fourier transform (FFT); performing micro maximum search on an amplitude spectrum through a phase adjustment algorithm to obtain an adjusted amplitude spectrum of noisy speech; estimating the amplitude spectrum of noise; dividing a whole band into a plurality of sub-bands and calculating the signal to noise ratio of each sub-band; performing amplitude spectrum subtraction of an over-subtraction rule on each sub-band; performing amplitude compensation on speech spectrums after the spectrum subtraction; and obtaining time domain waveforms of the signals through fast Fourier inversion and signal overlapping.

Description

Voice enhancement method of multi-sub-band spectral subtraction based on phase adjustment and amplitude compensation

Technical Field

The invention relates to the field of voice signal processing, in particular to a voice enhancement method based on multi-subband spectral subtraction of phase adjustment and amplitude compensation.

Background

Spectral subtraction is one of the most widely used speech enhancement algorithms that historically have been proposed for noise cancellation (ref 1: s.f. ball, "Suppression of the acoustic noise in speech using speech processing", IEEE tran.acuost. speech Signal process.27(2), 113. sub. 120, 1979). It is based on a basic theory: for additive noise, we can subtract the noise spectrum from the spectrum of the Discrete Fourier Transform (DFT) of the noisy speech to obtain an estimate of the speech spectrum. The noise spectrum may be estimated and updated by the silence segments. The enhanced speech time-domain waveform can be obtained by performing Inverse Discrete Fourier Transform (IDFT) on the estimated speech spectrum. The spectral subtraction only needs DFT and IDFT, and has low calculation complexity and simple realization.

Spectral subtraction is largely divided into first order spectral subtraction (i.e., magnitude spectral subtraction) and second order spectral subtraction (i.e., power spectral subtraction). Whatever the form of spectral subtraction, great care should be taken in the design to avoid introducing speech distortion. If the subtracted portion is greater than the noise, the voice information will be lost; conversely, if the subtracted portion is less than the noise, there will be excessive noise residual. Researchers have subsequently proposed a number of improved algorithms to attenuate (or even eliminate) the speech distortion introduced during spectral subtraction.

Corresponding to the simple implementation of the spectral subtraction method, there are many drawbacks. Among them, the most dominant defect comes from musical noise. Due to the presence of noise estimation errors and spectral perturbations, the amplitude of the noisy signal in some frequency bands will be smaller than the estimated amplitude of the noise, thus giving the estimated speech spectrum after subtraction a negative value. The simplest approach is to zero these values less than zero so that the spectral magnitudes of the full band are all non-negative. However, this non-linear operation on negative values produces many isolated peaks in the frequency band. These isolated peaks exhibit strong randomness in both the time and frequency domains, and although not large in magnitude, the effect is severe. In the time domain, these isolated peaks sound like single-tone tones, and their pitch (frequency) varies randomly from frame to frame, thereby creating a new type of noise, often referred to as musical noise (musical noise). In many times, the musical noise is more objectionable than the original noise. Another important factor in generating musical noise is the large variance of the spectra of noisy speech and noise when estimated, and the large difference in spectral subtraction rules at different frequency points.

Speech noise is difficult to overcome in conventional spectral subtraction because spectral subtraction is based on the rule: cross terms resulting from the speech signal spectrum and the noise spectrum can be ignored. This rule is consistent with the view of long-term statistics: since speech and noise are random processes independent of each other, the expected value of its cross terms should be equal to zero. However, in Speech enhancement algorithm implementations, each frame is only about 20-30ms long, and in such a short time, this rule is difficult to establish (reference 2: N.Evans, J.Mason, W.Liu, and B.Fauve, "An assessment on the fundamental considerations". Proc.IEEE Internat.conf.on Acoustics, Speech, Signal Processing (ICASSP), 2006). Therefore, the formula of the spectral subtraction method is only an approximation and not an accuracy. Researchers have made many efforts to investigate the effects of cross-terms, but these studies are mainly directed to improving the performance of Automatic Speech Recognition (ASR), not Speech quality.

Disclosure of Invention

Technical problem to be solved

The fundamental source of music noise in the spectral subtraction is cross term error, the invention aims to overcome the adverse effect of the cross term error and provides an amplitude spectral subtraction method under zero error so as to thoroughly eliminate the music noise.

(II) technical scheme

The invention provides a speech enhancement method based on multi-subband spectral subtraction of phase adjustment and amplitude compensation for solving the technical problems, which comprises the following steps:

step a: collecting a voice signal y (n) with noise, and obtaining an amplitude spectrum alpha of the voice signal y (n) with noise_y(ω), wherein n represents discrete time points and ω represents discrete frequency points;

step b: carrying out microspur maximum value search on the amplitude spectrum of the voice with noise by utilizing a phase adjustment algorithm to obtain the amplitude spectrum of the voice with noise when the phase difference between a pure voice signal and an additive noise signal is 0

Step c: updating additive noise amplitude spectra using noise estimation algorithms

Step d: using over-subtraction rule coefficients and additive noise amplitude spectraFor the noisy speech amplitude spectrumPerforming amplitude spectrum subtraction to obtain pure speech amplitude spectrum

Step e: using the second order amplitude compensation factor and the preset first order amplitude compensation factor to process the pure speech amplitude spectrumCompensating to obtain enhanced pure speech amplitude spectrumFurther obtaining an enhanced clean speech signal

(III) advantageous effects

For cross terms generated by speech signal spectrum and noise spectrum in the traditional spectral subtraction, an unreasonable assumption of neglecting cross term errors is abandoned. In the invention, a multi-subband spectral subtraction method based on phase adjustment and amplitude compensation is provided by using a space geometric principle and fully utilizing the difference of voice and noise signals. Practice proves that the method can well overcome the influence of cross terms, effectively eliminate isolated peaks randomly appearing in a frequency spectrum, and further effectively suppress music noise.

Drawings

FIG. 1 is a flow chart of a speech enhancement method of the present invention based on phase adjustment and amplitude compensation for multi-subband spectral subtraction;

FIG. 2 is a flow chart of a phase adjustment algorithm in the speech enhancement method of the present invention;

FIG. 3 is a flow chart of dividing a plurality of sub-bands and calculating corresponding SNRs in the speech enhancement method of the present invention;

FIG. 4 is a flow chart of sub-band spectral subtraction in the speech enhancement method of the present invention;

FIG. 5 is a flow chart of an amplitude compensation algorithm in the speech enhancement method of the present invention.

Detailed Description

In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.

As shown in FIG. 1, the present invention discloses a speech enhancement method based on phase adjustment and amplitude compensation for multi-subband spectral subtraction, which comprises the following steps:

step 101: noisy speech signal y (n) is collected by a microphone. The following is a detailed description of step 101.

Suppose that the noisy speech signal y (n) collected by the microphone at time k is the sum of x (n) and v (n), i.e.

y(n)＝g*s(n)+v(n)＝x(n)+v(n) (1)

Where g is the impulse response from source s (n) to the microphone, x (n) is the clean speech picked up by the microphone, and v (n) is additive noise.

Step 102 truncates the voice with noise, and performs Fast Fourier Transform (FFT) to obtain an amplitude spectrum and a phase spectrum of the voice signal with noise. Step 102 is described in detail below.

In this example we use a 32ms hanning window to truncate the signal and perform an FFT to obtain a frequency domain representation of the signal

Y(ω)＝X(ω)+V(ω) (2)

Wherein, X represents the pure voice frequency spectrum, Y represents the frequency spectrum of the voice with noise, V represents the frequency spectrum of the additive noise, and omega represents the discrete frequency point.

Writing equation (2) in polar form,

wherein alpha is_y(ω)、α_x(omega) and alpha_v(ω) amplitude spectra of noisy speech, clean speech and additive noise, θ_y(ω)、θ_x(ω) and θ_v(omega) is the phase spectrum of the noisy speech, the clean speech and the additive noise, respectively, the amplitude spectrum alpha of the noisy speech_y(ω) can be obtained by averaging a plurality of successive frames, the amplitude spectrum α of the additive noise_v(ω) can be estimated using a noise estimation algorithm, the magnitude spectrum α of clean speech_x(ω) is the object to be estimated.

Step 103: carrying out microspur maximum value search on the amplitude spectrum by utilizing a phase adjustment algorithm to obtain a noisy speech amplitude spectrum when the phase difference between a pure speech signal x (n) and an additive noise signal v (n) is 0Namely, on the omega frequency point, the maximum value of the amplitude of the voice signal with noise is searched in a plurality of continuous moments as the amplitude spectrum of the voice with noise after phase adjustment. The detailed flowchart of step 103 is shown in fig. 2, and the specific steps are as follows:

step 201: the displacement point of the noise-containing speech signal is initialized to m 1.

Step 202: forward shifting the noisy speech signal by m sampling points to obtain a shifted noisy speech signal:

y_m＝[y(n-m)，y(n-m-1)，...，y(n-m-L+1)]^T (4)

wherein [ ·]^TIs a transposition operator, vector y_mEach element in (a) represents a sample value of the noisy speech signal at time point i, i ═ n-m, n-m-1.

Step 203: FFT is carried out on the voice signal with noise after the bit shift, and the frequency spectrum Y of the voice signal with noise is obtained_m(ω), i.e.

Y_m(ω)＝FFT(y_m)

Where FFT (-) is a fast Fourier transform operator.

Step 204: obtaining an absolute value of the obtained frequency spectrum of the noisy speech signal after the displacement, and obtaining an amplitude spectrum | Y of the noisy speech signal after the displacement_m(ω)|。

Step 205: judging the displacement point M, if M > M_maxIf the judgment result is no, go to step 206; otherwise, go to step 207. Wherein,

wherein,is an operator rounded up, f_sIs the sampling rate and Ω is the length of the FFT.

Step 206: the displacement point m is incremented by 1 and step 202 is performed.

Step 207: performing a budget of taking the maximum value once, namely on the omega frequency point, and performing Y pair within the range of 0 < M < M (omega)_m(omega) taking the maximum value as the amplitude spectrum of the noise-carrying voice signal after phase adjustment

Wherein,

wherein,is an integer operator up and Ω denotes the length of the fast fourier transform.

Through step 103, the phase angle θ between the clean speech signal x (n) and the additive noise signal v (n) is finally obtained_xvAmplitude spectrum of noisy speech signal when 0

Step 104: and updating the estimation result of the additive noise power spectrum. The following is a detailed description of step 104.

First, a Signal-to-Noise Ratio (SNR) of the entire band is calculated

Wherein log₁₀Representing natural logarithm operators, [ ·]Is a finite range summation operator, k represents a frame number, ω represents a discrete frequency bin,representing an estimate of the noisy speech power spectrum for the current frame k,an estimate representing the additive noise power spectrum of the previous frame k-1; .

Then, according to Voice Activity Detection (VAD) method, the lower threshold SNR of the voiced segment is determined_thDetermining an update of the estimated value of the additive noise power spectrum as follows:

wherein,representing an estimate of the additive noise power spectrum of the previous frame k-1,an estimated value representing the power spectrum of the noisy speech of the current frame k, SNR_thIs the lower threshold for SNR for voiced segments, k is the frame number.

Obtaining an additive noise amplitude spectrum according to the estimated value of the additive noise power spectrum

Step 105: the full frequency band is divided into a plurality of sub-bands and the signal-to-noise ratio is calculated on each sub-band. The detailed flowchart of step 105 is shown in fig. 3, and the specific steps are as follows:

step 301: initialization conditions, in this example the number of subbands R equals 8, the sampling rate f_s＝8000Hz。

Step 302: calculating the bandwidth of the sub-band, in this example, we adopt a uniformly distributed and non-overlapping bandwidth strategy, i.e. the bandwidth of each sub-band is equal, and is f_d＝500Hz。

Step 303: calculating the initial frequency point of each sub-band by the following method:

initial frequency point:

wherein f is_dIs the bandwidth of the sub-band, f_sIs the sampling rate, Ω is the length of the FFT, and r is the r-th sub-band.

Cut-off frequency point:

f_dis the bandwidth of the sub-band, f_sIs the sampling rate and Ω is the length of the FFT.

Step 304:

calculating SNR at the r sub-band_r，r＝1，2，...，R：

Wherein log₁₀Representing natural logarithm operators, [ ·]Is a finite range summation operator, k represents a frame number, ω represents a discrete frequency bin,representing an estimate of the noisy speech power spectrum for the current frame k,representing an estimate of the additive noise power spectrum of the current frame k.

Step 106 uses over-subtraction rule to perform spectrum analysis on the amplitude of the noisy speech of the current frame kPerforming spectral subtraction to obtain the pure speech amplitude spectrum of the current frame kThe detailed flowchart of step 106 is shown in fig. 4, and the specific steps are as follows:

step 401: the initialization condition, r ═ 1, indicates that the algorithm starts from the first subband.

Step 402: a subtraction gain factor is calculated as follows:

wherein, b_rAnd e_rRespectively, a start frequency point and a cut-off frequency point of the r-th sub-band.

Step 403: the over-subtraction rule coefficients are calculated as follows:

step 404: the regular spectral subtraction is performed on the r-th subband as follows:

wherein,is the additive noise amplitude spectrum of the current frame k,is the noisy speech amplitude spectrum for the current frame k.

Step 405: and judging whether the condition R is less than R or not. If the judgment result is yes, go to step 406; otherwise, go to step 407.

Step 406: r +1, and step 402 is performed.

Step 407: half-wave rectification is performed to place the negative amplitude value generated by equation (13) at a minimum value, while suppressing musical noise caused by inaccurate noise estimation, as calculated as follows:

where φ is a maximum attenuation threshold, set to 0.02.

Step 107 utilizes a second order magnitude compensation factor mu_2，rAnd preSet first order amplitude compensation factor mu_1，rEstimation of clean speech amplitude spectraCompensating to obtain enhanced voice spectrumThe detailed flowchart of step 107 is shown in fig. 5, and the specific steps are as follows:

step 501: the initialization condition, r ═ 1, indicates that the algorithm starts from the first subband.

Step 502: a second order compensation factor is calculated as follows:

wherein the SNR₀The threshold for VAD algorithm to indicate whether there is voice activity is 3 dB.

Step 503: implementing an amplitude spectrum compensation algorithm as follows:

wherein the first order compensation factor mu_1，r0.05, mainly to further suppress musical noise.

Step 504: and judging whether the condition R is less than R or not. If the judgment result is yes, go to step 505; otherwise, go to step 506.

Step 505: r +1, and step 502 is performed.

Step 506: the algorithm ends.

Step 108 is to obtain an enhanced pure speech amplitude spectrumInverse Fast Fourier Transform (IFFT) is performed, and the obtained time-domain short-time pure speech signals are superimposed according to a 75% overlap ratio to obtain an enhanced pure speech signal

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A speech enhancement method based on phase adjustment and amplitude compensation for multi-subband spectral subtraction, comprising the steps of:

2. The method according to claim 1, wherein said step b of performing a microspur maximum search on the amplitude spectrum of the noisy speech includes:

on the frequency point of the omega, searching the maximum value of the amplitude of the voice signal with noise in continuous M (omega) moments as the amplitude spectrum of the voice signal with noise after phase adjustment, namely finding the amplitude spectrum of the voice signal with noise when the phase difference between the pure voice signal and the additive noise signal is 0:

wherein, Y_m(ω) is the fast fourier transform of the speech signal after shifting M sampling points, and M (ω) takes different values at different frequency points, as shown in the following formula:

wherein,is an upward integer operator, omega represents the length of the fast fourier transform, and omega represents the discrete frequency points.

3. The method of claim 1, wherein the step c updates an additive noise magnitude spectrumFurther comprising:

step c 1: calculating the signal-to-noise ratio SNR of the full frequency band:

wherein, 1og₁₀Representing natural logarithm operators, [ ·]Is a finite range summation operator, k represents a frame number, ω represents a discrete frequency bin,representing an estimate of the noisy speech power spectrum for the current frame k,an estimate representing the additive noise power spectrum of the previous frame k-1;

step c 2: using the voice activity detection VAD method, the lower threshold SNR of the voiced segments is used_thUpdating the estimated value of the additive noise power spectrum:

<math> <mrow> <msubsup> <mover> <mi>σ</mi> <mo>^</mo> </mover> <mi>v</mi> <mn>2</mn> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>ω</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <msubsup> <mover> <mi>σ</mi> <mo>^</mo> </mover> <mi>v</mi> <mn>2</mn> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>ω</mi> <mo>)</mo> </mrow> </mtd> <mtd> <msub> <mi>ifSNR</mi> <mi>k</mi> </msub> <mo>></mo> <msub> <mi>SNR</mi> <mi>th</mi> </msub> </mtd> </mtr> <mtr> <mtd> <mn>0.98</mn> <mo>·</mo> <msubsup> <mover> <mi>σ</mi> <mo>^</mo> </mover> <mi>v</mi> <mn>2</mn> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>-</mo> <mn>1</mn> <mo>,</mo> <mi>ω</mi> <mo>)</mo> </mrow> <mo>+</mo> <mn>0.02</mn> <mo>·</mo> <msubsup> <mover> <mi>α</mi> <mo>^</mo> </mover> <mi>y</mi> <mn>2</mn> </msubsup> <mrow> <mo>(</mo> <mi>k</mi> <mo>,</mo> <mi>ω</mi> <mo>)</mo> </mrow> </mtd> <mtd> <mi>else</mi> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>

wherein,representing an estimate of the additive noise power spectrum of the previous frame k-1,representing the estimated value of the noisy speech power spectrum of the current frame k, wherein k is the frame number;

step c 3: obtaining an additive noise amplitude spectrum according to the estimated value of the additive noise power spectrum

4. The method of claim 1, wherein step d is preceded by the step of dividing the full frequency band into a plurality of sub-bands and calculating the signal-to-noise ratio in each sub-band, which comprises the steps of:

step 1: dividing the full frequency band into a plurality of sub-frequency bands, and calculating the sub-frequency band bandwidth f_d，

f_d＝f_s/2R

Wherein f is_sIs the sampling rate, R is the number of sub-bands;

step 2: calculating the initial frequency point b of each sub-band_rAnd cut-off frequency point e_r：

Wherein R1, 2, R Ω represents the length of the fast fourier transform;

and step 3: calculating SNR at the r sub-band_r，r＝1，2，...，R：

5. The method of claim 4, wherein the over-reduction rule coefficient in step d is calculated as follows:

wherein,the rule coefficients are subtracted from the respective sub-bands.

6. The method of claim 5, wherein the amplitude spectrum in step d is reduced by:

performing spectral subtraction on the r-th sub-band to obtain a pure speech amplitude spectrum of the current frame k:

wherein,is the additive noise amplitude spectrum of the current frame k,is the noisy speech amplitude spectrum of the current frame k, R1, 2, R,_ris a subtraction gain factor, the calculation formula is as follows:

7. the method of claim 4, wherein the second order magnitude compensation factor is calculated in step e as follows:

calculating a second-order amplitude compensation factor mu on each sub-frequency band_2，r：

<math> <mrow> <msub> <mi>μ</mi> <mrow> <mn>2</mn> <mo>,</mo> <mi>r</mi> </mrow> </msub> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <msub> <mi>ifSNR</mi> <mi>r</mi> </msub> <mo><</mo> <msub> <mi>SNR</mi> <mn>0</mn> </msub> </mtd> </mtr> <mtr> <mtd> <mfrac> <mrow> <msub> <mi>SNR</mi> <mi>r</mi> </msub> <mo>-</mo> <msub> <mi>SNR</mi> <mn>0</mn> </msub> </mrow> <mn>4</mn> </mfrac> <mo>·</mo> <msup> <mi>e</mi> <mrow> <mo>-</mo> <mfrac> <msup> <mrow> <mo>(</mo> <msub> <mi>SNR</mi> <mi>r</mi> </msub> <mo>-</mo> <msub> <mi>SNR</mi> <mn>0</mn> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> <mn>8</mn> </mfrac> </mrow> </msup> </mtd> <mtd> <mi>else</mi> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>

Wherein the SNR₀Is a threshold for determining whether there is voice activity.

8. The method of claim 7, wherein the compensating of the clean speech amplitude spectrum in step e is performed by:

performing amplitude compensation on the r-th sub-band to obtain an enhanced pure speech amplitude spectrum

Wherein R is 1, 2_1，rFor a predetermined first-order compensation factor, mu_2，rIs a second order magnitude compensation factor.

9. The method as claimed in claim 6, wherein said step d further performs half-wave rectification on the spectrum of the clean speech amplitude of the current frame k obtained by said spectral subtraction:

where φ is the maximum attenuation threshold.

10. The method of claim 1, wherein step e is performed by applying an enhanced clean speech amplitude spectrumCarrying out fast Fourier inverse transformation to obtain a time domain short-time pure voice signal, and superposing the time domain short-time pure voice signal according to a 75% overlapping rate to obtain an enhanced pure voice signal