CN110085249B

CN110085249B - Single-channel speech enhancement method of recurrent neural network based on attention gating

Info

Publication number: CN110085249B
Application number: CN201910385797.4A
Authority: CN
Inventors: 梁瑞宇; 孔凡留; 谢跃; 王青云; 程佳鸣; 孙世若; 赵力
Original assignee: Nanjing Institute of Technology
Current assignee: Shenzhen Innotrik Technology Co ltd
Priority date: 2019-05-09
Filing date: 2019-05-09
Publication date: 2021-03-16
Anticipated expiration: 2039-05-09
Also published as: CN110085249A

Abstract

The invention discloses a single-channel speech enhancement method of a recurrent neural network based on attention gating, which comprises the steps of performing frame division and windowing on single-channel speech with noise, and extracting 38-dimensional signal features; constructing a deep circulation neural network for single-channel speech enhancement; constructing a training data set by utilizing a pure voice library and a noise library; training the constructed deep circulation neural network; inputting the extracted voice characteristics with noise into a trained deep circulation neural network, outputting a band gain estimation value of the voice with noise, and smoothing and interpolating to obtain an interpolation gain; and applying the interpolation gain to the single-channel voice with noise to obtain an enhanced voice frequency spectrum. The method can effectively inhibit noise including non-stationary noise, and simultaneously keeps low enough computation complexity, so that the method can be used for real-time single-channel speech enhancement, and has the advantages of ingenious method, novel concept and good application prospect.

Description

Single-channel speech enhancement method of recurrent neural network based on attention gating

Technical Field

The invention relates to the technical field of voice enhancement, in particular to a single-channel voice enhancement method of a recurrent neural network based on attention gating.

Background

Speech enhancement, as a branch of speech signal processing, has important applications in the fields of speech communication, hearing aid devices, Automatic Speech Recognition (ASR) system front-ends, etc. Speech enhancement is generally divided into single-channel speech enhancement and multi-channel speech enhancement. Single channel speech enhancement is relatively more difficult to implement due to the absence of spatial information from the microphone array.

Some unsupervised single-channel speech enhancement algorithms proposed earlier, such as spectral subtraction, wiener filtering, Minimum Mean Square Error (MMSE) -based magnitude spectrum estimation, or log-domain spectrum estimation, cannot effectively suppress non-stationary noise due to the assumption of noise stationarity. Subsequently, supervised single-channel speech enhancement algorithms based on Hidden Markov Models (HMMs), non-Negative Matrix Factorization (NMF) and deep learning were proposed, wherein the application of deep learning made a breakthrough advance in the field of speech enhancement. The neural network relies on a strong fitting power to be able to learn a representation of a clean target speech from the characteristics of the noisy speech without the assumption of noise stationarity.

However, the currently proposed speech enhancement method generally has a poor suppression effect on non-stationary noise, and the speech enhancement method based on deep learning often cannot be applied to real-time speech enhancement due to its high computational complexity.

Disclosure of Invention

The invention aims to solve the problems that the existing voice enhancement method has poor inhibition effect on non-stationary noise and the deep learning-based voice enhancement method cannot meet the real-time requirement due to high calculation complexity. The single-channel speech enhancement method based on the attention-gated recurrent neural network can effectively inhibit noise including non-stationary noise, and simultaneously keeps low enough computation complexity, so that the method can be used for real-time single-channel speech enhancement, and is ingenious, novel in concept and good in application prospect.

In order to achieve the purpose, the invention adopts the technical scheme that:

a single-channel speech enhancement method based on a cyclic neural network of attention gating comprises the following steps,

step (A), frame division and windowing are carried out on single-channel voice with noise, and 38-dimensional signal features including Bark frequency cepstrum coefficients and derivative parameters thereof, discrete cosine transform of fundamental tone correlation coefficients, fundamental tone period and frequency spectrum non-stationarity measurement parameters are extracted;

step (B), constructing a deep circulation neural network for single-channel speech enhancement;

step (C), a training data set is constructed by utilizing a pure voice library and a noise library;

step (D), training the deep circulation neural network constructed in the step (B) by utilizing the 38-dimensional signal characteristics, the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity mark of the training data;

step (E), inputting the extracted characteristics of the voice with noise into a trained deep circulation neural network, outputting a band gain estimation value of the voice with noise, and smoothing and interpolating to obtain an interpolation gain;

and (F) applying the interpolation gain to the single-channel voice with noise to obtain an enhanced voice frequency spectrum.

The single-channel speech enhancement method based on the attention-gated recurrent neural network comprises the step (A) of extracting 38-dimensional signal features, wherein the 38-dimensional signal features specifically comprise 18 Bark frequency cepstrum coefficients, first-order time derivatives and second-order time derivatives of the first 6 Bark frequency cepstrum coefficients, discrete cosine transform of the first 6 inter-band pitch correlation coefficients, 1 pitch period coefficient and 1 spectral non-stationarity measurement parameter.

The single-channel speech enhancement method based on the attention-gated recurrent neural network comprises the following steps of (B) constructing a deep recurrent neural network for single-channel speech enhancement, wherein the deep recurrent neural network comprises six layers, the first layer is a Dense layer, an activation function is tanh, and the number of units is 24; the second to fifth layers are attention-gated LSTM layers with an activation function of tanh and cell numbers of 24,48,48 and 96, respectively; the sixth layer is a Dense layer, the activation function is sigmoid, and the number of units is 18. The second layer output of the network passes through a Dense layer of one layer to obtain the 1-dimensional signal activity mark.

In the foregoing single-channel speech enhancement method based on the attention-gated recurrent neural network, the forward propagation process of the deep recurrent neural network is as shown in formula (1) to formula (5):

a_t＝σ[V_a tanh(W_ac_t-1)] (1)

o_t＝σ(W_o·[h_t-1，x_t]+b_o) (2)

wherein t is a frame number; a, o, c, h are attention gate, output gate, cell state vector and hidden vector respectively,

are cell candidate state vectors, which are of the same dimension; x is an input vector; v_aAnd W_aAll are parameter matrixes for calculating the attention gate; w_o，b_oRespectively calculating a weight matrix and an offset vector of an output gate; w_c，b_cRespectively calculating a weight matrix and an offset vector of the candidate state vector; sigma is sigmoid function;

is element-by-element multiplication.

The aforementioned single-channel speech enhancement method based on attention-gated recurrent neural network, step (C), constructs a training data set using a clean speech library and a noise library, specifically, each sample is passed through a biquad filter to change the amplitude of the mixed signal, where the form of the biquad filter h (z) is shown in formula (6):

wherein r is₁...r₄Is in the range of [ -3/8,3/8 [ -3/8]Random values evenly distributed within the range.

The single-channel speech enhancement method based on the attention-gated recurrent neural network comprises the following steps of (D) and (B) training the deep recurrent neural network constructed in the step (B).

(D1) Calculating a band gain g of the band b_bAs shown in the formula (7),

wherein E is_s(b) And E_x(b) Energy in band b, g, for clean and noisy speech, respectively_bHas a value of [0,1 ]]To (c) to (d);

(D2) taking the extracted 38-dimensional signal features as input of the deep recurrent neural network;

(D3) taking the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity flag as the training target of the recurrent neural network, the loss function L is shown as formula (8):

L＝L_g+αL_vad (8)

wherein L is_gFor the loss function corresponding to the band gain estimate, L_vadα is a weighting factor for the loss function corresponding to the VAD estimate. Wherein, the loss function L corresponding to the estimated value of the band gain_gAs shown in formula (9):

wherein the content of the first and second substances,

is a band gain estimate, L_binIs a cross entropy loss function;

(D4) during training, each time a batch is trained, all parameters are cut off to be in the range of [ -0.5,0.5 ].

In the aforementioned single-channel speech enhancement method based on the attention-gated recurrent neural network, step (E) is to smooth and interpolate the estimated value of the band gain output by the network to obtain the interpolated gain, and the specific process is as follows,

smoothed band gain

As shown in equation (10):

wherein the content of the first and second substances,

for the smooth gain of the previous frame, λ is the attenuation factor, and the interpolation gain r (k) of each frequency bin k is shown in equation (11):

wherein, w_b(k) The amplitude of frequency band b at frequency point k.

In the aforementioned single-channel speech enhancement method based on the attention-gated recurrent neural network, step (F), the interpolation gain is applied to the single-channel speech with noise to obtain the enhanced speech frequency spectrum

As shown in the formula (12),

wherein alpha is_bFor the filter coefficients, P (k) is the spectrum of the pitch-delayed signal x (n-T), and X (k) is the spectrum of noisy single-channel speech.

The invention has the beneficial effects that: the single-channel speech enhancement method based on the attention-gated recurrent neural network of the invention concentrates units on the useful information output in the currently input context information by using attention in the traditional LSTM model, thereby improving the learning capability of the network. The band gain is estimated from the noisy features using a deep-loop neural network, without any assumptions, and the generalization capability of the network can be improved by including various noise conditions in the training set. In addition, the recurrent neural network only needs to output 18 frequency band gain estimated values between 0 and 1 VAD estimated value, thereby greatly reducing the computational complexity. The single-channel speech enhancement method can effectively suppress noise including non-stationary noise, avoids the common problem of music noise in noise suppression through frequency band division, and simultaneously keeps low enough calculation complexity, so that the method can be used for real-time single-channel speech enhancement, and is ingenious, novel in concept and good in application prospect.

Drawings

FIG. 1 is a flow chart of a single-channel speech enhancement method of an attention-gated-based recurrent neural network of the present invention;

FIG. 2 is a block diagram of the deep recurrent neural network of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

As shown in FIG. 1, the single-channel speech enhancement method based on the attention-gated recurrent neural network of the present invention comprises the following steps.

Step (A), performing frame division and windowing on single-channel noisy speech, and extracting 38-dimensional signal characteristics, including Bark frequency cepstrum coefficients and derivative parameters thereof, discrete cosine transform of fundamental tone correlation coefficients, fundamental tone period and spectrum non-stationarity measurement parameters, specifically including 18 Bark frequency cepstrum coefficients, first-order time derivatives and second-order time derivatives of the first 6 Bark frequency cepstrum coefficients, discrete cosine transform of the first 6 inter-band fundamental tone correlation coefficients, 1 fundamental tone period coefficient and 1 spectrum non-stationarity measurement parameter;

step (B), constructing a deep circulation neural network for single-channel speech enhancement, where the deep circulation neural network includes six layers, as shown in fig. 2, the first layer is a sense layer, the activation function is tanh, and the number of units is 24; the second to fifth layers are attention-gated LSTM layers with an activation function of tanh and cell numbers of 24,48,48 and 96, respectively; the sixth layer is a Dense layer, the activation function is sigmoid, and the number of units is 18; the second layer output passes through a Dense layer (the activation function is sigmoid, the number of units is 1) of one layer, and a 1-dimensional signal activity mark is obtained. The forward propagation process of the deep circulation neural network is shown in the formula (1) to the formula (5):

a_t＝σ[V_a tanh(W_ac_t-1)] (1)

o_t＝σ(W_o·[h_t-1，x_t]+b_o) (2)

is element-by-element multiplication;

step (C), a training data set is constructed by utilizing a pure voice library and a noise library, specifically, each sample passes through a biquad filter to change the amplitude of a mixed signal, and the form of the biquad filter H (z) is shown as a formula (6):

wherein r is₁...r₄Is in the range of [ -3/8,3/8 [ -3/8]Random values uniformly distributed in the range, and z is a symbol of z transformation;

and (D) training the deep circulation neural network constructed in the step (B) by utilizing the 38-dimensional signal characteristics, the 18-dimensional ideal frequency band gains (namely the gains of 18 Bark frequency bands) and the 1-dimensional signal activity mark of the training data, and comprising the following steps.

(D1) And calculates a band gain gb of the band b as shown in equation (7),

L＝L_g+αL_vad (8)

wherein the content of the first and second substances,

is a band gain estimate, L_binIs a cross entropy loss function;

And (E) inputting the extracted noisy speech features into the trained deep-circulation neural network, outputting a band gain estimation value of the noisy speech, and smoothing and interpolating to obtain an interpolation gain, wherein the specific process is as follows.

Smoothed band gain

As shown in equation (10):

wherein the content of the first and second substances,

wherein, w_b(k) The amplitude of frequency band b at frequency point k.

Step (F), the interpolation gain is acted on the single-channel voice with noise to obtain an enhanced voice frequency spectrum

As shown in the formula (12),

Under different noisy voices, the enhancement effect of the algorithm is shown in table 1, pure voice used for constructing a test set is from a reading work set in a standard Chinese level test matching optical disc, 4 paragraphs are selected from the pure voice, the pure voice is cut into 15s voice sections to obtain 45 pure samples, and the 45 pure samples and recorded noise are mixed by 6 signal-to-noise ratios (including-5 dB, 0dB, 5dB, 10dB, 15dB and 20dB) to generate 270 noisy samples. Indicators for measuring enhanced performance include PESQ (subjective Speech Quality assessment), fwsegSNR (frequency-weighted segment SNR), and STOI (Short-Time Objective Intelligibility). The results in table 1 show that the single-channel speech enhancement method of the present invention significantly improves PESQ and fwsegSNR under all signal-to-noise ratios, and has a certain improvement effect on STOI, and average PESQ, fwsegSNR and STOI are respectively improved by 0.51, 2.29dB and 0.018, thereby achieving a stronger speech enhancement performance.

Table 1 enhancement performance test results for the single channel speech enhancement method of the present invention

In summary, the single-channel speech enhancement method based on the attention-gated recurrent neural network of the present invention improves the learning ability of the network by using attention in the conventional LSTM model to focus the unit on the information useful for outputting in the currently input context information. The band gain is estimated from the noisy features using a deep-loop neural network, without any assumptions, and the generalization capability of the network can be improved by including various noise conditions in the training set. In addition, the recurrent neural network only needs to output 18 frequency band gain estimated values between 0 and 1 VAD estimated value, thereby greatly reducing the computational complexity. The single-channel speech enhancement method can effectively suppress noise including non-stationary noise, avoids the common problem of music noise in noise suppression through frequency band division, and simultaneously keeps low enough calculation complexity, so that the method can be used for real-time single-channel speech enhancement, and is ingenious, novel in concept and good in application prospect.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A single-channel speech enhancement method of a recurrent neural network based on attention gating is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

2. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 1, wherein: and (A) extracting 38-dimensional signal characteristics, which specifically comprise 18 Bark frequency cepstrum coefficients, a first-order time derivative and a second-order time derivative of the first 6 Bark frequency cepstrum coefficients, discrete cosine transform of the first 6 inter-band pitch correlation coefficients, 1 pitch period coefficient and 1 spectrum non-stationarity measurement parameter.

3. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 1, wherein: step (B), constructing a deep circulation neural network for single-channel speech enhancement, wherein the deep circulation neural network comprises six layers, the first layer is a Dense layer, an activation function is tanh, and the number of units is 24; the second to fifth layers are attention-gated LSTM layers with an activation function of tanh and cell numbers of 24,48,48 and 96, respectively; the sixth layer is a Dense layer, the activation function is sigmoid, the number of units is 18, and the output of the second layer of the network passes through the Dense layer of one layer to obtain the 1-dimensional signal activity mark.

4. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 3, wherein: the forward propagation process of the deep circulation neural network is shown in the formula (1) to the formula (5):

a_t＝σ[V_atanh(W_ac_t-1)] (1)

o_t＝σ(W_o·[h_t-1,x_t]+b_o) (2)

wherein t is a frame number; a is_t,o_t,c_t,h_tAttention gate, output gate, cell state vector and hidden vector,

are cell candidate state vectors, which are of the same dimension; x is the number of_tIs an input vector; v_aAnd W_aAll are parameter matrixes for calculating the attention gate; w_o,b_oRespectively calculating a weight matrix and an offset vector of an output gate; w_c,b_cRespectively calculating a weight matrix and an offset vector of the candidate state vector; sigma is sigmoid function;

is element-by-element multiplication.

5. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 1, wherein: step (C), a training data set is constructed by utilizing a pure voice library and a noise library, specifically, each sample passes through a biquad filter to change the amplitude of a mixed signal, and the form of the biquad filter H (z) is shown as a formula (6):

6. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 1, wherein: step (D), training the deep circulation neural network constructed in the step (B), and comprising the following steps:

(D1) calculating a band gain g of the band b_bAs shown in the formula (7),

L＝L_g+αL_vad (8)

wherein L is_gFor the loss function corresponding to the band gain estimate, L_vadA is a loss function corresponding to VAD estimated value, and alpha is a weighting coefficient, wherein, the loss function L corresponding to the frequency band gain estimated value_gAs shown in formula (9):

wherein the content of the first and second substances,

is a band gain estimate, L_binIs a cross entropy loss function;

7. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 6, wherein: step (E), smoothing and interpolating the estimated value of the band gain output by the network to obtain the interpolated gain, the concrete process is as follows,

smoothed band gain

As shown in equation (10):

wherein the content of the first and second substances,

wherein, w_b(k) The amplitude of frequency band b at frequency point k.

8. The single-channel speech enhancement method for an attention-gated-based recurrent neural network of claim 7, wherein: step (F), the interpolation gain is acted on the single-channel voice with noise to obtain the enhanced voice frequency spectrum

As shown in the formula (12),