Disclosure of Invention
The invention aims to solve the problems that the existing voice enhancement method has poor inhibition effect on non-stationary noise and the deep learning-based voice enhancement method cannot meet the real-time requirement due to high calculation complexity. The single-channel speech enhancement method based on the attention-gated recurrent neural network can effectively inhibit noise including non-stationary noise, and simultaneously keeps low enough computation complexity, so that the method can be used for real-time single-channel speech enhancement, and is ingenious, novel in concept and good in application prospect.
In order to achieve the purpose, the invention adopts the technical scheme that:
a single-channel speech enhancement method based on a cyclic neural network of attention gating comprises the following steps,
step (A), frame division and windowing are carried out on single-channel voice with noise, and 38-dimensional signal features including Bark frequency cepstrum coefficients and derivative parameters thereof, discrete cosine transform of fundamental tone correlation coefficients, fundamental tone period and frequency spectrum non-stationarity measurement parameters are extracted;
step (B), constructing a deep circulation neural network for single-channel speech enhancement;
step (C), a training data set is constructed by utilizing a pure voice library and a noise library;
step (D), training the deep circulation neural network constructed in the step (B) by utilizing the 38-dimensional signal characteristics, the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity mark of the training data;
step (E), inputting the extracted characteristics of the voice with noise into a trained deep circulation neural network, outputting a band gain estimation value of the voice with noise, and smoothing and interpolating to obtain an interpolation gain;
and (F) applying the interpolation gain to the single-channel voice with noise to obtain an enhanced voice frequency spectrum.
The single-channel speech enhancement method based on the attention-gated recurrent neural network comprises the step (A) of extracting 38-dimensional signal features, wherein the 38-dimensional signal features specifically comprise 18 Bark frequency cepstrum coefficients, first-order time derivatives and second-order time derivatives of the first 6 Bark frequency cepstrum coefficients, discrete cosine transform of the first 6 inter-band pitch correlation coefficients, 1 pitch period coefficient and 1 spectral non-stationarity measurement parameter.
The single-channel speech enhancement method based on the attention-gated recurrent neural network comprises the following steps of (B) constructing a deep recurrent neural network for single-channel speech enhancement, wherein the deep recurrent neural network comprises six layers, the first layer is a Dense layer, an activation function is tanh, and the number of units is 24; the second to fifth layers are attention-gated LSTM layers with an activation function of tanh and cell numbers of 24,48,48 and 96, respectively; the sixth layer is a Dense layer, the activation function is sigmoid, and the number of units is 18. The second layer output of the network passes through a Dense layer of one layer to obtain the 1-dimensional signal activity mark.
In the foregoing single-channel speech enhancement method based on the attention-gated recurrent neural network, the forward propagation process of the deep recurrent neural network is as shown in formula (1) to formula (5):
at=σ[Va tanh(Wact-1)] (1)
ot=σ(Wo·[ht-1,xt]+bo) (2)
wherein t is a frame number; a, o, c, h are attention gate, output gate, cell state vector and hidden vector respectively,
are cell candidate state vectors, which are of the same dimension; x is an input vector; v
aAnd W
aAll are parameter matrixes for calculating the attention gate; w
o,b
oRespectively calculating a weight matrix and an offset vector of an output gate; w
c,b
cRespectively calculating a weight matrix and an offset vector of the candidate state vector; sigma is sigmoid function;
is element-by-element multiplication.
The aforementioned single-channel speech enhancement method based on attention-gated recurrent neural network, step (C), constructs a training data set using a clean speech library and a noise library, specifically, each sample is passed through a biquad filter to change the amplitude of the mixed signal, where the form of the biquad filter h (z) is shown in formula (6):
wherein r is1...r4Is in the range of [ -3/8,3/8 [ -3/8]Random values evenly distributed within the range.
The single-channel speech enhancement method based on the attention-gated recurrent neural network comprises the following steps of (D) and (B) training the deep recurrent neural network constructed in the step (B).
(D1) Calculating a band gain g of the band bbAs shown in the formula (7),
wherein E iss(b) And Ex(b) Energy in band b, g, for clean and noisy speech, respectivelybHas a value of [0,1 ]]To (c) to (d);
(D2) taking the extracted 38-dimensional signal features as input of the deep recurrent neural network;
(D3) taking the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity flag as the training target of the recurrent neural network, the loss function L is shown as formula (8):
L=Lg+αLvad (8)
wherein L isgFor the loss function corresponding to the band gain estimate, Lvadα is a weighting factor for the loss function corresponding to the VAD estimate. Wherein, the loss function L corresponding to the estimated value of the band gaingAs shown in formula (9):
wherein the content of the first and second substances,
is a band gain estimate, L
binIs a cross entropy loss function;
(D4) during training, each time a batch is trained, all parameters are cut off to be in the range of [ -0.5,0.5 ].
In the aforementioned single-channel speech enhancement method based on the attention-gated recurrent neural network, step (E) is to smooth and interpolate the estimated value of the band gain output by the network to obtain the interpolated gain, and the specific process is as follows,
smoothed band gain
As shown in equation (10):
wherein the content of the first and second substances,
for the smooth gain of the previous frame, λ is the attenuation factor, and the interpolation gain r (k) of each frequency bin k is shown in equation (11):
wherein, wb(k) The amplitude of frequency band b at frequency point k.
In the aforementioned single-channel speech enhancement method based on the attention-gated recurrent neural network, step (F), the interpolation gain is applied to the single-channel speech with noise to obtain the enhanced speech frequency spectrum
As shown in the formula (12),
wherein alpha isbFor the filter coefficients, P (k) is the spectrum of the pitch-delayed signal x (n-T), and X (k) is the spectrum of noisy single-channel speech.
The invention has the beneficial effects that: the single-channel speech enhancement method based on the attention-gated recurrent neural network of the invention concentrates units on the useful information output in the currently input context information by using attention in the traditional LSTM model, thereby improving the learning capability of the network. The band gain is estimated from the noisy features using a deep-loop neural network, without any assumptions, and the generalization capability of the network can be improved by including various noise conditions in the training set. In addition, the recurrent neural network only needs to output 18 frequency band gain estimated values between 0 and 1 VAD estimated value, thereby greatly reducing the computational complexity. The single-channel speech enhancement method can effectively suppress noise including non-stationary noise, avoids the common problem of music noise in noise suppression through frequency band division, and simultaneously keeps low enough calculation complexity, so that the method can be used for real-time single-channel speech enhancement, and is ingenious, novel in concept and good in application prospect.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
As shown in FIG. 1, the single-channel speech enhancement method based on the attention-gated recurrent neural network of the present invention comprises the following steps.
Step (A), performing frame division and windowing on single-channel noisy speech, and extracting 38-dimensional signal characteristics, including Bark frequency cepstrum coefficients and derivative parameters thereof, discrete cosine transform of fundamental tone correlation coefficients, fundamental tone period and spectrum non-stationarity measurement parameters, specifically including 18 Bark frequency cepstrum coefficients, first-order time derivatives and second-order time derivatives of the first 6 Bark frequency cepstrum coefficients, discrete cosine transform of the first 6 inter-band fundamental tone correlation coefficients, 1 fundamental tone period coefficient and 1 spectrum non-stationarity measurement parameter;
step (B), constructing a deep circulation neural network for single-channel speech enhancement, where the deep circulation neural network includes six layers, as shown in fig. 2, the first layer is a sense layer, the activation function is tanh, and the number of units is 24; the second to fifth layers are attention-gated LSTM layers with an activation function of tanh and cell numbers of 24,48,48 and 96, respectively; the sixth layer is a Dense layer, the activation function is sigmoid, and the number of units is 18; the second layer output passes through a Dense layer (the activation function is sigmoid, the number of units is 1) of one layer, and a 1-dimensional signal activity mark is obtained. The forward propagation process of the deep circulation neural network is shown in the formula (1) to the formula (5):
at=σ[Va tanh(Wact-1)] (1)
ot=σ(Wo·[ht-1,xt]+bo) (2)
wherein t is a frame number; a, o, c, h are attention gate, output gate, cell state vector and hidden vector respectively,
are cell candidate state vectors, which are of the same dimension; x is an input vector; v
aAnd W
aAll are parameter matrixes for calculating the attention gate; w
o,b
oRespectively calculating a weight matrix and an offset vector of an output gate; w
c,b
cRespectively calculating a weight matrix and an offset vector of the candidate state vector; sigma is sigmoid function;
is element-by-element multiplication;
step (C), a training data set is constructed by utilizing a pure voice library and a noise library, specifically, each sample passes through a biquad filter to change the amplitude of a mixed signal, and the form of the biquad filter H (z) is shown as a formula (6):
wherein r is1...r4Is in the range of [ -3/8,3/8 [ -3/8]Random values uniformly distributed in the range, and z is a symbol of z transformation;
and (D) training the deep circulation neural network constructed in the step (B) by utilizing the 38-dimensional signal characteristics, the 18-dimensional ideal frequency band gains (namely the gains of 18 Bark frequency bands) and the 1-dimensional signal activity mark of the training data, and comprising the following steps.
(D1) And calculates a band gain gb of the band b as shown in equation (7),
wherein E iss(b) And Ex(b) Energy in band b, g, for clean and noisy speech, respectivelybHas a value of [0,1 ]]To (c) to (d);
(D2) taking the extracted 38-dimensional signal features as input of the deep recurrent neural network;
(D3) taking the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity flag as the training target of the recurrent neural network, the loss function L is shown as formula (8):
L=Lg+αLvad (8)
wherein L isgFor the loss function corresponding to the band gain estimate, Lvadα is a weighting factor for the loss function corresponding to the VAD estimate. Wherein, the loss function L corresponding to the estimated value of the band gaingAs shown in formula (9):
wherein the content of the first and second substances,
is a band gain estimate, L
binIs a cross entropy loss function;
(D4) during training, each time a batch is trained, all parameters are cut off to be in the range of [ -0.5,0.5 ].
And (E) inputting the extracted noisy speech features into the trained deep-circulation neural network, outputting a band gain estimation value of the noisy speech, and smoothing and interpolating to obtain an interpolation gain, wherein the specific process is as follows.
Smoothed band gain
As shown in equation (10):
wherein the content of the first and second substances,
for the smooth gain of the previous frame, λ is the attenuation factor, and the interpolation gain r (k) of each frequency bin k is shown in equation (11):
wherein, wb(k) The amplitude of frequency band b at frequency point k.
Step (F), the interpolation gain is acted on the single-channel voice with noise to obtain an enhanced voice frequency spectrum
As shown in the formula (12),
wherein alpha isbFor the filter coefficients, P (k) is the spectrum of the pitch-delayed signal x (n-T), and X (k) is the spectrum of noisy single-channel speech.
Under different noisy voices, the enhancement effect of the algorithm is shown in table 1, pure voice used for constructing a test set is from a reading work set in a standard Chinese level test matching optical disc, 4 paragraphs are selected from the pure voice, the pure voice is cut into 15s voice sections to obtain 45 pure samples, and the 45 pure samples and recorded noise are mixed by 6 signal-to-noise ratios (including-5 dB, 0dB, 5dB, 10dB, 15dB and 20dB) to generate 270 noisy samples. Indicators for measuring enhanced performance include PESQ (subjective Speech Quality assessment), fwsegSNR (frequency-weighted segment SNR), and STOI (Short-Time Objective Intelligibility). The results in table 1 show that the single-channel speech enhancement method of the present invention significantly improves PESQ and fwsegSNR under all signal-to-noise ratios, and has a certain improvement effect on STOI, and average PESQ, fwsegSNR and STOI are respectively improved by 0.51, 2.29dB and 0.018, thereby achieving a stronger speech enhancement performance.
Table 1 enhancement performance test results for the single channel speech enhancement method of the present invention
In summary, the single-channel speech enhancement method based on the attention-gated recurrent neural network of the present invention improves the learning ability of the network by using attention in the conventional LSTM model to focus the unit on the information useful for outputting in the currently input context information. The band gain is estimated from the noisy features using a deep-loop neural network, without any assumptions, and the generalization capability of the network can be improved by including various noise conditions in the training set. In addition, the recurrent neural network only needs to output 18 frequency band gain estimated values between 0 and 1 VAD estimated value, thereby greatly reducing the computational complexity. The single-channel speech enhancement method can effectively suppress noise including non-stationary noise, avoids the common problem of music noise in noise suppression through frequency band division, and simultaneously keeps low enough calculation complexity, so that the method can be used for real-time single-channel speech enhancement, and is ingenious, novel in concept and good in application prospect.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.