CN110085249B - Single-channel speech enhancement method of recurrent neural network based on attention gating - Google Patents

Single-channel speech enhancement method of recurrent neural network based on attention gating Download PDF

Info

Publication number
CN110085249B
CN110085249B CN201910385797.4A CN201910385797A CN110085249B CN 110085249 B CN110085249 B CN 110085249B CN 201910385797 A CN201910385797 A CN 201910385797A CN 110085249 B CN110085249 B CN 110085249B
Authority
CN
China
Prior art keywords
neural network
noise
speech enhancement
attention
channel speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910385797.4A
Other languages
Chinese (zh)
Other versions
CN110085249A (en
Inventor
梁瑞宇
孔凡留
谢跃
王青云
程佳鸣
孙世若
赵力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Innotrik Technology Co ltd
Original Assignee
Nanjing Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Technology filed Critical Nanjing Institute of Technology
Priority to CN201910385797.4A priority Critical patent/CN110085249B/en
Publication of CN110085249A publication Critical patent/CN110085249A/en
Application granted granted Critical
Publication of CN110085249B publication Critical patent/CN110085249B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Noise Elimination (AREA)

Abstract

The invention discloses a single-channel speech enhancement method of a recurrent neural network based on attention gating, which comprises the steps of performing frame division and windowing on single-channel speech with noise, and extracting 38-dimensional signal features; constructing a deep circulation neural network for single-channel speech enhancement; constructing a training data set by utilizing a pure voice library and a noise library; training the constructed deep circulation neural network; inputting the extracted voice characteristics with noise into a trained deep circulation neural network, outputting a band gain estimation value of the voice with noise, and smoothing and interpolating to obtain an interpolation gain; and applying the interpolation gain to the single-channel voice with noise to obtain an enhanced voice frequency spectrum. The method can effectively inhibit noise including non-stationary noise, and simultaneously keeps low enough computation complexity, so that the method can be used for real-time single-channel speech enhancement, and has the advantages of ingenious method, novel concept and good application prospect.

Description

Single-channel speech enhancement method of recurrent neural network based on attention gating
Technical Field
The invention relates to the technical field of voice enhancement, in particular to a single-channel voice enhancement method of a recurrent neural network based on attention gating.
Background
Speech enhancement, as a branch of speech signal processing, has important applications in the fields of speech communication, hearing aid devices, Automatic Speech Recognition (ASR) system front-ends, etc. Speech enhancement is generally divided into single-channel speech enhancement and multi-channel speech enhancement. Single channel speech enhancement is relatively more difficult to implement due to the absence of spatial information from the microphone array.
Some unsupervised single-channel speech enhancement algorithms proposed earlier, such as spectral subtraction, wiener filtering, Minimum Mean Square Error (MMSE) -based magnitude spectrum estimation, or log-domain spectrum estimation, cannot effectively suppress non-stationary noise due to the assumption of noise stationarity. Subsequently, supervised single-channel speech enhancement algorithms based on Hidden Markov Models (HMMs), non-Negative Matrix Factorization (NMF) and deep learning were proposed, wherein the application of deep learning made a breakthrough advance in the field of speech enhancement. The neural network relies on a strong fitting power to be able to learn a representation of a clean target speech from the characteristics of the noisy speech without the assumption of noise stationarity.
However, the currently proposed speech enhancement method generally has a poor suppression effect on non-stationary noise, and the speech enhancement method based on deep learning often cannot be applied to real-time speech enhancement due to its high computational complexity.
Disclosure of Invention
The invention aims to solve the problems that the existing voice enhancement method has poor inhibition effect on non-stationary noise and the deep learning-based voice enhancement method cannot meet the real-time requirement due to high calculation complexity. The single-channel speech enhancement method based on the attention-gated recurrent neural network can effectively inhibit noise including non-stationary noise, and simultaneously keeps low enough computation complexity, so that the method can be used for real-time single-channel speech enhancement, and is ingenious, novel in concept and good in application prospect.
In order to achieve the purpose, the invention adopts the technical scheme that:
a single-channel speech enhancement method based on a cyclic neural network of attention gating comprises the following steps,
step (A), frame division and windowing are carried out on single-channel voice with noise, and 38-dimensional signal features including Bark frequency cepstrum coefficients and derivative parameters thereof, discrete cosine transform of fundamental tone correlation coefficients, fundamental tone period and frequency spectrum non-stationarity measurement parameters are extracted;
step (B), constructing a deep circulation neural network for single-channel speech enhancement;
step (C), a training data set is constructed by utilizing a pure voice library and a noise library;
step (D), training the deep circulation neural network constructed in the step (B) by utilizing the 38-dimensional signal characteristics, the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity mark of the training data;
step (E), inputting the extracted characteristics of the voice with noise into a trained deep circulation neural network, outputting a band gain estimation value of the voice with noise, and smoothing and interpolating to obtain an interpolation gain;
and (F) applying the interpolation gain to the single-channel voice with noise to obtain an enhanced voice frequency spectrum.
The single-channel speech enhancement method based on the attention-gated recurrent neural network comprises the step (A) of extracting 38-dimensional signal features, wherein the 38-dimensional signal features specifically comprise 18 Bark frequency cepstrum coefficients, first-order time derivatives and second-order time derivatives of the first 6 Bark frequency cepstrum coefficients, discrete cosine transform of the first 6 inter-band pitch correlation coefficients, 1 pitch period coefficient and 1 spectral non-stationarity measurement parameter.
The single-channel speech enhancement method based on the attention-gated recurrent neural network comprises the following steps of (B) constructing a deep recurrent neural network for single-channel speech enhancement, wherein the deep recurrent neural network comprises six layers, the first layer is a Dense layer, an activation function is tanh, and the number of units is 24; the second to fifth layers are attention-gated LSTM layers with an activation function of tanh and cell numbers of 24,48,48 and 96, respectively; the sixth layer is a Dense layer, the activation function is sigmoid, and the number of units is 18. The second layer output of the network passes through a Dense layer of one layer to obtain the 1-dimensional signal activity mark.
In the foregoing single-channel speech enhancement method based on the attention-gated recurrent neural network, the forward propagation process of the deep recurrent neural network is as shown in formula (1) to formula (5):
at=σ[Va tanh(Wact-1)] (1)
ot=σ(Wo·[ht-1,xt]+bo) (2)
Figure BDA0002054787780000046
Figure BDA0002054787780000041
Figure BDA0002054787780000042
wherein t is a frame number; a, o, c, h are attention gate, output gate, cell state vector and hidden vector respectively,
Figure BDA0002054787780000043
are cell candidate state vectors, which are of the same dimension; x is an input vector; vaAnd WaAll are parameter matrixes for calculating the attention gate; wo,boRespectively calculating a weight matrix and an offset vector of an output gate; wc,bcRespectively calculating a weight matrix and an offset vector of the candidate state vector; sigma is sigmoid function;
Figure BDA0002054787780000044
is element-by-element multiplication.
The aforementioned single-channel speech enhancement method based on attention-gated recurrent neural network, step (C), constructs a training data set using a clean speech library and a noise library, specifically, each sample is passed through a biquad filter to change the amplitude of the mixed signal, where the form of the biquad filter h (z) is shown in formula (6):
Figure BDA0002054787780000045
wherein r is1...r4Is in the range of [ -3/8,3/8 [ -3/8]Random values evenly distributed within the range.
The single-channel speech enhancement method based on the attention-gated recurrent neural network comprises the following steps of (D) and (B) training the deep recurrent neural network constructed in the step (B).
(D1) Calculating a band gain g of the band bbAs shown in the formula (7),
Figure BDA0002054787780000051
wherein E iss(b) And Ex(b) Energy in band b, g, for clean and noisy speech, respectivelybHas a value of [0,1 ]]To (c) to (d);
(D2) taking the extracted 38-dimensional signal features as input of the deep recurrent neural network;
(D3) taking the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity flag as the training target of the recurrent neural network, the loss function L is shown as formula (8):
L=Lg+αLvad (8)
wherein L isgFor the loss function corresponding to the band gain estimate, Lvadα is a weighting factor for the loss function corresponding to the VAD estimate. Wherein, the loss function L corresponding to the estimated value of the band gaingAs shown in formula (9):
Figure BDA0002054787780000052
wherein the content of the first and second substances,
Figure BDA0002054787780000053
is a band gain estimate, LbinIs a cross entropy loss function;
(D4) during training, each time a batch is trained, all parameters are cut off to be in the range of [ -0.5,0.5 ].
In the aforementioned single-channel speech enhancement method based on the attention-gated recurrent neural network, step (E) is to smooth and interpolate the estimated value of the band gain output by the network to obtain the interpolated gain, and the specific process is as follows,
smoothed band gain
Figure BDA0002054787780000061
As shown in equation (10):
Figure BDA0002054787780000062
wherein the content of the first and second substances,
Figure BDA0002054787780000063
for the smooth gain of the previous frame, λ is the attenuation factor, and the interpolation gain r (k) of each frequency bin k is shown in equation (11):
Figure BDA0002054787780000064
wherein, wb(k) The amplitude of frequency band b at frequency point k.
In the aforementioned single-channel speech enhancement method based on the attention-gated recurrent neural network, step (F), the interpolation gain is applied to the single-channel speech with noise to obtain the enhanced speech frequency spectrum
Figure BDA0002054787780000065
As shown in the formula (12),
Figure BDA0002054787780000066
wherein alpha isbFor the filter coefficients, P (k) is the spectrum of the pitch-delayed signal x (n-T), and X (k) is the spectrum of noisy single-channel speech.
The invention has the beneficial effects that: the single-channel speech enhancement method based on the attention-gated recurrent neural network of the invention concentrates units on the useful information output in the currently input context information by using attention in the traditional LSTM model, thereby improving the learning capability of the network. The band gain is estimated from the noisy features using a deep-loop neural network, without any assumptions, and the generalization capability of the network can be improved by including various noise conditions in the training set. In addition, the recurrent neural network only needs to output 18 frequency band gain estimated values between 0 and 1 VAD estimated value, thereby greatly reducing the computational complexity. The single-channel speech enhancement method can effectively suppress noise including non-stationary noise, avoids the common problem of music noise in noise suppression through frequency band division, and simultaneously keeps low enough calculation complexity, so that the method can be used for real-time single-channel speech enhancement, and is ingenious, novel in concept and good in application prospect.
Drawings
FIG. 1 is a flow chart of a single-channel speech enhancement method of an attention-gated-based recurrent neural network of the present invention;
FIG. 2 is a block diagram of the deep recurrent neural network of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
As shown in FIG. 1, the single-channel speech enhancement method based on the attention-gated recurrent neural network of the present invention comprises the following steps.
Step (A), performing frame division and windowing on single-channel noisy speech, and extracting 38-dimensional signal characteristics, including Bark frequency cepstrum coefficients and derivative parameters thereof, discrete cosine transform of fundamental tone correlation coefficients, fundamental tone period and spectrum non-stationarity measurement parameters, specifically including 18 Bark frequency cepstrum coefficients, first-order time derivatives and second-order time derivatives of the first 6 Bark frequency cepstrum coefficients, discrete cosine transform of the first 6 inter-band fundamental tone correlation coefficients, 1 fundamental tone period coefficient and 1 spectrum non-stationarity measurement parameter;
step (B), constructing a deep circulation neural network for single-channel speech enhancement, where the deep circulation neural network includes six layers, as shown in fig. 2, the first layer is a sense layer, the activation function is tanh, and the number of units is 24; the second to fifth layers are attention-gated LSTM layers with an activation function of tanh and cell numbers of 24,48,48 and 96, respectively; the sixth layer is a Dense layer, the activation function is sigmoid, and the number of units is 18; the second layer output passes through a Dense layer (the activation function is sigmoid, the number of units is 1) of one layer, and a 1-dimensional signal activity mark is obtained. The forward propagation process of the deep circulation neural network is shown in the formula (1) to the formula (5):
at=σ[Va tanh(Wact-1)] (1)
ot=σ(Wo·[ht-1,xt]+bo) (2)
Figure BDA0002054787780000081
Figure BDA0002054787780000082
Figure BDA0002054787780000083
wherein t is a frame number; a, o, c, h are attention gate, output gate, cell state vector and hidden vector respectively,
Figure BDA0002054787780000084
are cell candidate state vectors, which are of the same dimension; x is an input vector; vaAnd WaAll are parameter matrixes for calculating the attention gate; wo,boRespectively calculating a weight matrix and an offset vector of an output gate; wc,bcRespectively calculating a weight matrix and an offset vector of the candidate state vector; sigma is sigmoid function;
Figure BDA0002054787780000085
is element-by-element multiplication;
step (C), a training data set is constructed by utilizing a pure voice library and a noise library, specifically, each sample passes through a biquad filter to change the amplitude of a mixed signal, and the form of the biquad filter H (z) is shown as a formula (6):
Figure BDA0002054787780000086
wherein r is1...r4Is in the range of [ -3/8,3/8 [ -3/8]Random values uniformly distributed in the range, and z is a symbol of z transformation;
and (D) training the deep circulation neural network constructed in the step (B) by utilizing the 38-dimensional signal characteristics, the 18-dimensional ideal frequency band gains (namely the gains of 18 Bark frequency bands) and the 1-dimensional signal activity mark of the training data, and comprising the following steps.
(D1) And calculates a band gain gb of the band b as shown in equation (7),
Figure BDA0002054787780000091
wherein E iss(b) And Ex(b) Energy in band b, g, for clean and noisy speech, respectivelybHas a value of [0,1 ]]To (c) to (d);
(D2) taking the extracted 38-dimensional signal features as input of the deep recurrent neural network;
(D3) taking the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity flag as the training target of the recurrent neural network, the loss function L is shown as formula (8):
L=Lg+αLvad (8)
wherein L isgFor the loss function corresponding to the band gain estimate, Lvadα is a weighting factor for the loss function corresponding to the VAD estimate. Wherein, the loss function L corresponding to the estimated value of the band gaingAs shown in formula (9):
Figure BDA0002054787780000092
wherein the content of the first and second substances,
Figure BDA0002054787780000093
is a band gain estimate, LbinIs a cross entropy loss function;
(D4) during training, each time a batch is trained, all parameters are cut off to be in the range of [ -0.5,0.5 ].
And (E) inputting the extracted noisy speech features into the trained deep-circulation neural network, outputting a band gain estimation value of the noisy speech, and smoothing and interpolating to obtain an interpolation gain, wherein the specific process is as follows.
Smoothed band gain
Figure BDA0002054787780000101
As shown in equation (10):
Figure BDA0002054787780000102
wherein the content of the first and second substances,
Figure BDA0002054787780000103
for the smooth gain of the previous frame, λ is the attenuation factor, and the interpolation gain r (k) of each frequency bin k is shown in equation (11):
Figure BDA0002054787780000104
wherein, wb(k) The amplitude of frequency band b at frequency point k.
Step (F), the interpolation gain is acted on the single-channel voice with noise to obtain an enhanced voice frequency spectrum
Figure BDA0002054787780000105
As shown in the formula (12),
Figure BDA0002054787780000106
wherein alpha isbFor the filter coefficients, P (k) is the spectrum of the pitch-delayed signal x (n-T), and X (k) is the spectrum of noisy single-channel speech.
Under different noisy voices, the enhancement effect of the algorithm is shown in table 1, pure voice used for constructing a test set is from a reading work set in a standard Chinese level test matching optical disc, 4 paragraphs are selected from the pure voice, the pure voice is cut into 15s voice sections to obtain 45 pure samples, and the 45 pure samples and recorded noise are mixed by 6 signal-to-noise ratios (including-5 dB, 0dB, 5dB, 10dB, 15dB and 20dB) to generate 270 noisy samples. Indicators for measuring enhanced performance include PESQ (subjective Speech Quality assessment), fwsegSNR (frequency-weighted segment SNR), and STOI (Short-Time Objective Intelligibility). The results in table 1 show that the single-channel speech enhancement method of the present invention significantly improves PESQ and fwsegSNR under all signal-to-noise ratios, and has a certain improvement effect on STOI, and average PESQ, fwsegSNR and STOI are respectively improved by 0.51, 2.29dB and 0.018, thereby achieving a stronger speech enhancement performance.
Table 1 enhancement performance test results for the single channel speech enhancement method of the present invention
Figure BDA0002054787780000111
In summary, the single-channel speech enhancement method based on the attention-gated recurrent neural network of the present invention improves the learning ability of the network by using attention in the conventional LSTM model to focus the unit on the information useful for outputting in the currently input context information. The band gain is estimated from the noisy features using a deep-loop neural network, without any assumptions, and the generalization capability of the network can be improved by including various noise conditions in the training set. In addition, the recurrent neural network only needs to output 18 frequency band gain estimated values between 0 and 1 VAD estimated value, thereby greatly reducing the computational complexity. The single-channel speech enhancement method can effectively suppress noise including non-stationary noise, avoids the common problem of music noise in noise suppression through frequency band division, and simultaneously keeps low enough calculation complexity, so that the method can be used for real-time single-channel speech enhancement, and is ingenious, novel in concept and good in application prospect.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (8)

1. A single-channel speech enhancement method of a recurrent neural network based on attention gating is characterized by comprising the following steps: comprises the following steps of (a) carrying out,
step (A), frame division and windowing are carried out on single-channel voice with noise, and 38-dimensional signal features including Bark frequency cepstrum coefficients and derivative parameters thereof, discrete cosine transform of fundamental tone correlation coefficients, fundamental tone period and frequency spectrum non-stationarity measurement parameters are extracted;
step (B), constructing a deep circulation neural network for single-channel speech enhancement;
step (C), a training data set is constructed by utilizing a pure voice library and a noise library;
step (D), training the deep circulation neural network constructed in the step (B) by utilizing the 38-dimensional signal characteristics, the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity mark of the training data;
step (E), inputting the extracted characteristics of the voice with noise into a trained deep circulation neural network, outputting a band gain estimation value of the voice with noise, and smoothing and interpolating to obtain an interpolation gain;
and (F) applying the interpolation gain to the single-channel voice with noise to obtain an enhanced voice frequency spectrum.
2. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 1, wherein: and (A) extracting 38-dimensional signal characteristics, which specifically comprise 18 Bark frequency cepstrum coefficients, a first-order time derivative and a second-order time derivative of the first 6 Bark frequency cepstrum coefficients, discrete cosine transform of the first 6 inter-band pitch correlation coefficients, 1 pitch period coefficient and 1 spectrum non-stationarity measurement parameter.
3. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 1, wherein: step (B), constructing a deep circulation neural network for single-channel speech enhancement, wherein the deep circulation neural network comprises six layers, the first layer is a Dense layer, an activation function is tanh, and the number of units is 24; the second to fifth layers are attention-gated LSTM layers with an activation function of tanh and cell numbers of 24,48,48 and 96, respectively; the sixth layer is a Dense layer, the activation function is sigmoid, the number of units is 18, and the output of the second layer of the network passes through the Dense layer of one layer to obtain the 1-dimensional signal activity mark.
4. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 3, wherein: the forward propagation process of the deep circulation neural network is shown in the formula (1) to the formula (5):
at=σ[Vatanh(Wact-1)] (1)
ot=σ(Wo·[ht-1,xt]+bo) (2)
Figure FDA0002812755300000021
Figure FDA0002812755300000022
Figure FDA0002812755300000023
wherein t is a frame number; a ist,ot,ct,htAttention gate, output gate, cell state vector and hidden vector,
Figure FDA0002812755300000024
are cell candidate state vectors, which are of the same dimension; x is the number oftIs an input vector; vaAnd WaAll are parameter matrixes for calculating the attention gate; wo,boRespectively calculating a weight matrix and an offset vector of an output gate; wc,bcRespectively calculating a weight matrix and an offset vector of the candidate state vector; sigma is sigmoid function;
Figure FDA0002812755300000025
is element-by-element multiplication.
5. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 1, wherein: step (C), a training data set is constructed by utilizing a pure voice library and a noise library, specifically, each sample passes through a biquad filter to change the amplitude of a mixed signal, and the form of the biquad filter H (z) is shown as a formula (6):
Figure FDA0002812755300000026
wherein r is1...r4Is in the range of [ -3/8,3/8 [ -3/8]Random values evenly distributed within the range.
6. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 1, wherein: step (D), training the deep circulation neural network constructed in the step (B), and comprising the following steps:
(D1) calculating a band gain g of the band bbAs shown in the formula (7),
Figure FDA0002812755300000031
wherein E iss(b) And Ex(b) Energy in band b, g, for clean and noisy speech, respectivelybHas a value of [0,1 ]]To (c) to (d);
(D2) taking the extracted 38-dimensional signal features as input of the deep recurrent neural network;
(D3) taking the 18-dimensional ideal frequency band gain and the 1-dimensional signal activity flag as the training target of the recurrent neural network, the loss function L is shown as formula (8):
L=Lg+αLvad (8)
wherein L isgFor the loss function corresponding to the band gain estimate, LvadA is a loss function corresponding to VAD estimated value, and alpha is a weighting coefficient, wherein, the loss function L corresponding to the frequency band gain estimated valuegAs shown in formula (9):
Figure FDA0002812755300000032
wherein the content of the first and second substances,
Figure FDA0002812755300000033
is a band gain estimate, LbinIs a cross entropy loss function;
(D4) during training, each time a batch is trained, all parameters are cut off to be in the range of [ -0.5,0.5 ].
7. The single-channel speech enhancement method for attention-gated-based recurrent neural networks according to claim 6, wherein: step (E), smoothing and interpolating the estimated value of the band gain output by the network to obtain the interpolated gain, the concrete process is as follows,
smoothed band gain
Figure FDA0002812755300000034
As shown in equation (10):
Figure FDA0002812755300000041
wherein the content of the first and second substances,
Figure FDA0002812755300000042
for the smooth gain of the previous frame, λ is the attenuation factor, and the interpolation gain r (k) of each frequency bin k is shown in equation (11):
Figure FDA0002812755300000043
wherein, wb(k) The amplitude of frequency band b at frequency point k.
8. The single-channel speech enhancement method for an attention-gated-based recurrent neural network of claim 7, wherein: step (F), the interpolation gain is acted on the single-channel voice with noise to obtain the enhanced voice frequency spectrum
Figure FDA0002812755300000044
As shown in the formula (12),
Figure FDA0002812755300000045
wherein alpha isbFor the filter coefficients, P (k) is the spectrum of the pitch-delayed signal x (n-T), and X (k) is the spectrum of noisy single-channel speech.
CN201910385797.4A 2019-05-09 2019-05-09 Single-channel speech enhancement method of recurrent neural network based on attention gating Active CN110085249B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910385797.4A CN110085249B (en) 2019-05-09 2019-05-09 Single-channel speech enhancement method of recurrent neural network based on attention gating

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910385797.4A CN110085249B (en) 2019-05-09 2019-05-09 Single-channel speech enhancement method of recurrent neural network based on attention gating

Publications (2)

Publication Number Publication Date
CN110085249A CN110085249A (en) 2019-08-02
CN110085249B true CN110085249B (en) 2021-03-16

Family

ID=67419464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910385797.4A Active CN110085249B (en) 2019-05-09 2019-05-09 Single-channel speech enhancement method of recurrent neural network based on attention gating

Country Status (1)

Country Link
CN (1) CN110085249B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11996108B2 (en) 2019-08-01 2024-05-28 Dolby Laboratories Licensing Corporation System and method for enhancement of a degraded audio signal
CN110491407B (en) * 2019-08-15 2021-09-21 广州方硅信息技术有限公司 Voice noise reduction method and device, electronic equipment and storage medium
CN110473567B (en) * 2019-09-06 2021-09-14 上海又为智能科技有限公司 Audio processing method and device based on deep neural network and storage medium
CN110675891B (en) * 2019-09-25 2020-09-18 电子科技大学 Voice separation method and module based on multilayer attention mechanism
CN110739003B (en) * 2019-10-23 2022-10-28 北京计算机技术及应用研究所 Voice enhancement method based on multi-head self-attention mechanism
CN110867192A (en) * 2019-10-23 2020-03-06 北京计算机技术及应用研究所 Speech enhancement method based on gated cyclic coding and decoding network
CN111341351B (en) * 2020-02-25 2023-05-23 厦门亿联网络技术股份有限公司 Voice activity detection method, device and storage medium based on self-attention mechanism
CN111429938B (en) * 2020-03-06 2022-09-13 江苏大学 Single-channel voice separation method and device and electronic equipment
CN111508519B (en) * 2020-04-03 2022-04-26 北京达佳互联信息技术有限公司 Method and device for enhancing voice of audio signal
CN111581892B (en) * 2020-05-29 2024-02-13 重庆大学 Bearing residual life prediction method based on GDAU neural network
CN111429932A (en) * 2020-06-10 2020-07-17 浙江远传信息技术股份有限公司 Voice noise reduction method, device, equipment and medium
WO2022026948A1 (en) 2020-07-31 2022-02-03 Dolby Laboratories Licensing Corporation Noise reduction using machine learning
CN111916060B (en) * 2020-08-12 2022-03-01 四川长虹电器股份有限公司 Deep learning voice endpoint detection method and system based on spectral subtraction
CN113516992A (en) * 2020-08-21 2021-10-19 腾讯科技(深圳)有限公司 Audio processing method and device, intelligent equipment and storage medium
CN111986660A (en) * 2020-08-26 2020-11-24 深圳信息职业技术学院 Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling
CN112349277B (en) * 2020-09-28 2023-07-04 紫光展锐(重庆)科技有限公司 Feature domain voice enhancement method combined with AI model and related product
CN112509593B (en) * 2020-11-17 2024-03-08 北京清微智能科技有限公司 Speech enhancement network model, single-channel speech enhancement method and system
CN115831155A (en) * 2021-09-16 2023-03-21 腾讯科技(深圳)有限公司 Audio signal processing method and device, electronic equipment and storage medium
CN114023352B (en) * 2021-11-12 2022-12-16 华南理工大学 Voice enhancement method and device based on energy spectrum depth modulation
CN113823309B (en) * 2021-11-22 2022-02-08 成都启英泰伦科技有限公司 Noise reduction model construction and noise reduction processing method
CN114664310B (en) * 2022-03-01 2023-03-31 浙江大学 Silent attack classification promotion method based on attention enhancement filtering

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180261225A1 (en) * 2017-03-13 2018-09-13 Mitsubishi Electric Research Laboratories, Inc. System and Method for Multichannel End-to-End Speech Recognition
CN108597541A (en) * 2018-04-28 2018-09-28 南京师范大学 A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104700828B (en) * 2015-03-19 2018-01-12 清华大学 The construction method of depth shot and long term memory Recognition with Recurrent Neural Network acoustic model based on selective attention principle
US10319374B2 (en) * 2015-11-25 2019-06-11 Baidu USA, LLC Deployed end-to-end speech recognition
US10268671B2 (en) * 2015-12-31 2019-04-23 Google Llc Generating parse trees of text segments using neural networks
US9799327B1 (en) * 2016-02-26 2017-10-24 Google Inc. Speech recognition with attention-based recurrent neural networks
KR102033411B1 (en) * 2016-08-12 2019-10-17 한국전자통신연구원 Apparatus and Method for Recognizing speech By Using Attention-based Context-Dependent Acoustic Model
US10929674B2 (en) * 2016-08-29 2021-02-23 Nec Corporation Dual stage attention based recurrent neural network for time series prediction
KR20180080446A (en) * 2017-01-04 2018-07-12 삼성전자주식회사 Voice recognizing method and voice recognizing appratus
CN107766894B (en) * 2017-11-03 2021-01-22 吉林大学 Remote sensing image natural language generation method based on attention mechanism and deep learning
CN108304587B (en) * 2018-03-07 2020-10-27 中国科学技术大学 Community question-answering platform answer sorting method
CN108648748B (en) * 2018-03-30 2021-07-13 沈阳工业大学 Acoustic event detection method under hospital noise environment
CN108682418B (en) * 2018-06-26 2022-03-04 北京理工大学 Speech recognition method based on pre-training and bidirectional LSTM
CN109065067B (en) * 2018-08-16 2022-12-06 福建星网智慧科技有限公司 Conference terminal voice noise reduction method based on neural network model
CN108986834B (en) * 2018-08-22 2023-04-07 中国人民解放军陆军工程大学 Bone conduction voice blind enhancement method based on codec framework and recurrent neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180261225A1 (en) * 2017-03-13 2018-09-13 Mitsubishi Electric Research Laboratories, Inc. System and Method for Multichannel End-to-End Speech Recognition
CN108597541A (en) * 2018-04-28 2018-09-28 南京师范大学 A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying

Also Published As

Publication number Publication date
CN110085249A (en) 2019-08-02

Similar Documents

Publication Publication Date Title
CN110085249B (en) Single-channel speech enhancement method of recurrent neural network based on attention gating
CN108172231B (en) Dereverberation method and system based on Kalman filtering
CN108447495B (en) Deep learning voice enhancement method based on comprehensive feature set
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
CN110148420A (en) A kind of audio recognition method suitable under noise circumstance
CN105679330B (en) Based on the digital deaf-aid noise-reduction method for improving subband signal-to-noise ratio (SNR) estimation
CN109961799A (en) A kind of hearing aid multicenter voice enhancing algorithm based on Iterative Wiener Filtering
CN113744749B (en) Speech enhancement method and system based on psychoacoustic domain weighting loss function
CN110808057A (en) Voice enhancement method for generating confrontation network based on constraint naive
CN112885375A (en) Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network
CN116013344A (en) Speech enhancement method under multiple noise environments
CN110111802B (en) Kalman filtering-based adaptive dereverberation method
CN115171712A (en) Speech enhancement method suitable for transient noise suppression
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
CN116052706B (en) Low-complexity voice enhancement method based on neural network
CN117219102A (en) Low-complexity voice enhancement method based on auditory perception
CN115312073A (en) Low-complexity residual echo suppression method combining signal processing and deep neural network
CN113066483B (en) Sparse continuous constraint-based method for generating countermeasure network voice enhancement
CN114566179A (en) Time delay controllable voice noise reduction method
CN114882898A (en) Multi-channel speech signal enhancement method and apparatus, computer device and storage medium
Li et al. An overview of speech dereverberation
CN113870884B (en) Single-microphone noise suppression method and device
Srinivasarao An efficient recurrent Rats function network (Rrfn) based speech enhancement through noise reduction
Boyko et al. Using recurrent neural network to noise absorption from audio files.
CN112687285B (en) Echo cancellation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240523

Address after: 518000 1302-1307, building 13, qinchengda paradise, zone 22, lingzhiyuan community, Xin'an street, Bao'an District, Shenzhen, Guangdong Province

Patentee after: SHENZHEN INNOTRIK TECHNOLOGY Co.,Ltd.

Country or region after: China

Address before: 211167 No.1, Hongjing Avenue, Jiangning Science Park, Jiangning District, Nanjing City, Jiangsu Province

Patentee before: NANJING INSTITUTE OF TECHNOLOGY

Country or region before: China