CN111986660A - Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling - Google Patents
Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling Download PDFInfo
- Publication number
- CN111986660A CN111986660A CN202010872886.4A CN202010872886A CN111986660A CN 111986660 A CN111986660 A CN 111986660A CN 202010872886 A CN202010872886 A CN 202010872886A CN 111986660 A CN111986660 A CN 111986660A
- Authority
- CN
- China
- Prior art keywords
- neural network
- band
- sub
- module
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000001228 spectrum Methods 0.000 claims abstract description 52
- 238000000605 extraction Methods 0.000 claims abstract description 44
- 238000013507 mapping Methods 0.000 claims description 25
- 238000001914 filtration Methods 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 18
- 230000004913 activation Effects 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 6
- 238000009432 framing Methods 0.000 claims description 6
- 230000001537 neural effect Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 230000037433 frameshift Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 2
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 230000009467 reduction Effects 0.000 description 12
- 230000006872 improvement Effects 0.000 description 7
- 238000003062 neural network model Methods 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
Abstract
The invention provides a single-channel speech enhancement method, a system and a storage medium for neural network sub-band modeling, wherein the single-channel speech enhancement method comprises the following steps: step 1: collecting a voice signal with noise, and sending the voice signal to a digital power spectrum extraction module and a bark cepstrum coefficient extraction module; step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module; and step 3: and (3) receiving the features extracted in the step (2) by adopting a frequency band feature division module, and then distributing the sub-band features to the extracted features by using the frequency band feature division module. The invention has the beneficial effects that: the invention carries out independent neural network modeling on each sub-frequency band of the voice signal, reduces the task difficulty of the neural network and reduces the parameters of the model.
Description
Technical Field
The invention relates to the field of data processing, in particular to a single-channel speech enhancement method, a single-channel speech enhancement system and a single-channel speech enhancement storage medium for neural network sub-band modeling.
Background
At present, a series of voice electronic products in the market, such as communication products and man-machine interaction products, are affected by various noise interferences. The noisy background noise not only affects the communication quality between people, but also brings great challenges to human-computer interaction. For example, for voice interaction electronic devices such as smart speakers, smart televisions, and vehicle-mounted devices, voice recognition is an indispensable technology, and the accuracy of voice recognition in a quiet environment can completely meet the requirements of people. However, when a certain background noise exists, the recognition accuracy of the machine is greatly influenced. Therefore, it is necessary to perform noise reduction processing on a speech signal by using a speech enhancement technique, to reduce the influence of interference noise, to improve the quality of speech, and to enable a machine to achieve a high recognition accuracy even in a complex acoustic environment. In addition, for some voice products with strict requirements on noise reduction and time delay, such as hearing aids, interphones, and ear returns, the voice enhancement algorithm not only needs to ensure a better noise reduction effect, but also has the characteristics of low computation and low time delay.
Disclosure of Invention
The invention provides a single-channel speech enhancement method for neural network sub-band modeling, which comprises the following steps:
step 1: collecting a voice signal with noise, and sending the voice signal to a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module;
step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module;
and step 3: a frequency band feature division module is adopted to receive the features extracted in the step (2), the frequency band feature division module is used to distribute sub-band features to the extracted features, the features on each sub-band are input to a corresponding neural network mapping module to estimate the prior signal-to-noise ratio, and finally the estimated prior signal-to-noise ratios on all sub-bands are combined and sent to a full-band wiener filtering module;
and 4, step 4: and (3) receiving and processing the estimated prior signal-to-noise ratio on all sub-bands in the step (3) by adopting a full-band wiener filtering module to obtain an enhanced voice signal.
As a further improvement of the present invention, in step 2, the performing, by the log power spectrum feature extraction module, feature extraction on the noisy speech signal further includes performing the following steps:
the first step is as follows: preprocessing a voice signal x (n) acquired by a microphone by framing and windowing;
the second step is as follows: performing fast Fourier transform to obtain a frequency spectrum of the signal, and obtaining a power spectrum S2(k) of a frequency domain;
the third step: carrying out natural logarithm operation;
the fourth step: the power spectrum is compressed in the logarithmic domain, and the extracted logarithmic power spectrum characteristic Y is obtainedlog(k) As shown in the following formula (1):
Ylog(k)=ln(S2(k)),k=1,2,...,N (1)
in the single-channel speech enhancement method, a sampling rate of 16kHz is adopted, the frame length of each frame is 16ms, the frame shift is 8ms, and N is 129.
As a further improvement of the present invention, in step 2, the feature extraction performed by the barker cepstrum coefficient extraction module on the noisy speech signal further includes the following steps:
step S1: preprocessing the input voice signal x (n) by framing and windowing;
step S2: performing fast Fourier transform to transform the data from a time domain to a frequency domain;
step S3: calculating a frequency domain power spectrum S2 (k);
step S4: and (3) the frequency domain power spectrum S2(k) obtained by calculation is processed by a bark filter, and a filtered energy spectrum is calculated, wherein the formula (2) is as follows:
where B is the index of the order of the Bark energy spectrum, B is the number of Bark filters, where 24 is taken, each filter corresponds to a Bark domain band, and the expression of the Bark frequency filter transfer function is shown in equation (3) below:
step S5: taking logarithm of bark energy spectrum of each frame, and making Discrete Cosine Transform (DCT) as shown in formula (4) to obtain bark cepstrum coefficient characteristic,
wherein, YbarkAnd (n) is the extracted BFCC characteristics, n is the frequency band index of the characteristics, the dimension of the characteristics is consistent with the number of the barker filters, and 24 dimensions are adopted.
As a further improvement of the present invention, in step 3, the band feature dividing module further includes sequentially performing the following steps:
sub-band division: dividing the frequency domain range of 0-8000Hz into 8 sub-bands, and respectively giving indexes of features on different sub-bands according to the different numbers of LPS features and BFCC features corresponding to each sub-band;
characteristic splicing step: and splicing the LPS and BFCC characteristics on each sub-band, and respectively sending the spliced LPS and BFCC characteristics to respective neural network mapping modules for estimation of prior signal-to-noise ratio.
As a further improvement of the present invention, in step 3, the neural network mapping module includes 5 neural layers, where the first and last layers are feedforward neural network layers, the middle three layers are GRU neural layers, and weighted summation is performed in the feedforward neural network layers in a fully connected manner, and nonlinear activation is performed, as shown in the following formula (5):
h=g(W·X+b) (5)
wherein, W and b are weight and bias of the neuron, h represents output of the feedforward neural network layer, X is input of the feedforward neural network layer, g (-) represents nonlinear activation operation, the feedforward neural network layer 1 adopts ReLU activation function, and the feedforward neural network layer 2 needs estimation of prior signal-to-noise ratio, so activation operation is not carried out, and only linear weighted summation is carried out.
As a further improvement of the present invention, the memory update mechanism in the neural network mapping module GRU layer is specifically as follows:
GRU unit inputs feature x of current frametOutput h from the previous frame reserved beforet-1Combining the outputs to generate an output h of the current frame through the processing of the update gate and the reset gatetThe above-mentioned steps are repeated and iterated, and the calculation formula of each gate and output is as follows,
rt=σ(Wr·[ht-1,xt]) (6)
zt=σ(Wz·[ht-1,xt]) (7)
where σ (-) and tanh (-) represent respectively a Sigmoid activation function and a hyperbolic tangent activation function, rtRepresenting the output of the current frame update gate, ztRepresenting the output of the current frame forgetting gate.
As a further improvement of the present invention, in step 3, the prior signal-to-noise values on the subbands estimated by the neural network mapping module are combined to obtain a 129-dimensional output.
As a further improvement of the present invention, in step 4, the full-band wiener filtering module further includes the following steps:
step Y1: the gain function for filtering is calculated, expressed as the following equation (10):
wherein the content of the first and second substances,mapping the prior signal-to-noise ratio value output by the module for the neural network;
step Y2: filtering the input voice with noise by using the estimated gain function, and finally performing inverse Fourier transform to obtain a voice signal after noise reductionThe formula is as follows:
the formula (11) being wiener filteringA frequency domain filtering process, where s (k) is the frequency spectrum of the input noisy speech signal, N is the number of frequency points per frame, here 129,for the enhanced speech signal spectrum, the inverse Fourier transform of equation (12) is performed to obtain the final time-domain signal output
The invention also discloses a single-channel speech enhancement system for neural network sub-band modeling, which comprises: a memory, a processor, and a computer program stored on the memory, the computer program being configured to implement the steps of the single-channel speech enhancement method of the present invention when invoked by the processor.
The invention also discloses a computer readable storage medium storing a computer program configured to, when invoked by a processor, implement the steps of the single channel speech enhancement method of the invention.
The invention has the beneficial effects that: 1. the single-channel speech enhancement method carries out independent neural network modeling on each sub-frequency band of the speech signal, reduces the task difficulty of the neural network, reduces the parameters of the model and realizes lower algorithm complexity; 2. the single-channel speech enhancement method adopts the neural network model to carry out prior signal-to-noise ratio estimation on the signal, and is combined with the traditional filtering method to carry out noise reduction, so that the generalization capability of the neural network noise reduction algorithm is effectively improved; 3. the single-channel speech enhancement method of the invention aims at the neural network model which is trained independently for each sub-band, the mapping precision is higher, and the better speech noise reduction effect can be realized.
Drawings
FIG. 1 is a functional block diagram of a single-channel speech enhancement method of the present invention;
FIG. 2 is a block diagram of the log power feature extraction principle of the single-channel speech enhancement method of the present invention;
FIG. 3 is a block diagram of the BFCC feature extraction principle of the single-channel speech enhancement method of the present invention;
FIG. 4 is a block diagram of the sub-bands of the neural network mapping module of the single channel speech enhancement method of the present invention;
FIG. 5 is a schematic block diagram of memory update in GRU layer of the single channel speech enhancement method of the present invention.
Detailed Description
As shown in fig. 1, the present invention discloses a single-channel speech enhancement method for neural network subband modeling, which estimates the prior signal-to-noise ratio of a target speech by using Log Power Spectrum (LPS) and Bark cepstral coefficients (BFCC) based on a neural network model, and combines with a wiener filtering method, thereby achieving a good compromise between noise reduction effect and computational complexity. The single-channel speech enhancement method comprises the following steps:
step 1: a single microphone collects a noisy speech signal and sends the noisy speech signal to a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module;
step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module;
and step 3: a frequency band feature division module is adopted to receive the features extracted in the step (2), then the frequency band feature division module is used for distributing the sub-band features of the two groups of extracted features, the features on each sub-band are input into a corresponding neural network mapping module to estimate the prior signal-to-noise ratio, and finally the estimated prior signal-to-noise ratios on all sub-bands are combined and sent to a full-band wiener filtering module;
and 4, step 4: and (3) receiving and processing the estimated prior signal-to-noise ratio on all sub-bands in the step (3) by adopting a full-band wiener filtering module to obtain an enhanced voice signal.
In the single-channel speech enhancement method, 4800 sentences (24 men and 24 women, each say 100 sentences) in an Aishell Chinese data set [1] are selected as pure speech data of a training set, then the pure speech data and 100 different noise types selected from a Freeside website [2] are randomly mixed, the mixed signal-to-noise ratio is in accordance with the uniform distribution of the interval range of [ -5,20], and the duration of obtaining noisy training data is about 100 hours in total. And then extracting BFCC characteristics and logarithmic power spectrum characteristics of each sub-band, constructing an ideal prior signal-to-noise ratio value corresponding to the BFCC characteristics and the logarithmic power spectrum characteristics, training each neural network by adopting a back propagation algorithm, dividing 10% of all training data into a verification set, and storing the model when the loss on the training set and the verification set is minimum so as to obtain neural network mapping models corresponding to different sub-bands. The above is the processing flow of the whole single-channel speech enhancement method and the training process of the neural network model, and each key module will be described in detail next.
As shown in fig. 2, in step 2, the log power spectrum feature extraction module is configured to extract a frequency domain log power feature of the speech signal, and the performing of feature extraction on the noisy speech signal by the log power spectrum feature extraction module further includes the following steps:
the first step is as follows: preprocessing a voice signal x (n) acquired by a microphone by framing and windowing;
the second step is as follows: performing Fast Fourier Transform (FFT) to obtain a frequency spectrum of the signal, and obtaining a power spectrum S2(k) of a frequency domain;
the third step: carrying out natural logarithm operation;
the fourth step: the power spectrum is compressed in the logarithmic domain, and the extracted logarithmic power spectrum characteristic Y is obtainedlog(k) As shown in the following formula (1):
Ylog(k)=ln(S2(k)),k=1,2,...,N (1)
in the single-channel speech enhancement method, a sampling rate of 16kHz is adopted, the frame length of each frame is 16ms, and the frame shift is 8ms, so that N is 129.
As shown in fig. 3, in step 2, the barker cepstrum coefficient feature extraction module performs feature extraction in the frequency domain by Bark scale, so as to simulate the masking effect of the human auditory system on sound, and extract a spectral feature very close to the subjective feeling of human by fully utilizing the characteristic that the low-frequency resolution of human ear on the sound signal is higher than the high-frequency resolution of human ear. The extraction module of the barker cepstrum coefficient also comprises the following steps:
step S1: preprocessing the input voice signal x (n) by framing and windowing;
step S2: performing fast Fourier transform to transform the data from a time domain to a frequency domain;
step S3: calculating a frequency domain power spectrum S2 (k);
step S4: and (3) the frequency domain power spectrum S2(k) obtained by calculation is processed by a bark filter, and a filtered energy spectrum is calculated, wherein the formula (2) is as follows:
where B is the index of the order of the Bark energy spectrum, B is the number of Bark filters, where 24 is taken, each filter corresponds to a Bark domain band, and the expression of the Bark frequency filter transfer function is shown in equation (3) below:
step S5: taking logarithm of bark energy spectrum of each frame, and making Discrete Cosine Transform (DCT) as shown in formula (4) to obtain bark cepstrum coefficient characteristic,
wherein, YbarkAnd (n) is the extracted BFCC characteristics, n is the frequency band index of the characteristics, the dimension of the characteristics is consistent with the number of the barker filters, and 24 dimensions are adopted.
In step 3, the frequency band feature dividing module divides the extracted barker cepstrum coefficient feature and the log power spectrum feature of each frame of signal into sub-bands, where each sub-band only includes the BFCC feature and the LPS feature in its frequency range, as shown in table 1.
TABLE 1 feature assignment for frequency domain subbands
The frequency band characteristic division module further comprises the following steps of:
sub-band division: the frequency domain range of 0-8000Hz is divided into 8 sub-bands and the sub-bands of low frequencies are divided more finely considering that most of the speech signal is concentrated in the low frequency range. In addition, indexes of features on different sub-bands are respectively given according to different numbers of LPS features and BFCC features corresponding to each sub-band, and the indexes are shown in table 1;
characteristic splicing step: and splicing the LPS and BFCC characteristics on each sub-band, and respectively sending the spliced LPS and BFCC characteristics to respective neural network mapping modules for estimation of prior signal-to-noise ratio.
In step 3, the neural network mapping module models each sub-band feature, and customizes a dedicated noise reduction model applied to different sub-bands. In consideration of the time sequence correlation characteristic of the voice signal, a model with the capability of mapping the prior signal-to-noise ratio is constructed in the neural network mapping module on the basis of a Gated Recurrent Unit (GRU).
As shown in fig. 4, after the sub-band division module is allocated, the characteristics of each sub-band are input into the designed neural network structure for a priori snrIs estimated. The neural network mapping module comprises 5 neural layers, wherein the first layer and the last layer are feedforward neural network layers, the middle three layers are GRU neural layers, weighted summation is carried out in the feedforward neural network layers in a full-connection mode, and nonlinear activation is carried out, and the following formula is shown in the specification(5) Shown in the figure:
h=g(W·X+b) (5)
wherein, W and b are weight and bias of the neuron, h represents output of the feedforward neural network layer, X is input of the feedforward neural network layer, g (-) represents nonlinear activation operation, the feedforward neural network layer 1 adopts ReLU activation function, and the feedforward neural network layer 2 needs estimation of prior signal-to-noise ratio, so activation operation is not carried out, and only linear weighted summation is carried out.
As shown in fig. 5, the memory update mechanism in the neural network mapping module GRU layer is specifically as follows:
GRU unit inputs feature x of current frametOutput h from the previous frame reserved beforet-1Combining the outputs to generate an output h of the current frame through the processing of the update gate and the reset gatetThe above-mentioned steps are repeated and iterated, and the calculation formula of each gate and output is as follows,
rt=σ(Wr·[ht-1,xt]) (6)
zt=σ(Wz·[ht-1,xt]) (7)
where σ (-) and tanh (-) represent respectively a Sigmoid activation function and a hyperbolic tangent activation function, rtRepresenting the output of the current frame update gate, ztRepresenting the output of the current frame forgetting gate.
In addition, since the number of features in each sub-band is different, although the neural network structure in each sub-band is the same, the number of neurons in the neural network model corresponding to each sub-band is different in consideration of the different task difficulty of each sub-band, as shown in table 2 below.
TABLE 2 neuron configuration for different sub-band neural network modules
In step 3, the prior signal-to-noise ratios on the subbands estimated by the neural network mapping module are combined to obtain 129-dimensional output.
In step 4, the full-band wiener filtering module further performs the following steps:
step Y1: the gain function for filtering is calculated, expressed as the following equation (10):
wherein the content of the first and second substances,mapping the prior signal-to-noise ratio value output by the module for the neural network;
step Y2: the estimated gain function is used for filtering the input voice with noise, and finally, inverse Fourier transform is carried out, so that the voice signal after noise reduction is obtainedThe formula is as follows:
equation (11) is the frequency domain filtering process of wiener filtering, where s (k) is the frequency spectrum of the input noisy speech signal, N is the number of frequency points per frame, here 129 is taken,for the enhanced speech signal spectrum, the final time-domain signal output is obtained by performing an inverse Fourier transform as in equation (12)
The invention has the beneficial effects that: 1. the single-channel speech enhancement method carries out independent neural network modeling on each sub-frequency band of the speech signal, reduces the task difficulty of the neural network, reduces the parameters of the model and realizes lower algorithm complexity; 2. the single-channel speech enhancement method adopts the neural network model to carry out prior signal-to-noise ratio estimation on the signal, and is combined with the traditional filtering method to carry out noise reduction, so that the generalization capability of the neural network noise reduction algorithm is effectively improved; 3. the single-channel speech enhancement method of the invention aims at the neural network model which is trained independently for each sub-band, the mapping precision is higher, and the better speech noise reduction effect can be realized.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (10)
1. A single-channel speech enhancement method for neural network sub-band modeling is characterized by comprising the following steps:
step 1: collecting a voice signal with noise, and sending the voice signal to a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module;
step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module;
and step 3: a frequency band feature division module is adopted to receive the features extracted in the step (2), the frequency band feature division module is used to distribute sub-band features to the extracted features, the features on each sub-band are input to a corresponding neural network mapping module to estimate the prior signal-to-noise ratio, and finally the estimated prior signal-to-noise ratios on all sub-bands are combined and sent to a full-band wiener filtering module;
and 4, step 4: and (3) receiving and processing the estimated prior signal-to-noise ratio on all sub-bands in the step (3) by adopting a full-band wiener filtering module to obtain an enhanced voice signal.
2. The single-channel speech enhancement method of claim 1, wherein in the step 2, the log-power spectral feature extraction module performs feature extraction on the noisy speech signal, and further comprises the steps of:
the first step is as follows: preprocessing a voice signal x (n) acquired by a microphone by framing and windowing;
the second step is as follows: performing fast Fourier transform to obtain a frequency spectrum of the signal, and obtaining a power spectrum S2(k) of a frequency domain;
the third step: carrying out natural logarithm operation;
the fourth step: the power spectrum is compressed in the logarithmic domain, and the extracted logarithmic power spectrum characteristic Y is obtainedlog(k) As shown in the following formula (1):
Ylog(k)=ln(S2(k)),k=1,2,...,N (1)
in the single-channel speech enhancement method, a sampling rate of 16kHz is adopted, the frame length of each frame is 16ms, the frame shift is 8ms, and N is 129.
3. The single-channel speech enhancement method of claim 1, wherein in the step 2, the barker cepstral coefficient extraction module performs feature extraction on the noisy speech signal further comprising the steps of:
step S1: preprocessing the input voice signal x (n) by framing and windowing;
step S2: performing fast Fourier transform to transform the data from a time domain to a frequency domain;
step S3: calculating a frequency domain power spectrum S2 (k);
step S4: and (3) the frequency domain power spectrum S2(k) obtained by calculation is processed by a bark filter, and a filtered energy spectrum is calculated, wherein the formula (2) is as follows:
where B is the index of the order of the Bark energy spectrum, B is the number of Bark filters, where 24 is taken, each filter corresponds to a Bark domain band, and the expression of the Bark frequency filter transfer function is shown in equation (3) below:
step S5: taking logarithm of bark energy spectrum of each frame, and making Discrete Cosine Transform (DCT) as shown in formula (4) to obtain bark cepstrum coefficient characteristic,
wherein, YbarkAnd (n) is the extracted BFCC characteristics, n is the frequency band index of the characteristics, the dimension of the characteristics is consistent with the number of the barker filters, and 24 dimensions are adopted.
4. The single-channel speech enhancement method of claim 1, wherein in the step 3, the band feature dividing module further comprises sequentially performing the following steps:
sub-band division: dividing the frequency domain range of 0-8000Hz into 8 sub-bands, and respectively giving indexes of features on different sub-bands according to the different numbers of LPS features and BFCC features corresponding to each sub-band;
characteristic splicing step: and splicing the LPS and BFCC characteristics on each sub-band, and respectively sending the spliced LPS and BFCC characteristics to respective neural network mapping modules for estimation of prior signal-to-noise ratio.
5. The single-channel speech enhancement method of claim 4, wherein in the step 3, the neural network mapping module comprises 5 neural layers, wherein the first and last layers are feedforward neural network layers, the middle three layers are GRU neural layers, and the feedforward neural network layers are weighted and summed in a fully-connected manner and activated nonlinearly, as shown in the following formula (5):
h=g(W·X+b) (5)
wherein, W and b are weight and bias of the neuron, h represents output of the feedforward neural network layer, X is input of the feedforward neural network layer, g (-) represents nonlinear activation operation, the feedforward neural network layer 1 adopts ReLU activation function, and the feedforward neural network layer 2 needs estimation of prior signal-to-noise ratio, so activation operation is not carried out, and only linear weighted summation is carried out.
6. The single-channel speech enhancement method of claim 5, wherein the memory update mechanism in the GRU layer of the neural network mapping module is specifically as follows:
GRU unit inputs feature x of current frametOutput h from the previous frame reserved beforet-1Combining the outputs to generate an output h of the current frame through the processing of the update gate and the reset gatetThe above-mentioned steps are repeated and iterated, and the calculation formula of each gate and output is as follows,
rt=σ(Wr·[ht-1,xt]) (6)
zt=σ(Wz·[ht-1,xt]) (7)
where σ (-) and tanh (-) represent respectively a Sigmoid activation function and a hyperbolic tangent activation function, rtRepresenting the output of the current frame update gate, ztRepresenting the output of the current frame forgetting gate.
7. The single-channel speech enhancement method of claim 1, wherein in step 3, the a priori SNR values on each subband estimated by the neural network mapping module are combined to obtain a 129-dimensional output.
8. The single-channel speech enhancement method of claim 7, wherein in the step 4, the full-band wiener filtering module further comprises performing the steps of:
step Y1: the gain function for filtering is calculated, expressed as the following equation (10):
wherein the content of the first and second substances,mapping the prior signal-to-noise ratio value output by the module for the neural network;
step Y2: filtering the input voice with noise by using the estimated gain function, and finally performing inverse Fourier transform to obtain a voice signal after noise reductionThe formula is as follows:
equation (11) is the frequency domain filtering process of wiener filtering, where s (k) is the frequency spectrum of the input noisy speech signal, N is the number of frequency points per frame, here 129 is taken,for the enhanced speech signal spectrum, the inverse Fourier transform of equation (12) is performed to obtain the final time-domain signal output
9. A single channel speech enhancement system with neural network sub-band modeling, comprising: memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the single channel speech enhancement method of any of claims 1-8 when invoked by the processor.
10. A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the single channel speech enhancement method of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010872886.4A CN111986660A (en) | 2020-08-26 | 2020-08-26 | Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010872886.4A CN111986660A (en) | 2020-08-26 | 2020-08-26 | Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111986660A true CN111986660A (en) | 2020-11-24 |
Family
ID=73440930
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010872886.4A Pending CN111986660A (en) | 2020-08-26 | 2020-08-26 | Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111986660A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113077806A (en) * | 2021-03-23 | 2021-07-06 | 杭州朗和科技有限公司 | Audio processing method and device, model training method and device, medium and equipment |
CN113096679A (en) * | 2021-04-02 | 2021-07-09 | 北京字节跳动网络技术有限公司 | Audio data processing method and device |
CN113516988A (en) * | 2020-12-30 | 2021-10-19 | 腾讯科技(深圳)有限公司 | Audio processing method and device, intelligent equipment and storage medium |
CN113571075A (en) * | 2021-01-28 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Audio processing method and device, electronic equipment and storage medium |
CN116403594A (en) * | 2023-06-08 | 2023-07-07 | 澳克多普有限公司 | Speech enhancement method and device based on noise update factor |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050240401A1 (en) * | 2004-04-23 | 2005-10-27 | Acoustic Technologies, Inc. | Noise suppression based on Bark band weiner filtering and modified doblinger noise estimate |
CN102124518A (en) * | 2008-08-05 | 2011-07-13 | 弗朗霍夫应用科学研究促进协会 | Apparatus and method for processing an audio signal for speech enhancement using a feature extraction |
CN107680610A (en) * | 2017-09-27 | 2018-02-09 | 安徽硕威智能科技有限公司 | A kind of speech-enhancement system and method |
CN110085249A (en) * | 2019-05-09 | 2019-08-02 | 南京工程学院 | The single-channel voice Enhancement Method of Recognition with Recurrent Neural Network based on attention gate |
CN110120225A (en) * | 2019-04-01 | 2019-08-13 | 西安电子科技大学 | A kind of audio defeat system and method for the structure based on GRU network |
CN110310656A (en) * | 2019-05-27 | 2019-10-08 | 重庆高开清芯科技产业发展有限公司 | A kind of sound enhancement method |
WO2020107269A1 (en) * | 2018-11-28 | 2020-06-04 | 深圳市汇顶科技股份有限公司 | Self-adaptive speech enhancement method, and electronic device |
-
2020
- 2020-08-26 CN CN202010872886.4A patent/CN111986660A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050240401A1 (en) * | 2004-04-23 | 2005-10-27 | Acoustic Technologies, Inc. | Noise suppression based on Bark band weiner filtering and modified doblinger noise estimate |
CN102124518A (en) * | 2008-08-05 | 2011-07-13 | 弗朗霍夫应用科学研究促进协会 | Apparatus and method for processing an audio signal for speech enhancement using a feature extraction |
CN107680610A (en) * | 2017-09-27 | 2018-02-09 | 安徽硕威智能科技有限公司 | A kind of speech-enhancement system and method |
WO2020107269A1 (en) * | 2018-11-28 | 2020-06-04 | 深圳市汇顶科技股份有限公司 | Self-adaptive speech enhancement method, and electronic device |
CN110120225A (en) * | 2019-04-01 | 2019-08-13 | 西安电子科技大学 | A kind of audio defeat system and method for the structure based on GRU network |
CN110085249A (en) * | 2019-05-09 | 2019-08-02 | 南京工程学院 | The single-channel voice Enhancement Method of Recognition with Recurrent Neural Network based on attention gate |
CN110310656A (en) * | 2019-05-27 | 2019-10-08 | 重庆高开清芯科技产业发展有限公司 | A kind of sound enhancement method |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113516988A (en) * | 2020-12-30 | 2021-10-19 | 腾讯科技(深圳)有限公司 | Audio processing method and device, intelligent equipment and storage medium |
CN113516988B (en) * | 2020-12-30 | 2024-02-23 | 腾讯科技(深圳)有限公司 | Audio processing method and device, intelligent equipment and storage medium |
CN113571075A (en) * | 2021-01-28 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Audio processing method and device, electronic equipment and storage medium |
CN113077806A (en) * | 2021-03-23 | 2021-07-06 | 杭州朗和科技有限公司 | Audio processing method and device, model training method and device, medium and equipment |
CN113077806B (en) * | 2021-03-23 | 2023-10-13 | 杭州网易智企科技有限公司 | Audio processing method and device, model training method and device, medium and equipment |
CN113096679A (en) * | 2021-04-02 | 2021-07-09 | 北京字节跳动网络技术有限公司 | Audio data processing method and device |
CN116403594A (en) * | 2023-06-08 | 2023-07-07 | 澳克多普有限公司 | Speech enhancement method and device based on noise update factor |
CN116403594B (en) * | 2023-06-08 | 2023-08-18 | 澳克多普有限公司 | Speech enhancement method and device based on noise update factor |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107845389B (en) | Speech enhancement method based on multi-resolution auditory cepstrum coefficient and deep convolutional neural network | |
CN111292759B (en) | Stereo echo cancellation method and system based on neural network | |
CN110867181B (en) | Multi-target speech enhancement method based on SCNN and TCNN joint estimation | |
CN111986660A (en) | Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling | |
CN109841226B (en) | Single-channel real-time noise reduction method based on convolution recurrent neural network | |
CN110619885B (en) | Method for generating confrontation network voice enhancement based on deep complete convolution neural network | |
CN110120227B (en) | Voice separation method of deep stack residual error network | |
CN110600050B (en) | Microphone array voice enhancement method and system based on deep neural network | |
CN111292762A (en) | Single-channel voice separation method based on deep learning | |
CN111899757B (en) | Single-channel voice separation method and system for target speaker extraction | |
CN112331224A (en) | Lightweight time domain convolution network voice enhancement method and system | |
CN112863535A (en) | Residual echo and noise elimination method and device | |
CN111986679A (en) | Speaker confirmation method, system and storage medium for responding to complex acoustic environment | |
CN112885375A (en) | Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network | |
Barros et al. | Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavelets | |
CN116013344A (en) | Speech enhancement method under multiple noise environments | |
CN113053400A (en) | Training method of audio signal noise reduction model, audio signal noise reduction method and device | |
Li et al. | Multi-resolution auditory cepstral coefficient and adaptive mask for speech enhancement with deep neural network | |
CN112397090B (en) | Real-time sound classification method and system based on FPGA | |
CN116612778B (en) | Echo and noise suppression method, related device and medium | |
CN117219102A (en) | Low-complexity voice enhancement method based on auditory perception | |
CN109215635B (en) | Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement | |
CN113571074B (en) | Voice enhancement method and device based on multi-band structure time domain audio frequency separation network | |
Radha et al. | Enhancing speech quality using artificial bandwidth expansion with deep shallow convolution neural network framework | |
CN115312073A (en) | Low-complexity residual echo suppression method combining signal processing and deep neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |