CN111986660A - Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling - Google Patents

Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling Download PDF

Info

Publication number
CN111986660A
CN111986660A CN202010872886.4A CN202010872886A CN111986660A CN 111986660 A CN111986660 A CN 111986660A CN 202010872886 A CN202010872886 A CN 202010872886A CN 111986660 A CN111986660 A CN 111986660A
Authority
CN
China
Prior art keywords
neural network
band
sub
module
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010872886.4A
Other languages
Chinese (zh)
Inventor
刘明
孙冲武
周彦兵
赵学华
李欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Information Technology
Original Assignee
Shenzhen Institute of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Information Technology filed Critical Shenzhen Institute of Information Technology
Priority to CN202010872886.4A priority Critical patent/CN111986660A/en
Publication of CN111986660A publication Critical patent/CN111986660A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Abstract

The invention provides a single-channel speech enhancement method, a system and a storage medium for neural network sub-band modeling, wherein the single-channel speech enhancement method comprises the following steps: step 1: collecting a voice signal with noise, and sending the voice signal to a digital power spectrum extraction module and a bark cepstrum coefficient extraction module; step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module; and step 3: and (3) receiving the features extracted in the step (2) by adopting a frequency band feature division module, and then distributing the sub-band features to the extracted features by using the frequency band feature division module. The invention has the beneficial effects that: the invention carries out independent neural network modeling on each sub-frequency band of the voice signal, reduces the task difficulty of the neural network and reduces the parameters of the model.

Description

Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling
Technical Field
The invention relates to the field of data processing, in particular to a single-channel speech enhancement method, a single-channel speech enhancement system and a single-channel speech enhancement storage medium for neural network sub-band modeling.
Background
At present, a series of voice electronic products in the market, such as communication products and man-machine interaction products, are affected by various noise interferences. The noisy background noise not only affects the communication quality between people, but also brings great challenges to human-computer interaction. For example, for voice interaction electronic devices such as smart speakers, smart televisions, and vehicle-mounted devices, voice recognition is an indispensable technology, and the accuracy of voice recognition in a quiet environment can completely meet the requirements of people. However, when a certain background noise exists, the recognition accuracy of the machine is greatly influenced. Therefore, it is necessary to perform noise reduction processing on a speech signal by using a speech enhancement technique, to reduce the influence of interference noise, to improve the quality of speech, and to enable a machine to achieve a high recognition accuracy even in a complex acoustic environment. In addition, for some voice products with strict requirements on noise reduction and time delay, such as hearing aids, interphones, and ear returns, the voice enhancement algorithm not only needs to ensure a better noise reduction effect, but also has the characteristics of low computation and low time delay.
Disclosure of Invention
The invention provides a single-channel speech enhancement method for neural network sub-band modeling, which comprises the following steps:
step 1: collecting a voice signal with noise, and sending the voice signal to a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module;
step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module;
and step 3: a frequency band feature division module is adopted to receive the features extracted in the step (2), the frequency band feature division module is used to distribute sub-band features to the extracted features, the features on each sub-band are input to a corresponding neural network mapping module to estimate the prior signal-to-noise ratio, and finally the estimated prior signal-to-noise ratios on all sub-bands are combined and sent to a full-band wiener filtering module;
and 4, step 4: and (3) receiving and processing the estimated prior signal-to-noise ratio on all sub-bands in the step (3) by adopting a full-band wiener filtering module to obtain an enhanced voice signal.
As a further improvement of the present invention, in step 2, the performing, by the log power spectrum feature extraction module, feature extraction on the noisy speech signal further includes performing the following steps:
the first step is as follows: preprocessing a voice signal x (n) acquired by a microphone by framing and windowing;
the second step is as follows: performing fast Fourier transform to obtain a frequency spectrum of the signal, and obtaining a power spectrum S2(k) of a frequency domain;
the third step: carrying out natural logarithm operation;
the fourth step: the power spectrum is compressed in the logarithmic domain, and the extracted logarithmic power spectrum characteristic Y is obtainedlog(k) As shown in the following formula (1):
Ylog(k)=ln(S2(k)),k=1,2,...,N (1)
in the single-channel speech enhancement method, a sampling rate of 16kHz is adopted, the frame length of each frame is 16ms, the frame shift is 8ms, and N is 129.
As a further improvement of the present invention, in step 2, the feature extraction performed by the barker cepstrum coefficient extraction module on the noisy speech signal further includes the following steps:
step S1: preprocessing the input voice signal x (n) by framing and windowing;
step S2: performing fast Fourier transform to transform the data from a time domain to a frequency domain;
step S3: calculating a frequency domain power spectrum S2 (k);
step S4: and (3) the frequency domain power spectrum S2(k) obtained by calculation is processed by a bark filter, and a filtered energy spectrum is calculated, wherein the formula (2) is as follows:
Figure BDA0002651691870000021
where B is the index of the order of the Bark energy spectrum, B is the number of Bark filters, where 24 is taken, each filter corresponds to a Bark domain band, and the expression of the Bark frequency filter transfer function is shown in equation (3) below:
Figure BDA0002651691870000022
step S5: taking logarithm of bark energy spectrum of each frame, and making Discrete Cosine Transform (DCT) as shown in formula (4) to obtain bark cepstrum coefficient characteristic,
Figure BDA0002651691870000023
wherein, YbarkAnd (n) is the extracted BFCC characteristics, n is the frequency band index of the characteristics, the dimension of the characteristics is consistent with the number of the barker filters, and 24 dimensions are adopted.
As a further improvement of the present invention, in step 3, the band feature dividing module further includes sequentially performing the following steps:
sub-band division: dividing the frequency domain range of 0-8000Hz into 8 sub-bands, and respectively giving indexes of features on different sub-bands according to the different numbers of LPS features and BFCC features corresponding to each sub-band;
characteristic splicing step: and splicing the LPS and BFCC characteristics on each sub-band, and respectively sending the spliced LPS and BFCC characteristics to respective neural network mapping modules for estimation of prior signal-to-noise ratio.
As a further improvement of the present invention, in step 3, the neural network mapping module includes 5 neural layers, where the first and last layers are feedforward neural network layers, the middle three layers are GRU neural layers, and weighted summation is performed in the feedforward neural network layers in a fully connected manner, and nonlinear activation is performed, as shown in the following formula (5):
h=g(W·X+b) (5)
wherein, W and b are weight and bias of the neuron, h represents output of the feedforward neural network layer, X is input of the feedforward neural network layer, g (-) represents nonlinear activation operation, the feedforward neural network layer 1 adopts ReLU activation function, and the feedforward neural network layer 2 needs estimation of prior signal-to-noise ratio, so activation operation is not carried out, and only linear weighted summation is carried out.
As a further improvement of the present invention, the memory update mechanism in the neural network mapping module GRU layer is specifically as follows:
GRU unit inputs feature x of current frametOutput h from the previous frame reserved beforet-1Combining the outputs to generate an output h of the current frame through the processing of the update gate and the reset gatetThe above-mentioned steps are repeated and iterated, and the calculation formula of each gate and output is as follows,
rt=σ(Wr·[ht-1,xt]) (6)
zt=σ(Wz·[ht-1,xt]) (7)
Figure BDA0002651691870000031
Figure BDA0002651691870000032
where σ (-) and tanh (-) represent respectively a Sigmoid activation function and a hyperbolic tangent activation function, rtRepresenting the output of the current frame update gate, ztRepresenting the output of the current frame forgetting gate.
As a further improvement of the present invention, in step 3, the prior signal-to-noise values on the subbands estimated by the neural network mapping module are combined to obtain a 129-dimensional output.
As a further improvement of the present invention, in step 4, the full-band wiener filtering module further includes the following steps:
step Y1: the gain function for filtering is calculated, expressed as the following equation (10):
Figure BDA0002651691870000041
wherein the content of the first and second substances,
Figure BDA0002651691870000042
mapping the prior signal-to-noise ratio value output by the module for the neural network;
step Y2: filtering the input voice with noise by using the estimated gain function, and finally performing inverse Fourier transform to obtain a voice signal after noise reduction
Figure BDA0002651691870000043
The formula is as follows:
Figure BDA0002651691870000044
Figure BDA0002651691870000045
the formula (11) being wiener filteringA frequency domain filtering process, where s (k) is the frequency spectrum of the input noisy speech signal, N is the number of frequency points per frame, here 129,
Figure BDA0002651691870000046
for the enhanced speech signal spectrum, the inverse Fourier transform of equation (12) is performed to obtain the final time-domain signal output
Figure BDA0002651691870000047
The invention also discloses a single-channel speech enhancement system for neural network sub-band modeling, which comprises: a memory, a processor, and a computer program stored on the memory, the computer program being configured to implement the steps of the single-channel speech enhancement method of the present invention when invoked by the processor.
The invention also discloses a computer readable storage medium storing a computer program configured to, when invoked by a processor, implement the steps of the single channel speech enhancement method of the invention.
The invention has the beneficial effects that: 1. the single-channel speech enhancement method carries out independent neural network modeling on each sub-frequency band of the speech signal, reduces the task difficulty of the neural network, reduces the parameters of the model and realizes lower algorithm complexity; 2. the single-channel speech enhancement method adopts the neural network model to carry out prior signal-to-noise ratio estimation on the signal, and is combined with the traditional filtering method to carry out noise reduction, so that the generalization capability of the neural network noise reduction algorithm is effectively improved; 3. the single-channel speech enhancement method of the invention aims at the neural network model which is trained independently for each sub-band, the mapping precision is higher, and the better speech noise reduction effect can be realized.
Drawings
FIG. 1 is a functional block diagram of a single-channel speech enhancement method of the present invention;
FIG. 2 is a block diagram of the log power feature extraction principle of the single-channel speech enhancement method of the present invention;
FIG. 3 is a block diagram of the BFCC feature extraction principle of the single-channel speech enhancement method of the present invention;
FIG. 4 is a block diagram of the sub-bands of the neural network mapping module of the single channel speech enhancement method of the present invention;
FIG. 5 is a schematic block diagram of memory update in GRU layer of the single channel speech enhancement method of the present invention.
Detailed Description
As shown in fig. 1, the present invention discloses a single-channel speech enhancement method for neural network subband modeling, which estimates the prior signal-to-noise ratio of a target speech by using Log Power Spectrum (LPS) and Bark cepstral coefficients (BFCC) based on a neural network model, and combines with a wiener filtering method, thereby achieving a good compromise between noise reduction effect and computational complexity. The single-channel speech enhancement method comprises the following steps:
step 1: a single microphone collects a noisy speech signal and sends the noisy speech signal to a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module;
step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module;
and step 3: a frequency band feature division module is adopted to receive the features extracted in the step (2), then the frequency band feature division module is used for distributing the sub-band features of the two groups of extracted features, the features on each sub-band are input into a corresponding neural network mapping module to estimate the prior signal-to-noise ratio, and finally the estimated prior signal-to-noise ratios on all sub-bands are combined and sent to a full-band wiener filtering module;
and 4, step 4: and (3) receiving and processing the estimated prior signal-to-noise ratio on all sub-bands in the step (3) by adopting a full-band wiener filtering module to obtain an enhanced voice signal.
In the single-channel speech enhancement method, 4800 sentences (24 men and 24 women, each say 100 sentences) in an Aishell Chinese data set [1] are selected as pure speech data of a training set, then the pure speech data and 100 different noise types selected from a Freeside website [2] are randomly mixed, the mixed signal-to-noise ratio is in accordance with the uniform distribution of the interval range of [ -5,20], and the duration of obtaining noisy training data is about 100 hours in total. And then extracting BFCC characteristics and logarithmic power spectrum characteristics of each sub-band, constructing an ideal prior signal-to-noise ratio value corresponding to the BFCC characteristics and the logarithmic power spectrum characteristics, training each neural network by adopting a back propagation algorithm, dividing 10% of all training data into a verification set, and storing the model when the loss on the training set and the verification set is minimum so as to obtain neural network mapping models corresponding to different sub-bands. The above is the processing flow of the whole single-channel speech enhancement method and the training process of the neural network model, and each key module will be described in detail next.
As shown in fig. 2, in step 2, the log power spectrum feature extraction module is configured to extract a frequency domain log power feature of the speech signal, and the performing of feature extraction on the noisy speech signal by the log power spectrum feature extraction module further includes the following steps:
the first step is as follows: preprocessing a voice signal x (n) acquired by a microphone by framing and windowing;
the second step is as follows: performing Fast Fourier Transform (FFT) to obtain a frequency spectrum of the signal, and obtaining a power spectrum S2(k) of a frequency domain;
the third step: carrying out natural logarithm operation;
the fourth step: the power spectrum is compressed in the logarithmic domain, and the extracted logarithmic power spectrum characteristic Y is obtainedlog(k) As shown in the following formula (1):
Ylog(k)=ln(S2(k)),k=1,2,...,N (1)
in the single-channel speech enhancement method, a sampling rate of 16kHz is adopted, the frame length of each frame is 16ms, and the frame shift is 8ms, so that N is 129.
As shown in fig. 3, in step 2, the barker cepstrum coefficient feature extraction module performs feature extraction in the frequency domain by Bark scale, so as to simulate the masking effect of the human auditory system on sound, and extract a spectral feature very close to the subjective feeling of human by fully utilizing the characteristic that the low-frequency resolution of human ear on the sound signal is higher than the high-frequency resolution of human ear. The extraction module of the barker cepstrum coefficient also comprises the following steps:
step S1: preprocessing the input voice signal x (n) by framing and windowing;
step S2: performing fast Fourier transform to transform the data from a time domain to a frequency domain;
step S3: calculating a frequency domain power spectrum S2 (k);
step S4: and (3) the frequency domain power spectrum S2(k) obtained by calculation is processed by a bark filter, and a filtered energy spectrum is calculated, wherein the formula (2) is as follows:
Figure BDA0002651691870000061
where B is the index of the order of the Bark energy spectrum, B is the number of Bark filters, where 24 is taken, each filter corresponds to a Bark domain band, and the expression of the Bark frequency filter transfer function is shown in equation (3) below:
Figure BDA0002651691870000062
step S5: taking logarithm of bark energy spectrum of each frame, and making Discrete Cosine Transform (DCT) as shown in formula (4) to obtain bark cepstrum coefficient characteristic,
Figure BDA0002651691870000071
wherein, YbarkAnd (n) is the extracted BFCC characteristics, n is the frequency band index of the characteristics, the dimension of the characteristics is consistent with the number of the barker filters, and 24 dimensions are adopted.
In step 3, the frequency band feature dividing module divides the extracted barker cepstrum coefficient feature and the log power spectrum feature of each frame of signal into sub-bands, where each sub-band only includes the BFCC feature and the LPS feature in its frequency range, as shown in table 1.
TABLE 1 feature assignment for frequency domain subbands
Figure BDA0002651691870000072
The frequency band characteristic division module further comprises the following steps of:
sub-band division: the frequency domain range of 0-8000Hz is divided into 8 sub-bands and the sub-bands of low frequencies are divided more finely considering that most of the speech signal is concentrated in the low frequency range. In addition, indexes of features on different sub-bands are respectively given according to different numbers of LPS features and BFCC features corresponding to each sub-band, and the indexes are shown in table 1;
characteristic splicing step: and splicing the LPS and BFCC characteristics on each sub-band, and respectively sending the spliced LPS and BFCC characteristics to respective neural network mapping modules for estimation of prior signal-to-noise ratio.
In step 3, the neural network mapping module models each sub-band feature, and customizes a dedicated noise reduction model applied to different sub-bands. In consideration of the time sequence correlation characteristic of the voice signal, a model with the capability of mapping the prior signal-to-noise ratio is constructed in the neural network mapping module on the basis of a Gated Recurrent Unit (GRU).
As shown in fig. 4, after the sub-band division module is allocated, the characteristics of each sub-band are input into the designed neural network structure for a priori snr
Figure BDA0002651691870000081
Is estimated. The neural network mapping module comprises 5 neural layers, wherein the first layer and the last layer are feedforward neural network layers, the middle three layers are GRU neural layers, weighted summation is carried out in the feedforward neural network layers in a full-connection mode, and nonlinear activation is carried out, and the following formula is shown in the specification(5) Shown in the figure:
h=g(W·X+b) (5)
wherein, W and b are weight and bias of the neuron, h represents output of the feedforward neural network layer, X is input of the feedforward neural network layer, g (-) represents nonlinear activation operation, the feedforward neural network layer 1 adopts ReLU activation function, and the feedforward neural network layer 2 needs estimation of prior signal-to-noise ratio, so activation operation is not carried out, and only linear weighted summation is carried out.
As shown in fig. 5, the memory update mechanism in the neural network mapping module GRU layer is specifically as follows:
GRU unit inputs feature x of current frametOutput h from the previous frame reserved beforet-1Combining the outputs to generate an output h of the current frame through the processing of the update gate and the reset gatetThe above-mentioned steps are repeated and iterated, and the calculation formula of each gate and output is as follows,
rt=σ(Wr·[ht-1,xt]) (6)
zt=σ(Wz·[ht-1,xt]) (7)
Figure BDA0002651691870000082
Figure BDA0002651691870000083
where σ (-) and tanh (-) represent respectively a Sigmoid activation function and a hyperbolic tangent activation function, rtRepresenting the output of the current frame update gate, ztRepresenting the output of the current frame forgetting gate.
In addition, since the number of features in each sub-band is different, although the neural network structure in each sub-band is the same, the number of neurons in the neural network model corresponding to each sub-band is different in consideration of the different task difficulty of each sub-band, as shown in table 2 below.
TABLE 2 neuron configuration for different sub-band neural network modules
Figure BDA0002651691870000091
In step 3, the prior signal-to-noise ratios on the subbands estimated by the neural network mapping module are combined to obtain 129-dimensional output.
In step 4, the full-band wiener filtering module further performs the following steps:
step Y1: the gain function for filtering is calculated, expressed as the following equation (10):
Figure BDA0002651691870000092
wherein the content of the first and second substances,
Figure BDA0002651691870000093
mapping the prior signal-to-noise ratio value output by the module for the neural network;
step Y2: the estimated gain function is used for filtering the input voice with noise, and finally, inverse Fourier transform is carried out, so that the voice signal after noise reduction is obtained
Figure BDA0002651691870000094
The formula is as follows:
Figure BDA0002651691870000095
Figure BDA0002651691870000096
equation (11) is the frequency domain filtering process of wiener filtering, where s (k) is the frequency spectrum of the input noisy speech signal, N is the number of frequency points per frame, here 129 is taken,
Figure BDA0002651691870000097
for the enhanced speech signal spectrum, the final time-domain signal output is obtained by performing an inverse Fourier transform as in equation (12)
Figure BDA0002651691870000098
The invention has the beneficial effects that: 1. the single-channel speech enhancement method carries out independent neural network modeling on each sub-frequency band of the speech signal, reduces the task difficulty of the neural network, reduces the parameters of the model and realizes lower algorithm complexity; 2. the single-channel speech enhancement method adopts the neural network model to carry out prior signal-to-noise ratio estimation on the signal, and is combined with the traditional filtering method to carry out noise reduction, so that the generalization capability of the neural network noise reduction algorithm is effectively improved; 3. the single-channel speech enhancement method of the invention aims at the neural network model which is trained independently for each sub-band, the mapping precision is higher, and the better speech noise reduction effect can be realized.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A single-channel speech enhancement method for neural network sub-band modeling is characterized by comprising the following steps:
step 1: collecting a voice signal with noise, and sending the voice signal to a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module;
step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module;
and step 3: a frequency band feature division module is adopted to receive the features extracted in the step (2), the frequency band feature division module is used to distribute sub-band features to the extracted features, the features on each sub-band are input to a corresponding neural network mapping module to estimate the prior signal-to-noise ratio, and finally the estimated prior signal-to-noise ratios on all sub-bands are combined and sent to a full-band wiener filtering module;
and 4, step 4: and (3) receiving and processing the estimated prior signal-to-noise ratio on all sub-bands in the step (3) by adopting a full-band wiener filtering module to obtain an enhanced voice signal.
2. The single-channel speech enhancement method of claim 1, wherein in the step 2, the log-power spectral feature extraction module performs feature extraction on the noisy speech signal, and further comprises the steps of:
the first step is as follows: preprocessing a voice signal x (n) acquired by a microphone by framing and windowing;
the second step is as follows: performing fast Fourier transform to obtain a frequency spectrum of the signal, and obtaining a power spectrum S2(k) of a frequency domain;
the third step: carrying out natural logarithm operation;
the fourth step: the power spectrum is compressed in the logarithmic domain, and the extracted logarithmic power spectrum characteristic Y is obtainedlog(k) As shown in the following formula (1):
Ylog(k)=ln(S2(k)),k=1,2,...,N (1)
in the single-channel speech enhancement method, a sampling rate of 16kHz is adopted, the frame length of each frame is 16ms, the frame shift is 8ms, and N is 129.
3. The single-channel speech enhancement method of claim 1, wherein in the step 2, the barker cepstral coefficient extraction module performs feature extraction on the noisy speech signal further comprising the steps of:
step S1: preprocessing the input voice signal x (n) by framing and windowing;
step S2: performing fast Fourier transform to transform the data from a time domain to a frequency domain;
step S3: calculating a frequency domain power spectrum S2 (k);
step S4: and (3) the frequency domain power spectrum S2(k) obtained by calculation is processed by a bark filter, and a filtered energy spectrum is calculated, wherein the formula (2) is as follows:
Figure FDA0002651691860000021
where B is the index of the order of the Bark energy spectrum, B is the number of Bark filters, where 24 is taken, each filter corresponds to a Bark domain band, and the expression of the Bark frequency filter transfer function is shown in equation (3) below:
Figure FDA0002651691860000022
step S5: taking logarithm of bark energy spectrum of each frame, and making Discrete Cosine Transform (DCT) as shown in formula (4) to obtain bark cepstrum coefficient characteristic,
Figure FDA0002651691860000023
wherein, YbarkAnd (n) is the extracted BFCC characteristics, n is the frequency band index of the characteristics, the dimension of the characteristics is consistent with the number of the barker filters, and 24 dimensions are adopted.
4. The single-channel speech enhancement method of claim 1, wherein in the step 3, the band feature dividing module further comprises sequentially performing the following steps:
sub-band division: dividing the frequency domain range of 0-8000Hz into 8 sub-bands, and respectively giving indexes of features on different sub-bands according to the different numbers of LPS features and BFCC features corresponding to each sub-band;
characteristic splicing step: and splicing the LPS and BFCC characteristics on each sub-band, and respectively sending the spliced LPS and BFCC characteristics to respective neural network mapping modules for estimation of prior signal-to-noise ratio.
5. The single-channel speech enhancement method of claim 4, wherein in the step 3, the neural network mapping module comprises 5 neural layers, wherein the first and last layers are feedforward neural network layers, the middle three layers are GRU neural layers, and the feedforward neural network layers are weighted and summed in a fully-connected manner and activated nonlinearly, as shown in the following formula (5):
h=g(W·X+b) (5)
wherein, W and b are weight and bias of the neuron, h represents output of the feedforward neural network layer, X is input of the feedforward neural network layer, g (-) represents nonlinear activation operation, the feedforward neural network layer 1 adopts ReLU activation function, and the feedforward neural network layer 2 needs estimation of prior signal-to-noise ratio, so activation operation is not carried out, and only linear weighted summation is carried out.
6. The single-channel speech enhancement method of claim 5, wherein the memory update mechanism in the GRU layer of the neural network mapping module is specifically as follows:
GRU unit inputs feature x of current frametOutput h from the previous frame reserved beforet-1Combining the outputs to generate an output h of the current frame through the processing of the update gate and the reset gatetThe above-mentioned steps are repeated and iterated, and the calculation formula of each gate and output is as follows,
rt=σ(Wr·[ht-1,xt]) (6)
zt=σ(Wz·[ht-1,xt]) (7)
Figure FDA0002651691860000031
Figure FDA0002651691860000032
where σ (-) and tanh (-) represent respectively a Sigmoid activation function and a hyperbolic tangent activation function, rtRepresenting the output of the current frame update gate, ztRepresenting the output of the current frame forgetting gate.
7. The single-channel speech enhancement method of claim 1, wherein in step 3, the a priori SNR values on each subband estimated by the neural network mapping module are combined to obtain a 129-dimensional output.
8. The single-channel speech enhancement method of claim 7, wherein in the step 4, the full-band wiener filtering module further comprises performing the steps of:
step Y1: the gain function for filtering is calculated, expressed as the following equation (10):
Figure FDA0002651691860000041
wherein the content of the first and second substances,
Figure FDA0002651691860000042
mapping the prior signal-to-noise ratio value output by the module for the neural network;
step Y2: filtering the input voice with noise by using the estimated gain function, and finally performing inverse Fourier transform to obtain a voice signal after noise reduction
Figure FDA0002651691860000043
The formula is as follows:
Figure FDA0002651691860000044
Figure FDA0002651691860000045
equation (11) is the frequency domain filtering process of wiener filtering, where s (k) is the frequency spectrum of the input noisy speech signal, N is the number of frequency points per frame, here 129 is taken,
Figure FDA0002651691860000046
for the enhanced speech signal spectrum, the inverse Fourier transform of equation (12) is performed to obtain the final time-domain signal output
Figure FDA0002651691860000047
9. A single channel speech enhancement system with neural network sub-band modeling, comprising: memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the single channel speech enhancement method of any of claims 1-8 when invoked by the processor.
10. A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the single channel speech enhancement method of any of claims 1-8.
CN202010872886.4A 2020-08-26 2020-08-26 Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling Pending CN111986660A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010872886.4A CN111986660A (en) 2020-08-26 2020-08-26 Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010872886.4A CN111986660A (en) 2020-08-26 2020-08-26 Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling

Publications (1)

Publication Number Publication Date
CN111986660A true CN111986660A (en) 2020-11-24

Family

ID=73440930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010872886.4A Pending CN111986660A (en) 2020-08-26 2020-08-26 Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling

Country Status (1)

Country Link
CN (1) CN111986660A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113077806A (en) * 2021-03-23 2021-07-06 杭州朗和科技有限公司 Audio processing method and device, model training method and device, medium and equipment
CN113096679A (en) * 2021-04-02 2021-07-09 北京字节跳动网络技术有限公司 Audio data processing method and device
CN113516988A (en) * 2020-12-30 2021-10-19 腾讯科技(深圳)有限公司 Audio processing method and device, intelligent equipment and storage medium
CN113571075A (en) * 2021-01-28 2021-10-29 腾讯科技(深圳)有限公司 Audio processing method and device, electronic equipment and storage medium
CN116403594A (en) * 2023-06-08 2023-07-07 澳克多普有限公司 Speech enhancement method and device based on noise update factor

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240401A1 (en) * 2004-04-23 2005-10-27 Acoustic Technologies, Inc. Noise suppression based on Bark band weiner filtering and modified doblinger noise estimate
CN102124518A (en) * 2008-08-05 2011-07-13 弗朗霍夫应用科学研究促进协会 Apparatus and method for processing an audio signal for speech enhancement using a feature extraction
CN107680610A (en) * 2017-09-27 2018-02-09 安徽硕威智能科技有限公司 A kind of speech-enhancement system and method
CN110085249A (en) * 2019-05-09 2019-08-02 南京工程学院 The single-channel voice Enhancement Method of Recognition with Recurrent Neural Network based on attention gate
CN110120225A (en) * 2019-04-01 2019-08-13 西安电子科技大学 A kind of audio defeat system and method for the structure based on GRU network
CN110310656A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of sound enhancement method
WO2020107269A1 (en) * 2018-11-28 2020-06-04 深圳市汇顶科技股份有限公司 Self-adaptive speech enhancement method, and electronic device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050240401A1 (en) * 2004-04-23 2005-10-27 Acoustic Technologies, Inc. Noise suppression based on Bark band weiner filtering and modified doblinger noise estimate
CN102124518A (en) * 2008-08-05 2011-07-13 弗朗霍夫应用科学研究促进协会 Apparatus and method for processing an audio signal for speech enhancement using a feature extraction
CN107680610A (en) * 2017-09-27 2018-02-09 安徽硕威智能科技有限公司 A kind of speech-enhancement system and method
WO2020107269A1 (en) * 2018-11-28 2020-06-04 深圳市汇顶科技股份有限公司 Self-adaptive speech enhancement method, and electronic device
CN110120225A (en) * 2019-04-01 2019-08-13 西安电子科技大学 A kind of audio defeat system and method for the structure based on GRU network
CN110085249A (en) * 2019-05-09 2019-08-02 南京工程学院 The single-channel voice Enhancement Method of Recognition with Recurrent Neural Network based on attention gate
CN110310656A (en) * 2019-05-27 2019-10-08 重庆高开清芯科技产业发展有限公司 A kind of sound enhancement method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113516988A (en) * 2020-12-30 2021-10-19 腾讯科技(深圳)有限公司 Audio processing method and device, intelligent equipment and storage medium
CN113516988B (en) * 2020-12-30 2024-02-23 腾讯科技(深圳)有限公司 Audio processing method and device, intelligent equipment and storage medium
CN113571075A (en) * 2021-01-28 2021-10-29 腾讯科技(深圳)有限公司 Audio processing method and device, electronic equipment and storage medium
CN113077806A (en) * 2021-03-23 2021-07-06 杭州朗和科技有限公司 Audio processing method and device, model training method and device, medium and equipment
CN113077806B (en) * 2021-03-23 2023-10-13 杭州网易智企科技有限公司 Audio processing method and device, model training method and device, medium and equipment
CN113096679A (en) * 2021-04-02 2021-07-09 北京字节跳动网络技术有限公司 Audio data processing method and device
CN116403594A (en) * 2023-06-08 2023-07-07 澳克多普有限公司 Speech enhancement method and device based on noise update factor
CN116403594B (en) * 2023-06-08 2023-08-18 澳克多普有限公司 Speech enhancement method and device based on noise update factor

Similar Documents

Publication Publication Date Title
CN107845389B (en) Speech enhancement method based on multi-resolution auditory cepstrum coefficient and deep convolutional neural network
CN111292759B (en) Stereo echo cancellation method and system based on neural network
CN110867181B (en) Multi-target speech enhancement method based on SCNN and TCNN joint estimation
CN111986660A (en) Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling
CN109841226B (en) Single-channel real-time noise reduction method based on convolution recurrent neural network
CN110619885B (en) Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN110120227B (en) Voice separation method of deep stack residual error network
CN110600050B (en) Microphone array voice enhancement method and system based on deep neural network
CN111292762A (en) Single-channel voice separation method based on deep learning
CN111899757B (en) Single-channel voice separation method and system for target speaker extraction
CN112331224A (en) Lightweight time domain convolution network voice enhancement method and system
CN112863535A (en) Residual echo and noise elimination method and device
CN111986679A (en) Speaker confirmation method, system and storage medium for responding to complex acoustic environment
CN112885375A (en) Global signal-to-noise ratio estimation method based on auditory filter bank and convolutional neural network
Barros et al. Estimation of speech embedded in a reverberant and noisy environment by independent component analysis and wavelets
CN116013344A (en) Speech enhancement method under multiple noise environments
CN113053400A (en) Training method of audio signal noise reduction model, audio signal noise reduction method and device
Li et al. Multi-resolution auditory cepstral coefficient and adaptive mask for speech enhancement with deep neural network
CN112397090B (en) Real-time sound classification method and system based on FPGA
CN116612778B (en) Echo and noise suppression method, related device and medium
CN117219102A (en) Low-complexity voice enhancement method based on auditory perception
CN109215635B (en) Broadband voice frequency spectrum gradient characteristic parameter reconstruction method for voice definition enhancement
CN113571074B (en) Voice enhancement method and device based on multi-band structure time domain audio frequency separation network
Radha et al. Enhancing speech quality using artificial bandwidth expansion with deep shallow convolution neural network framework
CN115312073A (en) Low-complexity residual echo suppression method combining signal processing and deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination