CN111986660A

CN111986660A - Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling

Info

Publication number: CN111986660A
Application number: CN202010872886.4A
Authority: CN
Inventors: 刘明; 孙冲武; 周彦兵; 赵学华; 李欣
Original assignee: Shenzhen Institute of Information Technology
Current assignee: Shenzhen Institute of Information Technology
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-11-24

Abstract

The invention provides a single-channel speech enhancement method, a system and a storage medium for neural network sub-band modeling, wherein the single-channel speech enhancement method comprises the following steps: step 1: collecting a voice signal with noise, and sending the voice signal to a digital power spectrum extraction module and a bark cepstrum coefficient extraction module; step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module; and step 3: and (3) receiving the features extracted in the step (2) by adopting a frequency band feature division module, and then distributing the sub-band features to the extracted features by using the frequency band feature division module. The invention has the beneficial effects that: the invention carries out independent neural network modeling on each sub-frequency band of the voice signal, reduces the task difficulty of the neural network and reduces the parameters of the model.

Description

Single-channel speech enhancement method, system and storage medium for neural network sub-band modeling

Technical Field

The invention relates to the field of data processing, in particular to a single-channel speech enhancement method, a single-channel speech enhancement system and a single-channel speech enhancement storage medium for neural network sub-band modeling.

Background

At present, a series of voice electronic products in the market, such as communication products and man-machine interaction products, are affected by various noise interferences. The noisy background noise not only affects the communication quality between people, but also brings great challenges to human-computer interaction. For example, for voice interaction electronic devices such as smart speakers, smart televisions, and vehicle-mounted devices, voice recognition is an indispensable technology, and the accuracy of voice recognition in a quiet environment can completely meet the requirements of people. However, when a certain background noise exists, the recognition accuracy of the machine is greatly influenced. Therefore, it is necessary to perform noise reduction processing on a speech signal by using a speech enhancement technique, to reduce the influence of interference noise, to improve the quality of speech, and to enable a machine to achieve a high recognition accuracy even in a complex acoustic environment. In addition, for some voice products with strict requirements on noise reduction and time delay, such as hearing aids, interphones, and ear returns, the voice enhancement algorithm not only needs to ensure a better noise reduction effect, but also has the characteristics of low computation and low time delay.

Disclosure of Invention

The invention provides a single-channel speech enhancement method for neural network sub-band modeling, which comprises the following steps:

step 1: collecting a voice signal with noise, and sending the voice signal to a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module;

step 2: receiving the noisy speech signal in the step 1 by adopting a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module, then carrying out feature extraction on the noisy speech signal by adopting the logarithmic power spectrum extraction module and the barker cepstrum coefficient extraction module, and finally sending the extracted features to a frequency band feature division module;

and step 3: a frequency band feature division module is adopted to receive the features extracted in the step (2), the frequency band feature division module is used to distribute sub-band features to the extracted features, the features on each sub-band are input to a corresponding neural network mapping module to estimate the prior signal-to-noise ratio, and finally the estimated prior signal-to-noise ratios on all sub-bands are combined and sent to a full-band wiener filtering module;

and 4, step 4: and (3) receiving and processing the estimated prior signal-to-noise ratio on all sub-bands in the step (3) by adopting a full-band wiener filtering module to obtain an enhanced voice signal.

As a further improvement of the present invention, in step 2, the performing, by the log power spectrum feature extraction module, feature extraction on the noisy speech signal further includes performing the following steps:

the first step is as follows: preprocessing a voice signal x (n) acquired by a microphone by framing and windowing;

the second step is as follows: performing fast Fourier transform to obtain a frequency spectrum of the signal, and obtaining a power spectrum S2(k) of a frequency domain;

the third step: carrying out natural logarithm operation;

the fourth step: the power spectrum is compressed in the logarithmic domain, and the extracted logarithmic power spectrum characteristic Y is obtained_log(k) As shown in the following formula (1):

Y_log(k)＝ln(S²(k)),k＝1,2,...,N (1)

in the single-channel speech enhancement method, a sampling rate of 16kHz is adopted, the frame length of each frame is 16ms, the frame shift is 8ms, and N is 129.

As a further improvement of the present invention, in step 2, the feature extraction performed by the barker cepstrum coefficient extraction module on the noisy speech signal further includes the following steps:

step S1: preprocessing the input voice signal x (n) by framing and windowing;

step S2: performing fast Fourier transform to transform the data from a time domain to a frequency domain;

step S3: calculating a frequency domain power spectrum S2 (k);

step S4: and (3) the frequency domain power spectrum S2(k) obtained by calculation is processed by a bark filter, and a filtered energy spectrum is calculated, wherein the formula (2) is as follows:

where B is the index of the order of the Bark energy spectrum, B is the number of Bark filters, where 24 is taken, each filter corresponds to a Bark domain band, and the expression of the Bark frequency filter transfer function is shown in equation (3) below:

step S5: taking logarithm of bark energy spectrum of each frame, and making Discrete Cosine Transform (DCT) as shown in formula (4) to obtain bark cepstrum coefficient characteristic,

wherein, Y_barkAnd (n) is the extracted BFCC characteristics, n is the frequency band index of the characteristics, the dimension of the characteristics is consistent with the number of the barker filters, and 24 dimensions are adopted.

As a further improvement of the present invention, in step 3, the band feature dividing module further includes sequentially performing the following steps:

sub-band division: dividing the frequency domain range of 0-8000Hz into 8 sub-bands, and respectively giving indexes of features on different sub-bands according to the different numbers of LPS features and BFCC features corresponding to each sub-band;

characteristic splicing step: and splicing the LPS and BFCC characteristics on each sub-band, and respectively sending the spliced LPS and BFCC characteristics to respective neural network mapping modules for estimation of prior signal-to-noise ratio.

As a further improvement of the present invention, in step 3, the neural network mapping module includes 5 neural layers, where the first and last layers are feedforward neural network layers, the middle three layers are GRU neural layers, and weighted summation is performed in the feedforward neural network layers in a fully connected manner, and nonlinear activation is performed, as shown in the following formula (5):

h＝g(W·X+b) (5)

wherein, W and b are weight and bias of the neuron, h represents output of the feedforward neural network layer, X is input of the feedforward neural network layer, g (-) represents nonlinear activation operation, the feedforward neural network layer 1 adopts ReLU activation function, and the feedforward neural network layer 2 needs estimation of prior signal-to-noise ratio, so activation operation is not carried out, and only linear weighted summation is carried out.

As a further improvement of the present invention, the memory update mechanism in the neural network mapping module GRU layer is specifically as follows:

GRU unit inputs feature x of current frame_tOutput h from the previous frame reserved before_t-1Combining the outputs to generate an output h of the current frame through the processing of the update gate and the reset gate_tThe above-mentioned steps are repeated and iterated, and the calculation formula of each gate and output is as follows,

r_t＝σ(W_r·[h_t-1,x_t]) (6)

z_t＝σ(W_z·[h_t-1,x_t]) (7)

where σ (-) and tanh (-) represent respectively a Sigmoid activation function and a hyperbolic tangent activation function, r_tRepresenting the output of the current frame update gate, z_tRepresenting the output of the current frame forgetting gate.

As a further improvement of the present invention, in step 3, the prior signal-to-noise values on the subbands estimated by the neural network mapping module are combined to obtain a 129-dimensional output.

As a further improvement of the present invention, in step 4, the full-band wiener filtering module further includes the following steps:

step Y1: the gain function for filtering is calculated, expressed as the following equation (10):

wherein the content of the first and second substances,

mapping the prior signal-to-noise ratio value output by the module for the neural network;

step Y2: filtering the input voice with noise by using the estimated gain function, and finally performing inverse Fourier transform to obtain a voice signal after noise reduction

The formula is as follows:

the formula (11) being wiener filteringA frequency domain filtering process, where s (k) is the frequency spectrum of the input noisy speech signal, N is the number of frequency points per frame, here 129,

for the enhanced speech signal spectrum, the inverse Fourier transform of equation (12) is performed to obtain the final time-domain signal output

The invention also discloses a single-channel speech enhancement system for neural network sub-band modeling, which comprises: a memory, a processor, and a computer program stored on the memory, the computer program being configured to implement the steps of the single-channel speech enhancement method of the present invention when invoked by the processor.

The invention also discloses a computer readable storage medium storing a computer program configured to, when invoked by a processor, implement the steps of the single channel speech enhancement method of the invention.

The invention has the beneficial effects that: 1. the single-channel speech enhancement method carries out independent neural network modeling on each sub-frequency band of the speech signal, reduces the task difficulty of the neural network, reduces the parameters of the model and realizes lower algorithm complexity; 2. the single-channel speech enhancement method adopts the neural network model to carry out prior signal-to-noise ratio estimation on the signal, and is combined with the traditional filtering method to carry out noise reduction, so that the generalization capability of the neural network noise reduction algorithm is effectively improved; 3. the single-channel speech enhancement method of the invention aims at the neural network model which is trained independently for each sub-band, the mapping precision is higher, and the better speech noise reduction effect can be realized.

Drawings

FIG. 1 is a functional block diagram of a single-channel speech enhancement method of the present invention;

FIG. 2 is a block diagram of the log power feature extraction principle of the single-channel speech enhancement method of the present invention;

FIG. 3 is a block diagram of the BFCC feature extraction principle of the single-channel speech enhancement method of the present invention;

FIG. 4 is a block diagram of the sub-bands of the neural network mapping module of the single channel speech enhancement method of the present invention;

FIG. 5 is a schematic block diagram of memory update in GRU layer of the single channel speech enhancement method of the present invention.

Detailed Description

As shown in fig. 1, the present invention discloses a single-channel speech enhancement method for neural network subband modeling, which estimates the prior signal-to-noise ratio of a target speech by using Log Power Spectrum (LPS) and Bark cepstral coefficients (BFCC) based on a neural network model, and combines with a wiener filtering method, thereby achieving a good compromise between noise reduction effect and computational complexity. The single-channel speech enhancement method comprises the following steps:

step 1: a single microphone collects a noisy speech signal and sends the noisy speech signal to a logarithmic power spectrum extraction module and a barker cepstrum coefficient extraction module;

and step 3: a frequency band feature division module is adopted to receive the features extracted in the step (2), then the frequency band feature division module is used for distributing the sub-band features of the two groups of extracted features, the features on each sub-band are input into a corresponding neural network mapping module to estimate the prior signal-to-noise ratio, and finally the estimated prior signal-to-noise ratios on all sub-bands are combined and sent to a full-band wiener filtering module;

In the single-channel speech enhancement method, 4800 sentences (24 men and 24 women, each say 100 sentences) in an Aishell Chinese data set [1] are selected as pure speech data of a training set, then the pure speech data and 100 different noise types selected from a Freeside website [2] are randomly mixed, the mixed signal-to-noise ratio is in accordance with the uniform distribution of the interval range of [ -5,20], and the duration of obtaining noisy training data is about 100 hours in total. And then extracting BFCC characteristics and logarithmic power spectrum characteristics of each sub-band, constructing an ideal prior signal-to-noise ratio value corresponding to the BFCC characteristics and the logarithmic power spectrum characteristics, training each neural network by adopting a back propagation algorithm, dividing 10% of all training data into a verification set, and storing the model when the loss on the training set and the verification set is minimum so as to obtain neural network mapping models corresponding to different sub-bands. The above is the processing flow of the whole single-channel speech enhancement method and the training process of the neural network model, and each key module will be described in detail next.

As shown in fig. 2, in step 2, the log power spectrum feature extraction module is configured to extract a frequency domain log power feature of the speech signal, and the performing of feature extraction on the noisy speech signal by the log power spectrum feature extraction module further includes the following steps:

the second step is as follows: performing Fast Fourier Transform (FFT) to obtain a frequency spectrum of the signal, and obtaining a power spectrum S2(k) of a frequency domain;

the third step: carrying out natural logarithm operation;

Y_log(k)＝ln(S²(k)),k＝1,2,...,N (1)

in the single-channel speech enhancement method, a sampling rate of 16kHz is adopted, the frame length of each frame is 16ms, and the frame shift is 8ms, so that N is 129.

As shown in fig. 3, in step 2, the barker cepstrum coefficient feature extraction module performs feature extraction in the frequency domain by Bark scale, so as to simulate the masking effect of the human auditory system on sound, and extract a spectral feature very close to the subjective feeling of human by fully utilizing the characteristic that the low-frequency resolution of human ear on the sound signal is higher than the high-frequency resolution of human ear. The extraction module of the barker cepstrum coefficient also comprises the following steps:

step S1: preprocessing the input voice signal x (n) by framing and windowing;

step S3: calculating a frequency domain power spectrum S2 (k);

In step 3, the frequency band feature dividing module divides the extracted barker cepstrum coefficient feature and the log power spectrum feature of each frame of signal into sub-bands, where each sub-band only includes the BFCC feature and the LPS feature in its frequency range, as shown in table 1.

TABLE 1 feature assignment for frequency domain subbands

The frequency band characteristic division module further comprises the following steps of:

sub-band division: the frequency domain range of 0-8000Hz is divided into 8 sub-bands and the sub-bands of low frequencies are divided more finely considering that most of the speech signal is concentrated in the low frequency range. In addition, indexes of features on different sub-bands are respectively given according to different numbers of LPS features and BFCC features corresponding to each sub-band, and the indexes are shown in table 1;

In step 3, the neural network mapping module models each sub-band feature, and customizes a dedicated noise reduction model applied to different sub-bands. In consideration of the time sequence correlation characteristic of the voice signal, a model with the capability of mapping the prior signal-to-noise ratio is constructed in the neural network mapping module on the basis of a Gated Recurrent Unit (GRU).

As shown in fig. 4, after the sub-band division module is allocated, the characteristics of each sub-band are input into the designed neural network structure for a priori snr

Is estimated. The neural network mapping module comprises 5 neural layers, wherein the first layer and the last layer are feedforward neural network layers, the middle three layers are GRU neural layers, weighted summation is carried out in the feedforward neural network layers in a full-connection mode, and nonlinear activation is carried out, and the following formula is shown in the specification(5) Shown in the figure:

h＝g(W·X+b) (5)

As shown in fig. 5, the memory update mechanism in the neural network mapping module GRU layer is specifically as follows:

r_t＝σ(W_r·[h_t-1,x_t]) (6)

z_t＝σ(W_z·[h_t-1,x_t]) (7)

In addition, since the number of features in each sub-band is different, although the neural network structure in each sub-band is the same, the number of neurons in the neural network model corresponding to each sub-band is different in consideration of the different task difficulty of each sub-band, as shown in table 2 below.

TABLE 2 neuron configuration for different sub-band neural network modules

In step 3, the prior signal-to-noise ratios on the subbands estimated by the neural network mapping module are combined to obtain 129-dimensional output.

In step 4, the full-band wiener filtering module further performs the following steps:

wherein the content of the first and second substances,

step Y2: the estimated gain function is used for filtering the input voice with noise, and finally, inverse Fourier transform is carried out, so that the voice signal after noise reduction is obtained

The formula is as follows:

equation (11) is the frequency domain filtering process of wiener filtering, where s (k) is the frequency spectrum of the input noisy speech signal, N is the number of frequency points per frame, here 129 is taken,

for the enhanced speech signal spectrum, the final time-domain signal output is obtained by performing an inverse Fourier transform as in equation (12)

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A single-channel speech enhancement method for neural network sub-band modeling is characterized by comprising the following steps:

2. The single-channel speech enhancement method of claim 1, wherein in the step 2, the log-power spectral feature extraction module performs feature extraction on the noisy speech signal, and further comprises the steps of:

the third step: carrying out natural logarithm operation;

Y_log(k)＝ln(S²(k)),k＝1,2,...,N (1)

3. The single-channel speech enhancement method of claim 1, wherein in the step 2, the barker cepstral coefficient extraction module performs feature extraction on the noisy speech signal further comprising the steps of:

step S1: preprocessing the input voice signal x (n) by framing and windowing;

step S3: calculating a frequency domain power spectrum S2 (k);

4. The single-channel speech enhancement method of claim 1, wherein in the step 3, the band feature dividing module further comprises sequentially performing the following steps:

5. The single-channel speech enhancement method of claim 4, wherein in the step 3, the neural network mapping module comprises 5 neural layers, wherein the first and last layers are feedforward neural network layers, the middle three layers are GRU neural layers, and the feedforward neural network layers are weighted and summed in a fully-connected manner and activated nonlinearly, as shown in the following formula (5):

h＝g(W·X+b) (5)

6. The single-channel speech enhancement method of claim 5, wherein the memory update mechanism in the GRU layer of the neural network mapping module is specifically as follows:

r_t＝σ(W_r·[h_t-1,x_t]) (6)

z_t＝σ(W_z·[h_t-1,x_t]) (7)

7. The single-channel speech enhancement method of claim 1, wherein in step 3, the a priori SNR values on each subband estimated by the neural network mapping module are combined to obtain a 129-dimensional output.

8. The single-channel speech enhancement method of claim 7, wherein in the step 4, the full-band wiener filtering module further comprises performing the steps of:

wherein the content of the first and second substances,

The formula is as follows:

9. A single channel speech enhancement system with neural network sub-band modeling, comprising: memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the single channel speech enhancement method of any of claims 1-8 when invoked by the processor.

10. A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the single channel speech enhancement method of any of claims 1-8.