CN112927709B

CN112927709B - Voice enhancement method based on time-frequency domain joint loss function

Info

Publication number: CN112927709B
Application number: CN202110155444.2A
Authority: CN
Inventors: 高戈; 王霄; 陈怡�; 杨玉红; 曾邦; 尹文兵
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2022-06-14
Anticipated expiration: 2041-02-04
Also published as: CN112927709A

Abstract

The invention provides a voice enhancement method based on a time-frequency domain joint loss function. The method comprises the steps of integrating a clean voice data set and a noise data set in an open source data set into a voice data set with noise, converting the voice data set with noise into amplitude spectrum, phase spectrum and waveform data through preprocessing operation, and constructing a training set. And (3) constructing a CNN network model, taking the noisy speech amplitude spectrum as input, taking the clean speech amplitude spectrum as a label, and performing model training. Carrying out waveform reconstruction on the amplitude spectrum estimation value output by the model and the phase spectrum of the voice with noise by an inverse short-time Fourier transform method to obtain a time domain waveform of the estimated voice; calculating the loss of a frequency domain through a clean voice amplitude spectrum and an amplitude spectrum estimation value; calculating time domain loss through the clean voice time domain waveform and the estimated voice time domain waveform; and constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss, and guiding the CNN network model to carry out weight optimization. The invention reduces the phenomenon that the estimated magnitude spectrum is not matched with the phase spectrum, and improves the effect of voice enhancement.

Description

Voice enhancement method based on time-frequency domain joint loss function

Technical Field

The invention relates to the field of voice enhancement, in particular to a voice enhancement method based on a time-frequency domain joint loss function.

Background

The voice communication is the most convenient information interaction mode between people and machines. However, regardless of the environment, voice communications are more or less disturbed by ambient noise. The voice enhancement technology is an effective way to solve the noise influence in the voice interaction process. The purpose of speech enhancement is to extract clean speech signals from background noise as much as possible, eliminate environmental noise, and improve speech quality and speech intelligibility.

In recent years, the popularity of artificial intelligence technology is high, and speech enhancement technology has also been developed rapidly, and various speech enhancement technologies are emerging. These speech enhancement schemes are mainly divided into: conventional speech enhancement schemes and deep learning based speech enhancement schemes.

The conventional speech enhancement scheme mainly includes: spectral subtraction, statistical model-based enhancement algorithms, and subspace enhancement algorithms. Spectral subtraction assumes that the noise is additive noise, then subtracts an estimate of the noise spectrum from the speech spectrum of the noisy speech, and finally to clean speech. The wiener filtering algorithm and the minimum mean square error algorithm are representatives of the enhancement algorithm based on the statistical model, and compared with the spectral subtraction method, the residual noise in the voice signal processed by the wiener filtering algorithm is similar to white noise, so that the voice signal is more comfortable in hearing. The minimum mean square error algorithm utilizes the important role of the short-time spectral amplitude of the speech signal in perception and utilizes a short-time spectral amplitude estimator of the minimum mean square error to enhance the noisy speech. The subspace enhancement algorithm is mainly derived from linear algebraic theory, and the principle is that in Euclidean space, the distribution of clean signals is limited to the subspace carrying away the signals. The task of speech enhancement can be accomplished by decomposing the vector space of the noisy signal into two subspaces.

Conventional speech enhancement algorithms mostly assume that the speech signal is stationary. However, in real life, such assumption conditions cannot be satisfied at all. The speech enhancement algorithm based on deep learning can effectively solve the problem by strong nonlinear fitting capability. Based on different training targets, the speech enhancement algorithm based on deep learning can be divided into two categories: one is a mask-based enhancement network and the other is a mapping-based enhancement network. The mask-based enhancement network is a training target for a neural network using an ideal scale mask or a phase mask, etc. The mapping-based enhancement network utilizes the fitting capabilities of the neural network to map the log or power spectrum of noisy speech directly to the power spectrum of clean speech. The deep learning based voice enhancement network may be classified into a DNN enhancement network, a CNN enhancement network, an RNN enhancement network, and a GAN enhancement network according to a difference in network models.

Wherein, the feature processing of the spectrogram is the key of the deep learning-based speech enhancement network. Therefore, CNN networks are more amenable to speech enhancement tasks than other network models.

In the process of implementing the invention, the inventor of the application finds that the prior art method has at least the following and technical problems:

A speech enhancement network usually uses a noisy speech magnitude spectrum to generate a magnitude spectrum of an estimated speech, and uses a noisy speech phase spectrum to perform waveform reconstruction, which may cause a mismatch between the magnitude spectrum and the phase spectrum. The speech enhancement algorithm is usually used as a front-end module for speech recognition, and the effect greatly affects the performance of the recognition system. The mismatch between the magnitude spectrum and the phase spectrum may damage the feature information of the speech signal, such as Mel Frequency Cepstrum Coefficient (MFCC), thereby affecting the accuracy of speech recognition.

Therefore, the method in the prior art has the defects as the voice enhancement method of the voice recognition front-end module, and has important practical significance for solving the problem of the voice enhancement phase mismatch.

Disclosure of Invention

The speech enhancement model based on the deep neural network generally operates by transforming the noisy speech signal onto the frequency domain using a Short Time Fourier Transform (STFT) method. The method generally learns the mapping relationship between the magnitude spectra by using the Mean Square Error (MSE) of the frequency domain magnitude spectra of the clean speech and the estimation result as a loss function. The traditional method uses the phase spectrum of the noisy language to generate an estimation signal, which can cause the mismatching of the amplitude spectrum and the phase spectrum, thereby influencing the enhancement effect. The invention aims to reduce the phenomenon that the amplitude spectrum and the phase spectrum of the enhanced generated voice are not matched by using a combined loss function of a time domain and a frequency domain.

In view of the defects of the traditional method, the invention provides a speech enhancement model based on a time-frequency domain joint loss function. The invention uses a time-frequency domain joint loss function to replace a single frequency domain loss function in the traditional scheme. By introducing time domain loss, the waveform error of the estimated signal and the clean voice signal in the time domain is calculated, and the influence caused by the condition that the amplitude spectrum and the phase spectrum are not matched in the traditional scheme is reduced. The performance of speech enhancement algorithms as a speech recognition front-end module is improved by reducing loss and corruption of information in the enhanced speech. The purpose of the invention is realized by the following technical scheme:

the invention provides a speech enhancement method based on a time-frequency domain joint loss function, which is characterized by comprising the following steps of:

step 1, collecting a clean voice data set and noise data in an open source data set into a voice data set with noise, framing and overlapping the clean voice in the clean voice data set by a short-time Fourier transform method, converting the clean voice into a frequency domain amplitude spectrum of each clean voice, constructing a clean voice frequency domain amplitude spectrum data set, sampling and framing the clean voice in the clean voice data set, adding a Hamming window to convert the clean voice into waveform data of the clean voice, forming a clean voice time domain waveform data set, framing and overlapping the noisy voice in the voice data set with noise by a short-time Fourier transform method, converting the frequency domain amplitude spectrum of each noisy voice and the frequency domain phase spectrum of each noisy voice, forming a voice amplitude spectrum data set with noise and a voice frequency domain phase spectrum data set with noise, and collecting the voice frequency domain amplitude spectrum data set with noise, the clean voice frequency domain amplitude spectrum data set, Constructing network training set data by using a time domain waveform data set of clean voice, a frequency domain amplitude spectrum data set of noisy voice and a frequency domain phase spectrum data set of noisy voice;

Step 2, a CNN network model is constructed, a frequency domain amplitude spectrum of the voice with noise in the network training set data is used as a model input data set, a frequency domain amplitude spectrum of the clean voice is used as a training target set, the network acquires the frequency domain amplitude spectrum of the voice with noise each time, the frequency domain amplitude spectrum of the clean voice corresponding to the frequency domain amplitude spectrum is used as a label, and the CNN network model predicts the frequency domain amplitude spectrum of the clean voice according to the frequency domain amplitude spectrum of the voice with noise to obtain a frequency domain amplitude spectrum estimated value; combining the frequency domain amplitude spectrum estimation value with a frequency domain phase spectrum of the voice with noise, and further performing waveform reconstruction by an inverse short-time Fourier transform method to obtain enhanced voice; sampling, framing and overlapping the enhanced voice and adding a Hamming window to obtain waveform data on an estimated voice time domain;

calculating the loss of the frequency domain through the frequency domain amplitude spectrum of the clean voice and the frequency domain amplitude spectrum estimation value; calculating time domain loss through time domain waveform data of the clean voice and waveform data on an estimated voice time domain; constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss;

and 3, updating the weight matrix of the convolutional layer by using Adam as an optimizer according to the time-frequency domain joint loss, and performing next iteration until training is finished to obtain optimized network weight parameters so as to construct an optimized CNN network model.

Preferably, the clean speech data set in step 1 is:

{C_i，i∈[1,K]}

wherein, C_iThe ith clean voice of the clean voice data set, wherein K is the number of clean voices;

step 1 the noisy data set is:

{N_i，i∈[1,K']}

where K' is the amount of noise in the noise data set;

the synthesized noisy speech data set in step 1 is:

{R_i，i∈[1,K]}

wherein R is_iThe ith synthesized voice with noise in the synthesized voice data set with noise is obtained, and K is the number of the voice with noise in the synthesized voice data set with noise;

step 1, the clean voice frequency domain amplitude spectrum data set is as follows:

{Y_i，i∈[1,K]}，Y_i＝{Y_i,j，j∈[1,N_i]}

wherein, Y_iFrequency domain amplitude spectrum, Y, representing the i-th clean speech_i,jJ frame frequency domain amplitude data, N, representing the ith clean speech_iThe total frame number of the ith dry clean voice;

step 1, the clean voice time domain waveform data set is as follows:

{y_i，i∈[1,K]}，y_i＝{y_i,j，j∈[1,T_i]}

wherein, y_iWaveform data representing the ith clean speech, y_i,jValue of j sample point, T, representing i clean speech_iThe total sampling point number of the ith dry clean voice is counted;

step 1, the data set of the frequency domain magnitude spectrum of the voice with noise comprises:

{M_i，i∈[1,K]}，M_i＝{M_i,j，j∈[1,N_i]}

wherein M is_iFrequency domain amplitude spectrum, M, representing the i-th noisy speech_i,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speech_iThe total frame number of the ith voice with noise;

The frequency domain phase spectrum data set of the voice with noise in the step 1 is as follows:

{P_i，i∈[1,K]}

wherein, P_iRepresenting the frequency domain phase spectrum of the ith voice with noise, wherein K is the number of the frequency domain phases of the voice with noise in the data set of the frequency domain phase spectrum of the voice with noise;

step 1, constructing network training set data comprises:

{M_i,P_i,Y_i,y_i,i∈[1,K]}

wherein K is the total number of voices in the voice data set with noise;

preferably, the pre-trained CNN network model in step 2 is formed by cascading an encoder and a decoder;

the encoder is formed by sequentially cascading a coding convolution layer, a coding batch normalization layer, a maximum pooling layer and a Leaky ReLu activation layer, wherein the coding convolution layer relates to weight updating;

the decoder is formed by sequentially cascading a decoding convolution layer, a decoding batch normalization layer and an upper sampling layer, wherein the decoding convolution layer relates to weight updating;

step 2, the frequency domain magnitude spectrum of the noisy speech in the training set data is: { M_i，i∈[1,K]}；

M_i＝{M_i,j，j∈[1,N_i]}

step 2, the frequency domain amplitude spectrum of the clean voice in the training set data is as follows: { Y_i，i∈[1,K]}；

Y_i＝{Y_i,j，j∈[1,N_i]}

Wherein, Y_iFrequency domain amplitude spectrum, Y, representing the i-th clean speech _i,jJ frame frequency domain amplitude data, N, representing the ith clean speech_iThe total frame number of the ith dry clean voice;

and 2, outputting a frequency domain amplitude spectrum estimated value by the model as follows:

wherein the content of the first and second substances,

the estimated value of the frequency domain amplitude spectrum output by the network after the ith strip of noisy speech is input is shown,

j frame data, N, representing an estimate of the ith frequency domain amplitude spectrum_iThe total frame number of the estimated value of the ith frequency domain amplitude spectrum;

step 2, the estimation voice time domain waveform data is as follows:

wherein the content of the first and second substances,

time domain waveform data representing the i-th piece of estimated speech,

the amplitude, T, of the jth sample point representing the ith estimated speech_iEstimating the total number of sampling points of the voice for the ith strip;

the frequency domain loss in the step 2 is the mean square error of the frequency domain amplitude spectrum, and specifically comprises the following steps:

wherein L is_i,FThe frequency domain loss of the ith strip of voice with noise as input;

the time domain loss in step 2 is a mean square error of the time domain waveform amplitude, and specifically includes the following steps:

wherein L is_i,WThe time domain loss of the ith strip of voice with noise as input;

the time-frequency domain joint loss in the step 2 is as follows:

L_i,total＝L_i,F+αL_i,W

wherein L is_i,totalAnd alpha is the weight coefficient of the hyper-parameter, which is the time-frequency domain joint loss when the ith strip of noisy speech is input.

Compared with the prior art, the invention has the following advantages and beneficial effects: the phenomenon that the estimated magnitude spectrum is not matched with the phase spectrum is reduced, and the voice enhancement effect is improved to a certain extent.

Drawings

FIG. 1: is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram: the invention is a structural schematic diagram of a CNN voice enhancement network.

Detailed Description

The embodiment is used for realizing training and testing based on the aishell speech set and the musan noise set.

As shown in fig. 1, the present embodiment performs speech enhancement and training based on a Convolutional Neural Network (CNN) model. And performing voice enhancement comparison with the existing algorithm by replacing the loss function with a joint loss function of a time-frequency domain.

The first embodiment of the present invention is a speech enhancement method based on a time-frequency domain joint loss function, and specifically includes the following steps:

Step 1 the clean speech data set is:

{C_i，i∈[1,K]}

wherein, C_iThe ith clean voice of the clean voice data set, wherein K is 1024 the number of clean voices;

step 1 the noisy data set is:

{N_i，i∈[1,K']}

where K' is the amount of noise in the noise data set;

the synthesized noisy speech data set in step 1 is:

{R_i，i∈[1,K]}

wherein R is_iThe number of the i-th synthesized voice with noise in the synthesized voice data set with noise is K-1024;

in step 1, the short-time fourier transform parameters are specifically 16kHz sampling, a window length of 256, and a window overlap of 75%, and the calculation method of the number of sampling points and the number of frames is as follows:

T＝t×16000

wherein T is the time length of the speech, T is the total number of sampling points of the speech, and N is the number of frames of the frequency domain amplitude spectrum of the speech

Step 1, the clean speech frequency domain magnitude spectrum data set is as follows:

{Y_i，i∈[1,K]}，Y_i＝{Y_i,j，j∈[1,N_i]}

step 1, the clean voice time domain waveform data set is as follows:

{y_i，i∈[1,K]}，y_i＝{y_i,j，j∈[1,T_i]}

wherein, y_iWaveform data representing the ith clean speech, y_i,jValue of j sample point, T, representing i clean speech _iThe total sampling point number of the ith dry clean voice is counted; (T)_iNumber of total sampling points for ith dry and clean voice through 16kHz sampling)

{M_i，i∈[1,K]}，M_i＝{M_i,j，j∈[1,N_i]}

wherein M is_iFrequency domain magnitude spectrum, M, representing the i-th noisy speech_i,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speech_iThe total frame number of the ith voice with noise;

{P_i，i∈[1,K]}

step 1, constructing network training set data as follows:

{M_i,P_i,Y_i,y_i,i∈[1,K]}

wherein K is the total number of voices in the voice data set with noise;

the pre-training CNN network model is formed by cascade connection of an encoder and a decoder;

M_i＝{M_i,j，j∈[1,N_i]}

Y_i＝{Y_i,j，j∈[1,N_i]}

Wherein, Y_iFrequency domain amplitude spectrum, Y, representing the i-th clean speech_i,jJ frame frequency domain amplitude data, N, representing the ith clean speech _iThe total frame number of the ith dry clean voice;

wherein, the first and the second end of the pipe are connected with each other,

step 2, the estimation voice time domain waveform data is as follows:

wherein the content of the first and second substances,

time domain waveform data representing the i-th estimated voice,

the frequency domain loss in step 2 is the mean square error of the frequency domain amplitude spectrum, and specifically comprises the following steps:

the time-frequency domain joint loss in the step 2 is as follows:

L_i,total＝L_i,F+αL_i,W

wherein L is_i,totalFor the time-frequency domain joint loss at the i-th noisy speech input, α ═ 0.15 is the weight coefficient of the hyperparameter.

A second embodiment of the present invention is a specific test procedure, including the steps of:

step 1, collecting the voice with noise, and converting the voice with noise into a magnitude spectrum and a phase spectrum.

And 2, sending the amplitude spectrum into an enhancement model to obtain an enhanced voice amplitude spectrum.

Step 3, using the phase spectrum of the noisy speech and the amplitude spectrum of the estimation result to perform Inverse Short-Time Fourier Transform (Inverse Short Time Fourier Transform ISTFT) to obtain an enhanced speech waveform, writing the enhanced speech waveform into a wav file,

and 4, carrying out voice recognition on the generated voice file, and comparing recognition accuracy rates of the generated voice with different enhancement models.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A speech enhancement method based on a time-frequency domain joint loss function is characterized by comprising the following steps:

Step 2, a CNN network model is constructed, a frequency domain amplitude spectrum of the noisy speech in the network training set data is used as a model input data set, a frequency domain amplitude spectrum of the clean speech is used as a training target set, the network acquires the frequency domain amplitude spectrum of the noisy speech at a time, the frequency domain amplitude spectrum of the clean speech corresponding to the frequency domain amplitude spectrum is used as a label, and the CNN network model predicts the frequency domain amplitude spectrum of the clean speech according to the frequency domain amplitude spectrum of the noisy speech to obtain a frequency domain amplitude spectrum estimated value; combining the frequency domain amplitude spectrum estimated value with a frequency domain phase spectrum of the voice with noise, and further performing waveform reconstruction by an inverse short-time Fourier transform method to obtain an enhanced voice; sampling, framing and overlapping the enhanced voice and adding a Hamming window to obtain waveform data on an estimated voice time domain;

calculating the frequency domain loss through the frequency domain amplitude spectrum of the clean voice and the frequency domain amplitude spectrum estimation value; calculating time domain loss through time domain waveform data of the clean voice and waveform data on an estimated voice time domain; constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss;