CN112927709B - Voice enhancement method based on time-frequency domain joint loss function - Google Patents

Voice enhancement method based on time-frequency domain joint loss function Download PDF

Info

Publication number
CN112927709B
CN112927709B CN202110155444.2A CN202110155444A CN112927709B CN 112927709 B CN112927709 B CN 112927709B CN 202110155444 A CN202110155444 A CN 202110155444A CN 112927709 B CN112927709 B CN 112927709B
Authority
CN
China
Prior art keywords
voice
frequency domain
data set
amplitude spectrum
clean
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110155444.2A
Other languages
Chinese (zh)
Other versions
CN112927709A (en
Inventor
高戈
王霄
陈怡�
杨玉红
曾邦
尹文兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202110155444.2A priority Critical patent/CN112927709B/en
Publication of CN112927709A publication Critical patent/CN112927709A/en
Application granted granted Critical
Publication of CN112927709B publication Critical patent/CN112927709B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention provides a voice enhancement method based on a time-frequency domain joint loss function. The method comprises the steps of integrating a clean voice data set and a noise data set in an open source data set into a voice data set with noise, converting the voice data set with noise into amplitude spectrum, phase spectrum and waveform data through preprocessing operation, and constructing a training set. And (3) constructing a CNN network model, taking the noisy speech amplitude spectrum as input, taking the clean speech amplitude spectrum as a label, and performing model training. Carrying out waveform reconstruction on the amplitude spectrum estimation value output by the model and the phase spectrum of the voice with noise by an inverse short-time Fourier transform method to obtain a time domain waveform of the estimated voice; calculating the loss of a frequency domain through a clean voice amplitude spectrum and an amplitude spectrum estimation value; calculating time domain loss through the clean voice time domain waveform and the estimated voice time domain waveform; and constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss, and guiding the CNN network model to carry out weight optimization. The invention reduces the phenomenon that the estimated magnitude spectrum is not matched with the phase spectrum, and improves the effect of voice enhancement.

Description

Voice enhancement method based on time-frequency domain joint loss function
Technical Field
The invention relates to the field of voice enhancement, in particular to a voice enhancement method based on a time-frequency domain joint loss function.
Background
The voice communication is the most convenient information interaction mode between people and machines. However, regardless of the environment, voice communications are more or less disturbed by ambient noise. The voice enhancement technology is an effective way to solve the noise influence in the voice interaction process. The purpose of speech enhancement is to extract clean speech signals from background noise as much as possible, eliminate environmental noise, and improve speech quality and speech intelligibility.
In recent years, the popularity of artificial intelligence technology is high, and speech enhancement technology has also been developed rapidly, and various speech enhancement technologies are emerging. These speech enhancement schemes are mainly divided into: conventional speech enhancement schemes and deep learning based speech enhancement schemes.
The conventional speech enhancement scheme mainly includes: spectral subtraction, statistical model-based enhancement algorithms, and subspace enhancement algorithms. Spectral subtraction assumes that the noise is additive noise, then subtracts an estimate of the noise spectrum from the speech spectrum of the noisy speech, and finally to clean speech. The wiener filtering algorithm and the minimum mean square error algorithm are representatives of the enhancement algorithm based on the statistical model, and compared with the spectral subtraction method, the residual noise in the voice signal processed by the wiener filtering algorithm is similar to white noise, so that the voice signal is more comfortable in hearing. The minimum mean square error algorithm utilizes the important role of the short-time spectral amplitude of the speech signal in perception and utilizes a short-time spectral amplitude estimator of the minimum mean square error to enhance the noisy speech. The subspace enhancement algorithm is mainly derived from linear algebraic theory, and the principle is that in Euclidean space, the distribution of clean signals is limited to the subspace carrying away the signals. The task of speech enhancement can be accomplished by decomposing the vector space of the noisy signal into two subspaces.
Conventional speech enhancement algorithms mostly assume that the speech signal is stationary. However, in real life, such assumption conditions cannot be satisfied at all. The speech enhancement algorithm based on deep learning can effectively solve the problem by strong nonlinear fitting capability. Based on different training targets, the speech enhancement algorithm based on deep learning can be divided into two categories: one is a mask-based enhancement network and the other is a mapping-based enhancement network. The mask-based enhancement network is a training target for a neural network using an ideal scale mask or a phase mask, etc. The mapping-based enhancement network utilizes the fitting capabilities of the neural network to map the log or power spectrum of noisy speech directly to the power spectrum of clean speech. The deep learning based voice enhancement network may be classified into a DNN enhancement network, a CNN enhancement network, an RNN enhancement network, and a GAN enhancement network according to a difference in network models.
Wherein, the feature processing of the spectrogram is the key of the deep learning-based speech enhancement network. Therefore, CNN networks are more amenable to speech enhancement tasks than other network models.
In the process of implementing the invention, the inventor of the application finds that the prior art method has at least the following and technical problems:
A speech enhancement network usually uses a noisy speech magnitude spectrum to generate a magnitude spectrum of an estimated speech, and uses a noisy speech phase spectrum to perform waveform reconstruction, which may cause a mismatch between the magnitude spectrum and the phase spectrum. The speech enhancement algorithm is usually used as a front-end module for speech recognition, and the effect greatly affects the performance of the recognition system. The mismatch between the magnitude spectrum and the phase spectrum may damage the feature information of the speech signal, such as Mel Frequency Cepstrum Coefficient (MFCC), thereby affecting the accuracy of speech recognition.
Therefore, the method in the prior art has the defects as the voice enhancement method of the voice recognition front-end module, and has important practical significance for solving the problem of the voice enhancement phase mismatch.
Disclosure of Invention
The speech enhancement model based on the deep neural network generally operates by transforming the noisy speech signal onto the frequency domain using a Short Time Fourier Transform (STFT) method. The method generally learns the mapping relationship between the magnitude spectra by using the Mean Square Error (MSE) of the frequency domain magnitude spectra of the clean speech and the estimation result as a loss function. The traditional method uses the phase spectrum of the noisy language to generate an estimation signal, which can cause the mismatching of the amplitude spectrum and the phase spectrum, thereby influencing the enhancement effect. The invention aims to reduce the phenomenon that the amplitude spectrum and the phase spectrum of the enhanced generated voice are not matched by using a combined loss function of a time domain and a frequency domain.
In view of the defects of the traditional method, the invention provides a speech enhancement model based on a time-frequency domain joint loss function. The invention uses a time-frequency domain joint loss function to replace a single frequency domain loss function in the traditional scheme. By introducing time domain loss, the waveform error of the estimated signal and the clean voice signal in the time domain is calculated, and the influence caused by the condition that the amplitude spectrum and the phase spectrum are not matched in the traditional scheme is reduced. The performance of speech enhancement algorithms as a speech recognition front-end module is improved by reducing loss and corruption of information in the enhanced speech. The purpose of the invention is realized by the following technical scheme:
the invention provides a speech enhancement method based on a time-frequency domain joint loss function, which is characterized by comprising the following steps of:
step 1, collecting a clean voice data set and noise data in an open source data set into a voice data set with noise, framing and overlapping the clean voice in the clean voice data set by a short-time Fourier transform method, converting the clean voice into a frequency domain amplitude spectrum of each clean voice, constructing a clean voice frequency domain amplitude spectrum data set, sampling and framing the clean voice in the clean voice data set, adding a Hamming window to convert the clean voice into waveform data of the clean voice, forming a clean voice time domain waveform data set, framing and overlapping the noisy voice in the voice data set with noise by a short-time Fourier transform method, converting the frequency domain amplitude spectrum of each noisy voice and the frequency domain phase spectrum of each noisy voice, forming a voice amplitude spectrum data set with noise and a voice frequency domain phase spectrum data set with noise, and collecting the voice frequency domain amplitude spectrum data set with noise, the clean voice frequency domain amplitude spectrum data set, Constructing network training set data by using a time domain waveform data set of clean voice, a frequency domain amplitude spectrum data set of noisy voice and a frequency domain phase spectrum data set of noisy voice;
Step 2, a CNN network model is constructed, a frequency domain amplitude spectrum of the voice with noise in the network training set data is used as a model input data set, a frequency domain amplitude spectrum of the clean voice is used as a training target set, the network acquires the frequency domain amplitude spectrum of the voice with noise each time, the frequency domain amplitude spectrum of the clean voice corresponding to the frequency domain amplitude spectrum is used as a label, and the CNN network model predicts the frequency domain amplitude spectrum of the clean voice according to the frequency domain amplitude spectrum of the voice with noise to obtain a frequency domain amplitude spectrum estimated value; combining the frequency domain amplitude spectrum estimation value with a frequency domain phase spectrum of the voice with noise, and further performing waveform reconstruction by an inverse short-time Fourier transform method to obtain enhanced voice; sampling, framing and overlapping the enhanced voice and adding a Hamming window to obtain waveform data on an estimated voice time domain;
calculating the loss of the frequency domain through the frequency domain amplitude spectrum of the clean voice and the frequency domain amplitude spectrum estimation value; calculating time domain loss through time domain waveform data of the clean voice and waveform data on an estimated voice time domain; constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss;
and 3, updating the weight matrix of the convolutional layer by using Adam as an optimizer according to the time-frequency domain joint loss, and performing next iteration until training is finished to obtain optimized network weight parameters so as to construct an optimized CNN network model.
Preferably, the clean speech data set in step 1 is:
{Ci,i∈[1,K]}
wherein, CiThe ith clean voice of the clean voice data set, wherein K is the number of clean voices;
step 1 the noisy data set is:
{Ni,i∈[1,K']}
where K' is the amount of noise in the noise data set;
the synthesized noisy speech data set in step 1 is:
{Ri,i∈[1,K]}
wherein R isiThe ith synthesized voice with noise in the synthesized voice data set with noise is obtained, and K is the number of the voice with noise in the synthesized voice data set with noise;
step 1, the clean voice frequency domain amplitude spectrum data set is as follows:
{Yi,i∈[1,K]},Yi={Yi,j,j∈[1,Ni]}
wherein, YiFrequency domain amplitude spectrum, Y, representing the i-th clean speechi,jJ frame frequency domain amplitude data, N, representing the ith clean speechiThe total frame number of the ith dry clean voice;
step 1, the clean voice time domain waveform data set is as follows:
{yi,i∈[1,K]},yi={yi,j,j∈[1,Ti]}
wherein, yiWaveform data representing the ith clean speech, yi,jValue of j sample point, T, representing i clean speechiThe total sampling point number of the ith dry clean voice is counted;
step 1, the data set of the frequency domain magnitude spectrum of the voice with noise comprises:
{Mi,i∈[1,K]},Mi={Mi,j,j∈[1,Ni]}
wherein M isiFrequency domain amplitude spectrum, M, representing the i-th noisy speechi,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speechiThe total frame number of the ith voice with noise;
The frequency domain phase spectrum data set of the voice with noise in the step 1 is as follows:
{Pi,i∈[1,K]}
wherein, PiRepresenting the frequency domain phase spectrum of the ith voice with noise, wherein K is the number of the frequency domain phases of the voice with noise in the data set of the frequency domain phase spectrum of the voice with noise;
step 1, constructing network training set data comprises:
{Mi,Pi,Yi,yi,i∈[1,K]}
wherein K is the total number of voices in the voice data set with noise;
preferably, the pre-trained CNN network model in step 2 is formed by cascading an encoder and a decoder;
the encoder is formed by sequentially cascading a coding convolution layer, a coding batch normalization layer, a maximum pooling layer and a Leaky ReLu activation layer, wherein the coding convolution layer relates to weight updating;
the decoder is formed by sequentially cascading a decoding convolution layer, a decoding batch normalization layer and an upper sampling layer, wherein the decoding convolution layer relates to weight updating;
step 2, the frequency domain magnitude spectrum of the noisy speech in the training set data is: { Mi,i∈[1,K]};
Mi={Mi,j,j∈[1,Ni]}
Wherein M isiFrequency domain amplitude spectrum, M, representing the i-th noisy speechi,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speechiThe total frame number of the ith voice with noise;
step 2, the frequency domain amplitude spectrum of the clean voice in the training set data is as follows: { Yi,i∈[1,K]};
Yi={Yi,j,j∈[1,Ni]}
Wherein, YiFrequency domain amplitude spectrum, Y, representing the i-th clean speech i,jJ frame frequency domain amplitude data, N, representing the ith clean speechiThe total frame number of the ith dry clean voice;
and 2, outputting a frequency domain amplitude spectrum estimated value by the model as follows:
Figure BDA0002933232830000051
Figure BDA0002933232830000052
wherein the content of the first and second substances,
Figure BDA0002933232830000053
the estimated value of the frequency domain amplitude spectrum output by the network after the ith strip of noisy speech is input is shown,
Figure BDA0002933232830000054
j frame data, N, representing an estimate of the ith frequency domain amplitude spectrumiThe total frame number of the estimated value of the ith frequency domain amplitude spectrum;
step 2, the estimation voice time domain waveform data is as follows:
Figure BDA0002933232830000055
Figure BDA0002933232830000056
wherein the content of the first and second substances,
Figure BDA0002933232830000057
time domain waveform data representing the i-th piece of estimated speech,
Figure BDA0002933232830000058
the amplitude, T, of the jth sample point representing the ith estimated speechiEstimating the total number of sampling points of the voice for the ith strip;
the frequency domain loss in the step 2 is the mean square error of the frequency domain amplitude spectrum, and specifically comprises the following steps:
Figure BDA0002933232830000059
wherein L isi,FThe frequency domain loss of the ith strip of voice with noise as input;
the time domain loss in step 2 is a mean square error of the time domain waveform amplitude, and specifically includes the following steps:
Figure BDA00029332328300000510
wherein L isi,WThe time domain loss of the ith strip of voice with noise as input;
the time-frequency domain joint loss in the step 2 is as follows:
Li,total=Li,F+αLi,W
wherein L isi,totalAnd alpha is the weight coefficient of the hyper-parameter, which is the time-frequency domain joint loss when the ith strip of noisy speech is input.
Compared with the prior art, the invention has the following advantages and beneficial effects: the phenomenon that the estimated magnitude spectrum is not matched with the phase spectrum is reduced, and the voice enhancement effect is improved to a certain extent.
Drawings
FIG. 1: is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram: the invention is a structural schematic diagram of a CNN voice enhancement network.
Detailed Description
The embodiment is used for realizing training and testing based on the aishell speech set and the musan noise set.
As shown in fig. 1, the present embodiment performs speech enhancement and training based on a Convolutional Neural Network (CNN) model. And performing voice enhancement comparison with the existing algorithm by replacing the loss function with a joint loss function of a time-frequency domain.
The first embodiment of the present invention is a speech enhancement method based on a time-frequency domain joint loss function, and specifically includes the following steps:
step 1, collecting a clean voice data set and noise data in an open source data set into a voice data set with noise, framing and overlapping the clean voice in the clean voice data set by a short-time Fourier transform method, converting the clean voice into a frequency domain amplitude spectrum of each clean voice, constructing a clean voice frequency domain amplitude spectrum data set, sampling and framing the clean voice in the clean voice data set, adding a Hamming window to convert the clean voice into waveform data of the clean voice, forming a clean voice time domain waveform data set, framing and overlapping the noisy voice in the voice data set with noise by a short-time Fourier transform method, converting the frequency domain amplitude spectrum of each noisy voice and the frequency domain phase spectrum of each noisy voice, forming a voice amplitude spectrum data set with noise and a voice frequency domain phase spectrum data set with noise, and collecting the voice frequency domain amplitude spectrum data set with noise, the clean voice frequency domain amplitude spectrum data set, Constructing network training set data by using a time domain waveform data set of clean voice, a frequency domain amplitude spectrum data set of noisy voice and a frequency domain phase spectrum data set of noisy voice;
Step 1 the clean speech data set is:
{Ci,i∈[1,K]}
wherein, CiThe ith clean voice of the clean voice data set, wherein K is 1024 the number of clean voices;
step 1 the noisy data set is:
{Ni,i∈[1,K']}
where K' is the amount of noise in the noise data set;
the synthesized noisy speech data set in step 1 is:
{Ri,i∈[1,K]}
wherein R isiThe number of the i-th synthesized voice with noise in the synthesized voice data set with noise is K-1024;
in step 1, the short-time fourier transform parameters are specifically 16kHz sampling, a window length of 256, and a window overlap of 75%, and the calculation method of the number of sampling points and the number of frames is as follows:
T=t×16000
Figure BDA0002933232830000071
wherein T is the time length of the speech, T is the total number of sampling points of the speech, and N is the number of frames of the frequency domain amplitude spectrum of the speech
Step 1, the clean speech frequency domain magnitude spectrum data set is as follows:
{Yi,i∈[1,K]},Yi={Yi,j,j∈[1,Ni]}
wherein, YiFrequency domain amplitude spectrum, Y, representing the i-th clean speechi,jJ frame frequency domain amplitude data, N, representing the ith clean speechiThe total frame number of the ith dry clean voice;
step 1, the clean voice time domain waveform data set is as follows:
{yi,i∈[1,K]},yi={yi,j,j∈[1,Ti]}
wherein, yiWaveform data representing the ith clean speech, yi,jValue of j sample point, T, representing i clean speech iThe total sampling point number of the ith dry clean voice is counted; (T)iNumber of total sampling points for ith dry and clean voice through 16kHz sampling)
Step 1, the data set of the frequency domain magnitude spectrum of the voice with noise comprises:
{Mi,i∈[1,K]},Mi={Mi,j,j∈[1,Ni]}
wherein M isiFrequency domain magnitude spectrum, M, representing the i-th noisy speechi,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speechiThe total frame number of the ith voice with noise;
the frequency domain phase spectrum data set of the voice with noise in the step 1 is as follows:
{Pi,i∈[1,K]}
wherein, PiRepresenting the frequency domain phase spectrum of the ith voice with noise, wherein K is the number of the frequency domain phases of the voice with noise in the data set of the frequency domain phase spectrum of the voice with noise;
step 1, constructing network training set data as follows:
{Mi,Pi,Yi,yi,i∈[1,K]}
wherein K is the total number of voices in the voice data set with noise;
step 2, a CNN network model is constructed, a frequency domain amplitude spectrum of the voice with noise in the network training set data is used as a model input data set, a frequency domain amplitude spectrum of the clean voice is used as a training target set, the network acquires the frequency domain amplitude spectrum of the voice with noise each time, the frequency domain amplitude spectrum of the clean voice corresponding to the frequency domain amplitude spectrum is used as a label, and the CNN network model predicts the frequency domain amplitude spectrum of the clean voice according to the frequency domain amplitude spectrum of the voice with noise to obtain a frequency domain amplitude spectrum estimated value; combining the frequency domain amplitude spectrum estimation value with a frequency domain phase spectrum of the voice with noise, and further performing waveform reconstruction by an inverse short-time Fourier transform method to obtain enhanced voice; sampling, framing and overlapping the enhanced voice and adding a Hamming window to obtain waveform data on an estimated voice time domain;
Calculating the loss of the frequency domain through the frequency domain amplitude spectrum of the clean voice and the frequency domain amplitude spectrum estimation value; calculating time domain loss through time domain waveform data of the clean voice and waveform data on an estimated voice time domain; constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss;
the pre-training CNN network model is formed by cascade connection of an encoder and a decoder;
the encoder is formed by sequentially cascading a coding convolution layer, a coding batch normalization layer, a maximum pooling layer and a Leaky ReLu activation layer, wherein the coding convolution layer relates to weight updating;
the decoder is formed by sequentially cascading a decoding convolution layer, a decoding batch normalization layer and an upper sampling layer, wherein the decoding convolution layer relates to weight updating;
step 2, the frequency domain magnitude spectrum of the noisy speech in the training set data is: { Mi,i∈[1,K]};
Mi={Mi,j,j∈[1,Ni]}
Wherein M isiFrequency domain amplitude spectrum, M, representing the i-th noisy speechi,jFrequency domain amplitude data, N, of the j-th frame representing the i-th noisy speechiThe total frame number of the ith voice with noise;
step 2, the frequency domain amplitude spectrum of the clean voice in the training set data is as follows: { Yi,i∈[1,K]};
Yi={Yi,j,j∈[1,Ni]}
Wherein, YiFrequency domain amplitude spectrum, Y, representing the i-th clean speechi,jJ frame frequency domain amplitude data, N, representing the ith clean speech iThe total frame number of the ith dry clean voice;
and 2, outputting a frequency domain amplitude spectrum estimated value by the model as follows:
Figure BDA0002933232830000081
Figure BDA0002933232830000082
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002933232830000083
the estimated value of the frequency domain amplitude spectrum output by the network after the ith strip of noisy speech is input is shown,
Figure BDA0002933232830000084
j frame data, N, representing an estimate of the ith frequency domain amplitude spectrumiThe total frame number of the estimated value of the ith frequency domain amplitude spectrum;
step 2, the estimation voice time domain waveform data is as follows:
Figure BDA0002933232830000085
Figure BDA0002933232830000086
wherein the content of the first and second substances,
Figure BDA0002933232830000091
time domain waveform data representing the i-th estimated voice,
Figure BDA0002933232830000092
the amplitude, T, of the jth sample point representing the ith estimated speechiEstimating the total number of sampling points of the voice for the ith strip;
the frequency domain loss in step 2 is the mean square error of the frequency domain amplitude spectrum, and specifically comprises the following steps:
Figure BDA0002933232830000093
wherein L isi,FThe frequency domain loss of the ith strip of voice with noise as input;
the time domain loss in step 2 is a mean square error of the time domain waveform amplitude, and specifically includes the following steps:
Figure BDA0002933232830000094
wherein L isi,WThe time domain loss of the ith strip of voice with noise as input;
the time-frequency domain joint loss in the step 2 is as follows:
Li,total=Li,F+αLi,W
wherein L isi,totalFor the time-frequency domain joint loss at the i-th noisy speech input, α ═ 0.15 is the weight coefficient of the hyperparameter.
And 3, updating the weight matrix of the convolutional layer by using Adam as an optimizer according to the time-frequency domain joint loss, and performing next iteration until training is finished to obtain optimized network weight parameters so as to construct an optimized CNN network model.
A second embodiment of the present invention is a specific test procedure, including the steps of:
step 1, collecting the voice with noise, and converting the voice with noise into a magnitude spectrum and a phase spectrum.
And 2, sending the amplitude spectrum into an enhancement model to obtain an enhanced voice amplitude spectrum.
Step 3, using the phase spectrum of the noisy speech and the amplitude spectrum of the estimation result to perform Inverse Short-Time Fourier Transform (Inverse Short Time Fourier Transform ISTFT) to obtain an enhanced speech waveform, writing the enhanced speech waveform into a wav file,
and 4, carrying out voice recognition on the generated voice file, and comparing recognition accuracy rates of the generated voice with different enhancement models.
The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (1)

1. A speech enhancement method based on a time-frequency domain joint loss function is characterized by comprising the following steps:
step 1, collecting a clean voice data set and noise data in an open source data set into a voice data set with noise, framing and overlapping the clean voice in the clean voice data set by a short-time Fourier transform method, converting the clean voice into a frequency domain amplitude spectrum of each clean voice, constructing a clean voice frequency domain amplitude spectrum data set, sampling and framing the clean voice in the clean voice data set, adding a Hamming window to convert the clean voice into waveform data of the clean voice, forming a clean voice time domain waveform data set, framing and overlapping the noisy voice in the voice data set with noise by a short-time Fourier transform method, converting the frequency domain amplitude spectrum of each noisy voice and the frequency domain phase spectrum of each noisy voice, forming a voice amplitude spectrum data set with noise and a voice frequency domain phase spectrum data set with noise, and collecting the voice frequency domain amplitude spectrum data set with noise, the clean voice frequency domain amplitude spectrum data set, Constructing network training set data by using a time domain waveform data set of clean voice, a frequency domain amplitude spectrum data set of noisy voice and a frequency domain phase spectrum data set of noisy voice;
Step 2, a CNN network model is constructed, a frequency domain amplitude spectrum of the noisy speech in the network training set data is used as a model input data set, a frequency domain amplitude spectrum of the clean speech is used as a training target set, the network acquires the frequency domain amplitude spectrum of the noisy speech at a time, the frequency domain amplitude spectrum of the clean speech corresponding to the frequency domain amplitude spectrum is used as a label, and the CNN network model predicts the frequency domain amplitude spectrum of the clean speech according to the frequency domain amplitude spectrum of the noisy speech to obtain a frequency domain amplitude spectrum estimated value; combining the frequency domain amplitude spectrum estimated value with a frequency domain phase spectrum of the voice with noise, and further performing waveform reconstruction by an inverse short-time Fourier transform method to obtain an enhanced voice; sampling, framing and overlapping the enhanced voice and adding a Hamming window to obtain waveform data on an estimated voice time domain;
calculating the frequency domain loss through the frequency domain amplitude spectrum of the clean voice and the frequency domain amplitude spectrum estimation value; calculating time domain loss through time domain waveform data of the clean voice and waveform data on an estimated voice time domain; constructing time-frequency domain combined loss according to the frequency domain loss and the time domain loss;
and 3, updating the weight matrix of the convolutional layer by using Adam as an optimizer according to the time-frequency domain joint loss, and performing next iteration until training is finished to obtain optimized network weight parameters so as to construct an optimized CNN network model.
CN202110155444.2A 2021-02-04 2021-02-04 Voice enhancement method based on time-frequency domain joint loss function Expired - Fee Related CN112927709B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110155444.2A CN112927709B (en) 2021-02-04 2021-02-04 Voice enhancement method based on time-frequency domain joint loss function

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110155444.2A CN112927709B (en) 2021-02-04 2021-02-04 Voice enhancement method based on time-frequency domain joint loss function

Publications (2)

Publication Number Publication Date
CN112927709A CN112927709A (en) 2021-06-08
CN112927709B true CN112927709B (en) 2022-06-14

Family

ID=76170408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110155444.2A Expired - Fee Related CN112927709B (en) 2021-02-04 2021-02-04 Voice enhancement method based on time-frequency domain joint loss function

Country Status (1)

Country Link
CN (1) CN112927709B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436640B (en) * 2021-06-28 2022-11-25 歌尔科技有限公司 Audio noise reduction method, device and system and computer readable storage medium
CN113707164A (en) * 2021-09-02 2021-11-26 哈尔滨理工大学 Voice enhancement method for improving multi-resolution residual error U-shaped network
CN115240648B (en) * 2022-07-18 2023-04-07 四川大学 Controller voice enhancement method and device facing voice recognition

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013127364A1 (en) * 2012-03-01 2013-09-06 华为技术有限公司 Voice frequency signal processing method and device
CN109360581A (en) * 2018-10-12 2019-02-19 平安科技(深圳)有限公司 Sound enhancement method, readable storage medium storing program for executing and terminal device neural network based
CN110503967A (en) * 2018-05-17 2019-11-26 ***通信有限公司研究院 A kind of sound enhancement method, device, medium and equipment
CN111081268A (en) * 2019-12-18 2020-04-28 浙江大学 Phase-correlated shared deep convolutional neural network speech enhancement method
CN111696568A (en) * 2020-06-16 2020-09-22 中国科学技术大学 Semi-supervised transient noise suppression method
CN112185405A (en) * 2020-09-10 2021-01-05 中国科学技术大学 Bone conduction speech enhancement method based on differential operation and joint dictionary learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013127364A1 (en) * 2012-03-01 2013-09-06 华为技术有限公司 Voice frequency signal processing method and device
CN110503967A (en) * 2018-05-17 2019-11-26 ***通信有限公司研究院 A kind of sound enhancement method, device, medium and equipment
CN109360581A (en) * 2018-10-12 2019-02-19 平安科技(深圳)有限公司 Sound enhancement method, readable storage medium storing program for executing and terminal device neural network based
CN111081268A (en) * 2019-12-18 2020-04-28 浙江大学 Phase-correlated shared deep convolutional neural network speech enhancement method
CN111696568A (en) * 2020-06-16 2020-09-22 中国科学技术大学 Semi-supervised transient noise suppression method
CN112185405A (en) * 2020-09-10 2021-01-05 中国科学技术大学 Bone conduction speech enhancement method based on differential operation and joint dictionary learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
孟祥彩等.基于相位影响的单声道语音增强改进算法.《电声技术》.2016,(第11期), *
曾亮.基于小波变换语音增强算法的Matlab仿真.《软件导刊》.2013,(第10期), *
秦焕丁等.基于最小均方误差幅度谱的改进语音增强算法研究.《电子技术》.2016,(第07期), *
胡科开等.一种基于改进型谱减法的语音增强新算法.《大众科技》.2008,(第09期), *

Also Published As

Publication number Publication date
CN112927709A (en) 2021-06-08

Similar Documents

Publication Publication Date Title
CN112927709B (en) Voice enhancement method based on time-frequency domain joint loss function
CN108172238B (en) Speech enhancement algorithm based on multiple convolutional neural networks in speech recognition system
Tu et al. Speech enhancement based on teacher–student deep learning using improved speech presence probability for noise-robust speech recognition
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
Shi et al. Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation.
CN112802491B (en) Voice enhancement method for generating confrontation network based on time-frequency domain
KR20080056069A (en) Method and apparatus for transforming a speech feature vector
CN111899750B (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
CN114495969A (en) Voice recognition method integrating voice enhancement
Yoneyama et al. Unified source-filter GAN: Unified source-filter network based on factorization of quasi-periodic parallel WaveGAN
Le et al. Inference skipping for more efficient real-time speech enhancement with parallel RNNs
CN111508475A (en) Robot awakening voice keyword recognition method and device and storage medium
CN114360571A (en) Reference-based speech enhancement method
Girirajan et al. Real-Time Speech Enhancement Based on Convolutional Recurrent Neural Network.
Yang et al. RS-CAE-based AR-Wiener filtering and harmonic recovery for speech enhancement
Tu et al. DNN training based on classic gain function for single-channel speech enhancement and recognition
Xu et al. Joint training ResCNN-based voice activity detection with speech enhancement
Haton Automatic speech recognition: A Review
Li et al. A Convolutional Neural Network with Non-Local Module for Speech Enhancement.
CN109741733B (en) Voice phoneme recognition method based on consistency routing network
Saleem et al. Time domain speech enhancement with CNN and time-attention transformer
Zhao et al. Time Domain Speech Enhancement using self-attention-based subspace projection
CN113035217A (en) Voice enhancement method based on voiceprint embedding under low signal-to-noise ratio condition
Skariah et al. Review of speech enhancement methods using generative adversarial networks
CN108573698B (en) Voice noise reduction method based on gender fusion information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20220614