CN112017682A

CN112017682A - Single-channel voice simultaneous noise reduction and reverberation removal system

Info

Publication number: CN112017682A
Application number: CN202010985378.7A
Authority: CN
Inventors: 范存航; 温正棋
Original assignee: Zhongke Extreme Element Hangzhou Intelligent Technology Co Ltd
Current assignee: Zhongke Extreme Element Hangzhou Intelligent Technology Co Ltd
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2020-12-01
Anticipated expiration: 2040-09-18
Also published as: CN112017682B

Abstract

The invention discloses a system for simultaneously reducing noise and removing reverberation of single-channel voice, which comprises: the voice noise reduction module trains a deep embedded feature extractor by using a deep clustering algorithm, extracts deep embedded features from a mixed voice signal, and maps the input mixed voice to an embedded space without noise, so that the deep embedded features do not contain noise and have great distinctiveness on reverberation and direct sound; the voice dereverberation module is connected with the voice noise reduction module, removes the reverberation voice signal from the deep embedded feature, and estimates the direct sound of a clean target, thereby achieving the purposes of voice noise reduction and dereverberation; the joint training module is respectively connected with the voice noise reduction module and the voice dereverberation module and is used for jointly optimizing the voice noise reduction module and the voice dereverberation module so as to improve the quality and the intelligibility of the enhanced voice.

Description

Single-channel voice simultaneous noise reduction and reverberation removal system

Technical Field

The invention relates to the technical field of signal processing, in particular to a system for simultaneously reducing noise and removing reverberation of single-channel voice.

Background

Speech is one of the main means for human beings to communicate information, and speech noise reduction and dereverberation have always occupied an important position in speech signal processing. In a real environment, a speech signal often contains both reverberation and noise, which seriously affects the quality and intelligibility of speech, and has a large impact on the performance of a speech recognition and voiceprint recognition system. Therefore, speech dereverberation and noise reduction are important. To solve the speech dereverberation problem, many methods have been proposed over the past years. The Weighted Prediction Error (WPE) algorithm processes speech dereverberation at the signal level, i.e. delayed linear prediction. WPE first derives a frequency dependent linear prediction filter over a number of historical frames. The filtered signal is then subtracted from the original reverberation signal in the subband domain to obtain the enhancement signal. However, when noise and reverberation exist simultaneously, the performance of the WPE algorithm is seriously affected, and the application of the method is limited.

In recent years, with the development of computer technology, a speech dereverberation method based on deep learning has been greatly developed and receives more and more attention. The speech dereverberation method based on deep learning establishes a mapping relation between the characteristic parameters of the mixed speech and the characteristic parameters of the target clean speech signal by training a speech dereverberation model, so that the target clean speech signal can be output by the established dereverberation model for any input mixed speech signal, and the purpose of speech dereverberation is achieved. However, these methods only use amplitude spectrum as a feature, and have no distinction, limiting the speech dereverberation performance. In the case that the voice contains both noise and reverberation, the voice quality after enhancement cannot be guaranteed.

Disclosure of Invention

In order to solve the defects of the prior art and realize the purpose of still keeping the enhanced voice to have higher tone quality under the condition that the voice simultaneously contains noise and reverberation, the invention adopts the following technical scheme:

a single channel speech simultaneous noise reduction and dereverberation system comprising: the voice noise reduction module trains a deep embedded feature extractor by using a deep clustering algorithm, extracts deep embedded features from a mixed voice signal, and maps the input mixed voice to an embedded space without noise, so that the deep embedded features do not contain noise and have great distinctiveness on reverberation and direct sound; the voice dereverberation module is connected with the voice noise reduction module, removes the reverberation voice signal from the deep embedded feature, and estimates the direct sound of a clean target, thereby achieving the purposes of voice noise reduction and dereverberation; the joint training module is respectively connected with the voice noise reduction module and the voice dereverberation module and is used for jointly optimizing the voice noise reduction module and the voice dereverberation module so as to improve the quality and the intelligibility of the enhanced voice.

The voice noise reduction module carries out short-time Fourier transform on an input mixed voice signal, models the input mixed voice signal after transforming a time domain signal to a frequency domain signal, extracts deep embedded features by using a deep clustering algorithm, maps the input mixed voice to an embedded space without noise, and trains the deep embedded features by using a deep neural network, wherein the training loss objective function of the voice noise reduction module is as follows:

v is a feature that is embedded in depth,

representing real numbers, TF is a time frequency block after Fourier transformation, B is the corresponding relation between direct sound and reverberation of each time frequency block,

the square Frobenius norm is expressed, so that the aim of voice noise reduction is fulfilled.

The voice dereverberation module is realized by using a deep neural network, the input of the network is a deep embedded characteristic, the output is an estimated target floating point masking value, and the formula is as follows:

is the estimated target floating-point masking value, the training loss objective function of the speech dereverberation module is:

the I Y (t, f) I is the amplitude spectrum of the mixed voice, the I X (t, f) I is the amplitude spectrum of the target clean direct sound, and the input amplitude spectrum Y (t, f) I of the mixed voice and the estimated target floating point masking value are utilized

And performing point-by-point multiplication to obtain an estimated amplitude spectrum of the target clean direct sound, and calculating a mean square error between the estimated amplitude spectrum of the target clean direct sound and the estimated amplitude spectrum of the target clean direct sound.

The joint training module is used for jointly optimizing the voice noise reduction module and the voice dereverberation module, and linearly adding the target function of the voice noise reduction module and the target function of the voice dereverberation module with certain weight to serve as a final target function so as to jointly optimize the voice noise reduction module and the voice dereverberation module and improve the performance of the voice enhancement system.

The overall training objective function is:

J_total＝λJ_DC+(1-λ)J

and lambda is the weight of the voice noise reduction module and the voice dereverberation module, and finally, the whole voice noise reduction and dereverberation module is optimized in a joint training mode.

The invention has the advantages and beneficial effects that:

the voice noise reduction module carries out noise reduction through feature extraction, and the extracted features distinguish reverberation from direct sound, so that the distinguishing performance of a voice reverberation-free system on the reverberation and the direct sound is improved; the voice dereverberation module estimates a target clean direct sound through training a neural network, so that the voice dereverberation performance is improved; the combined training module jointly optimizes the voice noise reduction module and the voice dereverberation module, and ensures the performance of voice enhancement while obtaining the depth embedded feature with distinctiveness, so that the enhanced voice can be clearer and understandable, and the tone quality is better.

Drawings

Fig. 1 is a schematic block diagram of the present invention.

Fig. 2 is a schematic structural diagram of a speech noise reduction module according to the present invention.

Fig. 3 is a schematic diagram of the structure of the speech dereverberation module in the present invention.

FIG. 4 is a schematic diagram of the structure of the joint training module according to the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

As shown in fig. 1, a simultaneous noise reduction and dereverberation system for single-channel speech includes: the voice noise reduction module trains a deep embedded feature extractor by utilizing a deep clustering algorithm, extracts deep embedded features from a mixed voice signal, and maps input voice into an embedded space without noise, so that the deep embedded features do not contain noise and have great distinctiveness on reverberation and direct sound; the voice dereverberation module is connected with the voice noise reduction module, and removes the reverberation voice signal from the deep embedded feature by utilizing the distinctiveness to estimate the direct sound of a clean target, thereby achieving the purposes of voice noise reduction and dereverberation; and the joint training module is respectively connected with the voice noise reduction module and the voice dereverberation module and is used for jointly optimizing the voice noise reduction and voice dereverberation modules and improving the quality and the intelligibility of the enhanced voice.

As shown in fig. 2, the voice noise reduction module performs short-time fourier transform on the input mixed voice signal, transforms the time domain signal to the frequency domain signal, and then models it; the voice noise reduction module extracts deep embedded features by using a deep clustering algorithm, input voice with noise and reverberation is mapped into an embedded space without noise, namely, the voice with noise and reverberation only contains the deep embedded features of the reverberation, the deep embedded features are obtained by using deep neural network training, and the training loss objective function of the voice noise reduction module is as follows:

where V is a deep embedded feature,

represents the squared Frobenius norm, for example: b if the direct sound is larger than the reverberant energy at time-frequency block tf_tf，11 and B_tf，20; otherwise B_tf，10 and B_tf，2The method is equivalent to mapping the input mixed voice to an embedded space which only contains reverberation and has no noise, so as to achieve the purpose of voice noise reduction.

As shown in fig. 3, the speech dereverberation module is used for training a speech dereverberation model, and the module is implemented by using a deep neural network, where the input of the network is a deep embedded feature, and the output is an estimated target floating point masking value, and the formula is as follows:

wherein the content of the first and second substances,

wherein | Y (t, f) | is the amplitude spectrum of the mixed voice, | X (t, f) | is the amplitude spectrum of the target clean direct sound, and the input amplitude spectrum | Y (t, f) | of the mixed voice and the estimated target floating point masking value are utilized

And performing point-by-point multiplication to obtain an estimated amplitude spectrum of the target clean direct sound, and calculating a mean square error between the estimated amplitude spectrum and the real amplitude spectrum.

As shown in fig. 4, the joint training module is used for jointly optimizing the speech noise reduction module and the speech dereverberation module, and the objective function of the speech noise reduction module and the objective function of the speech dereverberation module are linearly added with a certain weight to serve as a final objective function, so that the joint optimization of each module is performed, and the performance of the speech enhancement system is improved.

The overall training objective function is:

J_total＝λJ_DC+(1-λ)J

wherein, λ represents the weight of the speech noise reduction module and the speech dereverberation module, and finally, the whole speech noise reduction and dereverberation system is optimized in a joint training mode.

And after the training is finished, the mixed voice signal is sequentially input into the voice noise reduction module and the voice dereverberation module, and a target clean direct sound signal is obtained.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A single channel speech simultaneous noise reduction and dereverberation system, comprising: the system comprises a voice noise reduction module, a voice dereverberation module and a joint training module, wherein the voice noise reduction module utilizes a deep clustering algorithm to train a deep embedded feature extractor, deep embedded features are extracted from mixed voice signals, input mixed voice is mapped into an embedded space without noise, the voice dereverberation module is connected with the voice noise reduction module, the reverberation voice signals are removed from the deep embedded features, direct sound of a clean target is estimated, and the joint training module is respectively connected with the voice noise reduction module and the voice dereverberation module and used for jointly optimizing the voice noise reduction and voice dereverberation modules.

2. The system of claim 1, wherein the speech noise reduction module performs short-time fourier transform on the input mixed speech signal, transforms the time domain signal into the frequency domain signal, models the frequency domain signal, extracts deep embedded features by using a deep clustering algorithm, maps the input mixed speech into an embedded space without noise, the deep embedded features are obtained by using deep neural network training, and the training loss objective function of the speech noise reduction module is:

v is a feature that is embedded in depth,

representing real numbers, TF being a Fourier transformed time-frequency block, B being eachThe corresponding relation between the direct sound and the reverberation of each time frequency block,

representing the squared Frobenius norm.

3. The system of claim 1, wherein the voice dereverberation module is implemented by using a deep neural network, the input of the network is a deep embedded feature, and the output is an estimated target floating point masking value, and the formula is as follows:

4. The system of claim 1, wherein the joint training module is configured to jointly optimize the speech noise reduction module and the speech dereverberation module, and linearly add the objective function of the speech noise reduction module and the objective function of the speech dereverberation module with a certain weight as a final objective function, so as to jointly optimize the speech noise reduction module and the speech dereverberation module.

The overall training objective function is:

J_total＝λJ_DC+(1-λ)J

λ is the weight of the speech noise reduction module and the speech dereverberation module.