CN110767244A

CN110767244A - Speech enhancement method

Info

Publication number: CN110767244A
Application number: CN201810827229.0A
Authority: CN
Inventors: 杜俊; 高天; 屠彦辉; 王立众; 杨磊; 徐学淼
Original assignee: University of Science and Technology of China USTC; Beijing Samsung Telecommunications Technology Research Co Ltd
Current assignee: University of Science and Technology of China USTC; Beijing Samsung Telecommunications Technology Research Co Ltd
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2020-02-07
Anticipated expiration: 2038-07-25
Also published as: CN110767244B

Abstract

The invention discloses a voice enhancement method, which comprises the following steps: extracting acoustic characteristics of each voice frame; training a progressive double-output neural network model by using samples of clean voice and noise voice, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing on acoustic characteristics; if the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features to obtain a waveform capable of being subjectively listened; if the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice to obtain the masked acoustic features, and then the waveform is reconstructed to obtain the enhanced voice. The scheme of the invention can meet the noise reduction requirement of human ears and improve the recognition accuracy of the voice with noise.

Description

Speech enhancement method

Technical Field

The invention relates to the technical field of voice processing, in particular to a voice enhancement method.

Background

Speech recognition is the process of making a machine understand what a person says, i.e., converting the vocabulary content in human speech into an input that can be recognized by a computer. The introduction of deep learning in recent 20 years, especially in recent years, has led to significant success in speech recognition technology, starting to move from the laboratory to the market. At present, voice input, voice retrieval, voice translation and the like based on a voice recognition technology are widely applied. It is well known that in noisy environments the performance of automatic speech recognition is greatly degraded if we take no measures, mainly due to the differences in the distribution of noisy speech between the distributions of the acoustic models. To improve recognition accuracy in noisy environments, speech enhancement algorithms typically use a pre-processing of speech recognition that matches the acoustic model distribution by transforming the noisy speech back to a clean state as much as possible.

Speech enhancement is an important branch of speech signal processing field, and noise has been a concern since the beginning of the research on speech signals, because speech is accompanied by noise in real life environments. While speech signal processing is generally only interested in the content of speech, speaker, language, etc., noise as an interfering item generally needs to be removed in advance, but considering that the generation process of speech and noise is non-linear and complex, the denoising process is difficult. Over the past few decades, unsupervised speech enhancement methods have been proposed by first estimating the spectral information of the noise and then subtracting the estimated noise spectrum from the noisy speech spectrum to obtain a prediction of the clean speech spectrum. But makes tracking and estimation of noise difficult due to its randomness and catastrophe. Meanwhile, in the conventional speech enhancement method, considering that the interaction relationship between noise and speech is complex, some assumptions on independence between signals and gaussian assumptions on feature distribution are required. Due to these assumptions, the conventional speech enhancement method is caused to leave much noise, even music noise. Secondly, the details of the speech are also largely corrupted, which is mainly reflected in the enhancement of low signal-to-noise ratio speech. Moreover, the extreme non-stationary noise is always a troublesome place in the traditional speech enhancement method, because the non-stationary noise is sudden, the non-stationary noise is always in an underestimated state and is difficult to remove from the noisy speech, and various non-stationary noises are probably occurring events in an actual acoustic environment. Finally, conventional speech enhancement methods tend to introduce some non-linear distortion that can be disruptive to the speech recognition at the back end.

Spectral subtraction is one of the most classical speech enhancement algorithms, used initially for speech enhancement and later gradually used in speech recognition. The method is divided into two parts of noise updating and noise elimination. In noise updating, firstly, it is necessary to determine that a non-speech segment is detected, and then a method of combining current frame and historical long-term information is used to estimate a noise spectrum. In noise cancellation, an estimate of the clean speech spectrum is obtained by subtracting the estimated noise spectrum from the noisy speech spectrum. Due to the over-subtraction problem of the spectral subtraction, an over-subtraction factor related to the signal-to-noise ratio needs to be set in each frequency band. The non-linear operation of spectral subtraction produces residual musical noise in the enhanced signal, and some specific solutions are adopted in practical applications. In addition, a reliable voice/non-voice detection module is also crucial.

One method related to spectral subtraction is wiener filtering. In this method, a linear filter is designed to minimize the mean square error between clean speech and the filtered speech signal for noisy speech. The transfer function obtained under this criterion is the ratio between the correlation spectrum of the clean signal and the noisy signal, which is approximated by the difference between the noisy spectrum and the estimated noise spectrum, and the power spectrum of the signal to be noisy. Therefore, the wiener filtering and the spectral subtraction are closely related, but the operation complexity of the wiener filtering is higher, and the wiener filtering and the spectral subtraction also fail to work for unsteady noise. Subsequently, a soft decision enhancement method based on the existence probability of speech was proposed (McAulay and Malpass,1980), which can reduce the distortion of speech as much as possible. The revolutionary speech enhancement method is the minimum mean square error based speech magnitude spectral estimation proposed by Ephraim and Malah in 1984, and then the minimum mean square error estimation of the log-spectral domain is proposed considering that the perception of the human ear to the sound intensity is non-linear. Among the methods, the most common one used in the prior art is the minimum controlled iterative Averaging (MCRA) noise estimation method proposed by Israel Cohen. Compared with the prior noise estimation method, the noise estimation method of the minimum control iterative average has the characteristics of smaller estimation error and quicker non-stationary noise tracking. The method can be considered as the best method in the traditional single-channel speech enhancement algorithm so far; however, this method can improve the recognition accuracy of the speech recognition system in the case of stationary noise and high signal-to-noise ratio. But in case of large background noise, the boosting is limited and the recognition rate may even be reduced. The main reason is that the robustness of the current recognition system itself is already strong, and if the enhancement algorithm destroys the target speech, the enhancement algorithm will bring negative effects.

Discussed above are conventional single-channel unsupervised speech enhancement methods, which are directed to some of the problems of unsupervised speech enhancement algorithms, and since the end of the eighties of the last century, methods based on supervised training models have been used in speech enhancement tasks, such as speech enhancement based on non-negative matrix factorization. Non-Negative Matrix Factorization (NMF) refers to a matrix V that can be decomposed into a product of a matrix W and a matrix H. But is characterized by the fact that matrix V, matrix W and matrix H have no negative elements. Considering that the speech spectrum of audio is generally potentially non-negative, non-negative matrix decomposition is usually applied in music signal separation, because music signals have a more fixed pattern than speech signals or noise signals, and therefore it is more convenient to train music signals and obtain some matrix bases. The non-negative matrix decomposition is applied to the field of signal separation, and the main aim is to separate two paths of signals from a mixed signal by using different signal bases. The signal basis can be obtained by training using the following steps. First, an objective function is defined, which may use euclidean distance or Kullback-Leibler divergence, as follows:

the voice noise reduction based on the non-negative matrix factorization is mainly divided into two steps: training and noise reduction. In the training phase, firstly, the feature bases W of clean voice and noise are obtained respectively_speechAnd W_noise. They are all n in size_f×n_bWhere n is_fRefers to the dimension of the feature base, and n_bRefers to the number of feature bases. The target formula is as follows:

min KLD(V_speech||W_speechH_speech)

min KLD(V_noise||W_noiseH_noise)

in the above formula, H_speechAnd H_noiseRefers to a coefficient matrix that generates the actual signal of the corresponding basis per frame of signal. Their dimensions are the dimensions of the features multiplied by the total number of frames. While in the enhancement stage, we consider the speech feature bases and the noise feature bases obtained in the training stage, namely W_speechAnd W_noiseIs a total characteristic base W which can be used in decoding and obtained by splicing the two_allAs follows:

W_all＝[W_speechW_noise]

then, on the premise of a given noisy speech signal, H is solved through a standard gradient descent algorithm_allFinally, the noise signal and the voice signal can be separated. The main problem with speech enhancement methods based on non-negative matrix factorization is that the models of noise and clean speech are trained separately and independence between noise and clean speech is assumed. This virtually limits the upper performance limit of the process. Meanwhile, the deep neural network-based method has fully exceeded the performance of the non-negative matrix factorization method in the voice separation related task.

Supervised speech enhancement methods, especially those based on deep learning, have also developed vigorously in recent years. With the successful application of deep neural networks in the field of speech recognition, neural networks can also be designed as a fine noise reduction filter. Meanwhile, based on big data training, the neural network can fully learn the complex nonlinear relation between the noisy speech and the clean speech. In addition, the training of the neural network is learned off-line, and as a person, the neural network can remember some noise patterns, so that some non-stationary noises can be well suppressed; however, if the training data does not match the test data, for example, the noise type is different, the speaker difference is large, and the system performance is greatly reduced.

Disclosure of Invention

The invention aims to provide a voice enhancement method which can meet the noise reduction requirement of human ears and improve the recognition accuracy of noisy voice.

The purpose of the invention is realized by the following technical scheme:

a method of speech enhancement comprising:

extracting acoustic characteristics of each voice frame;

training a progressive double-output neural network model by using samples of clean voice and noise voice, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing on acoustic characteristics;

if the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features to obtain a waveform capable of being subjectively listened; if the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice to obtain the masked acoustic features, and then the waveform is reconstructed to obtain the enhanced voice.

According to the technical scheme provided by the invention, the voice enhancement is carried out based on the progressive double-output neural network model, so that the voice subjected to deep noise reduction can be output to meet the noise reduction requirement of human ears, and the voice subjected to partial noise reduction and with a certain signal-to-noise ratio is output to be matched with the recognition model driven by the back-end data; through manual audiometry and objective index measurement, the speech after deep noise reduction is obviously improved in subjective audibility and various indexes; by combining the voice recognition model, compared with a recognition result without noise reduction processing, the voice with noise reduced by the neural network part effectively improves the recognition accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a speech enhancement method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a progressive dual-output neural network model according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Speech enhancement methods restore noisy speech to as clean a state as possible by transformation, generally speaking, speech enhancement mainly considers whether the human hearing of the enhanced signal is good, while for recognition systems it is more concerned whether the error rate of the system is reduced, and these two targets are linked but not completely consistent, for the simple reason that a human may have an immune function to some distortions in the speech signal, but the recognition system may be sensitive to them. It often happens that the enhanced speech has a good subjective hearing improvement but cannot bring an improvement in speech recognition accuracy. Another big challenge faced by the deep learning based speech enhancement is a generalized problem, which is a difficult problem that the deep learning based model cannot avoid. Specifically, speech enhancement is a difficult problem in terms of unknown noise, unknown speaking style, and extremely low signal-to-noise ratio.

In contrast, the embodiment of the present invention provides a speech enhancement method, and through a multi-target joint training method, a progressive dual-output neural network model can have better generalization and simultaneously output a deeply denoised speech and a speech with a certain signal-to-noise ratio, so that the noise reduction requirement of human ears on a noisy speech can be met, and the accuracy of a recognition system can be improved. As shown in fig. 1, the flow chart of the method mainly includes the following parts:

1. and extracting the acoustic characteristics of each voice frame.

1) The input voice signal is processed by frame division to obtain a voice frame sequence.

In this step, the input voice (voice with noise) can be processed by adding a hamming window, and each frame of data is obtained. Illustratively, the general hamming window length may be selected to be 32 milliseconds, the window shift may be 16 milliseconds, and the overlap portion may be 16 milliseconds.

2) The acoustic features adopt logarithmic power spectrum features, and when the logarithmic power spectrum features of each voice frame are extracted, frequency domain signals are obtained through Fourier transform and modulus taking:

in the above formula, d is the frequency dimension, h (L) is the window function, and L is the number of points for discrete Fourier transform; if the number of points L for performing discrete fourier transform can be increased, that is, the number of information points to be sampled is more, the input features will contain more information, which is also beneficial to the learning of the subsequent neural network.

The log power spectral signature is defined as:

Y(d)＝log|Y(d)'|²d＝0,1,...,D-1；

as will be appreciated by those skilled in the art, since the STFT transform is symmetric in the frequency domain, only the first half of the points are taken, i.e., D ═ L/2+1, while the second half of the points D ═ D., L-1, Y (D) are obtained by the symmetry criterion, Y (D) ═ Y (L-D). That is, y (d) involved in the subsequent calculation process also considers only the first half of the points, and all the points are considered in the final speech restoration, as described later

Is shown in formula, in the calculationSince there is a symmetry, the value corresponding to the second half is also a known quantity.

Illustratively, if the sampling rate is a waveform file of 16kHz, the log power spectrum is characterized by 257 dimensions (32ms 16kHz is 512 samples per frame, and D512/2 +1 is 257).

2) And splicing continuous frames, wherein data obtained after splicing a certain number of frames is used as a sample during splicing, and the mark of the central frame of the sample is used as the mark of the sample where the central frame is located.

Typically 7-frame splicing, 11-frame splicing or 15-frame splicing can be adopted; and when the spliced sample is used as a training sample, the mark of the central frame is used as the mark of the sample. Typically 7 frames are used for concatenation for networks with hidden node numbers 1024 or 2048, and 11 or 15 frames for concatenation for networks 3072 or 4096 and even larger may be used as input. In the present study, 7-frame splicing is adopted as input for a network with a hidden layer of 1024.

2. And training the progressive double-output neural network model by using clean voice and noise voice samples, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing.

The progressive dual-output neural network model is shown in fig. 2, the learning goal at each layer is continuously increased with the signal-to-noise ratio, for example, the signal-to-noise ratio of target1 is 5db, and the signal-to-noise ratio of target2 is 10 db. The final target is learned according to the mode that the signal to noise ratio is gradually increased, and the finally trained progressive dual-output neural network model can predict the ideal soft masking of each time frequency pointCan also bind predicted

The acoustic features are enhanced by predicting the log power spectral features (LPS) of clean speech.

A method for directly estimating the logarithmic power spectrum is called direct mapping, the direct mapping method can remove a large amount of noise, and the objective index and the subjective audibility of the voice are greatly improved. However, for an application scenario of docking a back-end recognition model, in the case of not retraining the back-end recognition model, the enhanced speech and the recognition model have a large mismatch, and it is difficult to improve the recognition accuracy. Therefore, when the back end is identified by docking, a time-frequency masking (masking) method is adopted. The method only removes partial noise to improve the signal-to-noise ratio to a certain degree, so that the method can keep the matching with a back-end recognition model as much as possible.

The following describes a time-frequency masking speech enhancement method. An Ideal binary time-frequency mask (IBM) is a time-frequency mask that is constructed from a pre-mixed signal of speech and noise. For each IBM unit, it is the voice dominated unit if its signal-to-noise ratio is greater than a preset local threshold (LC), otherwise it is the noise dominated unit. IBM can be defined quantitatively as:

although IBM-based estimation of speech enhancement based on classification can lead to better intelligibility, speech quality is severely lost, speaker information, etc. is also lost. And depends very much on the correctness of the IBM classification, and once the IBM is judged to be wrong, the information loss is very serious. Aiming at the defects of IBM, the embodiment of the invention adopts the target of ideal soft masking to train a deep neural network, which can ensure the characteristics of speech intelligibility and speech quality. IRM is now also the mainstream in masking based methods. The definition of IRM is:

in the above formula, X²(t, d) and N²(t, d) are log power spectral features of clean and noisy speech, respectively, and SNR (t, d) represents the signal-to-noise ratio β is a tunable parameter, typically set to 0.5, which becomes the wiener gain of the square root.

By using X²(t, d) and N²And (t, d) training the progressive dual-output neural network model, and accurately estimating the IRM by learning the mapping relation between the progressive dual-output neural network model and the progressive dual-output neural network model. Because at the time of testing, it is assumed that there is an accurate estimation already for IRM on a certain T-F unit through the deep neural network, i.e. ideal soft mask

Then pass through

The log power spectrum feature (i.e. the enhanced log power spectrum feature) of the clean speech can be predicted, and the formula is as follows：

In the above formula, Y²(t, d) is the power spectrum of the noisy speech. Specifically to log power spectral features, prediction of log power spectral features of clean speech can be written as:

wherein, log ((Y)²(t,d))＝Y(d)。

It will be appreciated by those skilled in the art that the occurrence of t in each of the above parameters is indicative of a time, e.g., X²(t,d)、N²(t, d); when the entire time axis is considered, t is omitted, e.g., y (d) which considers the entire time axis.

In an embodiment of the invention, the back propagation algorithm is based on the minimum mean square error between the clean speech log power spectrum and the enhanced speech log power spectrum, and the log power spectrum is taken in such a way that it is more consistent with the human auditory system. Because the perception of sound intensity by the human ear is a non-linear relationship and the greater the sound intensity, the higher the degree of suppression. The stochastic gradient descent algorithm based on the minimum batch mode can be used for improving the convergence rate of the learning of the progressive dual-output neural network model, and is represented as follows:

in the above formula, E is the average square error of the progressive dual-output neural network model learning,

correspondingly, the enhanced logarithmic power spectrum characteristic of the 1 st 1 … K progressive learning target in the nth frame and the d-th frequency dimension and the logarithmic power spectrum characteristic of the target are representedI.e. the output of the progressive dual-output neural network model;

corresponding representation estimated ideal soft mask, target ideal soft mask; n represents the size of the minimum batch, i.e. the number of samples; d is the total dimension of the log power spectrum feature vector; (W)^l，b^l) Parameters representing weights and biases to be learned at the l-th layer.

As will be appreciated by those skilled in the art,the specific value of (a) can be set by the user according to the actual situation.

The number of the entire hidden layers is denoted by L, and then L +1 denotes the output layer. In addition, it should be noted that the input features of the progressive dual-output neural network model are all subjected to gaussian normalization, that is, the mean value of the whole training data is normalized to 0, and the variance is normalized to 1. Both noisy and clean speech are warped with the global mean and global variance of the noisy training data. One advantage of this process is that the input data and the output data of the progressive dual-output neural network model are transformed identically, so that the neural network can be learned more easily. After the input data and output data are prepared, a learning rate λ can be used to begin updating the weights and bias parameters of the network.

3. Neural network decoding and speech recovery

1) If the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features (log power spectrum features) to obtain the waveform which can be subjectively listened.

First, calculate

In the above formula, the first and second carbon atoms are,

for definition in the real number domain, for the enhanced log power spectral features, which take into account the whole time axis, t is omitted,

which is also an enhanced log power spectrum characteristic but is a definition in the complex domain, ∠ y (d) refers to phase information derived from the input speech because the human ear is not sensitive to small changes in phase.

Then, inverse discrete Fourier transform reconstruction is carried out to obtain enhanced time domain voice

Finally, the waveform of the whole sentence is synthesized by a classical overlap-add algorithm.

2) If the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice (namely, the logarithmic power spectrum features of the noisy voice) to obtain the masked acoustic features (namely, the enhanced acoustic features), and then the waveform is reconstructed to obtain the enhanced voice.

The masked acoustic features, that is, the enhanced log power spectrum features, and the implementation manner of reconstructing the waveform to obtain the enhanced speech, may refer to the foregoing processing manner applied to human ears.

Since the biggest problem in machine learning is the mismatch of training data and test data, the estimation of the network is biased. Particularly, in the application of actual speech enhancement, when the difference between the test speech and the training speech is large, the enhanced logarithmic power spectrum features directly output by the network can damage the clean speech, thereby affecting the recognition accuracy. The improved log power spectrum characteristic of the ideal masking signal calculation reduces the damage to clean voice and simultaneously remains more noise. For a speech recognition system, the noise robustness is strong, but the noise damage is sensitive, so the logarithmic power spectrum feature after the ideal masking signal calculation enhancement is more suitable for the recognition system.

According to the scheme of the embodiment of the invention, the voice enhancement is carried out based on the progressive double-output neural network model, so that the voice subjected to deep noise reduction can be output to meet the noise reduction requirement of human ears, and the voice subjected to partial noise reduction and with a certain signal-to-noise ratio is output to be matched with the recognition model driven by the back-end data; through manual audiometry and objective index measurement, the speech after deep noise reduction is obviously improved in subjective audibility and various indexes; by combining the voice recognition model, compared with a recognition result without noise reduction processing, the voice with noise reduced by the neural network part effectively improves the recognition accuracy.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of speech enhancement, comprising:

extracting acoustic characteristics of each voice frame;

2. The method of claim 1, wherein the extracting the acoustic features of each speech frame comprises:

performing framing processing on an input voice signal to obtain a voice frame sequence;

the acoustic features adopt logarithmic power spectrum features, and when the logarithmic power spectrum features of each voice frame are extracted, frequency domain signals are obtained through Fourier transform and modulus taking:

in the above formula, d is the frequency dimension, h (L) is the window function, and L is the number of points for discrete Fourier transform;

the log power spectral signature is defined as:

Y(d)＝log|Y(d)'|²d＝0,1,...,D-1；

in the above formula, D is L/2+ 1.

3. A speech enhancement method according to claim 2, characterized in that the method further comprises: and splicing continuous frames before the extracted acoustic features are used as the input of the progressive double-output neural network model, wherein data obtained after splicing a certain number of frames is used as a sample during splicing, and the mark of the central frame of the sample is used as the mark of the sample where the central frame is located.

4. The speech enhancement method according to claim 1, wherein the progressive double-output neural network model learns the final target in a manner that the signal-to-noise ratio gradually increases, and the finally trained progressive double-output neural network model can predict ideal soft masking at each time-frequency point and can also perform enhancement processing on acoustic features, that is, predict the log power spectral features of clean speech.

5. A speech enhancement method according to claim 1 or 4, characterized in that the formula for predicting the clean log power spectral features is:

wherein the content of the first and second substances,

representing the log power spectral characteristics of the predicted clean speech,

represents the ideal soft mask, log ((Y)²(t, d)) ═ y (d), y (d) is the extracted log power spectrum feature, d is the frequency dimension, and t is time.

6. The speech enhancement method according to claim 1, wherein the convergence rate of the progressive dual-output neural network model learning is improved based on a stochastic gradient descent algorithm of a least batch mode, and is represented as:

in the above formula, E is the average of the progressive dual-output neural network model learningThe square error is a measure of the square error,

correspondingly, the enhanced logarithmic power spectrum characteristics of the 1 st 1 … K progressive learning target in the nth frame and the d-th frequency dimension and the logarithmic power spectrum characteristics of the target are represented;

corresponding representation estimated ideal soft mask, target ideal soft mask; n represents the size of the minimum batch, i.e. the number of samples; d, the total dimension of the log power spectrum feature vector; (W)^l,b^l) Parameters representing weights and biases to be learned at the l-th layer.

7. The speech enhancement method of claim 1, wherein reconstructing the waveform using the enhanced acoustic features if applied to the human ear to obtain the subjectively audiometrically inaudible waveform comprises:

first, calculate

In the above formula, the first and second carbon atoms are,

is defined on a real number domain, represents the characteristic of the enhanced log power spectrum,

is also enhanced∠ Y (d) refers to phase information obtained from input speech;

Wherein, L is the point number of discrete Fourier transform when extracting the acoustic characteristic of each voice frame;

finally, the waveform of the whole sentence is synthesized by an overlap-add algorithm.