CN110767244A - Speech enhancement method - Google Patents
Speech enhancement method Download PDFInfo
- Publication number
- CN110767244A CN110767244A CN201810827229.0A CN201810827229A CN110767244A CN 110767244 A CN110767244 A CN 110767244A CN 201810827229 A CN201810827229 A CN 201810827229A CN 110767244 A CN110767244 A CN 110767244A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- noise
- neural network
- enhanced
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 230000000750 progressive effect Effects 0.000 claims abstract description 29
- 238000003062 neural network model Methods 0.000 claims abstract description 27
- 238000012549 training Methods 0.000 claims abstract description 19
- 230000000873 masking effect Effects 0.000 claims abstract description 17
- 238000012545 processing Methods 0.000 claims abstract description 12
- 210000005069 ears Anatomy 0.000 claims abstract description 10
- 238000001228 spectrum Methods 0.000 claims description 39
- 230000003595 spectral effect Effects 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 4
- 125000004432 carbon atom Chemical group C* 0.000 claims description 2
- 238000009432 framing Methods 0.000 claims 1
- 239000000126 substance Substances 0.000 claims 1
- 230000009467 reduction Effects 0.000 abstract description 16
- 239000011159 matrix material Substances 0.000 description 15
- 238000013528 artificial neural network Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 4
- 238000001914 filtration Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000013507 mapping Methods 0.000 description 3
- 230000013707 sensory perception of sound Effects 0.000 description 3
- 238000000926 separation method Methods 0.000 description 3
- 238000012076 audiometry Methods 0.000 description 2
- 238000000354 decomposition reaction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000003032 molecular docking Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000036737 immune function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Soundproofing, Sound Blocking, And Sound Damping (AREA)
Abstract
The invention discloses a voice enhancement method, which comprises the following steps: extracting acoustic characteristics of each voice frame; training a progressive double-output neural network model by using samples of clean voice and noise voice, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing on acoustic characteristics; if the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features to obtain a waveform capable of being subjectively listened; if the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice to obtain the masked acoustic features, and then the waveform is reconstructed to obtain the enhanced voice. The scheme of the invention can meet the noise reduction requirement of human ears and improve the recognition accuracy of the voice with noise.
Description
Technical Field
The invention relates to the technical field of voice processing, in particular to a voice enhancement method.
Background
Speech recognition is the process of making a machine understand what a person says, i.e., converting the vocabulary content in human speech into an input that can be recognized by a computer. The introduction of deep learning in recent 20 years, especially in recent years, has led to significant success in speech recognition technology, starting to move from the laboratory to the market. At present, voice input, voice retrieval, voice translation and the like based on a voice recognition technology are widely applied. It is well known that in noisy environments the performance of automatic speech recognition is greatly degraded if we take no measures, mainly due to the differences in the distribution of noisy speech between the distributions of the acoustic models. To improve recognition accuracy in noisy environments, speech enhancement algorithms typically use a pre-processing of speech recognition that matches the acoustic model distribution by transforming the noisy speech back to a clean state as much as possible.
Speech enhancement is an important branch of speech signal processing field, and noise has been a concern since the beginning of the research on speech signals, because speech is accompanied by noise in real life environments. While speech signal processing is generally only interested in the content of speech, speaker, language, etc., noise as an interfering item generally needs to be removed in advance, but considering that the generation process of speech and noise is non-linear and complex, the denoising process is difficult. Over the past few decades, unsupervised speech enhancement methods have been proposed by first estimating the spectral information of the noise and then subtracting the estimated noise spectrum from the noisy speech spectrum to obtain a prediction of the clean speech spectrum. But makes tracking and estimation of noise difficult due to its randomness and catastrophe. Meanwhile, in the conventional speech enhancement method, considering that the interaction relationship between noise and speech is complex, some assumptions on independence between signals and gaussian assumptions on feature distribution are required. Due to these assumptions, the conventional speech enhancement method is caused to leave much noise, even music noise. Secondly, the details of the speech are also largely corrupted, which is mainly reflected in the enhancement of low signal-to-noise ratio speech. Moreover, the extreme non-stationary noise is always a troublesome place in the traditional speech enhancement method, because the non-stationary noise is sudden, the non-stationary noise is always in an underestimated state and is difficult to remove from the noisy speech, and various non-stationary noises are probably occurring events in an actual acoustic environment. Finally, conventional speech enhancement methods tend to introduce some non-linear distortion that can be disruptive to the speech recognition at the back end.
Spectral subtraction is one of the most classical speech enhancement algorithms, used initially for speech enhancement and later gradually used in speech recognition. The method is divided into two parts of noise updating and noise elimination. In noise updating, firstly, it is necessary to determine that a non-speech segment is detected, and then a method of combining current frame and historical long-term information is used to estimate a noise spectrum. In noise cancellation, an estimate of the clean speech spectrum is obtained by subtracting the estimated noise spectrum from the noisy speech spectrum. Due to the over-subtraction problem of the spectral subtraction, an over-subtraction factor related to the signal-to-noise ratio needs to be set in each frequency band. The non-linear operation of spectral subtraction produces residual musical noise in the enhanced signal, and some specific solutions are adopted in practical applications. In addition, a reliable voice/non-voice detection module is also crucial.
One method related to spectral subtraction is wiener filtering. In this method, a linear filter is designed to minimize the mean square error between clean speech and the filtered speech signal for noisy speech. The transfer function obtained under this criterion is the ratio between the correlation spectrum of the clean signal and the noisy signal, which is approximated by the difference between the noisy spectrum and the estimated noise spectrum, and the power spectrum of the signal to be noisy. Therefore, the wiener filtering and the spectral subtraction are closely related, but the operation complexity of the wiener filtering is higher, and the wiener filtering and the spectral subtraction also fail to work for unsteady noise. Subsequently, a soft decision enhancement method based on the existence probability of speech was proposed (McAulay and Malpass,1980), which can reduce the distortion of speech as much as possible. The revolutionary speech enhancement method is the minimum mean square error based speech magnitude spectral estimation proposed by Ephraim and Malah in 1984, and then the minimum mean square error estimation of the log-spectral domain is proposed considering that the perception of the human ear to the sound intensity is non-linear. Among the methods, the most common one used in the prior art is the minimum controlled iterative Averaging (MCRA) noise estimation method proposed by Israel Cohen. Compared with the prior noise estimation method, the noise estimation method of the minimum control iterative average has the characteristics of smaller estimation error and quicker non-stationary noise tracking. The method can be considered as the best method in the traditional single-channel speech enhancement algorithm so far; however, this method can improve the recognition accuracy of the speech recognition system in the case of stationary noise and high signal-to-noise ratio. But in case of large background noise, the boosting is limited and the recognition rate may even be reduced. The main reason is that the robustness of the current recognition system itself is already strong, and if the enhancement algorithm destroys the target speech, the enhancement algorithm will bring negative effects.
Discussed above are conventional single-channel unsupervised speech enhancement methods, which are directed to some of the problems of unsupervised speech enhancement algorithms, and since the end of the eighties of the last century, methods based on supervised training models have been used in speech enhancement tasks, such as speech enhancement based on non-negative matrix factorization. Non-Negative Matrix Factorization (NMF) refers to a matrix V that can be decomposed into a product of a matrix W and a matrix H. But is characterized by the fact that matrix V, matrix W and matrix H have no negative elements. Considering that the speech spectrum of audio is generally potentially non-negative, non-negative matrix decomposition is usually applied in music signal separation, because music signals have a more fixed pattern than speech signals or noise signals, and therefore it is more convenient to train music signals and obtain some matrix bases. The non-negative matrix decomposition is applied to the field of signal separation, and the main aim is to separate two paths of signals from a mixed signal by using different signal bases. The signal basis can be obtained by training using the following steps. First, an objective function is defined, which may use euclidean distance or Kullback-Leibler divergence, as follows:
the voice noise reduction based on the non-negative matrix factorization is mainly divided into two steps: training and noise reduction. In the training phase, firstly, the feature bases W of clean voice and noise are obtained respectivelyspeechAnd Wnoise. They are all n in sizef×nbWhere n isfRefers to the dimension of the feature base, and nbRefers to the number of feature bases. The target formula is as follows:
min KLD(Vspeech||WspeechHspeech)
min KLD(Vnoise||WnoiseHnoise)
in the above formula, HspeechAnd HnoiseRefers to a coefficient matrix that generates the actual signal of the corresponding basis per frame of signal. Their dimensions are the dimensions of the features multiplied by the total number of frames. While in the enhancement stage, we consider the speech feature bases and the noise feature bases obtained in the training stage, namely WspeechAnd WnoiseIs a total characteristic base W which can be used in decoding and obtained by splicing the twoallAs follows:
Wall=[WspeechWnoise]
then, on the premise of a given noisy speech signal, H is solved through a standard gradient descent algorithmallFinally, the noise signal and the voice signal can be separated. The main problem with speech enhancement methods based on non-negative matrix factorization is that the models of noise and clean speech are trained separately and independence between noise and clean speech is assumed. This virtually limits the upper performance limit of the process. Meanwhile, the deep neural network-based method has fully exceeded the performance of the non-negative matrix factorization method in the voice separation related task.
Supervised speech enhancement methods, especially those based on deep learning, have also developed vigorously in recent years. With the successful application of deep neural networks in the field of speech recognition, neural networks can also be designed as a fine noise reduction filter. Meanwhile, based on big data training, the neural network can fully learn the complex nonlinear relation between the noisy speech and the clean speech. In addition, the training of the neural network is learned off-line, and as a person, the neural network can remember some noise patterns, so that some non-stationary noises can be well suppressed; however, if the training data does not match the test data, for example, the noise type is different, the speaker difference is large, and the system performance is greatly reduced.
Disclosure of Invention
The invention aims to provide a voice enhancement method which can meet the noise reduction requirement of human ears and improve the recognition accuracy of noisy voice.
The purpose of the invention is realized by the following technical scheme:
a method of speech enhancement comprising:
extracting acoustic characteristics of each voice frame;
training a progressive double-output neural network model by using samples of clean voice and noise voice, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing on acoustic characteristics;
if the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features to obtain a waveform capable of being subjectively listened; if the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice to obtain the masked acoustic features, and then the waveform is reconstructed to obtain the enhanced voice.
According to the technical scheme provided by the invention, the voice enhancement is carried out based on the progressive double-output neural network model, so that the voice subjected to deep noise reduction can be output to meet the noise reduction requirement of human ears, and the voice subjected to partial noise reduction and with a certain signal-to-noise ratio is output to be matched with the recognition model driven by the back-end data; through manual audiometry and objective index measurement, the speech after deep noise reduction is obviously improved in subjective audibility and various indexes; by combining the voice recognition model, compared with a recognition result without noise reduction processing, the voice with noise reduced by the neural network part effectively improves the recognition accuracy.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a speech enhancement method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a progressive dual-output neural network model according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Speech enhancement methods restore noisy speech to as clean a state as possible by transformation, generally speaking, speech enhancement mainly considers whether the human hearing of the enhanced signal is good, while for recognition systems it is more concerned whether the error rate of the system is reduced, and these two targets are linked but not completely consistent, for the simple reason that a human may have an immune function to some distortions in the speech signal, but the recognition system may be sensitive to them. It often happens that the enhanced speech has a good subjective hearing improvement but cannot bring an improvement in speech recognition accuracy. Another big challenge faced by the deep learning based speech enhancement is a generalized problem, which is a difficult problem that the deep learning based model cannot avoid. Specifically, speech enhancement is a difficult problem in terms of unknown noise, unknown speaking style, and extremely low signal-to-noise ratio.
In contrast, the embodiment of the present invention provides a speech enhancement method, and through a multi-target joint training method, a progressive dual-output neural network model can have better generalization and simultaneously output a deeply denoised speech and a speech with a certain signal-to-noise ratio, so that the noise reduction requirement of human ears on a noisy speech can be met, and the accuracy of a recognition system can be improved. As shown in fig. 1, the flow chart of the method mainly includes the following parts:
1. and extracting the acoustic characteristics of each voice frame.
1) The input voice signal is processed by frame division to obtain a voice frame sequence.
In this step, the input voice (voice with noise) can be processed by adding a hamming window, and each frame of data is obtained. Illustratively, the general hamming window length may be selected to be 32 milliseconds, the window shift may be 16 milliseconds, and the overlap portion may be 16 milliseconds.
2) The acoustic features adopt logarithmic power spectrum features, and when the logarithmic power spectrum features of each voice frame are extracted, frequency domain signals are obtained through Fourier transform and modulus taking:
in the above formula, d is the frequency dimension, h (L) is the window function, and L is the number of points for discrete Fourier transform; if the number of points L for performing discrete fourier transform can be increased, that is, the number of information points to be sampled is more, the input features will contain more information, which is also beneficial to the learning of the subsequent neural network.
The log power spectral signature is defined as:
Y(d)=log|Y(d)'|2d=0,1,...,D-1;
as will be appreciated by those skilled in the art, since the STFT transform is symmetric in the frequency domain, only the first half of the points are taken, i.e., D ═ L/2+1, while the second half of the points D ═ D., L-1, Y (D) are obtained by the symmetry criterion, Y (D) ═ Y (L-D). That is, y (d) involved in the subsequent calculation process also considers only the first half of the points, and all the points are considered in the final speech restoration, as described laterIs shown in formula, in the calculationSince there is a symmetry, the value corresponding to the second half is also a known quantity.
Illustratively, if the sampling rate is a waveform file of 16kHz, the log power spectrum is characterized by 257 dimensions (32ms 16kHz is 512 samples per frame, and D512/2 +1 is 257).
2) And splicing continuous frames, wherein data obtained after splicing a certain number of frames is used as a sample during splicing, and the mark of the central frame of the sample is used as the mark of the sample where the central frame is located.
Typically 7-frame splicing, 11-frame splicing or 15-frame splicing can be adopted; and when the spliced sample is used as a training sample, the mark of the central frame is used as the mark of the sample. Typically 7 frames are used for concatenation for networks with hidden node numbers 1024 or 2048, and 11 or 15 frames for concatenation for networks 3072 or 4096 and even larger may be used as input. In the present study, 7-frame splicing is adopted as input for a network with a hidden layer of 1024.
2. And training the progressive double-output neural network model by using clean voice and noise voice samples, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing.
The progressive dual-output neural network model is shown in fig. 2, the learning goal at each layer is continuously increased with the signal-to-noise ratio, for example, the signal-to-noise ratio of target1 is 5db, and the signal-to-noise ratio of target2 is 10 db. The final target is learned according to the mode that the signal to noise ratio is gradually increased, and the finally trained progressive dual-output neural network model can predict the ideal soft masking of each time frequency pointCan also bind predictedThe acoustic features are enhanced by predicting the log power spectral features (LPS) of clean speech.
A method for directly estimating the logarithmic power spectrum is called direct mapping, the direct mapping method can remove a large amount of noise, and the objective index and the subjective audibility of the voice are greatly improved. However, for an application scenario of docking a back-end recognition model, in the case of not retraining the back-end recognition model, the enhanced speech and the recognition model have a large mismatch, and it is difficult to improve the recognition accuracy. Therefore, when the back end is identified by docking, a time-frequency masking (masking) method is adopted. The method only removes partial noise to improve the signal-to-noise ratio to a certain degree, so that the method can keep the matching with a back-end recognition model as much as possible.
The following describes a time-frequency masking speech enhancement method. An Ideal binary time-frequency mask (IBM) is a time-frequency mask that is constructed from a pre-mixed signal of speech and noise. For each IBM unit, it is the voice dominated unit if its signal-to-noise ratio is greater than a preset local threshold (LC), otherwise it is the noise dominated unit. IBM can be defined quantitatively as:
although IBM-based estimation of speech enhancement based on classification can lead to better intelligibility, speech quality is severely lost, speaker information, etc. is also lost. And depends very much on the correctness of the IBM classification, and once the IBM is judged to be wrong, the information loss is very serious. Aiming at the defects of IBM, the embodiment of the invention adopts the target of ideal soft masking to train a deep neural network, which can ensure the characteristics of speech intelligibility and speech quality. IRM is now also the mainstream in masking based methods. The definition of IRM is:
in the above formula, X2(t, d) and N2(t, d) are log power spectral features of clean and noisy speech, respectively, and SNR (t, d) represents the signal-to-noise ratio β is a tunable parameter, typically set to 0.5, which becomes the wiener gain of the square root.
By using X2(t, d) and N2And (t, d) training the progressive dual-output neural network model, and accurately estimating the IRM by learning the mapping relation between the progressive dual-output neural network model and the progressive dual-output neural network model. Because at the time of testing, it is assumed that there is an accurate estimation already for IRM on a certain T-F unit through the deep neural network, i.e. ideal soft mask
Then pass throughThe log power spectrum feature (i.e. the enhanced log power spectrum feature) of the clean speech can be predicted, and the formula is as follows:
In the above formula, Y2(t, d) is the power spectrum of the noisy speech. Specifically to log power spectral features, prediction of log power spectral features of clean speech can be written as:
wherein, log ((Y)2(t,d))=Y(d)。
It will be appreciated by those skilled in the art that the occurrence of t in each of the above parameters is indicative of a time, e.g., X2(t,d)、N2(t, d); when the entire time axis is considered, t is omitted, e.g., y (d) which considers the entire time axis.
In an embodiment of the invention, the back propagation algorithm is based on the minimum mean square error between the clean speech log power spectrum and the enhanced speech log power spectrum, and the log power spectrum is taken in such a way that it is more consistent with the human auditory system. Because the perception of sound intensity by the human ear is a non-linear relationship and the greater the sound intensity, the higher the degree of suppression. The stochastic gradient descent algorithm based on the minimum batch mode can be used for improving the convergence rate of the learning of the progressive dual-output neural network model, and is represented as follows:
in the above formula, E is the average square error of the progressive dual-output neural network model learning, correspondingly, the enhanced logarithmic power spectrum characteristic of the 1 st 1 … K progressive learning target in the nth frame and the d-th frequency dimension and the logarithmic power spectrum characteristic of the target are representedI.e. the output of the progressive dual-output neural network model; corresponding representation estimated ideal soft mask, target ideal soft mask; n represents the size of the minimum batch, i.e. the number of samples; d is the total dimension of the log power spectrum feature vector; (W)l,bl) Parameters representing weights and biases to be learned at the l-th layer.
As will be appreciated by those skilled in the art,the specific value of (a) can be set by the user according to the actual situation.
The number of the entire hidden layers is denoted by L, and then L +1 denotes the output layer. In addition, it should be noted that the input features of the progressive dual-output neural network model are all subjected to gaussian normalization, that is, the mean value of the whole training data is normalized to 0, and the variance is normalized to 1. Both noisy and clean speech are warped with the global mean and global variance of the noisy training data. One advantage of this process is that the input data and the output data of the progressive dual-output neural network model are transformed identically, so that the neural network can be learned more easily. After the input data and output data are prepared, a learning rate λ can be used to begin updating the weights and bias parameters of the network.
3. Neural network decoding and speech recovery
1) If the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features (log power spectrum features) to obtain the waveform which can be subjectively listened.
In the above formula, the first and second carbon atoms are,for definition in the real number domain, for the enhanced log power spectral features, which take into account the whole time axis, t is omitted,which is also an enhanced log power spectrum characteristic but is a definition in the complex domain, ∠ y (d) refers to phase information derived from the input speech because the human ear is not sensitive to small changes in phase.
Then, inverse discrete Fourier transform reconstruction is carried out to obtain enhanced time domain voice
Finally, the waveform of the whole sentence is synthesized by a classical overlap-add algorithm.
2) If the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice (namely, the logarithmic power spectrum features of the noisy voice) to obtain the masked acoustic features (namely, the enhanced acoustic features), and then the waveform is reconstructed to obtain the enhanced voice.
The masked acoustic features, that is, the enhanced log power spectrum features, and the implementation manner of reconstructing the waveform to obtain the enhanced speech, may refer to the foregoing processing manner applied to human ears.
Since the biggest problem in machine learning is the mismatch of training data and test data, the estimation of the network is biased. Particularly, in the application of actual speech enhancement, when the difference between the test speech and the training speech is large, the enhanced logarithmic power spectrum features directly output by the network can damage the clean speech, thereby affecting the recognition accuracy. The improved log power spectrum characteristic of the ideal masking signal calculation reduces the damage to clean voice and simultaneously remains more noise. For a speech recognition system, the noise robustness is strong, but the noise damage is sensitive, so the logarithmic power spectrum feature after the ideal masking signal calculation enhancement is more suitable for the recognition system.
According to the scheme of the embodiment of the invention, the voice enhancement is carried out based on the progressive double-output neural network model, so that the voice subjected to deep noise reduction can be output to meet the noise reduction requirement of human ears, and the voice subjected to partial noise reduction and with a certain signal-to-noise ratio is output to be matched with the recognition model driven by the back-end data; through manual audiometry and objective index measurement, the speech after deep noise reduction is obviously improved in subjective audibility and various indexes; by combining the voice recognition model, compared with a recognition result without noise reduction processing, the voice with noise reduced by the neural network part effectively improves the recognition accuracy.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (7)
1. A method of speech enhancement, comprising:
extracting acoustic characteristics of each voice frame;
training a progressive double-output neural network model by using samples of clean voice and noise voice, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing on acoustic characteristics;
if the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features to obtain a waveform capable of being subjectively listened; if the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice to obtain the masked acoustic features, and then the waveform is reconstructed to obtain the enhanced voice.
2. The method of claim 1, wherein the extracting the acoustic features of each speech frame comprises:
performing framing processing on an input voice signal to obtain a voice frame sequence;
the acoustic features adopt logarithmic power spectrum features, and when the logarithmic power spectrum features of each voice frame are extracted, frequency domain signals are obtained through Fourier transform and modulus taking:
in the above formula, d is the frequency dimension, h (L) is the window function, and L is the number of points for discrete Fourier transform;
the log power spectral signature is defined as:
Y(d)=log|Y(d)'|2d=0,1,...,D-1;
in the above formula, D is L/2+ 1.
3. A speech enhancement method according to claim 2, characterized in that the method further comprises: and splicing continuous frames before the extracted acoustic features are used as the input of the progressive double-output neural network model, wherein data obtained after splicing a certain number of frames is used as a sample during splicing, and the mark of the central frame of the sample is used as the mark of the sample where the central frame is located.
4. The speech enhancement method according to claim 1, wherein the progressive double-output neural network model learns the final target in a manner that the signal-to-noise ratio gradually increases, and the finally trained progressive double-output neural network model can predict ideal soft masking at each time-frequency point and can also perform enhancement processing on acoustic features, that is, predict the log power spectral features of clean speech.
5. A speech enhancement method according to claim 1 or 4, characterized in that the formula for predicting the clean log power spectral features is:
6. The speech enhancement method according to claim 1, wherein the convergence rate of the progressive dual-output neural network model learning is improved based on a stochastic gradient descent algorithm of a least batch mode, and is represented as:
in the above formula, E is the average of the progressive dual-output neural network model learningThe square error is a measure of the square error, correspondingly, the enhanced logarithmic power spectrum characteristics of the 1 st 1 … K progressive learning target in the nth frame and the d-th frequency dimension and the logarithmic power spectrum characteristics of the target are represented; corresponding representation estimated ideal soft mask, target ideal soft mask; n represents the size of the minimum batch, i.e. the number of samples; d, the total dimension of the log power spectrum feature vector; (W)l,bl) Parameters representing weights and biases to be learned at the l-th layer.
7. The speech enhancement method of claim 1, wherein reconstructing the waveform using the enhanced acoustic features if applied to the human ear to obtain the subjectively audiometrically inaudible waveform comprises:
In the above formula, the first and second carbon atoms are,is defined on a real number domain, represents the characteristic of the enhanced log power spectrum,is also enhanced∠ Y (d) refers to phase information obtained from input speech;
then, inverse discrete Fourier transform reconstruction is carried out to obtain enhanced time domain voice
Wherein, L is the point number of discrete Fourier transform when extracting the acoustic characteristic of each voice frame;
finally, the waveform of the whole sentence is synthesized by an overlap-add algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810827229.0A CN110767244B (en) | 2018-07-25 | 2018-07-25 | Speech enhancement method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810827229.0A CN110767244B (en) | 2018-07-25 | 2018-07-25 | Speech enhancement method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110767244A true CN110767244A (en) | 2020-02-07 |
CN110767244B CN110767244B (en) | 2024-03-29 |
Family
ID=69328031
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810827229.0A Active CN110767244B (en) | 2018-07-25 | 2018-07-25 | Speech enhancement method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110767244B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047502A (en) * | 2019-04-18 | 2019-07-23 | 广州九四智能科技有限公司 | The recognition methods of hierarchical voice de-noising and system under noise circumstance |
CN111775151A (en) * | 2020-06-28 | 2020-10-16 | 河南工业职业技术学院 | Intelligent control system of robot |
CN112289337A (en) * | 2020-11-03 | 2021-01-29 | 北京声加科技有限公司 | Method and device for filtering residual noise after machine learning voice enhancement |
CN113160839A (en) * | 2021-04-16 | 2021-07-23 | 电子科技大学 | Single-channel speech enhancement method based on adaptive attention mechanism and progressive learning |
CN113436640A (en) * | 2021-06-28 | 2021-09-24 | 歌尔科技有限公司 | Audio noise reduction method, device and system and computer readable storage medium |
CN113611318A (en) * | 2021-06-29 | 2021-11-05 | 华为技术有限公司 | Audio data enhancement method and related equipment |
CN113763976A (en) * | 2020-06-05 | 2021-12-07 | 北京有竹居网络技术有限公司 | Method and device for reducing noise of audio signal, readable medium and electronic equipment |
CN113823291A (en) * | 2021-09-07 | 2021-12-21 | 广西电网有限责任公司贺州供电局 | Voiceprint recognition method and system applied to power operation |
CN114999519A (en) * | 2022-07-18 | 2022-09-02 | 中邮消费金融有限公司 | Voice real-time noise reduction method and system based on double transformation |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101512573A (en) * | 2006-08-28 | 2009-08-19 | 国际商业机器公司 | Collaborative, event driven system management |
CN103531204A (en) * | 2013-10-11 | 2014-01-22 | 深港产学研基地 | Voice enhancing method |
US20160111107A1 (en) * | 2014-10-21 | 2016-04-21 | Mitsubishi Electric Research Laboratories, Inc. | Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System |
US20170061978A1 (en) * | 2014-11-07 | 2017-03-02 | Shannon Campbell | Real-time method for implementing deep neural network based speech separation |
CN107452389A (en) * | 2017-07-20 | 2017-12-08 | 大象声科(深圳)科技有限公司 | A kind of general monophonic real-time noise-reducing method |
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
-
2018
- 2018-07-25 CN CN201810827229.0A patent/CN110767244B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101512573A (en) * | 2006-08-28 | 2009-08-19 | 国际商业机器公司 | Collaborative, event driven system management |
CN103531204A (en) * | 2013-10-11 | 2014-01-22 | 深港产学研基地 | Voice enhancing method |
US20160111107A1 (en) * | 2014-10-21 | 2016-04-21 | Mitsubishi Electric Research Laboratories, Inc. | Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System |
CN107077860A (en) * | 2014-10-21 | 2017-08-18 | 三菱电机株式会社 | Method for will there is audio signal of making an uproar to be converted to enhancing audio signal |
US20170061978A1 (en) * | 2014-11-07 | 2017-03-02 | Shannon Campbell | Real-time method for implementing deep neural network based speech separation |
CN107452389A (en) * | 2017-07-20 | 2017-12-08 | 大象声科(深圳)科技有限公司 | A kind of general monophonic real-time noise-reducing method |
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
Non-Patent Citations (1)
Title |
---|
文仕学;孙磊;杜俊;: "渐进学习语音增强方法在语音识别中的应用", 小型微型计算机***, no. 01, 15 January 2018 (2018-01-15) * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047502A (en) * | 2019-04-18 | 2019-07-23 | 广州九四智能科技有限公司 | The recognition methods of hierarchical voice de-noising and system under noise circumstance |
CN113763976A (en) * | 2020-06-05 | 2021-12-07 | 北京有竹居网络技术有限公司 | Method and device for reducing noise of audio signal, readable medium and electronic equipment |
CN113763976B (en) * | 2020-06-05 | 2023-12-22 | 北京有竹居网络技术有限公司 | Noise reduction method and device for audio signal, readable medium and electronic equipment |
CN111775151A (en) * | 2020-06-28 | 2020-10-16 | 河南工业职业技术学院 | Intelligent control system of robot |
CN112289337A (en) * | 2020-11-03 | 2021-01-29 | 北京声加科技有限公司 | Method and device for filtering residual noise after machine learning voice enhancement |
CN112289337B (en) * | 2020-11-03 | 2023-09-01 | 北京声加科技有限公司 | Method and device for filtering residual noise after machine learning voice enhancement |
CN113160839A (en) * | 2021-04-16 | 2021-07-23 | 电子科技大学 | Single-channel speech enhancement method based on adaptive attention mechanism and progressive learning |
CN113436640A (en) * | 2021-06-28 | 2021-09-24 | 歌尔科技有限公司 | Audio noise reduction method, device and system and computer readable storage medium |
CN113436640B (en) * | 2021-06-28 | 2022-11-25 | 歌尔科技有限公司 | Audio noise reduction method, device and system and computer readable storage medium |
CN113611318A (en) * | 2021-06-29 | 2021-11-05 | 华为技术有限公司 | Audio data enhancement method and related equipment |
CN113823291A (en) * | 2021-09-07 | 2021-12-21 | 广西电网有限责任公司贺州供电局 | Voiceprint recognition method and system applied to power operation |
CN114999519A (en) * | 2022-07-18 | 2022-09-02 | 中邮消费金融有限公司 | Voice real-time noise reduction method and system based on double transformation |
Also Published As
Publication number | Publication date |
---|---|
CN110767244B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110767244B (en) | Speech enhancement method | |
Hendriks et al. | DFT-domain based single-microphone noise reduction for speech enhancement | |
Xu et al. | An experimental study on speech enhancement based on deep neural networks | |
Narayanan et al. | Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training | |
Ghanbari et al. | A new approach for speech enhancement based on the adaptive thresholding of the wavelet packets | |
JP5666444B2 (en) | Apparatus and method for processing an audio signal for speech enhancement using feature extraction | |
Azarang et al. | A review of multi-objective deep learning speech denoising methods | |
Hansen et al. | Speech enhancement based on generalized minimum mean square error estimators and masking properties of the auditory system | |
Tu et al. | A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition | |
Lee et al. | A joint learning algorithm for complex-valued tf masks in deep learning-based single-channel speech enhancement systems | |
Kim et al. | End-to-end multi-task denoising for joint SDR and PESQ optimization | |
Sadjadi et al. | Blind spectral weighting for robust speaker identification under reverberation mismatch | |
Saleem et al. | Supervised speech enhancement based on deep neural network | |
Swami et al. | Speech enhancement by noise driven adaptation of perceptual scales and thresholds of continuous wavelet transform coefficients | |
Coto-Jimenez et al. | Hybrid speech enhancement with wiener filters and deep lstm denoising autoencoders | |
Naik et al. | A literature survey on single channel speech enhancement techniques | |
Gupta et al. | Speech enhancement using MMSE estimation and spectral subtraction methods | |
Liu et al. | Using Shifted Real Spectrum Mask as Training Target for Supervised Speech Separation. | |
Shome et al. | Reference free speech quality estimation for diverse data condition | |
Roy et al. | Deep residual network-based augmented Kalman filter for speech enhancement | |
Sadasivan et al. | Speech enhancement using a risk estimation approach | |
Hepsiba et al. | Computational intelligence for speech enhancement using deep neural network | |
Pop et al. | Speech enhancement for forensic purposes | |
CN111383652B (en) | Single-channel voice enhancement method based on double-layer dictionary learning | |
Liu et al. | Multiresolution cochleagram speech enhancement algorithm using improved deep neural networks with skip connections |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |