CN110767244A - Speech enhancement method - Google Patents

Speech enhancement method Download PDF

Info

Publication number
CN110767244A
CN110767244A CN201810827229.0A CN201810827229A CN110767244A CN 110767244 A CN110767244 A CN 110767244A CN 201810827229 A CN201810827229 A CN 201810827229A CN 110767244 A CN110767244 A CN 110767244A
Authority
CN
China
Prior art keywords
voice
speech
noise
neural network
enhanced
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810827229.0A
Other languages
Chinese (zh)
Other versions
CN110767244B (en
Inventor
杜俊
高天
屠彦辉
王立众
杨磊
徐学淼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Beijing Samsung Telecommunications Technology Research Co Ltd
Original Assignee
University of Science and Technology of China USTC
Beijing Samsung Telecommunications Technology Research Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC, Beijing Samsung Telecommunications Technology Research Co Ltd filed Critical University of Science and Technology of China USTC
Priority to CN201810827229.0A priority Critical patent/CN110767244B/en
Publication of CN110767244A publication Critical patent/CN110767244A/en
Application granted granted Critical
Publication of CN110767244B publication Critical patent/CN110767244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

The invention discloses a voice enhancement method, which comprises the following steps: extracting acoustic characteristics of each voice frame; training a progressive double-output neural network model by using samples of clean voice and noise voice, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing on acoustic characteristics; if the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features to obtain a waveform capable of being subjectively listened; if the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice to obtain the masked acoustic features, and then the waveform is reconstructed to obtain the enhanced voice. The scheme of the invention can meet the noise reduction requirement of human ears and improve the recognition accuracy of the voice with noise.

Description

Speech enhancement method
Technical Field
The invention relates to the technical field of voice processing, in particular to a voice enhancement method.
Background
Speech recognition is the process of making a machine understand what a person says, i.e., converting the vocabulary content in human speech into an input that can be recognized by a computer. The introduction of deep learning in recent 20 years, especially in recent years, has led to significant success in speech recognition technology, starting to move from the laboratory to the market. At present, voice input, voice retrieval, voice translation and the like based on a voice recognition technology are widely applied. It is well known that in noisy environments the performance of automatic speech recognition is greatly degraded if we take no measures, mainly due to the differences in the distribution of noisy speech between the distributions of the acoustic models. To improve recognition accuracy in noisy environments, speech enhancement algorithms typically use a pre-processing of speech recognition that matches the acoustic model distribution by transforming the noisy speech back to a clean state as much as possible.
Speech enhancement is an important branch of speech signal processing field, and noise has been a concern since the beginning of the research on speech signals, because speech is accompanied by noise in real life environments. While speech signal processing is generally only interested in the content of speech, speaker, language, etc., noise as an interfering item generally needs to be removed in advance, but considering that the generation process of speech and noise is non-linear and complex, the denoising process is difficult. Over the past few decades, unsupervised speech enhancement methods have been proposed by first estimating the spectral information of the noise and then subtracting the estimated noise spectrum from the noisy speech spectrum to obtain a prediction of the clean speech spectrum. But makes tracking and estimation of noise difficult due to its randomness and catastrophe. Meanwhile, in the conventional speech enhancement method, considering that the interaction relationship between noise and speech is complex, some assumptions on independence between signals and gaussian assumptions on feature distribution are required. Due to these assumptions, the conventional speech enhancement method is caused to leave much noise, even music noise. Secondly, the details of the speech are also largely corrupted, which is mainly reflected in the enhancement of low signal-to-noise ratio speech. Moreover, the extreme non-stationary noise is always a troublesome place in the traditional speech enhancement method, because the non-stationary noise is sudden, the non-stationary noise is always in an underestimated state and is difficult to remove from the noisy speech, and various non-stationary noises are probably occurring events in an actual acoustic environment. Finally, conventional speech enhancement methods tend to introduce some non-linear distortion that can be disruptive to the speech recognition at the back end.
Spectral subtraction is one of the most classical speech enhancement algorithms, used initially for speech enhancement and later gradually used in speech recognition. The method is divided into two parts of noise updating and noise elimination. In noise updating, firstly, it is necessary to determine that a non-speech segment is detected, and then a method of combining current frame and historical long-term information is used to estimate a noise spectrum. In noise cancellation, an estimate of the clean speech spectrum is obtained by subtracting the estimated noise spectrum from the noisy speech spectrum. Due to the over-subtraction problem of the spectral subtraction, an over-subtraction factor related to the signal-to-noise ratio needs to be set in each frequency band. The non-linear operation of spectral subtraction produces residual musical noise in the enhanced signal, and some specific solutions are adopted in practical applications. In addition, a reliable voice/non-voice detection module is also crucial.
One method related to spectral subtraction is wiener filtering. In this method, a linear filter is designed to minimize the mean square error between clean speech and the filtered speech signal for noisy speech. The transfer function obtained under this criterion is the ratio between the correlation spectrum of the clean signal and the noisy signal, which is approximated by the difference between the noisy spectrum and the estimated noise spectrum, and the power spectrum of the signal to be noisy. Therefore, the wiener filtering and the spectral subtraction are closely related, but the operation complexity of the wiener filtering is higher, and the wiener filtering and the spectral subtraction also fail to work for unsteady noise. Subsequently, a soft decision enhancement method based on the existence probability of speech was proposed (McAulay and Malpass,1980), which can reduce the distortion of speech as much as possible. The revolutionary speech enhancement method is the minimum mean square error based speech magnitude spectral estimation proposed by Ephraim and Malah in 1984, and then the minimum mean square error estimation of the log-spectral domain is proposed considering that the perception of the human ear to the sound intensity is non-linear. Among the methods, the most common one used in the prior art is the minimum controlled iterative Averaging (MCRA) noise estimation method proposed by Israel Cohen. Compared with the prior noise estimation method, the noise estimation method of the minimum control iterative average has the characteristics of smaller estimation error and quicker non-stationary noise tracking. The method can be considered as the best method in the traditional single-channel speech enhancement algorithm so far; however, this method can improve the recognition accuracy of the speech recognition system in the case of stationary noise and high signal-to-noise ratio. But in case of large background noise, the boosting is limited and the recognition rate may even be reduced. The main reason is that the robustness of the current recognition system itself is already strong, and if the enhancement algorithm destroys the target speech, the enhancement algorithm will bring negative effects.
Discussed above are conventional single-channel unsupervised speech enhancement methods, which are directed to some of the problems of unsupervised speech enhancement algorithms, and since the end of the eighties of the last century, methods based on supervised training models have been used in speech enhancement tasks, such as speech enhancement based on non-negative matrix factorization. Non-Negative Matrix Factorization (NMF) refers to a matrix V that can be decomposed into a product of a matrix W and a matrix H. But is characterized by the fact that matrix V, matrix W and matrix H have no negative elements. Considering that the speech spectrum of audio is generally potentially non-negative, non-negative matrix decomposition is usually applied in music signal separation, because music signals have a more fixed pattern than speech signals or noise signals, and therefore it is more convenient to train music signals and obtain some matrix bases. The non-negative matrix decomposition is applied to the field of signal separation, and the main aim is to separate two paths of signals from a mixed signal by using different signal bases. The signal basis can be obtained by training using the following steps. First, an objective function is defined, which may use euclidean distance or Kullback-Leibler divergence, as follows:
Figure BDA0001742771700000031
Figure BDA0001742771700000032
the voice noise reduction based on the non-negative matrix factorization is mainly divided into two steps: training and noise reduction. In the training phase, firstly, the feature bases W of clean voice and noise are obtained respectivelyspeechAnd Wnoise. They are all n in sizef×nbWhere n isfRefers to the dimension of the feature base, and nbRefers to the number of feature bases. The target formula is as follows:
min KLD(Vspeech||WspeechHspeech)
min KLD(Vnoise||WnoiseHnoise)
in the above formula, HspeechAnd HnoiseRefers to a coefficient matrix that generates the actual signal of the corresponding basis per frame of signal. Their dimensions are the dimensions of the features multiplied by the total number of frames. While in the enhancement stage, we consider the speech feature bases and the noise feature bases obtained in the training stage, namely WspeechAnd WnoiseIs a total characteristic base W which can be used in decoding and obtained by splicing the twoallAs follows:
Wall=[WspeechWnoise]
then, on the premise of a given noisy speech signal, H is solved through a standard gradient descent algorithmallFinally, the noise signal and the voice signal can be separated. The main problem with speech enhancement methods based on non-negative matrix factorization is that the models of noise and clean speech are trained separately and independence between noise and clean speech is assumed. This virtually limits the upper performance limit of the process. Meanwhile, the deep neural network-based method has fully exceeded the performance of the non-negative matrix factorization method in the voice separation related task.
Supervised speech enhancement methods, especially those based on deep learning, have also developed vigorously in recent years. With the successful application of deep neural networks in the field of speech recognition, neural networks can also be designed as a fine noise reduction filter. Meanwhile, based on big data training, the neural network can fully learn the complex nonlinear relation between the noisy speech and the clean speech. In addition, the training of the neural network is learned off-line, and as a person, the neural network can remember some noise patterns, so that some non-stationary noises can be well suppressed; however, if the training data does not match the test data, for example, the noise type is different, the speaker difference is large, and the system performance is greatly reduced.
Disclosure of Invention
The invention aims to provide a voice enhancement method which can meet the noise reduction requirement of human ears and improve the recognition accuracy of noisy voice.
The purpose of the invention is realized by the following technical scheme:
a method of speech enhancement comprising:
extracting acoustic characteristics of each voice frame;
training a progressive double-output neural network model by using samples of clean voice and noise voice, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing on acoustic characteristics;
if the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features to obtain a waveform capable of being subjectively listened; if the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice to obtain the masked acoustic features, and then the waveform is reconstructed to obtain the enhanced voice.
According to the technical scheme provided by the invention, the voice enhancement is carried out based on the progressive double-output neural network model, so that the voice subjected to deep noise reduction can be output to meet the noise reduction requirement of human ears, and the voice subjected to partial noise reduction and with a certain signal-to-noise ratio is output to be matched with the recognition model driven by the back-end data; through manual audiometry and objective index measurement, the speech after deep noise reduction is obviously improved in subjective audibility and various indexes; by combining the voice recognition model, compared with a recognition result without noise reduction processing, the voice with noise reduced by the neural network part effectively improves the recognition accuracy.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a speech enhancement method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a progressive dual-output neural network model according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Speech enhancement methods restore noisy speech to as clean a state as possible by transformation, generally speaking, speech enhancement mainly considers whether the human hearing of the enhanced signal is good, while for recognition systems it is more concerned whether the error rate of the system is reduced, and these two targets are linked but not completely consistent, for the simple reason that a human may have an immune function to some distortions in the speech signal, but the recognition system may be sensitive to them. It often happens that the enhanced speech has a good subjective hearing improvement but cannot bring an improvement in speech recognition accuracy. Another big challenge faced by the deep learning based speech enhancement is a generalized problem, which is a difficult problem that the deep learning based model cannot avoid. Specifically, speech enhancement is a difficult problem in terms of unknown noise, unknown speaking style, and extremely low signal-to-noise ratio.
In contrast, the embodiment of the present invention provides a speech enhancement method, and through a multi-target joint training method, a progressive dual-output neural network model can have better generalization and simultaneously output a deeply denoised speech and a speech with a certain signal-to-noise ratio, so that the noise reduction requirement of human ears on a noisy speech can be met, and the accuracy of a recognition system can be improved. As shown in fig. 1, the flow chart of the method mainly includes the following parts:
1. and extracting the acoustic characteristics of each voice frame.
1) The input voice signal is processed by frame division to obtain a voice frame sequence.
In this step, the input voice (voice with noise) can be processed by adding a hamming window, and each frame of data is obtained. Illustratively, the general hamming window length may be selected to be 32 milliseconds, the window shift may be 16 milliseconds, and the overlap portion may be 16 milliseconds.
2) The acoustic features adopt logarithmic power spectrum features, and when the logarithmic power spectrum features of each voice frame are extracted, frequency domain signals are obtained through Fourier transform and modulus taking:
Figure BDA0001742771700000051
in the above formula, d is the frequency dimension, h (L) is the window function, and L is the number of points for discrete Fourier transform; if the number of points L for performing discrete fourier transform can be increased, that is, the number of information points to be sampled is more, the input features will contain more information, which is also beneficial to the learning of the subsequent neural network.
The log power spectral signature is defined as:
Y(d)=log|Y(d)'|2d=0,1,...,D-1;
as will be appreciated by those skilled in the art, since the STFT transform is symmetric in the frequency domain, only the first half of the points are taken, i.e., D ═ L/2+1, while the second half of the points D ═ D., L-1, Y (D) are obtained by the symmetry criterion, Y (D) ═ Y (L-D). That is, y (d) involved in the subsequent calculation process also considers only the first half of the points, and all the points are considered in the final speech restoration, as described later
Figure BDA0001742771700000061
Is shown in formula, in the calculationSince there is a symmetry, the value corresponding to the second half is also a known quantity.
Illustratively, if the sampling rate is a waveform file of 16kHz, the log power spectrum is characterized by 257 dimensions (32ms 16kHz is 512 samples per frame, and D512/2 +1 is 257).
2) And splicing continuous frames, wherein data obtained after splicing a certain number of frames is used as a sample during splicing, and the mark of the central frame of the sample is used as the mark of the sample where the central frame is located.
Typically 7-frame splicing, 11-frame splicing or 15-frame splicing can be adopted; and when the spliced sample is used as a training sample, the mark of the central frame is used as the mark of the sample. Typically 7 frames are used for concatenation for networks with hidden node numbers 1024 or 2048, and 11 or 15 frames for concatenation for networks 3072 or 4096 and even larger may be used as input. In the present study, 7-frame splicing is adopted as input for a network with a hidden layer of 1024.
2. And training the progressive double-output neural network model by using clean voice and noise voice samples, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing.
The progressive dual-output neural network model is shown in fig. 2, the learning goal at each layer is continuously increased with the signal-to-noise ratio, for example, the signal-to-noise ratio of target1 is 5db, and the signal-to-noise ratio of target2 is 10 db. The final target is learned according to the mode that the signal to noise ratio is gradually increased, and the finally trained progressive dual-output neural network model can predict the ideal soft masking of each time frequency pointCan also bind predicted
Figure BDA0001742771700000064
The acoustic features are enhanced by predicting the log power spectral features (LPS) of clean speech.
A method for directly estimating the logarithmic power spectrum is called direct mapping, the direct mapping method can remove a large amount of noise, and the objective index and the subjective audibility of the voice are greatly improved. However, for an application scenario of docking a back-end recognition model, in the case of not retraining the back-end recognition model, the enhanced speech and the recognition model have a large mismatch, and it is difficult to improve the recognition accuracy. Therefore, when the back end is identified by docking, a time-frequency masking (masking) method is adopted. The method only removes partial noise to improve the signal-to-noise ratio to a certain degree, so that the method can keep the matching with a back-end recognition model as much as possible.
The following describes a time-frequency masking speech enhancement method. An Ideal binary time-frequency mask (IBM) is a time-frequency mask that is constructed from a pre-mixed signal of speech and noise. For each IBM unit, it is the voice dominated unit if its signal-to-noise ratio is greater than a preset local threshold (LC), otherwise it is the noise dominated unit. IBM can be defined quantitatively as:
Figure BDA0001742771700000071
although IBM-based estimation of speech enhancement based on classification can lead to better intelligibility, speech quality is severely lost, speaker information, etc. is also lost. And depends very much on the correctness of the IBM classification, and once the IBM is judged to be wrong, the information loss is very serious. Aiming at the defects of IBM, the embodiment of the invention adopts the target of ideal soft masking to train a deep neural network, which can ensure the characteristics of speech intelligibility and speech quality. IRM is now also the mainstream in masking based methods. The definition of IRM is:
Figure BDA0001742771700000072
in the above formula, X2(t, d) and N2(t, d) are log power spectral features of clean and noisy speech, respectively, and SNR (t, d) represents the signal-to-noise ratio β is a tunable parameter, typically set to 0.5, which becomes the wiener gain of the square root.
By using X2(t, d) and N2And (t, d) training the progressive dual-output neural network model, and accurately estimating the IRM by learning the mapping relation between the progressive dual-output neural network model and the progressive dual-output neural network model. Because at the time of testing, it is assumed that there is an accurate estimation already for IRM on a certain T-F unit through the deep neural network, i.e. ideal soft mask
Figure BDA0001742771700000073
Then pass through
Figure BDA0001742771700000074
The log power spectrum feature (i.e. the enhanced log power spectrum feature) of the clean speech can be predicted, and the formula is as follows:
Figure BDA0001742771700000075
In the above formula, Y2(t, d) is the power spectrum of the noisy speech. Specifically to log power spectral features, prediction of log power spectral features of clean speech can be written as:
Figure BDA0001742771700000076
wherein, log ((Y)2(t,d))=Y(d)。
It will be appreciated by those skilled in the art that the occurrence of t in each of the above parameters is indicative of a time, e.g., X2(t,d)、N2(t, d); when the entire time axis is considered, t is omitted, e.g., y (d) which considers the entire time axis.
In an embodiment of the invention, the back propagation algorithm is based on the minimum mean square error between the clean speech log power spectrum and the enhanced speech log power spectrum, and the log power spectrum is taken in such a way that it is more consistent with the human auditory system. Because the perception of sound intensity by the human ear is a non-linear relationship and the greater the sound intensity, the higher the degree of suppression. The stochastic gradient descent algorithm based on the minimum batch mode can be used for improving the convergence rate of the learning of the progressive dual-output neural network model, and is represented as follows:
Figure BDA0001742771700000081
in the above formula, E is the average square error of the progressive dual-output neural network model learning,
Figure BDA0001742771700000082
Figure BDA0001742771700000083
correspondingly, the enhanced logarithmic power spectrum characteristic of the 1 st 1 … K progressive learning target in the nth frame and the d-th frequency dimension and the logarithmic power spectrum characteristic of the target are representedI.e. the output of the progressive dual-output neural network model;
Figure BDA0001742771700000085
corresponding representation estimated ideal soft mask, target ideal soft mask; n represents the size of the minimum batch, i.e. the number of samples; d is the total dimension of the log power spectrum feature vector; (W)l,bl) Parameters representing weights and biases to be learned at the l-th layer.
As will be appreciated by those skilled in the art,the specific value of (a) can be set by the user according to the actual situation.
The number of the entire hidden layers is denoted by L, and then L +1 denotes the output layer. In addition, it should be noted that the input features of the progressive dual-output neural network model are all subjected to gaussian normalization, that is, the mean value of the whole training data is normalized to 0, and the variance is normalized to 1. Both noisy and clean speech are warped with the global mean and global variance of the noisy training data. One advantage of this process is that the input data and the output data of the progressive dual-output neural network model are transformed identically, so that the neural network can be learned more easily. After the input data and output data are prepared, a learning rate λ can be used to begin updating the weights and bias parameters of the network.
3. Neural network decoding and speech recovery
1) If the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features (log power spectrum features) to obtain the waveform which can be subjectively listened.
First, calculate
Figure BDA0001742771700000087
Figure BDA0001742771700000088
In the above formula, the first and second carbon atoms are,
Figure BDA0001742771700000089
for definition in the real number domain, for the enhanced log power spectral features, which take into account the whole time axis, t is omitted,
Figure BDA00017427717000000810
which is also an enhanced log power spectrum characteristic but is a definition in the complex domain, ∠ y (d) refers to phase information derived from the input speech because the human ear is not sensitive to small changes in phase.
Then, inverse discrete Fourier transform reconstruction is carried out to obtain enhanced time domain voice
Figure BDA00017427717000000811
Figure BDA0001742771700000091
Finally, the waveform of the whole sentence is synthesized by a classical overlap-add algorithm.
2) If the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice (namely, the logarithmic power spectrum features of the noisy voice) to obtain the masked acoustic features (namely, the enhanced acoustic features), and then the waveform is reconstructed to obtain the enhanced voice.
The masked acoustic features, that is, the enhanced log power spectrum features, and the implementation manner of reconstructing the waveform to obtain the enhanced speech, may refer to the foregoing processing manner applied to human ears.
Since the biggest problem in machine learning is the mismatch of training data and test data, the estimation of the network is biased. Particularly, in the application of actual speech enhancement, when the difference between the test speech and the training speech is large, the enhanced logarithmic power spectrum features directly output by the network can damage the clean speech, thereby affecting the recognition accuracy. The improved log power spectrum characteristic of the ideal masking signal calculation reduces the damage to clean voice and simultaneously remains more noise. For a speech recognition system, the noise robustness is strong, but the noise damage is sensitive, so the logarithmic power spectrum feature after the ideal masking signal calculation enhancement is more suitable for the recognition system.
According to the scheme of the embodiment of the invention, the voice enhancement is carried out based on the progressive double-output neural network model, so that the voice subjected to deep noise reduction can be output to meet the noise reduction requirement of human ears, and the voice subjected to partial noise reduction and with a certain signal-to-noise ratio is output to be matched with the recognition model driven by the back-end data; through manual audiometry and objective index measurement, the speech after deep noise reduction is obviously improved in subjective audibility and various indexes; by combining the voice recognition model, compared with a recognition result without noise reduction processing, the voice with noise reduced by the neural network part effectively improves the recognition accuracy.
Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (7)

1. A method of speech enhancement, comprising:
extracting acoustic characteristics of each voice frame;
training a progressive double-output neural network model by using samples of clean voice and noise voice, estimating ideal soft masking of each voice frame by using the trained progressive double-output neural network model, and performing enhancement processing on acoustic characteristics;
if the method is applied to human ears, the waveform is reconstructed by using the enhanced acoustic features to obtain a waveform capable of being subjectively listened; if the method is applied to a voice recognition system, the estimated ideal soft masking is applied to the acoustic features of the input voice to obtain the masked acoustic features, and then the waveform is reconstructed to obtain the enhanced voice.
2. The method of claim 1, wherein the extracting the acoustic features of each speech frame comprises:
performing framing processing on an input voice signal to obtain a voice frame sequence;
the acoustic features adopt logarithmic power spectrum features, and when the logarithmic power spectrum features of each voice frame are extracted, frequency domain signals are obtained through Fourier transform and modulus taking:
Figure FDA0001742771690000011
in the above formula, d is the frequency dimension, h (L) is the window function, and L is the number of points for discrete Fourier transform;
the log power spectral signature is defined as:
Y(d)=log|Y(d)'|2d=0,1,...,D-1;
in the above formula, D is L/2+ 1.
3. A speech enhancement method according to claim 2, characterized in that the method further comprises: and splicing continuous frames before the extracted acoustic features are used as the input of the progressive double-output neural network model, wherein data obtained after splicing a certain number of frames is used as a sample during splicing, and the mark of the central frame of the sample is used as the mark of the sample where the central frame is located.
4. The speech enhancement method according to claim 1, wherein the progressive double-output neural network model learns the final target in a manner that the signal-to-noise ratio gradually increases, and the finally trained progressive double-output neural network model can predict ideal soft masking at each time-frequency point and can also perform enhancement processing on acoustic features, that is, predict the log power spectral features of clean speech.
5. A speech enhancement method according to claim 1 or 4, characterized in that the formula for predicting the clean log power spectral features is:
Figure FDA0001742771690000021
wherein the content of the first and second substances,
Figure FDA0001742771690000022
representing the log power spectral characteristics of the predicted clean speech,
Figure FDA0001742771690000023
represents the ideal soft mask, log ((Y)2(t, d)) ═ y (d), y (d) is the extracted log power spectrum feature, d is the frequency dimension, and t is time.
6. The speech enhancement method according to claim 1, wherein the convergence rate of the progressive dual-output neural network model learning is improved based on a stochastic gradient descent algorithm of a least batch mode, and is represented as:
in the above formula, E is the average of the progressive dual-output neural network model learningThe square error is a measure of the square error,
Figure FDA0001742771690000025
Figure FDA0001742771690000026
correspondingly, the enhanced logarithmic power spectrum characteristics of the 1 st 1 … K progressive learning target in the nth frame and the d-th frequency dimension and the logarithmic power spectrum characteristics of the target are represented;
Figure FDA0001742771690000027
Figure FDA0001742771690000028
corresponding representation estimated ideal soft mask, target ideal soft mask; n represents the size of the minimum batch, i.e. the number of samples; d, the total dimension of the log power spectrum feature vector; (W)l,bl) Parameters representing weights and biases to be learned at the l-th layer.
7. The speech enhancement method of claim 1, wherein reconstructing the waveform using the enhanced acoustic features if applied to the human ear to obtain the subjectively audiometrically inaudible waveform comprises:
first, calculate
Figure FDA0001742771690000029
Figure FDA00017427716900000210
In the above formula, the first and second carbon atoms are,
Figure FDA00017427716900000211
is defined on a real number domain, represents the characteristic of the enhanced log power spectrum,
Figure FDA00017427716900000212
is also enhanced∠ Y (d) refers to phase information obtained from input speech;
then, inverse discrete Fourier transform reconstruction is carried out to obtain enhanced time domain voice
Figure FDA00017427716900000213
Figure FDA00017427716900000214
Wherein, L is the point number of discrete Fourier transform when extracting the acoustic characteristic of each voice frame;
finally, the waveform of the whole sentence is synthesized by an overlap-add algorithm.
CN201810827229.0A 2018-07-25 2018-07-25 Speech enhancement method Active CN110767244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810827229.0A CN110767244B (en) 2018-07-25 2018-07-25 Speech enhancement method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810827229.0A CN110767244B (en) 2018-07-25 2018-07-25 Speech enhancement method

Publications (2)

Publication Number Publication Date
CN110767244A true CN110767244A (en) 2020-02-07
CN110767244B CN110767244B (en) 2024-03-29

Family

ID=69328031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810827229.0A Active CN110767244B (en) 2018-07-25 2018-07-25 Speech enhancement method

Country Status (1)

Country Link
CN (1) CN110767244B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047502A (en) * 2019-04-18 2019-07-23 广州九四智能科技有限公司 The recognition methods of hierarchical voice de-noising and system under noise circumstance
CN111775151A (en) * 2020-06-28 2020-10-16 河南工业职业技术学院 Intelligent control system of robot
CN112289337A (en) * 2020-11-03 2021-01-29 北京声加科技有限公司 Method and device for filtering residual noise after machine learning voice enhancement
CN113160839A (en) * 2021-04-16 2021-07-23 电子科技大学 Single-channel speech enhancement method based on adaptive attention mechanism and progressive learning
CN113436640A (en) * 2021-06-28 2021-09-24 歌尔科技有限公司 Audio noise reduction method, device and system and computer readable storage medium
CN113611318A (en) * 2021-06-29 2021-11-05 华为技术有限公司 Audio data enhancement method and related equipment
CN113763976A (en) * 2020-06-05 2021-12-07 北京有竹居网络技术有限公司 Method and device for reducing noise of audio signal, readable medium and electronic equipment
CN113823291A (en) * 2021-09-07 2021-12-21 广西电网有限责任公司贺州供电局 Voiceprint recognition method and system applied to power operation
CN114999519A (en) * 2022-07-18 2022-09-02 中邮消费金融有限公司 Voice real-time noise reduction method and system based on double transformation

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101512573A (en) * 2006-08-28 2009-08-19 国际商业机器公司 Collaborative, event driven system management
CN103531204A (en) * 2013-10-11 2014-01-22 深港产学研基地 Voice enhancing method
US20160111107A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101512573A (en) * 2006-08-28 2009-08-19 国际商业机器公司 Collaborative, event driven system management
CN103531204A (en) * 2013-10-11 2014-01-22 深港产学研基地 Voice enhancing method
US20160111107A1 (en) * 2014-10-21 2016-04-21 Mitsubishi Electric Research Laboratories, Inc. Method for Enhancing Noisy Speech using Features from an Automatic Speech Recognition System
CN107077860A (en) * 2014-10-21 2017-08-18 三菱电机株式会社 Method for will there is audio signal of making an uproar to be converted to enhancing audio signal
US20170061978A1 (en) * 2014-11-07 2017-03-02 Shannon Campbell Real-time method for implementing deep neural network based speech separation
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN107845389A (en) * 2017-12-21 2018-03-27 北京工业大学 A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
文仕学;孙磊;杜俊;: "渐进学习语音增强方法在语音识别中的应用", 小型微型计算机***, no. 01, 15 January 2018 (2018-01-15) *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110047502A (en) * 2019-04-18 2019-07-23 广州九四智能科技有限公司 The recognition methods of hierarchical voice de-noising and system under noise circumstance
CN113763976A (en) * 2020-06-05 2021-12-07 北京有竹居网络技术有限公司 Method and device for reducing noise of audio signal, readable medium and electronic equipment
CN113763976B (en) * 2020-06-05 2023-12-22 北京有竹居网络技术有限公司 Noise reduction method and device for audio signal, readable medium and electronic equipment
CN111775151A (en) * 2020-06-28 2020-10-16 河南工业职业技术学院 Intelligent control system of robot
CN112289337A (en) * 2020-11-03 2021-01-29 北京声加科技有限公司 Method and device for filtering residual noise after machine learning voice enhancement
CN112289337B (en) * 2020-11-03 2023-09-01 北京声加科技有限公司 Method and device for filtering residual noise after machine learning voice enhancement
CN113160839A (en) * 2021-04-16 2021-07-23 电子科技大学 Single-channel speech enhancement method based on adaptive attention mechanism and progressive learning
CN113436640A (en) * 2021-06-28 2021-09-24 歌尔科技有限公司 Audio noise reduction method, device and system and computer readable storage medium
CN113436640B (en) * 2021-06-28 2022-11-25 歌尔科技有限公司 Audio noise reduction method, device and system and computer readable storage medium
CN113611318A (en) * 2021-06-29 2021-11-05 华为技术有限公司 Audio data enhancement method and related equipment
CN113823291A (en) * 2021-09-07 2021-12-21 广西电网有限责任公司贺州供电局 Voiceprint recognition method and system applied to power operation
CN114999519A (en) * 2022-07-18 2022-09-02 中邮消费金融有限公司 Voice real-time noise reduction method and system based on double transformation

Also Published As

Publication number Publication date
CN110767244B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN110767244B (en) Speech enhancement method
Hendriks et al. DFT-domain based single-microphone noise reduction for speech enhancement
Xu et al. An experimental study on speech enhancement based on deep neural networks
Narayanan et al. Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training
Ghanbari et al. A new approach for speech enhancement based on the adaptive thresholding of the wavelet packets
JP5666444B2 (en) Apparatus and method for processing an audio signal for speech enhancement using feature extraction
Azarang et al. A review of multi-objective deep learning speech denoising methods
Hansen et al. Speech enhancement based on generalized minimum mean square error estimators and masking properties of the auditory system
Tu et al. A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition
Lee et al. A joint learning algorithm for complex-valued tf masks in deep learning-based single-channel speech enhancement systems
Kim et al. End-to-end multi-task denoising for joint SDR and PESQ optimization
Sadjadi et al. Blind spectral weighting for robust speaker identification under reverberation mismatch
Saleem et al. Supervised speech enhancement based on deep neural network
Swami et al. Speech enhancement by noise driven adaptation of perceptual scales and thresholds of continuous wavelet transform coefficients
Coto-Jimenez et al. Hybrid speech enhancement with wiener filters and deep lstm denoising autoencoders
Naik et al. A literature survey on single channel speech enhancement techniques
Gupta et al. Speech enhancement using MMSE estimation and spectral subtraction methods
Liu et al. Using Shifted Real Spectrum Mask as Training Target for Supervised Speech Separation.
Shome et al. Reference free speech quality estimation for diverse data condition
Roy et al. Deep residual network-based augmented Kalman filter for speech enhancement
Sadasivan et al. Speech enhancement using a risk estimation approach
Hepsiba et al. Computational intelligence for speech enhancement using deep neural network
Pop et al. Speech enhancement for forensic purposes
CN111383652B (en) Single-channel voice enhancement method based on double-layer dictionary learning
Liu et al. Multiresolution cochleagram speech enhancement algorithm using improved deep neural networks with skip connections

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant