CN113299308A - Voice enhancement method and device, electronic equipment and storage medium - Google Patents

Voice enhancement method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113299308A
CN113299308A CN202010987302.8A CN202010987302A CN113299308A CN 113299308 A CN113299308 A CN 113299308A CN 202010987302 A CN202010987302 A CN 202010987302A CN 113299308 A CN113299308 A CN 113299308A
Authority
CN
China
Prior art keywords
bandwidth
gain
signal
noise
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010987302.8A
Other languages
Chinese (zh)
Inventor
宋琦
洪传荣
陈思宇
唐磊
王立波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010987302.8A priority Critical patent/CN113299308A/en
Publication of CN113299308A publication Critical patent/CN113299308A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The embodiment of the disclosure relates to a voice enhancement method and device, electronic equipment and a storage medium. The voice enhancement method comprises the following steps: acquiring a voice signal with noise; dividing a full frequency band of a voice signal with noise into a first frequency band and a second frequency band; carrying out noise reduction processing on the voice signal with noise corresponding to the first frequency width to obtain a first gain corresponding to the first frequency width; predicting a second gain corresponding to a second frequency width based on the first gain; based on the first gain and the second gain, an enhanced speech signal for the noisy speech signal in a full frequency band is determined. The embodiment of the disclosure divides the full frequency band of the voice signal with noise into the low frequency band (first frequency band) and the medium and high frequency band (second frequency band), only performs noise reduction processing on the low frequency band, and performs gain prediction on the medium and high frequency bands.

Description

Voice enhancement method and device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the technical field of voice processing, in particular to a voice enhancement method and device, electronic equipment and a non-transitory computer readable storage medium.
Background
With the development of the live broadcast industry of the e-commerce and the popularization of the mobile terminal equipment, live broadcast scenes are gradually diversified and are not limited to traditional live broadcast rooms any more, and the mobile terminal equipment can be used for live broadcast in outdoor scenes, shopping malls, markets and the like which are wide and loud.
Two major factors affecting the live viewing are pictures and sounds, and the diversification of the live scenes poses new challenges to the processing of the anchor real-time sound signals, for example, the live scenes are mixed in sound, resulting in unclear anchor sound.
In the prior art, a speech enhancement scheme is used to extract the sound of the anchor from the mixed sound. However, the current speech enhancement scheme has poor suppression capability for non-stationary sudden noise, residual noise exists in the enhanced speech, and the residual noise causes poor subjective listening feeling and even influences intelligibility of speech information transmission. In addition, many assumptions exist in the process of solving the analytic solution by the current speech enhancement scheme, and the current speech enhancement scheme is difficult to adapt to complex and changeable actual scenes. The above description of the discovery process of the problems is only for the purpose of aiding understanding of the technical solutions of the present disclosure, and does not represent an admission that the above is prior art.
Disclosure of Invention
To solve at least one problem of the prior art, at least one embodiment of the present disclosure provides a voice enhancement method, apparatus, electronic device, and storage medium.
In a first aspect, an embodiment of the present disclosure provides a speech enhancement method, where the method includes:
acquiring a voice signal with noise;
dividing the full frequency band of the voice signal with noise into a first frequency band and a second frequency band;
carrying out noise reduction processing on the voice signal with noise corresponding to the first frequency width to obtain a first gain corresponding to the first frequency width;
predicting a second gain corresponding to the second bandwidth based on the first gain;
determining an enhanced speech signal of the noisy speech signal in the full frequency band based on the first gain and the second gain.
In a second aspect, an embodiment of the present disclosure further provides a speech enhancement apparatus, where the apparatus includes:
the acquiring unit is used for acquiring a voice signal with noise; dividing the full frequency band of the voice signal with noise into a first frequency band and a second frequency band;
the noise reduction unit is used for carrying out noise reduction processing on the voice signal with noise corresponding to the first frequency width to obtain a first gain corresponding to the first frequency width;
a prediction unit for predicting a second gain corresponding to the second bandwidth based on the first gain;
a determining unit, configured to determine, based on the first gain and the second gain, an enhanced speech signal of the noisy speech signal in the full frequency band.
In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor and a memory; the processor is adapted to perform the steps of the speech enhancement method according to the first aspect by calling a program or instructions stored in the memory.
In a fourth aspect, the disclosed embodiments also propose a non-transitory computer-readable storage medium storing a program or instructions for causing a computer to perform the steps of the speech enhancement method according to the first aspect.
It can be seen that, in at least one embodiment of the present disclosure, by dividing the full frequency band of the noisy speech signal into a low frequency band (first frequency band) and a medium-high frequency band (second frequency band), noise reduction processing is performed only on the low frequency band, and gain prediction is performed on the medium-high frequency band, so that full-frequency band speech enhancement can be achieved without performing noise reduction processing on the full frequency band. Compare current scheme to full frequency band noise reduction processing, the data volume and the complexity of the reducible processing of this disclosure have promoted efficiency and the speed of handling, are fit for removing end equipment deployment, realize STREAMING full frequency band speech enhancement.
Drawings
To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a diagram of an exemplary application scenario for speech enhancement provided by an embodiment of the present disclosure;
FIG. 2 is an exemplary block diagram of a speech enhancement apparatus provided by an embodiment of the present disclosure;
fig. 3 is an exemplary block diagram of an electronic device provided by an embodiment of the present disclosure;
FIG. 4 is an exemplary flow chart of a method for speech enhancement provided by an embodiment of the present disclosure;
FIG. 5 is an exemplary flow chart of another speech enhancement method provided by embodiments of the present disclosure;
FIG. 6 is an exemplary flow chart for determining a first gain provided by embodiments of the present disclosure;
FIG. 7 is an exemplary architecture diagram of a time convolutional network provided by an embodiment of the present disclosure;
FIG. 8 is an exemplary flowchart of determining an enhanced speech signal of a first bandwidth according to an embodiment of the present disclosure;
FIG. 9 is an exemplary waveform diagram and corresponding spectrogram for noisy speech;
fig. 10 is a waveform diagram and a corresponding spectrogram of the noisy speech shown in fig. 9 after the OMLSA processing;
fig. 11 is a waveform diagram and a corresponding spectrogram obtained after the noisy speech shown in fig. 9 is processed by the speech enhancement method provided by the embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure can be more clearly understood, the present disclosure will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. The specific embodiments described herein are merely illustrative of the disclosure and are not intended to be limiting. All other embodiments derived by one of ordinary skill in the art from the described embodiments of the disclosure are intended to be within the scope of the disclosure.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
To facilitate understanding of aspects of the embodiments of the present disclosure, terms related to the embodiments of the present disclosure are explained as follows:
and (3) speech enhancement: in real-world environments, speech may be disturbed by non-stationary or noisy background noise, such as noise from trains, cars, factories and streets. The background noise interference can seriously affect the performance of the speech processing system, so that the robustness of the speech processing system can be improved by enhancing the speech interfered by the background noise in advance. The main purpose of speech enhancement is to suppress noise and improve the overall perceptual quality and intelligibility of noisy speech.
Full frequency band: the bandwidth of the speech signal determines the richness of the frequency components, and the more frequency components, the higher the tone quality of the speech signal, and the closer to the real analog sampling sound. The definition of the full frequency band under different scenes is slightly different, and the full frequency band of the speech signal of the real-time interactive scene class is generally considered to refer to a 48khz sampling rate.
Flow processing: the finger voice processing system can output the processed data stream with fixed delay or low delay.
TCN (Temporal Convolutional Network) Network structure: an MLP (Multi-layer characteristics) multilayer perceptron model lacks the capability of acquiring Long-Term (Long Term) information, and Network structures such as an RNN (Recurrent Neural Network) and an LSTM (Long Short-Term Memory) have the capability of acquiring a Long-Term information time sequence dependency relationship, but have the defects of high delay, high training complexity and the like. In order to solve the defects, a TCN network structure appears in various sequence modeling tasks to replace RNN type network structures, the TCN network structure is obtained by introducing causal convolution (CausalConvolition) and cavity convolution (scaled convolution) into a traditional CNN (Convolutional Neural network) network structure to obtain exponentially-increased receptive fields, and meanwhile, a dense connection network (DenseNet) and a residual error network (ResNet) are fused to enable the TCN network structure to be designed deeply and obtain effective long-term history information.
Single channel speech enhancement
OMLSA (optimal-Modified Log-Spectral Amplitude estimation) is a Single-channel Speech Enhancement (Single-channel Speech Enhancement) scheme. OMLSA has various assumptions and is difficult to adapt to complex and variable actual scenes.
1. Assuming that the noise is additive noise
The relationship between noise and speech is complex, and two relationships are generally considered to exist: the method mainly aims at suppressing the additive noise, and at the moment, a noisy speech Time domain signal y (t) can be regarded as the sum of a speech Time domain signal x (t) and a noise Time domain signal n (t), Short-Time Fourier Transform (STFT) is carried out on two sides of the signal, so that an expression form of a frequency domain can be obtained, and the frequency domain meets the additive relation under the condition of supposing the additive noise.
2. Assuming that the speech and noise are independent of each other and that the distribution of noisy speech, clean speech and noise signals satisfies a Gaussian distribution
Meanwhile, assuming that the voice and the noise are independent from each other, expectation is taken for two sides, and then assuming that the distribution of the voice with noise, the clean voice and the noise signal all meet the Gaussian distribution, the relation of the variance between the voice and the noise can be obtained.
3. Binary hypothesis model
The frame is divided into two states (a speech frame and a non-speech frame) by VAD (Voice Activity Detection), a binary hypothesis model is obtained, and then iterative update estimation of noise variance can be performed, generally speaking, the variance of noise is updated only in the H0 state of hypothesis test, so as to ensure undistorted speech to the maximum extent.
4. It is assumed that the speech amplitude of the current frame is only correlated with the noisy speech signal of the current frame and is uncorrelated with other frames
The whole derivation process is brought into a probability frame, firstly, a probability density function of the voice with noise can be obtained, and under the condition that the voice exists, the estimation of a voice amplitude spectrum can be obtained.
Since the estimation of the speech magnitude spectrum does not take into account the compression characteristics of the human ear on sound, an estimation of the log spectrum is proposed. However, the estimation of the magnitude spectrum and the estimation of the logarithm spectrum of the speech ignore the estimation of clean speech in the case of a non-speech frame, and particularly under the condition of low signal-to-noise ratio, the judgment of VAD is easy to make mistakes, thereby causing speech damage, therefore, the probability of speech existence is further introduced, the speech damage is avoided as much as possible, and a minimum gain is introduced at the same time, which means that the speech distortion is avoided as much as possible in the case of a non-speech frame.
As can be seen, OMLSA has the following problems:
(1) the non-stationary bursty noise cannot be effectively suppressed, and the Power Spectral Density (PSD) estimate of the noise has a large tracking delay and requires offset compensation.
(2) Residual noise exists, which affects subjective auditory perception.
(3) It is necessary to assume that the distribution models of the speech signal and the noise signal exist in and out of the actual scene.
Therefore, to solve at least one problem of the OMLSA, the embodiments of the present disclosure provide a speech enhancement method, apparatus, electronic device, and non-transitory computer-readable storage medium, which divide full-band speech into low-frequency speech and middle-high-frequency speech, predict time-frequency gain of the low-frequency speech by performing noise reduction on the low-frequency speech to obtain a frequency-domain amplitude spectrum, and synthesize time-domain data after low-frequency speech enhancement in combination with a phase spectrum of the low-frequency speech; predicting the time domain gain of the medium-high frequency voice by utilizing the time-frequency gain of the low-frequency voice, and acting the time domain gain on the medium-high frequency voice; and finally, synthesizing the time domain data after the low-frequency voice enhancement and the time domain data after the medium-high frequency voice enhancement into an enhanced voice signal under a full frequency band, and realizing the enhancement of the full frequency band voice. The embodiment of the disclosure does not make any distribution assumption on the voice and noise signals to be processed, and can adapt to complex and changeable actual scenes.
Fig. 1 is a diagram of an exemplary application scenario of speech enhancement according to an embodiment of the present disclosure. As shown in fig. 1, a noisy speech signal is input to a speech enhancement device 10, the noisy speech signal is noise-reduced by the speech enhancement device 10, and an enhanced speech signal of the noisy speech signal in the full frequency band is output.
In fig. 1, the speech enhancement device 10 includes, but is not limited to: a low-frequency speech noise reduction unit 11 and a mid-high frequency speech prediction unit 12.
The low-frequency speech denoising unit 11 is configured to extract low-frequency speech from the noisy speech signal, perform denoising processing on the low-frequency speech to obtain a time-frequency gain of the low-frequency speech, further obtain a frequency-domain amplitude spectrum of the low-frequency speech based on the time-frequency gain of the low-frequency speech, and synthesize time-domain data after enhancement of the low-frequency speech by combining with the phase spectrum of the low-frequency speech.
And the medium-high frequency voice prediction unit 12 is configured to predict a time domain gain of the medium-high frequency voice by using the time-frequency gain of the low-frequency voice obtained by the low-frequency voice denoising unit 11, and apply the time domain gain to the medium-high frequency voice, for example, obtain time domain data after the medium-high frequency voice is enhanced by multiplying the time domain gain and the medium-high frequency voice.
In some embodiments, the medium-high frequency speech prediction unit 12 may estimate a priori signal-to-noise ratio and a posteriori signal-to-noise ratio in the high-frequency component based on the time-frequency gain of the low-frequency speech, further obtain a speech existence probability based on the priori signal-to-noise ratio and the posteriori signal-to-noise ratio in the high-frequency component, and obtain a time-domain gain of the medium-high frequency speech based on the smoothed posteriori signal-to-noise ratio and the speech existence probability.
In some embodiments, the speech enhancement device 10 may further comprise: and an enhanced speech signal output unit, configured to synthesize an enhanced speech signal in a full band to enhance full-band speech by using the time-domain data obtained by enhancing the low-frequency speech and obtained by the medium-high frequency speech prediction unit 12, where the time-domain data is obtained by the low-frequency speech noise reduction unit 11, and the time-domain data is obtained by enhancing the high-frequency speech and obtained by the medium-high frequency speech prediction unit 12.
In some embodiments, the division of each unit in the speech enhancement apparatus 10 is only one logical functional division, and there may be another division manner in actual implementation, for example, the low frequency speech noise reduction unit 11 or the middle and high frequency speech prediction unit 12 may be divided into a plurality of sub-units. It will be understood that the various units or sub-units may be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application.
Fig. 2 is an exemplary block diagram of a speech enhancement apparatus 20 provided in the embodiment of the present disclosure. In some embodiments, speech enhancement device 20 may be implemented as speech enhancement device 10 in fig. 1 or as part of speech enhancement device 10 for enabling enhancement of full-band speech.
In FIG. 2, the speech enhancement device 20 may include, but is not limited to: an acquisition unit 21, a noise reduction unit 22, a prediction unit 23, and a determination unit 24.
Acquisition unit 21
The acquiring unit 21 is configured to acquire the noisy speech signal and divide a full band of the noisy speech signal into a first bandwidth and a second bandwidth, where the first bandwidth and the second bandwidth constitute the full band of the noisy speech signal. Wherein, the upper limit (maximum frequency) of the first bandwidth is less than or equal to the lower limit (minimum frequency) of the second bandwidth. In some embodiments, the first bandwidth is a low frequency band of the noisy speech signal, e.g., 0 to 16 kHz; the second bandwidth is a medium-high frequency band of the noisy speech signal, such as 16KHz to 48 KHz.
In some embodiments, the obtaining unit 21 may determine the first bandwidth based on the voice processing capability information of the mobile end device, and further divide the full frequency band of the noisy voice signal into the first frequency band and the second frequency band based on the first frequency band.
In some embodiments, the voice processing capability information of the mobile end device may be determined in various ways, for example, the processing speed of the mobile end device may be tested through the voice test data, and on the premise that the streaming processing is satisfied, the maximum frequency that the mobile end device can process is taken as the voice processing capability information of the mobile end device.
For example, the following steps are carried out: on the premise of satisfying streaming processing, the maximum frequency that the mobile terminal device can process is 24KHz, and then the first bandwidth can be determined to be 0 to 24 KHz; accordingly, the second bandwidth is 24KHz to 48 KHz.
In some embodiments, the obtaining unit 21 may determine a magnitude spectrum and a phase spectrum of a first bandwidth, and a magnitude spectrum and a phase spectrum of a second bandwidth. In some embodiments, the noisy speech signal obtained by the obtaining unit 21 is a full-band signal, that is, the noisy speech signal is 0 to 48KHz, so that a magnitude spectrum and a phase spectrum corresponding to the full band can be obtained, after the full band of the noisy speech signal is divided into a first bandwidth and a second bandwidth, the magnitude spectrum and the phase spectrum of the first bandwidth, and the magnitude spectrum and the phase spectrum of the second bandwidth can be determined from the magnitude spectrum and the phase spectrum corresponding to the full band. The acquisition of the magnitude spectrum and the phase spectrum belongs to the mature technology in the field, and is not described in detail here.
In some embodiments, the obtaining unit 21 may perform low-frequency sampling and time-frequency transformation on the obtained noisy speech signal to obtain a magnitude spectrum and a phase spectrum of a first bandwidth. For example, the acquisition unit 21 performs 16KHz sampling on one frame of voice signal, and since one frame of voice signal lasts for 10ms, 160 sampling points are obtained after sampling. The obtaining unit 21 performs time-frequency transformation, such as 512-point FFT (Fast Fourier Transform), on the 160 sampling points to obtain a magnitude spectrum and a phase spectrum of the first bandwidth. In consideration of the symmetry of the 512-point FFT, the input of the noise reduction unit 22 may be represented by (512 ÷ 2+1 ═ 257) points.
Noise reduction unit 22
The noise reduction unit 22 is configured to perform noise reduction processing on the noisy speech signal corresponding to the first bandwidth to obtain a first gain corresponding to the first bandwidth. In some embodiments, the noise reduction unit 22 performs noise reduction processing on the amplitude spectrum of the first bandwidth to obtain a first gain corresponding to the first bandwidth. The first gain is a gain of a magnitude spectrum of the low-frequency noise voice signal, and the first gain is a time-frequency domain gain (time-frequency gain for short). For example, the noise reduction unit 22 may perform noise reduction processing on 257 point data output by the acquisition unit 21.
In some embodiments, the noise reduction unit 22 may determine a first signal energy ratio corresponding to a first bandwidth based on the magnitude spectrum of the first bandwidth. Wherein, the first signal energy ratio is, for example, any one of, but not limited to: the voice-to-noise ratio, the voice-to-noise power ratio, the voice-to-noise amplitude ratio, the noise-to-voice power ratio, the noise-to-voice amplitude ratio, and the like, wherein the voice may be clean voice, that is, voice without noise, or voice with noise.
In some embodiments, the noise reduction unit 22 may determine a spectrogram of a first bandwidth based on the magnitude spectrum of the first bandwidth, and further determine a first signal energy ratio corresponding to the first bandwidth based on the spectrogram of the first bandwidth.
In some embodiments, the denoising unit 22 may perform a first feature extraction on the magnitude spectrum of the first bandwidth (or the spectrogram of the first bandwidth), wherein the first feature extraction is used to perform a dimension reduction process on the magnitude spectrum of the first bandwidth, and to reduce odd-dimension data into even-dimension data, so as to facilitate a subsequent second feature extraction on the even-dimension data. It can be seen that the first feature extraction does not perform substantial feature extraction, and therefore, the first feature extraction can be understood as feature pre-extraction or surface feature extraction, which is a preparation for subsequent substantial feature extraction. For example, the following steps are carried out: the noise reduction unit 22 may perform a dimension reduction process on 257 point data (which may be understood as 257 dimensional data) output by the acquisition unit 21, to obtain 256 point data (which may be understood as 256 dimensional data).
In some embodiments, the denoising unit 22 may perform a second feature extraction on the features (e.g., even-dimensional data) obtained by the first extraction after performing the first feature extraction, where the second feature extraction is used for performing feature extraction on the input with time sequence. The second extraction of features is a substantial extraction of features compared to the first extraction of features, and therefore the second extraction of features can also be understood as an abstract extraction of features. After the second extraction of the features, the denoising unit 22 may output a first signal energy ratio corresponding to the first bandwidth based on the features obtained by the second extraction.
In some embodiments, the noise reduction unit 22 may perform noise reduction processing on the amplitude spectrum (or spectrogram) of the first bandwidth through a Time Convolutional Network (TCN) to obtain an a priori signal energy ratio corresponding to the first bandwidth. The TCN network can realize first feature extraction on the magnitude spectrum (or the spectrogram) with the first frequency width, and realize dimension reduction processing on the magnitude spectrum (or the spectrogram) with the first frequency width. The TCN network can also perform second feature extraction on the features obtained by the first feature extraction, so that feature extraction on the input with time sequence is realized. The TCN network can also output a prior signal energy ratio corresponding to the first bandwidth based on the second extracted features.
In some embodiments, the TCN network includes a first fully-connected layer, a plurality of hole cause and effect convolution layers in series, and a second fully-connected layer. The first full-connection layer performs first feature extraction on the magnitude spectrum (or the spectrogram) of the first bandwidth, so as to perform dimension reduction processing on the magnitude spectrum (or the spectrogram) of the first bandwidth, for example, perform dimension reduction on 257-dimensional data to obtain 256-dimensional data. And performing second feature extraction on the features obtained by the first extraction by the plurality of cavity cause and effect convolution layers connected in series, wherein the cavity cause and effect convolution layers can improve the overall receptive field of the TCN network. The second fully-connected layer outputs a first signal energy ratio corresponding to the first bandwidth based on the second extracted features, and performs dimensionality enhancement processing on the second fully-connected layer, for example, dimensionality enhancement on 256-dimensional data to obtain 257-dimensional data, which is opposite to the function of the first fully-connected layer.
It should be noted that, in the above embodiment, in consideration of the characteristics of a speech signal time sequence and an application scenario with limited resources, a conventional RNN-type structure is not used to capture long-term context information, but a plurality of serially-connected cavity cause-and-effect convolutional layers are used to perform second feature extraction on features obtained by first feature extraction, so as to improve the overall receptive field of the TCN network, and in addition, when the cavity cause-and-effect convolutional layers are inferred from the current time step, information obtained in a future time step is not used, and meanwhile, the normalization (norm) mode is improved from the existing layerorm to framenorm, specifically, the existing layerorm normalizes all neurons of each layer, that is, normalizes inputs at each depth, and mainly has an obvious RNN effect; in the embodiment, normalization of the framenorm only depends on results of previous and next frames, and compared with the prior art, the normalization range is reduced, and efficiency is higher.
In some embodiments, the noise reduction unit 22 may perform gain processing on the first signal energy ratio to obtain a first gain corresponding to the first bandwidth. Among other things, gain processing may include, but is not limited to: smoothing and gain compensation. The smoothing process may improve the speech perception. The gain compensation may ensure fidelity of the speech.
In some embodiments, the noise reduction unit 22 may determine an initial gain based on the first signal capability ratio, and then smooth the initial gain; the process of determining the initial gain may follow the existing mature technology, and is not described herein again. In some embodiments, the smoothing process includes, but is not limited to, removing abnormal discontinuities in the initial gain to achieve a better continuous enhanced speech perception.
In some embodiments, the noise reduction unit 22 may perform gain compensation on the result obtained by the smoothing processing to obtain a first gain corresponding to the first bandwidth. In some embodiments, the gain compensation is a gain compensation for speech, and includes but is not limited to performing statistical distribution characteristic analysis on a result obtained by the smoothing processing, identifying speech and performing corresponding compensation, thereby obtaining a first gain corresponding to a first bandwidth. In some embodiments, the specific manner of gain compensation may also follow the existing mature technology, and is not described herein again.
In some embodiments, the functional form of the first gain obtained by the noise reduction unit 22 is determined by a statistical model of the assumed clean speech and noise and an optimization criterion. The combined consideration employs an MMSE-STSA gain function based on the minimum mean square error MMSE and gaussian distribution assumptions of clean speech and noise.
In some embodiments, noise reduction unit 22 may adjust the magnitude of the first gain. For example, after the first signal energy ratio is subjected to gain processing to obtain a first gain corresponding to a first bandwidth, the first gain is multiplied by a noise reduction weight to adjust the magnitude of the first gain, wherein the noise reduction weight may be predetermined based on user information.
For example, the noise reduction weight for the paid user is higher, and the noise reduction weight for the non-paid user is lower, so that different noise reduction effects are given to different users. For another example, noise reduction weight for a paid merchant (live merchant) is higher, while noise reduction weight for a non-paid merchant is lower, so that different noise reduction effects are given to different live merchants.
Prediction unit 23
The prediction unit 23 is configured to predict a second gain corresponding to a second bandwidth based on the first gain corresponding to the first bandwidth obtained by the noise reduction unit 22. Wherein, the second gain is the time domain gain of the medium-high frequency noisy speech signal, also called time domain noise reduction gain. In some embodiments, the first bandwidth is a low frequency band of the noisy speech signal, e.g., 0 to 16 kHz; the second bandwidth is a medium-high frequency band of the noisy speech signal, such as 16KHz to 48 KHz.
In some embodiments, the prediction unit 23 may determine the existence probability of the voice corresponding to the second bandwidth based on the first gain corresponding to the first bandwidth. In some embodiments, the prediction unit 23 may estimate a priori snr and a posteriori snr in the high frequency component based on a first gain corresponding to the first bandwidth or a first signal energy ratio corresponding to the first bandwidth, and further obtain the speech existence probability corresponding to the second bandwidth based on the priori snr and the posteriori snr in the high frequency component.
In some embodiments, the prediction unit 23 may determine a second gain corresponding to a second bandwidth based on a first gain corresponding to a first bandwidth and a voice existence probability corresponding to a second bandwidth. In some embodiments, the prediction unit 23 obtains a second gain corresponding to a second bandwidth based on the smoothed posterior snr of the high frequency component and the speech existence probability corresponding to the second bandwidth.
Determination unit 24
A determining unit 24 is configured to determine an enhanced speech signal of the noisy speech signal in the full frequency band based on the first gain and the second gain. In some embodiments, the determining unit 24 obtains the enhanced speech signal of the noisy speech signal in the full frequency band based on the first gain corresponding to the first frequency band, the second gain corresponding to the second frequency band, the amplitude spectrum and the phase spectrum of the first frequency band, and the amplitude spectrum and the phase spectrum of the second frequency band.
In some embodiments, the determining unit 24 may obtain the enhanced speech signal of the first bandwidth based on the first gain corresponding to the first bandwidth and the amplitude spectrum and the phase spectrum of the first bandwidth. For example, the determining unit 24 may apply a first gain corresponding to the first bandwidth to the spectrogram of the first bandwidth, for example, perform a multiplication operation to obtain the spectrogram with the first gain increased; and then the speech spectrogram with the first gain increased and the phase spectrum with the first frequency width are subjected to speech synthesis to obtain an enhanced speech signal with the first frequency width.
In some embodiments, the determining unit 24 may obtain the enhanced speech signal of the second bandwidth based on a second gain corresponding to the second bandwidth and the amplitude spectrum and the phase spectrum of the second bandwidth. For example, the determining unit 24 can apply a second gain corresponding to a second bandwidth to the time-domain signal (i.e. the medium-high frequency noisy speech signal) of the second bandwidth to obtain an enhanced speech signal of the second bandwidth. The time domain signal of the second bandwidth can be obtained by the amplitude spectrum and the phase spectrum of the second bandwidth.
In some embodiments, the determining unit 24 may obtain the enhanced speech signal with the noise signal in the full frequency band based on the enhanced speech signal with the first frequency width and the enhanced speech signal with the second frequency width.
It can be seen that, in at least one embodiment of the present disclosure, by dividing the full frequency band of the noisy speech signal into a low frequency band (first frequency band) and a medium-high frequency band (second frequency band), noise reduction processing is performed only on the low frequency band, and gain prediction is performed on the medium-high frequency band, so that full-frequency band speech enhancement can be achieved without performing noise reduction processing on the full frequency band. Compare current scheme to full frequency band noise reduction processing, the data volume and the complexity of the reducible processing of this disclosure have promoted efficiency and the speed of handling, are fit for removing end equipment deployment, realize STREAMING full frequency band speech enhancement. The mobile terminal device can be a mobile electronic device such as a smart phone, a tablet computer and an intelligent sound box. In some embodiments, devices for playing and/or capturing audio in a video conference, a telephone conference, etc. scenario may also employ aspects of embodiments of the present disclosure.
It can be seen that, in at least one embodiment of the present disclosure, a TCN network is designed in combination with a data-driven supervised learning idea, no distribution assumption is made on speech and noise signals to be processed, and meanwhile, a noise data set mainly based on real scene noise is constructed to "artificially" let the TCN network remember some sudden noises, and a noise reduction strategy is changed from the perspective of improving generalization capability to perform speech reverse noise reduction, so that the TCN network can cope with unseen sudden noises, thereby solving the (1) th and (3) th point problems of the OMLSA algorithm.
It can be seen that in at least one embodiment of the present disclosure, a TCN network (void causal convolution) capable of capturing long-term context critical information is adopted to improve the characterization capability of the TCN network, so that the training error is reduced, and the residual noise is significantly reduced.
Therefore, in at least one embodiment of the present disclosure, a streaming full-band speech enhancement method applicable to a mobile end device based on a TCN network structure is provided, which has both low power consumption and streaming processing characteristics, and is suitable for mobile end scene deployment of services such as cloud conference, live broadcast, microphone connection, and smart audio with large noise.
In some embodiments, the division of each unit in the speech enhancement apparatus 20 is only one logical function division, and there may be another division manner when the actual implementation is performed, for example, at least two units of the obtaining unit 21, the noise reduction unit 22, the prediction unit 23, and the determination unit 24 may be implemented as one unit; the acquisition unit 21, the noise reduction unit 22, the prediction unit 23, or the determination unit 24 may also be divided into a plurality of sub-units. It will be understood that the various units or sub-units may be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application.
Fig. 3 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure. As shown in fig. 3, the electronic apparatus includes: at least one processor 31, at least one memory 32, and at least one communication interface 33. The various components in the electronic device are coupled together by a bus system 34. And a communication interface 33 for information transmission with an external device. Understandably, the bus system 34 is used to enable connective communication between these components. The bus system 34 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, the various buses are labeled as bus system 34 in fig. 3.
It will be appreciated that the memory 32 in this embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
In some embodiments, memory 32 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system and an application program.
The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic tasks and processing hardware-based tasks. The application programs include various application programs such as a media player (MediaPlayer), a Browser (Browser), etc. for implementing various application tasks. The program for implementing the speech enhancement method provided by the embodiment of the present disclosure may be included in an application program.
In the embodiment of the present disclosure, the processor 31 is configured to call a program or an instruction stored in the memory 32, which may be specifically a program or an instruction stored in an application program, and the processor 31 is configured to execute the steps of the embodiments of the speech enhancement method provided by the embodiment of the present disclosure.
The speech enhancement method provided by the embodiment of the present disclosure may be applied to the processor 31, or implemented by the processor 31. The processor 31 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 31. The processor 31 may be a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The steps of the speech enhancement method provided by the embodiment of the present disclosure can be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software units in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 32, and the processor 31 reads the information in the memory 32 and performs the steps of the method in combination with the hardware thereof.
Fig. 4 is an exemplary flowchart of a speech enhancement method according to an embodiment of the present disclosure. The main execution body of the method is an electronic device, and in some embodiments, the electronic device is a mobile terminal device, for example, a mobile electronic device such as a smart phone, a tablet computer, and a smart sound box. In some embodiments, devices for playing and/or capturing audio in a video conference, a telephone conference, etc. scenario may also employ aspects of embodiments of the present disclosure. For convenience of description, the following embodiments describe the flow of the speech enhancement method with the electronic device as the main execution body.
As shown in fig. 4, in step 401, the electronic device acquires a noisy speech signal and divides a full band of the noisy speech signal into a first bandwidth and a second bandwidth, where the first bandwidth and the second bandwidth constitute the full band of the noisy speech signal. Wherein, the upper limit (maximum frequency) of the first bandwidth is less than or equal to the lower limit (minimum frequency) of the second bandwidth.
In some embodiments, the first bandwidth is a low frequency band of the noisy speech signal, e.g., 0 to 16 kHz; the second bandwidth is a medium-high frequency band of the noisy speech signal, such as 16KHz to 48 KHz.
In some embodiments, the electronic device may determine the first bandwidth based on the voice processing capability information, and further divide the full band of the noisy voice signal into a first bandwidth and a second bandwidth based on the first bandwidth.
In some embodiments, the voice processing capability information of the electronic device may be determined in various ways, for example, the processing speed of the electronic device may be tested through voice test data, and the maximum frequency that the electronic device can process is taken as the voice processing capability information on the premise that streaming processing is satisfied.
For example, the following steps are carried out: on the premise of satisfying streaming processing, the maximum frequency that the electronic device can process is 24KHz, and then the first bandwidth can be determined to be 0 to 24 KHz; accordingly, the second bandwidth is 24KHz to 48 KHz.
In some embodiments, the electronic device can determine a magnitude spectrum and a phase spectrum of a first bandwidth and a magnitude spectrum and a phase spectrum of a second bandwidth. In some embodiments, the noisy speech signal obtained by the electronic device is a full-band signal, that is, the noisy speech signal is 0 to 48KHz, so that a magnitude spectrum and a phase spectrum corresponding to the full band can be obtained, and after the full band of the noisy speech signal is divided into a first bandwidth and a second bandwidth, the magnitude spectrum and the phase spectrum of the first bandwidth, and the magnitude spectrum and the phase spectrum of the second bandwidth can be determined from the magnitude spectrum and the phase spectrum corresponding to the full band. The acquisition of the magnitude spectrum and the phase spectrum belongs to the mature technology in the field, and is not described in detail here.
In some embodiments, the electronic device may perform low-frequency sampling and time-frequency transformation on the acquired noisy speech signal to obtain a magnitude spectrum and a phase spectrum of a first bandwidth. For example, the electronic device performs 16KHz sampling on a frame of voice signal, and since the frame of voice signal lasts for 10ms, 160 sampling points are obtained after sampling. The electronic device performs time-frequency Transform, such as 512-point FFT (Fast Fourier Transform), on the 160 sampling points to obtain a magnitude spectrum and a phase spectrum of the first bandwidth. In consideration of the symmetry of the 512-point FFT, the electronic device can perform noise reduction processing on (512 ÷ 2+1 ═ 257) points.
In step 402, the electronic device performs noise reduction processing on the noisy speech signal corresponding to the first bandwidth to obtain a first gain corresponding to the first bandwidth. In some embodiments, the electronic device may perform noise reduction on the amplitude spectrum of the first bandwidth to obtain a first gain corresponding to the first bandwidth. The first gain is a gain of a magnitude spectrum of the low-frequency noise voice signal, and the first gain is a time-frequency domain gain (time-frequency gain for short). For example, the electronic device may perform noise reduction processing on the 257 point data.
In some embodiments, the electronic device may determine a first signal energy ratio corresponding to a first bandwidth based on a magnitude spectrum of the first bandwidth; and then gain processing is carried out on the first signal energy ratio to obtain a first gain corresponding to the first bandwidth. Among other things, gain processing may include, but is not limited to: smoothing and gain compensation. The smoothing process may improve the speech perception. The gain compensation may ensure fidelity of the speech.
In some embodiments, the electronic device may determine a spectrogram of the first bandwidth based on the magnitude spectrum of the first bandwidth, and further determine a first signal energy ratio corresponding to the first bandwidth based on the spectrogram of the first bandwidth.
In some embodiments, the first signal energy ratio is, for example, any one of, but not limited to: the voice-to-noise ratio, the voice-to-noise power ratio, the voice-to-noise amplitude ratio, the noise-to-voice power ratio, the noise-to-voice amplitude ratio, and the like, wherein the voice may be clean voice, that is, voice without noise, or voice with noise.
In some embodiments, the electronic device may perform a first extraction of features on the magnitude spectrum of the first bandwidth (or the spectrogram of the first bandwidth), where the first extraction is used to perform a dimension reduction process on the magnitude spectrum of the first bandwidth; performing second feature extraction on the features obtained by the first extraction, wherein the second feature extraction is used for performing feature extraction on the input with time sequence; thereby outputting a first signal energy ratio corresponding to the first bandwidth based on the second extracted feature.
The first extraction of the characteristics can reduce odd-dimension data into even-dimension data, so that the second extraction of the characteristics of the even-dimension data is convenient to carry out subsequently. It can be seen that the first feature extraction does not perform substantial feature extraction, and therefore, the first feature extraction can be understood as feature pre-extraction or surface feature extraction, which is a preparation for subsequent substantial feature extraction. For example, the following steps are carried out: the noise reduction unit 22 may perform a dimension reduction process on 257 point data (which may be understood as 257 dimensional data) output by the acquisition unit 21, to obtain 256 point data (which may be understood as 256 dimensional data).
The second extraction of features is a substantial extraction of features compared to the first extraction of features, and therefore the second extraction of features can also be understood as an abstract extraction of features.
In some embodiments, the electronic device may perform noise reduction processing on the magnitude spectrum of the first bandwidth (or the spectrogram of the first bandwidth) through a Time Convolutional Network (TCN), so as to obtain a prior signal energy ratio corresponding to the first bandwidth. The TCN network can realize first feature extraction on the magnitude spectrum (or the spectrogram) with the first frequency width, and realize dimension reduction processing on the magnitude spectrum (or the spectrogram) with the first frequency width. The TCN network can also perform second feature extraction on the features obtained by the first feature extraction, so that feature extraction on the input with time sequence is realized. The TCN network can also output a prior signal energy ratio corresponding to the first bandwidth based on the second extracted features.
In some embodiments, the TCN network includes a first fully-connected layer, a plurality of hole cause and effect convolution layers in series, and a second fully-connected layer. The first full-connection layer performs first feature extraction on the magnitude spectrum (or the spectrogram) of the first bandwidth, so as to perform dimension reduction processing on the magnitude spectrum (or the spectrogram) of the first bandwidth, for example, perform dimension reduction on 257-dimensional data to obtain 256-dimensional data. And performing second feature extraction on the features obtained by the first extraction by the plurality of cavity cause and effect convolution layers connected in series, wherein the cavity cause and effect convolution layers can improve the overall receptive field of the TCN network. The second fully-connected layer outputs a first signal energy ratio corresponding to the first bandwidth based on the second extracted features, and performs dimensionality enhancement processing on the second fully-connected layer, for example, dimensionality enhancement on 256-dimensional data to obtain 257-dimensional data, which is opposite to the function of the first fully-connected layer.
It should be noted that, in the above embodiment, in consideration of the characteristics of a speech signal time sequence and an application scenario with limited resources, a conventional RNN-type structure is not used to capture long-term context information, but a plurality of serially-connected cavity cause-and-effect convolutional layers are used to perform second feature extraction on features obtained by first feature extraction, so as to improve the overall receptive field of the TCN network, and in addition, when the cavity cause-and-effect convolutional layers are inferred from the current time step, information obtained in a future time step is not used, and meanwhile, the normalization (norm) mode is improved from the existing layerorm to framenorm, specifically, the existing layerorm normalizes all neurons of each layer, that is, normalizes inputs at each depth, and mainly has an obvious RNN effect; in the embodiment, normalization of the framenorm only depends on results of previous and next frames, and compared with the prior art, the normalization range is reduced, and efficiency is higher.
In some embodiments, the electronic device may determine an initial gain based on the first signal capability ratio, and then smooth the initial gain; therefore, the gain compensation is carried out on the result obtained by the smoothing processing, and the first gain corresponding to the first bandwidth is obtained.
In some embodiments, the smoothing process includes, but is not limited to, removing abnormal discontinuities in the initial gain to achieve a better continuous enhanced speech perception. In some embodiments, the gain compensation is for the speech, and includes but is not limited to performing statistical distribution characteristic analysis on the result obtained by the smoothing processing, recognizing the speech and performing corresponding compensation, thereby obtaining a first gain corresponding to the first bandwidth.
In some embodiments, the functional form of the first gain achieved by the electronic device is determined by a statistical model of the assumed clean speech and noise and a criterion of optimization. The combined consideration employs an MMSE-STSA gain function based on the minimum mean square error MMSE and gaussian distribution assumptions of clean speech and noise.
In some embodiments, the electronic device may adjust a magnitude of the first gain. For example, after the first signal energy ratio is subjected to gain processing to obtain a first gain corresponding to a first bandwidth, the first gain is multiplied by a noise reduction weight to adjust the magnitude of the first gain, wherein the noise reduction weight may be predetermined based on user information.
For example, the noise reduction weight for the paid user is higher, and the noise reduction weight for the non-paid user is lower, so that different noise reduction effects are given to different users. For another example, noise reduction weight for a paid merchant (live merchant) is higher, while noise reduction weight for a non-paid merchant is lower, so that different noise reduction effects are given to different live merchants.
In step 403, the electronic device predicts a second gain corresponding to a second bandwidth based on a first gain corresponding to a first bandwidth. Wherein, the second gain is the time domain gain of the medium-high frequency noisy speech signal, also called time domain noise reduction gain.
In some embodiments, the electronic device determines a voice existence probability corresponding to a second bandwidth based on a first gain corresponding to a first bandwidth; and determining a second gain corresponding to a second frequency width based on the first gain and the voice existence probability.
In some embodiments, the electronic device may estimate a priori snr and a posteriori snr in the high frequency component based on a first gain corresponding to the first bandwidth or a first signal energy ratio corresponding to the first bandwidth, and further obtain a speech existence probability corresponding to the second bandwidth based on the priori snr and the posteriori snr in the high frequency component.
In some embodiments, the electronic device may obtain a second gain corresponding to a second bandwidth based on the smoothed posterior snr of the high frequency component and the speech existence probability corresponding to the second bandwidth.
In step 404, the electronic device determines an enhanced speech signal for the noisy speech signal in a full frequency band based on the first gain and the second gain. In some embodiments, the electronic device may obtain the enhanced voice signal of the noisy voice signal in the full frequency band based on a first gain corresponding to the first frequency band, a second gain corresponding to the second frequency band, the magnitude spectrum and the phase spectrum of the first frequency band, and the magnitude spectrum and the phase spectrum of the second frequency band.
In some embodiments, the electronic device may obtain an enhanced voice signal of a first bandwidth based on a first gain corresponding to the first bandwidth and a magnitude spectrum and a phase spectrum of the first bandwidth; obtaining an enhanced voice signal with a second bandwidth based on a second gain corresponding to the second bandwidth and the amplitude spectrum and the phase spectrum of the second bandwidth; and then based on the enhanced voice signal with the first frequency width and the enhanced voice signal with the second frequency width, obtaining the enhanced voice signal of the voice signal with noise under the full frequency band.
It can be seen that, in at least one embodiment of the present disclosure, by dividing the full frequency band of the noisy speech signal into a low frequency band (first frequency band) and a medium-high frequency band (second frequency band), noise reduction processing is performed only on the low frequency band, and gain prediction is performed on the medium-high frequency band, so that full-frequency band speech enhancement can be achieved without performing noise reduction processing on the full frequency band. Compare current scheme to full frequency band noise reduction processing, the data volume and the complexity of the reducible processing of this disclosure have promoted efficiency and the speed of handling, are fit for removing end equipment deployment, realize STREAMING full frequency band speech enhancement.
It can be seen that, in at least one embodiment of the present disclosure, a streaming full-band speech enhancement method suitable for a mobile terminal based on a TCN network structure is provided in combination with a data-driven supervised learning idea, and the method has characteristics of low power consumption and streaming processing, and is suitable for mobile terminal scene deployment of services such as cloud conference, live broadcast, and live broadcast.
It is noted that, for simplicity of description, the foregoing method embodiments are described as a series of acts or combination of acts, but those skilled in the art will appreciate that the disclosed embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the disclosed embodiments. In addition, those skilled in the art can appreciate that the embodiments described in the specification all belong to alternative embodiments.
Embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, where the non-transitory computer-readable storage medium stores a program or an instruction, and the program or the instruction causes a computer to execute steps of the embodiments of the speech enhancement method, which are not described herein again to avoid repeated descriptions.
Fig. 5 is an exemplary flowchart of a speech enhancement method provided by an embodiment of the present disclosure, an execution subject of the method is an electronic device, and for convenience of description, an explanation of a flow of the speech enhancement method performed by the execution subject is omitted in the following embodiment.
As shown in fig. 5, in step 501, the noisy speech signal is subjected to time domain sampling and frequency domain transformation to obtain a magnitude spectrum and a phase spectrum of a first bandwidth and a magnitude spectrum and a phase spectrum of a second bandwidth. Wherein the first bandwidth and the second bandwidth constitute a full frequency band of the noisy speech signal. In some embodiments, the first bandwidth is a low frequency band of the noisy speech signal, e.g., 0 to 16 kHz; the second bandwidth is a medium-high frequency band of the noisy speech signal, such as 16KHz to 48 KHz.
In step 502, based on the magnitude spectrum of the first bandwidth, a spectrogram of the first bandwidth can be determined, and a first gain corresponding to the first bandwidth can be obtained by performing noise reduction processing on the spectrogram of the first bandwidth. The first gain is a gain of a magnitude spectrum of the low-frequency noise voice signal, and the first gain is a time-frequency domain gain (time-frequency gain for short).
In step 503, a second gain corresponding to a second bandwidth is predicted based on the first gain corresponding to the first bandwidth. Wherein, the second gain is the time domain gain of the medium-high frequency noisy speech signal, also called time domain noise reduction gain.
In step 504, an enhanced speech signal of the noisy speech signal in the full band is synthesized based on the first gain corresponding to the first bandwidth, the second gain corresponding to the second bandwidth, the magnitude spectrum and the phase spectrum of the first bandwidth, and the magnitude spectrum and the phase spectrum of the second bandwidth.
In some embodiments, an enhanced speech signal of a first bandwidth is obtained based on a first gain corresponding to the first bandwidth and the magnitude spectrum and the phase spectrum of the first bandwidth; obtaining an enhanced voice signal with a second frequency width based on a second gain corresponding to the second frequency width and the amplitude spectrum and the phase spectrum of the second frequency width; and synthesizing the enhanced voice signal of the noise voice signal in the full frequency band based on the enhanced voice signal of the first frequency band and the enhanced voice signal of the second frequency band.
Fig. 6 is an exemplary flowchart for determining the first gain according to an embodiment of the disclosure, and the flowchart is applied to step 502 for determining the first gain in fig. 5, that is, the flowchart shown in fig. 6 is an implementation manner of step 502.
As shown in fig. 6, the amplitude spectrum of the first bandwidth is input into a Time Convolution Network (TCN), and the TCN performs noise reduction processing to output a first signal energy ratio corresponding to the first bandwidth.
In some embodiments, based on the magnitude spectrum of the first bandwidth, a spectrogram of the first bandwidth may be determined, and the spectrogram of the first bandwidth may be input into a Time Convolution Network (TCN), and the TCN may perform noise reduction processing to output a first signal energy ratio corresponding to the first bandwidth.
In some embodiments, the TCN network can perform the first feature extraction on the magnitude spectrum of the first bandwidth (or the spectrogram of the first bandwidth), and perform the dimension reduction on the magnitude spectrum of the first bandwidth (or the spectrogram of the first bandwidth). The TCN network can also perform second feature extraction on the features obtained by the first feature extraction, so that feature extraction on the input with time sequence is realized. The TCN network can also output a prior signal energy ratio corresponding to the first bandwidth based on the second extracted features.
In step 601, the first signal energy ratio is subjected to gain processing to obtain a first gain corresponding to a first bandwidth. Among other things, gain processing may include, but is not limited to: smoothing and gain compensation. The smoothing process may improve the speech perception. The gain compensation may ensure fidelity of the speech.
In some embodiments, an initial gain is determined based on the first signal capability ratio, and the initial gain is smoothed; therefore, the gain compensation is carried out on the result obtained by the smoothing processing, and the first gain corresponding to the first bandwidth is obtained.
In some embodiments, the smoothing process includes, but is not limited to, removing abnormal discontinuities in the initial gain to achieve a better continuous enhanced speech perception. In some embodiments, the gain compensation is for the speech, and includes but is not limited to performing statistical distribution characteristic analysis on the result obtained by the smoothing processing, recognizing the speech and performing corresponding compensation, thereby obtaining a first gain corresponding to the first bandwidth.
Fig. 7 is an exemplary architecture diagram of a Time Convolutional Network (TCN) provided by an embodiment of the present disclosure. In some embodiments, the time convolution network shown in FIG. 7 may be implemented as, or as part of, the time convolution network shown in FIG. 6.
As shown in fig. 7, the time convolutional network comprises a first fully-connected layer, a plurality of hole-cause-effect convolutional layers connected in series, and a second fully-connected layer. The first full-connection layer performs first feature extraction on the magnitude spectrum (or the spectrogram) of the first bandwidth, so as to perform dimension reduction processing on the magnitude spectrum (or the spectrogram) of the first bandwidth, for example, perform dimension reduction on 257-dimensional data to obtain 256-dimensional data. And performing second feature extraction on the features obtained by the first extraction by the plurality of cavity cause and effect convolution layers connected in series, wherein the cavity cause and effect convolution layers can improve the overall receptive field of the TCN network. The second fully-connected layer outputs a first signal energy ratio corresponding to the first bandwidth based on the second extracted features, and performs dimensionality enhancement processing on the second fully-connected layer, for example, dimensionality enhancement on 256-dimensional data to obtain 257-dimensional data, which is opposite to the function of the first fully-connected layer.
Fig. 8 is an exemplary flowchart for determining an enhanced speech signal of a first bandwidth according to an embodiment of the disclosure. The first bandwidth is a low frequency band of the noisy speech signal, e.g. 0 to 16 kHz.
As shown in fig. 8, after determining the magnitude spectrum and the phase spectrum of the noisy speech signal 801 in the first frequency width, a speech spectrum 802 of the noisy speech signal 801 in the first frequency width is obtained based on the magnitude spectrum of the first frequency width.
Taking the spectrogram 802 with a first bandwidth as an input of a residual convolutional network (ResNetConv network)803, the ResNetConv network 803 may be implemented as a TCN network structure, which includes a first fully-connected layer (FC)8031, a plurality of hole causal convolutional layers (Conv blocks)8032 connected in series, and a second fully-connected layer (FC) 8033.
The output of the ResNetConv network 803 is a first signal energy ratio corresponding to the first bandwidth, in this embodiment, the first signal energy ratio is a prior signal energy ratio, for example, a prior signal-to-noise ratio.
The prior signal energy ratio output by the ResNetConv network 803 is Gain processed by Gain post processing (Gain post processing)804, which may include but is not limited to: smoothing and gain compensation. The smoothing process may improve the speech perception. The gain compensation may ensure fidelity of the speech.
The output of the Gain post processing 804 is the first Gain corresponding to the first bandwidth. The first gain is the gain of the magnitude spectrum of the low-frequency noise voice signal, and the first gain is the time-frequency domain gain (time-frequency gain for short), so the first gain can be applied to the spectrogram 802 with the first frequency width, for example, multiplication operation, to obtain the spectrogram 805 with the first gain increased.
The speech spectrum 805 with the first gain increased and the phase spectrum corresponding to the first bandwidth are subjected to speech synthesis (Audio synthesis)806 to obtain an enhanced speech signal 807 with the first bandwidth.
Based on the embodiment shown in fig. 8, the process of predicting the second gain corresponding to the second bandwidth is described as follows (1) to (6):
(1) for the first gain segment, the default is divided into 4 segments, each segment length is deltaBweHB.
(2) The energy ratio of a first signal corresponding to the first bandwidth is a priori signal-to-noise ratio snrLocPrior; accordingly, the posterior signal-to-noise ratio is denoted as snrLocPost.
snrLocPost(i)=snrLocPrior(i)+1.0(i=0,1,…binNum)
Where i represents the index of the frequency dimension, ranging from 0 to binNum, which is the maximum frequency dimension index, typically 257.
Figure BDA0002689672550000261
Wherein logLrtTimeavg (i, t) represents the speech state likelihood factor of the ith frequency point and the tth frame, and LRT _ TAVG is a smoothing factor in the time domain. Here, LRT _ TAVG may be 0.5.
Figure BDA0002689672550000262
Wherein loglrttievangksum is the average speech state likelihood factor over all frequency points.
(3) The width of the TCN is a parameter of sigmoid mapping in the prior TCN network and is used for performing value domain range expansion on the non-voice section. The default value of width is width chloride 0. If logLrtTimeAvgKsum < threshPrior0, the default value for widthPrior is widthPrior 1. The width hprior0 may be 4, and 1 ═ 2 × width hprior 0.
(4) Performing tanh mapping
indPrior=weightIndPrior0×(0.5×(tanh(widthPrior×(logLrtTimeAvgKsum-threshPrior0))+1.0))
priorSpeechProb(t)=priorSpeechProb(t-1)+priorUpdate×(indPrior-priorSpeechProb(t-1))
The weight indPrior0 may be 1, the threshPrior0 may be 0.5, the priorUpdate may be 0.1, the priorspechprob (t) is the prior speech existence probability at time t, and the indPrior is a spectrum flatness feature.
(5) Limiting the range of (0.01,1.0) of the priorSpeechProb (t), and calculating the average value in the fixed frequency range to finally obtain the voice existence probability avgProbSpeechHB corresponding to the second frequency width.
Figure BDA0002689672550000271
Figure BDA0002689672550000272
(6) The calledgain is a first gain corresponding to the first bandwidth.
Figure BDA0002689672550000273
gainModHB=0.5×(1.0+tanh(gainMapParHB×(2.0×avgProbSpeechHB-1.0)))
gainTimeDomainHB=0.5×gainModHB+0.5×avgFilterGainHB
Wherein, avgFilterGainHB is an average value of the first gain in the specific frequency band, and gainTimeDomainHB is a second gain corresponding to the second frequency band. gainMapParHB may be 1. If gainTimeDomainHB is greater than or equal to 0.5, the gainModHB coefficient can be adjusted to 0.25, the avgFilterGainHB coefficient can be adjusted to 0.75, and the amplitude is limited.
Technical effects
Based on the above description of the embodiments, the technical effects of the embodiments of the present disclosure are explained:
1. objective index:
the scheme of the embodiment of the disclosure improves the voice quality PESQ by about 11.8% compared with the OMLSA algorithm and improves the voice intelligibility STOI by about 17.0% compared with the OMLSA algorithm; meanwhile, the requirement of real-time streaming processing is met in terms of consumption of the mobile terminal, and most of mobile terminal equipment can be covered. Table 1 compares the effect of the scheme of the disclosed embodiment to the OMLSA algorithm.
Table 1 shows the effect comparison of the scheme of the disclosed embodiment and the OMLSA algorithm
Figure BDA0002689672550000274
2. Subjective auditory sensation:
FIG. 9 is an exemplary waveform diagram and corresponding spectrogram for noisy speech in a real environment; fig. 10 is a waveform diagram and a corresponding spectrogram of the noisy speech shown in fig. 9 after the OMLSA processing; fig. 11 is a waveform diagram and a corresponding spectrogram obtained after the noisy speech shown in fig. 9 is processed by the speech enhancement method provided by the embodiment of the present disclosure.
It can be seen that the scheme of the embodiment of the present disclosure is significantly better than the OMLSA algorithm in terms of the suppression capability of non-stationary burst noise.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than others, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments.
Those skilled in the art will appreciate that the description of each embodiment has a respective emphasis, and reference may be made to the related description of other embodiments for those parts of an embodiment that are not described in detail.
Although the embodiments of the present disclosure have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the present disclosure, and such modifications and variations fall within the scope defined by the appended claims.

Claims (14)

1. A method of speech enhancement, the method comprising:
acquiring a voice signal with noise;
dividing the full frequency band of the voice signal with noise into a first frequency band and a second frequency band;
carrying out noise reduction processing on the voice signal with noise corresponding to the first frequency width to obtain a first gain corresponding to the first frequency width;
predicting a second gain corresponding to the second bandwidth based on the first gain;
determining an enhanced speech signal of the noisy speech signal in the full frequency band based on the first gain and the second gain.
2. The method of claim 1, wherein dividing the full band of the noisy speech signal into a first bandwidth and a second bandwidth comprises:
determining a first bandwidth based on the voice processing capability information of the mobile terminal equipment;
based on the first frequency width, dividing the full frequency width of the voice signal with noise into a first frequency width and a second frequency width.
3. The method of claim 1,
the noise reduction processing on the noisy speech signal corresponding to the first bandwidth to obtain the first gain corresponding to the first bandwidth includes:
determining a magnitude spectrum and a phase spectrum of a first bandwidth;
performing noise reduction processing on the amplitude spectrum of the first bandwidth to obtain a first gain corresponding to the first bandwidth;
said determining, based on the first gain and the second gain, an enhanced speech signal for the noisy speech signal in the full frequency band comprises:
determining a magnitude spectrum and a phase spectrum of a second frequency width;
and obtaining the enhanced voice signal of the voice signal with noise under the full frequency band based on the first gain, the second gain, the amplitude spectrum and the phase spectrum of the first frequency band and the amplitude spectrum and the phase spectrum of the second frequency band.
4. The method according to claim 3, wherein the denoising the amplitude spectrum of the first bandwidth to obtain the first gain corresponding to the first bandwidth comprises:
determining a first signal energy ratio corresponding to the first bandwidth based on the amplitude spectrum of the first bandwidth;
and performing gain processing on the first signal energy ratio to obtain a first gain corresponding to the first bandwidth.
5. The method of claim 4, wherein the determining the first signal energy ratio for the first bandwidth based on the magnitude spectrum of the first bandwidth comprises:
performing first feature extraction on the magnitude spectrum of the first bandwidth, wherein the first extraction is used for performing dimensionality reduction processing on the magnitude spectrum of the first bandwidth;
performing second feature extraction on the features obtained by the first extraction, wherein the second feature extraction is used for performing feature extraction on the input with time sequence;
and outputting a first signal energy ratio corresponding to the first bandwidth based on the second extracted features.
6. The method of claim 5, wherein the determining the first signal energy ratio for the first bandwidth based on the magnitude spectrum of the first bandwidth comprises:
and carrying out noise reduction processing on the amplitude spectrum of the first bandwidth through a time convolution network to obtain a prior signal energy ratio corresponding to the first bandwidth.
7. The method of claim 5, wherein the temporal convolutional network comprises a first fully connected layer, a plurality of hole causal convolutional layers in series, and a second fully connected layer;
the first full-connection layer performs first feature extraction on the magnitude spectrum of the first bandwidth; performing second feature extraction on the features obtained by the first extraction by the multiple cavity cause-effect convolution layers connected in series; and the second full-connection layer outputs a first signal energy ratio corresponding to the first bandwidth based on the features obtained by the second extraction.
8. The method of claim 4, wherein the first signal energy ratio comprises any one of:
signal to noise ratio, speech to noise power ratio, speech to noise amplitude ratio, noise to speech power ratio, noise to speech amplitude ratio.
9. The method of claim 4, wherein the performing gain processing on the first signal energy ratio to obtain a first gain corresponding to the first bandwidth comprises:
determining an initial gain based on the first signal capability ratio;
smoothing the initial gain;
and performing gain compensation on the result obtained by the smoothing processing to obtain a first gain corresponding to the first bandwidth.
10. The method of claim 1, wherein predicting a second gain corresponding to the second bandwidth based on the first gain comprises:
determining the voice existence probability corresponding to the second bandwidth based on the first gain;
and determining a second gain corresponding to the second bandwidth based on the first gain and the voice existence probability.
11. The method of claim 3, wherein said deriving the enhanced speech signal of the noisy speech signal in the full band based on the first gain, the second gain, the first bandwidth of the magnitude spectrum and the phase spectrum, and the second bandwidth of the magnitude spectrum and the phase spectrum comprises:
obtaining an enhanced voice signal of the first frequency width based on the first gain and the amplitude spectrum and the phase spectrum of the first frequency width;
obtaining an enhanced voice signal of the second frequency width based on the second gain and the amplitude spectrum and the phase spectrum of the second frequency width;
and obtaining the enhanced voice signal of the voice signal with noise under the full frequency band based on the enhanced voice signal with the first frequency width and the enhanced voice signal with the second frequency width.
12. A speech enhancement apparatus, characterized in that the apparatus comprises:
the acquiring unit is used for acquiring a voice signal with noise; dividing the full frequency band of the voice signal with noise into a first frequency band and a second frequency band;
the noise reduction unit is used for carrying out noise reduction processing on the voice signal with noise corresponding to the first frequency width to obtain a first gain corresponding to the first frequency width;
a prediction unit for predicting a second gain corresponding to the second bandwidth based on the first gain;
a determining unit, configured to determine, based on the first gain and the second gain, an enhanced speech signal of the noisy speech signal in the full frequency band.
13. An electronic device, comprising: a processor and a memory;
the processor is adapted to perform the steps of the method of any one of claims 1 to 11 by calling a program or instructions stored in the memory.
14. A non-transitory computer-readable storage medium storing a program or instructions for causing a computer to perform the steps of the method according to any one of claims 1 to 11.
CN202010987302.8A 2020-09-18 2020-09-18 Voice enhancement method and device, electronic equipment and storage medium Pending CN113299308A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010987302.8A CN113299308A (en) 2020-09-18 2020-09-18 Voice enhancement method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010987302.8A CN113299308A (en) 2020-09-18 2020-09-18 Voice enhancement method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113299308A true CN113299308A (en) 2021-08-24

Family

ID=77318307

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010987302.8A Pending CN113299308A (en) 2020-09-18 2020-09-18 Voice enhancement method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113299308A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113903352A (en) * 2021-09-28 2022-01-07 阿里云计算有限公司 Single-channel speech enhancement method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090104559A (en) * 2008-03-31 2009-10-06 (주)트란소노 Procedure for processing noisy speech signals, and apparatus and program therefor
CN108735213A (en) * 2018-05-29 2018-11-02 太原理工大学 A kind of sound enhancement method and system based on phase compensation
CN109308904A (en) * 2018-10-22 2019-02-05 上海声瀚信息科技有限公司 A kind of array voice enhancement algorithm
US20190259381A1 (en) * 2018-02-14 2019-08-22 Cirrus Logic International Semiconductor Ltd. Noise reduction system and method for audio device with multiple microphones
CN110164467A (en) * 2018-12-18 2019-08-23 腾讯科技(深圳)有限公司 The method and apparatus of voice de-noising calculate equipment and computer readable storage medium
WO2020107269A1 (en) * 2018-11-28 2020-06-04 深圳市汇顶科技股份有限公司 Self-adaptive speech enhancement method, and electronic device
CN111429932A (en) * 2020-06-10 2020-07-17 浙江远传信息技术股份有限公司 Voice noise reduction method, device, equipment and medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090104559A (en) * 2008-03-31 2009-10-06 (주)트란소노 Procedure for processing noisy speech signals, and apparatus and program therefor
US20190259381A1 (en) * 2018-02-14 2019-08-22 Cirrus Logic International Semiconductor Ltd. Noise reduction system and method for audio device with multiple microphones
CN108735213A (en) * 2018-05-29 2018-11-02 太原理工大学 A kind of sound enhancement method and system based on phase compensation
CN109308904A (en) * 2018-10-22 2019-02-05 上海声瀚信息科技有限公司 A kind of array voice enhancement algorithm
WO2020107269A1 (en) * 2018-11-28 2020-06-04 深圳市汇顶科技股份有限公司 Self-adaptive speech enhancement method, and electronic device
CN110164467A (en) * 2018-12-18 2019-08-23 腾讯科技(深圳)有限公司 The method and apparatus of voice de-noising calculate equipment and computer readable storage medium
CN111429932A (en) * 2020-06-10 2020-07-17 浙江远传信息技术股份有限公司 Voice noise reduction method, device, equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王一海;: "一种改进型谱减法的电子商务语音噪声去噪方法研究", 信息化研究, no. 02, 20 April 2020 (2020-04-20), pages 25 - 29 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113903352A (en) * 2021-09-28 2022-01-07 阿里云计算有限公司 Single-channel speech enhancement method and device

Similar Documents

Publication Publication Date Title
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
CN110491407B (en) Voice noise reduction method and device, electronic equipment and storage medium
Luo et al. Real-time single-channel dereverberation and separation with time-domain audio separation network.
US10504539B2 (en) Voice activity detection systems and methods
CN111489760B (en) Speech signal dereverberation processing method, device, computer equipment and storage medium
CN109036460B (en) Voice processing method and device based on multi-model neural network
JP4861645B2 (en) Speech noise suppressor, speech noise suppression method, and noise suppression method in speech signal
JP2022529641A (en) Speech processing methods, devices, electronic devices and computer programs
US20200184987A1 (en) Noise reduction using specific disturbance models
EP4004906A1 (en) Per-epoch data augmentation for training acoustic models
EP2559026A1 (en) Audio communication device, method for outputting an audio signal, and communication system
CN112004177B (en) Howling detection method, microphone volume adjustment method and storage medium
CN111540342B (en) Energy threshold adjusting method, device, equipment and medium
US10504530B2 (en) Switching between transforms
CN112185410B (en) Audio processing method and device
CN111508519A (en) Method and device for enhancing voice of audio signal
WO2023001128A1 (en) Audio data processing method, apparatus and device
CN113763977A (en) Method, apparatus, computing device and storage medium for eliminating echo signal
CN114267372A (en) Voice noise reduction method, system, electronic device and storage medium
WO2022256577A1 (en) A method of speech enhancement and a mobile computing device implementing the method
CN117693791A (en) Speech enhancement
CN115472174A (en) Sound noise reduction method and device, electronic equipment and storage medium
CN113823301A (en) Training method and device of voice enhancement model and voice enhancement method and device
CN113299308A (en) Voice enhancement method and device, electronic equipment and storage medium
CN115472153A (en) Voice enhancement system, method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination