WO2022161277A1 - 语音增强方法、模型训练方法以及相关设备 - Google Patents

语音增强方法、模型训练方法以及相关设备 Download PDF

Info

Publication number
WO2022161277A1
WO2022161277A1 PCT/CN2022/073197 CN2022073197W WO2022161277A1 WO 2022161277 A1 WO2022161277 A1 WO 2022161277A1 CN 2022073197 W CN2022073197 W CN 2022073197W WO 2022161277 A1 WO2022161277 A1 WO 2022161277A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
amplitude spectrum
neural network
network module
module
Prior art date
Application number
PCT/CN2022/073197
Other languages
English (en)
French (fr)
Inventor
雪巍
蔡玉玉
吴俊仪
全刚
张超
杨帆
丁国宏
何晓冬
Original Assignee
北京沃东天骏信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京沃东天骏信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京沃东天骏信息技术有限公司
Publication of WO2022161277A1 publication Critical patent/WO2022161277A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the field of Internet technologies, and in particular, to a speech enhancement method, a model training method, and related equipment.
  • speech recognition technology has been applied to various scenarios such as smart hardware and smart phone customer service. Because the accuracy of its recognition results is closely related to work efficiency and user interaction experience, people have requirements for the effect of speech recognition also getting higher.
  • the application scenarios of speech recognition are basically related to the daily needs and work needs of users, it is impossible to ensure that the input speech signal is pure and noise-free speech, resulting in noise interference when recognizing some speech with noise in the background environment. quality, resulting in inaccurate recognition results, affecting the efficiency of users in the process of human-computer interaction and audio-text transcription. Therefore, speech enhancement technology for solving audio noise interference in complex noise environment has become a key part of speech recognition technology.
  • the purpose of speech enhancement technology is to process speech containing noise and output the processed pure speech audio. Its main means can be divided into two categories: linear filtering methods based on signal processing, such as Wiener filtering, Kalman filtering, filters based on minimum mean square error, etc.; and methods based on machine learning, such as recurrent neural networks, Based on convolutional neural network, based on convolutional-recurrent neural network, method based on UNET network, etc.
  • the linear filtering method based on signal processing first presets the statistical model of speech and noise, and solves the optimal filter under certain optimization criteria, and acts on the frequency with noise to achieve the purpose of enhancing speech.
  • the method based on machine learning uses a large amount of training data, adopts a certain network structure, and trains nonlinear functions from noisy speech to pure speech under the framework of supervised learning, so as to achieve the purpose of speech enhancement.
  • linear filter-based methods do not require large-scale data training, they often design optimization functions based on expert knowledge.
  • the model assumptions of speech or noise are too ideal, such as assuming that noise obeys stationary properties, etc., resulting in a significant drop in performance in practical scenarios, especially under non-stationary noise conditions.
  • the speech enhancement method based on machine learning uses a large amount of corpus to train the neural network to obtain the mapping of noisy speech features to pure speech, which can significantly improve the performance under complex non-stationary noise.
  • its performance is obviously limited by the variability of noise in the training corpus. When the training corpus is limited, overfitting often occurs, resulting in poor generalization performance for out-of-set noise.
  • the main reason for this problem is that the method based on machine learning relies too much on the existing neural network model structure and does not introduce the traditional expert knowledge based on signal processing, so it is difficult to design a regularization method that conforms to the optimal speech signal processing and improve the network. performance.
  • the present disclosure provides a speech enhancement method, a model training method, and related equipment.
  • optimizing the speech enhancement method it can maintain good enhancement performance under both stationary noise and complex non-stationary noise, and at the same time, improve the generalization of speech enhancement. performance.
  • One aspect of the present disclosure provides a method for training a speech enhancement model, where the speech enhancement model includes a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module,
  • the speech enhancement model training method includes:
  • the first feature set is input into the speech prediction neural network module, and the speech prediction neural network module is used to output the first quasi-estimated pure speech amplitude spectrum and prediction error;
  • the noise estimation neural network module is used for outputting estimated noise energy
  • the first quasi-estimated pure speech amplitude spectrum and prediction error output by the speech prediction neural network module, and the estimated noise energy output by the noise estimation neural network module are input into the linear filtering module, and the linear filtering module is used for outputting estimated Pure speech amplitude spectrum;
  • a model loss is calculated according to the pure speech amplitude spectrum and the estimated pure speech amplitude spectrum, and the speech enhancement model is trained according to the model loss.
  • the acquiring the noisy speech amplitude spectrum and the pure speech amplitude spectrum of each speech pair in the speech training set includes:
  • a transforming step from time domain to frequency domain is performed on the pure speech signal of the speech pair
  • the transforming step from time domain to frequency domain includes:
  • An amplitude spectrum of the to-be-processed speech signal is generated based on the amplitude of each frequency point of the Fourier spectrum of the to-be-processed speech signal.
  • the speech prediction neural network module is a time series neural network model
  • the first feature set is a noise amplitude spectrum sequence of a plurality of consecutive frames
  • all the data output by the speech prediction neural network module The first quasi-estimated pure speech amplitude spectrum is the first quasi-estimated pure speech amplitude spectrum sequence having the same dimension as the noise amplitude spectrum sequence
  • the prediction error output by the speech prediction neural network module is the same as the noise amplitude spectrum sequence.
  • Spectral sequences have prediction error sequences of the same dimension.
  • the noise estimation neural network module is a multi-layer fully connected network
  • the second feature set includes the current frame and the noisy speech amplitude spectrum of the domain window of the current frame.
  • the linear filtering module includes a Wiener filtering module, a Kalman gain calculation module, and a linear combination module,
  • the Wiener filtering module is configured to output the Wiener filtering solution of the pure speech amplitude spectrum according to the estimated noise energy output by the noise estimation neural network module and the second feature set, as the second quasi-estimated pure speech amplitude spectrum;
  • the Kalman gain calculation module is configured to output the optimal Kalman gain G according to the prediction error output by the speech prediction neural network module and the estimated noise energy output by the noise estimation neural network module;
  • the linear combination module is configured to calculate the linear combination result of the first quasi-estimated pure speech amplitude spectrum and the second quasi-estimated pure speech amplitude spectrum output by the speech prediction neural network module according to the optimal Kalman gain G , as the estimated pure speech amplitude spectrum.
  • the linearity of the first quasi-estimated pure speech amplitude spectrum and the second quasi-estimated pure speech amplitude spectrum output by the speech prediction neural network module is calculated
  • the combined result, as the estimated pure speech amplitude spectrum includes:
  • a weighted sum of the first quasi-estimated pure speech amplitude spectrum and the second quasi-estimated pure speech amplitude spectrum is calculated according to the first weight and the second weight, as the estimated pure speech amplitude spectrum.
  • the calculating a model loss according to the pure speech amplitude spectrum and the estimated pure speech amplitude spectrum, and training the speech enhancement model according to the model loss includes:
  • the parameters of speech prediction neural network module and noise estimation neural network module are optimized by back propagation algorithm.
  • a speech enhancement method comprising:
  • the first feature set and the second feature set are input into a trained speech enhancement model
  • the speech enhancement model includes a speech prediction neural network module, a noise estimation neural network module and a linear filtering module, wherein the first feature set As the input of the speech prediction neural network module, the speech prediction neural network module is used to output the first quasi-estimated pure speech amplitude spectrum and the prediction error, and the second feature set is used as the input of the noise estimation neural network module,
  • the noise estimation neural network module is used to output estimated noise energy, the first quasi-estimated pure speech amplitude spectrum and prediction error output by the speech prediction neural network module, and the estimated noise energy output by the noise estimation neural network module are input to the linear a filtering module, the linear filtering module is used for outputting the estimated pure speech amplitude spectrum of the speech signal to be enhanced;
  • the enhanced speech signal of the to-be-enhanced speech signal is obtained by restoring according to the estimated pure speech amplitude spectrum and the to-be-enhanced speech phase spectrum.
  • the speech enhancement model includes a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module,
  • the voice enhancement model training device includes:
  • the first acquisition module is configured to acquire the noisy speech amplitude spectrum and the pure speech amplitude spectrum of each speech pair in the speech training set, and the speech pair includes the associated pure speech signal and the noisy speech signal;
  • a second obtaining module configured to obtain a first feature set and a second feature set according to the noisy speech amplitude spectrum
  • a first input module configured to input the first feature set into the speech prediction neural network module, and the speech prediction neural network module is configured to output a first quasi-estimated pure speech amplitude spectrum and a prediction error;
  • a second input module configured to input the second feature set into the noise estimation neural network module, the noise estimation neural network module for outputting estimated noise energy
  • An output module configured to input the first quasi-estimated pure speech amplitude spectrum and prediction error output by the speech prediction neural network module, and the estimated noise energy output by the noise estimation neural network module into the linear filtering module, and the linear filtering module Used to output estimated pure speech amplitude spectrum;
  • a training module configured to calculate a model loss according to the pure speech amplitude spectrum and the estimated pure speech amplitude spectrum, and to train the speech enhancement model according to the model loss.
  • Yet another aspect of the present disclosure provides an electronic device, comprising: a processor; a memory, where executable instructions are stored in the memory; wherein, when the executable instructions are executed by the processor, the implementation of any of the foregoing embodiments.
  • Yet another aspect of the present disclosure provides a computer-readable storage medium for storing a program, wherein the program, when executed, implements the speech enhancement model training method and/or the speech enhancement method described in any of the foregoing embodiments.
  • the present disclosure combines a signal processing-based linear filtering method and a machine learning-based speech signal enhancement model by making the speech enhancement model include a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module to perform speech signal enhancement through the speech enhancement model.
  • Enhancement method using the speech enhancement method based on machine learning to improve the speech enhancement performance of the linear filtering method based on signal processing under complex non-stationary noise, using the linear filtering method based on signal processing to improve the generalization performance of the speech enhancement method based on machine learning , to achieve the optimization of speech enhancement.
  • FIG. 1 shows a flowchart of a method for training a speech enhancement model in an embodiment of the present disclosure
  • FIG. 2 shows a schematic structural diagram of a speech enhancement model in an embodiment of the present disclosure
  • Fig. 3 shows the synchronization flowchart of the speech enhancement method in the embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of a module of an apparatus for training a speech enhancement model in an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of modules of a speech enhancement apparatus in an embodiment of the present disclosure
  • FIG. 6 shows a schematic structural diagram of an electronic device in an embodiment of the present disclosure.
  • FIG. 7 shows a schematic structural diagram of a computer-readable storage medium in an embodiment of the present disclosure.
  • FIG. 1 shows the main steps of the speech enhancement training method in the embodiment.
  • the speech enhancement model provided by the present disclosure includes a speech prediction neural network module, a noise estimation neural network module and a linear filtering module.
  • the voice enhancement model training method includes: step S110: acquiring the noisy speech amplitude spectrum and the pure speech amplitude spectrum of each speech pair in the speech training set, the speech pair including the associated pure speech signal and the noisy speech signal; step S120: Obtain a first feature set and a second feature set according to the noisy speech amplitude spectrum; Step S130: Input the first feature set into the speech prediction neural network module, and the speech prediction neural network module is used to output the first Quasi-estimated pure speech amplitude spectrum and prediction error; Step S140: Input the second feature set into the noise estimation neural network module, and the noise estimation neural network module is used to output estimated noise energy; Step S150; The first quasi-estimated pure speech amplitude spectrum and prediction error output by the prediction neural network module, and the estimated noise energy output
  • the speech enhancement method of the above-mentioned embodiment by making the speech enhancement model include a speech prediction neural network module, a noise estimation neural network module and a linear filtering module, so as to enhance the speech signal through the speech enhancement model, so as to combine the linear filtering method based on signal processing. and machine learning-based speech enhancement methods, using machine learning-based speech enhancement methods to improve the performance of signal processing-based linear filtering methods under complex non-stationary noise, using signal processing-based linear filtering methods to improve machine learning-based speech enhancement Enhance the generalization performance of the method to achieve optimization for speech enhancement.
  • the speech training set may include multiple pairs of speech pairs, each pair of speech pairs includes an associated pure speech signal and a noisy speech signal, and the noisy speech signal is obtained by adding noise with a certain signal-to-noise ratio to the pure speech signal get.
  • the added signal-to-noise ratio can be set as required.
  • the range of the added signal-to-noise ratio can be set to -10 to 30 dB, which is not a limitation of the present disclosure.
  • obtaining the noisy speech amplitude spectrum and the pure speech amplitude spectrum of each speech pair in the speech training set in step S110 can be achieved by the following steps, including: performing a time domain to frequency domain transformation step on the pure speech signal of the speech pair to obtain a pure speech amplitude spectrum; and performing a time domain to frequency domain transformation step on the noisy speech signal of the speech pair to obtain a noisy speech amplitude spectrum.
  • the transformation step from the time domain to the frequency domain is implemented in the following manner:
  • the to-be-processed speech signal x(t) is divided into frames.
  • t is the sampling point sequence number of the speech signal to be processed.
  • the time length of each frame may be 8 milliseconds to 32 milliseconds, and a 50%-75% coincidence between each frame may be maintained.
  • the length and degree of overlap of each frame can be set as required, and the present disclosure is not limited thereto.
  • keeping each frame in a certain coincidence during frame division is to make use of temporal correlation to facilitate windowing for the Fourier transform of the subsequent steps.
  • Fourier transform is performed on each frame of the speech signal to be processed to obtain a frame Fourier spectrum of each frame.
  • short-time Fourier transform of 64-512 frequency bins may be performed on each frame.
  • the number of frequency points can be set as required, and the present disclosure is not limited thereto.
  • the frame Fourier spectrum of each frame of the speech signal to be processed is spliced according to the time axis to obtain the Fourier spectrum X(t, f) of the speech signal to be processed, wherein the Fourier spectrum X(t , f) is the two-dimensional short-time Fourier spectrum in the complex domain, t is the frame number, and f is the frequency number.
  • the amplitude spectrum X(t, f) of the speech signal to be processed is generated.
  • the pure speech signal and the noisy speech signal can be transformed from the time domain to the frequency domain respectively, thereby obtaining the pure speech amplitude spectrum
  • FIG. 2 shows a schematic structural diagram of a speech enhancement model in an embodiment of the present disclosure.
  • the speech enhancement model 200 includes a speech prediction neural network module 210 , a noise estimation neural network module 220 and a linear filtering module 230 .
  • the speech prediction neural network module 210 may be a time series neural network model.
  • the time series neural network model can be, for example, a multi-layer long-short-term memory recurrent neural network.
  • the number of nodes in each layer can be 256-1024 nodes, and the number of nodes in each layer is the same.
  • the time series neural network model provided by the present disclosure is not limited by this.
  • the first feature set may be a sequence of noise amplitude spectra of multiple consecutive frames.
  • ] T , where F is the total number of frequency bands. Therefore, the noise amplitude spectrum sequence is Y A [k] [y(L ⁇ k),y(L ⁇ k+1),...,y(L ⁇ k+L-1)], where k is the sequence number, L is the sequence length.
  • the speech prediction neural network module (time series neural network model) 210 has two outputs: a first quasi-estimated clean speech amplitude spectrum and a prediction error.
  • the time series neural network model predicts the first quasi-estimated pure speech amplitude spectrum sequence and the corresponding prediction error sequence according to the input feature sequence in a sequence-to-sequence manner.
  • the first quasi-estimated pure speech amplitude spectrum output by the speech prediction neural network module is the first quasi-estimated pure speech amplitude spectrum sequence with the same dimension as the noise amplitude spectrum sequence Y A [k], so
  • the prediction error output by the speech prediction neural network module is a prediction error sequence with the same dimension as the noise amplitude spectrum sequence Y A [k].
  • the first quasi-estimated pure speech amplitude spectrum sequence of the first quasi-estimated pure speech amplitude spectrum can be recorded as
  • the noise estimation neural network module 220 may be a multi-layer fully connected network.
  • the number of nodes in each layer of the multi-layer fully connected network may be 256-1024 nodes, and the number of nodes in each layer is the same, which is not limited in the present disclosure.
  • the second feature set includes the current frame (t frame) and the domain window of the current frame [tN,t-N+1,...,t,...,t+N-1,t+N]
  • the output of the noise estimation neural network module 220 is the noise energy vector of the current frame, the dimension is F ⁇ 1, and the f-th element of the noise energy vector represents the estimated noise energy of the f-th frequency band, denoted as
  • the linear filtering module 230 includes a Wiener filtering module 231 , a Kalman gain calculation module 232 and a linear combination module 233 .
  • the Wiener filter module 231 is configured to output the Wiener filter solution of the pure speech amplitude spectrum as the second quasi-estimated pure speech amplitude spectrum according to the estimated noise energy output by the noise estimation neural network module 220 and the second feature set.
  • the total speech energy of time-frequency points can be calculated according to the noisy speech amplitude spectrum of the second feature set
  • the Wiener filtering module 231 can obtain the Wiener filtering solution of the pure speech amplitude spectrum according to the following formula based on the minimum mean square error criterion:
  • the Kalman gain calculation module 232 is configured to output the optimal Kalman gain according to the prediction error output by the speech prediction neural network module 210 and the estimated noise energy output by the noise estimation neural network module 220 . Specifically, the Kalman gain calculation module 232 can determine the optimal Kalman gain according to the following formula based on the traditional Kalman filter theory:
  • G is the optimal Kalman gain
  • G is the variance of the prediction error of the time-frequency point of the f-th frequency band of the t-th frame
  • noise energy is the time-frequency point of the f-th frequency band of the t-th frame.
  • the linear combination module 233 is configured to calculate the linear combination result of the first quasi-estimated pure speech amplitude spectrum and the second quasi-estimated pure speech amplitude spectrum output by the speech prediction neural network module according to the optimal Kalman gain, as The estimated pure speech amplitude spectrum. Specifically, the linear combination module 233 can calculate and estimate the pure speech amplitude spectrum according to the following formula:
  • is the estimated pure speech amplitude spectrum
  • G is the optimal Kalman gain
  • G is the second weight of the second quasi-estimated pure speech amplitude spectrum
  • (1-G) is the first weight of the first quasi-estimated pure speech amplitude spectrum
  • step S160 in FIG. 1 calculates a model loss according to the pure speech amplitude spectrum and the estimated pure speech amplitude spectrum (for example, the model loss can be calculated by a loss function), and trains according to the model loss
  • the speech enhancement model can be implemented through the following steps: using a back propagation algorithm to optimize the parameters of the speech prediction neural network module and the noise estimation neural network module.
  • the speech enhancement model can adaptively learn the parameters in the speech prediction neural network module and the noise estimation neural network module based on the back propagation algorithm of the neural network under the criterion of minimum mean square error.
  • Embodiments of the present disclosure further provide a speech enhancement method for enhancing speech signals based on the trained speech enhancement model.
  • FIG. 3 shows a flowchart of a speech enhancement method in an embodiment of the present disclosure, as shown in FIG. 3 , including:
  • Step S310 Obtain the to-be-enhanced speech amplitude spectrum and the to-be-enhanced speech phase spectrum of the to-be-enhanced speech signal.
  • step S310 may perform short-time Fourier transform on the to-be-enhanced speech signal to obtain the to-be-enhanced speech amplitude spectrum and the to-be-enhanced speech phase spectrum.
  • Step S320 Obtain a first feature set and a second feature set according to the to-be-enhanced speech amplitude spectrum.
  • the steps of obtaining the first feature set and the second feature set based on the to-be-enhanced speech amplitude spectrum may be the same as the method of obtaining the first feature set and the second feature set based on the noisy speech amplitude spectrum, which will not be repeated here.
  • Step S330 Input the first feature set and the second feature set into a trained speech enhancement model, where the speech enhancement model includes a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module, wherein the first A feature set is used as the input of the speech prediction neural network module, the speech prediction neural network module is used to output the first quasi-estimated pure speech amplitude spectrum and prediction error, and the second feature set is used as the noise estimation neural network module The input of the noise estimation neural network module is used to output the estimated noise energy, the first quasi-estimated pure speech amplitude spectrum and the prediction error output by the speech prediction neural network module, the noise estimation neural network module output estimated noise energy input The linear filtering module is used for outputting the estimated pure speech amplitude spectrum of the speech signal to be enhanced.
  • the speech enhancement model includes a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module
  • the first A feature set is used as the input of the speech prediction neural network module
  • the speech prediction neural network module is used to
  • the structure of the speech enhancement model, the specific implementation of the speech prediction neural network module, the noise estimation neural network module, and the linear filtering module can be implemented with reference to FIG. 2 and the related description of FIG. 2 .
  • the speech enhancement model can be obtained by training with the training method shown in FIG. 1 .
  • Step S340 Perform restoration according to the estimated pure speech amplitude spectrum and the to-be-enhanced speech phase spectrum to obtain an enhanced speech signal of the to-be-enhanced speech signal.
  • the speech enhancement model includes a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module, so as to enhance the speech signal through the speech enhancement model.
  • the linear filtering method and the speech enhancement method based on machine learning are used to improve the speech enhancement performance of the linear filtering method based on signal processing under complex non-stationary noise by using the speech enhancement method based on machine learning.
  • Generalization performance of speech enhancement methods for machine learning enabling optimization of speech enhancement.
  • Embodiments of the present disclosure further provide a speech enhancement model training apparatus, which can be used to implement the speech enhancement model training method described in any of the foregoing embodiments.
  • the speech enhancement model includes a speech prediction neural network module, a noise estimation neural network module and a linear filtering module.
  • FIG. 4 shows a schematic diagram of modules of a speech enhancement model training apparatus in an embodiment of the present disclosure.
  • the speech enhancement model training apparatus 410 in this embodiment includes a first acquisition module 411 , a second acquisition module 412 , a first An input module 413 , a second input module 414 , an output module 415 and a training module 416 .
  • the first acquisition module 411 is configured to acquire the noisy speech amplitude spectrum and the pure speech amplitude spectrum of each speech pair in the speech training set, where the speech pair includes the associated pure speech signal and the noisy speech signal.
  • the second obtaining module 412 is configured to obtain a first feature set and a second feature set according to the noisy speech amplitude spectrum.
  • the first input module 413 is configured to input the first feature set into the speech prediction neural network module, and the speech prediction neural network module is configured to output a first quasi-estimated pure speech amplitude spectrum and a prediction error.
  • the second input module 414 is configured to input the second feature set into the noise estimation neural network module for outputting estimated noise energy.
  • the output module 415 is configured to input the first quasi-estimated pure speech amplitude spectrum and prediction error output by the speech prediction neural network module, and the estimated noise energy output by the noise estimation neural network module into the linear filtering module, and the linear filtering module Used to output estimated pure speech amplitude spectrum.
  • the training module 416 is configured to calculate a model loss according to the pure speech amplitude spectrum and the estimated pure speech amplitude spectrum, and to train the speech enhancement model according to the model loss. For the specific principles of each module, reference may be made to the above-mentioned embodiments of any speech enhancement model training method, and the description will not be repeated here.
  • An embodiment of the present disclosure further provides a speech enhancement apparatus, which can be used to implement the speech enhancement method described in any of the foregoing embodiments.
  • FIG. 5 shows the main modules of the speech enhancement apparatus in the embodiment.
  • the speech enhancement apparatus 420 in this embodiment includes a third acquisition module 421 , a fourth acquisition module 422 , an enhancement module 423 and a restoration module 424 .
  • the third acquisition module 421 is configured to acquire the to-be-enhanced speech amplitude spectrum and the to-be-enhanced speech phase spectrum of the to-be-enhanced speech signal.
  • the fourth obtaining module 422 is configured to obtain a first feature set and a second feature set according to the to-be-enhanced speech amplitude spectrum.
  • the enhancement module 423 is configured to input the first feature set and the second feature set into a trained speech enhancement model, the speech enhancement model including a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module, wherein the The first feature set is used as the input of the speech prediction neural network module, and the speech prediction neural network module is used to output the first quasi-estimated pure speech amplitude spectrum and prediction error, and the second feature set is used as the noise estimation neural network.
  • the input of the network module the noise estimation neural network module is used to output the estimated noise energy, the first quasi-estimated pure speech amplitude spectrum and the prediction error output by the speech prediction neural network module, the noise estimation neural network module outputs the estimated noise
  • the energy is input to the linear filtering module, and the linear filtering module is configured to output the estimated pure speech amplitude spectrum of the speech signal to be enhanced.
  • the restoration module 424 is configured to perform restoration according to the estimated pure speech amplitude spectrum and the to-be-enhanced speech phase spectrum to obtain an enhanced speech signal of the to-be-enhanced speech signal.
  • the speech enhancement model includes a speech prediction neural network module, a noise estimation neural network module and a linear filtering module, so as to enhance the speech signal through the speech enhancement model.
  • Linear filtering method based on signal processing and speech enhancement method based on machine learning using the speech enhancement method based on machine learning to improve the speech enhancement performance of the linear filtering method based on signal processing under complex non-stationary noise, using the linear filtering method based on signal processing The method improves the generalization performance of the speech enhancement method based on machine learning, and realizes the optimization of speech enhancement.
  • FIG. 4 and FIG. 5 only schematically illustrate the speech enhancement model training device and the speech enhancement device provided by the present disclosure.
  • the splitting, merging, and adding of modules are all within the protection of the present disclosure. within the range.
  • the speech enhancement model training apparatus and speech enhancement apparatus provided by the present disclosure may be implemented by software, hardware, firmware, plug-ins, and any combination thereof, and the present disclosure is not limited thereto.
  • Embodiments of the present disclosure further provide an electronic device, including a processor and a memory, where executable instructions are stored in the memory, and when the executable instructions are executed by the processor, the speech enhancement model training method and/or the speech enhancement model described in any of the foregoing embodiments are implemented. Enhancement method.
  • the electronic device of the present disclosure combines the linear filtering method based on signal processing and the machine-based linear filtering method by making the speech enhancement model include a speech prediction neural network module, a noise estimation neural network module and a linear filtering module to enhance the speech signal through the speech enhancement model.
  • the speech enhancement method based on learning, using the speech enhancement method based on machine learning to improve the speech enhancement performance of the linear filtering method based on signal processing under complex non-stationary noise, using the linear filtering method based on signal processing to improve the speech enhancement method based on machine learning.
  • FIG. 6 is a schematic structural diagram of an electronic device in an embodiment of the present disclosure. It should be understood that FIG. 6 only schematically shows various modules, and these modules may be virtual software modules or actual hardware modules. Splitting and addition of other modules are within the scope of the present disclosure.
  • electronic device 500 takes the form of a general-purpose computing device.
  • Components of the electronic device 500 include, but are not limited to, at least one processing unit 510, at least one storage unit 520, a bus 530 connecting different platform components (including the storage unit 520 and the processing unit 510), a display unit 540, and the like.
  • the storage unit stores program codes, and the program codes can be executed by the processing unit 510, so that the processing unit 510 executes the steps of the speech enhancement model training method and/or the speech enhancement method described in any of the foregoing embodiments.
  • the processing unit 510 may perform the steps shown in FIGS. 1 and 3 .
  • the storage unit 520 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 5201 and/or a cache storage unit 5202 , and may further include a read only storage unit (ROM) 5203 .
  • RAM random access storage unit
  • ROM read only storage unit
  • the storage unit 520 may also include a program/utility 5204 having one or more program modules 5205 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, examples of which are Each or some combination of these may include an implementation of a network environment.
  • the bus 530 may be representative of one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any of a variety of bus structures. bus.
  • the electronic device 500 may also communicate with one or more external devices 600, which may be one or more of a keyboard, a pointing device, a Bluetooth device, and the like. These external devices 600 enable the user to interact with the electronic device 500 .
  • Electronic device 500 is also capable of communicating with one or more other computing devices, including routers, modems, as shown. Such communication may occur through input/output (I/O) interface 550 .
  • the electronic device 500 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 560 .
  • Network adapter 560 may communicate with other modules of electronic device 500 through bus 530 .
  • other hardware and/or software modules may be used in conjunction with electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage platform, etc.
  • Embodiments of the present disclosure further provide a computer-readable storage medium for storing a program, and when the program is executed, the speech enhancement model training method and/or speech enhancement method described in any of the foregoing embodiments is implemented.
  • various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code that, when the program product runs on a terminal device, is used to cause the terminal device to perform any of the above-mentioned implementations The speech enhancement model training method and/or the speech enhancement method described in the example.
  • the computer-readable storage medium of the present disclosure combines the signal processing-based linear filtering by enabling the speech enhancement model to include a speech prediction neural network module, a noise estimation neural network module, and a linear filtering module to perform speech signal enhancement through the speech enhancement model
  • the method and the speech enhancement method based on machine learning use the speech enhancement method based on machine learning to improve the speech enhancement performance of the linear filtering method based on signal processing under complex non-stationary noise, and use the linear filtering method based on signal processing to improve the speech enhancement method based on machine learning.
  • FIG. 7 is a schematic structural diagram of a computer-readable storage medium of the present disclosure.
  • a program product 700 for implementing the above method according to an embodiment of the present disclosure is described, which can adopt a portable compact disk read only memory (CD-ROM) and include program codes, and can be stored in a terminal device, For example running on a personal computer.
  • CD-ROM portable compact disk read only memory
  • the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • the program product may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of readable storage media include, but are not limited to, electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable Read-only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the above.
  • a computer-readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, carrying readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable storage medium can also be any readable medium other than a readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • Program code embodied on a readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Program code for performing the operations of the present disclosure may be written in any combination of one or more programming languages, including object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming Language - such as the "C" language or similar programming language.
  • the program code may execute entirely on the user computing device, partly on the user device, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server execute on.
  • the remote computing devices may be connected to the user computing device over any kind of network, including a local area network (LAN) or wide area network (WAN), or may be connected to an external computing device, such as using an Internet service provider business to connect via the Internet.
  • LAN local area network
  • WAN wide area network
  • Internet service provider business such as using an Internet service provider business to connect via the Internet.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Filters That Use Time-Delay Elements (AREA)

Abstract

一种语音增强方法、模型训练方法以及相关设备,语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,模型训练方法包括:获取训练集中各语音对的带噪语音幅度谱以及纯净语音幅度谱(S110);根据带噪语音幅度谱获得第一特征集以及第二特征集(S120);将第一特征集输入语音预测神经网络模块,以输出第一准估计纯净语音幅度谱以及预测误差(S130);将第二特征集输入噪声估计神经网络模块,以输出估计噪声能量(S140);将第一准估计纯净语音幅度谱、预测误差、估计噪声能量输入线性滤波模块,线性滤波模块用于输出估计纯净语音幅度谱(S150);根据纯净语音幅度谱以及估计纯净语音幅度谱计算模型损失,以训练语音增强模型(S160)。该模型能够实现语音增强的优化。

Description

语音增强方法、模型训练方法以及相关设备
相关申请的交叉引用
本公开要求于2021年01月29日提交的申请号为202110129897.8、名称为“语音增强方法、模型训练方法以及相关设备”的中国专利申请的优先权,该中国专利申请的全部内容通过引用全部并入本文。
技术领域
本公开涉及互联网技术领域,具体地说,涉及一种语音增强方法、模型训练方法以及相关设备。
背景技术
随着语音识别技术的高速发展,语音识别技术已被应用于智能硬件、智能电话客服等多种场景,因为其识别结果准确性与工作效率和用户交互体验息息相关,人们对语音识别的效果的要求也越来越高。目前,由于语音识别的应用场景基本都与用户日常生活需求和工作需求有关,无法保证输入语音信号是纯净、无噪音的语音,导致在识别一些背景环境有噪音的语音时,噪音干扰了语音信号的质量,导致识别结果不准确,影响了用户在人机交互、音频文字转写的过程中的效率。因此,针对解决复杂噪声环境中的音频噪音干扰的语音增强技术成为了语音识别技术中的关键部分。
语音增强技术目的是对包含噪音的语音进行处理,并输出处理后的纯净语音音频。其主要手段可分为两大类:基于信号处理的线性滤波方法,如维纳滤波、卡尔曼滤波、基于最小均方误差的滤波器等;和基于机器学习的方法,如基于递归神经网络、基于卷积神经网络,基于卷积-递归神经网络、基于UNET网络的方法等。
基于信号处理的线性滤波方法首先预设语音和噪声的统计模型,在一定的优化准则下,求解最优滤波器,并作用于带噪音频,达到增强语音的目的。基于机器学习的方法采用大量训练数据,采用一定的网络结构,在监督学习的框架下,训练从带噪语音到纯净语音的非线性函数,从而达到语音增强的目的。
虽然基于线性滤波器的方法不需要大规模的数据训练,但是由于其常基于专家知识设计最优化函数,然而在某些条件下,由于语音或噪声的模型假设过于理想化,如假设噪声服从平稳性等,导致在实际场景下,尤其是在非平稳噪声条件下,性能明显下降。基于机器学***稳噪声下的性能。然而,其性能明显受到训练语料中噪声多变性的限制,当训练语料有限时,常产生过拟合问题,导致对于集外噪声的泛化性能较差。该问题的主要原因为,基于机器学习的方法过于依赖现有神经网络模型结构,并未引入传统的基于信号处理的专家知识,从而难以通过设计符合最优语音信号处理的正则化方法,提高网络的性能。
由此,如何优化语音增强方法,以在平稳噪声和复杂非平稳噪声下皆保持良好的增强性能,同时,提升语音增强的泛化性能,是本领域技术人员亟待解决的技术问题。
需要说明的是,上述背景技术部分公开的信息仅用于加强对本公开的背景的理解,因此可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
有鉴于此,本公开提供一种语音增强方法、模型训练方法以及相关设备,通过优化语音增强方法,以在平稳噪声和复杂非平稳噪声下皆保持良好的增强性能,同时,提升语音增强的泛化性能。
本公开的一个方面提供一种语音增强模型训练方法,所述语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,
所述语音增强模型训练方法,包括:
获取语音训练集中各语音对的带噪语音幅度谱以及纯净语音幅度谱,所述语音对包括关联的纯净语音信号以及带噪语音信号;
根据所述带噪语音幅度谱获得第一特征集以及第二特征集;
将所述第一特征集输入所述语音预测神经网络模块,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差;
将所述第二特征集输入所述噪声估计神经网络模块,所述噪声估计神经网络模块用于输出估计噪声能量;
将所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出的估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出估计纯净语音幅度谱;
根据所述纯净语音幅度谱以及所述估计纯净语音幅度谱计算模型损失,并根据模型损失训练所述语音增强模型。
在本公开的一些实施例中,所述获取语音训练集中各语音对的带噪语音幅度谱以 及纯净语音幅度谱包括:
对所述语音对的纯净语音信号执行时域到频域的变换步骤;
对所述语音对的带噪语音信号执行时域到频域的变换步骤,
所述时域到频域的变换步骤包括:
对待处理语音信号进行分帧;
对所述待处理语音信号的各帧进行傅里叶变换,获得各帧的帧傅里叶谱;
按时间轴拼接所述待处理语音信号的各帧的帧傅里叶谱,获得所述待处理语音信号的傅里叶谱;
基于所述待处理语音信号的傅里叶谱的各频点的幅度,生成所述待处理语音信号的幅度谱。
在本公开的一些实施例中,所述语音预测神经网络模块为时间序列神经网络模型,所述第一特征集为多个连续帧的噪声幅度谱序列,所述语音预测神经网络模块输出的所述第一准估计纯净语音幅度谱为与所述噪声幅度谱序列具有相同维度的第一准估计纯净语音幅度谱序列,所述语音预测神经网络模块输出的所述预测误差为与所述噪声幅度谱序列具有相同维度的预测误差序列。
在本公开的一些实施例中,所述噪声估计神经网络模块为多层全连接网络,所述第二特征集包括当前帧以及当前帧的领域窗口的带噪语音幅度谱。
在本公开的一些实施例中,所述线性滤波模块包括维纳滤波模块、卡尔曼增益计算模块以及线性组合模块,
所述维纳滤波模块用于根据所述噪声估计神经网络模块输出的估计噪声能量以及所述第二特征集,输出纯净语音幅度谱的维纳滤波解,作为第二准估计纯净语音幅度谱;
所述卡尔曼增益计算模块用于根据所述语音预测神经网络模块输出的预测误差以及所述噪声估计神经网络模块输出的估计噪声能量,输出最优卡尔曼增益G;
所述线性组合模块用于根据所述最优卡尔曼增益G,计算所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及所述第二准估计纯净语音幅度谱的线性组合结果,作为所述估计纯净语音幅度谱。
在本公开的一些实施例中,根据所述最优卡尔曼增益G,计算所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及所述第二准估计纯净语音幅度谱的线性组合结果,作为所述估计纯净语音幅度谱包括:
将(1-G)作为所述第一准估计纯净语音幅度谱的第一权重;
将最优卡尔曼增益G作为所述第二准估计纯净语音幅度谱的第二权重;
根据所述第一权重和第二权重计算所述第一准估计纯净语音幅度谱和所述第二准估计纯净语音幅度谱的加权合,作为所述估计纯净语音幅度谱。
在本公开的一些实施例中,所述根据所述纯净语音幅度谱以及所述估计纯净语音幅度谱计算模型损失,并根据模型损失训练所述语音增强模型包括:
采用后向传播算法,优化语音预测神经网络模块以及噪声估计神经网络模块的参数。
根据本公开的另一方面,还提供一种语音增强方法,包括:
获取待增强语音信号的待增强语音幅度谱和待增强语音相位谱;
根据所述待增强语音幅度谱获得第一特征集以及第二特征集;
将所述第一特征集以及第二特征集输入经训练的语音增强模型,所述语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,其中,所述第一特征集作为所述语音预测神经网络模块的输入,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差,所述第二特征集作为所述噪声估计神经网络模块的输入,所述噪声估计神经网络模块用于输出估计噪声能量,所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出所述待增强语音信号的估计纯净语音幅度谱;
根据所述估计纯净语音幅度谱和待增强语音相位谱进行还原获得所述待增强语音信号的增强语音信号。
根据本公开的另一方面,还提供一种语音增强模型训练装置,所述语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,
所述语音增强模型训练装置,包括:
第一获取模块,配置成获取语音训练集中各语音对的带噪语音幅度谱以及纯净语音幅度谱,所述语音对包括关联的纯净语音信号以及带噪语音信号;
第二获取模块,配置成根据所述带噪语音幅度谱获得第一特征集以及第二特征集;
第一输入模块,配置成将所述第一特征集输入所述语音预测神经网络模块,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差;
第二输入模块,配置成将所述第二特征集输入所述噪声估计神经网络模块,所述 噪声估计神经网络模块用于输出估计噪声能量;
输出模块,配置成将所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出估计纯净语音幅度谱;
训练模块,配置成根据所述纯净语音幅度谱以及所述估计纯净语音幅度谱计算模型损失,并根据模型损失训练所述语音增强模型。
本公开的又一个方面提供一种电子设备,包括:处理器;存储器,所述存储器中存储有可执行指令;其中,所述可执行指令被所述处理器执行时,实现上述任意实施例所述的语音增强模型训练方法和/或语音增强方法。
本公开的又一个方面提供一种计算机可读的存储介质,用于存储程序,其中,所述程序被执行时实现上述任意实施例所述的语音增强模型训练方法和/或语音增强方法。
本公开与现有技术相比的有益效果至少包括:
本公开通过使语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,以通过语音增强模型进行语音信号增强,以此结合基于信号处理的线性滤波方法和基于机器学***稳噪声下的语音增加性能,利用基于信号处理的线性滤波方法提高基于机器学习的语音增强方法的泛化性能,实现语音增强的优化。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。显而易见地,下面描述的附图仅仅是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出本公开实施例中语音增强模型训练方法的流程图;
图2示出本公开实施例中语音增强模型的结构示意图;
图3示出本公开实施例中语音增强方法的同步流程图;
图4示出本公开实施例中语音增强模型训练装置的模块示意图;
图5示出本公开实施例中语音增强装置的模块示意图;
图6示出本公开实施例中电子设备的结构示意图;以及
图7示出本公开实施例中计算机可读的存储介质的结构示意图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的实施方式。相反,提供这些实施方式使本公开全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
下面实施例中的步骤序号仅用于表示不同的执行内容,并不严格限定步骤之间的执行顺序。具体描述时使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。需要说明的是,在不冲突的情况下,本公开的实施例及不同实施例中的特征可以相互组合。
图1示出实施例中语音增强训练方法的主要步骤,参照图1所示,本公开提供的所述语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块。所述语音增强模型训练方法包括:步骤S110:获取语音训练集中各语音对的带噪语音幅度谱以及纯净语音幅度谱,所述语音对包括关联的纯净语音信号以及带噪语音信号;步骤S120:根据所述带噪语音幅度谱获得第一特征集以及第二特征集;步骤S130:将所述第一特征集输入所述语音预测神经网络模块,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差;步骤S140:将所述第二特征集输入所述噪声估计神经网络模块,所述噪声估计神经网络模块用于输出估计噪声能量;步骤S150;将所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出的估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出估计纯净语音幅度谱;以及步骤S160:根据所述纯净语音幅度谱以及所述估计纯净语音幅度谱计算模型损失,并根据模型损失训练所述语音增强模 型。
上述实施例的语音增强方法,通过使语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,以通过语音增强模型进行语音信号增强,以此结合基于信号处理的线性滤波方法和基于机器学***稳噪声下的语音增加性能,利用基于信号处理的线性滤波方法提高基于机器学习的语音增强方法的泛化性能,实现语音增强的优化。
下面结合图2以及具体的示例,对语音增强模型训练方法进行详细说明。
具体而言,步骤S110中语音训练集可以包括多对语音对,每一对语音对包括关联的纯净语音信号以及带噪语音信号,带噪声语音信号通过在纯净语音信号加入一定信噪比的噪声得到。所加入的信噪比可以按需设置,例如在一些具体实施例中,可以将所加入的信噪比的范围设置为-10到30dB,本公开并非以此为限制。
具体而言,步骤S110获取语音训练集中各语音对的带噪语音幅度谱以及纯净语音幅度谱可以通过如下步骤来实现包括:对所述语音对的纯净语音信号执行时域到频域的变换步骤以获得纯净语音幅度谱;以及对所述语音对的带噪语音信号执行时域到频域的变换步骤以获得带噪语音幅度谱。
具体而言,当将纯净语音信号以及带噪语音信号作为待处理语音信号时,所述时域到频域的变换步骤通过如下方式来实现:
首先,对待处理语音信号x(t)进行分帧。其中,t为待处理语音信号的采样点序号。在一些具体实现中,可以使得每帧时间长度为8毫秒至32毫秒,各个帧之间保持50%-75%的重合。每帧的长度和重合程度可以按需设置,本公开并非以此为限制。进一步地,在分帧时使得各帧保持一定的重合是利用时间相关性便于为后续步骤的傅里叶变换加窗。其次,对所述待处理语音信号的各帧进行傅里叶变换,获得各帧的帧傅里叶谱。在一些具体实现中,可以对各帧进行64-512频点的短时傅里叶变换。频点的数量可以按需设置,本公开并非以此为限制。再次,按时间轴拼接所述待处理语音信号的各帧的帧傅里叶谱,获得所述待处理语音信号的傅里叶谱X(t,f),其中,傅里叶谱X(t,f)为复数域的二维短时傅里叶谱,t为帧序号,f为频率序号。最后,基于所述待处理语音信号的傅里叶谱X(t,f)的各频点的幅度,生成所述待处理语音信号的幅度谱X(t,f)。
由此,基于上述步骤,可以分别对纯净语音信号以及带噪语音信号进行时域到频 域的变换,由此,分别获得纯净语音幅度谱|S(t,f)|以及带噪语音幅度谱|S(t,f)|。
下面结合图2,图2示出本公开实施例中语音增强模型的结构示意图。所述语音增强模型200包括语音预测神经网络模块210、噪声估计神经网络模块220以及线性滤波模块230。
所述语音预测神经网络模块210可以是时间序列神经网络模型。时间序列神经网络模型例如可以为多层长短时记忆递归神经网络。在本实施例中,多层长短时记忆递归神经网络中,可以使每层的节点数为256-1024个节点,各层节点数量一致。本公开提供的时间序列神经网络模型并非以此为限制。
由于语音预测神经网络模块210为时间序列神经网络模型,因此,第一特征集可以为多个连续帧的噪声幅度谱序列。具体而言,可以定义噪声幅度谱序列的第t帧为y(t)=[|Y(t,1)|,|Y(t,2)|,…,|Y(t,F)|] T,其中F为总频带数。由此,则噪声幅度谱序列为Y A[k]=[y(L×k),y(L×k+1),…,y(L×k+L-1)],其中k为序列编号,L为序列长度。
语音预测神经网络模块(时间序列神经网络模型)210具有两个输出:第一准估计纯净语音幅度谱以及预测误差。时间序列神经网络模型通过序列到序列的方式,根据输入的特征序列,预测第一准估计纯净语音幅度谱序列和对应的预测误差序列。进一步地,所述语音预测神经网络模块输出的所述第一准估计纯净语音幅度谱为与所述噪声幅度谱序列Y A[k]具有相同维度的第一准估计纯净语音幅度谱序列,所述语音预测神经网络模块输出的所述预测误差为与所述噪声幅度谱序列Y A[k]具有相同维度的预测误差序列。第一准估计纯净语音幅度谱序列的各第一准估计纯净语音幅度谱可以记为|S NN(t,f)|;预测误差序列的值为预测误差的方差
Figure PCTCN2022073197-appb-000001
噪声估计神经网络模块220可以为多层全连接网络。在本实施例中,多层全连接网络的每层的节点数可以为256-1024个节点,各层节点数量一致,本公开并非以此为限制。在本实施例中,所述第二特征集包括当前帧(t帧)以及当前帧的领域窗口[t-N,t-N+1,…,t,…,t+N-1,t+N] T的带噪语音幅度谱。即Y B(t)=[y(t-N),y(t-N+1),…,y(t+N-1),y(t+N)],其中N为邻域窗口的宽度。噪声估计神经网络模块220输出为当前帧的噪声能量向量,维度为F×1,噪声能量向量的第f个元素代表第f个频带的估计噪声能量,记为
Figure PCTCN2022073197-appb-000002
所述线性滤波模块230包括维纳滤波模块231、卡尔曼增益计算模块232以及线性组合模块233。
维纳滤波模块231用于根据所述噪声估计神经网络模块220输出的估计噪声能量以及所述第二特征集,输出纯净语音幅度谱的维纳滤波解,作为第二准估计纯净语音幅度谱。
具体而言,可以根据第二特征集的带噪语音幅度谱计算时频点的总语音能量
Figure PCTCN2022073197-appb-000003
维纳滤波模块231可以基于最小均方误差准则,根据如下公式,得到纯净语音幅度谱的维纳滤波解:
Figure PCTCN2022073197-appb-000004
其中,|S Wiener(t,f)|为纯净语音幅度谱的维纳滤波解,
Figure PCTCN2022073197-appb-000005
为第t帧第f个频带的时频点的总语音能量,
Figure PCTCN2022073197-appb-000006
为第t帧第f个频带的时频点的噪声能量,|Y(t,f)|为带噪语音幅度谱。
卡尔曼增益计算模块232用于根据所述语音预测神经网络模块210输出的预测误差以及所述噪声估计神经网络模块220输出的估计噪声能量,输出最优卡尔曼增益。具体而言,卡尔曼增益计算模块232可以基于传统卡尔曼滤波理论,根据如下公式确定最优卡尔曼增益:
Figure PCTCN2022073197-appb-000007
其中,G为最优卡尔曼增益,
Figure PCTCN2022073197-appb-000008
为第t帧第f个频带的时频点的预测误差的方差,
Figure PCTCN2022073197-appb-000009
为第t帧第f个频带的时频点的噪声能量。
线性组合模块233用于根据所述最优卡尔曼增益,计算所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及所述第二准估计纯净语音幅度谱的线性组合结果,作为所述估计纯净语音幅度谱。具体而言,线性组合模块233可以根据如下公式计算估计纯净语音幅度谱:
|S o(t,f)|=G*|S Wiener(t,f)|+(1-G)*|S NN(t,f)|
其中,|S o(t,f)|为估计纯净语音幅度谱,G为最优卡尔曼增益,且G为第二准估计纯净语音幅度谱|S Wiener(t,f)|的第二权重;(1-G)为第一准估计纯净语音幅度谱|S NN(t,f)|的第一权重。
在本公开的一些实施例中,图1中步骤S160根据所述纯净语音幅度谱以及所述估计纯净语音幅度谱计算模型损失(例如可以通过损失函数进行模型损失的计算),并根据模型损失训练所述语音增强模型可以通过如下步骤来实现:采用后向传播算法,优化语音预测神经网络模块以及噪声估计神经网络模块的参数。具体而言,语音增强 模型可以在最小均方误差的准则下,基于神经网络的后向传播算法,自适应学习语音预测神经网络模块和噪声估计神经网络模块中的参数。
以上仅仅是示意性地示出本公开的多种实现方式,本公开并非以此为限制,各实现方式可以单独或组合实现。
本公开实施例还提供一种语音增强方法,用于基于经训练的语音增强模型进行语音信号的增强。图3示出本公开实施例中语音增强方法的流程图,如图3所示,包括:
步骤S310:获取待增强语音信号的待增强语音幅度谱和待增强语音相位谱。
具体而言,步骤S310可以对待增强语音信号进行短时傅里叶变换,以获得待增强语音幅度谱和待增强语音相位谱。
步骤S320:根据所述待增强语音幅度谱获得第一特征集以及第二特征集。
具体而言,基于待增强语音幅度谱获得第一特征集以及第二特征集的步骤可以与基于带噪语音幅度谱获得第一特征集以及第二特征集的方式相同,在此不予赘述。
步骤S330:将所述第一特征集以及第二特征集输入经训练的语音增强模型,所述语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,其中,所述第一特征集作为所述语音预测神经网络模块的输入,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差,所述第二特征集作为所述噪声估计神经网络模块的输入,所述噪声估计神经网络模块用于输出估计噪声能量,所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出所述待增强语音信号的估计纯净语音幅度谱。
具体而言,语音增强模型的结构,语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块的具体实施可以参照图2以及图2的相关描述来实现。语音增强模型可以经由如图1所示的训练方法训练获得。
步骤S340:根据所述估计纯净语音幅度谱和待增强语音相位谱进行还原获得所述待增强语音信号的增强语音信号。
由此,本实施例的语音增强方法中,通过使语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,以通过语音增强模型进行语音信号增强,以此结合基于信号处理的线性滤波方法和基于机器学***稳噪声 下的语音增加性能,利用基于信号处理的线性滤波方法提高基于机器学习的语音增强方法的泛化性能,实现语音增强的优化。
本公开实施例还提供一种语音增强模型训练装置,可用于实现上述任意实施例所描述的语音增强模型训练方法。其中,语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块。图4示出本公开实施例中语音增强模型训练装置的模块示意图,如图4所示,本实施例中的语音增强模型训练装置410包括第一获取模块411、第二获取模块412、第一输入模块413、第二输入模块414、输出模块415以及训练模块416。第一获取模块411配置成获取语音训练集中各语音对的带噪语音幅度谱以及纯净语音幅度谱,所述语音对包括关联的纯净语音信号以及带噪语音信号。第二获取模块412配置成根据所述带噪语音幅度谱获得第一特征集以及第二特征集。第一输入模块413配置成将所述第一特征集输入所述语音预测神经网络模块,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差。第二输入模块414配置成将所述第二特征集输入所述噪声估计神经网络模块,所述噪声估计神经网络模块用于输出估计噪声能量。输出模块415配置成将所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出估计纯净语音幅度谱。训练模块416配置成根据所述纯净语音幅度谱以及所述估计纯净语音幅度谱计算模型损失,并根据模型损失训练所述语音增强模型。其中,各个模块的具体原理可参见上述任意语音增强模型训练方法实施例,此处不再重复说明。
本公开实施例还提供一种语音增强装置,可用于实现上述任意实施例所描述的语音增强方法。
图5示出实施例中语音增强装置的主要模块,参照图5所示,本实施例中语音增强装置420包括第三获取模块421、第四获取模块422、增强模块423以及还原模块424。第三获取模块421配置成获取待增强语音信号的待增强语音幅度谱和待增强语音相位谱。第四获取模块422配置成根据所述待增强语音幅度谱获得第一特征集以及第二特征集。增强模块423配置成将所述第一特征集以及第二特征集输入经训练的语音增强模型,所述语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,其中,所述第一特征集作为所述语音预测神经网络模块的输入,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差,所述 第二特征集作为所述噪声估计神经网络模块的输入,所述噪声估计神经网络模块用于输出估计噪声能量,所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出所述待增强语音信号的估计纯净语音幅度谱。还原模块424配置成根据所述估计纯净语音幅度谱和待增强语音相位谱进行还原获得所述待增强语音信号的增强语音信号。其中,各个模块的具体原理可参见上述任意语音增强方法实施例,此处不再重复说明。
本实施例的语音增强模型训练装置以及语音增强装置中,通过使语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,以通过语音增强模型进行语音信号增强,以此结合基于信号处理的线性滤波方法和基于机器学***稳噪声下的语音增加性能,利用基于信号处理的线性滤波方法提高基于机器学习的语音增强方法的泛化性能,实现语音增强的优化。
图4和图5仅仅是示意性的示出本公开提供的语音增强模型训练装置以及语音增强装置,在不违背本公开构思的前提下,模块的拆分、合并、增加都在本公开的保护范围之内。本公开提供的语音增强模型训练装置以及语音增强装置可以由软件、硬件、固件、插件及他们之间的任意组合来实现,本公开并非以此为限。
本公开实施例还提供一种电子设备,包括处理器和存储器,存储器中存储有可执行指令,可执行指令被处理器执行时,实现上述任意实施例描述的语音增强模型训练方法和/或语音增强方法。
本公开的电子设备通过使语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,以通过语音增强模型进行语音信号增强,以此结合基于信号处理的线性滤波方法和基于机器学***稳噪声下的语音增加性能,利用基于信号处理的线性滤波方法提高基于机器学习的语音增强方法的泛化性能,实现语音增强的优化。
图6是本公开实施例中电子设备的结构示意图,应当理解的是,图6仅仅是示意性地示出各个模块,这些模块可以是虚拟的软件模块或实际的硬件模块,这些模块的合并、拆分及其余模块的增加都在本公开的保护范围之内。
如图6所示,电子设备500以通用计算设备的形式表现。电子设备500的组件包括但不限于:至少一个处理单元510、至少一个存储单元520、连接不同平台组件(包括存储单元520和处理单元510)的总线530、显示单元540等。
其中,存储单元存储有程序代码,程序代码可以被处理单元510执行,使得处理单元510执行上述任意实施例描述的语音增强模型训练方法和/或语音增强方法的步骤。例如,处理单元510可以执行如图1和图3所示的步骤。
存储单元520可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)5201和/或高速缓存存储单元5202,还可以进一步包括只读存储单元(ROM)5203。
存储单元520还可以包括具有一个或多个程序模块5205的程序/实用工具5204,这样的程序模块5205包括但不限于:操作***、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线530可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、***总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备500也可以与一个或多个外部设备600通信,外部设备600可以是键盘、指向设备、蓝牙设备等设备中的一种或多种。这些外部设备600使得用户能与该电子设备500进行交互通信。电子设备500也能与一个或多个其它计算设备进行通信,所示计算机设备包括路由器、调制解调器。这种通信可以通过输入/输出(I/O)接口550进行。并且,电子设备500还可以通过网络适配器560与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。网络适配器560可以通过总线530与电子设备500的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备500使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID***、磁带驱动器以及数据备份存储平台等。
本公开实施例还提供一种计算机可读的存储介质,用于存储程序,程序被执行时实现上述任意实施例描述的语音增强模型训练方法和/或语音增强方法。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在终端设备上运行时,程序代码用于使终端设备执行上述任意实施例描述 的语音增强模型训练方法和/或语音增强方法。
本公开的计算机可读的存储介质通过使语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,以通过语音增强模型进行语音信号增强,以此结合基于信号处理的线性滤波方法和基于机器学***稳噪声下的语音增加性能,利用基于信号处理的线性滤波方法提高基于机器学习的语音增强方法的泛化性能,实现语音增强的优化。
图7是本公开的计算机可读的存储介质的结构示意图。参考图6所示,描述了根据本公开的实施方式的用于实现上述方法的程序产品700,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子包括但不限于:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读的存储介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读存储介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。可读存储介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全 地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备,例如利用因特网服务提供商来通过因特网连接。
以上内容是结合具体的优选实施方式对本公开所作的进一步详细说明,不能认定本公开的具体实施只局限于这些说明。对于本公开所属技术领域的普通技术人员来说,在不脱离本公开构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本公开的保护范围。

Claims (11)

  1. 一种语音增强模型训练方法,其中,所述语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,
    所述语音增强模型训练方法,包括:
    获取语音训练集中各语音对的带噪语音幅度谱以及纯净语音幅度谱,所述语音对包括关联的纯净语音信号以及带噪语音信号;
    根据所述带噪语音幅度谱获得第一特征集以及第二特征集;
    将所述第一特征集输入所述语音预测神经网络模块,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差;
    将所述第二特征集输入所述噪声估计神经网络模块,所述噪声估计神经网络模块用于输出估计噪声能量;
    将所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出的估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出估计纯净语音幅度谱;
    根据所述纯净语音幅度谱以及所述估计纯净语音幅度谱计算模型损失,并根据模型损失训练所述语音增强模型。
  2. 如权利要求1所述的语音增强模型训练方法,其中,所述获取语音训练集中各语音对的带噪语音幅度谱以及纯净语音幅度谱包括:
    对所述语音对的纯净语音信号执行时域到频域的变换步骤;
    对所述语音对的带噪语音信号执行时域到频域的变换步骤,
    所述时域到频域的变换步骤包括:
    对待处理语音信号进行分帧;
    对所述待处理语音信号的各帧进行傅里叶变换,获得各帧的帧傅里叶谱;
    按时间轴拼接所述待处理语音信号的各帧的帧傅里叶谱,获得所述待处理语音信号的傅里叶谱;
    基于所述待处理语音信号的傅里叶谱的各频点的幅度,生成所述待处理语音信号的幅度谱。
  3. 如权利要求1所述的语音增强模型训练方法,其中,所述语音预测神经网络模块为时间序列神经网络模型,所述第一特征集为多个连续帧的噪声幅度谱序列,所 述语音预测神经网络模块输出的所述第一准估计纯净语音幅度谱为与所述噪声幅度谱序列具有相同维度的第一准估计纯净语音幅度谱序列,所述语音预测神经网络模块输出的所述预测误差为与所述噪声幅度谱序列具有相同维度的预测误差序列。
  4. 如权利要求1所述的语音增强模型训练方法,其中,所述噪声估计神经网络模块为多层全连接网络模型,所述第二特征集包括当前帧以及当前帧的领域窗口的带噪语音幅度谱。
  5. 如权利要求1所述的语音增强模型训练方法,其中,所述线性滤波模块包括维纳滤波模块、卡尔曼增益计算模块以及线性组合模块,
    所述维纳滤波模块用于根据所述噪声估计神经网络模块输出的估计噪声能量以及所述第二特征集,输出纯净语音幅度谱的维纳滤波解,作为第二准估计纯净语音幅度谱;
    所述卡尔曼增益计算模块用于根据所述语音预测神经网络模块输出的预测误差以及所述噪声估计神经网络模块输出的估计噪声能量,输出最优卡尔曼增益G;
    所述线性组合模块用于根据所述最优卡尔曼增益G,计算所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及所述第二准估计纯净语音幅度谱的线性组合结果,作为所述估计纯净语音幅度谱。
  6. 如权利要求5所述的语音增强模型训练方法,其中,根据所述最优卡尔曼增益G,计算所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及所述第二准估计纯净语音幅度谱的线性组合结果,作为所述估计纯净语音幅度谱包括:
    将(1-G)作为所述第一准估计纯净语音幅度谱的第一权重;
    将最优卡尔曼增益G作为所述第二准估计纯净语音幅度谱的第二权重;
    根据所述第一权重和第二权重计算所述第一准估计纯净语音幅度谱和所述第二准估计纯净语音幅度谱的加权合,作为所述估计纯净语音幅度谱。
  7. 如权利要求1至6任一项所述的语音增强模型训练方法,其中,所述根据所述纯净语音幅度谱以及所述估计纯净语音幅度谱计算模型损失,并根据模型损失训练所述语音增强模型包括:
    采用后向传播算法,优化语音预测神经网络模块以及噪声估计神经网络模块的参数。
  8. 一种语音增强方法,其中,包括:
    获取待增强语音信号的待增强语音幅度谱和待增强语音相位谱;
    根据所述待增强语音幅度谱获得第一特征集以及第二特征集;
    将所述第一特征集以及第二特征集输入经训练的语音增强模型,所述语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,其中,所述第一特征集作为所述语音预测神经网络模块的输入,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差,所述第二特征集作为所述噪声估计神经网络模块的输入,所述噪声估计神经网络模块用于输出估计噪声能量,所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出所述待增强语音信号的估计纯净语音幅度谱;
    根据所述估计纯净语音幅度谱和待增强语音相位谱进行还原获得所述待增强语音信号的增强语音信号。
  9. 一种语音增强模型训练装置,其中,所述语音增强模型包括语音预测神经网络模块、噪声估计神经网络模块以及线性滤波模块,
    所述语音增强模型训练装置,包括:
    第一获取模块,配置成获取语音训练集中各语音对的带噪语音幅度谱以及纯净语音幅度谱,所述语音对包括关联的纯净语音信号以及带噪语音信号;
    第二获取模块,配置成根据所述带噪语音幅度谱获得第一特征集以及第二特征集;
    第一输入模块,配置成将所述第一特征集输入所述语音预测神经网络模块,所述语音预测神经网络模块用于输出第一准估计纯净语音幅度谱以及预测误差;
    第二输入模块,配置成将所述第二特征集输入所述噪声估计神经网络模块,所述噪声估计神经网络模块用于输出估计噪声能量;
    输出模块,配置成将所述语音预测神经网络模块输出的第一准估计纯净语音幅度谱以及预测误差、所述噪声估计神经网络模块输出估计噪声能量输入所述线性滤波模块,所述线性滤波模块用于输出估计纯净语音幅度谱;
    训练模块,配置成根据所述纯净语音幅度谱以及所述估计纯净语音幅度谱计算模型损失,并根据模型损失训练所述语音增强模型。
  10. 一种电子设备,其中,包括:
    处理器;
    存储器,所述存储器中存储有可执行指令;
    其中,所述可执行指令被所述处理器执行时,实现:
    如权利要求1-7任一项所述的语音增强模型训练方法;和/或
    如权利要求8所述的语音增强方法。
  11. 一种计算机可读的存储介质,用于存储程序,其中,所述程序被执行时实现:
    如权利要求1-7任一项所述的语音增强模型训练方法;和/或
    如权利要求8所述的语音增强方法。
PCT/CN2022/073197 2021-01-29 2022-01-21 语音增强方法、模型训练方法以及相关设备 WO2022161277A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110129897.8 2021-01-29
CN202110129897.8A CN113808602A (zh) 2021-01-29 2021-01-29 语音增强方法、模型训练方法以及相关设备

Publications (1)

Publication Number Publication Date
WO2022161277A1 true WO2022161277A1 (zh) 2022-08-04

Family

ID=78892819

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/073197 WO2022161277A1 (zh) 2021-01-29 2022-01-21 语音增强方法、模型训练方法以及相关设备

Country Status (2)

Country Link
CN (1) CN113808602A (zh)
WO (1) WO2022161277A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052706A (zh) * 2023-03-30 2023-05-02 苏州清听声学科技有限公司 一种基于神经网络的低复杂度语音增强方法
CN117789744A (zh) * 2024-02-26 2024-03-29 青岛海尔科技有限公司 基于模型融合的语音降噪方法、装置及存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113808602A (zh) * 2021-01-29 2021-12-17 北京沃东天骏信息技术有限公司 语音增强方法、模型训练方法以及相关设备
CN114267368A (zh) * 2021-12-22 2022-04-01 北京百度网讯科技有限公司 音频降噪模型的训练方法、音频降噪方法及装置
GB2620747A (en) * 2022-07-19 2024-01-24 Samsung Electronics Co Ltd Method and apparatus for speech enhancement
CN116843514B (zh) * 2023-08-29 2023-11-21 北京城建置业有限公司 一种基于数据识别的物业综合管理***及方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767781A (zh) * 2019-03-06 2019-05-17 哈尔滨工业大学(深圳) 基于超高斯先验语音模型与深度学习的语音分离方法、***及存储介质
CN110211598A (zh) * 2019-05-17 2019-09-06 北京华控创为南京信息技术有限公司 智能语音降噪通信方法及装置
CN113808602A (zh) * 2021-01-29 2021-12-17 北京沃东天骏信息技术有限公司 语音增强方法、模型训练方法以及相关设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639502B1 (en) * 2009-02-16 2014-01-28 Arrowhead Center, Inc. Speaker model-based speech enhancement system
CN103208291A (zh) * 2013-03-08 2013-07-17 华南理工大学 一种可用于强噪声环境的语音增强方法及装置
CN105489226A (zh) * 2015-11-23 2016-04-13 湖北工业大学 一种用于拾音器的多窗谱估计的维纳滤波语音增强方法
CN109767782B (zh) * 2018-12-28 2020-04-14 中国科学院声学研究所 一种提高dnn模型泛化性能的语音增强方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767781A (zh) * 2019-03-06 2019-05-17 哈尔滨工业大学(深圳) 基于超高斯先验语音模型与深度学习的语音分离方法、***及存储介质
CN110211598A (zh) * 2019-05-17 2019-09-06 北京华控创为南京信息技术有限公司 智能语音降噪通信方法及装置
CN113808602A (zh) * 2021-01-29 2021-12-17 北京沃东天骏信息技术有限公司 语音增强方法、模型训练方法以及相关设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XUE WEI; QUAN GANG; ZHANG CHAO; DING GUOHONG; HE XIAODONG; ZHOU BOWEN: "Neural Kalman Filtering for Speech Enhancement", ICASSP 2021 - 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 6 June 2021 (2021-06-06), pages 7108 - 7112, XP033955006, DOI: 10.1109/ICASSP39728.2021.9413499 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052706A (zh) * 2023-03-30 2023-05-02 苏州清听声学科技有限公司 一种基于神经网络的低复杂度语音增强方法
CN117789744A (zh) * 2024-02-26 2024-03-29 青岛海尔科技有限公司 基于模型融合的语音降噪方法、装置及存储介质
CN117789744B (zh) * 2024-02-26 2024-05-24 青岛海尔科技有限公司 基于模型融合的语音降噪方法、装置及存储介质

Also Published As

Publication number Publication date
CN113808602A (zh) 2021-12-17

Similar Documents

Publication Publication Date Title
WO2022161277A1 (zh) 语音增强方法、模型训练方法以及相关设备
WO2021043015A1 (zh) 语音识别方法及装置、神经网络训练方法及装置
Li et al. Glance and gaze: A collaborative learning framework for single-channel speech enhancement
CN108615535B (zh) 语音增强方法、装置、智能语音设备和计算机设备
WO2021179424A1 (zh) 结合ai模型的语音增强方法、***、电子设备和介质
CN111968658B (zh) 语音信号的增强方法、装置、电子设备和存储介质
Venkataramani et al. Adaptive front-ends for end-to-end source separation
US9520141B2 (en) Keyboard typing detection and suppression
WO2022178942A1 (zh) 情绪识别方法、装置、计算机设备和存储介质
CN111508518B (zh) 一种基于联合字典学习和稀疏表示的单通道语音增强方法
WO2022183806A1 (zh) 基于神经网络的语音增强方法、装置及电子设备
CN112767959B (zh) 语音增强方法、装置、设备及介质
CN112489668B (zh) 去混响方法、装置、电子设备和存储介质
US20240046947A1 (en) Speech signal enhancement method and apparatus, and electronic device
WO2022213825A1 (zh) 基于神经网络的端到端语音增强方法、装置
CN113345460A (zh) 音频信号处理方法、装置、设备及存储介质
CN116403594B (zh) 基于噪声更新因子的语音增强方法和装置
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
CN115662461A (zh) 降噪模型训练方法、装置以及设备
CN114171043B (zh) 回声的确定方法、装置、设备以及存储介质
Ullah et al. Semi-supervised transient noise suppression using OMLSA and SNMF algorithms
CN112491449A (zh) 声回波消除方法、装置、电子设备和存储介质
Wu et al. Time-Domain Mapping with Convolution Networks for End-to-End Monaural Speech Separation
Li et al. An improved speech enhancement algorithm based on combination of OMLSA and IMCRA
CN113744754B (zh) 语音信号的增强处理方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22745152

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22745152

Country of ref document: EP

Kind code of ref document: A1