CN116137153A - Training method of voice noise reduction model and voice enhancement method - Google Patents

Training method of voice noise reduction model and voice enhancement method Download PDF

Info

Publication number
CN116137153A
CN116137153A CN202111353720.2A CN202111353720A CN116137153A CN 116137153 A CN116137153 A CN 116137153A CN 202111353720 A CN202111353720 A CN 202111353720A CN 116137153 A CN116137153 A CN 116137153A
Authority
CN
China
Prior art keywords
fourier spectrum
noise reduction
voice
module
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111353720.2A
Other languages
Chinese (zh)
Inventor
张鹏远
党风
颜永红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN202111353720.2A priority Critical patent/CN116137153A/en
Publication of CN116137153A publication Critical patent/CN116137153A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

The application provides a training method and a voice enhancement method of a voice noise reduction model. The speech noise reduction model includes: the first enhancement module is used for carrying out noise reduction processing on an input frequency spectrum and outputting the frequency spectrum; the second enhancement module is used for carrying out noise reduction processing on the input frequency spectrum and outputting complex masking. The processing sequence of the first enhancement module and the second enhancement module is determined according to the signal-to-noise ratio of the sound channel. When the signal-to-noise ratio of the sound channel is smaller than a preset value, the first enhancement module is utilized to process so as to recover voice harmonic waves, and the second enhancement module is utilized to process so as to enhance noise reduction performance.

Description

Training method of voice noise reduction model and voice enhancement method
Technical Field
The present disclosure relates to the field of speech enhancement technologies, and in particular, to a training method of a speech noise reduction model and a speech enhancement method.
Background
In the context of speech applications, perceived speech typically contains environmental interference from noise sources. For example, in the context of automatic speech recognition, telecommunication systems, and hearing assistant devices, noise in speech can have an impact on the actual application of speech. The purpose of speech enhancement is to extract useful speech information in noisy speech, thereby improving speech quality and intelligibility.
Enhancement processing of mono speech is a challenging task due to the lack of spatial information in mono. Particularly, under the condition of low signal-to-noise ratio, the existing voice enhancement method has low noise reduction performance on the mono voice.
Disclosure of Invention
The application provides a training method and a voice enhancement method of a voice noise reduction model with multi-stage noise reduction, which are used for designing processing links of two stages of spectrum processing and masking processing aiming at mono voice so as to improve the voice enhancement performance in the mono environment.
In a first aspect, the present application provides a method for training a speech noise reduction model.
The method comprises the following steps: acquiring a voice training set corresponding to a sound channel; the voice training set comprises a plurality of noisy voice samples and a plurality of clean voice samples, wherein the noisy voice samples and the clean voice samples are in one-to-one correspondence; determining a voice noise reduction model corresponding to the sound channel by utilizing the voice training set; the speech noise reduction model includes: an analysis filter, a first enhancement module, a second enhancement module, and a synthesis filter module;
wherein the determining, by using the speech training set, a speech noise reduction model corresponding to the vocal tract includes:
The analysis filter converts the input noisy speech samples into a first fourier spectrum; when the signal-to-noise ratio of the sound channel is smaller than a preset value, the first enhancement module outputs a second Fourier spectrum based on the first Fourier spectrum, the first Fourier spectrum and the second Fourier spectrum are spliced into a third Fourier spectrum, and the second enhancement module outputs a first complex mask based on the third Fourier spectrum, and the first complex mask is converted into a fourth Fourier spectrum; when the signal-to-noise ratio of the sound channel is not smaller than the preset value, the second enhancement module outputs a second complex mask based on the first Fourier spectrum, converts the second complex mask into a fifth Fourier spectrum, splices the first Fourier spectrum and the fifth Fourier spectrum into a sixth Fourier spectrum, and the first enhancement module outputs a seventh Fourier spectrum based on the sixth Fourier spectrum; the synthesis filter module converts the fourth fourier spectrum or the seventh fourier spectrum into clean speech; and updating the voice noise reduction model according to the clean voice and the clean voice sample corresponding to the input noisy voice sample.
In the above scheme, the noise reduction links of two stages are designed according to the difference of the signal to noise ratio of the sound channel in the voice noise reduction model, including a module for outputting a frequency spectrum and a module for outputting a mask, and the advantages of the spectrum mapping and the mask method are combined, so that the voice noise reduction effect can be enhanced, especially under the condition of lower signal to noise ratio, the module for outputting the frequency spectrum is firstly used for processing to recover voice components, and then the module for outputting the mask is used for processing to enhance the noise reduction performance, so that the voice enhancement quality is improved.
In one possible implementation, the first enhancement module includes a plurality of first sub-modules and the second enhancement module includes a plurality of second sub-modules; the number of the first sub-modules and the number of the second sub-modules are determined according to the signal to noise ratio of the sound channel; the first submodule is used for carrying out noise reduction processing on the input Fourier spectrum and outputting the Fourier spectrum after noise reduction; the second submodule is used for carrying out noise reduction processing on the input Fourier spectrum and outputting complex masking after noise reduction.
When the signal-to-noise ratio of the sound channel is smaller than a preset value, the Fourier spectrum input to the nth first submodule is determined according to the Fourier spectrum output by the analysis filter and the Fourier spectrum output by the nth-i first submodule, and the Fourier spectrum input to the mth second submodule is determined according to the Fourier spectrum output by the analysis filter, the Fourier spectrum output by each first submodule and the Fourier spectrum corresponding to complex masking output by the mth-j second submodule. Wherein N is the number of first sub-modules, i is 1, N, M is 1, M, and j is 1, M.
When the signal-to-noise ratio of the sound channel is not smaller than the preset value, the Fourier spectrum input to the mth second submodule is determined according to the Fourier spectrum output by the analysis filter and the Fourier spectrum output by the mth-j second submodule, and the Fourier spectrum input to the nth first submodule is determined according to the Fourier spectrum output by the analysis filter, the Fourier spectrum corresponding to the complex mask output by each second submodule and the Fourier spectrum output by the nth-i first submodule. Wherein N is the number of first sub-modules, i is 1, N, M is 1, M, and j is 1, M.
In one possible embodiment, the first sub-module and the second sub-module each comprise: an encoder, a time domain recurrent neural network, a frequency domain recurrent neural network, and a decoder; the encoder is used for encoding the input Fourier spectrum to obtain a first high-dimensional Fourier spectrum; the time domain circulating neural network is used for determining a second high-dimensional Fourier spectrum according to the characteristic data of each sub-band in the first high-dimensional Fourier spectrum; the frequency domain recurrent neural network is used for determining a third high-dimensional Fourier spectrum according to the characteristic data of each time point in the second high-dimensional Fourier spectrum; the decoder in the first sub-module is used for decoding the third high-dimensional Fourier spectrum and outputting the Fourier spectrum after noise reduction; and the decoder in the second sub-module is used for decoding the third high-dimensional Fourier spectrum and outputting the complex mask after noise reduction.
In a possible implementation manner, the determining, by using the speech training set, a speech noise reduction model corresponding to the vocal tract further includes: determining a time domain loss value between the clean speech and the clean speech sample according to a time domain loss function; determining a frequency domain loss value between the fourth fourier spectrum and the eighth fourier spectrum according to a frequency domain loss function, or determining a third loss value between the seventh fourier spectrum and the eighth fourier spectrum according to the frequency domain loss function, the eighth fourier spectrum being determined from clean speech samples; updating the speech noise reduction model according to the time domain loss value and the frequency domain loss value, or updating the speech noise reduction model according to the time domain loss value and the third loss value.
In a second aspect, the present application provides a method of speech enhancement.
The method comprises the following steps: acquiring a voice noise reduction model corresponding to a sound channel; and carrying out noise reduction processing on the target voice transmitted by the sound channel by utilizing the voice noise reduction model to obtain clean voice.
In a third aspect, the present application provides a training apparatus for a speech noise reduction model.
The training device comprises: the acquisition module is used for acquiring a voice training set corresponding to the sound channel; the voice training set comprises a plurality of noisy voice samples and a plurality of clean voice samples, wherein the noisy voice samples and the clean voice samples are in one-to-one correspondence; the training module is used for determining a voice noise reduction model corresponding to the sound channel by utilizing the voice training set; the speech noise reduction model includes: an analysis filter, a first enhancement module, a second enhancement module, and a synthesis filter module;
The training module is specifically configured to: converting the input noisy speech samples into a first fourier spectrum using the analysis filter; when the signal-to-noise ratio of the sound channel is smaller than a preset value, the first enhancement module outputs a second Fourier spectrum based on the first Fourier spectrum, the first Fourier spectrum and the second Fourier spectrum are spliced into a third Fourier spectrum, the second enhancement module outputs a first complex mask based on the third Fourier spectrum, the first complex mask is converted into a fourth Fourier spectrum, and the synthesis filter module converts the fourth Fourier spectrum into clean voice; when the signal-to-noise ratio of the sound channel is not smaller than the preset value, the second enhancement module outputs a second complex mask based on the first Fourier spectrum, converts the second complex mask into a fifth Fourier spectrum, splices the first Fourier spectrum and the fifth Fourier spectrum into a sixth Fourier spectrum, the first enhancement module outputs a seventh Fourier spectrum based on the sixth Fourier spectrum, and the synthesis filter module converts the seventh Fourier spectrum into clean voice; and updating the voice noise reduction model according to the clean voice and the clean voice sample corresponding to the input noisy voice sample.
In one possible implementation, the first enhancement module includes a plurality of first sub-modules and the second enhancement module includes a plurality of second sub-modules; the number of the first sub-modules and the number of the second sub-modules are determined according to the signal to noise ratio of the sound channel; the first submodule is used for carrying out noise reduction processing on the input Fourier spectrum and outputting the Fourier spectrum after noise reduction; the second submodule is used for carrying out noise reduction processing on the input Fourier spectrum and outputting complex masking after noise reduction;
when the signal-to-noise ratio of the sound channel is smaller than a preset value, the Fourier spectrum input to the nth first submodule is determined according to the Fourier spectrum output by the analysis filter and the Fourier spectrum output by the nth-i first submodule, and the Fourier spectrum input to the mth second submodule is determined according to the Fourier spectrum output by the analysis filter, the Fourier spectrum output by each first submodule and the Fourier spectrum corresponding to complex masking output by the mth-j second submodule;
when the signal-to-noise ratio of the sound channel is not smaller than the preset value, the Fourier spectrum input to the mth second submodule is determined according to the Fourier spectrum output by the analysis filter and the Fourier spectrum output by the mth-j second submodule, and the Fourier spectrum input to the nth first submodule is determined according to the Fourier spectrum output by the analysis filter, the Fourier spectrum corresponding to the complex mask output by each second submodule and the Fourier spectrum output by the nth-i first submodule;
N is the number of first sub-modules, i is 1, N, M is 1, M, and j is 1, M.
In one possible embodiment, the first sub-module and the second sub-module each comprise: an encoder, a time domain recurrent neural network, a frequency domain recurrent neural network, and a decoder; the encoder is used for encoding the input Fourier spectrum to obtain a first high-dimensional Fourier spectrum; the time domain circulating neural network is used for determining a second high-dimensional Fourier spectrum according to the characteristic data of each sub-band in the first high-dimensional Fourier spectrum; the frequency domain recurrent neural network is used for determining a third high-dimensional Fourier spectrum according to the characteristic data of each time point in the second high-dimensional Fourier spectrum; the decoder in the first sub-module is used for decoding the third high-dimensional Fourier spectrum and outputting the Fourier spectrum after noise reduction;
and the decoder in the second sub-module is used for decoding the third high-dimensional Fourier spectrum and outputting the complex mask after noise reduction.
In one possible implementation, the training module is further configured to:
Determining a time domain loss value between the clean speech and the clean speech sample according to a time domain loss function;
determining a frequency domain loss value between the fourth fourier spectrum and the eighth fourier spectrum according to a frequency domain loss function, or determining a third loss value between the seventh fourier spectrum and the eighth fourier spectrum according to the frequency domain loss function, the eighth fourier spectrum being determined from clean speech samples; updating the speech noise reduction model according to the time domain loss value and the frequency domain loss value, or updating the speech noise reduction model according to the time domain loss value and the third loss value.
In a fourth aspect, the present application provides a speech enhancement apparatus.
The voice enhancement device includes: the acquisition module is used for acquiring a voice noise reduction model corresponding to the sound channel; and the processing module is used for carrying out noise reduction processing on the target voice transmitted by the sound channel by utilizing the voice noise reduction model to obtain clean voice.
In a fifth aspect, the present application provides a computing device. The computing device includes: a processor and a memory, the processor being configured to execute a computer program stored in the memory to perform the training method of the first aspect and alternative embodiments thereof, or to perform the speech enhancement method of the second aspect.
In a sixth aspect, the present application provides a computer-readable storage medium. The computer readable storage medium comprises instructions which, when run on a computer, cause the computer to perform the training method of the first aspect and alternative embodiments thereof, or to perform the speech enhancement method of the second aspect.
In a seventh aspect, the present application provides a computer program product. The computer program product comprises a program code which, when run by a computer, causes the computer to perform the training method of the first aspect and alternative embodiments thereof or to perform the speech enhancement method of the second aspect.
Any of the apparatuses or computer storage media or computer program products provided above are used to perform the methods provided above, and thus, the advantages achieved by the methods are referred to as the advantages of the corresponding schemes in the corresponding methods provided above, and are not described herein.
Drawings
FIG. 1 is a schematic diagram of a multi-stage speech enhancement model according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a noise reduction module in a multi-stage speech enhancement model according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of a submodule in a multi-stage speech enhancement model according to an embodiment of the present application;
FIG. 4 is a flowchart of a training method for a speech enhancement model according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a training device for a speech enhancement model according to an embodiment of the present disclosure;
FIG. 6 is a flowchart of a method for speech enhancement according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a voice enhancement device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a computing device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.
In the description of embodiments of the present application, words such as "exemplary," "such as" or "for example," are used to indicate by way of example, illustration, or description. Any embodiment or design described herein as "exemplary," "such as" or "for example" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary," "such as" or "for example," etc., is intended to present related concepts in a concrete fashion.
In the description of the embodiments of the present application, the term "and/or" is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a alone, B alone, and both A and B. In addition, unless otherwise indicated, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of screen terminals means two or more screen terminals.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.
In the application scenario of speech, the speech enhancement methods used are typically spectral subtraction, wiener filtering and statistical-based methods. Due to the strong capability of deep neural networks, voice enhancement methods based on deep neural networks are also emerging.
The method can be divided into two main categories, namely mask-based and mapping-based, from the time-frequency domain, by a voice enhancement method based on a deep neural network. For the former, for example, an ideal binary mask or an ideal ratio mask, the energy distribution relationship between the clean and noise components is analyzed. For the latter, a logarithmic power spectrum or an amplitude spectrum is used as a mapping target.
In practical application, when the voice enhancement method based on the deep neural network is applied to terminal equipment (such as mobile phones, earphone and other electronic equipment), the finally-embodied noise reduction effect is different due to different computing capacities of the terminal equipment.
Fig. 1 is a schematic structural diagram of a speech enhancement model with multi-stage noise reduction according to an embodiment of the present application.
As shown in fig. 1, the speech enhancement model 100 includes: an analysis filter 101, a noise reduction module 102, and a synthesis filter 103.
The analysis filter 101 is configured to process the input noisy speech X and convert it into a fourier spectrum X. Specifically, the analysis filter 101 employs a short-time fourier transform, the frame length is 512, and the frame shift is 128. The noisy speech is expressed as X epsilon R (1 XL), and the output is Fourier spectrum X epsilon R (2 XFxT), where L is the number of speech sampling points, F is the number of Fourier frequency points 256, and T is the number of frames.
The noise reduction module 102 is configured to perform noise reduction processing on the fourier spectrum output from the analysis filter 101, and output the fourier spectrum or complex mask. The noise reduction module 102 may be configured to output a fourier spectrum or complex mask according to the signal-to-noise ratio of the channel transmitting the noisy speech. For example, when the signal-to-noise ratio of the channel is less than a preset value, the noise reduction module 102 is configured to output a fourier spectrum. For example, when the signal-to-noise ratio of the channel is not less than a preset value, the noise reduction module 102 is configured to output complex masking to avoid losing information in the voice in an environment with low signal-to-noise ratio.
In one example, the noise reduction module 102 may include a first enhancement module 1021 that outputs a fourier spectrum and a second enhancement module 1022 that outputs complex masking. Wherein, the front-to-back relationship of the first enhancement module 1021 and the second enhancement module 1022 in processing data can be determined according to the signal-to-noise ratio of the channels.
Specifically, when the signal-to-noise ratio of the channel is smaller than the preset value, the first enhancement module 1021 processes the fourier spectrum output by the analysis filter 101, and outputs the enhanced fourier spectrum. The second enhancement module 1022 performs noise reduction processing based on the fourier spectrum output from the analysis filter 101 and the fourier spectrum output from the first enhancement module 1021, and outputs complex masking.
Specifically, when the signal-to-noise ratio of the channel is not less than the preset value, the second enhancement module 1022 performs noise reduction processing based on the fourier spectrum output from the analysis filter 101, and outputs complex masking. The first enhancement module 1021 performs noise reduction processing based on the fourier spectrum output from the analysis filter 101 and the fourier spectrum corresponding to the complex mask output from the second enhancement module 1022, and outputs an enhanced fourier spectrum. When the signal-to-noise ratio is low, the second enhancement module is utilized to process the Fourier spectrum and output complex masking, so that the loss of information in the spectrum can be avoided, and the quality of noise reduction processing is improved.
In one example, the first enhancement module 1021 and the second enhancement module 1022 may configure multiple sub-modules. As shown in fig. 2, the noise reduction module 102 may be configured in a case where the signal-to-noise ratio is smaller than a preset value, the first enhancement module 1021 may include N first sub-modules arranged in sequence, and the second enhancement module 1022 may include M second sub-modules arranged in sequence.
Each first sub-module processes the input fourier spectrum in sequence, wherein the fourier spectrum input by each first sub-module is obtained by splicing the fourier spectrum output by the analysis filter 101, the fourier spectrum output by each first sub-module before it, and/or the fourier spectrum corresponding to the complex mask output by each second sub-module before it.
Likewise, the fourier spectrum input by each second sub-module is obtained by analysis of the fourier spectrum output by filter 101, complex masking of the corresponding fourier spectrum output by each second sub-module before it, and/or splicing of the fourier spectrum output by each first sub-module before it.
The splicing refers to splicing a plurality of Fourier spectrums in a channel dimension. For example, as shown in fig. 2, in the schematic structural diagram of the noise reduction module 102 in the case that the signal-to-noise ratio is smaller than the preset value, the fourier spectrum input to the 2 nd first sub-module may be obtained by splicing the fourier spectrum output by the analysis filter 101 and the fourier spectrum output by the 1 st first sub-module in the channel dimension. As another example, as shown in fig. 2, the fourier spectrum input to the 2 nd second submodule may be obtained by splicing the fourier spectrum output from the analysis filter 101, the fourier spectrum output from the 1 st first submodule to the N first submodule, and the fourier spectrum corresponding to the complex mask output from the 1 st second submodule in the channel dimension. The process of stitching and the process of complex masking into the spectrum are not shown in fig. 2.
Specifically, the number of first sub-modules and the number of second sub-modules may be determined according to the computing capabilities of the terminal device. For example, a correspondence relationship between the main frequency of the processor and the number of sub-modules in the terminal device may be preset, and the number of sub-modules of the first enhancement module 1021 and the number of sub-modules of the second enhancement module 1022 may be determined according to the correspondence relationship.
Specifically, the first sub-module and the second sub-module may adopt the same structure. As shown in fig. 3, the first sub-module and the second sub-module may include: an encoder 301, a time domain recurrent neural network 302, a frequency domain recurrent neural network 303, and a decoder 304.
The encoder 301 is configured to encode the input fourier spectrum to obtain a first high-dimensional fourier spectrum. Specifically, the encoder 301 may be constructed by using a convolutional neural network, and may specifically include three complex convolutional layers, where the number of channels of each convolutional layer may be designed to be 64, the convolution kernel size of each channel is (5, 2), and the step size of each convolutional layer is (2, 1), and (1, 1). The fourier spectrum input to the encoder 301 may be denoted as H e R (c×f ' x T), where C denotes the number of channels, F ' denotes the number of fourier frequencies after downsampling, F ' =256/4=64.
Modeling the time-domain recurrent neural network 302 along a time axis of the first high-dimensional fourier spectrum, the time-domain recurrent neural network 302 configured to determine a second high-dimensional fourier spectrum from the characteristic data of each subband in the first high-dimensional fourier spectrum, the characteristics of each subbandThe symptom data may represent H 1,f ∈R^(C×T),f∈[1,F']。
The frequency domain recurrent neural network 303 is modeled along a frequency axis of the second high-dimensional fourier spectrum, the frequency domain recurrent neural network 303 is configured to determine a third high-dimensional fourier spectrum from characteristic data of each time point in the second high-dimensional fourier spectrum, where the characteristic data of each time point may be denoted as H 1,t ∈R^(C×F'),t∈[1,T]。
The decoder 304 in the first sub-module is configured to decode the third high-dimensional fourier spectrum, and output the fourier spectrum after noise reduction. The decoder 304 in the second sub-module is configured to decode the third high-dimensional fourier spectrum, and output the complex mask after noise reduction. Specifically, the decoder 304 in each sub-module employs a convolutional neural network and includes three complex deconvolution layers. The third high-dimensional Fourier spectrum input to the decoder 304 may be denoted as H '∈R (C×F' ×T), the output Fourier spectrum as Xε R (2×F×T), and the output complex mask M ε R (2×F×T).
The synthesis filter 103 is configured to perform noise reduction processing on the fourier spectrum output by the noise reduction module 102 or the fourier spectrum corresponding to the complex mask, so as to obtain clean speech corresponding to the noisy speech. The synthesis filter 103 uses a short-time inverse fourier transform, the frame length is 512, and the frame shift is 128. The synthesis filter 103 inputs the enhanced Fourier spectrum XεRx (2 XFx T) and outputs the enhanced Fourier spectrum XεRx (2 XFx T)
Figure BDA0003356736020000061
Wherein L is the number of voice sampling points, F is the number of Fourier frequency points 256, and T is the number of frames.
Fig. 4 is a training method of a speech enhancement model according to an embodiment of the present application. This method is used to train the speech enhancement model 100 shown in fig. 1. As shown in fig. 4, the method includes the following steps S401 to S402.
In step S401, a speech training set corresponding to a vocal tract is acquired.
The voice training set comprises a plurality of noisy voice samples and a plurality of clean voice samples, wherein the noisy voice samples and the clean voice samples are in one-to-one correspondence.
In step S402, a speech enhancement model corresponding to the vocal tract is determined using the speech training set. The specific structure of the speech enhancement model is shown in fig. 1, and will not be described herein. The processing sequence of each sub-module in the voice enhancement model to the frequency spectrum can be determined according to the signal-to-noise ratio of the sound channel, and the number of each sub-module in the voice enhancement model can be determined according to the computing capability of the equipment applying the voice enhancement model.
Specifically, the step S402 of determining the speech enhancement model corresponding to the vocal tract by using the speech training set specifically includes:
s4021, inputting the noisy speech into a speech enhancement model to obtain clean speech output by the speech enhancement model.
The analysis filter converts the input noisy speech samples into a first Fourier spectrum;
when the signal-to-noise ratio of the sound channel is smaller than a preset value, the first enhancement module outputs a second Fourier spectrum based on the first Fourier spectrum, the first Fourier spectrum and the second Fourier spectrum are spliced into a third Fourier spectrum, and the second enhancement module outputs a first complex mask based on the third Fourier spectrum, and the first complex mask is converted into a fourth Fourier spectrum;
When the signal-to-noise ratio of the sound channel is not smaller than the preset value, the second enhancement module outputs a second complex mask based on the first Fourier spectrum, converts the second complex mask into a fifth Fourier spectrum, splices the first Fourier spectrum and the fifth Fourier spectrum into a sixth Fourier spectrum, and the first enhancement module outputs a seventh Fourier spectrum based on the sixth Fourier spectrum;
the synthesis filter module converts the fourth fourier spectrum or the seventh fourier spectrum into clean speech;
s4022, determining a time domain loss value between clean voice and a clean voice sample according to a time domain loss function, determining a frequency domain loss value between the fourth Fourier spectrum and the eighth Fourier spectrum according to the frequency domain loss function when the signal-to-noise ratio of a sound channel is smaller than a preset value, and determining a frequency domain loss value between the seventh Fourier spectrum and the eighth Fourier spectrum according to the frequency domain loss function when the signal-to-noise ratio of the sound channel is not smaller than the preset value, wherein the eighth Fourier spectrum is determined according to the clean voice sample
S4023, updating the voice noise reduction model according to the time domain loss value and the frequency domain loss value, and updating the voice noise reduction model. Wherein updating the speech noise reduction model includes updating network parameters of each first sub-module and each second sub-module in the speech noise reduction model.
Specifically, the following formula may be used to determine the first loss value L from the time domain loss value and the frequency domain loss value, and update the speech noise reduction model using a gradient descent method based on the first loss value L.
L=L_audio+L_spectral
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003356736020000071
for clean speech, y is a clean speech sample, < ->
Figure BDA0003356736020000072
And->
Figure BDA0003356736020000073
Respectively the real part and the imaginary part of the frequency spectrum corresponding to the clean voice, |Y r I and Y i The i are the real and imaginary parts of the spectrum corresponding to the clean speech samples, respectively. />
Fig. 5 is a training device for a speech noise reduction model according to an embodiment of the present application.
As shown in fig. 5, the training apparatus 500 includes:
an obtaining module 501, configured to obtain a speech training set corresponding to a vocal tract; the voice training set comprises a plurality of noisy voice samples and a plurality of clean voice samples, wherein the noisy voice samples and the clean voice samples are in one-to-one correspondence;
the training module 502 is configured to determine a speech noise reduction model corresponding to the vocal tract by using the speech training set; the speech noise reduction model includes: an analysis filter, a first enhancement module, a second enhancement module, and a synthesis filter module.
The training module 502 is specifically configured to:
converting the input noisy speech samples into a first fourier spectrum using the analysis filter;
When the signal-to-noise ratio of the sound channel is smaller than a preset value, the first enhancement module outputs a second Fourier spectrum based on the first Fourier spectrum, the first Fourier spectrum and the second Fourier spectrum are spliced into a third Fourier spectrum, the second enhancement module outputs a first complex mask based on the third Fourier spectrum, the first complex mask is converted into a fourth Fourier spectrum, and the synthesis filter module converts the fourth Fourier spectrum into clean voice;
when the signal-to-noise ratio of the sound channel is not smaller than the preset value, the second enhancement module outputs a second complex mask based on the first Fourier spectrum, converts the second complex mask into a fifth Fourier spectrum, splices the first Fourier spectrum and the fifth Fourier spectrum into a sixth Fourier spectrum, the first enhancement module outputs a seventh Fourier spectrum based on the sixth Fourier spectrum, and the synthesis filter module converts the seventh Fourier spectrum into clean voice;
and updating the voice noise reduction model according to the clean voice and the clean voice sample corresponding to the input noisy voice sample.
Fig. 6 is a schematic diagram of a voice enhancement method applied to a terminal device according to an embodiment of the present application.
As shown in fig. 6, the method includes the following steps S601 to S602.
In step S601, a speech noise reduction model corresponding to a vocal tract is acquired. The voice noise reduction model corresponding to the vocal tract can be obtained by training the method described in fig. 4, which is not described herein.
In step S602, noise reduction processing is performed on the target voice transmitted by the vocal tract by using the voice noise reduction model, so as to obtain clean voice.
Fig. 7 is a schematic structural diagram of a voice enhancement device according to an embodiment of the present application.
As shown in fig. 7, the voice enhancement apparatus 700 includes:
an obtaining module 701, configured to obtain a voice noise reduction model corresponding to a vocal tract;
and the processing module 702 is configured to perform noise reduction processing on the target voice transmitted by the sound channel by using the voice noise reduction model, so as to obtain clean voice.
Fig. 8 is a schematic diagram of a hardware architecture of a computing device 800 according to an embodiment of the present application.
The computing device 800 may be the model training device described above or the terminal device described above. With reference to fig. 8, the computing device 800 includes a processor 801, a memory 802, a communication interface 803, and a bus 804, the processor 801, the memory 802, and the communication interface 803 being connected to each other by the bus 804. The processor 801, the memory 802, and the communication interface 803 may also be connected by other connection means than the bus 604.
The memory 802 may be various types of storage media, such as random access memory (random access memory, RAM), read-only memory (ROM), nonvolatile RAM (NVRAM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (electrically erasable PROM, EEPROM), flash memory, optical memory, hard disk, and the like.
Where the processor 801 may be a general-purpose processor, the general-purpose processor may be a processor that performs certain steps and/or operations by reading and executing content stored in a memory (e.g., memory 802). For example, the general purpose processor may be a central processing unit (central processing unit, CPU). The processor 801 may include at least one circuit to perform all or part of the steps of the methods provided by the embodiments shown in fig. 4 or fig. 6.
Among other things, communication interface 803 includes input/output (I/O) interfaces, physical interfaces, logical interfaces, and the like for enabling interconnection of devices within computing device 800, as well as interfaces for enabling interconnection of computing device 800 with other devices (e.g., other computing devices or terminal devices). The physical interface may be an ethernet interface, a fiber optic interface, an ATM interface, etc.
Where bus 804 may be any type of communication bus, such as a system bus, that interconnects processor 801, memory 802, and communication interface 803.
The above devices may be provided on separate chips, or may be provided at least partially or entirely on the same chip. Whether the individual devices are independently disposed on different chips or integrally disposed on one or more chips is often dependent on the needs of the product design. The embodiment of the application does not limit the specific implementation form of the device.
The computing device 800 shown in fig. 8 is merely exemplary, and in implementation, the computing device 800 may include other components, which are not listed here.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application. It should be understood that, in the embodiment of the present application, the sequence number of each process does not mean the sequence of execution, and the execution sequence of each process should be determined by the function and the internal logic of each process, and should not be limited in any way to the implementation process of the embodiment of the present application.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not meant to limit the scope of the invention, but to limit the scope of the invention.

Claims (10)

1. A method of training a speech noise reduction model, the method comprising:
acquiring a voice training set corresponding to a sound channel; the voice training set comprises a plurality of noisy voice samples and a plurality of clean voice samples, wherein the noisy voice samples and the clean voice samples are in one-to-one correspondence;
Determining a voice noise reduction model corresponding to the sound channel by utilizing the voice training set; the speech noise reduction model includes: an analysis filter, a first enhancement module, a second enhancement module, and a synthesis filter module;
wherein the determining, by using the speech training set, a speech noise reduction model corresponding to the vocal tract includes:
the analysis filter converts the input noisy speech samples into a first fourier spectrum;
when the signal-to-noise ratio of the sound channel is smaller than a preset value, the first enhancement module outputs a second Fourier spectrum based on the first Fourier spectrum, the first Fourier spectrum and the second Fourier spectrum are spliced into a third Fourier spectrum, the second enhancement module outputs a first complex mask based on the third Fourier spectrum, the first complex mask is converted into a fourth Fourier spectrum, and the synthesis filter module converts the fourth Fourier spectrum into clean voice;
when the signal-to-noise ratio of the sound channel is not smaller than the preset value, the second enhancement module outputs a second complex mask based on the first Fourier spectrum, converts the second complex mask into a fifth Fourier spectrum, splices the first Fourier spectrum and the fifth Fourier spectrum into a sixth Fourier spectrum, the first enhancement module outputs a seventh Fourier spectrum based on the sixth Fourier spectrum, and the synthesis filter module converts the seventh Fourier spectrum into clean voice;
And updating the voice noise reduction model according to the clean voice and the clean voice sample corresponding to the input noisy voice sample.
2. The method of claim 1, wherein the first enhancement module comprises a plurality of first sub-modules and the second enhancement module comprises a plurality of second sub-modules;
the number of the first sub-modules and the number of the second sub-modules are determined according to the signal to noise ratio of the sound channel;
the first submodule is used for carrying out noise reduction processing on the input Fourier spectrum and outputting the Fourier spectrum after noise reduction; the second submodule is used for carrying out noise reduction processing on the input Fourier spectrum and outputting complex masking after noise reduction;
when the signal-to-noise ratio of the sound channel is smaller than a preset value, the Fourier spectrum input to the nth first submodule is determined according to the Fourier spectrum output by the analysis filter and the Fourier spectrum output by the nth-i first submodule, and the Fourier spectrum input to the mth second submodule is determined according to the Fourier spectrum output by the analysis filter, the Fourier spectrum output by each first submodule and the Fourier spectrum corresponding to complex masking output by the mth-j second submodule;
When the signal-to-noise ratio of the sound channel is not smaller than the preset value, the Fourier spectrum input to the mth second submodule is determined according to the Fourier spectrum output by the analysis filter and the Fourier spectrum output by the mth-j second submodule, and the Fourier spectrum input to the nth first submodule is determined according to the Fourier spectrum output by the analysis filter, the Fourier spectrum corresponding to the complex mask output by each second submodule and the Fourier spectrum output by the nth-i first submodule;
n is the number of first sub-modules, i is 1, N, M is 1, M, and j is 1, M.
3. The method of claim 2, wherein the first sub-module and the second sub-module each comprise: an encoder, a time domain recurrent neural network, a frequency domain recurrent neural network, and a decoder;
the encoder is used for encoding the input Fourier spectrum to obtain a first high-dimensional Fourier spectrum;
the time domain circulating neural network is used for determining a second high-dimensional Fourier spectrum according to the characteristic data of each sub-band in the first high-dimensional Fourier spectrum;
the frequency domain recurrent neural network is used for determining a third high-dimensional Fourier spectrum according to the characteristic data of each time point in the second high-dimensional Fourier spectrum;
The decoder in the first sub-module is used for decoding the third high-dimensional Fourier spectrum and outputting the Fourier spectrum after noise reduction;
and the decoder in the second sub-module is used for decoding the third high-dimensional Fourier spectrum and outputting the complex mask after noise reduction.
4. The method of claim 1, wherein the determining a speech noise reduction model for the vocal tract using the speech training set further comprises:
determining a time domain loss value between the clean speech and the clean speech sample according to a time domain loss function;
determining a frequency domain loss value between the fourth fourier spectrum and the eighth fourier spectrum according to a frequency domain loss function, or determining a third loss value between the seventh fourier spectrum and the eighth fourier spectrum according to the frequency domain loss function, the eighth fourier spectrum being determined from clean speech samples;
updating the speech noise reduction model according to the time domain loss value and the frequency domain loss value, or updating the speech noise reduction model according to the time domain loss value and the third loss value.
5. A method of speech enhancement, the method comprising:
Acquiring a voice noise reduction model corresponding to a sound channel;
and carrying out noise reduction processing on the target voice transmitted by the sound channel by utilizing the voice noise reduction model to obtain clean voice.
6. A training device for a speech noise reduction model, the training device comprising:
the acquisition module is used for acquiring a voice training set corresponding to the sound channel; the voice training set comprises a plurality of noisy voice samples and a plurality of clean voice samples, wherein the noisy voice samples and the clean voice samples are in one-to-one correspondence;
the training module is used for determining a voice noise reduction model corresponding to the sound channel by utilizing the voice training set; the speech noise reduction model includes: an analysis filter, a first enhancement module, a second enhancement module, and a synthesis filter module;
the training module is specifically configured to:
converting the input noisy speech samples into a first fourier spectrum using the analysis filter;
when the signal-to-noise ratio of the sound channel is smaller than a preset value, the first enhancement module outputs a second Fourier spectrum based on the first Fourier spectrum, the first Fourier spectrum and the second Fourier spectrum are spliced into a third Fourier spectrum, the second enhancement module outputs a first complex mask based on the third Fourier spectrum, the first complex mask is converted into a fourth Fourier spectrum, and the synthesis filter module converts the fourth Fourier spectrum into clean voice;
When the signal-to-noise ratio of the sound channel is not smaller than the preset value, the second enhancement module outputs a second complex mask based on the first Fourier spectrum, converts the second complex mask into a fifth Fourier spectrum, splices the first Fourier spectrum and the fifth Fourier spectrum into a sixth Fourier spectrum, the first enhancement module outputs a seventh Fourier spectrum based on the sixth Fourier spectrum, and the synthesis filter module converts the seventh Fourier spectrum into clean voice;
and updating the voice noise reduction model according to the clean voice and the clean voice sample corresponding to the input noisy voice sample.
7. The training device of claim 6, wherein the first enhancement module comprises a plurality of first sub-modules and the second enhancement module comprises a plurality of second sub-modules;
the number of the first sub-modules and the number of the second sub-modules are determined according to the signal to noise ratio of the sound channel;
the first submodule is used for carrying out noise reduction processing on the input Fourier spectrum and outputting the Fourier spectrum after noise reduction;
the second submodule is used for carrying out noise reduction processing on the input Fourier spectrum and outputting complex masking after noise reduction;
When the signal-to-noise ratio of the sound channel is smaller than a preset value, the Fourier spectrum input to the nth first submodule is determined according to the Fourier spectrum output by the analysis filter and the Fourier spectrum output by the nth-i first submodule, and the Fourier spectrum input to the mth second submodule is determined according to the Fourier spectrum output by the analysis filter, the Fourier spectrum output by each first submodule and the Fourier spectrum corresponding to complex masking output by the mth-j second submodule;
when the signal-to-noise ratio of the sound channel is not smaller than the preset value, the Fourier spectrum input to the mth second submodule is determined according to the Fourier spectrum output by the analysis filter and the Fourier spectrum output by the mth-j second submodule, and the Fourier spectrum input to the nth first submodule is determined according to the Fourier spectrum output by the analysis filter, the Fourier spectrum corresponding to the complex mask output by each second submodule and the Fourier spectrum output by the nth-i first submodule;
n is the number of first sub-modules, i is 1, N, M is 1, M, and j is 1, M.
8. The training device of claim 7, wherein the first sub-module and the second sub-module each comprise: an encoder, a time domain recurrent neural network, a frequency domain recurrent neural network, and a decoder;
The encoder is used for encoding the input Fourier spectrum to obtain a first high-dimensional Fourier spectrum;
the time domain circulating neural network is used for determining a second high-dimensional Fourier spectrum according to the characteristic data of each sub-band in the first high-dimensional Fourier spectrum;
the frequency domain recurrent neural network is used for determining a third high-dimensional Fourier spectrum according to the characteristic data of each time point in the second high-dimensional Fourier spectrum;
the decoder in the first sub-module is used for decoding the third high-dimensional Fourier spectrum and outputting the Fourier spectrum after noise reduction;
and the decoder in the second sub-module is used for decoding the third high-dimensional Fourier spectrum and outputting the complex mask after noise reduction.
9. The training device of claim 6, wherein the training module is further configured to:
determining a time domain loss value between the clean speech and the clean speech sample according to a time domain loss function;
determining a frequency domain loss value between the fourth fourier spectrum and the eighth fourier spectrum according to a frequency domain loss function, or determining a third loss value between the seventh fourier spectrum and the eighth fourier spectrum according to the frequency domain loss function, the eighth fourier spectrum being determined from clean speech samples;
Updating the speech noise reduction model according to the time domain loss value and the frequency domain loss value, or updating the speech noise reduction model according to the time domain loss value and the third loss value.
10. A speech enhancement apparatus, the speech enhancement apparatus comprising:
the acquisition module is used for acquiring a voice noise reduction model corresponding to the sound channel;
and the processing module is used for carrying out noise reduction processing on the target voice transmitted by the sound channel by utilizing the voice noise reduction model to obtain clean voice.
CN202111353720.2A 2021-11-16 2021-11-16 Training method of voice noise reduction model and voice enhancement method Pending CN116137153A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111353720.2A CN116137153A (en) 2021-11-16 2021-11-16 Training method of voice noise reduction model and voice enhancement method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111353720.2A CN116137153A (en) 2021-11-16 2021-11-16 Training method of voice noise reduction model and voice enhancement method

Publications (1)

Publication Number Publication Date
CN116137153A true CN116137153A (en) 2023-05-19

Family

ID=86332928

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111353720.2A Pending CN116137153A (en) 2021-11-16 2021-11-16 Training method of voice noise reduction model and voice enhancement method

Country Status (1)

Country Link
CN (1) CN116137153A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117133303A (en) * 2023-10-26 2023-11-28 荣耀终端有限公司 Voice noise reduction method, electronic equipment and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117133303A (en) * 2023-10-26 2023-11-28 荣耀终端有限公司 Voice noise reduction method, electronic equipment and medium
CN117133303B (en) * 2023-10-26 2024-03-29 荣耀终端有限公司 Voice noise reduction method, electronic equipment and medium

Similar Documents

Publication Publication Date Title
Lin et al. Speech enhancement using multi-stage self-attentive temporal convolutional networks
US7676374B2 (en) Low complexity subband-domain filtering in the case of cascaded filter banks
US20130024191A1 (en) Audio communication device, method for outputting an audio signal, and communication system
Schröter et al. Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio
CN116030792B (en) Method, apparatus, electronic device and readable medium for converting voice tone
CN112259116A (en) Method and device for reducing noise of audio data, electronic equipment and storage medium
CN113470688B (en) Voice data separation method, device, equipment and storage medium
CN116994564B (en) Voice data processing method and processing device
Ueda et al. Environment-dependent denoising autoencoder for distant-talking speech recognition
CN115602165A (en) Digital staff intelligent system based on financial system
CN114898762A (en) Real-time voice noise reduction method and device based on target person and electronic equipment
CN116137153A (en) Training method of voice noise reduction model and voice enhancement method
Jiang et al. An Improved Unsupervised Single‐Channel Speech Separation Algorithm for Processing Speech Sensor Signals
CN108172214A (en) A kind of small echo speech recognition features parameter extracting method based on Mel domains
CN117133307A (en) Low-power consumption mono voice noise reduction method, computer device and computer readable storage medium
CN112687284B (en) Reverberation suppression method and device for reverberation voice
Tan et al. Multichannel noise reduction using dilated multichannel U-net and pre-trained single-channel network
US20240135954A1 (en) Learning method for integrated noise echo cancellation system using multi-channel based cross-tower network
CN113763976B (en) Noise reduction method and device for audio signal, readable medium and electronic equipment
CN113345465B (en) Voice separation method, device, equipment and computer readable storage medium
CN114783455A (en) Method, apparatus, electronic device and computer readable medium for voice noise reduction
CN113066483B (en) Sparse continuous constraint-based method for generating countermeasure network voice enhancement
Li et al. Dynamic attention based generative adversarial network with phase post-processing for speech enhancement
CN113012711A (en) Voice processing method, device and equipment
Zhang et al. URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination