CN114974281A - Training method and device of voice noise reduction model, storage medium and electronic device - Google Patents

Training method and device of voice noise reduction model, storage medium and electronic device Download PDF

Info

Publication number
CN114974281A
CN114974281A CN202210567936.7A CN202210567936A CN114974281A CN 114974281 A CN114974281 A CN 114974281A CN 202210567936 A CN202210567936 A CN 202210567936A CN 114974281 A CN114974281 A CN 114974281A
Authority
CN
China
Prior art keywords
voice
data
spectrum
noise reduction
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210567936.7A
Other languages
Chinese (zh)
Inventor
关海欣
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202210567936.7A priority Critical patent/CN114974281A/en
Publication of CN114974281A publication Critical patent/CN114974281A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention discloses a training method and device of a voice noise reduction model, a storage medium and an electronic device. The training method of the voice noise reduction model comprises the following steps: by adding a spectrum loss function (a form of signal envelope) into a loss function based on a signal, the total loss function can better keep voice information while reducing noise, so that the hearing is improved, and the technical problem that in the prior art, voice information is less while a voice noise reduction model reduces noise, and the voice damage is large is caused.

Description

Training method and device of voice noise reduction model, storage medium and electronic device
Technical Field
The invention relates to the field related to training of a voice noise reduction model, in particular to a training method and device of the voice noise reduction model, a storage medium and an electronic device.
Background
The current single-channel deep learning noise reduction method has made remarkable progress, and the performance of the method is obviously superior to that of the traditional signal processing method. The current U-Net based model is one of the mainstream technical solutions.
The current single-channel deep learning noise reduction method has made remarkable progress, and the performance of the method is obviously superior to that of the traditional signal processing method. However, although the method has excellent noise reduction performance, the method is often accompanied with a speech damage problem, and particularly when the model parameter quantity is small, the method is obvious, and brings negative effects to speech audibility.
The loss function is a key factor for the performance of a deep learning model, the loss functions used in the current speech noise reduction are more, and the biggest problem is that the signal itself is used, and the characteristics of the speech are not fully considered.
The speech can be essentially decomposed into envelope and harmonic, the harmonic can be regarded as a carrier signal, the envelope carries more important information such as semantics and vocal cord form, for example, even if the vocal cord of a person is not vibrated, the content of the vocal cord can still be understood by using a whisper to exchange the other party, so that the semantic information can be reserved as long as the envelope information is not lost fundamentally, and the envelope information can be described by using an MEL spectrum or a Bark spectrum in auditory perception.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the invention provides a training method and device of a voice noise reduction model, a storage medium and an electronic device, which are used for at least solving the technical problem that in the prior art, the voice noise reduction model reduces noise and simultaneously keeps less voice information to cause larger voice damage.
According to an aspect of the embodiments of the present invention, there is provided a method for training a speech noise reduction model, including: acquiring voice training sample data, wherein the voice training sample data carries noise data; performing feature extraction on the voice training sample data to obtain voice feature data; inputting the voice characteristic data into a preset voice noise reduction model, and outputting predicted voice data, wherein the predicted voice data does not comprise noise data; and under the condition that both a signal loss function and a spectrum loss function in the preset voice noise reduction model meet preset conditions, finishing training of the preset voice noise reduction model to obtain a target voice noise reduction model, wherein the loss function is met by the predicted voice data and clean voice data, the spectrum loss function is formed by the predicted spectrum data of the predicted voice data and the spectrum data of the clean voice data, and the clean voice data is training sample data without noise data.
Optionally, the spectral loss function is formed by the predicted spectral data of the predicted speech data and the spectral data of the clean speech data, and includes: converting the voice characteristic data from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a first frequency spectrum energy characteristic; converting the voice characteristic data of the clean voice from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a second frequency spectrum energy characteristic; forming the spectral loss function based on the first spectral energy characteristic and the second spectral energy characteristic.
Optionally, the constructing the spectral loss function based on the first spectral energy characteristic and the second spectral energy characteristic includes: the spectral loss function is calculated by the following formula:
Figure BDA0003658961630000021
Ms(t,b)=|Fs|*MS,Ms2(t,b)=|Fs2|*MS,
wherein T represents the frame number of a piece of speech, Ms (T, b) represents the first spectral energy feature, Ms2(T, b) represents the second spectral energy feature, | | represents the absolute value, Fs represents the noise spectrum, Fs2 represents the clean speech spectrum, and Ms represents the preset matrix.
Optionally, the performing feature extraction on the voice training sample data to obtain voice feature data includes: and performing feature extraction on the voice training sample data through one of the following models: and obtaining the voice characteristic data by a convolutional neural network, a recursive neural network or a fully-connected neural network.
Optionally, before the obtaining of the voice training sample data, the method further includes: and mixing target voice data without noise and various types of noise with different signal-to-noise ratios to obtain the voice training sample data.
According to an aspect of an embodiment of the present invention, there is provided a speech noise reduction model method, including: acquiring target voice information; and inputting the target voice information into a target noise reduction voice model, and outputting noise-reduced voice information, wherein the target noise reduction voice model comprises the voice noise reduction model obtained by any one of the above training voice models.
According to an aspect of the embodiments of the present invention, there is provided a training apparatus for a speech noise reduction model, including: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voice training sample data, and the voice training sample data carries noise data; the feature extraction unit is used for extracting features of the voice training sample data to obtain voice feature data; the prediction unit is used for inputting the voice characteristic data into a preset voice noise reduction model and outputting predicted voice data, wherein the predicted voice data does not comprise noise data; and the training unit is used for finishing the training of the preset voice noise reduction model to obtain a target voice noise reduction model under the condition that both a signal loss function and a spectrum loss function in the preset voice noise reduction model meet preset conditions, wherein the loss function is met by the predicted voice data and clean voice data, the spectrum loss function is formed by the predicted spectrum data of the predicted voice data and the spectrum data of the clean voice data, and the clean voice data is training sample data without noise data.
Optionally, the training unit includes: the first conversion module is used for converting the voice characteristic data from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a first frequency spectrum energy characteristic; the second conversion module is used for converting the voice characteristic data of the clean voice from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a second frequency spectrum energy characteristic; a construction module to construct the spectral loss function based on the first spectral energy characteristic and the second spectral energy characteristic.
Optionally, the building module is further configured to perform the following operations: the spectral loss function is calculated by the following formula:
Figure BDA0003658961630000041
Ms(t,b)=|Fs|*MS,Ms2(t,b)=|Fs2|*MS,
wherein T represents the frame number of a piece of speech, Ms (T, b) represents the first spectral energy feature, Ms2(T, b) represents the second spectral energy feature, | | represents the absolute value, Fs represents the noise spectrum, Fs2 represents the clean speech spectrum, and Ms represents the preset matrix.
Optionally, the feature extraction unit includes: the feature extraction module is used for extracting features of the voice training sample data through one of the following models: and obtaining the voice characteristic data by a convolutional neural network, a recursive neural network or a fully-connected neural network.
Optionally, the apparatus further comprises: and the mixing unit is used for mixing the target voice data without noise and various types of noise with different signal to noise ratios before the voice training sample data is obtained, and obtaining the voice training sample data.
According to a first aspect of embodiments of the present application, a computer-readable storage medium is provided, wherein the storage medium stores a computer program, and the computer program is configured to execute the above training method for a speech noise reduction model when running.
According to a first aspect of embodiments of the present application, there is provided an electronic apparatus, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the above training method for a speech noise reduction model.
According to a first aspect of embodiments of the present application, a computer-readable storage medium is provided, wherein the storage medium stores a computer program, and the computer program is configured to execute the above-mentioned voice noise reduction method when running.
According to a first aspect of embodiments of the present application, there is provided an electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the above-mentioned voice noise reduction method.
In the embodiment of the invention, voice training sample data is obtained, wherein the voice training sample data carries noise data; performing feature extraction on voice training sample data to obtain voice feature data; inputting the voice characteristic data into a preset voice noise reduction model, and outputting predicted voice data, wherein the predicted voice data does not comprise noise data; under the condition that both a signal loss function and a spectrum loss function in the preset voice noise reduction model meet preset conditions, finishing training of the preset voice noise reduction model to obtain a target voice noise reduction model, wherein the loss function formed by the predicted voice data and the clean voice data meets the requirements, the spectrum loss function formed by the predicted spectrum data of the predicted voice data and the spectrum data of the clean voice data, and the clean voice data is training sample data without noise data. In this embodiment, a spectrum loss function (a form of signal envelope) is added to a loss function based on a signal, so that the total loss function can better retain voice information while reducing noise, and the audibility is improved, thereby at least solving the technical problem that in the prior art, when a voice noise reduction model reduces noise, the retained voice information is less, and the voice damage is large.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware structure of a mobile terminal according to an alternative training method of a speech noise reduction model in an embodiment of the present invention;
FIG. 2 is a flow chart of an alternative method of training a speech noise reduction model according to an embodiment of the present invention;
FIG. 3 is a block diagram of a speech noise reduction model according to an alternative asymmetric encoding/decoding method of the present invention;
FIG. 4 is a diagram of an alternative training apparatus for a speech noise reduction model according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the training method of the speech noise reduction model provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal or a similar operation device. Taking an example of the method running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of the training method of a speech noise reduction model according to the embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the training method of the speech noise reduction model in the embodiment of the present invention, and the processor 102 executes the computer program stored in the memory 104 to execute various functional applications and data processing, i.e., to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Fig. 2 is a flowchart of a training method of a speech noise reduction model according to an embodiment of the present invention, and as shown in fig. 2, the flow of the training method of the speech noise reduction model includes the following steps:
step S202, voice training sample data is obtained, wherein the voice training sample data carries noise data.
And step S204, performing feature extraction on the voice training sample data to obtain voice feature data.
Step S206, inputting the voice characteristic data into a preset voice noise reduction model, and outputting predicted voice data, wherein the predicted voice data does not include noise data.
And S208, under the condition that both the signal loss function and the spectrum loss function in the preset voice noise reduction model meet preset conditions, finishing training of the preset voice noise reduction model to obtain a target voice noise reduction model, wherein the predicted voice data and the clean voice data form a loss function meeting the conditions, the predicted spectrum data of the predicted voice data and the spectrum data of the clean voice data form a spectrum loss function, and the clean voice data is training sample data without noise data.
In this embodiment, the trained target speech noise reduction model may implement noise reduction of speech information in different scenes, for example, in a dialog scene, noise reduction is performed on user speech data. And in the voice passing process, noise reduction is carried out on the voice data of the user. The voice data after noise reduction can be used for voice recognition, and the accuracy of the voice recognition is improved. The method can also be applied to voice communication and is used for improving the communication quality.
The voice training sample is obtained by mixing clean voice data without noise and noise voice data. And mixing the signals with different signal-to-noise ratios to obtain voice training sample data. The voice data mixing method includes, but is not limited to, mixing clean voice data with different types of noise data by the same signal-to-noise ratio, and mixing clean voice data with each type of noise data by different signal-to-noise ratios.
According to the embodiment provided by the application, voice training sample data is obtained, wherein the voice training sample data carries noise data; performing feature extraction on voice training sample data to obtain voice feature data; inputting the voice characteristic data into a preset voice noise reduction model, and outputting predicted voice data, wherein the predicted voice data does not comprise noise data; and under the condition that both a signal loss function and a spectrum loss function in the preset voice noise reduction model meet preset conditions, finishing training of the preset voice noise reduction model to obtain a target voice noise reduction model, wherein the loss function formed by the predicted voice data and the clean voice data meets the requirements, the spectrum loss function formed by the predicted spectrum data of the predicted voice data and the spectrum data of the clean voice data, and the clean voice data are training sample data without noise data. In this embodiment, a spectrum loss function (a form of signal envelope) is added to a loss function based on a signal, so that the total loss function can better retain voice information while reducing noise, and the audibility is improved, thereby at least solving the technical problem that in the prior art, when a voice noise reduction model reduces noise, the retained voice information is less, and the voice damage is large.
Optionally, the step of constructing a spectrum loss function by the predicted spectrum data of the predicted speech data and the spectrum data of the clean speech data may include: converting the voice characteristic data from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a first frequency spectrum energy characteristic; converting the voice characteristic data of the clean voice from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a second frequency spectrum energy characteristic; a spectral loss function is constructed based on the first spectral energy characteristic and the second spectral energy characteristic.
Optionally, constructing a spectral loss function based on the first spectral energy characteristic and the second spectral energy characteristic may include: the spectral loss function is calculated by the following formula:
Figure BDA0003658961630000081
Ms(t,b)=|Fs|*MS,Ms2(t,b)=|Fs2|*MS,
wherein T represents the frame number of a piece of speech, Ms (T, b) represents the first spectral energy feature, Ms2(T, b) represents the second spectral energy feature, | | represents the absolute value, Fs represents the noise spectrum, Fs2 represents the clean speech spectrum, and Ms represents the preset matrix.
In the present embodiment, the MS preset matrix is a transformation matrix obtained through auditory perception of human ears, such as a mel filter bank (a Bark filter bank is also possible). The output Ms is the mel spectrum of the noise spectrum obtained by transforming the matrix, and Ms2 is the mel spectrum of clean speech.
It should be noted that, in the present embodiment, Fs refers to a noise spectrum, and Fs2 refers to a clean speech spectrum. The spectrum may be a transform domain spectrum such as FFT, DCT, etc., taking DCT spectrum as an example. The spectrum can be computed every frame, so it is more reasonable that Fs is in fact Fs (t, f), where f represents a frequency bin. MS is a matrix, the size of the matrix is F multiplied by B, wherein F is the total frequency point number and can be selected as 16kHz data, F is selected as 512, B is the number of frequency points of an envelope spectrum, English is called FilterBank and can be selected as 80, the balance between noise reduction and voice damage can be selected, the experimental value range is 26-80, and an energy spectrum | Fs | can be converted into the envelope spectrum through the conversion matrix to generate Ms (t, B).
Optionally, performing feature extraction on the voice training sample data to obtain voice feature data, which may include: and performing feature extraction on the voice training sample data through one of the following models: and obtaining the voice characteristic data by a convolutional neural network, a recursive neural network or a full-connection neural network.
Optionally, before obtaining the voice training sample data, the method may further include: and mixing target voice data without noise and various types of noise with different signal-to-noise ratios to obtain voice training sample data.
According to an aspect of an embodiment of the present invention, there is provided a speech noise reduction model method, including: acquiring target voice information; and inputting the target voice information into a target noise reduction voice model, and outputting the noise-reduced voice information, wherein the target noise reduction voice model comprises the voice noise reduction model obtained by any one of the above training voice models.
As an alternative embodiment, the present application further provides a speech noise reduction model training method based on the loss function.
In this embodiment, as shown in fig. 3, a schematic structural diagram of a speech noise reduction model based on a loss function, where the network model adopts a combination of a convolutional neural network, a long-term memory network, and a fully-connected network.
Based on the above model design, the following is a model training mode, and different models are similar in architecture.
The whole process comprises a training stage and an application stage, wherein the training stage comprises three steps:
step 1: generating data, mixing original clean voice data (equivalent to original voice data) and various types of noise with different signal-to-noise ratios, and taking mixed voice (equivalent to voice training sample data) as training input data x; the original clean speech s is also used as reference input data and is used when model loss is calculated;
step 2: extracting characteristics, namely framing and windowing each section of voice of the training data, wherein each frame is transformed by using Discrete Cosine Transform (DCT) to convert time domain characteristics into frequency domain characteristics Fx (t, f); clean speech also does this to yield Fs (t, f) for use in calculating model losses;
and 3, step 3: training the network, inputting the extracted features into the network model for training, estimating an implicit Mask (t, f) by using a Signal approximation method, multiplying the Mask (t, f) to the feature Fx (t, f) of the voice with noise, estimating the feature Fs2(t, f) of the clean Signal, carrying out iDCT (DCT inverse transformation) transformation on Fs2(t, f), and carrying out overlap-add to obtain the enhanced voice on the time domain
Figure BDA0003658961630000101
(where t denotes a frame and f denotes a frequency bin).
Will be provided with
Figure BDA0003658961630000102
And the target speech s is used to calculate the error using a loss function. The loss function is scaled-innovative SNR (SI-SNR),
it is defined by the following formula:
Figure BDA0003658961630000103
wherein s and
Figure BDA0003658961630000104
representing clean speech and estimated speech separately,<,>a dot-product representing a vector is calculated,
Figure BDA0003658961630000105
is the euclidean norm.
Other loss functions such as SNR, MSE, MAE, etc. may also be used as the loss function. The DCT \ IDCT transform may also be replaced with FFT \ IFFT or learnable transform features.
Converting Fs (t, f) and Fs2(t, f) from a frequency domain to a MEL domain through a MEL spectrum conversion matrix Mel (f, b), obtaining Ms (t, b), Ms2(t, b), and calculating Mel spectrum loss function error, wherein the formula is as follows:
ms (t, b) ═ Fs | | Ms, Ms2(t, b) ═ Fs2| Ms, where | | | denotes the absolute value.
Mel loss function Mel-loss=|Ms(t,b) 1/3 -Ms2(t,b) 1/3 |
The total loss all-loss is alpha SI-SNR + beta Mel-loss, wherein alpha and beta are set parameters, the balance between the noise reduction amount and the voice damage is adjusted, and in practice, the alpha can be 1, and the beta can be 10. And continuously iterating and training, and when the loss is continuously reduced until convergence, storing the model.
Wherein, the model part explains that:
CNN: a convolutional neural network is referred to, and 7 layers of CNN are used in practice;
RNN: a recurrent neural network is referred to, and 2-layer LSTM is actually used;
DNN: refers to a fully-connected neural network, in practice using 1-layer DNN;
the model can be cut, added and deleted according to the requirements in application.
And in the application stage, the trained model is used for reasoning in actual use. Unknown voice data with noise is subjected to framing, windowing and feature extraction, a mask is obtained through a trained model, then the mask is multiplied by features obtained through feature extraction, and the predicted clean voice is obtained through feature inverse transformation and overlap addition.
In this embodiment, by adding MEL spectrum (a form of signal envelope) to the loss function based on the signal, the loss function can better retain the voice information while reducing noise, so that the hearing sense is improved.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
In this embodiment, a training apparatus for a speech noise reduction model is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a block diagram of a structure of a training apparatus of a speech noise reduction model according to an embodiment of the present invention, and as shown in fig. 4, the training apparatus of the speech noise reduction model includes:
the obtaining unit 41 is configured to obtain voice training sample data, where the voice training sample data carries noise data.
And the feature extraction unit 43 is configured to perform feature extraction on the voice training sample data to obtain voice feature data.
And a prediction unit 45, configured to input the speech feature data into a preset speech noise reduction model, and output predicted speech data, where the predicted speech data does not include noise data.
And the training unit 47 is configured to end the training of the preset speech noise reduction model to obtain the target speech noise reduction model when both the signal loss function and the spectral loss function in the preset speech noise reduction model meet preset conditions, where the predicted speech data and the clean speech data form a loss function, the predicted spectral data of the predicted speech data and the spectral data of the clean speech data form a spectral loss function, and the clean speech data is training sample data without noise data.
According to the embodiment provided by the application, the obtaining unit 41 obtains the voice training sample data, wherein the voice training sample data carries noise data; the feature extraction unit 43 performs feature extraction on the voice training sample data to obtain voice feature data; the prediction unit 45 inputs the voice feature data into a preset voice noise reduction model, and outputs predicted voice data, wherein the predicted voice data does not include noise data; the training unit 47 ends the training of the preset speech noise reduction model to obtain a target speech noise reduction model when both the signal loss function and the spectral loss function in the preset speech noise reduction model meet preset conditions, wherein the predicted speech data and the clean speech data form a loss function, the predicted spectral data of the predicted speech data and the spectral data of the clean speech data form a spectral loss function, and the clean speech data is training sample data without noise data. In this embodiment, a spectrum loss function (a form of signal envelope) is added to a loss function based on a signal, so that the total loss function can better retain voice information while reducing noise, and the audibility is improved, thereby at least solving the technical problem that in the prior art, when a voice noise reduction model reduces noise, the retained voice information is less, and the voice damage is large.
Optionally, the training unit may include: the first conversion module is used for converting the voice characteristic data from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a first frequency spectrum energy characteristic; the second conversion module is used for converting the voice characteristic data of the clean voice from a frequency domain to a frequency spectrum energy domain through the spectrum conversion matrix to obtain a second frequency spectrum energy characteristic; a construction module for constructing a spectral loss function based on the first spectral energy characteristic and the second spectral energy characteristic.
Optionally, the building module is further configured to perform the following operations: the spectral loss function is calculated by the following formula:
Figure BDA0003658961630000131
Ms(t,b)=|Fs|*MS,Ms2(t,b)=|Fs2|*MS,
wherein T represents the frame number of a piece of speech, Ms (T, b) represents the first spectral energy feature, Ms2(T, b) represents the second spectral energy feature, | | represents the absolute value, Fs represents the noise spectrum, Fs2 represents the clean speech spectrum, and Ms represents the preset matrix.
Optionally, the feature extraction unit may include: the feature extraction module is used for extracting features of the voice training sample data through one of the following models: and obtaining the voice characteristic data by a convolutional neural network, a recursive neural network or a full-connection neural network.
Optionally, the apparatus may further include: and the mixing unit is used for mixing the target voice data without noise and various types of noise with different signal to noise ratios before the voice training sample data is obtained, so as to obtain the voice training sample data.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:
s1, acquiring voice training sample data, wherein the voice training sample data carries noise data;
s2, performing feature extraction on the voice training sample data to obtain voice feature data;
s3, inputting the voice characteristic data into a preset voice noise reduction model, and outputting predicted voice data, wherein the predicted voice data does not include noise data;
and S4, under the condition that the signal loss function and the spectrum loss function in the preset voice noise reduction model both meet preset conditions, ending the training of the preset voice noise reduction model to obtain a target voice noise reduction model, wherein the predicted voice data and the clean voice data form a loss function meeting the conditions, the predicted spectrum data of the predicted voice data and the spectrum data of the clean voice data form a spectrum loss function, and the clean voice data is training sample data without noise data.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1, acquiring voice training sample data, wherein the voice training sample data carries noise data;
s2, performing feature extraction on the voice training sample data to obtain voice feature data;
s3, inputting the voice characteristic data into a preset voice noise reduction model, and outputting predicted voice data, wherein the predicted voice data does not include noise data;
and S4, under the condition that the signal loss function and the spectrum loss function in the preset voice noise reduction model both meet preset conditions, ending the training of the preset voice noise reduction model to obtain a target voice noise reduction model, wherein the predicted voice data and the clean voice data form a loss function meeting the conditions, the predicted spectrum data of the predicted voice data and the spectrum data of the clean voice data form a spectrum loss function, and the clean voice data is training sample data without noise data.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for training a speech noise reduction model is characterized by comprising the following steps:
acquiring voice training sample data, wherein the voice training sample data carries noise data;
performing feature extraction on the voice training sample data to obtain voice feature data;
inputting the voice characteristic data into a preset voice noise reduction model, and outputting predicted voice data, wherein the predicted voice data does not comprise noise data;
and under the condition that both a signal loss function and a spectrum loss function in the preset voice noise reduction model meet preset conditions, finishing training of the preset voice noise reduction model to obtain a target voice noise reduction model, wherein the loss function is met by the predicted voice data and clean voice data, the spectrum loss function is formed by the predicted spectrum data of the predicted voice data and the spectrum data of the clean voice data, and the clean voice data are training sample data without noise data.
2. The method of claim 1, wherein the predicted spectral data of the predicted speech data and the spectral data of the clean speech data form the spectral loss function, comprising:
converting the voice characteristic data from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a first frequency spectrum energy characteristic;
converting the voice characteristic data of the clean voice from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a second frequency spectrum energy characteristic;
forming the spectral loss function based on the first spectral energy characteristic and the second spectral energy characteristic.
3. The method of claim 2, wherein said constructing the spectral loss function based on the first spectral energy characteristic and the second spectral energy characteristic comprises:
the spectral loss function is calculated by the following formula:
Figure FDA0003658961620000021
Ms(t,b)=|Fs|*MS,Ms2(t,b)=|Fs2|*MS,
wherein T represents the frame number of a piece of speech, Ms (T, b) represents the first spectral energy feature, Ms2(T, b) represents the second spectral energy feature, | | represents the absolute value, Fs represents the noise spectrum, Fs2 represents the clean speech spectrum, and Ms represents the preset matrix.
4. The method according to claim 1, wherein said performing feature extraction on the voice training sample data to obtain voice feature data comprises:
and performing feature extraction on the voice training sample data through one of the following models: and obtaining the voice characteristic data by a convolutional neural network, a recursive neural network or a fully-connected neural network.
5. The method of claim 1, wherein prior to said obtaining voice training sample data, the method further comprises:
and mixing the target voice data without noise and various types of noise with different signal-to-noise ratios to obtain the voice training sample data.
6. A method for speech noise reduction modeling, comprising:
acquiring target voice information;
inputting the target voice information into a target noise reduction voice model, and outputting noise-reduced voice information, wherein the target noise reduction voice model comprises the voice information obtained by training according to any one of the claims 1 to 4.
7. A training device for a speech noise reduction model is characterized by comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voice training sample data, and the voice training sample data carries noise data;
the feature extraction unit is used for extracting features of the voice training sample data to obtain voice feature data;
the prediction unit is used for inputting the voice characteristic data into a preset voice noise reduction model and outputting predicted voice data, wherein the predicted voice data does not comprise noise data;
and the training unit is used for finishing the training of the preset voice noise reduction model to obtain a target voice noise reduction model under the condition that both a signal loss function and a spectrum loss function in the preset voice noise reduction model meet preset conditions, wherein the loss function is met by the predicted voice data and clean voice data, the spectrum loss function is formed by the predicted spectrum data of the predicted voice data and the spectrum data of the clean voice data, and the clean voice data is training sample data without noise data.
8. The apparatus of claim 7, wherein the training unit comprises:
the first conversion module is used for converting the voice characteristic data from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a first frequency spectrum energy characteristic;
the second conversion module is used for converting the voice characteristic data of the clean voice from a frequency domain to a frequency spectrum energy domain through a spectrum conversion matrix to obtain a second frequency spectrum energy characteristic;
a construction module for constructing the spectral loss function based on the first spectral energy characteristic and the second spectral energy characteristic.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 5 or 6 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5 or 6.
CN202210567936.7A 2022-05-24 2022-05-24 Training method and device of voice noise reduction model, storage medium and electronic device Pending CN114974281A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210567936.7A CN114974281A (en) 2022-05-24 2022-05-24 Training method and device of voice noise reduction model, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210567936.7A CN114974281A (en) 2022-05-24 2022-05-24 Training method and device of voice noise reduction model, storage medium and electronic device

Publications (1)

Publication Number Publication Date
CN114974281A true CN114974281A (en) 2022-08-30

Family

ID=82984881

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210567936.7A Pending CN114974281A (en) 2022-05-24 2022-05-24 Training method and device of voice noise reduction model, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN114974281A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024051676A1 (en) * 2022-09-08 2024-03-14 维沃移动通信有限公司 Model training method and apparatus, electronic device, and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024051676A1 (en) * 2022-09-08 2024-03-14 维沃移动通信有限公司 Model training method and apparatus, electronic device, and medium

Similar Documents

Publication Publication Date Title
CN110415686B (en) Voice processing method, device, medium and electronic equipment
CN110491407B (en) Voice noise reduction method and device, electronic equipment and storage medium
JP7258182B2 (en) Speech processing method, device, electronic device and computer program
CN107610708B (en) Identify the method and apparatus of vocal print
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN111508519B (en) Method and device for enhancing voice of audio signal
WO2022141868A1 (en) Method and apparatus for extracting speech features, terminal, and storage medium
CN113241064B (en) Speech recognition, model training method and device, electronic equipment and storage medium
CN109670073B (en) Information conversion method and device and interactive auxiliary system
CN114974281A (en) Training method and device of voice noise reduction model, storage medium and electronic device
CN114792524A (en) Audio data processing method, apparatus, program product, computer device and medium
WO2022246986A1 (en) Data processing method, apparatus and device, and computer-readable storage medium
CN113782044A (en) Voice enhancement method and device
CN116959469A (en) Training method and device for voice enhancement model, electronic equipment and storage medium
CN116959476A (en) Audio noise reduction processing method and device, storage medium and electronic equipment
CN111402918A (en) Audio processing method, device, equipment and storage medium
US20230186943A1 (en) Voice activity detection method and apparatus, and storage medium
CN113763978B (en) Voice signal processing method, device, electronic equipment and storage medium
CN115662461A (en) Noise reduction model training method, device and equipment
CN110648681B (en) Speech enhancement method, device, electronic equipment and computer readable storage medium
CN114171043A (en) Echo determination method, device, equipment and storage medium
CN114067785B (en) Voice deep neural network training method and device, storage medium and electronic device
CN112201227A (en) Voice sample generation method and device, storage medium and electronic device
CN114974283A (en) Training method and device of voice noise reduction model, storage medium and electronic device
CN114155883B (en) Progressive type based speech deep neural network training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination