CN112447183A - Training method and device for audio processing model, audio denoising method and device, and electronic equipment - Google Patents

Training method and device for audio processing model, audio denoising method and device, and electronic equipment Download PDF

Info

Publication number
CN112447183A
CN112447183A CN202011278852.9A CN202011278852A CN112447183A CN 112447183 A CN112447183 A CN 112447183A CN 202011278852 A CN202011278852 A CN 202011278852A CN 112447183 A CN112447183 A CN 112447183A
Authority
CN
China
Prior art keywords
audio
model
trained
denoised
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011278852.9A
Other languages
Chinese (zh)
Inventor
郑羲光
张旭
张晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202011278852.9A priority Critical patent/CN112447183A/en
Publication of CN112447183A publication Critical patent/CN112447183A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0224Processing in the time domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The present disclosure relates to a training method of an audio processing model, wherein the audio processing model includes a feature dimension reduction model to be trained, an audio denoising model to be trained, and a feature dimension enhancement model to be trained, the method includes: acquiring a signal with a noise frequency and a corresponding pure audio signal; inputting the signal with the noise frequency into the feature dimension reduction model to be trained to obtain the audio feature after dimension reduction; inputting the audio features subjected to the dimensionality reduction into the audio denoising model to be trained to obtain denoised audio features; inputting the denoised audio features into the feature dimension-increasing model to be trained to obtain denoised audio signals: adjusting model parameters of the audio processing model to be trained based on a difference between the clean audio signal and the denoised audio signal until the model loss value is lower than a preset threshold value. By adopting the method and the device, the denoising accuracy of the audio can be improved.

Description

Training method and device for audio processing model, audio denoising method and device, and electronic equipment
Technical Field
The present disclosure relates to the field of audio processing, and in particular, to a method and an apparatus for training an audio processing model, and an audio denoising method and apparatus, an electronic device, and a storage medium.
Background
Noise in audio tends to create a harsh sensation to a listener of the audio during playback of the audio file. At the same time, the noise present in the audio often affects the presentation of the information by the entire audio. This makes the audio denoising process important in the field of audio signal processing.
The related art often uses a neural network to solve the problem of denoising the audio signal. When the neural network is trained, the loss of the neural network is often directly optimized by using the low-dimensional features output by the neural network; however, the final output of the audio denoising processing cannot be compared fairly in this way, so that the accuracy of the neural network obtained by training in denoising the audio signal is not high.
The problem of low accuracy in the audio denoising processing process in the prior art
Disclosure of Invention
The present disclosure provides a method, an apparatus, an electronic device and a storage medium for training an audio processing model, and audio denoising, so as to at least solve the problem of low audio denoising accuracy in the related art. The technical scheme of the disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided a training method for an audio processing model, where the audio processing model includes a feature dimension reduction model to be trained, an audio denoising model to be trained, and a feature dimension enhancement model to be trained, the method includes:
acquiring training sample data; the training sample data comprises a noisy audio signal and a corresponding clean audio signal;
inputting the signal with the noise frequency into the feature dimension reduction model to be trained to obtain the audio feature after dimension reduction;
inputting the audio features subjected to the dimensionality reduction into the audio denoising model to be trained to obtain denoised audio features;
inputting the denoised audio features into the feature dimension-increasing model to be trained to obtain denoised audio signals:
obtaining a model loss value of the audio processing model based on a difference between the clean audio signal and the denoised audio signal;
adjusting the model parameters of the audio processing model to be trained according to the model loss value until the model loss value is lower than a preset threshold value, and taking the audio processing model with the adjusted model parameters as the trained audio processing model; the trained audio processing model comprises a trained feature dimension reduction model, a trained audio denoising model and a trained feature dimension increasing model.
In one possible implementation manner, the inputting the noisy audio signal into the feature dimension reduction model to obtain a dimension-reduced audio feature includes:
performing time-frequency conversion processing on the signal with the noise frequency to obtain a signal with the noise frequency after the time-frequency conversion processing;
and inputting the noise-carrying frequency signal subjected to the time-frequency conversion processing into the feature dimension reduction model to be trained to obtain the audio feature subjected to dimension reduction.
In a possible implementation manner, the inputting the noisy audio signal after the time-frequency conversion processing to the feature dimension reduction model to be trained to obtain the audio feature after the dimension reduction includes:
carrying out full-connection processing on the noisy frequency signal subjected to the time-frequency conversion processing through the feature dimension reduction model to be trained to obtain a first full-connection processing result; the characteristic dimension of the first full-connection processing result is smaller than that of the noise-frequency signal after the time-frequency conversion processing;
and determining the first full-connection processing result as the audio feature after dimensionality reduction.
In one possible implementation manner, the inputting the denoised audio features into the feature-up-dimensional model to be trained to obtain a denoised audio signal includes:
carrying out full-connection processing on the denoised audio features through the feature dimension-increasing model to be trained to obtain a second full-connection processing result; the characteristic dimension of the second full-connection processing result is larger than the characteristic dimension of the denoised audio characteristic;
and determining the second full-link processing result as the denoised audio signal.
In one possible implementation, the obtaining a model loss value of the audio processing model based on a difference between the clean audio signal and the denoised audio signal includes:
carrying out time-frequency conversion processing on the pure audio signal corresponding to the frequency signal with the noise to obtain a pure audio signal after the time-frequency conversion processing;
and determining the model loss value according to the difference between the pure audio signal after the time-frequency conversion processing and the de-noised audio signal.
According to a second aspect of the embodiments of the present disclosure, there is provided an audio denoising method, including:
acquiring a trained audio processing model; the trained audio processing model is obtained by training according to the training method of the audio processing model as claimed in any one of claims 1 to 5; the trained audio processing model comprises a trained characteristic dimension reduction model, a trained audio denoising model and a trained characteristic dimension increasing model;
inputting the audio signal to be denoised into the trained feature dimension reduction model to obtain dimension reduction audio features;
inputting the dimension reduction audio features into the trained audio denoising model to obtain denoising audio features;
inputting the denoising audio features into the trained feature dimension-increasing model to obtain a denoised audio signal corresponding to the audio signal to be denoised.
In one possible implementation manner, the inputting the audio signal to be denoised into the trained feature dimension reduction model to obtain a dimension reduction audio feature includes:
performing time-frequency conversion processing on the audio signal to be denoised to obtain the audio signal to be denoised after the time-frequency conversion processing;
and inputting the audio signal to be denoised after the time-frequency conversion processing into the trained feature dimension reduction model to obtain the dimension reduction audio feature.
In one possible implementation manner, the inputting the denoising audio feature into the trained feature dimension-increasing model to obtain a denoised audio signal corresponding to the audio signal to be denoised includes:
performing dimensionality-rising processing on the denoising audio features through the trained feature dimensionality-rising model to generate denoised time-frequency domain audio signals corresponding to the audio signals to be denoised;
performing reverse time-frequency conversion processing on the denoised time-frequency domain audio signal to obtain an audio signal subjected to reverse time-frequency conversion processing;
and determining the audio signal after the reverse time-frequency conversion processing as the denoised audio signal.
According to a third aspect of the embodiments of the present disclosure, there is provided a training apparatus for an audio processing model, where the audio processing model includes a feature dimension reduction model to be trained, an audio denoising model to be trained, and a feature dimension enhancement model to be trained, the apparatus including:
a data acquisition unit configured to perform acquisition of training sample data; the training sample data comprises a noisy audio signal and a corresponding clean audio signal;
the first dimension reduction unit is configured to input the noisy audio signal into the feature dimension reduction model to be trained to obtain a dimension-reduced audio feature;
a first denoising unit configured to input the dimensionality reduced audio features into the audio denoising model to be trained to obtain denoised audio features;
a first dimension-raising unit configured to perform input of the denoised audio features into the feature dimension-raising model to be trained, so as to obtain a denoised audio signal:
a loss value obtaining unit configured to perform obtaining a model loss value of the audio processing model based on a difference between the clean audio signal and the denoised audio signal;
the parameter adjusting unit is configured to adjust the model parameters of the audio processing model to be trained according to the model loss value until the model loss value is lower than a preset threshold value, and the audio processing model after model parameter adjustment is used as a trained audio processing model; the trained audio processing model comprises a trained feature dimension reduction model, a trained audio denoising model and a trained feature dimension increasing model.
In a possible implementation manner, the first dimension reduction unit is specifically configured to perform time-frequency conversion processing on the noisy frequency signal to obtain a noisy frequency signal after the time-frequency conversion processing; and inputting the noise-carrying frequency signal subjected to the time-frequency conversion processing into the feature dimension reduction model to be trained to obtain the audio feature subjected to dimension reduction.
In a possible implementation manner, the first dimension reduction unit is specifically configured to execute full connection processing on the noisy frequency signal after the time-frequency conversion processing through the feature dimension reduction model to be trained, so as to obtain a first full connection processing result; the characteristic dimension of the first full-connection processing result is smaller than that of the noise-frequency signal after the time-frequency conversion processing; and determining the first full-connection processing result as the audio feature after dimensionality reduction.
In a possible implementation manner, the first dimension-raising unit is specifically configured to execute full-join processing on the denoised audio features through the feature dimension-raising model to be trained, so as to obtain a second full-join processing result; the characteristic dimension of the second full-connection processing result is larger than the characteristic dimension of the denoised audio characteristic; and determining the second full-link processing result as the denoised audio signal.
In a possible implementation manner, the loss value obtaining unit is specifically configured to perform time-frequency conversion processing on a clean audio signal corresponding to the noisy audio signal, so as to obtain a clean audio signal after the time-frequency conversion processing; and determining the model loss value according to the difference between the pure audio signal after the time-frequency conversion processing and the de-noised audio signal.
According to a fourth aspect of the embodiments of the present disclosure, there is provided an audio denoising apparatus, including:
a model obtaining unit configured to perform obtaining a trained audio processing model; the trained audio processing model is obtained by training according to the training method of the audio processing model; the trained audio processing model comprises a trained characteristic dimension reduction model, a trained audio denoising model and a trained characteristic dimension increasing model;
the second dimension reduction unit is configured to input the audio signal to be denoised into the trained feature dimension reduction model to obtain a dimension reduction audio feature;
a second denoising unit configured to perform input of the dimension-reduced audio features into the trained audio denoising model to obtain denoised audio features;
and the second dimension-increasing unit is configured to input the denoising audio features into the trained feature dimension-increasing model to obtain a denoised audio signal corresponding to the audio signal to be denoised.
In a possible implementation manner, the second dimension reduction unit is specifically configured to perform time-frequency conversion processing on the audio signal to be denoised, so as to obtain the audio signal to be denoised after the time-frequency conversion processing; and inputting the audio signal to be denoised after the time-frequency conversion processing into the trained feature dimension reduction model to obtain the dimension reduction audio feature.
In a possible implementation manner, the second dimension-raising unit is specifically configured to execute a feature dimension-raising model trained by the trained unit, perform dimension-raising processing on the denoised audio feature, and generate a denoised time-frequency domain audio signal corresponding to the audio signal to be denoised; performing reverse time-frequency conversion processing on the denoised time-frequency domain audio signal to obtain an audio signal subjected to reverse time-frequency conversion processing; and determining the audio signal after the reverse time-frequency conversion processing as the denoised audio signal.
According to a fifth aspect of the embodiments of the present disclosure, there is provided an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor implements the method for training an audio processing model according to the first aspect or any one of the possible implementations of the first aspect, or the method for training an audio processing model according to the second aspect or any one of the possible implementations of the second aspect.
According to a sixth aspect of embodiments of the present disclosure, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements a method of training an audio processing model as set forth in the first aspect or any one of the possible implementations of the first aspect, or as set forth in the second aspect or any one of the possible implementations of the second aspect.
According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, the program product comprising a computer program stored in a readable storage medium, from which at least one processor of a device reads and executes the computer program, so that the device performs the method for training an audio processing model according to the first aspect or any one of the possible implementations of the first aspect, or the method for training an audio processing model according to the second aspect or any one of the possible implementations of the second aspect.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: obtaining training sample data comprising a noisy frequency signal and a corresponding pure audio signal, and inputting the noisy frequency signal into a feature dimension reduction model to be trained to perform dimension reduction processing on the noisy audio signal to obtain an audio feature after dimension reduction; inputting the audio features subjected to dimension reduction into an audio denoising model to be trained to obtain denoised audio features; inputting the denoised audio features into a feature dimension-increasing model to be trained to perform dimension-increasing processing on the denoised audio features to generate a denoised audio signal: obtaining a model loss value of an audio processing model according to the difference between a pure audio signal and a de-noised audio signal corresponding to a signal with a noise frequency; adjusting model parameters of the audio processing model to be trained according to the model loss value until the model loss value is lower than a preset threshold value, and taking the audio processing model to be trained as the trained audio processing model; therefore, the dimension reduction processing is carried out on the noisy audio signal, so that the model parameter quantity in the learning process of the audio processing model is greatly reduced, and the model convergence speed of the audio processing model is improved; meanwhile, the denoised audio features are subjected to dimension-increasing processing, so that the denoised audio signals are consistent with the feature dimensions of the pure audio signals corresponding to the noisy audio signals, the fact that the actual performance of the audio processing model can be accurately reflected by the first loss value of the audio processing model obtained through calculation is guaranteed, model parameters of the audio processing model are accurately adjusted, and the trained audio processing model can accurately denoise the audio to be denoised.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
FIG. 1 is a diagram illustrating an application environment for a method of training an audio processing model, according to an exemplary embodiment.
FIG. 2 is a flow diagram illustrating a method of training an audio processing model according to an exemplary embodiment.
FIG. 3 is a flow diagram illustrating another method of training an audio processing model in accordance with an exemplary embodiment.
FIG. 4 is a flow diagram illustrating the training of an audio processing model according to an exemplary embodiment.
FIG. 5 is a flow diagram illustrating a method of training an audio processing model according to an exemplary embodiment.
FIG. 6 is a flow diagram illustrating another method of training an audio processing model in accordance with an exemplary embodiment.
FIG. 7 is a process flow diagram illustrating a method of audio denoising in accordance with an exemplary embodiment.
FIG. 8 is a block diagram illustrating an apparatus for training an audio processing model in accordance with an exemplary embodiment.
Fig. 9 is a block diagram illustrating an audio denoising apparatus according to an exemplary embodiment.
Fig. 10 is an internal block diagram of an electronic device shown in accordance with an example embodiment.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure.
The training method of the audio processing model provided by the present disclosure includes a feature dimension reduction model to be trained, an audio denoising model to be trained, and a feature dimension increasing model to be trained, and may be applied to the application environment as shown in fig. 1. Referring to fig. 1, the application environment diagram includes a server 110, and the server 110 may be implemented by an independent server or a server cluster composed of a plurality of servers. In fig. 1, the server 110 is an independent server for explanation, and referring to fig. 1, the server 110 obtains training sample data; the training sample data comprises a signal with noise frequency and a corresponding pure audio signal; then, the server 110 inputs the noisy audio signal into a feature dimension reduction model to be trained to obtain the audio feature after dimension reduction; then, the server 110 inputs the audio features after dimension reduction into an audio denoising model to be trained to obtain denoised audio features; then, the server 110 inputs the denoised audio features into the feature dimension-increasing model to be trained to obtain a denoised audio signal: then, the server 110 obtains a model loss value of the audio processing model based on a difference between the pure audio signal and the denoised audio signal; finally, the server 110 adjusts the model parameters of the audio processing model to be trained according to the model loss value, and takes the audio processing model with the adjusted model parameters as the trained audio processing model until the model loss value is lower than a preset threshold value; the trained audio processing model comprises a trained characteristic dimension reduction model, a trained audio denoising model and a trained characteristic dimension increasing model. It should be noted that the trained audio processing model of the present disclosure may also be applied to a terminal, for example, the terminal processes a noisy audio signal by using the trained audio processing model to obtain a denoised audio signal.
Fig. 2 is a flowchart illustrating a method for training an audio processing model, which is used in the server 110 of fig. 1, as shown in fig. 2, according to an exemplary embodiment, and includes the following steps.
In step S210, training sample data is acquired.
Wherein the training sample data comprises a noisy audio signal and a corresponding clean audio signal.
The audio processing model may refer to a neural network model capable of denoising an audio signal, such as a deep learning model; in an actual scenario, the audio processing model may also be a Network model formed by combining CNN (Convolutional Neural Network) and LSTM (Long Short-Term Memory Network).
In practical application, the audio processing model comprises a feature dimension reduction model to be trained, an audio denoising model to be trained and a feature dimension increasing model to be trained.
The noisy audio signal may refer to an audio signal carrying noise.
In specific implementation, when the server trains the audio denoising model to be trained, the server can obtain an audio sample set. Wherein the set of audio samples includes a noisy audio signal and a label corresponding to the noisy audio signal. Wherein, the tag corresponding to the noisy audio signal may be a clean audio signal corresponding to the noisy audio signal.
In step S220, the noisy audio signal is input into the feature dimension reduction model to be trained, so as to obtain the audio feature after dimension reduction.
In a specific implementation, after the server obtains the noisy frequency signal, the server may perform dimensionality reduction on the noisy frequency signal to obtain the audio feature after dimensionality reduction. Specifically, the server may obtain a feature dimension reduction model to be trained, input the noisy audio signal to the feature dimension reduction model to be trained, perform dimension reduction processing on the noisy audio signal through the feature dimension reduction model to be trained, and take an output result of the feature dimension reduction model to be trained as the audio feature after dimension reduction.
In step S230, the audio features after dimension reduction are input into an audio denoising model to be trained, so as to obtain denoised audio features.
In the specific implementation, after the server obtains the audio feature after the dimensionality reduction corresponding to the noisy audio signal, the server inputs the audio feature after the dimensionality reduction into an audio denoising model to be trained, and the audio feature after the dimensionality reduction is denoised by the audio denoising model to be trained, so that the denoised audio feature is obtained.
In step S240, the denoised audio features are input into a feature dimension-increasing model to be trained, so as to obtain a denoised audio signal.
In a specific implementation, after the server obtains the denoised audio features, the server performs dimensionality enhancement on the denoised audio features to generate a denoised audio signal. Specifically, the server may obtain a feature dimension-increasing model to be trained, input the denoised audio features into the feature dimension-increasing model to be trained, perform dimension-increasing processing on the denoised audio features through the feature dimension-increasing model to be trained, and use an output result of the feature dimension-increasing model to be trained as a denoised audio signal.
In step S250, a model loss value of the audio processing model is obtained based on a difference between the clean audio signal and the denoised audio signal.
Where a clean audio signal may refer to an audio signal that does not carry noise in a noisy audio signal. In practical applications, the noisy audio signal can be generated from the clean audio signal.
The model loss value is used for measuring the error between the denoised audio characteristic output by the audio processing model and the target audio characteristic; wherein, the smaller the model loss value, the closer the audio processing model is to the training target.
In specific implementation, the server may calculate a total loss value according to a clean audio signal and a denoised audio signal corresponding to the noisy audio signal in combination with a cross entropy loss function, and use the total loss value as a model loss value of the audio processing model to be trained.
In step S260, the model parameters of the audio processing model to be trained are adjusted according to the model loss value, and when the model loss value is lower than the preset threshold, the audio processing model with the adjusted model parameters is used as the trained audio processing model.
The trained audio processing model comprises a trained feature dimension reduction model, a trained audio denoising model and a trained feature dimension increasing model.
When the model loss value of the audio processing model is lower than the preset threshold value, the model parameters of the audio processing model are converged.
In a specific implementation, after the server obtains the model loss value, the server can judge whether the model loss value is lower than a preset threshold value; and when the model loss value is lower than a preset threshold value, the model parameters of the audio processing model at the moment are converged, and the server takes the audio processing model to be trained as the trained audio processing model.
When the model loss value is greater than or equal to the preset threshold value, the server determines a model parameter update gradient of the audio processing model according to the model loss value, reversely updates the model parameter of the audio processing model based on the model parameter update gradient, takes the updated audio processing model as the audio processing model to be trained, and repeatedly executes the steps S210 to S260 so as to continuously update the model parameter of the audio processing model until the model loss value obtained according to the audio processing model is less than the preset threshold value.
Specifically, in the process that the server adjusts the model parameters of the audio processing model to be trained, the server can also obtain the model loss value of the feature dimension reduction model to be trained according to the difference between a pure audio signal and a de-noised audio signal corresponding to a signal with a noise frequency; and adjusting the model parameters of the characteristic dimension reduction model according to the model loss value of the characteristic dimension reduction model until the model loss value of the characteristic dimension reduction model to be trained is lower than a preset threshold value, and taking the characteristic dimension reduction model to be trained as the trained characteristic dimension reduction model.
Meanwhile, the server can also obtain a model loss value of the characteristic dimension-increasing model to be trained according to the difference between a pure audio signal and a de-noised audio signal corresponding to the noisy audio signal; and adjusting the model parameters of the characteristic dimension-increasing model according to the model loss value of the characteristic dimension-increasing model until the model loss value of the characteristic dimension-increasing model to be trained is lower than a preset threshold value, and taking the characteristic dimension-increasing model to be trained as the trained characteristic dimension-increasing model.
In the training method of the audio processing model, the audio features after dimension reduction are obtained by obtaining training sample data comprising the noisy frequency signals and the corresponding pure audio signals and inputting the noisy frequency signals into the feature dimension reduction model to be trained so as to perform dimension reduction processing on the noisy audio signals; inputting the audio features subjected to dimension reduction into an audio denoising model to be trained to obtain denoised audio features; inputting the denoised audio features into a feature dimension-increasing model to be trained to perform dimension-increasing processing on the denoised audio features to generate a denoised audio signal: obtaining a model loss value of an audio processing model according to the difference between a pure audio signal and a de-noised audio signal corresponding to a signal with a noise frequency; adjusting model parameters of the audio processing model to be trained according to the model loss value until the model loss value is lower than a preset threshold value, and taking the audio processing model to be trained as the trained audio processing model; therefore, the dimension reduction processing is carried out on the noisy audio signal, so that the model parameter quantity in the learning process of the audio processing model is greatly reduced, and the model convergence speed of the audio processing model is improved; meanwhile, the denoised audio features are subjected to dimension-increasing processing, so that the denoised audio signals are consistent with the feature dimensions of the pure audio signals corresponding to the noisy audio signals, the fact that the actual performance of the audio processing model can be accurately reflected by the first loss value of the audio processing model obtained through calculation is guaranteed, model parameters of the audio processing model are accurately adjusted, and the trained audio processing model can accurately denoise the audio to be denoised.
In addition, because the characteristic dimension-increasing network and the characteristic dimension-reducing network are both in the whole network optimization process, the audio processing model can learn the characteristic dimension-increasing network and the characteristic dimension-reducing network which are most suitable for the current audio denoising problem by self aiming at different audio denoising problems.
In an exemplary embodiment, inputting the noisy audio signal into the feature dimension reduction model to obtain a dimension-reduced audio feature includes: carrying out time-frequency conversion processing on the noisy audio signal to obtain a noisy audio signal after the time-frequency conversion processing; and inputting the noise-carrying frequency signal subjected to time-frequency conversion processing into the feature dimension reduction model to be trained to obtain the audio feature subjected to dimension reduction.
Wherein, carry out dimensionality reduction processing to the noise frequency signal after the time frequency conversion processing, obtain the audio frequency characteristic after the dimensionality reduction, include: carrying out full-connection processing on the noisy frequency signal subjected to the time-frequency conversion processing through a feature dimension reduction model to be trained to obtain a first full-connection processing result; and determining the first full-connection processing result as the audio feature after dimensionality reduction.
And the characteristic dimension of the first full-connection processing result is smaller than that of the noise-frequency signal after time-frequency conversion processing.
In a specific implementation, in the process that the server performs the dimension reduction processing on the noisy audio signal to obtain the audio features after the dimension reduction, the server may perform time-frequency conversion processing on the noisy audio signal to obtain the noisy audio signal after the time-frequency conversion processing. Specifically, the server may convert the noisy frequency signal to a time-frequency domain through short-time fourier transform, so as to obtain the noisy frequency signal after the time-frequency conversion processing.
For example, the known band noise frequency signal M may be represented as M (t) in the time domain; wherein T represents time, T is more than 0 and less than or equal to T, and after short-time Fourier transform, M (T) can be expressed in a time-frequency domain as: m (n, k) ═ STFT (M (t)); wherein N is a frame sequence, N is more than 0 and less than or equal to N, and N is the total frame number; k is a central frequency sequence 0< K is not more than K; (K is the number of total frequency points).
After the server obtains the noisy frequency signal after the time-frequency conversion, the server can input the noisy frequency signal after the time-frequency conversion into a feature dimension reduction model to be trained; then, the server performs full connection processing on the noisy frequency signal after the time-frequency conversion processing through a full connection layer in the feature dimension reduction model to be trained, and further obtains a first full connection processing result of which the feature dimension is smaller than that of the noisy frequency signal after the time-frequency conversion processing, and the first full connection processing result is used as the audio feature after the dimension reduction.
According to the technical scheme of the embodiment, the noisy frequency signal after time-frequency conversion is obtained by performing time-frequency conversion processing on the noisy frequency signal, the noisy frequency signal after the time-frequency conversion processing is input to a feature dimension reduction model to be trained, and the noisy frequency signal after the time-frequency conversion processing is subjected to full connection processing through the feature dimension reduction model to be trained to obtain a first full connection processing result as an audio feature after dimension reduction; the characteristic dimensionality of the first full-connection processing result is smaller than that of the noise-frequency-carrying signal subjected to time-frequency conversion processing; therefore, efficient dimensionality reduction processing can be performed on the noisy frequency signal, so that the model parameter quantity in the learning process of the audio denoising model is greatly reduced, and the model convergence speed of the audio denoising model is improved.
In an exemplary embodiment, obtaining a model loss value of an audio processing model based on a difference between a clean audio signal and a denoised audio signal comprises: carrying out time-frequency conversion processing on a pure audio signal corresponding to the noisy audio signal to obtain a pure audio signal after the time-frequency conversion processing; and determining a model loss value according to the difference between the pure audio signal and the de-noised audio signal after the time-frequency conversion processing.
In the specific implementation, the server obtains a model loss value of the audio denoising model to be trained according to a difference between a clean audio signal corresponding to the noisy audio signal and a denoised audio signal, and specifically includes: and the server performs time-frequency conversion processing on the pure audio signal corresponding to the band noise frequency signal to obtain the pure audio signal after the time-frequency conversion processing. Specifically, the server may convert the pure audio signal corresponding to the noisy audio signal into a time-frequency domain through short-time fourier transform, so as to obtain the pure audio signal after the time-frequency conversion.
For example, it is known that a clean audio signal S can be represented in the time domain as S (t); wherein T represents time, T is more than 0 and less than or equal to T, and S (T) can be expressed in a time-frequency domain after short-time Fourier transform: s (n, k) ═ STFT (S (t)); wherein N is a frame sequence, N is more than 0 and less than or equal to N, and N is the total frame number; k is a central frequency sequence 0< K is not more than K; (K is the number of total frequency points).
And finally, the server determines a model loss value according to the signal difference between the pure audio signal and the de-noised audio signal after the time-frequency conversion processing.
According to the technical scheme of the embodiment, pure audio signals corresponding to noisy audio signals are subjected to time-frequency conversion processing to obtain pure audio signals after the time-frequency conversion processing; according to the difference between the pure audio signal and the denoised audio signal after the time-frequency conversion processing, the loss value is determined, so that the loss finally calculated by the network can be in the time-frequency domain after the dimension is increased, the final output can be subjected to fair comparison, the actual performance of the audio processing model can be accurately reflected by the model loss value of the audio processing model obtained through calculation, the model parameters of the audio processing model can be accurately adjusted, and the denoised audio can be accurately subjected to denoising processing by the trained audio processing model.
In an exemplary embodiment, inputting the denoised audio features into a feature dimension-increasing model to be trained to obtain a denoised audio signal, including: carrying out full-connection processing on the denoised audio features through a feature dimension-increasing model to be trained to obtain a second full-connection processing result; the characteristic dimension of the second full-connection processing result is larger than the characteristic dimension of the denoised audio characteristic; and determining the second full-connection processing result as a denoised audio signal.
In the specific implementation, the process that the server performs the dimension-raising processing on the denoised audio features to generate the denoised audio signals specifically includes: the server can obtain a feature dimension-increasing model to be trained, and then the denoised audio features are input into the feature dimension-increasing model to be trained; performing full-join processing on the denoised audio features through a full-join layer in a feature dimension-increasing model to be trained to obtain a second full-join processing result of which the feature dimension is larger than that of the denoised audio features; and finally, the server determines the second full-connection processing result as a de-noised audio signal.
According to the technical scheme, the characteristic dimensionality of the denoised audio signal is consistent with the characteristic dimensionality of the pure audio signal corresponding to the denoised audio signal by performing the dimensionality raising processing on the denoised audio characteristic, so that the actual performance of the audio processing model can be accurately reflected by the model loss value of the audio processing model obtained through calculation, the model parameter of the audio processing model is accurately adjusted, and the denoised audio can be accurately denoised by the trained audio processing model.
Fig. 3 is a flowchart illustrating another audio processing model training method according to an exemplary embodiment, and as shown in fig. 3, an audio processing model training method is used in the server shown in fig. 1, and includes the following steps: in step S310, the noisy audio signal is subjected to time-frequency conversion to obtain a noisy audio signal after the time-frequency conversion. In step S320, the noisy frequency signal after the time-frequency conversion is input to a feature dimension reduction model to be trained. In step S330, performing full-join processing on the noisy audio signal after the time-frequency conversion processing through the feature dimension reduction model to be trained to obtain a first full-join processing result as an audio feature after dimension reduction; and the characteristic dimension of the first full-connection processing result is smaller than that of the noise-frequency signal after the time-frequency conversion processing. In step S340, the audio feature after dimension reduction is input into an audio denoising model to be trained, so as to obtain a denoised audio feature. In step S350, the denoised audio features are input to a feature dimension-increasing model to be trained. In step S360, performing full join processing on the denoised audio features through the feature dimension-increasing model to be trained to obtain a second full join processing result as a denoised audio signal; and the characteristic dimension of the second full-connection processing result is larger than the characteristic dimension of the denoised audio characteristic. In step S370, a model loss value of the audio processing model is obtained according to a difference between a clean audio signal corresponding to the noisy audio signal and the denoised audio signal. In step S380, the model parameters of the audio processing model to be trained are adjusted according to the model loss value, and when the model loss value is lower than a preset threshold value, the audio processing model with the adjusted model parameters is used as the trained audio processing model. It should be noted that, for the specific limitations of the above steps, reference may be made to the above specific limitations of the training method for the audio processing model, and details are not described here again.
To facilitate understanding by those skilled in the art, fig. 4 is a flow chart of a training process of an audio processing model; as shown in fig. 4, the first deep neural network includes an audio dimension reduction model to be trained, an audio denoising model to be trained, and an audio dimension increasing model to be trained. In practical application, a server acquires a noisy audio signal and performs time-frequency conversion processing on the noisy audio signal to obtain a noisy audio signal after the time-frequency conversion processing; and then, performing dimensionality reduction on the noisy frequency signal subjected to the time-frequency conversion processing through a feature dimensionality reduction model to be trained to obtain audio features subjected to dimensionality reduction. Then, inputting the audio features subjected to dimension reduction into an audio denoising model to be trained to obtain denoised audio features; then, performing dimensionality-raising processing on the denoised audio features to generate a denoised audio signal: carrying out time-frequency conversion processing on a pure audio signal corresponding to the noisy audio signal to obtain a pure audio signal after the time-frequency conversion processing; and determining model loss values corresponding to the audio dimensionality reduction model to be trained, the audio denoising model to be trained and the audio dimensionality enhancement model to be trained respectively according to the difference between the pure audio signal and the denoising audio signal after the time-frequency conversion processing. And the server optimizes the model parameters corresponding to each model based on the model loss values corresponding to the audio dimension reduction model to be trained, the audio denoising model to be trained and the audio dimension increasing model to be trained respectively until the model loss values corresponding to the audio dimension reduction model to be trained, the audio denoising model to be trained and the audio dimension increasing model to be trained are lower than a preset threshold value, so that the trained audio dimension reduction model, the trained audio denoising model and the trained audio dimension increasing model are obtained.
Fig. 5 is a flowchart illustrating an audio denoising method according to an exemplary embodiment, which may be applied to a terminal or a server, and the following description mainly takes the server as an example; as shown in fig. 5, the audio denoising method is used in the server shown in fig. 1, and includes the following steps:
in step S510, a trained audio processing model is obtained; the trained audio processing model is obtained by training according to the training method of the audio processing model.
The trained audio processing model comprises a trained feature dimension reduction model, a trained audio denoising model and a trained feature dimension increasing model.
The audio signal to be denoised may refer to an audio signal from which noise needs to be removed.
In specific implementation, when a user needs to remove noise in an audio signal to be denoised, the user can upload the audio signal to be denoised to a server through a user terminal, and then the server can obtain the audio signal to be denoised. Meanwhile, the server can obtain parameter information of the trained audio processing model, and then model building is carried out based on the parameter information to obtain the trained audio processing model.
In step S520, the audio signal to be denoised is input into the trained feature dimension reduction model to obtain a dimension reduction audio feature.
In a specific implementation, after the server obtains the audio signal to be denoised, the server may perform a dimension reduction process on the audio signal to be denoised to obtain a dimension reduction audio characteristic. Specifically, the server may obtain a trained feature dimension reduction model, input the audio signal to be denoised to the trained feature dimension reduction model, perform dimension reduction processing on the audio signal to be denoised through the trained feature dimension reduction model, and take an output result of the trained feature dimension reduction model as a dimension reduction audio feature.
In step S530, the dimension-reduced audio features are input into the trained audio denoising model to obtain denoising audio features.
The trained audio denoising model is obtained by training according to the training method of the audio processing model.
In the specific implementation, after the server acquires the dimensionality reduction audio frequency characteristics corresponding to the audio signal to be denoised, the server inputs the dimensionality reduction audio frequency characteristics into a trained audio denoising model so as to denoise the dimensionality reduction audio frequency characteristics through the trained audio denoising model, and then the denoised audio frequency characteristics are obtained.
In step S540, the denoised audio features are input into the trained feature dimension-increasing model, so as to obtain a denoised audio signal corresponding to the audio signal to be denoised.
In the specific implementation, after the server obtains the denoising audio features, the server performs dimensionality enhancement on the denoising audio features to generate denoised audio signals. Specifically, the server can obtain a trained feature dimension-increasing model, input the denoising audio features into the trained feature dimension-increasing model, perform dimension-increasing processing on the denoising audio features through the trained feature dimension-increasing model, and determine the denoising audio signal corresponding to the audio signal to be denoised according to an output result of the trained feature dimension-increasing model.
In the audio denoising method, the audio features after dimensionality reduction are obtained by obtaining training sample data comprising noisy audio signals and corresponding pure audio signals and inputting the noisy audio signals into a feature dimensionality reduction model to be trained so as to perform dimensionality reduction processing on the noisy audio signals; inputting the audio features subjected to dimension reduction into an audio denoising model to be trained to obtain denoised audio features; inputting the denoised audio features into a feature dimension-increasing model to be trained to perform dimension-increasing processing on the denoised audio features to generate a denoised audio signal: obtaining a model loss value of an audio processing model according to the difference between a pure audio signal and a de-noised audio signal corresponding to a signal with a noise frequency; adjusting model parameters of the audio processing model to be trained according to the model loss value until the model loss value is lower than a preset threshold value, and taking the audio processing model to be trained as the trained audio processing model; therefore, the dimension reduction processing is carried out on the noisy audio signal, so that the model parameter quantity in the learning process of the audio processing model is greatly reduced, and the model convergence speed of the audio processing model is improved; meanwhile, the denoised audio features are subjected to dimension-increasing processing, so that the denoised audio signals are consistent with the feature dimensions of the pure audio signals corresponding to the noisy audio signals, the fact that the actual performance of the audio processing model can be accurately reflected by the first loss value of the audio processing model obtained through calculation is guaranteed, model parameters of the audio processing model are accurately adjusted, and the trained audio processing model can accurately denoise the audio to be denoised.
In an exemplary embodiment, inputting an audio signal to be denoised into a trained feature dimension reduction model to obtain a dimension reduction audio feature, including: carrying out time-frequency conversion processing on the audio signal to be denoised to obtain the audio signal to be denoised after the time-frequency conversion processing; and inputting the audio signal to be denoised after the time-frequency conversion processing into a trained feature dimension reduction model to obtain the dimension reduction audio feature.
In the specific implementation, the server performs the dimension reduction processing on the audio features in the audio signal to be denoised to obtain the dimension reduction audio features, and the method specifically includes: and the server performs time-frequency conversion processing on the audio signal to be denoised to obtain the audio signal to be denoised after the time-frequency conversion processing. Specifically, the server may convert the audio signal to be denoised to a time-frequency domain through short-time fourier transform, so as to obtain the audio signal to be denoised after the time-frequency conversion processing.
Then, after the server obtains the audio signal to be denoised after the time-frequency conversion processing, the server can input the audio signal to be denoised after the time-frequency conversion processing to the trained feature dimension reduction model; and then, the server performs full-connection processing on the audio signal to be denoised after the time-frequency conversion processing through a full-connection layer in the trained feature dimension reduction model, so as to obtain a third full-connection processing result of which the feature dimension is smaller than that of the audio signal to be denoised after the time-frequency conversion processing, and the third full-connection processing result is used as a dimension reduction audio feature.
According to the technical scheme of the embodiment, the audio signal to be denoised is obtained after time-frequency conversion processing by performing time-frequency conversion processing on the audio signal to be denoised; and the audio signal to be denoised after the time-frequency conversion processing is subjected to dimension reduction processing, so that the dimension reduction processing can be accurately performed on the audio features in the audio signal to be denoised to obtain the dimension reduction audio features.
In an exemplary embodiment, inputting the denoised audio features into a trained feature-raised dimensional model to obtain a denoised audio signal corresponding to the audio signal to be denoised, including: performing dimension-raising processing on the denoising audio frequency characteristics through a trained characteristic dimension-raising model to generate a denoised time-frequency domain audio signal corresponding to the audio signal to be denoised; performing reverse time-frequency conversion processing on the denoised time-frequency domain audio signal to obtain an audio signal subjected to reverse time-frequency conversion processing; and determining the audio signal subjected to the inverse time-frequency conversion processing as a denoised audio signal.
The denoised time-frequency domain audio signal may be a denoised audio signal in a time-frequency domain.
In the specific implementation, the server performs the dimension-up processing on the denoised audio features to generate the denoised audio signal corresponding to the audio signal to be denoised, and the method specifically includes: the server can perform dimension-increasing processing on the de-noised audio features to generate de-noised time-frequency domain audio signals corresponding to the audio signals to be de-noised. Specifically, the server acquires a trained feature dimension-increasing model, and then, the denoising audio features are input into the trained feature dimension-increasing model; performing full-join processing on the denoising audio features through a full-join layer in the trained feature dimension-increasing model to obtain a fourth full-join processing result of which the feature dimension is larger than that of the denoising audio features; and finally, the server determines that the fourth full-connection processing result is the denoised time-frequency domain audio signal.
And finally, the server performs reverse time-frequency conversion processing on the denoised time-frequency domain audio signal to obtain the audio signal subjected to reverse time-frequency conversion processing as the denoised audio signal. Specifically, the server may convert the denoised time-frequency domain audio signal to the time domain through reverse short-time fourier transform, so as to obtain the denoised audio signal after the time-frequency conversion processing.
According to the technical scheme of the embodiment, the denoised audio characteristics are subjected to dimension-raising processing, and denoised time-frequency domain audio signals corresponding to the audio signals to be denoised are generated; and performing reverse time-frequency conversion processing on the denoised time-frequency domain audio signal to obtain an audio signal subjected to reverse time-frequency conversion processing, wherein the audio signal is used as the denoised audio signal, so that the denoising audio characteristic can be accurately subjected to dimension-increasing processing, and the denoised audio signal corresponding to the audio signal to be denoised is generated.
Fig. 6 is a flowchart illustrating another audio denoising method according to an exemplary embodiment, where, as shown in fig. 6, an audio denoising method is used in the server shown in fig. 1, and includes the following steps: in step S610, a trained audio processing model is obtained; the trained audio processing model is obtained by training according to the training method of the audio processing model; the trained audio processing model comprises a trained feature dimension reduction model, a trained audio denoising model and a trained feature dimension increasing model. In step S620, performing time-frequency conversion on the audio signal to be denoised to obtain the audio signal to be denoised after the time-frequency conversion. In step S630, the audio signal to be denoised after the time-frequency conversion is input into the trained feature dimension reduction model, so as to obtain the dimension reduction audio feature. In step S640, the dimension-reduced audio features are input into the trained audio denoising model to obtain denoised audio features. In step S650, performing dimension-up processing on the denoised audio features through the trained feature dimension-up model, and generating denoised time-frequency domain audio signals corresponding to the audio signals to be denoised. In step S660, inverse time-frequency conversion processing is performed on the denoised time-frequency domain audio signal to obtain an audio signal after inverse time-frequency conversion processing, and the audio signal is used as the denoised audio signal. It should be noted that, for the specific limitations of the above steps, reference may be made to the above specific limitations of an audio denoising method, and details are not described herein again.
To facilitate understanding by those skilled in the art, fig. 7 is a process flow diagram of an audio denoising method; as shown in fig. 7, the second deep neural network includes a trained audio dimension reduction model, a trained audio denoising model, and a trained audio dimension enhancement model. In practical application, a server acquires an audio signal to be denoised; and then, the server carries out time-frequency conversion processing on the audio signal to be denoised to obtain the audio signal to be denoised after the time-frequency conversion processing. And then, the server performs dimensionality reduction processing on the audio signal to be denoised after the time-frequency conversion processing through the trained audio dimensionality reduction model to obtain dimensionality reduction audio features. Secondly, inputting the dimension reduction audio features into a trained audio denoising model by the server to obtain denoising audio features; and then, the server performs dimension-increasing processing on the de-noised audio features through the trained audio dimension-increasing model to generate de-noised time-frequency domain audio signals corresponding to the audio signals to be de-noised. And finally, the server performs inverse time-frequency conversion processing on the denoised time-frequency domain audio signal to obtain the audio signal subjected to inverse time-frequency conversion processing, and the audio signal is used as the denoised audio signal.
It should be understood that, although the steps in the flowcharts of fig. 2, 3, 5 and 6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 3, 5, and 6 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or in alternation with other steps or at least some of the other steps.
FIG. 8 is a block diagram of an apparatus for training an audio processing model, according to an example embodiment. Referring to fig. 8, the apparatus includes:
a data acquisition unit 810 configured to perform acquisition of training sample data; the training sample data comprises a noisy audio signal and a corresponding clean audio signal;
a first dimension reduction unit 820 configured to perform input of the noisy audio signal into the feature dimension reduction model to be trained, so as to obtain a dimension-reduced audio feature;
a first denoising unit 830, configured to perform input of the reduced-dimension audio features into the audio denoising model to be trained, so as to obtain denoised audio features;
a first dimension-raising unit 840 configured to perform input of the denoised audio features into the feature dimension-raising model to be trained, to obtain a denoised audio signal:
a loss value obtaining unit 850 configured to perform obtaining a model loss value of the audio processing model based on a difference between the clean audio signal and the denoised audio signal;
a parameter adjusting unit 860 configured to perform adjusting the model parameters of the audio processing model to be trained according to the model loss value until the model loss value is lower than a preset threshold, and taking the audio processing model after model parameter adjustment as a trained audio processing model; the trained audio processing model comprises a trained feature dimension reduction model, a trained audio denoising model and a trained feature dimension increasing model.
In a possible implementation manner, the first dimension reduction unit 820 is specifically configured to perform time-frequency conversion processing on the noisy frequency signal, so as to obtain a noisy frequency signal after the time-frequency conversion processing; and inputting the noise-carrying frequency signal subjected to the time-frequency conversion processing into the feature dimension reduction model to be trained to obtain the audio feature subjected to dimension reduction.
In a possible implementation manner, the first dimension reduction unit 820 is specifically configured to perform full-connection processing on the noisy frequency signal after the time-frequency conversion processing through the feature dimension reduction model to be trained, so as to obtain a first full-connection processing result; the characteristic dimension of the first full-connection processing result is smaller than that of the noise-frequency signal after the time-frequency conversion processing; and determining the first full-connection processing result as the audio feature after dimensionality reduction.
In a possible implementation manner, the first dimension-raising unit 840 is specifically configured to perform full-join processing on the denoised audio features through the feature dimension-raising model to be trained, so as to obtain a second full-join processing result; the characteristic dimension of the second full-connection processing result is larger than the characteristic dimension of the denoised audio characteristic; and determining the second full-link processing result as the denoised audio signal.
In a possible implementation manner, the loss value obtaining unit 850 is specifically configured to perform time-frequency conversion processing on the clean audio signal corresponding to the noisy audio signal, so as to obtain a clean audio signal after the time-frequency conversion processing; and determining the model loss value according to the difference between the pure audio signal after the time-frequency conversion processing and the de-noised audio signal.
FIG. 9 is a block diagram illustrating an audio denoising apparatus according to an exemplary embodiment. Referring to fig. 9, the apparatus includes:
a model obtaining unit 910 configured to perform obtaining a trained audio processing model; the trained audio processing model is obtained by training according to the training method of the audio processing model; the trained audio processing model comprises a trained characteristic dimension reduction model, a trained audio denoising model and a trained characteristic dimension increasing model;
a second dimension reduction unit 920, configured to input the audio signal to be denoised into the trained feature dimension reduction model to obtain a dimension reduction audio feature;
a second denoising unit 930 configured to perform input of the dimension-reduced audio features into the trained audio denoising model, resulting in denoised audio features;
and a second dimension-raising unit 940, configured to input the denoising audio features into the trained feature dimension-raising model, so as to obtain a denoised audio signal corresponding to the audio signal to be denoised.
In one embodiment, the second dimension reduction unit 920 is specifically configured to perform time-frequency conversion processing on the audio signal to be denoised, so as to obtain the audio signal to be denoised after the time-frequency conversion processing; and inputting the audio signal to be denoised after the time-frequency conversion processing into the trained feature dimension reduction model to obtain the dimension reduction audio feature.
In one embodiment, the second dimension-raising unit 940 is specifically configured to perform dimension-raising processing on the denoised audio features through the trained feature dimension-raising model, and generate a denoised time-frequency domain audio signal corresponding to the audio signal to be denoised; performing reverse time-frequency conversion processing on the denoised time-frequency domain audio signal to obtain an audio signal subjected to reverse time-frequency conversion processing; and determining the audio signal after the reverse time-frequency conversion processing as the denoised audio signal.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
FIG. 10 is a block diagram illustrating an apparatus 1000 for performing a training method of an audio processing model, or an audio denoising method, according to an example embodiment. For example, the device 1000 may be a server. Referring to fig. 10, device 1000 includes a processing component 1020 that further includes one or more processors and memory resources, represented by memory 1022, for storing instructions, such as application programs, that are executable by processing component 1020. The application programs stored in memory 1022 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1020 is configured to execute instructions to perform a training method of the audio processing model described above, or an audio denoising method.
Device 1000 can also include a power component 1024 configured to perform power management for device 1000, a wired or wireless network interface 1026 configured to connect device 1000 to a network, and an input-output (I/O) interface 1028. Device 1000 may operate based on an operating system stored in memory 1022, such as WindowsServerTM, MacOSXTM, UnixTM, LinuxTM, FreeBSDTM, or the like.
In an exemplary embodiment, a storage medium comprising instructions, such as memory 1022 comprising instructions, executable by a processor of device 1000 to perform the above-described method is also provided. The storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A training method of an audio processing model, wherein the audio processing model comprises a feature dimension reduction model to be trained, an audio denoising model to be trained and a feature dimension raising model to be trained, and the method comprises the following steps:
acquiring training sample data; the training sample data comprises a noisy audio signal and a corresponding clean audio signal;
inputting the signal with the noise frequency into the feature dimension reduction model to be trained to obtain the audio feature after dimension reduction;
inputting the audio features subjected to the dimensionality reduction into the audio denoising model to be trained to obtain denoised audio features;
inputting the denoised audio features into the feature dimension-increasing model to be trained to obtain denoised audio signals:
obtaining a model loss value of the audio processing model based on a difference between the clean audio signal and the denoised audio signal;
adjusting the model parameters of the audio processing model to be trained according to the model loss value until the model loss value is lower than a preset threshold value, and taking the audio processing model with the adjusted model parameters as the trained audio processing model; the trained audio processing model comprises a trained feature dimension reduction model, a trained audio denoising model and a trained feature dimension increasing model.
2. The method for training an audio processing model according to claim 1, wherein the inputting the noisy audio signal into the feature dimension reduction model to obtain the audio features after dimension reduction comprises:
performing time-frequency conversion processing on the signal with the noise frequency to obtain a signal with the noise frequency after the time-frequency conversion processing;
and inputting the noise-carrying frequency signal subjected to the time-frequency conversion processing into the feature dimension reduction model to be trained to obtain the audio feature subjected to dimension reduction.
3. The method for training an audio processing model according to claim 2, wherein the inputting the noisy audio signal after the time-frequency conversion processing into the feature dimension reduction model to be trained to obtain the audio feature after dimension reduction comprises:
carrying out full-connection processing on the noisy frequency signal subjected to the time-frequency conversion processing through the feature dimension reduction model to be trained to obtain a first full-connection processing result; the characteristic dimension of the first full-connection processing result is smaller than that of the noise-frequency signal after the time-frequency conversion processing;
and determining the first full-connection processing result as the audio feature after dimensionality reduction.
4. The method for training the audio processing model according to claim 1, wherein the inputting the denoised audio features into the feature-to-be-trained multidimensional model to obtain a denoised audio signal comprises:
carrying out full-connection processing on the denoised audio features through the feature dimension-increasing model to be trained to obtain a second full-connection processing result; the characteristic dimension of the second full-connection processing result is larger than the characteristic dimension of the denoised audio characteristic;
and determining the second full-link processing result as the denoised audio signal.
5. The method for training an audio processing model according to any one of claims 1 to 4, wherein the obtaining a model loss value of the audio processing model based on a difference between the clean audio signal and the denoised audio signal comprises:
carrying out time-frequency conversion processing on the pure audio signal corresponding to the frequency signal with the noise to obtain a pure audio signal after the time-frequency conversion processing;
and determining the model loss value according to the difference between the pure audio signal after the time-frequency conversion processing and the de-noised audio signal.
6. A method for denoising audio, the method comprising:
acquiring a trained audio processing model; the trained audio processing model is obtained by training according to the training method of the audio processing model as claimed in any one of claims 1 to 5; the trained audio processing model comprises a trained characteristic dimension reduction model, a trained audio denoising model and a trained characteristic dimension increasing model;
inputting the audio signal to be denoised into the trained feature dimension reduction model to obtain dimension reduction audio features;
inputting the dimension reduction audio features into the trained audio denoising model to obtain denoising audio features;
inputting the denoising audio features into the trained feature dimension-increasing model to obtain a denoised audio signal corresponding to the audio signal to be denoised.
7. The training device of the audio processing model is characterized in that the audio processing model comprises a feature dimension reduction model to be trained, an audio denoising model to be trained and a feature dimension increasing model to be trained, and comprises:
a data acquisition unit configured to perform acquisition of training sample data; the training sample data comprises a noisy audio signal and a corresponding clean audio signal;
the first dimension reduction unit is configured to input the noisy audio signal into the feature dimension reduction model to be trained to obtain a dimension-reduced audio feature;
a first denoising unit configured to input the dimensionality reduced audio features into the audio denoising model to be trained to obtain denoised audio features;
a first dimension-raising unit configured to perform input of the denoised audio features into the feature dimension-raising model to be trained, so as to obtain a denoised audio signal:
a loss value obtaining unit configured to perform obtaining a model loss value of the audio processing model based on a difference between the clean audio signal and the denoised audio signal;
the parameter adjusting unit is configured to adjust the model parameters of the audio processing model to be trained according to the model loss value until the model loss value is lower than a preset threshold value, and the audio processing model after model parameter adjustment is used as a trained audio processing model; the trained audio processing model comprises a trained feature dimension reduction model, a trained audio denoising model and a trained feature dimension increasing model.
8. An audio denoising apparatus, comprising:
a model obtaining unit configured to perform obtaining a trained audio processing model; the trained audio processing model is obtained by training according to the training method of the audio processing model as claimed in any one of claims 1 to 5; the trained audio processing model comprises a trained characteristic dimension reduction model, a trained audio denoising model and a trained characteristic dimension increasing model;
the second dimension reduction unit is configured to input the audio signal to be denoised into the trained feature dimension reduction model to obtain a dimension reduction audio feature;
a second denoising unit configured to perform input of the dimension-reduced audio features into the trained audio denoising model to obtain denoised audio features;
and the second dimension-increasing unit is configured to input the denoising audio features into the trained feature dimension-increasing model to obtain a denoised audio signal corresponding to the audio signal to be denoised.
9. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of training an audio processing model according to any one of claims 1 to 5 or the method of denoising audio according to claim 6.
10. A storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of training an audio processing model according to any one of claims 1 to 5, or a method of audio denoising according to claim 6.
CN202011278852.9A 2020-11-16 2020-11-16 Training method and device for audio processing model, audio denoising method and device, and electronic equipment Pending CN112447183A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011278852.9A CN112447183A (en) 2020-11-16 2020-11-16 Training method and device for audio processing model, audio denoising method and device, and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011278852.9A CN112447183A (en) 2020-11-16 2020-11-16 Training method and device for audio processing model, audio denoising method and device, and electronic equipment

Publications (1)

Publication Number Publication Date
CN112447183A true CN112447183A (en) 2021-03-05

Family

ID=74737126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011278852.9A Pending CN112447183A (en) 2020-11-16 2020-11-16 Training method and device for audio processing model, audio denoising method and device, and electronic equipment

Country Status (1)

Country Link
CN (1) CN112447183A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758669A (en) * 2022-06-13 2022-07-15 深圳比特微电子科技有限公司 Audio processing model training method and device, audio processing method and device and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817239A (en) * 2018-12-24 2019-05-28 龙马智芯(珠海横琴)科技有限公司 The noise-reduction method and device of voice
CN111081268A (en) * 2019-12-18 2020-04-28 浙江大学 Phase-correlated shared deep convolutional neural network speech enhancement method
CN111179961A (en) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN111223493A (en) * 2020-01-08 2020-06-02 北京声加科技有限公司 Voice signal noise reduction processing method, microphone and electronic equipment
CN111354367A (en) * 2018-12-24 2020-06-30 ***通信有限公司研究院 Voice processing method and device and computer storage medium
CN111429931A (en) * 2020-03-26 2020-07-17 云知声智能科技股份有限公司 Noise reduction model compression method and device based on data enhancement
WO2020177371A1 (en) * 2019-03-06 2020-09-10 哈尔滨工业大学(深圳) Environment adaptive neural network noise reduction method and system for digital hearing aids, and storage medium
CN111755013A (en) * 2020-07-07 2020-10-09 苏州思必驰信息科技有限公司 Denoising automatic encoder training method and speaker recognition system
CN111863008A (en) * 2020-07-07 2020-10-30 北京达佳互联信息技术有限公司 Audio noise reduction method and device and storage medium
CN111883164A (en) * 2020-06-22 2020-11-03 北京达佳互联信息技术有限公司 Model training method and device, electronic equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817239A (en) * 2018-12-24 2019-05-28 龙马智芯(珠海横琴)科技有限公司 The noise-reduction method and device of voice
CN111354367A (en) * 2018-12-24 2020-06-30 ***通信有限公司研究院 Voice processing method and device and computer storage medium
WO2020177371A1 (en) * 2019-03-06 2020-09-10 哈尔滨工业大学(深圳) Environment adaptive neural network noise reduction method and system for digital hearing aids, and storage medium
CN111081268A (en) * 2019-12-18 2020-04-28 浙江大学 Phase-correlated shared deep convolutional neural network speech enhancement method
CN111179961A (en) * 2020-01-02 2020-05-19 腾讯科技(深圳)有限公司 Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN111223493A (en) * 2020-01-08 2020-06-02 北京声加科技有限公司 Voice signal noise reduction processing method, microphone and electronic equipment
CN111429931A (en) * 2020-03-26 2020-07-17 云知声智能科技股份有限公司 Noise reduction model compression method and device based on data enhancement
CN111883164A (en) * 2020-06-22 2020-11-03 北京达佳互联信息技术有限公司 Model training method and device, electronic equipment and storage medium
CN111755013A (en) * 2020-07-07 2020-10-09 苏州思必驰信息科技有限公司 Denoising automatic encoder training method and speaker recognition system
CN111863008A (en) * 2020-07-07 2020-10-30 北京达佳互联信息技术有限公司 Audio noise reduction method and device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114758669A (en) * 2022-06-13 2022-07-15 深圳比特微电子科技有限公司 Audio processing model training method and device, audio processing method and device and electronic equipment
CN114758669B (en) * 2022-06-13 2022-09-02 深圳比特微电子科技有限公司 Audio processing model training method and device, audio processing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
Li et al. Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement
JP7177167B2 (en) Mixed speech identification method, apparatus and computer program
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
US10891944B2 (en) Adaptive and compensatory speech recognition methods and devices
US10332507B2 (en) Method and device for waking up via speech based on artificial intelligence
CN110223680B (en) Voice processing method, voice recognition device, voice recognition system and electronic equipment
CN108615535B (en) Voice enhancement method and device, intelligent voice equipment and computer equipment
US11514925B2 (en) Using a predictive model to automatically enhance audio having various audio quality issues
CN108922543B (en) Model base establishing method, voice recognition method, device, equipment and medium
CN112820315A (en) Audio signal processing method, audio signal processing device, computer equipment and storage medium
CN111415653B (en) Method and device for recognizing speech
CN113345460B (en) Audio signal processing method, device, equipment and storage medium
CN111785288A (en) Voice enhancement method, device, equipment and storage medium
JP2019095551A (en) Generation device, generation method, and generation program
CN112447183A (en) Training method and device for audio processing model, audio denoising method and device, and electronic equipment
EP2774147B1 (en) Audio signal noise attenuation
CN112992190B (en) Audio signal processing method and device, electronic equipment and storage medium
CN112151055B (en) Audio processing method and device
CN109741761B (en) Sound processing method and device
CN111341333A (en) Noise detection method, noise detection device, medium, and electronic apparatus
US20230386492A1 (en) System and method for suppressing noise from audio signal
CN107919136B (en) Digital voice sampling frequency estimation method based on Gaussian mixture model
CN113393857B (en) Method, equipment and medium for eliminating human voice of music signal
CN113314101B (en) Voice processing method and device, electronic equipment and storage medium
You et al. A speech enhancement method based on multi-task Bayesian compressive sensing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination