CN118215961A - Control of speech retention in speech enhancement - Google Patents

Control of speech retention in speech enhancement Download PDF

Info

Publication number
CN118215961A
CN118215961A CN202280074139.3A CN202280074139A CN118215961A CN 118215961 A CN118215961 A CN 118215961A CN 202280074139 A CN202280074139 A CN 202280074139A CN 118215961 A CN118215961 A CN 118215961A
Authority
CN
China
Prior art keywords
denoising
audio signal
mask
machine learning
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280074139.3A
Other languages
Chinese (zh)
Inventor
孙俊岱
芦烈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dolby Laboratories Licensing Corp
Original Assignee
Dolby Laboratories Licensing Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corp filed Critical Dolby Laboratories Licensing Corp
Priority claimed from PCT/US2022/049193 external-priority patent/WO2023086311A1/en
Publication of CN118215961A publication Critical patent/CN118215961A/en
Pending legal-status Critical Current

Links

Landscapes

  • Soundproofing, Sound Blocking, And Sound Damping (AREA)

Abstract

A method for performing denoising of an audio signal is provided. In some embodiments, the method involves determining aggressive control parameter values that adjust the degree of voice retention to be applied. In some embodiments, the method involves obtaining a training set of training samples having a noisy audio signal and a target denoising mask. In some implementations, the method involves training a machine learning model, wherein the trained machine learning model is operable to take as input a noisy test audio signal and generate a corresponding de-noised test audio signal, and wherein the aggressive control parameter values are used to: 1) Generating a frequency domain representation of the noisy audio signal comprised in the training set; 2) Modifying the target denoising mask; 3) Determining an architecture of a machine learning model; or 4) determining a loss during training of the machine learning model.

Description

Control of speech retention in speech enhancement
Cross Reference to Related Applications
The present application claims priority from PCT patent application number PCT/CN2021/129573 filed on day 2021, 11, 9, 5, 13, 2022, U.S. provisional application number 63/364,661, and U.S. provisional application number 63/289,846 filed on day 2021, 12, 15, all of which are incorporated herein by reference in their entirety.
Technical Field
The present disclosure relates to systems, methods, and media for controlling speech retention in speech enhancement.
Background
For example, denoising techniques may be applied to noisy audio signals to generate denoised or clean audio signals. However, performing denoising techniques can be difficult, particularly for various types of audio content, such as audio content including music, conversations or conversations between multiple speakers, a mix of music and speech, and so forth.
Symbols and terms
Throughout this disclosure, including in the claims, the terms "speaker (speaker)", "loudspeaker (loudspeaker)" and "audio reproduction transducer" are synonymously used to denote any sound producing transducer (or set of transducers). A typical set of headphones includes two speakers. The speakers may be implemented to include multiple transducers (e.g., woofers and tweeters) that may be driven by a single common speaker feed or multiple speaker feeds. In some examples, the speaker feed(s) may undergo different processing in different circuit branches coupled to different transducers.
Throughout this disclosure, including in the claims, the expression "performing an operation on" a signal or data (e.g., filtering, scaling, transforming, or applying gain to a signal or data) is used in a broad sense to mean performing an operation directly on a signal or data or on a processed version of a signal or data (e.g., a version of a signal that has undergone preliminary filtering or preprocessing prior to performing an operation thereon).
Throughout this disclosure, including in the claims, the expression "system" is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem implementing a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, where the subsystem generates M inputs and the other X-M inputs are received from external sources) may also be referred to as a decoder system.
Throughout this disclosure, including in the claims, the term "processor" is used in a broad sense to mean a system or device that is programmable or otherwise configurable (e.g., in software or firmware) to perform operations on data (e.g., audio or video or other image data). Examples of processors include field programmable gate arrays (or other configurable integrated circuits or chip sets), digital signal processors programmed and/or otherwise configured to perform pipelined processing of audio or other sound data, programmable general purpose processors or computers, and programmable microprocessor chips or chip sets.
Disclosure of Invention
At least some aspects of the present disclosure may be implemented via a method. Some methods may involve determining, by a control system, an aggressive control parameter value that adjusts a degree of speech preservation to be applied when denoising an audio signal. Some methods may involve obtaining, by the control system, a training set of training samples having a noisy audio signal and a target denoising mask. Some methods may involve training, by the control system, a machine learning model by: a) Generating a frequency domain representation of the noisy audio signal corresponding to the training samples; b) Providing a frequency domain representation of the noisy audio signal to the machine learning model; c) Generating a predictive denoising mask based on an output of the machine learning model; d) Determining a loss of error representing the predicted denoising mask relative to the target denoising mask corresponding to the training sample; e) Updating weights associated with the machine learning model; and f) repeating a) through e) until a stopping criterion is reached. In some methods, a trained machine learning model may be used to take as input a noisy test audio signal and generate a corresponding denoised test audio signal, and wherein the aggressive control parameter values are used for at least one of: 1) Generating a frequency domain representation of the noisy audio signal included in the training set; 2) Modifying the target denoising mask included in the training set; 3) Determining an architecture of the machine learning model prior to training the machine learning model; or 4) determining the loss.
In some examples, generating the frequency domain representation of the noisy audio signal includes: generating a spectrum of the noisy audio signal; and generating a frequency domain representation of the noisy audio signal by grouping bins of a frequency spectrum of the noisy audio signal into a plurality of frequency bands, wherein a number of frequency bands is determined based on the aggressive control parameter values.
In some examples, modifying the target denoising mask included in the training set includes applying a power function to target denoising masks in the target denoising mask, and wherein an exponent of the power function is determined based on the aggressiveness control parameter value.
In some examples, the machine learning model includes a Convolutional Neural Network (CNN), and wherein determining the architecture of the machine learning model includes determining a filter size of a convolutional block of the CNN based on the aggressive control parameter values.
In some examples, the machine learning model includes a U-Net, and wherein determining an architecture of the machine learning model includes determining a depth of the U-Net based on the aggressive control parameter values.
In some examples, determining the penalty includes applying a penalty weight to an error of the predicted denoising mask relative to the target denoising mask, and wherein the penalty weight is determined based at least in part on the aggressive control parameter value. In some examples, the penalty weights are based at least in part on whether the corresponding noisy audio signal associated with the training samples includes speech.
Some methods involve determining, by a control system, an aggressive control parameter value that adjusts the degree of speech preservation to be applied when denoising an audio signal. Some methods involve providing, by the control system, a frequency domain representation of a noisy audio signal to a trained model to generate a denoising mask. Some methods involve modifying, by the control system, the denoising mask based at least in part on the aggressive control parameter value. Some methods involve applying, by the control system, a modified denoising mask to a frequency domain representation of the noisy audio signal to obtain a denoising spectrum. Some methods involve generating, by the control system, a time-domain representation of the denoising spectrum to generate a denoising frequency signal.
In some examples, modifying the denoising mask includes applying a compression function to the denoising mask, wherein parameters associated with the compression function are determined based on the aggressive control parameter values. In some examples, the compression function includes a power function, wherein an exponent of the power function is determined based on the aggressive control parameter value. In some examples, the compression function comprises an exponential function, and wherein a parameter of the exponential function is determined based on the aggressive control parameter value.
In some examples, modifying the denoising mask includes performing smoothing on a denoising mask for a frame of the noisy audio signal based on a denoising mask generated for a previous frame of the noisy audio signal. In some examples, performing the smoothing includes multiplying the denoising mask for a frame of the noisy audio signal with a weighted version of the denoising mask generated for a previous frame of the noisy audio signal, wherein weights for generating the weighted version are determined based on the aggressive control parameter values. In some examples, the denoising mask for frames of the noisy audio signal includes a time axis and a frequency axis, and wherein smoothing is performed with respect to the time axis. In some examples, the denoising mask for frames of the noisy audio signal includes a time axis and a frequency axis, and wherein smoothing is performed with respect to the frequency axis.
In some examples, the aggressive control parameter value is determined based on whether the current frame of the noisy audio signal comprises speech.
In some examples, some methods further involve causing the generated de-noised frequency signal to be presented via one or more loudspeakers or headphones.
Some or all of the operations, functions, and/or methods described herein may be performed by one or more devices in accordance with instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. Thus, some innovative aspects of the subject matter described in this disclosure can be implemented via one or more non-transitory media having software stored thereon.
At least some aspects of the present disclosure may be implemented via an apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some embodiments, the apparatus is or includes an audio processing system having an interface system and a control system. The control system may include one or more general purpose single or multi-chip processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or a combination thereof.
The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.
Drawings
Fig. 1 illustrates a block diagram of an example system for performing denoising of an audio signal, according to some embodiments.
Fig. 2 illustrates a block diagram of an example system for performing denoising of an audio signal, according to some embodiments.
FIG. 3 illustrates an example convolutional neural network that may be used in accordance with some embodiments.
Fig. 4 illustrates an example U-Net architecture that can be used in accordance with some embodiments.
FIG. 5 is a flowchart of an example process for training a model to perform denoising, according to some embodiments.
FIG. 6 is a flowchart of an example process for controlling the degree of speech retention in post-processing, according to some embodiments.
Fig. 7 shows a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the disclosure.
Like reference numbers and designations in the various drawings indicate like elements.
Detailed Description
Denoising of noisy audio signals may be performed using any number of denoising techniques. However, generating a denoised or clean audio signal from an input noisy signal may require a tradeoff between noise reduction and speech preservation. In particular, more aggressive approaches that prioritize noise reduction may result in reduced speech retention, while more conservative approaches that prioritize speech retention may result in excessive noise remaining in the generated denoised frequency signal. This tradeoff can be particularly difficult to manage when a single denoising technique is applied to multiple types of audio content. For example, applying the same denoising technique to audio content that includes a conversation and audio content that does not include a conversation may result in insufficient speech retention in the conversation content and/or increased noise in non-conversation content, both of which may be disadvantageous.
Techniques, methods, systems, and media for controlling aggressiveness or tradeoff between speech preservation and noise reduction in the application of noise reduction techniques are disclosed herein. In some embodiments, aggressiveness of the denoising technique may be controlled by an aggressiveness control parameter value. For example, the aggressive control parameter values may indicate a desired balance between speech preservation and noise reduction. In some implementations, the aggressive control parameter value may be set based on the type of audio content associated with the input noisy audio signal, such as whether the input noisy audio signal includes dialog, music, or the like.
In some embodiments, aggressive control parameter values may be utilized during training of a machine learning model for generating the denoised frequency signal. For example, in some implementations, aggressive control parameter values may be used to modify training samples used by the machine learning model during training, and/or may be used to train the machine learning model through a loss function. In some embodiments, aggressive control parameter values may be used to determine or select the structure of the machine learning model.
In some embodiments, aggressive control parameter values may be utilized for the output of the algorithm used to generate the denoised frequency signal. The use of aggressive control parameter values at the algorithm output is generally referred to herein as "post-processing". For example, in some embodiments, aggressive control parameter values may be utilized for the output of a trained machine learning model for generating a de-noised frequency signal.
Fig. 1 generally illustrates a system for generating a noise-free frequency signal using a machine learning model. FIG. 2 generally depicts various ways in which the aggressiveness control parameter value may be used, whether during training of a machine learning model or in post-processing. Fig. 3 and 4 illustrate example architectures of machine learning models that may be used in accordance with some embodiments. FIG. 5 depicts an example flow chart of a process for utilizing aggressive control parameter values during training of a machine learning model, and FIG. 6 depicts an example flow chart of a process for utilizing aggressive control parameter values in post-processing.
In some implementations, the input audio signal may be enhanced using a trained machine learning model. In some implementations, the input audio signal may be transformed to the frequency domain by extracting frequency domain features. In some implementations, a perceptual transform based on human cochlear processing may be applied to the frequency domain representation to obtain the banded features. Examples of perceptual transforms that may be applied to the frequency domain representation include gamma filters (gammatine filters), equivalent rectangular bandwidth filters, mel scale (Mel scale) based transforms, and the like. In some implementations, the frequency domain representation may be provided as an input to a trained machine learning model that generates as an output a predictive denoising mask. The predictive denoising mask may be a frequency domain representation of a mask that, when applied to the frequency domain representation of the input audio signal, generates a spectrum of the denoised frequency signal. In some implementations, an inverse transform of the perceptual transform may be applied to the predictive denoising mask to generate a modified predictive denoising mask. The frequency domain representation of the enhanced audio signal may then be generated by multiplying the frequency domain representation of the input audio signal with the modified predictive denoising mask. The enhanced audio signal may then be generated by transforming the frequency domain representation of the enhanced audio signal to the time domain.
In other words, a trained machine learning model for enhancing an audio signal may be trained to generate a predictive denoising mask for a given frequency domain input audio signal, which, when applied to the frequency domain input audio signal, generates a frequency domain representation of the corresponding denoised frequency signal. In some implementations, the predictive denoising mask may be applied to the frequency domain representation of the input audio signal by multiplying the frequency domain representation of the input audio signal with the predictive denoising mask. Alternatively, in some embodiments, the logarithm of the frequency domain representation of the input audio signal may be taken. In such an embodiment, the frequency domain representation of the denoised frequency signal may be obtained by adding the logarithm of the predictive denoise mask to the logarithm of the frequency domain representation of the input audio signal. In some embodiments, instead of adding the logarithm of the predictive denoising mask to the logarithm of the frequency domain representation, the logarithm of the input audio signal may be transformed into the linear domain, and the denoising signal may be obtained by multiplying the linear predictive denoising mask with the linear frequency domain representation of the original noisy signal.
It should be noted that in some implementations, training the machine learning model may include determining weights associated with one or more nodes and/or connections between nodes of the machine learning model. In some implementations, the machine learning model can be trained on a first device (e.g., server, desktop computer, laptop computer, etc.). Once trained, the weights associated with the trained machine learning model may then be provided (e.g., transmitted) to a second device (e.g., server, desktop computer, laptop computer, media device, smart television, mobile device, wearable computer, etc.) for use by the second device in denoising the audio signal.
Fig. 1 illustrates an example system for denoising an audio signal. It should be noted that although fig. 1 depicts denoising an audio signal, the systems and techniques described in connection with fig. 1 may also be applied to other types of enhancements, such as dereverberation, noise suppression, a combination of dereverberation, and the like. In other words, in some embodiments, instead of generating a predicted denoising mask and a predicted denoising frequency signal, a predicted enhancement mask may be generated, and the predicted enhancement mask may be used to generate a predicted enhancement audio signal, wherein the predicted enhancement audio signal is a denoised and/or dereverberated version of a distorted input audio signal.
Fig. 1 illustrates an example of a system 100 for denoising an audio signal according to some embodiments. In some examples, the system 100 may be implemented by a control system, such as the control system 710 described herein with reference to fig. 7. As illustrated, the denoising component 106 takes the input audio signal 102 as an input and generates the denoised frequency signal 104 as an output. In some implementations, the denoising component 106 includes a feature extractor 108. Feature extractor 108 may generate a frequency domain representation of input audio signal 102, which may be considered as an input signal spectrum. The input signal spectrum may then be provided to the trained machine learning model 110. The trained machine learning model 110 may generate as output a predictive denoising mask. The predictive denoising mask may be provided to a denoising signal spectrum generator 112. The denoised signal spectrum generator 112 may apply a predictive denoise mask to the input signal spectrum to generate a denoised signal spectrum (e.g., a frequency domain representation of the denoised frequency signal). The denoised signal spectrum may then be provided to a time domain transformation component 114. The time domain transforming part 114 may generate the denoised frequency signal 104.
As shown in fig. 1 and described above in connection with the figure, a trained machine learning model may be used to generate a noise-free frequency signal from an input noisy audio signal. In some embodiments, it may be desirable to control the degree of speech retention in the denoised audio signal. For example, more aggressive denoising techniques may produce a greater degree of noise reduction while having poor performance in terms of speech preservation and vice versa. In some embodiments, aggressiveness of a denoising technique for generating a corresponding denoised frequency signal from an input noisy audio signal may be controlled by an aggressiveness control parameter. In some implementations, aggressive control parameters may be used to control the degree of speech retention during training of the machine learning model. For example, aggressive control parameters may be utilized in generating a training set to be used by a machine learning model. As a more specific example, the aggressive control parameters may be utilized to modify the frequency domain representation of the noisy audio signal included in the training set. As another specific example, aggressive control parameters may be utilized to modify a target denoising mask used during training of a machine learning model. As another example, in some embodiments, aggressive control parameters may be utilized to build the architecture of a machine learning model. As yet another example, in some embodiments, aggressive control parameters may be utilized to determine the loss used by the machine learning model to iteratively determine weight parameters during the training process. Additionally or alternatively, in some implementations aggressive control parameters may be used to alter the de-noised frequency signal generated by the trained machine learning model. The use of aggressive control parameters on the output generated using a trained machine learning model is often referred to as "post-processing". It should be noted that in some embodiments, aggressive control parameters may be used in a variety of ways and/or in a variety of phases, which may include use during machine learning model training and/or in post-processing. Fig. 2 illustrates a system that depicts one of many possible ways in which aggressive control parameters may be used to control speech preservation in generating a de-noised frequency signal. FIG. 5 depicts a flowchart of an example process for using aggressive control parameters during training of a machine learning model. FIG. 6 depicts a flowchart of an example process for using aggressive control parameters in post-processing.
As illustrated in fig. 2, system 200 includes a training set creation component 202. In some examples, one or more components of system 200 may be implemented by a control system, such as control system 710 described herein with reference to fig. 7. The training set creation component 202 can generate a training set that can be utilized by a machine learning model to denoise an audio signal. In some implementations, the training set component 202 can be implemented, for example, on a device that generates and/or stores the training set 208. In some implementations, each training sample may include a noisy audio signal and a corresponding target denoising mask to be generated by the machine learning model. The target denoising mask may be obtained from the target denoising mask database 206. In some implementations, the aggressive control parameters may be used to modify the target denoising mask, as described below in connection with fig. 5. In some implementations, the training set component 202 can generate a noisy audio signal utilized in the training samples. For example, the training set component 202 may apply noise (e.g., randomly selected noise signals from a set of candidate noise signals, randomly generated noise, etc.) to the clean audio signals stored in the clean audio signal database 204. Continuing with the example, in some implementations, a target denoising mask may be determined based on the clean audio signal and noise used to generate the noisy audio signal.
The training set 208 may then be used to train a machine learning model 210a. In some implementations, the machine learning model 210a can be or include a Convolutional Neural Network (CNN), U-Net, or any other suitable type of architecture. Example architectures are shown in fig. 3 and 4 and are described below in connection with these figures. The machine learning model 210a may include a prediction component 212a and a loss determination component 214. Prediction component 212a may generate a predictive denoising mask for noisy audio signals obtained from training set 208. Example techniques for generating the predictive denoising mask are described in more detail above in connection with fig. 1, and in connection with fig. 5 below. The penalty determination component 214 may determine a penalty associated with the predictive denoising mask. For example, the penalty may indicate a difference between the predicted denoising mask and the actual denoising mask (e.g., a target associated with a particular training sample). The penalty may be used to update the weights associated with prediction component 212 a. It should be noted that aggressive control parameters may be used by prediction component 212a (e.g., to generate a predicted denoising signal) and/or by loss determination component 214 (e.g., to determine a loss of weight for updating machine learning model 210 a), as described in more detail below in connection with fig. 5.
After training, the trained machine learning model 210b may utilize the trained predictive component 212b (e.g., corresponding to the finally determined weights) to generate the de-noised frequency signal. For example, the trained machine learning model 210b may take as input the noisy audio signal 214 and may generate as output the denoising mask 216. The denoising mask 216 may then be applied to the frequency domain representation of the input noisy audio signal 214 to generate a denoised frequency signal. It should be noted that the trained machine learning model 210b may have the same architecture as the machine learning model 210 a. Additionally, it should be noted that in some implementations, aggressive control parameters may be utilized to adjust the speech preservation in the denoising mask 216 generated by the trained machine learning model 210 b. The application of aggressive control parameters on the generated denoising mask is generally referred to herein as applying aggressive control parameters in post-processing, and is further described in connection with fig. 6.
In some implementations, the machine learning model used to generate the denoised frequency signals can be CNN. In some implementations, the architecture of the CNN may be built using aggressive control parameters. For example, in some embodiments, the kernel size of the convolutional layer of the CNN may be k, where the convolutional layer implements a filter of size (k, k). Continuing with this example, a larger filter size (e.g., a larger k value) may correspond to a more conservative result or a higher voice reserve relative to a smaller k value. In other words, in some embodiments, the aggressive control parameters may be used to select kernel sizes to be used in one or more convolutional layers of the CNN to be trained. It should be noted that in some embodiments, the CNN-based model may include multiple convolution paths, each utilizing a different filter size. In such an embodiment, the aggressiveness control parameter may be used to set weights associated with each convolution path. For example, in instances where the aggressiveness control parameter indicates a higher aggressiveness (e.g., more noise reduction and less speech preservation), the aggressiveness control parameter may be used to weight the convolution path associated with a smaller filter size more heavily and the convolution path associated with a larger filter size less heavily. Conversely, in instances where the aggressive control parameters indicate higher conservation (e.g., less noise reduction and more speech preservation), the aggressive control parameters may be used to weight the convolution paths associated with larger filter sizes more heavily and the convolution paths associated with smaller filter sizes less heavily.
Fig. 3 illustrates an example CNN including multiple convolution paths, according to some embodiments. As illustrated, the input 301 is provided to a plurality of convolution paths. In some embodiments, each convolution path may include L convolution layers, where L is a natural number greater than or equal to 1. For example, a first convolution path includes layers 304a, 306a, and 308a, a second convolution path includes layers 304b, 306b, and 308b, and a third convolution path includes layers 304c, 306c, and 308c. Continuing with the example, the first layer of the L layers may have N l filters, where l= … L. Examples of L include 3, 4, 5, 10, etc. In some embodiments, for each parallel convolution path, the number of filters of the first layer, N l, may be given by N l=1*N0, where N 0 is a predetermined constant greater than or equal to 1.
In some embodiments, the filter size of the filter may be the same (e.g., uniform) in each parallel convolution path. For example, a3×3 filter size may be used in each layer L in the parallel convolution paths (e.g., 304a, 306a, and 308 a). By using filters of the same size in each parallel convolution path, a mix of different scale features can be avoided. In this way, the CNN learns the feature extraction of the same scale in each path, which greatly increases the convergence speed of the CNN. In an embodiment, the filter size of the filter may be different between different convolution paths. For example, the filter size of the first convolution path including 304a, 306a, and 308a is 3×3. Continuing with the example, the filter size of the second convolution path including 304b, 306b, and 308b is 5×5. Continuing the example still further, the filter size of the third convolution paths including 304c, 306c, and 308c is 7 x 7. It should be noted that filter sizes other than those depicted in fig. 3 may be used. In some embodiments, the filter size may depend on the harmonic length for feature extraction.
In some embodiments, for a given convolution path, the input to each of the L convolution layers may be zero-padded before performing the convolution operation in each layer. In this way, the same data shape can be maintained from input to output.
In some embodiments, for a given convolution path, a nonlinear operation may be performed in each of the L convolution layers. The nonlinear operation may include one or more of the following: a parameter correction linear unit (PRelu), a correction linear unit (Relu), a leakage correction linear unit (LeakyRelu), an exponential linear unit (Elu) and/or a scaling exponential linear unit (Selu). In some embodiments, nonlinear operation may be used as an activation function in each of the L convolutional layers.
In some embodiments, for a given parallel convolution path, the filters of at least one layer of the parallel convolution path may be expanded 2D convolution filters. The use of an expansion filter enables the correlation of harmonic features in different receptive fields to be extracted. The dilation enables far receptive fields to be reached by skipping a series of time-frequency (TF) bins. In some embodiments, the expansion operation of the filters of at least one layer of the parallel convolution paths may be performed only on the frequency axis. For example, in the context of the present disclosure, an expansion of (1, 2) may indicate that there is no expansion along the time axis (expansion factor 1), while every other bin along the frequency axis is skipped (expansion factor 2). In general, (1, d) expansion may indicate that (d-1) bins are skipped along the frequency axis between the bins for feature extraction by the corresponding filters.
In some embodiments, for a given convolution path, the filters of two or more layers of the parallel convolution path may be expanded 2D convolution filters, wherein the expansion factor of the expanded 2D convolution filter increases exponentially with increasing number of layers i. In this way, a receptive field that grows exponentially with depth can be achieved. As illustrated by the example of fig. 3, in an embodiment, for a given parallel convolution path, the expansion may be (1, 1) in a first layer of L convolution layers, the expansion may be (1, 2) in a second layer of L convolution layers, the expansion may be (1, 2 (L-1)) in a first layer of L convolution layers, and the expansion may be (1, 2 (L-1)) in a last layer of L convolution layers, where (c, d) indicates the expansion factor c along the time axis and the expansion factor d along the frequency axis.
The aggregate multiscale CNN may be trained. Training to aggregate multi-scale CNNs may involve the steps of: (i) Calculating frame FFT coefficients of original noisy speech and target speech; (ii) Determining the amplitude of the noisy speech and the target speech by ignoring the phase; (iii) Determining a target output mask by determining a difference between the magnitudes of the noisy speech and the target speech; (iv) Limiting the target mask to a range based on the statistical histogram; (v) Using as input a multi-frame frequency amplitude of the noisy speech; (vi) Using the corresponding target mask of step (iii) as an output.
It should be noted that in step (iii), the target output mask may be determined using the following equation:
target mask= ||y (t, f) |/|x (t, f) ||x (t, f)
In some embodiments, features extracted from each parallel convolution path of the aggregated multiscale CNN of the time-frequency transforms of the multiple frames from the original noisy speech signal input 301 are output. The outputs from each parallel convolution path are then aggregated in an aggregation block 302 to obtain an aggregated output. In some embodiments, as shown in FIG. 3, weights 310a, 310b, and 310c may be applied to each parallel convolution path. Weights 310a, 310b, and 310c may be determined based at least in part on the aggressiveness control parameter values, e.g., to set or modify weights associated with different filter sizes of parallel convolution paths.
In some implementations, the machine learning model used to generate the denoising mask can be a CNN with a U-Net architecture. Such a U-Net can have M encoding layers and M corresponding decoding layers. Feature information from a particular coding layer m may be passed to a corresponding mth decoding layer via a skipped connection, allowing the decoding layer to utilize not only feature information from a previously decoded layer, but also feature information from a corresponding coding layer passed via a skipped connection. As used herein, skipping a connection refers to passing feature information from one layer of the network to a layer other than the subsequent next layer. The M value indicating the number of coding layers and corresponding decoding layers represents the depth of the U-Net. In some embodiments, the depth of the U-Net can be determined based on the aggressiveness control parameter. In particular, in some embodiments, deeper U-Net or correspondingly larger M values can be used in a machine learning model that produces a more aggressive denoising mask relative to shallower U-Net with smaller M values. In other words, U-Net with larger M values can produce more aggressive denoising masks that more effectively reduce noise at the expense of speech preservation, while U-Net with smaller M values can produce more conservative denoising masks that more effectively preserve speech at the expense of noise reduction.
Fig. 4 illustrates an example of a U-Net architecture 400 that can be implemented in association with a machine learning model, in accordance with some embodiments. The U-Net 400 includes a set of encoding layers 402 and a corresponding set of decoding layers 404. The input may pass continuously through the encoding layers of the set of encoding layers 402, wherein the characteristic information generated from an encoding layer is passed to a subsequent encoding layer. For example, input may be provided to the encoding layer 402 a. Continuing with this example, the output of encoding layer 402a may be provided to encoding layer 402b, which is then provided to encoding layer 402c. The final encoding layer generates potential features 408 that are then passed to the first decoding layer of the set of decoding layers 404. The output of each decoding layer is then passed to the subsequent decoding layer, as indicated by the arrow in fig. 4, such that the topmost decoding layer generates the final output. For example, information may pass from the decoding layer 404c to the decoding layer 404b and then to the decoding layer 404a that generates the final output. As illustrated, each encoding layer also passes the feature information to the decoder layer of the corresponding level of the U-Net via a skipped connection. For example, as illustrated in fig. 4, the feature information generated by the encoding layer 402a is passed to the decoding layer 404a via the skip connection 406. Note that three encoding layers and corresponding three decoding layers are illustrated in fig. 4 to depict a U-Net of depth 3. According to some embodiments, increasing the depth of the U-Net (e.g., to 4 layers, 5 layers, 8 layers, etc.) can increase the aggressiveness of denoising techniques that utilize denoising masks generated by the U-Net. Conversely, reducing the depth of the U-Net (e.g., to layer 2) can increase the voice reserve of the denoising technique that utilizes the denoising mask generated by the U-Net.
As described above in connection with fig. 2, aggressive control parameters may be used to adjust the balance between speech preservation and noise reduction in training a machine learning model that generates a denoising mask used to generate a denoising signal. The aggressive control parameters may be used in different ways or in combination of ways. For example, aggressive control parameters may be used to: generating a frequency domain representation of the noisy audio signal, the frequency domain representation being provided to a machine learning model during training; modifying a target denoising mask, the target denoising mask being a target generated by a machine learning model for a given input during training; architecture of machine learning model; and/or determining a loss of weight for updating the machine learning model during training.
FIG. 5 illustrates a flow chart of an example process 500 for training a machine learning model that generates a denoising mask that may be used to generate a denoising frequency signal. In some embodiments, blocks of process 500 may be performed by a control system. An example of such a control system is shown in fig. 7 and described below in connection with this figure. In some implementations, the blocks of process 500 may be performed in a different order than shown in fig. 5. In some embodiments, two or more blocks of process 500 may be performed substantially in parallel. In some embodiments, one or more blocks of process 500 may be omitted.
Process 500 may begin at 502 with determining an aggressive control parameter value that adjusts the degree of speech preservation to be used in denoising noisy audio signals. In some implementations, the aggressiveness control parameter value may be determined based on the type of audio content to be processed using a machine learning model. For example, in instances where the machine learning model is to generate a denoising mask to be applied to audio content including conversational content (e.g., having multiple speakers) and the like, the aggressive control parameter may be set to a value that is relatively low (e.g., conservative) and thus prioritizes voice reservations over noise reduction. Conversely, in instances where the machine learning model is to generate a denoising mask to be applied to audio content including individual speakers or other non-dialog-heavy content, the aggressive control parameter may be set to a relatively large value that prioritizes denoising over speech preservation.
At 504, process 500 may obtain a training set of training samples, each training sample having a noisy audio signal and a target denoising mask. In some embodiments, the noisy audio signal included in the training set may be generated by applying the noise signal to the clean audio signal. In some embodiments, the noise signal may be randomly selected from a set of candidate noise signals and mixed with the clean audio signal, e.g., to achieve a randomly selected signal-to-noise ratio (SNR). In some implementations, the noise signal may be random noise generated for mixing with the clean audio signal.
At 506, process 500 may optionally generate a frequency domain representation of the noisy audio signal based on the aggressive control parameter values for training samples of the training set. As described above in connection with fig. 1, a frequency domain representation of a noisy audio signal may be generated by determining a spectrum (denoted herein as Spec T*N) of the noisy audio signal having N bins, where T is the number of frames of the audio signal, and where N is the frequency bin. The spectrum may then be "band divided" or modified by grouping the frequency bins of the spectrum into various frequency bands (which may be referred to herein simply as "bands"). In some implementations, the frequency band may be determined based on a representation of cochlear processing of the human ear. In an example where the spectrum is grouped into B bands and where W represents a band matrix, which may be determined based on a gamma pass filter bank (Gammatone filterbank), an equivalent rectangular bandwidth, a mel filter (MEL FILTER), etc., the banded spectrum may be determined by:
Bamdedspectrum=SpecT*N*WN*B
In some implementations, the B value or the number of bands in which the frequency bins of the spectrum are grouped may be determined based on the aggressive control parameter values. For example, a smaller B value or a smaller number of frequency bands may result in: the voice reserve of the audio signal comprising the dialog segment increases; noise reduction in non-dialog segments is aggressive; and the residual noise in the dialog segments increases. In other words, a smaller B value may result in increased speech retention of the dialog segment at the cost of increased residual noise in the dialog segment and noise reduction in the non-dialog segment. Conversely, a larger B value or a larger number of bands may result in: noise reduction in the dialog segments is more aggressive at the expense of speech retention; and the residual noise in the non-dialog segments increases.
At 508, process 500 may optionally modify the target denoising mask based on the aggressive control parameter value. It should be noted that in some embodiments, block 508 may be omitted and process 500 may proceed to block 510.
The target denoising mask is generally denoted herein as MSM (t, f), where t corresponds to a time component and f corresponds to a frequency component. In some implementations, the denoising mask may be determined by:
in the equations given above, Y and X represent the spectra of the clean audio signal and the noisy audio signal, respectively. For example, Y may be the spectrum of a clean audio signal and X may be the spectrum of noise of the audio signal. In other words, given a denoising mask, a clean audio spectrum may be obtained by multiplying the denoising mask with the spectrum of the noisy audio spectrum.
Note that as described above in connection with block 504, each training sample may include a target denoising mask that is to be predicted by the machine learning model for the corresponding noisy audio signal. In some implementations, the target denoising mask for a particular training sample may be modified based on the aggressive control parameter value. For example, the target denoising mask may be modified by applying a power to the target denoising mask, where the power is denoted by α. An example of modifying the target denoising mask by applying a power α is given by:
The power α may be in the range of 0 to 1 to produce a more conservative result that prioritizes voice reservations. In some embodiments, the power α may be greater than 1 to generate more aggressive results that prioritize noise reduction. Exemplary values of α include 0.2, 0.5, 0.8, 1, 1.2, 1.5, 2, 2.5, 3, etc. In some implementations, α can be determined based on an aggressiveness control parameter value. For example, in response to an aggressive control parameter value indicating that voice reservation is prioritized at the expense of noise reduction, α may be set to a relatively small value, and vice versa.
At 510, process 500 may provide a frequency domain representation of the noisy audio signal to a machine learning model whose architecture optionally depends on the aggressive control parameter values. As described above in connection with fig. 1, a frequency domain representation of the noisy audio signal, which may be a banded spectrum of the noisy audio signal as described above in connection with block 506, is provided as an input to the machine learning model. As described above in connection with fig. 3 and 4, the architecture of the machine learning model may have been determined or selected based on aggressive control parameter values. For example, in instances where the machine learning model includes CNN, the filter size used in the convolutional layer may be determined based on the aggressive control parameter values. As a more specific example, as shown in fig. 3 and described above in connection with this figure, a larger filter size may result in a more conservative result of the machine learning model that prioritizes speech preservation over noise reduction. Conversely, smaller filter sizes may cause the machine learning model to produce more aggressive results that prioritize noise reduction over speech preservation. As another example, in instances where the machine learning model includes U-Net, the depth of U-Net may be determined or selected based on aggressive control parameter values, as described above in connection with fig. 4. As a more specific example, the depth of the U-Net may be relatively large to generate more aggressive results that prioritize voice retention over noise reduction. Conversely, a relatively shallow U-Net can be utilized to generate more conservative results that prioritize voice retention over noise reduction.
At 512, process 500 may generate a predictive denoising mask using the machine learning model. For example, as described above in connection with fig. 1, the predictive denoising mask may be an output of the machine learning model when the frequency domain representation of the noisy audio signal is provided as an input to the machine learning model.
At 514, process 500 may determine a loss representing an error of the predicted denoising mask relative to the target denoising mask of the training sample, wherein the loss is determined using a loss function, optionally dependent on the aggressiveness control parameter value. For example, in some embodiments, the aggressive control parameter values may be used to set a penalty factor for use in the loss function, where the penalty factor indicates whether the loss function penalizes more or less severely the noise suppression. In one example, the loss function may be expressed as:
Loss=mean (p|y pred(i,j)-ytrue(t,j)|γ)
In the equation given above, γ represents a power factor, y true represents a target denoising mask of the training sample, y pred represents a predicted denoising mask generated by the machine learning model at block 512, i represents a frame index, j represents a frequency band index, and P represents a penalty weight matrix. In some embodiments, P has the same dimensions as y pred and y true.
In some embodiments, P may be determined by:
Given the above equation, in the instance of a > b, the penalty weight applied in the penalty function may be greater in instances where the predicted denoising mask is smaller than the target denoising mask, which indicates excessive noise suppression at the expense of speech retention.
In some embodiments, the loss function may be determined by:
loss= |y pred-ytrue|*(sign(ytrue-ypred) ×α+β)
In the equation given above, the values of α and β may be two parameters used as penalty weights for punishing excessive or insufficient noise suppression. The values of α and β may be set based on the aggressive control parameter values. Exemplary values for α and β include 0.3, 0.5, 0.7, 1, 1.2, and the like.
Note that in the loss function example given above, the same penalty weight parameters are used regardless of the type of audio content included in the training samples. For example, the same penalty weight parameters are used for dialog segments and non-dialog segments. In some implementations, conversational and non-conversational segments may be considered differently when applying the loss function. It should be noted that in some embodiments, any suitable technique may be used to identify the dialog segments and non-dialog segments, such as by identifying metadata or other indicia that specifies whether a particular frame or segment of the audio signal corresponds to a dialog segment or a non-dialog segment, and so forth. This may allow noise oversuppression of dialog segments at the expense of speech retention to be more heavily penalized than non-dialog segments. In some embodiments, the loss function may include two components, one setting a first penalty weight applied to the dialog segment and the other setting a second penalty weight applied to the non-dialog segment. The two components of the loss function may be gated by a gating threshold g. An example of such a loss function is given by:
loss of =mean(g*P1*|ypred(i,j)-ytrue(i,j)|γ+(1-g)*P2*|ypred(i,j)-ytrue(i,j)|γ
In the equation given above, the gating control may be given by:
In the loss functions given above, P 1 and P 2 may represent two penalty weight matrices applied to the dialog segment and the non-dialog segment, respectively, based on gating control. In one example, P 1 can be given by:
As described above, a and b are constants that can be determined based on the aggressive control parameter values for controlling the penalty of the dialog segment for excessive noise suppression relative to the penalty of insufficient noise suppression.
In one example, P 2 can be given by:
Similar to that described above in connection with P 1, c and d represent constants that can be determined based on the aggressive control parameter values to control the penalty of noise suppression overage relative to the penalty of noise suppression underage for non-conversational segments.
At 516, process 500 may update the weights of the machine learning model based on the penalty(s). For example, the process 500 may update weights associated with one or more layers of the machine learning model based on the loss. The weights may be updated using any suitable technique, such as gradient descent, batch gradient descent, etc. Note that in some implementations, the process 500 may update the weights in batches, rather than updating the weights for each training sample.
At 518, process 500 may determine whether training of the machine learning model has been completed. For example, the process 500 may determine whether all training samples have been processed, whether more than a predetermined number of training periods have been completed, and/or whether the change in the weight of the machine learning model in successive training iterations is less than a predetermined change threshold.
If at 518, process 500 determines that the training of the machine learning model has not been completed ("no" at block 518), process 500 may loop back to block 506 and may continue training the machine learning model, for example, with another training sample in the training set. In some implementations, the process 500 may loop through blocks 506-518 until the process 500 determines that training is complete.
Conversely, if at 518, process 500 determines that the training of machine learning has been completed ("yes" at 518), process 500 may continue to block 520 and may optionally utilize a trained machine learning model. For example, in some embodiments, process 500 may store weights representing the trained machine learning model as parameters. Continuing with this example, process 500 may apply a frequency domain representation of the tested noisy audio signal to the trained machine learning model at the time of inference to generate a denoising mask that may be used to generate the denoising frequency signal, as shown in fig. 1 and 2 and described above in connection with these figures. In some embodiments, weights associated with the trained machine learning model may be provided to an end user device, which may then utilize these weights in denoising the noisy audio signal when inferred.
In some implementations, aggressive control parameters may be applied to a denoising mask generated, for example, by a machine learning model. For example, the aggressiveness control parameter may be applied to a denoising mask to generate a modified denoising mask, wherein the aggressiveness control parameter is used to adjust the degree of speech preservation when generating a denoising frequency signal with the modified denoising mask. The modified denoising mask may then be used to generate a denoised frequency signal. The denoising mask may be modified in different ways based on the aggressive control parameters. For example, in some implementations, the denoising mask may be modified by applying a power-law compressor function to the denoising mask, wherein power values of the power-law compressor are determined based at least in part on the aggressiveness control parameter. As another example, in some implementations, the denoising mask may be modified by applying a gaussian compressor function to the denoising mask, wherein the variance of the gaussian compressor is determined based at least in part on the aggressiveness control parameter. Note that, as will be described in more detail below, the gaussian compressor may additionally or alternatively be referred to as an exponential function. As yet another example, in some implementations, the denoising mask may be modified by smoothing the denoising mask.
Fig. 6 is a flow chart of an example process 600 for modifying a denoising mask based on aggressive control parameters. In some embodiments, blocks of process 600 may be performed by a control system. An example of such a control system is shown in fig. 7 and described below in connection with this figure. In some implementations, the blocks of process 600 may be performed in a different order than shown in fig. 6. In some embodiments, two or more blocks of process 600 may be performed substantially in parallel. In some embodiments, one or more blocks of process 600 may be omitted.
Process 600 may begin at 602 with determining an aggressive control parameter value that adjusts the degree of speech preservation to be applied when denoising a noisy audio signal. As described above, in some implementations, the aggressive control parameter values may be determined based on the type of audio content to be processed using the machine learning model. For example, in instances where denoising is applied to audio content including conversational content (e.g., having multiple speakers) and the like, the aggressiveness control parameter may be set to a value that is relatively low and thus prioritizes voice retention over noise reduction. Conversely, in instances where denoising is applied to audio content that includes a single speaker or other non-dialog-heavy content, the aggressiveness control parameter may be set to a relatively large value that prioritizes denoising over voice retention. It should be noted that in some implementations, the process 600 may determine whether a particular segment of the noisy audio signal to be denoised includes conversational or non-conversational content. For example, in some embodiments, the process 600 may determine whether a segment includes conversational or non-conversational content based on metadata or a flag stored in conjunction with the noisy audio signal that indicates a portion or segment of the noisy audio signal that includes conversational content. It should also be noted that some noisy audio signals, such as movie soundtracks, may include some dialog segments and some non-dialog segments. In this case, the process 600 may set different aggressive control parameter values for different segments or portions of the noisy audio signal based on, for example, whether a particular segment or portion includes a dialog.
At 604, process 600 may obtain a denoising mask, wherein the denoising mask is generated using a frequency domain representation of the noisy audio signal. For example, as described above in connection with fig. 1,2, and 5, the frequency domain representation of the noisy audio signal may include a spectrum of the noisy audio signal. In some embodiments, the frequency domain representation of the noisy audio signal may include a spectrum of the noisy audio signal that is modified, for example, by band-dividing frequency bins of the spectrum based on a perceptual transform that represents perceptual characteristics associated with the human cochlea.
In some embodiments, the denoising mask may be obtained by providing a frequency domain representation of the noisy audio signal to a machine learning model that has been trained to generate the denoising mask as an output. The machine learning model can have any suitable architecture, e.g., CNN, U-Net, recurrent Neural Network (RNN), etc. In some embodiments, the aggressiveness control parameters that may be the same as or different from the aggressiveness control parameters obtained at block 602 may already be used during training of the machine learning model, or the architecture used to select the machine learning model, as described above in connection with fig. 2-5. However, it should be appreciated that in some embodiments, a machine learning model may be used that has not previously used aggressive control parameters in training the machine learning model and/or constructing the machine learning model. The denoising mask is generally referred to herein as MSM (t, f).
At 606, process 600 may modify the denoising mask by performing at least one of: 1) Applying a power law compressor to the denoising mask; 2) Applying a gaussian compressor to the denoising mask; and/or 3) smoothing the denoising mask.
In some implementations, a power law compressor may be applied to generate a modified denoising mask (generally referred to herein as MSM mod (t, f)) by:
MSMmod(t,f)=MSM(t,f)α
In the equation given above, α is a power value applied to the denoising mask obtained at block 604. The value of alpha may be determined based on the aggressive control parameter values. For example, in response to determining that denoising will be more conservative based on the aggressive control parameter values, e.g., voice preservation is prioritized over denoising, the value of α may be selected to be between 0 and 1. Example values for α used to generate the result of prioritizing voice reservations over noise reduction include 0.1, 0.2, 0.6, 0.8, etc. Conversely, in response to determining that denoising will be more aggressive based on the aggressive control parameter values, for example, the value of α may be selected to be greater than 1 over voice retention prioritizing denoising. Example values for α used to generate the result of prioritizing noise reduction over speech preservation include 1.05, 1.1, 1.2, 1.3, 1.8, etc.
In some implementations, a gaussian compressor may be applied to generate a modified denoising mask (generally referred to herein as MSM mod) by:
In the equation given above, var may be an adjustable parameter that may be determined based at least in part on the aggressive control parameter values. Applying a gaussian compressor to the denoising mask may cause the modified denoising mask to have an s-shape, wherein the value of the modified denoising mask is greater than about 0.5 for the high signal-to-noise ratio portion of the audio signal and less than about 0.5 for the low signal-to-noise ratio portion of the audio signal. The value of var may shift the function to the left or right accordingly, changing the midpoint in signal-to-noise ratio where the value of the modified denoising mask is greater than or less than 0.5. Note that the s-shaped function may be essentially an exponential function truncated at a lower limit and an upper limit. It should be noted that in some embodiments, by setting the modified denoising mask to the minimum of the original denoising mask and the modified denoising mask after the application of the gaussian compressor, the original denoising mask value may be maintained while utilizing the shifted S-shape of the modified denoising mask.
In some implementations, smoothing may be performed on the denoising mask to generate a modified denoising mask. In some embodiments, smoothing may be performed by smoothing mask values associated with the current frame with mask values associated with the previous frame. Smoothing may be performed using any suitable filtering technique, such as mean filtering, median filtering, adaptive filtering, and the like. In some embodiments, larger filter sizes may produce more conservative results in the denoised frequency signal. Accordingly, the filter size for performing filtering/smoothing may be determined by the aggressive control parameter values. In particular, a larger filter size may be used in response to aggressive control parameter values indicating a preference for more conservative results or voice retention over noise reduction. It should be noted that smoothing may be used only to generate a more conservative denoised frequency signal that prioritizes speech preservation over noise reduction over the original denoised mask obtained at block 604. However, aggressive control parameter values may be used to vary the degree of speech retention in the denoised frequency signal.
It should be noted that smoothing/filtering may be performed with respect to the time axis or with respect to the frequency axis. In one example, smoothing/filtering may be performed on the time axis by:
MSMmod(t,f)=max(Mask(t,f),β*Mask(t-1,f))
In the equation given above, β is a parameter that may be determined to change the degree of speech retention in the denoised frequency signal based at least in part on the aggressive control parameter values, where a larger value of β corresponds to an increased speech retention or a more conservative result. In some embodiments, β may be in the range of 0 to 1, including 0 and 1. Exemplary values of beta include 0, 0.2, 0.5, 0.7, 0.8, 1, etc.
In another example, smoothing/filtering may be performed on the frequency axis by:
MSMmod(t,f)=max(Mask(t,f),β*Mask(t,f-1))
similar to the above, β is a parameter that may be determined to change the degree of speech retention in the denoised audio signal based at least in part on the aggressive control parameter values, where a larger value of β corresponds to an increased speech retention or more conservative result. In some embodiments, β may be in the range of 0 to 1. Exemplary values of beta include 0, 0.2, 0.5, 0.7, 0.8, 0.99, and the like.
It should be noted that the denoising mask may be modified in a variety of ways. For example, in some embodiments, the denoising mask may be modified by applying a compressor function (power law compressor, gaussian compressor, etc.) and by performing smoothing/filtering.
At 608, process 600 applies the modified denoising mask to the frequency domain representation of the noisy audio signal to obtain a denoised spectrum. Given the modified denoising mask represented by MSM mod (t, f) and the frequency domain representation of the noisy audio signal represented by X (t, f), the denoising spectrum, denoted Y (t, f), can be determined by:
Y(t,f)=X(t,f)*MSMmod(t,f)
In other words, in some implementations, the denoised spectrum may be obtained by multiplying the frequency domain representation of the noisy audio signal with the modified denoising mask.
At 610, process 600 may generate a time domain representation of the denoising spectrum to generate a denoising frequency signal. For example, as described above in connection with fig. 1, process 600 may apply an inverse frequency transform to the de-noised spectrum to generate a de-noised frequency signal. In some implementations, the process 600 can invert the frequency band of the frequency bin before applying the inverse frequency transform.
Fig. 7 is a block diagram illustrating an example of components of an apparatus capable of implementing various aspects of the disclosure. As with the other figures provided herein, the types and numbers of elements shown in fig. 7 are provided by way of example only. Other embodiments may include more, fewer, and/or different types and numbers of elements. According to some examples, the apparatus 700 may be configured to perform at least some of the methods disclosed herein. In some implementations, the apparatus 700 may be or may include a television, one or more components of an audio system, a mobile device (such as a cellular telephone), a laptop computer, a tablet device, a smart speaker, or another type of device.
According to some alternative embodiments, the apparatus 700 may be or may include a server. In some such examples, the apparatus 700 may be or may include an encoder. Thus, in some cases, apparatus 700 may be a device configured for use within an audio environment, such as a home audio environment, while in other cases apparatus 900 may be a device configured for use in a "cloud", e.g., a server.
In this example, the apparatus 700 includes an interface system 705 and a control system 710. In some implementations, the interface system 705 can be configured to communicate with one or more other devices in an audio environment. In some examples, the audio environment may be a home audio environment. In other examples, the audio environment may be another type of environment, such as an office environment, an automobile environment, a train environment, a street or sidewalk environment, a park environment, and so forth. In some implementations, the interface system 705 can be configured to exchange control information and associated data with audio devices of an audio environment. In some examples, the control information and associated data may relate to one or more software applications being executed by the apparatus 700.
In some implementations, the interface system 705 can be configured to receive a content stream or to provide a content stream. The content stream may include audio data. The audio data may include, but may not be limited to, audio signals. In some examples, the audio data may include spatial data such as channel data and/or spatial metadata. In some examples, the content stream may include video data and audio data corresponding to the video data.
The interface system 705 may include one or more network interfaces and/or one or more external device interfaces, such as one or more Universal Serial Bus (USB) interfaces. According to some embodiments, the interface system 705 may include one or more wireless interfaces. The interface system 705 may include one or more devices for implementing a user interface, such as one or more microphones, one or more speakers, a display system, a touch sensor system, and/or a gesture sensor system. In some examples, the interface system 705 may include one or more interfaces between a control system 710 and a memory system, such as the optional memory system 715 shown in fig. 7. However, in some cases, control system 710 may include a memory system. In some implementations, the interface system 705 can be configured to receive input from one or more microphones in an environment.
For example, control system 710 may include a general purpose single or multi-chip processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.
In some implementations, control system 710 may reside in more than one device. For example, in some implementations, a portion of control system 710 may reside in a device within one of the environments depicted herein, and another portion of control system 710 may reside in a device outside of the environment, such as a server, mobile device (e.g., smart phone or tablet), etc. In other examples, a portion of control system 710 may reside in a device within an environment, and another portion of control system 710 may reside in one or more other devices of the environment. For example, a portion of control system 710 may reside in a device, such as a server, that implements a cloud-based service, and another portion of control system 710 may reside in another device, such as another server, a memory device, etc., that implements a cloud-based service. In some examples, the interface system 705 may also reside in more than one device.
In some implementations, control system 710 may be configured to at least partially perform the methods disclosed herein. According to some examples, control system 710 may be configured to implement methods that utilize aggressive control parameters in training a machine learning model, aggressive control parameters in post-processing, and so forth.
Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on one or more non-transitory media. Such non-transitory media may include memory devices such as those described herein, including but not limited to Random Access Memory (RAM) devices, read Only Memory (ROM) devices, and the like. One or more non-transitory media may be located, for example, in the optional memory system 715 and/or the control system 710 shown in fig. 7. Accordingly, various innovative aspects of the subject matter described in the present disclosure can be implemented in one or more non-transitory media having software stored thereon. For example, the software may include instructions for utilizing aggressive control parameters in training a machine learning model, utilizing aggressive control parameters in post-processing, and the like. For example, the software may be executed by one or more components of a control system, such as control system 710 of FIG. 7.
In some examples, the apparatus 700 may include an optional microphone system 720 shown in fig. 7. Optional microphone system 720 may include one or more microphones. In some implementations, one or more microphones may be part of or associated with another device, such as a speaker of a speaker system, a smart audio device, or the like. In some examples, the apparatus 700 may not include the microphone system 720. However, in some such embodiments, the apparatus 700 may still be configured to receive microphone data for one or more microphones in an audio environment via the interface system 710. In some such implementations, a cloud-based implementation of the apparatus 700 may be configured to receive microphone data or noise indicia corresponding at least in part to microphone data from one or more microphones in an audio environment via the interface system 710.
According to some embodiments, the apparatus 700 may comprise an optional loudspeaker system 725 shown in fig. 7. The optional microphone system 725 may include one or more microphones, which may also be referred to herein as "speakers" or more generally as "audio reproduction transducers. In some examples (e.g., cloud-based implementations), the apparatus 700 may not include the loudspeaker system 725. In some embodiments, the apparatus 700 may comprise headphones. Headphones may be connected or coupled to device 700 via a headphone jack or via a wireless connection (e.g., bluetooth).
Aspects of the present disclosure include a system or device configured (e.g., programmed) to perform one or more examples of the disclosed methods, and a tangible computer-readable medium (e.g., disk) storing code for implementing one or more examples of the disclosed methods or steps thereof. For example, some disclosed systems may be or include a programmable general purpose processor, digital signal processor, or microprocessor programmed with software or firmware and/or otherwise configured to perform any of a variety of operations on data, including embodiments of the disclosed methods or steps thereof. Such a general purpose processor may be or include a computer system including an input device, memory, and a processing subsystem programmed (and/or otherwise configured) to perform one or more examples of the disclosed methods (or steps thereof) in response to data asserted thereto.
Some embodiments may be implemented as a configurable (e.g., programmable) Digital Signal Processor (DSP) that is configured (e.g., programmed and otherwise configured) to perform the required processing on the audio signal(s), including the execution of one or more examples of the disclosed methods. Alternatively, embodiments of the disclosed systems (or elements thereof) may be implemented as a general-purpose processor (e.g., a Personal Computer (PC) or other computer system or microprocessor, which may include an input device and memory) programmed and/or otherwise configured with software or firmware to perform any of a variety of operations, including one or more examples of the disclosed methods. Alternatively, elements of some embodiments of the inventive system are implemented as a general-purpose processor or DSP configured (e.g., programmed) to perform one or more examples of the disclosed methods, and the system also includes other elements (e.g., one or more microphones and/or one or more microphones). A general purpose processor configured to perform one or more examples of the disclosed methods may be coupled to an input device (e.g., a mouse and/or keyboard), memory, and a display device.
Another aspect of the disclosure is a computer-readable medium (e.g., a disk or other tangible storage medium) storing code (e.g., an encoder executable to perform one or more examples of the disclosed methods or steps thereof) for performing one or more examples of the disclosed methods or steps thereof.
While specific embodiments of, and applications for, the present disclosure have been described herein, it will be apparent to those of ordinary skill in the art that many more modifications than mentioned herein are possible without departing from the scope of the disclosure described and claimed herein. It is to be understood that while certain forms of the disclosure have been illustrated and described, the disclosure is not to be limited to the specific embodiments described and illustrated or to the specific methods described.

Claims (19)

1. A method of performing denoising of an audio signal, comprising:
Determining, by the control system, an aggressive control parameter value that adjusts a degree of speech preservation to be applied when denoising the audio signal;
Obtaining, by the control system, a training set of training samples, the training samples in the training set having a noisy audio signal and a target denoising mask; and
Training a machine learning model by the control system by:
(a) Generating a frequency domain representation of the noisy audio signal corresponding to the training samples,
(B) Providing the frequency domain representation of the noisy audio signal to the machine learning model,
(C) Generating a predictive denoising mask based on an output of the machine learning model,
(D) Determining a loss of error representing the predictive denoising mask relative to the target denoising mask corresponding to the training sample,
(E) Updating weights associated with the machine learning model, and
(F) Repeating (a) to (e) until a stopping criterion is reached,
Wherein the trained machine learning model is operable to take as input a noisy test audio signal and to generate a corresponding denoised test audio signal, and wherein the aggressive control parameter values are used for at least one of: 1) Generating a frequency domain representation of the noisy audio signal included in the training set; 2) Modifying the target denoising mask included in the training set; 3) Determining an architecture of the machine learning model prior to training the machine learning model; or 4) determining the loss.
2. The method of claim 1, wherein generating the frequency domain representation of the noisy audio signal comprises:
Generating a spectrum of the noisy audio signal; and
A frequency domain representation of the noisy audio signal is generated by grouping bins of a frequency spectrum of the noisy audio signal into a plurality of frequency bands, wherein a number of frequency bands is determined based on the aggressive control parameter values.
3. The method of any of claims 1 or 2, wherein modifying the target denoising mask included in the training set comprises applying a power function to target denoising masks of the target denoising masks, and wherein an exponent of the power function is determined based on the aggressiveness control parameter value.
4. The method of any of claims 1-3, wherein the machine learning model comprises a Convolutional Neural Network (CNN), and wherein determining an architecture of the machine learning model comprises determining a filter size of a convolutional block of the CNN based on the aggressive control parameter values.
5. The method of any of claims 1-3, wherein the machine learning model comprises a U-Net, and wherein determining an architecture of the machine learning model comprises determining a depth of the U-Net based on the aggressive control parameter values.
6. The method of any of claims 1-5, wherein determining the penalty comprises applying a penalty weight to an error of the predictive denoising mask relative to the target denoising mask, and wherein the penalty weight is determined based at least in part on the aggressive control parameter value.
7. The method of claim 6, wherein the penalty weight is based at least in part on whether a corresponding noisy audio signal associated with the training sample comprises speech.
8. A method of performing denoising of an audio signal, the method comprising:
Determining, by the control system, an aggressive control parameter value that adjusts a degree of speech preservation to be applied when denoising the audio signal;
providing, by the control system, a frequency domain representation of the noisy audio signal to the trained model to generate a denoising mask;
modifying, by the control system, the denoising mask based at least in part on the aggressive control parameter value;
Applying, by the control system, the modified denoising mask to a frequency domain representation of the noisy audio signal to obtain a denoising spectrum; and
A time domain representation of the de-noised spectrum is generated by the control system to generate a de-noised frequency signal.
9. The method of claim 8, wherein modifying the denoising mask comprises applying a compression function to the denoising mask, wherein parameters associated with the compression function are determined based on the aggressive control parameter values.
10. The method of claim 9, wherein the compression function comprises a power function, and wherein an exponent of the power function is determined based on the aggressiveness control parameter value.
11. The method of claim 9, wherein the compression function comprises an exponential function, and wherein a parameter of the exponential function is determined based on the aggressive control parameter value.
12. The method of any of claims 8 to 11, wherein modifying the denoising mask comprises performing smoothing on a denoising mask for a frame of the noisy audio signal based on a denoising mask generated for a previous frame of the noisy audio signal.
13. The method of claim 12, wherein performing the smoothing comprises multiplying the denoising mask for a frame of the noisy audio signal with a weighted version of the denoising mask generated for a previous frame of the noisy audio signal, wherein weights for generating the weighted version are determined based on the aggressiveness control parameter value.
14. The method of any of claims 12 or 13, wherein the denoising mask for frames of the noisy audio signal comprises a time axis and a frequency axis, and wherein smoothing is performed with respect to the time axis.
15. The method of any of claims 12 or 13, wherein the denoising mask for frames of the noisy audio signal comprises a time axis and a frequency axis, and wherein smoothing is performed with respect to the frequency axis.
16. The method of any of claims 8 to 15, wherein the aggressive control parameter value is determined based on whether a current frame of the noisy audio signal comprises speech.
17. The method of any of claims 8 to 16, further comprising causing the generated de-noised frequency signal to be presented via one or more loudspeakers or headphones.
18. An apparatus configured to implement the method of any one of claims 1 to 17.
19. One or more non-transitory media having software stored thereon, the software comprising instructions for controlling one or more devices to perform the method of any of claims 1-17.
CN202280074139.3A 2021-11-09 2022-11-08 Control of speech retention in speech enhancement Pending CN118215961A (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN2021129573 2021-11-09
CNPCT/CN2021/129573 2021-11-09
US63/289,846 2021-12-15
US202263364661P 2022-05-13 2022-05-13
US63/364,661 2022-05-13
PCT/US2022/049193 WO2023086311A1 (en) 2021-11-09 2022-11-08 Control of speech preservation in speech enhancement

Publications (1)

Publication Number Publication Date
CN118215961A true CN118215961A (en) 2024-06-18

Family

ID=91450988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280074139.3A Pending CN118215961A (en) 2021-11-09 2022-11-08 Control of speech retention in speech enhancement

Country Status (1)

Country Link
CN (1) CN118215961A (en)

Similar Documents

Publication Publication Date Title
CN112105902B (en) Perceptually-based loss functions for audio encoding and decoding based on machine learning
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
WO2020232180A1 (en) Method and apparatus for speech source separation based on a convolutional neural network
US10755728B1 (en) Multichannel noise cancellation using frequency domain spectrum masking
JP6987075B2 (en) Audio source separation
CN104637491A (en) Externally estimated SNR based modifiers for internal MMSE calculations
JP6764923B2 (en) Speech processing methods, devices, devices and storage media
EP4189677B1 (en) Noise reduction using machine learning
US20240177726A1 (en) Speech enhancement
WO2023001128A1 (en) Audio data processing method, apparatus and device
CN114373473A (en) Simultaneous noise reduction and dereverberation through low-delay deep learning
CN104637493A (en) Speech probability presence modifier improving log-mmse based noise suppression performance
WO2023086311A1 (en) Control of speech preservation in speech enhancement
CN104637490A (en) Accurate forward SNR estimation based on MMSE speech probability presence
WO2015027168A1 (en) Method and system for speech intellibility enhancement in noisy environments
CN108899041B (en) Voice signal noise adding method, device and storage medium
CN118215961A (en) Control of speech retention in speech enhancement
CN116312570A (en) Voice noise reduction method, device, equipment and medium based on voiceprint recognition
WO2023287782A1 (en) Data augmentation for speech enhancement
WO2023118644A1 (en) Apparatus, methods and computer programs for providing spatial audio
CN110648681B (en) Speech enhancement method, device, electronic equipment and computer readable storage medium
CN114360572A (en) Voice denoising method and device, electronic equipment and storage medium
US20230402050A1 (en) Speech Enhancement
US20230343312A1 (en) Music Enhancement Systems
EP4258263A1 (en) Apparatus and method for noise suppression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication