WO2019094562A1 - Neural network based blind source separation - Google Patents

Neural network based blind source separation Download PDF

Info

Publication number
WO2019094562A1
WO2019094562A1 PCT/US2018/059785 US2018059785W WO2019094562A1 WO 2019094562 A1 WO2019094562 A1 WO 2019094562A1 US 2018059785 W US2018059785 W US 2018059785W WO 2019094562 A1 WO2019094562 A1 WO 2019094562A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio signals
neural network
function
convolutional neural
parameters
Prior art date
Application number
PCT/US2018/059785
Other languages
French (fr)
Inventor
Longfei YAN
Willem Bastiaan Kleijn
Original Assignee
Google Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google Llc filed Critical Google Llc
Publication of WO2019094562A1 publication Critical patent/WO2019094562A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • BSS Blind Source Separation
  • audio BSS a signal can be considered as a function of time and a source is a physical object creating a signal. If only one signal mixture is recorded, this can be categorized as a single- channel audio signal. On the other hand, multi-channel audio signals involve multiple signal mixtures.
  • Source signals can be mixed in different manners. Nonlinear mixing can be more prevalent in image processing. For audio BSS techniques, linear mixing generally is an accurate description of the physical mixing process. There are two categories of linear mixtures. The first category considers the delayed (convolutive) contribution of source signals, which is common in telecommunications or in reverberant environment. The second category can assume that the signals are instantaneously mixed at any given time. Short-time Fourier transforms can be used to convert convolutive mixing to a close approximation of instantaneous mixing.
  • Example implementations can identify independent audio source signals included in a mixed audio signal by mapping the signals in the mixed audio signal to set of audio signals and minimizing correlation between pairs of signals in the set of audio signals.
  • a device in a general aspect includes a sound acquisition manager configured to receive a mixed audio signal including a first plurality of audio signals, an independent component analysis manager configured by a set of parameters to generate a second plurality of audio signals based on the first plurality of audio signals, and to minimize, by adjusting said configuring parameters, a correlation between pairs of signals of the converted second plurality of audio signals, and a memory configured to store the second plurality of audio signals as multi-channel audio data.
  • a method and a non-transitory computer-readable storage medium having stored thereon computer executable program code which, when executed on a computer system, causes the computer system to perform steps.
  • the steps include receiving a mixed audio signal including a first plurality of audio signals, and storing the identified independent audio source signals as multi-channel audio data, determining a set of parameters configured to generate a second plurality of audio signals based on the first plurality of audio signals, converting the second plurality of audio signals using a nonlinear function, minimizing a correlation between pairs of signals of the converted second plurality of audio signals and storing the second plurality of audio signals as multi-channel audio data.
  • Implementations can include one or more of the following features.
  • the plurality of audio signals can be treated as temporally independent, the plurality of audio signals can be identically distributed, and the plurality of audio signals can be non-Gaussian.
  • the minimizing of the correlation includes measuring correlation using a contrast function.
  • the nonlinear function is at least one sigmoid function.
  • the minimizing of the correlation includes revising the set of parameters.
  • the independent component analysis manager is further configured to apply a pre-whitening function to the first plurality of audio signals and the pre-whitening transforms the first plurality of audio signals into a data set having unit covariance.
  • the set of parameters are elements of a separating matrix.
  • the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically.
  • the set of parameters can be selected to reduce the correlation of pairs of converted second plurality audio signals.
  • the set of parameters can be determined using backpropagation and stochastic gradient descent.
  • the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically, an output of the convolutional neural network is used as an input to an activation function, and the activation function is configured to allow the convolutional neural network to implement a non-linear process.
  • the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically, a loss function for the convolutional neural network is a contrast function (e.g., a correlation between the second plurality of signal components), the separating matrix is an orthogonal matrix including the set of parameters as weights connecting input neurons and output neurons of the convolutional neural network.
  • a method and a non-transitory computer-readable storage medium having stored thereon computer executable program code which, when executed on a computer system, causes the computer system to perform steps.
  • the steps include receiving a mixed audio signal including a first plurality of audio signals, using a convolutional neural network to numerically estimate a separating matrix for the first plurality of audio signals, applying at least one activation function including a sigmoid function to an output of the convolutional neural network and minimizing a correlation between pairs of signals generated using the at least one activation function.
  • FIG. 1 illustrates a block diagram of an example system according to at least one example embodiment.
  • FIG. 2 illustrates a block diagram of a method for identifying independent audio source signals according to at least one example embodiment.
  • FIG. 3 illustrates pseudo code for a correlation independent component analysis (CICA) algorithm according to at least one example embodiment.
  • CICA correlation independent component analysis
  • FIG. 4 illustrates a block diagram of a neural network architecture according to at least one example embodiment.
  • FIG. 5 illustrates pseudo code for an independence classifier algorithm according to at least one example embodiment.
  • FIG. 6 illustrates pseudo code for generating a mixing matrix according to at least one example embodiment.
  • FIG. 7 illustrates a flow diagram for an architecture for an independence classifier according to at least one example embodiment.
  • FIG. 8 illustrates a flow diagram for an architecture for generating a mixing matrix according to at least one example embodiment.
  • FIG. 9 shows an example of a computer device and a mobile computer device according to at least one example embodiment.
  • a neural network based approach that can implement complex mappings to perform a multi-channel audio BSS technique can be applied in a real-time environment is described herein.
  • Multi-channel audios can include more information than single-channel audios because independent signals (within multi-channel audio) can be distinguishable from one another.
  • Implementations determine BSS using a neural network based ICA technique.
  • techniques can use a deep neural network (DNN) model to distinguish independent signals and/or a DNN model to classify signal mixtures according to their mixing matrices.
  • DNN deep neural network
  • FIG. 1 is a block diagram that illustrates an example system 100 in which the improved techniques described herein can be implemented.
  • the system 100 can include a sound acquiring computer 120 that is configured to acquire a mixed audio signal (e.g., a sound field detected using at least one microphone) and separate signal sources (e.g., sources of sound waves) within the mixed audio signal.
  • the sound acquiring computer 120 can include at least one interface 122, at least one processing unit 124, and memory 126.
  • the at least one interface 122 can include, for example, an audio adaptor configured to convert analog audio signals received from at least one detector into an electronic form for use by the sound acquiring computer 120.
  • the at least one detector can be a device configured to capture sound and communicate a signal representing the captured sound.
  • the at least one detector can be a microphone, a piezoelectric device, a fiber optic sensor, a microphone chip or silicon microphone, and/or the like.
  • the at least one interface 122 can include, for example, Ethernet adaptors, Token Ring adaptors, etc., for converting electronic and/or optical signals received from a network into an electronic form for use by the sound acquiring computer 120.
  • the at least one processing unit 124 can include one or more processing chips and/or assemblies.
  • the memory 126 can include both volatile memory (e.g., RAM) and non-volatile memory, such as one or more ROMs, disk drives, solid state drives, and the like.
  • the memory 126 can include a non-transitory computer readable storage medium.
  • the at least one processing unit 124 and the memory 126 together form control circuitry, which is configured and arranged to carry out various methods and functions as described herein.
  • one or more of the components of the sound acquiring computer 120 can include processors (e.g., at least one processing unit 124) configured to process instructions stored in the memory 126. Examples of such instructions include a sound acquisition manager 130, an independent component analysis (ICA) manager 140, a non- negative matrix factorization (NMF) manager 142, an autoencoder manager 144, a machine learning module 150, and a mixing matrix module 160.
  • the memory 126 can be configured to store various data, which is described with respect to the respective managers that use such data.
  • the sound acquisition manager 130 can be configured to acquire a mixed audio signal on which a BSS technique can be applied, the output of which can be stored as multi-channel audio data 132.
  • the mixed audio signal can correspond to audio data captured by at least one microphone.
  • a plurality of human speakers can speak for a period of time.
  • the captured audio data can be communicated to the sound acquiring computer 120 and processed by the at least one interface 122.
  • Processing can include, for example, analog to digital conversion at a desired sampling rate.
  • capturing the multi-channel audio data 132 can include detecting and communicating three English speakers for seven seconds and converting the audio signal using a sampling rate of 16 kHz.
  • the multi-channel audio data 132 can include data identifying independent audio source signals included in the acquired mixed audio signal.
  • the independent audio source signals can be compressed and stored and/or communicated to another device.
  • the independent audio source signals can be stored and or communicated with a corresponding video file.
  • the sound acquiring computer 120 can be an element of a hearing aid device. Therefore, the independent audio source signals can be filtered based on a users desire to listen to a signal source and disregard other signal sources.
  • the sound acquiring computer 120 can be an element of an augmented reality device. Therefore, the independent audio source signals can be separated and communicated to a speaker to generate, for example, a stereo listening effect.
  • the ICA manager 140 can be configured to implement an ICA technique.
  • the ICA manager 140 can be configured to implement a correlation ICA (CICA) technique.
  • CICA correlation ICA
  • An assumption regarding the source signals can be used in BSS techniques.
  • the assumption is that the source signals can be considered as temporally independent (e.g., a first persons recorded voice is not dependent on an initiation in time or a duration of time when compared to a second persons recorded voice) and identically distributed (IID) but non-Gaussian.
  • IID identically distributed
  • IID identically distributed
  • Independent indicates that the source of data is initiated as independent events.
  • Non- Gaussian indicates that the data distribution is not Gaussian.
  • a mixed audio signal where the sources are two people speaking (e.g., in a room) a first persons recorded voice is independent in time, source independent when compared to a second persons recorded voice.
  • the data for each source (e.g., the first person and the second person) associated with the mixed audio signal has a non-Gaussian distribution that doesn't fluctuate and is taken from the same probability distribution.
  • the assumption leads to using ICA techniques for BSS that are based on or use higher order statistics (HOS), such as FastICA, Infomax ICA and Kernel ICA (each described below in more detail).
  • HOS higher order statistics
  • the independence assumption for audio sources can be reasonable given long enough excerpt duration in the time domain. When the excerpt duration is below, for example, 20ms, the independence assumption does not hold. But when the excerpt duration is over, for example, 100ms, the independence assumption is valid. This means ICA techniques may not separate signal mixtures immediately but with an acceptable delay of 100ms.
  • ICA In audio applications the sources are typically independent. Therefore, example implementations can use ICA techniques for audio BSS. There are several, non-exhaustive, advantages of using ICA. First, the working mechanism of ICA can be defined using signal processing and statistics. Second, in many ICA algorithms, the uniqueness of the solution can be guaranteed due to global convergence. Third, ICA can be implemented on non-negative signals and mixed-sign signals. Further, in multi-channel BSS, ICA can benefit from having more sensors (e.g., microphones) than source signals.
  • a contrast function can be an optimization criterion the global optima of which correspond to a separation of all sources.
  • the contrast function should be able to consistently estimate how separate the outputs are.
  • Contrast functions based on HOS can be used in temporally IID and non-Gaussian distributions.
  • a correlation measurement can be used as a contrast function in ICA to measure output independence.
  • This technique can be called a correlation ICA (CICA) technique (e.g., as implemented in the ICA manager 140).
  • FIG. 2 illustrates a block diagram of a method for identifying independent audio source signals according to at least one example embodiment.
  • a mixed audio signal including a plurality of audio signals e.g., a first plurality of audio signals
  • the mixed audio signal is received by the sound acquisition manager 130 via the at least one interface 122.
  • step S210 a separating matrix, including a set of parameters, is determined (and/or selected) for the mixed audio signal and in step S215 another plurality of audio signals (or second plurality of audio signals) is generated using the separating matrix.
  • Searching for a solution may be difficult (e.g., time consuming, processor intensive, memory intensive, and the like).
  • one or more predetermined functions may be used and this may include the identity function, the latter resulting in a linear correlation of the two random variables Yi and Y 2 .
  • step S220 the another plurality of audio signals (or second plurality of audio signals) is converted using a nonlinear function (or a plurality of nonlinear functions).
  • a nonlinear function or a plurality of nonlinear functions
  • at least one sigmoid function can be applied to the another plurality of audio signals because the sigmoid projection can be a sufficient nonlinear function for measuring correlation in BSS.
  • Other functions that can be used to minimize the correlation can include (but are not limited to) a logistic function (which a sigmoid function is a subset of), a Heaviside function, a Tanh function, an ArcTan function and/or the like.
  • the function configured to minimize the nonlinear correlation e.g., the sigmoid function
  • at least one constraint e.g., limit nonlinear correlation to zero (0)
  • step S225 the set of parameters is revised to minimize a correlation between pairs of signals in the converted plurality of audio signals.
  • Minimizing correlation can include repeating steps S210-S220 with a revised set of parameters.
  • Minimizing correlation can include backpropagation of a stochastic gradient obtained for a batch or epoch of data to obtain a revised set of parameters.
  • Minimizing correlation can include determining a threshold number of signal pairs have a correlation equal to zero (0).
  • step S230 in response to minimizing the correlation, independent audio source signals included in the mixed audio signal are identified as the another plurality of audio signals.
  • Pre-whitening can help with linear correlation minimization.
  • Pre-whitening can transform the received audio signals X E R " M x R " N into another set of signals which has unit covariance. M is the dimensionality of the observed signals which is equal to the number of sources and N is the observed data length in each dimensionality.
  • mean subtraction for X can be performed before pre-whitening.
  • C s I due to the indetermination of scale in BSS technique.
  • C x AA T .
  • A UDQ T where U and Q are orthonormal matrices and D is a diagonal matrix.
  • C x UD 2 UT.
  • Pre-whitening can help convert the CICA technique into determining an optimally conditioned orthonormal separating matrix. This can reduce the possible choices for a separating matrix. If pre-whitening is coupled with the nonlinear correlation properly, this combination can effectively and efficiently perform audio BSS tasks by finding the correct or best separating matrix.
  • the combination of pre-whitening and CICA can determine independence in two stages.
  • pre-whitening minimizes the linear correlation to zero.
  • a second operator is selected to minimize one or more nonlinear correlations.
  • the second operator can be a separating matrix B.
  • the separating matrix B can be selected to minimize a nonlinear correlation that is a linear correlation after a nonlinear function (e.g., a sigmoid nonlinear function) subject to a constraint that keeps the linear correlation to zero.
  • the constraint can be that the separating matrix B should be orthogonal.
  • a convolutional neural network can be used to estimate the separating matrix B numerically.
  • the separating matrix can form a filter in the convolutional neural network and the nonlinear function is the nonlinear activation function of the network,
  • a neural network can perform BSS tasks with backpropagation through stochastic gradient descent.
  • the orthogonal matrix to be determined can consist of the weights connecting input and output neurons. Pseudo code for the CICA algorithm is shown in FIG. 3.
  • the CICA algorithm can be implemented using a convolutional neural network consisting of input and output neurons.
  • Example architectures can use any number of input and output neurons so long as input and output neurons are equal in size. In an example implementation, three input and three output neurons can be used.
  • FIG. 4 illustrates an architecture of the CICA neural network according to an example implementation.
  • the CICA neural network architecture 400 includes three input neurons 405 where XI, X2, and X3 represent the three channels of observed signal mixtures.
  • the CICA neural network architecture 400 includes three output neurons 410 where Yl, Y2, and Y3 represent the demixed signals in three channels.
  • the CICA neural network architecture 400 includes three function neurons 415 where f represents an activation function (e.g., a sigmoid function).
  • the input neurons 405 and the output neurons 410 are fully connected.
  • the activation function is configured to define the output behavior of the corresponding output neuron 410.
  • the output of the activation function can be used as the output of the neural network and/or used as the input for another neuron.
  • the activation function can introduce non-linear properties to the neural network allowing the neural network to leam from complicated, non-linear mappings between inputs and response variables.
  • the activation function allows the CICA neural network architecture 400 to implement a non-linear process.
  • the CICA neural network architecture 400 is shown as including a plurality of tiers 420.
  • Each tier 420 includes the neural network shown in tier 1.
  • the activation function / can be the same function or a different function in each tier 420.
  • the output of the function neurons 415 of tier 1 can be the input to input neuron 405 of tier 1 which can repeat until tier n in which the output of the function neurons 415 (in tier n) is the output of the neural network.
  • the audio signals can be input to the input neuron 405 of each tier 1 to n and the output of the function neuron 415 of each tier 1-n is the output of the neural network.
  • the equivalent output of each tier can be summed together as the output of the CICA neural network architecture 400.
  • the contrast function consists of four components: a correlation measurement C, a weights orthogonality measurement O, a regularization term Q for balancing energy in different demixed channels and a regularization term V to ensure the minimum variance of the demixed signals is guaranteed.
  • a correlation measurement C is the sum of pairwise correlation between output signals.
  • each column of X and Y can define a signal channel (e.g., in the multi-channel audio signal).
  • the weights orthogonality measurement O can be calculated as
  • Bi, B 2 and B 3 can denote the three columns of B.
  • the energy balancing term Q can be calculated by can help
  • the minimum variance of the observed signal channel can be set to v.
  • the variances of three demixed signal channels be and .
  • the variance regularization term V cm be calculated where relu(-) yields a
  • V can help cause variances of the demixed signal channels to be above the minimum variance.
  • the batch size used in neural network training should contribute to one quarter of a second of speech. For example, if the sampling rate used in dataset is 16kHz, the batch size should be 4000.
  • the weights and bias in the neural network is initialized from standard normal distribution. The sum of each column of the weight matrix can be constrained to be 1.
  • the first choice can be to fix the learning rate to a fixed number (e.g., 0.0001). This way the neural network can converge, but it may be slow.
  • the second choice is to use a dynamic learning rate.
  • the starting learning rate can be 0.001.
  • the learning rate can be updated by multiplying 0.1 every 14000 steps.
  • the number of epochs for training can be set to, for example, 5000. But the training can stop early if there is no loss improvement observed. This early stop condition may be checked, for example, every 300 epochs.
  • the ICA manager 140 can be configured to implement an ICA technique.
  • the ICA manager 140 can be configured to implement a Fast ICA technique.
  • Fast ICA is a linear transformation method for non-Gaussian data in order to separate statistically independent components from the observed mixture.
  • the assumptions the source signals are non-Gaussian and statistically independent. Both assumptions are typically satisfied with multi-channel audio.
  • the non-Gaussianity assumption is important in fast ICA because Gaussian variables may not be informative enough.
  • the independent components can still be estimated.
  • Fast ICA can take advantage of the property that the sum of independent variables is more Gaussian than any of the single variable is utilized. By doing the reverse process of making the sum (e.g., signal mixture) less Gaussian than before, source separation can be achieved. Therefore, by using a contrast function to measure the Gaussianity, Fast ICA amounts to an optimization problem that maximizes the non-Gaussianity of the estimated original source signal.
  • the optimization can be performed by finding a direction that maximizes Gaussianity and projecting the data onto that direction.
  • the next projected direction should be found on the orthogonal space of the previous direction to avoid repetition. This is called projection pursuit.
  • the process is repeated until the number of projected directions is equal to the number of estimated independent source signals.
  • Fast ICA can be a fixed-point iteration algorithm, exploiting that the demixing vector has zero derivative at a local optima.
  • the iteration updating rules are dependent on the choice of the contrast function.
  • a kurtosis function and a negentropy function can be selected as the contrast functions.
  • Slowly growing functions can give better performance than kurtosis, as they have smaller asymptotic variance and are more robust to sample outliers.
  • the ICA manager 140 can be configured to implement an ICA technique.
  • the ICA manager 140 can be configured to implement an Infomax ICA technique.
  • a neural network can have no hidden layer(s), only input and output layers exist.
  • the Infomax ICA takes advantage of the nonlinear transfer function of a neural network to maximize the mutual information between inputs and outputs. By doing so, statistically independent components can be separated from the signal mixture without assuming any knowledge of the input distributions.
  • X and Y can be the input and output of a neural network, the mutual information between X and is the differential entropy of the output
  • a neural network can be regarded as a deterministic function, that diverges to negative infinity which can be avoided by taking the derivative of with respect to the
  • output is equivalent to maximizing the entropy of output.
  • the entropy of the output can thus be expressed as
  • the neural network weight parameter w can be updated to maximize
  • the joint entropy of the outputs is equivalent to minimizing the mutual information between output variables, given a constraint on the signal power. When is zero, and become statistically independent.
  • the Infomax ICA approach comes up with a way to deal with mutual information without getting into the pitfalls of the distribution probabilities of data. Instead, the Jacobian between inputs and outputs can be differentiated with regard to the weight parameter to derive the desired independent components. This is used in a neural network when the Jacobian only accounts for one layer of transformation.
  • the ICA manager 140 can be configured to implement an ICA technique.
  • the ICA manager 140 can be configured to implement a Kernel ICA technique.
  • the Kernel ICA uses a contrast function based on canonical correlation in Reproducing Kernel Hilbert Space (RKHS). If two signals are independent, their correlation is zero. The converse may not be true in Euclidean space. But in RKHS, the converse can also be true. This reduces the independence problem to the zero-correlation problem, which is statistically simpler. Kernel ICA utilizes this property of RKHS and minimizes the maximal correlation between projected signals to extract independent components.
  • RKHS Kernel Hilbert Space
  • CCA Canonical Correlation Analysis
  • Equation 4 An alternative form for Equation 4 is
  • Equations 5 and 6 indicate the eigenvalues appear in pairs:
  • T be a RKHS based on Gaussian kemels.
  • the projections in CCA be projections from M to 7. Then define ⁇ -correlation as the maximal correlation between the random ) variables and
  • Equation 7 can be written as
  • w is the maximal possible correlation between one-dimensional linear projections of and This is equivalent to find the first canonical correlation between and , which suggests computing an ICA contrast function based on the computation of a canonical correlation in function space.
  • Equation 11 can be approximated using equations 12, 13 and 14 as:
  • the NMF manager 142 can be configured to implement an NMF technique.
  • NMF can be based on parts-based object perception. It is a way to factorize a matrix with non- negative constraints. A matrix is factorized into two components, the entries of which are all non-negative. In the frequency domain, this constraint is compatible with the intuition that the spectrogram of audio components can be added together to form an audio mixture spectrogram, since all entries in spectrogram are non-negative.
  • the standard NMF setting can be suitable for single-channel BSS.
  • the matrix factorization becomes the decomposition of data spectrogram into a sum of low rank spectrograms. Each elementary spectrogram can represent a pattern in time. Simply stacking up spectrograms to form a larger matrix for decomposition may not optimally exploit the redundancy information between multiple channels.
  • NMF can be extended into a multichannel BSS technique to implement convolutional mixtures in an optimal manner.
  • the autoencoder manager 144 can be configured to implement an autoencoder and/or non-negative autoencoder (NAE) technique.
  • NAE is an extension of normal autoencoders. Autoencoders are unsupervised neural networks aiming at minimizing the deviation between inputs and outputs. Multi-layer autoencoders can be realized by stacking up compositional autoencoders vertically. In a NAE model, we train the autoencoders with the constraint that none of its weights can be negative. This is equivalent to performing NMF on input signals, where the encoder and decoder weight matrices play the roles of the two non-negative factorized components. NAE is built on the basis of NMF, therefore NAE can inherit the computational complexity of NMF.
  • the CICA technique described in this disclosure can utilize portions and/or elements of Fast ICA, Infomax ICA and Kernel ICA.
  • the CICA technique can utilize the orthogonal projection utilized in FastICA, the function configured to minimize the nonlinear correlation (e.g., sigmoid function) employed in Infomax ICA and the correlation measurement involved in Kernel ICA.
  • the nonlinear correlation e.g., sigmoid function
  • the disclosed CICA technique can outperform the known techniques used for demixing determined linear instantaneously mixed signals.
  • the disclosed CICA technique can generate an improved signal separation at a faster speed.
  • the improvement includes the use of an SOS based contrast function in temporally IID non-Gaussian signals.
  • the disclosed CICA technique can use a contrast function that minimizes independence by minimizing linear and nonlinear correlations.
  • the CICA technique can be extended to signal mixtures with memory. This extension can be done by transforming signal mixtures to the time-frequency domain. The convolution then becomes multiplication in the time-frequency domain so that the convolutional mixtures can behave very similar to instantaneous mixtures.
  • the machine learning module 150 can use a generative adversarial network (GAN) a training technique for DNN applications. Therefore, GAN can be used in the context of audio BSS.
  • GAN generative adversarial network
  • a GAN consists of a generator producing fake data and a discriminator estimating the probability whether the inputs are from the generator or from the real data.
  • the generator and discriminator can be trained simultaneously for a min/max two-player game so that they will compete and hopefully achieve equilibrium (e.g., a Nash equilibrium).
  • the generator can recover the training data distribution and the discriminator will output 0.5 as the probability everywhere.
  • the machine learning module 150 can be used to train the aforementioned CNN.
  • the CNN is trained prior to use.
  • the CNN is trained while in use and/or as the CNN is used.
  • a hearing aid including the sound acquiring computer 120 can be trained with live data during use.
  • an audio/video capture system including the sound acquiring computer 120 can be trained with previously recorded data prior to use.
  • Example implementations can implement a cost function with an ICA.
  • the cost function can utilize a Jacobian of the transformation between input and output.
  • the Jacobian can account for every operation that transforms input to output. This indicates that in a deep learning manner, the Jacobian has to involve the operations in every layer of the DNN. This makes the cost function less than optimal in a deep neural network.
  • the cost function can be replaced with the discriminator output.
  • the implementation can include using an independence classifier as the discriminator.
  • the disclosed discriminator can be an independence classifier using a convolutional neural network to distinguish synthesized Laplacian sequences.
  • Hilbert Schmidt Independence Criterion uses the sum of the squared singular values of the cross-covariance in the RKHS as an independence criterion. This technique can require a computational time of , where n indicates the size of the observed data. If a
  • DN based independence classifier can be realized, it would be much faster than HSIC because the neural network training can take place offline.
  • the zero correlation can be used to indicate independence.
  • the neural network can leam the correlation formula: as the output,
  • CNN Convolutional Neural Network
  • a convolutional kernel is a leamable filter which computes the dot product between itself and the input volume it covers. As a filter is shared by all the convolutional outputs (e.g., receptive fields), this helps reduce the number of parameters in a neural network. It has been shown that with the identity skip connection, a deep neural network can have better accuracy, faster convergence and less overfitting. This is due to the effect created by a clean information path of identity mapping. Therefore, in an implementation, identity skip connections can be used in the discriminator.
  • the architecture of the independence classifier includes six layers 715-1 to 715-6 (other than the input layer 705 and output layer 710). This architecture is shown in FIG. 7.
  • the first four layers 715-1 to 715-4 are the convolutional layers to build relationships between input data. Except for the first layer 715-1, they can be expressed as:
  • x/ and xm are input and output of the 1-th layer
  • x/ -2 is the output of the (l-2)th layer
  • x 0 would denote the output of the input layer, which is the input itself.
  • 7 is the convolutional operation.
  • Wu and Wn are the kernel weights.
  • x/_i can be used as the identity mapping
  • can be a tanh function
  • fl can a sigmoid function.
  • Batch normalization can also be implemented in the independence classifier. This is to overcome the issue that the input distribution for each layer is constantly changing with the parameters of previous layers. The impact of internal covariate shift can be minimized by slowing down the training with lower learning rate and by initializing parameters more carefully. Batch normalization can address this issue more straightforwardly without slowing down the training speed or carefully tuning the initialization parameters. Batch normalization can normalize layer inputs to make their distributions consistent.
  • the first layer 715-1 the only difference from the other convolutional layers is that there is no identity mapping.
  • the fifth layer 715-5 is a max-pooling layer to extract relationships and reduce the dimensionality.
  • the sixth layer 715-6 is a fully connected layer to learn a function of the extracted relationships. The dropout technique is applied to this fully connected layer to reduce the overfitting.
  • the last layer 715-6 is the output layer for classification.
  • the activation functions for the fifth 715-5 and sixth layer 715-6 are both functions configured to minimize the nonlinear correlation between audio signals (e.g., sigmoid functions).
  • Label 1 can indicate dependence and label 0 can indicate independence.
  • a cost function that has a minimal value when the output is the same as the correct label can be used.
  • One possible function is a negative log function.
  • y can be the final output, e can be le-8 to avoid the numerical issues.
  • cost -log(l- y +2 *y * / - / + e), where / is the label of the signal inputs.
  • a threshold can be set equal to 0.5 as y should be between 0 and 1. If y is above 0.5, the classified label is dependence and if_y is less than or equal to 0.5, the classified label is independence.
  • the kernel used by 7 can be set to dimension [100, 2], such that the kernal has height 100 and width 2. Therefore, there are 32 kernels used for each convolution.
  • the fifth layer 715-5 down-samples the data by the size of 2.
  • the input of the sixth layer 715-6 is reshaped to [-1, 32], where -1 indicates the data is a dynamically assigned number according to the data size.
  • the weight matrix in the sixth layer 715-6 has the dimension [32, 64].
  • the rate of dropout can be 0.5.
  • the output layer 710 summarizes all inputs to one single output, with a weight matrix of dimension [64, 1].
  • the leaming rate can be 0.00001.
  • the batch size used for training can be 1000.
  • the number of training epochs can be 500.
  • the neural network training can have an early stop if the loss is less than 0.5.
  • Pseudo code for the independence classifier algorithm is shown in FIG. 5.
  • the mixing matrix module 160 can be configured to generate a mixing matrix in BSS.
  • techniques can include estimating the demixing matrices.
  • a demixing matrix is the inverse of the mixing matrix. Determining the inverse of a matrix is a straightforward linear algebraic process if the matrix is full-rank. Therefore, determined BSS can include estimating the mixing matrices.
  • a DNN can be used for mixing matrix estimation.
  • a supervised learning technique can be used for matrix estimation.
  • Supervised learning can be a learning algorithm used to comprehend the data with guided support.
  • unsupervised leaming can make sense of data without any extra guide.
  • the guided support can be providing the mixing matrices as known information in the training process.
  • the mixing matrices are only unknown in the testing scenarios. Accordingly, the DNN model can learn the underlying influence of mixing matrices on a variety of given data points during training so that it could predict mixing matrices during use and/or testing.
  • At least one two by two mixing matrix F generated from standard normal distribution can be selected and a Frobenius bubble can be applied to the selected matrix.
  • a Frobenius bubble sets a limit to how close other mixing matrices can be positioned around the selected mixing matrix.
  • the distance between F and another mixing matrix M is calculated by the sum of squared error between their corresponding elements. If the distance is smaller than the limit, M is determined to be within the Frobenius bubble of F, which is too close.
  • mixing matrices outside the Frobenius bubble are used. This makes sure that other mixing matrices are sufficiently far away from F so that the classification task would not be influenced.
  • the limit In response to determining the classification accuracy is good enough for a distance limit (e.g., accuracy is above or below a threshold), the limit can be reduced to make the task more challenging.
  • the goal is that the classification accuracy is reasonably good with an arbitrarily small limit.
  • the limit naturally becomes small when the dimensionality of the mixing matrix becomes larger.
  • the data can be pre-processed with mean subtraction and pre-whitening.
  • the 160 can include eleven layers 815-1 to 815-1 1 (other than the input layer 805 and output layer 810). This architecture is shown in the FIG. 8.
  • the first eight layers 815-1 to 815-8 are convolutional layers with gated activation as described above with regard to FIG. 7.
  • the convolution operation can be configured to filter signals. Filtering signals can aid with searching mixing matrix features when combined with the gated activation.
  • batch normalization in each layer can be used.
  • the cost function is, for example, the sum of squared error between the predicted mixing matrix and the real mixing matrix.
  • the cost function is the softmax cross-entropy function.
  • the softmax function can transform the output values into normalized probabilities sum to one and the cross-entropy function can measure the distance between the predicted probability distribution and the real probability distribution.
  • the special advantage of the softmax cross-entropy loss function over other loss functions such as sum of squared error is that it converges faster with reasonable training samples.
  • the softmax cross-entropy function can be defined using
  • cross-entropy loss can be defined as: The batch size used in neural
  • the network training can be 1000.
  • the learning rate can be 0.00001.
  • the convolutional kernel is of size [100, 2] .
  • 4 kernels can be used.
  • the number of training epochs for three variants can be 700, 500, 200 respectively.
  • the neural network training can stop early if the loss is less than 0.01. Pseudo code for generating the mixing matrix is shown in FIG. 6.
  • the different variants of the mixing matrix classifier can classify from a small set of mixing matrices if they are sufficiently far away from each other. Additional parameter searching may reduce the Frobenius distance between mixing matrices. The Frobenius distance may also be reduced by using a larger block of data or using a larger mixing matrix.
  • the CICA technique can be utilized in many applications such as hearing aid kits, augmented reality headsets, multichannel audio encoding and the like.
  • the CICA technique may be able to separate signals in a more efficient and robust manner.
  • the independence classifier may change techniques used measure independence. A successful independence classifier can be fast and accurate offline.
  • the mixing matrix classifier with an arbitrarily small Frobenius radius can give quick response to BSS applications.
  • the mixing matrix classifier may be able to detect a change of emitting location of source signals in minimal time.
  • FIG. 9 shows an example of a computer device 900 and a mobile computer device 950, which may be used with the techniques described here.
  • Computing device 900 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers.
  • Computing device 950 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, and other similar computing devices.
  • the components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.
  • Computing device 900 includes a processor 902, memory 904, a storage device 906, a high-speed interface 908 connecting to memory 904 and high-speed expansion ports 910, and a low speed interface 912 connecting to low speed bus 914 and storage device 906.
  • Each of the components 902, 904, 906, 908, 910, and 912, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate.
  • the processor 902 can process instructions for execution within the computing device 900, including instructions stored in the memory 904 or on the storage device 906 to display graphical information for a GUI on an external input/output device, such as display 916 coupled to high speed interface 908.
  • multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory.
  • multiple computing devices 900 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
  • the memory 904 stores information within the computing device 900.
  • the memory 904 is a volatile memory unit or units.
  • the memory 904 is a non-volatile memory unit or units.
  • the memory 904 may also be another form of computer-readable medium, such as a magnetic or optical disk.
  • the storage device 906 is capable of providing mass storage for the computing device 900.
  • the storage device 906 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations.
  • a computer program product can be tangibly embodied in an information carrier.
  • the computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above.
  • the information carrier is a computer- or machine-readable medium, such as the memory 904, the storage device 906, or memory on processor 902.
  • the high speed controller 908 manages bandwidth-intensive operations for the computing device 900, while the low speed controller 912 manages lower bandwidth- intensive operations. Such allocation of functions is exemplary only.
  • the high-speed controller 908 is coupled to memory 904, display 916 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 910, which may accept various expansion cards (not shown).
  • low-speed controller 912 is coupled to storage device 906 and low-speed expansion port 914.
  • the low-speed expansion port which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
  • input/output devices such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
  • the computing device 900 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 920, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 924. In addition, it may be implemented in a personal computer such as a laptop computer 922. Alternatively, components from computing device 900 may be combined with other components in a mobile device (not shown), such as device 950. Each of such devices may contain one or more of computing device 900, 950, and an entire system may be made up of multiple computing devices 900, 950 communicating with each other.
  • Computing device 950 includes a processor 952, memory 964, an input/output device such as a display 954, a communication interface 966, and a transceiver 968, among other components.
  • the device 950 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage.
  • a storage device such as a microdrive or other device, to provide additional storage.
  • Each of the components 950, 952, 964, 954, 966, and 968 are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
  • the processor 952 can execute instructions within the computing device 950, including instructions stored in the memory 964.
  • the processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors.
  • the processor may provide, for example, for coordination of the other components of the device 950, such as control of user interfaces, applications run by device 950, and wireless communication by device 950.
  • Processor 952 may communicate with a user through control interface 958 and display interface 956 coupled to a display 954.
  • the display 954 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology.
  • the display interface 956 may comprise appropriate circuitry for driving the display 954 to present graphical and other information to a user.
  • the control interface 958 may receive commands from a user and convert them for submission to the processor 952.
  • an external interface 962 may be provide in communication with processor 952, to enable near area communication of device 950 with other devices.
  • External interface 962 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
  • the memory 964 stores information within the computing device 950.
  • the memory 964 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units.
  • Expansion memory 974 may also be provided and connected to device 950 through expansion interface 972, which may include, for example, a SIMM (Single In Line Memory Module) card interface.
  • SIMM Single In Line Memory Module
  • expansion memory 974 may provide extra storage space for device 950, or may also store applications or other information for device 950.
  • expansion memory 974 may include instructions to carry out or supplement the processes described above, and may include secure information also.
  • expansion memory 974 may be provide as a security module for device 950, and may be programmed with instructions that permit secure use of device 950.
  • secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
  • the memory may include, for example, flash memory and/or NVRAM memory, as discussed below.
  • a computer program product is tangibly embodied in an information carrier.
  • the computer program product contains instructions that, when executed, perform one or more methods, such as those described above.
  • the information carrier is a computer- or machine-readable medium, such as the memory 964, expansion memory 974, or memory on processor 952, that may be received, for example, over transceiver 968 or external interface 962.
  • Device 950 may communicate wirelessly through communication interface
  • Communication interface 966 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 968. In addition, short-range communication may occur, such as using a Bluetooth, Wi-Fi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 970 may provide additional navigation- and location-related wireless data to device 950, which may be used as appropriate by applications running on device 950.
  • GPS Global Positioning System
  • Device 950 may also communicate audibly using audio codec 960, which may receive spoken information from a user and convert it to usable digital information. Audio codec 960 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 950. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 950.
  • Audio codec 960 may receive spoken information from a user and convert it to usable digital information. Audio codec 960 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 950. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 950.
  • the computing device 950 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 980. It may also be implemented as part of a smart phone 982, personal digital assistant, or other similar mobile device.
  • Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
  • Various implementations of the systems and techniques described here can be realized as and/or generally be referred to herein as a circuit, a module, a block, or a system that can combine software and hardware aspects.
  • a module may include the functions/acts/computer program instructions executing on a processor (e.g., a processor formed on a silicon substrate, a GaAs substrate, and the like) or some other programmable data processing apparatus.
  • Methods discussed above may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof.
  • the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium.
  • a processor(s) may perform the necessary tasks.
  • references to acts and symbolic representations of operations that may be implemented as program modules or functional processes include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types and may be described and/or implemented using existing hardware at existing structural elements.
  • Such existing hardware may include one or more Central Processing Units (CPUs), digital signal processors (DSPs), application-specific-integrated-circuits, field programmable gate arrays (FPGAs) computers or the like.
  • CPUs Central Processing Units
  • DSPs digital signal processors
  • FPGAs field programmable gate arrays
  • the software implemented aspects of the example embodiments are typically encoded on some form of non-transitory program storage medium or implemented over some type of transmission medium.
  • the program storage medium may be magnetic (e.g., a floppy disk or a hard drive) or optical (e.g., a compact disk read only memory, or CD ROM), and may be read only or random access.
  • the transmission medium may be twisted wire pairs, coaxial cable, optical fiber, or some other suitable transmission medium known to the art. The example embodiments not limited by these aspects of any given implementation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A device includes a sound acquisition manager configured to receive a mixed audio signal including a first plurality of audio signals, an independent component analysis manager configured to determine a set of parameters configured to generate a second plurality of audio signals based on the first plurality of audio signals, and to minimize a correlation between pairs of signals of the converted second plurality of audio signals, and a memory configured to store the second plurality of audio signals as multi-channel audio data.

Description

NEURAL NETWORK BASED BLIND SOURCE SEPARATION
RELATED APPLICATION
[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 62/583,141, filed on November 8, 2017, entitled "NEURAL NETWORK BASED BLIND SOURCE SEPARATION", the contents of which are incorporated in their entirety herein by reference.
BACKGROUND
[0002] Blind Source Separation (BSS) has been applied in audio signal processing, BSS is capable of extracting the most probable original independent audio source signals when the only available information is the observed audio signal mixture. Due to the nature of the limited prior information, it is a challenging task to estimate the original audio signals. In audio BSS, a signal can be considered as a function of time and a source is a physical object creating a signal. If only one signal mixture is recorded, this can be categorized as a single- channel audio signal. On the other hand, multi-channel audio signals involve multiple signal mixtures.
[0003] Source signals can be mixed in different manners. Nonlinear mixing can be more prevalent in image processing. For audio BSS techniques, linear mixing generally is an accurate description of the physical mixing process. There are two categories of linear mixtures. The first category considers the delayed (convolutive) contribution of source signals, which is common in telecommunications or in reverberant environment. The second category can assume that the signals are instantaneously mixed at any given time. Short-time Fourier transforms can be used to convert convolutive mixing to a close approximation of instantaneous mixing.
[0004] The inter-dependencies between different channels and the intra-dependencies within channels are often complex. Thus it is difficult to separate the multi-channel signal mixture into its underlying source signals by using Independent Component Analysis (ICA) and Non- negative Matrix Factorization (NMF) techniques. As a result, a fast and robust multi-channel audio BSS technique that can be applied effectively in a real-time environment has not been described or implemented. SUMMARY
[0005] Example implementations can identify independent audio source signals included in a mixed audio signal by mapping the signals in the mixed audio signal to set of audio signals and minimizing correlation between pairs of signals in the set of audio signals.
[0006] In a general aspect a device includes a sound acquisition manager configured to receive a mixed audio signal including a first plurality of audio signals, an independent component analysis manager configured by a set of parameters to generate a second plurality of audio signals based on the first plurality of audio signals, and to minimize, by adjusting said configuring parameters, a correlation between pairs of signals of the converted second plurality of audio signals, and a memory configured to store the second plurality of audio signals as multi-channel audio data.
[0007] In another general aspect a method and a non-transitory computer-readable storage medium having stored thereon computer executable program code which, when executed on a computer system, causes the computer system to perform steps. The steps include receiving a mixed audio signal including a first plurality of audio signals, and storing the identified independent audio source signals as multi-channel audio data, determining a set of parameters configured to generate a second plurality of audio signals based on the first plurality of audio signals, converting the second plurality of audio signals using a nonlinear function, minimizing a correlation between pairs of signals of the converted second plurality of audio signals and storing the second plurality of audio signals as multi-channel audio data.
[0008] Implementations can include one or more of the following features. For example, the plurality of audio signals can be treated as temporally independent, the plurality of audio signals can be identically distributed, and the plurality of audio signals can be non-Gaussian. The minimizing of the correlation includes measuring correlation using a contrast function. The nonlinear function is at least one sigmoid function. The minimizing of the correlation includes revising the set of parameters. The independent component analysis manager is further configured to apply a pre-whitening function to the first plurality of audio signals and the pre-whitening transforms the first plurality of audio signals into a data set having unit covariance. The set of parameters are elements of a separating matrix. The determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically. The set of parameters can be selected to reduce the correlation of pairs of converted second plurality audio signals. The set of parameters can be determined using backpropagation and stochastic gradient descent.
[0009] For example, the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically, an output of the convolutional neural network is used as an input to an activation function, and the activation function is configured to allow the convolutional neural network to implement a non-linear process. The determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically, a loss function for the convolutional neural network is a contrast function (e.g., a correlation between the second plurality of signal components), the separating matrix is an orthogonal matrix including the set of parameters as weights connecting input neurons and output neurons of the convolutional neural network.
[0010] In yet another general aspect a method and a non-transitory computer-readable storage medium having stored thereon computer executable program code which, when executed on a computer system, causes the computer system to perform steps. The steps include receiving a mixed audio signal including a first plurality of audio signals, using a convolutional neural network to numerically estimate a separating matrix for the first plurality of audio signals, applying at least one activation function including a sigmoid function to an output of the convolutional neural network and minimizing a correlation between pairs of signals generated using the at least one activation function.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] Example embodiments will become more fully understood from the detailed description given herein below and the accompanying drawings, wherein like elements are represented by like reference numerals, which are given by way of illustration only and thus are not limiting of the example embodiments and wherein:
[0012] FIG. 1 illustrates a block diagram of an example system according to at least one example embodiment.
[0013] FIG. 2 illustrates a block diagram of a method for identifying independent audio source signals according to at least one example embodiment. [0014] FIG. 3 illustrates pseudo code for a correlation independent component analysis (CICA) algorithm according to at least one example embodiment.
[0015] FIG. 4 illustrates a block diagram of a neural network architecture according to at least one example embodiment.
[0016] FIG. 5 illustrates pseudo code for an independence classifier algorithm according to at least one example embodiment.
[0017] FIG. 6 illustrates pseudo code for generating a mixing matrix according to at least one example embodiment.
[0018] FIG. 7 illustrates a flow diagram for an architecture for an independence classifier according to at least one example embodiment.
[0019] FIG. 8 illustrates a flow diagram for an architecture for generating a mixing matrix according to at least one example embodiment.
[0020] FIG. 9 shows an example of a computer device and a mobile computer device according to at least one example embodiment.
[0021] It should be noted that these Figures are intended to illustrate the general characteristics of methods, structure and/or materials utilized in certain example embodiments and to supplement the written description provided below. These drawings are not, however, to scale and may not precisely reflect the precise structural or performance characteristics of any given embodiment, and should not be interpreted as defining or limiting the range of values or properties encompassed by example embodiments. For example, the relative thicknesses and positioning of molecules, layers, regions and/or structural elements may be reduced or exaggerated for clarity. The use of similar or identical reference numbers in the various drawings is intended to indicate the presence of a similar or identical element or feature.
DETAILED DESCRIPTION
[0022] A neural network based approach that can implement complex mappings to perform a multi-channel audio BSS technique can be applied in a real-time environment is described herein. Multi-channel audios can include more information than single-channel audios because independent signals (within multi-channel audio) can be distinguishable from one another. Implementations determine BSS using a neural network based ICA technique. In addition, techniques can use a deep neural network (DNN) model to distinguish independent signals and/or a DNN model to classify signal mixtures according to their mixing matrices.
[0023] FIG. 1 is a block diagram that illustrates an example system 100 in which the improved techniques described herein can be implemented. The system 100 can include a sound acquiring computer 120 that is configured to acquire a mixed audio signal (e.g., a sound field detected using at least one microphone) and separate signal sources (e.g., sources of sound waves) within the mixed audio signal. The sound acquiring computer 120 can include at least one interface 122, at least one processing unit 124, and memory 126.
[0024] The at least one interface 122 can include, for example, an audio adaptor configured to convert analog audio signals received from at least one detector into an electronic form for use by the sound acquiring computer 120. The at least one detector can be a device configured to capture sound and communicate a signal representing the captured sound. For example, the at least one detector can be a microphone, a piezoelectric device, a fiber optic sensor, a microphone chip or silicon microphone, and/or the like. The at least one interface 122 can include, for example, Ethernet adaptors, Token Ring adaptors, etc., for converting electronic and/or optical signals received from a network into an electronic form for use by the sound acquiring computer 120.
[0025] The at least one processing unit 124 can include one or more processing chips and/or assemblies. The memory 126 can include both volatile memory (e.g., RAM) and non-volatile memory, such as one or more ROMs, disk drives, solid state drives, and the like. The memory 126 can include a non-transitory computer readable storage medium. The at least one processing unit 124 and the memory 126 together form control circuitry, which is configured and arranged to carry out various methods and functions as described herein.
[0026] In some implementations, one or more of the components of the sound acquiring computer 120 can include processors (e.g., at least one processing unit 124) configured to process instructions stored in the memory 126. Examples of such instructions include a sound acquisition manager 130, an independent component analysis (ICA) manager 140, a non- negative matrix factorization (NMF) manager 142, an autoencoder manager 144, a machine learning module 150, and a mixing matrix module 160. In addition, the memory 126 can be configured to store various data, which is described with respect to the respective managers that use such data. [0027] The sound acquisition manager 130 can be configured to acquire a mixed audio signal on which a BSS technique can be applied, the output of which can be stored as multi-channel audio data 132. In some implementations, the mixed audio signal can correspond to audio data captured by at least one microphone. For example, a plurality of human speakers can speak for a period of time. The captured audio data can be communicated to the sound acquiring computer 120 and processed by the at least one interface 122. Processing can include, for example, analog to digital conversion at a desired sampling rate. For example, capturing the multi-channel audio data 132 can include detecting and communicating three English speakers for seven seconds and converting the audio signal using a sampling rate of 16 kHz.
[0028] The multi-channel audio data 132 can include data identifying independent audio source signals included in the acquired mixed audio signal. The independent audio source signals can be compressed and stored and/or communicated to another device. For example, the independent audio source signals can be stored and or communicated with a corresponding video file. The sound acquiring computer 120 can be an element of a hearing aid device. Therefore, the independent audio source signals can be filtered based on a users desire to listen to a signal source and disregard other signal sources. The sound acquiring computer 120 can be an element of an augmented reality device. Therefore, the independent audio source signals can be separated and communicated to a speaker to generate, for example, a stereo listening effect.
[0029] The ICA manager 140 can be configured to implement an ICA technique. For example, the ICA manager 140 can be configured to implement a correlation ICA (CICA) technique. There can be many estimates for the target sources. This is because by acquiring the multi-channel audio signal only, there could be a myriad of combinations of mixing matrices and source signals that can produce the same multi-channel audio signal. Different diversity or mixture (e.g., audio signal mixture) assumptions may be used in determining models for BSS techniques.
[0030] An assumption regarding the source signals can be used in BSS techniques. The assumption is that the source signals can be considered as temporally independent (e.g., a first persons recorded voice is not dependent on an initiation in time or a duration of time when compared to a second persons recorded voice) and identically distributed (IID) but non-Gaussian. In statistical analysis, identically distributed indicates that a data distribution doesn't fluctuate and the data is taken from the same probability distribution. Independent indicates that the source of data is initiated as independent events. Non- Gaussian indicates that the data distribution is not Gaussian. Therefore, using the assumption, data associated with a mixed audio signal where the sources are two people speaking (e.g., in a room) a first persons recorded voice is independent in time, source independent when compared to a second persons recorded voice. Further, the data for each source (e.g., the first person and the second person) associated with the mixed audio signal has a non-Gaussian distribution that doesn't fluctuate and is taken from the same probability distribution.
[0031] The assumption leads to using ICA techniques for BSS that are based on or use higher order statistics (HOS), such as FastICA, Infomax ICA and Kernel ICA (each described below in more detail). The independence assumption for audio sources can be reasonable given long enough excerpt duration in the time domain. When the excerpt duration is below, for example, 20ms, the independence assumption does not hold. But when the excerpt duration is over, for example, 100ms, the independence assumption is valid. This means ICA techniques may not separate signal mixtures immediately but with an acceptable delay of 100ms.
[0032] In audio applications the sources are typically independent. Therefore, example implementations can use ICA techniques for audio BSS. There are several, non-exhaustive, advantages of using ICA. First, the working mechanism of ICA can be defined using signal processing and statistics. Second, in many ICA algorithms, the uniqueness of the solution can be guaranteed due to global convergence. Third, ICA can be implemented on non-negative signals and mixed-sign signals. Further, in multi-channel BSS, ICA can benefit from having more sensors (e.g., microphones) than source signals.
[0033] A contrast function can be an optimization criterion the global optima of which correspond to a separation of all sources. The contrast function should be able to consistently estimate how separate the outputs are. Contrast functions based on HOS can be used in temporally IID and non-Gaussian distributions.
[0034] In an example implementation, a correlation measurement can be used as a contrast function in ICA to measure output independence. This technique can be called a correlation ICA (CICA) technique (e.g., as implemented in the ICA manager 140). FIG. 2 illustrates a block diagram of a method for identifying independent audio source signals according to at least one example embodiment. As shown in FIG. 2, in step S205 a mixed audio signal including a plurality of audio signals (e.g., a first plurality of audio signals) is received. For example, the mixed audio signal is received by the sound acquisition manager 130 via the at least one interface 122.
[0035] In step S210 a separating matrix, including a set of parameters, is determined (and/or selected) for the mixed audio signal and in step S215 another plurality of audio signals (or second plurality of audio signals) is generated using the separating matrix.
[0036] Using the CICA technique, if two random variables Yi and Y2 are independent, their correlation equals zero (0). Further, if the correlation between / (Yi) and f (Y2) is zero for any mapping function /(· ) from R to R, Yi and Y2 can be determined to be mutually independent.
[0037] Searching for a solution (e.g., finding each independent audio source) in the entire function space of f (.) may be difficult (e.g., time consuming, processor intensive, memory intensive, and the like). According to an example implementation, one or more predetermined functions may be used and this may include the identity function, the latter resulting in a linear correlation of the two random variables Yi and Y2.
[0038] In step S220 the another plurality of audio signals (or second plurality of audio signals) is converted using a nonlinear function (or a plurality of nonlinear functions). For example, at least one sigmoid function can be applied to the another plurality of audio signals because the sigmoid projection can be a sufficient nonlinear function for measuring correlation in BSS. Other functions that can be used to minimize the correlation can include (but are not limited to) a logistic function (which a sigmoid function is a subset of), a Heaviside function, a Tanh function, an ArcTan function and/or the like. The function configured to minimize the nonlinear correlation (e.g., the sigmoid function) alone may not replace the function space /(·). Therefore, in an implementation at least one constraint (e.g., limit nonlinear correlation to zero (0)) can be implemented in the CICA the technique.
[0039] Then, in step S225 the set of parameters is revised to minimize a correlation between pairs of signals in the converted plurality of audio signals. Minimizing correlation can include repeating steps S210-S220 with a revised set of parameters. Minimizing correlation can include backpropagation of a stochastic gradient obtained for a batch or epoch of data to obtain a revised set of parameters. Minimizing correlation can include determining a threshold number of signal pairs have a correlation equal to zero (0). In step S230 in response to minimizing the correlation, independent audio source signals included in the mixed audio signal are identified as the another plurality of audio signals.
[0040] In an example implementation, a SOS technique used in BSS called pre-whitening can help with linear correlation minimization. Pre-whitening can transform the received audio signals X E R " M x R " N into another set of signals which has unit covariance. M is the dimensionality of the observed signals which is equal to the number of sources and N is the observed data length in each dimensionality. To simplify the pre-whitening, mean subtraction for X can be performed before pre-whitening. Pre-whitening can include multiplying X with a whitening matrix Wso that if X = WX, XXT = /.
[0041] In an example implementation, the whitening matrix is selected as: let X = AS where X represents the observed signals, A represents the mixing matrix and S represents the received audio signals. The covariance matrix Cx = XXT and the covariance matrix Cs = SST. Therefore, Cx = ACSAT. Without loss of generality, assume Cs = I due to the indetermination of scale in BSS technique.
[0042] Accordingly, Cx = AAT. Using singular value decomposition, A = UDQT where U and Q are orthonormal matrices and D is a diagonal matrix. Substitute the decomposition of A into Cx, Cx = UD2UT. The whitening matrix can be defined as W = D_1UT. Accordingly, XXT = / and the separating matrix B = A_1UD is also orthonormal.
[0043] Pre-whitening can help convert the CICA technique into determining an optimally conditioned orthonormal separating matrix. This can reduce the possible choices for a separating matrix. If pre-whitening is coupled with the nonlinear correlation properly, this combination can effectively and efficiently perform audio BSS tasks by finding the correct or best separating matrix.
[0044] The combination of pre-whitening and CICA can determine independence in two stages. In the first stage, pre-whitening minimizes the linear correlation to zero. In the second stage, a second operator is selected to minimize one or more nonlinear correlations. The second operator can be a separating matrix B. The separating matrix B can be selected to minimize a nonlinear correlation that is a linear correlation after a nonlinear function (e.g., a sigmoid nonlinear function) subject to a constraint that keeps the linear correlation to zero. The constraint can be that the separating matrix B should be orthogonal.
[0045] A convolutional neural network can be used to estimate the separating matrix B numerically. The separating matrix can form a filter in the convolutional neural network and the nonlinear function is the nonlinear activation function of the network, By using a contrast function as the loss function, a neural network can perform BSS tasks with backpropagation through stochastic gradient descent. The orthogonal matrix to be determined can consist of the weights connecting input and output neurons. Pseudo code for the CICA algorithm is shown in FIG. 3.
[0046] The CICA algorithm can be implemented using a convolutional neural network consisting of input and output neurons. Example architectures can use any number of input and output neurons so long as input and output neurons are equal in size. In an example implementation, three input and three output neurons can be used.
[0047] FIG. 4 illustrates an architecture of the CICA neural network according to an example implementation. As shown in FIG. 4, the CICA neural network architecture 400 includes three input neurons 405 where XI, X2, and X3 represent the three channels of observed signal mixtures. The CICA neural network architecture 400 includes three output neurons 410 where Yl, Y2, and Y3 represent the demixed signals in three channels. Further, the CICA neural network architecture 400 includes three function neurons 415 where f represents an activation function (e.g., a sigmoid function). The input neurons 405 and the output neurons 410 are fully connected.
[0048] The activation function is configured to define the output behavior of the corresponding output neuron 410. The output of the activation function can be used as the output of the neural network and/or used as the input for another neuron. The activation function can introduce non-linear properties to the neural network allowing the neural network to leam from complicated, non-linear mappings between inputs and response variables. The activation function allows the CICA neural network architecture 400 to implement a non-linear process.
[0049] The CICA neural network architecture 400 is shown as including a plurality of tiers 420. Each tier 420 includes the neural network shown in tier 1. However, the activation function / can be the same function or a different function in each tier 420. In an example implementation, the output of the function neurons 415 of tier 1 can be the input to input neuron 405 of tier 1 which can repeat until tier n in which the output of the function neurons 415 (in tier n) is the output of the neural network. In another example implementation, the audio signals can be input to the input neuron 405 of each tier 1 to n and the output of the function neuron 415 of each tier 1-n is the output of the neural network. In this implementation, the equivalent output of each tier can be summed together as the output of the CICA neural network architecture 400.
[0050] The contrast function consists of four components: a correlation measurement C, a weights orthogonality measurement O, a regularization term Q for balancing energy in different demixed channels and a regularization term V to ensure the minimum variance of the demixed signals is guaranteed. This can be written as an equation for the neural network loss
Figure imgf000012_0010
The correlation measurement C is the sum of pairwise correlation between output signals. The correlation formula described
Figure imgf000012_0001
below).
[0051] B can denote a weight matrix connecting inputs X
Figure imgf000012_0002
and
Figure imgf000012_0003
and outputs Y (e.g., Yl, Y2, and Y3). This relationship can define that Y = XB. In this implementation, each column of X and Y can define a signal channel (e.g., in the multi-channel audio signal). The weights orthogonality measurement O can be calculated as
Figure imgf000012_0004
Figure imgf000012_0005
, where sum(-) returns the sum of all the elements in the functional argument. This encourages the inner product between columns of B to be zero.
[0052] Bi, B2 and B3 can denote the three columns of B. The energy balancing term Q can be calculated by can help
Figure imgf000012_0006
cause the absolute sum of each column of the weight matrix to be close to 1.
[0053] The minimum variance of the observed signal channel can be set to v. The variances of three demixed signal channels be
Figure imgf000012_0008
and
Figure imgf000012_0009
. The variance regularization term V cm be calculated where relu(-) yields a
Figure imgf000012_0007
nonzero value only if the functional argument is positive. The term V can help cause variances of the demixed signal channels to be above the minimum variance.
[0054] According to an example implementation, the batch size used in neural network training should contribute to one quarter of a second of speech. For example, if the sampling rate used in dataset is 16kHz, the batch size should be 4000. The weights and bias in the neural network is initialized from standard normal distribution. The sum of each column of the weight matrix can be constrained to be 1.
[0055] There can be two choices for the learning rate. The first choice can be to fix the learning rate to a fixed number (e.g., 0.0001). This way the neural network can converge, but it may be slow. The second choice is to use a dynamic learning rate. The starting learning rate can be 0.001. Then the learning rate can be updated by multiplying 0.1 every 14000 steps. Using this implementation, the neural network can converge much faster (as compared to the fixed rate). However, using the updated learning rate approach, the neural network may suffer more oscillation. The number of epochs for training can be set to, for example, 5000. But the training can stop early if there is no loss improvement observed. This early stop condition may be checked, for example, every 300 epochs.
[0056] The ICA manager 140 can be configured to implement an ICA technique. For example, the ICA manager 140 can be configured to implement a Fast ICA technique. Fast ICA is a linear transformation method for non-Gaussian data in order to separate statistically independent components from the observed mixture. There are two assumptions about the source signals. The assumptions the source signals are non-Gaussian and statistically independent. Both assumptions are typically satisfied with multi-channel audio.
[0057] The non-Gaussianity assumption is important in fast ICA because Gaussian variables may not be informative enough. In the joint density distribution of independent Gaussian variables, no information can be provided for the mixing matrix. The only exception is that if only one source signal is normally distributed, the independent components can still be estimated. Fast ICA can take advantage of the property that the sum of independent variables is more Gaussian than any of the single variable is utilized. By doing the reverse process of making the sum (e.g., signal mixture) less Gaussian than before, source separation can be achieved. Therefore, by using a contrast function to measure the Gaussianity, Fast ICA amounts to an optimization problem that maximizes the non-Gaussianity of the estimated original source signal.
[0058] The optimization can be performed by finding a direction that maximizes Gaussianity and projecting the data onto that direction. The next projected direction should be found on the orthogonal space of the previous direction to avoid repetition. This is called projection pursuit. The process is repeated until the number of projected directions is equal to the number of estimated independent source signals.
[0059] In an implementation, Fast ICA can be a fixed-point iteration algorithm, exploiting that the demixing vector has zero derivative at a local optima. The iteration updating rules are dependent on the choice of the contrast function. A kurtosis function and a negentropy function can be selected as the contrast functions. Slowly growing functions can give better performance than kurtosis, as they have smaller asymptotic variance and are more robust to sample outliers. An example of such a function is G(u) = log(cosh(a«)), where a >1 is a constant.
[0060] The ICA manager 140 can be configured to implement an ICA technique. For example, the ICA manager 140 can be configured to implement an Infomax ICA technique. A neural network can have no hidden layer(s), only input and output layers exist. The Infomax ICA takes advantage of the nonlinear transfer function of a neural network to maximize the mutual information between inputs and outputs. By doing so, statistically independent components can be separated from the signal mixture without assuming any knowledge of the input distributions.
[0061] X and Y can be the input and output of a neural network, the mutual information between X and is the differential entropy of the output
Figure imgf000014_0004
and
Figure imgf000014_0006
is the conditional differential entropy of the output given input. In this context, is the differential entropy of the output component that is independent of the input.
Figure imgf000014_0005
A neural network can be regarded as a deterministic function,
Figure imgf000014_0007
that diverges to negative infinity which can be avoided by taking the derivative of with respect to the
Figure imgf000014_0008
weight parameter w. H(Y \ X) disappears as it is not influenced by w. This gives the equation suggesting that maximizing mutual information between input and
Figure imgf000014_0003
output is equivalent to maximizing the entropy of output.
[0062] In the case of one input and one output, the probability density function of the output
The entropy of the output can thus be expressed as
Figure imgf000014_0001
To maximize H(Y) by changing w.
Figure imgf000014_0002
determine the stochastic gradient ascent updating rule:
Figure imgf000015_0001
In the case of multiple inputs and outputs, change to the Jacobian \J\ and everything else
Figure imgf000015_0004
follows. In this way, the neural network weight parameter w can be updated to maximize
Figure imgf000015_0013
[0063] To show the maximization of
Figure imgf000015_0009
can lead to the extraction of independent components οf Χ
Figure imgf000015_0010
consider a system with outputs
Figure imgf000015_0012
and
Figure imgf000015_0011
The joint entropy of the outputs can be shown as: H This equation suggests that maximizing
Figure imgf000015_0005
the joint entropy of the outputs is equivalent to minimizing the mutual information between output variables, given a constraint on the signal power. When is zero, and
Figure imgf000015_0006
Figure imgf000015_0007
Figure imgf000015_0008
become statistically independent.
[0064] The Infomax ICA approach comes up with a way to deal with mutual information without getting into the pitfalls of the distribution probabilities of data. Instead, the Jacobian between inputs and outputs can be differentiated with regard to the weight parameter to derive the desired independent components. This is used in a neural network when the Jacobian only accounts for one layer of transformation.
[0065] The ICA manager 140 can be configured to implement an ICA technique. For example, the ICA manager 140 can be configured to implement a Kernel ICA technique. The Kernel ICA uses a contrast function based on canonical correlation in Reproducing Kernel Hilbert Space (RKHS). If two signals are independent, their correlation is zero. The converse may not be true in Euclidean space. But in RKHS, the converse can also be true. This reduces the independence problem to the zero-correlation problem, which is statistically simpler. Kernel ICA utilizes this property of RKHS and minimizes the maximal correlation between projected signals to extract independent components.
[0066] Canonical Correlation Analysis (CCA) maximizes the correlation between projections of distributions of two or more random vectors. Consider two random vectors yx E MP and
Figure imgf000015_0003
their first canonical correlation is:
Figure imgf000015_0002
Figure imgf000016_0001
where = cov(zj, z; ) . Taking derivatives with respect to ζι and ζ^, results in:
Figure imgf000016_0002
Normalizing by letting
Figure imgf000016_0006
and
Figure imgf000016_0007
, the CCA can be transformed to the generalized eigenvalue problem:
Figure imgf000016_0003
where p is short for
Figure imgf000016_0012
. An alternative form for Equation 4 is
Figure imgf000016_0004
which can be written as
Figure imgf000016_0005
[0067] Equations 5 and 6 indicate the eigenvalues appear in pairs:
Figure imgf000016_0008
Figure imgf000016_0009
. Next determine the maximal generalized eigenvalue: . This is
Figure imgf000016_0010
equivalent to finding the minimal generalized eigenvalue: n is
Figure imgf000016_0011
bounded between zero and one, making it easier for computation.
[0068] Let T be a RKHS based on Gaussian kemels. Let the projections in CCA be projections from M to
Figure imgf000016_0015
7. Then define
Figure imgf000016_0016
^-correlation as the maximal correlation between the random ) variables and
Figure imgf000016_0014
Figure imgf000016_0013
Figure imgf000017_0001
[0069] In RKHS, pT = 0 would indicate that the i and 2 are independent. This leads to the contrast function
Figure imgf000017_0003
[0070] By the Reproducing Property of RKHS:
Figure imgf000017_0002
where ) is the feature map and K(; x) is a function in T for each x. Let f be
Figure imgf000017_0009
, then
Figure imgf000017_0010
Figure imgf000017_0004
Equation 7 can be written as
Figure imgf000017_0005
w here
Figure imgf000017_0006
is the maximal possible correlation between one-dimensional linear projections of and This is equivalent to find the first canonical correlation
Figure imgf000017_0007
Figure imgf000017_0008
between
Figure imgf000018_0007
and
Figure imgf000018_0008
, which suggests computing an ICA contrast function based on the computation of a canonical correlation in function space.
[0071] To calculate empirical first calculate the empirical covariances and variances in
Figure imgf000018_0009
RKHS:
Figure imgf000018_0001
where
Figure imgf000018_0002
and K2 are the Gram matrices associated with the data sets A
Figure imgf000018_0006
Gram matrix K has entries K Similarly for empirical variance:
Figure imgf000018_0005
Figure imgf000018_0003
[0072] Equation 11 can be approximated using equations 12, 13 and 14 as:
Figure imgf000018_0004
[0073] Using equation 15, the canonical correlation can be calculated between signals in RKHS and the zero-valued pT which implies independence. This Kernel ICA approach is robust to nearly Gaussian sources, outliers and nonlinear relationships.
[0074] The NMF manager 142 can be configured to implement an NMF technique. NMF can be based on parts-based object perception. It is a way to factorize a matrix with non- negative constraints. A matrix is factorized into two components, the entries of which are all non-negative. In the frequency domain, this constraint is compatible with the intuition that the spectrogram of audio components can be added together to form an audio mixture spectrogram, since all entries in spectrogram are non-negative.
[0075] The standard NMF setting can be suitable for single-channel BSS. The matrix factorization becomes the decomposition of data spectrogram into a sum of low rank spectrograms. Each elementary spectrogram can represent a pattern in time. Simply stacking up spectrograms to form a larger matrix for decomposition may not optimally exploit the redundancy information between multiple channels. NMF can be extended into a multichannel BSS technique to implement convolutional mixtures in an optimal manner.
[0076] The autoencoder manager 144 can be configured to implement an autoencoder and/or non-negative autoencoder (NAE) technique. NAE is an extension of normal autoencoders. Autoencoders are unsupervised neural networks aiming at minimizing the deviation between inputs and outputs. Multi-layer autoencoders can be realized by stacking up compositional autoencoders vertically. In a NAE model, we train the autoencoders with the constraint that none of its weights can be negative. This is equivalent to performing NMF on input signals, where the encoder and decoder weight matrices play the roles of the two non-negative factorized components. NAE is built on the basis of NMF, therefore NAE can inherit the computational complexity of NMF.
[0077] The CICA technique described in this disclosure can utilize portions and/or elements of Fast ICA, Infomax ICA and Kernel ICA. For example, the CICA technique can utilize the orthogonal projection utilized in FastICA, the function configured to minimize the nonlinear correlation (e.g., sigmoid function) employed in Infomax ICA and the correlation measurement involved in Kernel ICA.
[0078] The disclosed CICA technique can outperform the known techniques used for demixing determined linear instantaneously mixed signals. The disclosed CICA technique can generate an improved signal separation at a faster speed. The improvement includes the use of an SOS based contrast function in temporally IID non-Gaussian signals. Moreover, the disclosed CICA technique can use a contrast function that minimizes independence by minimizing linear and nonlinear correlations.
[0079] As the BSS methods for instantaneous mixtures are generally stepping stones for convolutional mixtures, the CICA technique can be extended to signal mixtures with memory. This extension can be done by transforming signal mixtures to the time-frequency domain. The convolution then becomes multiplication in the time-frequency domain so that the convolutional mixtures can behave very similar to instantaneous mixtures.
[0080] The machine learning module 150 can use a generative adversarial network (GAN) a training technique for DNN applications. Therefore, GAN can be used in the context of audio BSS. A GAN consists of a generator producing fake data and a discriminator estimating the probability whether the inputs are from the generator or from the real data. The generator and discriminator can be trained simultaneously for a min/max two-player game so that they will compete and hopefully achieve equilibrium (e.g., a Nash equilibrium). In an implementation, the generator can recover the training data distribution and the discriminator will output 0.5 as the probability everywhere.
[0081] The machine learning module 150 can be used to train the aforementioned CNN. In an example implementation, the CNN is trained prior to use. In another example implementation the CNN is trained while in use and/or as the CNN is used. For example, a hearing aid including the sound acquiring computer 120 can be trained with live data during use. For example, an audio/video capture system including the sound acquiring computer 120 can be trained with previously recorded data prior to use.
[0082] Example implementations can implement a cost function with an ICA. The cost function can utilize a Jacobian of the transformation between input and output. The Jacobian can account for every operation that transforms input to output. This indicates that in a deep learning manner, the Jacobian has to involve the operations in every layer of the DNN. This makes the cost function less than optimal in a deep neural network. In an example implementation, the cost function can be replaced with the discriminator output. The implementation can include using an independence classifier as the discriminator. In other words, the disclosed discriminator can be an independence classifier using a convolutional neural network to distinguish synthesized Laplacian sequences.
[0083] Hilbert Schmidt Independence Criterion (HSIC) uses the sum of the squared singular values of the cross-covariance in the RKHS as an independence criterion. This technique can require a computational time of , where n indicates the size of the observed data. If a
Figure imgf000021_0007
DN based independence classifier can be realized, it would be much faster than HSIC because the neural network training can take place offline.
[0084] As discussed above, the zero correlation can be used to indicate independence. Consider the neural network to be a mapping between the input signals to RKHS, the neural network can leam the correlation formula: as the output,
where yt andy2 are the estimated signals. The mean subtraction and pre-whitening can be used in the pre-processing stage of the data inputs. Consequently, both
Figure imgf000021_0003
and can be set to 1. The correlation formula now is
Figure imgf000021_0004
Figure imgf000021_0006
Figure imgf000021_0005
This indicates that the neural network can leam the covariance between the data inputs. As a neural network can approximate any function, it is possible to leam the covariance operation. In the context of ICA, assuming observed signals (xl, x2) are instantaneous mixed from original source signals S by the mixing matrix al, a2, their correlation can be expressed as:
Figure imgf000021_0001
Figure imgf000022_0001
[0085] This indicates that the correlation between observed signals is an inner product between mixing vectors. Similarly, the correlation between estimated signals can be regarded as the inner product between mixing vectors in RKHS. Therefore, a Convolutional Neural Network (CNN) can be used as an architecture for the discriminator. This is because the audio signals have a temporal relationship. Each data point is related to its predecessors and successors to some extent. CNN can deal with this relationship by using a convolutional kernel to convolve an area of data points.
[0086] A convolutional kernel is a leamable filter which computes the dot product between itself and the input volume it covers. As a filter is shared by all the convolutional outputs (e.g., receptive fields), this helps reduce the number of parameters in a neural network. It has been shown that with the identity skip connection, a deep neural network can have better accuracy, faster convergence and less overfitting. This is due to the effect created by a clean information path of identity mapping. Therefore, in an implementation, identity skip connections can be used in the discriminator.
[0087] The architecture of the independence classifier includes six layers 715-1 to 715-6 (other than the input layer 705 and output layer 710). This architecture is shown in FIG. 7. The first four layers 715-1 to 715-4 are the convolutional layers to build relationships between input data. Except for the first layer 715-1, they can be expressed as:
Figure imgf000022_0002
where x/ and xm are input and output of the 1-th layer, x/-2 is the output of the (l-2)th layer. x0 would denote the output of the input layer, which is the input itself. 7 is the convolutional operation. Wu and Wn are the kernel weights. x/_i can be used as the identity mapping, β can be a tanh function and fl can a sigmoid function.
[0088] As shown in FIG. 7, there are two activations in each convolutional layer, one through β and the other through fl. A Hadamard product between the outputs of the two activations can be calculated as the final output of the convolutional layer. This gated activation implementation can enhance the convolutional layer for multiplicative relationships, which improves independence measurements.
[0089] Batch normalization can also be implemented in the independence classifier. This is to overcome the issue that the input distribution for each layer is constantly changing with the parameters of previous layers. The impact of internal covariate shift can be minimized by slowing down the training with lower learning rate and by initializing parameters more carefully. Batch normalization can address this issue more straightforwardly without slowing down the training speed or carefully tuning the initialization parameters. Batch normalization can normalize layer inputs to make their distributions consistent.
[0090] In the first layer 715-1 , the only difference from the other convolutional layers is that there is no identity mapping. The fifth layer 715-5 is a max-pooling layer to extract relationships and reduce the dimensionality. The sixth layer 715-6 is a fully connected layer to learn a function of the extracted relationships. The dropout technique is applied to this fully connected layer to reduce the overfitting. The last layer 715-6 is the output layer for classification. The activation functions for the fifth 715-5 and sixth layer 715-6 are both functions configured to minimize the nonlinear correlation between audio signals (e.g., sigmoid functions).
[0091] There are two classification labels, 0 and 1. Label 1 can indicate dependence and label 0 can indicate independence. A cost function that has a minimal value when the output is the same as the correct label can be used. One possible function is a negative log function. y can be the final output, e can be le-8 to avoid the numerical issues. When the label is 1 , the cost function can be cost = -\og(y + e). This results in the output being close to 1 when calculated. When the label is 0, the cost function can be cost = -log(l - y + e). This results in the output being close to 0 when calculated. Combining the two cost functions, results in cost = -log(l- y +2 *y * / - / + e), where / is the label of the signal inputs. To classify independence and dependence from y, a threshold can be set equal to 0.5 as y should be between 0 and 1. If y is above 0.5, the classified label is dependence and if_y is less than or equal to 0.5, the classified label is independence.
[0092] In an example implementation, the kernel used by 7 can be set to dimension [100, 2], such that the kernal has height 100 and width 2. Therefore, there are 32 kernels used for each convolution. The fifth layer 715-5 down-samples the data by the size of 2. The input of the sixth layer 715-6 is reshaped to [-1, 32], where -1 indicates the data is a dynamically assigned number according to the data size. The weight matrix in the sixth layer 715-6 has the dimension [32, 64]. The rate of dropout can be 0.5. The output layer 710 summarizes all inputs to one single output, with a weight matrix of dimension [64, 1]. The leaming rate can be 0.00001. The batch size used for training can be 1000. The number of training epochs can be 500. The neural network training can have an early stop if the loss is less than 0.5. Pseudo code for the independence classifier algorithm is shown in FIG. 5.
[0093] The mixing matrix module 160 can be configured to generate a mixing matrix in BSS. In the context of linear BSS, techniques can include estimating the demixing matrices. A demixing matrix is the inverse of the mixing matrix. Determining the inverse of a matrix is a straightforward linear algebraic process if the matrix is full-rank. Therefore, determined BSS can include estimating the mixing matrices.
[0094] A DNN can be used for mixing matrix estimation. In an example implementation, a supervised learning technique can be used for matrix estimation. Supervised learning can be a learning algorithm used to comprehend the data with guided support. In contrast, unsupervised leaming can make sense of data without any extra guide. The guided support can be providing the mixing matrices as known information in the training process. In other words, the mixing matrices are only unknown in the testing scenarios. Accordingly, the DNN model can learn the underlying influence of mixing matrices on a variety of given data points during training so that it could predict mixing matrices during use and/or testing.
[0095] There can be many possible mixing matrices even if each matrix entry is bounded between certain intervals and the matrix size is fixed. This means that there are infinitely many data labels for the DNN training. It can be difficult to learn infinitely many labels for a DNN model. [0096] Therefore, implementations can start with a few representative labels for two by two mixing matrices. Let all matrix entries be either one or negative one. There are 16 such matrices. A property of mixing matrices is being invertible, therefore, implementations can include screening out the matrices that have no inverse. This can leave eight matrices. Moreover, no two matrices should have identical rows at the same row in order to comply with the determined BSS condition. If two rows are identical, they provide identical information. This reduces the information available to distinguish between matrices. As a result, four matrices are used for labels. The dataset can be derived by generating signal mixtures from these four matrices.
[0097] In addition, at least one two by two mixing matrix F generated from standard normal distribution can be selected and a Frobenius bubble can be applied to the selected matrix. A Frobenius bubble sets a limit to how close other mixing matrices can be positioned around the selected mixing matrix. The distance between F and another mixing matrix M is calculated by the sum of squared error between their corresponding elements. If the distance is smaller than the limit, M is determined to be within the Frobenius bubble of F, which is too close. For the training purpose, mixing matrices outside the Frobenius bubble are used. This makes sure that other mixing matrices are sufficiently far away from F so that the classification task would not be influenced.
[0098] In response to determining the classification accuracy is good enough for a distance limit (e.g., accuracy is above or below a threshold), the limit can be reduced to make the task more challenging. The goal is that the classification accuracy is reasonably good with an arbitrarily small limit. The limit naturally becomes small when the dimensionality of the mixing matrix becomes larger. To make processing of the training simpler, the data can be pre-processed with mean subtraction and pre-whitening.
[0099] We implemented three variants of Mixing Matrix Classifiers. Strictly speaking, the first variant should not be named after classifier. It uses signals mixed by random matrices for training and tries to learn how to extract the mixing matrix from the observed signals. The second variant classifies signals mixed by one of the four fixed matrices. The four matrices are - The third variant classifies whether the
Figure imgf000025_0001
signal mixtures are mixed by a particular matrix or not. All the other mixing matrices are constrained to be outside the Frobenius circle of the particular matrix with a radius r. These three variants can share the substantially similar neural network architecture and parameters. Small differences between different variants will be discussed below.
[00100] The architecture of the mixing matrix classifier of the mixing matrix module
160 can include eleven layers 815-1 to 815-1 1 (other than the input layer 805 and output layer 810). This architecture is shown in the FIG. 8. The first eight layers 815-1 to 815-8 are convolutional layers with gated activation as described above with regard to FIG. 7. The convolution operation can be configured to filter signals. Filtering signals can aid with searching mixing matrix features when combined with the gated activation.
[00101] In some implementations, batch normalization in each layer can be used.
There is no max-pooling layer in this classifier. Instead, three fully connected layers (e.g., layer 9, 10, 1 1) can be used to extract useful features. For the first two variants, the number of neurons in the three fully connected layers can be 500, 250 and 4. For the third variant, the difference is that it has only 2 neurons in its last fully connected layer.
[00102] In the first variant implementation, the cost function is, for example, the sum of squared error between the predicted mixing matrix and the real mixing matrix. In the second and the third variant implementations, the cost function is the softmax cross-entropy function. There is a natural pairing between the softmax function and the cross-entropy function. This is because the softmax function can transform the output values into normalized probabilities sum to one and the cross-entropy function can measure the distance between the predicted probability distribution and the real probability distribution. The special advantage of the softmax cross-entropy loss function over other loss functions such as sum of squared error is that it converges faster with reasonable training samples. Hence, we use softmax cross-entropy function to evaluate the loss in the classification tasks in which the outputs can be interpreted as probabilities.
[00103] The softmax cross-entropy function can be defined using
Figure imgf000026_0003
represent real probabilities and represent the classification outputs of the mixing
Figure imgf000026_0004
matrix mlassifier. The softmax function can be written as: Subsequently, the
Figure imgf000026_0002
cross-entropy loss can be defined as: The batch size used in neural
Figure imgf000026_0001
network training can be 1000. The learning rate can be 0.00001. The convolutional kernel is of size [100, 2] . For each convolution, 4 kernels can be used. The number of training epochs for three variants can be 700, 500, 200 respectively. The neural network training can stop early if the loss is less than 0.01. Pseudo code for generating the mixing matrix is shown in FIG. 6.
[00104] The different variants of the mixing matrix classifier can classify from a small set of mixing matrices if they are sufficiently far away from each other. Additional parameter searching may reduce the Frobenius distance between mixing matrices. The Frobenius distance may also be reduced by using a larger block of data or using a larger mixing matrix.
[00105] The CICA technique can be utilized in many applications such as hearing aid kits, augmented reality headsets, multichannel audio encoding and the like. The CICA technique may be able to separate signals in a more efficient and robust manner. The independence classifier may change techniques used measure independence. A successful independence classifier can be fast and accurate offline. The mixing matrix classifier with an arbitrarily small Frobenius radius can give quick response to BSS applications. The mixing matrix classifier may be able to detect a change of emitting location of source signals in minimal time.
[00106] FIG. 9 shows an example of a computer device 900 and a mobile computer device 950, which may be used with the techniques described here. Computing device 900 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Computing device 950 is intended to represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, and other similar computing devices. The components shown here, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed in this document.
[00107] Computing device 900 includes a processor 902, memory 904, a storage device 906, a high-speed interface 908 connecting to memory 904 and high-speed expansion ports 910, and a low speed interface 912 connecting to low speed bus 914 and storage device 906. Each of the components 902, 904, 906, 908, 910, and 912, are interconnected using various busses, and may be mounted on a common motherboard or in other manners as appropriate. The processor 902 can process instructions for execution within the computing device 900, including instructions stored in the memory 904 or on the storage device 906 to display graphical information for a GUI on an external input/output device, such as display 916 coupled to high speed interface 908. In other implementations, multiple processors and/or multiple buses may be used, as appropriate, along with multiple memories and types of memory. Also, multiple computing devices 900 may be connected, with each device providing portions of the necessary operations (e.g., as a server bank, a group of blade servers, or a multi-processor system).
[00108] The memory 904 stores information within the computing device 900. In one implementation, the memory 904 is a volatile memory unit or units. In another implementation, the memory 904 is a non-volatile memory unit or units. The memory 904 may also be another form of computer-readable medium, such as a magnetic or optical disk.
[00109] The storage device 906 is capable of providing mass storage for the computing device 900. In one implementation, the storage device 906 may be or contain a computer-readable medium, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, a flash memory or other similar solid state memory device, or an array of devices, including devices in a storage area network or other configurations. A computer program product can be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 904, the storage device 906, or memory on processor 902.
[00110] The high speed controller 908 manages bandwidth-intensive operations for the computing device 900, while the low speed controller 912 manages lower bandwidth- intensive operations. Such allocation of functions is exemplary only. In one implementation, the high-speed controller 908 is coupled to memory 904, display 916 (e.g., through a graphics processor or accelerator), and to high-speed expansion ports 910, which may accept various expansion cards (not shown). In the implementation, low-speed controller 912 is coupled to storage device 906 and low-speed expansion port 914. The low-speed expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet) may be coupled to one or more input/output devices, such as a keyboard, a pointing device, a scanner, or a networking device such as a switch or router, e.g., through a network adapter.
[00111] The computing device 900 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a standard server 920, or multiple times in a group of such servers. It may also be implemented as part of a rack server system 924. In addition, it may be implemented in a personal computer such as a laptop computer 922. Alternatively, components from computing device 900 may be combined with other components in a mobile device (not shown), such as device 950. Each of such devices may contain one or more of computing device 900, 950, and an entire system may be made up of multiple computing devices 900, 950 communicating with each other.
[00112] Computing device 950 includes a processor 952, memory 964, an input/output device such as a display 954, a communication interface 966, and a transceiver 968, among other components. The device 950 may also be provided with a storage device, such as a microdrive or other device, to provide additional storage. Each of the components 950, 952, 964, 954, 966, and 968, are interconnected using various buses, and several of the components may be mounted on a common motherboard or in other manners as appropriate.
[00113] The processor 952 can execute instructions within the computing device 950, including instructions stored in the memory 964. The processor may be implemented as a chipset of chips that include separate and multiple analog and digital processors. The processor may provide, for example, for coordination of the other components of the device 950, such as control of user interfaces, applications run by device 950, and wireless communication by device 950.
[00114] Processor 952 may communicate with a user through control interface 958 and display interface 956 coupled to a display 954. The display 954 may be, for example, a TFT LCD (Thin-Film-Transistor Liquid Crystal Display) or an OLED (Organic Light Emitting Diode) display, or other appropriate display technology. The display interface 956 may comprise appropriate circuitry for driving the display 954 to present graphical and other information to a user. The control interface 958 may receive commands from a user and convert them for submission to the processor 952. In addition, an external interface 962 may be provide in communication with processor 952, to enable near area communication of device 950 with other devices. External interface 962 may provide, for example, for wired communication in some implementations, or for wireless communication in other implementations, and multiple interfaces may also be used.
[00115] The memory 964 stores information within the computing device 950. The memory 964 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units. Expansion memory 974 may also be provided and connected to device 950 through expansion interface 972, which may include, for example, a SIMM (Single In Line Memory Module) card interface. Such expansion memory 974 may provide extra storage space for device 950, or may also store applications or other information for device 950. Specifically, expansion memory 974 may include instructions to carry out or supplement the processes described above, and may include secure information also. Thus, for example, expansion memory 974 may be provide as a security module for device 950, and may be programmed with instructions that permit secure use of device 950. In addition, secure applications may be provided via the SIMM cards, along with additional information, such as placing identifying information on the SIMM card in a non-hackable manner.
[00116] The memory may include, for example, flash memory and/or NVRAM memory, as discussed below. In one implementation, a computer program product is tangibly embodied in an information carrier. The computer program product contains instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer- or machine-readable medium, such as the memory 964, expansion memory 974, or memory on processor 952, that may be received, for example, over transceiver 968 or external interface 962.
[00117] Device 950 may communicate wirelessly through communication interface
966, which may include digital signal processing circuitry where necessary. Communication interface 966 may provide for communications under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS, among others. Such communication may occur, for example, through radio-frequency transceiver 968. In addition, short-range communication may occur, such as using a Bluetooth, Wi-Fi, or other such transceiver (not shown). In addition, GPS (Global Positioning System) receiver module 970 may provide additional navigation- and location-related wireless data to device 950, which may be used as appropriate by applications running on device 950.
[00118] Device 950 may also communicate audibly using audio codec 960, which may receive spoken information from a user and convert it to usable digital information. Audio codec 960 may likewise generate audible sound for a user, such as through a speaker, e.g., in a handset of device 950. Such sound may include sound from voice telephone calls, may include recorded sound (e.g., voice messages, music files, etc.) and may also include sound generated by applications operating on device 950.
[00119] The computing device 950 may be implemented in a number of different forms, as shown in the figure. For example, it may be implemented as a cellular telephone 980. It may also be implemented as part of a smart phone 982, personal digital assistant, or other similar mobile device.
[00120] While example embodiments may include various modifications and alternative forms, embodiments thereof are shown above by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit example embodiments to the particular forms disclosed, but on the contrary, example embodiments are to cover all modifications, equivalents, and alternatives falling within the scope of the claims. Like numbers refer to like elements throughout the description of the figures.
[00121] Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. Various implementations of the systems and techniques described here can be realized as and/or generally be referred to herein as a circuit, a module, a block, or a system that can combine software and hardware aspects. For example, a module may include the functions/acts/computer program instructions executing on a processor (e.g., a processor formed on a silicon substrate, a GaAs substrate, and the like) or some other programmable data processing apparatus.
[00122] Some of the above example embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe the operations as sequential processes, many of the operations may be performed in parallel, concurrently or simultaneously. In addition, the order of operations may be re-arranged. The processes may be terminated when their operations are completed, but may also have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, subprograms, etc.
[00123] Methods discussed above, some of which are illustrated by the flow charts, may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine or computer readable medium such as a storage medium. A processor(s) may perform the necessary tasks.
[00124] Specific structural and functional details disclosed herein are merely representative for purposes of describing example embodiments. Example embodiments, however, be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.
[00125] It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of example embodiments. As used herein, the term and/or includes any and all combinations of one or more of the associated listed items.
[00126] It will be understood that when an element is referred to as being connected or coupled to another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being directly connected or directly coupled to another element, there are no intervening elements present. Other words used to describe the relationship between elements should be interpreted in a like fashion (e.g., between versus directly between, adjacent versus directly adjacent, etc.).
[00127] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments. As used herein, the singular forms a, an and the are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms comprises, comprising, includes and/or including, when used herein, specify the presence of stated features, integers, steps, operations, elements and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof.
[00128] It should also be noted that in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed concurrently or may sometimes be executed in the reverse order, depending upon the functionality/acts involved.
[00129] Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which example embodiments belong. It will be further understood that terms, e.g., those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
[00130] Portions of the above example embodiments and corresponding detailed description are presented in terms of software, or algorithms and symbolic representations of operation on data bits within a computer memory. These descriptions and representations are the ones by which those of ordinary skill in the art effectively convey the substance of their work to others of ordinary skill in the art. An algorithm, as the term is used here, and as it is used generally, is conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of optical, electrical, or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
[00131] In the above illustrative embodiments, reference to acts and symbolic representations of operations (e.g., in the form of flowcharts) that may be implemented as program modules or functional processes include routines, programs, objects, components, data structures, etc., that perform particular tasks or implement particular abstract data types and may be described and/or implemented using existing hardware at existing structural elements. Such existing hardware may include one or more Central Processing Units (CPUs), digital signal processors (DSPs), application-specific-integrated-circuits, field programmable gate arrays (FPGAs) computers or the like.
[00132] It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise, or as is apparent from the discussion, terms such as processing or computing or calculating or determining of displaying or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical, electronic quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
[00133] Note also that the software implemented aspects of the example embodiments are typically encoded on some form of non-transitory program storage medium or implemented over some type of transmission medium. The program storage medium may be magnetic (e.g., a floppy disk or a hard drive) or optical (e.g., a compact disk read only memory, or CD ROM), and may be read only or random access. Similarly, the transmission medium may be twisted wire pairs, coaxial cable, optical fiber, or some other suitable transmission medium known to the art. The example embodiments not limited by these aspects of any given implementation.
[00134] Lastly, it should also be noted that whilst the accompanying claims set out particular combinations of features described herein, the scope of the present disclosure is not limited to the particular combinations hereafter claimed, but instead extends to encompass any combination of features or embodiments herein disclosed irrespective of whether or not that particular combination has been specifically enumerated in the accompanying claims at this time.

Claims

WHAT IS CLAIMED IS:
1. A device comprising:
a sound acquisition manager configured to receive a mixed audio signal including a first plurality of audio signals;
an independent component analysis manager configured to:
determine a set of parameters configured to generate a second plurality of audio signals based on the first plurality of audio signals,
convert the second plurality of audio signals using a nonlinear function, and minimize a correlation between pairs of signals of the converted second plurality of audio signals; and
a memory configured to store the second plurality of audio signals as multi-channel audio data.
2. The device of claim 1 , wherein
the plurality of audio signals are temporally independent,
the plurality of audio signals are identically distributed, and
the plurality of audio signals are non-Gaussian.
3. The device of claim 1, wherein the set of parameters is selected to reduce the correlation of pairs of converted second plurality audio signals.
4. The device of claim 3, wherein the set of parameters is determined using
backpropagation and stochastic gradient descent.
5. The device of claim 1, wherein the converting of the plurality of audio signals is based on at least one sigmoid function.
6. The device of claim 1 , wherein the minimizing of the correlation includes revising the set of parameters.
7. The device of claim 1 , wherein
the independent component analysis manager is further configured to transform the first plurality of audio signals to minimize a linear correlation between the first plurality of audio signals in a pre-whitening stage, and
the pre-whitening transforms the first plurality of audio signals into a data set having unit covariance.
8. The device of claim 1, wherein the set of parameters are elements of a separating matrix.
9. The device of claim 1, wherein the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically.
10. The device of claim 1 , wherein
the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically,
an output of the convolutional neural network is used as an input to an activation function, and
the activation function is configured to allow the convolutional neural network to implement a non-linear process.
11. The device of claim 1 , wherein
the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically,
a loss function for the convolutional neural network is a contrast function, the separating matrix is an orthogonal matrix including the set of parameters as weights connecting input neurons and output neurons of the convolutional neural network.
12. A method comprising:
receiving a mixed audio signal including a first plurality of audio signals; determining a set of parameters configured to generate a second plurality of audio signals based on the first plurality of audio signals;
converting the second plurality of audio signals using a nonlinear function; and minimizing a correlation between pairs of signals of the converted second plurality of audio signals; and
storing the second plurality of audio signals as multi-channel audio data.
13. The method of claim 12, wherein the nonlinear function is at least one sigmoid function.
14. The method of claim 12, wherein the minimizing of the correlation includes revising the set of parameters.
15. The method of claim 12, further comprising:
transforming the first plurality of audio signals to minimize a linear correlation between the first plurality of audio signals in a pre-whitening stage, wherein the pre- whitening transforms the first plurality of audio signals into a data set having unit covariance.
16. The method of claim 12, wherein the set of parameters are elements of a separating matrix.
17. The method of claim 12, wherein the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically.
18. The method of claim 12, wherein
the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically,
an output of the convolutional neural network is used as an input to an activation function, and
the activation function is configured to allow the convolutional neural network to implement a non-linear process.
19. The method of claim 12, wherein
the determining of the set of parameters includes using a convolutional neural network to estimate a separating matrix numerically,
a loss function for the convolutional neural network is a contrast function, the separating matrix is an orthogonal matrix including the set of parameters as weights connecting input neurons and output neurons of the convolutional neural network.
20. A non-transitory computer readable storage medium including instructions that when executed by at least one processor, cause the at least one processor to:
receive a mixed audio signal including a first plurality of audio signals;
use a convolutional neural network to numerically estimate a separating matrix for the first plurality of audio signals;
apply at least one activation function including a sigmoid function to an output of the convolutional neural network; and
minimize a correlation between pairs of signals generated using the at least one activation function.
PCT/US2018/059785 2017-11-08 2018-11-08 Neural network based blind source separation WO2019094562A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762583141P 2017-11-08 2017-11-08
US62/583,141 2017-11-08

Publications (1)

Publication Number Publication Date
WO2019094562A1 true WO2019094562A1 (en) 2019-05-16

Family

ID=64564993

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2018/059785 WO2019094562A1 (en) 2017-11-08 2018-11-08 Neural network based blind source separation

Country Status (1)

Country Link
WO (1) WO2019094562A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457703A (en) * 2019-08-12 2019-11-15 广东工业大学 A kind of file classification method, device and equipment based on improvement convolutional neural networks
CN110807524A (en) * 2019-11-13 2020-02-18 大连民族大学 Single-channel signal blind source separation amplitude correction method
CN113419342A (en) * 2021-07-01 2021-09-21 重庆邮电大学 Free illumination optical design method based on deep learning
CN113674756A (en) * 2021-10-22 2021-11-19 青岛科技大学 Frequency domain blind source separation method based on short-time Fourier transform and BP neural network
CN114495974A (en) * 2022-02-18 2022-05-13 腾讯科技(深圳)有限公司 Audio signal processing method
CN115409073A (en) * 2022-10-31 2022-11-29 之江实验室 I/Q signal identification-oriented semi-supervised width learning method and device
CN117202077A (en) * 2023-11-03 2023-12-08 恩平市海天电子科技有限公司 Microphone intelligent correction method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5706402A (en) * 1994-11-29 1998-01-06 The Salk Institute For Biological Studies Blind signal processing system employing information maximization to recover unknown signals through unsupervised minimization of output redundancy

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5706402A (en) * 1994-11-29 1998-01-06 The Salk Institute For Biological Studies Blind signal processing system employing information maximization to recover unknown signals through unsupervised minimization of output redundancy

Non-Patent Citations (9)

* Cited by examiner, † Cited by third party
Title
BUREL G ED - ROY ASIM ET AL: "BLIND SEPARATION OF SOURCES: A NONLINEAR NEURAL ALGORITHM", NEURAL NETWORKS, ELSEVIER SCIENCE PUBLISHERS, BARKING, GB, vol. 5, no. 6, 1 November 1992 (1992-11-01), pages 937 - 947, XP000334414, ISSN: 0893-6080, DOI: 10.1016/S0893-6080(05)80090-5 *
CELIK A ET AL: "Mixed-signal real-time adaptive blind source separation", PROCEEDINGS / 2004 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS : MAY 23 - 26, 2004, SHERATON VANCOUVER WALL CENTRE HOTEL, VANCOUVER, BRITISH COLUMBIA, CANADA, IEEE OPERATIONS CENTER, PISCATAWAY, NJ, vol. 5, 23 May 2004 (2004-05-23), pages 760 - 763, XP010720375, ISBN: 978-0-7803-8251-0 *
IREN VALOVA ET AL: "Blind Source Separation with Neural Networks: Demixing Sources From Mixtures with Different Parameters", 25TH DIGITAL AVIONICS SYSTEMS CONFERENCE, 2006 IEEE/AIAA, IEEE, PI, 1 October 2006 (2006-10-01), pages 1 - 11, XP031048501, ISBN: 978-1-4244-0377-6 *
JIANG DU ET AL: "BSS: a new approach for watermark attack", MULTIMEDIA SOFTWARE ENGINEERING, 2002. PROCEEDINGS. FOURTH INTERNATION AL SYMPOSIUM ON DEC. 11-13, 2002, PISCATAWAY, NJ, USA,IEEE, 11 December 2002 (2002-12-11), pages 182 - 187, XP010632749, ISBN: 978-0-7695-1857-2 *
JUN-AN YANG ET AL: "Research of nonlinear blind source separation algorithm based on quantum evolutionary neural network", MACHINE LEARNING AND CYBERNETICS, 2003 INTERNATIONAL CONFERENCE ON NOV. 2-5, 2003, PISCATAWAY, NJ, USA,IEEE, vol. 2, 2 November 2003 (2003-11-02), pages 835 - 840, XP010678271, ISBN: 978-0-7803-7865-0, DOI: 10.1109/ICMLC.2003.1259594 *
MOSAYEBI RAZIYEH ET AL: "Subband blind source separation for convolutive mixture of speech signals based on dynamic modeling", IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY, IEEE, 12 December 2013 (2013-12-12), pages 299 - 304, XP032583911, DOI: 10.1109/ISSPIT.2013.6781897 *
SHUN-TIAN LOU ET AL: "Fuzzy-based learning rate determination for blind source separation", IEEE TRANSACTIONS ON FUZZY SYSTEMS, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 11, no. 3, 1 June 2003 (2003-06-01), pages 375 - 383, XP011097542, ISSN: 1063-6706, DOI: 10.1109/TFUZZ.2003.812697 *
YANG H H ET AL: "Information backpropagation for blind separation of sources in nonlinear mixture", NEURAL NETWORKS,1997., INTERNATIONAL CONFERENCE ON HOUSTON, TX, USA 9-12 JUNE 1997, NEW YORK, NY, USA,IEEE, US, vol. 4, 9 June 1997 (1997-06-09), pages 2141 - 2146, XP010238971, ISBN: 978-0-7803-4122-7, DOI: 10.1109/ICNN.1997.614237 *
YONGMAN LIN ET AL: "A RBF Neural Network Algorithm for Blind Source Separation of Linear Mixing Signals", INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2006. ISDA '06. SIXT H INTERNATIONAL CONFERENCE ON, IEEE, PI, 1 October 2006 (2006-10-01), pages 231 - 235, XP031023070, ISBN: 978-0-7695-2528-0 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457703A (en) * 2019-08-12 2019-11-15 广东工业大学 A kind of file classification method, device and equipment based on improvement convolutional neural networks
CN110457703B (en) * 2019-08-12 2022-12-30 广东工业大学 Text classification method, device and equipment based on improved convolutional neural network
CN110807524A (en) * 2019-11-13 2020-02-18 大连民族大学 Single-channel signal blind source separation amplitude correction method
CN110807524B (en) * 2019-11-13 2023-11-21 大连民族大学 Single-channel signal blind source separation amplitude correction method
CN113419342A (en) * 2021-07-01 2021-09-21 重庆邮电大学 Free illumination optical design method based on deep learning
CN113674756A (en) * 2021-10-22 2021-11-19 青岛科技大学 Frequency domain blind source separation method based on short-time Fourier transform and BP neural network
CN114495974A (en) * 2022-02-18 2022-05-13 腾讯科技(深圳)有限公司 Audio signal processing method
CN114495974B (en) * 2022-02-18 2024-02-23 腾讯科技(深圳)有限公司 Audio signal processing method
CN115409073A (en) * 2022-10-31 2022-11-29 之江实验室 I/Q signal identification-oriented semi-supervised width learning method and device
CN115409073B (en) * 2022-10-31 2023-03-24 之江实验室 I/Q signal identification-oriented semi-supervised width learning method and device
CN117202077A (en) * 2023-11-03 2023-12-08 恩平市海天电子科技有限公司 Microphone intelligent correction method
CN117202077B (en) * 2023-11-03 2024-03-01 恩平市海天电子科技有限公司 Microphone intelligent correction method

Similar Documents

Publication Publication Date Title
WO2019094562A1 (en) Neural network based blind source separation
Nugraha et al. Multichannel audio source separation with deep neural networks
Heymann et al. Neural network based spectral mask estimation for acoustic beamforming
EP3404578B1 (en) Sensor transformation attention network (stan) model
JP6679898B2 (en) KEYWORD DETECTION DEVICE, KEYWORD DETECTION METHOD, AND KEYWORD DETECTION COMPUTER PROGRAM
Raj et al. Dover-lap: A method for combining overlap-aware diarization outputs
Wang et al. A joint training framework for robust automatic speech recognition
US20190378498A1 (en) Processing audio waveforms
CN111370014B (en) System and method for multi-stream target-voice detection and channel fusion
US10580401B2 (en) Sub-matrix input for neural network layers
Qin et al. The INTERSPEECH 2020 far-field speaker verification challenge
Braun et al. Multi-channel attention for end-to-end speech recognition
EP3501026B1 (en) Blind source separation using similarity measure
US10283112B2 (en) System and method for neural network based feature extraction for acoustic model development
KR20210052036A (en) Apparatus with convolutional neural network for obtaining multiple intent and method therof
Saeidi et al. Uncertain LDA: Including observation uncertainties in discriminative transforms
EP4310838A1 (en) Speech wakeup method and apparatus, and storage medium and system
Huang et al. Deep graph random process for relational-thinking-based speech recognition
US20230298616A1 (en) System and Method For Identifying Sentiment (Emotions) In A Speech Audio Input with Haptic Output
Sim et al. Adaptation of deep neural network acoustic models for robust automatic speech recognition
Li et al. A practical two-stage training strategy for multi-stream end-to-end speech recognition
Espi et al. Spectrogram patch based acoustic event detection and classification in speech overlapping conditions
CN112180318A (en) Sound source direction-of-arrival estimation model training and sound source direction-of-arrival estimation method
Braun et al. Attention-driven multi-sensor selection
Elnaggar et al. A new unsupervised short-utterance based speaker identification approach with parametric t-SNE dimensionality reduction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18811997

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18811997

Country of ref document: EP

Kind code of ref document: A1