WO2016100460A1 - Systems and methods for source localization and separation - Google Patents

Systems and methods for source localization and separation Download PDF

Info

Publication number
WO2016100460A1
WO2016100460A1 PCT/US2015/066012 US2015066012W WO2016100460A1 WO 2016100460 A1 WO2016100460 A1 WO 2016100460A1 US 2015066012 W US2015066012 W US 2015066012W WO 2016100460 A1 WO2016100460 A1 WO 2016100460A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor
acoustic
doa
source
matrix
Prior art date
Application number
PCT/US2015/066012
Other languages
French (fr)
Inventor
Johannes TRAA
Noah Daniel STEIN
David Wingate
Original Assignee
Analog Devices, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Analog Devices, Inc. filed Critical Analog Devices, Inc.
Publication of WO2016100460A1 publication Critical patent/WO2016100460A1/en

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • G01S3/8006Multi-channel systems specially adapted for direction-finding, i.e. having a single aerial system capable of giving simultaneous indications of the directions of different signals

Definitions

  • the present invention relates to the field of signal processing, and in particular to source localization and/or separation.
  • an acoustic sensor acquires an acoustic signal that has contributions from a plurality of different acoustic sources, where, as used herein, the term “contribution of an acoustic source” refers to at least a portion of an acoustic signal generated by a particular acoustic source, typically the portion being a portion of a particular frequency or a range of frequencies, at a particular time or range of times.
  • an acoustic source is e.g. a person speaking, there will be multiple contributions, i.e. there will be acoustic signals of different frequencies at different times generated by such a “source.”
  • source separation In a process generally referred to as “source separation,” various digital signal processing techniques are used to recover the original component signals attributable to different sources from a combined signal acquired by the acoustic sensor (i.e. from the acquired signal that has a combination of contributions from different sources). A process of performing source separation without any prior information about
  • Source separation can often be improved by processing acoustic signals acquired by multiple acoustic sensors, arranged e.g. in a sensor array, e.g. a microphone array. In such scenarios, each acoustic sensor acquires a corresponding signal that includes
  • source localization refers to a process of
  • DOA Directional of Arrival
  • Sound source localization and separation is used in many applications, including, for example, signal enhancement and noise cancellation for phones or hearing aids, speech recognition, home automation, and voice user interface in the car or home.
  • various source separation techniques use DOA in order to recover signals attributable to one or more of the individual sources.
  • source localization typically precedes, or may be considered a part of, source separation.
  • many well ⁇ known source separation approaches use beamforming, i.e. signal processing techniques used to control the directionality of the reception of a signal, by employing arrays of acoustic sensors that aim to improve directional gain of the sensor array(s) by increasing the gain in the direction of a source of interest (e.g. a speaker) and decreasing the gain in the direction of interferences and noise.
  • Beamforming techniques use information about the DOA of the source, and, therefore, are preceded by
  • SRP Steered Response Power
  • each beamformer in a family of beamformers focuses on a specific direction.
  • SRP localization can be used with a
  • GCC Generalized Cross ⁇ Correlation
  • PHAT Phase Transform
  • a different known method for finding DOAs uses eigenanalysis of the data correlation matrix.
  • a Multiple Signal Classification (MUSIC) algorithm uses this method to identify signal and noise subspaces and form a MUSIC pseudospectrum that contains peaks at the source DOAs.
  • the MUSIC pseudospectrum plots direction on the x ⁇ axis and likelihood of that direction as being the source of a sound on the y ⁇ axis, and thus is a function over the space of directions which indicates where sources are likely to be.
  • Another known method includes modeling observed data vectors as zero ⁇ means Gaussian random variables and using an EM algorithm to learn the sources’ covariance parameters.
  • the sources can be separated using multichannel Wiener filtering.
  • multichannel Wiener filtering can be used separate source signals from background noise.
  • multichannel Wiener filtering can be used to separate speech signals from each other.
  • the output of the multichannel Wiener filter includes multiple sources and includes a correlation matrix that describes how the channels are correlated. The multichannel Wiener filter reconstructs source vectors directly.
  • a more effective and efficient method for localizing and separating signals involves interpreting the SRP function as a probability distribution and maximizing it as a function of the source DOAs.
  • a mixture of single ⁇ source SRPs MoSRP
  • MultSRP an SRP that explicitly models the presence of multiple sources.
  • Some advantages of the second method include simultaneous localization of each of the multiple sources and explicit modeling of interference between sources.
  • Time ⁇ Frequency (TF) masking is used to isolate TF bins, described in greater detail below, that correspond to directional signals of interest, thereby merging the localization, separation and Wiener post ⁇ filtering steps into one unified approach.
  • an improved type of Wiener filter may be used for estimating a weight for each of multiple TF bins for each of multiple sources.
  • the weight estimates for each time ⁇ frequency bin can be used to determine which bins contain source energy and which bins do not contain source energy. Bins which do not contain source energy may still contain energy, for example, noise.
  • a Wiener filter coefficient is estimated, where the Wiener filter coefficient corresponds to the probability that any of the directional sources are present.
  • a method for identifying a first direction of arrival of sound waves (i.e. acoustic signals) from a first acoustic source and a second direction of arrival of sound waves from a second acoustic source.
  • the methods includes receiving, at a microphone array, acoustic signals including a combination of the sound waves from the first and second acoustic sources, converting the received acoustic signals from a time domain to a time ⁇ frequency domain, processing the converted acoustic signals to determine an estimated first angle representing the first direction of arrival and an estimated second angle representing the second direction of arrival, and updating the estimated first and second angles.
  • the processing includes localizing, separating and Wiener post ⁇ filtering the converted acoustic signals using time ⁇ frequency weighting and outputting a time ⁇ frequency weighted signal for estimating the first and
  • converting the received acoustic signals from a time domain to a time ⁇ frequency domain includes using a short time Fourier transform.
  • the method includes combining the time ⁇ frequency weighted signal with the converted acoustic signals to generate a correlation matrix.
  • updating the estimated first and second angles comprises utilizing the correlation matrix and the estimated first and second angles and outputting updated estimated first and second angles.
  • processing the converted acoustic signals to determine the estimated first and second angles includes decomposing the converted acoustic signals to identify signals from each of the first and second acoustic sources by accounting for interference between the first and second acoustic sources in forming the acoustic signals.
  • processing the converted acoustic signals and updating the first and second estimated angles includes iteratively decomposing the converted acoustic signals to simultaneously determine the first and second directions of arrival.
  • processing the converted acoustic signals includes processing using steered response power localization.
  • the method further includes using an inverse STFT to convert the processed converted acoustic signals back into the time domain and separating the sound waves from the first acoustic source from the sound waves from the second acoustic source.
  • aspects of the present disclosure may be embodied in various manners – e.g. as a method, a system, a computer program product, or a computer ⁇ readable storage medium. Accordingly, aspects of the present disclosure may take the form of an entirely hardware
  • aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s), preferably non ⁇ transitory, having computer readable program code embodied, e.g., stored, thereon.
  • a computer program may, for example, be downloaded (updated) to the existing devices and systems (e.g. to the existing radar or sonar receivers or/and their controllers, etc.) or be stored upon manufacturing of these devices and systems.
  • FIGURE 1 is a diagram illustrating an audio processor receiving signals from multiple sources, according to some embodiments of the disclosure
  • FIGURE 2 is a diagram illustrating a method for identifying a first direction of arrival of sound waves from a first acoustic source and a second direction of arrival of sound waves from a second acoustic source, according to some embodiments of the disclosure;
  • FIGURE 3 is one diagram illustrating a method for separating and localizing signals, according to some embodiments of the disclosure.
  • FIGURE 4 is a diagram illustrating two data vectors from two sources and the combination of the two data vectors, according to some embodiments of the disclosure
  • FIGURE 5A is a diagram illustrating single ⁇ source likelihood over DOAs, according to some embodiments of the disclosure.
  • Figure 5B is a diagram illustrating a multi ⁇ source SRP likelihood for a data mixture of two sources over a joint space of all DOA pairs, according to some
  • FIG. 6 is another diagram illustrating a MultSRP method for separating and localizing signals, according to some embodiments of the disclosure.
  • FIGURE 1 is a diagram 100 illustrating an audio processor 102 receiving signals from first 104a, second 104b, and third 104n sources, according to some embodiments of the disclosure.
  • the audio processor 102 includes a microphone array 106, a direction finding module 108, a source separating module 110, and an audio processing module 112.
  • the microphone array 106 receives (i.e. acquires) a combined sound, referred to in the following as “ambient sound,” including the signals from the first 104a, second 104b and third 104n sources.
  • ambient sound includes signals from more than three sources, and there may be any number of sources present.
  • the microphone array 106 may include one or more acoustic sensors, arranged e.g. in a sensor array, each sensor of the array configured to acquire an ambient sound (i.e., each acoustic sensor acquires a corresponding signal).
  • the sensors may be provided relatively close to one another, e.g. less than 2 centimeters (cm) apart, preferably less than 1 cm apart.
  • the sensors may be arranged separated by distances that are much smaller, on the order of e.g. 1 millimeter (mm) or about 300 times smaller than typical sound wavelength, where beamforming techniques, used e.g. for
  • the sensors may be provided at larger distances with respect to one another.
  • the plurality of signals acquired by an array of acoustic sensors may consider the plurality of signals acquired by an array of acoustic sensors as a single signal, possibly by combining the individual acquired signals into a single signal as is appropriate for a particular implementation. Therefore, in the following, when an “acquired signal” is discussed in a singular form, then, unless otherwise specified, it is to be understood that the signal may comprise several acquired signals acquired by different sensors of the microphone array 106.
  • a characteristic could e.g. be a quantity indicative of a magnitude of the acquired signal.
  • a characteristic is “spectral” in that it is computed for a particular frequency or a range of frequencies.
  • a characteristic is “time ⁇ dependent” in that it may have different values at different times.
  • such characteristics may be a Short Time Fourier Transform (STFT), computed as follows.
  • STFT Short Time Fourier Transform
  • An acquired signal is functionally divided into overlapping blocks, referred to herein as “frames.”
  • frames may be of a duration of 64 milliseconds (ms) and be overlapping by e.g. 48 ms.
  • the portion of the acquired signal within a frame is then multiplied with a window function (i.e. a window function is applied to the frames) to smooth the edges.
  • window function also known as tapering or apodization function refers to a mathematical function that has values equal to or close to zero outside of a particular interval.
  • the window functions used are non ⁇ negative smooth "bell ⁇ shaped" curves, though rectangle, triangle, and other functions can be used.
  • a function that is constant inside the interval and zero elsewhere is called a “rectangular window,” referring to the shape of its graphical representation.
  • a transformation function such as e.g. Fast Fourier Transform (FFT)
  • the frequency decomposition of all of the frames may be arranged in a matrix where frames and frequency are indexed (in the following, frames are described to be indexed by “t” and frequencies are described to be indexed by “f”).
  • Each element of such an array, indexed by (f, t) comprises a complex value resulting from the application of the transformation function and is referred to herein as a "time ⁇ frequency (TF) bin” or simply “bin.”
  • TF time ⁇ frequency bin
  • bin may be viewed as indicative of the fact that such a matrix may be considered as comprising a plurality of bins into which the signal’s energy is distributed.
  • Time ⁇ frequency bins come into play in BSS algorithms in that separation of a particular acoustic signal of interest (i.e. an acoustic signal generated by a particular source of interest) from the total signal acquired by an acoustic sensor may be achieved by identifying which bins correspond to the signal of interest, i.e. when and at which frequencies the signal of interest is active. Once such bins are identified, the total acquired signal may be masked by zeroing out the undesired time ⁇ frequency bins. Such an approach would be called a “hard mask.” Applying a so ⁇ called “soft mask” is also possible, the soft mask scaling the magnitude of each bin by some amount. Then an inverse transformation function (e.g.
  • inverse STFT may be applied to obtain the desired separated signal of interest in the time domain.
  • masking in the frequency domain corresponds to applying a time ⁇ varying frequency ⁇ selective filter in the time domain.
  • the desired separated signal of interest may then be selectively processed for various purposes.
  • each source 104a, 104b, 104n has a distinct location, and the signal from each source 104a, 104b, 104n arrives at the microphone array 106 at an angle relative to its source location. Based on this angle, for each signal, the audio processor 102 estimates a direction ⁇ of ⁇ arrival (DOA).
  • DOA direction ⁇ of ⁇ arrival
  • each source 104a, 104b, 104n has a DOA 114a, 114b, 114n.
  • the first source 104a has a first DOA 114a
  • the second source 104b has a second DOA 114b
  • the third source 104n has a third DOA 114n.
  • the microphone array 106 is coupled to the direction finding module 108, and the signals received at the microphone array 106 are transmitted to the direction
  • the direction finding module 108 estimates the DOAs 114a, 114b, and 114n associated with source signals 104a, 104b, and 104n, as described in greater detail below.
  • the direction finding module 108 is coupled to a separation masking module 110, where the signals corresponding to the various sources 104a, 104b, 104n are separated from each other and from background noise which may be present.
  • the direction finding module 108 and the separation masking module 110 are each also coupled to a further audio processing module 112, where further processing of the acoustic signals occurs.
  • the further audio processing may depend on the application, and may include, for example, enhancing one or more speech signals, and filtering out constant noise or repetitive sounds.
  • MVDR Minimum Variance Distortionless Response
  • LCMV Linearly ⁇ Constrained Minimum ⁇ Variance
  • MUSIC Multiple Signal Classification
  • Delay ⁇ and ⁇ Sum (DS) beamforming involves adding a time delay to the signal recorded from each microphone that cancels out the delay caused by the extra travel time that it took for the signal to reach the microphone (as opposed to microphones that were closer to the signal source). Summing the resulting in ⁇ phase signals enhances the signal.
  • This beamforming method can be used to estimate DOA by testing various time delays, since the delay that correlates with the correct DOA will amplify the signal, while incorrect time delays destructively interfere with the signal.
  • the DS beamforming method focuses on the time domain to estimate DOA, and it is inaccurate in noisy environments.
  • Delay ⁇ and ⁇ Sum (DS) beamforming involves fractional delays in the frequency domain.
  • the received signals are processed by measuring the fractional delays in the signals, weighting each channel by a complex coefficient, and adding up the results.
  • DS beamforming is used in processing received signals in the single source model described below.
  • MVDR beamforming is similar to DS beamforming, but takes into account statistical noise correlations between the channels.
  • a Fourier transform can be used to transform the time domain signal into the time ⁇ frequency plane by converting time delays between sensors into phase shifts.
  • MVDR beamforming provides good noise suppression by minimizing the output power of the array while not distorting signals from the primary DOA, but it has a power defined by a matrix inversion, and is therefore computationally intensive.
  • MVDR and DS beamformers are generalized to the Multi ⁇ source case via a multiply ⁇ constrained optimization problem, and the solution is the Linearly ⁇ Constrained Minimum ⁇ Variance (LCMV) beamformer.
  • the weight vector can be used to determine how to weight the channels in the time ⁇ frequency plane to preserve energy from desired directions and suppress energy from other directions:
  • the beamformers discussed above can be used to estimate the coefficients when the source DOA(s) ⁇ are already known. Thus, in systems using the LCMV, MVDR and DS beamforming methods described above, first the source DOA(s) are determined and then beamforming is performed.
  • the source DOA(s) may be determined using, for example, Steered Response Power Localization as described below.
  • the MUSIC beamformer is a subspace method based on an eigenanalysis of the covariance matrix.
  • the MUSIC beamformer requires an eigendecomposition. Additionally, MUSIC is based on the assumption that the subspace that the signals lie in is
  • the MUSIC beamformer decomposes a covariance matrix representing the signal and noise of the received signal.
  • Steered Response Power (SRP) Localization is used to estimate source DOA(s) ⁇ .
  • SRP localization is used to estimate DOA’s by discretizing the direction space.
  • source DOAs ⁇ estimated by SRP Localization can be input in the LCMV beamforming equation (2) above.
  • SRP localization identifies DOAs ⁇ by searching for peaks in the output power of a single ⁇ source beamformer.
  • a more accurate and effective approach is to scan all DOA sets ⁇ using an LCMV beamformer and locate the peak output power.
  • this is computationally inefficient and too time ⁇ consuming for real ⁇ time feedback, since discretizing the DOA search space into D look directions results in D K ⁇ ‘s to be scanned (where K is the number of sources present).
  • the multi ⁇ source SRP function is modeled as a continuous likelihood function parametrized by ⁇ and the likelihood function is maximized to identify source DOAs.
  • FIGURE 2 is a diagram illustrating a method 200 for identifying a first direction of arrival of sound waves from a first acoustic source and a second direction of arrival of sound waves from a second acoustic source.
  • the method includes, at step 202, receiving, at a microphone array, acoustic signals including the sound waves from the first and second acoustic sources.
  • the received acoustic signals now represented by electrical signals generated by the microphone array, are converted from a time domain to a time ⁇ frequency domain.
  • the converted acoustic signals are processed to determine an estimated first angle representing the first direction of arrival and an estimated second angle representing the second direction of arrival. Processing includes localizing, separating and Wiener post ⁇ filtering the converted acoustic signals using time ⁇ frequency weighting and outputting a time ⁇ frequency
  • the estimated first and second angles are updated.
  • the likelihood of the first and second angles is determined integrally as a single unit from the mixed signals received at the microphone array, rather than maximizing the likelihood of each of the first and second angles separately.
  • FIGURE 3 is a diagram illustrating a method 300 for separating and localizing signals, according to some embodiments of the disclosure.
  • the method 300 is an iterative approach in which a probabilistic SRP model is combined with time ⁇ frequency masking to perform blind source separation and localization in the presence of non ⁇ stationary interference.
  • the method has an iterative loop 306 including a Time ⁇ Frequency (TF) weighting step 308, a correlation matrices step 310, and a direction of arrival (DOA) update step 312.
  • TF Time ⁇ Frequency
  • DOA direction of arrival
  • the method 300 begins with receiving input acoustic signals x 302 acquired by different microphone elements of the microphone array 106.
  • each acquired signal 302 may, and typically will, include contributions from multiple sources 104a ⁇ 104n and a goal of source separation is to distinguish these individual contributions on a per ⁇ source basis.
  • the acquired input acoustic signals 302 are processed through an STFT 304 to transform the signals from the time domain to the time ⁇ frequency plane.
  • the output X from the STFT 304 is input to the TF weighting step 308 and to the correlation matrices step 310.
  • the TF weighting step 308 uses TF masking to isolate TF bins that correspond to selected directional signals.
  • some directional signals are identified as being directional signals of interest, and the corresponding TF bins are isolated. Identifying the directional signal or signals of interest may include separating identified signals, and selecting one (or more) of the separated signals.
  • the selected signal corresponds to a speech signal, and it may be the speech of a particular speaker.
  • the selected directional signals are identified based on peaks in output power.
  • the TF weighting step 308 receives the output signals from the STFT step 304 as well as a DOA set ⁇ (DOA matrix) from the DOA update step 312, and
  • the output ⁇ from the TF weighting step 308 is input into the correlation matrices step 310.
  • the correlation matrices step 310 combines the TF weighted input and data output from the STFT 304.
  • the correlation matrices step 310 uses the inputs to derive correlation matrices as described in greater detail below with respect to equations 15 and 16, and outputs an updated correlation matrix R to the DOA update step 312.
  • the DOA update step revises the set of DOA’s ⁇ based on the input correlation matrix R, and outputs the updated DOA’s ⁇ to the TF weighting step 308.
  • an output set of DOA’s ⁇ indicating the localization results is output from the DOA update step 312 to a final separation step 314.
  • the separation step 314 also receives the STFT processed data x as input.
  • the set of DOA’s ⁇ is used to separate out the signals in the data x and generates an STFT matrix for each source.
  • the STFT matrices are processed with an inverse STFT at step 316, which transforms each one into a time domain signal.
  • the time domain signals 318 output from step 316 are localized, separated and post ⁇ filtered output signals.
  • the method 300 is performed using the following equations.
  • a first method for maximizing the SRP as a function of the source DOA’s uses an SRP that explicitly models the presence of multiple sources.
  • Identifying the DOAs involves maximizing a likelihood function:
  • x is the STFT coefficients of the data from the microphone array
  • ⁇ 1 and ⁇ 2 are estimated DOA angles.
  • a Gaussian likelihood for the observed data vectors x ft is:
  • f the variance of the background noise at frequency f
  • I the identity matrix
  • a f is the steering matrix including the observed mixing vectors a f as elements
  • s ft is a vector of complex source coefficients for a time ⁇ frequency bin with one component for each source.
  • the expectation E[s ft ] can be approximated with a least squares estimate:
  • FIGURE 4 is a diagram 400 illustrating a first 402 and second 404 data vectors from first and second sources and the combination 406 of the two data vectors
  • the diagram 400 illustrates the additivity of the first 402 and second 404 data vectors. As illustrated in Figures 5A and 5B, due to interference between the first 402 and second 404 data vectors, a spurious peak in the single source likelihood is present between the true DOAs. If a single ⁇ source likelihood is calculated for the superposition of the first 402 and second 404 data vectors, the single source likelihood will indicate the likelihood of a single source at the combination data vector 406 source. This is illustrated in FIGURE 5A, which shows is a diagram 500 illustrating single ⁇ source likelihood over DOAs, according to some embodiments of the disclosure. In particular, the diagram 500 shows the single source likelihood, with a peak indicating a DOA around 1.3 ⁇ 1.4 radians. Thus, the single source likelihood equation estimates a single source positioned between the first and second sources, rather than the two separate sources.
  • Figure 5B is a diagram 550 illustrating a multi ⁇ source SRP likelihood for a data mixture of two sources over a joint space of all DOA pairs, according to some embodiments of the disclosure.
  • the data shown in Figure 5B is derived using equation (10) above, which estimated the first and second sources as having a DOAs at 0.56 radians and at 2.26 radians on the unit circle.
  • a second method for maximizing the SRP as a function of the source DOA’s uses a mixture of single ⁇ source SRP’s.
  • is the step size
  • is a function that normalizes the gradient, which appears in parentheses after the ⁇ .
  • the gradient indicates which direction corresponds with an improvement in DOA estimates.
  • the step size ⁇ is how far to move in the indicated direction.
  • the maximum likelihood can be estimated for both the one ⁇ source model and the multiple source model.
  • the gradient can be
  • m is the matrix of microphone positions (a matrix in which the columns are the positions of the
  • the single source model can be used when multiple sources are present by modeling the presence of the other sources at each time t with hidden variables z ft that capture which source is active at any selected time.
  • EM Expectation ⁇ Maximization
  • [0076] are source ⁇ specific correlation matrices, and are defined in terms of the posterior probabilities of the z ft ’s:
  • equations 13 ⁇ 16 show one way to use the single source method of equation 12 for multiple sources.
  • equation 17 can be used for localization of multiple sources.
  • the E step soft TF weights are determined, and in the M step, each source’s DOA is optimized.
  • the EM method alternates between estimating localization (DOA) parameters and estimating separation (TF mask) parameters.
  • the gradient in the multiple source case is:
  • This multiple source case takes cross ⁇ talk into account while avoiding the complexity of the EM algorithm.
  • the wei hts can be a roximated as usin e uations 21 and 22 : (21)
  • interleaving the Wiener masking with DOA optimization improves localization accuracy in the presence of ambient noise.
  • Equation 15 can be estimated by multiplying the posteriors with the Wiener filter weights.
  • the sources can be separated by applying TF masks with weights. In various examples, this may be done in one or more of step 308 and step 314 of the method 300.
  • equation can be used: (23)
  • the source coefficients are recovered with LCMV beamforming.
  • the variance is related to the hardness of the mask, such that as the variance moves to zero, the mask becomes binary.
  • the masks can be applied to corresponding components of and followed with a Wiener masking step to suppress non ⁇ speech interference and reduce the presence of masking artifacts.
  • FIGURE 6 is a diagram illustrating a method 600 for separating and localizing signals, according to some embodiments of the disclosure.
  • the method 600 may be considered as a summary, or an alternative representation, of the method 300 described above. Therefore, in the interests of brevity, some steps illustrated in method 600 refer to steps illustrated in method 300 in order to not repeat all of the details of their descriptions.
  • stage 610 that may be referred to as a preprocessing stage
  • stage 620 that may be referred to as an optimization stage
  • stage 630 that may be referred to as a source separation stage.
  • the preprocessing stage 610 may include steps 612, 614, 616, and 618.
  • acoustic signals are captured by the microphone array 106, as described above with reference to 302.
  • step 614 STFT is applied to the captured signals x m . in order to convert the captured signals into the TF domain resulting in complex ⁇ values matrices
  • step 616 correlation matrices are initialized by estimating correlation matrices for each frequency as:
  • step 618 the DOA arameter matrix
  • k being an integer between 1 and n for the acoustic sources 104 illustrated in FIGURE 1
  • step 618 may be carried out in different manners, including e.g. SRP localization described above.
  • the initialized DOA matrix ⁇ 0 is provided to the optimization stage 620.
  • the optimization stage 620 may include steps 622, 624, 626, and 628 which may be iteratively repeated for a number of iterations I max , in order to improve the estimate of the DOA matrix ⁇ (i.e. in order to improve DOA estimates for the different acoustic sources 104).
  • the number of iterations I max may be determined by various stopping conditions. For example, in some embodiments, the maximum number of iterations may be pre ⁇ defined, while, in other embodiments, iterations may be performed until a certain condition is met, such as e.g. a
  • step 622 for each frequency, a steering matrix A f , described above with reference to equation (5) and subsequent equations, is computed as:
  • l f is the frequency in Hertz at the f th frequency band
  • c is the speed of sound
  • a projector matrix may then be computed as shown above with the equation (8).
  • Steering matrices A and projection matrices B may then be, optionally, provided to step 624.
  • step 624 if Wiener masking described above is used, new correlation matrices are re ⁇ estimated as described above with reference to equations (19) ⁇ (20).
  • equations (20) and (19) for re ⁇ estimating the new correlation matrices may be re ⁇ written as equations (31) and (32) below:
  • a DOA gradient matrix may be computed as
  • Equation (33) is an exemplary explicit equation for the gradient given in equation (17) above.
  • the gradient matrix G is provided to step 628 where the DOA matrix ⁇ is adjusted as described with reference to equation (11) above.
  • the DOA matrix is adjusted as
  • the columns of the DOA matrix may be normalized as:
  • step 628 describes the gradient procedure given an appropriate gradient as given in equations (11) and (12) above.
  • Step 624 may be performed as a part of 308 and 310 described above, while step 628 corresponds to 312 described above.
  • Updated DOA matrix ⁇ is then provided to the source separation, as illustrated in FIGURE 6 with ⁇ provided to the separation stage 630 and as illustrated in FIGURE 3 with an arrow from 306 to the final separation step 314.
  • the source separation stage 630 may include steps 632 and 634. Following the iterative procedure described above, any number of methods may be used to enhance/separate the directional signals, all of which methods are within the scope of the present disclosure.
  • each source 104 may be isolated by estimating TF masks and applying them to the STFT X.
  • the sources can be separated by applying TF masks with weights, which could be done in one or more of step 308 and step 314 of the method 300, using equation (23) provided above using estimates of the source coefficients provided by K LCMV beamformers, each designated to isolate a single source while blocking out, or at least substantially suppressing, the others. In one embodiment, this may be implemented as:
  • the variance controls the hardness of the mask such that as , the mask becomes binary, assigning each TF bin entirely to a single source.
  • these masks are applied to any single captured signal (i.e. to any signal captured by one of the acoustic sensors of the microphone array 106) and inverted to the time ⁇ domain using inverse STFT, as described above with reference to 316.
  • the method 600 is presented for the case of an SRP that explicitly models the presence of multiple sources, i.e.method 600 is a MultSRP method.
  • a method for the mixture of single ⁇ source SRPs would include steps analogous to those illustrated in FIGURE 6 with the main difference residing in the gradients of the two methods, in particular in how the correlation information is used (i.e. the difference between MultSRP and MoSRP is in re ⁇ computing the correlation matrices as is done in step 624 described above).
  • step 624 would involve including posterior probability weights in re ⁇ computing the correlation matrices as in equation (15). Gradients for the MoSRP method are given in equation (12).
  • third rank tensors are represented with capital letters (e.g. X), while individual elements of a tensor are denoted with X ijk , where “ijk” represents indices corresponding to those most appropriate for the tensor.
  • Sub ⁇ matrices of the third rank tensors i.e. second rank tensors, also referred to as matrices
  • X ::k which indicates that, in this example, only the third index of the third rank tensor X is specified.
  • sub ⁇ vectors i.e.
  • first rank tensors derived from the corresponding rank tensors are similarly denoted as, for example, X :jk , indicting that e.g. only the second and third index of the third rank tensor X is specified.
  • Source localization refers to determining a DOA of an acoustic signal generated by an acoustic source k of K acoustic sources 104 ⁇ 1 through 104 ⁇ K, the DOA indicating a DOA of the acoustic signal at a microphone array 106 comprising M microphones.
  • K and M could be an integer equal to or greater than 2.
  • M is typically an integer on the order of 5, but, of course, in various implementations the
  • K is typically an integer in the range [2,4]. Since in a typical deployment scenario it is often not possible to know for sure how many acoustic sources are present, value of K (i.e. the number of acoustic sources being modeled) is estimated/selected based on various considerations that a person of ordinary skill in the art would readily recognize, such as e.g. likely number of acoustic sources, an estimate based on a source ⁇ counting algorithm, or prior knowledge.
  • a source localization method may include steps of: a) determining a time ⁇ frequency (TF) tensor (X) of FxTxM dimensions, where F is an integer indicating the number of frequency components f and T is an integer indicating the number of time frames t (each of F, T, and M being an integer equal to or greater than 2, where F may be on the order of 500 and T may be on the order of 100), the TF tensor comprising a TF representation, e.g.
  • each element Xftm of the tensor X f being an integer from a set ⁇ 1, ... ,F ⁇ , t being an integer from a set ⁇ 1, .., T ⁇ , and m being an integer from a set ⁇ 1, ..., M ⁇ , is configured to comprise a complex value indicative of measured magnitude and phase of a portion of a digitized stream x corresponding to a frequency component f at a time frame t for a microphone m;
  • a DOA tensor ( ⁇ ), the DOA tensor being of dimensions 3xK (i.e. it is a second order tensor, or a matrix) and comprising estimated DOA information for each of the K acoustic sources, where each element ⁇ ik of the DOA tensor (i being an integer from a set ⁇ 1, 2, 3 ⁇ , k being an integer from a set ⁇ 1, .., K ⁇ ) is configured to comprise a real value indicative of orientation of a particular acoustic source k with respect to the microphone array (in a 3 ⁇ dimensional space around the microphone array 106) in dimension i (the columns ⁇ :k of ⁇ are vectors of length 1);
  • each element R m1m2f of the correlation tensor (m1 and m2 each being integers from a set ⁇ 1, ... M ⁇ and f being an integer from a set ⁇ 1, ..., F ⁇ ) is configured to comprise a complex value indicative of estimated correlation between a portion of the digitized stream x as acquired by microphone m1 (m1 being an integer from a set ⁇ 1, ... M ⁇ ) and a portion of the digitized stream x as acquired by microphone m2 (m2 being an integer from a set ⁇ 1, ... M ⁇ ) for a particular frequency component f (f being an integer from a set ⁇ 1, ..., F ⁇ );
  • localizable sources i.e. sources for which it is possible to determine orientation with respect to the microphone array; in other words ⁇ directional sources; in other words – sources that may be approximated as point sources for which it is possible to identify their location; e.g. ambient noise coming from all different directions would not
  • Each element B m1m2f of the projector tensor (m1 and m2 both being integers from a set ⁇ 1, ..., M ⁇ and f being an integer from a set ⁇ 1, ..., F ⁇ ) is configured to comprise a complex value indicative of a set (subspace) of data vectors X ft: that correspond to signals originating from the estimated orientations in ⁇ at frequency component f (the product B ::f * X ft: results in a vector that approximates the directional components in the signal at time t and frequency f);
  • each element G ik of the DOA gradient tensor i being an integer from a set ⁇ 1, 2, 3 ⁇ , k being an integer from a set ⁇ 1, .., K ⁇
  • G ik of the DOA gradient tensor i being an integer from a set ⁇ 1, 2, 3 ⁇ , k being an integer from a set ⁇ 1, .., K ⁇
  • each element G ik of the DOA gradient tensor i being an integer from a set ⁇ 1, 2, 3 ⁇ , k being an integer from a set ⁇ 1, .., K ⁇
  • each element G ik of the DOA gradient tensor i being an integer from a set ⁇ 1, 2, 3 ⁇ , k being an integer from a set ⁇ 1, .., K ⁇
  • orientation estimates of an acoustic source k i.e. an estimated change in the DOA matrix ⁇ that is necessary to improve the source orientation estimates
  • determining the DOA of an acoustic source k based on a column ⁇ :k of the DOA tensor i.e. a DOA vector for any source k is then obtained from the column ⁇ :k of the DOA matrix.
  • the source localization method summarized above could further include steps e’) and e’’) to be iterated together with steps d) ⁇ g), steps e’) and e’’) being as follows:
  • each element W ftk of the weight tensor is configured to comprise a real value between 0 and 1 indicative of the degree to which acoustic source k is active in the (f,t) th bins of the TF tensor X (i.e. indicating a percentage of energy in the (f,t) th bin for each of M microphones that is attributable to the acoustic signal generated by the acoustic source k), and
  • the one or more predefined criteria may include a predefined threshold value indicating improvement, e.g. percentage improvement, of a likelihood value indicating how well the estimated orientations in ⁇ explain the observed data given the assumed data model (see equation (9)).
  • a predefined threshold value indicating improvement, e.g. percentage improvement, of a likelihood value indicating how well the estimated orientations in ⁇ explain the observed data given the assumed data model (see equation (9)).
  • Example 1 provides a method for determining a direction of arrival (DOA) of an acoustic signal generated by an acoustic source k of K acoustic sources, the DOA indicating a DOA of the acoustic signal at a microphone array including M microphones, each of K and M being an integer equal to or greater than 2, the method including: a) determining a time ⁇ frequency (TF) tensor of FxTxM dimensions, where F is an integer indicating a number of frequency components f and T is an integer indicating a number of time frames t, the TF tensor including a TF representation of each of M digitized signal streams x, each digitized stream corresponding to a combined acoustic signal captured by one of M microphones of the microphone array; b) initializing a DOA matrix of dimensions 3xK, the DOA matrix including estimated DOA information for each of the K acoustic sources; c) based on values of the TF ten
  • Example 2 provides the method according to Example 1, where each element X ftm of the TF tensor is configured to include a complex value indicative of measured magnitude and phase of a portion of a digitized stream x corresponding to a frequency component f at a time frame t for a microphone m.
  • Example 3 provides the method according to Examples 1 or 2, where each element ⁇ ik of the DOA matrix is configured to include a real value indicative of orientation of the acoustic source k with respect to the microphone array in dimension i.
  • Example 4 provides the method according to any one of the preceding Examples, where each element R m1m2f of the correlation tensor is configured to include a complex value indicative of correlation between a portion of the digitized stream x as acquired by microphone m1 and a portion of the digitized stream x as acquired by microphone m2 for a particular frequency component f.
  • Example 5 provides the method according to any one of the preceding Examples, where each element A mkf of the steering tensor is configured to include a complex value indicative of a magnitude and a phase response of a microphone m to an acoustic source k at a frequency component f.
  • Example 6 provides the method according to any one of the preceding Examples, where each element B m1m2f of the projector tensor is configured to include a complex value indicative of a set of data vectors X ft: that correspond to localizable signals with steering matrix A ::f at a frequency component f.
  • Example 7 provides the method according to any one of the preceding Examples, where each element G ik of the DOA gradient matrix is configured to include a real value indicative of an estimated change in the DOA tensor for improving orientation estimate of the acoustic source k.
  • Example 8 provides the method according to any one of the preceding Examples, further including: e’) based on values of the projector tensor and values of the TF tensor, computing a TF weight tensor of dimensions FxTxK, where each element W ftk of the TF weight tensor is configured to include a real value between 0 and 1 indicative of
  • Example 9 provides the method according to Example 8, where computing the TF weight tensor includes using a Wiener mask.
  • Example 10 provides the method according to Example 8, where computing the TF weight tensor includes using a Wiener mask and defining source ⁇ specific correlation matrices in terms of posterior probabilities using a Wiener mask.
  • Example 11 provides the method according to any one of the preceding Examples, where the iterations are performed until one or more predefined criteria are met.
  • Example 12 provides a method for identifying a first direction of arrival of sound waves from a first acoustic source and a second direction of arrival of sound waves from a second acoustic source, the method including: receiving, at a microphone array, acoustic signals including the sound waves from the first and second acoustic sources; converting the received acoustic signals from a time domain to a time ⁇ frequency domain; processing the converted acoustic signals to determine an estimated first angle representing the first direction of arrival and an estimated second angle representing the second direction of arrival; and updating the estimated first and second angles; where processing includes localizing, separating and Wiener post ⁇ filtering the converted acoustic signals using time ⁇ frequency weighting and outputting a time ⁇ frequency weighted signal for estimating the first and second angles.
  • Example 13 provides the method according to Example 12, further including combining the time ⁇ frequency weighted signal with the converted acoustic signals to generate a correlation matrix.
  • Example 14 provides the method according to Example 13, where updating the estimated first and second angles includes utilizing the correlation matrix and the estimated first and second angles and outputting updated estimated first and second angles.
  • Example 15 provides the method according to Example 12, where converting the received acoustic signals from a time domain to a time ⁇ frequency domain includes using a short time Fourier transform.
  • Example 16 provides the method according to Example 12, where processing the converted acoustic signals to determine the estimated first and second angles includes decomposing the converted acoustic signals to identify signals from each of the first and second acoustic sources by accounting for interference between the first and second acoustic sources in forming the acoustic signals.
  • Example 17 provides the method according to Example 12, where processing the converted acoustic signals and updating the first and second estimated angles includes iteratively decomposing the converted acoustic signals to simultaneously determine the first and second directions of arrival.
  • Example 18 provides the method according to Example 12, where processing the converted acoustic signals includes processing using steered response power localization.
  • Example 19 provides the method according to Example 12, further including using an inverse STFT to convert the processed converted acoustic signals back into the time domain and separating the sound waves from the first acoustic source from the sound waves from the second acoustic source.
  • Example 20 provides a system comprising means for implementing the method according to any one of the preceding Examples.
  • Example 21 provides a data structure for assisting implementation of the method according to any one of the preceding Examples.
  • Example 22 provides a system for determining a DOA of an acoustic signal generated by an acoustic source k of K acoustic sources, the DOA indicating a DOA of the acoustic signal at a microphone array comprising M microphones, each of K and M being an integer equal to or greater than 2, the system including at least one memory element configured to store computer executable instructions, and at least one processor coupled to the at least one memory element and configured, when executing the instructions, to carry out the method according to any one of Examples 1 ⁇ 11.
  • Example 23 provides one or more non ⁇ transitory tangible media encoding logic that include instructions for execution that, when executed by a processor, are operable to perform operations for determining a DOA of an acoustic signal generated by an acoustic source k of K acoustic sources, the DOA indicating a DOA of the acoustic signal at a microphone array comprising M microphones, each of K and M being an integer equal to or greater than 2, the operations comprising operations of the method according to any one of Examples 1 ⁇ 11.
  • Example 24 provides a system for identifying a first direction of arrival of sound waves from a first acoustic source and a second direction of arrival of sound waves from a second acoustic source, the system including at least one memory element configured to store computer executable instructions, and at least one processor coupled to the at least one memory element and configured, when executing the instructions, to carry out the method according to any one of Examples 12 ⁇ 19.
  • Example 25 provides one or more non ⁇ transitory tangible media encoding logic that include instructions for execution that, when executed by a processor, are operable to perform operations for identifying a first direction of arrival of sound waves from a first acoustic source and a second direction of arrival of sound waves from a second acoustic source, the operations comprising operations of the method according to any one of Examples 12 ⁇ 19.
  • any number of electrical circuits used to implement the systems and methods of the FIGURES may be implemented on a board of an associated electronic device.
  • the board can be a general circuit board that can hold various components of the internal electronic system of the electronic device and,
  • the board can provide the electrical connections by which the other components of the system can communicate electrically.
  • Any suitable processors (inclusive of digital signal processors, microprocessors, supporting chipsets, etc.), computer ⁇ readable non ⁇ transitory memory elements, etc. can be suitably coupled to the board based on particular configuration needs, processing demands, computer designs, etc.
  • Other components such as external storage, additional sensors, controllers for audio/video display, and peripheral devices may be attached to the board as plug ⁇ in cards, via cables, or integrated into the board itself.
  • the functionalities described herein may be implemented in emulation form as software or firmware running within one or more configurable (e.g., programmable) elements arranged in a structure that supports these functions.
  • the software or firmware providing the emulation may be provided on non ⁇ transitory computer ⁇ readable storage medium comprising instructions to allow a processor to carry out those functionalities.
  • the systems and methods of the FIGURES may be implemented as stand ⁇ alone modules (e.g., a device with associated components and circuitry configured to perform a specific application or function) or implemented as plug ⁇ in modules into application specific hardware of electronic devices.
  • SOC system on chip
  • An SOC represents an IC that integrates components of a computer or other electronic system into a single chip. It may contain digital, analog, mixed ⁇ signal, and often radio frequency functions: all of which may be provided on a single chip substrate.
  • MCM multi ⁇ chip ⁇ module
  • the identification, localization and separation functionalities may be implemented in one or more silicon cores in
  • ASICs Application Specific Integrated Circuits
  • FPGAs Field Programmable Gate Arrays
  • the features discussed herein can be applicable to medical systems, scientific instrumentation, wireless and wired communications, radar, industrial process control, audio and video equipment, current sensing, instrumentation (which can be highly precise), and other digital ⁇ processing ⁇ based systems.
  • certain embodiments discussed above can be provisioned in digital signal processing technologies for medical imaging, patient monitoring, medical instrumentation, and home healthcare. This could include pulmonary monitors, accelerometers, heart rate monitors, pacemakers, etc. Other applications can involve automotive technologies for safety systems (e.g., stability control systems, driver assistance systems, braking systems, infotainment and interior applications of any kind). Furthermore, powertrain systems (for example, in hybrid and electric vehicles) can use high ⁇ precision data conversion products in battery monitoring, control systems, reporting controls, maintenance activities, etc.
  • the teachings of the present disclosure can be applicable in the industrial markets that include process control systems that help drive productivity, energy efficiency, and reliability.
  • the teachings of the signal processing circuits discussed above can be used for image processing, auto focus, and image stabilization (e.g., for digital still cameras, camcorders, etc.).
  • Other consumer applications can include audio and video processors for home theater systems, DVD recorders, and high ⁇ definition televisions.
  • Yet other consumer applications can involve advanced touch screen controllers (e.g., for any type of portable media device).
  • such technologies could readily part of smartphones, tablets, security systems, PCs, gaming technologies, virtual reality, simulation training, etc.
  • references to various features e.g., elements, structures, modules, components, steps, operations, characteristics, etc.
  • references to various features e.g., elements, structures, modules, components, steps, operations, characteristics, etc.
  • a system that can include any suitable circuitry, dividers, capacitors, resistors, inductors, ADCs, DFFs, logic gates, software, hardware, links, etc.) that can be part of any type of computer, which can further include
  • the system can include means for clocking data from the digital core onto a first data output of a macro using a first clock, the first clock being a macro clock; means for clocking the data from the first data output of the macro into the physical interface using a second clock, the second clock being a physical interface clock; means for clocking a first reset signal from the digital core onto a reset output of the macro using the macro clock, the first reset signal output used as a second reset signal; means for sampling the second reset signal using a third clock, which provides a clock rate greater than the rate of the second clock, to generate a sampled reset signal; and means for resetting the second clock to a predetermined state in the physical interface in response to a transition of the sampled reset signal.
  • the ‘means for’ in these instances can include (but is not limited to) using any suitable component discussed herein, along with any suitable software, circuitry, hub, computer code, logic, algorithms, hardware, controller, interface, link, bus, communication pathway, etc.
  • the system includes memory that further comprises machine ⁇ readable instructions that when executed cause the system to perform any of the activities discussed above.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

A method for identifying the direction of arrival of sound waves from first and second acoustic sources is disclosed. The method includes receiving, at a microphone array, acoustic signals including the sound waves from the first and second acoustic sources, converting the received acoustic signals from a time domain to a time‐frequency domain, processing the converted acoustic signals to determine an estimated first angle representing the first direction of arrival and an estimated second angle representing the second direction of arrival, and updating the estimated first and second angles, where processing includes localizing, separating and Wiener post‐filtering the converted acoustic signals using time‐frequency weighting.

Description

 
SYSTEMS AND METHODS FOR SOURCE LOCALIZATION AND SEPARATION 
 
CROSS‐REFERENCE TO RELATED APPLICATIONS 
[0001] This application claims the benefit of and priority from U.S. Provisional  Patent Application Serial No. 62/093,903 filed 18 December 2014 entitled “SYSTEMS AND  METHODS FOR SOURCE LOCALIZATION AND SEPARATION,” which is incorporated herein  by reference in its entirety. 
 
TECHNICAL FIELD OF THE DISCLOSURE 
[0002] The present invention relates to the field of signal processing, and in  particular to source localization and/or separation. 
 
BACKGROUND 
[0003] Use of spoken input for user devices, including smartphones, automobiles,  etc., can be challenging due to the fact that, typically, an acoustic environment in which a  desired signal from a speaker is acquired also contains undesired signals from other  acoustic sources.  In such an environment, an acoustic sensor acquires an acoustic signal  that has contributions from a plurality of different acoustic sources, where, as used  herein, the term “contribution of an acoustic source” refers to at least a portion of an  acoustic signal generated by a particular acoustic source, typically the portion being a  portion of a particular frequency or a range of frequencies, at a particular time or range  of times.  When an acoustic source is e.g. a person speaking, there will be multiple  contributions, i.e. there will be acoustic signals of different frequencies at different times  generated by such a “source.”   
[0004] In a process generally referred to as “source separation,” various digital  signal processing techniques are used to recover the original component signals  attributable to different sources from a combined signal acquired by the acoustic sensor  (i.e. from the acquired signal that has a combination of contributions from different  sources).  A process of performing source separation without any prior information about 
   
the acoustic signals is often referred to as “blind source separation” (BSS).  Source  separation can often be improved by processing acoustic signals acquired by multiple  acoustic sensors, arranged e.g. in a sensor array, e.g. a microphone array.  In such  scenarios, each acoustic sensor acquires a corresponding signal that includes 
contributions from multiple sources and comparison of the signals acquired by different  acoustic sensors provides an insight into individual contributions of the different sources.   
[0005] In general, the term “source localization” refers to a process of 
determining spatial position of a particular source within a given environment.  Various  digital signal processing techniques usually use the term “Direction of Arrival” (DOA) to  describe a parameter that indicates direction from which the signal generated by a  particular source arrived, thus localizing the source within the environment.   
[0006] Sound source localization and separation is used in many applications,  including, for example, signal enhancement and noise cancellation for phones or hearing  aids, speech recognition, home automation, and voice user interface in the car or home. 
[0007] Typically, various source separation techniques use DOA in order to  recover signals attributable to one or more of the individual sources.  Thus, source  localization typically precedes, or may be considered a part of, source separation.  For  example, many well‐known source separation approaches use beamforming, i.e. signal  processing techniques used to control the directionality of the reception of a signal, by  employing arrays of acoustic sensors that aim to improve directional gain of the sensor  array(s) by increasing the gain in the direction of a source of interest (e.g. a speaker) and  decreasing the gain in the direction of interferences and noise.  Beamforming techniques  use information about the DOA of the source, and, therefore, are preceded by 
localization step where location of the source in the environment is determined or  estimated.    
[0008] One known approach for finding the DOAs is Steered Response Power  (SRP) localization, which searches for peaks in the output power of a family of 
beamformers as a function of the DOA.  In one example, each beamformer in a family of  beamformers focuses on a specific direction.  SRP localization can be used with a 
Maximum‐Likelihood (ML) formulation.  Another known approach computes the     
Generalized Cross‐Correlation (GCC) function, which can be used with a spectral  weighting function such as Phase Transform (PHAT) to enhance the localizer.   
[0009] A different known method for finding DOAs uses eigenanalysis of the data  correlation matrix.  For example, a Multiple Signal Classification (MUSIC) algorithm uses  this method to identify signal and noise subspaces and form a MUSIC pseudospectrum  that contains peaks at the source DOAs.  The MUSIC pseudospectrum plots direction on  the x‐axis and likelihood of that direction as being the source of a sound on the y‐axis,  and thus is a function over the space of directions which indicates where sources are  likely to be.   
[0010] Another known method includes modeling observed data vectors as zero‐ means Gaussian random variables and using an EM algorithm to learn the sources’  covariance parameters.  The sources can be separated using multichannel Wiener  filtering.  According to some implementations, multichannel Wiener filtering can be used  separate source signals from background noise.  In some implementations, multichannel  Wiener filtering can be used to separate speech signals from each other.  In one example,  in a multichannel case, in which there are multiple channels and multiple source signals,  the output of the multichannel Wiener filter includes multiple sources and includes a  correlation matrix that describes how the channels are correlated.  The multichannel  Wiener filter reconstructs source vectors directly.   
[0011] The methods discussed above are sequential methods: first the DOA is  estimated and the source is localized, and then the signal is separated from other signals  and from background noise.  One approach for simultaneously localizing and separating  various sounds sources uses Bayesian analysis.  However, Bayesian analysis uses prior  information about the sources, which may not always be available.  For example,  Bayesian analysis requires prior information about the magnitudes of the sources.  
[0012] As the foregoing illustrates, improvements with respect to source  localization and separation are desired. 
 
   
   
 
OVERVIEW   
[0013] A more effective and efficient method for localizing and separating signals  is provided, and involves interpreting the SRP function as a probability distribution and  maximizing it as a function of the source DOAs.  In one method, a mixture of single‐ source SRPs (MoSRP) is used.  In a second method, an SRP that explicitly models the  presence of multiple sources is provided (MultSRP).  Some advantages of the second  method include simultaneous localization of each of the multiple sources and explicit  modeling of interference between sources.  Time‐Frequency (TF) masking is used to  isolate TF bins, described in greater detail below, that correspond to directional signals of  interest, thereby merging the localization, separation and Wiener post‐filtering steps into  one unified approach. 
[0014] According to some embodiments, an improved type of Wiener filter may  be used for estimating a weight for each of multiple TF bins for each of multiple sources.   The weight estimates for each time‐frequency bin can be used to determine which bins  contain source energy and which bins do not contain source energy.  Bins which do not  contain source energy may still contain energy, for example, noise.  For each time  frequency bin, a Wiener filter coefficient is estimated, where the Wiener filter coefficient  corresponds to the probability that any of the directional sources are present. 
[0015] According to one aspect, a method is provided for identifying a first  direction of arrival of sound waves (i.e. acoustic signals) from a first acoustic source and a  second direction of arrival of sound waves from a second acoustic source.  The methods  includes receiving, at a microphone array, acoustic signals including a combination of the  sound waves from the first and second acoustic sources, converting the received acoustic  signals from a time domain to a time‐frequency domain, processing the converted  acoustic signals to determine an estimated first angle representing the first direction of  arrival and an estimated second angle representing the second direction of arrival, and  updating the estimated first and second angles.  The processing includes localizing,  separating and Wiener post‐filtering the converted acoustic signals using time‐frequency  weighting and outputting a time‐frequency weighted signal for estimating the first and 
   
second angles.  In one example, converting the received acoustic signals from a time  domain to a time‐frequency domain includes using a short time Fourier transform.   
[0016] According to some implementations, the method includes combining the  time‐frequency weighted signal with the converted acoustic signals to generate a  correlation matrix.  In some implementations, updating the estimated first and second  angles comprises utilizing the correlation matrix and the estimated first and second  angles and outputting updated estimated first and second angles. 
[0017] According to some implementations, processing the converted acoustic  signals to determine the estimated first and second angles includes decomposing the  converted acoustic signals to identify signals from each of the first and second acoustic  sources by accounting for interference between the first and second acoustic sources in  forming the acoustic signals.  In some implementations, processing the converted  acoustic signals and updating the first and second estimated angles includes iteratively  decomposing the converted acoustic signals to simultaneously determine the first and  second directions of arrival.  In one example, processing the converted acoustic signals  includes processing using steered response power localization. 
[0018] According to some implementations, the method further includes using an  inverse STFT to convert the processed converted acoustic signals back into the time  domain and separating the sound waves from the first acoustic source from the sound  waves from the second acoustic source. 
[0019] As will be appreciated by one skilled in the art, aspects of the present  disclosure may be embodied in various manners – e.g. as a method, a system, a  computer program product, or a computer‐readable storage medium.  Accordingly,  aspects of the present disclosure may take the form of an entirely hardware 
embodiment, an entirely software embodiment (including firmware, resident software,  micro‐code, etc.) or an embodiment combining software and hardware aspects that may  all generally be referred to herein as a "circuit," "module" or "system."  Functions  described in this disclosure may be implemented as an algorithm executed by one or  more processing units, e.g. one or more microprocessors, of one or more computers.  In  various embodiments, different steps and portions of the steps of each of the methods     
described herein may be performed by different processing units.  Furthermore, aspects  of the present disclosure may take the form of a computer program product embodied in  one or more computer readable medium(s), preferably non‐transitory, having computer  readable program code embodied, e.g., stored, thereon.  In various embodiments, such a  computer program may, for example, be downloaded (updated) to the existing devices  and systems (e.g. to the existing radar or sonar receivers or/and their controllers, etc.) or  be stored upon manufacturing of these devices and systems. 
[0020] Other features and advantages of the disclosure are apparent from the  following description, and from the claims. 
 
BRIEF DESCRIPTION OF THE DRAWING 
[0021] To provide a more complete understanding of the present disclosure and  features and advantages thereof, reference is made to the following description, taken in  conjunction with the accompanying figures, wherein like reference numerals represent  like parts, in which:     
[0022] FIGURE 1 is a diagram illustrating an audio processor receiving signals from  multiple sources, according to some embodiments of the disclosure; 
[0023] FIGURE 2 is a diagram illustrating a method for identifying a first direction  of arrival of sound waves from a first acoustic source and a second direction of arrival of  sound waves from a second acoustic source, according to some embodiments of the  disclosure; 
[0024] FIGURE 3 is one diagram illustrating a method for separating and localizing  signals, according to some embodiments of the disclosure; 
[0025] FIGURE 4 is a diagram illustrating two data vectors from two sources and  the combination of the two data vectors, according to some embodiments of the  disclosure;  
[0026] FIGURE 5A is a diagram illustrating single‐source likelihood over DOAs,  according to some embodiments of the disclosure;  
   
[0027] Figure 5B is a diagram illustrating a multi‐source SRP likelihood for a data  mixture of two sources over a joint space of all DOA pairs, according to some 
embodiments of the disclosure; and 
[0028] Figure 6 is another diagram illustrating a MultSRP method for separating  and localizing signals, according to some embodiments of the disclosure. 
 
DESCRIPTION OF EXAMPLE EMBODIMENTS OF THE DISCLOSURE 
[0029] FIGURE 1 is a diagram 100 illustrating an audio processor 102 receiving  signals from first 104a, second 104b, and third 104n sources, according to some  embodiments of the disclosure.  The audio processor 102 includes a microphone array  106, a direction finding module 108, a source separating module 110, and an audio  processing module 112.   
[0030] The microphone array 106 receives (i.e. acquires) a combined sound,  referred to in the following as “ambient sound,” including the signals from the first 104a,  second 104b and third 104n sources.  In other examples, the microphone array 106  receives ambient sound includes signals from more than three sources, and there may be  any number of sources present.   
[0031] The microphone array 106 may include one or more acoustic sensors,  arranged e.g. in a sensor array, each sensor of the array configured to acquire an ambient  sound (i.e., each acoustic sensor acquires a corresponding signal).  In some embodiments  where a plurality of acoustic sensors are employed, the sensors may be provided  relatively close to one another, e.g. less than 2 centimeters (cm) apart, preferably less  than 1 cm apart.  In an embodiment, the sensors may be arranged separated by distances  that are much smaller, on the order of e.g. 1 millimeter (mm) or about 300 times smaller  than typical sound wavelength, where beamforming techniques, used e.g. for 
determining DOA of an acoustic signal, do not apply.  In other embodiments, the sensors  may be provided at larger distances with respect to one another.   
[0032] While some embodiments where a plurality of acoustic sensors are  employed make a distinction between the signals acquired by different sensors (e.g. for  the purpose of determining DOA by e.g. comparing the phases of the different signals),     
other embodiments may consider the plurality of signals acquired by an array of acoustic  sensors as a single signal, possibly by combining the individual acquired signals into a  single signal as is appropriate for a particular implementation.  Therefore, in the  following, when an “acquired signal” is discussed in a singular form, then, unless  otherwise specified, it is to be understood that the signal may comprise several acquired  signals acquired by different sensors of the microphone array 106. 
[0033] Different source localization and separation techniques presented herein  are based on computing time‐dependent spectral characteristics X of the signal acquired  by the microphone array 106.  A characteristic could e.g. be a quantity indicative of a  magnitude of the acquired signal.  A characteristic is “spectral” in that it is computed for  a particular frequency or a range of frequencies.  A characteristic is “time‐dependent” in  that it may have different values at different times.   
[0034] In an embodiment, such characteristics may be a Short Time Fourier  Transform (STFT), computed as follows.  An acquired signal is functionally divided into  overlapping blocks, referred to herein as “frames.”  For example, frames may be of a  duration of 64 milliseconds (ms) and be overlapping by e.g. 48 ms.  The portion of the  acquired signal within a frame is then multiplied with a window function (i.e. a window  function is applied to the frames) to smooth the edges.  As is known in signal processing,  and in particular in spectral analysis, the term “window function” (also known as tapering  or apodization function) refers to a mathematical function that has values equal to or  close to zero outside of a particular interval.  The values outside the interval do not have  to be identically zero, as long as the product of the window multiplied by its argument is  square integrable, and, more specifically, that the function goes sufficiently rapidly  toward zero.  In typical applications, the window functions used are non‐negative smooth  "bell‐shaped" curves, though rectangle, triangle, and other functions can be used.  For  instance, a function that is constant inside the interval and zero elsewhere is called a  “rectangular window,” referring to the shape of its graphical representation.  Next, a  transformation function, such as e.g. Fast Fourier Transform (FFT), is applied 
transforming the waveform multiplied by the window function from a time domain to a  frequency domain.  As a result, a frequency decomposition of a portion of the acquired     
signal within each frame is obtained.  The frequency decomposition of all of the frames  may be arranged in a matrix where frames and frequency are indexed (in the following,  frames are described to be indexed by “t” and frequencies are described to be indexed  by “f”).  Each element of such an array, indexed by (f, t) comprises a complex value  resulting from the application of the transformation function and is referred to herein as  a "time‐frequency (TF) bin” or simply “bin.”  The term “bin” may be viewed as indicative  of the fact that such a matrix may be considered as comprising a plurality of bins into  which the signal’s energy is distributed. 
[0035] Time‐frequency bins come into play in BSS algorithms in that separation of  a particular acoustic signal of interest (i.e. an acoustic signal generated by a particular  source of interest) from the total signal acquired by an acoustic sensor may be achieved  by identifying which bins correspond to the signal of interest, i.e. when and at which  frequencies the signal of interest is active.  Once such bins are identified, the total  acquired signal may be masked by zeroing out the undesired time‐frequency bins.  Such  an approach would be called a “hard mask.”  Applying a so‐called “soft mask” is also  possible, the soft mask scaling the magnitude of each bin by some amount.  Then an  inverse transformation function (e.g. inverse STFT) may be applied to obtain the desired  separated signal of interest in the time domain.  Thus, masking in the frequency domain  (i.e. in the domain of the transformation function) corresponds to applying a time‐ varying frequency‐selective filter in the time domain.  The desired separated signal of  interest may then be selectively processed for various purposes. 
[0036] In one example, each source 104a, 104b, 104n has a distinct location, and  the signal from each source 104a, 104b, 104n arrives at the microphone array 106 at an  angle relative to its source location.  Based on this angle, for each signal, the audio  processor 102 estimates a direction‐of‐arrival (DOA).  Thus, at the audio processor 102,  each source 104a, 104b, 104n has a DOA 114a, 114b, 114n.  The first source 104a has a  first DOA 114a, the second source 104b has a second DOA 114b, and the third source  104n has a third DOA 114n. 
[0037] The microphone array 106 is coupled to the direction finding module 108,  and the signals received at the microphone array 106 are transmitted to the direction     
finding module 108.  The direction finding module 108 estimates the DOAs 114a, 114b,  and 114n associated with source signals 104a, 104b, and 104n, as described in greater  detail below.  The direction finding module 108 is coupled to a separation masking  module 110, where the signals corresponding to the various sources 104a, 104b, 104n  are separated from each other and from background noise which may be present.  The  direction finding module 108 and the separation masking module 110 are each also  coupled to a further audio processing module 112, where further processing of the  acoustic signals occurs.  The further audio processing may depend on the application, and  may include, for example, enhancing one or more speech signals, and filtering out  constant noise or repetitive sounds. 
[0038] In traditional array processing, linear filtering algorithms are used for  enhancing directional signals.  In particular, beamforming is used to constructively add  signals received at microphones in the array and suppress noise.  There are several  different methods for beamforming including Delay‐and‐Sum (DS) beamforming, 
Minimum Variance Distortionless Response (MVDR) beamforming, Linearly‐Constrained  Minimum‐Variance (LCMV) beamforming, and Multiple Signal Classification (MUSIC)  beamforming. 
[0039] In one example, Delay‐and‐Sum (DS) beamforming involves adding a time  delay to the signal recorded from each microphone that cancels out the delay caused by  the extra travel time that it took for the signal to reach the microphone (as opposed to  microphones that were closer to the signal source).  Summing the resulting in‐phase  signals enhances the signal.  This beamforming method can be used to estimate DOA by  testing various time delays, since the delay that correlates with the correct DOA will  amplify the signal, while incorrect time delays destructively interfere with the signal.  The  DS beamforming method focuses on the time domain to estimate DOA, and it is  inaccurate in noisy environments. 
[0040] In another example, Delay‐and‐Sum (DS) beamforming involves fractional  delays in the frequency domain.  Generally, when a small microphone is array is used, the  received signals are processed by measuring the fractional delays in the signals,  weighting each channel by a complex coefficient, and adding up the results.  According to     
one implementation, DS beamforming is used in processing received signals in the single  source model described below. 
[0041] MVDR beamforming is similar to DS beamforming, but takes into account  statistical noise correlations between the channels.   
[0042] A Fourier transform can be used to transform the time domain signal into  the time‐frequency plane by converting time delays between sensors into phase shifts.   MVDR beamforming provides good noise suppression by minimizing the output power of  the array while not distorting signals from the primary DOA, but it has a power defined  by a matrix inversion, and is therefore computationally intensive.  The MVDR 
beamformer solution is: 
Figure imgf000012_0001
        (1)   
[0043] MVDR and DS beamformers are generalized to the Multi‐source case via a  multiply‐constrained optimization problem, and the solution is the Linearly‐Constrained  Minimum‐Variance (LCMV) beamformer.  In particular, in the LCMV beamformer, the  weight vector can be used to determine how to weight the channels in the time‐ frequency plane to preserve energy from desired directions and suppress energy from  other directions:    
 
Figure imgf000012_0002
   
[0044] The beamformers discussed above can be used to estimate the  coefficients when the source DOA(s)  Φ are already known.  Thus, in systems using the  LCMV, MVDR and DS beamforming methods described above, first the source DOA(s) are  determined and then beamforming is performed.  The source DOA(s) may be determined  using, for example, Steered Response Power Localization as described below. 
[0045] The MUSIC beamformer is a subspace method based on an eigenanalysis  of the covariance matrix.  The MUSIC beamformer requires an eigendecomposition.   Additionally, MUSIC is based on the assumption that the subspace that the signals lie in is 
   
orthogonal to the space in which the noise lies.  In one example, the MUSIC beamformer  decomposes a covariance matrix representing the signal and noise of the received signal. 
[0046] Steered Response Power (SRP) Localization is used to estimate source  DOA(s)  Φ.  In some examples, SRP localization is used to estimate DOA’s by discretizing  the direction space.  In particular, source DOAs  Φ estimated by SRP Localization can be  input in the LCMV beamforming equation (2) above.  SRP localization identifies DOAs  Φ  by searching for peaks in the output power of a single‐source beamformer.   
[0047] When multiple sources are present, there may be multiple peaks in the  SRP.  However, in compact microphone arrays or closely spaced microphone arrays, close  spacing of the microphone elements makes the steering vectors hard to distinguish, and  thus low frequency peaks are poorly localized.  Additionally, if the source coefficients are  simultaneously large in magnitude, the SRP function is distorted by cross‐terms.  
[0048] A more accurate and effective approach is to scan all DOA sets  Φ using an  LCMV beamformer and locate the peak output power.  However, this is computationally  inefficient and too time‐consuming for real‐time feedback, since discretizing the DOA  search space into D look directions results in DK  Φ‘s to be scanned (where K is the  number of sources present).  Instead, according to one implementation, the multi‐source  SRP function is modeled as a continuous likelihood function parametrized by  Φ and the  likelihood function is maximized to identify source DOAs.  
[0049] FIGURE 2 is a diagram illustrating a method 200 for identifying a first  direction of arrival of sound waves from a first acoustic source and a second direction of  arrival of sound waves from a second acoustic source.  The method includes, at step 202,  receiving, at a microphone array, acoustic signals including the sound waves from the  first and second acoustic sources.  At step 204, the received acoustic signals, now  represented by electrical signals generated by the microphone array, are converted from  a time domain to a time‐frequency domain.  At step 206, the converted acoustic signals  are processed to determine an estimated first angle representing the first direction of  arrival and an estimated second angle representing the second direction of arrival.   Processing includes localizing, separating and Wiener post‐filtering the converted  acoustic signals using time‐frequency weighting and outputting a time‐frequency     
weighted signal for estimating the first and second angles.  At step 208, the estimated  first and second angles are updated.  According to one feature, the likelihood of the first  and second angles is determined integrally as a single unit from the mixed signals  received at the microphone array, rather than maximizing the likelihood of each of the  first and second angles separately. 
[0050] FIGURE 3 is a diagram illustrating a method 300 for separating and  localizing signals, according to some embodiments of the disclosure.  As shown in Figure  3, the method 300 is an iterative approach in which a probabilistic SRP model is  combined with time‐frequency masking to perform blind source separation and  localization in the presence of non‐stationary interference.  In particular, the method has  an iterative loop 306 including a Time‐Frequency (TF) weighting step 308, a correlation  matrices step 310, and a direction of arrival (DOA) update step 312. 
[0051] The method 300 begins with receiving input acoustic signals x 302  acquired by different microphone elements of the microphone array 106.  As described  above, each acquired signal 302 may, and typically will, include contributions from  multiple sources 104a‐104n and a goal of source separation is to distinguish these  individual contributions on a per‐source basis.   
[0052] The acquired input acoustic signals 302 are processed through an STFT 304  to transform the signals from the time domain to the time‐frequency plane.   
[0053] The output X from the STFT 304 is input to the TF weighting step 308 and  to the correlation matrices step 310.  The TF weighting step 308 uses TF masking to  isolate TF bins that correspond to selected directional signals.  In particular, some  directional signals are identified as being directional signals of interest, and the  corresponding TF bins are isolated.  Identifying the directional signal or signals of interest  may include separating identified signals, and selecting one (or more) of the separated  signals.  In one example, the selected signal corresponds to a speech signal, and it may be  the speech of a particular speaker.    
[0054] In one example, the selected directional signals are identified based on  peaks in output power.  The TF weighting step 308 receives the output signals from the  STFT step 304 as well as a DOA set  Θ (DOA matrix) from the DOA update step 312, and     
uses these inputs to perform TF weighting as described in greater details in Equations 3‐ 17.  Thus, the localization, separation, and Wiener post‐filtering steps are merged into  the TF weighting step 308. 
[0055] The output  ^ from the TF weighting step 308 is input into the correlation  matrices step 310.  The correlation matrices step 310 combines the TF weighted input  and data output from the STFT 304. The correlation matrices step 310 uses the inputs to  derive correlation matrices as described in greater detail below with respect to equations  15 and 16, and outputs an updated correlation matrix R to the DOA update step 312.  The  DOA update step revises the set of DOA’s  Θ based on the input correlation matrix R, and  outputs the updated DOA’s  Θ to the TF weighting step 308. 
[0056] Following the iterative loop 306 of the method 300 shown in Figure 3, an  output set of DOA’s  Θ indicating the localization results is output from the DOA update  step 312 to a final separation step 314.  The separation step 314 also receives the STFT  processed data x as input.  At the separation step 314, the set of DOA’s  Θ is used to  separate out the signals in the data x and generates an STFT matrix for each source.  The  STFT matrices are processed with an inverse STFT at step 316, which transforms each one  into a time domain signal.  The time domain signals 318 output from step 316 are  localized, separated and post‐filtered output signals.  
[0057] According to various implementations, the method 300 is performed using  the following equations.   
[0058] According to some implementations a first method for maximizing the SRP  as a function of the source DOA’s uses an SRP that explicitly models the presence of  multiple sources. 
[0059] Identifying the DOAs involves maximizing a likelihood function:  
Figure imgf000015_0001
[0060] where x is the STFT coefficients of the data from the microphone array,  and  θ1 and  θ2 are estimated DOA angles.  A Gaussian likelihood for the observed data  vectors xft is: 
Figure imgf000015_0002
   
[0061] where the mean  μft encodes the expected value of xft, and σ2
f represents  the variance of the background noise at frequency f, and I is the identity matrix, and: 
Figure imgf000016_0003
[0062] for a hypothesized DOA set  Θ, where Af is the steering matrix including  the observed mixing vectors af as elements, and sft is a vector of complex source  coefficients for a time‐frequency bin with one component for each source.  The  expectation E[sft] can be approximated with a least squares estimate:  
Figure imgf000016_0004
[0063] which is the output of a LCMV beamformer with Rf  σI, where H is a  Hermitian transpose.  In some implementations, there can be regularization within the  brackets in equation (6) to make sure that the matrix inverse is well‐conditioned.  And  therefore: 
Figure imgf000016_0005
[0064] where: 
Figure imgf000016_0006
[0065] Thus, the likelihood of a particular DOA set  Θ is: 
Figure imgf000016_0001
 
[0066] where, in the log domain above, the proportionality sign  ^ means equality  up to an additive constant (rather than up to a multiplicative constant). This can be  aggregated over time t and ex anded: 
Figure imgf000016_0002
[0067] Using the above equations, the DOAs of signals from multiple sources can  be efficiently determined more accurately than by previous methods. 
[0068] FIGURE 4 is a diagram 400 illustrating a first 402 and second 404 data  vectors from first and second sources and the combination 406 of the two data vectors 
   
402 and 404, according to some embodiments of the disclosure.  The diagram 400  illustrates the additivity of the first 402 and second 404 data vectors.  As illustrated in  Figures 5A and 5B, due to interference between the first 402 and second 404 data  vectors, a spurious peak in the single source likelihood is present between the true DOAs.  If a single‐source likelihood is calculated for the superposition of the first 402 and second  404 data vectors, the single source likelihood will indicate the likelihood of a single  source at the combination data vector 406 source. This is illustrated in FIGURE 5A, which  shows is a diagram 500 illustrating single‐source likelihood over DOAs, according to some  embodiments of the disclosure.  In particular, the diagram 500 shows the single source  likelihood, with a peak indicating a DOA around 1.3‐1.4 radians.  Thus, the single source  likelihood equation estimates a single source positioned between the first and second  sources, rather than the two separate sources. 
[0069] Figure 5B is a diagram 550 illustrating a multi‐source SRP likelihood for a  data mixture of two sources over a joint space of all DOA pairs, according to some  embodiments of the disclosure.  The data shown in Figure 5B is derived using equation  (10) above, which estimated the first and second sources as having a DOAs at 0.56  radians and at 2.26 radians on the unit circle. 
[0070] According to some implementations a second method for maximizing the  SRP as a function of the source DOA’s uses a mixture of single‐source SRP’s. 
[0071] Maximum likelihood estimation of the source DOAs can be estimated  using a gradient ascent on the SRP likelihood shown above in equation 10: 
Figure imgf000017_0001
 
[0072] where  η is the step size, and  Ω is a function that normalizes the gradient,  which appears in parentheses after the  Ω.  The gradient indicates which direction  corresponds with an improvement in DOA estimates.  The step size  η is how far to move  in the indicated direction.   The maximum likelihood can be estimated for both the one‐ source model and the multiple source model.  For the one‐source model, the gradient  can be 
   
Figure imgf000018_0001
[0073] where  ^ ^denotes element‐wise multiplication, and m is the matrix of  microphone positions (a matrix in which the columns are the positions of the 
microphones).  The single source model can be used when multiple sources are present  by modeling the presence of the other sources at each time t with hidden variables zft  that capture which source is active at any selected time.  In one example, an Expectation‐ Maximization (EM) algorithm is used to iterate between estimating zft ‘s and DOAs: 
Figure imgf000018_0002
  (13) 
[0074] the lower bound of the EM algorithm is: 
Figure imgf000018_0003
[0075] where    
Figure imgf000018_0004
[0076] are source‐specific correlation matrices, and are defined in terms of the  posterior probabilities of the zft’s: 
Figure imgf000018_0005
 
[0077] The equations 13‐16 show one way to use the single source method of  equation 12 for multiple sources.  According to other implementations, equation 17 can  be used for localization of multiple sources.  According to one feature, in the E step, soft  TF weights are determined, and in the M step, each source’s DOA is optimized.  Thus, the  EM method alternates between estimating localization (DOA) parameters and estimating  separation (TF mask) parameters.  
   
[0078] According to one implementation, the gradient in the multiple source case  is: 
 
Figure imgf000019_0001
 
[0079] This multiple source case takes cross‐talk into account while avoiding the  complexity of the EM algorithm.   
[0080] The localization accuracy in the presence of ambient noise can be  improved using Wiener filtering.  This may be done in step 308 of the method 300 shown  in Figure 3.   In the presence of non‐directional interference: 
Figure imgf000019_0005
[0081] where bft = Af( Φ) sft and cft = nft + eft.  According to one example, the  MMSE‐optimal weighting to recover bft is given by the Wiener mask: 
 
Figure imgf000019_0002
 
[0082] Thus, a robust estimate of the correlation matrices is: 
 
Figure imgf000019_0003
[0083] The wei hts can be a roximated as usin  e uations  21  and  22 :    (21)   
Figure imgf000019_0004
[0084] According to one feature, interleaving the Wiener masking with DOA  optimization improves localization accuracy in the presence of ambient noise.  In some  implementations, for a mixture of one source models, the correlation matrices shown in 
   
Equation 15 can be estimated by multiplying the posteriors with the Wiener filter  weights.  
[0085] According to one implementation, the sources can be separated by  applying TF masks with weights.  In various examples, this may be done in one or more of  step 308 and step 314 of the method 300.  For example, the following equation can be  used:        (23) 
[0086] where the source coefficients       are recovered with LCMV beamforming.   The variance is related to the hardness of the mask, such that as the variance moves to  zero, the mask becomes binary.  The masks can be applied to corresponding components  of      and followed with a Wiener masking step to suppress non‐speech interference and  reduce the presence of masking artifacts. 
[0087] FIGURE 6 is a diagram illustrating a method 600 for separating and  localizing signals, according to some embodiments of the disclosure.  The method 600  may be considered as a summary, or an alternative representation, of the method 300  described above.  Therefore, in the interests of brevity, some steps illustrated in method  600 refer to steps illustrated in method 300 in order to not repeat all of the details of  their descriptions.   
[0088] The method 600 may be considered as including three main stages: stage  610 that may be referred to as a preprocessing stage, stage 620 that may be referred to  as an optimization stage, and stage 630 that may be referred to as a source separation  stage. 
[0089] As shown in FIGURE 6, the preprocessing stage 610 may include steps 612,  614, 616, and 618.  In step 612, acoustic signals are captured by the microphone array  106, as described above with reference to 302.  The captured signals 612 may be  considered as multiple discrete‐time signals  , where m is an  integer indicating a particular acoustic sensor of the microphone array 106 comprising M  acoustic sensors (i.e. m= 1; … ;M). 
   
[0090] In step 614, STFT is applied to the captured signals xm. in order to convert  the captured signals into the TF domain resulting in complex‐values matrices  
Figure imgf000021_0003
[0091] The magnitude  ortion of these matrices ma  be removed to give 
Figure imgf000021_0001
[0092] In step 616, correlation matrices are initialized by estimating correlation  matrices for each frequency as: 
Figure imgf000021_0004
[0093] In step 618, the DOA  arameter matrix  
Figure imgf000021_0002
is initialized with  Θ0 where  
Figure imgf000021_0005
is the unit vector describing the orientation of the kth acoustic source (k being an  integer between 1 and n for the acoustic sources 104 illustrated in FIGURE 1)  relative to the microphone array 106. 
[0094] The initialization of step 618 may be carried out in different manners,  including e.g. SRP localization described above. 
[0095] As shown in FIGURE 6, the initialized DOA matrix  Θ0 is provided to the  optimization stage 620.  As shown in FIGURE 6, the optimization stage 620 may include  steps 622, 624, 626, and 628 which may be iteratively repeated for a number of  iterations Imax, in order to improve the estimate of the DOA matrix  Θ (i.e. in order to  improve DOA estimates for the different acoustic sources 104).  The number of iterations  Imax may be determined by various stopping conditions.  For example, in some  embodiments, the maximum number of iterations may be pre‐defined, while, in other  embodiments, iterations may be performed until a certain condition is met, such as e.g. a 
   
pre‐specified threshold in the percentage improvement of the likelihood value given by  equation (9). 
[0096] In step 622, for each frequency, a steering matrix Af, described above with  reference to equation (5) and subsequent equations, is computed as: 
Figure imgf000022_0001
where lf is the frequency in Hertz at the fth frequency band, c  is the speed of sound, and 
Figure imgf000022_0002
is a matrix of microphone locations. 
[0097] For each frequency, a projector matrix may then be computed as shown  above with the equation (8). 
[0098] Steering matrices A and projection matrices B may then be, optionally,  provided to step 624.  In step 624, if Wiener masking described above is used, new  correlation matrices are re‐estimated as described above with reference to equations  (19)‐(20).  In an embodiments, equations (20) and (19) for re‐estimating the new  correlation matrices may be re‐written as equations (31) and (32) below: 
       
Figure imgf000022_0003
[0099] In step 626, a DOA gradient matrix may be computed as 
 
Figure imgf000022_0004
[00100] Equation (33) is an exemplary explicit equation for the gradient given in  equation (17) above. 
[00101] The columns of the gradient matrix given by the equation (33) are  normalized as: 
Figure imgf000022_0005
   
[00102] The gradient matrix G is provided to step 628 where the DOA matrix  Θ is  adjusted as described with reference to equation (11) above.  In particular, the DOA  matrix is adjusted as 
Figure imgf000023_0002
where the step size at the ith iteration is 
     
Figure imgf000023_0003
[00103] The columns of the DOA matrix may be normalized as: 
Figure imgf000023_0001
 
[00104] While equation (33) provides an explicit equation for the gradient given in  equation (17) above, step 628 describes the gradient procedure given an appropriate  gradient as given in equations (11) and (12) above. 
[00105] Step 624 may be performed as a part of 308 and 310 described above,  while step 628 corresponds to 312 described above. 
[00106] Updated DOA matrix  Θ is then provided to the source separation, as  illustrated in FIGURE 6 with  Θ provided to the separation stage 630 and as illustrated in  FIGURE 3 with an arrow from 306 to the final separation step 314. 
[00107] As shown in FIGURE 6, the source separation stage 630 may include steps  632 and 634.  Following the iterative procedure described above, any number of  methods may be used to enhance/separate the directional signals, all of which methods  are within the scope of the present disclosure.  In one embodiment, in step 632, each  source 104 may be isolated by estimating TF masks and applying them to the STFT X.  As  previously described herein, according to one implementation, the sources can be  separated by applying TF masks with weights, which could be done in one or more of  step 308 and step 314 of the method 300, using equation (23) provided above using  estimates        of the source coefficients provided by K LCMV beamformers, each  designated to isolate a single source while blocking out, or at least substantially  suppressing, the others.  In one embodiment, this may be implemented as: 
Figure imgf000023_0004
   
[00108] The variance  controls the hardness of the mask such that as  , the mask becomes binary, assigning each TF bin entirely to a single source.  [00109] In step 634, these masks are applied to any single captured signal (i.e. to  any signal captured by one of the acoustic sensors of the microphone array 106) and  inverted to the time‐domain using inverse STFT, as described above with reference to  316. 
[00110] The method 600 is presented for the case of an SRP that explicitly models  the presence of multiple sources, i.e.method 600 is a MultSRP method.  A method for the  mixture of single‐source SRPs (MoSRP) would include steps analogous to those illustrated  in FIGURE 6 with the main difference residing  in the gradients of the two methods, in  particular in how the correlation information is used (i.e. the difference between  MultSRP and MoSRP is in re‐computing the correlation matrices as is done in step 624  described above).  For MoSRP, step 624 would involve including posterior probability  weights in re‐computing the correlation matrices as in equation (15). Gradients for the  MoSRP method are given in equation (12).  
[00111] The methods for source localization and separation described above may  be summarized as follows.  In the following summary, third rank tensors are represented  with capital letters (e.g. X), while individual elements of a tensor are denoted with Xijk,  where “ijk” represents indices corresponding to those most appropriate for the tensor.   Sub‐matrices of the third rank tensors (i.e. second rank tensors, also referred to as  matrices) are denoted, for example, as X::k, which indicates that, in this example, only the  third index of the third rank tensor X is specified.  For sub‐matrices, sub‐vectors (i.e. first  rank tensors derived from the corresponding rank tensors, also referred to as vectors)  are similarly denoted as, for example, X:jk, indicting that e.g. only the second and third  index of the third rank tensor X is specified. 
[00112] Source localization refers to determining a DOA of an acoustic signal  generated by an acoustic source k of K acoustic sources 104‐1 through 104‐K, the DOA  indicating a DOA of the acoustic signal at a microphone array 106 comprising M  microphones.  Each of K and M could be an integer equal to or greater than 2.  M is  typically an integer on the order of 5, but, of course, in various implementations the     
value of integer M may be different.  K is typically an integer in the range [2,4].  Since in a  typical deployment scenario it is often not possible to know for sure how many acoustic  sources are present, value of K (i.e. the number of acoustic sources being modeled) is  estimated/selected based on various considerations that a person of ordinary skill in the  art would readily recognize, such as e.g. likely number of acoustic sources, an estimate  based on a source‐counting algorithm, or prior knowledge. 
[00113] In an embodiment, a source localization method may include steps of:  a) determining a time‐frequency (TF) tensor (X) of FxTxM dimensions, where F is an  integer indicating the number of frequency components f and T is an integer indicating  the number of time frames t (each of F, T, and M being an integer equal to or greater  than 2, where F may be on the order of 500 and T may be on the order of 100), the TF  tensor comprising a TF representation, e.g. STFT, of each of M digitized signal streams x,  each stream corresponding to a combined acoustic signal captured by one of M  microphones of the microphone array (the term “combined” indicating that the captured  acoustic signal may include contributions from any combination of one or more of the K  acoustic sources), where each element Xftm of the tensor X, f being an integer from a set  {1, … ,F}, t being an integer from a set {1, .., T}, and m being an integer from a set {1, …,  M}, is configured to comprise a complex value indicative of measured magnitude and  phase of a portion of a digitized stream x corresponding to a frequency component f at a  time frame t for a microphone m; 
b) initializing a DOA tensor ( Θ), the DOA tensor being of dimensions 3xK (i.e. it is a  second order tensor, or a matrix) and comprising estimated DOA information for each of  the K acoustic sources, where each element  Θik of the DOA tensor (i being an integer  from a set {1, 2, 3}, k being an integer from a set {1, .., K}) is configured to comprise a real  value indicative of orientation of a particular acoustic source k with respect to the  microphone array (in a 3‐dimensional space around the microphone array 106) in  dimension i (the columns  Θ:k of  Θ are vectors of length 1); 
c) computing (equation (26) above) a correlation tensor (R) based on values of the TF  tensor, the correlation tensor being of dimensions MxMxF and comprising information  indicative of correlation of the combined acoustic signals captured by different     
microphones of the microphone array, where each element Rm1m2f of the correlation  tensor (m1 and m2 each being integers from a set {1, … M} and f being an integer from a  set {1, …, F}) is configured to comprise a complex value indicative of estimated  correlation between a portion of the digitized stream x as acquired by microphone m1  (m1 being an integer from a set {1, … M}) and a portion of the digitized stream x as  acquired by microphone m2 (m2 being an integer from a set {1, … M}) for a particular  frequency component f (f being an integer from a set {1, …, F}); 
d) computing (equation (29) above) a steering tensor (A) based on values of the DOA  tensor, the steering tensor being of dimensions MxKxF, where each element Amkf of the  steering tensor (m being an integer from a set {1, …, M}, k being an integer from a set {1,  .., K}, and f being an integer from a set {1, …, F}) is configured to comprise a complex  value indicative of the magnitude and phase response of a microphone m to an acoustic  source located at  Θ:k at a frequency component f; 
e) computing (equation (8) above) a projector tensor (B) based on values of the  steering tensor, the projector tensor being of dimensions MxMxF and comprising  information indicative of which one or more portions of the TF tensor determined in step  a) originate from localizable sources (i.e. sources for which it is possible to determine  orientation with respect to the microphone array; in other words ‐ directional sources; in  other words – sources that may be approximated as point sources for which it is possible  to identify their location; e.g. ambient noise coming from all different directions would  not be associated with a localizable source because it’s not possible to identify or  estimate a single direction of arrival of that sound).  Each element Bm1m2f of the projector  tensor (m1 and m2 both being integers from a set {1, …, M} and f being an integer from a  set {1, …, F}) is configured to comprise a complex value indicative of a set (subspace) of  data vectors Xft: that correspond to signals originating from the estimated orientations in  Θ at frequency component f (the product B::f * Xft: results in a vector that approximates  the directional components in the signal at time t and frequency f); 
f) computing (equation (33) above) a DOA gradient tensor (G) based on values of the  steering tensor, values of the projector tensor, and values of the correlation tensor, the  DOA gradient tensor being of dimensions 3xK (i.e. a matrix or a second rank tensor) and     
comprising information indicative of a change to the DOA matrix for modifying/improving  the estimated DOA information, where each element Gik of the DOA gradient tensor (i  being an integer from a set {1, 2, 3}, k being an integer from a set {1, .., K}) is configured  to comprise a real value indicative of an estimated change in the DOA tensor for  improving orientation estimates of an acoustic source k (i.e. an estimated change in the  DOA matrix  Θ that is necessary to improve the source orientation estimates); 
g) updating (i.e. re‐computing the values of) the DOA tensor based on values of the  DOA gradient tensor;  
h) iterating steps d)‐g) two or more times; and 
i) following the iterations, determining the DOA of an acoustic source k based on a  column  Θ:k of the DOA tensor (i.e. a DOA vector for any source k is then obtained from  the column  Θ:k of the DOA matrix). 
[00114] In one further embodiment, the source localization method summarized  above could further include steps e’) and e’’) to be iterated together with steps d)‐g),  steps e’) and e’’) being as follows: 
e’) computing a TF weight tensor (W) based on values of the projector tensor B and  TF tensor X, the weight tensor being of dimensions FxTxK, where each element Wftk of  the weight tensor is configured to comprise a real value between 0 and 1 indicative of  the degree to which acoustic source k is active in the (f,t)th bins of the TF tensor X (i.e.  indicating a percentage of energy in the (f,t)th bin for each of M microphones that is  attributable to the acoustic signal generated by the acoustic source k), and 
e’’) re‐computing (equation (20)) the correlation tensor R based on values of the TF  tensor X and the TF weight tensor W. 
[00115] The summary provided above is applicable to both the MultSRP and  MoSRP approaches described herein.  These approaches begin to differ in how the TF  weight tensor is computed in  step e’).  In the MultSRP method, the TF weight tensor is  computed using equation (19), while, in the MoSRP method, the weight tensor is  computed using both equations (16) and (19). 
[00116] In various embodiments, iterations of steps summarized above may be  performed until one or more predefined, or dynamically defined, criteria are met.  In an     
embodiment, the one or more predefined criteria may include a predefined threshold  value indicating improvement, e.g. percentage improvement, of a likelihood value  indicating how well the estimated orientations in  Θ explain the observed data given the  assumed data model (see equation (9)). 
 
Examples 
[00117] Example 1 provides a method for determining a direction of arrival (DOA)  of an acoustic signal generated by an acoustic source k of K acoustic sources, the DOA  indicating a DOA of the acoustic signal at a microphone array including M microphones,  each of K and M being an integer equal to or greater than 2, the method including: a)  determining a time‐frequency (TF) tensor of FxTxM dimensions, where F is an integer  indicating a number of frequency components f and T is an integer indicating a number  of time frames t, the TF tensor including a TF representation of each of M digitized signal  streams x, each digitized stream corresponding to a combined acoustic signal captured by  one of M microphones of the microphone array; b) initializing a DOA matrix of  dimensions 3xK, the DOA matrix including estimated DOA information for each of the K  acoustic sources; c) based on values of the TF tensor, computing a correlation tensor of  dimensions MxMxF, the correlation tensor including information indicative of correlation  of the combined acoustic signals captured by different microphones of the microphone  array; d) based on values of the DOA matrix, computing a steering tensor of dimensions  MxKxF, the steering tensor including information indicative of phase and magnitude  response of each microphone of the microphone array to each acoustic source of the K  acoustic sources; e) based on values of the steering tensor, computing a projector tensor  of dimensions MxMxF, the projector tensor including information indicative of which one  or more portions of the TF tensor determined in step a) originate from localizable  sources; f) based on values of the steering tensor, values of the projector tensor, and  values of the correlation tensor, computing a DOA gradient matrix of dimensions 3xK, the  DOA gradient matrix including information indicative of a change to the DOA matrix for  modifying the estimated DOA information; g) updating the DOA matrix based on values  of the DOA gradient matrix; h) iterating steps d)‐g) two or more times; and i) following     
the iterations, determining the DOA of an acoustic source k based on a column  Θ:k of the  DOA matrix. 
[00118] Example 2 provides the method according to Example 1, where each  element Xftm of the TF tensor is configured to include a complex value indicative of  measured magnitude and phase of a portion of a digitized stream x corresponding to a  frequency component f at a time frame t for a microphone m. 
[00119] Example 3 provides the method according to Examples 1 or 2, where each  element  Θik of the DOA matrix is configured to include a real value indicative of  orientation of the acoustic source k with respect to the microphone array in dimension i. 
[00120] Example 4 provides the method according to any one of the preceding  Examples, where each element Rm1m2f of the correlation tensor is configured to include a  complex value indicative of correlation between a portion of the digitized stream x as  acquired by microphone m1 and a portion of the digitized stream x as acquired by  microphone m2 for a particular frequency component f. 
[00121] Example 5 provides the method according to any one of the preceding  Examples, where each element Amkf of the steering tensor is configured to include a  complex value indicative of a magnitude and a phase response of a microphone m to an  acoustic source k at a frequency component f. 
[00122] Example 6 provides the method according to any one of the preceding  Examples, where each element Bm1m2f of the projector tensor is configured to include a  complex value indicative of a set of data vectors Xft: that correspond to localizable signals  with steering matrix A::f at a frequency component f. 
[00123] Example 7 provides the method according to any one of the preceding  Examples, where each element Gik of the DOA gradient matrix is configured to include a  real value indicative of an estimated change in the DOA tensor for improving orientation  estimate of the acoustic source k. 
[00124] Example 8 provides the method according to any one of the preceding  Examples, further including: e’) based on values of the projector tensor and values of the  TF tensor, computing a TF weight tensor of dimensions FxTxK, where each element Wftk  of the TF weight tensor is configured to include a real value between 0 and 1 indicative of     
a degree to which the acoustic source k is active in the (f,t)th bin, and e’’) re‐computing  the correlation tensor based on the values of the TF tensor and values of the TF weight  tensor, where the iterations include iterating steps d‐g, e’, and e’’. 
[00125] Example 9 provides the method according to Example 8, where computing  the TF weight tensor includes using a Wiener mask. 
[00126] Example 10 provides the method according to Example 8, where  computing the TF weight tensor includes using a Wiener mask and defining source‐ specific correlation matrices in terms of posterior probabilities using a Wiener mask. 
[00127] Example 11 provides the method according to any one of the preceding  Examples, where the iterations are performed until one or more predefined criteria are  met. 
[00128] Example 12 provides a method for identifying a first direction of arrival of  sound waves from a first acoustic source and a second direction of arrival of sound waves  from a second acoustic source, the method including: receiving, at a microphone array,  acoustic signals including the sound waves from the first and second acoustic sources;  converting the received acoustic signals from a time domain to a time‐frequency domain;  processing the converted acoustic signals to determine an estimated first angle  representing the first direction of arrival and an estimated second angle representing the  second direction of arrival; and updating the estimated first and second angles; where  processing includes localizing, separating and Wiener post‐filtering the converted  acoustic signals using time‐frequency weighting and outputting a time‐frequency  weighted signal for estimating the first and second angles. 
[00129] Example 13 provides the method according to Example 12, further  including combining the time‐frequency weighted signal with the converted acoustic  signals to generate a correlation matrix. 
[00130] Example 14 provides the method according to Example 13, where  updating the estimated first and second angles includes utilizing the correlation matrix  and the estimated first and second angles and outputting updated estimated first and  second angles. 
   
[00131] Example 15 provides the method according to Example 12, where  converting the received acoustic signals from a time domain to a time‐frequency domain  includes using a short time Fourier transform.   
[00132] Example 16 provides the method according to Example 12, where  processing the converted acoustic signals to determine the estimated first and second  angles includes decomposing the converted acoustic signals to identify signals from each  of the first and second acoustic sources by accounting for interference between the first  and second acoustic sources in forming the acoustic signals.  
[00133] Example 17 provides the method according to Example 12, where  processing the converted acoustic signals and updating the first and second estimated  angles includes iteratively decomposing the converted acoustic signals to simultaneously  determine the first and second directions of arrival.  
[00134] Example 18 provides the method according to Example 12, where  processing the converted acoustic signals includes processing using steered response  power localization. 
[00135] Example 19 provides the method according to Example 12, further  including using an inverse STFT to convert the processed converted acoustic signals back  into the time domain and separating the sound waves from the first acoustic source from  the sound waves from the second acoustic source. 
[00136] Example 20 provides a system comprising means for implementing the  method according to any one of the preceding Examples. 
[00137] Example 21 provides a data structure for assisting implementation of the  method according to any one of the preceding Examples.  
[00138] Example 22 provides a system for determining a DOA of an acoustic signal  generated by an acoustic source k of K acoustic sources, the DOA indicating a DOA of the  acoustic signal at a microphone array comprising M microphones, each of K and M being  an integer equal to or greater than 2, the system including at least one memory element  configured to store computer executable instructions, and at least one processor coupled  to the at least one memory element and configured, when executing the instructions, to  carry out the method according to any one of Examples 1‐11.     
[00139] Example 23 provides one or more non‐transitory tangible media encoding  logic that include instructions for execution that, when executed by a processor, are  operable to perform operations for determining a DOA of an acoustic signal generated by  an acoustic source k of K acoustic sources, the DOA indicating a DOA of the acoustic  signal at a microphone array comprising M microphones, each of K and M being an  integer equal to or greater than 2, the operations comprising operations of the method  according to any one of Examples 1‐11. 
[00140] Example 24 provides a system for identifying a first direction of arrival of  sound waves from a first acoustic source and a second direction of arrival of sound waves  from a second acoustic source, the system including at least one memory element  configured to store computer executable instructions, and at least one processor coupled  to the at least one memory element and configured, when executing the instructions, to  carry out the method according to any one of Examples 12‐19. 
[00141] Example 25 provides one or more non‐transitory tangible media encoding  logic that include instructions for execution that, when executed by a processor, are  operable to perform operations for identifying a first direction of arrival of sound waves  from a first acoustic source and a second direction of arrival of sound waves from a  second acoustic source, the operations comprising operations of the method according  to any one of Examples 12‐19. 
 
Variations and implementations 
[00142] In the discussions of the embodiments above, components can readily be  replaced, substituted, or otherwise modified in order to accommodate particular circuitry  needs.  Moreover,  it should be noted that  the use of  complementary electronic devices,  hardware, software, etc. offer an equally viable option  for  implementing  the teachings  of the present disclosure.   
[00143] In one example embodiment, any number of electrical circuits used to  implement the systems and methods of the FIGURES may be implemented on a board of  an associated electronic device.  The board can be a general circuit board that can hold  various components of the internal electronic system of the electronic device and,     
further, provide connectors for other peripherals.  More specifically, the board can  provide the electrical connections by which the other components of the system can  communicate electrically.  Any suitable processors (inclusive of digital signal processors,  microprocessors, supporting chipsets, etc.), computer‐readable non‐transitory memory  elements, etc. can be suitably coupled to the board based on particular configuration  needs, processing demands, computer designs, etc.  Other components such as external  storage, additional sensors, controllers for audio/video display, and peripheral devices  may be attached to the board as plug‐in cards, via cables, or integrated into the board  itself.  In various embodiments, the functionalities described herein may be implemented  in emulation form as software or firmware running within one or more configurable (e.g.,  programmable) elements arranged in a structure that supports these functions.  The  software or firmware providing the emulation may be provided on non‐transitory  computer‐readable storage medium comprising instructions to allow a processor to carry  out those functionalities. 
[00144] In another example embodiment, the systems and methods of the  FIGURES may be implemented as stand‐alone modules (e.g., a device with associated  components and circuitry configured to perform a specific application or function) or  implemented as plug‐in modules into application specific hardware of electronic devices.   Note that particular embodiments of the present disclosure may be readily included in a  system on chip (SOC) package, either in part, or in whole.  An SOC represents an IC that  integrates components of a computer or other electronic system into a single chip.  It  may contain digital, analog, mixed‐signal, and often radio frequency functions: all of  which may be provided on a single chip substrate.  Other embodiments may include a  multi‐chip‐module (MCM), with a plurality of separate ICs located within a single  electronic package and configured to interact closely with each other through the  electronic package.  In various other embodiments, the identification, localization and  separation functionalities may be implemented in one or more silicon cores in 
Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs),  and other semiconductor chips. 
   
[00145] It is also imperative to note that all of the specifications, dimensions, and  relationships outlined herein (e.g., the number of processors, logic operations, etc.) have  only been offered for purposes of example and teaching only.  Such information may be  varied considerably without departing from the spirit of the present disclosure, or the  scope of the appended claims or examples.  The specifications apply only to one non‐ limiting example and, accordingly, they should be construed as such.  In the foregoing  description, example embodiments have been described with reference to particular  processor and/or component arrangements.  Various modifications and changes may be  made to such embodiments without departing from the scope of the appended claims or  examples.  The description and drawings are, accordingly, to be regarded in an illustrative  rather than in a restrictive sense.   
[00146] Note that the activities discussed above with reference to the FIGURES are  applicable to any integrated circuits that involve signal processing, particularly those that  can execute specialized software programs, or algorithms, some of which may be  associated with processing digitized real‐time data.  Certain embodiments can relate to  multi‐DSP signal processing, floating point processing, signal/control processing, fixed‐ function processing, microcontroller applications, etc. 
[00147] In certain contexts, the features discussed herein can be applicable to  medical systems, scientific instrumentation, wireless and wired communications, radar,  industrial process control, audio and video equipment, current sensing, instrumentation  (which can be highly precise), and other digital‐processing‐based systems.   
[00148] Moreover, certain embodiments discussed above can be provisioned in  digital signal processing technologies for medical imaging, patient monitoring, medical  instrumentation, and home healthcare.  This could include pulmonary monitors,  accelerometers, heart rate monitors, pacemakers, etc.  Other applications can involve  automotive technologies for safety systems (e.g., stability control systems, driver  assistance systems, braking systems, infotainment and interior applications of any kind).   Furthermore, powertrain systems (for example, in hybrid and electric vehicles) can use  high‐precision data conversion products in battery monitoring, control systems, reporting  controls, maintenance activities, etc.     
[00149] In yet other example scenarios, the teachings of the present disclosure can  be applicable in the industrial markets that include process control systems that help  drive productivity, energy efficiency, and reliability.  In consumer applications, the  teachings of the signal processing circuits discussed above can be used for image  processing, auto focus, and image stabilization (e.g., for digital still cameras, camcorders,  etc.).  Other consumer applications can include audio and video processors for home  theater systems, DVD recorders, and high‐definition televisions.  Yet other consumer  applications can involve advanced touch screen controllers (e.g., for any type of portable  media device).  Hence, such technologies could readily part of smartphones, tablets,  security systems, PCs, gaming technologies, virtual reality, simulation training, etc. 
[00150] Note that with the numerous examples provided herein, interaction may  be described in terms of two, three, four, or more electrical components.  However, this  has been done for purposes of clarity and example only.  It should be appreciated that  the system can be consolidated in any suitable manner.  Along similar design alternatives,  any of the illustrated components, modules, and elements of the FIGURES may be  combined in various possible configurations, all of which are clearly within the broad  scope of this Specification.  In certain cases, it may be easier to describe one or more of  the functionalities of a given set of flows by only referencing a limited number of  electrical elements.  It should be appreciated that the electrical circuits of the FIGURES  and its teachings are readily scalable and can accommodate a large number of  components, as well as more complicated/sophisticated arrangements and 
configurations.  Accordingly, the examples provided should not limit the scope or inhibit  the broad teachings of the electrical circuits as potentially applied to a myriad of other  architectures. 
[00151] Note that in this Specification, references to various features (e.g.,  elements, structures, modules, components, steps, operations, characteristics, etc.)  included in “one embodiment”, “example embodiment”, “an embodiment”, “another  embodiment”, “some embodiments”, “various embodiments”, “other embodiments”,  “alternative embodiment”, and the like are intended to mean that any such features are 
   
included in one or more embodiments of the present disclosure, but may or may not  necessarily be combined in the same embodiments.   
[00152] It is also important to note that the functions related to acoustic source  localization and separation, illustrate only some of the possible localization and  separation functions that may be executed by, or within, systems illustrated in the  FIGURES.  Some of these operations may be deleted or removed where appropriate, or  these operations may be modified or changed considerably without departing from the  scope of the present disclosure.  In addition, the timing of these operations may be  altered considerably. The preceding operational flows have been offered for purposes of  example and discussion. Substantial flexibility is provided by embodiments described  herein in that any suitable arrangements, chronologies, configurations, and timing  mechanisms may be provided without departing from the teachings of the present  disclosure. 
[00153] Numerous other changes, substitutions, variations, alterations, and  modifications may be ascertained to one skilled in the art and it is intended that the  present disclosure encompass all such changes, substitutions, variations, alterations, and  modifications as falling within the scope of the appended claims.   
[00154] Although the claims are presented in single dependency format in the  style used before the USPTO, it should be understood that any claim can depend on and  be combined with any preceding claim of the same type unless that is clearly technically  infeasible. 
 
OTHER NOTES, EXAMPLES, AND IMPLEMENTATIONS 
[00155] Note that all optional features of the apparatus described above may also  be implemented with respect to the method or process described herein and specifics in  the examples may be used anywhere in one or more embodiments. 
[00156] In a first example, a system is provided (that can include any suitable  circuitry, dividers, capacitors, resistors, inductors, ADCs, DFFs, logic gates, software,  hardware, links, etc.) that can be part of any type of computer, which can further include 
   
a circuit board coupled to a plurality of electronic components.  The system can include  means for clocking data from the digital core onto a first data output of a macro using a  first clock, the first clock being a macro clock; means for clocking the data from the first  data output of the macro into the physical interface using a second clock, the second  clock being a physical interface clock; means for clocking a first reset signal from the  digital core onto a reset output of the macro using the macro clock, the first reset signal  output used as a second reset signal; means for sampling the second reset signal using a  third clock, which provides a clock rate greater than the rate of the second clock, to  generate a sampled reset signal; and means for resetting the second clock to a  predetermined state in the physical interface in response to a transition of the sampled  reset signal. 
[00157]   The ‘means for’ in these instances (above) can include (but is not limited  to) using any suitable component discussed herein, along with any suitable software,  circuitry, hub, computer code, logic, algorithms, hardware, controller, interface, link, bus,  communication pathway, etc.  In a second example, the system includes memory that  further comprises machine‐readable instructions that when executed cause the system  to perform any of the activities discussed above.   
 

Claims

  WHAT IS CLAIMED IS: 
  1.  A method for determining a direction of arrival (DOA) of an acoustic signal  generated by an acoustic source k of K acoustic sources, the DOA indicating a DOA of the  acoustic signal at a microphone array comprising M microphones, each of K and M being  an integer equal to or greater than 2, the method comprising:    a) determining a time‐frequency (TF) tensor of FxTxM dimensions, where F is an  integer indicating a number of frequency components f and T is an integer indicating  a number of time frames t, the TF tensor comprising a TF representation of each of M  digitized signal streams x, each digitized stream corresponding to a combined  acoustic signal captured by one of M microphones of the microphone array;  b) initializing a DOA matrix of dimensions 3xK, the DOA matrix comprising  estimated DOA information for each of the K acoustic sources;  c) based on values of the TF tensor, computing a correlation tensor of dimensions  MxMxF, the correlation tensor comprising information indicative of correlation of the  combined acoustic signals captured by different microphones of the microphone  array;  d) based on values of the DOA matrix, computing a steering tensor of dimensions  MxKxF, the steering tensor comprising information indicative of phase and magnitude  response of each microphone of the microphone array to each acoustic source of the  K acoustic sources;  e) based on values of the steering tensor, computing a projector tensor of  dimensions MxMxF, the projector tensor comprising information indicative of which  one or more portions of the TF tensor determined in step a) originate from localizable  sources;  f) based on values of the steering tensor, values of the projector tensor, and  values of the correlation tensor, computing a DOA gradient matrix of dimensions 3xK,     
the DOA gradient matrix comprising information indicative of a change to the DOA  matrix for modifying the estimated DOA information;  g) updating the DOA matrix based on values of the DOA gradient matrix;   h) iterating steps d)‐g) two or more times; and  i) following the iterations, determining the DOA of an acoustic source k based on  a column  Θ:k of the DOA matrix. 
2.  The method according to claim 1, where each element Xftm of the TF tensor is  configured to comprise a complex value indicative of measured magnitude and phase of  a portion of a digitized stream x corresponding to a frequency component f at a time  frame t for a microphone m. 
3.  The method according to claim 1, where each element  Θik of the DOA matrix is  configured to comprise a real value indicative of orientation of the acoustic source k with  respect to the microphone array in dimension i. 
4.  The method according to claim 1, where each element Rm1m2f of the correlation  tensor is configured to comprise a complex value indicative of correlation between a  portion of the digitized stream x as acquired by microphone m1 and a portion of the  digitized stream x as acquired by microphone m2 for a particular frequency component f. 
5.  The method according to claim 1, where each element Amkf of the steering tensor  is configured to comprise a complex value indicative of a magnitude and a phase  response of a microphone m to an acoustic source k at a frequency component f. 
6.  The method according to claim 1, wherein each element Bm1m2f of the projector  tensor is configured to comprise a complex value indicative of a set of data vectors Xft:  that correspond to localizable signals with steering matrix A::f at a frequency component  f.   
   
7.  The method according to claim 1, wherein each element Gik of the DOA gradient  matrix is configured to comprise a real value indicative of an estimated change in the  DOA tensor for improving orientation estimate of the acoustic source k. 
8.  The method according to any one of claims 1‐7, further comprising:  e’) based on values of the projector tensor and values of the TF tensor, computing  a TF weight tensor of dimensions FxTxK, where each element Wftk of the TF weight  tensor is configured to comprise a real value between 0 and 1 indicative of a degree  to which the acoustic source k is active in the (f,t)th bin, and  e’’) re‐computing the correlation tensor based on the values of the TF tensor and  values of the TF weight tensor,  wherein the iterations comprise iterating steps d‐g, e’, and e’’. 
9.  The method according to claim 8, wherein computing the TF weight tensor  comprises using a Wiener mask. 
10.  The method according to claim 8, wherein computing the TF weight tensor  comprises using a Wiener mask and defining source‐specific correlation matrices in terms  of posterior probabilities using a Wiener mask. 
11.  The method according to any one of claims 1‐7, wherein the iterations are  performed until one or more predefined criteria are met. 
12.  A method for identifying a first direction of arrival of sound waves from a first  acoustic source and a second direction of arrival of sound waves from a second acoustic  source, the method comprising:    receiving, at a microphone array, acoustic signals including the sound waves from  the first and second acoustic sources;  converting the received acoustic signals from a time domain to a time‐frequency  domain;  
   
processing the converted acoustic signals to determine an estimated first angle  representing the first direction of arrival and an estimated second angle representing  the second direction of arrival; and  updating the estimated first and second angles;   wherein processing includes localizing, separating and Wiener post‐filtering the  converted acoustic signals using time‐frequency weighting and outputting a time‐ frequency weighted signal for estimating the first and second angles. 
13.   The method according to claim 12, further comprising combining the time‐
frequency weighted signal with the converted acoustic signals to generate a  correlation matrix. 
14.   The method according to claim 13, wherein updating the estimated first and 
second angles comprises utilizing the correlation matrix and the estimated first and  second angles and outputting updated estimated first and second angles. 
15.   The method according to claim 12, wherein converting the received acoustic  signals from a time domain to a time‐frequency domain includes using a short time  Fourier transform.   
16.   The method according to claim 12, wherein processing the converted acoustic  signals to determine the estimated first and second angles includes decomposing the  converted acoustic signals to identify signals from each of the first and second  acoustic sources by accounting for interference between the first and second acoustic  sources in forming the acoustic signals.  
17.   The method according to claim 12, wherein processing the converted acoustic  signals and updating the first and second estimated angles includes iteratively  decomposing the converted acoustic signals to simultaneously determine the first  and second directions of arrival.    
   
18.   The method according to claim 12, wherein processing the converted acoustic  signals includes processing using steered response power localization. 
19.   The method according to claim 12, further comprising using an inverse STFT to  convert the processed converted acoustic signals back into the time domain and  separating the sound waves from the first acoustic source from the sound waves  from the second acoustic source.     
 
PCT/US2015/066012 2014-12-18 2015-12-16 Systems and methods for source localization and separation WO2016100460A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201462093903P 2014-12-18 2014-12-18
US62/093,903 2014-12-18

Publications (1)

Publication Number Publication Date
WO2016100460A1 true WO2016100460A1 (en) 2016-06-23

Family

ID=56127517

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/066012 WO2016100460A1 (en) 2014-12-18 2015-12-16 Systems and methods for source localization and separation

Country Status (1)

Country Link
WO (1) WO2016100460A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107132503A (en) * 2017-03-23 2017-09-05 哈尔滨工程大学 Acoustic vector circle battle array broadband coherent source direction estimation method based on vector singular value decomposition
WO2018056214A1 (en) * 2016-09-23 2018-03-29 Jfeスチール株式会社 Ultrasound wave source azimuth orienting device, and method of analyzing superimposed image
CN109239665A (en) * 2018-07-10 2019-01-18 北京大学深圳研究生院 A kind of more sound source consecutive tracking method and apparatus based on signal subspace similarity spectrum and particle filter
CN109856593A (en) * 2018-12-21 2019-06-07 南京理工大学 Intelligent miniature array sonic transducer and its direction-finding method towards sound source direction finding
CN110876100A (en) * 2018-08-29 2020-03-10 北京嘉楠捷思信息技术有限公司 Sound source orientation method and system
CN111060875A (en) * 2019-12-12 2020-04-24 北京声智科技有限公司 Method and device for acquiring relative position information of equipment and storage medium
CN111133511A (en) * 2017-07-19 2020-05-08 音智有限公司 Sound source separation system
WO2020060519A3 (en) * 2018-09-17 2020-06-04 Aselsan Elektroni̇k Sanayi̇ Ve Ti̇caret Anoni̇m Şi̇rketi̇ Joint source localization and separation method for acoustic sources
CN111724801A (en) * 2020-06-22 2020-09-29 北京小米松果电子有限公司 Audio signal processing method and device and storage medium
CN112106069A (en) * 2018-06-13 2020-12-18 赫尔实验室有限公司 Streaming data tensor analysis using blind source separation
WO2021013346A1 (en) * 2019-07-24 2021-01-28 Huawei Technologies Co., Ltd. Apparatus for determining spatial positions of multiple audio sources
CN113138367A (en) * 2020-01-20 2021-07-20 中国科学院上海微***与信息技术研究所 Target positioning method and device, electronic equipment and storage medium
US11134348B2 (en) 2017-10-31 2021-09-28 Widex A/S Method of operating a hearing aid system and a hearing aid system
WO2022042864A1 (en) 2020-08-31 2022-03-03 Proactivaudio Gmbh Method and apparatus for measuring directions of arrival of multiple sound sources
CN114460541A (en) * 2022-02-10 2022-05-10 国网上海市电力公司 Method and device for positioning noise source of electrical equipment and sound source positioning equipment
CN115015830A (en) * 2022-06-01 2022-09-06 北京中安智能信息科技有限公司 Underwater acoustic signal processing algorithm based on machine learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140023199A1 (en) * 2012-07-23 2014-01-23 Qsound Labs, Inc. Noise reduction using direction-of-arrival information
US20140192999A1 (en) * 2013-01-08 2014-07-10 Stmicroelectronics S.R.L. Method and apparatus for localization of an acoustic source and acoustic beamforming
US20140226838A1 (en) * 2013-02-13 2014-08-14 Analog Devices, Inc. Signal source separation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140023199A1 (en) * 2012-07-23 2014-01-23 Qsound Labs, Inc. Noise reduction using direction-of-arrival information
US20140192999A1 (en) * 2013-01-08 2014-07-10 Stmicroelectronics S.R.L. Method and apparatus for localization of an acoustic source and acoustic beamforming
US20140226838A1 (en) * 2013-02-13 2014-08-14 Analog Devices, Inc. Signal source separation

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018056214A1 (en) * 2016-09-23 2018-03-29 Jfeスチール株式会社 Ultrasound wave source azimuth orienting device, and method of analyzing superimposed image
JPWO2018056214A1 (en) * 2016-09-23 2018-09-20 Jfeスチール株式会社 Ultrasonic source orientation locating device and overlay image analysis method
CN107132503B (en) * 2017-03-23 2019-09-27 哈尔滨工程大学 Acoustic vector circle battle array broadband coherent source direction estimation method based on vector singular value decomposition
CN107132503A (en) * 2017-03-23 2017-09-05 哈尔滨工程大学 Acoustic vector circle battle array broadband coherent source direction estimation method based on vector singular value decomposition
CN111133511B (en) * 2017-07-19 2023-10-27 音智有限公司 sound source separation system
CN111133511A (en) * 2017-07-19 2020-05-08 音智有限公司 Sound source separation system
US11134348B2 (en) 2017-10-31 2021-09-28 Widex A/S Method of operating a hearing aid system and a hearing aid system
US11218814B2 (en) 2017-10-31 2022-01-04 Widex A/S Method of operating a hearing aid system and a hearing aid system
US11146897B2 (en) 2017-10-31 2021-10-12 Widex A/S Method of operating a hearing aid system and a hearing aid system
CN112106069A (en) * 2018-06-13 2020-12-18 赫尔实验室有限公司 Streaming data tensor analysis using blind source separation
CN109239665A (en) * 2018-07-10 2019-01-18 北京大学深圳研究生院 A kind of more sound source consecutive tracking method and apparatus based on signal subspace similarity spectrum and particle filter
CN109239665B (en) * 2018-07-10 2022-04-15 北京大学深圳研究生院 Multi-sound-source continuous positioning method and device based on signal subspace similarity spectrum and particle filter
CN110876100A (en) * 2018-08-29 2020-03-10 北京嘉楠捷思信息技术有限公司 Sound source orientation method and system
CN110876100B (en) * 2018-08-29 2022-12-09 嘉楠明芯(北京)科技有限公司 Sound source orientation method and system
US11482239B2 (en) 2018-09-17 2022-10-25 Aselsan Elektronik Sanayi Ve Ticaret Anonim Sirketi Joint source localization and separation method for acoustic sources
WO2020060519A3 (en) * 2018-09-17 2020-06-04 Aselsan Elektroni̇k Sanayi̇ Ve Ti̇caret Anoni̇m Şi̇rketi̇ Joint source localization and separation method for acoustic sources
CN109856593B (en) * 2018-12-21 2023-01-03 南京理工大学 Sound source direction-finding-oriented miniature intelligent array type acoustic sensor and direction-finding method thereof
CN109856593A (en) * 2018-12-21 2019-06-07 南京理工大学 Intelligent miniature array sonic transducer and its direction-finding method towards sound source direction finding
WO2021013346A1 (en) * 2019-07-24 2021-01-28 Huawei Technologies Co., Ltd. Apparatus for determining spatial positions of multiple audio sources
US11921198B2 (en) 2019-07-24 2024-03-05 Huawei Technologies Co., Ltd. Apparatus for determining spatial positions of multiple audio sources
CN111060875A (en) * 2019-12-12 2020-04-24 北京声智科技有限公司 Method and device for acquiring relative position information of equipment and storage medium
CN111060875B (en) * 2019-12-12 2022-07-15 北京声智科技有限公司 Method and device for acquiring relative position information of equipment and storage medium
CN113138367A (en) * 2020-01-20 2021-07-20 中国科学院上海微***与信息技术研究所 Target positioning method and device, electronic equipment and storage medium
CN111724801A (en) * 2020-06-22 2020-09-29 北京小米松果电子有限公司 Audio signal processing method and device and storage medium
WO2022042864A1 (en) 2020-08-31 2022-03-03 Proactivaudio Gmbh Method and apparatus for measuring directions of arrival of multiple sound sources
CN114460541A (en) * 2022-02-10 2022-05-10 国网上海市电力公司 Method and device for positioning noise source of electrical equipment and sound source positioning equipment
CN115015830A (en) * 2022-06-01 2022-09-06 北京中安智能信息科技有限公司 Underwater acoustic signal processing algorithm based on machine learning

Similar Documents

Publication Publication Date Title
WO2016100460A1 (en) Systems and methods for source localization and separation
Gannot et al. A consolidated perspective on multimicrophone speech enhancement and source separation
US20170251301A1 (en) Selective audio source enhancement
US9706298B2 (en) Method and apparatus for localization of an acoustic source and acoustic beamforming
US9099096B2 (en) Source separation by independent component analysis with moving constraint
US10192568B2 (en) Audio source separation with linear combination and orthogonality characteristics for spatial parameters
US20130297296A1 (en) Source separation by independent component analysis in conjunction with source direction information
US20130294611A1 (en) Source separation by independent component analysis in conjuction with optimization of acoustic echo cancellation
US20170140771A1 (en) Information processing apparatus, information processing method, and computer program product
US20160073198A1 (en) Spatial audio apparatus
JP6363213B2 (en) Apparatus, method, and computer program for signal processing for removing reverberation of some input audio signals
CN113113034A (en) Multi-source tracking and voice activity detection for planar microphone arrays
Koldovský et al. Spatial source subtraction based on incomplete measurements of relative transfer function
CN110088835B (en) Blind source separation using similarity measures
Nesta et al. Convolutive underdetermined source separation through weighted interleaved ICA and spatio-temporal source correlation
US10718742B2 (en) Hypothesis-based estimation of source signals from mixtures
WO2014079484A1 (en) Method for determining a dictionary of base components from an audio signal
JP2022135451A (en) Acoustic processing device, acoustic processing method, and program
Salvati et al. Power method for robust diagonal unloading localization beamforming
JP5406866B2 (en) Sound source separation apparatus, method and program thereof
GB2510650A (en) Sound source separation based on a Binary Activation model
Girin et al. Audio source separation into the wild
Nesta et al. Unsupervised spatial dictionary learning for sparse underdetermined multichannel source separation
Li et al. Low complex accurate multi-source RTF estimation
Fakhry et al. Underdetermined source detection and separation using a normalized multichannel spatial dictionary

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15870955

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15870955

Country of ref document: EP

Kind code of ref document: A1