CN101778322A

CN101778322A - Microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic

Info

Publication number: CN101778322A
Application number: CN200910250393A
Authority: CN
Inventors: 刘文举; 程宁; 李超
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2009-12-07
Filing date: 2009-12-07
Publication date: 2010-07-14
Anticipated expiration: 2029-12-07
Also published as: CN101778322B

Abstract

The invention discloses a microphone array postfiltering sound enhancement method based on multi-models and hearing characteristic, aiming at two important factors influencing the postfiltering sound enhancement performance of a microphone array, i.e. accurate estimation for signal parameters and suitable compromise between increasing noise reduction performance and reducing voice distortion. The scheme of the invention comprises the following steps of carrying out time domain alignment on signals collected by the microphone array, and carrying out short-time Fourier transform and characteristic value analysis based of power spectrum; determining the dimensionality of a signal subspace through the existence probability of target voice signal in maximation noise-carried voice signals; self-adaptively selecting a distribution model of a noise power spectrum in the noise-carried voice signals; estimating noise power spectrum by utilizing a conditional probability; estimating an auditory masking threshold value based on the signal subspace; and estimating a postfilter by combining Lagrange multipliers according to the auditory sensing characteristics.

Description

Based on filtering sound enhancement method behind the microphone array of multi-model and auditory properties

Technical field

The present invention relates to the design of signal subspace method, auditory masking effect and the postfilter of microphone array.

Background technology

Real-life voice usually are subjected to The noise in the environment, and the multicenter voice Enhancement Method had been subjected to paying close attention to widely in the last few years.Microphone array voice enhancement method is that with respect to the advantage of single channel sound enhancement method it can utilize the correlation characteristic of estimated signal more accurately between the multiple signals, thereby reaches better voice reinforced effects.Wherein, behind the microphone array filtering sound enhancement method especially since its outstanding anti-acoustic capability obtained in recent years widely using.(list of references 1:K.Uwe Simmer such as Simmer, et al, " Post-filtering techniques ", inMicrophone Arrays, M.Brandstein and D.Ward, Eds.New York:Springer, ch.3, pp.36-60,2001.) having proved that optimum multicenter voice under the least mean-square error meaning strengthens to separate can be decomposed into the non-distortion response of a minimum variance Beam-former and add that a single pass dimension receives the form of postfilter.Although proved the optimality of back filtering method in theory, in actual applications,, limited the performance of back filtering method because the very difficult power spectrum that accurately estimates voice signal and noise signal obtains desirable postfilter.So, reasonably postfilter design, power spectrum signal is estimated to make that the performance of sound enhancement method is significantly improved accurately.Zelinski (list of references 2:R.Zelinski, " A microphone array with adaptive post-filteringfor noise reduction in reverberant rooms ", in Proc.of ICASSP-88,1988, Vol.5, pp.2578-2581.) suppose that the noise signal on each array element is incoherent, proposed a kind of postfilter method for designing.But owing in the actual environment, there is certain correlation between the array element noise, so this method poor-performing.McCowan (list of references 3:Iain A.McCowan, Herv é Bourlard, " Microphone array post-filter based on noise field coherence ", IEEETransaction on Speech and Audio Processing, Vol.11, pp.709-715, Nov.2003.) considered correlation between the noise, utilize the characteristic of shot noise field, proposed a kind of postfilter method for designing, have preferably voice and strengthen the property.But because its method is based on shot noise field hypothesis, so when the noise field in the practical matter did not meet the shot noise field, this method performance can significantly decrease.The present invention utilizes the auditory masking effect of people's ear, has proposed a kind of postfilter method for designing based on auditory perception property.For the spectrum of estimating noise power more accurately, the present invention is signal subspace and noise subspace with the signals with noise spatial decomposition, proposed to exist probability to maximize the method for estimator Spatial Dimension with target voice signal signal, reasonably estimate the dimension of signal subspace and noise subspace, on noise subspace, the method with conditional probability estimating noise power spectrum has been proposed.Experiment showed, that noise estimation method ratio noise estimation method in the past proposed by the invention is more accurate, the postfilter based on auditory perception property that is proposed is also more effective than traditional postfilter.

The frequency domain representation of the Noisy Speech Signal vector that receives on the array of supposing to be made up of L microphone is: X=[X ₁..., X _L] ^HThe frequency domain representation of the voice signal after the enhancing that is obtained by the weighting summation of array input signal is as follows:

Y＝w ^HX＝w ^H[Sd+N] (1)

Wherein, model w is the array weight coefficient, and S is an echo signal, d=[d ₁..., d _L] ^TBe to propagate vector, N=[N ₁..., N _L] ^HBe the noise signal vector, [] ^HBe the conjugate transpose operator.

Error signal e=S-w ^HThe power of X is:

φ_{ee} = E [{S - w^{H} X} {S^{H} - X^{H} w}] = φ_{SS} - w^{H} φ_{XS} - φ_{XS}^{H} w + w^{H} Φ_{XX} w - - - (2)

Wherein, Φ _XXBe the cross power spectrum matrix of multichannel Noisy Speech Signal X, φ _XSBe the crosspower spectrum of multichannel Noisy Speech Signal X and single channel echo signal S, φ _SSIt is the power spectrum of single channel target voice signal S.

Make φ _EeWeight w is differentiated, is zero, can get optimal weighting coefficients:

w_{opt} = Φ_{XX}^{- 1} φ_{XS} - - - (3)

Under target voice signal and the incoherent hypothesis of noise, (3) formula becomes:

w_{opt} = Φ_{XX}^{- 1} φ_{SS} d = {[φ_{SS} {dd}^{H} + Φ_{NN}]}^{- 1} φ_{SS} d - - - (4)

Use the Sherman-Morrison-Woodbury identity, following formula can be expressed as again:

w_{opt} = [\frac{φ_{SS}}{φ_{SS} + {(d^{H} Φ_{NN}^{- 1} d)}^{- 1}}] \frac{Φ_{NN}^{- 1} d}{d^{H} Φ_{NN}^{- 1} d} = [\frac{φ_{SS}}{φ_{SS} + φ_{NN}}] \frac{Φ_{NN}^{- 1} d}{d^{H} Φ_{NN}^{- 1} d} - - - (5)

Wherein, φ _NNBe respectively the auto-power spectrum of single channel noise, Φ _NNIt is multi-channel noise cross power spectrum matrix.Formula (5) can be regarded the non-distortion response of a minimum variance Beam-former as

Add that a single pass dimension receives postfilter φ _SS/ (φ _SS+ φ _NN).

Summary of the invention

In order to solve prior art problems, the objective of the invention is to the single channel postfilter is designed, utilize many distributed models adaptive selection method and auditory properties to design a kind of new postfilter.The problem that the design of single channel postfilter needs to consider comprises two aspects: good anti-acoustic capability and less target voice signal distortion.Usually, postfilter also may increase the distortion of target voice signal in noise reduction.So the two is reasonably compromised is the problem that the postfilter design must be considered.

For reaching described purpose, the invention provides a kind ofly based on filtering sound enhancement method behind the microphone array of multi-model and auditory properties, the concrete steps of this method are as follows:

Step a: the multi-path voice signal of the microphone array collection band noise of forming by L microphone, the voice signal of each road band noise is carried out time domain alignment, the frequency signal form of each the road signal indication value of pluralizing after using discrete Fourier transform in short-term to align is calculated the spectral power matrix of microphone array multiple signals and this spectral power matrix is carried out characteristic value decomposition and obtains eigenvalue matrix and eigenvectors matrix;

Step b: by the probability that exists of target voice signal in the maximization Noisy Speech Signal, determine the dimension Q of signal subspace, and Q≤L;

Step c: based on the stationarity of spectrum, noise power spectrum distributed model in the adaptively selected Noisy Speech Signal;

Steps d: utilize conditional probability estimating noise power spectrum;

Step e: estimate according to signal subspace dimension and noise power spectrum, utilize auditory masking effect, estimate to obtain the auditory masking threshold of each frequency based on signal subspace;

Step f: according to noise power spectrum, auditory masking threshold, estimate postfilter in conjunction with Lagrange multiplier, residual noise in the feasible enhancing voice is less than the auditory masking threshold of people's ear, thereby eliminate the residual noise influence, and make the distortion of target voice signal as much as possible little, finish that the filtering voice strengthen behind the microphone array.

Wherein, described spectral power matrix is carried out characteristic value decomposition, comprising:

Utilize characteristic value decomposition that the Noisy Speech Signal space is divided into two sub spaces, i.e. signal subspace: to comprise target voice signal and noise; Noise subspace: only comprise noise; The spectral power matrix Φ of Noisy Speech Signal X at time frame t and frequency k _XX(k, t) characteristic value decomposition is:

Φ _XX(k，t)＝UΛ _XXU ^H＝U(Λ _SS+φ _NN(k，t)I)U ^H

Wherein, X=S+N, X are Noisy Speech Signal, and S is the target voice signal, and N is a noise; Λ _XXBe the Noisy Speech Signal power spectrum characteristic value matrix of characteristic value descending, Λ _SSBe the target voice signal power spectrum characteristic value matrix of characteristic value descending, U is an eigenvectors matrix, φ _NN(k t) is the noise power of time frame t and frequency k, and I is L rank unit matrix, [] ^HBe the conjugate transpose operator.

Wherein, described definite signal subspace dimension is to get the probability maximum that only Q value makes that the target voice signal exists in the noisy speech; Utilize conditional probability to calculate, step comprises:

Definition exclusive events H ₀And H ₁:

Incident H ₀: in the Noisy Speech Signal, only there is noise, do not have the target voice signal;

Incident H ₁: in the Noisy Speech Signal, target voice signal and noise exist simultaneously;

Signal subspace dimension Q is defined as:

\underset{Q}{\arg \max} P (S (k, t) | H_{1})

Wherein, (k t) is the power spectrum of target voice signal signal on k Frequency point of t frame to S, and P () is the distribution function of target voice signal spectrum, and argmax () is an operator of seeking the parameter value with maximum scores.

Wherein, described stationarity based on spectrum, noise power spectrum distributed model in the adaptively selected Noisy Speech Signal may further comprise the steps:

Step c1: define a discriminant function Ω who is used for explaining the stationarity of power spectrum:

Ω = \frac{\sqrt[(L - Q)]{Π_{i = Q + 1}^{L} λ_{X_{i}}}}{\frac{1}{L - Q} Σ_{i = Q + 1}^{L} λ_{X_{i}}}

That is, Ω is a geometric average

To arithmetic average

Ratio, wherein,

Be Noisy Speech Signal power spectrum characteristic value matrix Λ _XXI characteristic value, i ∈ Q+1 ..., L} is the subscript of characteristic value, the value of Ω is between 0 to 1;

Step c2: compare according to discriminant score and predetermined threshold value, determine to be useful in the noise power spectrum distributed model in the Noisy Speech Signal.

Wherein, described comparison step according to discriminant score and predetermined threshold value comprises:

Step c21: determine two predetermined threshold value Ω ₁And Ω ₂, Ω ₁＜Ω ₂

Step c22: compare discriminant function and predetermined threshold value, especially, if discriminant function is less than predetermined threshold value Ω ₁, then select the zero-mean Gaussian Profile for use; If differentiate greater than predetermined threshold value Ω ₂, then select Gamma distribution for use; Otherwise select laplacian distribution for use.

Wherein, utilize the step of conditional probability estimating noise power spectrum to comprise:

For each frame Noisy Speech Signal, the probability that it only contains noise is P (H ₀| X), promptly containing the probability that noise contains the target voice signal again is P (H ₁| X); At both of these case, the estimating noise power spectrum is as follows respectively:

\{\begin{matrix} H_{0} : φ_{NN}^{0} = \frac{1}{L} Σ_{i = 1}^{L} λ_{X_{i}} \\ H_{1} : φ_{NN}^{1} = \frac{1}{L - Q} Σ_{i = Q + 1}^{L} λ_{X_{i}} \end{matrix}

Wherein,

With

Be respectively that noise is at exclusive events H ₀And H ₁Power spectrum under a situation arises, i ∈ 1 ..., L} is the subscript of characteristic value;

According to condition probability formula, noise power spectrum is estimated as follows:

{\tilde{φ}}_{NN} = P (H_{0} | X) φ_{NN}^{0} + P (H_{1} | X) φ_{NN}^{1} .

Wherein, the step of described estimation auditory masking threshold comprises:

Step f1: auditory frequency range 0-15500Hz is divided into several crucial sub-bands;

Step f2: calculate the auditory masking threshold in each sub-band respectively.

Wherein, auditory masking threshold in each sub-band of described calculating is the energy that calculates each frequency on each sub-band, calculate the propagation coefficient of people's ear basement membrane for each frequency range sound, then the propagation coefficient of the energy of each frequency on each sub-band and each frequency range sound being multiplied each other obtains the epilamellar excitation energy value of people's ear, and the functional relation according to epilamellar excitation energy value of people's ear and auditory masking threshold calculates masking threshold again.

Wherein, described step in conjunction with Lagrange multiplier estimation postfilter G is as follows:

Step fa: under the constraints of residual noise power, minimize the distortion of target voice signal, set up optimization problem with this less than masking threshold;

Step fb: find the solution in conjunction with Lagrange multiplier, obtain the optimal estimation of postfilter;

Step fc: bring auditory masking threshold and noise power spectrum into and estimate, finish the design of postfilter.

Beneficial effect of the present invention: the present invention utilizes the auditory masking effect of people's ear to propose a kind of rational half-way house, has designed a kind of new postfilter based on auditory perception property.Traditional noise estimation method is based on the noise estimation method of VAD, just detects the pure noise frame in the noisy speech, estimates noise power spectrum on voice and the noise hybrid frame with the average power spectra on these frames.Because noise changes, the noise on each frame is actually different.So, compose the noise power spectrum of estimating on all frames based on the noise estimation method of VAD with the average noise power on the pure noise frame and can cause bigger evaluated error.At this situation, the present invention proposes a kind of noise power spectrum method of estimation based on the signals with noise Subspace Decomposition, all estimating noise power is composed on each frame signal, has reduced the Noise Estimation error greatly.Then, the present invention utilizes the auditory masking effect design postfilter of people's ear, makes the residual noise that strengthens in the voice of back be sheltered by the target voice, has also reduced the distortion of target voice in noise reduction.

Description of drawings

Further characteristic of the present invention and advantage will be described below with reference to illustrative accompanying drawing.

Fig. 1 illustrate an application based on the microphone array of multi-model and auditory properties after the example flow diagram of filtering sound enhancement method;

Fig. 2 is the flow chart of a definite signal subspace dimension method;

Fig. 3 is the flow chart of noise power spectrum distributed model in the definite Noisy Speech Signal;

Fig. 4 is a flow chart that utilizes conditional probability estimating noise power spectrum;

Fig. 5 is a flow chart that calculates auditory masking threshold;

Fig. 6 is the flow chart of a design postfilter.

Embodiment

The following detailed description that should be appreciated that different examples and accompanying drawing is not to be intended to the present invention is limited to special illustrative embodiment; The illustrative embodiment that is described only is illustration each step of the present invention, and its scope is defined by additional claim.

The present invention utilizes the auditory masking effect of people's ear to propose a kind of rational half-way house, has designed a kind of new postfilter based on auditory perception property.The auditory masking effect of people's ear is meant, under normal conditions, target voice signal signal is strong signal, and background noise relatively a little less than, auditory system can be determined auditory masking threshold on the frequency domain according to concrete target voice signal signal like this, if filtered residual noise is limited under the auditory masking threshold of people's ear, this noise just can not be perceived by the human ear so, thereby realizes the enhancing to Noisy Speech Signal.Concrete step is as follows:

A kind of new based on filtering sound enhancement method behind the microphone array of multi-model and auditory properties, comprise the following steps:

Step b:, determine the dimension Q of signal subspace by the probability that exists of target voice signal in the maximization Noisy Speech Signal;

Steps d: utilize conditional probability estimating noise power spectrum;

Normally used noise estimation method is based on the noise estimation method of VAD.Just detect the pure noise frame in the noisy speech, estimate noise power spectrum on voice and the noise hybrid frame with the average power spectra on these frames.Because noise changes, the noise on each frame is actually different.So, compose the noise power spectrum of estimating on all frames based on the noise estimation method of VAD with the average noise power on the pure noise frame and can cause bigger evaluated error.

At this situation, step b) of the present invention and step d) have adopted a kind of method based on the signals with noise Subspace Decomposition to come the dimension and the noise power spectrum of estimating noise subspace, all estimating noise power is composed on each frame signal, has greatly reduced the Noise Estimation error.

Under target voice signal and the incoherent hypothesis of noise, Noisy Speech Signal is at the spectral power matrix Φ of time frame t and frequency k _XX(k t) can be expressed as target voice signal signal power spectrum matrix Φ _SS(k is t) with noise signal spectral power matrix Φ _NN(k, t) sum:

Φ _XX(k，t)＝Φ _SS(k，t)+Φ _NN(k，t) (6)

For microphone array signals, can suppose that the auto-power spectrum of noise signal on each array element equates, and noise signal is uncorrelated between array element, then following formula is set up:

Φ _NN(k，t)＝φ _NN(k，t)I (7)

Wherein, I is L rank unit matrixs, φ _NN(k t) is the auto-power spectrum of single channel noise.

Make the characteristic value decomposition of target voice signal spectral power matrix be:

Φ _SS(k，t)＝UΛ _SSU ^H (8)

Wherein, Λ _SSBe the eigenvalue matrix of characteristic value descending, U is the characteristic of correspondence vector matrix, and Q is a rank of matrix, and Q≤L.

Utilize characteristic value decomposition the signals with noise space can be divided into two sub spaces: signal subspace (comprising target voice signal and noise) and noise subspace (only comprising noise).If signals with noise spectral power matrix characteristic value decomposition is:

Φ _XX(k，t)＝UΛ _XXU ^H＝U(Λ _SS+φ _NN(k，t)I)U ^H (9)

Λ _XXBe the Noisy Speech Signal power spectrum characteristic value matrix of characteristic value descending, I is L rank unit matrix.

The present invention proposes and from noise subspace, estimate to obtain noise auto-power spectrum φ _NNMethod.At first need to determine the dimension Q and the noise subspace dimension P of signal subspace.

In step b), provide a kind of probability that exists to determine the method for Q by target voice signal in the maximization Noisy Speech Signal, promptly get the probability maximum that only Q value makes that the target voice signal exists.

Utilize conditional probability to calculate, definition exclusive events H ₀And H ₁:

Signal subspace dimension Q is defined as:

\underset{Q}{\arg \max} P (S (k, t) | H_{1}) - - - (10)

Step c) provides a kind of adaptive approach based on noise power spectrum distributed model in the stationarity select tape noisy speech signal of spectrum.This method comprises the following steps:

At first, definition discriminant function Ω

Ω = \frac{\sqrt[(L - Q)]{Π_{i = Q + 1}^{L} λ_{X_{i}}}}{\frac{1}{L - Q} Σ_{i = Q + 1}^{L} λ_{X_{i}}} - - - (11)

That is, Ω is a geometric average

To arithmetic average

Ratio wherein,

Be Noisy Speech Signal power spectrum characteristic value matrix Λ _XXI characteristic value, i ∈ Q+1 ..., L} is the subscript of characteristic value, the value of Ω is between 0 to 1.

Then, determine two predetermined threshold value, Ω ₁And Ω ₂(Ω ₁＜Ω ₂), compare discriminant function and predetermined threshold value, especially, if discriminant function is less than predetermined threshold value Ω ₁, then select the zero-mean Gaussian Profile for use; If differentiate greater than predetermined threshold value Ω ₂, then select Gamma distribution for use; Otherwise select laplacian distribution for use.

In step d), provide a kind of method of utilizing conditional probability estimating noise power spectrum.For each frame Noisy Speech Signal, the probability that it only contains noise is P (H ₀| X), promptly containing the probability that noise contains the target voice signal again is P (H ₁| X); At both of these case, the estimating noise power spectrum is as follows respectively:

\{\begin{matrix} H_{0} : φ_{NN}^{0} = \frac{1}{L} Σ_{i = 1}^{L} λ_{X_{i}} \\ H_{1} : φ_{NN}^{1} = \frac{1}{L - Q} Σ_{i = Q + 1}^{L} λ_{X_{i}} \end{matrix} - - - (12)

Wherein, i ∈ 1 ..., L} is the subscript of characteristic value,

With

Be respectively noise at exclusive events H0 and the H1 power spectrum under a situation arises.

According to condition probability formula, the noise power spectrum method of estimation is as follows:

{\tilde{φ}}_{NN} = P (H_{0} | X) \cdot φ_{NN}^{0} + P (H_{1} | X) \cdot φ_{NN}^{1} - - - (13)

Step e) provides a kind of to be estimated according to signal subspace dimension and noise power spectrum, and utilize auditory masking effect, estimation obtains the method for the auditory masking threshold of each frequency based on signal subspace.

Auditory frequency range is 0 to 15500Hz, has covered 24 critical sub-bands, need calculate auditory masking threshold in each sub-band.At first calculate the energy of each frequency on each sub-band, calculate the propagation coefficient of people's ear basement membrane for each frequency range sound again, the propagation coefficient of the energy of each frequency on each sub-band and each frequency range sound is multiplied each other obtains the epilamellar excitation energy value of people's ear then.At last, the functional relation according to epilamellar excitation energy value of people's ear and auditory masking threshold further calculates masking threshold again.

It is a kind of according to noise power spectrum, auditory masking threshold that step f) provides, and estimates postfilter G (e in conjunction with Lagrange multiplier ^{J ω}) method.Residual noise in the feasible enhancing voice influences thereby eliminate residual noise, and makes the distortion of target voice signal as much as possible little less than the auditory masking threshold of people's ear.The filtering voice strengthen after finishing microphone array.

The output signal of supposing the non-distortion response of minimum variance Beam-former is Target voice signal signal is S (e ^{J ω}), the voice signal after back filtering strengthens and the error of target voice signal signal can be expressed as follows:

E (e^{jω}) = G (e^{jω}) \tilde{S} (e^{jω}) - S (e^{jω}) = [G (e^{jω}) - 1] S (e^{jω}) + G (e^{jω}) \tilde{N} (e^{jω}) - - - (14)

Wherein, For

In noise.

Describe the distortion that strengthens target voice signal in the voice for first in the formula (14), described the size that strengthens residual noise in the voice for second.Can calculate a suitable postfilter G (e ^{J ω}) make to strengthen residual noise in the voice less than the auditory masking threshold of people's ear, thus its influence eliminated.At formula (14), the present invention proposes following goal constraint:

\min E_{T} = {[G (e^{jω}) - 1]}^{2} S {(e^{jω})}^{2} + G {(e^{jω})}^{2} \tilde{N} {(e^{jω})}^{2} - - - (15)

Constraints:

G {(e^{jω})}^{2} \tilde{N} {(e^{jω})}^{2} \leq C_{thr} - - - (16)

Wherein, C _ThrBe auditory masking threshold.

Find the solution order with method of Lagrange multipliers:

J = E_{T} + μ (G {(e^{jω})}^{2} \tilde{N} {(e^{jω})}^{2} - C_{thr}) - - - (17)

Wherein, μ is a Lagrange multiplier.

Make J to G (e ^{J ω}) differentiate, and be zero, can get:

G (e^{jω}) = \frac{S {(e^{jω})}^{2}}{S {(e^{jω})}^{2} + (1 + μ) \tilde{N} {(e^{jω})}^{2}} - - - (18)

Can be found out under goal constraint of the present invention by formula (18), be exactly the Weiner filter of more reasonably having estimated noise on expression-form based on the postfilter of auditory perception property.

Make J to the μ differentiate, and be zero, can get:

G (e^{jω}) = \sqrt{\frac{C_{thr}}{\tilde{N} {(e^{jω})}^{2}}} - - - (19)

Equate by (18) and (19) two formulas, can get:

1 + μ = \frac{{S (e^{jω})}^{2}}{\tilde{N} {(e^{jω})}^{2}} \max (\sqrt{\frac{\tilde{N} {(e^{jω})}^{2}}{C_{thr}}} - 1,0) - - - (20)

(20) are brought into (18), and with in the formula (13)

Replace

It is as follows to obtain the postfilter based on auditory perception property that this paper carries:

G (e^{jω}) = \frac{1}{1 + \max (\sqrt{\frac{{\tilde{φ}}_{NN}}{C_{thr}}} - 1,0)} - - - (21)

In Fig. 1, go out an application based on the microphone array of multi-model and auditory properties after filtering sound enhancement method flow chart.System comprises the microphone array of at least two microphones 101.

The microphone of microphone array has different arrangements, and especially, microphone 101 is placed in a row, and wherein each microphone and adjoining microphone have predeterminable range.For example, the distance between two microphones may approximately be 5 centimetres.For different applied environments and specification requirement, microphone array may be set in place.

The voice signal of gathering from microphone 101 is sent to signal processing unit 102.Before being sent to signal processing unit, voice signal can come the preliminary treatment voice signal through low pass filter.

The defeated voice signal of gathering of 102 pairs of different microphones of signal processing unit carries out delay compensation to realize time domain alignment.Each microphone signal after using discrete Fourier transform in short-term to align is expressed as the frequency signal form of complex values, calculates the spectral power matrix Φ of the multichannel Noisy Speech Signal of microphone array collection at time frame t, frequency k _XX(k t) and to this matrix carries out characteristic value decomposition, obtains eigenvalue matrix Λ _XXWith eigenvectors matrix U.

In following step 103, utilize eigenvalue matrix Λ _XX,, determine the dimension Q of signal subspace by the probability method that exists of target voice signal in the maximization Noisy Speech Signal.

Then, step 104 is utilized the dimension Q of signal subspace, based on the stationarity of spectrum, noise power spectrum distributed model in the adaptively selected Noisy Speech Signal.

Step 105 is utilized signal subspace dimension Q and noise power spectrum distributed model, composes according to the conditional probability estimating noise power.

Step 106 utilizes signal subspace dimension and noise power spectrum to estimate, according to auditory masking effect, estimates to obtain the auditory masking threshold of each frequency based on signal subspace.

At last, step 107 utilizes noise power spectrum to estimate and auditory masking threshold, in conjunction with Lagrange multiplier design postfilter.

At Fig. 2, the flow process of the method for a definite signal subspace dimension has been described, this method is corresponding to the step 103 among Fig. 1.

Through step 101 and step 102, the voice signal that microphone array is gathered has passed through time domain alignment, Short Time Fourier Transform.And to the power spectrum Φ of multichannel Noisy Speech Signal _XXCarry out characteristic value decomposition, obtain eigenvalue matrix Λ _XXWith eigenvectors matrix U.By (9) formula, signals with noise power spectrum characteristic value matrix be broken down into power spectrum signal characteristic value and noise power spectrum characteristic value and, Q is the dimension of signal subspace.

In first step 201, the dimension Q of initializing signal subspace, making it is 1.

Next, step 202 is upgraded noise power spectrum and target voice signal power spectrum.Because Noisy Speech Signal power spectrum characteristic value matrix Λ _XXBe descending, and the hypothesis signal strength signal intensity is greater than noise, so when the dimension of signal subspace was Q, the power of noise was

φ_{NN} = \frac{1}{L - Q} Σ_{i = Q + 1}^{L} λ_{X_{i}} - - - (22)

Wherein, i ∈ Q+1 ..., L} is the subscript of characteristic value.

And the power of target voice signal is

S = \frac{1}{Q} Σ_{i = 1}^{Q} {(λ_{X_{i}} - φ_{NN})}^{\frac{1}{2}} - - - (23)

Wherein, i ∈ 1 ..., Q} is the subscript of characteristic value.

So, the variance of target voice signal is

v_{s} = \{\begin{matrix} λ_{X_{1}} - φ_{NN} & Q = 1 \\ \frac{1}{Q} Σ_{i = 1}^{Q} {[{(λ_{X_{i}} - φ_{NN})}^{\frac{1}{2}} - S]}^{2} & Q > 1 \end{matrix} - - - (24)

Wherein, wherein, i ∈ 1 ..., Q} is the subscript of characteristic value.

Step 203 selects a spectrum of describing the target voice signal to distribute from Gauss model, laplace model and gamma model arbitrarily.Calculate the conditional probability P of target voice signal _G(S (k, t) | H ₁), especially, when selecting Gauss model,

P_{G} (S (k, t) | H_{1}) = \frac{1}{\sqrt{{2 πv}_{s} (k, t)}} \exp {- \frac{S^{2} (k, t)}{{2 v}_{s} (k, t)}}

Step 204 realizes that variable Q and j's adds computing certainly:

Q＝Q+1

Then step 205 is judged loop termination condition Q＞L, especially, when condition does not satisfy, returns step 202; Otherwise carry out step 206.

Formula that step 206 is utilized (10) of the present invention has finally been determined the dimension Q of signal subspace, promptly

\underset{Q}{\arg \max} P (S (k, t) | H_{1}) .

In Fig. 3, the flow chart of noise power spectrum distributed model in the definite Noisy Speech Signal has been described.This method is corresponding to the step 104 among Fig. 1.

Gauss model, laplace model and gamma model can be used to describe the spectral coefficient of voice signal and noise signal, but also can be different for its noise characteristic of different noise types, so Model Selection should be carried out targetedly according to the characteristic of target noise.In this example, the statistics according to the computer fan noise has provided the method that a kind of stationarity based on spectrum is carried out Model Selection.

In step 301, calculate discriminant score Ω by (11) formula.

Step 302 judges that whether discriminant score Ω is less than Ω ₁If judged result is true, then selects Gauss model; Otherwise execution in step 303 judges that whether discriminant score Ω is less than Ω ₂If judged result is true, then selects laplace model; Otherwise select the gamma model.

The model adaptation selection algorithm that the present invention embodies is based on the result to the data statistics of a large amount of computer fan noise experiment.Experiment finds that Gauss model is an optimal models when Ω gets smaller value, when the Ω value is big, and the laplace model optimum, and the total average noise evaluated error of gamma model is minimum.In view of the above, to carry out Model Selection as follows in the present invention:

In Fig. 4, a method flow diagram that utilizes conditional probability estimating noise power spectrum has been described.This method is corresponding to the step 105 among Fig. 1.

Step 401 is calculated the average power spectra of the pure noise frame of Noisy Speech Signal The initial segment

Step 402 is calculated the power spectrum of present frame

φ_{NN}^{cur} = \frac{1}{L} Σ_{i = 1}^{L} λ_{X_{i}}

Wherein, i ∈ 1 ..., L} is the subscript of characteristic value.

Next step 403 is calculated the ratio of present frame power spectrum and pure noise power spectrum

r = \frac{φ_{NN}^{cur}}{φ_{NN}^{pre}}

Step 403 has been finished conditional probability P (H jointly to step 408 ₀| calculating X).The size of r and setting threshold α at first relatively, α gets and is slightly larger than 1 smaller value, and especially, α is taken as 1.2.When r＜α, present frame more may be pure noise frame, so P (H ₀| X) should get bigger value, the present invention is provided with under it and is limited to 0.8.If work as r＞α, present frame more may be a speech frame, at this moment P (H ₀| X) should get a suitable value.Because the energy of signal is distributed uneven on each frequency, so, different P (H got according to different frequencies here ₀| X) value.When low frequency, P (H ₀| value X) should be greater than the value of high frequency, because the energy of signal concentrates on low frequency region mostly.Promptly

P (H_{0} | X) = \{\begin{matrix} \max (\frac{1}{1 + r β_{1}}, 0.8) & r \leq 1.2 \\ \{\begin{matrix} \frac{1}{1 + r β_{2}} & if & f \leq f_{thr} \\ \frac{1}{1 + r β_{3}} & if & f > f_{thr} \end{matrix} & r > 1.2 \end{matrix} - - - (26)

Wherein, f _ThrBe the threshold frequency of low-and high-frequency, β ₁And β ₂It is weight coefficient.

Step 409 design conditions probability P (H ₁| X)=1-P (H ₀| X).

Obtain conditional probability P (H ₀| X) and P (H ₁| X), step 410 utilizes (13) formula to obtain the estimated value of noise power spectrum

In Fig. 5, a kind of flow chart that calculates the method for auditory masking threshold has been described.This method is corresponding to the step 106 among Fig. 1.For the masking by noise in the signal is fallen, thereby realize enhancing to target voice signal signal, need be with noise limit at this below threshold value.

Step 501 is 24 sub-frequency bands with 0 to 15500Hz human auditory system scope division, so that calculate auditory masking threshold in each sub-band.

In step 502, utilize the signal subspace dimension of step 206 gained, calculated the energy of each frequency.(j, b) expression is the energy on b frequency in the j sub-frequency bands to H, can calculate according to signal subspace characteristic value and characteristic vector.

H (j, b) = mean (\frac{1}{L} Σ_{i = 1}^{Q} λ_{S_{i}} {| U_{1, i} |}^{2}) - - - (27)

Wherein, For the characteristic value of target voice signal spectral power matrix is estimated U _{1, i}Be i base of signal subspace, i ∈ 1 ..., Q} is that the subscript m ean () of characteristic value is for getting the average operator.

SF (j) is the function of expressing people's ear basement membrane propagation characteristic on the j sub-frequency bands, j ∈ 1 ..., 24}.

In step 503, calculate the propagation function of each sub-band

SF (j) = 15.81 + 7.5 (j + 0.474) - 17.5 \sqrt{1 + {(j + 0.474)}^{2}},

j∈{1，…，24} (28)

Next, the excitation energy value of energy on the step 504 computational chart traveller on a long journey ear basement membrane

C(j，b)＝SF(j)*H(j，b)，j∈{1，…，24} (29)

Step 505 is calculated auditory masking threshold

C_{thr} = 10^{\log_{10} | C (j, b) | - | \frac{O (j)}{10} | - | \frac{{\tilde{φ}}_{NN}}{10} |} - - - (30)

Wherein, O (j) is a side-play amount, j ∈ 1 ..., 24} represents the j sub-frequency bands.

In Fig. 6, the flow chart of a design postfilter has been described.This method is corresponding to the step 107 among Fig. 1.

The power of residual noise is lower than under the condition of auditory masking threshold in the voice after guaranteeing enhancing, for the distortion that makes target voice signal signal reaches minimum.

Step 601 is described constrained optimization problem, and is as follows:

Target:

{\min E}_{T} = {[G (e^{jω}) - 1]}^{2} S {(e^{jω})}^{2} + G {(e^{jω})}^{2} \tilde{N} {(e^{jω})}^{2}

Constraints:

G {(e^{jω})}^{2} \tilde{N} {(e^{jω})}^{2} \leq C_{thr}

Step 602 utilizes method of Lagrange multipliers to find the solution, order:

J = E_{T} + μ (G {(e^{jω})}^{2} \tilde{N} {(e^{jω})}^{2} - C_{thr})

Make J to G (e ^{J ω}) and μ differentiate respectively, and be zero, can get:

\{\begin{matrix} G (e^{jω}) = \frac{S {(e^{jω})}^{2}}{S {(e^{jω})}^{2} + (1 + μ) \tilde{N} {(e^{jω})}^{2}} \\ G (e^{jω}) = \sqrt{\frac{C_{thr}}{\tilde{N} {(e^{jω})}^{2}}} \end{matrix}

Step 603 is found the solution this equation, obtains the optimal estimation of postfilter, that is:

G (e^{jω}) = \frac{1}{1 + \max (\sqrt{\frac{{\tilde{φ}}_{NN}}{C_{thr}}} - 1,0)}

The noise power spectrum that again step 410 is obtained is estimated

With the 505 auditory masking threshold C that obtain _ThrBring into, step 604 is finished the design of postfilter.

According to this specification, the further modifications and variations of the present invention are conspicuous for the technical staff in described field.Therefore, this explanation will be regarded as illustrative and its objective is to one of ordinary skill in the art's instruction being used to carry out conventional method of the present invention.Should be appreciated that the form of the present invention that this specification illustrates and describes just is counted as current preferred embodiment.

Claims

1. one kind based on filtering sound enhancement method behind the microphone array of multi-model and auditory properties, it is characterized in that, comprises the following steps:

Steps d: utilize conditional probability estimating noise power spectrum;

2. the method for claim 1 is characterized in that, described spectral power matrix is carried out characteristic value decomposition, comprising:

Φ _XX(k，t)＝UΛ _XXU ^H＝U(Λ _SS+φ _NN(k，t)I)U ^H

3. the method for claim 1 is characterized in that, described definite signal subspace dimension is to get the probability maximum that only Q value makes that the target voice signal exists in the noisy speech; Utilize conditional probability to calculate, step comprises:

Definition exclusive events H ₀And H ₁:

Signal subspace dimension Q is defined as:

\underset{Q}{\arg \max} P (S (k, t) | H_{1})

4. the method for claim 1 is characterized in that, described stationarity based on spectrum, and noise power spectrum distributed model in the adaptively selected Noisy Speech Signal may further comprise the steps:

Ω = \frac{(L - Q) \sqrt{Π_{i = Q + 1}^{L} λ_{X_{i}}}}{\frac{1}{L - Q} Σ_{i = Q + 1}^{L} λ_{X_{i}}}

That is, Ω is a geometric average

To arithmetic average

Ratio, wherein,

5. method as claimed in claim 4 is characterized in that, described comparison step according to discriminant score and predetermined threshold value comprises:

6. the method for claim 1 is characterized in that, utilizes the step of conditional probability estimating noise power spectrum to comprise:

\{\begin{matrix} H_{0} : & φ_{NN}^{0} = \frac{1}{L} Σ_{i = 1}^{L} λ_{X_{i}} \\ H_{1} : & φ_{NN}^{1} = \frac{1}{L - Q} Σ_{i = Q + 1}^{L} λ_{X_{i}} \end{matrix}

Wherein,

With

{\tilde{φ}}_{NN} = P (H_{0} | X) φ_{NN}^{0} + P (H_{1} | X) φ_{NN}^{1} .

7. the method for claim 1 is characterized in that, the step of described estimation auditory masking threshold comprises:

8. method as claimed in claim 7, it is characterized in that, auditory masking threshold in each sub-band of described calculating is the energy that calculates each frequency on each sub-band, calculate the propagation coefficient of people's ear basement membrane for each frequency range sound, then the propagation coefficient of the energy of each frequency on each sub-band and each frequency range sound being multiplied each other obtains the epilamellar excitation energy value of people's ear, and the functional relation according to epilamellar excitation energy value of people's ear and auditory masking threshold calculates masking threshold again.

9. the method for claim 1 is characterized in that, described step in conjunction with Lagrange multiplier estimation postfilter G is as follows: