CN102074236B

CN102074236B - Speaker clustering method for distributed microphone

Info

Publication number: CN102074236B
Application number: CN2010105683868A
Authority: CN
Inventors: 杨毅; 刘加
Original assignee: Tsinghua University
Current assignee: Beijing Huacong Zhijia Technology Co Ltd
Priority date: 2010-11-29
Filing date: 2010-11-29
Publication date: 2012-06-06
Anticipated expiration: 2030-11-29
Also published as: CN102074236A

Abstract

The invention relates to a speaker clustering method for a distributed microphone, which comprises the following steps: firstly performing pretreatment on signals acquired by the distributed microphone, further adopting the time delay estimation method for calculation against sound source signal fragments, getting a corresponding time delay estimation vector, then ruling out wrong data, performing speaker segmentation, and finally performing speaker clustering according to the speaker segmentation result. The distributed microphone is used as a signal acquisition and output device for calculating the time delay vector of the voice signal fragments, the time delay estimation precision is improved by ruling out the wrong data, and clustering algorithm is adopted for the time delay vector so as to respectively classify the voice signal fragments according to identities of speakers; furthermore, the device has the advantages of low price and convenience in use, and the speaker clustering method can be applied in a multi-person multi-party dialogue scene under a complex acoustic environment.

Description

A kind of speaker clustering method of distributed mike wind

Technical field

The invention belongs to the voice technology field, relate to a kind of speaker clustering method of distributed mike wind particularly.

Background technology

Along with the continuous development of network and mechanics of communication, utilize existing multimedia technology, network and mechanics of communication, distributed proccessing etc. can realize that the many people under the complicated acoustic enviroment scene talk with in many ways.Input of tradition sound source and sound pick-up outfit comprise head microphone, omni-directional and directivity single microphone, microphone array etc.Single microphone has advantages such as volume is little, cheap as traditional sound source input and sound pick-up outfit, but does not possess the ability to neighbourhood noise processing and auditory localization; Microphone array is made up of a plurality of microphones of putting according to specific geometrical position, and spacing wave is carried out the time-space domain Combined Treatment, and its ability comprises: the auditory localization under identification and separating sound-source, the reverberation condition, enhancing voice signal etc.

The sound signal collecting system that distributed mike wind is made up of a plurality of single microphones, each microphone is controlled by distinct device, and the arrangement and the spacing of microphone had no restriction, and the signal that microphone is gathered is not exclusively synchronous in time domain.Distributed mike wind is simple in structure, easy to use, the saving cost, meets the requirement of the multi-direction complex dialogs scene of many sound sources, can accomplish multiple application such as speaker's cluster, identification and location effectively.Different with the microphone array system is that distributed mike wind has no constraint and restriction, sound source in the distributed mike wind system and microphone position information the unknown in addition to the position of microphone with putting.

It is one of research topic of field of voice signal that acoustic information is classified automatically, and the speaker cuts apart (Speaker Segmentation) and speaker's cluster (Speaker Clustering) is an important component part wherein.Usual way is: the speaker is cut apart whole tested speech is divided into a series of sound bites, and these sound bites only belong to a certain speaker dependent; Speaker's cluster is responsible for the voice that belong to a speaker that disperse are classified as one type.

Traditional speaker's dividing method moves statistic law with the window of Gauss model basically and is the basis, adopts different distances to estimate selection, obtains cut-point through merging based on Bayesian information criterion.Speaker clustering method can adopt evolution Hidden Markov (EHMM) computing method, upgrades segmentation result through weighing the path mark.When speaker's number does not limit, can adopt the method for hierarchical clustering to carry out speaker's cluster.

The speaker clustering method of microphone array mainly utilizes speaker's differences in spatial location to classify.Cardinal principle is: with the space characteristics of time delay estimate vector as the speaker, in GMM/HMM (gauss hybrid models/hidden Markov model) model, these characteristics are integrated and classified.The time delay algorithm for estimating of microphone array mainly comprises GCC (broad sense simple crosscorrelation) method and LMS (least mean-square error) method.It is more serious that GCC (broad sense simple crosscorrelation) is influenced by reverberation; Produced GCC (broad sense simple crosscorrelation) method of CEP (cepstrum pre-filtering) method and fundamental tone weighting after the improvement, EVD (characteristic value decomposition) and then utilize the technology of subspace and transport function recently to find the solution respectively based on the delay time estimation method of ATF (acoustic transfer function).But responsive during the microphone array system-computed to the error of sampling between each equipment, therefore the voice data synchronism is required very strict; And the sound source number is unknown in common many people Multi-Party Conference scene, microphone position is unknown, the room acoustics environment is unknown, promptly need under the scene that time and spatial prior information all lack, handle voice data.

As the single microphone of traditional sound source input and sound pick-up outfit, cheap, simple in structure, shortcoming is to be subject to environmental disturbances, and can not position sound source; The conventional microphone array system is by broad research, the main cause that does not have commercialization be specialized hardware cost an arm and a leg and algorithm complex higher.

Summary of the invention

In order to overcome the shortcoming of above-mentioned prior art; The objective of the invention is to propose a kind of speaker clustering method of distributed mike wind; As signals collecting and output device, the time delay of computing voice signal segment vector improves the time delay estimated accuracy through the debug data with distributed mike wind; Adopt clustering algorithm that speech signal segment is sorted out respectively by speaker ' s identity to the time delay vector; Equipment price is cheap, has advantage easy to use, can be applicable to the many people session operational scenarios in many ways under the complicated acoustic enviroment.

A kind of speaker clustering method of distributed mike wind may further comprise the steps:

The first step is carried out pre-service to the signal of distributed mike elegance collection

The multichannel sound-source signal that at first distributed mike wind is obtained carries out pre-service; Earlier the multichannel sound-source signal is divided frame and carries out the fast Fourier transform (FFT) conversion; Then the multichannel sound-source signal is carried out end-point detection; Signal is divided into two types of sound-source signal and non-sound-source signals; The purpose of end-point detection is from audio digital signals, to distinguish voice signal and non-speech audio, and sound end detecting method can adopt subband spectrum entropy algorithm, and at first the spectrum division with every frame voice becomes n (n is the integer greater than zero) subband; Calculate the frequency spectrum entropy of each subband; Obtain the frequency spectrum entropy of every frame to the subband spectrum entropy of n frame in succession through one group of order statistics wave filter then, according to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) _FFTIndividual some Y _i(0≤i≤N _FFT), the probability density of each point on spectrum domain can be used formula (1) expression:

p_{i} = Y_{i} / Σ_{k = 0}^{N_{FFT} - 1} Y_{k} - - - (1)

Wherein: Y _kBe k the point of voice signal on power spectrum through the FFT conversion, Y _iBe i the point of voice signal on power spectrum through the FFT conversion, N _FFTBe the number of i, p _iBe the probability density of i point on spectrum domain,

The entropy function of corresponding signal on spectrum domain defines available formula (2) expression:

H = - Σ_{k = 0}^{N_{FFT} - 1} p_{k} \log (p_{k}) - - - (2)

Wherein: p _kBe the probability density of k point on spectrum domain, N _FFTBe the number of i, H is the entropy function on the spectrum domain,

With the N on the frequency domain _FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, and the probability that calculates each point on the l frame frequency spectral domain is shown in formula (3):

p_{l} [k, i] = (Y_{i} + Q) / Σ_{j = m_{k}}^{m_{k + 1} - 1} (Y_{j} + Q) - - - (3)

Wherein: Y _jBe j the point of voice signal on power spectrum through the FFT conversion, Y _iBe the point on k the subband,

(0≤k≤K-1, m _k≤i≤m _K+1-1) be the subband lower limit, Q is a constant, p _l[k, i] is the probability of each point on the l frame frequency spectral domain,

According to the definition of information entropy, the value of the frequency spectrum entropy of k subband of l frame is shown in formula (4):

E_{s} [l, k] = Σ_{i = mk}^{m_{k + 1} - 1} p_{l} [k, i] \log (p_{l} [k, i]) (0 \leq k \leq K - 1) - - - (4)

Wherein: p _l[k, i] is the probability of each point on the l frame frequency spectral domain, E _s[l, k] is the frequency spectrum entropy of k subband of l frame,

We can calculate the spectrum information entropy of l frame according to following formula (5):

H_{l} = - \frac{1}{K} Σ_{k = 0}^{K - 1} E_{h} [l, k] - - - (5)

Wherein: E _h[l, k] is the frequency spectrum entropy of k subband of l frame, and K is the subband number, H _lInformation entropy for through k subband of the l frame after the filtering smoothing processing defines shown in formula (6):

E _h[l，k]＝(1-λ)E _s(h)[l，k]+λE _s(h+1)[l，k](0≤k≤K-1)(6)

Wherein: E _{S (h)}[l, k] preparation method is following: the order statistics wave filter of each subband acts on the sub-band information entropy E that a group length is L in the algorithm _s[l-N, k], KE _s[l, k], KE _sOn [l+N, k], this group sub-band information entropy is pressed ascending order rank order, E _{S (h)}[l, k] is E _s[l-N, k], KE _s[l, k], KE _sH maximal value in [l+N, k]; λ is a constant, E _h[l, k] is the information entropy of k subband of the l frame after the filtering smoothing processing,

The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H _l, work as H _lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as T=β Avg+ θ, wherein

β=0.01, θ=0.1, E _m[k] is E _s[0, k], K, E _sThe intermediate value of [N-1, k], Avg is the Noise Estimation that input signal begins the N frame most,

Second step, adopt Time Delay Estimation Method to calculate to the sound-source signal fragment, obtain corresponding time delay estimate vector

At first confirm volume coordinate, concrete grammar is: each microphone is numbered in order is microphone M1, M2...; Mn, n are the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2; If the position of microphone M1 is an origin; Microphone M1 is the starting point coordinate direction of principal axis to the direction of microphone M2, subsequently per 50 frame voice signals is regarded as one group of sound bite, adopts Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones; Obtain the individual time delay estimated value of n (n-1), shown in formula (7):

τ_{k} = {[\begin{matrix} {\hat{τ}}_{12} & {\hat{τ}}_{13} & L & {\hat{τ}}_{ij} \end{matrix}]}^{T} - - - (7)

Wherein: Be that delay inequality between i microphone and j the microphone is estimated τ _kBe the delay inequality estimate vector,

Time delay is estimated to adopt PHAT (phase tranformation) weighting algorithm, and its weighting coefficient is shown in formula (8), and delay time estimation method is shown in formula (9)～(10):

W (ω) = \frac{1}{| X_{1} (ω) X_{2}^{*} (ω) |} - - - (8)

Wherein: X ₁(ω), X ₂(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, ^*Be conjugate of symbol,

R_{x_{1} x_{2}} (n) = IFFT (W (ω) \cdot X_{1} (ω) \cdot X_{2}^{*} (ω)) - - - (9)

\hat{τ} = \underset{n}{\arg \max} R_{x_{1} x_{2}} (n) - - - (10)

Wherein:

Be the broad sense cross correlation function of two paths of signals,

Be x ₁And x ₂Between the time

Prolong estimated value,

In the 3rd step, the debug data are also carried out the speaker and are cut apart

At first need remove invalid data, press following formula (11) calculation delay:

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & SNR < {Thr}_{SNR} \\ \hat{τ} [n] & SNR &GreaterEqual; {Thr}_{SNR} \end{matrix} - - - (11)

Wherein: n is the index value of a certain frame, and τ is the corresponding delay data of a certain frame,

Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr _SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & \hat{τ} [n] < Thr \\ \hat{τ} [n] & \hat{τ} [n] &GreaterEqual; Thr \end{matrix} - - - (12)

Wherein: n is the index value of a certain frame; τ is the corresponding delay data of a certain frame;

is the delay data that a certain frame is estimated; When a certain moment time delay was estimated less than threshold value Thr, the estimation time delay that adopted a last moment was as this time delay estimated value constantly

Then the speaker of different spatial is cut apart calculating, at first calculate posterior probability β _i(τ _k) shown in formula (13):

β_{i} (τ_{k}) = \frac{α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})}{α_{1} g (τ_{k}; μ_{1} . σ_{1}^{2}) + α_{2} g (τ_{k}; μ_{2} . σ_{2}^{2}) + L + α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})} - - - (13)

Wherein:

Be defined parameters, α _i=1/i, i represent the number of GMM model,

Initial value adopt K-means algorithm computation, τ _kFor formula 7 calculates the time delay estimate vector that obtains, β _i(τ _k) be posterior probability,

Formula (14) is the parameter update algorithm:

\{\begin{matrix} {\hat{μ}}_{i} = \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) τ_{k}}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{σ}}_{i}^{2} = \frac{1}{d} \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) {(τ_{k} - μ_{i})}^{T} (τ_{k} - μ_{i})}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{α}}_{i} = \frac{1}{n} Σ_{k = 1}^{n} β_{i} (τ_{k}) \end{matrix} - - - (14)

Wherein:

Be estimates of parameters,

Be the estimation of GMM model parameter, β _i(τ _k) be the posterior probability that formula 13 calculates gained, when

The time stop undated parameter, min is a constant here, represents minimum tolerance value,

In the 4th step, the result of cutting apart according to the speaker carries out speaker's cluster

Utilize a kind of algorithm that the sound bite after cutting apart is carried out cluster based on K-means; Calculate the territory density of each set earlier; The point that density is maximum is as initial point, and next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy;

Next calculates sample point and upgrades the value at center to the distance at set center, selects the sampled point of coincidence formula (15) to upgrade as new set center,

Func = Σ_{j = 1}^{J} Σ_{n = 1}^{M} {| | \hat{τ} [n] - τ_{j} | |}^{2} - - - (15)

Wherein: Be the time delay estimate vector

Cluster centre τ with each sound bite _jDistance, τ _j[n] is center vector, and J is speaker's number, and M is the microphone number,

At last come different spaces speaker's sound bite is sorted out and marked according to set center vector and the vectorial distance of sound bite.

The present invention has following advantage:

(1), the distributed asynchronous sound sensor that proposes of the present invention, the locus of sonic transducer is not had strict restriction, in addition the synchronism of acquired signal is required lowlyer, compare the microphone array application more flexibly extensively;

(2) the present invention made full use of between the microphone and sound source and microphone between a plurality of delay inequalities carry out information fusion, carry out the speaker through the time delay estimate vector and cut apart, when having reduced the complicacy of traditional speaker's partitioning algorithm, robustness increases;

(3), the present invention made full use of the advantage of distributed mike wind in spatial domain, and single speaker's speech signal segment time delay estimate vector is carried out speaker's cluster, reduced the complicacy of traditional speaker's clustering algorithm;

(4), the speaker clustering method of distributed mike wind of the present invention can be applied to multiple many people session operational scenarios in many ways, it is good to have robustness, adapts to the characteristics of multiple acoustic enviroment, and

The present invention can realize on present palm PC, PDA(Personal Digital Assistant) or mobile phone that its range of application is very extensive.

Description of drawings

Fig. 1 is a schematic flow sheet of the present invention.

Fig. 2 is the schematic flow sheet of end-point detection of the present invention.

Fig. 3 is the synoptic diagram that sound source time delay of the present invention is estimated.

Fig. 4 is the schematic flow sheet that speaker of the present invention is cut apart cluster.

Embodiment

Below in conjunction with accompanying drawing the present invention is elaborated.

With reference to Fig. 1, a kind of speaker clustering method of distributed mike wind may further comprise the steps:

With reference to Fig. 2; The multichannel sound-source signal that at first distributed mike wind is obtained carries out pre-service; Earlier the multichannel sound-source signal is divided frame and carries out the fast Fourier transform (FFT) conversion; Then the multichannel sound-source signal is carried out end-point detection, signal is divided into two types of sound-source signal and non-sound-source signals, the purpose of end-point detection is from audio digital signals, to distinguish voice signal and non-speech audio; The early stage employing can be distinguished voice signal and noise exactly based on the method for energy and zero-crossing rate; But the voice in the reality are usually polluted by bigger neighbourhood noise, and sound end detecting method can adopt subband spectrum entropy algorithm, and at first the spectrum division with every frame voice becomes n (n is the integer greater than zero) subband; Calculate the frequency spectrum entropy of each subband; Obtain the frequency spectrum entropy of every frame to the subband spectrum entropy of n frame in succession through one group of order statistics wave filter then, according to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) _FFTIndividual some Y _i(0≤i≤N _FFT), the probability density of each point on spectrum domain can be used formula (1) expression:

p_{i} = Y_{i} / Σ_{k = 0}^{N_{FFT} - 1} Y_{k} - - - (1)

H = - Σ_{k = 0}^{N_{FFT} - 1} p_{k} \log (p_{k}) - - - (2)

With the N on the frequency domain _FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, and the probability that calculates each point on the i frame frequency spectral domain is shown in formula (3):

p_{l} [k, i] = (Y_{i} + Q) / Σ_{j = m_{k}}^{m_{k + 1} - 1} (Y_{j} + Q) - - - (3)

Wherein: Y _iBe j the point of voice signal on power spectrum through the FFT conversion, Y _iBe the point on k the subband,

E_{s} [l, k] = Σ_{i = mk}^{m_{k + 1} - 1} p_{l} [k, i] \log (p_{l} [k, i]) (0 \leq k \leq K - 1) - - - (4)

H_{l} = - \frac{1}{K} Σ_{k = 0}^{K - 1} E_{h} [l, k] - - - (5)

E _h[l，k]＝(1-λ)E _s(h)[l，k]+λE _s(h+1)[l，k](0≤k≤K-1)(6)

With reference to Fig. 3, at first confirm volume coordinate, concrete grammar is: each microphone is numbered in order is microphone M1; M2..., Mn, n are the integer greater than 1; Select initially to be numbered 1 and 2 two microphone M1 and M2, the position of establishing microphone M1 is an origin, and microphone M1 is the starting point coordinate direction of principal axis to the direction of microphone M2; Subsequently per 50 frame voice signals are regarded as one group of sound bite; Adopt Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones, obtain the individual time delay estimated value of n (n-1), shown in formula (7):

τ_{k} = {[\begin{matrix} {\hat{τ}}_{12} & {\hat{τ}}_{13} & L & {\hat{τ}}_{ij} \end{matrix}]}^{T} - - - (7)

Wherein:

Be that delay inequality between i microphone and j the microphone is estimated τ _kBe the delay inequality estimate vector,

W (ω) = \frac{1}{| X_{1} (ω) X_{2}^{*} (ω) |} - - - (8)

R_{x_{1} x_{2}} (n) = IFFT (W (ω) \cdot X_{1} (ω) \cdot X_{2}^{*} (ω)) - - - (9)

Wherein: X ₁(ω), X ₂(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, ^*Be conjugate of symbol, IFFT is anti-FFT conversion,

Be the broad sense cross correlation function of two paths of signals,

\hat{τ} = \underset{n}{\arg \max} R_{x_{1} x_{2}} (n) - - - (10)

Wherein:

Be the broad sense cross correlation function of two paths of signals,

Be x ₁And x ₂Between the time delay estimated value,

With reference to Fig. 4, at first need remove invalid data, press following formula (11) calculation delay:

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & SNR < {Thr}_{SNR} \\ \hat{τ} [n] & SNR &GreaterEqual; {Thr}_{SNR} \end{matrix} - - - (11)

Wherein: n is the index value of a certain frame, and τ is the corresponding delay data of a certain frame, Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr _SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:

τ [n] = \{\begin{matrix} \hat{τ} [n - 1] & \hat{τ} [n] < Thr \\ \hat{τ} [n] & \hat{τ} [n] &GreaterEqual; Thr \end{matrix} - - - (12)

β_{i} (τ_{k}) = \frac{α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})}{α_{1} g (τ_{k}; μ_{1} . σ_{1}^{2}) + α_{2} g (τ_{k}; μ_{2} . σ_{2}^{2}) + L + α_{i} g (τ_{k}; μ_{i} . σ_{i}^{2})} - - - (12)

Wherein:

Be defined parameters, α _i=1/i, i represent the number of GMM model,

Formula (14) is the parameter update algorithm:

\{\begin{matrix} {\hat{μ}}_{i} = \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) τ_{k}}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{σ}}_{i}^{2} = \frac{1}{d} \frac{Σ_{k = 1}^{n} β_{i} (τ_{k}) {(τ_{k} - μ_{i})}^{T} (τ_{k} - μ_{i})}{Σ_{k = 1}^{n} β_{i} (τ_{k})} \\ {\hat{α}}_{i} = \frac{1}{n} Σ_{k = 1}^{n} β_{i} (τ_{k}) \end{matrix} - - - (14)

Wherein:

Be estimates of parameters,

Utilize a kind of algorithm based on K-means that the sound bite after cutting apart is carried out cluster, this algorithm can overcome standard K-means algorithm performance and receive initial value and isolated point to influence big defective,

Calculate the territory density of each set earlier, the point that density is maximum is as initial point, and next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy;

Func = Σ_{j = 1}^{J} Σ_{n = 1}^{M} {| | \hat{τ} [n] - τ_{j} | |}^{2} - - - (15)

Wherein:

Be the time delay estimate vector Cluster centre τ with each sound bite _jDistance, τ _j[n] is center vector, and J is speaker's number, and M is the microphone number,

In the accompanying drawing:

Be the locus vector of single sound source, Be the locus vector of another single sound source,

Be respectively single microphone M _iM _kM _jLocus vector.

Claims

1. the speaker clustering method of a distributed mike wind is characterized in that: may further comprise the steps:

The multichannel sound-source signal that at first distributed mike wind is obtained carries out pre-service; Earlier the multichannel sound-source signal is divided frame and carries out fast Fourier transform (FFT); Then the multichannel sound-source signal is carried out end-point detection; Signal is divided into two types of sound-source signal and non-sound-source signals, and the purpose of end-point detection is from audio digital signals, to distinguish voice signal and non-speech audio, and sound end detecting method can adopt subband spectrum entropy algorithm; At first the spectrum division with every frame voice becomes n subband; N is the integer greater than zero, calculates the frequency spectrum entropy of each subband, obtains the subband spectrum entropy of n frame in succession through one group of order statistics wave filter the frequency spectrum entropy of every frame then; According to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) _FFTIndividual point

The probability density of each point on spectrum domain can be used formula (1) expression:

p_{i} < < Y_{i} / \begin{matrix} N_{FFT} 1 \\ k < < 0 \end{matrix} Y_{k} - - - (1)

H < < \begin{matrix} N_{FFT} 1 \\ k < < 0 \end{matrix} p_{k} \log (p_{k}) - - - (2)

p_{l} [k, i] < < (\begin{matrix} Y_{i} & Q \end{matrix}) / \begin{matrix} m_{k 1} 1 \\ j < < m_{k} \end{matrix} (\begin{matrix} Y_{j} & Q \end{matrix}) - - - (3)

Wherein: Y _jBe j the point of voice signal on power spectrum through the FFT conversion, Y _iBe the point on k the subband, 1) be the subband lower limit, Q is a constant, p _l[k, i] is the probability of each point on the l frame frequency spectral domain,

H_{l} < < \frac{1}{K} \begin{matrix} K 1 \\ k < < 0 \end{matrix} E_{h} [l, k] - - - (5)

Wherein: E _{S (h)}[l, k] preparation method is following: the order statistics wave filter of each subband acts on the sub-band information entropy E that a group length is L in the algorithm _s[lN, k] ... E _s[l, k] ... E _sOn [lN, k], this group sub-band information entropy is pressed ascending order rank order, E _{S (h)}[l, k] is

E _s[lN, k] ... E _s[l, k] ... E _sH maximal value in [lN, k]; L is a constant, E _h[l, k] is the information entropy of k subband of the l frame after the filtering smoothing processing,

The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H _l, work as H _lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as

Wherein

Avg < < \frac{1}{K} \begin{matrix} K 1 \\ k < < 0 \end{matrix} E_{m} [k],

E _m[k] is E _s[0, k] ..., E _sThe intermediate value of [N1, k], Avg is the Noise Estimation that input signal begins the N frame most,

At first confirm volume coordinate, concrete grammar is: to each microphone M1 that numbers in order, M2...; Mn, n are the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2; If the position of M1 is an origin; M1 is the starting point coordinate direction of principal axis to the direction of M2, subsequently per 50 frame voice signals is regarded as one group of sound bite, adopts Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones; Obtain the individual delay inequality estimated value of n (n 1), shown in formula (7):

{&LeftRightArrow;}_{k} < < {[\begin{matrix} \hat{{&LeftRightArrow;}_{12}} & \hat{{&LeftRightArrow;}_{13}} & \cdot \cdot \cdot & \hat{{&LeftRightArrow;}_{ij}} \end{matrix}]}^{T} - - - (7)

Wherein:

is that the delay inequality between i microphone and j the microphone is estimated;

is the delay inequality estimate vector

W (_) < < \frac{1}{| X_{1} (_) X_{2}^{*} (_) |} - - - (8)

Wherein: is respectively the two-way time-domain signal through the output after the FFT conversion; * be conjugate of symbol

R_{x_{1} x_{2}} (n) < < IFFT (W (_) X_{1} (_) X_{2}^{*} (_)) - - - (9)

{&LeftRightArrow;}^{^} < < \underset{n}{\arg \max} R_{x_{1} x_{2}} (n) - - - (10)

Wherein:

Be the broad sense cross correlation function of two paths of signals,

Be x ₁And x ₂Between the time

Prolong estimated value,

Wherein: n is the index value of a certain frame,

Be the corresponding delay data of a certain frame,

Wherein: n is the index value of a certain frame; Be the corresponding delay data of a certain frame; is the delay data that a certain frame is estimated; When a certain moment time delay is estimated less than threshold value Thr; The estimation time delay that adopted a last moment is as this time delay estimated value constantly

Then the speaker of different spatial is cut apart calculating, at first calculates posterior probability