CN102074236B - Speaker clustering method for distributed microphone - Google Patents

Speaker clustering method for distributed microphone Download PDF

Info

Publication number
CN102074236B
CN102074236B CN2010105683868A CN201010568386A CN102074236B CN 102074236 B CN102074236 B CN 102074236B CN 2010105683868 A CN2010105683868 A CN 2010105683868A CN 201010568386 A CN201010568386 A CN 201010568386A CN 102074236 B CN102074236 B CN 102074236B
Authority
CN
China
Prior art keywords
frame
point
formula
subband
time delay
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2010105683868A
Other languages
Chinese (zh)
Other versions
CN102074236A (en
Inventor
杨毅
刘加
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huacong Zhijia Technology Co Ltd
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN2010105683868A priority Critical patent/CN102074236B/en
Publication of CN102074236A publication Critical patent/CN102074236A/en
Application granted granted Critical
Publication of CN102074236B publication Critical patent/CN102074236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to a speaker clustering method for a distributed microphone, which comprises the following steps: firstly performing pretreatment on signals acquired by the distributed microphone, further adopting the time delay estimation method for calculation against sound source signal fragments, getting a corresponding time delay estimation vector, then ruling out wrong data, performing speaker segmentation, and finally performing speaker clustering according to the speaker segmentation result. The distributed microphone is used as a signal acquisition and output device for calculating the time delay vector of the voice signal fragments, the time delay estimation precision is improved by ruling out the wrong data, and clustering algorithm is adopted for the time delay vector so as to respectively classify the voice signal fragments according to identities of speakers; furthermore, the device has the advantages of low price and convenience in use, and the speaker clustering method can be applied in a multi-person multi-party dialogue scene under a complex acoustic environment.

Description

A kind of speaker clustering method of distributed mike wind
Technical field
The invention belongs to the voice technology field, relate to a kind of speaker clustering method of distributed mike wind particularly.
Background technology
Along with the continuous development of network and mechanics of communication, utilize existing multimedia technology, network and mechanics of communication, distributed proccessing etc. can realize that the many people under the complicated acoustic enviroment scene talk with in many ways.Input of tradition sound source and sound pick-up outfit comprise head microphone, omni-directional and directivity single microphone, microphone array etc.Single microphone has advantages such as volume is little, cheap as traditional sound source input and sound pick-up outfit, but does not possess the ability to neighbourhood noise processing and auditory localization; Microphone array is made up of a plurality of microphones of putting according to specific geometrical position, and spacing wave is carried out the time-space domain Combined Treatment, and its ability comprises: the auditory localization under identification and separating sound-source, the reverberation condition, enhancing voice signal etc.
The sound signal collecting system that distributed mike wind is made up of a plurality of single microphones, each microphone is controlled by distinct device, and the arrangement and the spacing of microphone had no restriction, and the signal that microphone is gathered is not exclusively synchronous in time domain.Distributed mike wind is simple in structure, easy to use, the saving cost, meets the requirement of the multi-direction complex dialogs scene of many sound sources, can accomplish multiple application such as speaker's cluster, identification and location effectively.Different with the microphone array system is that distributed mike wind has no constraint and restriction, sound source in the distributed mike wind system and microphone position information the unknown in addition to the position of microphone with putting.
It is one of research topic of field of voice signal that acoustic information is classified automatically, and the speaker cuts apart (Speaker Segmentation) and speaker's cluster (Speaker Clustering) is an important component part wherein.Usual way is: the speaker is cut apart whole tested speech is divided into a series of sound bites, and these sound bites only belong to a certain speaker dependent; Speaker's cluster is responsible for the voice that belong to a speaker that disperse are classified as one type.
Traditional speaker's dividing method moves statistic law with the window of Gauss model basically and is the basis, adopts different distances to estimate selection, obtains cut-point through merging based on Bayesian information criterion.Speaker clustering method can adopt evolution Hidden Markov (EHMM) computing method, upgrades segmentation result through weighing the path mark.When speaker's number does not limit, can adopt the method for hierarchical clustering to carry out speaker's cluster.
The speaker clustering method of microphone array mainly utilizes speaker's differences in spatial location to classify.Cardinal principle is: with the space characteristics of time delay estimate vector as the speaker, in GMM/HMM (gauss hybrid models/hidden Markov model) model, these characteristics are integrated and classified.The time delay algorithm for estimating of microphone array mainly comprises GCC (broad sense simple crosscorrelation) method and LMS (least mean-square error) method.It is more serious that GCC (broad sense simple crosscorrelation) is influenced by reverberation; Produced GCC (broad sense simple crosscorrelation) method of CEP (cepstrum pre-filtering) method and fundamental tone weighting after the improvement, EVD (characteristic value decomposition) and then utilize the technology of subspace and transport function recently to find the solution respectively based on the delay time estimation method of ATF (acoustic transfer function).But responsive during the microphone array system-computed to the error of sampling between each equipment, therefore the voice data synchronism is required very strict; And the sound source number is unknown in common many people Multi-Party Conference scene, microphone position is unknown, the room acoustics environment is unknown, promptly need under the scene that time and spatial prior information all lack, handle voice data.
As the single microphone of traditional sound source input and sound pick-up outfit, cheap, simple in structure, shortcoming is to be subject to environmental disturbances, and can not position sound source; The conventional microphone array system is by broad research, the main cause that does not have commercialization be specialized hardware cost an arm and a leg and algorithm complex higher.
Summary of the invention
In order to overcome the shortcoming of above-mentioned prior art; The objective of the invention is to propose a kind of speaker clustering method of distributed mike wind; As signals collecting and output device, the time delay of computing voice signal segment vector improves the time delay estimated accuracy through the debug data with distributed mike wind; Adopt clustering algorithm that speech signal segment is sorted out respectively by speaker ' s identity to the time delay vector; Equipment price is cheap, has advantage easy to use, can be applicable to the many people session operational scenarios in many ways under the complicated acoustic enviroment.
A kind of speaker clustering method of distributed mike wind may further comprise the steps:
The first step is carried out pre-service to the signal of distributed mike elegance collection
The multichannel sound-source signal that at first distributed mike wind is obtained carries out pre-service; Earlier the multichannel sound-source signal is divided frame and carries out the fast Fourier transform (FFT) conversion; Then the multichannel sound-source signal is carried out end-point detection; Signal is divided into two types of sound-source signal and non-sound-source signals; The purpose of end-point detection is from audio digital signals, to distinguish voice signal and non-speech audio, and sound end detecting method can adopt subband spectrum entropy algorithm, and at first the spectrum division with every frame voice becomes n (n is the integer greater than zero) subband; Calculate the frequency spectrum entropy of each subband; Obtain the frequency spectrum entropy of every frame to the subband spectrum entropy of n frame in succession through one group of order statistics wave filter then, according to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) FFTIndividual some Y i(0≤i≤N FFT), the probability density of each point on spectrum domain can be used formula (1) expression:
p i = Y i / Σ k = 0 N FFT - 1 Y k - - - ( 1 )
Wherein: Y kBe k the point of voice signal on power spectrum through the FFT conversion, Y iBe i the point of voice signal on power spectrum through the FFT conversion, N FFTBe the number of i, p iBe the probability density of i point on spectrum domain,
The entropy function of corresponding signal on spectrum domain defines available formula (2) expression:
H = - Σ k = 0 N FFT - 1 p k log ( p k ) - - - ( 2 )
Wherein: p kBe the probability density of k point on spectrum domain, N FFTBe the number of i, H is the entropy function on the spectrum domain,
With the N on the frequency domain FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, and the probability that calculates each point on the l frame frequency spectral domain is shown in formula (3):
p l [ k , i ] = ( Y i + Q ) / Σ j = m k m k + 1 - 1 ( Y j + Q ) - - - ( 3 )
Wherein: Y jBe j the point of voice signal on power spectrum through the FFT conversion, Y iBe the point on k the subband,
Figure BSA00000368534100044
(0≤k≤K-1, m k≤i≤m K+1-1) be the subband lower limit, Q is a constant, p l[k, i] is the probability of each point on the l frame frequency spectral domain,
According to the definition of information entropy, the value of the frequency spectrum entropy of k subband of l frame is shown in formula (4):
E s [ l , k ] = Σ i = mk m k + 1 - 1 p l [ k , i ] log ( p l [ k , i ] ) ( 0 ≤ k ≤ K - 1 ) - - - ( 4 )
Wherein: p l[k, i] is the probability of each point on the l frame frequency spectral domain, E s[l, k] is the frequency spectrum entropy of k subband of l frame,
We can calculate the spectrum information entropy of l frame according to following formula (5):
H l = - 1 K Σ k = 0 K - 1 E h [ l , k ] - - - ( 5 )
Wherein: E h[l, k] is the frequency spectrum entropy of k subband of l frame, and K is the subband number, H lInformation entropy for through k subband of the l frame after the filtering smoothing processing defines shown in formula (6):
E h[l,k]=(1-λ)E s(h)[l,k]+λE s(h+1)[l,k](0≤k≤K-1)(6)
Wherein: E S (h)[l, k] preparation method is following: the order statistics wave filter of each subband acts on the sub-band information entropy E that a group length is L in the algorithm s[l-N, k], KE s[l, k], KE sOn [l+N, k], this group sub-band information entropy is pressed ascending order rank order, E S (h)[l, k] is E s[l-N, k], KE s[l, k], KE sH maximal value in [l+N, k]; λ is a constant, E h[l, k] is the information entropy of k subband of the l frame after the filtering smoothing processing,
The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H l, work as H lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as T=β Avg+ θ, wherein
Figure BSA00000368534100052
β=0.01, θ=0.1, E m[k] is E s[0, k], K, E sThe intermediate value of [N-1, k], Avg is the Noise Estimation that input signal begins the N frame most,
Second step, adopt Time Delay Estimation Method to calculate to the sound-source signal fragment, obtain corresponding time delay estimate vector
At first confirm volume coordinate, concrete grammar is: each microphone is numbered in order is microphone M1, M2...; Mn, n are the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2; If the position of microphone M1 is an origin; Microphone M1 is the starting point coordinate direction of principal axis to the direction of microphone M2, subsequently per 50 frame voice signals is regarded as one group of sound bite, adopts Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones; Obtain the individual time delay estimated value of n (n-1), shown in formula (7):
τ k = τ ^ 12 τ ^ 13 L τ ^ ij T - - - ( 7 )
Wherein: Be that delay inequality between i microphone and j the microphone is estimated τ kBe the delay inequality estimate vector,
Time delay is estimated to adopt PHAT (phase tranformation) weighting algorithm, and its weighting coefficient is shown in formula (8), and delay time estimation method is shown in formula (9)~(10):
W ( ω ) = 1 | X 1 ( ω ) X 2 * ( ω ) | - - - ( 8 )
Wherein: X 1(ω), X 2(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, *Be conjugate of symbol,
R x 1 x 2 ( n ) = IFFT ( W ( ω ) · X 1 ( ω ) · X 2 * ( ω ) ) - - - ( 9 )
τ ^ = arg max n R x 1 x 2 ( n ) - - - ( 10 )
Wherein:
Figure BSA00000368534100066
Be the broad sense cross correlation function of two paths of signals,
Figure BSA00000368534100067
Be x 1And x 2Between the time
Prolong estimated value,
In the 3rd step, the debug data are also carried out the speaker and are cut apart
At first need remove invalid data, press following formula (11) calculation delay:
&tau; [ n ] = &tau; ^ [ n - 1 ] SNR < Thr SNR &tau; ^ [ n ] SNR &GreaterEqual; Thr SNR - - - ( 11 )
Wherein: n is the index value of a certain frame, and τ is the corresponding delay data of a certain frame,
Figure BSA00000368534100069
Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:
&tau; [ n ] = &tau; ^ [ n - 1 ] &tau; ^ [ n ] < Thr &tau; ^ [ n ] &tau; ^ [ n ] &GreaterEqual; Thr - - - ( 12 )
Wherein: n is the index value of a certain frame; τ is the corresponding delay data of a certain frame;
Figure BSA00000368534100072
is the delay data that a certain frame is estimated; When a certain moment time delay was estimated less than threshold value Thr, the estimation time delay that adopted a last moment was as this time delay estimated value constantly
Then the speaker of different spatial is cut apart calculating, at first calculate posterior probability β ik) shown in formula (13):
&beta; i ( &tau; k ) = &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) &alpha; 1 g ( &tau; k ; &mu; 1 . &sigma; 1 2 ) + &alpha; 2 g ( &tau; k ; &mu; 2 . &sigma; 2 2 ) + L + &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) - - - ( 13 )
Wherein:
Figure BSA00000368534100074
Be defined parameters, α i=1/i, i represent the number of GMM model,
Figure BSA00000368534100075
Initial value adopt K-means algorithm computation, τ kFor formula 7 calculates the time delay estimate vector that obtains, β ik) be posterior probability,
Formula (14) is the parameter update algorithm:
&mu; ^ i = &Sigma; k = 1 n &beta; i ( &tau; k ) &tau; k &Sigma; k = 1 n &beta; i ( &tau; k ) &sigma; ^ i 2 = 1 d &Sigma; k = 1 n &beta; i ( &tau; k ) ( &tau; k - &mu; i ) T ( &tau; k - &mu; i ) &Sigma; k = 1 n &beta; i ( &tau; k ) &alpha; ^ i = 1 n &Sigma; k = 1 n &beta; i ( &tau; k ) - - - ( 14 )
Wherein:
Figure BSA00000368534100077
Be estimates of parameters,
Figure BSA00000368534100078
Be the estimation of GMM model parameter, β ik) be the posterior probability that formula 13 calculates gained, when
Figure BSA00000368534100079
The time stop undated parameter, min is a constant here, represents minimum tolerance value,
In the 4th step, the result of cutting apart according to the speaker carries out speaker's cluster
Utilize a kind of algorithm that the sound bite after cutting apart is carried out cluster based on K-means; Calculate the territory density of each set earlier; The point that density is maximum is as initial point, and next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy;
Next calculates sample point and upgrades the value at center to the distance at set center, selects the sampled point of coincidence formula (15) to upgrade as new set center,
Func = &Sigma; j = 1 J &Sigma; n = 1 M | | &tau; ^ [ n ] - &tau; j | | 2 - - - ( 15 )
Wherein: Be the time delay estimate vector
Figure BSA00000368534100083
Cluster centre τ with each sound bite jDistance, τ j[n] is center vector, and J is speaker's number, and M is the microphone number,
At last come different spaces speaker's sound bite is sorted out and marked according to set center vector and the vectorial distance of sound bite.
The present invention has following advantage:
(1), the distributed asynchronous sound sensor that proposes of the present invention, the locus of sonic transducer is not had strict restriction, in addition the synchronism of acquired signal is required lowlyer, compare the microphone array application more flexibly extensively;
(2) the present invention made full use of between the microphone and sound source and microphone between a plurality of delay inequalities carry out information fusion, carry out the speaker through the time delay estimate vector and cut apart, when having reduced the complicacy of traditional speaker's partitioning algorithm, robustness increases;
(3), the present invention made full use of the advantage of distributed mike wind in spatial domain, and single speaker's speech signal segment time delay estimate vector is carried out speaker's cluster, reduced the complicacy of traditional speaker's clustering algorithm;
(4), the speaker clustering method of distributed mike wind of the present invention can be applied to multiple many people session operational scenarios in many ways, it is good to have robustness, adapts to the characteristics of multiple acoustic enviroment, and
The present invention can realize on present palm PC, PDA(Personal Digital Assistant) or mobile phone that its range of application is very extensive.
Description of drawings
Fig. 1 is a schematic flow sheet of the present invention.
Fig. 2 is the schematic flow sheet of end-point detection of the present invention.
Fig. 3 is the synoptic diagram that sound source time delay of the present invention is estimated.
Fig. 4 is the schematic flow sheet that speaker of the present invention is cut apart cluster.
Embodiment
Below in conjunction with accompanying drawing the present invention is elaborated.
With reference to Fig. 1, a kind of speaker clustering method of distributed mike wind may further comprise the steps:
The first step is carried out pre-service to the signal of distributed mike elegance collection
With reference to Fig. 2; The multichannel sound-source signal that at first distributed mike wind is obtained carries out pre-service; Earlier the multichannel sound-source signal is divided frame and carries out the fast Fourier transform (FFT) conversion; Then the multichannel sound-source signal is carried out end-point detection, signal is divided into two types of sound-source signal and non-sound-source signals, the purpose of end-point detection is from audio digital signals, to distinguish voice signal and non-speech audio; The early stage employing can be distinguished voice signal and noise exactly based on the method for energy and zero-crossing rate; But the voice in the reality are usually polluted by bigger neighbourhood noise, and sound end detecting method can adopt subband spectrum entropy algorithm, and at first the spectrum division with every frame voice becomes n (n is the integer greater than zero) subband; Calculate the frequency spectrum entropy of each subband; Obtain the frequency spectrum entropy of every frame to the subband spectrum entropy of n frame in succession through one group of order statistics wave filter then, according to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) FFTIndividual some Y i(0≤i≤N FFT), the probability density of each point on spectrum domain can be used formula (1) expression:
p i = Y i / &Sigma; k = 0 N FFT - 1 Y k - - - ( 1 )
Wherein: Y kBe k the point of voice signal on power spectrum through the FFT conversion, Y iBe i the point of voice signal on power spectrum through the FFT conversion, N FFTBe the number of i, p iBe the probability density of i point on spectrum domain,
The entropy function of corresponding signal on spectrum domain defines available formula (2) expression:
H = - &Sigma; k = 0 N FFT - 1 p k log ( p k ) - - - ( 2 )
Wherein: p kBe the probability density of k point on spectrum domain, N FFTBe the number of i, H is the entropy function on the spectrum domain,
With the N on the frequency domain FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, and the probability that calculates each point on the i frame frequency spectral domain is shown in formula (3):
p l [ k , i ] = ( Y i + Q ) / &Sigma; j = m k m k + 1 - 1 ( Y j + Q ) - - - ( 3 )
Wherein: Y iBe j the point of voice signal on power spectrum through the FFT conversion, Y iBe the point on k the subband,
Figure BSA00000368534100104
(0≤k≤K-1, m k≤i≤m K+1-1) be the subband lower limit, Q is a constant, p l[k, i] is the probability of each point on the l frame frequency spectral domain,
According to the definition of information entropy, the value of the frequency spectrum entropy of k subband of l frame is shown in formula (4):
E s [ l , k ] = &Sigma; i = mk m k + 1 - 1 p l [ k , i ] log ( p l [ k , i ] ) ( 0 &le; k &le; K - 1 ) - - - ( 4 )
Wherein: p l[k, i] is the probability of each point on the l frame frequency spectral domain, E s[l, k] is the frequency spectrum entropy of k subband of l frame,
We can calculate the spectrum information entropy of l frame according to following formula (5):
H l = - 1 K &Sigma; k = 0 K - 1 E h [ l , k ] - - - ( 5 )
Wherein: E h[l, k] is the frequency spectrum entropy of k subband of l frame, and K is the subband number, H lInformation entropy for through k subband of the l frame after the filtering smoothing processing defines shown in formula (6):
E h[l,k]=(1-λ)E s(h)[l,k]+λE s(h+1)[l,k](0≤k≤K-1)(6)
Wherein: E S (h)[l, k] preparation method is following: the order statistics wave filter of each subband acts on the sub-band information entropy E that a group length is L in the algorithm s[l-N, k], KE s[l, k], KE sOn [l+N, k], this group sub-band information entropy is pressed ascending order rank order, E S (h)[l, k] is E s[l-N, k], KE s[l, k], KE sH maximal value in [l+N, k]; λ is a constant, E h[l, k] is the information entropy of k subband of the l frame after the filtering smoothing processing,
The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H l, work as H lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as T=β Avg+ θ, wherein
Figure BSA00000368534100112
β=0.01, θ=0.1, E m[k] is E s[0, k], K, E sThe intermediate value of [N-1, k], Avg is the Noise Estimation that input signal begins the N frame most,
Second step, adopt Time Delay Estimation Method to calculate to the sound-source signal fragment, obtain corresponding time delay estimate vector
With reference to Fig. 3, at first confirm volume coordinate, concrete grammar is: each microphone is numbered in order is microphone M1; M2..., Mn, n are the integer greater than 1; Select initially to be numbered 1 and 2 two microphone M1 and M2, the position of establishing microphone M1 is an origin, and microphone M1 is the starting point coordinate direction of principal axis to the direction of microphone M2; Subsequently per 50 frame voice signals are regarded as one group of sound bite; Adopt Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones, obtain the individual time delay estimated value of n (n-1), shown in formula (7):
&tau; k = &tau; ^ 12 &tau; ^ 13 L &tau; ^ ij T - - - ( 7 )
Wherein:
Figure BSA00000368534100122
Be that delay inequality between i microphone and j the microphone is estimated τ kBe the delay inequality estimate vector,
Time delay is estimated to adopt PHAT (phase tranformation) weighting algorithm, and its weighting coefficient is shown in formula (8), and delay time estimation method is shown in formula (9)~(10):
W ( &omega; ) = 1 | X 1 ( &omega; ) X 2 * ( &omega; ) | - - - ( 8 )
Wherein: X 1(ω), X 2(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, *Be conjugate of symbol,
R x 1 x 2 ( n ) = IFFT ( W ( &omega; ) &CenterDot; X 1 ( &omega; ) &CenterDot; X 2 * ( &omega; ) ) - - - ( 9 )
Wherein: X 1(ω), X 2(ω) be respectively the two-way time-domain signal through the output after the FFT conversion, *Be conjugate of symbol, IFFT is anti-FFT conversion,
Figure BSA00000368534100125
Be the broad sense cross correlation function of two paths of signals,
&tau; ^ = arg max n R x 1 x 2 ( n ) - - - ( 10 )
Wherein:
Figure BSA00000368534100127
Be the broad sense cross correlation function of two paths of signals,
Figure BSA00000368534100128
Be x 1And x 2Between the time delay estimated value,
In the 3rd step, the debug data are also carried out the speaker and are cut apart
With reference to Fig. 4, at first need remove invalid data, press following formula (11) calculation delay:
&tau; [ n ] = &tau; ^ [ n - 1 ] SNR < Thr SNR &tau; ^ [ n ] SNR &GreaterEqual; Thr SNR - - - ( 11 )
Wherein: n is the index value of a certain frame, and τ is the corresponding delay data of a certain frame, Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:
&tau; [ n ] = &tau; ^ [ n - 1 ] &tau; ^ [ n ] < Thr &tau; ^ [ n ] &tau; ^ [ n ] &GreaterEqual; Thr - - - ( 12 )
Wherein: n is the index value of a certain frame; τ is the corresponding delay data of a certain frame;
Figure BSA00000368534100134
is the delay data that a certain frame is estimated; When a certain moment time delay was estimated less than threshold value Thr, the estimation time delay that adopted a last moment was as this time delay estimated value constantly
Then the speaker of different spatial is cut apart calculating, at first calculate posterior probability β ik) shown in formula (13):
&beta; i ( &tau; k ) = &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) &alpha; 1 g ( &tau; k ; &mu; 1 . &sigma; 1 2 ) + &alpha; 2 g ( &tau; k ; &mu; 2 . &sigma; 2 2 ) + L + &alpha; i g ( &tau; k ; &mu; i . &sigma; i 2 ) - - - ( 12 )
Wherein:
Figure BSA00000368534100136
Be defined parameters, α i=1/i, i represent the number of GMM model,
Figure BSA00000368534100137
Initial value adopt K-means algorithm computation, τ kFor formula 7 calculates the time delay estimate vector that obtains, β ik) be posterior probability,
Formula (14) is the parameter update algorithm:
&mu; ^ i = &Sigma; k = 1 n &beta; i ( &tau; k ) &tau; k &Sigma; k = 1 n &beta; i ( &tau; k ) &sigma; ^ i 2 = 1 d &Sigma; k = 1 n &beta; i ( &tau; k ) ( &tau; k - &mu; i ) T ( &tau; k - &mu; i ) &Sigma; k = 1 n &beta; i ( &tau; k ) &alpha; ^ i = 1 n &Sigma; k = 1 n &beta; i ( &tau; k ) - - - ( 14 )
Wherein:
Figure BSA00000368534100141
Be estimates of parameters,
Figure BSA00000368534100142
Be the estimation of GMM model parameter, β ik) be the posterior probability that formula 13 calculates gained, when
Figure BSA00000368534100143
The time stop undated parameter, min is a constant here, represents minimum tolerance value,
In the 4th step, the result of cutting apart according to the speaker carries out speaker's cluster
Utilize a kind of algorithm based on K-means that the sound bite after cutting apart is carried out cluster, this algorithm can overcome standard K-means algorithm performance and receive initial value and isolated point to influence big defective,
Calculate the territory density of each set earlier, the point that density is maximum is as initial point, and next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy;
Next calculates sample point and upgrades the value at center to the distance at set center, selects the sampled point of coincidence formula (15) to upgrade as new set center,
Func = &Sigma; j = 1 J &Sigma; n = 1 M | | &tau; ^ [ n ] - &tau; j | | 2 - - - ( 15 )
Wherein:
Figure BSA00000368534100145
Be the time delay estimate vector Cluster centre τ with each sound bite jDistance, τ j[n] is center vector, and J is speaker's number, and M is the microphone number,
At last come different spaces speaker's sound bite is sorted out and marked according to set center vector and the vectorial distance of sound bite.
In the accompanying drawing:
Figure BSA00000368534100147
Be the locus vector of single sound source, Be the locus vector of another single sound source,
Figure BSA00000368534100149
Be respectively single microphone M iM kM jLocus vector.

Claims (1)

1. the speaker clustering method of a distributed mike wind is characterized in that: may further comprise the steps:
The first step is carried out pre-service to the signal of distributed mike elegance collection
The multichannel sound-source signal that at first distributed mike wind is obtained carries out pre-service; Earlier the multichannel sound-source signal is divided frame and carries out fast Fourier transform (FFT); Then the multichannel sound-source signal is carried out end-point detection; Signal is divided into two types of sound-source signal and non-sound-source signals, and the purpose of end-point detection is from audio digital signals, to distinguish voice signal and non-speech audio, and sound end detecting method can adopt subband spectrum entropy algorithm; At first the spectrum division with every frame voice becomes n subband; N is the integer greater than zero, calculates the frequency spectrum entropy of each subband, obtains the subband spectrum entropy of n frame in succession through one group of order statistics wave filter the frequency spectrum entropy of every frame then; According to the value of frequency spectrum entropy the voice of input are classified, concrete steps are: with the voice signal of every frame through obtaining its N on power spectrum after the fast Fourier transform (FFT) FFTIndividual point
Figure FSB00000687623700011
The probability density of each point on spectrum domain can be used formula (1) expression:
p i < < Y i / N FFT 1 k < < 0 Y k - - - ( 1 )
Wherein: Y kBe k the point of voice signal on power spectrum through the FFT conversion, Y iBe i the point of voice signal on power spectrum through the FFT conversion, N FFTBe the number of i, p iBe the probability density of i point on spectrum domain,
The entropy function of corresponding signal on spectrum domain defines available formula (2) expression:
H < < N FFT 1 k < < 0 p k log ( p k ) - - - ( 2 )
Wherein: p kBe the probability density of k point on spectrum domain, N FFTBe the number of i, H is the entropy function on the spectrum domain,
With the N on the frequency domain FFTIndividual point is divided into the frequency range of K non-overlapping copies, is called subband, and the probability that calculates each point on the l frame frequency spectral domain is shown in formula (3):
p l [ k , i ] < < Y i Q / m k 1 1 j < < m k Y j Q - - - ( 3 )
Wherein: Y jBe j the point of voice signal on power spectrum through the FFT conversion, Y iBe the point on k the subband, 1) be the subband lower limit, Q is a constant, p l[k, i] is the probability of each point on the l frame frequency spectral domain,
According to the definition of information entropy, the value of the frequency spectrum entropy of k subband of l frame is shown in formula (4):
Figure FSB00000687623700023
Wherein: p l[k, i] is the probability of each point on the l frame frequency spectral domain, E s[l, k] is the frequency spectrum entropy of k subband of l frame,
We can calculate the spectrum information entropy of l frame according to following formula (5):
H l < < 1 K K 1 k < < 0 E h [ l , k ] - - - ( 5 )
Wherein: E h[l, k] is the frequency spectrum entropy of k subband of l frame, and K is the subband number, H lInformation entropy for through k subband of the l frame after the filtering smoothing processing defines shown in formula (6):
Wherein: E S (h)[l, k] preparation method is following: the order statistics wave filter of each subband acts on the sub-band information entropy E that a group length is L in the algorithm s[lN, k] ... E s[l, k] ... E sOn [lN, k], this group sub-band information entropy is pressed ascending order rank order, E S (h)[l, k] is
E s[lN, k] ... E s[l, k] ... E sH maximal value in [lN, k]; L is a constant, E h[l, k] is the information entropy of k subband of the l frame after the filtering smoothing processing,
The signal that can be obtained every frame by formula (5) has a frequency spectrum entropy H l, work as H lValue during greater than prior preset threshold T, the l frame is differentiated speech frame, otherwise is judged to non-speech frame; Threshold value T is defined as
Figure FSB00000687623700031
Wherein Avg < < 1 K K 1 k < < 0 E m [ k ] ,
Figure FSB00000687623700033
Figure FSB00000687623700034
E m[k] is E s[0, k] ..., E sThe intermediate value of [N1, k], Avg is the Noise Estimation that input signal begins the N frame most,
Second step, adopt Time Delay Estimation Method to calculate to the sound-source signal fragment, obtain corresponding time delay estimate vector
At first confirm volume coordinate, concrete grammar is: to each microphone M1 that numbers in order, M2...; Mn, n are the integer greater than 1, select initially to be numbered 1 and 2 two microphone M1 and M2; If the position of M1 is an origin; M1 is the starting point coordinate direction of principal axis to the direction of M2, subsequently per 50 frame voice signals is regarded as one group of sound bite, adopts Time Delay Estimation Method that every group of sound bite estimated to the delay inequality between any two microphones; Obtain the individual delay inequality estimated value of n (n 1), shown in formula (7):
&LeftRightArrow; k < < &LeftRightArrow; 12 ^ &LeftRightArrow; 13 ^ &CenterDot; &CenterDot; &CenterDot; &LeftRightArrow; ij ^ T - - - ( 7 )
Wherein:
Figure FSB00000687623700036
is that the delay inequality between i microphone and j the microphone is estimated;
Figure FSB00000687623700037
is the delay inequality estimate vector
Time delay is estimated to adopt PHAT (phase tranformation) weighting algorithm, and its weighting coefficient is shown in formula (8), and delay time estimation method is shown in formula (9)~(10):
W ( _ ) < < 1 | X 1 ( _ ) X 2 * ( _ ) | - - - ( 8 )
Wherein: is respectively the two-way time-domain signal through the output after the FFT conversion; * be conjugate of symbol
R x 1 x 2 ( n ) < < IFFT ( W ( _ ) X 1 ( _ ) X 2 * ( _ ) ) - - - ( 9 )
&LeftRightArrow; ^ < < arg max n R x 1 x 2 ( n ) - - - ( 10 )
Wherein:
Figure FSB00000687623700044
Be the broad sense cross correlation function of two paths of signals,
Figure FSB00000687623700045
Be x 1And x 2Between the time
Prolong estimated value,
In the 3rd step, the debug data are also carried out the speaker and are cut apart
At first need remove invalid data, press following formula (11) calculation delay:
Figure FSB00000687623700046
Wherein: n is the index value of a certain frame,
Figure FSB00000687623700047
Be the corresponding delay data of a certain frame,
Figure FSB00000687623700048
Be the delay data that a certain frame is estimated, when a certain moment signal to noise ratio (S/N ratio) less than threshold value Thr SNRThe time, adopt last one constantly estimation time delay as this time delay estimated value constantly, and (12) further calculation delay by formula:
Figure FSB00000687623700049
Wherein: n is the index value of a certain frame; Be the corresponding delay data of a certain frame; is the delay data that a certain frame is estimated; When a certain moment time delay is estimated less than threshold value Thr; The estimation time delay that adopted a last moment is as this time delay estimated value constantly
Then the speaker of different spatial is cut apart calculating, at first calculates posterior probability
Figure FSB000006876237000411
shown in formula (13):
Figure FSB000006876237000412
Wherein:
Figure FSB000006876237000413
is defined parameters;
Figure FSB000006876237000414
i represents the number of GMM model; The initial value of
Figure FSB00000687623700051
adopts the K-means algorithm computation; is that formula 7 calculates the time delay estimate vector that obtains; is posterior probability
Formula (14) is the parameter update algorithm:
Figure FSB00000687623700054
Wherein: is estimates of parameters;
Figure FSB00000687623700056
is the estimation of GMM model parameter;
Figure FSB00000687623700057
is the posterior probability that formula 13 calculates gained; Stop undated parameter during as
Figure FSB00000687623700058
; Min is a constant here; Represent minimum tolerance value
In the 4th step, the result of cutting apart according to the speaker carries out speaker's cluster
Utilize a kind of algorithm that the sound bite after cutting apart is carried out cluster based on K-means; Calculate the territory density of each set earlier; The point that density is maximum is as initial point, and next initial point is and the point of first initial point apart from maximum that the number up to initial point meets the requirements by that analogy;
Next calculates sample point and upgrades the value at center to the distance at set center, selects the sampled point of coincidence formula (15) to upgrade as new set center,
Func < < J M j < < 1 n < < 1 &LeftRightArrow; ^ [ n ] &LeftRightArrow; j 2 - - - ( 15 )
Wherein:
Figure FSB000006876237000510
is the distance of the cluster centre
Figure FSB000006876237000512
of time delay estimate vector
Figure FSB000006876237000511
and each sound bite;
Figure FSB000006876237000513
is center vector; J is speaker's number; M is the microphone number
At last come different spaces speaker's sound bite is sorted out and marked according to set center vector and the vectorial distance of sound bite.
CN2010105683868A 2010-11-29 2010-11-29 Speaker clustering method for distributed microphone Active CN102074236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105683868A CN102074236B (en) 2010-11-29 2010-11-29 Speaker clustering method for distributed microphone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105683868A CN102074236B (en) 2010-11-29 2010-11-29 Speaker clustering method for distributed microphone

Publications (2)

Publication Number Publication Date
CN102074236A CN102074236A (en) 2011-05-25
CN102074236B true CN102074236B (en) 2012-06-06

Family

ID=44032754

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105683868A Active CN102074236B (en) 2010-11-29 2010-11-29 Speaker clustering method for distributed microphone

Country Status (1)

Country Link
CN (1) CN102074236B (en)

Families Citing this family (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102300140B (en) 2011-08-10 2013-12-18 歌尔声学股份有限公司 Speech enhancing method and device of communication earphone and noise reduction communication earphone
CN102509548B (en) * 2011-10-09 2013-06-12 清华大学 Audio indexing method based on multi-distance sound sensor
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
US9185199B2 (en) * 2013-03-12 2015-11-10 Google Technology Holdings LLC Method and apparatus for acoustically characterizing an environment in which an electronic device resides
CN103175897B (en) * 2013-03-13 2015-08-05 西南交通大学 A kind of high-speed switch hurt recognition methods based on vibration signal end-point detection
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN104347068B (en) * 2013-08-08 2020-05-22 索尼公司 Audio signal processing device and method and monitoring system
CN103439688B (en) * 2013-08-27 2015-04-22 大连理工大学 Sound source positioning system and method used for distributed microphone arrays
CN104575498B (en) * 2015-01-30 2018-08-17 深圳市云之讯网络技术有限公司 Efficient voice recognition methods and system
CN104767739B (en) * 2015-03-23 2018-01-30 电子科技大学 The method that unknown multi-protocols blended data frame is separated into single protocol data frame
CN104766093B (en) * 2015-04-01 2018-02-16 中国科学院上海微***与信息技术研究所 A kind of acoustic target sorting technique based on microphone array
CN105161093B (en) * 2015-10-14 2019-07-09 科大讯飞股份有限公司 A kind of method and system judging speaker's number
CN105388459B (en) * 2015-11-20 2017-08-11 清华大学 The robust sound source space-location method of distributed microphone array network
CN106887231A (en) * 2015-12-16 2017-06-23 芋头科技(杭州)有限公司 A kind of identification model update method and system and intelligent terminal
CN106981289A (en) * 2016-01-14 2017-07-25 芋头科技(杭州)有限公司 A kind of identification model training method and system and intelligent terminal
CN105869645B (en) * 2016-03-25 2019-04-12 腾讯科技(深圳)有限公司 Voice data processing method and device
CN109155130A (en) * 2016-05-13 2019-01-04 伯斯有限公司 Handle the voice from distributed microphone
US10249305B2 (en) * 2016-05-19 2019-04-02 Microsoft Technology Licensing, Llc Permutation invariant training for talker-independent multi-talker speech separation
CN106405499A (en) * 2016-09-08 2017-02-15 南京阿凡达机器人科技有限公司 Method for robot to position sound source
CN107886951B (en) * 2016-09-29 2021-07-23 百度在线网络技术(北京)有限公司 Voice detection method, device and equipment
CN106504773B (en) * 2016-11-08 2023-08-01 上海贝生医疗设备有限公司 Wearable device and voice and activity monitoring system
CN106940997B (en) * 2017-03-20 2020-04-28 海信集团有限公司 Method and device for sending voice signal to voice recognition system
CN107202976B (en) * 2017-05-15 2020-08-14 大连理工大学 Low-complexity distributed microphone array sound source positioning system
CN109215667B (en) 2017-06-29 2020-12-22 华为技术有限公司 Time delay estimation method and device
CN107393549A (en) * 2017-07-21 2017-11-24 北京华捷艾米科技有限公司 Delay time estimation method and device
CN107885323B (en) * 2017-09-21 2020-06-12 南京邮电大学 VR scene immersion control method based on machine learning
CN108364637B (en) * 2018-02-01 2021-07-13 福州大学 Audio sentence boundary detection method
CN108665894A (en) * 2018-04-06 2018-10-16 东莞市华睿电子科技有限公司 A kind of voice interactive method of household appliance
CN108872939B (en) * 2018-04-29 2020-09-29 桂林电子科技大学 Indoor space geometric outline reconstruction method based on acoustic mirror image model
CN109087648B (en) * 2018-08-21 2023-10-20 平安科技(深圳)有限公司 Counter voice monitoring method and device, computer equipment and storage medium
CN109658948B (en) * 2018-12-21 2021-04-16 南京理工大学 Migratory bird migration activity-oriented acoustic monitoring method
CN109618273B (en) * 2018-12-29 2020-08-04 北京声智科技有限公司 Microphone quality inspection device and method
CN110021302A (en) * 2019-03-06 2019-07-16 厦门快商通信息咨询有限公司 A kind of Intelligent office conference system and minutes method
CN110290468B (en) * 2019-07-04 2020-09-22 英华达(上海)科技有限公司 Virtual sound insulation communication method, device, system, electronic device and storage medium
CN110428842A (en) * 2019-08-13 2019-11-08 广州国音智能科技有限公司 Speech model training method, device, equipment and computer readable storage medium
CN110501674A (en) * 2019-08-20 2019-11-26 长安大学 A kind of acoustical signal non line of sight recognition methods based on semi-supervised learning
CN111063341B (en) * 2019-12-31 2022-05-06 思必驰科技股份有限公司 Method and system for segmenting and clustering multi-person voice in complex environment
CN112581941A (en) * 2020-11-17 2021-03-30 北京百度网讯科技有限公司 Audio recognition method and device, electronic equipment and storage medium
CN112735385B (en) * 2020-12-30 2024-05-31 中国科学技术大学 Voice endpoint detection method, device, computer equipment and storage medium
CN112684437B (en) * 2021-01-12 2023-08-11 浙江大学 Passive ranging method based on time domain warping transformation
CN112684412B (en) * 2021-01-12 2022-09-13 中北大学 Sound source positioning method and system based on pattern clustering
CN113096669B (en) * 2021-03-31 2022-05-27 重庆风云际会智慧科技有限公司 Speech recognition system based on role recognition
CN113178196B (en) * 2021-04-20 2023-02-07 平安国际融资租赁有限公司 Audio data extraction method and device, computer equipment and storage medium
CN113573212B (en) * 2021-06-04 2023-04-25 成都千立智能科技有限公司 Sound amplifying system and microphone channel data selection method
CN113380234B (en) * 2021-08-12 2021-12-17 明品云(北京)数据科技有限公司 Method, device, equipment and medium for generating form based on voice recognition
CN113808612B (en) * 2021-11-18 2022-02-11 阿里巴巴达摩院(杭州)科技有限公司 Voice processing method, device and storage medium
CN116030815B (en) * 2023-03-30 2023-06-20 北京建筑大学 Voice segmentation clustering method and device based on sound source position

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH02209027A (en) * 1989-02-09 1990-08-20 Fujitsu Ltd Acoustic echo canceller
JPH1097276A (en) * 1996-09-20 1998-04-14 Canon Inc Method and device for speech recognition, and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452704A (en) * 2007-11-29 2009-06-10 中国科学院声学研究所 Speaker clustering method based on information transfer

Also Published As

Publication number Publication date
CN102074236A (en) 2011-05-25

Similar Documents

Publication Publication Date Title
CN102074236B (en) Speaker clustering method for distributed microphone
CN102103200B (en) Acoustic source spatial positioning method for distributed asynchronous acoustic sensor
CN108464015B (en) Microphone array signal processing system
CN108731886B (en) A kind of more leakage point acoustic fix ranging methods of water supply line based on iteration recursion
CN106373589B (en) A kind of ears mixing voice separation method based on iteration structure
JP4816711B2 (en) Call voice processing apparatus and call voice processing method
CN102565759B (en) Binaural sound source localization method based on sub-band signal to noise ratio estimation
CN111429939B (en) Sound signal separation method of double sound sources and pickup
EP3387648A1 (en) Localization algorithm for sound sources with known statistics
CN101593522A (en) A kind of full frequency domain digital hearing aid method and apparatus
CN103901401A (en) Binaural sound source positioning method based on binaural matching filter
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN109859749A (en) A kind of voice signal recognition methods and device
Al-Karawi et al. Early reflection detection using autocorrelation to improve robustness of speaker verification in reverberant conditions
Han et al. Robust GSC-based speech enhancement for human machine interface
CN103901400A (en) Binaural sound source positioning method based on delay compensation and binaural coincidence
WO2004084187A1 (en) Object sound detection method, signal input delay time detection method, and sound signal processing device
Wang et al. Localization based sequential grouping for continuous speech separation
CN111179959B (en) Competitive speaker number estimation method and system based on speaker embedding space
CN111429916B (en) Sound signal recording system
JP2017067948A (en) Voice processor and voice processing method
Gburrek et al. A meeting transcription system for an ad-hoc acoustic sensor network
Himawan et al. Clustering of ad-hoc microphone arrays for robust blind beamforming
Imoto et al. Spatial-feature-based acoustic scene analysis using distributed microphone array
Venkatesan et al. Deep recurrent neural networks based binaural speech segregation for the selection of closest target of interest

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20181115

Address after: 100085 Beijing Haidian District Shangdi Information Industry Base Pioneer Road 1 B Block 2 Floor 2030

Patentee after: Beijing Huacong Zhijia Technology Co., Ltd.

Address before: 100084 Beijing 100084 box 82 box, Tsinghua University Patent Office

Patentee before: Tsinghua University