TWI262433B

TWI262433B - Voice locating system

Info

Publication number: TWI262433B
Application number: TW94110648A
Authority: TW
Inventors: Jwu-Sheng Hu; Wei-Han Liou; Jie-Cheng Jeng
Original assignee: Univ Nat Chiao Tung
Priority date: 2005-04-01
Filing date: 2005-04-01
Publication date: 2006-09-21
Also published as: TW200636561A

Abstract

The present invention relates to a voice locating system, wherein a sound signal is received by a microphone array and transmitted to a voice detecting system for determining the operational status procedure of the system; the acoustic signal is then transmitted to an environmental parameter training subsystem to produce a training signal simulated according to an additive principle; producing a statistical parameter from the training signal through the extraction of characteristics and the distribution describing the characteristics; and actuating a voice location detecting subsystem by the sound signal extracted by the microphone array and the statistical parameter to detect the location of a sounder.

Description

1262433 源訊號的能量，最後比較經過該波束形成器後的能量來决疋聲源的角度。雖上述之方法可以在噪音環境下使用，但仍無法解決有物體擋住和同方向但不同距離的問題’且必須在麥克風間互相匹配下才可使用。 4·美國專利公告弟6,243,471號由Brandstein等人提出利用二個以上之麥克風作為一個群組，再利用簡單的幾何關係找出三維空間的資訊，因此，只要有複數個群組就可產生複數個三維空間資訊，即便可估測出聲源方位，亦不會有物體播住和同方向但不同距離的問題，但上述之方法若使用在複雜的環境下，其所需的麥克風陣列數目則相當可觀，且由於利用時間差聲源角度定位法（TDOA)，在高反射和暫態的時候就會產生角度上的誤差，而利用簡單的幾何關係求解會因不準而有更大的為差，雖可利用了求取變異數（var|_ance)配合高斯分布的假設來製造出不同的權重以減少誤差，但若有雜訊存在時，該高斯分布的假設就不能同時適用於雜訊和聲源同時存在的狀況，其估測出來的角度會產生誤差而不準’且必須在麥克風間互相匹配下才可使用。美國專利公告第5,778,〇82號由Chu等人提出利用簡略的聲源制线對是㈣生f要估測之聲音加以辨以求找出雜訊區段，並預先估測雜訊之共相關矩 1262433 陣（Cr〇sscorrelation matrix )，而再利用該共相關矩陣與欲估測之聲源的共相關矩陣作相減動作，以達到消去雜訊所造成之影響的效果。但由於上述之方法並沒有針對語音偵測設計，因此無法穩定的❹以語音發生，此外，若雜訊之共相_陣估測不準確，騎造成估測結果之誤差，亦無法對被擋住的物體作分辨，且必須使用匹配的麥克風。 6·美國專利公告第5,465 3〇2號由⑷加等人提出利用複數個麥克風並以兩兩計算出時間延遲（加㊀ delay)後，制用非平面波假設原理計算出聲源與麥克風陣列之相對位置。上狀方法雖可估測出聲源之位置’但需要麥克風間互相B，且對於環境蜂音、反射和聲源與麥克風陣列間有障礙物等問題則無法解決。 7·美國專利公告第4,333,170號由Mathews等人提出利用相位差異斜率（phase difference s|〇pe)，並加以料以計算出聲源對麥克風陣列之角度關係，並利用訊號頻》曰之敗(1強度找出適合之頻率來加以分群，由於上述之方法在聲音與雜訊包含同樣之頻帶時會發生誤差，因而導致估測結果不準確，同時，該方法亦需麥歧間互相匹配’ 不能分辨被擋住之聲源或同樣角度，距離不同之聲源。 12624331262433 The energy of the source signal, and finally the energy after passing through the beamformer to determine the angle of the sound source. Although the above method can be used in a noisy environment, it is still impossible to solve the problem that objects are blocked and in the same direction but at different distances, and must be used until the microphones are matched with each other. 4. U.S. Patent Bulletin 6,243,471, by Brandstein et al., uses two or more microphones as a group, and then uses simple geometric relations to find information in three-dimensional space. Therefore, as long as there are multiple groups, multiple numbers can be generated. In the three-dimensional space information, even if the sound source orientation can be estimated, there will be no problem that objects are in the same direction but at different distances. However, if the above method is used in a complicated environment, the number of microphone arrays required is equivalent. Obvious, and due to the use of time difference source angle positioning method (TDOA), angle errors will occur in high reflection and transient, and the use of simple geometric relations will be more inaccurate due to inaccuracy. Although the assumption that the variance (var|_ance) is combined with the Gaussian distribution can be used to create different weights to reduce the error, if there is noise, the Gaussian distribution hypothesis cannot be applied to the noise and sound simultaneously. When the source is present at the same time, the estimated angle will produce an error and it will not be used and must be matched before the microphones match each other. U.S. Patent Bulletin No. 5,778, No. 82 was proposed by Chu et al. using a simple sound source line to identify the sounds to be estimated in order to find the noise section and to estimate the total amount of noise. Correlation moment 1262433 matrix (Cr〇sscorrelation matrix), and then use the co-correlation matrix and the co-correlation matrix of the sound source to be estimated as subtraction action to achieve the effect of eliminating the influence of noise. However, since the above method is not designed for speech detection, it is impossible to stably generate speech. In addition, if the common phase estimation of the noise is inaccurate, the error caused by the estimation of the riding cannot be blocked. The objects are resolved and a matching microphone must be used. 6. U.S. Patent Publication No. 5,465 3〇2, (4) Adding et al., using a plurality of microphones to calculate the time delay (plus one delay), the non-planar wave hypothesis is used to calculate the sound source and the microphone array. relative position. Although the upper method can estimate the position of the sound source, 'there is a need for B between the microphones, and there is no problem with environmental buzz, reflection, and obstacles between the sound source and the microphone array. 7. U.S. Patent No. 4,333,170, by Mathews et al., uses a phase difference s|〇pe and calculates the angle relationship between the sound source and the microphone array, and uses the signal frequency to defeat ( 1 Intensity Find the appropriate frequency to be grouped. Because the above method will cause errors when the sound and noise contain the same frequency band, the estimation result is inaccurate. At the same time, the method also needs to match each other. Distinguish the sound source that is blocked or the same angle, the source of the sound is different. 1262433

8. 由 Lo 等人於 IEEE TRANSACTIONS 〇N INSTRUMENTION AND MEASUREMENT, VOL.53,NO.4,AUGUST 2004 所發表的 Robust Joint Audio-Vvideo Localization in Video Conferencing Using Reliability Information中提出一種同時利用影像與聲音訊號的方法來達到語者定位的效果，該方法的聲音訊號定位部分只利用了簡單的延遲加成波束型成器 (delay and sum beamformer )來計算不同區域 (section)的聲音能量，並尋找最大能量的區域來當作可能的語者位置，再與該影像之部分判斷融合，以達到語者定位的目標，但由於該方法需要一組圓形麥克風陣列放置在語者的中間，才能夠分出不同區域的聲音能量，而該區域由麥克風陣列的圓心放射狀區分，並不方便使用，若要達到較佳之穩定結果時，需要影像的整合，該方法之系統架構複雜，且價格昂貴，不利使用。 9. IEEE TRANSACTIONS ON NEUWORAL NETWORKS, V〇L.11，N〇.4，JULY 2000由 Guner Arslan 等人所發表的A Unified NeurahNetwork-Based Speaker Localization Technique中使用了以類神經網路 (neura卜network)為基礎的技術來作聲源定位，當訊號雜訊比（SNR)高過20dB的時候，即使是大角度的定位 1262433 依然有很好的效果，且可用於近場（near-field)和遠場 (far-field)的應用中，但無法使用在周遭環境之噪音大的時候，且也無法解決物體擔住和同方向但不同距離的問題，並必須在麥克風間互相匹配下才可使用。 10. IEEE TRANSACTIONS ON SPEECH AND AUDIO PRESSING，VOL.8，N〇.2，MARCH 2000 由 James G_ Ryan 等人所發表的 Aarray Optimization Applied in the Near Field of a Microphone Array 中提出當麥克風間距離等於一半波長（d = ^ )的時候會有最好的效果，因此在作定位的時候特別將和“I這兩個情況分開討論，且上述之方法最好的運用情況是聲源在近場（near-field)，而雜訊是在遠場（far-field)的時候，該方法也無法解決物體擋住和同方向但不同距離的問題，且必須在麥克風間互相匹配下才可使用。8. A Robust Joint Audio-Vvideo Localization in Video Conferencing Using Reliability Information, published by Lo et al. in IEEE TRANSACTIONS 〇N INSTRUMENTION AND MEASUREMENT, VOL.53, NO.4, AUGUST 2004, proposes a simultaneous use of video and audio signals. The method achieves the effect of speaker positioning. The sound signal localization part of the method only uses a simple delay and sum beamformer to calculate the sound energy of different sections and find the maximum energy. The region is regarded as a possible speaker position, and then merged with the part of the image to achieve the target of the speaker, but since the method requires a set of circular microphone arrays to be placed in the middle of the speaker, the difference can be distinguished. The sound energy of the area, which is radially separated by the center of the microphone array, is not convenient to use. If better stable results are needed, image integration is required. The system architecture of the method is complicated, expensive, and unfavorable. 9. IEEE TRANSACTIONS ON NEUWORAL NETWORKS, V〇L.11, N〇.4, JULY 2000 is used by Guner Arslan et al. A Unified Neurah Network-Based Speaker Localization Technique uses a neural network (neura) Based on the basic technology for sound source localization, even when the signal-to-noise ratio (SNR) is higher than 20dB, even the large-angle positioning 1262433 still has a good effect, and can be used for near-field and far-field. In the field (far-field) application, but can not be used when the surrounding environment is noisy, and can not solve the problem of the object bearing and the same direction but different distances, and must be matched before the microphones can be used. 10. IEEE TRANSACTIONS ON SPEECH AND AUDIO PRESSING, VOL.8, N〇.2, MARCH 2000, proposed by James G_Ryan et al., Aarray Optimization Applied in the Near Field of a Microphone Array, when the distance between the microphones is equal to half the wavelength (d = ^) will have the best effect, so in the positioning, it will be discussed separately from the "I" situation, and the best use of the above method is the sound source in the near field (near- Field), and the noise is in the far-field, this method can not solve the problem that the object blocks and the same direction but different distances, and must be matched before the microphones match each other.

11. Huang 等人在 IEEE TRANSACTIONS ON INSTRUMENTION AND MEASUREMENT, VOL.44,NO.3,JUNE 1995 所發表的 A Biomimetic System for Localization and Separation of Multiple Sound Sources中提出一種以「抵達時間差」（Arrival Temporal Disparities，ATD)的方式來計算聲源角度，該方法必須偵測聲源的起始點（onset)，並利用該起始點11. Huang et al. proposed an "Arrival Temporal Disparities" in the A Biomimetic System for Localization and Separation of Multiple Sound Sources published by IEEE TRANSACTIONS ON INSTRUMENTION AND MEASUREMENT, VOL.44, NO.3, JUNE 1995. ATD) way to calculate the sound source angle, the method must detect the origin of the sound source (onset) and use the starting point

Claims

1262433十、申請專利範圍··1262433 X. Patent application scope··

•-種浯音定位系統，其係包含：条μ 一麥克風陣列，係用以接收聲音訊號；一語音偵測練子=用以決定系統之運作狀11流程；—環境參數訓一 ·Μ二」係用以將聲波訊號藉由加成性原理模擬產生 Ζ練訊號’並將該訓練訊號經由特徵之抽取及描述此• A kind of voice positioning system, which includes: a μ μ microphone array for receiving audio signals; a voice detection training = used to determine the operation of the system 11 process; - environmental parameters training one · two Used to simulate the sound signal by the additive principle of the sound wave signal' and extract and describe the training signal through the feature.

早1 ^分佈^產生—統計參數；及—語音位置偵測用以藉由該麥克風陣列所擷取之聲音訊號及參數偵測語者位置，並利用估測結果，經由統計 ^更新模組來更新統計參數化模組。 2 · f據中請專利範圍第1項所述之語音定位系統，其中， 5亥麥克風陣列係包含至少2顆以上之麥克風。 3·依^中請專利範圍第i項所述之語音以緣統，其中，參數訓練子彡統係包含—語者參考訊號之記憶體、了環境噪音訊號之記憶體、—合併器、—相位差特徵抽取模組及一特徵統計參數化模組。 4·依射請專利範圍第3項所述之語音定位系統，其中，該特徵統計參數化模組係為混合高斯模型（gmm)'及核心基礎模型（Kernel based model)中擇其一。 5.依據中請專利範圍第1項所述之語音定位系統，其中，該語音位置偵測子系統係包含一位置偵測模組及、一統计參數更新模組。 1262433Early 1 ^ distribution ^ generation - statistical parameters; and - voice position detection is used to detect the position of the speaker by the sound signal and parameters captured by the microphone array, and use the estimation result to update the module via the statistics Update the statistical parameterization module. The voice positioning system according to the first aspect of the invention, wherein the 5th microphone array comprises at least two microphones. 3. According to the voice mentioned in item i of the patent scope, the parameter training subsystem includes the memory of the speaker reference signal, the memory of the environmental noise signal, the merger, The phase difference feature extraction module and a feature statistical parameterization module. 4. The speech positioning system according to item 3 of the patent scope, wherein the characteristic statistical parameterization module is one of a mixed Gaussian model (gmm) and a kernel based model. 5. The voice positioning system of claim 1, wherein the voice position detection subsystem comprises a position detection module and a statistical parameter update module. 1262433

統計參數更新模組 X X 執行位置偵測模組 X X 執行特徵統計參數化模組 X 執行 X 相位差特徵抽取模組 X 執行 X X 更新 X 諺黎 ♦ ^ 更新 I_ X X ，CNJ TC CO todj CiaL) CiaiJ w π|ΜStatistical parameter update module XX Perform position detection module XX Perform feature statistical parameterization module X Execute X phase difference feature extraction module X Execute XX Update X 谚 ♦ ^ Update I_ XX , CNJ TC CO todj CiaL) CiaiJ w π|Μ

12624331262433

各頻帶所屬之頻率區間 2(A/-1W u I VI V I C^T VI V 麥克風對組合數 II 、一 (N II I II T s5 麥克風對組合 (w, m + Μ - \) 其中/w = l (m, m + M - 2) 其中 (m,m + 1) 其中 1 S w < Λ/ -1 頻帶編號頻帶1 (t = i) 頻帶2 (b = 2) 頻帶Λ/ -1 (b = i\f-\) 画ε濉 1262433The frequency interval 2 to which each frequency band belongs (A/-1W u I VI VIC^T VI V microphone pair combination number II, one (N II I II T s5 microphone pair combination (w, m + Μ - \) where /w = l (m, m + M - 2) where (m, m + 1) where 1 S w < Λ / -1 band number band 1 (t = i) band 2 (b = 2) band Λ / -1 ( b = i\f-\) draw ε濉1262433

% W. 凝4Τ> 费 JJJD ΓΖ-» κ- ^ ¢1 ^ 5 S >θ 妹 W： W： W：雜訊環境之穩健性 •5- 夺 -fr- ^ Μ W：夺夺 •6- 近場與遠場之應用 US 6,826,284 2004 US 0,013,275 US 6,449,593 US 6,243,471 US 5,778,082 US 5,465,302 US 4,333,170 Lo等人提出 Guner Arslan等人提出 James G· Ryan等人提出 Huang等人提出 I 本發明 (mM)_寸濉% W. 凝四Τ> Fee JJJD ΓΖ-» κ- ^ ¢1 ^ 5 S >θ 妹 W: W: W: Robustness of the noise environment •5- win-fr- ^ Μ W: Capture •6 - Application of near field and far field US 6,826,284 2004 US 0,013,275 US 6,449,593 US 6,243,471 US 5,778,082 US 5,465,302 US 4,333,170 Lo et al. proposed by Guner Arslan et al., James G. Ryan et al., proposed by Huang et al. (present) Inch