TWI485697B

TWI485697B - Environmental sound recognition method

Info

Publication number: TWI485697B
Application number: TW101119270A
Authority: TW
Inventors: Jia Ching Wang; chang hong Lin; Min Kang Tsai
Original assignee: Univ Nat Central
Priority date: 2012-05-30
Filing date: 2012-05-30
Publication date: 2015-05-21
Also published as: TW201349223A

Description

環境聲音辨識方法Environmental sound identification method

本發明有關一種聲音辨識之領域，特別有關於一種基於新型時頻參數之環境聲音辨識方法。The invention relates to the field of sound recognition, and particularly relates to an environmental sound recognition method based on a novel time-frequency parameter.

在過去的幾十年，聲音辨識已經被應用在許多廣泛的應用上，像是語音內容的分析、視頻索引(video indexing)、機器人行走、數位家庭、電影或電視之影片聲音的分類和可攜式裝置的服務。對於機器人行走，大部分的技術著重於視覺方面的應用，但僅靠視覺技術達成機器人行走能力是相當有限的，舉例來說，當機器人視覺資訊遭到干擾或破壞時，則無法依據額外的資訊幫助機器人行走，因此提出了同時利用聲音辨識和視覺技術提高機器人的行走能力；數位家庭之應用可將環境中小孩的哭聲、物品打破的聲音或老人跌倒的聲音等特定聲音進行辨識；電影或電視之影片聲音的分類係根據影片中如***聲、機關槍聲音或動物聲音來分類電影或電視影片的類型(例如普遍級、輔導級或限制級等)；至於可攜式裝置的服務方面，像是手機可以判斷目前使用者所在的環境以自動切換模式，而不用靠手動的方式切換，舉例來說，當使用者在開會，手機會根據周遭的環境聲音自動將目前的模式轉換為震動。因此聲音辨識是一項很重要的技術。In the past few decades, voice recognition has been applied to a wide range of applications, such as voice content analysis, video indexing, robot walking, digital home, movie or television film sound classification and portability. Service of the device. For robot walking, most of the technology focuses on visual applications, but the ability to achieve robot walking by visual technology is quite limited. For example, when robot visual information is disturbed or destroyed, it cannot be based on additional information. Helping the robot to walk, it is proposed to use the sound recognition and visual technology to improve the walking ability of the robot; the application of the digital family can identify the specific sounds such as the crying of the child in the environment, the sound of the broken item or the sound of the fall of the old man; The classification of the film sound of a television is based on the type of film or television film (such as universal level, tutoring level or restriction level) in the film such as blasting sound, machine gun sound or animal sound; as for the service aspect of the portable device, like The mobile phone can judge the current user's environment to automatically switch modes, instead of manually switching. For example, when the user is in a meeting, the mobile phone automatically converts the current mode into vibration according to the surrounding ambient sound. Therefore, sound recognition is an important technology.

聲音辨識的目標就是辨認出聲音訊號的特徵，包含說話、音樂和其他聲音事件。一個良好設計的演算法能夠使得使用者更容易去搜尋和管理聲音檔案。因此，辨識的結果也能進一步的處理而用於許多應用，以方便人類的生活。直到最近幾年，由於環境需求的計算增加，在聲音辨識的研究逐漸移到環境特徵。因此，在環境聲音辨識也受到了一系列的關注。The goal of sound recognition is to identify the characteristics of the sound signal, including speaking, Music and other sound events. A well-designed algorithm makes it easier for users to search and manage sound files. Therefore, the results of the identification can be further processed for many applications to facilitate human life. Until recently, as the calculation of environmental requirements increased, research on sound recognition gradually moved to environmental characteristics. Therefore, environmental sound recognition has also received a series of attention.

在此領域大部份的研究著重於系統探索有用的聲音特徵參數，而從聲音訊號擷取五種不同的特徵參數，像是音調(pitch)、調和(harmonicity)、響度(loudness)、亮度(brightness)及頻寬(bandwidth)。以上提到的聲音特徵參數的擷取可利用頻域(frequency domain)、時域(time domain)或時頻域(time-frequency domain)等的參數擷取方法來分析聲音訊號頻譜的變化，並且能夠具有時間區域性的性質。Most of the research in this field focuses on systematically exploring useful sound feature parameters, and extracting five different feature parameters, such as pitch, harmonicity, loudness, and brightness, from the sound signal. Brightness) and bandwidth. The above-mentioned sound characteristic parameter extraction may use a parameter extraction method such as a frequency domain, a time domain or a time-frequency domain to analyze a change of a sound signal spectrum, and Can have the nature of time zone.

利用時域的參數擷取方法的缺點在於分析資料過於龐大、易受噪音的影響，對於大小聲、有聲音及沒有聲音的比率多少只能概略性的分析，分析聲音的類別有限，無法分析各頻率的成分。The disadvantage of using the time domain parameter extraction method is that the analysis data is too large and susceptible to noise. The ratio of the size of the sound, the sound and the sound is only a rough analysis, and the analysis of the sound type is limited, and it is impossible to analyze each. The composition of the frequency.

利用頻域的參數擷取方法例如採用傅立葉(Fourier)轉換來進行時域與頻域的轉換，以分析各頻率能量的成分，此種方法可以依據人耳聽覺模型及發聲腔模型建立參數，如MFCC(Mei-scale Frequency Cepstrum Coefficient，梅爾倒頻譜係數)及LPCC(Linear Prediction Cepstrum Coefficient，線性搜尋頻譜係數)，以模擬人的聽覺及發聲方式而來分析聲音。但是此種方法的缺點係對於周遭環境聲音的辨識有限，且對於短時間內的頻率成分的分析較弱。The frequency domain parameter extraction method is used, for example, to perform time domain and frequency domain conversion using Fourier transform to analyze the components of each frequency energy. This method can establish parameters according to the human ear hearing model and the vocal cavity model, such as MFCC (Mei-scale Frequency Cepstrum Coefficient) and LPCC (Linear Prediction Cepstrum Coefficient), which analyzes sound by simulating human hearing and vocalization sound. However, the disadvantage of this method is that the identification of ambient sounds is limited, and the analysis of frequency components in a short time is weak.

有鑒於上述問題，本發明的目的係提供一種環境聲音辨識方法，該方法對於環境聲音的分析，可使用較少的分析資料進行分析，並利用環境聲音之聲音特徵參數來產生新的參數值，而利用聲音分類器對新的參數值進行分類，以辨識出環境聲音的類別，如此可提高周遭環境聲音的辨識度，對於短時間內的頻率成分的分析亦較強。In view of the above problems, an object of the present invention is to provide an environmental sound recognition method, which can analyze an environmental sound using less analysis data, and use the sound characteristic parameters of the environmental sound to generate new parameter values. The sound classifier is used to classify the new parameter values to identify the category of the ambient sound, so that the recognition of the ambient sound can be improved, and the analysis of the frequency components in a short time is also strong.

本發明提供一種環境聲音辨識方法，於一環境聲音辨識裝置中執行該方法之下列步驟：(a)於一聲音特徵辭典中建立一聲音特徵矩陣，該聲音特徵矩陣係由複數列向量序列所組成，將複數個波形樣式中心位置之每一者、複數個頻率之每一者及複數個波形樣式長度之每一者的組合進行運算以獲得對應波形式樣之該等向量序列；(b)將一疊代向量序列與該等向量序列之每一者進行內積運算，以獲得複數個相似係數與分別對應該等相似係數之複數個內積向量序列，其中在第一次疊代運算時，該疊代向量序列為該至少一聲音訊號向量序列中之一者；(c)將該疊代向量序列與對應最大之該相似係數之內積向量序列進行運算以獲得一相似訊號向量序列，並將該疊代向量序列與該相似訊號向量序列進行減法運算以獲得一剩餘訊號向量序列；(d)以一預定次數重覆執行步驟(b)與(c)而獲得與該預定次數相同數目的該等最大之相似係數，其中在下一次疊代運算時，該疊代向量序列係為該剩餘訊號向量序列；(e)將相同的頻率與波形樣式長度之最大之該等相似係數進行加總，以獲得複數個能量值；(f)將該等能量值、該等頻率與該等波形樣式長度進行運算，以獲得與該至少一聲音訊號向量序列之序列個數相同之至少一尺度頻率描述符，其中該至少一尺度頻率描述符之每一者包含一最大能量比率、複數個波形樣式長度能量值、複數個頻率能量值、一波形樣式長度質心值、一頻率質心值、一波形樣式長度分散值及一頻率分散值；以及(g)將該至少一尺度頻率描述符進行平均值運算以獲得一平均值尺度頻率描述符。The invention provides an environmental sound recognition method, and the following steps are performed in an environmental sound recognition device: (a) establishing a sound feature matrix in a sound feature dictionary, the sound feature matrix being composed of a plurality of sequence vector sequences And computing a combination of each of the plurality of waveform pattern center positions, each of the plurality of frequencies, and each of the plurality of waveform pattern lengths to obtain the vector sequence of the corresponding wave pattern; (b) The iterative vector sequence and each of the vector sequences are subjected to inner product operations to obtain a plurality of similarity coefficients and a plurality of inner product vector sequences respectively corresponding to the similarity coefficients, wherein in the first iteration operation, the The iterative vector sequence is one of the at least one sound signal vector sequence; (c) computing the iterative vector sequence and the inner product vector sequence corresponding to the largest similarity coefficient to obtain a sequence of similar signal vectors, and The iteration The vector sequence is subtracted from the sequence of similar signal vectors to obtain a sequence of residual signal vectors; (d) repeating steps (b) and (c) a predetermined number of times to obtain the same number of such maximums as the predetermined number of times a similarity coefficient, wherein in the next iterative operation, the iterative vector sequence is the sequence of residual signal vectors; (e) summing the similar frequencies with the same frequency and the maximum length of the waveform pattern to obtain a plurality of similar coefficients An energy value; (f) computing the energy values, the frequencies, and the lengths of the waveform patterns to obtain at least one scale frequency descriptor that is the same as the sequence of the at least one sound signal vector sequence, wherein the at least one scale frequency descriptor Each of the scale frequency descriptors includes a maximum energy ratio, a plurality of waveform pattern length energy values, a plurality of frequency energy values, a waveform pattern length centroid value, a frequency centroid value, a waveform pattern length dispersion value, and a frequency dispersion value; and (g) averaging the at least one scale frequency descriptor to obtain an average scale frequency descriptor.

為使熟習本發明所屬技術領域之一般技藝者能更進一步瞭解本發明，下文特列舉本發明之實施方式，並配合所附圖式，詳細說明本發明的構成內容及所欲達成之功效。The embodiments of the present invention are described in detail below with reference to the accompanying drawings.

圖1為本發明之環境聲音辨識裝置之方塊圖。在圖1中，環境聲音辨識裝置10包含有一聲音前處理模組12、一Gabor(蓋博)辭典14、一匹配搜尋(Matching Pursuit)模組 16、一尺度頻率圖模組18、一能量估算模組20及一聲音分類器22。1 is a block diagram of an environmental sound recognition device of the present invention. In FIG. 1, the ambient sound recognition device 10 includes a sound pre-processing module 12, a Gabor dictionary 14, and a Matching Pursuit module. 16. A scale frequency map module 18, an energy estimation module 20, and a sound classifier 22.

圖2為本發明之環境聲音處理程序之示意圖。圖3A為圖2中之一個音框聲音波形之示意圖，圖3B至3I為本發明之Gabor辭典中波形式樣之示意圖。2 is a schematic diagram of an environmental sound processing program of the present invention. 3A is a schematic diagram of a sound waveform of a sound box of FIG. 2, and FIGS. 3B to 3I are schematic diagrams of wave patterns in the Gabor dictionary of the present invention.

聲音前處理模組12接收周遭環境的環境聲音(如圖2上方的聲音波形所示)，而將所接收到之環境聲音轉換成為聲音檔案格式(諸如mp3、wav格式等)之至少一個聲音檔(本實施例為N個聲音檔)。接著，聲音前處理模組12例如使用Matlab軟體將N個聲音檔之聲音訊號進行取樣及量化以獲得為數學形式之N個聲音訊號向量序列(圖2中之音框1、音框2...音框N)。The sound pre-processing module 12 receives the ambient sound of the surrounding environment (as shown by the sound waveform in the upper part of FIG. 2), and converts the received ambient sound into at least one sound file of the sound file format (such as mp3, wav format, etc.). (This embodiment is N sound files). Then, the sound pre-processing module 12 samples and quantizes the sound signals of the N sound files using Matlab software, for example, to obtain a sequence of N sound signal vectors in a mathematical form (the sound box 1 and the sound box 2 in FIG. 2: . Box N).

Gabor辭典14係一聲音特徵辭典，然而本發明不侷限於此，可產生聲音特徵之辭典皆適用於本發明。於Gabor辭典14建立一聲音特徵矩陣，該聲音特徵矩陣係由複數列波形式樣之向量序列所組成。在Gabor辭典14中的波形式樣(如圖3B至3I所示)之向量序列係藉由公式(1)之Gabor函數(Gabor Function)所產生的。The Gabor Dictionary 14 is a sound feature dictionary, but the present invention is not limited thereto, and a dictionary that can generate sound characteristics is applicable to the present invention. A sound feature matrix is constructed in Gabor Dictionary 14, which is composed of a vector sequence of complex column wave patterns. The vector sequence of the wave pattern (shown in Figures 3B to 3I) in the Gabor dictionary 14 is generated by the Gabor function of equation (1).

在本實施例中，Gabor函數主要有4種參數，分別是ρ、μ、f 、θ，其中ρ表示波形式樣長度或尺度(scale)，就是用來控制Gabor函數的寬度，μ表示波形式樣中心位置或時間(time)，就是控制Gabor函數的中心位置，f 表示頻率，θ是相位(phase)，另外兩個參數分別是K_ρ,f
,θ 和t，K_ρ,f
,θ 是正規化常數，可以使得∥G_ρ,μ,f
,θ ∥=1，t是時間索引(time index)。In this embodiment, the Gabor function mainly has four parameters, namely ρ, μ, f , and θ, where ρ represents the wave pattern length or scale, which is used to control the width of the Gabor function, and μ represents the wave pattern center. The position or time is to control the center position of the Gabor function, f is the frequency, θ is the phase, and the other two parameters are K _{ρ, f , θ} and t, respectively, K _{ρ, f , θ} are normalized. The constant can be such that ∥G _{ρ,μ, f , θ} ∥=1, and t is a time index.

根據上述的參數設定來產生波形式樣之向量序列G_ρ,μ,f
,θ ，例如採用的參數值為ρ={2^j |j=1,..,8}，μ={0,64,128,192}，f ={150,450,840,1370,2150,3400,5800}，θ=0，t=0-255，因此Gabor辭典14總共有224個(8尺度* 7頻率* 4中心位置)波形式樣之向量序列G_ρ,μ,f
,θ 。The vector sequence G _{ρ, μ, f , θ of the} wave pattern is generated according to the above parameter setting, for example, the parameter value used is ρ={2 ^j |j=1, .., 8}, μ={0, 64, 128, 192} , f = {150, 450, 840, 1370, 2150, 3400, 5800}, θ = 0, t = 0-255, so the Gabor dictionary 14 has a total of 224 (8 scale * 7 frequency * 4 center position) wave pattern vector sequence G _{ρ, μ, f , θ} .

在頻率參數的設定係採用有別於一般均勻(uniform)分佈的方式，而是用臨界頻帶(critical band)的形式去設定，因為臨界頻帶具有人耳聽覺模型的特性。一般來說，人耳聽覺範圍是從20到20000Hz，然而取樣速率(sampling rate)是16 kHz，因此適合的聽覺範圍是20到8000 Hz。在本實施例中，頻率為非均勻分佈方式的臨界頻帶的範圍是0到8000 Hz，在這範圍有21個臨界頻帶，但從實驗得知，只用七個臨界頻帶對於資料庫會有較好的效果，所以Gabor辭典14是根據七個臨界頻帶所產生的。The setting of the frequency parameter is different from the general uniform distribution, but is set in the form of a critical band because the critical band has the characteristics of the human auditory model. In general, the human ear hearing range is from 20 to 20,000 Hz, however the sampling rate is 16 kHz, so a suitable hearing range is 20 to 8000 Hz. In the present embodiment, the critical frequency band in which the frequency is non-uniform distribution is in the range of 0 to 8000 Hz, and there are 21 critical bands in this range, but it is known from experiments that only seven critical bands are used for the database. Good results, so the Gabor dictionary 14 is based on seven critical bands.

匹配搜尋模組16係採用匹配搜尋演算法以將環境聲音之一音框的聲音訊號向量序列與Gabor辭典14之聲音特徵矩陣中每一波形式樣之向量序列進行內積運算。The matching search module 16 uses a matching search algorithm to inner product the sequence of sound signals of one of the ambient sounds and the vector sequence of each wave of the sound feature matrix of the Gabor dictionary 14.

一般典型的匹配搜尋演算法在疊代過程中通常是組合兩個方面，一是選取波形式樣方面，另一是對訊號分解(例如音框)方面，不論是以上的哪一方面，在匹配搜尋演算法的疊代過程的停止條件都適合以下三種停止準則：第一種停止準則是利用波形式樣與分解訊號進行內積運算產生的相關係數作判斷，當相關係數低於預設的門檻值則停止；第二種停止準則是選取的波形式樣的個數已經達到設定數目；第三種停止準則是剩餘訊號的能量佔總能量的比率低於預設的門檻值。本實施例是採用第二種準則作為匹配搜尋演算法的疊代過程的停止條件，以選取60個波形式樣為例作為停止條件。The typical typical matching search algorithm usually combines two in the iterative process. In one aspect, one is to select the wave pattern aspect, and the other is to decompose the signal (for example, the sound box). Regardless of the above, the stop condition of the iterative process of the matching search algorithm is suitable for the following three stopping criteria: The first stopping criterion is to use the correlation coefficient generated by the inner product operation of the wave pattern and the decomposition signal to judge, and the correlation coefficient is stopped when the correlation coefficient is lower than the preset threshold; the second stopping criterion is that the number of selected wave patterns has been The set number is reached; the third stopping criterion is that the ratio of the energy of the remaining signal to the total energy is lower than the preset threshold. In this embodiment, the second criterion is adopted as the stopping condition of the iterative process of the matching search algorithm, and 60 wave patterns are selected as the stopping condition.

在選取波形式樣方面，匹配搜尋模組16從Gabor辭典14中選取一個波形式樣，也就是與原訊號最相似的波形式樣，假設D是Gabor辭典14包含完整且有用的波形式樣，能夠以下列的公式(2)表示： In selecting the wave pattern, the matching search module 16 selects a wave pattern from the Gabor dictionary 14, that is, the wave pattern most similar to the original signal, assuming that D is a Gabor dictionary 14 containing a complete and useful wave pattern, capable of the following Equation (2) means:

其中，γ是波形式樣的向量序列σ的參數向量，Γ是參數向量γ的集合。接著在匹配搜尋演算法的分解方面，原訊號與Gabor辭典14內的所有波形式樣之向量序列作內積運算以獲得相似係數，然後選取相似係數最大的波形式樣的向量序列，可以用下列公式(3)決定： Where γ is the parameter vector of the wave-like vector sequence σ, and Γ is the set of parameter vectors γ. Then, in the decomposition of the matching search algorithm, the original signal and the vector sequence of all wave patterns in the Gabor dictionary 14 are subjected to inner product operations to obtain similar coefficients, and then the vector sequence of the wave pattern with the largest similarity coefficient is selected, and the following formula can be used ( 3) Decide:

其中，s是原訊號，|.|是內積運算，T是向量轉置，|σ ^T s |是相似係數(亦為能量)，σ^* 是與原訊號最相似(相似係數為最大者)的波形式樣的向量序列，|σ ^T s |是相似係數。經過選擇相似係數為最大的波形式樣的向量序列，利用先前疊代計算出的剩餘訊號減掉波形式樣的向量序列σ^* ，可以用公式(4)來獲得更新的剩餘訊號的向量序列：R _s (n )=R _s (n -1)-(σ ^*T R _s (n -1))σ ^* (4)Where s is the original signal, |.| is the inner product operation, T is the vector transpose, | σ ^T s | is the similarity coefficient (also energy), σ ^* is the most similar to the original signal (the similarity coefficient is the largest) The wave-like vector sequence, | σ ^T s | is the similarity coefficient. After selecting the vector sequence of the wave pattern with the largest similarity coefficient, and subtracting the vector sequence σ ^{* of the} wave pattern from the residual signal calculated by the previous iteration, the vector sequence of the updated residual signal can be obtained by formula (4): R _s ( n )= R _s ( n -1)-( σ ^{* T} R _s ( n -1)) σ ^* (4)

其中，R_s (n)表示經過n次疊代計算出來的更新剩餘訊號的向量序列，當n=1，R_s (0)就是原訊號s的向量序列。因為匹配搜尋模組16在每一次疊代計算都是選擇最相似的波形式樣的向量序列，所以經過n次疊代(在本實施例為60次)計算之後，剩餘訊號會變得很小，換句話說，利用這些選取到的波形式樣的向量序列σ^* 作重建，重建的訊號與原訊號會有較小的誤差。Where R _s (n) represents a vector sequence of updated residual signals calculated by n iterations. When n=1, R _s (0) is the vector sequence of the original signal s. Since the matching search module 16 selects the vector sequence of the most similar wave pattern in each iterative calculation, after n times of iteration (60 times in this embodiment), the residual signal becomes small. In other words, using the vector sequence σ ^{* of the} selected wave patterns for reconstruction, the reconstructed signal has a small error with the original signal.

圖4A至4D為本發明之尺度頻率圖(Scale Frequency Map)，圖4A為頻率為非均勻分佈之門鈴聲之尺度頻率圖，圖4B為頻率為均勻分佈之門鈴聲之尺度頻率圖，圖4C為頻率為非均勻分佈之狗聲之尺度頻率圖，圖4D為頻率為均勻分佈之狗聲之尺度頻率圖。4A to 4D are scale frequency maps of the present invention, FIG. 4A is a scale frequency diagram of a door ringtone whose frequency is non-uniformly distributed, and FIG. 4B is a scale frequency diagram of a door ringtone whose frequency is evenly distributed, FIG. 4C For the scale frequency diagram of the dog sound whose frequency is non-uniformly distributed, FIG. 4D is a scale frequency diagram of the dog sound whose frequency is evenly distributed.

尺度頻率圖模組18根據上述ρ={2^j |j=1,..,8}，f ={150,450,840,1370,2150,3400,5800}建立頻率為臨界頻帶的非均勻分佈方式之56個區塊(8尺度* 7頻率)的尺度頻率圖，如圖4A及4C所示。然而，本實施例並非侷限於此，尺度頻率圖模組18可建立頻率為均勻分佈的尺度頻率圖，如圖4B及4D所示。The scale frequency map module 18 establishes 56 non-uniform distribution patterns of the critical frequency band according to ρ={2 ^j |j=1,..,8}, f ={150,450,840,1370,2150,3400,5800}. The scale frequency map of the block (8 scale * 7 frequency) is shown in Figures 4A and 4C. However, the present embodiment is not limited thereto, and the scale frequency map module 18 can establish a scale frequency map whose frequency is uniformly distributed, as shown in FIGS. 4B and 4D.

由匹配搜尋模組16在經過60次的疊代運算後，產生60個最大的相似係數(亦即能量值)。尺度頻率圖模組18將60個最大的相似係數中於尺度頻率圖中位在相同的波形式樣長度(即尺度)與頻率之區塊之最大的相似係數(即能量)疊加在一起，以獲得於尺度頻率圖中每個區塊的能量值，如圖4A至4D所示。由圖4A至4D觀察發現，在圖4A及4C中，以臨界頻帶的非均勻分佈方式之尺度頻率圖觀察能量分佈情況是符合人耳聽覺範圍所要分析環境聲音之能量分佈情況。After the 60 iterations have been performed by the matching search module 16, 60 maximum similarity coefficients (i.e., energy values) are generated. The scale frequency map module 18 superimposes the maximum similarity coefficient (ie, energy) of the 60 largest similar coefficients in the scale wave map at the same wave pattern length (ie, scale) and the frequency block to obtain The energy values of each block in the scaled frequency diagram are shown in Figures 4A through 4D. As seen from FIGS. 4A to 4D, in FIGS. 4A and 4C, the energy distribution observed by the scaled frequency diagram of the non-uniform distribution of the critical band is an energy distribution of the ambient sound to be analyzed in accordance with the human auditory range.

在獲得尺度頻率圖中每個區塊的能量值後，能量估算模組20將尺度頻率圖中之頻率、波形樣式長度及每個區塊的能量值進行運算，以獲得對應一個音框(即一個聲音訊號向量序列)之一個尺度頻率描述符(descriptor)，其中尺頻率描述符包含一最大能量比率(Max Energy Ratio)、複數個波形樣式長度能量值、複數個頻率能量值、一波形樣式長度質心(centroid)值、一頻率質心值、一波形樣式長度分散(spread)值及一頻率分散值。重覆上述各模組之操作，以獲得圖2之N個音框所對應之N個尺度頻率描述符。After obtaining the energy value of each block in the scale frequency map, the energy estimation module 20 calculates the frequency in the scale frequency map, the length of the waveform pattern, and the energy value of each block to obtain a corresponding sound frame (ie, a scale frequency descriptor of an audio signal vector sequence, wherein the scale frequency descriptor includes a maximum energy ratio (Max Energy Ratio), a plurality of waveform pattern length energy values, a plurality of frequency energy values, and a waveform pattern length Centroid value, a frequency centroid value, a waveform pattern length spread value, and a frequency dispersion value. The operations of the above modules are repeated to obtain N scale frequency descriptors corresponding to the N sound boxes of FIG.

其中，能量估算模組20將尺度頻率圖中每個區塊的能量值全部加總，以獲得一總能量值，並計算尺度頻率圖中為最大的能量值佔該總能量值的比率有多少，而獲得最大能量比率。當最大能量比率越大時，表示環境聲音越像是一個單調聲音。The energy estimation module 20 sums up the energy values of each block in the scale frequency map to obtain a total energy value. And calculate the maximum energy value in the scale frequency map as the total energy value What is the ratio, and the maximum energy ratio is obtained. When the maximum energy ratio is larger, it means that the ambient sound is more like a monotonous sound.

接著，能量估算模組20對尺度頻率圖中每一頻率計算頻率能量值，亦即將尺度頻率圖中每一頻率對所有的波形樣式長度之能量值進行加總，以獲得如公式(5)所示之7個頻率能量值E (f _k )： Next, the energy estimation module 20 calculates a frequency energy value for each frequency in the scale frequency map, that is, sums the energy values of all the waveform pattern lengths for each frequency in the scale frequency map to obtain the equation (5). Show 7 frequency energy values E ( f _k ):

能量估算模組20對尺度頻率圖中每一尺度(即波形樣式長度)計算尺度(即波形樣式長度)能量值，亦即將尺度頻率圖中每一尺度對所有的頻率之能量值進行加總，以獲得如公式(6)所示之8個尺度能量值： The energy estimation module 20 calculates the energy value of the scale (ie, the length of the waveform pattern) for each scale (ie, the length of the waveform pattern) in the scaled frequency map, that is, the energy value of all the frequencies in each scale of the scaled frequency map is added. To obtain 8 scale energy values as shown in equation (6) :

能量估算模組20對尺度頻率圖中所有的尺度及每一區塊的能量值與總能量值利用公式(7)進行運算，以獲得尺度(即波形樣式長度)質心值SC： The energy estimation module 20 compares all the scales in the scaled frequency map with the energy values and total energy values of each block. Calculate using equation (7) to obtain the scale (ie waveform pattern length) centroid value SC:

能量估算模組20對尺度頻率圖中所有的頻率及每一區塊的能量值與總能量值利用公式(8)進行運算，以獲得頻率質心值FC： The energy estimation module 20 compares all the frequencies in the scaled frequency map and the energy values and total energy values of each block. Calculate using equation (8) to obtain the frequency centroid value FC:

能量估算模組20對尺度頻率圖中所有的尺度及每一區塊的能量值、總能量值與尺度質心值SC利用公式(9)進行運算，以獲得尺度(即波形樣式長度)分散值SS： The energy estimation module 20 compares all the scales in the scale frequency map and the energy value and total energy value of each block. Calculate with the scale centroid value SC using equation (9) to obtain the scale (ie waveform pattern length) dispersion value SS:

能量估算模組20對尺度頻率圖中所有的頻率及每一區塊的能量值、總能量值與頻率質心值FC利用公式(10)進行運算，以獲得頻率分散值FC： The energy estimation module 20 compares all the frequencies in the scale frequency map and the energy value and total energy value of each block. Calculate with the frequency centroid value FC using equation (10) to obtain the frequency dispersion value FC:

經由能量估算模組20將尺度頻率圖中之頻率、波形樣式長度及每個區塊的能量值根據上述公式(5)至(10)進行運算，以獲得對應一個音框(即一個聲音訊號向量序列)之包含有一個最大能量比率、8個尺度(即波形樣式長度)能量值、7個頻率能量值E (f _k )、一個尺度(即波形樣式長度)質心值SC、一個頻率質心值FC、一個尺度(即波形樣式長度)分散值SS及一個頻率分散值FC之一個尺度頻率描述符，因此利用上述各模組之操作以獲得圖2中對應N個音框之N個尺度頻率描述符。再由能量估算模組20將N個尺度頻率描述符進行平均值運算以獲得一個平均值尺度頻率描述符，亦即將N個最大能量比率進行平均值運算以獲得一個平均值最大能量比率，將N*8個尺度能量值進行平均值運算以獲得8個平均值尺度能量值，將N*7個頻率能量值E (f _k )進行平均值運算以獲得7個平均值頻率能量值EE (f _k )，將N個尺度質心值SC進行平均值運算以獲得一個平均值尺度質心值ESC，將N個頻率質心值FC進行平均值運算以獲得一個平均值尺度質心值EFC，將N個尺度分散值SS進行平均值運算以獲得一個平均值尺度分散值ESS，將N個頻率分散值FS進行平均值運算以獲得一個平均值頻率分散值EFS。The frequency, the length of the waveform pattern, and the energy value of each block in the scale frequency map are calculated by the energy estimation module 20 according to the above formulas (5) to (10) to obtain a corresponding sound frame (ie, an audio signal vector). The sequence) contains a maximum energy ratio, 8 scales (ie waveform pattern length) energy values , 7 frequency energy values E ( f _k ), a scale (ie waveform pattern length) centroid value SC, a frequency centroid value FC, a scale (ie waveform pattern length) dispersion value SS and a frequency dispersion value FC A scale frequency descriptor, thus utilizing the operations of the various modules described above to obtain N scale frequency descriptors for the corresponding N bins in FIG. The energy estimation module 20 then averages the N scale frequency descriptors to obtain an average scale frequency descriptor, that is, averages the N maximum energy ratios to obtain an average maximum energy ratio, and *8 scale energy values Perform an average operation to obtain 8 average scale energy values The N*7 frequency energy values E ( f _k ) are averaged to obtain 7 average frequency energy values EE ( f _k ), and the N scale centroid values SC are averaged to obtain an average value. The centroid value ESC is calculated by averaging the N frequency centroid values FC to obtain an average scale centroid value EFC, and the N scale dispersion values SS are averaged to obtain an average scale dispersion value ESS. The N frequency dispersion values FS are averaged to obtain an average frequency dispersion value EFS.

聲音分類器22可以利用目前已知的技術來實施，例如採用支持向量機器(Support vector machine，SVM)。先利用數種環境聲音類別的資料來訓練聲音分類器22，以使訓練後的聲音分類器22可以辨識欲分類環境聲音類別的資料。The sound classifier 22 can be implemented using currently known techniques, such as with a Support Vector Machine (SVM). The sound classifier 22 is first trained using data of several environmental sound categories so that the trained sound classifier 22 can identify the material of the environmental sound category to be classified.

在本實施例中，能量估算模組20將所計算之包含有一個平均值最大能量比率、8個平均值尺度能量值、7個平均值頻率能量值EE (f _k )、一個平均值尺度質心值ESC、一個平均值尺度質心值EFC、一個平均值尺度分散值ESS及一個平均值頻率分散值EFS之平均值尺度頻率描述符傳送至聲音分類器22，聲音分類器22根據所接收之平均值尺度頻率描述符來進行辨識以分類出環境聲音之類別。In this embodiment, the energy estimation module 20 includes the calculated maximum energy ratio and the average value of the eight energy values. , 7 mean frequency energy values EE ( f _k ), an average scale centroid value ESC, an average scale centroid value EFC, an average scale dispersion value ESS, and an average value of the mean value dispersion value EFS The scale frequency descriptor is passed to a sound classifier 22, which recognizes based on the received average scale frequency descriptor to classify the categories of ambient sounds.

以下將說明本發明之環境聲音辨識方法的操作步驟，同時參考以上各圖式來進行說明。The operation steps of the environmental sound recognition method of the present invention will be described below, and will be described with reference to the above drawings.

圖5為本發明之環境聲音辨識方法之流程圖。在圖5中，於Gabor辭典14建立一聲音特徵矩陣，該聲音特徵矩陣係由複數列波形式樣之向量序列所組成，而在Gabor辭典14中如公式(2)所示的波形式樣(如圖3B至3I所示)之向量序列係藉由公式(1)之Gabor函數來產生的(步驟S50)。FIG. 5 is a flow chart of an environmental sound recognition method according to the present invention. In Fig. 5, a sound feature matrix is constructed in Gabor dictionary 14, which is composed of a vector sequence of complex column wave patterns, and a wave pattern as shown in formula (2) in Gabor dictionary 14 (Fig. The vector sequence shown in 3B to 3I is generated by the Gabor function of the formula (1) (step S50).

在本實施例中，Gabor函數的主要4種參數分別上述是ρ、μ、f 、θ及另外兩個參數分別是K_ρ,f
,θ 和t，K_ρ,f
,θ 是正規化常數，參數t是時間索引。以分別設定參數ρ={2^j |j=1,..,8}，μ={0,64,128,192}，f ={150,450,840,1370,2150,3400,5800}，θ=0，t=0-255來產生波形式樣之向量序列G_ρ,μ,f
,θ ，因此Gabor辭典14總共有224個(8尺度* 7頻率* 4中心位置)波形式樣之向量序列G_ρ,μ,f
,θ 。In this embodiment, the main four parameters of the Gabor function are ρ, μ, f , θ and the other two parameters are respectively K _{ρ, f , θ} and t, K _{ρ, f , θ} are normalization constants, The parameter t is a time index. To set the parameters ρ={2 ^j |j=1,..,8}, μ={0,64,128,192}, f ={150,450,840,1370,2150,3400,5800},θ=0,t=0- 255 is used to generate the vector sequence G _{ρ, μ, f , θ of the} wave pattern, so the Gabor dictionary 14 has a total of 224 (8 scale * 7 frequency * 4 center position) wave pattern vector sequences G _{ρ, μ, f , θ} .

其中，在頻率參數的設定係採用臨界頻帶的形式去設定，因為臨界頻帶具有人耳聽覺模型的特性，頻率為非均勻分佈方式的臨界頻帶的範圍是0到8000Hz，而從實驗得知，只用七個臨界頻帶對於資料庫會有較好的效果，所以Gabor辭典14是根據七個臨界頻帶所產生的。Among them, the setting of the frequency parameter is set in the form of a critical frequency band, because the critical frequency band has the characteristics of the human ear hearing model, and the critical frequency band whose frequency is non-uniform distribution mode ranges from 0 to 8000 Hz, and it is known from experiments that only Using seven critical bands will have a good effect on the database, so the Gabor dictionary 14 is based on seven critical bands.

聲音前處理模組12接收周遭環境的環境聲音(如圖2上方的聲音波形所示)，而將所接收到之環境聲音轉換成為聲音訊號(諸如mp3、wav格式等)之N個聲音檔(步驟S52)。接著，聲音前處理模組12例如使用Matlab軟體以設定一取樣頻率及一量化位元數，根據取樣頻率對N個聲音檔的聲音訊號進行取樣，並根據量化位元數對經取樣的聲音訊號進行量化，以獲得為數學形式之N個聲音訊號向量序列(圖2中之音框1、音框2...音框N)(步驟S54)。The sound pre-processing module 12 receives the ambient sound of the surrounding environment (as shown by the sound waveform in the upper part of FIG. 2), and converts the received ambient sound into N sound files of the sound signal (such as mp3, wav format, etc.) ( Step S52). Then, the sound pre-processing module 12 uses a Matlab software to set a sampling frequency and a quantization bit number, sample the sound signals of the N sound files according to the sampling frequency, and compare the sampled sound signals according to the number of quantization bits. Quantization is performed to obtain N sound signal vector sequences (sound box 1, sound box 2, ... sound box N in Fig. 2) in a mathematical form (step S54).

在第一次疊代運算時，採用匹配搜尋演算法之匹配搜尋模組16將環境聲音之一音框的原訊號s的聲音訊號向量序列與Gabor辭典14之聲音特徵矩陣中每一波形式樣之向量序列利用公式(3)來進行內積運算，以獲得224個相似係數(即能量)與分別對應224個相似係數之224個內積向量序列σ ^* ；在第二次以後的疊代運算，匹配搜尋模組16將公式(4)中進行疊代運算之向量序列R_s (n-1)與Gabor辭典14之聲音特徵矩陣中每一波形式樣之向量序列利用公式(3)來進行內積運算，同樣獲得224個相似係數與分別對應224個相似係數之224個內積向量序列σ ^* (步驟S56)。In the first iterative operation, the matching search module 16 using the matching search algorithm compares the sound signal vector sequence of the original signal s of one of the ambient sounds with each wave of the sound feature matrix of the Gabor dictionary 14 The vector sequence uses the formula (3) to perform the inner product operation to obtain 224 similar coefficients (ie, energy) and 224 inner product vector sequences σ ^* corresponding to 224 similar coefficients respectively; in the second and subsequent iteration operations, The matching search module 16 uses the vector sequence R _s (n-1) of the iterative operation in the formula (4) and the vector sequence of each wave form in the sound feature matrix of the Gabor dictionary 14 to perform inner product using the formula (3). In the operation, 224 similar coefficients and 224 inner product vector sequences σ ^* respectively corresponding to 224 similar coefficients are obtained (step S56).

利用公式(4)，在一次的疊代計算中，由匹配搜尋模組16選擇相似係數為最大的內積向量序列σ^* ，並將原訊號s的聲音訊號向量序列或經疊代運算之向量序列R_s (n-1)與對應最大之相似係數之內積向量序列σ^* 進行運算以獲得一相似訊號向量序列，匹配搜尋模組16將經疊代運算之向量序列R_s (n-1)與該相似訊號向量序列進行減法運算以獲得剩餘訊號的向量序列R_s (n)(步驟S58)。在下一次的疊代計算，將剩餘訊號的向量序列R_s (n)設定為經疊代運算之向量序列R_s (n-1)。Using equation (4), in one iteration calculation, the matching search module 16 selects the inner product vector sequence σ ^* with the largest similarity coefficient, and the sound signal vector sequence of the original signal s or the vector of the iterative operation. The sequence R _s (n-1) is operated with the inner product vector sequence σ ^* corresponding to the largest similarity coefficient to obtain a sequence of similar signal vectors, and the matching search module 16 performs the iterative operation vector sequence R _s (n-1). And subtracting the sequence of similar signal vectors to obtain a vector sequence R _s (n) of the residual signals (step S58). In the next iterative calculation, the vector sequence R _s (n) of the residual signal is set to the vector sequence R _s (n-1) of the iterative operation.

因為匹配搜尋模組16在每一次疊代計算都是選擇最相似的波形式樣的向量序列，所以經過n次疊代(在本實施例為60次)計算之後，剩餘訊號會變得很小，換句話說，利用這些選取到的最相似之波形式樣的向量序列σ^* 作重建，重建的訊號與原訊號會有較小的誤差。Since the matching search module 16 selects the vector sequence of the most similar wave pattern in each iterative calculation, after n times of iteration (60 times in this embodiment), the residual signal becomes small. In other words, using the vector sequence σ ^* of the most similar wave patterns selected for reconstruction, the reconstructed signal has a small error with the original signal.

由匹配搜尋模組16以一預定次數(在本實施例為60次)以疊代計算方式重覆步驟S56與S58，可獲得經60次疊代運算之60個最大之相似係數(步驟S60)。The matching search module 16 repeats steps S56 and S58 in an iterative calculation manner for a predetermined number of times (60 times in this embodiment), and obtains 60 largest similarity coefficients of 60 iteration operations (step S60). .

由尺度頻率圖模組18根據上述ρ={2^j |j=1,..,8}，f ={150,450,840,1370,2150,3400,5800}建立頻率為臨界頻帶的非均勻分佈方式之56個區塊(8尺度* 7頻率)的尺度頻率圖(如圖4A及4C所示)。The non-uniform distribution pattern of the critical frequency band is established by the scale frequency map module 18 according to the above ρ={2 ^j |j=1,..,8}, f ={150,450,840,1370,2150,3400,5800}. A scaled frequency plot of blocks (8 scales * 7 frequencies) (as shown in Figures 4A and 4C).

匹配搜尋模組16在經過60次疊代運算後，產生60個最大的相似係數(亦即能量值)。尺度頻率圖模組18將60個最大的相似係數中於尺度頻率圖中位在相同的波形式樣長度 (即尺度)與頻率之區塊之最大的相似係數(即能量)疊加在一起，以獲得於尺度頻率圖中每個區塊的能量值E (s _i ,f _k )，如圖4A及4C所示(步驟S62)。The matching search module 16 produces 60 maximum similarity coefficients (i.e., energy values) after 60 iterations. The scale frequency map module 18 superimposes the maximum similarity coefficient (ie, energy) of the 60 largest similar coefficients in the scale wave map at the same wave pattern length (ie, scale) and the frequency block to obtain The energy value E ( s _i , f _k ) of each block in the scale frequency map is as shown in Figs. 4A and 4C (step S62).

在獲得尺度頻率圖中每個區塊的能量值E (s _i ,f _k )後，能量估算模組20將尺度頻率圖中之頻率、波形樣式長度及每個區塊的能量值E (s _i ,f _k )進行運算，以獲得對應一個音框(即一個聲音訊號向量序列)之一個尺度頻率描述符，其中尺頻率描述符包含一個最大能量比率、8個尺度(波形樣式長度)能量值、7個頻率能量值E (f _k )、一個波形樣式長度質心值SC、一個頻率質心值FC、一個尺度(波形樣式長度)分散值SS及一個頻率分散值FS(步驟S64)。重覆上述步驟S56至S64各模組之操作，以獲得圖2之N個音框所對應之N個尺度頻率描述符。After obtaining the energy value E ( s _i , f _k ) of each block in the scale frequency map, the energy estimation module 20 compares the frequency in the scale frequency map, the length of the waveform pattern, and the energy value of each block E ( s _i , f _k ) performing an operation to obtain a scale frequency descriptor corresponding to a sound box (ie, a sequence of sound signal vectors), wherein the scale frequency descriptor includes a maximum energy ratio and 8 scale (waveform pattern length) energy values And 7 frequency energy values E ( f _k ), a waveform pattern length centroid value SC, a frequency centroid value FC, a scale (waveform pattern length) dispersion value SS, and a frequency dispersion value FS (step S64). The operations of the modules in steps S56 to S64 are repeated to obtain N scale frequency descriptors corresponding to the N sound boxes of FIG.

其中，能量估算模組20將尺度頻率圖中每個區塊的能量值E (s _i ,f _k )全部加總，以獲得一總能量值，並計算尺度頻率圖中為最大的能量值佔該總能量值的比率有多少，而獲得最大能量比率。The energy estimation module 20 adds up the energy values E ( s _i , f _k ) of each block in the scale frequency map to obtain a total energy value. And calculate the maximum energy value in the scale frequency map as the total energy value What is the ratio, and the maximum energy ratio is obtained.

接著，能量估算模組20對尺度頻率圖中每一頻率計算頻率能量值，亦即利用公式(5)將尺度頻率圖中每一頻率對所有的尺度(波形樣式長度)之能量值進行加總，以獲得7個頻率能量值E (f _k )。Next, the energy estimation module 20 calculates the frequency energy value for each frequency in the scaled frequency map, that is, uses the formula (5) to add the energy value of each frequency in the scale frequency map to all the scales (waveform pattern length). To obtain 7 frequency energy values E ( f _k ).

能量估算模組20對尺度頻率圖中每一尺度(即波形樣式長度)計算尺度能量值，亦即利用公式(6)將尺度頻率圖中每一尺度對所有的頻率之能量值進行加總，以獲得8個尺度能量值。The energy estimation module 20 calculates the scale energy value for each scale (ie, the waveform pattern length) in the scale frequency map, that is, the energy value of all the frequencies in the scale frequency map is summed by using equation (6). Get 8 scale energy values .

能量估算模組20對尺度頻率圖中所有的尺度及每一區塊的能量值與總能量值利用公式(7)進行運算，以獲得一個尺度質心值SC。The energy estimation module 20 compares all the scales in the scaled frequency map with the energy values and total energy values of each block. The operation is performed using equation (7) to obtain a scale centroid value SC.

能量估算模組20對尺度頻率圖中所有的頻率及每一區塊的能量值E (s _i ,f _k )與總能量值利用公式(8)進行運算，以獲得一個頻率質心值FC。The energy estimation module 20 compares all frequencies in the scaled frequency map and the energy values E ( s _i , f _k ) and total energy values of each block. The operation is performed using equation (8) to obtain a frequency centroid value FC.

能量估算模組20對尺度頻率圖中所有的尺度及每一區塊的能量值E (s _i ,f _k )、總能量值與尺度質心值SC利用公式(9)進行運算，以獲得一個尺度分散值SS。The energy estimation module 20 compares all the scales in the scaled frequency map and the energy values E ( s _i , f _k ) and total energy values of each block. The scale centroid value SC is operated by the formula (9) to obtain a scale dispersion value SS.

能量估算模組20對尺度頻率圖中所有的頻率及每一區塊的能量值E (s _i ,f _k )、總能量值與頻率質心值FC利用公式(10)進行運算，以獲得一個頻率分散值FC。The energy estimation module 20 compares all the frequencies in the scaled frequency map and the energy values E ( s _i , f _k ) and total energy values of each block. The frequency centroid value FC is operated by the formula (10) to obtain a frequency dispersion value FC.

經由能量估算模組20將尺度頻率圖中之頻率、波形樣式長度及每個區塊E (s _i ,f _k )的能量值根據上述公式(5)至(10)進行運算，以獲得對應一個音框(即一個聲音訊號向量序列)之包含有一個最大能量比率、8個尺度(即波形樣式長度)能量值、7個頻率能量值E (f _k )、一個尺度(即波形樣式長度)質心值SC、一個頻率質心值FC、一個尺度(即波形樣式長度)分散值SS及一個頻率分散值FC之一個尺度頻率描述符，因此利用上述各模組之操作以獲得圖2中對應N個音框之N個尺度頻率描述符。再由能量估算模組20將N個尺度頻率描述符進行平均值運算以獲得一個平均值尺度頻率描述符(步驟S66)，亦即將N個最大能量比率進行平均值運算以獲得一個平均值最大能量比率，將N*8個尺度能量值進行平均值運算以獲得8個平均值尺度能量值，將N*7個頻率能量值E (f _k )進行平均值運算以獲得7個平均值頻率能量值EE (f _k )，將N個尺度質心值SC進行平均值運算以獲得一個平均值尺度質心值ESC，將N個頻率質心值FC進行平均值運算以獲得一個平均值尺度質心值EFC，將N個尺度分散值SS進行平均值運算以獲得一個平均值尺度分散值ESS，將N個頻率分散值FS進行平均值運算以獲得一個平均值頻率分散值EFS，而平均值尺度頻率描述符包含一個平均值最大能量比率、8個平均值尺度能量值、7個平均值頻率能量值EE (f _k )、一個平均值尺度質心值ESC、一個平均值頻率質心值EFC、一個平均值尺度分散值ESS及一個平均值頻率分散值EFS。The energy in the scale frequency map, the length of the waveform pattern, and the energy value of each block E ( s _i , f _k ) are calculated according to the above formulas (5) to (10) via the energy estimation module 20 to obtain a corresponding one. The sound box (ie, a sequence of sound signal vectors) contains a maximum energy ratio, 8 scales (ie, waveform pattern length) energy values. , 7 frequency energy values E ( f _k ), a scale (ie waveform pattern length) centroid value SC, a frequency centroid value FC, a scale (ie waveform pattern length) dispersion value SS and a frequency dispersion value FC A scale frequency descriptor, thus utilizing the operations of the various modules described above to obtain N scale frequency descriptors for the corresponding N bins in FIG. Then, the energy estimation module 20 averages the N scale frequency descriptors to obtain an average scale frequency descriptor (step S66), that is, averages the N maximum energy ratios to obtain an average maximum energy. Ratio, will be N*8 scale energy values Perform an average operation to obtain 8 average scale energy values The N*7 frequency energy values E ( f _k ) are averaged to obtain 7 average frequency energy values EE ( f _k ), and the N scale centroid values SC are averaged to obtain an average value. The centroid value ESC is calculated by averaging the N frequency centroid values FC to obtain an average scale centroid value EFC, and the N scale dispersion values SS are averaged to obtain an average scale dispersion value ESS. The N frequency dispersion values FS are averaged to obtain an average frequency dispersion value EFS, and the average scale frequency descriptor includes an average maximum energy ratio and 8 average scale energy values. 7 average frequency energy values EE ( f _k ), an average scale centroid value ESC, an average frequency centroid value EFC, an average scale dispersion value ESS, and an average frequency dispersion value EFS.

能量估算模組20將所計算之包含有一個平均值最大能量比率、8個平均值尺度能量值、7個平均值頻率能量值EE (f _k )、一個平均值尺度質心值ESC、一個平均值尺度質心值EFC、一個平均值尺度分散值ESS及一個平均值頻率分散值EFS之平均值尺度頻率描述符傳送至聲音分類器22，而由聲音分類器22根據所接收之平均值尺度頻率描述符來進行辨識以分類出環境聲音之類別(步驟S68)。The energy estimation module 20 includes the calculated average energy ratio and the eight average scale energy values. , 7 mean frequency energy values EE ( f _k ), an average scale centroid value ESC, an average scale centroid value EFC, an average scale dispersion value ESS, and an average value of the mean value dispersion value EFS The scale frequency descriptor is transmitted to the sound classifier 22, and is recognized by the sound classifier 22 based on the received average scale frequency descriptor to classify the category of the ambient sound (step S68).

本發明提供一種環境聲音辨識方法，其優點對於環境聲音的分析可使用較少的分析資料進行分析，並利用環境聲音之聲音特徵參數來產生對應為非均勻之頻率分佈的尺度頻率圖之描述符的參數值，而利用聲音分類器對描述符進行分類，以辨識出環境聲音的類別，如此可提高周遭環境聲音的辨識度，對於短時間內的頻率成分的分析亦較強。The invention provides an environmental sound recognition method, the advantages of which can be analyzed by using less analysis data for the analysis of the ambient sound, and using the sound feature parameters of the ambient sound to generate a descriptor of the scale frequency map corresponding to the non-uniform frequency distribution. The parameter values are used, and the descriptors are classified by the sound classifier to identify the category of the ambient sound, so that the recognition of the ambient sound can be improved, and the analysis of the frequency components in a short time is also strong.

雖然本發明已參照較佳具體例及舉例性附圖敘述如上，惟其應不被視為係限制性者。熟悉本技藝者對其形態及具體例之內容做各種修改、省略及變化，均不離開本發明之申請專利範圍之所主張範圍。The present invention has been described above with reference to the preferred embodiments and the accompanying drawings, and should not be considered as limiting. Various modifications, omissions and changes may be made without departing from the scope of the invention.

10‧‧‧環境聲音辨識裝置10‧‧‧Environmental sound recognition device

12‧‧‧聲音前處理模組12‧‧‧Sound pre-processing module

14‧‧‧Gabor辭典14‧‧‧Gabor Dictionary

16‧‧‧匹配搜尋模組16‧‧‧ Matching search module

18‧‧‧尺度頻率圖模組18‧‧‧Scale frequency map module

20‧‧‧能量估算模組20‧‧‧Energy Estimation Module

22‧‧‧聲音分類器22‧‧‧Sound classifier

圖1為本發明之環境聲音辨識裝置之方塊圖；圖2為本發明之環境聲音處理程序之示意圖；圖3A為圖2中之一個音框聲音波形之示意圖；圖3B至3I為本發明之Gabor辭典中波形式樣之示意圖，圖4A至4D為本發明之尺度頻率圖，其中圖4A為頻率為非均勻分佈之門鈴聲之尺度頻率圖，圖4B為頻率為均勻分佈之門鈴聲之尺度頻率圖，圖4C為頻率為非均勻分佈之狗聲之尺度頻率圖，圖4D為頻率為均勻分佈之狗聲之尺度頻率圖；以及圖5為本發明之環境聲音辨識方法之流程圖。1 is a block diagram of an ambient sound recognition device of the present invention; FIG. 2 is a schematic diagram of an ambient sound processing program of the present invention; FIG. 3A is a schematic diagram of a sound waveform of a sound frame of FIG. 2; FIGS. 3B to 3I are diagrams of the present invention; FIG. 4A to FIG. 4D are diagrams showing the scale frequency of the present invention, wherein FIG. 4A is a scale frequency diagram of a door ringtone whose frequency is non-uniformly distributed, and FIG. 4B is a frequency uniformity. The scale frequency map of the distribution door ringtone, FIG. 4C is a scale frequency diagram of the dog sound whose frequency is non-uniformly distributed, FIG. 4D is a scale frequency diagram of the dog sound whose frequency is evenly distributed; and FIG. 5 is the ambient sound recognition of the present invention Flow chart of the method.

Claims

一種環境聲音辨識方法，於一環境聲音辨識裝置中執行該方法之下列步驟：(a)於一聲音特徵辭典中建立一聲音特徵矩陣，該聲音特徵矩陣係由複數列向量序列所組成，將複數個波形樣式中心位置之每一者、複數個頻率之每一者及複數個波形樣式長度之每一者的組合進行運算以獲得對應波形式樣之該等向量序列；(b)將一疊代向量序列與該等向量序列之每一者進行內積運算，以獲得複數個相似係數與分別對應該等相似係數之複數個內積向量序列，其中在第一次疊代運算時，該疊代向量序列為該至少一聲音訊號向量序列中之一者；(c)將該疊代向量序列與對應最大之該相似係數之內積向量序列進行運算以獲得一相似訊號向量序列，並將該疊代向量序列與該相似訊號向量序列進行減法運算以獲得一剩餘訊號向量序列；(d)以一預定次數重覆執行步驟(b)與(c)而獲得與該預定次數相同數目的該等最大之相似係數，其中在下一次疊代運算時，該疊代向量序列係為該剩餘訊號向量序列；(e)將相同的頻率與波形樣式長度之最大之該等相似係數進行加總，以獲得複數個能量值；(f)將該等能量值、該等頻率與該等波形樣式長度進行運算，以獲得與該至少一聲音訊號向量序列之序列個數相同之至少一尺度頻率描述符，其中該至少一尺度頻率描述符之每一者包含一最大能量比率、複數個波形樣式長度能量值、複數個頻率能量值、一波形樣式長度質心值、一頻率質心值、一波形樣式長度分散值及一頻率分散值；以及(g)將該至少一尺度頻率描述符進行平均值運算以獲得一平均值尺度頻率描述符。An environmental sound recognition method, the following steps of the method are performed in an environmental sound recognition device: (a) establishing a sound feature matrix in a sound feature dictionary, the sound feature matrix being composed of a plurality of column vector sequences, which will be plural a combination of each of the center positions of the waveform patterns, each of the plurality of frequencies, and each of the plurality of waveform pattern lengths to obtain the vector sequence of the corresponding wave pattern; (b) the one-time generation vector And performing an inner product operation on each of the sequence and the vector sequence to obtain a plurality of inner product vector sequences of the plurality of similarity coefficients and the corresponding similarity coefficients, wherein the iterative vector is used in the first iteration operation The sequence is one of the at least one sound signal vector sequence; (c) operating the iterative vector sequence with the corresponding inner product vector sequence of the similarity coefficient to obtain a similar signal vector sequence, and the iteration The vector sequence is subtracted from the sequence of similar signal vectors to obtain a sequence of residual signal vectors; (d) repeating steps (b) and (c) by a predetermined number of times The same number of such maximum similarity coefficients as the predetermined number of times, wherein the iterative vector sequence is the sequence of residual signal vectors in the next iterative operation; (e) the same frequency and waveform pattern length is the largest Equalizing the similarity coefficients to obtain a plurality of energy values; (f) operating the energy values, the frequencies, and the lengths of the waveform patterns Calculating to obtain at least one scale frequency descriptor identical to the sequence of the at least one sound signal vector sequence, wherein each of the at least one scale frequency descriptor comprises a maximum energy ratio, a plurality of waveform pattern length energy values a plurality of frequency energy values, a waveform pattern length centroid value, a frequency centroid value, a waveform pattern length dispersion value, and a frequency dispersion value; and (g) averaging the at least one scale frequency descriptor to Obtain an average scale frequency descriptor.

如申請專利範圍第1項之方法，進一步包含下列步驟：(h)在步驟(b)之前，將一環境聲音轉換成至少一聲音檔；(i)設定一取樣頻率及一量化位元數，根據該取樣頻率對該至少一聲音檔的聲音訊號進行取樣，並根據該量化位元數對經取樣的聲音訊號進行量化，以獲得該至少一聲音訊號向量序列；以及(j)在步驟(g)之後，將平均尺度頻率描述符進行比對，以分類出該平均尺度頻率描述符之一環境聲音類別。The method of claim 1, further comprising the steps of: (h) converting an ambient sound into at least one sound file before step (b); (i) setting a sampling frequency and a number of quantization bits, And sampling the sound signal of the at least one sound file according to the sampling frequency, and quantizing the sampled sound signal according to the number of quantized bits to obtain the at least one sound signal vector sequence; and (j) at step (g) After that, the average scale frequency descriptors are compared to classify one of the average scale frequency descriptors for the ambient sound category.

如申請專利範圍第1項之方法，其中，在步驟(f)中，將該等能量值進行加總，以獲得一總能量值；將該等能量值中為最大能量的能量值與該總能量值進行運算，以獲得該最大能量比率；將該等頻率之每一者對所有的波形樣式長度之能量值進行加總，以獲得複數個頻率能量值；將該等波形樣式長度之每一者對所有的頻率之能量值進行加總，以獲得複數個波形樣式長度能量值；將該等能量值、該等波形樣式長度與該總能量值進行運算以獲得該波形樣式長度質心值；將該等能量值、該等頻率與該總能量值進行運算以獲得該頻率質心值；將該等能量值、該等波形樣式長度、該總能量值與該波形樣式長度質心值進行運算以獲得該波形樣式長度分散值；以及將該等能量值、該等頻率、該總能量值與該頻率質心值進行運算以獲得該頻率分散值。The method of claim 1, wherein in step (f), the energy values are summed to obtain a total energy value; wherein the energy value is the maximum energy energy value and the total The energy values are computed to obtain the maximum energy ratio; each of the frequencies is summed over the energy values of all waveform pattern lengths to obtain a plurality of frequency energy values; each of the waveform pattern lengths For all the energy values of the frequency Row summing to obtain a plurality of waveform pattern length energy values; calculating the energy values, the waveform pattern lengths, and the total energy values to obtain the waveform pattern length centroid values; the energy values, the same Calculating a frequency and the total energy value to obtain the frequency centroid value; calculating the energy value, the waveform pattern length, the total energy value, and the waveform pattern length centroid value to obtain the waveform pattern length dispersion value And calculating the energy value, the frequencies, the total energy value, and the frequency centroid value to obtain the frequency dispersion value.

如申請專利範圍第1項之方法，其中，在步驟(g)中，該平均值尺度頻率描述符包含一平均值最大能量比率、複數個平均值波形樣式長度能量值、複數個平均值頻率能量值、一平均值波形樣式長度質心值、一平均值頻率質心值、一平均值波形樣式長度分散值及一平均值頻率分散值。The method of claim 1, wherein in step (g), the average scale frequency descriptor comprises an average maximum energy ratio, a plurality of average waveform pattern length energy values, and a plurality of average frequency energy Value, an average waveform pattern length centroid value, an average frequency centroid value, an average waveform pattern length dispersion value, and an average frequency dispersion value.

如申請專利範圍第1項之方法，其中，由該等頻率作為一頻率座標軸，而該等波形樣式長度作為一尺度座標軸，以建立一尺度頻率圖。The method of claim 1, wherein the frequencies are used as a frequency coordinate axis, and the waveform pattern lengths are used as a scale coordinate axis to establish a scale frequency map.

如申請專利範圍第5項之方法，其中，根據人耳聽覺模型設定該等頻率為非線性分佈之人耳可聽到頻率。The method of claim 5, wherein the human ear audible frequency is set according to the human ear hearing model.

如申請專利範圍第1項之方法，其中，在步驟(a)中，將該等波形樣式中心位置之每一者、該等頻率之每一者及該等波形樣式長度之每一者的組合在與複數個時間索引進行運算，以獲得對應波形式樣之該等向量序列。The method of claim 1, wherein in step (a), each of the center positions of the waveform patterns, each of the frequencies, and the like The combination of each of the waveform pattern lengths is computed with a plurality of time indices to obtain the vector sequences of the corresponding wave patterns.