CN105469807A

CN105469807A - Multi-fundamental frequency extraction method and multi-fundamental frequency extraction device

Info

Publication number: CN105469807A
Application number: CN201511023725.3A
Authority: CN
Inventors: 刘文举; 江巍; 王天正; 李�杰; 梁基重; 李艳鹏; 乔利玮; 刘元华
Original assignee: Shanxi Zhenzhong Electric Power Co ltd; Institute of Automation of Chinese Academy of Science; Electric Power Research Institute of State Grid Shanxi Electric Power Co Ltd
Current assignee: Shanxi Zhenzhong Electric Power Co ltd; Institute of Automation of Chinese Academy of Science; Electric Power Research Institute of State Grid Shanxi Electric Power Co Ltd
Priority date: 2015-12-30
Filing date: 2015-12-30
Publication date: 2016-04-06
Anticipated expiration: 2035-12-30
Also published as: CN105469807B

Abstract

The invention discloses a multi-fundamental frequency extraction method and a multi-fundamental frequency extraction device based on empirical mode decomposition and a hidden Markov model. The method comprises steps: an auditory filter bank is used for filtering a speech signal, and framing is carried out on the signal after filtering; an auto-correlation function is calculated on each time frequency unit for an auditory spectrum; on the basis of an intrinsic mode function obtained through the empirical mode decomposition, the instantaneous frequency of each time frequency unit dominant sound source is calculated; on the basis of each instantaneous frequency, a frequency matching function is calculated; the frequency matching function is used for building the likelihood probability of each fundamental frequency state, and a corpus is used for counting the transition probability between each fundamental frequency state and a fundamental frequency value; and the likelihood probability of each fundamental frequency state is enhanced, the enhanced likelihood probability is combined with the corresponding transition probability, and the hidden Markov model is used for extracting a multi-fundamental frequency track of the speech signal.

Description

A kind of many fundamental frequencies extracting method and device

Technical field

The present invention relates to digital signal processing empirical mode decomposition, voice signal filter bank analysis, voice signal fundamental frequency extract, the structure of hidden Markov model likelihood probability and transition probability.

Background technology

The extraction of fundamental frequency (pitch) and the tracking of track thereof to all more voices and Audio Signal Processing technology all significant, such as audio retrieval and classification, the identification of Chinese intonation and single-channel voice isolation technics etc.There are now some performances well for detecting pure or having the fundamental frequency extraction algorithm of the single fundamental frequency in the voice of a small amount of noise.But the hypothesis of single fundamental frequency uses when making this kind of algorithm cannot there is multiple fundamental frequency in voice simultaneously, the situation having music to exist in the situation that such as two speakers speak simultaneously or speaker's background, and be devoted to solve in Computational auditory scene analysis (CASA) technology of cocktail party problem, many fundamental frequencies extract and follow the trail of the important foundation of usually effectively carrying out voice segmentation and tissue especially.

Hidden Markov Model (HMM) is used to the continuity of following the trail of pitch contour very early.HMM model was just once utilized to the fundamental frequency number of adjudicating every frame in mixing voice in the eighties; 2003, the distance between true pitch period and the peak value of autocorrelation function was used to the likelihood function modeling of fundamental frequency state and has occurred a many Pitch-Synchronous OLA algorithm based on hidden Markov model.2013, the elevation information of autocorrelation function peak value was used directly to the potential function building fundamental frequency distribution, and the performance of many fundamental frequencies extraction algorithm is able to further lifting.The common feature of above-mentioned algorithm is that the middle level expression characteristic that uses in many fundamental frequencies tracing process is all based on sense of hearing spectrogram (cochleagram), specifically, the likelihood probability function of fundamental frequency state all extracts from auto-correlation spectrogram (correlogram), and the local feature near the mainly peak value of their use.

The principal feature of sense of hearing spectrogram is that low frequency resolution is higher; high frequency resolution is lower; multiple harmonic in the usual voice responsive simultaneously of hf channel of auditoiy filterbank, therefore the response of hf channel can be subject to the impact of amplitude modulation effect (AM) usually.When single fundamental frequency extracts, the peak value of the amplitude envelops that the amplitude modulation effect in hf channel produces and envelope both provides the information of former fundamental frequency.But, when many fundamental frequencies extract, but may there is energy close but belong to the higher hamonic wave of different fundamental frequency in a time frequency unit simultaneously, its amplitude modulation(PAM) rate will be caused so not belong to the situation of the harmonic wave of any one fundamental frequency, this skew that the peak height and peak that make corresponding autocorrelation function are made a mistake, thus passive impact is produced on the extraction of many fundamental frequencies.

Summary of the invention

In view of this, in order to overcome recurrent doubling time Problem-Error and the above-mentioned other problems mentioned in fundamental frequency leaching process, the present invention proposes a kind of many fundamental frequencies extracting method based on empirical mode decomposition and hidden Markov model.

According to an aspect of the present invention, provide a kind of many fundamental frequencies extracting method based on empirical mode decomposition and hidden Markov model, it is characterized in that, comprise the following steps:

Step 1: carry out filtering to voice signal by auditoiy filterbank, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;

Step 2: calculate autocorrelation function in each time frequency unit of sense of hearing spectrum;

Step 3: empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;

Step 4: calculated rate adaptation function on the basis of each instantaneous frequency;

Step 5: the likelihood probability building each fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;

Step 6: the likelihood probability of described each fundamental frequency state strengthened, then combines the described likelihood probability after enhancing with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.

According to a further aspect of the invention, provide a kind of many fundamental frequencies extraction element based on empirical mode decomposition and hidden Markov model, it is characterized in that, comprising:

Pretreatment module, it carries out filtering by auditoiy filterbank to voice signal, carries out framing to filtered signal, obtains two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal;

Autocorrelation function computing module, it calculates autocorrelation function in each time frequency unit of sense of hearing spectrum;

Instantaneous frequency computing module, it carries out empirical mode decomposition to the autocorrelation function of each time frequency unit, and the basis of the essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;

Frequency matching function computation module, it is calculated rate adaptation function on the basis of each instantaneous frequency;

Likelihood probability and transition probability computing module, it builds the likelihood probability of each fundamental frequency state with frequency matching function, and uses corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value; Described each fundamental frequency state comprises single fundamental frequency state and double-basis state frequently;

Trajectory extraction module, the likelihood probability of described each fundamental frequency state strengthens by it, then the described likelihood probability after enhancing is combined with corresponding transition probability, and uses the many pitch contours of hidden Markov model to described voice signal to extract.

The such scheme that the present invention proposes, in order to suppress the unfavorable amplitude modulation effect occurred in the hf channel of gammatone bank of filters in many fundamental frequencies leaching process, instead of autocorrelation function with frequency matching function in the process calculating the fundamental tone state likelihood probability in hidden Markov model.On the other hand, compare with peak height with the peak of autocorrelation function, the average instantaneous frequency of time frequency unit is not more vulnerable to the impact of noise and amplitude modulation effect, the frequency matching function that the present invention extracts on average instantaneous frequency basis shows more reliable in the leaching process of many fundamental frequencies, finally makes getting a promotion of many fundamental frequencies extraction algorithm.

In addition, doubling time mistake is a kind of mistake often run in fundamental frequency leaching process, for this problem, the said method that the present invention proposes reduces the peak height of frequency matching function in doubling time position by a kind of process that strengthens, the likelihood probability of doubling time candidate point is reduced, thus reduces the probability of doubling time mistake generation.

In a word, the present invention, by the suppression to unfavorable amplitude modulation effect and doubling time error probability, is combined in fundamental tone state transition probability corpus added up and obtains, and adopts the mode of hidden Markov model decoding to obtain the track of double-basis voice frequency.

Accompanying drawing explanation

The further characteristic of the present invention and advantage are described below with reference to illustrative accompanying drawing.

Fig. 1 is the process flow diagram of the many fundamental frequencies extracting method based on empirical mode decomposition and Hidden Markov Model (HMM) that the present invention proposes;

Fig. 2 is the process flow diagram of the autocorrelation function of an each time frequency unit of calculating in prior art;

Fig. 3 is the process flow diagram of the frequency matching function of an each time frequency unit of calculating in the present invention;

Fig. 4 is a process flow diagram utilizing frequency matching function to build fundamental frequency state likelihood probability;

Fig. 5 is a process flow diagram strengthened fundamental frequency state likelihood function;

Fig. 6 is a process flow diagram utilizing hidden Markov model to carry out the extraction of many fundamental frequencies.

Embodiment

Should be appreciated that the following detailed description of different example and accompanying drawing is not be intended to the present invention to be limited to special illustrative embodiment; The illustrative embodiment be described is only illustration each step of the present invention, and its scope is defined by the claim of adding.

The present invention, by carrying out empirical mode decomposition to the autocorrelation function of time frequency unit in voice two dimension sense of hearing spectrogram, obtains leading instantaneous frequency, and calculated rate adaptation function on its basis.Compared with autocorrelation function, frequency matching function can to overcome when many fundamental frequencies extract disadvantageous amplitude modulation effect in high frequency gammatone filter bank channel, and the fundamental frequency state likelihood function therefore built on frequency matching functional foundations is more stable and reliable.Utilize in the fundamental frequency state likelihood function and fundamental frequency corpus built and add up the fundamental frequency state transition function obtained, the mode can decoded by Viterbi, carries out the extraction of many pitch contours by hidden Markov model.

As described in Figure 1, the present invention proposes a kind of many fundamental frequencies extracting method based on empirical mode decomposition and hidden Markov model, the concrete steps of the method are as follows:

Step 1: filtering is carried out to voice signal by auditoiy filterbank, and two-dimentional time-frequency expression and the sense of hearing spectrum that framing obtains voice signal is carried out to filtered voice signal;

Step 3: empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the first essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source;

Step 4: calculated rate adaptation function on the basis of instantaneous frequency;

Step 5: the likelihood probability building each frame fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each frame fundamental frequency state and fundamental frequency numerical value;

Step 6: likelihood probability carried out strengthening to carry out reduction doubling time mistake, is then combined the likelihood probability after enhancing with transition probability, and uses the many pitch contours of hidden Markov model to current speech to extract.

Wherein, by auditoiy filterbank, filtering is carried out to voice signal described in step 1, and obtain two-dimentional time-frequency expression and the sense of hearing spectrum of voice signal, by auditoiy filterbank filtering is carried out to the voice signal of one dimension and windowing framing obtains two-dimentional time-frequency expresses, the time dimension (corresponding voice frame number) of the wherein one-dimensional representation voice signal that described two-dimentional time-frequency is expressed, another one-dimensional representation frequency dimension (respective channel sequence number).

Described auditoiy filterbank is a kind of model imitating cochlea Auditory Perception mechanism, and the response of the time domain impulse of this wave filter is for such as formula shown in (1), and wherein filter centre frequency is distributed between 0Hz to 3000Hz.

g (t, f) = \{\begin{matrix} t^{l - 1} \exp (- 2 π b t) c o s (2 π f t) & \begin{matrix} i f & t > 0 \end{matrix} \\ 0 & e l s e \end{matrix} - - - (1)

Wherein, t represents the time, and filter order l=4, f are filter centre frequency, and b is equivalent rectangular bandwidth.

As shown in Figure 2, voice signal is through above-mentioned bank of filters filtering, and in bank of filters, the output of each wave filter is the time-domain signal identical with primary speech signal length.Windowing process is carried out to the output of each filtering channel, typical window is long is embodied as 20ms, just the two-dimentional time-frequency expression of primary speech signal and sense of hearing spectrogram and cochleagram can be obtained thus, C (c can be used, m) represent, wherein c represents filter channel sequence number, and m represents speech frame sequence number.

In each time frequency unit of sense of hearing spectrum, calculate autocorrelation function described in step 2, computing formula is such as formula shown in (2).

A_{H} (c, m, τ) = \frac{Σ_{n = 0}^{W} h (c, m \cdot T + n) \times h (c, m \cdot T + n + τ)}{\sqrt{Σ_{n = 0}^{W} h^{2} (c, m \cdot T + n)} \sqrt{Σ_{n = 0}^{W} h^{2} (c, m \cdot T + n + τ)}} - - - (2)

Wherein, h (c) is the output of corresponding gammatone wave filter in c passage, and m is speech frame sequence number, and n represents discrete time point, and τ is some time delay, and T represents the sampling number that every frame voice signal is corresponding, and W represents discrete point number.

Respective filter due to different passage is different, and the delay that therefore output signal of each path filter produces is different.By the calculating of autocorrelation function, the effect of each channel phases alignment can be played.

Described in step 3, empirical mode decomposition is carried out to the autocorrelation function of each time frequency unit, and the basis of the first essential mode function obtained in empirical mode decomposition calculates the instantaneous frequency that each time frequency unit dominates sound source, specifically comprise the steps:

Utilize Hilbert-Huang transform that original autocorrelation function is decomposed into a series of essential mode function, and according to auditory masking effect, choose the instantaneous frequency of frequency as sound source leading in this time frequency unit of the essential mode function that first decomposites.

As shown in Figure 3, described in step 4 on the basis of instantaneous frequency calculated rate adaptation function, this function is expressed as the middle level in fundamental frequency leaching process, can describe the average instantaneous frequency of current time frequency unit and the degree of each candidate pitch frequency matching, its computing formula is such as formula shown in (3).

F (c, m, τ) = 1 - 2 \cdot [\overset{&OverBar;}{f} (c, m) \cdot τ - int (\overset{&OverBar;}{f} (c, m) \cdot τ)] - - - (3)

Wherein, represent the average instantaneous frequency being positioned at the time frequency unit of c passage of m frame, τ represents the pitch period (some time delay namely in described limit of consideration) of candidate, and int () is bracket function, returns nearest round values.

As shown in Figure 4, described in step 5, build the likelihood probability of each frame fundamental frequency state with frequency matching function, and use corpus to add up transition probability between each fundamental frequency state and fundamental frequency numerical value, specifically comprise the steps:

First, the basis of frequency matching function builds the likelihood probability of each fundamental frequency state, i.e. observation probability, in time frequency unit u (c, m), there is single pitch period τ ₁likelihood probability such as formula shown in (4).There are two each and every one pitch period τ in time frequency unit u (c, m) simultaneously ₁and τ ₂likelihood probability such as formula shown in (5).

p (x | ω_{1}) = \frac{\underset{c &Element; Φ_{c}}{Σ} F (c, m, τ_{1}) \cdot L (c, m)}{\underset{C &Element; Φ_{c}}{Σ} L (c, m)} - - - (4)

g (x | ω_{2}) = \frac{\underset{c &Element; Φ_{c}}{Σ} m a x {F (c, m, τ_{1}), F (c, m, τ_{2})} \cdot L (c, m)}{\underset{c &Element; Φ_{c}}{Σ} L (c, m)} - - - (5)

Wherein, x represents the voice signal observed, ω ₁, ω ₂be respectively single fundamental frequency state and double-basis voice frequency state; The normalization loudness that L (c, m) is each time frequency unit; Φ _cfor the channel position set in two-dimentional time-frequency expression;

In above-mentioned formula, L (c, m) is the normalization loudness of each time frequency unit, and its computing formula is as follows:

L_{n} (c, m) = \frac{2^{\log E (c, m)}}{Σ_{c = 1}^{N} 2^{\log E (c, m)}} - - - (6)

Wherein, E (c, m) represents the energy of time frequency unit u (c, m), and N is the channel number of bank of filters.

Secondly, the state of the fundamental frequency of every frame may be present among three kinds of spaces, i.e. zero-base frequency, single fundamental frequency and double-basis space frequently:

Ω＝Ω ₀∪Ω ₁∪Ω ₂

Transition probability between three fundamental frequency states is that the statistics of database by marking with fundamental frequency obtains:

Wherein, Ω _ifundamental frequency state space, p _ijrepresent from fundamental frequency state space Ω _ito fundamental frequency state space Ω _jtransition probability.

Likelihood probability carried out described in step 6 strengthening to carry out reduction doubling time mistake, then the likelihood probability after enhancing be combined with transition probability, and use the many pitch contours of hidden Markov model to current speech to extract, as shown in Figure 5, step comprises:

First, strengthen single fundamental frequency likelihood probability function, formula is for shown in formula (7):

p_{e n} (x | {τ_{1}}) = p (x | {τ_{1}}) - α \cdot \underset{m = 2, ..., 7}{m a x} p (x | {τ_{1} / m}) - - - (7)

Wherein, the scope of m is 2 ~ 7, and represent that the doubling time mistake only caused for the harmonic wave of 2 ~ 7 times strengthens, α is pre-determined factor, adopts the numerical value between 0.6 ~ 0.8.

The meaning that this formula represents strengthens the likelihood probability of a single fundamental frequency state, the candidate pitch period τ of the present invention in formula (4) ₁1/m position functional value in find that maximum value, a coefficient is multiplied by this numerical value and using result of product as frequency matching adjusted value, described coefficient can adopt the numerical value between 0.6 ~ 0.8, then deduct on the frequency matching functional value at former candidate pitch period point place adjusted value be enhanced after frequency matching functional value.

Secondly, to double-basis frequently likelihood probability function strengthen, for this reason, first double-basis frequently likelihood probability be written as two functions to add with form such as formula shown in (8):

g(x|{τ ₁，τ ₂})＝p(x|{τ ₁})+p _r(τ ₁，τ ₂)(8)

Wherein,

p_{r} (τ_{1}, τ_{2}) = \frac{\underset{c &Element; Φ_{c}}{Σ} (F (c, m, τ_{2}) - F (c, m, τ_{1})) \cdot L (c, m)}{\underset{c &Element; Φ_{c}}{Σ} L (c, m)} - - - (9)

In above-mentioned formula, τ ₁and τ ₂be the pitch period point of two candidates, g (x|{ τ ₁, τ ₂) represent the likelihood probability observing voice signal x on these two candidate pitch period points.

Then, two functions are strengthened by the method for formula (7) respectively, and obtain the result of formula (10):

g _en(x|{τ ₁，τ ₂})＝p _en(x|{τ ₁})+p _{r_en}(τ ₁，τ ₂)(10)

Wherein, p _en(x|{ τ ₁) represent p (x|{ τ ₁) strengthen after result, p _{r_en}(τ ₁, τ ₂) represent p _r(τ ₁, τ ₂) result after enhancing.

As shown in Figure 6, the likelihood probability after the enhancing obtained in this step is combined with the three kinds of fundamental frequency state transition probabilities obtained in step 5, just can by the Vit of hidden Markov model _erbi decode procedure obtains many pitch contours.It should be noted that, fundamental frequency state space one has three kinds of states, but this step only calculates the likelihood probability of wherein two states, this is because the zero condition of fundamental frequency (namely not having the state of fundamental frequency) likelihood probability is a constant preset, calculates without the need to carrying out and strengthen.

According to this instructions, the present invention further modifications and variations is apparent for the technician in described field.Therefore, this explanation will be regarded as illustrative and its objective is to one of ordinary skill in the art lecture be used for performing conventional method of the present invention.Should be appreciated that this instructions illustrates and the form of the present invention that describes just is counted as current preferred embodiment.

Claims

1., based on many fundamental frequencies extracting method of empirical mode decomposition and hidden Markov model, it is characterized in that, comprise the following steps:

2. the method for claim 1, it is characterized in that, auditoiy filterbank is utilized to carry out filtering to voice signal in step 1, the output of each wave filter in described auditoiy filterbank is the time-domain signal identical with described voice signal length, to the output windowing sub-frame processing of described each wave filter, the two-dimentional time-frequency obtaining described voice signal is expressed.

3. the method for claim 1, is characterized in that, calculates described in step 2 at the autocorrelation function of each time frequency unit of sense of hearing spectrum by following formula:

A_{H} (c, m, τ) = \frac{Σ_{n = 0}^{W} h (c, m \cdot T + n) \times h (c, m \cdot T + n + τ)}{\sqrt{Σ_{n = 0}^{W} h^{2} (c, m \cdot T + n)} \sqrt{Σ_{n = 0}^{W} h^{2} (c, m \cdot T + n + τ)}} - - - (2)

Wherein, h (c) is the output of respective filter in described auditoiy filterbank in c filter channel, and m is speech frame sequence number, and n represents discrete time point, and τ is some time delay.

4. the method for claim 1, is characterized in that, on the autocorrelation function of each time frequency unit, carry out empirical mode decomposition described in step 3, step comprises:

Utilize Hilbert-Huang transform that described autocorrelation function is decomposed into a series of essential mode function, and according to auditory masking effect, the instantaneous frequency of frequency leading sound source in this time frequency unit of the essential mode function that first is decomposited.

5. the method for claim 1, is characterized in that, the function of frequency matching described in step 4 is for the degree of the average instantaneous frequency and each candidate pitch frequency matching that describe current time frequency unit, and its computing formula is as follows:

F (c, m, τ) = 1 - 2 \cdot [\overset{&OverBar;}{f} (c, m) \cdot τ - int (\overset{&OverBar;}{f} (c, m) \cdot τ)] - - - (3)

Wherein, represent the average instantaneous frequency being positioned at the time frequency unit of m frame c passage, τ represents the pitch period of candidate, and int () is bracket function, returns nearest round values.

6. the method for claim 1, is characterized in that, step 5 specifically comprises:

First, the basis of frequency matching function builds the likelihood probability of each fundamental frequency state, in the time frequency unit u (c, m) of c passage of m frame, there is single pitch period τ ₁likelihood probability as follows:

p (x | ω_{1}) = \frac{\underset{c &Element; Φ_{c}}{Σ} F (c, m, τ_{1}) \cdot L (c, m)}{\underset{c &Element; Φ_{c}}{Σ} L (c, m)} - - - (4)

There are two each and every one pitch period τ in time frequency unit u (c, m) simultaneously ₁and τ ₂likelihood probability as follows:

g (x | ω_{2}) = \frac{\underset{c &Element; Φ_{c}}{Σ} m a x {F (c, m, τ_{1}), F (c, m, τ_{2})} \cdot L (c, m)}{\underset{c &Element; Φ_{c}}{Σ} L (c, m)} - - - (5)

Wherein, x represents voice signal, ω ₁, ω ₂be respectively single fundamental frequency state and double-basis voice frequency state; The normalization loudness that L (c, m) is each time frequency unit; Φ _cfor the channel position set in two-dimentional time-frequency expression, F (c, m, τ ₁) be described frequency matching function;

L_{n} (c, m) = \frac{2^{\log E (c, m)}}{Σ_{c = 1}^{N} 2^{\log E (c, m)}} - - - (6)

Wherein, E (c, m) represents the energy of time frequency unit u (c, m), and N is the channel number of bank of filters;

Ω＝Ω ₀∪Ω ₁∪Ω ₂

7. the method for claim 1, is characterized in that, strengthen the likelihood probability of each fundamental frequency state in step 6, concrete steps are as follows:

First, strengthen single fundamental frequency likelihood probability function, formula is:

p_{e n} (x | {τ_{1}}) = p (x | {τ_{1}}) - α \cdot \underset{m = 2, ..., 7}{m a x} p (x | {τ_{1} / m}) - - - (7)

Wherein, the scope of m is 2 ~ 7, and represent that the doubling time mistake only caused for the harmonic wave of 2 ~ 7 times strengthens, α is pre-determined factor, p (x|{ τ ₁) expression pitch period is τ ₁time observe single fundamental frequency likelihood probability of current speech signal x; p _en(x|{ τ ₁) be the single fundamental frequency likelihood probability after enhancing;

Secondly, the likelihood probability function of double-basis frequency state is strengthened, for this reason, first the likelihood probability of double-basis frequency state is written as two function p (x|{ τ ₁) and p _r(τ ₁, τ ₂) add and form:

g(x|{τ ₁，τ ₂})＝p(x|{τ ₁})+p _r(τ ₁，τ ₂)(8)

p_{r} (τ_{1}, τ_{2}) = \frac{\underset{c &Element; Γ}{Σ} (F (c, m, τ_{2}) - F (c, m, τ_{1})) \cdot L (c, m)}{\underset{c &Element; Γ}{Σ} L (c, m)} - - - (9)

Wherein, p (x|{ τ ₁) be the likelihood probability of single fundamental frequency state, F (c, m, τ _i) be frequency matching function, L (c, m) is the normalization loudness of each time frequency unit, and c is channel number;

Then, to two function p (x|{ τ ₁) and p _r(τ ₁, τ ₂) strengthen by the method for formula (7) respectively, and obtain the likelihood probability of double-basis frequency state:

g _en(x|{τ ₁，τ ₂})＝p _en(x|{τ ₁})+p _{r_en}(τ ₁，τ ₂)(10)

Wherein, g _en(x|{ τ ₁, τ ₂) be the likelihood probability of the frequently state of the double-basis after strengthening, p _en(x|{ τ ₁) and p _{r_en}(τ ₁, τ ₂) be respectively p (x|{ τ ₁) and p _r(τ ₁, τ ₂) value after enhancing.

8., based on many fundamental frequencies extraction element of empirical mode decomposition and hidden Markov model, it is characterized in that, comprising: