CN101853661B

CN101853661B - Noise spectrum estimation and voice mobility detection method based on unsupervised learning

Info

Publication number: CN101853661B
Application number: CN2010101781664A
Authority: CN
Inventors: 应冬文; 颜永红; 付强; 潘接林
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2010-05-14
Filing date: 2010-05-14
Publication date: 2012-05-30
Anticipated expiration: 2030-05-14
Also published as: CN101853661A

Abstract

The noise power Power estimation and voice mobility detection method that the present invention relates to a kind of based on unsupervised learning,Include the following steps: the log-magnitude feature 1) for voice signal on each frequency point,Establish a GMM model; 2) for one section of voice data,M frame buffer is set,Preceding M frame input signal is stored in caching,The log-magnitude spectrum of M frame in caching is extracted,The GMM model for substituting into step 1) is initialized,The model λ 0 initialized,k; 3) in the model λ 0 initialized,After k,Since M+1 frame,Using the method for incremental learning,GMM model is updated frame by frame,Successively recursion obtains

And obtain noise figure

With probability of occurrence of the voice signal on k-th of frequency point of the i-th frame. The present invention is the tight coupling solution of Power estimation and voice mobility detection, can enhance voice application system to the adaptability of noise circumstance; The present invention independent of " noise starting " it is assumed that also, the present invention description of the voice mobility on time-frequency two-dimensional space can also be provided.

Description

Noise spectrum estimation and voice mobility detection method based on unsupervised learning

Technical field

The present invention relates to voice process technology field, specifically, the present invention relates to a kind of noise power Power estimation based on unsupervised learning and voice mobility detection method.Wherein, voice mobility detection is that the algorithm whether voice occurs is judged on time dimension, and it can answer existence in the form of "Yes" or "No", and the existence of voice can also be described with voice probability of occurrence.

Background technology

Most voice application system has to face ambient noise interference.Forefathers propose many methods and remove interference of the noise to voice system, and almost all of method all relies on voice mobility detection and noise power spectrum is estimated.The two modules have close contact, and their accuracy directly affects the overall noiseproof feature of system.Traditional solution has following Railway Project：

1. in general anti-noise algorithm, voice mobility detection and noise power Power estimation are a loose couplings cascaded, first calculate the mobility of voice, are then composed according to mobility come estimating noise power.Voice mobility detection device directly affects the accuracy of noise power Power estimation to the sensitivity of voice signal.

Voice mobility detection device is excessively sensitive, is easily caused underestimating for noise power spectrum；Conversely, it is excessively blunt, it is easily caused over-evaluating for noise power spectrum.Therefore, generally require to adjust the sensitivity of speech detector in traditional scheme according to noise circumstance, influence is brought on the adaptability of noise circumstance to system.

2. traditional solution is the mode based on semi-supervised learning.In initial period, general system need to make " noise starting " it is assumed that the beginning of i.e. hypothesis sentence is constantly present one section of non-speech audio.

This section of non-speech audio can be understood as the ambient noise sample manually marked, the initialization model of noise be set up from these mark samples, this is a kind of supervised learning method.Its defect is：This hypothesis is difficult to be met in some applications, such as when sentence is started with voice signal, then the initialization for causing noise model is failed, and then make it that speech detection and noise power Power estimation are all inaccurate.Follow-up phase after the initialization model of noise is set up, traditional solution is mostly using the result of detection and estimation come more new model, and this learning method is decision making-oriented, and it is a kind of non-supervisory study.

The learning method of this decision making-oriented, by the output result of estimation/detector, feedback is fed back for more new model.But, incorrect result is easily fed back to model by it, causes the precise decreasing of model, and model further results in the precise decreasing of estimation/detection.So mistake is progressively accumulated over time, and systematic function also can progressively decline over time.Supervised learning in initial period, adds the unsupervised learning in follow-up phase, forms a semi-supervised learning process.Two problems in initial period and follow-up phase, are all due to caused by the mode of this semi-supervised learning.

3. conventional most of voice mobility detection devices are only to provide description of the voice mobility on time dimension, lack description of the voice mobility in frequency domain dimension, therefore further process of refinement can not be carried out to noise.

The content of the invention

The present invention is directed to the shortcoming of conventional voice mobility detection device and noise power spectrum estimator, propose a tightly coupled solution, voice mobility detection and noise power Power estimation is set to obtain unification under a unsupervised learning framework, so as to strengthen adaptability of the voice application system to noise circumstance.In addition, the invention independent of " noise starting " it is assumed that practicality is stronger than traditional method；Meanwhile, the present invention also provides description of the voice mobility in time frequency space, is conducive to carrying out further process of refinement to noise.

For achieving the above object, the invention provides a kind of noise power Power estimation based on unsupervised learning and voice mobility detection method, as shown in Fig. 2 comprising the following steps：

1) the log-magnitude feature for voice signal on each frequency, sets up a GMM model, and mathematic(al) representation is as follows：

p (x_{i, k} | λ_{i, k}) = w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k});

Wherein, the Gaussian component of GMM model is expressed as：

p (x_{i, k} | h, λ_{i, k}) = \frac{1}{\sqrt{2 π κ_{i, k}^{(h)}}} \exp {- \frac{1}{2} {(x_{i, k} - μ_{i, k}^{(h)})}^{2}},

Wherein, x_{I, k}Represent the log-magnitude spectrum on k-th of frequency of the i-th frame, h ∈ { 0,1 },

GMM weight coefficient is represented,

WithAverage and variance are represented respectively, and wherein h=1 represents speech components, and h=0 represents noise component(s)；

λ_{i, k} = {{μ}_{i, k}^{(1)}, μ_{i, k}^{(0)}, κ_{i, k}^{(1)}, κ_{i, k}^{(0)}, w_{i, k}^{(1)}, w_{i, k}^{(0)}}

Represent the parameter set of gauss hybrid models；

2) for one section of speech data, set M frame buffers, preceding M frames input signal deposit caching in, extract caching in M frames log-magnitude spectrum, substitute into step 1) GMM model initialized, the model λ initialized_{0, k}；Initialization procedure is using constraint EM algorithms；

3) in the model λ initialized_{0, k}Afterwards, since M+1 frames, using the method for incremental learning, GMM model is updated frame by frame, recursion is obtained successively

λ_{i, k} = {{μ}_{i, k}^{(1)}, μ_{i, k}^{(0)}, κ_{i, k}^{(1)}, κ_{i, k}^{(0)}, w_{i, k}^{(1)}, w_{i, k}^{(0)}},

And draw noise figure

With probability of occurrence of the voice signal on k-th of frequency of the i-th frame：

p (h = 1 | x_{i, k}, λ_{i, k}) = \frac{w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})}{w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})},

Wherein i=1,2,3 ... ....

Wherein, the Increment Learning Algorithm includes recursion weight coefficient, recursion average and recursion variance；

Recursion weight coefficient method is：

w_{i + 1, k}^{(h)} = α w_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k});

Recursion Mean Method is：

μ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} μ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) x_{i + 1, k}}{w_{i + 1, k}^{(h)}};

Or

μ_{i + 1, k}^{(h)} = α_{μ} μ_{i, k}^{(h)} + (1 - α_{μ}) p (h | x_{i + 1, k}, λ_{i, k}) x_{i + 1, k};

Recursion Variance Method is：

κ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} κ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2}}{w_{i + 1, k}^{(h)}};

Or

κ_{i + 1, k}^{(h)} = α_{κ} κ_{i, k}^{(h)} + (1 - α_{κ}) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2};

Or

κ_{i + 1, k}^{(h)} = α_{κ} κ_{i, k}^{(h)} + (1 - α_{κ}) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + k}^{(h)})}^{2};

Wherein, α_κ、α_μThe smoothing factor that 1 cut-grafting is bordering on 1 is less than with α.

Compared with prior art, the present invention has following technique effect：

The present invention is a kind of voice mobility detection and the tightly coupled scheme of noise power Power estimation, can strengthen adaptability of the voice application system to noise circumstance；In addition, the present invention independent of " noise starting " it is assumed that with stronger practicality；Also, the present invention can also provide the description of voice mobility in time-frequency two-dimensional spatially, be conducive to carrying out further process of refinement to noise.

Brief description of the drawings

Fig. 1 shows one section of voice time domain figure and sound spectrograph by noise jamming；

Wherein (a) is partly one section of sound spectrograph destroyed by white noise, and signal to noise ratio is 0dB；(b) probability graph partly existed for voice signal, the gray scale in figure represents that the probability of (i.e. in the presence of) occurs in voice signal；As can be seen that the presence probability of this method output accurately describes the structure of sound spectrograph from the contrast of (a) and (b) figure.

Fig. 2 is a kind of noise power Power estimation based on unsupervised learning of the present invention and the flow chart of voice mobility detection method.

Embodiment

The present invention proposes a kind of noise power Power estimation based on unsupervised learning framework and voice mobility detection method.The maximum feature of unsupervised learning framework is that the model of noise and voice messaging is set up in a kind of non-supervisory mode, no matter model initialization or at no point in the update process, all independent of the information manually marked.Specifically, it has following feature：

● in initial phase, originated independent of noise it is assumed that so the application of the invention is more wide in range than general solution application.

● at no point in the update process, it is not necessary to feedback information, therefore, the problem of error accumulation it can be eased to a certain extent.

● it is tightly coupled relation between them, it is only necessary to just can be with regulating system by a few parameters while provide the information of voice mobility and the information of noise power spectrum.And in loosely coupled system, voice mobility module and noise detection module have respective regulation parameter, parameter is more, and system is difficult to adjust.

● voice mobility is the two-dimensional signal of one " time --- frequency ", and other voice mobility detection algorithms merely depict existence of the voice on time dimension.

In one embodiment, the carrier of unsupervised learning framework is the gauss hybrid models (Gaussian Mixture Model, be abbreviated as GMM) of double components.The distribution of one of representation in components speech energy, another component is the distribution of noise energy.Frequency band is divided into 8 subbands, energy envelope is extracted on each subband, and set up a corresponding GMM by the present invention according to melscale.EM algorithm initialization GMM are used first, then come progressive updating GMM by the way of incremental learning.According to GMM model, the mobility on this subband of voice and the power spectral information of noise are deduced out respectively.

The present invention is fitted using the GMM with Prescribed Properties to the spectrum-envelope of voice.

In fit procedure, average, weight conjunction variance respectively to GMM etc. enter row constraint.No matter in EM algorithms or during incremental learning, require

κ_{i + 1, k}^{(1)} = \max {κ_{i + 1, k}^{(0)}, κ_{i + 1, k}^{(1)}},

w_{i + 1, k}^{(1)} = \max {w_{i + 1, k}^{(1)}, ϵ},

And

w_{i + 1, k}^{(0)} = 1 - w_{i + 1, k}^{(1)} .

Wherein, for GMM Increment Learning Algorithm, the calculating of recursion weight coefficient, recursion average and recursion variance is specifically included.1) recursion weight coefficient：

w_{i + 1, k}^{(h)} = α w_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) .

Wherein α be one be less than 1 but close to 1 smoothing factor, such as α=0.99.

2) recursion average.

μ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} μ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) x_{i + 1, k}}{w_{i + 1, k}^{(h)}};

Or

μ_{i + 1, k}^{(h)} = α_{μ} μ_{i, k}^{(h)} + (1 - α_{μ}) p (h | x_{i + 1, k}, λ_{i, k}) x_{i + 1, k} .

Wherein α_μBe one be less than 1 but close to 1 smoothing factor, such as α_μ=0.99.

3) recursion variance.

κ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} κ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2}}{w_{i + 1, k}^{(h)}};

Or

κ_{i + 1, k}^{(h)} = α_{κ} κ_{i, k}^{(h)} + (1 - α_{κ}) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2};

Or

κ_{i + 1, k}^{(h)} = α_{κ} κ_{i, k}^{(h)} + (1 - α_{κ}) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + k}^{(h)})}^{2} .

Wherein α_κBe one be less than 1 but close to 1 smoothing factor, such as α_κ=0.99.

The present invention is further described through with reference to a preferred embodiment.

The principle of the present invention is as follows：

For log-magnitude feature of the voice signal on each frequency, a gauss hybrid models GMM is set up, this model changes with the change of input signal over time.The mathematic(al) representation of model is as follows：

p (x_{i, k} | λ_{i, k}) = w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})

Gaussian component wherein in GMM model is expressed as：

p (x_{i, k} | h, λ_{i, k}) = \frac{1}{\sqrt{2 π κ_{i, k}^{(h)}}} \exp {- \frac{1}{2} {(x_{i, k} - μ_{i, k}^{(h)})}^{2}}

Here x_{I, k}Represent the log-magnitude spectrum on k-th of frequency of the i-th frame, h represents the classification of Gaussian component, h ∈ { 0,1 },

GMM weight coefficient is represented,

With

Average and variance are represented respectively.Wherein h=1 represents speech components, and h=0 represents noise component(s).

λ_{i, k} = {{μ}_{i, k}^{(1)}, μ_{i, k}^{(0)}, κ_{i, k}^{(1)}, κ_{i, k}^{(0)}, w_{i, k}^{(1)}, w_{i, k}^{(0)}}

Represent the parameter set of gauss hybrid models.

In this model

It is exactly that we want the noise of estimation.Meanwhile, we can derive probability of occurrence of the voice signal on k-th of frequency of the i-th frame：

p (h = 1 | x_{i, k}, λ_{i, k}) = \frac{w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})}{w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})}

Based on above-mentioned principle, according to one embodiment of present invention, as shown in Fig. 2 the noise power Power estimation and voice mobility detection method comprise the following steps：

Step 100：M frame buffers are set, in preceding M frames input signal deposit caching, the amplitude spectrum of M frames in caching are extracted.The method for extracting frame amplitude spectrum is as follows：

Digitized sound signal first to this frame pre-processes and (according to system actual conditions, can include adding window, preemphasis etc.), if being F points, first zero padding to N points (wherein N >=F, N=2 per frame length^j, j is integer and j >=8), carries out leaf transformation in N point discrete Fouriers, obtains discrete spectrum

Wherein y_{I, n}Represent n-th of sampled point of the i-th frame in caching, Y_{I, k}Represent k-th of Fourier transformation value (k=0,1 ..., N-1) of the i-th frame in caching.So, its range value may be calculated x_{I, k}=20*log₁₀|Y_{I, k}|。

Step 200：GMM initialization.The gauss hybrid models λ of a double components is initialized on each frequency k_{I, k}, wherein subscript i represents time, λ_{I=0, k}Represent the model of initialization.Initialization procedure is using constraint EM algorithms, and on some frequency k, specific initialization step is as follows：

Step 201：M+1 sample is divided into by two classes by the method for cluster (such as IBG Non-surveillance clusterings, or fuzzy clustering etc.)：

{x_{i_{j}, k}^{(1)} | j = 0,1, . . ., M_{1}}

With

{x_{i_{j}, k}^{(0)} | j = 0,1, . . ., M_{0}},

Wherein M₀+M₁- 1=M, the larger class of average represents that another kind of use subscript (0) represents with subscript (1).The average of two classes is

The average of the less class of energy is

Wherein

The variance of two classes is respectively：

{\overset{&OverBar;}{κ}}_{0, k}^{(0)} = \frac{1}{M_{0} + 1} Σ_{j = 0}^{M_{0}} {(x_{i_{j}, k} - {\overset{&OverBar;}{μ}}_{0, k}^{(0)})}^{2},

{\overset{&OverBar;}{κ}}_{0, k}^{(1)} = \frac{1}{M_{1} + 1} Σ_{j = 0}^{M_{1}} {(x_{i_{j}, k} - {\overset{&OverBar;}{μ}}_{0, k}^{(1)})}^{2} .

The initialization weight coefficient of two classes：

The likelihood score of novel model of calculating,

In following iterative process, old model parameter set expression is λ '_{0, k}, new model parameter is：

{\overset{&OverBar;}{λ}}_{0, k} = {{\overset{&OverBar;}{μ}}_{0, k}^{(1)}, {\overset{&OverBar;}{μ}}_{0, k}^{(0)}, {\overset{&OverBar;}{κ}}_{0, k}^{(1)}, {\overset{&OverBar;}{κ}}_{0, k}^{(0)}, {\overset{&OverBar;}{w}}_{0, k}^{(1)}, {\overset{&OverBar;}{w}}_{0, k}^{(0)}} .

Before iteration is started,

L′_kIt is set to a very big number, such as L '_k=-10000.Start interative computation below.

Step 202：The probability that noise and voice occur is calculated,

p (h | x_{i, k}, λ_{0, k}^{'}) = \frac{w_{0, k}^{(h)} p (x_{i, k} | h, λ_{0, k}^{'})}{Σ_{h} w_{0, k}^{(h)} p (x_{i, k} | h, λ_{0, k}^{'})}, h &Element; {0,1};

Step 203：Calculate new weight coefficient：

{\overset{&OverBar;}{w}}_{0, k}^{(h)} = \frac{1}{M + 1} Σ_{j = 0}^{M} p (h | x_{j}, λ_{0, k}^{'});

Step 204：If

Then stop iteration, while λ_{0, k}=λ '_{0, k}；Wherein υ is a number close to 0 and more than 0, such as υ=0.05.

Step 205：Calculate new average：

{\overset{&OverBar;}{μ}}_{0, k}^{(h)} = \frac{Σ_{j = 0}^{M} x_{j} p (h | x_{j}, λ_{0, k}^{'})}{(M + 1) {\overset{&OverBar;}{w}}_{0, k}^{(h)}};

Step 206：Row constraint is entered to new average：

Wherein δ is a constant, and span is between 1 to 10.

Step 207：New variance is calculated,

{\overset{&OverBar;}{κ}}_{0, k}^{(h)} = \frac{Σ_{j = 0}^{M} {(x_{j} - {\overset{&OverBar;}{μ}}_{0, k}^{(h)})}^{2} p (h | x_{j}, λ_{0, k}^{'})}{(M + 1) {\overset{&OverBar;}{w}}_{0, k}^{(h)}};

Step 208：Row constraint is entered to new variance,

Step 209：The likelihood score of novel model of calculating

Step 210：If meeting condition

Iteration is terminated, wherein ε is the numeral of a very little, such as ε=0.1.If

Iteration is jumped to " step 202 ".

Step 300：GMM progressive updating.Setting up the model λ of initialization_{0, k}Afterwards, since M+1 frames, using the method for incremental learning, GMM model is updated frame by frame.Iterative process can be expressed as：On each frequency k, it is known that λ_{I, k}With current observed value x_{I+1, k}, infer λ_{I+1, k}.Fourier transform is carried out for i+1 frame, Y is obtained_{I+1, k}, wherein 0≤k ＜ N.On each frequency k, amplitude spectrum x is calculated_{I, k}=20*log₁₀|Y_{I, k}|.For k-th of frequency, specific iterative step is as follows：

Step 301：The probability that noise and voice occur is calculated,

p (h | x_{i + 1, k}, λ_{i, k}) = \frac{w_{i, k}^{(h)} p (x_{i + 1, k} | h, λ_{0, k})}{Σ_{h} w_{i, k}^{(h)} p (x_{i + 1, k} | h, λ_{0, k})},

H ∈ { 0,1 }.

Step 302：Calculate new weight coefficient：

w_{i + 1, k}^{(h)} = α w_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) .

Wherein, α be one be less than 1 but close to 1 smoothing factor, such as α=0.99.

Step 303：Row constraint is entered to new weight coefficient,And

Step 304：New average is calculated,

μ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} μ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) x_{i + 1, k}}{w_{i + 1, k}^{(h)}} .

Step 305：Row constraint is entered to new average：

Step 306：New variance is calculated,

κ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} κ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2}}{w_{i + 1, k}^{(h)}}

Step 307：Row constraint is entered to new variance,

From above sub-step, we obtain λ_{I+1, k}In all parameter so that obtained corresponding voice probability of occurrence p (h | x_{I+1, k}, λ_{I, k}) and noise signal power spectrum valuation

Algorithm based on above-described embodiment, performance to noise power Power estimation is evaluated, using each 8 sentences of men and women's words person's speech data in TIMIT databases, and white Gaussian noise in NOISEX92 noise databases, F16 fight support storehouse noise and babble noises are according to 0,5, the signal to noise ratio such as 10dB mixes.Evaluation index is line spectrum error, is defined as follows formula：

SegError = \frac{1}{M} Σ_{l = 1}^{M} {10 \log_{10} Σ_{k = 0}^{N - 1} D^{2} (k, l) / Σ_{k = 0}^{N - 1} {[D (k, l) - \hat{D} (k, l)]}^{2}}

Wherein D (k, l) represents actual noise amplitude spectrum,

The noise amplitude spectrum of estimation is represented, notices that SegErr values are smaller, estimate is represented closer to actual value, it is about accurate to estimate.Algorithm is compared respectively at three kinds of noise power spectrum algorithm for estimating of current main-stream, wherein MS represents minimum statistics algorithm, MCRA represents the recurrence average algorithm of minimum control, and IMCRA represents that the minimum control of raising version returns average algorithm, and TV-GMM is algorithm of the invention.Table 1 indicates line spectrum error SegError result.

Table 1

As can be seen from the above table, algorithm proposed by the present invention is respectively provided with obvious advantage for three kinds of algorithms of current main flow.

Claims

1. a kind of noise power Power estimation based on unsupervised learning and voice mobility detection method, comprise the following steps：

p (x_{i, k} | λ_{i, k}) = w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k});

Wherein, the Gaussian component of GMM model is expressed as：

p (x_{i, k} | h, λ_{i, k}) = \frac{1}{\sqrt{2 π κ_{i, k}^{(h)}}} \exp {- \frac{1}{2} {(x_{i, k} - μ_{i, k}^{(h)})}^{2}},

GMM weight coefficient is represented,With

Average and variance are represented respectively, and wherein h=1 represents speech components, and h=0 represents noise component(s)；

λ_{i, k} = {{μ}_{i, k}^{(1)}, μ_{i, k}^{(0)}, κ_{i, k}^{(1)}, κ_{i, k}^{(0)}, w_{i, k}^{(1)}, w_{i, k}^{(0)}}

Represent the parameter set of gauss hybrid models；

3) in the model λ initialized_{0, k}Afterwards, since M+1 frames, using the method for incremental learning, the GMM model of each frequency band is updated frame by frame, recursion is obtained successively

And draw noise figure

p (h = 1 | x_{i, k}, λ_{i, k}) = \frac{w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})}{w_{i, k}^{(0)} p (x_{i, k} | h = 0, λ_{i, k}) + w_{i, k}^{(1)} p (x_{i, k} | h = 1, λ_{i, k})},

Wherein i=1,2,3 ... ....

2. noise power Power estimation according to claim 1 and voice mobility detection method, it is characterised in that the Increment Learning Algorithm includes：Recursion weight coefficient, recursion average and recursion variance；

Recursion weight coefficient method is：

w_{i + 1, k}^{(h)} = α w_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k});

Recursion Mean Method is：

μ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} μ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) x_{i + 1, k}}{w_{i + 1, k}^{(h)}};

Or

μ_{i + 1, k}^{(h)} = α_{μ} μ_{i, k}^{(h)} + (1 - α_{μ}) p (h | x_{i + 1, k}, λ_{i, k}) x_{i + 1, k};

Recursion Variance Method is：

κ_{i + 1, k}^{(h)} = \frac{α w_{i, k}^{(h)} κ_{i, k}^{(h)} + (1 - α) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2}}{w_{i + 1, k}^{(h)}};

Or

κ_{i + 1, k}^{(h)} = α_{κ} κ_{i, k}^{(h)} + (1 - α_{κ}) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2};

Or

κ_{i + 1, k}^{(h)} = α_{κ} κ_{i, k}^{(h)} + (1 - α_{κ}) p (h | x_{i + 1, k}, λ_{i, k}) {(x_{i + 1, k} - μ_{i + 1, k}^{(h)})}^{2};

Wherein, α_κ、α_μWith α be less than 1 and close to 1 smoothing factor.