CN103714818A

CN103714818A - Speaker recognition method based on noise shielding nucleus

Info

Publication number: CN103714818A
Application number: CN201310681894.0A
Authority: CN
Inventors: 张卫强; 刘加
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2013-12-12
Filing date: 2013-12-12
Publication date: 2014-04-09
Anticipated expiration: 2033-12-12
Also published as: CN103714818B

Abstract

The invention discloses a speaker recognition method based on a noise shielding nucleus in the field of speech signal processing. The method comprises the following steps: step 1, inputting audio data and extracting short-time features of the audio data frame by frame; step 2, training a GMM model containing M Gauss mixed elements with the use of the short-time features of the audio data and recording the GMM model as an audio GMM; step 3, training a GMM model containing N Gauss mixed elements with the use of the short-time features of noise data and recording the GMM model as a noise GMM; step 4, splicing the audio GMM and the noise GMM into a mixed GMM; step 5, using the mixed GMM to generate a noise shielding super vector; and step 6, carry out SVM training and testing with the use of the generated noise shielding super vector and completing speaker training and recognition. The method can be used for automatically shielding noise contained in audio, is simple to implement and can effectively improve the performance of speaker recognition under noise conditions.

Description

Method for distinguishing speek person based on noise shielding core

Technical field

The invention belongs to field of voice signal, relate in particular to a kind of method for distinguishing speek person based on noise shielding core.

Background technology

Speaker Recognition Technology can be by speech recognition speaker's identity, and it has a wide range of applications in fields such as long-distance identity-certifying, information securities.At present in Speaker Identification field, the support vector machine of GSV-SVM(based on gauss hybrid models average super vector) be a kind of conventional method, it first utilizes UBM(universal background model) generate GSV(gauss hybrid models average super vector), and then use SVM(support vector machine) carry out Speaker Identification.The method is easily affected by noise, in order to address this problem, generally at front end, carries out voice enhancing, or adopts channel compensation technology during modeling.But these methods all need to introduce extra module processes noise, comparatively complicated while realizing.

Summary of the invention

The problem existing for above-mentioned prior art, the present invention proposes a kind of method for distinguishing speek person based on noise shielding core, it is characterized in that, and described method specifically comprises the following steps:

Step 1: input audio data, extracts short-time characteristic frame by frame to voice data;

Step 2: adopt one of the short-time characteristic training of speech data to mix first GMM model containing M Gauss, be designated as voice GMM;

Step 3: adopt one of the short-time characteristic training of noise data to mix first GMM model containing N Gauss, be designated as noise GMM;

Step 4: voice GMM and noise GMM are spliced into a mixing GMM;

Step 5: with mixing GMM generted noise shielding super vector;

Step 6: adopt the noise shielding super vector generating to carry out the training and testing of SVM, complete speaker's training and identification.

In described step 1, short-time characteristic adopts cepstrum feature in short-term, and cepstrum feature type is linear prediction cepstrum coefficient system LPCC, Mei Er frequency marking cepstrum coefficient MFCC or perception linear predictor coefficient PLP in short-term.

In described step 1, short-time characteristic can also adopt short-time energy, short-time zero-crossing rate, short-term correlation coefficient.

In described step 2 and step 3, GMM model training method adopts EM algorithm.

In described step 2, M value is hundreds of to several thousand, and in described step 3, N value is tens to hundreds of, and M value is more than 10N.

In described step 4, GMM joining method is: establishing voice GMM parameter is

noise GMM parameter is

wherein w is the mixed first weight of Gauss, and μ is the mixed first mean vector of Gauss, and Σ is the mixed first variance matrix of Gauss, and subscript m is the mixed first label of Gauss, and subscript s represents voice, and subscript n represents noise, and the parameter of mixing GMM is:

{w_{m}, μ_{m}, Σ_{m}} = \{\begin{matrix} {\frac{1}{2} w_{m}^{s}, μ_{m}^{s}, Σ_{m}^{s}}, & m = 1, . . ., M \\ {\frac{1}{2} w_{m - M}^{n}, μ_{m - M}^{n}, Σ_{m - M}^{n}}, & m = M + 1, . . ., M + N \end{matrix}

In described step 5, the production method of noise shielding super vector is M dimension corresponding to mixed unit before only calculating, and masks the dimension that noise is corresponding.

In described step 5, the concrete production method of noise shielding super vector is as follows:

Step 501: the cepstrum feature in short-term of supposing a section audio is { x _t, t=1 ..., T}, wherein x is a frame feature, and subscript t is frame label, and T is totalframes, calculates frame by frame the mixed first posterior probability of each Gauss, t=1 ..., T, m=1 ..., M:

γ_{m} (t) = \frac{w_{m} p_{m} (x_{t})}{Σ_{m^{'} = 1}^{M + N} w_{m^{'}} p_{m^{'}} (x_{t})}

P wherein _m(x _t) be the mixed first Gaussian probability density of m Gauss, its computing formula is:

p_{m} (x_{t}) = \frac{1}{{(2 π)}^{D / 2} | Σ_{m} |^{1 / 2}} \exp {- \frac{1}{2} {(x_{t} - μ_{m})}^{T} Σ_{m}^{- 1} (x_{t} - μ_{m})};

Step 502: calculate the mixed unit of each Gauss and upgrade mean value vector, m=1 ..., M:

ξ_{m} = \frac{Σ_{t = 1}^{T} γ_{m} (t) x_{t}}{Σ_{t = 1}^{T} γ_{m} (t)};

Step 503: utilize GMM weight and variance to carry out the mixed unit of its each Gauss renewal mean value vector regular, m=1 ..., M:

ξ_{m}^{'} = \sqrt{w_{m}} Σ_{m}^{- 1 / 2} ξ_{m};

Step 504: the vector after regular to M splices, production noise shielding super vector:

ζ = [\begin{matrix} ξ_{1}^{'} \\ ξ_{2}^{'} \\ . \\ . \\ . \\ ξ_{M}^{'} \end{matrix}]

The training and testing Kernel Function of described SVM adopts linear kernel.

The invention has the beneficial effects as follows: noise shielding super vector can carry out automatic shield to the noise containing in audio frequency, and adopt the framework of GSV-SVM method to process, realize simple.Adopt the method, can effectively improve the performance of Speaker Identification under noise conditions.

Accompanying drawing explanation

Fig. 1 is the process flow diagram that in the present invention, training mixes GMM;

Fig. 2 is the process flow diagram of generted noise shielding super vector in the present invention.

Embodiment

Below in conjunction with accompanying drawing, preferred embodiment is elaborated.Should be emphasized that following explanation is only exemplary, rather than in order to limit the scope of the invention and to apply.

Fig. 1 is the process flow diagram of training mixing GMM provided by the invention.Described method specifically comprises the following steps:

Step 1: voice data is extracted to short-time characteristic frame by frame;

Short-time characteristic can adopt cepstrum feature in short-term, in short-term cepstrum feature extracting method (as voice signal process as described in textbook) in characteristic type be to be linear prediction cepstrum coefficient system LPCC, Mei Er frequency marking cepstrum coefficient MFCC or perception linear predictor coefficient PLP.

Short-time characteristic can also adopt short-time energy, short-time zero-crossing rate, short-term correlation coefficient etc.

Step 2: adopt one of the short-time characteristic training of speech data to mix first GMM model containing M Gauss, be designated as voice GMM; The general value of M is hundreds of to several thousand, enumerates typical value 2048,1024,512 here;

Step 3: adopt one of the short-time characteristic training of noise data to mix first GMM model containing N Gauss, be designated as noise GMM; The general value of N is tens to hundreds of, enumerates typical value 128,64,32 here;

GMM model training method (as voice signal process as described in textbook) adopt the EM(expectation maximum) algorithm.

Step 4: voice GMM and noise GMM are spliced into a mixing GMM, and concrete joining method is as follows: establishing voice GMM parameter is noise GMM parameter is

{w_{m}, μ_{m}, Σ_{m}} = \{\begin{matrix} {\frac{1}{2} w_{m}^{s}, μ_{m}^{s}, Σ_{m}^{s}}, & m = 1, . . ., M \\ {\frac{1}{2} w_{m - M}^{n}, μ_{m - M}^{n}, Σ_{m - M}^{n}}, & m = M + 1, . . ., M + N \end{matrix}

Step 5: generate gauss hybrid models average super vector with mixing GMM, but M dimension corresponding to mixed unit before only calculating mask the dimension that noise is corresponding, is called noise shielding super vector;

The idiographic flow of generted noise shielding super vector as shown in Figure 2, comprises the following steps:

γ_{m} (t) = \frac{w_{m} p_{m} (x_{t})}{Σ_{m^{'} = 1}^{M + N} w_{m^{'}} p_{m^{'}} (x_{t})}

p_{m} (x_{t}) = \frac{1}{{(2 π)}^{D / 2} | Σ_{m} |^{1 / 2}} \exp {- \frac{1}{2} {(x_{t} - μ_{m})}^{T} Σ_{m}^{- 1} (x_{t} - μ_{m})};

ξ_{m} = \frac{Σ_{t = 1}^{T} γ_{m} (t) x_{t}}{Σ_{t = 1}^{T} γ_{m} (t)};

ξ_{m}^{'} = \sqrt{w_{m}} Σ_{m}^{- 1 / 2} ξ_{m};

ζ = [\begin{matrix} ξ_{1}^{'} \\ ξ_{2}^{'} \\ . \\ . \\ . \\ ξ_{M}^{'} \end{matrix}]

Finally utilize the noise shielding super vector generating to carry out the training and testing of SVM, complete speaker's training and identification.

In the present invention, owing to mixing GMM, contain the mixed unit of noise, they can automatic absorption noise.While running into noise, the mixed first Gaussian probability density of noise is larger, and the mixed first Gaussian probability density of voice is less, and this can make the mixed first posterior probability of voice in step 501 less than normal, thereby make its vector proportion in step 502 less, reach the object of noise shielding.

The training and testing method of SVM (method described in general modfel identification textbook) Kernel Function adopts linear kernel.

The super vector that adopts noise shielding, for the training and testing of SVM, can effectively improve the performance of Speaker Identification under noise conditions.

The above; be only the present invention's embodiment preferably, but protection scope of the present invention is not limited to this, is anyly familiar with in technical scope that those skilled in the art disclose in the present invention; the variation that can expect easily or replacement, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the method for distinguishing speek person based on noise shielding core, is characterized in that, the method specifically comprises the following steps:

Step 4: voice GMM and noise GMM are spliced into a mixing GMM;

Step 5: with mixing GMM generted noise shielding super vector;

2. the method for distinguishing speek person based on noise shielding core according to claim 1, it is characterized in that, in described step 1, short-time characteristic adopts cepstrum feature in short-term, and cepstrum feature type is linear prediction cepstrum coefficient system LPCC, Mei Er frequency marking cepstrum coefficient MFCC or perception linear predictor coefficient PLP in short-term.

3. the method for distinguishing speek person based on noise shielding core according to claim 1 and 2, is characterized in that, in described step 1, short-time characteristic can also adopt short-time energy, short-time zero-crossing rate, short-term correlation coefficient.

4. the method for distinguishing speek person based on noise shielding core according to claim 1, is characterized in that, in described step 2 and step 3, GMM model training method adopts EM algorithm.

5. the method for distinguishing speek person based on noise shielding core according to claim 1, is characterized in that, in described step 2, M value is hundreds of to several thousand, and in described step 3, N value is tens to hundreds of, and M value is more than 10N.

6. the method for distinguishing speek person based on noise shielding core according to claim 1, is characterized in that, in described step 4, GMM joining method is: establishing voice GMM parameter is

noise GMM parameter is

。

7. the method for distinguishing speek person based on noise shielding core according to claim 1, is characterized in that, in described step 5, the production method of noise shielding super vector is M dimension corresponding to mixed unit before only calculating, and masks the dimension that noise is corresponding.

8. according to the method for distinguishing speek person based on noise shielding core described in claim 1 or 7, it is characterized in that, in described step 5, the concrete production method of noise shielding super vector is as follows:

9. the method for distinguishing speek person based on noise shielding core according to claim 1, is characterized in that, the training and testing Kernel Function of described SVM adopts linear kernel.