CN1758332A

CN1758332A - Speaker recognition method based on MFCC linear emotion compensation

Info

Publication number: CN1758332A
Application number: CNA2005100613603A
Authority: CN
Inventors: 吴朝晖; 杨莹春; 吴甜
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2005-10-31
Filing date: 2005-10-31
Publication date: 2006-04-12
Anticipated expiration: 2025-10-31
Also published as: CN100440315C

Abstract

This invention relates to an identification method for speakers based on MFCC linear sensitive compensation including: 1, pre-processing phonetic signals, 2, picking up the characters on the phonetic frame: extracting the MFCC on the phone of a speaker and the fundamental frequency to divide the phone signal flow into a sonant part and a surd part based on that if the fundamental frequency exists to throw away a frame phone if it is surd, 3, carrying out linear compensation to the MFCC of the related frame according to the change of the fundamental frequency, 4, compensating the MFCC based on the maximum coefficient of the probability got from the maximum likelihood evaluation to train with it, 5, identification.

Description

Method for distinguishing speek person based on MFCC linear emotion compensation

Technical field

The present invention relates to biometrics identification technology, mainly is a kind of method for distinguishing speek person based on MFCC linear emotion compensation.

Background technology

Biometrics identification technology is meant by computing machine and utilizes mankind itself's physiology or behavioural characteristic to carry out a kind of technology of authentication, it is a foundation with unique, reliable, stable physiological characteristic of human body (as fingerprint, iris, face, palmmprint etc.) or behavioural characteristic (speech, keystroke, gait, signature etc.), adopt the power and the network technology of computing machine to carry out Flame Image Process and pattern-recognition, in order to differentiate people's identity.It is wherein a kind of that Application on Voiceprint Recognition or Speaker Identification belong to, and is a speech parameter according to reflection speaker's physiology and behavioural characteristic in the speech waveform, discerns the technology of speaker ' s identity automatically.

Not only comprise speaker information and language content information in people's the sound, also be full of features such as emotion and mood.Discrimination can sharply descend traditional method for distinguishing speek person on the voice of emotion influence containing, and this is because the emotional factor that does not have will be included in the sound is taken into account, does not just consider the effect of the rhythm and paralanguage in the voice.Show on the feature that physiological characteristic is only extracted in traditional vocal print feature extraction from voice signal, the Application on Voiceprint Recognition system mainly relies on the lower level acoustic feature to discern.Because speaker's personal characteristics can not proper be portrayed in information extraction all sidedly, causes existing Application on Voiceprint Recognition system performance instability.

Summary of the invention

The present invention will solve the existing defective of above-mentioned technology, provide a kind of use based on method for distinguishing speek person under the emotional speech of the linear cepstrum feature compensation of fundamental frequency, by linear compensation, be implemented in the robustness that the emotional factor influence improves Speaker Identification down to speaker's cepstrum feature.

The technical solution adopted for the present invention to solve the technical problems: this method for distinguishing speek person based on MFCC linear emotion compensation, 1), voice signal carries out pre-service key step is:: comprise that mainly sampling and quantification, pre-emphasis handle and windowing; 2), the feature extraction on the speech frame: on speaker's voice, extract cepstrum feature MFCC and fundamental frequency, whether exist according to fundamental frequency, sound signal stream is divided into voiced segments and voiceless sound section, if judge that certain frame is a unvoiced frames, then abandon this frame voice, disregard; 3), the MFCC of respective frame is carried out linear compensation, constantly adjust the probable value maximum that penalty coefficient obtains the maximal possibility estimation in the EM algorithm therebetween, and determine penalty coefficient thus according to the variation of fundamental frequency; 4), the coefficient of the probability maximum that maximal possibility estimation obtains is compensated to MFCC, train by the phonetic feature after the compensation according to this; 5), identification: after being used for phonetic entry, through feature extraction, obtain a characteristic vector sequence, this sequence is input among the GMM of relevant user model parameter, obtains the similarity value and gives a mark to the user according to it.

The technical scheme that technical solution problem of the present invention is adopted can also be further perfect.Described cepstrum feature linear compensation is revised by the fundamental frequency of corresponding frame for the MFCC feature of each frame is respectively tieed up value, it can be tried one's best characterize speaker's personal characteristics better, reduce the variation of the inner phonetic feature of speaker that brings because of the emotion change.Described penalty coefficient is to carry out cepstrum feature employed description fundamental frequency of when compensation to change the factor to the MFCC feature affects, can adjust by EM algorithm repeatedly and obtain best penalty coefficient.Described repeatedly EM algorithm determines that the The optimal compensation coefficient method is to conceal probability estimate by the MFCC after the compensation of different penalty coefficients, finds out the penalty coefficient that the penalty coefficient that wherein makes the probable value maximum uses during as training pattern.

The effect that the present invention is useful is: adopt the cepstrum feature compensation based on fundamental frequency, utilize the Changing Pattern of prosodic features in emotional speech, make speaker characteristic have more stability after the MFCC feature of emotional speech compensated, to reduce speaker self the voice difference that the emotion influence brings as far as possible.Select best penalty coefficient by the EM algorithm that repeatedly calls in gauss hybrid models (GMM) training process.Use this method can find the best coefficient of describing variation relation between fundamental frequency and the original MFCC feature.

Description of drawings

Fig. 1 is the process of linear compensation EM training algorithm of the present invention;

Fig. 2 is an algorithm flow chart of the present invention;

Embodiment

The invention will be described further below in conjunction with drawings and Examples: method of the present invention was divided into for six steps.

The first step: voice signal pre-service

1, sampling and quantification

A), voice signal is carried out filtering, make its nyquist frequency F with sharp filter _NBe 4KHz;

B), speech sample rate F=2F is set _N

C), to voice signal s _a(t) sample by the cycle, obtain the amplitude sequence of audio digital signals

s (n) = s_{a} (\frac{n}{F});

D), s (n) is carried out quantization encoding, the quantized value that obtains amplitude sequence is represented s ' (n) with pulse code modulation (pcm).

2, pre-emphasis is handled

A), Z transfer function H (the z)=1-az of digital filter is set ^-1In pre emphasis factor a, the value that the desirable ratio of a 1 is slightly little;

B), s ' is (n) by digital filter, obtains the suitable amplitude sequence s of the high, medium and low frequency amplitude of voice signal " (n).

3, windowing

A), the frame length N of computing voice frame, N need satisfy:

20 \leq \frac{N}{F} \leq 30,

Here F is the speech sample rate, and unit is Hz;

B), be that N, the frame amount of moving are N/2 with the frame length, s " (n) is divided into a series of speech frame F _m, each speech frame comprises N voice signal sample;

C), calculate the hamming code window function:

D), to each speech frame F _mAdd hamming code window:

ω(n)×F _m(n){F _m′(n)|n＝1，1，...，N-1}。

Second step: feature extraction

Feature extraction on the speech frame comprises the extraction of fundamental frequency (pitch) and Mel cepstrum coefficient (MFCC).

1, fundamental frequency (pitch):

A), the hunting zone f of fundamental frequency is set _Floor=50, f _Ceiling=1250 (Hz);

B), the span f of the fundamental frequency of voice is set _Min=50, f _Max=550 (Hz);

C), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k).

D), calculate the SHR (subharmonic-harmonic wave ratio) of each frequency

SHR＝SS/SH

Wherein

SS = Σ_{n = 1}^{N} X ((n - 1 / 2) f),

SH = Σ_{n = 1}^{N} X (nf),

N＝f _ceiling/f

E), find out the highest frequency f of SHR ₁

F) if f ₁＞f _MaxPerhaps f ₁SS-SH＜0, think non-speech frame so, fundamental frequency is 0, Pitch=0

G), at [1.9375f ₁, 2.0625f ₁] the interval seek the frequency f of the local maximum of SHR ₂

H) if f ₂＞f _Max, perhaps f ₂SHR＞0.2, Pitch=f ₁

I), other situations, Pitch=f ₂

J), the fundamental frequency that obtains is carried out the auto-correlation effect:

From the mid point of frame, the long sampled point of 1/pitch is respectively got in front and back, calculates their autocorrelation value C, if C＜0.2 thinks that so the fundamental frequency value is unreliable, Pitch=0.

K), at last whole Pitch values is carried out median smoothing filtering.

2, the extraction of MFCC:

A), the exponent number p of Mel cepstrum coefficient is set;

B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k).

C), calculate Mel territory scale:

M_{i} = \frac{i}{p} \times 2595 \log (1 + \frac{8000 / 2.0}{700.0}), (i = 0,1,2, . . ., p)

D), calculate corresponding frequency domain scale:

f_{i} = 700 \times e^{\frac{M_{i}}{2595} In 10} - 1, (i = 0,1,2, . . ., p)

E), calculate each Mel territory passage φ _jOn the logarithm energy spectrum:

E_{j} = Σ_{k = 0}^{\frac{K}{2} - 1} φ_{j} (k) {| X (k) |}^{2}

Wherein

Σ_{k = 0}^{\frac{K}{2} - 1} φ_{j} (k) = 1 .

F), be discrete cosine transform DCT

The 3rd step, cepstrum feature compensation

1, alignment cepstrum feature and fundamental frequency

The voiced sound signal is a kind of quasi-periodic signal, and its cycle is called fundamental frequency.Whether exist according to fundamental frequency, sound signal stream is divided into voiced segments and voiceless sound section,, then abandon this frame voice, disregard if judge that certain frame is the voiceless sound section.

2, determine the The optimal compensation coefficient by the EM algorithm

In previous step, corresponding different penalty coefficient α _kCarry out the repeatedly probability calculation of latent state, to obtain the The optimal compensation coefficient.

A), to compensate coefficient be α to the cepstrum feature of corresponding frame _kLinear compensation

X (t) is a t cepstrum feature constantly, and Y (t) is a t fundamental frequency constantly, x _Opt(t) be the cepstrum feature of this moment after over-compensation, E (Y (t)) is the average pitch frequency:

x_{opt} (t) = x (t) - α_{k} \times \frac{| Y (t) - E (Y (t)) |}{| E (Y (t)) |}

B), estimate latent state probability

P_{i}^{'} = \frac{Σ_{t = 1}^{T} T_{t} (i)}{Σ_{t = 1}^{T} Σ_{i = 1}^{M} T_{t} (i)} = \frac{1}{T} Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ)

Wherein

P (i_{t} = i / z_{t}, λ) = \frac{P_{i} p (z_{t} / i_{t} = i, λ)}{p (z_{t} / λ)} = \frac{P_{i} b_{i} (z_{t})}{Σ_{i = 1}^{M} P_{i} b_{i} (z_{t})}

C), circulation is calculated until finding

Satisfy

\hat{α} = \underset{α}{\arg \max} {P (i_{t} = i / z_{t}, λ)}

D), estimate parameter P ', the μ of GMM with local maximal criterion _i' and R, i.e. λ '.

μ_{i}^{'} = \frac{Σ_{t = 1}^{T} T_{t} (i) z_{t}}{Σ_{t = 1}^{T} T_{t} (i)} = \frac{Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ) z_{t}}{Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ)}

R_{t}^{'} = \frac{Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ) {(z_{t} {- μ}_{i}^{'})}^{T} (z_{t} - μ_{i}^{'})}{Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ)}

The 4th step, training

Each speaker's phonetic feature has all formed specific distribution in feature space, the characteristic distribution after over-compensation can be used to describe speaker's individual character better.Gauss hybrid models (GMM) is the characteristic distribution that is similar to the speaker with the linear combination of a plurality of Gaussian distribution.

The functional form of each speaker's probability density function is identical, the parameter in the different just functions.M rank gauss hybrid models GMM comes the distribution of descriptor frame feature in feature space with the linear combination of M single Gaussian distribution, that is:

p (x) = Σ_{i = 1}^{M} P_{i} b_{i} (x)

b_{i} (x) = N (x, μ_{i}, R_{i}) = \frac{1}{{(2 π)}^{p / 2} {| p_{i} |}^{1 / 2}} - \exp {- \frac{1}{2} {(x - μ_{i})}^{T} R_{i}^{- 1} ({x - μ}_{i})}

Wherein, p is the dimension of feature, b _i(x) being kernel function, is that mean value vector is μ _i, covariance matrix is R _iGauss of distribution function, M (optional, as to be generally 16,32) is the exponent number of GMM model, is made as one in the past and determines integer setting up speaker model.

λ &cong; {P_{i}, μ_{i}, R_{i} | i = 1,2, . . ., M}

Be the parameter among the speaker characteristic distribution GMM.As the weighting coefficient that Gaussian Mixture distributes, P _iShould satisfy and make:

{&Integral;}_{- \infty}^{+ \infty} p (x / λ) dx = 1

Promptly have

Σ_{i = 1}^{M} P_{i} = 1

Because the p (x) that calculates among the GMM need ask p * p dimension square formation R _i(i=1,2 ..., M) contrary, operand is big.For this reason, with R _jBe made as diagonal matrix, inversion operation be converted into ask computing reciprocal, improve arithmetic speed.

The 5th step, identification

After being used for phonetic entry,, obtain a characteristic vector sequence through feature extraction.This sequence is input among the GMM of relevant user model parameter, obtains the similarity value and gives a mark to the user according to it.

Experimental result

Native system is tested on Emotional Prosody Speech sound bank.This sound bank is by (the Linguistic Data Consortium of interlinguistics data alliance, be LDC) the emotional speech database set up according to database standard, pronunciation character research as the different emotions voice, record by 7 professional performers (3 male target speakers and 4 women's target speakers), read aloud a series of specific statements that give in English, mainly be date and numeral, contained 14 kinds of different emotions types.The method of recording is the different tone, intonation and the word speed when allowing the corresponding emotion of actor, each speaker does not wait in the record length of every kind of emotion, between 10 seconds to 40 seconds, also have only a few to reach 50 seconds greatly, the total record length of each speaker is greatly about 5,6 minutes.

We design and have finished two groups of experiments on this storehouse.The cepstrum feature training pattern of not passing through any backoff algorithm is only used in first group of benchmark experiment that is to use classical MFCC-GMM, and GMM is by common EM algorithm training.This group is tested control group the most.

Second group of experiment carried out linear compensation to cepstrum feature, and adopts EM repeatedly to estimate to select best penalty coefficient, uses revised MFCC proper vector training GMM model.

Assess for performance, error rate (EER, Equal Error Rate) and discrimination (IR, Identification Rate) such as select for use to be used as the evaluation criteria of experimental result speaker identification system.

The calculating of EER need be used other two evaluation indexes:

(1) false acceptance rate FA: the phrase number that mistake is accepted is divided by the false acceptance rate that should unaccepted total phrase number promptly obtains the speaker verification;

(2) false rejection rate FR: with the phrase number of False Rejects divided by the false rejection rate that should received total phrase number promptly obtains the speaker verification.

When FA=FR or | during FA-FR|＜δ (δ＜0.0001), obtain system etc. error rate (EER), i.e. EER=FA or EER=FR.

The computing formula of discrimination IR is:

Being provided with of experiment parameter is as follows:

Window is long	32ms
Window is long	32ms	Stepping	16ms
Pre-emphasis	0.97	Stepping	16ms
Pre-emphasis	0.97	The MFCC dimension	16MFCC+delta
GMM	32 rank	The MFCC dimension	16MFCC+delta

Experimental result is as follows:

Method	EER(％)	IR(％)
Method	EER(％)	IR(％)	The benchmark experiment	32.41	62.94
This method	29.92	73.04	The benchmark experiment	32.41	62.94

Every kind of emotion is divided other experimental result such as following table, compares with the benchmark experiment, and "+" represents that this value raises to some extent, and "-" expression reduces:

Affective state	Relative EER (%)	Relative IR (%)
Affective state	Relative EER (%)	Relative IR (%)	Proud (Elation)	-4.30	+6.29
(Panic) in alarm	-10.76	+19.86	Proud (Elation)	-4.30	+6.29
(Panic) in alarm	-10.76	+19.86	Indignation (Hot anger)	-3.60	+9.35
Detest (Disgust)	-3.70	+15.56	Indignation (Hot anger)	-3.60	+9.35
Detest (Disgust)	-3.70	+15.56	Angry (Cold anger)	-1.92	+12.82
Anxiety (Anxiety)	-3.92	+8.82	Angry (Cold anger)	-1.92	+12.82
Anxiety (Anxiety)	-3.92	+8.82	(Interest) of intense interest	-1.41	+5.09
Desperate (Despair)	-2.79	+5.78	(Interest) of intense interest	-1.41	+5.09
Desperate (Despair)	-2.79	+5.78	Contempt (Contempt)	-1.02	+10.0
Sad (Sadness)	-3.53	+15.23	Contempt (Contempt)	-1.02	+10.0
Sad (Sadness)	-3.53	+15.23	Pride (Pride)	-2.76	+5.96
Ashamed (Shame)	-1.35	+11.49	Pride (Pride)	-2.76	+5.96
Ashamed (Shame)	-1.35	+11.49	Be weary of (Boredom)	-0.00	+10.39
Neutral (Neutral)	-0.00	+6.25	Be weary of (Boredom)	-0.00	+10.39

Experimental machine device configuration CPU is AMD Athlon (tm) XP2500+, in save as 512M ddr400.

Experimental result shows that the eigen compensation method can make cepstrum feature more can describe speaker's individual information, thereby improves the performance of Speaker Identification, makes its error rate reduce, and discrimination raises.And the experiment on the emotion storehouse has shown that this method all has effect preferably for various affective states.

Claims

1, a kind of method for distinguishing speek person based on MFCC linear emotion compensation, it is characterized in that: key step is:

1), voice signal carries out pre-service: mainly comprise sampling and quantification, pre-emphasis processing and windowing;

2), the feature extraction on the speech frame: on speaker's voice, extract cepstrum feature MFCC and fundamental frequency, whether exist according to fundamental frequency, sound signal stream is divided into voiced segments and voiceless sound section, if judge that certain frame is a unvoiced frames, then abandon this frame voice, disregard;

3), the MFCC of respective frame is carried out linear compensation, constantly adjust the probable value maximum that penalty coefficient obtains the maximal possibility estimation in the EM algorithm therebetween, and determine penalty coefficient thus according to the variation of fundamental frequency;

4), the coefficient of the probability maximum that maximal possibility estimation obtains is compensated to MFCC, train by the phonetic feature after the compensation according to this;

5), identification: after being used for phonetic entry, through feature extraction, obtain a characteristic vector sequence, this sequence is input among the GMM of relevant user model parameter, obtains the similarity value and gives a mark to the user according to it.

2, the method for distinguishing speek person based on MFCC linear emotion compensation according to claim 1, it is characterized in that: described cepstrum feature linear compensation is revised by the fundamental frequency of corresponding frame for the MFCC feature of each frame is respectively tieed up value, it can be tried one's best characterize speaker's personal characteristics better.

3, the method for distinguishing speek person based on MFCC linear emotion compensation according to claim 1, it is characterized in that: described penalty coefficient is to proceed to spectrum signature employed description fundamental frequency of when compensation to change the factor to the MFCC feature affects, can adjust by EM algorithm repeatedly and obtain best penalty coefficient.

4, the method for distinguishing speek person based on MFCC linear emotion compensation according to claim 1, it is characterized in that: described repeatedly EM algorithm determines that the The optimal compensation coefficient method is to conceal probability estimate by the MFCC after the compensation of different penalty coefficients, finds out the penalty coefficient that the penalty coefficient that wherein makes the probable value maximum uses during as training pattern.

5, the method for distinguishing speek person based on MFCC linear emotion compensation according to claim 1, it is characterized in that: the feature extraction on the speech frame comprises fundamental frequency, i.e. pitch and Mel cepstrum coefficient, the i.e. extraction of MFCC;

1), fundamental frequency:

A), the hunting zone f of fundamental frequency is set _Floor=50, f _Ceiling=1250Hz;

B), the span f of the fundamental frequency of voice is set _Min=50, f _Max=550Hz;

C), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k);

D), calculate the SHR of each frequency, i.e. subharmonic-harmonic wave ratio

SHR＝SS/SH

Wherein

SS = Σ_{n = 1}^{N} X ((n - 1 / 2) f), SH = Σ_{n = 1}^{N} X (nf), N = f_{ceiling} / f

E), find out the highest frequency f of SHR ₁

F) if f ₁＞f _MaxPerhaps f ₁SS-SH＜0, think non-voice or quiet frame so, fundamental frequency Pitch=0

H) if f ₂＞f _Max, perhaps f ₂SHR＞0.2, Pitch=f ₁

I), other situations, Pitch=f ₂

From the mid point of frame, the long sampled point of 1/pitch is respectively got in front and back, calculates their autocorrelation value C, if C＜0.2 thinks that so the fundamental frequency value is unreliable, Pitch=0;

K), at last whole Pitch values is carried out median smoothing filtering;

2), the extraction of MFCC:

A), the exponent number p of Mel cepstrum coefficient is set;

B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k);

C), calculate Mel territory scale:

M_{i} = \frac{i}{p} \times 2595 \log (1 + \frac{8000 / 2.0}{700.0}), (i = 0,1,2, \cdot \cdot \cdot, p)

D), calculate corresponding frequency domain scale:

f_{i} = 700 \times e^{\frac{M_{i}}{2595} \ln 10} - 1, (i = 0,1,2, \cdot \cdot \cdot, p)

E), calculate each Mel territory passage φ _jOn the logarithm energy spectrum:

E_{j} = Σ_{k = 0}^{\frac{K}{2} - 1} φ_{j} (k) {| X (k) |}^{2}

Wherein

Σ_{k = 0}^{\frac{K}{2} - 1} φ_{j} (k) = 1 .

Wherein

F), be discrete cosine transform DCT.

6, according to claim 1 or 2 or 3 or 4 described method for distinguishing speek person, it is characterized in that: determine the The optimal compensation coefficient by the EM algorithm, corresponding different penalty coefficient α based on MFCC linear emotion compensation _kCarry out the repeatedly probability calculation of latent state, to obtain the The optimal compensation coefficient;

x_{opt} (t) = x (t) - α_{k} \times \frac{| Y (t) - E (Y (t)) |}{| E (Y (t)) |}

B), estimate latent state probability

P_{i}^{'} = \frac{Σ_{t = 1}^{T} T_{t} (i)}{Σ_{t = 1}^{T} Σ_{i = 1}^{M} T_{t} (i)} = \frac{1}{T} Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ)

Wherein

P (i_{t} = i / z_{t}, λ) = \frac{P_{i} p (z_{t} / i_{t} = i, λ)}{p (z_{t} / λ)} = \frac{P_{i} b_{i} (z_{t})}{Σ_{i = 1}^{M} P_{i} b_{i} (z_{t})}

C), circulation is calculated until finding Satisfy

\hat{α} = \underset{α}{\arg \max} {P (i_{t} = i / z_{t}, λ)}

D), estimate parameter P ', the μ of GMM with local maximal criterion _i' and R _i', i.e. λ ';

\begin{matrix} μ_{i}^{'} = \frac{Σ_{t = 1}^{T} T_{t} (i) z_{t}}{Σ_{t = 1}^{T} T_{t} (i)} = \frac{Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ) z_{t}}{Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ)} \\ R_{i}^{'} = \frac{Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ) {(z_{t} - μ_{i}^{'})}^{T} (z_{t} - μ_{i}^{'})}{Σ_{t = 1}^{T} P (i_{t} = i / z_{t}, λ)} \end{matrix} .