CN1758332A - Speaker recognition method based on MFCC linear emotion compensation - Google Patents

Speaker recognition method based on MFCC linear emotion compensation Download PDF

Info

Publication number
CN1758332A
CN1758332A CNA2005100613603A CN200510061360A CN1758332A CN 1758332 A CN1758332 A CN 1758332A CN A2005100613603 A CNA2005100613603 A CN A2005100613603A CN 200510061360 A CN200510061360 A CN 200510061360A CN 1758332 A CN1758332 A CN 1758332A
Authority
CN
China
Prior art keywords
mfcc
compensation
sigma
fundamental frequency
linear
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2005100613603A
Other languages
Chinese (zh)
Other versions
CN100440315C (en
Inventor
吴朝晖
杨莹春
吴甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CNB2005100613603A priority Critical patent/CN100440315C/en
Publication of CN1758332A publication Critical patent/CN1758332A/en
Application granted granted Critical
Publication of CN100440315C publication Critical patent/CN100440315C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Complex Calculations (AREA)

Abstract

This invention relates to an identification method for speakers based on MFCC linear sensitive compensation including: 1, pre-processing phonetic signals, 2, picking up the characters on the phonetic frame: extracting the MFCC on the phone of a speaker and the fundamental frequency to divide the phone signal flow into a sonant part and a surd part based on that if the fundamental frequency exists to throw away a frame phone if it is surd, 3, carrying out linear compensation to the MFCC of the related frame according to the change of the fundamental frequency, 4, compensating the MFCC based on the maximum coefficient of the probability got from the maximum likelihood evaluation to train with it, 5, identification.

Description

Method for distinguishing speek person based on MFCC linear emotion compensation
Technical field
The present invention relates to biometrics identification technology, mainly is a kind of method for distinguishing speek person based on MFCC linear emotion compensation.
Background technology
Biometrics identification technology is meant by computing machine and utilizes mankind itself's physiology or behavioural characteristic to carry out a kind of technology of authentication, it is a foundation with unique, reliable, stable physiological characteristic of human body (as fingerprint, iris, face, palmmprint etc.) or behavioural characteristic (speech, keystroke, gait, signature etc.), adopt the power and the network technology of computing machine to carry out Flame Image Process and pattern-recognition, in order to differentiate people's identity.It is wherein a kind of that Application on Voiceprint Recognition or Speaker Identification belong to, and is a speech parameter according to reflection speaker's physiology and behavioural characteristic in the speech waveform, discerns the technology of speaker ' s identity automatically.
Not only comprise speaker information and language content information in people's the sound, also be full of features such as emotion and mood.Discrimination can sharply descend traditional method for distinguishing speek person on the voice of emotion influence containing, and this is because the emotional factor that does not have will be included in the sound is taken into account, does not just consider the effect of the rhythm and paralanguage in the voice.Show on the feature that physiological characteristic is only extracted in traditional vocal print feature extraction from voice signal, the Application on Voiceprint Recognition system mainly relies on the lower level acoustic feature to discern.Because speaker's personal characteristics can not proper be portrayed in information extraction all sidedly, causes existing Application on Voiceprint Recognition system performance instability.
Summary of the invention
The present invention will solve the existing defective of above-mentioned technology, provide a kind of use based on method for distinguishing speek person under the emotional speech of the linear cepstrum feature compensation of fundamental frequency, by linear compensation, be implemented in the robustness that the emotional factor influence improves Speaker Identification down to speaker's cepstrum feature.
The technical solution adopted for the present invention to solve the technical problems: this method for distinguishing speek person based on MFCC linear emotion compensation, 1), voice signal carries out pre-service key step is:: comprise that mainly sampling and quantification, pre-emphasis handle and windowing; 2), the feature extraction on the speech frame: on speaker's voice, extract cepstrum feature MFCC and fundamental frequency, whether exist according to fundamental frequency, sound signal stream is divided into voiced segments and voiceless sound section, if judge that certain frame is a unvoiced frames, then abandon this frame voice, disregard; 3), the MFCC of respective frame is carried out linear compensation, constantly adjust the probable value maximum that penalty coefficient obtains the maximal possibility estimation in the EM algorithm therebetween, and determine penalty coefficient thus according to the variation of fundamental frequency; 4), the coefficient of the probability maximum that maximal possibility estimation obtains is compensated to MFCC, train by the phonetic feature after the compensation according to this; 5), identification: after being used for phonetic entry, through feature extraction, obtain a characteristic vector sequence, this sequence is input among the GMM of relevant user model parameter, obtains the similarity value and gives a mark to the user according to it.
The technical scheme that technical solution problem of the present invention is adopted can also be further perfect.Described cepstrum feature linear compensation is revised by the fundamental frequency of corresponding frame for the MFCC feature of each frame is respectively tieed up value, it can be tried one's best characterize speaker's personal characteristics better, reduce the variation of the inner phonetic feature of speaker that brings because of the emotion change.Described penalty coefficient is to carry out cepstrum feature employed description fundamental frequency of when compensation to change the factor to the MFCC feature affects, can adjust by EM algorithm repeatedly and obtain best penalty coefficient.Described repeatedly EM algorithm determines that the The optimal compensation coefficient method is to conceal probability estimate by the MFCC after the compensation of different penalty coefficients, finds out the penalty coefficient that the penalty coefficient that wherein makes the probable value maximum uses during as training pattern.
The effect that the present invention is useful is: adopt the cepstrum feature compensation based on fundamental frequency, utilize the Changing Pattern of prosodic features in emotional speech, make speaker characteristic have more stability after the MFCC feature of emotional speech compensated, to reduce speaker self the voice difference that the emotion influence brings as far as possible.Select best penalty coefficient by the EM algorithm that repeatedly calls in gauss hybrid models (GMM) training process.Use this method can find the best coefficient of describing variation relation between fundamental frequency and the original MFCC feature.
Description of drawings
Fig. 1 is the process of linear compensation EM training algorithm of the present invention;
Fig. 2 is an algorithm flow chart of the present invention;
Embodiment
The invention will be described further below in conjunction with drawings and Examples: method of the present invention was divided into for six steps.
The first step: voice signal pre-service
1, sampling and quantification
A), voice signal is carried out filtering, make its nyquist frequency F with sharp filter NBe 4KHz;
B), speech sample rate F=2F is set N
C), to voice signal s a(t) sample by the cycle, obtain the amplitude sequence of audio digital signals s ( n ) = s a ( n F ) ;
D), s (n) is carried out quantization encoding, the quantized value that obtains amplitude sequence is represented s ' (n) with pulse code modulation (pcm).
2, pre-emphasis is handled
A), Z transfer function H (the z)=1-az of digital filter is set -1In pre emphasis factor a, the value that the desirable ratio of a 1 is slightly little;
B), s ' is (n) by digital filter, obtains the suitable amplitude sequence s of the high, medium and low frequency amplitude of voice signal " (n).
3, windowing
A), the frame length N of computing voice frame, N need satisfy:
20 ≤ N F ≤ 30 ,
Here F is the speech sample rate, and unit is Hz;
B), be that N, the frame amount of moving are N/2 with the frame length, s " (n) is divided into a series of speech frame F m, each speech frame comprises N voice signal sample;
C), calculate the hamming code window function:
Figure A20051006136000072
D), to each speech frame F mAdd hamming code window:
ω(n)×F m(n){F m′(n)|n=1,1,...,N-1}。
Second step: feature extraction
Feature extraction on the speech frame comprises the extraction of fundamental frequency (pitch) and Mel cepstrum coefficient (MFCC).
1, fundamental frequency (pitch):
A), the hunting zone f of fundamental frequency is set Floor=50, f Ceiling=1250 (Hz);
B), the span f of the fundamental frequency of voice is set Min=50, f Max=550 (Hz);
C), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k).
D), calculate the SHR (subharmonic-harmonic wave ratio) of each frequency
SHR=SS/SH
Wherein SS = Σ n = 1 N X ( ( n - 1 / 2 ) f ) , SH = Σ n = 1 N X ( nf ) , N=f ceiling/f
E), find out the highest frequency f of SHR 1
F) if f 1>f MaxPerhaps f 1SS-SH<0, think non-speech frame so, fundamental frequency is 0, Pitch=0
G), at [1.9375f 1, 2.0625f 1] the interval seek the frequency f of the local maximum of SHR 2
H) if f 2>f Max, perhaps f 2SHR>0.2, Pitch=f 1
I), other situations, Pitch=f 2
J), the fundamental frequency that obtains is carried out the auto-correlation effect:
From the mid point of frame, the long sampled point of 1/pitch is respectively got in front and back, calculates their autocorrelation value C, if C<0.2 thinks that so the fundamental frequency value is unreliable, Pitch=0.
K), at last whole Pitch values is carried out median smoothing filtering.
2, the extraction of MFCC:
A), the exponent number p of Mel cepstrum coefficient is set;
B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k).
C), calculate Mel territory scale:
M i = i p × 2595 log ( 1 + 8000 / 2.0 700.0 ) , ( i = 0,1,2 , . . . , p )
D), calculate corresponding frequency domain scale:
f i = 700 × e M i 2595 In 10 - 1 , ( i = 0,1,2 , . . . , p )
E), calculate each Mel territory passage φ jOn the logarithm energy spectrum:
E j = Σ k = 0 K 2 - 1 φ j ( k ) | X ( k ) | 2
Wherein Σ k = 0 K 2 - 1 φ j ( k ) = 1 .
F), be discrete cosine transform DCT
The 3rd step, cepstrum feature compensation
1, alignment cepstrum feature and fundamental frequency
The voiced sound signal is a kind of quasi-periodic signal, and its cycle is called fundamental frequency.Whether exist according to fundamental frequency, sound signal stream is divided into voiced segments and voiceless sound section,, then abandon this frame voice, disregard if judge that certain frame is the voiceless sound section.
2, determine the The optimal compensation coefficient by the EM algorithm
In previous step, corresponding different penalty coefficient α kCarry out the repeatedly probability calculation of latent state, to obtain the The optimal compensation coefficient.
A), to compensate coefficient be α to the cepstrum feature of corresponding frame kLinear compensation
X (t) is a t cepstrum feature constantly, and Y (t) is a t fundamental frequency constantly, x Opt(t) be the cepstrum feature of this moment after over-compensation, E (Y (t)) is the average pitch frequency:
x opt ( t ) = x ( t ) - α k × | Y ( t ) - E ( Y ( t ) ) | | E ( Y ( t ) ) |
B), estimate latent state probability
P i ′ = Σ t = 1 T T t ( i ) Σ t = 1 T Σ i = 1 M T t ( i ) = 1 T Σ t = 1 T P ( i t = i / z t , λ )
Wherein
P ( i t = i / z t , λ ) = P i p ( z t / i t = i , λ ) p ( z t / λ ) = P i b i ( z t ) Σ i = 1 M P i b i ( z t )
C), circulation is calculated until finding
Figure A20051006136000094
Satisfy
α ^ = arg max α { P ( i t = i / z t , λ ) }
D), estimate parameter P ', the μ of GMM with local maximal criterion i' and R, i.e. λ '.
μ i ′ = Σ t = 1 T T t ( i ) z t Σ t = 1 T T t ( i ) = Σ t = 1 T P ( i t = i / z t , λ ) z t Σ t = 1 T P ( i t = i / z t , λ )
R t ′ = Σ t = 1 T P ( i t = i / z t , λ ) ( z t - μ i ′ ) T ( z t - μ i ′ ) Σ t = 1 T P ( i t = i / z t , λ )
The 4th step, training
Each speaker's phonetic feature has all formed specific distribution in feature space, the characteristic distribution after over-compensation can be used to describe speaker's individual character better.Gauss hybrid models (GMM) is the characteristic distribution that is similar to the speaker with the linear combination of a plurality of Gaussian distribution.
The functional form of each speaker's probability density function is identical, the parameter in the different just functions.M rank gauss hybrid models GMM comes the distribution of descriptor frame feature in feature space with the linear combination of M single Gaussian distribution, that is:
p ( x ) = Σ i = 1 M P i b i ( x )
b i ( x ) = N ( x , μ i , R i ) = 1 ( 2 π ) p / 2 | p i | 1 / 2 - exp { - 1 2 ( x - μ i ) T R i - 1 ( x - μ i ) }
Wherein, p is the dimension of feature, b i(x) being kernel function, is that mean value vector is μ i, covariance matrix is R iGauss of distribution function, M (optional, as to be generally 16,32) is the exponent number of GMM model, is made as one in the past and determines integer setting up speaker model. λ ≅ { P i , μ i , R i | i = 1,2 , . . . , M } Be the parameter among the speaker characteristic distribution GMM.As the weighting coefficient that Gaussian Mixture distributes, P iShould satisfy and make:
∫ - ∞ + ∞ p ( x / λ ) dx = 1
Promptly have
Σ i = 1 M P i = 1
Because the p (x) that calculates among the GMM need ask p * p dimension square formation R i(i=1,2 ..., M) contrary, operand is big.For this reason, with R jBe made as diagonal matrix, inversion operation be converted into ask computing reciprocal, improve arithmetic speed.
The 5th step, identification
After being used for phonetic entry,, obtain a characteristic vector sequence through feature extraction.This sequence is input among the GMM of relevant user model parameter, obtains the similarity value and gives a mark to the user according to it.
Experimental result
Native system is tested on Emotional Prosody Speech sound bank.This sound bank is by (the Linguistic Data Consortium of interlinguistics data alliance, be LDC) the emotional speech database set up according to database standard, pronunciation character research as the different emotions voice, record by 7 professional performers (3 male target speakers and 4 women's target speakers), read aloud a series of specific statements that give in English, mainly be date and numeral, contained 14 kinds of different emotions types.The method of recording is the different tone, intonation and the word speed when allowing the corresponding emotion of actor, each speaker does not wait in the record length of every kind of emotion, between 10 seconds to 40 seconds, also have only a few to reach 50 seconds greatly, the total record length of each speaker is greatly about 5,6 minutes.
We design and have finished two groups of experiments on this storehouse.The cepstrum feature training pattern of not passing through any backoff algorithm is only used in first group of benchmark experiment that is to use classical MFCC-GMM, and GMM is by common EM algorithm training.This group is tested control group the most.
Second group of experiment carried out linear compensation to cepstrum feature, and adopts EM repeatedly to estimate to select best penalty coefficient, uses revised MFCC proper vector training GMM model.
Assess for performance, error rate (EER, Equal Error Rate) and discrimination (IR, Identification Rate) such as select for use to be used as the evaluation criteria of experimental result speaker identification system.
The calculating of EER need be used other two evaluation indexes:
(1) false acceptance rate FA: the phrase number that mistake is accepted is divided by the false acceptance rate that should unaccepted total phrase number promptly obtains the speaker verification;
(2) false rejection rate FR: with the phrase number of False Rejects divided by the false rejection rate that should received total phrase number promptly obtains the speaker verification.
When FA=FR or | during FA-FR|<δ (δ<0.0001), obtain system etc. error rate (EER), i.e. EER=FA or EER=FR.
The computing formula of discrimination IR is:
Figure A20051006136000111
Being provided with of experiment parameter is as follows:
Window is long 32ms
Stepping 16ms
Pre-emphasis 0.97
The MFCC dimension 16MFCC+delta
GMM 32 rank
Experimental result is as follows:
Method EER(%) IR(%)
The benchmark experiment 32.41 62.94
This method 29.92 73.04
Every kind of emotion is divided other experimental result such as following table, compares with the benchmark experiment, and "+" represents that this value raises to some extent, and "-" expression reduces:
Affective state Relative EER (%) Relative IR (%)
Proud (Elation) -4.30 +6.29
(Panic) in alarm -10.76 +19.86
Indignation (Hot anger) -3.60 +9.35
Detest (Disgust) -3.70 +15.56
Angry (Cold anger) -1.92 +12.82
Anxiety (Anxiety) -3.92 +8.82
(Interest) of intense interest -1.41 +5.09
Desperate (Despair) -2.79 +5.78
Contempt (Contempt) -1.02 +10.0
Sad (Sadness) -3.53 +15.23
Pride (Pride) -2.76 +5.96
Ashamed (Shame) -1.35 +11.49
Be weary of (Boredom) -0.00 +10.39
Neutral (Neutral) -0.00 +6.25
Experimental machine device configuration CPU is AMD Athlon (tm) XP2500+, in save as 512M ddr400.
Experimental result shows that the eigen compensation method can make cepstrum feature more can describe speaker's individual information, thereby improves the performance of Speaker Identification, makes its error rate reduce, and discrimination raises.And the experiment on the emotion storehouse has shown that this method all has effect preferably for various affective states.

Claims (6)

1, a kind of method for distinguishing speek person based on MFCC linear emotion compensation, it is characterized in that: key step is:
1), voice signal carries out pre-service: mainly comprise sampling and quantification, pre-emphasis processing and windowing;
2), the feature extraction on the speech frame: on speaker's voice, extract cepstrum feature MFCC and fundamental frequency, whether exist according to fundamental frequency, sound signal stream is divided into voiced segments and voiceless sound section, if judge that certain frame is a unvoiced frames, then abandon this frame voice, disregard;
3), the MFCC of respective frame is carried out linear compensation, constantly adjust the probable value maximum that penalty coefficient obtains the maximal possibility estimation in the EM algorithm therebetween, and determine penalty coefficient thus according to the variation of fundamental frequency;
4), the coefficient of the probability maximum that maximal possibility estimation obtains is compensated to MFCC, train by the phonetic feature after the compensation according to this;
5), identification: after being used for phonetic entry, through feature extraction, obtain a characteristic vector sequence, this sequence is input among the GMM of relevant user model parameter, obtains the similarity value and gives a mark to the user according to it.
2, the method for distinguishing speek person based on MFCC linear emotion compensation according to claim 1, it is characterized in that: described cepstrum feature linear compensation is revised by the fundamental frequency of corresponding frame for the MFCC feature of each frame is respectively tieed up value, it can be tried one's best characterize speaker's personal characteristics better.
3, the method for distinguishing speek person based on MFCC linear emotion compensation according to claim 1, it is characterized in that: described penalty coefficient is to proceed to spectrum signature employed description fundamental frequency of when compensation to change the factor to the MFCC feature affects, can adjust by EM algorithm repeatedly and obtain best penalty coefficient.
4, the method for distinguishing speek person based on MFCC linear emotion compensation according to claim 1, it is characterized in that: described repeatedly EM algorithm determines that the The optimal compensation coefficient method is to conceal probability estimate by the MFCC after the compensation of different penalty coefficients, finds out the penalty coefficient that the penalty coefficient that wherein makes the probable value maximum uses during as training pattern.
5, the method for distinguishing speek person based on MFCC linear emotion compensation according to claim 1, it is characterized in that: the feature extraction on the speech frame comprises fundamental frequency, i.e. pitch and Mel cepstrum coefficient, the i.e. extraction of MFCC;
1), fundamental frequency:
A), the hunting zone f of fundamental frequency is set Floor=50, f Ceiling=1250Hz;
B), the span f of the fundamental frequency of voice is set Min=50, f Max=550Hz;
C), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k);
D), calculate the SHR of each frequency, i.e. subharmonic-harmonic wave ratio
SHR=SS/SH
Wherein SS = Σ n = 1 N X ( ( n - 1 / 2 ) f ) , SH = Σ n = 1 N X ( nf ) , N = f ceiling / f
E), find out the highest frequency f of SHR 1
F) if f 1>f MaxPerhaps f 1SS-SH<0, think non-voice or quiet frame so, fundamental frequency Pitch=0
G), at [1.9375f 1, 2.0625f 1] the interval seek the frequency f of the local maximum of SHR 2
H) if f 2>f Max, perhaps f 2SHR>0.2, Pitch=f 1
I), other situations, Pitch=f 2
J), the fundamental frequency that obtains is carried out the auto-correlation effect:
From the mid point of frame, the long sampled point of 1/pitch is respectively got in front and back, calculates their autocorrelation value C, if C<0.2 thinks that so the fundamental frequency value is unreliable, Pitch=0;
K), at last whole Pitch values is carried out median smoothing filtering;
2), the extraction of MFCC:
A), the exponent number p of Mel cepstrum coefficient is set;
B), be fast fourier transform FFT, time-domain signal s (n) is become frequency domain signal X (k);
C), calculate Mel territory scale:
M i = i p × 2595 log ( 1 + 8000 / 2.0 700.0 ) , ( i = 0,1,2 , · · · , p )
D), calculate corresponding frequency domain scale:
f i = 700 × e M i 2595 ln 10 - 1 , ( i = 0,1,2 , · · · , p )
E), calculate each Mel territory passage φ jOn the logarithm energy spectrum:
E j = Σ k = 0 K 2 - 1 φ j ( k ) | X ( k ) | 2
Wherein Σ k = 0 K 2 - 1 φ j ( k ) = 1 .
Wherein
F), be discrete cosine transform DCT.
6, according to claim 1 or 2 or 3 or 4 described method for distinguishing speek person, it is characterized in that: determine the The optimal compensation coefficient by the EM algorithm, corresponding different penalty coefficient α based on MFCC linear emotion compensation kCarry out the repeatedly probability calculation of latent state, to obtain the The optimal compensation coefficient;
A), to compensate coefficient be α to the cepstrum feature of corresponding frame kLinear compensation
X (t) is a t cepstrum feature constantly, and Y (t) is a t fundamental frequency constantly, X Opt(t) be the cepstrum feature of this moment after over-compensation, E (Y (t)) is the average pitch frequency:
x opt ( t ) = x ( t ) - α k × | Y ( t ) - E ( Y ( t ) ) | | E ( Y ( t ) ) |
B), estimate latent state probability
P i ′ = Σ t = 1 T T t ( i ) Σ t = 1 T Σ i = 1 M T t ( i ) = 1 T Σ t = 1 T P ( i t = i / z t , λ )
Wherein
P ( i t = i / z t , λ ) = P i p ( z t / i t = i , λ ) p ( z t / λ ) = P i b i ( z t ) Σ i = 1 M P i b i ( z t )
C), circulation is calculated until finding Satisfy
α ^ = arg max α { P ( i t = i / z t , λ ) }
D), estimate parameter P ', the μ of GMM with local maximal criterion i' and R i', i.e. λ ';
μ i ′ = Σ t = 1 T T t ( i ) z t Σ t = 1 T T t ( i ) = Σ t = 1 T P ( i t = i / z t , λ ) z t Σ t = 1 T P ( i t = i / z t , λ ) R i ′ = Σ t = 1 T P ( i t = i / z t , λ ) ( z t - μ i ′ ) T ( z t - μ i ′ ) Σ t = 1 T P ( i t = i / z t , λ ) .
CNB2005100613603A 2005-10-31 2005-10-31 Speaker recognition method based on MFCC linear emotion compensation Expired - Fee Related CN100440315C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100613603A CN100440315C (en) 2005-10-31 2005-10-31 Speaker recognition method based on MFCC linear emotion compensation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100613603A CN100440315C (en) 2005-10-31 2005-10-31 Speaker recognition method based on MFCC linear emotion compensation

Publications (2)

Publication Number Publication Date
CN1758332A true CN1758332A (en) 2006-04-12
CN100440315C CN100440315C (en) 2008-12-03

Family

ID=36703669

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100613603A Expired - Fee Related CN100440315C (en) 2005-10-31 2005-10-31 Speaker recognition method based on MFCC linear emotion compensation

Country Status (1)

Country Link
CN (1) CN100440315C (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102201237A (en) * 2011-05-12 2011-09-28 浙江大学 Emotional speaker identification method based on reliability detection of fuzzy support vector machine
CN1975856B (en) * 2006-10-30 2011-11-09 邹采荣 Speech emotion identifying method based on supporting vector machine
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
CN102354496A (en) * 2011-07-01 2012-02-15 中山大学 PSM-based (pitch scale modification-based) speech identification and restoration method and device thereof
WO2013040981A1 (en) * 2011-09-23 2013-03-28 浙江大学 Speaker recognition method for combining emotion model based on near neighbour principles
CN101547261B (en) * 2008-03-27 2013-06-05 富士通株式会社 Association apparatus and association method
CN103594091A (en) * 2013-11-15 2014-02-19 深圳市中兴移动通信有限公司 Mobile terminal and voice signal processing method thereof
CN105679321A (en) * 2016-01-29 2016-06-15 宇龙计算机通信科技(深圳)有限公司 Speech recognition method and device and terminal
CN106297823A (en) * 2016-08-22 2017-01-04 东南大学 A kind of speech emotional feature selection approach based on Standard of Environmental Noiseization conversion
CN103943104B (en) * 2014-04-15 2017-03-01 海信集团有限公司 A kind of voice messaging knows method for distinguishing and terminal unit
CN109346087A (en) * 2018-09-17 2019-02-15 平安科技(深圳)有限公司 Fight the method for identifying speaker and device of the noise robustness of the bottleneck characteristic of network
CN109564759A (en) * 2016-08-03 2019-04-02 思睿逻辑国际半导体有限公司 Speaker Identification
CN110931022A (en) * 2019-11-19 2020-03-27 天津大学 Voiceprint identification method based on high-frequency and low-frequency dynamic and static characteristics
CN111462759A (en) * 2020-04-01 2020-07-28 科大讯飞股份有限公司 Speaker labeling method, device, equipment and storage medium
CN111681664A (en) * 2020-07-24 2020-09-18 北京百瑞互联技术有限公司 Method, system, storage medium and equipment for reducing audio coding rate
CN113409762A (en) * 2021-06-30 2021-09-17 平安科技(深圳)有限公司 Emotional voice synthesis method, device, equipment and storage medium
CN113567969A (en) * 2021-09-23 2021-10-29 江苏禹治流域管理技术研究院有限公司 Illegal sand dredger automatic monitoring method and system based on underwater acoustic signals

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60213195T8 (en) * 2002-02-13 2007-10-04 Sony Deutschland Gmbh Method, system and computer program for speech / speaker recognition using an emotion state change for the unsupervised adaptation of the recognition method

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1975856B (en) * 2006-10-30 2011-11-09 邹采荣 Speech emotion identifying method based on supporting vector machine
CN101547261B (en) * 2008-03-27 2013-06-05 富士通株式会社 Association apparatus and association method
CN102201237A (en) * 2011-05-12 2011-09-28 浙江大学 Emotional speaker identification method based on reliability detection of fuzzy support vector machine
CN102201237B (en) * 2011-05-12 2013-03-13 浙江大学 Emotional speaker identification method based on reliability detection of fuzzy support vector machine
CN102354496A (en) * 2011-07-01 2012-02-15 中山大学 PSM-based (pitch scale modification-based) speech identification and restoration method and device thereof
CN102354496B (en) * 2011-07-01 2013-08-21 中山大学 PSM-based (pitch scale modification-based) speech identification and restoration method and device thereof
CN102324232A (en) * 2011-09-12 2012-01-18 辽宁工业大学 Method for recognizing sound-groove and system based on gauss hybrid models
WO2013040981A1 (en) * 2011-09-23 2013-03-28 浙江大学 Speaker recognition method for combining emotion model based on near neighbour principles
CN103594091B (en) * 2013-11-15 2017-06-30 努比亚技术有限公司 A kind of mobile terminal and its audio signal processing method
CN103594091A (en) * 2013-11-15 2014-02-19 深圳市中兴移动通信有限公司 Mobile terminal and voice signal processing method thereof
CN103943104B (en) * 2014-04-15 2017-03-01 海信集团有限公司 A kind of voice messaging knows method for distinguishing and terminal unit
CN105679321A (en) * 2016-01-29 2016-06-15 宇龙计算机通信科技(深圳)有限公司 Speech recognition method and device and terminal
CN109564759B (en) * 2016-08-03 2023-06-09 思睿逻辑国际半导体有限公司 Speaker identification
CN109564759A (en) * 2016-08-03 2019-04-02 思睿逻辑国际半导体有限公司 Speaker Identification
US11735191B2 (en) 2016-08-03 2023-08-22 Cirrus Logic, Inc. Speaker recognition with assessment of audio frame contribution
CN106297823A (en) * 2016-08-22 2017-01-04 东南大学 A kind of speech emotional feature selection approach based on Standard of Environmental Noiseization conversion
CN109346087A (en) * 2018-09-17 2019-02-15 平安科技(深圳)有限公司 Fight the method for identifying speaker and device of the noise robustness of the bottleneck characteristic of network
CN109346087B (en) * 2018-09-17 2023-11-10 平安科技(深圳)有限公司 Noise robust speaker verification method and apparatus against bottleneck characteristics of a network
CN110931022A (en) * 2019-11-19 2020-03-27 天津大学 Voiceprint identification method based on high-frequency and low-frequency dynamic and static characteristics
CN110931022B (en) * 2019-11-19 2023-09-15 天津大学 Voiceprint recognition method based on high-low frequency dynamic and static characteristics
CN111462759A (en) * 2020-04-01 2020-07-28 科大讯飞股份有限公司 Speaker labeling method, device, equipment and storage medium
CN111462759B (en) * 2020-04-01 2024-02-13 科大讯飞股份有限公司 Speaker labeling method, device, equipment and storage medium
CN111681664A (en) * 2020-07-24 2020-09-18 北京百瑞互联技术有限公司 Method, system, storage medium and equipment for reducing audio coding rate
CN113409762A (en) * 2021-06-30 2021-09-17 平安科技(深圳)有限公司 Emotional voice synthesis method, device, equipment and storage medium
CN113409762B (en) * 2021-06-30 2024-05-07 平安科技(深圳)有限公司 Emotion voice synthesis method, emotion voice synthesis device, emotion voice synthesis equipment and storage medium
CN113567969A (en) * 2021-09-23 2021-10-29 江苏禹治流域管理技术研究院有限公司 Illegal sand dredger automatic monitoring method and system based on underwater acoustic signals
CN113567969B (en) * 2021-09-23 2021-12-17 江苏禹治流域管理技术研究院有限公司 Illegal sand dredger automatic monitoring method and system based on underwater acoustic signals

Also Published As

Publication number Publication date
CN100440315C (en) 2008-12-03

Similar Documents

Publication Publication Date Title
CN1758332A (en) Speaker recognition method based on MFCC linear emotion compensation
CN101178897B (en) Speaking man recognizing method using base frequency envelope to eliminate emotion voice
US8930185B2 (en) Speech feature extraction apparatus, speech feature extraction method, and speech feature extraction program
Shanthi et al. Review of feature extraction techniques in automatic speech recognition
CN1787075A (en) Method for distinguishing speek speek person by supporting vector machine model basedon inserted GMM core
CN110265063B (en) Lie detection method based on fixed duration speech emotion recognition sequence analysis
CN1787076A (en) Method for distinguishing speek person based on hybrid supporting vector machine
Torres-Boza et al. Hierarchical sparse coding framework for speech emotion recognition
CN100543840C (en) Method for distinguishing speek person based on emotion migration rule and voice correction
CN113111151A (en) Cross-modal depression detection method based on intelligent voice question answering
Quan et al. Reduce the dimensions of emotional features by principal component analysis for speech emotion recognition
Kandali et al. Vocal emotion recognition in five native languages of Assam using new wavelet features
Pao et al. Detecting emotions in Mandarin speech
Zheng et al. An improved speech emotion recognition algorithm based on deep belief network
Houari et al. Study the Influence of Gender and Age in Recognition of Emotions from Algerian Dialect Speech.
Meftah et al. Emotional speech recognition: A multilingual perspective
Lu et al. Physiological feature extraction for text independent speaker identification using non-uniform subband processing
Palo et al. Emotion Analysis from Speech of Different Age Groups.
Rao et al. Glottal excitation feature based gender identification system using ergodic HMM
Patil et al. A review on emotional speech recognition: resources, features, and classifiers
Hamiditabar et al. Determining the severity of depression in speech based on combination of acoustic-space and score-space features
Julia et al. Detection of emotional expressions in speech
Kexin et al. Research on Emergency Parking Instruction Recognition Based on Speech Recognition and Speech Emotion Recognition
Tanprasert et al. Comparative study of GMM, DTW, and ANN on Thai speaker identification system
Fernandez et al. Exploiting vocal-source features to improve ASR accuracy for low-resource languages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20081203

Termination date: 20211031