CN106504772B - Speech-emotion recognition method based on weights of importance support vector machine classifier - Google Patents

Speech-emotion recognition method based on weights of importance support vector machine classifier Download PDF

Info

Publication number
CN106504772B
CN106504772B CN201610969948.7A CN201610969948A CN106504772B CN 106504772 B CN106504772 B CN 106504772B CN 201610969948 A CN201610969948 A CN 201610969948A CN 106504772 B CN106504772 B CN 106504772B
Authority
CN
China
Prior art keywords
frame
sample
weights
importance
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610969948.7A
Other languages
Chinese (zh)
Other versions
CN106504772A (en
Inventor
黄永明
吴奥
章国宝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201610969948.7A priority Critical patent/CN106504772B/en
Publication of CN106504772A publication Critical patent/CN106504772A/en
Application granted granted Critical
Publication of CN106504772B publication Critical patent/CN106504772B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a kind of speech-emotion recognition methods based on weights of importance support vector machine classifier, the quantization including training sample and test sample deviation, the foundation of weights of importance Modulus Model and the foundation of the SVM based on weights of importance coefficient.This method quantifies the deviation of training sample and test sample on the basis of weights of importance coefficient, to carry out deviation adjusting in classifier level.There is the weights of importance Modulus Model of deviation training sample in emotional semantic classification and test sample by constructing in the present invention, quantify the deviation of training sample and test sample in speech samples, utilize the SVM classifier based on weights of importance Modulus Model, deviation adjusting is carried out by adjusting Optimal Separating Hyperplane in classifier level, improves the accuracy and stability of speech emotion recognition.

Description

Speech-emotion recognition method based on weights of importance support vector machine classifier
Technical field
The present invention relates to a kind of speech-emotion recognition methods based on weights of importance support vector machine classifier, belong to language Sound emotion recognition technical field.
Background technique
With the fast development of information technology and the rise of various intelligent terminals, existing man-machine interactive system is faced with day Beneficial acid test.In order to overcome the obstacle of human-computer interaction, make human-computer interaction it is more convenient, naturally, the emotion intelligence of machine is just It is increasingly subject to the attention of each area research person.Voice as the high efficiency interactive medium in human-computer interaction now with development potential, Carry emotion information abundant.Important subject of the speech emotion recognition as emotion intelligence is surveyed in remote teaching, auxiliary Lie, automatic remote telephone service center and clinical medicine, intelligent toy, before smart phone etc. has wide application Scape has attracted more and more research institutions and the extensive concern of researcher.
In improving speech emotion recognition, training sample is different from the environment of the acquisition time of test sample and acquisition, There is the offsets of covariant between training sample and test sample, in order to improve the precision and robust of speech emotion recognition Property, compensating to deviation existing for it seems most important.The deviation generated by speech sample environment is excluded, from raw tone The redundancies such as the unrelated speech content information of similar emotion are rejected in data, extract effective emotion information, are to improve language The key points and difficulties of sound emotion recognition system robustness.
As a kind of emerging voice technology, weights of importance Modulus Model is because of its flexibility in speech signal processing And validity, increasingly obtain the extensive attention of researcher.For classification problem, quantify on the basis of weights of importance coefficient The deviation of training sample and test sample reduces environmental factor to voice feelings to carry out deviation adjusting in classifier level The other influence of perception improves the accuracy and stability of speech emotion recognition.It is this in classifier level compensation training sample The method of this existing covariant deviation between test sample has great importance in speech emotion recognition research.
Summary of the invention
Technical problem: the present invention provides a kind of robustness that can be improved speech emotion recognition, by classifier level Existing covariant deviation based on weights of importance support vector machine classifier between compensation training sample and test sample Speech-emotion recognition method can reduce sample and record environment and record the influences of the irrelevant informations for speech recognition such as people, can To improve the precision and robustness of speech emotion recognition.
Technical solution: the speech-emotion recognition method of the invention based on weights of importance support vector machine classifier, packet Include following steps:
Step 1: the voice signal of input being pre-processed, and extracts feature vector di
Step 2: the sample set of input is divided into training sample setAnd test sample collectionAnd from the survey Try sample setIn randomly select b template point, cl, compositionWhereinIt is one that the training sample is concentrated Sample,It is the sample that the test sample is concentrated, ntrIt is that training sample concentrates number of samples, nteIt is test sample collection Middle number of samples, i are that training sample concentrates sample serial number, and j is that test sample concentrates sample serial number, and l is the test randomly selected Sample serial number in sample set;
Step 3: calculating the best Gaussian kernel width of basic functionDetailed process is as follows:
Step 3.1: it is respectively 0.1,0.2 ..., 1 that default basic function Gaussian kernel width σ, which is arranged,;
Step 3.2: precompensation parameter vector α is calculated according to following below scheme:
Step 3.2.1: it calculates according to the following formulaBuilding withFor b × b matrix of element
It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, cl′It is the test specimens randomly selected This collectionIn a bit, l ' is that the test sample that randomly selects concentrates sample index;
Step 3.2.2: it calculates according to the following formulaBuilding withFor the b dimensional vector of element
It is the vector of b dimension,It isIn element;
Step 3.2.3: precompensation parameter vector α is calculated:
It is constraint condition with α >=0, calculates optimization problemIt calculates following formula and takes minimum The value of parameter vector α when value:
WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTurn Set vector;
Step 3.3: cross validation calculates the best Gaussian kernel width of basic function
Training sample setAnd test sample collectionIt is respectively classified into R subsetWithUnder Formula calculates the approximating variances expectation of r-th of weights of importance under cross validation:
WhereinIt is the approximating variances expectation of r-th weights of importance under cross validation, r=1,2 ... R,It is R training sample subset,It is r-th of test sample subset,It isSample number,It isSample number, strIt isIn a sample, steIt isIn a sample,It is sample strWeights of importance estimation,It is sample This steWeights of importance estimation, calculation formula is as follows:
Wherein αlIt is first of element in the precompensation parameter vector α being calculated in step 3.2.3;
By preset 10 σ values: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1 substitutes into following formula calculating respectively The approximating variances expectation of weights of importance under cross validationIt will be the smallestIt is most preferably high as basic function to be worth corresponding σ This core width
Wherein r=1,2 ... R;
Step 4: being constraint condition with α >=0, calculate optimization problemObtain optimal parameter vector
Wherein WhereinIt is matrixIn l row l ' column element,It is one-dimensional vectorIn first of element;
Step 5: it is calculate by the following formula weights of importance β (s):
WhereinFor optimal parameter vectorIn element, s be training test sample point in a sample, s ∈ D, D are The set of training test sample point;
Step 6: establish weights of importance SVM classifier:
It is added on the slack variable ξ of standard SVM classifier, obtains as follows using the weights of importance β (s) as coefficient SVM classifier expression formula:
The SVM classifier expression formula and following constraint condition are constituted into weights of importance SVM classifier:
yi(<w,di>+b)≥1-ξii≥0,1≤i≤L
Wherein, w is the standard vector of Optimal Separating Hyperplane, | w | be w mould it is long, C is punishment parameter, diIt is by pre-processing it Training sample set afterwardsThe feature vector of extraction, yi∈ {+1, -1 } is class label, they form training sample (d1,y1), (d2,y2) ..., (dl,yl), βiIt is training sample point (di,yi) weights of importance, ξiIt is training sample point (di,yi) pine Relaxation variable;
Step 7: the weights of importance SVM classifier that the feature vector and the step 6 extracted using the step 1 are established Carry out the identification of speech emotional.
Further, in the method for the present invention, the pretreatment in the step 1 includes the following steps:
Step 1.1: preemphasis being carried out as the following formula to audio digital signals X according to the following formula, the voice letter after obtaining preemphasis Number
WhereinIndicate the discrete point serial number of audio digital signals X,For the length of audio digital signals X,WithAudio digital signals X is respectively indicatedWithValue on a discrete point,Language after indicating preemphasis Sound signal?Value on a discrete point, X (- 1)=0;
Step 1.2: using the method for overlapping segmentation to the voice signal after preemphasisCarry out framing, previous frame starting point with The distance of latter frame starting point is known as frame shifting, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length 16ms is taken, that is, takes at 256 points,Speech frame set is obtained by framingThe speech frame setIn The data of n-th of discrete point of k' speech frame are as follows:
WhereinFor the kth in speech frame set ' a speech frame, n indicates speech frame discrete point serial number, and k' is speech frame sequence Number, K' is speech frame totalframes, and is met:
It indicatesIt is rounded downwards;
Step 1.3: to each speech frameLength of window is selected to carry out adding window for 256 points of Hamming window w Processing, obtains adding window speech frame xk'Are as follows:
Wherein xk'(n)、W (n) respectively indicates xk'Value of the w on n-th of discrete point, length of window are 256 points of Hamming window function are as follows:
Step 1.4: to each adding window speech frame xk', 1≤k'≤K', calculating short-time energy Ek'With short-time zero-crossing rate Zk':
Wherein Ek'Indicate adding window speech frame xk'Short-time energy, Zk'Indicate xk'Short-time zero-crossing rate, xk'It (n) is adding window language Sound frame xk'Value on n-th of sampled point, xk'It (n-1) is xk'Value on (n-1)th sampled point, sgn [xk'(n)]、sgn [xk'It (n-1)] is respectively xk'(n)、xk'(n-1) sign function, it may be assumed that
Wherein λ is the independent variable of above-mentioned sign function;
Step 1.5: determining short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ:
Wherein K' is speech frame totalframes;
Step 1.6: to each adding window speech frame, making first order differentiation with short-time energy first, i.e., be greater than short-time energy value Threshold value tEAdding window speech frame differentiate efficient voice frame labeled as level-one, the smallest level-one of frame number is differentiated that efficient voice frame is made For the start frame of the currently active speech frame set, differentiate efficient voice frame as the currently active voice the maximum level-one of frame number The end frame of frame set;
Then make second level differentiation with short-time zero-crossing rate, i.e., the currently active speech frame set is pressed using start frame as starting point Differentiate frame by frame according to the descending sequence of frame number, short-time zero-crossing rate is greater than threshold value tZAdding window speech frame be labeled as effective language Sound frame, and differentiate frame by frame using end frame as starting point according to the ascending sequence of frame number, short-time zero-crossing rate is greater than threshold value tZAdding window speech frame be labeled as efficient voice frame;
The efficient voice frame set obtained after two-stage is differentiated is denoted as { pk}1≤k≤K, wherein k is efficient voice frame number, K For efficient voice frame totalframes, pkFor k-th of efficient voice frame in efficient voice frame set.
Further, the feature vector d in the method for the present invention, in the step 1iIt is to extract as follows:
With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic are right as low order descriptor The low order descriptor of sentence carries out that statement level feature is calculated.
The statistical nature of sentence sample is that (such as fundamental frequency, frame energy, mel-frequency fall with the short-time characteristic of voice frame level Spectral coefficient and wavelet packet cepstrum coefficient feature proposed in this paper etc.) as low order descriptor (Low Level Descriptor, LLD), statement level characteristic parameter obtained from carrying out statistics calculating as all short-time characteristics to sentence.
Common statistic is listed in Table 1 below in speech emotional feature extraction:
Table 1
Wherein short-time characteristic are as follows: fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250- 650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Meier Correlated Spectroscopy position It sets, 90%, 75%, 50%, 25% Meier correlation spectrum roll-off point.
The utility model has the advantages that compared with prior art, the present invention having the advantage that
In existing speech-emotion recognition method, do not exist between training sample and test sample in practical application Covariant offset account for, so as to cause actual speech emotion recognition application effect than speech emotional under experimental situation The effect of identification is worse.Weights of importance Modulus Model is established in the present invention, substantially considers the test in practical application Existing each species diversity between sample and training sample, i.e., to covariant existing between training sample and test sample deviate into Row quantization, it is the quantized value that weights of importance factor beta, which is calculated, this can intuitively show training sample and test specimens Deviation between this.It can be inclined by covariant in the extraction of subsequent speech emotional feature, the establishment process of classifier Quantized value β is moved, deviation is compensated, so that the deviation for excluding to generate by speech sample environment knows speech emotional significantly Other influence.Compared with the speech-emotion recognition method of other deviation compensations, weights of importance Modulus Model is established, for instruction The deviation practiced between sample and test sample is quantified, and the computational complexity and difficulty of covariant compensation are reduced.
Based on weights of importance Modulus Model is established, in SVM classifier, by increasing weights of importance coefficient, to instruction The deviation practiced between sample and test sample compensates.Compared with other SVM classifier recognition methods, this method is in classics Weights of importance is introduced in the objective function of SVM classifier, the technology using on-fixed penalty factor is equivalent to, according to importance Weight coefficient, the data big for weight increase penalty coefficient and reduce environmental factor to be adjusted to Optimal Separating Hyperplane Influence to speech emotion recognition improves the accuracy and stability of speech emotion recognition in practical application, than others Standard SVM has better classifying quality.
Detailed description of the invention
Fig. 1 is weights of importance SVM training flow chart of the invention.
Fig. 2 is weights of importance flow chart of the invention.
Specific embodiment
Below with reference to embodiment and Figure of description, the present invention is further illustrated.
Speech emotional characteristic extraction method based on speech content robust of the invention, comprising the following steps:
Step 1: the sample set of input being pre-processed, pretreated input training sample set is obtainedWith Test sample collectionAnd the test sample collection after pretreatmentIn b template point randomly selectingWherein It is the sample that training sample is concentrated after pre-processing,It is the sample that test sample is concentrated after pre-processing, ntrIt is trained Sample set number of samples, nteIt is test sample collection number of samples, clBe fromIn the template point that randomly selects, i is trained sample This concentration sample index, j are that test sample concentrates sample index, and l is that the test sample randomly selected concentrates sample index.
Wherein pretreatment specifically comprises the following steps:
Step 1.1: preemphasis being carried out as the following formula to audio digital signals X, the voice signal after obtaining preemphasis
WhereinIndicate the discrete point serial number of audio digital signals X,For the length of audio digital signals X,WithAudio digital signals X is respectively indicatedWithValue on a discrete point,Language after indicating preemphasis Sound signal?Value on a discrete point, X (- 1)=0;
Step 1.2: using the method for overlapping segmentation to the voice signal after preemphasisCarry out framing, previous frame starting point with The distance of latter frame starting point is known as frame shifting, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length 16ms is taken, that is, takes at 256 points,Speech frame set is obtained by framing
WhereinFor the kth in speech frame set ' a speech frame, n indicates speech frame discrete point serial number, and k' is speech frame sequence Number, K' is speech frame totalframes, and is met:
It indicatesIt is rounded downwards;
Step 1.3: to each speech frameLength of window is selected to carry out adding window for 256 points of Hamming window w Processing, obtains adding window speech frame xk'Are as follows:
Wherein xk'(n)、W (n) respectively indicates xk'Value of the w on n-th of discrete point, length of window are 256 points of Hamming window function are as follows:
It is subsequent that end-point detection is completed using well known energy zero-crossing rate dual-threshold judgement method, the specific steps are as follows:
Step 1.4: to each adding window speech frame xk', 1≤k'≤K', calculating short-time energy Ek'With short-time zero-crossing rate Zk':
Wherein Ek'Indicate adding window speech frame xk'Short-time energy, Zk'Indicate xk'Short-time zero-crossing rate, xk'It (n) is adding window language Sound frame xk'Value on n-th of sampled point, xk'It (n-1) is xk'Value on (n-1)th sampled point, sgn [xk'(n)]、sgn [xk'It (n-1)] is respectively xk'(n)、xk'(n-1) sign function, it may be assumed that
Step 1.5: determining short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ:
Wherein K' is speech frame totalframes;
Step 1.6: to each adding window speech frame, making first order differentiation with short-time energy first, short-time energy value is greater than threshold Value tEAdding window speech frame differentiate efficient voice frame labeled as level-one, using the smallest level-one of frame number differentiate efficient voice frame as The maximum level-one of frame number is differentiated efficient voice frame as the currently active speech frame by the start frame of the currently active speech frame set Then the end frame of set makees second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, be with start frame Point differentiates frame by frame according to the descending sequence of frame number, and short-time zero-crossing rate is greater than threshold value tZAdding window speech frame be labeled as Efficient voice frame, and differentiate frame by frame using end frame as starting point according to the ascending sequence of frame number, short-time zero-crossing rate is big In threshold value tZAdding window speech frame be labeled as efficient voice frame, the efficient voice frame set obtained after two-stage is differentiated is denoted as {sk}1≤k≤K, wherein k is efficient voice frame number, and K is efficient voice frame totalframes, skFor k-th in efficient voice frame set Efficient voice frame.
Step 2: calculating the best Gaussian kernel width of basic function
For the degree of closeness of training sample data and the distribution of test sample data, weights of importance β (s) can be used To indicate:
Wherein ptr(s) training sample set after pre-processing is indicatedTraining sample distribution density, pte(s) pre- place is indicated Test sample collection after reasonTest sample distribution density.
Step 2.1: it is respectively 0.1,0.2 ..., 1 that default basic function Gaussian kernel width σ, which is arranged,;
Step 2.2: calculate precompensation parameter vector α:
β (s) is simulated by linear model are as follows:
α=(α12,...,αl) ',It is basic function,B and It can be according to sampleWithIt determines.
Calculate variance J0(α):
Last is constant term to above formula, can be ignored, first two are indicated using J (α):
Wherein α ' is the transposed vector of vector α, and H is the matrix of b × b:H is b dimension Vector:
The expectation that J (α) is approached using the method for average obtains the approximating variances expectation of weights of importance
WhereinIt is the matrix of b × b: It is the vector of b dimension: It is vectorTransposed vector.
The nonnegativity restrictions for considering weights of importance β (x), is converted into optimization problem:
Constraint condition: α >=0
The optimization problem is calculated, parameter vector α is the optimal solution of the planning problem.
It is calculatingWithWhen,It is the gaussian kernel function that a core width is σ,
It willIt substitutes intoWithIn, it can calculate:
WhereinForIn element,ForIn element, l, l '=1,2 ..., b, cl′Be fromIn it is random The template point of selection, l ' are that the test sample randomly selected concentrates sample index, and σ is 1 in preset value.
Step 2.3: cross validation calculates best basic function Gaussian kernel width
The training sample set after pretreatmentAnd test sample collectionIt is respectively classified into R subsetWithCalculate following formula:
WhereinIt is the approximating variances expectation of r-th weights of importance under cross validation, r=1,2 ... R,It is R training sample subset,It is r-th of test sample subset,It isSample number,It isSample number, strIt isIn a sample, steIt isIn a sample,It is sample strWeights of importance estimation,It is sample This steWeights of importance estimation.
Calculate the approximating variances expectation of weights of importance under cross validation
Wherein r=1,2 ... R.
Pass through minimumIt solves, i.e. following formula, obtains optimal solutionAs best basic function Gaussian kernel width
Wherein σ is respectively preset value 0.1,0.2 ..., 1.
Step 3: calculating optimal parameter vector
Utilize Gaussian bases obtained in step 2 and best basic function Gaussian kernel widthIt substitutes into and calculatesWithSuch as Following formula:
Wherein l, l '=1,2 ..., b;
Optimization problem is calculated using formula (9) (10)Constraint condition is α >=0, can be calculated To optimal parameter vector
Step 4: calculating approximate significance weight
β (x) can be obtained by step 2 to be modeled as by linear modelSubstitute into basic function
It can obtain:
WhereinFor vectorIn element, s is a sample in training test sample point, and s ∈ D, D are training test The set of sample point.
Step 5: establish weights of importance SVM classifier model:
It is added on the slack variable ξ of standard SVM classifier using weights of importance as coefficient:
Wherein constraint condition: yi(<w,di>+b)≥1-ξii>=0,1≤i≤L, w are the standard vectors of Optimal Separating Hyperplane, | w | be w mould it is long, ξ is slack variable, and C is punishment parameter, diIt is by training sampleThe feature vector of extraction, yi∈{+ It 1, -1 } is class label, they form training sample (d1,y1), (d2,y2) ..., (dl,yl), βiIt is training sample point (di,yi) Weights of importance.
The statistical nature of sentence sample is that (such as fundamental frequency, frame energy, mel-frequency fall with the short-time characteristic of voice frame level Spectral coefficient and wavelet packet cepstrum coefficient feature proposed in this paper etc.) as low order descriptor (Low Level Descriptor, LLD), statement level characteristic parameter obtained from carrying out statistics calculating as all short-time characteristics to sentence.
Common statistic is listed in Table 1 below in speech emotional feature extraction:
Table 1
Wherein short-time characteristic are as follows: fundamental frequency, logarithm frame energy, frequency band energy (0-250Hz, 0-650Hz, 250- 650Hz, 1-4kHz), the cepstrum energy of 26 mel-frequency bands, 13 rank mel cepstrum coefficients, minimax Meier Correlated Spectroscopy position It sets, 90%, 75%, 50%, 25% Meier correlation spectrum roll-off point.Formula (13) and its constraint condition are weights of importance SVM points Class device model.
Above-described embodiment is only the preferred embodiment of the present invention, it should be pointed out that: for the ordinary skill of the art For personnel, without departing from the principle of the present invention, several improvement and equivalent replacement can also be made, these are to the present invention Claim improve with the technical solution after equivalent replacement, each fall within protection scope of the present invention.

Claims (3)

1. a kind of speech-emotion recognition method based on weights of importance support vector machine classifier, which is characterized in that this method The following steps are included:
Step 1: the voice signal of input being pre-processed, and extracts feature vector di
Step 2: the sample set of input is divided into training sample setAnd test sample collectionAnd from the test specimens This collectionIn randomly select b template point, cl, compositionWhereinIt is the sample that the training sample is concentrated,It is the sample that the test sample is concentrated, ntrIt is that training sample concentrates number of samples, nteIt is that test sample concentrates sample Number, i are that training sample concentrates sample serial number, and j is that test sample concentrates sample serial number, and l is the test sample collection randomly selected Middle sample serial number;
Step 3: calculating the best Gaussian kernel width of basic functionDetailed process is as follows:
Step 3.1: it is respectively 0.1,0.2 ..., 1 that default basic function Gaussian kernel width σ, which is arranged,;
Step 3.2: precompensation parameter vector α is calculated according to following below scheme:
Step 3.2.1: it calculates according to the following formulaBuilding withFor b × b matrix of element
It is the matrix of b × b,It isIn element, l, l '=1,2 ..., b, cl′It is the test sample collection randomly selectedIn a bit, l ' is that the test sample that randomly selects concentrates sample index;
Step 3.2.2: it calculates according to the following formulaBuilding withFor the b dimensional vector of element
It is the vector of b dimension,It isIn element;
Step 3.2.3: precompensation parameter vector α is calculated:
It is constraint condition with α >=0, calculates optimization problemWhen i.e. calculating following formula is minimized The value of parameter vector α:
WhereinIt is the approximating variances expectation of weights of importance, α ' is the transposed vector of vector α,It is vectorTransposition to Amount;
Step 3.3: cross validation calculates the best Gaussian kernel width of basic function
Training sample setAnd test sample collectionIt is respectively classified into R subsetWithIt counts according to the following formula Calculate the approximating variances expectation of r-th of weights of importance under cross validation:
WhereinIt is the approximating variances expectation of r-th weights of importance under cross validation, r=1,2 ... R,It is r-th of instruction Practice sample set,It is r-th of test sample subset,It isSample number,It isSample number, strIt isIn A sample, steIt isIn a sample,It is sample strWeights of importance estimation,It is sample ste Weights of importance estimation, calculation formula is as follows:
Wherein αlIt is first of element in the precompensation parameter vector α being calculated in step 3.2.3;
By preset 10 σ values: 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1, which substitutes into following formula respectively and calculates, to intersect Verify the approximating variances expectation of lower weights of importanceIt will be the smallestIt is worth corresponding σ as the best Gaussian kernel of basic function Width
Wherein r=1,2 ... R;
Step 4: being constraint condition with α >=0, calculate optimization problemObtain optimal parameter vector
WhereinL, l '= 1,2 ..., b, whereinIt is matrixIn l row l ' column element,It is one-dimensional vectorIn first of element;
Step 5: it is calculate by the following formula weights of importance β (s):
WhereinFor optimal parameter vectorIn element, s is a sample in training test sample point, and s ∈ D, D are training The set of test sample point;
Step 6: establish weights of importance SVM classifier:
It is added on the slack variable ξ of standard SVM classifier using the weights of importance β (s) as coefficient, obtains following SVM points Class device expression formula:
The SVM classifier expression formula and following constraint condition are constituted into weights of importance SVM classifier:
yi(<w,di>+b)≥1-ξii≥0,1≤i≤L
Wherein, w is the standard vector of Optimal Separating Hyperplane, | w | be w mould it is long, C is punishment parameter, diIt is by the instruction after pre-processing Practice sample setThe feature vector of extraction, yi∈ {+1, -1 } is class label, they form training sample (d1,y1), (d2, y2) ..., (dl,yl), βiIt is training sample point (di,yi) weights of importance, ξiIt is training sample point (di,yi) relaxation become Amount;
Step 7: the weights of importance SVM classifier that the feature vector and the step 6 extracted using the step 1 are established carries out The identification of speech emotional.
2. the speech-emotion recognition method according to claim 1 based on weights of importance support vector machine classifier, It is characterized in that, the pretreatment in the step 1 includes the following steps:
Step 1.1: preemphasis being carried out as the following formula to audio digital signals X according to the following formula, the voice signal after obtaining preemphasis
WhereinIndicate the discrete point serial number of audio digital signals X,For the length of audio digital signals X,WithAudio digital signals X is respectively indicatedWithValue on a discrete point,Language after indicating preemphasis Sound signal?Value on a discrete point, X (- 1)=0;
Step 1.2: using the method for overlapping segmentation to the voice signal after preemphasisCarry out framing, previous frame starting point with it is latter The distance of frame starting point is known as frame shifting, and frame pipettes 8ms herein, i.e., in sample rate FsTake under=16kHz at 128 points, each frame length takes 16ms takes at 256 points,Speech frame set is obtained by framingThe speech frame setMiddle kth ' The data of n-th of discrete point of a speech frame are as follows:
WhereinFor the kth in speech frame set ' a speech frame, n indicates speech frame discrete point serial number, and k' is voice frame number, K' is speech frame totalframes, and is met:
It indicatesIt is rounded downwards;
Step 1.3: to each speech frameLength of window is selected to carry out windowing process for 256 points of Hamming window w, Obtain adding window speech frame xk'Are as follows:
Wherein xk'(n)、W (n) respectively indicates xk'Value of the w on n-th of discrete point, length of window are 256 points Hamming window function are as follows:
Step 1.4: to each adding window speech frame xk', 1≤k'≤K', calculating short-time energy Ek'With short-time zero-crossing rate Zk':
Wherein Ek'Indicate adding window speech frame xk'Short-time energy, Zk'Indicate xk'Short-time zero-crossing rate, xk'It (n) is adding window speech frame xk'Value on n-th of sampled point, xk'It (n-1) is xk'Value on (n-1)th sampled point, sgn [xk'(n)]、sgn[xk'(n- It 1)] is respectively xk'(n)、xk'(n-1) sign function, it may be assumed that
Wherein λ is the independent variable of above-mentioned sign function;
Step 1.5: determining short-time energy threshold value tEWith short-time zero-crossing rate threshold value tZ:
Wherein K' is speech frame totalframes;
Step 1.6: to each adding window speech frame, making first order differentiation with short-time energy first, i.e., short-time energy value is greater than threshold value tE Adding window speech frame differentiate efficient voice frame labeled as level-one, differentiate efficient voice frame as currently the smallest level-one of frame number The maximum level-one of frame number is differentiated efficient voice frame as the currently active speech frame set by the start frame of efficient voice frame set End frame;
Then make second level differentiation with short-time zero-crossing rate, i.e., to the currently active speech frame set, using start frame as starting point, according to frame The descending sequence of serial number differentiates frame by frame, and short-time zero-crossing rate is greater than threshold value tZAdding window speech frame be labeled as efficient voice Frame, and differentiate frame by frame using end frame as starting point according to the ascending sequence of frame number, short-time zero-crossing rate is greater than threshold value tZ Adding window speech frame be labeled as efficient voice frame;
The efficient voice frame set obtained after two-stage is differentiated is denoted as { pk}1≤k≤K, wherein k is efficient voice frame number, and K is effective Speech frame totalframes, pkFor k-th of efficient voice frame in efficient voice frame set.
3. the speech-emotion recognition method according to claim 1 or 2 based on weights of importance support vector machine classifier, It is characterized in that, the feature vector d in the step 1iIt is to extract as follows:
With the short-time characteristic of voice frame level, the first-order difference and second differnce of short-time characteristic are as low order descriptor, to sentence Low order descriptor carry out that statement level feature is calculated.
CN201610969948.7A 2016-11-04 2016-11-04 Speech-emotion recognition method based on weights of importance support vector machine classifier Active CN106504772B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610969948.7A CN106504772B (en) 2016-11-04 2016-11-04 Speech-emotion recognition method based on weights of importance support vector machine classifier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610969948.7A CN106504772B (en) 2016-11-04 2016-11-04 Speech-emotion recognition method based on weights of importance support vector machine classifier

Publications (2)

Publication Number Publication Date
CN106504772A CN106504772A (en) 2017-03-15
CN106504772B true CN106504772B (en) 2019-08-20

Family

ID=58322831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610969948.7A Active CN106504772B (en) 2016-11-04 2016-11-04 Speech-emotion recognition method based on weights of importance support vector machine classifier

Country Status (1)

Country Link
CN (1) CN106504772B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108735233A (en) * 2017-04-24 2018-11-02 北京理工大学 A kind of personality recognition methods and device
CN108364641A (en) * 2018-01-09 2018-08-03 东南大学 A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise
CN108831450A (en) * 2018-03-30 2018-11-16 杭州鸟瞰智能科技股份有限公司 A kind of virtual robot man-machine interaction method based on user emotion identification
WO2020024210A1 (en) * 2018-08-02 2020-02-06 深圳大学 Method and apparatus for optimizing window parameter of integrated kernel density estimator, and terminal device
CN110991238B (en) * 2019-10-30 2023-04-28 中科南京人工智能创新研究院 Speech assisting system based on speech emotion analysis and micro expression recognition
CN111415680B (en) * 2020-03-26 2023-05-23 心图熵动科技(苏州)有限责任公司 Voice-based anxiety prediction model generation method and anxiety prediction system
CN113434698B (en) * 2021-06-30 2022-08-02 华中科技大学 Relation extraction model establishing method based on full-hierarchy attention and application thereof
CN116801456A (en) * 2023-08-22 2023-09-22 深圳市创洺盛光电科技有限公司 Intelligent control method of LED lamp

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080077720A (en) * 2007-02-21 2008-08-26 인하대학교 산학협력단 A voice activity detecting method based on a support vector machine(svm) using a posteriori snr, a priori snr and a predicted snr as a feature vector
KR20110021328A (en) * 2009-08-26 2011-03-04 인하대학교 산학협력단 The method to improve the performance of speech/music classification for 3gpp2 codec by employing svm based on discriminative weight training
CN102201237A (en) * 2011-05-12 2011-09-28 浙江大学 Emotional speaker identification method based on reliability detection of fuzzy support vector machine
CN103544963A (en) * 2013-11-07 2014-01-29 东南大学 Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN104091602A (en) * 2014-07-11 2014-10-08 电子科技大学 Speech emotion recognition method based on fuzzy support vector machine

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20080077720A (en) * 2007-02-21 2008-08-26 인하대학교 산학협력단 A voice activity detecting method based on a support vector machine(svm) using a posteriori snr, a priori snr and a predicted snr as a feature vector
KR20110021328A (en) * 2009-08-26 2011-03-04 인하대학교 산학협력단 The method to improve the performance of speech/music classification for 3gpp2 codec by employing svm based on discriminative weight training
CN102201237A (en) * 2011-05-12 2011-09-28 浙江大学 Emotional speaker identification method based on reliability detection of fuzzy support vector machine
CN103544963A (en) * 2013-11-07 2014-01-29 东南大学 Voice emotion recognition method based on core semi-supervised discrimination and analysis
CN104091602A (en) * 2014-07-11 2014-10-08 电子科技大学 Speech emotion recognition method based on fuzzy support vector machine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition;Yongming Huang et al.;《IET Signal Process》;20150615;第9卷(第4期);341-348
基于SVM的语音信号情感识别;秦宇强,张雪英;《电路与***学报》;20121031;第17卷(第5期);55-59

Also Published As

Publication number Publication date
CN106504772A (en) 2017-03-15

Similar Documents

Publication Publication Date Title
CN106504772B (en) Speech-emotion recognition method based on weights of importance support vector machine classifier
CN106328121B (en) Chinese Traditional Instruments sorting technique based on depth confidence network
CN109256150B (en) Speech emotion recognition system and method based on machine learning
CN101599271B (en) Recognition method of digital music emotion
CN109243494B (en) Children emotion recognition method based on multi-attention mechanism long-time memory network
CN109493886A (en) Speech-emotion recognition method based on feature selecting and optimization
CN105261367B (en) A kind of method for distinguishing speek person
CN108899049A (en) A kind of speech-emotion recognition method and system based on convolutional neural networks
CN112259104B (en) Training device for voiceprint recognition model
CN107393554A (en) In a kind of sound scene classification merge class between standard deviation feature extracting method
CN102509547A (en) Method and system for voiceprint recognition based on vector quantization based
CN109815892A (en) The signal recognition method of distributed fiber grating sensing network based on CNN
CN102723078A (en) Emotion speech recognition method based on natural language comprehension
WO2016119604A1 (en) Voice information search method and apparatus, and server
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
CN102655003B (en) Method for recognizing emotion points of Chinese pronunciation based on sound-track modulating signals MFCC (Mel Frequency Cepstrum Coefficient)
CN110767210A (en) Method and device for generating personalized voice
CN109243493A (en) Based on the vagitus emotion identification method for improving long memory network in short-term
CN104123933A (en) Self-adaptive non-parallel training based voice conversion method
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
CN111128128B (en) Voice keyword detection method based on complementary model scoring fusion
CN108364641A (en) A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise
Zhang et al. Speech emotion recognition using combination of features
CN114783418B (en) End-to-end voice recognition method and system based on sparse self-attention mechanism
CN111128240B (en) Voice emotion recognition method based on anti-semantic-erasure

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant