CN110246518A

CN110246518A - Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features

Info

Publication number: CN110246518A
Application number: CN201910496244.6A
Authority: CN
Inventors: ***; 徐聪; 马琳; 薄洪健; 丰上; 陈婧; 李洪伟; 王子豪; 孙聪珊
Original assignee: Shenzhen Academy of Aerospace Technology
Current assignee: Shenzhen Academy of Aerospace Technology
Priority date: 2019-06-10
Filing date: 2019-06-10
Publication date: 2019-09-17

Abstract

The present invention provides a kind of speech-emotion recognition method, device, system and storage mediums based on more granularity sound state fusion features, the speech-emotion recognition method includes the following steps: the first step, and frame calculates step: the prosodic features, spectrum correlated characteristic and sound quality feature of each frame are calculated as unit of frame；Second step, the extraction step of section grain size characteristic: the big granularity static state global characteristics by the way that whole sentence corpus is calculated, while convolution is carried out to adjacent frame feature in timing using Gaussian window, obtain more granularity time-varying behavioral characteristics.The beneficial effects of the present invention are: the present invention proposes more granularity sound state Fusion Features emotional speech analytical technologies, the extraction of feature is carried out from three different granularities to voice, it is special permanent to obtain more granularity time-varying dynamics, so that feature can portray the general speech feature of speaker, speech emotional feature can be described again to change with time, and make the feature of extraction more efficient.

Description

Speech-emotion recognition method, device, system based on more granularity sound state fusion features And storage medium

Technical field

The present invention relates to voice processing technology field more particularly to a kind of voices based on more granularity sound state fusion features Emotion identification method, device, system and storage medium.

Background technique

Traditional method is first to extract acoustic feature as unit of frame to voice, then that all frames of whole section of voice are special It levies for statistical analysis, obtains final feature.Using support vector machines (SupportVectorMachine, SVM), perceptron etc. As classifier.

Traditional feature extracting method, extraction are characterized in the global static nature for whole section of voice, can not embody Talk about the speech emotional dynamic variation characteristic during people speaks.Also believe without the dynamic change for voice in the selection of classifier Breath is designed or optimizes.

Summary of the invention

The present invention provides a kind of speech-emotion recognition methods based on more granularity sound state fusion features, including walk as follows Rapid: the first step, frame calculate step: the prosodic features, spectrum correlated characteristic and sound quality of each frame are calculated as unit of frame Feature；Second step, the extraction step of section grain size characteristic: the big granularity by the way that whole sentence corpus is calculated is static global special Sign, while convolution is carried out to adjacent frame feature in timing using Gaussian window, more granularity time-varying behavioral characteristics are obtained, so that more Granularity time-varying behavioral characteristics can portray the general speech feature of speaker and describe the change of speech emotional feature at any time Change.

The present invention also provides a kind of speech emotion recognition devices based on more granularity sound state fusion features, comprising: frame Computing module: for calculating the prosodic features, spectrum correlated characteristic and sound quality feature of each frame as unit of frame；Frame meter Calculate module: for calculating the prosodic features, spectrum correlated characteristic and sound quality feature of each frame as unit of frame；

The present invention also provides a kind of speech emotion recognition systems based on more granularity sound state fusion features, comprising: deposits Reservoir, processor and the computer program being stored on the memory, the computer program are configured to by the processing The step of device realizes method of the present invention when calling.

The present invention also provides a kind of computer readable storage mediums, it is characterised in that: the computer-readable storage medium Matter is stored with computer program, and the computer program realizes the step of method of the present invention when being configured to be called by processor Suddenly.

The beneficial effects of the present invention are: the present invention is according to human brain for recognizing on period for showing in speech emotion recognition Know rule, propose more granularity sound state Fusion Features emotional speech analytical technologies, spy is carried out from three different granularities to voice The extraction of sign, to obtain that more granularity time-varying dynamic is special permanent so that feature can portray speaker general speech feature and Description speech emotional feature changes with time, and makes the feature of extraction more efficient.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention.

Specific embodiment

The invention discloses a kind of speech-emotion recognition methods based on more granularity sound state fusion features, using more granularities The analytical technology of sound state Fusion Features, calculated as unit of frame first the sound prosodic features of each frame, spectrum signature and Then sound quality feature etc. passes through the big granularity static state global characteristics that whole sentence corpus is calculated.We utilize simultaneously Gaussian window carries out convolution to adjacent frame feature in timing, more granularity time-varying behavioral characteristics is obtained, so that feature can portray The general speech feature of speaker, and speech emotional feature can be described and changed with time.

The speech-emotion recognition method based on more granularity sound state fusion features, includes the following steps:

The first step, frame calculate step: the prosodic features, spectrum correlated characteristic and sound of each frame are calculated as unit of frame Qualitative character；

Second step, the extraction step of section grain size characteristic: the big granularity by the way that whole sentence corpus is calculated is static global Feature, while convolution is carried out to adjacent frame feature in timing using Gaussian window, more granularity time-varying behavioral characteristics are obtained, so that More granularity time-varying behavioral characteristics can portray the general speech feature of speaker and describe the change of speech emotional feature at any time Change.

In the first step, frame calculates in step, includes the following steps:

Voice framing step: step 1 using Hamming window as window function, sets frame length as 25ms, it is 10ms that frame, which moves, to even Continuous sound bite to be identified carries out framing, as the minimum treat granularity in feature extraction；

The extraction step of frame grain size characteristic: step 2 to each frame divided in voice framing step, extracts 65 dimension sound Feature is learned, including fundamental frequency, short-time energy, short-time average energy, zero-crossing rate, mean amplitude of tide are poor, formant, MFCC etc., such as following table It is shown；

Here, x is used_t=(a_(t,1),a_(t,2),…,a_(t,65)) indicate t-th of frame feature vector, wherein 65 be frame Characteristic Vectors The dimension of amount then can obtain frame eigenmatrix for each clock signal comprising T frame

In second step, the extraction step of section grain size characteristic, the frame feature square for being 65 × T for obtained each size Battle array, we utilize the segment length L=300ms set in advance according to human brain hearing mechanism, and corresponding convolution function group G (M, T) Convolution is carried out, wherein M is the number of convolution function in convolution function group, and last section eigenmatrix is calculated by following formula S_M×T,

S_(m,t)=G_(m,t)*(x_t-L+1,x_t-L+2,…,x_t)^T

(x_t-L+1,x_t-L+2,…,x_t)^TIt is to be covered in the convolution window of L with x by segment length_tFor the frame eigenmatrix of ending. G_(m,t)For m-th of Gaussian function in convolution function group G (M, T), can be calculated as the following formula, wherein T_DFor two neighboring convolution window Between time delay, be equal to the length of a frame herein.

Wherein, σ_mIt is calculated by following formula, here our predefineds

The invention also discloses a kind of speech emotion recognition devices based on more granularity sound state fusion features, comprising:

Frame computing module: for calculating the prosodic features, spectrum correlated characteristic and sound matter of each frame as unit of frame Measure feature；

The extraction module of section grain size characteristic: static global special for the big granularity by the way that whole sentence corpus is calculated Sign, while convolution is carried out to adjacent frame feature in timing using Gaussian window, more granularity time-varying behavioral characteristics are obtained, so that more Granularity time-varying behavioral characteristics can portray the general speech feature of speaker and describe the change of speech emotional feature at any time Change.

In the frame computing module, comprising:

Voice framing module: for being moved according to the frame length and frame of setting, to continuously wait know using Hamming window as window function Other sound bite carries out framing, as the minimum treat granularity in feature extraction；

The extraction module of frame grain size characteristic: for extracting setting dimension to each frame divided in voice framing module Acoustic feature, frame eigenmatrix can be obtained for each clock signal comprising T frame.

In the extraction module of described section of grain size characteristic, for obtained frame eigenmatrix, using preparatory according to human brain The segment length that hearing mechanism is set, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is convolution in convolution function group The number of function, and last section eigenmatrix S is calculated by following formula_M×T, S_(m,t)=G_(m,t)*(x_t-L+1,x_t-L+2,…, x_t)^T, G_(m,t)For m-th of Gaussian function in convolution function group G (M, T).

In voice framing module, using Hamming window as window function, frame length is set as 25ms, it is 10ms that frame, which moves, to continuous Sound bite to be identified carry out framing, as the minimum treat granularity in feature extraction.

In the extraction module of frame grain size characteristic, to each frame divided in voice framing module, 65 dimension acoustics are extracted Feature, 65 dimension acoustic features include: smooth fundamental frequency, dimension 1, voiced sound probability, dimension 1, zero-crossing rate, dimension 1, MFCC, dimension 14, it can measure, dimension 1, sound spectrum filtering, dimension 28, spectrum energy, dimension 15, local frequencies shake, dimension 1, interframe frequency Shake, dimension 1, local amplitude perturbation, dimension 1, humorous ratio of making an uproar, dimension 1；Use x_t=(a_(t,1),a_(t,2),…,a_(t,65)) indicate T frame feature vector, wherein 65 be the dimension of frame feature vector, it then can for each clock signal comprising T frame Obtain frame eigenmatrix

In the extraction module of described section of grain size characteristic, the frame eigenmatrix for being 65 × T for obtained each size, benefit Convolution is carried out with the preparatory segment length L=300ms set according to human brain hearing mechanism, and corresponding convolution function group G (M, T), Wherein M is the number of convolution function in convolution function group, and last section eigenmatrix S is calculated by following formula_M×T, S_(m,t)= G_(m,t)*(x_t-L+1,x_t-L22,…,x_t)^T, G_(m,t)For m-th of Gaussian function in convolution function group G (M, T), can be counted as the following formula It calculates,Wherein T_DFor the time delay between two neighboring convolution window.

The invention also discloses a kind of speech emotion recognition systems based on more granularity sound state fusion features, comprising: deposits Reservoir, processor and the computer program being stored on the memory, the computer program are configured to by the processing The step of device realizes method of the present invention when calling.

The invention also discloses a kind of computer readable storage mediums, it is characterised in that: the computer-readable storage medium Matter is stored with computer program, and the computer program realizes the step of method of the present invention when being configured to be called by processor Suddenly.

The present invention proposes a kind of speech emotional feature-extraction analysis method based on auditory sense cognition rule, and based on this building Speech-emotion recognition method out, relates to the use of the method to solve the problems, such as speech emotion recognition, including but not limited to computer, The artificial intelligence technology comprising speech emotion recognition of machine terminal operation.

The present invention, for the cognitive law on period for showing in speech emotion recognition, proposes that more granularities are dynamic according to human brain Static nature merges emotional speech analytical technology, the extraction of feature is carried out from three different granularities to voice, to obtain more Granularity time-varying dynamic is special permanent so that feature can portray the general speech feature of speaker and describe speech emotional feature with The variation of time makes the feature of extraction more efficient.

In recognizer, using long short-term memory (Long Short Term-Memory, LSTM) network model.LSTM Model can effectively model time series, make full use of the timing information in feature.On the other hand, the length of LSTM When memory mechanism can allow network that the feature of different moments is selectively remembered and identified, have Fusion Features machine System.

The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention Protection scope.

Claims

1. a kind of speech-emotion recognition method based on more granularity sound state fusion features, which comprises the steps of:

The first step, frame calculate step: the prosodic features, spectrum correlated characteristic and sound quality of each frame are calculated as unit of frame Feature；

Second step, the extraction step of section grain size characteristic: the big granularity static state global characteristics by the way that whole sentence corpus is calculated, Convolution is carried out to adjacent frame feature in timing using Gaussian window simultaneously, more granularity time-varying behavioral characteristics are obtained, so that more Degree time-varying behavioral characteristics, which can portray the general speech feature of speaker and describe speech emotional feature, to change with time.

2. speech-emotion recognition method according to claim 1, which is characterized in that in the first step, frame calculates step In, include the following steps:

Voice framing step: step 1 using Hamming window as window function, is moved, to continuous to be identified according to the frame length and frame of setting Sound bite carries out framing, as the minimum treat granularity in feature extraction；

The extraction step of frame grain size characteristic: step 2 to each frame divided in voice framing step, extracts setting dimension Acoustic feature can obtain frame eigenmatrix for each clock signal comprising T frame；

In the second step, the extraction step of section grain size characteristic, for obtained frame eigenmatrix, using preparatory according to people The segment length that brain hearing mechanism is set, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is to roll up in convolution function group The number of Product function, and last section eigenmatrix S is calculated by following formula_M×T, S_(m,t)=G_(m,t)*(x_t-L+1,x_t-L+2,…, x_t)^T, G_(m,t)For m-th of Gaussian function in convolution function group G (M, T), (x_t-L+1,x_t-L+2,…,x_t)^TThe convolution for being L for segment length Covered in window with x_tFor the frame eigenmatrix of ending.

3. speech-emotion recognition method according to claim 2, which is characterized in that in step 1, voice framing step, Using Hamming window as window function, frame length is set as 25ms, it is 10ms that frame, which moves, framing is carried out to continuous sound bite to be identified, As the minimum treat granularity in feature extraction；

In step 2, the extraction step of frame grain size characteristic, to each frame divided in voice framing step, 65 dimension sound are extracted Feature is learned, 65 dimension acoustic features include: smooth fundamental frequency, dimension 1, voiced sound probability, dimension 1, zero-crossing rate, dimension 1, MFCC, dimension Degree 14, can measure, dimension 1, sound spectrum filtering, dimension 28, spectrum energy, dimension 15, local frequencies shake, dimension 1, interframe frequency Rate shake, dimension 1, local amplitude perturbation, dimension 1, humorous ratio of making an uproar, dimension 1；Use x_t=(a_(t,1),a_(t,2),…,a_(t,65)) indicate T-th of frame feature vector, wherein 65 be the dimension of frame feature vector, then for each clock signal comprising T frame To obtain frame eigenmatrix

4. speech-emotion recognition method according to claim 3, which is characterized in that in the second step, section grain size characteristic Extraction step in, the frame eigenmatrix for being 65 × T for obtained each size is set using preparatory according to human brain hearing mechanism The segment length L=300ms set, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is convolution letter in convolution function group Several numbers, and last section eigenmatrix S is calculated by following formula_M×T, S_(m,t)=G_(m,t)*(x_t-L+1,x_t-L+2,…,x_t)^T, G_(m,t)For m-th of Gaussian function in convolution function group G (M, T), can be calculated as the following formula, Wherein T_DFor the time delay between two neighboring convolution window.

5. a kind of speech emotion recognition device based on more granularity sound state fusion features characterized by comprising

Frame computing module: the prosodic features, spectrum correlated characteristic and sound quality for calculating each frame as unit of frame are special Sign；

The extraction module of section grain size characteristic: for the big granularity static state global characteristics by the way that whole sentence corpus is calculated, together Shi Liyong Gaussian window carries out convolution to adjacent frame feature in timing, more granularity time-varying behavioral characteristics is obtained, so that more granularities Time-varying behavioral characteristics can portray the general speech feature of speaker and describe speech emotional feature and change with time.

6. speech emotion recognition device according to claim 5, which is characterized in that in the frame computing module, comprising:

Voice framing module: for being moved according to the frame length and frame of setting using Hamming window as window function, to continuous language to be identified Tablet section carries out framing, as the minimum treat granularity in feature extraction；

The extraction module of frame grain size characteristic: for extracting the sound of setting dimension to each frame divided in voice framing module Feature is learned, frame eigenmatrix can be obtained for each clock signal comprising T frame；In the extraction of described section of grain size characteristic In module, for obtained frame eigenmatrix, the preparatory segment length set according to human brain hearing mechanism, and corresponding volume are utilized Product function group G (M, T) carries out convolution, and wherein M is the number of convolution function in convolution function group, and is calculated finally by following formula Section eigenmatrix S_M×T, S_(m,t)=G_(m,t)*(x_t-L+1,x_t-L+2,…,x_t)^T, G_(m,t)It is m-th in convolution function group G (M, T) Gaussian function, (x_t-L+1,x_t-L+2,…,x_t)^TIt is to be covered in the convolution window of L with x by segment length_tFor the frame eigenmatrix of ending.

7. speech emotion recognition device according to claim 6, which is characterized in that in voice framing module, with Hamming Window sets frame length as 25ms as window function, and it is 10ms that frame, which moves, framing is carried out to continuous sound bite to be identified, as spy Minimum treat granularity in sign extraction；

In the extraction module of frame grain size characteristic, to each frame divided in voice framing module, 65 dimension acoustic features are extracted, 65 dimension acoustic features include: smooth fundamental frequency, dimension 1, voiced sound probability, dimension 1, zero-crossing rate, dimension 1, MFCC, dimension 14, It can measure, dimension 1, sound spectrum filtering, dimension 28, spectrum energy, dimension 15, local frequencies shake, dimension 1, interframe frequency jitter, Dimension 1, local amplitude perturbation, dimension 1, humorous ratio of making an uproar, dimension 1；Use x_t=(a_(t,1), a_(t,2),…,a_(t,65)) indicate t-th of frame Characteristic vector then can obtain frame for each clock signal comprising T frame wherein 65 be the dimension of frame feature vector Eigenmatrix

8. speech emotion recognition device according to claim 7, which is characterized in that in the extraction mould of described section of grain size characteristic In block, the frame eigenmatrix for being 65 × T for obtained each size utilizes the preparatory section set according to human brain hearing mechanism Long L=300ms, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is of convolution function in convolution function group Number, and last section eigenmatrix S is calculated by following formula_M×T, S_(m,t)=G_(m,t)*(x_t-1+1,x_t-L+2,…,x_t)^T, G_(m,t)For M-th of Gaussian function in convolution function group G (M, T), can be calculated as the following formula, Its Middle T_DFor the time delay between two neighboring convolution window.

9. a kind of speech emotion recognition system based on more granularity sound state fusion features characterized by comprising memory, Processor and the computer program being stored on the memory, the computer program are configured to be called by the processor The step of Shi Shixian method of any of claims 1-4.

10. a kind of computer readable storage medium, it is characterised in that: the computer-readable recording medium storage has computer journey Sequence, the computer program realize the step of method of any of claims 1-4 when being configured to be called by processor Suddenly.