CN110246518A - Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features - Google Patents
Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features Download PDFInfo
- Publication number
- CN110246518A CN110246518A CN201910496244.6A CN201910496244A CN110246518A CN 110246518 A CN110246518 A CN 110246518A CN 201910496244 A CN201910496244 A CN 201910496244A CN 110246518 A CN110246518 A CN 110246518A
- Authority
- CN
- China
- Prior art keywords
- frame
- dimension
- feature
- speech
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000004927 fusion Effects 0.000 title claims abstract description 18
- 235000019580 granularity Nutrition 0.000 claims abstract description 48
- 238000000605 extraction Methods 0.000 claims abstract description 35
- 238000001228 spectrum Methods 0.000 claims abstract description 15
- 230000002996 emotional effect Effects 0.000 claims abstract description 14
- 230000003542 behavioural effect Effects 0.000 claims abstract description 12
- 230000003068 static effect Effects 0.000 claims abstract description 9
- 230000002596 correlated effect Effects 0.000 claims abstract description 8
- 230000006870 function Effects 0.000 claims description 41
- 238000009432 framing Methods 0.000 claims description 21
- 230000008909 emotion recognition Effects 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 11
- 210000004556 brain Anatomy 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 8
- 230000000875 corresponding effect Effects 0.000 claims description 7
- 230000015654 memory Effects 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 6
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Child & Adolescent Psychology (AREA)
- General Health & Medical Sciences (AREA)
- Hospice & Palliative Care (AREA)
- Psychiatry (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The present invention provides a kind of speech-emotion recognition method, device, system and storage mediums based on more granularity sound state fusion features, the speech-emotion recognition method includes the following steps: the first step, and frame calculates step: the prosodic features, spectrum correlated characteristic and sound quality feature of each frame are calculated as unit of frame;Second step, the extraction step of section grain size characteristic: the big granularity static state global characteristics by the way that whole sentence corpus is calculated, while convolution is carried out to adjacent frame feature in timing using Gaussian window, obtain more granularity time-varying behavioral characteristics.The beneficial effects of the present invention are: the present invention proposes more granularity sound state Fusion Features emotional speech analytical technologies, the extraction of feature is carried out from three different granularities to voice, it is special permanent to obtain more granularity time-varying dynamics, so that feature can portray the general speech feature of speaker, speech emotional feature can be described again to change with time, and make the feature of extraction more efficient.
Description
Technical field
The present invention relates to voice processing technology field more particularly to a kind of voices based on more granularity sound state fusion features
Emotion identification method, device, system and storage medium.
Background technique
Traditional method is first to extract acoustic feature as unit of frame to voice, then that all frames of whole section of voice are special
It levies for statistical analysis, obtains final feature.Using support vector machines (SupportVectorMachine, SVM), perceptron etc.
As classifier.
Traditional feature extracting method, extraction are characterized in the global static nature for whole section of voice, can not embody
Talk about the speech emotional dynamic variation characteristic during people speaks.Also believe without the dynamic change for voice in the selection of classifier
Breath is designed or optimizes.
Summary of the invention
The present invention provides a kind of speech-emotion recognition methods based on more granularity sound state fusion features, including walk as follows
Rapid: the first step, frame calculate step: the prosodic features, spectrum correlated characteristic and sound quality of each frame are calculated as unit of frame
Feature;Second step, the extraction step of section grain size characteristic: the big granularity by the way that whole sentence corpus is calculated is static global special
Sign, while convolution is carried out to adjacent frame feature in timing using Gaussian window, more granularity time-varying behavioral characteristics are obtained, so that more
Granularity time-varying behavioral characteristics can portray the general speech feature of speaker and describe the change of speech emotional feature at any time
Change.
The present invention also provides a kind of speech emotion recognition devices based on more granularity sound state fusion features, comprising: frame
Computing module: for calculating the prosodic features, spectrum correlated characteristic and sound quality feature of each frame as unit of frame;Frame meter
Calculate module: for calculating the prosodic features, spectrum correlated characteristic and sound quality feature of each frame as unit of frame;
The present invention also provides a kind of speech emotion recognition systems based on more granularity sound state fusion features, comprising: deposits
Reservoir, processor and the computer program being stored on the memory, the computer program are configured to by the processing
The step of device realizes method of the present invention when calling.
The present invention also provides a kind of computer readable storage mediums, it is characterised in that: the computer-readable storage medium
Matter is stored with computer program, and the computer program realizes the step of method of the present invention when being configured to be called by processor
Suddenly.
The beneficial effects of the present invention are: the present invention is according to human brain for recognizing on period for showing in speech emotion recognition
Know rule, propose more granularity sound state Fusion Features emotional speech analytical technologies, spy is carried out from three different granularities to voice
The extraction of sign, to obtain that more granularity time-varying dynamic is special permanent so that feature can portray speaker general speech feature and
Description speech emotional feature changes with time, and makes the feature of extraction more efficient.
Detailed description of the invention
Fig. 1 is flow chart of the method for the present invention.
Specific embodiment
The invention discloses a kind of speech-emotion recognition methods based on more granularity sound state fusion features, using more granularities
The analytical technology of sound state Fusion Features, calculated as unit of frame first the sound prosodic features of each frame, spectrum signature and
Then sound quality feature etc. passes through the big granularity static state global characteristics that whole sentence corpus is calculated.We utilize simultaneously
Gaussian window carries out convolution to adjacent frame feature in timing, more granularity time-varying behavioral characteristics is obtained, so that feature can portray
The general speech feature of speaker, and speech emotional feature can be described and changed with time.
The speech-emotion recognition method based on more granularity sound state fusion features, includes the following steps:
The first step, frame calculate step: the prosodic features, spectrum correlated characteristic and sound of each frame are calculated as unit of frame
Qualitative character;
Second step, the extraction step of section grain size characteristic: the big granularity by the way that whole sentence corpus is calculated is static global
Feature, while convolution is carried out to adjacent frame feature in timing using Gaussian window, more granularity time-varying behavioral characteristics are obtained, so that
More granularity time-varying behavioral characteristics can portray the general speech feature of speaker and describe the change of speech emotional feature at any time
Change.
In the first step, frame calculates in step, includes the following steps:
Voice framing step: step 1 using Hamming window as window function, sets frame length as 25ms, it is 10ms that frame, which moves, to even
Continuous sound bite to be identified carries out framing, as the minimum treat granularity in feature extraction;
The extraction step of frame grain size characteristic: step 2 to each frame divided in voice framing step, extracts 65 dimension sound
Feature is learned, including fundamental frequency, short-time energy, short-time average energy, zero-crossing rate, mean amplitude of tide are poor, formant, MFCC etc., such as following table
It is shown;
Here, x is usedt=(a(t,1),a(t,2),…,a(t,65)) indicate t-th of frame feature vector, wherein 65 be frame Characteristic Vectors
The dimension of amount then can obtain frame eigenmatrix for each clock signal comprising T frame
In second step, the extraction step of section grain size characteristic, the frame feature square for being 65 × T for obtained each size
Battle array, we utilize the segment length L=300ms set in advance according to human brain hearing mechanism, and corresponding convolution function group G (M, T)
Convolution is carried out, wherein M is the number of convolution function in convolution function group, and last section eigenmatrix is calculated by following formula
SM×T,
S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…,xt)T
(xt-L+1,xt-L+2,…,xt)TIt is to be covered in the convolution window of L with x by segment lengthtFor the frame eigenmatrix of ending.
G(m,t)For m-th of Gaussian function in convolution function group G (M, T), can be calculated as the following formula, wherein TDFor two neighboring convolution window
Between time delay, be equal to the length of a frame herein.
Wherein, σmIt is calculated by following formula, here our predefineds
The invention also discloses a kind of speech emotion recognition devices based on more granularity sound state fusion features, comprising:
Frame computing module: for calculating the prosodic features, spectrum correlated characteristic and sound matter of each frame as unit of frame
Measure feature;
The extraction module of section grain size characteristic: static global special for the big granularity by the way that whole sentence corpus is calculated
Sign, while convolution is carried out to adjacent frame feature in timing using Gaussian window, more granularity time-varying behavioral characteristics are obtained, so that more
Granularity time-varying behavioral characteristics can portray the general speech feature of speaker and describe the change of speech emotional feature at any time
Change.
In the frame computing module, comprising:
Voice framing module: for being moved according to the frame length and frame of setting, to continuously wait know using Hamming window as window function
Other sound bite carries out framing, as the minimum treat granularity in feature extraction;
The extraction module of frame grain size characteristic: for extracting setting dimension to each frame divided in voice framing module
Acoustic feature, frame eigenmatrix can be obtained for each clock signal comprising T frame.
In the extraction module of described section of grain size characteristic, for obtained frame eigenmatrix, using preparatory according to human brain
The segment length that hearing mechanism is set, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is convolution in convolution function group
The number of function, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…,
xt)T, G(m,t)For m-th of Gaussian function in convolution function group G (M, T).
In voice framing module, using Hamming window as window function, frame length is set as 25ms, it is 10ms that frame, which moves, to continuous
Sound bite to be identified carry out framing, as the minimum treat granularity in feature extraction.
In the extraction module of frame grain size characteristic, to each frame divided in voice framing module, 65 dimension acoustics are extracted
Feature, 65 dimension acoustic features include: smooth fundamental frequency, dimension 1, voiced sound probability, dimension 1, zero-crossing rate, dimension 1, MFCC, dimension
14, it can measure, dimension 1, sound spectrum filtering, dimension 28, spectrum energy, dimension 15, local frequencies shake, dimension 1, interframe frequency
Shake, dimension 1, local amplitude perturbation, dimension 1, humorous ratio of making an uproar, dimension 1;Use xt=(a(t,1),a(t,2),…,a(t,65)) indicate
T frame feature vector, wherein 65 be the dimension of frame feature vector, it then can for each clock signal comprising T frame
Obtain frame eigenmatrix
In the extraction module of described section of grain size characteristic, the frame eigenmatrix for being 65 × T for obtained each size, benefit
Convolution is carried out with the preparatory segment length L=300ms set according to human brain hearing mechanism, and corresponding convolution function group G (M, T),
Wherein M is the number of convolution function in convolution function group, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)=
G(m,t)*(xt-L+1,xt-L22,…,xt)T, G(m,t)For m-th of Gaussian function in convolution function group G (M, T), can be counted as the following formula
It calculates,Wherein TDFor the time delay between two neighboring convolution window.
The invention also discloses a kind of speech emotion recognition systems based on more granularity sound state fusion features, comprising: deposits
Reservoir, processor and the computer program being stored on the memory, the computer program are configured to by the processing
The step of device realizes method of the present invention when calling.
The invention also discloses a kind of computer readable storage mediums, it is characterised in that: the computer-readable storage medium
Matter is stored with computer program, and the computer program realizes the step of method of the present invention when being configured to be called by processor
Suddenly.
The present invention proposes a kind of speech emotional feature-extraction analysis method based on auditory sense cognition rule, and based on this building
Speech-emotion recognition method out, relates to the use of the method to solve the problems, such as speech emotion recognition, including but not limited to computer,
The artificial intelligence technology comprising speech emotion recognition of machine terminal operation.
The present invention, for the cognitive law on period for showing in speech emotion recognition, proposes that more granularities are dynamic according to human brain
Static nature merges emotional speech analytical technology, the extraction of feature is carried out from three different granularities to voice, to obtain more
Granularity time-varying dynamic is special permanent so that feature can portray the general speech feature of speaker and describe speech emotional feature with
The variation of time makes the feature of extraction more efficient.
In recognizer, using long short-term memory (Long Short Term-Memory, LSTM) network model.LSTM
Model can effectively model time series, make full use of the timing information in feature.On the other hand, the length of LSTM
When memory mechanism can allow network that the feature of different moments is selectively remembered and identified, have Fusion Features machine
System.
The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that
Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist
Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention
Protection scope.
Claims (10)
1. a kind of speech-emotion recognition method based on more granularity sound state fusion features, which comprises the steps of:
The first step, frame calculate step: the prosodic features, spectrum correlated characteristic and sound quality of each frame are calculated as unit of frame
Feature;
Second step, the extraction step of section grain size characteristic: the big granularity static state global characteristics by the way that whole sentence corpus is calculated,
Convolution is carried out to adjacent frame feature in timing using Gaussian window simultaneously, more granularity time-varying behavioral characteristics are obtained, so that more
Degree time-varying behavioral characteristics, which can portray the general speech feature of speaker and describe speech emotional feature, to change with time.
2. speech-emotion recognition method according to claim 1, which is characterized in that in the first step, frame calculates step
In, include the following steps:
Voice framing step: step 1 using Hamming window as window function, is moved, to continuous to be identified according to the frame length and frame of setting
Sound bite carries out framing, as the minimum treat granularity in feature extraction;
The extraction step of frame grain size characteristic: step 2 to each frame divided in voice framing step, extracts setting dimension
Acoustic feature can obtain frame eigenmatrix for each clock signal comprising T frame;
In the second step, the extraction step of section grain size characteristic, for obtained frame eigenmatrix, using preparatory according to people
The segment length that brain hearing mechanism is set, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is to roll up in convolution function group
The number of Product function, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…,
xt)T, G(m,t)For m-th of Gaussian function in convolution function group G (M, T), (xt-L+1,xt-L+2,…,xt)TThe convolution for being L for segment length
Covered in window with xtFor the frame eigenmatrix of ending.
3. speech-emotion recognition method according to claim 2, which is characterized in that in step 1, voice framing step,
Using Hamming window as window function, frame length is set as 25ms, it is 10ms that frame, which moves, framing is carried out to continuous sound bite to be identified,
As the minimum treat granularity in feature extraction;
In step 2, the extraction step of frame grain size characteristic, to each frame divided in voice framing step, 65 dimension sound are extracted
Feature is learned, 65 dimension acoustic features include: smooth fundamental frequency, dimension 1, voiced sound probability, dimension 1, zero-crossing rate, dimension 1, MFCC, dimension
Degree 14, can measure, dimension 1, sound spectrum filtering, dimension 28, spectrum energy, dimension 15, local frequencies shake, dimension 1, interframe frequency
Rate shake, dimension 1, local amplitude perturbation, dimension 1, humorous ratio of making an uproar, dimension 1;Use xt=(a(t,1),a(t,2),…,a(t,65)) indicate
T-th of frame feature vector, wherein 65 be the dimension of frame feature vector, then for each clock signal comprising T frame
To obtain frame eigenmatrix
4. speech-emotion recognition method according to claim 3, which is characterized in that in the second step, section grain size characteristic
Extraction step in, the frame eigenmatrix for being 65 × T for obtained each size is set using preparatory according to human brain hearing mechanism
The segment length L=300ms set, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is convolution letter in convolution function group
Several numbers, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…,xt)T,
G(m,t)For m-th of Gaussian function in convolution function group G (M, T), can be calculated as the following formula,
Wherein TDFor the time delay between two neighboring convolution window.
5. a kind of speech emotion recognition device based on more granularity sound state fusion features characterized by comprising
Frame computing module: the prosodic features, spectrum correlated characteristic and sound quality for calculating each frame as unit of frame are special
Sign;
The extraction module of section grain size characteristic: for the big granularity static state global characteristics by the way that whole sentence corpus is calculated, together
Shi Liyong Gaussian window carries out convolution to adjacent frame feature in timing, more granularity time-varying behavioral characteristics is obtained, so that more granularities
Time-varying behavioral characteristics can portray the general speech feature of speaker and describe speech emotional feature and change with time.
6. speech emotion recognition device according to claim 5, which is characterized in that in the frame computing module, comprising:
Voice framing module: for being moved according to the frame length and frame of setting using Hamming window as window function, to continuous language to be identified
Tablet section carries out framing, as the minimum treat granularity in feature extraction;
The extraction module of frame grain size characteristic: for extracting the sound of setting dimension to each frame divided in voice framing module
Feature is learned, frame eigenmatrix can be obtained for each clock signal comprising T frame;In the extraction of described section of grain size characteristic
In module, for obtained frame eigenmatrix, the preparatory segment length set according to human brain hearing mechanism, and corresponding volume are utilized
Product function group G (M, T) carries out convolution, and wherein M is the number of convolution function in convolution function group, and is calculated finally by following formula
Section eigenmatrix SM×T, S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…,xt)T, G(m,t)It is m-th in convolution function group G (M, T)
Gaussian function, (xt-L+1,xt-L+2,…,xt)TIt is to be covered in the convolution window of L with x by segment lengthtFor the frame eigenmatrix of ending.
7. speech emotion recognition device according to claim 6, which is characterized in that in voice framing module, with Hamming
Window sets frame length as 25ms as window function, and it is 10ms that frame, which moves, framing is carried out to continuous sound bite to be identified, as spy
Minimum treat granularity in sign extraction;
In the extraction module of frame grain size characteristic, to each frame divided in voice framing module, 65 dimension acoustic features are extracted,
65 dimension acoustic features include: smooth fundamental frequency, dimension 1, voiced sound probability, dimension 1, zero-crossing rate, dimension 1, MFCC, dimension 14,
It can measure, dimension 1, sound spectrum filtering, dimension 28, spectrum energy, dimension 15, local frequencies shake, dimension 1, interframe frequency jitter,
Dimension 1, local amplitude perturbation, dimension 1, humorous ratio of making an uproar, dimension 1;Use xt=(a(t,1), a(t,2),…,a(t,65)) indicate t-th of frame
Characteristic vector then can obtain frame for each clock signal comprising T frame wherein 65 be the dimension of frame feature vector
Eigenmatrix
8. speech emotion recognition device according to claim 7, which is characterized in that in the extraction mould of described section of grain size characteristic
In block, the frame eigenmatrix for being 65 × T for obtained each size utilizes the preparatory section set according to human brain hearing mechanism
Long L=300ms, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is of convolution function in convolution function group
Number, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)=G(m,t)*(xt-1+1,xt-L+2,…,xt)T, G(m,t)For
M-th of Gaussian function in convolution function group G (M, T), can be calculated as the following formula, Its
Middle TDFor the time delay between two neighboring convolution window.
9. a kind of speech emotion recognition system based on more granularity sound state fusion features characterized by comprising memory,
Processor and the computer program being stored on the memory, the computer program are configured to be called by the processor
The step of Shi Shixian method of any of claims 1-4.
10. a kind of computer readable storage medium, it is characterised in that: the computer-readable recording medium storage has computer journey
Sequence, the computer program realize the step of method of any of claims 1-4 when being configured to be called by processor
Suddenly.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910496244.6A CN110246518A (en) | 2019-06-10 | 2019-06-10 | Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910496244.6A CN110246518A (en) | 2019-06-10 | 2019-06-10 | Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110246518A true CN110246518A (en) | 2019-09-17 |
Family
ID=67886454
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910496244.6A Pending CN110246518A (en) | 2019-06-10 | 2019-06-10 | Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110246518A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291640A (en) * | 2020-01-20 | 2020-06-16 | 北京百度网讯科技有限公司 | Method and apparatus for recognizing gait |
CN113255630A (en) * | 2021-07-15 | 2021-08-13 | 浙江大华技术股份有限公司 | Moving target recognition training method, moving target recognition method and device |
CN113808619A (en) * | 2021-08-13 | 2021-12-17 | 北京百度网讯科技有限公司 | Voice emotion recognition method and device and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101930735A (en) * | 2009-06-23 | 2010-12-29 | 富士通株式会社 | Speech emotion recognition equipment and speech emotion recognition method |
CN103258532A (en) * | 2012-11-28 | 2013-08-21 | 河海大学常州校区 | Method for recognizing Chinese speech emotions based on fuzzy support vector machine |
CN103531206A (en) * | 2013-09-30 | 2014-01-22 | 华南理工大学 | Voice affective characteristic extraction method capable of combining local information and global information |
CN104835508A (en) * | 2015-04-01 | 2015-08-12 | 哈尔滨工业大学 | Speech feature screening method used for mixed-speech emotion recognition |
CN108564942A (en) * | 2018-04-04 | 2018-09-21 | 南京师范大学 | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system |
US20190074028A1 (en) * | 2017-09-01 | 2019-03-07 | Newton Howard | Real-time vocal features extraction for automated emotional or mental state assessment |
-
2019
- 2019-06-10 CN CN201910496244.6A patent/CN110246518A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101930735A (en) * | 2009-06-23 | 2010-12-29 | 富士通株式会社 | Speech emotion recognition equipment and speech emotion recognition method |
CN103258532A (en) * | 2012-11-28 | 2013-08-21 | 河海大学常州校区 | Method for recognizing Chinese speech emotions based on fuzzy support vector machine |
CN103531206A (en) * | 2013-09-30 | 2014-01-22 | 华南理工大学 | Voice affective characteristic extraction method capable of combining local information and global information |
CN104835508A (en) * | 2015-04-01 | 2015-08-12 | 哈尔滨工业大学 | Speech feature screening method used for mixed-speech emotion recognition |
US20190074028A1 (en) * | 2017-09-01 | 2019-03-07 | Newton Howard | Real-time vocal features extraction for automated emotional or mental state assessment |
CN108564942A (en) * | 2018-04-04 | 2018-09-21 | 南京师范大学 | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system |
Non-Patent Citations (3)
Title |
---|
徐聪: "基于卷积—长短时记忆神经网络的时序信号多粒度分析处理方法研究", 《中国优秀硕士学位论文全文数据库(医药卫生科技辑)》 * |
薄洪健等: "基于卷积神经网络学习的语音情感特征降维方法研究", 《高技术通讯》 * |
陈婧等: "多粒度特征融合的维度语音情感识别方法", 《信号处理》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291640A (en) * | 2020-01-20 | 2020-06-16 | 北京百度网讯科技有限公司 | Method and apparatus for recognizing gait |
CN111291640B (en) * | 2020-01-20 | 2023-02-17 | 北京百度网讯科技有限公司 | Method and apparatus for recognizing gait |
CN113255630A (en) * | 2021-07-15 | 2021-08-13 | 浙江大华技术股份有限公司 | Moving target recognition training method, moving target recognition method and device |
CN113255630B (en) * | 2021-07-15 | 2021-10-15 | 浙江大华技术股份有限公司 | Moving target recognition training method, moving target recognition method and device |
CN113808619A (en) * | 2021-08-13 | 2021-12-17 | 北京百度网讯科技有限公司 | Voice emotion recognition method and device and electronic equipment |
CN113808619B (en) * | 2021-08-13 | 2023-10-20 | 北京百度网讯科技有限公司 | Voice emotion recognition method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cummins et al. | An image-based deep spectrum feature representation for the recognition of emotional speech | |
CN105632501B (en) | A kind of automatic accent classification method and device based on depth learning technology | |
CN109326302A (en) | A kind of sound enhancement method comparing and generate confrontation network based on vocal print | |
CN107945790A (en) | A kind of emotion identification method and emotion recognition system | |
EP3469582A1 (en) | Neural network-based voiceprint information extraction method and apparatus | |
Mashao et al. | Combining classifier decisions for robust speaker identification | |
CN108900725A (en) | A kind of method for recognizing sound-groove, device, terminal device and storage medium | |
CN108597496A (en) | A kind of speech production method and device for fighting network based on production | |
CN110246518A (en) | Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features | |
CN112786052B (en) | Speech recognition method, electronic equipment and storage device | |
Sailor et al. | Filterbank learning using convolutional restricted Boltzmann machine for speech recognition | |
Paulose et al. | Performance evaluation of different modeling methods and classifiers with MFCC and IHC features for speaker recognition | |
Sarkar et al. | Time-contrastive learning based deep bottleneck features for text-dependent speaker verification | |
CN106653002A (en) | Literal live broadcasting method and platform | |
CN108986798A (en) | Processing method, device and the equipment of voice data | |
CN106297769B (en) | A kind of distinctive feature extracting method applied to languages identification | |
Sinha et al. | Acoustic-phonetic feature based dialect identification in Hindi Speech | |
López-Espejo et al. | Improved external speaker-robust keyword spotting for hearing assistive devices | |
CN109377986A (en) | A kind of non-parallel corpus voice personalization conversion method | |
Mahesha et al. | LP-Hillbert transform based MFCC for effective discrimination of stuttering dysfluencies | |
CN104464738B (en) | A kind of method for recognizing sound-groove towards Intelligent mobile equipment | |
Selva Nidhyananthan et al. | Assessment of dysarthric speech using Elman back propagation network (recurrent network) for speech recognition | |
Liu et al. | Using bidirectional associative memories for joint spectral envelope modeling in voice conversion | |
Chakroun et al. | Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments | |
CN106875944A (en) | A kind of system of Voice command home intelligent terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190917 |
|
RJ01 | Rejection of invention patent application after publication |