CN110246518A - Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features - Google Patents

Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features Download PDF

Info

Publication number
CN110246518A
CN110246518A CN201910496244.6A CN201910496244A CN110246518A CN 110246518 A CN110246518 A CN 110246518A CN 201910496244 A CN201910496244 A CN 201910496244A CN 110246518 A CN110246518 A CN 110246518A
Authority
CN
China
Prior art keywords
frame
dimension
feature
speech
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910496244.6A
Other languages
Chinese (zh)
Inventor
***
徐聪
马琳
薄洪健
丰上
陈婧
李洪伟
王子豪
孙聪珊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Academy of Aerospace Technology
Original Assignee
Shenzhen Academy of Aerospace Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Academy of Aerospace Technology filed Critical Shenzhen Academy of Aerospace Technology
Priority to CN201910496244.6A priority Critical patent/CN110246518A/en
Publication of CN110246518A publication Critical patent/CN110246518A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The present invention provides a kind of speech-emotion recognition method, device, system and storage mediums based on more granularity sound state fusion features, the speech-emotion recognition method includes the following steps: the first step, and frame calculates step: the prosodic features, spectrum correlated characteristic and sound quality feature of each frame are calculated as unit of frame;Second step, the extraction step of section grain size characteristic: the big granularity static state global characteristics by the way that whole sentence corpus is calculated, while convolution is carried out to adjacent frame feature in timing using Gaussian window, obtain more granularity time-varying behavioral characteristics.The beneficial effects of the present invention are: the present invention proposes more granularity sound state Fusion Features emotional speech analytical technologies, the extraction of feature is carried out from three different granularities to voice, it is special permanent to obtain more granularity time-varying dynamics, so that feature can portray the general speech feature of speaker, speech emotional feature can be described again to change with time, and make the feature of extraction more efficient.

Description

Speech-emotion recognition method, device, system based on more granularity sound state fusion features And storage medium
Technical field
The present invention relates to voice processing technology field more particularly to a kind of voices based on more granularity sound state fusion features Emotion identification method, device, system and storage medium.
Background technique
Traditional method is first to extract acoustic feature as unit of frame to voice, then that all frames of whole section of voice are special It levies for statistical analysis, obtains final feature.Using support vector machines (SupportVectorMachine, SVM), perceptron etc. As classifier.
Traditional feature extracting method, extraction are characterized in the global static nature for whole section of voice, can not embody Talk about the speech emotional dynamic variation characteristic during people speaks.Also believe without the dynamic change for voice in the selection of classifier Breath is designed or optimizes.
Summary of the invention
The present invention provides a kind of speech-emotion recognition methods based on more granularity sound state fusion features, including walk as follows Rapid: the first step, frame calculate step: the prosodic features, spectrum correlated characteristic and sound quality of each frame are calculated as unit of frame Feature;Second step, the extraction step of section grain size characteristic: the big granularity by the way that whole sentence corpus is calculated is static global special Sign, while convolution is carried out to adjacent frame feature in timing using Gaussian window, more granularity time-varying behavioral characteristics are obtained, so that more Granularity time-varying behavioral characteristics can portray the general speech feature of speaker and describe the change of speech emotional feature at any time Change.
The present invention also provides a kind of speech emotion recognition devices based on more granularity sound state fusion features, comprising: frame Computing module: for calculating the prosodic features, spectrum correlated characteristic and sound quality feature of each frame as unit of frame;Frame meter Calculate module: for calculating the prosodic features, spectrum correlated characteristic and sound quality feature of each frame as unit of frame;
The present invention also provides a kind of speech emotion recognition systems based on more granularity sound state fusion features, comprising: deposits Reservoir, processor and the computer program being stored on the memory, the computer program are configured to by the processing The step of device realizes method of the present invention when calling.
The present invention also provides a kind of computer readable storage mediums, it is characterised in that: the computer-readable storage medium Matter is stored with computer program, and the computer program realizes the step of method of the present invention when being configured to be called by processor Suddenly.
The beneficial effects of the present invention are: the present invention is according to human brain for recognizing on period for showing in speech emotion recognition Know rule, propose more granularity sound state Fusion Features emotional speech analytical technologies, spy is carried out from three different granularities to voice The extraction of sign, to obtain that more granularity time-varying dynamic is special permanent so that feature can portray speaker general speech feature and Description speech emotional feature changes with time, and makes the feature of extraction more efficient.
Detailed description of the invention
Fig. 1 is flow chart of the method for the present invention.
Specific embodiment
The invention discloses a kind of speech-emotion recognition methods based on more granularity sound state fusion features, using more granularities The analytical technology of sound state Fusion Features, calculated as unit of frame first the sound prosodic features of each frame, spectrum signature and Then sound quality feature etc. passes through the big granularity static state global characteristics that whole sentence corpus is calculated.We utilize simultaneously Gaussian window carries out convolution to adjacent frame feature in timing, more granularity time-varying behavioral characteristics is obtained, so that feature can portray The general speech feature of speaker, and speech emotional feature can be described and changed with time.
The speech-emotion recognition method based on more granularity sound state fusion features, includes the following steps:
The first step, frame calculate step: the prosodic features, spectrum correlated characteristic and sound of each frame are calculated as unit of frame Qualitative character;
Second step, the extraction step of section grain size characteristic: the big granularity by the way that whole sentence corpus is calculated is static global Feature, while convolution is carried out to adjacent frame feature in timing using Gaussian window, more granularity time-varying behavioral characteristics are obtained, so that More granularity time-varying behavioral characteristics can portray the general speech feature of speaker and describe the change of speech emotional feature at any time Change.
In the first step, frame calculates in step, includes the following steps:
Voice framing step: step 1 using Hamming window as window function, sets frame length as 25ms, it is 10ms that frame, which moves, to even Continuous sound bite to be identified carries out framing, as the minimum treat granularity in feature extraction;
The extraction step of frame grain size characteristic: step 2 to each frame divided in voice framing step, extracts 65 dimension sound Feature is learned, including fundamental frequency, short-time energy, short-time average energy, zero-crossing rate, mean amplitude of tide are poor, formant, MFCC etc., such as following table It is shown;
Here, x is usedt=(a(t,1),a(t,2),…,a(t,65)) indicate t-th of frame feature vector, wherein 65 be frame Characteristic Vectors The dimension of amount then can obtain frame eigenmatrix for each clock signal comprising T frame
In second step, the extraction step of section grain size characteristic, the frame feature square for being 65 × T for obtained each size Battle array, we utilize the segment length L=300ms set in advance according to human brain hearing mechanism, and corresponding convolution function group G (M, T) Convolution is carried out, wherein M is the number of convolution function in convolution function group, and last section eigenmatrix is calculated by following formula SM×T,
S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…,xt)T
(xt-L+1,xt-L+2,…,xt)TIt is to be covered in the convolution window of L with x by segment lengthtFor the frame eigenmatrix of ending. G(m,t)For m-th of Gaussian function in convolution function group G (M, T), can be calculated as the following formula, wherein TDFor two neighboring convolution window Between time delay, be equal to the length of a frame herein.
Wherein, σmIt is calculated by following formula, here our predefineds
The invention also discloses a kind of speech emotion recognition devices based on more granularity sound state fusion features, comprising:
Frame computing module: for calculating the prosodic features, spectrum correlated characteristic and sound matter of each frame as unit of frame Measure feature;
The extraction module of section grain size characteristic: static global special for the big granularity by the way that whole sentence corpus is calculated Sign, while convolution is carried out to adjacent frame feature in timing using Gaussian window, more granularity time-varying behavioral characteristics are obtained, so that more Granularity time-varying behavioral characteristics can portray the general speech feature of speaker and describe the change of speech emotional feature at any time Change.
In the frame computing module, comprising:
Voice framing module: for being moved according to the frame length and frame of setting, to continuously wait know using Hamming window as window function Other sound bite carries out framing, as the minimum treat granularity in feature extraction;
The extraction module of frame grain size characteristic: for extracting setting dimension to each frame divided in voice framing module Acoustic feature, frame eigenmatrix can be obtained for each clock signal comprising T frame.
In the extraction module of described section of grain size characteristic, for obtained frame eigenmatrix, using preparatory according to human brain The segment length that hearing mechanism is set, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is convolution in convolution function group The number of function, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…, xt)T, G(m,t)For m-th of Gaussian function in convolution function group G (M, T).
In voice framing module, using Hamming window as window function, frame length is set as 25ms, it is 10ms that frame, which moves, to continuous Sound bite to be identified carry out framing, as the minimum treat granularity in feature extraction.
In the extraction module of frame grain size characteristic, to each frame divided in voice framing module, 65 dimension acoustics are extracted Feature, 65 dimension acoustic features include: smooth fundamental frequency, dimension 1, voiced sound probability, dimension 1, zero-crossing rate, dimension 1, MFCC, dimension 14, it can measure, dimension 1, sound spectrum filtering, dimension 28, spectrum energy, dimension 15, local frequencies shake, dimension 1, interframe frequency Shake, dimension 1, local amplitude perturbation, dimension 1, humorous ratio of making an uproar, dimension 1;Use xt=(a(t,1),a(t,2),…,a(t,65)) indicate T frame feature vector, wherein 65 be the dimension of frame feature vector, it then can for each clock signal comprising T frame Obtain frame eigenmatrix
In the extraction module of described section of grain size characteristic, the frame eigenmatrix for being 65 × T for obtained each size, benefit Convolution is carried out with the preparatory segment length L=300ms set according to human brain hearing mechanism, and corresponding convolution function group G (M, T), Wherein M is the number of convolution function in convolution function group, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)= G(m,t)*(xt-L+1,xt-L22,…,xt)T, G(m,t)For m-th of Gaussian function in convolution function group G (M, T), can be counted as the following formula It calculates,Wherein TDFor the time delay between two neighboring convolution window.
The invention also discloses a kind of speech emotion recognition systems based on more granularity sound state fusion features, comprising: deposits Reservoir, processor and the computer program being stored on the memory, the computer program are configured to by the processing The step of device realizes method of the present invention when calling.
The invention also discloses a kind of computer readable storage mediums, it is characterised in that: the computer-readable storage medium Matter is stored with computer program, and the computer program realizes the step of method of the present invention when being configured to be called by processor Suddenly.
The present invention proposes a kind of speech emotional feature-extraction analysis method based on auditory sense cognition rule, and based on this building Speech-emotion recognition method out, relates to the use of the method to solve the problems, such as speech emotion recognition, including but not limited to computer, The artificial intelligence technology comprising speech emotion recognition of machine terminal operation.
The present invention, for the cognitive law on period for showing in speech emotion recognition, proposes that more granularities are dynamic according to human brain Static nature merges emotional speech analytical technology, the extraction of feature is carried out from three different granularities to voice, to obtain more Granularity time-varying dynamic is special permanent so that feature can portray the general speech feature of speaker and describe speech emotional feature with The variation of time makes the feature of extraction more efficient.
In recognizer, using long short-term memory (Long Short Term-Memory, LSTM) network model.LSTM Model can effectively model time series, make full use of the timing information in feature.On the other hand, the length of LSTM When memory mechanism can allow network that the feature of different moments is selectively remembered and identified, have Fusion Features machine System.
The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be said that Specific implementation of the invention is only limited to these instructions.For those of ordinary skill in the art to which the present invention belongs, exist Under the premise of not departing from present inventive concept, a number of simple deductions or replacements can also be made, all shall be regarded as belonging to of the invention Protection scope.

Claims (10)

1. a kind of speech-emotion recognition method based on more granularity sound state fusion features, which comprises the steps of:
The first step, frame calculate step: the prosodic features, spectrum correlated characteristic and sound quality of each frame are calculated as unit of frame Feature;
Second step, the extraction step of section grain size characteristic: the big granularity static state global characteristics by the way that whole sentence corpus is calculated, Convolution is carried out to adjacent frame feature in timing using Gaussian window simultaneously, more granularity time-varying behavioral characteristics are obtained, so that more Degree time-varying behavioral characteristics, which can portray the general speech feature of speaker and describe speech emotional feature, to change with time.
2. speech-emotion recognition method according to claim 1, which is characterized in that in the first step, frame calculates step In, include the following steps:
Voice framing step: step 1 using Hamming window as window function, is moved, to continuous to be identified according to the frame length and frame of setting Sound bite carries out framing, as the minimum treat granularity in feature extraction;
The extraction step of frame grain size characteristic: step 2 to each frame divided in voice framing step, extracts setting dimension Acoustic feature can obtain frame eigenmatrix for each clock signal comprising T frame;
In the second step, the extraction step of section grain size characteristic, for obtained frame eigenmatrix, using preparatory according to people The segment length that brain hearing mechanism is set, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is to roll up in convolution function group The number of Product function, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…, xt)T, G(m,t)For m-th of Gaussian function in convolution function group G (M, T), (xt-L+1,xt-L+2,…,xt)TThe convolution for being L for segment length Covered in window with xtFor the frame eigenmatrix of ending.
3. speech-emotion recognition method according to claim 2, which is characterized in that in step 1, voice framing step, Using Hamming window as window function, frame length is set as 25ms, it is 10ms that frame, which moves, framing is carried out to continuous sound bite to be identified, As the minimum treat granularity in feature extraction;
In step 2, the extraction step of frame grain size characteristic, to each frame divided in voice framing step, 65 dimension sound are extracted Feature is learned, 65 dimension acoustic features include: smooth fundamental frequency, dimension 1, voiced sound probability, dimension 1, zero-crossing rate, dimension 1, MFCC, dimension Degree 14, can measure, dimension 1, sound spectrum filtering, dimension 28, spectrum energy, dimension 15, local frequencies shake, dimension 1, interframe frequency Rate shake, dimension 1, local amplitude perturbation, dimension 1, humorous ratio of making an uproar, dimension 1;Use xt=(a(t,1),a(t,2),…,a(t,65)) indicate T-th of frame feature vector, wherein 65 be the dimension of frame feature vector, then for each clock signal comprising T frame To obtain frame eigenmatrix
4. speech-emotion recognition method according to claim 3, which is characterized in that in the second step, section grain size characteristic Extraction step in, the frame eigenmatrix for being 65 × T for obtained each size is set using preparatory according to human brain hearing mechanism The segment length L=300ms set, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is convolution letter in convolution function group Several numbers, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…,xt)T, G(m,t)For m-th of Gaussian function in convolution function group G (M, T), can be calculated as the following formula, Wherein TDFor the time delay between two neighboring convolution window.
5. a kind of speech emotion recognition device based on more granularity sound state fusion features characterized by comprising
Frame computing module: the prosodic features, spectrum correlated characteristic and sound quality for calculating each frame as unit of frame are special Sign;
The extraction module of section grain size characteristic: for the big granularity static state global characteristics by the way that whole sentence corpus is calculated, together Shi Liyong Gaussian window carries out convolution to adjacent frame feature in timing, more granularity time-varying behavioral characteristics is obtained, so that more granularities Time-varying behavioral characteristics can portray the general speech feature of speaker and describe speech emotional feature and change with time.
6. speech emotion recognition device according to claim 5, which is characterized in that in the frame computing module, comprising:
Voice framing module: for being moved according to the frame length and frame of setting using Hamming window as window function, to continuous language to be identified Tablet section carries out framing, as the minimum treat granularity in feature extraction;
The extraction module of frame grain size characteristic: for extracting the sound of setting dimension to each frame divided in voice framing module Feature is learned, frame eigenmatrix can be obtained for each clock signal comprising T frame;In the extraction of described section of grain size characteristic In module, for obtained frame eigenmatrix, the preparatory segment length set according to human brain hearing mechanism, and corresponding volume are utilized Product function group G (M, T) carries out convolution, and wherein M is the number of convolution function in convolution function group, and is calculated finally by following formula Section eigenmatrix SM×T, S(m,t)=G(m,t)*(xt-L+1,xt-L+2,…,xt)T, G(m,t)It is m-th in convolution function group G (M, T) Gaussian function, (xt-L+1,xt-L+2,…,xt)TIt is to be covered in the convolution window of L with x by segment lengthtFor the frame eigenmatrix of ending.
7. speech emotion recognition device according to claim 6, which is characterized in that in voice framing module, with Hamming Window sets frame length as 25ms as window function, and it is 10ms that frame, which moves, framing is carried out to continuous sound bite to be identified, as spy Minimum treat granularity in sign extraction;
In the extraction module of frame grain size characteristic, to each frame divided in voice framing module, 65 dimension acoustic features are extracted, 65 dimension acoustic features include: smooth fundamental frequency, dimension 1, voiced sound probability, dimension 1, zero-crossing rate, dimension 1, MFCC, dimension 14, It can measure, dimension 1, sound spectrum filtering, dimension 28, spectrum energy, dimension 15, local frequencies shake, dimension 1, interframe frequency jitter, Dimension 1, local amplitude perturbation, dimension 1, humorous ratio of making an uproar, dimension 1;Use xt=(a(t,1), a(t,2),…,a(t,65)) indicate t-th of frame Characteristic vector then can obtain frame for each clock signal comprising T frame wherein 65 be the dimension of frame feature vector Eigenmatrix
8. speech emotion recognition device according to claim 7, which is characterized in that in the extraction mould of described section of grain size characteristic In block, the frame eigenmatrix for being 65 × T for obtained each size utilizes the preparatory section set according to human brain hearing mechanism Long L=300ms, and corresponding convolution function group G (M, T) carry out convolution, and wherein M is of convolution function in convolution function group Number, and last section eigenmatrix S is calculated by following formulaM×T, S(m,t)=G(m,t)*(xt-1+1,xt-L+2,…,xt)T, G(m,t)For M-th of Gaussian function in convolution function group G (M, T), can be calculated as the following formula, Its Middle TDFor the time delay between two neighboring convolution window.
9. a kind of speech emotion recognition system based on more granularity sound state fusion features characterized by comprising memory, Processor and the computer program being stored on the memory, the computer program are configured to be called by the processor The step of Shi Shixian method of any of claims 1-4.
10. a kind of computer readable storage medium, it is characterised in that: the computer-readable recording medium storage has computer journey Sequence, the computer program realize the step of method of any of claims 1-4 when being configured to be called by processor Suddenly.
CN201910496244.6A 2019-06-10 2019-06-10 Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features Pending CN110246518A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910496244.6A CN110246518A (en) 2019-06-10 2019-06-10 Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910496244.6A CN110246518A (en) 2019-06-10 2019-06-10 Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features

Publications (1)

Publication Number Publication Date
CN110246518A true CN110246518A (en) 2019-09-17

Family

ID=67886454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910496244.6A Pending CN110246518A (en) 2019-06-10 2019-06-10 Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features

Country Status (1)

Country Link
CN (1) CN110246518A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291640A (en) * 2020-01-20 2020-06-16 北京百度网讯科技有限公司 Method and apparatus for recognizing gait
CN113255630A (en) * 2021-07-15 2021-08-13 浙江大华技术股份有限公司 Moving target recognition training method, moving target recognition method and device
CN113808619A (en) * 2021-08-13 2021-12-17 北京百度网讯科技有限公司 Voice emotion recognition method and device and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930735A (en) * 2009-06-23 2010-12-29 富士通株式会社 Speech emotion recognition equipment and speech emotion recognition method
CN103258532A (en) * 2012-11-28 2013-08-21 河海大学常州校区 Method for recognizing Chinese speech emotions based on fuzzy support vector machine
CN103531206A (en) * 2013-09-30 2014-01-22 华南理工大学 Voice affective characteristic extraction method capable of combining local information and global information
CN104835508A (en) * 2015-04-01 2015-08-12 哈尔滨工业大学 Speech feature screening method used for mixed-speech emotion recognition
CN108564942A (en) * 2018-04-04 2018-09-21 南京师范大学 One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930735A (en) * 2009-06-23 2010-12-29 富士通株式会社 Speech emotion recognition equipment and speech emotion recognition method
CN103258532A (en) * 2012-11-28 2013-08-21 河海大学常州校区 Method for recognizing Chinese speech emotions based on fuzzy support vector machine
CN103531206A (en) * 2013-09-30 2014-01-22 华南理工大学 Voice affective characteristic extraction method capable of combining local information and global information
CN104835508A (en) * 2015-04-01 2015-08-12 哈尔滨工业大学 Speech feature screening method used for mixed-speech emotion recognition
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
CN108564942A (en) * 2018-04-04 2018-09-21 南京师范大学 One kind being based on the adjustable speech-emotion recognition method of susceptibility and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
徐聪: "基于卷积—长短时记忆神经网络的时序信号多粒度分析处理方法研究", 《中国优秀硕士学位论文全文数据库(医药卫生科技辑)》 *
薄洪健等: "基于卷积神经网络学习的语音情感特征降维方法研究", 《高技术通讯》 *
陈婧等: "多粒度特征融合的维度语音情感识别方法", 《信号处理》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291640A (en) * 2020-01-20 2020-06-16 北京百度网讯科技有限公司 Method and apparatus for recognizing gait
CN111291640B (en) * 2020-01-20 2023-02-17 北京百度网讯科技有限公司 Method and apparatus for recognizing gait
CN113255630A (en) * 2021-07-15 2021-08-13 浙江大华技术股份有限公司 Moving target recognition training method, moving target recognition method and device
CN113255630B (en) * 2021-07-15 2021-10-15 浙江大华技术股份有限公司 Moving target recognition training method, moving target recognition method and device
CN113808619A (en) * 2021-08-13 2021-12-17 北京百度网讯科技有限公司 Voice emotion recognition method and device and electronic equipment
CN113808619B (en) * 2021-08-13 2023-10-20 北京百度网讯科技有限公司 Voice emotion recognition method and device and electronic equipment

Similar Documents

Publication Publication Date Title
Cummins et al. An image-based deep spectrum feature representation for the recognition of emotional speech
CN105632501B (en) A kind of automatic accent classification method and device based on depth learning technology
CN109326302A (en) A kind of sound enhancement method comparing and generate confrontation network based on vocal print
CN107945790A (en) A kind of emotion identification method and emotion recognition system
EP3469582A1 (en) Neural network-based voiceprint information extraction method and apparatus
Mashao et al. Combining classifier decisions for robust speaker identification
CN108900725A (en) A kind of method for recognizing sound-groove, device, terminal device and storage medium
CN108597496A (en) A kind of speech production method and device for fighting network based on production
CN110246518A (en) Speech-emotion recognition method, device, system and storage medium based on more granularity sound state fusion features
CN112786052B (en) Speech recognition method, electronic equipment and storage device
Sailor et al. Filterbank learning using convolutional restricted Boltzmann machine for speech recognition
Paulose et al. Performance evaluation of different modeling methods and classifiers with MFCC and IHC features for speaker recognition
Sarkar et al. Time-contrastive learning based deep bottleneck features for text-dependent speaker verification
CN106653002A (en) Literal live broadcasting method and platform
CN108986798A (en) Processing method, device and the equipment of voice data
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
Sinha et al. Acoustic-phonetic feature based dialect identification in Hindi Speech
López-Espejo et al. Improved external speaker-robust keyword spotting for hearing assistive devices
CN109377986A (en) A kind of non-parallel corpus voice personalization conversion method
Mahesha et al. LP-Hillbert transform based MFCC for effective discrimination of stuttering dysfluencies
CN104464738B (en) A kind of method for recognizing sound-groove towards Intelligent mobile equipment
Selva Nidhyananthan et al. Assessment of dysarthric speech using Elman back propagation network (recurrent network) for speech recognition
Liu et al. Using bidirectional associative memories for joint spectral envelope modeling in voice conversion
Chakroun et al. Efficient text-independent speaker recognition with short utterances in both clean and uncontrolled environments
CN106875944A (en) A kind of system of Voice command home intelligent terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190917

RJ01 Rejection of invention patent application after publication