CN110534133A - A kind of speech emotion recognition system and speech-emotion recognition method - Google Patents
A kind of speech emotion recognition system and speech-emotion recognition method Download PDFInfo
- Publication number
- CN110534133A CN110534133A CN201910803429.7A CN201910803429A CN110534133A CN 110534133 A CN110534133 A CN 110534133A CN 201910803429 A CN201910803429 A CN 201910803429A CN 110534133 A CN110534133 A CN 110534133A
- Authority
- CN
- China
- Prior art keywords
- time step
- module
- layer
- emotion recognition
- spectrum signature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 41
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000001228 spectrum Methods 0.000 claims abstract description 40
- 230000008451 emotion Effects 0.000 claims abstract description 22
- 230000002996 emotional effect Effects 0.000 claims abstract description 22
- 239000013598 vector Substances 0.000 claims abstract description 21
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 230000015654 memory Effects 0.000 claims description 28
- 230000007246 mechanism Effects 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 12
- 238000000605 extraction Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 4
- 230000007935 neutral effect Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000005303 weighing Methods 0.000 claims description 2
- 241000208340 Araliaceae Species 0.000 claims 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims 2
- 235000003140 Panax quinquefolius Nutrition 0.000 claims 2
- 235000008434 ginseng Nutrition 0.000 claims 2
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 claims 1
- 239000010931 gold Substances 0.000 claims 1
- 229910052737 gold Inorganic materials 0.000 claims 1
- 230000008447 perception Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 9
- 238000013527 convolutional neural network Methods 0.000 description 14
- 238000013528 artificial neural network Methods 0.000 description 12
- 238000012549 training Methods 0.000 description 10
- 238000012360 testing method Methods 0.000 description 8
- 239000000284 extract Substances 0.000 description 6
- 230000000694 effects Effects 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 125000006850 spacer group Chemical group 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000007812 deficiency Effects 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000003475 lamination Methods 0.000 description 2
- 230000036651 mood Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 230000002269 spontaneous effect Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241001269238 Data Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- DHNRXBZYEKSXIM-UHFFFAOYSA-N chloromethylisothiazolinone Chemical compound CN1SC(Cl)=CC1=O DHNRXBZYEKSXIM-UHFFFAOYSA-N 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Molecular Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Signal Processing (AREA)
- Psychiatry (AREA)
- Hospice & Palliative Care (AREA)
- Child & Adolescent Psychology (AREA)
- Neurology (AREA)
- Machine Translation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of speech emotion recognition systems, comprising: sequentially connected audio preprocessing module, CNN module, pyramid FSMN module, time step pay attention to power module and output module;The invention also discloses a kind of speech-emotion recognition methods applied to speech emotion recognition system, comprising the following steps: and 1, preliminary work is carried out to voice, obtain language spectrum signature figure;2, language spectrum signature figure is operated, building includes the language spectrum signature figure of audio shallow-layer information;3, the language spectrum signature figure comprising audio shallow-layer information is further processed, obtains deeper semantic information and contextual information;4, the language spectrum signature figure with deeper semantic information and contextual information is handled, obtains whole section of voice and the highest feature vector of speaker's emotion degree of correlation;5, emotional category corresponding with whole section of voice is exported.Speech emotion recognition performance of the invention compared with the existing technology, achieves sizable promotion.
Description
Technical field
The present invention relates to artificial intelligence and technical field of voice recognition, and in particular to a kind of speech emotion recognition system and language
Sound emotion identification method is a kind of deep neural network technology end to end, is that basic network improves with DFSMN.
Background technique
With the extensive use of speech recognition technology being constantly progressive with speech recognition apparatus, day of the human-computer interaction in people
It is often more and more common in life.However, these equipment can only identify the text grade content of human language mostly, it cannot identify and speak
The affective state of people, and speech emotion recognition in terms of service focusing on people and human-computer interaction there are many useful application,
Such as intellect service robot, automation call center and long-distance education.Up to the present, it has caused considerable grind
Study carefully concern, and proposes many methods.Since machine learning (such as CNN even depth neural network) rapidly developed since in recent years,
By the trial and improvement carried out in many fields, this method all shows good performance.How by deep learning skill
Art is still in applied to the field gropes state, in practical applications, solves this challenging task and faces very
More problems are urgently to be resolved.The practical application of speech emotion recognition technology is a challenging task, needs to collect
Magnanimity complexity and difficult audio data are studied, and the audio for how allowing pure environment to be recorded is closer to the language under real scene
Sound is the big problem for needing to solve in the prior art.
Typical speech emotion recognition (SER) system using speech waveform as inputting, then output target emotional category it
One.Traditional SER system uses gauss hybrid models (GMMs) (Neiberg D, Elenius K, Laskowski
K.Emotion recognition in spontaneous speech using GMMs[C]//Ninth
International Conference on Spoken Language Processing.2006.), hidden Markov model
(HMMs)(Nwe T L,Foo S W,De Silva L C.Speech emotion recognition using
HiddenMarkov models [J] .Speech communication, 2003,41 (4): 603-623.), support vector machines
(SVMs)(Yang N,Yuan J,Zhou Y,et al.Enhanced multiclass SVM with thresholding
fusion for speech-based emotion classification[J].International journal of
Speech technology, 2017,20 (1): 27-41.) and long short-term memory (LSTM) (Tao F, Liu G.Advanced
LSTM:A study about better time dependency modeling in emotion recognition
[C]//2018IEEE International Conference onAcoustics,Speech and Signal
Processing (ICASSP) .IEEE, 2018:2906-2910.), these systems there is a problem of one it is significant be to all rely on
Mature manual phonetic feature, the selection of feature influence modelling effect very big.These features generally include frame level speech letter
Number frequency spectrum, cepstrum, fundamental tone and energy feature.Then the statistical function of these features is obtained into a words across multiple frame applications
Language grade feature vector.
With the explosive growth of depth learning technology, some researchers explore deep learning method to establish more Shandong
The SER model of stick.Zhang Z et al. (Zhang Z, Ringeval F, Han J, et al.Facing realism in
spontaneous emotion recognition from speech:Feature enhancement by
autoencoder with LSTM neural networks[C]//Proceedings INTERSPEECH 2016,17th
Annual Conference of the International Speech Communication Association
(ISCA) .2016:3593-3597.) propose feature enhancing of the one kind based on long short-term memory (LSTM) neural network coding certainly
Algorithm, for extracting emotion information from voice.Correspondingly, recurrent neural network (RNN) is proved to that there is stronger sequence to build
Mould ability, especially in voice recognition tasks.However, the training of RNN depends on backpropagation (BPTT) at any time, due to it
Complexity is calculated, the problems such as time-consuming, gradient disappears and explosion may be brought.In order to solve these problems, one kind is had also been proposed
Feedforward sequence memory network (hereinafter referred to as are as follows: FSMN).In recent years, a large number of studies show that, in speech recognition and language model etc.
In task, FSMN can not need it is any repeat feed back in the case where to it is long when relationship model.In addition, Zhang S etc.
People (Zhang S, Lei M, Yan Z, et al.Deep-FSMN for large vocabulary continuous speech
recognition[C]//2018IEEE International Conference onAcoustics,Speech and
Signal Processing (ICASSP) .IEEE, 2018:5869-5873.) in order to construct a deeper network structure, it mentions
Skip connection structure is gone out to be applied in FSMN, very big improvement has been carried out to model before.
Research activities about SER can trace back to the 1980s.But due to gender, speaker, language and recording
The factors such as the variation of environment, SER are still challenging in practical applications.Many researchers attempt perfect by designing
Hand-written phonetic feature solves these problems, to reinforce contacting with human emotion.However, the phonetic feature of these manual extractions
It is only applicable to specific task, versatility is poor.This results in needing to design different languages in face of different voice inter-related tasks
The original intention of sound feature, this and depth learning technology is disagreed.
Summary of the invention
In view of the deficiencies of the prior art, the object of the present invention is to provide a kind of speech emotion recognition system, which is one
Plant the deep neural network structure that feedovers end to end.
In view of the deficiencies of the prior art, it is a further object of the present invention to provide a kind of applied to speech emotion recognition system
Speech-emotion recognition method.
To achieve the purpose of the present invention, technical solution below is taken: a kind of speech emotion recognition system, comprising: according to
The audio preprocessing module of secondary connection, CNN module, pyramid FSMN module, time step pay attention to power module and output module;It is described
CNN module has convolutional layer, and there is the pyramid FSMN module pyramid to remember block structure;
The original audio data received is converted to language spectrum signature figure by the audio preprocessing module;
The CNN module carries out preliminary treatment to language spectrum signature figure, and building includes the characteristic pattern of shallow-layer information;
The characteristic pattern comprising shallow-layer information is further processed in the pyramid FSMN module, to obtain deeper time
Semantic information and contextual information;
The time step notices that power module is used to pay close attention to specific region in time step, and calculates different time step-length to most
The weighing factor of whole emotion recognition;
The output module has several emotional category, and the output module is for exporting and original audio data most phase
The emotional category matched.
The time step notices that power module specifically can be as shown by the following formula:
at=Average (ht),
Y=Xs,
Wherein, atIt is the mean value of t-th of time step, htIt is the feature vector of t-th of time step, Average is letter of averaging
Number;S is the output of attention mechanism,It is softmax activation primitive, W1It is the weight that time step pays attention to first layer in power module
Parameter, W2It is the weight parameter that time step pays attention to the second layer in power module, b1It is the biasing that time step pays attention to first layer in power module
Parameter, b2It is the offset parameter that time step pays attention to the second layer in power module, f is any activation primitive, and a is by all atIt constitutes
Feature vector;Y is the output of output module as a result, X is the input that time step pays attention to power module.
When the convolutional layer uses the step-length of the core with k size and s size to carry out convolution operation, the convolutional layer
Output is calculate by the following formula:
Wout=(Win- k)/s+1,
Hout=(Hin- k)/s+1,
Wherein, WoutIt is the width for exporting language spectrum signature figure, WinIt is the width for inputting language spectrum signature figure, k is convolution kernel size, and s is
The mobile step-length of convolution kernel;HoutIt is the height for exporting characteristic pattern, HinIt is the height of input feature vector figure, k is convolution kernel size, and s is convolution
The mobile step-length of core.
Block structure is remembered using the pyramid, is preceding to time step N by length1With backward time step N2Time
Step-length ht, it is encoded to the length of a fixed size, then by N1With N2Sum be calculated as currently exporting, the current output is specific
As shown by the following formula:
Wherein,It is the output of t moment memory module, f is any activation primitive, aiIt is the power of i-th of forward direction time step
Weight, ht-iIt is i-th of forward direction time step, bjIt is the weight of j-th of backward time step, ht-jIt is j-th of forward direction time step;
Pyramid memory block structure can using jump connection, the relationship of the input of jump connection and output such as with
Shown in lower formula:
Wherein,It is the output of l layers of t moment block of memory,It is any activation primitive,It is l -1 layers of t moment
Block of memory output,It is the input of l layers of t moment block of memory,It is before l layers to time step,It is l layer i-th
The weight of forward direction time step,It is i-th of forward direction time step of l layer, s1It is preceding to time step interval,It is l layer i-th
The weight of backward time step,It is j-th of backward time step of l layer, s2To time step interval after being;It is l+1 layers
The output of t moment hidden layer, WlIt is the weight parameter of l layers of block of memory, bl+1It is the biasing of l layers of block of memory.
The convolutional layer can be two layers, and the shallow-layer information can be audio loudness or frequency, the several emotion
Classification can be four kinds of emotional categories, which can be happy, sad, angry and neutral.
Another object to realize the present invention takes technical solution below: one kind being applied to speech emotion recognition system
The speech-emotion recognition method of system, comprising the following steps:
Step 1, audio preprocessing module carry out preliminary feature extraction and regularization to the voice received and operate, and obtain
CNN module is input to language spectrum signature figure, and by the language spectrum signature figure;
Step 2, CNN module carry out convolutional calculation operation to the language spectrum signature figure received, construct comprising audio shallow-layer
The language spectrum signature figure of information (such as audio loudness and frequency etc.);
Step 3, pyramid FSMN module further locate the language spectrum signature figure comprising audio shallow-layer information
Reason, and block structure is remembered by pyramid, deeper semantic information and contextual information in language spectrum signature figure are obtained, such as
Speaker's gender included in one section of voice, speaker's emotion etc.;
Step 4, time step pay attention to power module to the language spectrum signature with deeper semantic information and contextual information
Figure is handled, calculate first different time step attention score, then with the score to entire language spectrum signature figure when
Weighted sum is done in spacer step dimension, to obtain whole section of voice and the highest feature vector of speaker's emotion degree of correlation;Utilize this
Time step notices that power module can make model be more concerned about part relevant to speaker's emotion, improves model robustness;
Described eigenvector is input to output module by step 5, and the dimension of described eigenvector represents corresponding emotional category
Probability, take emotional category corresponding to the dimension of maximum probability as final output as a result, to output with whole section of voice
Corresponding emotional category, i.e. model classify to the emotion of prediction;The output module is a full articulamentum, this connects entirely
Connect that layer exported is the feature vector that length is 4.
Technical problem solved by the invention: speech emotion recognition, a Duan Yuyin are solved the problems, such as based on depth learning technology
In include information it is very more, such as: the gender of speaker, background noise, speech content, affective state of speaker etc.,
This just brings very big difficulty and challenge to speech emotion recognition problem.Equally, although the speech emotional based on deep learning
Identification has obtained certain research, but most of research is all based on LSTM, and LSTM inherently has such as parameter amount
It is huge, train the problems such as difficult;To sum up, existing speech emotion recognition technology is still faced with much, not fine
It solves the problems, such as.
The advantages of the present invention:
1, in the tasks such as speech recognition and language model, FSMN can be right in the case where not needing any repetition and feeding back
Relationship is modeled when long, is based on these results of study, and the invention proposes a kind of deep neural networks that feedovers end to end, is used
In solution speech emotion recognition task.It after getting rid of LSTM, not only greatly improves the recognition speed of model, while also effectively dropping
Low time consumption for training.It being compared with the traditional method, the present invention does not use the audio frequency characteristics of various manual extractions as mode input,
But directly use original sound spectrograph as input, more voice raw informations are contained among these, to make model generalization
Ability is stronger.The complexity of model construction is reduced simultaneously, it is not necessary to which for different models, different input feature vectors is set.
2, different from the speech emotion recognition research of deep learning is mostly based on, the present invention does not use recurrent neural net
Network and its mutation use net based on the full Connection Neural Network of feedforward of this standard of DFSMN as basic network
Network, and the memory block structure of pyramid is proposed on the basis of DFSMN, so that entire model more robust, with network
Intensification can extract more high-rise semantic information.
3, model bottom of the invention is level 2 volume lamination, rather than directly uses DFSMN layers, in addition, with the depth of network
Enter, the stronger feature of robustness can be extracted using down-sampled method, and be substantially reduced the size of feature.
4, in order to make model more focused on information relevant to emotion, not by the interference of other information, the present invention is also proposed
Attention mechanism based on time step, Attention attention mechanism is applied in the output of pyramid FSMN, note is utilized
It anticipates power mechanism, each element in output sequence all relies on the element-specific in list entries, and which increase the calculating of model
Burden, but more accurate, the better model of performance can be generated, end of the invention is demonstrated using IEMOCAP speech emotional data set
To the validity of end network.
5, deep neural network structure end to end proposed by the invention, and for side designed by each problem
Method, can effectively run, and good confirmation obtained from experiment, also, test 3.3 times faster than archetype, by dividing
Analysis and verifying, speech emotion recognition performance of the invention achieve sizable promotion.
Detailed description of the invention
Fig. 1 is the structure chart of speech emotion recognition system, and the pFSMN in figure is pyramid FSMN.
Fig. 2 a is the structure chart of FSMN.
Fig. 2 b is the structure chart of DFSMN.
Fig. 3 is the structure chart that time step pays attention to power module.
Fig. 4 is the flow chart of speech-emotion recognition method.
Specific embodiment
Embodiment
The present invention is further illustrated With reference to embodiment.
As shown in Figure 1, a kind of speech emotion recognition system, comprising: sequentially connected audio preprocessing module, CNN module,
Pyramid FSMN module, time step pay attention to power module and output module;The CNN module has convolutional layer, the pyramid
There is FSMN module pyramid to remember block structure.
Speech emotion recognition system in the present embodiment is a kind of deep neural network structure that feedovers end to end, the present invention
It is the improvement carried out to the basic network in classical FSMN and DFSMN structure, the convolutional layer being added in the basic network,
To realize the feature extraction of more bottom.
Audio data of the invention is the wav format of mainstream, and sample frequency is 16000Hz, further by original audio
Data framing does Fourier transformation, and the length of each frame is 25ms, a length of 10ms of frame walk.Pass through preliminary treatment, audio data
Switch to 2 dimension sound spectrograph features as mode input.It is detailed in Fig. 1, model bottom is level 2 volume lamination, rather than directly uses DFSMN
Layer.In addition, going deep into network, the stronger feature of robustness can be extracted using down-sampled method, and be substantially reduced spy
The size of sign, this module significantly improve the accuracy of model.When the step-length using the core and s size with k size is rolled up
When product (pond) operation, output layer can be calculated by equation below:
Wout=(Win- k)/s+1,
Hout=(Hin- k)/s+1,
Wherein, WoutIt is the width for exporting language spectrum signature figure, WinIt is the width for inputting language spectrum signature figure, k is convolution kernel size, and s is
The mobile step-length of convolution kernel;HoutIt is the height for exporting characteristic pattern, HinIt is the height of input feature vector figure, k is convolution kernel size, and s is convolution
The mobile step-length of core.
As shown in Figure 2 a, FSMN is a kind of full Connection Neural Network of feedforward of standard, it is added to additionally in hidden layer
Memory module, tapped-delay structure can be used by time step htLength be N1Forward direction time step and length be
N2Follow-up time step be encoded to the length of a fixed size, then their sum is calculated as currently exporting, such as following public affairs
Shown in formula:
Wherein,It is the output of t moment memory module, f is any activation primitive, aiIt is the power of i-th of forward direction time step
Weight, ht-iIt is i-th of forward direction time step, bjIt is the weight of j-th of backward time step, ht-jIt is j-th of forward direction time step.
In order to keep the depth of FSMN bigger, as shown in Figure 2 b, different from original FSMN framework, DFSMN eliminates hiding
Direct positive connection between layer at the same time, introduces jump connection only using memory module as input to overcome gradient to disappear
And explosion issues, it inputs with the relationship of output as shown by the following formula:
Wherein,It is the output of l layers of t moment block of memory,It is any activation primitive,It is l -1 layers of t moment
Block of memory output,It is the input of l layers of t moment block of memory,It is before l layers to time step,It is l layer i-th
The weight of forward direction time step,It is i-th of forward direction time step of l layer, s1It is preceding to time step interval,It is l layer i-th
The weight of backward time step,It is j-th of backward time step of l layer, s2To time step interval after being;It is l+1
Layer t moment hidden layer output, WlIt is the weight parameter of l layers of block of memory, bl+1It is the biasing of l layers of block of memory.
In above-mentioned FSMN and DFSMN, the length of memory module is identical, it means that in above equation, is led to
It is identical to cross all hidden layer acquisitionsAnd s2, in memory block structure, bottom not only extracts specific time step-length t
Contextual information, while will also include the information, therefore, subsequent relation of long standing relation can be duplicate, it is no longer necessary in top layer
Introduce redundant information.The invention proposes a pyramids to remember block structure, and in this pyramid memory block structure, model exists
More contextual informations are extracted on deeper level, it is to pass through increase that this pyramid, which remembers block structure,And s2Come
It realizes, therefore, bottom extracts feature from the minute informations such as word speed and rhythm, and top layer is more advanced from emotion and gender etc.
Feature is extracted in information, this pyramid memory block structure improves precision, while reducing the quantity of parameter.
Present invention also adds attention mechanism, and Attention attention mechanism is applied to the output of pyramid FSMN
In, using attention mechanism, each element in output sequence all relies on the element-specific in list entries.Which increase moulds
The computation burden of type, but more accurate, the better model of performance can be generated.In most of realize, pay attention to being implemented as a power
Vector (usually as the output of softmax function), dimension are equal to the length of list entries.In the present embodiment, one section of language
Sound is divided into many segments, and time step is known as in neural network.Obviously, when one section of voice includes a large amount of blank, and
It is not that each time step is useful to SER task, therefore model needs to pay close attention to some specific region, on this basis, building
Time step pays attention to power module, as shown in figure 3, time step notices that power module can be described as following formula:
at=Average (ht),
Y=Xs,
Wherein, atIt is the mean value of t-th of time step, htIt is the feature vector of t-th of time step, Average is letter of averaging
Number;S is the output of attention mechanism,It is softmax activation primitive, W1It is the weight that time step pays attention to first layer in power module
Parameter, W2It is the weight parameter that time step pays attention to the second layer in power module, b1It is the biasing that time step pays attention to first layer in power module
Parameter, b2It is the offset parameter that time step pays attention to the second layer in power module, f is any activation primitive, and a is by all atIt constitutes
Feature vector;Y is the output of output module as a result, X is the input that time step pays attention to power module.
The output of whole network is obtained finally by a full articulamentum, the optimization aim of the model is universal cross entropy damage
Lose function.The output vector length of model matches with emotional category number, and the value of each position corresponds to the emotion in output vector
The probability of classification, the final emotional category for choosing maximum probability is as output.
As shown in figure 4, speech emotion recognition process in the present embodiment specifically includes the following steps:
Step 1, audio preprocessing module carry out preliminary feature extraction and regularization to the voice received and operate, and obtain
CNN module is input to language spectrum signature figure, and by the language spectrum signature figure;
Step 2, CNN module carry out convolutional calculation operation to the language spectrum signature figure received, construct comprising audio shallow-layer
The language spectrum signature figure of information;
Step 3, pyramid FSMN module further locate the language spectrum signature figure comprising audio shallow-layer information
Reason, and block structure is remembered by pyramid, deeper semantic information and contextual information in language spectrum signature figure are obtained, such as
Speaker's gender included in one section of voice, speaker's emotion etc.;
Step 4, time step pay attention to power module to the language spectrum signature with deeper semantic information and contextual information
Figure is handled, calculate first different time step attention score, then with the score to entire language spectrum signature figure when
Weighted sum is done in spacer step dimension, to obtain whole section of voice and the highest feature vector of speaker's emotion degree of correlation;Utilize this
Time step notices that power module can make model be more concerned about part relevant to speaker's emotion, improves model robustness;
Described eigenvector is input to output module by step 5, and the dimension of described eigenvector represents corresponding emotional category
Probability, take emotional category corresponding to the dimension of maximum probability as final output as a result, to output with whole section of voice
Corresponding emotional category, i.e. model classify to the emotion of prediction;The output module is a full articulamentum, this connects entirely
Connect that layer exported is the feature vector that length is 4.
Technical solution of the present invention not only greatly improves the recognition speed of model after getting rid of LSTM, while also
Effect reduces time consumption for training, in addition, different from traditional voice emotion recognition system, the present invention does not use manual features as mould
Type input, but directly use original sound spectrograph as input, wherein more raw informations are contained, to make model generalization ability
It is stronger.In order to make model more focused on information relevant to emotion, not by the interference of other information, when the invention proposes being based on
The attention mechanism of spacer step, and be incorporated among model.
(1) data set that the present invention uses;
SER model of the invention is assessed using IEMOCAP corpus, which includes several sections of dialogues, in each meeting
In words, two participants show certain types of emotion by exchanging.These language be divided into indignation, it is frightened, excited,
It is neutral, detest, be surprised, sad, happy, dejected, other and XXX.The case where XXX is that scholiast cannot reach an agreement with regard to label.
In the present embodiment, only selected 5 classes: indignation, excited, happy, neutral and sad, the voice sum used is 5531.For
The mood of the sample size of each mood classification of balance, fast happy excitement is merged into happy classification.In addition, randomly selecting total
The 10% of data is used as test object, and remainder data is checked as training data, 10% training data as verify data
Whether need to stop in advance.
There are two channel datas of video and audio in corpus, and has only used audio data in the present invention.Audio collection
Using high-quality microphone (Schoeps CMIT 5U), sampling rate 48khz.16 kilo hertzs are downsampled to, and is extracted
The acoustic feature of one 201D.Unlike other technical solutions, the present embodiment only uses sound spectrograph as input, extracts
Journey carries out in the 25ms window that a moving step length is 10ms (100fps).Whole sentence voice data has been done at normalization simultaneously
Reason.
(2) test process describes;
Use Pytorch frame as training tool, the network architecture is as shown in Figure 1, two 5*5conv layers make in front
With the hidden layer and block of memory of 4 FSMN blocks have 256 and 128 nodes respectively.In order to avoid over-fitting, CNN and pFSMN layers
There is batch normalization layer below, time sequencing is 4 to 32, and step-length is 1 to 2, and the model of the present embodiment is to be based on using Pytorch
Adam optimizer be trained, batch be dimensioned to 32, learning rate is fixed as 0.003, use 4470 pre-set
Item training with audio data is iterated training, effect of the test model on verifying collection simultaneously in every iteration one wheel, when testing
Deconditioning in advance when the continuous 3 iteration rounds of recognition accuracy on card collection are constant.All experiments are all at 1
It is carried out on the work station of NVIDIATITAN XP.
(3) test result;
For the performance of measuring system, calculates the overall accuracy (weighting accuracy, WA) of test sample and do not sympathize with
The average recall rate (unweighted accuracy, UA) of thread classification, and the corresponding recall rate to each classification.
Test result shows that compared with LSTM, improved series model performance improves 2.47%, illustrates FSMN herein
There is better series model performance in task.HSF-CRNN(Luo D,Zou Y,Huang D.Investigation on
Joint Representation Learning for Robust Feature Extraction in Speech Emotion
Recognition [J] .Proc.Interspeech 2018,2018:152-156.) it is a kind of improved CNN proposed by Luo
RNN method is combined, it uses hand-made phonetic feature as input, and the model of the present embodiment is realized respectively on UA and WA
0.53% and 3.99% absolute improvement, it is demonstrated experimentally that without using common manual phonetic feature, can from
Dynamic ground extracts useful information from spectrogram, and the present invention has also set up a basic C-biLSTM model and has been compared,
The accuracy rate of " sad " sample is better than other methods, and the recognition accuracy of other classifications is then very different.In order to illustrate attention
The working principle of mechanism establishes a C-pFSMN model, other than there is no attention mechanism, entire model remainder with
Model in the present invention is completely the same, the results showed that, compared with C-pFSMN, attention mechanism proposed by the invention is appointed in SER
It is showed well in business, UA absolutely improves 6.3%, in addition, front end CNN layers can extract more complicated feature, thus as expected
Model performance is improved like that.
C-biLSTM is constructed by 2-CNN layers and 2-Bi-LSTM, wherein there is 256 nodes in hidden layer.It and the present embodiment
Model it is similar, be widely used in Series Modeling task.Therefore, also by the computing resource of C-biLSTM and this hair
Bright method compares.The result shows that model of the invention is 64 minutes a length of when there was only 1.85M parameter while training,
It is more faster than C-LSTM model.This means that better performance may be implemented in the present invention, while needing less computing resource.
Technical solution of the present invention, very good solution speech emotion recognition problem, not only obtains the speed of speech recognition
To greatly improving, while also effectively reducing time consumption for training.In addition, the present invention different from traditional voice emotion recognition system
It does not use manual features as mode input, but directly uses original sound spectrograph as input, wherein containing more original
Information, to keep model generalization ability stronger.In order to make model more focused on information relevant to emotion, not by other information
Interference, the invention proposes the attention mechanism based on time step, and are incorporated among model, and experiment shows of the invention
Modelling effect is good, and required computing resource is less.
Above-listed detailed description is illustrating for possible embodiments of the present invention, and the embodiment is not to limit this hair
Bright the scope of the patents, all equivalence enforcements or change without departing from carried out by the present invention, is intended to be limited solely by the scope of the patents of this case.
Claims (10)
1. a kind of speech emotion recognition system characterized by comprising
Sequentially connected audio preprocessing module, CNN module, pyramid FSMN module, time step pay attention to power module and output mould
Block, the CNN module have convolutional layer;
The original audio data received is converted to language spectrum signature figure by the audio preprocessing module;
The CNN module carries out preliminary treatment to language spectrum signature figure, and building includes the characteristic pattern of shallow-layer information;
The characteristic pattern comprising shallow-layer information is further processed in the pyramid FSMN module, to obtain deeper language
Adopted information and contextual information;
The time step notices that power module is used to pay close attention to specific region in time step, and calculates different time step-length to final feelings
The other weighing factor of perception;
The output module has several emotional category, which is used to export and original audio data most matches
Emotional category.
2. speech emotion recognition system according to claim 1, which is characterized in that the time step notices that power module is specific
As shown by the following formula:
at=Average (ht),
Y=Xs,
Wherein, atIt is the mean value of t-th of time step, htIt is the feature vector of t-th of time step, Average is function of averaging;s
It is the output of attention mechanism,It is softmax activation primitive, W1It is the weight ginseng that time step pays attention to first layer in power module
Number, W2It is the weight parameter that time step pays attention to the second layer in power module, b1It is the biasing ginseng that time step pays attention to first layer in power module
Number, b2It is the offset parameter that time step pays attention to the second layer in power module, f is any activation primitive, and a is by all atThe spy of composition
Levy vector;Y is the output of output module as a result, X is the input that time step pays attention to power module.
3. speech emotion recognition system according to claim 1, which is characterized in that have k big when the convolutional layer uses
When the step-length of small core and s size carries out convolution operation, the output of the convolutional layer is calculate by the following formula:
Wout=(Win- k)/s+1,
Hout=(Hin- k)/s+1,
Wherein, WoutIt is the width for exporting language spectrum signature figure, WinIt is the width for inputting language spectrum signature figure, k is convolution kernel size, and s is convolution
The mobile step-length of core;HoutIt is the height for exporting characteristic pattern, HinIt is the height of input feature vector figure, k is convolution kernel size, and s is that convolution kernel moves
Dynamic step-length.
4. speech emotion recognition system according to claim 1, which is characterized in that the pyramid FSMN module has gold
Word tower remembers block structure, remembers block structure using the pyramid, is preceding to time step N by length1With backward time step N2
Time step ht, it is encoded to the length of a fixed size, then by N1With N2Sum be calculated as currently exporting, this is current defeated
Out specifically as shown by the following formula:
Wherein,It is the output of t moment memory module, f is any activation primitive, aiIt is the weight of i-th of forward direction time step,
ht-iIt is i-th of forward direction time step, bjIt is the weight of j-th of backward time step, ht-jIt is j-th of forward direction time step;
The pyramid memory block structure is using jump connection, the relationship such as following formula institute of the input and output of jump connection
Show:
Wherein,It is the output of l layers of t moment block of memory,It is any activation primitive,It is the note of l-1 layers of t moment
Recall block output,It is the input of l layers of t moment block of memory,It is before l layers to time step,It is l i-th of forward direction of layer
The weight of time step,It is i-th of forward direction time step of l layer, s1It is preceding to time step interval,It is backward i-th of l layer
The weight of time step,It is j-th of backward time step of l layer, s2To time step interval after being;When being l+1 layers of t
Carve hidden layer output, WlIt is the weight parameter of the 1st layer of block of memory, bl+1It is the biasing of l layers of block of memory.
5. speech emotion recognition system according to any one of claims 1 to 4, which is characterized in that the convolutional layer is two
Layer.
6. speech emotion recognition system according to any one of claims 1 to 4, which is characterized in that the shallow-layer information is
Audio loudness or frequency.
7. speech emotion recognition system according to any one of claims 1 to 4, which is characterized in that the several emotion
Classification is four kinds of emotional categories.
8. speech emotion recognition system according to claim 7, which is characterized in that four kinds of emotional categories are to open
It is the heart, sad, angry and neutral.
9. a kind of speech-emotion recognition method applied to speech emotion recognition system described in claim 1, which is characterized in that
The following steps are included:
Step 1, audio preprocessing module carry out preliminary feature extraction and regularization to the voice received and operate, and obtain language
Spectrum signature figure, and the language spectrum signature figure is input to CNN module;
Step 2, CNN module carry out convolutional calculation operation to the language spectrum signature figure received, construct comprising audio shallow-layer information
Language spectrum signature figure;
The language spectrum signature figure comprising audio shallow-layer information is further processed in step 3, pyramid FSMN module, and
Block structure is remembered by pyramid, obtains deeper semantic information and contextual information in language spectrum signature figure;
Step 4, time step pay attention to power module to the language spectrum signature figure with deeper semantic information and contextual information into
Row processing calculates the attention score of different time step first, then uses the score to entire language spectrum signature figure in time step
Weighted sum is done in dimension, to obtain whole section of voice and the highest feature vector of speaker's emotion degree of correlation;
Described eigenvector is input to output module by step 5, and the dimension of described eigenvector represents the general of corresponding emotional category
Rate takes emotional category corresponding to the dimension of maximum probability as final output as a result, opposite with whole section of voice to export
The emotional category answered.
10. speech-emotion recognition method according to claim 9, which is characterized in that in steps of 5, the output module
It is a full articulamentum, what which was exported is the feature vector that length is 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910803429.7A CN110534133B (en) | 2019-08-28 | 2019-08-28 | Voice emotion recognition system and voice emotion recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910803429.7A CN110534133B (en) | 2019-08-28 | 2019-08-28 | Voice emotion recognition system and voice emotion recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110534133A true CN110534133A (en) | 2019-12-03 |
CN110534133B CN110534133B (en) | 2022-03-25 |
Family
ID=68664896
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910803429.7A Active CN110534133B (en) | 2019-08-28 | 2019-08-28 | Voice emotion recognition system and voice emotion recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110534133B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143567A (en) * | 2019-12-30 | 2020-05-12 | 成都数之联科技有限公司 | Comment emotion analysis method based on improved neural network |
CN111539458A (en) * | 2020-04-02 | 2020-08-14 | 咪咕文化科技有限公司 | Feature map processing method and device, electronic equipment and storage medium |
CN112053007A (en) * | 2020-09-18 | 2020-12-08 | 国网浙江兰溪市供电有限公司 | Distribution network fault first-aid repair prediction analysis system and method |
CN112634947A (en) * | 2020-12-18 | 2021-04-09 | 大连东软信息学院 | Animal voice and emotion feature set sequencing and identifying method and system |
CN113255800A (en) * | 2021-06-02 | 2021-08-13 | 中国科学院自动化研究所 | Robust emotion modeling system based on audio and video |
CN113257280A (en) * | 2021-06-07 | 2021-08-13 | 苏州大学 | Speech emotion recognition method based on wav2vec |
CN113903327A (en) * | 2021-09-13 | 2022-01-07 | 北京卷心菜科技有限公司 | Voice environment atmosphere recognition method based on deep neural network |
CN115512693A (en) * | 2021-06-23 | 2022-12-23 | 中移(杭州)信息技术有限公司 | Audio recognition method, acoustic model training method, device and storage medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20090063202A (en) * | 2009-05-29 | 2009-06-17 | 포항공과대학교 산학협력단 | Method for apparatus for providing emotion speech recognition |
CN108010514A (en) * | 2017-11-20 | 2018-05-08 | 四川大学 | A kind of method of speech classification based on deep neural network |
CN109036465A (en) * | 2018-06-28 | 2018-12-18 | 南京邮电大学 | Speech-emotion recognition method |
CN109285562A (en) * | 2018-09-28 | 2019-01-29 | 东南大学 | Speech-emotion recognition method based on attention mechanism |
CN109460737A (en) * | 2018-11-13 | 2019-03-12 | 四川大学 | A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network |
CN109637522A (en) * | 2018-12-26 | 2019-04-16 | 杭州电子科技大学 | A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
-
2019
- 2019-08-28 CN CN201910803429.7A patent/CN110534133B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20090063202A (en) * | 2009-05-29 | 2009-06-17 | 포항공과대학교 산학협력단 | Method for apparatus for providing emotion speech recognition |
CN108010514A (en) * | 2017-11-20 | 2018-05-08 | 四川大学 | A kind of method of speech classification based on deep neural network |
CN109036465A (en) * | 2018-06-28 | 2018-12-18 | 南京邮电大学 | Speech-emotion recognition method |
CN109285562A (en) * | 2018-09-28 | 2019-01-29 | 东南大学 | Speech-emotion recognition method based on attention mechanism |
CN109460737A (en) * | 2018-11-13 | 2019-03-12 | 四川大学 | A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network |
CN109637522A (en) * | 2018-12-26 | 2019-04-16 | 杭州电子科技大学 | A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph |
CN109767790A (en) * | 2019-02-28 | 2019-05-17 | 中国传媒大学 | A kind of speech-emotion recognition method and system |
Non-Patent Citations (3)
Title |
---|
SHILIANG ZHANG ETC.: "DEEP-FSMN FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION", 《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING(ICASSP)》 * |
张园园: "基于深度学习的多模态情感识别方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王金华 等: "基于语谱图提取深度空间注意特征的语音情感识别算法", 《电信科学》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111143567A (en) * | 2019-12-30 | 2020-05-12 | 成都数之联科技有限公司 | Comment emotion analysis method based on improved neural network |
CN111539458A (en) * | 2020-04-02 | 2020-08-14 | 咪咕文化科技有限公司 | Feature map processing method and device, electronic equipment and storage medium |
CN111539458B (en) * | 2020-04-02 | 2024-02-27 | 咪咕文化科技有限公司 | Feature map processing method and device, electronic equipment and storage medium |
CN112053007A (en) * | 2020-09-18 | 2020-12-08 | 国网浙江兰溪市供电有限公司 | Distribution network fault first-aid repair prediction analysis system and method |
CN112053007B (en) * | 2020-09-18 | 2022-07-26 | 国网浙江兰溪市供电有限公司 | Distribution network fault first-aid repair prediction analysis system and method |
CN112634947A (en) * | 2020-12-18 | 2021-04-09 | 大连东软信息学院 | Animal voice and emotion feature set sequencing and identifying method and system |
CN112634947B (en) * | 2020-12-18 | 2023-03-14 | 大连东软信息学院 | Animal voice and emotion feature set sequencing and identifying method and system |
CN113255800A (en) * | 2021-06-02 | 2021-08-13 | 中国科学院自动化研究所 | Robust emotion modeling system based on audio and video |
CN113255800B (en) * | 2021-06-02 | 2021-10-15 | 中国科学院自动化研究所 | Robust emotion modeling system based on audio and video |
CN113257280A (en) * | 2021-06-07 | 2021-08-13 | 苏州大学 | Speech emotion recognition method based on wav2vec |
CN115512693A (en) * | 2021-06-23 | 2022-12-23 | 中移(杭州)信息技术有限公司 | Audio recognition method, acoustic model training method, device and storage medium |
CN113903327A (en) * | 2021-09-13 | 2022-01-07 | 北京卷心菜科技有限公司 | Voice environment atmosphere recognition method based on deep neural network |
Also Published As
Publication number | Publication date |
---|---|
CN110534133B (en) | 2022-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110534133A (en) | A kind of speech emotion recognition system and speech-emotion recognition method | |
Fayek et al. | Towards real-time speech emotion recognition using deep neural networks | |
Jiao et al. | Simulating dysarthric speech for training data augmentation in clinical speech applications | |
Cai et al. | Deep maxout neural networks for speech recognition | |
CN111583964B (en) | Natural voice emotion recognition method based on multimode deep feature learning | |
CN112784798A (en) | Multi-modal emotion recognition method based on feature-time attention mechanism | |
Li et al. | Exploiting the potentialities of features for speech emotion recognition | |
CN108564942A (en) | One kind being based on the adjustable speech-emotion recognition method of susceptibility and system | |
CN107972028B (en) | Man-machine interaction method and device and electronic equipment | |
Han et al. | Speech emotion recognition with a resnet-cnn-transformer parallel neural network | |
CN112466326A (en) | Speech emotion feature extraction method based on transform model encoder | |
Wang et al. | Research on speech emotion recognition technology based on deep and shallow neural network | |
Cardona et al. | Online phoneme recognition using multi-layer perceptron networks combined with recurrent non-linear autoregressive neural networks with exogenous inputs | |
Jiang et al. | Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit. | |
CN117095702A (en) | Multi-mode emotion recognition method based on gating multi-level feature coding network | |
Fan et al. | Adaptive Domain-Aware Representation Learning for Speech Emotion Recognition. | |
Fan et al. | The impact of student learning aids on deep learning and mobile platform on learning behavior | |
CN114898779A (en) | Multi-mode fused speech emotion recognition method and system | |
CN113571095B (en) | Speech emotion recognition method and system based on nested deep neural network | |
Tang et al. | A bimodal network based on Audio–Text-Interactional-Attention with ArcFace loss for speech emotion recognition | |
CN106297769A (en) | A kind of distinctive feature extracting method being applied to languages identification | |
Akinpelu et al. | Lightweight deep learning framework for speech emotion recognition | |
Liu et al. | Dual-tbnet: Improving the robustness of speech features via dual-transformer-bilstm for speech emotion recognition | |
CN112329819A (en) | Underwater target identification method based on multi-network fusion | |
Wang et al. | Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |