CN110534133A - A kind of speech emotion recognition system and speech-emotion recognition method - Google Patents

A kind of speech emotion recognition system and speech-emotion recognition method Download PDF

Info

Publication number
CN110534133A
CN110534133A CN201910803429.7A CN201910803429A CN110534133A CN 110534133 A CN110534133 A CN 110534133A CN 201910803429 A CN201910803429 A CN 201910803429A CN 110534133 A CN110534133 A CN 110534133A
Authority
CN
China
Prior art keywords
time step
module
layer
emotion recognition
spectrum signature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910803429.7A
Other languages
Chinese (zh)
Other versions
CN110534133B (en
Inventor
殷绪成
曹秒
杨春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Wisdom Electronic Technology Co Ltd
Original Assignee
Zhuhai Wisdom Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Wisdom Electronic Technology Co Ltd filed Critical Zhuhai Wisdom Electronic Technology Co Ltd
Priority to CN201910803429.7A priority Critical patent/CN110534133B/en
Publication of CN110534133A publication Critical patent/CN110534133A/en
Application granted granted Critical
Publication of CN110534133B publication Critical patent/CN110534133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Neurology (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of speech emotion recognition systems, comprising: sequentially connected audio preprocessing module, CNN module, pyramid FSMN module, time step pay attention to power module and output module;The invention also discloses a kind of speech-emotion recognition methods applied to speech emotion recognition system, comprising the following steps: and 1, preliminary work is carried out to voice, obtain language spectrum signature figure;2, language spectrum signature figure is operated, building includes the language spectrum signature figure of audio shallow-layer information;3, the language spectrum signature figure comprising audio shallow-layer information is further processed, obtains deeper semantic information and contextual information;4, the language spectrum signature figure with deeper semantic information and contextual information is handled, obtains whole section of voice and the highest feature vector of speaker's emotion degree of correlation;5, emotional category corresponding with whole section of voice is exported.Speech emotion recognition performance of the invention compared with the existing technology, achieves sizable promotion.

Description

A kind of speech emotion recognition system and speech-emotion recognition method
Technical field
The present invention relates to artificial intelligence and technical field of voice recognition, and in particular to a kind of speech emotion recognition system and language Sound emotion identification method is a kind of deep neural network technology end to end, is that basic network improves with DFSMN.
Background technique
With the extensive use of speech recognition technology being constantly progressive with speech recognition apparatus, day of the human-computer interaction in people It is often more and more common in life.However, these equipment can only identify the text grade content of human language mostly, it cannot identify and speak The affective state of people, and speech emotion recognition in terms of service focusing on people and human-computer interaction there are many useful application, Such as intellect service robot, automation call center and long-distance education.Up to the present, it has caused considerable grind Study carefully concern, and proposes many methods.Since machine learning (such as CNN even depth neural network) rapidly developed since in recent years, By the trial and improvement carried out in many fields, this method all shows good performance.How by deep learning skill Art is still in applied to the field gropes state, in practical applications, solves this challenging task and faces very More problems are urgently to be resolved.The practical application of speech emotion recognition technology is a challenging task, needs to collect Magnanimity complexity and difficult audio data are studied, and the audio for how allowing pure environment to be recorded is closer to the language under real scene Sound is the big problem for needing to solve in the prior art.
Typical speech emotion recognition (SER) system using speech waveform as inputting, then output target emotional category it One.Traditional SER system uses gauss hybrid models (GMMs) (Neiberg D, Elenius K, Laskowski K.Emotion recognition in spontaneous speech using GMMs[C]//Ninth International Conference on Spoken Language Processing.2006.), hidden Markov model (HMMs)(Nwe T L,Foo S W,De Silva L C.Speech emotion recognition using HiddenMarkov models [J] .Speech communication, 2003,41 (4): 603-623.), support vector machines (SVMs)(Yang N,Yuan J,Zhou Y,et al.Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification[J].International journal of Speech technology, 2017,20 (1): 27-41.) and long short-term memory (LSTM) (Tao F, Liu G.Advanced LSTM:A study about better time dependency modeling in emotion recognition [C]//2018IEEE International Conference onAcoustics,Speech and Signal Processing (ICASSP) .IEEE, 2018:2906-2910.), these systems there is a problem of one it is significant be to all rely on Mature manual phonetic feature, the selection of feature influence modelling effect very big.These features generally include frame level speech letter Number frequency spectrum, cepstrum, fundamental tone and energy feature.Then the statistical function of these features is obtained into a words across multiple frame applications Language grade feature vector.
With the explosive growth of depth learning technology, some researchers explore deep learning method to establish more Shandong The SER model of stick.Zhang Z et al. (Zhang Z, Ringeval F, Han J, et al.Facing realism in spontaneous emotion recognition from speech:Feature enhancement by autoencoder with LSTM neural networks[C]//Proceedings INTERSPEECH 2016,17th Annual Conference of the International Speech Communication Association (ISCA) .2016:3593-3597.) propose feature enhancing of the one kind based on long short-term memory (LSTM) neural network coding certainly Algorithm, for extracting emotion information from voice.Correspondingly, recurrent neural network (RNN) is proved to that there is stronger sequence to build Mould ability, especially in voice recognition tasks.However, the training of RNN depends on backpropagation (BPTT) at any time, due to it Complexity is calculated, the problems such as time-consuming, gradient disappears and explosion may be brought.In order to solve these problems, one kind is had also been proposed Feedforward sequence memory network (hereinafter referred to as are as follows: FSMN).In recent years, a large number of studies show that, in speech recognition and language model etc. In task, FSMN can not need it is any repeat feed back in the case where to it is long when relationship model.In addition, Zhang S etc. People (Zhang S, Lei M, Yan Z, et al.Deep-FSMN for large vocabulary continuous speech recognition[C]//2018IEEE International Conference onAcoustics,Speech and Signal Processing (ICASSP) .IEEE, 2018:5869-5873.) in order to construct a deeper network structure, it mentions Skip connection structure is gone out to be applied in FSMN, very big improvement has been carried out to model before.
Research activities about SER can trace back to the 1980s.But due to gender, speaker, language and recording The factors such as the variation of environment, SER are still challenging in practical applications.Many researchers attempt perfect by designing Hand-written phonetic feature solves these problems, to reinforce contacting with human emotion.However, the phonetic feature of these manual extractions It is only applicable to specific task, versatility is poor.This results in needing to design different languages in face of different voice inter-related tasks The original intention of sound feature, this and depth learning technology is disagreed.
Summary of the invention
In view of the deficiencies of the prior art, the object of the present invention is to provide a kind of speech emotion recognition system, which is one Plant the deep neural network structure that feedovers end to end.
In view of the deficiencies of the prior art, it is a further object of the present invention to provide a kind of applied to speech emotion recognition system Speech-emotion recognition method.
To achieve the purpose of the present invention, technical solution below is taken: a kind of speech emotion recognition system, comprising: according to The audio preprocessing module of secondary connection, CNN module, pyramid FSMN module, time step pay attention to power module and output module;It is described CNN module has convolutional layer, and there is the pyramid FSMN module pyramid to remember block structure;
The original audio data received is converted to language spectrum signature figure by the audio preprocessing module;
The CNN module carries out preliminary treatment to language spectrum signature figure, and building includes the characteristic pattern of shallow-layer information;
The characteristic pattern comprising shallow-layer information is further processed in the pyramid FSMN module, to obtain deeper time Semantic information and contextual information;
The time step notices that power module is used to pay close attention to specific region in time step, and calculates different time step-length to most The weighing factor of whole emotion recognition;
The output module has several emotional category, and the output module is for exporting and original audio data most phase The emotional category matched.
The time step notices that power module specifically can be as shown by the following formula:
at=Average (ht),
Y=Xs,
Wherein, atIt is the mean value of t-th of time step, htIt is the feature vector of t-th of time step, Average is letter of averaging Number;S is the output of attention mechanism,It is softmax activation primitive, W1It is the weight that time step pays attention to first layer in power module Parameter, W2It is the weight parameter that time step pays attention to the second layer in power module, b1It is the biasing that time step pays attention to first layer in power module Parameter, b2It is the offset parameter that time step pays attention to the second layer in power module, f is any activation primitive, and a is by all atIt constitutes Feature vector;Y is the output of output module as a result, X is the input that time step pays attention to power module.
When the convolutional layer uses the step-length of the core with k size and s size to carry out convolution operation, the convolutional layer Output is calculate by the following formula:
Wout=(Win- k)/s+1,
Hout=(Hin- k)/s+1,
Wherein, WoutIt is the width for exporting language spectrum signature figure, WinIt is the width for inputting language spectrum signature figure, k is convolution kernel size, and s is The mobile step-length of convolution kernel;HoutIt is the height for exporting characteristic pattern, HinIt is the height of input feature vector figure, k is convolution kernel size, and s is convolution The mobile step-length of core.
Block structure is remembered using the pyramid, is preceding to time step N by length1With backward time step N2Time Step-length ht, it is encoded to the length of a fixed size, then by N1With N2Sum be calculated as currently exporting, the current output is specific As shown by the following formula:
Wherein,It is the output of t moment memory module, f is any activation primitive, aiIt is the power of i-th of forward direction time step Weight, ht-iIt is i-th of forward direction time step, bjIt is the weight of j-th of backward time step, ht-jIt is j-th of forward direction time step;
Pyramid memory block structure can using jump connection, the relationship of the input of jump connection and output such as with Shown in lower formula:
Wherein,It is the output of l layers of t moment block of memory,It is any activation primitive,It is l -1 layers of t moment Block of memory output,It is the input of l layers of t moment block of memory,It is before l layers to time step,It is l layer i-th The weight of forward direction time step,It is i-th of forward direction time step of l layer, s1It is preceding to time step interval,It is l layer i-th The weight of backward time step,It is j-th of backward time step of l layer, s2To time step interval after being;It is l+1 layers The output of t moment hidden layer, WlIt is the weight parameter of l layers of block of memory, bl+1It is the biasing of l layers of block of memory.
The convolutional layer can be two layers, and the shallow-layer information can be audio loudness or frequency, the several emotion Classification can be four kinds of emotional categories, which can be happy, sad, angry and neutral.
Another object to realize the present invention takes technical solution below: one kind being applied to speech emotion recognition system The speech-emotion recognition method of system, comprising the following steps:
Step 1, audio preprocessing module carry out preliminary feature extraction and regularization to the voice received and operate, and obtain CNN module is input to language spectrum signature figure, and by the language spectrum signature figure;
Step 2, CNN module carry out convolutional calculation operation to the language spectrum signature figure received, construct comprising audio shallow-layer The language spectrum signature figure of information (such as audio loudness and frequency etc.);
Step 3, pyramid FSMN module further locate the language spectrum signature figure comprising audio shallow-layer information Reason, and block structure is remembered by pyramid, deeper semantic information and contextual information in language spectrum signature figure are obtained, such as Speaker's gender included in one section of voice, speaker's emotion etc.;
Step 4, time step pay attention to power module to the language spectrum signature with deeper semantic information and contextual information Figure is handled, calculate first different time step attention score, then with the score to entire language spectrum signature figure when Weighted sum is done in spacer step dimension, to obtain whole section of voice and the highest feature vector of speaker's emotion degree of correlation;Utilize this Time step notices that power module can make model be more concerned about part relevant to speaker's emotion, improves model robustness;
Described eigenvector is input to output module by step 5, and the dimension of described eigenvector represents corresponding emotional category Probability, take emotional category corresponding to the dimension of maximum probability as final output as a result, to output with whole section of voice Corresponding emotional category, i.e. model classify to the emotion of prediction;The output module is a full articulamentum, this connects entirely Connect that layer exported is the feature vector that length is 4.
Technical problem solved by the invention: speech emotion recognition, a Duan Yuyin are solved the problems, such as based on depth learning technology In include information it is very more, such as: the gender of speaker, background noise, speech content, affective state of speaker etc., This just brings very big difficulty and challenge to speech emotion recognition problem.Equally, although the speech emotional based on deep learning Identification has obtained certain research, but most of research is all based on LSTM, and LSTM inherently has such as parameter amount It is huge, train the problems such as difficult;To sum up, existing speech emotion recognition technology is still faced with much, not fine It solves the problems, such as.
The advantages of the present invention:
1, in the tasks such as speech recognition and language model, FSMN can be right in the case where not needing any repetition and feeding back Relationship is modeled when long, is based on these results of study, and the invention proposes a kind of deep neural networks that feedovers end to end, is used In solution speech emotion recognition task.It after getting rid of LSTM, not only greatly improves the recognition speed of model, while also effectively dropping Low time consumption for training.It being compared with the traditional method, the present invention does not use the audio frequency characteristics of various manual extractions as mode input, But directly use original sound spectrograph as input, more voice raw informations are contained among these, to make model generalization Ability is stronger.The complexity of model construction is reduced simultaneously, it is not necessary to which for different models, different input feature vectors is set.
2, different from the speech emotion recognition research of deep learning is mostly based on, the present invention does not use recurrent neural net Network and its mutation use net based on the full Connection Neural Network of feedforward of this standard of DFSMN as basic network Network, and the memory block structure of pyramid is proposed on the basis of DFSMN, so that entire model more robust, with network Intensification can extract more high-rise semantic information.
3, model bottom of the invention is level 2 volume lamination, rather than directly uses DFSMN layers, in addition, with the depth of network Enter, the stronger feature of robustness can be extracted using down-sampled method, and be substantially reduced the size of feature.
4, in order to make model more focused on information relevant to emotion, not by the interference of other information, the present invention is also proposed Attention mechanism based on time step, Attention attention mechanism is applied in the output of pyramid FSMN, note is utilized It anticipates power mechanism, each element in output sequence all relies on the element-specific in list entries, and which increase the calculating of model Burden, but more accurate, the better model of performance can be generated, end of the invention is demonstrated using IEMOCAP speech emotional data set To the validity of end network.
5, deep neural network structure end to end proposed by the invention, and for side designed by each problem Method, can effectively run, and good confirmation obtained from experiment, also, test 3.3 times faster than archetype, by dividing Analysis and verifying, speech emotion recognition performance of the invention achieve sizable promotion.
Detailed description of the invention
Fig. 1 is the structure chart of speech emotion recognition system, and the pFSMN in figure is pyramid FSMN.
Fig. 2 a is the structure chart of FSMN.
Fig. 2 b is the structure chart of DFSMN.
Fig. 3 is the structure chart that time step pays attention to power module.
Fig. 4 is the flow chart of speech-emotion recognition method.
Specific embodiment
Embodiment
The present invention is further illustrated With reference to embodiment.
As shown in Figure 1, a kind of speech emotion recognition system, comprising: sequentially connected audio preprocessing module, CNN module, Pyramid FSMN module, time step pay attention to power module and output module;The CNN module has convolutional layer, the pyramid There is FSMN module pyramid to remember block structure.
Speech emotion recognition system in the present embodiment is a kind of deep neural network structure that feedovers end to end, the present invention It is the improvement carried out to the basic network in classical FSMN and DFSMN structure, the convolutional layer being added in the basic network, To realize the feature extraction of more bottom.
Audio data of the invention is the wav format of mainstream, and sample frequency is 16000Hz, further by original audio Data framing does Fourier transformation, and the length of each frame is 25ms, a length of 10ms of frame walk.Pass through preliminary treatment, audio data Switch to 2 dimension sound spectrograph features as mode input.It is detailed in Fig. 1, model bottom is level 2 volume lamination, rather than directly uses DFSMN Layer.In addition, going deep into network, the stronger feature of robustness can be extracted using down-sampled method, and be substantially reduced spy The size of sign, this module significantly improve the accuracy of model.When the step-length using the core and s size with k size is rolled up When product (pond) operation, output layer can be calculated by equation below:
Wout=(Win- k)/s+1,
Hout=(Hin- k)/s+1,
Wherein, WoutIt is the width for exporting language spectrum signature figure, WinIt is the width for inputting language spectrum signature figure, k is convolution kernel size, and s is The mobile step-length of convolution kernel;HoutIt is the height for exporting characteristic pattern, HinIt is the height of input feature vector figure, k is convolution kernel size, and s is convolution The mobile step-length of core.
As shown in Figure 2 a, FSMN is a kind of full Connection Neural Network of feedforward of standard, it is added to additionally in hidden layer Memory module, tapped-delay structure can be used by time step htLength be N1Forward direction time step and length be N2Follow-up time step be encoded to the length of a fixed size, then their sum is calculated as currently exporting, such as following public affairs Shown in formula:
Wherein,It is the output of t moment memory module, f is any activation primitive, aiIt is the power of i-th of forward direction time step Weight, ht-iIt is i-th of forward direction time step, bjIt is the weight of j-th of backward time step, ht-jIt is j-th of forward direction time step.
In order to keep the depth of FSMN bigger, as shown in Figure 2 b, different from original FSMN framework, DFSMN eliminates hiding Direct positive connection between layer at the same time, introduces jump connection only using memory module as input to overcome gradient to disappear And explosion issues, it inputs with the relationship of output as shown by the following formula:
Wherein,It is the output of l layers of t moment block of memory,It is any activation primitive,It is l -1 layers of t moment Block of memory output,It is the input of l layers of t moment block of memory,It is before l layers to time step,It is l layer i-th The weight of forward direction time step,It is i-th of forward direction time step of l layer, s1It is preceding to time step interval,It is l layer i-th The weight of backward time step,It is j-th of backward time step of l layer, s2To time step interval after being;It is l+1 Layer t moment hidden layer output, WlIt is the weight parameter of l layers of block of memory, bl+1It is the biasing of l layers of block of memory.
In above-mentioned FSMN and DFSMN, the length of memory module is identical, it means that in above equation, is led to It is identical to cross all hidden layer acquisitionsAnd s2, in memory block structure, bottom not only extracts specific time step-length t Contextual information, while will also include the information, therefore, subsequent relation of long standing relation can be duplicate, it is no longer necessary in top layer Introduce redundant information.The invention proposes a pyramids to remember block structure, and in this pyramid memory block structure, model exists More contextual informations are extracted on deeper level, it is to pass through increase that this pyramid, which remembers block structure,And s2Come It realizes, therefore, bottom extracts feature from the minute informations such as word speed and rhythm, and top layer is more advanced from emotion and gender etc. Feature is extracted in information, this pyramid memory block structure improves precision, while reducing the quantity of parameter.
Present invention also adds attention mechanism, and Attention attention mechanism is applied to the output of pyramid FSMN In, using attention mechanism, each element in output sequence all relies on the element-specific in list entries.Which increase moulds The computation burden of type, but more accurate, the better model of performance can be generated.In most of realize, pay attention to being implemented as a power Vector (usually as the output of softmax function), dimension are equal to the length of list entries.In the present embodiment, one section of language Sound is divided into many segments, and time step is known as in neural network.Obviously, when one section of voice includes a large amount of blank, and It is not that each time step is useful to SER task, therefore model needs to pay close attention to some specific region, on this basis, building Time step pays attention to power module, as shown in figure 3, time step notices that power module can be described as following formula:
at=Average (ht),
Y=Xs,
Wherein, atIt is the mean value of t-th of time step, htIt is the feature vector of t-th of time step, Average is letter of averaging Number;S is the output of attention mechanism,It is softmax activation primitive, W1It is the weight that time step pays attention to first layer in power module Parameter, W2It is the weight parameter that time step pays attention to the second layer in power module, b1It is the biasing that time step pays attention to first layer in power module Parameter, b2It is the offset parameter that time step pays attention to the second layer in power module, f is any activation primitive, and a is by all atIt constitutes Feature vector;Y is the output of output module as a result, X is the input that time step pays attention to power module.
The output of whole network is obtained finally by a full articulamentum, the optimization aim of the model is universal cross entropy damage Lose function.The output vector length of model matches with emotional category number, and the value of each position corresponds to the emotion in output vector The probability of classification, the final emotional category for choosing maximum probability is as output.
As shown in figure 4, speech emotion recognition process in the present embodiment specifically includes the following steps:
Step 1, audio preprocessing module carry out preliminary feature extraction and regularization to the voice received and operate, and obtain CNN module is input to language spectrum signature figure, and by the language spectrum signature figure;
Step 2, CNN module carry out convolutional calculation operation to the language spectrum signature figure received, construct comprising audio shallow-layer The language spectrum signature figure of information;
Step 3, pyramid FSMN module further locate the language spectrum signature figure comprising audio shallow-layer information Reason, and block structure is remembered by pyramid, deeper semantic information and contextual information in language spectrum signature figure are obtained, such as Speaker's gender included in one section of voice, speaker's emotion etc.;
Step 4, time step pay attention to power module to the language spectrum signature with deeper semantic information and contextual information Figure is handled, calculate first different time step attention score, then with the score to entire language spectrum signature figure when Weighted sum is done in spacer step dimension, to obtain whole section of voice and the highest feature vector of speaker's emotion degree of correlation;Utilize this Time step notices that power module can make model be more concerned about part relevant to speaker's emotion, improves model robustness;
Described eigenvector is input to output module by step 5, and the dimension of described eigenvector represents corresponding emotional category Probability, take emotional category corresponding to the dimension of maximum probability as final output as a result, to output with whole section of voice Corresponding emotional category, i.e. model classify to the emotion of prediction;The output module is a full articulamentum, this connects entirely Connect that layer exported is the feature vector that length is 4.
Technical solution of the present invention not only greatly improves the recognition speed of model after getting rid of LSTM, while also Effect reduces time consumption for training, in addition, different from traditional voice emotion recognition system, the present invention does not use manual features as mould Type input, but directly use original sound spectrograph as input, wherein more raw informations are contained, to make model generalization ability It is stronger.In order to make model more focused on information relevant to emotion, not by the interference of other information, when the invention proposes being based on The attention mechanism of spacer step, and be incorporated among model.
(1) data set that the present invention uses;
SER model of the invention is assessed using IEMOCAP corpus, which includes several sections of dialogues, in each meeting In words, two participants show certain types of emotion by exchanging.These language be divided into indignation, it is frightened, excited, It is neutral, detest, be surprised, sad, happy, dejected, other and XXX.The case where XXX is that scholiast cannot reach an agreement with regard to label. In the present embodiment, only selected 5 classes: indignation, excited, happy, neutral and sad, the voice sum used is 5531.For The mood of the sample size of each mood classification of balance, fast happy excitement is merged into happy classification.In addition, randomly selecting total The 10% of data is used as test object, and remainder data is checked as training data, 10% training data as verify data Whether need to stop in advance.
There are two channel datas of video and audio in corpus, and has only used audio data in the present invention.Audio collection Using high-quality microphone (Schoeps CMIT 5U), sampling rate 48khz.16 kilo hertzs are downsampled to, and is extracted The acoustic feature of one 201D.Unlike other technical solutions, the present embodiment only uses sound spectrograph as input, extracts Journey carries out in the 25ms window that a moving step length is 10ms (100fps).Whole sentence voice data has been done at normalization simultaneously Reason.
(2) test process describes;
Use Pytorch frame as training tool, the network architecture is as shown in Figure 1, two 5*5conv layers make in front With the hidden layer and block of memory of 4 FSMN blocks have 256 and 128 nodes respectively.In order to avoid over-fitting, CNN and pFSMN layers There is batch normalization layer below, time sequencing is 4 to 32, and step-length is 1 to 2, and the model of the present embodiment is to be based on using Pytorch Adam optimizer be trained, batch be dimensioned to 32, learning rate is fixed as 0.003, use 4470 pre-set Item training with audio data is iterated training, effect of the test model on verifying collection simultaneously in every iteration one wheel, when testing Deconditioning in advance when the continuous 3 iteration rounds of recognition accuracy on card collection are constant.All experiments are all at 1 It is carried out on the work station of NVIDIATITAN XP.
(3) test result;
For the performance of measuring system, calculates the overall accuracy (weighting accuracy, WA) of test sample and do not sympathize with The average recall rate (unweighted accuracy, UA) of thread classification, and the corresponding recall rate to each classification.
Test result shows that compared with LSTM, improved series model performance improves 2.47%, illustrates FSMN herein There is better series model performance in task.HSF-CRNN(Luo D,Zou Y,Huang D.Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition [J] .Proc.Interspeech 2018,2018:152-156.) it is a kind of improved CNN proposed by Luo RNN method is combined, it uses hand-made phonetic feature as input, and the model of the present embodiment is realized respectively on UA and WA 0.53% and 3.99% absolute improvement, it is demonstrated experimentally that without using common manual phonetic feature, can from Dynamic ground extracts useful information from spectrogram, and the present invention has also set up a basic C-biLSTM model and has been compared, The accuracy rate of " sad " sample is better than other methods, and the recognition accuracy of other classifications is then very different.In order to illustrate attention The working principle of mechanism establishes a C-pFSMN model, other than there is no attention mechanism, entire model remainder with Model in the present invention is completely the same, the results showed that, compared with C-pFSMN, attention mechanism proposed by the invention is appointed in SER It is showed well in business, UA absolutely improves 6.3%, in addition, front end CNN layers can extract more complicated feature, thus as expected Model performance is improved like that.
C-biLSTM is constructed by 2-CNN layers and 2-Bi-LSTM, wherein there is 256 nodes in hidden layer.It and the present embodiment Model it is similar, be widely used in Series Modeling task.Therefore, also by the computing resource of C-biLSTM and this hair Bright method compares.The result shows that model of the invention is 64 minutes a length of when there was only 1.85M parameter while training, It is more faster than C-LSTM model.This means that better performance may be implemented in the present invention, while needing less computing resource.
Technical solution of the present invention, very good solution speech emotion recognition problem, not only obtains the speed of speech recognition To greatly improving, while also effectively reducing time consumption for training.In addition, the present invention different from traditional voice emotion recognition system It does not use manual features as mode input, but directly uses original sound spectrograph as input, wherein containing more original Information, to keep model generalization ability stronger.In order to make model more focused on information relevant to emotion, not by other information Interference, the invention proposes the attention mechanism based on time step, and are incorporated among model, and experiment shows of the invention Modelling effect is good, and required computing resource is less.
Above-listed detailed description is illustrating for possible embodiments of the present invention, and the embodiment is not to limit this hair Bright the scope of the patents, all equivalence enforcements or change without departing from carried out by the present invention, is intended to be limited solely by the scope of the patents of this case.

Claims (10)

1. a kind of speech emotion recognition system characterized by comprising
Sequentially connected audio preprocessing module, CNN module, pyramid FSMN module, time step pay attention to power module and output mould Block, the CNN module have convolutional layer;
The original audio data received is converted to language spectrum signature figure by the audio preprocessing module;
The CNN module carries out preliminary treatment to language spectrum signature figure, and building includes the characteristic pattern of shallow-layer information;
The characteristic pattern comprising shallow-layer information is further processed in the pyramid FSMN module, to obtain deeper language Adopted information and contextual information;
The time step notices that power module is used to pay close attention to specific region in time step, and calculates different time step-length to final feelings The other weighing factor of perception;
The output module has several emotional category, which is used to export and original audio data most matches Emotional category.
2. speech emotion recognition system according to claim 1, which is characterized in that the time step notices that power module is specific As shown by the following formula:
at=Average (ht),
Y=Xs,
Wherein, atIt is the mean value of t-th of time step, htIt is the feature vector of t-th of time step, Average is function of averaging;s It is the output of attention mechanism,It is softmax activation primitive, W1It is the weight ginseng that time step pays attention to first layer in power module Number, W2It is the weight parameter that time step pays attention to the second layer in power module, b1It is the biasing ginseng that time step pays attention to first layer in power module Number, b2It is the offset parameter that time step pays attention to the second layer in power module, f is any activation primitive, and a is by all atThe spy of composition Levy vector;Y is the output of output module as a result, X is the input that time step pays attention to power module.
3. speech emotion recognition system according to claim 1, which is characterized in that have k big when the convolutional layer uses When the step-length of small core and s size carries out convolution operation, the output of the convolutional layer is calculate by the following formula:
Wout=(Win- k)/s+1,
Hout=(Hin- k)/s+1,
Wherein, WoutIt is the width for exporting language spectrum signature figure, WinIt is the width for inputting language spectrum signature figure, k is convolution kernel size, and s is convolution The mobile step-length of core;HoutIt is the height for exporting characteristic pattern, HinIt is the height of input feature vector figure, k is convolution kernel size, and s is that convolution kernel moves Dynamic step-length.
4. speech emotion recognition system according to claim 1, which is characterized in that the pyramid FSMN module has gold Word tower remembers block structure, remembers block structure using the pyramid, is preceding to time step N by length1With backward time step N2 Time step ht, it is encoded to the length of a fixed size, then by N1With N2Sum be calculated as currently exporting, this is current defeated Out specifically as shown by the following formula:
Wherein,It is the output of t moment memory module, f is any activation primitive, aiIt is the weight of i-th of forward direction time step, ht-iIt is i-th of forward direction time step, bjIt is the weight of j-th of backward time step, ht-jIt is j-th of forward direction time step;
The pyramid memory block structure is using jump connection, the relationship such as following formula institute of the input and output of jump connection Show:
Wherein,It is the output of l layers of t moment block of memory,It is any activation primitive,It is the note of l-1 layers of t moment Recall block output,It is the input of l layers of t moment block of memory,It is before l layers to time step,It is l i-th of forward direction of layer The weight of time step,It is i-th of forward direction time step of l layer, s1It is preceding to time step interval,It is backward i-th of l layer The weight of time step,It is j-th of backward time step of l layer, s2To time step interval after being;When being l+1 layers of t Carve hidden layer output, WlIt is the weight parameter of the 1st layer of block of memory, bl+1It is the biasing of l layers of block of memory.
5. speech emotion recognition system according to any one of claims 1 to 4, which is characterized in that the convolutional layer is two Layer.
6. speech emotion recognition system according to any one of claims 1 to 4, which is characterized in that the shallow-layer information is Audio loudness or frequency.
7. speech emotion recognition system according to any one of claims 1 to 4, which is characterized in that the several emotion Classification is four kinds of emotional categories.
8. speech emotion recognition system according to claim 7, which is characterized in that four kinds of emotional categories are to open It is the heart, sad, angry and neutral.
9. a kind of speech-emotion recognition method applied to speech emotion recognition system described in claim 1, which is characterized in that The following steps are included:
Step 1, audio preprocessing module carry out preliminary feature extraction and regularization to the voice received and operate, and obtain language Spectrum signature figure, and the language spectrum signature figure is input to CNN module;
Step 2, CNN module carry out convolutional calculation operation to the language spectrum signature figure received, construct comprising audio shallow-layer information Language spectrum signature figure;
The language spectrum signature figure comprising audio shallow-layer information is further processed in step 3, pyramid FSMN module, and Block structure is remembered by pyramid, obtains deeper semantic information and contextual information in language spectrum signature figure;
Step 4, time step pay attention to power module to the language spectrum signature figure with deeper semantic information and contextual information into Row processing calculates the attention score of different time step first, then uses the score to entire language spectrum signature figure in time step Weighted sum is done in dimension, to obtain whole section of voice and the highest feature vector of speaker's emotion degree of correlation;
Described eigenvector is input to output module by step 5, and the dimension of described eigenvector represents the general of corresponding emotional category Rate takes emotional category corresponding to the dimension of maximum probability as final output as a result, opposite with whole section of voice to export The emotional category answered.
10. speech-emotion recognition method according to claim 9, which is characterized in that in steps of 5, the output module It is a full articulamentum, what which was exported is the feature vector that length is 4.
CN201910803429.7A 2019-08-28 2019-08-28 Voice emotion recognition system and voice emotion recognition method Active CN110534133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910803429.7A CN110534133B (en) 2019-08-28 2019-08-28 Voice emotion recognition system and voice emotion recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910803429.7A CN110534133B (en) 2019-08-28 2019-08-28 Voice emotion recognition system and voice emotion recognition method

Publications (2)

Publication Number Publication Date
CN110534133A true CN110534133A (en) 2019-12-03
CN110534133B CN110534133B (en) 2022-03-25

Family

ID=68664896

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910803429.7A Active CN110534133B (en) 2019-08-28 2019-08-28 Voice emotion recognition system and voice emotion recognition method

Country Status (1)

Country Link
CN (1) CN110534133B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143567A (en) * 2019-12-30 2020-05-12 成都数之联科技有限公司 Comment emotion analysis method based on improved neural network
CN111539458A (en) * 2020-04-02 2020-08-14 咪咕文化科技有限公司 Feature map processing method and device, electronic equipment and storage medium
CN112053007A (en) * 2020-09-18 2020-12-08 国网浙江兰溪市供电有限公司 Distribution network fault first-aid repair prediction analysis system and method
CN112634947A (en) * 2020-12-18 2021-04-09 大连东软信息学院 Animal voice and emotion feature set sequencing and identifying method and system
CN113255800A (en) * 2021-06-02 2021-08-13 中国科学院自动化研究所 Robust emotion modeling system based on audio and video
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN113903327A (en) * 2021-09-13 2022-01-07 北京卷心菜科技有限公司 Voice environment atmosphere recognition method based on deep neural network
CN115512693A (en) * 2021-06-23 2022-12-23 中移(杭州)信息技术有限公司 Audio recognition method, acoustic model training method, device and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090063202A (en) * 2009-05-29 2009-06-17 포항공과대학교 산학협력단 Method for apparatus for providing emotion speech recognition
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20090063202A (en) * 2009-05-29 2009-06-17 포항공과대학교 산학협력단 Method for apparatus for providing emotion speech recognition
CN108010514A (en) * 2017-11-20 2018-05-08 四川大学 A kind of method of speech classification based on deep neural network
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN109460737A (en) * 2018-11-13 2019-03-12 四川大学 A kind of multi-modal speech-emotion recognition method based on enhanced residual error neural network
CN109637522A (en) * 2018-12-26 2019-04-16 杭州电子科技大学 A kind of speech-emotion recognition method extracting deep space attention characteristics based on sound spectrograph
CN109767790A (en) * 2019-02-28 2019-05-17 中国传媒大学 A kind of speech-emotion recognition method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SHILIANG ZHANG ETC.: "DEEP-FSMN FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION", 《2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING(ICASSP)》 *
张园园: "基于深度学习的多模态情感识别方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王金华 等: "基于语谱图提取深度空间注意特征的语音情感识别算法", 《电信科学》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143567A (en) * 2019-12-30 2020-05-12 成都数之联科技有限公司 Comment emotion analysis method based on improved neural network
CN111539458A (en) * 2020-04-02 2020-08-14 咪咕文化科技有限公司 Feature map processing method and device, electronic equipment and storage medium
CN111539458B (en) * 2020-04-02 2024-02-27 咪咕文化科技有限公司 Feature map processing method and device, electronic equipment and storage medium
CN112053007A (en) * 2020-09-18 2020-12-08 国网浙江兰溪市供电有限公司 Distribution network fault first-aid repair prediction analysis system and method
CN112053007B (en) * 2020-09-18 2022-07-26 国网浙江兰溪市供电有限公司 Distribution network fault first-aid repair prediction analysis system and method
CN112634947A (en) * 2020-12-18 2021-04-09 大连东软信息学院 Animal voice and emotion feature set sequencing and identifying method and system
CN112634947B (en) * 2020-12-18 2023-03-14 大连东软信息学院 Animal voice and emotion feature set sequencing and identifying method and system
CN113255800A (en) * 2021-06-02 2021-08-13 中国科学院自动化研究所 Robust emotion modeling system based on audio and video
CN113255800B (en) * 2021-06-02 2021-10-15 中国科学院自动化研究所 Robust emotion modeling system based on audio and video
CN113257280A (en) * 2021-06-07 2021-08-13 苏州大学 Speech emotion recognition method based on wav2vec
CN115512693A (en) * 2021-06-23 2022-12-23 中移(杭州)信息技术有限公司 Audio recognition method, acoustic model training method, device and storage medium
CN113903327A (en) * 2021-09-13 2022-01-07 北京卷心菜科技有限公司 Voice environment atmosphere recognition method based on deep neural network

Also Published As

Publication number Publication date
CN110534133B (en) 2022-03-25

Similar Documents

Publication Publication Date Title
CN110534133A (en) A kind of speech emotion recognition system and speech-emotion recognition method
Fayek et al. Towards real-time speech emotion recognition using deep neural networks
Jiao et al. Simulating dysarthric speech for training data augmentation in clinical speech applications
Cai et al. Deep maxout neural networks for speech recognition
CN111583964B (en) Natural voice emotion recognition method based on multimode deep feature learning
CN112784798A (en) Multi-modal emotion recognition method based on feature-time attention mechanism
Li et al. Exploiting the potentialities of features for speech emotion recognition
CN108564942A (en) One kind being based on the adjustable speech-emotion recognition method of susceptibility and system
CN107972028B (en) Man-machine interaction method and device and electronic equipment
Han et al. Speech emotion recognition with a resnet-cnn-transformer parallel neural network
CN112466326A (en) Speech emotion feature extraction method based on transform model encoder
Wang et al. Research on speech emotion recognition technology based on deep and shallow neural network
Cardona et al. Online phoneme recognition using multi-layer perceptron networks combined with recurrent non-linear autoregressive neural networks with exogenous inputs
Jiang et al. Speech Emotion Recognition Using Deep Convolutional Neural Network and Simple Recurrent Unit.
CN117095702A (en) Multi-mode emotion recognition method based on gating multi-level feature coding network
Fan et al. Adaptive Domain-Aware Representation Learning for Speech Emotion Recognition.
Fan et al. The impact of student learning aids on deep learning and mobile platform on learning behavior
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
Tang et al. A bimodal network based on Audio–Text-Interactional-Attention with ArcFace loss for speech emotion recognition
CN106297769A (en) A kind of distinctive feature extracting method being applied to languages identification
Akinpelu et al. Lightweight deep learning framework for speech emotion recognition
Liu et al. Dual-tbnet: Improving the robustness of speech features via dual-transformer-bilstm for speech emotion recognition
CN112329819A (en) Underwater target identification method based on multi-network fusion
Wang et al. Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant