CN110534133A

CN110534133A - A kind of speech emotion recognition system and speech-emotion recognition method

Info

Publication number: CN110534133A
Application number: CN201910803429.7A
Authority: CN
Inventors: 殷绪成; 曹秒; 杨春
Original assignee: Zhuhai Wisdom Electronic Technology Co Ltd
Current assignee: Zhuhai Wisdom Electronic Technology Co Ltd
Priority date: 2019-08-28
Filing date: 2019-08-28
Publication date: 2019-12-03
Anticipated expiration: 2039-08-28
Also published as: CN110534133B

Abstract

The invention discloses a kind of speech emotion recognition systems, comprising: sequentially connected audio preprocessing module, CNN module, pyramid FSMN module, time step pay attention to power module and output module；The invention also discloses a kind of speech-emotion recognition methods applied to speech emotion recognition system, comprising the following steps: and 1, preliminary work is carried out to voice, obtain language spectrum signature figure；2, language spectrum signature figure is operated, building includes the language spectrum signature figure of audio shallow-layer information；3, the language spectrum signature figure comprising audio shallow-layer information is further processed, obtains deeper semantic information and contextual information；4, the language spectrum signature figure with deeper semantic information and contextual information is handled, obtains whole section of voice and the highest feature vector of speaker's emotion degree of correlation；5, emotional category corresponding with whole section of voice is exported.Speech emotion recognition performance of the invention compared with the existing technology, achieves sizable promotion.

Description

A kind of speech emotion recognition system and speech-emotion recognition method

Technical field

The present invention relates to artificial intelligence and technical field of voice recognition, and in particular to a kind of speech emotion recognition system and language Sound emotion identification method is a kind of deep neural network technology end to end, is that basic network improves with DFSMN.

Background technique

With the extensive use of speech recognition technology being constantly progressive with speech recognition apparatus, day of the human-computer interaction in people It is often more and more common in life.However, these equipment can only identify the text grade content of human language mostly, it cannot identify and speak The affective state of people, and speech emotion recognition in terms of service focusing on people and human-computer interaction there are many useful application, Such as intellect service robot, automation call center and long-distance education.Up to the present, it has caused considerable grind Study carefully concern, and proposes many methods.Since machine learning (such as CNN even depth neural network) rapidly developed since in recent years, By the trial and improvement carried out in many fields, this method all shows good performance.How by deep learning skill Art is still in applied to the field gropes state, in practical applications, solves this challenging task and faces very More problems are urgently to be resolved.The practical application of speech emotion recognition technology is a challenging task, needs to collect Magnanimity complexity and difficult audio data are studied, and the audio for how allowing pure environment to be recorded is closer to the language under real scene Sound is the big problem for needing to solve in the prior art.

Typical speech emotion recognition (SER) system using speech waveform as inputting, then output target emotional category it One.Traditional SER system uses gauss hybrid models (GMMs) (Neiberg D, Elenius K, Laskowski K.Emotion recognition in spontaneous speech using GMMs[C]//Ninth International Conference on Spoken Language Processing.2006.), hidden Markov model (HMMs)(Nwe T L,Foo S W,De Silva L C.Speech emotion recognition using HiddenMarkov models [J] .Speech communication, 2003,41 (4): 603-623.), support vector machines (SVMs)(Yang N,Yuan J,Zhou Y,et al.Enhanced multiclass SVM with thresholding fusion for speech-based emotion classification[J].International journal of Speech technology, 2017,20 (1): 27-41.) and long short-term memory (LSTM) (Tao F, Liu G.Advanced LSTM:A study about better time dependency modeling in emotion recognition [C]//2018IEEE International Conference onAcoustics,Speech and Signal Processing (ICASSP) .IEEE, 2018:2906-2910.), these systems there is a problem of one it is significant be to all rely on Mature manual phonetic feature, the selection of feature influence modelling effect very big.These features generally include frame level speech letter Number frequency spectrum, cepstrum, fundamental tone and energy feature.Then the statistical function of these features is obtained into a words across multiple frame applications Language grade feature vector.

With the explosive growth of depth learning technology, some researchers explore deep learning method to establish more Shandong The SER model of stick.Zhang Z et al. (Zhang Z, Ringeval F, Han J, et al.Facing realism in spontaneous emotion recognition from speech:Feature enhancement by autoencoder with LSTM neural networks[C]//Proceedings INTERSPEECH 2016,17th Annual Conference of the International Speech Communication Association (ISCA) .2016:3593-3597.) propose feature enhancing of the one kind based on long short-term memory (LSTM) neural network coding certainly Algorithm, for extracting emotion information from voice.Correspondingly, recurrent neural network (RNN) is proved to that there is stronger sequence to build Mould ability, especially in voice recognition tasks.However, the training of RNN depends on backpropagation (BPTT) at any time, due to it Complexity is calculated, the problems such as time-consuming, gradient disappears and explosion may be brought.In order to solve these problems, one kind is had also been proposed Feedforward sequence memory network (hereinafter referred to as are as follows: FSMN).In recent years, a large number of studies show that, in speech recognition and language model etc. In task, FSMN can not need it is any repeat feed back in the case where to it is long when relationship model.In addition, Zhang S etc. People (Zhang S, Lei M, Yan Z, et al.Deep-FSMN for large vocabulary continuous speech recognition[C]//2018IEEE International Conference onAcoustics,Speech and Signal Processing (ICASSP) .IEEE, 2018:5869-5873.) in order to construct a deeper network structure, it mentions Skip connection structure is gone out to be applied in FSMN, very big improvement has been carried out to model before.

Research activities about SER can trace back to the 1980s.But due to gender, speaker, language and recording The factors such as the variation of environment, SER are still challenging in practical applications.Many researchers attempt perfect by designing Hand-written phonetic feature solves these problems, to reinforce contacting with human emotion.However, the phonetic feature of these manual extractions It is only applicable to specific task, versatility is poor.This results in needing to design different languages in face of different voice inter-related tasks The original intention of sound feature, this and depth learning technology is disagreed.

Summary of the invention

In view of the deficiencies of the prior art, the object of the present invention is to provide a kind of speech emotion recognition system, which is one Plant the deep neural network structure that feedovers end to end.

In view of the deficiencies of the prior art, it is a further object of the present invention to provide a kind of applied to speech emotion recognition system Speech-emotion recognition method.

To achieve the purpose of the present invention, technical solution below is taken: a kind of speech emotion recognition system, comprising: according to The audio preprocessing module of secondary connection, CNN module, pyramid FSMN module, time step pay attention to power module and output module；It is described CNN module has convolutional layer, and there is the pyramid FSMN module pyramid to remember block structure；

The original audio data received is converted to language spectrum signature figure by the audio preprocessing module；

The CNN module carries out preliminary treatment to language spectrum signature figure, and building includes the characteristic pattern of shallow-layer information；

The characteristic pattern comprising shallow-layer information is further processed in the pyramid FSMN module, to obtain deeper time Semantic information and contextual information；

The time step notices that power module is used to pay close attention to specific region in time step, and calculates different time step-length to most The weighing factor of whole emotion recognition；

The output module has several emotional category, and the output module is for exporting and original audio data most phase The emotional category matched.

The time step notices that power module specifically can be as shown by the following formula:

a_t=Average (h_t),

Y=Xs,

Wherein, a_tIt is the mean value of t-th of time step, h_tIt is the feature vector of t-th of time step, Average is letter of averaging Number；S is the output of attention mechanism,It is softmax activation primitive, W₁It is the weight that time step pays attention to first layer in power module Parameter, W₂It is the weight parameter that time step pays attention to the second layer in power module, b₁It is the biasing that time step pays attention to first layer in power module Parameter, b₂It is the offset parameter that time step pays attention to the second layer in power module, f is any activation primitive, and a is by all a_tIt constitutes Feature vector；Y is the output of output module as a result, X is the input that time step pays attention to power module.

When the convolutional layer uses the step-length of the core with k size and s size to carry out convolution operation, the convolutional layer Output is calculate by the following formula:

W_out=(W_in- k)/s+1,

H_out=(H_in- k)/s+1,

Wherein, W_outIt is the width for exporting language spectrum signature figure, W_inIt is the width for inputting language spectrum signature figure, k is convolution kernel size, and s is The mobile step-length of convolution kernel；H_outIt is the height for exporting characteristic pattern, H_inIt is the height of input feature vector figure, k is convolution kernel size, and s is convolution The mobile step-length of core.

Block structure is remembered using the pyramid, is preceding to time step N by length₁With backward time step N₂Time Step-length h_t, it is encoded to the length of a fixed size, then by N₁With N₂Sum be calculated as currently exporting, the current output is specific As shown by the following formula:

Wherein,It is the output of t moment memory module, f is any activation primitive, a_iIt is the power of i-th of forward direction time step Weight, h_t-iIt is i-th of forward direction time step, b_jIt is the weight of j-th of backward time step, h_t-jIt is j-th of forward direction time step；

Pyramid memory block structure can using jump connection, the relationship of the input of jump connection and output such as with Shown in lower formula:

Wherein,It is the output of l layers of t moment block of memory,It is any activation primitive,It is l -1 layers of t moment Block of memory output,It is the input of l layers of t moment block of memory,It is before l layers to time step,It is l layer i-th The weight of forward direction time step,It is i-th of forward direction time step of l layer, s₁It is preceding to time step interval,It is l layer i-th The weight of backward time step,It is j-th of backward time step of l layer, s₂To time step interval after being；It is l+1 layers The output of t moment hidden layer, W^lIt is the weight parameter of l layers of block of memory, b^l+1It is the biasing of l layers of block of memory.

The convolutional layer can be two layers, and the shallow-layer information can be audio loudness or frequency, the several emotion Classification can be four kinds of emotional categories, which can be happy, sad, angry and neutral.

Another object to realize the present invention takes technical solution below: one kind being applied to speech emotion recognition system The speech-emotion recognition method of system, comprising the following steps:

Step 1, audio preprocessing module carry out preliminary feature extraction and regularization to the voice received and operate, and obtain CNN module is input to language spectrum signature figure, and by the language spectrum signature figure；

Step 2, CNN module carry out convolutional calculation operation to the language spectrum signature figure received, construct comprising audio shallow-layer The language spectrum signature figure of information (such as audio loudness and frequency etc.)；

Step 3, pyramid FSMN module further locate the language spectrum signature figure comprising audio shallow-layer information Reason, and block structure is remembered by pyramid, deeper semantic information and contextual information in language spectrum signature figure are obtained, such as Speaker's gender included in one section of voice, speaker's emotion etc.；

Step 4, time step pay attention to power module to the language spectrum signature with deeper semantic information and contextual information Figure is handled, calculate first different time step attention score, then with the score to entire language spectrum signature figure when Weighted sum is done in spacer step dimension, to obtain whole section of voice and the highest feature vector of speaker's emotion degree of correlation；Utilize this Time step notices that power module can make model be more concerned about part relevant to speaker's emotion, improves model robustness；

Described eigenvector is input to output module by step 5, and the dimension of described eigenvector represents corresponding emotional category Probability, take emotional category corresponding to the dimension of maximum probability as final output as a result, to output with whole section of voice Corresponding emotional category, i.e. model classify to the emotion of prediction；The output module is a full articulamentum, this connects entirely Connect that layer exported is the feature vector that length is 4.

Technical problem solved by the invention: speech emotion recognition, a Duan Yuyin are solved the problems, such as based on depth learning technology In include information it is very more, such as: the gender of speaker, background noise, speech content, affective state of speaker etc., This just brings very big difficulty and challenge to speech emotion recognition problem.Equally, although the speech emotional based on deep learning Identification has obtained certain research, but most of research is all based on LSTM, and LSTM inherently has such as parameter amount It is huge, train the problems such as difficult；To sum up, existing speech emotion recognition technology is still faced with much, not fine It solves the problems, such as.

The advantages of the present invention:

1, in the tasks such as speech recognition and language model, FSMN can be right in the case where not needing any repetition and feeding back Relationship is modeled when long, is based on these results of study, and the invention proposes a kind of deep neural networks that feedovers end to end, is used In solution speech emotion recognition task.It after getting rid of LSTM, not only greatly improves the recognition speed of model, while also effectively dropping Low time consumption for training.It being compared with the traditional method, the present invention does not use the audio frequency characteristics of various manual extractions as mode input, But directly use original sound spectrograph as input, more voice raw informations are contained among these, to make model generalization Ability is stronger.The complexity of model construction is reduced simultaneously, it is not necessary to which for different models, different input feature vectors is set.

2, different from the speech emotion recognition research of deep learning is mostly based on, the present invention does not use recurrent neural net Network and its mutation use net based on the full Connection Neural Network of feedforward of this standard of DFSMN as basic network Network, and the memory block structure of pyramid is proposed on the basis of DFSMN, so that entire model more robust, with network Intensification can extract more high-rise semantic information.

3, model bottom of the invention is level 2 volume lamination, rather than directly uses DFSMN layers, in addition, with the depth of network Enter, the stronger feature of robustness can be extracted using down-sampled method, and be substantially reduced the size of feature.

4, in order to make model more focused on information relevant to emotion, not by the interference of other information, the present invention is also proposed Attention mechanism based on time step, Attention attention mechanism is applied in the output of pyramid FSMN, note is utilized It anticipates power mechanism, each element in output sequence all relies on the element-specific in list entries, and which increase the calculating of model Burden, but more accurate, the better model of performance can be generated, end of the invention is demonstrated using IEMOCAP speech emotional data set To the validity of end network.

5, deep neural network structure end to end proposed by the invention, and for side designed by each problem Method, can effectively run, and good confirmation obtained from experiment, also, test 3.3 times faster than archetype, by dividing Analysis and verifying, speech emotion recognition performance of the invention achieve sizable promotion.

Detailed description of the invention

Fig. 1 is the structure chart of speech emotion recognition system, and the pFSMN in figure is pyramid FSMN.

Fig. 2 a is the structure chart of FSMN.

Fig. 2 b is the structure chart of DFSMN.

Fig. 3 is the structure chart that time step pays attention to power module.

Fig. 4 is the flow chart of speech-emotion recognition method.

Specific embodiment

Embodiment

The present invention is further illustrated With reference to embodiment.

As shown in Figure 1, a kind of speech emotion recognition system, comprising: sequentially connected audio preprocessing module, CNN module, Pyramid FSMN module, time step pay attention to power module and output module；The CNN module has convolutional layer, the pyramid There is FSMN module pyramid to remember block structure.

Speech emotion recognition system in the present embodiment is a kind of deep neural network structure that feedovers end to end, the present invention It is the improvement carried out to the basic network in classical FSMN and DFSMN structure, the convolutional layer being added in the basic network, To realize the feature extraction of more bottom.

Audio data of the invention is the wav format of mainstream, and sample frequency is 16000Hz, further by original audio Data framing does Fourier transformation, and the length of each frame is 25ms, a length of 10ms of frame walk.Pass through preliminary treatment, audio data Switch to 2 dimension sound spectrograph features as mode input.It is detailed in Fig. 1, model bottom is level 2 volume lamination, rather than directly uses DFSMN Layer.In addition, going deep into network, the stronger feature of robustness can be extracted using down-sampled method, and be substantially reduced spy The size of sign, this module significantly improve the accuracy of model.When the step-length using the core and s size with k size is rolled up When product (pond) operation, output layer can be calculated by equation below:

W_out=(W_in- k)/s+1,

H_out=(H_in- k)/s+1,

As shown in Figure 2 a, FSMN is a kind of full Connection Neural Network of feedforward of standard, it is added to additionally in hidden layer Memory module, tapped-delay structure can be used by time step h_tLength be N₁Forward direction time step and length be N₂Follow-up time step be encoded to the length of a fixed size, then their sum is calculated as currently exporting, such as following public affairs Shown in formula:

Wherein,It is the output of t moment memory module, f is any activation primitive, a_iIt is the power of i-th of forward direction time step Weight, h_t-iIt is i-th of forward direction time step, b_jIt is the weight of j-th of backward time step, h_t-jIt is j-th of forward direction time step.

In order to keep the depth of FSMN bigger, as shown in Figure 2 b, different from original FSMN framework, DFSMN eliminates hiding Direct positive connection between layer at the same time, introduces jump connection only using memory module as input to overcome gradient to disappear And explosion issues, it inputs with the relationship of output as shown by the following formula:

Wherein,It is the output of l layers of t moment block of memory,It is any activation primitive,It is l -1 layers of t moment Block of memory output,It is the input of l layers of t moment block of memory,It is before l layers to time step,It is l layer i-th The weight of forward direction time step,It is i-th of forward direction time step of l layer, s₁It is preceding to time step interval,It is l layer i-th The weight of backward time step,It is j-th of backward time step of l layer, s₂To time step interval after being；It is l+1 Layer t moment hidden layer output, W^lIt is the weight parameter of l layers of block of memory, b^l+1It is the biasing of l layers of block of memory.

In above-mentioned FSMN and DFSMN, the length of memory module is identical, it means that in above equation, is led to It is identical to cross all hidden layer acquisitionsAnd s₂, in memory block structure, bottom not only extracts specific time step-length t Contextual information, while will also include the information, therefore, subsequent relation of long standing relation can be duplicate, it is no longer necessary in top layer Introduce redundant information.The invention proposes a pyramids to remember block structure, and in this pyramid memory block structure, model exists More contextual informations are extracted on deeper level, it is to pass through increase that this pyramid, which remembers block structure,And s₂Come It realizes, therefore, bottom extracts feature from the minute informations such as word speed and rhythm, and top layer is more advanced from emotion and gender etc. Feature is extracted in information, this pyramid memory block structure improves precision, while reducing the quantity of parameter.

Present invention also adds attention mechanism, and Attention attention mechanism is applied to the output of pyramid FSMN In, using attention mechanism, each element in output sequence all relies on the element-specific in list entries.Which increase moulds The computation burden of type, but more accurate, the better model of performance can be generated.In most of realize, pay attention to being implemented as a power Vector (usually as the output of softmax function), dimension are equal to the length of list entries.In the present embodiment, one section of language Sound is divided into many segments, and time step is known as in neural network.Obviously, when one section of voice includes a large amount of blank, and It is not that each time step is useful to SER task, therefore model needs to pay close attention to some specific region, on this basis, building Time step pays attention to power module, as shown in figure 3, time step notices that power module can be described as following formula:

a_t=Average (h_t),

Y=Xs,

The output of whole network is obtained finally by a full articulamentum, the optimization aim of the model is universal cross entropy damage Lose function.The output vector length of model matches with emotional category number, and the value of each position corresponds to the emotion in output vector The probability of classification, the final emotional category for choosing maximum probability is as output.

As shown in figure 4, speech emotion recognition process in the present embodiment specifically includes the following steps:

Step 2, CNN module carry out convolutional calculation operation to the language spectrum signature figure received, construct comprising audio shallow-layer The language spectrum signature figure of information；

Technical solution of the present invention not only greatly improves the recognition speed of model after getting rid of LSTM, while also Effect reduces time consumption for training, in addition, different from traditional voice emotion recognition system, the present invention does not use manual features as mould Type input, but directly use original sound spectrograph as input, wherein more raw informations are contained, to make model generalization ability It is stronger.In order to make model more focused on information relevant to emotion, not by the interference of other information, when the invention proposes being based on The attention mechanism of spacer step, and be incorporated among model.

(1) data set that the present invention uses；

SER model of the invention is assessed using IEMOCAP corpus, which includes several sections of dialogues, in each meeting In words, two participants show certain types of emotion by exchanging.These language be divided into indignation, it is frightened, excited, It is neutral, detest, be surprised, sad, happy, dejected, other and XXX.The case where XXX is that scholiast cannot reach an agreement with regard to label. In the present embodiment, only selected 5 classes: indignation, excited, happy, neutral and sad, the voice sum used is 5531.For The mood of the sample size of each mood classification of balance, fast happy excitement is merged into happy classification.In addition, randomly selecting total The 10% of data is used as test object, and remainder data is checked as training data, 10% training data as verify data Whether need to stop in advance.

There are two channel datas of video and audio in corpus, and has only used audio data in the present invention.Audio collection Using high-quality microphone (Schoeps CMIT 5U), sampling rate 48khz.16 kilo hertzs are downsampled to, and is extracted The acoustic feature of one 201D.Unlike other technical solutions, the present embodiment only uses sound spectrograph as input, extracts Journey carries out in the 25ms window that a moving step length is 10ms (100fps).Whole sentence voice data has been done at normalization simultaneously Reason.

(2) test process describes；

Use Pytorch frame as training tool, the network architecture is as shown in Figure 1, two 5*5conv layers make in front With the hidden layer and block of memory of 4 FSMN blocks have 256 and 128 nodes respectively.In order to avoid over-fitting, CNN and pFSMN layers There is batch normalization layer below, time sequencing is 4 to 32, and step-length is 1 to 2, and the model of the present embodiment is to be based on using Pytorch Adam optimizer be trained, batch be dimensioned to 32, learning rate is fixed as 0.003, use 4470 pre-set Item training with audio data is iterated training, effect of the test model on verifying collection simultaneously in every iteration one wheel, when testing Deconditioning in advance when the continuous 3 iteration rounds of recognition accuracy on card collection are constant.All experiments are all at 1 It is carried out on the work station of NVIDIATITAN XP.

(3) test result；

For the performance of measuring system, calculates the overall accuracy (weighting accuracy, WA) of test sample and do not sympathize with The average recall rate (unweighted accuracy, UA) of thread classification, and the corresponding recall rate to each classification.

Test result shows that compared with LSTM, improved series model performance improves 2.47%, illustrates FSMN herein There is better series model performance in task.HSF-CRNN(Luo D,Zou Y,Huang D.Investigation on Joint Representation Learning for Robust Feature Extraction in Speech Emotion Recognition [J] .Proc.Interspeech 2018,2018:152-156.) it is a kind of improved CNN proposed by Luo RNN method is combined, it uses hand-made phonetic feature as input, and the model of the present embodiment is realized respectively on UA and WA 0.53% and 3.99% absolute improvement, it is demonstrated experimentally that without using common manual phonetic feature, can from Dynamic ground extracts useful information from spectrogram, and the present invention has also set up a basic C-biLSTM model and has been compared, The accuracy rate of " sad " sample is better than other methods, and the recognition accuracy of other classifications is then very different.In order to illustrate attention The working principle of mechanism establishes a C-pFSMN model, other than there is no attention mechanism, entire model remainder with Model in the present invention is completely the same, the results showed that, compared with C-pFSMN, attention mechanism proposed by the invention is appointed in SER It is showed well in business, UA absolutely improves 6.3%, in addition, front end CNN layers can extract more complicated feature, thus as expected Model performance is improved like that.

C-biLSTM is constructed by 2-CNN layers and 2-Bi-LSTM, wherein there is 256 nodes in hidden layer.It and the present embodiment Model it is similar, be widely used in Series Modeling task.Therefore, also by the computing resource of C-biLSTM and this hair Bright method compares.The result shows that model of the invention is 64 minutes a length of when there was only 1.85M parameter while training, It is more faster than C-LSTM model.This means that better performance may be implemented in the present invention, while needing less computing resource.

Technical solution of the present invention, very good solution speech emotion recognition problem, not only obtains the speed of speech recognition To greatly improving, while also effectively reducing time consumption for training.In addition, the present invention different from traditional voice emotion recognition system It does not use manual features as mode input, but directly uses original sound spectrograph as input, wherein containing more original Information, to keep model generalization ability stronger.In order to make model more focused on information relevant to emotion, not by other information Interference, the invention proposes the attention mechanism based on time step, and are incorporated among model, and experiment shows of the invention Modelling effect is good, and required computing resource is less.

Above-listed detailed description is illustrating for possible embodiments of the present invention, and the embodiment is not to limit this hair Bright the scope of the patents, all equivalence enforcements or change without departing from carried out by the present invention, is intended to be limited solely by the scope of the patents of this case.

Claims

1. a kind of speech emotion recognition system characterized by comprising

Sequentially connected audio preprocessing module, CNN module, pyramid FSMN module, time step pay attention to power module and output mould Block, the CNN module have convolutional layer；

The characteristic pattern comprising shallow-layer information is further processed in the pyramid FSMN module, to obtain deeper language Adopted information and contextual information；

The time step notices that power module is used to pay close attention to specific region in time step, and calculates different time step-length to final feelings The other weighing factor of perception；

The output module has several emotional category, which is used to export and original audio data most matches Emotional category.

2. speech emotion recognition system according to claim 1, which is characterized in that the time step notices that power module is specific As shown by the following formula:

a_t=Average (h_t),

Y=Xs,

Wherein, a_tIt is the mean value of t-th of time step, h_tIt is the feature vector of t-th of time step, Average is function of averaging；s It is the output of attention mechanism,It is softmax activation primitive, W₁It is the weight ginseng that time step pays attention to first layer in power module Number, W₂It is the weight parameter that time step pays attention to the second layer in power module, b₁It is the biasing ginseng that time step pays attention to first layer in power module Number, b₂It is the offset parameter that time step pays attention to the second layer in power module, f is any activation primitive, and a is by all a_tThe spy of composition Levy vector；Y is the output of output module as a result, X is the input that time step pays attention to power module.

3. speech emotion recognition system according to claim 1, which is characterized in that have k big when the convolutional layer uses When the step-length of small core and s size carries out convolution operation, the output of the convolutional layer is calculate by the following formula:

W_out=(W_in- k)/s+1,

H_out=(H_in- k)/s+1,

Wherein, W_outIt is the width for exporting language spectrum signature figure, W_inIt is the width for inputting language spectrum signature figure, k is convolution kernel size, and s is convolution The mobile step-length of core；H_outIt is the height for exporting characteristic pattern, H_inIt is the height of input feature vector figure, k is convolution kernel size, and s is that convolution kernel moves Dynamic step-length.

4. speech emotion recognition system according to claim 1, which is characterized in that the pyramid FSMN module has gold Word tower remembers block structure, remembers block structure using the pyramid, is preceding to time step N by length₁With backward time step N₂ Time step h_t, it is encoded to the length of a fixed size, then by N₁With N₂Sum be calculated as currently exporting, this is current defeated Out specifically as shown by the following formula:

Wherein,It is the output of t moment memory module, f is any activation primitive, a_iIt is the weight of i-th of forward direction time step, h_t-iIt is i-th of forward direction time step, b_jIt is the weight of j-th of backward time step, h_t-jIt is j-th of forward direction time step；

The pyramid memory block structure is using jump connection, the relationship such as following formula institute of the input and output of jump connection Show:

Wherein,It is the output of l layers of t moment block of memory,It is any activation primitive,It is the note of l-1 layers of t moment Recall block output,It is the input of l layers of t moment block of memory,It is before l layers to time step,It is l i-th of forward direction of layer The weight of time step,It is i-th of forward direction time step of l layer, s₁It is preceding to time step interval,It is backward i-th of l layer The weight of time step,It is j-th of backward time step of l layer, s₂To time step interval after being；When being l+1 layers of t Carve hidden layer output, W^lIt is the weight parameter of the 1st layer of block of memory, b^l+1It is the biasing of l layers of block of memory.

5. speech emotion recognition system according to any one of claims 1 to 4, which is characterized in that the convolutional layer is two Layer.

6. speech emotion recognition system according to any one of claims 1 to 4, which is characterized in that the shallow-layer information is Audio loudness or frequency.

7. speech emotion recognition system according to any one of claims 1 to 4, which is characterized in that the several emotion Classification is four kinds of emotional categories.

8. speech emotion recognition system according to claim 7, which is characterized in that four kinds of emotional categories are to open It is the heart, sad, angry and neutral.

9. a kind of speech-emotion recognition method applied to speech emotion recognition system described in claim 1, which is characterized in that The following steps are included:

Step 1, audio preprocessing module carry out preliminary feature extraction and regularization to the voice received and operate, and obtain language Spectrum signature figure, and the language spectrum signature figure is input to CNN module；

Step 2, CNN module carry out convolutional calculation operation to the language spectrum signature figure received, construct comprising audio shallow-layer information Language spectrum signature figure；

The language spectrum signature figure comprising audio shallow-layer information is further processed in step 3, pyramid FSMN module, and Block structure is remembered by pyramid, obtains deeper semantic information and contextual information in language spectrum signature figure；

Step 4, time step pay attention to power module to the language spectrum signature figure with deeper semantic information and contextual information into Row processing calculates the attention score of different time step first, then uses the score to entire language spectrum signature figure in time step Weighted sum is done in dimension, to obtain whole section of voice and the highest feature vector of speaker's emotion degree of correlation；

Described eigenvector is input to output module by step 5, and the dimension of described eigenvector represents the general of corresponding emotional category Rate takes emotional category corresponding to the dimension of maximum probability as final output as a result, opposite with whole section of voice to export The emotional category answered.

10. speech-emotion recognition method according to claim 9, which is characterized in that in steps of 5, the output module It is a full articulamentum, what which was exported is the feature vector that length is 4.