CN111583964A - Natural speech emotion recognition method based on multi-mode deep feature learning - Google Patents

Natural speech emotion recognition method based on multi-mode deep feature learning Download PDF

Info

Publication number
CN111583964A
CN111583964A CN202010290317.9A CN202010290317A CN111583964A CN 111583964 A CN111583964 A CN 111583964A CN 202010290317 A CN202010290317 A CN 202010290317A CN 111583964 A CN111583964 A CN 111583964A
Authority
CN
China
Prior art keywords
neural network
dimensional
convolutional neural
emotion
dimensional convolutional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010290317.9A
Other languages
Chinese (zh)
Other versions
CN111583964B (en
Inventor
张石清
赵小明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taizhou University
Original Assignee
Taizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taizhou University filed Critical Taizhou University
Priority to CN202010290317.9A priority Critical patent/CN111583964B/en
Publication of CN111583964A publication Critical patent/CN111583964A/en
Application granted granted Critical
Publication of CN111583964B publication Critical patent/CN111583964B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a natural speech emotion recognition method based on multimode deep feature learning, which comprises the following steps of: s1, generating an appropriate multi-modal representation: generating three appropriate audio representations from the original one-dimensional speech signal for input of subsequent different CNN models; s2, learning multi-modal characteristics by adopting a multi-depth convolution neural network model; and S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting the final speech emotion recognition result. According to the method, deep multi-modal characteristics with complementary characteristics are fused and learned through the multiple deep convolutional neural networks, the emotion classification performance is obviously improved, and characteristics with good discrimination are provided for natural speech emotion recognition.

Description

Natural speech emotion recognition method based on multi-mode deep feature learning
Technical Field
The invention relates to the technical field of voice signal processing and mode recognition, in particular to a natural voice emotion recognition method based on multimode deep feature learning.
Background
In recent years, natural speech emotion recognition, which aims to provide intelligent emotion services that can be used for voice call centers, healthcare, and emotion calculation through a speech interaction pattern directly with a computer, has become an active and challenging research topic in the fields of pattern recognition, speech signal processing, artificial intelligence, and the like, unlike conventional input devices.
At present, in the field of speech emotion recognition, a great deal of preliminary work is mainly performed on simulated emotion, because the establishment of the simulated emotion database is much easier than that of natural emotion. In recent years, research on emotion recognition of natural speech in real environments has been receiving attention from researchers because it is closer to reality and much more difficult to recognize than a simulated emotion.
The speech emotion feature extraction is a key step in speech emotion recognition, and aims to extract feature parameters capable of reflecting emotion expression information of a speaker from emotion speech signals. Currently, a large number of speech emotion recognition documents employ manually designed features for emotion recognition, such as prosodic features (fundamental frequency, amplitude, utterance duration), timbre features (formants, spectral energy distribution, harmonic-to-noise ratio), spectral features (mel-frequency cepstral coefficients (MFCC), Linear Predictive Coefficients (LPC), and Linear Predictive Cepstral Coefficients (LPCC)). However, these manually designed speech emotion feature parameters belong to low-level features, and have a semantic gap with emotion labels understood by human beings, so that it is necessary to develop a high-level speech emotion feature extraction method.
To address this problem, emerging deep learning techniques in recent years may provide clues, and as deeper architectures are used, deep learning techniques generally have certain advantages over traditional approaches, including their ability to automatically detect complex structures and features without the need for manual feature extraction.
Up to now, various representative deep learning techniques, such as Deep Neural Network (DNN), deep Convolutional Neural Network (CNN), long-short term memory based recurrent neural network (LSTMRNN), etc., have been used for speech emotion recognition.
For example, a "speech emotion recognition method based on multi-scale deep convolution cyclic neural network" disclosed in the Chinese patent literature (publication No. CN108717856A) combines a deep Convolution Neural Network (CNN) and a long-term memory network (LSTM), and simultaneously considers the characteristics of different discriminative power of two-dimensional (2D) voice frequency spectrum segment information with different lengths on different emotion type identification, provides a multi-scale CNN + LSTM mixed deep learning model, is applied to natural voice emotion identification in actual environment, however, the speech emotion recognition method using 2D speech spectral fragment information as CNN input cannot capture dynamic change information expressed by features in 2D time-frequency (time-frequency) between consecutive frames in a sentence of speech, and thus cannot provide feature parameters with good discriminative power for natural speech emotion recognition. Although the LSTM-RNN can be used for modeling of temporal information, the temporal information is over-emphasized.
Disclosure of Invention
The invention provides a natural speech emotion recognition method based on multimode deep feature learning, aiming at overcoming the defect that dynamic change information in 2D time-frequency feature representation among continuous frames in a sentence of speech cannot be captured in the prior art, so that feature parameters with good discrimination can not be provided for natural speech emotion recognition.
In order to achieve the purpose, the invention adopts the following technical scheme:
a natural speech emotion recognition method based on multi-mode deep feature learning comprises the following steps:
s1, generating an appropriate multi-modal representation: generating three appropriate audio representations from the original one-dimensional speech signal for input of subsequent different CNN models;
s2, learning multi-modal characteristics by adopting a multi-depth convolution neural network model;
and S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting the final speech emotion recognition result.
In the scheme of the invention, the deep Multi-modal characteristics learned by adopting a Multi-depth convolutional neural network model (Multi-CNN), namely a one-dimensional convolutional neural network (1D-CNN), a two-dimensional convolutional neural network (2D-CNN) and a three-dimensional convolutional neural network (3D-CNN), have the characteristic of certain complementarity, are used for natural speech emotion recognition, the technical problem that dynamic change information in 2D time-frequency characteristic representation among continuous frames in a sentence of speech cannot be captured in the prior art is solved, the deep Multi-modal characteristics with the characteristic of complementarity are learned through fusion of the Multi-depth convolutional neural network, the emotion classification performance is obviously improved, and the characteristic of good discrimination is provided for natural speech emotion recognition.
Preferably, the step S1 includes the steps of:
s1.1, segmenting a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolution neural network, and setting the length of a voice segment;
s1.2, extracting a two-dimensional Mel frequency spectrogram from a one-dimensional original voice signal, and constructing a three-channel frequency spectrum segment similar to an RGB image as the input of a two-dimensional convolution neural network;
s1.3, forming a 3D dynamic segment similar to a video by a plurality of continuous two-dimensional Mel frequency spectrum segment sequences, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolution neural network.
Preferably, the step S2 includes the steps of:
s2.1, performing one-dimensional original voice signal waveform modeling by adopting a one-dimensional convolution neural network: constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model;
s2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolution neural network: aiming at target data, finely adjusting the existing AlexNet deep convolution neural network model trained in advance, and sampling the generated Mel spectrum segment size of three channels;
s2.3, modeling the space-time dynamic information based on the three-dimensional convolutional neural network: and executing a space-time characteristic learning task based on the extracted video-like 3D dynamic segment, and avoiding network overfitting by adopting a dropout regularization method.
Preferably, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-connected layers and 1 Softmax classified output layer, and the one-dimensional convolutional layers comprise a batch processing normalization layer and a modified linear unit activation function layer, namely, input data is normalized before the one-dimensional convolutional neural network is trained.
Preferably, the AlexNet deep convolutional neural network model includes five convolutional layers, three maximum pooling layers, and two fully connected layers.
Preferably, the fine tuning in step S2.2 comprises the following steps:
1) copying the whole network parameters from a pre-trained AlexNet deep convolution neural network model to initialize the used two-dimensional convolution neural network;
2) replacing the softmax output layer in the AlexNet deep convolutional neural network model with a new sample label vector corresponding to the number of emotion classes used in the data set;
3) the used two-dimensional convolutional neural network is retrained using a standard back propagation strategy.
Fine-tuning is widely used for transfer learning in computer vision and alleviates the problem of data insufficiency.
Preferably, the three-dimensional convolutional neural network in the step S2.3 includes two three-dimensional convolutional layers including a batch normalization layer and a modified linear cell activation function layer, two three-dimensional maximum pooling layers, two full-link layers, and one softmax output layer.
Preferably, the step S3 includes the steps of:
s3.1, performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, and updating network parameters;
s3.2, performing average operation on all the divided segment classification results in a sentence of voice by adopting an average pooling strategy on the segment classification results obtained from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice;
s3.3, maximizing the emotion classification score results on the whole sentence voice level, and obtaining the emotion recognition result of each convolutional neural network;
and S3.4, combining classification score results obtained by different convolutional neural network models on the whole sentence voice level by utilizing a score level fusion strategy so as to carry out final emotion classification.
Preferably, step S3.4 may be expressed as:
scorefusion=λ1score1D2score2D3score3D
λ123=1;
wherein λ1、λ2And λ3Weight values, λ, representing different classification scores obtained by one-, two-and three-dimensional convolutional neural networks1、λ2And λ3Is determined by stepping at 0.1 intervals at 0,1]And determining the optimal value obtained by searching in the range.
Preferably, the updating of the network parameters in step S3.1 is shown by the following expression:
Figure BDA0002450147140000041
where W represents the weight value of the softmax layer of the network parameter theta,
Figure BDA0002450147140000042
the representation is associated with the input data aiCorresponding output of the last full connection layer (FC) layer, yiA class label vector representing the ith segment, which is equal to the emotion category of the whole sentence of speech, and H represents a softmax logarithmic loss function, wherein H is expressed by the following expression:
Figure BDA0002450147140000043
where C represents the total number of emotion categories.
The invention has the beneficial effects that: the method solves the technical problems that dynamic change information in 2D time-frequency feature representation between continuous frames in a sentence of voice cannot be captured in the prior art, so that good discrimination characteristic parameters cannot be provided for natural voice emotion recognition, and time information is over-emphasized.
Drawings
FIG. 1 is a general architectural block diagram of the present invention.
Fig. 2 is a graph of a confusion matrix for the identification results of the fractional fusion of the present invention on the AFEW5.0 dataset.
FIG. 3 is a diagram of a confusion matrix of the recognition results of the fractional fusion of the present invention on the BAUM-1s dataset.
Detailed Description
The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.
Example 1: in this embodiment, a natural speech emotion recognition method based on multi-mode deep feature learning, as shown in fig. 1, includes the following steps:
s1, generating an appropriate multi-modal representation: three suitable audio representations are generated from the original one-dimensional speech signal for subsequent input of different CNN models.
Step S1 includes the following steps:
s1.1, segmenting a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolutional neural network (1D-CNN), and setting the length of the voice segments; for best performance, the length of the speech segment is set to 625 frames, the original speech signal is down-sampled at 22kHz and scaled to [ -256, 256 ]. In this case, the scaled data is naturally close to zero, so the average does not need to be subtracted.
S1.2, extracting a two-dimensional Mel frequency spectrum graph from a one-dimensional original voice signal, and constructing a three-channel frequency spectrum segment similar to an RGB image as the input of a two-dimensional convolution neural network (2D-CNN);
s1.3, forming a 3D dynamic segment similar to a video by a plurality of continuous two-dimensional Mel frequency spectrum segment sequences, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolutional neural network (3D-CNN).
In experimental validation, we generated a three-channel two-dimensional Mel-frequency spectrum fragment of a size that was, as an input to a two-dimensional convolutional neural network. Specifically, we extract the entire log Mel spectrogram of a speech signal in the range of 20Hz to 8000Hz using 64 Mel filter banks. At this time, a Hamming window of 25ms is used, overlapping for 10 ms. Then, using a 64 frame size text box, the entire log Mel spectrum is segmented into fixed length segments, resulting in a 64' 64 static segment. Then, we calculate the first and second regression coefficients of the generated static segment along the time axis, thereby obtaining the first derivative (delta) coefficient and the second derivative (delta-delta) coefficient of the static segment. Finally, we can generate Mel-frequency spectral slices of three channels (static, first and second derivatives), similar to color RGB images in computer vision.
And S2, learning the Multi-modal characteristics by adopting a Multi-depth convolution neural network model Multi-CNN.
Step S2 includes the following steps:
s2.1, performing one-dimensional original voice signal waveform modeling by adopting a one-dimensional convolution neural network: and constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model.
The stride length of the convolutional layer and the maximum pooling layer is set to 1, as shown in table 1, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-link layers and 1 Softmax classification output layer, the one-dimensional convolutional layer comprises a batch normalization layer and a modified linear unit (ReLU) activation function layer, namely, input data are normalized before the one-dimensional convolutional neural network is trained, and the output of the Softmax layer corresponds to the whole emotion category on the used data set.
TABLE 1
Figure BDA0002450147140000051
S2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolution neural network: aiming at target data, finely adjusting an existing AlexNet deep convolution neural network model trained in advance, and sampling the size of a generated three-channel Mel frequency spectrum segment to an input size fixed by the AlexNet model; as shown in table 2, the AlexNet deep convolutional neural network model includes five convolutional layers, three max pooling layers, and two fully-connected layers.
TABLE 2
Figure BDA0002450147140000061
The fine tuning in step S2.2 comprises the following steps:
1) copying the whole network parameters from a pre-trained AlexNet deep convolution neural network model to initialize the used two-dimensional convolution neural network;
2) replacing the softmax output layer in the AlexNet deep convolutional neural network model with a new sample label vector corresponding to the number of emotion classes used in the data set;
3) the used two-dimensional convolutional neural network is retrained using a standard back propagation strategy.
Fine-tuning is widely used for transfer learning in computer vision and alleviates the problem of data insufficiency.
S2.3, modeling the space-time dynamic information based on the three-dimensional convolutional neural network: and executing a space-time characteristic learning task based on the extracted video-like 3D dynamic segment, and avoiding network overfitting by adopting a dropout regularization method.
As shown in table 3, the three-dimensional convolutional neural network includes two three-dimensional convolutional layers including a batch normalization layer and a modified linear unit (ReLU) activation function layer, two three-dimensional maximum pooling layers, two fully-connected layers, and one softmax output layer.
TABLE 3
Figure BDA0002450147140000071
And S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting a final speech emotion recognition result.
Since the feature representations obtained from the one-dimensional convolutional neural network and the three-dimensional convolutional neural network capture completely different acoustic characteristics compared to the two-dimensional convolutional neural network based on 2D time-frequency representation as input, which indicates that emotion features learned from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional convolutional neural network may be complementary to each other, and therefore, they need to be integrated in a multiple convolutional neural network fusion network, which may further improve speech emotion classification performance.
Step S3 includes the following steps:
s3.1, performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, and updating network parameters, wherein the updated network parameters are shown in the following expression:
Figure BDA0002450147140000072
where W represents the weight value of the softmax layer of the network parameter theta,
Figure BDA0002450147140000073
the representation is associated with the input data aiCorresponding output of the last full connection layer (FC) layer, yiThe class label vector representing the ith segment is equal to the emotion category of the whole sentence of voice, H represents a softmax logarithmic loss function, and H is expressed by the following expression:
Figure BDA0002450147140000074
where C represents the total number of emotion categories.
S3.2, performing average operation on all the divided segment classification results in a sentence of voice by adopting an average pooling strategy on the segment classification results obtained from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice level (utterance-level);
s3.3, maximizing the emotion classification score results on the whole sentence voice level, and obtaining the emotion recognition result of each convolutional neural network;
s3.4, combining classification score results (score) obtained by different convolutional neural network models (1D-CNN,2D-CNN,3D-CNN) on the whole sentence voice level by utilizing a score-level fusion strategy to perform final emotion classification, wherein the classification score results can be expressed as:
scorefusion=λ1score1D2score2D3score3D
λ123=1;
wherein λ1、λ2And λ3Weight values, λ, representing different classification scores obtained by one-, two-and three-dimensional convolutional neural networks1、λ2And λ3Is determined by stepping at 0.1 intervals at [0,1 ]]And searching the range to obtain the optimal value.
To verify the effectiveness of our proposed method for natural speech emotion recognition, we performed experiments using two challenging natural emotion speech data sets AFEW5.0 and bamm-1 s, and no other action emotion speech data set.
The AFEW5.0 dataset contains 7 emotion categories, such as angry, happy, sad, offensive, surprised, afraid, and neutral, with three annotators invited to annotate these emotions. The AFEW5.0 dataset is divided into three parts: training set Train (723 samples), validation set Val (383 samples) and Test set Test (539 samples). We do not use the test set because it only opens up the acquisition right for the researchers participating in the competition.
BAUM-1s contains not only 6 basic emotional categories, including anger, happiness, sadness, dislikes, fear, surprise, but also other mental states such as uncertainty, thinking, concentration, and confusion. Here we only focused on identifying 6 basic emotion classes, resulting in a subset with 521 emotion video samples.
A. Setting of experiments
For training of the one-dimensional convolutional neural network, the two-dimensional convolutional neural network, and the three-dimensional convolutional neural network, the minimum batch processing size of input data is 30, the maximum cycle (epoch) number is 300, and the learning rate is 0.001. In order to accelerate the training speed of the convolutional neural network, a NVIDIA GTX TITAN X GPU with a 12GB memory is adopted, and natural speech emotion recognition is realized by using a cross-validation strategy irrelevant to a speaker, wherein the strategy is mainly used for a real scene.
The experiment used the original training set (Train) on the AFEW5.0 dataset and the validation set (Val) during training as test sets. On the BAUM-1s dataset, which contains 31 Turkish people, a leave-one-out cross-validation (LOSGO) strategy of 5 group crossings was employed. In this way, the average recognition accuracy over five tests was reported on the BAUM-1s dataset.
Note that in the experiment, we segmented the Mel spectrum of the whole speech extracted from the audio sample into a certain number of Mel-spectrum (Mel-spectral) segments, and performed segment-level feature learning using a convolutional neural network. In this case, we set the emotion classification of each Mel-frequency spectrum segment as an emotion tag on the speech level of the whole sentence.
B. Network training
Network training is carried out on the one-dimensional convolution neural network, the two-dimensional convolution neural network and the three-dimensional neural network, network parameters are updated, and the updated network parameters are shown in the following expression:
Figure BDA0002450147140000091
where W represents the weight value of the softmax layer of the network parameter theta,
Figure BDA0002450147140000092
the representation is associated with the input data aiCorresponding final FC layer output, yiAnd a class label vector representing the ith segment is equal to the emotion classification of the sounding level, H represents a softmax logarithmic loss function, and H is expressed as follows:
Figure BDA0002450147140000093
where C represents the total number of emotion categories.
C. Results and analysis
The sample fragment length input to the one-dimensional convolutional neural network (1D-CNN) may have a significant impact on the performance of the 1D-CNN network. Therefore, we preliminarily studied the performance of different sample segment lengths as the input of the 1D-CNN network, i.e., tested the performance of four different sample segment lengths (125, 625, 3125, 15625 frames) as the input of the one-dimensional convolutional neural network, and the number of corresponding convolutional layers. Table 4 shows the identification performance of four different sample segment lengths associated with the number of layers of convolution. Note that each convolutional layer is followed by a max pooling layer, except that the last convolutional layer is equivalent to a fully-connected layer.
TABLE 4
Figure BDA0002450147140000094
As a result, the sample segment length of 625 frames performed best among the four different sample segment lengths in the AFEW5.0 and bamm-1 s datasets, as shown in table 4. Specifically, our method has an accuracy of 24.02% on the AFEW5.0 dataset and 37.37% on the bamu-1 s dataset. Larger sample segment lengths help improve performance, however, sample segment lengths that are too large do not necessarily improve performance. This may be due to the larger sample fragment length reducing the number of samples used in the one-dimensional convolutional neural network training. Therefore, the performance of the one-dimensional convolutional neural network does not always improve when the sample segment length increases, so we set the sample segment length of the one-dimensional convolutional neural network to 625 frames.
Since the extracted three-channel Mel frequency spectrum fragment is similar to an RGB image and is used as an input of a 2D-CNN network, it is feasible to perform fine tuning on an existing depth model based on ImageNet data. In order to evaluate the fine-tuning effect of different pre-training deep network models, fine-tuning recognition performances of three typical deep network models, namely AlexNet, VGG-16 and ResNet-50, on a target emotion data set are compared. The recognition results of these deep network models are obtained by averaging the score scores of all the segmented segments and then performing a maximization operation on them.
Table 5 shows the fine tuning recognition results for three typical deep network models (e.g., AlexNet, VGG-16, and ResNet-50). As can be seen from Table 5, AlexNet performed slightly better than VGG-16 and ResNet-50, with AlexNet accuracy of 29.24% in the AFEW5.0 database and VGG-16 and ResNet-50 of 28.16% and 28.55%, respectively, and in BAUM-1s database, AlexNet, VGG-16, and ResNet-50 of 42.29%, 41.73%, and 41.97%, respectively. This indicates that the deeper network models such as VGG-16 and ResNet-50 do not have a significant performance improvement over the shallower AlexNet, probably because the emotion data sets used are very limited and therefore the number of speech samples produced is not sufficient to train deeper networks.
TABLE 5
Figure BDA0002450147140000101
For spatio-temporal feature learning, a plurality of continuous two-dimensional Mel frequency spectrum fragment sequences form a 3D dynamic fragment similar to a video as the input of a three-dimensional convolutional neural network (3D-CNN). The created video segment length is equal to the number of consecutive two-dimensional Mel-frequency spectrum segments. The video segment length also significantly affects the performance of the 3D-CNN network.
In order to evaluate the performance of different video segment lengths as input to a 3D CNN network, we have given in experiments the identification of 4 different video segment lengths (4, 6, 8, 10 Mel spectral segments). For these video segment lengths, the three-dimensional convolutional neural network has the same network structure except for the first convolutional layer. In the first convolutional layer (Conv1.), its depth of three-dimensional filter size (i.e., the number of consecutive Mel-frequency spectral slices in series) is equal to the corresponding video slice length. Table 6 gives the performance of four different video segment lengths (i.e. 4, 6, 8, 10 Mel spectral segments) and the depth of the three-dimensional filter size of the first convolution layer.
TABLE 6
Figure BDA0002450147140000102
As can be seen from table 6, the video segment length containing 4 consecutive voice Mel spectral segments achieves the best performance in the AFEW5.0 database and the bamu-1 s database with 28.46% and 37.97% accuracy, respectively. As the video segment length increases, the performance of the three-dimensional convolutional neural network degrades, possibly because the number of video segments used to train the three-dimensional convolutional neural network decreases as the video segment length increases.
In the experiment, we propose and compare two fusion methods of multiple deep convolutional neural networks: feature level fusion and fractional level fusion. For feature level fusion, we first extract the entire sentence-level speech feature for each convolutional neural network, i.e. by using an averaging pooling operation on the segment features represented by the output of the last fully-connected layer of the convolutional neural network used. Then, the features of three whole sentence voice levels from one-dimensional, two-dimensional and three-dimensional convolution neural networks are directly connected in series to form a total 5376-D feature vector, and finally, a linear Support Vector Machine (SVM) is adopted to carry out final emotion classification.
Table 7 lists the different multiple deep convolutional neural network fusion methods and the identification of the single convolutional neural network that achieves the best performance. For the AFEW5.0 database, the optimal weight values are 0.3, 0.4; for the BAUM-1s database, the optimal weight values are 0.2, 0.5, 0.3. From the results in table 7, it can be seen:
1) two-dimensional convolutional neural networks (2D-CNN) perform best, followed by three-dimensional convolutional neural networks (3D-CNN) and one-dimensional convolutional neural networks (1D-CNN). This shows that it is effective to fine-tune the existing deep network model AlexNet based on ImageNet data using the generated two-dimensional Mel voice spectrum fragments similar to RGB images, thereby relieving the pressure of emotion data insufficiency on deep neural network training.
2) Fractional fusion has better performance than feature-level fusion. This shows that fractional order fusion is more suitable for the fusion of multiple deep neural networks.
3) Compared with single convolution neural networks such as one-dimensional, two-dimensional and three-dimensional convolution neural networks, the method has better performance in realizing the fusion of multiple convolution neural networks in the characteristic layer and the fractional layer. This shows that the multi-modal deep features learned from one-dimensional, two-dimensional, three-dimensional convolutional neural networks are complementary, so they are integrated into a multi-deep convolutional neural network fusion network to obtain significantly improved emotion classification performance.
TABLE 7
Figure BDA0002450147140000111
In order to provide the recognition accuracy of each emotion, fig. 2 and fig. 3 respectively show confusion matrices of recognition results. The fractional fusion method now achieved recognition accuracies of 35.77% and 44.06% on the two data sets, respectively.
As shown in fig. 2, we can see that the accuracy of the three emotions, "angry", "neutral" and "fear" in the AFEW5.0 database is 56.25%, 50.79% and 43.48%, respectively. And the classification accuracy of the other four emotions, namely 'annoying', 'happy', 'sad' and 'surprise', is less than 33%.
As can be seen from FIG. 3, in the BAUM-1s database, two emotions, "sad" and "happy" were recognized with accuracy rates of 70.90% and 55.49%, respectively. The accuracy of the other four emotions, namely 'anger', 'fear', 'offensive' and 'surprise' is less than 29%.
In the scheme of the invention, the deep Multi-modal characteristics learned by adopting a Multi-depth convolutional neural network model (Multi-CNN), namely a one-dimensional convolutional neural network (1D-CNN), a two-dimensional convolutional neural network (2D-CNN) and a three-dimensional convolutional neural network (3D-CNN), have certain complementary characteristics, are used for natural speech emotion recognition, the technical problem that dynamic change information in 2D time-frequency characteristic representation among continuous frames in a sentence of speech cannot be captured in the prior art is solved, the deep Multi-modal characteristics with the complementary characteristics are learned through fusion of the Multi-depth convolutional neural network, the emotion classification performance is obviously improved, and the characteristics with good discrimination are provided for natural speech emotion recognition.

Claims (10)

1. A natural speech emotion recognition method based on multi-mode deep feature learning is characterized by comprising the following steps:
s1, generating an appropriate multi-modal representation: generating three appropriate audio representations from the original one-dimensional speech signal for input of subsequent different CNN models;
s2, learning multi-modal characteristics by adopting a multi-depth convolution neural network model;
and S3, integrating the classification results of different CNN models by adopting a score level fusion method, and outputting the final speech emotion recognition result.
2. The method for recognizing natural speech emotion based on multi-mode depth feature learning of claim 1, wherein said step S1 includes the following steps:
s1.1, segmenting a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolution neural network, and setting the length of a voice segment;
s1.2, extracting a two-dimensional Mel frequency spectrogram from a one-dimensional original voice signal, and constructing a three-channel frequency spectrum segment similar to an RGB image as the input of a two-dimensional convolution neural network;
s1.3, forming a 3D dynamic segment similar to a video by a plurality of continuous two-dimensional Mel frequency spectrum segment sequences, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolution neural network.
3. The method for recognizing natural speech emotion based on multi-mode depth feature learning of claim 1, wherein said step S2 includes the following steps:
s2.1, performing one-dimensional original voice signal waveform modeling by adopting a one-dimensional convolution neural network: constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model;
s2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolution neural network: aiming at target data, finely adjusting the existing AlexNet deep convolution neural network model trained in advance, and sampling the generated Mel spectrum segment size of three channels;
s2.3, modeling the space-time dynamic information based on the three-dimensional convolutional neural network: and executing a space-time characteristic learning task based on the extracted video-like 3D dynamic segment, and avoiding network overfitting by adopting a dropout regularization method.
4. The method according to claim 3, wherein the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-connected layers and 1 Softmax classification output layer, and the one-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer, namely, input data is normalized before the one-dimensional convolutional neural network is trained.
5. The method according to claim 3, wherein the AlexNet deep convolutional neural network model comprises five convolutional layers, three maximum pooling layers and two full-connected layers.
6. The method for natural speech emotion recognition based on multimode deep feature learning as claimed in claim 3, wherein the fine tuning in step S2.2 comprises the following steps:
1) copying the whole network parameters from a pre-trained AlexNet deep convolution neural network model to initialize the used two-dimensional convolution neural network;
2) replacing the softmax output layer in the AlexNet deep convolutional neural network model with a new sample label vector corresponding to the number of emotion classes used in the data set;
3) the used two-dimensional convolutional neural network is retrained using a standard back propagation strategy.
7. The method according to claim 1, wherein the three-dimensional convolutional neural network in step S2.3 comprises two three-dimensional convolutional layers, two three-dimensional max pooling layers, two fully-connected layers and one softmax output layer, and the three-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer.
8. The method for recognizing natural speech emotion based on multi-mode depth feature learning of claim 1, wherein said step S3 includes the following steps:
s3.1, performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, and updating network parameters;
s3.2, performing average operation on all the divided segment classification results in a sentence of voice by adopting an average pooling strategy on the segment classification results obtained from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice;
s3.3, maximizing the emotion classification score results on the whole sentence voice level, and obtaining the emotion recognition result of each convolutional neural network;
and S3.4, combining classification score results obtained by different convolutional neural network models on the whole sentence voice level by utilizing a score level fusion strategy so as to carry out final emotion classification.
9. The method for recognizing natural speech emotion based on multi-mode deep feature learning as claimed in claim 8, wherein said step S3.4 can be expressed as:
scorefusion=λ1score1D2score2D3score3D
λ123=1;
wherein λ1、λ2And λ3Representing through a one-dimensional convolutional neural networkWeight values, λ, of different classification scores obtained by a network, a two-dimensional convolutional neural network and a three-dimensional convolutional neural network1、λ2And λ3Is determined by stepping at 0.1 intervals at [0,1 ]]And searching the range to obtain the optimal value.
10. The method for natural speech emotion recognition based on multimode deep feature learning of claim 8, wherein the step S3.1 of updating network parameters is represented by the following expression:
Figure FDA0002450147130000031
wherein W represents the weight value of the softmax layer of the network parameter θ, γ (v)i(ii) a θ) is expressed as the input data aiCorresponding output of the last full connection layer (FC) layer, yiA class label vector representing the ith segment, which is equal to the emotion category of the whole sentence of speech, and H represents a softmax logarithmic loss function, wherein H is expressed by the following expression:
Figure FDA0002450147130000032
where C represents the total number of emotion categories.
CN202010290317.9A 2020-04-14 2020-04-14 Natural voice emotion recognition method based on multimode deep feature learning Active CN111583964B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010290317.9A CN111583964B (en) 2020-04-14 2020-04-14 Natural voice emotion recognition method based on multimode deep feature learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010290317.9A CN111583964B (en) 2020-04-14 2020-04-14 Natural voice emotion recognition method based on multimode deep feature learning

Publications (2)

Publication Number Publication Date
CN111583964A true CN111583964A (en) 2020-08-25
CN111583964B CN111583964B (en) 2023-07-21

Family

ID=72126539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010290317.9A Active CN111583964B (en) 2020-04-14 2020-04-14 Natural voice emotion recognition method based on multimode deep feature learning

Country Status (1)

Country Link
CN (1) CN111583964B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347910A (en) * 2020-11-05 2021-02-09 中国电子科技集团公司第二十九研究所 Signal fingerprint identification method based on multi-mode deep learning
CN113116300A (en) * 2021-03-12 2021-07-16 复旦大学 Physiological signal classification method based on model fusion
CN113409824A (en) * 2021-07-06 2021-09-17 青岛洞听智能科技有限公司 Speech emotion recognition method
CN113780107A (en) * 2021-08-24 2021-12-10 电信科学技术第五研究所有限公司 Radio signal detection method based on deep learning dual-input network model
CN113903362A (en) * 2021-08-26 2022-01-07 电子科技大学 Speech emotion recognition method based on neural network
CN114612810A (en) * 2020-11-23 2022-06-10 山东大卫国际建筑设计有限公司 Dynamic self-adaptive abnormal posture recognition method and device
CN114726802A (en) * 2022-03-31 2022-07-08 山东省计算中心(国家超级计算济南中心) Network traffic identification method and device based on different data dimensions
CN115195757A (en) * 2022-09-07 2022-10-18 郑州轻工业大学 Electric bus starting driving behavior modeling and recognition training method
CN117373491A (en) * 2023-12-07 2024-01-09 天津师范大学 Method and device for dynamically extracting voice emotion characteristics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JAEBOK KIM 等: "Learning spectro-temporal features with 3D CNNs for speech emotion recognition", 《ARXIV:1708.05071V1》 *
JIANFENG ZHAO 等: "Learning deep features to recognise speech emotion using merged deep CNN", 《IET SIGNAL PROCESSING》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347910B (en) * 2020-11-05 2022-05-31 中国电子科技集团公司第二十九研究所 Signal fingerprint identification method based on multi-mode deep learning
CN112347910A (en) * 2020-11-05 2021-02-09 中国电子科技集团公司第二十九研究所 Signal fingerprint identification method based on multi-mode deep learning
CN114612810B (en) * 2020-11-23 2023-04-07 山东大卫国际建筑设计有限公司 Dynamic self-adaptive abnormal posture recognition method and device
CN114612810A (en) * 2020-11-23 2022-06-10 山东大卫国际建筑设计有限公司 Dynamic self-adaptive abnormal posture recognition method and device
CN113116300A (en) * 2021-03-12 2021-07-16 复旦大学 Physiological signal classification method based on model fusion
CN113409824A (en) * 2021-07-06 2021-09-17 青岛洞听智能科技有限公司 Speech emotion recognition method
CN113780107A (en) * 2021-08-24 2021-12-10 电信科学技术第五研究所有限公司 Radio signal detection method based on deep learning dual-input network model
CN113780107B (en) * 2021-08-24 2024-03-01 电信科学技术第五研究所有限公司 Radio signal detection method based on deep learning dual-input network model
CN113903362A (en) * 2021-08-26 2022-01-07 电子科技大学 Speech emotion recognition method based on neural network
CN113903362B (en) * 2021-08-26 2023-07-21 电子科技大学 Voice emotion recognition method based on neural network
CN114726802A (en) * 2022-03-31 2022-07-08 山东省计算中心(国家超级计算济南中心) Network traffic identification method and device based on different data dimensions
CN115195757A (en) * 2022-09-07 2022-10-18 郑州轻工业大学 Electric bus starting driving behavior modeling and recognition training method
CN115195757B (en) * 2022-09-07 2023-08-04 郑州轻工业大学 Electric bus starting driving behavior modeling and recognition training method
CN117373491A (en) * 2023-12-07 2024-01-09 天津师范大学 Method and device for dynamically extracting voice emotion characteristics
CN117373491B (en) * 2023-12-07 2024-02-06 天津师范大学 Method and device for dynamically extracting voice emotion characteristics

Also Published As

Publication number Publication date
CN111583964B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
CN111583964B (en) Natural voice emotion recognition method based on multimode deep feature learning
Er A novel approach for classification of speech emotions based on deep and acoustic features
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
Zhang et al. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching
Sun End-to-end speech emotion recognition with gender information
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
CN110400579B (en) Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network
Schuller et al. Emotion recognition in the noise applying large acoustic feature sets
CN110675859B (en) Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN108717856A (en) A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN107972028B (en) Man-machine interaction method and device and electronic equipment
CN108269133A (en) A kind of combination human bioequivalence and the intelligent advertisement push method and terminal of speech recognition
CN103996155A (en) Intelligent interaction and psychological comfort robot service system
CN110534133B (en) Voice emotion recognition system and voice emotion recognition method
Dhuheir et al. Emotion recognition for healthcare surveillance systems using neural networks: A survey
Xu et al. Multi-type features separating fusion learning for Speech Emotion Recognition
Cornejo et al. Audio-visual emotion recognition using a hybrid deep convolutional neural network based on census transform
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
CN110348482A (en) A kind of speech emotion recognition system based on depth model integrated architecture
Zaferani et al. Automatic personality traits perception using asymmetric auto-encoder
Nanduri et al. A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data
Naveenkumar et al. Audio based emotion detection and recognizing tool using mel frequency based cepstral coefficient
Ullah et al. Speech emotion recognition using deep neural networks
Li Robotic emotion recognition using two-level features fusion in audio signals of speech
Kumar et al. Machine learning technique-based emotion classification using speech signals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant