CN111583964B - Natural voice emotion recognition method based on multimode deep feature learning - Google Patents

Natural voice emotion recognition method based on multimode deep feature learning Download PDF

Info

Publication number
CN111583964B
CN111583964B CN202010290317.9A CN202010290317A CN111583964B CN 111583964 B CN111583964 B CN 111583964B CN 202010290317 A CN202010290317 A CN 202010290317A CN 111583964 B CN111583964 B CN 111583964B
Authority
CN
China
Prior art keywords
neural network
convolutional neural
dimensional
dimensional convolutional
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010290317.9A
Other languages
Chinese (zh)
Other versions
CN111583964A (en
Inventor
张石清
赵小明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taizhou University
Original Assignee
Taizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taizhou University filed Critical Taizhou University
Priority to CN202010290317.9A priority Critical patent/CN111583964B/en
Publication of CN111583964A publication Critical patent/CN111583964A/en
Application granted granted Critical
Publication of CN111583964B publication Critical patent/CN111583964B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a natural voice emotion recognition method based on multimode deep feature learning, which comprises the following steps: s1, generating a proper multi-modal representation: generating three proper audio representation forms from the original one-dimensional voice signal for the input of different subsequent CNN models; s2, learning multi-modal features by adopting a multiple depth convolution neural network model; s3, integrating different CNN model classification results by adopting a score level fusion method, and outputting a final voice emotion recognition result. According to the invention, deep multi-modal features with complementarity are learned through fusion of multiple deep convolutional neural networks, so that emotion classification performance is remarkably improved, and features of good discrimination are provided for natural voice emotion recognition.

Description

Natural voice emotion recognition method based on multimode deep feature learning
Technical Field
The invention relates to the technical field of voice signal processing and pattern recognition, in particular to a natural voice emotion recognition method based on multimode deep feature learning.
Background
In recent years, natural voice emotion recognition has been an active and challenging research topic in the fields of pattern recognition, voice signal processing, artificial intelligence, etc., and unlike conventional input devices, natural voice emotion recognition aims to provide intelligent emotion services usable in voice call centers, healthcare, and emotion calculation through a direct voice interaction pattern with a computer.
Currently, in the field of speech emotion recognition, a large amount of early work is mainly done on simulated emotion, because the creation of such a simulated emotion database is much easier than natural emotion. In recent years, research on natural speech emotion recognition in an actual environment has been attracting attention from researchers because it is closer to reality and more difficult than recognition of simulated emotion.
The voice emotion feature extraction is a key step in voice emotion recognition, and aims to extract feature parameters capable of reflecting emotion expression information of a speaker from emotion voice signals. Currently, a large number of speech emotion recognition documents employ manually designed features for emotion recognition, such as prosodic features (fundamental frequency, amplitude, pronunciation duration), timbre features (formants, spectral energy distribution, harmonic noise ratio), spectral features (mel-frequency cepstrum coefficients (MFCCs), linear Prediction Coefficients (LPCs), and Linear Prediction Cepstrum Coefficients (LPCCs)). However, these manually designed speech emotion feature parameters belong to low-level features, and have a problem of "semantic gap" with emotion tags understood by humans, so that it is necessary to develop a high-level speech emotion feature extraction method.
To address this problem, newly emerging deep learning techniques may provide clues, and due to the use of deeper architectures, deep learning techniques generally have certain advantages over traditional approaches, including their ability to automatically detect complex structures and features without the need for manual feature extraction.
Various representative deep learning techniques, such as Deep Neural Networks (DNNs), deep Convolutional Neural Networks (CNNs), recurrent neural networks based on long-short term memory (LSTMRNNs), etc., have been used for speech emotion recognition so far.
For example, a "speech emotion recognition method based on a multi-scale deep convolutional recurrent neural network" (bulletin number CN108717856 a) disclosed in chinese patent literature combines a deep Convolutional Neural Network (CNN) with a long-short-time memory network (LSTM), and considers the characteristic that two-dimensional (2D) speech spectrum fragment information with different lengths have different discriminatory power for different emotion types, a multi-scale cnn+lstm hybrid deep learning model is provided and applied to natural speech emotion recognition in an actual environment, but the speech emotion recognition method using 2D speech spectrum fragment information as CNN input cannot capture dynamic variation information represented by features in 2D time-frequency (time-frequency) between consecutive frames in a sentence of speech, so that feature parameters with good discriminatory power cannot be provided for natural speech emotion recognition. Although LSTM-RNN can be used for modeling of time information, time information is over emphasized.
Disclosure of Invention
The invention provides a natural voice emotion recognition method based on multi-mode deep feature learning, which aims to overcome the defect that dynamic change information in 2D time-frequency feature representation between continuous frames in a sentence of voice cannot be captured in the prior art, so that feature parameters with good discrimination can not be provided for natural voice emotion recognition.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a natural speech emotion recognition method based on multi-mode deep feature learning, the method comprising the steps of:
s1, generating a proper multi-modal representation: generating three proper audio representation forms from the original one-dimensional voice signal for the input of different subsequent CNN models;
s2, learning multi-modal features by adopting a multiple depth convolution neural network model;
s3, integrating different CNN model classification results by adopting a score level fusion method, and outputting a final voice emotion recognition result.
The scheme of the invention considers that the deep Multi-modal features learned by adopting a multiple deep convolutional neural network model (Multi-CNN), namely a one-dimensional convolutional neural network (1D-CNN), a two-dimensional convolutional neural network (2D-CNN) and a three-dimensional convolutional neural network (3D-CNN) have a certain complementarity, is used for natural voice emotion recognition, solves the technical problem that dynamic change information in 2D time-frequency feature representation between continuous frames in a sentence of voice cannot be captured in the prior art, and obviously improves emotion classification performance by learning deep Multi-modal features with complementarity features through fusion of the multiple deep convolutional neural network, and provides good discrimination features for natural voice emotion recognition.
Preferably, the step S1 includes the steps of:
s1.1, dividing a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolutional neural network, and setting the length of the voice segments;
s1.2, extracting a two-dimensional Mel spectrogram from a one-dimensional original voice signal, and constructing three-channel spectral fragments similar to RGB images as input of a two-dimensional convolutional neural network;
s1.3, forming a plurality of continuous two-dimensional Mel spectrum segment sequences into a 3D dynamic segment similar to a video, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolutional neural network.
Preferably, the step S2 includes the steps of:
s2.1, modeling a one-dimensional original voice signal waveform by adopting a one-dimensional convolutional neural network: constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model;
s2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolutional neural network: aiming at target data, fine-tuning an existing pre-trained AlexNet deep convolutional neural network model, and sampling the size of a Mel frequency spectrum fragment of the generated three channels;
s2.3, modeling space-time dynamic information based on a three-dimensional convolutional neural network: based on the extracted 3D dynamic segments of the class video, a space-time feature learning task is executed, and a dropout regularization method is adopted to avoid network overfitting.
Preferably, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three max pooling layers, two full-connection layers and 1 Softmax classified output layer, wherein the one-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer, namely input data is normalized before one-dimensional convolutional neural network training.
Preferably, the AlexNet deep convolutional neural network model comprises five convolutional layers, three max pooling layers and two fully connected layers.
Preferably, the fine tuning in step S2.2 includes the steps of:
1) Copying the whole network parameters from a pre-trained AlexNet deep convolutional neural network model to initialize a two-dimensional convolutional neural network;
2) Replacing a softmax output layer in the AlexNet deep convolutional neural network model with a new sample tag vector, the tag vector corresponding to the number of emotion categories used in the dataset;
3) The used two-dimensional convolutional neural network is retrained using a standard back-propagation strategy.
Fine tuning is widely used for transfer learning in computer vision and alleviates the problem of insufficient data.
Preferably, the three-dimensional convolutional neural network in the step S2.3 includes two three-dimensional convolutional layers, two three-dimensional max-pooling layers, two fully-connected layers and one softmax output layer, and the three-dimensional convolutional layers include a batch normalization layer and a modified linear unit activation function layer.
Preferably, the step S3 includes the steps of:
s3.1, performing network training on a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, and updating network parameters;
s3.2, carrying out average operation on all divided segment classification results in one sentence of voice by adopting an average pooling strategy on segment classification results obtained from a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice level;
s3.3, maximizing emotion classification score results on the whole sentence voice level, and obtaining emotion recognition results of each convolutional neural network;
s3.4, combining classification score results obtained by different convolutional neural network models on the whole sentence voice level by using a score level fusion strategy so as to carry out final emotion classification.
Preferably, the step S3.4 may be expressed as:
score fusion =λ 1 score 1D2 score 2D3 score 3D
λ 123 =1;
wherein lambda is 1 、λ 2 And lambda (lambda) 3 Weight value, lambda representing different classification scores obtained by one-dimensional convolutional neural network, two-dimensional convolutional neural network and three-dimensional convolutional neural network 1 、λ 2 And lambda (lambda) 3 Is determined by stepping at 0.1 intervals at [0,1 ]]Searching in a range to obtainIs determined by the optimum value of (a).
Preferably, the updating network parameters in the step S3.1 is as follows:
where W represents the weight value of the softmax layer of the network parameter theta,the representation is with the input data a i Output of the corresponding last full connection layer (FC) layer, y i The class label vector representing the i-th segment is equal to the emotion type of the whole sentence of voice, and H represents a softmax logarithmic loss function, wherein the H is expressed as follows: />Wherein C represents the total number of emotion categories.
The beneficial effects of the invention are as follows: the method solves the technical problems that dynamic change information in 2D time-frequency characteristic representation between continuous frames in a sentence of voice cannot be captured in the prior art, so that characteristic parameters with good discrimination can not be provided for natural voice emotion recognition, and time information is excessively emphasized, deep multi-modal characteristics with complementation characteristics are fused and learned through a multiple deep convolutional neural network, emotion classification performance is remarkably improved, and good discrimination characteristics are provided for natural voice emotion recognition.
Drawings
Fig. 1 is a block diagram of the overall architecture of the present invention.
FIG. 2 is a confusion matrix plot of the recognition results of the fractional fusion of the present invention on the AFEW5.0 dataset.
Fig. 3 is a confusion matrix diagram of the recognition result of the score-level fusion of the present invention to the BAUM-1s dataset.
Detailed Description
The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.
Example 1: the natural voice emotion recognition method based on multimode deep feature learning of the embodiment, as shown in fig. 1, comprises the following steps:
s1, generating a proper multi-modal representation: three suitable audio representations are generated from the original one-dimensional speech signal for subsequent input of different CNN models.
Step S1 comprises the steps of:
s1.1, dividing a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolutional neural network (1D-CNN), and setting the length of the voice segments; for optimum performance, the length of the speech segment is set to 625 frames, the original speech signal is sampled at 22kHz and scaled to [ -256, 256]. In this case, the scaled data is naturally close to zero, so that the average value does not need to be subtracted.
S1.2, extracting a two-dimensional Mel spectrogram from a one-dimensional original voice signal, and constructing three-channel spectrum segments similar to RGB images as input of a two-dimensional convolutional neural network (2D-CNN);
s1.3, forming a plurality of continuous two-dimensional Mel spectrum segment sequences into a 3D dynamic segment similar to a video, and performing space-time feature learning by using the 3D dynamic segment as the input of a three-dimensional convolutional neural network (3D-CNN).
In experimental verification, we generated three-channel two-dimensional Mel spectral slices of size as input to a two-dimensional convolutional neural network. In particular, we extract the entire logarithmic Mel spectrogram of the speech signal in the range 20Hz to 8000Hz using a 64 Mel filter bank. At this time, a hamming window of 25ms is used, overlapping by 10ms. Then, the entire logarithmic Mel spectrum is segmented into fixed length segments using a 64 frame size text box, resulting in a 64'64 static segment. Subsequently, we calculate the first and second regression coefficients of the generated static segment along the time axis, resulting in the first derivative (delta) coefficient and the second derivative (delta-delta) coefficient of the static segment. Finally, similar to a color RGB image in computer vision, we can generate Mel-spectral fragments for three channels (static, first derivative and second derivative).
S2, learning Multi-modal features by adopting a Multi-depth convolutional neural network model Multi-CNN.
Step S2 comprises the steps of:
s2.1, modeling a one-dimensional original voice signal waveform by adopting a one-dimensional convolutional neural network: and constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model.
The stride length of the convolutional layer and the max pooling layer is set to be 1, as shown in table 1, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three max pooling layers, two full-connection layers and 1 Softmax classification output layer, the one-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit (ReLU) activation function layer, that is, input data is normalized before one-dimensional convolutional neural network training, and the output of the Softmax layer corresponds to the whole emotion classification on the used data set.
TABLE 1
S2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolutional neural network: aiming at target data, fine-tuning an existing pre-trained AlexNet deep convolutional neural network model, and sampling the size of a generated three-channel Mel frequency spectrum segment to the fixed input size of the AlexNet model; as shown in table 2, the AlexNet deep convolutional neural network model includes five convolutional layers, three max pooling layers, and two fully connected layers.
TABLE 2
The fine tuning in step S2.2 comprises the steps of:
1) Copying the whole network parameters from a pre-trained AlexNet deep convolutional neural network model to initialize a two-dimensional convolutional neural network;
2) Replacing a softmax output layer in the AlexNet deep convolutional neural network model with a new sample tag vector, the tag vector corresponding to the number of emotion categories used in the dataset;
3) The used two-dimensional convolutional neural network is retrained using a standard back-propagation strategy.
Fine tuning is widely used for transfer learning in computer vision and alleviates the problem of insufficient data.
S2.3, modeling space-time dynamic information based on a three-dimensional convolutional neural network: based on the extracted 3D dynamic segments of the class video, a space-time feature learning task is executed, and a dropout regularization method is adopted to avoid network overfitting.
As shown in table 3, the three-dimensional convolutional neural network includes two three-dimensional convolutional layers including a batch normalization layer and a modified linear unit (ReLU) activation function layer, two three-dimensional max pooling layers, two fully connected layers, and one softmax output layer.
TABLE 3 Table 3
S3, integrating different CNN model classification results by adopting a score level fusion method, and outputting a final voice emotion recognition result.
Since the feature representations obtained from the one-dimensional convolutional neural network and the three-dimensional convolutional neural network capture completely different acoustic characteristics compared to the two-dimensional convolutional neural network based on the 2D time-frequency representation as input, which suggests that emotion features learned from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional convolutional neural network may be complementary to each other, it is necessary to integrate them in a multiple convolutional neural network fusion network, which may further improve the speech emotion classification performance.
Step S3 comprises the steps of:
s3.1, performing network training on a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, updating network parameters, wherein the updated network parameters are shown in the following expression:
where W represents the weight value of the softmax layer of the network parameter theta,the representation is with the input data a i Output of the corresponding last full connection layer (FC) layer, y i The class label vector representing the i-th segment is equal to the emotion type of the whole sentence of voice, H represents a softmax logarithmic loss function, and H is expressed as follows: />Wherein C represents the total number of emotion categories.
S3.2, carrying out average operation on all divided segment classification results in one sentence of voice by adopting an average pooling strategy on segment classification results obtained from a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, so as to generate an emotion classification score result on the whole sentence of voice level (utterance-level);
s3.3, maximizing emotion classification score results on the whole sentence voice level, and obtaining emotion recognition results of each convolutional neural network;
s3.4, combining classification score results (score) obtained by different convolutional neural network models (1D-CNN, 2D-CNN, 3D-CNN) on the whole sentence voice level by using a score-level fusion strategy so as to carry out final emotion classification, wherein the classification score results can be expressed as follows:
score fusion =λ 1 score 1D2 score 2D3 score 3D
λ 123 =1;
wherein lambda is 1 、λ 2 And lambda (lambda) 3 Weights representing different classification scores obtained by one-dimensional convolutional neural network, two-dimensional convolutional neural network, and three-dimensional convolutional neural networkWeight, lambda 1 、λ 2 And lambda (lambda) 3 Is determined by stepping at 0.1 intervals at [0,1 ]]And searching in the range to obtain an optimal value.
To verify the effectiveness of our proposed method for natural speech emotion recognition, we used two challenging natural emotion speech datasets, AFEW5.0 and BAUM-1s, and no other motion emotion speech dataset was used for the experiment.
The AFEW5.0 dataset contains 7 emotion categories, such as angry, happy, sad, offensive, surprise, fear, and neutral, inviting three annotators to annotate these emotions. The AFEW5.0 dataset is divided into three parts: training set Train (723 samples), validation set Val (383 samples) and Test set Test (539 samples). We have not used the test set because it only opens access rights to researchers participating in the competition.
BAUM-1s contains not only 6 basic emotion categories, including angry, happy, sad, offensive, fear, surprise, but also other mental states, such as indeterminate, thinking, concentration, and puzzlement. Here we focus on identifying only 6 basic emotion categories, resulting in a subset of 521 emotion video samples.
A. Experimental setup
For training of one-dimensional convolutional neural networks, two-dimensional convolutional neural networks, and three-dimensional convolutional neural networks, the minimum batch size of input data was 30, the maximum number of cycles (epochs) was 300, and the learning rate was 0.001. In order to speed up the training of convolutional neural networks, a NVIDIA GTX TITAN X GPU with 12GB memory is used, and a speaker-independent cross-validation strategy is used to implement natural speech emotion recognition, which is mainly used in real scenes.
The experiment uses the original training set (Train) on the AFEW5.0 data set and the verification set (Val) in the training process as the test set. On the BAUM-1s dataset containing 31 Turkish persons, a leave-one-out cross-validation (LOSGO) strategy of 5 sets of cross was employed. In this way, the average recognition accuracy in five tests is reported on the BAUM-1s dataset.
Note that in experiments we split the Mel spectrum of the whole speech extracted from the audio samples into a number of Mel spectrum (Mel-spline) segments, and segment-level feature learning was performed using convolutional neural networks. In this case, we set the emotion classification of each Mel spectral fragment to an emotion tag on the whole sentence speech level.
B. Network training
Performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, updating network parameters, wherein the updated network parameters are shown in the following expression:
where W represents the weight value of the softmax layer of the network parameter theta,the representation is with the input data a i Output of the corresponding final FC layer, y i The class label vector representing paragraph i, equal to the sounding-level emotion class, H represents the softmax logarithmic loss function, H is expressed as follows: />Wherein C represents the total number of emotion categories.
C. Results and analysis
The sample segment length input to a one-dimensional convolutional neural network (1D-CNN) may have a significant impact on the performance of the 1D-CNN network. Thus, we have studied the performance of different sample segment lengths as inputs to a 1D-CNN network, i.e., tested the performance of four different sample segment lengths (125, 625, 3125, 15625 frames) as inputs to a one-dimensional convolutional neural network, and the number of corresponding convolutional layers. Table 4 shows the identification performance of four different sample segment lengths associated with the number of convolutions. Note that each convolution layer is followed by a maximum pooling layer, except that the convolution layer of the last layer is equivalent to a fully connected layer.
TABLE 4 Table 4
As shown in table 4, the sample segment length of 625 frames performs best among the four different sample segment lengths in the AFEW5.0 and BAUM-1s datasets. Specifically, our method has an accuracy of 24.02% on the AFEW5.0 dataset and 37.37% on the BAUM-1s dataset. Larger sample segment lengths help to improve performance, but too large sample segment lengths do not necessarily improve performance. This may be due to the larger sample segment length reducing the number of samples used in one-dimensional convolutional neural network training. Therefore, the performance of the one-dimensional convolutional neural network is not always improved when the sample segment length is increased, so we set the sample segment length of the one-dimensional convolutional neural network to 625 frames.
Since the extracted three-channel Mel spectral slices resemble RGB images, as input to a 2D-CNN network, it is possible to fine tune the existing ImageNet data-based depth model. To evaluate the fine tuning effect of different pre-trained deep network models, we compared the fine tuning recognition performance of three typical deep network models, alexNet, VGG-16 and ResNet-50, on the target emotion dataset. The recognition results of these deep network models are obtained by averaging the score scores of all the segments of the segmentation and then maximizing them.
Table 5 shows the trim recognition results for three typical deep network models (e.g., alexNet, VGG-16, and ResNet-50). As can be seen from Table 5, alexNet performance is slightly better than VGG-16 and ResNet-50, with AlexNet accuracy of 29.24% in the AFEW5.0 database and 28.16% and 28.55% in VGG-16 and ResNet-50, respectively, and with accuracy of 42.29%, 41.73% and 41.97% in the BAUM-1s database, respectively. This suggests that deeper network models such as VGG-16 and ResNet-50 have no significant performance improvement over shallower AlexNet, perhaps because the emotion data set used is very limited and therefore the number of speech samples generated is insufficient to train deeper networks.
TABLE 5
For space-time feature learning, we compose a sequence of consecutive two-dimensional Mel spectral slices into a video-like 3D dynamic slice as input to a three-dimensional convolutional neural network (3D-CNN). The created video clip length is equal to the number of consecutive two-dimensional Mel spectral clips. The video clip length can also significantly impact the performance of the 3D-CNN network.
In order to evaluate the performance of different video segment lengths as inputs to the 3D CNN network, we present in experiments the recognition results of 4 different video segment lengths (4, 6, 8, 10 Mel spectral segments). For these video segment lengths, the three-dimensional convolutional neural network has the same network structure except for the first convolutional layer. In the first convolution layer (Conv 1.) its three-dimensional filter-sized depth (i.e. the number of consecutive Mel-spectral slices in series) is equal to the corresponding video slice length. Table 6 shows the performance of four different video segment lengths (i.e., 4, 6, 8, 10 Mel spectral segments) and the depth of the three-dimensional filter size of the first convolution layer.
TABLE 6
As can be seen from table 6, the video clip lengths containing 4 consecutive speech Mel spectral clips gave the best performance in the AFEW5.0 database and the BAUM-1s database, with accuracy of 28.46% and 37.97%, respectively. As the length of the video segments increases, the performance of the three-dimensional convolutional neural network decreases, probably because the number of video segments used to train the three-dimensional convolutional neural network decreases as the length of the video segments increases.
In experiments, we propose and compare two fusion methods of multiple deep convolutional neural networks: feature level fusion and score level fusion. For feature level fusion, we first extract the full sentence speech level (level) feature for each convolutional neural network, i.e., by using an average pooling operation on the segment features represented by the outputs of the last fully-connected layer of the convolutional neural network used. Then, we directly concatenate the features on three whole sentence speech levels from one-dimensional, two-dimensional and three-dimensional convolutional neural networks to form a total 5376-D feature vector, and finally, use a linear Support Vector Machine (SVM) to carry out final emotion classification.
Table 7 shows the different multiple depth convolutional neural network fusion methods and the identification results of the single convolutional neural network that achieved the best performance. For the AFEW5.0 database, the optimal weight values are 0.3, 0.3 and 0.4; for the BAUM-1s database, the optimal weights are 0.2, 0.5, 0.3. From the results in table 7, it can be seen that:
1) The two-dimensional convolutional neural network (2D-CNN) performs best, followed by the three-dimensional convolutional neural network (3D-CNN) and the one-dimensional convolutional neural network (1D-CNN). This shows that the generated two-dimensional Mel voice frequency spectrum segment similar to RGB image is used for fine tuning the existing depth network model AlexNet based on the ImageNet data effectively, so that the pressure of insufficient emotion data on deep neural network training is relieved.
2) Fractional level fusion has better performance than feature level fusion. This illustrates that fractional order fusion is more suitable for fusion of multiple deep neural networks.
3) Compared with single convolutional neural networks such as one-dimensional, two-dimensional and three-dimensional convolutional neural networks, the realization of the fusion of the multiple convolutional neural networks in the feature layer and the fractional layer has better performance. This suggests that the multi-modal deep features learned from one-, two-, and three-dimensional convolutional neural networks are complementary, so they are integrated into a multi-deep convolutional neural network fusion network to achieve significantly improved emotion classification performance.
TABLE 7
In order to provide recognition accuracy for each emotion, fig. 2 and 3 show confusion matrices of recognition results, respectively. The fractional fusion method now achieves an identification accuracy of 35.77% and 44.06% on the two data sets, respectively.
As shown in FIG. 2, we can see that in the AFEW5.0 database, the accuracy of the three emotions, "angry", "neutral" and "fear" are 56.25%, 50.79% and 43.48%, respectively. While the other four emotions, namely 'offensive', 'happy', 'sad' and 'surprise', have classification accuracy of less than 33%.
As can be seen from FIG. 3, in the BAUM-1s database, both "sad" and "happy" emotions are identified with accuracy rates of 70.90% and 55.49%, respectively. The other four emotions, namely 'vital energy', 'fear', 'offensiveness', 'surprise', have an accuracy of less than 29%.
According to the scheme, the Multi-depth convolutional neural network model (Multi-CNN) is adopted, namely the deep Multi-modal features learned by the one-dimensional convolutional neural network (1D-CNN), the two-dimensional convolutional neural network (2D-CNN) and the three-dimensional convolutional neural network (3D-CNN) have certain complementarity characteristics, the Multi-depth Multi-modal feature learning method is used for natural voice emotion recognition, the technical problem that dynamic change information in 2D time-frequency feature representation between continuous frames in a sentence of voice cannot be captured in the prior art is solved, deep Multi-modal features with complementarity characteristics are learned through fusion of the Multi-depth convolutional neural network, emotion classification performance is remarkably improved, and good discrimination characteristics are provided for natural voice emotion recognition.

Claims (4)

1. The natural voice emotion recognition method based on multimode deep feature learning is characterized by comprising the following steps of:
s1, generating a proper multi-modal representation: three audio representation forms suitable for different depth convolution neural network structures are generated from an original one-dimensional voice signal and are used for inputting different depth convolution neural network models;
the step S1 includes the steps of:
s1.1, scaling the waveform of a one-dimensional original voice signal to [ -256, 256], dividing the waveform into voice fragments with the length of 625 frames, and inputting the voice fragments into a one-dimensional convolutional neural network;
s1.2, extracting a two-dimensional Mel spectrogram from a one-dimensional original voice signal, and constructing three-channel Mel frequency spectrum segments similar to RGB images, wherein the three-channel Mel frequency spectrum segments comprise Mel frequency spectrum segments 64 multiplied by 3 of static, first derivative and second derivative, and the Mel frequency spectrum segments are used as input of a two-dimensional convolutional neural network;
s1.3, forming a 3D dynamic segment similar to a video by using a plurality of continuous two-dimensional Mel spectrum segment sequences, namely 64 multiplied by 3 sequences of spectrum segments of three continuous channels, and taking the 3D dynamic segment as the input of a three-dimensional convolutional neural network to perform space-time feature learning;
s2, learning different multi-mode depth features by adopting the multi-depth convolutional neural network model established in the steps S1.1-S1.3;
the step S2 includes the steps of:
s2.1, modeling a one-dimensional original voice signal waveform by adopting a one-dimensional convolutional neural network: constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model; the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-connection layers and 1 Softmax classification output layer, wherein the one-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer, namely input data are normalized before one-dimensional convolutional neural network training;
s2.2, performing two-dimensional Mel spectrogram modeling by using a two-dimensional convolutional neural network: aiming at target data, fine-tuning an AlexNet depth convolutional neural network model trained in advance in the existing computer vision field; the AlexNet deep convolutional neural network model comprises five convolutional layers, three maximum pooling layers and two full-connection layers;
s2.3, modeling space-time dynamic information based on a three-dimensional convolutional neural network: based on the extracted 3D dynamic segments of the class video, executing a space-time feature learning task, and adopting a dropout regularization method to avoid network overfitting; the three-dimensional convolutional neural network in the step S2.3 comprises two three-dimensional convolutional layers, two three-dimensional maximum pooling layers, two full-connection layers and a softmax output layer, wherein the three-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer;
s3, integrating the classification results of different depth convolution neural network models with complementarity by adopting a score level fusion method, and outputting a final speech emotion recognition result.
2. The method for recognizing natural speech emotion based on multi-mode deep feature learning according to claim 1, wherein the fine tuning in step S2.2 comprises the steps of:
1) Copying the whole network parameters from a pre-trained AlexNet deep convolutional neural network model to initialize a two-dimensional convolutional neural network;
2) Replacing a softmax output layer in the AlexNet deep convolutional neural network model with a new sample tag vector, the tag vector corresponding to the number of emotion categories used in the dataset;
3) The used two-dimensional convolutional neural network is retrained using a standard back-propagation strategy.
3. The method for recognizing natural speech emotion based on multi-mode deep feature learning according to claim 1, wherein said step S3 comprises the steps of:
s3.1, performing network training on a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, and updating network parameters;
s3.2, carrying out average operation on all divided segment classification results in one sentence of voice by adopting an average pooling strategy on segment classification results obtained from a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice level;
s3.3, maximizing emotion classification score results on the whole sentence voice level, and obtaining emotion recognition results of each convolutional neural network;
s3.4, combining classification score results obtained by different convolutional neural network models on the whole sentence voice level by using a score level fusion strategy so as to carry out final emotion classification.
4. A method of natural speech emotion recognition based on multimode deep feature learning according to claim 3, wherein step S3.4 can be expressed as:
score fusion =λ 1 score 1D2 score 2D3 score 3D
λ 123 =1;
wherein lambda is 1 、λ 2 And lambda (lambda) 3 Weight value, lambda representing different classification scores obtained by one-dimensional convolutional neural network, two-dimensional convolutional neural network and three-dimensional convolutional neural network 1 、λ 2 And lambda (lambda) 3 Is determined by stepping at 0.1 intervals at [0,1 ]]And searching in the range to obtain an optimal value.
CN202010290317.9A 2020-04-14 2020-04-14 Natural voice emotion recognition method based on multimode deep feature learning Active CN111583964B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010290317.9A CN111583964B (en) 2020-04-14 2020-04-14 Natural voice emotion recognition method based on multimode deep feature learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010290317.9A CN111583964B (en) 2020-04-14 2020-04-14 Natural voice emotion recognition method based on multimode deep feature learning

Publications (2)

Publication Number Publication Date
CN111583964A CN111583964A (en) 2020-08-25
CN111583964B true CN111583964B (en) 2023-07-21

Family

ID=72126539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010290317.9A Active CN111583964B (en) 2020-04-14 2020-04-14 Natural voice emotion recognition method based on multimode deep feature learning

Country Status (1)

Country Link
CN (1) CN111583964B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347910B (en) * 2020-11-05 2022-05-31 中国电子科技集团公司第二十九研究所 Signal fingerprint identification method based on multi-mode deep learning
CN114612810B (en) * 2020-11-23 2023-04-07 山东大卫国际建筑设计有限公司 Dynamic self-adaptive abnormal posture recognition method and device
CN113116300A (en) * 2021-03-12 2021-07-16 复旦大学 Physiological signal classification method based on model fusion
CN113409824B (en) * 2021-07-06 2023-03-28 青岛洞听智能科技有限公司 Speech emotion recognition method
CN113780107B (en) * 2021-08-24 2024-03-01 电信科学技术第五研究所有限公司 Radio signal detection method based on deep learning dual-input network model
CN113903362B (en) * 2021-08-26 2023-07-21 电子科技大学 Voice emotion recognition method based on neural network
CN114726802A (en) * 2022-03-31 2022-07-08 山东省计算中心(国家超级计算济南中心) Network traffic identification method and device based on different data dimensions
CN115195757B (en) * 2022-09-07 2023-08-04 郑州轻工业大学 Electric bus starting driving behavior modeling and recognition training method
CN117373491B (en) * 2023-12-07 2024-02-06 天津师范大学 Method and device for dynamically extracting voice emotion characteristics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108717856A (en) * 2018-06-16 2018-10-30 台州学院 A kind of speech-emotion recognition method based on multiple dimensioned depth convolution loop neural network
CN109036465A (en) * 2018-06-28 2018-12-18 南京邮电大学 Speech-emotion recognition method
CN110534132A (en) * 2019-09-23 2019-12-03 河南工业大学 A kind of speech-emotion recognition method of the parallel-convolution Recognition with Recurrent Neural Network based on chromatogram characteristic

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Learning deep features to recognise speech emotion using merged deep CNN;Jianfeng Zhao 等;《IET Signal Processing》;20180424 *
Learning spectro-temporal features with 3D CNNs for speech emotion recognition;Jaebok Kim 等;《arXiv:1708.05071v1》;20170814 *

Also Published As

Publication number Publication date
CN111583964A (en) 2020-08-25

Similar Documents

Publication Publication Date Title
CN111583964B (en) Natural voice emotion recognition method based on multimode deep feature learning
Er A novel approach for classification of speech emotions based on deep and acoustic features
CN112348075B (en) Multi-mode emotion recognition method based on contextual attention neural network
Atila et al. Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
Sun End-to-end speech emotion recognition with gender information
CN110400579B (en) Speech emotion recognition based on direction self-attention mechanism and bidirectional long-time and short-time network
Schuller et al. Emotion recognition in the noise applying large acoustic feature sets
Umamaheswari et al. An enhanced human speech emotion recognition using hybrid of PRNN and KNN
CN112784798A (en) Multi-modal emotion recognition method based on feature-time attention mechanism
Sultana et al. Bangla speech emotion recognition and cross-lingual study using deep CNN and BLSTM networks
CN110675859B (en) Multi-emotion recognition method, system, medium, and apparatus combining speech and text
CN110534133B (en) Voice emotion recognition system and voice emotion recognition method
Han et al. Speech emotion recognition with a resnet-cnn-transformer parallel neural network
Xu et al. Multi-type features separating fusion learning for Speech Emotion Recognition
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
CN110348482A (en) A kind of speech emotion recognition system based on depth model integrated architecture
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
Wu et al. Speech synthesis with face embeddings
Akinpelu et al. Lightweight deep learning framework for speech emotion recognition
Mocanu et al. Speech emotion recognition using GhostVLAD and sentiment metric learning
Zaferani et al. Automatic personality traits perception using asymmetric auto-encoder
Nanduri et al. A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data
Mohammed et al. Speech Emotion Recognition Using MELBP Variants of Spectrogram Image.
Zhu et al. Emotion Recognition of College Students Based on Audio and Video Image.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant