CN111583964B

CN111583964B - Natural voice emotion recognition method based on multimode deep feature learning

Info

Publication number: CN111583964B
Application number: CN202010290317.9A
Authority: CN
Inventors: 张石清; 赵小明
Original assignee: Taizhou University
Current assignee: Taizhou University
Priority date: 2020-04-14
Filing date: 2020-04-14
Publication date: 2023-07-21
Anticipated expiration: 2040-04-14
Also published as: CN111583964A

Abstract

The invention discloses a natural voice emotion recognition method based on multimode deep feature learning, which comprises the following steps: s1, generating a proper multi-modal representation: generating three proper audio representation forms from the original one-dimensional voice signal for the input of different subsequent CNN models; s2, learning multi-modal features by adopting a multiple depth convolution neural network model; s3, integrating different CNN model classification results by adopting a score level fusion method, and outputting a final voice emotion recognition result. According to the invention, deep multi-modal features with complementarity are learned through fusion of multiple deep convolutional neural networks, so that emotion classification performance is remarkably improved, and features of good discrimination are provided for natural voice emotion recognition.

Description

Natural voice emotion recognition method based on multimode deep feature learning

Technical Field

The invention relates to the technical field of voice signal processing and pattern recognition, in particular to a natural voice emotion recognition method based on multimode deep feature learning.

Background

In recent years, natural voice emotion recognition has been an active and challenging research topic in the fields of pattern recognition, voice signal processing, artificial intelligence, etc., and unlike conventional input devices, natural voice emotion recognition aims to provide intelligent emotion services usable in voice call centers, healthcare, and emotion calculation through a direct voice interaction pattern with a computer.

Currently, in the field of speech emotion recognition, a large amount of early work is mainly done on simulated emotion, because the creation of such a simulated emotion database is much easier than natural emotion. In recent years, research on natural speech emotion recognition in an actual environment has been attracting attention from researchers because it is closer to reality and more difficult than recognition of simulated emotion.

The voice emotion feature extraction is a key step in voice emotion recognition, and aims to extract feature parameters capable of reflecting emotion expression information of a speaker from emotion voice signals. Currently, a large number of speech emotion recognition documents employ manually designed features for emotion recognition, such as prosodic features (fundamental frequency, amplitude, pronunciation duration), timbre features (formants, spectral energy distribution, harmonic noise ratio), spectral features (mel-frequency cepstrum coefficients (MFCCs), linear Prediction Coefficients (LPCs), and Linear Prediction Cepstrum Coefficients (LPCCs)). However, these manually designed speech emotion feature parameters belong to low-level features, and have a problem of "semantic gap" with emotion tags understood by humans, so that it is necessary to develop a high-level speech emotion feature extraction method.

To address this problem, newly emerging deep learning techniques may provide clues, and due to the use of deeper architectures, deep learning techniques generally have certain advantages over traditional approaches, including their ability to automatically detect complex structures and features without the need for manual feature extraction.

Various representative deep learning techniques, such as Deep Neural Networks (DNNs), deep Convolutional Neural Networks (CNNs), recurrent neural networks based on long-short term memory (LSTMRNNs), etc., have been used for speech emotion recognition so far.

For example, a "speech emotion recognition method based on a multi-scale deep convolutional recurrent neural network" (bulletin number CN108717856 a) disclosed in chinese patent literature combines a deep Convolutional Neural Network (CNN) with a long-short-time memory network (LSTM), and considers the characteristic that two-dimensional (2D) speech spectrum fragment information with different lengths have different discriminatory power for different emotion types, a multi-scale cnn+lstm hybrid deep learning model is provided and applied to natural speech emotion recognition in an actual environment, but the speech emotion recognition method using 2D speech spectrum fragment information as CNN input cannot capture dynamic variation information represented by features in 2D time-frequency (time-frequency) between consecutive frames in a sentence of speech, so that feature parameters with good discriminatory power cannot be provided for natural speech emotion recognition. Although LSTM-RNN can be used for modeling of time information, time information is over emphasized.

Disclosure of Invention

The invention provides a natural voice emotion recognition method based on multi-mode deep feature learning, which aims to overcome the defect that dynamic change information in 2D time-frequency feature representation between continuous frames in a sentence of voice cannot be captured in the prior art, so that feature parameters with good discrimination can not be provided for natural voice emotion recognition.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a natural speech emotion recognition method based on multi-mode deep feature learning, the method comprising the steps of:

s1, generating a proper multi-modal representation: generating three proper audio representation forms from the original one-dimensional voice signal for the input of different subsequent CNN models;

s2, learning multi-modal features by adopting a multiple depth convolution neural network model;

s3, integrating different CNN model classification results by adopting a score level fusion method, and outputting a final voice emotion recognition result.

The scheme of the invention considers that the deep Multi-modal features learned by adopting a multiple deep convolutional neural network model (Multi-CNN), namely a one-dimensional convolutional neural network (1D-CNN), a two-dimensional convolutional neural network (2D-CNN) and a three-dimensional convolutional neural network (3D-CNN) have a certain complementarity, is used for natural voice emotion recognition, solves the technical problem that dynamic change information in 2D time-frequency feature representation between continuous frames in a sentence of voice cannot be captured in the prior art, and obviously improves emotion classification performance by learning deep Multi-modal features with complementarity features through fusion of the multiple deep convolutional neural network, and provides good discrimination features for natural voice emotion recognition.

Preferably, the step S1 includes the steps of:

s1.1, dividing a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolutional neural network, and setting the length of the voice segments;

s1.2, extracting a two-dimensional Mel spectrogram from a one-dimensional original voice signal, and constructing three-channel spectral fragments similar to RGB images as input of a two-dimensional convolutional neural network;

s1.3, forming a plurality of continuous two-dimensional Mel spectrum segment sequences into a 3D dynamic segment similar to a video, and performing space-time feature learning by taking the 3D dynamic segment as the input of a three-dimensional convolutional neural network.

Preferably, the step S2 includes the steps of:

s2.1, modeling a one-dimensional original voice signal waveform by adopting a one-dimensional convolutional neural network: constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model;

s2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolutional neural network: aiming at target data, fine-tuning an existing pre-trained AlexNet deep convolutional neural network model, and sampling the size of a Mel frequency spectrum fragment of the generated three channels;

s2.3, modeling space-time dynamic information based on a three-dimensional convolutional neural network: based on the extracted 3D dynamic segments of the class video, a space-time feature learning task is executed, and a dropout regularization method is adopted to avoid network overfitting.

Preferably, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three max pooling layers, two full-connection layers and 1 Softmax classified output layer, wherein the one-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer, namely input data is normalized before one-dimensional convolutional neural network training.

Preferably, the AlexNet deep convolutional neural network model comprises five convolutional layers, three max pooling layers and two fully connected layers.

Preferably, the fine tuning in step S2.2 includes the steps of:

1) Copying the whole network parameters from a pre-trained AlexNet deep convolutional neural network model to initialize a two-dimensional convolutional neural network;

2) Replacing a softmax output layer in the AlexNet deep convolutional neural network model with a new sample tag vector, the tag vector corresponding to the number of emotion categories used in the dataset;

3) The used two-dimensional convolutional neural network is retrained using a standard back-propagation strategy.

Fine tuning is widely used for transfer learning in computer vision and alleviates the problem of insufficient data.

Preferably, the three-dimensional convolutional neural network in the step S2.3 includes two three-dimensional convolutional layers, two three-dimensional max-pooling layers, two fully-connected layers and one softmax output layer, and the three-dimensional convolutional layers include a batch normalization layer and a modified linear unit activation function layer.

Preferably, the step S3 includes the steps of:

s3.1, performing network training on a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, and updating network parameters;

s3.2, carrying out average operation on all divided segment classification results in one sentence of voice by adopting an average pooling strategy on segment classification results obtained from a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, so as to generate emotion classification score results on the whole sentence of voice level;

s3.3, maximizing emotion classification score results on the whole sentence voice level, and obtaining emotion recognition results of each convolutional neural network;

s3.4, combining classification score results obtained by different convolutional neural network models on the whole sentence voice level by using a score level fusion strategy so as to carry out final emotion classification.

Preferably, the step S3.4 may be expressed as:

score ^fusion ＝λ ₁ score ^1D +λ ₂ score ^2D +λ ₃ score ^3D ；

λ ₁ +λ ₂ +λ ₃ ＝1；

wherein lambda is ₁ 、λ ₂ And lambda (lambda) ₃ Weight value, lambda representing different classification scores obtained by one-dimensional convolutional neural network, two-dimensional convolutional neural network and three-dimensional convolutional neural network ₁ 、λ ₂ And lambda (lambda) ₃ Is determined by stepping at 0.1 intervals at [0,1 ]]Searching in a range to obtainIs determined by the optimum value of (a).

Preferably, the updating network parameters in the step S3.1 is as follows:

where W represents the weight value of the softmax layer of the network parameter theta,the representation is with the input data a _i Output of the corresponding last full connection layer (FC) layer, y _i The class label vector representing the i-th segment is equal to the emotion type of the whole sentence of voice, and H represents a softmax logarithmic loss function, wherein the H is expressed as follows: />Wherein C represents the total number of emotion categories.

The beneficial effects of the invention are as follows: the method solves the technical problems that dynamic change information in 2D time-frequency characteristic representation between continuous frames in a sentence of voice cannot be captured in the prior art, so that characteristic parameters with good discrimination can not be provided for natural voice emotion recognition, and time information is excessively emphasized, deep multi-modal characteristics with complementation characteristics are fused and learned through a multiple deep convolutional neural network, emotion classification performance is remarkably improved, and good discrimination characteristics are provided for natural voice emotion recognition.

Drawings

Fig. 1 is a block diagram of the overall architecture of the present invention.

FIG. 2 is a confusion matrix plot of the recognition results of the fractional fusion of the present invention on the AFEW5.0 dataset.

Fig. 3 is a confusion matrix diagram of the recognition result of the score-level fusion of the present invention to the BAUM-1s dataset.

Detailed Description

The technical scheme of the invention is further specifically described below through examples and with reference to the accompanying drawings.

Example 1: the natural voice emotion recognition method based on multimode deep feature learning of the embodiment, as shown in fig. 1, comprises the following steps:

s1, generating a proper multi-modal representation: three suitable audio representations are generated from the original one-dimensional speech signal for subsequent input of different CNN models.

Step S1 comprises the steps of:

s1.1, dividing a one-dimensional original voice signal waveform into segments, inputting the segments into a one-dimensional convolutional neural network (1D-CNN), and setting the length of the voice segments; for optimum performance, the length of the speech segment is set to 625 frames, the original speech signal is sampled at 22kHz and scaled to [ -256, 256]. In this case, the scaled data is naturally close to zero, so that the average value does not need to be subtracted.

S1.2, extracting a two-dimensional Mel spectrogram from a one-dimensional original voice signal, and constructing three-channel spectrum segments similar to RGB images as input of a two-dimensional convolutional neural network (2D-CNN);

s1.3, forming a plurality of continuous two-dimensional Mel spectrum segment sequences into a 3D dynamic segment similar to a video, and performing space-time feature learning by using the 3D dynamic segment as the input of a three-dimensional convolutional neural network (3D-CNN).

In experimental verification, we generated three-channel two-dimensional Mel spectral slices of size as input to a two-dimensional convolutional neural network. In particular, we extract the entire logarithmic Mel spectrogram of the speech signal in the range 20Hz to 8000Hz using a 64 Mel filter bank. At this time, a hamming window of 25ms is used, overlapping by 10ms. Then, the entire logarithmic Mel spectrum is segmented into fixed length segments using a 64 frame size text box, resulting in a 64'64 static segment. Subsequently, we calculate the first and second regression coefficients of the generated static segment along the time axis, resulting in the first derivative (delta) coefficient and the second derivative (delta-delta) coefficient of the static segment. Finally, similar to a color RGB image in computer vision, we can generate Mel-spectral fragments for three channels (static, first derivative and second derivative).

S2, learning Multi-modal features by adopting a Multi-depth convolutional neural network model Multi-CNN.

Step S2 comprises the steps of:

s2.1, modeling a one-dimensional original voice signal waveform by adopting a one-dimensional convolutional neural network: and constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model.

The stride length of the convolutional layer and the max pooling layer is set to be 1, as shown in table 1, the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three max pooling layers, two full-connection layers and 1 Softmax classification output layer, the one-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit (ReLU) activation function layer, that is, input data is normalized before one-dimensional convolutional neural network training, and the output of the Softmax layer corresponds to the whole emotion classification on the used data set.

TABLE 1

S2.2, performing two-dimensional Mel spectrum modeling by using a two-dimensional convolutional neural network: aiming at target data, fine-tuning an existing pre-trained AlexNet deep convolutional neural network model, and sampling the size of a generated three-channel Mel frequency spectrum segment to the fixed input size of the AlexNet model; as shown in table 2, the AlexNet deep convolutional neural network model includes five convolutional layers, three max pooling layers, and two fully connected layers.

TABLE 2

The fine tuning in step S2.2 comprises the steps of:

As shown in table 3, the three-dimensional convolutional neural network includes two three-dimensional convolutional layers including a batch normalization layer and a modified linear unit (ReLU) activation function layer, two three-dimensional max pooling layers, two fully connected layers, and one softmax output layer.

TABLE 3 Table 3

Since the feature representations obtained from the one-dimensional convolutional neural network and the three-dimensional convolutional neural network capture completely different acoustic characteristics compared to the two-dimensional convolutional neural network based on the 2D time-frequency representation as input, which suggests that emotion features learned from the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional convolutional neural network may be complementary to each other, it is necessary to integrate them in a multiple convolutional neural network fusion network, which may further improve the speech emotion classification performance.

Step S3 comprises the steps of:

s3.1, performing network training on a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, updating network parameters, wherein the updated network parameters are shown in the following expression:

where W represents the weight value of the softmax layer of the network parameter theta,the representation is with the input data a _i Output of the corresponding last full connection layer (FC) layer, y _i The class label vector representing the i-th segment is equal to the emotion type of the whole sentence of voice, H represents a softmax logarithmic loss function, and H is expressed as follows: />Wherein C represents the total number of emotion categories.

S3.2, carrying out average operation on all divided segment classification results in one sentence of voice by adopting an average pooling strategy on segment classification results obtained from a one-dimensional convolutional neural network, a two-dimensional convolutional neural network and a three-dimensional neural network, so as to generate an emotion classification score result on the whole sentence of voice level (utterance-level);

s3.4, combining classification score results (score) obtained by different convolutional neural network models (1D-CNN, 2D-CNN, 3D-CNN) on the whole sentence voice level by using a score-level fusion strategy so as to carry out final emotion classification, wherein the classification score results can be expressed as follows:

score ^fusion ＝λ ₁ score ^1D +λ ₂ score ^2D +λ ₃ score ^3D ；

λ ₁ +λ ₂ +λ ₃ ＝1；

wherein lambda is ₁ 、λ ₂ And lambda (lambda) ₃ Weights representing different classification scores obtained by one-dimensional convolutional neural network, two-dimensional convolutional neural network, and three-dimensional convolutional neural networkWeight, lambda ₁ 、λ ₂ And lambda (lambda) ₃ Is determined by stepping at 0.1 intervals at [0,1 ]]And searching in the range to obtain an optimal value.

To verify the effectiveness of our proposed method for natural speech emotion recognition, we used two challenging natural emotion speech datasets, AFEW5.0 and BAUM-1s, and no other motion emotion speech dataset was used for the experiment.

The AFEW5.0 dataset contains 7 emotion categories, such as angry, happy, sad, offensive, surprise, fear, and neutral, inviting three annotators to annotate these emotions. The AFEW5.0 dataset is divided into three parts: training set Train (723 samples), validation set Val (383 samples) and Test set Test (539 samples). We have not used the test set because it only opens access rights to researchers participating in the competition.

BAUM-1s contains not only 6 basic emotion categories, including angry, happy, sad, offensive, fear, surprise, but also other mental states, such as indeterminate, thinking, concentration, and puzzlement. Here we focus on identifying only 6 basic emotion categories, resulting in a subset of 521 emotion video samples.

A. Experimental setup

For training of one-dimensional convolutional neural networks, two-dimensional convolutional neural networks, and three-dimensional convolutional neural networks, the minimum batch size of input data was 30, the maximum number of cycles (epochs) was 300, and the learning rate was 0.001. In order to speed up the training of convolutional neural networks, a NVIDIA GTX TITAN X GPU with 12GB memory is used, and a speaker-independent cross-validation strategy is used to implement natural speech emotion recognition, which is mainly used in real scenes.

The experiment uses the original training set (Train) on the AFEW5.0 data set and the verification set (Val) in the training process as the test set. On the BAUM-1s dataset containing 31 Turkish persons, a leave-one-out cross-validation (LOSGO) strategy of 5 sets of cross was employed. In this way, the average recognition accuracy in five tests is reported on the BAUM-1s dataset.

Note that in experiments we split the Mel spectrum of the whole speech extracted from the audio samples into a number of Mel spectrum (Mel-spline) segments, and segment-level feature learning was performed using convolutional neural networks. In this case, we set the emotion classification of each Mel spectral fragment to an emotion tag on the whole sentence speech level.

B. Network training

Performing network training on the one-dimensional convolutional neural network, the two-dimensional convolutional neural network and the three-dimensional neural network, updating network parameters, wherein the updated network parameters are shown in the following expression:

where W represents the weight value of the softmax layer of the network parameter theta,the representation is with the input data a _i Output of the corresponding final FC layer, y _i The class label vector representing paragraph i, equal to the sounding-level emotion class, H represents the softmax logarithmic loss function, H is expressed as follows: />Wherein C represents the total number of emotion categories.

C. Results and analysis

The sample segment length input to a one-dimensional convolutional neural network (1D-CNN) may have a significant impact on the performance of the 1D-CNN network. Thus, we have studied the performance of different sample segment lengths as inputs to a 1D-CNN network, i.e., tested the performance of four different sample segment lengths (125, 625, 3125, 15625 frames) as inputs to a one-dimensional convolutional neural network, and the number of corresponding convolutional layers. Table 4 shows the identification performance of four different sample segment lengths associated with the number of convolutions. Note that each convolution layer is followed by a maximum pooling layer, except that the convolution layer of the last layer is equivalent to a fully connected layer.

TABLE 4 Table 4

As shown in table 4, the sample segment length of 625 frames performs best among the four different sample segment lengths in the AFEW5.0 and BAUM-1s datasets. Specifically, our method has an accuracy of 24.02% on the AFEW5.0 dataset and 37.37% on the BAUM-1s dataset. Larger sample segment lengths help to improve performance, but too large sample segment lengths do not necessarily improve performance. This may be due to the larger sample segment length reducing the number of samples used in one-dimensional convolutional neural network training. Therefore, the performance of the one-dimensional convolutional neural network is not always improved when the sample segment length is increased, so we set the sample segment length of the one-dimensional convolutional neural network to 625 frames.

Since the extracted three-channel Mel spectral slices resemble RGB images, as input to a 2D-CNN network, it is possible to fine tune the existing ImageNet data-based depth model. To evaluate the fine tuning effect of different pre-trained deep network models, we compared the fine tuning recognition performance of three typical deep network models, alexNet, VGG-16 and ResNet-50, on the target emotion dataset. The recognition results of these deep network models are obtained by averaging the score scores of all the segments of the segmentation and then maximizing them.

Table 5 shows the trim recognition results for three typical deep network models (e.g., alexNet, VGG-16, and ResNet-50). As can be seen from Table 5, alexNet performance is slightly better than VGG-16 and ResNet-50, with AlexNet accuracy of 29.24% in the AFEW5.0 database and 28.16% and 28.55% in VGG-16 and ResNet-50, respectively, and with accuracy of 42.29%, 41.73% and 41.97% in the BAUM-1s database, respectively. This suggests that deeper network models such as VGG-16 and ResNet-50 have no significant performance improvement over shallower AlexNet, perhaps because the emotion data set used is very limited and therefore the number of speech samples generated is insufficient to train deeper networks.

TABLE 5

For space-time feature learning, we compose a sequence of consecutive two-dimensional Mel spectral slices into a video-like 3D dynamic slice as input to a three-dimensional convolutional neural network (3D-CNN). The created video clip length is equal to the number of consecutive two-dimensional Mel spectral clips. The video clip length can also significantly impact the performance of the 3D-CNN network.

In order to evaluate the performance of different video segment lengths as inputs to the 3D CNN network, we present in experiments the recognition results of 4 different video segment lengths (4, 6, 8, 10 Mel spectral segments). For these video segment lengths, the three-dimensional convolutional neural network has the same network structure except for the first convolutional layer. In the first convolution layer (Conv 1.) its three-dimensional filter-sized depth (i.e. the number of consecutive Mel-spectral slices in series) is equal to the corresponding video slice length. Table 6 shows the performance of four different video segment lengths (i.e., 4, 6, 8, 10 Mel spectral segments) and the depth of the three-dimensional filter size of the first convolution layer.

TABLE 6

As can be seen from table 6, the video clip lengths containing 4 consecutive speech Mel spectral clips gave the best performance in the AFEW5.0 database and the BAUM-1s database, with accuracy of 28.46% and 37.97%, respectively. As the length of the video segments increases, the performance of the three-dimensional convolutional neural network decreases, probably because the number of video segments used to train the three-dimensional convolutional neural network decreases as the length of the video segments increases.

In experiments, we propose and compare two fusion methods of multiple deep convolutional neural networks: feature level fusion and score level fusion. For feature level fusion, we first extract the full sentence speech level (level) feature for each convolutional neural network, i.e., by using an average pooling operation on the segment features represented by the outputs of the last fully-connected layer of the convolutional neural network used. Then, we directly concatenate the features on three whole sentence speech levels from one-dimensional, two-dimensional and three-dimensional convolutional neural networks to form a total 5376-D feature vector, and finally, use a linear Support Vector Machine (SVM) to carry out final emotion classification.

Table 7 shows the different multiple depth convolutional neural network fusion methods and the identification results of the single convolutional neural network that achieved the best performance. For the AFEW5.0 database, the optimal weight values are 0.3, 0.3 and 0.4; for the BAUM-1s database, the optimal weights are 0.2, 0.5, 0.3. From the results in table 7, it can be seen that:

1) The two-dimensional convolutional neural network (2D-CNN) performs best, followed by the three-dimensional convolutional neural network (3D-CNN) and the one-dimensional convolutional neural network (1D-CNN). This shows that the generated two-dimensional Mel voice frequency spectrum segment similar to RGB image is used for fine tuning the existing depth network model AlexNet based on the ImageNet data effectively, so that the pressure of insufficient emotion data on deep neural network training is relieved.

2) Fractional level fusion has better performance than feature level fusion. This illustrates that fractional order fusion is more suitable for fusion of multiple deep neural networks.

3) Compared with single convolutional neural networks such as one-dimensional, two-dimensional and three-dimensional convolutional neural networks, the realization of the fusion of the multiple convolutional neural networks in the feature layer and the fractional layer has better performance. This suggests that the multi-modal deep features learned from one-, two-, and three-dimensional convolutional neural networks are complementary, so they are integrated into a multi-deep convolutional neural network fusion network to achieve significantly improved emotion classification performance.

TABLE 7

In order to provide recognition accuracy for each emotion, fig. 2 and 3 show confusion matrices of recognition results, respectively. The fractional fusion method now achieves an identification accuracy of 35.77% and 44.06% on the two data sets, respectively.

As shown in FIG. 2, we can see that in the AFEW5.0 database, the accuracy of the three emotions, "angry", "neutral" and "fear" are 56.25%, 50.79% and 43.48%, respectively. While the other four emotions, namely 'offensive', 'happy', 'sad' and 'surprise', have classification accuracy of less than 33%.

As can be seen from FIG. 3, in the BAUM-1s database, both "sad" and "happy" emotions are identified with accuracy rates of 70.90% and 55.49%, respectively. The other four emotions, namely 'vital energy', 'fear', 'offensiveness', 'surprise', have an accuracy of less than 29%.

According to the scheme, the Multi-depth convolutional neural network model (Multi-CNN) is adopted, namely the deep Multi-modal features learned by the one-dimensional convolutional neural network (1D-CNN), the two-dimensional convolutional neural network (2D-CNN) and the three-dimensional convolutional neural network (3D-CNN) have certain complementarity characteristics, the Multi-depth Multi-modal feature learning method is used for natural voice emotion recognition, the technical problem that dynamic change information in 2D time-frequency feature representation between continuous frames in a sentence of voice cannot be captured in the prior art is solved, deep Multi-modal features with complementarity characteristics are learned through fusion of the Multi-depth convolutional neural network, emotion classification performance is remarkably improved, and good discrimination characteristics are provided for natural voice emotion recognition.

Claims

1. The natural voice emotion recognition method based on multimode deep feature learning is characterized by comprising the following steps of:

s1, generating a proper multi-modal representation: three audio representation forms suitable for different depth convolution neural network structures are generated from an original one-dimensional voice signal and are used for inputting different depth convolution neural network models;

the step S1 includes the steps of:

s1.1, scaling the waveform of a one-dimensional original voice signal to [ -256, 256], dividing the waveform into voice fragments with the length of 625 frames, and inputting the voice fragments into a one-dimensional convolutional neural network;

s1.2, extracting a two-dimensional Mel spectrogram from a one-dimensional original voice signal, and constructing three-channel Mel frequency spectrum segments similar to RGB images, wherein the three-channel Mel frequency spectrum segments comprise Mel frequency spectrum segments 64 multiplied by 3 of static, first derivative and second derivative, and the Mel frequency spectrum segments are used as input of a two-dimensional convolutional neural network;

s1.3, forming a 3D dynamic segment similar to a video by using a plurality of continuous two-dimensional Mel spectrum segment sequences, namely 64 multiplied by 3 sequences of spectrum segments of three continuous channels, and taking the 3D dynamic segment as the input of a three-dimensional convolutional neural network to perform space-time feature learning;

s2, learning different multi-mode depth features by adopting the multi-depth convolutional neural network model established in the steps S1.1-S1.3;

the step S2 includes the steps of:

s2.1, modeling a one-dimensional original voice signal waveform by adopting a one-dimensional convolutional neural network: constructing a one-dimensional convolutional neural network model, and training the constructed one-dimensional convolutional neural network model; the one-dimensional convolutional neural network model comprises four one-dimensional convolutional layers, three maximum pooling layers, two full-connection layers and 1 Softmax classification output layer, wherein the one-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer, namely input data are normalized before one-dimensional convolutional neural network training;

s2.2, performing two-dimensional Mel spectrogram modeling by using a two-dimensional convolutional neural network: aiming at target data, fine-tuning an AlexNet depth convolutional neural network model trained in advance in the existing computer vision field; the AlexNet deep convolutional neural network model comprises five convolutional layers, three maximum pooling layers and two full-connection layers;

s2.3, modeling space-time dynamic information based on a three-dimensional convolutional neural network: based on the extracted 3D dynamic segments of the class video, executing a space-time feature learning task, and adopting a dropout regularization method to avoid network overfitting; the three-dimensional convolutional neural network in the step S2.3 comprises two three-dimensional convolutional layers, two three-dimensional maximum pooling layers, two full-connection layers and a softmax output layer, wherein the three-dimensional convolutional layers comprise a batch normalization layer and a modified linear unit activation function layer;

s3, integrating the classification results of different depth convolution neural network models with complementarity by adopting a score level fusion method, and outputting a final speech emotion recognition result.

2. The method for recognizing natural speech emotion based on multi-mode deep feature learning according to claim 1, wherein the fine tuning in step S2.2 comprises the steps of:

3. The method for recognizing natural speech emotion based on multi-mode deep feature learning according to claim 1, wherein said step S3 comprises the steps of:

4. A method of natural speech emotion recognition based on multimode deep feature learning according to claim 3, wherein step S3.4 can be expressed as:

score ^fusion ＝λ ₁ score ^1D +λ ₂ score ^2D +λ ₃ score ^3D ；

λ ₁ +λ ₂ +λ ₃ ＝1；

wherein lambda is ₁ 、λ ₂ And lambda (lambda) ₃ Weight value, lambda representing different classification scores obtained by one-dimensional convolutional neural network, two-dimensional convolutional neural network and three-dimensional convolutional neural network ₁ 、λ ₂ And lambda (lambda) ₃ Is determined by stepping at 0.1 intervals at [0,1 ]]And searching in the range to obtain an optimal value.