CN112800875A

CN112800875A - Multi-mode emotion recognition method based on mixed feature fusion and decision fusion

Info

Publication number: CN112800875A
Application number: CN202110048664.5A
Authority: CN
Inventors: 刘兴旺; 廣田薰; 程智鹏; 李文龙; 戴亚平
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2021-05-14

Abstract

A multi-mode emotion recognition method with mixed feature fusion and decision fusion belongs to the field of mode recognition and emotion recognition. The implementation method of the invention comprises the following steps: firstly, constructing an image emotion recognition network by using a convolutional neural network framework, and acquiring image characteristics and image emotion states; secondly, constructing a text emotion recognition network by using a recurrent neural network framework, and acquiring text characteristics and a text emotion state; and thirdly, constructing a multi-mode information fusion emotion recognition network, constructing a main classifier for fusing image emotion states and text emotion states and acquiring a main emotion classification, constructing an auxiliary classifier for fusing image features and text features and acquiring an auxiliary emotion classification, and fusing the main emotion classification and the auxiliary emotion classification to acquire a final emotion classification. The invention utilizes the information complementation among the multi-mode information to avoid the problem of low emotion recognition accuracy rate of single-mode information caused by factors such as information blurring or missing and provides a new idea for multi-mode data fusion and emotion recognition.

Description

Multi-mode emotion recognition method based on mixed feature fusion and decision fusion

Technical Field

The invention relates to the fields of data fusion, neural networks, emotion recognition and the like, in particular to a multi-mode information fusion emotion recognition method based on hybrid fusion.

Background

Human beings express emotional information through various modalities such as expressions, postures, sounds, languages, and the like, and emotional behaviors are important indexes reflecting human satisfaction. With the development of artificial intelligence technology, emotion recognition is an important means for realizing good human-computer interaction. The emotion recognition is to extract the features of the emotion signals to obtain the mapping relation between the external appearance features and the internal emotion states of the emotion, thereby recognizing the internal emotion types of the recognized objects. The emotion recognition has very wide application prospects in the fields of machine service, health medical treatment, remote education, unmanned driving and the like.

A modality is a way to characterize information, such as images, text, sound, etc. Multimodal, i.e., various forms of combinations of two or more modalities. The same object has expressions of different modalities, and information of different modalities is independent and has potential relevance. Currently, emotion recognition mainly acquires and analyzes single-mode emotion information to obtain the emotional state of a tested person. Due to the fact that the single-mode information is weak in anti-interference capacity and is easy to dope some redundant signals or lack part of information, the accuracy of classification identification is low and even classification errors can be caused.

Human cognitive process is multi-modal, and an individual perceives a scene through signals such as vision, hearing and even touch, and obtains high-dimensional information such as emotion through fusion processing and semantic understanding of the information. The multi-modal information fusion aims to simulate the human perception understanding process, and the aim of removing redundant information in the modalities or supplementing missing information of a certain modality is fulfilled by establishing a model capable of processing, associating and reasoning information from multiple modalities and capturing potential association among different modality information by utilizing complementarity among the modality information.

Multimodal fusion is mainly divided into three aspects according to the fusion hierarchy: data level fusion, feature level fusion and decision level fusion. The data level fusion is only suitable for signals with similar types, and cannot process signals with larger differences, such as image and sound signals. The feature level fusion converts different modal data extraction into high-dimensional feature expression, combines high-order features of different modalities in a certain mode, fuses the high-order features into a new feature vector, and can capture complementary information among different modalities. The decision-level fusion takes different modal data as the input of a trained classifier to obtain each classification result, a final decision vector is output according to a fusion method, the difference of different modal information is fully considered, errors of the decision-level fusion come from different classifiers, the errors of the different classifiers are usually not related to each other, and the accumulation of the errors is not caused.

Disclosure of Invention

The invention aims to overcome the defect that the existing single-mode emotion recognition method is weak in anti-interference capability, and provides a high-precision multi-mode information emotion recognition method by utilizing information complementation among multi-mode information. The invention adopts an information fusion method of mixed feature layer fusion and decision layer fusion, and a mixed feature fusion and decision fusion multi-modal emotion recognition method is constructed by fusing multi-modal information.

The purpose of the invention is realized by the following technical scheme.

The invention discloses a multi-modal information fusion emotion recognition method based on hybrid fusion, which comprises the following steps:

step 1: constructing an image emotion recognition network based on a Convolutional Neural Network (CNN) framework, extracting the characteristics of image information through a stacked convolutional structure, acquiring the image characteristics by capturing high-dimensional characteristics and classifying the acquired image information emotion states;

step 2: extracting edge information of the face feature area, obtaining a single image feature matrix by judging whether the edge information exists or not, obtaining an emotion feature matrix by accumulation processing of the single image feature matrix, removing redundant area feature information and reserving significant area feature information.

And step 3: and constructing a mixed and fused multi-mode information fusion network. And performing decision-making level fusion on the image emotion label and the text emotion label by using a main classifier to obtain a fused main classification result. And performing feature level fusion on the image features and the text features by using an auxiliary classifier to obtain an auxiliary classification result. And fusing the main classification result and the auxiliary classification result to obtain the final emotional state. And constructing a feature fusion layer and a decision fusion layer, and comprehensively utilizing the correlation and complementarity between the two modal information to realize the final emotion recognition and classification task.

The implementation method of the step 1 comprises the following steps:

and constructing an image emotion recognition network by using a Convolutional Neural Network (CNN) for extracting image features and acquiring emotion classification. The portion may employ a variety of image feature extraction networks, such as VGGnet, Resnet, and the like. Inputting image emotion recognition network for image data in a format with the size of (B, C, H, W), wherein B is Batch size (Batch size), namely the number of pieces of image information input at the same time; c is the number of image channels, if the color image is RGB three channels, the gray image is single channel; h and W are the height and width of the image, respectively. The network extracts image characteristics I1, sends I1 to a full connection layer and obtains the final image information emotional state I, wherein I is a vector of [ batch _ size, num _ class ] dimension, and num _ class is the predicted category number.

The step 2 is realized by the following steps:

and constructing a text emotion recognition network by using a Recurrent Neural Network (RNN) for extracting text features and acquiring emotion classification. The part can adopt various text feature extraction frameworks, such as LSTM, BilTM and other mainstream frameworks. For text data, each word in the text is input into a word embedding layer to be encoded to obtain a word vector, and the input dimension of the network model is [ batch _ size, seq _ len ], wherein the batch _ size is the size of the batch text, and the seq _ len is the length of a sentence. And after the specified word embedding layer is subjected to random initialization, the word vector dimension is [ batch _ size, seq _ len, embed _ size ], and the embed _ size is the word vector dimension. And inputting the obtained word vector into the RNN to obtain hidden layer vectors [ batch _ size, seq _ len, hidden _ size x 2] of all the moments, wherein hidden _ size is the size of a hidden layer. And extracting text characteristics T1 by the network, sending T1 into a full connection layer and acquiring a final text information emotional state T, wherein T is a vector of [ batch _ size, num _ class ] dimension, and num _ class is the predicted category number.

The implementation method of the step 3 is as follows:

step 3.1: and constructing a main classifier for multi-modal information fusion. Splicing the image emotional state A and the text emotional state B, and sending the spliced image emotional state A and the text emotional state B into a main classifier to obtain a main classification result (Class) with the dimension of 1 × 4;

step 3.2: acquiring image features and text feature weights of feature fusion, and performing cascading (collocation) operation on the image features and the text features on batch dimensions, wherein for image data, the feature weights are as follows:

wherein B is the batch size and C is the number of image data channels. For text data, the characteristic weights are:

wherein, B is the size of the text batch, and S is the length of the text. And mapping the two to a 0-1 interval through normalization to obtain a new feature Fused _ feature as follows:

and taking the new features as the input of an Auxiliary classifier to obtain an Auxiliary classification result (Auxiliary).

Step 3.3: the fusion layer routes the input vectors to a plurality of nodes by adopting a dynamic routing mode, and generates final fusion vectors through vector compression and splicing. Firstly, the input feature vector passes through a hidden layer:

u¹＝W¹v¹，u²＝W²v²,

wherein v is¹And v²W is the weight for the feature vector of the input text and image. Adopting dynamic route mode to make last oneThe feature vectors obtained in the step are routed to three nodes:

s¹＝c¹¹u¹+c¹²u²,

s²＝c²¹u¹+c²²u²,

s³＝c³¹u¹+c³²u²,

generating an auxiliary classifier with dimension 1 x 4 by compressing and splicing vectors:

v＝Concat(Squash(sⁱ)),

step 3.4: fusing a main classification result and an auxiliary classification result by a decision-level fusion method, and acquiring a final classification result by using a softmax function:

Finally_class＝softmax(Auxiliary+class)。

compared with the prior art, the invention has the following advantages:

1. the invention discloses a mixed feature fusion and decision fusion multi-mode emotion recognition method, which comprises the steps of extracting features of image and text information and recognizing emotion classification results, constructing a decision fusion-based main classifier and a feature fusion-based auxiliary classifier, and obtaining a final classification result by weighting the results of the main classifier and the feature fusion-based auxiliary classifier, so that the problem of poor performance of the emotion recognition method due to information loss or fuzziness under a single-mode condition is solved, and a good recognition effect is achieved;

2. the invention discloses a multi-modal emotion recognition method with mixed feature fusion and decision fusion, which comprises the steps of constructing a fusion layer of various modal features for fusing features of different modes, constructing the feature fusion layer in a dynamic routing mode, routing input vectors to a plurality of nodes, generating fusion vectors through compression and splicing of the vectors, and fully considering the correlation and difference among different modal information;

3. the multi-modal emotion recognition method with mixed feature fusion and decision fusion disclosed by the invention can be replaced by using a network framework with good feature extraction capability in each modal information, and has good variability and expansibility.

Drawings

The invention will be further described with reference to the following examples and embodiments, in which:

FIG. 1 is a flowchart of a mixed feature fusion and decision fusion multimodal emotion recognition method in an embodiment of the present invention;

FIG. 2 is a block diagram of a mixed feature fusion and decision fusion multimodal emotion recognition method according to an embodiment of the present invention;

FIG. 3 is a fusion layer framework diagram of a multi-modal emotion recognition method with hybrid feature fusion and decision fusion according to an embodiment of the present invention;

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific embodiments, which are given by way of illustration only and are not intended to limit the scope of the invention.

Fig. 1 is a flowchart of a mixed feature fusion and decision fusion multimodal emotion recognition method in an embodiment of the present invention, and fig. 2 is a frame diagram of a mixed feature fusion and decision fusion multimodal emotion recognition method in an embodiment of the present invention. FIG. 3 is a fusion layer framework diagram of a multi-modal emotion recognition method with hybrid feature fusion and decision fusion according to an embodiment of the present invention. Fig. 3 disclosed in this embodiment is a frame diagram fused by a multi-modal emotion recognition method based on hybrid feature fusion and decision fusion in the embodiment of the present invention, and the specific implementation steps are as follows:

step 1: and (4) synthesis of a multi-modal data set. The training data set of the network model is divided into an image data set and a text data set and is used for training and verifying the feasibility and superiority of the algorithm. The text data set is derived from yf _ amazon, contains 72 ten thousand shopping comment/score data from 14 ten thousand users, has 5 emotion classifications, and is subjected to data cleaning to remove texts with null values, messy codes and no actual meanings. The image data set is derived from original data sets such as Kaggle and FER2013, has 6 emotion classifications, and for matching the text data set, the image data set retains 5 emotion classifications: vitality generation, heart injury, normality, happiness and surprise. And (3) carrying out one-to-one correspondence on the text data and the image data, and constructing a triple data set with a structure of < label, image and text > for training the multi-modal emotion recognition model.

Step 2: generation of a master classifier. As shown in fig. 2, the embedding layer of the image adopts a residual error network Resnet50 architecture, the original vector of the image is M × N, and the feature I of S × T is obtained after the embedding layer is encoded. The imbedding network of the text adopts a long-short term memory recognition network BilSTM structure, the original vector length of the text adopts the size of 128 x 1, the length is insufficient and is supplemented by 0, the length exceeds 128 to be cut off, and 256 x 1 characteristic T is obtained after the imbedding layer coding. And dimension integration is carried out on the features I and the features T through different full connection layers, finally, features with the same dimension B1 are generated, and finally, the main classifier array of B2 is generated through splicing.

And step 3: generation of an auxiliary classifier. As shown in fig. 1, feature vectors a and B of an image and a text are input to a weighting layer to obtain shallow feature vectors, and in order to retain semantic information as much as possible, dot product operation is performed with the original feature vectors, and the obtained two vectors are used as input of a fusion layer. As shown in fig. 2, the fusion layer routes the input vector to N nodes in a dynamic routing manner, where N is 3. And then generating a final fusion vector through vector compression and splicing, and fusing image and text features as much as possible through the method.

And 4, step 4: and performing decision-level fusion on the main classifier and the auxiliary classifier, and identifying the emotional characteristics by adopting a Softmax regression model to obtain the emotional categories. The expression categories are 5 categories, which are angry, sad, neutral, happy and surprised, respectively. And fusing the main classifier and the auxiliary classifier by three methods of mean fusion, DS evidence theory fusion and dynamic weight fusion, and then identifying the emotional characteristics by using a Softmax regression model to obtain 5 types of emotional probabilities, wherein the maximum probability is an expression identification result.

Through the steps, a pre-synthesized multi-modal data set is subjected to experiment and randomly divided into a training set and a verification set, wherein the training set accounts for 70% of the total amount, the verification set accounts for 15%, and the test set accounts for 15%. Three sets of comparative experiments were performed, experiment one: training an LSTM network by using a single text data set to obtain the text emotion recognition accuracy; experiment two: training a ResNet50 network by using a single image data set to obtain the image emotion recognition accuracy; experiment three: and training the multi-modal emotion recognition model by using an image text data set, wherein the data fusion method respectively adopts mean fusion, DS evidence theory fusion and dynamic weight fusion to obtain the emotion recognition accuracy of the multi-modal emotion recognition model. Finally, compared with the experimental results of the first experiment and the second experiment, the experimental result of the third experiment is improved by 3.22%, 3.68% and 10.54%.

The above embodiments are preferred identification modes of the present invention, but the present invention is not limited to the above embodiments, and various changes can be made within the scope of knowledge in the art without departing from the spirit of the present invention. Any modification, equivalent replacement, or improvement made without departing from the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A multi-mode emotion recognition method based on mixed feature fusion and decision fusion is characterized by comprising the following steps: comprises the following steps of preparing a mixture of a plurality of raw materials,

step 1: an image emotion recognition network is constructed based on a Convolutional Neural Network (CNN) framework, and image information is subjected to feature extraction through a stacked convolutional structure, so that the capability of capturing multi-dimensional features is achieved, the image features are further obtained, and the image information emotion states are obtained in a classified mode;

step 2: and constructing a text emotion recognition network based on a Recurrent Neural Network (RNN) framework. The RNN takes the output of the previous node as the input of the next node, so that the memory function of the RNN is realized, the model can better extract the characteristics of the long text information and recognize the emotional state of the text information;

2. The method of claim 1, wherein the method comprises the following steps: the implementation method of the step 1 is that,

3. The method of claim 1, wherein the method comprises the following steps: the implementation method of the step 2 is that,

4. The method of claim 1, wherein the method comprises the following steps: the implementation method of the step 3 is that,

step 3.1: and constructing a main classifier for multi-modal information fusion. Splicing the image emotional state I and the text emotional state T and sending the spliced image emotional state I and the text emotional state T into a main classifier to obtain a main classification result (Class) with the dimension of 1 × 4;

wherein, B is the image batch size, and C is the image data channel number. For text data, the characteristic weights are:

And 3.3, the fusion layer routes the input vector to a plurality of nodes in a dynamic routing mode and generates a final fusion vector through vector compression and splicing. Firstly, the input feature vector passes through a hidden layer: