CN108921037B

CN108921037B - Emotion recognition method based on BN-acceptance double-flow network

Info

Publication number: CN108921037B
Application number: CN201810579049.5A
Authority: CN
Inventors: 卿粼波; 王露; 滕奇志; 何小海; 熊文诗; 吴晓红
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-06-07
Filing date: 2018-06-07
Publication date: 2022-06-03
Anticipated expiration: 2038-06-07
Also published as: CN108921037A

Abstract

The invention provides an individual emotion recognition method based on posture information, and mainly relates to the fact that an individual posture is researched by a deep learning method to judge the emotion of an individual. The method comprises the following steps: firstly, introducing a BN-acceptance-based double-flow network model, and extracting static and dynamic characteristics of an input sequence through learning of an original image and an optical flow image; and then adding Space Pyramid Pooling (SPP) on the basis of the double-current network to input the image into the network in the original size, thereby reducing the influence on the model performance caused by deformation. According to the invention, firstly, a double-flow network is utilized to learn the space-time characteristics of the input sequence, and pyramid pooling is introduced to retain the original information of the video frame, so that the network can effectively learn the characteristics of individual attitude and emotion, and a higher recognition rate is obtained.

Description

Emotion recognition method based on BN-acceptance double-flow network

Technical Field

The invention relates to an emotion recognition problem in the field of deep learning, in particular to an emotion analysis method based on a BN-acceptance + SPP double-flow network.

Background

The emotion is a state integrating human feelings, ideas and behaviors, and plays an important role in human-to-human communication. The emotional state of a person can be usually judged according to the facial expression of the person, but in some specific environments, such as monitoring visual angles, the situation that the face is shielded, and the like, the person can not necessarily obtain clear facial expression of the person. In fact, the real emotion of a person is expressed not only by means of facial expressions, but also by means of body movements of the person, certain emotion information can be expressed. Therefore, the research of the present invention is mainly focused on emotion recognition of individual gestures based on video.

Emotion recognition is an important research content and direction in the field of computer vision, and currently, many authoritative international periodicals and top-level meetings have related subjects and contents, and many foreign famous schools also have related courses. The traditional emotion recognition method based on videos mainly depends on manually selected features, the method is time-consuming and labor-consuming, the generalization performance of obtained model parameters is poor, and the degree of emotion recognition service is limited. Deep learning is an important component of the development of the field of artificial intelligence, and has become a very popular research direction in the field of artificial intelligence in recent years. The method has great breakthrough in many fields (such as image recognition, voice recognition and the like), and particularly has high recognition rate and generalization capability in video analysis. Therefore, the method utilizes the advantages of deep learning in video analysis to research individual emotion recognition in the video.

Emotion recognition based on posture information has been developed in recent years, and related research is relatively small, and mainly focuses on research of traditional algorithms. Li and the like^[1]The method comprises the steps of carrying out behavior identification and classification by using original skeleton coordinates and skeleton motion; piana et al^[2]An automatic emotion recognition model and system based on whole body motion is provided, which is used for helping the autistic children learn recognition and express emotion through the whole body motion. Also, some combine the motion features of human body posture with advanced kinematic geometry features for clustering and classification. Crenn et al ^[3]The method comprises the steps of obtaining low-level features such as running data and the like by utilizing a human 3D framework sequence, decomposing the features into three types including geometric features, motion features and Fourier features, calculating meta-features (such as mean values, standard deviations and the like) of the low-level features, and finally classifying the meta-features by adopting a classifier. Deep learning, whether in terms of recognition time or accuracy, is greatly improved compared with the traditional method, but due to the lack of emotion data sets related to gestures, the research related to individual emotion recognition based on gesture information by adopting deep learning is still rare.

Disclosure of Invention

The invention aims to provide an individual emotion recognition method based on gestures, which combines deep learning with human gestures in a video, fully utilizes the superiority of a BN-initiation + SPP network structure, introduces a double-current network structure to perform individual emotion recognition based on the video, effectively learns the emotion characteristics of the individual gestures, and obtains higher recognition rate.

For convenience of explanation, the following concepts are first introduced:

an optical flow method: is a simple and practical way of expressing image motion, which is generally defined as the apparent motion of the image brightness pattern in an image sequence, i.e. the expression of the motion speed of a point on the surface of a spatial object on the imaging plane of a visual sensor.

A convolutional neural network: a multilayer feedforward neural network, each layer is composed of a plurality of two-dimensional planes, neurons of each plane work independently, and a convolutional neural network comprises a convolutional layer and a pooling layer.

Double-current convolutional neural network: the network is designed aiming at the extraction of the video behavior characteristics, and takes a single-frame RGB original image and an optical flow image obtained based on video data as two inputs respectively so as to realize the representation of behavior object space appearance information and the extraction of behavior process time sequence characteristics.

Spatial Pyramid Pooling (SPP): the SPP layer is formed by combining a plurality of down-sampling layers, can divide an input feature map from coarse to fine, and converts the feature map into a feature vector with fixed length, so that the SPP layer can extract various local information.

The invention specifically adopts the following technical scheme:

the emotion recognition method based on the BN-acceptance double-flow network is mainly characterized by comprising the following steps:

1. the individual pose data set is divided into four mood categories: boredom (bored), agitation (excited), pneumatosis (free), relaxation (relax);

2. adding Space Pyramid Pooling (SPP) in front of a full connection layer of the BN-acceptance dual-flow network, and respectively training a Space-time network on a data set;

The method mainly comprises the following steps:

(1) the individual posture sequence data set is divided into four emotion categories: boring, exciting, generating qi and relaxing;

(2) generating an optical flow image sequence corresponding to the data set by adopting an optical flow algorithm of a document [4] to represent the motion characteristics of the individual posture;

(3) dividing the original data set and the optical flow data set into a training set, a verification set and a test set according to a proportion;

(4) introducing a double-current convolutional neural network model based on BN-initiation, adding an SPP layer to optimize the BN-initiation network before a full connection layer, training a space-time network by using a training set and a verification set, and verifying by using a test set;

(5) and carrying out Average fusion on the spatial stream and time stream two-channel network based on BN-acceptance + SPP to obtain the accuracy ACC (accuracy) and the macro Average precision MAP (macro Average precision) on the test set.

Drawings

FIG. 1 is a schematic diagram of an overall framework of emotion recognition based on a BN-acceptance + SPP dual-flow network.

FIGS. 2-a-2-b are accuracy confusion matrices obtained on a test set without adding an SPP layer according to the present invention, wherein 2-a is a test matrix of a spatial stream BN-initiation network, and 2-b is a test matrix of a temporal stream BN-initiation network.

FIGS. 3-a-3-b are accuracy confusion matrices obtained on a test set when an SPP layer is added, wherein 3-a is a test matrix of a spatial stream BN-initiation + SPP network, and 3-b is a test matrix of a temporal stream BN-initiation + SPP network.

FIG. 4 shows that the present invention obtains ACC and MAP on the test set after averagely merging the spatial stream and the time stream based on BN-initiation + SPP.

Detailed Description

The present invention is further described in detail with reference to the drawings and examples, it should be noted that the following examples are only for illustrating the present invention and should not be construed as limiting the scope of the present invention, and those skilled in the art should be able to make certain insubstantial modifications and adaptations to the present invention based on the above disclosure and should still fall within the scope of the present invention.

In fig. 1, an emotion recognition method based on a BN-acceptance + SPP dual-stream network includes the following steps:

(1) firstly, after an individual data set in a public space is obtained, an optical flow image sequence of an original data set is generated by adopting an optical flow algorithm of a document [4] to represent the motion characteristics of an individual posture;

(2) dividing an original data set and an obtained optical flow data set into a test set, a verification set and a training set according to a proportion, and giving corresponding emotion types;

(3) Removing the SPP layer shown in FIG. 1, inputting the data of the training set and the verification set into a spatio-temporal network respectively for learning to obtain a training model, and testing and verifying the effect by using the data of the test set;

(4) adding an SPP layer, inputting the training set into a space-time network for learning according to the original size to obtain a training model, and testing and verifying the effect by using the data of the test set;

(5) after the spatial stream and the time stream two-channel network based on BN-initiation + SPP are averagely fused, ACC and MAP on a test set are obtained;

the convolutional neural networks of two channels of spatial flow and time flow are separately trained by Caffe, and parameters of the time flow and spatial flow networks are set through experiments, as shown in Table 1. Because the number of the established samples of the individual posture emotion data set is small, in order to prevent the overfitting phenomenon, a method of data expansion and adding a Dropout layer in a network is adopted.

TABLE 1 training parameter settings

Reference documents:

[1]Li C,Zhong Q,Xie D,et al.Skeleton-based Action Recognition with Convolutional Neural Networks[J].2017:597-600.

[2]Piana S,

A，Odone F,et al.Adaptive Body Gesture Representation for Automatic Emotion Recognition[J].ACM Transactions on Interactive Intelligent Systems(TiiS),2016,6(1):6.

[3]Crenn A,Khan R A,Meyer A,et al.Body Expression Recognition from Animated 3D Skeleton[C]//International Conference on 3D Imaging.IEEE,2017:1-7.

[4]Brox T,Bruhn A,Papenberg N,et al.High Accuracy Optical Flow Estimation Based on A Theory for Warping[C]//European Conference on Computer Vision(ECCV),2004:25-36.

Claims

1. an individual emotion recognition method based on a BN-initiation + SPP double-flow network is characterized by comprising the following steps:

a. the individual pose data set is divided into four mood categories: boredom, excited, angry from, relaxed, given the emotional category of each sequence;

b. Adding Space Pyramid Pooling Space Pyramid Pooling before a full connection layer of a BN-acceptance double-flow network, and respectively training a Space-time network on a data set;

c. the double-flow network training parameter based on BN-initiation + SPP is the basic learning rate base _ lr: 0.00000001; learning rate change index gamma: 0.01; weight attenuation weight _ decay: 0.005; maximum number of iterations max _ iter: 150000;

the method mainly comprises the following steps:

(1) processing the data set by adopting an optical flow algorithm to generate a corresponding optical flow image sequence, and representing the motion characteristics of the individual posture;

(2) dividing a data set into a training set, a verification set and a test set, and giving the emotion category of each sequence;

(3) introducing a double-current convolutional neural network model based on BN-initiation, adding an SPP layer to optimize the BN-initiation network before a full connection layer, training a space-time network by using a training set and a verification set, and verifying by using a test set;

(4) and carrying out average fusion on the spatial stream and time stream two-channel network based on BN-acceptance + SPP to obtain the accuracy ACC and the macro average accuracy MAP on the test set.

2. The method for individual emotion recognition based on BN-initiation + SPP dual-stream network as claimed in claim 1, wherein the spatiotemporal features of the data set are learned separately by using the dual-stream network in step (3).

3. The method for recognizing individual emotion based on BN-initiation + SPP dual-stream network as claimed in claim 1, wherein in step (3), the SPP layer is added before the full connection layer of BN-initiation dual-stream network, so that the training set is input into the network in original size to avoid the loss of motion information caused by fixed input size, and then the training of spatio-temporal network is performed on the data set respectively.