CN114582002A

CN114582002A - Facial expression recognition method combining attention module and second-order pooling mechanism

Info

Publication number: CN114582002A
Application number: CN202210403298.5A
Authority: CN
Inventors: 周婷; 陈劲全; 余卫宇
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-06-03

Abstract

The invention discloses a facial expression recognition method combining an attention module and a second-order pooling mechanism, and relates to the field of deep learning. The invention comprises the following steps: acquiring a face image; preprocessing a face image, wherein the preprocessing of the face image comprises face detection and alignment, data enhancement and image normalization; and performing feature extraction on the preprocessed face image, and finishing expression classification. The invention effectively converts the face picture interfered by factors such as illumination, head posture, shielding and the like in the natural environment into the frontal face picture with proper contrast and no shielding, thereby solving the problem of interference of other factor variables irrelevant to the expression in the real world environment on the expression recognition effect.

Description

Facial expression recognition method combining attention module and second-order pooling mechanism

Technical Field

The invention relates to the field of deep learning, in particular to a facial expression recognition method combining an attention module and a second-order pooling mechanism.

Background

The human face expression is an important mode for transmitting information in human-to-human communication, the development of the human face expression recognition technology can effectively promote the development of relevant fields such as pattern recognition and image processing, the human face expression recognition technology has high scientific research value, and application scenes of the human face expression recognition technology comprise severe monitoring, fatigue driving monitoring, criminal detection, human-computer interaction and the like. With the rapid development of large-scale image data and computer hardware (particularly GPU), the deep learning method obtains a breakthrough result in image understanding, and the deep neural network has strong feature expression capability, can learn features with discrimination capability, and is gradually applied to automatic facial expression recognition tasks. According to different types of processed data, the deep facial expression recognition method can be roughly divided into two categories, namely a deep facial expression recognition network based on a static image and a deep facial expression recognition network based on a video.

The current advanced depth facial expression recognition method based on static images mainly comprises the following steps: the video-based deep facial expression recognition mainly uses a basic time sequence network to analyze time information carried in a video sequence, such as LSTM, C3D and the like, or uses a face key point track to capture dynamic changes of face components in continuous frames, and combines a spatial network and a time network into a parallel multi-network. In addition, expression recognition can be expanded to scenes with more practical application value by combining other expression models such as a facial action unit model and other multimedia modalities such as an audio modality and human physiological information.

From 2013 onwards, expression recognition competitions such as FER2013 and EmotiW have collected relatively abundant training samples from challenging real-world scenes, facilitating the transition of facial expression recognition from laboratory controlled environments to natural environments. From the research subjects, the field of expression recognition is experiencing a rapid development from laboratory swinging to spontaneous expression of the real world, from exaggerated expressions lasting for a long time to micro-expressions appearing instantaneously, from basic expression classification to complex expression analysis.

As the task of facial expression recognition gradually moves from a laboratory controlled environment to a challenging real-world environment, current deep facial expression recognition systems address several issues:

1) overfitting problems due to lack of adequate training data;

2) interference problems caused by other expression independent factor variables (such as illumination, head pose and identity characteristics) in the real world environment;

3) the recognition accuracy of the facial expression recognition system in the real environment is improved.

Disclosure of Invention

In view of the above, the present invention provides a facial expression recognition method combining an attention module and a second-order pooling mechanism to solve the above problems in the background art.

In order to achieve the purpose, the invention adopts the following technical scheme:

a facial expression recognition method combining an attention module and a second-order pooling mechanism comprises the following steps:

acquiring a face image;

preprocessing a face image, wherein the preprocessing of the face image comprises face detection and alignment, data enhancement and image normalization;

and (5) performing feature extraction on the preprocessed face image, and finishing expression classification.

Optionally, the face detection and alignment includes face detection, key point positioning, and face alignment, and specifically includes:

the input of the face detection module is a face expression picture, and the output is a detected face area;

performing face key point coordinate positioning according to the face detection area, and importing a five-point key point detection model by using a face key point detection interface in a dlib library to obtain five-point key point coordinates of the face;

and carrying out face alignment by using the coordinates of the key points of the five points.

Optionally, the calculation process of the face alignment is as follows:

firstly, according to four coordinates (x) of the left eye and the right eye respectively₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄) And calculating the center coordinates of the left eye and the right eye:

after obtaining the central coordinates of the two eyes, firstly connecting the central coordinates of the two eyes, calculating the included angle theta between the connecting line and the horizontal line, and then averaging the central coordinates of the left eye, the right eye and the coordinate of the point below the nose to calculate the coordinate (x) of a rotation central point_center,y_center)：

Combined with the rotation center point coordinate (x)_center,y_center) And obtaining an affine transformation matrix by using an interface for solving the affine transformation matrix in the OpenCV, and calling an interface function of the OpenCV to perform affine transformation on the image to obtain a photo with aligned human face.

Optionally, the data enhancement is to dynamically perform random geometric or color transformation operation on the input image at the data reading stage through a transform () interface in the deep learning framework pytore, and then input the transformed image into a network for training, so as to implement data expansion.

Optionally, the image normalization divides the pixel value of the image by 255, and any pixel value of the normalized image is between [0,1 ].

Optionally, the extraction of facial expression features is realized by using an 18-layer resnet network, and by adding a softmax layer at the end of the resnet network, the network output result is normalized to a probability value of 7 types of expressions, wherein the maximum value is the classification result.

Optionally, feature extraction and expression classification are implemented by using an end-to-end deep neural network, where the deep neural network has the following structure: the first layer is a convolution layer with convolution kernel size of 7 multiplied by 7, and the number of channels is 64; the second layer is a pooling layer with a pooling core size of 3 × 3, and the number of channels is 64; and connecting eight residual error structures which are fused with the convolution attention module, finally outputting a feature map of a 512-dimensional channel, then connecting a second-order pooling layer to realize feature aggregation, and finally obtaining a classification result by utilizing a full connection layer and a softmax layer.

Compared with the prior art, the invention discloses a facial expression recognition method combining an attention module and a second-order pooling mechanism, and has the following beneficial effects:

(1) the method effectively converts the face picture interfered by factors such as illumination, head posture, shielding and the like in the natural environment into the front face picture with proper contrast and no shielding by carrying out two steps of face detection, alignment and image normalization on the face picture, thereby solving the problem of interference of other factor variables irrelevant to the expression in the real world environment on the expression recognition effect.

(2) By means of data enhancement, operations such as random cutting, rotation, overturning, noise adding, color changing and the like are dynamically carried out on an input image at a data reading stage in a network training process, data are expanded to multiple times of original data, and better data diversity is obtained. The overfitting problem caused by the lack of sufficient training data is effectively solved.

(3) The method is characterized in that a resnet network structure is improved to be more suitable for extracting expression characteristics, a convolution attention module (CBAM) is added to enable a network to pay more attention to characteristic extraction of an object to be recognized, a second-order pooling mechanism is added to extract second-order characteristics of expressions, so that distortion degree information of facial expression muscles can be captured better, the extraction capability of a network model on the facial expression characteristics is improved, and the problem of how to improve the recognition accuracy of a facial expression recognition system in a real environment is solved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic overall flow diagram of the present invention;

FIG. 2 is a diagram of the results of the face detection of the present invention;

FIGS. 3 a-3 b are diagrams of the detection results of the key points of the face according to the present invention;

FIGS. 4 a-4 b are schematic diagrams illustrating a front-to-back comparison of face alignment according to the present invention;

FIG. 5 is a block diagram of the improved resnet18 of the present invention;

FIG. 6 is a diagram of the ResBlock + CBAM structure of the present invention;

FIG. 7 is a block diagram of a channel attention module of the present invention;

FIG. 8 is a block diagram of a spatial attention module according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a facial expression recognition method combining an attention module and a second-order pooling mechanism, which comprises two stages of data preprocessing, feature extraction and expression classification, wherein the input of an algorithm is a facial expression picture, and the output of the algorithm is a classification result value of the picture, and the classification result comprises seven classes of anger, nausea, fear, joy, sadness, surprise and neutrality.

Data preprocessing: the data preprocessing module adopted by the scheme comprises three steps of face detection and alignment, data enhancement and image normalization. The face detection and alignment comprises three steps of face detection, key point positioning and face alignment. Firstly, face detection is realized by using a face detector interface (dlib.get _ front _ face _ detector) based on the HOG feature and the linear classifier and a face detector interface (dlib.cnn _ face _ detection _ model _ v1) based on the CNN in a dlib library. The input of the face detection module is a face expression picture, the output is a detected face area, and the face detection result is shown in fig. 2.

And then, according to the result detected in the previous step, performing next-step face key point coordinate positioning, and importing a five-point key point detection model by using a face key point detection interface (dlib. shape _ predictor) in a dlib library to obtain five-point key point coordinates of the face, wherein the positions of the five points are shown in fig. 3 a-3 b.

After the coordinates are obtained, the coordinates of the key points are utilized to align the face, and the specific implementation is as follows: firstly, according to four coordinates (x) of the left eye and the right eye respectively₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄) And calculating the center coordinates of the left eye and the right eye:

after the central coordinates of the two eyes are obtained, firstly connecting the central coordinates of the two eyes, calculating the included angle theta between the connecting line and a horizontal line, and then averaging the central coordinates of the left eye, the right eye and the coordinate of a point below the nose to calculate the coordinate (x) of a rotating central point_center,y_center)：

After the coordinates of the center point and the included angle θ are rotated, an affine transformation matrix can be obtained by using an interface for solving the affine transformation matrix in the OpenCV, then an interface function of the OpenCV is called to perform affine transformation on the image, the image is adjusted to be 224 × 224 pixels, and a result graph after the face alignment in fig. 4a is shown in fig. 4 b.

The data enhancement is to dynamically perform random geometric or color transformation operation on an input image at a data reading stage through a transform. And the image normalization divides the pixel value of the image by 255, and any pixel value of the normalized image is between 0 and 1.

Feature extraction and expression classification: the module utilizes an improved 18-layer resnet network, as shown in fig. 5, to extract facial expression features, and normalizes a network output result into a probability value of 7 types of expressions by adding a softmax layer at the tail end of the resnet network, wherein the maximum value is a classification result, so as to classify the expressions.

The whole module uses an end-to-end deep neural network to realize the steps of feature extraction and expression classification, and the structure of the deep neural network is shown in figure 5. The first layer is a convolution layer with convolution kernel size of 7 multiplied by 7, and the number of channels is 64; the second layer is a pooling layer with the size of the pooling core being 3 multiplied by 3, and the number of channels is 64; and then connecting eight residual error structures which are fused with the convolution attention module, finally outputting a feature map of a 512-dimensional channel, then connecting a second-order pooling layer to realize feature aggregation, and finally obtaining a classification result by utilizing a full connection layer and a softmax layer. The input of the network is a facial expression picture obtained after data preprocessing in the step 1, the characteristics of the facial expression from the bottom layer to the high layer are extracted through a plurality of convolution layers and pooling layers of the network, then the characteristics are converted into 1 x 7-dimensional column vectors by utilizing a full connection layer, and then seven classification results are normalized through a softmax layer to obtain the final classification probability value.

(1) Residual structure incorporating a convolution attention module (ResBlock + CBAM module in FIG. 6)

The structure of the ResBlock + CBAM module is shown in FIG. 6, wherein F is the extracted feature of the previous convolutional layer, and F is input into the channel attention module to calculate the channel attention diagram M_C，M_CMultiplying the input F to obtain an output F1 of the channel attention module; then F1 is input into a space attention module to calculate a space attention diagram M_S,M_SMultiplying the input F1 to obtain the final output F2 of the convolution attention module, and continuing to perform feature learning on the convolution layer after F2 continues to input. By adding a convolution module to the underlying residual block of resnet18, the resnet network can be made more focused on the object to be identified.

The convolution attention module (CBAM) module is divided into a channel attention module and a space attention module, and given an intermediate feature map, the CBAM module sequentially infers an attention map along the channel and the space, and then multiplies the attention map with an input feature map for adaptive feature optimization.

The structure of the channel attention module is shown in fig. 7, and the feature map extracted from the front layer is subjected to global average pooling and global maximum pooling simultaneously to realize compression in spatial dimension, a one-dimensional vector is obtained and then sent to a two-layer shared full-connection layer network, the two paths are summed element by element and combined, and finally a sigmoid is activated to activateFunction to produce a channel attention map M_C。

The spatial attention module structure is shown in fig. 8, and a feature diagram output by the channel attention module is taken as an input feature diagram of the module. And respectively performing average value pooling and maximum value pooling on the channel dimensions, and then splicing and combining the extracted feature maps (the number of channels is 1) to obtain a 2-channel feature map. Reducing the convolution into 1 channel through convolution layer with convolution kernel size of 7 multiplied by 7, and generating space attention diagram M through sigmoid activation function_S。

(2) Second order pooling mechanism

The global second-order pooling is performed by calculating the covariance matrix (second-order information) of the feature map to select the value representing the data distribution of the feature map. It is assumed that a set of H × W feature maps F is obtained by the preceding convolution operation_i(i ═ 1,2, …, C), where C is the number of channels in the set of profiles. The idea of global covariance pooling is to treat the feature map as a random variable, with each element of the feature map being a sample value of the random variable. Will feature chart F_iStraightened out into a vector f of (hxw, 1)_iAnd calculating a covariance matrix of the set of feature maps:

the physical significance of the covariance matrix is very significant, with the ith row representing the statistical correlation of channel i with all channels.

The second-order covariance pooling can effectively utilize the related information among the channels extracted by deep neural network learning and contains richer feature information, so that the global average pooling layer of the resnet18 is changed into a global second-order pooling layer, and the feature expression capability of the network can be improved. The specific implementation details are as follows: firstly, the 512-dimensional feature map output by the convolutional layer before the second-order pooling layer is subjected to dimension reduction to 256-dimensional feature map by using a 1 x 1 convolutional kernel, and then the covariance matrix of the set of features is calculated and subjected to matrix square root normalization to obtain 32896-dimensional feature map so as to realize the second-order pooling operation.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A facial expression recognition method combining an attention module and a second-order pooling mechanism is characterized by comprising the following steps:

acquiring a face image;

and performing feature extraction on the preprocessed face image, and finishing expression classification.

2. The method of claim 1, wherein the face detection and alignment comprises face detection, key point positioning, and face alignment, and specifically comprises:

3. The method of claim 2, wherein the face alignment is calculated by combining an attention module and a second-order pooling scheme as follows:

first according to leftFour coordinates (x) of the right two eyes₁,y₁),(x₂,y₂),(x₃,y₃),(x₄,y₄) And calculating the center coordinates of the left eye and the right eye:

Combined with the rotation center point coordinate (x)_center,y_center) And obtaining an affine transformation matrix by utilizing an interface for solving the affine transformation matrix in the OpenCV, and calling an interface function of the OpenCV to perform affine transformation on the image to obtain a photo with aligned human face.

4. The method for recognizing facial expressions by combining an attention module and a second-order pooling mechanism according to claim 1, wherein the data enhancement is to dynamically perform random geometric or color transformation operation on the input image at the data reading stage through a transform.

5. The method of claim 1, wherein the image normalization divides the pixel value of the image by 255, and any pixel value of the normalized image is between [0,1 ].

6. The method for recognizing the facial expression by combining the attention module and the second-order pooling mechanism according to claim 1, wherein the extraction of the facial expression features is realized by using an 18-layer resnet network, and the network output result is normalized into a probability value of 7 types of expressions by adding a softmax layer at the end of the resnet network, wherein the maximum value is the classification result.

7. The method for recognizing the facial expression by combining the attention module and the second-order pooling mechanism according to claim 1, wherein feature extraction and expression classification are realized by using an end-to-end deep neural network, and the structure of the deep neural network is as follows: the first layer is a convolution layer with convolution kernel size of 7 multiplied by 7, and the number of channels is 64; the second layer is a pooling layer with a pooling core size of 3 × 3, and the number of channels is 64; and connecting eight residual error structures which are fused with the convolution attention module, finally outputting a feature map of a 512-dimensional channel, then connecting a second-order pooling layer to realize feature aggregation, and finally obtaining a classification result by utilizing a full connection layer and a softmax layer.