Video expression recognition method based on capsule-long-and-short-term memory neural network
Technical Field
The invention belongs to the technical field of facial expression recognition, and particularly relates to a video expression recognition method based on a capsule-long-short-term memory neural network.
Background
The human face is one of the important biological features of a person, and contains a large amount of information, and among many pieces of information contained in the human face, expression information is one of the important information. The expression is the intuitive reaction of human emotion, the state of facial muscles and five sense organs is formed, and expression recognition as an important part of human-computer interaction is always one of important research findings of computer vision. By adopting the progress of computer technology, the arrival of a big data era, the development of computer hardware such as GPU (graphics processing unit) and the like, related achievements of several steps in the field of face recognition, such as face detection, face feature extraction, image classification and the like can be used as reference in the field of expression recognition, the expression recognition is also developed in a great extent on software and hardware, and corresponding research institutions, expression databases and new algorithms are more and more.
The expression recognition has very wide application fields, the body shadow is available in various fields such as man-machine interaction, robot manufacturing, medical health, remote education and the like, the emotion of the other party can be objectively analyzed by adopting artificial intelligence to recognize the expression, the phenomenon that the emotion of the other party is wrongly read due to the personal emotion is avoided, the defect that the energy of human is limited is overcome, and some fine expressions which can be ignored by naked eyes can be captured. In case examination, the police is assisted to monitor expressions of criminals, detect real psychological conditions of the criminals and help to detect cases; in clinical medicine, the psychological activities of patients are known through expression observation of self-closing children, and doctors are assisted to make a more appropriate treatment scheme to assist the quick recovery of the self-closing children; in a shopping mall, the customer satisfaction degree is monitored aiming at a certain product, and staff in the shopping mall are assisted to make a more appropriate popularization scheme; in the aspect of traffic travel, a learner researches fatigue driving to prevent accidents caused by poor self state of the driver; in distance education, through detecting the emotion changes of students in class, teachers are helped to master the learning degree of the students on knowledge points, and learning progress is scientifically adjusted, so that learning of the students is better promoted.
At present, a common video sequence expression recognition method is to combine a convolutional neural network and a long-term and short-term memory neural network to model changes of facial expressions in videos, generally a deep convolutional neural network is adopted to extract spatial information, and a multilayer long-term and short-term memory neural network is adopted to obtain time information. The convolutional neural network has a characteristic learning ability. The advantages of convolutional neural networks are many, but some disadvantages are also revealed in practical applications: (1) the consistency recognition capability after image migration is low, wherein the image migration refers to that the consistency of a convolutional neural network on left and right translation, rotation, frame addition and the like is difficult to perceive, so that a training set required by CNN is very large, and the data enhancement technology is useful but has limited promotion; (2) the problem of the convolutional neural network is that the neurons are all equal, and there is no internal organization structure, which results in that the same identification cannot be made on the same article at different positions and different angles, and the mutual relation between the substructures obtained by different convolutional kernels cannot be extracted. The convolutional neural network has good performance in extracting and detecting object features, but ignores local and internal relative position information (such as relative position, direction, skewness and the like), thereby losing some important information.
Disclosure of Invention
In order to overcome the defects, the invention provides a video expression recognition method based on a capsule-long-short-term memory neural network, which specifically comprises the following steps as shown in fig. 1:
converting a video including a face into a video frame;
detecting a face image in a video frame, and preprocessing the face image;
constructing a capsule network, extracting the characteristics of the face image by using a capsule network encoder and reconstructing the picture by using a capsule network decoder;
constructing a long-term and short-term memory neural network, and taking the output of the capsule network encoder as the input of the long-term and short-term memory neural network;
and classifying the expression corresponding to the maximum probability value in the long-time memory neural network output as a label of the sequence.
Further, detecting a face image in the video frame, and preprocessing the face image includes:
carrying out face detection on the video frame, intercepting a face ROI (region of interest), and carrying out size normalization and graying;
detecting the face in the video frame by adopting an MTCNN algorithm, positioning the face in the video frame, cutting the detected face into a size with a fixed size, and performing graying processing;
and respectively selecting fixed frames from the video frames in each video as a group of video sequences to finish the extraction and pretreatment of the face images.
Furthermore, the capsule network uses three convolution layers, a convolution capsule layer and a digital capsule layer as an encoder of the capsule network, uses four layers of deconvolution layers as a decoder of the digital capsule, extracts the characteristics of pictures through the convolution layers and converts a characteristic diagram after the last convolution operation into an original capsule for the use of a dynamic routing algorithm, iterates the capsule through the dynamic routing algorithm and superposes the capsules in the last dimension, and the digital capsule layer adopts the length of each capsule vector to represent the probability of each expression category and is used for calculating the classification loss; the encoder is used for optimizing the network, reconstructing the image with the highest output probability, comparing the Euclidean distance between the reconstructed image and the original image, and calculating the reconstruction loss.
Further, the compression operation is represented as:
wherein v isj、sjVectors, v, that are all capsulesjAccording to the preceding capsule sjAnd (4) calculating.
Further, the dynamic routing algorithm is used for acquiring a high-level capsule according to the original capsule, and comprises the following steps:
wherein s is
jIs a capsule in the high layer, and has the advantages of high-temperature resistance,
a capsule being a bottom layer, c
ijTo be a coupling coefficient, W
ijAre weight parameters.
Further, the upper layer capsules and the lower layer capsules have a coupling coefficient cijTo c isijThe sum of the coupling coefficients is 1, and the coefficients are expressed as:
cij=softmax(b'ij);
wherein, b'ijTo represent the updated value, bijIs a pre-update value, which is initially zero; v. ofjIs the vector of the high-level capsule.
Further, the loss function of the encoder is expressed as:
Lc=Tcmax(0,m+-‖vc‖)2+λ(1-Tc)max(0,‖vc‖-m-)2;
wherein, TcExpressed as whether the expression class c exists, when exists, the value is 1, and the nonexistence is 0; m is a unit of+、m-Respectively an upper boundary and a lower boundary; | | Vc| | is expressed as the module length of the capsule, i.e. the probability of expression class c.
Further, the loss function of the decoder is expressed as:
wherein n represents the number of pixel points, riObtaining a reconstructed value of the ith pixel point finally through a Capsule-based facial expression recognition network and a decoder; a is aiThe true value of the ith pixel point.
Further, when the long-term memory neural network is constructed, the cross entropy is made between the input vector and the actual label of the vector, the average value of the cross entropy of all elements in the vector is used as a loss function of the long-term memory neural network, and the cross entropy is expressed as:
wherein, yi' is the actual expression category label; y isiPredicted expression probability for sample i.
The invention has the following beneficial effects:
1. the improved capsule network is used for replacing a convolutional neural network to perform facial expression image recognition;
2. performing feature extraction on the preprocessed image by adopting three-layer convolution, so that potential features are more easily extracted;
3. the capsule network is combined with the long-time and short-time memory network, the capsule network extracts spatial information, and the long-time and short-time memory neural network extracts time sequence information, so that the accuracy of expression classification is effectively improved;
4. compared with the traditional Recurrent Neural Network (RNN), the long-time and short-time memory neural network solves the problem of gradient disappearance and reduces the difficulty of model training.
Drawings
FIG. 1 is a flow chart of a video expression recognition method based on a capsule-long-and-short-term memory neural network according to the present invention;
FIG. 2 is a schematic diagram illustrating the effect of the AFEW data set after face detection and preprocessing according to the present invention;
FIG. 3 is a network model of a capsule network encoder of the present invention;
fig. 4 is a network model structure diagram of a decoder in the capsule network according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The invention provides a video expression recognition method based on a capsule-long-short-term memory neural network, which specifically comprises the following steps:
converting a video including a human face into a video frame;
detecting a face image in a video frame, and preprocessing the face image;
constructing a capsule network, extracting the characteristics of the face image by using the capsule network and reconstructing the image;
constructing a long-time memory neural network, and extracting time sequence characteristics of the reconstructed picture by using the network;
and taking the expression classification corresponding to the maximum probability value in the long-time memory neural network output as the label of the sequence.
Example 1
The embodiment further illustrates a video expression recognition method based on a capsule-long-term memory neural network.
In this embodiment, the method is divided into three steps, which specifically include:
firstly, acquiring original data and preprocessing the original data
In this embodiment, the video in the data set is converted to a video frame using ffmpeg using an MMI data set and an AFEW data set.
In the embodiment, MTCNN is adopted to perform face localization on a video frame, and the detected face is cut into a fixed size, wherein the image is reduced to 48 × 48 by resize ();
graying the reduced picture, and selecting 16 frames from the video frames in each video as a group of video sequences as a preprocessing result as shown in fig. 2.
(II) extracting characteristics by utilizing capsule network and reconstructing pictures
In this embodiment, the constructed capsule network includes convolution layers, convolution capsule layers, digital capsule layers and deconvolution layers, in this embodiment, as shown in fig. 3, the number of convolution layers is 3, and the number of deconvolution layers is 4, where the first layer of convolution layers adopts 5 × 5 convolution kernels, and the step length is 1; in the convolution operation of the second layer, the convolution kernel adopts 5 x 5, and the step size is 2; the third layer is a convolution capsule layer, the essence of the layer is to convert the feature map after a convolution operation into the original capsule for use in the dynamic routing algorithm, the convolution operation is performed first in the layer, in the actual operation, the convolution kernel of the layer is 9 × 9, the step size is 2, and each convolution layer adopts the ReLU function as the activation function.
The digital capsule layer is the key of the capsule network, the transformed capsules of the third layer rolling capsule layer are used as input, the input capsules are subjected to iterative computation of a dynamic routing algorithm, extracted information is superposed in the last dimension, compression operation is carried out after superposition, the compression operation is to standardize each element in the vector to enable the element to be between 0 and 1, and the compression function square () is expressed as:
wherein v isj、sjVectors, v, all of CapsulejAccording to the preceding capsule sjAnd (4) calculating.
The dynamic routing algorithm is used for acquiring capsules of higher layers according to original capsules, and comprises the following steps:
wherein s is
jIs a capsule in the high layer, and the capsule,
being a bottom layer of capsules, w
ijAs a weight parameter, c
ijIs a coupling coefficient, W
ijIs as follows. The upper layer capsule and the lower layer capsule have a coupling coefficient c
ij,c
ijThe sum of the coefficients is 1. The coefficients are expressed as:
cij=softmax(bi′j);
wherein, b'ijTo represent the updated value, bijIs the value before update, which is initially zero value, from b'ijThrough softmax function; v. ofjIs the vector of the high-level capsule.
The final output value is also a vector for representing the entity characteristics, and the modular length of the vector represents the probability of the expression. The total loss function of the capsule network consists of two parts, namely editing loss of an encoder and reconstruction loss of a decoder, and parameters of the capsule network are updated iteratively according to the loss function, wherein the loss function of the encoder part of the capsule network is defined as:
Lc=Tcmax(0,m+-‖vc‖)2+λ(1-Tc)max(0,‖vc‖-m-)2;
wherein, TcThe expression class c is 1 in existence and 0 in nonexistence; m is+,m-The upper boundary and the lower boundary are respectively, and the corresponding values are respectively set to be 0.9 and 0.1 in the invention; i VcAnd | | is expressed as the module length of the capsule, namely the probability of the expression class c.
The encoder network structure is shown in fig. 4, and features extracted by the encoder are reconstructed by using the deconvolution layer and compared with the original image.
Before entering a decoder network, the output of an original encoder is subjected to a softmax algorithm to obtain an expression with the maximum modular length, then masking other classes of Capsule characterization entities, and then inputting the masked entities into a decoder based on deconvolution. Initially we need to go through a fully connected layer of 2304 output neurons. The data is reorganized by converting this output to 12 x 16 size data and the decoder is ready to derive the original image information from this information. The deconvolution kernel size of the deconvolution parameters in the deconvolution layer is set to be 3 x 3, the corresponding step length is 1, a 48 x 48 data structure can be obtained after four layers of deconvolution networks, and the obtained reconstruction value is compared with the true value of the corresponding pixel point, so that a reconstruction loss function can be defined as follows:
wherein n represents the number of pixel points, riThe ith pixel point can be regarded as a reconstructed value finally obtained after the ith pixel point passes through a Capsule-based facial expression recognition network and a decoder; a isiThe true value of the ith pixel point.
(III) adopting long-and-short-time memory neural network to extract time sequence characteristics
The output of a capsule network encoder is used as the input of a long-time and short-time memory neural network, the number of hidden layers of the long-time and short-time memory neural network is set to be 128, and the cross entropy is made between the output vector and the actual label of the sample, and the formula is as follows:
wherein, yi' is the actual expression category label; y isiPredicted expression probability for sample i.
And (4) taking the output of the long-time memory neural network, selecting the expression category corresponding to the maximum probability value as the label of the sequence sample, and finishing the video expression classification of the sequence sample.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.