CN117576753A

CN117576753A - Micro-expression recognition method based on attention feature fusion of facial key points

Info

Publication number: CN117576753A
Application number: CN202311579931.7A
Authority: CN
Inventors: 邵艳利; 郑万闯; 王兴起; 方景龙; 魏丹; 陈滨
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-02-20

Abstract

The invention discloses a micro-expression recognition method based on attention feature fusion of facial key points. And secondly, based on the optical flow characteristics and the face structure diagram, extracting deep optical flow characteristics and face structure characteristics through a neural network. Finally, based on the deep optical flow characteristics and the facial structure characteristics, the micro-expression recognition result is obtained through multi-scale characteristic fusion combined with an attention mechanism. The invention accurately focuses on important characteristics and context information, improves the generalization capability of the model and the personalized processing capability of different samples, and improves the accuracy and the robustness of the micro-expression recognition task.

Description

Micro-expression recognition method based on attention feature fusion of facial key points

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a micro-expression recognition method based on attention feature fusion of facial key points.

Background

The micro-expression is a short-term inconspicuous facial expression change, is not controlled by self, and usually appears when people try to mask the true emotion of the people, so that the micro-expression can be regarded as leakage of the true emotion of the people, and can reflect the potential emotion of the people. Therefore, the detection and recognition of the micro-expression has important research and application value in the micro-expression analysis technology in criminal investigation, clinical medicine, commercial negotiations, public safety and other scenes.

At present, the recognition methods of the micro expressions are mainly divided into two types: traditional manual feature extraction methods and recognition methods based on deep learning. The conventional micro-expression recognition method generally adopts a mode of manually extracting features, and can be divided into texture-based features and geometric transformation-based features according to different algorithms. Texture-based features use apparent texture features as micro-expressive features. The geometric transformation-based feature uses optical flow and facial key point-based information to extract micro-expression features with differentiation, which is more likely to achieve a higher recognition rate than the texture extraction-based feature method. In recent years, with the continuous development of deep learning technology, deep convolutional neural networks have achieved remarkable results in the field of computer vision such as object detection, semantic segmentation, image processing and the like due to their strong feature extraction capability and complex problem expression capability. Features extracted in the deep learning context can be classified into features extracted based on convolutional neural networks, features extracted based on long short-term memory network (LongShort-TermMemory, LSTM) network architecture, features extracted based on graph convolution neural networks, and features extracted based on convolutional neural networks and attention mechanism architecture, which generally show better performance with strong feature learning capability of the model.

Although the micro-expression recognition method based on the deep learning technology is superior to the traditional method based on the manual feature extraction technology, the problems of small working amplitude, short duration, locality of the appeared facial position and the like exist in the motion of the micro-expression in the micro-expression recognition work, so that the micro-expression is difficult to capture and analyze, and the difficulty of the micro-expression detection and recognition is increased to a certain extent. The data set samples have obvious distribution differences, the characteristics of few sample classification cannot be fully characterized, the robustness of the model to few classes is poor, the classification capacity is weak, and the problem of overfitting is generated on the data set, so that the practicability is lacked. In addition, most researchers for micro expression recognition focus on obtaining more optimized expression features, network depth is continuously deepened in a model, complexity is increased, instability of the network and excessive redundancy of features are caused, requirements on hardware equipment are gradually increased, and cost is increased.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a micro-expression recognition method based on attention feature fusion of facial key points, which is characterized in that the position of micro-expression motion is positioned through the facial key points of a human body, feature extraction is carried out by combining optical flow and facial structure key point bimodal features, and self-adaptive weight distribution is carried out on two modalities by using an attention mechanism, so that multi-scale feature fusion is finally completed, the influence of the motion locality of the micro-expression and the fine-micro difference between data set samples is overcome, and the accuracy and the high efficiency of micro-expression recognition work are improved.

A micro-expression recognition method based on attention feature fusion of facial key points specifically comprises the following steps:

and step 1, positioning key points of faces and faces, and obtaining shallow optical flow characteristics and a face structure diagram.

Step 1-1, face clipping and face key point positioning

Face detection is the first step of automatic analysis of micro-expressions, and in order to remove the influence caused by background noise, a part only containing the face is intercepted. The face detection is carried out through a face detection algorithm in a Dlib library of an open source, 68 key points of the face can be obtained, and the face clipping process is to solve the maximum and minimum horizontal and vertical coordinates according to the 68 key points, so that the clipping range is determined. And cutting each frame by the method to generate a micro-expression picture sequence aiming at each micro-expression video sample in the data set. Some key points at the outline of the face are marked to be not obvious in the change of different micro-expression states by research, and according to the position of the micro-expression movement, n most representative key points around the mouth, the eyebrow and the nose are selected to form a topological diagram of the face structure of the face, so that the micro-change of the micro-expression movement of the face is captured. These relationships are quantified in the form of an adjacency matrix by the facial correlation between these key points and function in the subsequent feature extraction step.

Step 1-2, shallow optical flow characteristics based on TV-L1 optical flow model

And extracting optical flow information in the horizontal and vertical directions from the initial frame and the peak frame in the sequence of the micro-expression pictures by using a TV-L1 optical flow model, and further calculating optical strain optical flow characteristics. The optical strain optical flow characteristics can represent the degree of facial deformation, are not easily influenced by illumination conditions and facial shielding conditions, and then optical flow information in two directions is overlapped with the optical strain optical flow characteristics to form shallow optical flow characteristics, so that local movement when the facial micro-expression occurs is effectively represented. In order to reduce negative effects caused by the whole facial noise motion, more efficient features are obtained based on the facial structure diagram constructed in the step 1-1, and each selected key point is taken as a central coordinate to be outwards expanded into an m multiplied by m three-dimensional optical flow matrix, so that the size can capture fine and effective micro-expression motion features and prevent effective information loss around the key points.

Step 1-3, constructing a face structure diagram based on a face key point sequence

The initial frame, the peak frame and the end frame in the sequence of the micro-expression pictures represent the key process of facial muscle movement when the micro-expression occurs, and the three frames contain rich movement information and also remove a large number of redundant frames in the whole video. When the micro-expression motion occurs, facial muscles move, which causes the key points to move, so that the facial key point coordinate data contains the motion information of the micro-expression. And (3) combining the facial key point information set forth in the step (1-1) on the basis of each frame to construct a spatial relationship between key points of a single frame and a motion relationship between key points of adjacent frames, which are keys for identifying micro-expression motions, so that a facial structure diagram facing micro-expression state change is constructed on the basis of a starting frame, a peak frame and an ending frame.

And 2, extracting deep optical flow characteristics and facial structure characteristics through a neural network based on the optical flow characteristics and the facial structure diagram.

Step 2-1, extracting deep optical flow characteristics based on key points

According to step 1-2, each data set sample is converted into n blocks of m×m×3 shallow optical flow map features, a shallow three-stream network model (Shallow Triple Stream Three-dimensional Net, STSTNet) is selected as a reference model to perform analysis and calculation on each block of optical flow map, and the shallow optical flow features are converted into advanced features with deep meanings, so that deep feature vector representations, namely deep optical flow features, are obtained.

In order to extract feature information of each key point in a data set sample by using a graph convolution network, according to the facial graph structure proposed in the step 1-1, each key point is used as a vertex in a graph, shallow optical flow features extracted by an optical flow graph corresponding to each key point are used as node features of the vertex in the graph, natural connection relations among face key points are used as an adjacent matrix, and finally GCN is used for aggregation and extraction of deep space-time features with discriminant information contained in the optical flow features to obtain deep optical flow features.

Step 2-2, extracting facial structural features based on key points

And (3) extracting feature information in the space-time diagram through a lightweight Shift diagram convolution network based on a spatial relationship and a motion relationship formed by the key points of the face in the step (1-3) to obtain facial structural features. The Shift graph convolution network comprises a space Shift graph convolution module and a time Shift graph convolution module, so that information between nodes in the same frame in a space-time graph can be fused, and information of key points between frames can be fused. The Shift GCN is provided by replacing the traditional convolution operator with the Shift convolution operator on the basis of the traditional GCN, so that a better effect can be achieved by using less parameter quantity and less calculated quantity, the relation between key points of different faces can be adaptively learned through global Shift diagram convolution, the flexibility of model learning is improved by introducing a learnable adjacency matrix, and the problem of limitation of node relation fixation in a predefined adjacency matrix is solved.

And step 3, obtaining a micro-expression recognition result by combining multi-scale feature fusion of an attention mechanism based on the deep optical flow features and the facial structure features.

Feature fusion strategies play a critical role in handling multi-modal or multi-learning approaches, and attention mechanisms have been successfully used to refine fusion weights applicable to different modes. According to the deep optical flow characteristics and the facial structure characteristics obtained in the step 2, respectively inputting the characteristics of the embedded learning of the encoder, generating soft attention learning weights alpha for different modes by using a softmax activation function, and multiplying the weight values with the original characteristics to obtain new characteristics. In order to keep the original characteristics, a 1+alpha weighted score is adopted as a new weight after self-learning, the characteristics of the two modes after attention weighting are connected to be used as the output of a characteristic fusion module, a deep characteristic vector with certainty is obtained through a full connection layer, and finally a final classification decision is made by a classifier to output a micro expression recognition result.

The invention has the following beneficial effects:

1. the end-to-end multichannel network model based on the facial key points is provided, so that the model can pay more attention to the part with larger micro-expression movement information in the face, and the problems of locality and minuteness of micro-expression movement are overcome. The network model consists of two channels of optical flow characteristics and facial structure characteristics extracted based on the key points of the human face, and the time and space characteristics are extracted from heterogeneous data, so that the richness of the sample characteristics of the data set is increased, and the recognition performance of the network model is improved.

2. The lightweight network based on the Shift GCN has the characteristics of less parameters, small calculated amount, shorter reasoning time and the like, and can be better suitable for the defect of less data set sample size in the microexpressive recognition work, so that the dependence of network model training on the data sample size is weakened. The method has the advantages that the training speed is ensured, the shift GCN enables the model to learn the relation between the facial key points in a self-adaptive mode through the introduction of the global space shift map convolution module, so that the problem of limitation of fixed connection in a predefined face structure diagram is solved, the interframe motion information of a face sequence can be better captured through the introduction of the time shift map convolution module, the model is facilitated to extract high-level features with discriminant, and the feature vector representation of a sample deep layer is obtained.

3. The attention feature fusion module can dynamically distribute attention according to data of different modes by adaptively learning weight values of different features through the encoder, and more accurately pay attention to important features and context information, so that generalization capability of the model and individuation processing capability of different samples are improved, and accuracy and robustness of a micro-expression recognition task are improved.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a graph of optical flow superimposed with horizontal and vertical optical flow extracted by the optical flow method and the subsequently calculated strain optical flow;

fig. 3 is a key-point-based facial structure diagram constructed from a start frame, a peak frame, and an end frame in a sequence of microexpressive images.

Detailed Description

The invention is further explained below with reference to the drawings.

As shown in fig. 1, a micro-expression recognition method based on attention feature fusion of facial key points is divided into three parts. A first part: preprocessing optical flow characteristics and facial structure characteristics based on facial key points; a second part: deep feature extraction is carried out through an end-to-end dual-channel network based on the facial key points; third section: and according to the attention characteristic fusion module, self-adaptively distributing weight coefficients for a plurality of channels, and obtaining a final result through a classifier. The method comprises the following specific steps:

Step 1-1, positioning key points of human face

In order to remove the influence caused by background noise, the micro-expression sequence is subjected to face detection, so that extraction of micro-expression characteristic information is facilitated, and the size of image data input into a model is reduced. The face detection is carried out through a Dlib algorithm based on a convolution neural network of an open source, a micro-expression image sequence is input, the face detection is carried out by using a model, 68 pieces of face key point information can be obtained, and then the maximum and minimum abscissas are solved according to the key points to determine the clipping range to carry out face clipping. By analyzing the position where the micro-expression motion occurs, 12 representative key points are selected to form a face structure diagram, wherein the key point coordinates of the positions of 6 eyebrows, 2 noses and 4 mouths are included, and the positions can contain a large amount of characteristic information when the micro-expression state changes. These relationships are quantified in the form of an adjacency matrix by the facial correlation between these key points and function in the subsequent feature extraction step.

Step 1-2, shallow optical flow feature extraction

And carrying out optical flow characteristic extraction on the micro-expression data in the data set based on the key points. In fig. 1, the image sequence Input by Input is a start frame, a peak frame and an end frame, wherein the start frame is a still frame when the micro-expression motion does not occur, the peak frame is a frame when the micro-expression motion amplitude reaches a peak value, and the end frame is a frame which is recovered to a normal state after the micro-expression motion occurs. Local motion information of facial micro-expressions in the initial frame and the peak frame can be extracted through a TV-L1 optical flow method, and the extracted two-dimensional optical flow field can represent the size and the direction of motion of each pixel between the two frames. The strain mode is used for measuring deformation degree of an object under the action of external force, can effectively reflect the region of micro-expression movement of the face, and can represent the deformation size of facial muscle tissues for small enough facial pixel movement, so that the micro-expression recognition task has better performance. By giving a two-dimensional optical flow vector, the feature of optical strain can be deduced to describe the facial motion pattern. By adding optical strain features to the optical flow field, each microexpressive data can be represented as a triplet based on an optical flow feature representation, and the three optical flow feature maps are superimposed as shown in fig. 2. According to the face key points selected in the step 1-1, the key points are used as central coordinates to be expanded outwards into rectangles with the size of 11 multiplied by 11, the formed optical flow blocks are used for representing the optical flow motion information around the key points, and all the optical flow blocks are finally used as input features of optical flow channels in the subsequent steps.

Step 1-3, facial Structure feature extraction

The human face micro expression is a dynamic change process, in order to extract time change information of key points in each frame of image by using a graph rolling network, a space-time graph G= (V, E) is constructed according to a key point sequence, wherein V represents a node set in the space-time graph, the node set is formed by a set obtained by carrying out key point detection on three continuous expression sequences, E represents an edge set of the space-time graph, and the edge set E of key points in the same frame in the sequence _S And edge set E from neighboring frame keypoints _T Composition is prepared.

Edge set E between key points of the same frame in sequence _S The internal relation among the key points of the human face is reflected by different construction methods, and the information flow direction among the key points of the same frame is also determined, so that the accuracy of the network model is seriously affected. The method is constructed according to two connection modes, namely, the eyebrows, the mouth and the nose of the face are geometrically connected according to the geometrical structure of the face organs, as shown in fig. 3, the information of the change of each organ structure of the face along with time can be described in the method, and the key points are connected with all the rest key points between the same frames in a full connection mode, because certain correlation exists among each organ of the face in different micro-expression states, for example, when the micro-expression type is positive, the eye corners of the face can drive the eyebrows to bend downwards, the mouth corners can rise, and certain correlation exists among the key points at long distances. Edges of co-frame keypoints can be expressed as follows:

wherein N is the number of key points at time t in the sequence, G represents a set formed by key point number pairs under different construction methods, each number pair represents that the key points of the two numbers are connected, and v _t,i v _t,j Is a binarized variable, v when the key point i is connected with the key point j _t,i v _t,j =1, otherwise v _t,i v _t,j ＝0。

The information on the space-time diagram is not only connected between key points of the same frame, but also needs to be transmitted between key points of different frame images, so that the key points between adjacent frames need to be connected, and the edge set E can be obtained by connecting the characteristic points with the same coordinates between the adjacent frames _T The expression can be expressed as follows:

E _T ＝{v _t,i v _(t+1),j |i,j ∈ [1,N]} (2)

wherein v is _t,i v _(t+1),j Is also a binarized variable, v when the key point i=j _t,i v _(t+1),j =1, otherwise v _t, _i v _(t+1),j ＝0。

Step 2-1, extracting facial optical flow block characteristics based on key points

According to the step 1-2, each data set sample is converted into 12 blocks of 11 multiplied by 3 shallow light flow graph characteristics, an STSTNet model is selected as a reference model to analyze and calculate each light flow graph, the shallow light flow characteristics are converted into advanced characteristics with deep meaning, and each light flow block is converted into deep characteristic vectors with the dimension of 64 after calculation and is used as node characteristics in a next-stage graph convolution network.

Standard convolution is performed on local areas in the euclidean structured data, which captures the most important information of pixels in the image, while graph convolution is the learning of the relationship between each object node in non-euclidean data, which can be seen as passing data through different nodes, with the goal of learning a function f, which can update the node characteristics of each node layer by layer. The adjacency matrix a and node characteristic X of the graph convolution can be expressed by the following formula:

A∈R ^n×n ，X∈R ^d×n (3)

where n represents the number of nodes, d represents the dimension of each node feature, and R represents the real set.

The input parameters of each layer, namely node characteristics, are updated through the convolution propagation function f in the graph convolution operation and can be marked as H ^l Where l represents the number of convolution layers where the current is, the input parameters of the first layer are the most primitive node features, i.e. H ⁰ =x. In general, each of the layers of the gallery may be represented as:

H ^l ＝f(H ^l-1 (a) (4) because f is a convolution propagation function, equation 4 can be further extended to:

H ^l ＝σ(AH ^l-1 W ^l-1 ) (5)

wherein σ is a nonlinear activation function, W ^l-1 ∈R ^d×d′ Is the weight matrix in the first layer of the picture volume, d and d' representing the input and output dimensions of the first layer, respectively. The graph convolution operation can also be stacked into multiple layers like a standard convolution, and the stacked GCN model can learn the dependency relationship among nodes after several iterative aggregation operations.

In order to extract feature information of each key point in a data set sample by using a graph convolution network, constructing a space graph according to the face structure diagram constructed in the step 1-1, wherein each key point is used as a vertex in the graph, and a feature vector extracted from a light flow diagram corresponding to each key point is used as a node feature X epsilon R of the vertex in the graph ^64×12 The natural connection relation between the key points of the human face is used as an adjacent matrix A epsilon R ^12×12 And performing feature aggregation operation by setting two layers of picture scroll layers, wherein input and output dimensions of the first layer are set respectivelyThe input and output dimensions of the first layer are set to be 32 and 16 for 64 and 32, respectively, and finally the node features of 12×16 are converted into deep optical flow features with discriminative information of 192 through reshape operation, and the deep optical flow features are output as the whole optical flow channel.

Step 2-2, extracting facial structural features based on key points

And (3) extracting characteristic information in the space-time diagram through a lightweight Shift diagram convolution network on the basis of a face structure diagram formed based on key points in the step (1-3). The Shift GCN is proposed by replacing the traditional convolution operator in the Shift convolution operator on the basis of a space-time diagram convolution network (Spatial Temporal Graph Convolutional Networks, STGCN), and by adding a Shift operation to the original space diagram convolution and time diagram convolution, message transmission between adjacent nodes can be effectively performed, and better effects can be achieved by using fewer parameters and calculation amount. The Shift graph convolution network comprises a space Shift graph convolution module and a time Shift graph convolution module, so that information between nodes in the same frame in a space-time graph can be fused, and information of key points between frames can be fused.

The space-shift-map convolution module can be divided into two types, local and global-shift-map convolution. For local motion map convolution, the perception domain is composed of facial key point physical structures predefined by the microexpressive data set, and this way only considers the inherent connection between key points, so that it is difficult to mine the potential relation with the "over distance" effect. The global shift graph convolution removes the limitation of physical inherent connection, changes the facial structure connection graph of a single frame into a complete graph, and enables the perception domain of each key point to cover the whole facial key point space graph. The connection strength between different nodes in the global shift map convolution is the same, but the importance between the facial key points is different, so that an adaptive global shift mechanism is introduced, the shifted features and the learnable mask are subjected to element multiplication, and the important connection information between the facial key points is used for mining, and can be expressed by the following formula:

F _M ＝F·Mask＝F·(tanh(M)+1) (6)

wherein F is movedNode characteristics after bit operation, M is mask information, F _M Is the characteristic information obtained after the self-adaptive importance weighting.

The lightweight network based on the Shift GCN mainly comprises a global space shift graph convolution and a time shift graph convolution module. According to the full-connection space-time diagram of the facial key points constructed in the steps 1-3, the input feature dimension of each data set sample is defined as t×v×c, wherein T is the number of key point frame sequences, specifically set to 3, the time dimension for representing the micro-expression motion change, V is the number of vertices of the key points of the same frame in the sequence, specifically set to 12, C is the feature dimension contained in each key point, specifically set to 2, and the abscissa information for representing a single facial key point is represented. In order to ensure that the dimension of the feature vector output by the optical flow channel is the same as that of the feature vector, the number of output feature channels of the global space displacement map convolution in the network model is set to be 16, and finally 12×16 node features are converted into 192-dimensional deep feature vectors containing rich space-time information through reshape operation, and the deep feature vectors are used as the output of facial structure feature channels and applied to the following steps.

Simply concatenating features does not reveal the actual importance of the individual modality information, and feature fusion strategies play a vital role in handling multi-modality or multi-learning approaches. By placing the attention mechanism on top of the extracted modality features, the help system concentrates attention on the information modality, which can be intuitively understood as giving a weighted score on the different modalities to represent the importance of a single branch. From the deep optical flow features F obtained in step 2 _flow Dough structural feature F _landmark First, the embedded feature is learned by using an encoder for the feature of each modality, wherein the encoder is composed of two fully connected layers with output feature channel numbers of 64 and 1, and then soft attention learning weights α are generated for different modalities by using a softmax activation function, which can be expressed as the following formula:

α＝softmax(tanh(W _f [F _flow ,F _landmark ]+b _f )) (7)

wherein W is _f And b _f For trainable fusion attention parameters, alpha is a 2-dimensional vector, and represents soft attention weight coefficients of an optical flow and a key point space-time diagram mode respectively. In order to keep the original characteristics, a 1+alpha weighted score is adopted as a new weight after self-learning, the characteristics of the two modes after attention weighting are connected to be used as the output of a characteristic fusion module, a deep characteristic vector with certainty is obtained through a full connection layer, and finally a final classification decision is made by a softmax classifier.

Microexpressive recognition was performed on Full, SMIC, CASMEII, SAMM datasets using the method and prior art techniques, respectively, and the results are shown in table 1, with the method achieving the best performance compared to the conventional method (Bi-WOOF) and the deep learning-based method (AlexNet, OFFApexNet, capsuleNet, dual-Inception, RCN-A, STSTNet). Compared with a single STSTNet method, UF1 and UAR indexes on a Full comprehensive data set are respectively improved by 4.51% and 1.76%, and meanwhile, better results are shown on SMIC and SAMM data sets, so that the superiority of the method is proved, and the accuracy of micro expression recognition can be effectively improved.

TABLE 1

Claims

1. The micro-expression recognition method based on the attention feature fusion of the facial key points is characterized by comprising the following steps of:

step 1, positioning key points of faces and faces, and obtaining shallow optical flow characteristics and a face structure diagram;

step 2, extracting deep optical flow characteristics and facial structure characteristics through a neural network based on the optical flow characteristics and the facial structure diagram;

2. The micro-expression recognition method based on the attention feature fusion of the facial key points according to claim 1, wherein the specific process of the step 1 is as follows:

step 1-1, face clipping and face key point positioning

Performing face detection through a face detection algorithm in an open source Dlib library to obtain information of key points of the face, solving maximum and minimum horizontal and vertical coordinates according to the key points, and determining a cutting range;

cutting each frame to generate a micro-expression picture sequence aiming at each micro-expression video sample in the data set, selecting n most representative key points to form a topological graph of a facial structure of a human face according to the position where micro-expression movement occurs, and capturing the change of the micro-expression movement of the human face; quantifying in the form of a adjacency matrix by the facial correlation between keypoints;

step 1-2, shallow optical flow characteristic extraction based on TV-L1 optical flow model

Extracting optical flow information in the horizontal and vertical directions from a starting frame and a peak frame in the micro-expression picture sequence by using a TV-L1 optical flow model, and calculating optical strain optical flow characteristics;

then, optical flow information in two directions is overlapped with optical strain optical flow characteristics to form shallow optical flow characteristics, and local movement of the facial micro-expression is represented; taking each selected key point as a central coordinate and expanding the key points outwards to form a three-dimensional optical flow matrix of m multiplied by m;

And (3) constructing a spatial relationship between key points of a single frame and a motion relationship between key points of adjacent frames by combining the facial key point information set forth in the step (1-1) on the basis of each frame in the micro-expression picture sequence, and constructing a facial structure diagram facing micro-expression state change based on a start frame, a peak frame and an end frame.

3. The micro-expression recognition method based on the attention feature fusion of the facial key points according to claim 2, wherein the specific process of the step 2 is as follows:

step 2-1, extracting deep optical flow characteristics based on key points

According to step 1-2, each data set sample is converted into a shallow optical flow map feature of n blocks of m×m×3, a shallow three-flow network model STSTSTNet is selected as a reference model, and the shallow optical flow map feature is converted into an advanced feature with deep meaning to obtain a deep optical flow feature;

step 2-2, extracting facial structural features based on key points

And (3) extracting feature information in the space-time diagram through a lightweight Shift diagram convolution network based on a spatial relationship and a motion relationship formed by the key points of the face in the step (1-3) to obtain facial structural features.

4. The micro-expression recognition method based on the attention feature fusion of the facial key points according to claim 3, wherein the specific process of the step 2-1 is as follows: according to the face map structure proposed in step 1-1, each key point is used as a vertex in the map, a shallow optical flow feature vector extracted from an optical flow graph corresponding to each key point is used as a node feature of the vertex in the map, natural connection relations among face key points are used as an adjacent matrix, and GCN is used for aggregation and extraction of deep space-time features with discriminant information contained in the optical flow features after STSTNet, so that deep optical flow features are obtained.

5. The micro-expression recognition method based on the attention feature fusion of the facial key points according to claim 4, wherein the specific process of the step 3 is as follows:

according to the deep optical flow characteristics and the facial structure characteristics obtained in the step 2, respectively inputting the deep optical flow characteristics and the facial structure characteristics into an encoder for learning, generating soft attention learning weights alpha for different modes by using a softmax activation function, adopting a 1+alpha weighting score as a new weight after self-learning, connecting the characteristics of the two modes after attention weighting as the output of a characteristic fusion module, obtaining a deterministic deep characteristic vector through a full connection layer, finally making a classification decision by a classifier, and outputting a microexpressive recognition result.