CN117576753A - Micro-expression recognition method based on attention feature fusion of facial key points - Google Patents

Micro-expression recognition method based on attention feature fusion of facial key points Download PDF

Info

Publication number
CN117576753A
CN117576753A CN202311579931.7A CN202311579931A CN117576753A CN 117576753 A CN117576753 A CN 117576753A CN 202311579931 A CN202311579931 A CN 202311579931A CN 117576753 A CN117576753 A CN 117576753A
Authority
CN
China
Prior art keywords
optical flow
micro
key points
facial
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311579931.7A
Other languages
Chinese (zh)
Inventor
邵艳利
郑万闯
王兴起
方景龙
魏丹
陈滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202311579931.7A priority Critical patent/CN117576753A/en
Publication of CN117576753A publication Critical patent/CN117576753A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a micro-expression recognition method based on attention feature fusion of facial key points. And secondly, based on the optical flow characteristics and the face structure diagram, extracting deep optical flow characteristics and face structure characteristics through a neural network. Finally, based on the deep optical flow characteristics and the facial structure characteristics, the micro-expression recognition result is obtained through multi-scale characteristic fusion combined with an attention mechanism. The invention accurately focuses on important characteristics and context information, improves the generalization capability of the model and the personalized processing capability of different samples, and improves the accuracy and the robustness of the micro-expression recognition task.

Description

Micro-expression recognition method based on attention feature fusion of facial key points
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a micro-expression recognition method based on attention feature fusion of facial key points.
Background
The micro-expression is a short-term inconspicuous facial expression change, is not controlled by self, and usually appears when people try to mask the true emotion of the people, so that the micro-expression can be regarded as leakage of the true emotion of the people, and can reflect the potential emotion of the people. Therefore, the detection and recognition of the micro-expression has important research and application value in the micro-expression analysis technology in criminal investigation, clinical medicine, commercial negotiations, public safety and other scenes.
At present, the recognition methods of the micro expressions are mainly divided into two types: traditional manual feature extraction methods and recognition methods based on deep learning. The conventional micro-expression recognition method generally adopts a mode of manually extracting features, and can be divided into texture-based features and geometric transformation-based features according to different algorithms. Texture-based features use apparent texture features as micro-expressive features. The geometric transformation-based feature uses optical flow and facial key point-based information to extract micro-expression features with differentiation, which is more likely to achieve a higher recognition rate than the texture extraction-based feature method. In recent years, with the continuous development of deep learning technology, deep convolutional neural networks have achieved remarkable results in the field of computer vision such as object detection, semantic segmentation, image processing and the like due to their strong feature extraction capability and complex problem expression capability. Features extracted in the deep learning context can be classified into features extracted based on convolutional neural networks, features extracted based on long short-term memory network (LongShort-TermMemory, LSTM) network architecture, features extracted based on graph convolution neural networks, and features extracted based on convolutional neural networks and attention mechanism architecture, which generally show better performance with strong feature learning capability of the model.
Although the micro-expression recognition method based on the deep learning technology is superior to the traditional method based on the manual feature extraction technology, the problems of small working amplitude, short duration, locality of the appeared facial position and the like exist in the motion of the micro-expression in the micro-expression recognition work, so that the micro-expression is difficult to capture and analyze, and the difficulty of the micro-expression detection and recognition is increased to a certain extent. The data set samples have obvious distribution differences, the characteristics of few sample classification cannot be fully characterized, the robustness of the model to few classes is poor, the classification capacity is weak, and the problem of overfitting is generated on the data set, so that the practicability is lacked. In addition, most researchers for micro expression recognition focus on obtaining more optimized expression features, network depth is continuously deepened in a model, complexity is increased, instability of the network and excessive redundancy of features are caused, requirements on hardware equipment are gradually increased, and cost is increased.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a micro-expression recognition method based on attention feature fusion of facial key points, which is characterized in that the position of micro-expression motion is positioned through the facial key points of a human body, feature extraction is carried out by combining optical flow and facial structure key point bimodal features, and self-adaptive weight distribution is carried out on two modalities by using an attention mechanism, so that multi-scale feature fusion is finally completed, the influence of the motion locality of the micro-expression and the fine-micro difference between data set samples is overcome, and the accuracy and the high efficiency of micro-expression recognition work are improved.
A micro-expression recognition method based on attention feature fusion of facial key points specifically comprises the following steps:
and step 1, positioning key points of faces and faces, and obtaining shallow optical flow characteristics and a face structure diagram.
Step 1-1, face clipping and face key point positioning
Face detection is the first step of automatic analysis of micro-expressions, and in order to remove the influence caused by background noise, a part only containing the face is intercepted. The face detection is carried out through a face detection algorithm in a Dlib library of an open source, 68 key points of the face can be obtained, and the face clipping process is to solve the maximum and minimum horizontal and vertical coordinates according to the 68 key points, so that the clipping range is determined. And cutting each frame by the method to generate a micro-expression picture sequence aiming at each micro-expression video sample in the data set. Some key points at the outline of the face are marked to be not obvious in the change of different micro-expression states by research, and according to the position of the micro-expression movement, n most representative key points around the mouth, the eyebrow and the nose are selected to form a topological diagram of the face structure of the face, so that the micro-change of the micro-expression movement of the face is captured. These relationships are quantified in the form of an adjacency matrix by the facial correlation between these key points and function in the subsequent feature extraction step.
Step 1-2, shallow optical flow characteristics based on TV-L1 optical flow model
And extracting optical flow information in the horizontal and vertical directions from the initial frame and the peak frame in the sequence of the micro-expression pictures by using a TV-L1 optical flow model, and further calculating optical strain optical flow characteristics. The optical strain optical flow characteristics can represent the degree of facial deformation, are not easily influenced by illumination conditions and facial shielding conditions, and then optical flow information in two directions is overlapped with the optical strain optical flow characteristics to form shallow optical flow characteristics, so that local movement when the facial micro-expression occurs is effectively represented. In order to reduce negative effects caused by the whole facial noise motion, more efficient features are obtained based on the facial structure diagram constructed in the step 1-1, and each selected key point is taken as a central coordinate to be outwards expanded into an m multiplied by m three-dimensional optical flow matrix, so that the size can capture fine and effective micro-expression motion features and prevent effective information loss around the key points.
Step 1-3, constructing a face structure diagram based on a face key point sequence
The initial frame, the peak frame and the end frame in the sequence of the micro-expression pictures represent the key process of facial muscle movement when the micro-expression occurs, and the three frames contain rich movement information and also remove a large number of redundant frames in the whole video. When the micro-expression motion occurs, facial muscles move, which causes the key points to move, so that the facial key point coordinate data contains the motion information of the micro-expression. And (3) combining the facial key point information set forth in the step (1-1) on the basis of each frame to construct a spatial relationship between key points of a single frame and a motion relationship between key points of adjacent frames, which are keys for identifying micro-expression motions, so that a facial structure diagram facing micro-expression state change is constructed on the basis of a starting frame, a peak frame and an ending frame.
And 2, extracting deep optical flow characteristics and facial structure characteristics through a neural network based on the optical flow characteristics and the facial structure diagram.
Step 2-1, extracting deep optical flow characteristics based on key points
According to step 1-2, each data set sample is converted into n blocks of m×m×3 shallow optical flow map features, a shallow three-stream network model (Shallow Triple Stream Three-dimensional Net, STSTNet) is selected as a reference model to perform analysis and calculation on each block of optical flow map, and the shallow optical flow features are converted into advanced features with deep meanings, so that deep feature vector representations, namely deep optical flow features, are obtained.
In order to extract feature information of each key point in a data set sample by using a graph convolution network, according to the facial graph structure proposed in the step 1-1, each key point is used as a vertex in a graph, shallow optical flow features extracted by an optical flow graph corresponding to each key point are used as node features of the vertex in the graph, natural connection relations among face key points are used as an adjacent matrix, and finally GCN is used for aggregation and extraction of deep space-time features with discriminant information contained in the optical flow features to obtain deep optical flow features.
Step 2-2, extracting facial structural features based on key points
And (3) extracting feature information in the space-time diagram through a lightweight Shift diagram convolution network based on a spatial relationship and a motion relationship formed by the key points of the face in the step (1-3) to obtain facial structural features. The Shift graph convolution network comprises a space Shift graph convolution module and a time Shift graph convolution module, so that information between nodes in the same frame in a space-time graph can be fused, and information of key points between frames can be fused. The Shift GCN is provided by replacing the traditional convolution operator with the Shift convolution operator on the basis of the traditional GCN, so that a better effect can be achieved by using less parameter quantity and less calculated quantity, the relation between key points of different faces can be adaptively learned through global Shift diagram convolution, the flexibility of model learning is improved by introducing a learnable adjacency matrix, and the problem of limitation of node relation fixation in a predefined adjacency matrix is solved.
And step 3, obtaining a micro-expression recognition result by combining multi-scale feature fusion of an attention mechanism based on the deep optical flow features and the facial structure features.
Feature fusion strategies play a critical role in handling multi-modal or multi-learning approaches, and attention mechanisms have been successfully used to refine fusion weights applicable to different modes. According to the deep optical flow characteristics and the facial structure characteristics obtained in the step 2, respectively inputting the characteristics of the embedded learning of the encoder, generating soft attention learning weights alpha for different modes by using a softmax activation function, and multiplying the weight values with the original characteristics to obtain new characteristics. In order to keep the original characteristics, a 1+alpha weighted score is adopted as a new weight after self-learning, the characteristics of the two modes after attention weighting are connected to be used as the output of a characteristic fusion module, a deep characteristic vector with certainty is obtained through a full connection layer, and finally a final classification decision is made by a classifier to output a micro expression recognition result.
The invention has the following beneficial effects:
1. the end-to-end multichannel network model based on the facial key points is provided, so that the model can pay more attention to the part with larger micro-expression movement information in the face, and the problems of locality and minuteness of micro-expression movement are overcome. The network model consists of two channels of optical flow characteristics and facial structure characteristics extracted based on the key points of the human face, and the time and space characteristics are extracted from heterogeneous data, so that the richness of the sample characteristics of the data set is increased, and the recognition performance of the network model is improved.
2. The lightweight network based on the Shift GCN has the characteristics of less parameters, small calculated amount, shorter reasoning time and the like, and can be better suitable for the defect of less data set sample size in the microexpressive recognition work, so that the dependence of network model training on the data sample size is weakened. The method has the advantages that the training speed is ensured, the shift GCN enables the model to learn the relation between the facial key points in a self-adaptive mode through the introduction of the global space shift map convolution module, so that the problem of limitation of fixed connection in a predefined face structure diagram is solved, the interframe motion information of a face sequence can be better captured through the introduction of the time shift map convolution module, the model is facilitated to extract high-level features with discriminant, and the feature vector representation of a sample deep layer is obtained.
3. The attention feature fusion module can dynamically distribute attention according to data of different modes by adaptively learning weight values of different features through the encoder, and more accurately pay attention to important features and context information, so that generalization capability of the model and individuation processing capability of different samples are improved, and accuracy and robustness of a micro-expression recognition task are improved.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a graph of optical flow superimposed with horizontal and vertical optical flow extracted by the optical flow method and the subsequently calculated strain optical flow;
fig. 3 is a key-point-based facial structure diagram constructed from a start frame, a peak frame, and an end frame in a sequence of microexpressive images.
Detailed Description
The invention is further explained below with reference to the drawings.
As shown in fig. 1, a micro-expression recognition method based on attention feature fusion of facial key points is divided into three parts. A first part: preprocessing optical flow characteristics and facial structure characteristics based on facial key points; a second part: deep feature extraction is carried out through an end-to-end dual-channel network based on the facial key points; third section: and according to the attention characteristic fusion module, self-adaptively distributing weight coefficients for a plurality of channels, and obtaining a final result through a classifier. The method comprises the following specific steps:
and step 1, positioning key points of faces and faces, and obtaining shallow optical flow characteristics and a face structure diagram.
Step 1-1, positioning key points of human face
In order to remove the influence caused by background noise, the micro-expression sequence is subjected to face detection, so that extraction of micro-expression characteristic information is facilitated, and the size of image data input into a model is reduced. The face detection is carried out through a Dlib algorithm based on a convolution neural network of an open source, a micro-expression image sequence is input, the face detection is carried out by using a model, 68 pieces of face key point information can be obtained, and then the maximum and minimum abscissas are solved according to the key points to determine the clipping range to carry out face clipping. By analyzing the position where the micro-expression motion occurs, 12 representative key points are selected to form a face structure diagram, wherein the key point coordinates of the positions of 6 eyebrows, 2 noses and 4 mouths are included, and the positions can contain a large amount of characteristic information when the micro-expression state changes. These relationships are quantified in the form of an adjacency matrix by the facial correlation between these key points and function in the subsequent feature extraction step.
Step 1-2, shallow optical flow feature extraction
And carrying out optical flow characteristic extraction on the micro-expression data in the data set based on the key points. In fig. 1, the image sequence Input by Input is a start frame, a peak frame and an end frame, wherein the start frame is a still frame when the micro-expression motion does not occur, the peak frame is a frame when the micro-expression motion amplitude reaches a peak value, and the end frame is a frame which is recovered to a normal state after the micro-expression motion occurs. Local motion information of facial micro-expressions in the initial frame and the peak frame can be extracted through a TV-L1 optical flow method, and the extracted two-dimensional optical flow field can represent the size and the direction of motion of each pixel between the two frames. The strain mode is used for measuring deformation degree of an object under the action of external force, can effectively reflect the region of micro-expression movement of the face, and can represent the deformation size of facial muscle tissues for small enough facial pixel movement, so that the micro-expression recognition task has better performance. By giving a two-dimensional optical flow vector, the feature of optical strain can be deduced to describe the facial motion pattern. By adding optical strain features to the optical flow field, each microexpressive data can be represented as a triplet based on an optical flow feature representation, and the three optical flow feature maps are superimposed as shown in fig. 2. According to the face key points selected in the step 1-1, the key points are used as central coordinates to be expanded outwards into rectangles with the size of 11 multiplied by 11, the formed optical flow blocks are used for representing the optical flow motion information around the key points, and all the optical flow blocks are finally used as input features of optical flow channels in the subsequent steps.
Step 1-3, facial Structure feature extraction
The human face micro expression is a dynamic change process, in order to extract time change information of key points in each frame of image by using a graph rolling network, a space-time graph G= (V, E) is constructed according to a key point sequence, wherein V represents a node set in the space-time graph, the node set is formed by a set obtained by carrying out key point detection on three continuous expression sequences, E represents an edge set of the space-time graph, and the edge set E of key points in the same frame in the sequence S And edge set E from neighboring frame keypoints T Composition is prepared.
Edge set E between key points of the same frame in sequence S The internal relation among the key points of the human face is reflected by different construction methods, and the information flow direction among the key points of the same frame is also determined, so that the accuracy of the network model is seriously affected. The method is constructed according to two connection modes, namely, the eyebrows, the mouth and the nose of the face are geometrically connected according to the geometrical structure of the face organs, as shown in fig. 3, the information of the change of each organ structure of the face along with time can be described in the method, and the key points are connected with all the rest key points between the same frames in a full connection mode, because certain correlation exists among each organ of the face in different micro-expression states, for example, when the micro-expression type is positive, the eye corners of the face can drive the eyebrows to bend downwards, the mouth corners can rise, and certain correlation exists among the key points at long distances. Edges of co-frame keypoints can be expressed as follows:
wherein N is the number of key points at time t in the sequence, G represents a set formed by key point number pairs under different construction methods, each number pair represents that the key points of the two numbers are connected, and v t,i v t,j Is a binarized variable, v when the key point i is connected with the key point j t,i v t,j =1, otherwise v t,i v t,j =0。
The information on the space-time diagram is not only connected between key points of the same frame, but also needs to be transmitted between key points of different frame images, so that the key points between adjacent frames need to be connected, and the edge set E can be obtained by connecting the characteristic points with the same coordinates between the adjacent frames T The expression can be expressed as follows:
E T ={v t,i v (t+1),j |i,j ∈ [1,N]} (2)
wherein v is t,i v (t+1),j Is also a binarized variable, v when the key point i=j t,i v (t+1),j =1, otherwise v t, i v (t+1),j =0。
And 2, extracting deep optical flow characteristics and facial structure characteristics through a neural network based on the optical flow characteristics and the facial structure diagram.
Step 2-1, extracting facial optical flow block characteristics based on key points
According to the step 1-2, each data set sample is converted into 12 blocks of 11 multiplied by 3 shallow light flow graph characteristics, an STSTNet model is selected as a reference model to analyze and calculate each light flow graph, the shallow light flow characteristics are converted into advanced characteristics with deep meaning, and each light flow block is converted into deep characteristic vectors with the dimension of 64 after calculation and is used as node characteristics in a next-stage graph convolution network.
Standard convolution is performed on local areas in the euclidean structured data, which captures the most important information of pixels in the image, while graph convolution is the learning of the relationship between each object node in non-euclidean data, which can be seen as passing data through different nodes, with the goal of learning a function f, which can update the node characteristics of each node layer by layer. The adjacency matrix a and node characteristic X of the graph convolution can be expressed by the following formula:
A∈R n×n ,X∈R d×n (3)
where n represents the number of nodes, d represents the dimension of each node feature, and R represents the real set.
The input parameters of each layer, namely node characteristics, are updated through the convolution propagation function f in the graph convolution operation and can be marked as H l Where l represents the number of convolution layers where the current is, the input parameters of the first layer are the most primitive node features, i.e. H 0 =x. In general, each of the layers of the gallery may be represented as:
H l =f(H l-1 (a) (4) because f is a convolution propagation function, equation 4 can be further extended to:
H l =σ(AH l-1 W l-1 ) (5)
wherein σ is a nonlinear activation function, W l-1 ∈R d×d′ Is the weight matrix in the first layer of the picture volume, d and d' representing the input and output dimensions of the first layer, respectively. The graph convolution operation can also be stacked into multiple layers like a standard convolution, and the stacked GCN model can learn the dependency relationship among nodes after several iterative aggregation operations.
In order to extract feature information of each key point in a data set sample by using a graph convolution network, constructing a space graph according to the face structure diagram constructed in the step 1-1, wherein each key point is used as a vertex in the graph, and a feature vector extracted from a light flow diagram corresponding to each key point is used as a node feature X epsilon R of the vertex in the graph 64×12 The natural connection relation between the key points of the human face is used as an adjacent matrix A epsilon R 12×12 And performing feature aggregation operation by setting two layers of picture scroll layers, wherein input and output dimensions of the first layer are set respectivelyThe input and output dimensions of the first layer are set to be 32 and 16 for 64 and 32, respectively, and finally the node features of 12×16 are converted into deep optical flow features with discriminative information of 192 through reshape operation, and the deep optical flow features are output as the whole optical flow channel.
Step 2-2, extracting facial structural features based on key points
And (3) extracting characteristic information in the space-time diagram through a lightweight Shift diagram convolution network on the basis of a face structure diagram formed based on key points in the step (1-3). The Shift GCN is proposed by replacing the traditional convolution operator in the Shift convolution operator on the basis of a space-time diagram convolution network (Spatial Temporal Graph Convolutional Networks, STGCN), and by adding a Shift operation to the original space diagram convolution and time diagram convolution, message transmission between adjacent nodes can be effectively performed, and better effects can be achieved by using fewer parameters and calculation amount. The Shift graph convolution network comprises a space Shift graph convolution module and a time Shift graph convolution module, so that information between nodes in the same frame in a space-time graph can be fused, and information of key points between frames can be fused.
The space-shift-map convolution module can be divided into two types, local and global-shift-map convolution. For local motion map convolution, the perception domain is composed of facial key point physical structures predefined by the microexpressive data set, and this way only considers the inherent connection between key points, so that it is difficult to mine the potential relation with the "over distance" effect. The global shift graph convolution removes the limitation of physical inherent connection, changes the facial structure connection graph of a single frame into a complete graph, and enables the perception domain of each key point to cover the whole facial key point space graph. The connection strength between different nodes in the global shift map convolution is the same, but the importance between the facial key points is different, so that an adaptive global shift mechanism is introduced, the shifted features and the learnable mask are subjected to element multiplication, and the important connection information between the facial key points is used for mining, and can be expressed by the following formula:
F M =F·Mask=F·(tanh(M)+1) (6)
wherein F is movedNode characteristics after bit operation, M is mask information, F M Is the characteristic information obtained after the self-adaptive importance weighting.
The lightweight network based on the Shift GCN mainly comprises a global space shift graph convolution and a time shift graph convolution module. According to the full-connection space-time diagram of the facial key points constructed in the steps 1-3, the input feature dimension of each data set sample is defined as t×v×c, wherein T is the number of key point frame sequences, specifically set to 3, the time dimension for representing the micro-expression motion change, V is the number of vertices of the key points of the same frame in the sequence, specifically set to 12, C is the feature dimension contained in each key point, specifically set to 2, and the abscissa information for representing a single facial key point is represented. In order to ensure that the dimension of the feature vector output by the optical flow channel is the same as that of the feature vector, the number of output feature channels of the global space displacement map convolution in the network model is set to be 16, and finally 12×16 node features are converted into 192-dimensional deep feature vectors containing rich space-time information through reshape operation, and the deep feature vectors are used as the output of facial structure feature channels and applied to the following steps.
And step 3, obtaining a micro-expression recognition result by combining multi-scale feature fusion of an attention mechanism based on the deep optical flow features and the facial structure features.
Simply concatenating features does not reveal the actual importance of the individual modality information, and feature fusion strategies play a vital role in handling multi-modality or multi-learning approaches. By placing the attention mechanism on top of the extracted modality features, the help system concentrates attention on the information modality, which can be intuitively understood as giving a weighted score on the different modalities to represent the importance of a single branch. From the deep optical flow features F obtained in step 2 flow Dough structural feature F landmark First, the embedded feature is learned by using an encoder for the feature of each modality, wherein the encoder is composed of two fully connected layers with output feature channel numbers of 64 and 1, and then soft attention learning weights α are generated for different modalities by using a softmax activation function, which can be expressed as the following formula:
α=softmax(tanh(W f [F flow ,F landmark ]+b f )) (7)
wherein W is f And b f For trainable fusion attention parameters, alpha is a 2-dimensional vector, and represents soft attention weight coefficients of an optical flow and a key point space-time diagram mode respectively. In order to keep the original characteristics, a 1+alpha weighted score is adopted as a new weight after self-learning, the characteristics of the two modes after attention weighting are connected to be used as the output of a characteristic fusion module, a deep characteristic vector with certainty is obtained through a full connection layer, and finally a final classification decision is made by a softmax classifier.
Microexpressive recognition was performed on Full, SMIC, CASMEII, SAMM datasets using the method and prior art techniques, respectively, and the results are shown in table 1, with the method achieving the best performance compared to the conventional method (Bi-WOOF) and the deep learning-based method (AlexNet, OFFApexNet, capsuleNet, dual-Inception, RCN-A, STSTNet). Compared with a single STSTNet method, UF1 and UAR indexes on a Full comprehensive data set are respectively improved by 4.51% and 1.76%, and meanwhile, better results are shown on SMIC and SAMM data sets, so that the superiority of the method is proved, and the accuracy of micro expression recognition can be effectively improved.
TABLE 1

Claims (5)

1. The micro-expression recognition method based on the attention feature fusion of the facial key points is characterized by comprising the following steps of:
step 1, positioning key points of faces and faces, and obtaining shallow optical flow characteristics and a face structure diagram;
step 2, extracting deep optical flow characteristics and facial structure characteristics through a neural network based on the optical flow characteristics and the facial structure diagram;
and step 3, obtaining a micro-expression recognition result by combining multi-scale feature fusion of an attention mechanism based on the deep optical flow features and the facial structure features.
2. The micro-expression recognition method based on the attention feature fusion of the facial key points according to claim 1, wherein the specific process of the step 1 is as follows:
step 1-1, face clipping and face key point positioning
Performing face detection through a face detection algorithm in an open source Dlib library to obtain information of key points of the face, solving maximum and minimum horizontal and vertical coordinates according to the key points, and determining a cutting range;
cutting each frame to generate a micro-expression picture sequence aiming at each micro-expression video sample in the data set, selecting n most representative key points to form a topological graph of a facial structure of a human face according to the position where micro-expression movement occurs, and capturing the change of the micro-expression movement of the human face; quantifying in the form of a adjacency matrix by the facial correlation between keypoints;
step 1-2, shallow optical flow characteristic extraction based on TV-L1 optical flow model
Extracting optical flow information in the horizontal and vertical directions from a starting frame and a peak frame in the micro-expression picture sequence by using a TV-L1 optical flow model, and calculating optical strain optical flow characteristics;
then, optical flow information in two directions is overlapped with optical strain optical flow characteristics to form shallow optical flow characteristics, and local movement of the facial micro-expression is represented; taking each selected key point as a central coordinate and expanding the key points outwards to form a three-dimensional optical flow matrix of m multiplied by m;
step 1-3, constructing a face structure diagram based on a face key point sequence
And (3) constructing a spatial relationship between key points of a single frame and a motion relationship between key points of adjacent frames by combining the facial key point information set forth in the step (1-1) on the basis of each frame in the micro-expression picture sequence, and constructing a facial structure diagram facing micro-expression state change based on a start frame, a peak frame and an end frame.
3. The micro-expression recognition method based on the attention feature fusion of the facial key points according to claim 2, wherein the specific process of the step 2 is as follows:
step 2-1, extracting deep optical flow characteristics based on key points
According to step 1-2, each data set sample is converted into a shallow optical flow map feature of n blocks of m×m×3, a shallow three-flow network model STSTSTNet is selected as a reference model, and the shallow optical flow map feature is converted into an advanced feature with deep meaning to obtain a deep optical flow feature;
step 2-2, extracting facial structural features based on key points
And (3) extracting feature information in the space-time diagram through a lightweight Shift diagram convolution network based on a spatial relationship and a motion relationship formed by the key points of the face in the step (1-3) to obtain facial structural features.
4. The micro-expression recognition method based on the attention feature fusion of the facial key points according to claim 3, wherein the specific process of the step 2-1 is as follows: according to the face map structure proposed in step 1-1, each key point is used as a vertex in the map, a shallow optical flow feature vector extracted from an optical flow graph corresponding to each key point is used as a node feature of the vertex in the map, natural connection relations among face key points are used as an adjacent matrix, and GCN is used for aggregation and extraction of deep space-time features with discriminant information contained in the optical flow features after STSTNet, so that deep optical flow features are obtained.
5. The micro-expression recognition method based on the attention feature fusion of the facial key points according to claim 4, wherein the specific process of the step 3 is as follows:
according to the deep optical flow characteristics and the facial structure characteristics obtained in the step 2, respectively inputting the deep optical flow characteristics and the facial structure characteristics into an encoder for learning, generating soft attention learning weights alpha for different modes by using a softmax activation function, adopting a 1+alpha weighting score as a new weight after self-learning, connecting the characteristics of the two modes after attention weighting as the output of a characteristic fusion module, obtaining a deterministic deep characteristic vector through a full connection layer, finally making a classification decision by a classifier, and outputting a microexpressive recognition result.
CN202311579931.7A 2023-11-24 2023-11-24 Micro-expression recognition method based on attention feature fusion of facial key points Pending CN117576753A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311579931.7A CN117576753A (en) 2023-11-24 2023-11-24 Micro-expression recognition method based on attention feature fusion of facial key points

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311579931.7A CN117576753A (en) 2023-11-24 2023-11-24 Micro-expression recognition method based on attention feature fusion of facial key points

Publications (1)

Publication Number Publication Date
CN117576753A true CN117576753A (en) 2024-02-20

Family

ID=89860408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311579931.7A Pending CN117576753A (en) 2023-11-24 2023-11-24 Micro-expression recognition method based on attention feature fusion of facial key points

Country Status (1)

Country Link
CN (1) CN117576753A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117974853A (en) * 2024-03-29 2024-05-03 成都工业学院 Self-adaptive switching generation method, system, terminal and medium for homologous micro-expression image

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117974853A (en) * 2024-03-29 2024-05-03 成都工业学院 Self-adaptive switching generation method, system, terminal and medium for homologous micro-expression image
CN117974853B (en) * 2024-03-29 2024-06-11 成都工业学院 Self-adaptive switching generation method, system, terminal and medium for homologous micro-expression image

Similar Documents

Publication Publication Date Title
CN108520535B (en) Object classification method based on depth recovery information
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN109360156A (en) Single image rain removing method based on the image block for generating confrontation network
Storey et al. 3DPalsyNet: A facial palsy grading and motion recognition framework using fully 3D convolutional neural networks
CN112163498B (en) Method for establishing pedestrian re-identification model with foreground guiding and texture focusing functions and application of method
CN112150493A (en) Semantic guidance-based screen area detection method in natural scene
CN112507920B (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN112288627A (en) Recognition-oriented low-resolution face image super-resolution method
CN111046734A (en) Multi-modal fusion sight line estimation method based on expansion convolution
CN117576753A (en) Micro-expression recognition method based on attention feature fusion of facial key points
CN113689382A (en) Tumor postoperative life prediction method and system based on medical images and pathological images
Liu et al. APSNet: Toward adaptive point sampling for efficient 3D action recognition
CN113486700A (en) Facial expression analysis method based on attention mechanism in teaching scene
CN116030498A (en) Virtual garment running and showing oriented three-dimensional human body posture estimation method
CN113989928A (en) Motion capturing and redirecting method
CN114120389A (en) Network training and video frame processing method, device, equipment and storage medium
Liu et al. Stereo video object segmentation using stereoscopic foreground trajectories
Chen et al. Intra-and inter-reasoning graph convolutional network for saliency prediction on 360° images
CN114066844A (en) Pneumonia X-ray image analysis model and method based on attention superposition and feature fusion
CN113298018A (en) False face video detection method and device based on optical flow field and facial muscle movement
CN113763417A (en) Target tracking method based on twin network and residual error structure
CN112487926A (en) Scenic spot feeding behavior identification method based on space-time diagram convolutional network
Yaseen et al. A novel approach based on multi-level bottleneck attention modules using self-guided dropblock for person re-identification
CN112069943A (en) Online multi-person posture estimation and tracking method based on top-down framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination