CN117373095A - Facial expression recognition method and system based on local global information cross fusion - Google Patents

Facial expression recognition method and system based on local global information cross fusion Download PDF

Info

Publication number
CN117373095A
CN117373095A CN202311448453.6A CN202311448453A CN117373095A CN 117373095 A CN117373095 A CN 117373095A CN 202311448453 A CN202311448453 A CN 202311448453A CN 117373095 A CN117373095 A CN 117373095A
Authority
CN
China
Prior art keywords
feature
local
face
global
expression recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311448453.6A
Other languages
Chinese (zh)
Inventor
文杰
刘毅成
唐宜冰
唐瞻雁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Original Assignee
Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology filed Critical Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority to CN202311448453.6A priority Critical patent/CN117373095A/en
Publication of CN117373095A publication Critical patent/CN117373095A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/169Holistic features and representations, i.e. based on the facial image taken as a whole
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a facial expression recognition method and a facial expression recognition system for cross fusion of local global information, comprising the following steps: s1, acquiring a face image dataset; s2, obtaining a facial expression recognition model according to the facial image dataset; and S3, inputting the face image to be processed into a facial expression recognition model, and carrying out real-time expression recognition. According to the technical scheme, more accurate expression recognition performance can be obtained.

Description

Facial expression recognition method and system based on local global information cross fusion
Technical Field
The invention belongs to the technical field of pattern recognition, and particularly relates to a facial expression recognition method and system based on local global information cross fusion.
Background
In recent years, facial expression recognition has received much attention in the field of computer vision, and many scholars at home and abroad have conducted a great deal of research work in the field of facial expression recognition, and the main research methods in these research works can be roughly divided into two types: global feature-based methods and local feature-based methods. The global feature-based method takes a global facial image as input, such as a deep attention center loss (Deep Attentive Center loss, DACL) method, integrates an attention mechanism to learn attention weights related to feature importance, and can effectively improve the distinguishing capability of the features; introducing a Self-care Network (SCN) into the Network by a Self-care weight module to calculate the uncertainty of each data sample, re-labeling samples with higher uncertainty based on the thought of pseudo labels, and finally labeling the correct samples by more attention labels of the model, thereby reducing the adverse effects caused by the uncertain labels; tag distribution learning (Label Distribution Learning on Auxiliary Label Space Graphs, LDL-ALSG) on the auxiliary tag space map learns tag distribution in expression recognition using topology information of tags in the expression recognition-related task, such as face action unit recognition and face marker detection. The method only extracts and processes the global features of the face, ignores the feature information of some areas of the face, however, the facial expressions have the characteristics of high similarity among classes and large intra-class difference, and the root cause is the local subtlety of the facial expressions, so that the information of the local areas of the face can provide effective discrimination basis for the recognition of the expressions.
In order to overcome this problem, scholars have also proposed a plurality of expression recognition methods based on local features, wang et al have designed a network structure based on an attention mechanism, cut a face image into a plurality of regions, extract feature information of each region, however, in this method, local features are extracted by using a random cutting manner, which may ignore some facial regions with strong relevance to expression change, and training different deep neural networks for each region to extract features at the same time causes more calculation consumption. Liu et al propose an enhanced deep belief network (Boosted Deep Belief Network, BDBN) that learns and selects a set of valid region features to describe the change in facial appearance associated with an expression, and the method also requires feature extraction and calculation for each region block of a human face, from which a partial region block is selected.
Considering that facial expression is the result of the combination of a plurality of regional features, and the conditions of occlusion, posture change and the like exist in a natural data set, only extracting local features ignores the connection and complementarity between facial regions, and cannot cope with the complex condition appearing in an image. Therefore, scholars also design an expression recognition method of fusion of local and global features aiming at the problem, zhao and the like propose a global multi-scale and local attention network to extract global and local features; li et al propose a convolutional neural network with an attention mechanism by which the network learns the attention weight of each region. The disadvantages of these existing methods are: (1) The local features are extracted in a random mode, so that the method can not ensure that the key region features of the face are accurately extracted, and irrelevant information can be extracted at the same time; (2) The fusion method uses a simple weighted summation or attention module, which can only highlight some local features, but cannot extract complementary information between local and global features.
Disclosure of Invention
The invention aims to solve the technical problem of providing a facial expression recognition method and a facial expression recognition system with cross fusion of local global information, which can realize more accurate expression recognition performance.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a facial expression recognition method of local global information cross fusion comprises the following steps:
s1, acquiring a face image training set;
s2, obtaining a facial expression recognition model according to the facial image training set;
and S3, inputting the face image to be processed into a facial expression recognition model, and carrying out real-time expression recognition.
Preferably, step S2 specifically includes:
s21, extracting global features and local features of a face image according to a face image training set;
and S22, obtaining a facial expression recognition model based on the fusion local area information and the global association information of the Transformer according to the global features and the local features of the facial image.
Preferably, in step S21, feature extraction is performed on the face image training set by using a convolutional neural network to obtain global featuresWherein h is g 、w g Respectively representing the height and width of the feature map, c g The number of channels representing the feature map; extracting 68 key point coordinates of a face through a face key point extraction network MobileFaceneT, and selecting 5 face region points associated with expressions based on the 68 key point coordinates, wherein the 5 face region points respectively comprise 5 regions of a left eye, a right eye, a nose, a left mouth corner and a right mouth corner of the face; the local feature decomposes the global feature of the image by taking the relative position of each region in the image as the center according to the extracted 5 face key regions, and divides 5 local feature blocks with the size of 7 multiplied by 7 to obtain local feature F 1 ,F 2 ,...,F k
Preferably, in step S22, the global feature F 0 And local feature F 1 ,F 2 ,...,F k Preprocessing to obtain global feature F in sequence form g And local feature F l Inputting the two features into a transducer module to obtain a fusion representation F of the image out
The invention also provides a facial expression recognition system with cross fusion of local global information, which comprises:
the acquisition device is used for acquiring the face image data set;
the processing device is used for obtaining a facial expression recognition model according to the facial image data set;
the recognition device is used for inputting the face image to be processed into the facial expression recognition model to perform real-time expression recognition.
Preferably, the processing device includes:
the extraction unit is used for extracting global features and local features of the face image according to the face image training set;
and the training unit is used for obtaining a facial expression recognition model according to the global features and the local features of the facial image.
Preferably, the extracting unit performs feature extraction on the face image training set by using a convolutional neural network to obtain global featuresWherein h is g 、w g Respectively representing the height and width of the feature map, c g The number of channels representing the feature map; extracting 68 key point coordinates of a face through a face key point extraction network MobileFaceneT, and selecting 5 face region points associated with expressions based on the 68 key point coordinates, wherein the 5 face region points respectively comprise 5 regions of a left eye, a right eye, a nose, a left mouth corner and a right mouth corner of the face; the local feature decomposes the global feature of the image by taking the relative position of each region in the image as the center according to the extracted 5 face key regions, and divides 5 local feature blocks with the size of 7 multiplied by 7 to obtain local feature F 1 ,F 2 ,...,F k
Preferably, the training unit is used for the global feature F 0 And local feature F 1 ,F 2 ,...,F k Preprocessing to obtain global feature F in sequence form g And local feature F l Inputting the two features into a transducer module to obtain a fusion representation F of the image out
In the aspect of local area feature extraction, the invention designs local feature information extraction based on facial marker points, and the local features of the facial local area are adaptively extracted by using a depth network based on the obtained coordinates of the facial local area; in the aspect of fusion of local area features and global features, a cross fusion module based on a transducer is designed to fuse local feature information and global feature information extracted by a network, so that a feature representation with local discriminant and global relevance is obtained, and more accurate expression recognition performance is obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a facial expression recognition method of local global information cross fusion in an embodiment of the present invention;
FIG. 2 is a block diagram illustrating a method of expression recognition by local global feature cross fusion in accordance with the present invention;
FIG. 3 is a schematic illustration of facial marker points;
FIG. 4 is a schematic diagram of 5 marker points covering the main information of the cover;
FIG. 5 is a schematic illustration of facial marker point based segmentation;
FIG. 6 is a block diagram of a cross-fused transducer encoder;
FIG. 7 is a flow chart of the network model training and sample testing according to the present invention, wherein (a) the network model training flow chart and (b) the sample testing phase flow chart.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Example 1:
as shown in fig. 1, an embodiment of the present invention provides a facial expression recognition method by cross-fusing local global information, including the following steps:
s1, acquiring a face image training set;
s2, obtaining a facial expression recognition model according to the facial image training set;
and S3, inputting the face image to be processed into a facial expression recognition model, and carrying out real-time expression recognition.
As an implementation manner of the embodiment of the present invention, step S2 specifically includes:
s21, extracting global features and local features of a face image according to a face image training set;
and S22, obtaining a facial expression recognition model based on the fusion local area information and the global association information of the Transformer according to the global features and the local features of the facial image.
In step S21, feature extraction is performed on the face image training set by using a convolutional neural network to obtain global featuresWherein h is g 、w g Respectively representing the height and width of the feature map, c g The number of channels representing the feature map; extracting 68 key point coordinates of a face through a face key point extraction network MobileFaceneT, and selecting 5 face region points associated with expressions based on the 68 key point coordinates, wherein the 5 face region points respectively comprise 5 regions of a left eye, a right eye, a nose, a left mouth corner and a right mouth corner of the face; the local feature decomposes the global feature of the image by taking the relative position of each region in the image as the center according to the extracted 5 face key regions, and divides 5 local feature blocks with the size of 7 multiplied by 7 to obtain local feature F 1 ,F 2 ,...,F k
As an implementation of the embodiment of the inventionIn the mode, in step S22, the global feature F 0 And local feature F 1 ,F 2 ,...,F k Preprocessing to obtain global feature F in sequence form g And local feature F l Inputting the two features into a transducer module to obtain a fusion representation F of the image out
Example 2:
as shown in FIG. 2, the embodiment of the invention provides an expression recognition method by cross fusion of local global information, wherein input data is RGB image I E R of human face H×W×3 Wherein H, W represents the height and width of the image, respectively. And extracting a feature map of the input image by using a convolutional neural network, and simultaneously extracting coordinates of positions of eyes, nose, mouth and the like of a human face by using a facial marker point extraction network (MobileFaceneT). Then, the feature map of the image is divided into a plurality of local feature maps based on the obtained face key point coordinates to obtain diversified local feature representations. Meanwhile, in order to obtain global association information of the local area of the image, the embodiment of the invention extracts the global characteristics of the image and designs a facial expression recognition model based on the fusion of the local area information and the global association information of the transducer.
In the aspect of fusion of local area characteristic information and global facial characteristic information, a local and global information cross fusion network based on a transducer is designed. The self-attention mechanism in the traditional transducer structure is mainly used for processing the feature sequence with single dimension, and can strengthen the relation between the inside of the feature sequence, and the method of the invention uses the cross-attention mechanism to respectively process the local feature and the global feature, so that the association information between the local feature and the global feature can be learned. The cross-attention mechanism can aggregate and adjust the two features, so that the local associated information is fused into the final feature representation, and further higher expression recognition performance is obtained.
Facial landmark-based local feature decomposition:
in order to extract local features of a facial expression-related region, the embodiment of the invention designs a local feature segmentation module based on facial marker points, a facial marker point extractor can accurately position a key local region of a human face, and a plurality of local regions segmented from the facial features can provide rich local discrimination information for an expression recognition task based on the coordinate points.
Facial marker point extraction: given a face image, the positions of face markers, such as eyes, nose, mouth, etc., are automatically detected, as shown in fig. 3. The local area surrounding the key points can provide important information about the facial gestures, expressions, sexes, ages and the like, so that the method has wide application scenes such as face recognition, face tracking, emotion recognition and the like. Generally, the internal expression of a person can be found by subtle changes in the face, eyes, nose, mouth, etc., and thus, embodiments of the present invention focus on local information of these facial areas. The MobileFaceNet is a lightweight face recognition model based on deep learning, is mainly applied to face recognition scenes on mobile equipment, and has the advantages of high efficiency, accuracy, portability and the like. The MobileFaceNet adds a global depth convolution layer after the last convolution layer of a common face recognition network to replace the traditional average pooling layer. The global depth convolution can give different weight information to different spatial positions of the face, so that richer face characteristic information is obtained. After the face marker point is pre-trained on the data set marked with the face marker points, the marker point coordinate information of the face can be efficiently and accurately output, and the robustness and the stability are good. Thus, the present invention introduces MobileFaceNet as the basis model for facial marker point detection.
Local region block segmentation based on facial landmark points: the local regional characteristics of facial muscles are important manifestations of facial expressions, so extracting the local regional information is beneficial to improving facial expression classification performance. The invention is inspired by the method, the area block features with high discrimination and strong expression relevance in the facial image of the human face are extracted. In order to find the key facial regions related to the expression, the present invention first detects 68 facial marker points using MobileFaceNet, and then selects 5 points of key facial information regions associated with expression transformation based on the detected 68 facial marker points, which respectively include 5 regions of left eye, right eye, nose, left mouth corner and right mouth corner of the face of the human body, as shown in fig. 4 below. In the invention, two marking points are used for positioning the left mouth corner and the right mouth corner respectively to determine the position of the mouth, and the other areas are used for positioning the center of the local area by using only one marking point.
Local region feature extraction: for an input face original image, firstly processing the image by using an image feature extraction module network, and outputting a facial feature image as the input image; the above region block location information segmented based on MobileFaceNet is based on an original face image, in order to be able to correspond to the feature map of the image, the above obtained 5 location point coordinates are scaled in equal proportion to be suitable for outputting the size of the feature map, and then a square region is cut out with the scaled mark point coordinates as the center and with a set side length parameter, so as to obtain 5 face local region blocks. Compared with the mode of directly carrying out region segmentation on the original image and then extracting local region characteristics respectively, the method and the device for extracting the local region characteristics not only reduce the total parameter quantity of the model, but also improve the calculation efficiency. This is mainly because in the present invention, all areas of the image are processed by the same convolutional neural network, and there is no need to separately provide a convolutional network for each area block. The feature map output from the feature extraction module has a size of 14×14, and the number of channels of the feature map is 256. After decomposing the feature map, 5 partial feature maps are obtained, and each partial feature map has a size of 7×7. The local feature map obtained after the segmentation is shown in fig. 5. At the time of division, the value of a region point beyond the range of the original feature map is set to 0.
After the facial feature extraction and the local area block segmentation, a global and local area feature set of the facial image is extracted, wherein if a convolution backbone network is expressed as r (& theta), the global feature F of the image is obtained 0 Can be expressed as:
F 0 =r(I;θ) (1)
where θ represents a parameter of the backbone network, and I is an input face image.
By F 1 ,F 2 ,...,F k Representing the passing ofThe k local area features obtained after cutting and dividing can be expressed as a feature set X:
X=[F 0 ,F 1 ,F 2 ,...,F k ] (2)
in the present invention, 5 partial areas are selected, and thus k=5 in the present invention.
Transformer-based local global feature cross fusion:
the local region features extracted from the key points of the face can provide typical local expression features related to the expression, meanwhile, the global features comprise contact information among a plurality of local regions, and after the two features are obtained, a cross fusion transducer network is used for fusion processing of the two features in order to fully mine the contact between the two features.
Preprocessing a feature map: transformer was originally proposed in the field of natural language processing and was used to process sequence-shaped inputs. In order to use a transducer for the fusion of image features, the resulting global feature map and 5 local feature maps are first converted into a sequence form for input to the transducer network for information fusion.
1) For global feature mapWherein h is g 、w g Respectively representing the height and width of the feature map, c g The number of channels representing the feature map. In order to keep the size of the local feature map consistent when input to the transducer, the global feature map is first reduced in dimension to reduce its height and width to the original 1/2. And splitting the feature map after dimension reduction along the channel dimension, and rearranging them into feature vector sequence +.>It can be briefly represented as F g ∈R N×D Wherein n=c g Representing the number of latches to which a transducer is input; />Representing the dimensions of a single patch feature.
2) For the local feature map, each feature map is first processed using a 1×1 convolution kernel, and the number of channels of each feature map is changed to c/r after the 1×1 convolution processing, where r is a reduction coefficient for reducing the feature dimension. Then, 5 feature images are spliced together according to channel dimensions by a vector splicing method, so that the channel number of the spliced feature images is 5c/r, when the reduction coefficient r is set to be 5, the channel number of the spliced feature images is c, the channel number is the same as that of the original global feature images, and then the same shape change operation is carried out on the spliced feature images to obtain the final local feature F l ∈R N×D . To this end, both features are converted into a data format that can be input into a transducer. Finally, a leachable class token is added before the input sequence, while leachable position codes are embedded in the vector sequence to introduce position information between the sequences.
Cross attention: the self-attention mechanism in the traditional transducer can well process the relationship inside the single-dimension sequence, and on the basis, many research work improved cross-attention mechanisms can well learn the complementary relationship between two dimension features. In order to fully mine the contact information between the local and global feature information, the invention uses a cross-attention mechanism to model the bidirectional correspondence between the local and global features.
In the transducer's self-attention operation, the input sequence is first mapped into three matrices by linear transformation: inquiring the matrix Q, the key value matrix K and the value matrix V, and then calculating the attention weight, wherein the calculation method is defined as follows:
wherein d k Is the dimension of the key value matrix K.
In order to realize complementary fusion between the two features, a query matrix Q is set in the network by a bureauPart characteristic F l Generated by a global feature F, and a key value matrix K and a value matrix V g Generating:
Q l =F l W Q ,K g =F g W K ,V g =F g W V (4)
wherein W is Q 、W K 、W V ∈R D×D Is a linear transformation matrix, F g As global features, F l Is a local feature.
Obtaining Q l 、K g 、V g The method of calculating the cross-attention was then as follows:
where softmax () is the activation function used for normalization of the attention weight.
Because of the introduction of the cross attention, the global features of the face are added with the regional position information brought by the local features, and the information of the key regions is highlighted; at the same time, a plurality of local feature blocks are also guided by global features to add more associated information.
In a transducer network, the multi-head attention mechanism performs k parallel cross-attention operations and concatenates the outputs of multiple attention heads into a final output form. Residual connection is added after a multi-head attention mechanism, then a multi-layer perceptron MLP is introduced to form a complete cross fusion transducer decoder, the structure of which is shown as 6, and the calculation process is as follows:
F'=CFMSA(Q l ,K g ,V g )+F g (6)
F out =MLP(Norm(F'))+F' (7)
where CFMSA () represents a multi-headed cross-attention mechanism, norm () represents layer normalization, and MLP () represents a multi-layer perceptron. F (F) out Is a characteristic representation of the final output.
Through the superposition of the N layers of fusion modules, the integral features of the face and the local region features corresponding to the key points are continuously updated to form a complete structure of the transducer, and finally, the class representation before the last layer of output sequence of the model is taken, and the classification prediction is carried out through the full-connection layer.
In the embodiment of the invention, the label smoothing cross entropy loss function is used, and the label smoothing cross entropy loss function reduces the ambiguity of the label by introducing a smoothing factor into the distribution of the real label, thereby improving the robustness of the model. The core idea is to multiply each element of the true tag distribution by a smoothing factor of less than 1 and assign the remaining probability mass to the other tags. In this way, the model does not assign all probability masses to a tag too confidently, but rather learns some features that have some similarity to the tag and assigns probability distributions smoothly to multiple tags. Therefore, the sensitivity of the model to noise can be reduced, and the accuracy and generalization capability of the model on test data can be improved.
The label smoothing cross entropy loss function can be expressed using the following formula:
wherein N represents the number of categories, y i Is the model's predicted probability for the i-th class,is a smoothed tag distribution. In the label smoothing cross entropy, the invention adopts the following smoothing formula:
where α is a smoothing factor, typically taking a small value, e.g., α=0.1 or α=0.01, and m is the number of categories.
The specific implementation process comprises the following steps:
the implementation process comprises two stages of training and identifying the neural network model. In the training stage, the embodiment of the invention sets corresponding parameters for the structure of the neural network and the training process, and the specific parameters are shown in the following table 1.
TABLE 1
The implementation process of the embodiment of the invention is divided into two stages of model training and recognition. The flow chart of the two stages is shown in (a) and (b) of fig. 7 below.
The training stage comprises the following detailed steps:
the specific training steps of the invention are introduced as follows:
the first step: and (5) adjusting the size of the image. And (3) scaling the input face image with any scale to be scaled to the input size agreed by the network, namely 224 multiplied by 224.
And a second step of: the facial image feature extraction comprises global feature extraction, facial key point extraction and local feature decomposition. For global feature extraction, the feature extraction is firstly performed on the face image based on the image feature extraction module used in the invention, namely the IR50 network, and the extracted global feature graph is expressed asWherein h is g 、w g Respectively representing the height and width of the feature map, c g The number of channels representing the feature map. For facial key point extraction, 68 key point coordinates of a face are extracted mainly based on a facial key point extraction network MobileFaceneT in the invention, and then 5 facial area points associated with expressions are selected based on the 68 key point coordinates, wherein the 5 areas comprise the left eye, the right eye, the nose, the left mouth corner and the right mouth corner of the face. The local feature decomposition is mainly based on the extractedThe global feature of the image is decomposed by taking the relative position of each region in the image as the center, and 5 local feature blocks with the size of 7 multiplied by 7 are segmented to obtain local features F 1 ,F 2 ,...,F k
And a third step of: local global feature cross fusion. For the global feature F 0 And local feature F 1 ,F 2 ,...,F k Preprocessing to obtain global feature F in sequence form g And local feature F l Inputting the two features into the cross fusion transducer module designed by the invention, and calculating based on formulas (4), (5), (6) and (7) to obtain fusion representation F of the image out
Fourth step: and outputting the prediction label. And inputting the representation of the image fusion representation obtained in the previous step into a multi-layer perceptron, and calculating the label predicted value of the image.
Fifth step: network loss calculation. And calculating a loss value based on label smoothing cross entropy loss function calculation formulas (8) and (9) used in the invention according to the label predicted value of the image.
Sixth step: and (5) gradient feedback optimization. And optimizing parameters of all network models at one time by utilizing a gradient descent optimization algorithm according to the network loss value obtained in the last step.
Seventh step: and (5) judging convergence. When the training times t reach the set value, stopping training the network model and outputting the parameters of the network model; otherwise t=t+1 and jumps to the second step for further execution.
And (3) an identification stage:
the specific identification process is expressed as follows:
the first step: and (5) adjusting the size of the image. And (3) scaling the input face image with any scale to be scaled to the input size agreed by the network, namely 224 multiplied by 224.
And a second step of: feature extraction, including global feature extraction, facial key point extraction, and local feature decomposition. The global feature extraction is mainly based on the image feature extraction module used in the invention, namely an IR50 network, and features of the face image are extracted firstly, and the extracted whole face image is extractedThe office feature map is expressed asWherein h is g 、w g Respectively representing the height and width of the feature map, c g The number of channels representing the feature map. The facial key point extraction is mainly based on 68 key point coordinates of a human face extracted by a facial key point extraction network MobileFaceNet in the invention, and then 5 facial area points associated with expressions are selected based on the 68 key point coordinates, wherein the facial area points respectively comprise 5 areas of the left eye, the right eye, the nose, the left mouth corner and the right mouth corner of the face of the human face. The local feature decomposition is to decompose the global feature of the image based on the 5 key facial regions obtained in the previous step and with the relative position of each region in the image as the center to obtain multiple local features F 1 ,F 2 ,...,F k
And a third step of: local global feature cross fusion. For the global feature F 0 And local feature F 1 ,F 2 ,...,F k Preprocessing to obtain global feature F in sequence form g And local feature F l Inputting the two features into the cross fusion transducer module designed by the invention, and calculating based on formulas (4), (5), (6) and (7) to obtain fusion representation F of the image out
Fourth step: and outputting the expression recognition result. Inputting the representation of the image fusion representation obtained in the previous step into a multi-layer perceptron, calculating a label predicted value of the image, and outputting a corresponding expression classification result.
Facial Expression Recognition (FER) is receiving increasing attention in the computer vision world. For facial expression recognition, there are two challenging problems in facial images: the similarity between classes is large and the difference in class is small. To address these challenges and achieve better performance, in embodiments of the present invention, a local-global information cross-fused transformer network is presented. Specifically, the method obtains a more discriminant facial representation by fully considering local information of a plurality of local areas of the face and global facial overall information. In order to extract the key local area characteristics of the face, the embodiment of the invention designs a local characteristic decomposition module based on the facial landmarks. In addition, a local-global information cross fusion converter is designed, the cooperative association between local face feature information and global face feature information is enhanced by using a cross attention mechanism, and the key area can be focused to the maximum extent while the association information between the local areas is considered. A large number of experiments carried out on three main stream expression recognition data sets show that the method is superior to the existing facial expression recognition methods, and the performance of facial expression recognition can be remarkably improved.
The facial expression recognition method based on the local global information cross fusion can recognize and distinguish the expression state shown in the face image, so that the recognition and understanding of human emotion are realized. The facial expression recognition technology has a plurality of application scenes, and mainly comprises the fields of man-machine interaction, intelligent accompanying, safety detection, online education, judicial criminal investigation, communication and the like. In the field of man-machine interaction, the facial expression recognition technology can be applied to a accompanying robot, and facial expressions of a person are acquired through a camera of the robot to recognize and analyze emotion changes, so that more intelligent and humanized accompanying service is provided. In the intelligent driving field, the expression change condition of a driver can be captured by installing a small camera in a vehicle, the expression is analyzed, and if the condition that the driver is tired or is in a mental unfocused state is identified through the expression, a reminding or early warning signal can be timely sent out to prompt the driver to run or stop at a reduced speed, so that traffic accidents are avoided. In the field of online education, facial expression recognition technology can help the mr to analyze the mind state and expression change of students in class, so as to know the learning condition of the students, adjust the progress and difficulty of courses through the states of the students, and further improve the quality of online remote teaching. In the criminal investigation field, facial expression change of people often becomes important investigation clues, and facial expression recognition technology can assist police to the interrogation of suspicious people, can judge the psychological activities of people through analysis expression, carries out psychological test, detects the honesty etc. when speaking to the people.
1) The embodiment of the invention extracts the local features based on the facial key points, can accurately extract the local area with larger expression correlation, and does not introduce excessive irrelevant information.
2) The partial global feature cross fusion module provided by the embodiment of the invention can adaptively fuse two features and deeply extract the complementary information of the two features, thereby obtaining fusion characterization with both local discriminant and global relativity.
3) Comparative experiments on three facial expression recognition data sets and the existing method show that the embodiment of the invention obtains optimal performance, as shown in table 2.
TABLE 2
Example 3:
the embodiment of the invention also provides a facial expression recognition system for cross fusion of local global information, which comprises the following steps:
the acquisition device is used for acquiring the face image data set;
the processing device is used for obtaining a facial expression recognition model according to the facial image data set;
the recognition device is used for inputting the face image to be processed into the facial expression recognition model to perform real-time expression recognition.
As one implementation of the embodiment of the present invention, the processing apparatus includes:
the extraction unit is used for extracting global features and local features of the face image according to the face image training set;
and the training unit is used for obtaining a facial expression recognition model according to the global features and the local features of the facial image.
As an implementation of the embodiment of the invention, the extraction unit performs feature extraction on the face image training set by using a convolutional neural network to obtain a global featureSign of signWherein h is g 、w g Respectively representing the height and width of the feature map, c g The number of channels representing the feature map; extracting 68 key point coordinates of a face through a face key point extraction network MobileFaceneT, and selecting 5 face region points associated with expressions based on the 68 key point coordinates, wherein the 5 face region points respectively comprise 5 regions of a left eye, a right eye, a nose, a left mouth corner and a right mouth corner of the face; the local feature decomposes the global feature of the image by taking the relative position of each region in the image as the center according to the extracted 5 face key regions, and divides 5 local feature blocks with the size of 7 multiplied by 7 to obtain local feature F 1 ,F 2 ,...,F k
As one implementation of the embodiment of the invention, the training unit trains the global feature F 0 And local feature F 1 ,F 2 ,...,F k Preprocessing to obtain global feature F in sequence form g And local feature F l Inputting the two features into a transducer module to obtain a fusion representation F of the image out
The above embodiments are merely illustrative of the preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but various modifications and improvements made by those skilled in the art to which the present invention pertains are made without departing from the spirit of the present invention, and all modifications and improvements fall within the scope of the present invention as defined in the appended claims.

Claims (8)

1. The facial expression recognition method based on the local global information cross fusion is characterized by comprising the following steps of:
s1, acquiring a face image training set;
s2, obtaining a facial expression recognition model according to the facial image training set;
and S3, inputting the face image to be processed into a facial expression recognition model, and carrying out real-time expression recognition.
2. The facial expression recognition method of claim 1, wherein step S2 specifically comprises:
s21, extracting global features and local features of a face image according to a face image training set;
and S22, obtaining a facial expression recognition model based on the fusion local area information and the global association information of the Transformer according to the global features and the local features of the facial image.
3. The facial expression recognition method of claim 2, wherein in step S21, feature extraction is performed on the training set of facial images by using a convolutional neural network to obtain global features F 0 ∈R hg×wg×cg Wherein h is g 、w g Respectively representing the height and width of the feature map, c g The number of channels representing the feature map; extracting 68 key point coordinates of a face through a face key point extraction network MobileFaceneT, and selecting 5 face region points associated with expressions based on the 68 key point coordinates, wherein the 5 face region points respectively comprise 5 regions of a left eye, a right eye, a nose, a left mouth corner and a right mouth corner of the face; the local feature decomposes the global feature of the image by taking the relative position of each region in the image as the center according to the extracted 5 face key regions, and divides 5 local feature blocks with the size of 7 multiplied by 7 to obtain local feature F 1 ,F 2 ,...,F k
4. A facial expression recognition method according to claim 3, wherein in step S22, the global feature F is identified as 0 And local feature F 1 ,F 2 ,...,F k Preprocessing to obtain global feature F in sequence form g And local feature F l Inputting the two features into a transducer module to obtain a fusion representation F of the image out
5. A facial expression recognition system of local global information cross fusion, comprising:
the acquisition device is used for acquiring the face image data set;
the processing device is used for obtaining a facial expression recognition model according to the facial image data set;
the recognition device is used for inputting the face image to be processed into the facial expression recognition model to perform real-time expression recognition.
6. The locally global information cross-fused facial expression recognition system of claim 5 wherein the processing means comprises:
the extraction unit is used for extracting global features and local features of the face image according to the face image training set;
and the training unit is used for obtaining a facial expression recognition model according to the global features and the local features of the facial image.
7. The facial expression recognition system of claim 6, wherein the extraction unit performs feature extraction on the training set of facial images using a convolutional neural network to obtain global features F 0 ∈R hg×wg×cg Wherein h is g 、w g Respectively representing the height and width of the feature map, c g The number of channels representing the feature map; extracting 68 key point coordinates of a face through a face key point extraction network MobileFaceneT, and selecting 5 face region points associated with expressions based on the 68 key point coordinates, wherein the 5 face region points respectively comprise 5 regions of a left eye, a right eye, a nose, a left mouth corner and a right mouth corner of the face; the local feature decomposes the global feature of the image by taking the relative position of each region in the image as the center according to the extracted 5 face key regions, and divides 5 local feature blocks with the size of 7 multiplied by 7 to obtain local feature F 1 ,F 2 ,...,F k
8. The facial expression recognition system of claim 7, wherein the training unit is configured to compare the global feature F to a global feature F 0 And bureauPart characteristic F 1 ,F 2 ,...,F k Preprocessing to obtain global feature F in sequence form g And local feature F l Inputting the two features into a transducer module to obtain a fusion representation F of the image out
CN202311448453.6A 2023-11-02 2023-11-02 Facial expression recognition method and system based on local global information cross fusion Pending CN117373095A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311448453.6A CN117373095A (en) 2023-11-02 2023-11-02 Facial expression recognition method and system based on local global information cross fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311448453.6A CN117373095A (en) 2023-11-02 2023-11-02 Facial expression recognition method and system based on local global information cross fusion

Publications (1)

Publication Number Publication Date
CN117373095A true CN117373095A (en) 2024-01-09

Family

ID=89396448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311448453.6A Pending CN117373095A (en) 2023-11-02 2023-11-02 Facial expression recognition method and system based on local global information cross fusion

Country Status (1)

Country Link
CN (1) CN117373095A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784763A (en) * 2021-01-27 2021-05-11 南京邮电大学 Expression recognition method and system based on local and overall feature adaptive fusion
CN113887487A (en) * 2021-10-20 2022-01-04 河海大学 Facial expression recognition method and device based on CNN-Transformer
CN115311730A (en) * 2022-09-23 2022-11-08 北京智源人工智能研究院 Face key point detection method and system and electronic equipment
WO2023173646A1 (en) * 2022-03-17 2023-09-21 深圳须弥云图空间科技有限公司 Expression recognition method and apparatus

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784763A (en) * 2021-01-27 2021-05-11 南京邮电大学 Expression recognition method and system based on local and overall feature adaptive fusion
CN113887487A (en) * 2021-10-20 2022-01-04 河海大学 Facial expression recognition method and device based on CNN-Transformer
WO2023173646A1 (en) * 2022-03-17 2023-09-21 深圳须弥云图空间科技有限公司 Expression recognition method and apparatus
CN115311730A (en) * 2022-09-23 2022-11-08 北京智源人工智能研究院 Face key point detection method and system and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CE ZHENG等: "POSTER: A Pyramid Cross-Fusion Transformer Network for Facial Expression Recognition", ARXIV:2204.040832V2[CS.CV], 13 August 2023 (2023-08-13), pages 1 - 13 *

Similar Documents

Publication Publication Date Title
US10445602B2 (en) Apparatus and method for recognizing traffic signs
CN111709311A (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN110119726A (en) A kind of vehicle brand multi-angle recognition methods based on YOLOv3 model
CN113723312B (en) Rice disease identification method based on visual transducer
CN113076891B (en) Human body posture prediction method and system based on improved high-resolution network
CN110516633A (en) A kind of method for detecting lane lines and system based on deep learning
Chang et al. Changes to captions: An attentive network for remote sensing change captioning
CN115497122A (en) Method, device and equipment for re-identifying blocked pedestrian and computer-storable medium
CN116452688A (en) Image description generation method based on common attention mechanism
CN114926796A (en) Bend detection method based on novel mixed attention module
CN112668493B (en) Reloading pedestrian re-identification, positioning and tracking system based on GAN and deep learning
CN114170686A (en) Elbow bending behavior detection method based on human body key points
CN117671617A (en) Real-time lane recognition method in container port environment
CN114944002B (en) Text description-assisted gesture-aware facial expression recognition method
CN117173777A (en) Learner front posture estimation method based on limb direction clue decoding network
CN117373095A (en) Facial expression recognition method and system based on local global information cross fusion
CN113051962B (en) Pedestrian re-identification method based on twin Margin-Softmax network combined attention machine
Sunney Real-Time Yoga Pose Detection using Machine Learning Algorithm
Mu et al. Algorithm Analysis of Face Recognition Robot Based on Deep Learning
CN116597419B (en) Vehicle height limiting scene identification method based on parameterized mutual neighbors
CN110458113A (en) A kind of non-small face identification method cooperated under scene of face
CN115115868B (en) Multi-mode collaborative scene recognition method based on triples
CN115601714B (en) Campus violent behavior identification method based on multi-modal data analysis
CN111461019B (en) Method, system and equipment for evaluating Chinese character writing quality
WO2024093466A1 (en) Person image re-identification method based on autonomous model structure evolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination