CN116363733A - Facial expression prediction method based on dynamic distribution fusion - Google Patents

Facial expression prediction method based on dynamic distribution fusion Download PDF

Info

Publication number
CN116363733A
CN116363733A CN202310357220.9A CN202310357220A CN116363733A CN 116363733 A CN116363733 A CN 116363733A CN 202310357220 A CN202310357220 A CN 202310357220A CN 116363733 A CN116363733 A CN 116363733A
Authority
CN
China
Prior art keywords
distribution
sample
category
branch
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310357220.9A
Other languages
Chinese (zh)
Inventor
刘姝
许焱
万通明
王科选
奎晓燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202310357220.9A priority Critical patent/CN116363733A/en
Publication of CN116363733A publication Critical patent/CN116363733A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/84Arrangements for image or video recognition or understanding using pattern recognition or machine learning using probabilistic graphical models from image or video features, e.g. Markov models or Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a facial expression prediction method based on dynamic distribution fusion, which comprises the steps of obtaining a facial expression data set, preprocessing a facial picture in the obtained data set, and obtaining a preprocessed data set; constructing auxiliary branches, and designing a double-branch neural network model based on the auxiliary branches; carrying out extraction sample distribution processing on the obtained pretreatment data set by adopting the constructed auxiliary branches; constructing category distribution, and mining emotion information processing aiming at the acquired sample distribution; carrying out dynamic distribution fusion processing on the constructed category distribution and the extracted sample distribution; constructing a multi-task learning frame and optimizing a double-branch neural network model; adopting an optimized double-branch neural network model to realize facial expression prediction; the invention introduces label distribution learning, and shows superiority compared with single label learning; dynamic distribution fusion is provided, and the effectiveness of label distribution learning is fully exerted; the method has the advantages of good prediction performance, high efficiency and less error.

Description

Facial expression prediction method based on dynamic distribution fusion
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a facial expression prediction method based on dynamic distribution fusion.
Background
Facial expression recognition is an important research direction in the field of computer vision. As a sub-field of emotion recognition, facial expression recognition can judge the expression state of a face through analysis of a facial image, and important support is provided for the fields of human-machine interaction, emotion calculation, intelligent monitoring and the like.
The facial expression recognition process mainly comprises facial expression image acquisition and preprocessing, facial expression feature extraction and facial expression classification. The facial expression preprocessing is to obtain the accurate position of the face from the acquired facial expression image through face detection and face alignment, and eliminate the interference of the picture background, and the success rate of the preprocessing is mainly influenced by factors such as image quality, light intensity, shielding and the like. The common facial expression feature extraction comprises geometric features, apparent features, mixed features and depth features, wherein the former three are used as traditional manual features and are widely applied in the early stage of facial expression recognition research, but the problems of low precision, poor robustness and the like of the methods are often existed; in recent years, with the rapid development of deep learning technology, deep features extracted through a deep convolutional neural network achieve good performance on facial expression recognition tasks. Facial expression classification is the last step of facial expression recognition, and the classification of traditional manual features often uses a K nearest neighbor method, a support vector machine, a random forest, an Adaboost algorithm, a Bayesian network, a single-layer perceptron and the like; in the deep learning framework, expression recognition can be performed in an end-to-end mode, namely, the deep neural network directly classifies and optimizes the features after learning the features.
Face expression models are mainly divided into 2D, 2.5D and 3D: the 2D face is an RGB face image shot by a common camera or an infrared image shot by an infrared camera, is an image for determining the representation color or texture under the visual angle, and does not contain depth information; 2.5D face is a face depth image shot by a depth camera under a certain visual angle, the curved surface information is discontinuous, and the depth information of the part which is not shielded is not shown; the 3D face is a point cloud or grid face image synthesized by face depth images with multiple angles, has complete curved surface information and contains depth information. The 2D facial expression recognition has long research time and complete software and hardware technology, and has been widely used, but the 2D facial expression only reflects two-dimensional plane information but does not contain depth information, so that the real facial expression cannot be completely expressed. Compared with a 2D human face, the 3D human face is not influenced by factors such as illumination, shielding or gesture, has better capability, can reflect human face information more truly, and is applied to the tasks such as human face synthesis, human face migration and the like. The 3D face generally obtains face depth information through professional equipment, and mainly comprises a binocular camera, an RGB-D camera based on a structured light principle and a TOF camera based on a light flight time principle. 2D facial expression recognition still dominates for the availability of 2D faces.
At present, a single-label learning method is selected for most of facial expression prediction methods to realize facial expression prediction. Although these methods have achieved good prediction performance, it is difficult to describe fuzzy or mislabeled samples due to insufficient emotion information contained in the single label, and overfitting of the neural network is easily caused, which makes it difficult to further improve prediction accuracy.
There are also few methods to select a label distribution learning method to implement facial expression prediction. Unlike the single tag learning method, these methods use tag distribution weights instead of single tags for training. Compared with a single label, the label distribution contains richer emotion information, and can effectively avoid the phenomenon of overfitting in the training process, so that the method has remarkable advantages. However, label distribution labeling is often difficult to obtain, so facial expression data sets that provide only a single label labeling still dominate. In recent years, label distribution learning methods focus on constructing label distribution from single labels, but the label distribution of these constructions is generally low in quality, and the advantages of label distribution learning cannot be fully exerted.
Disclosure of Invention
The invention aims to provide a facial expression prediction method based on dynamic distribution fusion, which has good prediction performance, high efficiency and less error.
The facial expression prediction method based on dynamic distribution fusion provided by the invention comprises the following steps:
s1, acquiring a facial expression data set, preprocessing a facial picture in the acquired data set, and acquiring a preprocessed data set;
s2, constructing auxiliary branches, and designing a double-branch neural network model based on the auxiliary branches;
s3, carrying out extraction sample distribution processing on the pretreatment data set obtained in the step S1 by adopting the auxiliary branches constructed in the step S2;
s4, constructing category distribution, and mining emotion information processing aiming at the sample distribution obtained in the step S3;
s5, carrying out dynamic distribution fusion processing on the category distribution constructed in the step S4 and the sample distribution obtained in the step S3;
s6, constructing a multi-task learning frame, and optimizing the double-branch neural network model designed in the step S2;
s7, adopting the double-branch neural network model obtained through optimization in the step S6 to realize facial expression prediction.
The step S1 of acquiring a facial expression data set, preprocessing a facial picture in the acquired data set, and acquiring a preprocessed data set specifically includes:
setting the facial expression data set as
Figure BDA0004163717140000031
And data centralization culvertCovering the C-type label and N samples, performing face alignment processing by using an MTCNN algorithm, and outputting face pictures with fixed sizes; scaling the output face picture to a given size, and performing data augmentation by using a RandAugment technology; and carrying out normalization processing on the RGB channels of the face picture by using the mean value and standard deviation of the ImageNet dataset.
The constructing auxiliary branches in the step S2, and designing a dual-branch neural network model based on the auxiliary branches specifically includes:
and constructing a dual-branch neural network model by adopting a ResNet18 network model. The ResNet18 network model is divided into two parts: layer 1 in the ResNet18 network model is frozen as a feature extractor and the last layer 3 in the ResNet18 network model is used as a feature discriminator, which is defined as the target limb. The auxiliary branch is constructed based on the target branch, and the parameters and the structure of the auxiliary branch are consistent with those of the target branch. And designing and obtaining a dual-branch neural network model based on the feature extractor, the target branch and the constructed auxiliary branch.
The extracting sample distribution processing for the preprocessing data set obtained in the step S1 by using the auxiliary branches constructed in the step S2 in the step S3 specifically includes:
taking the probability distribution of the auxiliary branch output constructed in the step S2 as a sample distribution, and expressing the sample distribution by adopting the following formula:
Figure BDA0004163717140000041
Figure BDA0004163717140000042
wherein,,
Figure BDA0004163717140000043
for sample x i Is the sample distribution of (y) j For the j-th class tag->
Figure BDA0004163717140000044
Is a labely j For sample x i Description degree of->
Figure BDA0004163717140000045
To assist the branch to sample x i Belonging to label y j Is used for predicting the probability of (1);
the auxiliary branches are trained through cross entropy loss to improve and maintain the distribution capacity of the auxiliary branch extraction samples, and the cross entropy loss function is expressed by adopting the following formula:
Figure BDA0004163717140000046
wherein L is ce Is a cross-entropy loss function,
Figure BDA0004163717140000047
for sample x i Logic tag y of (2) i Is a function of the value of c,
Figure BDA0004163717140000048
is the auxiliary branch to sample x i The prediction probability belonging to category c.
The step S4 of constructing category distribution, which is to mine emotion information processing for the sample distribution obtained in the step S3, specifically comprises the following steps:
using class distribution mining to find out implicit emotion information in sample distribution, eliminating influence of sample distribution errors on model performance, and expressing class distribution by adopting the following formula:
Figure BDA0004163717140000051
wherein,,
Figure BDA0004163717140000052
for category distribution of category c->
Figure BDA0004163717140000053
For samples belonging to category cx i Category distribution of N c For the number of samples belonging to category c;
setting a threshold t to judge whether the output category distribution meets the set robustness requirement, if the label y j For the description degree of the category c not reaching the threshold t, using a threshold distribution temporary substitution category distribution training model, describing by adopting the following formula:
Figure BDA0004163717140000054
Figure BDA0004163717140000055
Figure BDA0004163717140000056
wherein,,
Figure BDA0004163717140000057
is the category distribution of category c, +.>
Figure BDA0004163717140000058
Is the threshold distribution of category c, +.>
Figure BDA0004163717140000059
For label y j The degree of description for category c.
The step S5 of performing dynamic distribution fusion processing on the category distribution constructed in the step S4 and the sample distribution obtained in the step S3 specifically includes:
the dynamic distribution fusion is based on category distribution, and the category distribution and the sample distribution are adaptively fused according to the attention weight of each sample. The dynamic distribution fusion is divided into two steps: attention weight extraction and adaptive distribution fusion;
1) Attention weight extraction:
for attention weight extraction, two attention modules are respectively embedded into the last layer of two branches to acquire the attention weight of a sample. The attention module is composed of a full connection layer and a Sigmoid function, the characteristics output by each branch are input to the corresponding attention module to extract attention weight of each sample, the attention weight value is used for judging whether a sample is clear or fuzzy, and the weight value is used for self-adaptive distribution fusion; the characteristics output by each branch are multiplied by the corresponding attention weight and then input into the corresponding classifier;
the flow of attention weight extraction is as follows:
a. for a batch of samples, the face features output by the feature extractor are input to the auxiliary branches and the target branches at the same time;
b. the attention weights output by the two attention modules are averaged to benefit from the sample ambiguity discrimination capability of the two limbs at the same time, and the averaged attention weights are expressed by the following formula:
Figure BDA0004163717140000061
wherein,,
Figure BDA0004163717140000062
and->
Figure BDA0004163717140000063
Sample x output by attention modules of two branches respectively i Is a weight of attention of (2);
c. the attention weights are rank regularized to avoid degradation of the discrimination capability of the attention module:
Figure BDA0004163717140000064
L RR =max(0,δ-(w H -w L ))
wherein w is H And w L Respectively M samples with high weight and low weightAttention weighted average of N-M samples, delta is a fixed difference, delta and M directly use the values in method SCN using the same attention module, L RR Is a loss function of ordering regularization;
d. the attention weight is normalized, and the following formula is adopted to represent the processing procedure:
Figure BDA0004163717140000071
wherein w is min For the lower limit of the attention weighting,
Figure BDA0004163717140000072
is sample x i Attention weights after ordering regularization,
Figure BDA0004163717140000073
is sample x i Attention weight after normalization treatment;
2) Adaptive distribution fusion:
the following general representation is used to represent the blended distribution after fusion:
Figure BDA0004163717140000074
wherein,,
Figure BDA0004163717140000075
is sample x i Mixed distribution after fusion,/->
Figure BDA0004163717140000076
Is sample x i Category distribution of->
Figure BDA0004163717140000077
Is sample x i Tag distribution of->
Figure BDA0004163717140000078
Is a samplex i And (5) carrying out attention weight after normalization processing.
The step S6 is to construct a multi-task learning framework, optimize the dual-branch neural network model designed in the step S2, and specifically comprise the following steps:
(1) optimizing the target branches:
training the target branch by using KL divergence loss, and expressing the training process by using the following formula:
Figure BDA0004163717140000079
wherein L is kld For the loss of the KL divergence,
Figure BDA00041637171400000710
for class c for sample x i Description degree of->
Figure BDA00041637171400000711
For the target branch to sample x i Belonging to label y j Is used for predicting the probability of (1);
(2) multitasking learning framework:
constructing a multi-task learning framework, minimizing a joint loss L through joint learning of distribution prediction and expression recognition, so as to optimize the prediction performance of a model, and expressing a joint loss function by adopting the following formula:
L=α 1 ·L kld2 ·L ce +L RR
Figure BDA0004163717140000081
Figure BDA0004163717140000082
wherein alpha is 1 And alpha 2 For the weighted slope function related to training round e, beta is the threshold of training round, alpha is introduced 1 And alpha 2 Optimizing training process。
The implementation of facial expression prediction by adopting the double-branch neural network model obtained by optimization in the step S6 in the step S7 specifically comprises the following steps:
and (3) outputting probability distribution of each sample through the target branches by adopting the double-branch neural network model obtained by optimization in the step (S6) to predict the facial expression, and selecting the expression corresponding to the highest prediction probability from the output probability distribution as the predicted expression of the sample.
According to the facial expression prediction method based on dynamic distribution fusion, tag distribution learning is introduced, and overfitting is effectively avoided in the training process based on rich emotion information contained in tag distribution, so that the superiority compared with single tag learning is shown; meanwhile, dynamic distribution fusion is provided, and high-quality mixed distribution close to real distribution is generated by using extracted sample distribution and mined category distribution, so that the effectiveness of label distribution learning is fully exerted; the method has the advantages of good prediction performance, high efficiency and less error.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
Detailed Description
A schematic process flow diagram of the method of the present invention is shown in fig. 1: the facial expression prediction method based on dynamic distribution fusion provided by the invention comprises the following steps:
s1, acquiring a facial expression data set, preprocessing a facial picture in the acquired data set, and acquiring a preprocessed data set; the method specifically comprises the following steps:
assume that the facial expression dataset is s= { (x) i ,y i ) I=1, 2, …, N }, and the data set covers class C tags and N samples, because the sizes of face pictures in different data sets are different, the MTCNN algorithm is used for face alignment processing, and a face picture with a fixed size is output, and the invention outputs a 100×100 face picture; scaling the output face picture to a given size, obtaining 224 multiplied by 224 of the given size, and performing data augmentation by using RandAugment technology; face picture using mean and standard deviation of ImageNet datasetNormalization processing of RGB channels;
s2, constructing auxiliary branches, and designing a double-branch neural network model based on the auxiliary branches, wherein the method specifically comprises the following steps of:
and constructing a dual-branch neural network model by adopting a ResNet18 network model. The ResNet18 network model is divided into two parts: layer 1 in the ResNet18 network model is frozen as a feature extractor and the last layer 3 in the ResNet18 network model is used as a feature discriminator, which is defined as the target limb. The auxiliary branch is constructed based on the target branch, and the parameters and the structure of the auxiliary branch are consistent with those of the target branch. Designing and obtaining a double-branch neural network model based on the feature extractor, the target branch and the constructed auxiliary branch;
s3, carrying out extraction sample distribution processing on the preprocessing data set acquired in the step S1 by adopting the auxiliary branches constructed in the step S2, wherein the method specifically comprises the following steps:
the probability distribution training model directly output by the ResNet18 network model can cause degradation of model performance, the probability distribution output by the auxiliary branch constructed in the step S2 is taken as sample distribution, and the sample distribution is expressed by adopting the following formula:
Figure BDA0004163717140000091
Figure BDA0004163717140000092
wherein,,
Figure BDA0004163717140000101
for sample x i Is the sample distribution of (y) j For the j-th class tag->
Figure BDA0004163717140000102
For label y j For sample x i Description degree of->
Figure BDA0004163717140000103
To assist the branch to sample x i Belonging to label y j Is used for predicting the probability of (1);
the auxiliary branches are trained through cross entropy loss to improve and maintain the distribution capacity of the auxiliary branch extraction samples, and the cross entropy loss function is expressed by adopting the following formula:
Figure BDA0004163717140000104
wherein L is ce Is a cross-entropy loss function,
Figure BDA0004163717140000105
for sample x i Logic tag y of (2) i Is a function of the value of c,
Figure BDA0004163717140000106
is the auxiliary branch to sample x i A predictive probability of belonging to category c;
s4, constructing category distribution, and mining emotion information processing aiming at the sample distribution obtained in the step S3, wherein the method specifically comprises the following steps:
based on the sensitivity of the deep neural network to fuzzy or error labeling samples, using class distribution mining to find out implicit emotion information in the sample distribution, eliminating the influence of sample distribution errors on model performance, and expressing class distribution by adopting the following formula:
Figure BDA0004163717140000107
wherein,,
Figure BDA0004163717140000108
for the distribution of category c->
Figure BDA0004163717140000109
For sample x belonging to category c i Category distribution of N c For the number of samples belonging to category c;
class distribution mining is performed by pairingAdding and averaging sample distribution of all samples belonging to a certain category to obtain category distribution of a corresponding category; because the parameters of the auxiliary branches are unstable in the initial training stage, the class distribution meeting the set stability requirement cannot be output, each expression cannot be accurately described by the class distribution at the moment, in order to avoid the prediction performance of the wrong class distribution degradation model, a threshold t is set to judge whether the output class distribution meets the set stability requirement, if yes, the label y j For the description degree of the category c not reaching the threshold t, using the threshold distribution to temporarily replace the category distribution training model, setting the threshold between 0 and 1, and determining a specific value through an ablation experiment. The threshold is set based on the following phenomena: the stronger the model's ability to extract features, the higher the value of the corresponding sample tag location in the tag distribution. Whether the feature extraction of the model is in place or not can be judged by setting a threshold value; the following formula is used for description:
Figure BDA0004163717140000111
Figure BDA0004163717140000112
Figure BDA0004163717140000113
wherein,,
Figure BDA0004163717140000114
is the category distribution of category c, +.>
Figure BDA0004163717140000115
Is the threshold distribution of category c, +.>
Figure BDA0004163717140000116
For label y j The degree of description for category c;
s5, carrying out dynamic distribution fusion processing on the category distribution constructed in the step S4 and the sample distribution obtained in the step S3, wherein the dynamic distribution fusion processing specifically comprises the following steps:
the dynamic distribution fusion is based on category distribution, and the category distribution and the sample distribution are adaptively fused according to the attention weight of each sample. The dynamic distribution fusion is divided into two steps: attention weight extraction and adaptive distribution fusion;
1) Attention weight extraction:
for attention weight extraction, two attention modules are respectively embedded into the last layer of two branches to acquire the attention weight of a sample. The attention module is composed of a full connection layer and a Sigmoid function, the characteristics output by each branch are input to the corresponding attention module to extract attention weight of each sample, the attention weight value can judge whether a sample is clear or fuzzy, and the weight value is used for self-adaptive distribution fusion; the characteristics output by each branch are multiplied by the corresponding attention weight and then input into the corresponding classifier;
the flow of attention weight extraction is as follows:
a. for a batch of samples, the face features output by the feature extractor are input to the auxiliary branches and the target branches at the same time;
b. the attention weights output by the two attention modules are averaged to benefit from the sample ambiguity discrimination capability of the two limbs at the same time, and the averaged attention weights are expressed by the following formula:
Figure BDA0004163717140000121
wherein,,
Figure BDA0004163717140000122
and->
Figure BDA0004163717140000123
Sample x output by attention modules of two branches respectively i Is a weight of attention of (2);
c. the attention weights are rank regularized to avoid degradation of the discrimination capability of the attention module:
Figure BDA0004163717140000124
L RR =max(0,δ-(w H -w L ))
wherein w is H And w L Attention weight averages of M samples of high weight and N-M samples of low weight, respectively, delta being a fixed difference, delta and M being values in the method SCN employing the same attention module directly used in order to avoid repetition of experiments, are set to 0.07 and 0.7N, L, respectively, in the present invention RR Is a loss function of ordering regularization;
d. the attention weight is normalized, and the following formula is adopted to represent the processing procedure:
Figure BDA0004163717140000125
wherein w is min For the lower limit of the attention weighting,
Figure BDA0004163717140000126
is sample x i Attention weights after ordering regularization,
Figure BDA0004163717140000131
is sample x i After normalization treatment, setting the attention weight and the super parameter w min The method aims to prevent the ambiguity of the low-attention-weight sample in the fusion process from deteriorating the model performance, and the lower the attention weight is, the higher the sample ambiguity is.
2) Adaptive distribution fusion:
for adaptive distribution fusion, the category distribution and the sample distribution are adaptively fused based on the acquired attention weight, so that the robustness of the category distribution and the diversity of the sample distribution are considered, and the mixed distribution after fusion is represented by adopting the following public representation:
Figure BDA0004163717140000132
wherein,,
Figure BDA0004163717140000133
is sample x i Mixed distribution after fusion,/->
Figure BDA0004163717140000134
Is sample x i Category distribution of->
Figure BDA0004163717140000135
Is sample x i Tag distribution of->
Figure BDA0004163717140000136
Is sample x i Attention weight after normalization treatment;
s6, constructing a multi-task learning frame, and optimizing a double-branch neural network model designed in the step S2, wherein the method specifically comprises the following steps:
(1) optimizing the target branches:
training the target branch by using KL divergence loss, and expressing the training process by using the following formula:
Figure BDA0004163717140000137
wherein L is kld For the loss of the KL divergence,
Figure BDA0004163717140000138
for class c for sample x i Description degree of->
Figure BDA0004163717140000139
For the target branch to sample x i Belonging to label y j Is used for predicting the probability of (1);
(2) multitasking learning framework:
constructing a multi-task learning framework, and minimizing a joint loss L through joint learning of distributed prediction and expression recognition, so as to optimize the prediction performance of the model; the joint loss function is expressed using the following formula:
L=α 1 ·L kld2 ·L ce +L RR
Figure BDA0004163717140000141
Figure BDA0004163717140000142
wherein alpha is 1 And alpha 2 For the weighted slope function related to training round e, beta is the threshold of training round, alpha is introduced 1 And alpha 2 Optimizing a training process; in the initial stage of training, training the auxiliary branches in an important mode so that the auxiliary branches can output sample distribution and category distribution meeting the set robustness requirement; in the later stage of training, training target branches and avoiding the auxiliary branches from being over fitted; in the reasoning stage, the auxiliary branches are removed, and only the target branches are used for predicting the expression of the sample;
s7, realizing facial expression prediction by adopting the double-branch neural network model obtained by optimizing in the step S6, wherein the method specifically comprises the following steps:
and (3) outputting probability distribution of each sample through the target branches by adopting the double-branch neural network model obtained by optimization in the step (S6) to predict the facial expression, and selecting the expression corresponding to the highest prediction probability from the output probability distribution as the predicted expression of the sample.

Claims (8)

1. A facial expression prediction method based on dynamic distribution fusion comprises the following steps:
s1, acquiring a facial expression data set, preprocessing a facial picture in the acquired data set, and acquiring a preprocessed data set;
s2, constructing auxiliary branches, and designing a double-branch neural network model based on the auxiliary branches;
s3, carrying out extraction sample distribution processing on the pretreatment data set obtained in the step S1 by adopting the auxiliary branches constructed in the step S2;
s4, constructing category distribution, and mining emotion information processing aiming at the sample distribution obtained in the step S3;
s5, carrying out dynamic distribution fusion processing on the category distribution constructed in the step S4 and the sample distribution obtained in the step S3;
s6, constructing a multi-task learning frame, and optimizing the double-branch neural network model designed in the step S2;
s7, adopting the double-branch neural network model obtained through optimization in the step S6 to realize facial expression prediction.
2. The facial expression prediction method based on dynamic distribution fusion according to claim 1, wherein the step S1 of obtaining a facial expression dataset, preprocessing a facial picture in the obtained dataset, and obtaining a preprocessed dataset specifically includes:
setting the facial expression data set as
Figure FDA0004163717120000011
The data set covers C-type labels and N samples, the MTCNN algorithm is used for face alignment processing, and face pictures with fixed sizes are output; scaling the output face picture to a given size, and performing data augmentation by using a RandAugment technology; and carrying out normalization processing on the RGB channels of the face picture by using the mean value and standard deviation of the ImageNet dataset.
3. The facial expression prediction method based on dynamic distribution fusion according to claim 2, wherein the constructing auxiliary branches in step S2, and designing a dual-branch neural network model based on the auxiliary branches, specifically comprises:
and constructing a dual-branch neural network model by adopting a ResNet18 network model. The ResNet18 network model is divided into two parts: layer 1 in the ResNet18 network model is frozen as a feature extractor and the last layer 3 in the ResNet18 network model is used as a feature discriminator, which is defined as the target limb. The auxiliary branch is constructed based on the target branch, and the parameters and the structure of the auxiliary branch are consistent with those of the target branch. And designing and obtaining a dual-branch neural network model based on the feature extractor, the target branch and the constructed auxiliary branch.
4. The facial expression prediction method based on dynamic distribution fusion according to claim 3, wherein the extracting sample distribution processing is performed on the preprocessing data set acquired in step S1 by the auxiliary branches constructed in step S2 in step S3, and specifically includes:
taking the probability distribution of the auxiliary branch output constructed in the step S2 as a sample distribution, and expressing the sample distribution by adopting the following formula:
Figure FDA0004163717120000021
Figure FDA0004163717120000022
wherein,,
Figure FDA0004163717120000028
for sample x i Is the sample distribution of (y) j For the j-th class tag->
Figure FDA0004163717120000023
For label y j For sample x i Description degree of->
Figure FDA0004163717120000024
To assist the branch to sample x i Belonging to label y j Is used for predicting the probability of (1);
the auxiliary branches are trained through cross entropy loss to improve and maintain the distribution capacity of the auxiliary branch extraction samples, and the cross entropy loss function is expressed by adopting the following formula:
Figure FDA0004163717120000025
wherein L is ce Is a cross-entropy loss function,
Figure FDA0004163717120000026
for sample x i Logic tag y of (2) i C value of>
Figure FDA0004163717120000027
Is the auxiliary branch to sample x i The prediction probability belonging to category c.
5. The facial expression prediction method based on dynamic distribution fusion according to claim 4, wherein the constructing of the category distribution in step S4, performing mining emotion information processing on the sample distribution obtained in step S3, specifically includes:
using class distribution mining to find out implicit emotion information in sample distribution, eliminating influence of sample distribution errors on model performance, and expressing class distribution by adopting the following formula:
Figure FDA0004163717120000031
wherein,,
Figure FDA0004163717120000032
for category distribution of category c->
Figure FDA0004163717120000033
For sample x belonging to category c i Category distribution of N c For the number of samples belonging to category c;
setting a threshold t to judge whether the output category distribution meets the set robustness requirement, if the label y j For the followingThe description degree of the category c does not reach the threshold t, and the threshold distribution is used for temporarily replacing the category distribution training model, so that the description is carried out by adopting the following formula:
Figure FDA0004163717120000034
Figure FDA0004163717120000035
Figure FDA0004163717120000036
wherein,,
Figure FDA0004163717120000037
is the category distribution of category c, +.>
Figure FDA0004163717120000038
Is the threshold distribution of category c, +.>
Figure FDA0004163717120000039
For label y j The degree of description for category c.
6. The facial expression prediction method based on dynamic distribution fusion according to claim 5, wherein the step S5 is characterized in that the dynamic distribution fusion processing is performed on the category distribution constructed in the step S4 and the sample distribution obtained in the step S3, and specifically includes:
the dynamic distribution fusion is based on category distribution, and the category distribution and the sample distribution are adaptively fused according to the attention weight of each sample. The dynamic distribution fusion is divided into two steps: attention weight extraction and adaptive distribution fusion;
1) Attention weight extraction:
for attention weight extraction, two attention modules are respectively embedded into the last layer of two branches to acquire the attention weight of a sample. The attention module is composed of a full connection layer and a Sigmoid function, the characteristics output by each branch are input to the corresponding attention module to extract attention weight of each sample, the attention weight value is used for judging whether a sample is clear or fuzzy, and the weight value is used for self-adaptive distribution fusion; the characteristics output by each branch are multiplied by the corresponding attention weight and then input into the corresponding classifier;
the flow of attention weight extraction is as follows:
a. for a batch of samples, the face features output by the feature extractor are input to the auxiliary branches and the target branches at the same time;
b. the attention weights output by the two attention modules are averaged to benefit from the sample ambiguity discrimination capability of the two limbs at the same time, and the averaged attention weights are expressed by the following formula:
Figure FDA0004163717120000041
wherein,,
Figure FDA0004163717120000042
and->
Figure FDA0004163717120000043
Sample x output by attention modules of two branches respectively i Is a weight of attention of (2);
c. the attention weights are rank regularized to avoid degradation of the discrimination capability of the attention module:
Figure FDA0004163717120000044
L RR =max(0,δ-(w H -w L ))
wherein w is H And w L Attention weight averages of M samples with high weight and N-M samples with low weight, respectively, delta being a fixed difference, delta and M directly using values in the method SCN employing the same attention module, L RR Is a loss function of ordering regularization;
d. the attention weight is normalized, and the following formula is adopted to represent the processing procedure:
Figure FDA0004163717120000051
wherein w is min For the lower limit of the attention weighting,
Figure FDA0004163717120000052
is sample x i Attention weight after ordering regularization, < ->
Figure FDA0004163717120000053
Is sample x i Attention weight after normalization treatment;
2) Adaptive distribution fusion:
the following general representation is used to represent the blended distribution after fusion:
Figure FDA0004163717120000054
wherein,,
Figure FDA0004163717120000055
is sample x i Mixed distribution after fusion,/->
Figure FDA0004163717120000056
Is sample x i Category distribution of->
Figure FDA0004163717120000057
Is sample x i Tag distribution of (2),/>
Figure FDA0004163717120000058
Is sample x i And (5) carrying out attention weight after normalization processing.
7. The facial expression prediction method based on dynamic distribution fusion according to claim 6, wherein the constructing a multi-task learning framework in step S6 optimizes the dual-branch neural network model designed in step S2, and specifically includes:
(1) optimizing the target branches:
training the target branch by using KL divergence loss, and expressing the training process by using the following formula:
Figure FDA0004163717120000059
wherein L is kld For the loss of the KL divergence,
Figure FDA00041637171200000510
for class c for sample x i Description degree of->
Figure FDA00041637171200000511
For the target branch to sample x i Belonging to label y j Is used for predicting the probability of (1);
(2) multitasking learning framework:
constructing a multi-task learning framework, minimizing a joint loss L through joint learning of distribution prediction and expression recognition, so as to optimize the prediction performance of a model, and expressing a joint loss function by adopting the following formula:
L=α 1 ·L kld2 ·L ce +L RR
Figure FDA0004163717120000061
Figure FDA0004163717120000062
wherein alpha is 1 And alpha 2 For the weighted slope function related to the training round, beta is the threshold of the training round, alpha is introduced 1 And alpha 2 And optimizing the training process.
8. The facial expression prediction method based on dynamic distribution fusion according to claim 7, wherein the facial expression prediction is realized by adopting the double-branch neural network model obtained by optimization in step S6 in step S7, and specifically comprises the following steps:
and (3) outputting probability distribution of each sample through the target branches by adopting the double-branch neural network model obtained by optimization in the step (S6) to predict the facial expression, and selecting the expression corresponding to the highest prediction probability from the output probability distribution as the predicted expression of the sample.
CN202310357220.9A 2023-04-06 2023-04-06 Facial expression prediction method based on dynamic distribution fusion Pending CN116363733A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310357220.9A CN116363733A (en) 2023-04-06 2023-04-06 Facial expression prediction method based on dynamic distribution fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310357220.9A CN116363733A (en) 2023-04-06 2023-04-06 Facial expression prediction method based on dynamic distribution fusion

Publications (1)

Publication Number Publication Date
CN116363733A true CN116363733A (en) 2023-06-30

Family

ID=86920731

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310357220.9A Pending CN116363733A (en) 2023-04-06 2023-04-06 Facial expression prediction method based on dynamic distribution fusion

Country Status (1)

Country Link
CN (1) CN116363733A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738120A (en) * 2023-08-11 2023-09-12 齐鲁工业大学(山东省科学院) Copper grade SCN modeling algorithm for X fluorescence grade analyzer

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738120A (en) * 2023-08-11 2023-09-12 齐鲁工业大学(山东省科学院) Copper grade SCN modeling algorithm for X fluorescence grade analyzer
CN116738120B (en) * 2023-08-11 2023-11-03 齐鲁工业大学(山东省科学院) Copper grade SCN modeling algorithm for X fluorescence grade analyzer

Similar Documents

Publication Publication Date Title
Shao et al. Performance evaluation of deep feature learning for RGB-D image/video classification
CN107609460B (en) Human body behavior recognition method integrating space-time dual network flow and attention mechanism
Bai et al. Subset based deep learning for RGB-D object recognition
US9558268B2 (en) Method for semantically labeling an image of a scene using recursive context propagation
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN111507378A (en) Method and apparatus for training image processing model
CN110033007B (en) Pedestrian clothing attribute identification method based on depth attitude estimation and multi-feature fusion
CN113505634B (en) Optical remote sensing image salient target detection method of double-flow decoding cross-task interaction network
Yan et al. Monocular depth estimation with guidance of surface normal map
CN113255602A (en) Dynamic gesture recognition method based on multi-modal data
CN114842238A (en) Embedded mammary gland ultrasonic image identification method
Song et al. Contextualized CNN for scene-aware depth estimation from single RGB image
CN110991500A (en) Small sample multi-classification method based on nested integrated depth support vector machine
CN116363733A (en) Facial expression prediction method based on dynamic distribution fusion
Xu et al. Weakly supervised facial expression recognition via transferred DAL-CNN and active incremental learning
CN115035599A (en) Armed personnel identification method and armed personnel identification system integrating equipment and behavior characteristics
Li et al. SGML: A symmetric graph metric learning framework for efficient hyperspectral image classification
Li et al. IIE-SegNet: Deep semantic segmentation network with enhanced boundary based on image information entropy
Poostchi et al. Feature selection for appearance-based vehicle tracking in geospatial video
Singh et al. Deep active transfer learning for image recognition
Lai et al. Underwater target tracking via 3D convolutional networks
Chiu et al. Real-time monocular depth estimation with extremely light-weight neural network
Kong et al. Detection model based on improved faster-RCNN in apple orchard environment
CN113033263B (en) Face image age characteristic recognition method
Girdhar et al. Gibbs sampling strategies for semantic perception of streaming video data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination