CN116363733A

CN116363733A - Facial expression prediction method based on dynamic distribution fusion

Info

Publication number: CN116363733A
Application number: CN202310357220.9A
Authority: CN
Inventors: 刘姝; 许焱; 万通明; 王科选; 奎晓燕
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-06-30

Abstract

The invention discloses a facial expression prediction method based on dynamic distribution fusion, which comprises the steps of obtaining a facial expression data set, preprocessing a facial picture in the obtained data set, and obtaining a preprocessed data set; constructing auxiliary branches, and designing a double-branch neural network model based on the auxiliary branches; carrying out extraction sample distribution processing on the obtained pretreatment data set by adopting the constructed auxiliary branches; constructing category distribution, and mining emotion information processing aiming at the acquired sample distribution; carrying out dynamic distribution fusion processing on the constructed category distribution and the extracted sample distribution; constructing a multi-task learning frame and optimizing a double-branch neural network model; adopting an optimized double-branch neural network model to realize facial expression prediction; the invention introduces label distribution learning, and shows superiority compared with single label learning; dynamic distribution fusion is provided, and the effectiveness of label distribution learning is fully exerted; the method has the advantages of good prediction performance, high efficiency and less error.

Description

Facial expression prediction method based on dynamic distribution fusion

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a facial expression prediction method based on dynamic distribution fusion.

Background

Facial expression recognition is an important research direction in the field of computer vision. As a sub-field of emotion recognition, facial expression recognition can judge the expression state of a face through analysis of a facial image, and important support is provided for the fields of human-machine interaction, emotion calculation, intelligent monitoring and the like.

The facial expression recognition process mainly comprises facial expression image acquisition and preprocessing, facial expression feature extraction and facial expression classification. The facial expression preprocessing is to obtain the accurate position of the face from the acquired facial expression image through face detection and face alignment, and eliminate the interference of the picture background, and the success rate of the preprocessing is mainly influenced by factors such as image quality, light intensity, shielding and the like. The common facial expression feature extraction comprises geometric features, apparent features, mixed features and depth features, wherein the former three are used as traditional manual features and are widely applied in the early stage of facial expression recognition research, but the problems of low precision, poor robustness and the like of the methods are often existed; in recent years, with the rapid development of deep learning technology, deep features extracted through a deep convolutional neural network achieve good performance on facial expression recognition tasks. Facial expression classification is the last step of facial expression recognition, and the classification of traditional manual features often uses a K nearest neighbor method, a support vector machine, a random forest, an Adaboost algorithm, a Bayesian network, a single-layer perceptron and the like; in the deep learning framework, expression recognition can be performed in an end-to-end mode, namely, the deep neural network directly classifies and optimizes the features after learning the features.

Face expression models are mainly divided into 2D, 2.5D and 3D: the 2D face is an RGB face image shot by a common camera or an infrared image shot by an infrared camera, is an image for determining the representation color or texture under the visual angle, and does not contain depth information; 2.5D face is a face depth image shot by a depth camera under a certain visual angle, the curved surface information is discontinuous, and the depth information of the part which is not shielded is not shown; the 3D face is a point cloud or grid face image synthesized by face depth images with multiple angles, has complete curved surface information and contains depth information. The 2D facial expression recognition has long research time and complete software and hardware technology, and has been widely used, but the 2D facial expression only reflects two-dimensional plane information but does not contain depth information, so that the real facial expression cannot be completely expressed. Compared with a 2D human face, the 3D human face is not influenced by factors such as illumination, shielding or gesture, has better capability, can reflect human face information more truly, and is applied to the tasks such as human face synthesis, human face migration and the like. The 3D face generally obtains face depth information through professional equipment, and mainly comprises a binocular camera, an RGB-D camera based on a structured light principle and a TOF camera based on a light flight time principle. 2D facial expression recognition still dominates for the availability of 2D faces.

At present, a single-label learning method is selected for most of facial expression prediction methods to realize facial expression prediction. Although these methods have achieved good prediction performance, it is difficult to describe fuzzy or mislabeled samples due to insufficient emotion information contained in the single label, and overfitting of the neural network is easily caused, which makes it difficult to further improve prediction accuracy.

There are also few methods to select a label distribution learning method to implement facial expression prediction. Unlike the single tag learning method, these methods use tag distribution weights instead of single tags for training. Compared with a single label, the label distribution contains richer emotion information, and can effectively avoid the phenomenon of overfitting in the training process, so that the method has remarkable advantages. However, label distribution labeling is often difficult to obtain, so facial expression data sets that provide only a single label labeling still dominate. In recent years, label distribution learning methods focus on constructing label distribution from single labels, but the label distribution of these constructions is generally low in quality, and the advantages of label distribution learning cannot be fully exerted.

Disclosure of Invention

The invention aims to provide a facial expression prediction method based on dynamic distribution fusion, which has good prediction performance, high efficiency and less error.

The facial expression prediction method based on dynamic distribution fusion provided by the invention comprises the following steps:

s1, acquiring a facial expression data set, preprocessing a facial picture in the acquired data set, and acquiring a preprocessed data set;

s2, constructing auxiliary branches, and designing a double-branch neural network model based on the auxiliary branches;

s3, carrying out extraction sample distribution processing on the pretreatment data set obtained in the step S1 by adopting the auxiliary branches constructed in the step S2;

s4, constructing category distribution, and mining emotion information processing aiming at the sample distribution obtained in the step S3;

s5, carrying out dynamic distribution fusion processing on the category distribution constructed in the step S4 and the sample distribution obtained in the step S3;

s6, constructing a multi-task learning frame, and optimizing the double-branch neural network model designed in the step S2;

s7, adopting the double-branch neural network model obtained through optimization in the step S6 to realize facial expression prediction.

The step S1 of acquiring a facial expression data set, preprocessing a facial picture in the acquired data set, and acquiring a preprocessed data set specifically includes:

setting the facial expression data set as

And data centralization culvertCovering the C-type label and N samples, performing face alignment processing by using an MTCNN algorithm, and outputting face pictures with fixed sizes; scaling the output face picture to a given size, and performing data augmentation by using a RandAugment technology; and carrying out normalization processing on the RGB channels of the face picture by using the mean value and standard deviation of the ImageNet dataset.

The constructing auxiliary branches in the step S2, and designing a dual-branch neural network model based on the auxiliary branches specifically includes:

and constructing a dual-branch neural network model by adopting a ResNet18 network model. The ResNet18 network model is divided into two parts: layer 1 in the ResNet18 network model is frozen as a feature extractor and the last layer 3 in the ResNet18 network model is used as a feature discriminator, which is defined as the target limb. The auxiliary branch is constructed based on the target branch, and the parameters and the structure of the auxiliary branch are consistent with those of the target branch. And designing and obtaining a dual-branch neural network model based on the feature extractor, the target branch and the constructed auxiliary branch.

The extracting sample distribution processing for the preprocessing data set obtained in the step S1 by using the auxiliary branches constructed in the step S2 in the step S3 specifically includes:

taking the probability distribution of the auxiliary branch output constructed in the step S2 as a sample distribution, and expressing the sample distribution by adopting the following formula:

wherein,,

for sample x _i Is the sample distribution of (y) _j For the j-th class tag->

Is a labely _j For sample x _i Description degree of->

To assist the branch to sample x _i Belonging to label y _j Is used for predicting the probability of (1);

the auxiliary branches are trained through cross entropy loss to improve and maintain the distribution capacity of the auxiliary branch extraction samples, and the cross entropy loss function is expressed by adopting the following formula:

wherein L is _ce Is a cross-entropy loss function,

for sample x _i Logic tag y of (2) _i Is a function of the value of c,

is the auxiliary branch to sample x _i The prediction probability belonging to category c.

The step S4 of constructing category distribution, which is to mine emotion information processing for the sample distribution obtained in the step S3, specifically comprises the following steps:

using class distribution mining to find out implicit emotion information in sample distribution, eliminating influence of sample distribution errors on model performance, and expressing class distribution by adopting the following formula:

wherein,,

for category distribution of category c->

For samples belonging to category cx _i Category distribution of N _c For the number of samples belonging to category c;

setting a threshold t to judge whether the output category distribution meets the set robustness requirement, if the label y _j For the description degree of the category c not reaching the threshold t, using a threshold distribution temporary substitution category distribution training model, describing by adopting the following formula:

wherein,,

is the category distribution of category c, +.>

Is the threshold distribution of category c, +.>

For label y _j The degree of description for category c.

The step S5 of performing dynamic distribution fusion processing on the category distribution constructed in the step S4 and the sample distribution obtained in the step S3 specifically includes:

the dynamic distribution fusion is based on category distribution, and the category distribution and the sample distribution are adaptively fused according to the attention weight of each sample. The dynamic distribution fusion is divided into two steps: attention weight extraction and adaptive distribution fusion;

1) Attention weight extraction:

for attention weight extraction, two attention modules are respectively embedded into the last layer of two branches to acquire the attention weight of a sample. The attention module is composed of a full connection layer and a Sigmoid function, the characteristics output by each branch are input to the corresponding attention module to extract attention weight of each sample, the attention weight value is used for judging whether a sample is clear or fuzzy, and the weight value is used for self-adaptive distribution fusion; the characteristics output by each branch are multiplied by the corresponding attention weight and then input into the corresponding classifier;

the flow of attention weight extraction is as follows:

a. for a batch of samples, the face features output by the feature extractor are input to the auxiliary branches and the target branches at the same time;

b. the attention weights output by the two attention modules are averaged to benefit from the sample ambiguity discrimination capability of the two limbs at the same time, and the averaged attention weights are expressed by the following formula:

wherein,,

and->

Sample x output by attention modules of two branches respectively _i Is a weight of attention of (2);

c. the attention weights are rank regularized to avoid degradation of the discrimination capability of the attention module:

L _RR ＝max(0,δ-(w _H -w _L ))

wherein w is _H And w _L Respectively M samples with high weight and low weightAttention weighted average of N-M samples, delta is a fixed difference, delta and M directly use the values in method SCN using the same attention module, L _RR Is a loss function of ordering regularization;

d. the attention weight is normalized, and the following formula is adopted to represent the processing procedure:

wherein w is _min For the lower limit of the attention weighting,

is sample x _i Attention weights after ordering regularization,

is sample x _i Attention weight after normalization treatment;

2) Adaptive distribution fusion:

the following general representation is used to represent the blended distribution after fusion:

wherein,,

is sample x _i Mixed distribution after fusion,/->

Is sample x _i Category distribution of->

Is sample x _i Tag distribution of->

Is a samplex _i And (5) carrying out attention weight after normalization processing.

The step S6 is to construct a multi-task learning framework, optimize the dual-branch neural network model designed in the step S2, and specifically comprise the following steps:

(1) optimizing the target branches:

training the target branch by using KL divergence loss, and expressing the training process by using the following formula:

wherein L is _kld For the loss of the KL divergence,

for class c for sample x _i Description degree of->

For the target branch to sample x _i Belonging to label y _j Is used for predicting the probability of (1);

(2) multitasking learning framework:

constructing a multi-task learning framework, minimizing a joint loss L through joint learning of distribution prediction and expression recognition, so as to optimize the prediction performance of a model, and expressing a joint loss function by adopting the following formula:

L＝α ₁ ·L _kld +α ₂ ·L _ce +L _RR

wherein alpha is ₁ And alpha ₂ For the weighted slope function related to training round e, beta is the threshold of training round, alpha is introduced ₁ And alpha ₂ Optimizing training process。

The implementation of facial expression prediction by adopting the double-branch neural network model obtained by optimization in the step S6 in the step S7 specifically comprises the following steps:

and (3) outputting probability distribution of each sample through the target branches by adopting the double-branch neural network model obtained by optimization in the step (S6) to predict the facial expression, and selecting the expression corresponding to the highest prediction probability from the output probability distribution as the predicted expression of the sample.

According to the facial expression prediction method based on dynamic distribution fusion, tag distribution learning is introduced, and overfitting is effectively avoided in the training process based on rich emotion information contained in tag distribution, so that the superiority compared with single tag learning is shown; meanwhile, dynamic distribution fusion is provided, and high-quality mixed distribution close to real distribution is generated by using extracted sample distribution and mined category distribution, so that the effectiveness of label distribution learning is fully exerted; the method has the advantages of good prediction performance, high efficiency and less error.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

Detailed Description

A schematic process flow diagram of the method of the present invention is shown in fig. 1: the facial expression prediction method based on dynamic distribution fusion provided by the invention comprises the following steps:

s1, acquiring a facial expression data set, preprocessing a facial picture in the acquired data set, and acquiring a preprocessed data set; the method specifically comprises the following steps:

assume that the facial expression dataset is s= { (x) _i ,y _i ) I=1, 2, …, N }, and the data set covers class C tags and N samples, because the sizes of face pictures in different data sets are different, the MTCNN algorithm is used for face alignment processing, and a face picture with a fixed size is output, and the invention outputs a 100×100 face picture; scaling the output face picture to a given size, obtaining 224 multiplied by 224 of the given size, and performing data augmentation by using RandAugment technology; face picture using mean and standard deviation of ImageNet datasetNormalization processing of RGB channels;

s2, constructing auxiliary branches, and designing a double-branch neural network model based on the auxiliary branches, wherein the method specifically comprises the following steps of:

and constructing a dual-branch neural network model by adopting a ResNet18 network model. The ResNet18 network model is divided into two parts: layer 1 in the ResNet18 network model is frozen as a feature extractor and the last layer 3 in the ResNet18 network model is used as a feature discriminator, which is defined as the target limb. The auxiliary branch is constructed based on the target branch, and the parameters and the structure of the auxiliary branch are consistent with those of the target branch. Designing and obtaining a double-branch neural network model based on the feature extractor, the target branch and the constructed auxiliary branch;

s3, carrying out extraction sample distribution processing on the preprocessing data set acquired in the step S1 by adopting the auxiliary branches constructed in the step S2, wherein the method specifically comprises the following steps:

the probability distribution training model directly output by the ResNet18 network model can cause degradation of model performance, the probability distribution output by the auxiliary branch constructed in the step S2 is taken as sample distribution, and the sample distribution is expressed by adopting the following formula:

wherein,,

for sample x _i Is the sample distribution of (y) _j For the j-th class tag->

For label y _j For sample x _i Description degree of->

wherein L is _ce Is a cross-entropy loss function,

for sample x _i Logic tag y of (2) _i Is a function of the value of c,

is the auxiliary branch to sample x _i A predictive probability of belonging to category c;

s4, constructing category distribution, and mining emotion information processing aiming at the sample distribution obtained in the step S3, wherein the method specifically comprises the following steps:

based on the sensitivity of the deep neural network to fuzzy or error labeling samples, using class distribution mining to find out implicit emotion information in the sample distribution, eliminating the influence of sample distribution errors on model performance, and expressing class distribution by adopting the following formula:

wherein,,

for the distribution of category c->

For sample x belonging to category c _i Category distribution of N _c For the number of samples belonging to category c;

class distribution mining is performed by pairingAdding and averaging sample distribution of all samples belonging to a certain category to obtain category distribution of a corresponding category; because the parameters of the auxiliary branches are unstable in the initial training stage, the class distribution meeting the set stability requirement cannot be output, each expression cannot be accurately described by the class distribution at the moment, in order to avoid the prediction performance of the wrong class distribution degradation model, a threshold t is set to judge whether the output class distribution meets the set stability requirement, if yes, the label y _j For the description degree of the category c not reaching the threshold t, using the threshold distribution to temporarily replace the category distribution training model, setting the threshold between 0 and 1, and determining a specific value through an ablation experiment. The threshold is set based on the following phenomena: the stronger the model's ability to extract features, the higher the value of the corresponding sample tag location in the tag distribution. Whether the feature extraction of the model is in place or not can be judged by setting a threshold value; the following formula is used for description:

wherein,,

is the category distribution of category c, +.>

Is the threshold distribution of category c, +.>

For label y _j The degree of description for category c;

s5, carrying out dynamic distribution fusion processing on the category distribution constructed in the step S4 and the sample distribution obtained in the step S3, wherein the dynamic distribution fusion processing specifically comprises the following steps:

1) Attention weight extraction:

for attention weight extraction, two attention modules are respectively embedded into the last layer of two branches to acquire the attention weight of a sample. The attention module is composed of a full connection layer and a Sigmoid function, the characteristics output by each branch are input to the corresponding attention module to extract attention weight of each sample, the attention weight value can judge whether a sample is clear or fuzzy, and the weight value is used for self-adaptive distribution fusion; the characteristics output by each branch are multiplied by the corresponding attention weight and then input into the corresponding classifier;

the flow of attention weight extraction is as follows:

wherein,,

and->

L _RR ＝max(0,δ-(w _H -w _L ))

wherein w is _H And w _L Attention weight averages of M samples of high weight and N-M samples of low weight, respectively, delta being a fixed difference, delta and M being values in the method SCN employing the same attention module directly used in order to avoid repetition of experiments, are set to 0.07 and 0.7N, L, respectively, in the present invention _RR Is a loss function of ordering regularization;

wherein w is _min For the lower limit of the attention weighting,

is sample x _i Attention weights after ordering regularization,

is sample x _i After normalization treatment, setting the attention weight and the super parameter w _min The method aims to prevent the ambiguity of the low-attention-weight sample in the fusion process from deteriorating the model performance, and the lower the attention weight is, the higher the sample ambiguity is.

2) Adaptive distribution fusion:

for adaptive distribution fusion, the category distribution and the sample distribution are adaptively fused based on the acquired attention weight, so that the robustness of the category distribution and the diversity of the sample distribution are considered, and the mixed distribution after fusion is represented by adopting the following public representation:

wherein,,

is sample x _i Mixed distribution after fusion,/->

Is sample x _i Category distribution of->

Is sample x _i Tag distribution of->

Is sample x _i Attention weight after normalization treatment;

s6, constructing a multi-task learning frame, and optimizing a double-branch neural network model designed in the step S2, wherein the method specifically comprises the following steps:

(1) optimizing the target branches:

wherein L is _kld For the loss of the KL divergence,

for class c for sample x _i Description degree of->

(2) multitasking learning framework:

constructing a multi-task learning framework, and minimizing a joint loss L through joint learning of distributed prediction and expression recognition, so as to optimize the prediction performance of the model; the joint loss function is expressed using the following formula:

L＝α ₁ ·L _kld +α ₂ ·L _ce +L _RR

wherein alpha is ₁ And alpha ₂ For the weighted slope function related to training round e, beta is the threshold of training round, alpha is introduced ₁ And alpha ₂ Optimizing a training process; in the initial stage of training, training the auxiliary branches in an important mode so that the auxiliary branches can output sample distribution and category distribution meeting the set robustness requirement; in the later stage of training, training target branches and avoiding the auxiliary branches from being over fitted; in the reasoning stage, the auxiliary branches are removed, and only the target branches are used for predicting the expression of the sample;

s7, realizing facial expression prediction by adopting the double-branch neural network model obtained by optimizing in the step S6, wherein the method specifically comprises the following steps:

Claims

1. A facial expression prediction method based on dynamic distribution fusion comprises the following steps:

2. The facial expression prediction method based on dynamic distribution fusion according to claim 1, wherein the step S1 of obtaining a facial expression dataset, preprocessing a facial picture in the obtained dataset, and obtaining a preprocessed dataset specifically includes:

setting the facial expression data set as

The data set covers C-type labels and N samples, the MTCNN algorithm is used for face alignment processing, and face pictures with fixed sizes are output; scaling the output face picture to a given size, and performing data augmentation by using a RandAugment technology; and carrying out normalization processing on the RGB channels of the face picture by using the mean value and standard deviation of the ImageNet dataset.

3. The facial expression prediction method based on dynamic distribution fusion according to claim 2, wherein the constructing auxiliary branches in step S2, and designing a dual-branch neural network model based on the auxiliary branches, specifically comprises:

4. The facial expression prediction method based on dynamic distribution fusion according to claim 3, wherein the extracting sample distribution processing is performed on the preprocessing data set acquired in step S1 by the auxiliary branches constructed in step S2 in step S3, and specifically includes:

wherein,,

for sample x _i Is the sample distribution of (y) _j For the j-th class tag->

For label y _j For sample x _i Description degree of->

wherein L is _ce Is a cross-entropy loss function,

for sample x _i Logic tag y of (2) _i C value of>

5. The facial expression prediction method based on dynamic distribution fusion according to claim 4, wherein the constructing of the category distribution in step S4, performing mining emotion information processing on the sample distribution obtained in step S3, specifically includes:

wherein,,

for category distribution of category c->

setting a threshold t to judge whether the output category distribution meets the set robustness requirement, if the label y _j For the followingThe description degree of the category c does not reach the threshold t, and the threshold distribution is used for temporarily replacing the category distribution training model, so that the description is carried out by adopting the following formula:

wherein,,

is the category distribution of category c, +.>

Is the threshold distribution of category c, +.>

For label y _j The degree of description for category c.

6. The facial expression prediction method based on dynamic distribution fusion according to claim 5, wherein the step S5 is characterized in that the dynamic distribution fusion processing is performed on the category distribution constructed in the step S4 and the sample distribution obtained in the step S3, and specifically includes:

1) Attention weight extraction:

the flow of attention weight extraction is as follows:

wherein,,

and->

L _RR ＝max(0,δ-(w _H -w _L ))

wherein w is _H And w _L Attention weight averages of M samples with high weight and N-M samples with low weight, respectively, delta being a fixed difference, delta and M directly using values in the method SCN employing the same attention module, L _RR Is a loss function of ordering regularization;

wherein w is _min For the lower limit of the attention weighting,

is sample x _i Attention weight after ordering regularization, < ->

Is sample x _i Attention weight after normalization treatment;

2) Adaptive distribution fusion:

wherein,,

is sample x _i Mixed distribution after fusion,/->

Is sample x _i Category distribution of->

Is sample x _i Tag distribution of (2)，/>

Is sample x _i And (5) carrying out attention weight after normalization processing.

7. The facial expression prediction method based on dynamic distribution fusion according to claim 6, wherein the constructing a multi-task learning framework in step S6 optimizes the dual-branch neural network model designed in step S2, and specifically includes:

(1) optimizing the target branches:

wherein L is _kld For the loss of the KL divergence,

for class c for sample x _i Description degree of->

(2) multitasking learning framework:

L＝α ₁ ·L _kld +α ₂ ·L _ce +L _RR

wherein alpha is ₁ And alpha ₂ For the weighted slope function related to the training round, beta is the threshold of the training round, alpha is introduced ₁ And alpha ₂ And optimizing the training process.

8. The facial expression prediction method based on dynamic distribution fusion according to claim 7, wherein the facial expression prediction is realized by adopting the double-branch neural network model obtained by optimization in step S6 in step S7, and specifically comprises the following steps: