Method for quickly constructing human face action unit recognition data set
Technical Field
The invention relates to a face action unit recognition technology, in particular to a method for quickly constructing a face action unit recognition data set.
Background
In the process of computer vision development, people have high attention to faces, and the research on facial expressions is continuously carried out, wherein one aspect of the computer vision research is to classify expressions: namely, expressions are classified into happiness, hurry, fear, surprise, anger, disgust, slight, others, and the like; the above is one of the most basic classification methods;
in fact, when the human body has fine psychological activities, corresponding fine expression changes can not appear on the face autonomously, and the types of the fine expression changes are far beyond the records; in this regard, psychologist Paul Ekman and research partner Wallace v. friesen define, according to the characteristics of human face anatomy and the actions of human face muscles, mutually independent Action units au (Action unit) divided according to changes in human face, which are generally understood as micro expressions, and define a facial behavior Coding system FACS (facial Action Coding system), based on which many research directions are extended, for example:
university of maryland and researchers at the dartemos institute published the paper "concentration Detection in Videos" and created a system named dare (concentration Analysis and learning engine) for detecting micro expressions and further for forensic lie Detection with significantly higher efficacy than humans of equal condition;
jeffrey M.Girard et al published article "Social Risk and Depression: evaluation from Manual and Automatic Facial Expression Analysis", human face action unit detection and Analysis was used for the detection of Depression, and the attention in Depression has been high;
the detection method of the human face action unit can be divided into an artificial feature method and a deep learning method, wherein:
the method for artificial feature generally requires locating landmark points (landmark) of a human face, which is defined as fixed marks for the positions of the outline, eyebrows, eyes, nose, mouth, etc. of the human face, and the method used in the patent comprises 68 landmark points. After the landmark points are positioned, aiming at the corresponding relation between different AUs and the landmark points, extracting regional characteristics such as HOG (histogram of organized gradient), LBPH (localized Binary Pattern) and the like near the landmark points, and then connecting classifiers such as SVM (support Vector machine) and the like to judge whether the AUs exist or not;
the Deep Learning method may not include a process of detecting landmark points, such as 8 × 8 division of a face Region directly by a paper "Deep Region and Multi-label Learning for Facial Action Unit Detection" published by kao kaili, and then Multi-AU Detection is performed through a convolutional neural network, or fine Detection may be performed using landmark point information, such as patent 201810441544.X, and Multi-AU Detection is performed through a convolutional neural network after a target Region of a specific size is taken according to landmark points. The methods all need a data set containing face action unit labels, and the data set needs to contain a face picture and AU labels corresponding to the face picture;
however, in The prior art, there are few data sets containing AU labels, and The published data set includes CK + data set mentioned in The article "The Extended Cohn-Kanade data set (CK +) -a complete data set for action unit and examination-specific expression" by Jeffrey f.cohn, etc., The part containing AU labels is 593 pictures, and The AU labeling process is a subjective and specialized process, and needs to know The FACS system in more detail;
the method adopted in the CK + data set establishing process is that after manual labeling, a FACS labeling expert confirms the labeling consistency, and after 17% of certified FACS labeling experts in the CK + data set are labeled, the kappa consistency is calculated. As the kappa verification method can only calculate the consistency of 2 persons, more than 3 persons are difficult to calculate, only one FACS labeling expert needs to calculate pairwise consistency with all labels in the establishment of the CK + data set, and the kappa verification method is invalid when a plurality of experts or all labels need to be calculated;
moreover, the labeling method selected by CK + in the prior art is not beneficial to establishing a large-scale data set, but with the development of technologies, such as deep learning, the support of mass data is increasingly needed, the number of face action units is large, the number of commonly used face action units is dozens, the labeling needs to be refined and specialized, the labor cost is high, and the existing data collection method cannot be competent.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method for quickly constructing a human face action unit identification data set, which adopts a method of machine labeling, manual correction, grouping labeling, multiple consistency criteria and a labeling sample set, ensures the labeling accuracy, reduces the labeling complexity and keeps the consistency of different labeling personnel while reducing the manual labeling cost as much as possible.
The method for quickly constructing the human face action unit recognition data set comprises the following steps:
step one, establishing an annotation manual;
according to the 'FACS' standard, Chinese transformation is carried out on the content of the Chinese characters, comments of an annotation expert are combined, and common cautionary matters in the annotation process are added to form an annotation manual which is used in the annotation training and annotation process;
step two, selecting and processing an AU;
selecting AUs to be marked, and dividing target marked AUs and marked members into groups with corresponding quantity;
acquiring all original video files to be marked, and framing the video files to acquire original human face pictures;
selecting an AU intensity detector D, and setting consistency thresholds delta _1 and delta _ 2;
detecting all original human face pictures to be labeled by using an intensity detector D to obtain a machine labeling result D _ L;
screening sample pictures with different AUs and different intensities of each person from all face original pictures to be labeled, modifying corresponding labeling results in the D _ L by a labeling expert to form a labeling sample set, and performing corresponding AU cognitive training on all labeling personnel by using FACS standards and the labeling sample set according to respective groups;
for any group in the AU groups, taking the machine labeling result of the same group of pictures for all members and labeling experts of the labeling group of the AU group, and correcting according to the FACS standard according to the AU machine labeling result and the labeling sample set of the AU group in the D _ L to obtain each group labeling correction result;
as an illustration, the same group of pictures is preferably 120;
for the correction result of the previous step, using an ICC (Intraclass Class Correlation group inner Correlation coefficient) and the consistency criterion of absolute difference to calculate consistency;
when the consistency requirement is met, entering a third step; repeating the operation of the second step if the consistency requirement is not met;
step three: selecting AUs to be marked and marking member grouping modes which are the same as the second step;
all original video files to be marked are obtained, frames of the video files are removed, and original human face pictures are obtained in the same way as the second step;
selecting an AU intensity detector D, and setting consistency thresholds delta _1 and delta _ 2;
detecting all original human face pictures to be labeled by using an intensity detector D to obtain a machine labeling result D _ L; 10% of each AU group is reserved, and the rest 90% equally divides all members in the AU group, namely all labeled members in the group label pictures in different groups respectively; all the marking members and the marking experts uniformly mark 10% of the reserved objects, and the consistency is determined by calculating the consistency;
if the consistency meets the requirement, continuing to execute; otherwise, returning to the step two, and returning to the step three to label the group of pictures after the consistency meets the requirement;
fusing the AU labeling results of different AU groups into all AU results, establishing a data set for training the human face action unit to identify based on the original picture and the fused integral AU labeling results, optimizing a detector D to further accelerate the data set establishment, and completing the establishment of all data sets;
further, taking the labeling results of 2 labeling personnel +1 labeling experts as an example, the consistency calculation includes the following steps:
s1: marking the result of the same group of AU, firstly calculating an ICC score;
wherein: n is the number of the current group of marked pictures, the ICC score corresponds to a threshold value delta _1, and the ICC consistency requirement is judged to be met under the condition that the ICC score is higher than the delta _ 1;
s2: for the case that the AU intensity label is more than 0, the ICC score will have a negative value, and then the absolute difference between all labels is calculated:
wherein: n is the number of the marked pictures in the current group, diff is the absolute difference to be obtained, corresponds to a threshold value delta _2, and is judged to meet the requirement of the consistency of the absolute difference under the condition that the absolute difference is lower than the threshold value delta _ 2; the overall consistency determination criterion is shown in fig. 5;
for example, for the intensity detector D, the invention selects an Openface (Free and open source face recognition with deep neural networks) method as the detection method of the detector D, and other methods capable of detecting the face AU intensity can be used as the detector D in this patent.
Has the advantages that:
the method adopts the method of machine labeling, manual correction, grouping labeling, multiple consistency criteria and labeling sample sets, ensures the labeling accuracy, reduces the labeling complexity and keeps the consistency of different labeling personnel while reducing the manual labeling cost as much as possible, can quickly construct the human face action unit identification data set in a shorter time, greatly reduces the labor cost, and well solves the defect that the consistency is difficult to unify when multiple persons are labeled in the prior art.
Drawings
FIG. 1 is a flowchart illustrating a label training section of an embodiment of a method for rapidly constructing a recognition data set of a human face action unit according to the present invention.
Fig. 2 is a flowchart illustrating a consistency determination of an embodiment of the method for rapidly constructing a face action unit recognition data set according to the present invention.
Fig. 3 is a labeling flowchart of an embodiment of the method for quickly constructing a human face action unit recognition data set according to the present invention.
Fig. 4 shows the result of the AU intensity of the human face detected by the machine according to the embodiment of the method for quickly constructing the recognition data set of the human face action unit.
Fig. 5 is a diagram illustrating a manual modification process of an embodiment of the method for rapidly constructing a human face action unit recognition data set according to the present invention.
Fig. 6 shows a merged labeling result of the embodiment of the method for quickly constructing a human face action unit recognition data set according to the present invention.
Detailed Description
The technical solutions of the present invention are specifically described below, it should be noted that the technical solutions of the present invention are not limited to the embodiments described in the examples, and those skilled in the art should refer to and refer to the contents of the technical solutions of the present invention, and make improvements and designs on the basis of the present invention, and shall fall into the protection scope of the present invention.
The method for quickly constructing the human face action unit recognition data set comprises the following steps:
step one, establishing an annotation manual;
according to the 'FACS' standard, Chinese transformation is carried out on the content of the Chinese characters, comments of an annotation expert are combined, and common cautionary matters in the annotation process are added to form an annotation manual which is used in the annotation training and annotation process;
step two, selecting and processing an AU;
selecting AUs to be marked, and dividing target marked AUs and marked members into groups with corresponding quantity;
acquiring all original video files to be marked, and framing the video files to acquire original human face pictures;
selecting an AU intensity detector D, and setting consistency thresholds delta _1 and delta _ 2;
detecting all original human face pictures to be labeled by using an intensity detector D to obtain a machine labeling result D _ L;
screening sample pictures with different AUs and different intensities of each person from all face original pictures to be labeled, modifying corresponding labeling results in the D _ L by a labeling expert to form a labeling sample set, and performing corresponding AU cognitive training on all labeling personnel by using FACS standards and the labeling sample set according to respective groups;
for any group in the AU groups, taking the machine labeling result of the same group of pictures for all members and labeling experts of the labeling group of the AU group, and correcting according to the FACS standard according to the AU machine labeling result and the labeling sample set of the AU group in the D _ L to obtain each group labeling correction result;
as an illustration, the same group of pictures is preferably 120;
for the correction result of the previous step, using an ICC (Intraclass Class Correlation group inner Correlation coefficient) and the consistency criterion of absolute difference to calculate consistency;
when the consistency requirement is met, entering a third step; repeating the operation of the second step if the consistency requirement is not met;
step three: selecting AUs to be marked and marking member grouping modes which are the same as the second step;
all original video files to be marked are obtained, frames of the video files are removed, and original human face pictures are obtained in the same way as the second step;
selecting an AU intensity detector D, and setting consistency thresholds delta _1 and delta _ 2;
detecting all original human face pictures to be labeled by using an intensity detector D to obtain a machine labeling result D _ L; 10% of each AU group is reserved, and the rest 90% equally divides all members in the AU group, namely all labeled members in the group label pictures in different groups respectively; all the marking members and the marking experts uniformly mark 10% of the reserved objects, and the consistency is determined by calculating the consistency;
if the consistency meets the requirement, continuing to execute; otherwise, returning to the step two, and returning to the step three to label the group of pictures after the consistency meets the requirement;
fusing the AU labeling results of different AU groups into all AU results, establishing a data set for training the human face action unit to identify based on the original picture and the fused integral AU labeling results, optimizing a detector D to further accelerate the data set establishment, and completing the establishment of all data sets;
as an example, taking the labeling result of 2 labeling personnel +1 labeling experts as an example, the consistency calculation includes the following steps:
s1: marking the result of the same group of AU, firstly calculating an ICC score;
wherein: n is the number of the current group of marked pictures, the ICC score corresponds to a threshold value delta _1, and the ICC consistency requirement is judged to be met under the condition that the ICC score is higher than the delta _ 1;
s2: for the case that the AU intensity label is more than 0, the ICC score will have a negative value, and then the absolute difference between all labels is calculated:
wherein: n is the number of the marked pictures in the current group, diff is the absolute difference to be obtained, corresponds to a threshold value delta _2, and is judged to meet the requirement of the consistency of the absolute difference under the condition that the absolute difference is lower than the threshold value delta _ 2; the overall consistency determination criterion is shown in fig. 2;
as an example, for the intensity detector D, the invention selects an Openface (Free and open source face recognition with deep neural networks) method as the detector D, and other methods capable of detecting the face AU intensity can be used as the detector D in this patent.
The preferred embodiments exemplify:
as shown in fig. 1, corresponding to the first step, in the training process of the annotation staff, firstly, the annotation expert performs hanceization on the content according to the FACS standard, and adds the common cautionary items in the annotation process according to the opinion of the annotation expert to form an annotation manual; secondly, if P part videos of M persons exist, Q pictures to be labeled are shared after frames are removed, 13 AUs such as AU1, AU2, AU4, AU6, AU7, AU9, AU10, AU12, AU14, AU15, AU17, AU23 and AU25 are selected to be divided into front 6 groups and rear 7 groups, corresponding labeling persons are divided into groups 1 labeled with the front 6 AUs and groups 2 labeled with the rear 7 AUs according to the AU groups, and each group 1 and group 2 comprises 3 persons; and the marking expert finds the pictures with different strengths of all AUs selected by each person in the M persons as far as possible from the Q pictures to be marked to form a marking example set and mark the pictures. The annotation manuals and the example sets are used in the training process and the formal annotation process;
according to the labeling manual and the example set, all labeling personnel are annotated and explained, N (for example, 120 is taken as N) pictures are selected from all Q pictures to be labeled to be used as a verification set of training consistency, and an intensity detector D is used for carrying out machine detection on the selected N pictures, wherein the detection result of each AU is given by the intensity detector, as shown in FIG. 4; the labeling expert and the grouping 2 carry out manual correction on the last 7 AUs of the selected N pictures according to the labeling manual and the example set, as shown in FIG. 5; similarly, the annotation expert and the group 1 modify the first 6 AUs, and combine the obtained annotation result 1 and the annotation result 2 to obtain a total annotation result, as shown in fig. 6;
respectively calculating an ICC value and an absolute difference diff value according to a formula 1 and a formula 2, judging the consistency of the label according to the consistency judging flow of the graph 2, and finishing the first step if the consistency requirement is met, and carrying out formal label; if not, the steps of explanation training, labeling and consistency testing are carried out again on the group 1, the group 2 or the group 1 and the group 2 according to the situation.
As shown in fig. 3, corresponding to step two, the formal annotation process is performed. Step two is similar to step one, delete and construct the part of the label manual, construct the sample set part, use label manual and sample set of the journey of step directly, increase all data and divide step and every batch of data allocation step into batches, divide all human face pictures to be labeled according to 90% and 10% and call as label set and test set, AU grouping and label personnel group and step together, label set part, different personnel of grouping 1 and grouping 2 label different pictures; marking the test set after marking the marking set, marking the same pictures for the members in the groups 1 and 2, and further performing consistency test; and if the consistency requirement is met, the data of the batch is effectively labeled, the next batch is carried out, if the requirement is not met, the batch is invalid, and the steps of explanation training, labeling and consistency testing in the first step are carried out again on the group 1 and the group 2 or the group 1 and the group 2 according to the situation.
The method adopts the method of machine labeling, manual correction, grouping labeling, multiple consistency criteria and labeling sample sets, ensures the labeling accuracy, reduces the labeling complexity and keeps the consistency of different labeling personnel while reducing the manual labeling cost as much as possible, can quickly construct the human face action unit identification data set in a shorter time, greatly reduces the labor cost, and well solves the defect that the consistency is difficult to unify when multiple persons are labeled in the prior art.
The above embodiments are only preferred embodiments of the present invention, and it should be understood that the above embodiments are only for assisting understanding of the method and the core idea of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalents and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.