CN109508654A

CN109508654A - Merge the human face analysis method and system of multitask and multiple dimensioned convolutional neural networks

Info

Publication number: CN109508654A
Application number: CN201811260674.XA
Authority: CN
Inventors: 刘袁缘; 周顺平; 张香兰; 方芳; 郭明强; 姚尧; 彭济耀
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2018-10-26
Filing date: 2018-10-26
Publication date: 2019-03-22
Anticipated expiration: 2038-10-26
Also published as: CN109508654B

Abstract

The invention discloses a kind of human face analysis method and system for merging multitask combining multi-scale convolutional neural networks, the picture to be learned for being first N × N to a size, the face area-of-interest for extracting K different scale from the picture using key area searching algorithm, as the input in multiple dimensioned tri- channels CNN；Then feature extraction is carried out to the K face area-of-interests respectively using CNN, obtains the face characteristic of different scale, the face characteristic of the different scale extracted is merged using cascade mode, obtains fused feature representation；The loss function of multiple tasks is merged again to obtain associated losses function, is inputted using the feature representation as study, the optimal solution about the associated losses function is obtained, to obtain the processing result of the multiple task.The present invention is promoted mutually using the correlation between task to improve the predictablity rate of individual task.

Description

Merge the human face analysis method and system of multitask and multiple dimensioned convolutional neural networks

Technical field

The present invention relates to face character fields, more specifically to a kind of fusion multitask and multiple dimensioned convolutional Neural The human face analysis method and system of network.

Background technique

In recent years, with the development of internet and artificial intelligence the relevant technologies achieve in practical applications centainly at Fruit, artificial intelligence field have attracted the concern of more and more researchers, and the application range of artificial intelligence technology is also more next It is wider.In computer vision field, Face datection and analysis are always the popular direction studied.Although being based on depth convolution recently The research of neural network (Convolutional Neural network, CNN) achieves significant achievement, and is widely answered For fields such as recognition of face, face tracking, Face datections, but for Face datection task, still it is difficult to from comprising extreme appearance Facial key point, head pose, gender and expression information are obtained in the facial image of state, illumination and change resolution.Face inspection It surveys, the task of crucial point location, Attitude estimation, Gender Classification and Expression Recognition is normally used as individual Resolving probiems.Recently, Research shows that learning multiple relevant tasks simultaneously can be improved the performance of individual task.

Summary of the invention

The technical problem to be solved in the present invention is that carrying out multitask (Multi-Task to the relevant attribute of face Learning, MTL) study, the correlation between attribute, and the multitask convolutional Neural for passing through multiple channels are excavated deeper into ground The case where same picture of e-learning a variety of resolution ratio, enriches the extraction of face character feature with this, thus by part attribute Accuracy of identification promoted.

The technical solution adopted by the present invention to solve the technical problems is: constructing a kind of fusion multitask combining multi-scale volume The human face analysis method of product neural network, comprises the following steps:

(1) multiple dimensioned face attention region extraction step:

The picture to be learned for being N × N to a size, is mentioned from the picture using key area searching algorithm The face area-of-interest for taking out K different scale, as the input in multiple dimensioned tri- channels CNN；Wherein, N indicates picture Plain size, K >=2 and be integer；

(2) multiple dimensioned learning procedure, including feature extraction sub-step and Fusion Features sub-step are merged；

Feature extraction sub-step: feature extraction is carried out to the K face area-of-interests respectively using CNN, is obtained not With the face characteristic of scale；

Fusion Features sub-step: using the face spy for the different scale that cascade mode extracts feature extraction sub-step Sign is merged, and fused feature representation is obtained；

(3) multitask human face analysis step:

The loss function of multiple tasks is merged to obtain associated losses function, the feature obtained with step (3) Expression obtains the optimal solution about the associated losses function, to obtain the processing of the multiple task as study input As a result.

Further, in the human face analysis method of fusion multitask combining multi-scale convolutional neural networks of the invention, In step (1), when extracting face area-of-interest, the face sense for positioning 3 scales is helped using " attention " mechanism Interest region.

Further, in the human face analysis method of fusion multitask combining multi-scale convolutional neural networks of the invention, Specifically include in the feature extraction sub-step of step (2): the feature channel in K channel of building, each feature channel are right respectively The face area-of-interest of one scale independently carries out feature extraction, extracts the face characteristic of K different scale altogether.

Further, in the human face analysis method of fusion multitask combining multi-scale convolutional neural networks of the invention, Feature representation in step (3), between the multiple task parallel training and shared inter-related task.

Further, in the human face analysis method of fusion multitask combining multi-scale convolutional neural networks of the invention, It is characterized in that, K=3.

Further, in the human face analysis method of fusion multitask combining multi-scale convolutional neural networks of the invention, In step (3), it is described merged the loss function of multiple tasks to obtain associated losses function specifically refer to: will be the multiple The loss function of task is weighted.

Further, in the human face analysis method of fusion multitask combining multi-scale convolutional neural networks of the invention, The multiple task refers to: Face datection, Attitude estimation, crucial point location, gender identification and five tasks of Expression Recognition.

Further, in the human face analysis method of fusion multitask combining multi-scale convolutional neural networks of the invention,

(a) Face datection task returns to face frame coordinate, loss function for detecting and positioning the face in picture are as follows:

loss_D=-(1-l) log (l-p_D)-l·log(p_D),

It is face wherein when l=1；L=0 is non-face；p_DIt is the probability of face for face key area；

(b) gender of gender identification mission face picture for identification, loss function are as follows:

loss_G=-(1-g) log (1-p_g0)-g·log(p_g1),

Wherein, male g=0, women g=1, (p_g0, p_g1) face picture be men and women probability；

(c) face key point location tasks are used for the face feature point location of face alignment, are on the basis of Face datection Upper progress, to the positioning feature point on face；Face key area is described with (x, y, w, h), wherein (x, y) is face The coordinate of regional center point, (w, h) are the width and height in the region respectively, and each visible key point is relative in region Heart point (x, y) displacement, and as follows according to (w, h) normalized formula:

Wherein, N_LIt is face key point number, (a_i,b_i) be training when label, (x_i,y_i) it is true value, face is crucial The loss function of point location are as follows:

It is the prediction result of i-th of face key point, v_iFor the visible factor of i-th of face key point, key point As it can be seen that v_i=1, otherwise v_i=0；

It can be seen that whether the factor is for characterizing face key point as it can be seen that being additional key point location, loss function are as follows:

(d) head pose estimation task is used for face picture head in roll (p¹), pitch (p₂) and yaw (p₃) three Prediction on direction, loss function are as follows:

Wherein,It is the estimated result of the angle of three directions head portion posture, (p₁,p₂,p₃) it is institute State the training label of the angle of three direction head portion postures；

(e) Expression Recognition task from given still image or dynamic video sequence for isolating specific expression shape State, so that it is determined that the mental emotion of identified object, loss function are as follows:

Wherein, M is the type number of expression, and M is more than or equal to 2,Indicate the classification of expression Training label, (p_e1, p_e2..., p_eM) indicate expression classification estimated result.

Associated losses function are as follows: loss_full=λ_Dloss_D+λ_Lloss_L+λ_Vloss_V+λ_Ploss_P+λ_Gloss_G+λ_Eloss_E； Wherein λ indicates weight in formula, and λ is greater than 0.

Further, in the human face analysis method of fusion multitask combining multi-scale convolutional neural networks of the invention, (e) in,

Wherein the type of expression shares 6 kinds, respectively represents 6 kinds of basic facial expressions: anger is detested, fears, is happy, is sad, is frightened It is surprised.

The present invention is to solve its technical problem, is additionally provided a kind of in fusion multitask combining multi-scale convolution of the invention The human face analysis system of neural network, the system use the fusion multitask combining multi-scale convolutional neural networks of any of the above-described Human face analysis method carry out human face analysis.

The human face analysis method and system for implementing fusion multitask combining multi-scale convolutional neural networks of the invention, have Below the utility model has the advantages that the present invention uses end-to-end training method, while multiple tasks to be used for, multiple tasks parallel training is simultaneously shared Characterization between inter-related task promotes multiple tasks to learn jointly, is promoted mutually using the correlation between task to improve list The predictablity rate of a task.

Detailed description of the invention

Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:

Fig. 1 is the human face analysis method task definition figure for merging multitask combining multi-scale convolutional neural networks；

Fig. 2 is result of study exemplary diagram；

Fig. 3 is the human face analysis method network structure for merging multitask combining multi-scale convolutional neural networks；

Fig. 4 is the exemplary diagram that element is searched in iteration region.

Specific embodiment

For a clearer understanding of the technical characteristics, objects and effects of the present invention, now control attached drawing is described in detail A specific embodiment of the invention.

With reference to Fig. 1-Fig. 3, the human face analysis method of the fusion multitask combining multi-scale convolutional neural networks of the present embodiment It specifically includes the following steps.

(1) multiple dimensioned face attention region extraction step:

The picture to be detected for being N × N to a size extracts the people of three scales using attention detection network Face area-of-interest, size are respectively 227 × 227,147 × 147,59 × 59, as tri- channels multiple dimensioned CNN Input；Wherein, N indicates pixel size

(2) multiple dimensioned learning procedure is merged, the fusion multitask Triple network learning model in three channels is constructed, including Feature extraction sub-step and Fusion Features sub-step；

Feature extraction sub-step: the face key area for three scales for respectively extracting above-mentioned steps (1) is as defeated Enter, they respectively correspond three sub-networks of table 1:

1 network detail parameters table of table

The input scale of sub-network one is 227 × 227, has used 8 convolutional layers, 3 pond layers, 1 full articulamentum.By It is layer distributed in the characteristic information that CNN is extracted, low convolutional layer includes more corner information, is suitble to study face key point Positioning and head pose assess task, and high convolutional layer is then suitble to study Face datection, Gender Classification and Expression Recognition etc., thus This patent is in the performance for promoting crucial point location and head state to assess by multi-scale feature fusion.The present invention has merged pond Change layer pool1 layers of (27 × 27 × 96), conv3 layers of convolutional layer (13 × 13 × 384), layer pool5 layers of pond (6 × 6 × 256). Further, it since their dimensions are inconsistent, cannot be directly connected to, so being convolution conv1a to pool1 layers and conv3 layers With convolution conv3a, the vector of 6 × 6 × 192 dimensions then is obtained to result dimensionality reduction, finally obtains 2048 by full articulamentum Fc6 The output vector of dimension.

The input scale 147 × 147 of sub-network two has used 4 convolutional layers, 3 pond layers and 2 full articulamentums, warp It crosses full articulamentum Fc61 and obtains 512 dimension output vectors.

The input scale of sub-network three is 59 × 59, has used 2 convolutional layers, 2 pond layers and 2 full articulamentums, is passed through It crosses full articulamentum Fc42 and obtains 512 dimension output vectors.

Fusion Features sub-step: using the face spy for the different scale that cascade mode extracts feature extraction sub-step Sign is merged, and fused feature representation is obtained.Three sub- network layers are merged to obtain 3072 dimension output vectors.Finally incite somebody to action 6 independent branches are divided into according to task to 3072 dimensional vectors: being Face datection, crucial point location, visibility detection, table respectively Feelings identification, Attitude estimation, gender identification.Wherein visibility detection is assisted face key point location.They respectively correspond Fc_ Detection, Fc_landmarks, Fc_visibility, Fc_pose, Fc_gender, Fc_expression six connect entirely Layer is connect, each branch carries out primary full connection respectively again and obtains the recognition result of each task, and see Table 1 for details for network parameter table.

(3) multitask human face analysis step:

Multi-task learning provides multiple supervision messages (label), is promoted mutually using the correlation between task, and more One feature of business is single tensor input (X), multiple outputs (task _ 1, task _ 2...), in the present invention as face is examined Survey, Expression Recognition etc., it is therefore desirable to have multiple loss functions for calculating separately the loss of each task.The present invention identifies simultaneously Five tasks, and for additional key point location task, crucial visibility of a point is proposed, therefore there are six loss functions:

(1) Face datection detects and positions the face in picture, high-precision face frame coordinate is returned to, as the people of prediction Face frame and the intersection of real human face range are when being greater than 0.5 with the union (IOU) of real human face range than the face frame of upper prediction Positive example (l=1), less than the negative example of 0.35 conduct (l=0), loss function is as follows:

loss_D=-(1-l) log (l-p_D)-l·log(p_D)

Wherein p_DIt is the probability of face for the face key area, is acquired from above-mentioned Fc_detection.

(2) gender identification identifies the gender of face on picture, and IOU is greater than 0.5 conduct positive example, and loss function is such as Under:

loss_G=-(1-g) log (1-p_g0)-g·log(p_g1)

Wherein male g=0, women then g=1, (p_g0, p_g1) it is the probability that face picture is men and women；

(3) face key point location, i.e. the face feature point location for face alignment, are on the basis of Face datection It carries out, characteristic point on the face such as corners of the mouth, canthus etc. is positioned, when IOU is used greater than 0.35 conduct positive example, All key point coordinates are normalized, and using relative value, mapping function is as follows:

(x_i, y_i) it is key point coordinate, (x, y) is the coordinate at face center, and w, h are that face is wide and high, for sightless Key point is set as (0,0).Loss function is calculated using Euclidean distance, as follows:

For the relative coordinate values after the normalization of prediction, N is keypoint quantity, and AFLW data set includes 21 passes Key point information, i.e. N=21, v_iIndicate the key point in this region as it can be seen that being otherwise 0 when being 1, that is to say, that not for those Visible point is not involved in operation.

(4) head pose estimation is the prediction to face picture head on tri- directions pitch, yaw and roll, is used Euclidean distance loss function training roll (p1), pitch (p2), it is positive example that yaw (p3), IOU, which are greater than 0.5, and loss function is public Formula is as follows:

WhereinFor the predicted value of the angle of three directions head portion posture, (p₁,p₂,p₃) it is described The training label of the angle of three direction head portion postures.

(5) Expression Recognition, which refers to, isolates specific emotional state from given still image or dynamic video sequence, So that it is determined that the mental emotion of identified object, similarly, when IOU is greater than 0.5 conduct positive example, loss function formula is as follows:

Wherein p_ejFor the predicted value of the classification of expression,For the training label of the classification of expression, j 0,1,2,3,5 divides Do not represent 6 kinds of basic facial expressions: angry, detest, fear, is happy, is sad, is surprised etc..

Further, associated losses function is the weighted sum of five task loss functions, formula are as follows:

loss_full=λ_Dloss_D+λ_Lloss_L+λ_Vloss_V+λ_Ploss_P+λ_Gloss_G+λ_Eloss_E

Wherein λ is that the significance level according to each task in general assignment is determined, λ is greater than 0.The weight of this patent is arranged Are as follows:

λ_D=1, λ_L=5, λ_E=2, λ_P=5, λ_G=2

(4) last handling process

Face datection algorithm of the invention faces two challenges: first, network is different surely to capture lesser face；The Two, human face region cannot be accurately positioned in the face frame predicted.Because confidence score is greater than 0.5 candidate regions by the present invention It is considered as comprising face, being considered as less than 0.35 is non-face, so the two challenges will lead to the reduction of Face datection rate, while shadow Ring the recognition result of other tasks.Therefore, the present invention promotes the accuracy of Face detection by last handling process.

The present invention has abandoned traditional bounding box regression algorithm, but proposes the iteration range searching suitable for this task With the non-maxima suppression algorithm based on key point.Iteration range searching is generated more by using the face feature point information of prediction Recall rate is improved in more face candidate area.Non-maxima suppression based on key point then by using it is predicted that facial markers Point information readjusts the bounding box of detection to promote the accuracy of positioning, reuses non-maxima suppression algorithm and goes out de-redundancy side Frame.Both methods does not need any training process.

(1) iteration range searching: when the correct Face datection score of key point is low, according to it is predicted that face feature point believe Breath generates new candidate regions using FaceRectCalculator, and it is carried out propagated forward in a model.This is new The candidate regions of generation can obtain higher score in this task of Face datection, to improve recall rate, realize process such as Fig. 4.

(2) based on the non-maxima suppression of key point: with the difference of other non-maxima suppression algorithms.Based on key point Non-maxima suppression algorithm not using face frame coordinate as parameter, but using four angular coordinates of minimum in key point as ginseng Number can avoid adjacent face from being merged or situations such as missing inspection with compact face in this way.

The embodiment of the present invention is described with above attached drawing, but the invention is not limited to above-mentioned specific Embodiment, the above mentioned embodiment is only schematical, rather than restrictive, those skilled in the art Under the inspiration of the present invention, without breaking away from the scope protected by the purposes and claims of the present invention, it can also make very much Form, all of these belong to the protection of the present invention.

Claims

1. a kind of human face analysis method for merging multitask combining multi-scale convolutional neural networks, which is characterized in that comprising as follows Step:

(1) multiple dimensioned face attention region extraction step:

The picture to be learned for being N × N to a size, is extracted from the picture using key area searching algorithm The face area-of-interest of K different scale, as the input in multiple dimensioned tri- channels CNN；Wherein, N indicates that pixel is big It is small, K >=2 and be integer；

Feature extraction sub-step: feature extraction is carried out to the K face area-of-interests respectively using CNN, obtains different rulers The face characteristic of degree；

Fusion Features sub-step: the face characteristic of the different scale extracted feature extraction sub-step using cascade mode into Row fusion, obtains fused feature representation；

(3) multitask human face analysis step:

The loss function of multiple tasks is merged to obtain associated losses function, the feature representation obtained with step (3) It is inputted as study, the optimal solution about the associated losses function is obtained, to obtain the processing result of the multiple task.

2. the human face analysis method of fusion multitask combining multi-scale convolutional neural networks according to claim 1, special Sign is, in step (1), when extracting face area-of-interest, helps to position 3 scales using " attention " mechanism Face area-of-interest.

3. the human face analysis method of fusion multitask combining multi-scale convolutional neural networks according to claim 1, special Sign is, specifically includes in the feature extraction sub-step of step (2): the feature channel in K channel of building, each feature channel Feature extraction is independently carried out to the face area-of-interest of a scale respectively, extracts the face characteristic of K different scale altogether.

4. the human face analysis method of fusion multitask combining multi-scale convolutional neural networks according to claim 1, special Sign is, the feature representation in step (3), between the multiple task parallel training and shared inter-related task.

5. the human face analysis method of fusion multitask combining multi-scale convolutional neural networks according to claim 1, special Sign is, K=3.

6. the human face analysis method of fusion multitask combining multi-scale convolutional neural networks according to claim 1, step (3) in, it is described merged the loss function of multiple tasks to obtain associated losses function specifically refer to: by the multiple task Loss function be weighted.

7. the human face analysis method of fusion multitask combining multi-scale convolutional neural networks according to claim 6, special Sign is that the multiple task refers to: Face datection, Attitude estimation, crucial point location, gender identification and Expression Recognition five appoint Business.

8. the human face analysis method of fusion multitask combining multi-scale convolutional neural networks according to claim 7, special Sign is,

loss_D=-(1-l) log (l-p_D)-l·log(p_D),

loss_G=-(1-g) log (1-p_g0)-g·log(p_g1),

(c) face key point location tasks be used for face alignment face feature point location, be on the basis of Face datection into Capable, to the positioning feature point on face；Face key area is described with (x, y, w, h), wherein (x, y) is human face region The coordinate of central point, (w, h) are the width and height in the region respectively, and each visible key point is relative to regional center point (x, y) displacement, and as follows according to (w, h) normalized formula:

Wherein, N_LIt is face key point number, (a_i,b_i) be training when label, (x_i,y_i) it is true value, face key point is fixed The loss function of position are as follows:

It is the prediction result of i-th of face key point, v_iFor the visible factor of i-th of face key point, key point can See, v_i=1, otherwise v_i=0；

(d) head pose estimation task is used for face picture head in roll (p₁), pitch (p₂) and yaw (p₃) three directions On prediction, loss function are as follows:

Wherein,It is the estimated result of the angle of three directions head portion posture, (p₁,p₂,p₃) it is described three The training label of the angle of a direction head portion posture；

(e) Expression Recognition task is used to isolate specific emotional state from given still image or dynamic video sequence, So that it is determined that the mental emotion of identified object, loss function are as follows:

Wherein, M is the type number of expression, and M is more than or equal to 2,Indicate the training of the classification of expression Label, (p_e1, p_e2..., p_eM) indicate expression classification estimated result.

Associated losses function are as follows: loss_full=λ_Dloss_D+λ_Lloss_L+λ_Vloss_V+λ_Ploss_P+λ_Gloss_G+λ_Eloss_E；Wherein formula Middle λ indicates weight, and λ is greater than 0.

9. the human face analysis method of fusion multitask combining multi-scale convolutional neural networks according to claim 8, special Sign is, (e) in,

Wherein the type of expression shares 6 kinds, respectively represents 6 kinds of basic facial expressions: anger is detested, fears, is happy, is sad, is surprised.

10. a kind of human face analysis system for merging multitask combining multi-scale convolutional neural networks, which is characterized in that using as weighed Benefit requires the human face analysis method of the described in any item fusion multitask combining multi-scale convolutional neural networks of 1-9 to carry out face point Analysis.