WO2016026063A1

WO2016026063A1 - A method and a system for facial landmark detection based on multi-task

Info

Publication number: WO2016026063A1
Application number: PCT/CN2014/000769
Authority: WO
Inventors: Xiaoou Tang; Zhanpeng Zhang; Ping Luo; Chen Change Loy
Original assignee: Xiaoou Tang
Priority date: 2014-08-21
Filing date: 2014-08-21
Publication date: 2016-02-25
Also published as: CN106575367A; CN106575367B

Abstract

The present application disclosed a method and system for detecting facial landmarks of a face image. The method may comprise extracting multiple feature maps from at least one facial region of the face image and/or the whole face image; generating a shared facial feature vector from the extracted multiple feature maps; and predicting facial landmark locations of the face image from the generated shared facial feature vector. With the present method and system, the facial landmark detection can be optimized together with heterogeneous but subtly related task, so that the detection robustness can be improved through multi-task learning.

Description

A METHOD AND A SYSTEM FOR FACIAL LANDMARK DETECTION

BASED ON MULTI-TASK

Technical Field

[0001] The present application relates to face alignment, in particular, to a method and a system for facial landmark detection.

Background

[0002] Facial landmark detection is a fundamental component in many face analysis tasks, such as facial attribute inference, face verification, and face recognition, but has long been impeded by problems of occlusion and pose variation.

[0003] Accurate facial landmark detection can be performed using a cascaded CNN (Convolutional Neural Network), in which faces are divided into different parts by pre-partition, each of which is processed by separate deep CNNs. The resulting outputs are subsequently averaged and channeled to separate cascaded layers to process each facial landmark individually.

[0004] In addition, the facial landmark detection is not a standalone problem, and its estimation may be influenced by a number of heterogeneous and subtly correlated factors. For example, when a kid is smiling, his/her mouth is widely opened. Effectively discovering and exploiting such an intrinsically correlated facial attribute would help in detecting the mouth corners more accurately. Also, the inter-ocular distance is smaller in faces with large yaw rotation. Such pose information may be leveraged as additional source of information to constrain the solution space of landmark estimation. Given the rich set of plausible related tasks, treating facial landmark detection in isolation is counterproductive.

[0005] However, different tasks are inherently different in learning difficulties and have different convergence rates. Further, certain tasks are likely to be over-fitting earlier than the others when learning simultaneously, which will jeopardizes the learning convergence of the whole model. Summary

[0006] In one aspect of the present application, disclosed is a method for detecting facial landmarks of a face image. The method may comprise extracting multiple feature maps from at least one facial region of the face image; generating a shared facial feature vector from the extracted multiple feature maps; and predicting facial landmark locations of the face image from the generated shared facial feature vector.

[0007] In another aspect of the present application, disclosed is a system for detecting facial landmarks of a face image. The system may comprise a feature extractor and a predictor. The feature extractor may extract multiple feature maps from at least one facial region of the face image and generate a shared facial feature vector from the extracted multiple feature maps. The predictor may predict facial landmark locations of the face image from the shared facial feature vector generated by the feature extractor.

[0008] According to the present application, there is a method for training a convolutional feature network for performing simultaneously facial landmark detection and at least one associated auxiliary task. The method may comprise 1) sampling a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set; 2) comparing dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error; 3) comparing dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error; 4) back-propagating the generated landmark error and all the training task errors through the convolutional neural network to adjust weights on connections between neurons of the convolutional neural network; 5) sampling a validating face image and its ground-truth target for each auxiliary task from a predetermined validation set; 6) comparing dissimilarities between the target prediction and the ground-truth target to generate a validating task error; and 7) determining if the generated training task error is less than a first predetermined value and the generated validating task error is less than a second predetermined value. If yes, the method for training the convolutional neural network will be terminated, otherwise, the steps l)-7) will be repeated.

[0009] According to the present application, there is further provided a computer-readable medium for storing the instructions executable by one or more processors to implement the above processer of the method.

[0010] In contrast to existing methods, the facial landmark detection can be optimized together with heterogeneous but subtly auxiliary tasks, so that the detection robustness can be improved through multi-task learning, especially in dealing with faces with severe occlusion and pose variation.

[0011] According to the present application, only one single CNN is used, and thus complexity of the required system/device can be reduced. Neither pre-partition of faces nor cascaded convolutional neural layers are required, leading to drastic reduction in model complexity, whilst still achieving comparable or even better accuracy.

[0012] As training proceeds, certain related tasks which are no longer beneficial to the main task when they reach their peak performance, and thus their training process can be halted. According to the present application, the training process of the CNN is conducted with an "early stopping" to stop the related tasks which begin to over-fit the training set and thus harm the main task, so as to facilitate learning convergence.

Brief Description of the Drawing

[0013] Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

[0014] Fig. 1 is a schematic diagram illustrating a system for facial landmark detection consistent with some disclosed embodiments.

[0015] Fig.2 is a schematic diagram illustrating a training unit as shown in Fig. 1 consistent with some disclosed embodiments. [0016] Fig. 3 is a schematic diagram illustrating an example of a system for facial landmark detection consistent with some disclosed embodiments, in which an example of a convolutional neural network is shown.

[0017] Fig. 4 is a schematic diagram illustrating a system for facial landmark detection when it is implemented in software consistent with some disclosed embodiments.

[0018] Fig. 5 is a schematic flowchart illustrating a method for facial landmark detection consistent with some disclosed embodiments.

[0019] Fig. 6 is a schematic flowchart illustrating a training process of the multi-task convolutional neural network consistent with some disclosed embodiments.

Detailed Description

[0020] Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts.

[0021] Fig. 1 is a schematic diagram illustrating an exemplary system 1000 for facial landmark detection consistent with some disclosed embodiments. According to the system 1000, facial landmark detection (hereinafter, also referred to as main task) is optimized jointly with at least one related/auxiliary task. The facial landmark detection means that detect 2D-locations, i.e, 2D coordinates (x and y) of facial region of a face image. Examples of facial landmark may include, but not limited to, left and right centers of the eyes, nose, left and right corners of the mouth of a face image. Examples of the auxiliary task may include, but not limited to, head pose estimation, demographic such as gender classification, age estimation, facial expression recognition such as smiling or facial attribute inference such as wearing glasses. It shall be appreciated that the number or type of the auxiliary tasks are not limited to those mentioned herein.

[0022] Referring to Fig. 1 again, where the system 1000 is implemented by the hardware, it may comprise a feature extractor 100, a training unit 200 and a predictor 300. The feature extractor 100 may extract multiple feature maps from at least one facial region of the face image and/or the whole face image. Then, a shared facial feature vector may be generated by the feature extractor 100 from the extracted multiple feature maps.

[0023] The predictor 300 may predict facial landmark locations of the face image from the shared facial feature vector extracted by the feature extractor 100. Simultaneously, the predictor 300 may further, from the shared facial feature vector, predict corresponding target of at least one auxiliary task associated with the facial landmark detection. According to the system 1000, the facial landmark detection can be optimized jointly with the auxiliary tasks.

[0024] According to an embodiment, the feature extractor 100 may comprise a convolutional neural network. The network may comprise a plurality of convolution-pooling layers and a fully connect layer. In the network, each of the plurality of convolution-pooling layers may perform convolution and max-pooling operations, and the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps. The fully connect layer may generate the shared facial feature vector from all the extracted multiple feature maps.

[0025] An example of the network is shown in Fig. 3, in which the convolutional neural network comprises an input layer, a plurality of (for example, three) convolution-pooling layers comprising one or more (for example, three) convolutional layers and one or more (for example, three) pooling layers, one convolutional layer and one fully connected layer. It is noted that the network is shown for exemplary, and the convolutional neural network in the feature extractor is not limited to it. As shown in Fig. 3, a 40 X 40 (for example) gray-scale face image is inputted in the input layer. The first convolution-pooling layer extracts feature maps from the inputted image. Then, the second convolution-pooling layer takes the output of first layer as input, to generate different feature maps. This process is continued by using all three convolution-pooling layers. At the end, the multiple layers of feature maps are used by the fully connected layer to generate the shared facial feature vector. That is, the shared facial feature vector is generated by performing multiple times of convolution and max pooling operations. Each layer contains a plurality of neurons with local or global receptive fields, and the weights on connection between the neurons of the convolutional neural network may be adjusted, so that the network is trained accordingly.

[0026] According to an embodiment, the system 1000 may further comprise a training unit 200. The training unit 200 may train, with a predetermined training set, the feature extractor so as to adjust the weights on connections between the neurons of the convolutional neural network such that the trained feature extractor is capable of extracting the shared facial feature vector. According to an embodiment of the present application shown in Fig. 2, the training unit 200 may comprise a sampler 201, a comparator 202 and a back-propagator 203.

[0027] As shown in Fig. 2, the sampler 201 may sample a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set. According to an embodiment, five ground-truth landmarks, that is, centers of the eyes, nose tip, corners of the mouth may be annotated directly on each training face image. According to another embodiment, the ground-truth target for each auxiliary task may be labeled manually. For example, for gender classification, the ground-truth target may be labeled as female (F) or male (M). For facial attribute inference, such as wearing glasses, the ground-truth target may be labeled as wearing (Y) or not wearing (N). For head pose estimation,

(0° , ±30°, ±60°) may be labeled and for expression recognition, such as smiling, yes/no may labeled accordingly.

[0028] The comparator 202 may compare dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error. The landmark error may be obtained by using, for example, least square method. The comparator 203 may further compare dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error. According to another embodiment, the training task error may be obtained by using, for example, cross-entropy method.

[0029] The back-propagator 203 may back-propagate the generated landmark error and all the training task errors through the convolutional neural network to adjust weights on connections between the neurons of the convolutional neural network.

[0030] According to an embodiment, the training unit 200 may further comprise a determiner 204. The determiner 204 may determine whether the training process of the facial landmark detection is converged. According to another embodiment, the determiner 204 may further determine whether the training process of each task is converged, which will be discussed later.

[0031] Hereinafter, components in the training unit 200 as mentioned above will be discussed in detail. For purpose of illustration, we will describe an embodiment in which T tasks are trained jointly by the training unit 200. For the T tasks, facial landmark detection, i.e., main task is denoted as r, and one of at least one related/auxiliary task is denoted as a, where

[0032] For each of the tasks, the training data is denoted as ,

where N represents number of the training data. In

particular, for the facial landmark detection r, the training data is denoted as

where is the 2D coordinates of the five landmarks. For the task a, the

training data is denoted as . In the embodiment, four tasks and s are

shown and represent inferences of 'pose', 'gender', 'wear glasses', and 'smiling', respectively. Thus, represents five different poses

and are binary attributes and represent female/male, not

wearing/wearing glasses and not smiling/smiling, respectively. Different weights are assigned to the main task r and each auxiliary task a, and are denoted as W and respectively.

[0033] Then, an objection function of all the tasks is formulated as below to optimize the main task r and the auxiliary task a:

where, is a linear function of and a weight vector

represents loss function; represents importance coefficient of a-th task's error; and represents a shared facial feature vector.

[0034] According to an embodiment, least square and cross-entropy functions are used as the loss function /(·) for the main task r and the auxiliary task a, respectively, to generate corresponding landmark error and training task errors. Therefore, the above objective function can be rewritten as below:

[0035] In the in the first term is a linear function. The

second term is a posterior probability function

wherein, denotes the column of a weight matrix of the task a, w. The third

term penalizes large weights

[0036] According to an embodiment, weights of all the task may be updated accordingly. In particular, the weight matrix of the facial landmark detection is updated by where η represents the learning rate (such as η = 0.003),

and Also, the weight matrix of each task a may be

calculated in a similar manner as

[0037] Then, the generated landmark error and the training task errors may be back-propagated layer by layer until the lowest layer by the back-propagator 203 through the convolutional neural network to adjust weights on connections between neurons of the convolutional neural network.

[0038] According to an embodiment, the error may be propagated back through the network followin a back-propagation strategy as below:

[0039] In the

q represent all the error in the layer with

. For example, ε represents the error of the lowest layer, and

represents the error of the second lowest layer. The errors of the lower la ers are computed following Eq.(3). For instance,

where is the gradient of the activation function of the

network

[0040] The above training process is repeated until the training process of the facial landmark detection is determined by the determiner 204 to be converged. In other words, if the error is less than a predetermined value, the training process will be determined to be converged. With the above training process, the feature extractor 100 is capable of extracting the shared facial feature vector from a given face image.

According to an embodiment, for any face image the trained feature extractor

100 extracts a shared feature vector Then, the landmark locations is predicated by

and the prediction targets of the auxiliary tasks are obtained by

[0041] During the above training process, at least one auxiliary task is trained simultaneously. However, different tasks have different loss functions and learning difficulties, and thus have different convergence rates. According to another embodiment, the determiner 204 may further determine whether the training process of the auxiliary tasks is converged.

[0042] In particular, represent values of the loss function of task a

on a validation set and the training set, respectively, if one task's measure exceeds a threshold as below, the task will stop:

[0043] In the Eq.(4), t represents the current iteration, k represents a training length, and

represents the importance coefficient of a-th task's error. The 'meet denotes the function for calculating median value. The first term in Eq.(4) represents the tendency of the training task error of the task a. If the training error drops rapidly within a period of length k, the value of the first term is small, which indicates that training process of the task can be continued as the task is still valuable. Otherwise, the first term is large, then the task is more likely to be stopped. From this, an auxiliary task can be switched off during the training process before it over-fit, so that the task can be "early stopped" before it begins to over-fit the training set and thus harm the main task.

[0044] With the above training process, the feature extractor 100 is capable of extracting a shared facial feature vector from any face image. For example, a face image x° is inputted in the input layer of the convolutional neural network as for example shown in Fig. 3. There are multiple sets of convolutional filters plus an activation function applied to a face image in each convolutional layer in the CNN and they are applied sequentially to project the face image to the higher layer. That is, the face image is projected to higher layer gradually by learning a sequence of non-linear mappings as below to obtain the shared facial feature vector x

[0045] Here, σ(·) and W^s represent the non-linear activation function applied to the face image and the filters needed to be learned in the layer / of CNN. For instance, Referring to Fig. 3 again, the shared facial feature vector can

be used for landmark detection and auxiliary/related tasks simultaneously in the estimation stage.

[0046] It shall be appreciated that the system 1000 may be implemented using certain hardware, software, or a combination thereof. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.

[0047] In the case that the system 1000 is implemented with software, the system 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 4, the system 1000 may include one or more processors (processors 102, 104, 106 etc.), a memory 112, a storage device 116, and a bus to facilitate information exchange among various components of system 1000. Processors 102-106 may include a central processing unit ("CPU"), a graphic processing unit ("GPU"), or other suitable information processing devices. Depending on the type of hardware being used, processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below.

[0048] Memory 112 can include, among other things, a random access memory ("RAM") and a read-only memory ("ROM"). Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106. For example, memory 112 may store one or more software applications. Further, memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106. It is noted that although only one block is shown in Fig. 1, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.

[0049] The system for facial landmark detection is described in the above. The following will describe a method for facial landmark detection with reference to Figs. 5 and 6.

[0050] Fig. 5 shows a schematic flowchart of the method for facial landmark detection and Fig. 6 shows a schematic flowchart of a training process of the multi-task convolutional neural network by the training unit 200.

[0051] In Figs. 5 and 6, methods 500 and 600 comprise a series of steps that may be performed by one or more of processors 102-106 or each module/unit of the system 1000 to implement a data processing operation. For purpose of description, the following discussion is made in reference to the situation where each module/unit of the system 1000 is made in hardware or the combination of hardware and software. The skilled in the art shall appreciate that other suitable devices or systems shall be applicable to carry out the following process and the system 1000 are just used to be an illustration to carry out the process.

[0052] As shown in Fig. 5, multiple feature maps are extracted by the feature extractor 100 from at least one facial region of the face image in step S501. In another embodiment, the multiple feature maps may be extracted from the whole face image in step S501. Then, in step S502, a shared facial feature vector is generated from the multiple feature maps extracted in step S501. In step S503, facial landmark locations of the face image is predicted from the shared facial feature vector generated in step S502. According to another embodiment, the shared facial feature vector may be used to predict corresponding target of at least one auxiliary task associated with the facial landmark detection. Then, the target predictions of all the auxiliary tasks are obtained simultaneously.

[0053] According to an embodiment, the feature extractor comprises a convolutional neural network comprising a plurality of convolution-pooling layers and a fully connect layer. Each of convolution-pooling layers is configured to perform convolution and max-pooling operations,. In the embodiment, in step S501, the multiple feature maps may be extracted by the plurality of convolution-pooling layer consecutively, wherein the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps. In step S502, the shared facial feature vector may be generated by the fully connect layer from all the multiple feature maps extracted in step S501.

[0054] In the embodiment, the method 500 further comprises a training step (not shown in Fig. 5), which will be discussed with reference to Fig. 6.

[0055] As shown in Fig. 6, in step S601, a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task are sampled from the predetermined training set. For the training face image, its facial landmark prediction and the target predictions of all the auxiliary tasks may be obtained from the predicator 300 accordingly in step S602. Then, the dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations is compared to generate a landmark error in step S603. In step S604, dissimilarities between the target predictions and the ground-truth target for each auxiliary task are compared respectively to generate at least one training task error. Then, the generated landmark error and all the training task errors are back-propagated through the convolutional neural network to adjust weights on connections between the neurons of the convolutional neural network in step S605. In step S606, it is determined that one of the auxiliary tasks is converged. If no, the process 600 turns back to step S606. If yes, the training process of the task is stopped in step S607 and proceeds to step S608. In the step S608, it is determined that the training process of the facial landmark detection is converged. If yes, the process 600 ends. If no, the process 600 turns back to step S601.

[0056] From this, the facial landmark detection can be optimized together with heterogeneous but subtly related tasks.

[0057] Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims is intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

[0058] Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims

What is claimed is:

1. A method for detecting facial landmarks of a face image, comprising:

extracting multiple feature maps from at least one facial region of the face image;

generating a shared facial feature vector from the extracted multiple feature maps; and

predicting facial landmark locations of the face image from the generated shared facial feature vector.

2. A method of claim 1, wherein the facial landmark comprises at least one selected from a group consisting of centers of the eyes, nose, corners of the mouth of a face image.

3. A method of claim 1, wherein in the step of predicting, the shared facial feature vector is used to predict corresponding target of at least one auxiliary task associated with the facial landmark detection, so as to obtain target predictions of all the auxiliary tasks simultaneously.

4. A method of claim 3, wherein the auxiliary tasks comprises at least one selected from a group consisting of head pose estimation, gender classification, age estimation, facial expression recognition or facial attribute inference.

5. A method of claim 4, wherein the step of extracting and generating are performed by a convolutional neural network comprising a plurality of convolution-pooling layers, each of which is configured to perform convolution and max-pooling operations, and

wherein the step of extracting further comprises:

extracting the multiple feature maps by the plurality of convolution-pooling layer consecutively, wherein the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps.

6. A method of claim 5, wherein the convolutional neural network further comprises a fully connect layer, and in the step of generating, the shared facial feature vector is generated by the fully connect layer from all the extracted multiple feature maps.

7. A method of claim 6, wherein each layer of the convolutional neural network has a plurality of neurons, and wherein the method further comprises:

training, with a predetermined training set, the network so as to adjust each weight on connections between the neurons of the network such that the shared facial feature vector is generated by the network with the adjusted weight.

8. A method of claim 7, wherein the step of training further comprises:

sampling a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set;

comparing dissimilarities between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error;

comparing dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error; and

back-propagating the generated landmark error and the generated training task error through the convolutional neural network to adjust weights on connections between the neurons of the convolutional neural network;

repeating the steps of sampling, comparing and back-propagating until the generated landmark error is less than a first predetermined value and the generated training task error is less than a second predetermined value.

9. A method of claim 8, wherein the comparison to generate a landmark error is performed by rule of a least square process and the comparison to generate a training task error is performed by rule of a cross-entropy process.

10. A method of claim 8, wherein, for each auxiliary task, the step of training further comprising:

sampling a validating face image and its ground-truth target for each auxiliary tasks from a predetermined validation set;

comparing dissimilarity between the target prediction and the ground-truth target to generate a validating task error;

repeating the sampling and the comparing until the generated training task error is less than a third predetermined value and the generated validating task error is less than a fourth predetermined value.

11. A method of claim 1, wherein in the step of predicting, the predicted facial landmark locations of the face image is determined by rule of

where W represents weight assigned to the facial landmark detection; and

represents the shared facial feature vector, and T represents transpose.

12. A system for detecting facial landmarks of a face image, comprising:

a feature extractor configured to,

extract multiple feature maps from at least one facial region of the face image; and

generate a shared facial feature vector from the extracted multiple feature maps; and

a predictor configured to predict facial landmark locations of the face image from the shared facial feature vector generated by the feature extractor.

13. A system of claim 12, wherein the predictor is further configured to obtain target predictions of at least one auxiliary task associated with the facial landmark detection by using the shared facial feature vector simultaneously.

14. A system of claim 12, wherein the feature extractor further comprises a convolutional neural network, wherein the convolutional neural network comprises:

a plurality of convolution-pooling layers configured to perform convolution and max-pooling operations, and wherein the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps; and

a fully connect layer configured to generate the shared facial feature vector from all the extracted multiple feature maps.

15. A system of claim 13, wherein each layer of the convolutional neural network has a plurality of neurons, and wherein the system further comprises:

a training unit configured to train, with a predetermined training set, the network so as to adjust the weights on connections between the neurons of the network such that the trained network is capable of extracting the shared facial feature vector.

16. A system of claim 15, wherein the training unit further comprises:

a sampler configured to sample a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set;

a comparator is configured to compare dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error and to compare dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error; and

a back-propagator configured to back-propagate the generated landmark error and the training task errors through the convolutional neural network to adjust weights on connections between the neurons of the convolutional neural network.

17. A system of claim 15, wherein the training unit further comprises:

a determiner configured to determine whether training process of the facial landmark detection is converged and whether training process of each task is converged.

18. A system of claim 12, wherein the facial landmark comprises at least one selected from a group consisting of centers of the eyes, nose, corners of the mouth of a face image.

19. A system of claim 13, wherein the auxiliary tasks comprises at least one selected from a group consisting of head pose estimation, gender classification, age estimation, facial expression recognition or facial attribute inference

20. A method for training a convolutional neural network for performing simultaneously facial landmark detection and at least one associated auxiliary task, comprising:

1) sampling a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set;

2) comparing dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error;

3) comparing dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error;

4) back-propagating the generated landmark error and all the training task errors through the convolutional neural network to adjust weights on connections between neurons of the convolutional neural network;

5) sampling a validating face image and its ground-truth target for each auxiliary task from a predetermined validation set;

6) comparing dissimilarities between the target prediction and the ground-truth target to generate a validating task error;

7) determining if the generated training task error is less than a first predetermined value and the generated validating task error is less than a second predetermined value; and

if yes, the method for training the convolutional neural network will be terminated, otherwise, the steps l)-7) will be repeated.