WO2016026063A1 - A method and a system for facial landmark detection based on multi-task - Google Patents

A method and a system for facial landmark detection based on multi-task Download PDF

Info

Publication number
WO2016026063A1
WO2016026063A1 PCT/CN2014/000769 CN2014000769W WO2016026063A1 WO 2016026063 A1 WO2016026063 A1 WO 2016026063A1 CN 2014000769 W CN2014000769 W CN 2014000769W WO 2016026063 A1 WO2016026063 A1 WO 2016026063A1
Authority
WO
WIPO (PCT)
Prior art keywords
facial
training
task
landmark
error
Prior art date
Application number
PCT/CN2014/000769
Other languages
French (fr)
Inventor
Xiaoou Tang
Zhanpeng Zhang
Ping Luo
Chen Change Loy
Original Assignee
Xiaoou Tang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaoou Tang filed Critical Xiaoou Tang
Priority to PCT/CN2014/000769 priority Critical patent/WO2016026063A1/en
Priority to CN201480081241.1A priority patent/CN106575367B/en
Publication of WO2016026063A1 publication Critical patent/WO2016026063A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/165Detection; Localisation; Normalisation using facial parts and geometric relationships

Definitions

  • the present application relates to face alignment, in particular, to a method and a system for facial landmark detection.
  • Facial landmark detection is a fundamental component in many face analysis tasks, such as facial attribute inference, face verification, and face recognition, but has long been impeded by problems of occlusion and pose variation.
  • Accurate facial landmark detection can be performed using a cascaded CNN (Convolutional Neural Network), in which faces are divided into different parts by pre-partition, each of which is processed by separate deep CNNs. The resulting outputs are subsequently averaged and channeled to separate cascaded layers to process each facial landmark individually.
  • a cascaded CNN Convolutional Neural Network
  • the facial landmark detection is not a standalone problem, and its estimation may be influenced by a number of heterogeneous and subtly correlated factors. For example, when a kid is smiling, his/her mouth is widely opened. Effectively discovering and exploiting such an intrinsically correlated facial attribute would help in detecting the mouth corners more accurately. Also, the inter-ocular distance is smaller in faces with large yaw rotation. Such pose information may be leveraged as additional source of information to constrain the solution space of landmark estimation. Given the rich set of plausible related tasks, treating facial landmark detection in isolation is counterproductive.
  • a method for detecting facial landmarks of a face image may comprise extracting multiple feature maps from at least one facial region of the face image; generating a shared facial feature vector from the extracted multiple feature maps; and predicting facial landmark locations of the face image from the generated shared facial feature vector.
  • a system for detecting facial landmarks of a face image may comprise a feature extractor and a predictor.
  • the feature extractor may extract multiple feature maps from at least one facial region of the face image and generate a shared facial feature vector from the extracted multiple feature maps.
  • the predictor may predict facial landmark locations of the face image from the shared facial feature vector generated by the feature extractor.
  • the method may comprise 1) sampling a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set; 2) comparing dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error; 3) comparing dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error; 4) back-propagating the generated landmark error and all the training task errors through the convolutional neural network to adjust weights on connections between neurons of the convolutional neural network; 5) sampling a validating face image and its ground-truth target for each auxiliary task from a predetermined validation set; 6) comparing dissimilarities between the target prediction and the ground-truth target to generate a validating task error; and 7) determining if the generated training task error is less than a first predetermined
  • the facial landmark detection can be optimized together with heterogeneous but subtly auxiliary tasks, so that the detection robustness can be improved through multi-task learning, especially in dealing with faces with severe occlusion and pose variation.
  • the training process of the CNN is conducted with an "early stopping" to stop the related tasks which begin to over-fit the training set and thus harm the main task, so as to facilitate learning convergence.
  • Fig. 1 is a schematic diagram illustrating a system for facial landmark detection consistent with some disclosed embodiments.
  • Fig.2 is a schematic diagram illustrating a training unit as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 3 is a schematic diagram illustrating an example of a system for facial landmark detection consistent with some disclosed embodiments, in which an example of a convolutional neural network is shown.
  • Fig. 4 is a schematic diagram illustrating a system for facial landmark detection when it is implemented in software consistent with some disclosed embodiments.
  • Fig. 5 is a schematic flowchart illustrating a method for facial landmark detection consistent with some disclosed embodiments.
  • FIG. 6 is a schematic flowchart illustrating a training process of the multi-task convolutional neural network consistent with some disclosed embodiments.
  • Fig. 1 is a schematic diagram illustrating an exemplary system 1000 for facial landmark detection consistent with some disclosed embodiments.
  • facial landmark detection (hereinafter, also referred to as main task) is optimized jointly with at least one related/auxiliary task.
  • the facial landmark detection means that detect 2D-locations, i.e, 2D coordinates (x and y) of facial region of a face image.
  • Examples of facial landmark may include, but not limited to, left and right centers of the eyes, nose, left and right corners of the mouth of a face image.
  • Examples of the auxiliary task may include, but not limited to, head pose estimation, demographic such as gender classification, age estimation, facial expression recognition such as smiling or facial attribute inference such as wearing glasses. It shall be appreciated that the number or type of the auxiliary tasks are not limited to those mentioned herein.
  • the system 1000 may comprise a feature extractor 100, a training unit 200 and a predictor 300.
  • the feature extractor 100 may extract multiple feature maps from at least one facial region of the face image and/or the whole face image. Then, a shared facial feature vector may be generated by the feature extractor 100 from the extracted multiple feature maps.
  • the predictor 300 may predict facial landmark locations of the face image from the shared facial feature vector extracted by the feature extractor 100. Simultaneously, the predictor 300 may further, from the shared facial feature vector, predict corresponding target of at least one auxiliary task associated with the facial landmark detection. According to the system 1000, the facial landmark detection can be optimized jointly with the auxiliary tasks.
  • the feature extractor 100 may comprise a convolutional neural network.
  • the network may comprise a plurality of convolution-pooling layers and a fully connect layer.
  • each of the plurality of convolution-pooling layers may perform convolution and max-pooling operations, and the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps.
  • the fully connect layer may generate the shared facial feature vector from all the extracted multiple feature maps.
  • the convolutional neural network comprises an input layer, a plurality of (for example, three) convolution-pooling layers comprising one or more (for example, three) convolutional layers and one or more (for example, three) pooling layers, one convolutional layer and one fully connected layer.
  • the network is shown for exemplary, and the convolutional neural network in the feature extractor is not limited to it.
  • a 40 X 40 (for example) gray-scale face image is inputted in the input layer.
  • the first convolution-pooling layer extracts feature maps from the inputted image.
  • the second convolution-pooling layer takes the output of first layer as input, to generate different feature maps.
  • the multiple layers of feature maps are used by the fully connected layer to generate the shared facial feature vector. That is, the shared facial feature vector is generated by performing multiple times of convolution and max pooling operations.
  • Each layer contains a plurality of neurons with local or global receptive fields, and the weights on connection between the neurons of the convolutional neural network may be adjusted, so that the network is trained accordingly.
  • the system 1000 may further comprise a training unit 200.
  • the training unit 200 may train, with a predetermined training set, the feature extractor so as to adjust the weights on connections between the neurons of the convolutional neural network such that the trained feature extractor is capable of extracting the shared facial feature vector.
  • the training unit 200 may comprise a sampler 201, a comparator 202 and a back-propagator 203.
  • the sampler 201 may sample a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set.
  • five ground-truth landmarks that is, centers of the eyes, nose tip, corners of the mouth may be annotated directly on each training face image.
  • the ground-truth target for each auxiliary task may be labeled manually.
  • the ground-truth target may be labeled as female (F) or male (M).
  • the ground-truth target may be labeled as wearing (Y) or not wearing (N).
  • head pose estimation For head pose estimation,
  • (0° , ⁇ 30°, ⁇ 60°) may be labeled and for expression recognition, such as smiling, yes/no may labeled accordingly.
  • the comparator 202 may compare dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error.
  • the landmark error may be obtained by using, for example, least square method.
  • the comparator 203 may further compare dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error.
  • the training task error may be obtained by using, for example, cross-entropy method.
  • the back-propagator 203 may back-propagate the generated landmark error and all the training task errors through the convolutional neural network to adjust weights on connections between the neurons of the convolutional neural network.
  • the training unit 200 may further comprise a determiner 204.
  • the determiner 204 may determine whether the training process of the facial landmark detection is converged.
  • the determiner 204 may further determine whether the training process of each task is converged, which will be discussed later.
  • T tasks are trained jointly by the training unit 200.
  • main task is denoted as r
  • auxiliary task is denoted as a
  • the training data is denoted as ,
  • N number of the training data.
  • the training data is denoted as
  • training data is denoted as .
  • four tasks and s are
  • weights of all the task may be updated accordingly.
  • the weight matrix of each task a may be calculated in a similar manner as
  • the generated landmark error and the training task errors may be back-propagated layer by layer until the lowest layer by the back-propagator 203 through the convolutional neural network to adjust weights on connections between neurons of the convolutional neural network.
  • the error may be propagated back through the network followin a back-propagation strategy as below:
  • represents the error of the lowest layer, and represents the error of the second lowest layer.
  • the errors of the lower la ers are computed following Eq.(3). For instance, where is the gradient of the activation function of the
  • the above training process is repeated until the training process of the facial landmark detection is determined by the determiner 204 to be converged. In other words, if the error is less than a predetermined value, the training process will be determined to be converged.
  • the feature extractor 100 is capable of extracting the shared facial feature vector from a given face image.
  • the trained feature extractor for any face image the trained feature extractor
  • the determiner 204 may further determine whether the training process of the auxiliary tasks is converged.
  • t represents the current iteration
  • k represents a training length
  • k represents a training length
  • the 'meet denotes the function for calculating median value.
  • the first term in Eq.(4) represents the tendency of the training task error of the task a. If the training error drops rapidly within a period of length k, the value of the first term is small, which indicates that training process of the task can be continued as the task is still valuable. Otherwise, the first term is large, then the task is more likely to be stopped. From this, an auxiliary task can be switched off during the training process before it over-fit, so that the task can be "early stopped" before it begins to over-fit the training set and thus harm the main task.
  • the feature extractor 100 is capable of extracting a shared facial feature vector from any face image.
  • a face image x° is inputted in the input layer of the convolutional neural network as for example shown in Fig. 3.
  • ⁇ ( ⁇ ) and W s represent the non-linear activation function applied to the face image and the filters needed to be learned in the layer / of CNN.
  • the shared facial feature vector can be
  • system 1000 may be implemented using certain hardware, software, or a combination thereof.
  • embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.
  • the system 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.
  • the system 1000 may include one or more processors (processors 102, 104, 106 etc.), a memory 112, a storage device 116, and a bus to facilitate information exchange among various components of system 1000.
  • processors 102-106 may include a central processing unit (“CPU"), a graphic processing unit (“GPU”), or other suitable information processing devices.
  • processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below.
  • Memory 112 can include, among other things, a random access memory (“RAM”) and a read-only memory (“ROM”). Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106. For example, memory 112 may store one or more software applications. Further, memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106. It is noted that although only one block is shown in Fig. 1, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
  • Fig. 5 shows a schematic flowchart of the method for facial landmark detection
  • Fig. 6 shows a schematic flowchart of a training process of the multi-task convolutional neural network by the training unit 200.
  • methods 500 and 600 comprise a series of steps that may be performed by one or more of processors 102-106 or each module/unit of the system 1000 to implement a data processing operation.
  • processors 102-106 or each module/unit of the system 1000 to implement a data processing operation.
  • the following discussion is made in reference to the situation where each module/unit of the system 1000 is made in hardware or the combination of hardware and software.
  • the skilled in the art shall appreciate that other suitable devices or systems shall be applicable to carry out the following process and the system 1000 are just used to be an illustration to carry out the process.
  • multiple feature maps are extracted by the feature extractor 100 from at least one facial region of the face image in step S501.
  • the multiple feature maps may be extracted from the whole face image in step S501.
  • a shared facial feature vector is generated from the multiple feature maps extracted in step S501.
  • facial landmark locations of the face image is predicted from the shared facial feature vector generated in step S502.
  • the shared facial feature vector may be used to predict corresponding target of at least one auxiliary task associated with the facial landmark detection. Then, the target predictions of all the auxiliary tasks are obtained simultaneously.
  • the feature extractor comprises a convolutional neural network comprising a plurality of convolution-pooling layers and a fully connect layer.
  • Each of convolution-pooling layers is configured to perform convolution and max-pooling operations,.
  • the multiple feature maps may be extracted by the plurality of convolution-pooling layer consecutively, wherein the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps.
  • the shared facial feature vector may be generated by the fully connect layer from all the multiple feature maps extracted in step S501.
  • the method 500 further comprises a training step (not shown in Fig. 5), which will be discussed with reference to Fig. 6.
  • step S601 a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task are sampled from the predetermined training set.
  • For the training face image its facial landmark prediction and the target predictions of all the auxiliary tasks may be obtained from the predicator 300 accordingly in step S602.
  • the dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations is compared to generate a landmark error in step S603.
  • step S604 dissimilarities between the target predictions and the ground-truth target for each auxiliary task are compared respectively to generate at least one training task error.
  • step S606 it is determined that one of the auxiliary tasks is converged. If no, the process 600 turns back to step S606. If yes, the training process of the task is stopped in step S607 and proceeds to step S608. In the step S608, it is determined that the training process of the facial landmark detection is converged. If yes, the process 600 ends. If no, the process 600 turns back to step S601.
  • the facial landmark detection can be optimized together with heterogeneous but subtly related tasks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Geometry (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The present application disclosed a method and system for detecting facial landmarks of a face image. The method may comprise extracting multiple feature maps from at least one facial region of the face image and/or the whole face image; generating a shared facial feature vector from the extracted multiple feature maps; and predicting facial landmark locations of the face image from the generated shared facial feature vector. With the present method and system, the facial landmark detection can be optimized together with heterogeneous but subtly related task, so that the detection robustness can be improved through multi-task learning.

Description

A METHOD AND A SYSTEM FOR FACIAL LANDMARK DETECTION
BASED ON MULTI-TASK
Technical Field
[0001] The present application relates to face alignment, in particular, to a method and a system for facial landmark detection.
Background
[0002] Facial landmark detection is a fundamental component in many face analysis tasks, such as facial attribute inference, face verification, and face recognition, but has long been impeded by problems of occlusion and pose variation.
[0003] Accurate facial landmark detection can be performed using a cascaded CNN (Convolutional Neural Network), in which faces are divided into different parts by pre-partition, each of which is processed by separate deep CNNs. The resulting outputs are subsequently averaged and channeled to separate cascaded layers to process each facial landmark individually.
[0004] In addition, the facial landmark detection is not a standalone problem, and its estimation may be influenced by a number of heterogeneous and subtly correlated factors. For example, when a kid is smiling, his/her mouth is widely opened. Effectively discovering and exploiting such an intrinsically correlated facial attribute would help in detecting the mouth corners more accurately. Also, the inter-ocular distance is smaller in faces with large yaw rotation. Such pose information may be leveraged as additional source of information to constrain the solution space of landmark estimation. Given the rich set of plausible related tasks, treating facial landmark detection in isolation is counterproductive.
[0005] However, different tasks are inherently different in learning difficulties and have different convergence rates. Further, certain tasks are likely to be over-fitting earlier than the others when learning simultaneously, which will jeopardizes the learning convergence of the whole model. Summary
[0006] In one aspect of the present application, disclosed is a method for detecting facial landmarks of a face image. The method may comprise extracting multiple feature maps from at least one facial region of the face image; generating a shared facial feature vector from the extracted multiple feature maps; and predicting facial landmark locations of the face image from the generated shared facial feature vector.
[0007] In another aspect of the present application, disclosed is a system for detecting facial landmarks of a face image. The system may comprise a feature extractor and a predictor. The feature extractor may extract multiple feature maps from at least one facial region of the face image and generate a shared facial feature vector from the extracted multiple feature maps. The predictor may predict facial landmark locations of the face image from the shared facial feature vector generated by the feature extractor.
[0008] According to the present application, there is a method for training a convolutional feature network for performing simultaneously facial landmark detection and at least one associated auxiliary task. The method may comprise 1) sampling a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set; 2) comparing dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error; 3) comparing dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error; 4) back-propagating the generated landmark error and all the training task errors through the convolutional neural network to adjust weights on connections between neurons of the convolutional neural network; 5) sampling a validating face image and its ground-truth target for each auxiliary task from a predetermined validation set; 6) comparing dissimilarities between the target prediction and the ground-truth target to generate a validating task error; and 7) determining if the generated training task error is less than a first predetermined value and the generated validating task error is less than a second predetermined value. If yes, the method for training the convolutional neural network will be terminated, otherwise, the steps l)-7) will be repeated.
[0009] According to the present application, there is further provided a computer-readable medium for storing the instructions executable by one or more processors to implement the above processer of the method.
[0010] In contrast to existing methods, the facial landmark detection can be optimized together with heterogeneous but subtly auxiliary tasks, so that the detection robustness can be improved through multi-task learning, especially in dealing with faces with severe occlusion and pose variation.
[0011] According to the present application, only one single CNN is used, and thus complexity of the required system/device can be reduced. Neither pre-partition of faces nor cascaded convolutional neural layers are required, leading to drastic reduction in model complexity, whilst still achieving comparable or even better accuracy.
[0012] As training proceeds, certain related tasks which are no longer beneficial to the main task when they reach their peak performance, and thus their training process can be halted. According to the present application, the training process of the CNN is conducted with an "early stopping" to stop the related tasks which begin to over-fit the training set and thus harm the main task, so as to facilitate learning convergence.
Brief Description of the Drawing
[0013] Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
[0014] Fig. 1 is a schematic diagram illustrating a system for facial landmark detection consistent with some disclosed embodiments.
[0015] Fig.2 is a schematic diagram illustrating a training unit as shown in Fig. 1 consistent with some disclosed embodiments. [0016] Fig. 3 is a schematic diagram illustrating an example of a system for facial landmark detection consistent with some disclosed embodiments, in which an example of a convolutional neural network is shown.
[0017] Fig. 4 is a schematic diagram illustrating a system for facial landmark detection when it is implemented in software consistent with some disclosed embodiments.
[0018] Fig. 5 is a schematic flowchart illustrating a method for facial landmark detection consistent with some disclosed embodiments.
[0019] Fig. 6 is a schematic flowchart illustrating a training process of the multi-task convolutional neural network consistent with some disclosed embodiments.
Detailed Description
[0020] Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts.
[0021] Fig. 1 is a schematic diagram illustrating an exemplary system 1000 for facial landmark detection consistent with some disclosed embodiments. According to the system 1000, facial landmark detection (hereinafter, also referred to as main task) is optimized jointly with at least one related/auxiliary task. The facial landmark detection means that detect 2D-locations, i.e, 2D coordinates (x and y) of facial region of a face image. Examples of facial landmark may include, but not limited to, left and right centers of the eyes, nose, left and right corners of the mouth of a face image. Examples of the auxiliary task may include, but not limited to, head pose estimation, demographic such as gender classification, age estimation, facial expression recognition such as smiling or facial attribute inference such as wearing glasses. It shall be appreciated that the number or type of the auxiliary tasks are not limited to those mentioned herein.
[0022] Referring to Fig. 1 again, where the system 1000 is implemented by the hardware, it may comprise a feature extractor 100, a training unit 200 and a predictor 300. The feature extractor 100 may extract multiple feature maps from at least one facial region of the face image and/or the whole face image. Then, a shared facial feature vector may be generated by the feature extractor 100 from the extracted multiple feature maps.
[0023] The predictor 300 may predict facial landmark locations of the face image from the shared facial feature vector extracted by the feature extractor 100. Simultaneously, the predictor 300 may further, from the shared facial feature vector, predict corresponding target of at least one auxiliary task associated with the facial landmark detection. According to the system 1000, the facial landmark detection can be optimized jointly with the auxiliary tasks.
[0024] According to an embodiment, the feature extractor 100 may comprise a convolutional neural network. The network may comprise a plurality of convolution-pooling layers and a fully connect layer. In the network, each of the plurality of convolution-pooling layers may perform convolution and max-pooling operations, and the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps. The fully connect layer may generate the shared facial feature vector from all the extracted multiple feature maps.
[0025] An example of the network is shown in Fig. 3, in which the convolutional neural network comprises an input layer, a plurality of (for example, three) convolution-pooling layers comprising one or more (for example, three) convolutional layers and one or more (for example, three) pooling layers, one convolutional layer and one fully connected layer. It is noted that the network is shown for exemplary, and the convolutional neural network in the feature extractor is not limited to it. As shown in Fig. 3, a 40 X 40 (for example) gray-scale face image is inputted in the input layer. The first convolution-pooling layer extracts feature maps from the inputted image. Then, the second convolution-pooling layer takes the output of first layer as input, to generate different feature maps. This process is continued by using all three convolution-pooling layers. At the end, the multiple layers of feature maps are used by the fully connected layer to generate the shared facial feature vector. That is, the shared facial feature vector is generated by performing multiple times of convolution and max pooling operations. Each layer contains a plurality of neurons with local or global receptive fields, and the weights on connection between the neurons of the convolutional neural network may be adjusted, so that the network is trained accordingly.
[0026] According to an embodiment, the system 1000 may further comprise a training unit 200. The training unit 200 may train, with a predetermined training set, the feature extractor so as to adjust the weights on connections between the neurons of the convolutional neural network such that the trained feature extractor is capable of extracting the shared facial feature vector. According to an embodiment of the present application shown in Fig. 2, the training unit 200 may comprise a sampler 201, a comparator 202 and a back-propagator 203.
[0027] As shown in Fig. 2, the sampler 201 may sample a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set. According to an embodiment, five ground-truth landmarks, that is, centers of the eyes, nose tip, corners of the mouth may be annotated directly on each training face image. According to another embodiment, the ground-truth target for each auxiliary task may be labeled manually. For example, for gender classification, the ground-truth target may be labeled as female (F) or male (M). For facial attribute inference, such as wearing glasses, the ground-truth target may be labeled as wearing (Y) or not wearing (N). For head pose estimation,
(0° , ±30°, ±60°) may be labeled and for expression recognition, such as smiling, yes/no may labeled accordingly.
[0028] The comparator 202 may compare dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error. The landmark error may be obtained by using, for example, least square method. The comparator 203 may further compare dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error. According to another embodiment, the training task error may be obtained by using, for example, cross-entropy method.
[0029] The back-propagator 203 may back-propagate the generated landmark error and all the training task errors through the convolutional neural network to adjust weights on connections between the neurons of the convolutional neural network.
[0030] According to an embodiment, the training unit 200 may further comprise a determiner 204. The determiner 204 may determine whether the training process of the facial landmark detection is converged. According to another embodiment, the determiner 204 may further determine whether the training process of each task is converged, which will be discussed later.
[0031] Hereinafter, components in the training unit 200 as mentioned above will be discussed in detail. For purpose of illustration, we will describe an embodiment in which T tasks are trained jointly by the training unit 200. For the T tasks, facial landmark detection, i.e., main task is denoted as r, and one of at least one related/auxiliary task is denoted as a, where
Figure imgf000008_0008
[0032] For each of the tasks, the training data is denoted as ,
Figure imgf000008_0009
where N represents number of the training data. In
Figure imgf000008_0002
particular, for the facial landmark detection r, the training data is denoted as
Figure imgf000008_0004
where is the 2D coordinates of the five landmarks. For the task a, the
Figure imgf000008_0003
training data is denoted as . In the embodiment, four tasks and s are
Figure imgf000008_0011
Figure imgf000008_0007
shown and represent inferences of 'pose', 'gender', 'wear glasses', and 'smiling', respectively. Thus, represents five different poses
Figure imgf000008_0006
Figure imgf000008_0010
and are binary attributes and represent female/male, not
Figure imgf000008_0005
wearing/wearing glasses and not smiling/smiling, respectively. Different weights are assigned to the main task r and each auxiliary task a, and are denoted as W and respectively.
Figure imgf000008_0001
[0033] Then, an objection function of all the tasks is formulated as below to optimize the main task r and the auxiliary task a:
Figure imgf000009_0002
where, is a linear function of and a weight vector
Figure imgf000009_0011
Figure imgf000009_0012
Figure imgf000009_0005
represents loss function; represents importance coefficient of a-th task's error; and represents a shared facial feature vector.
[0034] According to an embodiment, least square and cross-entropy functions are used as the loss function /(·) for the main task r and the auxiliary task a, respectively, to generate corresponding landmark error and training task errors. Therefore, the above objective function can be rewritten as below:
Figure imgf000009_0003
[0035] In the in the first term is a linear function. The
Figure imgf000009_0004
second term is a posterior probability function
Figure imgf000009_0001
wherein, denotes the column of a weight matrix of the task a, w. The third
Figure imgf000009_0010
Figure imgf000009_0009
term penalizes large weights
Figure imgf000009_0008
[0036] According to an embodiment, weights of all the task may be updated accordingly. In particular, the weight matrix of the facial landmark detection is updated by where η represents the learning rate (such as η = 0.003),
Figure imgf000009_0007
and Also, the weight matrix of each task a may be
Figure imgf000009_0006
calculated in a similar manner as
Figure imgf000010_0004
[0037] Then, the generated landmark error and the training task errors may be back-propagated layer by layer until the lowest layer by the back-propagator 203 through the convolutional neural network to adjust weights on connections between neurons of the convolutional neural network.
[0038] According to an embodiment, the error may be propagated back through the network followin a back-propagation strategy as below:
Figure imgf000010_0001
[0039] In the
Figure imgf000010_0011
q represent all the error in the layer with
Figure imgf000010_0010
Figure imgf000010_0005
. For example, ε represents the error of the lowest layer, and
Figure imgf000010_0012
represents the error of the second lowest layer. The errors of the lower la ers are computed following Eq.(3). For instance,
Figure imgf000010_0002
where is the gradient of the activation function of the
Figure imgf000010_0006
network
[0040] The above training process is repeated until the training process of the facial landmark detection is determined by the determiner 204 to be converged. In other words, if the error is less than a predetermined value, the training process will be determined to be converged. With the above training process, the feature extractor 100 is capable of extracting the shared facial feature vector from a given face image.
According to an embodiment, for any face image the trained feature extractor
Figure imgf000010_0009
100 extracts a shared feature vector Then, the landmark locations is predicated by
Figure imgf000010_0008
and the prediction targets of the auxiliary tasks are obtained by
Figure imgf000010_0007
Figure imgf000010_0003
[0041] During the above training process, at least one auxiliary task is trained simultaneously. However, different tasks have different loss functions and learning difficulties, and thus have different convergence rates. According to another embodiment, the determiner 204 may further determine whether the training process of the auxiliary tasks is converged.
[0042] In particular, represent values of the loss function of task a
Figure imgf000011_0003
on a validation set and the training set, respectively, if one task's measure exceeds a threshold as below, the task will stop:
Figure imgf000011_0002
[0043] In the Eq.(4), t represents the current iteration, k represents a training length, and
Figure imgf000011_0005
represents the importance coefficient of a-th task's error. The 'meet denotes the function for calculating median value. The first term in Eq.(4) represents the tendency of the training task error of the task a. If the training error drops rapidly within a period of length k, the value of the first term is small, which indicates that training process of the task can be continued as the task is still valuable. Otherwise, the first term is large, then the task is more likely to be stopped. From this, an auxiliary task can be switched off during the training process before it over-fit, so that the task can be "early stopped" before it begins to over-fit the training set and thus harm the main task.
[0044] With the above training process, the feature extractor 100 is capable of extracting a shared facial feature vector from any face image. For example, a face image x° is inputted in the input layer of the convolutional neural network as for example shown in Fig. 3. There are multiple sets of convolutional filters plus an activation function applied to a face image in each convolutional layer in the CNN and they are applied sequentially to project the face image to the higher layer. That is, the face image is projected to higher layer gradually by learning a sequence of non-linear mappings as below to obtain the shared facial feature vector x
Figure imgf000011_0004
Figure imgf000011_0001
[0045] Here, σ(·) and Ws represent the non-linear activation function applied to the face image and the filters needed to be learned in the layer / of CNN. For instance, Referring to Fig. 3 again, the shared facial feature vector can
Figure imgf000012_0001
be used for landmark detection and auxiliary/related tasks simultaneously in the estimation stage.
[0046] It shall be appreciated that the system 1000 may be implemented using certain hardware, software, or a combination thereof. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.
[0047] In the case that the system 1000 is implemented with software, the system 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 4, the system 1000 may include one or more processors (processors 102, 104, 106 etc.), a memory 112, a storage device 116, and a bus to facilitate information exchange among various components of system 1000. Processors 102-106 may include a central processing unit ("CPU"), a graphic processing unit ("GPU"), or other suitable information processing devices. Depending on the type of hardware being used, processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods that will be explained in greater detail below.
[0048] Memory 112 can include, among other things, a random access memory ("RAM") and a read-only memory ("ROM"). Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106. For example, memory 112 may store one or more software applications. Further, memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106. It is noted that although only one block is shown in Fig. 1, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
[0049] The system for facial landmark detection is described in the above. The following will describe a method for facial landmark detection with reference to Figs. 5 and 6.
[0050] Fig. 5 shows a schematic flowchart of the method for facial landmark detection and Fig. 6 shows a schematic flowchart of a training process of the multi-task convolutional neural network by the training unit 200.
[0051] In Figs. 5 and 6, methods 500 and 600 comprise a series of steps that may be performed by one or more of processors 102-106 or each module/unit of the system 1000 to implement a data processing operation. For purpose of description, the following discussion is made in reference to the situation where each module/unit of the system 1000 is made in hardware or the combination of hardware and software. The skilled in the art shall appreciate that other suitable devices or systems shall be applicable to carry out the following process and the system 1000 are just used to be an illustration to carry out the process.
[0052] As shown in Fig. 5, multiple feature maps are extracted by the feature extractor 100 from at least one facial region of the face image in step S501. In another embodiment, the multiple feature maps may be extracted from the whole face image in step S501. Then, in step S502, a shared facial feature vector is generated from the multiple feature maps extracted in step S501. In step S503, facial landmark locations of the face image is predicted from the shared facial feature vector generated in step S502. According to another embodiment, the shared facial feature vector may be used to predict corresponding target of at least one auxiliary task associated with the facial landmark detection. Then, the target predictions of all the auxiliary tasks are obtained simultaneously.
[0053] According to an embodiment, the feature extractor comprises a convolutional neural network comprising a plurality of convolution-pooling layers and a fully connect layer. Each of convolution-pooling layers is configured to perform convolution and max-pooling operations,. In the embodiment, in step S501, the multiple feature maps may be extracted by the plurality of convolution-pooling layer consecutively, wherein the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps. In step S502, the shared facial feature vector may be generated by the fully connect layer from all the multiple feature maps extracted in step S501.
[0054] In the embodiment, the method 500 further comprises a training step (not shown in Fig. 5), which will be discussed with reference to Fig. 6.
[0055] As shown in Fig. 6, in step S601, a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task are sampled from the predetermined training set. For the training face image, its facial landmark prediction and the target predictions of all the auxiliary tasks may be obtained from the predicator 300 accordingly in step S602. Then, the dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations is compared to generate a landmark error in step S603. In step S604, dissimilarities between the target predictions and the ground-truth target for each auxiliary task are compared respectively to generate at least one training task error. Then, the generated landmark error and all the training task errors are back-propagated through the convolutional neural network to adjust weights on connections between the neurons of the convolutional neural network in step S605. In step S606, it is determined that one of the auxiliary tasks is converged. If no, the process 600 turns back to step S606. If yes, the training process of the task is stopped in step S607 and proceeds to step S608. In the step S608, it is determined that the training process of the facial landmark detection is converged. If yes, the process 600 ends. If no, the process 600 turns back to step S601.
[0056] From this, the facial landmark detection can be optimized together with heterogeneous but subtly related tasks.
[0057] Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims is intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.
[0058] Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims

What is claimed is:
1. A method for detecting facial landmarks of a face image, comprising:
extracting multiple feature maps from at least one facial region of the face image;
generating a shared facial feature vector from the extracted multiple feature maps; and
predicting facial landmark locations of the face image from the generated shared facial feature vector.
2. A method of claim 1, wherein the facial landmark comprises at least one selected from a group consisting of centers of the eyes, nose, corners of the mouth of a face image.
3. A method of claim 1, wherein in the step of predicting, the shared facial feature vector is used to predict corresponding target of at least one auxiliary task associated with the facial landmark detection, so as to obtain target predictions of all the auxiliary tasks simultaneously.
4. A method of claim 3, wherein the auxiliary tasks comprises at least one selected from a group consisting of head pose estimation, gender classification, age estimation, facial expression recognition or facial attribute inference.
5. A method of claim 4, wherein the step of extracting and generating are performed by a convolutional neural network comprising a plurality of convolution-pooling layers, each of which is configured to perform convolution and max-pooling operations, and
wherein the step of extracting further comprises:
extracting the multiple feature maps by the plurality of convolution-pooling layer consecutively, wherein the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps.
6. A method of claim 5, wherein the convolutional neural network further comprises a fully connect layer, and in the step of generating, the shared facial feature vector is generated by the fully connect layer from all the extracted multiple feature maps.
7. A method of claim 6, wherein each layer of the convolutional neural network has a plurality of neurons, and wherein the method further comprises:
training, with a predetermined training set, the network so as to adjust each weight on connections between the neurons of the network such that the shared facial feature vector is generated by the network with the adjusted weight.
8. A method of claim 7, wherein the step of training further comprises:
sampling a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set;
comparing dissimilarities between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error;
comparing dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error; and
back-propagating the generated landmark error and the generated training task error through the convolutional neural network to adjust weights on connections between the neurons of the convolutional neural network;
repeating the steps of sampling, comparing and back-propagating until the generated landmark error is less than a first predetermined value and the generated training task error is less than a second predetermined value.
9. A method of claim 8, wherein the comparison to generate a landmark error is performed by rule of a least square process and the comparison to generate a training task error is performed by rule of a cross-entropy process.
10. A method of claim 8, wherein, for each auxiliary task, the step of training further comprising:
sampling a validating face image and its ground-truth target for each auxiliary tasks from a predetermined validation set;
comparing dissimilarity between the target prediction and the ground-truth target to generate a validating task error;
repeating the sampling and the comparing until the generated training task error is less than a third predetermined value and the generated validating task error is less than a fourth predetermined value.
11. A method of claim 1, wherein in the step of predicting, the predicted facial landmark locations of the face image is determined by rule of
Figure imgf000018_0001
where W represents weight assigned to the facial landmark detection; and
Figure imgf000018_0002
represents the shared facial feature vector, and T represents transpose.
12. A system for detecting facial landmarks of a face image, comprising:
a feature extractor configured to,
extract multiple feature maps from at least one facial region of the face image; and
generate a shared facial feature vector from the extracted multiple feature maps; and
a predictor configured to predict facial landmark locations of the face image from the shared facial feature vector generated by the feature extractor.
13. A system of claim 12, wherein the predictor is further configured to obtain target predictions of at least one auxiliary task associated with the facial landmark detection by using the shared facial feature vector simultaneously.
14. A system of claim 12, wherein the feature extractor further comprises a convolutional neural network, wherein the convolutional neural network comprises:
a plurality of convolution-pooling layers configured to perform convolution and max-pooling operations, and wherein the feature maps extracted by a previous layer of the convolution-pooling layers are inputted into a next layer of the convolution-pooling layers to extract feature maps different from the previously extracted feature maps; and
a fully connect layer configured to generate the shared facial feature vector from all the extracted multiple feature maps.
15. A system of claim 13, wherein each layer of the convolutional neural network has a plurality of neurons, and wherein the system further comprises:
a training unit configured to train, with a predetermined training set, the network so as to adjust the weights on connections between the neurons of the network such that the trained network is capable of extracting the shared facial feature vector.
16. A system of claim 15, wherein the training unit further comprises:
a sampler configured to sample a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set;
a comparator is configured to compare dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error and to compare dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error; and
a back-propagator configured to back-propagate the generated landmark error and the training task errors through the convolutional neural network to adjust weights on connections between the neurons of the convolutional neural network.
17. A system of claim 15, wherein the training unit further comprises:
a determiner configured to determine whether training process of the facial landmark detection is converged and whether training process of each task is converged.
18. A system of claim 12, wherein the facial landmark comprises at least one selected from a group consisting of centers of the eyes, nose, corners of the mouth of a face image.
19. A system of claim 13, wherein the auxiliary tasks comprises at least one selected from a group consisting of head pose estimation, gender classification, age estimation, facial expression recognition or facial attribute inference
20. A method for training a convolutional neural network for performing simultaneously facial landmark detection and at least one associated auxiliary task, comprising:
1) sampling a training face image, its ground-truth landmark locations and its ground-truth target for each auxiliary task from the predetermined training set;
2) comparing dissimilarity between the predicted facial landmark locations and the ground-truth landmark locations to generate a landmark error;
3) comparing dissimilarities between the target predictions and the ground-truth target for each auxiliary task, respectively, to generate at least one training task error;
4) back-propagating the generated landmark error and all the training task errors through the convolutional neural network to adjust weights on connections between neurons of the convolutional neural network;
5) sampling a validating face image and its ground-truth target for each auxiliary task from a predetermined validation set;
6) comparing dissimilarities between the target prediction and the ground-truth target to generate a validating task error;
7) determining if the generated training task error is less than a first predetermined value and the generated validating task error is less than a second predetermined value; and
if yes, the method for training the convolutional neural network will be terminated, otherwise, the steps l)-7) will be repeated.
PCT/CN2014/000769 2014-08-21 2014-08-21 A method and a system for facial landmark detection based on multi-task WO2016026063A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2014/000769 WO2016026063A1 (en) 2014-08-21 2014-08-21 A method and a system for facial landmark detection based on multi-task
CN201480081241.1A CN106575367B (en) 2014-08-21 2014-08-21 Method and system for the face critical point detection based on multitask

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/000769 WO2016026063A1 (en) 2014-08-21 2014-08-21 A method and a system for facial landmark detection based on multi-task

Publications (1)

Publication Number Publication Date
WO2016026063A1 true WO2016026063A1 (en) 2016-02-25

Family

ID=55350056

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/000769 WO2016026063A1 (en) 2014-08-21 2014-08-21 A method and a system for facial landmark detection based on multi-task

Country Status (2)

Country Link
CN (1) CN106575367B (en)
WO (1) WO2016026063A1 (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105957095A (en) * 2016-06-15 2016-09-21 电子科技大学 Gray-scale image based Spiking angular point detection method
CN106951840A (en) * 2017-03-09 2017-07-14 北京工业大学 A kind of facial feature points detection method
JP2017211799A (en) * 2016-05-25 2017-11-30 キヤノン株式会社 Information processing device and information processing method
JP2018055377A (en) * 2016-09-28 2018-04-05 日本電信電話株式会社 Multitask processing device, multitask model learning device, and program
WO2018090905A1 (en) * 2016-11-15 2018-05-24 Huawei Technologies Co., Ltd. Automatic identity detection
CN108073910A (en) * 2017-12-29 2018-05-25 百度在线网络技术(北京)有限公司 For generating the method and apparatus of face characteristic
CN108292363A (en) * 2016-07-22 2018-07-17 日电实验室美国公司 In vivo detection for anti-fraud face recognition
CN109145798A (en) * 2018-08-13 2019-01-04 浙江零跑科技有限公司 A kind of Driving Scene target identification and travelable region segmentation integrated approach
CN109635750A (en) * 2018-12-14 2019-04-16 广西师范大学 A kind of compound convolutional neural networks images of gestures recognition methods under complex background
CN110163098A (en) * 2019-04-17 2019-08-23 西北大学 Based on the facial expression recognition model construction of depth of seam division network and recognition methods
EP3529747A4 (en) * 2016-10-19 2019-10-09 Snap Inc. Neural networks for facial modeling
US10467459B2 (en) 2016-09-09 2019-11-05 Microsoft Technology Licensing, Llc Object detection based on joint feature extraction
WO2019221739A1 (en) * 2018-05-17 2019-11-21 Hewlett-Packard Development Company, L.P. Image location identification
CN111191675A (en) * 2019-12-03 2020-05-22 深圳市华尊科技股份有限公司 Pedestrian attribute recognition model implementation method and related device
WO2020199931A1 (en) * 2019-04-02 2020-10-08 腾讯科技(深圳)有限公司 Face key point detection method and apparatus, and storage medium and electronic device
WO2021036726A1 (en) * 2019-08-29 2021-03-04 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Method, system, and computer-readable medium for using face alignment model based on multi-task convolutional neural network-obtained data
CN112820382A (en) * 2021-02-04 2021-05-18 上海小芃科技有限公司 Breast cancer postoperative intelligent rehabilitation training method, device, equipment and storage medium
CN107871106B (en) * 2016-09-26 2021-07-06 北京眼神科技有限公司 Face detection method and device
WO2022003982A1 (en) * 2020-07-03 2022-01-06 日本電気株式会社 Detection device, learning device, detection method, and storage medium
US11776323B2 (en) 2022-02-15 2023-10-03 Ford Global Technologies, Llc Biometric task network
US11954881B2 (en) 2018-08-28 2024-04-09 Apple Inc. Semi-supervised learning using clustering as an additional constraint

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107145857B (en) * 2017-04-29 2021-05-04 深圳市深网视界科技有限公司 Face attribute recognition method and device and model establishment method
CN107038429A (en) * 2017-05-03 2017-08-11 四川云图睿视科技有限公司 A kind of multitask cascade face alignment method based on deep learning
CN106951888B (en) * 2017-05-09 2020-12-01 安徽大学 Relative coordinate constraint method and positioning method of human face characteristic point
CN107358149B (en) * 2017-05-27 2020-09-22 深圳市深网视界科技有限公司 Human body posture detection method and device
CN107578055B (en) * 2017-06-20 2020-04-14 北京陌上花科技有限公司 Image prediction method and device
CN108229288B (en) * 2017-06-23 2020-08-11 北京市商汤科技开发有限公司 Neural network training and clothes color detection method and device, storage medium and electronic equipment
CN107563279B (en) * 2017-07-22 2020-12-22 复旦大学 Model training method for adaptive weight adjustment aiming at human body attribute classification
US11341631B2 (en) 2017-08-09 2022-05-24 Shenzhen Keya Medical Technology Corporation System and method for automatically detecting a physiological condition from a medical image of a patient
CN107423727B (en) * 2017-08-14 2018-07-10 河南工程学院 Face complex expression recognition methods based on neural network
CN107704848A (en) * 2017-10-27 2018-02-16 深圳市唯特视科技有限公司 A kind of intensive face alignment method based on multi-constraint condition convolutional neural networks
CN108196535B (en) * 2017-12-12 2021-09-07 清华大学苏州汽车研究院(吴江) Automatic driving system based on reinforcement learning and multi-sensor fusion
CN107992864A (en) * 2018-01-15 2018-05-04 武汉神目信息技术有限公司 A kind of vivo identification method and device based on image texture
CN110060296A (en) * 2018-01-18 2019-07-26 北京三星通信技术研究有限公司 Estimate method, electronic equipment and the method and apparatus for showing virtual objects of posture
CN108399373B (en) * 2018-02-06 2019-05-10 北京达佳互联信息技术有限公司 The model training and its detection method and device of face key point
US10990820B2 (en) * 2018-03-06 2021-04-27 Dus Operating Inc. Heterogeneous convolutional neural network for multi-problem solving
CN108416314B (en) * 2018-03-16 2022-03-08 中山大学 Picture important face detection method
CN108615016B (en) * 2018-04-28 2020-06-19 北京华捷艾米科技有限公司 Face key point detection method and face key point detection device
CN109147940B (en) * 2018-07-05 2021-05-25 科亚医疗科技股份有限公司 Apparatus and system for automatically predicting physiological condition from medical image of patient
CN109522910B (en) * 2018-12-25 2020-12-11 浙江商汤科技开发有限公司 Key point detection method and device, electronic equipment and storage medium
CN109829431B (en) * 2019-01-31 2021-02-12 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN111563397B (en) * 2019-02-13 2023-04-18 阿里巴巴集团控股有限公司 Detection method, detection device, intelligent equipment and computer storage medium
CN109902641B (en) * 2019-03-06 2021-03-02 中国科学院自动化研究所 Semantic alignment-based face key point detection method, system and device
CN110136828A (en) * 2019-05-16 2019-08-16 杭州健培科技有限公司 A method of medical image multitask auxiliary diagnosis is realized based on deep learning
CN110705419A (en) * 2019-09-24 2020-01-17 新华三大数据技术有限公司 Emotion recognition method, early warning method, model training method and related device
CN111339813B (en) * 2019-09-30 2022-09-27 深圳市商汤科技有限公司 Face attribute recognition method and device, electronic equipment and storage medium
CN111860101A (en) * 2020-04-24 2020-10-30 北京嘀嘀无限科技发展有限公司 Training method and device for face key point detection model
KR102538804B1 (en) * 2020-11-16 2023-06-01 상명대학교 산학협력단 Device and method for landmark detection using artificial intelligence
CN112488003A (en) * 2020-12-03 2021-03-12 深圳市捷顺科技实业股份有限公司 Face detection method, model creation method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1352436A (en) * 2000-11-15 2002-06-05 星创科技股份有限公司 Real-time face identification system
CN101673340A (en) * 2009-08-13 2010-03-17 重庆大学 Method for identifying human ear by colligating multi-direction and multi-dimension and BP neural network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4778158B2 (en) * 2001-05-31 2011-09-21 オリンパス株式会社 Image selection support device
CN102831382A (en) * 2011-06-15 2012-12-19 北京三星通信技术研究有限公司 Face tracking apparatus and method
CN103824054B (en) * 2014-02-17 2018-08-07 北京旷视科技有限公司 A kind of face character recognition methods based on cascade deep neural network

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1352436A (en) * 2000-11-15 2002-06-05 星创科技股份有限公司 Real-time face identification system
CN101673340A (en) * 2009-08-13 2010-03-17 重庆大学 Method for identifying human ear by colligating multi-direction and multi-dimension and BP neural network

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10909455B2 (en) 2016-05-25 2021-02-02 Canon Kabushiki Kaisha Information processing apparatus using multi-layer neural network and method therefor
JP2017211799A (en) * 2016-05-25 2017-11-30 キヤノン株式会社 Information processing device and information processing method
CN105957095A (en) * 2016-06-15 2016-09-21 电子科技大学 Gray-scale image based Spiking angular point detection method
CN108292363B (en) * 2016-07-22 2022-05-24 日电实验室美国公司 Living body detection for spoof-proof facial recognition
CN108292363A (en) * 2016-07-22 2018-07-17 日电实验室美国公司 In vivo detection for anti-fraud face recognition
JP2019508801A (en) * 2016-07-22 2019-03-28 エヌイーシー ラボラトリーズ アメリカ インクNEC Laboratories America, Inc. Biometric detection for anti-spoofing face recognition
US10467459B2 (en) 2016-09-09 2019-11-05 Microsoft Technology Licensing, Llc Object detection based on joint feature extraction
CN107871106B (en) * 2016-09-26 2021-07-06 北京眼神科技有限公司 Face detection method and device
JP2018055377A (en) * 2016-09-28 2018-04-05 日本電信電話株式会社 Multitask processing device, multitask model learning device, and program
EP4266249A3 (en) * 2016-10-19 2024-01-17 Snap Inc. Neural networks for facial modeling
US11100311B2 (en) 2016-10-19 2021-08-24 Snap Inc. Neural networks for facial modeling
EP3529747A4 (en) * 2016-10-19 2019-10-09 Snap Inc. Neural networks for facial modeling
US10460153B2 (en) 2016-11-15 2019-10-29 Futurewei Technologies, Inc. Automatic identity detection
WO2018090905A1 (en) * 2016-11-15 2018-05-24 Huawei Technologies Co., Ltd. Automatic identity detection
CN106951840A (en) * 2017-03-09 2017-07-14 北京工业大学 A kind of facial feature points detection method
CN108073910A (en) * 2017-12-29 2018-05-25 百度在线网络技术(北京)有限公司 For generating the method and apparatus of face characteristic
WO2019221739A1 (en) * 2018-05-17 2019-11-21 Hewlett-Packard Development Company, L.P. Image location identification
CN109145798A (en) * 2018-08-13 2019-01-04 浙江零跑科技有限公司 A kind of Driving Scene target identification and travelable region segmentation integrated approach
CN109145798B (en) * 2018-08-13 2021-10-22 浙江零跑科技股份有限公司 Driving scene target identification and travelable region segmentation integration method
US11954881B2 (en) 2018-08-28 2024-04-09 Apple Inc. Semi-supervised learning using clustering as an additional constraint
CN109635750A (en) * 2018-12-14 2019-04-16 广西师范大学 A kind of compound convolutional neural networks images of gestures recognition methods under complex background
US11734851B2 (en) 2019-04-02 2023-08-22 Tencent Technology (Shenzhen) Company Limited Face key point detection method and apparatus, storage medium, and electronic device
WO2020199931A1 (en) * 2019-04-02 2020-10-08 腾讯科技(深圳)有限公司 Face key point detection method and apparatus, and storage medium and electronic device
CN110163098A (en) * 2019-04-17 2019-08-23 西北大学 Based on the facial expression recognition model construction of depth of seam division network and recognition methods
WO2021036726A1 (en) * 2019-08-29 2021-03-04 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Method, system, and computer-readable medium for using face alignment model based on multi-task convolutional neural network-obtained data
US12033364B2 (en) 2019-08-29 2024-07-09 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Method, system, and computer-readable medium for using face alignment model based on multi-task convolutional neural network-obtained data
CN111191675B (en) * 2019-12-03 2023-10-24 深圳市华尊科技股份有限公司 Pedestrian attribute identification model realization method and related device
CN111191675A (en) * 2019-12-03 2020-05-22 深圳市华尊科技股份有限公司 Pedestrian attribute recognition model implementation method and related device
WO2022003982A1 (en) * 2020-07-03 2022-01-06 日本電気株式会社 Detection device, learning device, detection method, and storage medium
JP7513094B2 (en) 2020-07-03 2024-07-09 日本電気株式会社 DETECTION APPARATUS, LEARNING APPARATUS, DETECTION METHOD, AND PROGRAM
CN112820382A (en) * 2021-02-04 2021-05-18 上海小芃科技有限公司 Breast cancer postoperative intelligent rehabilitation training method, device, equipment and storage medium
US11776323B2 (en) 2022-02-15 2023-10-03 Ford Global Technologies, Llc Biometric task network

Also Published As

Publication number Publication date
CN106575367A (en) 2017-04-19
CN106575367B (en) 2018-11-06

Similar Documents

Publication Publication Date Title
WO2016026063A1 (en) A method and a system for facial landmark detection based on multi-task
US20220392234A1 (en) Training neural networks for vehicle re-identification
US9811718B2 (en) Method and a system for face verification
CN106415594B (en) Method and system for face verification
US11288835B2 (en) Lighttrack: system and method for online top-down human pose tracking
EP3074918B1 (en) Method and system for face image recognition
US11836931B2 (en) Target detection method, apparatus and device for continuous images, and storage medium
CN109271958B (en) Face age identification method and device
JP2023134499A (en) Robust training in presence of label noise
Glauner Deep convolutional neural networks for smile recognition
Gong et al. Model-based oversampling for imbalanced sequence classification
US11488309B2 (en) Robust machine learning for imperfect labeled image segmentation
US20120243779A1 (en) Recognition device, recognition method, and computer program product
US10592786B2 (en) Generating labeled data for deep object tracking
CN111914878B (en) Feature point tracking training method and device, electronic equipment and storage medium
Dong et al. Adaptive cascade deep convolutional neural networks for face alignment
CN113196303A (en) Inappropriate neural network input detection and processing
US11625589B2 (en) Residual semi-recurrent neural networks
CN111223128A (en) Target tracking method, device, equipment and storage medium
Zhai et al. Face verification across aging based on deep convolutional networks and local binary patterns
CN114998592A (en) Method, apparatus, device and storage medium for instance partitioning
CN112836753A (en) Methods, apparatus, devices, media and products for domain adaptive learning
Boursinos et al. Improving prediction confidence in learning-enabled autonomous systems
WO2017079972A1 (en) A method and a system for classifying objects in images
Jo et al. Ransac versus cs-ransac

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14900141

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14900141

Country of ref document: EP

Kind code of ref document: A1