WO2016090522A1 - Method and apparatus for predicting face attributes - Google Patents

Method and apparatus for predicting face attributes Download PDF

Info

Publication number
WO2016090522A1
WO2016090522A1 PCT/CN2014/001120 CN2014001120W WO2016090522A1 WO 2016090522 A1 WO2016090522 A1 WO 2016090522A1 CN 2014001120 W CN2014001120 W CN 2014001120W WO 2016090522 A1 WO2016090522 A1 WO 2016090522A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
location
inputted
face image
filters
Prior art date
Application number
PCT/CN2014/001120
Other languages
French (fr)
Inventor
Xiaoou Tang
Ziwei Liu
Ping Luo
Xiaogang Wang
Original Assignee
Xiaoou Tang
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiaoou Tang filed Critical Xiaoou Tang
Priority to PCT/CN2014/001120 priority Critical patent/WO2016090522A1/en
Priority to CN201480083724.5A priority patent/CN107004116B/en
Publication of WO2016090522A1 publication Critical patent/WO2016090522A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/164Detection; Localisation; Normalisation using holistic features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/178Human faces, e.g. facial parts, sketches or expressions estimating age from face image; using age information for improving recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/179Human faces, e.g. facial parts, sketches or expressions metadata assisted face recognition

Definitions

  • the present application relates to a method and an apparatus for predicting face attributes in the wild.
  • Face attributes such as expression, race, and hair style, are beneficial for many applications such as image tagging and face verification. Predicting face attributes from images on the web is challenging, because of large background clutters and face variations, such as scale, pose, and illumination.
  • Existing methods for attribute recognition first detect faces and their landmarks, and then extract high-dimensional features, such as HOG (Histogram of Oriented Gradient) or LBP (Local Binary Patter) , from image patches centered on landmarks. These features are concatenated to train classifiers.
  • Hand-crafted features such as (Histogram of Oriented Gradient) , LBP (Local Binary Patter) and GIST (Spatial Envelope)
  • HOG and color histogram to train logistic regression for object search and tagging based on attributes.
  • HOG-like features on various face regions to tackle attribute classification and face verification.
  • a three-level SVM system has been built to extract higher-level information.
  • Various hand-crafted features have been combined to obtain an intermediate representation for a particular domain.
  • the present application proposes a deep learning framework for face attribute prediction in the wild.
  • the invention addresses the problem of predicting face attributes in the wild.
  • the goal is automatically labelling raw web images with facial attributes tags (e.g. “Male” , “Young” , “Smiling” , “Wearing Hat” , “Big Eyes” , “Oval Face” and “Mustache” etc. ) .
  • facial attributes tags e.g. “Male” , “Young” , “Smiling” , “Wearing Hat” , “Big Eyes” , “Oval Face” and “Mustache” etc.
  • the proposed deep learning framework does not rely on face and landmark detection. Instead, it cascades two convolutional devices (CNNs) , in which one (LNet) to locate face region and the other (ANet) to extract high-level face representation from the entire located face region (without landmarks) for attribute prediction.
  • CNNs convolutional devices
  • Training the LNet and ANet is in a weakly supervised manner, i.e. only attribute tags of training images are provided. This is fundamentally different from training face and landmark detectors, where face bounding boxes and landmark positions are needed. It makes the preparation of training data much easier.
  • the LNet and ANet are first pre-trained differently and then jointly trained with attribute labels.
  • the LNet is pre-trained by classifying massive general object categories.
  • its pre-trained features have good generalization capability on handling various background clutters.
  • the LNet is then fine-tuned by predicting attributes.
  • Features learned by attribute prediction can capture rich face variations and are effective for face localization. It also can better distinguish subtle differences between human faces and analogous patterns, such as a cat face.
  • the ANet is pre-trained by classifying massive face identities, to obtain discriminative face representation. Then it is fine-tuned by the attribute prediction task.
  • a filter (or a group of filters) functions as a detector of an attribute.
  • a subset of neurons When a subset of neurons is activated, they indicate the existence of face images, which have a particular attribute configuration.
  • the neurons at different layers can form many activation patterns, implying that the whole set of face images can be divided into many subsets based on attribute configurations, and each activation pattern corresponds to one subset (e.g. ‘pointy nose’ , ‘rosy cheek’ , and ‘smiling’ ) . Therefore, it is not surprising that filters learned by attribute prediction lead to effective representations for face localization. By simply averaging and thresholding response maps, good face localization is achieved.
  • This application also discloses that the high-level hidden neurons of the ANet after pre-training implicitly learn and discover semantic concepts that are related to identity, such as race, gender, and age. These concepts are significantly expanded after fine-tuning for attribute classification. This fact indicates that when a deep model is pre-trained for face recognition, it is also implicitly learning attributes. The performance of attribute prediction drops without the pre-training stage. With this strategy, each face attribute is well explained by a sparse linear combination of these semantic concepts. By analyzing the coefficients of such combinations, attributes show clear grouping patterns, which could be well interpreted semantically.
  • an apparatus for predicting facial attributes tags comprising:
  • a first location prediction device for predicting a head-shoulder location in an inputted face image
  • a second location prediction device for predicting a face location in the inputted face image from the predicted head-shoulder location
  • an attribute prediction device for extracting one or more face representations from the face location and classifying desired the extracted attributes for the inputted face image from the extracted face representations.
  • a method for predicting facial attributes tags comprising:
  • the step of predicting the head-shoulder location further comprises: calculating a geodesic distance in a response for each position in an inputted face image, the response being obtained from a first neural network with the inputted face image; and determining the position belongs to the head-shoulder location if said calculated distance is larger than a predetermined threshold.
  • the method may further comprise a step of training, which may include:
  • pre-trained face localization model and pre-trained attribute prediction model are combined as a final model for predicting facial attributes tags.
  • Fig. 1 is a schematic diagram illustrating the face localization by attributes with a single detector (a) , a multiple view (b) and a face localization by attributes.
  • Fig. 2 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.
  • Fig. 3 is a schematic diagram illustrating proposed pipeline of attribute inference for the predictor as shown in Fig. 2 consistent with some disclosed embodiments.
  • Fig. 4 is a schematic diagram illustrating an interweaved operation of the predictor according to one embodiment of the present application.
  • Fig. 5 is a schematic flowchart illustrating flowchart of the training stage of face localization model consistent with some disclosed embodiments.
  • Fig. 6 is a schematic flowchart illustrating flowchart of the fine-tuning for the face localization model consistent with some disclosed embodiments.
  • Fig. 7 is a schematic flowchart illustrating flowchart of the training stage of attribute prediction model consistent with some disclosed embodiments.
  • Fig. 8 is a schematic flowchart illustrating flowchart of the fine-tuning for the attribute prediction model consistent with some disclosed embodiments.
  • the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” , “device” , “module” or “system. ” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
  • Fig. 2 is a schematic diagram illustrating an exemplary apparatus 100 for predicting face attributes in the wild according to one embodiment of the present application.
  • the apparatus 100 may comprise a predictor 10 and a trainer 20.
  • the predictor 10 may comprise an attribute inference system consists of a plurality of (for example, four stages as shown in Fig. 3) stages by cascading a first location prediction device 101, a second location prediction device 102 and an attribute prediction device 103.
  • the first location prediction device 101 is used for predicting a location of head-shoulder in an inputted face image.
  • the first location prediction device 101 is configured to obtain a response h 0 in light of the inputted face image and to calculate a geodesic distance in the response for each position in the image. If the calculated distance is larger than a predetermined threshold, the first location prediction device 101 will determine that this position in the image belong to the location of head-shoulder, which will be discussed later.
  • the first location prediction device 101 may comprise a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers (C1 to C5) , wherein the convolutional layers are configured with globally shared filters, the filters being recurrently applied at every location of the image so as to translate and scale the face images.
  • the second location prediction device 102 is used for predicting a location of face in the inputted face image from the predicted location of head-shoulder.
  • the second location prediction device 102 comprises a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers (C1 to C5) , wherein the convolutional layers of the device 102 are configured with globally shared filters, the filters being recurrently applied at every location of the image so as to translate and scale the face images.
  • the attribute prediction device 103 is used for extracting one or more face representations y from the location of face for classifying the extracted attributes.
  • the attribute prediction device 103 may comprise one or more (for example, four as shown) convolutional layers (C1 to C4) , wherein each of the convolutional layers is configured with one or more filters, the filters at a first convolutional layer C1 and a second convolutional layer C2 are globally shared, while the filters at a third convolutional layer C3 and a fourth convolutional layer C4 are locally shared.
  • the attribute prediction device 103 further comprises one or more (for example, three) max-pooling layers, each of the max-pooling layers is connected to a corresponding convolutional layer of the convolutional layers and configured to render the whole system robustness to local translations.
  • the device 103 further comprises one fully-connected layer (FC) cascaded to the last convolutional layer (for example, C4 as shown) to classify the extracted attributes and learn compact and discriminative face representation.
  • FC fully-connected layer
  • the first location prediction device 101 calculates a response map h 0 , which indicates a location of head-shoulder, as shown in Fig. 3 (a) .
  • x 0 is then combined with h 0 to crop the region of head-shoulder, denoted as x s .
  • the second location prediction device 102 utilizes x s . as input and outputs a response map h s , which designates a region of face.
  • hs is combined with x s to locate the face region x f
  • the location prediction devices 101 and 102 are cascaded to propose face location in a coarse-to-fine manner.
  • the attribute prediction device 103 is applied to the face region x f to extract response map h s . for classifying the extracted attributes.
  • the attribute prediction device 103 may comprise a fully-connected layer to classify attributesy. High responses in these maps are associated to different facial components, implying that the device (ANet) 103 is able to capture subtle face differences, such as shapes of lips and eyebrows.
  • ANet device 103 is able to capture subtle face differences, such as shapes of lips and eyebrows.
  • several candidate windows are selected to pool the feature vectors by the fully-connected layer. Then these features are concatenated as ha to train linear classifier for attribute recognition.
  • the first location prediction device 101, the second location prediction device 102 and the attribute prediction device 103 may be implemented as different neural networks, represented as LNet0 101, LNets 102 and ANets 103 in Fig. 3.
  • Each of the LNeto 101, LNets 102 and ANet 103 may be implemented by software, integrated circuits (ICs) or the combination thereof.
  • the network structures of LNeto 101 and LNets 102 may be the same, which, for example, stack two max-pooling and a five convolutional layers (C 1 to C5) with globally shared filters.
  • the ANet 103 stacks, for example, four convolutional layers (C1 to C4) , three max-pooling layers and one fully-connected layer (FC) , where the filters at C 1 and C2 are globally shared, while the filters at C3 and C4 are locally shared. As shown in Fig. 3 (c) , the response maps at C2 and C3 are divided into grids with non-overlapping cells, each of which learns different filters.
  • the fully-connected layer of the ANet 103 transforms the response maps produced by convolution into a compact and discriminate feature representation.
  • fully-connected layer in attribute prediction ANet 103 may produce suitable feature representation for classification (predicting attribute tags, e.g. “Male” , “Young” , “Smiling” , “Wearing Hat” , “Big Eyes” , “Oval Face” and “Mustache” etc. ) .
  • the locally shared filters have been proved effective for face related problems, because they can capture different information from different face parts.
  • the network structures are specified in Fig. 3.
  • the filters at C1 of LNeto 101 have a plurality of (for example, 96) channels and the filter size in each channel may be 11Xl 1X3, as the input image xo contains three color channels.
  • relu (x) max (0; x) is the rectified linear function and *denotes the convolution operator, h u(l-1) and h v(l) stand for the u-th input channel at the (l- 1) layer and the v-th output channel at the 1 layer, respectively.
  • k vu(l) and b v(l) denote the filters and bias, wherein the filters capture translation-invariant structures while the bias represent overall energy levels.
  • the max-pooling operations at C1 and C2 partition the feature maps into grid with overlapping cells, which are formulated as
  • (i; j) indicates the cell with index (i; j) and (p; q) is a position index within ⁇ .
  • the maximum value is pooled over each small cell as expressed in Eqn. (2) .
  • ⁇ i is the density intensity in position i
  • ⁇ i min j: ⁇ j> ⁇ i (s ij )
  • s ij is the spatial distance between position i and positionj.
  • ⁇ i measures its distance to the nearest position which has a larger density intensity. Then density peaks are identified by selecting extreme large d i . This process can be further accelerated, as is sparse. It can propose the correct window by cropping the region with the highest density. Note that the face image x f can be cropped in a similar fashion as above.
  • ANet 103 utilizes the estimated face region x f as input.
  • the filters of C1 and C2 in the ANet 103 are globally shared and can be formulated in the same way as Eqn. (1) and (2) .
  • the locally shared filters at C3 and C4 are learned to capture different local information in the specific facial regions (cells) .
  • the highlighted cells A corresponds to left eye and left mouth corner respectively.
  • These locally shared filters can be formulated as
  • each filter in C3 corresponds to four local regions in C2. These regions can be overlapped. And the same relationship applies between C4 and C3.
  • the device 103 applies each filter in C3 using Eqn. (1) to the entire response map h (2) , resulting in the response maps and as shown in Fig. 4 (b) .
  • the device 103 needs to apply to these maps. Difficulty exists because filters in C3 have spatial relationship. For instance, the response of should be at the left hand side of To compensate for these geometric constrains, the interweaved map of C3, is constructed as depicted in Fig. 4 (c) , where responses in the same cell are padded together.
  • the feature map of C4 is calculated as standard convolution using Eqn. (1) as Similarly, we could get feature maps for other locally shared filters in C4.
  • the whole process could be viewed as implicitly combining different part detectors (locally shared filters) under geometric constrains (interweaved operation) to facilitate accurate localization. Then the present application properly crops and pools at different positions to generate multiple views of h (4) . It could further suppress residual misalignment and achieve size fit for fully connected layer. Feeding these multi-view response maps into FC layer of the ANet would lead to multi-view representations of face region.
  • the present application concatenates all multi-view representations together to obtain the final face representation h.
  • the attribute inference system with the cascaded LNeto 101, LNets102 and ANet 103 in the predictor 10 shall be trained first.
  • the trainer 20 may receive general object dataset with category annotations and face dataset with identity and attribute annotations, and then feed the general object dataset and its category annotations to the predictor 10 to obtain a pre-trained face localization model, and feed face dataset and its identity annotations are fed into the predictor 10 so as to obtain pre-trained attribute prediction model.
  • the obtained pre-trained face localization model and pre-trained attribute prediction model are further inputted the predictor 10, along with face dataset and its attribute annotations, so as to obtain the final model.
  • the trainer 20 may comprise a face localization pre-training device 201, an attribute prediction pre-training device 202 and a fine-tuning device 203.
  • the face localization pre-training device 201 operates to receive general object dataset with category annotations and face dataset with identity and attribute annotations and then feed the general object dataset and its category annotations into the predictor 10 so as to obtain the pre-trained face localization model.
  • Fig. 5 illustrates a flow chart for the face localization pre-training device 201 to train the predictor 10 to obtain the pre-trained face localization model.
  • the face localization pre-training device 201 operates to randomly initialize neuron weights between the connections of each two convolution layers in the LNet0 101, LNets 102 and Anet 103 of the predictor 10.
  • the face localization pre-training device 201 calculates classification errors by classifying each image into one of a plurality of (N) general object categories. Specifically, if the object category is predicted correctly, then the classification error is zero, otherwise the classification error is added 1/N.
  • the face localization pre-training device 201 operates to back-propagate classification errors through the layers in the LNet0, LNets and ANet 103of the predictor 10, so as to update the neuron weights between the connections of each two convolution layers.
  • the face localization pre-training device 201 determines if it is converged by comparing the currently obtained classification errors with a predetermined threshold, if yes, at step S505, the pre-trained face localization model for the predictor 10 is obtained, otherwise, the process returns to step s502.
  • the standard training process updates the weights of model so that the output of the model (prediction) can approach ground truth annotation iteratively.
  • the goal of model training is to predict the presence of certain attribute (e.g. “male” , “smiling” etc. ) that should align with the reality (annotations) .
  • the weights of model are initialized randomly.
  • the pre-training of the face localization pre-training device 201 resembles the standard training except that the task in pre-training is different from the final task.
  • the task used here for pre-training is to predict object category (e.g. “car” , “dog” , “mountain” , “flower” etc. ) present in each image while the final task is to predict face attributes.
  • the fine-tuning device 203 proceeds with a fine-tuning that resembles standard training except that the weights are not initialized randomly, but initialized using the weights in pre-trained model.
  • Fig. 6 illustrates the flow chart for the fine-tuning for the fine-tuning device 203 according to one embodiment of the present application.
  • the fine-tuning device 20 initializes weights of the predictor 10 using those weights in the pre-trained model.
  • the fine-tuning device 20 calculates classification errors by predicting attribute tags for each image. Specifically, if the attribute tag is predicted correctly, then the classification error is zero, otherwise the classification error is added 1/N.
  • the fine-tuning device 20 back-propagates classification errors through the predictor 10, so as to update the weights, and then at step 604, the fine-tuning device 20 determine ifthe currently obtained classification errors is less than a predetermined threshold, if yes, at step S605, the process terminates and the face localization model formed by the updated weights for the predictor 10 is obtained, otherwise, the process returns to step s602.
  • Fig. 7 illustrates a flow chart for training an attribute prediction model for the attribute prediction pre-training device 202 according to one embodiment of the present application.
  • the attribute prediction pre-training device 202 operates to randomly initialize neuron weights between the connections of each two convolution layers in the LNet0, LNets and ANet of the predictor 10.
  • the attribute prediction pre-training device 202 calculates identification errors by classifying each image into one of a plurality of (N) human face identities. Specifically, if the human face identity is predicted correctly, then the classification error is zero, otherwise the classification error is added 1/N.
  • the attribute prediction pre-training device 202 operates to back-propagate the identification errors through the layers in the LNet0, LNets and ANet of the predictor 10, so as to update the neuron weights between the connections of each two convolution layers.
  • the attribute prediction pre-training device 202 determines if the currently obtained identification error is less than a predetermined threshold, if yes, at step S705, the process terminates and the pre-trained attribute prediction model formed by the updated weights for the predictor 10 is obtained, otherwise, the process returns to step s702.
  • Fig. 8 illustrates the flow chart for the fine-tuning of the pre-trained attribute prediction model according to one embodiment of the present application.
  • the fine-tuning device 20 initializes weights of the predictor 10 using those weights in the pre-trained attribute prediction model.
  • the fine-tuning device 20 calculates classification errors by predicting attribute tags for each image. Specifically, if the attribute tag is predicted correctly, then the classification error is zero, otherwise the classification error is added 1/N.
  • the fine-tuning device 20 back-propagates classification errors through the predictor 10, so as to update the weights, and then at step 803, the fine-tuning device 20 determine if the currently obtained classification error is less than a predetermined threshold, if yes, at step S805, the process terminates and the face localization model formed by the updated weights for the predictor 10 is obtained, otherwise, the process returns to step s802.
  • the fine-tuned face localization model and fine-tuned attribute prediction model are concatenated together to form the final model.
  • Face localization model takes raw web image as input and outputs the position of face region in it.
  • attribute prediction model takes the targeted face region as input and outputs the predicted attribute tags. So the final integrated model is the concatenation of the face localization model and the attribute prediction model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

Disclosed are an apparatus and a method for predicting facial attributes tags. The apparatus may comprises a first location prediction device for predicting a head-shoulder location in an inputted face image; a second location prediction device for predicting a face location in the inputted face image from the predicted head-shoulder location; and an attribute prediction device for extracting one or more face representations from the face location and classifying desired attributes for the inputted face image from the extracted face representations.

Description

Method and Apparatus for Predicting Face Attributes Technical Field
The present application relates to a method and an apparatus for predicting face attributes in the wild.
Background
Face attributes, such as expression, race, and hair style, are beneficial for many applications such as image tagging and face verification. Predicting face attributes from images on the web is challenging, because of large background clutters and face variations, such as scale, pose, and illumination. Existing methods for attribute recognition first detect faces and their landmarks, and then extract high-dimensional features, such as HOG (Histogram of Oriented Gradient) or LBP (Local Binary Patter) , from image patches centered on landmarks. These features are concatenated to train classifiers.
Although this pipeline is suitable for controlled environment, it has drawbacks when dealing with web images. It heavily depends on the precision of face and landmark detections, which are not reliable in web images. Most of state-of-the-art face detection and alignment fail because features are extracted at wrong landmark positions. Face detection also has ambiguity.
Hand-crafted features, such as (Histogram of Oriented Gradient) , LBP (Local Binary Patter) and GIST (Spatial Envelope) , at pre-defined landmarks is a standard step in the attribute recognition. It has been proposed to combine HOG and color histogram to train logistic regression for object search and tagging based on attributes. It has been also proposed to extract HOG-like features on various face regions to tackle attribute classification and face verification. To improve the discriminativeness of hand-crafted features given a specific task, a three-level SVM system has been built to extract higher-level information. Various hand-crafted features have been combined to obtain an intermediate representation for a particular domain.
Recently, deep learning methods have achieved great success in attribute inference, due to their ability to learn compact and discriminative features. It has been demonstrated that off-the-shelf features learned by the Convolutional Network (CNN) of ImageNet can be effectively adapted to attribute classification. It has been also  shown that better performance can be achieved by ensembling learned features of multiple pose-normalized CNNs. Specific network structures have also been designed for attribute prediction. In the art, it has introduced a deep sum-product architecture to account for occlusion during attribute inference. The main drawback of the above methods is that they heavily rely on accurate landmark detection and pose estimation in both training and testing steps. Even though a recent work can perform automatically part localization during testing, it still requires landmark annotations of the training data.
Summary
The present application proposes a deep learning framework for face attribute prediction in the wild. The invention addresses the problem of predicting face attributes in the wild. Specifically, the goal is automatically labelling raw web images with facial attributes tags (e.g. “Male” , “Young” , “Smiling” , “Wearing Hat” , “Big Eyes” , “Oval Face” and “Mustache” etc. ) .
The proposed deep learning framework does not rely on face and landmark detection. Instead, it cascades two convolutional devices (CNNs) , in which one (LNet) to locate face region and the other (ANet) to extract high-level face representation from the entire located face region (without landmarks) for attribute prediction.
Training the LNet and ANet is in a weakly supervised manner, i.e. only attribute tags of training images are provided. This is fundamentally different from training face and landmark detectors, where face bounding boxes and landmark positions are needed. It makes the preparation of training data much easier. The LNet and ANet are first pre-trained differently and then jointly trained with attribute labels.
Secondly, different pre-training and fine-tuning strategies are designed for the LNet and ANet. Different from training face detectors with positive (face) and negative (non-face) samples, the LNet is pre-trained by classifying massive general object categories. Thus, its pre-trained features have good generalization capability on handling various background clutters. The LNet is then fine-tuned by predicting attributes. Features learned by attribute prediction can capture rich face variations and are effective for face localization. It also can better distinguish subtle differences between human faces and analogous patterns, such as a cat face. The ANet is pre-trained by classifying massive face identities, to obtain discriminative face  representation. Then it is fine-tuned by the attribute prediction task.
Thirdly, to make face localization and attribute prediction realtime, a fast feed-forward scheme is proposed. It evaluates web image with arbitrary size. If filters are globally shared, this can be done by convolving images with filters. It becomes non-trivial if the filters are locally shared, while as known in the art, the locally shared filters perform better in face related tasks. This is solved by proposing an interweaved operation.
Besides proposing new methods, the present application also reveals valuable facts on learning face representation. They not only motivate this invention but also benefit future research on face and deep learning.
(1) It shows how supervised pre-training with massive object categories and massive identities can improve feature learning of the LNet and ANet for face localization and attribute recognition, respectively.
(2) It demonstrates that although filters of LNet are fine-tuned by attribute prediction, their response maps over the entire image have strong indication of face’s location. Good features for face localization should be able to capture rich face variations, and more supervised information on these variations improves the learning process. To understand, one could consider the examples in Fig. 1. If only a single detector is used to classify all the positive and negative samples in Fig. 1 (a) , it is difficult to handle complex face variations. Therefore, multi-view face detectors were developed in Fig. 1 (b) , i.e. face images in different views are handled by different detectors. View labels were used in training detectors and the whole training set is divided into subsets according to views. If views are treated as one type of face attributes, learning face representation by predicting attributes with deep models actually extends this idea to extreme. As shown in Fig. 1 (c) , a filter (or a group of filters) functions as a detector of an attribute. When a subset of neurons is activated, they indicate the existence of face images, which have a particular attribute configuration. The neurons at different layers can form many activation patterns, implying that the whole set of face images can be divided into many subsets based on attribute configurations, and each activation pattern corresponds to one subset (e.g. ‘pointy nose’ , ‘rosy cheek’ , and ‘smiling’ ) . Therefore, it is not surprising that filters learned by attribute prediction lead to effective representations for face localization. By simply averaging and thresholding response maps, good face localization is achieved.
(3) This application also discloses that the high-level hidden neurons of the ANet after pre-training implicitly learn and discover semantic concepts that are related to identity, such as race, gender, and age. These concepts are significantly expanded after fine-tuning for attribute classification. This fact indicates that when a deep model is pre-trained for face recognition, it is also implicitly learning attributes. The performance of attribute prediction drops without the pre-training stage. With this strategy, each face attribute is well explained by a sparse linear combination of these semantic concepts. By analyzing the coefficients of such combinations, attributes show clear grouping patterns, which could be well interpreted semantically.
In one aspect of the present application, disclosed is an apparatus for predicting facial attributes tags, comprising:
a first location prediction device for predicting a head-shoulder location in an inputted face image;
a second location prediction device for predicting a face location in the inputted face image from the predicted head-shoulder location;
an attribute prediction device for extracting one or more face representations from the face location and classifying desired the extracted attributes for the inputted face image from the extracted face representations.
In another aspect of the present application, disclosed is a method for predicting facial attributes tags, comprising:
predicting a head-shoulder location in an inputted face image;
predicting a face location in the inputted face image from the predicted head-shoulder location;
extracting one or more face representations from the face location; and
classifying desired the extracted attributes for the inputted face image from the extracted face representations.
In one embodiment of the present application, the step of predicting the head-shoulder location further comprises: calculating a geodesic distance in a response for each position in an inputted face image, the response being obtained from a first neural network with the inputted face image; and determining the position belongs to the head-shoulder location if said calculated distance is larger than a predetermined threshold.
In one embodiment of the present application, the method may further comprise a step of training, which may include:
retrieving a general object dataset with category annotations and face dataset with identity and attribute annotations;
determining a pre-trained face localization model based on the general object dataset and its category annotations; and
determining a pre-trained attribute prediction model based on the face dataset and its identity annotations,
wherein the pre-trained face localization model and pre-trained attribute prediction model are combined as a final model for predicting facial attributes tags.
Brief Description of the Drawing
Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
Fig. 1 is a schematic diagram illustrating the face localization by attributes with a single detector (a) , a multiple view (b) and a face localization by attributes.
Fig. 2 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.
Fig. 3 is a schematic diagram illustrating proposed pipeline of attribute inference for the predictor as shown in Fig. 2 consistent with some disclosed embodiments.
Fig. 4 is a schematic diagram illustrating an interweaved operation of the predictor according to one embodiment of the present application.
Fig. 5 is a schematic flowchart illustrating flowchart of the training stage of face localization model consistent with some disclosed embodiments.
Fig. 6 is a schematic flowchart illustrating flowchart of the fine-tuning for the face localization model consistent with some disclosed embodiments.
Fig. 7 is a schematic flowchart illustrating flowchart of the training stage of attribute prediction model consistent with some disclosed embodiments.
Fig. 8 is a schematic flowchart illustrating flowchart of the fine-tuning for the attribute prediction model consistent with some disclosed embodiments.
Detailed Description
Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms ″a″ , ″an″ and ″the″ are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms ″comprises″ and/or ″comprising, ″ when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” , “device” , “module” or “system. ” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
It is further understood that the use of relational terms such as first and second, and the like, if any, are used solely to distinguish one from another entity, item, or action without necessarily requiring or implying any actual such relationship or order between such entities, items or actions.
Much of the inventive functionality and many of the inventive principles  when implemented, are best supported with or in software or integrated circuits (ICs) , such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions or ICs with minimal experimentation. Therefore, in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention, further discussion of such software and ICs, if any, will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.
Fig. 2 is a schematic diagram illustrating an exemplary apparatus 100 for predicting face attributes in the wild according to one embodiment of the present application. As shown, the apparatus 100 may comprise a predictor 10 and a trainer 20.
The predictor 10 may comprise an attribute inference system consists of a plurality of (for example, four stages as shown in Fig. 3) stages by cascading a first location prediction device 101, a second location prediction device 102 and an attribute prediction device 103.
The first location prediction device 101 is used for predicting a location of head-shoulder in an inputted face image. In one embodiment of the present application, the first location prediction device 101 is configured to obtain a response h0 in light of the inputted face image and to calculate a geodesic distance in the response for each position in the image. If the calculated distance is larger than a predetermined threshold, the first location prediction device 101 will determine that this position in the image belong to the location of head-shoulder, which will be discussed later.
As shown in Fig. 3, the first location prediction device 101 may comprise a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers (C1 to C5) , wherein the convolutional layers are configured with globally shared filters, the filters being recurrently applied at every location of the image so as to translate and scale the face images.
The second location prediction device 102 is used for predicting a location of face in the inputted face image from the predicted location of head-shoulder. As shown in Fig. 3, the second location prediction device 102  comprises a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers (C1 to C5) , wherein the convolutional layers of the device 102 are configured with globally shared filters, the filters being recurrently applied at every location of the image so as to translate and scale the face images.
The attribute prediction device 103 is used for extracting one or more face representations y from the location of face for classifying the extracted attributes. The attribute prediction device 103 may comprise one or more (for example, four as shown) convolutional layers (C1 to C4) , wherein each of the convolutional layers is configured with one or more filters, the filters at a first convolutional layer C1 and a second convolutional layer C2 are globally shared, while the filters at a third convolutional layer C3 and a fourth convolutional layer C4 are locally shared. The attribute prediction device 103 further comprises one or more (for example, three) max-pooling layers, each of the max-pooling layers is connected to a corresponding convolutional layer of the convolutional layers and configured to render the whole system robustness to local translations. The device 103 further comprises one fully-connected layer (FC) cascaded to the last convolutional layer (for example, C4 as shown) to classify the extracted attributes and learn compact and discriminative face representation.
Referring to Fig. 3, given a face image x0 with arbitrary size, the first location prediction device 101 calculates a response map h0, which indicates a location of head-shoulder, as shown in Fig. 3 (a) . x0 is then combined with h0 to crop the region of head-shoulder, denoted as xs. The second location prediction device 102 utilizes xs. as input and outputs a response map hs, which designates a region of face. Similarly, hs is combined with xs to locate the face region xf The  location prediction devices  101 and 102 are cascaded to propose face location in a coarse-to-fine manner.
The attribute prediction device 103 is applied to the face region xf to extract response map hs. for classifying the extracted attributes. The attribute prediction device 103 may comprise a fully-connected layer to classify attributesy. High responses in these maps are associated to different facial components, implying that the device (ANet) 103 is able to capture subtle face differences, such as shapes of lips and eyebrows. In the last stage as illustrated in Fig. 3 (d) , several candidate windows are selected to pool the feature vectors by the fully-connected layer. Then these features are concatenated as ha to train linear classifier for attribute recognition. 
As shown in Fig. 3, the first location prediction device 101, the second location prediction device 102 and the attribute prediction device 103 may be implemented as different neural networks, represented as LNet0 101, LNets 102 and ANets 103 in Fig. 3. Each of the LNeto 101, LNets 102 and ANet 103 may be implemented by software, integrated circuits (ICs) or the combination thereof. In one embodiment of the present application, as shown in Figs. 3 (a) and (b) , the network structures of LNeto 101 and LNets 102 may be the same, which, for example, stack two max-pooling and a five convolutional layers (C 1 to C5) with globally shared filters. These filters are recurrently applied at every location of the image and are able to account for large face translation and scaling. The ANet 103 stacks, for example, four convolutional layers (C1 to C4) , three max-pooling layers and one fully-connected layer (FC) , where the filters at C 1 and C2 are globally shared, while the filters at C3 and C4 are locally shared. As shown in Fig. 3 (c) , the response maps at C2 and C3 are divided into grids with non-overlapping cells, each of which learns different filters. The fully-connected layer of the ANet 103 transforms the response maps produced by convolution into a compact and discriminate feature representation. For example, fully-connected layer in attribute prediction ANet 103 may produce suitable feature representation for classification (predicting attribute tags, e.g. “Male” , “Young” , “Smiling” , “Wearing Hat” , “Big Eyes” , “Oval Face” and “Mustache” etc. ) .
The locally shared filters have been proved effective for face related problems, because they can capture different information from different face parts. The network structures are specified in Fig. 3. For instance, the filters at C1 of LNeto 101 have a plurality of (for example, 96) channels and the filter size in each channel may be 11Xl 1X3, as the input image xo contains three color channels.
Where both of LNeto 101 and LNets 103 have five convolutional layers, each of which utilizes the output of the previous layer as input and may be formulated as:hv(l)=relu(hv(l)ukvu(l)*hu(l-1))    (1)
where relu (x) = max (0; x) is the rectified linear function and *denotes the convolution operator, hu(l-1) and hv(l) stand for the u-th input channel at the (l- 1) layer and the v-th output channel at the 1 layer, respectively. kvu(l) and bv(l) denote the filters and bias, wherein the filters capture translation-invariant structures while the bias represent overall energy levels.
The max-pooling operations at C1 and C2 partition the feature maps into  grid with overlapping cells, which are formulated as
Figure PCTCN2014001120-appb-000001
Here, (i; j) indicates the cell with index (i; j) and (p; q) is a position index within Ω. The maximum value is pooled over each small cell as expressed in Eqn. (2) .
After it obtains the response map, for example
Figure PCTCN2014001120-appb-000002
an important issue is how to crop the patch of head-shoulder from xo. A simple solution is cropping the region with the responses in
Figure PCTCN2014001120-appb-000003
larger than a threshold. However, difficulty exists when multiple faces are presented, such that we may have multiple regions with evenly high responses. Therefore, it is devised, in one embodiment of the present application, a fast density peak identifying technique. It calculates a special geodesic distance for each position i in
Figure PCTCN2014001120-appb-000004
Figure PCTCN2014001120-appb-000005
where ρi is the density intensity in position i, σi= minj:ρj>ρi (sij) and sij is the spatial distance between position i and positionj. σi measures its distance to the nearest position which has a larger density intensity. Then density peaks are identified by selecting extreme large di. This process can be further accelerated, as 
Figure PCTCN2014001120-appb-000006
 is sparse. It can propose the correct window by cropping the region with the highest density. Note that the face image xf can be cropped in a similar fashion as above.
ANet 103 utilizes the estimated face region xf as input. As shown in Fig3 (c) , the filters of C1 and C2 in the ANet 103 are globally shared and can be formulated in the same way as Eqn. (1) and (2) . The locally shared filters at C3 and C4 are learned to capture different local information in the specific facial regions (cells) . For example, the highlighted cells A (Fig. 3 (c) corresponds to left eye and left mouth corner respectively. These locally shared filters can be formulated as
Figure PCTCN2014001120-appb-000007
where (p; q) is the cell index. However, as shown in Fig. 3 (c) , the estimated face region xf is not well-aligned, because of the large variation presented in the web image. If we simply apply Eqn. (4) , the subsequent face features may contain noise. A simple solution is to densely crop image patches and apply ANet 103 on each of them, but there are redundant computations (e.g. C 1 and C2) . Therefore, the present  application proposes interweaved operation, which can account for misalignment without cropping multiple patches.
To better visualize the process, the network structure of C2, C3 and C4 is again illustrated in Fig. 4 (a) , where each filter in C3 corresponds to four local regions in C2. These regions can be overlapped. And the same relationship applies between C4 and C3. For clarity, we consider four filters
Figure PCTCN2014001120-appb-000008
and
Figure PCTCN2014001120-appb-000009
in C3 and one filter k (4) 1 in C4. It assumes there is only one channel. After obtaining response map h (2) in C2, the device 103 applies each filter in C3 using Eqn. (1) to the entire response map h (2) , resulting in the response maps
Figure PCTCN2014001120-appb-000010
and 
Figure PCTCN2014001120-appb-000011
as shown in Fig. 4 (b) . In the next step, the device 103 needs to apply
Figure PCTCN2014001120-appb-000012
to these maps. Difficulty exists because filters in C3 have spatial relationship. For instance, the response of
Figure PCTCN2014001120-appb-000013
should be at the left hand side of
Figure PCTCN2014001120-appb-000014
To compensate for these geometric constrains, the interweaved map of C3, 
Figure PCTCN2014001120-appb-000015
is constructed as depicted in Fig. 4 (c) , where responses in the same cell are padded together.
Then, the feature map of C4 is calculated as standard convolution using Eqn. (1) as
Figure PCTCN2014001120-appb-000016
Similarly, we could get feature maps
Figure PCTCN2014001120-appb-000017
for other locally shared filters in C4.
Since we assume filters in C4 have one channel, the redundant parts in h(4) i are filter responses at other possible spatial positions. To find desired position, we construct interweaved map h (4) inter of C4 to preserve geometric constraint and search for its maximum component:
Figure PCTCN2014001120-appb-000018
The whole process could be viewed as implicitly combining different part detectors (locally shared filters) under geometric constrains (interweaved operation) to facilitate accurate localization. Then the present application properly crops and pools at different positions to generate multiple views of h(4). It could further suppress residual misalignment and achieve size fit for fully connected layer. Feeding these multi-view response maps into FC layer of the ANet would lead to multi-view representations of face region. The present application concatenates all multi-view representations together to obtain the final face representation h.
To make the predictor 10 work effectively, the attribute inference system with the cascaded LNeto 101, LNets102 and ANet 103 in the predictor 10 shall be trained first. To this end, the trainer 20 may receive general object dataset with category annotations and face dataset with identity and attribute annotations, and then feed the general object dataset and its category annotations to the predictor 10 to obtain a pre-trained face localization model, and feed face dataset and its identity annotations are fed into the predictor 10 so as to obtain pre-trained attribute prediction model. The obtained pre-trained face localization model and pre-trained attribute prediction model are further inputted the predictor 10, along with face dataset and its attribute annotations, so as to obtain the final model.
To this end, as shown in Fig. 2, the trainer 20 may comprise a face localization pre-training device 201, an attribute prediction pre-training device 202 and a fine-tuning device 203.
The face localization pre-training device 201 operates to receive general object dataset with category annotations and face dataset with identity and attribute annotations and then feed the general object dataset and its category annotations into the predictor 10 so as to obtain the pre-trained face localization model. Fig. 5 illustrates a flow chart for the face localization pre-training device 201 to train the predictor 10 to obtain the pre-trained face localization model. At step s501, the face localization pre-training device 201 operates to randomly initialize neuron weights between the connections of each two convolution layers in the LNet0 101, LNets 102 and Anet 103 of the predictor 10. At step s502, the face localization pre-training device 201 calculates classification errors by classifying each image into one of a plurality of (N) general object categories. Specifically, ifthe object category is predicted correctly, then the classification error is zero, otherwise the classification error is added 1/N. At step s503, the face localization pre-training device 201 operates to back-propagate classification errors through the layers in the LNet0, LNets and ANet 103of the predictor 10, so as to update the neuron weights between the connections of each two convolution layers. At step S504, the face localization pre-training device 201 determines if it is converged by comparing the currently obtained classification errors with a predetermined threshold, if yes, at step S505, the pre-trained face localization model for the predictor 10 is obtained, otherwise, the process returns to step s502.
As well known in the art, the standard training process updates the  weights of model so that the output of the model (prediction) can approach ground truth annotation iteratively. For example, in face attributes prediction task, the goal of model training is to predict the presence of certain attribute (e.g. “male” , “smiling” etc. ) that should align with the reality (annotations) . Usually, the weights of model are initialized randomly. The pre-training of the face localization pre-training device 201 resembles the standard training except that the task in pre-training is different from the final task. For example, the task used here for pre-training is to predict object category (e.g. “car” , “dog” , “mountain” , “flower” etc. ) present in each image while the final task is to predict face attributes.
The fine-tuning device 203 proceeds with a fine-tuning that resembles standard training except that the weights are not initialized randomly, but initialized using the weights in pre-trained model. Fig. 6 illustrates the flow chart for the fine-tuning for the fine-tuning device 203 according to one embodiment of the present application. In step s601, the fine-tuning device 20 initializes weights of the predictor 10 using those weights in the pre-trained model. At step s602, the fine-tuning device 20 calculates classification errors by predicting attribute tags for each image. Specifically, if the attribute tag is predicted correctly, then the classification error is zero, otherwise the classification error is added 1/N. At step s603, the fine-tuning device 20 back-propagates classification errors through the predictor 10, so as to update the weights, and then at step 604, the fine-tuning device 20 determine ifthe currently obtained classification errors is less than a predetermined threshold, if yes, at step S605, the process terminates and the face localization model formed by the updated weights for the predictor 10 is obtained, otherwise, the process returns to step s602.
Fig. 7 illustrates a flow chart for training an attribute prediction model for the attribute prediction pre-training device 202 according to one embodiment of the present application. At step s701, the attribute prediction pre-training device 202 operates to randomly initialize neuron weights between the connections of each two convolution layers in the LNet0, LNets and ANet of the predictor 10. At step s702, the attribute prediction pre-training device 202 calculates identification errors by classifying each image into one of a plurality of (N) human face identities. Specifically, if the human face identity is predicted correctly, then the classification error is zero, otherwise the classification error is added 1/N. At step s703, the attribute prediction pre-training device 202 operates to back-propagate the identification errors  through the layers in the LNet0, LNets and ANet of the predictor 10, so as to update the neuron weights between the connections of each two convolution layers. At step S704, the attribute prediction pre-training device 202 determines if the currently obtained identification error is less than a predetermined threshold, if yes, at step S705, the process terminates and the pre-trained attribute prediction model formed by the updated weights for the predictor 10 is obtained, otherwise, the process returns to step s702.
Fig. 8 illustrates the flow chart for the fine-tuning of the pre-trained attribute prediction model according to one embodiment of the present application. In step s801, the fine-tuning device 20 initializes weights of the predictor 10 using those weights in the pre-trained attribute prediction model. At step s802, the fine-tuning device 20 calculates classification errors by predicting attribute tags for each image. Specifically, if the attribute tag is predicted correctly, then the classification error is zero, otherwise the classification error is added 1/N. At step s803, the fine-tuning device 20 back-propagates classification errors through the predictor 10, so as to update the weights, and then at step 803, the fine-tuning device 20 determine if the currently obtained classification error is less than a predetermined threshold, if yes, at step S805, the process terminates and the face localization model formed by the updated weights for the predictor 10 is obtained, otherwise, the process returns to step s802.
In the last, the fine-tuned face localization model and fine-tuned attribute prediction model are concatenated together to form the final model. Face localization model takes raw web image as input and outputs the position of face region in it. Then, the attribute prediction model takes the targeted face region as input and outputs the predicted attribute tags. So the final integrated model is the concatenation of the face localization model and the attribute prediction model.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in  order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (15)

  1. An apparatus for predicting facial attributes tags, comprising:
    a first location prediction device for predicting a head-shoulder location in an inputted face image;
    a second location prediction device for predicting a face location in the inputted face image from the predicted head-shoulder location; and
    an attribute prediction device for extracting one or more face representations from the face location and classifying desired attributes for the inputted face image from the extracted face representations.
  2. The apparatus according to claim 1, wherein the first location prediction device is configured to calculate a geodesic distance for each position in the inputted face image, and to determine if the calculated distance is larger than a predetermined threshold, if yes, the first location prediction device determines that this position in the inputted image belong to the head-shoulder location.
  3. The apparatus according to claim 2, wherein the geodesic distance is determined based on density intensity for each position in the inputted face image and a spatial distance between each position to its nearest position in the inputted face image.
  4. The apparatus according to claim 1 or 2, wherein the first location prediction device comprises a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers, wherein the convolutional layers are configured with globally shared filters, the filters being recurrently applied at each location of the face image to translate and scale the inputted face image.
  5. The apparatus according to claim 1 or 2, wherein the second location prediction device comprises a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers, wherein the convolutional layers are configured with globally shared filters, the filters being recurrently applied at each location of the image to translate and scale the inputted face images.
  6. The apparatus according to claim 1 or 2, wherein the attribute prediction device comprises a plurality of convolutional layers,
    wherein each of the convolutional layers is configured with one or more filters for translating and scaling the inputted face image, the filters at a first layer of the convolutional layers and a second layer of the convolutional layers are globally shared, while the filters at a third layer and a fourth layer of the convolutional layers are locally shared.
  7. The apparatus according to claim 1 or 2, wherein the attribute prediction device further comprises a plurality of max-pooling layers, each of which is connected to a corresponding convolutional layer of the convolutional layers and configured to partition received feature maps into grid with overlapping cells such that following convolutional layers translate and scale the partitioned grids of the inputted face image in an interweaved way.
  8. The apparatus according to claim 1 or 2, wherein the attribute prediction device further comprises:
    a fully-connected layer connected to a last layer of the convolutional layers and configured to transform response maps for the inputted image, which are produced by the convolution layers, into a compact and discriminate feature representation.
  9. The apparatus according to any one of claims 1-7, further comprising a trainer configured to:
    feed a general object dataset and its category annotations to the first and the second location prediction devices to obtain a pre-trained face localization model, and feed face dataset and its identity annotations into the attribute prediction device to obtain pre-trained attribute prediction model, wherein the obtained pre-trained face localization model and pre-trained attribute prediction model being combined as a final model.
  10. A method for predicting facial attributes tags, comprising:
    predicting a head-shoulder location in an inputted face image;
    predicting a face location in the inputted face image from the predicted head-shoulder location; and
    extracting one or more face representations from the predicted face location; and
    classifying desired attributes for the inputted face image from the extracted face representations.
  11. The method according to claim 9, wherein the step of predicting the head-shoulder location further comprises:
    calculating a geodesic distance in a response for each position in the inputted face image, the response being obtained from a first neural network with the inputted face image; and
    determining that the position belongs to the head-shoulder location if said calculated distance is larger than a predetermined threshold.
  12. The method according to claim 10, wherein the geodesic distance is determined based on a density intensity for each position in the inputted face image and a spatial distance between each position to its nearest position in inputted face image.
  13. The method according to claim 9 or 10, wherein the first neural network comprises a plurality of max-pooling layers and a plurality of convolutional layers, wherein the convolutional layers are configured with globally shared filters, the method further comprises:
    applying recurrently the filters to each location of the inputted face image so as to translate and scale the inputted face image.
  14. The method according to claim 12, wherein the face location is predicted from the predicted head-shoulder location by a second neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers, wherein the convolutional layers are configured with globally shared filters, the method further comprises:
    applying recurrently the filters to each location of the inputted face image so as to translate and scale the inputted face image.
  15.  The method according to claim 9 or 10, further comprising:
    retrieving a general object dataset with category annotations and face dataset  with identity and attribute annotations;
    determining a pre-trained face localization model based on the general object dataset and its category annotations; and
    determining a pre-trained attribute prediction model based on the face dataset and its identity annotations,
    wherein the pre-trained face localization model and pre-trained attribute prediction model are combined as a final model for predicting facial attributes tags of the inputted face image.
PCT/CN2014/001120 2014-12-12 2014-12-12 Method and apparatus for predicting face attributes WO2016090522A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2014/001120 WO2016090522A1 (en) 2014-12-12 2014-12-12 Method and apparatus for predicting face attributes
CN201480083724.5A CN107004116B (en) 2014-12-12 2014-12-12 Method and apparatus for predicting face's attribute

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/001120 WO2016090522A1 (en) 2014-12-12 2014-12-12 Method and apparatus for predicting face attributes

Publications (1)

Publication Number Publication Date
WO2016090522A1 true WO2016090522A1 (en) 2016-06-16

Family

ID=56106393

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/001120 WO2016090522A1 (en) 2014-12-12 2014-12-12 Method and apparatus for predicting face attributes

Country Status (2)

Country Link
CN (1) CN107004116B (en)
WO (1) WO2016090522A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018113261A1 (en) * 2016-12-22 2018-06-28 深圳光启合众科技有限公司 Target object recognition method and apparatus, and robot
CN109492571A (en) * 2018-11-02 2019-03-19 北京地平线机器人技术研发有限公司 Identify the method, apparatus and electronic equipment at human body age
US10783393B2 (en) 2017-06-20 2020-09-22 Nvidia Corporation Semi-supervised learning for landmark localization
WO2020211398A1 (en) * 2019-04-16 2020-10-22 深圳壹账通智能科技有限公司 Portrait attribute model creating method and apparatus, computer device and storage medium
CN114581458A (en) * 2020-12-02 2022-06-03 中强光电股份有限公司 Method for generating image recognition model and electronic device using same

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549842B (en) * 2018-03-21 2020-08-04 珠海格力电器股份有限公司 Method and device for classifying figure pictures
CN110390295B (en) * 2019-07-23 2022-04-01 深圳市道通智能航空技术股份有限公司 Image information identification method and device and storage medium
CN111783574B (en) * 2020-06-17 2024-02-23 李利明 Meal image recognition method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7606777B2 (en) * 2006-09-01 2009-10-20 Massachusetts Institute Of Technology High-performance vision system exploiting key features of visual cortex
CN102054159A (en) * 2009-10-28 2011-05-11 腾讯科技(深圳)有限公司 Method and device for tracking human faces
CN103824054A (en) * 2014-02-17 2014-05-28 北京旷视科技有限公司 Cascaded depth neural network-based face attribute recognition method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7606777B2 (en) * 2006-09-01 2009-10-20 Massachusetts Institute Of Technology High-performance vision system exploiting key features of visual cortex
CN102054159A (en) * 2009-10-28 2011-05-11 腾讯科技(深圳)有限公司 Method and device for tracking human faces
CN103824054A (en) * 2014-02-17 2014-05-28 北京旷视科技有限公司 Cascaded depth neural network-based face attribute recognition method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018113261A1 (en) * 2016-12-22 2018-06-28 深圳光启合众科技有限公司 Target object recognition method and apparatus, and robot
US10783393B2 (en) 2017-06-20 2020-09-22 Nvidia Corporation Semi-supervised learning for landmark localization
US10783394B2 (en) 2017-06-20 2020-09-22 Nvidia Corporation Equivariant landmark transformation for landmark localization
CN109492571A (en) * 2018-11-02 2019-03-19 北京地平线机器人技术研发有限公司 Identify the method, apparatus and electronic equipment at human body age
WO2020211398A1 (en) * 2019-04-16 2020-10-22 深圳壹账通智能科技有限公司 Portrait attribute model creating method and apparatus, computer device and storage medium
CN114581458A (en) * 2020-12-02 2022-06-03 中强光电股份有限公司 Method for generating image recognition model and electronic device using same

Also Published As

Publication number Publication date
CN107004116A (en) 2017-08-01
CN107004116B (en) 2018-09-21

Similar Documents

Publication Publication Date Title
Chen et al. Underwater object detection using Invert Multi-Class Adaboost with deep learning
WO2016090522A1 (en) Method and apparatus for predicting face attributes
Brosch et al. Deep 3D convolutional encoder networks with shortcuts for multiscale feature integration applied to multiple sclerosis lesion segmentation
Li et al. Maximum-margin structured learning with deep networks for 3d human pose estimation
Singh et al. End-to-end localization and ranking for relative attributes
Vu et al. Context-aware CNNs for person head detection
Zhang et al. Self supervised deep representation learning for fine-grained body part recognition
Liu et al. Cross-view action recognition via view knowledge transfer
Xie et al. Deep voting: A robust approach toward nucleus localization in microscopy images
US20160174902A1 (en) Method and System for Anatomical Object Detection Using Marginal Space Deep Neural Networks
Xia et al. Loop closure detection for visual SLAM using PCANet features
CN106407958B (en) Face feature detection method based on double-layer cascade
Zhang et al. Picking neural activations for fine-grained recognition
Li et al. HEGM: A hierarchical elastic graph matching for hand gesture recognition
Suhail et al. Convolutional neural network based object detection: A review
Ibrahem et al. Real-time weakly supervised object detection using center-of-features localization
Jebali et al. Vision-based continuous sign language recognition using multimodal sensor fusion
Wang et al. Concept mask: Large-scale segmentation from semantic concepts
Guan et al. An Object Detection Framework Based on Deep Features and High-Quality Object Locations.
Zou et al. Online glocal transfer for automatic figure-ground segmentation
Saabni Facial expression recognition using multi Radial Bases Function Networks and 2-D Gabor filters
Shi et al. Beyond IID: learning to combine non-iid metrics for vision tasks
Xu et al. Representative feature alignment for adaptive object detection
Zhao et al. Learning saliency features for face detection and recognition using multi-task network
CN114399731B (en) Target positioning method under supervision of single coarse point

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14907875

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14907875

Country of ref document: EP

Kind code of ref document: A1