WO2016090522A1

WO2016090522A1 - Method and apparatus for predicting face attributes

Info

Publication number: WO2016090522A1
Application number: PCT/CN2014/001120
Authority: WO
Inventors: Xiaoou Tang; Ziwei Liu; Ping Luo; Xiaogang Wang
Original assignee: Xiaoou Tang
Priority date: 2014-12-12
Filing date: 2014-12-12
Publication date: 2016-06-16
Also published as: CN107004116A; CN107004116B

Abstract

Disclosed are an apparatus and a method for predicting facial attributes tags. The apparatus may comprises a first location prediction device for predicting a head-shoulder location in an inputted face image; a second location prediction device for predicting a face location in the inputted face image from the predicted head-shoulder location; and an attribute prediction device for extracting one or more face representations from the face location and classifying desired attributes for the inputted face image from the extracted face representations.

Description

Method and Apparatus for Predicting Face Attributes

Technical Field

The present application relates to a method and an apparatus for predicting face attributes in the wild.

Background

Face attributes， such as expression， race， and hair style， are beneficial for many applications such as image tagging and face verification. Predicting face attributes from images on the web is challenging， because of large background clutters and face variations， such as scale， pose， and illumination. Existing methods for attribute recognition first detect faces and their landmarks， and then extract high-dimensional features， such as HOG (Histogram of Oriented Gradient) or LBP (Local Binary Patter) ， from image patches centered on landmarks. These features are concatenated to train classifiers.

Although this pipeline is suitable for controlled environment， it has drawbacks when dealing with web images. It heavily depends on the precision of face and landmark detections， which are not reliable in web images. Most of state-of-the-art face detection and alignment fail because features are extracted at wrong landmark positions. Face detection also has ambiguity.

Hand-crafted features， such as (Histogram of Oriented Gradient) ， LBP (Local Binary Patter) and GIST (Spatial Envelope) ， at pre-defined landmarks is a standard step in the attribute recognition. It has been proposed to combine HOG and color histogram to train logistic regression for object search and tagging based on attributes. It has been also proposed to extract HOG-like features on various face regions to tackle attribute classification and face verification. To improve the discriminativeness of hand-crafted features given a specific task， a three-level SVM system has been built to extract higher-level information. Various hand-crafted features have been combined to obtain an intermediate representation for a particular domain.

Recently， deep learning methods have achieved great success in attribute inference， due to their ability to learn compact and discriminative features. It has been demonstrated that off-the-shelf features learned by the Convolutional Network (CNN) of ImageNet can be effectively adapted to attribute classification. It has been also shown that better performance can be achieved by ensembling learned features of multiple pose-normalized CNNs. Specific network structures have also been designed for attribute prediction. In the art， it has introduced a deep sum-product architecture to account for occlusion during attribute inference. The main drawback of the above methods is that they heavily rely on accurate landmark detection and pose estimation in both training and testing steps. Even though a recent work can perform automatically part localization during testing， it still requires landmark annotations of the training data.

Summary

The present application proposes a deep learning framework for face attribute prediction in the wild. The invention addresses the problem of predicting face attributes in the wild. Specifically， the goal is automatically labelling raw web images with facial attributes tags (e.g. “Male” ， “Young” ， “Smiling” ， “Wearing Hat” ， “Big Eyes” ， “Oval Face” and “Mustache” etc. ) .

The proposed deep learning framework does not rely on face and landmark detection. Instead， it cascades two convolutional devices (CNNs) ， in which one (LNet) to locate face region and the other (ANet) to extract high-level face representation from the entire located face region (without landmarks) for attribute prediction.

Training the LNet and ANet is in a weakly supervised manner， i.e. only attribute tags of training images are provided. This is fundamentally different from training face and landmark detectors， where face bounding boxes and landmark positions are needed. It makes the preparation of training data much easier. The LNet and ANet are first pre-trained differently and then jointly trained with attribute labels.

Secondly， different pre-training and fine-tuning strategies are designed for the LNet and ANet. Different from training face detectors with positive (face) and negative (non-face) samples， the LNet is pre-trained by classifying massive general object categories. Thus， its pre-trained features have good generalization capability on handling various background clutters. The LNet is then fine-tuned by predicting attributes. Features learned by attribute prediction can capture rich face variations and are effective for face localization. It also can better distinguish subtle differences between human faces and analogous patterns， such as a cat face. The ANet is pre-trained by classifying massive face identities， to obtain discriminative face representation. Then it is fine-tuned by the attribute prediction task.

Thirdly， to make face localization and attribute prediction realtime， a fast feed-forward scheme is proposed. It evaluates web image with arbitrary size. If filters are globally shared， this can be done by convolving images with filters. It becomes non-trivial if the filters are locally shared， while as known in the art， the locally shared filters perform better in face related tasks. This is solved by proposing an interweaved operation.

Besides proposing new methods， the present application also reveals valuable facts on learning face representation. They not only motivate this invention but also benefit future research on face and deep learning.

(1) It shows how supervised pre-training with massive object categories and massive identities can improve feature learning of the LNet and ANet for face localization and attribute recognition， respectively.

(2) It demonstrates that although filters of LNet are fine-tuned by attribute prediction， their response maps over the entire image have strong indication of face’s location. Good features for face localization should be able to capture rich face variations， and more supervised information on these variations improves the learning process. To understand， one could consider the examples in Fig. 1. If only a single detector is used to classify all the positive and negative samples in Fig. 1 (a) ， it is difficult to handle complex face variations. Therefore， multi-view face detectors were developed in Fig. 1 (b) ， i.e. face images in different views are handled by different detectors. View labels were used in training detectors and the whole training set is divided into subsets according to views. If views are treated as one type of face attributes， learning face representation by predicting attributes with deep models actually extends this idea to extreme. As shown in Fig. 1 (c) ， a filter (or a group of filters) functions as a detector of an attribute. When a subset of neurons is activated， they indicate the existence of face images， which have a particular attribute configuration. The neurons at different layers can form many activation patterns， implying that the whole set of face images can be divided into many subsets based on attribute configurations， and each activation pattern corresponds to one subset (e.g. ‘pointy nose’ ， ‘rosy cheek’ ， and ‘smiling’ ) . Therefore， it is not surprising that filters learned by attribute prediction lead to effective representations for face localization. By simply averaging and thresholding response maps， good face localization is achieved.

(3) This application also discloses that the high-level hidden neurons of the ANet after pre-training implicitly learn and discover semantic concepts that are related to identity， such as race， gender， and age. These concepts are significantly expanded after fine-tuning for attribute classification. This fact indicates that when a deep model is pre-trained for face recognition， it is also implicitly learning attributes. The performance of attribute prediction drops without the pre-training stage. With this strategy， each face attribute is well explained by a sparse linear combination of these semantic concepts. By analyzing the coefficients of such combinations， attributes show clear grouping patterns， which could be well interpreted semantically.

In one aspect of the present application， disclosed is an apparatus for predicting facial attributes tags， comprising：

a first location prediction device for predicting a head-shoulder location in an inputted face image；

a second location prediction device for predicting a face location in the inputted face image from the predicted head-shoulder location；

an attribute prediction device for extracting one or more face representations from the face location and classifying desired the extracted attributes for the inputted face image from the extracted face representations.

In another aspect of the present application， disclosed is a method for predicting facial attributes tags， comprising：

predicting a head-shoulder location in an inputted face image；

predicting a face location in the inputted face image from the predicted head-shoulder location；

extracting one or more face representations from the face location； and

classifying desired the extracted attributes for the inputted face image from the extracted face representations.

In one embodiment of the present application， the step of predicting the head-shoulder location further comprises： calculating a geodesic distance in a response for each position in an inputted face image， the response being obtained from a first neural network with the inputted face image； and determining the position belongs to the head-shoulder location if said calculated distance is larger than a predetermined threshold.

In one embodiment of the present application， the method may further comprise a step of training， which may include：

retrieving a general object dataset with category annotations and face dataset with identity and attribute annotations；

determining a pre-trained face localization model based on the general object dataset and its category annotations； and

determining a pre-trained attribute prediction model based on the face dataset and its identity annotations，

wherein the pre-trained face localization model and pre-trained attribute prediction model are combined as a final model for predicting facial attributes tags.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating the face localization by attributes with a single detector (a) ， a multiple view (b) and a face localization by attributes.

Fig. 2 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.

Fig. 3 is a schematic diagram illustrating proposed pipeline of attribute inference for the predictor as shown in Fig. 2 consistent with some disclosed embodiments.

Fig. 4 is a schematic diagram illustrating an interweaved operation of the predictor according to one embodiment of the present application.

Fig. 5 is a schematic flowchart illustrating flowchart of the training stage of face localization model consistent with some disclosed embodiments.

Fig. 6 is a schematic flowchart illustrating flowchart of the fine-tuning for the face localization model consistent with some disclosed embodiments.

Fig. 7 is a schematic flowchart illustrating flowchart of the training stage of attribute prediction model consistent with some disclosed embodiments.

Fig. 8 is a schematic flowchart illustrating flowchart of the fine-tuning for the attribute prediction model consistent with some disclosed embodiments.

Detailed Description

Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments， it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary， it is intended to cover alternatives， modifications， and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description， numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances， well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein， the singular forms ″a″ ， ″an″ and ″the″ are intended to include the plural forms as well， unless the context clearly indicates otherwise. It will be further understood that the terms ″comprises″ and/or ″comprising， ″ when used in this specification， specify the presence of stated features， integers， steps， operations， elements， and/or components， but do not preclude the presence or addition of one or more other features， integers， steps， operations， elements， components， and/or groups thereof.

As will be appreciated by one skilled in the art， the present invention may be embodied as a system， method or computer program product. Accordingly， the present invention may take the form of an entirely hardware embodiment， an entirely software embodiment (including firmware， resident software， micro-code， etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit， ” ， “device” ， “module” or “system. ” Furthermore， the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.

It is further understood that the use of relational terms such as first and second， and the like， if any， are used solely to distinguish one from another entity， item， or action without necessarily requiring or implying any actual such relationship or order between such entities， items or actions.

Much of the inventive functionality and many of the inventive principles when implemented， are best supported with or in software or integrated circuits （ICs) ， such as a digital signal processor and software therefore or application specific ICs. It is expected that one of ordinary skill， notwithstanding possibly significant effort and many design choices motivated by， for example， available time， current technology， and economic considerations， when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions or ICs with minimal experimentation. Therefore， in the interest of brevity and minimization of any risk of obscuring the principles and concepts according to the present invention， further discussion of such software and ICs， if any， will be limited to the essentials with respect to the principles and concepts used by the preferred embodiments.

Fig. 2 is a schematic diagram illustrating an exemplary apparatus 100 for predicting face attributes in the wild according to one embodiment of the present application. As shown， the apparatus 100 may comprise a predictor 10 and a trainer 20.

The predictor 10 may comprise an attribute inference system consists of a plurality of (for example， four stages as shown in Fig. 3) stages by cascading a first location prediction device 101， a second location prediction device 102 and an attribute prediction device 103.

The first location prediction device 101 is used for predicting a location of head-shoulder in an inputted face image. In one embodiment of the present application， the first location prediction device 101 is configured to obtain a response h₀ in light of the inputted face image and to calculate a geodesic distance in the response for each position in the image. If the calculated distance is larger than a predetermined threshold， the first location prediction device 101 will determine that this position in the image belong to the location of head-shoulder， which will be discussed later.

As shown in Fig. 3， the first location prediction device 101 may comprise a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers (C1 to C5) ， wherein the convolutional layers are configured with globally shared filters， the filters being recurrently applied at every location of the image so as to translate and scale the face images.

The second location prediction device 102 is used for predicting a location of face in the inputted face image from the predicted location of head-shoulder. As shown in Fig. 3， the second location prediction device 102 comprises a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers (C1 to C5) ， wherein the convolutional layers of the device 102 are configured with globally shared filters， the filters being recurrently applied at every location of the image so as to translate and scale the face images.

The attribute prediction device 103 is used for extracting one or more face representations y from the location of face for classifying the extracted attributes. The attribute prediction device 103 may comprise one or more (for example， four as shown) convolutional layers (C1 to C4) ， wherein each of the convolutional layers is configured with one or more filters， the filters at a first convolutional layer C1 and a second convolutional layer C2 are globally shared， while the filters at a third convolutional layer C3 and a fourth convolutional layer C4 are locally shared. The attribute prediction device 103 further comprises one or more (for example， three) max-pooling layers， each of the max-pooling layers is connected to a corresponding convolutional layer of the convolutional layers and configured to render the whole system robustness to local translations. The device 103 further comprises one fully-connected layer (FC) cascaded to the last convolutional layer (for example， C4 as shown) to classify the extracted attributes and learn compact and discriminative face representation.

Referring to Fig. 3， given a face image x₀ with arbitrary size， the first location prediction device 101 calculates a response map h₀， which indicates a location of head-shoulder， as shown in Fig. 3 (a) . x₀ is then combined with h₀ to crop the region of head-shoulder， denoted as x_s. The second location prediction device 102 utilizes x_s. as input and outputs a response map h_s， which designates a region of face. Similarly， hs is combined with x_s to locate the face region x_f The

location prediction devices

101 and 102 are cascaded to propose face location in a coarse-to-fine manner.

The attribute prediction device 103 is applied to the face region x_f to extract response map h_s. for classifying the extracted attributes. The attribute prediction device 103 may comprise a fully-connected layer to classify attributesy. High responses in these maps are associated to different facial components， implying that the device (ANet) 103 is able to capture subtle face differences， such as shapes of lips and eyebrows. In the last stage as illustrated in Fig. 3 (d) ， several candidate windows are selected to pool the feature vectors by the fully-connected layer. Then these features are concatenated as ha to train linear classifier for attribute recognition.

As shown in Fig. 3， the first location prediction device 101， the second location prediction device 102 and the attribute prediction device 103 may be implemented as different neural networks， represented as LNet0 101， LNets 102 and ANets 103 in Fig. 3. Each of the LNeto 101， LNets 102 and ANet 103 may be implemented by software， integrated circuits (ICs) or the combination thereof. In one embodiment of the present application， as shown in Figs. 3 (a) and (b) ， the network structures of LNeto 101 and LNets 102 may be the same， which， for example， stack two max-pooling and a five convolutional layers (C 1 to C5) with globally shared filters. These filters are recurrently applied at every location of the image and are able to account for large face translation and scaling. The ANet 103 stacks， for example， four convolutional layers (C1 to C4) ， three max-pooling layers and one fully-connected layer (FC) ， where the filters at C 1 and C2 are globally shared， while the filters at C3 and C4 are locally shared. As shown in Fig. 3 (c) ， the response maps at C2 and C3 are divided into grids with non-overlapping cells， each of which learns different filters. The fully-connected layer of the ANet 103 transforms the response maps produced by convolution into a compact and discriminate feature representation. For example， fully-connected layer in attribute prediction ANet 103 may produce suitable feature representation for classification (predicting attribute tags， e.g. “Male” ， “Young” ， “Smiling” ， “Wearing Hat” ， “Big Eyes” ， “Oval Face” and “Mustache” etc. ) .

The locally shared filters have been proved effective for face related problems， because they can capture different information from different face parts. The network structures are specified in Fig. 3. For instance， the filters at C1 of LNeto 101 have a plurality of (for example， 96) channels and the filter size in each channel may be 11Xl 1X3， as the input image xo contains three color channels.

Where both of LNeto 101 and LNets 103 have five convolutional layers， each of which utilizes the output of the previous layer as input and may be formulated as：h^v(l)＝relu(h^v(l)+Σ_uk^vu(l)*h^u(l-1)) (1)

where relu (x) ＝ max (0； x) is the rectified linear function and *denotes the convolution operator， h^u(l-1) and h^v(l) stand for the u-th input channel at the (l- 1) layer and the v-th output channel at the 1 layer， respectively. k^vu(l) and b^v(l) denote the filters and bias， wherein the filters capture translation-invariant structures while the bias represent overall energy levels.

The max-pooling operations at C1 and C2 partition the feature maps into grid with overlapping cells， which are formulated as

Here， (i； j) indicates the cell with index (i； j) and (p； q) is a position index within Ω. The maximum value is pooled over each small cell as expressed in Eqn. (2) .

After it obtains the response map， for example

an important issue is how to crop the patch of head-shoulder from xo. A simple solution is cropping the region with the responses in

larger than a threshold. However， difficulty exists when multiple faces are presented， such that we may have multiple regions with evenly high responses. Therefore， it is devised， in one embodiment of the present application， a fast density peak identifying technique. It calculates a special geodesic distance for each position i in

where ρ_i is the density intensity in position i， σ_i＝ min_{j：ρj＞ρi} (s_ij) and s_ij is the spatial distance between position i and positionj. σ_i measures its distance to the nearest position which has a larger density intensity. Then density peaks are identified by selecting extreme large d_i. This process can be further accelerated， as

is sparse. It can propose the correct window by cropping the region with the highest density. Note that the face image x_f can be cropped in a similar fashion as above.

ANet 103 utilizes the estimated face region x_f as input. As shown in Fig3 (c) ， the filters of C1 and C2 in the ANet 103 are globally shared and can be formulated in the same way as Eqn. (1) and (2) . The locally shared filters at C3 and C4 are learned to capture different local information in the specific facial regions (cells) . For example， the highlighted cells A (Fig. 3 (c) corresponds to left eye and left mouth corner respectively. These locally shared filters can be formulated as

where (p； q) is the cell index. However， as shown in Fig. 3 (c) ， the estimated face region xf is not well-aligned， because of the large variation presented in the web image. If we simply apply Eqn. (4) ， the subsequent face features may contain noise. A simple solution is to densely crop image patches and apply ANet 103 on each of them， but there are redundant computations (e.g. C 1 and C2) . Therefore， the present application proposes interweaved operation， which can account for misalignment without cropping multiple patches.

To better visualize the process， the network structure of C2， C3 and C4 is again illustrated in Fig. 4 (a) ， where each filter in C3 corresponds to four local regions in C2. These regions can be overlapped. And the same relationship applies between C4 and C3. For clarity， we consider four filters

and

in C3 and one filter k (4) 1 in C4. It assumes there is only one channel. After obtaining response map h (2) in C2， the device 103 applies each filter in C3 using Eqn. (1) to the entire response map h (2) ， resulting in the response maps

and

as shown in Fig. 4 (b) . In the next step， the device 103 needs to apply

to these maps. Difficulty exists because filters in C3 have spatial relationship. For instance， the response of

should be at the left hand side of

To compensate for these geometric constrains， the interweaved map of C3，

is constructed as depicted in Fig. 4 (c) ， where responses in the same cell are padded together.

Then， the feature map of C4 is calculated as standard convolution using Eqn. (1) as

Similarly， we could get feature maps

for other locally shared filters in C4.

Since we assume filters in C4 have one channel， the redundant parts in h(4) i are filter responses at other possible spatial positions. To find desired position， we construct interweaved map h (4) inter of C4 to preserve geometric constraint and search for its maximum component：

The whole process could be viewed as implicitly combining different part detectors (locally shared filters) under geometric constrains (interweaved operation) to facilitate accurate localization. Then the present application properly crops and pools at different positions to generate multiple views of h⁽⁴⁾. It could further suppress residual misalignment and achieve size fit for fully connected layer. Feeding these multi-view response maps into FC layer of the ANet would lead to multi-view representations of face region. The present application concatenates all multi-view representations together to obtain the final face representation h.

To make the predictor 10 work effectively， the attribute inference system with the cascaded LNeto 101， LNets102 and ANet 103 in the predictor 10 shall be trained first. To this end， the trainer 20 may receive general object dataset with category annotations and face dataset with identity and attribute annotations， and then feed the general object dataset and its category annotations to the predictor 10 to obtain a pre-trained face localization model， and feed face dataset and its identity annotations are fed into the predictor 10 so as to obtain pre-trained attribute prediction model. The obtained pre-trained face localization model and pre-trained attribute prediction model are further inputted the predictor 10， along with face dataset and its attribute annotations， so as to obtain the final model.

To this end， as shown in Fig. 2， the trainer 20 may comprise a face localization pre-training device 201， an attribute prediction pre-training device 202 and a fine-tuning device 203.

The face localization pre-training device 201 operates to receive general object dataset with category annotations and face dataset with identity and attribute annotations and then feed the general object dataset and its category annotations into the predictor 10 so as to obtain the pre-trained face localization model. Fig. 5 illustrates a flow chart for the face localization pre-training device 201 to train the predictor 10 to obtain the pre-trained face localization model. At step s501， the face localization pre-training device 201 operates to randomly initialize neuron weights between the connections of each two convolution layers in the LNet0 101， LNets 102 and Anet 103 of the predictor 10. At step s502， the face localization pre-training device 201 calculates classification errors by classifying each image into one of a plurality of (N) general object categories. Specifically， ifthe object category is predicted correctly， then the classification error is zero， otherwise the classification error is added 1/N. At step s503， the face localization pre-training device 201 operates to back-propagate classification errors through the layers in the LNet0， LNets and ANet 103of the predictor 10， so as to update the neuron weights between the connections of each two convolution layers. At step S504， the face localization pre-training device 201 determines if it is converged by comparing the currently obtained classification errors with a predetermined threshold， if yes， at step S505， the pre-trained face localization model for the predictor 10 is obtained， otherwise， the process returns to step s502.

As well known in the art， the standard training process updates the weights of model so that the output of the model (prediction) can approach ground truth annotation iteratively. For example， in face attributes prediction task， the goal of model training is to predict the presence of certain attribute (e.g. “male” ， “smiling” etc. ) that should align with the reality (annotations) . Usually， the weights of model are initialized randomly. The pre-training of the face localization pre-training device 201 resembles the standard training except that the task in pre-training is different from the final task. For example， the task used here for pre-training is to predict object category (e.g. “car” ， “dog” ， “mountain” ， “flower” etc. ) present in each image while the final task is to predict face attributes.

The fine-tuning device 203 proceeds with a fine-tuning that resembles standard training except that the weights are not initialized randomly， but initialized using the weights in pre-trained model. Fig. 6 illustrates the flow chart for the fine-tuning for the fine-tuning device 203 according to one embodiment of the present application. In step s601， the fine-tuning device 20 initializes weights of the predictor 10 using those weights in the pre-trained model. At step s602， the fine-tuning device 20 calculates classification errors by predicting attribute tags for each image. Specifically， if the attribute tag is predicted correctly， then the classification error is zero， otherwise the classification error is added 1/N. At step s603， the fine-tuning device 20 back-propagates classification errors through the predictor 10， so as to update the weights， and then at step 604， the fine-tuning device 20 determine ifthe currently obtained classification errors is less than a predetermined threshold， if yes， at step S605， the process terminates and the face localization model formed by the updated weights for the predictor 10 is obtained， otherwise， the process returns to step s602.

Fig. 7 illustrates a flow chart for training an attribute prediction model for the attribute prediction pre-training device 202 according to one embodiment of the present application. At step s701， the attribute prediction pre-training device 202 operates to randomly initialize neuron weights between the connections of each two convolution layers in the LNet0， LNets and ANet of the predictor 10. At step s702， the attribute prediction pre-training device 202 calculates identification errors by classifying each image into one of a plurality of (N) human face identities. Specifically， if the human face identity is predicted correctly， then the classification error is zero， otherwise the classification error is added 1/N. At step s703， the attribute prediction pre-training device 202 operates to back-propagate the identification errors through the layers in the LNet0， LNets and ANet of the predictor 10， so as to update the neuron weights between the connections of each two convolution layers. At step S704， the attribute prediction pre-training device 202 determines if the currently obtained identification error is less than a predetermined threshold， if yes， at step S705， the process terminates and the pre-trained attribute prediction model formed by the updated weights for the predictor 10 is obtained， otherwise， the process returns to step s702.

Fig. 8 illustrates the flow chart for the fine-tuning of the pre-trained attribute prediction model according to one embodiment of the present application. In step s801， the fine-tuning device 20 initializes weights of the predictor 10 using those weights in the pre-trained attribute prediction model. At step s802， the fine-tuning device 20 calculates classification errors by predicting attribute tags for each image. Specifically， if the attribute tag is predicted correctly， then the classification error is zero， otherwise the classification error is added 1/N. At step s803， the fine-tuning device 20 back-propagates classification errors through the predictor 10， so as to update the weights， and then at step 803， the fine-tuning device 20 determine if the currently obtained classification error is less than a predetermined threshold， if yes， at step S805， the process terminates and the face localization model formed by the updated weights for the predictor 10 is obtained， otherwise， the process returns to step s802.

In the last， the fine-tuned face localization model and fine-tuned attribute prediction model are concatenated together to form the final model. Face localization model takes raw web image as input and outputs the position of face region in it. Then， the attribute prediction model takes the targeted face region as input and outputs the predicted attribute tags. So the final integrated model is the concatenation of the face localization model and the attribute prediction model.

The corresponding structures， materials， acts， and equivalents of all means or step plus function elements in the claims below are intended to include any structure， material， or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description， but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application， and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

An apparatus for predicting facial attributes tags， comprising：

a first location prediction device for predicting a head-shoulder location in an inputted face image；

a second location prediction device for predicting a face location in the inputted face image from the predicted head-shoulder location； and

an attribute prediction device for extracting one or more face representations from the face location and classifying desired attributes for the inputted face image from the extracted face representations.
The apparatus according to claim 1， wherein the first location prediction device is configured to calculate a geodesic distance for each position in the inputted face image， and to determine if the calculated distance is larger than a predetermined threshold， if yes， the first location prediction device determines that this position in the inputted image belong to the head-shoulder location.
The apparatus according to claim 2， wherein the geodesic distance is determined based on density intensity for each position in the inputted face image and a spatial distance between each position to its nearest position in the inputted face image.
The apparatus according to claim 1 or 2， wherein the first location prediction device comprises a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers， wherein the convolutional layers are configured with globally shared filters， the filters being recurrently applied at each location of the face image to translate and scale the inputted face image.
The apparatus according to claim 1 or 2， wherein the second location prediction device comprises a neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers， wherein the convolutional layers are configured with globally shared filters， the filters being recurrently applied at each location of the image to translate and scale the inputted face images.
The apparatus according to claim 1 or 2， wherein the attribute prediction device comprises a plurality of convolutional layers，

wherein each of the convolutional layers is configured with one or more filters for translating and scaling the inputted face image， the filters at a first layer of the convolutional layers and a second layer of the convolutional layers are globally shared， while the filters at a third layer and a fourth layer of the convolutional layers are locally shared.
The apparatus according to claim 1 or 2， wherein the attribute prediction device further comprises a plurality of max-pooling layers， each of which is connected to a corresponding convolutional layer of the convolutional layers and configured to partition received feature maps into grid with overlapping cells such that following convolutional layers translate and scale the partitioned grids of the inputted face image in an interweaved way.
The apparatus according to claim 1 or 2， wherein the attribute prediction device further comprises：

a fully-connected layer connected to a last layer of the convolutional layers and configured to transform response maps for the inputted image， which are produced by the convolution layers， into a compact and discriminate feature representation.
The apparatus according to any one of claims 1-7， further comprising a trainer configured to：

feed a general object dataset and its category annotations to the first and the second location prediction devices to obtain a pre-trained face localization model， and feed face dataset and its identity annotations into the attribute prediction device to obtain pre-trained attribute prediction model， wherein the obtained pre-trained face localization model and pre-trained attribute prediction model being combined as a final model.
A method for predicting facial attributes tags， comprising：

predicting a head-shoulder location in an inputted face image；

predicting a face location in the inputted face image from the predicted head-shoulder location； and

extracting one or more face representations from the predicted face location； and

classifying desired attributes for the inputted face image from the extracted face representations.
The method according to claim 9， wherein the step of predicting the head-shoulder location further comprises：

calculating a geodesic distance in a response for each position in the inputted face image， the response being obtained from a first neural network with the inputted face image； and

determining that the position belongs to the head-shoulder location if said calculated distance is larger than a predetermined threshold.
The method according to claim 10， wherein the geodesic distance is determined based on a density intensity for each position in the inputted face image and a spatial distance between each position to its nearest position in inputted face image.
The method according to claim 9 or 10， wherein the first neural network comprises a plurality of max-pooling layers and a plurality of convolutional layers， wherein the convolutional layers are configured with globally shared filters， the method further comprises：

applying recurrently the filters to each location of the inputted face image so as to translate and scale the inputted face image.
The method according to claim 12， wherein the face location is predicted from the predicted head-shoulder location by a second neural network consisting of a plurality of max-pooling layers and a plurality of convolutional layers， wherein the convolutional layers are configured with globally shared filters， the method further comprises：

applying recurrently the filters to each location of the inputted face image so as to translate and scale the inputted face image.
The method according to claim 9 or 10， further comprising：

retrieving a general object dataset with category annotations and face dataset with identity and attribute annotations；

determining a pre-trained face localization model based on the general object dataset and its category annotations； and

determining a pre-trained attribute prediction model based on the face dataset and its identity annotations，

wherein the pre-trained face localization model and pre-trained attribute prediction model are combined as a final model for predicting facial attributes tags of the inputted face image.