CN107004116A

CN107004116A - Method and apparatus for predicting face's attribute

Info

Publication number: CN107004116A
Application number: CN201480083724.5A
Authority: CN
Inventors: 汤晓鸥; 刘子纬; 罗平; 王晓刚
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2014-12-12
Filing date: 2014-12-12
Publication date: 2017-08-01
Anticipated expiration: 2034-12-12
Also published as: WO2016090522A1; CN107004116B

Abstract

The open apparatus and method for being used to predict face's attribute tags.The equipment may include：First position prediction meanss, the head and shoulder position in face image for predicting input；Face location in second place prediction meanss, its face image for being used to the head and shoulder position from prediction predict input；And attribute forecast device, for being represented from the one or more faces of face location extraction and being represented to classify to the required attribute of the face image of input according to the face of extraction.

Description

Method and apparatus for predicting face's attribute

Technical field

The application is related to the method and apparatus for predicting face's attribute under natural environment.

Background technology

Expression, race and face's attribute such as hair style be in many applications it is beneficial, such as, image tag and face Checking.Because overall background is chaotic and face changes, such as, ratio, posture and illumination, face's attribute tool is predicted from network image There is certain difficulty.Existing method for Attribute Recognition detects face and its key point first, is then occupied from key point In image block in extract high dimensional feature, such as, HOG (histograms of oriented gradients) or LBP (local binary patterns).These features Be connected training grader.

Although above-mentioned streamline is suitable to controlled environment, had the disadvantage in that when handling network image.It largely takes Certainly in face and the precision of critical point detection, and this is insecure in network image.Most prior arts are all because in mistake Feature is extracted in key point position by mistake, therefore can not successfully realize face detection and alignment.Face detection also has ambiguity.

Make feature by hand at predefined key point, such as, histograms of oriented gradients, LBP (local binary patterns) and GIST (space envelope) is the standard step of Attribute Recognition.It has been proposed that HOG and color histogram are combined, to allow logic to return Return, to carry out object search based on attribute and to tag.The extraction HOG category features on various face areas are also proposed, with The problem of solving attributive classification and face authentication.In order to improve the mirror for the feature made by hand in the case of given specific tasks Other property, has set up three layers of SVM systems, to extract the information of higher level.The various features made by hand have been combined, to obtain The intermediate representation of special domain (particular domain).

Recently, compact distinctive feature can be learnt by being attributed to, and deep learning method takes in terms of inferred from attributes Obtain immense success.It has been proved that the ready-made feature learnt by ImageNet convolutional network (CNN) can be effectively adapted to category Property classification.Have also been shown that, by integrating the feature that the normalized CNN of multiple postures is learnt, it is possible to achieve better performance. It is used for attribute forecast through have devised specific network structure.In the art, have been incorporated into depth and-product (sum- Product) framework, with during inferred from attributes response environment block (occlusion).The major defect of the above method is, They are largely dependent upon accurate critical point detection and pose estimation in training and testing procedure.Even if recent work Part positioning can be automatically carried out during testing, but stills need the key point mark of training data.

The content of the invention

The application proposes the deep learning framework for carrying out face's attribute forecast under natural environment.The present invention is solved The problem of face's attribute being predicted under natural environment.Specifically, target is (for example, " male ", " year with face's attribute tags Light people ", " smiling ", " wearing a hat ", " oxeye ", " oval face " and " moustache " etc.) automatically give primitive network figure As tagging.

Proposed deep learning framework is not rely on face and critical point detection.On the contrary, it is by two Convole Units (CNN) cascade, one of them (LNet) is used for positioning face area and another (ANet) is used for from the face entirely positioned Senior face is extracted in region (without key point) to represent, for attribute forecast.

LNet and ANet are trained using Weakly supervised mode, i.e. only provide the attribute tags of training image.This fundamental difference In training face and Keypoint detector, wherein needing face's bounding box and key point position.This causes the preparation of training data Easily.LNet and ANet carry out training in advance in a different manner first, are then trained jointly with attribute tags.

Secondly, for training in advance different with ANet designs LNet and fine setting strategy.Different from just (face) sample Originally with the training face detector of negative (non-face) sample, by the way that substantial amounts of general object classification is classified come training in advance LNet.Therefore, the feature of training in advance has good generalization ability in terms of various background clutters are handled.Then pass through prediction Attribute finely tunes LNet.The feature learnt by attribute forecast can capture abundant face's change, and can be effectively used for face Positioning.The nuance between face and similar pattern (such as, cat face) can also preferably be distinguished.By by substantial amounts of face Identity, which is classified, carrys out training in advance ANet, is represented with obtaining distinctive face.Then, it is finely adjusted by attribute forecast task.

3rd, in order to carry out face locating and attribute forecast in real time, it is proposed that a kind of fast feedforward scheme.It, which is assessed, has The network image of arbitrary size.If wave filter is globally shared, then this can come by using wave filter convolved image Into.If wave filter is local shared, then it becomes meaningful, and as known in the art, local shared filter Ripple device performance in terms of face's inter-related task is more preferable.Operate to solve this problem by the intertexture provided.

In addition to proposing new method, the application is also disclosed that the valuable fact represented on study face.The thing It is real not only to have encouraged the present invention, and it is beneficial to the future studies of face and deep learning.

(1) it show using a large amount of object type and a large amount of identity have supervision training in advance how can to improve LNet and ANet feature learning, to be respectively used to face locating and Attribute Recognition.

(2) it is proved, although LNet wave filter is finely adjusted by attribute forecast, their responses on the entire image Figure effectively indicates the position of face.Desirable features for face locating should be able to capture abundant face's change, and More supervision messages of these changes are used to improve learning process.In order to understand, it may be considered that the example in Fig. 1.If in Fig. 1 (a) only all positive sample and negative sample are classified using single detector in, then be difficult to the complicated face of processing and become Change.Therefore, multiple views face detector is developed in Fig. 1 (b), i.e. by the face of different detector processes different points of view Image.Viewpoint label is used to train detector, and whole training set is divided into subset according to viewpoint.If regarding viewpoint as a kind of Face's attribute of type, then predict that the study face of attribute represents this idea expanding to pole by using depth model Limit.As shown in Fig. 1 (c), wave filter (or one group of wave filter) plays a part of the detector of attribute.When the subset of neuron is swashed When living, they show there is face image, and the face image has particular community configuration.Neuron at different layers can be with shape Into many activation patterns, so that many subsets can be divided into based on attribute configuration by showing the whole set of face image, and Each activation pattern corresponds to a subset (for example, ' sharp nose ', ' ruddy cheek ' and ' smiling ').Therefore, by belonging to Property prediction study wave filter cause effective expression no wonder of face locating.By simply take response diagram average value and Threshold value, realizes good face locating.

(3) the application is also disclosed, the senior hidden neuron of the ANet after training in advance implicitly learn and find with The related semantic concept of identity (such as, race, sex and age).These concepts significantly extend after fine adjustment, for category Property classification.The fact that show, when for face recognition come training in advance depth model, also implicitly learn attribute.Do not having In the case of having the training in advance stage, the hydraulic performance decline of attribute forecast.Using this strategy, each face's attribute is semantic by these The sparse linear combination of concept is explained well.By analyzing the coefficient of such combination, being shown using attribute can be semantically Interpret well, clearly group mode.

The one side of the application discloses the equipment for predicting face's attribute tags, and it includes：

First position prediction meanss, it is used to predict head-shoulder position in the face image of input；

Face in second place prediction meanss, its face image for being used to the head from prediction-shoulder position predict input Position；

Attribute forecast device, it is used to represent from the one or more faces of face location extraction, and according to the face of extraction Portion is represented, the required extraction attribute of the face image of input is classified.

The another aspect of the application discloses the method for predicting face's attribute tags, and it includes：

Predict head-shoulder position in the face image of input；

Face location in the face image for predicting input from head-shoulder position of prediction；

One or more faces are extracted from face location to represent；And

Represented according to the face of extraction, the required extraction attribute of the face image of input is classified.

Also include in one embodiment of the application, the step of pre- gauge head-shoulder position：For in the face image of input Each position, calculate the geodesic distance in response, first nerves network obtains described after described face image is inputted to it Response；And if the distance of the calculating is more than predetermined threshold, it is determined that the position belongs to head-shoulder position.

In one embodiment of the application, the step of this method can also include training, the step can include：

Obtain the general object data set with classification mark and the face data collection with identity and attribute labeling；

The face locating model of training in advance is determined based on general object data set and its classification mark；And

The attribute forecast model of training in advance is determined based on face data collection and its identity mark,

Wherein the face locating model of training in advance and the attribute forecast model of training in advance are combined into for predicting face The final mask of attribute tags.

Brief description of the drawings

The exemplary non-limiting embodiments of the present invention are described below with reference to accompanying drawing.Accompanying drawing is illustrative, and typically not In definite ratio.Same or like element on different figures quotes identical Ref. No..

Fig. 1 is to show the foundation attribute using single detector (a), multiple viewpoints (b) and the face locating according to attribute Face locating schematic diagram.

Fig. 2 is the schematic diagram for showing to meet the equipment for face recognition of some open embodiments.

Fig. 3 is the proposal streamline for showing to meet the inferred from attributes of the fallout predictor as shown in Figure 2 of some open embodiments Schematic diagram.

Fig. 4 is the schematic diagram of the intertexture operation for the fallout predictor for showing one embodiment according to the application.

Fig. 5 is the exemplary flow for showing to meet the flow chart of the training stage of the face locating model of some open embodiments Figure.

Fig. 6 is the schematic flow diagram for showing to meet the flow chart of the fine setting of the face locating model of some open embodiments.

Fig. 7 is the exemplary flow for showing to meet the flow chart of the training stage of the attribute forecast model of some open embodiments Figure.

Fig. 8 is the schematic flow diagram for showing to meet the flow chart of the fine setting of the attribute forecast model of some open embodiments.

Embodiment

Some specific embodiments of the present invention are reference will be made in detail now, including are used to implement the present invention expected from inventor Optimal mode.The example of these specific embodiments is shown in accompanying drawing.Although describing the present invention with reference to these specific embodiments, It should be understood that being not intended to limit the invention to the embodiment.Contrary, it is intended to cover can be included in such as appended claims Alternative solution, modification and equivalent in the spirit and scope of the present invention of restriction.Listed in describing below many specific thin Section, to provide the comprehensive understanding to the application.Can be real in the case of some or all of these no details Trample the present invention.In other cases, well-known process operation is not described in detail, in order to avoid unnecessarily make change of the present invention It must obscure.

Term used herein and is not intended to limit the present invention merely for the sake of the purpose of description specific embodiment.Remove Non- context is clearly indicated otherwise, and otherwise singulative " one " used herein, " one " and " described " are also intended to include again Number form formula.It should also be understood that the term " comprising " and/or " comprising " that are used in this specification there is the feature for explanation, it is whole Number, step, operation, element and/or part, but be not precluded from existing or add other one or more features, integer, step, Operation, element, part and/or combinations thereof.

As those skilled in the art will be appreciated that the present invention can be presented as system, method or computer program product.Cause This, the present invention can use following form：Full hardware embodiment, full software implementation (including firmware, resident software, microcode Deng), or the software and hardware aspect group that generally will all can be described as " circuit ", " device ", " module " or " system " herein Embodiment altogether.In addition, the present invention can use the form of computer program product, the computer program product embodies In any tangible expression medium, the medium has the computer usable program code embodied in media as well.

It should also be understood that such as the first and second etc. relational languages (if yes) are used alone, by entity, an item Mesh or action make a distinction with another, and may not require or imply any actual pass between these entities, project or action System or order.

Many principles in many functions and the principle of the invention in function of the present invention are when implementing by software or integrated electricity (IC) is best supported on road, such as, digital signal processor and software or application-specific integrated circuit.Despite the presence of may substantial amounts of effort and The many design alternatives encouraged by such as pot life, current techniques and economic consideration, it is anticipated that those skilled in the art Member will readily be able to when by concept disclosed herein and principle guiding using minimum experiment generate such software instruction or IC.Therefore, for any risk of the fuzzy principles and concepts according to the present invention of succinct and minimum, such software and IC It is discussed further (if yes) and is limited to necessary principle and concept used in preferred embodiment.

Fig. 2 be show according to one embodiment of the application be used under natural environment predict the exemplary of face's attribute The schematic diagram of equipment 100.As illustrated, equipment 100 can include fallout predictor 10 and training aids 20.

Fallout predictor 10 can include inferred from attributes system, and the inferred from attributes system is by by first position prediction meanss 101st, multiple levels (stage) formed by second place prediction meanss 102 and the cascade of attribute forecast device 103 are (for example, such as Fig. 3 Four shown levels) constitute.

First position prediction meanss 101 are used to predict head-shoulder position in the face image of input.At one of the application In embodiment, first position prediction meanss 101 are configured to obtain response h0 for the face image inputted, and for figure As in each position come calculate response in geodesic distance.If the distance calculated is more than predetermined threshold, then first position Prediction meanss 101 will determine that this position in image belongs to the position of head-shoulder, and this will be discussed after.

As shown in figure 3, first position prediction meanss 101 may include by multiple maximum pond layers and multiple convolutional layers (C1 to C5) the neutral net constituted, wherein convolutional layer is configured to, with globally shared wave filter, apply the filter loop In each position of image, with Pan and Zoom face image.

Second place prediction meanss 102 be used for from head-shoulder position of prediction come predict input face image in face Position.As shown in figure 3, second place prediction meanss 102 are included by multiple maximum pond layers and multiple convolutional layers (C1 to C5) The neutral net of composition, the convolutional layer of wherein device 102 is configured to globally shared wave filter, the filter loop Apply in each position of image, so as to Pan and Zoom face image on ground.

The position that attribute forecast device 103 is used for from face is extracted one or more faces and represented, for extraction Attribute is classified.Attribute forecast device 103 may include one or more (for example, four as depicted), and (C1 is arrived convolutional layer C4), wherein each convolutional layer is configured to one or more wave filters, at the first convolutional layer C1 and the second convolutional layer C2 Wave filter is globally shared, and the wave filter at the 3rd convolutional layer C3 and Volume Four lamination C4 is local shared.Attribute is pre- Surveying device 103 also includes one or more (for example, three) maximum pond layer, and each maximum pond layer is all connected in convolutional layer Corresponding convolutional layer, and be configured so that total system for locally translation for have robustness.Device 103 also includes A full articulamentum (FC) of last convolutional layer (for example, C4 as depicted) is cascaded to, to be carried out to the attribute of extraction Classify and learn compact distinctive face and represent.

With reference to Fig. 3, in the case of the given face image x0 with arbitrary size, first position prediction meanss 101 are counted Response diagram h0 is calculated, the response diagram indicates the position of head-shoulder, shown in such as Fig. 3 (a).X0 is then combined with h0, is expressed as with trimming The region of xs head-shoulder.Xs is used as input by second place prediction meanss 102, and exports response diagram hs, and the response diagram refers to Show the region of face.Similarly, hs combines to position face area xf with xs.Predicted position device 101 and 102 is cascaded, To provide face location by mode from coarse to fine.

Attribute forecast device 103 is applied to face area xf, to extract response diagram hs, for by the attributive classification of extraction. Attribute forecast device 103 may include full articulamentum, and attribute y is classified.High response and different face parts in these figures Amount is associated, so as to show that device (ANet) 103 can capture trickle face's difference, such as, the shape of lip and eyebrow. Shown in last stage, such as Fig. 3 (d), select some candidate windows, with by full articulamentum by characteristic vector pond.Then, these Feature is connected to become ha, to train the linear classifier for Attribute Recognition.

As shown in figure 3, first position prediction meanss 101, second place prediction meanss 102 and attribute forecast device 103 can Different neutral nets are embodied as, LNet0 101, LNet 102 and ANet 103 in Fig. 3 is expressed as.LNeto 101、LNet Can each be implemented by software, integrated circuit (IC) or combinations thereof in 102 and ANet 103.In the reality of the application Apply in example, such as shown in Fig. 3 (a) and Fig. 3 (b), LNeto 101 and LNet 102 network structure can be with identical, for example, using entirely Two maximum pond layers and five convolutional layers (C1 to C5) are stacked up by the shared wave filter of office.Answer these filter loops Used in each position of image, and it can be considered that larger face's Pan and Zoom.ANet 103 is for example by four convolutional layers (C1 to C4), three maximum pond layers and a full articulamentum (FC) are stacked up, and the wave filter at wherein C1 and C2 is the overall situation Shared, and the wave filter at C3 and C4 is local shared.As shown in Fig. 3 (c), the response diagram at C2 and C3, which is divided into, to be had The grid of non-overlapped unit, the different wave filter of each modular learning.The sound that ANet 103 full articulamentum will be generated by convolution Should figure be transformed into compact distinctive character representation.It is used for for example, the full articulamentum in attribute forecast ANet 103 can be generated The suitable characteristics of classification represent (prediction attribute tags, for example, " male ", " young man ", " smiling ", " wearing a hat ", " oxeye ", " oval face " and " moustache " etc.).

Local shared wave filter is it is verified that to face's relevant issues effectively, because they can be captured from difference The different information of face portion.Network structure is indicated in figure 3.For example, the wave filter at LNeto 101 C1 have it is multiple Filter size in (for example, 96) passage, and each passage can be 11 × l1 × 3, because input picture xo contains Three Color Channels.

In the case that two in LNeto 101 and LNet 103 have five convolutional layers, each of which is by before One layer of output is used as inputting and being formulated into：

h^v(l)=relu (b^v(l)+∑_uk^vu(l)*h^u(l-1)) (1)

Wherein relu (x)=max (0；X) it is adjusted linear function, and * represents convolution operator, h^u(l-1)And h^v(l) Respectively represent (l-1) layer at u-th of input channel and l layers at v-th of output channel.k^vu(l)And b^v(l)Represent wave filter And deviation, its median filter captures translation invariant structure and deviation represents integral energy level.

Characteristic pattern is divided into the grid with overlapped elements by the maximum pondization operation at C1 and C2, and the overlapped elements are used Formula is expressed as：

Herein, (i；J) represent that there is index (i；J) unit, and (p；Q) it is location index in Ω.Maximum exists Pond is carried out on each junior unit, such as equation (2) is expressed.

Obtain response diagram (for example,) after, it is the image block that head-shoulder how is trimmed from xo the problem of important.Simply Solution be utilizeIn response more than threshold value trim region.However, had difficulties when multiple faces are presented, Allow to obtain multiple regions with uniform high response.Therefore, in one embodiment of the application, it is contemplated that fast density Peak value identification technology.It is directed toIn each position i calculate special geodesic distance：

Wherein ρ_iIt is position i density strength,And s_ijPosition i with position j it Between space length.σ_iMeasure its distance for arriving the proximal most position with greater density intensity.Then, by selecting great d_i To recognize density peaks.This process can be further speeded up, becauseThan sparse.It can propose that there is maximum by trimming Correcting window is carried out in the region of density.It should be noted that face image x can be trimmed by above-mentioned similar mode_f。

ANet 103 is by the face area x of estimation_fAs input.As shown in Fig. 3 (c), C1's and C2 in ANet 103 Wave filter is local shared, and can be formulated by with equation (1) and (2) identical mode.Office at C3 and C4 Share wave filter by study to capture the different local messages in specific face area (unit) in portion.For example, highlight Unit A (Fig. 3 (c)) corresponds respectively to left eye and the left corners of the mouth.These wave filters locally shared can be formulated into：

Wherein (p；Q) it is unit index.However, as shown in Fig. 3 (c), the face area x of estimation_fDo not align well, Because there is large change in network image.If simply applicable equations (4), subsequent face feature can contain noise.Simply Solution be densely trimmed image block and ANet 103 to be applied in each of which, but there is redundant computation (for example, C1 and C2).Therefore, the application proposes the operation that interweaves, so as to consider in the case where not trimming multiple images block Dislocation.

In order to which the process is preferably visualized, C2, C3 and C4 network structure, wherein C3 are shown in Fig. 4 (a) again In each wave filter correspond to C2 in four regional areas.These regions can be with overlapping.And same relation is applied to Between C4 and C3.For clarity, it is considered to four wave filters in C3WithAnd one in C4 Wave filter k (4) 1.Assuming that only existing a passage.After the response diagram h (2) in obtaining C2, device 103 uses equation (1) Each wave filter in C3 is applied to whole response diagram h (2), so as to obtain response diagramWithSuch as Shown in Fig. 4 (b).In the next step, device 103 need byApplied to these figures.Because the wave filter in C3 has space Relation, therefore have difficulties.For example,Response should beLeft-hand side.In order to compensate these geometrical constraints, C3 friendship Knit figureIt is configured to as shown in Fig. 4 (c), the response wherein in junior unit is filled up together.

Then, C4 characteristic pattern is subjected to Standard convolution calculating using equation (1), Similarly, the characteristic pattern of other innings of shared wave filter in C4 can be obtained

As it is assumed that the wave filter in C4 has a passage, therefore, the redundancy section in h (4) i is other possible skies Between wave filter response at position.In order to find desired position, C4 intersection chart is constructedTo protect geometrical constraint simultaneously And search largest component：

Whole process can be considered detector (the local shared detection that different piece is implicitly combined under geometrical constraint Device) (interweave operation), to contribute to accurately positioning.Then, the application diverse location suitably trim with pond, with generate h⁽⁴⁾Multiple viewpoints.This can further suppress the remaining size fitting for misplacing and realizing full articulamentum.By these multiple views The FC layers that response diagram is fed to ANet represent the multiple views for causing face area.All multiple views are represented connection by the application Together, h is represented to obtain final face.

In order that fallout predictor 10 effectively works, should train first in fallout predictor 10 have cascade LNeto 101, LNet 102 and ANet 103 inferred from attributes system.Therefore, training aids 20 can receive the general object number marked with classification According to collection and the face data collection with identity and attribute labeling, then general object data set and its Category criteria are fed to Fallout predictor 10, to obtain the face locating model of training in advance, and face data collection and its identity mark are fed to fallout predictor 10, to obtain the attribute forecast model of training in advance.The face locating model of acquired training in advance and training in advance Attribute forecast model further inputs into fallout predictor 10 together with face data collection and its attribute labeling, to obtain final mask.

Therefore, as shown in Fig. 2 training aids 20 may include face locating training in advance device 201, attribute forecast training in advance Device 202 and micromatic setting 203.

Face locating training in advance device 201 operates the general object data set marked to receive with classification and had General object data set and its classification mark, are then fed in fallout predictor 10 by the face data collection of identity and attribute labeling, To obtain the face locating model of training in advance.Fig. 5 shows the training fallout predictor 10 of face locating training in advance device 201 to obtain The flow chart of the face locating model of training in advance.At step s501, face locating training in advance device 201 is operated with random God between the connection of each two convolutional layer in LNet0 101, LNet 102 and the Anet 103 of ground initialization fallout predictor 10 Through first weights.At step s502, face locating training in advance device 201 by by each image classification to multiple (N number of) logical Error in classification is calculated with one in object type.Specifically, if being correctly predicted object type, then error in classification It is zero, otherwise error in classification increases 1/N.At step s503, face locating training in advance device 201 is operated with by error in classification Counter-propagate through the layer in LNet0, LNet and ANet 103 of fallout predictor 10, so as to update each two convolutional layer connection it Between neuron weights.At step S504, face locating training in advance device 201 by by the error in classification currently obtained with Predetermined threshold is compared to determine whether convergence, if convergence, then at step S505, obtains the advance instruction of fallout predictor 10 Experienced face locating model, else process returns to step s502.

As known in the art, the weights of standard training procedure more new model, so that the output of model (prediction) can be anti- Again close to basic true mark.For example, in face's attribute forecast task, the target of model training is that prediction should be with reality The presence of some attributes (for example, " male ", " smiling " etc.) of (mark) alignment.Generally, the weights of model are randomly first Beginningization.The training in advance of face locating training in advance device 201 is similar with standard exercise, except the task in training in advance is different In final task.For example, the task for training in advance is to predict object type present in each image (for example, " vapour herein Car ", " dog ", " mountain ", " flower " etc.), and final task is prediction face attribute.

Micromatic setting 203 proceeds the fine setting similar with standard exercise, except weights not random initializtion, but makes Initialized with the weights in the model of training in advance.Fig. 6 shows to be filled according to the fine setting that is used for of one embodiment of the application Put the flow chart of 203 fine setting.In step s601, micromatic setting 20 is come just using those weights in the model of training in advance The weights of beginningization fallout predictor 10.At step s602, micromatic setting 20 is calculated point by predicting the attribute tags of each image Class error.Specifically, if being correctly predicted attribute tags, then error in classification is zero, otherwise error in classification increases 1/N. At step s603, error in classification is counter-propagated through fallout predictor 10 by micromatic setting 20, to update weights, then in step At 604, micromatic setting 20 determines whether the error in classification currently obtained is less than predetermined threshold, if it is, in step At S605, process is terminated and obtains the face locating model formed by the renewal weights of fallout predictor 10, and otherwise, process is returned to Step s602.

Fig. 7 shows the attribute for being used to train attribute forecast training in advance device 202 of one embodiment according to the application The flow chart of forecast model.At step s701, attribute forecast training in advance device 202 operates randomly to initialize fallout predictor Neuron weights between the connection of each two convolutional layer in 10 LNet0, LNet and ANet.At step s702, attribute Prediction training in advance device 202 is by the way that each image classification is missed to one in multiple (N number of) faces identity to calculate identification Difference.Specifically, if being correctly predicted face's identity, then error in classification is zero, otherwise error in classification increases 1/N.In step At rapid s703, attribute forecast training in advance device 202 operate with identification error is counter-propagated through fallout predictor 10 LNet0, Layer in LNet and ANet, the neuron weights between connection to update each two convolutional layer.At step S704, attribute is pre- Survey training in advance device 202 and determine whether the identification error currently obtained is less than predetermined threshold, if it is, in step At S705, process terminates and obtains the forecast model of the training in advance formed by the renewal weights of fallout predictor 10, otherwise, process Return to step s702.

Fig. 8 shows the flow of the fine setting of the attribute forecast model for training in advance of one embodiment according to the application Figure.In step s801, micromatic setting 20 initializes prediction using those weights in the attribute forecast model of training in advance The weights of device 10.At step s802, micromatic setting 20 calculates error in classification by predicting the attribute tags of each image.Tool For body, if being correctly predicted attribute tags, then error in classification is zero, otherwise error in classification increases 1/N.In step s803 Place, error in classification is counter-propagated through fallout predictor 10 by micromatic setting 20, to update weights, then at step 803, fine setting Device 20 determines whether the error in classification currently obtained is less than predetermined threshold, if it is, at step S805, process The face locating model formed by the renewal weights of fallout predictor 10 is terminated and obtains, otherwise, process returns to step s802.

Finally, trimmed face locating model and trimmed attribute forecast model link together, final to be formed Model.Face locating model regard primitive network image as the position for inputting and exporting the face area in the image.Then, Attribute forecast model regard target face region as the attribute tags for inputting and exporting prediction.Therefore, final Integrated Models It is the connection of face locating model and attribute forecast model.

All components or step in appended claims add the counter structure of function element, material, action and equivalent Thing is intended to include being used for any structure, material or the action that require element combination perform function with other of special requirement protection. Description of the invention is presented for the purpose of illustration and description, but is not intended to itemize or is limited the invention to institute's public affairs The form opened.Without departing from the scope and spirit of the present invention, those skilled in the art will be clear that it is many change and Change.Embodiment is by selection and describes best to explain the principle and practical application of the present invention, and makes this area Technical staff can be suitable for the various embodiments of expected special-purpose and various changes understand the present invention.

Claims

1. a kind of equipment for predicting face's attribute tags, including：

First position prediction meanss, head-shoulder position in face image for predicting input；

Second place prediction meanss, for the face position in the face image from the head of prediction-shoulder position to predict the input Put；And

Attribute forecast device, is represented for extracting one or more faces from the face location, and according to the face of extraction Represent, the required attribute of the face image of the input is classified.

2. equipment according to claim 1, wherein the first position prediction meanss are configured to calculate the input The geodesic distance of each position in face image, and determine whether the distance calculated is more than predetermined threshold, if it does, So described first position prediction meanss determine that this position in the image of the input belongs to head-shoulder position.

3. equipment according to claim 2, wherein the density of each position in the face image based on the input is strong Degree and each position determine the geodesic distance to the space length between the proximal most position in the face image of the input From.

4. equipment according to claim 1 or 2, wherein the first position prediction meanss are included by multiple maximum pond layers The neutral net constituted with multiple convolutional layers, wherein the convolutional layer is configured to globally shared wave filter, the filter Ripple device is cyclically applied in each position of the face image, with the face image inputted described in Pan and Zoom.

5. equipment according to claim 1 or 2, wherein the second place prediction meanss are included by multiple maximum pond layers The neutral net constituted with multiple convolutional layers, wherein the convolutional layer is configured to globally shared wave filter, the filter Ripple device is cyclically applied in each position of described image, with the face image inputted described in Pan and Zoom.

6. equipment according to claim 1 or 2, wherein the attribute forecast device includes multiple convolutional layers,

Wherein each convolutional layer is configured to one or many with the face image for being used to input described in Pan and Zoom The wave filter at the second layer in first layer and the convolutional layer in individual wave filter, the convolutional layer is globally shared , and third layer in the convolutional layer and the wave filter at the 4th layer are local shared.

7. equipment according to claim 1 or 2, wherein the attribute forecast device also includes multiple pond layers, Mei Gechi Change the corresponding convolutional layer that is connected in the convolutional layer of layer and be configured to the characteristic pattern of reception being divided into overlapping list The grid of member so that convolutional layer below is with the grid division of the face image inputted described in interleaving mode Pan and Zoom.

8. equipment according to claim 1 or 2, wherein the attribute forecast device also includes：

Full articulamentum, it is connected to last layer of the convolutional layer and is configured to generated by the convolutional layer, institute The response diagram for stating the image of input is transformed into compact distinctive character representation.

9. equipment according to any one of claim 1 to 7, in addition to training aids, the training aids are configured to：

General object data set and its classification mark are fed to the first position prediction meanss and second place prediction Device, to obtain the face locating model of training in advance, and is fed to the attribute by face data collection and its identity mark Prediction meanss, to obtain the attribute forecast model of training in advance, wherein the face locating model of the training in advance obtained and in advance The attribute forecast model of training is combined into final mask.

10. a kind of method for predicting face's attribute tags, including：

Predict head-shoulder position in the face image of input；

The face location in face image from head-shoulder position of prediction to predict the input；And

One or more faces are extracted from the face location of prediction to represent；And

Represented to classify to the required attribute of the face image of the input according to the face of extraction.

11. method according to claim 9, wherein the step of predicting the head-shoulder position also includes：

For each position in the face image of the input, the geodesic distance in response is calculated, wherein, first nerves network The response is obtained after the face image is inputted to it；And

If the distance of the calculating is more than predetermined threshold, then determine that targeted position belongs to the head-shoulder position.

12. method according to claim 10, wherein the density of each position in the face image based on the input Intensity and each position determine the geodesic distance to the space length between the proximal most position in the face image of input.

13. the method according to claim 9 or 10, wherein the first nerves network includes multiple maximum pond layers and many Individual convolutional layer, wherein the convolutional layer is configured to globally shared wave filter, methods described also includes：

The wave filter is cyclically applied to each position of the face image of the input, to be inputted described in Pan and Zoom Face image.

14. method according to claim 12, wherein by nervus opticus network is predicted from head-shoulder position of the prediction The face location, the nervus opticus network is made up of multiple maximum pond layers and multiple convolutional layers, wherein the convolutional layer It is configured to globally shared wave filter, methods described also includes：

15. the method according to claim 9 or 10, it also includes：

The face locating model of training in advance is determined based on the general object data set and its classification mark；And

The attribute forecast model of training in advance is determined based on the face data collection and its identity mark,

The face locating model of wherein described training in advance and the attribute forecast model of training in advance are combined into described for predicting The final mask of face's attribute tags of the face image of input.