CN115588217A

CN115588217A - Face attribute detection method based on deep self-attention network

Info

Publication number: CN115588217A
Application number: CN202210720368.XA
Authority: CN
Inventors: 刘德成; 彭春蕾; 何维杰; 张鼎文; 王楠楠; 李洁; 高新波
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2023-01-10

Abstract

The invention relates to a face attribute detection method based on a deep self-attention network, which comprises the following steps: step 1, acquiring a training sample set, wherein the training sample set comprises N face images and identity information of each face image, each face image comprises A face attribute labels, and N and A are natural numbers larger than 0; step 2, training a face attribute detection model by using the training sample set to obtain a trained deep face attribute detection model, wherein the deep face attribute detection model comprises a shared attribute feature learning module and a specific attention feature learning module; and 3, inputting the face image to be detected into the trained deep face attribute detection model to obtain a detection result. The invention provides an identity-related hierarchical face attribute loss function, and the task of learning the relationship between the face attribute and the face identity can guide the model to better learn the face attribute detection task by simultaneously inputting the face attribute and the face identity, so that the detection accuracy is improved.

Description

Face attribute detection method based on deep self-attention network

Technical Field

The invention belongs to the technical field of face attribute detection, and relates to a face attribute detection method based on a deep self-attention network.

Background

Face attribute detection can analyze semantic information (including age, gender, etc.) in the face image data. The method has wide application in the fields of video monitoring, face retrieval, social media and the like. In recent years, with the development of deep learning related technologies, image classification tasks have made great progress, but because of the great difference between different human face attributes, there is still a great gap between the accuracy of analyzing and detecting different attributes contained in human face images and actual applications. The method for detecting the face attribute based on deep learning can realize end-to-end learning by combining feature extraction and attribute classification, thereby improving the detection rate and continuously improving the heat of face attribute detection in the field of image processing.

The face attribute can be used as a separate task of face recognition and can also be used as auxiliary information to assist other tasks. Although current face attribute detection algorithms have achieved good performance, there are still some challenging problems and shortcomings to be addressed.

Human face image analysis plays an important role in the fields of biological recognition security and computer vision. The face attribute can be regarded as key semantic information, and can be applied to a plurality of real-world scenes (such as monitoring, image retrieval and face attribute tampering). The core challenge of face attribute detection is to extract appropriate features that link the contents of two different fields of visual description words and image pixels. Although the advent of convolutional neural networks has made tremendous progress in the field of human face attribute detection, there are still many challenges to human face attribute detection in the real world.

Because the shot face images are always in different scenes, the backgrounds of the face images are complex, the posture change and the illumination change between the images are large, and the face attribute detection accuracy rate is inevitably influenced.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a face attribute detection method based on a deep self-attention network. The technical problem to be solved by the invention is realized by the following technical scheme:

the embodiment of the invention provides a face attribute detection method based on a deep self-attention network, which comprises the following steps:

step 1, acquiring a training sample set, wherein the training sample set comprises N human face images and identity information of each human face image, each human face image comprises A human face attribute labels, and N and A are natural numbers larger than 0;

step 2, training a face attribute detection model by using the training sample set to obtain a trained deep face attribute detection model, wherein the deep face attribute detection model comprises a shared attribute feature learning module and a specific attention feature learning module;

and 3, inputting the face image to be detected into the trained deep face attribute detection model to obtain a detection result.

In one embodiment of the present invention, the step 2 comprises:

step 2.1, segmenting the face image in the training sample set into a plurality of non-overlapping windows;

2.2, inputting the segmented face image and the identity information of the face image into the deep face attribute detection model so as to establish a hierarchical identity information limiting loss function according to the output of the deep face attribute detection model;

step 2.3, processing the hierarchy identity information limiting loss function by using a random gradient descent algorithm to minimize the hierarchy identity information limiting loss function;

and 2.4, obtaining the trained deep face attribute detection model through the minimum hierarchical identity information limit loss function.

In an embodiment of the present invention, the shared attribute feature learning module includes a linear embedding Layer, m image block fusion layers, an (m + 1) Swin Transformer Layer, a pooling Layer, and a first full connection Layer, where the linear embedding Layer, the (m + 1) Swin Transformer Layer, and the first full connection Layer are sequentially connected, an image block fusion Layer is disposed before each of the 2 nd to (m + 1) th Swin Transformer layers, and a pooling Layer is disposed after the (m + 1) th Swin Transformer Layer.

In one embodiment of the invention, the attention-specific feature learning module comprises: a global attribute branch module and a plurality of local region branch modules and identity branch modules.

In an embodiment of the present invention, the global attribute branching module and each of the local region branching modules include a second fully-connected layer, a third fully-connected layer, a first ReLU activation function layer, a first dropout layer, and a first batch normalization layer, which are connected in sequence.

In an embodiment of the present invention, the identity branching module includes a fourth full connection layer, a fifth full connection layer, a second ReLU activation function layer, a second dropout layer, and a second batch normalization layer, which are connected in sequence, outputs of the second full connection layers of all the local area branching modules are connected and then input to the identity branching module, and an output of the identity branching module is used to calculate a global identity loss.

In one embodiment of the invention, the hierarchical identity information limit loss function is:

wherein, alpha, lambda and beta are weight parameters;

wherein C is the number of identities in the training sample set, x _n For the nth face image,

for the identity information of the nth face image, if the identities of the two input face images are the same, w is _i,j Is 1, otherwise is 0,

corresponding characteristics obtained by the ith personal face image through a deep face attribute detection model,

corresponding features obtained for the jth personal face image through a deep face attribute detection model, G is the number of attribute groups divided according to a face attribute grouping strategy, A _g The number of the face attributes in the g-th attribute group,

for the ith individual face image in the g attribute groupThe a-th attribute tag of (1),

is the probability of the a-th attribute within the g-th attribute group of the i-th personal face image.

In an embodiment of the present invention, step 3 is preceded by:

acquiring a test sample set, wherein the test sample set comprises M face images and identity information of each face image, and each face image comprises an A individual face attribute label;

and detecting the face attributes of the test sample set by using the trained deep face attribute detection model to obtain a test result.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides a semi-automatic face attribute grouping strategy. Because the face attributes have certain relation, the face attribute grouping strategy is applied to the face attribute detection task. A multitask face attribute detection method based on a face attribute grouping strategy groups attributes according to correlation among face attributes, so that each attribute group corresponds to a learning task. However, it remains a challenge how to group these diverse face attributes reasonably. According to the human face attribute visualization thermodynamic diagrams obtained by the deep self-attention neural network, the human face attributes with similar thermodynamic diagrams are divided into the same human face attribute group. Therefore, the human face attributes which are attractive, young, fat and the like and describe the global human face features are divided into one attribute group, and the human face attributes which mainly describe the local human face features, such as the big nose, the optical head, the bang and the like, are divided into the same attribute group. Different from the prior face attribute grouping method which always depends on prior knowledge (such as data type, semantics and subjectivity), the face attribute grouping strategy provided by the invention can effectively divide the face attributes into proper classes according to different semantic information of the face attributes. Specifically, the 40 face attributes are divided into 7 attribute groups, which can further improve the intrinsic spatial relationship of the same attribute group, and is beneficial for the model to better learn the intrinsic relationship of different attributes in the same semantic group.

2. The invention uses a model based on a depth self-attention mechanism to extract the face attribute sharing characteristics. The attention mechanism can help the model to focus differently on different contents in the image when the image is processed, and the attention mechanism can help the model to effectively extract meaningful information from a large amount of picture data with limited attention. In order for the model to efficiently process the face image, the face image needs to be converted to have sequential features. Specifically, an original face image needs to be subjected to image blocking operation, an image is split into n smaller image blocks, each image block is flattened, the image blocks can be regarded as one-dimensional features at the moment, position information is embedded into the features, and finally, category information needs to be embedded into the total features to convert the image into sequence features. Unlike the traditional convolutional neural network-based model which only focuses on the features of adjacent regions, the model used in the invention can focus on the global features of the picture more. In order to enable the model to better focus on the global features of the picture, the model used by the invention not only uses an attention mechanism, but also uses a shift window. The shift window based attention mechanism may make the model focus more on the global features of the picture.

3. Considering the internal relationship between the semantic attribute and the face identity, the invention provides an identity-related hierarchical face attribute loss function to model the correlation between the face attribute and the face identity. Usually, there is an internal association between two face images with the same identity, and these face images have a face attribute with a very high similarity, for example, both face images will have a high cheekbone attribute. The face images with different identities have face attributes with lower similarity, for example, one face image has a face attribute with a big nose, and the other face image does not have the face attribute with the big nose. The high-level identity information is helpful for the model to learn the discriminative characteristics for the face attribute detection to a certain extent. In order to model the relationship between semantic attributes and face identities, the invention provides a hierarchical face attribute loss function related to the identity, and the task of learning the relationship between the face attributes and the face identities can guide the model to better learn the face attribute detection task by simultaneously inputting the face attributes and the face identities, so that the detection accuracy is improved.

Other aspects and features of the present invention will become apparent from the following detailed description, which proceeds with reference to the accompanying drawings. It is to be understood, however, that the drawings are designed solely for purposes of illustration and not as a definition of the limits of the invention, for which reference should be made to the appended claims. It should be further understood that the drawings are not necessarily drawn to scale and that, unless otherwise indicated, they are merely intended to conceptually illustrate the structures and procedures described herein.

Drawings

Fig. 1 is a schematic flowchart of a face attribute detection method based on a deep self-attention network according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a deep face attribute detection model according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart of a face attribute detection method based on a deep self-attention network according to an embodiment of the present invention, where the face attribute detection method includes:

step 1, a training sample set is obtained, wherein the training sample set comprises N face images and identity information of each face image, each face image comprises A face attribute labels, and N and A are natural numbers larger than 0. The face attribute labels are used for labeling attributes of the face image, the attributes comprise a head, eyes, a nose, a mouth, cheeks and a neck, and the identity information is used for representing the identity of the face image.

In addition, the embodiment further includes a test sample set, where the test sample set includes M face images and identity information of each face image, where each face image includes an a-number face attribute tag, and M is a natural number greater than 0.

And 2, training a face attribute detection model by using the training sample set to obtain a trained deep face attribute detection model, wherein the deep face attribute detection model comprises a shared attribute feature learning module and a specific attention feature learning module.

In one embodiment, step 2 may comprise:

and 2.1, segmenting the face image in the training sample set into a plurality of non-overlapping windows.

Specifically, given an input face image, the input face image is processed into a plurality of non-overlapping windows using a conventional window partitioning strategy, for example, each partial window has a size of 4 × 4.

And 2.2, inputting the segmented face image and the identity information of the face image into a deep face attribute detection model so as to establish a hierarchical identity information limiting loss function according to the output of the deep face attribute detection model.

In one embodiment, referring to fig. 2, the shared attribute feature learning module includes a linear embedding Layer, m tile fusion layers, a (m + 1) switch Transformer Layer (depth self-attention model based on shift window), a pooling Layer, and a first full-link Layer, where the linear embedding Layer, the (m + 1) switch Transformer Layer, and the first full-link Layer are connected in sequence, a tile fusion Layer is disposed before the 2 nd to the (m + 1) th switch Transformer Layer, and a pooling Layer is disposed after the (m + 1) th switch Transformer Layer. The method comprises the steps that a linear embedding layer maps segmented face images into 96-dimensional features, in a Swin Transformamer layer, standard multi-head self-attention (MSA) is used for processing input, then a multi-layer perceptron (MLP) is designed for improving conversion capacity, and finally features output by a first full-connection layer are output to each local region branching module of a specific attention feature learning module.

In one embodiment, referring to fig. 2, the attention-specific feature learning module comprises: the system comprises a global attribute branch module, a plurality of local region branch modules and an identity branch module, wherein the global attribute branch module is used for outputting global attributes, the global attributes are global attributes belonging to images, such as round, thick makeup, oval faces and the like, the local region branch modules are used for outputting local attributes, such as eyes, noses and the like, and specifically, the local region branch modules comprise a head region branch module, an eye region branch module, a nose region branch module, a mouth region branch module, a cheek region branch module and a neck region branch module. The global attribute branching module and each local region branching module respectively comprise a second full connection layer, a third full connection layer, a first ReLU activation function layer, a first dropout layer and a first batch normalization layer which are connected in sequence; the identity branch module comprises a fourth full connection layer, a fifth full connection layer, a second ReLU activation function layer, a second dropout layer and a second batch normalization layer which are sequentially connected, the output of the global attribute branch module and the output of the second full connection layers of all the local region branch modules are connected and then input into the identity branch module, and the output of the identity branch module is used for calculating global identity loss.

The specific attention characteristic learning module adds corresponding specific local area branches to detect the face attributes with strong correlation in the same attribute group according to the proposed semi-automatic face attribute grouping strategy. In the local area branch, the features from the shared attribute feature learning module are Connected to two Fully Connected layers (FC) of each local area branch module, and the numbers of neurons of the two Fully Connected layers (i.e., the second Fully Connected Layer and the third Fully Connected Layer) are 2048 and 512, respectively. In order to avoid the over-fitting phenomenon and enhance the nonlinear fitting capability of the model, a ReLU (Linear rectification function), a dropout (neural node random drop layer) layer (dropout probability =0.5, i.e., drop probability = 0.5), and a Batch Normalization layer (BN) are added after the full connection layer. In addition to local area branching, in order to model the relationship between the attributes of a face and an identity from a global perspective, an identity-related branch is introduced in the attention-specific feature learning module. The branch is mainly constrained from the global angle, all the characteristics of the first full-link layer passing through the local area branch module are connected, and two full-link layers (namely a fourth full-link layer and a fifth full-link layer) are connected behind the branch, and the number of the neurons is 4096 and 2048 respectively. A ReLU activation function, a dropout layer (dropout probability = 0.5), and a bulk normalization layer are also added after the fully connected layer. Features obtained from the identity-related branches are used to calculate global identity loss, and help the model further mine the internal relationship between the face attributes and the face identity.

In one embodiment, the hierarchical identity information limit loss function is:

wherein, alpha, lambda and beta are weight parameters;

wherein C is the number of identities in the training sample set, x _n For the nth human face image,

for the identity information of the nth face image, if the identities of the two input face images are the same, w is _i,j Is 1, noThe number of the bits is 0 and,

corresponding features obtained by the ith personal face image through a deep face attribute detection model,

for the a-th attribute tag within the g-th attribute group of the ith personal face image,

the probability of the a-th attribute in the g-th attribute group of the ith personal face image is the probability that the face image passes through the depth face attribute detection model, and then a value between 0 and 1 is output for a single attribute, the value is regarded as the probability, if the value is larger than 0.5, the image is regarded as having a certain face attribute, otherwise, the image does not have a certain face attribute. Wherein, alpha Loss _F +(1-α)Loss _C Is a global identity loss function. The attribute groups include a global attribute group, a head group, an eye group, a nose group, a mouth group, a cheek group, and a neck group, and the specific contents of the attribute groups are shown in table 1.

TABLE 1 face attributes and their corresponding attribute groups

According to the invention, semantic attention specific region branches are added according to an attribute grouping strategy for detecting strong related face attributes, and attributes with similar attention regions are grouped into the same group. Subsequently, the facial attributes were grouped into seven attention-specific attribute groups according to the proposed semi-automatic grouping strategy.

Step 2.3, processing the hierarchical identity information limiting loss function by using a random gradient descent algorithm to minimize the hierarchical identity information limiting loss function;

and 2.4, obtaining a trained deep face attribute detection model through a minimum hierarchical identity information limit loss function.

That is to say, the random gradient descent algorithm is used to minimize the hierarchical identity information limiting loss function, the parameters of the deep face attribute detection model at this time are the parameters of the finally trained deep face attribute detection model, and then the face attribute can be directly detected by using the parameters.

The invention applies a hierarchical identity information limit loss function, mines the face identity information and the relation between the face identities from the global and local angles, and can help a model to learn robust discrimination information better by considering the internal relation between semantic attributes and the face identities.

And 3, detecting the face attributes of the test sample set by using the trained deep face attribute detection model to obtain a test result, so that the accuracy of detecting the face attributes by using the trained deep face attribute detection model can be determined.

And 4, inputting the face image to be detected into the trained deep face attribute detection model to obtain a detection result.

In this embodiment, the face image to be detected is the face image to be detected, and after the face image to be detected is segmented, the segmented face image is input to the trained deep face attribute detection model, and the trained deep face attribute detection model can output corresponding attributes.

1. At present, most face attribute detection algorithms are realized based on a convolutional neural network, and although a model based on a deep convolutional neural network can obtain good face attribute detection rate, convolution operation only focuses on one neighborhood, only extracts local features, and has certain limitation in the aspect of capturing global feature representation. The invention provides a face attribute detection method based on a deep self-attention network, aiming at the defect that a convolutional neural network detects the face attribute, and the method can better capture long-distance feature dependence and effectively improve the face attribute detection rate.

2. Most of the existing face attribute detection algorithms regard a face attribute detection task as an independent multi-label classification task, and ignore the internal association between attributes and face identity information. In order to mine the relationship between the face attribute and the face identity information, the invention provides a hierarchy of identity information correlation loss function. The loss function can help a model to better learn the features containing the robustness discrimination information by considering the internal association of the face identity and the semantic attribute.

3. Unlike most previous face attribute detection methods that use a priori knowledge (such as data type, semantics, and subjectivity) to classify attributes, the present invention uses a semi-automatic attribute grouping strategy: a Gradient-weighted Class Activation Map (Grad-CAM) visualization technique is used to draw a thermodynamic Map of different face attributes, and then attributes with similar regions of interest are grouped into the same group. The face attributes are thus divided into 7 groups, which are: global property group, head group, eyes group, nose group, mouth group, cheek group, and neck group. The grouping strategy can help the model to better learn the intrinsic associations between different attributes in the same attribute group.

1. As shown in table 2, the present invention can achieve the best recognition accuracy among 20 of 40 individual face attributes compared to the best prior art at present, such as: when detecting the attribute baldness (bald) and the attribute bang (bangs), the method of the invention achieves the identification accuracy of 91.04% and 96.50%, which are respectively 6.04% and 5.50% higher than that of the current best method DMTL (Deep Multi-Task Learning network). By using a semi-automatic face attribute grouping strategy, the face attributes with similar visual thermodynamic diagrams are divided into the same attribute group, so that the model is helped to learn the face attribute characteristics in the same group better. The feature has more robustness and discriminability, so that the model has better performance on the detection rate of single face attribute.

TABLE 2

2. The best prior art DMTL introduces an additional face normalization step in order to reduce the impact of large face variations and face picture complexity. The method firstly uses a SeetaFaceEngine to detect key points of the face and cuts the face according to the key points, which increases the calculation cost. The method directly inputs the face image into the trunk model, does not have an additional face image processing step, directly extracts the features of the input whole face image, and reduces the calculation cost of the model in the face attribute detection.

3.DMTL uses AlexNet based on a deep convolution neural network as a backbone network of a model, and because convolution is carried out each time only on the adjacent position of a concerned area, the model pays more attention to local information of a picture. The invention extracts the face attribute sharing characteristics based on the depth self-attention mechanism network, and has different emphasis on different parts in the picture, so that the model can effectively extract meaningful information from a large amount of picture data by using limited attention. Using an attention mechanism, the model can be aided in learning global associations between elements, and in focusing on local associations between elements. Compared with DMTL, the model used by the invention focuses more on global information in the graph during feature extraction, and not only focuses on information of adjacent regions.

4. In DMTL, only the face attribute detection task is considered, and the association between the face attribute and the identity information is ignored, but the invention uses a face identity information constraint loss function to limit and guide the face attribute detection task. Usually, there is an internal association between two face images with the same identity, and these images have a face attribute with a very high similarity, for example, both face images will have a high cheekbone attribute. The face images with different identities have face attributes with lower similarity, for example, one face image has a face attribute with a big nose, and the other face image does not have a face attribute with a big nose. The identity information at the high level is helpful for the model to learn discriminative characteristics for face attribute detection to a certain extent. In order to model the relationship between the semantic attribute and the face identity, the invention provides an identity-related hierarchical face attribute loss function, the loss function considers the internal relationship between the semantic attribute and the face identity, and the task of learning the relationship between the face attribute and the face identity can guide the model to better learn the face attribute detection task by simultaneously inputting the face attribute and the face identity.

In the description of the present invention, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic data point described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristic data points described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions may be made without departing from the spirit of the invention, which should be construed as belonging to the scope of the invention.

Claims

1. A face attribute detection method based on a deep self-attention network is characterized by comprising the following steps:

step 1, acquiring a training sample set, wherein the training sample set comprises N face images and identity information of each face image, each face image comprises A face attribute labels, and N and A are natural numbers larger than 0;

2. The method for detecting the face attribute based on the deep self-attention network as claimed in claim 1, wherein the step 2 comprises:

2.1, segmenting the face image in the training sample set into a plurality of non-overlapping windows;

2.2, inputting the segmented face image and the identity information of the face image into the depth face attribute detection model so as to establish a hierarchical identity information limiting loss function according to the output of the depth face attribute detection model;

3. The method according to claim 2, wherein the shared attribute feature learning module includes a linear embedding Layer, m tile fusion layers, (m + 1) Swin Transformer Layer, a pooling Layer, and a first full connection Layer, the linear embedding Layer, the (m + 1) Swin Transformer Layer, and the first full connection Layer are connected in sequence, a tile fusion Layer is disposed before the 2 nd to (m + 1) th Swin Transformer Layer, and a pooling Layer is disposed after the (m + 1) th Swin Transformer Layer.

4. The method of claim 3, wherein the specific attention feature learning module comprises: a global attribute branch module and a plurality of local region branch modules and identity branch modules.

5. The method according to claim 4, wherein the global attribute branching module and each local region branching module comprise a second fully-connected layer, a third fully-connected layer, a first ReLU activation function layer, a first drop layer and a first batch normalization layer, which are connected in sequence.

6. The method according to claim 5, wherein the identity branching module comprises a fourth full connection layer, a fifth full connection layer, a second ReLU activation function layer, a second dropout layer and a second batch normalization layer, which are connected in sequence, outputs of the second full connection layers of all the local region branching modules are connected and then input to the identity branching module, and an output of the identity branching module is used for calculating global identity loss.

7. The method according to claim 6, wherein the hierarchical identity information constraint loss function is:

wherein, alpha, lambda and beta are weight parameters;

for the identity information of the nth human face image, if the identities of the two input human face images are the same, w is _i,j Is 1, otherwise is 0,

corresponding features obtained for the ith personal face image through a deep face attribute detection model,

8. The method for detecting the face attribute based on the deep self-attention network as claimed in claim 1, further comprising before step 3:

acquiring a test sample set, wherein the test sample set comprises M face images and identity information of each face image, and each face image comprises an A personal face attribute label;