CN111401294B

CN111401294B - Multi-task face attribute classification method and system based on adaptive feature fusion

Info

Publication number: CN111401294B
Application number: CN202010228805.7A
Authority: CN
Inventors: 崔超然; 申朕; 黄瑾
Original assignee: Shandong University of Finance and Economics
Current assignee: Shandong University of Finance and Economics
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2022-07-15
Anticipated expiration: 2040-03-27
Also published as: CN111401294A

Abstract

The invention discloses a multitask face attribute classification method and a multitask face attribute classification system based on self-adaptive feature fusion, wherein the method comprises the following steps of: acquiring a face image to be classified; carrying out preprocessing operation on the face image to be classified; inputting the preprocessed face images to be classified into a multitask face attribute classification model based on self-adaptive feature fusion to obtain the probability of different classes of the images on each face attribute, and selecting the class with the maximum probability as a classification result on the corresponding attribute. According to the method, the self-adaptive feature fusion layer is constructed, network branches of different tasks are connected to form a uniform multi-task deep convolution neural network, so that information can be effectively shared among the different tasks, and the classification accuracy effect is remarkably improved.

Description

Multitask face attribute classification method and system based on self-adaptive feature fusion

Technical Field

The disclosure relates to the technical field of computer vision and machine learning, in particular to a multitask face attribute classification method and system based on adaptive feature fusion.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

In recent years, deep convolutional neural networks have achieved breakthrough in many computer vision tasks, such as target detection, semantic segmentation, depth prediction, and the like. The multi-task deep convolution neural network aims to process a plurality of related tasks together, improves the learning efficiency, and meanwhile improves the prediction accuracy and generalization performance through the characteristic interaction between tasks to prevent overfitting.

When a multitask deep convolution neural network is implemented, the most common scheme is to construct a network architecture based on parameter hard sharing. In this scheme, different tasks share a lower network layer and maintain respective branches at the higher network layer. Prior to training, the shared network layer needs to be manually specified by experience. This approach lacks theoretical guidance, and an unreasonable choice for the shared network layer may also lead to a severe degradation of the performance of the method.

In view of this, many researchers have proposed automatically building shared network layers by learning optimal feature combinations for different tasks on a single network layer, thereby avoiding the complex enumeration and model training processes when parameters are shared hard.

For example, in the Cross Stitch method (see IshanMisra, AbhinavShrivastava, Abhinav Gupta, and Martial Hebert. Cross-batch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3994-; in the NDDR method (see Yuan Gao, Jianyi Ma, Mingbo Zhao, Wei Liu, and Alan L you, Nddr-cnn: Layer with feature fusion in multi-task CNNs by neural discrete dimensional reconstruction, in Proceedings of the IEEE Conference Computer Vision and Pattern Recognition, pages 3205 and 3214,2019), researchers stack feature maps from different tasks along the channel dimensions and reduce them using a1 × 1 convolution to meet the feature map channel size requirements of subsequent network branches.

In the course of implementing the present disclosure, the inventors found that the following technical problems exist in the prior art:

although the above works have been demonstrated in experiments to achieve better performance, they are essentially all learning to construct a fixed feature fusion strategy. After training is complete, all input samples correspond to the same set of feature fusion weights. And the characteristics of the image cannot be well expressed by the features after feature fusion.

Disclosure of Invention

In order to solve the defects of the prior art, the disclosure provides a multitask face attribute classification method and system based on self-adaptive feature fusion; in the multi-task face attribute classification, for some samples, the features needing to be fused among tasks may be very similar; while for other samples, the features may be very different or even complementary to each other. Therefore, when the feature fusion of the multi-task learning is carried out, the characteristics of the features to be fused should be fully considered. Based on the inspiration, the disclosure introduces a dynamic feature fusion mechanism when designing a multitask deep convolution neural network, and adaptively fuses features according to the dependency relationship among the features to realize the sharing and interaction of the features among tasks.

In a first aspect, the present disclosure provides a multitask face attribute classification method based on adaptive feature fusion;

the multitask face attribute classification method based on the self-adaptive feature fusion comprises the following steps:

acquiring a face image to be classified;

carrying out preprocessing operation on the face image to be classified;

inputting the preprocessed face images to be classified into a multitask face attribute classification model based on self-adaptive feature fusion to obtain the probability of different classes of the images on each face attribute, and selecting the class with the maximum probability as a classification result on the corresponding attribute.

In a second aspect, the present disclosure provides a multitask face attribute classification system based on adaptive feature fusion;

a multitask face attribute classification system based on self-adaptive feature fusion comprises the following steps:

an acquisition module configured to: acquiring a face image to be classified;

a pre-processing module configured to: carrying out preprocessing operation on the face image to be classified;

a classification module configured to: inputting the preprocessed face images to be classified into a multitask face attribute classification model based on self-adaptive feature fusion to obtain the probability of different classes of the images on each face attribute, and selecting the class with the maximum probability as a classification result on the corresponding attribute.

In a third aspect, the present disclosure also provides an electronic device comprising a memory and a processor, and computer instructions stored on the memory and executed on the processor, wherein when the computer instructions are executed by the processor, the method of the first aspect is performed.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the beneficial effect of this disclosure is:

the method and the device take the relation among different task feature maps in the multitask deep convolution neural network into consideration, namely, when feature fusion is carried out, the degree of sharing or retaining feature information is determined according to the characteristics of the feature maps.

When the method is realized, a self-adaptive feature fusion layer is constructed, network branches of different tasks are connected to form a uniform multi-task deep convolution neural network, so that information can be effectively shared among the different tasks, and the classification accuracy effect is improved remarkably.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to be construed as limiting the disclosure.

Fig. 1 is a flowchart of a deep multi-task learning method based on adaptive feature fusion according to a first embodiment of the present disclosure.

FIG. 2 is a schematic diagram of a network branch connecting two tasks by using an adaptive feature fusion layer to form a unified multitask deep convolutional neural network according to a first embodiment of the present disclosure;

FIG. 3 is a schematic view of the internal connection of a feature fusion layer according to the first embodiment of the disclosure;

fig. 4 is a schematic diagram of an internal connection relationship of a channel level fusion module according to a first embodiment of the present disclosure;

fig. 5 is a schematic diagram of a spatial hierarchy fusion module according to a first embodiment of the present disclosure.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiment I provides a multitask face attribute classification method based on self-adaptive feature fusion;

as shown in fig. 1, the multi-task face attribute classification method based on adaptive feature fusion includes:

s1: acquiring a face image to be classified;

s2: carrying out preprocessing operation on the face image to be classified;

s3: inputting the preprocessed face images to be classified into a multitask face attribute classification model based on self-adaptive feature fusion to obtain the probability of different classes of the images on each face attribute, and selecting the class with the maximum probability as a classification result on the corresponding attribute.

As one or more embodiments, the preprocessing operation specifically includes:

first, all images are scaled to 224 × 224 pixels;

and then, calculating the pixel average value of the training set image, and subtracting the pixel average value from each face image to be classified to perform normalization operation.

As one or more embodiments, the obtaining of the multitask face attribute classification model based on the adaptive feature fusion includes:

constructing a multitask neural network model based on self-adaptive feature fusion;

constructing a training set, wherein the training set comprises: the method comprises the following steps of (1) obtaining a plurality of face images, wherein each face image comprises at least two known attributes;

carrying out preprocessing operation on the images of the training set, comprising: first, all images are scaled to 224 × 224 pixels; then, calculating the pixel average value of the training set images, and enabling each image to subtract the average value to perform normalization operation; finally, before each training, performing horizontal inversion and Gaussian blur processing on the training image according to a set probability;

training the multi-task neural network model based on the adaptive feature fusion by using the image after the preprocessing operation to obtain a trained multi-task neural network model based on the adaptive feature fusion; namely, the multi-task human face attribute classification model based on the self-adaptive feature fusion.

The beneficial effects of the above technical scheme are: through the preprocessing step, the number of training samples can be effectively expanded, and the diversity of the training samples is improved.

It is to be understood that the known attributes, at least, include one or more of the following examples: age, gender, expression, etc.

It should be appreciated that in this embodiment, the Adience dataset is selected to perform the task of age classification and gender classification on the face image simultaneously. In an Adience data set, the age classification tasks are divided into eight categories of 0-2, 4-6, 8-12, 15-20, 25-32, 38-43, 48-53 and 60 +; gender classification comprises a male and a female category together;

it should be understood that the criterion for model training is that the loss function reaches a minimum. Defining the loss in gender classification as L using a cross-entropy loss function_ageThe loss in age classification is L_sexThen the total loss function is L ═ λ L_age+L_sex. Wherein, lambda is a hyperparameter of two types of losses of the balance model. Considering that gender classification is a two-classification problem and age classification is a multiple-classification problem, the value of λ is set to 1/2. Training the network by adopting a random gradient descent algorithm, and determining the network weight which can minimize the loss function;

as one or more embodiments, the adaptive feature fusion based multitasking neural network model comprises:

two network branches in parallel: a first network branch and a second network branch;

a first network branch comprising: the system comprises a convolution layer group A1, a convolution layer group A2, a convolution layer group A3, a convolution layer group A4, a convolution layer group A5, a full connection layer A6 and a softmax layer A7 which are connected in sequence;

a second network branch comprising: a convolution layer group B1, a convolution layer group B2, a convolution layer group B3, a convolution layer group B4, a convolution layer group B5, a full connection layer B6 and a Softmax layer B7 which are connected in sequence;

and the convolution layer groups corresponding to the first network branch and the second network branch are connected through four self-adaptive feature fusion layers.

Further, the convolution layer group corresponding to the first network branch and the second network branch is connected through four adaptive feature fusion layers, which specifically includes:

the output end of the convolution layer group A1 and the output end of the convolution layer group B1 are both connected with the input end of the first adaptive characteristic fusion layer;

an input end of the convolution layer group A2 and an input end of the convolution layer group B2 are both connected with an output end of the first adaptive characteristic fusion layer;

the output end of the convolution layer group A2 and the output end of the convolution layer group B2 are both connected with the input end of the second adaptive characteristic fusion layer;

the input end of the convolution layer group A3 and the input end of the convolution layer group B3 are both connected with the output end of the second adaptive characteristic fusion layer;

the output end of the convolution layer group A3 and the output end of the convolution layer group B3 are both connected with the input end of the third adaptive characteristic fusion layer;

the input end of the convolution layer group A4 and the input end of the convolution layer group B4 are both connected with the output end of the third adaptive characteristic fusion layer;

the output end of the convolution layer group A4 and the output end of the convolution layer group B4 are both connected with the input end of the fourth adaptive characteristic fusion layer;

an input of the convolution layer group a5 and an input of the convolution layer group B5 are both connected to an output of the fourth adaptive feature fusion layer.

It should be understood that the working principle of the above multitask neural network model based on adaptive feature fusion is as follows:

the first network branch and the second network branch receive the same input image, the first network branch is responsible for classifying the age of the face in the input image, the second network branch is responsible for classifying the gender of the face in the input image, and the output of the network branches represents the probability that the input image belongs to each category on the corresponding attribute;

the first network branch and the second network branch are identical in structure and are based on the ResNet101 network structure (see Kaiming He, Xiangyu Zhuang, Shaoqingren, and Jianan Sun. deep residual learning for image Recognition in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770-778, 2016). Each network branch consists of five convolutional layer groups, one fully-connected layer and one softmax layer. Wherein each convolution layer group comprises a plurality of continuous convolution layers and a maximum pooling layer.

And respectively introducing a first adaptive feature fusion layer, a second adaptive feature fusion layer and a fourth adaptive feature fusion layer, and connecting the convolution layer groups corresponding to the first network branch and the second network branch, thereby realizing feature interaction between two tasks and constructing a uniform multi-task deep convolution neural network, wherein the structure of the network is shown in figure 2.

Further, the fully-connected layer a6 of the first network branch performs nonlinear transformation on the input feature map, and maps the feature map into a column vector; the dimension of the column vector is equal to the number of categories on the age attribute, and each dimension corresponds to a specific age category;

further, the fully-connected layer B6 of the second network branch performs nonlinear transformation on the input feature map, and maps it into a column vector; the dimensions of the column vector are equal to the number of categories on the gender attribute, with each dimension corresponding to a particular gender category.

Further, Softmax layer a7 of the first network branch converts each dimension of the input vector into a probability value representing the probability of the input image on each category of age attribute;

further, Softmax layer B7 of the second network branch converts each dimension of the input vector into a probability value representing the probability of the input image on each category of gender attribute;

for one or more embodiments, the first adaptive feature fusion layer, the second adaptive feature fusion layer, the third adaptive feature fusion layer, and the fourth adaptive feature fusion layer are identical in structure.

As one or more embodiments, as shown in fig. 3, the first adaptive feature fusion layer includes:

the system comprises a channel level fusion module and a space level fusion module which are sequentially connected, wherein the input end of the channel level fusion module is the input end of the current adaptive feature fusion layer; and the output end of the spatial hierarchy fusion module is the output end of the current adaptive feature fusion layer.

As one or more embodiments, the channel hierarchy fusion module includes:

the first average pooling layer and the second average pooling layer are parallel;

the output ends of the first average pooling layer and the second average pooling layer are connected with the series unit;

the series unit is connected with the first full connection layer, and the first full connection layer is connected with the second full connection layer;

the second full-connection layer is respectively connected with the third full-connection layer and the fourth full-connection layer;

the third full connection layer is connected with the first Softmax function layer;

the fourth full connection layer is connected with the second Softmax function layer;

the first Softmax function layer is respectively connected with the first multiplier and the second multiplier;

the second Softmax function layer is connected with the third multiplier and the fourth multiplier respectively;

the first multiplier and the second multiplier are both connected with the first adder;

the third multiplier and the fourth multiplier are both connected with the second adder.

As one or more embodiments, as shown in fig. 4, the channel level fusion module works according to the following principles:

firstly, in a channel level fusion module, inputting original feature maps x of two network branches_AAnd x_BRespectively carrying out average pooling along the channel dimension to obtain

And

and will be

And

are connected together;

then, the connected results are subjected to dimensionality reduction processing respectively through a first full connection layer and a second full connection layer to obtain two guide vectors

And

make it possible to

Through the third full connection layer, x is obtained_AAnd x_BRespectively corresponding fusion weight vector

And

make it

Through the fourth full connection layer, x is obtained_AAnd x_BAre respectively paired withCorresponding fusion weight vector

And

wherein, the first and the second end of the pipe are connected with each other,

and

is equal to the original feature map x_AThe number of the channels of (a) is,

and

is equal to the original feature map x_BThe number of channels of (a);

will be provided with

And

performing Softmax operations on the corresponding position elements two by two, such that

Will be provided with

And

Finally, the original feature map is compared withMultiplying and adding the fusion weight vectors to respectively obtain

And

namely that

Will be provided with

And

input to the spatial hierarchy fusion module.

As one or more embodiments, the spatial hierarchy fusion module includes:

the third average pooling layer and the fourth average pooling layer are arranged in parallel;

the output ends of the third average pooling layer and the fourth average pooling layer are connected with the stacking unit;

the stacking unit is connected with the first convolution layer and the second convolution layer respectively;

the first convolution layer is connected with the fifth full-connection layer, and the second convolution layer is connected with the sixth full-connection layer;

the fifth full connection layer is connected with the third Softmax function layer; the sixth full connection layer is connected with the fourth Softmax function layer;

the third Softmax function layer is respectively connected with the fifth multiplier and the sixth multiplier;

the fourth Softmax function layer is connected with the seventh multiplier and the eighth multiplier respectively;

the fifth multiplier and the sixth multiplier are both connected with the third adder;

the seventh multiplier and the eighth multiplier are both connected with the fourth adder.

As one or more embodiments, as shown in fig. 5, the spatial hierarchy fusion module operates according to the following principle:

firstly, in a spatial hierarchy fusion module, an input feature map is input

And

respectively carrying out average pooling along the spatial dimension to obtain

And

and will be

And

stacking together;

then, the stacked results are respectively passed through two convolution layers, each convolution layer has only one convolution kernel of 1 × 1, so as to obtain two guide matrixes

And

will be provided with

Vectorizing and passing through a full connection layer to obtain

And respectively corresponding fusion weight vectors

And

will be provided with

Vectorizing and passing through a full connection layer to obtain

And

respectively corresponding fusion weight vector

And

will be provided with

And

matrixing them to a size equal to the input profile

The size of the space of (a).

Will be provided with

And

matrixing them to a size equal to the input profile

The spatial dimensions of (a);

will be provided with

And

Will be provided with

And

Finally, multiplying and adding the input characteristic graph and the fusion weight vector to respectively obtain

And

namely, it is

Will be provided with

And

into the next convolution layer group of the first network branch and the second network branch, respectively.

The method takes the relation among different task characteristic graphs in the multitask deep convolution neural network into consideration, namely when the characteristic fusion is carried out, the degree of sharing or retaining the characteristic information is determined according to the characteristics of the characteristic graphs, and the self-adaptive characteristic fusion is realized.

The second embodiment provides a multitask face attribute classification system based on adaptive feature fusion;

a multitask face attribute classification system based on self-adaptive feature fusion comprises:

an acquisition module configured to: acquiring a face image to be classified;

In a third embodiment, the present invention further provides an electronic device, which includes a memory, a processor, and computer instructions stored in the memory and executed on the processor, where the computer instructions, when executed by the processor, implement the method in the first embodiment.

In a fourth embodiment, the present embodiment further provides a computer-readable storage medium for storing computer instructions, and the computer instructions, when executed by a processor, implement the method of the first embodiment.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. The multitask face attribute classification method based on the self-adaptive feature fusion is characterized by comprising the following steps:

acquiring a face image to be classified;

carrying out preprocessing operation on the face image to be classified;

inputting the preprocessed face images to be classified into a multitask face attribute classification model based on self-adaptive feature fusion to obtain the probability of different classes of the images on each face attribute, and selecting the class with the maximum probability as a classification result on the corresponding attribute;

the method for acquiring the multitask face attribute classification model based on the adaptive feature fusion comprises the following steps:

the multitask neural network model based on the self-adaptive feature fusion comprises the following steps:

the convolution layer groups corresponding to the first network branch and the second network branch are connected through four self-adaptive feature fusion layers;

the adaptive feature fusion layer comprises:

the channel level fusion module and the spatial level fusion module are connected in sequence;

the channel level fusion module has the working principle that:

And

and will be

And

are connected together;

then, the connected results are subjected to dimensionality reduction processing respectively through a first full-connection layer and a second full-connection layer to obtain two guide vectors

And

make it possible to

And

make it

Through the fourth full connection layer, x is obtained_AAnd x_BRespectively corresponding fusion weight vector

And

wherein the content of the first and second substances,

and

is equal to the original feature map x_AThe number of the channels of (a) is,

and

is equal to the original feature map x_BThe number of channels of (a);

will be provided with

And

softmax operations are performed two by two on the corresponding location elements such that

Will be provided with

And

Finally, multiplying and adding the original characteristic graph and the fusion weight vector to respectively obtain

And

namely, it is

And

will be provided with

And

input to the spatial hierarchy fusion module.

2. The method according to claim 1, wherein the preprocessing operation comprises:

first, all images are scaled to 224 × 224 pixels;

and then, calculating the pixel average value of the training set image, and subtracting the pixel average value from each human face image to be classified to perform normalization operation.

3. The method of claim 1, wherein the obtaining of the multi-tasking face attribute classification model based on adaptive feature fusion further comprises:

preprocessing the images in the training set, including: first, all images are scaled to 224 × 224 pixels; then, calculating the pixel average value of the training set images, and enabling each image to subtract the average value to perform normalization operation; finally, before each training, carrying out horizontal turning and Gaussian fuzzy processing on the training image according to a set probability;

4. The method as set forth in claim 3,

a second network branch comprising: connected in sequence are convolution layer group B1, convolution layer group B2, convolution layer group B3, convolution layer group B4, convolution layer group B5, full connection layer B6 and Softmax layer B7.

5. The method as set forth in claim 4, wherein,

the working principle of the multitask neural network model based on the self-adaptive feature fusion is as follows:

the input end of the channel level fusion module is the input end of the current adaptive feature fusion layer; and the output end of the spatial hierarchy fusion module is the output end of the current self-adaptive feature fusion layer.

6. The method of claim 5, wherein the spatial hierarchy fusion module operates on a principle comprising:

firstly, in a spatial hierarchy fusion module, an input feature map is input

And

respectively performing average pooling along spatial dimension to obtain

And

and will be

And

stacking together;

then, the stacked results are passed through two convolutional layers, respectively, in each convolutional layerWith only one 1 × 1 convolution kernel, two steering matrices are obtained

And

will be provided with

Vectorizing and passing through a full connection layer to obtain

And respectively corresponding fusion weight vectors

And

will be provided with

Vectorizing and passing through a full connection layer to obtain

And

respectively corresponding fusion weight vector

And

will be provided with

And

matrixing them to a size equal to the input profile

The spatial dimension of (a);

will be provided with

And

matrixing them to a size equal to the input profile

The spatial dimensions of (a);

will be provided with

And

Will be provided with

And

Finally, input feature map is merged withMultiplying and adding the resultant weight vectors to obtain

And

namely that

And

Figure DEST_PATH_FDA00024286448500000410

will be provided with

Figure DEST_PATH_FDA00024286448500000411

And

Figure DEST_PATH_FDA00024286448500000412

respectively into the next set of convolution layers of the first network branch and the second network branch.

7. The multitask face attribute classification system based on the self-adaptive feature fusion is characterized by comprising the following steps:

an acquisition module configured to: acquiring a face image to be classified;

a classification module configured to: inputting the preprocessed face images to be classified into a multitask face attribute classification model based on self-adaptive feature fusion to obtain the probability of different classes of the images on each face attribute, and selecting the class with the maximum probability as a classification result on the corresponding attribute;

the multitask neural network model based on the adaptive feature fusion comprises the following steps:

the adaptive feature fusion layer comprises:

the channel level fusion module and the space level fusion module are sequentially connected;

the channel level fusion module has the working principle that:

And

and will be

And

are connected together;

And

make it

Through the third stepConnecting the layers to obtain x_AAnd x_BRespectively corresponding fusion weight vector

And

make it possible to

And

and

is equal to the original feature map x_AThe number of the channels of (a) is,

and

is equal to the original feature map x_BThe number of channels of (a);

will be provided with

And

carried out two by two on corresponding position elementsSoftmax operates so that

Will be provided with

And

And

namely, it is

And

will be provided with

And

input to the spatial hierarchy fusion module.

8. An electronic device comprising a memory and a processor and computer instructions stored on the memory and executable on the processor, the computer instructions when executed by the processor performing the steps of any of the methods of claims 1-6.

9. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the method of any one of claims 1 to 6.