CN117152823A

CN117152823A - Multi-task age estimation method based on dynamic cavity convolution pyramid attention

Info

Publication number: CN117152823A
Application number: CN202311163492.1A
Authority: CN
Inventors: 黄原丰; 胡春龙
Original assignee: Jiangsu University of Science and Technology
Current assignee: Jiangsu University of Science and Technology
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2023-12-01

Abstract

The invention provides a multi-task age estimation method based on dynamic cavity convolution pyramid attention, which comprises the steps of learning input image features by utilizing a pre-designed lightweight neural network with attention through iterative training, obtaining an age feature map, simultaneously carrying out a label distribution random regression flow task, and mutually restraining by a KL loss function, a mean square error loss function and a quantile loss function until a model converges to obtain an age value of an image to be detected. The invention designs a new convolutional neural network structure and applies the dynamic cavity convolutional pyramid attention, can extract the features of different scales in image information, and is mutually constrained by a new loss function combination on the basis, so that the age estimation can be carried out more quickly on the premise of keeping high prediction accuracy.

Description

Multi-task age estimation method based on dynamic cavity convolution pyramid attention

Technical Field

The invention relates to the technical field of image classification and identification, in particular to a multitask age estimation method based on dynamic cavity convolution pyramid attention.

Background

Age information, which is an inherent attribute of face information, is an important biological feature of human beings. With the continuous development of technology, human-computer interaction is widely applied, and the face recognition technology is widely popularized and used. And the age estimation research can assist the human face recognition task, so that the safety and convenience of human-computer interaction are improved. Age estimation is a research for analyzing and predicting face images based on the existing algorithm, and the prediction method can be roughly divided into a regression class, a sorting class and a classification class. Although age estimation studies have been conducted for many years, how to accurately locate and extract feature information from images to support subsequent prediction accuracy remains a very challenging problem.

Two types of methods are often adopted in the age estimation field to extract feature information, one type is a feature information extraction method adopting manual feature extraction, and the methods often require information extractors to have strong priori knowledge. The other category is that a deep learning feature extraction method of a neural network is adopted, and features are mapped into another feature space through network learning. Such methods often do not adequately extract features due to the limitations of the convolution kernel receptive field. The following biological details included in the face image are as follows: wrinkles, speckles, etc. are therefore ignored, which presents challenges in terms of the accuracy of their age estimation task.

Disclosure of Invention

The invention aims to provide a multi-task age estimation method based on dynamic cavity convolution pyramid attention, which is applied to the field of age estimation of face images and jointly improves the age estimation accuracy through a new extraction network and multi-loss function constraint.

In order to achieve the technical aim, the invention adopts the following technical scheme: a multi-task age estimation method based on dynamic cavity convolution pyramid attention is characterized in that preprocessing operation is firstly carried out on different image information, preprocessed images are input into a built feature extraction neural network, network feature extraction efficiency is improved through dynamic cavity convolution pyramid attention, extracted feature images are input into a multi-task module, joint constraint is carried out through combination of newly designed regression flow tasks and label distribution flow task loss functions, face image age estimation is completed, and task accuracy is improved. The method mainly comprises the following steps:

step 1, preprocessing images, namely unifying pixels in different images;

step 2: extracting image age characteristics; transmitting the image obtained in the step 1 into an attention network model so as to obtain an age characteristic diagram;

step 3: and 2, uniformly performing multi-task learning on the two scale feature graphs obtained in the step 2, and reducing age estimation errors by using constraint of a KL loss function, an MSE loss function and a quantile loss function through back propagation to obtain a final result.

The invention is further realized by the following technical measures:

the input image preprocessing operation is carried out, so that unified learning characteristics of the network model are facilitated; the method specifically comprises the following steps:

step 1: processing the input target image, and reshaping the image into a 224×224 size specification;

step 2: performing random overturning and random rotating operation with probability of 0.5 on the remolded image;

step 3: the random erasure process with the probability of 0.5 and the proportion range of 0.01:0.15 is adopted on the R channel of the image, the cross ratio is 0.3:3.3, and the normalization operation is carried out on the image.

The beneficial effects are that: compared with the prior art, the invention has the following advantages:

aiming at the problem of insufficient feature extraction capability in the existing face image age estimation scheme, the invention provides a face image age estimation method based on attention mechanism and fusing feature information.

In a first aspect, the present invention proposes a new convolutional neural network structure, in which an attention module is introduced, and by means of automatic learning, a importance weight is given to each feature by using a neural network propagation feature, so as to improve the key feature information extraction capability of the network. Meanwhile, the invention further introduces methods such as a grouping sensor, channel cutting and the like, further improves the retention capacity of effective features, removes interference feature information and improves the generalization level of the model. Meanwhile, the dynamic convolution is utilized in the attention layer to keep the attention characteristics of the channel learned by the original network, and the performance of the model is further improved.

In a second aspect, the invention designs a novel network loss function set, and the accuracy of tasks can be further improved by combining quantile loss with KL divergence.

The construction method of the feature extraction neural network comprises the following steps: based on a convolution layer, a convolution residual layer, an attention layer and a maximum pooling layer, stacking to form a feature extraction module, stacking the feature extraction module for three times to obtain a first-stage extraction network, stacking the convolution block and the attention layer to form a full convolution feature extraction module, and accessing the first-stage extraction network. And adding a global pooling layer behind the last layer of the total feature network extraction to analyze task flows so as to form a feature extraction neural network module.

The feature images are subjected to channel dimension compression once, the quantity of attention method parameters is reduced, dynamic cavity convolution check features with different sizes are used for pooling, the sizes of the processed features are restored by interpolation of bilinear functions, the features are spliced according to channel dimension features, feature information of different sensing fields is combined into consideration, channel information importance degree dependency analysis is introduced, the processed results are transmitted into a Coordinate Attention (CA) attention layer, and final attention feature output is obtained. The module calculation formula is as follows:

F′＝CA(cat(f(Dynamic Conv _ks (W ₂ (F)))))

wherein is neutralizedAre all convolution operations of information compression, dynamic Conv _ks Representing dynamic hole convolution operations of different convolution kernel sizes, f representing bilinear interpolation operations, and CA representing the CoordinateateAttention attention layer.

The calculation formula of the dynamic convolution kernel method of the dynamic cavity convolution layer is shown as follows:

wherein the method comprises the steps ofWeight value representing a single convolution kernel, +.>Representing the deviation value of a single convolution kernel.

The calculation formula of the method for interpolating and restoring the image size by the bilinear function is shown as follows:

the CA attention layer maps the original spatial information of the image in a parallel two one-dimensional coding mode, and independently aggregates and codes the features from the vertical direction and the horizontal direction to form spatial dependency analysis. And splitting the dependence relationship of the channel dimension so that the attention dependence relationship of the channel dimension introduces the attention dependence analysis of the space dimension to form a cross attention perception graph. The formula is shown below.

f＝δ(F ₁ ([z ^h ,z ^w ]))

Wherein the method comprises the steps ofThe feature map obtained in the width and height directions is represented by F, and the feature map F obtained by normalizing the feature map and transmitting the feature map to a Sigmod function is represented by F. At this time, the attention weights of the height and width in the space dimension of the feature map are d ^h ＝σ(F _h (f ^h )),d ^w ＝σ(F _w (f ^w )). The final dynamic cavity convolution pyramid attention output feature map formula is as follows.

The method for multi-task age estimation based on the dynamic cavity convolution pyramid attention performs a multi-task learning method on the finally extracted feature map, and performs new loss function combination constraint according to MSE loss, quantile loss and KL label distribution loss constraint through a regression flow task and an asymmetric label distribution task. Its total Loss function loss=λl _kl +L _qloss +L _mse Wherein the MSE loss, quantile loss, and KL tag distribution loss constraint distribution are as follows:

wherein the method comprises the steps ofRepresenting a predictive distribution->Representing a priori distribution, y' _i Representing predictive regression task values, y _i Original tag value, λ is the weight parameter of the direct importance of balance loss.

Drawings

FIG. 1 is a flow chart of a face image age estimation method of the present invention;

FIG. 2 is a schematic diagram of a feature extraction neural network of the present invention;

FIG. 3 is a schematic diagram of a single convolutional residual layer of the present invention;

FIG. 4 is a schematic diagram of a dynamic cavity convolution pyramid attention module of the present invention;

FIG. 5 is a schematic diagram of a dynamic hole convolution of the present invention;

fig. 6 is a schematic diagram of the Coordinate Attention attention layer of the present invention.

Specific implementation measures

The invention will be further described with reference to the drawings and the specific examples. It is to be understood that the invention may be embodied in various forms and that the exemplary and non-limiting embodiments shown in the drawings and described below are not intended to limit the invention to the specific embodiments illustrated.

In the method for estimating the multi-task age based on the dynamic cavity convolution pyramid attention, preprocessing operation is firstly carried out on different image information, preprocessed images are input into a built neural network, network feature extraction efficiency is improved through the dynamic cavity convolution pyramid attention, extracted feature images are input into a multi-task module, and the estimation of the human face image age is completed through joint constraint of regression flow task loss and label distribution flow loss. As shown in fig. 1, the method specifically comprises the following steps:

step 1, preprocessing images, namely unifying pixels in different images;

S1, preprocessing data.

Preprocessing each sample picture in the public age sample data set, and firstly, reshaping the image into a size specification of 224×224; then, carrying out random overturning and random rotation operation with the probability of 0.5 on the remolded image; and then, adopting random erasure processing with probability of 0.5 and proportion range of 0.01:0.15 on an R channel of the image and carrying out normalization operation on the image to obtain a preprocessed picture, wherein the random erasure processing is carried out on the image with the transverse ratio of 0.3:3.3.

S2, constructing a face image extraction network based on an attention mechanism.

Referring to fig. 2, fig. 2 shows a feature extraction network model constructed by the present invention, which is mainly based on a convolution layer, a convolution residual layer, an attention layer and a maximum pooling layer, and is formed by stacking the feature extraction modules three times to obtain a first-stage extraction network, and then forming a full convolution feature extraction module by stacking the convolution block and the attention layer, and accessing the first-stage extraction network. And adding a global pooling layer behind the last layer of the total feature network extraction to analyze task flows so as to form a feature extraction neural network module.

The dynamic cavity convolution pyramid attention layer has the function of carrying out feature learning of different scales on a newly learned feature image in a dynamic cavity convolution mode, carrying out merging analysis on feature learning results of different scales, extracting spatial importance dependence and obtaining an importance weight value which is endowed to each pixel point on the image. Through the attention mode, the information extraction capability of the feature extraction network is improved, and the model robustness of the whole human face age estimation task is improved.

S21, in the dynamic cavity convolution pyramid attention layer, the image is subjected to channel dimension compression once, the attention method parameter quantity is reduced, the dynamic cavity convolution check features with different sizes are utilized to carry out pooling treatment, the treated features are subjected to interpolation and restoration of the image size by using bilinear functions, the image size is spliced according to channel dimension features, the feature information of different sensing fields is combined into consideration, meanwhile, channel information importance degree dependency analysis is introduced, the treated results are transmitted into the Coordinate Attention (CA) attention layer, and final attention feature output is obtained. The module calculation formula is as follows:

F′＝CA(cat(f(Dynamic Conv _ks (W ₂ (f)))))

wherein is neutralizedAre all convolution operations of information compression, dynamic Conv _ks Representing dynamic hole convolution operations of different convolution kernel sizes, F represents a bilinear interpolation operation, and CA represents a CoordinateateAttention attention layer.

S22, in the dynamic cavity convolution pyramid attention layer, a calculation formula of the dynamic convolution kernel method of the dynamic cavity convolution layer is shown as follows:

S23, in the dynamic cavity convolution pyramid attention layer, the image size is restored by bilinear interpolation so as to be used for subsequent spatial dimension attention learning. The calculation formula of the bilinear function interpolation method is shown as follows:

and S24, the CA attention layer maps the original spatial information of the image in a parallel two-dimensional coding mode, and independently performs aggregation coding on the features from the vertical direction and the horizontal direction to form spatial dependency analysis. And splitting the dependence relationship of the channel dimension so that the attention dependence relationship of the channel dimension introduces the attention dependence analysis of the space dimension to form a cross attention perception graph. The formula is shown below.

f＝δ(F ₁ ([z ^h ,z ^w ]))

Wherein the method comprises the steps ofThe characteristic diagrams in the width direction and the height direction are respectively obtained, and f is the characteristic diagram f obtained after the characteristic diagram subjected to normalization processing is transmitted into a Sigmod function. At this time, the attention weights of the height and width in the space dimension of the feature map are d ^h ＝σ(F _h (f ^h )),d ^w ＝σ(F _w (f ^w )). The final dynamic cavity convolution pyramid attention output feature map formula is as follows.

S3, multitask learning

The invention uses asymmetric label distribution and regressionAnd the streaming task analyzes and learns the final characteristic information graph extracted by the characteristic extraction network, and performs new loss function combination constraint according to MSE loss, quantile loss and KL label distribution loss. Its total Loss function loss=λl _kl +L _qloss +L _mse . Wherein the MSE loss, quantile loss, and KL tag distribution loss constraint distribution are as follows:

Claims

1. A multi-task age estimation method based on the attention of a dynamic cavity convolution pyramid is characterized by comprising the following steps of: the method comprises the steps of firstly preprocessing different image information, inputting preprocessed images into a built neural network, improving network feature extraction efficiency through dynamic cavity convolution pyramid attention, inputting extracted feature images into a multi-task module, and completing human face image age estimation through joint constraint of regression flow task loss and label distribution flow loss.

2. The method for estimating the multi-task age based on the attention of the dynamic cavity convolution pyramid according to claim 1, wherein the method comprises the following steps of: the method mainly comprises the following steps:

step 1, preprocessing images, namely unifying pixels in different images;

3. The method for estimating the multi-task age based on the attention of the dynamic cavity convolution pyramid according to claim 2, wherein the image preprocessing method comprises the following 2 steps:

4. The construction method of the feature extraction neural network according to claim 2 comprises the following steps: based on a convolution layer, a convolution residual layer, an attention layer and a maximum pooling layer, stacking to form a feature extraction module, stacking the feature extraction module for three times to obtain a first-stage extraction network, stacking the convolution block and the attention layer to form a full convolution feature extraction module, and accessing the first-stage extraction network. And adding a global pooling layer behind the last layer of the total feature network extraction to analyze task flows so as to form a feature extraction neural network module.

5. The method for estimating the multi-task age based on the attention of the dynamic cavity convolution pyramid as claimed in claim 2, wherein the method comprises the following steps: and (2) carrying out channel dimension compression on the feature images once by the dynamic cavity convolution pyramid attention layer in the step (2), reducing the attention method parameter quantity, carrying out pooling treatment on the feature by utilizing dynamic cavity convolution check features with different sizes, carrying out interpolation and restoration on the treated features by using bilinear functions, splicing the features according to channel dimension features, carrying out consideration on feature information of different sensing fields, introducing channel information importance degree dependency analysis, and introducing the treated results into a Coordinate Attention (CA) attention layer to obtain final attention feature output. The module calculation formula is as follows:

F′＝CA(cat(f(Dynamic Conv _ks (W ₂ (F)))))

wherein is neutralizedAre all convolution operations of information compression, dynamic Conv _ks Representing dynamic hole convolution operations of different convolution kernel sizes, f representing bilinear interpolation operations, CA representing the Coordinate Attention attention layer.

6. The method for estimating the age of a plurality of tasks based on the attention of a dynamic cavity convolution pyramid as claimed in claim 4, wherein the calculation formula of the dynamic convolution kernel method of the dynamic cavity convolution layer is as follows:

7. The method for estimating the ages of multiple tasks based on the attention of the dynamic cavity convolution pyramid as claimed in claim 4, wherein the calculation formula of the bilinear function interpolation method is as follows:

8. the method for estimating age of multiple tasks based on attention of dynamic cavity convolution pyramid as claimed in claim 4, wherein Coordinate Attention attention layer maps original spatial information of image by means of parallel two one-dimensional coding, and independently aggregate-codes features from vertical and horizontal directions to form spatial dependency analysis. And splitting the dependence relationship of the channel dimension so that the attention dependence relationship of the channel dimension introduces the attention dependence analysis of the space dimension to form a cross attention perception graph. The formula is shown below.

f＝δ(F ₁ ([z ^h ,z ^w ]))

Wherein the method comprises the steps ofRespectively representing the obtained characteristic diagrams in the width direction and the height direction, and f represents that the characteristic diagram after normalization processing is transmitted into SigmAnd (3) a characteristic diagram f obtained after the od function. At this time, the attention weights of the height and width in the space dimension of the feature map are d ^h ＝σ(F _h (f ^h )),d ^w ＝σ(F _w (f ^w )). The final dynamic cavity convolution pyramid attention output feature map formula is as follows.

9. The method for estimating the multi-task age based on the attention of the dynamic cavity convolution pyramid as claimed in claim 1, wherein the finally extracted feature map is subjected to a multi-task learning method, and new loss function combination constraint according to MSE loss, quantile loss and KL label distribution loss constraint is performed through a regression flow task and an asymmetric label distribution task. Its total Loss function loss=λl _kl +L _qloss +L _mse . Wherein the MSE loss, quantile loss, and KL tag distribution loss constraint distribution are as follows: