CN115984934A - Training method of face pose estimation model, face pose estimation method and device - Google Patents

Training method of face pose estimation model, face pose estimation method and device Download PDF

Info

Publication number
CN115984934A
CN115984934A CN202310008849.2A CN202310008849A CN115984934A CN 115984934 A CN115984934 A CN 115984934A CN 202310008849 A CN202310008849 A CN 202310008849A CN 115984934 A CN115984934 A CN 115984934A
Authority
CN
China
Prior art keywords
layer
loss value
feature
feature map
pose estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310008849.2A
Other languages
Chinese (zh)
Inventor
祁晓婷
黄泽元
杨战波
蒋召
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Longzhi Digital Technology Service Co Ltd
Original Assignee
Beijing Longzhi Digital Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Longzhi Digital Technology Service Co Ltd filed Critical Beijing Longzhi Digital Technology Service Co Ltd
Priority to CN202310008849.2A priority Critical patent/CN115984934A/en
Publication of CN115984934A publication Critical patent/CN115984934A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The disclosure relates to the field of artificial intelligence, and provides a training method of a face pose estimation model, a face pose estimation method and a face pose estimation device. The method comprises the following steps: acquiring a face image and inputting the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image; and performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model. The network structure of the initial face pose estimation model provided by the disclosure has light weight and strong feature extraction capability, the model has high reasoning speed and high precision, the requirement on computational power is not high, and the model can be deployed at the edge end with small computational power for use.

Description

Training method of face pose estimation model, face pose estimation method and device
Technical Field
The disclosure relates to the field of artificial intelligence, in particular to a training method of a face pose estimation model, a face pose estimation method and a face pose estimation device.
Background
The human face posture estimation is a very key step in a human face recognition system, a man-machine interaction system, an access control system and other application scenes, and the good human face posture estimation is beneficial to improving the accuracy of human face recognition and improving the performance of the system.
At present, the traditional face pose estimation model is generally large in calculated amount, low in inference speed and accuracy, high in calculation force requirement and difficult to deploy at an edge end with low calculation force.
Disclosure of Invention
In view of this, the embodiments of the present disclosure provide a training method for a face pose estimation model, a face pose estimation method, and a device, so as to solve the problems that the existing face pose estimation model is generally large in calculation amount, slow in inference speed and low in accuracy, has high requirements for calculation power, and is difficult to deploy at an edge end with small calculation power.
In a first aspect of the embodiments of the present disclosure, a training method for a face pose estimation model is provided, including:
acquiring a face image, and inputting the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image;
performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model;
the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer.
In a second aspect of the embodiments of the present disclosure, a face pose estimation method is provided, including:
acquiring a face image to be recognized;
inputting the face image to be recognized into a final face pose estimation model, and outputting a yaw angle, a pitch angle and a rolling angle of the face image to be recognized, wherein the final face pose estimation model is obtained by training the face pose estimation model according to the first aspect.
In a third aspect of the embodiments of the present disclosure, a training apparatus for a face pose estimation model is provided, including:
the acquisition module is configured to acquire a face image, and input the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image;
the training module is configured to perform iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtain a final face pose estimation model;
the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer.
In a fourth aspect of the embodiments of the present disclosure, there is provided a face pose estimation apparatus, including:
the image acquisition module is configured to acquire a face image to be recognized;
and the recognition module is configured to input the face image to be recognized into the final face pose estimation model and output the yaw angle, the pitch angle and the rolling angle of the face image to be recognized, wherein the final face pose estimation model is obtained by training the face pose estimation model according to the first aspect.
In a fifth aspect of the embodiments of the present disclosure, there is provided an electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a sixth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned method.
Compared with the prior art, the beneficial effects of the embodiment of the disclosure at least comprise: obtaining a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image by acquiring the face image and inputting the face image into a pre-constructed initial face pose estimation model; performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model; the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer. The network structure of the initial face pose estimation model provided by the disclosure has light weight and strong feature extraction capability, the model has high reasoning speed and high precision, the requirement on computational power is not high, and the model can be deployed at the edge end with small computational power for use.
Drawings
To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive efforts.
Fig. 1 is a schematic flowchart of a training method for a face pose estimation model according to an embodiment of the present disclosure;
fig. 2 is a schematic network structure diagram of an initial face pose estimation model in a training method for a face pose estimation model according to an embodiment of the present disclosure;
fig. 3 is a schematic network structure diagram of a feature fusion network in the training method of the face pose estimation model according to the embodiment of the present disclosure;
fig. 4 is a schematic network structure diagram of a loss computation network in the training method of the face pose estimation model according to the embodiment of the present disclosure;
fig. 5 is a schematic flow chart of a face pose estimation method provided by the embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a training device for a face pose estimation model according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a face pose estimation apparatus provided in the embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
The following describes a training method of a face pose estimation model, a face pose estimation method, and an apparatus according to embodiments of the present disclosure in detail with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of a training method for a face pose estimation model according to an embodiment of the present disclosure. As shown in fig. 1, the training method of the face pose estimation model includes:
step S101, a face image is obtained and input into a pre-constructed initial face pose estimation model, and a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image are obtained.
The face image generally refers to an image including a face, which is acquired by an image pickup device (such as a monocular camera, a binocular camera, etc.).
Fig. 2 is a schematic network structure diagram of an initial face pose estimation model provided in the embodiment of the present disclosure. As shown in fig. 2, the initial face pose estimation model includes a first feature extraction network 201 (hereinafter referred to as "block 1"), a second feature extraction network 202 (hereinafter referred to as "block 2"), a third feature extraction network 203 (hereinafter referred to as "block 3"), a feature Fusion network 204 (hereinafter referred to as "Multi-features Fusion"), and a loss calculation network 205; the first feature extraction network 201 includes a first Depth Separable convolutional layer (Depth wise Separable Conv) 2011, a first normalization layer (BN) 2012, a first activation function layer (activation function Relu) 2013, and a first Average Pooling layer (Average Pooling) 2014; the second feature extraction network 202 includes a second Depth Separable convolutional layer (Depth wise Separable Conv) 2021, a second batch normalization layer (BN) 2022, a second activation function layer (activation function Relu) 2023, a first attention layer (Transformer encoder) 2024, and a second Average Pooling layer (Average Pooling) 2025; the third feature extraction network 203 includes a third Depth Separable convolutional layer (Depth wise Separable Conv) 2031, a third batch normalization layer (BN) 2032, a third activation function layer (activation function Relu) 2033, and a second attention layer (Transformer encoder) 2034.
And S102, performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model.
As an example, suppose that N face images (N is a positive integer larger than or equal to 1) are collected from a camera device installed in a hall of a building, the N face images are input into an initial face pose estimation model which is constructed in advance, and after a first round of training, a yaw angle Loss value (Loss _ yaw), a pitch angle Loss value (Loss _ pitch) and a roll angle Loss value (Loss _ roll) of each face image are output. And then, the yaw angle Loss value (Loss _ yaw), the pitch angle Loss value (Loss _ pitch) and the roll angle Loss value (Loss _ roll) obtained after the first round of training are subjected to back propagation to guide the initial face pose estimation model to be further optimized, so that a better face pose prediction effect is obtained. According to the steps, a second round of training is performed until a preset training round threshold (which can be set according to actual conditions, for example, 50 rounds, 60 rounds, 70 rounds and the like) is reached, or a preset model precision is reached, that is, a preset iteration termination condition is considered to be reached, and a final face pose estimation model is obtained.
According to the technical scheme provided by the embodiment of the disclosure, a yaw angle loss value, a pitch angle loss value and a roll angle loss value of a face image are obtained by acquiring the face image and inputting the face image into a pre-constructed initial face pose estimation model; performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model; the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer. The network structure of the initial face pose estimation model provided by the disclosure has light weight and strong feature extraction capability, the model has high reasoning speed and high precision, the requirement on computational power is not high, and the model can be deployed at the edge end with small computational power for use.
In some embodiments, the method includes inputting a face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value, and a roll angle loss value of the face image, and specifically includes:
carrying out feature extraction on the face image by using a first feature extraction network to obtain a first feature map;
inputting the first feature map into a second feature extraction network for feature extraction to obtain a second feature map;
inputting the second feature map into a third feature extraction network for feature extraction to obtain a third feature map;
inputting the second feature map and the third feature map into a feature fusion network for feature fusion to obtain a fusion feature map;
and inputting the fusion characteristic diagram into a loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram.
With reference to fig. 2, the first feature extraction network 201 is composed of three blocks 1, where each block1 includes a first Depth Separable convolutional layer (Depth wise Separable Conv) 2011, a first normalization layer (BN) 2012, a first activation function layer (activation function Relu) 2013, and a first Average Pooling layer (Average Pooling) 2014.
Suppose that the face image f with dimension [128, 3] (H, W, C) is input into the first feature extraction network 201, where H represents the picture height, W represents the picture width, and C represents the number of channels; after the face image f is subjected to feature extraction by the first feature extraction network 201, a first feature map f1 with dimensions [16, 128] (H, W, C) is output. Then, the first feature map f1 is input into the second feature extraction network 202 for feature extraction, and a second feature map f2 with dimensions [8, 256] (H, W, C) is obtained. Then, the second feature map f2 is input into the third feature extraction network 203 for feature extraction, and a third feature map f3 with dimensions [4, 512] (H, W, C) is obtained. Then, the second feature map f2 and the third feature map f3 are input into the feature fusion network 204 for feature fusion, so as to obtain a fused feature map (fused feature) with dimensions [8, 256] (H, W, C). Finally, the fusion feature map is input into the Loss calculation network 205, and the yaw angle Loss value Loss _ yaw, the pitch angle Loss value Loss _ pitch, and the roll angle Loss value Loss _ roll of the fusion feature map are calculated.
In some embodiments, inputting the first feature map into a second feature extraction network for feature extraction to obtain a second feature map specifically includes:
inputting the first feature map into a second depth separable convolution layer, a second batch of normalization layers and a second activation function layer for feature extraction to obtain a first extracted feature map;
inputting the first extracted feature map into a first attention layer to obtain first global feature information of the first extracted feature map;
and inputting the first global feature information and the first extracted feature map into a second average pooling layer, and outputting a second feature map.
Referring to fig. 2, the second feature extraction network 202 includes two Depth wise Separable Conv (second Depth Separable convolution layer 2021) + BN (second batch normalization layer 2022) + Relu (second activation function layer 2023) network layers, a first attention layer (Transformer encoder) 2024 and a second Average Pooling layer (Average Pooling) 2025.
In connection with the above example, the first feature map f1 with dimension [16, 128] (H, W, C) is input into two Depth wise Separable Conv (second Depth Separable convolution layer 2021) + BN (second batch normalization layer 2022) + Relu (second activation function layer 2023) network layers of the second feature extraction network 202 for feature extraction, resulting in a first extracted feature map with dimension [16, 256] (H, W, C). In combination, the first extracted feature map is input into the first attention layer 2024, so as to obtain first global feature information of the first extracted feature map. Then, the first global feature information and the first extracted feature map with the dimension [16, 256] (H, W, C) are input into the second average pooling layer 2025, and the second feature map f2 with the dimension [8, 256] (H, W, C) is output.
In some embodiments, inputting the first extracted feature map into the first attention layer to obtain first global feature information of the first extracted feature map comprises:
performing matrix transformation processing on the first extracted characteristic graph to obtain a first transformation matrix;
inputting the first transformation matrix into the first attention layer to obtain a first characteristic parameter, a second characteristic parameter and a third characteristic parameter;
and determining first global feature information of the first extracted feature map according to the first feature parameter, the second feature parameter and the third feature parameter.
In combination with the above example, the first extracted feature map with the dimension [16, 256] (H, W, C) is subjected to matrix transformation processing, that is, processing is performed by using a reshape function (the reshape function is a function for transforming a specified matrix into a matrix with a specific dimension in MATLAB, and the number of elements in the matrix is not changed, and the function can readjust the number of rows, columns, and dimensions of the matrix), so as to obtain a first transformation matrix with the feature dimension [16 × 16,256] (H, W, C).
Then, the first transformation matrix is input into the first attention layer 2024 to obtain the first feature parameter W k A second characteristic parameter W q And a third characteristic parameter W v And generates the corresponding key, query and value. key generationAnd (3) representing other pixel points in the face image f, wherein the query represents a current pixel point in the face image f, and the value represents the similarity between the current pixel point and other pixel points.
Let K = f W k ,Q=f*W q ,V=f*W v Wherein W is k ∈R C×C ',W q ∈R C×C ',W v ∈R C×C For the generated Q and K, a weight coefficient a is calculated using the following formula (1).
Figure BDA0004037036970000091
C' represents the dimensions of Q and K, K T Representing the transpose of K.
Substituting the calculated weight coefficient a into the following formula (2) to calculate first global feature information f att
f att =a*V (2)。
Then, the first global feature information f obtained by the calculation is used att And a dimension of [16,16,256]The first extracted feature map of (H, W, C) is input into the second average pooling layer 2025 with output dimensions of [8, 256]](H, W, C) second characteristic diagram f2.
Referring to fig. 2, the third feature extraction network 203 includes a third depth separable convolutional layer 2031 (depthwisesepartioncoconv), a third batch normalization layer (BN) 2032, a third activation function layer (Relu) 2033, and a second attention layer (fransformer encoder) 2034.
The second feature map f2 obtained as described above is input to the third feature extraction network 203, and feature extraction is performed using the third depth separable convolution layer 2031 (depthwiseseparatolconv), the third batch normalization layer (BN) 2032, the third activation function layer (Relu) 2033, and the second attention layer (fransformer encoder) 2034, thereby obtaining a third feature map f3.
The feature extraction processes of the first attention layer 2024 and the second attention layer 2034 are substantially the same.
In the embodiment of the present disclosure, the first attention layer 2024 and the second attention layer 2034 which are arranged in the second feature extraction network 202 and the third feature extraction network 203 can effectively enhance the feature extraction capability of the networks, obtain more comprehensive image features, and facilitate improvement of subsequent accuracy of face pose estimation.
In some embodiments, the feature fusion network includes an upsampling layer, a fourth average pooling layer, a first fully connected layer, a fourth activation function layer.
Inputting the second feature diagram and the third feature diagram into a feature fusion network for feature fusion to obtain a fusion feature diagram, wherein the fusion feature diagram comprises:
processing the third feature map by utilizing an upper sampling layer, a fourth average pooling layer, a first full-connection layer and a fourth activation function layer to obtain a fourth feature map;
and fusing the fourth feature map and the second feature map to obtain a fused feature map.
Fig. 3 is a schematic network structure diagram of a feature fusion network 204 provided in an embodiment of the present disclosure. As shown in fig. 3, the feature fusion network 204 includes an upsampling layer (upsamplle) 301, a fourth Average Pooling layer (Average Pooling) 302, a first fully connected layer (FC) 303, and a fourth activation function layer (sigmoid function layer) 304.
In some embodiments, processing the third feature map with the upsampling layer, the fourth average pooling layer, the first full link layer, and the fourth activation function layer to obtain a fourth feature map includes:
processing the third characteristic diagram by using an up-sampling layer to obtain an up-sampling characteristic diagram;
inputting the up-sampling feature map into a fourth average pooling layer, and outputting an average feature map;
inputting the average characteristic diagram into the first full-connection layer and the fourth activation function layer in sequence to obtain a channel weight coefficient of the average characteristic diagram;
and calculating to obtain a fourth feature map according to the channel weight coefficient and the third feature map.
As an example, the third feature map f3 with the dimension [4, 512] (H, W, C) obtained above is input into an upsampling layer (upsample) 301 for processing, so as to obtain an upsampling feature map with the dimension [8, 256] (H, W, C). Then, the up-sampled feature map is input to a fourth Average Pooling layer (Average Pooling) 302, and the Average feature map is output. Then, inputting the average feature map into a first full connection layer (FC) 303 and a fourth activation function layer (sigmoid function layer) 304 in sequence for feature extraction to obtain a channel weight coefficient of the average feature map; and then multiplying the channel weight coefficient of the average feature map by the third feature map f3 to obtain a fourth feature map f4 by calculation. Finally, the fourth feature map f4 is added to the second feature map f2 to obtain a fused feature map (fused feature) with dimensions [8, 256] (H, W, C).
The feature fusion network provided by the embodiment of the disclosure can provide different importance degrees or association degrees for the prediction target of the current layer according to the features of different layers, and provide different weight coefficients for the features of different layers according to the importance degrees, so that the fused features contain richer semantic information. Specifically, the second feature map f2 usually contains low-level semantic information of spatial features of the face image, and the third feature map f3 usually contains high-level semantic information of some abstract features of the face image. By fusing the second feature map f2 and the third feature map f3 in the above manner, richer semantic information can be obtained, and the face pose estimation accuracy of the model can be improved.
In some embodiments, the loss calculation network includes a yaw loss calculation layer, a pitch loss calculation layer, and a roll loss calculation layer;
inputting the fusion characteristic diagram into a loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram, wherein the method comprises the following steps:
inputting the fusion characteristic diagram into a yaw angle loss calculation layer, and calculating to obtain a yaw angle loss value;
inputting the fusion characteristic diagram into a pitch angle loss calculation layer, and calculating to obtain a pitch angle loss value;
and inputting the fusion feature map into a rolling angle loss calculation layer, and calculating to obtain a rolling angle loss value.
Fig. 4 is a schematic network structure diagram of a loss calculating network 205 according to an embodiment of the present disclosure. As shown in fig. 4, the loss calculation network 205 includes a yaw loss calculation layer 2051, a pitch loss calculation layer 2052, and a roll loss calculation layer 2053. The yaw angle loss calculation layer 2051 includes an FC full connection layer and a softmax layer; the pitch angle loss calculation layer 2052 includes an FC full connection layer and a softmax layer; the tumble angle loss calculation layer 2053 includes an FC full link layer, softmax layer.
As an example, the obtained fused feature map (fused feature) is input to the yaw angle loss calculation layer 2051, the pitch angle loss calculation layer 2052, and the roll angle loss calculation layer 2053, respectively, and a yaw angle loss value, a pitch angle loss value, and a roll angle loss value are calculated.
In some embodiments, inputting the fused feature map into a yaw angle loss calculation layer, and calculating to obtain a yaw angle loss value, includes:
calculating a cross entropy loss value, a mean loss value and a variance loss value of the fusion characteristic graph;
determining a first weight coefficient of the mean loss value and a second weight coefficient of the variance loss value;
and calculating the yaw angle loss value of the fusion characteristic diagram according to the cross entropy loss value, the mean loss value, the variance loss value, the first weight coefficient and the second weight coefficient.
The following description will be made in detail by taking an example in which a fused feature map (fused feature) is input to the yaw angle Loss calculation layer 2051 to calculate a yaw angle Loss value Loss _ yaw.
The cross entropy loss value L of the fused feature map is obtained by inputting the fused feature map (fused feature) into the yaw angle loss calculation layer 2051 for processing, and calculating according to the following formula (3) c
Figure BDA0004037036970000121
Calculating the mean loss value L of the fusion characteristic diagram according to the following formulas (4) and (5) m . The mean loss is the difference between the mean of the angular distribution used for penalty estimation and the true angle.
Figure BDA0004037036970000122
Figure BDA0004037036970000123
Calculating the variance loss value L of the fusion characteristic diagram according to the following formulas (6) and (7) v . The variance loss is used to penalize the variance of the estimated angular distribution to obtain a concentrated distribution.
Figure BDA0004037036970000124
Figure BDA0004037036970000125
In the above expressions (1) to (7), N represents the number of samples of the current batch, M represents the number of categories, y i Representing true yaw angle, p ij A probability value representing a yaw angle of a jth class of an ith sample output through a softmax layer, i represents a sample index value, j represents a class index value, m i Mean, v, representing the yaw angle distribution of i samples i The variance value of the yaw angle distribution representing i samples.
The category here refers to dividing the yaw angle [ -90 °,90 ° ] into a plurality of angle intervals, for example, 18 angle intervals may be divided into [ -90 °, -80 ° ], [ -80 °, -70 ° ]. [80 °,90 ° ] for a total of 18 angle intervals, one angle interval representing one yaw angle category (e.g., [ -90 °, -80 ° ] representing category 1, [ -80 °, -70 ° ] representing category 2... [80 °,90 ° ] representing category 18), when M is 1,2,3, 4.... 18.
The yaw angle Loss value Loss _ yaw of the fused feature map is calculated according to the following formula (8).
Loss _yaw =L c +α1*Lm+α2*L v (8)。
In equation (8), α 1 represents a first weight coefficient of the mean loss value, and α 2 represents a second weight coefficient of the variance loss value.
When alpha 1 is not less than 0.1 and alpha 2 is not less than 0.01, the accuracy of the final face pose estimation model on the estimation of the face pose is better. The first weight coefficient and the second weight coefficient may be determined by adjusting parameters empirically, and are not particularly limited herein.
Similarly, the determination method of the pitch angle Loss value Loss _ pitch and the roll angle Loss value Loss _ roll of the fused feature map is similar to the determination method of the yaw angle Loss value Loss _ yaw of the fused feature map, so the fused feature map can be input into the pitch angle Loss calculation layer 2052 and the roll angle Loss calculation layer 2053 respectively by referring to the above steps, and the pitch angle Loss value Loss _ pitch and the roll angle Loss value Loss _ roll of the fused feature map are calculated, and the specific calculation process is not repeated here.
In the embodiment of the present disclosure, the yaw angle Loss value Loss _ yaw, the pitch angle Loss value Loss _ pitch, and the roll angle Loss value Loss _ roll of the fused feature map are respectively calculated by the yaw angle Loss calculation layer 2051, the pitch angle Loss calculation layer 2052, and the roll angle Loss calculation layer 2053 in the Loss calculation network 205, the Loss values of the angles fully consider the mean Loss, the variance Loss, and the softmax Loss of the fused feature map, and the iterative training of the model is guided by the three Loss values to further optimize the iterative training, so that the estimation accuracy of the final face pose estimation model on the face pose can be effectively improved, and the reliability and the practicability of the model can be further improved.
Fig. 5 is a schematic flow chart of a face pose estimation method provided by the embodiment of the disclosure. As shown in fig. 3, the face pose estimation method includes:
step S501, a face image to be recognized is obtained.
Step S502, inputting the face image to be recognized into a final face pose estimation model, and outputting the yaw angle, the pitch angle and the rolling angle of the face image to be recognized, wherein the final face pose estimation model is obtained by the training method of the face pose estimation model.
According to the face pose estimation method provided by the embodiment of the disclosure, a lightweight and efficient feature extraction network is designed by organically combining a depth separable convolution and an attention mechanism, the feature extraction network can effectively extract global information of a face image, has low requirements on computational power, can be deployed at an edge end for use, and has high model inference speed and high accuracy; in addition, the mean loss, the variance loss and the softmax loss are jointly applied to iterative training of the face pose estimation model, so that network optimization of the initial face pose estimation model can be better guided, a final face pose estimation model with higher accuracy is obtained, and the practicability and reliability of the final face pose estimation model are improved.
All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.
The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.
Fig. 6 is a schematic diagram of a training device for a face pose estimation model according to an embodiment of the present disclosure.
As shown in fig. 6, the training device for the face pose estimation model includes:
the acquiring module 601 is configured to acquire a face image, and input the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image;
a training module 602 configured to perform iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value, and the roll angle loss value until a preset iteration termination condition is obtained, so as to obtain a final face pose estimation model;
the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer.
In some embodiments, inputting a face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value, and a roll angle loss value of the face image, including:
carrying out feature extraction on the face image by using a first feature extraction network to obtain a first feature map;
inputting the first feature map into a second feature extraction network for feature extraction to obtain a second feature map;
inputting the second feature map into a third feature extraction network for feature extraction to obtain a third feature map;
inputting the second feature map and the third feature map into a feature fusion network for feature fusion to obtain a fusion feature map;
and inputting the fusion characteristic diagram into a loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram.
In some embodiments, inputting the first feature map into a second feature extraction network for feature extraction to obtain a second feature map, including:
inputting the first feature map into a second depth separable convolution layer, a second batch of normalization layers and a second activation function layer for feature extraction to obtain a first extracted feature map;
inputting the first extracted feature map into a first attention layer to obtain first global feature information of the first extracted feature map;
and inputting the first global feature information and the first extracted feature map into a second average pooling layer, and outputting a second feature map.
In some embodiments, inputting the first extracted feature map into the first attention layer to obtain first global feature information of the first extracted feature map comprises:
performing matrix transformation processing on the first extracted characteristic graph to obtain a first transformation matrix;
inputting the first transformation matrix into the first attention layer to obtain a first characteristic parameter, a second characteristic parameter and a third characteristic parameter;
and determining first global feature information of the first extracted feature map according to the first feature parameter, the second feature parameter and the third feature parameter.
In some embodiments, the feature fusion network includes an upsampling layer, a fourth average pooling layer, a first fully connected layer, a fourth activation function layer.
Inputting the second feature diagram and the third feature diagram into a feature fusion network for feature fusion to obtain a fusion feature diagram, wherein the fusion feature diagram comprises:
processing the third feature map by utilizing an upper sampling layer, a fourth average pooling layer, a first full-connection layer and a fourth activation function layer to obtain a fourth feature map;
and fusing the fourth feature map and the second feature map to obtain a fused feature map.
In some embodiments, processing the third feature map with the upsampling layer, the fourth average pooling layer, the first full link layer, and the fourth activation function layer to obtain a fourth feature map includes:
processing the third characteristic diagram by using an up-sampling layer to obtain an up-sampling characteristic diagram;
inputting the up-sampling feature map into a fourth average pooling layer, and outputting an average feature map;
inputting the average characteristic diagram into the first full-connection layer and the fourth activation function layer in sequence to obtain a channel weight coefficient of the average characteristic diagram;
and calculating to obtain a fourth feature map according to the channel weight coefficient and the third feature map.
In some embodiments, loss calculation network 205 includes a yaw loss calculation layer 2051, a pitch loss calculation layer 2051, and a roll loss calculation layer 2053;
inputting the fusion characteristic diagram into a loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram, wherein the method comprises the following steps:
inputting the fusion characteristic diagram into a yaw angle loss calculation layer, and calculating to obtain a yaw angle loss value;
inputting the fusion characteristic diagram into a pitch angle loss calculation layer, and calculating to obtain a pitch angle loss value;
and inputting the fusion feature map into a rolling angle loss calculation layer, and calculating to obtain a rolling angle loss value.
In some embodiments, inputting the fused feature map into a yaw angle loss calculation layer, and calculating to obtain a yaw angle loss value, includes:
calculating a cross entropy loss value, a mean loss value and a variance loss value of the fusion characteristic graph;
determining a first weight coefficient of the mean loss value and a second weight coefficient of the variance loss value;
and calculating the yaw angle loss value of the fusion characteristic diagram according to the cross entropy loss value, the mean loss value, the variance loss value, the first weight coefficient and the second weight coefficient.
Fig. 7 is a schematic diagram of a face pose estimation apparatus provided in an embodiment of the present disclosure. As shown in fig. 7, the face pose estimation apparatus includes:
an image acquisition module 701 configured to acquire a face image to be recognized;
and the recognition module 702 is configured to input the face image to be recognized into the final face pose estimation model, and output the yaw angle, pitch angle and roll angle of the face image to be recognized, wherein the final face pose estimation model is obtained by the training method of the face pose estimation model.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.
Fig. 8 is a schematic diagram of an electronic device 8 provided by an embodiment of the disclosure. As shown in fig. 8, the electronic apparatus 8 of this embodiment includes: a processor 801, a memory 802, and a computer program 803 stored in the memory 802 and operable on the processor 801. The steps in the various method embodiments described above are implemented when the computer program 803 is executed by the processor 801. Alternatively, the processor 801 implements the functions of the respective modules/units in the above-described respective apparatus embodiments when executing the computer program 803.
The electronic device 8 may be a desktop computer, a notebook, a palm computer, a cloud server, or other electronic devices. The electronic device 8 may include, but is not limited to, a processor 801 and a memory 802. Those skilled in the art will appreciate that fig. 8 is merely an example of electronic device 8, does not constitute a limitation of electronic device 8, and may include more or fewer components than shown, or different components.
The Processor 801 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc.
The storage 802 may be an internal storage unit of the electronic device 8, for example, a hard disk or a memory of the electronic device 8. The memory 802 may also be an external storage device of the electronic device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 8. The memory 802 may also include both internal and external storage units of the electronic device 8. The memory 802 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method in the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above methods and embodiments. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain suitable additions or additions that may be required in accordance with legislative and patent practices within the jurisdiction, for example, in some jurisdictions, computer readable media may not include electrical carrier signals or telecommunications signals in accordance with legislative and patent practices.
The above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims (13)

1. A training method of a face pose estimation model is characterized by comprising the following steps:
acquiring a face image, and inputting the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image;
performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model;
the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer.
2. The method of claim 1, wherein inputting the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image comprises:
extracting the features of the face image by using the first feature extraction network to obtain a first feature map;
inputting the first feature map into the second feature extraction network for feature extraction to obtain a second feature map;
inputting the second feature map into the third feature extraction network for feature extraction to obtain a third feature map;
inputting the second feature map and the third feature map into the feature fusion network for feature fusion to obtain a fusion feature map;
and inputting the fusion characteristic diagram into the loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram.
3. The method of claim 2, wherein inputting the first feature map into the second feature extraction network for feature extraction to obtain a second feature map comprises:
inputting the first feature map into a second depth separable convolution layer, a second batch of normalization layers and a second activation function layer for feature extraction to obtain a first extracted feature map;
inputting the first extracted feature map into the first attention layer to obtain first global feature information of the first extracted feature map;
and inputting the first global feature information and the first extracted feature map into the second average pooling layer, and outputting a second feature map.
4. The method of claim 3, wherein inputting the first extracted feature map into the first attention layer to obtain first global feature information of the first extracted feature map comprises:
performing matrix transformation processing on the first extracted characteristic diagram to obtain a first transformation matrix;
inputting the first transformation matrix into the first attention layer to obtain a first characteristic parameter, a second characteristic parameter and a third characteristic parameter;
and determining first global feature information of the first extracted feature map according to the first feature parameter, the second feature parameter and the third feature parameter.
5. The method of claim 2, wherein the feature fusion network comprises an upsampling layer, a fourth average pooling layer, a first fully connected layer, a fourth activation function layer;
inputting the second feature map and the third feature map into the feature fusion network for feature fusion to obtain a fusion feature map, including:
processing the third feature map by utilizing an upper sampling layer, a fourth average pooling layer, a first full-link layer and a fourth activation function layer to obtain a fourth feature map;
and fusing the fourth feature map and the second feature map to obtain a fused feature map.
6. The method of claim 5, wherein processing the third feature map using an upsampling layer, a fourth average pooling layer, a first full-link layer, and a fourth activation function layer to obtain a fourth feature map comprises:
processing the third feature map by using the up-sampling layer to obtain an up-sampling feature map;
inputting the up-sampling feature map into the fourth average pooling layer, and outputting an average feature map;
inputting the average characteristic diagram into the first full-connection layer and the fourth activation function layer in sequence to obtain a channel weight coefficient of the average characteristic diagram;
and calculating to obtain a fourth feature map according to the channel weight coefficient and the third feature map.
7. The method of claim 2, wherein the loss calculation network comprises a yaw angle loss calculation layer, a pitch angle loss calculation layer, and a roll angle loss calculation layer;
inputting the fusion characteristic diagram into the loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram, wherein the calculation comprises the following steps:
inputting the fusion feature map into the yaw angle loss calculation layer, and calculating to obtain a yaw angle loss value;
inputting the fusion characteristic diagram into the pitch angle loss calculation layer, and calculating to obtain a pitch angle loss value;
and inputting the fusion feature diagram into the roll angle loss calculation layer, and calculating to obtain a roll angle loss value.
8. The method according to claim 7, wherein inputting the fused feature map into the yaw angle loss calculation layer, and calculating a yaw angle loss value comprises:
calculating a cross entropy loss value, a mean loss value and a variance loss value of the fusion feature map;
determining a first weight coefficient of the mean loss value and a second weight coefficient of the variance loss value;
and calculating the yaw angle loss value of the fusion characteristic diagram according to the cross entropy loss value, the mean loss value, the variance loss value, the first weight coefficient and the second weight coefficient.
9. A face pose estimation method is characterized by comprising the following steps:
acquiring a face image to be recognized;
inputting the face image to be recognized into a final face pose estimation model, and outputting the yaw angle, pitch angle and roll angle of the face image to be recognized, wherein the final face pose estimation model is obtained by training the face pose estimation model according to any one of claims 1 to 8.
10. A training device for a face pose estimation model is characterized by comprising:
the system comprises an acquisition module, a display module and a control module, wherein the acquisition module is configured to acquire a face image and input the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image;
the training module is configured to perform iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtain a final face pose estimation model;
the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer.
11. A face pose estimation device, characterized by comprising:
the image acquisition module is configured to acquire a face image to be recognized;
a recognition module configured to input the face image to be recognized into a final face pose estimation model and output a yaw angle, a pitch angle and a roll angle of the face image to be recognized, wherein the final face pose estimation model is obtained by training the face pose estimation model according to any one of claims 1 to 8.
12. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 9 when executing the computer program.
13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.
CN202310008849.2A 2023-01-04 2023-01-04 Training method of face pose estimation model, face pose estimation method and device Pending CN115984934A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310008849.2A CN115984934A (en) 2023-01-04 2023-01-04 Training method of face pose estimation model, face pose estimation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310008849.2A CN115984934A (en) 2023-01-04 2023-01-04 Training method of face pose estimation model, face pose estimation method and device

Publications (1)

Publication Number Publication Date
CN115984934A true CN115984934A (en) 2023-04-18

Family

ID=85966519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310008849.2A Pending CN115984934A (en) 2023-01-04 2023-01-04 Training method of face pose estimation model, face pose estimation method and device

Country Status (1)

Country Link
CN (1) CN115984934A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711040A (en) * 2023-05-24 2024-03-15 荣耀终端有限公司 Calibration method and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711040A (en) * 2023-05-24 2024-03-15 荣耀终端有限公司 Calibration method and electronic equipment

Similar Documents

Publication Publication Date Title
CN111563508B (en) Semantic segmentation method based on spatial information fusion
CN111915660B (en) Binocular disparity matching method and system based on shared features and attention up-sampling
CN111340077B (en) Attention mechanism-based disparity map acquisition method and device
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN111476133B (en) Unmanned driving-oriented foreground and background codec network target extraction method
CN116612288B (en) Multi-scale lightweight real-time semantic segmentation method and system
EP4006777A1 (en) Image classification method and device
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
Zhai et al. Group-split attention network for crowd counting
CN114612902A (en) Image semantic segmentation method, device, equipment, storage medium and program product
CN115984934A (en) Training method of face pose estimation model, face pose estimation method and device
CN113538402B (en) Crowd counting method and system based on density estimation
Xia et al. Combination of multi‐scale and residual learning in deep CNN for image denoising
CN114462486A (en) Training method of image processing model, image processing method and related device
CN116258756B (en) Self-supervision monocular depth estimation method and system
CN115937693A (en) Road identification method and system based on remote sensing image
CN113780305B (en) Significance target detection method based on interaction of two clues
CN117036658A (en) Image processing method and related equipment
Yi et al. Gated residual feature attention network for real-time Dehazing
CN111626298A (en) Real-time image semantic segmentation device and segmentation method
CN117058380B (en) Multi-scale lightweight three-dimensional point cloud segmentation method and device based on self-attention
CN113837284B (en) Double-branch filter pruning method based on deep learning
CN114782684B (en) Point cloud semantic segmentation method and device, electronic equipment and storage medium
CN117809048A (en) Intelligent image edge extraction system and method
CN116958013A (en) Method, device, medium, equipment and product for estimating number of objects in image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination