CN115984934A

CN115984934A - Training method of face pose estimation model, face pose estimation method and device

Info

Publication number: CN115984934A
Application number: CN202310008849.2A
Authority: CN
Inventors: 祁晓婷; 黄泽元; 杨战波; 蒋召
Original assignee: Beijing Longzhi Digital Technology Service Co Ltd
Current assignee: Beijing Longzhi Digital Technology Service Co Ltd
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-04-18

Abstract

The disclosure relates to the field of artificial intelligence, and provides a training method of a face pose estimation model, a face pose estimation method and a face pose estimation device. The method comprises the following steps: acquiring a face image and inputting the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image; and performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model. The network structure of the initial face pose estimation model provided by the disclosure has light weight and strong feature extraction capability, the model has high reasoning speed and high precision, the requirement on computational power is not high, and the model can be deployed at the edge end with small computational power for use.

Description

Training method of face pose estimation model, face pose estimation method and device

Technical Field

The disclosure relates to the field of artificial intelligence, in particular to a training method of a face pose estimation model, a face pose estimation method and a face pose estimation device.

Background

The human face posture estimation is a very key step in a human face recognition system, a man-machine interaction system, an access control system and other application scenes, and the good human face posture estimation is beneficial to improving the accuracy of human face recognition and improving the performance of the system.

At present, the traditional face pose estimation model is generally large in calculated amount, low in inference speed and accuracy, high in calculation force requirement and difficult to deploy at an edge end with low calculation force.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a training method for a face pose estimation model, a face pose estimation method, and a device, so as to solve the problems that the existing face pose estimation model is generally large in calculation amount, slow in inference speed and low in accuracy, has high requirements for calculation power, and is difficult to deploy at an edge end with small calculation power.

In a first aspect of the embodiments of the present disclosure, a training method for a face pose estimation model is provided, including:

acquiring a face image, and inputting the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image;

performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model;

the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer.

In a second aspect of the embodiments of the present disclosure, a face pose estimation method is provided, including:

acquiring a face image to be recognized;

inputting the face image to be recognized into a final face pose estimation model, and outputting a yaw angle, a pitch angle and a rolling angle of the face image to be recognized, wherein the final face pose estimation model is obtained by training the face pose estimation model according to the first aspect.

In a third aspect of the embodiments of the present disclosure, a training apparatus for a face pose estimation model is provided, including:

the acquisition module is configured to acquire a face image, and input the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image;

the training module is configured to perform iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtain a final face pose estimation model;

In a fourth aspect of the embodiments of the present disclosure, there is provided a face pose estimation apparatus, including:

the image acquisition module is configured to acquire a face image to be recognized;

and the recognition module is configured to input the face image to be recognized into the final face pose estimation model and output the yaw angle, the pitch angle and the rolling angle of the face image to be recognized, wherein the final face pose estimation model is obtained by training the face pose estimation model according to the first aspect.

In a fifth aspect of the embodiments of the present disclosure, there is provided an electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.

In a sixth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, which stores a computer program, which when executed by a processor, implements the steps of the above-mentioned method.

Compared with the prior art, the beneficial effects of the embodiment of the disclosure at least comprise: obtaining a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image by acquiring the face image and inputting the face image into a pre-constructed initial face pose estimation model; performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model; the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer. The network structure of the initial face pose estimation model provided by the disclosure has light weight and strong feature extraction capability, the model has high reasoning speed and high precision, the requirement on computational power is not high, and the model can be deployed at the edge end with small computational power for use.

Drawings

To more clearly illustrate the technical solutions in the embodiments of the present disclosure, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without inventive efforts.

Fig. 1 is a schematic flowchart of a training method for a face pose estimation model according to an embodiment of the present disclosure;

fig. 2 is a schematic network structure diagram of an initial face pose estimation model in a training method for a face pose estimation model according to an embodiment of the present disclosure;

fig. 3 is a schematic network structure diagram of a feature fusion network in the training method of the face pose estimation model according to the embodiment of the present disclosure;

fig. 4 is a schematic network structure diagram of a loss computation network in the training method of the face pose estimation model according to the embodiment of the present disclosure;

fig. 5 is a schematic flow chart of a face pose estimation method provided by the embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a training device for a face pose estimation model according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a face pose estimation apparatus provided in the embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.

The following describes a training method of a face pose estimation model, a face pose estimation method, and an apparatus according to embodiments of the present disclosure in detail with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a training method for a face pose estimation model according to an embodiment of the present disclosure. As shown in fig. 1, the training method of the face pose estimation model includes:

step S101, a face image is obtained and input into a pre-constructed initial face pose estimation model, and a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image are obtained.

The face image generally refers to an image including a face, which is acquired by an image pickup device (such as a monocular camera, a binocular camera, etc.).

Fig. 2 is a schematic network structure diagram of an initial face pose estimation model provided in the embodiment of the present disclosure. As shown in fig. 2, the initial face pose estimation model includes a first feature extraction network 201 (hereinafter referred to as "block 1"), a second feature extraction network 202 (hereinafter referred to as "block 2"), a third feature extraction network 203 (hereinafter referred to as "block 3"), a feature Fusion network 204 (hereinafter referred to as "Multi-features Fusion"), and a loss calculation network 205; the first feature extraction network 201 includes a first Depth Separable convolutional layer (Depth wise Separable Conv) 2011, a first normalization layer (BN) 2012, a first activation function layer (activation function Relu) 2013, and a first Average Pooling layer (Average Pooling) 2014; the second feature extraction network 202 includes a second Depth Separable convolutional layer (Depth wise Separable Conv) 2021, a second batch normalization layer (BN) 2022, a second activation function layer (activation function Relu) 2023, a first attention layer (Transformer encoder) 2024, and a second Average Pooling layer (Average Pooling) 2025; the third feature extraction network 203 includes a third Depth Separable convolutional layer (Depth wise Separable Conv) 2031, a third batch normalization layer (BN) 2032, a third activation function layer (activation function Relu) 2033, and a second attention layer (Transformer encoder) 2034.

And S102, performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model.

As an example, suppose that N face images (N is a positive integer larger than or equal to 1) are collected from a camera device installed in a hall of a building, the N face images are input into an initial face pose estimation model which is constructed in advance, and after a first round of training, a yaw angle Loss value (Loss _ yaw), a pitch angle Loss value (Loss _ pitch) and a roll angle Loss value (Loss _ roll) of each face image are output. And then, the yaw angle Loss value (Loss _ yaw), the pitch angle Loss value (Loss _ pitch) and the roll angle Loss value (Loss _ roll) obtained after the first round of training are subjected to back propagation to guide the initial face pose estimation model to be further optimized, so that a better face pose prediction effect is obtained. According to the steps, a second round of training is performed until a preset training round threshold (which can be set according to actual conditions, for example, 50 rounds, 60 rounds, 70 rounds and the like) is reached, or a preset model precision is reached, that is, a preset iteration termination condition is considered to be reached, and a final face pose estimation model is obtained.

According to the technical scheme provided by the embodiment of the disclosure, a yaw angle loss value, a pitch angle loss value and a roll angle loss value of a face image are obtained by acquiring the face image and inputting the face image into a pre-constructed initial face pose estimation model; performing iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value and the roll angle loss value until a preset iteration termination condition is obtained, and obtaining a final face pose estimation model; the initial face pose estimation model comprises a first feature extraction network, a second feature extraction network, a third feature extraction network, a feature fusion network and a loss calculation network; the first feature extraction network comprises a first depth separable convolutional layer, a first batch of normalization layers, a first activation function layer and a first average pooling layer; the second feature extraction network comprises a second depth separable convolution layer, a second batch normalization layer, a second activation function layer, a first attention layer and a second average pooling layer; the third feature extraction network includes a third depth separable convolution layer, a third batch normalization layer, a third activation function layer, and a second attention layer. The network structure of the initial face pose estimation model provided by the disclosure has light weight and strong feature extraction capability, the model has high reasoning speed and high precision, the requirement on computational power is not high, and the model can be deployed at the edge end with small computational power for use.

In some embodiments, the method includes inputting a face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value, and a roll angle loss value of the face image, and specifically includes:

carrying out feature extraction on the face image by using a first feature extraction network to obtain a first feature map;

inputting the first feature map into a second feature extraction network for feature extraction to obtain a second feature map;

inputting the second feature map into a third feature extraction network for feature extraction to obtain a third feature map;

inputting the second feature map and the third feature map into a feature fusion network for feature fusion to obtain a fusion feature map;

and inputting the fusion characteristic diagram into a loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram.

With reference to fig. 2, the first feature extraction network 201 is composed of three blocks 1, where each block1 includes a first Depth Separable convolutional layer (Depth wise Separable Conv) 2011, a first normalization layer (BN) 2012, a first activation function layer (activation function Relu) 2013, and a first Average Pooling layer (Average Pooling) 2014.

Suppose that the face image f with dimension [128, 3] (H, W, C) is input into the first feature extraction network 201, where H represents the picture height, W represents the picture width, and C represents the number of channels; after the face image f is subjected to feature extraction by the first feature extraction network 201, a first feature map f1 with dimensions [16, 128] (H, W, C) is output. Then, the first feature map f1 is input into the second feature extraction network 202 for feature extraction, and a second feature map f2 with dimensions [8, 256] (H, W, C) is obtained. Then, the second feature map f2 is input into the third feature extraction network 203 for feature extraction, and a third feature map f3 with dimensions [4, 512] (H, W, C) is obtained. Then, the second feature map f2 and the third feature map f3 are input into the feature fusion network 204 for feature fusion, so as to obtain a fused feature map (fused feature) with dimensions [8, 256] (H, W, C). Finally, the fusion feature map is input into the Loss calculation network 205, and the yaw angle Loss value Loss _ yaw, the pitch angle Loss value Loss _ pitch, and the roll angle Loss value Loss _ roll of the fusion feature map are calculated.

In some embodiments, inputting the first feature map into a second feature extraction network for feature extraction to obtain a second feature map specifically includes:

inputting the first feature map into a second depth separable convolution layer, a second batch of normalization layers and a second activation function layer for feature extraction to obtain a first extracted feature map;

inputting the first extracted feature map into a first attention layer to obtain first global feature information of the first extracted feature map;

and inputting the first global feature information and the first extracted feature map into a second average pooling layer, and outputting a second feature map.

Referring to fig. 2, the second feature extraction network 202 includes two Depth wise Separable Conv (second Depth Separable convolution layer 2021) + BN (second batch normalization layer 2022) + Relu (second activation function layer 2023) network layers, a first attention layer (Transformer encoder) 2024 and a second Average Pooling layer (Average Pooling) 2025.

In connection with the above example, the first feature map f1 with dimension [16, 128] (H, W, C) is input into two Depth wise Separable Conv (second Depth Separable convolution layer 2021) + BN (second batch normalization layer 2022) + Relu (second activation function layer 2023) network layers of the second feature extraction network 202 for feature extraction, resulting in a first extracted feature map with dimension [16, 256] (H, W, C). In combination, the first extracted feature map is input into the first attention layer 2024, so as to obtain first global feature information of the first extracted feature map. Then, the first global feature information and the first extracted feature map with the dimension [16, 256] (H, W, C) are input into the second average pooling layer 2025, and the second feature map f2 with the dimension [8, 256] (H, W, C) is output.

In some embodiments, inputting the first extracted feature map into the first attention layer to obtain first global feature information of the first extracted feature map comprises:

performing matrix transformation processing on the first extracted characteristic graph to obtain a first transformation matrix;

inputting the first transformation matrix into the first attention layer to obtain a first characteristic parameter, a second characteristic parameter and a third characteristic parameter;

and determining first global feature information of the first extracted feature map according to the first feature parameter, the second feature parameter and the third feature parameter.

In combination with the above example, the first extracted feature map with the dimension [16, 256] (H, W, C) is subjected to matrix transformation processing, that is, processing is performed by using a reshape function (the reshape function is a function for transforming a specified matrix into a matrix with a specific dimension in MATLAB, and the number of elements in the matrix is not changed, and the function can readjust the number of rows, columns, and dimensions of the matrix), so as to obtain a first transformation matrix with the feature dimension [16 × 16,256] (H, W, C).

Then, the first transformation matrix is input into the first attention layer 2024 to obtain the first feature parameter W _k A second characteristic parameter W _q And a third characteristic parameter W _v And generates the corresponding key, query and value. key generationAnd (3) representing other pixel points in the face image f, wherein the query represents a current pixel point in the face image f, and the value represents the similarity between the current pixel point and other pixel points.

Let K = f W _k ，Q＝f*W _q ，V＝f*W _v Wherein W is _k ∈R ^C×C '，W _q ∈R ^C×C '，W _v ∈R ^C×C For the generated Q and K, a weight coefficient a is calculated using the following formula (1).

C' represents the dimensions of Q and K, K ^T Representing the transpose of K.

Substituting the calculated weight coefficient a into the following formula (2) to calculate first global feature information f ^att 。

f ^att ＝a*V (2)。

Then, the first global feature information f obtained by the calculation is used ^att And a dimension of [16,16,256]The first extracted feature map of (H, W, C) is input into the second average pooling layer 2025 with output dimensions of [8, 256]](H, W, C) second characteristic diagram f2.

Referring to fig. 2, the third feature extraction network 203 includes a third depth separable convolutional layer 2031 (depthwisesepartioncoconv), a third batch normalization layer (BN) 2032, a third activation function layer (Relu) 2033, and a second attention layer (fransformer encoder) 2034.

The second feature map f2 obtained as described above is input to the third feature extraction network 203, and feature extraction is performed using the third depth separable convolution layer 2031 (depthwiseseparatolconv), the third batch normalization layer (BN) 2032, the third activation function layer (Relu) 2033, and the second attention layer (fransformer encoder) 2034, thereby obtaining a third feature map f3.

The feature extraction processes of the first attention layer 2024 and the second attention layer 2034 are substantially the same.

In the embodiment of the present disclosure, the first attention layer 2024 and the second attention layer 2034 which are arranged in the second feature extraction network 202 and the third feature extraction network 203 can effectively enhance the feature extraction capability of the networks, obtain more comprehensive image features, and facilitate improvement of subsequent accuracy of face pose estimation.

In some embodiments, the feature fusion network includes an upsampling layer, a fourth average pooling layer, a first fully connected layer, a fourth activation function layer.

Inputting the second feature diagram and the third feature diagram into a feature fusion network for feature fusion to obtain a fusion feature diagram, wherein the fusion feature diagram comprises:

processing the third feature map by utilizing an upper sampling layer, a fourth average pooling layer, a first full-connection layer and a fourth activation function layer to obtain a fourth feature map;

and fusing the fourth feature map and the second feature map to obtain a fused feature map.

Fig. 3 is a schematic network structure diagram of a feature fusion network 204 provided in an embodiment of the present disclosure. As shown in fig. 3, the feature fusion network 204 includes an upsampling layer (upsamplle) 301, a fourth Average Pooling layer (Average Pooling) 302, a first fully connected layer (FC) 303, and a fourth activation function layer (sigmoid function layer) 304.

In some embodiments, processing the third feature map with the upsampling layer, the fourth average pooling layer, the first full link layer, and the fourth activation function layer to obtain a fourth feature map includes:

processing the third characteristic diagram by using an up-sampling layer to obtain an up-sampling characteristic diagram;

inputting the up-sampling feature map into a fourth average pooling layer, and outputting an average feature map;

inputting the average characteristic diagram into the first full-connection layer and the fourth activation function layer in sequence to obtain a channel weight coefficient of the average characteristic diagram;

and calculating to obtain a fourth feature map according to the channel weight coefficient and the third feature map.

As an example, the third feature map f3 with the dimension [4, 512] (H, W, C) obtained above is input into an upsampling layer (upsample) 301 for processing, so as to obtain an upsampling feature map with the dimension [8, 256] (H, W, C). Then, the up-sampled feature map is input to a fourth Average Pooling layer (Average Pooling) 302, and the Average feature map is output. Then, inputting the average feature map into a first full connection layer (FC) 303 and a fourth activation function layer (sigmoid function layer) 304 in sequence for feature extraction to obtain a channel weight coefficient of the average feature map; and then multiplying the channel weight coefficient of the average feature map by the third feature map f3 to obtain a fourth feature map f4 by calculation. Finally, the fourth feature map f4 is added to the second feature map f2 to obtain a fused feature map (fused feature) with dimensions [8, 256] (H, W, C).

The feature fusion network provided by the embodiment of the disclosure can provide different importance degrees or association degrees for the prediction target of the current layer according to the features of different layers, and provide different weight coefficients for the features of different layers according to the importance degrees, so that the fused features contain richer semantic information. Specifically, the second feature map f2 usually contains low-level semantic information of spatial features of the face image, and the third feature map f3 usually contains high-level semantic information of some abstract features of the face image. By fusing the second feature map f2 and the third feature map f3 in the above manner, richer semantic information can be obtained, and the face pose estimation accuracy of the model can be improved.

In some embodiments, the loss calculation network includes a yaw loss calculation layer, a pitch loss calculation layer, and a roll loss calculation layer;

inputting the fusion characteristic diagram into a loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram, wherein the method comprises the following steps:

inputting the fusion characteristic diagram into a yaw angle loss calculation layer, and calculating to obtain a yaw angle loss value;

inputting the fusion characteristic diagram into a pitch angle loss calculation layer, and calculating to obtain a pitch angle loss value;

and inputting the fusion feature map into a rolling angle loss calculation layer, and calculating to obtain a rolling angle loss value.

Fig. 4 is a schematic network structure diagram of a loss calculating network 205 according to an embodiment of the present disclosure. As shown in fig. 4, the loss calculation network 205 includes a yaw loss calculation layer 2051, a pitch loss calculation layer 2052, and a roll loss calculation layer 2053. The yaw angle loss calculation layer 2051 includes an FC full connection layer and a softmax layer; the pitch angle loss calculation layer 2052 includes an FC full connection layer and a softmax layer; the tumble angle loss calculation layer 2053 includes an FC full link layer, softmax layer.

As an example, the obtained fused feature map (fused feature) is input to the yaw angle loss calculation layer 2051, the pitch angle loss calculation layer 2052, and the roll angle loss calculation layer 2053, respectively, and a yaw angle loss value, a pitch angle loss value, and a roll angle loss value are calculated.

In some embodiments, inputting the fused feature map into a yaw angle loss calculation layer, and calculating to obtain a yaw angle loss value, includes:

calculating a cross entropy loss value, a mean loss value and a variance loss value of the fusion characteristic graph;

determining a first weight coefficient of the mean loss value and a second weight coefficient of the variance loss value;

and calculating the yaw angle loss value of the fusion characteristic diagram according to the cross entropy loss value, the mean loss value, the variance loss value, the first weight coefficient and the second weight coefficient.

The following description will be made in detail by taking an example in which a fused feature map (fused feature) is input to the yaw angle Loss calculation layer 2051 to calculate a yaw angle Loss value Loss _ yaw.

The cross entropy loss value L of the fused feature map is obtained by inputting the fused feature map (fused feature) into the yaw angle loss calculation layer 2051 for processing, and calculating according to the following formula (3) _c 。

Calculating the mean loss value L of the fusion characteristic diagram according to the following formulas (4) and (5) _m . The mean loss is the difference between the mean of the angular distribution used for penalty estimation and the true angle.

Calculating the variance loss value L of the fusion characteristic diagram according to the following formulas (6) and (7) _v . The variance loss is used to penalize the variance of the estimated angular distribution to obtain a concentrated distribution.

In the above expressions (1) to (7), N represents the number of samples of the current batch, M represents the number of categories, y _i Representing true yaw angle, p _ij A probability value representing a yaw angle of a jth class of an ith sample output through a softmax layer, i represents a sample index value, j represents a class index value, m _i Mean, v, representing the yaw angle distribution of i samples _i The variance value of the yaw angle distribution representing i samples.

The category here refers to dividing the yaw angle [ -90 °,90 ° ] into a plurality of angle intervals, for example, 18 angle intervals may be divided into [ -90 °, -80 ° ], [ -80 °, -70 ° ]. [80 °,90 ° ] for a total of 18 angle intervals, one angle interval representing one yaw angle category (e.g., [ -90 °, -80 ° ] representing category 1, [ -80 °, -70 ° ] representing category 2... [80 °,90 ° ] representing category 18), when M is 1,2,3, 4.... 18.

The yaw angle Loss value Loss _ yaw of the fused feature map is calculated according to the following formula (8).

Loss _{_yaw} ＝L _c +α1*Lm+α2*L _v (8)。

In equation (8), α 1 represents a first weight coefficient of the mean loss value, and α 2 represents a second weight coefficient of the variance loss value.

When alpha 1 is not less than 0.1 and alpha 2 is not less than 0.01, the accuracy of the final face pose estimation model on the estimation of the face pose is better. The first weight coefficient and the second weight coefficient may be determined by adjusting parameters empirically, and are not particularly limited herein.

Similarly, the determination method of the pitch angle Loss value Loss _ pitch and the roll angle Loss value Loss _ roll of the fused feature map is similar to the determination method of the yaw angle Loss value Loss _ yaw of the fused feature map, so the fused feature map can be input into the pitch angle Loss calculation layer 2052 and the roll angle Loss calculation layer 2053 respectively by referring to the above steps, and the pitch angle Loss value Loss _ pitch and the roll angle Loss value Loss _ roll of the fused feature map are calculated, and the specific calculation process is not repeated here.

In the embodiment of the present disclosure, the yaw angle Loss value Loss _ yaw, the pitch angle Loss value Loss _ pitch, and the roll angle Loss value Loss _ roll of the fused feature map are respectively calculated by the yaw angle Loss calculation layer 2051, the pitch angle Loss calculation layer 2052, and the roll angle Loss calculation layer 2053 in the Loss calculation network 205, the Loss values of the angles fully consider the mean Loss, the variance Loss, and the softmax Loss of the fused feature map, and the iterative training of the model is guided by the three Loss values to further optimize the iterative training, so that the estimation accuracy of the final face pose estimation model on the face pose can be effectively improved, and the reliability and the practicability of the model can be further improved.

Fig. 5 is a schematic flow chart of a face pose estimation method provided by the embodiment of the disclosure. As shown in fig. 3, the face pose estimation method includes:

step S501, a face image to be recognized is obtained.

Step S502, inputting the face image to be recognized into a final face pose estimation model, and outputting the yaw angle, the pitch angle and the rolling angle of the face image to be recognized, wherein the final face pose estimation model is obtained by the training method of the face pose estimation model.

According to the face pose estimation method provided by the embodiment of the disclosure, a lightweight and efficient feature extraction network is designed by organically combining a depth separable convolution and an attention mechanism, the feature extraction network can effectively extract global information of a face image, has low requirements on computational power, can be deployed at an edge end for use, and has high model inference speed and high accuracy; in addition, the mean loss, the variance loss and the softmax loss are jointly applied to iterative training of the face pose estimation model, so that network optimization of the initial face pose estimation model can be better guided, a final face pose estimation model with higher accuracy is obtained, and the practicability and reliability of the final face pose estimation model are improved.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

The following are embodiments of the disclosed apparatus that may be used to perform embodiments of the disclosed methods. For details not disclosed in the embodiments of the apparatus of the present disclosure, refer to the embodiments of the method of the present disclosure.

Fig. 6 is a schematic diagram of a training device for a face pose estimation model according to an embodiment of the present disclosure.

As shown in fig. 6, the training device for the face pose estimation model includes:

the acquiring module 601 is configured to acquire a face image, and input the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image;

a training module 602 configured to perform iterative training on the initial face pose estimation model based on the yaw angle loss value, the pitch angle loss value, and the roll angle loss value until a preset iteration termination condition is obtained, so as to obtain a final face pose estimation model;

In some embodiments, inputting a face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value, and a roll angle loss value of the face image, including:

In some embodiments, inputting the first feature map into a second feature extraction network for feature extraction to obtain a second feature map, including:

In some embodiments, loss calculation network 205 includes a yaw loss calculation layer 2051, a pitch loss calculation layer 2051, and a roll loss calculation layer 2053;

Fig. 7 is a schematic diagram of a face pose estimation apparatus provided in an embodiment of the present disclosure. As shown in fig. 7, the face pose estimation apparatus includes:

an image acquisition module 701 configured to acquire a face image to be recognized;

and the recognition module 702 is configured to input the face image to be recognized into the final face pose estimation model, and output the yaw angle, pitch angle and roll angle of the face image to be recognized, wherein the final face pose estimation model is obtained by the training method of the face pose estimation model.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

Fig. 8 is a schematic diagram of an electronic device 8 provided by an embodiment of the disclosure. As shown in fig. 8, the electronic apparatus 8 of this embodiment includes: a processor 801, a memory 802, and a computer program 803 stored in the memory 802 and operable on the processor 801. The steps in the various method embodiments described above are implemented when the computer program 803 is executed by the processor 801. Alternatively, the processor 801 implements the functions of the respective modules/units in the above-described respective apparatus embodiments when executing the computer program 803.

The electronic device 8 may be a desktop computer, a notebook, a palm computer, a cloud server, or other electronic devices. The electronic device 8 may include, but is not limited to, a processor 801 and a memory 802. Those skilled in the art will appreciate that fig. 8 is merely an example of electronic device 8, does not constitute a limitation of electronic device 8, and may include more or fewer components than shown, or different components.

The Processor 801 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc.

The storage 802 may be an internal storage unit of the electronic device 8, for example, a hard disk or a memory of the electronic device 8. The memory 802 may also be an external storage device of the electronic device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the electronic device 8. The memory 802 may also include both internal and external storage units of the electronic device 8. The memory 802 is used to store computer programs and other programs and data required by the electronic device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method in the above embodiments, and may also be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of the above methods and embodiments. The computer program may comprise computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, recording medium, U.S. disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer readable medium may contain suitable additions or additions that may be required in accordance with legislative and patent practices within the jurisdiction, for example, in some jurisdictions, computer readable media may not include electrical carrier signals or telecommunications signals in accordance with legislative and patent practices.

The above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present disclosure, and are intended to be included within the scope of the present disclosure.

Claims

1. A training method of a face pose estimation model is characterized by comprising the following steps:

2. The method of claim 1, wherein inputting the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image comprises:

extracting the features of the face image by using the first feature extraction network to obtain a first feature map;

inputting the first feature map into the second feature extraction network for feature extraction to obtain a second feature map;

inputting the second feature map into the third feature extraction network for feature extraction to obtain a third feature map;

inputting the second feature map and the third feature map into the feature fusion network for feature fusion to obtain a fusion feature map;

and inputting the fusion characteristic diagram into the loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram.

3. The method of claim 2, wherein inputting the first feature map into the second feature extraction network for feature extraction to obtain a second feature map comprises:

inputting the first extracted feature map into the first attention layer to obtain first global feature information of the first extracted feature map;

and inputting the first global feature information and the first extracted feature map into the second average pooling layer, and outputting a second feature map.

4. The method of claim 3, wherein inputting the first extracted feature map into the first attention layer to obtain first global feature information of the first extracted feature map comprises:

performing matrix transformation processing on the first extracted characteristic diagram to obtain a first transformation matrix;

5. The method of claim 2, wherein the feature fusion network comprises an upsampling layer, a fourth average pooling layer, a first fully connected layer, a fourth activation function layer;

inputting the second feature map and the third feature map into the feature fusion network for feature fusion to obtain a fusion feature map, including:

processing the third feature map by utilizing an upper sampling layer, a fourth average pooling layer, a first full-link layer and a fourth activation function layer to obtain a fourth feature map;

6. The method of claim 5, wherein processing the third feature map using an upsampling layer, a fourth average pooling layer, a first full-link layer, and a fourth activation function layer to obtain a fourth feature map comprises:

processing the third feature map by using the up-sampling layer to obtain an up-sampling feature map;

inputting the up-sampling feature map into the fourth average pooling layer, and outputting an average feature map;

7. The method of claim 2, wherein the loss calculation network comprises a yaw angle loss calculation layer, a pitch angle loss calculation layer, and a roll angle loss calculation layer;

inputting the fusion characteristic diagram into the loss calculation network, and calculating to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the fusion characteristic diagram, wherein the calculation comprises the following steps:

inputting the fusion feature map into the yaw angle loss calculation layer, and calculating to obtain a yaw angle loss value;

inputting the fusion characteristic diagram into the pitch angle loss calculation layer, and calculating to obtain a pitch angle loss value;

and inputting the fusion feature diagram into the roll angle loss calculation layer, and calculating to obtain a roll angle loss value.

8. The method according to claim 7, wherein inputting the fused feature map into the yaw angle loss calculation layer, and calculating a yaw angle loss value comprises:

calculating a cross entropy loss value, a mean loss value and a variance loss value of the fusion feature map;

9. A face pose estimation method is characterized by comprising the following steps:

acquiring a face image to be recognized;

inputting the face image to be recognized into a final face pose estimation model, and outputting the yaw angle, pitch angle and roll angle of the face image to be recognized, wherein the final face pose estimation model is obtained by training the face pose estimation model according to any one of claims 1 to 8.

10. A training device for a face pose estimation model is characterized by comprising:

the system comprises an acquisition module, a display module and a control module, wherein the acquisition module is configured to acquire a face image and input the face image into a pre-constructed initial face pose estimation model to obtain a yaw angle loss value, a pitch angle loss value and a roll angle loss value of the face image;

11. A face pose estimation device, characterized by comprising:

a recognition module configured to input the face image to be recognized into a final face pose estimation model and output a yaw angle, a pitch angle and a roll angle of the face image to be recognized, wherein the final face pose estimation model is obtained by training the face pose estimation model according to any one of claims 1 to 8.

12. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 9 when executing the computer program.

13. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.