CN111353381B

CN111353381B - 2D image-oriented human body 3D gesture estimation method

Info

Publication number: CN111353381B
Application number: CN202010021822.3A
Authority: CN
Inventors: 刘龙; 杨乐
Original assignee: Zhejiang Shuike Culture Group Co ltd
Current assignee: Xi'an Huaqi Zhongxin Technology Development Co ltd; Zhejiang Shuike Culture Group Co ltd
Priority date: 2020-01-09
Filing date: 2020-01-09
Publication date: 2023-12-08
Anticipated expiration: 2040-01-09
Also published as: CN111353381A

Abstract

The invention discloses a human body 3D gesture estimation method facing 2D images, which comprises the following steps of 1, carrying out convolution, normalization and activation operations on the 2D images in sequence, and outputting the imagesStep 2, for the imageSequentially performing convolution, normalization and activation operations to output imagesStep 3, image is processedInput subnet one is processed and the characteristic diagram C is output ₁ 、C ₂ The method comprises the steps of carrying out a first treatment on the surface of the Step 4, feature map C ₁ 、C ₂ Input subnet two is processed and a characteristic diagram D is output ₁ 、D ₂ 、D ₃ The method comprises the steps of carrying out a first treatment on the surface of the Step 5, feature map D ₁ 、D ₂ 、D ₃ Processing the input subnet III, and outputting a characteristic diagram E ₁ 、E ₂ 、E ₃ The method comprises the steps of carrying out a first treatment on the surface of the Step 6, for the characteristic diagram E ₁ 、E ₂ 、E ₃ And processing to obtain a matrix P, namely the estimated posture. The method provided by the invention has the advantages of accurate estimated depth, less algorithm parameters and strong generalization.

Description

2D image-oriented human body 3D gesture estimation method

Technical Field

The invention belongs to the technical field of human body posture estimation, and particularly relates to a human body 3D posture estimation method for a 2D image.

Background

Among deep neural networks, the 3D estimation method for human body posture in an image mainly includes CNN (Convolutional Neural Networks, convolutional neural network), LSTM (Long Short Term Memory network ), GCN (Graph convolution Networks) and GAN (Generative Adversarial Network) networks, wherein CNN is currently the main stream.

The initial application of convolutional neural networks in pose estimation is 2D pose estimation of a human body in an image, i.e., estimating joint points of a single person or multiple persons from a single picture, and connecting the relevant joint points. Along with optimization and expansion of algorithm performance, the accuracy or evaluation index of 2D pose estimation approaches to the bottleneck, and people aim at 3D pose estimation of human bodies in images. In recent years, the method for estimating the 3D gesture of the human body in a single image is mainly based on Stacked Hourglass (ECCV in 2016), CPM (Convolutional Pose Machine is the research of CMU Yaser Sheikh), MSPN (Multi-Stage Pose Network, proposed by face++, coco key point detection champion in 2018) and HRNet and the like, stacked Hourglass adopts a mode of stacking a plurality of Hourglass, each Hourglass uses a plurality of residual blocks (He Kaiming in 2015), and the whole frame step training realizes the estimation of the 3D gesture; CPM is to calculate the response graph of each node under each stage, and find out the maximum response value, namely the position of the node; MSPN uses multi-stage feature fusion with different scales, and combines semantic information of small-scale features and local details of large-scale features to complete prediction of joint positions; the 3D gesture estimation can be completed by the method, and a higher score is obtained in various estimated standards, but the method has the following defects:

(1) In terms of predicting depth, the algorithm cannot accurately estimate the depth value of each node;

(2) The relation among the joints of the human body is not fully considered, so that some estimated postures are wrong and do not accord with the motion relation among the joints of the human body (such as wrong estimation of knee bending and the like);

(3) The parameter amount of the algorithm model is large.

Disclosure of Invention

The invention aims to provide a 2D image-oriented human body 3D gesture estimation method which is accurate in estimation depth, small in algorithm parameter and strong in generalization.

The technical scheme adopted by the invention is that the human body 3D gesture estimation method facing the 2D image is implemented according to the following steps:

step 1, carrying out convolution, normalization and activation operations on a 2D image in sequence, and outputting the image

Step 2, for the imageSequentially performing convolution, normalization and activation operations to output image +.>

Step 3, image is processedInput subnet one is processed and the characteristic diagram C is output ₁ 、C ₂ ；

Step 4, feature map C ₁ 、C ₂ Input subnet two is processed and a characteristic diagram D is output ₁ 、D ₂ 、D ₃ ；

Step 5, feature map D ₁ 、D ₂ 、D ₃ Processing the input subnet III, and outputting a characteristic diagram E ₁ 、E ₂ 、E ₃ ；

Step 6, for the characteristic diagram E ₁ 、E ₂ 、E ₃ And processing to obtain a matrix P, namely the estimated posture.

The invention is also characterized in that:

the step 1 is specifically implemented according to the following steps:

step 1.1, carrying out the following operations on the 2D image simultaneously:

(1) The convolution operation is performed by using a convolution kernel of 3×3, and the number of channels is (1-a) _in -b _in ) X64, obtain high frequency characteristic diagram A ₁ ＝[128，128，(1-a _in -b _in )×64]The method comprises the steps of carrying out a first treatment on the surface of the Wherein a is _in Is a low frequency channel number coefficient; b _in Is the intermediate frequency channel number coefficient;

(2) Downsampling by 1/2 times, and obtaining the number of channels b _in X64, obtain intermediate frequency characteristic diagram A ₂ ＝[64，64，b _in ×64]；

(3) Downsampling by 1/4 times, and obtaining channel number a _in X64, obtain low frequency characteristic diagram A ₃ ＝[32，32，a _in ×64]；

Step 1.2, performing the following operations on each image output in step 1.1:

first, the average μ of the image pixels is calculated ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then, the variance sigma of the image pixels is calculated ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then carrying out normalization processing on the image pixels to obtainFinally, each pixel is activated by a linear rectification function to obtain +.>

The step 2 is specifically implemented according to the following steps:

step 2.1, image pairThe high-frequency characteristic diagram, the medium-frequency characteristic diagram and the low-frequency characteristic diagram in the (a) are subjected to characteristic extraction, namely the following operations are performed simultaneously:

performing convolution operation on the high-frequency image by adopting a convolution check of 3 multiplied by 3 to obtain a characteristic diagram B _{1_conv} ；

Performing 1/2 times downsampling operation on the high-frequency image to obtain a characteristic diagram B _{1_down} ；

Downsampling the high-frequency image by 1/4 times to obtain a feature map B _{1_down2} ；

Up-sampling the intermediate frequency image by 2 times to obtain a characteristic diagram B _{2_up} ；

Performing convolution operation on the intermediate frequency image by adopting a convolution check of 3 multiplied by 3 to obtain a characteristic diagram B _{2_conv} ；

Downsampling the intermediate frequency image by 1/2 times to obtain a characteristic diagram B _{2_down} ；

4 times up-sampling operation is carried out on the low-frequency image to obtain a characteristic diagram B _{3_up2} ；

2 times up-sampling operation is carried out on the low-frequency image to obtain a characteristic diagram B _{3_up} ；

Convolving the low-frequency image by using a convolution check of 3×3 to obtain a feature map B _{3_conv} ；

Step 2.2 channel merger

For characteristic diagram B _{1_conv} 、B _{2_up} 、B _{3_up2} Combining the channel numbers to obtain a high-frequency characteristic diagram B ₁ ＝[64，64，(1-a _in -b _in )×64]；

For characteristic diagram B _{1_down} 、B _{2_conv} 、B _{3_up} Combining the channel numbers to obtain an intermediate frequency characteristic diagram B ₂ ＝[32，32，b _in ×64]；

For characteristic diagram B _{1_down2} 、B _{2_down} 、B _{3_conv} Combining the channel numbers to obtain a low-frequency characteristic diagram B ₃ ＝[16，16，a _in ×64]；

Step 2.3, performing the following operations on each image output in step 2.2:

first, the average μ of the image pixels is calculated ₂ The method comprises the steps of carrying out a first treatment on the surface of the Then, the variance sigma of the image pixels is calculated ₂ The method comprises the steps of carrying out a first treatment on the surface of the Then carrying out normalization processing on the image pixels to obtainFinally, each pixel is activated by a linear rectification function to obtain +.>

In step 1.2 and step 2.3, the average value of the pixels is calculated as follows:

wherein x is _i An image input for each layer; m is the number of pixels;

the variance calculation formula of the pixel is as follows:

wherein x is _i An image input for each layer; m is the number of pixels;

the normalization formula is as follows:

wherein epsilon is a very small number of 0.0001 to 0.01;

the linear rectification activation function is as follows:

the step 3 is specifically implemented according to the following steps:

step 3.1, image is processedInputting the first residual block in the subnet one for processing

Step 3.1.1, image pairThe high-frequency characteristic diagram, the medium-frequency characteristic diagram and the low-frequency characteristic diagram in the (a) are subjected to characteristic extraction, namely the following operations are performed simultaneously:

performing convolution operation on the high-frequency image by adopting a convolution check of 3×3 to obtain a feature map C _{1_conv} ；

Performing 1/2 times downsampling operation on the high-frequency image to obtain a characteristic diagram C _{1_down} ；

Downsampling the high-frequency image by 1/4 times to obtain a feature image C _{1_down2} ；

Up-sampling the intermediate frequency image by 2 times to obtain a characteristic diagram C _{2_up} ；

Performing convolution operation on the intermediate frequency image by adopting a convolution check of 3 multiplied by 3 to obtain a characteristic diagram C _{2_conv} ；

Downsampling the intermediate frequency image by 1/2 times to obtain a characteristic diagram C _{2_down} ；

4 times up-sampling operation is carried out on the low-frequency image to obtain a characteristic diagram C _{3_up2} ；

2 times up-sampling operation is carried out on the low-frequency image to obtain a characteristic diagram C _{3_up} ；

Convolving the low-frequency image by using a convolution check of 3×3 to obtain a feature map C _{3_conv} ；

Step 3.1.2 channel merger

For characteristic diagram C _{1_conv} 、C _{2_up} 、C _{3_up2} Combining the channel numbers to obtain a characteristic diagram C _{first_1_H} ；

For characteristic diagram C _{1_down} 、C _{2_conv} 、C _{3_up} Combining the channel numbers to obtain a characteristic diagram C _{first_1_M} ；

For characteristic diagram C _{1_down2} 、C _{2_down} 、C _{3_conv} Combining the channel numbers to obtain a characteristic diagram C _{first_1_L} ；

Step 3.1.3, feature map C is obtained by the method of step 1.2 _{first_1_H} Performing corresponding operation to obtain a characteristic diagram C _{first_2_H} ；

The characteristic diagram C is subjected to the method of step 1.2 _{first_1_M} Performing corresponding operation to obtain a characteristic diagram C _{first_2_M} ；

The characteristic diagram C is subjected to the method of step 1.2 _{first_1_L} Performing corresponding operation to obtain a characteristic diagram C _{first_2_L} ；

Step 3.1.4, the method from step 3.1.1 to step 3.1.3 is adopted for the characteristic diagram C _{first_2_H} Performing corresponding operation to obtain a characteristic diagram C _{first_3_H} The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram C is subjected to a method from step 3.1.1 to step 3.1.3 _{first_2_M} Performing corresponding operation to obtain a characteristic diagram C _{first_3_M} The method comprises the steps of carrying out a first treatment on the surface of the The characteristic diagram C is subjected to a method from step 3.1.1 to step 3.1.3 _{first_2_L} Performing corresponding operation to obtain a characteristic diagram C _{first_3_L} ；

Step 3.1.5, adopting the method from step 3.1.1 to step 3.1.3,for characteristic diagram C _{first_3_H} 、C _{first_3_M} 、C _{first_3_L} Performing corresponding operation to obtain a characteristic diagram C _{first_4_H} 、C _{first_4_M} 、C _{first_4_L} ；

Step 3.1.6, image is takenHigh-frequency characteristic diagram and characteristic diagram C _{first_4_H} Adding to obtain a characteristic diagram C _{1_first} The method comprises the steps of carrying out a first treatment on the surface of the Image +.>Mid-frequency signature and signature C _{first_4_M} Adding to obtain a characteristic diagram C _{2_first} The method comprises the steps of carrying out a first treatment on the surface of the Image +.>Low frequency characteristic diagram and characteristic diagram C _{first_4_L} Adding to obtain a characteristic diagram C _{3_first} ；

Step 3.2, processing the second residual block in the output/input subnet I of the step 3.1

The characteristic diagram C is subjected to the method of step 3.1 _{1_first} 、C _{2_first} 、C _{3_first} Performing corresponding operation to obtain a characteristic diagram C _{1_sec ond} 、C _{2_sec ond} 、C _{3_sec ond} ；

Step 3.3, processing the third residual block in the output/input subnet I of the step 3.2

The characteristic diagram C is subjected to the method of step 3.1 _{1_sec ond} 、C _{2_sec ond} 、C _{2_third} Performing corresponding operation to obtain a characteristic diagram C _{1_third} 、C _{2_third} 、C _{3_third} ；

Step 3.4, processing the fourth residual block in the output/input subnet I of the step 3.3

The characteristic diagram C is subjected to the method of step 3.1 _{1_third} 、C _{2_third} 、C _{3_third} Performing corresponding operation to obtain a characteristic diagram C _{1_fourth} 、C _{2_fourth} 、C _{3_fourth} ；

Step 3.5, processing the output/input conversion layer in the sub-network I of step 3.4

Feature map C is checked using a 3 x 3 convolution _{1_fourth} 、C _{2_fourth} 、C _{3_fourth} Performing convolution operation to obtain a characteristic diagram C _{1_fifth_1} 、C _{2_fifth_1} 、C _{3_fifth_1} The method comprises the steps of carrying out a first treatment on the surface of the Marked as C ₁ C, i.e ₁ Including feature map C _{1_fifth_1} 、C2 _{_fifth_1} 、C _{3_fifth_1} ；

For characteristic diagram C _{1_fourth} 、C _{2_fourth} 、C _{3_fourth} Performing 1/2 times downsampling operation to obtain a characteristic diagram C _{1_fifth} _ ₂ 、C _{2_fift_2} 、C _{3_fifth_2} The method comprises the steps of carrying out a first treatment on the surface of the Marked as C ₂ C, i.e ₂ Including feature map C _{1_fifth_2} 、C _{2_fifth_2} 、C _{3_fifth_2} 。

Step 4 is specifically implemented according to the following steps:

step 4.1, processing the first residual block in the output/input subnet II of the step 3, and adopting the method of the step 3.1 to carry out C ₁ Performing corresponding operation to obtain D _{1_first} ；

C using the method of step 3.1 ₂ Performing corresponding operation to obtain D _{2_first} ；

Step 4.2, processing the second residual block in the output/input sub-network II of the step 4.1

Pair D using the procedure of step 3.1 _{1_first} Performing corresponding operation to obtain D _{1_sec ond} ；

Pair D using the procedure of step 3.1 _{2_first} Performing corresponding operation to obtain D _{2_sec ond} ；

Step 4.3, processing the third residual block in the output/input subnet II of the step 4.2

Pair D using the procedure of step 3.1 _{1_sec ond} Performing corresponding operation to obtain D _{1_third} ；

Pair D using the procedure of step 3.1 _{2_sec ond} Performing corresponding operation to obtain D _{2_third} ；

Step 4.4, processing the third residual block in the second output/input subnet of the step 4.3

Pair D using the procedure of step 3.1 _{1_third} Performing corresponding operation to obtain D _{1_fourth} ；

Pair D using the procedure of step 3.1 _{2_third} Performing corresponding operation to obtain D _{2_fourth} ；

Step 4.5, processing the conversion layer in the output/input sub-network II of the step 4.4

Using a 3 x 3 convolution check D _{1_fourth} Performing convolution operation to obtain D _{1_fifth_1} The method comprises the steps of carrying out a first treatment on the surface of the Pair D _{2_fourth} Performing up-sampling operation by 2 times to obtain D _{2_fifth_1} The method comprises the steps of carrying out a first treatment on the surface of the Will D _{1_fifth_1} And D _{2_fifth_1} Adding to obtain D ₁ ；

Pair D _{1_fourth} Performing 1/2 times downsampling operation to obtain D _{1_fifth_2} The method comprises the steps of carrying out a first treatment on the surface of the Using a 3 x 3 convolution check D _{2_fourth} Performing convolution operation to obtain D _{2_fifth_2} The method comprises the steps of carrying out a first treatment on the surface of the Will D _{1_fifth_2} And D _{2_fifth_2} Adding to obtain D ₂ ；

Pair D _{1_fourth} Performing 1/4 times downsampling operation to obtain D _{1_fifth_3} The method comprises the steps of carrying out a first treatment on the surface of the Pair D _{2_fourth} Performing 1/2 times downsampling operation to obtain D _{2_fifth_3} The method comprises the steps of carrying out a first treatment on the surface of the Will D _{1_fifth_3} And D _{2_fifth_3} Adding to obtain D ₃ 。

Step 5 is specifically implemented according to the following steps:

step 5.1, processing the first residual block in the output/input subnet III of the step 4

Pair D using the procedure of step 3.1 ₁ Performing corresponding operation to obtain E _{1_first} ；

Pair D using the procedure of step 3.1 ₂ Performing corresponding operation to obtain E _{2_first} ；

Pair D using the procedure of step 3.1 ₃ Performing corresponding operation to obtain E _{3_first} ；

Step 5.2, processing the second residual block in the output/input subnet III of the step 5.1

Pair E using the procedure of step 3.1 _{1_first} Performing corresponding operationsObtaining E _{1_sec ond} ；

Pair E using the procedure of step 3.1 _{2_first} Performing corresponding operation to obtain E _{2_sec ond} ；

Pair E using the procedure of step 3.1 _{3_first} Performing corresponding operation to obtain E _{3_sec ond} ；

Step 5.3, processing the third residual block in the output/input subnet III of the step 5.2

Pair E using the procedure of step 3.1 _{1_sec ond} Performing corresponding operation to obtain E _{1_third} ；

Pair E using the procedure of step 3.1 _{2_sec ond} Performing corresponding operation to obtain E _{2_third} ；

Pair E using the procedure of step 3.1 _{3_sec ond} Performing corresponding operation to obtain E _{3_third} ；

Step 5.4, processing the fourth residual block in the output/input subnet III of the step 5.3

Pair E using the procedure of step 3.1 _{1_third} Performing corresponding operation to obtain E _{1_fourth} ；

Pair E using the procedure of step 3.1 _{2_third} Performing corresponding operation to obtain E _{2_fourth} ；

Pair E using the procedure of step 3.1 _{3_third} Performing corresponding operation to obtain E _{3_fourth} ；

Step 5.5, processing the conversion layer in the output/input subnet III of step 5.4

E was checked using a 3X 3 convolution _{1_fourth} Performing convolution operation to obtain E _{1_fifth_1} The method comprises the steps of carrying out a first treatment on the surface of the Pair E _{2_fourth} Performing up-sampling operation by 2 times to obtain E _{2_fifth_1} The method comprises the steps of carrying out a first treatment on the surface of the Pair E _{3_fourth} 4 times of up-sampling operation is carried out to obtain E _{3_fifth_1} The method comprises the steps of carrying out a first treatment on the surface of the Will E _{1_fifth_1} 、E _{2_fifth_1} 、E _{3_fifth_1} Adding to obtain E ₁ ；

Pair E _{1_fourth} Performing 1/2 times downsampling operation to obtain E _{1_fifth_2} The method comprises the steps of carrying out a first treatment on the surface of the E was checked using a 3X 3 convolution _{2_fourth} Performing convolution operation to obtain E _{2_fifth_2} The method comprises the steps of carrying out a first treatment on the surface of the Pair E _{3_fourth} Performing up-sampling operation by 2 times to obtain E _{3_fifth_2} The method comprises the steps of carrying out a first treatment on the surface of the Will E _{1_fifth_2} 、E _{2_fifth_2} 、E _{3_fifth_2} Adding to obtain E ₂ ；

Pair E _{1_fourth} Performing 1/4 times downsampling operation to obtain E _{1_fifth_3} The method comprises the steps of carrying out a first treatment on the surface of the Pair E _{2_fourth} Performing 1/2 times downsampling operation to obtain E _{2_fifth_3} The method comprises the steps of carrying out a first treatment on the surface of the E was checked using a 3X 3 convolution _{3_fourth} Performing convolution operation to obtain E _{3_fifth_3} The method comprises the steps of carrying out a first treatment on the surface of the Will E _{1_fifth_3} 、E _{2_fifth_3} 、E _{3_fifth_3} Adding to obtain E ₃ 。

The specific process of the step 6 is as follows:

step 6.1, check E with a 3×3 convolution ₁ Performing convolution operation to obtain E _{1_conv} The method comprises the steps of carrying out a first treatment on the surface of the Pair E ₂ Performing up-sampling operation by 2 times to obtain E _{2_up} The method comprises the steps of carrying out a first treatment on the surface of the Pair E ₃ 4 times of up-sampling operation is carried out to obtain E _{3_up2} The method comprises the steps of carrying out a first treatment on the surface of the Will E _{1_conv} 、E _{2_up} 、E _{3_up2} Adding to obtain P _pre ；

Step 6.2, P _pre Performing matrix transformation to obtain a characteristic diagram P _{pre_trans} ＝[64，64，64，Allioint]The method comprises the steps of carrying out a first treatment on the surface of the For the characteristic map P _{pre_trans} Performing softmax operation on the first three channels to obtain a characteristic diagram H;

and 6.3, extracting joint coordinates in the feature map H, wherein the operation is expressed as follows:

wherein W, H, D is the width, height and number of the feature map respectively;

step 6.4, P _{_x} ，P _{_v} ，P _{_z} And (5) obtaining a matrix P after splicing, namely the estimated gesture.

softmax is expressed as:

wherein x is _i Is the pixel value of the i-th pixel.

The beneficial effects of the invention are as follows:

(1) Compared with the traditional convolution mode, the improved convolution mode can extract different characteristics in the sample, and the parameters are smaller, so that the network is light;

(2) According to the invention, through the self-built neural network, large-scale details and small-scale global features are extracted in a targeted manner, and a process from shallow to deep to shallow is adopted, so that more accurate estimation of the 3D gesture is realized;

(3) The invention uses the idea of residual error network, reduces the parameter quantity of algorithm, and avoids the problems of gradient explosion or gradient disappearance of network;

(4) According to the convolutional neural network algorithm, the loss function is optimized through preprocessing the data set, and the loss function training network is adopted, so that the convolutional neural network algorithm disclosed by the invention is more suitable for estimating the human body posture in reality;

(5) Relay supervision adopted in the loss function not only visualizes the training process of the network, but also can improve the convergence rate of the network.

Drawings

FIG. 1 is a structural block diagram of a convolutional neural network algorithm of a human body 3D gesture estimation method facing a 2D image;

FIG. 2 is a convolution schematic diagram of step 1 of the human body 3D gesture estimation method facing to the 2D image;

FIG. 3 is a schematic diagram of step 3 of the human body 3D gesture estimation method facing to the 2D image;

FIG. 4 is a schematic diagram of step 4 of the human body 3D gesture estimation method facing the 2D image;

FIG. 5 is a schematic diagram of step 5 of the human body 3D gesture estimation method facing the 2D image;

FIG. 6 is a human body trunk activity diagram in the human body 3D posture estimation method facing the 2D image of the present invention;

FIG. 7 is a schematic view of joint points in the 2D image-oriented human body 3D gesture estimation method of the present invention;

FIG. 8 is a conventional 3D pose graph estimated based on the Hourgass framework;

FIG. 9 is a front view of a conventional 3D pose estimated based on the Hourglass framework;

FIG. 10 is a right side view of a conventional 3D pose estimated based on the Hourgass framework;

FIG. 11 is a left view of a conventional 3D pose estimated based on the Hourgass framework

FIG. 12 is a front view of a 3D pose estimated by the 2D image-oriented human body 3D pose estimation method of the present invention;

FIG. 13 is a right view of a 3D pose estimated by the 2D image-oriented human body 3D pose estimation method of the present invention;

fig. 14 is a left view of a 3D pose estimated by the 2D image-oriented human body 3D pose estimation method of the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

As shown in fig. 1, the method for estimating the 3D posture of the human body facing the 2D image is implemented according to the following steps:

As shown in fig. 2, the step 1 is specifically implemented according to the following steps:

step 1.1, carrying out the following operations on the 2D image simultaneously:

(3) Downsampling by 1/4 times, number of channelsIs a as _in X64, obtain low frequency characteristic diagram A ₃ ＝[32，32，a _in ×64]；

Step 1.2, performing the following operations on each image output in step 1.1:

The step 2 is specifically implemented according to the following steps:

Centering ofDownsampling the frequency image by 1/2 times to obtain a characteristic diagram B _{2_down} ；

Step 2.2 channel merger

Step 2.3, performing the following operations on each image output in step 2.2:

As shown in fig. 2, the step 3 is specifically implemented according to the following steps:

Step 3.1.2 channel merger

Step 3.1.5, the method from step 3.1.1 to step 3.1.3 is adopted for the characteristic diagram C _{first_3_H} 、C _{first_3_M} 、C _{first_3_L} Performing corresponding operation to obtain a characteristic diagram C _{first_4_H} 、C _{first_4_M} 、C _{first_4_L} ；

Feature map C is checked using a 3 x 3 convolution _{1_fourth} 、C _{2_fourth} 、C _{3_fourth} Performing convolution operation to obtain a characteristic diagram C _{1_fifth_1} 、C _{2_fifth_1} 、C _{3_fifth_1} The method comprises the steps of carrying out a first treatment on the surface of the Marked as C ₁ C, i.e ₁ Including feature map C _{1_fifth_1} 、C _{2_fifth_1} 、C _{3_fifth_1} ；

For characteristic diagram C _{1_fourth} 、C _{2_fourth} 、C _{3_fourth} Performing 1/2 times downsampling operation to obtain a characteristic diagram C _{1_fifth_2} 、C _{2_fifth} _ ₂ 、C _{3_fifth_2} The method comprises the steps of carrying out a first treatment on the surface of the Marked as C ₂ C, i.e ₂ Including feature map C _{1_fifth_2} 、C _{2_fifth_2} 、C _{3_fifth_2} 。

As shown in fig. 3, the step 4 is specifically implemented according to the following steps:

step 4.1, processing the first residual block in the output/input subnet II of the step 3

C using the method of step 3.1 ₁ Performing corresponding operation to obtain D _{1_first} ；

As shown in fig. 4, the step 5 is specifically implemented as follows:

Pair E using the procedure of step 3.1 _{1_first} Performing corresponding operation to obtain E _{1_sec ond} ；

Using step 3.1Method pair E _{1_sec ond} Performing corresponding operation to obtain E _{1_third} ；

Step 6, for the characteristic diagram E ₁ 、E ₂ 、E ₃ Processing to obtain a matrix P, namely an estimated gesture;

Step 6.2, P _pre Performing matrix transformation to obtain a characteristic diagram P _{pre_trans} ＝[64，64，64，Alljoint]The method comprises the steps of carrying out a first treatment on the surface of the For the characteristic map P _{pre_trans} Performing softmax operation on the first three channels to obtain a characteristic diagram H;

step 6.4, P _{_x} ，P _{_v} ，P _{_z} The matrix P is obtained after the splicing, namely the estimated gesture;

wherein, softmax is expressed as:

wherein x is _i Is the pixel value of the i-th pixel.

wherein x is _i An image input for each layer; m is the number of pixels;

the variance calculation formula of the pixel is as follows:

wherein x is _i An image input for each layer; m is the number of pixels;

the normalization formula is as follows:

wherein epsilon is a very small number of 0.0001 to 0.01;

the linear rectification activation function is as follows:

1. calculating losses using optimized loss functions

The loss term comprises relay supervision loss, symmetry loss, motion loss and depth loss;

(1) Loss of relay supervision

Predicting relay loss of the ith joint point of the human body; the ith joint point coordinate isFrom G _i = (x, y) represents the true joint point coordinates; the relay supervision Loss is calculated by adopting the following formula:

wherein, allJoint is the quantity of the predicted human body joint points;

(2) Loss of symmetry

As shown in fig. 7, the lower arm, the upper arm, the shoulder, the thigh and the lower leg of the human body are symmetrical; the lengths of the arms 7-8 and 12-13 are equalThe method comprises the steps of carrying out a first treatment on the surface of the The symmetric Loss is calculated by adopting the following formula _symmetry ：

Wherein all_s is the symmetric limb logarithm;coordinates of the joint points to be predicted;representing the limb rods between the predicted joints (e.g., 1-2 in fig. 8, left foot and left knee), limb rods (e.g., 6-5 in fig. 8, right foot and right knee).

(3) Loss of motion

As shown in FIG. 6, the range of motion of the joints of the human body is defined by counting the main stream data set, the normal standing posture of the human body is defined, the left shoulder direction is the positive x-axis direction, the direction of the two feet is the positive y-axis direction, the direction of the two feet is the positive z-axis direction, each joint motion range is obtained, and the spherical coordinates are used for representing the joint motion ranges, and gamma ₁ (γ _min ，γ _max ) The joint 1 is shown as having a length (gamma) _min ，γ _max ) In between the two,indicating that the horizontal angle of joint 1 belongs to +.>θ ₁ (θ _min ，θ _max ) Indicating that the pitch angle of the joint 1 belongs to (θ) _min ，θ _max ) Spherocoord is the joint (e.g. lower legs 1 and 4, upper legs 2 and 3, etc. in fig. 9), i.e.:

judging whether the predicted joint belongs to the motion range, if so, giving penalty lambda if not, wherein the loss is 0;

the motion loss is calculated using the following formula:

(4) Depth loss

The predicted ith joint point coordinate of the human body is as followsFrom G-3d _i = (x, y, z) represents the true joint point coordinates, and the depth loss is calculated using the following formula: />

The total loss is:

Loss _total ＝Loss _middle +Loss _symmetry +Loss _{Sph_c} +Loss _3D

2. the method of the invention is compared with the traditional Hourgassbased network

Visualizing predicted P= [ m, allJoint,3 ]; as shown in fig. 8, which shows joints of a human body, fig. 9 to 11 are conventional 3D poses estimated based on a Hourglass network, and it can be seen that the knees are bent forward, and do not conform to the motion relationship of the human body; fig. 12-14 are 3D poses estimated by the method of the present invention, and it can be seen from different angles that the predicted result conforms to the human body motion relationship.

3. Memory consumption and floating point number test of convolutional neural network algorithm in the invention

a _in ，b _in When taking different values, the consumed memory and the floating point number per second operation are shown in the following table, wherein a _in ，b _in The low frequency channel coefficient and the medium frequency channel coefficient are respectively:

as can be seen from the graph, the consumed space and the floating point number per second operation are unchanged when the parameter amount is set to 0, but when the set value is increased, the consumed space and the floating point number per second operation start to decrease, indicating that the improved convolution system is functioning.

Claims

1. The human body 3D posture estimation method facing the 2D image is characterized by comprising the following steps of:

the step 3 is specifically implemented according to the following steps:

Downsampling the high-frequency image by 1/4 times to obtain a characteristic diagram C _{1_down2} ；

Step 3.1.2 channel merger

The characteristic diagram C is subjected to the method of step 3.1 _{1_first} 、C _{2_first} 、C _{3_first} Performing corresponding operation to obtain a characteristic diagram C _{1_second} 、C _{2_second} 、C _{3_second} ；

The characteristic diagram C is subjected to the method of step 3.1 _{1_second} 、C _{2_second} 、C _{2_second} Performing corresponding operation to obtain a characteristic diagram C _{1_third} 、C _{2_third} 、C _{3_third} ；

For characteristic diagram C _{1_fourth} 、C _{2_fourth} 、C _{3_fourth} Performing 1/2 times downsampling operation to obtain a characteristic diagram C _{1_fifth_2} 、C _{2_fifth_2} 、C _{3_fifth_2} The method comprises the steps of carrying out a first treatment on the surface of the Marked as C ₂ C, i.e ₂ Including feature map C _{1_fifth_2} 、C _{2_fifth_2} 、C _{3_fifth_2} 。

2. The method for estimating the 3D pose of the human body facing the 2D image according to claim 1, wherein the step 1 is specifically implemented according to the following steps:

step 1.1, carrying out the following operations on the 2D image simultaneously:

(1) The convolution operation is performed by using a convolution kernel of 3×3, and the number of channels is (1-a) _in -b _in ) X64, obtain high frequency characteristic diagram A ₁ ＝[128,128,(1-a _in -b _in )×64]The method comprises the steps of carrying out a first treatment on the surface of the Wherein a is _in Is a low frequency channel number coefficient; b _in Is the intermediate frequency channel number coefficient;

(2) Downsampling by 1/2 times, and obtaining the number of channels b _in X64, obtain intermediate frequency characteristic diagram A ₂ ＝[64,64,b _in ×64]；

(3) Downsampling by 1/4 times, and obtaining channel number a _in X64, obtain low frequency characteristic diagram A ₃ ＝[32,32,a _in ×64]；

Step 1.2, performing the following operations on each image output in step 1.1:

3. The method for estimating the 3D pose of the human body facing the 2D image according to claim 2, wherein the step 2 is specifically implemented according to the following steps:

Downsampling the high-frequency image by 1/4 times to obtain a characteristic diagram B _{1_down2} ；

Step 2.2 channel merger

For characteristic diagram B _{1_conv} 、B _{2_up} 、B _{3_up2} Combining the channel numbers to obtain a high-frequency characteristic diagram B ₁ ＝[64,64,(1-a _in -b _in )×64]；

For characteristic diagram B _{1_down} 、B _{2_conv} 、B _{3_up} Combining the channel numbers to obtain an intermediate frequency characteristic diagram B ₂ ＝[32,32,b _in ×64]；

For characteristic diagram B _{1_down2} 、B _{2_down} 、B _{3_conv} Combining the channel numbers to obtain a low-frequency characteristic diagram B ₃ ＝[16,16,a _in ×64]；

Step 2.3, performing the following operations on each image output in step 2.2:

first, the average μ of the image pixels is calculated ₂ The method comprises the steps of carrying out a first treatment on the surface of the Then, the variance sigma of the image pixels is calculated ₂ The method comprises the steps of carrying out a first treatment on the surface of the Then carrying out normalization processing on the image pixels to obtainFinally, a linear rectification function is adopted for eachThe pixels are activated to get +.>

4. A method for estimating a 3D posture of a human body facing a 2D image according to claim 3, wherein in the step 1.2 and the step 2.3, the average value of the pixels is calculated as follows:

wherein x is _i An image input for each layer; m is the number of pixels;

the variance calculation formula of the pixel is as follows:

wherein x is _i An image input for each layer; m is the number of pixels;

the normalization formula is as follows:

wherein epsilon is a very small number of 0.0001 to 0.01;

the linear rectification activation function is as follows:

5. the method for estimating the 3D pose of the human body facing the 2D image according to claim 1, wherein the step 4 is specifically implemented according to the following steps:

Pair D using the procedure of step 3.1 _{1_first} Performing corresponding operation to obtain D _{1_second} ；

Pair D using the procedure of step 3.1 _{2_first} Performing corresponding operation to obtain D _{2_second} ；

Pair D using the procedure of step 3.1 _{1_second} Performing corresponding operation to obtain D _{1_third} ；

Pair D using the procedure of step 3.1 _{2_second} Performing corresponding operation to obtain D _{2_third} ；

Step 4.4, processing the fourth residual block in the output/input subnet II of the step 4.3

6. The method for estimating a 3D pose of a human body facing a 2D image according to claim 5, wherein said step 5 is specifically implemented as follows:

Pair E using the procedure of step 3.1 _{1_first} Performing corresponding operation to obtain E _{1_second} ；

Pair E using the procedure of step 3.1 _{2_first} Performing corresponding operation to obtain E _{2_second} ；

Pair E using the procedure of step 3.1 _{3_first} Performing corresponding operation to obtain E _{3_second} ；

Pair E using the procedure of step 3.1 _{1_second} Performing corresponding operation to obtain E _{1_third} ；

Pair E using the procedure of step 3.1 _{2_second} Performing corresponding operation to obtain E _{2_third} ；

Pair E using the procedure of step 3.1 _{3_second} Performing corresponding operation to obtain E _{3_third} ；

7. The method for estimating 3D pose of human body facing 2D image according to claim 6, wherein said step 6 comprises the specific procedures of:

Step 6.2, P _pre Performing matrix transformation to obtain a characteristic diagram P _{pre_trans} ＝[64,64,64,Alljoint]The method comprises the steps of carrying out a first treatment on the surface of the For the characteristic map P _{pre_trans} Performing softmax operation on the first three channels to obtain a characteristic diagram H;

step 6.4, P _{_x} ，P _{_y} ，P _{_z} And (5) obtaining a matrix P after splicing, namely the estimated gesture.

8. The 2D image oriented human 3D pose estimation method of claim 7, wherein the softmax is expressed as:

wherein x is _i Is the pixel value of the i-th pixel.