CN111311729A

CN111311729A - Natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network

Info

Publication number: CN111311729A
Application number: CN202010056119.6A
Authority: CN
Inventors: 林杰; 崔健; 石光明; 刘丹华; 李甫
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-01-18
Filing date: 2020-01-18
Publication date: 2020-06-19
Anticipated expiration: 2040-01-18
Also published as: CN111311729B

Abstract

The invention discloses a natural scene three-dimensional human body posture reconstruction method based on a bidirectional projection network, which aims at solving the problem that the human body three-dimensional posture reconstruction process in the prior art still needs to be improved. The invention comprises the following steps: firstly, acquiring data by using a camera; secondly, sending the collected video and image data to a two-dimensional posture detector to obtain two-dimensional human body joint point coordinates of corresponding postures; designing two-way projection networks with two structures according to the existence of the three-dimensional attitude data tags in the training process; training the designed network by using a deep-antagonistic learning strategy, minimizing a network loss function, and iterating to finally obtain a trained three-dimensional posture generator; and fifthly, inputting the output result of the two-dimensional posture detector in the step two into the three-dimensional posture generator trained in the step four. The technology is low in cost, can assist VR and AR technologies in the 5G era, establishes portable somatosensory interaction equipment, and realizes large-scale popularization and application of three-dimensional motion reconstruction technologies.

Description

Natural scene three-dimensional human body posture reconstruction method based on bidirectional projection network

Technical Field

The invention relates to the technical field of computer vision, in particular to a natural scene three-dimensional human body posture reconstruction method based on a bidirectional projection network.

Background

In the virtual reality technology and somatosensory human-computer interaction, the motion of a human body is usually required to be accurately captured, and a moving human body three-dimensional skeleton is reconstructed. The existing method usually uses some hardware peripherals such as a professional motion capture device (MOCAP) or a somatosensory camera (Kinect) to complete the reconstruction of the three-dimensional human body posture. However, these professional devices are usually expensive and have extremely high requirements on experimental environments, which hinders the wide-range popularization and application of the three-dimensional pose reconstruction technology. The estimation of the 3D posture of the human body in the monocular image is a difficult task in computer vision, and the reconstruction of the three-dimensional posture based on the 2D joint point is a difficult pathological problem. Most of the existing methods usually rely on paired label data to perform supervised training on the network, and model performance is poor under the conditions of label data shortage and clear corresponding relation. Therefore, the human body three-dimensional posture reconstruction process is improved by utilizing the deep learning technology, so that the whole process can get rid of the dependence of professional hardware peripherals, and the three-dimensional human body posture reconstruction in a natural scene can be completed only by using a common mobile phone and a camera.

In the existing deep learning method, a network is usually trained by relying on paired human posture data with labels, the model is difficult to train and has poor generalization performance under the conditions of lacking of three-dimensional labels and clear corresponding relation, and reasonable three-dimensional reconstruction is difficult to be carried out on complicated and changeable special human postures in a natural environment. Therefore, the method for reconstructing the three-dimensional human body posture can be used for accurately reconstructing the three-dimensional human body posture in a natural scene, a deep learning scheme which does not depend on tag data in a training process is significant, professional motion capture equipment can be replaced with extremely low cost, and the reconstruction of the three-dimensional posture in the natural scene can be completed.

Disclosure of Invention

The invention overcomes the problem that the human body three-dimensional posture reconstruction process still needs to be improved in the prior art, and provides a natural scene three-dimensional human body posture reconstruction method based on a bidirectional projection network, which can carry out three-dimensional reconstruction on human body actions in a natural scene by using a monocular camera.

The technical scheme of the invention is to provide a natural scene three-dimensional human body posture reconstruction method based on a bidirectional projection network, which comprises the following steps: comprises the following steps:

acquiring a natural scene human motion video or image data by using a camera;

secondly, sending the collected video and image data to a two-dimensional posture detector to obtain two-dimensional human body joint point coordinates of corresponding postures;

designing two-way projection networks with two structures according to the existence of the three-dimensional attitude data tags in the training process;

training the designed network by using a deep antagonistic learning strategy, minimizing a network loss function, and finally obtaining a trained three-dimensional posture generator through iteration;

and step five, inputting the output result of the two-dimensional posture detector in the step two into the trained three-dimensional posture generator in the step four, and outputting the output result to be the three-dimensional posture data of the person in the video/image.

Preferably, in the first step, a common monocular optical camera or a mobile phone camera is used to complete the acquisition of the character motion data in a natural scene, and the data is in the form of pictures or videos.

Preferably, in the second step, the two-dimensional gesture detector is a two-dimensional gesture detection method of openpos, StackHourglass, or HRNet, and when the acquired data is a picture, the picture is directly input to obtain a two-dimensional joint detection result, and when the acquired data is a video, a two-dimensional joint detection sequence is obtained by inputting frame by frame.

Preferably, in the third step, two bidirectional projection networks with different structures A/B are selected according to whether the user has three-dimensional attitude tag data, when the three-dimensional attitude data is available, the bidirectional projection networks work in an A mode, the network is composed of two opposite dual branches, and a network module of the network comprises a three-dimensional attitude generator, a three-dimensional attitude discriminator, a two-dimensional attitude projection layer and a two-dimensional attitude discriminator; when no three-dimensional attitude data is available, the bidirectional projection network works in a B mode, the network is composed of two projection branches in different directions, and a network module of the bidirectional projection network comprises a three-dimensional attitude generator, a two-dimensional attitude projection layer and a two-dimensional attitude discriminator.

Preferably, the three-dimensional posture generator in the third step has the input of two-dimensional joint point coordinates and the output of three-dimensional joint point coordinates, the three-dimensional joint point coordinates comprise two depth residual error networks and a posture feature extraction layer, the depth residual error networks are formed by stacking four residual error blocks, the number of neurons in each layer is 1024, and the posture feature extraction layer completes the coding compression of the posture topological structure; the two-dimensional attitude discriminator and the three-dimensional attitude discriminator have the same network architecture, the two-dimensional/three-dimensional attitude feature extraction layer, the depth residual error network and a full connection layer are contained in the two-dimensional attitude discriminator and the three-dimensional attitude discriminator, and the two-dimensional/three-dimensional discriminator module inputs attitude vectors with different dimensions and outputs a unitary discrimination value; the two-dimensional attitude projection layer comprises two branches of residual error network forward projection and rotation transformation, the attitude is projected to different observation angles respectively according to functions, the input of the two-dimensional attitude projection layer module is three-dimensional attitude data, and the output is projected two-dimensional attitude data.

Preferably, said step four comprises the sub-steps of,

step 4.1, when three-dimensional posture data are available for network training, selecting a mode A network architecture for training;

step 4.1.1, taking the two-dimensional posture as input, firstly, outputting an initial depth estimation value through a residual error network in a three-dimensional posture generator to obtain an initial estimation result of the three-dimensional posture; then, an initial estimation result is transmitted into an attitude feature extraction layer, a feature vector is output through attitude prior topological structure feature extraction, the feature vector is transmitted into a depth residual error network again to output a final depth estimation value, and a final three-dimensional reconstruction attitude is generated;

step 4.1.2, one path of the generated three-dimensional reconstruction posture obtains forward projection through a two-dimensional posture projection layer, and calculates a posture error with the input two-dimensional posture, and the other path of the generated three-dimensional reconstruction posture is sent to a three-dimensional posture discriminator to calculate a distribution error;

step 4.1.3, taking the three-dimensional posture as input, firstly obtaining forward projection through a two-dimensional posture projection layer, wherein one path of the forward projection is sent to a three-dimensional posture generator to obtain a three-dimensional reconstruction result, and the three-dimensional posture generator and the input three-dimensional posture calculate a posture error, and the other path of the forward projection is sent to a two-dimensional posture discriminator to calculate a distribution error;

4.2, when no three-dimensional posture data is available for network training, selecting a mode B network architecture for training;

step 4.2.1, taking the two-dimensional posture as input, firstly, outputting an initial depth estimation value through a residual error network in a three-dimensional posture generator to obtain an initial estimation result of the three-dimensional posture; then, an initial estimation result is transmitted into an attitude feature extraction layer, a feature vector is output through attitude prior topological structure feature extraction, the feature vector is transmitted into a depth residual error network again to output a final depth estimation value, and a final three-dimensional reconstruction attitude is generated;

step 4.2.2, transmitting the three-dimensional reconstruction posture into a two-dimensional posture projection layer to respectively obtain a forward projection and a rotary projection, wherein the forward projection calculates a posture error with the input two-dimensional posture, and the rotary projection calculates a two-dimensional distribution error through a two-dimensional posture discriminator;

4.3, respectively calculating loss functions in the A/B modes, wherein the loss functions comprise an attitude loss function and a distribution loss function;

step 4.3.1, in the mode a, the overall loss function of the network is defined as:

loss_A＝L_GAN(G_3d,D_3d)+L_GAN(G_2d,D_2d)+L_dual(G_2d,G_3d) Wherein L is_GANRepresenting students with gradient penalty termsThe loss function of the countermeasure network reflects the distribution error, and the calculation formula is as follows:

L_dualrepresenting the bidirectional loss of the dual network, reflecting the attitude error, the calculation formula is as follows:

L_dual(G_2d,G_3d)＝||G_2d(G_3d(X_2d))-X_2d||₁+||G_3d(G_2d(X_3d))-X_3d||₁

λ is a neural network hyperparameter, G_3dRepresenting a three-dimensional pose generator, G_2dRepresenting a two-dimensional pose projection layer, D_3dAnd D_2dRespectively representing three-dimensional and two-dimensional attitude discriminators, X_2dAnd X_3dRespectively representing a true two-dimensional attitude and a three-dimensional attitude, A_3dRepresenting a random three-dimensional attitude on a connection line of a sampling point of the reconstructed three-dimensional attitude distribution and the real three-dimensional attitude distribution, A_2dRepresenting the random two-dimensional attitude on the connecting line of the projection two-dimensional attitude distribution and the real two-dimensional attitude distribution sampling point;

step 4.3.2, in the mode B, the overall loss function of the network is defined as:

loss_B＝L_GAN(G_R2dG_3d,D_2d)+L_pose(G_K2dG_3d) Wherein L is_GANRepresenting a loss function of the generation countermeasure network with a gradient penalty term, which reflects the distribution error, and the calculation formula is as follows:

L_poseto reconstruct the loss, which reflects the attitude error, the calculation formula is as follows:

L_pose(G_K2dG_3d)＝||G_K2dG_3d(X_2d)-X_2d||₁

λ is a neural network hyperparameter, G_3dRepresenting a three-dimensional pose generator, G_R2dRepresenting a two-dimensional pose projection layer rotation projection transformation, G_K2dRepresenting a two-dimensional pose projection layer forward projection transformation, D_2dRepresenting a two-dimensional attitude discriminator, X_2dRepresenting true two-dimensional attitude data, A_2dRepresenting the random two-dimensional attitude on the connecting line of the projection two-dimensional attitude distribution and the real two-dimensional attitude distribution sampling point;

and 4.4, adjusting a network parameter minimization error function by using a neural network optimizer, iterating for 20-40 EPOCH, and then converging a loss function to obtain the trained three-dimensional posture generator.

The step five comprises the following sub-steps,

step 5.1, transmitting video or image data acquired by a common camera into a two-dimensional attitude detector, and firstly obtaining two-dimensional joint point data;

step 5.2, normalizing the output result of the two-dimensional attitude detector to enable the output result to be directly used as the input of the three-dimensional attitude generator; the normalization process has the following substeps:

step 5.2.1, reconstructing the coordinates of the central neck by using the detected coordinates of the left shoulder joint and the right shoulder joint:

wherein: (x)_T,y_T) Represents the central neck coordinate, (x)_ls,y_ls) Represents the left shoulder coordinate, (x)_rs,y_rs) Representing the coordinates of the right shoulder;

and 5.2.2, reconstructing central spine coordinates by using the detected left and right shoulder joints and hip joints:

wherein: (x)_S,y_S) Represents the central spine coordinate (x)_ls,y_ls) Represents the left shoulder coordinate, (x)_rs,y_rs) Represents the coordinates of the right shoulder, (x)_lh,y_lh) Represents the left hip coordinate, (x)_rh,y_rh) Represents the right hip coordinate;

step 5.3, the normalized two-dimensional posture data is transmitted into a three-dimensional posture generator, the output result is the reconstructed three-dimensional posture, and when the input data is image data, the output result is a three-dimensional human body posture skeleton; when the input data is video data, the output result is three-dimensional human skeleton action.

Compared with the prior art, the natural scene three-dimensional human body posture reconstruction method based on the bidirectional projection network has the following advantages: (1) the deep neural network is trained in a data-driven mode, low-cost human body posture three-dimensional reconstruction can be achieved directly through the neural network, expensive hardware equipment is not needed, data can be collected only through a common camera or a mobile phone, three-dimensional posture reconstruction can be conducted on a moving human body based on a visual method, and the three-dimensional reconstruction of the human body posture can be completed instead of professional hardware peripherals. The cost is low, the use is convenient, the power can be assisted by VR and AR technologies in the 5G era, the portable somatosensory interaction equipment is established, and the large-scale popularization and application of the three-dimensional motion reconstruction technology are realized.

(2) A special neural network training mode is adopted, the physiological structure characteristics of human body posture data are fully utilized, new constraints are added to the network, therefore, the label-free deep learning training process can be realized in the training process of the network without depending on specific data labels and three-dimensional data sets, the trained model has good generalization performance, and the complex three-dimensional human body posture estimation task in a natural scene can be realized.

(3) The invention designs a bidirectional projection network by researching two characteristics of human body posture. The posture prior knowledge contained in the data set is used as a new constraint to be added into the training process of the network, so that the dependence of the model on real 3D data during training is reduced, the network can be trained without depending on tag data, and accurate 3D human body posture reconstruction in a natural scene can be realized.

Drawings

FIG. 1 is a schematic diagram of an A-mode network structure of a bidirectional projection network according to the present invention;

FIG. 2 is a schematic diagram of a B-mode network structure of the bidirectional projection network of the present invention;

FIG. 3 is a schematic diagram of the internal structure of the bidirectional projection network component module according to the present invention;

FIG. 4 is an overall flow chart of the present invention;

FIG. 5 is a three-dimensional human body posture reconstruction effect diagram under a natural scene.

Detailed Description

The method for reconstructing a three-dimensional human body pose of a natural scene based on a bidirectional projection network according to the present invention is further described with reference to the accompanying drawings and the specific embodiments, in which, as shown in the drawings, detailed descriptions of the present invention are presented in this embodiment.

Introduction to related Art

Reconstructing human postural movements in three-dimensional space is one of the main targets of computer vision, and this problem has been studied by relevant scholars in the last century [1 ]. To get rid of the dependence on professional equipment, some early methods were mostly based on feature engineering to reconstruct 3D poses by performing motor physiological modeling on human skeletal joints [2, 3], or search-based methods to output corresponding 3D poses [4, 5] for 2D poses using a database dictionary of 3D skeletons for nearest neighbor lookup. With the development of deep learning, researchers try to output the 3D pose of the human body directly from RGB images by establishing an end-to-end model [6, 7, 8, 9], but the complicated background of the images in natural scenes usually interferes with the end-to-end 3D pose reconstruction process. In recent years, great attention has been paid to the inference of 3D human body posture from monocular vision systems, and the technology can be widely applied to animation movies, virtual reality, behavior recognition and human-computer interaction. This is very challenging in computer vision tasks, as restoring the 3D pose from 2D observations is itself a morbid problem. Under a natural scene, under the influence of factors such as illumination, angles and complex backgrounds, it is very difficult to directly infer the 3D posture of a human body in an image, and the problem is split into two parts by some previous works: firstly, 2D postures are estimated from images through various advanced 2D human key point detectors, and then 3D human posture reconstruction is carried out on the basis of the obtained 2D postures. Wherein [10] firstly, a simple baseline algorithm is provided, 3D posture reconstruction is regarded as a regression task from a 2D joint point to a 3D coordinate point, and high-quality 3D posture reconstruction is completed by utilizing a neural network. [11] Further representation of the pose as a distance matrix translates this problem into a two-dimensional to three-dimensional distance matrix regression. [12] The human body posture is regarded as a kind of special topological graph data, a semantic graph convolution network (SemGCN) is designed, and the regression task of graph structure data is completed. However, these methods of training the network using three-dimensional label data have two serious limitations: (1) because the data of the 3D posture has high requirements on experimental conditions and usually expensive multi-angle motion capture equipment is needed to capture three-dimensional information of human motion indoors, it is usually difficult to obtain a large amount of 3D human posture data for training in real scenes; (2) strict corresponding relation in the process of training labeled data can cause an overfitting phenomenon on a single data set, the overfitting phenomenon is represented on the aspect that a model cannot be generalized to other special angles or unseen 2D postures, and on the other hand, the overfitting phenomenon is represented on the aspect that a network can only generate 3D posture data in a training set and cannot reasonably reconstruct complex posture actions in a natural scene. Both of these limitations are due to the reliance of the training process on 3D label data.

In recent years, the accuracy of 2D human joint detection algorithms is increasing, and real-time 2D pose estimation in natural scenes can be achieved. More and more researchers are therefore working on the reconstruction of 3D poses using these easily available 2D joint point data, i.e. divided into two steps: first 2D poses are obtained from the image using advanced 2D body joint detectors and then these 2D poses are lifted to 3D. The key point for solving the ill-conditioned problem is to add reasonable prior information as a constraint in combination with the problem characteristics, and in the traditional method, the constraint is provided by a manually designed regular term, and usually, only a solution of a single problem can be realized. In the deep learning era, the automatic learning of priori information constraint from data by using a network can be regarded as a new idea for solving the ill-conditioned problem, and the problems can be solved through a model trained by a large amount of data.

Therefore, abstracting the important characteristics of the pose data and serving as the constraint of the network is the main contribution point of the invention. Through the research on the physiological structure characteristics of the posture data, the invention designs a bidirectional projection network by utilizing a deep learning technology, the bidirectional projection network has two working modes of A/B, the network can be trained respectively under the condition of three-dimensional data labels and label-free data, and the trained network can complete a complex three-dimensional human posture reconstruction task under a natural scene.

Second, the proposed method

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1: referring to fig. 1 to 5, an overall flowchart of a natural scene three-dimensional human body posture reconstruction method based on a bidirectional projection network is shown in fig. 4, and after a picture or a video is obtained by shooting with a monocular camera, a corresponding two-dimensional human body posture is obtained through a two-dimensional posture detection network (openpos, HRNet, StackHourglass) and a two-dimensional joint detection result is obtained. Before data is sent to the three-dimensional gesture generator, a corresponding mode needs to be selected according to whether the data has three-dimensional label data or not to train the three-dimensional gesture generator. When the three-dimensional attitude tag data exists, the bidirectional projection network works in the mode A, and when the three-dimensional attitude tag data does not exist, the bidirectional projection network works in the mode B. The user can select a corresponding mode to train the network according to whether the user has three-dimensional human body posture data.

When the mode a is selected, the two-way projection network training process has the structure as shown in fig. 1, and the two-dimensional posture data and the three-dimensional posture data are respectively sent to two branches of the two-way network. In the first branch, the input two-dimensional posture generates a three-dimensional posture reconstruction result through a three-dimensional posture generator, and the reconstruction result generates a two-dimensional projection result again through a two-dimensional posture projection layer; in the second branch, the input three-dimensional posture is firstly transmitted into the two-dimensional posture projection layer, and then the output result is transmitted into the three-dimensional posture generator to complete the second reconstruction. Two branches complete a dual operation process, errors of the reconstructed attitude need to be calculated in the two processes respectively, the errors are divided into distribution errors and attitude errors, and the loss function of the whole network is as follows:

loss_A＝L_GAN(G_3d,D_3d)+L_GAN(G_2d,D_2d)+L_dual(G_2d,G_3d)，

wherein L is_GANRepresenting a loss function of the generation countermeasure network with a gradient penalty term, which reflects the distribution error, and the calculation formula is as follows:

L_dual(G_2d,G_3d)＝||G_2d(G_3d(X_2d))-X_2d||₁+||G_3d(G_2d(X_3d))-X_3d||₁

when the mode B is selected, the bidirectional projection network training process has the structure shown in fig. 2, the mode B does not need any label data, the input two-dimensional posture is firstly subjected to three-dimensional posture generator to obtain a reconstruction result, then the three-dimensional posture is subjected to two projection changes through the two-dimensional posture projection layer, one branch circuit projects the three-dimensional posture to a forward observation visual angle to obtain a forward two-dimensional projection result, and the other branch circuit performs rotary projection transformation on the three-dimensional posture result to obtain other visual angle observation results. Two branches at this moment finish two kinds of different observation processes, carry on two kinds of constraints to these two kinds of observation results respectively and can get attitude error and distribution error equally, the loss function of the whole network is:

loss_B＝L_GAN(G_R2dG_3d,D_2d)+L_pose(G_K2dG_3d)，

L_pose(G_K2dG_3d)＝||G_K2dG_3d(X_2d)-X_2d||₁

λ is a neural network hyperparameter, G_3dRepresenting a three-dimensional pose generator, G_R2dRepresenting a two-dimensional pose projection layer rotation projection transformation, G_K2dRepresenting a two-dimensional pose projection layer forward projection transformation, D_2dRepresenting a two-dimensional attitude discriminator, X_2dRepresentsTrue two-dimensional attitude data, A_2dRepresenting the random two-dimensional attitude on the connecting line of the projection two-dimensional attitude distribution and the real two-dimensional attitude distribution sampling point;

in the training process, the two A/B modes of the bidirectional projection network share the same network module, and the network module is shown in FIG. 3 and comprises a three-dimensional posture generator, a two/three-dimensional posture discriminator and a two-dimensional posture projection layer.

The three-dimensional posture generator comprises two depth residual error networks and a posture characteristic extraction layer, the depth residual error networks are formed by stacking four residual error blocks, the number of neurons in each layer is 1024, an initial depth estimation value is output by an input two-dimensional posture through the residual error networks to obtain an initial estimation result of the three-dimensional posture, then the initial estimation result is transmitted into the posture characteristic extraction layer, the three-dimensional posture is coded into a characteristic vector containing a space angle and depth information through posture prior topological structure characteristic extraction, the characteristic vector is transmitted into the depth residual error networks again to output a final depth estimation value, and a final three-dimensional reconstruction posture is generated.

The two-dimensional attitude discriminator and the three-dimensional attitude discriminator have the same network architecture, and the main difference is that the feature extraction layers are different, the attitudes of two dimensions are firstly coded into a feature vector containing a motion attitude topological structure through the corresponding attitude feature extraction layer, and then the final discrimination value is output through a depth residual error network and a full connection layer, so that the calculation of the difference between the two distributions is completed.

The two-dimensional attitude projection layer comprises two branch circuits, the attitude can be projected to different angles respectively, the observation of a forward visual angle is completed through a depth residual error network connected by a plurality of residual error blocks, and the observation of other rotating visual angles is realized through the attitude rotation conversion layer.

The transformation process of the forward projection is as follows:

X_2d＝G_2d(X_3d)

the transformation process of the rotational projection is as follows:

X_2d＝G_R2dX_3d

wherein X_2dRepresenting a two-dimensional attitude, X_3dRepresenting three-dimensional attitude, G_2dRepresenting depth residual network projective transformation, G_R2dRepresenting a rotational transformation.

Wherein the rotation transformation matrix:

the A, B two training modes of the bidirectional projection network can be formed by combining the modules, the corresponding mode is selected according to actual conditions to train the network, the error function is continuously iterated and minimized, and the trained three-dimensional posture generator can be finally obtained through 20-40 EPOCH network training.

The previously detected two-dimensional poses are then subjected to a normalization process as follows:

1. reconstructing central neck coordinates using the detected left and right shoulder joint coordinates:

2. reconstructing central spine coordinates using the detected left and right shoulder joints and hip joints:

the normalized two-dimensional human body posture is transmitted into a trained three-dimensional posture generator, the three-dimensional posture generator can output a three-dimensional human body skeleton which accords with the human body posture topological structure according to a two-dimensional detection result, and the skeleton sequence of each frame of video is connected, so that the reconstruction of the three-dimensional human body posture in the video can be realized. The reconstruction effect of the invented method is shown in fig. 5.

And thirdly, in the reference documents, the numbers carried in parentheses in the application documents refer to the documents corresponding to the numbers below.

[1]H.-J.Lee and Z.Chen.Determination of 3d human body postures from asingle view.Computer Vision,Graphics,and Image Processing,30(2):148–168,1985.

[2]V.Ramakrishna,T.Kanade,and Y.Sheikh.Reconstructing 3d human posefrom 2d image landmarks.In European Conference on Computer Vision(ECCV),pages573–586.Springer,2012.

[3]C.Ionescu,J.Carreira,and C.Sminchisescu.Iterated second-orderlabel sensitive pooling for 3d human pose estimation.In Conference onComputer Vision and Pattern Recognition(CVPR),pages 1661–1668,2014.2

[4]H.Jiang.3d human pose reconstruction using millions ofexemplars.In International Conference on Pattern Recognition(ICPR),pages1674–1677.IEEE,2010.

[5]C.-H.Chen and D.Ramanan.3D human pose estimation＝2D poseestimation+matching.In Conference on Computer Vision and Pattern Recognition(CVPR),pages 5759–5767,2017.

[6]S.Li and A.B.Chan.3d human pose estimation from monocular imageswith deep convolutional neural network.In Asian Conference on Computer Vision(ACCV),pages 332–347.Springer,2014.

[7]D.Mehta,S.Sridhar,O.Sotnychenko,H.Rhodin,M.Shafiei,H.-P.Seidel,W.Xu,D.Casas,and C.Theobalt.Vnect:Real-time 3d human pose estimation with asingle rgb camera.volume 36,72017.

[8]B.Tekin,I.Katircioglu,M.Salzmann,V.Lepetit,and P.Fua.Structuredprediction of 3d human pose with deep neural networks.In British MachineVision Conference(BMVC),2016.

[9]G.Pavlakos,X.Zhou,K.G.Derpanis,and K.Daniilidis.Coarse-to-finevolumetric prediction for single-image 3d human pose.In Conference onComputer Vision and Pattern Recognition(CVPR),pages 1263–1272.IEEE,2017.

[10]J.Martinez,R.Hossain,J.Romero,and J.J.Little.A simple yeteffective baseline for 3d human pose estimation.In ICCV,2017.

[11]F.Moreno-Noguer.3d human pose estimation from a single image viadistance matrix regression.In Proceedings of the Conference on ComputerVision and Pattern Recognition(CVPR),2017.1

[12]Zhao L,Peng X,Tian Y,et al.Semantic Graph Convolutional Networksfor 3D Human Pose Regression[C]//Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition.2019:3425-3435.

Claims

1. A natural scene three-dimensional human body posture reconstruction method based on a bidirectional projection network is characterized in that: comprises the following steps:

acquiring a natural scene human motion video or image data by using a camera;

2. The natural scene three-dimensional human body posture reconstruction method based on the bidirectional projection network as claimed in claim 1, characterized in that: in the first step, a common monocular optical camera or a mobile phone camera is adopted to complete the acquisition of character motion data in a natural scene, and the data is in the form of pictures or videos.

3. The natural scene three-dimensional human body posture reconstruction method based on the bidirectional projection network as claimed in claim 1, characterized in that: in the second step, the two-dimensional posture detector is a two-dimensional posture detection method of OpenPose, StackHourglass or HRNet, when the acquired data is a picture, the picture is directly input to obtain a two-dimensional joint detection result, and when the acquired data is a video, a two-dimensional joint detection sequence is obtained by inputting frame by frame.

4. The natural scene three-dimensional human body posture reconstruction method based on the bidirectional projection network as claimed in claim 1, characterized in that: in the third step, two bidirectional projection networks with different structures A/B are selected according to whether a user has three-dimensional attitude tag data or not, when the three-dimensional attitude data is used, the bidirectional projection networks work in an A mode, the networks are composed of two opposite dual branches, and the network modules comprise a three-dimensional attitude generator, a three-dimensional attitude discriminator, a two-dimensional attitude projection layer and a two-dimensional attitude discriminator; when no three-dimensional attitude data is available, the bidirectional projection network works in a B mode, the network is composed of two projection branches in different directions, and a network module of the bidirectional projection network comprises a three-dimensional attitude generator, a two-dimensional attitude projection layer and a two-dimensional attitude discriminator.

5. The natural scene three-dimensional human body posture reconstruction method based on the bidirectional projection network as claimed in claim 1, characterized in that: the three-dimensional posture generator in the third step has the input of two-dimensional joint point coordinates and the output of three-dimensional joint point coordinates, the three-dimensional posture generator internally comprises two depth residual error networks and a posture characteristic extraction layer, the depth residual error networks are formed by stacking four residual error blocks, the number of neurons in each layer is 1024, and the posture characteristic extraction layer completes the coding compression of a posture topological structure; the two-dimensional attitude discriminator and the three-dimensional attitude discriminator have the same network architecture, the two-dimensional/three-dimensional attitude feature extraction layer, the depth residual error network and a full connection layer are contained in the two-dimensional attitude discriminator and the three-dimensional attitude discriminator, and the two-dimensional/three-dimensional discriminator module inputs attitude vectors with different dimensions and outputs a unitary discrimination value; the two-dimensional attitude projection layer comprises two branches of residual error network forward projection and rotation transformation, the attitude is projected to different observation angles respectively according to functions, the input of the two-dimensional attitude projection layer module is three-dimensional attitude data, and the output is projected two-dimensional attitude data.

6. The natural scene three-dimensional human body posture reconstruction method based on the bidirectional projection network as claimed in claim 1, characterized in that: the fourth step comprises the following sub-steps,

loss_A＝L_GAN(G_3d,D_3d)+L_GAN(G_2d,D_2d)+L_dual(G_2d,G_3d) Wherein L is_GANRepresenting a loss function of the generation countermeasure network with a gradient penalty term, which reflects the distribution error, and the calculation formula is as follows:

L_dual(G_2d,G_3d)＝||G_2d(G_3d(X_2d))-X_2d||₁+||G_3d(G_2d(X_3d))-X_3d||₁

loss_B＝L_GAN(G_R2dG_3d,D_2d)+L_po_se(G_K2dG_3d) Wherein L is_GANRepresenting a loss function of the generation countermeasure network with a gradient penalty term, which reflects the distribution error, and the calculation formula is as follows:

L_pose(G_K2dG_3d)＝||G_K2dG_3d(X_2d)-X_2d||₁

7. The natural scene three-dimensional human body posture reconstruction method based on the bidirectional projection network as claimed in claim 1, characterized in that: the step five comprises the following sub-steps,